Advanced work with dplyr, ggplot, tidyr

Load in required packages and data:

library(dplyr)
library(tidyr)
library(nycflights13)
## Error in library(nycflights13): there is no package called 'nycflights13'
library(ggplot2)
data(flights)
## Warning in data(flights): data set 'flights' not found

Filter out large delays:

flights <- flights %>% filter(dep_delay < 240)
## Error in eval(lhs, parent, parent): object 'flights' not found

Function summarise_each

Great to work apply the same summary functions on different variables.

flights %>% select(dep_delay, arr_delay) %>%
    summarise_each(funs(mean))
## Error in eval(lhs, parent, parent): object 'flights' not found
# remove missing values from the calculation
flights %>% select(dep_delay, arr_delay) %>%
    summarise_each(funs(mean(., na.rm=TRUE)))
## Error in eval(lhs, parent, parent): object 'flights' not found
# using the helper function matches()
flights %>%
    summarise_each(funs(mean(., na.rm=TRUE)), matches("delay"))
## Error in eval(lhs, parent, parent): object 'flights' not found

Package tidyr

The package tidyr is great for manipulating data from long to wide, or from wide to long form. Here you can find a broader introduction, we are going to use only the gather() and the spread() function.

flights %>% ggplot(aes(x=dep_delay)) + geom_density()
## Error in eval(lhs, parent, parent): object 'flights' not found

If we would like to plot more variables on the same plot, it is best to first collect them into one with gather() and then map the type into a new dimension of the graph (say, color). Here we plot the distribution of the arrival and departure delay on the same plot.

flights %>% gather(delay, value, dep_delay, arr_delay) %>%
    ggplot(aes(x=value, fill=delay)) + geom_density(alpha = .3)
## Error in eval(lhs, parent, parent): object 'flights' not found

The package could be used for creating nice summary tables as well. See an illustration below, where we first gather the variables we would like to use, and apply several summary functions after grouping them by their types. (It may help to understand the command below step-by-step, by looking at the intermediate results).

flights %>% 
    gather(measure, value, dep_delay, arr_delay, air_time, distance) %>%
    select(value, measure) %>%
    filter(!is.na(value)) %>%
    group_by(measure) %>%
    summarise_each(funs(mean, median, min, max, sd))
## Error in eval(lhs, parent, parent): object 'flights' not found

For loop and ggplot

If you would like to create the same plot for different variables, you may want to use loops instead of typing in the same thing again and again. However, looping over variable names is tricky. It is better to loop over the names as strings, and using the aes_string() function within ggplot() as illustrated in this example.

for (var in c("dep_delay", "arr_delay")) {
    flights %>%
        ggplot(aes_string(x=var)) + 
        geom_histogram()
    # ggsave(paste(var, "_hist.png"))  # you can save them within the loop
}
## Error in eval(lhs, parent, parent): object 'flights' not found