DA Homework 3 - SOLUTION

Task 0

Download the purchases.csv from the data section. This sample data contains purchases from an online store. Load the data into R and check whether the type of your variables are correct (e.g. purchase_date should be of type “Date”).

purchases <- read_csv("/Users/jdivenyi/teaching/BME_adat/201617/data/purchases.csv")
## Error: '/Users/jdivenyi/teaching/BME_adat/201617/data/purchases.csv' does not exist.

Task 1

Plot the histogram of all the log sales.

purchases %>%
    ggplot(aes(x = log(sales))) + geom_histogram()
## Error in eval(lhs, parent, parent): object 'purchases' not found

Task 2

Plot the distributions of log sales amounts for the two years separately. Check geom_density in the documentation of ggplot2. Here is what you should get.

purchases %>%
    mutate(year = factor(year)) %>%
    ggplot(aes(x=log(sales), fill=year)) +
    geom_density(alpha=0.3)
## Error in eval(lhs, parent, parent): object 'purchases' not found

Task 3

Plot the aggregate daily sales. Add a smoothed line to the plot (you can experiment with the span option of geom_smooth() to control the smoothness of your line).

Default:

purchases %>%
    group_by(purchase_date) %>%
    summarise(sales = sum(sales)) %>%
    ggplot(aes(x=purchase_date, y=sales)) +
        geom_line() +
        geom_smooth()
## Error in eval(lhs, parent, parent): object 'purchases' not found

With span = 0.2:

purchases %>%
    group_by(purchase_date) %>%
    summarise(sales = sum(sales)) %>%
    ggplot(aes(x=purchase_date, y=sales)) +
        geom_line() +
        geom_smooth(span = 0.2)
## Error in eval(lhs, parent, parent): object 'purchases' not found

Task 4

Which month brings the most sales? Plot a bar graph with aggregate sales per month. Look at the documentation of geom_bar() to solve this. Note the labels of the x axis (the documentation helps to reproduce).

purchases %>%
    mutate(month = factor(month)) %>%
    group_by(month) %>%
    summarise(aggregate_sales = sum(sales)) %>%
    ggplot(aes(x = month, y = aggregate_sales)) +
    geom_bar(stat = "identity")
## Error in eval(lhs, parent, parent): object 'purchases' not found

Task 5

Recreate the previous graph by drawing the columns separately for the years (map the year variable to the fill of the bars and see the examples in the documentation to achieve side-by-side bars).

purchases %>%
    mutate(year = factor(year)) %>%
    group_by(year, month) %>%
    summarise(aggregate_sales = sum(sales), sd = sd(sales), n = n()) %>%
    ggplot(aes(x = month, y = aggregate_sales, fill = year)) +
    geom_bar(stat = "identity", position = "dodge") +
    scale_x_continuous(breaks = seq(1, 12))
## Error in eval(lhs, parent, parent): object 'purchases' not found

Task 6

We have seen that the aggregate sales are lower in 2013 than in 2012. Which year has higher average sales amount? Plot a bar graph with average sales per year and add error bars with two times the standard deviation (do not forget to adjust by the number of observations). You can add error bars by using the geom_errorbar() function of ggplot2. (I use the “lightblue” color to fill the bars in order to make the error bars more visible.)

purchases %>%
    mutate(year = factor(year)) %>%
    group_by(year) %>%
    summarise(avg_sales = mean(sales), sd = sd(sales), n = n()) %>%
    ggplot(aes(x = year, y = avg_sales)) +
    geom_bar(stat = "identity", fill = 'lightblue') +
    geom_errorbar(aes(ymin = avg_sales - 2*sd/sqrt(n), ymax = avg_sales + 2*sd/sqrt(n))) 
## Error in eval(lhs, parent, parent): object 'purchases' not found

Task +1

Watch this video and collect 3 positive (or negative) points about the presentation.