Homework 2 — SOLUTIONS

library(dplyr)
library(tidyr)
library(ggplot2)

SAME AS PRACTICE EXERCISES OF THE LAST CLASS

TASK 6 IS UPDATED - THANKS TO ISTVAN FOR THE NOTICE

Due 7 October 24:00. Send the hw2_<your-last-name> file to divenyi.janos@phd.ceu.edu.

Task 0

Download the purchases.csv from the data section. Load it into R.

purchases <- read.csv(
    "/home/divenyijanos/Dropbox/teaching/Programming_Tools/Fall2015/Data/purchases.csv"
    )
## Warning in file(file, "rt"): cannot open file '/home/divenyijanos/Dropbox/
## teaching/Programming_Tools/Fall2015/Data/purchases.csv': No such file or
## directory
## Error in file(file, "rt"): cannot open the connection

Task 1

Give the mean and the median of the individual purchases.

purchases %>%
    summarise(mean(sales), median(sales))
## Error in eval(lhs, parent, parent): object 'purchases' not found

Task 2

Tell R that your purchase_date variable is a date. You can do this by applying the as.Date() function to the original variable (similar to how we can use as.character()). Then you can get the median day of the purchases.

purchases <- purchases %>% mutate(purchase_date = as.Date(purchase_date))
## Error in eval(lhs, parent, parent): object 'purchases' not found
purchases %>%
    summarise(median(purchase_date))
## Error in eval(lhs, parent, parent): object 'purchases' not found

Task 3

List the 5 biggest buyer along with their aggregate purchases.

purchases %>%
    group_by(contact_id) %>%
    summarise(sales = sum(sales)) %>%
    arrange(desc(sales)) %>%
    head(5)
## Error in eval(lhs, parent, parent): object 'purchases' not found

Task 4

Plot the distributions of log sales amounts for the two years separately. For this you should have the year variable as factor.

purchases %>%
    ggplot(aes(x=log(sales), fill=factor(year))) +
    geom_density(alpha=0.3)
## Error in eval(lhs, parent, parent): object 'purchases' not found

Task 5

List the number of buyers in each month by year. (Hint: you might need tidyr for accomplishing this).

purchases %>%
    group_by(year, month) %>%
    summarise(n = n()) %>%
    spread(year, n)
## Error in eval(lhs, parent, parent): object 'purchases' not found

Task 6

What share of total sales in 2013 comes from the top 5 buyers in 2013? You may want to aggregate sales by contact first and then to use the cumsum() function to calculate cumulative sums.

Hint: this table is an intermediate state you may want to achive.

middle <- purchases %>%
    filter(year == 2013) %>%
    group_by(contact_id) %>%
    summarise(sales = sum(sales)) %>%
    arrange(desc(sales)) %>%
    mutate(cumulative_sales = cumsum(sales), all_sales = sum(sales))
## Error in eval(lhs, parent, parent): object 'purchases' not found

The correct answer you should get is (from the previous table)

middle %>%
    head(5) %>% tail(1) %>%
    mutate(top5_share = cumulative_sales/all_sales) %>%
    select(top5_share)
## Error in eval(lhs, parent, parent): object 'middle' not found

Task 7

Plot the aggregate daily sales (you should combine dplyr and ggplot statements). Note that you should have purchase_date as date instead of factor or character. Add a smoothed line to the plot (you can experiment with the span option of geom_smooth() to control the smoothness of your line).

Default:

purchases %>%
    group_by(purchase_date) %>%
    summarise(sales = sum(sales)) %>%
    ggplot(aes(x=purchase_date, y=sales)) +
        geom_line() +
        geom_smooth()
## Error in eval(lhs, parent, parent): object 'purchases' not found

With span = 0.2:

purchases %>%
    group_by(purchase_date) %>%
    summarise(sales = sum(sales)) %>%
    ggplot(aes(x=purchase_date, y=sales)) +
        geom_line() +
        geom_smooth(span = 0.2)
## Error in eval(lhs, parent, parent): object 'purchases' not found

Task 8

Which month brings the most sales? Plot a bar graph with aggregate sales per month. Look at the documentation of geom_bar() to solve this. Note the labels of the x axis (the documentation helps to reproduce).

purchases %>%
    group_by(month) %>%
    summarise(sales = sum(sales)) %>%
    ggplot(aes(x=month, y=sales)) +
    geom_bar(stat="identity") +
    scale_x_continuous(breaks = seq(1, 12))
## Error in eval(lhs, parent, parent): object 'purchases' not found

Task 9

Recreate the previous graph by drawing the columns separately for the years (map the year variable to column and see the examples in the documentation to achieve side-by-side bars).

purchases %>%
    group_by(year, month) %>%
    summarise(sales = sum(sales)) %>%
    ggplot(aes(x=month, y=sales, fill=factor(year))) +
    geom_bar(stat="identity", position="dodge") +
    scale_x_continuous(breaks = seq(1, 12))
## Error in eval(lhs, parent, parent): object 'purchases' not found

Task 10

Plot a graph which gives you that to what share of all sales are the top x% of buyers responsible. So a point at x = 0.5, y = 0.8 would tell 80% of all sales come from the top 50% of buyers. (Hint: use your intermediate dataframe from task 6.)

middle %>%
    mutate(
        id = 1,
        cumulative_sales_share = cumulative_sales/all_sales,
        cumulative_buyer_share = cumsum(id)/n()
    ) %>%
    ggplot(aes(x=cumulative_buyer_share, y=cumulative_sales_share)) +
    geom_line(size=2)
## Error in eval(lhs, parent, parent): object 'middle' not found