Homework 2 — SOLUTIONS
library(dplyr)
library(tidyr)
library(ggplot2)
SAME AS PRACTICE EXERCISES OF THE LAST CLASS
TASK 6 IS UPDATED - THANKS TO ISTVAN FOR THE NOTICE
Due 7 October 24:00. Send the hw2_<your-last-name>
file to divenyi.janos@phd.ceu.edu.
Task 0
Download the purchases.csv
from the data section. Load it into R.
purchases <- read.csv(
"/home/divenyijanos/Dropbox/teaching/Programming_Tools/Fall2015/Data/purchases.csv"
)
## Warning in file(file, "rt"): cannot open file '/home/divenyijanos/Dropbox/
## teaching/Programming_Tools/Fall2015/Data/purchases.csv': No such file or
## directory
## Error in file(file, "rt"): cannot open the connection
Task 1
Give the mean and the median of the individual purchases.
purchases %>%
summarise(mean(sales), median(sales))
## Error in eval(lhs, parent, parent): object 'purchases' not found
Task 2
Tell R that your purchase_date
variable is a date. You can do this by applying
the as.Date()
function to the original variable (similar to how we can use
as.character()
). Then you can get the median day of the purchases.
purchases <- purchases %>% mutate(purchase_date = as.Date(purchase_date))
## Error in eval(lhs, parent, parent): object 'purchases' not found
purchases %>%
summarise(median(purchase_date))
## Error in eval(lhs, parent, parent): object 'purchases' not found
Task 3
List the 5 biggest buyer along with their aggregate purchases.
purchases %>%
group_by(contact_id) %>%
summarise(sales = sum(sales)) %>%
arrange(desc(sales)) %>%
head(5)
## Error in eval(lhs, parent, parent): object 'purchases' not found
Task 4
Plot the distributions of log sales amounts for the two years separately. For this you should have the year variable as factor.
purchases %>%
ggplot(aes(x=log(sales), fill=factor(year))) +
geom_density(alpha=0.3)
## Error in eval(lhs, parent, parent): object 'purchases' not found
Task 5
List the number of buyers in each month by year. (Hint: you might need tidyr
for accomplishing this).
purchases %>%
group_by(year, month) %>%
summarise(n = n()) %>%
spread(year, n)
## Error in eval(lhs, parent, parent): object 'purchases' not found
Task 6
What share of total sales in 2013 comes from the top 5 buyers in 2013? You may
want to aggregate sales by contact first and then to use the cumsum()
function
to calculate cumulative sums.
Hint: this table is an intermediate state you may want to achive.
middle <- purchases %>%
filter(year == 2013) %>%
group_by(contact_id) %>%
summarise(sales = sum(sales)) %>%
arrange(desc(sales)) %>%
mutate(cumulative_sales = cumsum(sales), all_sales = sum(sales))
## Error in eval(lhs, parent, parent): object 'purchases' not found
The correct answer you should get is (from the previous table)
middle %>%
head(5) %>% tail(1) %>%
mutate(top5_share = cumulative_sales/all_sales) %>%
select(top5_share)
## Error in eval(lhs, parent, parent): object 'middle' not found
Task 7
Plot the aggregate daily sales (you should combine dplyr
and ggplot
statements).
Note that you should have purchase_date
as date instead of factor or character.
Add a smoothed line to the plot (you can experiment with the span
option of
geom_smooth()
to control the smoothness of your line).
Default:
purchases %>%
group_by(purchase_date) %>%
summarise(sales = sum(sales)) %>%
ggplot(aes(x=purchase_date, y=sales)) +
geom_line() +
geom_smooth()
## Error in eval(lhs, parent, parent): object 'purchases' not found
With span = 0.2
:
purchases %>%
group_by(purchase_date) %>%
summarise(sales = sum(sales)) %>%
ggplot(aes(x=purchase_date, y=sales)) +
geom_line() +
geom_smooth(span = 0.2)
## Error in eval(lhs, parent, parent): object 'purchases' not found
Task 8
Which month brings the most sales? Plot a bar graph with aggregate sales per
month. Look at the documentation of geom_bar()
to solve this. Note the labels
of the x axis (the documentation helps to reproduce).
purchases %>%
group_by(month) %>%
summarise(sales = sum(sales)) %>%
ggplot(aes(x=month, y=sales)) +
geom_bar(stat="identity") +
scale_x_continuous(breaks = seq(1, 12))
## Error in eval(lhs, parent, parent): object 'purchases' not found
Task 9
Recreate the previous graph by drawing the columns separately for the years (map the year variable to column and see the examples in the documentation to achieve side-by-side bars).
purchases %>%
group_by(year, month) %>%
summarise(sales = sum(sales)) %>%
ggplot(aes(x=month, y=sales, fill=factor(year))) +
geom_bar(stat="identity", position="dodge") +
scale_x_continuous(breaks = seq(1, 12))
## Error in eval(lhs, parent, parent): object 'purchases' not found
Task 10
Plot a graph which gives you that to what share of all sales are the top x% of
buyers responsible. So a point at x = 0.5, y = 0.8
would tell 80% of all sales
come from the top 50% of buyers. (Hint: use your intermediate dataframe from
task 6.)
middle %>%
mutate(
id = 1,
cumulative_sales_share = cumulative_sales/all_sales,
cumulative_buyer_share = cumsum(id)/n()
) %>%
ggplot(aes(x=cumulative_buyer_share, y=cumulative_sales_share)) +
geom_line(size=2)
## Error in eval(lhs, parent, parent): object 'middle' not found