DA Homework 3 - SOLUTION
Task 0
Download the purchases.csv
from the data section. This sample data contains
purchases from an online store.
Load the data into R and check whether the type of your variables are correct
(e.g. purchase_date
should be of type “Date”).
purchases <- read_csv("/Users/jdivenyi/teaching/BME_adat/201617/data/purchases.csv")
## Error: '/Users/jdivenyi/teaching/BME_adat/201617/data/purchases.csv' does not exist.
Task 1
Plot the histogram of all the log sales.
purchases %>%
ggplot(aes(x = log(sales))) + geom_histogram()
## Error in eval(lhs, parent, parent): object 'purchases' not found
Task 2
Plot the distributions of log sales amounts for the two years separately.
Check geom_density
in the documentation of
ggplot2
. Here is what you should get.
purchases %>%
mutate(year = factor(year)) %>%
ggplot(aes(x=log(sales), fill=year)) +
geom_density(alpha=0.3)
## Error in eval(lhs, parent, parent): object 'purchases' not found
Task 3
Plot the aggregate daily sales.
Add a smoothed line to the plot (you can experiment with the span
option of
geom_smooth()
to control the smoothness of your line).
Default:
purchases %>%
group_by(purchase_date) %>%
summarise(sales = sum(sales)) %>%
ggplot(aes(x=purchase_date, y=sales)) +
geom_line() +
geom_smooth()
## Error in eval(lhs, parent, parent): object 'purchases' not found
With span = 0.2
:
purchases %>%
group_by(purchase_date) %>%
summarise(sales = sum(sales)) %>%
ggplot(aes(x=purchase_date, y=sales)) +
geom_line() +
geom_smooth(span = 0.2)
## Error in eval(lhs, parent, parent): object 'purchases' not found
Task 4
Which month brings the most sales? Plot a bar graph with aggregate sales per
month. Look at the documentation of geom_bar()
to solve this. Note the labels
of the x axis (the documentation helps to reproduce).
purchases %>%
mutate(month = factor(month)) %>%
group_by(month) %>%
summarise(aggregate_sales = sum(sales)) %>%
ggplot(aes(x = month, y = aggregate_sales)) +
geom_bar(stat = "identity")
## Error in eval(lhs, parent, parent): object 'purchases' not found
Task 5
Recreate the previous graph by drawing the columns separately for the years (map the year variable to the fill of the bars and see the examples in the documentation to achieve side-by-side bars).
purchases %>%
mutate(year = factor(year)) %>%
group_by(year, month) %>%
summarise(aggregate_sales = sum(sales), sd = sd(sales), n = n()) %>%
ggplot(aes(x = month, y = aggregate_sales, fill = year)) +
geom_bar(stat = "identity", position = "dodge") +
scale_x_continuous(breaks = seq(1, 12))
## Error in eval(lhs, parent, parent): object 'purchases' not found
Task 6
We have seen that the aggregate sales are lower in 2013 than in 2012. Which year
has higher average sales amount? Plot a bar graph with average sales per year and
add error bars with two times the standard deviation (do not forget to adjust by the number of observations).
You can add error bars by using the geom_errorbar()
function of ggplot2
.
(I use the “lightblue” color to fill the bars in order to make the error bars more visible.)
purchases %>%
mutate(year = factor(year)) %>%
group_by(year) %>%
summarise(avg_sales = mean(sales), sd = sd(sales), n = n()) %>%
ggplot(aes(x = year, y = avg_sales)) +
geom_bar(stat = "identity", fill = 'lightblue') +
geom_errorbar(aes(ymin = avg_sales - 2*sd/sqrt(n), ymax = avg_sales + 2*sd/sqrt(n)))
## Error in eval(lhs, parent, parent): object 'purchases' not found
Task +1
Watch this video and collect 3 positive (or negative) points about the presentation.