Exploratory Data Analysis in R
First look at some data
height_weight <- read.csv("/home/divenyijanos/Dropbox/teaching/Programming_Tools/Fall2015/Data/height_weight.csv")
## Warning in file(file, "rt"): cannot open file '/home/divenyijanos/Dropbox/
## teaching/Programming_Tools/Fall2015/Data/height_weight.csv': No such file
## or directory
## Error in file(file, "rt"): cannot open the connection
summary(height_weight)
## Error in summary(height_weight): object 'height_weight' not found
plot(height_weight)
## Error in plot(height_weight): object 'height_weight' not found
Cleaning
height_weight$height[height_weight$height < 100] <- height_weight$height[height_weight$height < 100]*100
## Error in eval(expr, envir, enclos): object 'height_weight' not found
height_weight$height[height_weight$height > 250] <- height_weight$height[height_weight$height > 250]-100
## Error in eval(expr, envir, enclos): object 'height_weight' not found
height_weight$male <- factor(height_weight$male, labels=c("female", "male"))
## Error in factor(height_weight$male, labels = c("female", "male")): object 'height_weight' not found
summary(height_weight)
## Error in summary(height_weight): object 'height_weight' not found
Package for data analysis
Base R functionality could be a little cumbersome. There are some packages which
makes the working with data easier. One of the bests is the package dplyr
developed by Hadley Wickham.
# install.packages("dplyr") # issue this in the first time
library(dplyr) # load package if you want to use it
The logic of dplyr
is built around the most frequent operations in data
analysis. You choose certain columns and rows, make some calculations on this
subset (sometimes group-wise), and returns the result. This is the same logic
which is applied in SQL
(if you do not know SQL
, do not bother).
This vignette
is really helpful in learning the basic commands of dplyr
.
The same cleaning task could be accomplished as follows:
height_weight <- read.csv("../Data/height_weight.csv")
## Warning in file(file, "rt"): cannot open file '../Data/height_weight.csv':
## No such file or directory
## Error in file(file, "rt"): cannot open the connection
height_weight <- mutate(
height_weight,
height = ifelse(height < 100, height*100, height),
height = ifelse(height > 250, height - 100, height),
male = factor(male, labels = c("female", "male"))
)
## Error in mutate(height_weight, height = ifelse(height < 100, height * : object 'height_weight' not found
Piping operator
The package dplyr
loads a new operator %>%
. This could further simplifies
our life. This works similarly to the piping operator |
of unix-type systems
(do not bother if you do not know what I am talking about).
If you would like to chain together more function calls (operations), with the piping operator you can pass the result of on call as an argument to the next call.
Let’s calculate the average height of males. The standard logic is to filter for males and then summarize the height.
summarize(filter(height_weight, male == "male"), mean(height))
## Error in filter(height_weight, male == "male"): object 'height_weight' not found
With piping the same call could formulate to better follow the logic: first filter, than summarize the filtered result.
height_weight %>% filter(male == "male") %>% summarize(mean(height))
## Error in eval(lhs, parent, parent): object 'height_weight' not found
Formally, f(x)
is the same as x %>% f()
. This could result in a much more
readable code.