Exploratory Data Analysis in R

First look at some data

height_weight <- read.csv("/home/divenyijanos/Dropbox/teaching/Programming_Tools/Fall2015/Data/height_weight.csv")
## Warning in file(file, "rt"): cannot open file '/home/divenyijanos/Dropbox/
## teaching/Programming_Tools/Fall2015/Data/height_weight.csv': No such file
## or directory
## Error in file(file, "rt"): cannot open the connection
summary(height_weight)
## Error in summary(height_weight): object 'height_weight' not found
plot(height_weight)
## Error in plot(height_weight): object 'height_weight' not found

Cleaning

height_weight$height[height_weight$height < 100] <- height_weight$height[height_weight$height < 100]*100
## Error in eval(expr, envir, enclos): object 'height_weight' not found
height_weight$height[height_weight$height > 250] <- height_weight$height[height_weight$height > 250]-100
## Error in eval(expr, envir, enclos): object 'height_weight' not found
height_weight$male <- factor(height_weight$male, labels=c("female", "male"))
## Error in factor(height_weight$male, labels = c("female", "male")): object 'height_weight' not found
summary(height_weight)
## Error in summary(height_weight): object 'height_weight' not found

Package for data analysis

Base R functionality could be a little cumbersome. There are some packages which makes the working with data easier. One of the bests is the package dplyr developed by Hadley Wickham.

# install.packages("dplyr")  # issue this in the first time
library(dplyr)  # load package if you want to use it

The logic of dplyr is built around the most frequent operations in data analysis. You choose certain columns and rows, make some calculations on this subset (sometimes group-wise), and returns the result. This is the same logic which is applied in SQL (if you do not know SQL, do not bother).

This vignette is really helpful in learning the basic commands of dplyr.

The same cleaning task could be accomplished as follows:

height_weight <- read.csv("../Data/height_weight.csv")
## Warning in file(file, "rt"): cannot open file '../Data/height_weight.csv':
## No such file or directory
## Error in file(file, "rt"): cannot open the connection
height_weight <- mutate(
        height_weight,
        height = ifelse(height < 100, height*100, height),
        height = ifelse(height > 250, height - 100, height),
        male = factor(male, labels = c("female", "male"))
    )
## Error in mutate(height_weight, height = ifelse(height < 100, height * : object 'height_weight' not found

Piping operator

The package dplyr loads a new operator %>%. This could further simplifies our life. This works similarly to the piping operator | of unix-type systems (do not bother if you do not know what I am talking about).

If you would like to chain together more function calls (operations), with the piping operator you can pass the result of on call as an argument to the next call.

Let’s calculate the average height of males. The standard logic is to filter for males and then summarize the height.

summarize(filter(height_weight, male == "male"), mean(height))
## Error in filter(height_weight, male == "male"): object 'height_weight' not found

With piping the same call could formulate to better follow the logic: first filter, than summarize the filtered result.

height_weight %>% filter(male == "male") %>% summarize(mean(height))
## Error in eval(lhs, parent, parent): object 'height_weight' not found

Formally, f(x) is the same as x %>% f(). This could result in a much more readable code.