It's a complete tutorial on data analysis or manipulation with R. This tutorial covers one of the most powerful R package for data wrangling i.e. dplyr. This package was written by the most popular R programmer Hadley Wickham who has written many useful R packages such as ggplot2, tidyr etc.
The package "dplyr" comprises many functions that perform mostly used data manipulation operations such as applying filter, selecting specific columns, sorting data, adding or deleting columns and aggregating data. Another most important advantage of this package is that it's very easy to learn and use dplyr functions. Also easy to recall these functions. For example, filter() is used to filter rows.
Important dplyr Functions to remember
In this tutorial, we are using the following data which contains income generated by states from year 2002 to 2015. Note : This data do not contain actual income figures of the states.
This dataset contains 51 observations (rows) and 16 variables (columns). The snapshot of few rows and columns of the dataset is shown below.
Download the Dataset
Example 6 : Selecting Variables (or Columns)
Suppose you are asked to select only a few variables. The code below selects variables "Index", columns from "State" to "Y2008".
Example 7 : Dropping Variables
The minus sign before a variable tells R to drop the variable.
Example 8 : Selecting or Dropping Variables starts with 'Y'
The starts_with() function is used to select variables starts with an alphabet.
Example 10 : Reorder Variables
The code below keeps variable'State'in the front and the remaining variables follow that.
It is used to change variable name.
Example 11 : Rename Variables
The rename function can be used to rename variables.
In the following code, we are renaming 'Index' variable to 'Index1'.
filter( ) Function
It is used to subset data with matching logical conditions.
Example 12 : Filter Rows
The grepl function is used to search for pattern matching. In the following code, we are looking for records wherein column state contains 'Ar' in their name.
summarise( ) Function
It is used to summarize data.
Example 18 : Summarize selected variables
In the example below, we are calculating mean and median for the variable Y2015.
Example 19 : Summarize Multiple Variables
In the following example, we are calculating number of records, mean and median for variables Y2005 and Y2006. The summarise_at function allows us to select multiple variables by their names.
Example 20 : Summarize with Custom Functions
Pipe Operator %>%
It is important to understand the pipe (%>%) operator before knowing the other functions of dplyr package. dplyr utilizes pipe operator from another package (magrittr).
Example 26 : Selecting 3rd Maximum Value by Categorical Variable
We are calculating third maximum value of variable Y2015 by variable Index. The following code first selects only two variables Index and Y2015. Then it filters the variable Index with 'A', 'C' and 'I' and then it groups the same variable and sorts the variable Y2015 in descending order. At last, it selects the third row.
Using Window Functions
Like SQL, dplyr uses window functions that are used to subset data within a group. It returns a vector of values. We could use min_rank() function that calculates rank in the preceding example,
mutate() function :
The cumsum function calculates cumulative sum of a variable. With mutate function, we insert a new variable called 'Total' which contains values of cumulative income of variable Index.
Combine Data Vertically
Example 38 : IF ELSE Statement
Syntax :
false : Value if condition does not meet
The bind_rows() function combine two datasets with rows. So combined dataset would contain 12 rows (6+6) and 2 columns.
What is dplyr?
dplyr is a powerful R-package to manipulate, clean and summarize unstructured data. In short, it makes data exploration and data manipulation easy and fast in R.
What's special about dplyr?
dplyr vs. Base R Functions
dplyr functions process faster than base R functions. It is because dplyr functions were written in a computationally efficient manner. They are also more stable in the syntax and better supports data frames than vectors.
How to install and load dplyr package
To install the dplyr package, type the following command.
dplyr functions process faster than base R functions. It is because dplyr functions were written in a computationally efficient manner. They are also more stable in the syntax and better supports data frames than vectors.
SQL Queries vs. dplyr
People have been utilizing SQL for analyzing data for decades. Every modern data analysis software such as Python, R, SAS etc supports SQL commands. But SQL was never designed to perform data analysis. It was rather designed for querying and managing data. There are many data analysis operations where SQL fails or makes simple things difficult. For example, calculating median for multiple variables, converting wide format data to long format etc. Whereas, dplyr package was designed to do data analysis.
The names of dplyr functions are similar to SQL commands such as select() for selecting variables, group_by() - group data by grouping variable, join() - joining two data sets. Also includes inner_join() and left_join(). It also supports sub queries for which SQL was popular for.
How to install and load dplyr package
To install the dplyr package, type the following command.
install.packages("dplyr")
To load dplyr package, type the command below
library(dplyr)
Important dplyr Functions to remember
dplyr Function | Description | Equivalent SQL |
---|---|---|
select() | Selecting columns (variables) | SELECT |
filter() | Filter (subset) rows. | WHERE |
group_by() | Group the data | GROUP BY |
summarise() | Summarise (or aggregate) data | - |
arrange() | Sort the data | ORDER BY |
join() | Joining data frames (tables) | JOIN |
mutate() | Creating New Variables | COLUMN ALIAS |
Data : Income Data by States
This dataset contains 51 observations (rows) and 16 variables (columns). The snapshot of few rows and columns of the dataset is shown below.
Image may be NSFW. Clik here to view. ![]() |
Input Dataset |
Download the Dataset
How to load Data
Submit the following code. Change the file path in the code below.
mydata = read.csv("C:\\Users\\Deepanshu\\Documents\\sampledata.csv")
Example 1 : Selecting Random N Rows
The sample_n function selects random rows from a data frame (or table). The second parameter of the function tells R the number of rows to select.
Example 2 : Selecting Random Fraction of Rows
The sample_frac function returns randomly N% of rows. In the example below, it returns randomly 10% of rows.
Example 3 : Remove Duplicate Rows based on all the variables (Complete Row)
The distinct function is used to eliminate duplicates.
Example 4 : Remove Duplicate Rows based on a variable
The .keep_all function is used to retain all other variables in the output data frame.
Example 5 : Remove Duplicates Rows based on multiple variables
In the example below, we are using two variables - Index, Y2010 to determine uniqueness.
select( ) Function
It is used to select only desired variables.
The sample_n function selects random rows from a data frame (or table). The second parameter of the function tells R the number of rows to select.
sample_n(mydata,3)
Example 2 : Selecting Random Fraction of Rows
The sample_frac function returns randomly N% of rows. In the example below, it returns randomly 10% of rows.
sample_frac(mydata,0.1)
Example 3 : Remove Duplicate Rows based on all the variables (Complete Row)
The distinct function is used to eliminate duplicates.
x1 = distinct(mydata)
Example 4 : Remove Duplicate Rows based on a variable
The .keep_all function is used to retain all other variables in the output data frame.
x2 = distinct(mydata, Index, .keep_all= TRUE)
Example 5 : Remove Duplicates Rows based on multiple variables
In the example below, we are using two variables - Index, Y2010 to determine uniqueness.
x2 = distinct(mydata, Index, Y2010, .keep_all= TRUE)
select( ) Function
It is used to select only desired variables.
select() syntax : select(data , ....)
data : Data Frame
.... : Variables by name or by function
Example 6 : Selecting Variables (or Columns)
Suppose you are asked to select only a few variables. The code below selects variables "Index", columns from "State" to "Y2008".
mydata2 = select(mydata, Index, State:Y2008)
Example 7 : Dropping Variables
The minus sign before a variable tells R to drop the variable.
mydata = select(mydata, -Index, -State)The above code can also be written like :
mydata = select(mydata, -c(Index,State))
Example 8 : Selecting or Dropping Variables starts with 'Y'
The starts_with() function is used to select variables starts with an alphabet.
mydata3 = select(mydata, starts_with("Y"))Adding a negative sign before starts_with() implies dropping the variables starts with 'Y'
mydata33 = select(mydata, -starts_with("Y"))The following functions helps you to select variables based on their names.
Helpers | Description |
---|---|
starts_with() | Starts with a prefix |
ends_with() | Ends with a prefix |
contains() | Contains a literal string |
matches() | Matches a regular expression |
num_range() | Numerical range like x01, x02, x03. |
one_of() | Variables in character vector. |
everything() | All variables. |
Example 9 : Selecting Variables contain 'I' in their names
mydata4 = select(mydata, contains("I"))
Example 10 : Reorder Variables
The code below keeps variable'State'in the front and the remaining variables follow that.
mydata5 = select(mydata, State, everything())rename( ) Function
It is used to change variable name.
rename() syntax : rename(data , new_name = old_name)
data : Data Frame
new_name : New variable name you want to keep
old_name : Existing Variable Name
Example 11 : Rename Variables
The rename function can be used to rename variables.
In the following code, we are renaming 'Index' variable to 'Index1'.
mydata6 = rename(mydata, Index1=Index)
Image may be NSFW. Clik here to view. ![]() |
Output |
filter( ) Function
It is used to subset data with matching logical conditions.
filter() syntax : filter(data , ....)
data : Data Frame
.... : Logical Condition
Example 12 : Filter Rows
Suppose you need to subset data. You want to filter rows and retain only those values in which Index is equal to A.
mydata7 = filter(mydata, Index == "A")
Example 13 : Multiple Selection Criteria
The %in% operator can be used to select multiple items. In the following program, we are telling R to select rows against 'A' and 'C' in column 'Index'.
Example 14 : 'AND' Condition in Selection Criteria
Suppose you need to apply 'AND' condition. In this case, we are picking data for 'A' and 'C' in the column 'Index' and income greater than 1.3 million in Year 2002.
Example 15 : 'OR' Condition in Selection Criteria
The %in% operator can be used to select multiple items. In the following program, we are telling R to select rows against 'A' and 'C' in column 'Index'.
mydata7 = filter(mydata6, Index %in% c("A", "C"))
Example 14 : 'AND' Condition in Selection Criteria
Suppose you need to apply 'AND' condition. In this case, we are picking data for 'A' and 'C' in the column 'Index' and income greater than 1.3 million in Year 2002.
mydata8 = filter(mydata6, Index %in% c("A", "C") & Y2002 >= 1300000 )
Example 15 : 'OR' Condition in Selection Criteria
The 'I' denotes OR in the logical condition. It means any of the two conditions.
mydata9 = filter(mydata6, Index %in% c("A", "C") | Y2002 >= 1300000)
Example 16 : NOT Condition
The "!" sign is used to reverse the logical condition.
mydata10 = filter(mydata6, !Index %in% c("A", "C"))
Example 17 : CONTAINS Condition
The grepl function is used to search for pattern matching. In the following code, we are looking for records wherein column state contains 'Ar' in their name.
mydata10 = filter(mydata6, grepl("Ar", State))
summarise( ) Function
It is used to summarize data.
summarise() syntax : summarise(data , ....)
data : Data Frame
..... : Summary Functions such as mean, median etc
Example 18 : Summarize selected variables
In the example below, we are calculating mean and median for the variable Y2015.
summarise(mydata, Y2015_mean = mean(Y2015), Y2015_med=median(Y2015))
Image may be NSFW. Clik here to view. ![]() |
Output |
Example 19 : Summarize Multiple Variables
In the following example, we are calculating number of records, mean and median for variables Y2005 and Y2006. The summarise_at function allows us to select multiple variables by their names.
summarise_at(mydata, vars(Y2005, Y2006), funs(n(), mean, median))
Image may be NSFW. Clik here to view. ![]() |
Output |
We can also use custom functions in the summarise function. In this case, we are computing the number of records, number of missing values, mean and median for variables Y2011 and Y2012. The dot (.) denotes each variables specified in the second argument of the function.
summarise_at(mydata, vars(Y2011, Y2012),
funs(n(), missing = sum(is.na(.)), mean(., na.rm = TRUE), median(.,na.rm = TRUE)))
Image may be NSFW. Clik here to view. ![]() |
Summarize : Output |
Example 21 : Summarize all Numeric Variables
First, store data for all the numeric variables
numdata = mydata[sapply(mydata,is.numeric)]
Second, the summarise_all function calculates summary statistics for all the columns in a data frame
summarise_all(numdata, funs(n(),mean,median))
Example 22 : Summarize Factor Variable
We are checking the number of levels/categories and count of missing observations in a categorical (factor) variable.
summarise_all(mydata["Index"], funs(nlevels(.), sum(is.na(.))))
arrange() function :
Use : Sort data
Use : Sort data
Syntax
arrange(data_frame, variable(s)_to_sort)
or
data_frame %>% arrange(variable(s)_to_sort)
To sort a variable in descending order, use desc(x).
Example 23 : Sort Data by Multiple Variables
The default sorting order of arrange() function is ascending. In this example, we are sorting data by multiple variables.
arrange(mydata, Index, Y2011)
Suppose you need to sort one variable by descending order and other variable by ascending oder.
arrange(mydata, desc(Index), Y2011)
It is important to understand the pipe (%>%) operator before knowing the other functions of dplyr package. dplyr utilizes pipe operator from another package (magrittr).
It allows you to write sub-queries like we do it in sql.
Note : All the functions in dplyr package can be used without the pipe operator. The question arises "Why to use pipe operator %>%". The answer is it lets to wrap multiple functions together with the use of %>%.
Syntax :
filter(data_frame, variable == value)The %>% is NOT restricted to filter function. It can be used with any function.
or
data_frame %>% filter(variable == value)
Example :
The code below demonstrates the usage of pipe %>% operator. In this example, we are selecting 10 random observations of two variables "Index""State" from the data frame "mydata".
dt = sample_n(select(mydata, Index, State),10)
or
dt = mydata %>% select(Index, State) %>% sample_n(10)
Image may be NSFW. Clik here to view. ![]() |
Output |
group_by() function :
Use : Group data by categorical variable
Syntax :
group_by(data, variables)
or
data %>% group_by(variables)
Example 24 : Summarise Data by Categorical Variable
We are calculating count and mean of variables Y2011 and Y2012 by variable Index.
t = summarise_at(group_by(mydata, Index), vars(Y2011, Y2012), funs(n(), mean(., na.rm = TRUE)))
The above code can also be written like
t = mydata %>% group_by(Index) %>%
summarise_at(vars(Y2011:Y2015), funs(n(), mean(., na.rm = TRUE)))
do() function :
Use : Compute within groups
Syntax :
do(data_frame, expressions_to_apply_to_each_group)Note : The dot (.) is required to refer to a data frame.
Example 25 : Filter Data within a Categorical Variable
Suppose you need to pull top 2 rows from 'A', 'C' and 'I' categories of variable Index.
t = mydata %>% filter(Index %in% c("A", "C","I")) %>% group_by(Index) %>%
do(head( . , 2))
Image may be NSFW. Clik here to view. ![]() |
Output : do() function |
Example 26 : Selecting 3rd Maximum Value by Categorical Variable
We are calculating third maximum value of variable Y2015 by variable Index. The following code first selects only two variables Index and Y2015. Then it filters the variable Index with 'A', 'C' and 'I' and then it groups the same variable and sorts the variable Y2015 in descending order. At last, it selects the third row.
t = mydata %>% select(Index, Y2015) %>%
filter(Index %in% c("A", "C","I")) %>%
group_by(Index) %>%
do(arrange(.,desc(Y2015))) %>% slice(3)
The slice() function is used to select rows by position.
Image may be NSFW. Clik here to view. ![]() |
Output |
Like SQL, dplyr uses window functions that are used to subset data within a group. It returns a vector of values. We could use min_rank() function that calculates rank in the preceding example,
t = mydata %>% select(Index, Y2015) %>%
filter(Index %in% c("A", "C","I")) %>%
group_by(Index) %>%
filter(min_rank(desc(Y2015)) == 3)
Example 27 : Summarize, Group and Sort Together
In this case, we are computing mean of variables Y2014 and Y2015 by variable Index. Then sort the result by calculated mean variable Y2015.
t = mydata %>%
group_by(Index)%>%
summarise(Mean_2014 = mean(Y2014, na.rm=TRUE),
Mean_2015 = mean(Y2015, na.rm=TRUE)) %>%
arrange(desc(Mean_2015))
mutate() function :
Use : Creates new variables
Syntax :
mutate(data_frame, expression(s) )
or
data_frame %>% mutate(expression(s))
Example 28 : Create a new variable
The following code calculates division of Y2015 by Y2014 and name it "change".
mydata1 = mutate(mydata, change=Y2015/Y2014)
Example 29 : Multiply all the variables by 1000
It creates new variables and name them with suffix "_new".
mydata11 = mutate_all(mydata, funs("new" = .* 1000))
Image may be NSFW. Clik here to view. ![]() |
Output |
The output shown in the image above is truncated due to high number of variables.
Example 30 : Calculate Rank for Variables
Suppose you need to calculate rank for variables Y2008 to Y2010.
Suppose you need to calculate rank for variables Y2008 to Y2010.
mydata12 = mutate_at(mydata, vars(Y2008:Y2010), funs(Rank=min_rank(.)))
Image may be NSFW. Clik here to view. ![]() |
Output |
By default, min_rank() assigns 1 to the smallest value and high number to the largest value. In case, you need to assign rank 1 to the largest value of a variable, use min_rank(desc(.))
mydata13 = mutate_at(mydata, vars(Y2008:Y2010), funs(Rank=min_rank(desc(.))))
Example 31 : Select State that generated highest income among the variable 'Index'
out = mydata %>% group_by(Index) %>% filter(min_rank(desc(Y2015)) == 1) %>%
select(Index, Y2015)
Example 32 : Cumulative Income of 'Index' variable
The cumsum function calculates cumulative sum of a variable. With mutate function, we insert a new variable called 'Total' which contains values of cumulative income of variable Index.
out2 = mydata %>% group_by(Index) %>% mutate(Total=cumsum(Y2015)) %>%
select(Index, Y2015, Total)
join() function :
by - common variable (primary key) to join by.
Example 33 : Common rows in both the tables
Let's create two data frames say df1 and df2.
INNER JOIN returns rows when there is a match in both tables. In this example, we are merging df1 and df2 with ID as common variable (primary key).
LEFT JOIN : It returns all rows from the left table, even if there are no matches in the right table.
Use : Join two datasets
Syntax :
inner_join(x, y, by = )x, y - datasets (or tables) to merge / join
left_join(x, y, by = )
right_join(x, y, by = )
full_join(x, y, by = )
semi_join(x, y, by = )
anti_join(x, y, by = )
by - common variable (primary key) to join by.
Example 33 : Common rows in both the tables
Let's create two data frames say df1 and df2.
df1 <- data.frame(ID = c(1, 2, 3, 4, 5),
w = c('a', 'b', 'c', 'd', 'e'),
x = c(1, 1, 0, 0, 1),
y=rnorm(5),
z=letters[1:5])
df2 <- data.frame(ID = c(1, 7, 3, 6, 8),
a = c('z', 'b', 'k', 'd', 'l'),
b = c(1, 2, 3, 0, 4),
c =rnorm(5),
d =letters[2:6])
INNER JOIN returns rows when there is a match in both tables. In this example, we are merging df1 and df2 with ID as common variable (primary key).
df3 = inner_join(df1, df2, by = "ID")
Image may be NSFW. Clik here to view. ![]() |
Output : INNER JOIN |
If the primary key does not have same name in both the tables, try the following way:
inner_join(df1, df2, by = c("ID"="ID1"))
Example 34 : Applying LEFT JOIN
left_join(df1, df2, by = "ID")
Image may be NSFW. Clik here to view. ![]() |
Output : LEFT JOIN |
Combine Data Vertically
intersect(x, y)
Rows that appear in both x and y.
union(x, y)
Rows that appear in either or both x and y.
setdiff(x, y)
Rows that appear in x but not y.
Example 35 : Applying INTERSECT
Prepare Sample Data for Demonstration
Example 36 : Applying UNION
union(x, y)
Rows that appear in either or both x and y.
setdiff(x, y)
Rows that appear in x but not y.
Example 35 : Applying INTERSECT
Prepare Sample Data for Demonstration
mtcars$model <- rownames(mtcars)INTERSECT selects unique rows that are common to both the data frames.
first <- mtcars[1:20, ]
second <- mtcars[10:32, ]
intersect(first, second)
Example 36 : Applying UNION
UNION displays all rows from both the tables and removes duplicate records from the combined dataset. By using union_all function, it allows duplicate rows in the combined dataset.
x=data.frame(ID = 1:6, ID1= 1:6)
y=data.frame(ID = 1:6, ID1 = 1:6)
union(x,y)
union_all(x,y)
Example 37 : Rows appear in one table but not in other table
setdiff(first, second)
Example 38 : IF ELSE Statement
Syntax :
if_else(condition, true, false, missing = NULL)true : Value if condition meets
false : Value if condition does not meet
missing : If not NULL, will be used to replace missing values
df <- c(-10,2, NA)
if_else(df < 0, "negative", "positive", missing = "missing value")
Create a new variable with IF_ELSE
If a value is less than 5, add it to 1 and if it is greater than or equal to 5, add it to 2. Otherwise 0.
If a value is less than 5, add it to 1 and if it is greater than or equal to 5, add it to 2. Otherwise 0.
df =data.frame(x = c(1,5,6,NA))
df$newvar = if_else(df$x<5, df$x+1, df$x+2,0)
Image may be NSFW. Clik here to view. ![]() |
Output |
Example 39 : Apply ROW WISE Operation
Suppose you want to find maximum value in each row of variables 2012, 2013, 2014, 2015. The rowwise() function allows you to apply functions to rows.
df = mydata %>%
rowwise() %>% mutate(Max= max(Y2012:Y2015)) %>%
select(Y2012:Y2015,Max)
Image may be NSFW. Clik here to view. ![]() |
Output |
Example 40 : Combine Data Frames
Suppose you are asked to combine two data frames. Let's first create two sample datasets.
df1=data.frame(ID = 1:6, x=letters[1:6])
df2=data.frame(ID = 7:12, x=letters[7:12])
Image may be NSFW. Clik here to view. ![]() |
Input Datasets |
xy = bind_rows(df1,df2)
It is equivalent to base R function rbind.
xy = rbind(df1,df2)
The bind_cols() function combine two datasets with columns. So combined dataset would contain 4 columns and 6 rows.
Example 41 : Calculate Percentile Values
Example 42 : Automate Model Building
This example explains the advanced usage of do() function. In this example, we are building linear regression model for each level of a categorical variable. There are 3 levels in variable cyl of dataset mtcars.
xy = bind_cols(x,y)The output is shown below-
or
xy = cbind(x,y)
Image may be NSFW. Clik here to view. ![]() |
cbind Output |
Example 41 : Calculate Percentile Values
The quantile() function is used to determine Nth percentile value. In this example, we are computing percentile values by variable Index.
mydata %>% group_by(Index) %>%
summarise(Pecentile_25=quantile(Y2015, probs=0.25),
Pecentile_50=quantile(Y2015, probs=0.5),
Pecentile_75=quantile(Y2015, probs=0.75),
Pecentile_99=quantile(Y2015, probs=0.99))
The ntile() function is used to divide the data into N bins.
x= data.frame(N= 1:10)
x = mutate(x, pos = ntile(x$N,5))
This example explains the advanced usage of do() function. In this example, we are building linear regression model for each level of a categorical variable. There are 3 levels in variable cyl of dataset mtcars.
length(unique(mtcars$cyl))Result : 3
by_cyl <- group_by(mtcars, cyl)
models <- by_cyl %>% do(mod = lm(mpg ~ disp, data = .))
summarise(models, rsq = summary(mod)$r.squared)
models %>% do(data.frame(
var = names(coef(.$mod)),
coef(summary(.$mod)))
)
Image may be NSFW. Clik here to view. ![]() |
Output : R-Squared Values |