Chapter 5 Data Manipulation with dplyr
In this chapter, we will learn a very popular package dplyr
to deal with data manipulation. We will mainly go through its main functions (BiomedicalDataScience 2019; r-project 2019).
5.1 select()
Before the lecture, install the package in your machine.
install.packages('dplyr')
Let’s first import the package and the mtcars
dataset.
library(dplyr)
data(mtcars)
If we want to select some columns from the dataset. We could use the select()
function in dplyr
. It is similar to the subset()
function, but here, you do not need to use select = c('mpg', 'disp')
. You use the names of the columns directly in the function.
<- select(mtcars, # name of the data frame
df # column names you want to select
mpg, disp) head(df, 3)
## mpg disp
## Mazda RX4 21.0 160
## Mazda RX4 Wag 21.0 160
## Datsun 710 22.8 108
You could also use the index of the columns.
<- select(mtcars, # name of the data frame
df c(1, 3)) # index of the columns you want to select
head(df, 3)
## mpg disp
## Mazda RX4 21.0 160
## Mazda RX4 Wag 21.0 160
## Datsun 710 22.8 108
The codes above is kind of a traditional way to do the work. We start with a function and put parameters in the function. However, this is not the typical way to use dplyr
.
The codes below is a more dplyr
way people use dplyr
. We start with the name of the data frame. Then, we put a special sign %>%
called pipe
after it. We continue from a new line and write the function we want to use. Besides that, we could add more functions with the pipe operator. For example, only show first three observations with head()
functions.
%>% # name of the data frame
mtcars select(mpg, disp) %>% # select the columns by their names
head(3)
## mpg disp
## Mazda RX4 21.0 160
## Mazda RX4 Wag 21.0 160
## Datsun 710 22.8 108
We will keep using this fashion in the following lecture.
Besides choosing some columns you want, you could also exclude the column you do not want by putting a negative sign -
before the variable.
%>%
mtcars select(-mpg, -disp) %>%
head(3)
## cyl hp drat wt qsec vs am gear carb
## Mazda RX4 6 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 6 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 4 93 3.85 2.320 18.61 1 1 4 1
You could use :
to select a range of variables.
%>%
mtcars select(mpg:hp) %>% # select from mpg to hp in the data frame
head(3)
## mpg cyl disp hp
## Mazda RX4 21.0 6 160 110
## Mazda RX4 Wag 21.0 6 160 110
## Datsun 710 22.8 4 108 93
5.2 filter()
In dplyr
, we could use filter()
function to select the rows satisfying some conditions.
%>%
mtcars filter(mpg > 30)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Add more conditions by using ,
to separate them.
%>%
mtcars filter(mpg > 30, qsec < 19)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
5.3 arrange()
We could arrange the order of some columns by arrange()
functions.
%>%
mtcars arrange(mpg) %>% # arrange mpg in asceding order
head(10)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Or maybe we want mpg
to be in a descending order. Just put a desc()
outside the variable.
%>%
mtcars arrange(desc(mpg)) %>%
head(10)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
We could put them together by using pipe operators to connect them.
%>%
mtcars select(gear, mpg) %>%
arrange(desc(gear), mpg) %>%
head(10)
## gear mpg
## Maserati Bora 5 15.0
## Ford Pantera L 5 15.8
## Ferrari Dino 5 19.7
## Porsche 914-2 5 26.0
## Lotus Europa 5 30.4
## Merc 280C 4 17.8
## Merc 280 4 19.2
## Mazda RX4 4 21.0
## Mazda RX4 Wag 4 21.0
## Volvo 142E 4 21.4
5.4 mutate()
We use mutate()
to do some calculations within the variables and create a new column to store them.
%>%
mtcars select(mpg) %>%
mutate(kmpg <- mpg * 1.609) %>%
head(10)
## mpg kmpg <- mpg * 1.609
## Mazda RX4 21.0 33.7890
## Mazda RX4 Wag 21.0 33.7890
## Datsun 710 22.8 36.6852
## Hornet 4 Drive 21.4 34.4326
## Hornet Sportabout 18.7 30.0883
## Valiant 18.1 29.1229
## Duster 360 14.3 23.0087
## Merc 240D 24.4 39.2596
## Merc 230 22.8 36.6852
## Merc 280 19.2 30.8928
%>%
mtcars select(mpg, wt) %>%
mutate(kmpg = mpg * 1.609, lbwt = wt * 1000) %>%
head(10)
## mpg wt kmpg lbwt
## Mazda RX4 21.0 2.620 33.7890 2620
## Mazda RX4 Wag 21.0 2.875 33.7890 2875
## Datsun 710 22.8 2.320 36.6852 2320
## Hornet 4 Drive 21.4 3.215 34.4326 3215
## Hornet Sportabout 18.7 3.440 30.0883 3440
## Valiant 18.1 3.460 29.1229 3460
## Duster 360 14.3 3.570 23.0087 3570
## Merc 240D 24.4 3.190 39.2596 3190
## Merc 230 22.8 3.150 36.6852 3150
## Merc 280 19.2 3.440 30.8928 3440
5.5 group_by() and summarise()
We use group_by()
to do aggregation (group the observations based the values of one or one more columns) work and summarise()
to calculate some statistics related to each group.
%>%
mtcars group_by(gear) %>%
summarise(mean_mpg = mean(mpg),
max_mpg = max(mpg),
total = n())
## # A tibble: 3 x 4
## gear mean_mpg max_mpg total
## * <dbl> <dbl> <dbl> <int>
## 1 3 16.1 21.5 15
## 2 4 24.5 33.9 12
## 3 5 21.4 30.4 5
%>%
mtcars group_by(gear, am) %>%
summarise(mean_mpg = mean(mpg),
max_mpg = max(mpg),
total = n())
## `summarise()` has grouped output by 'gear'. You can override using the `.groups` argument.
## # A tibble: 4 x 5
## # Groups: gear [3]
## gear am mean_mpg max_mpg total
## <dbl> <dbl> <dbl> <dbl> <int>
## 1 3 0 16.1 21.5 15
## 2 4 0 21.0 24.4 4
## 3 4 1 26.3 33.9 8
## 4 5 1 21.4 30.4 5
5.6 join()
We could use join()
to do the same work of merge()
.
<- c(1:4) # create variable ID
ID <- c('A', 'B', 'C', 'D') # create variable Name
Name <- c(69.5, 77.5, 99, 90) # create variable Score1
Score1 <- data.frame(ID, Name, Score1) # combine the varibles into one data frame called df1
df1 df1
## ID Name Score1
## 1 1 A 69.5
## 2 2 B 77.5
## 3 3 C 99.0
## 4 4 D 90.0
<- c('A', 'D', 'C') # create variable Name
Name <- c(98, 46, 55) # create variable Score2
Score2 <- data.frame(Name, Score2) # combine the varibles into one data frame called df2
df2 df2
## Name Score2
## 1 A 98
## 2 D 46
## 3 C 55
%>%
df1 left_join(df2, by = 'Name')
## ID Name Score1 Score2
## 1 1 A 69.5 98
## 2 2 B 77.5 NA
## 3 3 C 99.0 55
## 4 4 D 90.0 46
%>%
df1 right_join(df2, by = 'Name')
## ID Name Score1 Score2
## 1 1 A 69.5 98
## 2 3 C 99.0 55
## 3 4 D 90.0 46
Could you tell the difference between left_join()
and right_join()
?
Besides left_join()
and right_join()
, we have inner_join()
(keep only matched observations of two data frames) and full_join()
(keep all observations of two data frames).