Chapter 5 Data Manipulation with dplyr

In this chapter, we will learn a very popular package dplyr to deal with data manipulation. We will mainly go through its main functions (BiomedicalDataScience 2019; r-project 2019).

5.1 select()

Before the lecture, install the package in your machine.

install.packages('dplyr')

Let’s first import the package and the mtcars dataset.

library(dplyr)
data(mtcars)

If we want to select some columns from the dataset. We could use the select() function in dplyr. It is similar to the subset() function, but here, you do not need to use select = c('mpg', 'disp'). You use the names of the columns directly in the function.

df <- select(mtcars, # name of the data frame
       mpg, disp) # column names you want to select
head(df, 3)
##                mpg disp
## Mazda RX4     21.0  160
## Mazda RX4 Wag 21.0  160
## Datsun 710    22.8  108

You could also use the index of the columns.

df <- select(mtcars, # name of the data frame
       c(1, 3)) # index of the columns you want to select
head(df, 3)
##                mpg disp
## Mazda RX4     21.0  160
## Mazda RX4 Wag 21.0  160
## Datsun 710    22.8  108

The codes above is kind of a traditional way to do the work. We start with a function and put parameters in the function. However, this is not the typical way to use dplyr.

The codes below is a more dplyr way people use dplyr. We start with the name of the data frame. Then, we put a special sign %>% called pipe after it. We continue from a new line and write the function we want to use. Besides that, we could add more functions with the pipe operator. For example, only show first three observations with head() functions.

mtcars %>% # name of the data frame
  select(mpg, disp) %>% # select the columns by their names
  head(3)
##                mpg disp
## Mazda RX4     21.0  160
## Mazda RX4 Wag 21.0  160
## Datsun 710    22.8  108

We will keep using this fashion in the following lecture.

Besides choosing some columns you want, you could also exclude the column you do not want by putting a negative sign - before the variable.

mtcars %>%
  select(-mpg, -disp) %>%
  head(3)
##               cyl  hp drat    wt  qsec vs am gear carb
## Mazda RX4       6 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag   6 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710      4  93 3.85 2.320 18.61  1  1    4    1

You could use : to select a range of variables.

mtcars %>%
  select(mpg:hp) %>% # select from mpg to hp in the data frame
  head(3)
##                mpg cyl disp  hp
## Mazda RX4     21.0   6  160 110
## Mazda RX4 Wag 21.0   6  160 110
## Datsun 710    22.8   4  108  93

5.2 filter()

In dplyr, we could use filter() function to select the rows satisfying some conditions.

mtcars %>%
  filter(mpg > 30)
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
## Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

Add more conditions by using , to separate them.

mtcars %>%
  filter(mpg > 30, qsec < 19)
##               mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Honda Civic  30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
## Lotus Europa 30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

5.3 arrange()

We could arrange the order of some columns by arrange() functions.

mtcars %>%
  arrange(mpg) %>% # arrange mpg in asceding order
  head(10)
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4

Or maybe we want mpg to be in a descending order. Just put a desc() outside the variable.

mtcars %>%
  arrange(desc(mpg)) %>%
  head(10)
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1

We could put them together by using pipe operators to connect them.

mtcars %>%
  select(gear, mpg) %>%
  arrange(desc(gear), mpg) %>%
  head(10)
##                gear  mpg
## Maserati Bora     5 15.0
## Ford Pantera L    5 15.8
## Ferrari Dino      5 19.7
## Porsche 914-2     5 26.0
## Lotus Europa      5 30.4
## Merc 280C         4 17.8
## Merc 280          4 19.2
## Mazda RX4         4 21.0
## Mazda RX4 Wag     4 21.0
## Volvo 142E        4 21.4

5.4 mutate()

We use mutate() to do some calculations within the variables and create a new column to store them.

mtcars %>%
  select(mpg) %>%
  mutate(kmpg <- mpg * 1.609) %>%
  head(10)
##                    mpg kmpg <- mpg * 1.609
## Mazda RX4         21.0             33.7890
## Mazda RX4 Wag     21.0             33.7890
## Datsun 710        22.8             36.6852
## Hornet 4 Drive    21.4             34.4326
## Hornet Sportabout 18.7             30.0883
## Valiant           18.1             29.1229
## Duster 360        14.3             23.0087
## Merc 240D         24.4             39.2596
## Merc 230          22.8             36.6852
## Merc 280          19.2             30.8928
mtcars %>%
  select(mpg, wt) %>%
  mutate(kmpg = mpg * 1.609, lbwt = wt * 1000) %>%
  head(10)
##                    mpg    wt    kmpg lbwt
## Mazda RX4         21.0 2.620 33.7890 2620
## Mazda RX4 Wag     21.0 2.875 33.7890 2875
## Datsun 710        22.8 2.320 36.6852 2320
## Hornet 4 Drive    21.4 3.215 34.4326 3215
## Hornet Sportabout 18.7 3.440 30.0883 3440
## Valiant           18.1 3.460 29.1229 3460
## Duster 360        14.3 3.570 23.0087 3570
## Merc 240D         24.4 3.190 39.2596 3190
## Merc 230          22.8 3.150 36.6852 3150
## Merc 280          19.2 3.440 30.8928 3440

5.5 group_by() and summarise()

We use group_by() to do aggregation (group the observations based the values of one or one more columns) work and summarise() to calculate some statistics related to each group.

mtcars %>%
  group_by(gear) %>%
  summarise(mean_mpg = mean(mpg),
            max_mpg = max(mpg),
            total = n())
## # A tibble: 3 x 4
##    gear mean_mpg max_mpg total
## * <dbl>    <dbl>   <dbl> <int>
## 1     3     16.1    21.5    15
## 2     4     24.5    33.9    12
## 3     5     21.4    30.4     5
mtcars %>%
  group_by(gear, am) %>%
  summarise(mean_mpg = mean(mpg),
            max_mpg = max(mpg),
            total = n())
## `summarise()` has grouped output by 'gear'. You can override using the `.groups` argument.
## # A tibble: 4 x 5
## # Groups:   gear [3]
##    gear    am mean_mpg max_mpg total
##   <dbl> <dbl>    <dbl>   <dbl> <int>
## 1     3     0     16.1    21.5    15
## 2     4     0     21.0    24.4     4
## 3     4     1     26.3    33.9     8
## 4     5     1     21.4    30.4     5

5.6 join()

We could use join() to do the same work of merge().

ID <- c(1:4) # create variable ID
Name <- c('A', 'B', 'C', 'D') # create variable Name
Score1 <- c(69.5, 77.5, 99, 90) # create variable Score1
df1 <- data.frame(ID, Name, Score1) # combine the varibles into one data frame called df1
df1
##   ID Name Score1
## 1  1    A   69.5
## 2  2    B   77.5
## 3  3    C   99.0
## 4  4    D   90.0
Name <- c('A', 'D', 'C') # create variable Name
Score2 <- c(98, 46, 55) # create variable Score2
df2 <- data.frame(Name, Score2) # combine the varibles into one data frame called df2
df2
##   Name Score2
## 1    A     98
## 2    D     46
## 3    C     55
df1 %>%
  left_join(df2, by = 'Name')
##   ID Name Score1 Score2
## 1  1    A   69.5     98
## 2  2    B   77.5     NA
## 3  3    C   99.0     55
## 4  4    D   90.0     46
df1 %>%
  right_join(df2, by = 'Name')
##   ID Name Score1 Score2
## 1  1    A   69.5     98
## 2  3    C   99.0     55
## 3  4    D   90.0     46

Could you tell the difference between left_join() and right_join()?

Besides left_join() and right_join(), we have inner_join() (keep only matched observations of two data frames) and full_join() (keep all observations of two data frames).