Chapter 4 Data Manipulation with Base Functions

We will introduce how to manipulate with different datasets using base functions in R.

4.1 Import and save datasets

There couple of ways to importing and saving different types of datasets (Quick-R 2019c, 2019a).

4.1.1 Import data

CSV file

mydata <- read.csv('c:/mydata.csv', # file location and name
                   header = TRUE, # read the first 
                   sep = ",") # which type of separation

EXCEL file

library(readxl)
dataset <- read_excel('c:/mydata.xlsx', # file location and name
                      sheet = 'data') # name or index of the sheet

dta STATA file

library(foreign)
mydata <- read.dta('c:/mydata.dta') # file location and name

Here is an easy way to load dataset in RStudio.

System tool Besides importing data by codes, you could also import data with the system tool. If this is your first time to use this tool, there may be a process to install the packages depending your options. But don’t worry, RStudio can do it by itself. You just need to click the button to approve the installation.

File -> Import dataset -> choose the type of dataset you want to import

There are some other options or parameters you could set in the import functions listed above (e.g., specify a variable type or try to skip some of the rows). Sometimes it is sort of complex to do it. So for me, I just do those things after I import the dataset.

4.1.2 Save file

CSV file

write.csv(df, # data
          'c:/filename.csv') # file location and name

EXCEL file

library(xlsx)
write.xlsx(mydata, # data
           "c:/mydata.xlsx") # file location and name

dta STATA file

library(foreign)
write.dta(mydata, "c:/mydata.dta")

Usually, it takes less time to save file in CSV and CSV file has a smaller size in storage.

4.1.3 File location

As you can see in the examples, you need to specify the location (or path) of the file to make sure that R could find your file in the right position. Usually you could find it by checking the system property of the file.

You could avoid this. First put your R file and dataset in the same folder. Then start the R file by double clicking. R will use the folder where the R file locates as the working folder. Then you could only specify the name of the file.

This is recommended. Since it will easier for others to check your codes since they do not need to change the path of the file.

4.2 View data

You could view the variable names and simple description in the Environment pane on the right-top position of RStudio. If you want to view more information, click the variable name and view the variable in new window. Here, we use the built-in dataset mtcars as a example. By importing this dataset, use data() function.

data(mtcars)

You could also view the data in the new window by View() function. Please pay attention that it is initial-capitalized.

View(mtcars)

View the first ten observations (rows) in the dataset. stands for double class, which is a subtype of numerical variable type.

head(mtcars, n = 10)

##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

view the last five observations in the dataset.

tail(mtcars, n = 5)

##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

List the variables in the dataset (Quick-R 2019b).

names(mtcars)

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

colnames(mtcars)

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

List the structure of the dataset.

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

List the dimensions of the dataset

dim(mtcars)

## [1] 32 11

List the number of rows in the dataset

nrow(mtcars)

## [1] 32

List the number of columns in the dataset.

ncol(mtcars)

## [1] 11

4.3 Data selection

Select one column with 5 rows of observations.

head(mtcars$mpg, n = 5) # by name

## [1] 21.0 21.0 22.8 21.4 18.7

head(mtcars[1], n = 5) # by index

##                    mpg
## Mazda RX4         21.0
## Mazda RX4 Wag     21.0
## Datsun 710        22.8
## Hornet 4 Drive    21.4
## Hornet Sportabout 18.7

Select several columns with 5 rows of observations.

head(mtcars[c('mpg', 'disp')], n = 5) # by name

##                    mpg disp
## Mazda RX4         21.0  160
## Mazda RX4 Wag     21.0  160
## Datsun 710        22.8  108
## Hornet 4 Drive    21.4  258
## Hornet Sportabout 18.7  360

head(mtcars[c(1, 3, 5)], n = 5) # by index

##                    mpg disp drat
## Mazda RX4         21.0  160 3.90
## Mazda RX4 Wag     21.0  160 3.90
## Datsun 710        22.8  108 3.85
## Hornet 4 Drive    21.4  258 3.08
## Hornet Sportabout 18.7  360 3.15

Select one row by index

mtcars[1,] # by index number

##           mpg cyl disp  hp drat   wt  qsec vs am gear carb
## Mazda RX4  21   6  160 110  3.9 2.62 16.46  0  1    4    4

mtcars['Valiant',] # by name of the index

##          mpg cyl disp  hp drat   wt  qsec vs am gear carb
## Valiant 18.1   6  225 105 2.76 3.46 20.22  1  0    3    1

Select several rows

mtcars[2:3, ] # by index

##                mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

mtcars[c(1,5,9), ] # by index

##                    mpg cyl  disp  hp drat   wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.62 16.46  0  1    4    4
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.44 17.02  0  0    3    2
## Merc 230          22.8   4 140.8  95 3.92 3.15 22.90  1  0    4    2

mtcars[c('Valiant', 'Hornet Sportabout'), ] # by name

##                    mpg cyl disp  hp drat   wt  qsec vs am gear carb
## Valiant           18.1   6  225 105 2.76 3.46 20.22  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.02  0  0    3    2

4.4 Conditional selection

mtcars[mtcars$mpg > 25, ]

##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

mtcars[(mtcars$mpg > 25) & (mtcars$qsec < 19), ]

##                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Honda Civic   30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Fiat X1-9     27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa  30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

mtcars$mpg[mtcars$gear == 4]

##  [1] 21.0 21.0 22.8 24.4 22.8 19.2 17.8 32.4 30.4 33.9 27.3 21.4

When you do it like this, you select elements from a vector, you do not need to use comma here.

4.5 Deal with missing values

In R, the missing values is presented as NA.

Test the existence of missing values with is.na() function. We use an revised old example here.

ID <- c(1:4) # create variable ID
Name <- c('A', NA, 'C', 'D') # create variable Name
Score <- c(69.5, 77.5, NA, 90) # create variable Score
df <- data.frame(ID, Name, Score) # combine the variables into one data frame called df
is.na(df)

##         ID  Name Score
## [1,] FALSE FALSE FALSE
## [2,] FALSE  TRUE FALSE
## [3,] FALSE FALSE  TRUE
## [4,] FALSE FALSE FALSE

Assign missing values

df$Score[df$Score == 90] <- NA
df

##   ID Name Score
## 1  1    A  69.5
## 2  2 <NA>  77.5
## 3  3    C    NA
## 4  4    D    NA

NAs will influence some functions.

mean(df$Score) # get the mean value (does not ignore NA)

## [1] NA

mean(df$Score, na.rm=TRUE) # (ignore NA)

## [1] 73.5

Test if the observations in the dataset has NAs.

complete.cases(df)

## [1]  TRUE FALSE FALSE FALSE

Find the observations with no NAs.

na.omit(df)

##   ID Name Score
## 1  1    A  69.5

4.6 Subset

subset() is another way to select the data you want.

Select observations with mpg larger than 30.

data(mtcars)
newdata <- subset(mtcars, mpg > 30)
newdata

##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
## Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

Select two columns from the dataset.

newdata <- subset(mtcars, select = c('mpg', 'cyl'))
head(newdata, n = 5)

##                    mpg cyl
## Mazda RX4         21.0   6
## Mazda RX4 Wag     21.0   6
## Datsun 710        22.8   4
## Hornet 4 Drive    21.4   6
## Hornet Sportabout 18.7   8

4.7 Merge two datasets

merge() function does the same work as vlookup() in excel and ‘Join’ function in ArcGIS. It links two datasets based on their common variable (the variable they both have).

ID <- c(1:4) # create variable ID
Name <- c('A', 'B', 'C', 'D') # create variable Name
Score1 <- c(69.5, 77.5, 99, 90) # create variable Score1
df1 <- data.frame(ID, Name, Score1) # combine the variables into one data frame called df1
df1

##   ID Name Score1
## 1  1    A   69.5
## 2  2    B   77.5
## 3  3    C   99.0
## 4  4    D   90.0

Name <- c('A', 'D', 'C') # create variable Name
Score2 <- c(98, 46, 55) # create variable Score2
df2 <- data.frame(Name, Score2) # combine the variables into one data frame called df2
df2

##   Name Score2
## 1    A     98
## 2    D     46
## 3    C     55

merge(df1, df2,       # dataframes needs to be merged
      by = 'Name',    # name of the column/variable used for merging
      all.x = TRUE)   # keep all observations in the first dataframe after merging

##   Name ID Score1 Score2
## 1    A  1   69.5     98
## 2    B  2   77.5     NA
## 3    C  3   99.0     55
## 4    D  4   90.0     46

merge(df1, df2, 
      by = 'Name',
      all.y = TRUE) # keep all observations in the first dataframe after merging

##   Name ID Score1 Score2
## 1    A  1   69.5     98
## 2    C  3   99.0     55
## 3    D  4   90.0     46

You could keep all the observations in both two data frames by set all = TRUE in the function.

4.8 Column operation

Column operation or vector operation is a very important idea in R. It applies the operations in two columns or the function in one column directly rather than applies them on each element one by one.

ID <- c(1:4) # create variable ID
Name <- c('A', 'B', 'C', 'D') # create variable Name
Score1 <- c(69.5, 77.5, 99, 90) # create variable Score1
Score2 <- c(98, 46, 55, 70) # create variable Score2
df <- data.frame(ID, Name, Score1, Score2) # combine the variables into one data frame called df
df

##   ID Name Score1 Score2
## 1  1    A   69.5     98
## 2  2    B   77.5     46
## 3  3    C   99.0     55
## 4  4    D   90.0     70

Create a new column in the dataframe to calculate the sum of score 1 and score 2

df$totalsocre <- df$Score1 + df$Score2
df

##   ID Name Score1 Score2 totalsocre
## 1  1    A   69.5     98      167.5
## 2  2    B   77.5     46      123.5
## 3  3    C   99.0     55      154.0
## 4  4    D   90.0     70      160.0

Create a new column in the dataframe to calculate the mean of score 1 and score 2

df$meansocre <- (df$Score1 + df$Score2)/2
df

##   ID Name Score1 Score2 totalsocre meansocre
## 1  1    A   69.5     98      167.5     83.75
## 2  2    B   77.5     46      123.5     61.75
## 3  3    C   99.0     55      154.0     77.00
## 4  4    D   90.0     70      160.0     80.00