Chapter 4 Data Manipulation with Base Functions
We will introduce how to manipulate with different datasets using base functions in R.
4.1 Import and save datasets
There couple of ways to importing and saving different types of datasets (Quick-R 2019c, 2019a).
4.1.1 Import data
CSV file
<- read.csv('c:/mydata.csv', # file location and name
mydata header = TRUE, # read the first
sep = ",") # which type of separation
EXCEL file
library(readxl)
<- read_excel('c:/mydata.xlsx', # file location and name
dataset sheet = 'data') # name or index of the sheet
dta STATA file
library(foreign)
<- read.dta('c:/mydata.dta') # file location and name mydata
Here is an easy way to load dataset in RStudio.
System tool Besides importing data by codes, you could also import data with the system tool. If this is your first time to use this tool, there may be a process to install the packages depending your options. But don’t worry, RStudio can do it by itself. You just need to click the button to approve the installation.
File -> Import dataset -> choose the type of dataset you want to import
There are some other options or parameters you could set in the import functions listed above (e.g., specify a variable type or try to skip some of the rows). Sometimes it is sort of complex to do it. So for me, I just do those things after I import the dataset.
4.1.2 Save file
CSV file
write.csv(df, # data
'c:/filename.csv') # file location and name
EXCEL file
library(xlsx)
write.xlsx(mydata, # data
"c:/mydata.xlsx") # file location and name
dta STATA file
library(foreign)
write.dta(mydata, "c:/mydata.dta")
Usually, it takes less time to save file in CSV and CSV file has a smaller size in storage.
4.1.3 File location
As you can see in the examples, you need to specify the location (or path) of the file to make sure that R could find your file in the right position. Usually you could find it by checking the system property of the file.
You could avoid this. First put your R file and dataset in the same folder. Then start the R file by double clicking. R will use the folder where the R file locates as the working folder. Then you could only specify the name of the file.
This is recommended. Since it will easier for others to check your codes since they do not need to change the path of the file.
4.2 View data
You could view the variable names and simple description in the Environment
pane on the right-top position of RStudio. If you want to view more information, click the variable name and view the variable in new window. Here, we use the built-in dataset mtcars
as a example. By importing this dataset, use data()
function.
data(mtcars)
You could also view the data in the new window by View()
function. Please pay attention that it is initial-capitalized.
View(mtcars)
View the first ten observations (rows) in the dataset.
head(mtcars, n = 10)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
view the last five observations in the dataset.
tail(mtcars, n = 5)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
List the variables in the dataset (Quick-R 2019b).
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
colnames(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
List the structure of the dataset.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
List the dimensions of the dataset
dim(mtcars)
## [1] 32 11
List the number of rows in the dataset
nrow(mtcars)
## [1] 32
List the number of columns in the dataset.
ncol(mtcars)
## [1] 11
4.3 Data selection
Select one column with 5 rows of observations.
head(mtcars$mpg, n = 5) # by name
## [1] 21.0 21.0 22.8 21.4 18.7
head(mtcars[1], n = 5) # by index
## mpg
## Mazda RX4 21.0
## Mazda RX4 Wag 21.0
## Datsun 710 22.8
## Hornet 4 Drive 21.4
## Hornet Sportabout 18.7
Select several columns with 5 rows of observations.
head(mtcars[c('mpg', 'disp')], n = 5) # by name
## mpg disp
## Mazda RX4 21.0 160
## Mazda RX4 Wag 21.0 160
## Datsun 710 22.8 108
## Hornet 4 Drive 21.4 258
## Hornet Sportabout 18.7 360
head(mtcars[c(1, 3, 5)], n = 5) # by index
## mpg disp drat
## Mazda RX4 21.0 160 3.90
## Mazda RX4 Wag 21.0 160 3.90
## Datsun 710 22.8 108 3.85
## Hornet 4 Drive 21.4 258 3.08
## Hornet Sportabout 18.7 360 3.15
Select one row by index
1,] # by index number mtcars[
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
'Valiant',] # by name of the index mtcars[
## mpg cyl disp hp drat wt qsec vs am gear carb
## Valiant 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
Select several rows
2:3, ] # by index mtcars[
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
c(1,5,9), ] # by index mtcars[
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.62 16.46 0 1 4 4
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2
## Merc 230 22.8 4 140.8 95 3.92 3.15 22.90 1 0 4 2
c('Valiant', 'Hornet Sportabout'), ] # by name mtcars[
## mpg cyl disp hp drat wt qsec vs am gear carb
## Valiant 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
4.4 Conditional selection
$mpg > 25, ] mtcars[mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
$mpg > 25) & (mtcars$qsec < 19), ] mtcars[(mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
$mpg[mtcars$gear == 4] mtcars
## [1] 21.0 21.0 22.8 24.4 22.8 19.2 17.8 32.4 30.4 33.9 27.3 21.4
When you do it like this, you select elements from a vector, you do not need to use comma here.
4.5 Deal with missing values
In R, the missing values is presented as NA
.
Test the existence of missing values with is.na()
function. We use an revised old example here.
<- c(1:4) # create variable ID
ID <- c('A', NA, 'C', 'D') # create variable Name
Name <- c(69.5, 77.5, NA, 90) # create variable Score
Score <- data.frame(ID, Name, Score) # combine the variables into one data frame called df
df is.na(df)
## ID Name Score
## [1,] FALSE FALSE FALSE
## [2,] FALSE TRUE FALSE
## [3,] FALSE FALSE TRUE
## [4,] FALSE FALSE FALSE
Assign missing values
$Score[df$Score == 90] <- NA
df df
## ID Name Score
## 1 1 A 69.5
## 2 2 <NA> 77.5
## 3 3 C NA
## 4 4 D NA
NAs will influence some functions.
mean(df$Score) # get the mean value (does not ignore NA)
## [1] NA
mean(df$Score, na.rm=TRUE) # (ignore NA)
## [1] 73.5
Test if the observations in the dataset has NAs.
complete.cases(df)
## [1] TRUE FALSE FALSE FALSE
Find the observations with no NAs.
na.omit(df)
## ID Name Score
## 1 1 A 69.5
4.6 Subset
subset()
is another way to select the data you want.
Select observations with mpg
larger than 30.
data(mtcars)
<- subset(mtcars, mpg > 30)
newdata newdata
## mpg cyl disp hp drat wt qsec vs am gear carb
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Select two columns from the dataset.
<- subset(mtcars, select = c('mpg', 'cyl'))
newdata head(newdata, n = 5)
## mpg cyl
## Mazda RX4 21.0 6
## Mazda RX4 Wag 21.0 6
## Datsun 710 22.8 4
## Hornet 4 Drive 21.4 6
## Hornet Sportabout 18.7 8
4.7 Merge two datasets
merge()
function does the same work as vlookup()
in excel and ‘Join’ function in ArcGIS. It links two datasets based on their common variable (the variable they both have).
<- c(1:4) # create variable ID
ID <- c('A', 'B', 'C', 'D') # create variable Name
Name <- c(69.5, 77.5, 99, 90) # create variable Score1
Score1 <- data.frame(ID, Name, Score1) # combine the variables into one data frame called df1
df1 df1
## ID Name Score1
## 1 1 A 69.5
## 2 2 B 77.5
## 3 3 C 99.0
## 4 4 D 90.0
<- c('A', 'D', 'C') # create variable Name
Name <- c(98, 46, 55) # create variable Score2
Score2 <- data.frame(Name, Score2) # combine the variables into one data frame called df2
df2 df2
## Name Score2
## 1 A 98
## 2 D 46
## 3 C 55
merge(df1, df2, # dataframes needs to be merged
by = 'Name', # name of the column/variable used for merging
all.x = TRUE) # keep all observations in the first dataframe after merging
## Name ID Score1 Score2
## 1 A 1 69.5 98
## 2 B 2 77.5 NA
## 3 C 3 99.0 55
## 4 D 4 90.0 46
merge(df1, df2,
by = 'Name',
all.y = TRUE) # keep all observations in the first dataframe after merging
## Name ID Score1 Score2
## 1 A 1 69.5 98
## 2 C 3 99.0 55
## 3 D 4 90.0 46
You could keep all the observations in both two data frames by set all = TRUE
in the function.
4.8 Column operation
Column operation or vector operation is a very important idea in R. It applies the operations in two columns or the function in one column directly rather than applies them on each element one by one.
<- c(1:4) # create variable ID
ID <- c('A', 'B', 'C', 'D') # create variable Name
Name <- c(69.5, 77.5, 99, 90) # create variable Score1
Score1 <- c(98, 46, 55, 70) # create variable Score2
Score2 <- data.frame(ID, Name, Score1, Score2) # combine the variables into one data frame called df
df df
## ID Name Score1 Score2
## 1 1 A 69.5 98
## 2 2 B 77.5 46
## 3 3 C 99.0 55
## 4 4 D 90.0 70
Create a new column in the dataframe to calculate the sum of score 1 and score 2
$totalsocre <- df$Score1 + df$Score2
df df
## ID Name Score1 Score2 totalsocre
## 1 1 A 69.5 98 167.5
## 2 2 B 77.5 46 123.5
## 3 3 C 99.0 55 154.0
## 4 4 D 90.0 70 160.0
Create a new column in the dataframe to calculate the mean of score 1 and score 2
$meansocre <- (df$Score1 + df$Score2)/2
df df
## ID Name Score1 Score2 totalsocre meansocre
## 1 1 A 69.5 98 167.5 83.75
## 2 2 B 77.5 46 123.5 61.75
## 3 3 C 99.0 55 154.0 77.00
## 4 4 D 90.0 70 160.0 80.00