Chapter 4 Data Manipulation with Base Functions
We will introduce how to manipulate with different datasets using base functions in R.
4.1 Import and save datasets
There couple of ways to importing and saving different types of datasets (Quick-R 2019c, 2019a).
4.1.1 Import data from commonly-used file types
CSV file
<- read.csv('c:/minneapolis.csv', ## file location and name
mydata header = TRUE, ## read the first
sep = ',') ## which type of separation
EXCEL file
library(readxl)
<- read_excel('c:/mydata.xlsx', # file location and name
dataset sheet = 'data') # name or index of the sheet
dta STATA file
library(foreign)
<- read.dta('c:/mydata.dta') # file location and name mydata
Below is an easy way to load datasets in RStudio.
System tool
Besides importing data by codes, you could also import data with the system tool. If this is your first time to use this tool, there may be a process to install the packages depending your options. But don’t worry, RStudio can do it by itself. You just need to click the button to approve the installation.
File -> Import dataset -> choose the type of dataset you want to import
There are some other options you could set in the import functions listed above (e.g., specify a variable type or try to skip some of the rows).
4.1.2 Save file
CSV file
write.csv(df, # data
'c:/filename.csv') # file location and name
EXCEL file
library(xlsx)
write.xlsx(mydata, # data
"c:/mydata.xlsx") # file location and name
dta STATA file
library(foreign)
write.dta(mydata, "c:/mydata.dta")
Usually, it takes less time to save file in CSV and CSV file has a smaller size in storage. Note that if you do not specify file location when saving it, R will save it to the working folder.
4.1.3 File location
As you can see in the examples, you need to specify the location (or path) of the file to make sure that R could find your file in the right position. Usually you could find the location by checking the system property of the file. In window system, you need to use forward slash /
to separate the locations. However, in Mac system, you use back slash \
.
Inputting file location could be tedious, but you could avoid this. Here is how. First, put your R file and dataset in the same folder. Then start the R file by double clicking it. R will use the folder where the R file locates as the working folder. In this way, you can only specify the name of the dataset when importing it.
This is highly recommended. It would be easier for others to check your codes as they do not need to change the path of the file.
4.2 View data
You could view the variable names and simple description in the Environment
pane on the right-top position of RStudio. If you want to view more information, click the variable name and view the variable in a new window. Here, we use the Minneapolis one-year ACS data as a example. This dataset contains the demographic information of the respondents in the city of Minneapolis in the ACS survey from 2010 to 2019.
The table below lists the variable names and their descriptions.
Variable | Name | Description |
YEAR | Year | The year when the data was collected. |
SEX | Gender | Gender of the respondent. 1 = Male 2 = Female |
AGE | Age | Age of the respondent (0 indicates age is smaller than 1). |
RACE | Race | Race of the respondent. 1 = White 2 = Black/African American/Negro 3 = Other |
HISPAN | Hispanic origin | Hispanic origin of the respondent. 0 = Not Hispanic 1 = Mexican 2 = Puerto Rican 3 = Cuban 4 = Other |
EDUC | Educational attainment | Respondents’ educational attainment, as measured by the highest year of school or degree completed. 0 = N/A or no schooling 1 = Nursery school to grade 4 2 = Grade 5, 6, 7, or 8 3 = Grade 9 4 = Grade 10 5 = Grade 11 6 = Grade 12 7 = 1 year of college 8 = 2 years of college 9 = 3 years of college 10 = 4 years of college 11 = 5+ years of college |
EMPSTAT | Employment status | Whether the respondent was a part of the labor force – working or seekingwork – and, if so, whether the person was currently unemployed. 0 = N/A 1 = Employed 2 = Unemployed 3 = Not in labor force |
INCTOT | Total personal income | Each respondent’s total pre-tax personal income or losses from all sources for the previous year |
FTOTINC | Total family income | The total pre-tax money income earned by one’s family from all sources for the previous year. |
<- read.csv('minneapolis.csv') minneapolis
You could also view the data in the new window by View()
function. Please pay attention that it is initial-capitalized.
View(minneapolis)
View the first ten observations (rows) in the dataset.
head(minneapolis, n = 10)
## YEAR SEX AGE RACE HISPAN EDUC EMPSTAT INCTOT FTOTINC
## 1 2010 2 59 1 0 10 1 33100 49100
## 2 2010 2 29 1 0 10 1 16000 49100
## 3 2010 1 54 2 0 7 3 1100 NA
## 4 2010 2 47 2 0 5 2 4800 4800
## 5 2010 1 59 2 0 10 2 0 4800
## 6 2010 2 10 2 0 1 0 NA 4800
## 7 2010 1 24 1 0 5 1 15000 15000
## 8 2010 1 26 2 3 6 1 18000 18000
## 9 2010 1 51 1 0 6 1 35200 74400
## 10 2010 2 47 1 0 6 1 39200 74400
view the last five observations in the dataset.
tail(minneapolis, n = 5)
## YEAR SEX AGE RACE HISPAN EDUC EMPSTAT INCTOT FTOTINC
## 25531 2019 2 66 1 0 7 1 77000 77000
## 25532 2019 2 29 3 1 6 3 0 38000
## 25533 2019 1 28 3 1 0 1 38000 38000
## 25534 2019 1 6 3 1 1 0 NA 38000
## 25535 2019 1 1 3 1 0 0 NA 38000
List the names of variables in the dataset (Quick-R 2019b).
names(minneapolis)
## [1] "YEAR" "SEX" "AGE" "RACE" "HISPAN" "EDUC" "EMPSTAT"
## [8] "INCTOT" "FTOTINC"
colnames(minneapolis)
## [1] "YEAR" "SEX" "AGE" "RACE" "HISPAN" "EDUC" "EMPSTAT"
## [8] "INCTOT" "FTOTINC"
List the structure of the dataset.
str(minneapolis)
## 'data.frame': 25535 obs. of 9 variables:
## $ YEAR : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ SEX : int 2 2 1 2 1 2 1 1 1 2 ...
## $ AGE : int 59 29 54 47 59 10 24 26 51 47 ...
## $ RACE : int 1 1 2 2 2 2 1 2 1 1 ...
## $ HISPAN : int 0 0 0 0 0 0 0 3 0 0 ...
## $ EDUC : int 10 10 7 5 10 1 5 6 6 6 ...
## $ EMPSTAT: int 1 1 3 2 2 0 1 1 1 1 ...
## $ INCTOT : num 33100 16000 1100 4800 0 NA 15000 18000 35200 39200 ...
## $ FTOTINC: num 49100 49100 NA 4800 4800 4800 15000 18000 74400 74400 ...
List the dimensions of the dataset
dim(minneapolis)
## [1] 25535 9
List the number of rows in the dataset
nrow(minneapolis)
## [1] 25535
List the number of columns in the dataset.
ncol(minneapolis)
## [1] 9
4.3 Data selection
Select one column with 5 rows of observations.
head(minneapolis$YEAR, n = 5) # by name
## [1] 2010 2010 2010 2010 2010
head(minneapolis[1], n = 5) # by index
## YEAR
## 1 2010
## 2 2010
## 3 2010
## 4 2010
## 5 2010
Select several columns with 5 rows of observations.
head(minneapolis[c('YEAR', 'SEX')], n = 5) # by name
## YEAR SEX
## 1 2010 2
## 2 2010 2
## 3 2010 1
## 4 2010 2
## 5 2010 1
head(minneapolis[c(1, 3, 5)], n = 5) # by index
## YEAR AGE HISPAN
## 1 2010 59 0
## 2 2010 29 0
## 3 2010 54 0
## 4 2010 47 0
## 5 2010 59 0
Select one row by index
2,] # by index number minneapolis[
## YEAR SEX AGE RACE HISPAN EDUC EMPSTAT INCTOT FTOTINC
## 2 2010 2 29 1 0 10 1 16000 49100
Select several rows
2:3, ] # by index minneapolis[
## YEAR SEX AGE RACE HISPAN EDUC EMPSTAT INCTOT FTOTINC
## 2 2010 2 29 1 0 10 1 16000 49100
## 3 2010 1 54 2 0 7 3 1100 NA
c(1,5,9), ] # by index minneapolis[
## YEAR SEX AGE RACE HISPAN EDUC EMPSTAT INCTOT FTOTINC
## 1 2010 2 59 1 0 10 1 33100 49100
## 5 2010 1 59 2 0 10 2 0 4800
## 9 2010 1 51 1 0 6 1 35200 74400
4.4 Conditional selection
<- minneapolis[minneapolis$YEAR == 2010, ]
newdata head(newdata)
## YEAR SEX AGE RACE HISPAN EDUC EMPSTAT INCTOT FTOTINC
## 1 2010 2 59 1 0 10 1 33100 49100
## 2 2010 2 29 1 0 10 1 16000 49100
## 3 2010 1 54 2 0 7 3 1100 NA
## 4 2010 2 47 2 0 5 2 4800 4800
## 5 2010 1 59 2 0 10 2 0 4800
## 6 2010 2 10 2 0 1 0 NA 4800
<- minneapolis[(minneapolis$YEAR == 2010) & (minneapolis$EDUC < 5), ]
newdata head(newdata)
## YEAR SEX AGE RACE HISPAN EDUC EMPSTAT INCTOT FTOTINC
## 6 2010 2 10 2 0 1 0 NA 4800
## 11 2010 2 8 1 0 1 0 NA 74400
## 41 2010 1 16 2 0 4 3 0 7900
## 42 2010 2 85 1 0 2 3 12000 NA
## 47 2010 1 10 3 0 1 0 NA NA
## 55 2010 1 5 1 0 1 0 NA 132000
<- minneapolis$INCTOT[minneapolis$YEAR == 2010]
newdata head(newdata)
## [1] 33100 16000 1100 4800 0 NA
4.5 Deal with missing values
In R, the missing values is presented as NA
.
Test the existence of missing values with is.na()
function. We use an revised old example here.
<- c(1:4) # create variable ID
ID <- c('A', NA, 'C', 'D') # create variable Name
Name <- c(69.5, 77.5, NA, 90) # create variable Score
Score <- data.frame(ID, Name, Score) # combine the varibles into one data frame called df
df is.na(df)
## ID Name Score
## [1,] FALSE FALSE FALSE
## [2,] FALSE TRUE FALSE
## [3,] FALSE FALSE TRUE
## [4,] FALSE FALSE FALSE
Assign missing values
$Score[df$Score == 90] <- NA
df df
## ID Name Score
## 1 1 A 69.5
## 2 2 <NA> 77.5
## 3 3 C NA
## 4 4 D NA
NAs will influence some functions.
mean(df$Score) # get the mean value (does not ignore NA)
## [1] NA
mean(df$Score, na.rm=TRUE) # (ignore NA)
## [1] 73.5
Test if the observations in the dataset has NAs.
complete.cases(df)
## [1] TRUE FALSE FALSE FALSE
Find the observations with no NAs.
na.omit(df)
## ID Name Score
## 1 1 A 69.5
4.6 Subset
subset()
is another way to select the data you want.
Select observations in the Minneapolis population dataset when YEAR
is equal to 2010.
<- subset(minneapolis, YEAR == 2010)
newdata head(newdata)
## YEAR SEX AGE RACE HISPAN EDUC EMPSTAT INCTOT FTOTINC
## 1 2010 2 59 1 0 10 1 33100 49100
## 2 2010 2 29 1 0 10 1 16000 49100
## 3 2010 1 54 2 0 7 3 1100 NA
## 4 2010 2 47 2 0 5 2 4800 4800
## 5 2010 1 59 2 0 10 2 0 4800
## 6 2010 2 10 2 0 1 0 NA 4800
Select two columns from the dataset.
<- subset(minneapolis, select = c('YEAR', 'AGE'))
newdata head(newdata, n = 5)
## YEAR AGE
## 1 2010 59
## 2 2010 29
## 3 2010 54
## 4 2010 47
## 5 2010 59
4.7 Merge two datasets
merge()
function does the same work as vlookup()
in excel and ‘Join’ function in ArcGIS. It links two datasets based on their common variable (the variable they both have).
<- c(1:4) # create variable ID
ID <- c('A', 'B', 'C', 'D') # create variable Name
Name <- c(69.5, 77.5, 99, 90) # create variable Score1
Score1 <- data.frame(ID, Name, Score1) # combine the variables into one data frame called df1
df1 df1
## ID Name Score1
## 1 1 A 69.5
## 2 2 B 77.5
## 3 3 C 99.0
## 4 4 D 90.0
<- c('A', 'D', 'C') # create variable Name
Name <- c(98, 46, 55) # create variable Score2
Score2 <- data.frame(Name, Score2) # combine the variables into one data frame called df2
df2 df2
## Name Score2
## 1 A 98
## 2 D 46
## 3 C 55
merge(df1, df2, # data frames need to be merged
by = 'Name', # name of the column/variable used for merging
all.x = TRUE) # keep all observations in the first data frame after merging
## Name ID Score1 Score2
## 1 A 1 69.5 98
## 2 B 2 77.5 NA
## 3 C 3 99.0 55
## 4 D 4 90.0 46
merge(df1, df2,
by = 'Name',
all.y = TRUE) # keep all observations in the first dataframe after merging
## Name ID Score1 Score2
## 1 A 1 69.5 98
## 2 C 3 99.0 55
## 3 D 4 90.0 46
You could keep all the observations in both two data frames by set all = TRUE
in the function. We will cover more about this in our lecture about dplyr
.
4.8 Column operation
Column operation or vector operation is a very important idea in R. It applies the operations in two columns or the function in one column directly rather than applies them on each element one by one.
<- c(1:4) # create variable ID
ID <- c('A', 'B', 'C', 'D') # create variable Name
Name <- c(69.5, 77.5, 99, 90) # create variable Score1
Score1 <- c(98, 46, 55, 70) # create variable Score2
Score2 <- data.frame(ID, Name, Score1, Score2) # combine the varibles into one data frame called df
df df
## ID Name Score1 Score2
## 1 1 A 69.5 98
## 2 2 B 77.5 46
## 3 3 C 99.0 55
## 4 4 D 90.0 70
Create a new column in the data frame to calculate the sum of score 1 and score 2
$totalsocre <- df$Score1 + df$Score2
df df
## ID Name Score1 Score2 totalsocre
## 1 1 A 69.5 98 167.5
## 2 2 B 77.5 46 123.5
## 3 3 C 99.0 55 154.0
## 4 4 D 90.0 70 160.0
Create a new column in the data frame to calculate the mean of score 1 and score 2
$meansocre <- (df$Score1 + df$Score2)/2
df df
## ID Name Score1 Score2 totalsocre meansocre
## 1 1 A 69.5 98 167.5 83.75
## 2 2 B 77.5 46 123.5 61.75
## 3 3 C 99.0 55 154.0 77.00
## 4 4 D 90.0 70 160.0 80.00