Chapter 4 Data Manipulation with Base Functions

We will introduce how to manipulate with different datasets using base functions in R.

4.1 Import and save datasets

There couple of ways to importing and saving different types of datasets (Quick-R 2019c, 2019a).

4.1.1 Import data from commonly-used file types

CSV file

mydata <- read.csv('c:/minneapolis.csv', ## file location and name
                   header = TRUE, ## read the first 
                   sep = ',') ## which type of separation

EXCEL file

library(readxl)
dataset <- read_excel('c:/mydata.xlsx', # file location and name
                      sheet = 'data') # name or index of the sheet

dta STATA file

library(foreign)
mydata <- read.dta('c:/mydata.dta') # file location and name

Below is an easy way to load datasets in RStudio.

System tool

Besides importing data by codes, you could also import data with the system tool. If this is your first time to use this tool, there may be a process to install the packages depending your options. But don’t worry, RStudio can do it by itself. You just need to click the button to approve the installation.

File -> Import dataset -> choose the type of dataset you want to import

There are some other options you could set in the import functions listed above (e.g., specify a variable type or try to skip some of the rows).

4.1.2 Save file

CSV file

write.csv(df, # data
          'c:/filename.csv') # file location and name

EXCEL file

library(xlsx)
write.xlsx(mydata, # data
           "c:/mydata.xlsx") # file location and name

dta STATA file

library(foreign)
write.dta(mydata, "c:/mydata.dta")

Usually, it takes less time to save file in CSV and CSV file has a smaller size in storage. Note that if you do not specify file location when saving it, R will save it to the working folder.

4.1.3 File location

As you can see in the examples, you need to specify the location (or path) of the file to make sure that R could find your file in the right position. Usually you could find the location by checking the system property of the file. In window system, you need to use forward slash / to separate the locations. However, in Mac system, you use back slash \.

Inputting file location could be tedious, but you could avoid this. Here is how. First, put your R file and dataset in the same folder. Then start the R file by double clicking it. R will use the folder where the R file locates as the working folder. In this way, you can only specify the name of the dataset when importing it.

This is highly recommended. It would be easier for others to check your codes as they do not need to change the path of the file.

4.2 View data

You could view the variable names and simple description in the Environment pane on the right-top position of RStudio. If you want to view more information, click the variable name and view the variable in a new window. Here, we use the Minneapolis one-year ACS data as a example. This dataset contains the demographic information of the respondents in the city of Minneapolis in the ACS survey from 2010 to 2019.

The table below lists the variable names and their descriptions.

Variable	Name	Description
YEAR	Year	The year when the data was collected.
SEX	Gender	Gender of the respondent. 1 = Male 2 = Female
AGE	Age	Age of the respondent (0 indicates age is smaller than 1).
RACE	Race	Race of the respondent. 1 = White 2 = Black/African American/Negro 3 = Other
HISPAN	Hispanic origin	Hispanic origin of the respondent. 0 = Not Hispanic 1 = Mexican 2 = Puerto Rican 3 = Cuban 4 = Other
EDUC	Educational attainment	Respondents’ educational attainment, as measured by the highest year of school or degree completed. 0 = N/A or no schooling 1 = Nursery school to grade 4 2 = Grade 5, 6, 7, or 8 3 = Grade 9 4 = Grade 10 5 = Grade 11 6 = Grade 12 7 = 1 year of college 8 = 2 years of college 9 = 3 years of college 10 = 4 years of college 11 = 5+ years of college
EMPSTAT	Employment status	Whether the respondent was a part of the labor force – working or seekingwork – and, if so, whether the person was currently unemployed. 0 = N/A 1 = Employed 2 = Unemployed 3 = Not in labor force
INCTOT	Total personal income	Each respondent’s total pre-tax personal income or losses from all sources for the previous year
FTOTINC	Total family income	The total pre-tax money income earned by one’s family from all sources for the previous year.

minneapolis <- read.csv('minneapolis.csv')

You could also view the data in the new window by View() function. Please pay attention that it is initial-capitalized.

View(minneapolis)

View the first ten observations (rows) in the dataset. stands for double class, which is a sub-type of numerical variable type.

head(minneapolis, n = 10)

##    YEAR SEX AGE RACE HISPAN EDUC EMPSTAT INCTOT FTOTINC
## 1  2010   2  59    1      0   10       1  33100   49100
## 2  2010   2  29    1      0   10       1  16000   49100
## 3  2010   1  54    2      0    7       3   1100      NA
## 4  2010   2  47    2      0    5       2   4800    4800
## 5  2010   1  59    2      0   10       2      0    4800
## 6  2010   2  10    2      0    1       0     NA    4800
## 7  2010   1  24    1      0    5       1  15000   15000
## 8  2010   1  26    2      3    6       1  18000   18000
## 9  2010   1  51    1      0    6       1  35200   74400
## 10 2010   2  47    1      0    6       1  39200   74400

view the last five observations in the dataset.

tail(minneapolis, n = 5)

##       YEAR SEX AGE RACE HISPAN EDUC EMPSTAT INCTOT FTOTINC
## 25531 2019   2  66    1      0    7       1  77000   77000
## 25532 2019   2  29    3      1    6       3      0   38000
## 25533 2019   1  28    3      1    0       1  38000   38000
## 25534 2019   1   6    3      1    1       0     NA   38000
## 25535 2019   1   1    3      1    0       0     NA   38000

List the names of variables in the dataset (Quick-R 2019b).

names(minneapolis)

## [1] "YEAR"    "SEX"     "AGE"     "RACE"    "HISPAN"  "EDUC"    "EMPSTAT"
## [8] "INCTOT"  "FTOTINC"

colnames(minneapolis)

## [1] "YEAR"    "SEX"     "AGE"     "RACE"    "HISPAN"  "EDUC"    "EMPSTAT"
## [8] "INCTOT"  "FTOTINC"

List the structure of the dataset.

str(minneapolis)

## 'data.frame':    25535 obs. of  9 variables:
##  $ YEAR   : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ SEX    : int  2 2 1 2 1 2 1 1 1 2 ...
##  $ AGE    : int  59 29 54 47 59 10 24 26 51 47 ...
##  $ RACE   : int  1 1 2 2 2 2 1 2 1 1 ...
##  $ HISPAN : int  0 0 0 0 0 0 0 3 0 0 ...
##  $ EDUC   : int  10 10 7 5 10 1 5 6 6 6 ...
##  $ EMPSTAT: int  1 1 3 2 2 0 1 1 1 1 ...
##  $ INCTOT : num  33100 16000 1100 4800 0 NA 15000 18000 35200 39200 ...
##  $ FTOTINC: num  49100 49100 NA 4800 4800 4800 15000 18000 74400 74400 ...

List the dimensions of the dataset

dim(minneapolis)

## [1] 25535     9

List the number of rows in the dataset

nrow(minneapolis)

## [1] 25535

List the number of columns in the dataset.

ncol(minneapolis)

## [1] 9

4.3 Data selection

Select one column with 5 rows of observations.

head(minneapolis$YEAR, n = 5) # by name

## [1] 2010 2010 2010 2010 2010

head(minneapolis[1], n = 5) # by index

##   YEAR
## 1 2010
## 2 2010
## 3 2010
## 4 2010
## 5 2010

Select several columns with 5 rows of observations.

head(minneapolis[c('YEAR', 'SEX')], n = 5) # by name

##   YEAR SEX
## 1 2010   2
## 2 2010   2
## 3 2010   1
## 4 2010   2
## 5 2010   1

head(minneapolis[c(1, 3, 5)], n = 5) # by index

##   YEAR AGE HISPAN
## 1 2010  59      0
## 2 2010  29      0
## 3 2010  54      0
## 4 2010  47      0
## 5 2010  59      0

Select one row by index

minneapolis[2,] # by index number

##   YEAR SEX AGE RACE HISPAN EDUC EMPSTAT INCTOT FTOTINC
## 2 2010   2  29    1      0   10       1  16000   49100

Select several rows

minneapolis[2:3, ] # by index

##   YEAR SEX AGE RACE HISPAN EDUC EMPSTAT INCTOT FTOTINC
## 2 2010   2  29    1      0   10       1  16000   49100
## 3 2010   1  54    2      0    7       3   1100      NA

minneapolis[c(1,5,9), ] # by index

##   YEAR SEX AGE RACE HISPAN EDUC EMPSTAT INCTOT FTOTINC
## 1 2010   2  59    1      0   10       1  33100   49100
## 5 2010   1  59    2      0   10       2      0    4800
## 9 2010   1  51    1      0    6       1  35200   74400

4.4 Conditional selection

newdata <- minneapolis[minneapolis$YEAR == 2010, ]
head(newdata)

##   YEAR SEX AGE RACE HISPAN EDUC EMPSTAT INCTOT FTOTINC
## 1 2010   2  59    1      0   10       1  33100   49100
## 2 2010   2  29    1      0   10       1  16000   49100
## 3 2010   1  54    2      0    7       3   1100      NA
## 4 2010   2  47    2      0    5       2   4800    4800
## 5 2010   1  59    2      0   10       2      0    4800
## 6 2010   2  10    2      0    1       0     NA    4800

newdata <- minneapolis[(minneapolis$YEAR == 2010) & (minneapolis$EDUC < 5), ]
head(newdata)

##    YEAR SEX AGE RACE HISPAN EDUC EMPSTAT INCTOT FTOTINC
## 6  2010   2  10    2      0    1       0     NA    4800
## 11 2010   2   8    1      0    1       0     NA   74400
## 41 2010   1  16    2      0    4       3      0    7900
## 42 2010   2  85    1      0    2       3  12000      NA
## 47 2010   1  10    3      0    1       0     NA      NA
## 55 2010   1   5    1      0    1       0     NA  132000

newdata <- minneapolis$INCTOT[minneapolis$YEAR == 2010]
head(newdata)

## [1] 33100 16000  1100  4800     0    NA

4.5 Deal with missing values

In R, the missing values is presented as NA.

Test the existence of missing values with is.na() function. We use an revised old example here.

ID <- c(1:4) # create variable ID
Name <- c('A', NA, 'C', 'D') # create variable Name
Score <- c(69.5, 77.5, NA, 90) # create variable Score
df <- data.frame(ID, Name, Score) # combine the varibles into one data frame called df
is.na(df)

##         ID  Name Score
## [1,] FALSE FALSE FALSE
## [2,] FALSE  TRUE FALSE
## [3,] FALSE FALSE  TRUE
## [4,] FALSE FALSE FALSE

Assign missing values

df$Score[df$Score == 90] <- NA
df

##   ID Name Score
## 1  1    A  69.5
## 2  2 <NA>  77.5
## 3  3    C    NA
## 4  4    D    NA

NAs will influence some functions.

mean(df$Score) # get the mean value (does not ignore NA)

## [1] NA

mean(df$Score, na.rm=TRUE) # (ignore NA)

## [1] 73.5

Test if the observations in the dataset has NAs.

complete.cases(df)

## [1]  TRUE FALSE FALSE FALSE

Find the observations with no NAs.

na.omit(df)

##   ID Name Score
## 1  1    A  69.5

4.6 Subset

subset() is another way to select the data you want.

Select observations in the Minneapolis population dataset when YEAR is equal to 2010.

newdata <- subset(minneapolis, YEAR == 2010)
head(newdata)

##   YEAR SEX AGE RACE HISPAN EDUC EMPSTAT INCTOT FTOTINC
## 1 2010   2  59    1      0   10       1  33100   49100
## 2 2010   2  29    1      0   10       1  16000   49100
## 3 2010   1  54    2      0    7       3   1100      NA
## 4 2010   2  47    2      0    5       2   4800    4800
## 5 2010   1  59    2      0   10       2      0    4800
## 6 2010   2  10    2      0    1       0     NA    4800

Select two columns from the dataset.

newdata <- subset(minneapolis, select = c('YEAR', 'AGE'))
head(newdata, n = 5)

##   YEAR AGE
## 1 2010  59
## 2 2010  29
## 3 2010  54
## 4 2010  47
## 5 2010  59

4.7 Merge two datasets

merge() function does the same work as vlookup() in excel and ‘Join’ function in ArcGIS. It links two datasets based on their common variable (the variable they both have).

ID <- c(1:4) # create variable ID
Name <- c('A', 'B', 'C', 'D') # create variable Name
Score1 <- c(69.5, 77.5, 99, 90) # create variable Score1
df1 <- data.frame(ID, Name, Score1) # combine the variables into one data frame called df1
df1

##   ID Name Score1
## 1  1    A   69.5
## 2  2    B   77.5
## 3  3    C   99.0
## 4  4    D   90.0

Name <- c('A', 'D', 'C') # create variable Name
Score2 <- c(98, 46, 55) # create variable Score2
df2 <- data.frame(Name, Score2) # combine the variables into one data frame called df2
df2

##   Name Score2
## 1    A     98
## 2    D     46
## 3    C     55

merge(df1, df2,       # data frames need to be merged
      by = 'Name',    # name of the column/variable used for merging
      all.x = TRUE)   # keep all observations in the first data frame after merging

##   Name ID Score1 Score2
## 1    A  1   69.5     98
## 2    B  2   77.5     NA
## 3    C  3   99.0     55
## 4    D  4   90.0     46

merge(df1, df2, 
      by = 'Name',
      all.y = TRUE) # keep all observations in the first dataframe after merging

##   Name ID Score1 Score2
## 1    A  1   69.5     98
## 2    C  3   99.0     55
## 3    D  4   90.0     46

You could keep all the observations in both two data frames by set all = TRUE in the function. We will cover more about this in our lecture about dplyr.

4.8 Column operation

Column operation or vector operation is a very important idea in R. It applies the operations in two columns or the function in one column directly rather than applies them on each element one by one.

ID <- c(1:4) # create variable ID
Name <- c('A', 'B', 'C', 'D') # create variable Name
Score1 <- c(69.5, 77.5, 99, 90) # create variable Score1
Score2 <- c(98, 46, 55, 70) # create variable Score2
df <- data.frame(ID, Name, Score1, Score2) # combine the varibles into one data frame called df
df

##   ID Name Score1 Score2
## 1  1    A   69.5     98
## 2  2    B   77.5     46
## 3  3    C   99.0     55
## 4  4    D   90.0     70

Create a new column in the data frame to calculate the sum of score 1 and score 2

df$totalsocre <- df$Score1 + df$Score2
df

##   ID Name Score1 Score2 totalsocre
## 1  1    A   69.5     98      167.5
## 2  2    B   77.5     46      123.5
## 3  3    C   99.0     55      154.0
## 4  4    D   90.0     70      160.0

Create a new column in the data frame to calculate the mean of score 1 and score 2

df$meansocre <- (df$Score1 + df$Score2)/2
df

##   ID Name Score1 Score2 totalsocre meansocre
## 1  1    A   69.5     98      167.5     83.75
## 2  2    B   77.5     46      123.5     61.75
## 3  3    C   99.0     55      154.0     77.00
## 4  4    D   90.0     70      160.0     80.00