Chapter 3 Introduction to R

3.1 Why R and why not R?

Before learning R, we need to know why we use R and why we do not use R.

Advantages of R (More flexible but less formal)

  • Free and Open source
  • More advanced technique packages
  • Deal with more than one datasets (big data) at the same time
  • Deal with not only data analysis tasks (data visualization, text analysis, creating website, etc.)

Advantages of STATA (More formal but less flexible)

  • More algorithms, packages, and implementations of econometrics
  • Faster
  • It is supported by Statacorp so the result is reliable
  • It presents results in a clear format
  • Syntax is simple and standard for most data analysis
  • Help document is formal

Besides those advantages, they have a lot of overlaps with each other. People cannot say one is absolutely better than the other. People choose them based on their task requirements. Sometimes, people use both of them for their daily work (e.g., my laptop has both R and STATA).

3.2 Variable name

A variable is used to store data including value, vector, data frame, etc., which R could use to manipulate (tutorialspoint 2019b). This chapter introduces variable types, operations between variables, data structures, conditional statements, loops, and functions.

Before we start, let’s first see how to name a variable. The valid variable name could be constructed with letters, numbers, the dot character (.), and underline character (_). Besides that, a valid variable name should start with a letter or the dot character not followed by a number. Below are some examples of variable names (either good or not good).

Examples Validity Discussion
var.name
var_name
_var_name Cannot start with the underline
.var_name
var%name Cannot contain %
.2var_name Cannot use the dot followed by a number to start a variable name
2var_name Cannot start with a number

3.3 Variable types

There are several types of variables which R could recognize, including character, numeric, integer, logical, and complex (Blischak et al. 2019). The type of one variable is decided by the type of value it stores. We can use class() function to check the type of each variable.

Character (also known as strings)

v <- "Hello, world!"
class(v)
## [1] "character"

Numeric (real or decimal number/integer)

v <- 59.28
class(v)
## [1] "numeric"

Integer (L tells R that this number is an integer)

v <- 2L
class(v)
## [1] "integer"
v <-2
class(v)
## [1] "numeric"

Logical (Usually True or false)

v <- TRUE
class(v)
## [1] "logical"
v <- FALSE
class(v)
## [1] "logical"

Complex (complex number is another type of number, different with real number)

v <- 1 + 4i
class(v)
## [1] "complex"

It is important to clearly know the type of the variable since different types of variables may have different functions or operations to deal with. Another caveat is that the outlook of the variable may not show its real variable type. For example, a common situation is listed below.

v <- "59.28"
class(v)
## [1] "character"

Here, the number has quotation marks outside, which means it has been transferred to type character. Therefore, please be careful about variable types!

3.4 Operations

An operation tells R the mathematical or logical manipulations among variables (tutorialspoint 2019a).

3.4.1 Assignment operations

Assignment operators assign values to variables.

Left assignment

a <- 1
b <<- "Hello, world!"
c = c(1, 3, 4)

Right assignment

1 -> a
2 ->> b

3.4.2 Arithmetic operations

Add

1 + 1
## [1] 2

Subtract

5 - 3
## [1] 2

Multiple

3 * 5
## [1] 15

Divide

5 - 3
## [1] 2

Power

5 ^ 2
## [1] 25
5 ** 2 # you can also do power operation like this
## [1] 25

Mode (find the remainder)

5 %% 2
## [1] 1

3.4.3 Relational operations

The relational operators compare the two elements and return a logical value (TRUE or FALSE).

Larger

3 > 4
## [1] FALSE
5 > 3
## [1] TRUE

Smaller

3 < 5
## [1] TRUE
4 < 2
## [1] FALSE

Equal

4 == 4
## [1] TRUE
5 == 4
## [1] FALSE

Note that double equal sign == is relational operation and single equal sign = is assignment operation.

No less than (larger or equal to)

3 >= 4
## [1] FALSE
2 >= 2
## [1] TRUE

No larger than (smaller or equal to)

5 <= 2
## [1] FALSE
5 <= 5
## [1] TRUE

Not equal

3 != 4
## [1] TRUE
3 != 3
## [1] FALSE

3.4.4 Logical operations

Logical operators are operations only for logical, numeric, or complex variable types. Most of the time, we apply them on logical values or variables. For numeric variables, 0 is considered FALSE and non-zero numbers are taken as TRUE (DataMentor 2019). You could use T for TRUE or F for FALSE as abbreviation.

Logical And

TRUE & TRUE
## [1] TRUE
FALSE & TRUE
## [1] FALSE
FALSE & FALSE
## [1] FALSE

Logical Or

TRUE | TRUE
## [1] TRUE
FALSE | TRUE
## [1] TRUE
FALSE | FALSE
## [1] FALSE

Logical Not

! TRUE
## [1] FALSE
! FALSE
## [1] TRUE

3.5 Data structures

Variables and values could construct different data structures including vector, matrix, data frame, list, and factor (Kabacoff 2019).

Vector

You could create a vector with c() function.

a <- c(5, 9, 2, 8) # create a numeric vector
a # show the value of this vector
## [1] 5 9 2 8
b <- c('hello', 'world', '!') # character vector
b
## [1] "hello" "world" "!"
c <- c(5, 'good') # if you create a vector containing mixed variable types, such as numeric and character, R will restrict them to be the same variable type, here, character
c
## [1] "5"    "good"

You could select elements in the vector by using var_name[#]. Please pay attention on how R indexes its elements in the data structure.

a[3] # select the 3rd element
## [1] 2
b[1:3] # select from the 1st to the 3rd element
## [1] "hello" "world" "!"
c[2] # select the 2nd element
## [1] "good"

1:3 means from 1 to 3, so it actually stands for three numbers here, which are 1, 2, 3.

Matrix

You could create a matrix using matrix() function.

a <- matrix(1:6,      # the data to be put in the matrix, here we use numbers from 1 to 6
            nrow = 2, # number of rows in the matrix
            ncol = 3, # number of columns in the matrix
            byrow = FALSE) # how to arrange the data in the matrix, FALSE means by columns, TURE means by rows.
a
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

For variable selection, the intuitive way is using coordinates.

a[2,3] # select the elements in the 2nd row and 3rd column
## [1] 6

You could also select the entire row or column.

a[ ,2] # the 2nd column
## [1] 3 4
a[1, ] # the 1st row
## [1] 1 3 5

Data frame

Data frame is a frequently-used data type in R. It could include columns with different types of values stored in them. Let’s create a data frame with mixed variables types using data.frame() function.

ID <- c(1:4) # create ID
Name <- c('A', 'B', 'C', 'D') # create Name
Score <- c(69.5, 77.5, 81.5, 90) # create Score
df <- data.frame(ID, Name, Score) # combine the variables into one data frame called df
df
##   ID Name Score
## 1  1    A  69.5
## 2  2    B  77.5
## 3  3    C  81.5
## 4  4    D  90.0

We created a data frame storing the students’ ID, name, and their test scores. We can select elements from this data frame with couple of ways.

df[2,3] # 2nd row and 3rd column
## [1] 77.5
df['ID'] # column of variable ID
##   ID
## 1  1
## 2  2
## 3  3
## 4  4
df[c('ID', 'Score')] # column of ID and Score
##   ID Score
## 1  1  69.5
## 2  2  77.5
## 3  3  81.5
## 4  4  90.0

There is another way to select the column by its name, which is more frequently used. When you type $ after the name of the data frame, RStudio will list all the variables in that data frame.

df$Name # column of variable Name
## [1] A B C D
## Levels: A B C D

List

A list could store mixed types of values, which is different from vector.

a <- list(ID = c(1, 2), Name = c('A', 'B'), Score = c(69.5, 89))

When you want to select elements from a list, you could do it in a similar way as a vector. However, list does not define row or column, so you cannot use 2-D coordinates to select elements like a data frame.

a[1]
## $ID
## [1] 1 2
a[2:3]
## $Name
## [1] "A" "B"
## 
## $Score
## [1] 69.5 89.0

Someone might be confusing since list looks similar to data frame. Here is a good discussion about it. Due to the time limitation, we will not cover this discussion in class. The main idea is that list is more flexible than data frame, while data frame has more restrictions. However, since data frame is more similar to 2-D table structure which is more frequently used in our daily work, we use data frame more than list.

Factor

Factor is the nominal variable in R. This type will be very useful when we want to analyze data from different groups, such as gender, school, etc.

a <- c(1, 2, 1, 2, 3, 3, 1, 1)
class(a)
## [1] "numeric"
afactor <- factor(a)
class(afactor)
## [1] "factor"

Use levels() to check the categories in variable afactor.

levels(afactor)
## [1] "1" "2" "3"

3.6 Conditional statement (if)

if (test_expression){
  statement_1
} else {
  statement_2
}

If the test_expression returns TRUE, then the codes will go to statement_1, if it returns FALSE, the codes will go to statement_2. You could also omit the else part.

if (test_expression){
  statement_1
}

If the test_expression returns FALSE, the codes will continue to next line.

x <- 5
if (x > 3){
  print('x is larger than 3')
} else {
  print('x is not larger than 3')
}
## [1] "x is larger than 3"
x <- 1
if (x > 3){
  print('x is larger than 3')
} 

Some other conditional statements include switch() and which().

3.7 Loops

Loops help us repeat the codes. for loop is a commonly-used one.

for (range){
  statement
}

range will provide the range for a variable. The form could be i in 1:3, which shows that i will be 1, 2, and 3 in each loop.

for (i in 1:3){
  print(i)
}
## [1] 1
## [1] 2
## [1] 3

You can nest conditional statement and loop together like the codes below (print the numbers (from 5 to 10) that are smaller than 7). Use the whole loop part to replace the statement in conditional statement.

for (i in 5:10) {
  if(i < 7) {
    print(i)
  }
}
## [1] 5
## [1] 6

3.8 Functions

Functions are codes that have been defined with specific usage. You only need to input some necessary variables and functions will do the tasks. To use function, you start with the name of the function followed with a pair of parentheses. Then, you input some arguments in the parentheses to give instructions to the function. For example, sum() function could help you add the all the numbers together in a vector or data frame and return the result.

sum(c(1, 4, 10, 5))
## [1] 20

Another example is mean() function, which could help you average the numbers in a vector or data frame and return the result.

mean(c(1, 4, 10, 5))
## [1] 5

In functions, some arguments must be input. For example, you need to input the dataset in mean() function. However, some arguments are not necessary to be input because they have default values. If you do not specify these arguments, then, the function will use their default values. For example, after checking the help page of mean(), you will find that there is an other argument called na.rm which decides whether the missing values should be removed. Let’s see the example below.

data <- c(1, 4, 5, NA)
mean(data)
## [1] NA

To avoid this, we need to add an argument to reset the value of na.rm in the mean() function.

mean(data, na.rm = TRUE)
## [1] 3.333333

na.rm tells the function whether missing values should be removed during the calculation. Its default value is FALSE, which means that the missing values should not be removed. Calculating the average of a list of numbers containing missing value will return a missing value. That’s why we get NA from our first try. In our second try, we set the value of na.rm to TRUE. The function removes the missing values and we have the correct result in our second try.

It is important to use the right function to do the right task. To do this, you have to be familiar with the functions you are using. It needs more practice.