# Chapter 3 Introduction to R

## 3.1 Why R and why not R?

Before learning R, we need to know why we use R and why we do not use R.

**Advantages of R** (More flexible but less formal)

- Free and Open source
- More advanced technique packages
- Deal with more than one datasets (big data) at the same time
- Deal with not only data analysis tasks (data visualization, text analysis, creating website, etc.)

**Advantages of STATA** (More formal but less flexible)

- More algorithms, packages, and implementations of econometrics
- Faster
- It is supported by Statacorp so the result is reliable
- It presents results in a clear format
- Syntax is simple and standard for most data analysis
- Help document is formal

Besides those advantages, they have a lot of overlaps with each other. People cannot say one is absolutely better than the other. People choose them based on their task requirements. Sometimes, people use both of them for their daily work (e.g., my laptop has both R and STATA).

## 3.2 Variable name

A variable is used to store data including value, vector, data frame, etc., which R could use to manipulate (tutorialspoint 2019b). This chapter introduces variable types, operations between variables, data structures, conditional statements, loops, and functions.

Before we start, let’s first see how to name a variable. The valid variable name could be constructed with letters, numbers, the dot character (`.`

), and underline character (`_`

). Besides that, a valid variable name should start with a letter or the dot character not followed by a number. Below are some examples of variable names (either good or not good).

Examples | Validity | Discussion |
---|---|---|

var.name | ✓ | |

var_name | ✓ | |

_var_name | ☓ | Cannot start with the underline |

.var_name | ✓ | |

var%name | ☓ | Cannot contain % |

.2var_name | ☓ | Cannot use the dot followed by a number to start a variable name |

2var_name | ☓ | Cannot start with a number |

## 3.3 Variable types

There are several types of variables which R could recognize, including character, numeric, integer, logical, and complex (Blischak et al. 2019). The type of one variable is decided by the type of value it stores. We can use `class()`

function to check the type of each variable.

**Character** (also known as strings)

```
<- "Hello, world!"
v class(v)
```

`## [1] "character"`

**Numeric** (real or decimal number/integer)

```
<- 59.28
v class(v)
```

`## [1] "numeric"`

**Integer** (`L`

tells R that this number is an integer)

```
<- 2L
v class(v)
```

`## [1] "integer"`

```
<-2
v class(v)
```

`## [1] "numeric"`

**Logical** (Usually True or false)

```
<- TRUE
v class(v)
```

`## [1] "logical"`

```
<- FALSE
v class(v)
```

`## [1] "logical"`

**Complex** (complex number is another type of number, different with real number)

```
<- 1 + 4i
v class(v)
```

`## [1] "complex"`

It is important to clearly know the type of the variable since different types of variables may have different functions or operations to deal with. Another caveat is that the outlook of the variable may not show its real variable type. For example, a common situation is listed below.

```
<- "59.28"
v class(v)
```

`## [1] "character"`

Here, the number has quotation marks outside, which means it has been transferred to type character. **Therefore, please be careful about variable types!**

## 3.4 Operations

An operation tells R the mathematical or logical manipulations among variables (tutorialspoint 2019a).

### 3.4.1 Assignment operations

Assignment operators assign values to variables.

**Left assignment**

```
<- 1
a <<- "Hello, world!"
b = c(1, 3, 4) c
```

**Right assignment**

```
1 -> a
2 ->> b
```

### 3.4.2 Arithmetic operations

**Add**

`1 + 1`

`## [1] 2`

**Subtract**

`5 - 3`

`## [1] 2`

**Multiple**

`3 * 5`

`## [1] 15`

**Divide**

`5 - 3`

`## [1] 2`

**Power**

`5 ^ 2`

`## [1] 25`

`5 ** 2 # you can also do power operation like this`

`## [1] 25`

**Mode** (find the remainder)

`5 %% 2`

`## [1] 1`

### 3.4.3 Relational operations

The relational operators compare the two elements and return a logical value (`TRUE`

or `FALSE`

).

**Larger**

`3 > 4`

`## [1] FALSE`

`5 > 3`

`## [1] TRUE`

**Smaller**

`3 < 5`

`## [1] TRUE`

`4 < 2`

`## [1] FALSE`

**Equal**

`4 == 4`

`## [1] TRUE`

`5 == 4`

`## [1] FALSE`

Note that double equal sign `==`

is relational operation and single equal sign `=`

is assignment operation.

**No less than** (larger or equal to)

`3 >= 4`

`## [1] FALSE`

`2 >= 2`

`## [1] TRUE`

**No larger than** (smaller or equal to)

`5 <= 2`

`## [1] FALSE`

`5 <= 5`

`## [1] TRUE`

**Not equal**

`3 != 4`

`## [1] TRUE`

`3 != 3`

`## [1] FALSE`

### 3.4.4 Logical operations

Logical operators are operations only for logical, numeric, or complex variable types. Most of the time, we apply them on logical values or variables. For numeric variables, 0 is considered `FALSE`

and non-zero numbers are taken as `TRUE`

(DataMentor 2019). You could use `T`

for `TRUE`

or `F`

for `FALSE`

as abbreviation.

**Logical And**

`TRUE & TRUE`

`## [1] TRUE`

`FALSE & TRUE`

`## [1] FALSE`

`FALSE & FALSE`

`## [1] FALSE`

**Logical Or**

`TRUE | TRUE`

`## [1] TRUE`

`FALSE | TRUE`

`## [1] TRUE`

`FALSE | FALSE`

`## [1] FALSE`

**Logical Not**

`! TRUE`

`## [1] FALSE`

`! FALSE`

`## [1] TRUE`

## 3.5 Data structures

Variables and values could construct different data structures including vector, matrix, data frame, list, and factor (Kabacoff 2019).

**Vector**

You could create a vector with `c()`

function.

```
<- c(5, 9, 2, 8) # create a numeric vector
a # show the value of this vector a
```

`## [1] 5 9 2 8`

```
<- c('hello', 'world', '!') # character vector
b b
```

`## [1] "hello" "world" "!"`

```
<- c(5, 'good') # if you create a vector containing mixed variable types, such as numeric and character, R will restrict them to be the same variable type, here, character
c c
```

`## [1] "5" "good"`

You could select elements in the vector by using `var_name[#]`

. Please pay attention on how R indexes its elements in the data structure.

`3] # select the 3rd element a[`

`## [1] 2`

`1:3] # select from the 1st to the 3rd element b[`

`## [1] "hello" "world" "!"`

`2] # select the 2nd element c[`

`## [1] "good"`

`1:3`

means from 1 to 3, so it actually stands for three numbers here, which are 1, 2, 3.

**Matrix**

You could create a matrix using `matrix()`

function.

```
<- matrix(1:6, # the data to be put in the matrix, here we use numbers from 1 to 6
a nrow = 2, # number of rows in the matrix
ncol = 3, # number of columns in the matrix
byrow = FALSE) # how to arrange the data in the matrix, FALSE means by columns, TURE means by rows.
a
```

```
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
```

For variable selection, the intuitive way is using coordinates.

`2,3] # select the elements in the 2nd row and 3rd column a[`

`## [1] 6`

You could also select the entire row or column.

`2] # the 2nd column a[ ,`

`## [1] 3 4`

`1, ] # the 1st row a[`

`## [1] 1 3 5`

**Data frame**

Data frame is a **frequently-used** data type in R. It could include columns with different types of values stored in them. Let’s create a data frame with mixed variables types using `data.frame()`

function.

```
<- c(1:4) # create ID
ID <- c('A', 'B', 'C', 'D') # create Name
Name <- c(69.5, 77.5, 81.5, 90) # create Score
Score <- data.frame(ID, Name, Score) # combine the variables into one data frame called df
df df
```

```
## ID Name Score
## 1 1 A 69.5
## 2 2 B 77.5
## 3 3 C 81.5
## 4 4 D 90.0
```

We created a data frame storing the students’ ID, name, and their test scores. We can select elements from this data frame with couple of ways.

`2,3] # 2nd row and 3rd column df[`

`## [1] 77.5`

`'ID'] # column of variable ID df[`

```
## ID
## 1 1
## 2 2
## 3 3
## 4 4
```

`c('ID', 'Score')] # column of ID and Score df[`

```
## ID Score
## 1 1 69.5
## 2 2 77.5
## 3 3 81.5
## 4 4 90.0
```

There is another way to select the column by its name, which is more frequently used. When you type `$`

after the name of the data frame, RStudio will list all the variables in that data frame.

`$Name # column of variable Name df`

```
## [1] A B C D
## Levels: A B C D
```

**List**

A list could store mixed types of values, which is different from vector.

`<- list(ID = c(1, 2), Name = c('A', 'B'), Score = c(69.5, 89)) a `

When you want to select elements from a list, you could do it in a similar way as a vector. However, list does not define row or column, so you cannot use 2-D coordinates to select elements like a data frame.

`1] a[`

```
## $ID
## [1] 1 2
```

`2:3] a[`

```
## $Name
## [1] "A" "B"
##
## $Score
## [1] 69.5 89.0
```

Someone might be confusing since list looks similar to data frame. Here is a good discussion about it. Due to the time limitation, we will not cover this discussion in class. The main idea is that list is more flexible than data frame, while data frame has more restrictions. However, since data frame is more similar to 2-D table structure which is more frequently used in our daily work, we use data frame more than list.

**Factor**

Factor is the nominal variable in R. This type will be very useful when we want to analyze data from different groups, such as gender, school, etc.

```
<- c(1, 2, 1, 2, 3, 3, 1, 1)
a class(a)
```

`## [1] "numeric"`

```
<- factor(a)
afactor class(afactor)
```

`## [1] "factor"`

Use `levels()`

to check the categories in variable `afactor`

.

`levels(afactor)`

`## [1] "1" "2" "3"`

## 3.6 Conditional statement (if)

```
if (test_expression){
statement_1else {
}
statement_2 }
```

If the `test_expression`

returns `TRUE`

, then the codes will go to `statement_1`

, if it returns `FALSE`

, the codes will go to `statement_2`

. You could also omit the `else`

part.

```
if (test_expression){
statement_1 }
```

If the `test_expression`

returns `FALSE`

, the codes will continue to next line.

```
<- 5
x if (x > 3){
print('x is larger than 3')
else {
} print('x is not larger than 3')
}
```

`## [1] "x is larger than 3"`

```
<- 1
x if (x > 3){
print('x is larger than 3')
}
```

Some other conditional statements include `switch()`

and `which()`

.

## 3.7 Loops

Loops help us repeat the codes. `for`

loop is a commonly-used one.

```
for (range){
statement }
```

`range`

will provide the range for a variable. The form could be `i in 1:3`

, which shows that `i`

will be 1, 2, and 3 in each loop.

```
for (i in 1:3){
print(i)
}
```

```
## [1] 1
## [1] 2
## [1] 3
```

You can nest conditional statement and loop together like the codes below (print the numbers (from 5 to 10) that are smaller than 7). Use the whole loop part to replace the statement in conditional statement.

```
for (i in 5:10) {
if(i < 7) {
print(i)
} }
```

```
## [1] 5
## [1] 6
```

## 3.8 Functions

Functions are codes that have been defined with specific usage. You only need to input some necessary variables and functions will do the tasks. To use function, you start with the name of the function followed with a pair of parentheses. Then, you input some arguments in the parentheses to give instructions to the function. For example, `sum()`

function could help you add the all the numbers together in a vector or data frame and return the result.

`sum(c(1, 4, 10, 5))`

`## [1] 20`

Another example is `mean()`

function, which could help you average the numbers in a vector or data frame and return the result.

`mean(c(1, 4, 10, 5))`

`## [1] 5`

In functions, some arguments must be input. For example, you need to input the dataset in `mean()`

function. However, some arguments are not necessary to be input because they have default values. If you do not specify these arguments, then, the function will use their default values. For example, after checking the help page of `mean()`

, you will find that there is an other argument called `na.rm`

which decides whether the missing values should be removed. Let’s see the example below.

```
<- c(1, 4, 5, NA)
data mean(data)
```

`## [1] NA`

To avoid this, we need to add an argument to reset the value of `na.rm`

in the `mean()`

function.

`mean(data, na.rm = TRUE)`

`## [1] 3.333333`

`na.rm`

tells the function whether missing values should be removed during the calculation. Its default value is `FALSE`

, which means that the missing values should not be removed. Calculating the average of a list of numbers containing missing value will return a missing value. That’s why we get `NA`

from our first try. In our second try, we set the value of `na.rm`

to `TRUE`

. The function removes the missing values and we have the correct result in our second try.

It is important to use the right function to do the right task. To do this, you have to be familiar with the functions you are using. It needs more practice.