Chapter 8 Statistics in R
In this chapter, we are going to use R to add the statistics in the plots and carry out statistical analysis.
We need dplyr
, ggplot2
, and the Minneapolis ACS dataset.
library(dplyr)
library(ggplot2)
<- read.csv('minneapolis.csv') minneapolis
8.1 Plotting statistics
8.1.1 Fitting the general trend of points
You could fit a line to see the general trend of the scatter plot by using stat_smooth()
in ggplot2
.
## select observations from 2017
<- filter(minneapolis, YEAR == 2017)
minneapolis_2017
<- ggplot(minneapolis_2017, aes(AGE, INCTOT)) +
p geom_point()
+ stat_smooth(method = 'lm') # fit a linear line for the points p
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 454 rows containing non-finite values (stat_smooth).
## Warning: Removed 454 rows containing missing values (geom_point).
+ stat_smooth(method = 'loess') # non-linear line p
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 454 rows containing non-finite values (stat_smooth).
## Warning: Removed 454 rows containing missing values (geom_point).
It seems that the non-linear relationship makes more sense.
8.1.2 Plotting distribution
You could draw a histogram for the dataset with geom_histogram()
. With the argument of binwidth
, we can set the width of the bin.
<- ggplot(minneapolis_2017, aes(AGE))
p
+ geom_histogram(binwidth = 5) # the width of each category is 5 p
Instead of the width of the bin, we could set number of the bins with bins
.
+ geom_histogram(bins = 30) # the number of bins is 30 p
By setting aes(y = ..density..)
in geom_histogram()
, the plot shows density instead of counts.
+ geom_histogram(aes( y = ..density..), bins = 30) # use density instead of count p
8.1.3 Density plot of distribution
You could fit a density line for the histogram, which is a smooth line shows the distribution of the data.
+ geom_histogram(aes( y = ..density..), binwidth = 5) +
p geom_density(alpha = 0.2, fill = 'Grey')
geom_freqpoly()
can serve a same purpose.
+ geom_freqpoly(binwidth = 5) p
8.1.4 Box plot of distribution
Another way to show more statistics about your data is box plot, which we have introduced before.
<- ggplot(minneapolis_2017, aes(factor(SEX), INCTOT))
p + geom_boxplot() p
## Warning: Removed 454 rows containing non-finite values (stat_boxplot).
8.2 Annotation of statistics
We can use geom_text()
to annotate the statistics in the plot.
8.2.1 bar plot
ggplot(minneapolis_2017, aes(x = RACE)) +
geom_bar() +
geom_text(aes(label = ..count..), stat = 'count', vjust = 1.5, colour = 'white')
## count the number of respondents based on RACE
<- minneapolis_2017 %>%
minneaplis_race group_by(RACE) %>%
summarise(count = n())
ggplot(minneaplis_race, aes(RACE, count)) +
geom_col() +
geom_text(aes(label = count), vjust = -0.3)
8.2.2 line plot
## calculate the average personal income for each year
<- minneapolis %>%
minneapolis_year group_by(YEAR) %>%
summarise(AvgInc = mean(INCTOT, na.rm = T))
ggplot(minneapolis_year, aes(YEAR, AvgInc)) +
geom_line(linetype= 'dashed') +
geom_point() +
geom_text(aes(label = round(AvgInc)), vjust = -0.5)
## calculate the average personal income for each racial group in each year
<- minneapolis %>%
minneapolis_year_race group_by(YEAR, RACE) %>%
summarise(AvgInc = mean(INCTOT, na.rm = T))
## line plot of the trend of average personal income
ggplot(minneapolis_year_race, aes(YEAR, AvgInc, linetype = factor(RACE))) +
geom_line() +
geom_text(aes(label = round(AvgInc)), vjust = -0.4)
8.3 Simple statistics
We have covered some of the functions in this topic. For example, we could use mean()
to compute the average value of a set of numbers.
8.3.1 Mean and median
We could use the base functions to do some simple statistical analysis directly.
mean(minneapolis_2017$AGE) # mean
## [1] 36.0889
median(minneapolis_2017$AGE) # median
## [1] 32
8.3.2 Minimum and maximum value
min(minneapolis_2017$AGE) # minimum value
## [1] 0
max(minneapolis_2017$AGE) # maximum value
## [1] 95
8.3.3 Quartile
<- quantile(minneapolis_2017$AGE)
x # list of quartiles x
## 0% 25% 50% 75% 100%
## 0 20 32 52 95
2] # select the value by its index x[
## 25%
## 20
You could add value from 0 to 1 in the quantile()
to find a specific value, for example, 40%.
quantile(minneapolis_2017$AGE, 0.4) # 40% of the dataset
## 40%
## 28
8.4 Linear regression
Before we start, let’s review some related knowledge first.
Regression is used to examine the linear relationships between dependent variable and independent variables, where dependent variable is the one to be explained and independent variables (also called regressors, predictors, explanatory variables) are those may have influences on the dependent variable. For example, the dependent variable is personal income, and the independent variables are education, gender, age, etc. Among those independent variables, there are two types, one is continuous variable and the other is categorical variable (including dummy variable). Continuous variable is variable with continuous values, such as income and age. Categorical variable contains levels of values, which are not continuous. Dummy variable is variable with two levels of values. For example, gender, and people could use 1 for male, and 0 for female.
Suppose we have a dependent variable \(Y\), and two independent variables \(X_1\) and \(X_2\), the regression model in assumption could be expressed as below,
\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \epsilon\].
Where, \(\beta_0\) is the intercept, \(\beta_1\) and \(\beta_2\) are coefficients for \(X_1\) and \(X_2\), \(\epsilon\) is the error term which is the part of the dependent variable which cannot be explained by the intercept and independent variables. Linear regression is to estimate the value of \(\beta_0\), \(\beta_1\), and \(\beta_2\), and test their significance. The coefficients for the independent variables stand for that if the independent variable change one unit, the dependent variable will change the amount same with the value of the corresponding coefficient.
The estimated model could be expressed as,
\[\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}X_2\]
Those variables with hat are estimated variables.
While regression provides the estimated values of intercepts and coefficients, it also provides the significance of these estimates with p-values. When p-value is smaller, the estimates tend to be more significant. In R, the function will use some marks to indicate the significance levels. The significance level is the probability that the estimates are true
.
Mark | Descriptions of significance level |
---|---|
. | 90% |
* | 95% |
** | 99% |
*** | 99.9% |
To quantify the fitness of the model, we use \(R^2\) with a value from 0 to 1. When \(R^2\) is close to 1, the model fits the dataset better. \(R^2\) has a property that when adding more independent variables in the regression model, the \(R^2\) will increase. There is another index called adjusted \(R^2\), which considers the number of variables in the models.
Our example is the Minneapolis ACS population dataset, and we want to explore the relationship between EDUC
(Education level) and AGE
(Age). Let’s draw a scatter plot to see their distribution.
ggplot(minneapolis_2017, aes(AGE, EDUC)) +
geom_point()
Because there are many overlapped points, we could add some noise to the positions of the points to help show all the points.
ggplot(minneapolis_2017, aes(AGE, EDUC)) +
geom_jitter(size = 0.5) ## add some noise to the positions of the points and set the size of the points to be 0.5
Based on the plot, it seems there is a positive relationship between these two variables. We add a linear line to fit the relationship with geom_smooth()
.
ggplot(minneapolis_2017, aes(AGE, EDUC)) +
geom_jitter(size = 0.5) +
geom_smooth(method = 'lm')
## `geom_smooth()` using formula 'y ~ x'
To quantify this linear relationship, We could use lm()
function to fit this linear relationship and use summary()
function to see the result. In the function, the formula indicates the model in assumption. Here, our model in assumption is,
\[EDUC = \beta_0 + \beta_1 \times AGE + \epsilon\]
When we code this model in R, we do
EDUC ~ AGE
We only need to write down the variable names of the dependent variable and independent variables, and use ~
to connect them. No need to write the intercept and error term.
We also need to indicate the name of the dataset in the function.
<- lm(EDUC ~ AGE, # formula
lm_fit data = minneapolis_2017) # dataset
summary(lm_fit) # check result
##
## Call:
## lm(formula = EDUC ~ AGE, data = minneapolis_2017)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.1979 -2.8243 0.3888 2.7364 5.1528
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.037407 0.114684 35.20 <2e-16 ***
## AGE 0.078687 0.002741 28.71 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.102 on 2855 degrees of freedom
## Multiple R-squared: 0.224, Adjusted R-squared: 0.2237
## F-statistic: 824.1 on 1 and 2855 DF, p-value: < 2.2e-16
The summarized result provides details about the model results, such as the coefficients and p-values, the model’s \(R^2\), etc.
Based on the information, we could know that the estimated coefficient for the intercept is 4.04, its p-value is \(< 2e-16\) with a mark \(***\), showing it is significant at 99.9% level. The estimated coefficient for AGE
is 0.08, its p-value is \(<2e-16\) with a mark \(***\), showing it is significant at 99.9% level.
We could also know the \(R^2\) is 0.224, and adjusted \(R^2\) is 0.2237.
We could use the codes below to check the \(R^2\) of the model directly.
summary(lm_fit)$r.squared # value of R2
## [1] 0.2239978
And get the values of the coefficients directly.
coefficients(lm_fit) # only check the coefficient
## (Intercept) AGE
## 4.03740708 0.07868725
The model can be improved. From the scatter plot between EDUC
and AGE
, many respondents get to their highest level of education before the age of 25. After 25, AGE
has trivial influence on EDUC
. We, then, are interested in the relationship before the age of 25.
<- filter(minneapolis_2017, AGE <= 25)
minneapolis_age
ggplot(minneapolis_age, aes(AGE, EDUC)) +
geom_jitter(size = 0.5) +
geom_smooth(method = 'lm')
## `geom_smooth()` using formula 'y ~ x'
<- lm(EDUC ~ AGE,
lm_fit data = minneapolis_age)
summary(lm_fit)
##
## Call:
## lm(formula = EDUC ~ AGE, data = minneapolis_age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3077 -1.0348 -0.0909 0.9091 3.9091
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.615627 0.087077 -18.55 <2e-16 ***
## AGE 0.405604 0.005307 76.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.303 on 1006 degrees of freedom
## Multiple R-squared: 0.8531, Adjusted R-squared: 0.8529
## F-statistic: 5841 on 1 and 1006 DF, p-value: < 2.2e-16
The model provides better fitness to the dataset now.
Most of the time, we need to examine the relationship between the dependent variable and more than one independent variable. The example below examines the relationship between EDUC
and AGE
, SEX
, and RACE
.
Education level and age.
ggplot(minneapolis_2017, aes(AGE, EDUC)) +
geom_jitter(size = 0.5)
Education level and gender.
ggplot(minneapolis_2017, aes(factor(SEX), EDUC)) +
geom_boxplot()
Education level and racial group.
ggplot(minneapolis_2017, aes(factor(RACE), EDUC)) +
geom_boxplot()
When there is more than one independent variables, we use +
to connect them in the formula. RACE
and SEX
are both categorical variables, so we transform it to factors.
<- lm(EDUC ~ AGE + factor(SEX) + factor(RACE),
mlm_fit
minneapolis_2017)summary(mlm_fit)
##
## Call:
## lm(formula = EDUC ~ AGE + factor(SEX) + factor(RACE), data = minneapolis_2017)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.4126 -2.2219 0.4191 2.4941 6.5838
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.976819 0.136087 36.571 <2e-16 ***
## AGE 0.070253 0.002684 26.170 <2e-16 ***
## factor(SEX)2 -0.044017 0.110679 -0.398 0.691
## factor(RACE)2 -2.413424 0.160769 -15.012 <2e-16 ***
## factor(RACE)3 -1.714259 0.159498 -10.748 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.957 on 2852 degrees of freedom
## Multiple R-squared: 0.2955, Adjusted R-squared: 0.2945
## F-statistic: 299.1 on 4 and 2852 DF, p-value: < 2.2e-16
summary(mlm_fit)$r.squared
## [1] 0.2955301
Without careful research design, the relationships fitted by the regression model are all correlations, not causalities.
8.5 Logistic regression
The above two examples both use continuous variables as their dependent variables. How about using a binomial variable (i.e., dummy variable, 0 or 1 as its value) a dependent variable? Then we need to do logistic regression. There are many functions to well accomplish this. When interpreting the coefficients of the logistic regression result, the coefficient stands for the change of the log odds of the dependent variable to 1.
## create a variable to indicate the poverty status of the respondents
<- minneapolis_2017 %>%
minneapolis_poverty filter(!is.na(INCTOT)) %>%
mutate(poverty = ifelse(INCTOT <= 12228, 1, 0))
Poverty level and age.
ggplot(minneapolis_poverty, aes(AGE, fill = factor(poverty))) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Poverty and education level.
ggplot(minneapolis_poverty, aes(EDUC, fill = factor(poverty))) +
geom_bar()
Poverty and gender.
ggplot(minneapolis_poverty, aes(SEX, fill = factor(poverty))) +
geom_bar()
Poverty and racial group.
ggplot(minneapolis_poverty, aes(RACE, fill = factor(poverty))) +
geom_bar()
Here, we introduce the glm()
function. We need to indicate family = binomial
in the function.
## logistic regression
<- glm(poverty ~ AGE + EDUC + factor(SEX) + factor(RACE),
logit_reg
minneapolis_poverty,family = binomial)
summary(logit_reg)
##
## Call:
## glm(formula = poverty ~ AGE + EDUC + factor(SEX) + factor(RACE),
## family = binomial, data = minneapolis_poverty)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3815 -0.7066 -0.5090 0.8830 2.8188
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.115161 0.233282 13.354 < 2e-16 ***
## AGE -0.031852 0.002971 -10.720 < 2e-16 ***
## EDUC -0.367547 0.022532 -16.312 < 2e-16 ***
## factor(SEX)2 0.096235 0.101052 0.952 0.341
## factor(RACE)2 0.646833 0.145549 4.444 8.83e-06 ***
## factor(RACE)3 0.168694 0.146323 1.153 0.249
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2983.5 on 2402 degrees of freedom
## Residual deviance: 2391.9 on 2397 degrees of freedom
## AIC: 2403.9
##
## Number of Fisher Scoring iterations: 4
Most of the information is similar with regression ones except that the logistic regression does not have \(R^2\) and adjusted \(R^2\). It uses AIC
(Akaike information criterion) in indicate the goodness of the model. If one model has a smaller AIC, it is better.