Chapter 7 Data Visualization with ggplot2
In this chapter, we will learn to use ggplot2
to do data visualization tasks. Before the lecture, please install the package first.
install.packages("ggplot2")
7.1 Grammer of Graphics (gg)
We have grammar for languages. We also have grammar for graphics. That’s where gg
of ggplot2
comes from. For ggplot2
, it has seven grammatical elements listed in the table below (DataCamp 2019).
Element | Description |
---|---|
Data | The dataset being plotted. |
Aesthetics | The scales onto which we map our data. |
Geometries | The visual elements used for our data. |
Facets | Plotting small multiples. |
Statistics | Representations of our data to aid understanding. |
Coordinates | The space on which the data will be plotted. |
Themes | All non-data ink. |
Let’s take the codes below as a simple example to show how those different elements work in ggplot2
. From this example, we might have a general idea what is each element and how does it work. We will talk about each element in details later.
library(ggplot2)
data(mtcars)
ggplot(mtcars, # Data
aes(x = mpg, y = wt)) + # Aesthetics
geom_point() + # Geometries
facet_grid(. ~ gear) + # Facets
stat_smooth(method = "lm", se = FALSE, col = "blue") + # Statistics
scale_x_continuous('Miles/(US) gallon',
limits = c(0, 40)) +
scale_y_continuous('Weight (1000 lbs)',
limits = c(0, 7)) + # Coordinates
theme_bw() # Themes
7.2 Data, Aesthetics, and Geometries
Generally, if you want to draw figures with ggplot2
, you need at least three elements, which are data, aesthetics, and geometries. Data is the dataset we want to visualize. Aesthetic specifies the variables and related attributes. Geometry indicates the plot type and related attributes. Take the example above again. We want to visualize the variables of mpg
and wp
(aesthetic) of the dataset mtcars
(data) with a scatter plot (geometry).
Similar to dplyr
, ggplot2
also has its own fashion of coding. We start with the ggplot()
function. Please pay attention that there is no 2
in the name of the function. In the function, we first indicate the name of the dataset or data frame. Then we use aes()
to indicate the scales we want to map our data. Here, we map mpg
to x axis, and wt
to y axis. Then we use a plus sign +
to connect it to other functions. We are going to draw a scatter plot, so we use geom_point()
.
ggplot(mtcars, # Data
aes(x = mpg, y = wt)) + # Aesthetics
geom_point() # Geometries
We could see the result and it is following our codes. Based on this plot, we have a overall idea of the relationship between wt
and mpg
, which is wt
has a negative relationship with mpg
. This result makes sense to us. If a car is lighter and it could drive farther with the same amount of gasoline. (Please remember that this relationship is just a type of correlation, not causality.)
We could add more attributes in the aesthetic element. For example, we could use color to indicate the value of hp
by adding col = hp
.
ggplot(mtcars,
aes(x = mpg, y = wt, col = hp)) +
geom_point()
Now we have more information in the result. While the weight is heavier, the hourse power is stronger.
Here, hp
is a continuous variable, so ggplot2
uses the darkness of the color to indicate the value. However, if we use a categorical variable or a factor (e.g. binomial variable), ggplot2
will use different colors to show different types.
ggplot(mtcars,
aes(x = mpg, y = wt, col = factor(am))) +
geom_point()
Here, am
stands for the types of transmission system (0 = automatic, 1 = manual). We use factor()
to transfer this variable to a categorical one. ggplot2
uses one color for automatic transmission and another color for manual transmission.
Besides color
, there are several other parameters to show different aesthetics of the plots. There are differences between the usages of continuous variable and categorical variable. If you want to map variables in those aesthetics, you have to do it in the aes()
function.
Parameter | Description | Continuous variable | Categorical variable |
---|---|---|---|
x | x axis position | ✓ | |
y | y axis position | ✓ | |
size | Diameter of points, thickness of lines | ✓ | |
alpha | Transparency | ✓ | ✓ |
color | Color of dots, outlines of other shapes | ✓ | ✓ |
fill | Fill color | ✓ | ✓ |
labels | Text on a plot or axes | ✓ | |
shape | Shape of point | ✓ | |
linetype | Line dash pattern | ✓ |
As for geometries, there are many different types you can use for different plots. For examples, geom_point()
for scatter plot, geom_bar()
for bar plot, geom_boxplot()
for boxplot, etc. Most functions of geometries are self-explained, so you could tell what their usages easily. We all talk about those commonly used geometries such as scatter plot, bar plot, line plot, etc in the following parts.
7.2.1 Scatter plot
We use geom_point()
to plot scatter plot in R with ggplot2
. In the example below, we map mpg
to x axis, and wt
to y axis. We indicate the transparency by set alpha = hp
. The transparency of each point is decided by its value of hp
.
ggplot(mtcars,
aes(x = mpg, y = wt, alpha = hp)) +
geom_point()
7.2.2 Bar plot
In the example below, we draw a bar plot to show the number of cars with different tansmission systems. In aes()
, we only indicate the variable am
. R then will count the number for each transimission type. We use geom_bar()
to plot it.
ggplot(mtcars,
aes(x = am)) +
geom_bar()
Go back to our previous example, which is different from the example above.
<- c(1998:2003) # create variable year
year <- c(500, 600, 650, 700, 400, 550) # create variable sales
sales <- data.frame(year, sales) # combine the variables into one data frame called df
df
ggplot(df, aes(x = year, y = sales)) +
geom_bar(stat = 'identity') # you need to specify stat = 'identity' to plot the actual value for each year, not count
Or you could use another geometry called geom_col()
to do it easily.
ggplot(df, aes(x = year, y = sales)) +
geom_col()
7.2.3 Line plot
To show the usage of line plot in ggplot2
, we use a new dataset economics
from the ggplot2
package. This dataset is produced from US economic time series data. Use help()
to check more information of this dataset.
In this example, we use geom_line()
to draw a line plot.
data(economics)
ggplot(economics,
aes(x = date, y = unemploy)) +
geom_line(col = 'Blue') # indicate the color of the line by setting col = 'Blue'
In this plot, it shows clearly the temporal trend of the number of unemployed population in the US. There two increase recently starting from 2001 and 2008, which match two economic risks around 2001 and 2008.
7.2.4 Boxplot
In the example below, we draw a boxplot for each transmission type. To do this, we map am
to the x axis, and mpg
to the y axis. We use as.factor()
to transfer am
to a factor (categorical variable).
ggplot(mtcars, aes(x = as.factor(am), y = mpg)) +
geom_boxplot()
7.2.5 Pie chart
In ggplot2
, it is not as intuitive as the base function pie()
to draw a pie chart. We use the same example to see the difference.
<- c(1998:2003) # create variable year
year <- c(500, 600, 650, 700, 400, 550) # create variable sales
sales <- data.frame(year, sales) # combine the variables into one data frame called df
df
ggplot(df, aes(x = '', y = sales, fill = factor(year))) +
geom_bar(width = 1, stat = 'identity') +
coord_polar('y') # transfer the coordinate system to the polar one
Whatggplot2
does here to draw a pie plot is to create a bar plot first.
ggplot(df, aes(x = '', y = sales, fill = factor(year))) +
geom_bar(width = 1, stat = 'identity')
And then transfer the coordinate system to the polar one.
ggplot(df, aes(x = '', y = sales, fill = factor(year))) +
geom_bar(width = 1, stat = 'identity') +
coord_polar('y')
7.3 Facets
If you want to split up your data by one or more variables and plot each subset in one figure, facet
is the element you want to use.
In the following example, we draw a scatter plot for each number of forward gears. In each plot, we map mpg
to the x axis and wt
to y axis. The three plots are aligned in a row.
<- ggplot(mtcars, # store the plot result in variable p
p aes(x = mpg, y = wt))
geom_point()
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity
+ facet_grid(. ~ gear) # Facets, for each number of forward gears p
If we want to align the plots in a column, exchange the position of the varialbe in the function.
+ facet_grid(gear ~ .) # Facets, pay attention to the position of gear and the dot sign p
We could put more variables to split the plots. In the following example, we put one more variable vs
in the facet_grid()
.
+ facet_grid(vs ~ gear) # add vs p
We could use the margins = T
to add more plots showing the aggregation of the plots in each column or row.
+ facet_grid(vs ~ gear, margins = T) # add aggregation plots for each row and column p
We could use labeller = label_both
to add more information in the label.
+ facet_grid(vs ~ gear, labeller = label_both) # add more info in the label p
7.4 Statistics
You could fit a line to see the general trend of the scatter plot by adding stat_smooth()
.
<- ggplot(mtcars,
p aes(x = mpg, y = wt)) +
geom_point()
+ stat_smooth(method = "lm") # linear line p
## `geom_smooth()` using formula 'y ~ x'
+ stat_smooth(method = 'loess') # non-linear line p
## `geom_smooth()` using formula 'y ~ x'
You could draw a histogram for the dataset.
<- ggplot(mtcars, aes(mpg))
p + geom_histogram(binwidth = 5) # the width of each category is 5 p
+ geom_histogram(bins = 3) # the number of bins is 3 p
You could fit a denisty line for the histogram.
+ geom_histogram(aes( y = ..density..), binwidth = 3, col = 'Black', fill = 'White') + # use density instead of count
p geom_density(alpha=.2, fill="Gray")
Another way to show more statistics about your data is boxplot, which we have been introduced before.
<- ggplot(mtcars, aes(x = factor(gear), y = mpg))
p + geom_boxplot() p
7.5 Coordinates
While there are many coordinate systems supported by ggplot2
, the most commonly used is Cartesian coordinate system, which is the combination of x axis and y axis orthogonally.
7.5.1 Zooming in and out
In the following example, we zoom in our plot to a specific area.
<- ggplot(mtcars, aes(x = mpg, y = wt)) +
p geom_point()
+ coord_cartesian(xlim = c(10, 30), ylim = c(2,4)) p
7.5.2 Ratio
We could change the ratio of the length of a y unit relative to the length of a x unit (\(\frac{\text{y unit}}{\text{x unit}}\)).
<- ggplot(mtcars, # Data
p aes(x = mpg, y = wt)) + # Aesthetics
geom_point()
+ coord_fixed(ratio = 1) # 1 means x and y axes have the same unit p
+ coord_fixed(ratio = 5) p
7.5.3 Swaping the axes
+ coord_flip() p
ggplot(mtcars, aes(x = as.factor(am), y = mpg)) +
geom_boxplot() +
coord_flip()
ggplot(mtcars, # Data
aes(gear)) + # Aesthetics
geom_bar() +
scale_y_reverse()
7.5.4 Polar coordinate system
We touched the polar coordinate system a little bit when drawing a pie chart.
ggplot(mtcars, aes(x = '', fill = factor(gear))) +
geom_bar(width = 1) +
coord_polar(theta = 'y')
We could do this in a different way.
ggplot(mtcars, aes(factor(gear))) +
geom_bar(width = 1, col = 'Black', fill = 'Grey') +
coord_polar()
7.6 Themes
ggplot2
is powerful in its flexibility of themes.
7.6.1 Add labels
Add labels with labs()
function.
<- ggplot(mtcars, aes(x = mpg, y = wt)) +
p geom_point()
+ labs(title = 'Plot 1', # title of the plot
p subtitle = 'Subplot1', # sub title
x = 'Miles/(US) gallon', # x label
y = 'Weight (1000 lbs)') # y label
Or use ggtitle()
, xlab()
, and ylab()
instead.
+ ggtitle('Plot1') +
p xlab('Miles/(US) gallon') + # x label
ylab('Weight (1000 lbs)') # y label
7.6.2 Change ticks
Here is an example of changing the ticks for a discrete variable.
<- ggplot(mtcars, aes(x = factor(am), y = mpg))
p + geom_boxplot() +
p scale_x_discrete(name = "Transmission", # name of the x axis
labels = c('Automatic', 'Manual')) # change 0 and 1 to automatic and manaual
7.6.3 theme() function
theme()
function could help to change the styles of every components of the plots. Here are a few examples about it (ggplot2 2019).
<- ggplot(mtcars, aes(x = wt, y = mpg)) +
p geom_point() +
labs(title = 'Fuel economy declines as weight increases')
# original plot p
We could use plot.title
to change the style of a title in the plot. In the following exmaple, we change the font size of the title by setting element_text()
and making size
to be twice larger as the default one.
+ theme(plot.title = element_text(size = rel(2))) p
We could also use absolute value to indicate the size directly.
+ theme(plot.title = element_text(size = 15)) p
We could change the background of the plot by setting the value of plot.background
in the theme()
function. For example, if we want to change the color, we could set element_rect()
and fill = 'red'
.
+ theme(plot.background = element_rect(fill = 'red')) p
More specifically, if we want to change the style of the panel, which is the inner part restricted by x and y axes, we could set the value of panel.background
.
+ theme(panel.background = element_rect(fill = 'green', colour = 'red')) p
We could set the line type of the panel’s border.
+ theme(panel.border = element_rect(linetype = 'dashed', fill = NA)) p
Change the attributes of the lines for the grid. element_line()
stands for the attributes of the lines.
+ theme(panel.grid.major = element_line(colour = 'black'),
p panel.grid.minor = element_line(colour = 'blue'))
Use element_blank()
to remove the themes of the target.
+ theme(panel.grid.major.y = element_blank(),
p panel.grid.minor.y = element_blank())
We could put the gridline on the top of our data by setting panel.ontop = TRUE
.
+ theme(
p panel.background = element_rect(fill = NA),
panel.grid.major = element_line(colour = 'Blue'),
panel.ontop = TRUE
)
Change the line style of the axis.
+ theme(
p axis.line = element_line(size = 3, colour = 'red')
)
Change the text style of the axis.
+ theme(
p axis.text = element_text(colour = 'Green', size = 15)
)
Change the attributes of the text for the axis ticks.
+ theme(
p axis.ticks = element_line(size = 1.5)
)
And y label.
+ theme(
p axis.title.y = element_text(size = rel(1.5), angle = 30)
)
Now, let’s see what we could do for the legend.
<- ggplot(mtcars, aes(wt, mpg)) +
p geom_point(aes(colour = factor(cyl), shape = factor(vs))) +
labs(
x = 'Weight (1000 lbs)',
y = 'Fuel economy (mpg)',
colour = 'Cylinders',
shape = 'Transmission system'
) p
Remove the legend by setting its postion as legend.position = 'none'
.
+ theme(
p legend.position = 'none'
)
+ theme(
p legend.position = 'top'
)
By setting legend.justification
for the legend, we anchor point for positioning legend inside plot (“center” or two-element numeric vector) or the justification according to the plot area when positioned outside the plot
+ theme(
p legend.justification = 'top'
)
+ theme(
p legend.position = c(.95, .95),
legend.justification = c('right', 'top'),
legend.box.just = 'right'
)
+ theme(
p legend.box.background = element_rect(),
legend.box.margin = margin(6, 6, 6, 6)
)
Set attributes for the key of the legend.
+ theme(
p legend.key = element_rect(fill = 'white', colour = 'black')
)
Set attributes for the text of the legend.
+ theme(
p legend.text = element_text(size = 8, colour = 'red', face = 'bold')
)
If you do not like to customize the theme one element by one element. ggplot2
also provides some function which give you a complete theme. For example, the theme_bw()
we used at the start of this lecture.
+ theme_bw() p
+ theme_minimal() p
+ theme_dark() p
Please check here for more complete theme.