4  Data Visualization with ggplot2

Abstract
This chapter introduces ggplot2, the tidyverse’s popular data visualization package.

Charles Minard’s famous data visualization of Napolean’s Russian Campisng of 1812.

4.1 Motivation

  1. Data visualization is exploratory data analysis (EDA)
  2. Data visualization is diagnosis and validation
  3. Data visualization is communication

4.2 Motivation (going beyond Excel)

  • Flexibility
  • Reproducibility
  • Scalability
  • Relational data vs. positional data

4.3 Background

  • The toughest part of data visualization is data munging.
  • Data frames are the only appropriate input for library(ggplot2).

ggplot2 is an R package for data visualization that was developed during Hadley Wickham’s graduate studies at Iowa State University. ggplot2 is formalized in A Layered Grammar of Graphics (Wickham 2010).

The grammar of graphics, originally by Leland Wilkinson, is a theoretical framework that breaks all data visualizations into their component pieces. With the layered grammar of graphics, Wickham extends Wilkinson’s grammar of graphics and implements it in R. The cohesion is impressive, and the theory flows to the code which informs the data visualization process in a way not reflected in any other data viz tool.

There are eight main ingredients to the grammar of graphics. We will work our way through the ingredients with many hands-on examples.

Exercise 1
  1. Open your .Rproj.
  2. Create a new .R script in your directory called 03_data-visualization.R.
  3. Type (don’t copy & paste) the following code below library(tidyverse) in 03_data-visualization.R.
ggplot(data = storms) + 
  geom_point(mapping = aes(x = pressure, y = wind))
  1. Add a comment above the ggplot2 code that describes the plot we created.

4.4 Eight Ingredients in the Grammar of Graphics:

4.4.1 Data

Tip

Data are the values represented in the visualization.

ggplot(data = ) or data |> ggplot()

storms |>
  select(name, year, category, lat, long, wind, pressure) |>
  sample_n(10) |>
  kable()
name year category lat long wind pressure
Keith 1988 NA 22.4 -87.2 60 990
Fabian 2003 1 42.3 -50.7 75 972
Harvey 2011 NA 16.1 -84.9 45 995
Arlene 1987 NA 39.0 -4.0 10 1009
Florence 1994 NA 22.7 -47.0 30 1011
Ernesto 1994 NA 10.5 -30.2 25 1010
Bill 2015 NA 37.8 -87.8 15 1002
Felix 1989 NA 38.5 -47.7 60 990
Beryl 1982 NA 15.3 -27.7 45 1000
Fifteen 2007 NA 48.7 -25.2 40 999

4.4.2 Aesthetic Mappings:

Tip

Aesthetic mappings are directions for how data are mapped in a plot in a way that we can perceive. Aesthetic mappings include linking variables to the x-position, y-position, color, fill, shape, transparency, and size.

aes(x = , y = , color = )

X or Y

Continuous Color or Fill

Discrete Color or Fill

Size

Shape

Others: transparency, line type

4.4.3 Geometric Objects:

Tip

Geometric objects are representations of the data, including points, lines, and polygons.

geom_bar() or geom_col()

Plots are often called their geometric object(s).

geom_line()

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_line()`).

geom_point()

Exercise 2
  1. Duplicate the code from Exercise 1. Add comments below the data visualization code that describes the argument or function that corresponds to each of the first three components of the grammar of graphics.

  2. Inside aes(), add color = category. Run the code.

  3. Replace color = category with color = "green". Run the code. What changed? Is this unexpected?

  4. Remove color = "green" from aes() and add it inside inside of geom_point() but outside of aes(). Run the code.

  5. This is a little cluttered. Add alpha = 0.2 inside geom_point() but outside of aes().

Aesthetic mappings like x and y almost always vary with the data. Aesthetic mappings like color, fill, shape, transparency, and size can vary with the data. But those arguments can also be added as styles that don’t vary with the data. If you include those arguments in aes(), they will show up in the legend (which can be annoying! and is also a sign that something should be changed!).

Exercise 3
  1. Create a new scatter plot using the msleep data set. Use bodywt on the x-axis and sleep_total on the y-axis.
  2. The y-axis doesn’t contain zero. Below geom_point(), add scale_y_continuous(limits = c(0, NA)). Hint: add + after geom_point().
  3. The x-axis is clustered near zero. Add scale_x_log10() above scale_y_continuous(limits = c(0, NA)).

4.4.4 Scales:

Tip

Scales turn data values, which are continuous, discrete, or categorical into aesthetic values. scale_*_*() functions control the specific behaviors of aesthetic mappings. This includes not only the x-axis and y-axis, but the ranges of sizes, types of shapes, and specific colors of aesthetics.

Before

scale_x_continuous()

After

scale_x_reverse()

Before

scale_size_continuous(breaks = c(25, 75, 125))

After

scale_size_continuous(range = c(0.5, 20), breaks = c(25, 75, 125))

Exercise 4
  1. Type the following code in your script.
data <- tibble(x = 1:10, y = 1:10)
ggplot(data = data) +
  geom_blank(mapping = aes(x = x, y = y))
  1. Add coord_polar() to your plot.

  2. Add labs(title = "Polar coordinate system") to your plot.

4.4.5 Coordinate Systems:

Tip

Coordinate systems map scaled geometric objects to the position of objects on the plane of a plot. The two most popular coordinate systems are the Cartesian coordinate system and the polar coordinate system.

coord_polar()

Exercise 5
  1. Create a scatter plot of the storms data set with pressure on the x-axis and wind on the y-axis.

  2. Add facet_wrap(~ category)

4.4.6 Facets:

Tip

Facets (optional) break data into meaningful subsets.

Faceting breaks data visualizations into meaningful subsets using a variable or variables in the data set. This type of visualization is sometimes called small multiples.

facet_wrap small multiples, in order, until the variable is exhausted. It does not provide much macro structure to the visualization. facet_grid creates a macro structure where each panel represents the combination of one level on the maco x-axis with one level on the macro y-axis.1

You can see a helpful chart illustrating these differences here.

Facet wrap

facet_wrap(~ category)

Facet grid

facet_grid(month ~ year)

Exercise 6
  1. Add the following code to your script. Submit it!
ggplot(storms) +
  geom_bar(mapping = aes(x = category))

4.4.7 Statistical Transformations:

Tip

Statistical transformations (optional) transform the data, typically through summary statistics and functions, before aesthetic mapping.

Before transformations, each observation in data is represented by one geometric object (i.e. a scatter plot). After a transformation, a geometric object can represent more than one observation (i.e. a bar in a histogram).

Note: geom_bar() performs statistical transformation. Use geom_col() to create a column chart with bars that encode individual observations in the data set.

Exercise 7
  1. Duplicate Exercise 6.

  2. Add theme_minimal() to the plot.

Exercise 8
  1. Duplicate Exercise 6.
  2. Run install.packages("remotes") and remotes::install_github("UrbanInstitute/urbnthemes") in the console.
  3. In the lines preceding the chart add and run the following code:
library(urbnthemes)
set_urbn_defaults(style = "print")
  1. Run the code to make the chart.

  2. Add scale_y_continuous(expand = expansion(mult = c(0, 0.1))) and rerun the code.

4.4.8 Themes:

Note

Themes control the visual style of plots with font types, font sizes, background colors, margins, and positioning.

Default theme

fivethirtyeight theme

urbnthemes

Exercise 9
  1. Add the following exercise to you script. Run it!
storms |>  
  filter(category > 0) |>
  distinct(name, year) |>
  count(year) |>
  ggplot() + 
  geom_line(mapping = aes(x = year, y = n))
  1. Add geom_point() after geom_line() with the same aesthetic mappings.

4.4.9 Layers (bonus!):

Note

Layers allow for multiple geometric objects to be plotted in the same data visualization.

Exercise 10
  1. Add the following exercise to you script. Run it!
ggplot(data = storms, mapping = aes(x = pressure, y = wind)) + 
  geom_point() +
  geom_smooth()

4.4.10 Inheritances (bonus!):

Note

Inheritances pass aesthetic mappings from ggplot() to later geom_*() functions.

Notice how the aesthetic mappings are passed to ggplot() in example 10. This is useful when using layers!

Exercise 11
  1. Pick your favorite plot from exercises 1 through 10 and duplicate the code.

  2. Add ggsave(filename = "favorite-plot.png") on a new line without + and then save the file. Look at the saved file.

  3. Add width = 6 and height = 4 to ggsave(). Run the code and then look at the saved file.

4.5 Summary

4.5.1 Functions

This is a summary of the functions we discussed in this chapter. While by no means comprehensive, these are an excellent starting point to visualizing data using ggplot2.

  • ggplot()
  • aes()
  • geom_*()
    • geom_point()
    • geom_line()
    • geom_col()
  • scale_*()
    • scale_y_continuous()
  • coord_*()
  • facet_*()
  • labs()

4.5.2 Theory

  1. Data
  2. Aesthetic mappings
  3. Geometric objects
  4. Scales
  5. Coordinate systems
  6. Facets
  7. Statistical transformations
  8. Theme

4.5.3 Resources


  1. facet_geo organizes charts in a way that attempts to preserve some geographic component of the data. It is beyond the scope of this course. You can learn more at the geofacet package vignette website.↩︎