library(tidyverse)
library(viridis)
library(ggrain)08: Visualisations
Overview
This tutorial focuses on creating data visualisations using the {ggplot2} package. It covers the conceptual grammar of {ggplot2} and a variety of examples of common plots. It also covers how to customise key elements of the plots, such as colour, axis labeling, and adding statistical summaries, and finishes with some tips on using the figures you produce, both in Quarto and exporting for use elsewhere.
Data Visualisation Resources
This tutorial will only touch on basics to get you started. Check out the following indispensible resources for building more beautiful visualisations:
- From Data to Viz, a beautifully designed helper site to help you explore the options for visualising your data, with thorough links to…
- R Graph Gallery, an archive of plots in R all with reproducible code and tutorials to build them yourself.
- Tidy Tuesdays, which releases a new dataset each week, and people build and share beautiful visualisations, along with the code and frequently their process.
- Data visualisation using R, for researchers who don’t use R, a fantastic how-to from the brilliant team at PsychTeachR at Glasgow, that provides a very friendly, start-to-finish walkthrough of the whole process of data reading, manipulation, and plot-building.
What do UGs know?
Data visualisation with {ggplot2} is explicitly introduced near the end of the first term of Year 1 at Sussex, and practiced throughout the second term. In Year 2, there is less focus on constructing plots from scratch, and instead students are taught how to get a basic plot pre-made and then modify it using their {ggplot2} skills.
This tutorial includes essentially everything students are taught in Year 1, with a variety of extRas sprinkled in. When we look at specific analyses (e.g. ANOVA, mediation, and moderation), we will also see how to generate plots accompany those analyses.
RepRoducible Figures
Figures are visual representations of a study’s results, and it is therefore key to ensure they are as reproducible as the numerical results and inferential conclusions reported in the paper. In some fields like biology where images are the data themselves, researchers have found that Photoshop is sometimes used to enhance images (Rossner & Yamada, 2004). This scenario is very different from the use of scatter plots or violin plots to visualise data, but we should be wary of the possibility of manipulating digital images. To ensure that a manuscript is as trustworthy as possible, all plots representing data should be produced alongside the other statistical results reported in the paper, and the plots produced in our analysis script should look exactly the same as the plots in the manuscript.
Setup
Packages
We will be relying heavily on the {ggplot2} package, naturally, which is part of {tidyverse}. You can load {ggplot2} by itself, but since we will also make use of some other {tidyverse} functions, it’s probably most efficient to simply load {tidyverse}.
We will also be using {ggrain} for raincloud plots and {viridis} for colour palettes.
You may notice that I will abandon my usual explicit function style for {ggplot2} functions - so, for example, instead of ggplot2::ggplot(ggplot2::aes(...)) I’ll just write ggplot(aes()). In this case, it is much easier to simply load the {ggplot2} package and drop the package call, than to type the same package name over and over and over. It also makes the chonky {ggplot2} code a lot easier to read!
Data
Today we’re continuing to work with the dataset courtesy of fantastic Sussex colleague Jenny Terry. This dataset contains real data about statistics and maths anxiety. As before, we will use the version of the dataset with averaged scores for each subscale and without the individual items.
Codebook
There’s quite a bit in this dataset, so you will need to refer to the codebook below for a description of all the variables.
This study explored the difference between maths and statistics anxiety, widely assumed to be different constructs. Participants completed the Statistics Anxiety Rating Scale (STARS) and Maths Anxiety Rating Scale - Revised (R-MARS), as well as modified versions, the STARS-M and R-MARS-S. In the modified versions of the scales, references to statistics and maths were swapped; for example, the STARS item “Studying for an examination in a statistics course” became the STARS-M item “Studying for an examination in a maths course”; and the R-MARS item “Walking into a maths class” because the R-MARS-S item “Walking into a statistics class”.
Participants also completed the State-Trait Inventory for Cognitive and Somatic Anxiety (STICSA). They completed the state anxiety items twice: once before, and once after, answering a set of five MCQ questions. These MCQ questions were either about maths, or about statistics; each participant only saw one of the two MCQ conditions.
For learning purposes, I’ve randomly generated some additional variables to add to the dataset containing info on distribution channel, consent, gender, and age. Especially for the consent variable, don’t worry: all the participants in this dataset did consent to the original study. I’ve simulated and added this variable in later to practice removing participants.
| Variable | Type | Description | Values |
|---|---|---|---|
| id | Categorical | Unique ID code | NA |
| distribution | Categorical | Channel through which the study was completed, either as a preview (before real data collection) or anonymous genuine responses. Note that this variable has been randomly generated and does NOT reflect genuine responses. | "preview" or "anonymous" |
| consent | Categorical | Whether the participant read and consented to participate. Note that this variable has been randomly generated and does NOT reflect genuine responses; all participants in this dataset did originally consent to participate. | "Yes" or "No" |
| gender | Categorical | Gender identity. Note that this variable has been randomly generated and does NOT reflect genuine responses. | "female", "male", "non-binary", or "other/pnts". "pnts" is an abbreviation for "Prefer not to say". |
| age | Numeric | Age in years. Note that this variable has been randomly generated and does NOT reflect genuine responses. | 18 - 99 |
| mcq | Categorical | Independent variable for MCQ question condition, whether the participant saw MCQ questions about mathematics or statistics. | "maths" or "stats" |
| stars_test_score | Numeric | Averaged score on the Test Anxiety subscale of the Statistics Anxiety Rating Scale (STARS) | 1 (low anxiety) to 5 (high anxiety) |
| stars_int_score | Numeric | Averaged score on the Interpretation Anxiety subscale of the Statistics Anxiety Rating Scale (STARS) | 1 (low anxiety) to 5 (high anxiety) |
| stars_help_score | Numeric | Averaged score on the Asking for Help subscale of the Statistics Anxiety Rating Scale (STARS) | 1 (low anxiety) to 5 (high anxiety) |
| stars_m_test_score | Numeric | Averaged score on the Test Anxiety subscale of the Statistics Anxiety Rating Scale - Maths (STARS-M), a modified version of the STARS with all references to maths replaced with statistics. | 1 (low anxiety) to 5 (high anxiety) |
| stars_m_int_score | Numeric | Averaged score on the Interpretation Anxiety subscale of the Statistics Anxiety Rating Scale - Maths (STARS-M), a modified version of the STARS with all references to maths replaced with statistics. | 1 (low anxiety) to 5 (high anxiety) |
| stars_m_help_score | Numeric | Averaged score on the Asking for Help subscale of the Statistics Anxiety Rating Scale - Maths (STARS-M), a modified version of the STARS with all references to maths replaced with statistics. | 1 (low anxiety) to 5 (high anxiety) |
| rmars_test_score | Numeric | Averaged score on the Test Anxiety subscale of the Revised Maths Anxiety Rating Scale (R-MARS) | 1 (low anxiety) to 5 (high anxiety) |
| rmars_num_score | Numeric | Averaged score on the Numerical Task Anxiety subscale of the Revised Maths Anxiety Rating Scale (R-MARS) | 1 (low anxiety) to 5 (high anxiety) |
| rmars_course_score | Numeric | Averaged score on the Course Anxiety subscale of the Revised Maths Anxiety Rating Scale (R-MARS) | 1 (low anxiety) to 5 (high anxiety) |
| rmars_s_test_score | Numeric | Averaged score on the Test Anxiety subscale of the Revised Maths Anxiety Rating Scale - Statistics (R-MARS-S), a modified version of the MARS with all references to maths replaced with statistics. | 1 (low anxiety) to 5 (high anxiety) |
| rmars_s_num_score | Numeric | Averaged score on the Numerical Anxiety subscale of the Revised Maths Anxiety Rating Scale - Statistics (R-MARS-S), a modified version of the MARS with all references to maths replaced with statistics. | 1 (low anxiety) to 5 (high anxiety) |
| rmars_s_course_score | Numeric | Averaged score on the Course Anxiety subscale of the Revised Maths Anxiety Rating Scale - Statistics (R-MARS-S), a modified version of the MARS with all references to maths replaced with statistics. | 1 (low anxiety) to 5 (high anxiety) |
| sticsa_trait_score | Numeric | Averaged score on the Trait Anxiety subscale of the State-Trait Inventory for Cognitive and Somatic Anxiety. | 1 (not at all) to 4 (very much so) |
| sticsa_pre_state_score | Numeric | Averaged score on the State Anxiety subscale of the State-Trait Inventory for Cognitive and Somatic Anxiety, pre-MCQ. | 1 (not at all) to 4 (very much so) |
| sticsa_post_state_score | Numeric | Averaged score on the State Anxiety subscale of the State-Trait Inventory for Cognitive and Somatic Anxiety, post-MCQ. | 1 (not at all) to 4 (very much so) |
| mcq_score | Numeric | Total (summed) score on the MCQ questions. | 0 (all incorrect) to 5 (all correct) |
If you have some experience with R, you are welcome to instead use another dataset that you are familiar with or are keen to explore. However, remember that anything you upload to the training Posit Cloud workspace is visible to all workspace admins, so keep ethical and GDPR considerations in mind.
Grammar of Graphics
The {ggplot} package is a bit of a universe in its own right. Like R itself, it’s extremely powerful and versatile, and that also means there are a near-endless variety of things you could do and different ways to do them.
What we’ll cover in this tutorial is the core structure of building data visualisations in {ggplot2}, so that you have a solid base to build your own designs on. No matter what you’re trying to do, there’s nearly always a blog post, help document, or Stack Overflow post titled “How to do [whatever] in ggplot” to point you in the right direction!
Layers
Plots in {ggplot2} are built in layers. Each layer adds to or changes something about the plot; these can be big elements, like determining the type of plot to create, to small details like editing axis labels or changing colours. In {ggplot2}, each layer is created or edited by a dedicated function.
If it helps, you can think of layers like different colours in a linocut print. Each additional layer of colour adds a bit more to the overall picture, building up from big blocks of colour to small details.
To build a visualisation in {ggplot2}, it’s a good idea to build your plot in the same way, from big picture to small detail. We will use the following general structure:
dataset |>
aesthetics_mapping +
choose_type_of_plot +
add_more_elements +
edit_labels_or_colours +
apply_a_themeThese are guidelines, but the general => specific flow is for a good reason: like all code layers are evaluated sequentially from top to bottom. So, it’s best to get the big pieces in place first, then fine-tune, than to have those fiddly bits overwritten by a major change at the end of the code. This structure also makes the code easier to read which is key for reproducibility.
Notice as well that layers are added to a plot object with + and NOT with |>. This is specific to {ggplot2} (AFAIK!) and is easy to forget, but don’t worry - it’s such a common thing that {ggplot2} has a very friendly error message for fixing it.
mapping must be created by aes()
The actual error that pops up when you use a pipe instead of + isn’t super transparent. However, there is a very friendly reminder directly underneath to nudge you in the right direction. It’s a good reason to always read the error message in full!
anx_scores |>
ggplot(aes(x = mcq_sum)) |> ## Using |> to add a layer
geom_histogram()Error in `geom_histogram()`:
! `mapping` must be created by `aes()`
ℹ Did you use `%>%` or `|>` instead of `+`?
Mapping
As we saw in the error just above, mapping is created with aes(). This little function defines the aesthetics of the plot - in other words, this is how you tell R what data you want it to plot. We’ll use the following general format to set up a plot:
dataset_name |>
ggplot(aes(x = variable_on_x_axis, y = variable_on_y_axis, ...))The ... takes additional arguments to add things like colour and fill.
Geoms, etc.
So, how do we actually add layers? There are several common types of functions with shared prefixes that do particular things. We’ll meet lots of examples of them just below, but as a quick reference for some of the more common function types:
geom_*(): Draw geometric objects to represent the data.stat_*(): Add elements to the plot calculated with statistical functions.scale_*(),labs(), andlims(): Adjust the appearance of the axes (labels, title, limits, etc.) or quickly adjust the labels or limits onlyguide_*(): Make adjustments to the scales or to other interpretational elements of the plot (such as legends for categories)theme_*(): Apply a pre-made theme to the entire plot
See the {ggplot} reference documentation for a comprehensive list and detailed guide to these functions and more.
Right, the best way to get a handle on these functions is to start building plots! So let’s jump in.
Histograms and Density Plots
Let’s start with a histogram, a very common type of visualisation that represents the frequency of each value in a variable. We’re going to first create a histogram of the mcq_score variable in the anx_scores dataset.
For a basic histogram, we need three elements. First, pipe in the data; then, set the aesthetics; then, use a geom function to draw the plot.
## Pipe in the data
anx_scores |>
## Set the aesthetics
ggplot(aes(x = mcq_score)) +
## Draw the plot with a geom
geom_histogram()`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Hooray! The downside is… that’s pretty ugly. For a quick glance this isn’t really any better than hist(). Let’s make it better by doing the following:
- Adjusting the binwidth to present the values more sensibly
- Adding colour and fill
- Adjusting the axis labels and tick marks
- Applying a theme.
If you haven’t yet, you might want to pull up the help documentation for geom_histogram() to get a look at the options available.
Adjusting Binwidth
The message from the previous output gave us a helpful tip to deal with those weird bins:
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
So, let’s use binwidth = to choose a more appropriate value for this data. Because these are the summed MCQ scores, the values can only be whole numbers between 0 and 5. So, let’s try setting the binwidth to 1.
anx_scores |>
ggplot(aes(x = mcq_score)) +
geom_histogram(binwidth = 1)
This is better, but now we’ve got some new problems: one, the automatically generated axis now only has 0, 2, and 4, when it would be better to have 0 - 5; and the histogram now looks like quite a shapeless lump. Let’s work on the colour first.
Colour and Fill Manual
It’s not obvious from the help documentation unfortunately (just mentioned in passing under the ... argument) but this function, and most geom_*() functions!, will take the arguments colour and fill. For this, we give each argument a string with a colour name in it. You can just guess at colour names until you find one you like, or you can refer to this massive list of R colour names to pick one.
Here’s an example of how that looks with some very boring colours, that’s a bit easier on the eyes than the dark grey lump above.
anx_scores |>
ggplot(aes(x = mcq_score)) +
geom_histogram(binwidth = 1,
colour = "black",
fill = "lightgray")
This would be fine for formal reporting, but we’re going to be all about the aesthetics today and those colours are downright dreary. Let’s instead borrow from the Sussex official colour palette to set custom colours. Feel free to choose whatever you like. I’m choosing “Sussex Flint”1 for the outline colour and “Deep Aquamarine” for the fill for the example below. Of course, R won’t recognise those specific colour names, so instead we need to provide the “hex code” for this specific colour, in quotes and preceded by a hash (#).
You could also use an online colour picker like this one to choose any hex code for any colour you can dream of to use!
anx_scores |>
ggplot(aes(x = mcq_score)) +
geom_histogram(binwidth = 1,
colour = "#003b49",
fill = "#007a78")
What we can see here is that the “colour” in this case is the outline of the bars; and the “fill” is the colour inside the bars. This system is fairly consistent across geoms, but you may need to mix up the “colour” vs “fill” arguments depending on the exact effect you want.
With that sorted, let’s move onto the axis and ticks.
Adjusting Axes with scale_*()
You can use the labs() shortcut to adjust labels easily, but personally I (JM) prefer to go the long way round and use the scale_*() family instead. These functions have specific names depending on which axis scale you want to adjust, and the way that data is measured. In this case I want to adjust the x-axis and the data is continuous2, so the function I need is scale_x_continuous().
The advantage of going through the extra trouble to use the scale_*() functions is that they allow you to adjust everything about that scale at once. So, I can use the arguments for scale_x_continuous() to change the axis label with name =, the breaks with breaks =, the limits with limits = …well, I’m sure you get the idea!
I can also do the same with the y axis, using (as you might have guessed) scale_y_continuous().
anx_scores |>
ggplot(aes(x = mcq_score)) +
geom_histogram(binwidth = 1,
colour = "#003b49",
fill = "#007a78") +
scale_x_continuous(name = "Total MCQ Score (out of 5)",
breaks = c(0:5)) +
scale_y_continuous(name = "Frequency",
breaks = seq(from = 0,
to = 120,
by = 20),
limits = c(0, 120))
If you have a look at the arguments breaks and limits, you can see why we took some time early on learning about numeric vectors. breaks requires a vector of values indicating where the breaks should be, and limits requires a vector of two values, one for the minimum limit and the other for the maximum limit.
I’ve also thrown in an optional new function, seq(), to create the y-axis breaks. This handy little base-R function generates a sequence of values, given the first and last values in the sequence and how much you want to increment by. I’ve asked it to start at 0, end at 120, and increment by 20 each time, instead of typing c(0, 20, 40, ...) myself, to generate the numbers for labeled breaks on the y-axis.
Adding a Theme
That’s actually looking pretty snazzy at this point. The last thing we might want to do for now is to apply a theme. The default is theme_grey(), which I’m not a fan of. If you type theme_ into a code chunk, you’ll get a list of options to try out and find one you like.
However, I’m going to go straight the most extRa option, namely: an inbuilt APA-style theme from the {papaja} package, papaja::theme_apa(). Because I like how it looks, I’ll use the box = TRUE argument to draw a box around the whole plot. Teal (excuse me, Deep Aquamarine) isn’t a particularly APA-standard colour, but I like it, so there.
anx_scores |>
ggplot(aes(x = mcq_score)) +
geom_histogram(binwidth = 1,
colour = "#003b49",
fill = "#007a78") +
scale_x_continuous(name = "Total MCQ Score (out of 5)",
breaks = c(0:5)) +
scale_y_continuous(name = "Frequency",
breaks = seq(from = 0,
to = 120,
by = 20),
limits = c(0, 120)) +
papaja::theme_apa(box = TRUE)
There you have it! That’s looking pretty nice, I think.
Now that you have a final plot, make sure to save the plot into an object, so that you can refer easily to it later. I’m going to call mine anx_viz_hist.
anx_viz_hist <- anx_scores |>
ggplot(aes(x = mcq_score)) +
geom_histogram(binwidth = 1,
colour = "#003b49",
fill = "#007a78") +
scale_x_continuous(name = "Total MCQ Score (out of 5)",
breaks = c(0:5)) +
scale_y_continuous(name = "Frequency",
breaks = seq(from = 0,
to = 120,
by = 20),
limits = c(0, 120)) +
papaja::theme_apa(box = TRUE)To lock it in, let’s get some practice using these functions with a minimally different example.
Splitting by Group
Note: This gets a bit into the weeds again, so if you’d like to skip this section, feel free to jump down to the next type of plot.
To make our histogram more interesting and informative, we may want to represent the MCQ scores for the two MCQ groups separately, since this was our independent manipulation. We’ll look at two ways to do this: as an overlay, or with two separate histograms.
Overlay
A first option is to overlay the two histograms with transparency reduced so the overlap is visible. To do this, we’ll need to do a few things, described below.
1anx_scores |>
2 ggplot(aes(x = mcq_score, fill = mcq)) +
3 geom_histogram(binwidth = 1, alpha=0.5, position="identity", colour = "#003b49") +
4 scale_x_continuous(name = "MCQ Score",
limits = c(0, 6)) +
5 scale_y_continuous(name = "Frequency") +
6 scale_fill_discrete(name = "MCQ Condition",
type = c("#007a78", "#7da1c4")) +
7 papaja::theme_apa(box = TRUE)- 1
-
Take the
anx_scoresdataset, and then - 2
-
Draw a plot, using
mcq_scoreon the x-axis andmcqgroup for different fill colours - 3
- Draw a histogram, with binwidths of one, transparency (alpha) at 50%, positioned on the same axis, with a “Sussex Flint” border colour
- 4
- Adjust the x-axis set the name and limits
- 5
- Adjust the y-axis to set the name
- 6
-
Adjust the fill for the
mcqgroups to set the name and provide colours for the two different groups - 7
- Add a theme.

The result with this particular data is, I would say, uninspiring. It’s fine but it’s a bit hard to read when the bins are this big. It might be better in this case to instead split the histogram into two panels, which we can do with…
Facet Wrap
We can use the facet_wrap() function to split our single plot into two based on a grouping variable - here, again, mcq. I’ve also changed the settings on the y-axis to adjust to the new scale.
anx_scores |>
ggplot(aes(x = mcq_score)) +
geom_histogram(binwidth = 1,
colour = "#003b49",
fill = "#007a78") +
scale_x_continuous(name = "Total MCQ Score (out of 5)",
breaks = c(0:5)) +
scale_y_continuous(name = "Frequency",
breaks = seq(from = 0,
to = 70,
by = 10),
limits = c(0, 70)) +
facet_wrap(~mcq) +
papaja::theme_apa(box = TRUE)
There are a few more things I’d like to tweak about this plot. First, I’d love to have the maths and stats groups in different colours to differentiate them visually. Second, I want to format the “maths” and “stats” labels at the top of the plot.
To do the first, I’m going to actually go back up to the aes() mapping function and add mcq as the variable that determines both colour and fill, and drop the color and fill arguments from geom_histogram().
Next, I’m going to add a vector of names and values to the labeller = as_labeller(...) argument of facet_wrap(), which adds nice labels above each mini-plot.
Finally, I use two more scale_*() functions to set the colour and fill for each group, with the order of the colours in the same order as the values of the mcq variable. The guide = "none" argument removes the automatically generated legend, which I didn’t need because I added the labels.
Whew! That’s it - here’s the final product.
anx_scores |>
ggplot(aes(x = mcq_score, colour = mcq, fill = mcq)) +
geom_histogram(binwidth = 1) +
scale_x_continuous(name = "Total MCQ Score (out of 5)",
breaks = c(0:5)) +
scale_y_continuous(name = "Frequency",
breaks = seq(from = 0,
to = 70,
by = 10),
limits = c(0, 70)) +
facet_wrap(~mcq,
labeller = as_labeller(
c(`maths` = "Maths MCQs",
`stats` = "Stats MCQs"))) +
scale_color_manual(values = c("#003b49", "#1b365d"),
guide = "none") +
scale_fill_manual(values = c("#007a78", "#7da1c4"),
guide = "none") +
papaja::theme_apa(box = TRUE)
Again, once I’m happy with this, I should save it into an object for later use.
How are you doing? This is a bunch of new information, so don’t worry if you feel a bit overwhelmed. Honestly when I am constructing plots, a good portion of the time I spend to make them is either 1) digging back through my own previous documents to lift code I’ve already written or 2) searching online for help posts to figure out how to do things! I didn’t know how to do basically any of the formatting for the facet-wrapped plot above, and it took me about an hour of testing and searching to figure it out. I hope having these templates saves you a bit of time in the future.
Barplots
With some key ideas for building plots under our belts, let’s have a look at creating visualisations of categorical data.
To start with, we’ll need some (preferably interesting) categorical data to work with. Let’s begin by creating a categorical variable out of one of our continuous variables to practice what we covered a the previous tutorial. (This kind of operation is often not recommended for real analysis, but we’re learning so, y’know, it’s fine.)
Now, let’s get started on our barplot.
To keep this to a 2x2 comparison, let’s compare our new STARS help category in male vs female participants. So, we’ll need to filter before beginning the plot.
Then, we’ll put our variable of interest - here, stars_help_cat - on the x-axis. Remember that in plots like this, the y-axis is the number of responses, and as we saw with histograms just above, R can do that for us automatically. So, we’ll skip the y-axis and instead add gender as a fill variable. Then, we’ll ask R to draw the right kind of geom for us - the transparently named geom_bar() - and throw on a nice theme on the end for good measure.
anx_scores |>
dplyr::filter(gender %in% c("male","female")) |>
ggplot(aes(x = stars_help_cat, fill = gender)) +
geom_bar() +
papaja::theme_apa(box = TRUE)
Hmm. This barplot is okay, but it’s not easy to compare the gender categories directly. The bars are filled with different colours based on the gender variable, which we wanted, but they’re stacked on top of each other!
Position Dodge
To “fix” this (assuming it isn’t what you want - it might be), we can use the position argument in geom_bar() to move the bars side by side, instead of stacked on top of each other. This option is called “dodge”, so let’s see how it looks now:
anx_scores |>
dplyr::filter(gender %in% c("male","female")) |>
ggplot(aes(x = stars_help_cat, fill = gender)) +
geom_bar(position = "dodge") +
papaja::theme_apa(box = TRUE)
Nice! Already that’s much better, and looking more like we might expect a grouped barplot to look. We do need to do something about those awful colours, but first let’s fix something that’s bothering me: the order of the x-axis categories.
Reordering Categories
The stars_help_cat variable on the x-axis has automatically ordered the categories in alphabetical order, “high” and then “low”. As I am a native speaker of a language that reads left to right, I’d prefer to have “low” on the left and “high” on the right3. So, how can I change the order of these categories?
Option 1: Scale Labels
We’ve seen the scale_*() functions already - this time, we have a discrete x-axis variable, so we need scale_x_discrete(). We’ve even already seen the limits = argument used to set the upper and lower bound of the plot. Here, we can use this argument to specify the order we want the categories to be displayed in as well.
anx_scores |>
dplyr::filter(gender %in% c("male","female")) |>
ggplot(aes(x = stars_help_cat, fill = gender)) +
geom_bar(position = "dodge") +
scale_x_discrete(limits = c("low", "high")) +
papaja::theme_apa(box = TRUE)
Option 2: Factor Levels
Especially if we were planning to use this variable in multiple plots or analysis, it would be better to convert our stars_help_cat variable to a factor would let us specify an order of factor levels. We would only have to do this once, then the order would be the same for all subsequent operations.
Let’s see how that would work. Here I’m using mutate() to do the factor conversion before I pipe the data on into ggplot(). We’ve seen the factor() function before, just adding the levels = argument to reorder the levels in the order we wanted. The resulting plot then has the levels in the right order.
## Convert stars_help_cat to factor in the dataset
anx_scores <- anx_scores |>
dplyr::mutate(stars_help_cat = factor(stars_help_cat,
levels = c("low", "high"))
)
## Rerun the plot code
anx_scores |>
dplyr::filter(gender %in% c("male","female")) |>
ggplot(aes(x = stars_help_cat, fill = gender)) +
geom_bar(position = "dodge") +
papaja::theme_apa(box = TRUE)
Right, now that all the structural stuff is out of the way, let’s finish up.
Keep in mind that ggplot() is just applying the labels as strings on top of your plot. This means that if you, for instance, reorder your categories on the x-axis but then forget to update the order of axis labels in your code, your plot will be wrong. It’s always important to double-check that the labels you add do in fact correspond correctly to the order of the categories in the plot to avoid mislabeling.
As an example, the code below is identical to the solution above, except I’ve commented out the limits argument in scale_x_discrete() that reorders the categories on the x-axis. However, I’ve left in the labels argument that whacks on the axis labels in the reordered order. Comparing to the plot above, we can see by the sizes of the bars that the actual categories are in high-low order, despite what the nicely formatted labels say.
anx_scores |>
dplyr::filter(gender %in% c("male","female")) |>
ggplot(aes(x = stars_help_cat, fill = gender)) +
geom_bar(position = "dodge") +
scale_x_discrete(name = "STARS Asking for Help Anxiety",
## limits = c("low", "high"),
labels = c("Low", "High")) +
scale_y_continuous(name = "Count",
limits = c(0, 175),
breaks = seq(0, 175, by = 25))+
scale_fill_discrete(name = "Gender",
labels = c("Female", "Male"),
type = c("#1b365d", "#f2c75c")) +
papaja::theme_apa(box = TRUE)
One method to avoid this is to add formatting outside of ggplot(). In essence, instead of pasting on the labels afterwards, you edit the data just before piping into ggplot() so that the labels are already correctly formatted and in the right order. For this current situation with one plot, it’s much of a muchness whether the relabeling is in the data (via mutate) or in the axis labels. However, if you wanted to create multiple plots with these variables, relabeling the data just before plot creation would likely be more efficient than pasting in/checking the labels for each plot.
anx_scores |>
dplyr::filter(gender %in% c("male","female")) |>
## This backticked name is kind of an extreme step! But it's possible
dplyr::mutate(`STARS Asking for Help Anxiety` = factor(stars_help_cat,
levels = c("low", "high"),
labels = c("Low", "High")),
Gender = factor (gender,
levels = c("female", "male"),
labels = c("Female", "Male"))) |>
## Note the formatted names here, and the backticks for the illegal variable name
ggplot(aes(x = `STARS Asking for Help Anxiety`, fill = Gender)) +
geom_bar(position = "dodge") +
scale_y_continuous(name = "Count",
limits = c(0, 175),
breaks = seq(0, 175, by = 25))+
scale_fill_discrete(type = c("#1b365d", "#f2c75c")) +
papaja::theme_apa(box = TRUE)
Raincloud and Violin Plots
For this section, we’ll focus on making nice raincloud plots, which we teach repeatedly in UG Year 1. For both of these plots, we’re going to expand our {ggplot} vocab by explicitly including a y-axis variable.
Raincloud plots are introduced in the second half of first year for Sussex UGs. I love them especially because they pack in so much useful information in a reasonably easy-to-read plot. We teach UGs to read them and practice them several times, including in an assessment. Despite their complexity, they are extremely easy to create thanks to the recently-released {ggrain} package that does most of the heavy lifting for you.
Let’s start by creating a basic raincloud plot of STARS Test subscale score split up by MCQ group. As we already covered above, I’m going to clean up the plot a bit with some scale formatting and a theme.
1anx_scores |>
2 ggplot(aes(x = mcq, y = stars_test_score)) +
3 geom_rain() +
4 scale_x_discrete(name = "MCQ Condition", labels = c("Maths", "Stats")) +
5 scale_y_continuous(name = "Mean STARS Test Score") +
6 papaja::theme_apa(box = TRUE)- 1
-
Take the dataset
anx_scores, and then - 2
-
Set up a plot, mapping
mcqto the x-axis andstars_test_scoreto the y-axis, plus - 3
- Draw a raincloud plot, plus
- 4
- Adjust the discrete x axis by giving it a name, and relabelling the group labels, plus
- 5
- Adjust the continuous y axis by giving it a name, plus
- 6
- Add an APA-style theme.

The default plot has three parts. From left to right, they are: a scatter of the data points; a boxplot; and a density plot of the data, turned on its side. If you tilt your head to the right, you can see that this looks a bit like a (very lumpy) cloud with the data falling like rain underneath.
We already have a good amount of information here, but one thing we are missing is the means in each group. We introduce raincloud plots in the context of t-tests, where the difference in group means is the key element of interest, so let’s add them to our plot, along with some colour adjustments.
Calculating Stats
To add means and CIs to our plot, we’ll need to calculate them. However, there’s no need to create a summary first - we can use a stat_*() function to calculate the necessary statistics inside our plot.
In this case, we can use stat_summary() to produce some summary statistics. To really make it quick, we’re going to quote the function mean_cl_boot, which calculates means and bootstrapped CIs. That’s it - just that one line does all the maths and adds the result to the plot. However, the new element will show up as black by default, so let’s also add in a colour to make sure we can actually see it. (I’ve chosen something obnoxious to make it easy to find!)
Once again, I’m going to save the plot in an object for later use, here anx_viz_rain.
1anx_viz_rain <- anx_scores |>
2 ggplot(aes(x = mcq, y = stars_test_score)) +
3 geom_rain() +
4 scale_x_discrete(name = "MCQ Condition", labels = c("Maths", "Stats")) +
5 scale_y_continuous(name = "Mean STARS Test Score") +
6 stat_summary(fun.data = mean_cl_normal, colour = "red") +
7 papaja::theme_apa(box = TRUE)
anx_viz_rain- 1
-
Take the dataset
anx_scores, and then - 2
-
Set up a plot, mapping
mcqto the x-axis andstars_test_scoreto the y-axis, plus - 3
- Draw a raincloud plot, plus
- 4
- Adjust the discrete x axis by giving it a name, and relabelling the group labels, plus
- 5
- Adjust the continuous y axis by giving it a name, plus
- 6
-
Calculate a summary of the data using the function
mean_cl_normalfor means and CIs, and add to the plot in red, plus - 7
- Add an APA-style theme.

Scatterplots
Let’s now turn to continuous variables only and produce a scatterplot to visualise a linear relationship.
To start, let’s set up a basic plot using the STICSA pre- and post-MCQ state anxiety scores. The new element is geom_point(), which draws points 4. In that layer, we also have two new arguments. The first is alpha, which sets the transparency of the points between 1 (solid) and 0 (invisible). The second is position = "jitter", which introduces some random noise into the placement of the points so make it easier to see overlapping values.
1anx_scores |>
2 ggplot(aes(x = sticsa_pre_state_score, y = sticsa_post_state_score)) +
3 geom_point(alpha = 0.4, position = "jitter") +
4 scale_x_continuous(name = "STICSA Pre-MCQ State Anxiety Score",
limits = c(1, 4)) +
5 scale_y_continuous(name = "STICSA Post-MCQ State Anxiety Score",
limits = c(1, 4)) +
6 papaja::theme_apa(box = TRUE)- 1
-
Take the dataset
anx_scores, and then - 2
-
Set up a plot, mapping
sticsa_pre_state_scoreto the x-axis andsticsa_post_state_scoreto the y-axis, plus - 3
- Draw points, with 40% transparency and some random noise to separate them, plus
- 4
- Adjust the continuous x axis by giving it a name, and setting the limits to 1 and 4, plus
- 5
- Adjust the continuous x axis by giving it a name, and setting the limits to 1 and 4, plus
- 6
- Add an APA-style theme.

Not bad, eh? Right off the bat we get a very passable scatterplot. Now, assuming that this scatterplot might accompany a linear model, we might also like to draw that model on our plot.
Line of Best Fit
Unfortunately the function to add a line of best fit isn’t as intuitively named as you might expect. Both geom_smooth() and stat_smooth() will do what we want here; in first year, we teach UGs geom_smooth() for this, so we’ll use that here too. (There is a geom_line() but isn’t quite the same!)
As we did with the histogram previously, we can add colour and fill arguments to adjust the colour. Here, colour determines the colour of the line, and fill determines the colour of the shaded confidence intervals.
anx_scores |>
ggplot(aes(x = sticsa_pre_state_score, y = sticsa_post_state_score)) +
geom_point(alpha = 0.4, position = "jitter") +
1 geom_smooth(method = "lm",
colour = "darkblue",
fill = "darkcyan") +
scale_x_continuous(name = "STICSA Pre-MCQ State Anxiety Score",
limits = c(1, 4)) +
scale_y_continuous(name = "STICSA Post-MCQ State Anxiety Score",
limits = c(1, 4)) +
papaja::theme_apa(box = TRUE)- 1
-
Add a line of best fit using the “
lm” function, with a dark blue line and dark cyan CIs.

Colour and Fill Palette
In the previous Colour and Fill section we chose colours by name or hex code to add to a plot. However, there are a wide variety of pre-made colour palettes to make use of. So, instead of choosing individual colours, you can quickly apply existing colour palettes.
Looking for the right colour scheme? Try these resources:
- This blog post on R colour palettes is a great overview of some of the most popular options available, with lots of examples.
- The {paletteer} package for quick access to any R colour palette and to an overview of all the palettes available
A popular palette and the one we’ll use for this task is from the lovely {viridis} colour palette package, designed to be pretty, print-friendly, and accessible (robust to colourblindness).
To use the palettes, we start out by dropping the colour and fill arguments from within geom_smooth(), and instead setting colour and fill arguments at the beginning of the plot, in the aes() function along with x and y. This will apply the same colour palette across all elements in the plot that can have colour - namely, the points from geom_point(), and the line of best fit from geom_smooth(). By setting them both to gender, the colours will be consistent across gender categories between all of these elements. Incidentally, doing this will also split our single LM line into four lines: one for each level of the gender variable.
anx_scores |>
1 ggplot(aes(x = sticsa_pre_state_score, y = sticsa_post_state_score,
colour = gender, fill = gender)) +
geom_point(alpha = 0.4, position = "jitter") +
geom_smooth(method = "lm") +
scale_x_continuous(name = "STICSA Pre-MCQ State Anxiety Score",
limits = c(1, 4)) +
scale_y_continuous(name = "STICSA Post-MCQ State Anxiety Score",
limits = c(1, 4)) +
papaja::theme_apa(box = TRUE)- 1
-
Set up the plot, mapping
sticsa_pre_state_scoreto the x-axis andsticsa_post_state_scoreto the y-axis, and both colour and fill togender.

A start - but let’s do something about that horrible default colour palette!
Just as we had scale_x... and scale_y... functions for specifying the x and y scales, we have scale_colour... and scale_fill... for specifying the details about the way the colour and fill appears. Here, I’m using viridis, which has its own dedicated functions, scale_colour_viridis() and scale_fill_viridis()5. Within those two new elements, we have the same two elements:
name: gives a title for the colour/fill legenddiscrete: tells {viridis} whether the variable to be coloured/filled is discrete or not. Here it is (it’sgender, a categorical variable), so we set this toTRUE.labels: applies labels to the levels of the variables.
As always, with a finished plot, save into a new object, here anx_viz_scatter.
anx_viz_scatter <- anx_scores |>
ggplot(aes(x = sticsa_pre_state_score, y = sticsa_post_state_score,
colour = gender, fill = gender)) +
geom_point(alpha = 0.4, position = "jitter") +
geom_smooth(method = "lm") +
scale_x_continuous(name = "STICSA Pre-MCQ State Anxiety Score",
limits = c(1, 4)) +
scale_y_continuous(name = "STICSA Post-MCQ State Anxiety Score",
limits = c(1, 4)) +
1 scale_colour_viridis(name = "Gender",
discrete = TRUE,
labels = c("Female", "Male", "Non-Binary", "Other/Prefer Not To Say")) +
scale_fill_viridis(name = "Gender",
discrete = TRUE,
labels = c("Female", "Male", "Non-Binary", "Other/Prefer Not To Say")) +
papaja::theme_apa(box = TRUE)
anx_viz_scatter- 1
- Adjust the fill and colour by applying the viridis colour palette, giving each a name, using discrete values, and relabelling the categories.

That’s looking fairly nice now!
3D Plots
Note: 3D plots really should be used with caution, and this bit is just for fun - feel free to skip down to the reporting section if you’re not inclined. If you’re keen to have a go, the code below gets you an interactive, useable 3D scatterplot in no time at all, but think carefully about whether this is really the best way to represent your data.
In something of a departure for us, we’re going to use a package that isn’t {ggplot2} for this next bit. (Shock! Horror!) Instead, we’ll use the plot3d() function from the {rgl} package to get a snazzy interactive 3D-plot that does rotation and zoom right out of the box. The syntax might look slightly unfamiliar, but it’s got the same underlying idea as what we’ve seen so far.
In order to have points coloured by MCQ type, I’ve created a new variable in the dataset assigning colours based on the values of the mcq variable. This and the rest of the code is borrowed straight from R Graph Gallery’s very helpful mini-tutorial on {rgl}.
# Add a new column with color
anx_scores <- anx_scores |>
dplyr::mutate(
plot_colours = ifelse(mcq == "maths", "orange", "darkcyan")
)Next, we’ll use the rgl::plot3d() function to generate the plot. The x, y, and z arguments each are specified using $ subsetting, and the colours in col come from the new variable I created just above. The rest of the arguments change the appearance of the points (type and size) and set the axis labels.
rgl::plot3d(
x = anx_scores$sticsa_trait_score,
y = anx_scores$sticsa_post_state_score,
z = anx_scores$sticsa_pre_state_score,
col = anx_scores$plot_colours,
type = "s",
radius = .05,
xlab = "STICSA Trait", ylab = "STICSA Post State", zlab = "STICSA Pre State"
)
Reporting
With Quarto
As ever, I strongly recommend referring to the Quarto Guide as the first stop for possibilities and/or problem-solving with Quarto.
In a previous section, we created a lovely scatterplot for a (hypothetical) linear model. Let’s see now how we could adjust the way that figure would appear in a Quarto document.
To begin, we’ve already stored the final output for the LM plot we created above in a new object, anx_viz_scatter. We’re just going to put our plot object in a new code chunk, in the document where we want it to appear. Equally, this could be the code that produces the plot; it doesn’t matter either way for the plot itself, but I would prefer the object-name method to avoid cluttering up the writeup with lots of code.
```{r}
anx_viz_scatter
```To change or add settings for a specific plot, we will use code chunk options, which we previously encountered in Tutorial 04. These options let us determine the way the code chunk behaves. Here, we’ll also see how we can use those code chunk options to style and format our plot.
Captions and Alt Text
First, we might add a caption to the plot, using the fig-cap option. You may have noticed that we didn’t give our plot a caption/title previously, and this is why. You can of course write whatever you like, but here’s an example.
```{r}
#| fig-cap: "Scatterplot and regression lines of mean pre- and post-MCQ state anxiety scores by gender identity"
anx_viz_scatter
```
You should also add alt text to your figures to make them readable by screenreaders and other assistive technology. It is important to describe the insights from the visualisations clearly, such as the key patterns, connections, or findings, to ensure they are accessible. This Medium article on scientific alt text gives some clear guidelines for keeping alt text informative, brief, and useful.
We can add alt text to our figure using the fig-alt option:
```{r}
#| fig-cap: "Scatterplot and regression lines of mean pre- and post-MCQ state anxiety scores by gender identity"
#| fig-alt: "Scatterplot and regression lines of mean pre- and post-MCQ state anxiety scores by gender identity, with lines for female, male, and people who responded other or prefer not to say showing a very similar positive relationship, while the line for non-binary people is essentially flat."
anx_viz_scatter
```Cross-Referencing
As with tables, we can also include automatic cross-referencing for figures in our document. This means that Quarto will automatically number our figures and include links to those figures in the text of the document.
To do this, let’s first give our figure a label. Labels are essentially names for this particular figure that will allow us to refer to it in the text. In order to be recognised for cross-referencing, the label must start with the fig- prefix.
```{r}
#| label: fig-anx-scatter
#| fig-cap: "Scatterplot and regression lines of mean pre- and post-MCQ state anxiety scores by gender identity"
anx_viz_scatter
```Then, use this label with an @ symbol in the text of the document to generate the cross-reference. For instance, I could write:
The result of this analysis is visualised in
@fig-anx-scatter.
Which would appear when rendered as below - notice the caption now has “Figure 1” added to it to correspond to the in-text numbering.
The result of this analysis is visualised in Figure 1.
Exporting and Linking
If you want to export and display your data visualisations outside the document containing the code that creates them, you will first need to save them as images, then upload or insert them in the relevant document. Conveniently, {ggplot2} contains an inbuilt function for saving plots. Not-conveniently, it works with file paths slightly differently than other functions we’ve seen (like the {readr}) functions.
First we need a folder to store our images in.
Saving Plots
Besides being able to quickly print out out plots, saving them in objects also makes it easy to save them with ggsave(). By default, this function will save the most recent plot, which for us was the scatterplot anx_viz_scatter. However, I’d strongly recommend always specifying the plot to save, to avoid accidental overwriting.
ggsave(
## Name for the image file, including extension
filename = "anx_viz_scatter.jpg",
## Object containing the plot
plot = anx_viz_scatter,
## Path to the folder to save in
path = here::here("images")
)The default formatting of this plot, however, isn’t my favourite - it’s come out quite square. Instead, I’d like to adjust the size of the saved plot image, which of course I can do with height and width arguments:
ggsave(
## Name for the image file, including extension
filename = "anx_viz_scatter.jpg",
## Object containing the plot
plot = anx_viz_scatter,
## Path to the folder to save in
path = here::here("images"),
## Adjust width
width = 10
)Finally, ggsave() has a create.dir argument to create a directory simultaneous with saving. Personally I’d rather create my folders separately, but if you prefer to it this way you could.
Insert and Link
Now that we have an image file, we could insert it into a document as we could with any other image. However, there’s an issue here. What if, for example, you update your dataset - a participant withdraws, or you gather more data, etc. Now the image you’ve saved is no longer correct.
An option is to link to images, rather than insert them directly. In Microsoft Word, for instance, you can use the “Insert and Link” option rather than straightforward “Insert”.

This method has the following advantages:
- The image is linked, which means that if you re-run your code and re-save the new image, the linked image in the document will update automatically.
- The image is also inserted which means that if you send this document to someone else, upload it to Drive etc., there will still be a copy of the image in the document, whatever was most recent when it was sent.
Very well done on your hard work so far. We’ll keep using our {ggplot2} chops when we dig into analysis next week, and see how we can easily create plots to accompany those analyses using built-in functions.
Footnotes
You may notice a certain commonality to the colour theme of this website!↩︎
Yes, this is not technically true because the values can only be whole numbers between 0 and 5, but what matters here is that R thinks this is continuous data because it is numeric.↩︎
See e.g. Shaki and Fischer, 2008↩︎
Tragically,
geom_scatter()isn’t a thing, not matter how many times I try!↩︎Naturally, if you want to use palettes from a different package, you’ll need to use the functions from that package instead! Those functions may also work a bit differently, so make sure you read the help documentation if you get stuck.↩︎






