library(dplyr)
## OR
library(tidyverse)05: Mutate
Overview
This tutorial focuses on a single essential {dplyr} function: mutate().
Hardworking, versatile, and indispensable, mutate() makes changes within a given dataset by creating new variables (columns).
Setup
Packages
We will again be focusing on {dplyr} today, which you can load by loading {tidyverse}.
Data
Today we’re continuing to work with the same dataset as last week. Courtesy of fantastic Sussex colleague Jenny Terry, this dataset contains real data about statistics and maths anxiety.
Codebook
There’s quite a bit in this dataset, so you will need to refer to the codebook below for a description of all the variables.
This study explored the difference between maths and statistics anxiety, widely assumed to be different constructs. Participants completed the Statistics Anxiety Rating Scale (STARS) and Maths Anxiety Rating Scale - Revised (R-MARS), as well as modified versions, the STARS-M and R-MARS-S. In the modified versions of the scales, references to statistics and maths were swapped; for example, the STARS item “Studying for an examination in a statistics course” became the STARS-M item “Studying for an examination in a maths course”; and the R-MARS item “Walking into a maths class” because the R-MARS-S item “Walking into a statistics class”.
Participants also completed the State-Trait Inventory for Cognitive and Somatic Anxiety (STICSA). They completed the state anxiety items twice: once before, and once after, answering a set of five MCQ questions. These MCQ questions were either about maths, or about statistics; each participant only saw one of the two MCQ conditions.
For learning purposes, I’ve randomly generated some additional variables to add to the dataset containing info on distribution channel, consent, gender, and age. Especially for the consent variable, don’t worry: all the participants in this dataset did consent to the original study. I’ve simulated and added this variable in later to practice removing participants.
General Format
The mutate() function is one of the most essential functions from the {dplyr} package. Its primary job is to easily and transparently make changes within a dataset - in particular, a tibble.
To make a single, straightforward change to a tibble, use the general format:
dataset_name |>
dplyr::mutate(
variable_name = instructions_for_creating_the_variable
)variable_name is the name of the variable that will be created by mutate(). This can be any name that follows R’s object naming rules. There are two main options for this name:
- If the dataset does not already contain a variable called
variable_name, a new variable will be added to the dataset. - If the dataset does already contain a variable called
variable_name, the new variable will silently replace (i.e. overwrite) the existing variable with the same name.
Here, “silently” means that R overwrites the existing variable without flagging that it is doing this or asking you if you are sure, so it’s important to be aware of this behaviour, and to know what variables already exist in your dataset.
instructions_for_creating_the_variable tells the function how to create variable_name. These instructions can be any valid R code, from a single value or constant, to complicated calculations or combinations of other variables. You can think of these instructions exactly the same way as the vector calculations we covered earlier, and they must return a series of values that is the same length as the existing dataset.
Although creating or modifying variables will likely be the most frequent way you use mutate(), it has other handy features such as:
- Deleting variables
- Deciding where newly created variables appear in the dataset
- Deciding which variables appear in the output, depending on which you’ve used
See the help documentation for more by running help(mutate) or ?mutate in the Console.
Adding New Variables
First, let’s see how to add new variables. Imagine we have found some collaborators to work with and we want to combine our datasets. To keep track of where the data came from, we want to add a lab variable at the start of our existing dataset containing the name of the university before we combine it with more data from elsewhere.
- 1
-
Take the dataset
anx_dataand then make the following changes: - 2
-
Create a new variable,
lab, that contains the value"Sussex" - 3
- Put this variable before the first variable in the existing dataset.
The new variable, lab, is added to the dataset, because anx_data doesn’t already contain a variable called lab. You can evaluate the success of this command by comparing the number of columns in anx_data in the Environment to the number in the tibble printed out above.
Because we haven’t assigned this change to the dataset, the version of anx_data in the Environment hasn’t changed.
Note that in this case, I’ve given a single value, "Sussex", as the content of the new variable. R will “recycle” this single value across all of the rows to create a constant. However, if I tried to do this with a longer vector, I’ll get an error:
anx_data |>
dplyr::mutate(
lab = c("Sussex", "Glasgow"),
.before = 1
)Error in `dplyr::mutate()`:
ℹ In argument: `lab = c("Sussex", "Glasgow")`.
Caused by error:
! `lab` must be size 465 or 1, not 2.
In this case I might need rep() (for creating vectors of repeating values), sample() (for creating random subsamples), or another helper function to generate the vector to add.
Changing Existing Variables
Next, let’s look at changing existing variables. For example, I know that gender and mcq are meant to be factors (also called “categorical data” in SPSS and elsewhere). So, let’s convert each of these two variables into factor data type.
- 1
- Take the dataset and make the following changes:
- 2
-
Convert the existing
gendervariable into a factor and overwrite the existinggendervariable with the new version. - 3
-
Convert the existing
mcqvariable into a factor and overwrite the existingmcqvariable with the new version.
Let’s look a little closer at the expression gender = factor(gender). The left side of the equals sign = is the name of the variable to be created in the dataset, gender. The right side, factor(gender), gives the instructions for how to create the information that the gender variable will contain. Since there is already a variable in the dataset called gender, the expression factor(gender) works on the existing version of the variable, then overwrites it into a variable with the same name. Here, the equals sign is working like the assignment operator <- for overwriting objects.
RepRoducibility: Overwriting variables
In most cases, you have to be wary of the implications of overwriting variables as accidentally rerunning the code out of order will lead to undetectable errors. Changing the data type of a variable is one instance in which overwriting is not going to be problematic - asking R to transform a factor variable into a factor variable will not lead to errors.
So far, we’ve written the code to create the lab variable and change the gender and mcq datasets, but neither of these changes have been assigned to the dataset, so the version of anx_data in the Environment is still unchanged. As we’ve seen before, once we’ve checked the code works by examining the output, we then assign the output of these commands to the dataset to make those changes.
Composite Scores
Row-wise magic is good magic. - Hadley Wickham
A very common mutate() task is to create a composite score from multiple variables - for example, an overall trait anxiety score from our sticsa_trait items. Let’s create an overall score that contains the mean of the ratings1 on each of the STICSA trait anxiety items, for each participant.
To do this, we need two new operations.
The first new function,
dplyr::c_across(), provides an efficient way to select multiple variables to contribute to the calculation - namely, by using<tidyselect>semantics.The second new function is actually a pair of functions,
dplyr::rowwise()anddplyr::ungroup(). These two respectively impose and remove an internal structure to the dataset, such that each row is treated like its own group, and any operations are done within those row-wise groups.
Let’s see the combination of these functions in action.
The code below assumes a dataset structured so there is information from each participant on only and exactly one row in the dataset.
If your data has observations from the same participants on multiple rows, you will need to reshape your data or otherwise adapt the code to suit your data structure.
1anx_data |>
2 dplyr::rowwise() |>
3 dplyr::mutate(
sticsa_trait_score = mean(c_across(starts_with("sticsa_trait")),
na.rm = TRUE)
) |>
4 dplyr::ungroup()- 1
-
Overwrite the
anx_datadataset with the following output: take the existinganx_datadataset, and then - 2
- Group the dataset by row, so any subsequent calculations will be done for each row separately, and then
- 3
-
Create the new
sticsa_trait_scorevariable by taking the mean of all the values in variables that start with the string “sticsa_trait” (ignoring any missing values), and then - 4
-
Remove the by-row grouping that was created by
rowwise()to output an ungrouped dataset.
For lots more details and examples on rowwise() and row-wise operations with {dplyr} - including which other scenarios in which a row-wise dataset would be useful - run vignette("rowwise") in the Console.
RepRoducibility: Invisible structures
Always ungroup a dataset at the end of a command. R will not say anything if you don’t ungroup a dataset and will retain the invisible structure. All the code after the grouping is implemented will run, but the structure imposed on the dataset will lead to errors in the results.
Exercises
Try out the following exercises in the accompanying workbook.
In many multi-item measures, some items are reversed in the way that they capture a particular construct. In this particular example, items on the STICSA are worded so that a higher numerical response (closer to the “very much so” end of the scale) indicates more anxiety, such as item 4: “I think that others won’t approve of me”.
However, reverse-coded items are intended to capture the same ideas, but in reverse. A reversed version of item 17 might read, “I can concentrate easily with no intrusive thoughts.” In this case, a higher numerical response (closer to the “very much so” end of the scale) would indicate less anxiety. In order for these reversed items to be aligned with the other items on the scale, so that together they form a cohesive score, the coding of the response scale must be flipped: high becomes low, and low becomes high.
If the response scale is a numerical integer sequence, as this one is, then the simplest way to reverse-code the responses is to subtract every response from the maximum possible response plus one. Here, the STICSA response scale is from 1 to 4; the maximum possible response is 4, plus one is 5. So, to reverse-code the responses, we need to subtract each rating on this item from 5. A high score (4) will be become a low score (5 - 4 = 1), and vice versa for a low score (5 - 1 = 4).
RepRoducibility: Overwriting variables
Unlike the previous example with changing variables to factor, this code could cause serious issues. If you were to rerun the code above a second time, you’d un-reverse the coding of these variables, which can lead to errors with the analysis. Towards the end of the tutorial, we will look at different ways to make your code more resilient to human error.
Conditionals
There are many functions out there for recoding variables (let’s wave cheerfully at dplyr::recode() as we cruise by it without stopping), but the following method, using dplyr::case_when(), is recommended because it is so versatile. It can be used to recode the values from one variable into new one, but it can also combine information across variables and handle multiple conditionals. It essentially allows a series of if-else statements without having to actually have lots of if-else statements.
The generic format of dplyr::case_when() can be stated as follows:
dataset_name |>
dplyr::mutate(
new_variable = dplyr::case_when(
logical_assertion ~ value,
logical_assertion ~ value,
.default = default_value
)
)logical_assertion is any R code that returns TRUE and FALSE values, exactly as we saw previously with filter().
value is the value to assign to new_variable for the cases for which logical_assertion for that line returns TRUE.
The assertions are evaluated sequentially (from first to last in the order they are written in the function), and the first match determines the value. This means that the assertions must be ordered from most specific to least specific.
The .default argument gives a final value, default_value, that will be assigned to new_variable for any case that doesn’t match any of the previous assertions.
The assertions for dplyr::case_when() are the same as the ones we used previously in dplyr::filter(). In fact, if you need to test the assertion you are writing to ensure that your code will work as you want, you can try the same assertion in dplyr::filter() to check whether the cases it returns are only and exactly the ones you want to change.
Let’s look at two examples of how dplyr::case_when() might come in handy.
One-Variable Input
We’ve created our composite sticsa_trait_score variable previously, and now we may want to change this continuous score into a categorical variable indicating whether or not participants display clinical levels of anxiety. So, we can use case_when() to recode sticsa_trait_score into a new sticsa_trait_cat variable.
1anxiety_cutoff <- 2.047619
2anx_data <- anx_data |>
3 dplyr::mutate(
4 sticsa_trait_cat = dplyr::case_when(
5 sticsa_trait_score >= anxiety_cutoff ~ "clinical",
6 sticsa_trait_score < anxiety_cutoff ~ "non-clinical",
7 .default = NA
)
)- 1
-
Create a new object,
anxiety_cutoff, containing the threshold value for separating clinical from non-clinical anxiety. This one is from Van Dam et al., 2013. - 2
-
Overwrite the
anx_dataobject by taking the dataset, and then… - 3
- Making a change to it by…
- 4
-
Creating a new variable,
sticsa_trait_cat, by applying the following rules: - 5
-
For cases where the value of
sticsa_trait_scoreis greater than or equal toanxiety_cutoff, assign the value “clinical” tosticsa_trait_cat - 6
-
For cases where the value of
sticsa_trait_scoreis less thananxiety_cutoff, assign the value “non-clinical” tosticsa_trait_cat - 7
-
For cases that don’t match any of the preceding criteria, assign
NAtosticsa_trait_cat
anxiety_cutoff object
In the code above, the cutoff value is stored in a new object, anxiety_cutoff, which is then used in the subsequent case_when() conditions. Why take this extra step?
This is a matter of style, since the output of this code would be entirely identical if I wrote the cutoff value into the case_when() assertions directly (e.g. sticsa_trait_score >= 2.047619). I have done it this way for a few reasons:
- The threshold value is easy to find, in case I need to remind myself which one I used, and it’s clearly named, so I know what it represents.
- The threshold value only needs to be typed in once, rather than copy/pasted or typed out multiple times, which decreases the risk of typos or errors.
- Most importantly, it’s easy to change, in case I need to update it later. I would only have to change the value in the
anxiety_cutoffobject once, at the beginning of the code chunk, and all of the subsequent code using that object would be similarly updated.
In short, it makes the code easier to navigate, more resilient to later updates, and more transparent in its meaning.
Multi-Variable Input
We might also like to create a useful coding variable to help keep track of the number of cases we’ve removed, and for what reasons. We can draw on input from multiple variables to create this single new variable. Here’s the idea to get started:
1anx_data |>
dplyr::mutate(
remove = dplyr::case_when(
2 distribution == "preview" ~ "preview",
3 consent != "Yes" | is.na(consent) ~ "no_consent",
4 .default = "keep"
)
)- 1
-
Take the dataset
anx_dataand then make a change to it by a creating a new variable,remove, by applying the following rules: - 2
-
For cases where the
distributionvariable contains exactly and only"preview", assign the value"preview"toremove. - 3
-
For cases where the
consentvariable does not contain exactly and only"Yes", or contains anNA, assign the value"no_consent"toremove. - 4
-
For cases that don’t match any of the preceding criteria, assign the value
"keep"toremove.
Note that for this variable, each assertion is designed to identify the cases that we do NOT want to keep. The .default = "keep" line assigns the value "keep" for any case that doesn’t match any of the exclusion criteria - i.e., unless there’s a reason to drop a particular case, we keep it by default.
RepRoducibility: Managing datasets during pre-processing
Some of the code we have been using so far relies on people diligently running the entire script from top to bottom. As an example, if we rerun this code chunk a second time, we will redo the reverse-scoring of the variables:
anx_data <- anx_data |>
dplyr::mutate(
sticsa_pre_state_17 = 5 - sticsa_pre_state_17,
sticsa_post_state_17 = 5 - sticsa_post_state_17
)To build some resilience into the code, we could create new variables instead of overwriting the original variables:
anx_data <- anx_data |>
dplyr::mutate(
sticsa_pre_state_17_reverse = 5 - sticsa_pre_state_17,
sticsa_post_state_17_reverse = 5 - sticsa_post_state_17
)Now, regardless how many time we rerun the chunk above, the values of sticsa_pre_state_17_reverse and sticsa_post_state_17_reverse will not change. However, if you choose to create a new variable, you need to consider how this would affect your interactions with the dataset (e.g. when you want to use tidyselect).
The other solution is to create a new object to house the pre-processed data instead of overwritting the object containing the original data:
anx_data_reverse <- anx_data |>
dplyr::mutate(
sticsa_pre_state_17 = 5 - sticsa_pre_state_17,
sticsa_post_state_17 = 5 - sticsa_post_state_17
)Now, regardless of how many times we rerun the chunk above, the values of sticsa_pre_state_17 and sticsa_post_state_17 in anx_data_reverse will remain the same. The problem with this approach is that if you create a new object for every step of your pre-processing, your environment will become cluttered with very similar looking objects and it can become difficult to keep track of all versions of your dataset.
There are no concrete rules for deciding when to save a new version of a dataset as an object, vs overwriting the existing dataset with the new version. One good starting point is to apply all pre-processing in the same command, and have only one “pre-processed” data object. If I (RB) want to do some analysis on an interim step of the pre-processing that will not be accessible once all pre-processing steps are completed, I would create a new object at that interim step as well. We didn’t use this approach in the examples above because we wanted to demonstrate the different use-cases of mutate() one at a time.
The only concrete rule to keep in mind is that you want your code to always return the same output every time you run it.
Iteration
This material will not be covered in live workshops, unless there is sufficient time and interest. The techniques in this section are not taught in core Methods modules for UG students, so they are not essential for dissertation supervisors. This section is included for anyone who wants to develop the efficiency and versatility of their coding beyond basic tasks.
If you want to skip this section, you can jump down to the next section.
The mutate() function is an amazing tool for working with your dataset, but applying the same change to multiple variables quickly becomes tedious. Imagine we wanted to change all of the character variables in this dataset to factors. We could do something like this:
anx_data |>
dplyr::mutate(
id = factor(id),
distribution = factor(distribution),
consent = factor(consent),
gender = factor(gender),
mcq = factor(gender),
remove = factor(remove)
)If there are only a few of these variables to change, then this may be fine - but even just a few are prone to mistakes or mistyping. Did you spot the mistake in the code above? The mcq variable was overwritten by the gender variable in a copy/paste mistake. This kind of mistake is both easy to make and very difficult to detect, since the code runs without issue.
To avoid this, the general rule of thumb is: if you have to copy/paste the same code more than once, use (or write!) a function instead. To use code more efficiently, the key is to identify where the code repeats, then use a function for that repetition instead of duplicated code.
Luckily we don’t have to figure out how to do this iteration from scratch2, because {dplyr} already has a built-in method for doing exactly this task, called dplyr::across(). It works like this:
dataset_name |>
dplyr::mutate(
dplyr::across(<tidyselect>, function_to_apply)
)In the first argument, we use <tidyselect> syntax to choose which variables we want to change.
In the second argument, the function or expression in function_to_apply is applied to each of the variables we’ve chosen. By default, the variables are overwritten.
The task we wanted to do above was to convert all character variables to factors. So our repetitive, copy/paste command above becomes:
anx_data |>
dplyr::mutate(
dplyr::across(c(id, distribution, consent, gender, mcq, remove),
factor)
)Exercises
Quick Test: \(\chi^2\)
Since we’ve created some handy dichotomous variables today, we can also have a quick \(\chi^2\) test of association as a treat. Just like we did t-tests with t.test(), for \(\chi^2\) we have chisq.test().
First, you can bring up the help documentation by running ?chisq.test in the Console.
You might notice right away that this function has no data = argument, and neither does it have an option to specify a formula like we’ve used previously. Instead, we just need to provide two vectors, which we can get out of our dataset using $ subsetting.
So, for example, to compare whether there is an association between MCQ type and trait anxiety (which we would rather NOT be the case, since participants were allocated randomly to MCQ condition), we can simply run:
chisq.test(anx_data$mcq, anx_data$sticsa_trait_cat)
Pearson's Chi-squared test with Yates' continuity correction
data: anx_data$mcq and anx_data$sticsa_trait_cat
X-squared = 2.8458, df = 1, p-value = 0.09161
If we store this model output in an object, we can then subset it to easily get counts of expected and observed frequencies.
anx_chisq <- chisq.test(anx_data$mcq, anx_data$sticsa_trait_cat)
anx_chisq$observed anx_data$sticsa_trait_cat
anx_data$mcq clinical non-clinical
maths 136 97
stats 154 78
anx_chisq$expected anx_data$sticsa_trait_cat
anx_data$mcq clinical non-clinical
maths 145.3118 87.68817
stats 144.6882 87.31183
Exercises
If, like me, you hate repetitive typing, you can really let R do the work for you. The example below shows the construction of a function that takes a dataset, x, with a remove variable in it as described above, and then automatically produces a paragraph detailing the exclusions and participant numbers.
report_exclusions <- function(x){
## Generate a tibble with counts of exclusions
## And add in a plain-language description of what each means
excl_sum <- x |>
dplyr::count(remove) |>
dplyr::mutate(
desc = dplyr::case_when(
remove == "age_bad" ~ "indicated an age above 100 or otherwise impossible,",
remove == "age_young" ~ "indicated an age below 18,",
remove == "no_consent" ~ "did not consent,",
.default = remove
)
)
## Extract initial number (minus previews)
n_initial <- excl_sum |>
dplyr::filter(remove != "preview") |>
dplyr::pull(n) |>
sum()
## Extract final number
n_final <- excl_sum |>
dplyr::filter(remove == "keep") |>
dplyr::pull(n)
## Drop previews and keeps so the following code only itemises exclusions
excl_sum <- excl_sum |>
dplyr::filter(!(remove %in% c("preview", "keep")))
## Paste the text together
paste("To begin,", n_initial, "cases were recorded.", "Subsequently,",
## Generate the sentences with paste() and then sub the last comma with a comma followed by "and"
gsub("(.*), (.*)", "\\1, and \\2", paste(excl_sum$n, "cases", excl_sum$desc, collapse = " ")),
"so they were excluded. This left a final sample of", n_final, "participants.")
}The important thing here is that the remove variable must be created the same way in the original dataset every time in order for this function to work correctly. If you wanted to add in more reasons for exclusions, you would also need to update the case_when() command at the beginning of the function to add a description for the new excluions.
Having created this custom function in my document somewhere, I could then simply write the following inline code in my Quarto text:
`r report_exclusions(anx_data)`
Which would render as follows:
To begin, 453 cases were recorded. Subsequently, 5 cases indicated an age above 100 or otherwise impossible, 22 cases indicated an age below 18, and 33 cases did not consent, so they were excluded. This left a final sample of 393 participants.
Footnotes
Note that averaging Likert data is controversial (h/t Dr Vlad Costin!), but widespread in the literature. We’re going to press boldly onward anyway to not get too deep in the statistical weeds, but if you’re using Likert scales in your own research, it’s something you might want to consider.↩︎
{purrr}, cats, scratch, get it?? I’m hilarious.↩︎