10: Qualtrics and Labelled Data

Overview

This tutorial will focus on efficient, transparent, and user-friendly techniques for working with data specifically gathered using the Qualtrics survey platform. We will cover how to import and work with labelled data from Qualtrics and how to easily produce a data dictionary straight from the dataset itself.

Acknowledgements

This tutorial was co-conceived and co-created with two brilliant PhD researchers, Hanna Eldarwish and Josh Francis, who contributed invaluable input throughout the process of developing the tutorial. This included collecting commonly asked questions and issues with Qualtrics data analysis; discussing the topics to cover and how best to cover them; and testing code and solutions. Hanna Eldarwish also provided the basis for the dataset, collected during her undergraduate dissertation at Sussex under the supervision of Dr Vlad Costin.

Setup

Packages

As usual, we will be using {tidyverse}. When {tidyverse} is installed, it also installs the {haven} package, which we will use for data importing. However, {haven} isn’t loaded as part of the core {tidyverse} group of packages, so let’s load it separately. Finally, we will also need the {labelled} package to work with labelled data.

Exercise

Load the packages.

library(tidyverse)
library(haven)
library(labelled)
library(sjPlot)

Data

Today’s dataset focuses on various aspects of meaning in life (MiL), and has been randomly generated based on a real dataset kindly contributed by Hanna Eldarwish and Vlad Costin. All variables have been randomly generated, but they are based on the patterns in the original dataset. The original, bigger dataset will be made available alongside article publication in the future, so keep an eye out for it!

New File Type

You might notice that instead of the familiar readr::read_csv(), today we have haven::read_sav(). That’s because the file I’ve prepared is a SAV file, associated with the SPSS statistical analysis programme. The next section explains why we are using this data type, but otherwise, there’s nothing new about these commands.

Exercise

Read in the mil_data.sav object from the data folder, or alternatively from Github via URL, as you prefer.

From a folder:

mil_data <- here::here("data/mil_data.sav") |> haven::read_sav()

From URL:

mil_data <- haven::read_sav("https://raw.githubusercontent.com/drmankin/practicum/master/data/mil_data.sav")

Codebook

This codebook is intentionally sparse, because we’ll be generating our own from the dataset in just a moment. This table covers only the questionnaire measures to help you understand the variables.

Variable Prefix Scale Subscale
global_meaning Meaning in Life Global Meaning
mattering Meaning in Life Mattering
coherence Meaning in Life Coherence
purpose Meaning in Life Purpose
sym_immortality Symbolic Immortality Single scale
belonging Belonging Single scale

 

Qualtrics Data

Qualtrics is a survey-building tool very commonly used for questionnaire-type studies, as well as some experimental work. The University of Sussex has an institutional licence for Qualtrics, so all staff and students can log in with their Sussex details and easily construct and collaborate on surveys.

For help using Qualtrics itself, the Qualtrics support pages are generally excellent. This tutorial will only briefly touch on the options within Qualtrics itself.

Once the study is complete and responses have been collected, you will need to export your data from Qualtrics so that you can analyse it. Qualtrics offers a variety of export data types, including our familiar CSV type. However, we’re going to instead explore a new option: SAV data.

SAV Data

The .sav file type is associated with SPSS, a widely used statistical analysis programme. So, why are we using SPSS files when working in R?

Importing via .sav has two key advantages. First, it results in a much cleaner import format. If you try importing the same data via .csv file, you’ll find that you need to do some very fiddly and pointless cleanup first. For instance, the .csv version of the same dataset will introduce some empty rows that have to be deleted with dplyr::slice() or similar. The .sav version of the dataset doesn’t have any comparable formatting issues.

Most importantly, however, importing .sav file types into R with particular packages like {haven} gets us a dataset with a special type of data: namely, labelled data. The labels allow us to preserve important information about the questions asked and response options in Qualtrics, and to (mostly) painlessly create codebooks for datasets. We will explore these features in depth in this tutorial.

Exporting from Qualtrics

If you’d like to work with your own study data, you will need to export your data in SAV format from Qualtrics first. To do this, open your Qualtrics survey and select the “Data & Analysis” tab along the top, just under the name of your survey.

In the Data Table view, look to the right-hand side of the screen. Click on the drop-down menu labelled “Export & Import”, then select the first option, “Export Data…”

A screenshot the Qualtrics Data & Analysis screen with red boxes indicating the steps to take to export data: Data & Analysis tab, Export & Import menu, and Export Data... option.

In the “Download a data table” menu, choose “SPSS” from the choices along the top. Make sure “Download all fields” is ticked, then click “Download”.

A screenshot the Qualtrics Download a Data Table screen with red boxes indicating the steps to take to export SPSS data: SPSS tab, Download button.

The dataset will download automatically to your computer’s Downloads folder. From there, you should rename it to something sensible and move it into a data folder within your project folder. From there, you can read it in using the here::here() |> haven::read_sav() combo that we saw in the Data section previously.

Sensible Naming Conventions and Folder Structure

I know it may not seem like something anyone should care about, but sensible file and folder names will make your life so much easier for working in R (and generally).

For folder structure, make sure you do the following:

  • Always always ALWAYS use an R Project for working in R.
  • Have a consistent set of folders for each project: for example, images, data, and docs.
  • Use sub-folders where necessary, but consider using sensible naming conventions instead.

For naming conventions, your file name should make it obvious what information it contains and when it was created, especially for datasets like this. Personally, I would prefer longer and more explicit file names over brevity; this is because I prefer to navigate files using R, and that’s much easier using explicit file names than it is with file metadata.

So, for a download like this, I’d probably name it something like qtrics_diss_2023_11_08.sav. The qtrics tells me it’s a Qualtrics export, the diss tells me it’s a dissertation project, and the last bit is the full date in easily machine-readable format. Imagine if I continue to recruit participants and download a new dataset later, say a month from now, and name it qtrics_diss_2023_12_08.sav. I could easily distinguish which dataset was which by the date, but also see that they are different versions of the same thing by their shared prefix.

This is a much more reliable system than calling them, say, Qualtrics output.sav and Dissertation FINAL REAL.sav. This kind of naming “convention” contains no information about which is which or when they were exported, or even that they’re two versions of the same study dataset! It might seem like a small detail at the time, but Future You trying to figure out which dataset to use weeks or months later will feel the difference.

 

The Plan

Our workflow for this dataset will be slightly different than previously. We’ll start by doing some basic cleanup of the dataset, and produce a codebook, or “data dictionary”, drawing on the label metadata in the SAV file. For the purpose of practice, we’ll also have a look at how to work with those labels, and manage different types of missing values.

As useful as labels are, they will get in the way when we want to work with our dataset further. So, we’ll convert the variables in the dataset into either factors, for categorical data, or numeric, for continuous data 1. From that point forward, we can work with the dataset using the techniques and functions we’ve learned thus far.

Cleanup and Data Dictionary

Tip

Most of the following examples are drawn from the “Introduction to labelled” vignette from the {labelled} package. If you want to do something with labelled data that isn’t covered here, that’s a good place to start!

Let’s start off by having a look at the dataset. As usual, you can call the dataset or use View() on it directly, but we’re going to take advantage of the new data type to get a more helpful summary, that emulates the “Variable View” in SPSS.

Exercise

Use the generate_dictionary() function from the {labelled} packages to create a data dictionary for mil_data, then pipe it into View().

mil_data |> 
  labelled::generate_dictionary() |>
  View()

What we get is a new summary dataset that contains some useful information about each of the variables in mil_data. Along with the actual variable name in the dataset, which we see under variable, we also get the actual question that participants saw in Qualtrics under label, and the response options - where applicable - in value_labels.

Why a data dictionary? There’s two key reasons to do this. First, as we’ve already seen throughout these tutorials, data dictionaries (or “codebooks”) are very useful for understanding datasets, even your own. In this case, we’ve generally named our variables usefully in Qualtrics before the export, but if you forget to (or don’t typically) do this, this reference helps you navigate unhelpful variable names like “Q42”, “Q16” etc. The second reason is for other people: if you want to share your data publicly, including a dictionary/codebook is not only a kindness to other users but also helps prevent misuse or misunderstandings.

Before we look at these labels in more depth, we’re first going to address two minor issues that commonly come up with Qualtrics data to make sure our dataset is ready to use.

Renaming Variables

If you inspected the dataset closely, you might have noticed that one of the items has a strange name: coherence_42, right between coherence_1 and coherence_3.

This wasn’t intentional - in the process of creating the questionnaire in Qualtrics, this variable came out with a weird name. It happens more easily than you think! The best-case option would be to update the Qualtrics questionnaire itself before exporting the data, but you may not be able (or want) to do this, so instead, let’s have a quick look at how to rename variables.

As (almost) always, there’s a friendly {dplyr} function to help us with our data wrangling. This time it’s sensibly-named dplyr::rename(), which renames variables using new_name = old_name arguments.

Exercise

Rename coherence_42 to coherence_2. Don’t forget to save this change to the dataset!

mil_data <- mil_data |> 
  dplyr::rename(coherence_2 = coherence_42)
Tip

Do you have lots of variables to rename? Do you like writing functions, or using regular expressions? Check out rename()’s flashier cousin rename_with(), which uses a function to rename variables.

Separating Columns

The second thing I’d like to do doesn’t concern the main dataset, but rather the data dictionary we’ve generated. For the single-item questions, the label column is reasonably helpful. However, the items with a shared prefix all come from the same matrix scale in Qualtrics, and their labels have two parts: the “question text” that usually contains directions about how to respond, and the actual item text for each individual item.

As an example, the label for belonging_1 reads:

Please rate the extent to which these statements apply to you. - When I am with other people, I feel included

Which corresponds to this in Qualtrics:

A screenshot of a Qualtrics matrix table of Likert responses, with 'Please rate the extent to which these statements apply to you' at the top, and a single item, 'When I am with others, I feel included' and a rating scale from strongly disagree to strongly agree.

To make the labels more readable, let’s split up the question text, which is repeated for all items on the same subscale and not very useful, and the item text, which contains the specific text of each item. The good news is that the two pieces are defined, or delimited, by the ” - ” symbol that Qualtrics automatically adds to link them.

Since we want to separate the labels column into two - making the dataset wider - using a delimiter, the separate_wider_delim() function from the {tidyr} package should do the trick!

1mil_data |>
2  labelled::generate_dictionary() |>
3  tidyr::separate_wider_delim(
4    cols = label,
5    delim = " - ",
6    names = c("label", "item_label"),
7    too_few = "align_start"
  )
1
Take the data, and then
2
Generate the data dictionary, and then
3
Separate wider by delimiter as follows:
4
Separate the label column
5
At the ” - ” delimiter
6
Into two new columns called “label” and “item_label” respectively
7
If there are too few pieces (that is, for the rows where there is no delimiter), fill in values from the start.

The result isn’t perfect, but it’ll do for our purposes - namely, to have a quick reference for the variables in our dataset.

Exercise

Save the final (for now) data dictionary in a new object, mil_dict, so we can refer to it as needed.

mil_dict <- mil_data |> 
  labelled::generate_dictionary() |> 
  tidyr::separate_wider_delim(
    cols = label,
    delim = " - ",
    names = c("label", "item_label"),
    too_few = "align_start"
  )

Viewer Data Dictionary

I personally like labelled::generate_dictionary() because (you will be unsurprised to learn) I like to mess about with regex to make it read just as I like. However, if you primarily need a quick reference as you’re working with your dataset, the delightful sjPlot::view_df() function makes this particularly easy.

Exercise

Put mil_data into the sjPlot::view_df() function and see what it does!

By default, the document opens in the Viewer, but you can also save the file it creates for further sharing - see the help documentation.

sjPlot::view_df(mil_data)
Data frame: mil_data
ID Name Label Values Value Labels
1 StartDate Start Date
2 EndDate End Date
3 Status Response Type 0
1
2
4
8
9
12
16
17
32
40
48
IP Address
Survey Preview
Survey Test
Imported
Spam
Survey Preview Spam
Imported Spam
Offline
Offline Survey Preview
EX
EX Spam
EX Offline
4 Finished Finished 0
1
False
True
5 RecordedDate Recorded Date
6 ResponseId Response ID <output omitted>
7 DistributionChannel Distribution Channel <output omitted>
8 UserLanguage User Language <output omitted>
9 english_fluency_1 Please select which box best describes your
English fluency. - How well can you speak English?
1
2
3
4
Very well
Well
Not well
Not at all
10 age How old are you? range: 15-82
11 gender What is your gender identity? This question is
optional. - Selected Choice
0
1
2
3
Male
Female
Non-binary
Other (please state below)
12 global_meaning_1 Please rate the extent to which you agree or
disagree with these statements. - My life as a
whole has meaning.
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat disagree
Neither agree nor disagree
Somewhat agree
Agree
Strongly agree
13 global_meaning_2 Please rate the extent to which you agree or
disagree with these statements. - My entire
existence is full of meaning.
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat disagree
Neither agree nor disagree
Somewhat agree
Agree
Strongly agree
14 global_meaning_3 Please rate the extent to which you agree or
disagree with these statements. - My life is
meaningless.
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat disagree
Neither agree nor disagree
Somewhat agree
Agree
Strongly agree
15 global_meaning_4 Please rate the extent to which you agree or
disagree with these statements. - My existence is
empty of meaning.
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat disagree
Neither agree nor disagree
Somewhat agree
Agree
Strongly agree
16 mattering_1 Please rate the extent to which you agree or
disagree with these statements. - Whether my life
ever existed matters even in the grand scheme of
the universe.
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat disagree
Neither agree nor disagree
Somewhat agree
Agree
Strongly agree
17 mattering_2 Please rate the extent to which you agree or
disagree with these statements. - Even considering
how big the universe is, I can say that my life
matters.
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat disagree
Neither agree nor disagree
Somewhat agree
Agree
Strongly agree
18 mattering_3 Please rate the extent to which you agree or
disagree with these statements. - My existence is
not significant in the grand scheme of things.
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat disagree
Neither agree nor disagree
Somewhat agree
Agree
Strongly agree
19 mattering_4 Please rate the extent to which you agree or
disagree with these statements. - Given the
vastness of the universe, my life does not matter.
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat disagree
Neither agree nor disagree
Somewhat agree
Agree
Strongly agree
20 coherence_1 Please rate the extent to which you agree or
disagree with these statements. - I can make sense
of the things that happen in my life.
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat disagree
Neither agree nor disagree
Somewhat agree
Agree
Strongly agree
21 coherence_2 Please rate the extent to which you agree or
disagree with these statements. - Looking at my
life as a whole, things seem clear to me.
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat disagree
Neither agree nor disagree
Somewhat agree
Agree
Strongly agree
22 coherence_3 Please rate the extent to which you agree or
disagree with these statements. - I can’t make
sense of events in my life.
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat disagree
Neither agree nor disagree
Somewhat agree
Agree
Strongly agree
23 coherence_4 Please rate the extent to which you agree or
disagree with these statements. - My life feels
like a sequence of unconnected events.
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat disagree
Neither agree nor disagree
Somewhat agree
Agree
Strongly agree
24 purpose_1 Please rate the extent to which you agree or
disagree with these statements. - I have a good
sense of what I am trying to accomplish in life.
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat disagree
Neither agree nor disagree
Somewhat agree
Agree
Strongly agree
25 purpose_2 Please rate the extent to which you agree or
disagree with these statements. - I have certain
life goals that compel me to keep going.
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat disagree
Neither agree nor disagree
Somewhat agree
Agree
Strongly agree
26 purpose_3 Please rate the extent to which you agree or
disagree with these statements. - I don’t know
what I am trying to accomplish in life.
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat disagree
Neither agree nor disagree
Somewhat agree
Agree
Strongly agree
27 purpose_4 Please rate the extent to which you agree or
disagree with these statements. - I don’t have
compelling life goals that keep me going.
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat disagree
Neither agree nor disagree
Somewhat agree
Agree
Strongly agree
28 sym_immortality_1 Please indicate the extent to which you believe
these statements are likely to occur. - After I
die, my impact on the world will continue
1
2
3
4
5
6
7
Extremely unlikely
Moderately unlikely
Slightly unlikely
Neither likely nor unlikely
Slightly likely
Moderately likely
Extremely likely
29 sym_immortality_2 Please indicate the extent to which you believe
these statements are likely to occur. - Some
aspect of myself, such as my name or
accomplishments, will be remembered long after I
die
1
2
3
4
5
6
7
Extremely unlikely
Moderately unlikely
Slightly unlikely
Neither likely nor unlikely
Slightly likely
Moderately likely
Extremely likely
30 belonging_1 Please rate the extent to which these statements
apply to you. - When I am with other people, I
feel included
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat Disagree
Neither Agree nor Disagree
Somewhat agree
Agree
Strongly agree
31 belonging_2 Please rate the extent to which these statements
apply to you. - I have close bonds with family and
friends
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat Disagree
Neither Agree nor Disagree
Somewhat agree
Agree
Strongly agree
32 belonging_3 Please rate the extent to which these statements
apply to you. - I feel accepted by others
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat Disagree
Neither Agree nor Disagree
Somewhat agree
Agree
Strongly agree
33 belonging_4 Please rate the extent to which these statements
apply to you. - I have a sense of belonging
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat Disagree
Neither Agree nor Disagree
Somewhat agree
Agree
Strongly agree
34 belonging_5 Please rate the extent to which these statements
apply to you. - I have a place at the table with
others
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat Disagree
Neither Agree nor Disagree
Somewhat agree
Agree
Strongly agree
35 belonging_6 Please rate the extent to which these statements
apply to you. - I feel connected with others
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat Disagree
Neither Agree nor Disagree
Somewhat agree
Agree
Strongly agree
36 belonging_7 Please rate the extent to which these statements
apply to you. - I feel like an outsider
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat Disagree
Neither Agree nor Disagree
Somewhat agree
Agree
Strongly agree
37 belonging_8 Please rate the extent to which these statements
apply to you. - I feel as if people do not care
about me
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat Disagree
Neither Agree nor Disagree
Somewhat agree
Agree
Strongly agree
38 belonging_9 Please rate the extent to which these statements
apply to you. - Because I do not belong, I feel
distant during the holiday season
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat Disagree
Neither Agree nor Disagree
Somewhat agree
Agree
Strongly agree
39 belonging_10 Please rate the extent to which these statements
apply to you. - I feel isolated from the rest of
the world
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat Disagree
Neither Agree nor Disagree
Somewhat agree
Agree
Strongly agree
40 belonging_11 Please rate the extent to which these statements
apply to you. - When I am with other people, I
feel like a stranger
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat Disagree
Neither Agree nor Disagree
Somewhat agree
Agree
Strongly agree
41 belonging_12 Please rate the extent to which these statements
apply to you. - Friends and family do not involve
me in their plans
1
2
3
4
5
6
7
Strongly disagree
Disagree
Somewhat Disagree
Neither Agree nor Disagree
Somewhat agree
Agree
Strongly agree

Labelled Data

As we’ve just seen in the data dictionary, the SAV data we’re using has a special property: labels. Labelled data has a number of features, all of which we will explore in depth shortly:

  • Variable labels. The label associated with a whole variable will contain the text of the item that the participants responded to. This is analogous to the “Label” column of the Variable View in SPSS.

  • Value labels. The label associated with individual values within a variable will contain the text associated with individual choices, for instance the points on a Likert scale or the options on a multiple-choice question. This is analogous to the “Values” column of the Variable View in SPSS.

  • Missing values. Within value labels, you can designate particular values as indicative of missing responses, refusal to respond, etc. This is analogous to the “Missing” column of the Variable View in SPSS.

We’re first going to look at how you can work with each of these elements. The reason to do this is that once our dataset has been thoroughly checked, we’re going to generate a final data dictionary, then convert any categorical variables into factors, the levels of which will correspond to the labels for that variable. We’ll also convert any numeric variables into numeric data type, which will discard the labels; that will make it possible to do analyses with them, but that’s why we have to create the data dictionary first.

Important

These features will work optimally only if you have set up your Qualtrics questionnaire appropriately. Make sure to refer to the Setting Up Qualtrics section of the next tutorial to get the most out of your labelled data and save yourself data cleaning and wrangling headaches later.

Variable Labels

Variable labels contain information about the whole variable, and for Qualtrics data, will by default contain either an automatically generated Qualtrics value (like “Start Date”), or the question text that that variable contains the responses to.

Getting Labels

To begin, let’s just get out a single variable label to work with using labelled::var_label().

To specify the variable we want, we will need to subset it from the dataset, using either $ or dplyr::pull() as previously.

labelled::var_label(mil_data$gender)
[1] "What is your gender identity?\n\nThis question is optional. - Selected Choice"

Creating/Updating Labels

If you’d like to edit labels, you can do it “manually” - that is, just writing a whole new label from scratch.

The structure of this code might look a little unfamiliar in terms of the code structure. For the most part, we’ve seen code that contains longer and more complex instructions on the right-hand side of the <-, and a single object being created or updated on the left-hand side. In the structure below, the left-hand side contains longer and more complex code that identifies the value(s) to be updated or created, and the right-hand side contains the value(s) to create or update. It’s the same logic, just with a different structure.

labelled::var_label(mil_data$StartDate) <- "Date and time questionnaire was started"

labelled::var_label(mil_data$StartDate)
[1] "Date and time questionnaire was started"

If you’re up for it, though, I’d recommend using it as an opportunity to start working with regular expressions. For example, if we want to keep only the first bit of the label for gender, then we can keep everything only up to an including the question mark, and and re-assign that to the variable label. This style is a bit more dynamic and resilient to changes or updates.

labelled::var_label(mil_data$gender) <- labelled::var_label(mil_data$gender) |> 
  gsub("(.*\\?).*", "\\1", x = _)

labelled::var_label(mil_data$gender)
[1] "What is your gender identity?"
Exercise

CHALLENGE: How can you read the gsub() command above? Why migh this be “more dynamic and resilient to changes or updates”?

Let’s pick apart this gsub() command a bit at a time. First, gsub() has three arguments:

  • pattern, here "(.*\\?).*", which is the regex statement representing the string to match.
  • replacement, here "\\1", which is the string that should replace the match in pattern.
  • x, the string to look in.

The pattern has essentially two parts: the bit in the rounded brackets, and the bit outside. The rounded brackets designate a “capturing group” - a portion of the string that should be grouped together as a unit. The benefit of this grouping is in the second argument of gsub(); \\1 isn’t the number 1, but rather is a pronoun referring to the first capturing group. In other words, as a whole, this gsub() command captures a subset of the incoming string, and then replaces the entire string with that captured string, essentially dropping everything outside the capturing group.

To understand the regex statement "(.*\\?).*", we need to look at the incoming text, x. In this case, x is being piped in from above and looks like this:

labelled::var_label(mil_data$gender)
[1] "What is your gender identity?\n\nThis question is optional. - Selected Choice"

As discussed in a previous Challenge task, .* is a common regex shorthand that means “match any character, as many times as possible.” It’s essentially an “any number of anything” wildcard. This wildcard appears both inside and outside the brackets. So, how does gsub() know which bit should belong in the capturing group?

The answer is \\?. This is a “literal” question mark. Some symbols, like . and ?, are regex operators, but we might want to also match the “literal” symbols full-stop “.” and question mark “?” in a string. In this case we need an “escape” character “", that escapes regex and turns the symbol into a literal one. So, the capturing group ends with a literal question mark - in the target string, that’s the question mark after”identity”, which is the only one in the string.

As an aside, if you’re wondering why there are two escape characters instead of one - i.e., why is it \\? and not \?, well, you and me both. There’s an explanation in vignette("regular-expressions") that never completely makes sense to me. Also, this seems to be an R thing - regex outside of R seems to use only a single escape character, so a literal question mark would be \?. If you are ever trying to adapt regex from e.g. StackOverflow or regex101 and it isn’t working, check whether the escape characters are right!

Anyway. We can now read "(.*\\?)" as “capture all characters up to an including a literal question mark” - which matches the substring “What is your gender identity?” in x. However, we don’t just want to replace that portion of the string - instead, we want to replace the whole string with that bit of it. So, the second .* outside the brackets matches the rest of the string. If we didn’t include this last bit, the capturing group would just be replaced with itself, which would result in the same string as we started with, as below:

labelled::var_label(mil_data$gender) |> 
  gsub("(.*\\?)", "\\1", x = _)
[1] "What is your gender identity?\n\nThis question is optional. - Selected Choice"

So, altogether, we can read this gsub() command as: “Capture everything up to an including the question mark, and replace the entire string with that capturing group.”

Now. Why, you might wonder, is all this faff better?

Well, it might not be. You might find it more frustrating or effortful to generate the right regex pattern than to replace the label “manually”, and in that case, there’s nothing wrong with just writing out the label you want. I said this was “more dynamic and resilient” because this command will always drop everything after the question mark, no matter what that text is. If there is no match, it won’t replace anything. So, unlike the “manual” option, there’s much less danger of accidentally mixing up labels or overwriting the wrong thing; and this regex statement can be generalised to any label that contains a question mark, rather than having to type out each label one by one.

Searching Labels

A very nifty feature of variable labels and {labelled} is the ability to search through them with labelled::look_for(). With the whole dataset, look_for() returns essentially the same info as generate_dictionary(), but given a second argument containing a search term, you get back only the variables whose label contains that term.

Exercise

Use labelled::look_for() to get only the items in this questionnaire that mentioned family.

I’ve piped into tibble::as_tibble() to make the output easier to read.

labelled::look_for(mil_data, "family") |> 
  tibble::as_tibble()

Value Labels

Value labels contain individual labels associated with unique values within a variable. It’s not necessary to have a label for every value, and indeed sometimes it’s advantageous not to.

Getting Labels

There are two functions to assist with this. labelled::val_labels() (with an “s”) returns all of the labels, while labelled::val_label() (without an “s”) will return the label for a single specified value.

labelled::val_labels(mil_data$english_fluency_1)
 Very well       Well   Not well Not at all 
         1          2          3          4 
labelled::val_label(mil_data$english_fluency_1, 3)
[1] "Not well"

Creating/Updating Labels

These two functions can also be used to update an entire variable or a single value respectively. The structure of this code is the same as we saw with variable labels previously.

Exercise

Get all the value labels for the gender variable. Then, update the last value to “Other”.

labelled::val_labels(mil_data$gender)
                      Male                     Female 
                         0                          1 
                Non-binary Other (please state below) 
                         2                          3 

The code for replacing this is much simpler manually…

labelled::val_label(mil_data$gender, 3) <- "Other"

But when has that ever stopped me?

labelled::val_label(mil_data$gender, 3) <- labelled::val_label(mil_data$gender, 3) |> 
  gsub("(.*?) .*", "\\1", x = _)

Missing Values

Labelled data allows an extra functionality from SPSS, namely to create user-defined “missing” values. These missing values aren’t actually missing, in the sense that the participant didn’t respond at all. Rather, they might be missing in the sense that a participant selected an option like “don’t know”, “doesn’t apply”, “prefer not to say”, etc.

Let’s look at an example. As we’ve just seen, we can get out all the value labels in variable with labelled::val_labels():

labelled::val_labels(mil_data$english_fluency_1)
 Very well       Well   Not well Not at all 
         1          2          3          4 

This variable asked participants to indicate their level of English fluency. Even for participants who have in fact responded to this question, we may want to code “Not well” and “Not as all” as “missing” so that they can be excluded easily. To do this, we can use the function labelled::na_values() to indicate which values should be considered as missing.

labelled::na_values(mil_data$english_fluency_1) <- 3:4

mil_data$english_fluency_1
<labelled_spss<double>[164]>: Please select which box best describes your English fluency. - How well can you speak English?
  [1] 2 2 2 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1
 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 2 1 1 1
 [75] 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 2 2
[112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[149] 1 1 1 1 1 1 1 1 1 2 1 1 1 1 3 2
Missing values: 3, 4

Labels:
 value      label
     1  Very well
     2       Well
     3   Not well
     4 Not at all

For the moment, these values are not actually NA in the data - they’re listed under “Missing Values” in the variable attributes. In other words, the actual responses are still retained. However, if we ask R which of the values in this variable are missing…

is.na(mil_data$english_fluency_1)
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[157] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE

…we can see one TRUE corresponding to the 3 above.

If we wanted to actually remove those values entirely and turn them into NAs for real, we could use labelled::user_na_to_na() for that purpose. Now, the variable has only two remaining values, and any 3s and 4s have been replaced.

labelled::user_na_to_na(mil_data$english_fluency_1)
<labelled<double>[164]>: Please select which box best describes your English fluency. - How well can you speak English?
  [1]  2  2  2  1  1  1  2  1  2  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 [26]  1  1  1  1  1  1  1  1  2  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 [51]  1  1  2  1  1  1  1  1  1  1  1  2  1  1  1  2  1  1  1  1  2  1  1  1  1
 [76]  1  1  2  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  2  1
[101]  1  1  2  1  1  1  1  1  1  2  2  1  1  1  1  1  1  1  1  1  1  1  1  1  1
[126]  1  1  1  2  2  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
[151]  1  1  1  1  1  1  1  2  1  1  1  1 NA  2

Labels:
 value     label
     1 Very well
     2      Well
Tip

See the {labelled} vignette for more help on working with user-defined NAs, including how to deal with them when converting to other types.

Converting Variables

The labels have served their purpose helping us navigate and clean up the dataset, and produce a lovely data dictionary for sharing. However, if we want to use the data, we’ll need to convert to other data types that we can use for statistical analysis.

This will fall into two main categories. Any variables containing numbers that we want to do maths with, we’ll convert to numeric type, which we’ve encountered a few times before, which will strip the labels. However, any variables that contain categorical data, we’ll instead convert to factors, which we haven’t encountered in this series yet - so we’ll start there. Remember that variables that will be converted to factor should have labels for all of their levels, whereas variables that will be converted to numeric can have fewer labels, because we will stop using them after the numeric conversion.

Factor

Factor variables are R’s way of representing categorical data, which have a fixed and known set of possible values. Thus far we’ve mostly been using character data for this purpose, but that’s a bit of a cheat - R’s been helping us by treating the same values as the same category, but factors create this structure explicitly.

As may be familiar from SPSS, factors actually contain two pieces of information for each observation: levels and labels. Levels are the (existing or possible) values that the variable contains, whereas labels are very similar to the labels we’ve just been exploring.

As an example, take an example factor vector:

factor(c(1, 2, 1, 1, 2),
       labels = c("Male", "Female"))
[1] Male   Female Male   Male   Female
Levels: Male Female

The underlying values in the factor are numbers, here 1 and 2. The labels are applied to the values in ascending order of those values, so 1 becomes “Male”, “2” becomes “Female”, etc. Here, we haven’t need to specify the levels; if you don’t elaborate otherwise, R will assume that they are the same as the unique values.

You can also supply additional possible values, even if they haven’t been observed, using the levels argument:

factor(c(1, 2, 1, 1, 1),
       levels = c(1, 2, 3),
       labels = c("Male", "Female", "Non-binary"))
[1] Male   Female Male   Male   Male  
Levels: Male Female Non-binary
Tip

Factors are so common and useful in R that they have a whole {tidyverse} package to themselves! You already installed {forcats} with {tidyverse}, but you can check out the help documentation if you’d like to learn more about working with factors.

The useful thing about labelled data that it’s very easy to convert into factors, which is what R expects for many different types of analysis and plotting functions. Handy!

For an individual variable, we can use labelled::to_factor() to convert to factor.

Exercise

Convert the gender variable to factor, although don’t assign this change to the dataset.

mil_data |> 
  dplyr::mutate(
    gender_fct = labelled::to_factor(gender),
    .keep = "used"
  )

If you wanted a specific order of the levels, for plotting or similar, there’s also a sort_levels = argument described in the help documentation.

Numeric

For continuous variables, we don’t need anything fancy to turn them into numeric data, because they technically already are. Instead, we just need to get rid of the labels using unclass().

Exercise

Use unclass() to convert belonging_1 to numeric, although don’t assign the change to the dataset.

This example shows both the conversion to numeric, and back to labelled with labelled::labelled().

mil_data |> 
  dplyr::mutate(
    belonging_1_num = unclass(belonging_1),
    belonging_1_lab = labelled::labelled(belonging_1_num),
    .keep = "used"
  )

The nice thing about this method is that we can now do maths with the unclassed numeric functions as normal, but the labels are still there if we want to get back - just convert with labelled::labelled().

Conditional Conversion

There are two main ways we could more efficiently convert variables than one by one. The first is offered by the {labelled} package - here I’ve just copied from the vignette describing the setup.

Note

In most of cases, if data documentation was properly done, categorical variables corresponds to vectors where all observed values have a value label while vectors where only few values have a value label should be considered as continuous.

In that situation, you could apply the unlabelled() method to an overall data frame. By default, unlabelled() works as follows:

  • if a column doesn’t inherit the haven_labelled class, it will be not affected;
  • if all observed values have a corresponding value label, the column will be converted into a factor (using to_factor());
  • otherwise, the column will be unclassed (and converted back to a numeric or character vector by applying base::unclass()).

If we wanted to do this, we’d have a bit more work to do. That’s because at the moment, our data doesn’t line up with this template. Having a look at our data dictionary again, we can see that our subscale variables have all of their levels labelled, so they won’t be converted as we’d like. Rather than do this now, I’d probably recommend setting up your Qualtrics like this to begin with.

Instead, we can take the second route and use what we’ve seen in previous Challenge tasks to convert variables conditionally.

Exercise

CHALLENGE: Convert categorical variables with labels to factors, and subscale variables to numeric.

Hint: Have a look back at dplyr::across() for efficient selecting and applying.

1mil_data <- mil_data |>
2  dplyr::mutate(
3    across(global_meaning_1:last_col(),
           unclass),
4    across(c(english_fluency_1, gender),
           labelled::to_factor)
  )
1
Overwrite mil_data by taking the existing dataset mil_data, and then
2
Change it as follows:
3
Across all the variables from global_meaning_1 through the last column, convert to numeric
4
Across the variables english_fluency_1 and gender, convert to factor.

 

Very well done today. You should now have a data dictionary to refer to, and a complete dataset to work with using the techniques we’ve already covered to clean up responses, create subscales, and so on. Next time we’ll work on solving common issues and avoiding those issues in the first place by setting things up right to begin with.

Footnotes

  1. For the purposes of simplicity, we’re going to keep pretending that Likert and similar rating scales are “continuous”.↩︎