02: IntRoduction II

Overview

In this second tutorial, we’ll continue working on foundational skills in R that we began in the first tutorial. We will cover how to organise your work in projects, and within those projects, some options for different types of documents to write your code. Jumping into the code side, we’ll also talk about installing and loading the packages you need, then we’ll focus on how to use the functions in those packages as well as their documentation. Finally, we’ll get stuck in with some common statistics functions and look at how to transparently comment code.

Setup

In each session, we will always follow the same steps to set up. We’ll walk through the key elements here in detail and then provide a brief summary in future tutorials.

Setup Steps

Projects

Projects are the main way that RStudio organises related files and information. Projects are associated with a working directory; everything in that directory, including all the sub-folders, are part of the same project.

It is highly recommended that you create a new project for each separate thing you want to work on in R. Among other advantages, it makes navigating folders much easier (see Reading In), allows you to easily pick up work from where you left off, and retain all the settings and options you have set each time you work on the same project.

Creating a Project

On Posit Cloud, you don’t really have a choice in the matter - you must create or open a project in the Cloud workspace in order to do anything.

On a desktop installation, you can create a new directory as a project or associate a project file with an existing directory.

See Posit’s Using RStudio Projects guide or Andy Field’s video guide to RStudio Desktop for more information.

RepRoducibility: File organisation

When uploading analysis materials and data online, it is important to make sure that your code will run with minimal intervention when another person downloads it. To help with this, the relationship between the analysis materials and the data assumed in the script needs to be preserved when the materials are uploaded on OSF.

Projects do this by setting a reference point, which is the project file; this is a .Rproj file with an icon like a little blue cube with an R in it. This project file turns the folder it’s in into a project folder, and among other benefits, means that you can navigate to other files in the same folder or any sub-folders using the project folder as a starting point. When you want to share or upload your files, it’s best to export the whole project folder, including the project file. This way, any relative file paths will continue to use the project folder as their starting point, and will still run successfully even on a different computer.

To use a convoluted analogy, this turns the project folder and its sub-folders into a terrarium - a self-contained ecosystem that can be transported as a unit from one place to another without disrupting the connections inside.

Also, avoid “hard-coding” file paths - while my work computer has a directory called “D:/reny-work/teaching/r-training/”, other people most likely don’t have the same directory and asking R to open a non-existent file path will cause an error.

Documents

As we discussed in the previous tutorial, one of the key strengths of doing your work using R (or any other programming language) is reproducibility - in other words, every step of the process from raw file to output is documented and replicable. However, this is only the case if you do in fact write your code down somewhere! To do that, you’ll need to create some kind of document to record your code. There are two main types of documents you might consider using: a script or a Quarto document.

Quarto documents

Quarto documents contain a combination of both non-code text and code. The main strength of these documents is that they can be rendered to some other output format, such as HTML, PDF, or Word, by executing all of the code and combining it with the text to create a nicely formatted document.

If you’re interested in using Quarto documents often, the Quarto help documentation is excellent. For now, use the Quarto document in your project on Posit Cloud for your work in this tutorial.

Exercise

Click the “Render” button at the top of your worksheet Quarto document to see what it produces.

Scripts

Scripts are text files that RStudio knows to read as R code. They have a .R file extension and can ONLY contain code. They are very useful for running code behind the scenes, so to speak, but not great for reviewing or presenting results.

Quarto or Script?

When deciding what kind of document to create, think about what you want to do with the output of your work.

  • Use Quarto if the document needs to contain any amount of text, or will be used to share the output of your code in a presentable way, such as notes for yourself, reports, articles, websites, etc.
  • The page you’re reading now is (or was!) a Quarto document.
  • Use a script if the document only needs to contain code and has a solely functional purpose, such as cleaning a dataset, manipulating files, defining new functions, etc.
  • I use a script to process all of the tutorial documents and generate the Quick Reference page.

In this series, we will almost always use Quarto documents, but scripts are an essential part of the development side of R, so you may encounter them in other contexts, for example working with collaborators, reviewing open-source data and code, or you may end up preferring them yourself.

Packages

In this tutorial, we’ll see how the main way that R does anything is via functions. All functions belong to a package, which are extensions to the R language. Packages can contain functions, documentation for those functions, datasets, sample code, and more. Some packages, like the {base} and {stats} packages that contain the mean() and t.test() functions that we saw previously, are included by default with R. However, you will often want to use functions from packages that aren’t included by default, so you must do this explicitly.

In order to utilise the functions in a package, you must do two things:

  1. Install the package (only once per device, or to update the package) using install.packages("package_name") in the Console
  2. Load the package (every time you want to use it) using library(package_name) at the beginning of each new document
Important

If you are working on these tutorials on the Posit Cloud workspace, all of the packages you need have been installed already. Please do not try to install any packages unless directed by the training lead, as this could cause unexpected conflicts or errors.

When you install R and RStudio for the first time on a device, this is like buying a new mobile phone. When you get a new phone, it comes with some apps pre-installed, like a messaging app, a camera, a calculator, etc. If you only ever wanted to take pictures and do basic maths with your phone, you could probably leave it at that. In the same way, when you install R it already comes with some basic packages for doing some operations.

Most likely, though, you’ll want to use other apps that don’t come with your new phone - like WhatsApp, or Twitter X Bluesky. Let’s say you’ve just got a new phone and you want to use WhatsApp. To do this, you’ll need to:

  1. Go to your phone’s app store and download WhatsApp (only once per device, or to update the app)
  2. Open the app (every time you want to use it)

As you can see, these steps correspond almost exactly to the installing vs loading steps described above. In order to use a package that doesn’t come pre-installed with R, you have to do both of these steps.

RepRoducibility: Package versions

R packages get regular updates. Some updates are minor but others can change how functions work, and in some cases functions can even be removed. Because of this, it is a good idea to note the specific versions of the packages you are using. One way to do this is to load all the packages you need with library() and then call the function sessionInfo() to see the specific versions you are using. This function can generate a lot of output. The most important information is the version of R you are using and the versions of the packages you loaded, which are listed under “other attached packages”.

Loading tidyverse

One of the core packages we use for UG teaching is {tidyverse}. This isn’t actually a single package, but rather a convenient shorthand to install and load a suite of interconnected packages all together. We will be using {tidyverse} packages throughout this training series, so loading {tidyverse} straightaway is a good habit to get into!

Exercise

Load the {tidyverse} package in your workbook.

Note: If you are on the Cloud workspace, {tidyverse} will already be installed. If not, you may need to install it first.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

When you load {tidyverse} for the first time, quite a lot of extra stuff gets printed along with it. All this output looks alarming, but these aren’t errors or warnings - they’re just messages. Messages are like warnings, but neutral: they just contain information that you might find helpful.

The usual {tidyverse} message contains two parts:

  • Attaching core tidyverse packages tells you which packages have just been loaded. Essentially, library(tidyverse) is a shortcut for loading all of these packages individually. Somewhat confusingly, installing {tidyverse} installs more packages than are in this list (for example, {magrittr} and {rlang}), many of which other {tidyverse} packages rely on to function. If you want to load them, you can use library() to do this - but you don’t need to unless you’re using those packages explicitly. For our purposes now, just the default {tidyverse} packages are fine.

  • Conflicts tells you about any package conflicts as a result of loading the packages.

What are conflicts? Consider that there are lots and lots of packages for R. At the time of this writing, CRAN (the repository for R packages) contains just over 20,000 R packages, with many, many more on Github and elsewhere. Although people generally try to avoid it, it is necessarily the case that sometimes, people give the same name to two different functions from two different packages.

So, if you have those packages both loaded, how does R know which one to use? This situation is called a conflict.

Resolving Conflicts

Method 1: Recency

In the absence of any other information, R will use the function from the package that was loaded most recently. This is exactly what’s happening in the {tidyverse} message above.

There are two conflicts mentioned, one of which reads:

❌ dplyr::filter() masks stats::filter()

First, briefly note the package::function() notation. You can read this notation as, for instance, “the filter function from the dplyr package”, or just “dplyr filter”. This is sort of like using the function’s “full name”, where the name of the package it belongs to is written first, followed by two colons, then the name of the function.

{stats}, you might remember, is a package that is always installed with R and is loaded by default. So, the {stats} package has a function called filter() that is already loaded to begin with whenever you start up R. When we loaded {tidyverse}, one of the new packages, {dplyr}, also contains a function called filter(). Because {dplyr} has been loaded more recently, if you write a bit of code using filter(), the one you will get is dplyr::filter()1. In other words, the more recently loaded dplyr::filter() covers over, steps in front of, or (in R terminology) “masks” stats::filter().

Now, what if you actually want to use filter() from {stats} instead, when {tidyverse} is loaded? Well, in that case you might want to use…

Method 2: Explicit style

Above we saw several examples of the package::function() notation, called “explicit” or “verbose” coding style. With explicit style, there isn’t actually a conflict between stats::filter() and dplyr::filter() anymore, because their package calls are clearly stated so R doesn’t have to guess which filter() you want. So, if you had loaded {tidyverse}, you could write stats::filter() in your code, and you would still get the function from the {stats} package even with {dplyr} loaded.

Another, secret benefit of explicit style is that as long as you have a package installed, you can use a function from that package without having to load it. Imagine I start a new project (so I have stats::filter() already loaded by default). If I only want to quickly use dplyr::filter(), I can use explicit style to use that function without having to run library(tidyverse).

The style you’ll see in these tutorials is a pretty devotedly explicit style: that is, I’ll always write dplyr::filter() instead of just filter(). I only leave out the package name in a few situations:

  1. When the function is from a default-loaded package, like {base} or {stats} (so I write mean() instead of base::mean()). This is mostly just convenience!
  2. When there are lots of functions from the same package in row, all of which have very distinctive names, that would make reading the code very difficult and writing the code very repetitive. This is the case, for example, with {ggplot2}, which we will encounter later on in the course. For cases like this, I make sure to load the relevant package and then leave off the package name.

To ensure all is well, we’ll load {tidyverse} regularly at the start of each tutorial; regardless, though, it’s recommended to always explicit style anyway (see below).

RepRoducibility: Explicit Style

For reproducible code, explicit style is strongly preferred. When you are working on your code, you might know which package(s) you have loaded in which order or intended to use; but this won’t necessarily be clear to anyone else. Explicit package::function() style removes the ambiguity and ensures your code always runs the same way, with the same functions you wanted.

 

Functions

Functions are like verbs in the R language. We’ve already started using a few functions: c(), class(), mean(), and as.numeric(). As we’ve seen, these functions perform some operation using the input inside their brackets, and produce the output of that operation. So, functions are the main way that R does anything.

In this section, we’ll take a systematic look at the process of using a new function - in particular, functions that take multiple inputs, or arguments. As we go, we’ll look at how to “translate” the command you want to give R into a verb (function) it can understand.

Basics and Help

Let’s look at an example of how this translation might work. For this example, I’d like to round the mean of four numbers, 45.34, 23.001, 7.3, and 16.9820, to two decimal places - a common task for reporting results in APA style.

Exercise

As a warm-up, calculate the mean of the four numbers given above and store this number in a new object called mean_to_round.

mean_to_round <- mean(c(45.34, 23.001, 7.3, 16.9820))
mean_to_round
[1] 23.15575

As a result of the calculation above, we now have the mean of the numbers of interest, 23.15575, that we want to round to two decimal places. If we want R to do this for us, we have to write a command that represents the operation we want R to perform. First, we need to know what R function corresponds to that operation. We’re lucky in this case: the function in R is also called round().

We know that we’re looking at a function in R because functions often have a name followed by brackets (and nothing else in R does). That is, they have the general form function_name(). Inside the brackets, we can add more information to the function to complete our command, although not all functions require any more information.

Exercise

Try running the round() function.

round()
Error in eval(expr, envir, enclos): 0 arguments passed to 'round' which requires 1 or 2 arguments

Unsurprisingly, R has given us an error. This is an informative error, though - that is, the error gives of some sort of intelligible clue about what’s gone wrong. Namely, it tells us that round() can’t just work without additional information (i.e. “arguments”).

What we want to do, “Round the number 23.15575 to two decimal places”, has two more important pieces of information that we need to tell R: what number we want to round (23.15575) and how many decimal places we want to round it to (2). So, how do we say this in R? To find out, let’s look at the help documentation.

Exercise

Open the help documentation for the round() function by running ?round() or help(round) in the Console.

Definition: Help Documentation

Help documentation is information, like instruction manuals, built into R about how individual functions work. Function documentation varies wildly in helpfulness and completeness, but it’s a useful place to check first if you want to find out what a function does. You can access the help documentation in a few different ways: by running ?function_name or help(function_name) in the Console, or by clicking on the “Help” tab in the Files section of RStudio and using the Find box to search for the function.

The first section, “Description”, varies quite a bit in intelligibility, depending on how complex the function is. Here, if we ignore the information about the other function included in this document, we can see that we have a useful description of round() that tells us that it rounds numbers (that’s a good sign) to a certain number of decimal places. That’s exactly what we want, so how do we use it?

Let’s scroll down to “Usage”, which gives examples of what the function looks like. You can see that the basic structure of this function is round(x, digits = 0). It seems like we need to add some more information in the brackets of our function - but how do we interpret x and digits = 0?

Arguments

The information inside a function’s brackets, which gives it the information it needs to work, are called arguments. Each argument in a function is separated by a comma, so we can see from round(x, digits = 0) that the round() function can take two arguments. How many arguments a function has depends on the function; some (like Sys.Date()) don’t need any arguments to run. One of the most useful parts of a function’s help documentation is the “Arguments” section, which tells you what each of the function’s arguments are and how to use them.

When referring to arguments, you will hear the terms “named” and “unnamed” arguments. This can be a bit confusing, because all arguments have a name - they have to, otherwise we couldn’t refer to them! The named vs unnamed distinction doesn’t refer to the arguments themselves, but rather how the person using the function chooses to write them out. There are some conventions around which arguments are named or not, so let’s have a look at that now.

The first argument to round() is simply x. Just like in maths, x is a placeholder for some number or numbers (a “numeric vector”, which should sound familiar now) that the function will work on. This is common notation in many functions: x, often the first argument in a function, typically denotes the placeholder for the information you want to use the function on. In our case, the numeric vector that we want to find the mean of is stored in an object called mean_to_round, so that’s what we’ll put into the x argument.

It’s important to note here that this argument x has no default value. In other words, we must minimally provide some information x to this function or it won’t be able to run. We know this because x appears on its own as an argument; arguments that do have a default value have an = sign followed by their default value (see the next paragraphs on digits = 0). Because we always have to provide additional information, x and similar arguments containing the values or data to work on are frequently unnamed when we use them. That means that instead of round(x = 23.15575…) we can just write round(23.15575…). They are also frequently the first argument in the function2. So, when you see reference to the “first unnamed argument” - especially important in {tidyverse} functions designed to work with the pipe operator, which we’ll meet next week - that simply means, “the first argument in the function for which the programmer hasn’t specified a name”, which is often, but not always or necessarily, the “data” or “information to work on” argument.

The second argument of round() is digits. You can think of arguments like this as settings that change the way a function works, often with only certain allowable values.

The help documentation tells us that digits should be an “integer indicating the number of decimal places…to be used.” We can also see in “Usage” that this argument has a default value, digits = 0. That means that if we don’t explicitly include the argument digits when we use the function, by default the round() function will round the number you give it to 0 decimal places. Arguments frequently, but not always, have a default, and it’s important to check so the function doesn’t quietly do something unexpected.

Default values of arguments are really useful, because the default is often the most frequently used or safest3 setting. It means you don’t have to specify every single aspect of a function every time you use it, as long as you want the function to work according to its defaults. In our case, we actually wanted round() to round to two decimal places, not 0. So, in our command, we should change the digits setting from the default, 0, to 2.

RepRoducibility: Default values

When using a function, check what the default values for its different arguments are. The default value of a given argument may not be what you expect it to be, and that can affect your results. For example, by default, SPSS computes a type III ANOVA, while the aov() function in R computes a type I ANOVA.

Using Functions

Now that we know what both of these arguments mean, we can change them to actually translate the sentence “Round the number 23.15575 to two decimal places” into a command that R can work with. We’ll explicitly write out the name of each argument so we know what they are doing.

Exercise

Use the round() function to round 23.15575 to two decimal places.

round(x = 23.15575, digits = 2)
[1] 23.16
## Alternatively, using the object
round(x = mean_to_round, digits = 2)
[1] 23.16

If you want to, you can achieve the same result by changing the order of the arguments, as long as you pay careful attention to which argument(s) you have named.

If we have written the names of both arguments, R can still do what we want it to do with the order of arguments reversed:

round(digits = 2, x = 23.15575)
[1] 23.16

We can also, to some degree, drop the names of the arguments, as long as R can still understand what we’re trying to do. In this case, the “first unnamed argument” is still x! Even though it’s not the first argument we’ve written in the function, it’s the first one that doesn’t have an explicit name.

round(digits = 2, 23.15575)
[1] 23.16

Although I left out the x =, R can still understand this because round() only takes two arguments, and we explicitly told it what value belongs to digits, so it assumes the second number must be x.

If more than one, or all, of the arguments are unnamed, then order becomes critical:

round(23.15575, 2)
[1] 23.16

This time I dropped both argument names. R can still understand this because when you don’t specify which input goes with which argument, R will assume they should go in the default order given in the help documentation. So, R has automatically assigned 23.15575 to x and 2 to digits.

As I use R more and more, I find that I name arguments more consistently, even though I know how the function works and dropping them is more efficient (at least in terms of typing). That’s because when I come back from lunch, or the next day, or six months later to revisit the same code, it’s much easier to recall what it all means when it’s well-annotated. So, I strongly recommend getting in the habit of including argument names in your code as a favour to your future self, and to avoid situations like this:

## uh oh!
round(2, 23.15575)
[1] 2

Here, since we left all the arguments unnamed, R assumed that 2 was the number we wanted to round. This isn’t what we wanted - but R has no way of knowing this. It always assumes that what we typed was precisely what we intended to ask R to do.

RepRoducibility: Unnamed arguments

Using unnamed arguments for all inputs in a function can make code hard to read. It requires you to remember all the different inputs that a function can take and the exact order in which this information needs to be put inside the function. For an example of how pain-inducing this can be, let’s look at the t.test() function which can take more than 11 input arguments. I will create two vectors of 20 numbers each, and then I will compare the two vectors using the t.test() function.

# sample 20 numbers from a normal distribution with the given mean and standard deviation
vector_1 <- rnorm(20, mean = 15, sd = 5) 

# sample 20 numbers from a different normal distribution 
vector_2 <- rnorm(20, mean = 25, sd = 5)

# compare the two vectors with a t-test
t.test(vector_1, vector_2, "less", 5, FALSE, TRUE, 0.78)

    Two Sample t-test

data:  vector_1 and vector_2
t = -8.1821, df = 38, p-value = 3.291e-10
alternative hypothesis: true difference in means is less than 5
78 percent confidence interval:
      -Inf -6.752164
sample estimates:
mean of x mean of y 
 14.38322  22.37444 

Looking at the information I have given to t.test() I can only tell what vector_1 and vector_2 correspond to. The rest is mystery and code shouldn’t be mysterious. Let’s compare that to using named arguments:

# compare the two vectors with a t-test
t.test(vector_1, vector_2, 
       alternative = "less", 
       mu = 5, paired = FALSE, 
       var.equal = TRUE, 
       conf.level = 0.78
)

    Two Sample t-test

data:  vector_1 and vector_2
t = -8.1821, df = 38, p-value = 3.291e-10
alternative hypothesis: true difference in means is less than 5
78 percent confidence interval:
      -Inf -6.752164
sample estimates:
mean of x mean of y 
 14.38322  22.37444 

This is a lot better! Some unresolved mystery still remains, so you may want to check the help menu of the t.test() function and add a comment written in plain English that describes what is happening in that line. More on comments later…

Passing Multiple Values to Arguments

A last important aspect of using functions is that each argument in a function can only take a single object or value as input. For example, we saw above that we put the single value 23.15575 into the x argument of round(). But what if we wanted to round more than one number? We don’t want to have to write a new round() command for every number, even though we could do this if we particularly enjoyed doing a lot of tedious and repetitive typing:

round(23.15575, digits = 2)
[1] 23.16
round(59.5452, digits = 2)
[1] 59.55
round(0.198, digits = 2)
[1] 0.2
Exercise

Before you go on, have a go using a single round() command to round 23.15575, 59.5452, and 0.198.

Hint: Refer to Vectors.

So what happens if we try to put all of those numbers into round()? We might first try this:

round(23.15575, 59.5452, 0.198, 2)
Error in eval(expr, envir, enclos): 4 arguments passed to 'round' which requires 1 or 2 arguments

Once again, R tells us that this doesn’t work by throwing an error. R has tried to do what we wanted, but the round() function only allows a max of two arguments, and we’ve given it four. Behind the scenes, R has tried to run round(x = 23.15575, digits = 59.5452... and can’t proceed from there because it doesn’t know what to do with the last two numbers. So, what we need to do is find a way to put all three numbers that we want to round into the first x argument together. If only there was a way to concatenate them together…

You may have guessed where this is going: one method we could use would be to put the three numbers we want to round into a vector, and then use that vector in round() as the x argument. We already saw that we can combine any number of things together into a single vector using the c() function.

## Create an intermediate object to contain the numbers
numbers <- c(23.15575, 59.5452, 0.198)
round(numbers, digits = 2)
[1] 23.16 59.55  0.20
## Put the vector of numbers into round() directly
round(c(23.15575, 59.5452, 0.198), digits = 2)
[1] 23.16 59.55  0.20

Here we can see a good example of a function inside another function. You can stack, or “nest”, functions inside each other like this as much as you like, although it can become difficult to read the code or keep track of what it’s doing. (There’s a great solution to this problem that we’ll encounter in the next tutorial: the pipe operator.)

That’s looking like some proper R code! Very nicely done.

Help Documentation, Revisited

Before we leave the round() function altogether, let’s take a look at two more useful sections of the help documentation. Depending on what you are trying to do, the “Details” section can tell you more about how exactly the function works - how it behaves in certain situations, or how it handles unusual or difficult cases. If a function isn’t doing what you expect it to, this is a good place to look for an explanation.

Finally, at the end of the documentation you can find the “Examples” section. If you are learning to use a new function, this section can give you a template for writing your own commands. You can also click the “Run examples” link, which will run the code in the Examples section for you so you can see what the function will do.

RepRoducibility: Why Code the Little Stuff?

You may be wondering what the point is of using functions like round(). After all, it isn’t difficult to round 23.15575 to 23.16 just by looking at it. So, why go to all the trouble to learn a new function, with new arguments and (potentially) its own little quirks, for each little task?4

When writing code, there are two important principles that come up again and again. In brief, these are:

Avoid hard-coding

“Hard-coding” essentially means writing information, like numbers, directly into your code as opposed to using code to produce or replicate them. This should be avoided as much as possible in favour of producing values programmatically, i.e. by using code.

To see why, take this tutorial as an example. We have been looking throughout this section at the number 23.15575. As I’m writing this tutorial, instead of typing out that number in the text each time, I have been using inline code to produce this number in the text using the same mean_to_round object we created in that earlier exercise - that is, using code instead of hard-coding, which would be typing the actual numbers into the text each time. This has a couple key benefits:

  • Because code will always work the same way each time, the number 23.15575 will always be replicated correctly - no risk of mistakes, mistyping, forgetting, etc.

  • If I decide to choose a different number for this example, all I need to do is go back to that first task and re-assign a new value to mean_to_round - and the number will be updated throughout the tutorial, wherever that object has been used. If I don’t do this, I will have to manually find and replace every instance of it. Playing spot-the-difference is not only tedious, repetitive, and time-consuming (again, these are the type of tasks that computers are great at) but relies on the programmer not forgetting or missing even one instance of the wrong number. If that number is instead stored in an object, both the repetitive updating and the risk of mistakes are massively mitigated.

For any task that you need to do more than twice, use a function

This is a rule of thumb is adapted from the R for Data Science chapter on function-writing. While we’re not going to worry about writing our own functions for now, it’s still a good mental habit to get into: whenever you need to perform the same task three or more times, look for a function to do it instead.

Why? Essentially, this goes back to “avoid hard-coding”. If you only ever need to do a task a couple times, it’s fine to just get it done manually (i.e. hard-code it) and move on. However, if you are going to need to do that task again and again, it will be more efficient in the long run - and safer in terms of mistakes - to do it with code instead, even if it takes a bit longer to find and learn a new function (or write one) than it would take to copy/paste. What you lose immediately in time, you gain in skill, replicability, and code resilience for future use.

Putting these two things together, we arrive at why it’s preferable to do even simple tasks like rounding with a function instead of “manually”. Using code whenever possible to store, print, and work with data of all kinds - even single values and simple operations - is always the option that is more replicable, more transparent, more resilient to changes, and (at least in the long run) less time- and memory-intensive.

Packages Revisited

We just had an in-depth look at the round() function, but say we weren’t happy with the way this function does rounding in particular situations. For instance, we might want to round a p-value to three decimal places with no leading 0 (as we teach UGs to report p), but that’s not how round() rolls:

round(.00793, digits = 3)
[1] 0.008

So, instead, we might look for a different function that does the rounding as we want, with no leading 0, for example the rd() function linked here. Looks great!

Exercise

Round the number .00793 to three decimal places using the rd() function.

rd(.00793, digits = 3)
Error in rd(0.00793, digits = 3): could not find function "rd"

Well, that didn’t go to plan!

There are two main reasons this “could not find function” error usually appears:

  1. You’ve misspelt the function name. This isn’t the case here (it’s only two letters!).
  2. The package that the function belongs to isn’t installed and/or loaded.

So, this function does exist, but in order to use it, we need to install and then load the package that contains it.

Installing Packages

According to the documentation linked above, the package that this function belongs to is called {weights}, so let’s install it first.

Exercise

Install the {weights} package. Replace function_name in the example command below and run this command in the Console only.

install.packages("function_name")

Note: As mentioned above, please don’t regularly install packages if you’re on the Posit Cloud workspace - but in this specific case, it’s good to get the practice!

Remember to run this command in the CONSOLE and not in a code chunk!

install.packages("weights")

You should see a good bit of chat from R in (alarming) red text. Don’t worry unless you see the actual word “ERROR”. You should soon see the message package ‘weights’ successfully unpacked and MD5 sums checked, which means all is well.

Now that we know we have the package installed, let’s give it another go.

Exercise

Round the number .00793 to three decimal places using the rd() function.

rd(.00793, digits = 3)
Error in rd(0.00793, digits = 3): could not find function "rd"

Still no dice!

Loading Packages

We know that we have the right package installed - but as you may have guessed, we haven’t also loaded the package. We can actually fix this error in one of two ways:

  1. Load the {weights} package using the library() function. We can then use the function as written above.
  2. Use explicit style, e.g. weights::rd(), when writing our command. In this case we don’t need to load the package as long as we have it installed.

Alright, third time lucky!

Exercise

Round the number .00793 to three decimal places using the rd() function, using either method to access the function in the {weights} package.

Using verbose style, before loading the package:

weights::rd(.00793, digits = 3)
[1] ".008"

First loading the package, then using the function without the package call:

library(weights)

rd(.00793, digits = 3)
[1] ".008"

Exercises

Now let’s do some coding. The base-R {stats} package contains a wide variety of very sensibly-named functions that calculate common descriptive statistics. These include:

  • mean() and median() (there is a function mode(), but it doesn’t do what we’d like it to here!)
  • min() for minimum value, max() for maximum value
  • range() for both minimum and maximum value in a single vector
  • sd() for standard deviation
Exercise

Run the following code in your workbook to generate the data you need for the following tasks.

practice_data <- c(15, 784, 2, NA, 956, 9, 23, 8, 326, 1, 406)
Exercise

Return the median and range of practice_data.

Hint: If you are getting back a result that isn’t particularly informative, use the help documentation to figure out how to deal with it.

A key feature of all of these stats functions is that, by default, they return NA if there are any NAs (missing values) present. (This is very sensible behaviour by default, but is frequently not the information we want when we use them.) So, they all have an argument, na.rm, which determines whether NAs should be removed. By default this argument is set to FALSE (NAs should NOT be removed), which you can see in the Usage section of the help documentation. If you want to get the calculation ignoring any NAs, you can set this argument to TRUE instead.

Because practice_data includes a missing value, running any of these functions with the default settings will just return NA:

median(practice_data)
[1] NA

To ignore any NAs and get output for the values that are there, change the default setting the na.rm argument for each function.

median(practice_data, na.rm = TRUE)
[1] 19
range(practice_data, na.rm = TRUE)
[1]   1 956
Exercise

Calculate a 10% trimmed mean for practice_data, but specify the trim first in the function.

This one takes a few steps.

First, we’ll need to call up the help documentation for mean(). This tells us that there is an argument, trim, that allows us to supply a fraction to be trimmed from each end before the mean is calculated.

Second, we also can see that like the other stats functions, mean() also has an na.rm argument, which we’ll need to provide if we want to get an answer other than NA.

If we wanted to do this without naming arguments, we could write our command like this:

mean(practice_data, .10, TRUE)
[1] 196.625

However, we’ve been asked to specify the trim first, and we can’t just move it or things go wrong. Here R is trying to parse practice_data as the input to trim and doesn’t know what to do with a bunch of different numbers.

mean(.10, practice_data, TRUE)
Error in mean.default(0.1, practice_data, TRUE): 'trim' must be numeric of length one

So, in order to make this function work and stick to our RepRoducibility best practice, let’s instead name all our arguments:

mean(trim = .10, x = practice_data, na.rm = TRUE)
[1] 196.625

Note that because the na.rm argument is in the right place, the function will run correctly without naming it - but it’s better to get in the habit!

RepRoducibility: Comments

Now that we have written some code, it is a good time to introduce comments - one of the best ways to glam up a wall of intimidating code into readable text. A comment is a line or piece of text that R will ignore when running code. R uses the ‘#’ symbol to signify the beginning of a comment:

# anything after the "#" symbol will be treated as a comment, even numbers - 1, 2, 3

a <- 5 # comments can be added on the same line after executable code

Comments which describe what code does make the code easier to read. Comments should be easy to understand and clear, but not overly long. Also, make sure you are consistent - use the same terminology and formatting throughout the whole script.

To illustrate, compare this chunk of code…

# Data of the control group consisting of people who completed the control condition of the experiment
group_1 <- rnorm(30, mean = 40, sd = 5)

# Data of the experimental group consisting of people who completed the experimental condition of the experiment 
group_2 <- rnorm(30, mean = 80, sd = 5) 

# Compare the means of the active group and the first group using a t-test assuming that they have equal variances 
t.test(group_1, group_2, var.equal = TRUE)

    Two Sample t-test

data:  group_1 and group_2
t = -35.749, df = 58, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -42.35415 -37.86253
sample estimates:
mean of x mean of y 
 40.51404  80.62238 

… to this one:

# Control group data
control_group <- rnorm(30, mean = 40, sd = 5)

# Experimental group data 
experimental_group <- rnorm(30, mean = 80, sd = 5) 

# Between subjects t-test comparing the control and experimental groups; assumes equal variances 
t.test(group_2, group_1, var.equal = TRUE)

    Two Sample t-test

data:  group_2 and group_1
t = 35.749, df = 58, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 37.86253 42.35415
sample estimates:
mean of x mean of y 
 80.62238  40.51404 

In addition to making code legible, comments specify your intentions and make it easier to verify that your code is doing exactly what you wanted it to do. For example, if you have computed a between-subjects t-test, but there is a comment saying that you intended to computed a paired-subjects t-test, the error will be easier to catch than in the absence of a descriptive comment.

Exercise

Go back to the code you wrote for the exercises above and add some comments to your code in plain English.

You can write your comments however you like. I (Jennifer) often use them in code that’s just for me to remind myself of why particular steps were necessary; how code I’m not very familiar with works; or just to celebrate successes or progress. Here’s some examples with the code we wrote above.

# Data
practice_data <- c(15, 784, 2, NA, 956, 9, 23, 8, 326, 1, 406)

# Calculate descriptives
# Data contains NAs so need to set na.rm to true
median(practice_data, na.rm = TRUE)
range(practice_data, na.rm = TRUE)

# Calculate 10% trimmed mean
mean(trim = .10, x = practice_data, na.rm = TRUE)

Footnotes

  1. As for how to pronounce “dplyr”, the official pronunciation is “dee-ply-er”, with “plier” like the tool for which it’s named. I have heard other people say “dipler”. Since code is always a bit tricky to read aloud, just go with whatever sounds good to you.↩︎

  2. Again, not necessarily - the base-R string-manipulation functions grep() and friends, for example, have x as their third argument. I know all the irregularities can be confusing, but remember that R is a massive collaborative project across decades and millions of users, so some quirks are inevitable!↩︎

  3. By “safest” setting, I mean that the function makes the fewest assumptions about what you intended.↩︎

  4. I’m playing the devil’s advocate a bit here!↩︎