An R primer for humdrumR users
Nathaniel Condit-Schultz
July 2022
Source:vignettes/RPrimer.Rmd
RPrimer.Rmd
Humdrum\(_{\mathbb{R}}\) is a package for the R programming language. You don’t need to be an R master to use humdrum\(_{\mathbb{R}}\), but there are some basic concepts from R that you will need to learn, and ultimately, if you want to get really advanced, you’ll need to develop some R skills. This document is a basic primer for R, which will teach you the basics you need to know in order to make the most out of humdrum\(_{\mathbb{R}}\).
Basic Commands
R code is made up of “expressions” like 2 + 2
,
sqrt(2)
, or (x - mean(x))^2
. As you can see,
you can create very intuitive arithmetic expressions, like
5 / 2
or 3 * 3
. However, the most common
elements of R expressions are “calls” to functions. A
“function” in R is a pre-built bit of code that does something. Most
functions take one or more input arguments, and “return” some
kind of output. For example, the function sqrt()
takes a
number as an input argument, and “returns” the square root of that
number.
To “call” a function, we write the function’s name, followed by
parentheses (()
). Any input arguments to the function must
go inside the parentheses, separated by commas if there are more than
one. Here are some examples of common functions being “called” with zero
or more input arguments:
Different functions have different arguments they recognize, with
specific names. For example, the function log()
takes two
arguments, called x
and base
. Other functions
can take any number of arguments, with any name. You can learn about a
function, including the arguments it accepts, by typing
?functionName
at the command line; for example,
?sqrt
or ?mean
. If you see an argument called
...
, that tells you that the function can take any number
of arguments.
We can explicitly “name” the function arguments we want by putting
argname = argument
into our calls: For example, you could
say log(10, base = 2)
. Named arguments are very useful when
we are creating data, like vectors and data.frame
s (see
below).
Pipes
Complex expressions might involve a large number of function calls, which can get tiresome to read (or write). For example, something like
log(round(sqrt(mean(x^2)), base = 2)
calls four functions! An expression like that is a bit
tricky to read, and it can be really easy to make a mistake where you
put the wrong number of parentheses. As an alternative, R gives us the
option of calling functions in a “pipe.” The way this works is we use
the “pipe” command |>
, which takes an input on the left
and “pipes” it into a function call on the right. For example, we can
rewrite the previous command as:
x^2 |> mean() |> sqrt() |> round() |> log(base = 2)
Much better! To make things even cleaner, R will understand if you
spread your expressions across multiple lines, by putting a new line
after each |>
, or function argument:
x^2 |>
mean() |>
sqrt() |>
round() |>
log(base = 2)
max(sqrt(2),
log(2),
exp(2),
pi / 2)
Variables
When coding in R, you’ll often want to “save” data or other objects
so you can reuse them. We do this by “assigning” something (often the
result of a function) to a “variable”. This is done using the
assignment operators, either <-
or ->
. A
variable name can be any combination of upper and lowercase letters.
Let’s calculate the square-root of two and save it to a variable:
tworoot <- sqrt(2)
We can then reuse that value as many times as we want:
You can also assign from left to right, using ->
.
This is useful in combination with pipes:
Note your variable names can also include _
,
.
, or numeric digits, as long as they aren’t at the
beginning of the name. For example, X1
or
my_name
are valid names—but not 2X
.
Basic Data Structures
In R, there two fundamental data structures that are used all the time:
- “atomic” vectors
- data.frames
Vectors
In R, the basic units—the atoms, if you will—of information are called “atomic” vectors. There are three basic atomic data types:
-
Numbers—
numeric
values.- Examples:
3
,4.2
,-13
,254.30
- Examples:
-
Strings of characters:
character
values.- Examples:
"note"
,"a"
,"do, a dear, a female dear"
- Examples:
-
Logical Trues and Falses:
logical
values.- Examples:
TRUE
,FALSE
- Examples:
You might be wondering, why are we calling these basic atoms “vectors”? Well, in R, the basic atomic data types are always considered a collection of ordered values. These ordered collections are called vectors. In the simple examples above, each vector only had a single value, so it just looks like one value—single values like this are often called “scalars”. However, R doesn’t really distinguish between scalars (single values) and vectors (multiple values)—everything is always a vector. (Still, we sometimes refer to length-1 vectors as scalars.)
To make a vector from scratch in R, use c()
, as so:
In this example, we’ve created five vectors.
- A
numeric
vector of length 3. - A
character
vector of length four (composers). - A
logical
vector of length 2. - Two
numeric
vectors of length 1.- That’s right,
c(32.3)
and32.3
are the same thing—a vector of length 1.
- That’s right,
Notice that vectors can’t mix-and-match different data types; which
makes sense because a vector is a single type of thing. But
this means that commands like c(3, "a")
will actually
create a character
vector, where the 3
is
forced to be a character ("3"
).
Vectorization
Having everything be a vector all the time is very useful, because it allows us to think of and use collections of data as single thing. If I give you, say, ten thousand numbers, you don’t have to worry about manipulating ten thousand things: rather, you just work with one thing: a vector, which happens to be of length 10,000. In R, we call this vectorization—generally, in R and in humdrum\(_{\mathbb{R}}\) we will constantly be taking advantage of vectorization to make our lives super easy!
For an example of vectorization, watch this:
We created two numeric
vectors:
- The first eight numbers of the Fibonacci sequence
- the single number
2
and multiplied them together! Notice that the entire Fibonacci vector is multiplied by two! We don’t have to worry about multiplying each number of the vector, it’s done for us.
There are two ideal circumstances for working with vectors.
- They are the same length.
- One vector is length 1, and the other isn’t.
In the first case, we work with multiple vectors that are all the same length, each value in each vector is “lined” up with values in the other vector. If we, for example, add two such vectors together, each “lined up” pair of numbers is added:
In the second case, one of the vectors is length-1 (a “scalar”). In this case, the scalar value is paired with each value in the longer vector (as in the Fibonacci example above).
What happens if we have vectors that are longer than one, but are not the same length? Well, R will generally attempt to “recycle” the shorter vector—which means repeat it— as necessary to match the length of the longer vector. If the shorter vector evenly divides the longer vector, you generally won’t have a problem:
If the division is not perfect, R will still “recycle” the shorter vector, but you’ll get a warning:
c(1, 2, 3, 4) * c(2, 3, 4)
> Warning in c(1, 2, 3, 4) * c(2, 3, 4): longer object length is not a multiple
> of shorter object length
> [1] 2 6 12 8
You see the warning message R have us?
“longer object length is not a multiple of shorter object length”
That’s R telling us that we’ve got an obvious mismatch in the lengths of our vectors.
Generally, it is best to work with vectors that are all the same length and/or scalar values (length-1 vectors), so you can avoid worrying about how exactly R is “recycling” values. This brings us too…
Factors
Factors are a useful modification of character
vectors,
which keep track of all the possible values (“levels”) you expect in
your data, even when some of those levels are missing from the
vector. This is mainly useful when we are counting data with
table()
.
For example, let’s consider the built-in R object called
letters
:
letters
> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
> [20] "t" "u" "v" "w" "x" "y" "z"
What happens if we call table()
on
letters
?:
table(letters)
> letters
> a b c d e f g h i j k l m n o p q r s t u v w x y z
> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Every letter appears once in the table, duh! What if we randomly sample a handful letters and table the result?
Notice that not all the letters from table appear in the output. E.g., if a letter never appears in the sample, it doesn’t get counted.
Let’s try something new: before sampling, I will call the command
factor()
on letters
:
factor(letters) |> sample(15, replace = TRUE) |> table()
>
> a b c d e f g h i j k l m n o p q r s t u v w x y z
> 0 3 1 1 1 1 0 0 0 0 2 1 0 0 0 1 0 0 0 0 1 1 1 0 1 0
Ah! Now our table includes all possible letters, even though many of
them appear 0
times.
So how does this work? Well the factor()
function looks
at a character
vector and outputs a new “factor” vector.
The factor vector acts just like a character
vector, except
it remembers all the unique values, or “levels”, in the vector:
factor(letters)
> [1] a b c d e f g h i j k l m n o p q r s t u v w x y z
> Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
Even if we remove some values from the factor vector, the vector will “remember” these levels. The factor will also remember the order of the levels, so you can make tables ordered the way you want them.
You can access, or set, the levels of a factor using these using the
levels()
function, or with the levels
argument
to the factor()
function itself. Maybe we want to tabulate
the letters, but put the vowels first:
factor(letters, levels = c('a', 'e', 'i', 'o', 'u',
"b", "c", "d", "f", "g", "h", "j", "k",
"l", "m", "n", "p", "q", "r", "s", "t", "v", "w", "x", "y", "z")) |>
sample(15) |>
table()
>
> a e i o u b c d f g h j k l m n p q r s t v w x y z
> 1 0 1 0 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 0 0 0
Note that if a character
string contains values that you
don’t include in your levels
, the value will show up
NA
in the resulting factor, and you may see warnigns like
“invalid factor level, NA generated
.”
Data frames
Data frames are the heart and soul of R. A data.frame
is
simply a collection of vectors that are all the same
length—ideal for vectorized operations! The vectors in a
data.frame
are arranged as columns in a two dimension
table. Let’s make a data frame, by feeding some vectors to the
[data.frame()] function:
X <- c("C", "D", "E", "F", "G", "A", "B", "C")
Y <- c(0, 2, 4, 5, 7, 9, 11, 12)
Z <- c("P1", "M2", "M3", "P4", "P5", "M6", "M7", "P8")
df <- data.frame(X, Y, Z)
df
> X Y Z
> 1 C 0 P1
> 2 D 2 M2
> 3 E 4 M3
> 4 F 5 P4
> 5 G 7 P5
> 6 A 9 M6
> 7 B 11 M7
> 8 C 12 P8
Notice that each of your columns/vectors can be a different type,
with no problem. Also notice, that each column has a name; we can
inspect these names using the colnames()
function.
Or change them:
colnames(df) <- c('Letters', "Semitones", "Intervals")
df
> Letters Semitones Intervals
> 1 C 0 P1
> 2 D 2 M2
> 3 E 4 M3
> 4 F 5 P4
> 5 G 7 P5
> 6 A 9 M6
> 7 B 11 M7
> 8 C 12 P8
Finally, it’s also possible to assign the column name we want when creating the data frame:
data.frame(Letters = X, Semitones = Y, Intervals = Z)
> Letters Semitones Intervals
> 1 C 0 P1
> 2 D 2 M2
> 3 E 4 M3
> 4 F 5 P4
> 5 G 7 P5
> 6 A 9 M6
> 7 B 11 M7
> 8 C 12 P8
Remember, the vectors in a data.frame
must all be the
same length. If you tried to make a data.frame
with a
vectors that don’t match in length, you’ll get an error
“arguments imply differing number of rows
.” The one
exception is that you can call data.frame
with some
scalar single values, which will be automatically recycled to
match the length of the other vectors.
With and Within You
We often want to access the columns/vectors held in a data frame. We
can do this several ways. One approach is with the $
operator, combined with the name of the column we want. For example, we
can get the Letters
column from the data frame we made
above using df$Letters
.
Often, we’ll want to write code that uses a bunch of different
columns from the same data.frame—in fact, this is the main
thing we do most of the time in R! To avoid writing df$
over and over again, we can use the with()
function.
with()
allows us to drop “inside” our
data.frame
, where our R commands can “see” the columns
variables:
Missing Data
Sometimes we’ll encounter data points which are irrelevant,
meaningless, or “not applicable.” In other cases, there may be relevant
data that is “missing.” R provides two distinct ways to represent
missing/irrelevant data: NULL
and NA
.
NULL
is a special R object/variable, which is used
represent something that is totally missing or empty. NULL
has no length (length(NULL) == 0
) and no value. It cannot
be indexed. Many functions will give an error if passed a
NULL
.
NA
is quite different than NULL
. Any atomic
vector can have NA
value at any (or all) indices—in fact,
you can have vectors or NA
values. The NA
values are still “values” in a vector, but they are used indicate when
there are values that are missing or problematic. Passing a vector with
NA
values to most functions does not lead to an error,
though you’ll often get a warning message instead. For example, consider
what happens if we apply the command as.numeric()
to the
following strings:
Four of the strings in this vector are converted to numbers without a
problem, but the string "apple"
makes no sense as a number.
So what does R do? It converts the three strings to numbers, just like
as.numeric()
is supposed to, but the "apple"
string appears as NA
in the outut. We also get warning
message: NAs introduced by coercion
. You might see that
warning sometimes, so now you know what it means!
What would happen if we tried applying a different function onto our
vector with an NA
?
The sqrt()
function has no problem taking the
square-roots of the three numbers, and it simply “propogates” the
NA
value in its input through to its output. The
“propogation” of missing values is a very useful feature in R: it makes
sure that we keep track of what data is missing, while keeping
our vectors all their original lengths.
Common Functions
-
getwd()
— Get R’s current working directory. -
setwd()
— Set R’s working directory. -
summary()
— Summarize the contents of an R object.
Vector functions
-
sort()
— Put values of a vector into ascending order.- set
decreasing = TRUE
for decreasing order.
- set
-
rev()
— Reverse the order of a vector. -
rep()
— Repeat a vector. -
unique()
— Returns only the unique values of a vector. -
x %in% y
— Which elements of the vectorx
appear in the vectory
? -
length()
— How long is the vector (orlist()
)? -
head()
andtail()
— Return the first or last \(N\) elements of a vector.- Provide the
n
argument a natural number to control \(N\).
- Provide the
Math
Arithmetic
-
x + y
— Addition; \(x + y\). -
x - y
— Subtraction; \(x - y\). -
-x
— Negation; \(-x\). -
x * y
— Multiplication; \(xy\) -
x^y
— Exponentiation; \(x^y\).- Use parentheses for things like
x^(1/3)
; \(x^{\frac{1}{3}}\).
- Use parentheses for things like
-
x / y
— Real division; \(\frac{x}{y}\). -
x %/% y
— Euclidean division; \(\lfloor \frac{x}{y} \rfloor\).- E.g., whole-number division with remainder.
-
x %% y
—x
moduloy
; \(x \mod y\).- E.g., remainder after whole-number division.
-
diff(x)
— This function calculates the differences between consecutive values in a numeric vector.-
diff(c(5, 3))
is the same as3 - 5
.
-
Other Math functions
-
sqrt(x)
— Square-root of numbers; \(\sqrt{x}\). -
abs(x)
— Absolute value of numbers; \(|x|\) -
round(x)
— Round number to nearest integer; \(\lfloor x \rceil\) -
log(x)
— Log of number (natural log by default); \(\log(x)\) -
sign(x)
— Sign (1, -1, or 0) of x; \(\text{sgn}\ x\)
Distribution and Tendency Functions
-
sum(x)
— The sum of a numeric vector. -
max(x)
— The maximum value in a numeric vector. -
min(x)
— The minimum value in a numeric vector. -
range(x)
— The minimum and maximum values of a numeric vector.- To get the size of the range, use
diff(range(x))
.
- To get the size of the range, use
-
mean(x)
— The arithmetic mean of numeric vector. -
median(x)
— The median of numeric vector. -
quantile(x)
— Other distribution quantiles of numeric vector.
Randomization functions
-
sample()
— Takes a random sample from a vector. Can also be used to randomize the order of a vector.
Analysis Functions
-
table()
— Tabulate all unique values in vector, or cross-tabulate across multiple vectors.- When using humdrum\(_{\mathbb{R}}\), you should use the similar [count()] instead!
Useful tricks
- Would you like to know how many elements in a vector match a logical
criteria? Take the sum of the logic:
sum((1:100) > 55)
sum(letters %in% c('a', 'e', 'i', 'o', 'u'))
- Would you like to know what proportion of values match your
logical criteria? Take the mean of the logic:
mean((1:100) > 55)
sum(letters %iN% c('a', 'e', 'i', 'o', 'u'))
Making your own functions
To make your function in R, you use the function
keyword, like so:
function(argument1, argument2, etc.) {
Expressions to evaluate here, involving the arguments
}
For example, let’s make a function that subtracts the mean from a
vector of numbers. We’ll have one argument, which we’ll call
numbers
.
myfunc <- function(numbers) {
mean <- mean(numbers)
numbers - mean
}
We’ve created our function, and assigned it the name
myfunc
, just like any other assignment. Let’s try it
out:
Notice that the last expression in your function definition is the value that gets “returned” by the function.
If you are feeling lazy, you can also define a function using a few
less keystrokes using the command \()
instead of
function()
. For example,