This is a short introduction to the basic syntax of the R programming language. This notebook contains the basics for understanding most of the amplicon data analysis workflow.
Learn R, beyond this notebook:
But: It takes many(!) hours to become a confindent R user.
# This is a comment. The # at the beginning ensures that R will ignore it.
You should have MANY comments in your code!
Comments will not only help yourself when you look at the code later, but also if/when you share your code with others. It is always better to comment too much than too little.
The working directory is the place where R will read and write files/plots to and from.
It is easiest if this is the same place as where the script is located.
In Rstudio, set this by clicking "Session" then "Set Working Directory" then "To Source File Location".
Now R looks for datafiles and saves plots in the same location as where your script is located.
You can also set the working directory with setwd:
setwd("/path/to/my/data")
And you can get the current working directory
getwd()
R understands simple arithmetics
2+2
5*2/(1-2)^3
We can save things in objects (in this case x and y)
x <- 2
y <- "Hello"
Now the names x and y are refering to some data. Working with objects is fundamental in R (and most other programming languages). You will have all your data stored in objects, and you run functions on those objects to create an output (which can be saved as another object).
The names of the objects can what ever you like, however not with spaces or special characters, and not starting with a number:
YouCanNameYourObject_LikeThis1234 <- 5
# We can use the objects for arithmetics
YouCanNameYourObject_LikeThis1234 * 3
# We can run functions on the objects
toupper(y)
# We can change objects
x <- 2 * 5
print(x)
# We can copy objects
k <- x
print(k)
# Characters, such as "Hello" are always quoted
# Look what happens if we forget the quotation:
y <- Hello
# It looks for an object called Hello, which does not exist
# So quotations are how objects and strings can be differentiated
Functions take the form of functionName(inputToFunction).
You can always find help on how a function works by typing ?functionName
?mean
We can make vectors with more than one number, character, or logical
# Use the function c() to create a vector
nums <- c(1, 4, 6, 10, 12, 5, 2)
test <- c("Hello", "world.", "Anybody", "there?")
We can do simple operations on these vectors
# Minimum
min(nums)
# Maximum
max(nums)
# Sum
sum(nums)
# Log10
log10(nums)
# Access the third element
nums[3]
test[3]
Using square brackets we can subset a vector with another vector
# First two elements
test[c(1, 2)]
We can also subset with a logical, such that only the TRUE ones are in the output
test[c(TRUE, FALSE, TRUE, FALSE)]
# Append strings
paste("before", test, "after")
paste("Number", nums, sep=":")
Factors are a special type of vector. In a factor the strings/numbers are given "levels" which by default are alphabetical. These levels for example determine the order categorical variables are plotted (as we will see later), and changing this order, means changing the factor levels
x <- factor(c("A", "A", "B", "B"))
Click the > to see the levels
x
x <- factor(c("A", "A", "B", "B"), levels = c("B", "A"))
x
You can convert between types with as.type. Always check the output! Converting might do something unexpected, and is not always reversible
as.character(c(1, 2, 3))
as.numeric(c("1", "2"))
as.factor(c("A", "B", "C"))
as.numeric(c(TRUE, FALSE))
as.character(c(TRUE, FALSE))
as.character(factor(c("A", "B")))
as.numeric(factor(c("A", "B")))
as.logical(c(1, 2.1, -3, 0))
We cannot convert strings to numbers, and the result is therefore NA (missing data):
as.numeric(c("A", "B"))
Missing data is represented by NA. NAs can be mixed with any data type
c(1, 2, 4, NA)
# Factors ignore NAs in the levels (click the > to see the levels)
factor(c(1, 2, 4, NA))
We can look for missing values
is.na(c(1, 2, 4, NA))
Since the output is a logical we can negate it:
!is.na(c(1, 2, 4, NA))
And we can use that output to get only values that are not missing with square brackets
x <- c(1, 2, 4, NA)
x[!is.na(x)]
We can compare variables to check if they are identical or not.
x <- c(10, 11, 12)
Test if equal to (note that there are 2 equal signs!):
x == 10
Test if not equal to:
x != 10
Larger than:
x > 11
Smaller than or equal:
x <= 11
We can count using comparisons. If you use sum on a logical it will count the TRUEs.
Count those above 10:
sum(x > 10)
Compare vectors of equal size:
c("A", "B" ,"C") == c("K", "B", "F")
Check which elements are in a vector:
c("A", "B" ,"C") %in% c("C", "D", "E", "F", "G")
All logicals can be negated:
!c("A", "B" ,"C") %in% c("C", "D", "E", "F", "G")
All logicals can be combined, with and (&) and or (|):
x == 10 | x == 11
Several logical operations can be combined with parentheses:
y <- c(9, NA, 12)
( x > 10 & y > 10 & !is.na(y)) | ( x < 10 & !is.na(y))
We can make lists, which can contain both numbers, strings, and anything else in the same list
# Use the function list() to create list
mylist <- list(this = 2,
these = c("salmon", "herring"),
WhatEverYouWantToCallIt = c(TRUE, TRUE, FALSE))
# Access by name
mylist[["WhatEverYouWantToCallIt"]]
# or by the order (index)
mylist[[2]]
A matrix is a two-dimensional array. A bit like an excel spreadsheet with rows and columns. However, all the cells has to contain the same type of data. Either all are numerics, all are characters, or all are logicals.
mat <- matrix(1:9, nrow = 3, ncol = 3)
mat
Dataframes are objects with columns and rows like a matrix. However, in a data.frame the columns can contain different data.types. This is a very common way to store data, where each row is a sample, and each column is a variable.
# When we read data from a text file it will be imported as a data.frame
# Lets load an external text file with data
df <- read.table("mydata.csv", header = TRUE, sep = ";")
# header = TRUE means that the first line in our text file contains the names of the columns
# sep = ";" is because semicolons seperate the columns in the external file
print(df)
# Use the str() function to check the structure of the dataframe
str(df)
# Access a column in a dataframe with the $ sign
df$group
When you read an external file into a data.frame in R strings are converted to factors, as you can see from the output above.
You should always check that the file has been loaded correctly, by running str() or running View() on the data.frame
Rows and columns can be accessed with [rows, columns]
Nothing after the comma means that all rows are selected
Nothing before the comma means that all columns are selected
# First row, second column:
df[1, 2]
# First row, all columns:
df[1, ]
# All rows, first column
df[, 1]
# First and second row, column named 'var2'
df[c(1, 2), "var2"]
# First four rows, columns named 'var1' and 'var2'
df[1:4, c("var1", "var2")]
# 1:4 means integers 1 to 4
# New columns can be made with the $ sign
df$var2_log10 <- log10(df$var2)
# Change the data in 10th row and second column
df[10, 2] <- 50
# Add 1 to the entire first column
df[, 1] <- df[, 1] + 1
# Input 20 in the second column, only if "group" column is equal to "A"
df[df$group == "A", 2] <- 20
# There are two = signs!
df
The table() function is very neat for counting the number of occurrences of different strings or numbers
table(df$group)
You can tabulate multiple variables to get counts of all possible combinations:
table(df$group, df$exp)
When you click Ctrl+s or you go to File -> Save, you only save the script
You don't save the data (what you see in the upper right corner of RStudio)
If you have run an analysis that took a long time to run, it is nice to save the results so you don't have to rerun the whole thing again.
# To save all data run this:
save.image("Mydata.RData")
# When you open R another day, you load it with the load function
load("Mydata.RData")
# A simple scatter plot
plot(df$var1, df$var2)
# Histogram
hist(df$var2)
# ggplot2 is a package for making nice plots.
library(ggplot2)
We sometimes get warnings that the package is not build for our specific version of R (as above). This is usually not a problem, but if you experience problems this could be the cause, and it can be solved by installing the latest version of R and installing the package again.
Packages are usually installed with install.packages("PackageName"). This only has to be done one time.
install.packages("car")
The above only works for packages on CRAN. However, some bioinformatics packages are on BioConductor, which has to be installed differently (see link). Some packages are only on GitHub and has to be installed through there (see for example this one).
ggplot2 is a great packages for making neat plots. There is a notebook on the details of ggplot2, but below is described how to make a simple plot.
Let's make a similar scatterplot as above, but with ggplot2
A ggplot is always made by starting with the ggplot(data, aes(...)) line
# First define and save the plot as an object:
p <- ggplot(df, aes(x = var1, y = var2)) +
geom_point()
# Then view the plot:
p
# Extra things can be added with +'s
p <- ggplot(df, aes(x = var1, y = var2)) +
geom_point() +
xlab("First variable") +
ylab("Second variable") +
theme_bw() # Remove the grey background
p
# Color by group
p <- ggplot(df, aes(x = var1, y = var2, color = group)) +
geom_point() +
xlab("First variable") +
ylab("Second variable") +
theme_bw()
p
# Save the plot
ggsave(filename = "MyFirstGGplot.png", plot = p,
width = 10, height = 6, units = "cm")