You can find the source files for this tutorial on the class GitHub repository:
https://github.com/EEOB-Macroevolution/EEOB565X-Spring2020/tree/master/practicals/01-intro-to-phylo
R is a statistical programming environment with many built-in mathematical functions and many others that are found in packages that can be installed.
Analyses in R are performed using a series of commands which are written in script files and passed to the main console. The general workflow is:
Like any programming language, one must learn its syntax. Over time one learns the commands in R, and how to string them together into meaningful operations. Here are some basic commands:
Assign a value to a variable
a <- 3
a
## [1] 3
b = 4
b
## [1] 4
5 -> c
c
## [1] 5
Combine values into a vector (i.e., array) using the c
function
b <- c(3,4,5)
b
## [1] 3 4 5
Items in a vector can be accessed by calling their position using the []
operators. Note that the first element in a vector is assigned position 1
.
b[2]
## [1] 4
Combine objects into a list using the list
function
l <- list(number = 3, values = c(3.5, 4, 12), message = "many things can go in a list")
l
## $number
## [1] 3
##
## $values
## [1] 3.5 4.0 12.0
##
## $message
## [1] "many things can go in a list"
Items in a list can be accessed by calling their position using the [[]]
operator, or by calling their name using the $
operator.
l[[3]]
## [1] "many things can go in a list"
l$message
## [1] "many things can go in a list"
The data.frame
object is another common datatype we will encounter in R. These operate much like matrices, or two dimensional vectors. data frames can also have multiple types of data in them, they can store strings of categorical variables in one and numerical data in another column. There are many ways to access elements in a data.frame
. If we have columns with names, we can use the $
operator to access specific columns. alternatively, we can use bracket notation to access specific, rows, columns, or a combination of the two. Using braket notation, you specify rows and columns with [rows,column]
; if either the rows or columns are left blank then it will return all rows or columns. For example if we have a data frame saved as my_df
then my_df$height
would return the column named ‘height’ while my_df[,2]
returns the second column and all rows since the rows portion was left blank.
iris
base dataset in R. It should already be loaded, you can make sure it is by typing iris
into the console.head()
function to look at the first few rows of the data frame. What kind of variables are there in the data set and what kind of values do they take?$
operator.R provides functions for generating random deviates from several different parametric distributions. For example, we can draw a single random value uniformly from the interval (0,1):
x <- runif(1)
x
## [1] 0.8462679
The rnorm
function lets us draw from a normal distribution. We can draw 50 values from a normal distribution with a mean of 5 and a standard deviation of 1:
a <- rnorm(50,5,1)
We can use the plot()
function to view a simple scatter plot of values in a vector.
plot(a)
And the hist()
function shows us a histogram of our vector
hist(a)
There are also several built-in functions that allow us to compute summary statistics of a vector:
mean(a)
## [1] 5.098229
median(a)
## [1] 5.170398
var(a)
## [1] 1.10515
Finally, you can look at the help for any function by calling ?
?var
## starting httpd help server ... done
The ape package provides a structure for storing a phylogenetic tree, as well as basic manipulation and plotting functions. The ape structure is used by most R packages which deal with phylogenetic trees, so it is important to understand it.
The first step is to install the package if it is not already
install.packages("ape")
and load it into the environment. ape is distributed by CRAN, the main package repository for R.
library(ape)
ape provides functions for simulating random trees with a fixed number of tips, rcoal
(for ultrametric trees) and rtree
. These functions can be useful for testing code on a small example, which is what we are going to do here.
tree <- rcoal(5)
Let’s take a look at what exactly ape created when we asked for a tree:
tree
##
## Phylogenetic tree with 5 tips and 4 internal nodes.
##
## Tip labels:
## t2, t4, t3, t1, t5
##
## Rooted; includes branch lengths.
This tree is of class phylo
:
class(tree)
## [1] "phylo"
Most phylogenetic packages require phylo
objects as input.
By opening the tree object in the RStudio environment panel, we can see that this object is in fact a list of several components. The list of these components can also be accessed using the names
function.
names(tree)
## [1] "edge" "edge.length" "tip.label" "Nnode"
Following is a short description of these components and the information they contain.
tree$edge
is a nx2 matrix, where each line represents an edge of the tree. The first element is the index of the node above the edge and the second the index of the node below the tree. For instance, here we can see that the first edge is between node 6 and 7:tree$edge[1,]
## [1] 6 1
tree$edge.length
is a vector of the lengths of all the edges of the tree, in the same order as the edge
matrix. For instance, this is the length of the previous edge 6-7:tree$edge.length[1]
## [1] 0.7076483
tree$tip.label
is a vector of tip labels, in the same order as the index of those tips in the edge
matrix (note that in the phylo
format, tips are required to have indices from 1 to ntips, where ntips is the number of tips). For instance, this is the label of the tip with index 3:tree$tip.label[3]
## [1] "t3"
tree$Nnode
is an integer and contains the number of internal nodes of the tree.tree$node.label
vector which contains labels for all the internal nodes, in the same order as their indices in the edge
matrix (so if an internal node has index ntips+5, its label is at position 5 in tree$node.label
).Most of the time, it is not necessary to directly edit the structure of phylo
objects, as the ape package and others provide functions to manipulate trees.
In order to interface with other software, or reuse data, we need to be able to input trees from a file, and in reverse to output trees to a file. ape supports two formats for input/output of trees, the Newick and the Nexus format. Let’s write our example tree to a Newick file:
write.tree(tree, file = "newick.tre")
Opening the file newick.tre
shows that the tree has been written as a Newick string. We can read that file again into the R environment:
newick_tree <- read.tree("newick.tre")
Opening both tree
and newick_tree
in the environment panel of RStudio shows that they are indeed the same tree. Similarly, we can write our tree to a Nexus file,
write.nexus(tree, file = "nexus.nex")
as well as read it back in.
nexus_tree <- read.nexus("nexus.nex")
Again, this tree is identical to the original.
Here we have a phylogeny with branch lengths denoted in green.
read.tree()
function. HINT you will want to put your string in the text
argument of the read.tree()
function.plot()
function. Does it look like the phylogeny from the imageVisualizing trees is another useful function provided by the ape package. By default, trees are plotted as phylograms, with tip labels:
plot(tree)
Other options are available, such as coloring the edges
plot(tree, edge.color = rainbow(8))
or other types of plots.
par(mfrow=c(1,2))
plot(tree, type = "cladogram")
plot(tree, type = "radial")
Note that many other packages extend the ape function or provide their own plotting tools.
Some other ape functions that can be useful are:
getMRCA
function returns the index of the most recent common ancestor of a group of tips.getMRCA(tree, c("t1", "t3"))
## [1] 9
node.depth.edgelength
function calculates the depth of each node using the edge lengths.node.depth.edgelength(tree)
## [1] 0.7076483 0.7076483 0.7076483 0.7076483 0.7076483 0.0000000 0.3276569
## [8] 0.3514894 0.5505876
Note that these are depths, i.e. the root is at d=0. In order to obtain the times, with the present at t=0, the values need to be reversed.
depths <- node.depth.edgelength(tree)
max(depths) - depths
## [1] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.7076483 0.3799914
## [8] 0.3561590 0.1570607
drop.tip
can also be a helpful function. In some cases we don’t have trait data for all tips on a phylogeny and may wish a get rid of some of those tips.
drop.tip()
This function removes tips from a phylogeny. It has a couple different argument but we’ll only really need to concern ourselves with two of them:
Here we will use the tree that you wrote by hand in the previous section. Let’s say we don’t have any trait data for A or B and want to remove them for our analysis.
tip
argumenttip
argumentDo these trees look the same? What happens to the root of the tree, is the tree the same length?