Practical: Introduction to Phylogenetics in R

Source

You can find the source files for this tutorial on the class GitHub repository:

https://github.com/EEOB-Macroevolution/EEOB565X-Spring2020/tree/master/practicals/01-intro-to-phylo

Working with Trees using the ape Package

The ape package provides a structure for storing a phylogenetic tree, as well as basic manipulation and plotting functions. The ape structure is used by most R packages which deal with phylogenetic trees, so it is important to understand it.

The first step is to install the package if it is not already

install.packages("ape")

and load it into the environment. ape is distributed by CRAN, the main package repository for R.

library(ape)

The tree object

ape provides functions for simulating random trees with a fixed number of tips, rcoal (for ultrametric trees) and rtree. These functions can be useful for testing code on a small example, which is what we are going to do here.

tree <- rcoal(5)

Let’s take a look at what exactly ape created when we asked for a tree:

tree

## 
## Phylogenetic tree with 5 tips and 4 internal nodes.
## 
## Tip labels:
##   t3, t4, t1, t2, t5
## 
## Rooted; includes branch lengths.

This tree is of class phylo:

class(tree)

## [1] "phylo"

Most phylogenetic packages require phylo objects as input.

By opening the tree object in the RStudio environment panel, we can see that this object is in fact a list of several components. The list of these components can also be accessed using the names function.

names(tree)

## [1] "edge"        "edge.length" "tip.label"   "Nnode"

Following is a short description of these components and the information they contain.

tree$edge is a nx2 matrix, where each line represents an edge of the tree. The first element is the index of the node above the edge and the second the index of the node below the tree. For instance, here we can see that the first edge is between node 6 and 7:

tree$edge[1,]

## [1] 6 1

tree$edge.length is a vector of the lengths of all the edges of the tree, in the same order as the edge matrix. For instance, this is the length of the previous edge 6-7:

tree$edge.length[1]

## [1] 1.034535

tree$tip.label is a vector of tip labels, in the same order as the index of those tips in the edge matrix (note that in the phylo format, tips are required to have indices from 1 to ntips, where ntips is the number of tips). For instance, this is the label of the tip with index 3:

tree$tip.label[3]

## [1] "t1"

tree$Nnode is an integer and contains the number of internal nodes of the tree.
optionally, there can also be a tree$node.label vector which contains labels for all the internal nodes, in the same order as their indices in the edge matrix (so if an internal node has index ntips+5, its label is at position 5 in tree$node.label).

Most of the time, it is not necessary to directly edit the structure of phylo objects, as the ape package and others provide functions to manipulate trees.

Reading and writing trees

In order to interface with other software, or reuse data, we need to be able to input trees from a file, and in reverse to output trees to a file. ape supports two formats for input/output of trees, the Newick and the Nexus format. Let’s write our example tree to a Newick file:

write.tree(tree, file = "newick.tre")

Opening the file newick.tre shows that the tree has been written as a Newick string. We can read that file again into the R environment:

newick_tree <- read.tree("newick.tre")

Opening both tree and newick_tree in the environment panel of RStudio shows that they are indeed the same tree. Similarly, we can write our tree to a Nexus file,

write.nexus(tree, file = "nexus.nex")

as well as read it back in.

nexus_tree <- read.nexus("nexus.nex")

Again, this tree is identical to the original.

Try it out! Newick Strings

Here we have a phylogeny with branch lengths denoted in green.

Try writing by hand the newick string representation of this phylogeny.
Now try reading in this newick string using the read.tree() function. HINT you will want to put your string in the text argument of the read.tree() function.
plot your tree with the plot() function. Does it look like the phylogeny from the image

Plotting trees

Visualizing trees is another useful function provided by the ape package. By default, trees are plotted as phylograms, with tip labels:

plot(tree)

Other options are available, such as coloring the edges

plot(tree, edge.color = rainbow(8))

or other types of plots.

par(mfrow=c(1,2))
plot(tree, type = "cladogram")
plot(tree, type = "radial")

Note that many other packages extend the ape function or provide their own plotting tools.

Other useful functions

Some other ape functions that can be useful are:

the getMRCA function returns the index of the most recent common ancestor of a group of tips.

getMRCA(tree, c("t1", "t3"))

## [1] 6

the node.depth.edgelength function calculates the depth of each node using the edge lengths.

node.depth.edgelength(tree)

## [1] 1.0345346 1.0345346 1.0345346 1.0345346 1.0345346 0.0000000 0.8286315
## [8] 0.8515468 0.9162804

Note that these are depths, i.e. the root is at d=0. In order to obtain the times, with the present at t=0, the values need to be reversed.

depths <- node.depth.edgelength(tree)
max(depths) - depths

## [1] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 1.0345346 0.2059031
## [8] 0.1829878 0.1182542

drop.tip can also be a helpful function. In some cases we don’t have trait data for all tips on a phylogeny and may wish a get rid of some of those tips.

drop.tip()

This function removes tips from a phylogeny. It has a couple different argument but we’ll only really need to concern ourselves with two of them:

$phy$ A phylo object
$tip$ A vector with a single tip or many tips to be removed. The vector may either contain the node number of the tips that are to be removed or a character vector with the names of the tips that you want to remove.

Try it out! Drop tips from a phylogenetic tree

Here we will use the tree that you wrote by hand in the previous section. Let’s say we don’t have any trait data for A or B and want to remove them for our analysis.

Try removing the tips by entering the tip node numbers in the tip argument
Try removing the tips by entering the tip names in the tip argument

Do these trees look the same? What happens to the root of the tree, is the tree the same length?

Estimating trees in R

This lesson is based on a tutorial written by Klaus Schliep entitled “Estimating phylogenetic trees with phangorn”.

To estimate a phylogeny from DNA sequences, we will use the package phangorn.

install.packages("phangorn")

library(phangorn)

Note that this is one of the few R packages for estimating phylogenies. Typically, researchers use other tools for this like RAxML, BEAST2, IQTree, RevBayes, PAUP*, etc. These software tools are developed as stand-alone programs and most are not accessible in the R environment. The purpose of this lesson is to walk you through how to estimate a phylogenetic tree using maximum likelihood for a simple dataset. With larger datasets, it is much more practical to use a tool that is actively developed for difficult phylogenetic problems.

Data File

We will use a common example dataset of 12 species of primates plus two outgroups (cow and mouse). This alignment was assembled by Dr. Masami Hasegawa of the Institute of Statistical Mathematics in Tokyo, from sequencing done by Kenji Hayasaka, Takashi Gojobori, and Satoshi Horai (Molecular Biology and Evolution 5: 626-644, 1988). For more details, see J. Felsenstein’s website.

Download the data file using R:

download.file("https://raw.githubusercontent.com/EEOB-Macroevolution/EEOB565X-Spring2020/master/practicals/01-intro-to-phylo/primates.dna","primates.dna")

This will download the file primates.dna into your current working directory. Now we can read in the file using the function read.phyDat().

primates <- read.phyDat("primates.dna", format = "interleaved")
primates

## 14 sequences with 232 character and 217 different site patterns.
## The states are a c g t

Parsimony

To begin, we will first estimate a tree using parsimony. To perform a heuristic search under parsimony, we can start with a tree built using a random-addition algorithm.

primates_start_tree <- random.addition(primates)

First let’s check the parsimony score of our starting tree using the parsimony() function:

parsimony(primates_start_tree, primates)

## [1] 746

Now let’s plot the tree.

plot(primates_start_tree)

You will notice that the tree does not have meaningful branch lengths. We can represent the branch lengths in units of the number of changes (under parsimony) using the acctran() function.

primates_start_tree <- acctran(primates_start_tree, primates)

The tree should now be depicted as a phylogram where the branch lengths correspond to the number of state changes along a branch.

plot(primates_start_tree)

Now we can optimize the tree topology under parsimony using the optim.parsimony() function. This performs a series of tree rearrangements under NNI.

primates_opt <- optim.parsimony(primates_start_tree, primates)

## Final p-score 746 after  0 nni operations

primates_opt <- acctran(primates_opt, primates)

What is the parsimony score of the optimized tree?

parsimony(primates_opt, primates)

## [1] 746

Plot the optimized tree:

plot(primates_opt)

What is different between the starting tree and the optimized tree?

Maximum Likelihood

Compute the likelihood of the parsimony tree and parsimony branch lengths. For this we use the pml() function of phangorn. By default, this approach computes the likelihood under the Jukes-Cantor (1969) model:

fitJC <- pml(primates_opt, data=primates)

The object fitJC contains the parameter estimates and likelihood of the JC model given the data and the tree topology and branch lengths we estimated under parsimony. We can view these values by calling the variable:

fitJC

## 
##  loglikelihood: -4501.298 
## 
## unconstrained loglikelihood: -1230.335 
## 
## Rate matrix:
##   a c g t
## a 0 1 1 1
## c 1 0 1 1
## g 1 1 0 1
## t 1 1 1 0
## 
## Base frequencies:  
## 0.25 0.25 0.25 0.25

We do not want to rely on parsimony for the tree topology and branch lengths, but this tree provides a reasonable starting place. We can use the optim.pml() function to optimize the tree and model parameters under JC69. This will perform NNI tree rearrangements to identify the topology that maximizes the likelihood.

fitJC <- optim.pml(fitJC, optNni=TRUE)

## optimize edge weights:  -4501.298 --> -4394.719 
## optimize edge weights:  -4394.719 --> -4394.719 
##  optimize topology:  -4394.719 --> -4307.223 
##  optimize topology:  -4307.223 --> -3682.3 
##  optimize topology:  -3682.3 --> -3488.014 
##  optimize topology:  -3488.014 --> -3210.295 
##  optimize topology:  -3210.295 --> -3201.436 
## NNI moves:  10 
## optimize edge weights:  -3201.436 --> -3201.436 
##  optimize topology:  -3201.436 --> -3197.995 
##  optimize topology:  -3197.995 --> -3197.995 
## NNI moves:  1 
## optimize edge weights:  -3197.995 --> -3197.995 
##  optimize topology:  -3197.995 --> -3197.995 
## NNI moves:  0

plot(fitJC$tree)

If we want to optimize the tree using a different model like GTR, we can use the pml object as an argument and specify the GTR model.

fitGTR <- optim.pml(fitJC, model="GTR", optNni=TRUE)

## optimize edge weights:  -3197.995 --> -3197.995 
## optimize base frequencies:  -3197.995 --> -2991.994 
## optimize rate matrix:  -2991.994 --> -2798.031 
## optimize edge weights:  -2798.031 --> -2725.055 
##  optimize topology:  -2725.055 --> -2720.394 
##  optimize topology:  -2720.394 --> -2718.486 
##  optimize topology:  -2718.486 --> -2718.486 
## NNI moves:  2 
## optimize base frequencies:  -2718.486 --> -2705.12 
## optimize rate matrix:  -2705.12 --> -2681.75 
## optimize edge weights:  -2681.75 --> -2672.949 
##  optimize topology:  -2672.949 --> -2672.945 
## NNI moves:  0 
## optimize base frequencies:  -2672.945 --> -2671.555 
## optimize rate matrix:  -2671.555 --> -2664.042 
## optimize edge weights:  -2664.042 --> -2659.952 
## optimize base frequencies:  -2659.952 --> -2659.774 
## optimize rate matrix:  -2659.774 --> -2656.292 
## optimize edge weights:  -2656.292 --> -2653.972 
## optimize base frequencies:  -2653.972 --> -2653.915 
## optimize rate matrix:  -2653.915 --> -2651.965 
## optimize edge weights:  -2651.965 --> -2650.552 
## optimize base frequencies:  -2650.552 --> -2650.53 
## optimize rate matrix:  -2650.53 --> -2649.29 
## optimize edge weights:  -2649.29 --> -2648.339 
## optimize base frequencies:  -2648.339 --> -2648.328 
## optimize rate matrix:  -2648.328 --> -2647.475 
## optimize edge weights:  -2647.475 --> -2646.795 
## optimize base frequencies:  -2646.795 --> -2646.789 
## optimize rate matrix:  -2646.789 --> -2646.17 
## optimize edge weights:  -2646.17 --> -2645.664 
## optimize base frequencies:  -2645.664 --> -2645.661 
## optimize rate matrix:  -2645.661 --> -2645.193 
## optimize edge weights:  -2645.193 --> -2619.302 
## optimize base frequencies:  -2619.302 --> -2619.268 
## optimize rate matrix:  -2619.268 --> -2618.995 
## optimize edge weights:  -2618.995 --> -2618.726 
## optimize base frequencies:  -2618.726 --> -2618.725 
## optimize rate matrix:  -2618.725 --> -2618.481 
## optimize edge weights:  -2618.481 --> -2618.255 
## optimize base frequencies:  -2618.255 --> -2618.255 
## optimize rate matrix:  -2618.255 --> -2618.04 
## optimize edge weights:  -2618.04 --> -2617.846

plot(fitGTR$tree)

We can try to reroot the tree using the two outgroup taxa. However, this will return an error if the outgroup is not monophyletic.

GTR_tree_rooted <- root(fitGTR$tree,c("Bovine","Mouse"))

## Error in root.phylo(fitGTR$tree, c("Bovine", "Mouse")): the specified outgroup is not monophyletic

View the details of the fitGTR object. What is the log-likelihood?

fitGTR

## 
##  loglikelihood: -2617.846 
## 
## unconstrained loglikelihood: -1230.335 
## 
## Rate matrix:
##            a            c            g          t
## a  0.0000000  0.521419243 27.763277194  0.3750646
## c  0.5214192  0.000000000  0.004814672 12.0830470
## g 27.7632772  0.004814672  0.000000000  1.0000000
## t  0.3750646 12.083046966  1.000000000  0.0000000
## 
## Base frequencies:  
## 0.387924 0.3855932 0.03982876 0.186654

How are the two trees different? What are possible reasons for this ?

One thing that may be bothering you is how do you decide if you should analyze your data under JC69 or under GTR. One option is to compare the models statistically. You can use a likelihood ratio test to compare the two models.

anova(fitJC, fitGTR)

## Likelihood Ratio Test Table
##   Log lik. Df Df change Diff log lik. Pr(>|Chi|)    
## 1  -3198.0 25                                       
## 2  -2617.8 33         8        1160.3  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results of this comparison show that GTR (model 2) is best supported.

Calculating the Likelihood of a Tree

In the lecture, you were given the optional challenge to calculate the likelihood of the tree pictured below assuming equal branch lengths and the JC69 model. The next part of this lesson will be to do this in R.

For a detailed explanation, you can see Josh Justison’s solution to the challenge problem here: https://jjustison.github.io/tree-likelihood