Ancestral State Estimation

Ancestral State Estimation of Discrete Characters

Reading in the Data

Here we explore methods for estimating ancestral states of discrete characters. First, we bring our data into R:

Load the gieger and phytools package
Read in the tree titled anole.gp.tre. This file can be found in the data folder of this tutorial
Use the read.csv() function to read in the discrete trait data called anole.gp.csv
- be sure to set row.names=1. This makes it so the first column in the csv get read as the rownames.
Match the phylogeny tips to the trait data.
- Be sure to set the sort argument to TRUE.
Assign the matched tree to a new variable called tree and the matched data to a new variable called group. The remainder of the tutorial will assume these variable names for the phylogeny and data frame with traits, respectively.
Try plotting the phylogeny and using the tiplabels() function to plot the character states on the tips.
- There are a lot of species this phylogeny so you may wish to add the argument type="fan" to the plot() function. This plots the phylogeny in a radial fashion and may make the plot less cluttered

## Loading required package: ape

## Warning: package 'ape' was built under R version 4.1.2

## Loading required package: maps

## Warning: package 'geiger' was built under R version 4.1.2

Here we see our phylogeny and our discrete character states for the extant taxa. This is a phylogeny of Anolis lizards and their habitat use. Species tend to concentrate their activities to particular habitat types. Thus, some interesting questions might be:

What is the ancestral Anolis habitat type?
Are species utilizing the same habitat each others’ closest relatives (i.e., are they monophyletic clades)?
How many times have species evolved to utilize a specific habitat?

Discrete Characters: Maximum Parsimony

One method for estimating ancestral character states is based on maximum parsimony (MP). Here the goal is to minimize the total number of changes between character states across the phylogeny to reconcile the data at the tips of the phylogeny with the topology of the phylogeny.

However, while MP methods are straight-forward to understand and quick to implement, they have several serious deficiencies. First, all transitions between traits are considered to be equally likely; which is related to the fact that transitions between states are equally weighted. These assumptions are clearly violated with real data. Also, and critically, branch length information is completely ignored using maximum parsimony. For these reasons, we will NOT demonstrate MP approaches in this tutorial!

Discrete Characters: Maximum Likelihood

An alternative is to use a model-based approach such as maximum likelihood (ML). Here we model the evolutionary transitions between character states, and select those ancestral values that maximize the likelihood of the data, conditioned on the phylogeny, the model, and the ancestral states. Here we perform this analysis in R using the ace() function.

ace()

This function performs ancestral character estimation for both discrete and continuous characters. First, we’ll focus on the parameters needed for discrete character estimation:

$x$ A factor or vector with character traits.
$phy$
$type$ A string with the type of data, either "discrete" or "continuous".
$method$ a string specifying the method used for estimation, can be "ML", "REML", "pic", or "GLS". NOTE: some of these options are only appropriate when type="continuous". For discrete data we’ll use "ML".
$model$ a rate matrix that specifies the parameters for transition rates. This is the same type of rate matrix that we generated in the Phylogenetic Association of Discrete Characters tutorial.

The function outputs a list with many elements, some of the elements of interest are:

$\$loglik$ The log likelihood of the model
$\$rates$ The estimates for the transition rates
$\$lik.anc$ The likelihood of each state at each node

library(corHMM)

## Loading required package: nloptr

## Loading required package: GenSA

##Create the model for trait evolution
group_formatted<-cbind(row.names(group),group[,1]) ##put the data in a specific format because the function is picky
group_model_er <- getStateMat4Dat(group_formatted,"ER")  #equal transition rates
plotMKmodel(group_model_er$rate.mat,rate.cat = 1)

## Warning in if (class(corhmm.obj) == "matrix" & is.null(rate.cat)) {: the
## condition has length > 1 and only the first element will be used

## Warning in if (class(corhmm.obj) == "corhmm") {: the condition has length > 1
## and only the first element will be used

## Warning in if (class(corhmm.obj) == "matrix") {: the condition has length > 1
## and only the first element will be used

## Warning in text.default(x, y, labels = labels, col = label.color, family =
## label.family, : font family not found in Windows font database

##Perform ancestral character estimation
anc.ML<-ace(x = group$x, phy = tree, 
            type = "discrete", method = 'ML',
            model = group_model_er$rate.mat)

anc.ML ## we can see a lot of information just by calling our object

## 
##     Ancestral Character Estimation
## 
## Call: ace(x = group$x, phy = tree, type = "discrete", method = "ML", 
##     model = group_model_er$rate.mat)
## 
##     Log-likelihood: -78.04604 
## 
## Rate index matrix:
##    CG GB TC TG Tr Tw
## CG  .  1  1  1  1  1
## GB  1  .  1  1  1  1
## TC  1  1  .  1  1  1
## TG  1  1  1  .  1  1
## Tr  1  1  1  1  .  1
## Tw  1  1  1  1  1  .
## 
## Parameter estimates:
##  rate index estimate std-err
##           1   0.0231   0.004
## 
## Scaled likelihoods at the root (type '...$lik.anc' to get them for all nodes):
##          CG          GB          TC          TG          Tr          Tw 
## 0.018197565 0.202238628 0.042841575 0.428609607 0.004383532 0.303729094

## We can see the probability of each state here, just the first few internal nodes will be shown
round(head(anc.ML$lik.anc),3)

##       CG    GB    TC    TG    Tr    Tw
## 83 0.018 0.202 0.043 0.429 0.004 0.304
## 84 0.008 0.197 0.020 0.501 0.001 0.272
## 85 0.003 0.134 0.020 0.716 0.005 0.122
## 86 0.002 0.042 0.019 0.858 0.002 0.077
## 87 0.000 0.001 0.001 0.995 0.000 0.002
## 88 0.000 0.000 0.000 0.998 0.000 0.001

# We can plot the ancestral character estimates with the nodelabels() function
colorkey <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2") ##A colorblind friendly color palette
names(colorkey) <- colnames(anc.ML$lik.anc) ##assign each character state a color
tip_cols<-colorkey[as.character(group$x)] ##putting the trait values in as.character() is important!

plot(tree,type="fan")
nodelabels(pie=anc.ML$lik.anc,piecol=colorkey,cex=0.5)
tiplabels(pch=19,col=tip_cols)
legend(x='bottomleft',legend = names(colorkey),fill=colorkey)

Again, note that internal nodes are estimated as combinations of different states. This reflects the fact that no single state is 100% likely for those nodes.

Try it out! Ancestral State Estimation with an All Rates Different Model

Here we will try an ancestral state estimation with a different model for trait evolution and compare it to our equal rates model. Do the following

Create a rate matrix for our data where all the rates are different
perform ancestral state estimation using the all rates different matrix
- How do rate estimates differ between the two models?
- What is the likelihood of each state at the root of the phylogeny?
Conduct a likelihood ratio test between the two models or compare their AICs to determine which model is a better fit. Hint: For the likelihood ratio test, how many parameters does the ‘ARD’ model have? You may want to look at the rate matrix for help with this.
- Which model would you conclude is a better fit?
Plot the ancestral state estimation on the phylogeny
- How does this compare to the previous visualization? Are any clades drastically different?

Discrete Characters: Bayesian Stochastic Mapping

An alternative to ML approaches is to utilize a Bayesian perspective. Here, a Markov process is used to model trait evolution on the phylogeny, given a model of transition rates between states. A single iteration yields what is called a stochastic map; representing one possible embodiment of evolution under the Markov process. Repeating this many times and summarizing the values provides a useful estimate of ancestral states. Here we perform this analysis in R using the make.simmap() function.

make.simmap()

This function performs stochastic mapping using several different methods. Essentially, given a model for character evolution, it simulates histories under the model. The function has many parameters and modes, we will be using the “empirical” mode that uses the following arguments:

$tree$
$x$ A factor or vector with character traits.
$model$ a rate matrix that specifies the parameters for transition rates. This is the same type of rate matrix that we generated in the Phylogenetic Association of Discrete Characters tutorial.
$nsim$ The number of times we want to simulate a character evolution history

The function outputs a simmap object that lists how long each branch spent in each state.

# simulate single stochastic character map using empirical Bayes method
gp<-setNames(group$x,rownames(group)) ##make.simmap needs a named vector for the data.
tree.smp1<-make.simmap(tree,gp,model=group_model_er$rate.mat)

## make.simmap is sampling character histories conditioned on
## the transition matrix
## 
## Q =
##             CG          GB          TC          TG          Tr          Tw
## CG -0.11570723  0.02314145  0.02314145  0.02314145  0.02314145  0.02314145
## GB  0.02314145 -0.11570723  0.02314145  0.02314145  0.02314145  0.02314145
## TC  0.02314145  0.02314145 -0.11570723  0.02314145  0.02314145  0.02314145
## TG  0.02314145  0.02314145  0.02314145 -0.11570723  0.02314145  0.02314145
## Tr  0.02314145  0.02314145  0.02314145  0.02314145 -0.11570723  0.02314145
## Tw  0.02314145  0.02314145  0.02314145  0.02314145  0.02314145 -0.11570723
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
##        CG        GB        TC        TG        Tr        Tw 
## 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667

## Done.

plot(tree.smp1,colorkey,type="fan")  #one run.  Not overly useful.

#Must do many times
tree.smp<-make.simmap(tree,gp,model=group_model_er$rate.mat,nsim=100)

## make.simmap is sampling character histories conditioned on
## the transition matrix
## 
## Q =
##             CG          GB          TC          TG          Tr          Tw
## CG -0.11570723  0.02314145  0.02314145  0.02314145  0.02314145  0.02314145
## GB  0.02314145 -0.11570723  0.02314145  0.02314145  0.02314145  0.02314145
## TC  0.02314145  0.02314145 -0.11570723  0.02314145  0.02314145  0.02314145
## TG  0.02314145  0.02314145  0.02314145 -0.11570723  0.02314145  0.02314145
## Tr  0.02314145  0.02314145  0.02314145  0.02314145 -0.11570723  0.02314145
## Tw  0.02314145  0.02314145  0.02314145  0.02314145  0.02314145 -0.11570723
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
##        CG        GB        TC        TG        Tr        Tw 
## 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667

## Done.

anc.smp<-summary(tree.smp,plot=FALSE)
plot(anc.smp, type="fan", fsize=0.8,ftype="i")

rm(list = ls())

After making many simmaps we summarize by reporting the proportion of the time that each node is in a given state. Again, note that internal nodes are estimated as combinations of different states. This reflects the fact that no single state is 100% likely for those nodes.

Try it out! Porgs Revisited

We looked at Porgs in the Discrete Character Evolution tutorial, now we will look at them again and estimate ancestral states

Read in the Porg phylogeny and character data. If you need to download them again, you can find them here:
- Porg Tree
- Porg Data
Create an ‘equal rates’ rate matrix and all ‘rates different’ rate matrix for the porg data.
use ace() to estimate ancestral states
- What is the likelihood of each state at the root for each model
Plot the ancestral state estimations for each model. How do they compare?
Use the directional model that you created in the Discrete Character Evolution tutorial to create a stochastic mapping.
Plot a summary of your stochastic mappings

Continuous Characters: Maximum Likelihood

For continuous characters, one envisions evolution under a Brownian motion model, which is the embodiment of a Markov process on a continuous scale. Fortunately, the ML algorithm above can accommodate such patterns of character evolution.

Reading in the Data

In order to do ancestral state estimation of continuous characters we need to read in some more data

Read in the tree titled anole.svl.tre. This file can be found in the data folder of this tutorial
- save the phylogeny under the variable name tree, this tutorial will assume it has that variable name
Use the read.csv() function to read in the discrete trait data called anole.svl.csv
- be sure to set row.names=1. This makes it so the first column in the csv get read as the rownames.
- save the data under the variable name data_cont, this tutorial will assume it has that variable name

Importantly, continuous character evolution can also be mathematically derived from other analytical approaches: namely squared change parsimony (SCP: minimizing squared evolutionary changes across the phylogeny) and generalized least squares (GLS).

anc.cont.ML<-fastAnc(tree,cont_data_vect,vars=TRUE,CI=TRUE)
#anc.cont.ML #This would show us the estimate and 95% CI for each internal node
anc.cont.ML$ace  #ancestral estimates

##      101      102      103      104      105      106      107      108 
## 4.065918 4.045601 4.047993 4.066923 4.036743 4.049514 4.054528 4.045218 
##      109      110      111      112      113      114      115      116 
## 3.979493 3.952440 3.926814 3.964414 3.995835 3.948034 3.955203 3.977842 
##      117      118      119      120      121      122      123      124 
## 4.213995 4.240867 4.248450 4.257574 4.239061 4.004120 4.007024 4.015249 
##      125      126      127      128      129      130      131      132 
## 4.006720 3.992617 3.925848 3.995759 4.049619 3.930793 3.908343 3.901518 
##      133      134      135      136      137      138      139      140 
## 3.885086 4.040356 4.060970 3.987789 3.951878 3.814014 3.707733 3.926147 
##      141      142      143      144      145      146      147      148 
## 3.988745 4.167983 4.157021 4.166146 4.146362 4.146132 4.159052 4.113267 
##      149      150      151      152      153      154      155      156 
## 4.133069 4.222223 4.461376 4.488496 4.437250 4.512071 4.953146 5.033253 
##      157      158      159      160      161      162      163      164 
## 4.962377 5.009025 5.018389 3.974900 3.880482 3.859268 3.859265 3.871895 
##      165      166      167      168      169      170      171      172 
## 3.899914 3.807943 3.826619 4.151598 3.760446 3.654324 3.676713 3.825559 
##      173      174      175      176      177      178      179      180 
## 3.747726 3.813974 3.803077 3.702349 3.566476 3.678236 3.675620 3.662452 
##      181      182      183      184      185      186      187      188 
## 3.563181 3.645426 4.017460 4.113689 4.229393 4.381609 4.243153 5.041281 
##      189      190      191      192      193      194      195      196 
## 5.051563 5.051719 5.057591 4.183653 4.151505 4.113279 3.969795 3.905122 
##      197      198      199 
## 4.191810 4.161419 4.092141

#PLOT as color map
tree.col<-contMap(tree,cont_data_vect,plot=FALSE)  #runs Anc. St. Est. on branches of tree
plot(tree.col,type="fan")

As shown by Martins and Hansen (1997), ML, SCP, and GLS yield identical ancestral estimates. For estimates via ML and GLS, we can use the ace() function that we used for ancestral state estimation of discrete characters. For ML estimation, we can use the function as we did before with the exception of changing type="continuous". Getting estimates via GLS should give the same estimates but is a bit more complicated because we have to provide a correlation matrix in the ace() function.

Each approach gives the same result, how neat is that?

Try it out! Ancestral state estimates of Anolis

1: Using the Anolis phylogeny and body size data, what is the estimated ancestral body size (SVL) at the ROOT of the phylogeny when using maximum likelihood?

Continuous Characters: Bayesian Estimation with Fossils

Many times, it is prudent to incorporate fossil information when available. This is critically important, as fossils provide calibration for the remaining estimates. Conceptually, fossils act much like an anchor, and allow one to condition the ancestral estimates on the phylogeny, the model, and the fossils. An example below (via simulation) shows the difference in estimates with and without ‘fossil’ priors:

tree<-pbtree(n=100,scale=1)
## simulate data with a trend
x<-fastBM(tree,internal=TRUE,mu=3)
phenogram(tree,x,ftype="off")

x.tip<-x[match(tree$tip.label,names(x))]
phenogram(tree,x.tip)  #traitgram under BM with no ancestral information

## Optimizing the positions of the tip labels...

#estimate with no prior 
a<-x[as.character(1:tree$Nnode+Ntip(tree))]
x<-x[tree$tip.label]
## let's see how bad we do if we ignore the trend
plot(a,fastAnc(tree,x),xlab="true values",
     ylab="estimated states under BM")
lines(range(c(x,a)),range(c(x,a)),lty="dashed",col="red")
title("estimated without prior information")

## incorporate prior knowledge
pm<-setNames(c(1000,rep(0,tree$Nnode)),
             c("sig2",1:tree$Nnode+length(tree$tip.label)))
## the root & two randomly chosen nodes
nn<-as.character(c(length(tree$tip.label)+1,
                   sample(2:tree$Nnode+length(tree$tip.label),2)))
pm[nn]<-a[as.character(nn)]
## prior variance
pv<-setNames(c(1000^2,rep(1000,length(pm)-1)),names(pm))
pv[as.character(nn)]<-1e-100
## run MCMC
mcmc<-anc.Bayes(tree,x,ngen=100000,
                control=list(pr.mean=pm,pr.var=pv,
                             a=pm[as.character(length(tree$tip.label)+1)],
                             y=pm[as.character(2:tree$Nnode+length(tree$tip.label))]))

## Control parameters (set by user or default):

## List of 7
##  $ sig2   : num 1.09
##  $ a      : Named num 0
##   ..- attr(*, "names")= chr "101"
##  $ y      : Named num [1:98] 0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "names")= chr [1:98] "102" "103" "104" "105" ...
##  $ pr.mean: Named num [1:100] 1000 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "names")= chr [1:100] "sig2" "101" "102" "103" ...
##  $ pr.var : Named num [1:100] 1e+06 1e-100 1e+03 1e+03 1e+03 ...
##   ..- attr(*, "names")= chr [1:100] "sig2" "101" "102" "103" ...
##  $ prop   : num [1:100] 0.0109 0.0109 0.0109 0.0109 0.0109 ...
##  $ sample : num 100

## Starting MCMC...

## Done MCMC.

anc.est<-colMeans(mcmc$mcmc[201:1001,as.character(1:tree$Nnode+length(tree$tip.label))])

plot(a,anc.est,xlab="true values",
     ylab="estimated states using informative prior")
lines(range(c(x,a)),range(c(x,a)),lty="dashed",col="red")
title("estimated using informative prior")

Ancestral State Estimation

Dean Adams (dcadams@iastate.edu)

2022-03-29

Overview

Learning Objectives

Download data files