Today we explore methods for estimating ancestral character states. For many macroevolutionary hypotheses, ancestral state estimation is essential. Such estimates allow us to infer the order of evolutionary transitions across the tree of life, estimate how frequently some evolutionary events have occured, and understand the initial conditions that gave rise to current patterns of phenotypic diversity.
Note, parts of this tutorial are based on and adapted from that of L. Revell: http://www.phytools.org/eqg2015/asr.html
To run the analyses in this tutorial, please download the following files from our Git-course repository:
Additionally, The R-script for running the code found in this tutorial is found here:
As before, on your local computer, place them in a folder named ‘TutorialData’, which is found in the same directory as this tutorial.
Here we explore methods for estimating ancestral states of discrete characters. First, we bring our data into R:
gieger
and phytools
packageanole.gp.tre
. This file can be found in the data folder of this tutorialread.csv()
function to read in the discrete trait data called anole.gp.csv
row.names=1
. This makes it so the first column in the csv
get read as the rownames.sort
argument to TRUE
.tree
and the matched data to a new variable called group
. The remainder of the tutorial will assume these variable names for the phylogeny and data frame with traits, respectively.tiplabels()
function to plot the character states on the tips.
type="fan"
to the plot()
function. This plots the phylogeny in a radial fashion and may make the plot less cluttered## Loading required package: ape
## Warning: package 'ape' was built under R version 4.1.2
## Loading required package: maps
## Warning: package 'geiger' was built under R version 4.1.2
Here we see our phylogeny and our discrete character states for the extant taxa. This is a phylogeny of Anolis lizards and their habitat use. Species tend to concentrate their activities to particular habitat types. Thus, some interesting questions might be:
One method for estimating ancestral character states is based on maximum parsimony (MP). Here the goal is to minimize the total number of changes between character states across the phylogeny to reconcile the data at the tips of the phylogeny with the topology of the phylogeny.
However, while MP methods are straight-forward to understand and quick to implement, they have several serious deficiencies. First, all transitions between traits are considered to be equally likely; which is related to the fact that transitions between states are equally weighted. These assumptions are clearly violated with real data. Also, and critically, branch length information is completely ignored using maximum parsimony. For these reasons, we will NOT demonstrate MP approaches in this tutorial!
An alternative is to use a model-based approach such as maximum likelihood (ML). Here we model the evolutionary transitions between character states, and select those ancestral values that maximize the likelihood of the data, conditioned on the phylogeny, the model, and the ancestral states. Here we perform this analysis in R using the ace()
function.
ace()
This function performs ancestral character estimation for both discrete and continuous characters. First, we’ll focus on the parameters needed for discrete character estimation:
"discrete"
or "continuous"
."ML"
, "REML"
, "pic"
, or "GLS"
. NOTE: some of these options are only appropriate when type="continuous"
. For discrete data we’ll use "ML"
.The function outputs a list with many elements, some of the elements of interest are:
library(corHMM)
## Loading required package: nloptr
## Loading required package: GenSA
##Create the model for trait evolution
group_formatted<-cbind(row.names(group),group[,1]) ##put the data in a specific format because the function is picky
group_model_er <- getStateMat4Dat(group_formatted,"ER") #equal transition rates
plotMKmodel(group_model_er$rate.mat,rate.cat = 1)
## Warning in if (class(corhmm.obj) == "matrix" & is.null(rate.cat)) {: the
## condition has length > 1 and only the first element will be used
## Warning in if (class(corhmm.obj) == "corhmm") {: the condition has length > 1
## and only the first element will be used
## Warning in if (class(corhmm.obj) == "matrix") {: the condition has length > 1
## and only the first element will be used
## Warning in text.default(x, y, labels = labels, col = label.color, family =
## label.family, : font family not found in Windows font database
##Perform ancestral character estimation
anc.ML<-ace(x = group$x, phy = tree,
type = "discrete", method = 'ML',
model = group_model_er$rate.mat)
anc.ML ## we can see a lot of information just by calling our object
##
## Ancestral Character Estimation
##
## Call: ace(x = group$x, phy = tree, type = "discrete", method = "ML",
## model = group_model_er$rate.mat)
##
## Log-likelihood: -78.04604
##
## Rate index matrix:
## CG GB TC TG Tr Tw
## CG . 1 1 1 1 1
## GB 1 . 1 1 1 1
## TC 1 1 . 1 1 1
## TG 1 1 1 . 1 1
## Tr 1 1 1 1 . 1
## Tw 1 1 1 1 1 .
##
## Parameter estimates:
## rate index estimate std-err
## 1 0.0231 0.004
##
## Scaled likelihoods at the root (type '...$lik.anc' to get them for all nodes):
## CG GB TC TG Tr Tw
## 0.018197565 0.202238628 0.042841575 0.428609607 0.004383532 0.303729094
## We can see the probability of each state here, just the first few internal nodes will be shown
round(head(anc.ML$lik.anc),3)
## CG GB TC TG Tr Tw
## 83 0.018 0.202 0.043 0.429 0.004 0.304
## 84 0.008 0.197 0.020 0.501 0.001 0.272
## 85 0.003 0.134 0.020 0.716 0.005 0.122
## 86 0.002 0.042 0.019 0.858 0.002 0.077
## 87 0.000 0.001 0.001 0.995 0.000 0.002
## 88 0.000 0.000 0.000 0.998 0.000 0.001
# We can plot the ancestral character estimates with the nodelabels() function
colorkey <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2") ##A colorblind friendly color palette
names(colorkey) <- colnames(anc.ML$lik.anc) ##assign each character state a color
tip_cols<-colorkey[as.character(group$x)] ##putting the trait values in as.character() is important!
plot(tree,type="fan")
nodelabels(pie=anc.ML$lik.anc,piecol=colorkey,cex=0.5)
tiplabels(pch=19,col=tip_cols)
legend(x='bottomleft',legend = names(colorkey),fill=colorkey)
Again, note that internal nodes are estimated as combinations of different states. This reflects the fact that no single state is 100% likely for those nodes.
Here we will try an ancestral state estimation with a different model for trait evolution and compare it to our equal rates model. Do the following
An alternative to ML approaches is to utilize a Bayesian perspective. Here, a Markov process is used to model trait evolution on the phylogeny, given a model of transition rates between states. A single iteration yields what is called a stochastic map; representing one possible embodiment of evolution under the Markov process. Repeating this many times and summarizing the values provides a useful estimate of ancestral states. Here we perform this analysis in R using the make.simmap()
function.
make.simmap()
This function performs stochastic mapping using several different methods. Essentially, given a model for character evolution, it simulates histories under the model. The function has many parameters and modes, we will be using the “empirical” mode that uses the following arguments:
The function outputs a simmap
object that lists how long each branch spent in each state.
# simulate single stochastic character map using empirical Bayes method
gp<-setNames(group$x,rownames(group)) ##make.simmap needs a named vector for the data.
tree.smp1<-make.simmap(tree,gp,model=group_model_er$rate.mat)
## make.simmap is sampling character histories conditioned on
## the transition matrix
##
## Q =
## CG GB TC TG Tr Tw
## CG -0.11570723 0.02314145 0.02314145 0.02314145 0.02314145 0.02314145
## GB 0.02314145 -0.11570723 0.02314145 0.02314145 0.02314145 0.02314145
## TC 0.02314145 0.02314145 -0.11570723 0.02314145 0.02314145 0.02314145
## TG 0.02314145 0.02314145 0.02314145 -0.11570723 0.02314145 0.02314145
## Tr 0.02314145 0.02314145 0.02314145 0.02314145 -0.11570723 0.02314145
## Tw 0.02314145 0.02314145 0.02314145 0.02314145 0.02314145 -0.11570723
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
## CG GB TC TG Tr Tw
## 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667
## Done.
plot(tree.smp1,colorkey,type="fan") #one run. Not overly useful.
#Must do many times
tree.smp<-make.simmap(tree,gp,model=group_model_er$rate.mat,nsim=100)
## make.simmap is sampling character histories conditioned on
## the transition matrix
##
## Q =
## CG GB TC TG Tr Tw
## CG -0.11570723 0.02314145 0.02314145 0.02314145 0.02314145 0.02314145
## GB 0.02314145 -0.11570723 0.02314145 0.02314145 0.02314145 0.02314145
## TC 0.02314145 0.02314145 -0.11570723 0.02314145 0.02314145 0.02314145
## TG 0.02314145 0.02314145 0.02314145 -0.11570723 0.02314145 0.02314145
## Tr 0.02314145 0.02314145 0.02314145 0.02314145 -0.11570723 0.02314145
## Tw 0.02314145 0.02314145 0.02314145 0.02314145 0.02314145 -0.11570723
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
## CG GB TC TG Tr Tw
## 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667
## Done.
anc.smp<-summary(tree.smp,plot=FALSE)
plot(anc.smp, type="fan", fsize=0.8,ftype="i")
rm(list = ls())
After making many simmaps we summarize by reporting the proportion of the time that each node is in a given state. Again, note that internal nodes are estimated as combinations of different states. This reflects the fact that no single state is 100% likely for those nodes.
We looked at Porgs in the Discrete Character Evolution tutorial, now we will look at them again and estimate ancestral states
ace()
to estimate ancestral states
For continuous characters, one envisions evolution under a Brownian motion model, which is the embodiment of a Markov process on a continuous scale. Fortunately, the ML algorithm above can accommodate such patterns of character evolution.
In order to do ancestral state estimation of continuous characters we need to read in some more data
anole.svl.tre
. This file can be found in the data folder of this tutorial
tree
, this tutorial will assume it has that variable nameread.csv()
function to read in the discrete trait data called anole.svl.csv
row.names=1
. This makes it so the first column in the csv
get read as the rownames.data_cont
, this tutorial will assume it has that variable nameImportantly, continuous character evolution can also be mathematically derived from other analytical approaches: namely squared change parsimony (SCP: minimizing squared evolutionary changes across the phylogeny) and generalized least squares (GLS).
anc.cont.ML<-fastAnc(tree,cont_data_vect,vars=TRUE,CI=TRUE)
#anc.cont.ML #This would show us the estimate and 95% CI for each internal node
anc.cont.ML$ace #ancestral estimates
## 101 102 103 104 105 106 107 108
## 4.065918 4.045601 4.047993 4.066923 4.036743 4.049514 4.054528 4.045218
## 109 110 111 112 113 114 115 116
## 3.979493 3.952440 3.926814 3.964414 3.995835 3.948034 3.955203 3.977842
## 117 118 119 120 121 122 123 124
## 4.213995 4.240867 4.248450 4.257574 4.239061 4.004120 4.007024 4.015249
## 125 126 127 128 129 130 131 132
## 4.006720 3.992617 3.925848 3.995759 4.049619 3.930793 3.908343 3.901518
## 133 134 135 136 137 138 139 140
## 3.885086 4.040356 4.060970 3.987789 3.951878 3.814014 3.707733 3.926147
## 141 142 143 144 145 146 147 148
## 3.988745 4.167983 4.157021 4.166146 4.146362 4.146132 4.159052 4.113267
## 149 150 151 152 153 154 155 156
## 4.133069 4.222223 4.461376 4.488496 4.437250 4.512071 4.953146 5.033253
## 157 158 159 160 161 162 163 164
## 4.962377 5.009025 5.018389 3.974900 3.880482 3.859268 3.859265 3.871895
## 165 166 167 168 169 170 171 172
## 3.899914 3.807943 3.826619 4.151598 3.760446 3.654324 3.676713 3.825559
## 173 174 175 176 177 178 179 180
## 3.747726 3.813974 3.803077 3.702349 3.566476 3.678236 3.675620 3.662452
## 181 182 183 184 185 186 187 188
## 3.563181 3.645426 4.017460 4.113689 4.229393 4.381609 4.243153 5.041281
## 189 190 191 192 193 194 195 196
## 5.051563 5.051719 5.057591 4.183653 4.151505 4.113279 3.969795 3.905122
## 197 198 199
## 4.191810 4.161419 4.092141
#PLOT as color map
tree.col<-contMap(tree,cont_data_vect,plot=FALSE) #runs Anc. St. Est. on branches of tree
plot(tree.col,type="fan")
As shown by Martins and Hansen (1997), ML, SCP, and GLS yield identical ancestral estimates. For estimates via ML and GLS, we can use the ace()
function that we used for ancestral state estimation of discrete characters. For ML estimation, we can use the function as we did before with the exception of changing type="continuous"
. Getting estimates via GLS should give the same estimates but is a bit more complicated because we have to provide a correlation matrix in the ace()
function.
Each approach gives the same result, how neat is that?
1: Using the Anolis phylogeny and body size data, what is the estimated ancestral body size (SVL
) at the ROOT of the phylogeny when using maximum likelihood?
Many times, it is prudent to incorporate fossil information when available. This is critically important, as fossils provide calibration for the remaining estimates. Conceptually, fossils act much like an anchor, and allow one to condition the ancestral estimates on the phylogeny, the model, and the fossils. An example below (via simulation) shows the difference in estimates with and without ‘fossil’ priors:
tree<-pbtree(n=100,scale=1)
## simulate data with a trend
x<-fastBM(tree,internal=TRUE,mu=3)
phenogram(tree,x,ftype="off")
x.tip<-x[match(tree$tip.label,names(x))]
phenogram(tree,x.tip) #traitgram under BM with no ancestral information
## Optimizing the positions of the tip labels...
#estimate with no prior
a<-x[as.character(1:tree$Nnode+Ntip(tree))]
x<-x[tree$tip.label]
## let's see how bad we do if we ignore the trend
plot(a,fastAnc(tree,x),xlab="true values",
ylab="estimated states under BM")
lines(range(c(x,a)),range(c(x,a)),lty="dashed",col="red")
title("estimated without prior information")
## incorporate prior knowledge
pm<-setNames(c(1000,rep(0,tree$Nnode)),
c("sig2",1:tree$Nnode+length(tree$tip.label)))
## the root & two randomly chosen nodes
nn<-as.character(c(length(tree$tip.label)+1,
sample(2:tree$Nnode+length(tree$tip.label),2)))
pm[nn]<-a[as.character(nn)]
## prior variance
pv<-setNames(c(1000^2,rep(1000,length(pm)-1)),names(pm))
pv[as.character(nn)]<-1e-100
## run MCMC
mcmc<-anc.Bayes(tree,x,ngen=100000,
control=list(pr.mean=pm,pr.var=pv,
a=pm[as.character(length(tree$tip.label)+1)],
y=pm[as.character(2:tree$Nnode+length(tree$tip.label))]))
## Control parameters (set by user or default):
## List of 7
## $ sig2 : num 1.09
## $ a : Named num 0
## ..- attr(*, "names")= chr "101"
## $ y : Named num [1:98] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "names")= chr [1:98] "102" "103" "104" "105" ...
## $ pr.mean: Named num [1:100] 1000 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "names")= chr [1:100] "sig2" "101" "102" "103" ...
## $ pr.var : Named num [1:100] 1e+06 1e-100 1e+03 1e+03 1e+03 ...
## ..- attr(*, "names")= chr [1:100] "sig2" "101" "102" "103" ...
## $ prop : num [1:100] 0.0109 0.0109 0.0109 0.0109 0.0109 ...
## $ sample : num 100
## Starting MCMC...
## Done MCMC.
anc.est<-colMeans(mcmc$mcmc[201:1001,as.character(1:tree$Nnode+length(tree$tip.label))])
plot(a,anc.est,xlab="true values",
ylab="estimated states using informative prior")
lines(range(c(x,a)),range(c(x,a)),lty="dashed",col="red")
title("estimated using informative prior")