Phylogenetic Regression and Related Topics

Overview

This tutorial is intended to provide an introduction to the basics of implementing several phylogenetic comparative methods using R. As with previous tutorials, we begin with a brief conceptual overview, which sets the stage for why this is type of analysis is crucial for correct macroevolutionary inference. Then, we go over the basics of reading data and phylogenies into R, evaluating the degree to which phylogenetic signal is present in a continuous trait, and learn three ways to examine bivariate trait associations in continuous data: phylogenetically independent contrasts (PIC), phylogenetic generalized least squares (PGLS), and phylogenetic transformation methods.

Learning Objectives

Understand why it is needed to condition trait data on the phylogeny
be able to conduct phylogenetic regression using either PIC, PGLS, or phylogenetic transformation

Download data files

To run the analyses in this tutorial, please download the following files from our Git-course repository:

Additionally, The R-script for running the code found in this tutorial is found here:

RScript

Then, on your local computer, place them in a folder named ‘TutorialData’, which is found in the same directory as this tutorial.

Conceptual Background

First, let’s recall why it is necessary to consider the phylogenetic relationships among taxa when performing comparative analyses. In macroevolutionary studies, we are often interested in determining how two or more traits covary across a broad set of taxa. In such cases, phenotypic values are often obtained at the species level, and represent the typical value for each species (e.g., mean adult body size). However, we know that species are not independent, as they share an evolutionary history as described by their phylogenetic relationships. The figure below illustrates the issue. Here we see an estimate of the phylogeny for hominins (Straight et al. 2013), which displays the evolutionary relationships among these taxa. From this figure, it is evident that sister species such as H. sapiens and H. ergaster are most closely related to one another than, say, H. sapiens and A. anamensis, which are more distantly related. Combining the relationships depicted in this phylogeny with Darwin’s principle of descent with modification, we can generate the following expectation: that H. sapiens and H. ergaster are likely to be more similar in the trait values of their phenotypes than are H. sapiens and A. anamensis. This is because H. sapiens and H. ergaster are more closely related to one another phylogenetically.

Fig 1. Phylogeny of hominins (Straight et al. 2013: A,. J. Phys. Anthropol.).

Now consider the case where we have measured two traits for a series of primate species, body weight and home range size (Fig. 2). Visually there appears to be a clear relationship between the two traits, but how can we test this empirically?

Fig 2. Data for body weight and home range size in primates (Harvey and Clutton-Brock 1981: Behav. Ecol. Siociobiol.).

As a quantitative biologist, one would naturally consider linear regression, and perhaps correlation, to be an appropriate statistical tool in this case. However, using ordinary least squares (OLS) methods, one assumes that the residual error from such models is identical and independently distributed. This expectation is expressed mathematically as: N(0,1). However, because the data points in Fig. 2 represent species values, we encounter the non-independence issue. In particular, the species are not independent, because they exhibit a shared evolutionary history as described by their phylogeny.

As such, the expected covariation between taxa is not iid: N(0,1). Instead, it is expressed as: N(0,V) where V is the expected covariance matrix described by the phylogeny and the evolutionary model under consideration (termed the phylogenetic covariance matrix). It is the non-independence expressed in V that must be accounted for during the regression analysis.

Numerous analytical methods have been developed to take phylogeny into account, which collectively are termed phylogenetic comparative methods (PCMs). DCA prefers to call them the Phylogenetic Comparative Toolkit, as a panoply of methods now exists. What links them conceptually is that they all utilize a phylogenetic perspective when evaluating patterns of trait evolution. Mathematically, these methods condition the data on the phylogeny during the analysis. Today we will learn how to implement the most commonly used methods for continuous variables: methods for implementing Phylogenetic Regression.

Illustrating the Problem

To illustrate the problem with a real-world example, let’s simulate some data phylogenetically and look the correlation of the data before and after conditioning on the phylogeny. This is also a useful skill to learn in R!

set.seed(6679881) ##The n for the largest known Cullen prime. Cullen primes have the form n*(2^n)+1 
library(phytools)

## Loading required package: ape

## Loading required package: maps

#A quick simulation
mytree<- pbtree(n=50, scale=1) #one way to simulate a tree. Note: in R, one has functions to simulate BD trees, random splits trees, etc. using different functions
plot(mytree)

X<-fastBM(tree=mytree) #simulates a continuous trait on phylogeny under Brownian motion
Y<-fastBM(tree=mytree)
cor(X,Y)

## [1] 0.5381317

Without considering phylogeny we can see a correlation of about $0.54$, this is quite large considering the two traits were simulated without any correlation!

Now let’s condition the data on the phylogeny and see look at the correlation between traits. We can use the pic() function to make them independent of the phylogeny, alleviating the issue of traits appearing highly correlated even though solely as a result shared ancestry.

pic()

This function computes the phylogenetically independent contrasts using the method described by Felsenstein (1985). pic has two primary arguments:

$x$ a numeric vector containing the trait values of each species. If the vector is named then values are matched to the labels of phy, otherwise the values are assumed to be in the same order of phy$tip.labels
$phy$ a phylo object

This function returns the phlyogenetically independent contrasts of the traits

X_pic <- pic(x=X,phy=mytree)
Y_pic <- pic(x=Y,phy=mytree)
cor(X_pic,Y_pic)

## [1] 0.07830748

Here we have estimated a correlation of approximately $0.08$, that’s notably closer to the true correlation of $0$ when compared to the correlation of the traits without taking the phylogeny into consideration.

Try it out! The better estimator

We just went over two ways of computing the correlation between two traits: one that doesn’t take the phylogeny into account and one that does. The later should generally be a better estimator of the true correlation as we are accounting for shared ancestry, however, via simulation we can show why it is a better estimator. We can simulate trait data many times and compute the correlation using both methods to get a feel for how these estimators tend to act. Try the following:

Simulate $X$ and $Y$ trait data on mytree 10,000 times
for each repetition of trait data simulation:
- compute the correlation coefficient $r$ of the $X$ and $Y$ and record the value in a vector
- compute $X_{pic}$ and $Y_{pic}$ then calculate their correlation $r_{pic}$ and record the value in a different vector
Make a histogram of the correlations from each method. Additionally, compute the mean and and variance of each vector

What do you notice about the behavior of the correlations $r$ and $r_{pic}$? What is the same? What is different and what does this mean for the quality of these estimators? See discussion in Rohlf 2006. Evol. and elsewhere for discourse on the estimates of parameters when conditioning on the phylogeny

Bringing Trait Data into R

Now, say we have our own data and wish to perform a phylogenetic comparative analysis. Our first step is to read the data and the phylogeny into R and ‘link’ them up. This step is crucial, as we need a way to connect the species trait values in the data matrix with the corresponding branches on the phylogeny. Another important step in this process is pruning. It is very common that the phylogeny has more taxa than the data matrix (because one is utilizing a phylogeny from a broad molecular study), or that the data matrix may contain data for species not represented in the phylogeny. Both types of ‘dangling’ taxa must be excised from their respective data objects.

Our first step is to load various R-libraries which contain functions for reading and manipulating phylogenetic datasets. There are many such packages, but the ‘central’ phylogenetics R package is ape. As seen earlier, this package has many functions for reading and manipulating phylogenetic trees, as well as numerous quantitative analysis performed in a phylogenetic perspective. Many other phylogenetics-R packages also load ape or some of its components.

Also critical for linking up data and a phylogeny is that one must have the species names as a component in the data matrix EXACTLY as they are found in the phylogeny (i.e. names(my_data) or rownames(my_data), depending on data structure of the trait information, match those on the phylogeny phy$tip.label). Luckily, the treedata() function can be used to match data to tree with ease, trimming both elements as nessecary.

treedata()

This function both trims a tree and pruning species from a matrix of trait data such that both elements have the same set of species treedata has two primary arguments and two quality of life arguments:

$phy$ a phylo object
$data$ a named vector or matrix of continuous trait values, for species in phy
$sort$ a logical for whether data should be sorted to match the ordering of names found in phy (i.e sort data such that the $i^{th}$ row of data corresponds to the $i^{th}$ species found in phy$tip.labels)
$warnings$ a logical for whether to report warnings of mismatched taxa between data and phy

This function returns a list containing the trimmed phylogeny and data. If the list is saved as a variable named output, then one can access the phylogeny with the command output$phy and the data with the output$data1.

NOTE: If the function is provided data containing both (continuous variables) and strings (typically categorical variables), then the output output$data will be a matrix with all values converted into strings, which creates issues for functions down the road. We will give an example for how to convert binary categorical variable saved as a string below.

library(ape)    
library(geiger)

#Here is a large time-dated molecular phylogeny (a chronogram):
ManderTree<-read.tree("../data/Mander.tre",tree.names=T)
plot(ManderTree)


#Read in data, phylogeny, and match/prune them to one another
ManderTree<-read.tree("../data/Mander.tre",tree.names=T)
plot(ManderTree)

Mander_dat<-read.csv("../data/PlethodonMns.csv", header=TRUE, row.names = 1) #Notice we read in the first column as row.names. These MUST match the names in the phylogeny

Mander_dat[,12]<-(as.numeric(Mander_dat[,12]=="N")) ##Convert the groups from strings to numerics. Treedata doesn't like different types of data.

#Now, prune the tree to match the data and vice-versa:
Pleth_matched<-treedata(phy=ManderTree,data = Mander_dat,sort=T, warnings=FALSE)
plot(Pleth_matched$phy)

#grabs appropriate column from data matrix after treedata. We'll use these in analyses later
SVL<-Pleth_matched$data[,1] 
HL<-Pleth_matched$data[,3]
BodyW<-Pleth_matched$data[,5]
Fore<-Pleth_matched$data[,6]
Hind<-Pleth_matched$data[,7]
Groups<-as.factor(Pleth_matched$data[,12]);names(Groups)<-rownames(Pleth_matched$data) ##Store the groups and name the vector according to the samples

Try it out! Hydromantes Data Matching

We were able to match our Plethodontid trait data to our salamander phylogeny, trimming both so that they contain the same species. We can now find the trimmed phylogeny in Pleth_matched$phy and trimmed data in Pleth_mathched$data. we also have a dataset of Hydromantes salamanders. Try matching the salamander tree to the Hydromantes dataset:

read in the Hydromantes dataset. The file is named HydromantesMns.csv and can be found in the data folder.
read in the salamander tree. The file is named Mander.tre and can be found in the data folder.
use treedata to sync up the two elements.

How many species were in the Hydromantes dataset before trimming? How many tips were on the salamander tree before trimming? How many species are in the matched dataset? Which species was/were removed from the Hydromantes dataset? Try plotting the trimmed Hydromantes phylogeny and using the tiplabels() function to plot head length on the phylogeny (?tiplabels may be helpful for learning how to use this function).

Methods for Implementing Phylogenetic Regression

Now it is time to put it all together and perform a phylogenetic comparative analysis. Using what we have learned, we will use the chronogram and a matrix of quantitative data Plethodontid salamanders. In this case, the data are a set of continuous linear measurements representing morphology. We will perform a bivariate regression of various traits to evaluate the evolutionary association between them, in particular we will be looking at body width (BodyW), hindlimb length (Hind),forelimb length (Fore), head length (HL), and body size (SVL). However, before we look at the different methods for phylogenetic regression, it may be useful to familiarize ourselves with some of the jargon and functions that are used across methods.

Models, Regression, and Statistical Tests in R

As we look at the various ways we can implement a phylogenetic regression there will be a lot of new functions and some strange syntax. A lot of the syntax is related to linear models created in R. There are many manipulations that can be done will modeling in R, however, we will be providing a brief and barebones introduction to some of these concepts. You can check out this free digital book for a more robust introduction to modeling and regression in R.

First, we will encounter the ~ operator when creating and fitting models in R. The tilde is used to denote the relation between variables. For example, we interpret the expression Y~X as a model where variation in Y (the dependent variable) depends on variation in X (the independent variable). This expression is then used by various functions (e.g. lm(), gls, lm.rrpp()) to fit the linear model, and estimate parameters from it.

Once we have and fit the linear model and obtained estiamtes of its parameters, we may wish to see those estimates and obtain key statistical summary values of the model. The two primary functions used for this are summary() and anova(). If you’re not familiar with these functions be sure to check out the boxes for each function.

summary()

This is a very general function for summarizing a lot of different types of R objects, when fit model objects are fed in the function will return summary statistics for parameters. For example:

##Make a linear model with preset data in R
my_linear_model <- lm(cars$dist ~ cars$speed) #Make a model to describe how car speed affects the distance traveled
summary(my_linear_model)

We can see that our model performed a regression and fit a line of the form $Y=\beta X+\epsilon$. For the coefficients we can see the y-intercept (denoted by (Intercept)) and the slope (denoted by cars$speed) that describes how increases in speed increases in distance. We can see the estimates for these parameters are $-17.5791$ and $3.9324$ respectively. We can assess whether these estimates differ significantly from $0$ by looking at the p-value denoted by Pr(>|t|). Since both have very small p-values, we might be inclined to say that both the slope and the intercept are nonzero. For the case of a nonzero slop, we then might conclude that car speed does indeed have an impact on the distance traveled.

We can also observe other summary statistics on the bottom, such as the R-squared, F-statistic, and degrees of freedom. The exact layout and statistics shown will change based on the type of object used in the function but generally they will display these statistics.

anova()

This function performs an analysis of variance for model objects. This suite of tests is often used to assess whether two groups have different means. However, in the case of linear modeling it can be used to test whether the effects (slope parameter) of covariates are nonzero.

##Make a linear model with preset data in R
my_linear_model <- lm(cars$dist ~ cars$speed) #Make a model to describe how car speed affects the distance traveled
anova(my_linear_model)

Here we see how well our coviariate of cars$speed does at explaining the response variable cars$dists. We can see the F-statistic of $89.567$ and the probability of having an F-statistic at least that extreme under the null model (slope of $0$) is $1.49\times10^{-12}$. We can actually see this is the same F-statistic presented in the summary() function. Although the two functions displayed some of the same information, this won’t always be the case for other linear models.

Using PIC, PGLS, and Phylogenetic Transformation

You can click each of the buttons to see how to perform phylogenetic regression under each method.

0: Phylogenetically-Naive Analysis

First we’ll explore the relationship between hindlimb length (Hind) and body width (BodyW).

Let’s see what happens when we erroneously don’t account for phylogeny.

#Phylogenetically-naive analyses
plot(BodyW,Hind)

cor.test(Hind,BodyW)  #correlation of two traits: pretty high

## 
##  Pearson's product-moment correlation
## 
## data:  Hind and BodyW
## t = 16.546, df = 43, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8748508 0.9609558
## sample estimates:
##       cor 
## 0.9296516

anova(lm(Hind~BodyW))

## Analysis of Variance Table
## 
## Response: Hind
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## BodyW      1 685.38  685.38  273.76 < 2.2e-16 ***
## Residuals 43 107.65    2.50                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When plotting the traits out it would appear that the two are highly correlated. In fact, a correlation test shows a correlation of $0.93$ with a very small p-value of $2.2\times10^{-16}$. Indeed, and ANOVA also leads us to believe that there is a non-trivial effect of hindlimb length on body width.

1: Phylogenetically Independent Contrasts

Now we’ll look again at the relationship between hindlimb length (Hind) and body width (BodyW), but first by conditioning on the phylogeny.

Now, let’s recall what PIC does. The approach (Felsenstein 1985) does not perform correlation or regression on the data at the tips of the phylogeny, but rather generates a set of contrast scores at each node of the phylogeny. Under Brownian motion, these are independent of one another and of the phylogeny. One uses the pruning algorithm to work down the phylogeny from tips to root, and obtains a set of N-1 contrasts for N taxa. These are then used in the comparative analysis. The approach is shown conceptually in Fig. 3.

Fig 3. Conceptual figure visualizing the phylogenetically independent contrasts procedure (from Harvey and Pagel, 1991).

Now let’s try this for our dataset:

#Generate PICs and test while conditioning on phylogeny
Hind_pic<-pic(x = Hind, phy = Pleth_matched$phy) 
BodyW_pic<-pic(x = BodyW, phy = Pleth_matched$phy) 
cor.test(Hind_pic,BodyW_pic)

## 
##  Pearson's product-moment correlation
## 
## data:  Hind_pic and BodyW_pic
## t = 8.0398, df = 42, p-value = 4.923e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6264769 0.8735296
## sample estimates:
##       cor 
## 0.7785548

plot(BodyW_pic,Hind_pic)

anova(lm(Hind_pic~BodyW_pic+0))   #regression through the origin (b/c order of taxa for contrast is arbitrary [can 'spin' on node])

## Analysis of Variance Table
## 
## Response: Hind_pic
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## BodyW_pic  1 18.743 18.7428  59.776 1.142e-09 ***
## Residuals 43 13.483  0.3135                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(lm(Hind_pic~BodyW_pic+0))   #coefficients of the model

## 
## Call:
## lm(formula = Hind_pic ~ BodyW_pic + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9146 -0.1847  0.1154  0.5444  1.3564 
## 
## Coefficients:
##           Estimate Std. Error t value Pr(>|t|)    
## BodyW_pic   1.7760     0.2297   7.732 1.14e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.56 on 43 degrees of freedom
## Multiple R-squared:  0.5816, Adjusted R-squared:  0.5719 
## F-statistic: 59.78 on 1 and 43 DF,  p-value: 1.142e-09

We have just performed our first phylogenetic comparative analysis of trait correlation! Our findings are as follows: when conditioned on the phylogeny, there was a significant relationship between hindlimb length and body width. This reveals a strong pattern of evolutionary covariation between the two traits at a macroevolutionary scale.

2: Phylogenetic Generalized Least Squares

Now let’s investigate the another pattern using a different mathematical implementation, this time we’ll look at head length (HL) and hindlimb length (Hind). Recall from lecture that PICs are a wonderful algorithm for conditioning the data on the phylogeny which results in a set of contrasts that are independent of phylogenetic relationships. However, one could also consider the problem from a purely statistical perspective. Consider standard regression. Ordinary least squares (OLS) methods assume that the residual error is iid: $N(0,1)$. However, a phylogeny explicitly describes non-independence between species. This may be encoded in a phylogenetic covariances matrix $V$, which may be used to model the error covariance in a Generalized Least Squares (GLS) regression. This approach is more flexible and using PICs, as one may perform any model of the form Y~X while conditioning on the phylogeny (e.g., regression, ANOVA, multiple regression, etc.). As we will see later in the semester, the approach is also more flexible in that non-Brownian motion models may be incorporated by using a different expected covariance matrix $V$. With PGLS, the statistical procedure is the same as our familiar OLS regression with the exception that non-independence is accounted for using weighted LS approaches (Fig. 4).

Fig 4. Mathematical description of PGLS.

vcv.phylo() and corBrownian()

Both of these functions are used to generate the variance-covariance matrix that describes the shared ancestry of the phylogeny. Generally these models assume Brownian motion as a model for generating the matrix $V$, although with some effort vcv.phylo() can use different models for trait evolution. These two functions differ in the way that the variance covariance matrix $V$ is formatted. corBrownian() generates a structure that is applicable to the robust linear modeling framework already present in R while vcv.phylo() simply returns a matrix with variance covariance matrix that is often used in functions developed by folks in the phylogenetics community.

While these functions have a few different arguments, they both really only require one:

$phy$ a phylo object

NOTE: although it is not a required argument for corBrownian(), using the form argument may come in handy. This argument takes in an expression that denotes the taxa covariate and optionally some grouping factor. At a minimum it can be handy to supply ~species, where species is just a vector with all the species names, this helps the GLS framework match tips on the phylogeny to rows of the data. If nothing is given in the form argument then the function will assume that the ordering of tips in the phylogeny is the same row ordering of the data. We can get away luckily without an a form argument since the two data structures match as we set sort=TRUE in treedata() but this may not always be the case.

#To perform PGLS in R, we must first estimate the phylogenetic covariance matrix V (C in matrix form):
spc <- Pleth_matched$phy$tip.label
V<-corBrownian(phy=Pleth_matched$phy,form = ~spc)
C <- vcv.phylo(phy = Pleth_matched$phy)

#Now we run the analysis:
library(nlme)

bm_gls<-gls(HL~Hind,correlation = V, data=data.frame(Hind, HL))
summary(bm_gls)

## Generalized least squares fit by REML
##   Model: HL ~ Hind 
##   Data: data.frame(Hind, HL) 
##        AIC      BIC    logLik
##   141.1063 146.3899 -67.55317
## 
## Correlation Structure: corBrownian
##  Formula: ~spc 
##  Parameter estimate(s):
## numeric(0)
## 
## Coefficients:
##                 Value Std.Error  t-value p-value
## (Intercept) 1.8698582 1.1661266 1.603478  0.1162
## Hind        0.5623206 0.0641191 8.769939  0.0000
## 
##  Correlation: 
##      (Intr)
## Hind -0.689
## 
## Standardized residuals:
##         Min          Q1         Med          Q3         Max 
## -0.95942493 -0.42396840 -0.12421551  0.08266248  1.93777971 
## 
## Residual standard error: 1.895069 
## Degrees of freedom: 45 total; 43 residual

anova(bm_gls)

## Denom. DF: 43 
##             numDF   F-value p-value
## (Intercept)     1 111.16364  <.0001
## Hind            1  76.91183  <.0001

We can see a significant relationship between head length and hindlimb. We should be able to draw the exact same conclusion from using PIC as well, so long as one performs the analysis correctly (see also Garland and Ives, 2000; Rohlf, 2001; Blomberg 2012 and others)!

3: Phylogenetic Transform

A third implementation of phylogenetic regression is to use phylogenetic transformation. We will use phylogenetic transformation to explore the relationship between hindlimb length (Hind) and forelimb length (Fore). Here we take the phylogenetic covariance matrix $V$ and from it obtain a matrix that re-expresses the expected non-independence through a series of axes representing a transformed dataspace. The $X$ and $Y$ data are then multiplied by this transformation matrix (ie., they are projected into the new dataspace). This results in data that have been conditioned on the phylogeny. As such the resulting data are independent of one another, and standard OLS methods may be used (see Garland and Ives, 2000; Adams 2014; Adams and Collyer 2018). The approach is illustrated graphically in Fig. 5.

Fig 5. Mathematical description of phylogenetic transformation.

library(RRPP)
##lm.rrpp, the function used to create the model, requires the data in a specific format. rrpp.data.frame puts the data in that format. It is essentially a data frame in a specific format.
rdf<-rrpp.data.frame(Hind=as.matrix(Hind),Fore=as.matrix(Fore),Groups=as.matrix(Groups), C=C)
res.PhyT<-lm.rrpp(Hind~Fore, data = rdf, Cov = C, print.progress = FALSE)
anova(res.PhyT)

## 
## Analysis of Variance, using Residual Randomization
## Permutation procedure: Randomization of null model residuals 
## Number of permutations: 1000 
## Estimation method: Generalized Least-Squares (via OLS projection) 
## Sums of Squares and Cross-products: Type I 
## Effect sizes (Z) based on F distributions
## 
##           Df     SS      MS     Rsq     F      Z Pr(>F)   
## Fore       1 29.887 29.8867 0.92743 549.5 9.2857  0.001 **
## Residuals 43  2.339  0.0544 0.07257                       
## Total     44 32.225                                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call: lm.rrpp(f1 = Hind ~ Fore, data = rdf, Cov = C, print.progress = FALSE)

res.PhyT$LM$gls.coefficients ## the estimated coefficients of the model

## (Intercept)        Fore 
##   0.5243724   1.1428053

Here we can see a significant relationship between hindlimb length and forelimb length. We should come to the same result, and have the same parameter estimates and statistical summary measures, regardless of whether we use this approach, PIC, or PGLS. The advantage of this last approach however, will become clear in a few weeks. This implementation may be used for examining patterns in highly-multivariate phenotypes $(Y)$, whereas parametric versions of PIC and PGLS are challenged under these conditions.

Try it out! Are they all the same?

While each method has their pros and cons (e.g. PGLS is good at accommodating various models for trait evolution while phylogenetic transformation is good for multivariate data), we should arrive at the same place regardless of the method we use. Here we will look at the relationship between body size (SVL) and head length (HL) using PIC, PGLS, and phylogenetic transformation to see whether they all give the same results. Specifically, you should:

Perform phylogenetic regression of HL on SVL using PIC, PGLS, and phylogenetic transformation
use summary() and anova() to record the F-statistic and the coefficient that describes the effect of SVL on HL for each method
Perform a regression of HL and SVLwithout conditioning on phylogeny

Are the summary statistics and coefficient estimates the same across each method? How do these summary statistics compare to the analysis that isn’t conditioned on the phylogeny? Do our conclusions change if we condition on the phylogeny? What conclusions would you make about the relationship between body size and head length?

Try it out! Where’s the signal?

We’ve fit various models to both our Hydromantes and Plethodontid datasets but it may be worth asking: do we have trait similarity as a result of phylogeny? In other words we want to evaluate phylogenetic signal to determine whether the phylogeny is an important factor when considering the evolution of various traits on these groups. Specifically, we will ask this for the SVL trait of our Plethodontid dataset and the HeadLength trait of our Hydromantes dataset. try the following:

Evaluate phylogenetic signal of SVL on Plethodontids
- compute the phylogenetic signal K
- Perform a hypothesis test using permutation to assess whether there is a significant amount of phylogenetic signal
Evaluate phylogenetic signal of HeadLength on Hydromantes
- compute the phylogenetic signal K
- Perform a hypothesis test using permutation to assess whether there is a significant amount of phylogenetic signal

Is there phylogenetic signal? which trait has more signal? NOTE: A formal test comparing the strength of phylogenetic signal requires an effect size (Z-score): see Collyer et al. 2022: MEE)

Additional Simulation Approaches

Sometimes we may wish to simulate data with a known correlation between traits under Brownian motion. In this case, we cannot simulate each trait separately, as that is tantamount to simulating data with no correlation.

In the non-phylogenetic (OLS) world, simulating such data is accomplished using an input covariance matrix. The following is an example:

####Additional Simulation Approaches
#Simulate correlated data
library(MASS)
R<-matrix(0.7, nrow=2,ncol=2); diag(R)<-1 ## This matrix desribes the covariation between traits
R

##      [,1] [,2]
## [1,]  1.0  0.7
## [2,]  0.7  1.0

dat.sim<-mvrnorm(n=10000,Sigma = R,mu = c(0,0))
plot(dat.sim, asp=1)

To simulate correlated traits under Brownian motion on a phylogeny, we use the function ‘sim.char’:

#Simulate BM correlated data on phylogeny
mytree<- pbtree(n=100, scale=1) #one way to simulate a tree
R<-matrix(0.7, nrow=2,ncol=2); diag(R)<-1 ## This matrix desribes the covariation between traits
R

##      [,1] [,2]
## [1,]  1.0  0.7
## [2,]  0.7  1.0

dat.BM<-sim.char(phy = mytree,par = R,nsim = 1)[,,1]
plot(dat.BM, asp=1)

Phylogenetic Regression and Related Topics

Dean Adams (dcadams@iastate.edu)

2022-02-15

Overview

Learning Objectives

Download data files

Conceptual Background

Illustrating the Problem

Try it out! The better estimator

Bringing Trait Data into R

Try it out! Hydromantes Data Matching

Methods for Implementing Phylogenetic Regression

Models, Regression, and Statistical Tests in R

Using PIC, PGLS, and Phylogenetic Transformation

0: Phylogenetically-Naive Analysis

1: Phylogenetically Independent Contrasts

2: Phylogenetic Generalized Least Squares

3: Phylogenetic Transform

Try it out! Are they all the same?

Try it out! Where’s the signal?

Additional Simulation Approaches

Phylogenetic Regression and Related Topics

Dean Adams (dcadams@iastate.edu)

2022-02-15

Overview

Learning Objectives

Download data files

Conceptual Background

Illustrating the Problem

Try it out! The better estimator

Bringing Trait Data into R

Try it out! Hydromantes Data Matching

Methods for Implementing Phylogenetic Regression

Models, Regression, and Statistical Tests in R

Using PIC, PGLS, and Phylogenetic Transformation

0: Phylogenetically-Naive Analysis

1: Phylogenetically Independent Contrasts

2: Phylogenetic Generalized Least Squares

3: Phylogenetic Transform

Try it out! Are they all the same?

Related Topics

Phylogenetic ANOVA

Phylogenetic Signal

Try it out! Where’s the signal?

Additional Simulation Approaches