Today we examine how one can perform phylogenetic comparative analyses on multivariate datasets. For this, the phenotypic data (Y) is a N x p matrix of phenotypic values for N species, across p trait dimensions. These p-dimensions could be a set of univariate traits (e.g., length, width, height, etc.) or they could represent a multi-dimensional trait encoded by multiple numbers (e.g., shape from geometric morphometric methods). The goal is to evaluate patterns in the response variables (Y) while conditioning the data on the phylogeny. In other words, we wish to perform macroevolutionary analyses via phylogenetic comparative methods, but do so on multivariate data.
As we discussed in class, most univariate PCMs now have a multivariate counter-part: phylogenetic regression, phylogenetic ANOVA, phylogenetic correlation, phylogenetic signal, and comparing rates of phenotypic evolution. Methods for comparing complex multivariate evolutionary models beyond BM1 and BMM are still an active area of development (see Adams and Collyer 2018a; Adams and Collyer 2019).
Today we will use data from within the package geomorph
, as well as data from the following files:
Additionally, The R-script for running the code found in this tutorial is found here:
As before, on your local computer, place them in a folder named ‘TutorialData’, which is found in the same directory as this tutorial.
One challenge with multivariate data is that visualizing patterns in such data is not always straightforward. In such cases, we have multiple phenotypic variables (\(p>1\)), for our set of \(N\) species. Geometrically, what this means is that each species is a point in a \(p\)-dimensional phenotype space. Visually characterizing patterns of dispersion in such spaces is challenging, so we rely on ordination methods from classical multivariate statistics. When phylogeny is considered, several analytical options are possible; all of which are based on principal components analysis.
The key for much of the principal components analysis that we’ll do relies on gm.prcomp()
, an extension of prcomp()
that can also account for phylogeny.
gm.prcomp
Function performs principal components analysis (PCA), phylogenetic PCA, or phylogenetically-aligned components (PaCA) on multivariate continuous data. This function has a few different arguments but we’ll primarily be concerned with the following:
FALSE
and is a logical for whether GLS-centering and covariance estimating should be used. This is where we use the information from the phylogeny in the analytical computation of our principal componentsThis function performs the principal components analysis. Most of the elements that are returned are to be used in other functions for analysis.
library(geomorph)
## Loading required package: RRPP
## Loading required package: rgl
## Loading required package: Matrix
data(plethspecies)
Y.gpa <- gpagen(plethspecies$land, print.progress = F) #GPA-alignment
Principal components analysis (PCA) is a common ordination approach that provides the best two-dimensional representation (projection), found by rotating the data to directions of the dataspace that maximize the total variation. These new directions become the new axes of our space, and are linear combinations of the original variables. They are found through a rigid rotation of the dataspace, meaning that distances, directions and angles between sets of points are preserved.
PCA <- gm.prcomp(Y.gpa$coords, phy = plethspecies$phy)
plot(PCA, main = "PCA", pch = 21, bg = "red")
A phylomorphospace is simply a principal components analysis (PCA) with the phylogeny superimposed into the plot. That is, the PCA is obtained as before, ancestral states are estimated, and then projected into the plot. An example is below.
plot(PCA, phylo = TRUE, pch = 21, bg = "red",
main = "Phylomorphospace")
Phylogenetic principal components analysis (pPCA) follows the same logic as above, only the rotation is conditioned on the phylogeny. The result of this is that the first pPCA dimension expresses variation independent of phylogenetic signal. Notice that this plot differs from the phylomorphospace above.
phylo.PCA <- gm.prcomp(Y.gpa$coords, phy = plethspecies$phy, GLS = TRUE)
plot(phylo.PCA, phylo = TRUE, pch = 21, bg="red",
main = "phylogenetic PCA")
PACA can be considered the complement to pPCA. Here, variation is rotated towards directions of phylogenetic signal, rather than away from it. Thus, PACA 1 describes variation expressing maximal phylogenetic signal in teh data. Comparisons of axis 1 from pPCA and PACA are useful in identifying the extent to which phylogenetic signal is embedded in our multivariate data.
PaCA.gls <- gm.prcomp(Y.gpa$coords, phy = plethspecies$phy,
align.to.phy = TRUE, GLS = TRUE)
plot(PaCA.gls, phylo = TRUE, pch = 21, bg = "red",
main = "PaCA: Phylogenetically-Aligned Component Analysis")
Now we will import data into R and perform a series of multivariate PCMs. For this example, we have 6 phenotypic measurements per species that serve as our response (Y) data: TailLength
, HeadLength
, Snout.eye
, BodyWidth
, Forelimb
, and Hindlimb
. We also have two predictor variables: SVL
(snout-vent length = body size), and group
(‘large’ or ‘small’ Plethodon). We will first use PCA to visualize the dataspace and then perform various hypothesis tests in a phylogenetic comparative framework. Note that in the PCA, most variation is expressed in PC1. In our case, this is generalized size: our variables are all linear distance (length) measurements, which all get larger or smaller as the animals are larger or smaller.
library(RRPP)
library(geiger)
## Loading required package: ape
library(phytools)
## Loading required package: maps
tree.best<-read.nexus("../data/Consensus of 1000 salamander trees.nex") #Maximum Credible Tree
plethdata<-read.csv("../data/meandata-CinGlutOnly.csv",row.names=1, header=TRUE )
svl<-plethdata[,2];names(svl)<-row.names(plethdata)
groups<-as.factor(plethdata[,1]); names(groups)<-row.names(plethdata)
Y<-as.matrix(plethdata[,c(3:8)]) ##This is our multivariate data
plethtree<-treedata(phy = tree.best,data = plethdata, warnings = FALSE)$phy
plot(plethtree)
axisPhylo(1)
PCA<-gm.prcomp(Y,phy=plethtree) #principal components of Y
plot(PCA, pch=21,bg="black") ##we can put the PCA directly into the plot function
Next we wish to test hypotheses describing trends of variation in our dataset. For univariate response data, the ‘workhorse’ of PCMs is undoubtedly phylogenetic regression. Three implementations are commonly used and all lead to identical results when implemented properly: phylogenetically independent contrasts (PIC), phylogenetic generalized least squares (PGLS), and phylogenetic-transformation with OLS regression. Unfortunately, for multivariate data, as trait dimensionality increases, the power of these methods decreases, because test statistics are evaluated using parametric approaches (Adams 2014a; Adams and Collyer 2015).
One solution is to use permutation methods. Specifically, phylogenetic transformation of the X and Y data is performed, and residuals from a reduced model are permuted (RRPP) to obtain significance (Adams and Collyer 2018b). The method has been shown to display appropriate type I error and has high power. It is implemented below.
To accomplish this we use lm.rrpp()
, it works much like lm()
or gls()
in that we need to provide a formula for our regression. They main difference is that there is also a Cov
argument for supplying an object covariance matrix: such as a phylogenetic covariance matrix, a spatial covariance matrix, or a temporal covariance matrix. The number of iterations iter
can also be specified for the number of permutations. However, before we begin our regression, we need to put our data in a special RRPP
dataframe with rrpp.data.frame()
.
PhyCov <- vcv.phylo(plethtree) #generate phylogenetic covariance matrix
rdf <- rrpp.data.frame(Y=Y, svl=svl,groups=groups,Cov = PhyCov) #RRPP data frame
PGLS.reg <- lm.rrpp(Y ~ svl, Cov = PhyCov, data = rdf, iter = 999,
print.progress = FALSE,) # randomize residuals
anova(PGLS.reg)
##
## Analysis of Variance, using Residual Randomization
## Permutation procedure: Randomization of null model residuals
## Number of permutations: 1000
## Estimation method: Generalized Least-Squares (via OLS projection)
## Sums of Squares and Cross-products: Type I
## Effect sizes (Z) based on F distributions
##
## Df SS MS Rsq F Z Pr(>F)
## svl 1 214.21 214.214 0.35852 19.561 3.1547 0.001 **
## Residuals 35 383.29 10.951 0.64148
## Total 36 597.50
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call: lm.rrpp(f1 = Y ~ svl, iter = 999, data = rdf, Cov = PhyCov, print.progress = FALSE)
Statistically related to regression is analysis of variance (ANOVA). It has recently been shown that for phylogenetic data, the RRPP approach described above is equally appropriate for phylogenetic ANOVA (Adams and Collyer 2018b). Recall however that the dispersion of groups across the phylogeny can affect both statistical and biological inferences. In particular, ‘clumping’ of groups on the phylogeny (phylogenetic aggregation) can lower statistical power and hamper inferences. In the extreme case of all species for a group belonging to a monophyletic sub-clade, one effectively has a single ‘transition’ of groups on the phylogeny; making it challenging to identify group differences in the response variable (this is akin to the BiSSE issues of state change replication discussed earlier in the semester).
PGLS.aov<-lm.rrpp(Y ~ groups, Cov = PhyCov, data = rdf, iter = 999, print.progress = FALSE) # randomize residuals
anova(PGLS.aov) #no difference once phylogeny considered
##
## Analysis of Variance, using Residual Randomization
## Permutation procedure: Randomization of null model residuals
## Number of permutations: 1000
## Estimation method: Generalized Least-Squares (via OLS projection)
## Sums of Squares and Cross-products: Type I
## Effect sizes (Z) based on F distributions
##
## Df SS MS Rsq F Z Pr(>F)
## groups 1 23.44 23.435 0.03922 1.4288 0.73593 0.252
## Residuals 35 574.07 16.402 0.96078
## Total 36 597.50
##
## Call: lm.rrpp(f1 = Y ~ groups, iter = 999, data = rdf, Cov = PhyCov,
## print.progress = FALSE)
Tests of association between two traits are commonly accomplished using correlation. The multivariate equivalent is partial least squares (PLS), which identifies the maximal assocation between two sets of variables. This may also be implemented in a phylogenetic context, with significance obtained via RRPP (Adams & Felice 2014).
PLS.Y <- phylo.integration(A = Y[,1:3], A2 = Y[,4:6], phy= plethtree, print.progress = FALSE)
summary(PLS.Y)
##
## Call:
## phylo.integration(A = Y[, 1:3], A2 = Y[, 4:6], phy = plethtree,
## print.progress = FALSE)
##
##
##
## r-PLS: 0.639
##
## Effect Size (Z): 3.3592
##
## P-value: 0.001
##
## Based on 1000 random permutations
plot(PLS.Y)
The degree to which trait variation associates with the phylogeny is termed phylogenetic signal. For univariate data, Blomberg et al. (2003) proposed the Kappa statistic. Its multivariate equivalent, K.mult, evaluates phylogenetic signal in multivariate data (Adams 2014b).
PS.shape <- physignal(A=Y,phy=plethtree,iter=999, print.progress = FALSE)
summary(PS.shape)
##
## Call:
## physignal(A = Y, phy = plethtree, iter = 999, print.progress = FALSE)
##
##
##
## Observed Phylogenetic Signal (K): 0.4342
##
## P-value: 0.002
##
## Effect Size: 2.6861
##
## Based on 1000 random permutations
plot(PS.shape)
One can also envision comparing rates of phenotypic evolution in multivariate traits. Here, the net rate of evolution is characterized, found as the mean of the rates of the individual trait dimensions (Adams 2014c). Two approaches for comparing evolutionary rates have been developed, which mirror methods discussed for univariate rate tests. First, one may compare rates of multivariate evolution between clades (Adams 2014c). Second, for two multivariate traits, one may compare rates of evolution between traits (Denton and Adams 2015).
ER<-compare.evol.rates(A=Y, phy=plethtree,gp=groups,iter=999, print.progress = FALSE)
summary(ER) #significantly higher rate of morphological evolution 'large' Plethodon
##
## Call:
##
##
## Observed Rate Ratio: 3.6913
##
## P-value: 0.027
##
## Effect Size: 1.7194
##
## Based on 1000 random permutations
##
## The rate for group Large is 3.27167050803742
##
## The rate for group Small is 0.886328591621753
plot(ER)
var.gp <- c("B","A","A","B","B","B") #head (A) vs. body (B)
EMR<-compare.multi.evol.rates(A=Y,gp=var.gp, Subset=TRUE, phy= plethtree,iter=999, print.progress = FALSE)
summary(EMR) #Limb traits evolve faster than head traits
##
## Call:
##
##
## Observed Rate Ratio: 30.8266
##
## P-value: 0.001
##
## Effect Size: 11.7072
##
## Based on 1000 random permutations
##
## The rate for group A is 0.128874014550266
##
## The rate for group B is 3.9727412987915
plot(EMR)
Plethodontid salamanders have a lot of diversity in their head shapes. Since quantifying head shape is better done in a multivariate setting, we have took landmark measurements of the head. We’re interested in asking whether there is a relationship between the shape of salamander head and head size.
geomorph
actually has a built in dataset that we can use for this analysis. We can load and format the data set with the following commands:
data("plethspecies")
pleth_tree<-plethspecies$phy
landmarks<-plethspecies$land
procD_landmarks <- gpagen(landmarks)
We now have our salamander tree in the saved as pleth_tree
and our head landmark data in landmark
. We then performed a Procrustes analysis on our landmark data to align it. within procD_landmarks
we can notice procD_landmarks$coords
contains our aligned landmarks while procD_landmarks$Csize
is our centroid size, or head size in this case.
geomorph.data.frame()
to format our aligned data set.