Until recently, the gllvm R-package only supported unconstrained ordination. When including predictor variables, the interpretation of the ordination would shift to a residual ordination, conditional on the predictors.

However, if the number of predictor variables is large and so is the number of species, including predictors can result in a very large number of parameters to estimate. For data of ecological communities, which can be quite sparse, this is not always a reasonable model to fit. As alternative, ecologists have performed constrained ordination for decades, with methods such as Canonical Correspondence Analysis, or Redundancy Analysis.

In this vignette, we demonstrate how to perform constrained ordination with the gllvm R-package. We load the hunting spider dataset:

library(gllvm)

data(spider)
Y <- spider$abund
X <- spider$x

#And scale the predictors
X <- scale(X)

which includes six predictor variables: “soil.dry”: soil dry mass, “bare.sand”: cover bare sand, “fallen.leaves”: “cover of fallen leaves”, “moss”: cover moss, “herb.layer”: cover of the herb layer, “reflection”: “reflection of the soil surface with a cloudless sky”.

For that, it is important to first understand what a constrained ordination is. Classical constrained ordination is statistically referred to as reduced rank regression (RRR). First, consider a multivariate Generalized Linear Model (GLM):

\[\begin{equation} \eta_{ij} = \beta_{0j} + \boldsymbol{X}_i^\top\boldsymbol{\beta}_j. \end{equation}\]

Here, \(\boldsymbol{\beta}_j\) are the slopes that represent a species responses to \(K\) predictor variables at site \(i\), \(\boldsymbol{X}_i\). In the gllvm R-package, the code to fit this model is:

MGLM <- gllvm(Y,X=X,family="poisson")

The “rank” of \(\boldsymbol{X}_i^\top\boldsymbol{\beta}_j\) is \(K\). RRR introduces a constraint on the species slopes matrix, namely on the number of independent columns in \(\boldsymbol{\beta}_j\) (a column is not independent when it can be formulated as a linear combination of another). The reduced ranks are in community ecology referred to as ecological gradients, or can be understood as ordination axes. If we define a latent variable \(\boldsymbol{z}_i = \boldsymbol{B}^\top\boldsymbol{X}_{i,lv} + \boldsymbol{\epsilon}_i\), for a \(K\times d\) matrix of slopes, we can understand RRR as a regression of the latent variable or ecological gradient, except that the residual \(\boldsymbol{\epsilon}_i\) is omitted, i.e. we assume that the ecological gradient can be represented perfectly by the predictor variables, so that the model becomes: \[\begin{equation} \eta_{ij} = \beta_{0j} + \boldsymbol{X}_i^\top\boldsymbol{B}\boldsymbol{\gamma}_j. \end{equation}\]

Where \(\boldsymbol{\gamma}_j\) is a set of reduced rank slopes for each species. This model can also be fitted in, e.g., the VGAM R-package. This parametrization is practically useful, as it drastically reduces the number of parameters compared to multivariate regression. The Rank can be determined by cross-validation, or alternatively, using information criteria. The code for this in the gllvm R-package, for an arbitrary choice of Rank 2, is:

RRGLM <- gllvm(Y,X=X,family="poisson", num.RR=2)

The reduced rank slopes are available under RRGLM$params$LvXcoef. Note: in general to improve convergence, it is good practice to center and scale the predictor variables. Unlike in other R-packages, we can now formulate a RRR with residual (because let’s face it, how often are we 100% confident that we have measured all relevant predictors?), so that we can assume that the ecological gradient is represented by unmeasured and measured predictors (the latter is how the residual can be understood). The code for this is:

CGLLVM <- gllvm(Y,X=X,family="poisson", num.lv.c = 2)

The number of reduced ranks, constrained latent variables, unconstrained latent variables can be freely combined using the num.RR, num.lv.c and num.lv arguments (but be careful not to overparameterize or overfit your model!). It is also possible to combine those arguments with full-rank predictors. If combining RRR (with or without latent variable), with full-rank predictors, the formula interface has to be used:

PCGLLVM <- gllvm(Y,X=X,family="poisson", num.lv.c = 2, lv.formula = ~bare.sand+fallen.leaves+moss+herb.layer+reflection, formula = ~soil.dry)

where RR.formula is the formula for the constrained ordination, and X.formula is the formula which informs the model which predictors should be modelled in full-rank. Note, that those two formulas cannot include the same predictor variables, and all predictor variables should be provided in the X argument. In essence, this performs a partial constrained ordination with latent variables. Reduced Rank regression can never include an (additional) intercept as it can be re-parameterized into a model with only \(\beta_{0j}\).

Though we did not do so here, information criteria can be used to determine the correct number of Reduced Ranks, or in general the correct number of constrained and unconstrained latent variables. Our recommendation is not to perform model-selection on the included predictor variables, but to mostly focus on the Ranks (if this causes convergence issues, first scale and centre predictors, and if that doesn’t help perform model-selection on the predictors). A wald-statistic with accompanying p-values can then be used to determine significance of the predictors:

summary(CGLLVM)
## 
## Call:
## gllvm(y = Y, X = X, num.lv.c = 2, family = family, n.init = 10)
## 
## Family:  poisson 
## 
## AIC:  1666.276 AICc:  1681.943 BIC:  1728.89 LL:  -786.1 df:  47 
## 
## Constrained LVs:  2 
## Reduced Ranks:  0 
## Unconstrained LVs:  0 
## Standard deviation of LVs:  0.5183 0.6907 
## 
## Formula:  ~ 1 
## LV formula:  ~0 + soil.dry + bare.sand + fallen.leaves + moss + herb.layer + reflection 
## 
## Coefficients LV predictors:
##                      Estimate Std. Error z value Pr(>|z|)    
## X.soil.dry(LV1)        0.3801     0.2462   1.544   0.1226    
## X.bare.sand(LV1)       0.2642     0.1766   1.496   0.1347    
## X.fallen.leaves(LV1)  -0.6957     0.3074  -2.263   0.0236 *  
## X.moss(LV1)            0.8177     0.1999   4.090 4.31e-05 ***
## X.herb.layer(LV1)      0.1325     0.1791   0.740   0.4595    
## X.reflection(LV1)      0.4973     0.2759   1.803   0.0714 .  
## X.soil.dry(LV2)        1.5032     0.3675   4.090 4.31e-05 ***
## X.bare.sand(LV2)      -0.3265     0.2323  -1.405   0.1599    
## X.fallen.leaves(LV2)  -0.8952     0.4039  -2.217   0.0267 *  
## X.moss(LV2)           -0.1123     0.3243  -0.346   0.7291    
## X.herb.layer(LV2)      0.5560     0.2359   2.357   0.0184 *  
## X.reflection(LV2)     -0.4909     0.3798  -1.293   0.1962    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Finally, we can use all the other tools in the R-package for inference, such as creating an ordination diagram with arrows:

ordiplot(CGLLVM, biplot=TRUE)

Arrows that show as less intense red (pink), are predictors of which the confidence interval for the slope includes zero, for at least one of the two plotted dimensions. There are various arguments inlcuded in the function to improve readability of the figure, have a look at its documentation. The arrows are always proportional to the size of the plot, so that the predictor with the largest slope estimate is the largest arrow. If the predictors have no effect, the slopes \(\boldsymbol{B}\) will be close to zero.

It is also possible to use the quadratic flag to fit a quadratic response model though we will not demonstrate that here, or to partition variance per latent variable and for specific predictors.

References