vignettes/vignette4.rmd
vignette4.rmd
R package gllvm fits Generalized linear latent variable models (GLLVM) for multivariate data1.
Developed by J. Niku, W.Brooks, R. Herliansyah, F.K.C. Hui, S. Taskinen, D.I. Warton, B. van der Veen.
The package available in
Package installation:
# From CRAN
install.packages(gllvm)
# OR
# From GitHub using devtools package's function install_github
devtools::install_github("JenniNiku/gllvm")
gllvm package depends on R packages TMB and mvabund, try to install these first.
GLLVMs are computationally intensive to fit due the integral in log-likelihood.
gllvm package overcomes computational problems by applying closed form approximations to log-likelihood and using automatic differentiation in C++ to accelerate computation times (TMB2).
Estimation is performed using either variational approximation (VA3), extended variational approximation method (EVA4) or Laplace approximation (LA5) method implemented via R package TMB.
VA method is faster and more accurate than LA, but not applicable for all distributions and link functions.
Using gllvm we can fit
Additional tools: residuals, information criteria, confidence intervals, visualization.
Response | Distribution | Method | Link |
---|---|---|---|
Counts | Poisson | VA/LA | log |
NB | VA/LA | log | |
ZIP | VA/LA | log | |
ZINB | VA/LA | log | |
Binary | Bernoulli | VA/LA | probit |
LA | logit | ||
Ordinal | Ordinal | VA | probit |
Normal | Gaussian | VA/LA | identity |
Positive continuous | Gamma | VA/LA | log |
Non-negative continuous | Exponential | VA/LA | log |
Biomass | Tweedie | LA/EVA | log |
Percent cover | beta | LA/EVA | probit/logit |
Main function of the gllvm package is
gllvm()
, which can be used to fit GLLVMs for multivariate
data with the most important arguments listed in the following:
gllvm(y = NULL, X = NULL, TR = NULL, family, num.lv = 2,
formula = NULL, method = "VA", row.eff = FALSE, n.init=1, starting.val ="res", ...)
soil.dry
: Soil dry massbare.sand
: cover of bare sandfallen.leaves
: cover of fallen leaves/twigsmoss
: cover of mossherb.layer
: cover of herb layerreflection
: reflection of the soil surface with a
cloudless skyFit GLLVM with environmental variables using gllvm:
library(gllvm)
data("spider", package = "mvabund")
fitx <- gllvm(y = spider$abund, X = spider$x, family = "negative.binomial", num.lv = 2)
fitx
## Call:
## gllvm(y = spider$abund, X = spider$x, family = "negative.binomial",
## num.lv = 2)
## family:
## [1] "negative.binomial"
## method:
## [1] "VA"
##
## log-likelihood: -593.6749
## Residual degrees of freedom: 217
## AIC: 1425.35
## AICc: 1557.572
## BIC: 1879.586
X=spider$x
fitx1 <- gllvm(spider$abund, X, family = "negative.binomial", num.lv = 1)
fitx2 <- gllvm(spider$abund, X, family = "negative.binomial", num.lv = 2)
fitx3 <- gllvm(spider$abund, X, family = "negative.binomial", num.lv = 3)
AIC(fitx1)
## [1] 1393.843
AIC(fitx2)
## [1] 1425.35
AIC(fitx3)
## [1] 1445.35
E1. Load spider data from mvabund package and take a look at the dataset.
library(gllvm)
#Package **mvabund** is loaded with **gllvm** so just load with a function `data()`.
data("spider")
# more info:
# ?spider
Package mvabund is loaded
with gllvm so just load with a function
data()
.
# response matrix:
spider$abund
## Alopacce Alopcune Alopfabr Arctlute Arctperi Auloalbi Pardlugu Pardmont
## [1,] 25 10 0 0 0 4 0 60
## [2,] 0 2 0 0 0 30 1 1
## [3,] 15 20 2 2 0 9 1 29
## [4,] 2 6 0 1 0 24 1 7
## [5,] 1 20 0 2 0 9 1 2
## [6,] 0 6 0 6 0 6 0 11
## [7,] 2 7 0 12 0 16 1 30
## [8,] 0 11 0 0 0 7 55 2
## [9,] 1 1 0 0 0 0 0 26
## [10,] 3 0 1 0 0 0 0 22
## [11,] 15 1 2 0 0 1 0 95
## [12,] 16 13 0 0 0 0 0 96
## [13,] 3 43 1 2 0 18 1 24
## [14,] 0 2 0 1 0 4 3 14
## [15,] 0 0 0 0 0 0 6 0
## [16,] 0 3 0 0 0 0 6 0
## [17,] 0 0 0 0 0 0 2 0
## [18,] 0 1 0 0 0 0 5 0
## [19,] 0 1 0 0 0 0 12 0
## [20,] 0 2 0 0 0 0 13 0
## [21,] 0 1 0 0 0 0 16 1
## [22,] 7 0 16 0 4 0 0 2
## [23,] 17 0 15 0 7 0 2 6
## [24,] 11 0 20 0 5 0 0 3
## [25,] 9 1 9 0 0 2 1 11
## [26,] 3 0 6 0 18 0 0 0
## [27,] 29 0 11 0 4 0 0 1
## [28,] 15 0 14 0 1 0 0 6
## Pardnigr Pardpull Trocterr Zoraspin
## [1,] 12 45 57 4
## [2,] 15 37 65 9
## [3,] 18 45 66 1
## [4,] 29 94 86 25
## [5,] 135 76 91 17
## [6,] 27 24 63 34
## [7,] 89 105 118 16
## [8,] 2 1 30 3
## [9,] 1 1 2 0
## [10,] 0 0 1 0
## [11,] 0 1 4 0
## [12,] 1 8 13 0
## [13,] 53 72 97 22
## [14,] 15 72 94 32
## [15,] 0 0 25 3
## [16,] 2 0 28 4
## [17,] 0 0 23 2
## [18,] 0 0 25 0
## [19,] 1 0 22 3
## [20,] 0 0 22 2
## [21,] 0 1 18 2
## [22,] 0 0 1 0
## [23,] 0 0 1 0
## [24,] 0 0 0 0
## [25,] 6 0 16 6
## [26,] 0 0 1 0
## [27,] 0 0 0 0
## [28,] 0 0 2 0
# Environmental variables
spider$x
## soil.dry bare.sand fallen.leaves moss herb.layer reflection
## 1 2.3321 0.0000 0.0000 3.0445 4.4543 3.9120
## 2 3.0493 0.0000 1.7918 1.0986 4.5643 1.6094
## 3 2.5572 0.0000 0.0000 2.3979 4.6052 3.6889
## 4 2.6741 0.0000 0.0000 2.3979 4.6151 2.9957
## 5 3.0155 0.0000 0.0000 0.0000 4.6151 2.3026
## 6 3.3810 2.3979 3.4340 2.3979 3.4340 0.6931
## 7 3.1781 0.0000 0.0000 0.6931 4.6151 2.3026
## 8 2.6247 0.0000 4.2627 1.0986 3.4340 0.6931
## 9 2.4849 0.0000 0.0000 4.3307 3.2581 3.4012
## 10 2.1972 3.9318 0.0000 3.4340 3.0445 3.6889
## 11 2.2192 0.0000 0.0000 4.1109 3.7136 3.6889
## 12 2.2925 0.0000 0.0000 3.8286 4.0254 3.6889
## 13 3.5175 1.7918 1.7918 0.6931 4.5109 3.4012
## 14 3.0865 0.0000 0.0000 1.7918 4.5643 1.0986
## 15 3.2696 0.0000 4.3944 0.6931 3.0445 0.6931
## 16 3.0301 0.0000 4.6052 0.6931 0.6931 0.0000
## 17 3.3322 0.0000 4.4543 0.6931 3.0445 1.0986
## 18 3.1224 0.0000 4.3944 0.0000 3.0445 1.0986
## 19 2.9232 0.0000 4.5109 1.6094 1.6094 0.0000
## 20 3.1091 0.0000 4.5951 0.6931 0.6931 0.0000
## 21 2.9755 0.0000 4.5643 0.6931 1.7918 0.0000
## 22 1.2528 3.2581 0.0000 4.3307 0.6931 3.9120
## 23 1.1939 3.0445 0.0000 4.0254 3.2581 4.0943
## 24 1.6487 3.2581 0.0000 4.0254 3.0445 4.0073
## 25 1.8245 3.5835 0.0000 1.0986 4.1109 2.3026
## 26 0.9933 4.5109 0.0000 1.7918 1.7918 4.3820
## 27 0.9555 2.3979 0.0000 3.8286 3.4340 3.6889
## 28 0.9555 3.4340 0.0000 3.7136 3.4340 3.6889
# Plot data using boxplot:
boxplot(spider$abund)
E2. Fit GLLVM with two latent variables to spider data with a suitable distribution. Data consists of counts of spider species.
# Take a look at the function documentation for help:
?gllvm
2. Response variables in spider data are counts, so Poisson, negative binomial and zero inflated Poisson are possible. However, ZIP is implemented only with Laplace method, so it need to be noticed, that if models are fitted with different methods they can not be compared with information criteria. Let’s try just with a Poisson and NB. NOTE THAT the results may not be exactly the same as below, as the initial values for each model fit are slightly different, so the results may
# Fit a GLLVM to data
fitp <- gllvm(y=spider$abund, family = poisson(), num.lv = 2)
fitp
## Call:
## gllvm(y = spider$abund, family = poisson(), num.lv = 2)
## family:
## [1] "poisson"
## method:
## [1] "VA"
##
## log-likelihood: -845.8277
## Residual degrees of freedom: 301
## AIC: 1761.655
## AICc: 1770.055
## BIC: 1895.254
fitnb <- gllvm(y=spider$abund, family = "negative.binomial", num.lv = 2)
fitnb
## Call:
## gllvm(y = spider$abund, family = "negative.binomial", num.lv = 2)
## family:
## [1] "negative.binomial"
## method:
## [1] "VA"
##
## log-likelihood: -733.6807
## Residual degrees of freedom: 289
## AIC: 1561.361
## AICc: 1577.028
## BIC: 1740.766
Based on AIC, NB distribution suits better. How about residual analysis: NOTE THAT The package uses randomized quantile residuals so each time you plot the residuals, they look a little different.
plot(fitnb, which = 1:2)
You could do these comparisons with Laplace method as well, using the code below, and it would give the same conclusion that NB distribution suits best:
GLLVM with two latent variables can be used as a model-based approach to unconstrained ordination, as considered at the first day of the workshop. E3. Fit GLLVM with environmental variables
soil.dry
and reflection
to the data with
suitable number of latent variables.
We can extract the two columns from the environmental variable matrix or define the model using formula.
# `soil.dry` and `reflection` are in columns 1 and 6
X <- spider$x[,c(1,6)]
fitx1 <- gllvm(spider$abund, X, family = "negative.binomial", num.lv = 1)
fitx2 <- gllvm(spider$abund, X, family = "negative.binomial", num.lv = 2)
fitx3 <- gllvm(spider$abund, X, family = "negative.binomial", num.lv = 3)
AIC(fitx1)
## [1] 1449.428
AIC(fitx2)
## [1] 1486.451
AIC(fitx3)
## [1] 1491.428
# Or alternatively using formula:
fitx1 <- gllvm(spider$abund, spider$x, formula = ~soil.dry + reflection, family = "negative.binomial", num.lv = 1)
fitx1
## Call:
## gllvm(y = spider$abund, X = spider$x, formula = ~soil.dry + reflection,
## family = "negative.binomial", num.lv = 1)
## family:
## [1] "negative.binomial"
## method:
## [1] "VA"
##
## log-likelihood: -664.7139
## Residual degrees of freedom: 276
## AIC: 1449.428
## AICc: 1476.046
## BIC: 1678.454
E4. Explore the model fit. Find the coefficients for environmental covariates.
Estimated parameters can be obtained with
coef()
function. Confidence intervals for parameters are
obtained with confint()
.
coef(fitx1)
## $Species.scores
## LV1
## Alopacce 1.000000e+00
## Alopcune 1.286564e+00
## Alopfabr 4.411474e-01
## Arctlute 2.261271e+00
## Arctperi 6.558817e-06
## Auloalbi 2.499060e+00
## Pardlugu 3.973132e-01
## Pardmont 9.557377e-01
## Pardnigr 2.565877e+00
## Pardpull 2.528257e+00
## Trocterr 9.659180e-01
## Zoraspin 1.751833e+00
##
## $sigma.lv
## LV1
## 0.8536904
##
## $Intercept
## Alopacce Alopcune Alopfabr Arctlute Arctperi Auloalbi Pardlugu
## -3.010735 -7.407244 2.077617 -21.478661 -7.920069 -12.969707 4.569019
## Pardmont Pardnigr Pardpull Trocterr Zoraspin
## -6.088991 -14.644413 -16.457915 -3.712424 -9.219694
##
## $Xcoef
## soil.dry reflection
## Alopacce -0.3438061 1.5945287
## Alopcune 2.6112725 0.6246891
## Alopfabr -1.8668475 0.7468388
## Arctlute 6.3227353 0.8208664
## Arctperi -1.7566353 2.8892032
## Auloalbi 4.1826104 0.6832891
## Pardlugu -0.5456458 -1.2606137
## Pardmont 1.7920398 1.4688279
## Pardnigr 4.9045138 0.8601683
## Pardpull 5.3505939 1.2824173
## Trocterr 2.3361048 0.2234954
## Zoraspin 3.5570334 0.1199447
##
## $inv.phi
## Alopacce Alopcune Alopfabr Arctlute Arctperi Auloalbi
## 9.375940e+00 1.699565e+00 1.577636e+00 1.063992e+00 4.391997e+05 1.511603e+00
## Pardlugu Pardmont Pardnigr Pardpull Trocterr Zoraspin
## 1.543843e+00 9.784953e-01 2.518153e+00 2.339355e+00 1.470886e+01 3.624948e+00
##
## $phi
## Alopacce Alopcune Alopfabr Arctlute Arctperi Auloalbi
## 1.066560e-01 5.883857e-01 6.338598e-01 9.398569e-01 2.276869e-06 6.615492e-01
## Pardlugu Pardmont Pardnigr Pardpull Trocterr Zoraspin
## 6.477341e-01 1.021977e+00 3.971164e-01 4.274683e-01 6.798622e-02 2.758661e-01
# Coefficients for covariates are named as `Xcoef`
# Confidence intervals for these coefficients:
confint(fitx1, parm = "Xcoef")
## cilow ciup
## Xcoef1 -1.02657582 0.3389637
## Xcoef2 1.34798979 3.8745552
## Xcoef3 -2.85510118 -0.8785939
## Xcoef4 2.24760383 10.3978668
## Xcoef5 -2.87603054 -0.6372400
## Xcoef6 1.79232232 6.5728985
## Xcoef7 -1.64615225 0.5548606
## Xcoef8 0.73330963 2.8507700
## Xcoef9 2.78645009 7.0225774
## Xcoef10 3.14444174 7.5567460
## Xcoef11 1.64647443 3.0257352
## Xcoef12 1.97813993 5.1359268
## Xcoef13 0.95846116 2.2305962
## Xcoef14 0.13820904 1.1111691
## Xcoef15 -0.24255716 1.7362347
## Xcoef16 -0.34239253 1.9841252
## Xcoef17 1.63797487 4.1404316
## Xcoef18 -0.18415487 1.5507331
## Xcoef19 -1.73569020 -0.7855372
## Xcoef20 0.89971165 2.0379442
## Xcoef21 0.01749653 1.7028400
## Xcoef22 0.40749684 2.1573378
## Xcoef23 -0.08588201 0.5328728
## Xcoef24 -0.46477569 0.7046652
# The first 12 intervals are for soil.dry and next 12 for reflection
I have problems in model fitting. My model
converges to infinity or local maxima: GLLVMs are complex models
where starting values have a big role. Choosing a different starting
value method (see argument starting.val
) or use multiple
runs and pick up the one giving highest log-likelihood value using
argument n.init
. More variation to the starting points can
be added with jitter.var
.
My results does not look the same as in answers: The results may not be exactly the same as in the answers, as the initial values for each model fit are slightly different, so the results may also differ slightly.
Latent variables induce correlation across response variables, and so provide means of estimating correlation patterns across species, and the extent to which they can be explained by environmental variables.
Information on correlation is stored in the LV loadings , so the residual covariance matrix, storing information on species co-occurrence that is not explained by environmental variables, can be calculated as , where .
getResidualCor
function can be used to estimate the
correlation matrix of the linear predictor across species.
Let’s consider first the correlation matrix based on a model without predictors:
fitnb <- gllvm(spider$abund, family = "negative.binomial", num.lv = 2)
The residual correlations can be visualized either using biplot,
which shows the species ordination, or visualizing the actual
correlation matrix using, eg., a corrplot
package.
The biplot can be produced using a function
ordiplot()
with an argument
biplot = TRUE
:
fitnb <- gllvm(spider$abund, family = "negative.binomial", num.lv = 2)
ordiplot(fitnb, biplot = TRUE)
abline(h = 0, v = 0, lty=2)
corrplot()
function:
fitnb <- gllvm(spider$abund, family = "negative.binomial", num.lv = 2)
cr <- getResidualCor(fitnb)
library(corrplot);
## corrplot 0.94 loaded
corrplot(cr, diag = FALSE, type = "lower", method = "square", tl.srt = 25)
soil.dry
(soil dry mass) and reflection
(reflection of the soil surface with a cloudless sky), which shows
different environmental gradients in ordination:
rbPal <- c("#00FA9A", "#00EC9F", "#00DFA4", "#00D2A9", "#00C5AF", "#00B8B4", "#00ABB9", "#009DBF", "#0090C4", "#0083C9", "#0076CF", "#0069D4", "#005CD9", "#004EDF", "#0041E4", "#0034E9", "#0027EF", "#001AF4", "#000DF9", "#0000FF")
X <- spider$x[,c(1,6)]
par(mfrow = c(1,2), mar=c(4,4,2,2))
for(i in 1:ncol(X)){
Col <- rbPal[as.numeric(cut(X[,i], breaks = 20))]
ordiplot(fitnb, symbols = T, s.colors = Col, main = colnames(X)[i], biplot = TRUE)
abline(h=0,v=0, lty=2)
}
coefplot()
plots point estimates of the species
specific environmental coefficients
with confidence intervals.dry.soil
and
reflection
were accounted, negative correlation between
species do not seem to exist anymore, indicating that negative
correlations were explained by different environmental conditions at
sites and how species respond to them, rather than direct species
interactions.
crx <- getResidualCor(fitx1)
corrplot(crx, diag = FALSE, type = "lower", method = "square", tl.srt = 25)
If species trait variables , measuring eg. species behaviour or physical appearance, would be available, fourth corner models should be considered:
Such models can also be fitted with gllvm()
function
by including a matrix of traits with argument TR
.
Examples can be found in the gllvm package’s vignettes.
Niku, J., F.K.C. Hui, S. Taskinen, and D.I. Warton. 2019. Gllvm - Fast Analysis of Multivariate Abundance Data with Generalized Linear Latent Variable Models in R. 10. Methods in Ecology and Evolution: 2173–82↩︎
Kasper Kristensen, Anders Nielsen, Casper W. Berg, Hans Skaug, Bradley M. Bell (2016). TMB: Automatic Differentiation and Laplace Approximation. Journal of Statistical Software, 70(5), 1-21↩︎
Hui, F. K. C., Warton, D., Ormerod, J., Haapaniemi, V., and Taskinen, S. (2017). Variational approximations for generalized linear latent variable models. Journal of Computational and Graphical Statistics. Journal of Computational and Graphical Statistics, 26:35-43↩︎
Korhonen, P., Hui, F. K. C., Niku, J., and Taskinen, S. (2021). Fast, universal estimation of latent variable models using extended variational approximations, arXiv:2107.02627 .↩︎
Niku, J., Warton, D. I., Hui, F. K. C., and Taskinen, S. (2017). Generalized linear latent variable models for multivariate count and biomass data in ecology. Journal of Agricultural, Biological, and Environmental Statistics, 22:498-522.↩︎