This short document showcases how to use the gllvm package to analyse multivariate percent cover data. Namely, we show how to apply the hurdle beta GLLVM (with logistic link), as detailed in Korhonen et al. (2024), to analyse the kelp forest dataset from the Santa Barbara Coastal Long-Term Ecological Research project, available from https://doi.org/10.6073/pasta/0af1a5b0d9dde5b4e5915c0012ccf99c.

A multivariate percent cover dataset often comes in the form of a n×mn \times m matrix $\bf{Y}$, with nn being the number of observational units/sites, and mm being the number of species of plants, macroalgae, sessile invertebrates, et cetera. The response yijy_{ij} is then the recorded percentage of the relative area covered by species jj on unit/site ii. Typically such datasets contain considerable proportion of zero observations, as a given obs. unit is often populated by only a small subset of the mm species in total. More rarely, it might even be the case that one of the species covers some obs. unit completely.

Hurdle beta GLLVM

Traditionally, beta regression has been used to model responses that take the form of percentages. If yij*y^*_{ij} is in the (open) interval (0,1)(0,1), and is distributed according to beta distribution, yij*Beta(μij,ϕj)y^*_{ij} \sim \text{Beta}(\mu_{ij}, \phi_j), then it has the following density: fbeta(yij*;μij,ϕj)=Γ(ϕj)Γ(μijϕj)Γ(ϕjϕjμij)(yij*)μijϕj1(1yij*)ϕj(1μij1). f_{\text{beta}}(y_{ij}^*; \mu_{ij}, \phi_j) = \frac{\Gamma(\phi_j)}{\Gamma(\mu_{ij}\phi_j)\Gamma(\phi_j-\phi_j\mu_{ij})} (y_{ij}^*)^{\mu_{ij}\phi_j-1}(1-y_{ij}^*)^{\phi_j(1-\mu_{ij}-1)}.

The mean parameter μij\mu_{ij} can be connected to a set of covariates and latent variables through some link function, by the equation $g(\mu_{ij})=\eta_{ij} = \beta_{0j} + \boldsymbol{\beta}_j^\top\bf{x_i} + \boldsymbol{\lambda}_j^\top\bf{u_i}$.

However, such a model is ill-suited for most percent cover datasets, due to the fact that the beta distribution isn’t capable of handling zero (or 100%100\%) responses. To make use of the beta regression model in such scenario, one needs to use some transformation in order to ‘push’ the responses away from the boundaries. This procedure might provide reasonable results when the numbers of zeros or ones in the data are relatively low. On the other hand, by transforming the zeros and ones, we might lose some important information.

A more sophisticated way of tackling such issue is by considering a zero-accommodating model, couple of which have been proposed recently. One such model is the hurdle beta model, which models the two (can be extended to three if 100%100\% covers are recorded, see Korhonen et al. (2024)) classes the response yijy_{ij} can take, i.e., {0}\{0\} and (0,1)(0,1), by separate processes. Namely, the zeros are assumed to be generated by a Bernoulli process. Conditional on yij(0,1)y_{ij}\in(0,1), the response is modeled using the standard beta distribution as presented above. The likelihood function for a response YijY_{ij} following the hurdle beta distribution is of the form: P(Yij;μij,μij0,ϕj)={1μij0,Yij=0,μij0fbeta(Yij;μij,ϕj),Yij(0,1). P(Y_{ij};\mu_{ij}, \mu_{ij}^0, \phi_j) = \begin{cases} 1-\mu_{ij}^0, & Y_{ij} = 0,\\ \mu_{ij}^0 \cdot f_{\text{beta}}(Y_{ij};\mu_{ij},\phi_j), & Y_{ij} \in (0,1). \\ \end{cases} where $g(\mu_{ij}^0) = \eta_{ij}^0 = \beta_{0j}^0 + \bf{x_i}^\top\boldsymbol{\beta}_j^0 + \bf{u_i}^\top\boldsymbol{\lambda}_j^0$ for and $g(\mu_{ij}) = \eta_{ij} = \beta_{0j} + \bf{x_i}^\top\boldsymbol{\beta}_j + \bf{u_i}^\top\boldsymbol{\lambda}_j$ and g()g(\cdot) can be either probit or logistic link function. Note, that here, the separate linear predictors share the same environmental covariates $\bf{x}_i$ and latent variable scores $\bf{u}_i$, while the coefficients and loadings are allowed to differ.

The gllvm package implements the hurdle beta GLLVM with two different estimation methods available. First, accessed by the argument method="VA" when calling uses a hybrid approach, where the method of variational approximations, or VA, is applied to the Bernoulli-process part of the data (only probit link allowed), while the method of extended variational approximations, see Korhonen et al. (2023), is applied to the beta distributed part. By instead specifying method="EVA", the EVA method is applied to both parts of the likelihood.

Example

In the following, we show how to fit the logistic hurdle beta GLLVM using EVA on the SBC LTER marine macroalgae (i.e., seaweed) percent cover dataset. The data has been collected in 2000-2020 along 44 permanent transect lines along coastal southern California. We will specify a model with two latent variables (for ordination) and will include the rockiness of the seabed and the average number of stripes of giant kelp as environmental covariates in the model.

## Loading required package: TMB
## 
## Attaching package: 'gllvm'
## The following object is masked from 'package:stats':
## 
##     simulate
data("kelpforest")
Yabund <- kelpforest$Y
Xenv <- kelpforest$X
SPinfo <- kelpforest$SPinfo

# Data contains both algae and sessile invertebrates
table(SPinfo$GROUP)
## 
##  ALGAE INVERT  PLANT 
##     61     69      2
# Select only the macroalgae:
Yalg <- Yabund[,SPinfo$GROUP=="ALGAE"]

# To demonstrate the models, use only the data from the year 2016:
Yalg <- Yalg[Xenv$YEAR==2016,]
Xenv <- Xenv[Xenv$YEAR==2016,]

# Remove species which have no observations or just one
Yalg <- Yalg[,-which(colSums(Yalg>0)<2)]

# Number of obs. and species:
dim(Yalg)
## [1] 44 42
# Specify the covariates in the linear predictor
Xformulai = ~KELP_FRONDS + PERCENT_ROCKY

After setting up the data, LV design and the covariates, the model is estimated by

fit <- gllvm(Yalg, X=Xenv, formula = Xformulai, family = "betaH", method="EVA", 
             num.lv = 2, link="logit", control=list(reltol=1e-12))

To inspect e.g., the covariate effects, use

fit$params$Xcoef
##            KELP_FRONDS PERCENT_ROCKY
## AU       -9.455406e-02 -3.102586e-03
## BF       -1.372606e+00  5.823305e-02
## BO        1.376262e-01 -2.476495e-02
## BR       -1.713636e-01  2.189629e-02
## BRA       3.248254e-01 -8.190024e-02
## CAL      -1.460250e-01  4.855651e-02
## CC        1.188441e-01 -1.115566e-02
## CF        1.530361e-01 -8.901479e-03
## CG        2.548407e-01 -7.649785e-02
## CO       -6.308269e-02  1.110347e-02
## COF      -6.410458e-02  2.787402e-02
## CP       -1.114389e-01 -6.294223e-02
## CRYP     -7.105296e-04 -4.663372e-05
## CYOS      5.478715e-02 -8.698350e-03
## DL        7.826420e-02 -1.635271e-03
## DMH      -2.017355e-01 -1.233158e-02
## DP       -3.621562e-02  7.676317e-03
## DU       -7.076633e-01 -5.883783e-02
## EAH       5.637383e+00 -6.443279e-04
## EC        3.391389e-02  1.520129e-02
## EH        6.122325e-01 -1.145803e-01
## ER       -4.458907e-02 -3.184757e-02
## FB        1.803531e-01 -8.068101e-03
## FTHR     -2.036075e-02 -2.218843e-02
## GR       -1.916974e-01 -1.522151e-01
## GS       -3.699922e-01 -1.332773e-02
## GYSP     -2.886326e-02 -1.889124e-04
## MH        9.505059e-02 -3.176209e-03
## NIE      -4.721101e-02 -5.200056e-03
## PH        1.457448e+00 -2.084187e-01
## PHSE     -4.970264e-01  7.569168e-03
## PL       -6.806346e-03  1.733389e-02
## POLA     -4.817113e-01 -3.194275e-02
## PRSP     -4.481198e-06  2.926159e-07
## R        -1.192304e-02  6.030937e-03
## RAT       1.283317e-02  2.015962e-03
## SAFU     -1.504383e+00  5.628179e-03
## SAHO     -2.371504e+00 -2.086756e-02
## SAMU     -2.168437e-01 -2.609541e-03
## SCCA      2.135574e-02  3.125122e-02
## TALE     -8.612450e-02 -2.620527e-02
## UV        7.221229e-02 -2.198163e-02
## H01_AU   -1.615009e-01 -5.316591e-02
## H01_BF   -6.786616e-01 -1.154694e-01
## H01_BO    2.921479e-01 -2.125122e-02
## H01_BR   -8.459162e-02 -2.009526e-04
## H01_BRA  -4.534394e-02 -3.474159e-02
## H01_CAL  -1.003116e-01  1.222654e-01
## H01_CC    7.228356e-02 -7.309390e-03
## H01_CF    2.242535e-02 -1.146587e-02
## H01_CG    8.491956e-03  2.194202e-01
## H01_CO    5.753283e-02 -1.559851e-02
## H01_COF  -3.874178e+00  2.607078e-01
## H01_CP   -3.139670e-01  1.821406e-01
## H01_CRYP -3.665508e-02 -2.117221e-02
## H01_CYOS  1.291399e-01 -1.766299e-02
## H01_DL    1.415808e-01 -4.510564e-02
## H01_DMH  -2.600719e-01 -2.829760e-02
## H01_DP   -1.254566e-01  1.656559e-02
## H01_DU   -4.602717e-01  2.061092e-01
## H01_EAH  -1.142915e+01  2.005793e-01
## H01_EC    2.314568e-01  5.714048e-02
## H01_EH   -2.407146e-01 -2.815340e-02
## H01_ER   -8.584950e-02  3.743826e-02
## H01_FB   -4.843342e-01 -1.131505e-02
## H01_FTHR -6.958817e-01 -5.687853e-02
## H01_GR   -1.125556e-01  2.484875e-01
## H01_GS   -4.605506e-01 -1.722882e-01
## H01_GYSP  8.646988e-02  8.165571e-03
## H01_MH    1.152026e+01 -7.457901e-02
## H01_NIE   4.986531e-01 -6.641824e-02
## H01_PH    1.000117e+00 -1.548690e-01
## H01_PHSE -8.661368e-01 -7.396483e-02
## H01_PL    5.010445e-02 -7.719896e-03
## H01_POLA  9.731101e-02 -6.570131e-02
## H01_PRSP -1.216649e-01 -5.570462e-03
## H01_R    -1.994332e-02  7.679153e-03
## H01_RAT  -9.514712e-02  2.241508e-02
## H01_SAFU -3.966692e+00 -1.714051e-01
## H01_SAHO -3.058831e+00  2.154947e-01
## H01_SAMU -1.170339e-01 -5.311298e-04
## H01_SCCA -9.417843e-02  2.281910e-02
## H01_TALE -5.648812e-01  2.160562e-01
## H01_UV   -7.738378e-01 -2.760466e-02

In the above, the prefix indicates that the coefficient relates to the Bernoulli part of the hurdle model.

Ordination plot can then be generated as per usual:

ordiplot(fit, jitter = TRUE, s.cex = .8)

Ordered beta GLLVM

Another solution for modeling percentage cover data in gllvm is to use ordered beta response model. It handles zeros and ones slightly differently compared to hurdle beta model. Instead of assuming that the zeros and ones comes from separate process from the percent cover, the model assumes that there is an underlying process zijz_{ij} where all observations comes from.

For species j=1,...,mj = 1, . . . , m, let zijz_{ij} denote an underlying continuous variable, and define two cutoff parameters ζj0<ζj1\zeta_{j0} < \zeta_{j1} such that Yij=0Y_{ij} = 0 occurs when zij<ζj0z_{ij} < \zeta_{j0}, Yij=1Y_{ij}=1 occurs when zij>ζj1z_{ij}>\zeta_{j1}, and Yij(0,1)Y_{ij} \in (0,1) occurs when ζj0<zij<ζj1\zeta_{j0} < z_{ij} < \zeta_{j1}. Conditional on Yij(0,1)Y_{ij} \in (0,1), the response variable follows a beta distribution. By assuming zijz_{ij} follows a logistic distribution, then marginalising over zijz_{ij} we obtain the following distribution for the percent cover responses that characterizes the ordered beta GLLVM,

P(Yij;ηij,ϕj)={ρij0,ifYij=0,(ρ1ρ0)fbeta(Yij;μij,ϕj),ifYij(0,1),1ρij1,ifYij=1,\begin{align} P(Y_{ij};\eta_{ij}, \phi_j) = \begin{cases} \rho^0_{ij}, & if Y_{ij} = 0 ,\\ \left(\rho^1 - \rho^0 \right) \cdot f_{\text{beta}}(Y_{ij}; \mu_{ij}, \phi_j), & if Y_{ij} \in (0,1) ,\\ 1-\rho^1_{ij}, & if Y_{ij} = 1 ,\\ \end{cases} \end{align}

Adjusting ordered beta to data without ones

Lets demonstrate the ordered beta response model for the previous example. As the data has no ones:

sum(Yalg ==1)
## [1] 0

We can accommodate the model to better handle data that by fixing the upper cutoff parameters to some large value, for example 20. With gllvm this can be done by setting starting values for the cutoff parameter, ´zetacutoff = c(0, 20)´ and fixing the upper cutoff parameters with ´setMap = list(zeta = factor(rbind(1:m, rep(NA, m))))´. We assume that the number of species/response variables in the data is saved to object m. There are ´m´ (number of species) lower cutoff parameters ζj0\zeta_{j0} we let be freely estimated (that’s indexes ´1:m´ in ´setMap´) and ´m´ upper cutoff parameters ζj1\zeta_{j1} that we fix (that’s ´rep(NA, m)´ in ´setMap´).

Setting shape parameter to common across species

Some species are observed only a few times:

colSums(Yalg>0)
##   AU   BF   BO   BR  BRA  CAL   CC   CF   CG   CO  COF   CP CRYP CYOS   DL  DMH 
##   23    3   11    5    2    5   30   13    5   24    5    3    3   33   26    8 
##   DP   DU  EAH   EC   EH   ER   FB FTHR   GR   GS GYSP   MH  NIE   PH PHSE   PL 
##   14    4    3   36    2    6    6    6   11   20    7   36   13    2    5    4 
## POLA PRSP    R  RAT SAFU SAHO SAMU SCCA TALE   UV 
##    6    4   31   17    6    2    7    7    5    6

so there is not much information to estimate the shape parameter of the beta distribution for each species separately. Thus we can also set the shape parameter to be common across species with ´disp.formula = rep(1, m)´. This can be applied for all beta based models in gllvm.

Model fit

Now we are ready to fit the model:

# save the number of species to object m
m <- ncol(Yalg)
fit_ob <- gllvm(Yalg, X=Xenv, formula = Xformulai, family = "orderedBeta", 
    method="EVA", num.lv = 2, link="logit",
    disp.formula = rep(1, m), zetacutoff = c(0, 20),
    setMap = list(zeta = factor(rbind(1:m, rep(NA, m)))) )
fit_ob
## Call: 
## gllvm(y = Yalg, X = Xenv, formula = Xformulai, family = "orderedBeta", 
##     num.lv = 2, method = "EVA", link = "logit", disp.formula = rep(1, 
##         m), setMap = list(zeta = factor(rbind(1:m, rep(NA, m)))), 
##     zetacutoff = c(0, 20))
## family: 
## [1] "orderedBeta"
## method: 
## [1] "EVA"
## 
## log-likelihood:  245.7163 
## Residual degrees of freedom:  1554 
## AIC:  96.56732 
## AICc:  208.2608 
## BIC:  1719.994

Now if we check the cutoff values we see that the upper bounds are fixed to 20

fit_ob$params$zeta
##         cutoff0 cutoff1
## AU   -2.8066526      20
## BF   -1.2909847      20
## BO   -2.7680134      20
## BR   -1.3784060      20
## BRA  -1.5088153      20
## CAL  -3.7936234      20
## CC   -4.6882192      20
## CF   -2.8193092      20
## CG   -3.4904803      20
## CO   -3.1259332      20
## COF  -6.0315847      20
## CP   -3.4386941      20
## CRYP -1.5338854      20
## CYOS -3.8779396      20
## DL   -4.3522017      20
## DMH  -2.7145261      20
## DP   -2.5389235      20
## DU   -2.0430645      20
## EAH  -4.9651301      20
## EC   -4.2535894      20
## EH   -2.7925201      20
## ER   -2.4001887      20
## FB   -0.9245673      20
## FTHR -2.5956086      20
## GR   -4.0362918      20
## GS   -3.7589668      20
## GYSP -1.8758024      20
## MH   -4.3894218      20
## NIE  -2.9925227      20
## PH   -1.4827602      20
## PHSE -2.4340275      20
## PL   -1.4354152      20
## POLA -2.3353322      20
## PRSP -2.5428144      20
## R    -4.4936593      20
## RAT  -3.1882598      20
## SAFU -4.3655443      20
## SAHO -7.6879426      20
## SAMU -1.5900864      20
## SCCA -2.3214241      20
## TALE -3.1525937      20
## UV   -1.5446683      20

Ordination plot:

ordiplot(fit_ob, jitter = TRUE, s.cex = .8)

References

Korhonen, P., F.K.C. Hui, J. Niku, and S. Taskinen. 2023. “Fast and Universal Estimation of Latent Variable Models Using Extended Variational Approximations.” 33: 26. https://doi.org/10.1007/s11222-022-10189-w.
Korhonen, P., F.K.C. Hui, J. Niku, S. Taskinen, and B. van der Veen. 2024. “A Comparison of Joint Species Distribution Models for Percent Cover Data.” https://doi.org/10.48550/arXiv.2403.11562.