Knowing your presenter helps with understanding their lens and to forge a connection - it’s almost as important as knowing your audience 😀
Today’s talk will focus on topics I’ve been researching on and off for a decade, centered on a practical example:
At UFS students are taught
So what checks do you do for generalised non-linear mixed effects models?
In Burger et al. (2023), Divan fitted 4 mixed effects regression models to the vitality of specific plant cover given various explanatory variables.
The study was on the impact of the grass species Agrostis magellanica on the cushion-forming plant, Azorella selago, using sub-Antarctic Marion Island as a model system.
But the health varies naturally over time, so instead of expected health,
So it turns out we could have just modelled the median vitality, but we didn’t know that until we did all the quantiles
… excluding outliers because they deviate considerably from the other observations may misrepresent vital ecological processes, ultimately leading to misleading conclusions. From a statistical perspective, it seems preferable to carry out an analysis that is robust to outliers rather than an analysis that is preceded by outlier removal.
We also implemented a simulation study to establish that
The models compared were the (1) Beta regression model, (2) Rectangular beta regression model, (3) Kumaraswamy regression model, and (4) Johnson-t regression model.
In summary, the estimates and HPD interval coverage of the fixed effects, which are the main parameters of interest, are acceptable under all parameter scenarios. However, the HPD intervals of the variance components are generally too conservative, similar to the findings of Burger et al. (2020) and Burger and Lesaffre (2021), in particular, under small variance components. The HPD interval coverage of the degrees of freedom is slightly lenient when the data are light-tailed; in contrast, it is acceptable when the data are heavy-tailed.
Let \(\boldsymbol\Psi^{\left(k\right)}\) and \(\boldsymbol\Sigma^{\left(k\right)}\) respectively represent the \(k^\text{th}\) posterior sample drawn from \(\boldsymbol\Psi\) and \(\boldsymbol\Sigma\), \(\mathbf{u}^{\left(k,l\right)}_{{\text{rep}_i}}\) the \(l^\text{th}\) replicate sample from \(p\left(\mathbf{u}_i\left|\boldsymbol\Sigma^{\left(k\right)}\right.\right)\), and \(\mathbf{u}^{\left(m\right)}_{{\text{rep}_i}}\) the \(m^\text{th}\) replicate sample from \(p\left(\mathbf{u}_i\left|\hat{\boldsymbol\Sigma}\right.\right)\) (\(k=1,\ldots,K\), \(l=1,\ldots,L\), and \(m=1,\ldots,M\)). Accordingly,
\[mDIC=-\frac{4}{K}\sum_{k=1}^K\sum_{i=1}^I\sum_{j=1}^{J_i}\log\left\{\frac{1}{L}\sum_{l=1}^{L}f\left(y_{ij}\left|\boldsymbol\Psi^{\left(k\right)},\mathbf{u}^{\left(k,l\right)}_{{\text{rep}_i}}\right.\right)\right\}-2\sum_{i=1}^I\sum_{j=1}^{J_i}\log\left\{\frac{1}{M}\sum_{m=1}^{M}f\left(y_{ij}\left|\hat{\boldsymbol\Psi},\mathbf{u}^{\left(m\right)}_{{\text{rep}_i}}\right.\right)\right\} \]
The probability density function of the Kumaraswamy regression model of Bayes, Bazán, and de Castro (2017) for a given bounded outcome \(0<y_{ij}<1\) is:
\[f\left(y_{ij}\left|\boldsymbol\beta,\mathbf{u}_i,\rho\right.\right) = -\frac{\log\left(1-q\right)\rho}{\log\left(1 - e^{-\rho}\right)\log\left(\kappa_{ij}\right)}y_{ij}^{-\frac{\rho}{\log\left(\kappa_{ij}\right)}-1}\left(1-y_{ij}^{-\frac{\rho}{\log\left(\kappa_{ij}\right)}}\right)^{\frac{\log\left(1-q\right)}{\log\left(1-e^{-\rho}\right)}-1}\]
where \(\kappa_{ij}=\frac{\exp\left({\mathbf{x}^{\prime}_{ij}\boldsymbol\beta + \mathbf{z}^{\prime}_{ij}\mathbf{u}_i}\right) }{ 1+\exp\left({\mathbf{x}^{\prime}_{ij}\boldsymbol\beta + \mathbf{z}^{\prime}_{ij}\mathbf{u}_i}\right)}\) is the \(q\) conditional quantile of \(y_{ij}\) given the \(i^\text{th}\) random effect under the Kumaraswamy model, and \(\rho>0\) is the precision parameter of the Kumaraswamy distribution.
The probability density function of the Johnson-\(t\) regression model of Lemonte and Moreno-Arenas (2020) for a given bounded outcome \(0<y_{ij}<1\) is:
\[f\left(y_{ij}\left|\boldsymbol\beta,\mathbf{u}_i,\rho,\nu\right.\right) = \frac{\rho\nu^{\frac{1}{2}\nu}}{y_{ij}\left(1-y_{ij}\right)}\frac{\Gamma\left(\frac{1}{2}+\frac{1}{2}\nu\right)}{\Gamma\left(\frac{1}{2}\right)\Gamma\left(\frac{1}{2}\nu\right)}\left[\nu+\left\{t_{\nu}\left(q\right)+\rho\left[\log\left(\frac{y_{ij}}{1-y_{ij}}\right)-\log\left(\frac{\kappa_{ij}}{1-\kappa_{ij}}\right)\right]\right\}^2\right]^{-\frac{\nu+1}{2}}\]
where \(\kappa_{ij}=\frac{\exp\left({\mathbf{x}^{\prime}_{ij}\boldsymbol\beta+\mathbf{z}^{\prime}_{ij}\mathbf{u}_i}\right)}{1+\exp\left({\mathbf{x}^{\prime}_{ij}\boldsymbol\beta+\mathbf{z}^{\prime}_{ij}\mathbf{u}_i}\right)}\) is the \(q\) conditional quantile of \(y_{ij}\) given the \(i^\text{th}\) random effect under the Johnson-\(t\) model, \(\rho>0\) and \(\nu>0\) are respectively the dispersion parameter and degrees of freedom of the Johnson-\(t\) distribution, and \(t_{\nu}\left(q\right)\) is the \(q\) quantile of the conventional \(t\)-distribution with degrees of freedom \(\nu\).
The Johnson-\(t\) distribution accommodates heavy-tailed data (including outliers) and is suitable for quantile regression given its closed-form quantile function.
For each of the new samples \(m=1\dots M\),
Then report a p-value equal to the proportion of new statistics more extreme than the observed sample statistic.
For every observation \(i\),
The resulting proportions should follow a U(0,1) distribution, so we can use a test for uniformity (e.g. AD or KS) to perform goodness-of-fit testing. We can also find outliers this way!
From Rice Ch. 2: Let \(W=F_Y(Y)\), then
\[P[W\leq w]=P[F_Y(Y)\leq w] = P[Y\leq F^{-1}(w)] = F(F^{-1}(w))=w\]
If a statistical model fits well then
Thank you for your time and attention.
This presentation was created using the Reveal.js format in Quarto, using the RStudio IDE. Font and line colours according to UFS branding, and background images combined from Midjourney using image editor GIMP.
2023/08/28 - Model checking