Model comparison and goodness-of-fit of generalised parametric models

Sean van der Merwe

Introduction

Me

Knowing your presenter helps with understanding their lens and to forge a connection - it’s almost as important as knowing your audience 😀

  • I do free statistical consultation for Staff, PhD, and M students at UFS
    • driven and sponsored by top management for the purpose of improving research output and quality overall
    • I personally help with 40 to 50 projects per year
  • I lecture Bayesian statistics for honours students
    • Focused on statistical modelling as a general problem solving tool
    • Practical module for building statistical skill-set
  • I serve on various committees, like ethics committees (animals and general), IT committees (AI and faculty), etc.
  • I sometimes even find time to do my own research (too rarely)

Outline

Today’s talk will focus on topics I’ve been researching on and off for a decade, centered on a practical example:

The road to GLMs and beyond

At UFS students are taught

  • Random variable \(X\sim N(\mu,\sigma^2)\)
  • Sample \(X_1,\dots,X_n\sim i.i.d.\ N(\mu,\sigma^2)\)
  • Ordinary regression \(y_i=\mathbf{x}_i\boldsymbol\beta + \varepsilon_i\) where \(\varepsilon_i\sim i.i.d.\ N(0,\sigma^2)\)
    • At this point the idea of residual checks is stressed
  • GLMs \(Y_i\sim N/Gamma/\dots(\mu_i,\sigma^2)\) where \(g(\mu_i)=\mathbf{x}_i\boldsymbol\beta\)
    • At this point the idea of residual checks vanishes

So what checks do you do for generalised non-linear mixed effects models?

Plants on Marion Island

Source and licence

Proportions as dependent variable

In Burger et al. (2023), Divan fitted 4 mixed effects regression models to the vitality of specific plant cover given various explanatory variables.

The study was on the impact of the grass species Agrostis magellanica on the cushion-forming plant, Azorella selago, using sub-Antarctic Marion Island as a model system.

  • A positive coefficient would indicate a symbiotic relationship, while a negative coefficient might indicate a parasitic relationship
  • Plant cover is necessarily a proportion and analysing it as a proportion makes sense
  • The ‘standard’ approaches of transforming the data or ignoring the boundaries altogether break down when the proportions approach 0 or 1
  • The paper showed that we need to use heavy tailed distributions if we want the model to be robust to outlying values

More technically

  • The currently available approach for modelling the quantiles of hierarchically structured continuous proportion data is the Kumaraswamy model of Bayes, Bazán, and de Castro (2017)
  • However, our application dataset contains significant outliers, and the Kumaraswamy model is not adequate for modelling heavy-tailed data (i.e., containing outliers)
  • We, therefore, considered a robust model to accommodate outliers by extending the fixed effects Johnson-\(t\) model of Lemonte and Moreno-Arenas (2020)
  • According to the mDIC statistic, the robust model fit the cushion plant dataset better than the non-robust model
  • Furthermore, the quantile model fits differ considerably between the non-robust and robust models:
    • Under the robust models, the covariate effects on the mean and median differ
  • Based on the model adequacy checks, it is clear that the outliers in the data have a substantial effect on the quantile fits, and thus, the ecological research questions were addressed using the Johnson-\(t\) model instead of the Kumaraswamy model

Ecological questions

  • Is the health of the plants affected by other plants encroaching on their turf?
  • Is the health of the plants affected by the where they are on the island, or how high they are above the water?

But the health varies naturally over time, so instead of expected health,

  • We modelled the dead stem cover using quantile regression
    • The 0.95 quantile is of particular interest as this represents a health limit
    • The 0.1, 0.25, 0.5, and 0.75 quantiles were also modelled to see the effects of the variables in the best case, typical case, and worst case (and in between)

Ecological results

  • “…the extreme quantiles of vitality in A. selago are not constrained by the cover of vascular plants and mosses.”
  • “The odds of all quantiles of final dead stem cover are about 64% higher for mid-altitude than high-altitude sites.”
  • “The effect of vascular and nonvascular plant species, altitude, and aspect on A. selago does not vary across plants of different vitality.”

So it turns out we could have just modelled the median vitality, but we didn’t know that until we did all the quantiles

Statistical results

  • While the ecological results were interesting, the paper was primarily about the statistics
  • Specifically, we compared standard models with heavy tailed models for proportions
  • The results were definitive on two fronts:
    • The heavy tailed models fitted better according to model comparison statistics
    • The heavy tailed models failed to reject tests of goodness of fit, while the standard models were rejected

… excluding outliers because they deviate considerably from the other observations may misrepresent vital ecological processes, ultimately leading to misleading conclusions. From a statistical perspective, it seems preferable to carry out an analysis that is robust to outliers rather than an analysis that is preceded by outlier removal.

Simulation study

We also implemented a simulation study to establish that

  • We were fitting the models correctly
    • Models fitted to data that meets their assumptions should fit correctly and estimate parameters with reasonable accuracy and coverage
  • Our assumptions and assertions about the model properties hold
    • Specifically, which models hold up better under data contamination

The models compared were the (1) Beta regression model, (2) Rectangular beta regression model, (3) Kumaraswamy regression model, and (4) Johnson-t regression model.

Simulation study results

  • Under each parameter scenario, the bias of the estimates of the fixed effects and of the dispersion parameter is small (close to 0). In contrast, the bias of the estimates of the variance components is large
  • The estimates of the degrees of freedom are relatively unbiased for low degrees of freedom and considerably biased for high degrees of freedom, which is fine
  • Under each parameter scenario, the coverage probability of the HPD interval of the fixed effects is close to or slightly higher than the nominal value. In contrast, the HPD intervals of the dispersion parameter and variance components are too conservative

In summary, the estimates and HPD interval coverage of the fixed effects, which are the main parameters of interest, are acceptable under all parameter scenarios. However, the HPD intervals of the variance components are generally too conservative, similar to the findings of Burger et al. (2020) and Burger and Lesaffre (2021), in particular, under small variance components. The HPD interval coverage of the degrees of freedom is slightly lenient when the data are light-tailed; in contrast, it is acceptable when the data are heavy-tailed.

Model comparison

Model comparison for mixed models

  • We want to know the best model for future plants
    • Not just the current observations
  • So we must adjust our model comparison statistics
  • Implemented the marginalised Deviance Information Criterion (mDIC) of Quintero and Lesaffre (2018)
  • The principle is to replace the observed random effects with new simulated random effects and then average over the random effects in the DIC calculation
    • Requires careful and precise code

Conditional (standard) Deviance information criterion (DIC)

  • Define the Deviance \(D(\theta)=-2\log(p(\theta|\mathbf{y}))+c\) where \(c\) is a meaningless constant
  • Define \(\bar{D}=E_\theta[D(\theta)]\) and \(p_D=\bar{D}-D(E(\theta))\)
  • then \(DIC=p_D+\bar{D}\) (Spiegelhalter et al. 2002)
  • \(p_D\) denotes the effective number of parameters
  • DIC is useful for hierarchical models and complex models with lots of correlated parameters - if 2 parameters do the same thing then counting them both doesn’t make sense (as AIC or BIC would)
  • A superior alternative for general problems is LOOIC, which attempts to bring in a cross-validation factor for better generalisation

Marginalised DIC

  • The mDIC requires integrating over the likelihood function’s random effects (for better generalisation)
  • In order to do so, Quintero and Lesaffre (2018) proposed generating replicate samples of the random effects that need to be integrated out
  • Two sources of replicate samples of the random effects are drawn as follows:

Let \(\boldsymbol\Psi^{\left(k\right)}\) and \(\boldsymbol\Sigma^{\left(k\right)}\) respectively represent the \(k^\text{th}\) posterior sample drawn from \(\boldsymbol\Psi\) and \(\boldsymbol\Sigma\), \(\mathbf{u}^{\left(k,l\right)}_{{\text{rep}_i}}\) the \(l^\text{th}\) replicate sample from \(p\left(\mathbf{u}_i\left|\boldsymbol\Sigma^{\left(k\right)}\right.\right)\), and \(\mathbf{u}^{\left(m\right)}_{{\text{rep}_i}}\) the \(m^\text{th}\) replicate sample from \(p\left(\mathbf{u}_i\left|\hat{\boldsymbol\Sigma}\right.\right)\) (\(k=1,\ldots,K\), \(l=1,\ldots,L\), and \(m=1,\ldots,M\)). Accordingly,

\[mDIC=-\frac{4}{K}\sum_{k=1}^K\sum_{i=1}^I\sum_{j=1}^{J_i}\log\left\{\frac{1}{L}\sum_{l=1}^{L}f\left(y_{ij}\left|\boldsymbol\Psi^{\left(k\right)},\mathbf{u}^{\left(k,l\right)}_{{\text{rep}_i}}\right.\right)\right\}-2\sum_{i=1}^I\sum_{j=1}^{J_i}\log\left\{\frac{1}{M}\sum_{m=1}^{M}f\left(y_{ij}\left|\hat{\boldsymbol\Psi},\mathbf{u}^{\left(m\right)}_{{\text{rep}_i}}\right.\right)\right\} \]

Kumaraswamy model

The probability density function of the Kumaraswamy regression model of Bayes, Bazán, and de Castro (2017) for a given bounded outcome \(0<y_{ij}<1\) is:

\[f\left(y_{ij}\left|\boldsymbol\beta,\mathbf{u}_i,\rho\right.\right) = -\frac{\log\left(1-q\right)\rho}{\log\left(1 - e^{-\rho}\right)\log\left(\kappa_{ij}\right)}y_{ij}^{-\frac{\rho}{\log\left(\kappa_{ij}\right)}-1}\left(1-y_{ij}^{-\frac{\rho}{\log\left(\kappa_{ij}\right)}}\right)^{\frac{\log\left(1-q\right)}{\log\left(1-e^{-\rho}\right)}-1}\]

where \(\kappa_{ij}=\frac{\exp\left({\mathbf{x}^{\prime}_{ij}\boldsymbol\beta + \mathbf{z}^{\prime}_{ij}\mathbf{u}_i}\right) }{ 1+\exp\left({\mathbf{x}^{\prime}_{ij}\boldsymbol\beta + \mathbf{z}^{\prime}_{ij}\mathbf{u}_i}\right)}\) is the \(q\) conditional quantile of \(y_{ij}\) given the \(i^\text{th}\) random effect under the Kumaraswamy model, and \(\rho>0\) is the precision parameter of the Kumaraswamy distribution.

Johnson-t model

The probability density function of the Johnson-\(t\) regression model of Lemonte and Moreno-Arenas (2020) for a given bounded outcome \(0<y_{ij}<1\) is:

\[f\left(y_{ij}\left|\boldsymbol\beta,\mathbf{u}_i,\rho,\nu\right.\right) = \frac{\rho\nu^{\frac{1}{2}\nu}}{y_{ij}\left(1-y_{ij}\right)}\frac{\Gamma\left(\frac{1}{2}+\frac{1}{2}\nu\right)}{\Gamma\left(\frac{1}{2}\right)\Gamma\left(\frac{1}{2}\nu\right)}\left[\nu+\left\{t_{\nu}\left(q\right)+\rho\left[\log\left(\frac{y_{ij}}{1-y_{ij}}\right)-\log\left(\frac{\kappa_{ij}}{1-\kappa_{ij}}\right)\right]\right\}^2\right]^{-\frac{\nu+1}{2}}\]

where \(\kappa_{ij}=\frac{\exp\left({\mathbf{x}^{\prime}_{ij}\boldsymbol\beta+\mathbf{z}^{\prime}_{ij}\mathbf{u}_i}\right)}{1+\exp\left({\mathbf{x}^{\prime}_{ij}\boldsymbol\beta+\mathbf{z}^{\prime}_{ij}\mathbf{u}_i}\right)}\) is the \(q\) conditional quantile of \(y_{ij}\) given the \(i^\text{th}\) random effect under the Johnson-\(t\) model, \(\rho>0\) and \(\nu>0\) are respectively the dispersion parameter and degrees of freedom of the Johnson-\(t\) distribution, and \(t_{\nu}\left(q\right)\) is the \(q\) quantile of the conventional \(t\)-distribution with degrees of freedom \(\nu\).

The Johnson-\(t\) distribution accommodates heavy-tailed data (including outliers) and is suitable for quantile regression given its closed-form quantile function.

Model adequacy

Goodness of fit in parametric regression

  • In a parametric regression, every observation follows a different distribution
    • Usually the same type of distribution, conditional on varying explanatory variables
    • e.g. \(y_i\sim N(\mathbf{x}_i\boldsymbol\beta, \sigma^2)\)
  • Some basic goodness-of-fit approaches (e.g. Shapiro-Wilk) only work in very simple cases
    • And are never truly appropriate (e.g. the observed residuals are not i.i.d. even if the theoretical ones are)
  • There are two general solutions for goodness-of-fit that work for parametric models of almost any complexity:
    • The (double) parametric bootstrap
    • The DHARMa (Hartig 2022) approach - based on ‘standardised residuals’

Double parametric bootstrap

  1. Fit your parametric model
  2. Find a suitable goodness-of-fit statistic \((S)\) for your model and calculate it \((S^{obs})\)
  3. Simulate \(M\) (say 2000) new samples the same size as your original sample from the model using the fitted parameters

For each of the new samples \(m=1\dots M\),

  1. Fit the model as if the new sample was the original sample
  2. Calculate \(S^m\) using the new fit on the new sample

Then report a p-value equal to the proportion of new statistics more extreme than the observed sample statistic.

Standardised residuals (much easier)

For every observation \(i\),

  • obtain many samples from the predictive distribution for \(y_i|X,M\)
    • Both frequentist and Bayesian approaches will work
  • find the proportion of samples that are less than \(y_i\)

The resulting proportions should follow a U(0,1) distribution, so we can use a test for uniformity (e.g. AD or KS) to perform goodness-of-fit testing. We can also find outliers this way!

From Rice Ch. 2: Let \(W=F_Y(Y)\), then

\[P[W\leq w]=P[F_Y(Y)\leq w] = P[Y\leq F^{-1}(w)] = F(F^{-1}(w))=w\]

Another explanation

If a statistical model fits well then

  • 95% prediction intervals should cover the values being predicted about 95% of the time
  • Similarly, 90% prediction intervals should cover the values being predicted about 90% of the time
  • And, x% prediction intervals should cover the values being predicted about x% of the time, for any x

Results for this study

Johnson-t model QQ Plot

Johnson-t model Residual Check

Kumaraswamy model QQ Plot

Kumaraswamy model QQ Plot

Conclusion

References

Bayes, C. L., J. L. Bazán, and M. de Castro. 2017. A quantile parametric mixed regression model for bounded response variables.” Statistics and Its Interface 10 (3): 483–93.
Burger, Divan A., and E. Lesaffre. 2021. Nonlinear mixed-effects modeling of longitudinal count data: Bayesian inference about median counts based on the marginal zero-inflated discrete Weibull distribution.” Statistics in Medicine 40 (23): 5078–95.
Burger, Divan A., Sean van der Merwe, Emmanuel Lesaffre, Peter C. le Roux, and Morgan J. Raath-Krüger. 2023. “A Robust Mixed-Effects Parametric Quantile Regression Model for Continuous Proportions: Quantifying the Constraints to Vitality in Cushion Plants.” Statistica Neerlandica n/a (n/a). https://doi.org/10.1111/stan.12293.
Burger, Divan A., R. Schall, J. T. Ferreira, and D.-G. Chen. 2020. A robust Bayesian mixed effects approach for zero inflated and highly skewed longitudinal count data emanating from the zero inflated discrete Weibull distribution.” Statistics in Medicine 39 (9): 1275–1291. DOI: 10.1002/sim.8475.
Hartig, Florian. 2022. DHARMa: Residual Diagnostics for Hierarchical (Multi-Level / Mixed) Regression Models. https://CRAN.R-project.org/package=DHARMa.
Lemonte, A. J., and G. Moreno-Arenas. 2020. On a heavy-tailed parametric quantile regression model for limited range response variables.” Computational Statistics 35 (1): 379–98.
Quintero, Adrian, and Emmanuel Lesaffre. 2018. “Comparing Hierarchical Models via the Marginalized Deviance Information Criterion.” Statistics in Medicine 37 (16): 2440–54. https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.7649.
Spiegelhalter, David J., Nicola G. Best, Bradley P. Carlin, and Angelika van der Linde. 2002. “Bayesian Measures of Model Complexity and Fit.” Journal of the Royal Statistical Society. Series B (Statistical Methodology) 64 (4): 583–639. http://www.jstor.org/stable/3088806.

The end

  • Robust parametric modelling is getting easier every year
  • Checking and comparing complex models can be done via simulation
    • But checking model adequacy is a very different problem to comparing models, and addresses entirely different concerns

Thank you for your time and attention.

This presentation was created using the Reveal.js format in Quarto, using the RStudio IDE. Font and line colours according to UFS branding, and background images combined from Midjourney using image editor GIMP.