This presentation focusses on the problem of parametric regression where the dependent variable is a bounded integer of any kind.
This presentation is drawn from our paper that is under review at “Statistical Modelling”.
Divan A. Burger\(^{1,2,3}\), Sean van der Merwe\(^{3}\), Emmanuel Lesaffre\(^{5}\)
A GLM revolves around two things:
\[g(E[Y])=X\boldsymbol\beta \]
For continuous data we might assume that \(y_i \sim N(\mu_i, \sigma)\),
and that \(\mu_i=\mathbf{x}_i\boldsymbol\beta\) with \(g(\mu)=\mu\).
If \(y_i \sim Bin(m, p_i)\) then a good link function is the scaled logit: \(g(\mu)=\log \frac{\mu/m}{(1-\mu/m)}\)
It converts the bounded probability of success \(0\leq p_i \leq 1\) to an unbounded \((-\infty;\infty)\) space for the linear predictor as follows:
\(logit\ p_i = \mathbf{x}_i\boldsymbol\beta\ \Rightarrow\ p_i = \frac{1}{1+\exp(-\mathbf{x}_i\boldsymbol\beta)}\)
What aspect of the Gaussian (normal) model is missing here?
Can the same principle be applied in the binomial case?
A much simpler approach is to use a standard distribution for innovations/errors/residuals, such as the normal or Student-\(t\), on the logit scale. This is what we have implemented and tested.
\[logit\ p_i \sim t(\nu, \mu_i, \sigma)\]
where \(\mu_i = \mathbf{x}_i\boldsymbol\beta\) for fixed effects regression, or with some fun added terms for mixed effects models.
Let’s get practical…
To compare the models for real data the following measures were employed:
Considering the residual diagnostic test p-values below, only the binomial-logit-\(t\) model fully fits the data.
Here we consider a fake subject from a simulated data set (similar to the real data though). The subject is fitted along with many others, but thanks to the use of random effects we obtain personalised predictions. The predictions can extend past the fitted portion of the data.
Two simulation studies were considered: first 16 scenarios of regular data from the proposed models, and then 3 levels of data contamination, i.e. mostly well behaved data but with extreme outliers thrown in to try to break the models.
For all the coefficients \((\beta_0,\dots,\beta_3)\) the coverage of the 95% interval was between 93% and 96%, using 1000 data sets per scenario.
This presentation was created using the Reveal.js format in Quarto, using the RStudio IDE. Font and line colours according to UFS branding, and background image using image editor GIMP by compositing images from CoPilot (DALL-E_v3).
2024/10/04 - Bin Reg