Exotic statistical modelling for real data: boerewors, river cleanup, etc.

Sean van der Merwe

Introduction

Outline

The key point of today’s talk is:

  • Effective statistical modelling of real data doesn’t fit in a box

The secondary point of today’s talk is:

  • Good mathematics and theoretical knowledge is king
    • Good coding is an extremely useful tool
      • But it is not the solution itself

You might be thinking, “Sean, why are you throwing shade at coding, isn’t coding like half your job?”

  • It’s actually 80% of my job, at least on a good day, but you don’t walk into a beautiful house and ask, “What screwdriver did you use to install the cupboards?”
    • I do supply my code to clients at project completion anyway though

Extreme focus

  • Maximal research output usually requires extreme focus
    • If you want to be a professor with a high rating then you are expected to specialise heavily in one field
    • Being an expert in a narrow field makes it easier to push boundaries in a direction and comes with prestige

BUT

  • Extreme focus can quickly create blind spots
    • Many professors publish paper after paper in fields that have almost no real world applications
      • They make arguments that are convincing to reviewers in that field, but not convincing to the people that have to actually implement the work
  • There are entire avenues of statistical research being pursued to this day that have been rendered redundant by work in adjacent fields, but the authors don’t read those other fields and thus are not aware

Examples of expired statistics

Disclaimer: I have personally taught all these topics recently (I’m part of the problem)

  • Random Forests and Support Vector Machines (SVMs) are almost always surpassed in real world performance by Extreme Gradient Boosting (XGboost), e.g. (Su et al. 2022), (Fan et al. 2018)
  • Histograms are almost always surpassed in performance by kernel density plots, e.g. see Ch. 20 of (Wasserman 2004)
    • Sturges’ rule should not be part of our syllabus
  • Humans can’t interpret pie charts well, a bar chart is objectively superior in almost all cases, e.g. (Robbins 2012)

The need for rounded statisticians

  • Real data doesn’t just come with one issue at a time
  • Addressing practical Stats/Act.Sci. problems requires at minimum:
    • Understanding how data is captured and stored
    • Being able to manipulate data (and results), including merging, mapping, renaming, transforming, pivoting wide to long
    • Being able to visualise data (and results), and not just one variable at a time
    • Being able to use plots and tables to identify issues with data that will mess up your analysis attempts
    • Knowing enough about all the different areas of statistics to know where to start looking for solutions

Exotic modelling

Multi-issue data

Statistical software has packages/components/tools for everything

  • There are packages for multivariate time series
  • There are packages for censored data
  • There are packages for missing data
  • There are packages for outliers
  • There are packages for repeated measurements

But there are no ‘packages’ for 3 of these at occurring at once, never mind all 5

Researchers usually focus heavily on a single aspect or problem at a time and narrowly in their field. They pick the data set to suit their research argument. It is frustratingly rare to see theoretical papers which consider that issues don’t come alone.

The real data: river pollution cleanup

Objective: obtain an estimate and interval for the reduction in pollution

My approach (the key point today)

I approached the problem by building a model that has two key properties:

  1. The model must have either a parameter whose value addresses the research question directly, or a clear way to create such a parameter or prediction after fitting
  2. The model must accommodate as many of the data’s key properties as it reasonably can, without excessive over-fitting

The second property is the difficult one, both practically and theoretically. It is difficult to know which properties of the data are going to materially affect the estimation of the key parameter(s) if they are not accommodated.

Model details

I formulate the model mathematically as follows:

The model for the inflows is:

\[\begin{aligned} x_i &\sim N(\mu_{t_i},\ \sigma_1^2) \\ \mu_j &\sim N(\phi\mu_{j-1},\ \sigma_3^2) \end{aligned}\]

The model for the outflows is:

\[\begin{aligned} y_i &\sim N(\nu_{t_i},\ \sigma_2^2) \\ \nu_j &= \mu_{j} + \alpha \end{aligned}\]

The priors assumed on the parameters are:

\[\begin{aligned} \phi &\sim U(-1, 1) \\ \alpha &\sim N(0, 10) \\ \end{aligned}\]

and \(\pi(\boldsymbol\sigma) = \prod \sigma_k^{-2}\).

Model results

With the resulting cleanup percentage distribution:

Sanity check

Let’s be realistic though: as fun and impressive as the fancy model is,

  • The results are roughly in line with those of the ordinary t-test or Wilcoxon test (which gives me confidence in my model fit 😀 )
  • The fundamentals of good science don’t change, just the tools
    • The mathematical and theoretical foundation was well laid a long time ago, and is quite robust

The fancy model is a more accurate approach, and it allows for clear quantification of desired quantities with uncertainty. It is more accurate because it uses all the information available, without additional approximation or guesswork.

Another fun example

Growing boerewors

  • Has anyone else noticed that prepared meat products in shops, like boerewors and burger patties, are getting more salty? Does your favourite sausage have more artificial preservatives than it did when you were younger?
  • And yet, if you take it out to defrost too soon or leave it in the fridge too long then it starts to grow? Grey fuzz and green goo are good for you, right?
  • There is a research drive at the UFS and elsewhere to investigate alternative preservatives that provide better taste, better health properties, and better shelf life
  • Unfortunately, the equipment for testing shelf life properly is extremely expensive so we (UFS) have to make do with old machines that are malfunctioning
  • Malfunctioning machines produce problematic data

Experiment details

  • They take boerewors samples, liquidize them, and place them in prepared trays of various carbon substrates (gogga food)
  • Each tray gets a different preservative
  • The growth is measured ‘objectively’ by how dark the liquid is at each point in time
  • The first pod of each tray is just water as a control
    • It still grows though because the wors itself is food for bacteria
    • So we subtract the water growth from the growth in the substrates

Illustration of growth

Statistics on the growth

  • Primarily non-parametric methods have been used to try to quantify the results
    • Things like average growth, number of pods that grow, diversity of growth
    • Publications do not explain how they quantify uncertainty in these measures, so I ended up using resampling (bootstrap) to get intervals
  • I was asked to also fit Gompertz curves to the growth
    • Gompertz curves have parameters that directly quantify the most interesting properties

Gompertz curves

\[\begin{aligned} y_{ij} &\sim Student~t(\nu_j,\ \mu_{ij},\ \sigma_j) \\ \mu_{ij} &= a_j \exp\left(-b_j e^{-c_j t_{ij}}\right) \\ \log\pi(a_j,b_j,c_j,\nu_j,\sigma_j) &= -a_j-b_j-c_j-2\log\sigma_j + \log\nu_j - 3\log(\nu_j + 0.75) + k \\ \end{aligned}\]

  • \(a_j\) denotes the scale of the overall growth, or the maximum growth.
  • \(b_j\) denotes the rough position of the growth.
  • \(c_j\) denotes the explosiveness of the growth.
  • \(\mu_{ij}\) denotes the expected position of the Gompertz curve, created from the Gompertz parameters above.
  • \(\sigma_j\) denotes the scale of the variation around the curve.
  • \(\nu_j\) denotes the tail index of the variation around the curve, controlling its shape. Small values indicate the presence of extreme variations in places.

Fitting the curves

  • Fitting 180 curves via posterior simulation is a slow process
  • As a shortcut I tried variational Bayes, also called stochastic gradient ascent
    • This is a major area of statistical research currently with many competing approaches
  • Success was mixed, and some curves had to be refit using the slower simulation process
    • Good results came quickly when the ascent went to the global maximum of the likelihood/posterior
    • Still prone to getting stuck in local maxima, even with the stochastic approach
  • Variational Bayes is a good tool to have in a statistician’s toolbox

Plant cover at the beach

Proportions as dependent variable

In (Burger et al. 2023), Divan fitted 4 mixed effects regression models to the proportion of plant cover given various explanatory variables.

The study was on the impact of the grass species Agrostis magellanica on the cushion-forming plant, Azorella selago, using sub-Antarctic Marion Island as a model system.

  • A positive coefficient would indicate a symbiotic relationship, while a negative coefficient might indicate a parasitic relationship
  • Plant cover is necessarily a proportion and analysing it as a proportion makes sense
  • The ‘standard’ approaches of transforming the data or ignoring the boundaries altogether break down when the proportions approach 0 or 1
  • The paper showed that we need to use heavy tailed distributions if we want the model to be robust to outlying values

Statistical results

  • While the ecological results were interesting, the paper was primarily about the statistics
  • Specifically, we compared standard models with heavy tailed models for proportions
  • The results were definitive on two fronts:
    • The heavy tailed models fitted better according to model comparison statistics
    • The heavy tailed models failed to reject tests of goodness of fit, while the standard models were rejected

Model comparison for mixed models

  • We want to know the best model for future plants
    • Not just the current observations
  • So we must adjust our model comparison statistics
  • Implemented the marginalised Deviance Information Criterion (mDIC) (Quintero and Lesaffre 2018)
    • Neither the maths nor the code fit on slides, sorry
  • The principle is to replace the observed random effects with new simulated random effects and then average over the random effects in the DIC calculation
    • Requires careful and precise code

Goodness of fit in parametric regression

  • In a parametric regression, every observation follows a different distribution
    • Usually the same type of distribution, conditional on varying explanatory variables
    • e.g. \(y_i\sim N(\mathbf{x}_i\boldsymbol\beta, \sigma^2)\)
  • Standard goodness-of-fit approaches (e.g. Shapiro-Wilk) only work in very simple cases
    • And are never truly appropriate (e.g. the observed residuals are not i.i.d. even if the theoretical ones are)
  • There are two general solutions for goodness-of-fit that work for parametric models of almost any complexity:
    • The (double) parametric bootstrap
    • The DHARMa (Hartig 2022) approach - based on standardised residuals

Standardised residuals

For every \(i\),

  • obtain many samples from the predictive distribution for \(y_i|X,M\)
    • Both frequentist and Bayesian approaches will work
  • find the proportion of samples that are less than \(y_i\)

The resulting proportions should follow a U(0,1) distribution, so we can use a test for uniformity (e.g. AD or KS) to perform goodness-of-fit testing. We can also find outliers this way!

Proof, from Rice Ch. 2, says: Let \(W=F_Y(Y)\), then

\[P[W\leq w]=P[F_Y(Y)\leq w] = P[Y\leq F^{-1}(w)] = F(F^{-1}(w))=w\]

Conclusion

References

Burger, Divan A., Sean van der Merwe, Emmanuel Lesaffre, Peter C. le Roux, and Morgan J. Raath-Krüger. 2023. “A Robust Mixed-Effects Parametric Quantile Regression Model for Continuous Proportions: Quantifying the Constraints to Vitality in Cushion Plants.” Statistica Neerlandica n/a (n/a). https://doi.org/10.1111/stan.12293.
Fan, Junliang, Xiukang Wang, Lifeng Wu, Hanmi Zhou, Fucang Zhang, Xiang Yu, Xianghui Lu, and Youzhen Xiang. 2018. “Comparison of Support Vector Machine and Extreme Gradient Boosting for Predicting Daily Global Solar Radiation Using Temperature and Precipitation in Humid Subtropical Climates: A Case Study in China.” Energy Conversion and Management 164: 102–11. https://doi.org/10.1016/j.enconman.2018.02.087.
Hartig, Florian. 2022. DHARMa: Residual Diagnostics for Hierarchical (Multi-Level / Mixed) Regression Models. https://CRAN.R-project.org/package=DHARMa.
Quintero, Adrian, and Emmanuel Lesaffre. 2018. “Comparing Hierarchical Models via the Marginalized Deviance Information Criterion.” Statistics in Medicine 37 (16): 2440–54. https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.7649.
Robbins, Naomi B. 2012. Creating More Effective Graphs. John Wiley & Sons.
Su, Jie, Yuzhe Wang, Xiaokai Niu, Shan Sha, and Junyu Yu. 2022. “Prediction of Ground Surface Settlement by Shield Tunneling Using XGBoost and Bayesian Optimization.” Engineering Applications of Artificial Intelligence 114: 105020. https://doi.org/10.1016/j.engappai.2022.105020.
Wasserman, Larry. 2004. All of Statistics: A Concise Course in Statistical Inference. Vol. 26. Springer.

Curiosity counts

  • The most valuable skill in statistics is curiosity. It is not just a personality trait, it can be learned, cultivated, and taught. Always being willing to learn new things will get you ahead more than almost anything else
  • I love learning new statistical ideas and approaches, and I really love learning new tools
    • The more tools in your (organised) toolbox the faster and better you can deal with issues
    • Learning new tools is easier than ever with AI chat being available to help you learn fast
    • But I’m careful to always remember that it is the task that matters, not the tool

The end

Thank you for your time and attention.

This presentation was created using the Reveal.js format in Quarto, using the RStudio IDE. Font and line colours according to UFS branding, and background image by logturnal on Freepik.