The key point of today’s talk is:
The secondary point of today’s talk is:
You might be thinking, “Sean, why are you throwing shade at coding, isn’t coding like half your job?”
BUT
Disclaimer: I have personally taught all these topics recently (I’m part of the problem)
Statistical software has packages/components/tools for everything
But there are no ‘packages’ for 3 of these at occurring at once, never mind all 5
Researchers usually focus heavily on a single aspect or problem at a time and narrowly in their field. They pick the data set to suit their research argument. It is frustratingly rare to see theoretical papers which consider that issues don’t come alone.
Objective: obtain an estimate and interval for the reduction in pollution
I approached the problem by building a model that has two key properties:
The second property is the difficult one, both practically and theoretically. It is difficult to know which properties of the data are going to materially affect the estimation of the key parameter(s) if they are not accommodated.
I formulate the model mathematically as follows:
The model for the inflows is:
\[\begin{aligned} x_i &\sim N(\mu_{t_i},\ \sigma_1^2) \\ \mu_j &\sim N(\phi\mu_{j-1},\ \sigma_3^2) \end{aligned}\]
The model for the outflows is:
\[\begin{aligned} y_i &\sim N(\nu_{t_i},\ \sigma_2^2) \\ \nu_j &= \mu_{j} + \alpha \end{aligned}\]
The priors assumed on the parameters are:
\[\begin{aligned} \phi &\sim U(-1, 1) \\ \alpha &\sim N(0, 10) \\ \end{aligned}\]
and \(\pi(\boldsymbol\sigma) = \prod \sigma_k^{-2}\).
With the resulting cleanup percentage distribution:
Let’s be realistic though: as fun and impressive as the fancy model is,
The fancy model is a more accurate approach, and it allows for clear quantification of desired quantities with uncertainty. It is more accurate because it uses all the information available, without additional approximation or guesswork.
\[\begin{aligned} y_{ij} &\sim Student~t(\nu_j,\ \mu_{ij},\ \sigma_j) \\ \mu_{ij} &= a_j \exp\left(-b_j e^{-c_j t_{ij}}\right) \\ \log\pi(a_j,b_j,c_j,\nu_j,\sigma_j) &= -a_j-b_j-c_j-2\log\sigma_j + \log\nu_j - 3\log(\nu_j + 0.75) + k \\ \end{aligned}\]
In (Burger et al. 2023), Divan fitted 4 mixed effects regression models to the proportion of plant cover given various explanatory variables.
The study was on the impact of the grass species Agrostis magellanica on the cushion-forming plant, Azorella selago, using sub-Antarctic Marion Island as a model system.
For every \(i\),
The resulting proportions should follow a U(0,1) distribution, so we can use a test for uniformity (e.g. AD or KS) to perform goodness-of-fit testing. We can also find outliers this way!
Proof, from Rice Ch. 2, says: Let \(W=F_Y(Y)\), then
\[P[W\leq w]=P[F_Y(Y)\leq w] = P[Y\leq F^{-1}(w)] = F(F^{-1}(w))=w\]
Thank you for your time and attention.
This presentation was created using the Reveal.js format in Quarto, using the RStudio IDE. Font and line colours according to UFS branding, and background image by logturnal on Freepik.
2023/05/26 - Exotic modelling