Robust inference in the presence of censoring, skewness, and extreme values

Sean van der Merwe

Introduction

Follow along

The short link to open this interactive presentation on your own device and follow along is: https://seanvdm.co.za/present

Direct link to presentation

Slides and code can be downloaded by clicking here

TL;DR

Location inference is fantastic
Real data has issues
Normal theory keeps breaking down
Possible solution:

Replace assumption of normality with skew-t

Outliers

People deal with outliers in a ton of ways
- I have personally applied at least 7 approaches
  - ```
  Yes, I have regrets
```
I want to show what I have found to be the second best general approach
- The best is always to figure out what went wrong and fix it
- Arguing about the worst approach would take the rest of the year

The times are changing

Skew-t errors are becoming more popular as a choice of residual distribution
Normal residuals are fantastic in so many ways; but they often do not fit reality
t residuals are more robust to the occasional problematic observation
Skew-t residuals go further in that they help with extreme values on one side and bounded domains on the other

Justification

Burger, D.A., Schall, R. & van der Merwe, S. A robust method for the assessment of average bioequivalence in the presence of outliers and skewness. Pharm Res (2021). https://doi.org/10.1007/s11095-021-03110-z

Divan analysed 130 clinical trials, and did a large simulation study, and concluded:

“The robust methodology is needed in a small but non-negligible proportion of bioequivalence trials: the robust method safeguards against occasional problematic outlier cases.”

Simply put, when the hypothesis is about the typical cases then freak sample events that are unlikely to repeat in future are not of interest

Robustness

The importance and usefulness of robust statistics is not in dispute
The goal is to come to the correct conclusion regarding a general future case
However, automatically discarding or ignoring outlying values purely on a statistical basis (e.g. \(3\sigma\)) is not allowed
- It is (rightly) seen as data manipulation to get the desired results (a form of p-hacking)
- Only outliers where the cause is known to be completely unrelated to the treatment may be removed
All variation must be modelled - we need to adapt the standard models to accommodate weird observations in a sensible way

Residuals

Assuming we have a good regression model, except for those pesky outliers,

We can often make it more robust by replacing the assumption of normal residuals with skew-t residuals
This simple change can make the intercepts and slopes more accurate and more stable
In random effects models we can also replace the assumption that the random intercepts follow a normal distribution with the assumption that they follow a t distribution,
- to make the model more robust to outlier subjects (e.g. a subject that deliberately tries to sabotage the project and doesn’t get caught).

The goal

We want the best of both worlds:

Robust inference about the location
Accurate modelling of the tails
- Full spread of predictions
- Intervals with correct real-world coverage

Skew-t densities

My favourite definition

There are many variations of skew-t distributions. My favourite is the simplest one:

Fernandez, C., & Mark F. J. Steel. (1998). On Bayesian Modeling of Fat Tails and Skewness. Journal of the American Statistical Association, 93(441), 359–371. https://doi.org/10.2307/2669632

Let \(y\sim \text{skew-t}(\nu, \mu, \sigma, \xi)\)
Define \(z=\frac{y-\mu}{\sigma}\)
Define \(w=\left\{\begin{array}{l} \frac{z}{\xi}\ \text{ if } z \geq 0 \\ z\ \xi\ \text{ if } z \leq 0 \end{array} \right.\)
Then \(f_W(w)=\frac{2}{\xi+\frac{1}{\xi}} f_t(w)\)
Where \(f_t(w)\) is the standard t density in the point \(w\), given below.

\[f_t(w) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu\pi}\ \Gamma\left(\frac{\nu}{2}\right)} \left(1 + \frac{w^2}{\nu} \right)^{-\frac{\nu+1}{2}}\]

Basic Skew-t model in Stan

functions { # Sean van der Merwe, UFS
    real skewt_lpdf(real y, real nu, real mu, real s, real xi) {
        real z;
        if (y >= mu) {
          z = (y - mu) / xi;
        } else {
          z = (y - mu) * xi;
        }
        return student_t_lpdf(z | nu, 0, s) + log(2) - log(xi + 1/xi);
    }
}
data {
  int<lower=0> n;                   // number of observations
  vector[n] x;                        // observations
}
parameters {
  real<lower=0.2> v;            // degrees of freedom
  real<lower=0> s;            // scale parameter
  real mu;                    // location parameter
  real<lower=0> xi;           // skewness parameter
}
model {
  target += -2*log(s) + log(v) - 3*log(v + 0.75) + log(xi) - 2*log(xi^2+1);
  x ~ skewt(v, mu, s, xi);
}

Location scale

The densities under consideration are all location-scale, which has many advantages.

Biggest advantage: Moving from a sample to a regression is straightforward

\[y_i\sim i.i.d.\ \text{skew-t}(\nu, \mu, \sigma, \xi)\]

becomes

\[y_i\sim \text{skew-t}(\nu, \mu_i, \sigma, \xi)\]

where \(\mu_i=f(\mathbf{x}_i)\).

Examples

What about real data?

Consider a company in South Africa that creates pollution as a necessary by-product of its operations (will not be named)
They measure the pollution in the water upstream from the plant and downstream from the plant
For some pollutants, the measurements are heavy tailed to the right; while at the same time loosely bounded on the left
- The boundaries appear random as they as changing over time due to unobserved factors outside of control
- The boundaries can include zero, making transformations difficult (and often counter-intuitive anyway)
The statistical comparison is currently done using a common non-parametric permutation test

Example comparison: Sodium

Probability of plant making it worse \(\approx\) 0, and predicted difference \(\approx\) -18.9.

Example comparison: Sulphate

Probability of plant making it worse \(\approx\) 0.129, and predicted difference \(\approx\) 15.5.

Example comparison: Total Dissolved Solids

Probability of plant making it worse \(\approx\) 0, and predicted difference \(\approx\) 135.3.

Example: Stinkbugs

They need to compare the effectiveness of control measures on wild stinkbugs. Wild stinkbugs (being wild and all) are very difficult to find, control, and study. The only data I received was how many bugs died during each of a few days.

Stinkbug Illustration

By using the interval during which each bug died as an interval censored observation, we can fit lifetime distributions and compare treatments.

Example: give me one with everything

I was asked to analyse a time series with fun features, like: missing values, censoring (2+ types), outliers, and of course dependencies (which had the same issues).

Yes, this is possible:

Example curve fitting:

I needed to fit curves with specific formulas:

Previously they would get a fit and assume it is perfect, but to accurately estimate the standard errors of further statistics requires that we propagate the uncertainty via robust fits

Priors

Objective priors

When meaningful prior information is available then it must be incorporated. Doing so can help avoid silly conclusions in many real-world areas.

When meaningful prior information is not available, as is usually the case, then we can use objective or vague priors.

If you don’t want to formally admit a prior then use the EM algorithm (ML is unreliable for t)
- Essentially using uniform priors, can do better

But what are good objective or vague priors for this skew-t?

Symmetric t

We have a good idea of what are good priors for the symmetric t thanks to UFS researchers Abrie van der Merwe, Michael von Maltitz, Johan Meyer, and Piet Groenewald.

They published a paper comparing six priors in detail https://doi.org/10.37920/sasj.2022.56.2.2
- They also derived and discussed some nice theoretical properties in some cases
I (independently) reproduced their simulation study and am quickly going to show the key results

Starting priors

We focus on the 3 most relevant of the 6 they investigated:

\(p_{2}\left(\nu,\mu, \sigma^2\right) \propto \sigma^{-2} \left[\Psi'\left(\frac{\nu}{2}\right)-\Psi'\left(\frac{\nu+1}{2}\right)-\frac{2\left(\nu+3\right)}{\nu\left(\nu+1\right)^{2}} \right]^{\frac{1}{2}}\)
\(p_{5}\left(\nu,\mu, \sigma^2\right) \propto \sigma^{-2}e^{-\xi\nu}\), where \(\xi=0.1\)
\(p_{6}\left(\nu,\mu, \sigma^2\right) \propto \sigma^{-2}\frac{2\nu d}{\left(\nu+d\right)^{3}}\), where \(d=1.2\)

Number 2 is their proposed choice (PM, reference prior), while 5 (exponential) and 6 (Juarez and Steel, 2010) are the most popular otherwise (although 6 is often applied with different \(d\) values, like 0.75, and 5 is often applied with different \(\xi\) values, like 1 or 0.01).

Results for these priors

Relative root median square error of the posterior median

Coverage

Now we consider the coverage of the symmetric 95% interval (2.5% to 97.5%), as well as the median interval width of this interval.

Refined Priors

\(p_{PMP}\left(\nu,\mu, \sigma^2\right) \propto \sigma^{-2} \left[\Psi'\left(\frac{\nu}{2}\right)-\Psi'\left(\frac{\nu+1}{2}\right)-\frac{2\left(\nu+3\right)}{\nu\left(\nu+1\right)^{2}} \right]^{\frac{1}{2}}\)
\(p_{d1.2}\left(\nu,\mu, \sigma^2\right) \propto \sigma^{-2}\frac{2\nu d}{\left(\nu+d\right)^{3}}\), where \(d=1.2\).
\(p_{d0.75}\left(\nu,\mu, \sigma^2\right) \propto \sigma^{-2}\frac{2\nu d}{\left(\nu+d\right)^{3}}\), where \(d=0.75\).
\(p_{exp1}\left(\nu,\mu, \sigma^2\right) \propto \sigma^{-2}e^{-\xi\nu}\), where \(\xi=1\)
\(p_{exp0.1}\left(\nu,\mu, \sigma^2\right) \propto \sigma^{-2}e^{-\xi\nu}\), where \(\xi=0.1\)
\(p_{exp0.01}\left(\nu,\mu, \sigma^2\right) \propto \sigma^{-2}e^{-\xi\nu}\), where \(\xi=0.01\)

Refined simulation study

Conclusion on symmetric t priors

The probability matching prior (of van der Merwe, et al., 2022, no relation😉), which is also a reference prior, is a good choice.
The hierarchical prior of Juárez and Steel (2010) with \(d=0.75\) looks even better.

But does the same apply for a skew t?

Spoiler: yes it does 😀

Skewness

Skewness parameter

The skewness parameter of this distribution is not directly connected to the empirical skewness
- Just like how the scale parameter is not the standard deviation, nor just a function of it
The direction is the same though
\(\xi=1\Rightarrow\) symmetric t
\(0<\xi<1\Rightarrow\) left skew (mean < median)
\(\xi>1\Rightarrow\) right skew (mean > median)

What is a good prior for the skewness parameter?

Empirical skewness

Before we continue, let us try to relate the skewness parameter to the empirical skewness.

We can simulate a sample from this skew-t density and then calculate the skewness of the sample.

# Skew-t simulation function by Sean van der Merwe, based on code of the 'skewt'
# package, for the variant defined by Fernandez and Steel (1998) 
rskt <- function (n, nu, xi=1, mu=0, s=1) {
  p <- runif(n)
  result <- rep(NA_real_, n)
  probzero <- 1/(xi^2 + 1)
  pos <- p >= probzero
  result[!pos] <- 1/xi * qt(((xi^2 + 1) * p[!pos])/2, nu)
  result[pos] <- xi * qt((1 + 1/xi^2)/2 * (p[pos] - probzero[pos]) + 1/2, nu)
  result * s + mu
}

# Sample skewness function by Sean van der Merwe
skewness <- function(x) {
  stdx <- x - mean(x)
  n <- length(x)
  (mean(stdx^3) / ((mean(stdx^2))^1.5)) * sqrt(n * (n - 1))/(n - 2)
}

Skewness Relationship

A better transformation

From: Arnold BC, Groeneveld RA. Measuring skewness with respect to the mode. Am Stat. 1995;49(1):34–8
Let: \(\gamma=\frac{\xi^2-1}{\xi^2+1}\) then we have a new skewness measure \(-1<\gamma < 1\)
Now \(\gamma=0\) is symmetric, \(\gamma<0\) is left-skew and \(\gamma>0\) is right-skew, as we are used to
This is still not directly connected to the empirical skewness, but surprisingly close:

This scale makes more sense

Since this transformed scale seems linearly related to the sample skewness of samples from our density of choice, it makes more sense to work on this scale.

First, we derive a new prior on this scale:

The transformed uniform prior
- Explicitly: Let \(\gamma\sim uniform(-1, 1)\), then
- \(\pi(\xi) = 2\xi(\xi^2+1)^{-2}\), and
- \(\log\pi(\xi)\propto \log(\xi) - 2\log(\xi^2+1)\)

Second, we will also calculate errors on this scale, for ease of interpretation.

Suggested priors

Consider some options:

Skew t simulation study

First we implement the simulation study on 1000 samples of size 30 from a standard t density (no skewness); and do this for each of the \(\nu\) options from 1 to 24.

Severe right skewness

Now we make \(\xi=4\):

Censoring

What does censoring have to do with anything?

Real data has real issues
A common one is measurements being censored one or both sides
- Sensors have lower detection limits
- People drop out of studies or stop getting tested
Truncation is another issue

Addressing any of these requires being able to evaluate probabilities of the density itself.

Derivation of Distribution Function and Survival Function

Fernandez and Steel (1998) define the density as follows:

\[f(z|\xi)=\frac{2\xi}{\xi^2+1}\left[ f_t(z\xi)I_{(-\infty,0]}(z) + f_t(z/\xi)I_{[0,\infty)]}(z)\right]\]

Thus, the distribution function must be obtained by integrating over this function.

Simple cases

When \(z\leq 0\) then the distribution function integration is simple and just a scaled version of the Student-t CDF results:

\[F(z|\xi) = \frac{ 2 {\xi} }{\xi^2+1} \int_{-\infty}^z f_t(u\xi) du = \frac{2}{\xi^2+1} F_t(z\xi) \]

And when \(z\geq 0\) then the survival function just a scaled version of the Student-t survival results:

\[\bar{F}(z|\xi) = \frac{2\xi}{\xi^2 + 1}\ \int_z^{\infty} f_t(u/\xi) du = \frac{2\xi^2}{\xi^2+1} \bar{F}_t(z/\xi) \]

Other cases

For the CDF above the mean and the survival function below the mean we did the far more complicated integrations just to be sure that we do in fact get to the correct easy results.

Distribution function above the mean:

\[F(z|\xi) = 1 - \frac{2\xi^2\bar{F}_t(z/\xi)}{\xi^2+1}\]

Survival function below the mean:

\[\bar{F}(z|\xi) = 1 - \frac{2 F_t(z\xi)}{\xi^2+1}\]

Conclusion

The end

Thank you for your time and attention.

I really hope I could inspire some people, or at least broaden minds as to what is possible.

This is the prior I’m recommending:

\[\begin{aligned}\log\pi(\mu,\sigma,\nu,\xi) = & -2\log(\sigma) + \log(\nu) - 3\log(\nu + 0.75) \\ & + \log(\xi) - 2\log(\xi^2+1) + constants\end{aligned}\]