Performance of Priors for Skew-t Samples

November 2021

Enjoying this presentation

seanvdm.co.za/present

Please open the presentation on your own computer (type link above) to enjoy the following benefits:

Exploring the data yourself
Interacting with the graphs yourself
Moving at your own pace (press o for overview)
Maximum quality
- Try f or F11 for full screen
- Press w at any time to change to wide screen mode and back
- Ctrl + Mouse Wheel to zoom

The presentation and source code will be available for download on my website at https://seanvdm.co.za/files/workshops/skewt2021

Introduction

The times are changing

Skew-t errors are becoming more popular as a choice of residual distribution.
Normal residuals are fantastic in so many ways;
but they often do not fit reality.
t residuals are more robust to the occasional problematic observation.
Skew-t residuals go further in that they help with extreme values on one side and bounded domains on the other.
With the best priors you can have confidence that you are making the right choice.

Justification

Burger, D.A., Schall, R. & van der Merwe, S. A robust method for the assessment of average bioequivalence in the presence of outliers and skewness. Pharm Res (2021). https://doi.org/10.1007/s11095-021-03110-z

Divan analysed 130 clinical trials, and did a large simulation study, and concluded:

“The robust methodology is needed in a small but non-negligible proportion of bioequivalence trials: the robust method safeguards against occasional problematic outlier cases.”

Simply put, when the hypothesis is about the typical cases then freak sample events that are unlikely to repeat in future are not of interest.

Robustness

The importance and usefulness of robust statistics is not in dispute.
The goal is to come to the correct conclusion regarding a general future case.
Automatically discarding or ignoring outlying values purely on a statistical basis is not allowed.
- Only outliers where the cause is known to be completely unrelated to the treatment may be removed.
Simply focussing on the median is not acceptable either, all the variation must be modelled.
So we need to adapt the standard regression models to accommodate weird observations in a sensible way.

Residuals

Assuming we have a good regression model, except for those pesky outliers, we can often make it more robust by replacing the assumption of normal residuals with skew-t residuals. This simple change can make the intercepts and slopes more accurate and more stable.

In random effects models we can also replace the assumption that the random intercepts follow a normal distribution with the assumption that they follow a t distribution, to make the model more robust to outlier subjects (e.g. a subject that deliberately tries to sabotage the project and doesn’t get caught).

Illustration

Skew-t densities

My favourite definition

There are many variations of skew-t distributions. My favourite is the simplest one:

Fernandez, C., & Mark F. J. Steel. (1998). On Bayesian Modeling of Fat Tails and Skewness. Journal of the American Statistical Association, 93(441), 359–371. https://doi.org/10.2307/2669632

Let \(y\sim \text{skew-t}(\nu, \mu, \sigma, \xi)\)
Define \(z=\frac{y-\mu}{\sigma}\)
Define \(w=\left\{\begin{array}{l} \frac{z}{\xi}\ \text{ if } z \geq 0 \\ z\ \xi\ \text{ if } z \leq 0 \end{array} \right.\)
Then \(f_W(w)=\frac{2}{\xi+\frac{1}{\xi}} f_t(w)\)
Where \(f_t(w)\) is the standard t density in the point \(w\), given below.

\[f_t(w) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu\pi}\ \Gamma\left(\frac{\nu}{2}\right)} \left(1 + \frac{w^2}{\nu} \right)^{-\frac{\nu+1}{2}}\]

Basic Skew-t model in Stan

functions { # Sean van der Merwe, UFS
    real skewt_lpdf(real y, real nu, real mu, real s, real xi) {
        real z;
        real lpdf;
        if (y >= mu) {
          z = (y - mu) / xi;
        } else {
          z = (y - mu) * xi;
        }
        lpdf = student_t_lpdf(z | nu, 0, s) + log(2) - log(xi + 1/xi);
        return lpdf;
    }
}
data {
  int<lower=0> n;                   // number of observations
  vector[n] x;                        // observations
}
parameters {
  real<lower=0.2> v;            // degrees of freedom
  real<lower=0> s;            // scale parameter
  real mu;                    // location parameter
  real<lower=0> xi;           // skewness parameter
}
model {
  for (i in 1:n) { target += skewt_lpdf(x[i] | v, mu, s, xi);  }
  target += -2*log(s) + log(v) - 3*log(v + 0.75) + log(xi) - 2*log(xi^2+1);
}

Location scale

The densities under consideration are all location-scale, which has many advantages.

Biggest advantage: Moving from a sample to a regression is straightforward

\[y_i\sim i.i.d.\ \text{skew-t}(\nu, \mu, \sigma, \xi)\]

becomes

\[y_i\sim \text{skew-t}(\nu, \mu_i, \sigma, \xi)\]

where \(\mu_i=f(\mathbf{x}_i)\).

It also means that we are not going to concern ourselves with the prior on \(\mu\), just assume uniform.

Priors

Objective priors

When meaningful prior information is available then it must be incorporated. Doing so can help avoid silly conclusions in many real-world areas.

When meaningful prior information is not available, as is usually the case, then we can use objective or vague priors.

Can reduce the amount of bias we introduce into the model
Can make the model more stable
Can produce more accurate parameter estimates

But what are good objective or vague priors for this skew-t?

Symmetric t

We have a good idea of what are good priors for the symmetric t thanks to UFS researchers Abrie van der Merwe, Michael von Maltitz, Johan Meyer, and Piet Groenewald.

They published a paper (not in print yet) comparing six priors in detail.
- They also derived and discussed some nice theoretical properties in some cases.
I reproduced their simulation study earlier this year and am quickly going to show the results.
But first, let us look at the priors they considered.

Six starting priors

\(p_{1}\left(\nu,\mu, \sigma^2\right) \propto \sigma^{-2} \left[\Psi'\left(\frac{\nu}{2}\right)-\Psi'\left(\frac{\nu+1}{2}\right)-\frac{2\left(\nu+5\right)}{\nu\left(\nu+1\right)\left(\nu+3\right)}\right]^{\frac{1}{2}}\)

\(p_{2}\left(\nu,\mu, \sigma^2\right) \propto \sigma^{-2} \left[\Psi'\left(\frac{\nu}{2}\right)-\Psi'\left(\frac{\nu+1}{2}\right)-\frac{2\left(\nu+3\right)}{\nu\left(\nu+1\right)^{2}} \right]^{\frac{1}{2}}\)

\(p_{3}\left(\nu,\mu, \sigma^2\right) \propto \sigma^{-2}\left(\frac{\nu}{\nu+3}\right)^{\frac{1}{2}} \left[\Psi'\left(\frac{\nu}{2}\right)-\Psi'\left(\frac{\nu+1}{2}\right)-\frac{2\left(\nu+3\right)}{\nu\left(\nu+1\right)^{2}} \right]^{\frac{1}{2}}\)
\(p_{4}\left(\nu,\mu, \sigma^2\right) \propto \sigma^{-3}\left(\frac{\nu+1}{\nu+3}\right)^{\frac{1}{2}}\left(\frac{\nu}{\nu+3}\right)^{\frac{1}{2}} \left[\Psi'\left(\frac{\nu}{2}\right)-\Psi'\left(\frac{\nu+1}{2}\right)-\frac{2\left(\nu+3\right)}{\nu\left(\nu+1\right)^{2}} \right]^{\frac{1}{2}}\)

\(p_{5}\left(\nu,\mu, \sigma^2\right) \propto \sigma^{-2}e^{-\xi\nu}\), where \(\xi=0.1\)
\(p_{6}\left(\nu,\mu, \sigma^2\right) \propto \sigma^{-2}\frac{2\nu d}{\left(\nu+d\right)^{3}}\), where \(d=1.2\).

Number 2 is the proposed choice, while 5 and 6 are the most popular (although 6 is often applied with different \(d\) values, like 0.75, and 5 is often applied with different \(\xi\) values, like 1 or 0.01).

Simulation procedure

Then we do the simulation as follows:

Define the samples for each \(\nu\),
- say 1000 samples of size 30 with \(\mu=0\) and \(\sigma=1\).
Create a cluster of Rs running in the background and load Stan on each.
Send our functions and models to each R in the cluster.
Define the \(\nu\) options. I used 1 to 24.
Do the simulation procedure for each \(\nu\) option, evenly split over the cluster.
- For each of the \(24*1000=24000\) samples we do a few thousand joint posterior simulations for all 3 parameters.
Arrange the results neatly in an array.
- We ‘integrate out’ \(\mu\) and \(\sigma\) by ignoring them, keeping only \(\nu\).
Stop the cluster and save the combined results.

Summarising the results

\(\nu\) can take on any positive value
There is little practical difference between 1000 and 100000 degrees of freedom though
- If you just took a mean then those values would completely dominate and destroy the meaning of your estimates
In their paper they addressed this by capping \(\nu\) at a rather low value in the custom Gibbs simulation procedure
In Stan I don’t cap \(\nu\) so I work with medians instead of means (at first)
- I estimate \(\nu\) with the posterior median
- And then calculate the root median square error, relative to the know values of \(\nu\)

Results for these priors

Relative root median square error of the posterior median

Other scales

When trying to understand the heaviness of the tail the focus is on small degrees of freedom.
The tails of the t change with \(\nu\) in the same way that the tails of the Pareto change with the tail index.
In extreme value theory it is considered more natural to work with the Extreme Value Index = 1/(the tail index). Perhaps the same is true here?
Or perhaps the log scale makes the most sense? The degrees of freedom can’t go negative, unlike the EVI.

Comparison on other scales

Best Coverage

First we consider the coverage of the shortest 95% interval, as well as the median interval width.

Standard Coverage

Now we consider the coverage of the symmetric 95% interval (2.5% to 97.5%), as well as the median interval width of this interval.

Priors of Interest

Priors 1, 3 and 4 are middle of the road on all fronts and not very popular to begin with, and so will be dropped from analysis from here on.

We will look at Prior 2 because it is the best in the area where the data is most t-like
We will look at Prior 6 because it is the most reliably good across a wide range of degrees of freedom
- This is an hierarchical prior by Juarez and Steel (2010): a gamma prior with an exponential prior on its slope
We will look at Prior 5 (exponential) because it has some accuracy in the area of most indecision, as is very easy and popular
For Priors 5 and 6 we will expand the hyperparameter choices as previously mentioned

Word of caution: while the exponential prior actually seems the best in many ways, what it is actually doing is saying that the data is always t, even when the data is not t. It also doesn’t seem to be able to tell exactly how t the data is.

Refined Priors

\(p_{PMP}\left(\nu,\mu, \sigma^2\right) \propto \sigma^{-2} \left[\Psi'\left(\frac{\nu}{2}\right)-\Psi'\left(\frac{\nu+1}{2}\right)-\frac{2\left(\nu+3\right)}{\nu\left(\nu+1\right)^{2}} \right]^{\frac{1}{2}}\)

\(p_{d1.2}\left(\nu,\mu, \sigma^2\right) \propto \sigma^{-2}\frac{2\nu d}{\left(\nu+d\right)^{3}}\), where \(d=1.2\).
\(p_{d0.75}\left(\nu,\mu, \sigma^2\right) \propto \sigma^{-2}\frac{2\nu d}{\left(\nu+d\right)^{3}}\), where \(d=0.75\).

\(p_{exp1}\left(\nu,\mu, \sigma^2\right) \propto \sigma^{-2}e^{-\xi\nu}\), where \(\xi=1\)
\(p_{exp0.1}\left(\nu,\mu, \sigma^2\right) \propto \sigma^{-2}e^{-\xi\nu}\), where \(\xi=0.1\)
\(p_{exp0.01}\left(\nu,\mu, \sigma^2\right) \propto \sigma^{-2}e^{-\xi\nu}\), where \(\xi=0.01\)

Refined simulation study

Refined Coverage

First we consider the coverage of the shortest 95% interval, as well as the median interval width.

Refined Standard Coverage

Now we consider the coverage of the symmetric 95% interval (2.5% to 97.5%), as well as the median interval width of this interval.

Conclusion on symmetric t priors

The probability matching prior (which is also a reference prior) is a good choice.
The hierarchical prior of Juárez and Steel (2010) with \(d=0.75\) looks great.

But does the same apply for a skew t?

Spoiler: yes it does 😀

Skewness

Skewness parameter

The skewness parameter of this distribution is not directly connected to the empirical skewness.
- Just like how the scale parameter is not the standard deviation, nor just a function of it.
The direction is the same though.

\(\xi=1\Rightarrow\) symmetric t
\(0<\xi<1\Rightarrow\) left skew (mean < median)
\(\xi>1\Rightarrow\) right skew (mean > median)

What is a good prior for the skewness parameter?

Intuitively a hump at 1 (1 == symmetry) might be desired
- Big issue: how much pull towards the center is appropriate?
- We could try something very flat like \(gamma(1.1,0.1)\) or something pronounced like \(LN(1,1)\)
But to have equal probability on each side you actually need a lot higher density between 0 and 1 than between 1 and \(\infty\), since the right side is much wider
We don’t want too much bias to either side, even though we know in practice that right skewness is far more common
- so maybe something symmetric on the log scale like \(LN(0,1)\)
- or a vague gamma with mean 1 like \(gamma(0.1,0.1)\)
Some mathematical justification would be nice

Empirical skewness

Before we continue, let us try to relate the skewness parameter to the empirical skewness.

We can simulate a sample from this skew-t density and then calculate the skewness of the sample.

# Skew-t simulation function by Sean van der Merwe,
# based on code of the 'skewt' package, for the
# variant defined by Fernandez and Steel (1998)
rskt <- function(n, nu, xi = 1, mu = 0, s = 1) {
    p <- runif(n)
    result <- rep(NA_real_, n)
    probzero <- 1/(xi^2 + 1)
    pos <- p >= probzero
    result[!pos] <- 1/xi * qt(((xi^2 + 1) * p[!pos])/2,
        nu)
    result[pos] <- xi * qt((1 + 1/xi^2)/2 * (p[pos] - probzero[pos]) +
        1/2, nu)
    result * s + mu
}

# Sample skewness function by Sean van der Merwe
skewness <- function(x) {
    stdx <- x - mean(x)
    n <- length(x)
    (mean(stdx^3)/((mean(stdx^2))^1.5)) * sqrt(n * (n -
        1))/(n - 2)
}

Skewness Relationship

A better transformation

From: Arnold BC, Groeneveld RA. Measuring skewness with respect to the mode. Am Stat. 1995;49(1):34–8
Let: \(\gamma=\frac{\xi^2-1}{\xi^2+1}\) then we have a new skewness measure \(-1<\gamma < 1\).
Now \(\gamma=0\) is symmetric, \(\gamma<0\) is left-skew and \(\gamma>0\) is right-skew, as we are used to.
This is still not directly connected to the empirical skewness, but surprisingly close:

This scale makes more sense

Since this transformed scale seems linearly related to the sample skewness of samples from our density of choice, it makes more sense to work on this scale.

First, we derive a new prior on this scale:

The transformed uniform prior
- Explicitly: Let \(\gamma\sim uniform(-1, 1)\), then
- \(\pi(\xi) = 2\xi(\xi^2+1)^{-2}\), and
- \(\log\pi(\xi)\propto \log(\xi) - 2\log(\xi^2+1)\)

Second, we will also calculate errors on this scale, for ease of interpretation.

Suggested priors

Consider some options:

Priors to test first

For the first simulation study we will try the following 3 priors:

The very flat but unbiased \(gamma(0.1, 0.1)\)
The transformed uniform prior
The very flat prior with a mode at 1: \(gamma(1.1, 0.1)\)

We will try each prior in combination with each of the two t priors settled on earlier, for a total of 6 joint priors again.

Skew t simulation study 1

First we implement the simulation study on 1000 samples of size 30 from a standard t density (no skewness); and do this for each of the \(\nu\) options from 1 to 24.

Severe right skewness

Now we make \(\xi=4\):

Is 1000 samples enough?

By accident I did another 1000 samples with \(\xi=4\), is there a difference?
It does not appear so.
Thus, we will stick to 1000 samples from here on, as that is already enough to keep a server busy for a day.

Conclusion

What about real data?

Consider a company in South Africa that creates pollution as a necessary by-product of its operations (will not be named).
They measure the pollution in the water upstream from the plant and downstream from the plant.
Ideally the difference would be small.
For some pollutants, the measurements are heavy tailed to the right; while at the same time loosely bounded on the left.
- The boundaries appear random as they as changing over time due to unobserved factors outside of control.
- The boundaries can include zero, making transformations difficult (and often counter-intuitive anyway).
Thus, the statistical comparison is currently done using a common non-parametric permutation test.

A better way?

It is conceivable that a parametric comparison using skew-t fits might yield more power.
- Checking this is on the to do list.
- Maybe an honours student wants to try this as a masters project?
A better prior for the skewness can also help when we want to test the notion that a sample is skew to one side or the other.
- Again the power needs to be formally evaluated.
On the next slide we look at two samples, up- and down-stream, with their fits.
We then calculate the probability of the down-stream location parameter being larger.

Example comparison: Sodium

Probability of plant making it worse by location parameter \(\approx\) 0, and by prediction \(\approx\) 0.209.

Example comparison: Sulphate

Probability of plant making it worse by location parameter \(\approx\) 0.129, and by prediction \(\approx\) 0.77.

Example comparison: Total Dissolved Solids

Probability of plant making it worse by location parameter \(\approx\) 0, and by prediction \(\approx\) 0.671.

The way forward

The next steps are:

Derive additional sensible prior(s) on the transformed scale
Do a final (more extensive) simulation study
Make a sensible recommendation for researchers wishing to use this skew-t distribution
Show that it works and is easy to apply using more real data examples
- e.g. bio-equivalence clinical trial mixed effects regression

Right now the recommendation is to use:

\[\begin{aligned}\log\pi(\mu,\sigma,\nu,\xi) = & -2\log(\sigma) + \log(\nu) - 3\log(\nu + 0.75) \\ & + \log(\xi) - 2\log(\xi^2+1) + constants\end{aligned}\]