November 2020

Introduction

Why Stan?

If all the nuts you ever have to tighten are the same one or two sizes then you buy the best spanners in those sizes. You don’t use a monkey wrench that needs to be set each time. But if you’re encountering different sizes of nuts, fittings, and other objects that need tightening then you get a monkey wrench. In statistical modelling, Stan is the monkey wrench.

According to https://mc-Stan.org/: “Stan is a state-of-the-art platform for statistical modelling and high-performance statistical computation.”

You type out your model in Stan language and then Stan does all the work, including: simulating the whole posterior distribution, simulating the approximate posterior, and/or Maximum Likelihood.

Advantages of Stan

  • It compiles your model into C++ code, so it’s very fast
  • Latest simulation methods
  • Can do usual Bayes, Variational Bayes or Maximum Likelihood as you like - You don’t change your model, just one line of code
  • Most importantly, if you want to change your model then you change only the part of the model that needs changing. No changing software, no installing new packages, no changing notation for a different ‘PROC’, etc.

Where Stan might help

Suppose you are doing a regression, but

  • Perhaps a Normal distribution is not appropriate; maybe a t, gamma, GPD, etc. model makes more sense
  • Maybe your observations belong to groups with differing unknown variances
  • Maybe you need to fit a regression function with a change point
  • Perhaps you have some prior information on one or more parameters

While tools exist for dealing with each of these issues, each issue has its own tools with their own notations and limitations. Few of these tools can handle any of the other issues at the same time.

With Stan you can handle all these issues at the same time.

Getting help

The Stan language is intuitive and sensible, but you will still need to look up most things the first time you want to use them.

The key bookmark is https://mc-stan.org/users/documentation/ which has everything you need.

I find myself visiting the language functions reference often to see the exact notation for the distribution I want to use.

For the other issues, a standard internet search gets you very far. Stan is popular free software so it has a large and active user base ready to help.

There are many helper packages that can do a lot of (or all) the work for you, including creating the model code for you to play with.

Running Stan

Interfaces

Stan is currently available for R, Python, and C++ as far as I can tell. We will focus on rstan as it’s the easiest for most statisticians.

There are many ways to use Stan in R:

  • You can type your Stan model in a separate file, then tell rstan to compile that file
  • You can type your Stan model in a Stan block in your Rmarkdown document and click the play button to compile it
  • You can save the compiled model in a .Rds or Rdata file and then load the compiled model each time you need it
  • You can mix and match the above as it suits you

Rmarkdown interface

I will use Rmarkdown to demonstrate.

I use Rmarkdown for almost everything, including teaching, assessment, consultation, some research, and even making presentations like this one.

To use Rmarkdown you open RStudio, click File -> New File -> R Markdown…

To put in a Stan model you click on the Insert button at the top right of the code window, then click Stan.

On the next slide I’ll show a very basic Stan model in RStudio.

First model: t density

Notice the play button in the top right corner (circled in red).

Model breakdown

The critical components are:

  1. The data
  2. The parameters
  3. The likelihood

The optional components are:

  • Priors for some or all parameters - default is uniform on the domain of the parameter
  • Transformed Parameters - for when your model gets complicated
  • Transformed Data - better to do in R in advance
  • Generated Quantities - for creating things you want later

Installation and Loading

To be able to use rstan the way I do:

  1. Install R
  2. Install Rtools
  3. Install RStudio
  4. In the RStudio Console window, type install.packages(‘rstan’)
  5. In your code, type library(rstan)

Number of cores

  • Optionally, set the number of cores for Stan to use when simulating in parallel
  • Easiest is to pick a number: options(mc.cores = 3)
  • On most computers you can get a big speed increase easily if you match the number of cores and chains
  • So calculate and store a good number: library(parallel); mycores <- ceiling(0.75*detectCores(logical = FALSE))
  • Then use that number everywhere: options(mc.cores = mycores) and later on when simulating set chains = mycores
  • I like 75% because then I can still use my computer while simulating (i.e. watch YouTube while I’m technically working).

Log returns

Let’s apply the t model to some data.

library(openxlsx)
sourcedata <- read.xlsx("IBM19891998.xlsx")
lret <- sourcedata$LogReturns
dates <- sourcedata$year + sourcedata$month/12 + sourcedata$day/365

library(viridisLite)
cols <- viridis(3)
par(mfrow = c(2, 1), mar = c(4, 4, 2, 0.2))
plot(dates, lret, type = "l", col = cols[1], lwd = 1, xlab = "Date", 
    ylab = "Log return", main = "IBM")
hist(lret, 50, col = cols[2], density = 20, ylab = "Frequency", 
    xlab = "Log return", main = "IBM")

IBM data

Fitting the model

stan_data <- list(x = lret, n = length(lret))  # Create the data list

fitML1 <- optimizing(studentt, stan_data)  # ML estimation

# This is the main simulation function:
fitBayes1 <- sampling(studentt, stan_data, chains = mycores)
# Type ?sampling for a lot of additional options

saveRDS(fitBayes1, "studenttIBM.Rds")  # Save model fit

summary(fitBayes1)$summary  # See results of simulation

traceplot(fitBayes1)  # See simulation paths to check for stability

list_of_draws1 <- extract(fitBayes1)  # Extract simulations

t model summary

mean se_mean sd 2.5% 97.5% Rhat
v 4.263 0.01 0.371 3.604 5.067 1.000
s 0.013 0.00 0.000 0.012 0.013 1.002
mu 0.000 0.00 0.000 0.000 0.001 1.000

The Rhat measure should always be close to 1.

Rhat compares the variation between chains to the variation within chains. If these are the same then that is a good indication of convergence in distribution.

t model trace plot

On the trace plot you should see the same patterns on the left as on the right to indicate that you have done enough burn-in.

You should also see the same sort of patterns in each chain.

When the chains are viewed together they should cover the whole posterior distribution.

Working with the results

# Posterior Predictive Distribution
postpred1 <- rt(length(list_of_draws1$mu), list_of_draws1$v) * 
    list_of_draws1$s + list_of_draws1$mu
par(mar = c(4.4, 4.4, 0.4, 0.4))
plot(density(postpred1), xlab = "Future values", ylab = "Posterior density", 
    main = "", col = "green", lwd = 2)

Non-standard distributions

We don’t need to only use standard distributions in Stan.

For example, we could use a better prior on the degrees of freedom (provided by Prof Abrie van der Merwe):

\[p(\nu)\propto \left[\psi'(0.5\nu)-\psi'(0.5(\nu+1))-2(\nu+3)\nu^{-1}(\nu+1)^{-2}\right]^{0.5}\]

where \(\psi'\) is the trigamma function.

Adding skewness

Regression

Ordinary regression

A standard regression model is just fitting a normal distribution with a linear function for the mean parameter. From Part 1.1 of the Stan user manual:

But with Stan we can fit any functions on any parameters.

                 (Within reason obviously, simpler is better.)

Opportunistic Potatoes

Opportunistic potatoes are a pest that affects farms in many negative ways and need to be controlled. This experiment attempts to find the economically optimal dose of a specific treatment for this purpose.

In the original experiment 8 measures were taken for various cultivars and doses, but we will restrict ourselves to only one measure and combine the cultivars for this discussion.

This research was done by Talana Cronje, UFS, on behalf of Potatoes South Africa.

They applied the treatment at various proportions of the recommended dose. Lower proportions are cheaper to implement.

Visualise the data

You must enable Javascript to view this page properly.

Solution

  • Looking at the data, it is clear that the economically optimal dose is near 20%.
  • More than that costs more money but doesn’t yield much improvement.
  • After 40% there doesn’t appear to be any real improvement.
  • No need for a complex cost function analysis in this case.

Can we get the same answer more formally?

Bi-phasic regression

I propose the following model:

Model results

Given an unknown change point \(\gamma\), the model is formally defined as \(y_i\sim N(\beta_0+\beta_1(x_i-\gamma)I(x_i>\gamma)+\beta_2(x_i-\gamma)I(x_i<\gamma),\ \sigma^2)\).

  • By far the easiest way to fit the model (in my opinion) is using STAN, via rstan
  • The p-value for the change point test is: 0.0047 and for the test of a flat slope after the change point the p-value is: 0.6583
  • Which means we assume there is a change point
    • with expected value 24% of recommended treatment
    • and 95% HPD interval of 10% to 42%

Intervals and p-values

After fitting the model, we can obtain intervals for any quantity of interest as follows:

  • Calculate the quantity for each simulated parameter vector, to get a vector of quantities
  • This vector of quantities can be seen as an approximation to the posterior distribution of the quantity of interest, let us call it sims.
  • Calculate a symmetric interval, say 95%, using the built-in quantile function: quantile(sims,c(0.025,0.975))
  • Calculate an equivalent to the traditional two-sided p-value using this function:
pvalfunc <- function(sims, target = 0) {
    2 * min(mean(sims < target), mean(sims > target))
}

Shortest interval

For a given coverage, a shorter interval provides more useful information.

If the distribution of the quantity of interest is skew then the symmetric interval is not ideal.

We can calculate the shortest interval, also known as the highest posterior density (HPD) interval, quite easily though if we have simulations.

The process is simple: to get the shortest 95% interval, just work out the length of every 95% interval and see which one is the shortest!

shortestinterval <- function(sims, alpha = 0.05) {
    # Coded by Sean van der Merwe, UFS
    sorted.sims <- sort(sims)
    nsims <- length(sims)
    gap <- round(nsims * (1 - alpha))
    widths <- diff(sorted.sims, gap)
    interval <- sorted.sims[c(which.min(widths), (which.min(widths) + 
        gap))]
    return(interval)
}

Changepoint Model with Censoring

If we take logs then the zeros are a problem. We can consider zeros as censored below 1 or 0.5.

GPD regression

Here we want to do a regression on the scale parameter as it changes with time and claimant properties.

LOO

Another advantage of Stan is efficient implementation of Leave-One-Out (LOO) cross-validation.

If we have multiple models we can compare their fit on the same set of data in a reliable way.

The most common uses would be to compare a complicated and simple model formulation, or to compare models with different sets of explanatory variables.

The steps are: calculate ‘log_lik’ for individual observations in your Stan models as a generated quantity, load the loo library https://cran.r-project.org/web/packages/loo/index.html , and then compare the models.

log_lik_1 <- extract_log_lik(ModelFit1, merge_chains = FALSE)
log_lik_2 <- extract_log_lik(ModelFit2, merge_chains = FALSE)
r_eff_1 <- relative_eff(exp(log_lik_1), cores = mycores)
r_eff_2 <- relative_eff(exp(log_lik_2), cores = mycores)
loo_1 <- loo(log_lik_1, r_eff = r_eff_1, cores = mycores)
loo_2 <- loo(log_lik_2, r_eff = r_eff_2, cores = mycores)
comp <- loo_compare(loo_1, loo_2)
print(comp, simplify = FALSE)

Thinning

Thinning is the process of dropping some simulations from the output.

At some point a few decades ago it became popular to use high thinning values, like only using every hundredth simulated vector. In the early 2000s it became clear that this is almost always counter-productive, and yet people still keep doing it.

If your model is very complicated and you are doing massively long simulation chains then it may be useful to apply thinning of 2, i.e. keeping every second simulation vector, in order to speed up post-processing; but never more than that.

Almost all statisticians are just fine forgetting that this is even a thing.

Art B. Owen, 2017, Statistically Efficient Thinning of a Markov Chain Sampler, Journal of Computational and Graphical Statistics, 26,3, 738-744

Last one: going over the top

The situation

We were asked to help rank 200 types of wheat according to some measures of how well they resist different pathogens.

While we did various standard analyses, the measurements were quite skew and seemed like they should fit a gamma model. The measurements were taken under varying conditions, such as site, year, and pathogen. Ultimately though we need a single measure of resistance per wheat strain under general conditions.

One possible approach is to model the measurements jointly, with wheat strain as fixed effect and the nuisance factors as random effects. We also have to account for differing variances under differing conditions.

Gamma random effects model

The general form of the model is:

\(y_i\sim Gamma(\alpha_i,\lambda_i)\)

\(\alpha_i=\mu_i^2/\sigma^2_{g_i}\)

\(\lambda_i=\mu_i/\sigma^2_{g_i}\)

\(\mu_i=\exp\left(\eta_{g_i} + \beta_{s_i}\right)\)

\(\eta_j\sim N(0,\tau^2)\)

\(g\) indicates the group and \(s\) the subject, so \(\exp(\beta_{s_1}),\exp(\beta_{s_2}),\dots\) is what we are interested in.

Stan version

Conclusion

Thank you for listening

  • Stan is the flexible all-purpose solution for people who find themselves with unusual statistical modelling problems.

  • It can do much more advanced stuff than what was shown here. See their website for more information.

  • The help files are extensive. The reference documents are the most useful when you are trying to solve a specific problem.

  • There are many helper packages that can do a lot of (or all) the work for you, including creating the model code for you to play with.

  • Fitting models with Stan can be a lot of fun because of the freedom it provides.