Prior selection for estimating the number of missing extreme values

Sean van der Merwe, Jan Beirlant, Andréhette Verster

Introduction

Motivation

Sometimes you might have reason to believe that the largest values in a sample are discarded or distorted and wish to adjust for this

  • Example: The largest diamonds in a mine are the most interesting and valuable, and are studied via extreme value theory (EVT)
    • The largest diamonds are also easy to see with the eye and grab with the hand
    • The largest diamonds also face the biggest obstacles when moving to the surface and are most likely to be broken up naturally
    • It is a case of missing not at random (MNAR)

This creates distortion at the end of the tail of the distribution

The effect of missing extremes

Let us have a look at the tail of the full sample via the Pareto QQ Plot:

It is clear that we have a good fit to the top part of the observations

Missing extremes

Consider now the case where the top values are missing. We explicitly exclude the top 10 values from the sample:

Now we observe a rather distorted plot

The idea

  • The goal now is to attempt to recover the correct fit in spite of these missing observations
  • This is done using the theory of Beirlant et al. (2023ish), presented to this department in early 2023
  • It is based on the assumption that there is enough information left in the tail to estimate how much information is missing and what it might have indicated
    • Formally, if we can estimate the extreme value index (EVI) and number of missing observations \((m)\) simultaneously and accurately, then we can correct the distortion

Outline for rest of talk

  1. Model recap
  2. Regularisation / prior distribution
  3. Results
  4. Methodology,
  • i.e. how to code a simulation study to be both flexible and robust

Model recap

Extreme value index

Assume that the underlying distribution satisfies the max-domain of attraction condition, i.e. assume that the maximum of independent and identically distributed observations \(X_1, X_2, \ldots, X_n\) can be approximated by the generalized extreme value distribution. Mathematically: as \(n\to\infty\)

\[ \mathbb{P}\left( a_n^{-1} \left(\max\limits_{i=1,\ldots,n}X_i -b_n \right) \leq y \right) \hspace{2mm} \longrightarrow \hspace{2mm} G_{\gamma} (y) = \exp \left( - (1 + \gamma y)^{-1/\gamma}\right) \hspace{5mm} \text{ for } 1+\gamma\, y>0 \]

where \(b_n \in \mathbb{R}\), \(a_n >0\) and \(\gamma \in \mathbb{R}\) are the location, scale and shape parameters, respectively. The EVI \(\gamma\) is a measure of the tail-heaviness of the distribution of \(X\) with a larger value of \(\gamma\) implying a heavier tail of \(F\).

For the results I’ll show later we specifically consider only the case where \(\gamma>0\).

Spacings

First define

\[V_{j,n} := \log \frac{X_{(n-j+1)}}{X_{(n-j)}} \ j=1, \ldots,k,\]

then, if \(\gamma>0\),

\[V_{j,n}\sim Exp\left(\frac{j+m}{\gamma}\right)\]

The log-likelihood function is then given by

\[ \ell(\gamma,m) = \sum_{j=1}^k \log (j+m) - k\log \gamma - \left(\frac{j+m}{\gamma}\right)v_{j,n} \]

Generalised spacings

The standard spacings can be generalised for any reasonable EVI using the exponential regression model from Matthys and Beirlant (2003). First define

\[W_{j,n} := \log \frac{X_{(n-j+1)} - X_{(n-k)}}{X_{(n-j)} - X_{(n-k)}} \ j=1, \ldots,k-1,\]

and \(k_{\gamma}(u)= \int_u^1 v^{\gamma -1}dv = \frac{1-u^{\gamma}}{\gamma}\), then

\[W_{j,n}\sim Exp\left((j+m)k_{\gamma}\left( {j+m \over k+m}\right)\right)\]

The log-likelihood function is then given by

\[ \ell(\gamma,m) = \sum_{j=1}^k \log (j+m) + \sum_{j=1}^k \log k_{\gamma}\left( {j+m \over k+m}\right) - \sum_{j=1}^k (j+m) k_{\gamma}\left( {j+m \over k+m}\right) w_{j,n} \]

This approach needs a much larger amount of information to be accurate though

Does it work?

Since this is a simulated sample we can use the true values of the parameters and see

Not so stricly Pareto

What if we use a more general distribution that isn’t Pareto all the way, but has Pareto tails?

The same thing happens

But now we have less information with which to address it as we can’t use the whole sample

The method still works

If you’re far enough in the tail and still have enough to work with

Regularisation / prior distribution

The need for regularisation

  • The likelihoods shown are flat with respect to \(m\)
  • The estimation of \(\gamma\) and \(m\) cannot be separated as they affect each other
  • Thus, we get poor estimation using these likelihoods as they are

By pulling \(m\) towards zero in a systematic way it helps the estimation

First option

  • The paper added a term \(-\lambda m\) to the log likelihood to improve estimation with \(\lambda\) fixed at 0.1
    • Spoiler for results: this was a very good choice 😀
  • They then used maximum likelihood to estimate \(m\) and \(\gamma\) (EVI)
  • This option is equivalent to saying that \(m\sim Exp(\lambda)\) and using the posterior mode estimates
    • Mathematically and practically equivalent, but philosophically less so
  • I like and implemented the Bayesian approach
    • Not limited to mode (optimal for 0-1 loss), can use median (optimal for absolute loss), mean (optimal for squared error loss), or others
    • Simulation of posterior gives the whole picture, so direct quantification of uncertainty, including meaningful intervals
    • Simulation is slower though, takes up to 1 second for a large sample

Second option

  • As a second option we could extend the first option by say squaring \(m\)
  • More generally, we could use \(-\lambda m^a\)
  • Turns out that when you square \(m\) then \(\lambda\) needs to be much smaller
    • In all cases, a larger value of \(\lambda\) pulls \(m\) to 0 more strongly, resulting in smaller intervals and more stable estimates
    • But the intervals can quickly become too small with poor to terrible coverage
      • Spurious accuracy is obtained by systematically underestimating \(m\)

Third option

  • We could consider a different form for the log prior entirely
  • How about \(-\lambda\log(m+a)\)?
  • I just made this up because I wanted something practical that has a different shape
  • Are there more formal options?

Prof Verster suggested we try objective priors

Objective priors

  • Objective prior derivations are based on i.i.d. samples, which we don’t have, so directly applying the standard approaches doesn’t work
  • But what if we follow the spirit of these approaches?
  • Both the Maximal Data Information Prior (MDIP) and Jeffreys’ Prior try to maximise the information contributed by the sample and minimise the information contributed by the prior
  • In our sample each observation contributes a different amount of information, causing a headache
    • But what if we focus on the observation that gives the most information only?

MDIP

For the largest observed value \((j=1)\), the log MDIP is

\[\begin{aligned} \log MDIP &= E_v[\log f(v_{1,n})] + c_1 \\ &= E_v\left[\log (1+m) - k\log \gamma - \left(\frac{1+m}{\gamma}\right)v_{1,n}\right] + c_1 \\ &= \log (1+m) - k\log \gamma - \left(\frac{1+m}{\gamma}\right)E_v\left[v_{1,n}\right] + c_1 \\ &= \log (1+m) - k\log \gamma - 1 + c_1 \\ \end{aligned}\]

At least for \(m\), this is the prior I suggested, with \(a=1=j\) and \(\lambda=-1\). Unfortunately, this pulls \(m\) away from 0 😣

Jeffreys

Let \(g(v)=-\log f(v) = -\log (1+m) + k\log \gamma + \left(\frac{1+m}{\gamma}\right)v\), then

\[\begin{aligned} \frac{\partial g}{\partial m} &= -(1+m)^{-1} + v\gamma^{-1} \\ \frac{\partial^2 g}{\partial m^2} &= (1+m)^{-2} \\ E_v\left[\frac{\partial^2 g}{\partial m^2}\right] &= (1+m)^{-2} \end{aligned}\]

So the log Jeffreys prior on \(m\) alone would be \(-\log (1+m)\), which is the prior I suggested with \(\lambda=1\) and \(a=1=j\) 😀

Independence and Joint Jeffreys

\[\begin{aligned} \frac{\partial^2 g}{\partial m\partial \gamma} &= -\gamma^{-2} \\ \frac{\partial g}{\partial \gamma} &= k\gamma^{-1} - (1+m)v\gamma^{-2} \\ \frac{\partial^2 g}{\partial \gamma^2} &= -k\gamma^{-2} + 2(1+m)v\gamma^{-3} \\ \end{aligned}\]

Giving an independence Jeffreys log prior of \(-\log (1+m) -\log\gamma + c_2\)

and a joint Jeffreys prior of

\[0.5\log\left[(1+m)^{-2}(2-k) - \gamma^{-2}\right]-\log\gamma + c_3\]

Results

So which one is the best?

How do you define best?

  • Do you focus on pure accuracy, like RMSE?
  • Do you consider bias to one side or the other to be worse?
    • Which estimator do you want to use?
  • Do you value coverage?

I value balance, so I want to calculate everything and consider it all, but that results in about 40 statistics per scenario, and I have no idea what you want to see

Posterior median minus true value

Posterior mode minus true value

Posterior mean minus true value

Shortest 95% interval coverage

Symmetric 95% interval coverage

Shortest 95% interval width

Symmetric 95% interval width

Methodology

Building scenarios

base_scenarios <- rbind(
  expand.grid(
    Distribution = c("Pareto", "Frechet"),
    EVI_true = c(0.2),
    m_true = c(5, 10),
    k = c(100, 200, 300),
    N = 500, 
    lambda = c(0.5, 0.3, 0.2, 0.1, 0.05),
    model = "Exponential decay", 
    a = c(1),
    spacings = c("Standard") # vs Generalised
  ),
  expand.grid(
    Distribution = c("Pareto", "Frechet"),
    EVI_true = c(0.2),
    m_true = c(5, 10),
    k = c(100, 200, 300),
    N = 500, 
    lambda = c(0.5, 1, 2),
    model = "Inverse decay",
    a = c(0.5, 1),
    spacings = c("Standard")
  )
)

Scenario table

Each of these scenarios is then repeated a lot of times for each sample we want to generate, say 1000 samples per scenario.

Running each repetition

  • The key to a reproducible and trustworthy simulation study is to do the same thing every time in every case
  • Here we do this by creating a single function
    • that takes a single scenario as input
    • and produces a single vector of statistics as output
  • Little to no code duplication!
    • If you have multiple pieces of code that are almost identical, except that they run on different data, then create a function

Core function

run_scenario <- function(scenario) {
  # Generate sample
  if (scenario$Distribution == "Pareto") {
    actuar::rpareto1(scenario$N, 1/scenario$EVI_true, 1) |> 
      sort(decreasing = TRUE) -> y_full
  } else {
    actuar::rinvweibull(scenario$N, 1/scenario$EVI_true, 1) |> 
      sort(decreasing = TRUE) -> y_full
  }
  # Remove top values
  y_full[-(1:scenario$m_true)] -> y_reduced
  # Calculate spacings
  v <- -diff(log(y_reduced[seq_len(scenario$k+1)]))
  # Select model
  if (scenario$model == "Exponential decay") {
    compiled_model <- standardspacings
  } else {
    compiled_model <- standardspacings2
  }
  # Do simulations
  compiled_model |> 
      sampling(data = list(y = v, 
                           k = length(v), 
                           lambda = scenario$lambda,
                           a = scenario$a
                           ), 
               pars = pars_of_interest, 
               chains = 1, 
               iter = 5000
      ) |> rstan::extract(pars = pars_of_interest) -> post_sims
  # Return statistics
  list(evi_stats = get_stats(post_sims$evi, scenario$EVI_true),
       m_stats = get_stats(post_sims$m, scenario$m_true)
  )
}

Calculating statistics

shortestinterval <- function(postsims, width=0.95) { # Coded by Sean van der Merwe, UFS
  postsims |> sort() -> sorted.postsims
  round(length(postsims)*width) -> gap
  sorted.postsims |> diff(gap) |> which.min() -> pos
  sorted.postsims[c(pos, pos + gap)] }

get_stats <- function(postsims, true_value) {
  true_value <- true_value |> unname()
  dens <- density(postsims)
  interval1 <- shortestinterval(postsims)
  interval2 <- quantile(postsims, c(0.025, 0.5, 0.975)) |> unname()
  c(Mode = dens$x[which.max(dens$y)], 
    Median = interval2[2],
    Mean = mean(postsims),
    L = interval1[1],
    U = interval1[2],
    L0.025 = interval2[1],
    U0.975 = interval2[3],
    Mode_error = dens$x[which.max(dens$y)] - true_value,
    Median_error = interval2[2] - true_value,
    Mean_error = mean(postsims) - true_value,
    Coverage = (interval1[1] < true_value) && (interval1[2] > true_value),
    Coverage_symmetric = (interval2[1] < true_value) && (interval2[3] > true_value),
    Width = interval1[2] - interval1[1],
    Width_symmetric = interval2[2] - interval2[1]
  )
}

Model

\[\begin{aligned} V_{j,n} &\sim Exp\left(\lambda_j\right) \\ \lambda_j &= (j+m)\gamma^{-1} \\ \gamma &\sim U(0, 1) \\ \log\pi(m) &= -\lambda m^a + c\text{ OR } -\lambda\log(m+a) + c \\ \text{where }\lambda,a &>0\text{ are prior hyper-parameters } \\ \text{and }c&\text{ is an unknown constant} \end{aligned}\]

Stan model example

// This Stan block defines a fit using standard spacings, by Sean van der Merwe, UFS
data {
  int<lower=2> k; // number of observations in total
  real<lower = 0> y[k]; // sorted and pre-transformed observations
  real<lower = 0> lambda; // prior scale hyper-parameter on missing observations
  real<lower = 0> a; // prior shape hyper-parameter on missing observations 
}
// The parameters of the model
parameters {
  real<lower = 0> m; // missing information parameter
  real<lower = 0, upper = 1> evi; // extreme value index
}
transformed parameters {
  real<lower = 0> rate[k]; // expected values
  for (j in 1:k) {
    rate[j] = (j+m)/evi;
  }
}
model {
  y ~ exponential(rate);
  target += -lambda*(m^a);
  evi ~ uniform(0, 1);
}

Running the simulation study

# Load the parallel library and create a thread cluster
library(parallel)
cl <- makeCluster(mycores) # For office computer use 3 or 4, for HPC use 60 or 100, or auto-detect
# Load rstan on each thread
cl |> clusterEvalQ({
  library(rstan)
  options(mc.cores = 1)
}) |> invisible()
# Export the key variables to each thread
cl |> clusterExport(c(
  'shortestinterval', 'get_stats', 'run_scenario', 'pars_of_interest', 'standardspacings', 'standardspacings2'
))

# Run the scenarios in parallel in a load balanced way
system.time({
  parLapplyLB(cl, scenariolist, run_scenario) -> all_results_raw
})

# Stop the cluster and save the results
cl |> stopCluster()
all_results_raw |> saveRDS("standardspacingsresults.rds")

Conclusion

The end

  • As expected, the precise nature of the prior doesn’t matter as long as it is sensible.
    • If your data contains enough information to address the research question then minor changes to the priors won’t affect the conclusions.
  • The overall shape of the priors can affect convergence and make model fitting easier if chosen well.

Thank you for your time and attention.

This presentation was created using the Reveal.js format in Quarto, using the RStudio IDE. Font and line colours according to UFS branding, and background image by Midjouney using image editor GIMP.