Prior selection for estimating the number of missing extreme values

Sean van der Merwe, Andréhette Verster

Introduction

This work was suggested by Prof Jan Beirlant of KU Leuven, and extends his recent work on this topic.

Motivation

Sometimes you might have reason to believe that the largest values in a sample are discarded or distorted and wish to adjust for this

Example: The largest diamonds in a mine are the most interesting and valuable, and are studied via extreme value theory (EVT)
- The largest diamonds also face the biggest obstacles when moving to the surface and are most likely to be broken up naturally
- It is a case of missing not at random (MNAR)

This creates distortion at the end of the tail of the distribution

The effect of missing extremes

Let us have a look at the tail of a Pareto sample via the Pareto QQ Plot:

It is clear that we have a good fit to the top part of the observations

Missing extremes

Consider now the case where the top values are missing. We explicitly exclude the top 10 values from the sample:

Now we observe a rather distorted plot

The idea

The goal now is to attempt to recover the correct fit in spite of these missing observations
This is done using the theory of Beirlant et al. (2023)
It is based on the assumption that there is enough information left in the tail to estimate how much information is missing and what it might have indicated
- Formally, if we can estimate the extreme value index (EVI) and number of missing observations \((m)\) simultaneously and accurately, then we can correct the distortion

Outline for rest of talk

Model
Regularisation / prior distribution
Methodology
Results

Model

Extreme value index

Assume that the underlying distribution satisfies the max-domain of attraction condition, i.e. assume that the maximum of independent and identically distributed observations \(X_1, X_2, \ldots, X_n\) can be approximated by the generalized extreme value distribution. Mathematically: as \(n\to\infty\)

\[ \mathbb{P}\left( a_n^{-1} \left(\max\limits_{i=1,\ldots,n}X_i -b_n \right) \leq y \right) \hspace{2mm} \longrightarrow \hspace{2mm} G_{\gamma} (y) = \exp \left( - (1 + \gamma y)^{-1/\gamma}\right) \hspace{5mm} \text{ for } 1+\gamma\, y>0 \]

where \(b_n \in \mathbb{R}\), \(a_n >0\) and \(\gamma \in \mathbb{R}\) are the location, scale and shape parameters, respectively. The EVI \(\gamma\) is a measure of the tail-heaviness of the distribution of \(X\) with a larger value of \(\gamma\) implying a heavier tail of \(F\).

Log spacings of the tail

First define

\[V_{j,n} := \log \frac{X_{(n-j+1)}}{X_{(n-j)}} = \log X_{(n-j+1)} - \log X_{(n-j)},\ \ j=1, \ldots,k,\]

then, if \(\gamma\) (EVI) \(>0\), and \(m>0\) quantifies the missing information,

\[V_{j,n}\sim Exp\left(\frac{j+m}{\gamma}\right)\]

The log-likelihood function is then given by

\[ \ell(\gamma,m) = \sum_{j=1}^k \log (j+m) - k\log \gamma - \left(\frac{j+m}{\gamma}\right)v_{j,n} \]

Generalised spacings

The standard spacings can be generalised for any reasonable EVI using the exponential regression model from Matthys and Beirlant (2003). First define

\[W_{j,n} := \log \frac{X_{(n-j+1)} - X_{(n-k)}}{X_{(n-j)} - X_{(n-k)}} \ j=1, \ldots,k-1,\]

and \(k_{\gamma}(u)= \int_u^1 v^{\gamma -1}dv = \frac{1-u^{\gamma}}{\gamma}\), then

\[W_{j,n}\sim Exp\left((j+m)k_{\gamma}\left( {j+m \over k+m}\right)\right)\]

The log-likelihood function is then given by

\[ \ell(\gamma,m) = \sum_{j=1}^k \log (j+m) + \sum_{j=1}^k \log k_{\gamma}\left( {j+m \over k+m}\right) - \sum_{j=1}^k (j+m) k_{\gamma}\left( {j+m \over k+m}\right) w_{j,n} \]

This approach needs a much larger amount of information to be accurate though.

Does it work?

Since this is a simulated sample we can use the true values of the parameters and see

Not so stricly Pareto

What if we use a more general distribution that isn’t Pareto all the way, but has Pareto tails?

The same thing happens

But now we have less information with which to address it as we can’t use the whole sample

The method still works

If you’re far enough in the tail and still have enough to work with

Regularisation / prior distribution

The need for regularisation

The likelihoods shown are flat with respect to \(m\)
The estimation of \(\gamma\) and \(m\) cannot be separated as they affect each other
Thus, we get poor estimation using these likelihoods as they are

Pulling \(m\) towards zero in a systematic way helps the estimation

First option

Jan Beirlant’s paper added a regularisation term \(-\lambda m\) to the log likelihood to improve estimation, with \(\lambda\) fixed at 0.1
- Spoiler for results: this was a very good choice 😀
They then used maximum likelihood to estimate \(m\) and \(\gamma\) (EVI)
This option is equivalent to saying that \(m\sim Exp(\lambda)\) and using the posterior mode estimate
- Mathematically and practically equivalent, but philosophically less so
I implemented the Bayesian approach
- Not limited to mode (optimal for 0-1 loss), can use median (optimal for absolute loss), mean (optimal for squared error loss), or others
- Simulation of posterior gives the whole picture, so direct quantification of uncertainty, including meaningful intervals
- Simulation is slower though, takes up to 1 second for a large sample 😉

Second option

As a second option we could extend the first option by say squaring \(m\)
More generally, we could use \(-\lambda m^a\)
Turns out that when you square \(m\) then \(\lambda\) needs to be much smaller
In all cases, a larger value of \(\lambda\) pulls \(m\) to 0 more strongly, resulting in smaller intervals and more stable estimates
But the intervals can quickly become too small with poor to terrible coverage
- Spurious accuracy is obtained by systematically underestimating \(m\)

Third option

We could consider a different form for the log prior entirely
How about \(-\lambda\log(m+a)\)?
After some effort we realised that this is related to Jeffreys’ prior for \(m\) (which has \(a=1\))
- Objective priors try to maximise the information contributed by the sample and minimise the information contributed by the prior

Jeffreys

For the largest observed value \((j=1)\), let \(g(v)=-\log f(v) = -\log (1+m) + k\log \gamma + \left(\frac{1+m}{\gamma}\right)v\), then

\[\begin{aligned} \frac{\partial g}{\partial m} &= -(1+m)^{-1} + v\gamma^{-1} \\ \frac{\partial^2 g}{\partial m^2} &= (1+m)^{-2} \\ E_v\left[\frac{\partial^2 g}{\partial m^2}\right] &= (1+m)^{-2} \end{aligned}\]

So the log Jeffreys prior on \(m\) alone would be \(-\log (1+m)\), which is the prior I suggested with \(\lambda=1\) and \(a=1=j\) 😀

Independence and Joint Jeffreys

\[\begin{aligned} \frac{\partial^2 g}{\partial m\partial \gamma} &= -\gamma^{-2} \\ \frac{\partial g}{\partial \gamma} &= k\gamma^{-1} - (1+m)v\gamma^{-2} \\ \frac{\partial^2 g}{\partial \gamma^2} &= -k\gamma^{-2} + 2(1+m)v\gamma^{-3} \\ \end{aligned}\]

Giving an independence Jeffreys log prior of \(-\log (1+m) -\log\gamma + c_2\)

and a joint Jeffreys prior of

\[0.5\log\left[(1+m)^{-2}(2-k) - \gamma^{-2}\right]-\log\gamma + c_3\]

Methodology of simulation study

Model for standard spacings

\[\begin{aligned} V_{j,n} &\sim Exp\left(\lambda_j\right) \\ \lambda_j &= (j+m)\gamma^{-1} \\ \gamma &\sim U(0, 1) \\ \log\pi(m) &= -\lambda m^a + c\text{ OR } -\lambda\log(m+a) + c \\ \text{where }\lambda,a &>0\text{ are prior hyper-parameters } \\ \text{and }c&\text{ is an unknown constant} \end{aligned}\]

The model is implemented using the STAN interface via rstan.

Building scenarios

base_scenarios <- rbind(
  expand.grid(
    Distribution = c("GPD"), # vs c("Pareto", "Frechet"),
    EVI_true = c(0.2, 0, -0.2),
    m_true = c(5, 10),
    k = c(100, 200, 300),
    N = 500, 
    lambda = c(0.4, 0.2, 0.1, 0.05, 0.02, 0.01),
    model = "Exponential decay", # vs "Inverse decay",
    a = c(0.5, 1, 2),
    spacings = "Generalised" # vs "Standard"
  ),
  expand.grid(
    ...
  )
)

Scenario table (Generalised spacings)

Each of these scenarios is then repeated, say 1000 samples per scenario.

Running each repetition

The key to a reproducible and trustworthy simulation study is to do the same thing every time in every case
Here we do this by creating a single function
- that takes a single scenario as input
- and produces a single vector of statistics as output
Little to no code duplication!
- If you have multiple pieces of code that are almost identical, except that they run on different data, then create a function

Results

For brevity we show the results for the generalised spacings only.

Standard spacings are more accurate (require that you 100% know the EVI is positive), but the patterns are similar.

ReML / Posterior mode (- true value)

Shortest 95% interval coverage

Conclusion

The end

As expected, the precise nature of the prior doesn’t matter as long as it is sensible and the sample size is not too small.

For this problem we recommend either \(\log\pi(m) = -0.01 m^2\) (smallest absolute error), or \(-0.1m\) (the one in the source paper), or the Jeffreys prior (most reliable).

Thank you for your time and attention.

This presentation was created using the Reveal.js format in Quarto, using the RStudio IDE. Font and line colours according to UFS branding, and background image by Midjouney using image editor GIMP.