Sometimes you might have reason to believe that the largest values in a sample are discarded or distorted and wish to adjust for this
This creates distortion at the end of the tail of the distribution
Let us have a look at the tail of the full sample via the Pareto QQ Plot:
It is clear that we have a good fit to the top part of the observations
Consider now the case where the top values are missing. We explicitly exclude the top 10 values from the sample:
Now we observe a rather distorted plot
Assume that the underlying distribution satisfies the max-domain of attraction condition, i.e. assume that the maximum of independent and identically distributed observations \(X_1, X_2, \ldots, X_n\) can be approximated by the generalized extreme value distribution. Mathematically: as \(n\to\infty\)
\[ \mathbb{P}\left( a_n^{-1} \left(\max\limits_{i=1,\ldots,n}X_i -b_n \right) \leq y \right) \hspace{2mm} \longrightarrow \hspace{2mm} G_{\gamma} (y) = \exp \left( - (1 + \gamma y)^{-1/\gamma}\right) \hspace{5mm} \text{ for } 1+\gamma\, y>0 \]
where \(b_n \in \mathbb{R}\), \(a_n >0\) and \(\gamma \in \mathbb{R}\) are the location, scale and shape parameters, respectively. The EVI \(\gamma\) is a measure of the tail-heaviness of the distribution of \(X\) with a larger value of \(\gamma\) implying a heavier tail of \(F\).
For the results I’ll show later we specifically consider only the case where \(\gamma>0\).
First define
\[V_{j,n} := \log \frac{X_{(n-j+1)}}{X_{(n-j)}} \ j=1, \ldots,k,\]
then, if \(\gamma>0\),
\[V_{j,n}\sim Exp\left(\frac{j+m}{\gamma}\right)\]
The log-likelihood function is then given by
\[ \ell(\gamma,m) = \sum_{j=1}^k \log (j+m) - k\log \gamma - \left(\frac{j+m}{\gamma}\right)v_{j,n} \]
The standard spacings can be generalised for any reasonable EVI using the exponential regression model from Matthys and Beirlant (2003). First define
\[W_{j,n} := \log \frac{X_{(n-j+1)} - X_{(n-k)}}{X_{(n-j)} - X_{(n-k)}} \ j=1, \ldots,k-1,\]
and \(k_{\gamma}(u)= \int_u^1 v^{\gamma -1}dv = \frac{1-u^{\gamma}}{\gamma}\), then
\[W_{j,n}\sim Exp\left((j+m)k_{\gamma}\left( {j+m \over k+m}\right)\right)\]
The log-likelihood function is then given by
\[ \ell(\gamma,m) = \sum_{j=1}^k \log (j+m) + \sum_{j=1}^k \log k_{\gamma}\left( {j+m \over k+m}\right) - \sum_{j=1}^k (j+m) k_{\gamma}\left( {j+m \over k+m}\right) w_{j,n} \]
This approach needs a much larger amount of information to be accurate though
Since this is a simulated sample we can use the true values of the parameters and see
What if we use a more general distribution that isn’t Pareto all the way, but has Pareto tails?
But now we have less information with which to address it as we can’t use the whole sample
If you’re far enough in the tail and still have enough to work with
By pulling \(m\) towards zero in a systematic way it helps the estimation
Prof Verster suggested we try objective priors
For the largest observed value \((j=1)\), the log MDIP is
\[\begin{aligned} \log MDIP &= E_v[\log f(v_{1,n})] + c_1 \\ &= E_v\left[\log (1+m) - k\log \gamma - \left(\frac{1+m}{\gamma}\right)v_{1,n}\right] + c_1 \\ &= \log (1+m) - k\log \gamma - \left(\frac{1+m}{\gamma}\right)E_v\left[v_{1,n}\right] + c_1 \\ &= \log (1+m) - k\log \gamma - 1 + c_1 \\ \end{aligned}\]
At least for \(m\), this is the prior I suggested, with \(a=1=j\) and \(\lambda=-1\). Unfortunately, this pulls \(m\) away from 0 😣
Let \(g(v)=-\log f(v) = -\log (1+m) + k\log \gamma + \left(\frac{1+m}{\gamma}\right)v\), then
\[\begin{aligned} \frac{\partial g}{\partial m} &= -(1+m)^{-1} + v\gamma^{-1} \\ \frac{\partial^2 g}{\partial m^2} &= (1+m)^{-2} \\ E_v\left[\frac{\partial^2 g}{\partial m^2}\right] &= (1+m)^{-2} \end{aligned}\]
So the log Jeffreys prior on \(m\) alone would be \(-\log (1+m)\), which is the prior I suggested with \(\lambda=1\) and \(a=1=j\) 😀
\[\begin{aligned} \frac{\partial^2 g}{\partial m\partial \gamma} &= -\gamma^{-2} \\ \frac{\partial g}{\partial \gamma} &= k\gamma^{-1} - (1+m)v\gamma^{-2} \\ \frac{\partial^2 g}{\partial \gamma^2} &= -k\gamma^{-2} + 2(1+m)v\gamma^{-3} \\ \end{aligned}\]
Giving an independence Jeffreys log prior of \(-\log (1+m) -\log\gamma + c_2\)
and a joint Jeffreys prior of
\[0.5\log\left[(1+m)^{-2}(2-k) - \gamma^{-2}\right]-\log\gamma + c_3\]
How do you define best?
I value balance, so I want to calculate everything and consider it all, but that results in about 40 statistics per scenario, and I have no idea what you want to see
base_scenarios <- rbind(
expand.grid(
Distribution = c("Pareto", "Frechet"),
EVI_true = c(0.2),
m_true = c(5, 10),
k = c(100, 200, 300),
N = 500,
lambda = c(0.5, 0.3, 0.2, 0.1, 0.05),
model = "Exponential decay",
a = c(1),
spacings = c("Standard") # vs Generalised
),
expand.grid(
Distribution = c("Pareto", "Frechet"),
EVI_true = c(0.2),
m_true = c(5, 10),
k = c(100, 200, 300),
N = 500,
lambda = c(0.5, 1, 2),
model = "Inverse decay",
a = c(0.5, 1),
spacings = c("Standard")
)
)
Each of these scenarios is then repeated a lot of times for each sample we want to generate, say 1000 samples per scenario.
run_scenario <- function(scenario) {
# Generate sample
if (scenario$Distribution == "Pareto") {
actuar::rpareto1(scenario$N, 1/scenario$EVI_true, 1) |>
sort(decreasing = TRUE) -> y_full
} else {
actuar::rinvweibull(scenario$N, 1/scenario$EVI_true, 1) |>
sort(decreasing = TRUE) -> y_full
}
# Remove top values
y_full[-(1:scenario$m_true)] -> y_reduced
# Calculate spacings
v <- -diff(log(y_reduced[seq_len(scenario$k+1)]))
# Select model
if (scenario$model == "Exponential decay") {
compiled_model <- standardspacings
} else {
compiled_model <- standardspacings2
}
# Do simulations
compiled_model |>
sampling(data = list(y = v,
k = length(v),
lambda = scenario$lambda,
a = scenario$a
),
pars = pars_of_interest,
chains = 1,
iter = 5000
) |> rstan::extract(pars = pars_of_interest) -> post_sims
# Return statistics
list(evi_stats = get_stats(post_sims$evi, scenario$EVI_true),
m_stats = get_stats(post_sims$m, scenario$m_true)
)
}
shortestinterval <- function(postsims, width=0.95) { # Coded by Sean van der Merwe, UFS
postsims |> sort() -> sorted.postsims
round(length(postsims)*width) -> gap
sorted.postsims |> diff(gap) |> which.min() -> pos
sorted.postsims[c(pos, pos + gap)] }
get_stats <- function(postsims, true_value) {
true_value <- true_value |> unname()
dens <- density(postsims)
interval1 <- shortestinterval(postsims)
interval2 <- quantile(postsims, c(0.025, 0.5, 0.975)) |> unname()
c(Mode = dens$x[which.max(dens$y)],
Median = interval2[2],
Mean = mean(postsims),
L = interval1[1],
U = interval1[2],
L0.025 = interval2[1],
U0.975 = interval2[3],
Mode_error = dens$x[which.max(dens$y)] - true_value,
Median_error = interval2[2] - true_value,
Mean_error = mean(postsims) - true_value,
Coverage = (interval1[1] < true_value) && (interval1[2] > true_value),
Coverage_symmetric = (interval2[1] < true_value) && (interval2[3] > true_value),
Width = interval1[2] - interval1[1],
Width_symmetric = interval2[2] - interval2[1]
)
}
\[\begin{aligned} V_{j,n} &\sim Exp\left(\lambda_j\right) \\ \lambda_j &= (j+m)\gamma^{-1} \\ \gamma &\sim U(0, 1) \\ \log\pi(m) &= -\lambda m^a + c\text{ OR } -\lambda\log(m+a) + c \\ \text{where }\lambda,a &>0\text{ are prior hyper-parameters } \\ \text{and }c&\text{ is an unknown constant} \end{aligned}\]
// This Stan block defines a fit using standard spacings, by Sean van der Merwe, UFS
data {
int<lower=2> k; // number of observations in total
real<lower = 0> y[k]; // sorted and pre-transformed observations
real<lower = 0> lambda; // prior scale hyper-parameter on missing observations
real<lower = 0> a; // prior shape hyper-parameter on missing observations
}
// The parameters of the model
parameters {
real<lower = 0> m; // missing information parameter
real<lower = 0, upper = 1> evi; // extreme value index
}
transformed parameters {
real<lower = 0> rate[k]; // expected values
for (j in 1:k) {
rate[j] = (j+m)/evi;
}
}
model {
y ~ exponential(rate);
target += -lambda*(m^a);
evi ~ uniform(0, 1);
}
# Load the parallel library and create a thread cluster
library(parallel)
cl <- makeCluster(mycores) # For office computer use 3 or 4, for HPC use 60 or 100, or auto-detect
# Load rstan on each thread
cl |> clusterEvalQ({
library(rstan)
options(mc.cores = 1)
}) |> invisible()
# Export the key variables to each thread
cl |> clusterExport(c(
'shortestinterval', 'get_stats', 'run_scenario', 'pars_of_interest', 'standardspacings', 'standardspacings2'
))
# Run the scenarios in parallel in a load balanced way
system.time({
parLapplyLB(cl, scenariolist, run_scenario) -> all_results_raw
})
# Stop the cluster and save the results
cl |> stopCluster()
all_results_raw |> saveRDS("standardspacingsresults.rds")
Thank you for your time and attention.
This presentation was created using the Reveal.js format in Quarto, using the RStudio IDE. Font and line colours according to UFS branding, and background image by Midjouney using image editor GIMP.
2023/08/11 - Missing Extremes