Introduction
This example is from the Random Effects Models notes of Prof. Robert Schall, UFS.
Students at a specific faculty normally undergo a 4-year degree programme. However, applicants who do not have the required minimum Admission Point are enrolled in a 5-year programme; that is, they join the regular 4-year programme after an initial year of preparatory classes.
In order to evaluate the effectiveness of the preparatory year, academic results (module marks) of students in the two programs were compared. Marks for 10 first year modules were available for the 20xx cohort of students.
The objective of the analysis was to compare the mean module marks of the two programs statistically.
Notes
A naive analysis approach would be as follows: For a given module, calculate the average mark for students from the 5-year and 4-year programs, respectively, and compare those marks using a two-sample t-test (say). However, the data set is characterised by the fact that for many students some of the module marks are missing, mostly because the student in question discontinued his or her studies, often for lack of academic success. Simply ignoring the missing data and performing an analysis only of the available data for a given module would cause the estimates of average module marks to be biased upwards (because the missing data are likely to come from the weaker students). While this upward bias applies to both the 4-year and the 5-year program, the drop-out rate is higher among students in the 5-year program than among students in the 4-year program. Therefore, the upward bias is stronger for the 5-year program than for the 4-year program, and this differential bias causes a bias in the estimate of the “5-year - 4-year” difference in average module marks.
It is plausible, and in fact clearly observed in practise, that the module marks for a given student are highly correlated. Thus available marks for a given student can provide information on module marks that are missing for that student. In order to exploit the correlation of module marks and thus “recover” information on missing marks, we must fit a model to all the module marks jointly. Missing data problems of this type can often be handled successfully by fitting a mixed model.
Model description
We are going to fit the following explanatory effects for the marks as a starting point:
- Module as fixed effect (different modules have different average marks)
- Program as fixed effect (the average marks in the 4-year and 5-year programs differ)
- Module.Program interaction as fixed effect (the differences between 4-year and 5-year programs differ for different modules)
- Student as random effect (students differ in their average module mark)
- Student as fixed effect on the variance (it is assumed that students differ in the variance of their marks)
Data preparation
Start by grabbing the data set here.
The first step is to prepare the data by shaping it and removing all the empty students (we can’t estimate the variance of a student with less than 2 marks).
Then we remove all the rows with no marks.
"academicplanexample.csv" |>
read.csv() -> rawdata
rawdata$subject |>
factor(levels = unique(rawdata$subject)) -> SubjectFactor
rawdata |>
by(SubjectFactor, function(d) {
sum(!is.na(d$mark)) > 2
}) -> GoodSubjectsIndicator
rawdata$subject %in% levels(SubjectFactor)[GoodSubjectsIndicator] ->
GoodRows
rawdata |>
subset(GoodRows) |>
na.omit() -> d
d$subject |>
factor(levels = unique(d$subject)) -> SubjectFactor
rm(GoodRows, GoodSubjectsIndicator)
d |>
nrow() -> n
PlanNum <- (d$plan == "A") + 1
d$module |>
factor() |>
as.numeric() -> ModuleNum
SubjectFactor |>
as.numeric() -> SubjNum
Model fitting
We will use the the STAN interface. We will attempt to use this interface via RSTAN to implement our models.
library(parallel)
library(rstan)
mycores <- max(1, floor(detectCores(logical = FALSE) * 0.75))
options(mc.cores = mycores)
rstan_options(auto_write = TRUE)
Full model
// This Stan block defines a mixed effects model with changing variance by subject, by Sean van der Merwe, UFS
data {
int<lower=2> nModule; // number of modules
int<lower=2> nPlan; // number of plans
int<lower=2> nSubj; // number of subjects
int<lower=2> n; // number of observations
int<lower=1, upper=nSubj> subj[n]; // subject membership
int<lower=1, upper=nPlan> plan[n]; // plan membership
int<lower=1, upper=nModule> module[n]; // module membership
real y[n]; // observations
}
parameters {
real<lower=0> phi[nSubj]; // within subject standard deviation
real alpha[nSubj]; // subject random effects
real<lower=0> tau; // between subject standard deviation
real beta[nModule, nPlan]; // module fixed effects
}
transformed parameters {
real mu[n];
real<lower=0> s[n];
for (i in 1:n) {
mu[i] = alpha[subj[i]] + beta[module[i],plan[i]];
s[i] = phi[subj[i]];
}
}
model {
alpha ~ normal(0, tau); // random effects are assumed to have zero mean
for (j in 1:nModule) {
for (k in 1:nPlan) {
beta[j,k] ~ normal(50, 100); // priors on beta parameters
}
}
tau ~ exponential(0.01); // prior on tau parameter
phi ~ exponential(0.01); // prior on phi parameters
y ~ normal(mu , s);
}
generated quantities {
vector[n] log_lik;
for (i in 1:n) {
log_lik[i] = normal_lpdf(y[i] | mu[i] , s[i]);
}
}
stan_data <- list(n = n, nModule = max(ModuleNum), nPlan = max(PlanNum),
nSubj = max(SubjNum), subj = SubjNum, module = ModuleNum, plan = PlanNum,
y = d$mark)
modelfit1 <- sampling(AcademicModel1, stan_data, pars = c("alpha",
"beta", "tau", "phi", "log_lik", "mu", "s"), iter = 20000, chains = mycores)
Simplified model
Here we have a single variance parameter, as in an ordinary regression model. This is the basic or standard approach.
// This Stan block defines a mixed effects model, by Sean van der Merwe, UFS
data {
int<lower=2> nModule; // number of modules
int<lower=2> nPlan; // number of plans
int<lower=2> nSubj; // number of subjects
int<lower=2> n; // number of observations
int<lower=1, upper=nSubj> subj[n]; // subject membership
int<lower=1, upper=nPlan> plan[n]; // plan membership
int<lower=1, upper=nModule> module[n]; // module membership
real y[n]; // observations
}
parameters {
real<lower=0> s; // within subject standard deviation
real alpha[nSubj]; // subject random effects
real<lower=0> tau; // between subject standard deviation
real beta[nModule, nPlan]; // module fixed effects
}
transformed parameters {
real mu[n];
for (i in 1:n) {
mu[i] = alpha[subj[i]] + beta[module[i],plan[i]];
}
}
model {
alpha ~ normal(0, tau); // random effects are assumed to have zero mean
for (j in 1:nModule) {
for (k in 1:nPlan) {
beta[j,k] ~ normal(50, 100); // priors on beta parameters
}
}
tau ~ exponential(0.01); // prior on tau parameter
s ~ exponential(0.01); // prior on s parameter
y ~ normal(mu , s);
}
generated quantities {
vector[n] log_lik;
for (i in 1:n) {
log_lik[i] = normal_lpdf(y[i] | mu[i] , s);
}
}
modelfit2 <- sampling(AcademicModel2, stan_data, pars = c("alpha",
"beta", "tau", "s", "log_lik", "mu"), iter = 20000, chains = mycores)
No difference model
Here we assume that there is no difference between the programmes, and merely model the module averages.
// This Stan block defines a mixed effects model, by Sean van der Merwe, UFS
data {
int<lower=2> nModule; // number of modules
int<lower=2> nSubj; // number of subjects
int<lower=2> n; // number of observations
int<lower=1, upper=nSubj> subj[n]; // subject membership
int<lower=1, upper=nModule> module[n]; // module membership
real y[n]; // observations
}
parameters {
real<lower=0> s; // within subject standard deviation
real alpha[nSubj]; // subject random effects
real<lower=0> tau; // between subject standard deviation
real beta[nModule]; // module fixed effects
}
transformed parameters {
real mu[n];
for (i in 1:n) {
mu[i] = alpha[subj[i]] + beta[module[i]];
}
}
model {
alpha ~ normal(0, tau); // random effects are assumed to have zero mean
for (j in 1:nModule) {
beta[j] ~ normal(50, 100); // priors on beta parameters
}
tau ~ exponential(0.01); // prior on tau parameter
s ~ exponential(0.01); // prior on s parameter
y ~ normal(mu , s);
}
generated quantities {
vector[n] log_lik;
for (i in 1:n) {
log_lik[i] = normal_lpdf(y[i] | mu[i] , s);
}
}
modelfit3 <- sampling(AcademicModel3, stan_data, pars = c("alpha",
"beta", "tau", "s", "log_lik", "mu"), iter = 20000, chains = mycores)
Model comparison
Here the models are compared using cross-validated information criteria.
library(loo)
list(M1 = modelfit1, M2 = modelfit2, M3 = modelfit3) -> models
models |>
lapply(function(model) {
model |>
extract_log_lik(merge_chains = FALSE) -> log_lik
log_lik |>
exp() |>
relative_eff(cores = mycores) -> r_eff
log_lik |>
loo(r_eff = r_eff, cores = mycores)
}) |>
loo_compare() |>
print(simplify = FALSE)
## elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
## M2 0.0 0.0 -5427.3 34.7 176.4 7.9 10854.6 69.5
## M3 -19.0 7.6 -5446.3 34.0 167.8 7.4 10892.6 67.9
## M1 -62.8 21.0 -5490.1 31.3 303.1 11.5 10980.2 62.6
models |>
lapply(extract) -> list_of_draws
list_of_draws |>
sapply(function(draws) {
Dthetabar <- sum(dnorm(stan_data$y, colMeans(draws$mu), colMeans(as.matrix(draws$s)),
log = TRUE)) * (-2)
mean(draws$log_lik |>
rowSums() * (-2)) -> muD
Dthetabar + 2 * (muD - Dthetabar)
}) -> DIC
cat("\n\n---\n\n### DIC Comparison\n\nDIC1 =", DIC[1], ", DIC2 =",
DIC[2], ", and DIC3=", DIC[3], "\n\nTherefore we conclude that model",
which.min(DIC), "is most parsimonious.\n\n---\n\n")
DIC Comparison
DIC1 = 10851.07 , DIC2 = 10833.11 , and DIC3= 10873.62
Therefore we conclude that model 2 is most parsimonious.
Analysis and conclusions
Firstly, we note that the model with changing variance may be overfitting slightly. We are not certain that this will be true for future samples, but it is clear that for this sample the single variance model is more parsimonious.
Further, we note that the model with differences between the programmes is more parsimonious, suggesting that there is a programme effect on the whole. It is important to consider the big picture first, to avoid spurious conclusions when looking at individual module differences.
Thus, we will investigate programme differences using the single variance (standard) mixed model with different means for each programme.
First we will consider the differences visually, and then using intervals and probabilities. The key point is that we are looking at the posterior distribution of the programme differences on a per-module basis.
diffs <- list_of_draws[[2]]$beta[, , 2] - list_of_draws[[2]]$beta[,
, 1]
d$module |>
factor() |>
levels() -> moduleNames
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
colnames(diffs) <- moduleNames
diffs |>
data.frame() |>
tidyr::pivot_longer(everything(), names_to = "Module", values_to = "Difference") |>
plot_ly(y = ~Difference, type = "box", color = ~Module)
pvalfunc <- function(sims, target = 0) {
2 * min(mean(sims < target), mean(sims > target))
}
sigfunc <- function(pvalue, alpha = 0.05) {
c("significant", "not significant")[c(pvalue <= alpha, pvalue >
alpha)]
}
shortestinterval <- function(postsims, alpha = 0.05) {
# Coded by Sean van der Merwe, UFS
postsims |>
sort() -> sorted.postsims
round(length(postsims) * (1 - alpha)) -> gap
sorted.postsims |>
diff(gap) |>
which.min() -> pos
sorted.postsims[c(pos, pos + gap)]
}
pvals <- apply(diffs, 2, pvalfunc)
ints <- apply(diffs, 2, shortestinterval)
postmean <- apply(diffs, 2, mean)
postmedian <- apply(diffs, 2, median)
results <- data.frame(Module = moduleNames, lower = ints[1, ], median = postmedian,
mean = postmean, upper = ints[2, ], p_value = pvals, adj_p_value = p.adjust(pvals),
Significance = sapply(p.adjust(pvals), sigfunc), row.names = NULL)
# openxlsx::write.xlsx(list(ModuleDifferences=results),
# 'AcademicPlanResults.xlsx')
results[, 2:7] <- round(results[, 2:7], 3)
kable(results)
Module | lower | median | mean | upper | p_value | adj_p_value | Significance |
---|---|---|---|---|---|---|---|
FAM114 | 2.652 | 6.807 | 6.813 | 10.962 | 0.001 | 0.009 | significant |
ILR114 | 2.221 | 5.981 | 5.984 | 9.828 | 0.002 | 0.014 | significant |
ILR124 | 3.972 | 7.867 | 7.869 | 11.734 | 0.000 | 0.001 | significant |
PSN124 | 0.674 | 4.737 | 4.738 | 8.847 | 0.024 | 0.024 | significant |
RGK114 | 1.067 | 5.020 | 5.015 | 8.728 | 0.010 | 0.019 | significant |
ROR124 | 2.127 | 6.001 | 5.991 | 9.833 | 0.003 | 0.014 | significant |
RPK112 | 2.199 | 6.096 | 6.100 | 10.143 | 0.003 | 0.014 | significant |
RPK122 | 2.017 | 6.195 | 6.215 | 10.178 | 0.002 | 0.014 | significant |
SFR114 | 5.146 | 9.009 | 9.004 | 13.057 | 0.000 | 0.000 | significant |
SFR124 | 12.697 | 16.864 | 16.855 | 20.846 | 0.000 | 0.000 | significant |
Thus, the difference in marks between students in the two programmes is significant for all modules, even after conservatively adjusting for multiple testing.
Combined effect
We can take a weighted average of the mark changes using the module credits as weights. Note that all calculations are done per simulated parameter set and then the simulations are combined at the latest stage for greatest accuracy. \(E[f(x)]\ne f(E[x])\)
weights <- as.numeric(substring(moduleNames, nchar(moduleNames)))
weights <- weights/sum(weights) # Weights should sum to 1 for easy application
diffall <- diffs %*% weights
cat("Optimal 95% interval of overall mark difference:", round(shortestinterval(diffall),
2), "\n")
## Optimal 95% interval of overall mark difference: 4.65 10.58
# plot(density(diffall), lwd=3, col='purple', xlab='Mark
# difference', main='')
density(diffall) -> dens
plot_ly(x = dens$x, y = dens$y, type = "scatter", mode = "lines",
fill = "tozeroy") |>
layout(xaxis = list(title = "Mark difference"), yaxis = list(title = "Density"))