Academic Plan Example

2022-02-02 R

Introduction

This example is from the Random Effects Models notes of Prof. Robert Schall, UFS.

Students at a specific faculty normally undergo a 4-year degree programme. However, applicants who do not have the required minimum Admission Point are enrolled in a 5-year programme; that is, they join the regular 4-year programme after an initial year of preparatory classes.

In order to evaluate the effectiveness of the preparatory year, academic results (module marks) of students in the two programs were compared. Marks for 10 first year modules were available for the 20xx cohort of students.

The objective of the analysis was to compare the mean module marks of the two programs statistically.

Notes

A naive analysis approach would be as follows: For a given module, calculate the average mark for students from the 5-year and 4-year programs, respectively, and compare those marks using a two-sample t-test (say). However, the data set is characterised by the fact that for many students some of the module marks are missing, mostly because the student in question discontinued his or her studies, often for lack of academic success. Simply ignoring the missing data and performing an analysis only of the available data for a given module would cause the estimates of average module marks to be biased upwards (because the missing data are likely to come from the weaker students). While this upward bias applies to both the 4-year and the 5-year program, the drop-out rate is higher among students in the 5-year program than among students in the 4-year program. Therefore, the upward bias is stronger for the 5-year program than for the 4-year program, and this differential bias causes a bias in the estimate of the “5-year - 4-year” difference in average module marks.

It is plausible, and in fact clearly observed in practise, that the module marks for a given student are highly correlated. Thus available marks for a given student can provide information on module marks that are missing for that student. In order to exploit the correlation of module marks and thus “recover” information on missing marks, we must fit a model to all the module marks jointly. Missing data problems of this type can often be handled successfully by fitting a mixed model.

Model description

We are going to fit the following explanatory effects for the marks as a starting point:

Module as fixed effect (different modules have different average marks)
Program as fixed effect (the average marks in the 4-year and 5-year programs differ)
Module.Program interaction as fixed effect (the differences between 4-year and 5-year programs differ for different modules)
Student as random effect (students differ in their average module mark)
Student as fixed effect on the variance (it is assumed that students differ in the variance of their marks)

Data preparation

Start by grabbing the data set here.

The first step is to prepare the data by shaping it and removing all the empty students (we can’t estimate the variance of a student with less than 2 marks).

Then we remove all the rows with no marks.

"academicplanexample.csv" |>
    read.csv() -> rawdata

rawdata$subject |>
    factor(levels = unique(rawdata$subject)) -> SubjectFactor
rawdata |>
    by(SubjectFactor, function(d) {
        sum(!is.na(d$mark)) > 2
    }) -> GoodSubjectsIndicator
rawdata$subject %in% levels(SubjectFactor)[GoodSubjectsIndicator] ->
    GoodRows
rawdata |>
    subset(GoodRows) |>
    na.omit() -> d
d$subject |>
    factor(levels = unique(d$subject)) -> SubjectFactor
rm(GoodRows, GoodSubjectsIndicator)
d |>
    nrow() -> n

PlanNum <- (d$plan == "A") + 1
d$module |>
    factor() |>
    as.numeric() -> ModuleNum
SubjectFactor |>
    as.numeric() -> SubjNum

Model fitting

We will use the the STAN interface. We will attempt to use this interface via RSTAN to implement our models.

library(parallel)
library(rstan)
mycores <- max(1, floor(detectCores(logical = FALSE) * 0.75))
options(mc.cores = mycores)
rstan_options(auto_write = TRUE)

Full model

// This Stan block defines a mixed effects model with changing variance by subject, by Sean van der Merwe, UFS
data {
  int<lower=2> nModule;               // number of modules
  int<lower=2> nPlan;                   // number of plans
  int<lower=2> nSubj;                   // number of subjects
  int<lower=2> n;                       // number of observations
  int<lower=1, upper=nSubj> subj[n];     // subject membership
  int<lower=1, upper=nPlan> plan[n];     // plan membership
  int<lower=1, upper=nModule> module[n];     // module membership
  real y[n];                    // observations
}
parameters {
  real<lower=0> phi[nSubj];           // within subject standard deviation
  real alpha[nSubj];                  // subject random effects
  real<lower=0> tau;              // between subject standard deviation
  real beta[nModule, nPlan];                   // module fixed effects
}
transformed parameters {
  real mu[n];
  real<lower=0> s[n];
  for (i in 1:n) {
    mu[i] = alpha[subj[i]] + beta[module[i],plan[i]];
    s[i] = phi[subj[i]];
  }
}
model {
  alpha ~ normal(0, tau);         // random effects are assumed to have zero mean
  for (j in 1:nModule) {
    for (k in 1:nPlan) {
        beta[j,k] ~ normal(50, 100);          // priors on beta parameters
    }
  }
  tau ~ exponential(0.01);         // prior on tau parameter
  phi ~ exponential(0.01);      // prior on phi parameters
  y ~ normal(mu , s);
}
generated quantities {
  vector[n] log_lik;
  for (i in 1:n) {
    log_lik[i] = normal_lpdf(y[i] | mu[i] , s[i]);
  }
}

stan_data <- list(n = n, nModule = max(ModuleNum), nPlan = max(PlanNum),
    nSubj = max(SubjNum), subj = SubjNum, module = ModuleNum, plan = PlanNum,
    y = d$mark)
modelfit1 <- sampling(AcademicModel1, stan_data, pars = c("alpha",
    "beta", "tau", "phi", "log_lik", "mu", "s"), iter = 20000, chains = mycores)

Simplified model

Here we have a single variance parameter, as in an ordinary regression model. This is the basic or standard approach.

// This Stan block defines a mixed effects model, by Sean van der Merwe, UFS
data {
  int<lower=2> nModule;               // number of modules
  int<lower=2> nPlan;                   // number of plans
  int<lower=2> nSubj;                   // number of subjects
  int<lower=2> n;                       // number of observations
  int<lower=1, upper=nSubj> subj[n];     // subject membership
  int<lower=1, upper=nPlan> plan[n];     // plan membership
  int<lower=1, upper=nModule> module[n];     // module membership
  real y[n];                    // observations
}
parameters {
  real<lower=0> s;           // within subject standard deviation
  real alpha[nSubj];                  // subject random effects
  real<lower=0> tau;              // between subject standard deviation
  real beta[nModule, nPlan];                   // module fixed effects
}
transformed parameters {
  real mu[n];
  for (i in 1:n) {
    mu[i] = alpha[subj[i]] + beta[module[i],plan[i]];
  }
}
model {
  alpha ~ normal(0, tau);         // random effects are assumed to have zero mean
  for (j in 1:nModule) {
    for (k in 1:nPlan) {
        beta[j,k] ~ normal(50, 100);          // priors on beta parameters
    }
  }
  tau ~ exponential(0.01);         // prior on tau parameter
  s ~ exponential(0.01);      // prior on s parameter
  y ~ normal(mu , s);
}
generated quantities {
  vector[n] log_lik;
  for (i in 1:n) {
    log_lik[i] = normal_lpdf(y[i] | mu[i] , s);
  }
}

modelfit2 <- sampling(AcademicModel2, stan_data, pars = c("alpha",
    "beta", "tau", "s", "log_lik", "mu"), iter = 20000, chains = mycores)

No difference model

Here we assume that there is no difference between the programmes, and merely model the module averages.

// This Stan block defines a mixed effects model, by Sean van der Merwe, UFS
data {
  int<lower=2> nModule;               // number of modules
  int<lower=2> nSubj;                   // number of subjects
  int<lower=2> n;                       // number of observations
  int<lower=1, upper=nSubj> subj[n];     // subject membership
  int<lower=1, upper=nModule> module[n];     // module membership
  real y[n];                    // observations
}
parameters {
  real<lower=0> s;           // within subject standard deviation
  real alpha[nSubj];                  // subject random effects
  real<lower=0> tau;              // between subject standard deviation
  real beta[nModule];                   // module fixed effects
}
transformed parameters {
  real mu[n];
  for (i in 1:n) {
    mu[i] = alpha[subj[i]] + beta[module[i]];
  }
}
model {
  alpha ~ normal(0, tau);         // random effects are assumed to have zero mean
  for (j in 1:nModule) {
        beta[j] ~ normal(50, 100);          // priors on beta parameters
  }
  tau ~ exponential(0.01);         // prior on tau parameter
  s ~ exponential(0.01);      // prior on s parameter
  y ~ normal(mu , s);
}
generated quantities {
  vector[n] log_lik;
  for (i in 1:n) {
    log_lik[i] = normal_lpdf(y[i] | mu[i] , s);
  }
}

modelfit3 <- sampling(AcademicModel3, stan_data, pars = c("alpha",
    "beta", "tau", "s", "log_lik", "mu"), iter = 20000, chains = mycores)

Model comparison

Here the models are compared using cross-validated information criteria.

library(loo)
list(M1 = modelfit1, M2 = modelfit2, M3 = modelfit3) -> models
models |>
    lapply(function(model) {
        model |>
            extract_log_lik(merge_chains = FALSE) -> log_lik
        log_lik |>
            exp() |>
            relative_eff(cores = mycores) -> r_eff
        log_lik |>
            loo(r_eff = r_eff, cores = mycores)
    }) |>
    loo_compare() |>
    print(simplify = FALSE)

##    elpd_diff se_diff elpd_loo se_elpd_loo p_loo   se_p_loo looic   se_looic
## M2     0.0       0.0 -5427.3     34.7       176.4     7.9  10854.6    69.5 
## M3   -19.0       7.6 -5446.3     34.0       167.8     7.4  10892.6    67.9 
## M1   -62.8      21.0 -5490.1     31.3       303.1    11.5  10980.2    62.6

models |>
    lapply(extract) -> list_of_draws
list_of_draws |>
    sapply(function(draws) {
        Dthetabar <- sum(dnorm(stan_data$y, colMeans(draws$mu), colMeans(as.matrix(draws$s)),
            log = TRUE)) * (-2)
        mean(draws$log_lik |>
            rowSums() * (-2)) -> muD
        Dthetabar + 2 * (muD - Dthetabar)
    }) -> DIC
cat("\n\n---\n\n### DIC Comparison\n\nDIC1 =", DIC[1], ", DIC2 =",
    DIC[2], ", and DIC3=", DIC[3], "\n\nTherefore we conclude that model",
    which.min(DIC), "is most parsimonious.\n\n---\n\n")

DIC Comparison

DIC1 = 10851.07 , DIC2 = 10833.11 , and DIC3= 10873.62

Therefore we conclude that model 2 is most parsimonious.

Analysis and conclusions

Firstly, we note that the model with changing variance may be overfitting slightly. We are not certain that this will be true for future samples, but it is clear that for this sample the single variance model is more parsimonious.

Further, we note that the model with differences between the programmes is more parsimonious, suggesting that there is a programme effect on the whole. It is important to consider the big picture first, to avoid spurious conclusions when looking at individual module differences.

Thus, we will investigate programme differences using the single variance (standard) mixed model with different means for each programme.

First we will consider the differences visually, and then using intervals and probabilities. The key point is that we are looking at the posterior distribution of the programme differences on a per-module basis.

diffs <- list_of_draws[[2]]$beta[, , 2] - list_of_draws[[2]]$beta[,
    , 1]
d$module |>
    factor() |>
    levels() -> moduleNames

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

colnames(diffs) <- moduleNames
diffs |>
    data.frame() |>
    tidyr::pivot_longer(everything(), names_to = "Module", values_to = "Difference") |>
    plot_ly(y = ~Difference, type = "box", color = ~Module)

pvalfunc <- function(sims, target = 0) {
    2 * min(mean(sims < target), mean(sims > target))
}
sigfunc <- function(pvalue, alpha = 0.05) {
    c("significant", "not significant")[c(pvalue <= alpha, pvalue >
        alpha)]
}

shortestinterval <- function(postsims, alpha = 0.05) {
    # Coded by Sean van der Merwe, UFS
    postsims |>
        sort() -> sorted.postsims
    round(length(postsims) * (1 - alpha)) -> gap
    sorted.postsims |>
        diff(gap) |>
        which.min() -> pos
    sorted.postsims[c(pos, pos + gap)]
}

pvals <- apply(diffs, 2, pvalfunc)
ints <- apply(diffs, 2, shortestinterval)
postmean <- apply(diffs, 2, mean)
postmedian <- apply(diffs, 2, median)
results <- data.frame(Module = moduleNames, lower = ints[1, ], median = postmedian,
    mean = postmean, upper = ints[2, ], p_value = pvals, adj_p_value = p.adjust(pvals),
    Significance = sapply(p.adjust(pvals), sigfunc), row.names = NULL)
# openxlsx::write.xlsx(list(ModuleDifferences=results),
# 'AcademicPlanResults.xlsx')
results[, 2:7] <- round(results[, 2:7], 3)
kable(results)

Module	lower	median	mean	upper	p_value	adj_p_value	Significance
FAM114	2.652	6.807	6.813	10.962	0.001	0.009	significant
ILR114	2.221	5.981	5.984	9.828	0.002	0.014	significant
ILR124	3.972	7.867	7.869	11.734	0.000	0.001	significant
PSN124	0.674	4.737	4.738	8.847	0.024	0.024	significant
RGK114	1.067	5.020	5.015	8.728	0.010	0.019	significant
ROR124	2.127	6.001	5.991	9.833	0.003	0.014	significant
RPK112	2.199	6.096	6.100	10.143	0.003	0.014	significant
RPK122	2.017	6.195	6.215	10.178	0.002	0.014	significant
SFR114	5.146	9.009	9.004	13.057	0.000	0.000	significant
SFR124	12.697	16.864	16.855	20.846	0.000	0.000	significant

Thus, the difference in marks between students in the two programmes is significant for all modules, even after conservatively adjusting for multiple testing.

Combined effect

We can take a weighted average of the mark changes using the module credits as weights. Note that all calculations are done per simulated parameter set and then the simulations are combined at the latest stage for greatest accuracy. \(E[f(x)]\ne f(E[x])\)

weights <- as.numeric(substring(moduleNames, nchar(moduleNames)))
weights <- weights/sum(weights)  # Weights should sum to 1 for easy application
diffall <- diffs %*% weights

cat("Optimal 95% interval of overall mark difference:", round(shortestinterval(diffall),
    2), "\n")

## Optimal 95% interval of overall mark difference: 4.65 10.58

# plot(density(diffall), lwd=3, col='purple', xlab='Mark
# difference', main='')
density(diffall) -> dens
plot_ly(x = dens$x, y = dens$y, type = "scatter", mode = "lines",
    fill = "tozeroy") |>
    layout(xaxis = list(title = "Mark difference"), yaxis = list(title = "Density"))

stan R

Sean van der Merwe

Coordinator of UFS Statistical Consultation Unit

Statistician