library(knitr)
opts_chunk$set(echo = TRUE, dev='png', fig.width=7, fig.height=4, out.width = "7in", out.height = "4in")

Multivariate regression is not what you think!

Under the usual regression assumptions:

When you have 1 y and 1 x variable in a regression it is called simple regression or univariate regression (sometimes other things).

When you have 1 y and 2+ x variables in a regression it is called multiple regression.

Both of the above cases are often called ordinary least squares (OLS) regression, since minimising the sum of square errors produces the same results as maximising the likelihood of this model.

Multivariate regression is when you have 2+ y variables at the same time.

Why haven’t I heard of this before?

Because it is almost always easier and better to split up a multivariate regression into multiple multiple regressions.

For each y variable you do a multiple regression start to finish, before moving onto the next y variable which gets its own regression, and so on.

Even if you only expect slightly different relationships in each regression splitting up completely is usually the best approach; and if you expect exactly the same relationships for each y then you can just put the data sets underneath each other and do a single regression.

So then why would you ever do multivariate regression?

When you can’t separate the regressions, obviously 😉

There are two clear cases for multivariate regression:

  1. When the different regressions need to have some parameters in common and some parameters not in common.
  2. When the different y values that occur together have a strict structure or joint boundaries \((e.g.\ y_1+y_2\leq1)\).

Implementing multivariate regression

Even when multivariate regression is fully justified, it can be implemented in a multiple regression framework using clever construction of dummy variables and penalty functions. Care must be taken with the error variance though, as OLS does not allow for changing variances without modification.

However, multivariate regression can be implemented directly.

library(tidyverse)
library(parallel)
library(rstan)
mycores <- max(1,floor(detectCores(logical = FALSE)*0.75))
options(mc.cores = mycores)
rstan_options(auto_write = TRUE)
library(plotly)

Bayesian MANOVA

Let’s consider an obvious example. We load and display the classic iris dataset:

data(iris)
iris |> by(iris$Species, summary)
## iris$Species: setosa
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100  
##  1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200  
##  Median :5.000   Median :3.400   Median :1.500   Median :0.200  
##  Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246  
##  3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300  
##  Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600  
##        Species  
##  setosa    :50  
##  versicolor: 0  
##  virginica : 0  
##                 
##                 
##                 
## ------------------------------------------------------------ 
## iris$Species: versicolor
##   Sepal.Length    Sepal.Width     Petal.Length   Petal.Width          Species  
##  Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000   setosa    : 0  
##  1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200   versicolor:50  
##  Median :5.900   Median :2.800   Median :4.35   Median :1.300   virginica : 0  
##  Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326                  
##  3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500                  
##  Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800                  
## ------------------------------------------------------------ 
## iris$Species: virginica
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400  
##  1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800  
##  Median :6.500   Median :3.000   Median :5.550   Median :2.000  
##  Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026  
##  3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300  
##  Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    : 0  
##  versicolor: 0  
##  virginica :50  
##                 
##                 
## 
iris |> plot_ly(x = ~Petal.Length, y = ~Sepal.Length, z = ~Sepal.Width, size = ~sqrt(Petal.Width), color = ~Species, type = 'scatter3d', mode = 'markers')

Now we ask the question, “Are the species different on average?”

As discussed earlier, the typical way to address this is to consider one measurement at a time, possibly remembering to adjust for multiple testing 😉 When possible, it is better to do a multivariate test first and individual tests second.

In Bayes we don’t test the same way as a frequentist hypothesis test, we either do a model comparison, or calculate a probability associated with a region of posterior equivalence (ROPE).

In this case we will build two models, one where we fit a multivariate normal distribution to the values as if it was one species and another where we fit the distribution to each species, and compare.

data {
  int n; // number of observations
  int k; // number of dimensions
  vector[k] y[n];
}
parameters {
  row_vector[k] mus; // expected values
  corr_matrix[k] rhomat;        // correlations
  vector<lower=0>[k] sds;      // standard deviations
}
model {
  y ~ multi_normal(mus, quad_form_diag(rhomat, sds));
  rhomat ~ lkj_corr(2);
  sds ~ cauchy(0, 2.5);
  mus ~ normal(0, 100);
}
generated quantities {
  vector[n] log_lik;
  for (i in 1:n) {
    log_lik[i] = multi_normal_lpdf(y[i] | mus, quad_form_diag(rhomat, sds));
  }
}
OneSet |> sampling(list(n = nrow(iris), k = 4, y = as.matrix(iris[,1:4]))) -> results1

We will first use the classic MANOVA assumption of constant variance:

data {
  int n; // number of observations
  int k; // number of dimensions
  vector[k] y[n];
  int ng; // number of groups
  int g[n]; // group membership 
}
parameters {
  row_vector[k] mus[ng]; // expected values
  corr_matrix[k] rhomat;        // correlations
  vector<lower=0>[k] sds;      // standard deviations
}
model {
  for (i in 1:n) {
    y[i] ~ multi_normal(mus[g[i]], quad_form_diag(rhomat, sds));
  }
  rhomat ~ lkj_corr(2);
  sds ~ cauchy(0, 2.5);
  for (j in 1:ng) {
    mus[j] ~ normal(0, 100);
  }
}
generated quantities {
  vector[n] log_lik;
  for (i in 1:n) {
    log_lik[i] = multi_normal_lpdf(y[i] | mus[g[i]], quad_form_diag(rhomat, sds));
  }
}
grps <- as.numeric(factor(iris$Species, levels = unique(iris$Species)))
ManyMuOneSigma |> sampling(list(n = nrow(iris), k = 4, y = as.matrix(iris[,1:4]), ng = max(grps), g = grps)) -> results2

Lastly we can have each group have a different covariance structure. This is like modelling each group separately, except that we get model statistics for the whole dataset at once, which allows for direct comparison with the previous models.

data {
  int n; // number of observations
  int k; // number of dimensions
  vector[k] y[n];
  int ng; // number of groups
  int g[n]; // group membership 
}
parameters {
  row_vector[k] mus[ng]; // expected values
  corr_matrix[k] rhomat[ng];        // correlations
  vector<lower=0>[k] sds[ng];      // standard deviations
}
model {
  for (i in 1:n) {
    y[i] ~ multi_normal(mus[g[i]], quad_form_diag(rhomat[g[i]], sds[g[i]]));
  }
  for (j in 1:ng) {
    mus[j] ~ normal(0, 100);
    rhomat[j] ~ lkj_corr(2);
    sds[j] ~ cauchy(0, 2.5);
  }
}
generated quantities {
  vector[n] log_lik;
  for (i in 1:n) {
    log_lik[i] = multi_normal_lpdf(y[i] | mus[g[i]], quad_form_diag(rhomat[g[i]], sds[g[i]]));
  }
}
ManySets |> sampling(list(n = nrow(iris), k = 4, y = as.matrix(iris[,1:4]), ng = max(grps), g = grps), chains = mycores) -> results3

Here the models are compared using cross-validated information criteria.

fits <- list(OneSet = results1, ManyMuOneSigma = results2, ManySets = results3)
library(loo)
fits |> lapply(\(fit) {extract_log_lik(fit, merge_chains = FALSE)}) -> log_lik
log_lik |> lapply(\(ll) {relative_eff(exp(ll), cores = mycores)}) -> r_eff
fits |> length() |> seq_len() |> 
  lapply(\(i) {loo(log_lik[[i]], r_eff = r_eff[[i]], cores = mycores)}) |> 
  loo_compare() -> comparison
rownames(comparison) <- names(fits)[order(rownames(comparison))]
comparison |> kable(digits = 1)
elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
ManySets 0.0 0.0 -68.1 21.9 39.6 3.9 136.1 43.8
ManyMuOneSigma -54.1 11.5 -122.2 21.7 23.4 2.4 244.4 43.4
OneSet -326.2 16.0 -394.2 17.3 13.5 1.4 788.5 34.6

Looking at the model comparison results it seems that we have evidence that the species differ convincingly on average and in covariance structure.

Note on covariance matrices

The decomposition used above is but one of many approaches to the difficult problem of modelling covariance matrices in general. Covariance matrices have strong restrictions on their structure (positive definite implies a lot).

The idea is that by separating the covariance matrix into a correlation matrix and a set of standard deviations we can model each separately using well established approaches for each.

The implementations above can be further streamlined by using the Cholesky decomposed versions of each, and then reconstructing the matrices in their original forms in a generated quantities block.

Generalising the model

The models shown so far can be generalised a great deal in many ways to allow for multivariate regression of other kinds.

For example, we could replace mus with regression equations.

Test yourself: modify the models above to test the hypothesis that the column averages are in the ratio 12:6:8:3.