library(knitr)
opts_chunk$set(echo = TRUE, dev='png', fig.width=7, fig.height=4, out.width = "7in", out.height = "4in")
Under the usual regression assumptions:
When you have 1 y and 1 x variable in a regression it is called simple regression or univariate regression (sometimes other things).
When you have 1 y and 2+ x variables in a regression it is called multiple regression.
Both of the above cases are often called ordinary least squares (OLS) regression, since minimising the sum of square errors produces the same results as maximising the likelihood of this model.
Multivariate regression is when you have 2+ y variables at the same time.
Because it is almost always easier and better to split up a multivariate regression into multiple multiple regressions.
For each y variable you do a multiple regression start to finish, before moving onto the next y variable which gets its own regression, and so on.
Even if you only expect slightly different relationships in each regression splitting up completely is usually the best approach; and if you expect exactly the same relationships for each y then you can just put the data sets underneath each other and do a single regression.
When you can’t separate the regressions, obviously 😉
There are two clear cases for multivariate regression:
Even when multivariate regression is fully justified, it can be implemented in a multiple regression framework using clever construction of dummy variables and penalty functions. Care must be taken with the error variance though, as OLS does not allow for changing variances without modification.
However, multivariate regression can be implemented directly.
library(tidyverse)
library(parallel)
library(rstan)
mycores <- max(1,floor(detectCores(logical = FALSE)*0.75))
options(mc.cores = mycores)
rstan_options(auto_write = TRUE)
library(plotly)
Let’s consider an obvious example. We load and display the classic iris dataset:
data(iris)
iris |> by(iris$Species, summary)
## iris$Species: setosa
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100
## 1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200
## Median :5.000 Median :3.400 Median :1.500 Median :0.200
## Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
## 3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
## Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
## Species
## setosa :50
## versicolor: 0
## virginica : 0
##
##
##
## ------------------------------------------------------------
## iris$Species: versicolor
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000 setosa : 0
## 1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200 versicolor:50
## Median :5.900 Median :2.800 Median :4.35 Median :1.300 virginica : 0
## Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326
## 3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500
## Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800
## ------------------------------------------------------------
## iris$Species: virginica
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400
## 1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800
## Median :6.500 Median :3.000 Median :5.550 Median :2.000
## Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026
## 3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300
## Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
## Species
## setosa : 0
## versicolor: 0
## virginica :50
##
##
##
iris |> plot_ly(x = ~Petal.Length, y = ~Sepal.Length, z = ~Sepal.Width, size = ~sqrt(Petal.Width), color = ~Species, type = 'scatter3d', mode = 'markers')
Now we ask the question, “Are the species different on average?”
As discussed earlier, the typical way to address this is to consider one measurement at a time, possibly remembering to adjust for multiple testing 😉 When possible, it is better to do a multivariate test first and individual tests second.
In Bayes we don’t test the same way as a frequentist hypothesis test, we either do a model comparison, or calculate a probability associated with a region of posterior equivalence (ROPE).
In this case we will build two models, one where we fit a multivariate normal distribution to the values as if it was one species and another where we fit the distribution to each species, and compare.
data {
int n; // number of observations
int k; // number of dimensions
vector[k] y[n];
}
parameters {
row_vector[k] mus; // expected values
corr_matrix[k] rhomat; // correlations
vector<lower=0>[k] sds; // standard deviations
}
model {
y ~ multi_normal(mus, quad_form_diag(rhomat, sds));
rhomat ~ lkj_corr(2);
sds ~ cauchy(0, 2.5);
mus ~ normal(0, 100);
}
generated quantities {
vector[n] log_lik;
for (i in 1:n) {
log_lik[i] = multi_normal_lpdf(y[i] | mus, quad_form_diag(rhomat, sds));
}
}
OneSet |> sampling(list(n = nrow(iris), k = 4, y = as.matrix(iris[,1:4]))) -> results1
We will first use the classic MANOVA assumption of constant variance:
data {
int n; // number of observations
int k; // number of dimensions
vector[k] y[n];
int ng; // number of groups
int g[n]; // group membership
}
parameters {
row_vector[k] mus[ng]; // expected values
corr_matrix[k] rhomat; // correlations
vector<lower=0>[k] sds; // standard deviations
}
model {
for (i in 1:n) {
y[i] ~ multi_normal(mus[g[i]], quad_form_diag(rhomat, sds));
}
rhomat ~ lkj_corr(2);
sds ~ cauchy(0, 2.5);
for (j in 1:ng) {
mus[j] ~ normal(0, 100);
}
}
generated quantities {
vector[n] log_lik;
for (i in 1:n) {
log_lik[i] = multi_normal_lpdf(y[i] | mus[g[i]], quad_form_diag(rhomat, sds));
}
}
grps <- as.numeric(factor(iris$Species, levels = unique(iris$Species)))
ManyMuOneSigma |> sampling(list(n = nrow(iris), k = 4, y = as.matrix(iris[,1:4]), ng = max(grps), g = grps)) -> results2
Lastly we can have each group have a different covariance structure. This is like modelling each group separately, except that we get model statistics for the whole dataset at once, which allows for direct comparison with the previous models.
data {
int n; // number of observations
int k; // number of dimensions
vector[k] y[n];
int ng; // number of groups
int g[n]; // group membership
}
parameters {
row_vector[k] mus[ng]; // expected values
corr_matrix[k] rhomat[ng]; // correlations
vector<lower=0>[k] sds[ng]; // standard deviations
}
model {
for (i in 1:n) {
y[i] ~ multi_normal(mus[g[i]], quad_form_diag(rhomat[g[i]], sds[g[i]]));
}
for (j in 1:ng) {
mus[j] ~ normal(0, 100);
rhomat[j] ~ lkj_corr(2);
sds[j] ~ cauchy(0, 2.5);
}
}
generated quantities {
vector[n] log_lik;
for (i in 1:n) {
log_lik[i] = multi_normal_lpdf(y[i] | mus[g[i]], quad_form_diag(rhomat[g[i]], sds[g[i]]));
}
}
ManySets |> sampling(list(n = nrow(iris), k = 4, y = as.matrix(iris[,1:4]), ng = max(grps), g = grps), chains = mycores) -> results3
Here the models are compared using cross-validated information criteria.
fits <- list(OneSet = results1, ManyMuOneSigma = results2, ManySets = results3)
library(loo)
fits |> lapply(\(fit) {extract_log_lik(fit, merge_chains = FALSE)}) -> log_lik
log_lik |> lapply(\(ll) {relative_eff(exp(ll), cores = mycores)}) -> r_eff
fits |> length() |> seq_len() |>
lapply(\(i) {loo(log_lik[[i]], r_eff = r_eff[[i]], cores = mycores)}) |>
loo_compare() -> comparison
rownames(comparison) <- names(fits)[order(rownames(comparison))]
comparison |> kable(digits = 1)
elpd_diff | se_diff | elpd_loo | se_elpd_loo | p_loo | se_p_loo | looic | se_looic | |
---|---|---|---|---|---|---|---|---|
ManySets | 0.0 | 0.0 | -68.1 | 21.9 | 39.6 | 3.9 | 136.1 | 43.8 |
ManyMuOneSigma | -54.1 | 11.5 | -122.2 | 21.7 | 23.4 | 2.4 | 244.4 | 43.4 |
OneSet | -326.2 | 16.0 | -394.2 | 17.3 | 13.5 | 1.4 | 788.5 | 34.6 |
Looking at the model comparison results it seems that we have evidence that the species differ convincingly on average and in covariance structure.
The decomposition used above is but one of many approaches to the difficult problem of modelling covariance matrices in general. Covariance matrices have strong restrictions on their structure (positive definite implies a lot).
The idea is that by separating the covariance matrix into a correlation matrix and a set of standard deviations we can model each separately using well established approaches for each.
The implementations above can be further streamlined by using the Cholesky decomposed versions of each, and then reconstructing the matrices in their original forms in a generated quantities block.
The models shown so far can be generalised a great deal in many ways to allow for multivariate regression of other kinds.
For example, we could replace mus with regression equations.
Test yourself: modify the models above to test the hypothesis that the column averages are in the ratio 12:6:8:3.