2021
Rules or sayings passed down through generations that, to quote Wikipedia, contain “unverified claims with exaggerated and/or inaccurate details.”
Example from my mother: “Don’t pull faces, if the wind changes then your face will stay that way.”
Recently debunked example: “Don’t swim after eating - you can get stomach cramps.”
Often they are created with good reasons, like parents wanting to rest after eating and not watch their kids in the pool.
And often those reasons are lost over time, leaving only a rule without context. Then when the context changes the rule stops being valid without people realising it.
In statistics we also have many rules that are sometimes taught as hard rules, but were actually never meant to be more than temporary rules of thumb.
Or rules that were actually twisted over time from good intentions, through misunderstandings, to end up being quite wrong or bad practice.
And of course rules that were created in times before computing power and Big Data where the authors never conceived of the idea that I’d have a personal computer at home that can perform 100,000 t-tests per second.
But I will show you now that it is a waste of time and effort.
Just do the unequal variance test every time.
Let’s look at the power of the t-test in both cases via simulation study. This is not a very good or neat example of a simulation study, but it will do for this presentation.
alpha <- 0.05 pwrequal <- function(mudif, n) { mean(replicate(10000, t.test(rnorm(n, mudif, 1), rnorm(n), var.equal = TRUE)$p.value) < alpha) } pwr <- function(mudif, n) { mean(replicate(10000, t.test(rnorm(n, mudif, 1), rnorm(n))$p.value) < alpha) } mudif <- seq(0, 2, l = 101) pwrcurve15 <- sapply(mudif, pwr, n = 15) pwrequalcurve15 <- sapply(mudif, pwrequal, n = 15) pwrcurve30 <- sapply(mudif, pwr, n = 30) pwrequalcurve30 <- sapply(mudif, pwrequal, n = 30) pwrcurve60 <- sapply(mudif, pwr, n = 60) pwrequalcurve60 <- sapply(mudif, pwrequal, n = 60) # ...
No power is lost when doing the unequal variance test on equal variance data, but accuracy is gained with unequal variances.
If the sample sizes are not the same, specially we double the sample size of the sample with the smallest variance, then we get that the unequal variance test is much better.
Tests for normality don’t test for normality.
They assume normality and test for evidence to reject it.
Another commonly taught rule is that if you reject normality then you should do a Wilcoxon/Mann-Whitney U-test. They call this a non-parametric alternative to the t-test.
However, the U-test has a very different hypothesis, and is thus not an equivalent test.
To quote Wikipedia, the null hypothesis is that …
The non-parametric equivalent to the t-test is the bootstrap t-test.
Given hypotheses \(H_0\) and \(H_1\), and a test statistic \(S\) with observed value \(s\), we can define a p-value as
\[p=P(S\geq s | H_0)\]
From Bayes’ Theorem we have that
\[P(H_1|s)=\frac{P(s|H_1)P(H_1)}{P(s|H_1)P(H_1)+P(s|H_0)P(H_0)}\]
There is no way to directly relate these quantities, even with extreme assumptions, and yet this is what people love to do.
End result: a lot of published research is not reproducible (over 40% in some psychology journals at one stage).
If you do 100 hypothesis tests (I’ve seen more than 1000 in a single PhD) then the probability of at least 1 false positive is
\[1-\left[(1-0.05)^{100}\right]>99.4\%\]
People are out there talking a lot of nonsense as a result.
Bonferroni said you could decrease your significance level, but I like the Holm-Bonferroni method more, where you inflate your p-values.
We start with a proper regression, then we add a noise variable, then we add another 18 noise variables. In theory the regression should get worse right?
set.seed(92340) x <- rnorm(30) y <- 2 * x + 30 + rnorm(30) m <- s <- vector("list", 3) s[[1]] <- summary(m[[1]] <- lm(y ~ x)) X <- cbind(x, rnorm(30)) s[[2]] <- summary(m[[2]] <- lm(y ~ X)) X <- cbind(X, matrix(rnorm(30 * 18), 30, 18)) s[[3]] <- summary(m[[3]] <- lm(y ~ X))
knitr::kable(round(data.frame(R2 = sapply(s, with, r.squared), adjR2 = sapply(s, with, adj.r.squared), AIC = sapply(m, AIC), BIC = sapply(m, BIC)), 2))
R2 | adjR2 | AIC | BIC |
---|---|---|---|
0.71 | 0.70 | 92.44 | 96.64 |
0.74 | 0.72 | 91.29 | 96.90 |
0.96 | 0.87 | 71.20 | 102.03 |
We see \(R^2\) go up a lot. Further, we see \(Adj.R^2\) and AIC improve, with 2 of the noise variables significant at 5%.
If you have explanatory variables that are very skew (not just a little) then a regression model tends to focus its efforts on the big values, and ignore the nuance in the bulk of the data. In these cases it may help to transform some of them to reduce the skewness and capture more, or just different aspects, of the relationships. Transformations particularly help with correlation analysis.
The result is that a shocking amount of published research is wrong, or just not reproducible.
Modern journals of good quality now require that data be published openly (if at all possible) and often that code be published, and published at a high standard at that.
People generally don’t like to have their work questioned, that’s why journals are making these things rules.
The advantages include:
And yet, already people are getting so obsessed with these rules that they forget why they were created. Published code helps people follow the details of the methodology, but code is not methodology.
Remember to understand rules and where they come from, don’t just memorise!