What are we doing and why?
We are going to look forward. We want to predict the next thing. Companies want to know what’s going to happen with the next client, the next month’s profits, the next deal, etc.
we are thus very much in the data science paradigm. However, a lot of data scientists are narrow in their thinking: they just want to predict the values as close as they can. They don’t understand that their predictions are going to be wrong and that there is value in understanding the uncertainty.
As statisticians we expect more from you. We expect you to understand uncertainty even in complicated models. One way to understand uncertainty is to look at intervals. The three most common types of intervals are:
- Confidence intervals: which say that if you repeat an experiment many times then the interval produced each time will cover the true value of a parameter proportion \((1-\alpha)\) of the time.
- Credibility intervals: which say that the true value of a parameter will lie inside the interval with probability \((1-\alpha)\).
- Prediction intervals: which say that the interval is expected to cover a new, unobserved value proportion \((1-\alpha)\) of the time.
Specific example
In this example we have two test marks and a final mark. The goal is to predict, with uncertainty, the final mark of a new student.
Read data
First we read in the data and create named variables to work with. Download data here.
mydata <- read.csv('markregressionexample.csv')
names(mydata)
## [1] "Student" "Test1" "Test2" "Final"
attach(mydata)
Transformation of y
In this case the final marks are between 0 and 100. If we just do an ordinary regression then it fails to account for the barriers at 0 and 100, and will create intervals that don’t make sense.
Let’s do logit transform of final marks and then use OLS. The qlogis function in R makes this easier.
ytrans <- qlogis(Final/100)
(s1 <- summary(model1 <- lm(ytrans~Test1+Test2)))
##
## Call:
## lm(formula = ytrans ~ Test1 + Test2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4884 -0.1818 -0.0311 0.1263 0.5871
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.472809 0.161550 -9.117 2.59e-11 ***
## Test1 0.015412 0.003135 4.916 1.55e-05 ***
## Test2 0.019497 0.002686 7.258 8.22e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2645 on 40 degrees of freedom
## Multiple R-squared: 0.8042, Adjusted R-squared: 0.7944
## F-statistic: 82.13 on 2 and 40 DF, p-value: 6.873e-15
Prediction for new observation
Now we bring in our new observation and make a prediction for it. We will do it manually to aid in understanding the model:
\(y_i\sim N(x_i\beta,\sigma^2)\), where \(y_i=log\left(\frac{Final/100}{1-Final/100}\right)\).
xnew <- c(1,97,98)
BetaHat <- coef(model1)
yhat <- BetaHat%*%xnew
(prediction <- plogis(yhat)*100)
## [,1]
## [1,] 87.35652
Always remember to reverse any transformation done at the start when you present your final predictions. In this case we used the plogis function in R in place of \(\frac{exp(x)}{1+exp(x)}\).
Prediction intervals
In order to create an interval we must account for both the uncertainty in coefficient \((\beta)\) and the uncertainty in errors \((\sigma^2)\). We do this via simulation as a general solution.
We make use of the assumption that \(\hat{\beta}\sim N_p(\beta,\Sigma_\beta)\). If you’re a Bayesian then just move the hat, the result is essentially the same.
SigmaBeta <- vcov(model1)
library(MASS)
acc <- 7500
simbetas <- mvrnorm(acc,BetaHat,SigmaBeta)
Eyhats <- t(t(xnew) %*% t(simbetas))
yhats <- rnorm(acc,Eyhats,s1$sigma)
preds <- plogis(yhats)*100
quantile(preds,c(0.025,0.975))
## 2.5% 97.5%
## 79.55136 92.36833
Note that if X is in rows in a matrix then don’t transpose it.
GLM
As an alternative, let’s consider the marks to be Binomial values instead. We then use a GLM formulation to fit the model.
y <- Final/100
n <- length(y)
summary(model2 <- glm(y~Test1+Test2,family=binomial(link='logit'),weights = rep(100,n)))
##
## Call:
## glm(formula = y ~ Test1 + Test2, family = binomial(link = "logit"),
## weights = rep(100, n))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3035 -0.7858 -0.1588 0.7010 2.3984
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.410660 0.131578 -10.721 < 2e-16 ***
## Test1 0.015180 0.002483 6.113 9.77e-10 ***
## Test2 0.018345 0.002150 8.530 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 287.795 on 42 degrees of freedom
## Residual deviance: 58.146 on 40 degrees of freedom
## AIC: 276.15
##
## Number of Fisher Scoring iterations: 4
Since no global test of lack of fit is done, we do it as a second step. We fit the intercept model and then compare the deviances of the two models.
model0 <- glm(y~1,family=binomial(link='logit'),weights = rep(100,n))
anova(model0,model2,test='Chisq')
## Analysis of Deviance Table
##
## Model 1: y ~ 1
## Model 2: y ~ Test1 + Test2
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 42 287.795
## 2 40 58.146 2 229.65 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The prediction steps are different depending on the specific model.
Here the model predicts the expected proportion \((p_i)\) where \(y_i\sim Binomial(100,p_i)\). Thus, after obtaining these predictions we use rbinom instead of rnorm to simulate new values. Note that the transformation occurs before the final simulation step.
SigmaBeta <- vcov(model2)
simbetas <- mvrnorm(acc,BetaHat,SigmaBeta)
Eyhats <- plogis(t(t(xnew) %*% t(simbetas)))
preds <- rbinom(acc,100,Eyhats)
quantile(preds,c(0.025,0.975))
## 2.5% 97.5%
## 80 94