November 2020
UFS staff (and to a lesser extent students) are privileged to have access to many premier statistical packages.
The biggest, and probably best, is SAS. Base SAS is a scripting language, like R, although with a vastly different syntax. On campus we also have the fantastic interfaces for it, like Enterprise Guide and the data science interfaces.
If you are doing standard statistics on well organised data and want robust trusted results then consider using SAS Enterprise Guide. It is installed on the computers in the teaching labs, so I recommend you go there and run what you need to. I do not recommend getting it on your computer though, not without good reason. Be sure it works for you before you get it for yourself.
Other options include Mathematica, Stata, Statistica, Matlab, and specialised programs for your field.
I have used all of the packages on this slide at some point. Now I only use R for all my work.
LinkedIn Learning is the official UFS platform for learning things and it has a lot of stuff on R. Much of it is likely to be better than what I’m doing.
The Help menu in RStudio has a ton of useful links to various sources of help and inspiration.
If you click in the Console window and type ? followed by the command you want help with then it will tell you tons about that command. If you almost know the command then type ?? instead.
Try it now, open RStudio if you haven’t yet, click in the console window, and type ?library
citation()
## ## To cite R in publications use: ## ## R Core Team (2020). R: A language and environment for statistical ## computing. R Foundation for Statistical Computing, Vienna, Austria. ## URL https://www.R-project.org/. ## ## A BibTeX entry for LaTeX users is ## ## @Manual{, ## title = {R: A Language and Environment for Statistical Computing}, ## author = {{R Core Team}}, ## organization = {R Foundation for Statistical Computing}, ## address = {Vienna, Austria}, ## year = {2020}, ## url = {https://www.R-project.org/}, ## } ## ## We have invested a lot of time and effort in creating R, please cite it ## when using it for data analysis. See also 'citation("pkgname")' for ## citing R packages.
I use Rmarkdown for almost everything, including teaching, assessment, consultation, some research, and even making presentations like this one.
To use Rmarkdown you open RStudio, click File -> New File -> R Markdown…
To put in code you click on the Insert button at the top right of the code window, then click R. Type some code in the block, say sqrt(5).
Everything else is part of the text. Markdown mixes text, code, and output neatly in one document. Click on Knit to make a document now.
Type these commands and press Ctrl+Enter at the end of each line to run it, or use the play button in the top right corner of the block to run the whole block.
# The hash in R is for typing comments, you only need to # type what's before the hash data(iris) # R has a lot of example data sets built in, this loads one of them summary(iris) # The summary command makes a summary of data or model output View(iris) # For viewing data in RStudio, same as double clicking on data in Environment hist(iris$Sepal.Length[iris$Species == "setosa"], 8, col = "purple", xlab = "Sepal Length of Setosa", main = "") # Histogram
You must enable Javascript to view this page properly.
Make your histogram:
PCA is used to reduce the number of variables in a block of data, by creating orthogonal linear combinations of the variables, while keeping the amount of variation retained as large as possible for linear combinations. Try the following:
myPCA <- prcomp(iris[, 1:4]) summary(myPCA) plot(myPCA, type = "l")
ItemNumber | Item | PC1 | PC2 | PC3 |
---|---|---|---|---|
1 | Standard deviation | 2.056 | 0.493 | 0.280 |
2 | Proportion of Variance | 0.925 | 0.053 | 0.017 |
3 | Cumulative Proportion | 0.925 | 0.978 | 0.995 |
The Iris PCA said that we only need 1 dimension to capture 92% of the variation in the measurements. However, this is an unsupervised learning method, meaning we haven’t looked at the target variable — species — at all yet.
We can see that it did not separate the species because that’s not the point of PCA.
To load data from Excel files we first load the openxlsx package.
library(openxlsx) mydata <- read.xlsx("nitrogen.xlsx", "Export")
If the package is not installed yet, then you will need to install it first, by running this command in the Console:
install.packages("openxlsx")
Remember that you install a package only once, using the console, and from then on you just load it every time you need it (by putting the library command near the start of your program that needs it).
Test yourself:
For the next session everyone must bring along a paper with some statistics.
At random times a person will be asked to explain to everyone what statistics was done in the paper they brought and why they think it was done that way.
Most of Session 2 and 3 will focus on statistical modelling.
What was the most common mistake in R you’ve made so far?
What was the first thing that you forgot from Session 1? What was the most important thing you forgot and wish you could remember?
How to relearn the things you forgot:
Let’s take a detour quickly and talk about vectors.
In R we put numbers together using c(). This is the concatenate function and creates a vector.
There are lots of ways to make sequences. We’ve already see the colon notation for integer sequences, but try a few more:
3:8 7:2
You can make more complicated sequences using the seq command:
seq(10, 40, 5) seq(10, 100, length = 10)
But the most fun way to make sequences is with the rep command:
rep(c(2, 4, 7), each = 2, times = 3) rep(c(1, 2, 3), c(3, 4, 5))
A histogram does not let us see what is happening with the different treatments, but a box plot might.
boxplot(mydata$N_Balance ~ mydata$Treatment)
par(mar = c(4.5, 4.5, 0.5, 0.5)) boxplot(mydata$N_Balance ~ mydata$Treatment, col = 2:7, xlab = "Treatment", ylab = "Nitrogen Balance")
The rest of this session will be devoted to working with data frames. A data frame is also a matrix and a list of variables. This means that there are tons of ways of working with a data frame.
What’s the difference between using double and single square brackets?
Hint: use the class function to find out. The class function is extremely important when learning R in order to debug a lot of things going wrong.
Try to use a command to get the variables that start with ‘Delta’.
Hint: use the names function inside the startsWith function in the square brackets.
Let’s do some t tests. First we will do paired tests:
t.test(mydata$Delta_NH4plus, mydata$Delta_NO3minus, paired = TRUE) t.test(mydata$Delta_NH4plus - mydata$Delta_NO3minus)
Then let’s make a factor that’s High when N_Balance is more than 15 and Low otherwise and do a two sample test:
NBhigh <- mydata$N_Balance > 15 t.test(mydata$Delta_NH4plus ~ NBhigh) t.test(mydata$Delta_NH4plus[NBhigh == 0], mydata$Delta_NH4plus[NBhigh == 1]) # Alternatively: NBhigh <- factor(mydata$N_Balance > 15) levels(NBhigh) <- c("Low", "High") t.test(mydata$Delta_NH4plus ~ NBhigh)
The t test assumes something roughly resembling normality (bell shape). It is robust to small violations, but not severe violations. For example, a t test works for Likert scale data that is not too far to one side or the other, but not for data that has extreme values or outliers.
You can test for normality using shapiro.test but a histogram is generally better.
If you want to use a non-parametric alternative you could try wilcox.test. It still tests location, but not the mean (average) like the t test. It is a safer test, but not as powerful.
`?`(shapiro.test) `?`(wilcox.test)
Statistical modelling is my favourite topic.
We will start with linear models and ANOVAs.
To do an ANOVA you always first do a linear model with the factors listed as explanatory variables in order of importance.
Regressing numbers on text doesn’t make sense, the text factors must be turned into numbers first. However, R does this automatically for you, and does it better than I would. It correctly makes binary (0,1) dummy variables for each category after the first (base) category for each factor.
Models are specified using a special syntax inside the command `lm’:
Let’s run a model and store the fit:
model1 <- lm(mydata$N_Balance ~ mydata$Treatment) # OR model1 <- lm(N_Balance ~ Treatment, data = mydata) # Then summary(model1)
An ANOVA table is calculated from a linear model fit by applying the anova function:
anova(model1)
## Analysis of Variance Table ## ## Response: N_Balance ## Df Sum Sq Mean Sq F value Pr(>F) ## Treatment 5 149.077 29.8155 14.704 0.0001489 *** ## Residuals 11 22.305 2.0277 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now we will load a new data set and apply the stuff we learned today.
So start a new R Markdown file, clean it up, and save it.
Download the file GlutenDatasetforR.xlsx and put it in the same folder. Note that I have not modified this file in any way from when I received it.
Normally there is a lot of work to do before we use the data, which is often easier to do in Excel; but this sheet is well constructed and nearly ready for use, so in this case we will fix it in R for extra learning.
So please load the data into R in a variable called gluten.
View the data in RStudio.
Note that we can’t modify data directly in the viewing window, we have to use code, unlike Excel. However, code is systematic, clean, and repeatable, which is a requirement in some cases.
We can see that every second repetition of “ATILC 2000” has a space in the middle that needs to be removed. Try to fix this now.
Hint: Either search for Name == “ATILC 2000” or for the joint case of Entry == 4 and Rep == 2.
You will know you have done the edit correctly if you run this command and get the result given below:
levels(factor(gluten$Name))
## [1] "ALTAR84" "ATILC2000" "CIRNOC2008" "JUPAREC2001" "MEXICALIC75" ## [6] "YAVAROS79"
What is the difference between the above command and unique(gluten$Name)?
If we tell R to use Treatment as an explanatory variable then R will turn it into a factor because it is stored as text.
unique(gluten$Treatment)
## [1] "Control" "Drght" "Flood" "Heat" "MeDS" "MeHS"
If we put this into a regression then for every level of the factor except the first (every treatment) R will create a variable that is 1 for that treatment and 0 for every other treatment.
The interpretation of the regression coefficients are then relative to the base treatment.
Year is not stored as text, so R thinks it is just numbers. If we put it in a regression then it will consider 2014 as being two 2013s because it is saved as 2 versus 1.
Since their are only two levels for Year in this data we could subtract 1 so that we have 1 versus 0 directly. This will not work right if their are more than two years.
The general solution though is to tell R explicitly that Year is a factor:
gluten$YearFactor <- factor(gluten$Year) levels(gluten$YearFactor) <- c("2013", "2014")
We can now do the ANOVAs. To start with we use Treatment, YearFactor, and Name as main effects without interaction. We will use the first measurement first.
anova(glutenM1 <- lm(LUPP ~ Treatment + YearFactor + Name, data = gluten))
## Analysis of Variance Table ## ## Response: LUPP ## Df Sum Sq Mean Sq F value Pr(>F) ## Treatment 5 2030.8 406.16 5.3546 0.0001629 *** ## YearFactor 1 1286.5 1286.46 16.9601 0.00006692 *** ## Name 5 326.1 65.23 0.8599 0.5100278 ## Residuals 132 10012.5 75.85 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can adjust the model and compare information criterion values to find the most parsimonious model:
anova(glutenM2 <- lm(LUPP ~ Treatment * YearFactor, data = gluten)) anova(glutenM3 <- lm(LUPP ~ Treatment + YearFactor, data = gluten)) AIC(glutenM1, glutenM2, glutenM3) BIC(glutenM1, glutenM2, glutenM3)
df | AIC | BIC | |
---|---|---|---|
glutenM1 | 13 | 1045.471 | 1084.078 |
glutenM2 | 13 | 1039.183 | 1077.791 |
glutenM3 | 8 | 1040.086 | 1063.845 |
Let’s test the residuals for normality:
r1 <- resid(glutenM3) shapiro.test(r1) qqnorm(r1) qqline(r1)
Next time we will see ways to efficiently do the above analysis for all the measurements (not just one at a time).
For now though, manually do the analysis for the FP_14 variable.
Also, try to make notes about what you understood and didn’t understand.
Bonus fake points for looking up something on one of the topics you want to learn more about.
A for loop lets you cycle through measurements or treatments or observations, and deal with them one at a time.
The notation is: for ( counting_variable in sequence_to_count_through ) { commands based on counting_variable }
# Compare (1:5)^2 # with for (i in 1:5) { print(i^2) } # and for (i in 1:5) { cat("The square of", i, "is", i^2, "\n") }
Let’s draw a box plot to compare treatments, but for all the measurements, one at a time.
for (measure in 8:18) { boxplot(gluten[[measure]] ~ gluten$Treatment, col = 1:6, ylab = names(gluten)[measure], xlab = "Treatment") }
Generalised Linear Models and Mixed Linear Models use the same basis as linear models, but just have some extra extensions.
Generalised Linear Models assume a different distribution than the normal distribution for the observed values. Common examples include the Binomial distribution (0 to fixed maximum) and Negative Binomial distribution (0 to infinity).
The most commonly used GLM is called a logistic regression. It is used when the dependent variable is a random binary outcome, to predict the probability of seeing one outcome or the other. An example would be predicting whether a plant survives long enough to produce fruit or not.
Let us try to predict whether LUPP is larger than UPP.
y <- (gluten$LUPP > gluten$UPP) table(y) logistic1 <- glm(y ~ exHMW + exLMW + exGLI + exAG + FP_14, data = gluten, family = binomial()) summary(logistic1) logistic2 <- glm(y ~ exHMW + exLMW, data = gluten, family = binomial()) summary(logistic2)
HMW <- seq(2, 14, length = 101) par(mfrow = c(2, 2)) for (LMW in seq(12, 24, 4)) { ypred <- predict(logistic2, newdata = data.frame(exHMW = HMW, exLMW = rep(LMW, 101)), type = "response") plot(HMW, ypred, type = "l", xlab = "exHMW", ylab = "Probabilty of LUPP > UPP", main = paste("exLMW =", LMW), ylim = c(0, 1)) }
Or maybe one plot is better
HMW <- seq(2, 14, length = 101) plot(gluten$exHMW, y, main = "", xlab = "exHMW", ylab = "Probabilty of LUPP > UPP", col = y + 1, pch = 20, xlim = c(2, 14), ylim = c(0, 1)) for (LMW in seq(12, 24, 4)) { ypred <- predict(logistic2, newdata = data.frame(exHMW = HMW, exLMW = rep(LMW, 101)), type = "response") lines(HMW, ypred, col = LMW/4, lwd = 3) } legend("right", paste("exLMW =", seq(12, 24, 4)), col = 3:6, lwd = 4)
Sometimes experiments are done where measurements are taken repeatedly on the same subject. A subject can be a person, animal, plant, location, etc..
If there are several of them and they are a random sample of a bigger population for which you hope to do inference, then it may help to treat them as random effects.
Let us consider the model we fitted first in Session 2, but let’s take the second measurement.
We now make the item name a random effect:
library(lme4) summary(glutenMixed1 <- lmer(UPP ~ (1 | Name) + Treatment + YearFactor, data = gluten))
The above model is called the random intercept model because we have a random intercept for each name.
These random intercepts are centred at zero, so if you make predictions you get predictions for an average future item of this type, not a specific one.
A more advanced option is to also have random slopes for each name based on some explanatory variable(s), but that is not applicable here.
For more advanced models you have a few options:
For an example of more advanced implementations I recommend the work I did last year for Talana Cronje on opportunistic potatoes, at https://seanvdm.co.za/post/potatoes/.
If you want to learn more about Stan then check out my video presentation for StatsCon2020, at https://seanvdm.co.za/files/video/SvdMStanForStatcon2020.mp4
The big truth behind all really interesting stats is that in most cases more time was spent on working with the raw data than on the actual statistics.
Cleaning up data is the biggest and most useful skill to learn if you want to do good analysis.
A few examples can now be discussed, time permitting.
Thank you everyone for attending, I hope you learned things.
This is a good time to discuss general topics related to analysis of data before we end.