A selection of consultation projects I’ve worked on recently

Sean van der Merwe

Introduction

This presentation focusses on some consultation projects that I worked on this year. Some are easy, some are more complex.

Outline

4 recent projects that might interest you

I will only highlight some of the results for discussion.

And maybe we might learn something…

Students versus reading
Giraffe milk (yes, you can drink it)
Students versus AI
Giraffe reactions to sounds

Students versus reading

This analysis was done for Elani Boshoff, a lecturer in English.

The strongest result in the study was that UFS students (studying English) were incredibly anxious about how much reading they will need to do at university.

I find this strange

UFS reading requirements are much lighter than at top US and EU universities by my understanding.

Surveys, surveys, everywhere

At UFS a questionnaire based survey is a very popular research instrument
Often the only way to get information about how issues affect people
If well constructed it can extract useful data at low cost
Mostly measures perceptions though

The best surveys are highly focussed, asking exactly what one wants to know to answer a specific research question.

Practically though, you can’t do a new survey every time you think of a new question. So people tend to ask a lot of questions at once.

Multiple testing

Most hypothesis tests assume they are the only test being done on the data
- Hypothesis tests have “main character syndrome”
- They think, “Oh wow, all this data collected just for me!”
- If you’re doing one test, and you assume \(\alpha=\) 5%, a 5% chance of a false positive result is low enough
But what happens if you are doing multiple tests?

Correlations

Comparing everything to everything

With 28 Likert scale questions you have 378 possible correlations
Testing all of them results in 378*5% \(\approx 19\) expected false positives
The probability of at least 1 false positive is \(1 - 0.95^{378}\approx\) 1 to 16 decimals
- Even at 1% significance level it is \(1 - 0.99^{378}\approx\) 0.9776074

Multiple testing visualised

The survival function and probability mass function of the number of false positives \((k)\) is about:

The interesting part

The most interesting questions are a mix of nominal and ordinal items. One approach to visualising them together is Multiple Correspondence Analysis (MCA).

Item	Description
Answer 32	I get nervous when I have to read academic texts on a digital device.
Answer 7	How do you usually read your prescribed academic texts?
Answer 8	If you could choose, what would you prefer to use to read academic texts?

MCA results

We note that those who get least nervous about reading on a digital device did and wanted to read on a laptop or computer; while those most nervous about reading on a digital device indicated that they want to read printed text. The more neutral respondents indicated that they used a mobile phone but might prefer a computer.

Giraffe milk

Yes, you can drink it.

Milking giraffes

Warning

Do not try this at home!

Giraffes cannot be milked standing up, they must be tranquilised
- Dangerous for them and the people doing it
Samples were taken opportunistically as part of a larger study where the giraffes are occasionally tranquilised anyway (for blood samples, sonar, insemination, measurement, etc.)
Milk samples were tested for bacteria and other issues
- The data was provided as log CFU counts per ml sampled

Relationships and random effects

Interest is in bacteria changes over lactation stage (continuous) or season (nominal)
Only 5 (female) giraffes available to be milked
Each giraffe is sampled multiple times, so observations are correlated within giraffe

Solution

Mixed effects regression models

Comparing model options

A set of Bayesian fits will be attempted using the rstanarm interface for Stan
Models for the same dependent variable will be compared using the Leave-one-out cross-validated information criterion (LOOIC)
By considering the best model, relative to the alternatives, we will be able to address the research questions in a reliable fashion: by answering the question, “Which model most adequately resembles the underlying data generating process?”

For each dependent variable, models considered included:

InterceptOnly = only random variation;
AnimalOnly = variation between animals is captured via a random intercept for each animal;
LacAB = Lactation month as explanatory variable, with A representing the polynomial order in which it is introduced as a fixed effect and B representing the polynomial order in which it is introduced as a random effect (2 = quadratic, 1 = linear, 0 = intercept or intercepts only);
ToY1B = Time of Year as explanatory variable with 3 levels, B being a 1 for random effects interaction, 0 for just random intercepts, and blank for no random effects at all.

Results

Safety

The microbial limits (for total bacteria count) for raw milk in South Africa as provided to me:
- Less than 200 000 CFU/ml (< 5.3010 log CFU/ml) intended for further processing (according to Regulation R1555)
- Less than 50 000 CFU/ml (< 4.69897 log CFU/ml) intended for consumption (according to R1555)
- Less than 5 000 CFU/ml (< 3.69897 log CFU/ml) if the cow and equipment sanitation is sound (according to the Dairy Standards Agency)

The predictive distributions (random future sample taken from a random future animal) will be compared against these limits and the probability of exceeding each safety threshold visualised.

Safety limit exceedence probability

Students versus AI

In this “controlled” experiment some students were helped to brainstorm using a chatbot before starting their essay, while the control group brainstormed without AI. All the students then did their essays in what was supposed to be a controlled environment (no AI).

Did it help with their marks?

What about confounders?

What about fitting a proper model?

Treatment is not significant. Word count is barely significant, but you can’t give marks for writing if there’s no writing I suppose. The vast majority of the variation is assigned to the facilitators by the model.

Giraffe reactions to sounds

This analysis is a nightmare. Your advice would be appreciated.

The reviewers say that the analysis is not complex enough, but it is already so complicated that they can’t fully follow or understand it 😢.

Let’s talk about Feature Engineering

The best worst part of data science
One of the most important topics in applied statistics and all forms of data science

How many of you remember studying it?

Why isn’t it a major outcome in all our modules?

Because it is nearly impossible to teach!

Some universities (e.g. UCT I think) try to teach it at postgrad level, mostly via examples; but even if you work through a dozen examples it won’t really prepare you for what you’ll face at work.

The giraffe video data

Kaitlyn Taylor, M student in Animal Science, went to 3 reserves and filmed 252 video of giraffes reacting to sounds that she played them, then coded the behaviours for 180s from the sound (for 55 behaviours).

The giraffe metadata

She also noted a lot of metadata for each video, such as:

Video_Number	Stimuli_Order	Wind_Direction
Location	ID	Temperature
Date	Sex	Speaker_Lat
Time_At_Start	Age	Speaker_Long
Habituation_Period	Group_Size	Observer_Lat
Sound_Type	Speaker_Distance	Observer_Long
Sound_Variation	Wind_Speed	Giraffe1_Lat

But how do we go from the raw data to something we can analyse statistically to address the research questions?

Feature engineering

Feature engineering varies dramatically, it can include:

Something simple such as a transformation
- Mean-variance scaling was used here for all continuous explanatory variables
- Location of speaker and giraffe was converted to distance between them
Something intermediate such as regrouping or remapping categories
- Combined Sex and Age variables to eliminate nonsensical combinations
Something complex such as collapsing each video into a summary of giraffe reactions
- First each behaviour was assigned an intensity
- Then the intensities were collapsed with a focus on the initial reaction
  - Matrix multiplication followed by a flipped logistic transform (think hard-coded neural network layer)

Raw fit results

Contrasts

Contrast	Average	Median	Lower	Upper	p_value	adj_p_val
SubAdult_M - Adult_F	3.803	3.801	2.451	5.173	0.000	0.000
SubAdult_M - Adult_M	4.506	4.514	2.993	5.930	0.000	0.000
SubAdult_M - SubAdult_F	4.049	4.057	2.775	5.416	0.000	0.000
Adult_F - Adult_M	0.703	0.704	-0.353	1.704	0.183	0.548
SubAdult_F - Adult_M	0.457	0.458	-0.328	1.183	0.237	0.548
Adult_F - SubAdult_F	0.246	0.241	-0.605	1.105	0.572	0.572

Sound type contrasts (main result)

Location	Contrast	Average	Median	Lower	Upper	p_value	adj_p_val
APGR	Drone - Dove	3.803	3.801	2.451	5.173	0.000	0.000
APGR	Vehicle - Dove	5.201	5.199	3.571	6.833	0.000	0.000
FGR	Drone - Dove	1.969	1.977	1.059	2.854	0.000	0.000
WGL	Talking - Dove	1.552	1.547	0.847	2.316	0.000	0.000
APGR	Talking - Dove	2.306	2.305	1.021	3.689	0.001	0.017
WGL	Vehicle - Dove	1.171	1.173	0.388	1.927	0.003	0.044
WGL	Drone - Dove	1.106	1.110	0.334	1.852	0.004	0.046
APGR	Vehicle - Talking	2.895	2.899	0.952	4.858	0.005	0.057
FGR	Talking - Dove	1.116	1.122	0.259	1.932	0.010	0.098
FGR	Vehicle - Dove	1.062	1.069	0.249	1.895	0.011	0.103
APGR	Drone - Talking	1.497	1.504	-0.120	3.222	0.080	0.642
FGR	Drone - Vehicle	0.907	0.909	-0.143	1.950	0.089	0.642
FGR	Drone - Talking	0.853	0.858	-0.232	1.881	0.117	0.704
APGR	Vehicle - Drone	1.398	1.394	-0.517	3.374	0.149	0.746
WGL	Talking - Drone	0.445	0.452	-0.465	1.358	0.348	1.000
WGL	Talking - Vehicle	0.380	0.375	-0.551	1.308	0.423	1.000
WGL	Vehicle - Drone	0.065	0.062	-0.867	1.006	0.900	1.000
FGR	Talking - Vehicle	0.054	0.050	-0.942	1.036	0.922	1.000

Model comparison

Since the models here have the same outcome but different explanatory variable construction, we can compare the models to determine which explanatory variable better matches the data generating process.

	elpd_diff	se_diff	elpd_loo	se_elpd_loo	p_loo	se_p_loo	looic	se_looic
HNR_Model	0.000	0.000	-1167.259	23.278	23.706	2.488	2334.518	46.557
Sound_Type_Model	-9.825	4.969	-1177.084	23.671	42.469	4.446	2354.168	47.342

The HNR model is more parsimonious and more likely to resemble the data generating process (actual giraffe behavioural processes).

Conclusion

An expansive statistical toolbox is incredibly useful when encountering real data.
Real data is difficult but rewarding to work with.
So expand your statistical toolbox to expand your impact.

This presentation was created using the Reveal.js format in Quarto, using the RStudio IDE. Font and line colours according to UFS branding, and background image using image editor GIMP by compositing images from CoPilot.