A selection of consultation projects I’ve worked on recently

Sean van der Merwe

Introduction

This presentation focusses on some consultation projects that I worked on this year. Some are easy, some are more complex.

Outline

4 recent projects that might interest you

I will only highlight some of the results for discussion.

And maybe we might learn something…

  • Students versus reading
  • Giraffe milk (yes, you can drink it)
  • Students versus AI
  • Giraffe reactions to sounds

Students versus reading

This analysis was done for Elani Boshoff, a lecturer in English.

The strongest result in the study was that UFS students (studying English) were incredibly anxious about how much reading they will need to do at university.

I find this strange

UFS reading requirements are much lighter than at top US and EU universities by my understanding.

Surveys, surveys, everywhere

  • At UFS a questionnaire based survey is a very popular research instrument
  • Often the only way to get information about how issues affect people
  • If well constructed it can extract useful data at low cost
  • Mostly measures perceptions though

The best surveys are highly focussed, asking exactly what one wants to know to answer a specific research question.

Practically though, you can’t do a new survey every time you think of a new question. So people tend to ask a lot of questions at once.

Multiple testing

  • Most hypothesis tests assume they are the only test being done on the data
    • Hypothesis tests have “main character syndrome”
    • They think, “Oh wow, all this data collected just for me!”
    • If you’re doing one test, and you assume \(\alpha=\) 5%, a 5% chance of a false positive result is low enough
  • But what happens if you are doing multiple tests?

Correlations

Comparing everything to everything

  • With 28 Likert scale questions you have 378 possible correlations
  • Testing all of them results in 378*5% \(\approx 19\) expected false positives
  • The probability of at least 1 false positive is \(1 - 0.95^{378}\approx\) 1 to 16 decimals
    • Even at 1% significance level it is \(1 - 0.99^{378}\approx\) 0.9776074

Multiple testing visualised

  • The survival function and probability mass function of the number of false positives \((k)\) is about:

The interesting part

The most interesting questions are a mix of nominal and ordinal items. One approach to visualising them together is Multiple Correspondence Analysis (MCA).

Item Description
Answer 32 I get nervous when I have to read academic texts on a digital device.
Answer 7 How do you usually read your prescribed academic texts?
Answer 8 If you could choose, what would you prefer to use to read academic texts?

MCA results

We note that those who get least nervous about reading on a digital device did and wanted to read on a laptop or computer; while those most nervous about reading on a digital device indicated that they want to read printed text. The more neutral respondents indicated that they used a mobile phone but might prefer a computer.

Giraffe milk

Yes, you can drink it.

Milking giraffes

Warning

Do not try this at home!

  • Giraffes cannot be milked standing up, they must be tranquilised
    • Dangerous for them and the people doing it
  • Samples were taken opportunistically as part of a larger study where the giraffes are occasionally tranquilised anyway (for blood samples, sonar, insemination, measurement, etc.)
  • Milk samples were tested for bacteria and other issues
    • The data was provided as log CFU counts per ml sampled

Relationships and random effects

  • Interest is in bacteria changes over lactation stage (continuous) or season (nominal)
  • Only 5 (female) giraffes available to be milked
  • Each giraffe is sampled multiple times, so observations are correlated within giraffe

Solution

Mixed effects regression models

Comparing model options

  • A set of Bayesian fits will be attempted using the rstanarm interface for Stan
  • Models for the same dependent variable will be compared using the Leave-one-out cross-validated information criterion (LOOIC)
  • By considering the best model, relative to the alternatives, we will be able to address the research questions in a reliable fashion: by answering the question, “Which model most adequately resembles the underlying data generating process?”

For each dependent variable, models considered included:

  • InterceptOnly = only random variation;
  • AnimalOnly = variation between animals is captured via a random intercept for each animal;
  • LacAB = Lactation month as explanatory variable, with A representing the polynomial order in which it is introduced as a fixed effect and B representing the polynomial order in which it is introduced as a random effect (2 = quadratic, 1 = linear, 0 = intercept or intercepts only);
  • ToY1B = Time of Year as explanatory variable with 3 levels, B being a 1 for random effects interaction, 0 for just random intercepts, and blank for no random effects at all.

Results

Safety

  • The microbial limits (for total bacteria count) for raw milk in South Africa as provided to me:
    • Less than 200 000 CFU/ml (< 5.3010 log CFU/ml) intended for further processing (according to Regulation R1555)
    • Less than 50 000 CFU/ml (< 4.69897 log CFU/ml) intended for consumption (according to R1555)
    • Less than 5 000 CFU/ml (< 3.69897 log CFU/ml) if the cow and equipment sanitation is sound (according to the Dairy Standards Agency)

The predictive distributions (random future sample taken from a random future animal) will be compared against these limits and the probability of exceeding each safety threshold visualised.

Safety limit exceedence probability

Students versus AI

In this “controlled” experiment some students were helped to brainstorm using a chatbot before starting their essay, while the control group brainstormed without AI. All the students then did their essays in what was supposed to be a controlled environment (no AI).

Did it help with their marks?

What about confounders?

What about fitting a proper model?

Treatment is not significant. Word count is barely significant, but you can’t give marks for writing if there’s no writing I suppose. The vast majority of the variation is assigned to the facilitators by the model.

Giraffe reactions to sounds

This analysis is a nightmare. Your advice would be appreciated.

The reviewers say that the analysis is not complex enough, but it is already so complicated that they can’t fully follow or understand it 😢.

Let’s talk about Feature Engineering

  • The best worst part of data science
  • One of the most important topics in applied statistics and all forms of data science

How many of you remember studying it?

Why isn’t it a major outcome in all our modules?

Because it is nearly impossible to teach!

Some universities (e.g. UCT I think) try to teach it at postgrad level, mostly via examples; but even if you work through a dozen examples it won’t really prepare you for what you’ll face at work.

The giraffe video data

Kaitlyn Taylor, M student in Animal Science, went to 3 reserves and filmed 252 video of giraffes reacting to sounds that she played them, then coded the behaviours for 180s from the sound (for 55 behaviours).

The giraffe metadata

She also noted a lot of metadata for each video, such as:

Video_Number Stimuli_Order Wind_Direction
Location ID Temperature
Date Sex Speaker_Lat
Time_At_Start Age Speaker_Long
Habituation_Period Group_Size Observer_Lat
Sound_Type Speaker_Distance Observer_Long
Sound_Variation Wind_Speed Giraffe1_Lat

But how do we go from the raw data to something we can analyse statistically to address the research questions?

Feature engineering

Feature engineering varies dramatically, it can include:

  • Something simple such as a transformation
    • Mean-variance scaling was used here for all continuous explanatory variables
    • Location of speaker and giraffe was converted to distance between them
  • Something intermediate such as regrouping or remapping categories
    • Combined Sex and Age variables to eliminate nonsensical combinations
  • Something complex such as collapsing each video into a summary of giraffe reactions
    • First each behaviour was assigned an intensity
    • Then the intensities were collapsed with a focus on the initial reaction
      • Matrix multiplication followed by a flipped logistic transform (think hard-coded neural network layer)

Raw fit results

Contrasts

Contrast Average Median Lower Upper p_value adj_p_val
SubAdult_M - Adult_F 3.803 3.801 2.451 5.173 0.000 0.000
SubAdult_M - Adult_M 4.506 4.514 2.993 5.930 0.000 0.000
SubAdult_M - SubAdult_F 4.049 4.057 2.775 5.416 0.000 0.000
Adult_F - Adult_M 0.703 0.704 -0.353 1.704 0.183 0.548
SubAdult_F - Adult_M 0.457 0.458 -0.328 1.183 0.237 0.548
Adult_F - SubAdult_F 0.246 0.241 -0.605 1.105 0.572 0.572

Sound type contrasts (main result)

Location Contrast Average Median Lower Upper p_value adj_p_val
APGR Drone - Dove 3.803 3.801 2.451 5.173 0.000 0.000
APGR Vehicle - Dove 5.201 5.199 3.571 6.833 0.000 0.000
FGR Drone - Dove 1.969 1.977 1.059 2.854 0.000 0.000
WGL Talking - Dove 1.552 1.547 0.847 2.316 0.000 0.000
APGR Talking - Dove 2.306 2.305 1.021 3.689 0.001 0.017
WGL Vehicle - Dove 1.171 1.173 0.388 1.927 0.003 0.044
WGL Drone - Dove 1.106 1.110 0.334 1.852 0.004 0.046
APGR Vehicle - Talking 2.895 2.899 0.952 4.858 0.005 0.057
FGR Talking - Dove 1.116 1.122 0.259 1.932 0.010 0.098
FGR Vehicle - Dove 1.062 1.069 0.249 1.895 0.011 0.103
APGR Drone - Talking 1.497 1.504 -0.120 3.222 0.080 0.642
FGR Drone - Vehicle 0.907 0.909 -0.143 1.950 0.089 0.642
FGR Drone - Talking 0.853 0.858 -0.232 1.881 0.117 0.704
APGR Vehicle - Drone 1.398 1.394 -0.517 3.374 0.149 0.746
WGL Talking - Drone 0.445 0.452 -0.465 1.358 0.348 1.000
WGL Talking - Vehicle 0.380 0.375 -0.551 1.308 0.423 1.000
WGL Vehicle - Drone 0.065 0.062 -0.867 1.006 0.900 1.000
FGR Talking - Vehicle 0.054 0.050 -0.942 1.036 0.922 1.000

Model comparison

Since the models here have the same outcome but different explanatory variable construction, we can compare the models to determine which explanatory variable better matches the data generating process.

elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
HNR_Model 0.000 0.000 -1167.259 23.278 23.706 2.488 2334.518 46.557
Sound_Type_Model -9.825 4.969 -1177.084 23.671 42.469 4.446 2354.168 47.342

The HNR model is more parsimonious and more likely to resemble the data generating process (actual giraffe behavioural processes).

Conclusion

  • An expansive statistical toolbox is incredibly useful when encountering real data.
  • Real data is difficult but rewarding to work with.
  • So expand your statistical toolbox to expand your impact.

This presentation was created using the Reveal.js format in Quarto, using the RStudio IDE. Font and line colours according to UFS branding, and background image using image editor GIMP by compositing images from CoPilot.