When you give students simulated data then you know the right answers up front (real data has issues). Simulated data that looks and sounds like real data is the best, both for assignments and for research. It should not be difficult.
deck <- paste( c(2:10, 'Jack', 'Queen', 'King', 'Ace') |> rep(times=4), 'of',
c('Spades','Diamonds','Clubs','Hearts') |> rep(each=13) )
# To draw a hand of 7 cards:
hand <- deck |> sample(7)
# Shuffling is taking a sample the same size (52) without replacement:
shuffled_deck <- deck |> sample(length(deck))
# To do a bootstrap sample just add: , TRUE
You might think, “Let me make it look nice manually by opening it in Excel,” but they you decide to generate new data and you have to repeat those steps, again and again, and then you’re going to ask yourself, “Why didn’t I just make this part of the code?”
You can go much fancier than this. The openxlsx and openxlsx2 packages allow for crazy customisation.
Excel workbooks can have many sheets, there is no need to write to different files each time.
We only need about 3 more lines of code 😊
students <- c('2024000001', '2024000002', '2024000003', '2024000004',
'2024000005') # Get class list in R somehow
# You can read student numbers from class list / mark list file:
# students <- openxlsx::read.xlsx('mark_list.xlsx', startRow = 4)$Student.ID
n <- 100
students |> lapply(\(s) { # Make a list of data frames
data.frame(
Mark = rbinom(n, 100, 0.6),
Group = c('🐬', '🐪') |> sample(n, replace = TRUE),
Height = rnorm(n, 1.67, 0.1) |> round(2)
)
}) |> setNames(students) |> # Give the data frames names
openxlsx::write.xlsx('gen_data_3.xlsx',
firstRow = TRUE, asTable = TRUE,
colWidths = c(10, 11, 12) # or "auto"
)
Students engage better with assignments that are relatable.
What about time series? We could simulate a pair of \(VARIMA_2(1,1,1)-tGARCH(1,1)\) financial time series like so:
n <- 120
e <- mvtnorm::rmvt(n, sigma = c(1, 0.5, 0.5, 1) |> matrix(2), df = 4)
x1 <- x2 <- v1 <- v2 <- rep(1, n)
for (i in 2:n) {
v1[i] <- 0.5 + 0.2*e[i-1, 1]^2 + 0.3*v1[i-1]
v2[i] <- 0.5 + 0.2*e[i-1, 2]^2 + 0.3*v2[i-1]
e[i,] <- e[i,] * sqrt(c(v1[i], v2[i]))
x1[i] <- 0.1 + 0.3*x1[i-1] + 0.3*e[i-1, 1] + e[i, 1]
x2[i] <- 0.12 + 0.35*x2[i-1] + 0.2*e[i-1, 2] + e[i, 2]
}
d <- data.frame(
Month = seq_len(n-20),
ABC = cumsum(x1[21:n]/60) |> exp() + 30,
DEF = cumsum(x2[21:n]/50) |> exp() + 29.5
)
Before saving to Excel, do a plot or summary, or even a full analysis, in R. You might save yourself a lot of hassle in the long run.
This presentation was created using the Reveal.js format in Quarto, using the RStudio IDE. Font and line colours according to UFS branding, and background image by Midjouney using image editor GIMP.
2024/07/19 - Gen Data