The purpose of this guide is to highlight the many ways to read and write Excel files from R, and discuss alternatives along the way.
In Windows, R can use the file chooser to let you find and select your file.
I am going to combine it with the read.xlsx function from the openxlsx library to easily read in the first sheet of my Excel file.
file.choose() |>
openxlsx::read.xlsx() -> mydata
Another thing people like to do is put the full path to their file in the read command, primarily to bypass the ‘working directory’ problem.
"C:/Users/myusername/OneDrive - myemployer/Work_CurrentYear/Subject/Examples/ExcelFile.xlsx" |>
openxlsx::read.xlsx() -> mydata
Note that R uses ‘/’ to separate folders/directories (the Linux convention).
The above approaches are bad because:
R will look for files in the ‘working directory’. In an ordinary R script this will usually be your documents folder.
The best way around the working directory (folder) issue is to write all code in R Markdown notebooks instead of ordinary R scripts. This makes RStudio move your working directory around for you.
If you want to use an R script then include it in an R Project to help you keep everything organised. Again, RStudio knows to look in the project folder first.
You can also use a package such as ‘here’ to help you with this issue.
Alternatively, in RStudio you can right-click on the script tab at the top, then select “Set Working Directory” to set the working directory for the current session. This is again a manual approach, which is bad (but easy).
If you are only ever reading the data in once (say with the Import Dataset button) and you don’t have too many variables, then it can be handy to ‘attach’ the data set, which will make the variables in the data set directly accessible by name.
This sounds great but actually causes all sorts of problems and messes in the long run. Attaching data doesn’t actually turn the data into individual variables at all, which means that when you read in data twice, or try to change the data set, then it often doesn’t do what you might expect.
Best practice is to never attach data, there is always a better way.
From here on I’m going to assume you are working in R Markdown or an R Project.
Always include the file name and sheet name in the code:
"ExcelFile.xlsx" |>
openxlsx::read.xlsx("MySheet") -> mydata
The above code can also be written as:
mydata <- openxlsx::read.xlsx(xlsxFile = "ExcelFile.xlsx", sheet = "MySheet")
Or various other arrangements and levels of being explicit.
R is extremely case sensitive (good thing). The best way to get your file name correct is to find it in your file explorer and copy the file name from there. In Windows Explorer you click on the file and press F2 (or click again) then copy the entire file name.
The best way to get your sheet name correct is to open the file in Excel, then right-click on the sheet name, choose ‘Rename’ and then copy the name (Ctrl+C).
I recommend you close the file again before reading it into R. In most cases this is not necessary, but if the file is being live edited (e.g. in a OneDrive folder) then you will not be able to read it into R without closing it first.
The really fancy approach is to organise your project files in sub folders. For example, you can create a ‘DataSets’ folder in your project folder and read the data from that:
"DataSets/ExcelFile.xlsx" |>
openxlsx::read.xlsx("MySheet") -> mydata
The ‘openxlsx’ package is quite flexible, you can:
R can read in all popular data formats:
‘openxlsx’ is the fastest way to read Excel files (faster than Excel itself). However, fastest doesn’t always mean fast, especially with really large files. Large data science projects don’t use Excel to store their data.
As you use R more, try to learn the RDS format:
`?`(saveRDS)
`?`(readRDS)
RDS files are generally as fast as CSV files, but smaller than XLSX files, giving the best of both!
The RDS format is exclusive to R, or anything that can call on R indirectly (like SAS, PowerBI, etc.). Other programming languages have similar tools for storing data in binary formats.
Often you might produce simulated data sets, or large tables that don’t fit on a page, and want to write it to Excel for distribution. You can do this with the ‘write.xlsx’ function:
mydata |>
openxlsx::write.xlsx("NewExcelFile.xlsx")
The default overwrite behaviour was recently changed to FALSE. I usually override it to TRUE.
I also like to name my sheet, which I do my putting the data into a named list:
list(MySheetName = mydata) |>
openxlsx::write.xlsx("NewExcelFile.xlsx", overwrite = TRUE)
The advantage of a named list is that we can write a lot of sheets at once. In the next example I will simulate 100 regression data sets, one for each of 100 fictitious students, and write them to an Excel file where each student gets a sheet with their student number in the name.
student_numbers <- 2012345678:2012345777
n_students <- length(student_numbers)
n_obs <- 123
dataset_list <- lapply(seq_len(n_students), function(s) {
x1 <- rgamma(n_obs, 5, 2)
x2 <- rnorm(n_obs, 10, 3)
y <- 0.3 * x1 + 0.4 * x2 + 1.34 * rnorm(n_obs)
data.frame(Y = y, X1 = x1, X2 = x2)
})
names(dataset_list) <- paste0("St", student_numbers)
dataset_list |>
openxlsx::write.xlsx("RegAssignment1.xlsx", overwrite = TRUE)
If you would like to do more than just write out the data then the ‘openxlsx’ package allows you to take finer control, to the point that you can create extremely complex workbooks programmatically.
You might start with ‘wb <- createWorkbook()’ to make a blank canvas, add a sheet with ‘addWorksheet(wb, “SheetName”, tabColour = “#FBB4AE”)’, and then fill it in with data and formulas, before applying styles, data validation, column widths, frozen panes, conditional formatting, and so on.
For example, I use this power to create a named table in an Excel file in my OneDrive, which is then used as input for Microsoft Power Automate to perform various tasks automatically in the cloud. This could be use to perhaps send an automatic reminder every week to each person that has not yet performed a task or reached a milestone.
The possibilities are endless.
.
.