Good spreadsheets

Spreadsheetin’: making data usable for computers

6 August 2024

A colleague from the CDS days recently asked for some good resources on spreadsheets / tabular data. I used to run “spreadsheetin’” sessions where I tried (through hands-on examples) to share some of the principles or practices I’ve stumbled into over the years.¹

Important fundamental assumption alert! In assembling this assortment of links, I worked from this assumption: good spreadsheet / tabular data practices amount, at their core, to storing data in a way that’s usable for computers. That may not be the only purpose for good spreadsheets, of course, but I find it’s a frequent and worthy one. (With the caveat that sometimes we make things less intuitively usable for people by making data more immediately usable for computers, which is a tradeoff worth interrogating depending on the work at hand! See the end of the post for some links to readings on critical data studies.)

With that in mind:

The best all-in-one resource that I have easily at hand are the free course materials for “Data Organization in Spreadsheets for Social Scientists”. It does a good job not assuming too much knowledge (though it maybe does take “spreadsheets” as a concept for granted; I don’t really have a good resource for that, and welcome any recommendations for one!).
A good quick reference for what makes a good data / spreadsheet for computing / publication purposes (which may be different than day-to-day usability, admittedly!), sort of a best practices application of the ideas taught in that course. It leaves out the why, but definitely gives a good how.
Covering similar ground as the previous two, but from a more research management perspective, these “good enough practices” are truly that. They may not include quite the level of explanation necessary for a first read for someone not versed in this stuff, but it’s good once they’ve read one or two other resources, I think. The description of “analysis-friendly data” aka “tidy data” (see next item!) is a really good quick summary of what makes a good / computable spreadsheet.
A lot of these reference or indirectly explain the concept of “tidy data”, which comes from a quite readable paper by Hadley Wickham, who then went on to grow a whole ecosystem of data science tools around the tidy philosophy. It’s only mildly statistician-technical; the sections defining and applying tidy data (2 and 3 respectively) are the most of interest in this context.

From there, if you wanted to get more into applied data practices (using a scripting language), you could follow my path into R for Data Science and the tidyverse, or, more recently, into working with Observable notebooks (check out trending notebooks for examples of what they can do).

But good spreadsheets (and / or other forms of tabular data storage!) will get you far on your way.

BONUS CONTENT!!

If you’d like to interrogate the idea of data from a critical perspective (because who wouldn’t want to do that while gaining some practical skills!?), some good readings:

“Raw Data” Is an Oxymoron, an ~~open-access~~² edited book, discusses the contrast between “raw” and “cooked” data in the introduction by Lisa Gitelman and Virginia Jackson.
Rob Kitchin’s brief summary of his book Data Lives: How Data Are Made and Shape Our World carries on this metaphor, discussing the field of critical data studies.
Kitchin and Tracey Lauriault discuss the nature of data, and the influence of data infrastructure on data (and vice versa) in a chapter in the 2018 book Digital Geographies. This is probably a good middle ground between the previous two (a whole book and a blog post), discussing many of the same concepts in an engaging way.

At its core, I remember the advice boiled down to “don’t use column or row highlighting alone to convey meaning”, from which we can extrapolate a bunch of good practices. That to say, these principles and practices were hardly novel!

Really, I’d been making so-so spreadsheets for years, then read R for Data Science years ago (there’s a new version, cool!). Reading the tidy data chapter and applying the tidy philosophy clarified the better practices underpinning those so-so spreadsheets, and I started making better ones. ↩
August 11, 2024 update: This book is not open access! My bad! If others know of an equivalent-ish suggestion, I’d be glad to share it here. ↩