Good spreadsheets
Spreadsheetin’: making data usable for computers
A colleague from the CDS days recently asked for some good resources on spreadsheets / tabular data. I used to run “spreadsheetin’” sessions where I tried (through hands-on examples) to share some of the principles or practices I’ve stumbled into over the years.1
Important fundamental assumption alert! In assembling this assortment of links, I worked from this assumption: good spreadsheet / tabular data practices amount, at their core, to storing data in a way that’s usable for computers. That may not be the only purpose for good spreadsheets, of course, but I find it’s a frequent and worthy one. (With the caveat that sometimes we make things less intuitively usable for people by making data more immediately usable for computers, which is a tradeoff worth interrogating depending on the work at hand! See the end of the post for some links to readings on critical data studies.)
With that in mind:
- The best all-in-one resource that I have easily at hand are the free course materials for “Data Organization in Spreadsheets for Social Scientists”. It does a good job not assuming too much knowledge (though it maybe does take “spreadsheets” as a concept for granted; I don’t really have a good resource for that, and welcome any recommendations for one!).
- A good quick reference for what makes a good data / spreadsheet for computing / publication purposes (which may be different than day-to-day usability, admittedly!), sort of a best practices application of the ideas taught in that course. It leaves out the why, but definitely gives a good how.
- Covering similar ground as the previous two, but from a more research management perspective, these “good enough practices” are truly that. They may not include quite the level of explanation necessary for a first read for someone not versed in this stuff, but it’s good once they’ve read one or two other resources, I think. The description of “analysis-friendly data” aka “tidy data” (see next item!) is a really good quick summary of what makes a good / computable spreadsheet.
- A lot of these reference or indirectly explain the concept of “tidy data”, which comes from a quite readable paper by Hadley Wickham, who then went on to grow a whole ecosystem of data science tools around the tidy philosophy. It’s only mildly statistician-technical; the sections defining and applying tidy data (2 and 3 respectively) are the most of interest in this context.
From there, if you wanted to get more into applied data practices (using a scripting language), you could follow my path into R for Data Science and the tidyverse, or, more recently, into working with Observable notebooks (check out trending notebooks for examples of what they can do).
But good spreadsheets (and / or other forms of tabular data storage!) will get you far on your way.
BONUS CONTENT!!
If you’d like to interrogate the idea of data from a critical perspective (because who wouldn’t want to do that while gaining some practical skills!?), some good readings:
- “Raw Data” Is an Oxymoron, an
open-access2 edited book, discusses the contrast between “raw” and “cooked” data in the introduction by Lisa Gitelman and Virginia Jackson. - Rob Kitchin’s brief summary of his book Data Lives: How Data Are Made and Shape Our World carries on this metaphor, discussing the field of critical data studies.
- Kitchin and Tracey Lauriault discuss the nature of data, and the influence of data infrastructure on data (and vice versa) in a chapter in the 2018 book Digital Geographies. This is probably a good middle ground between the previous two (a whole book and a blog post), discussing many of the same concepts in an engaging way.
-
At its core, I remember the advice boiled down to “don’t use column or row highlighting alone to convey meaning”, from which we can extrapolate a bunch of good practices. That to say, these principles and practices were hardly novel!
Really, I’d been making so-so spreadsheets for years, then read R for Data Science years ago (there’s a new version, cool!). Reading the tidy data chapter and applying the tidy philosophy clarified the better practices underpinning those so-so spreadsheets, and I started making better ones. ↩
-
August 11, 2024 update: This book is not open access! My bad! If others know of an equivalent-ish suggestion, I’d be glad to share it here. ↩