When talking about data publication, many of us get caught up in protracted conversations aimed at carefully anticipating and building solutions for every possible permutation and use case. Last week’s release of U.S. census data, in its raw, un-indexed form, however, supports the idea that we don’t have to have all the answers to move forward.
Genealogists, statisticians and legions of casual web surfers have been buzzing about last week’s release of the complete, un-redacted collection of scanned 1940 U.S. census data schedules. Though census records are routinely made available to the public after a 72-year privacy embargo, this most recent release marks the first time that the census data set has been made available in such a widely accessible way: by publishing the schedules online.
In the first 3-hours that the data was available, 22.5 million hits crippled the 1940census.archives.gov servers. The following day, nearly 3 times that number of requests continued to hammer the servers as curious researchers scoured the census data looking for relatives of missing soldiers; hoping to find out a little bit more about their own family members; or trying to piece together a picture of life in post-Great Depression, pre-WWII America.
For the time being, scouring the data is a somewhat laborious task of narrowing in on the census schedules for a particular district, then performing a quick visual scan for people’s names. The 3.9 million scanned images that make up the data set are not, in other words, fully indexed — in fact, only a single field (the Enumeration District number field) is searchable. Encoding that field alone took 6 full-time archivists 3-months.
The task of encoding the remaining 5.3 billion fields is being taken up by an army of volunteers. Some major genealogy websites (such as Ancestry.com and MyHeritage.com) hope the crowd-sourced effort will result in a fully indexed, fully searchable database by the end of the year.
Release day for the census has been described as “the Super Bowl for genealogists.” This excitement about data, and participation by the public in transforming the data set into a more useable, indexed form are encouraging indications that those of us interested in how best to facilitate even more sharing and publishing of data online are doing work that has enormous, widely-appreciated value. The crowd-sourced volunteer effort also reminds us that we don’t necessarily have to have all the answers when thinking about publishing data. In some cases, functionality that seems absolutely essential (such as the ability to search through the data set) is work that can (and will) be taken up by others.
So, how about your data set(s)? Who are the professional and armchair domain enthusiasts that will line up to download your data? What are some of the functionality roadblocks that are preventing you from publishing your data, and how might a third party (or a crowd sourced effort) work as a solution? (Feel free to answer in the comments section below.)