Modern Software Experience

2012-04-26

Doubly Censusable

1940 Census

On 2012 April 2, the National Archives and Records Administration (NARA) released the 1940 census. Several organisation bought a complete a set of census images and started indexing.

Several parties are working together in the 1940 U.S. Census Community Project. The upside of this cooperation is that the indexing will be done quickly. The downside is that they'll all end up with exactly the same index.
The issue with that is that they'll end up with exactly the same mistakes; if an indexing error prevents you from finding the record you're looking for, you're out of luck, because the other service providers have exactly the same indexing error.

FamilySearch

It is a little known and underappreciated fact that the 1940 U.S. Census Community Project does not create a public index. The three parties working together all get a copy of the index, but the index is not put into the public domain.
FamilySearch has been remarkably silent on this issue, but I asked around to find out just how public the resulting index is, and Archives.com provided the answer: although this is advertised as a collaborative community project, FamilySearch claims copyright and ownership of the resulting index. FamilySearch has no intentions of sharing the resulting index with anyone but the companies that helped advertise the indexing project. I invite FamilySearch to give back to the community what the community provides in the first place, by sharing the resulting index with everyone.

Ancestry

Ancestry.com did not join the collaborative indexing project. Ancestry started its own indexing project, in collaboration with the Minnesota Population Center at the University of Minnesota. According to their press release, they are creating the most comprehensive database of the [USA] 1940 census. While FamilySearch claims ownership of the index created by the 1940 U.S. Census Community Project, Ancestry has decided to share their database freely with the scientific community and the public. Well, they are not sharing the entire index, but they are sharing all the numerically-coded fields. Both the data and the documentation will available through the Minnesota Population Center's Integrated Public Use Microdata Series (IPUMS) site.

double indexing

Humans make mistakes, and indexing a dataset requires knowledge many indexers don't have until they have indexed a lot. To reduce mistakes, indexing projects employ double indexing; each page gets indexed twice, by two different indexers. If both indexers agree, the indexing is done. If there are differences, a third indexer arbitrates.

There is no doubt that double indexing prevents sloppy indexing and cuts down mistakes, but the result is not perfect. No one knows how many errors remain. The arbitration statistics could be used to provide an estimate, but it would be just that; an estimate.

Archives.com

FamilySearch has an index and Ancestry.com has an index. If only you could somehow get both indexes and compare them, you'd have some real-life, actual data on the accuracy of double indexing.
FamilySearch is only sharing their index with the two parties that helped them draw free labour to their indexing project. Ancestry.com is sharing with everybody through IPUMS, but is only sharing the numerically-coded fields.

Yesterday, Ancestry.com announced its definite agreement to acquire Archives.com. Archives.com is one of two parties that advertised FamilySearch's indexing project, and is getting a copy of the resulting index in return for that. Thus, once the acquisition completes, Ancestry.com will be the only party to have both the FamilySearch and the Ancestry.com index.

double double index

Ancestry.com will be able to compare both indexes, and they should have the Minnesota Population Center do so, in both their own and the community's interest. When they compare the indexes they will find differences, they will find errors in both. They'll be able to use that to correct any errors in their own index. Their double index will become a combination of two double indexes, it will become a double double index; an index resulting from the combination of two separate double indexing efforts. The error rate of a double double index will be lower than that of either double index.

They'll obtain some real-world data on the accuracy of double-indexing, and the improvement double double indexing provides, which I hope they'll share with us. They'll have even better data for academic research, and additionally have a very good idea just how accurate that data is.
Ancestry.com will not only have an better index, but even be able to claim, truthfully and without hyperbole, that their index is better than the one FamilySearch and FindMyPast are using. They'll even be able to quantify that claim.

One can hope that once Ancestry.com provides a double double index, some of the other parties that created their own double index, will decide to work together to create another double double index. Index quality will improve all around, and genealogists consulting the double double indexes will be the real winners.

links