Modern Software Experience

2011-10-09

genealogy blogger database statistics

genealogy database size

In June of 2009, I wrote a series of articles examining genealogy database size. It all started with three genealogy articles I published at the end of May.
First, there was Medium Size Genealogy, the first ever attempt to estimate of a medium size genealogy; the size of your database if you investigated your ancestry thoroughly, and did not investigate anything else. It introduced a simple formula to calculate a ballpark figure, and for various estimates of the key values, the medium size varies from 12.000 through 50.000 to 88.0000.

The second article was My Large is Smaller than Yours, an article that poked fun at vendors using large to describe medium, small, tiny and even miniscule databases, mostly to try and hide the fact that the capacity of their application is severely limited. That 2009 article concluded with the remark that In an age of 32-bit desktop computers and 64-bit servers, every 16-bit value is small.

The third article was Average Number of Nth Generation Descendants. Some bloggers had claimed that it impossible to calculate an average number of nth-generation descendants for ancestors that lived n generations ago, I proved otherwise by providing a formula to do so.

Average Size Month

The first two articles prompted discussion in the geneasphere about just what is large, see links in Average Size Month. During the month of June, I wrote two interrelated series, one looking into the average database size of various genealogy hosting sites, and one introducing social genealogy metrics; metrics for social genealogy sites.
Some unscientific polling suggested that many expected the average to be just a few thousand. The final article in the first series, Genealogy Database Average, Median and Mode, notes that the average is higher, but that the median fits the expectations; perhaps we actually estimate the median when we try to estimate the average. The ballpark figures found are an average size of about 12.500, and a median size of about 2.500 individuals.

databasesize
Randy Seaver41.324
Karen24.978
Becky Higgins4.483
Caroline Gurney5.096
Pamela Wile1.845
Carol16.661
Celia8.066
MNFamilyHistorian11.922
Mel9.453
Jacqueline Foster10.167
Liz Tapley3.808
Ginger Smith8.561
Jen Smart1.308
Julie4.837
GeneaPopPop5.012
Geniaus8.473
Nastrond91.380
MidWestAncestree23.476
Elizabeth Handler4.347
Doris Wheeler4.105
Bill West25.971
Reba Mc5.645
Lyn Swan14.579
Lis K517
Sébastien Comeau43.018
Tim Forsythe6.054
Tessa Keough5.657
  
number27
total390.563
average14.465
median8.066

Saturday Night Genealogy Fun

Every week, Randy Seaver of Genea-Musings posts Saturday Night Genealogy Fun (SNGF), challenging participants to perform some random genealogy or family history related task. This week, the challenge was to post your genealogy database numbers.
Randy posted his own, and participants left theirs in comments or in a post on their own blog. I decided to collect the key value, the number of individuals in each database, and calculate some statistics.

Most numbers are copied directly from the comments on Randy's blog post, or the blog post the comment linked to. MidWestAncestree broke his database into five parts, probably because of performance or capacity issues with Ancestry Family Trees. The numbers she posted have been added together again.

There are a few tiny genealogy databases, quite a few small ones, five medium size genealogy databases, and one large one. I did include Randy's database, but I did not include my own database, simply because I do not want to skew the results towards large databases.
That leaves 27 participants who posted their database size, totalling 390.563 individuals, with an average database size of 14.465, and a median size of 8.066.

larger

Both the average and the median for the SNGF participants are higher than the average and median calculated in 2009. Part of the difference is simply that, in the more more than two years since, databases have grown larger. Part of the reason is that participants are self-selecting, and some bloggers with tiny or small databases may feel to intimidated by the medium size database to reveal their numbers. Perhaps, but surely that effect was more than compensated for by leaving my own database out.
The main reason is probably that genealogy bloggers aren't average; the average (ahem) genealogy blogger is more obsessed with enthusiastic about genealogy than most other genealogists.

updates

2011-10-09 instant update

Updated with more numbers from latest participants, and a comment on my Google+ post.

2011-10-09 instant update

Updated with more numbers from comments on Randy's Google+ post. No comments on either of his two FaceBook posts.

links