Modern Software Experience

2009-06-29

Genealogy Size

Medium Size Genealogy

A month ago, I wrote Medium Size Genealogy. It introduced the concept of a Medium Size Genealogy, defined as the average size of a well-researched ancestry for a (young) living person. This number is obviously dependant on the availability of vital records to build genealogies from, something which varies from one country to another. It also increases with each generation.

With some assumptions, for a person of Western European descent, the Medium Size may well be 25.000 to 50.000 individuals. The point of that article was not to calculate an exact number, but merely to point out that a database of 25.000 to 50.000 is not large.

meaning

I did not define Medium Size Genealogy randomly. The number has meaning. It is a rough indication of how large your ancestry database will be when you have researched and documented it completely. It is an indication of the amount of work you are up against.

It can reasonably considered a medium because it is what most researchers are aiming for and there are those who have moved beyond documenting their ancestors into doing one-name studies, one-place studies research for others, etcetera.

Now, you are likely to start researching related lines while researching your ancestry, if only because your ancestors kept reusing the same names, which makes it easy to get confused and important to double check by researching related lines. Thus, in practice, your database may well exceed medium size before you are done researching your ancestry.

My Large is Smaller than Yours

My Large is Smaller than Yours highlights how vendors whose product have small limits seem to playing a game of My Large is Smaller than Yours; they call their surprising small limits large, even when tiny is a more appropriate choice of word.
It concludes with the observation that in an age of 32-bit desktop computers and 64-bit servers, every 16-bit value is small.

Average Number of Nth Generation Descendants

Just after writing these two articles, I came across a blog post that claims that it is impossible to calculate an average number of nth-generation descendants for ancestors that lived n generations ago.

Average Number of Nth Generation Descendants shows that claim to be wrong with a simple formula for estimating the average number of descendants using population statistics.

geneasphere

The first two articles sparked some discussion in the geneasphere.

In What is "Large", Valerie Craft of Begin with 'Craft' wondered how large is the average GEDCOM of a hobbyist researcher?.
In Big GEDCOMs and other genealogy software things Randy picked up Valerie’s question, opining on what he considers large and small.
In The Size of my GEDCOM John Newmark continued the discussion of how large is large, and mentions the issue of quality versus quantity.

Although I had neither opined on average database size, nor given any opinion on when a database is large, those two questions seemed to be focus of the interest.

The questions of average size had intrigued me for some time, and inspired by this interest, I decided to try and answer that question.

Social Genealogy Size

I’d been writing an article series on Social Genealogy Size. It discussed why the statistics for Geni are so much better than those for We’re Related, why average size is not a good metric for social genealogy sites. Social Genealogy Metrics introduced metrics for genealogy sites, and Social Genealogy Success contains some idea on how vendors can improve their metrics.

I rounded the series off with an article that relates social genealogy metrics to average GEDCOM upload size. Because the answer strongly depends on the unknown fraction of users that upload GEDCOM files, its conclusion is not very conclusive.

Average Genealogy Size

I continued with an article series on Average Genealogy Size.

Figuring Out Average genealogy Size

Figuring out Average Genealogy Size lists some thoughts about how to figure out average size. Its notes various issues that should be not be allowed to bias the result.
Among these thoughts are the notions that the sample should be representative and the result should be verifiable. Thus, numbers provided by public genealogy hosting services seem ideal.

However, things are not simple as dividing the total number of records on such a site by the number of databases.

Desktop Size versus Upload Size

First of all, Desktop Size versus Upload Size makes just one point; we may be interested in the average size of the desktop databases, but such a site provides us with the average size of the uploaded database instead. These two sizes are different but related. However, the average upload size is about data that gets shared, so it is often the right metric.

Average GEDCOM Upload Size

Average GEDCOM Upload Size discusses issues that bias the results of an genealogy hosting size, and comes up with a set of criteria to select a site that does not suffer any such bias. It ends with the remark that there is such a site and that its average genealogy size is about 12.500 individuals.

Average WebTree Size

The site that meets all criteria is FamilyLink’s WebTree and Average WebTree Size discusses how the average size was calculated from its public numbers and how an inconsistency was dealt with.

Average Genealogy Hosting Size

Average Genealogy Hosting Size puts the WebTree numbers in perspective by comparing it to the numbers for GENDEX sites and other public genealogy hosting sites.

Average GenCircles Size

That articles already mentions the average for GenCircles. The next article, Average GenCircles Size, discusses how it was calculated and how some issues were dealt with.

Genealogy Database Average, Median and Mode

The last article in the series, Genealogy Database Average, Median and Mode discusses what it all means. It mentions the possibility of a large database bias that Average GEDCOM Upload Size overlooked, some reasons why most people estimate the average to be lower, how databases of size 1 skew the results, and ends with the practical observation that vendors who’s upload limit is lower than the average size are missing out on a lot of data.

links

genealogy size articles

Social Genealogy Size series

Average Genealogy Size series

blog posts