2009-06-09 Figuring out Average Genealogy Size

Average Genealogy Size

Average

database size

What is the average size of a genealogy research database nowadays?
It may seem a simple question, but it is not so easy to come up with a reasonable answer.

social failure

So far, I’ve shown that the numbers for social genealogy that vendors publish eagerly are not much help. First, Average Size is a Statistic calculated average database size from social genealogy numbers with paradoxical results. Social Genealogy Metrics explained the paradox; the average database size metric does not fit social genealogy. It also introduced social genealogy metrics that do make some sense, including the P/U ratio.

What we really want to know is the average size of the desktop research database. Interestingly, the social genealogy numbers do allow estimating that. First, How Geni beats We’re Related pointed out that Geni’s P/U ratio is three times that of We’re Related, apparently because Geni supports GEDCOM import and We’re Related does not. Following that observation, Social GEDCOM Formula provided a formula that relates the difference in P/U factors to the average GEDCOM upload size.

Alas, to actually calculate the average upload size from the public numbers using that formula requires that we know the fraction of users that uploaded a database, and that is probably a trade secret.
So, all that formula tells us is that if the fraction of uploaders is between 1 in 10.000 and 1 in 100 users, then the average upload size is between 100 and 10.000 individuals.

That is not very informative, so we need some other numbers, a better method. Let’s take a step back and consider how we would like to calculate it if we could.

ideal

Ideally, we’d calculate the average size by averaging the size of all existing genealogy research databases. That is data we simply do not have, but there are several ways to get a significant sample.

polling

One way to collect data is by asking for it. That could be a poll on a blog or a questionnaire at a trade show. The obvious problems with this approach is that the respondents select themselves and that there is no way to verify the answer. One typo could ruin the results, and you would not even know it had happened. It is definitely better to get numbers straight from the database than to ask around.

collection

Over the years, I’ve build a collection of GEDCOM files, and I could calculate some statistics for it, but I know that my collection is not representative at all. To calculate a good estimate, we need to find a representative collection.

vendors

Vendors of genealogy software have various ways to collect data. They could have the software report the database size back to them. Ancestry.com’s Family Tree Maker and MyHeritage’s Family Tree Builder try to get all your information, by making you upload your entire database.

Vendors also receive databases from users for debugging purposes, and probably keep these around for continued testing. They may have build quite a collection that way, and could calculate some statistics for it.

So, vendors are likely to have fairly good idea of the size of their user’s databases, and it would be nice if some vendors shared that information (hint, hint), but they may not want to do so.
Worse, even if they shared the information, we would still not be happy, as we have no way of verifying it.

biased

The more interesting observation is that vendor-specific data is biased. Users may start with a relatively slow program, but as their database grows, the poor performance will be reason enough to switch to another program. Thus, slow programs lose users with large database, and fast programs often gain users that already have a sizeable database that will only get larger.

Users may switch for more reasons, but the fact remains that users who switch from one program to another affects the statistics for both, and that often, many users will switch for the same reasons. Thus, vendor-specific numbers have a bias.
We need a sampling that avoid vendor-bias.

By the way, if you had all the the vendor-specific numbers to compare with each other, you would know which programs users are abandoning for which other programs; vendors with a small average are losing customers, vendors with a larger average are gaining them. Then again, perhaps some vendors have a low average because they are successful at attracting new researchers. Either way, vendors may not want other vendors to know their numbers.

summary

Quickly summarising the thoughts so far

Social Genealogy numbers are public
Alas, average size isn’t Social Genealogy Metric
Average GEDCOM size and P/U ratio are related
We do not know the fraction of users that uploaded data
The ideal is averaging numbers for all existing databases
That ideal is not achievable, we must settle for a sample
That should be a representative sample
Polls are no good, because respondents are self-selecting
Typos can ruin results, so numbers should be processed directly
The result should be verifiable
Vendors are likely to have lots of data
Vendors have reasons to not share their data
We cannot verify vendor data
Vendor-specific data is biased anyway
We want a sample that avoids vendor bias

hosting

Sites that allow users to host their genealogy database by uploading a GEDCOM for free seem to provide an ideal sample. Many such sites boast about the number of records they have, and several publish the number of users too, thus allowing the calculation of their average size.

It easy to just pick one such site and calculate the average, but there are many reasons why that is not good enough…