Modern Software Experience

2009-06-12

Genealogy Size

What is the size of the average genealogy research database? Figuring out average Genealogy Size gave a quick overview of the articles that addressed this question so far, looked at various possible sources of information, and concluded that genealogy hosting sites are the best source of numbers.

Desktop Size versus Upload Size pointed out that such a calculation would lead to the average upload size, and listed reasons why that is not the same as the average desktop size, but went on to argue out that the upload size is more relevant than the desktop size, and often the number you really want.

However, finding a reasonable answer still is not as simple as picking any hosting site and calculating its average. There are many issues that influence the average.

issues

public numbers

Some sites publish numbers, others do not. Many boast about the total number of records, but we need a site that also reveals the number of users or databases to calculate an average.

Some sites publish all the numbers we need, others do not, and yet others publish more.

We are limited to sites that do publish the necessary numbers. The numbers for sites that do not publish these numbers could be very different. We have no way to know, but no immediate reason to assume the numbers are significantly different either.

I do suspect that sites that publish numbers and thus advertise how popular they are already, do attract more users because of it, but that is still no reason to assume their average is different.

GEDCOM restriction

Using the average from a site that build its collection through GEDCOM import does limit the uploads and thus the calculated average to applications that support GEDCOM export.

There are few people who maintain their research in a word processor, a spreadsheet or a database of their own making. Genealogy editors that do not support GEDCOM export are rare too, and widely advised against.

Put like that, it hardly seems an issue, just a formal observation that there is this limitation, but there are some real issues here.

GEDCOM quality

One issue is that vendors of desktop genealogy applications seem to put considerably more time in making sense of each other’s GEDCOM dialect than fixing their own.

The GEDCOM output quality of several applications is plain horrible, and what some applications produce as GEDCOM is not even GEDCOM at all. The low output quality of these application may make it impossible for the hosting site to import the data. A user may attempt to fix whatever problems are encountered, but may just as well decide to give up.

Thus, applications with low quality GEDCOM output are likely to be under-represented on hosting sites. These application arguably deserve to be underrepresented, but it is a factor that influences the result. How it influences the result is unknown, but it seems safe to assume that hosting vendors try to solve all known issues, so perhaps the influence is negligible.

non-GEDCOM

GEDCOM is the dominant data exchange format, but not the only one. Some desktop application allow direct upload to particular sites.

Vendor-bias

Some vendors offer both a desktop application and a hosting site. That gives that particular hosting site a marketing advantage with those desktop users already. Moreover, some desktop application that can directly upload their data to the vendor’s site are quite pushy in prompting the user to do so, or even bluntly does when the user thinks they are merely printing a book…

The coupling leads to overrepresentation of the particular application already. Pushiness does not only make that worse but also leads to many databases being uploaded before the user would otherwise have done so; thus, the average size on these sites is likely to be lower than on other sites.
Users angry about the vendor holding on to their database are likely to avoid uploading their data again, and likely to switch applications. Both user decisions lead to yet lower averages for these vendor-specific hosting sites.

upload limits

Some sites have upload limits. Limited Genealogy argues out that this does not only affect the average size, but also the quality of the data. In turn, small databases and low quality make the site less attractive.
That effect probably isn’t very strong yet, but the fact remains that a limit affects the average, and affects it a lot. I’ll get back to just how much influences the average.

For now, the sensible to do is to avoid sites with upload limits, and calculate an averages for sites without upload limits.
The number we get that way might a little on the high side. After all, the sites that do have limits drive users with larger databases to the sites that have none.

paid sites

A site that offers free hosting for small databases and charges for larger ones will have a low average, but what about a site that charges all its users for hosting?

There are various ways to charge for hosting, but most contracts have a minimum fee, which you probably do not want to pay for hosting of a tiny database of just a dozen individuals. Paid hosting of small databases is relatively expensive, and paid hosting of large database is relatively cheap. So, a paid hosting site is likely to have a relatively high average.

old versus new

An old sites will have many files from many years ago, back when the average was smaller than it is today. The older the site, the stronger this effect will be. We do not know yet how strong this effect is, but we should not underestimate it. After all, just 25 years ago the average size was zero.

updates

For a first estimate, it does not matter much whether you divide the total size by the number of user or the number of databases. Most sites allow only one database per user, and most users have just one database.

replace or keep

There is another issue that matters a lot more; how does the site handle updates. Does the site allow you to upload a new version to replace the previous one, or does keep the old one, and show the new one as a separate database.

There are sites where you recognise the same errors or omissions in multiple trees, and sometimes these trees even haven nearly identical names. Sometimes sites do allow uploading new versions of existing trees, but make uploading the new version as a new tree much easier. Sometimes it is too hard to retrieve a password used years ago, and much easier to create a new account. Whatever the reason, keeping old versions around effects the average size.

difference

By the way, today sites with multiple versions of the same tree tend to annoy us, as we through the same data twice or thrice. But if these sites were to upgrade their software to show only the latest tree, and then highlight changes to data already in previous version, we would love it, as it would be highlighting apparent errors that have been corrected since then.

So, that these sites have multiple versions can be very Good Thing, but until they use it wisely, their presentation of two versions as two trees is annoying; it would be better to show just the latest one.

average

The old versions a site keeps around do affect the average. The older versions are smaller, possible a lot smaller. That affects the average we calculate today.

deletability

Quite a few sites will hold on to your data once you’ve uploaded it. Even calls to their helpdesk will not get your data deleted. A few public complaints from users unable to delete their data are enough to make many other users hesitant about uploading theirs - perhaps less so when their database is small, and more so when their database is large. Thus, sites that do not allow users to delete their data possible have a lower average than sites that do allow users to delete their data.

right site

We need to pick the right site.

It should have public numbers. It should not be associated with a vendor of desktop software to avoid vendor bias. It should not be free for small and paid for larger databases. It should not be a paid site either. So it has to be a free site.

It should have no upload limits. It should allow updating of existing databases. It should allow deleting your database. It should ideally be a fairly new site, to avoid the average size being influenced by many older databases.

There is such a site, and its average database size is about 12.500 individuals.

links