What is the size of the average genealogy research database? Figuring out average Genealogy Size gave a quick overview of the articles that addressed this question so far, looked at various possible sources of information, and concluded that genealogy hosting sites are the best source of numbers.
Desktop Size versus Upload Size pointed out that such a calculation would lead to the average upload size, and listed reasons why that is not the same as the average desktop size, but went on to argue out that the upload size is more relevant than the desktop size, and often the number you really want.
However, finding a reasonable answer still is not as simple as picking any hosting site and calculating its average. There are many issues that influence the average.
Some sites publish numbers, others do not. Many boast about the total number of records, but we need a site that also reveals the number of users or databases to calculate an average.
Some sites publish all the numbers we need, others do not, and yet others publish more.
We are limited to sites that do publish the necessary numbers. The numbers for sites that do not publish these numbers could be very different. We have no way to know, but no immediate reason to assume the numbers are significantly different either.
I do suspect that sites that publish numbers and thus advertise how popular they are already, do attract more users because of it, but that is still no reason to assume their average is different.
Using the average from a site that build its collection through GEDCOM import does limit the uploads and thus the calculated average to applications that support GEDCOM export.
There are few people who maintain their research in a word processor, a spreadsheet or a database of their own making. Genealogy editors that do not support GEDCOM export are rare too, and widely advised against.
Put like that, it hardly seems an issue, just a formal observation that there is this limitation, but there are some real issues here.
One issue is that vendors of desktop genealogy applications seem to put considerably more time in making sense of each other’s GEDCOM dialect than fixing their own.
The GEDCOM output quality of several applications is plain
horrible, and what some applications produce as GEDCOM
is not even GEDCOM
at all. The low output quality of these application may make it impossible for
the hosting site to import the data. A user may attempt to fix whatever problems
are encountered, but may just as well decide to give up.
Thus, applications with low quality GEDCOM output are likely to be under-represented on hosting sites. These application arguably deserve to be underrepresented, but it is a factor that influences the result. How it influences the result is unknown, but it seems safe to assume that hosting vendors try to solve all known issues, so perhaps the influence is negligible.
GEDCOM is the dominant data exchange format, but not the only one. Some desktop application allow direct upload to particular sites.
Some vendors offer both a desktop application and a hosting site. That gives that particular hosting site a marketing advantage with those desktop users already. Moreover, some desktop application that can directly upload their data to the vendor’s site are quite pushy in prompting the user to do so, or even bluntly does when the user thinks they are merely printing a book…
The coupling leads to overrepresentation of the particular application
already. Pushiness does not only make that worse but also leads to many databases being
uploaded before the user would otherwise have done so; thus, the average size on
these sites is likely to be lower than on other sites.
Users angry about the vendor holding on to
their database are likely to avoid uploading their data again, and likely to
switch applications. Both user decisions lead to yet lower averages for
these vendor-specific hosting sites.
Some sites have upload limits. Limited Genealogy argues out that
this does not only affect the average size, but also the quality of the data. In
turn, small databases and low quality make the site less attractive.
That effect probably isn’t very strong yet, but the fact remains that a limit
affects the average, and affects it a lot. I’ll get back to just how much
influences the average.
For now, the sensible to do is to avoid sites with upload limits, and
calculate an averages for sites without upload limits.
The number we get that way might a little on the high side. After all, the sites
that do have limits drive users with larger databases to the sites that have
none.
A site that offers free hosting for small databases and charges for larger ones will have a low average, but what about a site that charges all its users for hosting?
There are various ways to charge for hosting, but most contracts have a minimum fee, which you probably do not want to pay for hosting of a tiny database of just a dozen individuals. Paid hosting of small databases is relatively expensive, and paid hosting of large database is relatively cheap. So, a paid hosting site is likely to have a relatively high average.
An old sites will have many files from many years ago, back when the average was smaller than it is today. The older the site, the stronger this effect will be. We do not know yet how strong this effect is, but we should not underestimate it. After all, just 25 years ago the average size was zero.
For a first estimate, it does not matter much whether you divide the total size by the number of user or the number of databases. Most sites allow only one database per user, and most users have just one database.
There is another issue that matters a lot more; how does the site handle updates. Does the site allow you to upload a new version to replace the previous one, or does keep the old one, and show the new one as a separate database.
There are sites where you recognise the same errors or omissions in multiple trees, and sometimes these trees even haven nearly identical names. Sometimes sites do allow uploading new versions of existing trees, but make uploading the new version as a new tree much easier. Sometimes it is too hard to retrieve a password used years ago, and much easier to create a new account. Whatever the reason, keeping old versions around effects the average size.
By the way, today sites with multiple versions of the same tree tend to annoy us, as we through the same data twice or thrice. But if these sites were to upgrade their software to show only the latest tree, and then highlight changes to data already in previous version, we would love it, as it would be highlighting apparent errors that have been corrected since then.
So, that these sites have multiple versions can be very Good Thing, but until they use it wisely, their presentation of two versions as two trees is annoying; it would be better to show just the latest one.
The old versions a site keeps around do affect the average. The older versions are smaller, possible a lot smaller. That affects the average we calculate today.
Quite a few sites will hold on to your data once you’ve uploaded it. Even calls to their helpdesk will not get your data deleted. A few public complaints from users unable to delete their data are enough to make many other users hesitant about uploading theirs - perhaps less so when their database is small, and more so when their database is large. Thus, sites that do not allow users to delete their data possible have a lower average than sites that do allow users to delete their data.
We need to pick the right site.
It should have public numbers. It should not be associated with a vendor of desktop software to avoid vendor bias. It should not be free for small and paid for larger databases. It should not be a paid site either. So it has to be a free site.
It should have no upload limits. It should allow updating of existing databases. It should allow deleting your database. It should ideally be a fairly new site, to avoid the average size being influenced by many older databases.
There is such a site, and its average database size is about 12.500 individuals.
Copyright © Tamura Jones. All Rights reserved.