Modern Software Experience

2009-06-10

Average Size

calculating average

Several articles have been addressing the question of large the average desktop genealogy database is, whether it is possible to come up with some reasonable estimate.

Figuring out Average Genealogy Size provided an overview of approaches so far, and then considered several possible sources of information, concluding that sites that allow users to host their genealogy database by uploading a GEDCOM for free seem to provide an ideal sample.

After all, many such sites boast about the number of records they have, and several publish the number of users too, thus allowing the calculation of their average size. If only it were that simple…

desktop versus upload

A correct calculation of the average upload size would yield just that, the average upload size. That is not the same as the average desktop size.

large gift

Owners of large database may be hesitant to freely give up the result of all their years of hard labour to some online service. That hesitation keeps the average low.

nothing to share

Owners of small databases may be hesitant because they feel intimidated by the larger one they see on the hosting site, and think they have nothing to share. That hesitation keeps the average high.

no living

Many users upload their entire database, but others remove the living individuals first to respect their privacy. Thus, the desktop database is larger than their upload.

ancestors only

Some users with extensive databases upload only their direct ancestors only, and ask you to contact them for information. Their desktop database is larger than their upload.

old

All databases were uploaded some time ago. The average size was smaller back then. The current average is larger.

grown

Even if all uploads are fairly recent, the corresponding desktop databases have still grown since then. The snapshot con the site is smaller than the current desktop database.

matching

Many beginning researchers with small database may feel that it is useful to upload their database in search of an automatic match. This leads to overrepresentation of small databases.

not the same

The average size of the desktop genealogy database and the average size of the uploaded database are related, but not the same.

desktop larger?

The upload size cannot be larger than the desktop size, but that does not imply the desktop average is larger than the upload average. If small databases are underrepresented, the desktop average may well be smaller than the upload average.

function of time

The various issues that influence the difference are functions of time; there is no constant factor between the desktop and the upload size. Desktop databases keep growing while the uploaded database remains unchanged. The fraction of living people (removed before upload) depends on the research and the shape of the population pyramid. Willingness to share and hesitation to show unfinished work depend on constantly culture.

The average upload size may very be a good approximation of the average desktop size, but it is hard to say how good an approximation. Then again, the more interesting observation is that it is often the upload size, not the desktop size, that we actually care about.

relevant

site

The owners of the hosting sites care about the upload size. Hosting sites would love to get more users to upload their data, but until those users do so, the size of these databases matters little to either the host or the visitors. The visitors care about what has been uploaded. The Social GEDCOM Formula uses upload size. Any import limits a vendor sets affect the upload size.

uploads

There is data in desktop databases that will for whatever reasons, never be uploaded. There is data in desktop databases that will eventually be uploaded - and it will be counted then. One day, someone will duplicate the research and it won’t matter anymore. We may care about all the data that is never shared, but it many ways, it is only the data that does get shared that matters to us.

shared data

 Although it may be interesting to know how much research data is still private, it is much more immediately interesting to know how much research data has been made public. It is through the upload that the personal data gets shared with the larger community.

Upload size is not just an approximation of desktop size, it is often the right metric.
So, let’s look at how to calculate average upload size…

links