Modern Software Experience

2009-06-17

Average Genealogy

GenCircles

Average Genealogy Hosting Size mentioned the average size for various genealogy hosting sites, including those for WebTree and GenCircles.

The numbers for WebTree and GenCircles were not derived from totals listed on the site’s home page, but from analysis of all sizes.
Average WebTree Size
discussed derivation of the WebTree average. This article discusses how I got the data for GenCircles, as well as how I dealt with some issues I encountered.

GenCircles

bias

Because GenCircles and Family Tree Legends were both products of Pearl Street Software, we can expect overrepresentation of and thus a bias towards the average size of Family Tree Legends databases.

When Pearl Street Software put GenCircles up for sale they claimed 135 million records and 160.000 registered users. That would work out to less than 850 records per user, but not all the registered users contributed a tree.
Many users registered because they thought that was necessary, others registered but failed to upload a database, and so on.

There is a enough public data on the GenCircles site to calculate the average size of the currently more than 100.000 trees.
That average size turns out to be 1.946,98.

data pages

GenCircles has a list all files link on the home page. This leads to 27 pages with a list on each on; pages A through Z and page 0-9. Analysing that data is just a matter of copying those page into a spreadsheet. Well, that is easier said than done.

One practical issue is that the total number of trees (more than 100.000) is more than 65.536, and even 32-bit spreadsheets may have several 16-bit limits in them.

nameless

Once I had all the data copied into a spreadsheet, I noticed a few odd things. Not only do many trees have the same name, some trees have no name. Some names are hurriedly chosen names such as My Family, which may have been chosen by anyone, but at other times it is clear that the a user has uploaded multiple versions of the same database over time. I have done nothing to correct either issue.

empty databases

That GenCircles has quite some trees with just one individual them is not a unique issue, but that GenCircles has 1.975 trees with zero records, often trees with the same name as other ones, suggests that users have regularly encountered uploading issues that were never fixed.

Although databases with just one record arguably are not trees either, only the zero-sized trees have been removed from the analysis.

That included 1.326 nameless zero-sized trees. Once the zero-sizes trees were removed, there were 649 nameless trees left. After removing another 8.396 empty trees, there were 101.370 trees left.

large databases

There are 140 files larger than 100.000 individuals, of which 20 are larger than 250.000 individuals, and the largest is 846.886 individuals. The name of that database makes it clear it was contributed in 2007, so it is not unlikely that the corresponding desktop database has passed 1.000.000 records already. I guess that it is currently the largest genealogical research database owned by a single individual.

average

The 101.370 databases left after removing the empty databases from consideration contain 197.365.966 records in total, so that is an average of 1.946,98 records per database.

links