I’ve posted quite a few articles about the size of genealogy databases theses past weeks. Spurred on by Randy Seaver, Valerie Craft and John Newmark, I have tried to find a reasonable answer to the question of average genealogy size.
The articles so far have looked at various issues in trying to answer that questions and presented numbers for various sites. I am collecting numbers to look beyond today’s average, but that is for another day. To wrap up this series, it might be a good idea to consider what the numbers.
After discussing how genealogy hosting sites might be biased, I tried to set some criteria to would avoid any such bias. The one site that matched those criteria is FamilyLink’s WebTree, and for that site the average size of the databases in its collection is about 12.500.
The numbers for other sites are lower, typically about 2.000 or 2.500, but the numbers for the TNG Network, possibly the least biased of all the others, are of the same magnitude.
The whole point of setting criteria was to remove bias. I am wondering though, whether I forgot to consider something. As Genealogy Software is Slow pointed out, lots of genealogy software does not handle large database large well or at all. So, to users with large database, most genealogy software is practically unusable, and they keep looking around for something better.
Perhaps those with large databases keep trying every new site they hear
about, and always try to upload their data to find out how well it works. If so,
any new site would soon have a relatively high average that will decrease over
time.
So far, the decreasing average I found for the TNG Network seems to support that theory. Perhaps the
average is closer to 5.000 than to 10.000.
There will be more numbers, but let’s assume for now that the WebTree numbers are reasonable. The informal polling I have done for opinions on average size suggests that this number is considerably larger than many expected. That sure seems an issue worth exploring. Are our own estimates too low, or is this number too high?
There are various reasons why you may believe this number to be too high. Perhaps the number is indeed relatively high, because WebTree somehow attracts relatively large databases.
Perhaps our estimates are low because we formed our estimate based on impressions we received time and time again visiting older sites, were the average is considerably lower because of the many older databases, and all the other reasons that tend to bias vendor-specific sites towards a relatively low average.
Perhaps you made some effort to estimate average size several years ago, when databases were smaller than they are now, and still have that relatively low estimate in the back of your mind.
These are some good reasons to explain the difference, but the informal polling I have done suggests that the most important reason that people doubt the average size is that they were not thinking about the average size at all.
Some informal and admittedly utterly unscientific polling I have done suggests that many believe that the average is a few thousand, somewhere around say 2.500, because they are under the impression that about half the databases they see are larger and about half are smaller than that.
Well, that impression is probably right, but the reasoning is not. That particular value is not the average size. That is the median size, and the distribution of database sizes is thus that the average size is several times the median size.
The median size of the databases in the WebTree collection is 2.314 individuals; half the databases are smaller than that, and half are larger than that.
I’ve looked at the size distribution for two different sites, WebTree and GenCircles, and in both cases, only about one in six databases is of average size or larger.
Most of us know that there are more small than large database, but when we try take
that into account, we apparently underestimate the impact of the large databases
and end up estimating some number close to the median.
I see no reason to feel bad about that; estimating the median without a
calculator is no mean trick.
The WebTree numbers suggest that the average database size is about 12.500 individuals, and the median size is about 2.500 individuals.
I have no concrete reason to assume the actual average is considerably higher or lower. So, I will just say that these numbers are not unreasonable, and leave it at that. My actual opinion is that neither the median nor the average size matters much anyway.
In his recent blog post Some Average Database Sizes, Randy
Seaver got ahead of the above observations by remarking that Perhaps the better measurement would be Median GEDCOM Size rather than
average size.
, something I agree with. If we tend to estimate the median instead of the average, then the median is the better metric for us.
He went on to remark that Really big databases can skew the numbers badly.
,
and that I disagree with. I understand the sentiment, but big databases do not skew the numbers in any way.
These databases are real and must be counted towards the average.
I even have a completely opposite opinion that may surprise you. It are not the big databases that skew the numbers, it are the tiny databases that skew the numbers.
Some statistics for a collection of numbers are the minimum, the maximum, the
average and the median. Another interesting one is the mode.
The mode is the value that occurs most often, and therefore the value that
impacts the average more than any other value does.
Large databases influence the numbers, but so do small ones. Of all sizes, nothing influences the average more than the mode, and the mode for both WebTree and GenCircles is 1; there are more databases of size 1 then of any other size.
When I calculated the average for GenCircles (in Average GenCircles Size), I omitted all the zero-sized databases from consideration. It seemed reasonable to disregard these cases, because these cases probably did not result from empty databases, but from upload errors.
Surely, a genealogy of size one is a contradictio in terminus.
Ignoring all zero-sized databases seemed the right thing to do, but just how much sense does it make to consider genealogical database of size one at all?
Surely, a genealogy of size one is a contradictio in terminus. Genealogy is about relationships, and for a relationship between individuals you need at least two individuals. Databases containing just one individual do not contain trees, but merely what genealogy applications call a disconnected individual. A disconnected individual isn’t a genealogy.
Thus, the minimum size of a genealogy is two individuals, and all the databases of size one should be ignored. Well, let’s try that.
Of the 101.370 GenCircles databases, exactly 2.700 databases have size one. That is more than 2½ %. The average size of the 98.670 remaining databases is not 1.946,98 but 2.000,23, about 2½ % more.
Of the 1.015 WebTree databases, 6 have size one, that is about ½ % of the total. The average size of the 1.009 remaining databases is not 12.065,95, but 12.137,70, about ½ % more.
A genealogical database with two individuals is hardly a genealogy. It is a start, but a genealogy? That is stretching the meaning of the word genealogy into ridiculous territory. It certainly isn’t a tree, to have any semblance of a tree shape, you need to relate at least three individuals to each other.
Where to draw the line? It does not seem unreasonable to demand parents and grandparents as the minimum for an ancestral overview, but not all genealogies are ancestral studies, and any such minimum is arbitrary.
In practice, we are often forced to include all sizes. When vendors provide totals only, their totals do include all the databases of size one. It may even include a few of size zero, but it is reasonable to assume that most vendors would clean these out of their system once in a while.
When we do have all the numbers we can exclude the databases of size one, and I’ve just shown how much that impact the results for genealogy hosting sites. For a social genealogy sites, the impact may be more dramatic.
The numbers for various sites do suggest that the average size is in the thousands, but it hard to be sure how large. WebTree’s average of about 12.500 may seem large, but then again, its median is only 2.500.
Do any of these numbers mean anything? Well, we may wonder about average or median size out of mere curiosity, but these numbers do seem relevant to vendors who set upload limits. If the average is 12.500 and their limit is 5.000, they sure are missing out on a lot of data.
Copyright © Tamura Jones. All Rights reserved.