Modern Software Experience

2009-06-07

Size Matters

large interest

The publication of Medium Size Genealogy and My Large is Smaller than Yours prompted three bloggers to write about two related questions that I had not addressed; when is a genealogy large and what is the average size of a genealogy research database anyway?

Medium Size Genealogy provides a rough estimate for the size of a well-researched ancestry. Many one-name research projects will be considerably larger than that.

The only point of My Large is Smaller than Yours makes is that vendors’ usage of large for various small values and limits is anachronistic at best. The article concludes with the remark that in an age of 32-bit desktop computers and 64-bit servers, every 16-bit value is small.

Neither article discusses when it is appropriate to call a genealogical research database large, and neither provides any estimate of the average size of today’s databases.

average size?

The second question has intrigued me for some time. It seems to be a topic of some interest with quite a few researchers, but I am not aware of any attempt to provide a reasonable estimate of the average research database.
This year, 25 years after the introduction of GEDCOM, seems an excellent time to try and answer that question. So, let’s try to answer it.

What is the size of the average genealogy research database today, one generation after the introduction of genealogy software?

social paradox

What a convenient coincidence! I already calculated the average size of genealogy databases in Average Size is a Statistic just a few days ago, but with some paradoxical results. I first used public numbers for Geni and We’re Related to show that the average size is less than 10, and then used the same numbers to show that it is more than 4 million.

Social Genealogy Metrics define several metrics for social genealogy that do make some sense and explained the paradox; the P/U ratio is not an average size.

how not to

So, Average Size is a Statistic showed how not to calculate the average size of a genealogy database.
Simply using some number for social genealogy sites to calculate an average is actually wrong for two reasons; one is that the ratio is not an average database size, and the other is that we are not after some social genealogy metric at all, but really after an average for desktop data.

desktop size

When people wonder about average size, they do not wonder about social genealogy sites, they wonder about the size of desktop databases. They wonder about the database size of the average genealogist, eh, the average database size of genealogical researchers. The average genealogist does not exist.

Interestingly, the numbers we already have for social genealogy sites can be used to calculate a first rough estimate of the average desktop database.

GEDCOM ratio

Social genealogy sites and applications attract different users than desktop applications do. Social genealogy are fairly new and start out with nothing, and that explains their fairly low P/U ratio.

Do keep in mind that it is a P/U ratio, not an average database size, or you end up making the same silly calculations as I did in Average Size is a Statistic, and would only end up with meaningless statistics that are easily contradicted using the very same numbers.

Geni’s P/U ratio is 15, whereas the ratio for We’re Related is only 5. As How Geni beats We’re Related points out, the main difference between the two is that Geni.com allows uploading of pre-existing genealogy data through GEDCOM import and We’re Related does not.

That Geni’s P/U ratio is three times that of We’re Related not only shows how important GEDCOM support is, but also allows us to calculate an estimate for the size of the average uploaded database.

formula

If we assume the difference in P/U ratios between two sites is wholly explained by the presence or absence of GEDCOM support, then we get the following formula:

G = S + f × ( D - M )

S
the natural Social P/U ratio (what you get without GEDCOM)
G
the P/U ratio with GEDCOM uploads
f
the fraction of users that upload databases
D
the average size of uploaded Desktop Databases
M
Average number of profiles Merged after each upload
For simplicity’s sake, we will assume that the number of profile merges is negligible, or merely that the merges have not happened yet, so that the formula simplifies too:

G = S + f × D

average upload size

We can calculate the average desktop database size from the social genealogy metrics by rearranging that formula thus:

D = ( G - S ) ÷ f

The problem we face that is we only know G and S, two of the four variables. We can fill both of these in, but are still left with two unknowns:

D = ( G - S ) ÷ f, with G =15 and S=5
D = ( 15 - 5 ) ÷ f
D = 10 ÷ f 

We can calculate D, but only if we know f.

assumption

If we assume that 1% of the Geni.com users bothered to upload their existing tree, than one in a hundred users is responsible for raising the number of profiles per hundred users from 500 to 1500. That would make the average size of the uploaded GEDCOM files approximately 1.000 individuals.

D = 10 ÷ f , with f=1%
D = 10 ÷ (1 / 100)
D = 10 × 100
D = 1000

Then again, if we were to assume that 1‰ of the users uploaded their database, the average desktop database size would be 10.000.

D = 10 ÷ f , with f=1‰
D = 10 ÷ (1 / 1000)
D = 10 × 1000
D = 10000

And, if we were to assume that only one basis point of the users uploaded their database, the average size would be 100.000.

D = 10 ÷ f , with f=1‱ (that is Unicode character U+2031)
D = 10 ÷ (1 / 10000)
D = 10 × 10000
D = 100000

reverse calculation

This gives us of some idea of the size of desktop databases, but also shows much the result depend on our assumptions. It really makes more sense to turn this calculation around; to first figure out an average desktop database size and then use that to estimate the fraction of users that uploaded their data.

f = ( G - S ) ÷ D

Whatever the actual percentage and average upload size, it is pretty clear that the average desktop size is considerably larger than Geni’s P/U ratio, and that is why a small fraction of the user base has such a major impact on Geni’s total database size. That underscores how important GEDCOM support is to social genealogy sites.

small fraction

That the GEDCOM import feature matters was clear without this calculation. The mere fact that Geni’s P/U ratio is three times that of We’re Related is all you need to know to draw that conclusion.

But if you have already somehow calculated an average desktop database size (as I have), and plugged that number into the formula to estimate the fraction of users that uploaded their data, then you know that this fraction is better expressed in basis points or per mille than percentage - and that implies that Geni could really explode the size of its database if it somehow managed to convince just a few more users to upload their data.

metrics

Right now, Geni has GEDCOM support, and We’re Related has not. Geni’s GEDCOM support gives it technological advantage over We’re Related.

The formula shows that GEDCOM support is a formula for success. If Geni’s marketing department were to wake to that reality and actively start to convince more users to upload their database, then Geni would soon surpass We’re Related in every important way.

As long as We’re Related lacks GEDCOM support, FamilyLink will be unable to mount a similar data-gathering campaign. FamilyLink cannot hope to compete against any such a campaign until it reinstates We’re Related’s GEDCOM import feature. GEDCOM imports matters, it matters lot. It literally makes a sizeable difference.

conclusion

I created a formula that relates average GEDCOM size to social genealogy metrics. Because it contains an unknown factor that we can only guess at, the formula seems less suited for calculating GEDCOM size from social genealogy metrics, than for calculating that factor from the average GEDCOM size, or convincing the marketing department that the GEDCOM import feature is a competitive advantage that should be exploited as much as possible.

Still, the formula gives a rough idea of the range within which the average lies using numbers we have already. Depending on your guess of the fraction of users that upload databases, somewhere between say 100 and 100.000 seems a safe bet. That is highly unsurprising and not particularly informative. Any average genealogist (pun intended) could have told you that without using a formula.

So, the next article gets back to the topic that inspired the formula; how to calculate average genealogy database size?

links

blogs

articles