Modern Software Experience

2012-11-08

multithreading for genealogy

Consistency Check Reminder

In Consistency Check Reminder, I suggested that genealogy software that offers a menu item to start consistency & plausibility checks should implement a new feature, the Consistency Check Reminder, which reminds users to perform those checks. That feature should work not unlike the fairly common back-up reminder feature, and be provided for much the same reasons; it is in our own interest, but it takes time and isn't very enjoyable, and if no one reminds us that we should do so, we're likely to forgo doing so.

I suggested the inclusion of this feature because it is a relatively easy but worthwhile addition to genealogy software that already provides an option to the check the database. Providing users with the ability to check their database for obvious and possible errors is a Good Thing. Reminding users to take advantage of the possibility to perform consistency & plausibility checks makes that Good Thing even better.
Genealogy software with the Consistency Check Reminder feature is better software than genealogy software without, but it is not ideal. It still requires the user to start the database check procedure and wait for the results. Ideally, the user would not have to do that.

time

In Consistency Check Reminder, I noted that it would make logical sense to ensure the database is consistent before making reports or charts, but that there is a practical consideration for not doing so; performing the checks takes time, and making the user wait on that every time they want to generate a small report does not provide an enjoyable user experience.
That argument for not performing the checks makes sense, but isn't the last word on the matter; the argument depends on the checks causing a significant delay.

speed

consistency & plausibility checks may take a considerable amount of time because the individual checks are slow and the database is large.

The speed at which different genealogy applications perform consistency checks varies considerable. One reason for that is different applications have different designs, and are based on different technologies. Another reason is that, more often than not, the checks are a feature that has been added on later, and that the vendor did not consider the speed with which the checks are performed to be important anyway.

There are several ways to speed up database consistency checks, but the most significant one is checking less; for example, when a user requests a report, checks could be limited to the individuals in the report.

Most of the effort involved in a straightforward check of a large database is wasted repeat effort; most of it has been checked before and has not changed since then.
duplicate effort

The one thing that will slow down even the fastest checks is database size; the larger the database that needs to be checked, the longer checking that database takes. Most of the effort involved in a straightforward check of a large database is wasted repeat effort; most of it has been checked before and has not changed since then. Avoiding that duplicate effort would truly speed up consistency checks.

data entry & import

Many genealogy editors perform checks when an individual is edited. Those data entry checks focus on consistency of the data for that individual (born before death), and the immediate relationships; partnerships (all marriages within the individual's lifetime), parents and children. Those checks prevent the most egregious errors, but because of their limited scope, do not catch all problems. Moreover, it still does not stop a database from being inconsistent; a user may deliberately enter inconsistent data, to research that problem later. Data entry checks are useful, but not sufficient. A database consistency checks is still needed.

Most genealogy editors allow importing databases. Some applications perform checks on imported data, others do not. Data import checks are not useless, it is good to inform the user, but data import generally is an all-or-nothing affair, and informing the user about detected problems does not stop these problems from entering the database.
Moreover, even a database that seems free from problems may cause consistency checks once imported; it may contain conflicting information for duplicate individuals.

That choosing to perform a consistency & plausibility check causes a genealogy application to recheck the entire database reflects the design of this feature as an extra, a utility for the database that was added later onto an already existing design.

integrated consistency checking

That choosing to perform a consistency & plausibility check causes a genealogy application to recheck the entire database reflects the design of this feature as an extra, a utility for the database that was added later onto an already existing design.
Consistency checks would be much faster if they were deeply integrated into the application, if the database was designed to support consistency checks.

remember results

The key idea is to remember results, so that it is no longer necessary to recheck everything every time. For example, whenever the user edits an individual, the application should not only warn the user when there is a possible problems, but always remember the results of that check as well.

The key idea is to remember results, so that it is no longer necessary to recheck everything every time.

The application should maintain a consistency & plausibility status for everything that gets checked. In practice, an application will probably use specific codes for specific issues (thus speeding up searches particular problems), but the general idea is that the consistency & plausibility status of each item is either good, bad or unknown; the status is unknown when it has not been checked yet, or needs to be rechecked because of some edit. For example, when the user edits an individual, the consistency status of all the immediate relationships that individual has become unknown; now that the dates for the individual may have changed, the plausibility of those relationships needs to be rechecked.

simple import

When the user imports data, the application could check the data as it is imported, but that would seriously increase the complexity of the import procedure. Integrated consistency checking allows a much simpler approach; simply set the consistency status of all imported items to unknown, to check them later.

Integrated consistency checking naturally lends itself to a multithreaded implementation.

efficient background checking

When the basic idea of remembering status codes is combined with some internal house-keeping, essentially keeping a list of items with status bad and a list of items with status unknown, checking the entire database does not even requires examining the status of every item; it only requires checking the list of items with status unknown.

Integrated consistency checking naturally lends itself a multithreaded implementation. It is possible to perform the actual consistency checks in the background. A background thread can check the list of items with consistency status unknown for new additions, checking items to update their status and then remove them from the list, to try and keep that list empty. When the list of items with status unknown is empty, all items have already been assigned either status good or status bad.

When the result for all consistency & plausibility checks are remembered, when those checks have already been performed in the background, the only action a menu item for a database consistency check still needs to trigger is presenting the list of items with status bad.
Perhaps more interesting it that becomes relatively easy to perform a partial database check, such as a check for just the part included in a report; after all, the actual checks have already been performed, all that is left to be done is compare that partial database to the list of items with with status bad.

Arguably the most interesting thing about integrated consistency checking is that enables a better user interface.

better user interface

Arguably the most interesting thing about integrated consistency checking is that enables a better user interface.
The most obvious advantage of integrated consistency checking is that it provides interactive performance levels, even for large databases. Users interested in possible problems will no longer be discouraged by a long wait on a slow process, but will instead experience quick presentation of the problem list.

A genealogy application with integrated consistency checking does not need to present Consistency Check Reminders. That isn't because the application is continually updating every consistency status field already. It remains a good idea to remind users to take advantage of the consistency checking the application offers.

The reason a genealogy application with integrated consistency does not need to present Consistency Check Reminders is that it can do something much better. Once the consistency status of items has been calculated anyway, highlighting that status in each and every overview becomes just as easy as colour-coding gender.
Integrated consistency checking enables a genealogy application to continually draw the user's attention to problems through colours and icons.

Update 2013-07-17: I did not come up with a separate for this feature. RootsMagic 6.3 has implemented it and called it Problem Alerts.

shared web trees

I am not aware of any other approach to consistency checks that has reasonable CPU consumption and performance for large databases. That already makes integrated consistency checking of particular interest to shared web trees, where the largest database fragments contain millions of profiles. What makes integrated consistency checking even more worthwhile to shared web trees is that an item's known consistency status can be taken advantage of in decisions to either invite or prevent edit and merge operations.

worth doing

Tracking consistency & plausibility status involves some overhead, but that it hardly an argument against it. Tracking this status makes more genealogical sense than tracking the date and time some record was last changed, and that is something that most genealogy applications already do.

Although I tried to present integrated consistency checking as simply as possible, actually implementing it isn't a trivial exercise. The brief description of the essential ideas provided above skipped over many details that implementations will have to deal with.
Implementing integrated consistency checking as outlined takes a considerable amount of development, but it is a great way to take advantage of multi-threading for genealogy, and the improved user experience makes it worth doing.

updates

2013-07-17: RootsMagic Problem Alerts

The Problem Alerts feature introduced today in the RootsMagic 6.3.0.0 upgrade aims to provide the user experience innovation presented in this article; RootsMagic uses background processing to run your entire database through its existing consistency checks, so that any problem persons can be highlighted on screen to draw attention to any issues it found.
RootsMagic is currently the only genealogy editor that aims to provide the suggested user experience for your databases.

2013-11-26: Legacy Family Tree 8

Legacy Family Tree version 8 has been released, and now includes Problem Alerts too.

links