Modern Software Experience

2007-08-27

Ancestry.com’s Appropriation of Copyrighted Content

anger

There is a lot of anger on many bulletin board about a recent addition to Ancestry.com, the Internet Biographical Collection. Ancestry.com itself describes it thus:

Source Information:

Ancestry.com. Internet Biographical Collection [database on-line]. Provo, UT, USA: The Generations Network, Inc., 2007. Original data: Biographical info taken from various English web sites. See specific website address provided with each entry.

About Internet Biographical Collection

This database contains a sampling of biographical sketches found on English language web pages throughout the entire World Wide Web. Web pages can vary greatly in the amount of information they contain about a given person, and in the number of related and unrelated people mentioned on the same page. The information source and the central topic of each page will also vary greatly. Given facts should be verified using other sources. One unique and valuable feature of this web-based collection is the number of hyperlinks leading from each page in the collection to other web pages of possible interest on related topics.

taken from various English web sites

Ancestry.com is not upfront about just which sites are included in the collection. It does not provide a list of the included sites. It does not even draw attention to its rather generic description either. The unwary consumer visiting Ancestry.com is misled into thinking that Collection implies ownership or affiliation.

scripts

To use the Internet Biographical collection on ancestry.co.uk, you must allow scripts on ancestry.co.uk, ancestry.com, atdt.com, 247realmedia.com, mfcreative.com. That problem is not specific to this collection, but typical for Ancestry.com’s less than stellar Internet architecture.

search results

The widespread anger is about Ancestry.com copying entire websites without permission. To see what all the ruckus was about, I started with a search on my own name. No matches. They have not copied my website.

When I entered another name, I got a table listing various results. Weird and telling is that Ancestry.com refers to these results as records. There is a View Record link for each link on the left side. There is also a View web page link on the right side. If you try to use either, you are prompted to sign up for paid access. That is what the anger is really about.

taking sites

Ancestry.com did not take selected data from various websites, Ancestry.com took the websites themselves.

Ancestry.com has been copying websites, calls the result the Internet Biographical Collection and is now selling access to it. There is lots of anger over this move. Many webmasters are screaming foul. After all, that their website is freely accessible does not mean it free of copyright, and that it is only the first of many complaints; Ancestry.com’s actions are dishonest, unethical and unlawful. What they have done is outright theft, plain and simple.

collection quality

Lost in the public outcry is some ridicule over the quality of the collection. The inclusion of unrelated web pages make it clear that it was created by a bot harvesting web sites, apparently without any user intervention. The bot may have some smarts, but cannot really distinguish between genealogy sites and other sites, and Ancestry.com apparently did not even bother to hire a human to scan and avoid inclusion of non-genealogical sites.

proof within the collection

This additional complaint itself is less important, but the observation is significant. It highlights that the collection is merely a bunch of copies created by a bot, not a selection created or vetted by a human. It shows that Ancestry.com claims that it took biographical data from various websites is not true. The examples of non-genealogical sites in the collection are proof that Ancestry.com has not been selecting data from websites at all, but has simply been copying websites.

what are they doing

There is some confusion about what exactly Ancestry.com is doing. Some people talk about copies, other about caches, and yet other about frames. Here’s what happening; they are showing copies they made of third party websites. Ancestry refers to these as Cached Results, but that does not seem an entirely appropriate and honest use of the word Cached to me, as I’ll explain further on.

These copies are shown with a bar along the top. It looks very much like one site framing another. Now, it actually is a frame, but it is the Ancestry.com site framing content on the Ancestry.com site - how weird is that?

The framed content was copied (a.k.a. stolen) from another site, and the bar along the top includes a link to the current site, titled View live web page. Funny detail: if the original page included copyright notices, these were copied along with the rest of the page.

copy

Do note an important difference with Google from a copyright point of view: Google shows the real links (with brief snippets) first, and offers the cached view as an option. Ancestry.com shows its copy first. You do not even get to see the link to the original site unless you decide view the copy! And you cannot view that copy until you pay a subscription.

One painful result of this scheme is that Ancestry.com is trying to charge people for viewing an out of date copy of their own site.

copyright

Many messages on message boards make it plain that Ancestry.com did never obtain permission to copy these websites.

Many people are plain angry because Ancestry.com stole their content, violated their copyright. What really ticks them off is that Ancestry.com is asking money for access to their original resources, which they themselves offer for free.

There is anger about Ancestry.com using their hard work to lure visitors to Ancestry.com instead of their site. There is anger about not even offering a working link back to their original web site, but a link to subscribe to Ancestry.com instead.

search engine similarity

Some have suggested that this new collection is similar to what search engines do.

Search engines keep a cache of the original page for search purposes, and some search engines choose to display that cache. It is likely that Ancestry’s bot will keep updating pages, just as a search engine bot keeps updating the search cache. But that is where the similarity between Ancestry.com’s collection of copies and a search engine cache ends.

differences

Search engines do not present their cache as a Biographical Collection. Ancestry.com keeps its collections in a transactional database with backups to make sure it never looses a record, while search engines do not worry about losing a page or even an entire hard disk, their spider will get a new copy soon enough.

Search engine do not demand membership to view their cache. Search engines do not charge for access to their cache.

up front

Search engines always offer a link to the original site along their search results. They do not present their cache as the destination of your search. Search engines are up front about not being the copyright owner. Search engines do not present their temporary cache as a collection. They do not present selected parts of their cache as their Internet Subject Collection and then charge you for it.

no demands

Search engines do not ask you to register with your email address, so they can spam you with their great offers. They do not ask for your name. They do not ask your credit card number. They do not demand monthly payments. They do not demand anything from you. Search engines simply provide free access to their cache, and always provide a link to the original pages.

Ancestry.com robot

robots.txt

Most search engines respect the robots.txt protocol, which tells search bots what they may and may not index. Not all bots respect it, and Ancestry.com’s bot is more aptly called a collection bot than a search bot, but it is a bot.

Some searching soon turned up the name of Ancestry.com’s bot: MyFamilyBot, and Ancestry’s com page for that bot. On that page, Ancestry.com claims that MyFamilyBot honours the robots.txt protocol.

robots.txt is irrelevant

Many web site owners will not be appeased by assurances that MyFamilyBot respects the robots.txt protocol. Most websites are happy to appear in search engines, and therefore do not need to bother with robots.txt at all. For the average genealogical web site owner, robots.txt is irrelevant. These site owners did not consider that someone might create a bot to steal their content outright.

collection bot

Ancestry’s MyFamilyBot is not a search bot. It is not building a search index, it is amassing a collection. Therefore, it is a collection bot. That is not a subtle distinction, but a fundamental one.
All those angry webmaster would be happy webmasters if Ancestry.com had created a genealogical search engine. They would be celebrating.

the difference

The difference between Google and Ancestry.com is the difference between a cataloguer providing directions to your site, and a plagiarist selling your work without your permission - without informing you about it, without even informing their customers that you exist, without telling them that they could have the current original for free instead paying for the stale copy they are being offered for sale.
No amount of sophistry can make these two different things the same.

similar but not the same

A search bot and collection bot may take the same actions when they visit your site. It is because of that similarity that both are known as web bots. It is because of their difference in purpose, because they server different bot masters, that we call one a search bot and the other a collection bot.

Most web site owners are ecstatic with more and more search bots crawling their site, but do not want a single collection bot crawling their site.

the bot line

There is good reason to distinguish between search bots and collection bots. However similar they are technically, they are quite different conceptually. Most web site owners are ecstatic with more and more search bots crawling their site, but do not want a single collection bot crawling their site. That’s the bot(tom) line.

bundled copyright

robots.txt

Although MyFamilyBot is not a search bot, it would still be wise to for it to respect robots.txt. Ancestry.com claims that is does, and I have heard no complaint that it does not. Of course, finding such complaints is hard, when Ancestry.com has not exactly gone out of its way to let everybody know about its collection bot, but let’s simply assume that their collection bot does respect robots.txt, as the real issue is that respecting robots.txt is not good enough. Respecting robots.txt is good enough for search bot, but not for collection bots.

copyright

The robots.txt file is not all that Ancestry.com has to respect. They have to respect the site owner’s copyright too. It is one thing to create a search index as a free or commercial service, it is another thing entirely to bundle copyrighted third party material and sell it as your own collection - and not linking to the content, lest the visitor to the Ancestry.com site decides to follow the free link to the up-to-date original instead of pay for Ancestry.com’s stale copy.

Bot Info page

Ancestry.com put up a bot info page, apparently early this year. That page is less than honest about the purpose of its bot. It describes the creation of a genealogical search engine, not the creation of a collection.

MyFamily is creating an index based on a powerful person-based biographical ranking engine that gives superior results over searches done using the more general purpose internet search engines. Ancestry.com indexes the biographic text and provides a search service that points users back to the originating website

Note the bit about including pointers to the originating site: Is originating an innocent silly mistake for original, or a deliberately misleading term, that can legally be explained in unexpected ways? Whatever the case may be, I expect Ancestry.com to update this description soon.

informing webmasters

Ancestry.com may try to argue that this bot info page has been up for months and that webmasters had ample time to update their robots.txt file. I do not believe that Ancestry.com has gone out of its way alert anyone to the existence of its bot.

In fact, even when I had figured out that their bot is called MyFamilyBot, I still had a hard time locating their bot info page. Google has many hits for MyFamilyBot, but Ancestry.com’s page is not showing on top. In fact, even today, it is still not in Google’s cache at all! Google: Your search - MyFamilyBot site:ancestry.com - did not match any documents.

Ancestry.com robots.txt

Ancestry.com’s own robots.txt does not disallow access to that page. It is a sub-page of their Learning Centre and that part of Ancestry.com is cached in Google. However, there apparently simply isn’t a single link pointing to the bot info page. The page exists in splendid isolation.

misleading

The first problem with the bot information page is that, to put it mildly, it is not exactly advertised on Ancestry.com’s front page. The second problem is that is it so dishonest and misleading, that if any webmaster had come across this page a few months ago at all, they would not have decided to block the bot, but instead have wondered how to make sure their site would be included in this genealogical search engine. It does not honestly warn webmasters that their sites are being copied, but instead tricks webmasters into letting their site be indexed under false pretences.

objections

Many website owners would object to the presentation of stale pages as a collection even if the pages were entirely free. There is the copyright issue, and how serious others take your copyright if you allow this sort of thing, but there is more. There is the niggling matter of accuracy. Be it a small spelling mistake, a serious error in interpretation of record or anything in between, when you find an error, you can update your own website. You will always be able to offer the best data you have.

Meanwhile, Ancestry.com has an outdated copy of your content. There is no way for you to correct the mistakes they copied onto their site, and continue to sell access to.

public domain

Bundling information and reselling it for commercial profit it is not illegal. Some web content may be collected, bundled and resold again. That is true of content that is known to be public domain or labelled as such by their owners.

class action suit

Ancestry.com harvested content from many English web sites, so it may well be a matter of time before some Americans start a class action suit against Ancestry.com.

Poetic justice would be to turn the tables; to have a bot bundle information from Ancestry.com databases and sell it as the Ancestry Records Collection, and then use the proceeds to support the class action suit.

extend robots.txt?

update

It is possible to extend the robots.txt protocol with commands that allow collection bots to harvest a web site, with the understanding that the absence of that command means that the site may not be harvested for collections.

However, that is not necessary. The robots.txt protocol serves it purpose well. It may a good idea to update its documentation to clearly state that robots.txt is meant for search bots, and that collection should respect additional protocols.

additional protocols

The necessary additional protocols already exist. One is a legal protocol, commonly referred to as respecting copyright, i.e. do not copy without permission.

The other is a simple technical protocol; the Creative Commons system of content labelling. The Creative Commons system is simple, has been around for years, has been specifically designed for automated processing, and the specifications are public, so it is not hard for a bot to process it.

The robots.txt file is for search bots, not collection bots. Collection bots need to respect additional protocols.

my bot line

My bot line has not changed. My robots.txt file still allows all search engines to index my content.

I have not even blocked MyFamilyBot; as long as it is only indexing my site for a freely accessible search engine that links back to my site, it is welcome to index my content.

collection bot changes

Collections bots are not welcome, and I am not going to fuel the backwards idea that the bot masters behind it can excuse their behaviour by pointing at my robots.txt file. The robots.txt file is for search bots, not collection bots. Collection bots need to respect additional protocols.

I have not added additional copyright statements or symbols to my content. I have not added a Creative Commons label. None of that is necessary. Every word I write is automatically copyrighted.

I don’t need to write a copyright statement. Copyright is automatic. I don’t need to explicitly allow fair use either. Fair use is automatic. This paragraph is mine and fair use allows you to quote it.

I don’t need to do a thing. That’s the bot line.

updates

2007-08-28 free service

In reply to the public outcry over their behaviour, and to stave of most lawsuits over their wholesale copying of and selling access to copyrighted material, Ancestry.com now claims they have made their Internet Biographical Collection a free service. They did not issue a press release, but merely mentioned this in their Family History Circle blog.

The claim that the database is now free is a misleading statement. To view the collection of copied content at all, you still need to register with Ancestry.com.

In its statement explaining this move, Ancestry.com makes it a point to state that they display a live link back to the source they extracted the information from. However, Ancestry.com has neither apologised for their actions, not promised to stop copying web sites.

Ancestry.com is still not up front about where their information came from; although they promised provide list of sites included in the collection, they still do not do so. Ancestry.com has not announced a way to have your work removed from their collection either.

Ancestry.com has not promised to contact all owners of the copyrighted works they infringed. Their own copyright page has still not been updated to address the copyright issues of the database they thus created.

The database of copied content is still behind a registration screen. You still have to become an Ancestry member to view the copied content.

That still makes it impossible for search engines to index the collection, and thus for services like Google Alert to detect new pages containing your family name. So, even persons who have bothered to register with such services to detect new web content will still not be alerted to the copying of their pages!

2007-08-29 pull down

Ancestry.com has decided to pull down the Internet Biographical Collection. Again, Ancestry.com did not issue a press release, but merely posted a message on their Family History Circle blog.

The message does not admit that they removed the copied content down in response to threats of a legal action in general or a class action suit in particular.

Their brief message does make remarkably frequent use of cached and search engine, a clear attempt to colour the perception of what they really did.

Their message does not include an apology for their actions, nor a promise to contact all owners of the copyrighted works they infringed.

2007-08-30 still there

Several bloggers took Ancestry.com’s latest blog post at face value. Some reposted it verbatim, others claimed success or congratulated Ancestry.com on doing the right thing.
Alas, the Ancestry.com blog message is not correct. The database of copied content is still there.

Users soon discovered that it has not been removed, but merely been renamed from Internet Biographical Collection to Unknown, and could still be searched. It now seems Ancestry.com’s blog post preceded its actions, that the message was released several hours before the database connection was severed.

The Unknown page is still up. It offers a search form, and a database description, and prompts you to become a member if you try to search it. If you do join up, you discover that you can still search the index, but that the database connection has been severed now.

2010-01-28 Internet Extraction Group

Yahoo Hot Jobs has the following Ancestry.com job posting for an CRAWLING ENGINEER – Web Search to work in Ancestry.com's Internet Extraction Group which uses information extraction and machine learning technology to crawl and capture genealogical and historical data from the web:

Job Description:

Ancestry.com is looking for an internet crawling execution administrator. The Internet Extraction Group uses information extraction and machine learning technology to crawl and capture genealogical and historical data from the web. The success of the group hinges upon the predictable execution of the extraction algorithms against the target websites. The execution administrator will initially be responsible for training and executing the extractors. Eventually this position will also include the responsibility for managing dozens of employees who will take over the responsibility of training the extractors.

Key Responsibilities / Performance Requirements:

  • Ability to work well with a small highly skilled team of engineers.
  • Ability to independently innovate and take ownership of significant projects and resources.
  • Proven skills with predictably establishing and meeting project schedules.
  • Ability to manage small to medium sized group of entry level computer users.

Required Skills:

  • Bachelors Degree in Computer Science or equivalent work experience.
  • 5+ years experience with CSS, HTML, XML, AJAX and JavaScript required.
  • Ability to effectively use JavaScript frameworks like YUI libraries is a plus.
  • A start-up mentality, pride of ownership, outgoing personality and passion for online design and simplicity will make an ideal addition to our team.
  • Familiarity and experience working in a .NET development environment.
  • Excellent writing and communication skills.

updates

2010-01-28 links update

Checked all blog links, added site names and dates and added many additional links to the list, including a The Genealogue post from 2006. Created a separate section for GeneaBlogie’s many posts on the subject.

2011-03-11: Web Crawl Team

The Internet Extraction Group is now known as the Web Crawl Team? Ancestry.com is looking for a contractor to crawl relevant content for import into our web records collection.

2011-04-23: GenealogyBlog link

The broken links to the deleted GenealogyBlog post The Generations Network continues to tarnish Their Image and the deleted Moultriecreek blog post More Naughty than Nice have been removed.

Several other broken links to long since deleted blog posts have been removed.

2011-04-24: HotJobs link

The Taleo and Yahoo! HotJobs links are defunct and have been removed.

2011-04-24: Internet Biographical Collection

The link to the Internet Biographical Collection itself has been dead for years. The broken link has been removed.

2011-05-17: Ancestry Web Search

Ancestry.com's Web Crawl Team has introduced Ancestry Web Search.

links

Ancestry.com

Ancestry.com blog

GeneaBlogie

blog posts