There is a lot of anger on many bulletin board about a recent addition to Ancestry.com, the Internet Biographical Collection. Ancestry.com itself describes it thus:
Source Information:
Ancestry.com. Internet Biographical Collection [database on-line]. Provo, UT, USA: The Generations Network, Inc., 2007. Original data: Biographical info taken from various English web sites. See specific website address provided with each entry.
About Internet Biographical Collection
This database contains a sampling of biographical sketches found on English language web pages throughout the entire World Wide Web. Web pages can vary greatly in the amount of information they contain about a given person, and in the number of related and unrelated people mentioned on the same page. The information source and the central topic of each page will also vary greatly. Given facts should be verified using other sources. One unique and valuable feature of this web-based collection is the number of hyperlinks leading from each page in the collection to other web pages of possible interest on related topics.
Ancestry.com is not upfront about just which sites are included in the
collection. It does not provide a list of the included sites. It does not even
draw attention to its rather generic description either. The unwary
consumer visiting Ancestry.com is misled into thinking that Collection
implies
ownership or affiliation.
To use the Internet Biographical collection
on ancestry.co.uk, you must allow
scripts on ancestry.co.uk, ancestry.com, atdt.com, 247realmedia.com,
mfcreative.com. That problem is not specific to this collection, but typical for Ancestry.com’s less than stellar Internet
architecture.
The widespread anger is about Ancestry.com copying entire websites without permission. To see what all the ruckus was about, I started with a search on my own name. No matches. They have not copied my website.
When I entered another name, I got a table listing various results. Weird and
telling is that Ancestry.com refers to these results as records
. There is a View Record
link
for each link on the left side. There is also a View web page
link on the
right side. If you try to use either, you are prompted to sign up for paid
access. That is what the anger is really about.
Ancestry.com did not take selected data from various websites, Ancestry.com took the websites themselves.
Ancestry.com has been copying websites, calls the result the Internet
Biographical Collection
and is now selling access to it. There is lots of anger
over this move. Many webmasters are screaming foul. After all, that their
website is freely accessible does not mean it free of copyright, and that it is
only the first of many complaints; Ancestry.com’s actions are dishonest,
unethical and unlawful. What they have done is outright theft, plain and simple.
Lost in the public outcry is some ridicule over the quality of the collection. The inclusion of unrelated web pages make it clear that it was created by a bot harvesting web sites, apparently without any user intervention. The bot may have some smarts, but cannot really distinguish between genealogy sites and other sites, and Ancestry.com apparently did not even bother to hire a human to scan and avoid inclusion of non-genealogical sites.
This additional complaint itself is less important, but the observation is significant. It highlights that the collection is merely a bunch of copies created by a bot, not a selection created or vetted by a human. It shows that Ancestry.com claims that it took biographical data from various websites is not true. The examples of non-genealogical sites in the collection are proof that Ancestry.com has not been selecting data from websites at all, but has simply been copying websites.
There is some confusion about what exactly Ancestry.com is doing. Some people
talk about copies, other about caches, and yet other about frames. Here’s what
happening; they are showing copies they made of third party websites. Ancestry
refers to these as Cached Results
, but that does not seem an entirely
appropriate and honest use of the word Cached
to me, as I’ll explain
further on.
These copies are shown with a bar along the top. It looks very much like one site framing another. Now, it actually is a frame, but it is the Ancestry.com site framing content on the Ancestry.com site - how weird is that?
The framed
content was copied (a.k.a. stolen
) from another site, and the bar along the top includes a link
to the current site, titled View live web page
. Funny detail: if the original page included copyright
notices, these were copied along with the rest of the page.
Do note an important difference with Google from a copyright point of view: Google shows the real links (with brief snippets) first, and offers the cached view as an option. Ancestry.com shows its copy first. You do not even get to see the link to the original site unless you decide view the copy! And you cannot view that copy until you pay a subscription.
One painful result of this scheme is that Ancestry.com is trying to charge people for viewing an out of date copy of their own site.
Many messages on message boards make it plain that Ancestry.com did never obtain permission to copy these websites.
Many people are plain angry because Ancestry.com stole their content, violated their copyright. What really ticks them off is that Ancestry.com is asking money for access to their original resources, which they themselves offer for free.
There is anger about Ancestry.com using their hard work to lure visitors to Ancestry.com instead of their site. There is anger about not even offering a working link back to their original web site, but a link to subscribe to Ancestry.com instead.
Some have suggested that this new collection is similar to what search engines do.
Search engines keep a cache of the original page for search purposes, and some search engines choose to display that cache. It is likely that Ancestry’s bot will keep updating pages, just as a search engine bot keeps updating the search cache. But that is where the similarity between Ancestry.com’s collection of copies and a search engine cache ends.
Search engines do not present their cache as a Biographical Collection
.
Ancestry.com keeps its collections in a transactional database with
backups to make sure it never looses a record, while search engines do not worry
about losing a page or even an entire hard disk, their spider will get a new copy
soon enough.
Search engine do not demand membership to view their cache. Search engines do not charge for access to their cache.
Search engines always offer a link to the original site along their search
results. They do not present their cache as the destination of your
search. Search engines are up front about not being the copyright owner. Search
engines do not present their temporary cache as a collection. They do not present
selected parts of their cache as their Internet Subject Collection
and then
charge you for it.
Search engines do not ask you to register with your email address, so they can spam you with their great offers. They do not ask for your name. They do not ask your credit card number. They do not demand monthly payments. They do not demand anything from you. Search engines simply provide free access to their cache, and always provide a link to the original pages.
robots.txtMost search engines respect the robots.txt protocol, which tells search bots what they may and may not index. Not all bots respect it, and Ancestry.com’s bot is more aptly called a collection bot than a search bot, but it is a bot.
Some searching soon turned up the name of Ancestry.com’s bot: MyFamilyBot, and Ancestry’s com page for that bot. On that page, Ancestry.com claims that MyFamilyBot honours the robots.txt protocol.
robots.txt is irrelevantMany web site owners will not be appeased by assurances that MyFamilyBot
respects the robots.txt protocol. Most websites are happy to appear in search
engines, and therefore do not need to bother with robots.txt at all. For the
average genealogical web site owner, robots.txt is irrelevant. These site owners
did not consider that someone might create a bot to steal their content outright.
Ancestry’s MyFamilyBot is not a search bot. It is not building a
search index, it is amassing a collection. Therefore, it is a collection bot.
That is not a subtle distinction, but a fundamental one.
All those angry webmaster would be
happy webmasters if Ancestry.com had created a genealogical search engine. They
would be celebrating.
The difference between Google and Ancestry.com is the difference between a
cataloguer providing directions to your site, and a plagiarist selling your work
without your permission - without informing you about it, without even informing
their customers that you exist, without telling them that they could have the current original
for free instead paying for the stale copy they are being offered for sale.
No amount of sophistry can make these two different things the same.
A search bot and collection bot may take the same actions when they visit your site. It is because of that similarity that both are known as web bots. It is because of their difference in purpose, because they server different bot masters, that we call one a search bot and the other a collection bot.
Most web site owners are ecstatic with more and more search bots crawling their site, but do not want a single collection bot crawling their site.
There is good reason to distinguish between search bots and collection bots. However similar they are technically, they are quite different conceptually. Most web site owners are ecstatic with more and more search bots crawling their site, but do not want a single collection bot crawling their site. That’s the bot(tom) line.
robots.txtAlthough MyFamilyBot is not a search bot, it would still be wise to for it to
respect robots.txt. Ancestry.com claims that is does, and I have heard no
complaint that it does not. Of course, finding such complaints is hard, when
Ancestry.com has not exactly gone out of its way to let everybody know about its
collection bot, but let’s simply assume that their collection bot does
respect robots.txt, as the real issue is that respecting robots.txt is not good enough. Respecting robots.txt is good enough for search bot, but not for collection bots.
The robots.txt file is not all that Ancestry.com has to respect. They have to
respect the site owner’s copyright too. It is one thing to create a search index
as a free or commercial service, it is another thing entirely to bundle
copyrighted third party material and sell it as your own collection - and not
linking to the content, lest the visitor to the Ancestry.com site decides to
follow the free link to the up-to-date original instead of pay for Ancestry.com’s stale copy.
Ancestry.com put up a bot info page, apparently early this year. That page is less than honest about the purpose of its bot. It describes the creation of a genealogical search engine, not the creation of a collection.
MyFamily is creating an index based on a powerful person-based biographical ranking engine that gives superior results over searches done using the more general purpose internet search engines. Ancestry.com indexes the biographic text and provides a search service that points users back to the originating website
Note the bit about including pointers to the originating
site: Is
originating
an innocent silly mistake for original
, or a deliberately misleading term,
that can legally be explained in unexpected ways? Whatever the case
may be, I expect Ancestry.com to update this description soon.
Ancestry.com may try to argue that this bot info page has been up for months and
that webmasters had ample time to update their robots.txt file. I do not believe
that Ancestry.com has gone out of its way alert anyone to the existence of its
bot.
In fact, even when I had figured out that their bot is called MyFamilyBot, I
still had a hard time locating their bot info page. Google has many hits for
MyFamilyBot, but Ancestry.com’s page is not showing on top. In fact, even today,
it is still not in Google’s cache at all! Google: Your search - MyFamilyBot
site:ancestry.com - did not match any documents
.
robots.txtAncestry.com’s own robots.txt does not disallow access to that page. It is a
sub-page of their Learning Centre
and that part of Ancestry.com is cached in
Google. However, there apparently simply isn’t a single link pointing to the bot info
page. The page exists in splendid isolation.
The first problem with the bot information page is that, to put it mildly, it is not exactly advertised on Ancestry.com’s front page. The second problem is that is it so dishonest and misleading, that if any webmaster had come across this page a few months ago at all, they would not have decided to block the bot, but instead have wondered how to make sure their site would be included in this genealogical search engine. It does not honestly warn webmasters that their sites are being copied, but instead tricks webmasters into letting their site be indexed under false pretences.
Many website owners would object to the presentation of stale pages as a collection even if the pages were entirely free. There is the copyright issue, and how serious others take your copyright if you allow this sort of thing, but there is more. There is the niggling matter of accuracy. Be it a small spelling mistake, a serious error in interpretation of record or anything in between, when you find an error, you can update your own website. You will always be able to offer the best data you have.
Meanwhile, Ancestry.com has an outdated copy of your content. There is no way for you to correct the mistakes they copied onto their site, and continue to sell access to.
Bundling information and reselling it for commercial profit it is not illegal. Some web content may be collected, bundled and resold again. That is true of content that is known to be public domain or labelled as such by their owners.
Ancestry.com harvested content from many English web sites, so it may well be a matter of time before some Americans start a class action suit against Ancestry.com.
Poetic justice would be to turn the tables; to have a bot bundle information from Ancestry.com databases and sell it as the Ancestry Records Collection, and then use the proceeds to support the class action suit.
robots.txt?It is possible to extend the robots.txt protocol with commands that allow collection bots to harvest a web site, with the understanding that the absence of that command means that the site may not be harvested for collections.
However, that is not necessary. The robots.txt protocol serves it purpose well.
It may a good idea to update its documentation to clearly state that robots.txt
is meant for search bots, and that collection should respect additional
protocols.
The necessary additional protocols already exist. One is a legal protocol, commonly referred to as respecting copyright, i.e. do not copy without permission.
The other is a simple technical protocol; the Creative Commons system of content labelling. The Creative Commons system is simple, has been around for years, has been specifically designed for automated processing, and the specifications are public, so it is not hard for a bot to process it.
The robots.txt file is for search bots, not collection bots.
Collection bots need to respect additional protocols.
My bot line has not changed. My robots.txt file still allows all search engines
to index my content.
I have not even blocked MyFamilyBot; as long as it is only indexing my site for a freely accessible search engine that links back to my site, it is welcome to index my content.
Collections bots are not welcome, and I am not going to fuel the backwards idea
that the bot masters behind it can excuse their behaviour by pointing at my robots.txt file. The robots.txt file is for search bots, not collection bots.
Collection bots need to respect additional protocols.
I have not added additional copyright statements or symbols to my content. I have not added a Creative Commons label. None of that is necessary. Every word I write is automatically copyrighted.
I don’t need to write a copyright statement. Copyright is automatic. I don’t need to explicitly allow fair use either. Fair use is automatic. This paragraph is mine and fair use allows you to quote it.
I don’t need to do a thing. That’s the bot line.
In reply to the public outcry over their behaviour, and to stave of most
lawsuits over their wholesale copying of and selling access to copyrighted material, Ancestry.com now claims
they have made their Internet Biographical Collection
a free service. They did not issue
a press release, but merely mentioned this in their Family History Circle
blog.
The claim that the database is now free is a misleading statement. To view the collection of copied content at all, you still need to register with Ancestry.com.
In its statement explaining this move, Ancestry.com makes it a point to state
that they display a live link back to the source they extracted
the
information from.
However, Ancestry.com has neither apologised for their actions, not promised to
stop copying web sites.
Ancestry.com is still not up front about where their information came from; although they promised provide list of sites included in the collection, they still do not do so. Ancestry.com has not announced a way to have your work removed from their collection either.
Ancestry.com has not promised to contact all owners of the copyrighted works they infringed. Their own copyright page has still not been updated to address the copyright issues of the database they thus created.
The database of copied content is still behind a registration screen. You still have to become an Ancestry member to view the copied content.
That still makes
it impossible for search engines to index the collection, and thus for services
like Google Alert to detect new
pages containing your family name. So,
even persons who have bothered to register with such services to detect new web
content will still not be alerted to the copying of their pages!
Ancestry.com has decided to pull down the Internet Biographical Collection
.
Again, Ancestry.com did not issue a press release, but merely posted a message
on their Family History Circle blog.
The message does not admit that they removed the copied content down in response to threats of a legal action in general or a class action suit in particular.
Their brief message does make remarkably frequent use of cached
and search
engine
, a clear attempt to colour the perception of what they really
did.
Their message does not include an apology for their actions, nor a promise to contact all owners of the copyrighted works they infringed.
Several bloggers took Ancestry.com’s latest blog
post at face value. Some reposted it verbatim, others claimed success or
congratulated Ancestry.com on doing the right thing.
Alas, the Ancestry.com blog message is
not correct. The database of copied content is still there.
Users soon
discovered that it has not been removed, but merely been renamed from Internet
Biographical Collection
to Unknown
, and could still be searched. It
now seems
Ancestry.com’s blog post preceded its actions, that the message was released
several hours before the database connection was severed.
The Unknown
page is still up. It offers a search form, and a database
description, and prompts you to become a member if you try to search it. If you
do join up, you discover that you can still search the index, but that the
database connection has been severed now.
Yahoo Hot Jobs has the following Ancestry.com job posting for an CRAWLING ENGINEER – Web Search
to work in Ancestry.com's Internet Extraction Group
which uses information extraction and machine learning technology to crawl and capture genealogical and historical data from the web
:
Job Description:
Ancestry.com is looking for an internet crawling execution administrator. The Internet Extraction Group uses information extraction and machine learning technology to crawl and capture genealogical and historical data from the web. The success of the group hinges upon the predictable execution of the extraction algorithms against the target websites. The execution administrator will initially be responsible for training and executing the extractors. Eventually this position will also include the responsibility for managing dozens of employees who will take over the responsibility of training the extractors.
Key Responsibilities / Performance Requirements:
- Ability to work well with a small highly skilled team of engineers.
- Ability to independently innovate and take ownership of significant projects and resources.
- Proven skills with predictably establishing and meeting project schedules.
- Ability to manage small to medium sized group of entry level computer users.
Required Skills:
- Bachelors Degree in Computer Science or equivalent work experience.
- 5+ years experience with CSS, HTML, XML, AJAX and JavaScript required.
- Ability to effectively use JavaScript frameworks like YUI libraries is a plus.
- A start-up mentality, pride of ownership, outgoing personality and passion for online design and simplicity will make an ideal addition to our team.
- Familiarity and experience working in a .NET development environment.
- Excellent writing and communication skills.
Checked all blog links, added site names and dates and added many additional links to the list, including a The Genealogue post from 2006. Created a separate section for GeneaBlogie’s many posts on the subject.
The Internet Extraction Group is now known as the Web Crawl Team? Ancestry.com is
looking for a contractor to crawl relevant content for import into our web records collection
.
The broken links to the deleted GenealogyBlog post The Generations Network continues to tarnish Their Image and the deleted Moultriecreek blog post More Naughty than Nice have been removed.
Several other broken links to long since deleted blog posts have been removed.
The Taleo and Yahoo! HotJobs links are defunct and have been removed.
The link to the Internet Biographical Collection itself has been dead for years. The broken link has been removed.
Ancestry.com's Web Crawl Team has introduced Ancestry Web Search.
Copyright © Tamura Jones. All Rights reserved.