Modern Software Experience

2008-11-01

image collection PDFs

scanned PDFs

Many PDFs are not real documents, but collection of images. Such scanned PDFs are relatively easy to make, and many organisations have published old books, magazines and newsletters this way.

drawbacks

This method of publication has two obvious drawbacks; an image collection PDF is not just larger than a real document PDF, you cannot search it either. Unless you want your PDF to be obscure, that inability to search scanned PDF largely defeats the purpose of publishing them. Without text, there is nothing to index for text-based search engines like Google.

If you had to publish documents, but did not want everyone to find them, posting a collection of images instead a real document was a way to make sure that none of the text made it into search engines - and in some sense, if it isn’t in the search engines, it doesn’t exist. It exist all right, but no one but those who somehow know about it already are likely to find it.

solution

Google has now solved that problem; it indexes images in PDF files now; uses OCR technology to recognise any text within the image and then indexes that.
All your old society newsletters should show up in Google soon, and it is probably just a matter of time before Google incorporates this feature in Google Desktop Search, so you can use to search your private collections.

technology

The OCR technology used is Google’s open OCRopus, based on the Tesseract software originally developed by Hewlett-Packard. To be more precise: Tesseract is a pluggable OCR engine used by OCRopus.

links