It turns out that Googlebot’s addiction to text has reached a new level following the announcement that it can now index text from scanned documents.

By using Optical Character Recognition (OCR) technology the search engine can now make sense of the content of scanned documents saved as PDF files.

Google has been indexing PDFs for a long time, but this is the first time it has started to ‘read’ images. Images of text, that is.

Scanned files had previously a relatively rare sight in the search results. In order to understand the content of a scanned document Google would look for titles and other metadata, where available, as well as taking into account third party links and references. Not exactly foolproof.

Now you can search on specific words found inside a document, so long as it’s a PDF – no doubt Google will expand the range of index-worthy formats over time.

OCR technology, which has been around for some time, needs to be pretty smart to get anywhere near to the reading ability of a human. Google's Evin Levey explains: “To people reading these documents, the distinction between words and pictures of words makes little difference, but for a computer the picture is almost unintelligible. Consider a circle. Should it be read it as a zero, the letter 'O', just a circle, or the ring from my coffee cup? People learn to answer this kind of question very quickly, but for the computer it is a painstaking and error-prone process.”

Some examples of the new technology in action: repairing aluminum wiring, spin lock performance, steady success in a volatile world. Those sexy phrases were all found in scanned files.

What does this mean for SEO?

Not too much, by my reckoning. Google should be easily able to convert scanned URLs into links, so there might be a few more links coming your way if your website features in scanned documents. But there will be no anchor text, so the quality of these links isn’t going to be amazing.

Maybe it’s simply about feeding Google with extra food. If your organisation has archived forests of scanned documents then by making them available you may present Google with some extra content. And if these documents are already accessible via your website then expect a visit from some variant of Googlebot to make sense of them.

Cue a discussion on whether to add the 'no_follow' tag to these files from the linksculptors...

Chris Lake

Published 31 October, 2008 by Chris Lake

Chris Lake is CEO at EmpiricalProof, and former Director of Content at Econsultancy. Follow him on Twitter, Google+ or connect via Linkedin.

582 more posts from this author

You might be interested in

Comments (2)


Katherine Burke, Content consultant at Kath Burke Ltd

This is brilliant news for writers and PR companies who scan in examples of stories they've written. I have portfolio examples that I scan in and send to potential clients. For web work, this includes screenshots. But I've been conscious that they are invisible to the Googlebots.
Go Google!

almost 10 years ago


James Edgar @ Document Scanning

This will make searching for academic or business related PDF's far more powerful. I use OCR in document scanning, converting from physical paper form to PDF. I find OCR to be highly sophisticated technology.

almost 8 years ago

Save or Cancel

Enjoying this article?

Get more just like this, delivered to your inbox.

Keep up to date with the latest analysis, inspiration and learning from the Econsultancy blog with our free Digital Pulse newsletter. You will receive a hand-picked digest of the latest and greatest articles, as well as snippets of new market data, best practice guides and trends research.