It turns out that Googlebot’s addiction to text has reached a new level following the announcement that it can now index text from scanned documents.
By using Optical Character Recognition (OCR) technology the search engine can now make sense of the content of scanned documents saved as PDF files.
Google has been indexing PDFs for a long time, but this is the first time it has started to ‘read’ images. Images of text, that is.
Scanned files had previously a relatively rare sight in the search results. In order to understand the content of a scanned document Google would look for titles and other metadata, where available, as well as taking into account third party links and references. Not exactly foolproof.
Now you can search on specific words found inside a document, so long as it’s a PDF – no doubt Google will expand the range of index-worthy formats over time.
OCR technology, which has been around for some time, needs to be pretty smart to get anywhere near to the reading ability of a human. Google's Evin Levey explains: “To people reading these documents, the distinction between words and pictures of words makes little difference, but for a computer the picture is almost unintelligible. Consider a circle. Should it be read it as a zero, the letter 'O', just a circle, or the ring from my coffee cup? People learn to answer this kind of question very quickly, but for the computer it is a painstaking and error-prone process.”
What does this mean for SEO?
Not too much, by my reckoning. Google should be easily able to convert scanned URLs into links, so there might be a few more links coming your way if your website features in scanned documents. But there will be no anchor text, so the quality of these links isn’t going to be amazing.
Maybe it’s simply about feeding Google with extra food. If your organisation has archived forests of scanned documents then by making them available you may present Google with some extra content. And if these documents are already accessible via your website then expect a visit from some variant of Googlebot to make sense of them.
Cue a discussion on whether to add the 'no_follow' tag to these files from the linksculptors...