<?xml version="1.0" encoding="UTF-8"?>
<blog-post>
  <author-id type="integer">27054</author-id>
  <blog-comments-count type="integer">1</blog-comments-count>
  <blog-post-status-id type="integer">3</blog-post-status-id>
  <body-format>econsultancy_xml</body-format>
  <body-formatted>
  &lt;p&gt;Google has been indexing PDFs for a long time, but this is the first time it has started to &#8216;read&#8217; images. Images of text, that is.&lt;/p&gt;
  &lt;p&gt;Scanned files had previously a relatively rare sight in the search results. In order to understand the content of a scanned document Google would look for titles and other metadata, where available, as well as taking into account third party links and references. Not exactly foolproof.&lt;/p&gt;
  &lt;p&gt;Now you can search on specific words found inside a document, so long as it&#8217;s a PDF &#8211; no doubt Google will expand the range of index-worthy formats over time.&lt;/p&gt;
  &lt;p&gt;OCR technology, which has been around for some time, needs to be pretty smart to get anywhere near to the reading ability of a human. Google's Evin Levey &lt;a href="http://googleblog.blogspot.com/2008/10/picture-of-thousand-words.html"&gt;explains&lt;/a&gt;: &lt;em&gt;&#8220;To people reading these documents, the distinction between words and pictures of words makes little difference, but for a computer the picture is almost unintelligible. Consider a circle. Should it be read it as a zero, the letter 'O', just a circle, or the ring from my coffee cup? People learn to answer this kind of question very quickly, but for the computer it is a painstaking and error-prone process.&#8221;&lt;/em&gt;&lt;/p&gt;
  &lt;p&gt;Some examples of the new technology in action: &lt;a href="http://www.google.com/search?q=repairing+aluminum+wiring"&gt;repairing aluminum wiring&lt;/a&gt;, &lt;a href="http://www.google.com/search?q=spin+lock+performance"&gt;spin lock performance&lt;/a&gt;, &lt;a href="http://www.google.com/search?q=Steady+success+in+a+volatile+world"&gt;steady success in a volatile world&lt;/a&gt;. Those sexy phrases were all found in scanned files. &lt;/p&gt;
  &lt;p&gt;
    &lt;strong&gt;What does this mean for SEO? &lt;br /&gt;&lt;/strong&gt;
  &lt;/p&gt;
  &lt;p&gt;Not too much, by my reckoning. Google should be easily able to convert scanned URLs into links, so there might be a few &lt;strong&gt;more links&lt;/strong&gt; coming your way if your website features in scanned documents. But there will be no anchor text, so the quality of these links isn&#8217;t going to be amazing. &lt;/p&gt;
  &lt;p&gt;Maybe it&#8217;s simply about feeding Google with extra food. If your organisation has archived forests of scanned documents then by making them available you may present Google with some &lt;strong&gt;extra content&lt;/strong&gt;. And if these documents are already accessible via your website then expect a visit from some variant of Googlebot to make sense of them. &lt;/p&gt;
  &lt;p&gt;Cue a discussion on whether to add the 'no_follow' tag to these files from the linksculptors...&lt;/p&gt;
</body-formatted>
  <body-unformatted>&lt;FormattedContent xmlns="http://www.e-consultancy.com/schema/formattedContent/"&gt;
  &lt;Paragraph&gt;Google has been indexing PDFs for a long time, but this is the first time it has started to &#8216;read&#8217; images. Images of text, that is.&lt;/Paragraph&gt;
  &lt;Paragraph&gt;Scanned files had previously a relatively rare sight in the search results. In order to understand the content of a scanned document Google would look for titles and other metadata, where available, as well as taking into account third party links and references. Not exactly foolproof.&lt;/Paragraph&gt;
  &lt;Paragraph&gt;Now you can search on specific words found inside a document, so long as it&#8217;s a PDF &#8211; no doubt Google will expand the range of index-worthy formats over time.&lt;/Paragraph&gt;
  &lt;Paragraph&gt;OCR technology, which has been around for some time, needs to be pretty smart to get anywhere near to the reading ability of a human. Google's Evin Levey &lt;Link URL="http://googleblog.blogspot.com/2008/10/picture-of-thousand-words.html" Window="Self"&gt;explains&lt;/Link&gt;: &lt;Quote&gt;&#8220;To people reading these documents, the distinction between words and pictures of words makes little difference, but for a computer the picture is almost unintelligible. Consider a circle. Should it be read it as a zero, the letter 'O', just a circle, or the ring from my coffee cup? People learn to answer this kind of question very quickly, but for the computer it is a painstaking and error-prone process.&#8221;&lt;/Quote&gt;&lt;/Paragraph&gt;
  &lt;Paragraph&gt;Some examples of the new technology in action: &lt;Link URL="http://www.google.com/search?q=repairing+aluminum+wiring" Window="Self"&gt;repairing aluminum wiring&lt;/Link&gt;, &lt;Link URL="http://www.google.com/search?q=spin+lock+performance" Window="Self"&gt;spin lock performance&lt;/Link&gt;, &lt;Link URL="http://www.google.com/search?q=Steady+success+in+a+volatile+world" Window="Self"&gt;steady success in a volatile world&lt;/Link&gt;. Those sexy phrases were all found in scanned files. &lt;/Paragraph&gt;
  &lt;Paragraph&gt;
    &lt;Emphasis&gt;What does this mean for SEO? &lt;LineBreak /&gt;&lt;/Emphasis&gt;
  &lt;/Paragraph&gt;
  &lt;Paragraph&gt;Not too much, by my reckoning. Google should be easily able to convert scanned URLs into links, so there might be a few &lt;Emphasis&gt;more links&lt;/Emphasis&gt; coming your way if your website features in scanned documents. But there will be no anchor text, so the quality of these links isn&#8217;t going to be amazing. &lt;/Paragraph&gt;
  &lt;Paragraph&gt;Maybe it&#8217;s simply about feeding Google with extra food. If your organisation has archived forests of scanned documents then by making them available you may present Google with some &lt;Emphasis&gt;extra content&lt;/Emphasis&gt;. And if these documents are already accessible via your website then expect a visit from some variant of Googlebot to make sense of them. &lt;/Paragraph&gt;
  &lt;Paragraph&gt;Cue a discussion on whether to add the 'no_follow' tag to these files from the linksculptors...&lt;/Paragraph&gt;
&lt;/FormattedContent&gt;</body-unformatted>
  <created-at type="datetime">2008-10-31T11:23:00+00:00</created-at>
  <enabled-blog-comments-count type="integer">1</enabled-blog-comments-count>
  <expertise-level-id type="integer">1</expertise-level-id>
  <extract-format>econsultancy_xml</extract-format>
  <extract-formatted>
  &lt;p&gt;
    &lt;strong&gt;It turns out that Googlebot&#8217;s addiction to text has reached a new level following the announcement that it can now index text from scanned documents.&lt;/strong&gt;
  &lt;/p&gt;
  &lt;p&gt;By using Optical Character Recognition (OCR) technology the search engine can now make sense of the content of scanned documents saved as PDF files.&lt;/p&gt;
</extract-formatted>
  <extract-unformatted>&lt;FormattedContent xmlns="http://www.e-consultancy.com/schema/formattedContent/"&gt;
  &lt;Paragraph&gt;
    &lt;Emphasis&gt;It turns out that Googlebot&#8217;s addiction to text has reached a new level following the announcement that it can now index text from scanned documents.&lt;/Emphasis&gt;
  &lt;/Paragraph&gt;
  &lt;Paragraph&gt;By using Optical Character Recognition (OCR) technology the search engine can now make sense of the content of scanned documents saved as PDF files.&lt;/Paragraph&gt;
&lt;/FormattedContent&gt;</extract-unformatted>
  <featured type="boolean">false</featured>
  <id type="integer">2920</id>
  <learn-more-formatted>&lt;p&gt;Read Econsultancy's &lt;strong&gt;&lt;a href="http://econsultancy.com/reports/cms-survey-report"&gt;CMS Survey Report&lt;/a&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;a href="http://econsultancy.com/reports/content-management-systems-cms-buyer-s-guide-2007"&gt;CMS Buyer's Guide&lt;/a&gt;&lt;/strong&gt; to learn more about this topic.&lt;/p&gt;</learn-more-formatted>
  <learn-more-unformatted>&lt;p&gt;Read Econsultancy's &lt;strong&gt;&lt;a href="http://econsultancy.com/reports/cms-survey-report"&gt;CMS Survey Report&lt;/a&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;a href="http://econsultancy.com/reports/content-management-systems-cms-buyer-s-guide-2007"&gt;CMS Buyer's Guide&lt;/a&gt;&lt;/strong&gt; to learn more about this topic.&lt;/p&gt;</learn-more-unformatted>
  <legacy-article-id type="integer">366629</legacy-article-id>
  <name>Google indexes text from scanned files</name>
  <private type="boolean">false</private>
  <published-at type="datetime">2008-10-31T11:23:00+00:00</published-at>
  <slug>google-indexes-text-from-scanned-files</slug>
  <tweetbacks-updated-at type="datetime">2009-04-28T23:17:33+01:00</tweetbacks-updated-at>
  <unpublished-at type="datetime" nil="true"></unpublished-at>
  <updated-at type="datetime">2009-10-13T09:51:24+01:00</updated-at>
  <views-count type="integer">499</views-count>
</blog-post>
