Google’s Webmaster Tools blog has just published a useful presentation, which provides advice on getting your pages crawled and indexed by the search engine.
Basically, the Googlebot can only crawl and index a small proportion of all the content online, so streamlining your site to reduce unnecessary crawling can optimise the speed and accuracy of your indexing.
Here are some of Google’s tips, more detail in the full slideshow…
Remove user-specific details from URLs.
For faster crawling and indexing, removing details that are specific to user, such as session IDs, will reduce the number of URLs pointing to the same content and speed up indexing.
Look out for infinite spaces
By ‘infinite spaces’ Google is referring to large numbers of links with little new content to index. This could be a calendar with links to future dates or an e-commerce website’s filtering options which can produce thousands of unique URLs.
All these extra links mean that the Googlebot is wasting its time trying to crawl all these URLs. Google has some suggested fixes, such as adding the nofollow attribute to such links.
Disallow actions that Googlebot can’t perform
Google cannot login to pages or crawl contact forms, so using the robots.txt file to disallow this will save wasted time attempting to crawl such pages.
Watch out for duplicate content
According to Google. the closer you can get to one unique URL for each unique piece of content, the more streamlined your website will be for indexing and crawling.
This is not always possible, so indicating the preferred URL by using the rel=canonical element, as described in this video from Matt Cutts, will solve this problem.