An important part of successfully managing your search engine optimisation targets is to nudge the search results your way when it's in your control. To help you achieve this target, there are some links you should prevent from getting indexed by the engines to begin with.
Firstly, because they offer little or no user experience benefits, secondly because they might get indexed instead of the desired content and lastly because preventing the engines from crawling unnecessary pages will reduce your bandwidth costs.
Here are few links you really don't want indexed:
1. HTTPS versions of your pages - To check if Google is indexing any https versions of your pages, simply type [inurl:https site:examplesite.com] and fix this by redirecting https to http using your .htacess file with the exclusion of log in pages.
2. Development server - Check to see if Google is indexing your development server by typing [site:development-site.com] and make sure to fix this situation by restricting the site from getting indexed using either 'Disallow: /' in the robots.txt file (no guarantee that it will always work), placing the site behind username and password or simply granting access based on your company IP address range.
3. Affiliate links - If you're using one of the affiliate networks, chances are this won't matter much as the link juice won't follow your way anyway, but to the affiliate network. If on the other hand you're using off the shelf software to manage your affiliates in house, the affiliate URL might look something like www.sitename.com/?affrt1|2|3|4|etc so to find if you've got any affiliate URLs indexed, search for [inurl:<ID-code> site:examplesite.com]
4. Pay per click agency links - Similar to off the shelf affiliate software, some PPC agencies might use proprietary software which can get indexed. You're able to restrict these pages from appearing in the index using your the robots.txt file and a wildcard query.
5. Google URL Builder links - I am a huge fan of Google URL Builder tool and often use it to track conversions from Google Base, newsletters and much more. The only complaint I have is that the URL uses question mark (?) parameter ID which does not prevent the page from getting indexed. The solution would have been to use hash (#) instead of the ? so www.examplesite.com/?url-id will get indexed, where www.examplesite.com/#url-id won't. Searching Google for [allinurl:utm_source utm_medium] which are parts of the tracking code brings up some interesting results. While there some hacks which allow you to tweak Google Analytics tracking code to use #, the quickest solution is to use canonical tag across the site and hope that Google will drop the other version of the page from its index.
6. Web based newsletter copies from way way back - To check if Google is indexing any old copies, search for the directory on Google using [site:examplesite.com/directory-name/]. As these pages might have some value you should 301 redirect them to a more appropriate page on the site. Often using this method is an easy way to pick low hanging fruit as you'll likely to win some more backlinks.
7. Shopping cart and log in pages - You should keep them out of the index not necessary because of content duplication issues, but because they offer no value.
What's your search engine index strategy?
Photo credit: anna maria lopez lopez via stock.xchng



Reader comments (2)
SEO Manager at Amnesia Razorfish
7:52AM on 25th April 2009
Dont forget if you are using blogs such as Wordpress, your old posts can rank above newer topics, you no longer want last years event details to be shown first or maybe ever again.
You can set a page's guide for importance within wordpress which writes this information to your xml sitemap that all search engines will read.
Search Marketing Manager at British Council
11:42AM on 27th April 2009
I have experienced high rankings with old enewsletters that were floating in the server, some of them boasting high page rank... it is definitely a good tip to 301 redirect them to other, not so popular areas of the website that may be related in some way with the enewsletters.
Another one thing I have noticed with sites that host client work in that the typical client/ folder is not being excluded in the robots file and therefore indexed by the engines eg: www.seoagency.com/clients/client1/developemnt.htm etc...
Log in to post a comment