An important part of successfully managing your search engine optimisation targets is to nudge the search results your way when it's in your control. To help you achieve this target, there are some links you should prevent from getting indexed by the engines to begin with.
Firstly, because they offer little or no user experience benefits, secondly because they might get indexed instead of the desired content and lastly because preventing the engines from crawling unnecessary pages will reduce your bandwidth costs.
Here are few links you really don't want indexed:
1. HTTPS versions of your pages - To check if Google is indexing any https versions of your pages, simply type [inurl:https site:examplesite.com] and fix this by redirecting https to http using your .htacess file with the exclusion of log in pages.
2. Development server - Check to see if Google is indexing your development server by typing [site:development-site.com] and make sure to fix this situation by restricting the site from getting indexed using either 'Disallow: /' in the robots.txt file (no guarantee that it will always work), placing the site behind username and password or simply granting access based on your company IP address range.
3. Affiliate links - If you're using one of the affiliate networks, chances are this won't matter much as the link juice won't follow your way anyway, but to the affiliate network. If on the other hand you're using off the shelf software to manage your affiliates in house, the affiliate URL might look something like www.sitename.com/?affrt1|2|3|4|etc so to find if you've got any affiliate URLs indexed, search for [inurl:<ID-code> site:examplesite.com]
4. Pay per click agency links - Similar to off the shelf affiliate software, some PPC agencies might use proprietary software which can get indexed. You're able to restrict these pages from appearing in the index using your the robots.txt file and a wildcard query.
5. Google URL Builder links - I am a huge fan of Google URL Builder tool and often use it to track conversions from Google Base, newsletters and much more. The only complaint I have is that the URL uses question mark (?) parameter ID which does not prevent the page from getting indexed. The solution would have been to use hash (#) instead of the ? so www.examplesite.com/?url-id will get indexed, where www.examplesite.com/#url-id won't. Searching Google for [allinurl:utm_source utm_medium] which are parts of the tracking code brings up some interesting results. While there some hacks which allow you to tweak Google Analytics tracking code to use #, the quickest solution is to use canonical tag across the site and hope that Google will drop the other version of the page from its index.
6. Web based newsletter copies from way way back - To check if Google is indexing any old copies, search for the directory on Google using [site:examplesite.com/directory-name/]. As these pages might have some value you should 301 redirect them to a more appropriate page on the site. Often using this method is an easy way to pick low hanging fruit as you'll likely to win some more backlinks.
7. Shopping cart and log in pages - You should keep them out of the index not necessary because of content duplication issues, but because they offer no value.
What's your search engine index strategy?
Photo credit: anna maria lopez lopez via stock.xchng