Showing posts 1 - 10 of 17
  1. Ashley Friedlein Staff

    CEO at Econsultancy

    07 August 2007 16:02pm

    Ashley Friedlein

    I should know this but I'm intrigued to hear from any SEO experts out there what the latest thinking / best practice is for allowing search engines to index restricted content e.g. content that sits behind a pay-access log in, or other barrier?

    We're looking at converting all our current file content (e.g. Word files, PDFs etc.), which are mostly paid-access only research and guides, into XHTML so that we can display them as HTML or allow users to convert them (e.g. to PDF) on the fly. This will also make it easier to syndicate our content, present it on other devices, "reskin" the presentation layer and so on.

    But it also would allow us to make all the contents of a report (e.g. a 200 page Word file) available for indexing by a search engine. In theory this is good because there is a lot of great, niche, content in these documents which would be great for long tail SEO and attracting high-converting traffic.

    But, of course, we wouldn't want the user to actually get access to the full content itself without first paying. Nor would we want all this pay-access content existing in Google's cache.

    I guess, in theory, we could allow the Googlebot and chosen other spiders to index this content as HTML but not allow real humans or other agents to do so. I believe we might be able to use the robots.txt protocol to prevent caching too?

    But in the case of the above we are showing Google something that we are not showing our users - and isn't this cloaking?

    However, Google's Book search seems to work in just this way so Google don't appear to be averse to indexing intellectual property / content in this way but without revealing it all?

    Any thoughts / pointers / experiences welcome...

    Thanks

    Ashley Friedlein
    CEO
    E-consultancy.com

  2. Steve Johnston

    Founder at Search:Johnston Google Consultants

    07 August 2007 17:00pm

    Steve Johnston

    The principles here are very simple, and in fact you do know it really, but you just wish it weren't so:

    Any time you serve content at a specific URL to the Googlebot that is different from the content a browser-based visitor may see, you are cloaking. Google routinely tests URLs with its UserAgent spoofed to appear to be a browser, in order to track any potential differences in content that is Googlebot-dependent. Google is also inviting spam reports on a level not previously seen before, and they promise to investigate any submitted via a Webmaster Tools login. So it only takes a competitor to spot what you are doing and 'shop' you to get Google to pay special attention to what you are doing.

    Cloaking deliberately is therefore a huge risk to your web site, as Google's reaction to such behaviour can be draconian, because, let's face it, you are trying to manipulate Google's view of your available information to gain traffic that you don't deserve.

    Google Book search, on the other hand, does not work in the way you suggest. Google will show you the content related to the search term, but will, as you rightly say, not reveal all of the information from the rest of the book.

    Your decision, therefore is whether you think the commercial consequences of publishing out all your pages of juicy content on page by page, URL by URL basis, are likely to be positive.

    Steve Johnston
    Google Consultant

  3. Ashley Friedlein Staff

    CEO at Econsultancy

    07 August 2007 17:26pm

    Ashley Friedlein

    I can see how this is 'cloaking' in as much as we are showing Google something the user can't access but we are not trying to deceive anyone - the content is there, we're not redirecting anyone or anything.

    Do we not 'deserve' the traffic? I presume Google would justify this (perhaps understandably) on it being a poor customer experience - you click on a search result only to go to a challenge page?

    But this happens anyway with content that is free access for a while and then becomes restricted access doesn't it? Why couldn't we expose all the content to the Googlebot and everyone else but only, say, for as long as it takes for Googlebot to index it. So like the Wall Street Journal might, or MarketingSherpa, or any number of sites, only we reduce the time window from free-to-restricted right down? This has pretty much the same effect doesn't it?

    I'm not clear, either, how Google Book search IS different to what I'm suggesting - we too would have search results relevant to the searcher's query and Google would display an extract from the document/web page but, on clicking, the user wouldn't see the whole document?

    I guess it begs the question of where the distinction lies between what is a "web page", a "book", a "document", a "file" etc. To my mind the differences are increasingly unimportant and unhelpful as a user - I want the information/content I want, in a format that I choose.

    I would emphasise that we don't plan to do anything unethical, whatever the possible business benefits, but I suppose I'm just not quite clear on Google's position; or maybe I am clear but disagree with them... ;)

    Ashley

  4. Paul Rudman

    Head of Technical Search at Barracuda Digital

    07 August 2007 17:43pm

    Paul Rudman

    Hi Ashley,

    This is something we've raised with several clients, as one of them had 300,000+ pages hidden behind a login screen, all potentially very valuable data in not only increasing the volume of long-tail phrases the site would be found for but also in providing more content on important core topics that then in turn would help the home page rank higher (as I have frequently found that if you have 500 pages on a topic on your site, rather than any single one of them ranking very high, it just pushes up the rankings for the home page).

    The most obvious but difficult solution to the problem would be to be able to provide Googlebot a username and password to 'unlock' that content itself, allowing it to index content in your secure area, but then when a search engine user clicked on the link in the Google results they would be prompted to login to read the full article. However, this approach would require an association with the Google team that not many sites or companies do or would ever possess.

    Another option that still avoids the potentially risky 'redirection based on bots or users' would be to skim the first 50 words from each piece of content that is currently secure, and publish it on your front end with appropriate Meta data to encourage high rankings.

    Search engines regard the first 50 words on a page as the most important anyway, so you would satisfy them in terms of providing enough content to get the SEO benefit, while only providing a teaser to visiting people, who would still have to register and login to read the rest of it.

    If you do go down the route of redirecting bots or users and return different content it is definitely a gray area. Google themselves have been doing it for years with Adwords, i.e. allowing you to only display ads to people from different towns, cities, and countries, but then again, that doesn't mean it's a license for everyone to do it.

    My thoughts on redirection based on bot or browser has always been that if you have a good reason for doing it, you will be fine. You will never get automatically banned or penalised by an engine, only potentially flagged for a manual review of suspicious activity, and assuming that reviewer has any sense, they will realise you are not trying to get an unfair advantage or fool them, you are actually just trying to improve the quality of their search database by sharing your content with search engine users.

    However, saying that, I'd go with the first 50 words on a page option as you get the benefits of the extra traffic from search engines, there's no redirection involved so no risk there, and you still get the value of the memberships / logins from people who want to read the full article.

    I hope this helps.

    Regards,

    Paul

    Head of Search
    Barracuda Digital

  5. Ashley Friedlein Staff

    CEO at Econsultancy

    07 August 2007 17:51pm

    Ashley Friedlein

    First 50 words on a page is an interesting idea... but I'm not quite sure how we would make this sensibly navigable from a user perspective e.g. for a 200 page guide that would be 200 pages of HTML. Perhaps it doesn't really matter as long as the spiders can crawl through it and it deliver search traffic.

    Though with only 50 words per page, buried deep where no-one is likely to link to it (not least because there is only an extract to see), then it may very well be indexed fine but never rank well enough to be of any value...

    Ashley

  6. Paul Rudman

    Head of Technical Search at Barracuda Digital

    07 August 2007 18:01pm

    Paul Rudman

    I think it all really depends on your motivation for the exercise to start with. If the purpose is to drive more traffic from search engines then this would represent a no risk potential option to at least get a percentage of each document indexed, and bear in mind the e-consultancy site is a very well respected domain with the engines, so if it was "joe's blog", then yes, I'd suggest that just pumping out 1000s of documents and articles wouldn't have much of an impact, but in the same way you can type in practically any recipe name into Google and you'll get a BBC page back regardless of how deeply entrenched it is down the hierarchical structure, you'd probably see the same on the e-consultancy site.

    There are also various ways you can then push this new section of your site, such as making sure you provide plenty of internal hyperlinks to it from your high PR pages (a home page link to this new section home page would be advisable).

    You'd also want to be sure of the following:

    - Unique Meta titles and descriptions for all the content you move to the front end
    - Unique H1 text for each new article / piece of content
    - Solid interlinking between the new contents so there is ideally no new page that has only a single link to it

    In time, as people find their way to this content naturally, they will link more to it specifically, so as it beds in as static content over the following months you'll find it's value increases to you, without the potential risks involved with bots / browsers being shown different content.

    On 17:51:00 7 August 2007 Ashley wrote:

    First 50 words on a page is an interesting idea... but I'm not quite sure how we would make this sensibly navigable from a user perspective e.g. for a 200 page guide that would be 200 pages of HTML. Perhaps it doesn't really matter as long as the spiders can crawl through it and it deliver search traffic.

    Though with only 50 words per page, buried deep where no-one is likely to link to it (not least because there is only an extract to see), then it may very well be indexed fine but never rank well enough to be of any value...

    Ashley

  7. Adam Crawford Gold

    SEO at Cheapflights Media

    07 August 2007 18:54pm

    Avatar-blank-50x50

    Hi Ashley

    Google's "First Click Free" system for subscription based content on Google News can also be used with the main natural search index I believe.  The following links may be helpful:

    http://www.google.com/support/news_pub/bin/answer.py?answer=40543
    http://www.betanews.com/article/Google_Indexing_Subscription_Content/1120164520
    http://searchengineland.com/070304-231603.php

    Adam Crawford
    http://www.propellernet.co.uk

    On 16:02:04 7 August 2007 Ashley wrote:

    I should know this but I'm intrigued to hear from any SEO experts out there what the latest thinking / best practice is for allowing search engines to index restricted content e.g. content that sits behind a pay-access log in, or other barrier?

    We're looking at converting all our current file content (e.g. Word files, PDFs etc.), which are mostly paid-access only research and guides, into XHTML so that we can display them as HTML or allow users to convert them (e.g. to PDF) on the fly. This will also make it easier to syndicate our content, present it on other devices, "reskin" the presentation layer and so on.

    But it also would allow us to make all the contents of a report (e.g. a 200 page Word file) available for indexing by a search engine. In theory this is good because there is a lot of great, niche, content in these documents which would be great for long tail SEO and attracting high-converting traffic.

    But, of course, we wouldn't want the user to actually get access to the full content itself without first paying. Nor would we want all this pay-access content existing in Google's cache.

    I guess, in theory, we could allow the Googlebot and chosen other spiders to index this content as HTML but not allow real humans or other agents to do so. I believe we might be able to use the robots.txt protocol to prevent caching too?

    But in the case of the above we are showing Google something that we are not showing our users - and isn't this cloaking?

    However, Google's Book search seems to work in just this way so Google don't appear to be averse to indexing intellectual property / content in this way but without revealing it all?

    Any thoughts / pointers / experiences welcome...

    Thanks

    Ashley Friedlein
    CEO
    E-consultancy.com

  8. Edward Cowell Enterprise

    SEO Director at Guava UK

    08 August 2007 07:25am

    Edward Cowell

    I'll side with Adam on this.

    The decision is not how to completely hide your content, but how difficult you want to make it for users to get it for free that paying/registering becomes a no brainer.

    It is not cloaking as long as you show the SAME content to spiders you would show to users.

    The common set up works as follows:

    • keep the site completely open to everyone including spiders.
    • set up a user detection script NOT a spider detection that can push users that have not authenticated to a registration or login page. Users that have authenticated get full normal access anyway.
    • for a better user experience and to comply with Googles "First Click free" guidelines, set that script up to allow access to a certain number of pages for free ( 1 or more) when people access via search engines.
    • You can set nocache meta tags or nowadays use the x-robots headers to keep the content out of the Google cache so that people can't access it that way, however often sites don't block the search engine cache so you can still access the content that way as Google prefers it if the nocache option is not used, I presume this is because their tools look at the way pages change over time, so if they can't cache them this won't work as well. The average internet user probably doesn't know the cache link even exists anyway.
    Here's how a couple of major websites do it, NOT low level spam sites, and seem to have no issues:

    The Washington Post
    http://www.washingtonpost.com
    - all articles are accessible for free when accessing via search engine results, after which a click on any link prompts authentication.

    Webmaster World
    http://www.webmasterworld.com
    - a certain number of articles are accessible for free after which no matter which way you access the pages including via search results, authentication is prompted. Seems to be time based and the authentication prompt resets back to free access after a short period.

    Both use server side 302 temporary redirects to puch users to the login/registration pages, and then once the user authenticates they get referred back to the article.

    Google needs to find a balanced approach to indexing and supplying results from content behind walled gardens because there will only be more of them in the future as publishers of quality content look to monetise it properly, "First Click free" and various major websites succesful implementations of this show that it is possible and is a reasonable interim solution for Google, internet users and publishers.

    Edward Cowell (Teddie)
    Neutralize (**)
    http://www.neutralize.com

  9. Ashley Friedlein Staff

    CEO at Econsultancy

    08 August 2007 09:30am

    Ashley Friedlein

    That's sounds like an excellent solution, thanks Teddie.

    Now we just have to turn about 300 Word files into XHTML...

    Ashley

  10. Lawrence L

    Freelance Web Consultant at architxt.net

    08 August 2007 12:29pm

    Lawrence L

    Are you suggesting having all 32 pages of the Online Lead Generation Report, for example, accessible to search engines but limit the number pages a user can access before the login prompt kicks in?

    Or that 5 specific pages out of the 32 pages are free at all times?

    The former sounds like a good idea but I think one could end up downloading all pages in multiple sessions.

Reply to this thread

Log in to reply to this thread or join Econsultancy for free so you can post to our forums along with other benefits.