I understand it as the former but probably limiting the number of pages anyone could view to 1 before they were prompted to log in. So, yes, someone could see the whole thing for free but would they really be bothered to try and do that?
Equally I guess we could cookie them so that even if they did come back through another search within a certain time frame they couldn't access the content without logging in. So they could delete their cookies for every page (or reject them) and go via search to cobble together the full report. But frankly there are much easier ways to steal our content than that... e.g. via illegal file swapping.
On the "cloaking" thing there seems to be a small, but important, difference to serving different 'content' versus serving different 'access'. It is wrong to serve different content to Google vs. users but apparently not wrong to serve different levels of content access?
"The easiest way to do this (allow Google to spider restricted content but not allow users to do so) is to configure your webservers to not serve the registration page to our crawlers (when the User-Agent is "Googlebot")"
To the uneducated this sounds to me very like 'serving something different to Googlebot than to users'? (i.e. cloaking).
But is this only true for Google News and not the main index? Teddie's mainstream examples further down this thread seem to make it clear that this also applies to the main index. In fact, Teddie's answer seems to me to sum up the best way to do things currently.
Though we're now scratching our heads about the best combination of deterrents (reffering strings, user agents, cookies, sessions etc.) to actually uniquely identify a user and try and prevent as much content theft as possible whilst not over-complicating things...
Yes, this is definitely along the lines of what we'd ideally like to do.
The relevant Google News page (http://www.google.com/support/newspub/bin/answer.py?answer=40543) strongly suggests that they actually don't mind you selectively allowing the Googlebot through a pay wall, which from a publisher's perspective provides the best of both worlds (we get our content indexed while it remains fully protected), but clearly that's not going to create a spectacular user experience for anyone clicking through from a search result.
However, I'm concerned about the technical logistics of "First Click Free" -- namely how trivial it is to circumvent it. Anyone with a rudimentary grasp of IT can install a browser extension to spoof their HTTP referer, or disable cookies, or even (with a bit more effort) make consecutive HTTP requests appear to come from different IP addresses, and faced with that combination of techniques it becomes impossible to distinguish between someone who's genuinely just landed from a search result and someone who's done 30 seconds of browser configuration and is now trawling the site downloading all the content.
There's always the old argument that publishers can do no better than provide a mild deterrent, and that anyone who's determined enough to crack the system is essentially welcome to invest the time and effort necessary to do so, but here we're talking about a really small effort versus a really big payoff. It's one thing for, say, the Washington Post to allow First Click Free on the grounds that their content is essentially ephemeral and largely advertising-funded in the first place, but E-consultancy's content has a high and lasting value that we can't afford to compromise so readily.
Does anyone have any technical insights about how to safely release useful chunks of content to users arriving from search engine results, without also making it very easy for any half-competent user to subvert this system to his advantage? The fundamentally anonymous and stateless nature of the web makes this a very difficult problem to solve reliably. We are extremely keen to provide the richest possible user experience, but preferably not at the expense of our content assets!
Digital Marketing Consultant, Trainer, Author and Speaker at SmartInsights.com
22 August 2007 08:57am
I've come late to this party - holidays! and can't add much from an SEO POV, but I think there are some other alternatives to consider.
What’s not really covered is the impact on conversion rate. My gut feel is that of the two models – #1 partial content preview (the 50 words one) and #2 time-limited preview (the Webmasterworld one) #1 would work best for conversion. It is also not subject to tech savvy people getting around the security on #2.
On the other hand, I must say I have always liked the Webmasterworld model since all their content is indexed so shows up frequently in long tail searches – and option #2 will therefore be much better for awareness / reach.
So, balancing higher conversion against reach I think option #2 will work best.
Which ever option you chose, but especially #2 it obviously needs to be flexible enough to revert if the engines introduce a new approach / rule on cloaking or a tag for subscription content.
The other aspect not really mentioned and this is probably option #3 - is that for all your reports you have a 3 or 4 layer hierarchy, essentially chapter:topic:sub-topic. So you could have a mechanism of restricting access at the chapter/topic level – first 50 words maybe, but make all the detailed content sub-topic available, and so exploit the tail, but the full picture – horizontal navigation - isn’t available.
I should know this but I'm intrigued to hear from any SEO experts out there what the latest thinking / best practice is for allowing search engines to index restricted content e.g. content that sits behind a pay-access log in, or other barrier?
We're looking at converting all our current file content (e.g. Word files, PDFs etc.), which are mostly paid-access only research and guides, into XHTML so that we can display them as HTML or allow users to convert them (e.g. to PDF) on the fly. This will also make it easier to syndicate our content, present it on other devices, "reskin" the presentation layer and so on.
But it also would allow us to make all the contents of a report (e.g. a 200 page Word file) available for indexing by a search engine. In theory this is good because there is a lot of great, niche, content in these documents which would be great for long tail SEO and attracting high-converting traffic.
But, of course, we wouldn't want the user to actually get access to the full content itself without first paying. Nor would we want all this pay-access content existing in Google's cache.
I guess, in theory, we could allow the Googlebot and chosen other spiders to index this content as HTML but not allow real humans or other agents to do so. I believe we might be able to use the robots.txt protocol to prevent caching too?
But in the case of the above we are showing Google something that we are not showing our users - and isn't this cloaking?
However, Google's Book search seems to work in just this way so Google don't appear to be averse to indexing intellectual property / content in this way but without revealing it all?
Quote: So you could have a mechanism of restricting access at the chapter/topic level – first 50 words maybe, but make all the detailed content sub-topic available, and so exploit the tail
Are you suggesting building, for want of a better term, a 'meta directory' that sits above the full content? That could actually be a very useful tool if integrated with a site map and would certainly help search prominence. You could continue to disallow access to deeper content, and really optimise the content of the directory summaries.
On 08:57:28 22 August 2007 DaveChaffey wrote:
I've come late to this party - holidays! and can't add much from an SEO POV, but I think there are some other alternatives to consider.
What’s not really covered is the impact on conversion rate. My gut feel is that of the two models – #1 partial content preview (the 50 words one) and #2 time-limited preview (the Webmasterworld one) #1 would work best for conversion. It is also not subject to tech savvy people getting around the security on #2.
On the other hand, I must say I have always liked the Webmasterworld model since all their content is indexed so shows up frequently in long tail searches – and option #2 will therefore be much better for awareness / reach.
So, balancing higher conversion against reach I think option #2 will work best.
Which ever option you chose, but especially #2 it obviously needs to be flexible enough to revert if the engines introduce a new approach / rule on cloaking or a tag for subscription content.
The other aspect not really mentioned and this is probably option #3 - is that for all your reports you have a 3 or 4 layer hierarchy, essentially chapter:topic:sub-topic. So you could have a mechanism of restricting access at the chapter/topic level – first 50 words maybe, but make all the detailed content sub-topic available, and so exploit the tail, but the full picture – horizontal navigation - isn’t available.
I should know this but I'm intrigued to hear from any SEO experts out there what the latest thinking / best practice is for allowing search engines to index restricted content e.g. content that sits behind a pay-access log in, or other barrier?
We're looking at converting all our current file content (e.g. Word files, PDFs etc.), which are mostly paid-access only research and guides, into XHTML so that we can display them as HTML or allow users to convert them (e.g. to PDF) on the fly. This will also make it easier to syndicate our content, present it on other devices, "reskin" the presentation layer and so on.
But it also would allow us to make all the contents of a report (e.g. a 200 page Word file) available for indexing by a search engine. In theory this is good because there is a lot of great, niche, content in these documents which would be great for long tail SEO and attracting high-converting traffic.
But, of course, we wouldn't want the user to actually get access to the full content itself without first paying. Nor would we want all this pay-access content existing in Google's cache.
I guess, in theory, we could allow the Googlebot and chosen other spiders to index this content as HTML but not allow real humans or other agents to do so. I believe we might be able to use the robots.txt protocol to prevent caching too?
But in the case of the above we are showing Google something that we are not showing our users - and isn't this cloaking?
However, Google's Book search seems to work in just this way so Google don't appear to be averse to indexing intellectual property / content in this way but without revealing it all?
Dave's option #3 is indeed an interesting concept and not one that I've seen done anywhere yet. In theory it holds out the opportunity to get the best of both worlds.
I don't think we'd need a "meta directory" as such, nor would we necessarily have to stick to a hierarchical "visibility" rule based on chapter:topic:sub-topic though that might be the most sensible way to do it in most cases.
I'd imagine we'd just have a series of tags in our XHTML which marked up the content to say whether it is the type of content which should show only the first 50 words or whether it should all show up under the 'first click free' idea. The nice thing about this is that should we need to change our approach (as Dave points out because of a change in the way the search engines work) then we'd just change our tags (metadata) and the rest would be easy.
Digital Marketing Consultant, Trainer, Author and Speaker at SmartInsights.com
24 August 2007 07:42am
I wasn't exactly thinking directory, but yes it would make sense to have a multi-level site map related for this deep content for easier nav and internal linking purposes - good idea.
The SEO Best Practice: Index Inclusion Guide is part of Econsultancy's renowned SEO Best Practice Guide and is has been created with the help and frontline insight of globally-esteemed SEO practitioners, in order to give you the edge in your natural search marketing activity.
The State of Search Marketing Report 2012, published by Econsultancy in association with SEMPO, looks in-depth at how companies are using paid search, search engine optimization (natural search) and social media marketing. The report looks closely at current practices and emerging trends across paid search and SEO, as well as their relationship with social media.
CEO at Econsultancy
08 August 2007 13:13pm
I understand it as the former but probably limiting the number of pages anyone could view to 1 before they were prompted to log in. So, yes, someone could see the whole thing for free but would they really be bothered to try and do that?
Equally I guess we could cookie them so that even if they did come back through another search within a certain time frame they couldn't access the content without logging in. So they could delete their cookies for every page (or reject them) and go via search to cobble together the full report. But frankly there are much easier ways to steal our content than that... e.g. via illegal file swapping.
Ashley
CEO at Econsultancy
08 August 2007 14:13pm
On the "cloaking" thing there seems to be a small, but important, difference to serving different 'content' versus serving different 'access'. It is wrong to serve different content to Google vs. users but apparently not wrong to serve different levels of content access?
As per Adam's link below to Google News' guidelines on allowing access to subscription content, they recommend:
"The easiest way to do this (allow Google to spider restricted content but not allow users to do so) is to configure your webservers to not serve the registration page to our crawlers (when the User-Agent is "Googlebot")"
To the uneducated this sounds to me very like 'serving something different to Googlebot than to users'? (i.e. cloaking).
But is this only true for Google News and not the main index? Teddie's mainstream examples further down this thread seem to make it clear that this also applies to the main index. In fact, Teddie's answer seems to me to sum up the best way to do things currently.
Though we're now scratching our heads about the best combination of deterrents (reffering strings, user agents, cookies, sessions etc.) to actually uniquely identify a user and try and prevent as much content theft as possible whilst not over-complicating things...
Ashley
Chief Architect at Econsultancy
17 August 2007 10:16am
Yes, this is definitely along the lines of what we'd ideally like to do.
The relevant Google News page (http://www.google.com/support/newspub/bin/answer.py?answer=40543) strongly suggests that they actually don't mind you selectively allowing the Googlebot through a pay wall, which from a publisher's perspective provides the best of both worlds (we get our content indexed while it remains fully protected), but clearly that's not going to create a spectacular user experience for anyone clicking through from a search result.
However, I'm concerned about the technical logistics of "First Click Free" -- namely how trivial it is to circumvent it. Anyone with a rudimentary grasp of IT can install a browser extension to spoof their HTTP referer, or disable cookies, or even (with a bit more effort) make consecutive HTTP requests appear to come from different IP addresses, and faced with that combination of techniques it becomes impossible to distinguish between someone who's genuinely just landed from a search result and someone who's done 30 seconds of browser configuration and is now trawling the site downloading all the content.
There's always the old argument that publishers can do no better than provide a mild deterrent, and that anyone who's determined enough to crack the system is essentially welcome to invest the time and effort necessary to do so, but here we're talking about a really small effort versus a really big payoff. It's one thing for, say, the Washington Post to allow First Click Free on the grounds that their content is essentially ephemeral and largely advertising-funded in the first place, but E-consultancy's content has a high and lasting value that we can't afford to compromise so readily.
Does anyone have any technical insights about how to safely release useful chunks of content to users arriving from search engine results, without also making it very easy for any half-competent user to subvert this system to his advantage? The fundamentally anonymous and stateless nature of the web makes this a very difficult problem to solve reliably. We are extremely keen to provide the richest possible user experience, but preferably not at the expense of our content assets!
Digital Marketing Consultant, Trainer, Author and Speaker at SmartInsights.com
22 August 2007 08:57am
I've come late to this party - holidays! and can't add much from an SEO POV, but I think there are some other alternatives to consider.
What’s not really covered is the impact on conversion rate. My gut feel is that of the two models – #1 partial content preview (the 50 words one) and #2 time-limited preview (the Webmasterworld one) #1 would work best for conversion. It is also not subject to tech savvy people getting around the security on #2.
On the other hand, I must say I have always liked the Webmasterworld model since all their content is indexed so shows up frequently in long tail searches – and option #2 will therefore be much better for awareness / reach.
So, balancing higher conversion against reach I think option #2 will work best.
Which ever option you chose, but especially #2 it obviously needs to be flexible enough to revert if the engines introduce a new approach / rule on cloaking or a tag for subscription content.
The other aspect not really mentioned and this is probably option #3 - is that for all your reports you have a 3 or 4 layer hierarchy, essentially chapter:topic:sub-topic. So you could have a mechanism of restricting access at the chapter/topic level – first 50 words maybe, but make all the detailed content sub-topic available, and so exploit the tail, but the full picture – horizontal navigation - isn’t available.
HTH Dave Chaffey
www.davechaffey.com
On 16:02:04 7 August 2007 Ashley wrote:
Social media guy at CSC
22 August 2007 09:50am
Quote: So you could have a mechanism of restricting access at the chapter/topic level – first 50 words maybe, but make all the detailed content sub-topic available, and so exploit the tail
Are you suggesting building, for want of a better term, a 'meta directory' that sits above the full content? That could actually be a very useful tool if integrated with a site map and would certainly help search prominence. You could continue to disallow access to deeper content, and really optimise the content of the directory summaries.
On 08:57:28 22 August 2007 DaveChaffey wrote:
CEO at Econsultancy
23 August 2007 10:40am
Dave's option #3 is indeed an interesting concept and not one that I've seen done anywhere yet. In theory it holds out the opportunity to get the best of both worlds.
I don't think we'd need a "meta directory" as such, nor would we necessarily have to stick to a hierarchical "visibility" rule based on chapter:topic:sub-topic though that might be the most sensible way to do it in most cases.
I'd imagine we'd just have a series of tags in our XHTML which marked up the content to say whether it is the type of content which should show only the first 50 words or whether it should all show up under the 'first click free' idea. The nice thing about this is that should we need to change our approach (as Dave points out because of a change in the way the search engines work) then we'd just change our tags (metadata) and the rest would be easy.
Ashley
Digital Marketing Consultant, Trainer, Author and Speaker at SmartInsights.com
24 August 2007 07:42am
I wasn't exactly thinking directory, but yes it would make sense to have a multi-level site map related for this deep content for easier nav and internal linking purposes - good idea.
Dave