Founder / Director / Co-founder at easyBacklog / Aqueduct / Econsultancy
07 September 2008 17:15pm
We are currently helping a new client of ours develop a hybrid vertical search / content aggregation site and as part of the initial planning and consultancy phase we are looking at how the content is technically aggregated, categorised and tagged appropriately. As such we’ve come across a couple of issues which I believe could impact on the overall feasibility of the project and I am hoping someone with legal or relevant experience can offer some advice.
Republishing content from 3rd party sites if credited My gut feeling tells me that without express permission from 3rd party websites, you are not legally allowed to scrape that content and then list on your own website for the purposes of search/browse/aggregation. However, if you consider all the search engines do exactly that where does one draw the line i.e. they will generally list extracts of the content rather than the complete content. Saying that, you can view the cache in Google search results which contains an exact copy of the content.
Therefore, my question specifically is whether it is legal to scrape content from 3rd party sites without any pre-agreed arrangements with them or even consideration of their terms & conditions so long as that content is publicly available on their sites? And if so, is there a ruling to say you simply need to credit them, or are there a set of laws about how you represent that content to ensure you are not presenting it as your own content but simply repurposing 3rd party content for your site visitor?
Robots.txt Following on from the point above, surely the mechanism which websites should use to restrict others from scraping and republishing their content in any form should be through the use of a robots.txt file which allows computers to communicate explicitly on what is or is not considered “crawlable”? Have there been any rulings surrounding that?
Monetisation of content Whilst our client is not proposing selling the content that they aggregate for their users, they will be indirectly monetising 3rd party content for which they do not pay any fees by including advertising, paid for services for advanced features and potentially even subscription services to access all content. Would this impact on whether they would be legally allowed to scrape and republish content? Google certainly don’t pass back any fees for the content they scrape...
What happens if our client starts scraping and they subsequently ask our client not to do this? Assuming our client starts their business based on the fact they scrape and collate content from the few large players and the long tail of smaller players, and as a result their market prevalence increases to the point where one of the few large players (who currently monopolise the market) decide to expressly request that our client stops scraping their site, what rights or lack thereof would our client have? The risk that at any point their business could be hugely impacted by the fact that they are doing a better job than their competitors but at the same time rely on their competitors is clearly worrying.
Whilst I have a sneaking suspicion that the legality of scraping and re-publishing content on your own website is certainly in the favour of the publisher, I am surprised as I can think of many large players out there relying on a business model which scrapes content and repurposes it for various uses i.e. Google, Kayak, Kelkoo, Technorati.
What should I be advising my client in respect to their rights as a content aggregator, and how that may impact on their business in the future?
Thanks
Matthew O’Riordan
Digital Services Director Aqueduct
My copyright understanding comes from my work as an online journalist. I don't think you can republish content word for word, even if it is accredited and freely available. For example, Sky and the BBC make their video etc freely available online but it would be an obvious copyright transgression to stream it from your own site.
Having said that, very few websites are likely to complain at the content being republished elsewhere as it will not damage their SEO and they are being accredited.
Basically, I do not think you would have any legal protection if you were doing this but I think it is quite unlikely you would be challenged.
In fact most news services would challenge you pretty quickly even if you did accredit them. Where they offer an RSS feed they expect people to use the title and story summary but then link directly through to their article on their site.
Some will allow you to publish the full article but generally at a cost.
How are they "allowed" to do this without express permission?
I have noticed that if you search on "Sky News" in Google, not all the results offer a view cache option. Could this be down to what they are allowed to cache and what they are not allowed to cache, or could it simply be an algorithm that Google employs to work out what is worth caching or not?
I remember asking a similar question several years ago in a law lecture - there was no definite answer then. A couple of years later there was a test case where search engine caching was ruled 'fair use'.
Anyway, on to your problem...
If I was you, I would:
Publish snippets from scraped pages, with links through to the full version.
Include full content only from feeds, or content where you've done some kind of licensing deal.
Put a mechanism in place so that you could switch off particular sites from content scraping if they requested it.
In terms of robots.txt, I would:
Avoid crawling any restricted sites
Avoid publishing any content where any kind of 'no-cache'/'no archive' meta tags have been set (that's probably what's happening with the sky content you mention?)
I hope that helps you - would be interested to hear more about what you're doing!
You should look at the terms and conditions on the sites you intend to target. You will also have to consider the liability for things like defamation, encouragement or glorification of terrorism, copyright theft and other intellectual property infringements, etc if you start publishing other people's content. Take the advice of some good new media lawyers soon.
Also make sure that you don't start being used to circulate malware via infected content.
Some site owners are very agressive with those they consider are trying to take their content. For example:
Thank you for all your expert tips on the legal issues of website scrapping.
I have one specific question:
I'm working on a business idea based on aggregating content from industry specific websites.
When I look at the terms of services on these websites, Is there anything specific I should be looking for? Some of the terms and conditions say that I could use it for non commercial purposes. I'm not selling the content to my potential customers but will be making ad revenue (hopefully).
See the below example from one of the clients...
"You are hereby granted a non-exclusive, non-transferable, limited license to view this Site, and to download and/or print insignificant portions of materials retrieved from this Site provided (a) it is used only for informational, non-commercial purposes, and (b) you do not remove or obscure the copyright notice or other notices. Except as expressly provided above, no part of this Site, including but not limited to materials retrieved there from and the underlying code, may be reproduced, republished, copied, transmitted, or distributed in any form or by any means, without the express written permission of xxxx"
Would I need explicit permission from this website before I scrape its publicly available content?
Thank you gurus and I truely appreciate your time and advice.
Founder / Director / Co-founder at easyBacklog / Aqueduct / Econsultancy
07 September 2008 17:15pm
We are currently helping a new client of ours develop a hybrid vertical search / content aggregation site and as part of the initial planning and consultancy phase we are looking at how the content is technically aggregated, categorised and tagged appropriately. As such we’ve come across a couple of issues which I believe could impact on the overall feasibility of the project and I am hoping someone with legal or relevant experience can offer some advice.
My gut feeling tells me that without express permission from 3rd party websites, you are not legally allowed to scrape that content and then list on your own website for the purposes of search/browse/aggregation. However, if you consider all the search engines do exactly that where does one draw the line i.e. they will generally list extracts of the content rather than the complete content. Saying that, you can view the cache in Google search results which contains an exact copy of the content.
Therefore, my question specifically is whether it is legal to scrape content from 3rd party sites without any pre-agreed arrangements with them or even consideration of their terms & conditions so long as that content is publicly available on their sites? And if so, is there a ruling to say you simply need to credit them, or are there a set of laws about how you represent that content to ensure you are not presenting it as your own content but simply repurposing 3rd party content for your site visitor?
Following on from the point above, surely the mechanism which websites should use to restrict others from scraping and republishing their content in any form should be through the use of a robots.txt file which allows computers to communicate explicitly on what is or is not considered “crawlable”? Have there been any rulings surrounding that?
Whilst our client is not proposing selling the content that they aggregate for their users, they will be indirectly monetising 3rd party content for which they do not pay any fees by including advertising, paid for services for advanced features and potentially even subscription services to access all content. Would this impact on whether they would be legally allowed to scrape and republish content? Google certainly don’t pass back any fees for the content they scrape...
Assuming our client starts their business based on the fact they scrape and collate content from the few large players and the long tail of smaller players, and as a result their market prevalence increases to the point where one of the few large players (who currently monopolise the market) decide to expressly request that our client stops scraping their site, what rights or lack thereof would our client have? The risk that at any point their business could be hugely impacted by the fact that they are doing a better job than their competitors but at the same time rely on their competitors is clearly worrying.
Whilst I have a sneaking suspicion that the legality of scraping and re-publishing content on your own website is certainly in the favour of the publisher, I am surprised as I can think of many large players out there relying on a business model which scrapes content and repurposes it for various uses i.e. Google, Kayak, Kelkoo, Technorati.
What should I be advising my client in respect to their rights as a content aggregator, and how that may impact on their business in the future?
Thanks
Matthew O’Riordan
Digital Services Director
Aqueduct
Copywriter at HappyCopy
07 September 2008 21:28pm
Hello Matt,
My copyright understanding comes from my work as an online journalist. I don't think you can republish content word for word, even if it is accredited and freely available. For example, Sky and the BBC make their video etc freely available online but it would be an obvious copyright transgression to stream it from your own site.
Having said that, very few websites are likely to complain at the content being republished elsewhere as it will not damage their SEO and they are being accredited.
Basically, I do not think you would have any legal protection if you were doing this but I think it is quite unlikely you would be challenged.
I hope this was some help!
Kind regards,
Felicity
HappyCopy.co.uk
Managing Director at Free Rein Ltd
08 September 2008 08:55am
Hi Felicity, Matt
In fact most news services would challenge you pretty quickly even if you did accredit them. Where they offer an RSS feed they expect people to use the title and story summary but then link directly through to their article on their site.
Some will allow you to publish the full article but generally at a cost.
Kind regards
Tony
Founder / Director / Co-founder at easyBacklog / Aqueduct / Econsultancy
08 September 2008 12:56pm
Thanks for your comment.
It's all pretty confusing and certainly a grey area I suspect.
If you look at something like http://64.233.183.104/search?q=cache:26uO4FrnezUJ:uk.news.yahoo.com/skynews/20080906/tuk-teenager-arrested-over-stabbing-45dbed5.html+sky+news+stabbing&hl=en&ct=clnk&cd=3&gl=uk you will see that Google are caching content from Sky without express permission to reproduce this, and this is a news service.
How are they "allowed" to do this without express permission?
I have noticed that if you search on "Sky News" in Google, not all the results offer a view cache option. Could this be down to what they are allowed to cache and what they are not allowed to cache, or could it simply be an algorithm that Google employs to work out what is worth caching or not?
Matt
E-Business Consultant at Dan Barker
08 September 2008 14:42pm
hi, Matt, how's life?
I remember asking a similar question several years ago in a law lecture - there was no definite answer then. A couple of years later there was a test case where search engine caching was ruled 'fair use'.
Anyway, on to your problem...
If I was you, I would:
In terms of robots.txt, I would:
I hope that helps you - would be interested to hear more about what you're doing!
daniel
Founder / Director / Co-founder at easyBacklog / Aqueduct / Econsultancy
08 September 2008 14:57pm
Thanks Daniel.
I think using snippets and linking back to the primary site is probably not a bad approach...
Thanks for the advice, and if I hear anything further about the issue I'll be sure to post it in here.
Matt
Director at Watson Hall Ltd
08 September 2008 16:09pm
Matthew
You should look at the terms and conditions on the sites you intend to target. You will also have to consider the liability for things like defamation, encouragement or glorification of terrorism, copyright theft and other intellectual property infringements, etc if you start publishing other people's content. Take the advice of some good new media lawyers soon.
Also make sure that you don't start being used to circulate malware via infected content.
Some site owners are very agressive with those they consider are trying to take their content. For example:
http://www.theregister.co.uk/2008/08/13/ryanair_screen_scraping_cancellations/
Others use firewalling techniques to block scrapers, or to log what's going on for future court action.
Colin Watson
Technical Director
Watson Hall Ltd for website security
President at Salebug.com
25 August 2009 03:26am
Hi Matt and others,
Thank you for all your expert tips on the legal issues of website scrapping.
I have one specific question:
I'm working on a business idea based on aggregating content from industry specific websites.
When I look at the terms of services on these websites, Is there anything specific I should be looking for? Some of the terms and conditions say that I could use it for non commercial purposes. I'm not selling the content to my potential customers but will be making ad revenue (hopefully).
See the below example from one of the clients...
"You are hereby granted a non-exclusive, non-transferable, limited license to view this Site, and to download and/or print insignificant portions of materials retrieved from this Site provided (a) it is used only for informational, non-commercial purposes, and (b) you do not remove or obscure the copyright notice or other notices. Except as expressly provided above, no part of this Site, including but not limited to materials retrieved there from and the underlying code, may be reproduced, republished, copied, transmitted, or distributed in any form or by any means, without the express written permission of xxxx"
Would I need explicit permission from this website before I scrape its publicly available content?
Thank you gurus and I truely appreciate your time and advice.
Best regards,
Prasad