1. Matthew O'Riordan Staff

    Founder / Director / Co-founder at easyBacklog / Aqueduct / Econsultancy

    07 September 2008 17:15pm

    Matthew O'Riordan

    We are currently helping a new client of ours develop a hybrid vertical search / content aggregation site and as part of the initial planning and consultancy phase we are looking at how the content is technically aggregated, categorised and tagged appropriately. As such we’ve come across a couple of issues which I believe could impact on the overall feasibility of the project and I am hoping someone with legal or relevant experience can offer some advice.

    • Republishing content from 3rd party sites if credited
      My gut feeling tells me that without express permission from 3rd party websites, you are not legally allowed to scrape that content and then list on your own website for the purposes of search/browse/aggregation. However, if you consider all the search engines do exactly that where does one draw the line i.e. they will generally list extracts of the content rather than the complete content. Saying that, you can view the cache in Google search results which contains an exact copy of the content.
      Therefore, my question specifically is whether it is legal to scrape content from 3rd party sites without any pre-agreed arrangements with them or even consideration of their terms & conditions so long as that content is publicly available on their sites? And if so, is there a ruling to say you simply need to credit them, or are there a set of laws about how you represent that content to ensure you are not presenting it as your own content but simply repurposing 3rd party content for your site visitor?
       
    • Robots.txt
      Following on from the point above, surely the mechanism which websites should use to restrict others from scraping and republishing their content in any form should be through the use of a robots.txt file which allows computers to communicate explicitly on what is or is not considered “crawlable”? Have there been any rulings surrounding that? 
       
    • Monetisation of content
      Whilst our client is not proposing selling the content that they aggregate  for their users, they will be indirectly monetising 3rd party content for which they do not pay any fees by including advertising, paid for services for advanced features and potentially even subscription services to access all content. Would this impact on whether they would be legally allowed to scrape and republish content? Google certainly don’t pass back any fees for the content they scrape...
       
    • What happens if our client starts scraping and they subsequently ask our client not to do this?
      Assuming our client starts their business based on the fact they scrape and collate content from the few large players and the long tail of smaller players, and as a result their market prevalence increases to the point where one of the few large players (who currently monopolise the market) decide to expressly request that our client stops scraping their site, what rights or lack thereof would our client have? The risk that at any point their business could be hugely impacted by the fact that they are doing a better job than their competitors but at the same time rely on their competitors is clearly worrying.

    Whilst I have a sneaking suspicion that the legality of scraping and re-publishing content on your own website is certainly in the favour of the publisher, I am surprised as I can think of many large players out there relying on a business model which scrapes content and repurposes it for various uses i.e. Google, Kayak, Kelkoo, Technorati.

    What should I be advising my client in respect to their rights as a content aggregator, and how that may impact on their business in the future?

    Thanks

    Matthew O’Riordan
    Digital Services Director
    Aqueduct

  2. Felicity King-Evans

    Copywriter at HappyCopy

    07 September 2008 21:28pm

    Felicity King-Evans

    Hello Matt,

    My copyright understanding comes from my work as an online journalist. I don't think you can republish content word for word, even if it is accredited and freely available. For example, Sky and the BBC make their video etc freely available online but it would be an obvious copyright transgression to stream it from your own site.

    Having said that, very few websites are likely to complain at the content being republished elsewhere as it will not damage their SEO and they are being accredited.

    Basically, I do not think you would have any legal protection if you were doing this but I think it is quite unlikely you would be challenged.

    I hope this was some help!

    Kind regards,

    Felicity

    HappyCopy.co.uk

  3. Tony Addison Silver

    Managing Director at Free Rein Ltd

    08 September 2008 08:55am

    Tony Addison

    Hi Felicity, Matt

    In fact most news services would challenge you pretty quickly even if you did accredit them. Where they offer an RSS feed they expect people to use the title and story summary but then link directly through to their article on their site.

    Some will allow you to publish the full article but generally at a cost.

    Kind regards
    Tony

  4. Matthew O'Riordan Staff

    Founder / Director / Co-founder at easyBacklog / Aqueduct / Econsultancy

    08 September 2008 12:56pm

    Matthew O'Riordan

    Thanks for your comment.

    It's all pretty confusing and certainly a grey area I suspect.

    If you look at something like http://64.233.183.104/search?q=cache:26uO4FrnezUJ:uk.news.yahoo.com/skynews/20080906/tuk-teenager-arrested-over-stabbing-45dbed5.html+sky+news+stabbing&hl=en&ct=clnk&cd=3&gl=uk you will see that Google are caching content from Sky without express permission to reproduce this, and this is a news service.

    How are they "allowed" to do this without express permission?

    I have noticed that if you search on "Sky News" in Google, not all the results offer a view cache option. Could this be down to what they are allowed to cache and what they are not allowed to cache, or could it simply be an algorithm that Google employs to work out what is worth caching or not?

    Matt

  5. dan barker

    E-Business Consultant at Dan Barker

    08 September 2008 14:42pm

    dan barker

    hi, Matt, how's life?

    I remember asking a similar question several years ago in a law lecture - there was no definite answer then. A couple of years later there was a test case where search engine caching was ruled 'fair use'.

    Anyway, on to your problem...

    If I was you, I would:

    1. Publish snippets from scraped pages, with links through to the full version.
    2. Include full content only from feeds, or content where you've done some kind of licensing deal.
    3. Put a mechanism in place so that you could switch off particular sites from content scraping if they requested it.

    In terms of robots.txt, I would:

    1. Avoid crawling any restricted sites
    2. Avoid publishing any content where any kind of 'no-cache'/'no archive' meta tags have been set (that's probably what's happening with the sky content you mention?)

    I hope that helps you - would be interested to hear more about what you're  doing!

    daniel

  6. Matthew O'Riordan Staff

    Founder / Director / Co-founder at easyBacklog / Aqueduct / Econsultancy

    08 September 2008 14:57pm

    Matthew O'Riordan

    Thanks Daniel.

    I think using snippets and linking back to the primary site is probably not a bad approach...

    Thanks for the advice, and if I hear anything further about the issue I'll be sure to post it in here.

    Matt

  7. Colin Watson

    Director at Watson Hall Ltd

    08 September 2008 16:09pm

    Colin Watson

    Matthew

    You should look at the terms and conditions on the sites you intend to target.  You will also have to consider the liability for things like defamation, encouragement or glorification of terrorism, copyright theft and other intellectual property infringements, etc if you start publishing other people's content.  Take the advice of some good new media lawyers soon.

    Also make sure that you don't start being used to circulate malware via infected content.

    Some site owners are very agressive with those they consider are trying to take their content.  For example:

    Others use firewalling techniques to block scrapers, or to log what's going on for future court action.

    Colin Watson
    Technical Director
    Watson Hall Ltd for website security

  8. Prasad Gollapalli

    President at Salebug.com

    25 August 2009 03:26am

    Avatar-blank-50x50

    Hi Matt and others,

    Thank you for all your expert tips on the legal issues of website scrapping.

    I have one specific question:

    I'm working on a business idea based on aggregating content from industry specific websites.

     When I look at the terms of services on these websites, Is there anything specific I should be looking for? Some of the terms and conditions say that I could use it for non commercial purposes. I'm not selling the content to my potential customers but will be making ad revenue (hopefully).

    See the below example from  one of the clients...

    "You are hereby granted a non-exclusive, non-transferable, limited license to view this Site, and to download and/or print insignificant portions of materials retrieved from this Site provided (a) it is used only for informational, non-commercial purposes, and (b) you do not remove or obscure the copyright notice or other notices. Except as expressly provided above, no part of this Site, including but not limited to materials retrieved there from and the underlying code, may be reproduced, republished, copied, transmitted, or distributed in any form or by any means, without the express written permission of xxxx"

    Would I need explicit permission from this website before I scrape its publicly available content? 

    Thank you gurus and I truely appreciate your time and advice.

    Best regards,

    Prasad

Reply to this thread

Log in to reply to this thread or join Econsultancy for free so you can post to our forums along with other benefits.