How to make website scraping easy

September is here again and the kids are back to school.

We thought we’d also go ‘back to basics’ and explain how retailers can simplify their data extraction process.

Web scraping is a way of extracting data from websites. Rich data extraction ensures that the most comprehensive product information is extracted from the retailer’s ecommerce site.

This ensures that the data remains accurate and up-to-date and leaves less room for error.

By Robert Durkin September 9th 2013 14:12

Why is web scraping important?

If retailers want to increase their product visibility and display their product inventory across the various channels, data extraction is essential. There are several ways of extracting site data, but one of the most common is screen scraping.

Screen scraping is carried out by a crawler that is sent onto an ecommerce site to capture specific data. This extracted data is then put together to create a product data feed.

Why are websites scraped?

Scraping makes data extraction much easier for retailers. Most of them have complex CMS systems, so their website is usually the only place where all of their product information comes together.

How can retailers improve their site so it’s easy to scrape?

Use IDs and classes within tags

If a website uses IDs and classes within its page tags, it’s much easier to produce Xpaths (the query language for selecting nodes), which are used to navigate through HTML.

Don’t use tables for structuring

If a site’s structure includes a lot of tables, it becomes more difficult to scrape. This is because there are unlikely to be IDs and classes within the table’s data.

Not only this, but when tables are used, the Xpaths can become much longer and are therefore more likely to break.

Don’t use unnecessary AJAX

AJAX (asynchronous JavaScript and XML)tends to load independently from HTML, meaning that it can be missed in the scraping process.

Although the browser does drive content and the HTML does load, sometimes something else will pop up. Though a crawler can be set to wait for AJAX content to load before scraping, any AJAX can still dramatically increase the scraping time.

Avoid using sessions

Unnecessary sessions make it difficult to deep link products and can also make the website difficult to scrape.

This is most common on travel website pages, as search URLs sometimes use sessions, causing them to timeout or expire after a period of time.

Be consistent

Crawlers are programmed to recognise each type of webpage based on its structure; if site pages are inconsistent, the crawler will return invalid results.

So, for example, if the crawler is expecting to find the product price under a particular HTML tag class or id and the client introduces a new product page where the price is located under an unfamiliar HTML tag class or id, the product is likely to be overlooked.

Make sure your website is compliant

All websites should comply with W3C standards, which lays out standards that developers need to adhere to. It’s also best to have well-formed HTML so that Xpaths can be created easily.

For example, if an HTML tag is not closed off properly, it can affect the structure of the site.

Keep your website accessible

Even in its simplest form, your website should be compatible with each of the various internet browsers.

So, even if a user has content blockers switched on, the website should still load. This also makes it much easier to scrape the product data.