How to crawl a website – Mind Your Code

The word “crawling” has become synonymous with any way of getting data from the web programmatically. But true crawling is actually a very specific method of finding URLs, and the term has become somewhat confusing.

Before we go into too much detail, let me just say that this post assumes that thereason you want to crawl a website is to get data from it and that you are not technical enough to code your own crawler from scratch (or you’re looking for a better way). If one (or both) of those things are true, then read on friend!

In order to get data from a website programmatically, you need a program that can take a URL as an input, read through the underlying code and extract the data into either a spreadsheet, JSON feed or other structured data format you can use. These programs – which can be written in almost any language – are generally referred to as web scrapers, but we prefer to call them Extractors (it just sounds friendlier).

A crawler, on the other hand, is one way of generating a list of URLs you then feed through your Extractor. But, they’re not always the best way.

How a crawler works

Crawlers are URL discovery tools. You give them a webpage to start from and they will follow all the links they can find on that page. If the links they follow lead them to a page they haven’t been to before, they will follow all the links on that page as well. And so on, and so on, in a loop.

The hope is that if you repeat this process enough, you will eventually wind up with a list of all the possible URLs, usually restricted to a given domain (eg. Asos.com).

The good thing about crawlers is they try to visit every page on a website, so they are very complete. The bad thing about crawlers is that they try to visit every page on a website, so they take a long time.

Crawling is very slow. And it produces a pretty heavy load on the site you are crawling. Not to mention, crawlers produce a static list of URLs, meaning if you want new information you have to recrawl the entire website all over again.

The final problem with crawlers is that a lot of the URLs they find won’t have data you want. Say you’re trying to build a catalogue of all the products on Asos. Asos has a lot of pages that don’t have products on them, but the crawler will visit them anyway.

When it comes to passing your crawled list of URLs through an Extractor, you can use URL patterning to try to weed out some of these unwanted pages (more on that later). You can also try to add some logic to your crawler to help guide it away from pages you know don’t have data.

But, crawling isn’t the only way to get a list of URLs. If you’re at all familiar with the site you want data from, there are much easier (and faster) ways of getting to your data.

Crawlers vs Extractors – finding the right tool for the job

Extractors have two benefits over crawlers. The first is that they are super targeted – ie. much, much faster. And the second, is that they are refreshable – no recrawling required, just run the program again.

When you build your Extractor (remember that’s the thing that gets the data from your list of URLs), you build it to a specific webpage. You’re essentially training the program that, given a URL, this is what data looks like.

In addition to providing you with a program you can pass other URLs through, an Extractor also provides you with a spreadsheet of all the data on that page.

So if you were to build an Extractor to a page with lots of links on it, you would wind up with a spreadsheet with a list of links that you could then feed into another Extractor to get the specific data you’re after.

That all sounds rather theoretical, so let’s look at some examples:

Data on a single page

Obviously, if your data sits on just one page, crawling the whole site is pretty pointless. It will be much simpler to build a single Extractor to that single page.

Data on multiple concurrent pages

This is often referred to as pagination – ie. the data you want is in a single list that is spread out across multiple pages. In this case, you need to look at the URL for a few pages to see if you can detect a pattern. For example:

In this case you can see that each URL ends in page=X

If the pattern is replicable, you should:

Build an Extractor to the first page
Generate a list of URLs using the pattern in Excel
Run your list of URLs through your Extractor

You should end up with a big spreadsheet of all the pages turned into data.

Note: This method also works if you already have a list of URLs already in a spreadsheet somewhere.

How a crawler works

Crawlers vs Extractors – finding the right tool for the job

Data on a single page

Data on multiple concurrent pages

You might also like

Leave a Reply