Lighter Web Scraping Using NodeJS

An alternative way for doing web scraping using NodeJS

If you search for Web Scrapping using NodeJS, probably Puppeteer examples/articles will come up. It is an awesome library to use for complex web scraping because you are actually automating a browser when using Puppeteer. With that said, I think it’s an overkill library to use for a simpler web scrapping. So in this article, we’ll look into how we can scrape data from the web without using Puppeteer


Getting Started

To do this we need to solve two problems. The first one is, how we can get the website HTML code. After that’s solved, the second problem is how to get the actual data that we need from the HTML code.

Let’s start coding! First, scaffold a new Node project by running

yarn init -y

Now that we have a project ready to use, let’s install some dependencies

yarn install axios cheerio

Axios

You might be familiar with this package because it’s quite a popular package to use for doing HTTP requests. Nowadays we usually use this to interact with API and get the result as JSON, but there’s a setting that we can tweak so the response will be an HTML instead of JSON.

Cheerio

Taken from their NPM Package description, it’s a “Fast, flexible & lean implementation of core jQuery designed specifically for the server” I think that explains it really well. Basically, with this package, we can run jQuery commands on the server.


Building The Scraper

We'll be using https://books.toscrape.com/ website to test our scraper. First off, create a file called index.js in your project folder root, we’ll use this file to build our scraper.

From the list of books on the website we'll grab a couple of things including:

  • Title
  • Price
  • Cover Image
  • Rating
  • Availability
  • URL

Let's get coding!

First, we import both axios and cheerio and then we create an async function called scrape.

Now let's grab the HTML code from the website using axios and load it to cheerio so we can query the data, to do this we'll do it like this

After inspecting the website we can see that the book listing looks like this. This will help us get the data.

HTML Structure for the Book Item

With that information, let's grab the book elements first. We can do that by using cheerio like this

Alright, we got the books. Now it's time to grab the simple data first, these are something that we can directly see in the element

After that's done, now we can also grab the data that's a bit more complicated like rating, availability, and url.

First off, for rating we can grab the p element and check the class because it contains how many ratings the book has (e.g. Three). Next up, for the availability we can just check is there any div with a class of .instock.availability, we query for both classes to make sure that the .instock class is really for the availability, and the .availability has .instock class to show that it is available.

All done! This is what the complete code looks like


Conclusion

I think this is the simplest way to do web scraping, and there are some pros and cons of doing it this way.

Pros

  • Simpler to build
  • Fewer resources needed (library like Puppeteer needs to install Chromium to run)
  • Smaller package size

Cons

  • Cannot scrape a website where navigation is needed (sign in, scroll, etc.)
  • Cannot take a screenshot of the page

In the end, it depends on what website do you want to scrape and what data that you want to get. If you want to get something from a complex website then yes, use something like Puppeteer! It has a powerful API and you can interact with a complex website. But if you need something simple, then axios and cheerio might be a better choice


Resources

Here are some resources for all the things that I've mentioned in this tutorial

Christian Dimas

Christian Dimas