Part one: Web Scraping in Node JS Step By Step

Charlie K
3 min readOct 2, 2020

--

Web Scraping your favorite marketplace using puppeteer

Puppeteer

Introduction

What is web scraping and how can it be of use?

The internet is a huge store of information be it text, media or images. Access to this data unfortunately may not be readily available. Web scraping would come in handy for extracting information from website mostly for data mining.

Other reasons include:

· An alternative to API calls for data retrieval. Incase API does not exist or you don’t know how to use it, Web scraping will work.

· Best tool for enriching current data with up to date data from their website

Project setup

Before we begin ensure you have Node and npm installed on your machine

We will be using the following package:

· Puppeteer

npm i puppeteer

Let’s scrape!

Inspecting the page

The first step in scraping is to select the website we wish to scrape. In our case it will be e-commerce site Jumia.

To inspect a website you can right click anywhere on the page and choose ‘Inspect’ or in Chrome you can use command ‘Ctrl + Shift + I’. To view a specific element on the page, go to the ‘Elements’ tab. As you move the cursor specific elements will be highlighted on the elements tab indicating their position and attributes.

Figure 1: Website to be scraped
Figure 2: Page elements

In our case, we need to extract the item image, price and url. Taking a closer look the element we notice that in <div> all elements have attribute <article> tag.

Figure 3: Element attributes

As we can see, <article class='prd _fb col '> is the tag that contains all the attributes that we need from the product. Now lets dive right into the code and do some scraping!

Connecting to the website and launch puppeteer:

const puppeteer = require('puppeteer');const browser = await puppeteer.launch({    headless: false,    defaultViewport: null});const page = await browser.newPage();await page.goto(`https://www.jumia.co.ke/catalog/?q=${encodeURI(searchTerm)}`);

Extracting the required elements

We use puppeteer page.evaluate()and page.awaitforSelector() methods to fetch the data. The methods are asynchronous as we need to wait for a Promise in order to continue with the execution.

Saving the data

In our example, the data is stored in the array and logged to the console. In the subsequent series, I would wish to save the data in the database and use Nodemailer to send notifications.

The entire code:

Output

Figure 4: Output to console

Conclusion

Web scraping is fundamental in data mining for research using automation.

Puppeteer as a package manager is very important for data mining projects where fetching of dynamic data is necessary. Luckily, headless browsers are becoming more and more accessible to handle all of our automation needs, thanks to projects like Puppeteer and the awesome teams behind them!

With the basic introduction on how to use puppeteer to scrape websites you can now experiment it on different websites.

Remember to stay tuned for Part two,

Happy scraping😎

--

--

Charlie K
Charlie K

Written by Charlie K

learn, innovate repeat. Writer, web developer ,tech & nature enthusiast. My posts are abstract😜

No responses yet