Web Scraping your favorite marketplace using puppeteer
Introduction
What is web scraping and how can it be of use?
The internet is a huge store of information be it text, media or images. Access to this data unfortunately may not be readily available. Web scraping would come in handy for extracting information from website mostly for data mining.
Other reasons include:
· An alternative to API calls for data retrieval. Incase API does not exist or you don’t know how to use it, Web scraping will work.
· Best tool for enriching current data with up to date data from their website
Project setup
Before we begin ensure you have Node and npm installed on your machine
We will be using the following package:
· Puppeteer
npm i puppeteer
Let’s scrape!
Inspecting the page
The first step in scraping is to select the website we wish to scrape. In our case it will be e-commerce site Jumia.
To inspect a website you can right click anywhere on the page and choose ‘Inspect’ or in Chrome you can use command ‘Ctrl + Shift + I’. To view a specific element on the page, go to the ‘Elements’ tab. As you move the cursor specific elements will be highlighted on the elements tab indicating their position and attributes.
In our case, we need to extract the item image, price and url. Taking a closer look the element we notice that in <div>
all elements have attribute <article>
tag.
As we can see, <article class='prd _fb col '>
is the tag that contains all the attributes that we need from the product. Now lets dive right into the code and do some scraping!
Connecting to the website and launch puppeteer:
const puppeteer = require('puppeteer');const browser = await puppeteer.launch({ headless: false, defaultViewport: null});const page = await browser.newPage();await page.goto(`https://www.jumia.co.ke/catalog/?q=${encodeURI(searchTerm)}`);
Extracting the required elements
We use puppeteer page.evaluate()
and page.awaitforSelector()
methods to fetch the data. The methods are asynchronous as we need to wait for a Promise in order to continue with the execution.
Saving the data
In our example, the data is stored in the array and logged to the console. In the subsequent series, I would wish to save the data in the database and use Nodemailer to send notifications.
The entire code:
Output
Conclusion
Web scraping is fundamental in data mining for research using automation.
Puppeteer as a package manager is very important for data mining projects where fetching of dynamic data is necessary. Luckily, headless browsers are becoming more and more accessible to handle all of our automation needs, thanks to projects like Puppeteer and the awesome teams behind them!
With the basic introduction on how to use puppeteer to scrape websites you can now experiment it on different websites.
Remember to stay tuned for Part two,
Happy scraping😎