Web Scraping is a popular methodology to extract data from websites. This is often done to derive insights for Sentiment Analysis, Predicting User preferences, Cross-Selling products, etc.
Some of the real-life examples of web scraping include – extracting data for pricing analysis, user ratings for movie sentiment analysis, corporate admin tasks to read and classify log files in an HTML, search bots trying to make sense of a results page. While web scraping activity does not provide intelligence of its own, as we have seen above the data extracted can be useful in multiple ways. A more common use case would be a start-up eCommerce website trying to set a price on its products based on market research on competitors.
While some of the websites provide access to their home-grown APIs to fetch data or provide search results download options, most Customer-facing websites are in the public domain and do not provide such features. Manually downloading the complete page might be possible but if one is looking for a specific set of data, multiple times across websites, this task can get daunting. This is where a web scraping tool comes in handy.
There are various open-source tools available in the market to perform this activity. Python has proven to be a good partner to most developers because of its ease of use and arsenal of drivers to perform the activity. So, let us discuss one of the popular tools, Selenium, for web scraping.
How does Selenium help?
Selenium is an automation tool that functions on web browsers. Primarily used by QA teams for Testing Automation, it has also proven to be a powerful tool in data mining activities. Below, we will see how we can use this tool to collect some data from a popular e-commerce site.
To start with, let us Install selenium and download the drivers into our python workbook using the below commands:
1.Open terminal and type – $ pip install selenium
2. Install the latest version of chrome driver from https://chromedriver.storage.googleapis.com/index.html?path=2.42/
We will first import important packages in our Notebook
Next, we need to create an instance of the chrome, this will enable the driver to open our URL in chrome.
Now, let us try accessing Amazon product page using the chrome driver
If you scroll down to the reviews section of the product, this is how the page would look like.
Now, let us choose the features we want to download. For example, we shall try and extract the Reviewer’s name, Rating and the review text from all the customer reviews. Let us now see the Console view of the browser. Goto Browser Options -> Developer Tools to get access to the XPath.
How does an XPath play a role?
XPath (or XML Path) in HTML pages provides a structure for developers or users to traverse the complex HTML or XML objects in a hierarchy. As shown in the image above, to get the “profile-name” of the user one needs to go down the “XPath” right from the page to the frame to individual review to the profile name tag to read the data. This structure is usually common across products in a website but varies across websites. So, once coded for a particular web page it can re-use for other products on the same website.
Now back to the web scrapping part, once we have identified the correct “tags” denoted by “<“ and “>” it is time to iterate and extract the data using the usual Python loops.
The chrome browser provides functions to read based on XPaths.
For example, the below code, extracts the reviewer’s name along with description and ratings.
The output of this loop is then updated into a Pandas data frame and this provides us with raw data in a simple consumable manner as shown below.
Once the data is in the format, it is easy to analyze and derive further insights through reporting tools.
Selenium helps in reading data based on the XPATH of the particular web page (s). Every website follows a particular set of XSD/HTML models for its web pages which makes it easy for Selenium to read data, submit forms, switch pages back-and-forth and click buttons.
Ensure you take care of the exceptional cases when developing or looping through the reviews, not all customers are alike and there might be missing data or special characters!
Note, that some of the websites do not allow you to scrape in this fashion or will block your IP temporarily if you try to scrape using too many requests.
While the Selenium driver is helpful in scraping data safely and easily, care needs to be taken while developing the selenium logic (especially in loops) since this directly runs on live web pages. Happy Scrapping!