Data Scraping is basically a process of extracting data from a website using some scripts or automation tool/software. In this demo, we have to scrape the review and information about the doctors from various medical field-oriented websites using Scrapy and Selenium tools.
Purpose of the Web scraping:
The list of things you can do with web scraping tools is almost endless. I have put some of the most common ones below.
There are people actively looking for jobs and there are companies looking to hire suitable manpower. The problem is there are a ton of job boards with a lot of listings. What if you can scrape the job links and title, put it in a single place from where the job seeker can get the details.
Reviews are important for businesses to know better about their customer. This gives the better understand of their customers & improve their services.
In today’s highly competitive market, it’s a top priority to protect your online reputation. Whether you sell your products online and have a strict pricing policy that you need to enforce or just want to know how people perceive your products online, brand monitoring with web scraping can give you this kind of information.
Price monitoring is a very common yet useful technique that we can use to automate the process of checking prices on various websites.
Demo Video of Physician Reviews scraping using Scrapy and Selenium
This is a demo of a data acquisition pipeline to search directories and extract physician reviews, ratings, review date, reviewer information, physician details using Scrapy and Selenium.
Demo Medical Websites:
Scrapy is a python crawling framework, used to extract the data from the web page with the help of a selector based on XPath.
Selenium is a UI automation tool used for data scraping. Scrapy is a very powerful web scraping framework, however, it has some limitations. Eg: If we need to extract a mobile number from healthgrades.com or similar sites, but the mobile number is displayed only after the user clicks the “show mobile number” button, we need to use Selenium for data scraping to execute the click event.
The scrapped information like the review text, date, rating, reviewer details, physician details, etc., are stored in a NoSql database like MongoDB. MongoDB is an open-source document database and leading NoSQL database. MongoDB uses the following hierarchy of artifacts to store information. They are
The database is a physical container for collections. Each database gets its own set of files on the file system. A single MongoDB server typically has multiple databases.
A collection is a group of MongoDB documents. It is the equivalent of an RDBMS table. A collection exists within a single database. Collections do not enforce a schema. Documents within a collection can have different fields. Typically, all documents in a collection are of similar or related purpose.
A document is a set of key-value pairs. Documents have dynamic schema. Dynamic schema means that documents in the same collection do not need to have the same set of fields or structure, and common fields in a collection’s documents may hold different types of data.
The following is a class diagram of how the review data scrapped is stored in the Mongo database.
Data scientist with 5+ years of experience leveraging Statistical Modeling, Data Processing, Data Mining, Machine Learning, and deep learning algorithms to solve challenging business problems in Natural Language Processing (NLP), Text Analytics, Chat-bots, and Full-stack web development.