Beginners Guide for Web Scraping Using Selenium
Beginners Guide for Web Scraping using Selenium
Join the DZone community and get the full member experience.
Join For FreeData experts and researchers often require exact and retrieving data from unconventional sites to train or test algorithms, build datasets, machine learning models, neural networks, etc. However, the website contributes APIs that are an awe-inspiring way to retrieve structured data. But what if there is the absence of an API when you want to bypass the method.
Under such situations, the data can easily be managed through the web page. But the conventional method is highly time-consuming and cumbersome; it becomes more challenging when you have to bargain with websites such as lodge booking, real estate, work listing, etc., as they need to be accessed frequently. However, Selenium allows a computerized method through various models to fetch the information from the website and obtain whatever you want. But before going deep with the process, let's understand web scraping as well as Selenium in detail.
What Is Web Scraping?
Also known as data scraping. It's the practice of importing data from a site into a folder or spreadsheet. It's the most proficient method to get erudition for the network; it's animating to import data in few problems. Well known use of web scraping includes:
- Analysis for web content;
- Pricing for connection sites;
- Find sales lead by wriggling public data sources;
- Sending data from eCommerce platforms to the online vendors.
The list is scratching the facade; the concept has immense apps. However, it's profitable in every case where information necessitates to be moved. It's an automated approach where an app concocts the network page for extra details for manipulation, like utilizing the landing page to multiple formats, creating a copy in a provincial database for more advanced analysis and retrieval. Few use illustrations of web scraping include:
- Reach Scraping;
- Online Price Checking and Tracking;
- Weather Information Tracking & Change Detection;
- Analysis;
- Web Data Assimilation
- Monitoring Online Activities of Brand.
What Is Selenium?
Selenium is the most valuable automation experiment framework for web solutions. It can constrain the browser to navigate the site much like individuals. It uses a web-driver combination to mimic user-oriented activities to find the desired event and control the browser.
Mainly, it's practiced for automating apps for experimentation purposes but not restricted to it only. It enables users to open any browser they want and perform assignments as an individual can, including:
- Snapping butting;
- Examine for accurate data on the web;
- Insert essential details in forms.
Point To Note:
Web scraping is a corresponding expression of service for the significant sites; device IP addresses can be blocked if you abrade very severely. Hence make sure to perform it smartly.
What Are We Going to Do?
The article is about web scraping using Selenium; hence we will compile a program that irritates user biographies and help you depository names for the fastened receptacles.
What Is the Requirement of the Project?
For the meticulous design, we crave Python 3.x. You can even perform the activity with Python 2.x with insignificant variations in coding. If you don't have the correct version, you can get it established from python.org. The pip will be installed, along with it, we will also practice driver and packages including:
- Package for the program;
- Chrome Driver;
- Virtualenv;
- Extras.
Selenium Setup
To download the package, make sure to administer the pip command in the terminal:
pip install Selenium;
Selenium Driver Setup
Web driver permits you to restrain the browser through OS-level synergy. It practices the browser's form for automation process accommodating to control the browser, the web driver must be installed and convenient by the PATH variable of the OS. Prefer to download the driver from an official site such as Chrome, Safari, or other browsers you love.
Project Setup
Plan a unique enclosure within the project portfolio. Build a setup.py folder in the same type in territory selenium. Now open a request line and generate a practical background through primary command.
$ virtualenv selenium_example
Once after creating the pragmatic background, it's time to get the dominion into virtualenv to operate the subsequent instruction in the terminal:
$(selenium_example) pip install -r setup.py
Import Required Modules
One after forming a unique project portfolio, devise a webscraping.py file. Make sure to enter the subsequent code pieces.
from selenium, import web driver
from selenium.web driver.common.by import By
from selenium.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
From selenium import webdriver -
Empower you to originate and initialize the browser.
From selenium.webdriver.common.by import By -
Equip you to find things using particular parameters.
From selenium.webdriver.support.ui import WebDriverWait -
It enables you to anticipate a page to arrange.
From selenium.webdriver.support import expected_conditions as EC -
Enables you to designate what you see on a particular page, assisting you in discovering the page has been oppressed.
From selenium.common.exceptions import TimeoutException-
It helps to maintain a timeout condition.
Create a New Instance In Incognito Mode
First, add the incognito contention to the driver.
option = driver.ChromeOptions()
option.add_argument(“ — incognito”)
Next, we construct an instance of Chrome.
browser = webdriver.Chrome(executable_path=’/Library/app support/Google/chromedriver’, chrome_options=option)
Remark that the executable_path is the point where you store and download your ChromeDriver.
Data Extraction
Let's being by seeking a product and download the CSV file utilizing the subsequent steps:
Create Driver Instance and Import Dependencies
The initial action is to construct an object of web driver for the browser by introducing it from the selenium module as:
# import webdriver
from selenium import webdriver
# import Action chains
from selenium.webdriver.common.action_chains import ActionChains
# import KEYS
from selenium.webdriver.common.keys import Keys
# create webdriver object
driver = webdriver.Firefox()
# get geeksforgeeks.org
driver.get("https://www.geeksforgeeks.org/")
# create action chain object
action = ActionChains(driver)
# perform the operation
action.key_down(Keys.CONTROL).send_keys('F').key_up(Keys.CONTROL).perform()
Now open the developer selection and accept "Enable Remote Automation." By default, the control is disabled in the browser and requires approval for the automation environment; else, it proposes SessionNotCreatedException. Therefore, permit the Develop option underneath the venerable surroundings in the browser decisions you are using.
Locating WebElement
Selenium provides a huge variety of functions to locate any element on a particular web page:
find_element_by_id: Use id to search an element;
find_element_by_name: Name is used to find an element;
find_element_by_partial_link_text: Search element by matching part of a hyperlink text;
find_element_by_xpath: Use xpath to find an elements;
find_element_by_link_text: Text value is used of a link to find element;
find_element_by_tag_name: Tag name can be used to find an element;
find_element_by_css_selector: CSS selector is used for id, class to find element.
find_element_by_class_name: Value of class attribute helps to find an element.
Make the Request
While executing the application, you need to consider two things:
- Approach coveted website URL;
- Implement an exception or try to manage a timeout circumstance.
In our illustration, we're going to use user portrait as the coveted site URL:
browser. get("https://gist.github.com/kapil-varshney/")
Now it's time to designate a period using try/except:
# Wait for seconds for a page to load
timeout time= 20try:
WebDriverWait(browser, timeout).til(EC.visibility_element_located((By.XPATH, “//img[@class=’width-full rounded-2']”)))except TimeoutException:
print(“Time for waiting for page load got over”)
browser.quit()
Make sure to wait continuously till the closing component is stored. The premise is that if the image gets placed, the entire page also gets approximately loaded as it's between the latest things to get stuffed.
Get the Rejoinder
Once we complete the appeal process successfully, we are obliged to respond for a great acknowledgment. The acknowledgment can be divided into two and later be consolidated at the point. The rejoinder can be the language as well as the title of the fastened treasuries of the profile you are using. Let's begin by preparing all titles for fastened containers. We're not preparing pure titles but are receiving a recipient with Selenium that involves the titles.
# find_element_by_xpath returns an array of objects.
Title Element = browser.find_element_by_xpath(“//a[@class=’text-bold’]”)
# use list comprehension to get the actual repo titles and not the selenium objects.
Titles = [x.text for x in title element] # print out all the titles.print('titles:')
print(titles, '\n')
The group formation and <a> tag are the equivalents for the titles of the fastener receptacles; hence we can discover all the components using this tower as a reference. We can get all the languages for pinned repositories; it's related to what was used before.
language element = browser.find_element_by_path(“//p[@class=’mb-0 f6 text-gray’]”)
# same thing as for list-comprehension:
languages = [x.text for x in language element]print(“languages:”)
print(languages, ‘\n’)
The itsclass edifice and <p> tag are equivalent to the languages of the pinned repositories; hence you need to find all elements using the edifice as a recommendation.
Combine the Rejoinders
The initial action is to match the title with its analogous language and get it printed. You can utilize the zip reception that matches the two components from two arrays. It helps you to map the same into and return a tuples array.
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
option = webdriver.ChromeOptions()
option.add_argument(" — incognito")
#browser = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver/', chrome_options=option)
browser = webdriver.Chrome(executable_path='/users/user_1566/downloads/chrome_driver/chromedriver', chrome_options=option)
browser.get('https://www.google.com/search?q=samsung+note10')
#items = len(browser.find_elements_by_class_name("cu-container"))
#items = len(browser.find_elements_by_class_name("mnr-c pla-unit"))
#print(items)
timeout = 20
try:
WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='top-pla-group-inner']")))
except TimeoutException:
print('Timed out waiting for page to load')
#browser.quit()
titles_element = browser.find_elements_by_xpath("//div[@class='mnr-c pla-unit']")
# use list comprehension to get the actual repo titles and not the selenium objects.
titles = [x.text for x in titles_element]
# print out all the titles.
print('titles:')
print(titles, '\n')
language_element = browser.find_elements_by_xpath("//a[@class=''plantl pla-unit-single-clickable-target clickable-card']")
print(language_element)
# same concept as for list-comprehension above.
languages = [x.text for x in language_element]
print("languages:")
print(languages, "\n")
for title, language in zip(titles, languages):
print("RepoName : Language")
print(title + ": " + language, "\n")
It’s Time To Operate the Program
Now it's time to execute the code by operating it through IDE; you can use the instruction listed below to run your program:
$ (selenium_example) python selenium_example.py
As you run your program, make sure to launch it in incognito mode. On the terminal, you can view the printout in the format you decided.
You’re All Set to Scrap!
Web scraping is used to get details from branded sites and apps from the time WWW was introduced. It was done on static pages with tags, data, components, etc. But modern technology has changed everything beginning from app development to testing to importing and exporting data. As modern technology is fueling today, professionals need to deal with a large volume of data to build a reliable model to deal with different issues.
Web scraping can help to extract a significant amount of data related to your brand product, customers, stock market, and many more. You can utilize the data from the site, including job portals, social media, eCommerce, to understand purchasing patterns, sentiments, employees' behavior, the list goes on.
Modern-day business is changing with concepts like E-Delivery and E-Ordering and most business professionals are looking forward to such concepts for automation.
The most popular framework and libraries used on the web include Selenium. Web scraping using Selenium helps you gather data and images from the web that can be used to develop train data for your project.
Opinions expressed by DZone contributors are their own.
Comments