Web Scraping With Python – A Step-By-Step Guide
What is web scraping?
Web scraping is an automation technique for extracting data from websites. Web Scraping gaining popularity day by day as the increase in use of Machine Learning algorithms. Web scraping helps to extract large set of data from websites accessing through HTTP protocol. Unlike the long and mind-numbing process of manually getting data, Web scraping uses intelligence automation methods to get thousands or even millions of data sets in a smaller amount of time. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications. Web Scraping With Python is the preferred method for web scrapping.
Why is Python a popular programming language for Web Scraping?
Python is the most popular language for web scraping as it can handle most of the processes easily.
- Python has many libraries for Web Scrapping.
- Python use less code to write more functionalities.
- Python is Opensource and most of the libraries are free to use and modify.
- Python can handle large amount of data.
- Python support many databases.
- Python can access OS components.
What is the use case of Web Scraping?
- Price Monitoring & Market Research: Web Scraping can be used by companies to scrap the product data for their products and competing products. Also can help price comparison services. High-quality web scraped data obtained in large volumes can be very helpful for companies in analyzing consumer trends.
- News Monitoring: Web scraping news sites can provide detailed reports on the current news and trends.
- Sentiment Analysis: Companies can use web scraping to collect data from social media websites such as Facebook and Twitter and generate report on general sentiment about their products and services.
- Email Marketing: Companies can also use Web scraping for email marketing. They can collect Email ID’s from various sites using web scraping and then send bulk promotional and marketing Emails.
Popular Python Libraries Used for Web Scrapping
- BeautifulSoup
- Scrapy
- Selenium
- Requests
- Urllib3
- Lxml
- MechanicalSoup
Practical Step-by-Step guide to scrape data from a website.
In this example, I am going to scrape data from “https://www.programmableweb.com” and store the extracted data in a CSV file.
I am going to use following tools and libraries:
- Python 3.4
- Requests
- BeautifulSoup
- CSV
There are many API’s and are categorized in different sectors, I am scrapping data for Transportation API. The API URL is https://www.programmableweb.com/category/transportation/api
The programmableweb.com data table for scrapping
From this site, we are going to grab the following information:
- API Name
- Description
- Category
- Followers
After collecting the information, we are going to store it in a CSV file.
Installing the required libraries.
pip install requests pip install beautifulsoup4
Importing the required libraries and dependencies
from bs4 import BeautifulSoup import requests import csv
Defining a function to visit sub-page and scrap required data:
def getSubPage(link, headers):
url="https://www.programmableweb.com"
response=requests.get(url+link, headers=headers, allow_redirects=False)
#Check the API response status code
if response.status_code == 200:
soup=BeautifulSoup(response.content,'html.parser')
APITagsHtml = soup.find('div', attrs={'class':'tags'})
APITagsH = APITagsHtml.find_all('a')
APItags = ''
for atg in APITagsH:
APItags = APItags + atg.string + ", "
APIDesc = soup.find('div', attrs={'class':'api_description tabs-header_description'}).text
APISubData = [APItags.strip(", "), APIDesc]
return APISubData
else:
return False
Python code to scrap data:
url="https://www.programmableweb.com/category/financial/api?pw_view_display_id=apis_all&page="
page_num = 0
APIData = []
file = open('apidata_financial.csv', 'w+', newline ='')
with file:
# identifying the CSV header
header = ['API Name', 'API Description', 'Category', 'Sub-Category','Followers','Inner Page Status','Inner Page URL']
#Initializing the CSV Object
writer = csv.DictWriter(file, fieldnames = header)
writer.writeheader()
# Defining HTTP Request Header
headers = {
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
}
while True:
#generate page url with page number to scrap data in pagination
pageUrl = url+ str(page_num)
#Sending Request to fetch the entire HTML page
response=requests.get(pageUrl,headers=headers)
htmlcontent=response.content
#Parse HTML data
soup=BeautifulSoup(htmlcontent,'html.parser')
#Check if we landed the last page and the response is empty then break the loop
if soup.find('div',attrs={'class':'view-empty'}):
break
else:
page_num = page_num + 1
table = soup.find('table', attrs={'class':'views-table'})
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
APIName = row.find('td', attrs={'class':'views-field-title'}).text
APIFollower = row.find('td', attrs={'class':'views-field-count'}).text
APIDescription = row.find('td', attrs={'class':'views-field-field-api-description'}).text
APICategory = row.find('td', attrs={'class':'views-field-field-article-primary-category'}).text
APINameHtml = row.find('td', attrs={'class':'views-field-title'})
APILink = APINameHtml.find('a').get('href')
APIDataResponse = getSubPage(APILink, headers)
if APIDataResponse:
writer.writerow({'API Name' : APIName, 'API Description': APIDataResponse[1], 'Category': APICategory, 'Sub-Category': APIDataResponse[0], 'Followers': APIFollower})
else:
writer.writerow({'API Name' : APIName, 'API Description': APIDescription, 'Category': APICategory, 'Sub-Category': APICategory, 'Followers': APIFollower})
The Output
The final script
from bs4 import BeautifulSoup
import requests
import csv
def getSubPage(link, headers):
url="https://www.programmableweb.com"
response=requests.get(url+link, headers=headers, allow_redirects=False)
if response.status_code == 200:
soup=BeautifulSoup(response.content,'html.parser')
APITagsHtml = soup.find('div', attrs={'class':'tags'})
APITagsH = APITagsHtml.find_all('a')
APItags = ''
for atg in APITagsH:
APItags = APItags + atg.string + ", "
APIDesc = soup.find('div', attrs={'class':'api_description tabs-header_description'}).text
APISubData = [APItags.strip(", "), APIDesc]
return APISubData
else:
return False
url="https://www.programmableweb.com/category/financial/api?pw_view_display_id=apis_all&page="
page_num = 0
APIData = []
file = open('apidata_financial.csv', 'w+', newline ='')
with file:
# identifying CSV header
header = ['API Name', 'API Description', 'Category', 'Sub-Category','Followers','Inner Page Status','Inner Page URL']
writer = csv.DictWriter(file, fieldnames = header)
writer.writeheader()
headers = {
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
}
while True:
pageUrl = url+ str(page_num)
print(pageUrl)
response=requests.get(pageUrl,headers=headers)
htmlcontent=response.content
soup=BeautifulSoup(htmlcontent,'html.parser')
if soup.find('div',attrs={'class':'view-empty'}):
break
else:
page_num = page_num + 1
table = soup.find('table', attrs={'class':'views-table'})
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
APIName = row.find('td', attrs={'class':'views-field-title'}).text
APIFollower = row.find('td', attrs={'class':'views-field-count'}).text
APIDescription = row.find('td', attrs={'class':'views-field-field-api-description'}).text
APICategory = row.find('td', attrs={'class':'views-field-field-article-primary-category'}).text
APINameHtml = row.find('td', attrs={'class':'views-field-title'})
APILink = APINameHtml.find('a').get('href')
APIDataResponse = getSubPage(APILink, headers)
if APIDataResponse:
APIData.append([APIName,APIDataResponse[1],APICategory,APIDataResponse[0],APIFollower, 200,APILink])
writer.writerow({'API Name' : APIName, 'API Description': APIDataResponse[1], 'Category': APICategory, 'Sub-Category': APIDataResponse[0], 'Followers': APIFollower})
else:
APIData.append([APIName,APIDescription,APICategory,APICategory,APIFollower])
writer.writerow({'API Name' : APIName, 'API Description': APIDescription, 'Category': APICategory, 'Sub-Category': APICategory, 'Followers': APIFollower, 'Inner Page Status':301, 'Inner Page URL':APILink})
Conclusion:
In this Project, we used the most popular web-scraping package Beautiful Soup, which creates a parse tree that can be used to extract data from HTML on a website. From the : https://www.programmableweb.com/ site, we have scraped data such as., API Name, Category, Description, and Followers. Finally, write the data it to a CSV file, apidata.csv. Hope this will help you in building small web scrappers. If you want to build a scrapper and collect data for your business need, feel free to contact us.
Leave a Reply