Why Do You Need Web Scraping?
It all begins with information (It is the collection of facts). Data is required by businesses and organizations for market research. These data can be gathered through interviews, observations, surveys, and questionnaires, as well as government archives and the Internet.
Web scraping is a technique for extracting relevant large amounts of data from websites and saving it to a file or database. The data that is scraped is usually in tabular or spreadsheet format (e.g.: CSV file).
In this blog, we'll scrape the value of a website. Today is the first day of our web scraping project.
Below given is the overview of the steps we will follow:
- Using queries, download the webpage.
- beautifulsoup4 will parse the HTML source code.
- Extract company names, CEOs, global rankings, market capitalization, annual revenue, employee count, and company URLs.
- Using Pandas, compile the data and generate a CSV file.
How to Perform Web scraping?
Python is a fantastic language that provides packages such as Beautiful Soup, Requests, and Pandas that are used to extract data from HTML code and transform it into various formats (CSV, XML, JSON) depending on the application.
HTML: The code used to organize a website and its information is known as HTML (Hypertext Markup Language). It includes tags that specify how well a web browser should format and display information.
BeautifulSoup is a Library for python that extracts data from HTML and XML files.
Requests is the de facto Python library standard for trying to make HTTP requests.
HTTP is a protocol that is used to retrieve resources such as HTML documents.
Let us extract the web page of the top insurance companies by market capitalization
At the end of the project, we will create a CSV file in the below format:
Download the Webpage using requests
We'll use the requests Python library to download the web page.
Let's get started with installing and importing requests.
!pip install requests --upgrade --quiet import requests
We can make use of requests.get obtain the ability to download a webpage
topics_url = 'https://www.value.today/world-top-companies/insurance' response = requests.get(topics_url)
requests.get returns a reaction object that contains the information from web page as well as some additional information. Using response.text, we can get at the contents of a website page.
Why Do You Need Web Scraping?
It all begins with information (It is the collection of facts). Data is required by businesses and organizations for market research. These data can be gathered through interviews, observations, surveys, and questionnaires, as well as government archives and the Internet.
Web scraping is a technique for extracting relevant large amounts of data from websites and saving it to a file or database. The data that is scraped is usually in tabular or spreadsheet format (e.g.: CSV file).
In this blog, we'll scrape the value of a website. Today is the first day of our web scraping project.
Below given is the overview of the steps we will follow:
- Using queries, download the webpage.
- beautifulsoup4 will parse the HTML source code.
- Extract company names, CEOs, global rankings, market capitalization, annual revenue, employee count, and company URLs.
- Using Pandas, compile the data and generate a CSV file.
How to Perform Web scraping?
Python is a fantastic language that provides packages such as Beautiful Soup, Requests, and Pandas that are used to extract data from HTML code and transform it into various formats (CSV, XML, JSON) depending on the application.
HTML: The code used to organize a website and its information is known as HTML (Hypertext Markup Language). It includes tags that specify how well a web browser should format and display information.
BeautifulSoup is a Library for python that extracts data from HTML and XML files.
Requests is the de facto Python library standard for trying to make HTTP requests.
HTTP is a protocol that is used to retrieve resources such as HTML documents.
Let us extract the web page of the top insurance companies by market capitalization
At the end of the project, we will create a CSV file in the below format:
companies_name,CEOs_name,world_ranks,market_capitalizations_in_billion_dollars,annual_revenues_in_million_dollars,number_of_employees,companies_URLs BERKSHIRE HATHAWAY,Warren Buffett,8,543.68,286260.0,391500.0,https://www.berkshirehathaway.com/ UNITEDHEALTH GROUP,David S. Wichmann,18,332.73,255630.0,320000.0,https://www.unitedhealthgroup.com/ BANK OF AMERICA CORPORATION,Brian Moynihan,20,262.2,85530.0,208000.0,https://www.bankofamerica.com/ WELLS FARGO & COMPANY,Charles W. Scharf,65,124.78,72340.0,258700.0,https://www.wellsfargo.com/ AIA GROUP,Lee Yuan Siong,91,152.33,50360.0,23000.0,http://www.aia.com/
Download the Webpage using requests
We'll use the requests Python library to download the web page.
Let's get started with installing and importing requests.
!pip install requests --upgrade --quiet import requests
We can make use of requests.get obtain the ability to download a webpage
topics_url = 'https://www.value.today/world-top-companies/insurance' response = requests.get(topics_url)
requests.get returns a reaction object that contains the information from web page as well as some additional information. Using response.text, we can get at the contents of a website page.
page_content = response.text page_content[:1000] '<!DOCTYPE html> \n<html lang="en" dir="ltr" prefix=" content: http://purl.org/rss/1.0/modules/content/ dc: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ og: http://ogp.me/ns# rdfs: http://www.w3.org/2000/01/rdf-schema# schema: http://schema.org/ sioc: http://rdfs.org/sioc/ns# sioct: http://rdfs.org/sioc/types# skos: http://www.w3.org/2004/02/skos/core# xsd: http://www.w3.org/2001/XMLSchema# ">\n <head>\n <meta charset="utf-8"/> \n<script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script> \n<script>(adsbygoogle=window.adsbygoogle||[]).push({google_ad_client:"ca-pub-2407955258669770",enable_page_level_ads:true});</script> <script>window.google_analytics_uacct="UA-121331115-1";(function(i,s,o,g,r,a,m){i["GoogleAnalyticsObject"]=r;i[r]=i[r]||function(){(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)})(window,document,"script","https://www'>
HTML code can be found on the website. Using requests, we successfully fetched the web page. The HTML of the webpage value is contained in the above cell page_ content [:1000]. We can also save it to a folder and view it locally within Jupyter by selecting "File > Open."
with open('world-insurance.html','w',encoding = "utf-8") as file: file.write(page_content)
This page will look similar to the original page
Parse the HTML source code using beautifulsoup4
To decode the HTML source code of the web page downloaded in the previous section, we'll use the Beautiful Soup Python library. We'll also add a helper function.
No comments:
Post a Comment