Web Scraping Services Provider USA: December 2021

Wednesday 22 December 2021

HOW TO SCRAPE HERTZ CAR INVENTORY USING PYTHON?

Hertz Global Holdings or HTZ has substantial media coverage because of a current bankruptcy filing and also an effort to get an extra $500 million in the equity after bankruptcy announcement. Hertz had filed for Chapter 11 Bankruptcy expecting a reformation. The move of raising equity has got stopped by as well as noted by current Hertz 8Q filing as of 18th June 2020.

Summary

This blog shows study using a Python programming language for both downloading data from Hertz website to scrape car data and also save it in “DataFrames”, which we had exported to “.csv” files, which were manipulated with Excel. The code utilized is given in the appendix of this blog.

For extract car API data, a search was done on Hertz website for different vehicles for sale inside 10,000 miles of the St. Louis, MO. The similar 10,000 radius download was implemented for New York City, NY as well as San Francisco, CA, except otherwise given statistics related to “St. Louis” search. The search yielded a few descriptive details, however, not the comprehensive view about the size of fleet portfolio, which is for sale.

Search results primarily showed total 26,054 vehicles for St. Louis search however, after extracting relevant search found are only 11,994 vehicles before site search results started showing blank. It brings question about the accuracy of search options. Different online sources mention that Hertz has originally put more than 50,000 vehicles for sale, although, without a complete universe, which contains distinctive Vehicle Identification Numbers (VINSs), attached with an organized scraping approach, we just cannot validate with inevitability the size of fleet for sale. Here, this indicates that whereas a search gathered 26,054 results, scanning through these results indicates that specific results have ended after 11,994 vehicles. Given in one more manner, the whole suggested search results did not match with the accessible displayed results. It is most expected related to programming structure or display structure of the used car sales website.

Data

The data given briefly indicates some data from St. Louis search. A few descriptions are a bit longer than others. For instance, we have a couple of entries in the St. Louis data, which list ‘1992’ like a complete description having mileage of ‘1 mile’ as well as no other details. It is usual with larger datasets and not concern for us as this is not material depending on the total data size from the search. Also, we have 215 entries for vehicles, which need manual calling to the locations for price quotes. So, all these 215 items have got removed from the St. Louis Data. Just go through the sample statistics for St. Louis data:

Example (St. Louis Zip Code Search):

Vehicle Title & Description: “2019 CADILLAC XT5 Premium Luxury SUV”
Listed Price: $33,279
Mileage: 24,664

Descriptive Statistics Overview:

Total Vehicles: 11,994
Average Price: $18,472
Total Sum of Sales Value: $217,583,305
Maximum Price: 2019 Chevrolet Corvette Z06 3LZ Coupe: $67,995
Minimum Price: 2016 Kia Rio LX Sedan: $6,877
Average Year: 2018.70
Average Mileage: 33,350

Image 1: Portfolio Sales Value in terms of Make

Figure 2: Vehicle Data Buckets Summary Statistics in terms of Search Location

Closing

The data given shows that an average mileage, price, as well as age across the samples is comparable. On an average, a car, which Hertz is selling is worth anywhere around $18,500. Also, an average mileage is around around 33,000–34,000 miles for cars putting for sale as well as most cars are of 2018–2019 models. Also, we observe that Hertz significantly favors Chevrolet car models. Although, without getting a list of all unique vehicles given for sale we just cannot go too far.

Next Steps:

The following step is try and get complete idea about how much money Hertz could raise through selling the portfolio as well as how that might affect them while going forward. For doing that, the given steps would be tried:

Accumulate a complete world of accessible to sell cars

Depending on the universe, just calculate an expected value of the whole sales portfolio

Utilize financial statements for determining what effect it might have for on Hertz debt holders for recoveries because Hertz continues the bankruptcy proceedings

In the bankruptcy reorganization, many factors could affect recoveries depending on which your securities assemble in the company’s capital structure. Generally, debt holders would get some part of the total principal in the recoveries or equity in a newly formed company. That estimated or perceived amount has a huge impact on market value of present securities. Modeling possible recoveries can provide insights about what present exceptional securities are valued.

Image 3: Example of a Sliced File Format

Image 4: Python Code

from requests import get
from bs4 import BeautifulSoup
import pandas as pd
url_insert = 0 # a variable to increment by 35 results per page
count_one = 0
count_two = 0
#create an empty dataframe, the "Split" columns are placeholders for when we split our multi-word title into
#components so in Excel we can analyze by make, model, etc.
df_middle = pd.DataFrame(columns = ['Description', 'Price', 'Mileage', 'Split 1', "Split 2", "Split 3", "Split 4",
                                    "Split 5", "Split 6", "Split 7", "Split 8", "Split 9", "Split 10"])
#at 35 results per page using a p of 800 means we can search up to 28,000 (35 * 800) results.  The most results we got after searching
#was just over 26,000 even though we did not download all of them because blank pages started just before 12,000 at the time
# of our St. Louis 10,000 mile radius search
for p in range(0,800):
    url = 'https://www.hertzcarsales.com/used-cars-for-sale.htm?start=' + str(url_insert) + '&geoRadius=10000&geoZip=10007'
    response = get(url)
    html_soup = BeautifulSoup(response.text, 'html.parser')
    type(html_soup)
    li_class = html_soup.find_all('li', class_ = 'item hproduct clearfix closed certified primary')
    count_one = count_one + 1
    print(count_one)
    print(url)
    url_insert = url_insert + 35
    for i in li_class:
        title_long = i.find('a', class_ ='url')
        title_long = title_long.text
        title_long = str(title_long)
        title_long = title_long.rstrip("\n")
        title_long = title_long.lstrip("\n")
        title_split = title_long.split()
        #our pre-split placeholder list, when split up to 14 columns are available, one for each word based on the number of
        #words in the title
        blank_list = ["none", "none", "none", "none", "none", "none", "none", "none", "none", "none", "none", "none", "none", "none"]
        for a in range(len(title_split)):
            blank_list[a]=title_split[a]
        price = i.find('span', class_='value')
        price = price.text
        price = str(price)
        price = price.rstrip("\n")
        price = price.lstrip("\n")
        #print(price)
        data = i.find('div', class_='gv-description')
        mileage = data.span.span
        mileage = mileage.text
        mileage = mileage.rstrip("\n")
         qmileage = mileage.lstrip("\n")
        #the line below removes the text "miles" from our mileage column
        mileage_clean = ''.join([i for i in mileage if i.isdigit()])
        #print(mileage.text)
        #print("----------------------------------------------------")
        list_current = {'Description':[title_long], 'Price':[price], 'Mileage':[mileage_clean], 'Split 1':blank_list[0], 'Split 2':blank_list[1], 'Split 3':blank_list[2],
        'Split 4': blank_list[3], 'Split 5':blank_list[4], 'Split 6':blank_list[5], 'Split 7':blank_list[6], 'Split 8':blank_list[7], 'Split 9':blank_list[8],
        'Split 10':blank_list[9], 'Split 11':blank_list[10], 'Split 12':blank_list[11], 'Split 13':blank_list[12], 'Split 14':blank_list[13]}
        df_current = pd.DataFrame(data = list_current)
        count_two = count_two + 1
        print(count_two)
        #print(df_current)
        df_middle = df_middle.append(df_current)
df_middle.to_csv('C:\\Users\\james\\PycharmProjects\\workingfiles\\Webscraping\\Hertz\\output_NewYorkCity.csv')

If you want to know more about how to scrape car data or extract car API data using Python, you can contact X-Byte Enterprise Crawling or ask for a free quote!

Monday 20 December 2021

HOW STOCK SENTIMENT ANALYSIS AND SUMMARIZATION IS CONDUCTED USING WEB SCRAPING?

stock-sentiment-analysis-and-summarization-via-web-scraping

For some, the stock market represents a tremendous risk since they lack the necessary information to make better selections. People spend a lot of time picking which Café to visit, but not nearly as much time deciding which stock to invest in. It is due to the fact that individuals have far less time, but this is where AI can help. Automatic summarization and online scraping appear to help us obtain the knowledge we need to make the best decisions.

Module References

1. Web Scraping Modules

Requests Module

For web scrapers, the request module is a blessing. It enables developers to retrieve the target webpage's HTML code.

BeautifulSoup

Unless you're a web developer, BeautifulSoup will come in handy because it breaks down a complex HTML page into a legible and scrapable soup object.

2. Standard Modules

Pandas Module

It's a well-known technique in a data developer's toolbox for dealing with enormous amounts of data and gaining inference or seeking information through direct correlation, combining, filtering, and expanded data analysis.

Numpy Module

To put it another way, it makes doing mathematical operations on the data. The heart of this module is the use of matrices and array calculations. Pandas is also based on it.

Matplotlib

Consumers, of obviously, like to see cool images, and visuals communicate a fair bit better than text on a screen. Matplotlib will take care of the rest.

3. Sentiment Analyzer Module

NLTK

It works by analyzing text data and inferring feelings from it. When it comes to Natural Language Processing, Hugging Face Robots and NLTK have a competitive advantage in the current market.

Textblob

During the first phase of my project, you can employ a light-weight sentiment analyzer.

Transformers Pipeline Sentiment

Transformer's arsenal includes a sentiment analyzer.

4. Article Summarization

Newspaper3K

It's a simple abstractive summary python module that assists you in summarizing a text.

Transformers(Financial-Summarization-Pegasus)

A deep learning toolkit primarily for NLP projects. Pegasus financial summary will be used in this project.

1. Install and Import Dependencies

Install pip... Essentially, we're just using run command in the background to download the latest the appropriate packages in our system so that we can access them in our code.

For the sake of convenience, pip will install all of the required packages for this project.

2. Summarization Modules

The summarizing models reduce the provided material to a logical and succinct summary.

Example: Financial-summarization-pegasus (Huggingface): It is pre-trained on financial language in order to extract the best summary from financial data.

Input:

In the largest financial buyout this year, National Commercial Bank (NCB), Saudi Arabia's top lender by assets, agreed to buy rival Samba Financial Group for $15 billion. According to a statement issued on Sunday, NCB will pay 28.45 riyals (US$7.58) each Samba share, valuing the company at 55.7 billion riyals. NCB will issue 0.739 new shares for every Samba share, which is at the lower end of the 0.736–0.787 ratio agreed upon by the banks when they signed an initial framework deal in June. The offer represents a 3.5 percent premium over Samba's closing price of 27.50 riyals on Oct. 8 and a 24 percent premium over the level at which the shares traded before the talks were made public. The merger talks were initially reported by Bloomberg News. The new bank will have total assets of more than 220 billion dollars, making it the third-largest lender in the Gulf area. The entity's market capitalization of 46 billion dollars is almost identical to Qatar National Bank's.

Output:

The NCB will pay 28.45 riyals per Samba share. The deal will create the third-largest lender in the Gulf area.

3. A News and Sentiment Pipeline: Finiviz website

Finiviz is the website that is being considered in this pipeline. It's a web-based application that lists securities and the most recent stock stories in chronological order. The goal of this pipeline is to extract the URLs, as well as their headlines and dates, and do sentiment analysis on the headlines.

User Defined Functions used in Pipeline 1:

1. Function: Finiviz_parser_data(ticker):

Using the requests library, this method collects data from the Finviz website. The downloaded item should thereafter have a response code of at least 200.

The HTML response is parsed and returned as soup using the Beautiful Soup class. It should be mentioned that soup is a bs4 food. BeautifulSoup.

2. Function: correct_time_formatting(time_data)

This function converts the Finiviz website's incorrect date and time format to a standardized format.

Before Execution
0 Sep-20–21 07:53AM
1 06:48AM
2 06:46AM
3 12:01AM
4 Sep-19–21 06:45AM
5 Sep-18–21 05:50PM
6 10:34AM
</br>
After Execution
0 Sep-20–21 07:53AM
1 Sep-20–21 06:48AM
2 Sep-20–21 06:46AM
3 Sep-20–21 12:01AM
4 Sep-19–21 06:45AM
5 Sep-18–21 05:50PM
6 Sep-18–21 10:34AM

3. Function: finviz_create_write_data(soup,file_name=’’MSFT”)

The file_name is customizable since the soup is supplied as a position argument and the file name is passed as a keyword parameter.

Finviz_create_write_data (soup, file name="Amazon") is an example.

The code extracts the URL, time, News Reporter, and News headline, among other things.

It uses Pandas to generate a data frame, publishes it to a CSV, then returns the data frame.

4. Function: create_csv_ticker_list(ticker_list):

This program simplifies the process of adding several stocks to a ticker list.

5. Function: def finviz_view_pandas_dataframe(ticker)

This function assists in the analysis process when an analyst has to do calculations on the data frame from a certain stock.

Take an example of Google stock and analysis

6. Function: clean_data(df, column_filter=”News Headline’, othe_column=Time”)

When the text is cleaned, such as lower casing, eliminating punctuation marks, removing stop words, and lemmatizing the text, the emotion analyzer that we employ, if efficient like transformers or lower effecient analyzers, performs much better.

7. Function: (Optional)find_unnecessary_stop_words(df, count) & cleaning_secondry(df, apply_column = “lemmatized”):

The other stop phrases must be found manually, and these functions help with that.

8. Function: sentiment_analyzer(df, column_applied_df = “final_sentiment_cleaned”, other_column-=”Time_pdformat’)

With df as input, the programme basically employs sentiment analyzers like nltk vader and textblob.

Steps to Reproduce

Step 1:

Using the user created functions finviz_parser_data and finviz_create_write_data, make a tesla stock CSV file.

Step 2:

Create a ticker list of at least the stocks you want and provide it to the function create_csv_ticker_list as an argument.

Step 3:

To perform individual analysis on your selected stock, establish a stock data frame.

Step 4:

Pandas includes a function that converts a data time item to a timestamp. Using pd.to datetime on the data frame's time column.

Step 5:

Import Stop Words in The Desired Language

Step 6:

Clean the Data Frame by passing it via preset clean functions.

Step 7:

Conduct sentiment analysis on the cleansed data's last column and assess the results.

Step 8:

Remember that we wrote a predefined method to analyze sentiment from CSV data.

Step 9:

The next step is to extract the news article summaries from the extracted URLS. Because some articles may result in a 403 ERROR, all of them cannot be scraped properly.

Example of one of the files:

Summarizing Pipeline 1:

To download our ticker's CSV file, we passed a ticker value to the function.
Created a ticker list, which was then utilized to scrape several tickers and their related CSV files.
Obtained stock data for a particular ticker.
Removed the information from the News headline.
(Optional) Using the function provided, manually declare the other stop words list and eliminate those words.
Run sentiment analysis on the News Headlines that have been cleansed.
Used a basic scatter plot to analyze the emotion.
Using a data frame and a csv file, you can scrape news items.

Functions Used in Pipeline 2

1. **def

google search stocknews (ticker,num=100,site=”yahoo+finance”) **: The "ticker" is used as a positional argument, "num" is the number of pages to search, and "site" can be any trustworthy website.