Wednesday, 22 December 2021

HOW TO SCRAPE HERTZ CAR INVENTORY USING PYTHON?



Hertz Global Holdings or HTZ has substantial media coverage because of a current bankruptcy filing and also an effort to get an extra $500 million in the equity after bankruptcy announcement. Hertz had filed for Chapter 11 Bankruptcy expecting a reformation. The move of raising equity has got stopped by as well as noted by current Hertz 8Q filing as of 18th June 2020.

Summary

This blog shows study using a Python programming language for both downloading data from Hertz website to scrape car data and also save it in “DataFrames”, which we had exported to “.csv” files, which were manipulated with Excel. The code utilized is given in the appendix of this blog.

For extract car API data, a search was done on Hertz website for different vehicles for sale inside 10,000 miles of the St. Louis, MO. The similar 10,000 radius download was implemented for New York City, NY as well as San Francisco, CA, except otherwise given statistics related to “St. Louis” search. The search yielded a few descriptive details, however, not the comprehensive view about the size of fleet portfolio, which is for sale.

Search results primarily showed total 26,054 vehicles for St. Louis search however, after extracting relevant search found are only 11,994 vehicles before site search results started showing blank. It brings question about the accuracy of search options. Different online sources mention that Hertz has originally put more than 50,000 vehicles for sale, although, without a complete universe, which contains distinctive Vehicle Identification Numbers (VINSs), attached with an organized scraping approach, we just cannot validate with inevitability the size of fleet for sale. Here, this indicates that whereas a search gathered 26,054 results, scanning through these results indicates that specific results have ended after 11,994 vehicles. Given in one more manner, the whole suggested search results did not match with the accessible displayed results. It is most expected related to programming structure or display structure of the used car sales website.

Data
data

The data given briefly indicates some data from St. Louis search. A few descriptions are a bit longer than others. For instance, we have a couple of entries in the St. Louis data, which list ‘1992’ like a complete description having mileage of ‘1 mile’ as well as no other details. It is usual with larger datasets and not concern for us as this is not material depending on the total data size from the search. Also, we have 215 entries for vehicles, which need manual calling to the locations for price quotes. So, all these 215 items have got removed from the St. Louis Data. Just go through the sample statistics for St. Louis data:

Example (St. Louis Zip Code Search):
  • Vehicle Title & Description: “2019 CADILLAC XT5 Premium Luxury SUV”
  • Listed Price: $33,279
  • Mileage: 24,664
Descriptive Statistics Overview:
  • Total Vehicles: 11,994
  • Average Price: $18,472
  • Total Sum of Sales Value: $217,583,305
  • Maximum Price: 2019 Chevrolet Corvette Z06 3LZ Coupe: $67,995
  • Minimum Price: 2016 Kia Rio LX Sedan: $6,877
  • Average Year: 2018.70
  • Average Mileage: 33,350

Image 1: Portfolio Sales Value in terms of Make

chart

Figure 2: Vehicle Data Buckets Summary Statistics in terms of Search Location

vehicle-data-buckets

Closing

The data given shows that an average mileage, price, as well as age across the samples is comparable. On an average, a car, which Hertz is selling is worth anywhere around $18,500. Also, an average mileage is around around 33,000–34,000 miles for cars putting for sale as well as most cars are of 2018–2019 models. Also, we observe that Hertz significantly favors Chevrolet car models. Although, without getting a list of all unique vehicles given for sale we just cannot go too far.

Next Steps:

The following step is try and get complete idea about how much money Hertz could raise through selling the portfolio as well as how that might affect them while going forward. For doing that, the given steps would be tried:

Accumulate a complete world of accessible to sell cars

Depending on the universe, just calculate an expected value of the whole sales portfolio

Utilize financial statements for determining what effect it might have for on Hertz debt holders for recoveries because Hertz continues the bankruptcy proceedings

In the bankruptcy reorganization, many factors could affect recoveries depending on which your securities assemble in the company’s capital structure. Generally, debt holders would get some part of the total principal in the recoveries or equity in a newly formed company. That estimated or perceived amount has a huge impact on market value of present securities. Modeling possible recoveries can provide insights about what present exceptional securities are valued.

Image 3: Example of a Sliced File Format

data-field

Image 4: Python Code

from requests import get
from bs4 import BeautifulSoup
import pandas as pd
url_insert = 0 # a variable to increment by 35 results per page
count_one = 0
count_two = 0
#create an empty dataframe, the "Split" columns are placeholders for when we split our multi-word title into
#components so in Excel we can analyze by make, model, etc.
df_middle = pd.DataFrame(columns = ['Description', 'Price', 'Mileage', 'Split 1', "Split 2", "Split 3", "Split 4",
                                    "Split 5", "Split 6", "Split 7", "Split 8", "Split 9", "Split 10"])
#at 35 results per page using a p of 800 means we can search up to 28,000 (35 * 800) results.  The most results we got after searching
#was just over 26,000 even though we did not download all of them because blank pages started just before 12,000 at the time
# of our St. Louis 10,000 mile radius search
for p in range(0,800):
    url = 'https://www.hertzcarsales.com/used-cars-for-sale.htm?start=' + str(url_insert) + '&geoRadius=10000&geoZip=10007'
    response = get(url)
    html_soup = BeautifulSoup(response.text, 'html.parser')
    type(html_soup)
    li_class = html_soup.find_all('li', class_ = 'item hproduct clearfix closed certified primary')
    count_one = count_one + 1
    print(count_one)
    print(url)
    url_insert = url_insert + 35
    for i in li_class:
        title_long = i.find('a', class_ ='url')
        title_long = title_long.text
        title_long = str(title_long)
        title_long = title_long.rstrip("\n")
        title_long = title_long.lstrip("\n")
        title_split = title_long.split()
        #our pre-split placeholder list, when split up to 14 columns are available, one for each word based on the number of
        #words in the title
        blank_list = ["none", "none", "none", "none", "none", "none", "none", "none", "none", "none", "none", "none", "none", "none"]
        for a in range(len(title_split)):
            blank_list[a]=title_split[a]
        price = i.find('span', class_='value')
        price = price.text
        price = str(price)
        price = price.rstrip("\n")
        price = price.lstrip("\n")
        #print(price)
        data = i.find('div', class_='gv-description')
        mileage = data.span.span
        mileage = mileage.text
        mileage = mileage.rstrip("\n")
         qmileage = mileage.lstrip("\n")
        #the line below removes the text "miles" from our mileage column
        mileage_clean = ''.join([i for i in mileage if i.isdigit()])
        #print(mileage.text)
        #print("----------------------------------------------------")
        list_current = {'Description':[title_long], 'Price':[price], 'Mileage':[mileage_clean], 'Split 1':blank_list[0], 'Split 2':blank_list[1], 'Split 3':blank_list[2],
        'Split 4': blank_list[3], 'Split 5':blank_list[4], 'Split 6':blank_list[5], 'Split 7':blank_list[6], 'Split 8':blank_list[7], 'Split 9':blank_list[8],
        'Split 10':blank_list[9], 'Split 11':blank_list[10], 'Split 12':blank_list[11], 'Split 13':blank_list[12], 'Split 14':blank_list[13]}
        df_current = pd.DataFrame(data = list_current)
        count_two = count_two + 1
        print(count_two)
        #print(df_current)
        df_middle = df_middle.append(df_current)
df_middle.to_csv('C:\\Users\\james\\PycharmProjects\\workingfiles\\Webscraping\\Hertz\\output_NewYorkCity.csv')

If you want to know more about how to scrape car data or extract car API data using Python, you can contact X-Byte Enterprise Crawling or ask for a free quote!

No comments:

Post a Comment

What is brand monitoring, and why is it essential for your business?

  Brand monitoring is a critical component of any business’s marketing strategy. You can stay ahead of new trends by staying up-to-date on w...