Monday, 13 December 2021

HOW TO UTILIZE WEB SCRAPING AND PYTHON TO ORGANIZE YOUR DREAM VACATION?

Have you dreamed about going on a dream vacation however, housing prices have kept you away from that? Or you don’t have time to constantly look for the options? If you are interested in this, then you will enjoy this blog.

We have created a data scraper, which will extract Airbnb listing posts data based on user inputs (date, location, total guests, guest types) as well as put data in a well-formatted Pandas DataFrame, it would filter data depending on the prices (this will keep posts that price is inside a user’s range) as well as finally this will send an auto email to a user having filtered posts. You just need to run a couple of python scripts and you will get the results.

Let’s check the information. Clone project is available at:

https://xbyte.io/Airbnb_scrapy
Python Modules and Libraries

For this assignment we have mainly utilized these libraries:

  • Selenium: A widely used framework to test applications. This is a great framework as it helps in getting websites through a driver, which can click as if you were surfing on a website.
  • BeautifulSoup: A wonderful python library, which helps you get data from XML and HTML files.
  • Smtplib: It outlines an SMTP (Simple Mail Transfer Protocol) customer session object, which can be utilized to send different emails.
  • Pandas: This is an open-source data analysis tool, which becomes useful when comes to working with data. Its key data structures include Dataframes and Series.

To make way with this project, you have to download a WebDriver (a tool, which gives capabilities to navigate web pages) to utilize Selenium. Since we have used Chrome here, we have downloaded a ChromeDriver. You can also download it from here:

ChromeDriver: https://chromedriver.chromium.org

Source Code

Let’s go through this code. Read comments to understand more.

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import time
import pandas as pd

# This is the path where I stored my chromedriver
PATH = "/Users/juanpih19/Desktop/Programs/chromedriver"

class AirbnbBot:

    # Class constructor that takes location, stay (Month, Week, Weekend)
    # Number of guests and type of guests (Adults, Children, Infants)
    def __init__(self, location, stay, number_guests, type_guests):
        self.location = location
        self.stay = stay
        self.number_guests = number_guests
        self.type_guests = type_guests
        self.driver = webdriver.Chrome(PATH)

    # The 'search()' function will do the searching based on user input
    def search(self):

        # The driver will take us to the Airbnb website
        self.driver.get('https://www.airbnb.com')
        time.sleep(1)

        # This will find the location's tab xpath, type the desired location
        # and hit enter so we move the driver to the next tab (check in)
        location = self.driver.find_element_by_xpath('//*[@id="bigsearch-query-detached-query-input"]')
        location.send_keys(Keys.RETURN)
        location.send_keys(self.location)
        location.send_keys(Keys.RETURN)

        # It was difficult to scrape every number on the calendar
        # so both the check in and check out dates are flexible.
        flexible = location.find_element_by_xpath('//*[@id="tab--tabs--1"]')
        flexible.click()

        # Even though we have flexible dates, we can choose if
        # the stay is for the weekend or for a week or month

        # if stay is for a weekend we find the xpath, click it and hit enter
        if self.stay in ['Weekend', 'weekend']:
            weekend = self.driver.find_element_by_xpath('//*[@id="flexible_trip_lengths-weekend_trip"]/button')
            weekend.click()
            weekend.send_keys(Keys.RETURN)

        # if stay is for a  week we find the xpath, click it and hit enter
        elif self.stay in ['Week', 'week']:
            week = self.driver.find_element_by_xpath('//*[@id="flexible_trip_lengths-one_week"]/button')
            week.click()
            week.send_keys(Keys.RETURN)

        # if stay is for a month we find the xpath, click it and hit enter
        elif self.stay in ['Month', 'month']:
            month = self.driver.find_element_by_xpath('//*[@id="flexible_trip_lengths-one_month"]/button')
            month.click()
            month.send_keys(Keys.RETURN)

        else:
            pass

        # Finds the guests xpath and clicks it
        guest_button = self.driver.find_element_by_xpath('/html/body/div[5]/div/div/div[1]/div/div/div[1]/div[1]/div/header/div/div[2]/div[2]/div/div/div/form/div[2]/div/div[5]/div[1]')
        guest_button.click()

        # Based on user input self.type_guests and self.number_guests

        # if type_guests are adults
        # it will add as many adults as assigned  on self.number_guests
        if self.type_guests in ['Adults', 'adults']:
            adults = self.driver.find_element_by_xpath('//*[@id="stepper-adults"]/button[2]')
            for num in range(int(self.number_guests)):
                adults.click()

        # if type_guests are children
        # it will add as many children as assigned  on self.number_guests
        elif self.type_guests in ['Children', 'children']:
            children = self.driver.find_element_by_xpath('//*[@id="stepper-children"]/button[2]')
            for num in range(int(self.number_guests)):
                children.click()

        # if type_guests are infants
        # it will add as many infants as assigned  on self.number_guests
        elif self.type_guests in ['Infants', 'infants']:
            infants = self.driver.find_element_by_xpath('//*[@id="stepper-infants"]/button[2]')
            for num in range(int(self.number_guests)):
                infants.click()

        else:
            pass


        # Guests tab is the last tab that we need to fill before searching
        # If I hit enter the driver would not search
        # I decided to click on a random place so I could find the search's button xpath
        x = self.driver.find_element_by_xpath('//*[@id="field-guide-toggle"]')
        x.click()
        x.send_keys(Keys.RETURN)


        # I find the search button snd click in it to search for all options
        search = self.driver.find_element_by_css_selector('button._sxfp92z')
        search.click()


    # This function will scrape all the information about every option
    # on the first page
    def scraping_aribnb(self):

        # Maximize the window
        self.driver.maximize_window()

        # Gets the current page sourse
        src = self.driver.page_source

        # We create a BeautifulSoup object and feed it the current page source
        soup = BeautifulSoup(src, features='lxml')

        # Find the class that contains all the options and store it
        # on list_of_houses variable
        list_of_houses = soup.find('div', class_ = "_fhph4u")

        # Type of properties list - using find_all function
        # found the class that contains all the types of properties
        # Used a list comp to append them to list_type_property
        type_of_property = list_of_houses.find_all('div', class_="_1tanv1h")
        list_type_property = [ i.text for i in type_of_property]

        # Host description list - using find_all function
        # found the class that contains all the host descriptions
        # Used a list comp to append them to list_host_description
        host_description = list_of_houses.find_all('div', class_='_5kaapu')
        list_host_description = [ i.text for i in host_description]

        # Number of bedrooms and bathrooms - using find_all function
        # bedrooms_bathrooms and other_amenities used the same class
        # Did some slicing so I could append each item to the right list
        number_of_bedrooms_bathrooms = list_of_houses.find_all('div', class_="_3c0zz1")
        list_bedrooms_bathrooms = [ i.text for i in number_of_bedrooms_bathrooms]
        bedrooms_bathrooms = []
        other_amenities = []

        bedrooms_bathrooms = list_bedrooms_bathrooms[::2]
        other_amenities = list_bedrooms_bathrooms[1::2]

        # Date - using find_all function
        # found the class that contains all the dates
        # Used a list comp to append them to list_date
        dates = list_of_houses.find_all('div', class_="_1v92qf0")
        list_dates = [date.text for date in dates]

        # Stars - using find_all function
        # found the class that contains all the stars
        # Used a list comp to append them to list_stars
        stars = list_of_houses.find_all('div', class_ = "_1hxyyw3")
        list_stars = [star.text[:3] for star in stars]

        # Price - using find_all function
        # found the class that contains all the prices
        # Used a list comp to append them to list_prices
        prices = list_of_houses.find_all('div', class_ = "_1gi6jw3f" )
        list_prices = [price.text for price in prices ]


        # putting the lists with data into a Pandas data frame
        airbnb_data = pd.DataFrame({'Type' : list_type_property, 'Host description': list_host_description, 'Bedrooms & bathrooms': bedrooms_bathrooms, 'Other amenities': other_amenities,
                'Date': list_dates,  'Price': list_prices})

        # Saving the DataFrame to a csv file
        airbnb_data.to_csv('Airbnb_data.csv', index=False)


if __name__ == '__main__':
    vacation = AirbnbBot('New York', 'week', '2', 'adults')
    vacation.search()
    time.sleep(2)
    vacation.scraping_aribnb()

A few xpaths are not displayed fully in the snippet, however, you don’t require a well-displayed xpath for understanding the project. Although, if you need a code for personal usage, you can get it from the Github link given here.

The above given code comes with two methods: scraping_aribnb() and search()

Search()

This method utilizes Selenium to go to an Airbnb website as well as fill tabs with information given by a user on a constructor that here is “New York”, “adults”, “week”, and “2”.

Search

Here is the procedure, which a search() method uses in easy English

search:
get website address
The address will take you to Airbnb's main page
Location, Check In, Check out, Guests and Search options will be displayed

# Location
Find xpath for location
Click in it
Enter desired location
Hit Enter, by hitting enter you will move to the next option (Check In)

# Check in - Check out
Check In and Check out options are flexible
Find xpath for Flexible
Click in it
Once Flexible is clicked three options will be displayed: Weekend, Week, Month
In the constructor the user specifies if the stay is for a week, weekend or month
Click on the right option
Hit Enter, to move to the next option

# Guests
The constuctor provides us with number of guests and type of guests (Adults, Children, Infants)
Find type of guests xpath
click in it as many times as specified on number guests
for num in range(int(self.number_guests)):
  click()

# Search
Up to this point
We have location, flexible check in and check out date, guests
Find xpath for search
The xpath did not work
Find the css selector that contains the search button
Click in it
Scraping_aribnb()

Once the Search() has taken care of providing information as well as taking us to different accessible alternatives, scraping_aribnb() deals with extracting data of every alternative on the first page as well as save data in the csv file.

Just go through all the data, which a single post provides

scraping_aribnb

Six columns have filled with the data from each post. (The post doesn't come on dataset)

sample-data

Let’s go through the procedure that scraping_aribnb() technique follows in easy English:

scraping airbnb posts:
  maximize the window
  src = get current page source (HTML code)
  
  soup = BeautifulSoup(src, features='lxml')
  
  list_of_houses = with the beautifulsoup object find 
                   the class containing all the posts with all 
                   its information
                        
  
  # list_type_property
  type_of_properties = list_of_houses.find (all) the class
                       containing all type of properties
  list_type_property = [i.text for i in type_of_properties]
  
  
  # list_host_description
  host_description = list_of_houses.find (all) the class
                       containing all host descriptions
  list_host_description = [i.text for i in host_description]
  
  
  # bedrooms_bathrooms, other_amenities
  number_of_bedrooms_batrooms = list_of_houses.find (all) the class
                                  containing all the amenities
  
  There are two types of amenities: Bedrooms and badrooms and others
  bedrooms_bathrooms = list_bedrooms_bathrooms[::2]
  other_amenities = list_bedrooms_bathrooms[1::2]
  
  
   # list_prices
   prices = list_of_houses.find (all) the class
                       containing all the prices
   list_prices = [i.text for i in prices]
   
   
   put each list into a dictionary and then put it into a data frame
   save it
We Have Got a Good Dataset. What Next?

Now, we have to understand how much a user is ready to pay for his housing and depending on the amount, we would filter a dataset as well as send a user with the most reasonable options.

We have created a traveler.py file, which takes user inputs and filters a dataset. We have decided to do that on a diverse file.

Just go through the code and read comments to understand more.

import pandas as pd
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.application import MIMEApplication
from email.mime.text import MIMEText
from password import password

class Traveler:

    # Email Address so user can received the filtered data
    # Stay: checks if it will be a week, month or weekend
    def __init__(self, email, stay):
        self.email = email
        self.stay = stay

    # This functtion creates a new csv file based on the options
    # that the user can afford
    def price_filter(self, amount):

        # The user will stay a month
        if self.stay in ['Month', 'month']:
            data = pd.read_csv('Airbnb_data.csv')

            # Monthly prices are usually over a $1,000.
            # Airbnb includes a comma in thousands making it hard to transform it 
            # from string to int.

            # This will create a column that takes only the digits
            # For example: $1,600 / month, this slicing will only take 1,600
            data['cleaned price'] = data['Price'].str[1:6]

            # list comp to replace every comma of every row with an empty space
            _l = [i.replace(',', '') for i in data['cleaned price']]
            data['cleaned price'] = _l

            # Once we got rid of commas, we convert every row to an int value
            int_ = [int(i) for i in data['cleaned price']]
            data['cleaned price'] = int_

            # We look for prices that are within the user's range
            # and save that to a new csv file
            result = data[data['cleaned price'] <= amount]
            return result.to_csv('filtered_data.csv', index=False)

        # The user will stay a weekend
        elif self.stay in ['Weekend', 'weekend', 'week', 'Week']:
            data = pd.read_csv('Airbnb_data.csv')

            # Prices per night are usually between 2 and 3 digits. Example: $50 or $100

            # This will create a column that takes only the digits
            # For example: $80 / night, this slicing will only take 80
            data['cleaned price'] = data['Price'].str[1:4]

            # This time I used the map() instead of list comp but it does the same thing.
            data['cleaned price'] = list(map(int, data['cleaned price']))

            # We look for prices that are within the user's range
            # and save that to a new csv file
            filtered_data = data[data['cleaned price'] <= amount]
            return filtered_data.to_csv('filtered_data.csv', index=False)

        else:
            pass

    def send_mail(self):
        # Create a multipart message
        # It takes the message body, subject, sender, receiver
        msg = MIMEMultipart()
        MESSAGE_BODY = 'Here is the list with possible options for your dream vacation'
        body_part = MIMEText(MESSAGE_BODY, 'plain')
        msg['Subject'] = "Filtered list of possible airbnb's"
        msg['From'] = 'projects.creativity.growth@gmail.com'
        msg['To'] =  self.email

        # Attaching the body part to the message
        msg.attach(body_part)

        # open and read the CSV file in binary
        with open('filtered_data.csv','rb') as file:

            # Attach the file with filename to the email
            msg.attach(MIMEApplication(file.read(), Name='filtered_data.csv'))

        # Create SMTP object
        smtp_obj = smtplib.SMTP('smtp.gmail.com', 587)
        smtp_obj.starttls()

        # Login to the server, email and password of the sender
        smtp_obj.login('projects.creativity.growth@gmail.com', password)

        # Convert the message to a string and send it
        smtp_obj.sendmail(msg['From'], msg['To'], msg.as_string())
        smtp_obj.quit()


if __name__ == "__main__":
    my_traveler = Traveler( 'juanpablacho19@gmail.com', 'week' )
    my_traveler.price_filter(80)
    my_traveler.send_mail()

The Traveler class is having two methods: send_email()and price_filter(amount)

Price_filter(amount)

This technique takes amount of money a user is ready to spend as well as filters the datasets to get some new datasets having accessible results as well as it makes a newer CSV with newer results.

sample-data
Send_email()

This technique utilizes smtplib library for sending an email to users with filtered CSV files attached to that. The message body about an email is “This is the list having all possible alternatives for dream vacations”. The email is automatically sent after running a traveler.py file.

send_email
How to Run?

Preferably, you need to run an airbnb_scrapy.py file initially so this collects the most current data as well as a traveler.py file to filter data as well as send emails.

Python traveler.py

Python airbnb_scrapy.py

Conclusion

It is another example about how influential Python is. This project provides you great understandings about web scraping having BeautifulSoup, data analysis and data cleaning with Pandas, an application testing having Selenium as well as email automation using smtplib.

Being a data analyst, you would not have data kindly formatted, or a company, which you require data from might not get an API, therefore, at times, you will need to utilize web scraping skills for collecting data.

In case, you have ideas about how a code could be improved or in case, you wish to increase on a project, then feel free to contact us!

No comments:

Post a Comment

What is brand monitoring, and why is it essential for your business?

  Brand monitoring is a critical component of any business’s marketing strategy. You can stay ahead of new trends by staying up-to-date on w...