Have you dreamed about going on a dream vacation however, housing prices have kept you away from that? Or you don’t have time to constantly look for the options? If you are interested in this, then you will enjoy this blog.
We have created a data scraper, which will extract Airbnb listing posts data based on user inputs (date, location, total guests, guest types) as well as put data in a well-formatted Pandas DataFrame, it would filter data depending on the prices (this will keep posts that price is inside a user’s range) as well as finally this will send an auto email to a user having filtered posts. You just need to run a couple of python scripts and you will get the results.
Let’s check the information. Clone project is available at:
https://xbyte.io/Airbnb_scrapyPython Modules and Libraries
For this assignment we have mainly utilized these libraries:
- Selenium: A widely used framework to test applications. This is a great framework as it helps in getting websites through a driver, which can click as if you were surfing on a website.
- BeautifulSoup: A wonderful python library, which helps you get data from XML and HTML files.
- Smtplib: It outlines an SMTP (Simple Mail Transfer Protocol) customer session object, which can be utilized to send different emails.
- Pandas: This is an open-source data analysis tool, which becomes useful when comes to working with data. Its key data structures include Dataframes and Series.
To make way with this project, you have to download a WebDriver (a tool, which gives capabilities to navigate web pages) to utilize Selenium. Since we have used Chrome here, we have downloaded a ChromeDriver. You can also download it from here:
ChromeDriver: https://chromedriver.chromium.org
Source Code
Let’s go through this code. Read comments to understand more.
import requests from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.common.action_chains import ActionChains import time import pandas as pd # This is the path where I stored my chromedriver PATH = "/Users/juanpih19/Desktop/Programs/chromedriver" class AirbnbBot: # Class constructor that takes location, stay (Month, Week, Weekend) # Number of guests and type of guests (Adults, Children, Infants) def __init__(self, location, stay, number_guests, type_guests): self.location = location self.stay = stay self.number_guests = number_guests self.type_guests = type_guests self.driver = webdriver.Chrome(PATH) # The 'search()' function will do the searching based on user input def search(self): # The driver will take us to the Airbnb website self.driver.get('https://www.airbnb.com') time.sleep(1) # This will find the location's tab xpath, type the desired location # and hit enter so we move the driver to the next tab (check in) location = self.driver.find_element_by_xpath('//*[@id="bigsearch-query-detached-query-input"]') location.send_keys(Keys.RETURN) location.send_keys(self.location) location.send_keys(Keys.RETURN) # It was difficult to scrape every number on the calendar # so both the check in and check out dates are flexible. flexible = location.find_element_by_xpath('//*[@id="tab--tabs--1"]') flexible.click() # Even though we have flexible dates, we can choose if # the stay is for the weekend or for a week or month # if stay is for a weekend we find the xpath, click it and hit enter if self.stay in ['Weekend', 'weekend']: weekend = self.driver.find_element_by_xpath('//*[@id="flexible_trip_lengths-weekend_trip"]/button') weekend.click() weekend.send_keys(Keys.RETURN) # if stay is for a week we find the xpath, click it and hit enter elif self.stay in ['Week', 'week']: week = self.driver.find_element_by_xpath('//*[@id="flexible_trip_lengths-one_week"]/button') week.click() week.send_keys(Keys.RETURN) # if stay is for a month we find the xpath, click it and hit enter elif self.stay in ['Month', 'month']: month = self.driver.find_element_by_xpath('//*[@id="flexible_trip_lengths-one_month"]/button') month.click() month.send_keys(Keys.RETURN) else: pass # Finds the guests xpath and clicks it guest_button = self.driver.find_element_by_xpath('/html/body/div[5]/div/div/div[1]/div/div/div[1]/div[1]/div/header/div/div[2]/div[2]/div/div/div/form/div[2]/div/div[5]/div[1]') guest_button.click() # Based on user input self.type_guests and self.number_guests # if type_guests are adults # it will add as many adults as assigned on self.number_guests if self.type_guests in ['Adults', 'adults']: adults = self.driver.find_element_by_xpath('//*[@id="stepper-adults"]/button[2]') for num in range(int(self.number_guests)): adults.click() # if type_guests are children # it will add as many children as assigned on self.number_guests elif self.type_guests in ['Children', 'children']: children = self.driver.find_element_by_xpath('//*[@id="stepper-children"]/button[2]') for num in range(int(self.number_guests)): children.click() # if type_guests are infants # it will add as many infants as assigned on self.number_guests elif self.type_guests in ['Infants', 'infants']: infants = self.driver.find_element_by_xpath('//*[@id="stepper-infants"]/button[2]') for num in range(int(self.number_guests)): infants.click() else: pass # Guests tab is the last tab that we need to fill before searching # If I hit enter the driver would not search # I decided to click on a random place so I could find the search's button xpath x = self.driver.find_element_by_xpath('//*[@id="field-guide-toggle"]') x.click() x.send_keys(Keys.RETURN) # I find the search button snd click in it to search for all options search = self.driver.find_element_by_css_selector('button._sxfp92z') search.click() # This function will scrape all the information about every option # on the first page def scraping_aribnb(self): # Maximize the window self.driver.maximize_window() # Gets the current page sourse src = self.driver.page_source # We create a BeautifulSoup object and feed it the current page source soup = BeautifulSoup(src, features='lxml') # Find the class that contains all the options and store it # on list_of_houses variable list_of_houses = soup.find('div', class_ = "_fhph4u") # Type of properties list - using find_all function # found the class that contains all the types of properties # Used a list comp to append them to list_type_property type_of_property = list_of_houses.find_all('div', class_="_1tanv1h") list_type_property = [ i.text for i in type_of_property] # Host description list - using find_all function # found the class that contains all the host descriptions # Used a list comp to append them to list_host_description host_description = list_of_houses.find_all('div', class_='_5kaapu') list_host_description = [ i.text for i in host_description] # Number of bedrooms and bathrooms - using find_all function # bedrooms_bathrooms and other_amenities used the same class # Did some slicing so I could append each item to the right list number_of_bedrooms_bathrooms = list_of_houses.find_all('div', class_="_3c0zz1") list_bedrooms_bathrooms = [ i.text for i in number_of_bedrooms_bathrooms] bedrooms_bathrooms = [] other_amenities = [] bedrooms_bathrooms = list_bedrooms_bathrooms[::2] other_amenities = list_bedrooms_bathrooms[1::2] # Date - using find_all function # found the class that contains all the dates # Used a list comp to append them to list_date dates = list_of_houses.find_all('div', class_="_1v92qf0") list_dates = [date.text for date in dates] # Stars - using find_all function # found the class that contains all the stars # Used a list comp to append them to list_stars stars = list_of_houses.find_all('div', class_ = "_1hxyyw3") list_stars = [star.text[:3] for star in stars] # Price - using find_all function # found the class that contains all the prices # Used a list comp to append them to list_prices prices = list_of_houses.find_all('div', class_ = "_1gi6jw3f" ) list_prices = [price.text for price in prices ] # putting the lists with data into a Pandas data frame airbnb_data = pd.DataFrame({'Type' : list_type_property, 'Host description': list_host_description, 'Bedrooms & bathrooms': bedrooms_bathrooms, 'Other amenities': other_amenities, 'Date': list_dates, 'Price': list_prices}) # Saving the DataFrame to a csv file airbnb_data.to_csv('Airbnb_data.csv', index=False) if __name__ == '__main__': vacation = AirbnbBot('New York', 'week', '2', 'adults') vacation.search() time.sleep(2) vacation.scraping_aribnb()
A few xpaths are not displayed fully in the snippet, however, you don’t require a well-displayed xpath for understanding the project. Although, if you need a code for personal usage, you can get it from the Github link given here.
The above given code comes with two methods: scraping_aribnb() and search()
Search()
This method utilizes Selenium to go to an Airbnb website as well as fill tabs with information given by a user on a constructor that here is “New York”, “adults”, “week”, and “2”.
Here is the procedure, which a search() method uses in easy English
search: get website address The address will take you to Airbnb's main page Location, Check In, Check out, Guests and Search options will be displayed # Location Find xpath for location Click in it Enter desired location Hit Enter, by hitting enter you will move to the next option (Check In) # Check in - Check out Check In and Check out options are flexible Find xpath for Flexible Click in it Once Flexible is clicked three options will be displayed: Weekend, Week, Month In the constructor the user specifies if the stay is for a week, weekend or month Click on the right option Hit Enter, to move to the next option # Guests The constuctor provides us with number of guests and type of guests (Adults, Children, Infants) Find type of guests xpath click in it as many times as specified on number guests for num in range(int(self.number_guests)): click() # Search Up to this point We have location, flexible check in and check out date, guests Find xpath for search The xpath did not work Find the css selector that contains the search button Click in it
Scraping_aribnb()
Once the Search() has taken care of providing information as well as taking us to different accessible alternatives, scraping_aribnb() deals with extracting data of every alternative on the first page as well as save data in the csv file.
Just go through all the data, which a single post provides
Six columns have filled with the data from each post. (The post doesn't come on dataset)
Let’s go through the procedure that scraping_aribnb() technique follows in easy English:
scraping airbnb posts: maximize the window src = get current page source (HTML code) soup = BeautifulSoup(src, features='lxml') list_of_houses = with the beautifulsoup object find the class containing all the posts with all its information # list_type_property type_of_properties = list_of_houses.find (all) the class containing all type of properties list_type_property = [i.text for i in type_of_properties] # list_host_description host_description = list_of_houses.find (all) the class containing all host descriptions list_host_description = [i.text for i in host_description] # bedrooms_bathrooms, other_amenities number_of_bedrooms_batrooms = list_of_houses.find (all) the class containing all the amenities There are two types of amenities: Bedrooms and badrooms and others bedrooms_bathrooms = list_bedrooms_bathrooms[::2] other_amenities = list_bedrooms_bathrooms[1::2] # list_prices prices = list_of_houses.find (all) the class containing all the prices list_prices = [i.text for i in prices] put each list into a dictionary and then put it into a data frame save it
We Have Got a Good Dataset. What Next?
Now, we have to understand how much a user is ready to pay for his housing and depending on the amount, we would filter a dataset as well as send a user with the most reasonable options.
We have created a traveler.py file, which takes user inputs and filters a dataset. We have decided to do that on a diverse file.
Just go through the code and read comments to understand more.
import pandas as pd import smtplib from email.mime.multipart import MIMEMultipart from email.mime.application import MIMEApplication from email.mime.text import MIMEText from password import password class Traveler: # Email Address so user can received the filtered data # Stay: checks if it will be a week, month or weekend def __init__(self, email, stay): self.email = email self.stay = stay # This functtion creates a new csv file based on the options # that the user can afford def price_filter(self, amount): # The user will stay a month if self.stay in ['Month', 'month']: data = pd.read_csv('Airbnb_data.csv') # Monthly prices are usually over a $1,000. # Airbnb includes a comma in thousands making it hard to transform it # from string to int. # This will create a column that takes only the digits # For example: $1,600 / month, this slicing will only take 1,600 data['cleaned price'] = data['Price'].str[1:6] # list comp to replace every comma of every row with an empty space _l = [i.replace(',', '') for i in data['cleaned price']] data['cleaned price'] = _l # Once we got rid of commas, we convert every row to an int value int_ = [int(i) for i in data['cleaned price']] data['cleaned price'] = int_ # We look for prices that are within the user's range # and save that to a new csv file result = data[data['cleaned price'] <= amount] return result.to_csv('filtered_data.csv', index=False) # The user will stay a weekend elif self.stay in ['Weekend', 'weekend', 'week', 'Week']: data = pd.read_csv('Airbnb_data.csv') # Prices per night are usually between 2 and 3 digits. Example: $50 or $100 # This will create a column that takes only the digits # For example: $80 / night, this slicing will only take 80 data['cleaned price'] = data['Price'].str[1:4] # This time I used the map() instead of list comp but it does the same thing. data['cleaned price'] = list(map(int, data['cleaned price'])) # We look for prices that are within the user's range # and save that to a new csv file filtered_data = data[data['cleaned price'] <= amount] return filtered_data.to_csv('filtered_data.csv', index=False) else: pass def send_mail(self): # Create a multipart message # It takes the message body, subject, sender, receiver msg = MIMEMultipart() MESSAGE_BODY = 'Here is the list with possible options for your dream vacation' body_part = MIMEText(MESSAGE_BODY, 'plain') msg['Subject'] = "Filtered list of possible airbnb's" msg['From'] = 'projects.creativity.growth@gmail.com' msg['To'] = self.email # Attaching the body part to the message msg.attach(body_part) # open and read the CSV file in binary with open('filtered_data.csv','rb') as file: # Attach the file with filename to the email msg.attach(MIMEApplication(file.read(), Name='filtered_data.csv')) # Create SMTP object smtp_obj = smtplib.SMTP('smtp.gmail.com', 587) smtp_obj.starttls() # Login to the server, email and password of the sender smtp_obj.login('projects.creativity.growth@gmail.com', password) # Convert the message to a string and send it smtp_obj.sendmail(msg['From'], msg['To'], msg.as_string()) smtp_obj.quit() if __name__ == "__main__": my_traveler = Traveler( 'juanpablacho19@gmail.com', 'week' ) my_traveler.price_filter(80) my_traveler.send_mail()
The Traveler class is having two methods: send_email()and price_filter(amount)
Price_filter(amount)
This technique takes amount of money a user is ready to spend as well as filters the datasets to get some new datasets having accessible results as well as it makes a newer CSV with newer results.
Send_email()
This technique utilizes smtplib library for sending an email to users with filtered CSV files attached to that. The message body about an email is “This is the list having all possible alternatives for dream vacations”. The email is automatically sent after running a traveler.py file.
How to Run?
Preferably, you need to run an airbnb_scrapy.py file initially so this collects the most current data as well as a traveler.py file to filter data as well as send emails.
Python traveler.py
Python airbnb_scrapy.py
Conclusion
It is another example about how influential Python is. This project provides you great understandings about web scraping having BeautifulSoup, data analysis and data cleaning with Pandas, an application testing having Selenium as well as email automation using smtplib.
Being a data analyst, you would not have data kindly formatted, or a company, which you require data from might not get an API, therefore, at times, you will need to utilize web scraping skills for collecting data.
In case, you have ideas about how a code could be improved or in case, you wish to increase on a project, then feel free to contact us!
No comments:
Post a Comment