TDM 40100: Project 11 — 2022
Motivation: In general, scraping data from websites has always been a popular topic in The Data Mine. In addition, it was one of the requested topics. For the remaining projects, we will be doing some scraping of housing data, and potentially: sqlite3
, containerization, and analysis work as well.
Context: This is the second in a series of 4 projects with a focus on web scraping that incorporates of variety of skills we’ve touched on in previous data mine courses. For this second project, we continue to build our suite of tools designed to scrape public housing data.
Scope: selenium, Python, web scraping
Questions
Question 1
If you did not complete the previous project, we will provide you with the code for the get_links
function on Monday, November 14th, below.
def get_links(search_term: str) -> list[str]:
"""
Given a search term, return a list of web links for all of the resulting properties.
"""
def _load_cards(driver):
"""
Given the driver, scroll through the cards
so that they all load.
"""
cards = driver.find_elements("xpath", "//article[starts-with(@id, 'zpid')]")
while True:
try:
num_cards = len(cards)
driver.execute_script('arguments[0].scrollIntoView();', cards[num_cards-1])
time.sleep(2)
cards = driver.find_elements("xpath", "//article[starts-with(@id, 'zpid')]")
if num_cards == len(cards):
break
num_cards = len(cards)
except StaleElementReferenceException:
# every once in a while we will get a StaleElementReferenceException
# because we are trying to access or scroll to an element that has changed.
# this probably means we can skip it because the data has already loaded.
continue
links = []
url = f"https://www.zillow.com/homes/for_sale/{'-'.join(search_term.split(' '))}_rb/"
firefox_options = Options()
# Headless mode means no GUI
firefox_options.add_argument("--headless")
firefox_options.add_argument("--disable-extensions")
firefox_options.add_argument("--no-sandbox")
firefox_options.add_argument("--disable-dev-shm-usage")
driver = webdriver.Firefox(options=firefox_options)
with driver as d:
d.get(url)
d.delete_all_cookies()
while True:
time.sleep(2)
_load_cards(d)
links.extend([e.get_attribute("href") for e in d.find_elements("xpath", "//a[@data-test='property-card-link' and @class='property-card-link']")])
next_link = d.find_element("xpath", "//a[@rel='next']")
if next_link.get_attribute("disabled") == "true":
break
url = next_link.get_attribute('href')
d.delete_all_cookies()
next_link.click()
return links
There is a lot of rich data on a home’s page. If you want to gauge the housing market in an area or for a search_term
, there are two pieces of data that could be particularly useful: the "Price history" and "Public tax history" components of the page.
Check out zillow.com links for a couple different houses.
Let’s say you want to track the date
, event
, and price
in a price_history
table, and the year
, property_tax
, and tax_assessment
in a tax_history
table.
Write 2 CREATE TABLE
statements to create the price_history
and tax_history
tables. In addition, create a houses
table where the NUMBER_zpid
is the primary key, and html
, which will store an HTML file. You can find the id in a house’s link. For example, www.zillow.com/homedetails/2180-N-Brentwood-Cir-Lecanto-FL-34461/43641432_zpid/ has the id 43641432_zpid
.
Use sqlite3
to create the tables in a database called $HOME/houses.db
. You can do all of this from within Jupyter Lab.
%sql sqlite:///$HOME/houses.db
%%sql
CREATE TABLE ...
Run the following queries to confirm and show your table schemas.
PRAGMA table_info(houses);
PRAGMA table_info(price_history);
PRAGMA table_info(tax_history);
-
Code used to solve this problem.
-
Output from running the code.
Question 2
Write a function called link_to_blob
that takes a link
and returns a blob
of the HTML file.
-
Navigate to page.
-
Sleep 2 seconds.
-
Scroll so elements load up. (think "Price and tax history" and clicking "See complete tax history", and clicking "See complete price history", etc.)
-
Create a
.html
file andwrite
the driver’spage_source
to the file. -
Open the file in
rb
mode and use theread
method to read the file into binary format. Return the binary format object. -
Delete the
.html
file from step (1). -
Quit the driver by calling
driver.quit()
.
In addition, write a function called blob_to_html
that accepts a blob (like what is returned from link_to_blob
) and returns the string containing the HTML content.
Demonstrate the functions by using link_to_blob
to get the blob for a link, and then using blob_to_html
to get the HTML content back from the returned value of link_to_blob
.
Just print the first 500 characters of the results of |
If you are unsure how to do any of this — please feel free to post in Piazza! |
Here is some skeleton code. The structure provided here works well for the problem.
|
Use this trick: the-examples-book.com/starter-guides/data-formats/xml#write-an-xpath-expression-to-get-every-div-element-where-the-string-abc123-is-in-the-class-attributes-value-as-a-substring for finding and clicking the “see more” buttons for the two tables. If you dig into the HTML youll see there is some text you can use to jump right to the two tables. To add to this, if instead of |
Remember the goal of this problem is to click the "see more" buttons (if they exist on a given page), and then just save the whole HTML page and convert it to binary for storage. |
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Write functions that accept html content (as a string) and uses the lxml.html
package to parse the HTML content and extract the various components for our price_history
and tax_history
tables.
My functions returned list of lists since the |
|
Here is some example output from my functions — you do not need to match this if you have a better way to do it.
Where
output
[['11/9/2022', 'Price change', 275000], ['11/2/2022', 'Listed for sale', 289900], ['1/13/2000', 'Sold', 19000]]
output
[[2021, 1344, 124511], [2020, 1310, 122792], [2019, 1290, 120031], [2018, 1260, 117793], [2017, 1260, 115370], [2016, 1252, 112997], [2015, 1262, 112212], [2014, 1277, 113120], [2013, 1295, 112920], [2012, 1389, 124535], [2011, 1557, 134234], [2010, 1495, 132251], [2009, 1499, 128776], [2008, 1483, 128647], [2007, 1594, 124900], [2006, 1608, 121900], [2005, 1704, 118400], [2004, 1716, 115000], [2003, 1624, 112900], [2002, 1577, 110300], [2000, 288, 15700]] |
Some skeleton hints if you want extra help. See discussion: piazza.com/class/l6usy14kpkk66n/post/lalzk6hi8ark
|
More skeleton code help, if wanted. See discussion: piazza.com/class/l6usy14kpkk66n/post/lalzk6hi8ark
|
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Write code that uses the get_links
function to get a list of links for a search_term
. Process each link in the list and insert the retrieved data into your houses.db
database.
Once complete, run a couple queries that demonstrate that the data was successfully inserted into the database.
Here is some skeleton code to assist.
|
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |