TDM 40100: Project 11 — 2022

Motivation: In general, scraping data from websites has always been a popular topic in The Data Mine. In addition, it was one of the requested topics. For the remaining projects, we will be doing some scraping of housing data, and potentially: sqlite3, containerization, and analysis work as well.

Context: This is the second in a series of 4 projects with a focus on web scraping that incorporates of variety of skills we’ve touched on in previous data mine courses. For this second project, we continue to build our suite of tools designed to scrape public housing data.

Scope: selenium, Python, web scraping

Learning Objectives

Use selenium to interact with a web page prior to scraping.
Use selenium and xpath expressions to efficiently scrape targeted data.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Questions

Question 1

If you did not complete the previous project, we will provide you with the code for the get_links function on Monday, November 14th, below.

def get_links(search_term: str) -> list[str]:
    """
    Given a search term, return a list of web links for all of the resulting properties.
    """
    def _load_cards(driver):
        """
        Given the driver, scroll through the cards
        so that they all load.
        """
        cards = driver.find_elements("xpath", "//article[starts-with(@id, 'zpid')]")
        while True:
            try:
                num_cards = len(cards)
                driver.execute_script('arguments[0].scrollIntoView();', cards[num_cards-1])
                time.sleep(2)
                cards = driver.find_elements("xpath", "//article[starts-with(@id, 'zpid')]")
                if num_cards == len(cards):
                    break
                num_cards = len(cards)
            except StaleElementReferenceException:
                # every once in a while we will get a StaleElementReferenceException
                # because we are trying to access or scroll to an element that has changed.
                # this probably means we can skip it because the data has already loaded.
                continue

    links = []
    url = f"https://www.zillow.com/homes/for_sale/{'-'.join(search_term.split(' '))}_rb/"

    firefox_options = Options()
    # Headless mode means no GUI
    firefox_options.add_argument("--headless")
    firefox_options.add_argument("--disable-extensions")
    firefox_options.add_argument("--no-sandbox")
    firefox_options.add_argument("--disable-dev-shm-usage")
    driver = webdriver.Firefox(options=firefox_options)

    with driver as d:
        d.get(url)
        d.delete_all_cookies()
        while True:
            time.sleep(2)
            _load_cards(d)
            links.extend([e.get_attribute("href") for e in d.find_elements("xpath", "//a[@data-test='property-card-link' and @class='property-card-link']")])
            next_link = d.find_element("xpath", "//a[@rel='next']")
            if next_link.get_attribute("disabled") == "true":
                break
            url = next_link.get_attribute('href')
            d.delete_all_cookies()
            next_link.click()

    return linkspython

There is a lot of rich data on a home’s page. If you want to gauge the housing market in an area or for a search_term, there are two pieces of data that could be particularly useful: the "Price history" and "Public tax history" components of the page.

Check out zillow.com links for a couple different houses.

Let’s say you want to track the date, event, and price in a price_history table, and the year, property_tax, and tax_assessment in a tax_history table.

Write 2 CREATE TABLE statements to create the price_history and tax_history tables. In addition, create a houses table where the NUMBER_zpid is the primary key, and html, which will store an HTML file. You can find the id in a house’s link. For example, www.zillow.com/homedetails/2180-N-Brentwood-Cir-Lecanto-FL-34461/43641432_zpid/ has the id 43641432_zpid.

Use sqlite3 to create the tables in a database called $HOME/houses.db. You can do all of this from within Jupyter Lab.

%sql sqlite:///$HOME/houses.dbipython

%%sql

CREATE TABLE ...ipython

Run the following queries to confirm and show your table schemas.

PRAGMA table_info(houses);sql

PRAGMA table_info(price_history);sql

PRAGMA table_info(tax_history);sql

Items to submit

Code used to solve this problem.
Output from running the code.

Question 2

Write a function called link_to_blob that takes a link and returns a blob of the HTML file.

Navigate to page.
Sleep 2 seconds.
Scroll so elements load up. (think "Price and tax history" and clicking "See complete tax history", and clicking "See complete price history", etc.)
Create a .html file and write the driver’s page_source to the file.
Open the file in rb mode and use the read method to read the file into binary format. Return the binary format object.
Delete the .html file from step (1).
Quit the driver by calling driver.quit().

In addition, write a function called blob_to_html that accepts a blob (like what is returned from link_to_blob) and returns the string containing the HTML content.

Demonstrate the functions by using link_to_blob to get the blob for a link, and then using blob_to_html to get the HTML content back from the returned value of link_to_blob.

Just print the first 500 characters of the results of blob_to_html to avoid cluttering your output.

If you are unsure how to do any of this — please feel free to post in Piazza!

Here is some skeleton code. The structure provided here works well for the problem.

import uuid
import os

def link_to_blob(link: str) -> bytes:
    def _load_tables(driver):
        """
        Given the driver, scroll through the cards
        so that they all load.
        """
        # find price and tax history element using xpath
        table = driver.find_element(...)

        # scroll the table into view
        driver.execute_script(...)

        # sleep 2 seconds
        time.sleep(2)

        try:
            # find the "See complete tax history" button (if it exists)
            see_more = driver.find_element(...)

            # click the button to reveal the rest of the history (if it exists)
            see_more.click()

        except NoSuchElementException:
            pass

        try:
            # find the "See complete price history" button (if it exists)
            see_more = driver.find_element(...)

            # click the button to reveal the rest of the history (if it exists)
            see_more.click()

        except NoSuchElementException:
            pass

    # create a .html file with a random name using the uuid package so there aren't collisions
    filename = f"{uuid.uuid4()}.html"

    # open the file
    with open(filename, 'w') as f:
        firefox_options = Options()
        # Headless mode means no GUI
        firefox_options.add_argument("--headless")
        firefox_options.add_argument("--disable-extensions")
        firefox_options.add_argument("--no-sandbox")
        firefox_options.add_argument("--disable-dev-shm-usage")

        driver = webdriver.Firefox(options=firefox_options)
        driver.get(link)
        time.sleep(2)
        _load_tables(driver)

        # write the page source to the file
        f.write(...)
        driver.quit()

    # open the file in read binary mode
    with open(filename, 'rb') as f:
        # read the binary contents that are ready to be inserted into a sqlite BLOB
        blob = f.read()

    # remove the file from the filesystem -- we don't need it anymore
    os.remove(filename)

    return blobpython

Use this trick: the-examples-book.com/starter-guides/data-formats/xml#write-an-xpath-expression-to-get-every-div-element-where-the-string-abc123-is-in-the-class-attributes-value-as-a-substring for finding and clicking the “see more” buttons for the two tables. If you dig into the HTML youll see there is some text you can use to jump right to the two tables.

To add to this, if instead of @class, 'abc' you use text(), 'abc' it will try to match the values between elements to "abc". For example, //div[contains(text(), 'abc')] will match <div>abc</div>.

Remember the goal of this problem is to click the "see more" buttons (if they exist on a given page), and then just save the whole HTML page and convert it to binary for storage.

Items to submit

Code used to solve this problem.
Output from running the code.

Question 3

Write functions that accept html content (as a string) and uses the lxml.html package to parse the HTML content and extract the various components for our price_history and tax_history tables.

My functions returned list of lists since the sqlite3 python package will accept that format in an executemany statement.

import lxml.html

tree = lxml.html.fromstring(blob_to_html(my_blob))
tree.xpath("blah")python

Here is some example output from my functions — you do not need to match this if you have a better way to do it.

my_blob = link_to_blob("https://www.zillow.com/homedetails/2180-N-Brentwood-Cir-Lecanto-FL-34461/43641432_zpid/")
get_price_history(blob_to_html(my_blob))python

Where

def blob_to_html(blob: bytes) -> str:
    return blob.decode("utf-8")python

output

[['11/9/2022', 'Price change', 275000],
 ['11/2/2022', 'Listed for sale', 289900],
 ['1/13/2000', 'Sold', 19000]]

my_blob = link_to_blob("https://www.zillow.com/homedetails/2180-N-Brentwood-Cir-Lecanto-FL-34461/43641432_zpid/")
get_tax_history(blob_to_html(my_blob))python

output

[[2021, 1344, 124511],
 [2020, 1310, 122792],
 [2019, 1290, 120031],
 [2018, 1260, 117793],
 [2017, 1260, 115370],
 [2016, 1252, 112997],
 [2015, 1262, 112212],
 [2014, 1277, 113120],
 [2013, 1295, 112920],
 [2012, 1389, 124535],
 [2011, 1557, 134234],
 [2010, 1495, 132251],
 [2009, 1499, 128776],
 [2008, 1483, 128647],
 [2007, 1594, 124900],
 [2006, 1608, 121900],
 [2005, 1704, 118400],
 [2004, 1716, 115000],
 [2003, 1624, 112900],
 [2002, 1577, 110300],
 [2000, 288, 15700]]

Some skeleton hints if you want extra help. See discussion: piazza.com/class/l6usy14kpkk66n/post/lalzk6hi8ark

def get_price_history(html: str):
    tree = lxml.html.fromstring(html)
    # xpath to find the price and tax history table
    # then, you can use the xpath `following-sibling::div` to find the `div` that directly follows the
    # price and tax history div (hint, look for "Price-and-tax-history" in the id attribute of a div element
    # after the "following-sibling::div" part, look for <tr> elements with an id attribute
    trs = tree.xpath(...)
    values = []
    for tr in trs:
        # xpath on the "tr" to find td with an inner span. Use string methods to remove the $ and remove the ",", and to remove trailing whitespace
        price = tr.xpath(...)[2].text.replace(...).replace(...).strip()

        # if price is empty, make it None
        if price == '':
            price = None

        # append the values
        values.append([tr.xpath(...)[0].text, tr.xpath(...)[1].text, price])

    return valuespython

More skeleton code help, if wanted. See discussion: piazza.com/class/l6usy14kpkk66n/post/lalzk6hi8ark

def get_tax_history(html: str):
    tree = lxml.html.fromstring(html)
    try:
        # find the 'Price-and-tax-history' div, then, the following-sibling::div, then a table element, then a tbody element
        tbody = tree.xpath("//div[@id='Price-and-tax-history']/following-sibling::div//table//tbody")[1]
    except IndexError:
        return None
    values = []
    # get the trs in the tbody
    for tr in tbody.xpath(".//tr"):
        # replace the $, ",", and "-", strip whitespace
        prop_tax = tr.xpath(...)[1].text.replace(...).replace(...).replace(...).strip()
        # if prop_tax is empty set to None
        if prop_tax == '':
            prop_tax = None
        # add the data, for the last item in the list, remove $ and ","
        values.append([int(tr.xpath(...)[0].text), prop_tax, int(tr.xpath(...)[2].text.replace(...).replace(...))])

    return valuespython

Items to submit

Code used to solve this problem.
Output from running the code.

Question 4

Write code that uses the get_links function to get a list of links for a search_term. Process each link in the list and insert the retrieved data into your houses.db database.

Once complete, run a couple queries that demonstrate that the data was successfully inserted into the database.

Here is some skeleton code to assist.

import sqlite3
from urllib.parse import urlsplit
from tqdm.notebook import tqdm

links = get_links("47933")

# connect to database
con = sqlite3.connect(...)
for link in tqdm(links): # this shows a progress bar for assistance

    # use link_to_blob to get the blob

    # use urlsplit to extract the zpid from the link

    # add values to a tuple for insertion into the database
    to_insert = (linkid, blob)

    # get a cursor
    cur = con.cursor()

    # insert the data into the houses table using the cursor

    # get price history data to insert
    to_insert = get_price_history(blob_to_html(blob))

    # insert id into price history data
    for val in to_insert:
        val.insert(0, linkid)

    # insert the data into the price_history table using the cursor

    # prep the tax history data in the exact same way as price history

    # if there is tax history data, insert the ids just like before

    # insert the data into the tax_history table using the cursor

    # commit the changes
    con.commit()

# close the connection
con.close()python

Items to submit

Code used to solve this problem.
Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.