TDM 40100: Project 2 — 2023
Motivation: The ability to use SQL to query data in a relational database is an extremely useful skill. What is even more useful is the ability to build a sqlite3
database, design a schema, insert data, create indexes, etc. This series of projects is focused around SQL, sqlite3
, with the opportunity to use other skills you’ve built throughout the previous years.
Context: In TDM 20100, you had the opportunity to learn some basics of SQL, and likely worked (at least partially) with sqlite3
— a powerful database engine. In this project (and following projects), we will branch into SQL and sqlite3
-specific topics and techniques that you haven’t yet had exposure to in The Data Mine.
Scope: sqlite3
, lmod, SQL
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_book_authors.json
-
/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_book_series.json
-
/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_books.json
-
/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_reviews_dedup.json
Questions
Question 1 (1 pt)
-
Write a function to determine how big our 4 files are, given a size to return in. (i.e. bytes, megabytes, etc.)
-
Run your function given bytes, kilobytes, megabytes, and gigabytes in turn.
-
Approximately how many books, reviews, and authors are included in the datasets?
-
Write code to get the size of one of the images in our dataset in bytes, without downloading the image.
The goodreads dataset contains a variety of files. With that being said there are 4 files which hold the bulk of the data. The rest is mostly derivatives of those 4 files.
-
/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_book_authors.json
-
/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_book_series.json
-
/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_books.json
-
/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_reviews_dedup.json
Write a bash
function that takes an argument that indicates the size unit to return in, and returns the size of a file in that unit by running du
. For example, if I wanted to know how many bytes the goodreads_books.json
file takes up using your function, I could run the following:
get_size b
In the kernel on Anvil that we are using, "shells" do not persist across different cells. What this means practically is that any bash function you define is only defined in that cell. When running a function, it must be defined in the same cell that you are running it in. |
|
Run your function 4 times, passing in bytes, kilobytes, megabytes, and gigabytes in turn. Does your output make sense for each size? If not, double check your function to make sure you are using the correct unit of size for each du
command. Don’t forget to try du --help
if you are having trouble.
For this project we will be using the convention that 1MB is 1,000,000 bytes, and 1GB is 1,000,000,000 bytes. We understand that historically this may have differed, but we will be operating off this convention for this project. |
Approximately how many books, reviews, and authors are included in the datasets?
For this part of the question, I would recommend first looking at the head of each file using head -n10 filename
, replacing filename with your file. How many books are on each line of the file on average? Once you know this, you can count how many lines are in the file (using wc
) and extrapolate to find your solution.
Finally, let’s take a look at the first book.
{"isbn": "0312853122", "text_reviews_count": "1", "series": [], "country_code": "US", "language_code": "", "popular_shelves": [{"count": "3", "name": "to-read"}, {"count": "1", "name": "p"}, {"count": "1", "name": "collection"}, {"count": "1", "name": "w-c-fields"}, {"count": "1", "name": "biography"}], "asin": "", "is_ebook": "false", "average_rating": "4.00", "kindle_asin": "", "similar_books": [], "description": "", "format": "Paperback", "link": "https://www.goodreads.com/book/show/5333265-w-c-fields", "authors": [{"author_id": "604031", "role": ""}], "publisher": "St. Martin's Press", "num_pages": "256", "publication_day": "1", "isbn13": "9780312853129", "publication_month": "9", "edition_information": "", "publication_year": "1984", "url": "https://www.goodreads.com/book/show/5333265-w-c-fields", "image_url": "https://images.gr-assets.com/books/1310220028m/5333265.jpg", "book_id": "5333265", "ratings_count": "3", "work_id": "5400751", "title": "W.C. Fields: A Life on Film", "title_without_series": "W.C. Fields: A Life on Film"}
If we want to find the size of the image for this book (given by the image_url
field), we could just download it and run du
. However, if we wanted to do this on a whole bunch of images it would be extremely slow and take up an enormous amount of space on our filesystem, which is extremely unideal for large datasets.
Instead, use curl
in order to get the headers of the image, and grep
to isolate the Content-Length
portion of the header. Take the time to figure out how this works and make your solution clean, as we will be building on this idea in the next question and it will be tough if you don’t understand it or have a very convoluted solution.
|
It is okay to manually copy/paste the link from the json. |
-
A function to calculate the size of a file in a given size.
-
The size of each of the four files listed in bytes, kilobytes, megabytes, and gigabytes.
-
The approximate number of books and reviews in the dataset.
-
The size of the image for the first book, in bytes.
Question 2 (2 pts)
-
Write a
bash
function to return the average size of n random images from our data. -
Run your function 4 times, passing in 25, 50, 100, and 1000 in turn.
-
Write 2-3 sentences to determine which average is likely closest to the true average of the dataset, and why. Include an explanations for why the values you got may not match the theoretical expectation.
In the last question, we got the size of a single image from our data. However, more often than not we will want to do this with many images in order to approximate how much space the images take on average, or to get any other sorts of summary statistics about the images in our data.
In the previous question we said it was okay to manually copy/paste the The
|
Write a function that takes an argument n that indicates the number of images to download, and returns the average size of n random images from our data. An example general outline of this function is included below to push you in the right direction.
avg_img_size () {
# Record start time
start_time=$(date +%s.%N)
# set helper variables
books_file="/anvil/projects/tdm/data/goodreads/goodreads_samples/goodreads_books.json"
# get a subset of _n_ image links
# initialize accumulating variables
let total_size=0 count=0
# get size of each image, then average all sizes
# THIS IS THE BIG SECTION YOU WILL NEED TO FILL IN
# print output
echo "Total Size: $total_size bytes"
echo "Num Files: $count"
echo "Average Size (N=$1): $average_size bytes"
# remove temp files
# Record end time, calculate elapsed time
end_time=$(date +%s.%N)
elapsed_time=$(echo "$end_time - $start_time" | bc)
echo "$elapsed_time sec runtime"
}
Your function can be tested as follows:
avg_img_size 25
echo " "
avg_img_size 50
echo " "
avg_img_size 100
echo " "
avg_img_size 1000
While retrieving the size of each image, you may notice that addition of the sizes is not working. If this is the case, try running |
The |
Run your function using the testing code provided above, which returns the average size of images given a subset of 25, 50, 100, and 1000 images. In a markdown cell, write 2-3 sentences explaining which average you received is theoretically closest to the 'true average' size of an image in the dataset, and why.
1000 images is a pretty large amount, so don’t expect your function to finish running instantly. My solution to this question took 86 seconds for 1000 images! If you are having a hard time getting your 1000 image test to run in a reasonable amount of time, go back to the 25 or 50 image tests and try and speed that one up first. |
-
Function to calculate the average of n random images from our data.
-
Results of running functions on subsets of 25, 50, 100, and 1000 images.
-
2-3+ sentences explaining which subset theoretically produces the most accurate average to the whole dataset, why, and why your results may differ from the theoretical expectation.
Question 3 (2 pts)
-
Create a directory called
goodreads_samples
somewhere in your$HOME
directory. -
Create ~100mb random subsets of our 4 main files in your
goodreads_samples
directory. -
Double check that your subset files are an appropriate size with
du
.
Okay, so roughly, in total we are looking at around 27 gb of data. With that size it will definitely be useful for us to create a database. After all, answering questions like:
-
What is the average rating of Brandon Sandersons books?
-
What are the titles of the 5 books with the most number of ratings?
would be quite difficult with the data in its current form.
Realistically, things are not very straightforward if we hand you this data and say "get that info please". However, if we had a nice sqlite
database the same tasks would be trivial! In the rest of this project, we will set up a sqlite
database, and populate it with the data from the goodreads
dataset with the end goal of creating a small database that make it easy to answer questions like the ones above.
First, before we do that, it would make sense to get a sample of each of the datasets. Working with samples just makes it a lot easier to load the data up and parse through it.
Use shuf
to get a random sample of the goodreads_books.json
and goodreads_reviews_dedup.json
datasets. Approximate how many rows you’d need in order to get the datasets down to around 100 mb each, and do so. Put the samples, and copies of goodreads_book_authors.json
and goodreads_book_series.json
in a directory called goodreads_samples
anywhere inside your $HOME directory.
Do NOT use the |
It just needs to be approximately 100mb — no need to fuss, as long as it is within a 50mb margin it should be fine. |
-
goodreads_samples
directory containing our 4 subset files. -
Code to check the size of our 4 subset files.
Question 4 (1 pt)
-
Write out the keys in each of the json files (excluding
goodreads_reviews_dedup.json
), and list the appropriate storage class to use.
Check out the 5 storage classes (which you can think of as 'data types' in languages like Python, C, and R) that sqlite3
uses: www.sqlite.org/datatype3.html
When we are looking into constructing a database, we need to think about what types of data we need to be storing. For example, if we wanted to store a bunch of values of a structure like 12349234
, an INTEGER would likely work well. However, if we are trying to store values like 0012349234
, storing as an INTEGER will lose us our 2 leading zeroes. In this case, TEXT may be more appropriate. These sort of small technicalities can make a big difference in how well our data is stored in our database, so be sure to look at the different values in each field before assigning them a type appropriate to store those values.
In a markdown cell, write out each of the keys in each of the json files (excluding goodreads_reviews_dedup.json
), and list the appropriate storage class to use. For example, I’ve provided an example solution for goodreads_reviews_dedup.json
.
-
user_id: TEXT
-
book_id: INTEGER
-
review_id: TEXT
-
rating: INTEGER
-
review_text: TEXT
-
date_added: TEXT
-
date_updated: TEXT
-
read_at: TEXT
-
started_at: TEXT
-
n_votes: INTEGER
-
n_comments: INTEGER
You don’t need to copy/paste the solution for |
You do not need to assign a type to the following keys in |
|
-
List of the keys in the json files and the appropriate storage class to use.
Question 5 (2 pts)
-
Write a
CREATE TABLE
statement for each of the 4 tables we planned in the previous question. -
Run the provided code snippets below to verify that your tables were created correctly.
We have done a lot of setup in the previous questions, and now we are finally ready to create our mini-database! We will do this using CREATE TABLE
statements in sqlite3
, while sourcing data from the goodreads_sample
directory we created in the previous question. While you will have to run your CREATE TABLE
statements in a terminal with sqlite3
launched in order for them to work, please paste them into your notebook as well in order to recieve points for this question.
Let’s start by launching sqlite3
like so:
module use /anvil/projects/tdm/opt/core
module load tdm
module load sqlite/3.39.2
sqlite3 my.db # this will create an empty database
This will put us inside our sqlite3
session, and we can start running our commands, sqlite3
-specific dot functions, and SQL queries. While you work on the next portion of this question, feel free to use the following code to validate that your CREATE TABLE
statements are working as expected. Additionally, this website has more information about sqlite for you to read up on in your extra time.
.tables
Running |
For now, we will only worry about creating tables with the columns that we identified in the previous question. This means that we will be leaving out the series
, popular_shelves
, similar_books
, and authors
columns in the goodreads_books.json
file.
Now that we have our sqlite3
session launched, let’s create our first table. In Question 4, we essentially created an outline of what each table’s columns would contain. Translate each of the lists of keys and storage classes you made into a CREATE TABLE
statement, and run it in your bash
shell. Don’t forget to paste your statements into your Jupyter notebook so we can see them! As an example, I have provided the CREATE TABLE
statement for the goodreads_reviews_dedup.json
file below. You should still run this statement as you want to create this table as well.
CREATE TABLE reviews (
user_id TEXT,
book_id INTEGER,
review_id TEXT,
rating INTEGER,
review_text TEXT,
date_added TEXT,
date_updated TEXT,
read_at TEXT,
started_at TEXT,
n_votes INTEGER,
n_comments INTEGER
);
While concepts like primary and foreign keys are extremely important and useful, we will not be covering them in this project. For now, just focus on building the four tables we outlined in Question 4, ensuring you are using the correct types. We will also cover restrictions like |
Finally, run all of the below statements in your Jupyter notebook to verify (and show us) that your tables were created correctly.
%sql sqlite:////home/x-jaxmattfair/my.db # change x-jaxmattfair to your username
%%sql
SELECT sql FROM sqlite_master WHERE name='reviews';
%%sql
SELECT sql FROM sqlite_master WHERE name='books';
%%sql
SELECT sql FROM sqlite_master WHERE name='series';
%%sql
SELECT sql FROM sqlite_master WHERE name='authors';
-
SQL
CREATE TABLE
statements for each of the 4 tables, and to create your database. -
Code snippets above and the results of running those code snippets.
Submitting your Work
Well done, you’ve finished your second project for this class and created your first database in sqlite3
! Make sure to save your work and submit it to Gradescope in the correct format.
You must double check your You will not receive full credit if your |
-
firstname-lastname-project02.ipynb
.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |
Here is the Zoom recording of the 4:30 PM discussion with students from 28 August 2023: