STAT 29000: Project 15 — Fall 2020
Motivation: We’ve done a lot of work with SQL this semester. Let’s review concepts in this project and mix and match R and SQL to solve data-driven problems.
Context: In this project, we will reinforce topics you’ve already learned, with a focus on SQL.
Scope: SQL, sqlite, R
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/movies_and_tv/imdb.db
Questions
Question 1
What is the first year where our database has > 1000 titles? Use the premiered
column in the titles
table as our year. What year has the most titles?
There could be missing values in |
-
SQL queries used to answer the questions.
-
What year is the first year to have > 1000 titles?
-
What year has the most titles?
Question 2
What, and how many, unique type
are there from the titles
table? For the year found in question (1) with the most titles
, how many titles of each type
are there?
-
SQL queries used to answer the questions.
-
How many and what are the unique
types
from thetitles
table? -
A list of
type
and and count for the year (premiered
) that had the mosttitles
.
F.R.I.E.N.D.S is a popular tv show. They have an interesting naming convention for the names of their episodes. They all begin with the text "The One …". There are 6 primary characters in the show: Chandler, Joey, Monica, Phoebe, Rachel, and Ross. Let’s use SQL and R to take a look at how many times each characters' names appear in the title of the episodes.
Question 3
Write a query that gets the episode_title_id
, primary_title
, rating
, and votes
, of all of the episodes of Friends (title_id
is tt0108778).
You can slightly modify the solution to question (5) in project 13. |
-
SQL query used to answer the question.
-
First 5 results of the query.
Question 4
Now that you have a working query, connect to the database and run the query to get the data into an R data frame. In previous projects, we learned how to used regular expressions to search for text. For each character, how many episodes `primary_title`s contained their name?
-
R code in a code chunk that was used to find the solution.
-
The solution pasted below the code chunk.
Question 5
Create a graphic showing our results in (4) using your favorite package. Make sure the plot has a good title, x-label, y-label, and try to incorporate some of the following colors: #273c8b, #bd253a, #016f7c, #f56934, #016c5a, #9055b1, #eaab37.
-
The R code used to generate the graphic.
-
The graphic in a png or jpg/jpeg format.
Question 6
Use a combination of SQL and R to find which of the following 3 genres has the highest average rating for movies (see type
column from titles
table): Romance, Comedy, Animation. In the titles
table, you can find the genres in the genres
column. There may be some overlap (i.e. a movie may have more than one genre), this is ok.
To query rows which have the genre Action as one of its genres:
SELECT * FROM titles WHERE genres LIKE '%action%';
-
Any code you used to solve the problem in a code chunk.
-
The average rating of each of the genres listed for movies.