STAT 19000: Project 7 — Fall 2020
Motivation: Three bread-and-butter functions that are a part of the base R are: subset
, merge
, and split
. subset
provides a more natural way to filter and select data from a data.frame. split
is a useful function that splits a dataset based on one or more factors. merge
brings the principals of combining data that SQL uses, to R.
Context: We’ve been getting comfortable working with data in within the R environment. Now we are going to expand our toolset with three useful functions, all the while gaining experience and practice wrangling data!
Scope: r, subset, merge, split, tapply
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/goodreads/csv
Questions
Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go. |
Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like |
Question 1
Load up the following two datasets goodreads_books.csv
and goodreads_book_authors.csv
into the data.frames books
and authors
, respectively. How many columns and rows are in each of these two datasets?
-
R code used to solve the problem.
-
The result of running the R code.
Question 2
We want to figure out how book size (num_pages
) is associated with various metrics. First, let’s create a vector called book_size
, that categorizes books into 4 categories based on num_pages
: small
(up to 250 pages), medium
(250-500 pages), large
(500-1000 pages), huge
(1000+ pages).
This [video and code](#r-lapply-flight-example) might be helpful. |
-
R code used to solve the problem.
-
The result of
table(book_size)
.
Question 3
Use tapply
to calculate the mean average_rating
, text_reviews_count
, and publication_year
by book_size
. Did any of the result surprise you? Why or why not?
-
R code used to solve the problem.
-
The output from running the R code.
Question 4
Notice in (3) how we used tapply
3 times. This would get burdensome if we decided to calculate 4 or 5 or 6 columns instead. Instead of using tapply, we can use split
, lapply
, and colMeans
to perform the same calculations.
Use split
to partition the data containing only the following 3 columns: average_rating
, text_reviews_count
, and publication_year
, by book_size
. Save the result as books_by_size
. What class is the result? lapply
is a function that allows you to loop over each item in a list and apply a function. Use lapply
and colMeans
to perform the same calculation as in (3).
This [video and code](#r-lapply-flight-example) and also this [video and code](#r-lapply-fars-example) might be helpful. |
-
R code used to solve the problem.
-
The output from running the code.
Question 5
We are working with a lot more data than we really want right now. We’ve provided you with the following code to filter out non-English books and only keep columns of interest. This will create a data frame called en_books
.
en_books <- books[books$language_code %in% c("en-US", "en-CA", "en-GB", "eng", "en", "en-IN") & books$publication_year > 2000, c("author_id", "book_id", "average_rating", "description", "title", "ratings_count", "language_code", "publication_year")]
Now create an equivalent data frame of your own, by using the subset
function (instead of indexing). Use res
as the name of the data frame that you create.
Do the dimensions (using dim
) of en_books
and res
agree? Why or why not? (They should both have 8 columns, but a different number of rows.)
Since the dimensions don’t match, take a look at NA values for the variables used to subset our data. |
This [video and code](#r-subset-8451-example) and also this [video and code](#r-subset-election-example) might be helpful. |
-
R code used to solve the problem.
-
Do the dimensions match?
-
1-2 sentences explaining why or why not.
Question 6
We now have a nice and tidy subset of data, called res
. It would be really nice to get some information on the authors. We can find that information in authors
dataset loaded in question 1! In question 2 of the previous project, we had a similar issue with the states names. There is a much better and easier way to solve these types of problems. Use the merge
function to combine res
and authors
in a way which appends all information from author
when there is a match in res
. Use the condition by="author_id"
in the merge. This is all you need to do:
mymergedDF <- merge(res, authors, by="author_id")
The resulting data frame will have all of the columns that are found in either |
Although we provided the necessary code for this example, you might want to know more about the merge function. This [video and code](#r-merge-fars-example) and also this [video and code](#r-merge-flights-example) might be helpful. |
-
the given R code used to solve the problem.
-
The
dim
of the newly merged data.frame.
Question 7
For an author of your choice (that is in the dataset), find the author’s highest rated book. Do you agree?
-
R code used to solve the problem.
-
The title of the highest rated book (from your author).
-
1-2 sentences explaining why or why not you agree with it being the highest rated book from that author.
OPTIONAL QUESTION
Look at the column names of the new dataframe created in question 6. Notice that there are two values for ratings_count
and two values for average_rating
. The names that have an appended x
are those values from the first argument to merge
, and the names that have an appended y
, are those values from the second argument to merge
. Rename these columns to indicate if they refer to a book, or an author.
For example, |
-
R code used to solve the problem.
-
The
names
of the new data.frame.