STAT 19000: Project 13 — Fall 2020
Motivation: It’s important to be able to lookup and understand the documentation of a new function. You may have looked up the documentation of functions like paste0
or sapply
, and noticed that in the "usage" section, one of the arguments is an ellipsis (…
). Well, unless you understand what this does, it’s hard to really get it. In this project, we will experiment with ellipsis, and write our own function that utilizes one.
Context: We’ve learned about, used, and written functions in many projects this semester. In this project, we will utilize some of the less-known features of functions.
Scope: r, functions
Questions
Question 1
Read /class/datamine/data/beer/beers.csv
into a data.frame named beers
. Read /class/datamine/data/beer/breweries.csv
into a data.frame named breweries
. Read /class/datamine/data/beer/reviews.csv
into a data.frame named reviews
.
Notice that |
Do not forget to load the |
Below we show you an example of how fast the fread
function is compared to`read.csv`.
microbenchmark(read.csv("/class/datamine/data/beer/reviews.csv", nrows=100000), data.frame(fread("/class/datamine/data/beer/reviews.csv", nrows=100000)), times=5)
Unit: milliseconds
expr
read.csv("/class/datamine/data/beer/reviews.csv", nrows = 1e+05)
data.frame(fread("/class/datamine/data/beer/reviews.csv", nrows = 1e+05))
min lq mean median uq max neval
5948.6289 6482.3395 6746.8976 7040.5881 7086.6728 7176.2589 5
120.7705 122.3812 127.9842 128.7794 133.7695 134.2205 5
This video demonstrates how to read the |
-
R code used to solve the problem.
Question 2
Take some time to explore the datasets. Like many datasets, our data is broken into 3 "tables". What columns connect each table? How many breweries in breweries
don’t have an associated beer in beers
? How many beers in beers
don’t have an associated brewery in breweries
?
We compare lists of names using |
-
R code used to solve the problem.
-
A description of columns which connect each of the files.
-
How many breweries don’t have an associated beer in
beers
. -
How many beers don’t have an associated brewery in
breweries
.
Question 3
Run ?sapply
and look at the usage section for sapply
. If you look at the description for the …
argument, you’ll see it is "optional arguments to FUN`". What this means is you can specify additional input for the function you are passing to `sapply
. One example would be passing T
to na.rm
in the mean function: sapply(dat, mean, na.rm=T)
. Use sapply
and the strsplit
function to separate the types of breweries (types
) by commas. Use another sapply
to loop through your results and count the number of types for each brewery. Be sure to name your final results n_types
. What is the average amount of services (n_types
) breweries in IN and MI offer (we are looking for the average of IN and MI combined)? Does that surprise you?
When you have one |
We show, in this video, how to find the average number of parts in a midwesterner’s name. Perhaps surprisingly, this same technique will be useful in solving Question 3. |
-
R code used to solve the question.
-
1-2 sentences answering the average amount of services breweries in Indiana and Michigan offer, and commenting on this answer.
Question 4
Write a function called compare_beers
that accepts a function that you will call FUN
, and any number of vectors of beer ids. The function, compare_beers
, should cycle through each vector/groups of beer_id`s, compute the function, `FUN
, on the subset of reviews
, and print "Group X: some_score" where X is the number 1+, and some_score is the result of applying FUN
on the subset of the reviews
data.
In the example below the function FUN
is the median
function and we have two vectors/groups of beer_id`s passed with c(271781) being group 1 and c(125646, 82352) group 2. Note that even though our example only passes two vectors to our `compare_beers
function, we want to write the function in a way that we could pass as many vectors as we want to.
Example:
compare_beers(reviews, median, c(271781), c(125646, 82352))
This example gives the output:
Group 1: 4 Group 2: 4.56
For your solution to this question, find the behavior of compare_beers
in this example:
compare_beers(reviews, median, c(88,92,7971), c(74986,1904), c(34,102,104,355))
There are different approaches to this question. You can use for loops or |
This first video shows how to use |
This second video basically walks students through how to build this function. If you use this video to learn how to build this function, please be sure to acknowledge this in your project solutions. |
-
R code used to solve the problem.
-
The result from running the provided example.
Question 5
Beer wars! IN and MI against AZ and CO. Use the function you wrote in question (4) to compare beer_id from each group of states. Make a cool plot of some sort. Be sure to comment on your plot.
Create a vector of |
This video demonstrates an example of how to use the |
-
R code used to solve the problem.
-
The result from running your function.
-
The resulting plot.
-
1-2 sentence commenting on your plot.