STAT 19000: Project 3 — Fall 2021
Motivation: data.frames
are the primary data structure you will work with when using R. It is important to understand how to insert, retrieve, and update data in a data.frame
.
Context: In the previous project we got our feet wet, ran our first R code, and learned about accessing data inside vectors. In this project we will continue to reinforce what we’ve already learned and introduce a new, flexible data structure called `data.frame`s.
Scope: r, data.frames, recycling, factors
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/olympics/*.csv
Questions
Question 1
Use R code to print the names of the two datasets in the /depot/datamine/data/olympics
directory.
Read the larger dataset into a data.frame called olympics
.
Print the first 6 rows of the olympics
data.frame, and take a look at the columns. Based on that, write 1-2 sentences describing the dataset (how many rows, how many columns, the type of data, etc.) and what it holds.
Relevant topics: list.files, file.info, read.csv, head
-
Code used to solve this problem.
-
Output from running the code.
-
1-2 sentences explaining the dataset.
Question 2
How many unique sports are accounted for in our olympics
dataset? Print a list of the sports. Is there any sport that you weren’t expecting? Why or why not?
R is a case-sensitive language. What this means is that whether or not 1 or more letters in a word are capitalized is important. For example, the following two variables are different.
So, when you are examining a
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
|
Relevant topics: unique, length
-
Code used to solve this problem.
-
Output from running the code.
-
1-2 sentences explaining the results.
Question 3
Create a data.frame called us_athletes
that contains only information on athletes from the USA. Use the column NOC
(National Olympic Committee 3-letter code). How many rows does us_athletes
have?
Now, perform the same operation on the olympics
data.frame, this time containing only the information on the athletes from the country of your choice. Name this new data.frame appropriately. How many rows does it have?
Now, create a data.frame called both
that contains the information on the athletes from the USA and the country of your choice. How many rows does it have?
-
Code used to solve this problem.
-
Output from running the code.
-
How many rows or athletes in the
us_athletes
dataset? -
How many rows or athletes in the other country’s dataset?
-
How many rows or athletes in the
both
dataset?
Question 4
What percentage of US athletes are women? What percentage of US athletes with gold medals are women?
Answer the same questions for your "other" country from question (3).
Relevant topics: prop.table, table, indexing
-
Code used to solve this problem.
-
Output from running the code.
Question 5
What is the oldest US athlete to compete based on our us_athletes
data.frame? At what age, in which sport, and what year did the athlete compete in?
Answer the same questions for your "other" country from question (3) and question (4).
Make sure you using indexing to only print the athlete’s information (age, sport, year). |
-
Code used to solve this problem.
-
Output from running the code.
-
Age, sport, and olympics year that the oldest athlete competed in, for each of your countries.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. |