STAT 19000: Project 5 — Spring 2022
Motivation: We will pause in our series of pandas
and numpy
projects to learn one of the most important parts of writing programs — functions! Functions allow us to reuse snippets of code effectively. Functions are a great way to reduce the repetition of code and also keep the code organized and readable.
Context: We are focusing on learning about writing functions in Python.
Scope: python, functions, pandas, matplotlib
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/whin/190/stations.csv
-
/depot/datamine/data/whin/190/observations.csv
Questions
We are very lucky to have great partners in the Wabash Heartland Innovation Network (WHIN)! They generously provide us with access to their API (here) for educational purposes. You’ve most likely either used their API in a previous project, or you’ve worked with a sample of their data to solve some sort of data-driven problem. In this project, we will be using a slightly modified sample of their dataset to learn more about how to write functions. |
Question 1
First, read both datasets into variables named stations
and obs
.
Secondly, take a look at the head
of both dataframes. You will notice, the station_id
in the obs
dataframe appears to correlate with the id
column in the stations
dataframe. This is a fairly common occurence when data has been normalized for a database.
For our current project we will work with a single dataset.
pandas
has a merge
method that can be used to join two dataframes based on a common column. Here the id
column from the stations
dataframe matches the station_id
column in the obs
dataframe. Here is the explanation on merge
.
Use |
Once merged, you will notice in the new dataframe, dat
. That the id
column from the obs
dataframe is now labeled id_x
, and the id
column from the stations
dataframe is now labeled id_y
.
Use the pandas
drop
method to remove the id_y
column.
Use the pandas
rename
method to rename id_x
to id
, and name
to station_name
.
Great! We have cleaned up our dataframe so it is easier to work with, while learning a variety of useful pandas
methods.
-
Code used to solve this problem.
-
Output from running the code.
Question 2
When looking at the new dataset, you may have noticed a mix of letters and numbers in the id
column. Below are a few samples of the contents of that column.
obs_1NnyYGMtAHBFDYWOBlsDlqppzVI obs_1No0NHuqV4VjOK8p8FguPT02T5B obs_1NqnftCklLZHBCHyykvcuc8QvE9 obs_1NqpV058q10hGNBNvYOBzzwpqOx obs_1NqrK3mraUzaj2j7hg6VcB23RjJ
The use of numbers and letters in this column are a variation on ksuid — a K-sortable globally unique id. The reason they are beneficial is that Ksuids are sortable by time and unique identifers (there is a minimal chance that any two id’s would be the same). If you are interested you can read more here.
Next, write a function called get_datetime
that accepts a ksuid (as a string) and returns the datetime
.
You can use the
Don’t forget to remove the "obs_" from the beginning of the ksuid. |
The following code should result in the following output.
for k in ksuids:
print(get_datetime(k))
2019-07-10 04:00:00 2019-07-10 04:15:00 2019-07-11 04:00:00 2019-07-11 04:15:00 2019-07-11 04:30:00
To verify that the ordering claim is true, (for example,the sorting of ksuids resulted in obervations are in chronological order).
We must first, use the sample
method to get 10 random id
values from the dat
dataframe.
Secondly sort the values, then loop through the sorted list of values, and use your get_datetime
function to print the datetime.
Can you confirm that sorting the ksuids automatically sorts the observations by datetime?
-
Code used to solve this problem.
-
Output from running the code.
Question 3
In this dataset we are given latitude
and longitude
values in degrees. We want to convert the degrees to radians. Write a function called degrees_to_radians
that accepts a latitude or longitude value in degrees, and returns the same value in radians.
The formula to do this is.
$degrees*arctan2(0, -1)/180$
|
Make sure to convert your result from a |
To test out your function you can use:
degrees_to_radians(88.0)
1.53588974175501
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Write a function that accepts two pandas
Series containing a latitude
and longitude
value. Also needs to be able to return the distance between two points in Kilometers. Call this function get_distance
.
You can do this by using the Haversine formula.
$2*r*arcsin(\sqrt{sin^2(\frac{\phi_2 - \phi_1}{2}) + cos(\phi_1)*cos(\phi_2)*sin^2(\frac{\lambda_2 - \lambda_1}{2})})$
Where:
-
$r$ is the radius of the Earth in kilometers, we can use: 6367.4447 kilometers
-
$\phi_1$ and $\phi_2$ are the latitude coordinates of the two points
-
$\lambda_1$ and $\lambda_2$ are the longitude coordinates of the two points
In the formula above, the latitude and longitudes need to be converted from degrees to radians. Your function from the Question 3 will be perfect for this! You can even put your It is common practice in the Python world to add an underscore as a prefix to helper functions. It is a sign that this function is just for "internal" use and should largly be ignored by the user. Follow this practice and prefix your |
|
Test your function on the 2 rows with the following id
values.
obs_1amnn4xst3O9VOawmUHFiqBVnCK obs_1fwlznMZXXS8WBkmyTHRgWnHYYf
37.896692299010574
-
Code used to solve this problem.
-
Output from running the code.
Question 5
Great! Make sure to note these solutions for future use…
Next, write a function called plot_stations
. plot_stations
should accept a dataset as an argument and produce a plot with the station locations plotted on a map.
For consistancy we will use plotly
to produce the plot. This stackoverflow post will show some samples. For further understanding here is the explanation for the function.
We want to be careful we don’t plot the same point over and over. To avoid that we want to make sure we reduce the dataset (inside the function), this will plot each pair of latitude and longitude values only once.
Set hover_name
to "station_id" so that hovering over a point will displays the station id.
Set scope
to "usa" to reduce the map to the USA. Be sure to zoom in on the map so you can see the the stations within Indiana!
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |