{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n", "\n", "Take notice:\n", "\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Week 2: Python and Metro" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A quick geopandas teaser\n", "Following our Python bootcamp last week (was it boring? exhilerating? a bit of both?), let's put that programming knowledge into action, using and creating data that reflects a real urban situation.\n", "\n", "We start by importing a new module `geopandas`. This is a pretty high level geospatial library, widely used by spatial data scientists all over the world. Don't worry about it too much for now, but know that it allows us to import a variety of spatial data formats, and plot them on a map.\n", "\n", "* [geopandas documentation](https://geopandas.readthedocs.io/en/latest/gallery/index.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import geopandas as gpd" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Next, we import some data. In this case, it is a [shapefile](https://desktop.arcgis.com/en/arcmap/latest/manage-data/shapefiles/what-is-a-shapefile.htm) I downloaded from the [LA Metro's Developer web portal](https://developer.metro.net/bus-rail-gis-data/). Notice that I am using relative paths to point to where the data is located in. \n", "\n", "* [read_file](https://geopandas.readthedocs.io/en/latest/docs/user_guide/io.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metro = gpd.read_file('data/Stations_All_0715.shp')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n", "\n", "Note that the reason we use `geopandas` instead of `pandas` (other than the fact that we love maps) is that `pandas` cannot read shapefiles, whereas `geopandas` can.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# what's the data type?\n", "type(metro)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# what does the data look like? \n", "metro.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Ah! Surprise, surprise. Welcome to your first look at a pandas dataframe. We will cover dataframes more extensively in later sessions, but know that a python dataframe is like an excel spreadsheet. \n", "\n", "![image.png](https://media.geeksforgeeks.org/wp-content/uploads/finallpandas.png)\n", "[(source)](https://www.geeksforgeeks.org/python-pandas-dataframe/?ref=lbp)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The `head()` command shows us the first 5 rows of the dataframe. You can also use `tail()` and `sample()`. Try these commands in the cells below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# try tail()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# try sample()\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Pandas Data Types\n", "\n", "Let's look at the data types for each column. You can collectively get all the datatypes for each column in a dataframe using the `dtypes` command." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "metro.dtypes" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "But there is better command that will get you more info. Yes, the `info` command." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# dataframe info\n", "metro.info()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Wait. That looks different from what we have worked on! As it turns out, pandas datatypes are slightly different from the raw python datatypes. Check out the table below:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Pandas TypeNative Python TypeDescription
objectstringThe most general dtype. Will be assigned to your column if column has mixed types (numbers and strings).
int64intNumeric characters. 64 refers to the memory allocated to hold this character.
float64floatNumeric characters with decimals. If a column contains numbers and NaNs (see below), pandas will default to float64, in case your missing value has a decimal.
datetime64, timedelta[ns]N/A (but see the datetime module in Python’s standard library)Values meant to hold time data. Look into these for time series experiments.
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Data exploration\n", "\n", "Part of data exploration is learning what is in your data. How many rows are there? What are the columns? How many rows represent a particular slice of the data?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# how many rows and columns?\n", "metro.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# what are the columns?\n", "metro.columns.to_list()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Counting unique values in a column\n", "\n", "First, learn how to get values for a single column." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# single column\n", "metro['LINE']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# another way\n", "metro.LINE" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", "What are situations when one method is necessary over the other? (i.e. metro['LINE'] vs metro.LINE)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "This returns what is called a python series, a one dimensional array.\n", "![image.png](https://media.geeksforgeeks.org/wp-content/uploads/dataSER-1.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "But what if you want to know how many stations there are for each line?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "metro['LINE'].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# save it as a variable\n", "line_count = metro['LINE'].value_counts()\n", "line_count" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# let's convert the series into a dataframe\n", "line_count = line_count.reset_index()\n", "line_count" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "type(line_count)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Rename columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# current columns as a list\n", "line_count.columns.to_list()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "To rename columns, simply give it a list of column names" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "line_count.columns = ['line', 'count']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "line_count" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### A quick bar plot" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "* [pandas plot](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/04_plotting.html#min-tut-04-plotting)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "line_count.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# give it additional arguments\n", "line_count.plot.bar(x = 'line', y = 'count', title = 'Number of stops per metro line')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# try it yourself. Create different plots using the metro dataframe\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Trimming the data\n", "Oftentimes, we import data and it has too many columns. It is always good practice to elimnate those rows that you are sure you will not use, and keep your data \"clean\" and \"mean.\"\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# output the original data's info\n", "metro.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# show a dataframe with a subset of columns\n", "metro[['LINE','LINENUM','STATION','LAT','LONG','geometry']]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Wait, why the double square brackets? `[[...]]`\n", "\n", "The reason for this is that we are feeding the dataframe a list of column names. Another way to do the same thing would be:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# list of desired column names\n", "desired_columns = ['LINE','LINENUM','STATION','LAT','LONG','geometry']\n", "\n", "# subset based on desired columns\n", "metro[desired_columns]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "If you now print the dataframe, what happens?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "metro.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "What happened? Why has the dataframe reverted to the original data?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "That's right. In order to preserve your new dataframe, you have to **declare** it as a new variable. And finally, whenever you make a copy of a dataframe, it is **highly recommended** to add the `.copy()` command at the end:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "metro_trimmed = metro[desired_columns].copy()\n", "metro_trimmed" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Subsetting/querying/filtering the data\n", "\n", "What if you only want to see a subset of the data? Or create a new table based on a query?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "metro_trimmed[metro_trimmed.LINE == 'EXPO']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another way using `.loc`\n", "- https://www.w3resource.com/pandas/dataframe/dataframe-loc.php" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# another way .loc\n", "metro_trimmed.loc[metro_trimmed['LINE'] == 'EXPO']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# another way .query\n", "metro_trimmed.query(\"LINE == 'EXPO'\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# try it yourself. Query the dataframe for other properties of interest\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Plotting\n", "\n", "We have now imported a shapefile, trimmed it, and created a series of queried subsets. Let's visualize our data. First, simply pass it the `plot()` command to see what it looks like." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "metro_trimmed.plot()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "That's great! Very rewarding, with a single command. The reason it is able to plot the station points is because of the `geometry` column that is created from the shapefile. This is a unique geopandas feature." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "metro_trimmed.geometry" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### A prettier plot\n", "\n", "You can add additional arguments to make the plot prettier: change the size, add legends, etc." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metro_trimmed.plot(\n", " figsize=(20,12), #size of the plot (a bit bigger than the default)\n", " column = 'LINE', # column that defines the color of the dots\n", " legend = True, # add a legend \n", " legend_kwds={\n", " 'loc': 'upper right',\n", " 'bbox_to_anchor':(1.3,1)\n", " } # this puts the legend to the side\n", ") " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Mapping with folium" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "Now it's time for another module. Everybody, please welcome `folium`. Folium brings leaflet, an open source javascript mapping library into our Python environment, allowing you to create instant interactive maps. Try it:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import folium" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# default folium map\n", "m = folium.Map()\n", "m" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A world map is cool... but let's add arguments to the `folium.map` command. Specifically, we can feed it a center latitude value, a center longitude value, and a default zoom level." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Get average lat/lon's" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# average latitude\n", "latitude = metro_trimmed.LAT.mean()\n", "latitude" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# average longitude\n", "longitude = metro_trimmed.LONG.mean()\n", "longitude" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Folium map with arguments\n", "\n", "Complete the code cell below with arguments to center the map based on the metro coordinates calculated above, and adjust the zoom level accordingly. Refer the [folium documentation](https://python-visualization.github.io/folium/quickstart.html) as necessary." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# complete this code so that the map will show up \n", "# centered based on the average lat/lon calculated above\n", "# adjust the zoom level accordingly\n", "m = folium.Map()\n", "m" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Adding point markers\n", "\n", "How do you add a marker to a folium map?\n", "\n", "* [Folium quickstart](https://python-visualization.github.io/folium/quickstart.html)\n", "\n", "```\n", "folium.Marker([45.3288, -121.6625], popup='Mt. Hood Meadows', tooltip=tooltip).add_to(m)\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Since we want to add a marker *for each station* in our dataframe, we do a for loop, and add the marker within the loop." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# first, note how to loop through a dataframe:\n", "for index, row in metro_trimmed.iterrows():\n", " print(row.STATION, row.LAT, row.LONG)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Using the for loop logic above, create a folium marker for each row in the dataframe." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# add the stations\n", "for index, row in metro_trimmed.iterrows():\n", " # add folium marker code\n", "\n", "m" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Color code markers\n", "That's great, but can we color code the markers so that they correspond to their metro lines?\n", "\n", "To do so:\n", "\n", "1. create a new column `color`\n", "1. add a color of choice based on the LINE that each row represents" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# add a new column\n", "metro_trimmed['color'] = ''" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "metro_trimmed.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Find unique values in a column" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# find unique values in the LINE column\n", "metro_trimmed.LINE.unique()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Update column based on a query on another column\n", "We now want to populate the newly created `color` column with values based on the LINE.\n", "\n", "Remember how you used the `loc` command to query the data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# display rows that match a query\n", "metro_trimmed.loc[metro_trimmed['LINE'] == 'EXPO']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The `loc` command has additional functionalities. You can use it to update a field based on a query on *another* field. But first, note that folium accepts only a few named colors, according to their [documentation](https://python-visualization.github.io/folium/modules.html): \n", "\n", "```\n", "[‘red’, ‘blue’, ‘green’, ‘purple’, ‘orange’, ‘darkred’,’lightred’, ‘beige’, ‘darkblue’, ‘darkgreen’, ‘cadetblue’, ‘darkpurple’, ‘white’, ‘pink’, ‘lightblue’, ‘lightgreen’, ‘gray’, ‘black’, ‘lightgray’]\n", "```" ] }, { "attachments": { "image.png": { "image/png": "" } }, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Using the `loc` command, we can update the color column for a single LINE:\n", "![image.png](attachment:image.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "metro_trimmed.loc[metro_trimmed['LINE'] == 'EXPO', 'color'] = 'orange'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# check your work\n", "metro_trimmed.loc[metro_trimmed['LINE'] == 'EXPO']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Now it's your turn. Update the color column for all other metro LINE's in the cell below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "metro_trimmed.sample(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# reset the map (you need to do this to erase previous layers)\n", "m = folium.Map(location=[latitude,longitude], tiles='Stamen Terrain', zoom_start=10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# add the stations with color icons\n", "for index, row in metro_trimmed.iterrows():\n", " tooltip_text = row.LINE + ' Line: ' + row.STATION\n", " folium.Marker(\n", " [row.LAT,row.LONG], \n", " popup=row.STATION, \n", " tooltip=tooltip_text,\n", " icon=folium.Icon(color=row.color)\n", " ).add_to(m)\n", "\n", "# show the map\n", "m" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Saving your folium map as an HTML file" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "# save the interactive maps as an html file\n", "m.save('metro.html')" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": { "height": "757px", "left": "523px", "top": "110px", "width": "384px" }, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }