Please see my accompanying blog post.
This notebook is available for download:
pip
¶In a Jupyter Notebook you can run a shell command by prefixing the line with !
. The %matplotlib
line is a special jupyter-only think that tells it to display plots inline (as opposed to e.g. outputting to a file).
!pip install geopandas > /dev/null
%matplotlib notebook
import pathlib
import urllib.request
import geopandas as gpd
Data is provided by census.gov. This line checks if the file has already been downloaded, and if not downloads the zip archive.
states_filename = "tl_2017_us_state.zip"
states_url = f"https://www2.census.gov/geo/tiger/TIGER2017/STATE/{states_filename}"
states_file = pathlib.Path(states_filename)
zipcode_filename = "tl_2017_us_zcta510.zip"
zipcode_url = f"https://www2.census.gov/geo/tiger/TIGER2017/ZCTA5/{zipcode_filename}"
zipcode_file = pathlib.Path(zipcode_filename)
for data_file, url in zip([states_file, zipcode_file], [states_url, zipcode_url]):
if not data_file.is_file():
with urllib.request.urlopen(url) as resp, \
open(data_file, "wb") as f:
f.write(resp.read())
Believe it or not, this was the part that took me the longest time to figure out. I kept opening up the zip file and having it read in the individual component files, which would either have just the plot data (without the associated zip codes), or something else incomplete. Ends up you just need to prefix the file path with zip://
and it's smart enough to do the rest for you.
zipcode_gdf = gpd.read_file(f"zip://{zipcode_file}")
states_gdf = gpd.read_file(f"zip://{states_file}")
Just like any other Pandas DataFrame:
.head()
lets you look at the first few rows (or .tail()
for the last few).dtypes
shows you the data type of each column.plot()
plots your data.iloc[0, :]
selects all columns of row one as a Series (can be abbreviated .iloc[0]
).iloc[0, 0:2]
selects first two columns of row one as a Series (can be abbreviated .iloc[0, :2]
).iloc[:, 0]
selects all rows of column one as a Series.iloc[[0], :]
selects all columns of row one as a DataFrame (can be abbreviated .iloc[[0]]
).iloc[[0], 0:2]
selects first two columns of row one as a DataFrame (can be abbreviated .iloc[[0], :2]
zipcode_gdf.head()
zipcode_gdf.dtypes
Note that GEOID10
is not an int
, so we'll need to use a string ("87420"
) as opposed to an int
(87420
).
zipcode_gdf.plot();
states_gdf.head()
states_gdf.plot();
# First row as a Series
zipcode_gdf.iloc[0]
# First row as a DataFrame
zipcode_gdf.iloc[[0], :]
# Plot the first row of the dataframe
zipcode_gdf.iloc[[0], :].plot();
df["column_name"] == "foo"
returns a Series of booleans (e.g. [False, False, True, False]
), so in order to filter a DataFrame and show only rows where column_name
is "foo"
, one common strategy is to select from the dataframe using that array of booleans, e.g.:
df[df["column_name"] == "foo"]
This pattern takes a little while to get used to.
shiprock = zipcode_gdf[zipcode_gdf["GEOID10"] == "87420"]
shiprock.plot();
Compare with: https://www.maptechnica.com/zip-code-map/87420
newmexico = states_gdf[states_gdf['NAME'] == "New Mexico"]
base = newmexico.plot(color='#FFD700')
shiprock.plot(ax=base, color='#BF0A30');