# US Median Income Over Time

Tags:

**Bottom Line:** A brief dicussion of median vs mean and look at SSA data on
the US median wage.

Download this jupyter notebook in runnable format here and view the post on my website here.

NB: There may be (hopefully minor) edits that are out of sync between the blog post and the downloaded notebook.

```
# pip install git+https://github.com/n8henrie/nb_black
# This library automatically runs the `black` code formatter in jupyter cells.
# The n8henrie/nb_black fork is slightly modified to allow specifying a custom
# line length instead of using the default.
import lab_black
lab_black.load_ipython_extension(get_ipython(), line_length=79)
```

# US Median Wage Estimates, 1990-2019

I often hear that the average US citizen has an income of around $50,000 per year. I hadn’t previously seen where this number comes from, but recently I found that the Social Security Administration provides some pretty interesting data at their website.

For example, here is the US average wage index over time (displaying a local copy below):

I recently discovered that they provide a webapp showing a summary of the US
W-2 reported income. Most interestingly, it includes binned data so one can
look at the changes in specific income brackets over time, and additionally
they provide an estimate of the *median* income as well.

An example URL for the webapp: https://www.ssa.gov/cgi-bin/netcomp.cgi?year=2003

## Median vs mean

Something I hadn’t considered (for whatever reason) is that discussion on the
state of economic affairs based solely on the *average* income likely
misrepresents how *most* of the US is doing. The *median* income is probably
more representative of what most of us have in mind.

As an example, if we have a population of 100 individuals whose income is normally distributed from 0 to 10 (choose whatever units you like), the mean and median tend to be similar:

```
import matplotlib.pyplot as plt
import numpy as np
rng = np.random.default_rng()
population = rng.normal(loc=5, scale=1.5, size=100)
print(
f"""First 10 values of population: {population[:10]}
{np.median(population)=}
{np.mean(population)=}
"""
)
plt.hist(population, bins=20, range=[0, 10])
```

```
First 10 values of population: [5.37663355 3.77759921 2.8321777 4.88866131 4.02549337 7.890585
8.07219506 4.30035334 2.18794887 2.15074077]
np.median(population)=4.921754367304535
np.mean(population)=4.982034011567311
```

However, if we take *just one* member of the population and make their income
an extreme outlier, the *average* is improved, but the *median* is not
(assuming the outlier was selected from above the median to start with).

```
population.sort()
initial_median = np.median(population)
initial_mean = np.mean(population)
```

```
# Make the 49th richest (51st poorest) individual 100 times richer
population[51] *= 100
```

```
print(
f"""Average income before: {initial_mean}
Average income after: {np.mean(population)}
Median income before: {initial_median}
Median income after: {np.median(population)}"""
)
```

```
Average income before: 4.982034011567312
Average income after: 9.940573583180026
Median income before: 4.921754367304535
Median income after: 4.921754367304535
```

We see that the *average* income of the population has nearly doubled!

While that *sounds* great, the truth of the matter is that *most* of this
population has not actually improved at all.

As a matter of fact, we could even have a situation where all but one lucky
individual see a *decrease* in their income, but the average *still* goes up.

```
population[:51] *= 0.95
population[52:] *= 0.95
```

```
print(
f"""Average income before: {initial_mean}
Average income after: {np.mean(population)}
Median income before: {initial_median}
Median income after: {np.median(population)}"""
)
```

```
Average income before: 4.982034011567312
Average income after: 9.693976195516614
Median income before: 4.921754367304535
Median income after: 4.675666648939308
```

In this example, *99%* of the population is 5% *poorer* than they were to start with,
which is reflected by the dropping *median* income, but contrasts starkly with
a near **doubling** of the *average* income of the population.

Hopefully this is old news to most readers, but I think it was worth briefly
reviewing. Statistics matter! Especially when some people in our contry are
worth *hundreds of billions of dollars*, while the official poverty rate
is more than 10% – some 33 million people.

## SSA data vs census data

Before going on to the SSA data, I wanted to point out that the SSA data below
differs *substantially* from the data available from the US Census (and for
fellow nerds, my understanding is that census.gov has a good API – I look
forward to exploring their data soon!). For example, their report
shows the 2019 median household income to be about $68,000 – *much* higher
than the SSA numbers below.

It seems like some of the important differences to keep in mind include:

- census includes
*reported*data, the SSA is*measured*(W-2) - census is
*household*data, SSA is*individual*- one might expect 2-income households to be roughly twice as much on the census data as compared to SSA

- AFAIK, SSA is
*only W-2 income*, meaning that capital gains and income from other work will not be reflectedin SSA, but may be included in what is reported by individuls in the census data

With that out of the way, on to the numbers!

```
# main dependencies for the notebook
import pickle
import re
import typing as t
from urllib.request import urlopen
import altair as alt
import pandas as pd
from sklearn.linear_model import LinearRegression
# NB: you will also need `pyarrow` installed to store and read dataframes from
# feather format. You don't need to import it, but uncomment below to test if
# you have it available.
# import pyarrow
```

```
# imports and helper function to run doctests
import copy
import doctest
def testme(func):
"""
Automatically runs doctests for a decorated function.
https://stackoverflow.com/a/49659927/1588795
"""
globs = copy.copy(globals())
globs.update({func.__name__: func})
doctest.run_docstring_examples(
func, globs, verbose=False, name=func.__name__
)
return func
```

```
# Build up some regular expressions to parse out the data from the
# relevant paragraphs
MONEY = r"\$[0-9,]+\.[0-9]{2}"
YEAR = r"\d{4}"
mean_re = re.compile(
(
'The "raw" average wage, computed as net compensation divided by the '
f"number of wage earners, is {MONEY} divided by [0-9,]+, or "
rf"({MONEY})\."
).strip()
)
median_re = re.compile(
(
"By definition, 50 percent of wage earners had net compensation less "
"than or equal to the <i>median</i> wage, which is estimated to be "
rf"({MONEY}) for {YEAR}\."
).strip()
)
```

```
@testme
def clean_number(string) -> float:
"""
Turn a string of a dollar amount into a float.
>>> clean_number("$123,456.78")
123456.78
"""
return float(string.lstrip("$").replace(",", ""))
def get_year(year: int) -> t.Dict[str, int]:
"""
Get the median and mean for `year` from SSA.gov.
"""
with urlopen(
f"https://www.ssa.gov/cgi-bin/netcomp.cgi?year={year}"
) as req:
resp = req.read().decode()
clean_whitespace = " ".join(resp.split())
median = median_re.search(clean_whitespace).group(1)
mean = mean_re.search(clean_whitespace).group(1)
return {
"year": year,
"median": clean_number(median),
"mean": clean_number(mean),
}
```

```
# If a feather file is available, load the dataframe from the local file.
# Otherwise, query ssa.gov for 1990-2019, create a dataframe, and make a
# local copy in feather format to prevent unnecessary http requests.
try:
df = pd.read_feather("20201022_ssa_median_income.feather").set_index(
"year"
)
except FileNotFoundError:
df = (
pd.DataFrame(get_year(year) for year in range(1990, 2020))
.set_index("year")
.sort_index()
)
df.reset_index().to_feather("20201022_ssa_median_income.feather")
```

```
alt.Chart(df.reset_index().melt("year")).mark_line().encode(
x="year:O",
y=alt.Y("value", axis=alt.Axis(format="$,d", title="income $USD")),
color=alt.Color("variable", legend=alt.Legend(title=None)),
).properties(title="Source: SSA.gov").interactive()
```

```
# Make a mapping of {year: dataframe} where the dataframe includes the entire
# chart of income levels. Again, once fetched, pickle the object locally to
# avoid unnecessary http requests. Please be familiar with security concerns
# when loading *untrusted* pickled data before running this block.
try:
with open("income_dfs.pkl", "rb") as f:
income_dfs = pickle.load(f)
except FileNotFoundError:
income_dfs = {}
for year in range(1990, 2020):
tmp_df = pd.read_html(
f"https://www.ssa.gov/cgi-bin/netcomp.cgi?year={year}"
)[-2]
tmp_df.columns = tmp_df.columns.droplevel()
tmp_df[["Aggregate amount", "Average amount"]] = tmp_df[
["Aggregate amount", "Average amount"]
].apply(lambda x: x.str.lstrip("$").str.replace(",", "").astype(float))
income_dfs[year] = tmp_df
with open("income_dfs.pkl", "wb") as f:
pickle.dump(income_dfs, f)
```

```
# Give an idea of the shape of this data
income_dfs[2003].head()
```

Net compensation interval | Number | Cumulativenumber | Percentof total | Aggregate amount | Average amount | |
---|---|---|---|---|---|---|

0 | $0.01 — 4,999.99 | 26312244 | 26312244 | 17.81198 | 5.361136e+10 | 2037.51 |

1 | 5,000.00 — 9,999.99 | 15231616 | 41543860 | 28.12296 | 1.125687e+11 | 7390.46 |

2 | 10,000.00 — 14,999.99 | 13262655 | 54806515 | 37.10107 | 1.651411e+11 | 12451.59 |

3 | 15,000.00 — 19,999.99 | 12733058 | 67539573 | 45.72066 | 2.225614e+11 | 17479.02 |

4 | 20,000.00 — 24,999.99 | 12034969 | 79574542 | 53.86769 | 2.703204e+11 | 22461.24 |

```
# Show that the bins from 1997 - 2019 are the same, but differ than 1996 and
# earlier
(
(
income_dfs[1996]["Net compensation interval"].eq(
income_dfs[1997]["Net compensation interval"]
)
).all(),
(
income_dfs[1997]["Net compensation interval"].eq(
income_dfs[2019]["Net compensation interval"]
)
).all(),
)
```

```
(False, True)
```

```
# Filter out the $50,000,000 and greater column
more_than_50_mil = (
pd.DataFrame(
pd.concat(
[income_dfs[y].iloc[-1, [0, 1]], pd.Series(y, index=["year"])]
)
for y in income_dfs.keys()
if y > 1996
)
.set_index("year")
.sort_index()
)
```

```
chart = (
alt.Chart(more_than_50_mil.reset_index())
.mark_point()
.encode(
x="year:O",
y="Number:Q",
)
.properties(title="# of income > $50,000,000 over time")
)
coef = (
chart.transform_regression("year", "Number", params=True)
.mark_text()
.encode(
x=alt.value(120),
y=alt.value(50),
text="coef:N",
)
)
rSquared = (
chart.transform_regression("year", "Number", params=True)
.mark_text()
.encode(
x=alt.value(120),
y=alt.value(75),
text="rSquared:N",
)
)
line = chart.transform_regression("year", "Number").mark_line()
(chart + coef + rSquared + line).interactive()
```

The chart above shows the parameters for the regression, but they’re a little hard to read. The most pertinent seems like it would be the slope – the number of people per year that are moving into the >= $50,000,000 per year range. Lets build a little convenience function to make the slope easily accessible.

```
def get_slope(df: pd.DataFrame, column_name: str) -> float:
"""
Returns the slope from a dataframe based on index and values from
`column_name`
>>> slope = pd.DataFrame({"A": [3, 5, 7]}).pipe(get_slope, "A")
>>> np.isclose(slope, 2)
True
"""
lr = LinearRegression()
lr.fit(
df.index.values.reshape(-1, 1),
df[column_name],
)
return lr.coef_[0]
```

```
slope = more_than_50_mil.pipe(get_slope, "Number")
mean = more_than_50_mil["Number"].mean()
print(
f"""{slope=:.2f}
Average number of people in this group: {mean:.2f}
Group is increasing by about: {slope / mean:.2%} per year"""
)
```

```
slope=7.59
Average number of people in this group: 112.65
Group is increasing by about: 6.74% per year
```

Now lets look at the poorest of the poor.

```
less_than_5k = (
pd.DataFrame(
pd.concat(
[income_dfs[y].iloc[0, [0, 1]], pd.Series(y, index=["year"])]
)
for y in income_dfs.keys()
if y > 1996
)
.set_index("year")
.sort_index()
)
```

```
chart = (
alt.Chart(less_than_5k.reset_index())
.mark_point()
.encode(
x="year:O",
y="Number:Q",
)
.properties(title="Less than $5k")
)
coef = (
chart.transform_regression("year", "Number", params=True)
.mark_text()
.encode(
x=alt.value(120),
y=alt.value(150),
text="coef:N",
)
)
rSquared = (
chart.transform_regression("year", "Number", params=True)
.mark_text()
.encode(
x=alt.value(120),
y=alt.value(175),
text="rSquared:N",
)
)
line = chart.transform_regression("year", "Number").mark_line()
(chart + coef + rSquared + line).interactive()
```

```
slope = less_than_5k.pipe(get_slope, "Number")
mean = less_than_5k["Number"].mean()
print(
f"""{slope=:.2f}
Average number of people in this group: {mean:.2f}
Group is decreasing by about: {-slope / mean:.2%} per year"""
)
```

```
slope=-357146.05
Average number of people in this group: 24614597.09
Group is decreasing by about: 1.45% per year
```

```
income_dfs[2019].iloc[-1]["Average amount"] / income_dfs[2019].iloc[0][
"Average amount"
]
```

```
42460.35824745949
```

## Summary

The SSA provides some interesting an accessible data on income in the US. The
median income reported by the SSA is *way* lower than I thought it was, and
much lower than the household-level data reported by the census.

The number of Americans making more than $50 million per year is low, and is increasing by about 6% per year (small denominator, this is less than 8 people). Each of them make as much per year as about 42,000 people in the lowest bracket combined. The count of people in the lowest bracket is decreasing by less than 2% per year. Hopefully this represents improvement in their standard of living, though I suppose it could also be people dropping out of the workforce completely.