I was at the Annual Meeting of the Population Association of America a few weeks ago, and while I was there, saw a great presentation by JosephSalvo, New York City’s Demographer-in-Chief (!). He showed some pretty incredible maps of Census mail return rates (the proportion of households who return the census without follow-up), and I’ve been trying to recreate them in R ever since. Mail return rates are important because for every household that doesn’t respond, the Census has to send fieldworkers to follow up – and because the Census has finite resources, very low return rates increase the chance of undercounting people in specific areas (which in turn leads to less funding for resources in that area, less representation in congress, and a lot of other cascading bad effects).
The Center for Urban Research at CUNY has actually made a pretty fancy map of this already by census tract, but I find maps of census tracts a teensy bit hard to read, so I wanted to try to visualize this at a higher level of aggregation (like neighborhood), even if I lost a bit of detail.
Luckily, it was very straightforward, because the Census has a pretty great administrative dataset they make available online for the purposes of survey planning, which had pretty much all of the data I needed.
Here’s the code for loading the data (along with a bit of cleaning):
The census dataset has an already calculated mail return rate variable, but since I was going to need to recalculate it at the neighborhood level, I wanted to make sure I could recreate it from the raw variables in the dataset. The denominator (number of valid, occupied addresses in the census tract) was pretty clearly identified in the codebook, but the numerator was a bit less clear, so I ended up guessing and adding up FRST_FRMS_CEN_2010 (Number of housing units that returned initial form in 2010 Census) and RPLCMNT_FRMS_CEN_2010 (Number of housing units that returned replacement form in 2010 Census). I then checked to make sure they matched the calculated rate:
And magically, it did! (this almost never happens). (also, don’t worry, I did a much more systematic check of this, just not showing it here because it is extremely boring).
Next step was to merge the dataset with the crosswalk file:
And then calculate return rates by neighorhood (instead of census tract) with the magic of the dplyr package.
Our dataset now looks like this, with one row per neighborhood:
Now we can plot! It’s very easy to do this for all of New York City:
I’m using the viridis package here, which is colorblind friendly, and all around awesome.
It’s also easy to split these maps up by borough (note that the scales are different for each, so we can see relative differences within boroughs). I’m not reproducing the code here, because it’s a bit clunky, but all I’m doing is mapping with different subsets of the data:
I don’t neccesarily have anything to say about these maps – just think they are interesting, especially in the context of planning for the 2020 Census, which by all accounts is going to be much more difficult to collect data for (because of the proposed – and incredibly ill-advised – citizenship question, for one). People already don’t respond to the Census for all sorts of reasons – distrust of the government, confusion about who needs to fill out the form (response rates are much lower in neighborhoods with higher proportions of renters, for example), busyness – and it’s not a great idea to add another reason for people not to respond. You can also see that response rates are are already incredibly variable by neighborhood - and while a lot of follow-up happens, it’s not hard to think that areas that require more follow-up have a higher risk of being undercounted (and thus underrepresented and underfunded).