A World Apart
Filippo Pellolio - Data Visualization and Visual Analytics - Viewer's Choice

About


One of the projects in CS424 is to choose a visualization out there and analyze it, using what we learned about visualizations in class. The visualization i chose is one from the Washington Post, it shows the median household income and college graduation rate for all the zip codes in the US.

How it works


The Map

The map shows a heatmap of all the zip codes in the US overlayed over a dark map. The colors represent the median income and the college graduation rate in the zipcode, the wealthier and educated a zipcode is the more yellow it appears. The map can be moved around and zoomed as the user wants.

Tooltip

Hovering over a particular zip area on the map a simple tooltip will show up, stating the name of the area, the zip code, and its percentile.

Search Box

On the top right corner of the application you can find a search box, inserting a zip code here will select it on the map, centering the view on it.

The Info

Once you click on a particular area on the map, the detailed informations about it will be displayed in this box on the right side of them map. This information box shows all the data available for that zip code: the median household income and the college graduation rate. The data is showed both as an overall number in the upper part and in detail as histograms in the bottom part.

The Legend

As in every good visualization in the bottom right corner we can find a legend, explaining which value is associated to every color in the heatmap.

Comments


In this section I will analyze the application and state what, in my opinion, has been done well and what could have been done better.

The Colors

The colors are probably the major flaw in this application: the idea of a dark map is not so bad and it really helps to highlight the yellow painted "Superzips", but as much as it is easy to see these zips it is difficult to see the "Subzips".
So if you are a rich and educated zip you can be easily seen at a first glance, but if you are a zip in the first two percentiles you are pretty much invisible, you confuse with the zips that are not displayed. It almost seems as the visualization is trying to hide the poor and uneducated...

Giving the background (the undisplayed zips) a color so similar to the one of the first percentile is the worst idea of them all, it almost seems as there's another percentile of people even poorer than the one in the first percentile.
Another problem with this choice of colors is that it gives a pretty biased idea of the wealth distribution to the user. At a first glance almost every zip in the US is blue-ish and the user can't really see that, since we are using percentiles, there is the same number of zips in every color range (apart from the last two that should be summed).
Would you say that there's the same number of zips in the third and fourth percentile here?

The Scale

The choice of using percentiles to divide the ranges is usually a pretty good idea, here the every percentile has the same span apart from the last two, and this deserves at least a comment.
The choice can be justified by the fact that the application is trying to show the so called "superzips", that fall in the last 5 percentiles. Usually a scale that goes 20 20 20 20 15 5 is definitely not a good idea, but in this case( with a better choice of colors as said before) it seems to be the best one.

The Median

The choice of using a median to aggregate the income in just one value is a great idea, since it is far less skewed than the mean. If the application would have used the mean as aggregation function the neighbourhood of, for instance, Bill Gates, would have had an unrealistic income mean.
Anyway, since the median is not a widespread concept as the mean, a user that is a novice to statistics could not appreciate this decision or even understand the value, so it would have been a good idea to explain what the median means in the box, maybe using a simple tooltip.

The Formula

As said before, the percentiles are a good idea to represent this data, but looking into how this percentiles are calculated we can find that they are the average between the income and the college rate percentiles. This decision takes away a lot of power from this visualization, it would have been so easy to let the user choose dynamically if he wanted to show only one of them or their average, giving him the chance to discover way more things
The first thing I wanted to find out when i saw this application was if there were some zips with high income and low graduation rate (and vice-versa), this design decision made this kind of discover impossible. Another problem with the formula is explained in one of the findings.

Lack of average

Using the application for a while one will start to feel the need of an explicit mean for the household income and college graduation, since the displayed percentile only gives information about the mean of them. Showing the separated means would help the user a lot and it would be almost effortless. An estimate can be found by looking at the areas that fall in the 50 percentiles, but this in neither fast nor accurate. Some averages for the US are shown in the histograms in the lower part of the info, but they are really small, so it is very difficult to precisely evaluate them.

The Search

In the search box in the application one can only search using the zip code, really? Who knows the zip code of let's say, Wichita?
I'm not an American citizen, so my situation is even worse, but I doubt that anyone knows more than a couple of zip code, probably your home zip code and the unforgettable 90210. Would have it been so difficult to insert a search by name?

The Choice of Zips

This visualization shows only the zips with more than 500 adults, and it is a decision that deserves support.
What they did wrong is not stating it more clearly, the only way to see the popup above is to search for a zip code that doesn't exist. Not exactly the most straight forward way.
A disclaimer like the one to explain the formula would have been a better idea.

Technology

The Visualization is made using the Mapbox API, a really common choice in terms of map visualizations.
Mapbox API are a wrapper around the core Leaflet API, so they deserve a mention too.
Surprisingly, no D3 is used in the visualization, it is uncommon since it perfectly goes along with the Mapbox API, but probably they didn't need it because the core of the visualization is pretty simple and static, consisting probably in a geojson with static data about every zip.

Sources

The data for this visualization comes from the ESRI American Community survey.
Sadly, since it is not open and free data, i couldn't analyze it and see if they could have done a better work, especially regarding the Formula explained before.
Anyway the data should be pretty accurate and unbiased, since it comes from a big survey-specialized company, that makes money selling it.

Target

The target for this visualization are the readers of the article on the Washington Post related to the visualization: well educated people living in the US. Since the majority of the reader will be from the United States the use of the cloropleth will not cause major problems, since they are all pretty much aware of the US geography. Since the readers are likely to be well educated even the use of a median without an explanation could be forgiven.

Findings


Stanford

Stanford is the zip on the map with the highest graduates percentage, almost everybody there has a degree.

College doesn't mean money

s

Having a high percentage of college graduates doesn't mean your median household income will be very high, this is because a college degree doesn't ensure money. In this particular case, all the people living in Stanford are probably professors, due to the nearby university, and the median income is probably around a professor's yearly income, which is fairly good, but not super high.

Money usually mean college

While it was relatively easy to find zips with high college graduation percentage and not so high revenue, it is very difficult to find high income zips with low college graduation percentage. This doesn't mean that you need a college degree to make money, we need to consider that here is a household income we're talking about, so, maybe, this means that if you have money you are more likely to go to college, but it doesn't mean you earned them thanks to your degree, maybe your father or mother earned them without it.

Bad Formula

Looking for high income-low college (and vice-versa) zips, I stumbled across the fact that the percentile assignment here isn't very meaningful: very different zip areas can fall into similar percentiles. That is why searching for the two findings above was so difficult.
In the picture above we can see how very different areas fell into almost the same percentile, this his why the mean between the two percentiles isn't a very good estimator, perhaps they should have given the user the possibility to toggle between the two percentiles.

Chicago diversity

Chicago is a city full of contradiction, and as you can see in the image above this map isn't an exception.The picture above refers to a zip code in the loop and one just a few blocks south, try to guess which is which. It's incredible to think that the richest and poorest people in the country are separated just by a few blocks.
It needs to be said that this is not true only for Chicago, it is a common controversy present in all the major American city. New York, Los Angeles and Washington have the same issue.

Medina, Washington

Medina is one of the richest cities in the US, even though it is fairly small (around 3000 people), and not in the best place of the US weather wise. Medina would be even better if we were considering the mean household income instead of the median.
"Why is Medina so good?" one could ask, it's because Bill Gates and Jeff Bezos live there, along with many other celebrities.

Conclusions


The visualization is overall pretty good, and it makes the point it wants to make: the richest people are concentrated in the same areas, most of them inside the major US cities.
To prove his point the author made some decisions in visualizing the data, such as using the mean between income and graduation rate and hiding the least populated areas, but the data shown is true, even if manipulated a little bit.
If he wanted an unbiased visualization he definitely should have shown the two percentiles divided on demand, anyway this visualization is targeted to the people reading the article in order to prove a point, and it is quite good in doing that.
The use of a cloropleth was really the only good way to display this data.
The major flaw is definitely the choice of colors, since the color for non-data and for the poorest areas are almost the same, this might have been a designer choice to skew the visualization.