Data Viz for analysis and discovery

Through my work helping scientists at Google, Berkeley, and Yale analyze their data, as well as in my own past experience as a data analyst at Google, I’ve identified some core concepts of visualization that apply across many projects. In 2018, I synthesized and presented these at the SciPy conference, as a Distinguish Speaker in the University of Washington’s eSciences speaker series, and at the Moss Landing Research Station. The goal was to help scientists create better charts and graphs for their own use and to show analysis tool builders, like those at SciPy, the types of features that would be good to have by default.

This set of 8 different representations of the same 2009 unemployment data by county in the United States illustrates the idea of having “small multiples, which show the same data but use different visual forms.” The only difference in these charts is the mapping from number to color, but we notice different aspects of the data based on the different mappings.   For example, in the upper left map the colors go from purple for the min value (close to 0%) to yellow for the max value in the dataset (~30%). The upper right goes from purple at 2.5% to yellow at 15%, since only a small number (4%) of the counties had an unemployment rate over 15%.   By default, most tools set the min/max of the color scale at the mix/max of the data. But, that can be less valuable than a more intentional color scale. By deciding to set the max to 15% we are able to more easily perceive (and notice) the difference between 8% and 12% than with a max of 30%.  Or, using the min/max default can be misleading. For example, if you make two maps, one with 2009 data and the other with 2012, the same color blue will represent different numbers purely because the most extreme outlier (the max value) will be different.

This set of 8 different representations of the same 2009 unemployment data by county in the United States illustrates the idea of having “small multiples, which show the same data but use different visual forms.” The only difference in these charts is the mapping from number to color, but we notice different aspects of the data based on the different mappings.

For example, in the upper left map the colors go from purple for the min value (close to 0%) to yellow for the max value in the dataset (~30%). The upper right goes from purple at 2.5% to yellow at 15%, since only a small number (4%) of the counties had an unemployment rate over 15%.

By default, most tools set the min/max of the color scale at the mix/max of the data. But, that can be less valuable than a more intentional color scale. By deciding to set the max to 15% we are able to more easily perceive (and notice) the difference between 8% and 12% than with a max of 30%.

Or, using the min/max default can be misleading. For example, if you make two maps, one with 2009 data and the other with 2012, the same color blue will represent different numbers purely because the most extreme outlier (the max value) will be different.

goals, context, and constraints

Data visualization has lots of “rules”, like “pie charts are bad” and “don’t use rainbow color schemes”. However, these rules assume a certain set of constraints.

For example, for rainbow color schemes, it assumes that (1) somebody who is colorblind might look at the chart, (2) it might be printed in black & white, and (3) a perceptually even color scheme is more important than having the most possible perceptual variation between colors. Yet, there are plenty of instances in scientific analysis in which a chart will (1) only be viewed by one person who is not colorblind, (2) will only look at the chart on a digital screen, and (3) getting as much perceptual differentiation as possible is much more important than perceptually even schemes because the key question is “is there a difference” rather than “how large is the difference.”

As chart creators, we should identify goals/context/constraints for our charts, be aware of the assumptions behind any data viz “rule”, and learn the advantages/disadvantages of various types of charts and design decisions.