Replacing Boxplots and Histograms, with Rugs, Violins & Bean Plots

Data Visualization with Tim Bock is back with his guide to the strengths and weaknesses for visualizations used to compare the distributions of numeric data.

Introductory stats courses teach about histograms and box plots. Three more recent visualizations – rugs, violins, and bean plots – are often superior. In this post I explain the strengths and weaknesses of these different types of visualizations for comparing the distributions of numeric data.

Histograms: Great in the classroom, but often problematic in the real-world

Histograms are column charts where the different categories are automatically grouped together into bins. The histogram below shows how often people have opened a trial of a software product. The height of the columns represents the proportion of people (if you hover, you can see the values).

I really like histograms and use them a lot when examining complex models. However, I tend to find them a poor visualization when communicating with less technical audiences. The histogram above is a great example of why. The histogram below was created default settings. The histogram algorithm has examined the data and attempted to deduce the “optimal” number of bins. In this case, three.

The obvious way of reading this histogram is to assume that it is communicating that there are three basic bands: under 20, 20 to 40, and above 40. However, such a reading is mistaken. To illustrate, I have created a different histogram below, where I have forced the software to create more bins. A completely different story emerges. Is it the correct story?

In the next chart I have told the software to use even more bins. Again, it highlights that the previous visualization was sub-optimal, as again it tells a different story.

You could be forgiven for thinking that the trick is to force histograms to have many bins. This strategy suffers from two flaws. First, it becomes hard to see and describe the pattern. How would you describe the pattern in the histogram above? Second, it makes it hard to compare histograms. How, for example, would you characterize the difference between the regions in the histograms below?

Box plots only emphasize a part of the story

Box plots are designed for comparing multiple distributions. They improve on histograms by emphasizing medians, quartiles, and any outliers. These box plots are only showing the top ‘whisker’, which emphasizes that the distributions are strongly skewed (i.e., not symmetrical around their median).

The median on a boxplot is shown by the line in the “middle” of each box. The most interesting conclusion form the box plots is that the median for Asia and Australia/NZ is much, much lower than for the other regions. Getting the same conclusion from the histograms is possible, but requires more work. However, only a skilled reader of box plots will immediately work out this key conclusion, as the plots themselves do not tap well into our perceptions. That is, a reader needs to be trained to know that the line through the middle of the box is the median and thus important; it is not something that we instinctively get from the plot.

We can augment the box plots by plotting all the data values on them. This is a more informative visualization. Look at Africa/ME. We can see that there are essentially no data points between 1 and 14, which is not something we could see from the traditional boxplot.

While this is a better visualization, it also highlights the basic problem with box plots: they do not reveal any natural groupings in the data (i.e., they cannot reveal multimodality).

In my quarter-of-a-century creating visualizations, it has been my general experience that box plots are never ideal in real-world analysis and data presentation. Sure they do a good job at showing the type of simulated data that appears in college, but in the real world – where distributions are skewed, data is often counts or rating scales, and multi-modality is a problem – they tend to never do the job.

Density plots with rugs are better

An alternative to both histograms and boxplots is to use density plots. Think of these has histograms with sanding of the corners (i.e., smoothing). They have the great advantage over histograms that the shapes that they create are more in line with shapes we see in nature, so we find them a bit easier to see. However, they still suffer from the first problem that we saw with the histograms: they over simplify. Look, for example, at Africa/ME, and compare it to the histogram at the beginning of the post.

One solution to the problem of density plots is to adjust them so they are not so smooth. However, this has the same problem as we saw with increasing the number of bins in the histogram. A better solution is often to plot the raw data values underneath, as shown below. This is known as a rug.

Combining density plots with summary statistics

While I like rugs, they suffer from a limitation when over-plotting occurs (i.e., where multiple values appear on the same position of the plot, but you cannot see that there are multiple values there). In these data sets, where we have lots of people with 0 scores, this makes them a problem. A partial solution is to augment the density plots with other summary statistics. In the visualization below, I have shown the range (as a think black line), the quartiles (thick black line), the median (cross), and the mean, as a white dot. To my mind this is the best of the visualizations so far. It takes the strengths from the boxplot but not its weaknesses.

Cue the violin

We can improve these plots in two ways. First, we can transpose them. The reason for doing this is that perceptually we are better at comparing heights of things (as it is a task we have evolved to do). Second, we can show the mirror of the density, which creates a symmetric plot that seem to be a bit easier on the brain (I imagine this is because the resulting shapes look are more recognizable). For example, the visualizations below look to me like lava lamps (these plots are called violin plots because the first data set they were tested on created a data set that looked like a violin).

The violin plot is very similar to a combination of a density chart and a box plot, with the key difference being that I have also plotted the mean, and I have used a line to indicate the entire range of data so that the viewer can see the extent to which the tip of the plot are just extrapolations versus real data.

Beans

We can get an even better understanding of the raw data values by instead plotting the unique values. This is called a bean plot. To my mind this is a prettier visualization than the violin chart, but ultimately, I find it less informative, as often I am very interested in the mean and, as mentioned before, rugs suffer from overplotting.

Bespoke is of course even better

Except for the density plots with summary statistics, all the plots above were created using point and click. However, if you have time, you can usually come up with a better mash up. For example, below I redo the first of the histograms as a density plot, where I have reduced the bandwidth (i.e., made it less smooth), and used a heatmap coloring. I particularly like the way that the coloring shows the discreteness of the data (e.g., that we can see that people most people have 0, 1, or 2, trials).

Overview

The traditional thing to say when writing a post about data visualization is that all visualizations have different strengths and weaknesses, and need to be matched to purpose. While such advice is of course always true, to my mind the violin and bean plots are usually superior to boxplots, and often to histograms.

The real challenge with these visualizations is they are new to many people. I vividly remember the first time I used a violin plot in a consulting assignment, way back in 2004. My client christened it the “Mr Hankey” plot (google it if you are not familiar with South Park), and instructed me never to use it again!

How to create these visualizations yourself

If you are short on cash and have the time and inclination to write code, you can create these using the rplotly package. The histogram, boxplot, density, and bean plots, can all be created using the menus in Q and Displayr, which is what I did when I wrote this post. The heated density plot at the end of the post, needs to be created by writing code, and I have explained how here .

Please share...

Join the conversation