Visualize Numerical Data Across Groups

This tutorial provides a tour to several plot types for plotting numerical data across groups. It uses the widespread ggplot2 R plotting library. The plots produced during the tutorial are not prettied up or optimized to the data. Instead, the tutorial focuses on illustrating the concepts with as little code as possible.

Learning Objective

By the end of this tutorial, you will be able to:

Choose from a collection of plot types when having to visualize numerical data across groups
Understand typical problems of plots and know approaches to mitigate them
Quickly make basic plots in R with ggplot2

Each section comes with a set of exercises. If your objective is to quickly make basic plots, you do not need to take these exercises. If you want to gain a deeper understand of visualization techniques, you should take these exercises. To do so, start this tutorial in an interactive environment (see top of this page) and open the index.qmd file, which allows you to modify an re-run the code cells.

Target Audience

This tutorial is designed for intermediate users who have already loaded and modified data in R but have not necessarily visualized the data.

Duration

Around half a work day to complete all exercises. Around half an hour to just read through.

Use Cases

When starting to investigate a new dataset, to check on a numerical attribute whether the values are distributed as one would expect
When preparing a visualization for a publication or report, if some effect is not clearly visible or one sees a problem with the first visualization one tried
When visualizing data, to be able to make an informed decision on which plot type to use

Computational Environment Setup

Ensure you have R (version 3.6.0 or higher) installed on your system. To ensure a smooth tutorial, first install all required packages:

install.packages("ggplot2")
install.packages("ggbeeswarm")
install.packages("maps")

The tutorial uses the widespread ggplot2 library for plotting. Load it:

library(ggplot2)

Data

This tutorial uses data of the capitals of the world taken from Wikidata. Load it and have a quick check:

capitals <- read.csv(url("https://zenodo.org/records/17804237/files/capitals-2025-12-03.csv"))
head(capitals)

           capital             country population     area  longitude  latitude
1            Kabul         Afghanistan    4273156 2.75e+08  69.165833 34.532778
2           Tirana             Albania     418495 4.18e+07  19.817778 41.328889
3          Algiers             Algeria    2364230 3.63e+08   3.058611 36.776389
4 Andorra la Vella             Andorra      24042 3.00e+07   1.522222 42.507222
5           Luanda              Angola    2487444 1.13e+08  13.234444 -8.838333
6       St. John's Antigua and Barbuda      21394 1.00e+07 -61.844722 17.121111
      continent
1          Asia
2        Europe
3        Africa
4        Europe
5        Africa
6 North America

summary(capitals)

   capital            country            population            area          
 Length:201         Length:201         Min.   :     575   Min.   :2.000e+05  
 Class :character   Class :character   1st Qu.:  223757   1st Qu.:5.200e+07  
 Mode  :character   Mode  :character   Median :  937700   Median :2.193e+08  
                                       Mean   : 2036817   Mean   :1.061e+09  
                                       3rd Qu.: 2145783   3rd Qu.:6.920e+08  
                                       Max.   :21893095   Max.   :3.000e+10  
   longitude           latitude        continent        
 Min.   :-175.202   Min.   :-41.289   Length:201        
 1st Qu.:  -7.992   1st Qu.:  4.175   Class :character  
 Median :  19.818   Median : 17.250   Mode  :character  
 Mean   :  19.673   Mean   : 19.166                     
 3rd Qu.:  47.525   3rd Qu.: 40.367                     
 Max.   : 179.117   Max.   : 64.175

For the sake of example, assume we are interested in the size of capitals across the world. To this end, we want to see the areas of the capitals grouped by the continent they are on.

Exercises:

After you finish each section of this tutorial, switch to using the population instead of the area for that section. Do your answers to the exercises change?

But, as a note up front, there is no better or worse plot type in general - it depends on the data you have, but also on what insight you are looking for.

Plotting Histograms

Let us start with the standard plot types. Often, this is sufficient. Especially if you just want to check whether the data is as you expect it to be, a standard plot type if most often the way to go.

For example, in preparation for this tutorial, a standard histogram check showed us that our original data used different measurement units (square meters and square kilometers), leading to vastly different numbers for the area - which we then resolved by using a normalized attribute instead.

In ggplot, we define the data points (capitals and aes(x=..., fill=...)) and then how the data points should be marked in the visualization (“geoms”). In the simplest case, this looks like this:

ggplot(capitals, aes(x=area, fill=continent)) + geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Interpretation:

The bars show the overall distribution and the colors show how much data from each continent contributes to each bar

Histograms show the frequency distribution through bars, but the bin choice is sometimes critical. You should never test only one number of bins, or else you might miss something that is not visible for the number you chosen.

ggplot(capitals, aes(x=area, fill=continent)) + geom_histogram(bins=3)

ggplot(capitals, aes(x=area, fill=continent)) + geom_histogram(bins=1000)

Exercises:

What are problems if the number of bins is “too small” and what if it is “too large”?
In geom_histogram, use position="fill". Can you interpret this plot?

Plotting Density

To avoid choosing a bin size and still being able to see the general distribution, density plots show the smoothed distribution of the data points. Instead of stacking the counts for the different groups (in our case continents), this plot type plots them above each other, making it necessary to make the visuals transparent (here alpha=0.2):

ggplot(capitals, aes(x=area, fill=continent)) + geom_density(alpha=0.2)

Interpretation:

Similar curves indicate similar distributions

Exercises:

In geom_density, set alpha to change the transparency, 0 being not visible and 1 being fully opaque
In geom_density, set adjust to control the bandwidth of density estimation: the higher the smoother the plot
Add + geom_rug(alpha = 0.3) to add tick marks for individual data points

Instead of using alpha, you can also display each density in a separate (aligned) plot using facet_wrap:

ggplot(capitals, aes(x=area)) + facet_wrap(~continent, ncol=1) + geom_density()

Instead of aligning the plots in one column, one can also place them upright (i.e., x and y switched) and next to another, leading to a so-called “violin plot”. Swap x and fill/y and replace geom_density() with geom_violin():

ggplot(capitals, aes(x=continent, y=area)) + geom_violin()

Exercises:

In which of the above plots is it easiest for you to say whether there are more small cities in Europe or North America?
In the violin plot, add + coord_flip(). Can you see differences between the facetted plot and the flipped violin plot?
Also geom_violin has the adjust parameter. When should you decrease this parameter and when should you increase it? Try it with adjust=10 and adjust=0.01!

Plotting Points

A different approach is to show the single data points instead of aggregated counts, which offers a straightforward solution, especially if the number of data points is relatively low:

ggplot(capitals, aes(x=continent, y=area)) + geom_point()

However, even for relatively low number of data points, this visualization suffers from “overplotting” in its most basic form shown above: several points are plotted on the same spot - the viewer can not tell how many. Several ways exist to counter this loss of information, at least to some degree.

First, you can use transparency like in the density plots:

ggplot(capitals, aes(x=continent, y=area)) + geom_point(alpha=0.2)

Second, you can have the points be moved horizontally in case they would overplot (called “Beeswarm”):

library(ggbeeswarm)
ggplot(capitals, aes(x=continent, y=area)) + geom_beeswarm()

Third, you can add a random horizontal offset to each point (called “Jitter”):

ggplot(capitals, aes(x=continent, y=area)) + geom_jitter()

Interpretation:

Each visual point corresponds to one data point - darker or clustered areas show regions of high density

Exercises:

In which of the above plots is it easiest for you to say whether there are more small cities in Europe or North America?
In which of the above plots is it the most difficult to say for each point to which continent it belongs? How could you change that?
In the transparent points plot, in geom_point, change alpha. What values of alpha are too small, and which too large?
In the Beeswarm plot, in aes, use shape=continent or color=continent (or both). What are the advantages of each? - In the jittered points plot, in geom_jitter, set width to different numeric values (e.g., 0.1, 1, 10, …). What values of width are too small, and which too large?

Plotting Box Plots

In scientific publications, one can often find the box plot as another option, though it takes some knowledge to interpret it:

ggplot(capitals, aes(x=continent, y=area)) + geom_boxplot()

Interpretation:

In a nutshell, the horizontal line in each box is the median, the box covers the first to third quartile, the lines that extend from the box up and down show what can be considered a regular range of the data and all points above and below those lines are considered non-regular (“outlier”) data points - which should be inspected to see whether those are due to data errors. The Box plot Wikipedia page describes box plots in detail. The used R implementation draws the whiskers at 1.5 the interquartile range.

Exercises

Instead of showing just the outliers, it can make sense to combine the boxplot with plots showing all points. Add outlier.shape=NA to geom_boxplot to remove the outliers from the plot, then use + to add another geom from above (e.g., + geom_jitter()). What are advantages and what are disadvantages of such a combination of plots?

Can I Look at the Data in Another Way?

Finally, when visualizing your data, you should always have an open mind for exploration. Maybe there is another way to look at the data, potentially providing new insights? For example, we started with the question of how large the capitals are in different regions of the world - so, instead of grouping them by continent, it might also be sensible to plot them on a world map:

world_map <- map_data("world")
ggplot() + geom_polygon(data=world_map, aes(x=long, y=lat, group=group), color="white") + geom_point(data=capitals, aes(x=longitude, y=latitude, size=population), color="red")

Conclusion

There is no better or worse plot type in general - it depends on the data you have, but also on what insight you are looking for. But if you notice problems like overplotting or a badly chosen bin size, you should now have a general understanding of ways to address these problems.

Though, in doubt which of a set of plausible plot types to choose, use the most standard one that still brings across what you want to see or to convey to others. Then you and your audience have to spent less effort to understand the plot type and can focus more on understanding the data.

Taxonomy

Learning Objective

Target Audience

Duration

Use Cases

Computational Environment Setup

Data

Plotting Histograms

Plotting Density

Plotting Points

Plotting Box Plots

Can I Look at the Data in Another Way?

Conclusion

Further reading: