Strategy and tactics for graphic multiples in Stata

Strategy and tactics for graphic multiples in Stata Nicholas J. Cox Department of Geography Durham University, UK

Comparison Many useful graphs compare two or more sets of values, and so can be thought as of multiples. Often there can be a fine line between richly detailed graphics and busy, unintelligible graphics that lead nowhere. In this presentation I survey strategy and tactics for developing good graphic multiples in Stata.

Strategies: what to do superimpose (on top) or juxtapose (alongside)? plot different versions or reductions of the data transform scales for easier comparison linear reference patterns backdrops of context

Tactics: details of what to do over()andby()options andgraph combine kill the key or lose the legend if you can annotations and self-explanatory markers

Datasets visited James Short’s collation from the transit of Venus Florence Nightingale’s data on deaths in the Crimean War deaths from the Titanic sinking Grunfeld panel data admissions to Berkeley hostility in response to insult or apology fluctuations in Arctic sea ice

Original programs discussed catplot (SSC) devnplot(SSC) qplot (Stata Journal) sparkline (SSC) spineplot (SJ) stripplot (SSC) tabplot(SSC)

Categorical comparisons

Berkeley admissions data A classic dataset covers admissions to six graduate majors by gender at UC Berkeley. At first sight, females were discriminated against. But there is an underlying interaction: major by major, females generally do well, yet their acceptance rates are worse on more popular majors. This is an example of an amalgamation paradox named for E.H. Simpson (1922–) but known to K. Pearson (1857–1936) and G.U. Yule (1871–1951).

Berkeley data references The original reference was Bickel, P.J., E.A. Hammel and J.W. O’Connell. 1975. Sex bias in graduate admissions: Data from Berkeley. Science 187: 398–404. The Berkeley data were discussed as an example for Stata in Cox, N.J. 2008. Spineplots and their kin. Stata Journal 8: 105–121.

A simple problem? The structure of the data is already well known. The challenge is how best to present it. There are three categorical variables major (anonymously A, B, C, D, E, F) gender (male, female) decision (accept, reject) so the data are just 24 frequencies.

Bar chart Many researchers would reach first for a bar chart. Here is a slightly non-standard example, produced by tabplot (SSC), which is for one-way, two-way or three-way bar charts. One feature here is showing numbers too in a hybrid of graph and table. A cosmetic detail is toning down the use of colour. Large blocks with strong colours are unsubtle.

Mosaic plot or spineplot The previous bar chart omitted the frequencies. We can show them using a mosaic plot or spineplot. The proportions of both variables are shown, giving marginal and conditional distributions. Areas of tiles are proportional to raw frequencies. Departures from independence are easily seen. The program here is spineplot.

Drilling down The bar chart and spineplot do a fair job of showing the gross breakdown with four percents. (Two are redundant.) Predictably, both would be rejected as trivial by many journal reviewers, but both could be useful for presentations. But clearly we need to drill down to see the patterns for different majors.

More detailed bar chart Stacking bars is a standard strategy, but the result is immediately much more complicated. Showing all the detail does not always help. Focusing more sharply on the response of interest is a way forward. In general there is no need for alphabetical order. Here majors A to F are already ordered by admission rate.

Dot chart Dot charts as advocated by W.S. Cleveland remain under-used by comparison with bar charts. In Stata that usually means graph dot. By using marker position alone, rather than bar length, they are less busy and thus ease more detailed comparison. Here it is easier to identify that female admission rates are higher for four majors and lower for the other two.

Details for dot charts Open symbols (e.g. ○ not ●) tolerate overlap much better than closed symbols. ○ can even be combined with + whenever nearly equal values are possible. Legends (keys) are at best a necessary evil. Self-explanatory or at least memorable symbolisation is to be prized wherever it is possible. Using blue for males and pink for females is a simple example.

A scatter plot? Many statistically-minded people find the idea of bar charts trivial, but their practice not very helpful. Where is the scatter plot, they cry? Plotting admission rate against number of applicants re-introduces a crucial aspect, size of major. This allows identification of positive correlation for males and negative correlation for females, hence the paradox. This is currently my favourite plot for these data.

Previously… In an earlier version of this plot I had admissions versus applications, both raw frequencies. Reference lines here are lines through the origin such as y = x and y = 0.5x for 100% and 50% admission rates. But it is simpler to plot admission rates. Then the reference lines are horizontal.

Slogans: the banal in search of the profound Focus as far as possible on the response or outcome, the variable you most want to explain. Linear reference patterns are good and horizontal patterns better. Omit what is unimportant and keep what is important. Even for a very simple problem, it is rare that a single graph meets all needs.

Continuous comparisons

Hostility change Results of an experiment reported by Atkinson, C. and J. Polivy. 1976. Effects of delay, attack, and retaliation on state depression and hostility. Journal of Abnormal Psychology 85: 570–576. Male and female subjects were made to wait and then either were insulted or received an apology. Half were given a chance to retaliate by negatively evaluating the experimenter. Hostility was measured before and after the experiment.

Variables in hostility study Response: Change in hostility, a difference of scores and so approximately continuous Predictors all binary: Treatment: insult, apology Gender: male, female Retaliation allowed: yes, no

ANOVA-type problems: What to plot? Change in hostility is adequately modelled by a simple linear model, using analysis of variance. What to plot for similar analyses is key here. Box plots (with medians etc.) are surprisingly common even when comparison of means is the central question. Plotting means with standard errors or confidence intervals is also common, but what about the detail omitted?

devnplot (SSC) devnplot (SSC) is named for its emphasis on plotting deviations. Deviations are measured from any level you care to specify, but deviations from means are the default. “devplot”was too ugly and “deviationplot”too long. Quantile enthusiasts will see it as a way to plot ordered quantiles side by side. Compare quantile or qplot (SJ).

devnplotsyntax The syntax resembles standard modelling syntax, response named first and any predictors following. With one variable named we get in essence a quantile plot for that variable, a plot of the ordered values versus an implicit cumulative probability scale. The scaffolding emphasising that each value can be represented by a deviation from a level might seem redundant, but bear with me.

Adding predictors to the syntax You can specify either one or two predictors. The result is a quantile plot for each subset, namely a category or combination of categories. An undocumented upper limit arising from a limit ingraphis 20 subsets, but more than 20 would likely be too busy any way. A third binary predictor can be shown indirectly by aseparate()option.

devnplot virtues The display serves well in showing variation within subsets as well as variation between. Interactions can be seen. The scaffolding (in subtle gray) helps to tie the values of a group together visually. The separate() option is best used to highlight a few unusual or interesting cases.

Waterfall plots Similar plots have been called waterfall plots, especially in clinical oncology. But watch out: waterfall plots (or charts) have at least two quite different meanings elsewhere, in business and physical science contexts. Sometimes the jungle of plot names is just a confounded nuisance.

James Short and the transit of Venus (1763) Short collated and corrected observations made by various astronomers during the transit of Venus in 1761. The parallax here is the angle subtended by the earth’s radius, as if viewed and measured from the surface of the sun. The data will be published and discussed in Stata Journal 13(3).

Deviation plot A deviation plot adjusts to the differing sample sizes. Here deviations are relative to 25% trimmed means (otherwise known as midmeans or interquartile means). Boxplot fans can think that they average values within the box. The context here of careful precise measurement does not rule out the occasional mild or even strong outlier.

Quantile plots Deviation plots (waterfall plots, if you prefer) are in essence quantile plots. qplot from SJ can superimpose through its over() option or juxtapose through its by() option. How well does that compare?

devnplotorqplot? I prefer devnplot here, although qplot has useful options too, including flexibility over axis scales. For example, if we plot against standard normal quantiles, normal (Gaussian) distributions will follow straight lines.

Strip plot An alternative display is a strip plot or dot plot. (Many other names exist.) Here it takes on the flavour of a histogram but with markers or point symbols for each value. Some binning allows stacking. stripplotfrom SSC offers an alternative to official Stata’sdotplot.

Histograms or box plots? Many statistical people would start almost automatically with histograms or box plots for such data. How do they compare? You can judge for yourself. A specific problem with histograms is keeping the amount of scaffolding down. It is easy to lose valuable real estate in axis and title information.

Strategy and tactics for graphic multiples in Stata