Visualizing Data Distribution with Violin Plots

EXPLORATORY DATA ANALYSIS – PART II Violin plots, ggplot2

Violin Plots: Visualizing Distribution and Probability Density https://blog.modeanalytics.com/violin-plot-examples/

VIOLIN PLOTS • A Violin Plot is used to visualize the distribution of the data and its probability density. • This chart is a combination of a Box Plot and a Kernel Density Plot that is rotated and placed on each side, to show the distribution shape of the data. • Box Plots are limited in their display of the data, as their visual simplicity tends to hide significant details about how values in the data are distributed. For example, with Box Plots you can't see if the distribution is bimodal or multimodal. While Violin plots display more information, they can be more noisier than a Box Plot.

VIOLIN PLOTS • The thick black bar in the center represents the interquartile range, the thin black line extended from it represents the 95% confidence intervals, and the white dot is the median.

EXAMPLE: The data contain records of 71 six-week-old baby chickens (aka chicks) and includes observations on their particular feed type, sex, and weight. This violin plot shows the relationship of feed type to chick weight. The box plot elements show the median weight for horsebean-fed chicks is lower than for other feed types. The shape of the distribution (extremely skinny on each end and wide in the middle) indicates the weights of sunflower-fed chicks are highly concentrated around the median.

Grouped violin plot

VIOLIN PLOT IN R • Use ggplot2 or vioplot. • Check the webpage http://www.sthda.com/english/wiki/ggplot2-violin-plot-quick-start-guide-r-software-and-data-visualization library(vioplot)plot(x, y, xlim=c(-5,5), ylim=c(-2,8))vioplot(x, col=”gold”, horizontal=TRUE, at=-1, add=TRUE,lty=2, rectCol=”gray”)vioplot(y, col=”blue”, horizontal=FALSE, at=-4, add=TRUE,lty=2)

KERNEL DENSITY ESTIMATION • A kernel is a special type of pdf with the added property that it must be even. Thus, a kernel is a function with the following properties • non-negative • real-valued • even • its definite integral over its support set must equal to 1 Some common pdfs are kernels; they include the Uniform(-1,1) and standard normal distributions.

SOME KERNEL FUNCTIONS

What is Kernel Density Estimation? • Kernel density estimation is a non-parametric method of estimating the pdf of a continuous random variable. • It is non-parametric because it does not assume any underlying distribution for the variable. • Essentially, at every datum, a kernel function is created with the datum at its center – this ensures that the kernel is symmetric about the datum. The pdf is then estimated by adding all of these kernel functions and dividing by the number of data to ensure that it satisfies the 2 properties of a pdf. • Intuitively, a kernel density estimate is a sum of “bumps”. A “bump” is assigned to every datum, and the size of the “bump” represents the probability assigned at the neighborhood of values around that datum.

Constructing a Kernel Density Estimate: Step by Step • Choose a kernel; the common ones are normal (Gaussian), uniform (rectangular), and triangular. • At each datum xi, build the scaled kernel function where K() is the chosen kernel function, and h is the bandwidth (window width or smoothing parameter). 3. Add all of the individual scaled kernel functions and divide by n this places a probability 1/n of each xi. It also ensures that the kernel density estimate integrates to 1 over its support set. The density() function in R computes the values of the kernel density estimate. Applying the plot() function to an object created by density() will plot the estimate. Applying the summary() function to the object will reveal useful statistics about the estimate

Choosing the Bandwidth • The optimal bandwidth for a kernel density estimate is typically calculated on the basis of an estimate for the integrated squared error or the mean integrated squared error Both criteria should be minimized to obtain a good approximation of the unknown density. • Bandwidth describes how fast the weights fall off. If you're just using flat bins, you can just think of this as choosing how wide the bins are. In practice, it turns out that bandwidth is actually a lot more important than kernel shape.

ASH or KernSmooth packages are ranked high for performance, and the package updates information also showed that they are two of the oldest density estimation packages, with regular updates

EXAMPLE: mpg # Kernel Density Plot d <- density(mtcars$mpg) # returns the density data plot(d) # plots the results # Filled Density Plotd <- density(mtcars$mpg)plot(d, main="Kernel Density of Miles Per Gallon")polygon(d, col="red", border="blue")

EXAMPLE (contd.) # Compare MPG distributions for cars with # 4,6, or 8 cylinders library(sm) attach(mtcars) mtcars$mpg=as.vector(mtcars[,1]) mtcars$cyl=as.factor(mtcars[,2]) # plot densities sm.density.compare(mtcars$mpg,cyl, xlab="Miles Per Gallon") title(main="MPG Distribution by Car Cylinders") # add legend via mouse click colfill<-c(2:(2+length(levels(cyl.f)))) legend(locator(1), levels(cyl.f), fill=colfill)

R Package: ggplot2* Used to produce statistical graphics, author = Hadley Wickham "attempt to take the good things about base and lattice graphics and improve on them with a strong, underlying model " based on The Grammar of Graphics by Leland Wilkinson, 2005 "... describes the meaning of what we do when we construct statistical graphics ... More than a taxonomy ... Computational system based on the underlying mathematics of representing statistical functions of data.“ • does not limit developer to a set of pre-specified graphics adds some concepts to grammar which allow it to work well with R *https://opr.princeton.edu/workshops/Downloads/2015Jan_ggplot2Koffman.pdf

qplot() ggplot2 provides two ways to produce plot objects: uses some concepts of The Grammar of Graphics, but doesn’t provide full capability and designed to be very similar to plot() and simple to use may make it easy to produce basic graphs but may delay understanding philosophy of ggplot2 ggplot() # grammar of graphics plot provides fuller implementation of The Grammar of Graphics may have steeper learning curve but allows much more flexibility when building graphs

Grammar Defines Components of Graphics data: in ggplot2, data must be stored as an R data frame coordinate system: describes 2-D space that data is projected onto - for example, Cartesian coordinates, polar coordinates, map projections, ... geoms: describe type of geometric objects that represent data - for example, points, lines, polygons, ... aesthetics: describe visual characteristics that represent data - for example, position, size, color, shape, transparency, fill scales: for each aesthetic, describe how visual characteristic is converted to display values - for example, log scales, color scales, size scales, shape scales, ... stats: describe statistical transformations that typically summarize data - for example, counts, means, medians, regression lines, ... facets: describe how data is split into subsets and displayed as multiple small graphs

EXAMPLE:mpg • We consider a dataset mpg consisting of fuel economy data from 1999 and 2008 for 38 popular models of car.

EXAMPLE:mpg • Let's get a look at the dataset we're using.

EXAMPLE: mpg Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 234 obs. of 11 variables: $ manufacturer: chr "audi" "audi" "audi" "audi" ... $ model : chr "a4" "a4" "a4" "a4" ... $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ... $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ... $ cyl : int 4 4 4 4 6 6 6 4 4 4 ... $ trans : chr "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ... $ drv : chr "f" "f" "f" "f" ... $ cty : int 18 21 20 21 16 18 18 18 16 20 ... $ hwy : int 29 29 31 30 26 26 27 26 25 28 ... $ fl : chr "p" "p" "p" "p" ... $ class : chr "compact" "compact" "compact" "compact" ..

EXAMPLE: mpg # summary() gives frequency tables for categorical variables # and mean and five-number summaries for continuous variables > summary(mpg) manufacturer model displ year Length:234 Length:234 Min. :1.600 Min. :1999 Class :character Class :character 1st Qu.:2.400 1st Qu.:1999 Mode :character Mode :character Median :3.300 Median :2004 Mean :3.472 Mean :2004 3rd Qu.:4.600 3rd Qu.:2008 Max. :7.000 Max. :2008 cyl trans drv cty Min. :4.000 Length:234 Length:234 Min. : 9.00 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00 Median :6.000 Mode :character Mode :character Median :17.00 Mean :5.889 Mean :16.86 3rd Qu.:8.000 3rd Qu.:19.00 Max. :8.000 Max. :35.00 hwy fl class Min. :12.00 Length:234 Length:234 1st Qu.:18.00 Class :character Class :character Median :24.00 Mode :character Mode :character Mean :23.44 3rd Qu.:27.00 Max. :44.00

EXAMPLE: mpg • Notice that categorical variables are given in different form.

EXAMPLE: mpg

EXAMPLE: mpg Geom: is the \type" of plot Aesthetics: shape, color, size, alpha Faceting: \small multiples" displaying different subsets Help is available. Try searching for examples, too. had.co.nz/ggplot2 had.co.nz/ggplot2/geom_point.html When certain aesthetics are defined, an appropriate legend is chosen and displayed automatically.

EXAMPLE: mpg

EXAMPLE: mpg Play with the aesthetic. 1. Assign variables to aesthetics colour, size, and shape. 2. What's the difference between discrete or continuous variables? 3. What happens when you combine multiple aesthetics? The behavior of the aesthetics is predictable and customizable.

EXAMPLE: mpg Let's see a couple examples.

EXAMPLE: mpg

EXAMPLE: mpg Faceting A small multiple (sometimes called faceting, trellis chart, lattice chart, grid chart, or panel chart) is a series or grid of small similar graphics or charts, allowing them to be easily compared. Typically, small multiples will display different subsets of the data. Useful strategy for exploring conditional relationships, especially for large data. Experiment with faceting of different types. What relationships would you like to see?

EXAMPLE: mpg

Improving plots • How can this plot be improved?

EXAMPLE: mpg • Problem:points lie on top of each other, so it's impossible to tell how many observations each point represents. • A solution:Jitter the points to reveal the individual points and reduce the opacity to 1/2 to indicate when points overlap.

EXAMPLE: mpg • How can this plot be improved?

EXAMPLE: mpg • Problem:The classes are in alphabetical order, which is somewhat arbitrary. • A solution:Reorder the class variable by the mean hwy for a meaningful ordering. Get help with ?reorder to understand how this works.

EXAMPLE: mpg . . . add jitter

. . . or replace with boxplots

. . . or jitter those points with reduced-opacity boxplots on top

. . . and can easily reorder by median() instead of mean() (mean is the default)

Interpretation of Graphical Displays forNumerical Data • Statistical methods for inference about a population usually make assumptions about the shape of the population frequency curve. A common assumption is that the population has a normal frequency curve. In practice, the observed data are used to assess the reasonableness of this assumption. In particular, a sample display should resemble a population display, provided the collected data are a random or representative sample from the population.

#### Unimodal, symmetric, bell-shaped, and no outliers (Normal distribution) ## base graphics # sample from normal distribution x1 <- rnorm(250, mean = 100, sd = 15) par(mfrow=c(3,1)) # Histogram overlaid with kernel density curve hist(x1, freq = FALSE, breaks = 20) points(density(x1), type = "l") rug(x1) # violin plot library(vioplot) grid.arrange(p1, p2, p3, ncol=1) vioplot(x1, horizontal=TRUE, col="gray") # boxplot boxplot(x1, horizontal=TRUE) ## ggplot # Histogram overlaid with kernel density curve x1_df <- data.frame(x1) p1 <- ggplot(x1_df, aes(x = x1)) # Histogram with density instead of count on y-axis p1 <- p1 + geom_histogram(aes(y=..density..) , binwidth=5 , colour="black", fill="white") # Overlay with transparent density plot p1 <- p1 + geom_density(alpha=0.1, fill="#FF6666") p1 <- p1 + geom_point(aes(y = -0.001) , position = position_jitter(height = 0.0005) , alpha = 1/5) # violin plot p2 <- ggplot(x1_df, aes(x = "x1", y = x1)) p2 <- p2 + geom_violin(fill = "gray50") p2 <- p2 + geom_boxplot(width = 0.2, alpha = 3/4) p2 <- p2 + coord_flip() # boxplot p3 <- ggplot(x1_df, aes(x = "x1", y = x1)) p3 <- p3 + geom_boxplot() p3 <- p3 + coord_flip() library(gridExtra) grid.arrange(p1, p2, p3, ncol=1)

#### Unimodal, symmetric, heavy-tailed # sample from normal distribution x2.temp <- rnorm(250, mean = 0, sd = 1) x2 <- sign(x2.temp)*x2.temp^2 * 15 + 100 par(mfrow=c(3,1))

#### Bimodal (multi-modal) # sample from uniform distribution x6 <- c(rnorm(150, mean = 100, sd = 15), rnorm(150, mean = 150, sd = 15)) par(mfrow=c(3,1))

Lollipop Chart theme_set(theme_bw()) # Prepare data: group mean city mileage by manufacturer. cty_mpg <- aggregate(mpg$cty, by=list(mpg$manufacturer), FUN=mean) # aggregate colnames(cty_mpg) <- c("make", "mileage") # change column names cty_mpg <- cty_mpg[order(cty_mpg$mileage), ] # sort cty_mpg$make <- factor(cty_mpg$make, levels = cty_mpg$make) # to retain the order in plot. # Plot ggplot(cty_mpg, aes(x=make, y=mileage)) + geom_point(size=3) + geom_segment(aes(x=make, xend=make, y=0, yend=mileage)) + labs(title="Lollipop Chart", subtitle="Make Vs Avg. Mileage", caption="source: mpg") + theme(axis.text.x = element_text(angle=65, vjust=0.6)) • Lollipop charts conveys the same information as in bar charts. By reducing the thick bars into thin lines, it reduces the clutter and lays more emphasis on the value. It looks nice and modern.

Visualizing Data Distribution with Violin Plots

Visualizing Data Distribution with Violin Plots

Presentation Transcript