Cluster Analysis

Cluster Analysis Classifying the Exoplanets

Cluster Analysis • Simple idea, difficult execution • Used for indexing large amounts of data in databases. (very hot skill to have 70/hour) • “The best form of cluster analysis is ordination, because ordination is not a form of cluster analysis.” –Morgan Byron • No formal def. of a cluster • Results are descriptive and subjective.

R Commands • library("scatterplot3d") • scatterplot3d(log(planets$mass), log(planets$period), log(planets$eccen), type = "h", angle = 55, scale.y = 0.7, pch = 16, y.ticklabs = seq(0, 10, by = 2), y.margin.add = 0.1) • Taking the log of the each data point • Setting the angle and the physical scale so it looks like a box • Pch is the symbol used for the data point • Seq() function sets the numeric scales • Y.margin.add adds a bit to the vertical margins

No real insight after our first view of the data, but it looks neat. Interpretation

R Commands • rge <- apply(planets, 2, max) - apply(planets, 2, min) • Stores the range of the data • 2 indicates the column margin of the data matrix • planet.dat <- sweep(planets, 2, rge, FUN = "/") • Divides each element in the matrix by the range of the column margin • n <- nrow(planet.dat) • wss <- rep(0, 10) • Creates a 10 dimensional vector of all 0’s • wss[1] <- (n-1)*sum(apply(planet.dat, 2, var)) • This is the sum of squares of all the points – if we partition the data in 1 group. • for (i in 2:10) wss[i] <- sum(kmeans(planet.dat, centers = i)$withinss) • Using the kmeans method, as the number of partitions increases, calculates the sum of squares of the members of each group.

The K-Means Method • This method uses different ways of minimizing a numerical value - often a notion of distance- by partitioning the data. • The method used in this analysis is minimizing the sums of squares of data within a group, and finding a number of groups that has the lowest SS • This method can be impractical with the number of partitions increasing very quickly as the number of groups and data points increases.

In choosing a good number of partitions, the “elbow” or the sharpest angle in the graph is an easy approach. The steepest angles look to be at 3 and 5 number of groups. The “Elbow”

Number of planets in the groups • planet_kmeans3 <- kmeans(planet.dat, centers = 3) • We chose to try 3 groups • table(planet_kmeans3$cluster) • 1 2 3 • 14 53 34 • ccent <- function(cl) { • f <- function(i) colMeans(planets[cl == i, ]) • Finds the mean for each cluster • x <- sapply(sort(unique(cl)), f) • Sorts • colnames(x) <- sort(unique(cl)) • return(x) }

The results • > ccent(planet_kmeans3$cluster) • Cluster 1 2 3 • mass 10.56786 1.6710566 2.9276471 • period 1693.17201 427.7105892 616.0760882 • eccen 0.36650 0.1219491 0.4953529 • Number 14 53 34

Model-Based Clustering in brief • The subjective decision or assumption is the number of clusters. • After that, it becomes a problem of maximizing the likelihood that a partition is the best.

Mclust function • Mclust find an appropriate model AND the optimal number of groups. • Not Free?!! Need a liscence agreement from University of Washington. • R Commands: • Library(“mclust”) • Planet_mclust <- Mclust(planet.dat) • Plot(planet_mclust, planet.dat) • Print(planet_mclust) • The best model is of diagonal clusters of varying volume and shape with 3 groups

Homework • Spend 30 minutes attempting exercise 15.1 and send me what you get done. • Stick it to the Man! • Then practice your air guitar • zweihanderdawg@gmail.com

Cluster Analysis