350 likes | 500 Views
Functional clustering. Marian Scott, Ruth Haggarty NERC workshop, University of Glasgow March 2014. First- what is clustering?. We anticipate that each sampling unit (person, site) belongs uniquely to one (unknown) group, we typically have a series of measurements on each unit
E N D
Functional clustering Marian Scott, Ruth Haggarty NERC workshop, University of Glasgow March 2014
First- what is clustering? • We anticipate that each sampling unit (person, site) belongs uniquely to one (unknown) group, we typically have a series of measurements on each unit • We don’t know how many groups • Membership is defined based on measures of similarity between the sampling units
First- what is clustering? • We expect that members of the clusters or groups are more similar than compared with members of other clusters • We measure dissimilarity (often as a measure of distance between pairs of observations) • Algorithmic clustering focuses on using measures of distance and hierarchical or k-means are the most commonly used
The dendrogram • A dendrogram shows the connections, like a tree, we can see where observations merge together. The height on the y-axis is the distance between the clusters being merged
Similarity or distance • How do we measure distance? • Euclidean distance • Weighted euclidean distance • Mahalanobis • Manhattan • ……..
Clustering methods hierarchical divisive put everything together and split agglomerative keep everything separate and join the most similar points (classical cluster analysis) non-hierarchical k-means clustering
Agglomerative hierarchical Single linkage or nearest neighbour finds the minimum spanning tree: shortest tree that connects all points chaining can be a problem
Agglomerative hierarchical Complete linkage or furthest neighbour compact clusters of approximately equal size. (makes compact groups even when none exist)
Agglomerative hierarchical Average linkage methods between single and complete linkage
K-means • Different set-up • For a fixed number of clusters, k-means tries to find the assignment that minimises the sum over all the clusters of the sum of squares within clusters. • Very computational, so need to find a strategy that is feasible
K-means • Begin with k starting centres • Assign each observation to the cluster closest centre • Recompute new centres • Re-assign • Keep repeating this till convergence
Model based clustering • Typically the data are clustered using some assumed mixture modelingstructure. • Then the group memberships are ‘learned’ in an unsupervised fashion. • Assume the data are collected from a finite collection of populations. • The data within each population can be modeled using a standard statistical model.
Model based clustering • Typically the data are clustered using some assumed mixture modelingstructure. • Then the group memberships are ‘learned’ in an unsupervised fashion. • Assume the data are collected from a finite collection of populations.
Model based clustering The data within each population can be modeled using a standard statistical model, using often a mixture of Normals. Note the , cluster probablities
Can we tell when we reach a good k? • Elbow plot, where we plot the within clusters total sum of squares against k, and look for an ‘elbow’ • Gap statistic • Same idea as the elbow plot, but normalises the elbow plot to make comparisons easier
Functional clustering • Functional clustering has the same sub-types • Hierarchical • K-mean • Model based We might also want to think about the ‘typical’ curve of a cluster, the functional average curve
Example 5 from intro Hierarchical cluster analysis applied to the functional distance matrix for the 26 total nitrogen site trends yields the dendrogram in Figure 7. An average linkage is used. Sites within each dam tend to group together.
Example 6 Hierarchical cluster analysis applied to the functional distance matrix for the 26 total nitrogen site trends yields the dendrogram in Figure 7. An average linkage is used. Sites within each dam tend to group together.
Functional clustering (1) • As always we start with the scenario that we have a set of ‘monitoring locations’, and that we measure our variable(s) of interest over time. • The temporal frequency might be rather irregular and sparse, or it might be very regular and very frequent (daily) for example • We will fit smooth curves to each of the time series and each curve will become the ‘data unit’. • The norm is to use B-splines as a means of creating the smooth curves. The coefficients of the splines are key going forward.
Functional clustering(2) • The coefficients of the basis functions for each time series will be an important aspect • For functional hierarchical clustering, we measure the distance between the two curves in terms of their coefficients and thus create a functional distance matrix • Now that we have the distance matrix, we can them apply one of the algorithms used for clustering, eg average linkage, k means
Functional clustering (3) • Good for description, we can create the functional average curve for each cluster. • Might not be such a good approach if, as is often the case in practice, the individual curves are sparsely sampled. • not about inference, no model underpins these approaches, so we don’t have probability of cluster membership, hence • Model based clustering
Model based clustering(1) instead of treating the basis coefficients as parameters and fitting a separate spline curve for each site, we use a random effects model for the coefficients. This allows us to borrow strength across curves, (handles sparsely or irregularly sampled curves). Furthermore, it automatically weights the estimated spline coefficients according to their variances and is highly efficient because it requires fitting few parameters.
Model based clustering(2) Big advantage is that we estimate the probability of cluster membership Computationally challenging: Choose the number of clusters, fit the model Repeat for different cluster numbers, Compare the models
Choosing how many (1) • One benefit of using model-based clustering techniques is that model selection criteria such as Akaike’s Information Criterion and Bayesian Information Criterion (BIC) can often be used to determine the appropriate number of clusters. But can be computationally expensive,
Choosing how many(2) • Another popular approach for selecting the number of clusters is the gap statistic proposed by Tibshirani et al. (2001), which compares the average within cluster dispersion for the observed data, Wk, with the average within cluster dispersion for a null reference distribution, which assumes that there is no clustering within the sites. A number of reference datasets, say B, are calculated, and for each, the same clustering technique that was applied to the observed data is used.
Model based clustering(4) Can be extended to deal with multiple curves per site, so for example, nitrate and phosphate Still an active research area. Resources are becoming available in R: MFDA model based functional clustering Funclustering
References • James and Sugar. (2003). Clustering for sparsely sampled functional data. JASA 98(462) • Henderson B,. (2006). Exploring between site differences in water quality trends. Environmetrics 17. • Jacques J, Preda C (2014) Model-based clustering for multivariate functional data. CSDA, 71