160 likes | 484 Views
Multivariate statistical methods. Cluster analysis. Multivariate methods. multivariate dataset – group of n objects, m variables (as a rule n > m, if possible). confirmation vs. eploration analysis confirmation – impact on parameter estimate and hypothesis testing
E N D
Multivariate statistical methods Cluster analysis
Multivariate methods • multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). • confirmation vs. eploration analysis • confirmation – impact on parameter estimate and hypothesis testing • exploration – impact on data exploration, finding out of patterns and structure
Multivariate statistical methods Unit classification • Cluster analysis • Discrimination analysis Analysis of relations among variables • Cannonical correlation analysis • Factor analysis • Principal component analysis
Cluster analysis (CA) • aim is find out groups of objects, which are similar and are different from other groups • methods of cluster analysis: • hierarchical • nonhierarchical
1. Hierarchical methods • creation of clusters of different level (clusters of the highest level include clusters of lower level) • results of hierarchical methods are formed in tree structure, results are presented by dendrogram • is specified: • similarity rate • algorithms of clustering
Hierarchical methods – similarity expression • qualitative values • number of indentical values/number of all values • quantitative values: • Euclidean distance vzdálenost • Manhattan distance (Hemming distance) • Tschebyshev distance
Similarity rates • Euclidean distance • Manhattan (Hemming distance) • Tschebyshev distance where xik, xjk are objects, which distance is explored in n-dimension, n is number of observed characteristics
Distance of objects in 2D Distances: • Circle – Euclidean • Internal square – Hemming • External square – Tshebyshev
Other types of similarity rates • Power definied by user, the higher p is, the higher weight of larger distances is and it means lower signification of smaller distances. Parameter r causes conversely. • 1-Pearson r unsuitable for smal number of dimension • Percentual discrepancy suitable for categorical variables
Algoritms of clustering • Nearest neighbor linkage: distance between two clusters is definied as distance of two nearest objects • Furthest neighbor linkage: distance between two clusters is definied as distance of two furthest objects • Nonweighted group average linkage: distance between two clusters is definied as average distance among all of pairs, where 1st member is from 1st cluster and 2nd member is from 2nd cluster • Weighted group average linkage: as previous, extra takes note of cluster size (number of objects) as weights
Algorithms of clustering • Nonweighted centroid: distance between two clusters is definied as distance of centroids of these clusters. Centroid is vector of averages (each coordinate is average of appropriate coordinates of objects in the each cluster) • Weighted centroid: as previous,extra takes note of cluster size (number of objects) as weights • Ward´s method: different from previous, for computation of distance among clusters is used analysis of variance. For clustering is important this rule, that the internal cluster sum of squares is minimal
2. Nonhierarchical method • mostly used is method K – means • algorithm is based on moving of objects among clusters • number of clusters is beforehand defined; randomly or according to experiences of analyst • centroids are defined for all clusters in the same step • all objects are explored. If the object is nearest to the own centroid, we leave it in this cluster. If not, move it in cluster, which centroid is the nearest. Intercluster sum of square should be minimal. This procedure repeat until at no objects shall be moved. Than we have final solution. • we are not working with distance matrix → K – means method is suitable for clustering of larger size of objects