460 likes | 470 Views
Learn about statistical analysis techniques including Multivariate Descriptive Analysis, Factor Analysis, and Clustering. Discover how these methods can help identify important relationships between variables and detect underlying dimensions in a multidimensional space. Perfect for researchers, analysts, and decision-makers.
E N D
Instructor: Prof. Louis Chauvel Statistical Analysis Multivariate descriptive analysis Factor analysis and clustering (PCA and HCA) +kmeans Principal components analysis Hierarchic cluster analysis
This session: descriptive multidimensional analysis • Good to detect important relations between variables • Not relevant for causality, net effects, confidence intervals,… • “Heuristic” (from Greek εὑρίσκω "I find, discover") methods • Efficient tool for synthesis • To put in the annexes of your thesis, or in reports • Politicians, decision makers, CEO$, etc. like their results • Useful if you need money • Factor analysis to find the main dimensions in a multidimensional space • Cluster analysis to find subgroups intra-homogeneous and inter-heterogeneous (« classes ») • Part 1 = Principal Component Analysis PCA • (ex. Welfareregimes) • Part 2 = Hierarchical Cluster AnalysisHCA • (ex. Welfareregimes) • Part 3 = Joint PCA and HCA • (ex. U.S. General Social Survey GSS)
Principal Components • Simplify N dimensional tables in 2 (3 or 4) axes • Reduce « noise » and keep « signal » • Identify underlying dimensions or principal components of a distribution • Helps understand the joint or common variation among a set of variables • Commonly used method of detection “latent dimensions” Rotation (=eigen-decomposition / spectral decomposition of the correlation matrix)
Principal Components • The first principal component is identified as the vector (or equivalently the linear combination of variables) on which the most data variation can be projected • The 2nd principal component is a vector perpendicular to the first, chosen so that it contains as much of the remaining variation as possible • And so on for the 3rd principal component, the 4th, the 5th etc.
https://www.cs.princeton.edu/picasso/mats/Lecture1_jps.ppt Principal Components Analysis (PCA) • Principle • Linear projection method to reduce the number of parameters • Transfer a set of correlated variables into a new set of uncorrelated variables • Map the data into a space of lower dimensionality • Form of unsupervised learning • Properties • It can be viewed as a rotation of the existing axes to new positions in the space defined by original variables • New axes are orthogonal and represent the directions with maximum variability
https://www.cs.princeton.edu/picasso/mats/Lecture1_jps.ppt Computing the Components • Data points are vectors in a multidimensional space • Projection of vector x onto an axis (dimension) u is u.x • Direction of greatest variability is that in which the average square of the projection is greatest • I.e. u such that E((u.x)2) over all x is maximized • (we subtract the mean along each dimension, and center the original axis system at the centroid of all data points, for simplicity) • This direction of u is the direction of the first Principal Component
https://www.cs.princeton.edu/picasso/mats/Lecture1_jps.ppt Computing the Components • E((u.x)2) = E ((u.x) (u.x)T) = E (u.x.xT.uT) • The matrix C = x.xT contains the correlations (similarities) of the original axes based on how the data values project onto them • So we are looking for w that maximizes uCuT, subject to u being unit-length • It is maximized when w is the principal eigenvector of the matrix C, in which case • uCuT = uluT = l if u is unit-length, where l is the principal eigenvalue of the correlation matrix C • The eigenvalue denotes the amount of variability captured along that dimension
https://www.cs.princeton.edu/picasso/mats/Lecture1_jps.ppt Why the Eigenvectors? Maximise uTxxTu s.tuTu = 1 Construct Langrangian uTxxTu–λuTu Vector of partial derivatives set to zero xxTu –λu =(xxT –λI) u = 0 As u ≠ 0 then u must be an eigenvector of xxT with eigenvalue λ
https://www.cs.princeton.edu/picasso/mats/Lecture1_jps.ppt Singular Value Decomposition The first root is called the prinicipal eigenvalue which has an associated orthonormal (uTu = 1) eigenvectoru Subsequent roots are ordered such that λ1> λ2 >… > λM with rank(D) non-zero values. Eigenvectors form an orthonormal basis i.e. uiTuj = δij The eigenvalue decomposition of xxT = UΣUT whereU = [u1, u2, …, uM] and Σ= diag[λ1, λ2, …, λM] Similarly the eigenvalue decomposition ofxTx = VΣVT The SVD is closely related to the above x=U Σ1/2 VT The left eigenvectors U, right eigenvectors V, singular values = square root of eigenvalues.
https://www.cs.princeton.edu/picasso/mats/Lecture1_jps.ppt Computing the Components Similarly for the next axis, etc. So, the new axes are the eigenvectors of the matrix of correlations of the original variables, which captures the similarities of the original variables based on how data samples project to them • Geometrically: centering followed by rotation • Linear transformation
http://www.cs.cmu.edu/~16385/s14/lec_slides/lec-18.ppt PCA: General From k original variables: x1,x2,...,xk: Produce k new variables: y1,y2,...,yk: y1 = a11x1 + a12x2 + ... + a1kxk y2 = a21x1 + a22x2 + ... + a2kxk ... yk = ak1x1 + ak2x2 + ... + akkxk
http://www.cs.cmu.edu/~16385/s14/lec_slides/lec-18.ppt PCA: General From k original variables: x1,x2,...,xk: Produce k new variables: y1,y2,...,yk: y1 = a11x1 + a12x2 + ... + a1kxk y2 = a21x1 + a22x2 + ... + a2kxk ... yk = ak1x1 + ak2x2 + ... + akkxk such that: yk's are uncorrelated (orthogonal) y1 explains as much as possible of original variance in data set y2 explains as much as possible of remaining variance etc.
2nd Principal Component, y2 1st Principal Component, y1 http://www.cs.cmu.edu/~16385/s14/lec_slides/lec-18.ppt
xi2 yi,1 yi,2 xi1 http://www.cs.cmu.edu/~16385/s14/lec_slides/lec-18.ppt PCA Scores
λ2 λ1 http://www.cs.cmu.edu/~16385/s14/lec_slides/lec-18.ppt PCA Eigenvalues
http://www.cs.cmu.edu/~16385/s14/lec_slides/lec-18.ppt PCA: Another Explanation From k original variables: x1,x2,...,xk: Produce k new variables: y1,y2,...,yk: y1 = a11x1 + a12x2 + ... + a1kxk y2 = a21x1 + a22x2 + ... + a2kxk ... yk = ak1x1 + ak2x2 + ... + akkxk yk's are Principal Components such that: yk's are uncorrelated (orthogonal) y1 explains as much as possible of original variance in data set y2 explains as much as possible of remaining variance etc.
http://www.cs.cmu.edu/~16385/s14/lec_slides/lec-18.ppt Principal Components Analysis on: • Covariance Matrix: • Variables must be in same units • Emphasizes variables with most variance • Mean eigenvalue ≠1.0 • Correlation Matrix: • Variables are standardized (mean 0.0, SD 1.0) • Variables can be in different units • All variables have same impact on analysis • Mean eigenvalue = 1.0
http://www.cs.cmu.edu/~16385/s14/lec_slides/lec-18.ppt PCA: General {a11,a12,...,a1k} is 1st Eigenvector of correlation/covariance matrix, and coefficients of first principal component {a21,a22,...,a2k} is 2nd Eigenvector of correlation/covariance matrix, and coefficients of 2nd principal component … {ak1,ak2,...,akk} is kth Eigenvector of correlation/covariance matrix, and coefficients of kth principal component
http://www.cs.cmu.edu/~16385/s14/lec_slides/lec-18.ppt PCA Summary until now • Rotates multivariate dataset into a new configuration which is easier to interpret • Purposes • simplify data • look at relationships between variables • look at patterns of units
https://www.cs.princeton.edu/picasso/mats/Lecture1_jps.ppt How Many PCs? For n original dimensions, correlation matrix is nxn, and has up to n eigenvectors. So n PCs. Where does dimensionality reduction come from?
https://www.cs.princeton.edu/picasso/mats/Lecture1_jps.ppt Dimensionality Reduction Can ignore the components of lesser significance. You do lose some information, but if the eigenvalues are small, you don’t lose much • n dimensions in original data • calculate n eigenvectors and eigenvalues • choose only the first p eigenvectors, based on their eigenvalues • final data set has only p dimensions
https://www.cs.princeton.edu/picasso/mats/Lecture1_jps.ppt Eigenvectors of a Correlation Matrix
Graphic presentation • correlation circle: variables and their correlation with axe 1, axe 2, etc. • Principal plane: individuals or groups • Interpretations to be done in terms of directions from the center • The center (0,0) means average situation (indiscriminate)
How to do? • Select “active variables”: numeric or pseudo-numeric (ordinal) variables • “active variables” should represent the different dimensions of your “field”, not too correlated, no bias in the representativity of your dimensions • Begin with a correlation matrix • IMPORTANT: Be sure the variables are oriented the way you think recode if “Height” is “1: high ; 2: medium ; 3: small” • (test differentvariantsof “active variables” ) • Thenprocess « internalanalysis » (of “active variables”) • And then «externalanalysis » (other variables and individuals)
Cluster Analysis • Techniques for identifying separate groups of similar cases • Similarity of cases is either specified directly in a distance matrix, or defined in terms of some distance function • Also used to summarise data by defining segments of similar cases • 3 main types of cluster analysis methods • Descending hierarchical cluster analysis • Each cluster (starting with the whole dataset) is divided into two, then divided again, and so on • Ascending hierarchical cluster analysis • Individuals are iteratively aggregated optimally to one each other from N to a small number of clusters • Iterative methods • k-means clustering (statakmeans) • Analogous non-parametric density estimation method
Clustering Techniques Wage The Ward Method Iterative HCA We have a M- dimensions multidimensional cloud of N dots i) Matrix of distance and 2 « closest » (weighted distance) points ; Mergethem as a single point withsum of weights ; Compute the change in inertia of the cloud of dots ; Back to i) => The N dots become a smallnumber of groups education
Cluster Analysis Options • Many definitions of distance (Manhattan, Euclidian, square euclidian) • Several choices of how to form clusters in hierarchical cluster analysis • Single linkage • Average linkage • Density linkage • Ward’s method • Many others • Ward’s method (like k-means) tends to form equal sized, roundish clusters • Average linkage generally forms roundish clusters with equal variance • Density linkage can identify clusters of different shapes
K-means Clustering Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple https://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.ppt
K-means Clustering – Details • Initial centroids are often chosen randomly. • Clusters produced vary from one run to another. • The centroid is (typically) the mean of the points in the cluster. • ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc. • K-means will converge for common similarity measures mentioned above. • Most of the convergence happens in the first few iterations. • Often the stopping condition is changed to ‘Until relatively few points change clusters’ • Complexity is O( n * K * I * d ) • n = number of points, K = number of clusters, I = number of iterations, d = number of attributes https://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.ppt
Optimal Clustering Sub-optimal Clustering Two different K-means Clusterings Original Points https://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.ppt
Importance of Choosing Initial Centroids https://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.ppt
Importance of Choosing Initial Centroids https://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.ppt
Evaluating K-means Clusters • Most common measure is Sum of Squared Error (SSE) • For each point, the error is the distance to the nearest cluster • To get SSE, we square these errors and sum them. • x is a data point in cluster Ci and mi is the representative point for cluster Ci • can show that micorresponds to the center (mean) of the cluster • Given two clusters, we can choose the one with the smallest error • One easy way to reduce SSE is to increase K, the number of clusters • A good clustering with smaller K can have a lower SSE than a poor clustering with higher K https://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.ppt
Solutions to Initial Centroids Problem • Multiple runs • Helps, but probability is not on your side • Sample and use hierarchical clustering to determine initial centroids • Select more than k initial centroids and then select among these initial centroids • Select most widely separated • Postprocessing • Bisecting K-means • Not as susceptible to initialization issues https://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.ppt
Limitations of K-means • K-means has problems when clusters are of differing • Sizes • Densities • Non-globular shapes • K-means has problems when the data contains outliers. https://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.ppt
Limitations of K-means: Differing Sizes K-means (3 Clusters) Original Points https://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.ppt
Limitations of K-means: Differing Density K-means (3 Clusters) Original Points https://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.ppt
Limitations of K-means: Non-globular Shapes Original Points K-means (2 Clusters) https://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.ppt
Beyond the conclusion. Assessment of the typology • How does the GostaEsping-Andersen’s Typology of welfare regimes and countries really fits with social facts? • Fenger (2007) assesses the GEA typology • http://www.louischauvel.org/fenger_2007.pdf • We can implement this on STATA : • http://www.louischauvel.org/pca_fenger.do
Assessing the typology near 2005 Fempart : Female participation (% of women in total workforce) Fertili : Total fertility rate (births per woman) Gini : Inequality (GINI-coefficient; 2002 or latest available year) Govhealth Government health expenditures (% of total gov expenditures) Healthexpen General health expenditures % GDP Labormarkt Spending on labor market policies (% of GDP) Lifeexp : Life expectancy (years) Oldageexpen Spending on old age (% of GDP) Soccontrib Revenues from social contributions (% of GDP) Socprotect Spending on social protection (% of GDP) Spendedu : Spending on education (% of GDP) taxesgdp : Taxes on revenue in % of GDP Unemployment Unemployment rates % Lab force
B. Assessment of the typology (…) cluster wardslinkage z* , measure(L2) cluster gen grp = group(5) tabstat z* , by(grp) pca z* predict f1 f2 f3 tabstat f* , by(ISO) tabstat f* , by(grp) cluster dendrogram _clus_2, labels(ISO) xlabel(, /// angle(90) labsize(*.75))