210 likes | 370 Views
Correspondence analysis applied to microarray data. Kurt Fellenberg C. Hausernedikt Brorsrt Neutzner', Jo( rg D. Hoheiselartin Vingron http://www.dkfz-heidelberg.de/funct_genome/PDF-Files/PNAS-98-(2001)-10781.pdf www.pnas.org/cgi/doi/10.1073/pnas.181597298. Principal Component Analysis.
E N D
Correspondence analysis applied to microarray data • Kurt Fellenberg C. Hausernedikt Brorsrt Neutzner', Jo( rg D. Hoheiselartin Vingron • http://www.dkfz-heidelberg.de/funct_genome/PDF-Files/PNAS-98-(2001)-10781.pdf • www.pnas.org/cgi/doi/10.1073/pnas.181597298
Principal Component Analysis • Given N data vectors from k-dimensions, find c <= k orthogonal vectors that can be best used to represent data • The original data set is reduced to one consisting of N data vectors on c principal components (reduced dimensions) • Each data vector is a linear combination of the c principal component vectors • Project on the subspace which preserve the most of the data variability:
Correspondence analysis • CA= PCA for categorical variables Example:Dataset X -27 dog species 7 categorical variables Name Height Weight Speed Intelligence Affection Agresivity Function - + ++ - + ++ - + ++ - + ++ - + - + C H U (company,Hunt,Utility) 1. Boxer 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 … =X 27.Caniche 1 0 0 1 0 0 0 1 0 0 0 1 0 1 1 0 1 0 0 CA study dependence between 2 categorical variables -Height,Function Works on crossable N Height/Function C H U C H U marginals - 6 1 0| 7 n11 n12 n13 |n1+ - + 3 2 0| 5 N= n21 n22 n23 |n2+ + ++ 1 6 8| 15 n31 n32 n33 |n3+ ++ - - - - - - - - - - - - - - - - - - - - - - - - - - 10 9 8| 27 n+1 n+2 n+3 |n marginals • Lines of crossable (categories of first variable) are seen as distribution in the space of distributions over the second variables category (dimension =# categories for second) • Distance between points (distribution) – mutual information (KL distance) • variable)distance • Projection on the subspace that preserve the most of the “variability” • Each category
Correspondence analysis • Divide each line by its total Height/Function C H U C H U - 6/7 1/7 0/7| 7 n11/n1+ n12/n1+ n13/n1+ + 3/5 2/5 0/5| 5 n21/n2+ n22/n2+ n23/n2+ ++ 1/5 6/5 8/5| 15 n31/n3+ n32/n3+ n33/n3+ Each row become a point in probability space over the categories of the second variable (conditional distributions given the category value of the first variable) # points =#categories of first variable =m dimension of space = #number of categories of second variable l distance between points (probabilities) –weighted Euclidian distance –low when indep (can transform data and work with usual Euclidian distance)
Correspondence analysis • Each row I considered with weight= • Measure for total variability I= chi-square statistic =measure of dependence between the two variables CA –visualization of the cell that contribute most to dependence: if an n_ih has an outstanding value then both line i and column h will be far from g in the same direction
Correspondence analysis Dimension reduction –project (in norm chi2) on the subspace that preserves the most of the variability (dependence) New variable =linear combinations of the initial ones Like in PCA -solutions in term of eigenvalues/eigenvectors of N -eigenvalue –gives proportion of variability preserved -measures for how well each point is represented in the subspace -measures for contribution of each point/category in determining the optimal subspace -subspace “meaning” Height/Function C H U CA1 CA2 C H U - 6/7 1/7 0/7| 7 1.10 -.92 n11/n1+ n12/n1+ n13/n1+ + 3/5 2/5 0/5| 5 0.85 1.23 n21/n2+ n22/n2+ n23/n2+ ++ 1/15 6/15 8/15| 15 - 0.84 0.02 n31/n3+ n32/n3+ n33/n3+ Close CA1 (and good points representation in subspace) means similar category (ease to visualize-identify similar categories of the first variable (Height) in the low dimensional plot) (-+ height have similar function) If join two “identical categories” the chi2- distance do not change
Repeat everything for transpose(N) Height/Function C H U Function/Height - + ++ CA1 CA2 - 6 1 0| C 6/10 3/10 1/10 1.04 -.10 + 3 2 0| H 1/9 2/9 6/9 -0.32 .43 ++ 1 6 8| U 0/8 0/8 0/9 -0.94 -.37 10 9 8 Each column become a point in probability space over the categories of the first variable (conditional distributions given the category value of the second variable) # points =#categories of second variable =l dimension of space = #number of categories of first variable m CA- New variables =linear combinations of the initial ones preserving dependence best Close CA1 (and good points representation in subspace) means similar category (ease to visualize-identify similar categories of the second variable (Function) in the low dimensional plot) (U,H functions have similar heights)
Overlap the two plots Function/Height - + ++ CA1 CA2 C 6/10 3/10 1/101.04 -.10 H 1/9 2/9 6/9 -0.32 .43 U 0/8 0/8 0/9 -0.94 -.37 CA1 1.10 .85 -.84 CA2 -.92 1.93 0.02 CA value in one plot are (up to a scale) weighted means of CA values in the second plot with weight corresponding to the conditional probability: 1.04= (6/10*(-.92)+3/10*1.93+1/10*0.02)*constant Include “standard coordinates” =virtual rows concentrated on one column (1 0 0) (0 1 0) (0 0 1) Categories of different variable close to the extreme of the axes and to each other are highly correlated: Utility dogs are big; Company dogs are small (see also shaving gene classification)
If reorder the rows and columns by first CA – generally cells with high values go on diagonalHeight/Function C H U CA1 - 6 1 0| 1.10 + 3 2 0| .85 ++ 1 6 8| -.84 CA1 1.04 -.32 -.94
Extension • Treat X as N (crossable for two possible variable with 27 respective 6 categories Name Height Function - + ++ C H U - + ++ C H U CA1 CA2 1. Boxer 0 1 0 1 0 0 0 1/2 0 ½ 0 0 .45 .88 … =X 27.Caniche 1 0 0 1 0 0 ½ 0 0 ½ 0 0 .91 .02 CA1 1.2 .85 -.84 1.04 -.32 -.4 Plot from transpose(X) identical to overlapped plots above New plot from X – extra points for each dog race Relationship Height/Function- Dog race: Canish is small dog for company
Multiple correspondence analysis Use the whole X (all the variables) as crosstable Name Height Weight Speed Intelligence Affection Agresivity Function - + ++ - + ++ - + ++ - + ++ - + - + C H U 1 . Boxer 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 … =X 27.Caniche 1 0 0 1 0 0 0 1 0 0 0 1 0 1 1 0 1 0 0 CA1 .32 .60 -.89 -35 .37 -.34 -.84 .78 .40 -.43 CA2 -1.04 .89 .37 -.81 .29 .46 -.29 .27 .19 -.21 Discovering association rules (based on correlation): Company dogs are small, with high affectivity Utility dogs are big, fast, aggressive Hunt dogs are very intelligent Use them for classification
What Is Association Mining? • Association rule mining: • Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. • Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database [AIS93] • Motivation: finding regularities in data • What products were often purchased together? — Beer and diapers?! • What are the subsequent purchases after buying a PC? • What kinds of DNA are sensitive to this new drug? • Can we automatically classify web documents?
Customer buys both Customer buys diaper Customer buys beer Basic Concepts: Frequent Patterns and Association Rules • Itemset X={x1, …, xk} • Find all the rules XYwith min confidence and support • support, s, probability that a transaction contains XY • confidence, c,conditional probability that a transaction having X also contains Y. Transaction-id Items bought 10 A, B, C 20 A, C 30 A, D • B, E, F • Let min_support = 50%, min_conf = 50%: • A C (50%, 66.7%) • C A (50%, 100%)
Algorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databases • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! • Method: • generate length (k+1) candidate itemsets from length k frequent itemsets, • test the candidates against DB • Challenges • Multiple scans of transaction database • Huge number of candidates • Tedious workload of support counting for candidates • Construct FP-tree From A Transaction Database • For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree • Repeat the process on each newly created conditional FP-tree • Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern
Association-Based Classification • Several methods for association-based classification • ARCS: Quantitative association mining and clustering of association rules (Lent et al’97) • It beats C4.5 in (mainly) scalability and also accuracy • Associative classification: (Liu et al’98) • It mines high support and high confidence rules in the form of “cond_set => y”, where y is a class label • CAEP (Classification by aggregating emerging patterns) (Dong et al’99) • Emerging patterns (EPs): the itemsets whose support increases significantly from one class to another • Mine Eps based on minimum support and growth rate
Table 1. Cell-cycle data as used in analysis • The raw intensity data as obtained from ais imaging software (Imaging Research, St. Catherines, ON, Canada) were normalized • The normalized data matrix was filtered for genes with positive minmax separation for at least one of the conditions under study (2). • The data were shifted to a positive range by adding the minimum + 1 • alpha0 alpha7 alpha14 alpha21 alpha28 alph1a35 … • (M/G1) (M/G1) (G1) (G1) (S) (S) .. • YHR126C 5.81 5.73 6.01 5.48 5.37 5.23 … • YOR066W 5.62 5.81 6.02 5.28 5.02 5.23 … • hxt4 5.78 6.21 6.02 5.5 5.58 5.21 … • PCL9 4.64 5.39 4.89 5.19 4.96 5.62 … • mcm3 5.38 5.8 6.13 5.74 4.52 5.22 … • . . . . . . . . … • * 800 genes X 73 hybridizations • 4 cell-cycle arrest methods of hybridization (18-alpha,24-cdc15,17-cdc28,14-elu) • Samples from each method are drawn and their cell-cycle phase had been classified –5 classes • * link toward database with information (meaning, functionality etc) for each gene provided
Each cell-cycle phase colored differently (M/G1),(G1),(S), (G2),(M) -can see that hybridization separate according to their cell-cycle phase (one phase = one region of the plot) - G1 phase strongly associated with histone gene cluster - cdc15-30 hybridization classified yellow behave green (located in green region) -cdc15-70 -cdc15-80 suggest improper phase classification for these samples (check with the profiles –proves correct)