250 likes | 395 Views
Correspondence Analysis. Ahmed Rebai Center of Biotechnology of Sfax. Correspondance analysis. Introduced by Benzecri (1973) For uncovering and understanding the structure and pattern in data in contingency tables.
E N D
Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax
Correspondance analysis • Introduced by Benzecri (1973) For uncovering and understanding the structure and pattern in data in contingency tables. • Involves finding coordinate values which represent the row and column categories in some optimal way
Contingency tables • Table with r rows and c columns
Main idea • Develop simple indices that will show us the relation between rows and columns • Indices that tell us simultaneously which columns have more wheights in a row category and vice versa • Reduce dimensionality like PCA • Indice are extracted in decreasing order of imporance
Which crietria? • In contigency table global independence between the two variables is generally measured by a chi-square (²) calculated as: • Where Eij are expected count under independence
Decomposition of ² • We have a departure from indepedence and we want to know why • To find the factors we use the matrix C of dimension (r xc ) with elements
How to find factors? • Singular value decomposition (SVD) of matrix C that is find matrice U, D and V such that C=U D VT • U are eigenvectors of CCT • V eigenvectors of CTC • D a diagonal matrix of where k are eigenvalues ofCCT • k=Rank(C)<Min(r-1,c-1)
Tr(CCT)= k = ²= cij² • The projections of the rows and the columns are given by the eigenvectors Uk and Vk C Uk = Vk CTVk = Uk
How many factors? • The adequacy of representation by the two first coordinates is measured by the % of explained inertia (1+2)/ k • In general a display on (U1,U2) of rows and (V1,V2) of columns • The proximity between rows and columns points is to be interpreted
CA in practice • Proximity of two rows (columns) indicates a similar profile that is similar conditional frequency distribution: the two rows (columns) are proportional • The orignin is the average of the factor; so a point (row or column) close to the origin indicates an average profile • Proximity of a row to a column indicates that this row has particularly important wheight in this column (if far from origin)
Without Corsica Classical bac Technical bac
Properties of CA • Allows consideration of dummy variables (called ‘illustrative variables’), as additional variables which do not contribute to the construction of the factorial space, but can be displayed on this factorial space. • With such a representation it is possible to determine the proximity between observations and variables and the illustrative variables and observations.
Tekaia and yeramian (2006) • 208 predicted proteomes representing the three phylogenetic domains and various lifestyle (hyperthromphile, thermophiles, psychrofile and mesophiles including eukaryotes) • Variables: amino-acid composition of proteomes • Illustrative variables:groups of amino-acids (charged, polar, hydrophobic)
Why CA? • To analyze distribution of species in terms of global properties and discriminated groups • Search for amino-acid signature in groups of species • Try to understand potential evolutionary trends
Results • First axis (63%) correspond to GC contents (Mycoplasma (23%) to Streptomyces(72%)) • Second axis (14%) correspond to optimals growth temperature