1 / 24

Correspondence Analysis

Correspondence Analysis. Ahmed Rebai Center of Biotechnology of Sfax. Correspondance analysis. Introduced by Benzecri (1973) For uncovering and understanding the structure and pattern in data in contingency tables.

Download Presentation

Correspondence Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax

  2. Correspondance analysis • Introduced by Benzecri (1973) For uncovering and understanding the structure and pattern in data in contingency tables. • Involves finding coordinate values which represent the row and column categories in some optimal way

  3. Contingency tables • Table with r rows and c columns

  4. Main idea • Develop simple indices that will show us the relation between rows and columns • Indices that tell us simultaneously which columns have more wheights in a row category and vice versa • Reduce dimensionality like PCA • Indice are extracted in decreasing order of imporance

  5. Which crietria? • In contigency table global independence between the two variables is generally measured by a chi-square (²) calculated as: • Where Eij are expected count under independence

  6. Decomposition of ² • We have a departure from indepedence and we want to know why • To find the factors we use the matrix C of dimension (r xc ) with elements

  7. How to find factors? • Singular value decomposition (SVD) of matrix C that is find matrice U, D and V such that C=U D VT • U are eigenvectors of CCT • V eigenvectors of CTC • D a diagonal matrix of where k are eigenvalues ofCCT • k=Rank(C)<Min(r-1,c-1)

  8. Tr(CCT)= k = ²=   cij² • The projections of the rows and the columns are given by the eigenvectors Uk and Vk C Uk = Vk CTVk = Uk

  9. How many factors? • The adequacy of representation by the two first coordinates is measured by the % of explained inertia (1+2)/ k • In general a display on (U1,U2) of rows and (V1,V2) of columns • The proximity between rows and columns points is to be interpreted

  10. CA in practice • Proximity of two rows (columns) indicates a similar profile that is similar conditional frequency distribution: the two rows (columns) are proportional • The orignin is the average of the factor; so a point (row or column) close to the origin indicates an average profile • Proximity of a row to a column indicates that this row has particularly important wheight in this column (if far from origin)

  11. A first example: French Bac

  12. Eigenvalues

  13. With Corsica

  14. Without Corsica Classical bac Technical bac

  15. Coefficients for regions

  16. Coefficients for Bac Type

  17. Properties of CA • Allows consideration of dummy variables (called ‘illustrative variables’), as additional variables which do not contribute to the construction of the factorial space, but can be displayed on this factorial space. • With such a representation it is possible to determine the proximity between observations and variables and the illustrative variables and observations.

  18. Tekaia and yeramian (2006) • 208 predicted proteomes representing the three phylogenetic domains and various lifestyle (hyperthromphile, thermophiles, psychrofile and mesophiles including eukaryotes) • Variables: amino-acid composition of proteomes • Illustrative variables:groups of amino-acids (charged, polar, hydrophobic)

  19. Why CA? • To analyze distribution of species in terms of global properties and discriminated groups • Search for amino-acid signature in groups of species • Try to understand potential evolutionary trends

  20. Results • First axis (63%) correspond to GC contents (Mycoplasma (23%) to Streptomyces(72%)) • Second axis (14%) correspond to optimals growth temperature

More Related