130 likes | 207 Views
Using EP : NG. Patrick Kemmeren. http://www.ebi.ac.uk/expressionprofiler. Aims. Demonstrate the use of the following components in EP:NG: Data Selection Data Transformation Missing Value Imputation Principal Components Analysis Hierarchical Clustering Clustering Comparison …
E N D
Using EP:NG Patrick Kemmeren http://www.ebi.ac.uk/expressionprofiler
Aims • Demonstrate the use of the following components in EP:NG: • Data Selection • Data Transformation • Missing Value Imputation • Principal Components Analysis • Hierarchical Clustering • Clustering Comparison … • … for the purposes of microarray data analysis (here, tumor line classification) … • … by examining the following paper: M. Crescenzi and A. Giuliani, The main biological determinants of tumor line taxonomy elucidated by a principal component analysis of microarray data, FEBS Letters 507(2001) 114-118.
Overview • PCA: Principal Component Analysis • Unsupervised approach • Reduces complexity of data by reducing its dimensionality • Computes a new, smaller set of uncorrelated variables that best represent the original data • Orientation by Genes • Genes are statistical variables, samples are statistical samples • Covariance matrix: records covariance of one gene vs. another, over samples.
PCA (cont.) • Principal Components • A principal component is a mathematical entity, computed from the data, equivalent to a characteristic vector of the covariance matrix • In other words, finding a way to rotate the original coordinate axes and finding the directions of maximum variance of the scatter of points • Summary • Eigenvalues are characteristic values of the principal components – the higher the eigenvalue, the more variability in the dataset it describes • The first few components can thus describe a large proportion of the data I - 1
Materials and Methods • Data • http://discover.nci.nih.gov/nature2000data/selected_data/t_matrix1375.txt • cDNA from 60 cancer cell lines, hyb’d to ~8000 individual gene cDNAs • T-matrix: 1416 variables, corresponding to selected genes of highest variance (1375) – log ratios between the gene expression level and a reference mixture • Strategy • Perform PCA analysis • Select top explaining components • Project + cluster cell lines into component space • Choose K (number of clusters) by visual observation + clustering comparison
Component: Data Upload • “Provide URL” • The data matrix URL: http://discover.nci.nih.gov/nature2000/data/selected_data/t_matrix1375.txt • Data Format: Nr of columns after 1 for annotation => 3 • Species: Homo Sapiens • >> Data Selection
Component: Data Selection • Select columns: • Only the cell-line columns (format XX:Cell Line) Filter: .*:.* • >> Data Transformation
Component: Data Transformation • KNN imputation • 10 neighbours • >> Data Transformation- transpose • >> Data Selection • >> Side menu: Ordination
Component: Ordination • Analysis Options: • Principal Components • Save 5 eigenvalues • Output: Graphs of Eigenvalues • Output: Summary and Eigenvalues, Arrays and Genes Co-ordinates • >> Output Display • Examine outputs… • Save the rows (cell lines) co-ordinates (keep top 5 eigenvalues) on the local hard drive (using original column annotations as row annotations here). • Import it into excel (paste the orignal column annotations (cell lines)). • Save this file as tab-delimited and upload it again.
Component: Hierarchical Clustering • Cluster the cell lines (in the 5 component space now) • Euclidean Distance • Average Linkage • >> Output Display • How many clusters can you see? • Try to zoom in
Components: K-Groups Clustering, Clustering Comparison • Cluster the tumour cell lines in the components with the K-means algorithm several times… try K=10, K=6, K=5, K=4 • Run Clustering Comparison several times • Which K seems most fitting? • An automated method for this process is being developed
Obtaining genes strongly correlated with components • From the PCA results screen, import the columns (genes) co-ordinates into Excel • Sort (Ascending) on the first component column (Comp1) • What are the top genes there?
Acknowledgements • Original EP Development: • Jaak Vilo (Tartu) • Patrick Kemmeren (Utrecht) • Misha Kapushesky • EP:NG Framework Development: • Patrick Kemmeren (Utrecht) • Misha Kapushesky • Visualization Components (under development): • Steffen Durinck (Leuven) • Clustering Comparison: • Aurora Torrente • Christine Körner (Leipzig) • PCA/COA/BGA: • Aedín Culhane (Cork) • Gene Ordering: • Karlis Freivalds (Riga) • Normalization(under development): • Tom Bogaert (Leuven) • Discussions: • EBI Microarray Informatics Team • Contributors from the open source community EP:NG is an open source project – if you are interested in contributing, testing or just discussing ideas, let us know!