1 / 13

Using EP : NG

Using EP : NG. Patrick Kemmeren. http://www.ebi.ac.uk/expressionprofiler. Aims. Demonstrate the use of the following components in EP:NG: Data Selection Data Transformation Missing Value Imputation Principal Components Analysis Hierarchical Clustering Clustering Comparison …

shiloh
Download Presentation

Using EP : NG

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using EP:NG Patrick Kemmeren http://www.ebi.ac.uk/expressionprofiler

  2. Aims • Demonstrate the use of the following components in EP:NG: • Data Selection • Data Transformation • Missing Value Imputation • Principal Components Analysis • Hierarchical Clustering • Clustering Comparison … • … for the purposes of microarray data analysis (here, tumor line classification) … • … by examining the following paper: M. Crescenzi and A. Giuliani, The main biological determinants of tumor line taxonomy elucidated by a principal component analysis of microarray data, FEBS Letters 507(2001) 114-118.

  3. Overview • PCA: Principal Component Analysis • Unsupervised approach • Reduces complexity of data by reducing its dimensionality • Computes a new, smaller set of uncorrelated variables that best represent the original data • Orientation by Genes • Genes are statistical variables, samples are statistical samples • Covariance matrix: records covariance of one gene vs. another, over samples.

  4. PCA (cont.) • Principal Components • A principal component is a mathematical entity, computed from the data, equivalent to a characteristic vector of the covariance matrix • In other words, finding a way to rotate the original coordinate axes and finding the directions of maximum variance of the scatter of points • Summary • Eigenvalues are characteristic values of the principal components – the higher the eigenvalue, the more variability in the dataset it describes • The first few components can thus describe a large proportion of the data I - 1

  5. Materials and Methods • Data • http://discover.nci.nih.gov/nature2000data/selected_data/t_matrix1375.txt • cDNA from 60 cancer cell lines, hyb’d to ~8000 individual gene cDNAs • T-matrix: 1416 variables, corresponding to selected genes of highest variance (1375) – log ratios between the gene expression level and a reference mixture • Strategy • Perform PCA analysis • Select top explaining components • Project + cluster cell lines into component space • Choose K (number of clusters) by visual observation + clustering comparison

  6. Component: Data Upload • “Provide URL” • The data matrix URL: http://discover.nci.nih.gov/nature2000/data/selected_data/t_matrix1375.txt • Data Format: Nr of columns after 1 for annotation => 3 • Species: Homo Sapiens • >> Data Selection

  7. Component: Data Selection • Select columns: • Only the cell-line columns (format XX:Cell Line) Filter: .*:.* • >> Data Transformation

  8. Component: Data Transformation • KNN imputation • 10 neighbours • >> Data Transformation- transpose • >> Data Selection • >> Side menu: Ordination

  9. Component: Ordination • Analysis Options: • Principal Components • Save 5 eigenvalues • Output: Graphs of Eigenvalues • Output: Summary and Eigenvalues, Arrays and Genes Co-ordinates • >> Output Display • Examine outputs… • Save the rows (cell lines) co-ordinates (keep top 5 eigenvalues) on the local hard drive (using original column annotations as row annotations here). • Import it into excel (paste the orignal column annotations (cell lines)). • Save this file as tab-delimited and upload it again.

  10. Component: Hierarchical Clustering • Cluster the cell lines (in the 5 component space now) • Euclidean Distance • Average Linkage • >> Output Display • How many clusters can you see? • Try to zoom in

  11. Components: K-Groups Clustering, Clustering Comparison • Cluster the tumour cell lines in the components with the K-means algorithm several times… try K=10, K=6, K=5, K=4 • Run Clustering Comparison several times • Which K seems most fitting? • An automated method for this process is being developed

  12. Obtaining genes strongly correlated with components • From the PCA results screen, import the columns (genes) co-ordinates into Excel • Sort (Ascending) on the first component column (Comp1) • What are the top genes there?

  13. Acknowledgements • Original EP Development: • Jaak Vilo (Tartu) • Patrick Kemmeren (Utrecht) • Misha Kapushesky • EP:NG Framework Development: • Patrick Kemmeren (Utrecht) • Misha Kapushesky • Visualization Components (under development): • Steffen Durinck (Leuven) • Clustering Comparison: • Aurora Torrente • Christine Körner (Leipzig) • PCA/COA/BGA: • Aedín Culhane (Cork) • Gene Ordering: • Karlis Freivalds (Riga) • Normalization(under development): • Tom Bogaert (Leuven) • Discussions: • EBI Microarray Informatics Team • Contributors from the open source community EP:NG is an open source project – if you are interested in contributing, testing or just discussing ideas, let us know!

More Related