Using EP : NG

Using EP:NG Patrick Kemmeren http://www.ebi.ac.uk/expressionprofiler

Aims • Demonstrate the use of the following components in EP:NG: • Data Selection • Data Transformation • Missing Value Imputation • Principal Components Analysis • Hierarchical Clustering • Clustering Comparison … • … for the purposes of microarray data analysis (here, tumor line classification) … • … by examining the following paper: M. Crescenzi and A. Giuliani, The main biological determinants of tumor line taxonomy elucidated by a principal component analysis of microarray data, FEBS Letters 507(2001) 114-118.

Overview • PCA: Principal Component Analysis • Unsupervised approach • Reduces complexity of data by reducing its dimensionality • Computes a new, smaller set of uncorrelated variables that best represent the original data • Orientation by Genes • Genes are statistical variables, samples are statistical samples • Covariance matrix: records covariance of one gene vs. another, over samples.

PCA (cont.) • Principal Components • A principal component is a mathematical entity, computed from the data, equivalent to a characteristic vector of the covariance matrix • In other words, finding a way to rotate the original coordinate axes and finding the directions of maximum variance of the scatter of points • Summary • Eigenvalues are characteristic values of the principal components – the higher the eigenvalue, the more variability in the dataset it describes • The first few components can thus describe a large proportion of the data I - 1

Materials and Methods • Data • http://discover.nci.nih.gov/nature2000data/selected_data/t_matrix1375.txt • cDNA from 60 cancer cell lines, hyb’d to ~8000 individual gene cDNAs • T-matrix: 1416 variables, corresponding to selected genes of highest variance (1375) – log ratios between the gene expression level and a reference mixture • Strategy • Perform PCA analysis • Select top explaining components • Project + cluster cell lines into component space • Choose K (number of clusters) by visual observation + clustering comparison

Component: Data Upload • “Provide URL” • The data matrix URL: http://discover.nci.nih.gov/nature2000/data/selected_data/t_matrix1375.txt • Data Format: Nr of columns after 1 for annotation => 3 • Species: Homo Sapiens • >> Data Selection

Component: Data Selection • Select columns: • Only the cell-line columns (format XX:Cell Line) Filter: .*:.* • >> Data Transformation

Component: Data Transformation • KNN imputation • 10 neighbours • >> Data Transformation- transpose • >> Data Selection • >> Side menu: Ordination

Component: Ordination • Analysis Options: • Principal Components • Save 5 eigenvalues • Output: Graphs of Eigenvalues • Output: Summary and Eigenvalues, Arrays and Genes Co-ordinates • >> Output Display • Examine outputs… • Save the rows (cell lines) co-ordinates (keep top 5 eigenvalues) on the local hard drive (using original column annotations as row annotations here). • Import it into excel (paste the orignal column annotations (cell lines)). • Save this file as tab-delimited and upload it again.

Component: Hierarchical Clustering • Cluster the cell lines (in the 5 component space now) • Euclidean Distance • Average Linkage • >> Output Display • How many clusters can you see? • Try to zoom in

Components: K-Groups Clustering, Clustering Comparison • Cluster the tumour cell lines in the components with the K-means algorithm several times… try K=10, K=6, K=5, K=4 • Run Clustering Comparison several times • Which K seems most fitting? • An automated method for this process is being developed

Obtaining genes strongly correlated with components • From the PCA results screen, import the columns (genes) co-ordinates into Excel • Sort (Ascending) on the first component column (Comp1) • What are the top genes there?

Acknowledgements • Original EP Development: • Jaak Vilo (Tartu) • Patrick Kemmeren (Utrecht) • Misha Kapushesky • EP:NG Framework Development: • Patrick Kemmeren (Utrecht) • Misha Kapushesky • Visualization Components (under development): • Steffen Durinck (Leuven) • Clustering Comparison: • Aurora Torrente • Christine Körner (Leipzig) • PCA/COA/BGA: • Aedín Culhane (Cork) • Gene Ordering: • Karlis Freivalds (Riga) • Normalization(under development): • Tom Bogaert (Leuven) • Discussions: • EBI Microarray Informatics Team • Contributors from the open source community EP:NG is an open source project – if you are interested in contributing, testing or just discussing ideas, let us know!

Using EP : NG

Using EP : NG

Presentation Transcript

EP

EP Morning Conference

EP BIH

Ep 1

EP - Tutorial

EP BIH

kadetosaavat-ep

EP Printing

Training (EP)

Bayesian Conditional Random Fields using Power EP

House EP

EP Chiclayo

Glibenclamide Ep

Bayesian Conditional Random Fields using Power EP

swi ng ru ng ha ng lu ng si ng ha ng ga ng ba ng wi ng