320 likes | 500 Views
Targeted Projection Pursuit for Microarray Data Analysis. Joe Faith Northumbria University. Outline. Analysing High-Dimensional Array Data Dimension-Reduction Techniques Targeted Projection Pursuit Experimental Results. Array Data.
E N D
Targeted Projection Pursuit for Microarray Data Analysis Joe Faith Northumbria University Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Outline • Analysing High-Dimensional Array Data • Dimension-Reduction Techniques • Targeted Projection Pursuit • Experimental Results Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Array Data • New array technologies producing floods of quantitative data • cDNA and oligonucleotide • Protein arrays • Combinatorial chemistry arrays • Tissue arrays • Typically dozens of samples x thousands of genes (or other attributes) Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Array Analysis Tasks • In case of classified data (samples of known diagnostic classes, eg cancer tumours) • spot clusters in data • spot outliers • classify new cases into existing classes • genetic profiles, feature selection, finding markers for particular conditions • Similar problems with time series / sequential data • Genome-wide study of transcription and regulation Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Array Analysis Techniques • Lots of techniques borrowed from statistics, machine learning, data mining. • Tend to be complicated and ‘opaque’ • Want to find ways to allow experimenter to: • Visualise / communicate • Explore • Hypothesis formation Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Statistical Problems • Nature of data presents many statistical problems: • Normalisation • Control of variance • Determining significance • Determining reliability • ‘high p, low n’ • Will ignore all these! Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Suppose we had just 2 genes… Gene A Clusters, classifications, outliers, correlations etc are then immediately obvious Gene B Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
3D Scatter Plots Gene A Gene B Gene C Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
‘Virtual Reality’ 3D Scatter Plots Angelova et al, 2005 Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Dimension Reduction Techniques • But what about p=4, 5, … 1000?? • Need some way of visualising and exploring higher dimensional ‘space’ in 2D Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Hierarchical Clustering • Produce a dendrogram based on sample/gene distances, and optimise order for display • But single dimension obscures many relationships Eisen et al, 1998 Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Multi-Dimensional Scaling • Finds best possible 2D representation of data points (ie preserve distances between points) • Eg Sammon’s Mapping (Ewing et al, 2001) Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
But… • ‘curse of dimensionality’ spreads points • Not projection based, so cannot visualise position of new unclassified samples • No indication of particular stresses Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Linear Projection Based Methods • Find a 2D ‘window’ through which we view the multidimensional data • The position of the window then contains useful information about, eg, respective significance of particular genes • Principal Components Analysis (Yeung, 2001) • Find view (window position) that best spreads the data • Projection Pursuit • Find projections best suited for particular purposes, such as separating classifications (Lee, 2005) Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Grand Tours • But each of these only show a single view out of many • So ‘Grand Tours’ show avideo of all possible views(Asimov, 1985) • Grand Tours in high dimensionsare mostly uninformative; andmake it hard to interpret data Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Manual Controls • Try using manual controls to alter projections? • Controls are ‘opaque’: user has no intuition about the effect their actions will have • Eg Xgobi (Cook, 97) Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Targeted Projection Pursuit • The intuition: • Allow user to manipulate view of data directly • Computer then tries to find view that best matches ‘target’ Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Quantitative Evaluation • Task: find a view of a data set that best shows classifications • Data: publicly available gene expression data sets of diagnosed cancer tissues • Method: compare resulting views with standard techniques for degree of class separation Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Data • LEUK50: Gene expression in two types of acute leukemia: acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML) [Gol99]. 38 cases of B-cell ALL, 9 cases of T-cell ALL, and 25 cases of AML. Expression levels of 7219 genes • SRBCT50: cDNA microarray analysis of small, round blue cell childhood tumors (SRBCT), including neuroblastoma (NB), rhabdomyosarcoma (RMS), Burkitt Lymphoma (BL; a subset of non-Hodgkin lymphoma) and members of Ewing’ family of tumors (EWS). 6567 genes for 83 samples [Kha01]. • NCI50: 60 cell lines from the National Cancer Institute's anticancer drug screen [Sch00]: 9 breast, 5 central nervous system (CNS), 7 colon, 6 leukemia, 8 melanoma, 9 non-small-cell lung carcinoma (NSCLC), 6 ovarian, 2 prostate, 8 renal. 9703 cDNA sequences. Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
DR Techniques • TPP: Targeted Projection Pursuit • PP: Projection Pursuit (computer search for optimal view) • SAM: Sammon Mapping • VS: VizStruct non-linear projection based on radial coordinates Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Metrics • ILDA: Linear Discriminant Analysis Index (Lee 05) • 5NN: Generalisation performance of K-Nearest Neighbours Classifier Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Results • Joe Faith, Robert Mintram, Maia Angelova (2006), "Targeted Projection Pursuit for Visualising Gene Expression Data Classifications", BioInformatics (forthcoming). • Joe Faith, Michael Brockway (2006), "Targeted Projection Pursuit Tool for Gene Expression Visualisation", Journal of Integrative Biology, (forthcoming). Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
LEUK Data Views Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
SRBCT Data Views Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
NCI Data Views Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Classifier Construction • Find a view in which all classes are clearly separated: • Components of projection then define combination of genes to define classification • Can order by significance to find a subset of relevant genes Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Outlier Detection • See LEUK data, outliers between ALL/T and ALL/B • See which potential outliers move with the rest of the samples of that class Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Gene Identification • Separate each class in turn from the remainder of the data. The most significant genes in this separation can then be found • NCI data: • Human melanoma antigen recognized by T-cells (MART-1) mRNA Chr.9 selects for Melanoma samples [Coulie 94] • Desmoplakin gene selects ovarian cancer cases [Adams 06] • SRBCT data: • CD83 selects Burkitt's Lymphoma samples [Dudziak 03] Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Future Work • Quantitatively evaluate TPP on other tasks • Develop tool to: • Handle wider range of data formats • Display time series / sequential data • Integrate with biological workflows: • Standard gene lists • Click-through to gene ontologies and DBs • Work with biologists to trial tool and get feedback Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
References • Maia Angelova, D. Ivanov and H. Yasrebi. Classification and visualisation of E.coli genes from microarray experiments, poster presentation, MASAMB05, March, Rothamsted Research, Harpenden, UK www.rothamsted.bbsrc.ac.uk/bab/masamb/posters/MAngelova.pdf • Eisen,M.B., Spellman,P.T., Brown,P.O., and Botstein,D. (1998) Cluster analysis anddisplay of genome-wide expression patterns, PNAS 95:25, 14863-14868 • Ewing,R.M. and Cherry,J.M. (2001) Visualisation of expression clusters using Sammon's non-linear mapping. Bioinformatics, 17,658-659. • K.Y.Yeung and W.L.Ruzzo, Principal Components Analysis for clustering gene expression data, Bioinformatics 17 (9) 763-774 (2001) • Lee,E.K, Cook,D., Klinke,S. and Lumley,T. (2005), Projection Pursuit for Exploratory Supervised Classification, Journal of Computational and Graphical Statistics, 14(4), 831-846 • Asimov, D. (1985). The Grand Tour: A Tool for Viewing Multidimensional Data. SIAM Journal of Scientific and Statistical Computing 6(1), 128 -- 11. Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
D. Cook, and A. Buja (1997), Manual Controls for High-Dimensional Data Projections J. Computational and Graphical Statistics, vol. 6, no. 4, pp. 464-480. • Golub,T.R., Slonim,D.K., Tamayo,P., Huard,C., Gaasenbeek,M., Mesirov,J.P., Coller,H., Loh,M.L., Downing,.J.R., Caligiuri,M.A., Bloomfield,C.D., Lander,E.S. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science,286(5439):531-7. • Scherf,U., Ross,D.T., Waltham,M., Smith,L.H., Lee,J.K., Tanabe,L., Kohn,K.W., Reinhold,W.C., Myers,T.G., Andrews,D.T., Scudiero,D.A., Eisen,M.B., Sausville,E.A., Pommier,Y., Botstein,D., Brown,P.O., and Weinstein,J.N. (2000) A Gene Expression Database for the Molecular Pharmacology of Cancer, Nature Genetics, 24(3), 236-244. • Khan,J., Wei,J.S., Ringnér,M., Saal,L.H., Ladanyi,M., Westermann,F., Berthold,F., Schwab,M., Antonescu,C.R., Peterson,C., and Meltzer,P.S. (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 7(6): 673--679. Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Coulie,PG, et al, (1994) A new gene coding for a differentiation antigen recognized by autologous cytolytic T lymphocytes on HLA-A2 melanomas, J Exp Med. Jul 1;180(1):35-42 • Dudziak et al (2003) Latent Membrane Protein 1 of Epstein-Barr Virus Induces CD83 by the NF-?B Signaling Pathway, J Virol; 77(15): 8290--8298. • Adams et al (2006) Meningothelial meningioma in a mature cystic teratoma of the ovary, Pathologe Mar 23 Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,