430 likes | 566 Views
Large-scale mining of gene expression patterns. Paul Pavlidis paul@bioinformatics.ubc.ca. VanBUG September 2007. Students Leon French Meeta Mistry Vaneet Lotay Postdoc Jesse Gillis Undergraduates Raymond Lim Suzanne Lane Programmers Kelsey Hamer Luke McCarthy. Genome. Synapse.
E N D
Large-scale mining of gene expression patterns Paul Pavlidis paul@bioinformatics.ubc.ca VanBUG September 2007
Students Leon French Meeta Mistry Vaneet Lotay Postdoc Jesse Gillis Undergraduates Raymond Lim Suzanne Lane Programmers Kelsey Hamer Luke McCarthy
Genome Synapse Injury Stress Disease Aging Development Signal transduction Synaptic modulation
Topics • Connectivity database and analysis • Gene expression data re-use system • Scaling up gene coexpression analysis • Applications and ongoing work
Age Genes Samples With JJ Mann, V Arango, E Sibille et al.
Age Genes Samples Data from http://national_databank.mclean.harvard.edu/
Goals for a system • Researchers should be able to put their new expression data in a wider context of previous studies without extraordinary effort. • Move analyzing multiple microarray data sets from a niche activity to the mainstream • Integration of other data types, domain specific information.
Public data sources Coexpression Differential expression
Challenges to comparing data sets • Need to match genes/transcripts across platforms • Data from third parties not always easy to handle • Varying scales, normalization, etc. • Varying data quality • Varying levels of “raw data” available • Selecting appropriate data to compare
With Cincinnati Children’s Hospital (D.Glass, M. Barnes et al.)
Which data sets are reasonable to compare? Too general, but lots of power All mouse data sets Mouse brain data sets Mouse neocortex data sets Mouse neocortex data sets examining stress Mouse neocortex data sets examining hypoxic stress Mouse neocortex data sets examining hypoxic stress after 3 hours of hypoxia Very specific, low power
Array Designs: 178 Assays (i.e., chips): 20837 Coexpression links (probe-level): >100 million
Scaling up analysis of gene coexpression • Genes that are coexpressed tend to have related function • Needed at the same place at the same time • “Guilt by association” • Reasonable to compare across studies Eisen et al., 1998 PNAS Two ribosomal protein genes. Expression Samples
Biological noise • Induced gene expression effects are often small. • Gene expression varies between “replicates” in biologically-meaningful ways. • Allows us to repurpose data Sample type
Functional coexpression should be (somewhat) generalized • If two genes are coexpressed under one condition, they will probably be coexpressed under at least some other conditions (or data sets). • Coexpression seen “only once” needs special care in interpretation. • We shouldn’t expect coexpression to be perfectly reproducible (for biological and technical reasons) Correlation Correlation
Genome Research, June 2004 A simple approach: Count Recurring patterns
Proof of concept analysis • 60 human data sets, 15700 RefSeq genes. • 70% cancer data • 11 million “links” • About 9.7 million different links
GRIN1 ATP6V0A1 PLD3 Allen Brain Institute
Application: analysis of imprinted genes Laurent Journot, INSERM – Universités Montpellier
LYAR interacting proteins Correlation p-value LYAR-interactors Ewing et al, 2007 Molecular Systems Biology
Vote counting limitations • Weak evidence distributed across data sets will not be picked up. • This example meets strict “vote counting” criteria in only 2/23 data sets Correlation
Correlation (Global) Support (# of datasets)
Datasets Genes pairs Related work: Zhou XJ et al., Nat.Biotech 2005
Summary • Reuse of public data: ‘adding value’ • Meta-analysis of coexpression • Some applications • Functional prediction • Candidate identification • Platform evaluation
Ongoing and future work • Applications and analyses • Protein interactions and hubs • Prediction of gene function at the synapse • Differential expression analysis • Regionalization • Mouse models of brain injury • Mouse models of psychosis • Expanding our public database and software http://www.bioinformatics.ubc.ca/Gemma Web-based tools for biologists; web services coming soon • Integration with other information sources
Thanks • And to: • NCBI GEO team • Groups who made data available • Collaborators who provided data prior to publication • Conrad Gilliam • Abraham Palmer • Andreas Kottmann • Etienne Sibille Gemma Xiang Wan Kelsey Hamer Luke McCarthy Kiran Keshav Suzanne Lane Meeta Mistra Jesse Gillis Joseph Santos Gozde Cozen David Quigley Anshu Sinha Spiro Pantazatos Wei-Keat Lim Tmm Homin Lee Amy Hsu Jon Sajdak Jie Qin Tzu-Lin Hsaio Collaborators Barclay Morrison Joseph Gogos Michael Hayden Blair Leavitt Tony Blau Panos Papapanou
Answers to FAQs • No, they don’t have to be time course experiments. • Yes, we’re using cDNA as well as Affymetrix etc. • Yes, we see reproducible negative correlations. • Yes, we’re interested in finding differences as well as similarities between data sets. • No, we aren’t necessarily inferring regulatory relationships • Yes, we know that RNA is just one way of measuring cell state. • No, we don’t have {worm,fly,yeast…} data, but we’d like to.