190 likes | 321 Views
BRC 2011 Session #4 – “ Omics ” Data. Session #4 - Outline. Challenges and Opportunities pathogen datasets; host datasets; integrating pathogen-host datasets BRC approach to managing “ omics ” data mRNAs, ncRNAs , RNAi , proteomics, metabolomics systems-level analysis
E N D
Session #4 - Outline • Challenges and Opportunities • pathogen datasets; host datasets; integrating pathogen-host datasets • BRC approach to managing “omics” data • mRNAs, ncRNAs, RNAi, proteomics, metabolomics • systems-level analysis • Francis Ouellette – “Interesting Gene List” visualization and analysis & training approaches • Ideas from Systems Biology and DBP interactions • Talking Points • Open discussion
Session #4 – Opportunities Andrew R. Joyce & Bernhard Ø. Palsson, Nature Reviews Molecular Cell Biology 7, 198-210 (March 2006)
Session #4 – Challenges • Approach to “omics” data is somewhat pathogen specific • Host “omics” data is relevant for bacteria, viruses and parasites; less so for vectors • Pathogen “omics” relevant for bacteria, parasites and vectors; less so for viruses • What kind of “omics” data should be supported by BRCs? • Pathogen vs host • mRNA, ncRNA, RNAi, proteomics, metabolomics, lipidmics, others • Raw, minimally processed or highly interpreted (status of NCBI SRA) • Results data and metadata • What should we do with the data? • Make available for download • Make available for browsing • Make available for visualization • Make available for analysis • Current infrastructure is focused largely on genomics • Genome sequence and gene/protein annotations about the pathogens; no infrastructure for host genes (Some progress on web services) • Analysis and visualization tools are focused on comparative genomics; few tools for “omics” data analysis and visualization • Standard nomenclature for naming our data sets so that they can be more easily identified and exchanged • How to acquire data sets of sufficient quality and quantity • Reliable sourcing of data, and acquisition from diverse off-site providers in real time • Availability of data and metadata in public resources – lack of standards; difficult to access • Data quality, reliability, and reproducibility • Technology/platform bias and lab-to-lab variations • Noise in data and false positives • Metadata driven analysis requires manual curation efforts to clean up signal from noise • Projection of omics data and its interpretation to closely related organisms • Use of omics data to improve annotations • Moving from data integration to knowledge integration
Session #4 – Opportunities Currently no organized resources for viral pathogen host response/host factor data; this would be very useful for the virology community Many BRC groups have extensive experience with microarray data and network analysis that could be leveraged Host data is becoming increasing relevant for novel drug discovery Using networks to relate different kinds of data Ask system-level biological questions that cannot be answered by any one ‘omics data type alone Visualization of multiple layers of information, simultaneously. How many tracks can one realistically add before a new approach is needed? Use omics data to identify/validate/correct gene models and gene functions, regulatory elements, metabolic and signaling pathways, and phenotypes Development of simple tools and pipelines to enable HT processing of omics data besides sequencing and transcriptomics
Talking points • Approach to “omics” data management • Raw vs minimally processed vs interpreted results • Facilitating relevant data capture from targeted projects • Capturing other high value related data • Adoption and use of data standards, especially for metadata • Utility of visualization and analysis of IGLs • Support for re-analysis of primary “omics” data • What to do with non-gene/protein-centric “omics” data
Francis Ouellette “Interesting Gene List” visualization and analysis & training approaches
Overview of Systems Biology & DBP Projects • Four systems biology groups funded by NIAID, including: • Systems Virology (Michael Katze group, Univ. Washington) • Influenza H1N1 and H5N1 and SARS Coronavirus • statistical models, algorithms and software, raw and processed gene expression data, and proteomics data • Systems Influenza (Alan Aderem group, Institute for Systems Biology) • various Influenza virus • microarray, mass spectrometry, and lipidomics data • ViPR Driving Biological Projects • Abraham Brass, Mass. General Hospital • Dengue virus host factor database from RNAi screen • Lynn Enquist / MoriahSzpara, Princeton University • Deep sequencing and neuronal microarrays for functional genomic analysis of Herpes Simplex Virus
Proposal for “Omics” Data • “Omics” data management (host) • Project metadata • Assay/experiment metadata • Data analysis metadata • Primary results • Derived results (e.g. “interesting gene lists” (IGLs)) • Add additional related datasets • Visualize IGLs in context of biological pathways and networks • Statistical analysis of pathway sub-network overrepresentation • Re-analysis of primary data using assembled pipeline tools
What level of data should be stored and made accessible • Primary results data • Need to define what is considered “primary” data for each platform • Microarray example: raw image files (.tiff) vs probe intensity values (.cel) • Opportunity for re-processing leading to re-interpretation • Derived/processed results • “Interesting gene lists” from microarray, RNAi, proteomics, and other experimental platforms • “Interesting metabolites lists”
Metadata (MIBBI-compliant) • Project Level Metadata • Hypothesis, rationale, study design, etc. • Publications and links pertaining to the project • Data providers - PI, other key personnel, affiliations, contact information • Assay Level Metadata • Sample source and characteristics of source • Sample type • Source/sample treatment information • Assay details • Data Processing/Analysis Level Metadata • Algorithm(s) used for transforming primary to derived data • Configuration parameters
Interpretation of “Interesting Gene Lists” Visualizinginteresting gene lists overrepresentation in protein-protein networks and/or biological pathways Statistical assessment of enrichment
Visualizing Hits from Interesting Gene Lists • Select Dataset(s) of interest • Choose all (or subset) of genes on list • Intersect/Subtract between studies • Visualize selected genes as a biological network
“Quick & Dirty” Overrepresentation Visualization • ReactomeSkyPainter • Limited to reactions and interactions found in Reactomedb • Visualizes “Big Picture” using pathway representations • Constructed using gene list from HCV study • HCV host factors residing in the nucleus • Ribonucleoprotein complex, transcription factors, kinases, protein metabolism/modification, nucleic acid binding / metabolism
Statistical Enrichment Analysis • Gene Ontology biological process overrepresentation • CLASSIFI • Protein interaction network module enrichment (PINME) analysis • Obtain all known human protein-protein interactions from BioGRID • Determine module (sub-network) structures (e.g. using dMoNet) • Identify function of modules (e.g. using CLASSIFI) • Determine overrepresentation statistics for IGLs • Visualize results
Talking points • Approach to “omics” data management • Raw vs minimally processed vs interpreted results • Facilitating relevant data capture from targeted projects • Capturing other high value related data • Adoption and use of data standards, especially for metadata • Utility of visualization and analysis of IGLs • Support for re-analysis of primary “omics” data • What to do with non-gene/protein-centric “omics” data