1 / 37

Bioinformatics: Making sense of functional genomics data

Bioinformatics: Making sense of functional genomics data. Chris Stoeckert, Ph.D. Dept of Genetics and Penn Center for Bioinformatics, University of Pennsylvania School of Medicine GoldenHelix Symposia, Athens, Greece December 3, 2010. What is Functional Genomics?.

pisces
Download Presentation

Bioinformatics: Making sense of functional genomics data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics: Making sense of functional genomics data Chris Stoeckert, Ph.D. Dept of Genetics and Penn Center for Bioinformatics, University of Pennsylvania School of Medicine GoldenHelix Symposia, Athens, Greece December 3, 2010

  2. What is Functional Genomics? From http://en.wikipedia.org/wiki/Functional_genomics • Functional genomics is a field of molecular biology that attempts to make use of the vast wealth of data produced by genomic projects (such as genome sequencing projects) to describe gene (and protein) functions and interactions. Unlike genomics and proteomics, functional genomics focuses on the dynamic aspects such as gene transcription, translation, and protein-protein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures. Functional genomics attempts to answer questions about the function of DNA at the levels of genes, RNA transcripts, and protein products. A key characteristic of functional genomics studies is their genome-wide approach to these questions, generally involving high-throughput methods rather than a more traditional “gene-by-gene” approach.

  3. The Impact of Functional Genomics • More data than can be fit in a publication, or a lab book, and often not generated by a single lab. • Microarray-based datasets are an example. • High throughput sequencing (HTS)-based datasets are becoming the “new” microarrays.

  4. What is Needed for Reproducible Research • Public archives of data • Minimum information about experiments so that the steps to generate the results can be followed • An example of what can go wrong: Potti A, Dressman HK, Bild A, Riedel RF, Chan G, Sayer R, Cragun J, Cottrill H, Kelley MJ, Petersen R, Harpole D, Marks J, Berchuck A, Ginsburg GS, Febbo P, Lancaster J, Nevins JR. Genomic signatures to guide the use of chemotherapeutics. Nat Med. 2006 Nov;12(11):1294-300. Epub 2006 Oct 22. Erratum in: Nat Med. 2008 Aug;14(8):889. Nat Med. 2007 Nov;13(11):1388. From July 21, 2010 N.Y. Times: “Last year, two biostatisticians at the University of Texas MD Anderson Cancer Center published an article in the scientific journal Annals of Applied Statistics in which they identified errors in Duke’s data analysis and said they had not been able to reproduce Duke’s results.” Keith A. Baggerly and Kevin R. Coombes Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology Ann. Appl. Stat. Volume 3, Number 4 (2009), 1309-1334.

  5. Recent examples of discoveries from integrating usable functional genomics datasets • A global map of human gene expression. Lukk et al. Nature Biotech. 2010 • Differentially Expressed RNA from Public Microarray Data Identifies Serum Protein Biomarkers for Cross-Organ Transplant Rejection and Other Conditions. R. Chen et al. PLoS Comp. Biol. 2010

  6. In the beginning there were microarrays… • And they were developed for gene expression profiling. • Microarray studies were published but only gene lists were provided. • No details, no place to get the data • In 1999, the Microarray Gene Expression Data Society (MGED) began to address the lack of verifiable and reproducible datasets by developing and promoting standards.

  7. MGED Standards • What information is needed for a microarray experiment? • MIAME: Minimal Information About a Microarray Experiment. Brazma et al., Nature Genetics 2001 • How do you “code up” microarray data? • MAGE-OM: MicroArray Gene Expression Object Model. Spellman et al., Genome Biology 2002 • MAGE-TAB Rayner et al., BMC Bioinformatics 2006 • What words do you use to describe a microarray experiment? • MO: MGED Ontology. Whetzel et al. Bioinformatics 2006

  8. labelled nucleic acid labelled nucleic acid labelled nucleic acid labelled nucleic acid labelled nucleic acid Gene expression data matrix normalization hybridisation hybridisation hybridisation hybridisation hybridisation Array design RNA extract RNA extract RNA extract RNA extract RNA extract Microarray Sample Sample Sample Sample Sample genes array array array array Protocol Protocol Protocol Protocol Protocol Protocol Experiment integration MIAME in a nutshell (ala AlvisBrazma) Stoeckert et al. Drug Discovery Today TARGETS 2004

  9. labelled nucleic acid labelled nucleic acid labelled nucleic acid labelled nucleic acid labelled nucleic acid Gene expression data matrix normalization hybridisation hybridisation hybridisation hybridisation hybridisation Array design RNA extract RNA extract RNA extract RNA extract RNA extract Microarray Sample Sample Sample Sample Sample genes array array array array Protocol Protocol Protocol Protocol Protocol Protocol Experiment integration Sequencing is replacing array technology @HWI-EAS266_0011:8:1:6:969#0/1 GTTTGCCNGTGTGTACGCTACCCCCTTCTTGTGTGTGTGTGTCT +HWI-EAS266_0011:8:1:6:969#0/1 _abb`a[DZ`aabaa_a`b]___^^aa_`aa_a^a[\\aZTZVY @HWI-EAS266_0011:8:1:7:1688#0/1 AAGATGANGGCAGGGTGCAAGATGGCAGGATGCAAGATGGCAGG +HWI-EAS266_0011:8:1:7:1688#0/1 a`^ab`^D\a]a`b``b_bbbaabb^abaa``^a_^_aa\]_VR @HWI-EAS266_0011:8:1:7:593#0/1 CAGTTCANTTCTCAGCACCACACTGGGATGCTCACACATGCCTG +HWI-EAS266_0011:8:1:7:593#0/1 abbbb_VD[bbbba_`bbbbbbbbbbbaa_`bbaabaabb_aa_ @HWI-EAS266_0011:8:1:7:139#0/1 CATGGGGNATAATTGCAATCCCCGATCCCCATCACGAATGGGGT +HWI-EAS266_0011:8:1:7:139#0/1 aab`[^YDY]Z\baa`aabaaaa`aa`a]aa```\aY]^\]ZVX @HWI-EAS266_0011:8:1:7:1390#0/1 GAATAATNGAATAGGACCGCGGTTCTATTTTGTTGGTTTTCGGA +HWI-EAS266_0011:8:1:7:1390#0/1 _U^b_`]D\__a_a`S```Y[a__]a\aa_`]`aTVZ__\HYVX @HWI-EAS266_0011:8:1:7:1663#0/1 TGATGTTNGTGGCAATAATGGGGGTAGCGGCAATGGTGGCGGGG +HWI-EAS266_0011:8:1:7:1663#0/1 a`[_X]\DQTZ[^YYa[[aXV[PZUUYSYBBBBBBBBBBBBBBB

  10. ChiP-Seq MeDIP-Seq Etc. labelled nucleic acid labelled nucleic acid labelled nucleic acid labelled nucleic acid nucleic acid normalization hybridisation hybridisation hybridisation hybridisation hybridisation Array design RNA extract RNA extract RNA extract RNA extract Chromatin, DNA extract Microarray Sample Sample Sample Sample Sample genes array array array array Protocol Protocol Protocol Protocol Protocol Protocol Experiment integration Sequencing is replacing array technology @HWI-EAS266_0011:8:1:6:969#0/1 GTTTGCCNGTGTGTACGCTACCCCCTTCTTGTGTGTGTGTGTCT +HWI-EAS266_0011:8:1:6:969#0/1 _abb`a[DZ`aabaa_a`b]___^^aa_`aa_a^a[\\aZTZVY @HWI-EAS266_0011:8:1:7:1688#0/1 AAGATGANGGCAGGGTGCAAGATGGCAGGATGCAAGATGGCAGG +HWI-EAS266_0011:8:1:7:1688#0/1 a`^ab`^D\a]a`b``b_bbbaabb^abaa``^a_^_aa\]_VR @HWI-EAS266_0011:8:1:7:593#0/1 CAGTTCANTTCTCAGCACCACACTGGGATGCTCACACATGCCTG +HWI-EAS266_0011:8:1:7:593#0/1 abbbb_VD[bbbba_`bbbbbbbbbbbaa_`bbaabaabb_aa_ @HWI-EAS266_0011:8:1:7:139#0/1 CATGGGGNATAATTGCAATCCCCGATCCCCATCACGAATGGGGT +HWI-EAS266_0011:8:1:7:139#0/1 aab`[^YDY]Z\baa`aabaaaa`aa`a]aa```\aY]^\]ZVX @HWI-EAS266_0011:8:1:7:1390#0/1 GAATAATNGAATAGGACCGCGGTTCTATTTTGTTGGTTTTCGGA +HWI-EAS266_0011:8:1:7:1390#0/1 _U^b_`]D\__a_a`S```Y[a__]a\aa_`]`aTVZ__\HYVX @HWI-EAS266_0011:8:1:7:1663#0/1 TGATGTTNGTGGCAATAATGGGGGTAGCGGCAATGGTGGCGGGG +HWI-EAS266_0011:8:1:7:1663#0/1 a`[_X]\DQTZ[^YYa[[aXV[PZUUYSYBBBBBBBBBBBBBBB

  11. MGED is now the Functional Genomics Data Society • The Functional Genomics Data Society - FGED (eff – jed) Society, founded in1999 as the MGED Society, advocates for open access to genomic data sets and works towards providing concrete solutions to achieve this. Our goal is to assure that investment in functional genomics data generates the maximum public benefit. Our work on defining minimum information specifications for reporting data in functional genomics papers have already enabled large data sets to be used and reused to their greater potential in biological and medical research. • We work with other organizations to develop standards for biological research data quality, annotation and exchange. • We facilitate the creation and use of software tools that build on these standards and allow researchers to annotate and share their data easily. • We promote scientific discovery that is driven by genome wide and other biological research data integration and meta-analysis.

  12. The Functional Genomics Data Society

  13. For more information see http://mged.org or http://fged.org Look here for information on a new site in the coming year …

  14. New Directions for FGED Standards • What information is needed for a UHTS experiment? • MINSEQE: Minimal Information about a high throughput SEQuencing Experiment. • http://www.mged.org/minseqe/ • How do you annotate functional genomics data? • Annotare: Tool to create MAGE-TAB. • http://code.google.com/p/annotare/ • What words do you use to describe an investigation? • OBI: Ontology for Biomedical Investigations. • http://obi-ontology.org/

  15. Minimum Information about a high-throughput Nucleotide SeQuencing Experiment – MINSEQE (April, 2008) • The description of the biological system and the particular states that are studied • The sequence read data for each assay • The 'final' processed (or summary) data for the set of assays in the study • The experiment design including sample data relationships • General information about the experiment • Essential experimental and data processing protocols

  16. Minimal Information for Biological and Biomedical Investigations 33 Projects registered as of November, 2010

  17. Related standards for exchange of phenotype data for reuse in meta analysis. • statement of informed consent that the data may (not) be used in additional analysis. • enrollment criteria / exclusion criteria • Population stratification: race/ethnicity/ strain, gender, age • tissue or cellular makeup of specimen • non-omics phenotype for each participant (subject or group) Jennifer Fostel (NIEHS): Leader Phenotypes Working group

  18. FGED design-specific criteria Different study or trial designs would have additional reporting requirements: (A) study involving a disease (A1) type and stage of disease (A2) pertinent genotype / genomic conditions (B) study involving chemical exposure (B1) dose administered, dosing frequency and duration, route of administration (B2) ID to permit unambiguous identification of chemical, i.e. Cas No (C) study involving surgical intervention (C1) details of intervention (C2) timing with respect to pertinent events (D) Observational or epidemiological study (D1) demographic factors (D2) pertinent medical history Jennifer Fostel (NIEHS): Leader Phenotypes Working group

  19. Annotare - An open source standalone MAGE-TAB editor Shankar R, Parkinson H, Burdett T, Hastings E, Liu J, Miller M, Srinivasa R, White J, Brazma A, Sherlock G, Stoeckert CJ Jr, Ball CA. Annotare - a tool for annotating high-throughput biomedical investigations and resulting data. Bioinformatics. 2010 Aug 23.

  20. MAGE-TAB Format What’s MAGE-TAB? • MAGE-TAB is a simple spreadsheet view which has two filesIDF - describing the experiment design, contact details, variables and protocols • SDRF - a spreadsheet with columns that describe samples, annotations, protocol references, hybridizations and data • Linked data files, e.g. CEL files, these are referenced by the SDRF • For single channel data one row in the SDRF = 1 hybridization, for two channel data one row = 1 channel • MAGE-TAB can also be used to annotate Next Gen Sequencing data Where can I get MAGE-TAB from? • ~10,000 MAGE-TAB files are available for download from ArrayExpress(includes GEO derived and ArrayExpressdata) • caArray also provides MAGE-TAB files for download. Who’s using MAGE-TAB? • BioConductor • GenePattern • MeV

  21. IDF file for E-TABM-34 IDF = Investigation Description Format

  22. SDRF file for E-TABM-34 SDRF = Sample and Data Relationship Format

  23. Annotare - an open source MAGE-TAB Editor Annotare is an annotation tool for high throughput gene expression experiments in MAGE-TAB format. Researchers can describe their investigations with the investigators’ contact details, experimental design, protocols that were employed, references to publications, details of biological samples, arrays, and experimental data produced in the investigation.

  24. Annotare Features • Intuitive graphical user interface forms for editing • Ontology support, an inbuilt ontology and web services connectivity to bioportal • Searchable standard templates • Design wizard • Validation module • Mac and Windows Support http://code.google.com/p/annotare/

  25. MAGE-TAB is used to load datasets into Beta Cell Genomics University of Pennsylvania Chris Stoeckert ElisabettaManduchi Chiao-Feng Lin Emily Allen Junmin Liu • Vanderbilt University • Mark Magnuson • Jean-Philippe Cartailler • Thomas Houfek • Josh Norman • Chris Howard http://genomics.betacell.org supporting the Beta Cell Biology Consortium

  26. http://genomics.betacell.org supporting the Beta Cell Biology Consortium

  27. OBI – Ontology for Biomedical Investigations Brinkman RR, Courtot M, Derom D, Fostel JM, He Y, Lord P, Malone J, Parkinson H, Peters B, Rocca-Serra P, Ruttenberg A, Sansone SA, Soldatova LN, Stoeckert CJ Jr, Turner JA, Zheng J; OBI consortium. Modeling biomedical experimental processes with OBI. J Biomed Semantics. 2010 Jun 22;1 Suppl 1:S7.

  28. Ontology • A set of concepts within a domain and the relationships between those concepts • Unambiguous description of a domain • Controlled vocabulary • Relations (logical constraints among terms) • Human readable and machine interpretable • Uses • Annotation • Text mining • Semantic web • Reasoning

  29. Gene Ontology (GO) (generated by NCBO BioPortal) The Gene ontology has been developed to provide terms to consistently describe the gene and gene product attributes across species and databases Standardized annotation facilitate data integration cross various databases and enable the ability to query and retrieve genes or proteins based on their shared biology

  30. Explosion of biomedical ontologies available http://bioportal.bioontology.org/

  31. OBI – Ontology for Biomedical Investigations • FGED is one of many communities contributing to OBI • Whereas the MGED Ontology is primarily a controlled vocabulary for use with MAGE, OBI is a well-founded ontology with logical definitions and restrictions to be used for multiple purposes (e.g., database models, text mining, file annotation) • OBI intends to be part of the OBO Foundry • Interoperable with Gene Ontology, CheBI, Phenotypic qualities (PATO), Cell Type (CL)… • OBI is available through browsers like the NCBO BioPortal

  32. Partial high level structure of OBI classes OBI and IAO (Information Artifact Ontology) classes are shown in blue. Classes imported from other external ontologies are shown in red. Some example subclasses, such as PCR product and cell culture are included to illustrate the use of the class processed material.

  33. Measuring the glucose concentration in blood From The OBI Consortium, The Ontology for Biomedical Investigations, under revision

  34. Using OBI and other OBO Foundry ontologies to capture phenotype data JieZheng, Omar Harb

  35. Enhance data mining in functional genomics resources EuPathDB: a portal to eukaryotic pathogen databases.Aurrecoechea C, Brestelli J, Brunk BP, Fischer S, Gajria B, Gao X, Gingle A, Grant G, Harb OS, Heiges M, Innamorato F, Iodice J, Kissinger JC, Kraemer ET, Li W, Miller JA, Nayak V, Pennington C, Pinney DF, Roos DS, Ross C, Srinivasamoorthy G, Stoeckert CJ Jr, Thibodeau R, Treatman C, Wang H.Nucleic Acids Res. 2010

  36. Find T. brucei genes by phenotype based on RNAi knockdown and insufficiency studies. TriTrypDB: a functional genomic resource for the Trypanosomatidae.Aslett M, Aurrecoechea C, Berriman M, Brestelli J, Brunk BP, Carrington M, Depledge DP, Fischer S, Gajria B, Gao X, Gardner MJ, Gingle A, Grant G, Harb OS, Heiges M, Hertz-Fowler C, Houston R, Innamorato F, Iodice J, Kissinger JC, Kraemer E, Li W, Logan FJ, Miller JA, Mitra S, Myler PJ, Nayak V, Pennington C, Phan I, Pinney DF, Ramasamy G, Rogers MB, Roos DS, Ross C, Sivam D, Smith DF, Srinivasamoorthy G, Stoeckert CJ Jr, Subramanian S, Thibodeau R, Tivey A, Treatman C, Velarde G, Wang H.Nucleic Acids Res. 2010

  37. Making sense of functional genomics data: Standards & Ontologies • Enhance sharing data and understanding what is shared by computers and humans. • Applications include: • Data mining of databases • Meta-analysis of multiple datasets

More Related