370 likes | 525 Views
Contents of this Talk. [Used as intro to Genome Databases Seminar, 2002] Overview of bioinformatics Motivations for genome databases Analogy of virus reverse-eng to genome analysis Questions to ask of a genome DB. Overview of Genome Databases. Peter D. Karp, Ph.D. SRI International
E N D
Contents of this Talk • [Used as intro to Genome Databases Seminar, 2002] • Overview of bioinformatics • Motivations for genome databases • Analogy of virus reverse-eng to genome analysis • Questions to ask of a genome DB
Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com www-db.stanford.edu/dbseminar/seminar.html
Talk Overview • Definition of bioinformatics • Motivations for genome databases • Computer virus analogy • Issues in building genome databases
Definition of Bioinformatics • Computational techniques for management and analysis of biological data and knowledge • Methods for disseminating, archiving, interpreting, and mining scientific information • Computational theories of biology • Genome Databases is a subfield of bioinformatics
Motivations for Bioinformatics • Growth in molecular-biology knowledge (literature) • Genomics • Study of genomes through DNA sequencing • Industrial Biology
Example Genomics Datatypes • Genome sequences • DOE Joint Genome Institute • 511M bases in Dec 2001 • 11.97G bases since Mar 1999 • Gene and protein expression data • Protein-protein interaction data • Protein 3-D structures
Genome Databases • Experimental data • Archive experimental datasets • Retrieving past experimental results should be faster than repeating the experiment • Capture alternative analyses • Lots of data, simpler semantics • Computational symbolic theories • Complex theories become too large to be grasped by a single mind • The database is the theory • Biology is very much concerned with qualitative relationships • Less data, more complex semantics
Bioinformatics • Distinct intellectual field at the intersection of CS and molecular biology • Distinct field because researchers in the field should know CS, biology, and bioinformatics • Spectrum from CS research to biology service • Rich source of challenging CS problems • Large, noisy, complex data-sets and knowledge-sets • Biologists and funding agencies demand working solutions
Bioinformatics Research • algorithms + data structures = programs • algorithms + databases = discoveries • Combine sophisticated algorithms with the right content: • Properly structured • Carefully curated • Relevant data fields • Proper amount of data
Goals of Systems Biology • Catalog the molecular parts lists of cells • Understand the function(s) of each part • Understand how those parts interact to produce the behavior of a cell or organism • Understand the evolution of those molecular parts
Analogy: Genome Analysis andVirus Analysis • Given: Virus binary executable file for known machine architecture • Reverse engineer the program • Procedures • Call graph • Specifications for I/O behavior of the program and all procedures • Capture and publish an annotated analysis of the virus • Comparative analysis of related viruses
Genome Analysis • Example: M. tuberculosis genome • Given: 4.4Mbp of DNA (genome) • Infer: • Molecular parts list of Mtb • A model of the biochemical machinery of Mtb cell • DNA is a blueprint for the program of life
Start 4.4Mbyte binary program 4.4Mbp DNA sequence
Step 1 Distinguish code from data segments Find procedure boundaries Distinguish coding from non-coding regions – Gene Finding
Step 2 Predict semantics of procedures A C B D Predict gene functions
Step 3 Predict procedure call graph D A B C A C B D D A B C Predict biochemical and gene networks
Step 4 Predict conditions under which procedures are invoked D Q R A B S C Predict expression of network fragments
Step 5 Infer complete program specification Formulate dynamic cellular simulation
Step 6 Internet publishing of structured program annotation with explanations, references, commentary Internet publishing of structured genome annotation with explanations, references, commentary
Step 7 Comparative analysis of viruses Evolutionary relationships among viruses Comparative analysis of genomes Evolutionary relationships among genomes
Step 8 Identify measures to disable virus or prevent its spread D Q R A B S C Identify target proteins for anti-microbial drug discovery
Database of Viruses • Create a database that stores • Binaries for all viruses • All annotation of virus programs by different investigators • Comparative analyses • Support • Remote API access • Click-at-a-time browsing
Reference on Major Genome Databases • Nucleic Acids Research Database Issue • http://nar.oupjournals.org/content/vol30/issue1/ • 112 databases
What are Database Goals andRequirements? • How many users? • What expertise do users have? • What problems will database be used to solve?
What is its Organizing Principle? • Different DBs partition the space of genome information in different dimensions • Experimental methods (Genbank, PDB) • Organism (EcoCyc, Flybase)
What is its Level of Interpretation? • Laboratory data • Primary literature (Genbank) • Review (SwissProt, MetaCyc) • Does DB model disagreement?
What are its Semantics and Content? • What entities and relationships does it model? • How does its content overlap with similar DBs? • How many entities of each type are present? • Sparseness of attributes and statistics on attribute values
What are Sources of its Data? • Potential information sources • Laboratory instruments • Scientific literature • Manual entry • Natural-language text mining • Direct submission from the scientific community • Genbank • Modification policy • DB staff only • Submission of new entries by scientific community • Update access by scientific community
What DBMS is Employed? • None • Relational • Object oriented • Frame knowledge representation system
Distribution / User Access • Multiple distribution forms enhance access • Browsing access with visualization tools • API • Portability
What Validation Approaches areEmployed? • None • Declarative consistency constraints • Programmatic consistency checking • Internal vs external consistency checking • What types of systematic errors might DB contain?
Database Documentation • Schema and its semantics • Format • API • Data acquisition techniques • Validation techniques • Size of different classes • Coverage of subject matter • Sparseness of attributes • Error rates
Relationship of Database Field toBioinformatics • Scientists generally ignorant of basic DB principles • Complex queries vs click-at-a-time access • Data model • Defined semantics for DB fields • Controlled vocabularies • Regular syntax for flatfiles • Automated consistency checking • Most biologists take one programming class • Evolution of typical genome database • Finer points of DB research off their radar screen • Handfull of DB researchers work in bioinformatics
Database Field • For many years, the majority of bioinformatics DBs did not employ a DBMS • Flatfiles were the rule • Scientists want to see the data directly • Commercial DBMSs too expensive, too complex • DBAs too expensive • Most scientists do not understand • Differences between BA, MS, PhD in CS • CS research vs applications • Implications for project planning, funding, bioinformatics research
Recommendation • Teaching scientists programming is not enough • Teaching scientists how to build a DBMS is irrelevant • Teach scientists basic aspects of databases and symbolic computing • Database requirements analysis • Data models, schema design • Knowledge representation, ontologies • Formal grammars • Complex queries • Database interoperability