500 likes | 521 Views
This project aims to finish sequencing and annotating the genome of Drosophila melanogaster. It includes gene disruptions using mutagenesis, cDNA sequencing for every gene, and informatics tool development for genome analysis. The project also involves data modeling, ontology application, and linking disease genes across species for biological research advancements.
E N D
Core 2: Bioinformatics NCBO-Berkeley
Berkeley Drosophila Genome Project • Finish the sequence of the euchromatic genome of Drosophilamelanogaster • Annotated biological important features of this sequence • Produced gene disruptions using P element-mediated mutagenesis • Full length sequencing and expression characterization of a cDNA for every gene • Developing informatics tools
Chris Shu Mark Sima Who is here from NCBO-Berkeley
Chris • GadFly database schema • GO database schema • Chado database schema • Perl libraries for all • OBD data architect
Shu • OBD dev & Data flow • AmiGO,ImaGO & database • Compute Pipeline
Mark • Apollo Genome Annotation Editor • Phenote and other OBD interfaces
Sima • Adh region annotation • Annotation of entire Drosophila Genome • Project manager and coordinator nonpareil • Associate Director
OBD Outline • Core 2 aims, refresher • Data models for OBD • phenotypes • clinical trials • others • Modeling frameworks • exchange formats • database system • SQL based vs ‘SemWeb’ dbs • Progress • Demo
Core 2 Specific Aims • Apply ontologies • Software toolkit for describing and classifying data • Capture, manage, and view data annotations • Database (OBD) and interfaces to store and view annotations • Investigate and compare implications • Linking human diseases to model systems • Maintain • Ongoing reconciliation of ontologies with annotations
Core 3 Driving Biological Projects • DBPs • phenotypes: Fly and Zebrafish to human • clinical trials • Core 2 Aims • Apply ontologies to describe data • Capture, manage, and view data annotations • Link disease genes to model systems • Reconcile annotation and ontology changes
Apply ontologies to describe data • Requirements • Data capture tools • phenote • demo tomorrow • no tool requirements from UCSF • Data model • Database (OBD) • --aim 2
data flow
user’s view
Data models • Common/shared domain specific models • Aim 3 • linking disease genes • model must support this • granularity • comparability
Domain specific data models • FB, ZFIN • genotype to phenotype • ‘EAV’ • qualities inhere in entities • orthologs • phenotype to disease • core 2 will help define common model • UCSF • clinical trials • existing ontology-friendly schema - trialbank
Phenotype data model • Qualities inhere in entities • Entity term; PATO term • brainFBbt:00005095;fusedPATO:0000642 • gutMA:0000917;dysplasticPATO:0000640 • tail finZDB:020702-16;ventralizedPATO:0000636 • kidneyZDB:020702-16;hypertrophiedPATO:0000636 • midfaceZDB:020702-16;hypoplasticPATO:0000636 • Pre-composed phenotype terms • Mammalian Phenotype Ontology • “increased activated B-cell number” MPO:0000319 • “pink fur hue” MPO:0000374
Extensions to simple model • What about • Relational attributes • Quantative vs qualitative • Post-composing entity and attribute terms • Relative states/values • Variation in place, space and time • A better treatment of absence • See CSHL Pheno meeting talk • also, more detailed formal presentation (available) • Not to mention genotypes, environments, provenance, etc
Modeling clinical trials • Model already described using frame-based schema • Further modeling required? • abstraction • to integrate more with other OBD datatypes • views • to only show parts relevant to OBD/BioPortal
Future DBPs and use cases • OBD will contain a variety of general types of data • Modeling is expensive • use existing models where appropriate • but whole must be cohesive and integrated • Most of this talk focuses on the pheno DBPs for illustrative purposes
Modeling frameworks • language • technology
Modeling data: underlying formalism • Model is expressed with modeling language • Options • Relational/SQL • Semi-structured, XML • Object-centric (UML, frame-based?) • Logic based • description logic: e.g. OWL • first-order logic: e.g. CL • Natural language descriptions • Model should be independent of language it is expressed in
Data exchange language: XML • Simple • XML is suited for data exchange • XML can drive software spec • constrains programmatic data model • XSD can generate UML • closed world assumption is useful • cf Ruttenberg et al • Mature technology • well understood by developers, MODs • standards
How OBD uses XML • obd-geno-pheno-xml (aka pheno-xml) • actually multiple modular components • genotype schema • phenotype schema: ‘EAV’ • environment schema • provenance schema • used as • exchange format • cf: gene ontology association files • no need for ClinicalTrials-XML
Example pheno-xml <genotype id="ZFIN:tm84"> <name>ZFIN:tm84</name> <genotype_phenotype_association> <phenotype> <entity type="ZDB-ANAT-010921-528"> <quality type=“PATO:……” > <state type="PATO:0000636"> <time_range type="ZDB-STAGE-010723-12"/> </state> </quality> </entity> </phenotype> </genotype_phenotype_association>
SQL Databases • Data storage, management and querying • all MODs use SQL dbs • Lots of advantages • scalable, standard QL, mature, APIs, etc • pure relational model is reasonably formal • XML/SQL more or less compatible • low impedance mismatch
Schemas for geno-pheno data • We already have schema: Chado • Used by many MODs (eg FB) • others are ‘chado compliant’ (eg ZFIN) • Modular • ontologies • genomic • genotype • phenotype • phylogenies • …etc • Phenotype module needs updating • will be driven by pheno-xml
Problem solved? • We have two mature, complementary technologies, and can define schemas for our model in an appropriate formalism for each • Is this enough to work with?
Issues • OBD will be much more than geno-pheno • clinical trials • future DBPs, other NCBCs • any data expressed in an ontology language • Software and schema development expensive • fragility in face of schema evolution • development gets bogged down in data exchange issues
Major issue • SQL and XPath work great for ‘traditional’ data… • …but are too low level for ontology-centric data • lack of inference • no way to directly express ontology constraints
Use cases from previous experience: AmiGO • GO • “find all TF genes” (is_a closure) • “find all gene products localised to endoplasmic reticulum” (part_of closure, over is_a) • Our solution (AmiGO & go-sqldb) • pre-compute transitive closure over all relations in db • (sort of) works for GO (for now) • refresh problem • explosive for tangled DAGs
OBD requires more ontological awareness • Other relations • ontogenic (egderives_from) • transitive_over • Other types of data • Pre- versus post- composed terms • E.g. MPO versus AO+PATO • E.g. Entity+Spatial qualifier • queries over either should be interchangeable
Solution: more expressive formalisms • QLs and APIs should provide and abstract away common ontology operations • ease of programming, optimisation • Choices • ‘Semweb’ databases • RDF + RDFS + Owl [ lite + DL ] + extra • lots to choose from, emerging standards • compatible with Obo v1.2 spec • Deductive databases • superset of relational databases • from Prolog to full CL
Modeling phenotypes as RDF/OWL or Obo instances classes/ terms instances entity quality
Example query in SeRQL find mutations affecting the shape of the wing vein: SELECT DISTINCT EI, ET, OrgI, QI, QT, QN FROM {EI} rdf:type {ET} rdfs:label {EN}, {EI} OBO_REL_part_of {OrgI} rdf:type {Tax} rdfs:label {TaxN}, {EI} OBO_REL_has_quality {QI} rdf:type {QT} rdfs:label {QN} WHERE label(EN) = "wing vein" AND label(TaxN) = ”Arthropoda" AND label(QN) = "ShapeValue" results of query on OBD-sesame: one annotation to “wing vein L2”, “branched”
Advantages of ‘SemWeb’ dbs • Advantages over pure SQL • The ontology is the model • constraints encoded in ontology • e.g. certain quality types only applicable to certain entity types • agile development - fast database integration • Rich modeling constructs • transitivity, subsumption, intersection, etc • powerful QLs and APIs • More (technical) interoperation ‘for free’ • URIs • proven? • Open World Assumption (maybe a hindrance?)
Disadvantages of ‘SemWeb’ dbs • Disadvantages • speed • may be slower than SQL • ..but in-memory execution is fast • lack of maturity • new technology.. but has a LOT of momentum • foundations • are RDF triples appropriate? • inherent difficulties modeling time • SQL allows n-ary relations/predicates
Hybrid model • SemWeb dbs are commonly layered over SQL DBs • We can have the best of both worlds • Data View layers • mapping between Obo/OWL model and domain-specific relational schema • (optionally) materialized for speed • different applications use appropriate layer
Current progress: OBD-Sesame • Sesame • open source ‘triple store’ • based on Jena • also used in Protégé-OWL • storage layer options • mysql/postgresql generic schema • in-memory • disk-based
OBD in Sesame: current datasets • Pheno • ZFIN & FB : EAV trial 2003 data • Test ortholog set • FB ‘simple phenotype’ alleles • ZFIN legacy phenotype data, automatically parsed to EAV • Ontologies: AOs, PATO, Cell, GO • Method • excel & flatfiles->pheno-xml->owl • OWL from http://www.fruitfly.org/~cjm/obo-download • Trialbank • Method: ocelot->obo-xml->owl • Soon • human orthologs and omim
Technology Evaluation: Sesame • Use case query set • Benchmarks • preliminary conclusions • SQL layering is terrible • in-memory is fast • optimisations? • other triple stores? • up to date results on wiki • http://smi.stanford.edu/projects/cbio/mwiki-internal/index.php/RDF_Sesame_Demo_Benchmark • Need to test OWL-DL entailment • Bigger dataset required for full evaluations • Community effort: pub-semweb-lifesci list
Parallel development: an OBD Prototype • Initiated prior to OBD-Sesame • Simple deductive database • prolog-based • chado-like schema • can be views on Obo/OWL predicates • amigo-clone user interface • Rapid prototyping • Current dataset • as obd-sesame, plus CT • trivial to drop in more
Example logic query find mutations affecting the shape of some part of the head capsule inheres(QI,EI) & inst(QI,QT) & label(QT,shape) & inst(EI,ETP) & part_of*(ETP,ET) & label(ET,’head capsule’) results of query on OBD-prolog: one annotation to “arista lateral”, “irregular shape”
OBD TODO • Pheno-xml • finalise release version • finalise Obo/OWL mapping • logic specification • Data • orthologies • OBD - BioPortal integration • how will it work? • Versioning and reconciling changes • decide on ontology versioning first
OBD dependencies • PATO development • UMLS into OBO-site • Ontologies • FMA accessibility? • species-centric AO alignments (XSPAN?) • Sept meeting on AO development • Nov meeting on disease ontologies • Data • MOD pheno annotation • OMIM annotation • Bioportal
Misc • NLP for phenote • Obol • trial on evolutionary phenotype characters • cambridge NLP project • can be used to ‘prime’ phenote • Decomposing MPO • pink furdef=fur, has_quality: pink
Discussion • Will SemWeb dbs work? • experiment • Ontology-based modeling • the ontology is the model • importance of • relations ontology • upper ontology
Demos • http://yuri.lbl.gov/amigo/ct • http://yuri.lbl.gov/amigo/obd • http://spade.lbl.gov:8080/sesame/actionFrameset.jsp?repository=mem-rdfs-db