Bioinformatics Project for Drosophila Genome Annotation

Core 2: Bioinformatics NCBO-Berkeley

Berkeley Drosophila Genome Project • Finish the sequence of the euchromatic genome of Drosophilamelanogaster • Annotated biological important features of this sequence • Produced gene disruptions using P element-mediated mutagenesis • Full length sequencing and expression characterization of a cDNA for every gene • Developing informatics tools

Chris Shu Mark Sima Who is here from NCBO-Berkeley

Chris • GadFly database schema • GO database schema • Chado database schema • Perl libraries for all • OBD data architect

Shu • OBD dev & Data flow • AmiGO,ImaGO & database • Compute Pipeline

Mark • Apollo Genome Annotation Editor • Phenote and other OBD interfaces

Sima • Adh region annotation • Annotation of entire Drosophila Genome • Project manager and coordinator nonpareil • Associate Director

OBD Outline • Core 2 aims, refresher • Data models for OBD • phenotypes • clinical trials • others • Modeling frameworks • exchange formats • database system • SQL based vs ‘SemWeb’ dbs • Progress • Demo

Core 2 Specific Aims • Apply ontologies • Software toolkit for describing and classifying data • Capture, manage, and view data annotations • Database (OBD) and interfaces to store and view annotations • Investigate and compare implications • Linking human diseases to model systems • Maintain • Ongoing reconciliation of ontologies with annotations

Core 3 Driving Biological Projects • DBPs • phenotypes: Fly and Zebrafish to human • clinical trials • Core 2 Aims • Apply ontologies to describe data • Capture, manage, and view data annotations • Link disease genes to model systems • Reconcile annotation and ontology changes

Apply ontologies to describe data • Requirements • Data capture tools • phenote • demo tomorrow • no tool requirements from UCSF • Data model • Database (OBD) • --aim 2

data flow

user’s view

Data models • Common/shared domain specific models • Aim 3 • linking disease genes • model must support this • granularity • comparability

Domain specific data models • FB, ZFIN • genotype to phenotype • ‘EAV’ • qualities inhere in entities • orthologs • phenotype to disease • core 2 will help define common model • UCSF • clinical trials • existing ontology-friendly schema - trialbank

Phenotype data model • Qualities inhere in entities • Entity term; PATO term • brainFBbt:00005095;fusedPATO:0000642 • gutMA:0000917;dysplasticPATO:0000640 • tail finZDB:020702-16;ventralizedPATO:0000636 • kidneyZDB:020702-16;hypertrophiedPATO:0000636 • midfaceZDB:020702-16;hypoplasticPATO:0000636 • Pre-composed phenotype terms • Mammalian Phenotype Ontology • “increased activated B-cell number” MPO:0000319 • “pink fur hue” MPO:0000374

Extensions to simple model • What about • Relational attributes • Quantative vs qualitative • Post-composing entity and attribute terms • Relative states/values • Variation in place, space and time • A better treatment of absence • See CSHL Pheno meeting talk • also, more detailed formal presentation (available) • Not to mention genotypes, environments, provenance, etc

Modeling clinical trials • Model already described using frame-based schema • Further modeling required? • abstraction • to integrate more with other OBD datatypes • views • to only show parts relevant to OBD/BioPortal

Future DBPs and use cases • OBD will contain a variety of general types of data • Modeling is expensive • use existing models where appropriate • but whole must be cohesive and integrated • Most of this talk focuses on the pheno DBPs for illustrative purposes

Modeling frameworks • language • technology

Modeling data: underlying formalism • Model is expressed with modeling language • Options • Relational/SQL • Semi-structured, XML • Object-centric (UML, frame-based?) • Logic based • description logic: e.g. OWL • first-order logic: e.g. CL • Natural language descriptions • Model should be independent of language it is expressed in

Data exchange language: XML • Simple • XML is suited for data exchange • XML can drive software spec • constrains programmatic data model • XSD can generate UML • closed world assumption is useful • cf Ruttenberg et al • Mature technology • well understood by developers, MODs • standards

How OBD uses XML • obd-geno-pheno-xml (aka pheno-xml) • actually multiple modular components • genotype schema • phenotype schema: ‘EAV’ • environment schema • provenance schema • used as • exchange format • cf: gene ontology association files • no need for ClinicalTrials-XML

Example pheno-xml <genotype id="ZFIN:tm84"> <name>ZFIN:tm84</name> <genotype_phenotype_association> <phenotype> <entity type="ZDB-ANAT-010921-528"> <quality type=“PATO:……” > <state type="PATO:0000636"> <time_range type="ZDB-STAGE-010723-12"/> </state> </quality> </entity> </phenotype> </genotype_phenotype_association>

SQL Databases • Data storage, management and querying • all MODs use SQL dbs • Lots of advantages • scalable, standard QL, mature, APIs, etc • pure relational model is reasonably formal • XML/SQL more or less compatible • low impedance mismatch

Schemas for geno-pheno data • We already have schema: Chado • Used by many MODs (eg FB) • others are ‘chado compliant’ (eg ZFIN) • Modular • ontologies • genomic • genotype • phenotype • phylogenies • …etc • Phenotype module needs updating • will be driven by pheno-xml

Problem solved? • We have two mature, complementary technologies, and can define schemas for our model in an appropriate formalism for each • Is this enough to work with?

Issues • OBD will be much more than geno-pheno • clinical trials • future DBPs, other NCBCs • any data expressed in an ontology language • Software and schema development expensive • fragility in face of schema evolution • development gets bogged down in data exchange issues

Major issue • SQL and XPath work great for ‘traditional’ data… • …but are too low level for ontology-centric data • lack of inference • no way to directly express ontology constraints

Use cases from previous experience: AmiGO • GO • “find all TF genes” (is_a closure) • “find all gene products localised to endoplasmic reticulum” (part_of closure, over is_a) • Our solution (AmiGO & go-sqldb) • pre-compute transitive closure over all relations in db • (sort of) works for GO (for now) • refresh problem • explosive for tangled DAGs

OBD requires more ontological awareness • Other relations • ontogenic (egderives_from) • transitive_over • Other types of data • Pre- versus post- composed terms • E.g. MPO versus AO+PATO • E.g. Entity+Spatial qualifier • queries over either should be interchangeable

Solution: more expressive formalisms • QLs and APIs should provide and abstract away common ontology operations • ease of programming, optimisation • Choices • ‘Semweb’ databases • RDF + RDFS + Owl [ lite + DL ] + extra • lots to choose from, emerging standards • compatible with Obo v1.2 spec • Deductive databases • superset of relational databases • from Prolog to full CL

Modeling phenotypes as RDF/OWL or Obo instances classes/ terms instances entity quality

Example query in SeRQL find mutations affecting the shape of the wing vein: SELECT DISTINCT EI, ET, OrgI, QI, QT, QN FROM {EI} rdf:type {ET} rdfs:label {EN}, {EI} OBO_REL_part_of {OrgI} rdf:type {Tax} rdfs:label {TaxN}, {EI} OBO_REL_has_quality {QI} rdf:type {QT} rdfs:label {QN} WHERE label(EN) = "wing vein" AND label(TaxN) = ”Arthropoda" AND label(QN) = "ShapeValue" results of query on OBD-sesame: one annotation to “wing vein L2”, “branched”

Advantages of ‘SemWeb’ dbs • Advantages over pure SQL • The ontology is the model • constraints encoded in ontology • e.g. certain quality types only applicable to certain entity types • agile development - fast database integration • Rich modeling constructs • transitivity, subsumption, intersection, etc • powerful QLs and APIs • More (technical) interoperation ‘for free’ • URIs • proven? • Open World Assumption (maybe a hindrance?)

Disadvantages of ‘SemWeb’ dbs • Disadvantages • speed • may be slower than SQL • ..but in-memory execution is fast • lack of maturity • new technology.. but has a LOT of momentum • foundations • are RDF triples appropriate? • inherent difficulties modeling time • SQL allows n-ary relations/predicates

Hybrid model • SemWeb dbs are commonly layered over SQL DBs • We can have the best of both worlds • Data View layers • mapping between Obo/OWL model and domain-specific relational schema • (optionally) materialized for speed • different applications use appropriate layer

Current progress: OBD-Sesame • Sesame • open source ‘triple store’ • based on Jena • also used in Protégé-OWL • storage layer options • mysql/postgresql generic schema • in-memory • disk-based

OBD in Sesame: current datasets • Pheno • ZFIN & FB : EAV trial 2003 data • Test ortholog set • FB ‘simple phenotype’ alleles • ZFIN legacy phenotype data, automatically parsed to EAV • Ontologies: AOs, PATO, Cell, GO • Method • excel & flatfiles->pheno-xml->owl • OWL from http://www.fruitfly.org/~cjm/obo-download • Trialbank • Method: ocelot->obo-xml->owl • Soon • human orthologs and omim

Technology Evaluation: Sesame • Use case query set • Benchmarks • preliminary conclusions • SQL layering is terrible • in-memory is fast • optimisations? • other triple stores? • up to date results on wiki • http://smi.stanford.edu/projects/cbio/mwiki-internal/index.php/RDF_Sesame_Demo_Benchmark • Need to test OWL-DL entailment • Bigger dataset required for full evaluations • Community effort: pub-semweb-lifesci list

Parallel development: an OBD Prototype • Initiated prior to OBD-Sesame • Simple deductive database • prolog-based • chado-like schema • can be views on Obo/OWL predicates • amigo-clone user interface • Rapid prototyping • Current dataset • as obd-sesame, plus CT • trivial to drop in more

Example logic query find mutations affecting the shape of some part of the head capsule inheres(QI,EI) & inst(QI,QT) & label(QT,shape) & inst(EI,ETP) & part_of*(ETP,ET) & label(ET,’head capsule’) results of query on OBD-prolog: one annotation to “arista lateral”, “irregular shape”

OBD TODO • Pheno-xml • finalise release version • finalise Obo/OWL mapping • logic specification • Data • orthologies • OBD - BioPortal integration • how will it work? • Versioning and reconciling changes • decide on ontology versioning first

OBD dependencies • PATO development • UMLS into OBO-site • Ontologies • FMA accessibility? • species-centric AO alignments (XSPAN?) • Sept meeting on AO development • Nov meeting on disease ontologies • Data • MOD pheno annotation • OMIM annotation • Bioportal

Misc • NLP for phenote • Obol • trial on evolutionary phenotype characters • cambridge NLP project • can be used to ‘prime’ phenote • Decomposing MPO • pink furdef=fur, has_quality: pink

Discussion • Will SemWeb dbs work? • experiment • Ontology-based modeling • the ontology is the model • importance of • relations ontology • upper ontology

Demos • http://yuri.lbl.gov/amigo/ct • http://yuri.lbl.gov/amigo/obd • http://spade.lbl.gov:8080/sesame/actionFrameset.jsp?repository=mem-rdfs-db

Bioinformatics Project for Drosophila Genome Annotation