Core 2: Bioinformatics

Core 2: Bioinformatics NCBO-Berkeley

Berkeley Drosophila Genome Project • Finish the sequence of the euchromatic genome of Drosophilamelanogaster • Annotated biological important features of this sequence • Produced gene disruptions using P element-mediated mutagenesis • Full length sequencing and expression characterization of a cDNA for every gene • Developing informatics tools

Chris Shu Mark Sima Who is here from NCBO-Berkeley

Chris • GadFly database schema • GO database schema • Chado database schema • Perl libraries for all • OBD data architect

Shu • OBD dev & Data flow • AmiGO,ImaGO & database • Compute Pipeline

Mark • Apollo Genome Annotation Editor • Phenote and other OBD interfaces

Sima • Adh region annotation • Annotation of entire Drosophila Genome • Project manager and coordinator nonpareil • Associate Director

OBD Outline • Core 2 aims, refresher • Data models for OBD • phenotypes • clinical trials • others • Modeling frameworks • exchange formats • database system • SQL based vs ‘SemWeb’ dbs • Progress • Demo

Core 2 Specific Aims • Apply ontologies • Software toolkit for describing and classifying data • Capture, manage, and view data annotations • Database (OBD) and interfaces to store and view annotations • Investigate and compare implications • Linking human diseases to model systems • Maintain • Ongoing reconciliation of ontologies with annotations

Core 3 Driving Biological Projects • DBPs • phenotypes: Fly and Zebrafish to human • clinical trials • Core 2 Aims • Apply ontologies to describe data • Capture, manage, and view data annotations • Link disease genes to model systems • Reconcile annotation and ontology changes

Apply ontologies to describe data • Requirements • Data capture tools • phenote • demo tomorrow • no tool requirements from UCSF • Data model • Database (OBD) • --aim 2

data flow

user’s view

Data models • Common/shared domain specific models • Aim 3 • linking disease genes • model must support this • granularity • comparability

Domain specific data models • FB, ZFIN • genotype to phenotype • ‘EAV’ • qualities inhere in entities • orthologs • phenotype to disease • core 2 will help define common model • UCSF • clinical trials • existing ontology-friendly schema - trialbank

Phenotype data model • Qualities inhere in entities • Entity term; PATO term • brainFBbt:00005095;fusedPATO:0000642 • gutMA:0000917;dysplasticPATO:0000640 • tail finZDB:020702-16;ventralizedPATO:0000636 • kidneyZDB:020702-16;hypertrophiedPATO:0000636 • midfaceZDB:020702-16;hypoplasticPATO:0000636 • Pre-composed phenotype terms • Mammalian Phenotype Ontology • “increased activated B-cell number” MPO:0000319 • “pink fur hue” MPO:0000374

Extensions to simple model • What about • Relational attributes • Quantative vs qualitative • Post-composing entity and attribute terms • Relative states/values • Variation in place, space and time • A better treatment of absence • See CSHL Pheno meeting talk • also, more detailed formal presentation (available) • Not to mention genotypes, environments, provenance, etc

Modeling clinical trials • Model already described using frame-based schema • Further modeling required? • abstraction • to integrate more with other OBD datatypes • views • to only show parts relevant to OBD/BioPortal

Future DBPs and use cases • OBD will contain a variety of general types of data • Modeling is expensive • use existing models where appropriate • but whole must be cohesive and integrated • Most of this talk focuses on the pheno DBPs for illustrative purposes

Modeling frameworks • language • technology

Modeling data: underlying formalism • Model is expressed with modeling language • Options • Relational/SQL • Semi-structured, XML • Object-centric (UML, frame-based?) • Logic based • description logic: e.g. OWL • first-order logic: e.g. CL • Natural language descriptions • Model should be independent of language it is expressed in

Data exchange language: XML • Simple • XML is suited for data exchange • XML can drive software spec • constrains programmatic data model • XSD can generate UML • closed world assumption is useful • cf Ruttenberg et al • Mature technology • well understood by developers, MODs • standards

How OBD uses XML • obd-geno-pheno-xml (aka pheno-xml) • actually multiple modular components • genotype schema • phenotype schema: ‘EAV’ • environment schema • provenance schema • used as • exchange format • cf: gene ontology association files • no need for ClinicalTrials-XML

Example pheno-xml <genotype id="ZFIN:tm84"> <name>ZFIN:tm84</name> <genotype_phenotype_association> <phenotype> <entity type="ZDB-ANAT-010921-528"> <quality type=“PATO:……” > <state type="PATO:0000636"> <time_range type="ZDB-STAGE-010723-12"/> </state> </quality> </entity> </phenotype> </genotype_phenotype_association>

SQL Databases • Data storage, management and querying • all MODs use SQL dbs • Lots of advantages • scalable, standard QL, mature, APIs, etc • pure relational model is reasonably formal • XML/SQL more or less compatible • low impedance mismatch

Schemas for geno-pheno data • We already have schema: Chado • Used by many MODs (eg FB) • others are ‘chado compliant’ (eg ZFIN) • Modular • ontologies • genomic • genotype • phenotype • phylogenies • …etc • Phenotype module needs updating • will be driven by pheno-xml

Problem solved? • We have two mature, complementary technologies, and can define schemas for our model in an appropriate formalism for each • Is this enough to work with?

Issues • OBD will be much more than geno-pheno • clinical trials • future DBPs, other NCBCs • any data expressed in an ontology language • Software and schema development expensive • fragility in face of schema evolution • development gets bogged down in data exchange issues

Major issue • SQL and XPath work great for ‘traditional’ data… • …but are too low level for ontology-centric data • lack of inference • no way to directly express ontology constraints

Use cases from previous experience: AmiGO • GO • “find all TF genes” (is_a closure) • “find all gene products localised to endoplasmic reticulum” (part_of closure, over is_a) • Our solution (AmiGO & go-sqldb) • pre-compute transitive closure over all relations in db • (sort of) works for GO (for now) • refresh problem • explosive for tangled DAGs

OBD requires more ontological awareness • Other relations • ontogenic (egderives_from) • transitive_over • Other types of data • Pre- versus post- composed terms • E.g. MPO versus AO+PATO • E.g. Entity+Spatial qualifier • queries over either should be interchangeable

Solution: more expressive formalisms • QLs and APIs should provide and abstract away common ontology operations • ease of programming, optimisation • Choices • ‘Semweb’ databases • RDF + RDFS + Owl [ lite + DL ] + extra • lots to choose from, emerging standards • compatible with Obo v1.2 spec • Deductive databases • superset of relational databases • from Prolog to full CL

Modeling phenotypes as RDF/OWL or Obo instances classes/ terms instances entity quality

Example query in SeRQL find mutations affecting the shape of the wing vein: SELECT DISTINCT EI, ET, OrgI, QI, QT, QN FROM {EI} rdf:type {ET} rdfs:label {EN}, {EI} OBO_REL_part_of {OrgI} rdf:type {Tax} rdfs:label {TaxN}, {EI} OBO_REL_has_quality {QI} rdf:type {QT} rdfs:label {QN} WHERE label(EN) = "wing vein" AND label(TaxN) = ”Arthropoda" AND label(QN) = "ShapeValue" results of query on OBD-sesame: one annotation to “wing vein L2”, “branched”

Advantages of ‘SemWeb’ dbs • Advantages over pure SQL • The ontology is the model • constraints encoded in ontology • e.g. certain quality types only applicable to certain entity types • agile development - fast database integration • Rich modeling constructs • transitivity, subsumption, intersection, etc • powerful QLs and APIs • More (technical) interoperation ‘for free’ • URIs • proven? • Open World Assumption (maybe a hindrance?)

Disadvantages of ‘SemWeb’ dbs • Disadvantages • speed • may be slower than SQL • ..but in-memory execution is fast • lack of maturity • new technology.. but has a LOT of momentum • foundations • are RDF triples appropriate? • inherent difficulties modeling time • SQL allows n-ary relations/predicates

Hybrid model • SemWeb dbs are commonly layered over SQL DBs • We can have the best of both worlds • Data View layers • mapping between Obo/OWL model and domain-specific relational schema • (optionally) materialized for speed • different applications use appropriate layer

Current progress: OBD-Sesame • Sesame • open source ‘triple store’ • based on Jena • also used in Protégé-OWL • storage layer options • mysql/postgresql generic schema • in-memory • disk-based

OBD in Sesame: current datasets • Pheno • ZFIN & FB : EAV trial 2003 data • Test ortholog set • FB ‘simple phenotype’ alleles • ZFIN legacy phenotype data, automatically parsed to EAV • Ontologies: AOs, PATO, Cell, GO • Method • excel & flatfiles->pheno-xml->owl • OWL from http://www.fruitfly.org/~cjm/obo-download • Trialbank • Method: ocelot->obo-xml->owl • Soon • human orthologs and omim

Technology Evaluation: Sesame • Use case query set • Benchmarks • preliminary conclusions • SQL layering is terrible • in-memory is fast • optimisations? • other triple stores? • up to date results on wiki • http://smi.stanford.edu/projects/cbio/mwiki-internal/index.php/RDF_Sesame_Demo_Benchmark • Need to test OWL-DL entailment • Bigger dataset required for full evaluations • Community effort: pub-semweb-lifesci list

Parallel development: an OBD Prototype • Initiated prior to OBD-Sesame • Simple deductive database • prolog-based • chado-like schema • can be views on Obo/OWL predicates • amigo-clone user interface • Rapid prototyping • Current dataset • as obd-sesame, plus CT • trivial to drop in more

Example logic query find mutations affecting the shape of some part of the head capsule inheres(QI,EI) & inst(QI,QT) & label(QT,shape) & inst(EI,ETP) & part_of*(ETP,ET) & label(ET,’head capsule’) results of query on OBD-prolog: one annotation to “arista lateral”, “irregular shape”

OBD TODO • Pheno-xml • finalise release version • finalise Obo/OWL mapping • logic specification • Data • orthologies • OBD - BioPortal integration • how will it work? • Versioning and reconciling changes • decide on ontology versioning first

OBD dependencies • PATO development • UMLS into OBO-site • Ontologies • FMA accessibility? • species-centric AO alignments (XSPAN?) • Sept meeting on AO development • Nov meeting on disease ontologies • Data • MOD pheno annotation • OMIM annotation • Bioportal

Misc • NLP for phenote • Obol • trial on evolutionary phenotype characters • cambridge NLP project • can be used to ‘prime’ phenote • Decomposing MPO • pink furdef=fur, has_quality: pink

Discussion • Will SemWeb dbs work? • experiment • Ontology-based modeling • the ontology is the model • importance of • relations ontology • upper ontology

Demos • http://yuri.lbl.gov/amigo/ct • http://yuri.lbl.gov/amigo/obd • http://spade.lbl.gov:8080/sesame/actionFrameset.jsp?repository=mem-rdfs-db

Core 2: Bioinformatics

Core 2: Bioinformatics

Presentation Transcript

341: Introduction to Bioinformatics

Bioinformatics Toolbox

LPHIG Bioinformatics of SFS Genomics Center Program Projects

Bioinformatics

Bioinformatics PhD. Course

BIOINFORMATICS Surveys

Introduction to Bioinformatics

CORE OMM Curriculum Board Review

CS 6293 Advanced Topics: Translational Bioinformatics

Introduction to Bioinformatics

Applied Bioinformatics

Protein Database

Bioinformatics Programming

Introduction to Bioinformatics

How To Teach Thinking Skills Within the Common Core

Bioinformatics Pipelines for RNA- Seq Data Analysis

Transition to PA Common Core

CS5263 Bioinformatics

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Fifty Shades of the Common Core: ELA