320 likes | 464 Views
Towards interoperability of bio-ontologies or Statistics vs Logic. Towards interoperability of bio-ontologies. Part I: Problem How to define meaning? Part II: Solution The meaning of a word is its use in language Part III: Solution The meaning of a word is OWL Part IV: Conclusion .
E N D
Towards interoperability of bio-ontologiesorStatistics vs Logic
Towards interoperability of bio-ontologies • Part I: ProblemHow to define meaning? • Part II: Solution The meaning of a word is its use in language • Part III: SolutionThe meaning of a word is OWL • Part IV: Conclusion <owl:Class rdf:ID="LifeAndAllTheRest" /> <owl:DatatypeProperty rdf:ID="lifeValue"> <rdfs:domain rdf:resource="#LifeAndAllTheRest" /> <rdfs:range rdf:resource="&xsd;positiveInteger"/> </owl:DatatypeProperty>
Part I: How to define meaning • Some hints: • Aardvark • To Belch • Mermaid‘s
Part I: How to define meaning • Defining concepts is difficult • Aardvark: Ambiguous definition also matching a bee • Belch: Verbal description vs. „just doing it“ • Ambiguity • C = sea: „Big blue wobbly thing that mermaid‘s live in“ • Avoid Negation • Dog = not a cat • Completeness (not part of video) • Johnson forgot „sausage“ in dictionary
Use in language = use in PubMed, use in UniProt, use in PDB, … >30.000 3D Structures >1.000.000 Sequences
GoPubMed.org MeshPubMed.org Cohse Textpresso EbiMed Whatizit Termine Vivisimo … BioCreative Textmining can help
Apoptosis vs programmed cell death • apoptosis NOT "programmed cell death“ > 120.000 papers • „programmed cell death“ NOT apoptosis: 1609 papers • „programmed cell death (apoptosis)“ 903 papers • „apoptosis (programmed cell death) 455 papers
Rethinking the microprocessor • Stemming: • binds, binding, bind! • Dimerization = dimer? • Organisation = organ! • Missing words: • Text “...a transcription factor that binds...'' = “transcription factor binding'’ • 1/3 of GO terms end with “activity” • Word sense disambiguation: • “Rethinking the microprocessor'' about microRNA
Word sense disambiugation • Tell me who your friends are and I will tell you who you are • Co-occurance graphs and Support Vector Maschines achieve • 80%-95% for word sense disambiguation for terms like development, spindle, nucleus, envelope, … • >80% for gene identification task in BioCreative competition • (Identifying terms much harder though)
Candidate terms • Statistics on text corpus can reveal candidate terms • Composition: membrane inner membrane mitochondrial inner membrane mitochondrial inner membrane peptidate complex “The compositional structure of gene ontology terms” [Ogren et al., 2004] • Systems:Text2onto, Ontolearn
Defining concepts • Caspases are a family of cysteine proteases that cleave proteins after • HIV is a disease that affects the immune system • The liver is the largest internal organ in the body • Small GTPases are monomeric guanine nucleotide-binding proteins • Endocytosis is a process by which cells internalize ... • Endosomes are membrane-bound vesicles • See also:
Why logic is promising • If all facts are formally defined, we can reason over models • Example Glycogen storage disease: • All metabolic reactions expressed as rules • Facts: Glucose down, pyruvate and lactate up • Inconsistent with model • Which facts can be added to restore consistency? • Alpha-glucosidase malfunctioning (GSD II) • Amylo-alpha-1,6-glucosidase malfunctioning (GSD III)
Long history of computational logic for computational biology • Leibniz (1646-1716) • Lingua universalis and calculus raciocinator • Idea: Reasoning = prime factor computation • concepts=numbers, • basic concepts=prime, • complex concepts=non-prime composed of basics’ primes • Example • animal=2, rational=3, therefore human=2x3=6 • Assuming monkey=10 he concludes: monkey =/= human because neither is 10/6 nor 6/10 divisable. • To prove the usefulness of his calculus, he assumes “those marvelous characteristic numbers” as given
Boole (1815-1864) • “clean beasts (x) are those which both divide the hoof (y) and chew the cud (z)”: x = yz • 1. Division: z = x/y • 2. Development: z = 1/1 xy + 1/0 x(1-y) + 0/1 (1-x)y + 0/0 (1-x)(1-y) = xy + 1/0 x(1-y) + 0 (1-x)y + 0/0 (1-x)(1-y) • 3. Interpretation: Beasts which chew the cud [z] consists of all clean beasts (which also divide the hoof)[xy] together with an indefinite remainder (some, none, or all)[indicated by 0/0] of unclean beasts which do not divide the hoof [(1-x)(1-y)] • Note: No statement about 0 (n/a) and 1/0 (no statement about z)
So, now it is OWL then <owl:Class rdf:ID="LifeAndAllTheRest" /> <owl:DatatypeProperty rdf:ID="lifeValue"> <rdfs:domain rdf:resource="#LifeAndAllTheRest" /> <rdfs:range rdf:resource="&xsd;positiveInteger"/> </owl:DatatypeProperty>
OWL is everywhere • 1600 OWL ontologies • Baker et al., Ontology Evaluation – Beauty in the eye of the beholder, Poster, 2005, NCOR inauguration
Implicit vs explicit • Snomed uses only existential restriction • General: • Life scientists make observations and only state facts, as life is too complex to generalise • Computer scientists make abstractions where possible Spackman and Reynoso. Examining SNOMED from the Perspective of Formal Ontological Principles: Some Preliminary Analysis and Observations
Dealing with exceptions • Any logical approach should handle exceptions, as they are the norm in the life sciences • Non-monotonic reasoning: • Every member state of the EU is in Europe, Britain is a member state of the EU, but… • Bird‘s fly, penguins do not, penguins are birds
Dealing with negation • Science is geared towards positive statements, but negative information is often equally important • Journal of negative results in biomedicine • Defining a negative interaction dataset • Select two random proteins • Select two proteins with different localisation • Different types of negation • Explicit: …HFR1 was shown not to interact with phyB… • Implicit: …Kip3 is not known to interact with Kar9…
Dealing with negation • Many different semantics su/a=su/d su/u su/sa sa/u=sa/d=sa/a su/su u/a=u/d=u/sa sa/su=sa/sa u/su=u/u d/su=d/u=d/a=d/d=d/sa a/su=a/u=a/a=a/d=a/sa
Reasoning • GONG Example (Mike Bada et al.): • glycosaminoglycan biosynthesis and heparin biosynthesis were unrelated GO ocncepts • Using formal reasoning and other ontologies containing heparin is-a glycosaminoglycan infer automatically • Heparin biosynthesis is-a glycosaminoglycan biosynthesis • But why not use textmining? • Robert: If X is-a Y and there are concept XZ and YZ, then suggest that Z is-a YZ • Yves: Sugar - sugar phosphotransferase - Phosphotransferase
Link text and ontologies • Textmining is difficult. • Thus, let authors write abstracts for maschines!(Force authors to submit data) • Dietrich: Cashew prize Kuhn et al., DILS2006
Beware • Both, statistical and logical approaches organise knowledge as a hierarchy • But hierarchies are just means to structure data in a continuous space
Towards inter-operability • Retrieval • Linking data, text, and ontologies semi-automatically • Evaluating ontologies on data/text, evaluating data/text on ontologies • Generating ontologies semi-automatically from text and data • Representation • Zebrafish vs Drosophila vs mouse anatomy • „Benign dictator vs UN“ • Relation ontology, meta data • Formats: XML, OBO, OWL, … • Reasoning • Need? Expressiveness DL, rules, non-monotonic reasoning, negation, closed/open world, …