170 likes | 332 Views
Knowledge-Based Integration of Neuroscience Data Sources. Amarnath Gupta Bertram Lud äscher Maryann Martone University of California San Diego. View Definition. A Standard Information Mediation Framework. Client Query. Integrated XML View. Mediator. XML View. XML View. XML View.
E N D
Knowledge-Based Integration of Neuroscience Data Sources Amarnath Gupta Bertram Ludäscher Maryann Martone University of California San Diego
View Definition A Standard Information Mediation Framework Client Query Integrated XML View Mediator XML View XML View XML View Wrapper Wrapper Data Source XML Data Source Data Source
View Definition A Neuroscience Question Cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? Integrated View Mediator Wrapper Wrapper Wrapper Wrapper WWW CaBP, Expasy protein localization morphometry neurotransmission
Integration Issues • Structural Heterogeneity • Resolved by converting to common semistructured data model • Heterogeneity in Query Capabilities • Resolved by writing wrappers with binding patterns and other capability-definition languages • Semantic Heterogeneity • Schema conflicts • Partially resolved by mapping rules in the mediator • Hidden Semantics?
Purkinje Cell layer of Cerebellar Cortex Molecular layer of Cerebellar Cortex Fragment of dendrite Hidden Semantics:Protein Localization <protein_localization> <neuron type=“purkinje cell” /> <protein channel=“red”> <name>RyR</> …. </protein> <region h_grid_pos=“1” v_grid_pos=“A”> <density> <structure fraction=“0.8”> <name>spine</> <amount name=“RyR”>0</> </> <structure fraction=“0.2”> <name>branchlet</> <amount name=“RyR”>30</> </>
Branch level beyond 4 is a branchlet Must be dendritic because Purkinje cells don’t have somatic spines Hidden Semantics: Morphometry <neuron name=“purkinje cell”> <branch level=“10”> <shaft> … </shaft> <spine number=“1”> <attachment x=“5.3” y=“-3.2” z=“8.7” /> <length>12.348</> <min_section>1.93</> <max_section>4.47</> <surface_area>9.884</> <volume>7.930</> <head> <width>4.47</> <length>1.79</> </head> </spine> …
The Problem • Multiple Worlds Integration • compatible terms not directly joinable • complex, indirect associations among schema elements • unstated integrity constraints • Why not use ontologies? • typical ontologies associate terms along limited number of dimensions • What’s needed • a “theory” under which non-identical terms can be “semantically” joined
Our Approach • Modify the standard Mediation Architecture • Wrapper • Extend to encode an object-version of the structure schema • Mediator • Redesign to incorporate auxiliary knowledge sources to • Correlate object schema of sources • Define additional objects not specified but derivable from sources • At the Mediator • Use a logic engine to • Encode the mapping rules between sources • Define integrated views using a combination of exported objects from source and the auxiliary knowledge sources • Perform query decomposition • We still use Global-as-View form of mediation
Object Wrapper Object Wrapper Structure Wrapper Structure Wrapper The KIND Architecture Integrated User View View Definition Rules Auxiliary Knowledge Source 1 Logic Engine Integration Logic Auxiliary Knowledge Source 2 Schema of Registered Sources Materialized Views Src 2 Src 1
The Knowledge-Base • Situate every data object in its anatomical context • An illustration • New data is registered with the knowledge-base • Insertion of new data reconciles the current knowledge-base with the new information by: • Indexing the data with the source as part of registration • Extending the knowledge-base • Creating new views with complex rules to encode additional domain knowledge
F-Logic for the Mediation Engine • Why F-Logic? • Provides the power of Datalog (with negation) and object creation through Skolem IDs • Correct amount of “notational sugar” and rules to provide object-oriented abstraction • Schema-level reasoning • Expressing variable arity • F-Logic in KIND • Source schema wrapped into F-Logic schema • Knowledge-sources programmed in F-Logic • Definition of Integrated Views
Wrapping into Logic Objects • Automated Part <!ELEMENT Studies (Study)*> <!ELEMENT Study (study_id, … animal, experiments, experimenters> <!ELEMENT experiments (experiment)*> <!ELEMENT experiment (description, instrument, parameters)> studyDB[studies study]. study[study_id string; … animal animal; experiments experiment; experimenters string]. … • Non-automated Part • Subclasses • Rules • Integrity Constraints mushroom_spine::spine S:mushroom_spine IF S:spine[head_;neck _]. ic1(S):alert[type “invalid spine”; object S] IF S:spine[undef {head, neck}].
union view association rule taxon[subspecies string; species string; genus string; … phylum string; kingdom string; superkingdom string]. Schema At Mediator subspecies::species::genus:: … kingdom::superkingdom T:TR, TR::TR1 IF T: ‘TAXON’.taxon[Taxon_Rank TR, Taxon_Rank1 TR1], Taxon_Rank::Taxon_Rank1. Class creation by schema reasoning Computing with Auxiliary Sources • Creating Mediated Classes • Reasoning with Schema animal[MR] IF S:source, S.animal [MR] . animal[taxon ‘TAXON’.taxon]. X[taxonT] IF X: ‘PROLAB’.animal[name N], words(N,[W1,W2|_]), T: ‘TAXON’.taxon[genus W1;species W2].
Integrated View Definition • Views are defined between sources and knowledge base • Example: protein_distribution • given:organism, protein, brain_region • KB Anatom: • recursively traverse the has_a paths under brain_region collect all anatomical_entities • Source PROLAB: • join with anatomical structures and collect the value of attribute “image.segments.features.feature.protein_amount” where “image.segments.features.feature.protein_name” = protein and “study_db.study.animal.name” = organism • Mediator: • aggregate over all parents up to brain_region • report distribution
Query Evaluation Example • protein distribution of Human NCS-1 homologue • from wrapped CaBP website: • get the amino acid sequence for human NCS-1 • from wrapped Expasy website: • submit amino acid sequence, get ranked homologues • at Mediator: • select homologues H found in rat, and homology > 0.70 • at Mediator: • for each h in H • from previous view: • protein_distribution(rat, h, cerebellum, distribution) • Construct result a second integrated view
Implementation • System • Flora as F-Logic Engine • Communicate with ODBC databases through underlying XSB Prolog • XML wrapping and Web querying through XMAS, our XML query language and custom-built wrappers • Data • Human Brain Project sites • NPACI Neuroscience Thrust sites
Work in Progress • Architecture • plug-in architecture for • domain knowledge sources • conceptual models from data sources • Functionality • better handling of large data • operations • expressive query language • operators for domain knowledge manipulation • query evaluation • query optimization using domain knowledge • Demonstration • at VLDB 2000