220 likes | 570 Views
Beyond Federation of Data Collections. Making Information Integration Service a part of NPACI Data Management Infrastructure. Amarnath Gupta Bertram Ludäscher Maryann Martone Ilya Zaslavsky. Collection Federation. In this scenario, scientific groups
E N D
Beyond Federation of Data Collections Making Information Integration Service a part of NPACI Data Management Infrastructure Amarnath Gupta Bertram Ludäscher Maryann Martone Ilya Zaslavsky
Collection Federation • In this scenario, scientific groups • produce data items (e.g., text data, images, simulation data …) • put them in collections • add metadata (who created it, what is the data about …) • make it available for sharing (on the web, in a data cache accessible with VBN, in HPSS with authorization information …) • The Problem • The data may be large number of small chunks or small number of large chunks – data movement is an issue • Heterogeneity in data types, storage technologies, networks, authentication protocols • Access has to be collection-based, data item wise, or data fragment wise, access may need executing data-specific functions • Storage Resource Broker/Metadata Catalog • The focus is on making the data available NPACI AHM,2001
Information Integration Cross-source queries What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? Cross-source relationships are modeled Information-producing services can be invoked ??? Integrated View ??? Data, relationships, constraints are modeled ??? Integrated View Definition ??? ???Mediator ??? Wrapper Wrapper Wrapper Wrapper Web protein localization morphometry neurotransmission CaBP, Expasy NPACI AHM,2001
Purkinje Cell layer of Cerebellar Cortex Molecular layer of Cerebellar Cortex Fragment of dendrite Hidden Semantics: Protein Localization <protein_localization> <neuron type=“purkinje cell” /> <protein channel=“red”> <name>RyR</> …. </protein> <region h_grid_pos=“1” v_grid_pos=“A”> <density> <structure fraction=“0.8”> <name>spine</> <amount name=“RyR”>0</> </> <structure fraction=“0.2”> <name>branchlet</> <amount name=“RyR”>30</> </> NPACI AHM,2001
Branch level beyond 4 is a branchlet Must be dendritic because Purkinje cells don’t have somatic spines Hidden Semantics: Morphometry <neuron name=“purkinje cell”> <branch level=“10”> <shaft> … </shaft> <spine number=“1”> <attachment x=“5.3” y=“-3.2” z=“8.7” /> <length>12.348</> <min_section>1.93</> <max_section>4.47</> <surface_area>9.884</> <volume>7.930</> <head> <width>4.47</> <length>1.79</> </head> </spine> … NPACI AHM,2001
The Problem • Multiple Worlds Integration • compatible terms not directly joinable • complex, indirect associations among schema elements • unstated integrity constraints • What’s needed? • a “theory” under which non-identical terms can be “semantically joined” => lift mediation to the level of conceptual models (CMs) => domain knowledge, ICs become rules over CMs => Model-Based Mediation NPACI AHM,2001
Information Integration What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? ??? Integrated View ??? ??? Integrated View Definition ??? ???Mediator ??? Wrapper Wrapper Wrapper Wrapper Web protein localization morphometry neurotransmission CaBP, Expasy NPACI AHM,2001
Example Query Evaluation (I) • Example: protein_distribution • given:organism, protein, brain_region • Use DOMAIN-KNOWLEDGE-BASE: • recursively traverse the has_a_star paths under brain_region collect all anatomical_entities • Source PROLAB: • join with anatomical structures and collect the value of attribute “image.segments.features.feature.protein_amount” where “image.segments.features.feature.protein_name” = proteinand “study_db.study.animal.name” = organism • Mediator: • aggregate over all parents up to brain_region • report distribution NPACI AHM,2001
Example Query Evaluation (II) @SENSELAB: X1 := select output from parallel fiber; @MEDIATOR: X2 := “hang off” X1 from Domain Map; @MEDIATOR: X3 := subregion-closure(X2); @NCMIR: X4 := select PROT-data(X3, Ryanodine Receptors); @MEDIATOR: X5 := compute aggregate(X4); "How does the parallel fiber output (Yale/SENSELAB) relate to the distribution of Ryanodine Receptors (UCSD/NCMIR)?" NPACI AHM,2001
Integration Issues SEMANTIC Integration • SYNTACTIC/STRUCTURAL Integration • Integrated Views (Src-XML => Intgr-XML) • Schema Integration (DTD =>DTD) • Wrapping, Data Extraction (Text => XML) MIX Mediation of Information using XML Distributed Query Processing SRB/MCAT storage, query capabilities protocols & services SYSTEM Integration TCP/IP HTTP CORBA NPACI AHM,2001
The Mediator Architecture Mediation Services API Mediator Layer • Source model lifting: • domain knowledge reconciliation • model transformation • Query formulation: • user query • integrated view definition Deductive Engine Model Reasoner • Source registration: • domain knowledge • model & schema • query & computation capabilities • Query processing: • view unfolding • semantic optimization • capability-based rewriting Optimizer Wrapper Layer • Query interface (down API): • SDLIP, SOAP, ... • (subsets of) SQL, X(ML)-Query, CPL,... • DOM • SRB-based access • Result delivery interface (up API): • SDLIP, SOAP, ... • pull (tuple/set-at-a-time, DOM) vs. push (stream) • synchronous/asynchronous • direct data/data reference File Sources RDB Sources Spatial Sources HTML Sources XML Sources Digital Libraries (Collections) Boston Univ. NCMIR UCSD Montana Univ. Yale Univ. SDLIP ARC IMS NPACI AHM,2001
Mediation Services: Source Registration-I Source Data Type Query Capability Result Delivery Access Protocol ARC XML QL DOOD SQL tree file table HTTP Java SRB Tuple-at-a-time Stream Set-at-a-time SPJ Selections Binary for Viewer NPACI AHM,2001
Mediation Services: Source Registration-II • Domain Model Registration • Here is my concept ontology • Keep it only as a private object • Merge my ontology with a pre-existing non-private ontology • Here are the equivalence relations • Detect conflicts between my ontology and a given public ontology • Conceptual Schema Registration • Classes, methods • Constraints • Domain Model Reference Next NPACI AHM,2001
ANATOM ANATOM Domain Map Back NPACI AHM,2001
anatom_dom(X) :- (ucsd_has_a(X,_); ucsd_has_a(_,X); ucsd_isa(X,_); ucsd_isa(_,X)). senselab_dom(X) :- (sl_has_a(X,_); sl_has_a(_,X); sl_isa(X,_); sl_isa(_,X)). % map senselab anatom terms to equivalent ucsd anatom terms sl2ucsd(X,X) :- senselab_dom(X), anatom_dom(X). sl2ucsd('A',axon). sl2ucsd('AH',axon). sl2ucsd('Dad',spiny_branchlet). % should REALLY map to a PATH not just the end of the path sl2ucsd('Dam',main_branches). % really only SOME of the main_branches based on the branch level sl2ucsd('Dap',main_branches). sl2ucsd('Dbd',spiny_branchlet). sl2ucsd('Dbm',main_branches). sl2ucsd('Dbp',main_branches). sl2ucsd('Ded',spiny_branchlet). sl2ucsd('Dem',main_branches). sl2ucsd('Dep',main_branches). sl2ucsd('T',axon). % keep has_a edge if at least one node is known from ucsd has_a(X,Y) :- sl2ucsd(_,X), ucsd_has_a(X,Y). has_a(X,Y) :- sl2ucsd(_,Y), ucsd_has_a(X,Y). % keep all and only ucsd is-a's isa(X,Y) :- ucsd_isa(X,Y).Back NPACI AHM,2001
Neuron MyNeuron Neostriatum Compartment Spiny Neuron ALL:has Soma Axon Dendrite Medium Spiny Neuron Neurotransmitter MyDendrite exp = AND OR GABA Substance P exp Dopamine R Substantia Nigra Pc Substantia Nigra Pr Globus Pallidus Int. Globus Pallidus Ext. Back NPACI AHM,2001
Mediation Services: Client Registration Client Update Client Query Client Thin Result Viewer Fat Result Viewer Navigate/ Ad-hoc Query Capability Query on Schema Derive Before Insert Check Data Merge Before Insert Client-side Processing Client-side Buffer Send Full Data Context Sensitive Server-side Buffer Server-Push/ Client-Pull NPACI AHM,2001
Mediation Services: Integrated View Definition • For the domain data modeler • Currently in a Logic Language (Frame-logic) protein_distribution(Protein, Organism, Brain_region, Feature_name, Anatom, Value) if I:protein_label_image[ proteins ->> {Protein}; organism -> Organism; anatomical_structures ->> % from PROLAB {AS:anatomical_structure[name->Anatom]}], % NAE:neuro_anatomic_entity[name->Anatom; % from ANATOM located_in->>{Brain_region}], % AS..segments..features[name->Feature_name; value->Value]. • May be wrapped into a simpler tool NPACI AHM,2001
Mediation Services: Query Formulation Tools • Combination of ad hoc and navigational • Open Issues • Recursive queries • Aggregate queries • Combining data and service requests NPACI AHM,2001
Mediation Services: Data Update Tools NPACI AHM,2001
Some Open Issues • Data/Knowledge Modeling • Extensibility: how to handle a source with new data types and operations? • Temporal Data: instrument readings, video microscopy • Spatial Data: Integrating with spatial database systems • Image database systems • Conflict Management • Grades of certainty • Alternate Hypothesis • Integrating Services • Registration and warping of my image slice to a reference • Integrating into Larger Applications • M-Cell simulation • Telemicroscopy • Visualization NPACI AHM,2001