170 likes | 268 Views
iFuice – Information Fusion utilizing Instance Correspondences and Peer Mappings. Erhard Rahm, Andreas Thor , David Aumueller, Hong-Hai Do, Nick Golovin, Toralf Kirsten University of Leipzig, Germany http://dbs.uni-leipzig.de. Who published at SIGMOD as a PC member?. Eventseer.
E N D
iFuice – Information Fusion utilizing Instance Correspondences and Peer Mappings Erhard Rahm, Andreas Thor, David Aumueller, Hong-Hai Do, Nick Golovin, Toralf Kirsten University of Leipzig, Germany http://dbs.uni-leipzig.de
Who published at SIGMOD as a PC member? Eventseer Who referenced publications of my favorite authors? Local file Who are the candidates for the SIGMOD test of time award? Google Scholar SwissProt PubMed What information system is used to support biological cancer anlaysis? MIM • Additional relationships / attributes (Eventseer, Google Scholar) • Hand-picked private data (local file) • Sources from different domains (SwissProt, MIM) Motivating scenario • Integrating ... ACM Citeseer DBLP
Schema vs. instance based integration • Data integration using query mediator approach • Mediated (global) schema • Matching / views between global and local schemas • Problems • Construction/evolution of global schema • Sources without or semi-structured schema • Heterogeneous/dirty data, mapping to artificial schema • Instance correspondences • Represent semantic relationships between instances • Allow integration of sources without schema • Can be inferred by weblinks
iFuice approach • Information Fusion utilizing Instance Correspondences and Peer Mappings • Bottom up integration • High-level operators • Generic way to dynamic information fusion • Mediator • Controls mapping / operator execution • Utilizes a domain model • P2P-like infrastructure • Correspondences between autonomous data sources • Easy link-up of a new source "where it fits best"
Agenda • Motivation & iFuice approach • Meta data model • Operators • iFuice scripts • Architecture • Summary & outlook
Author Conference Publication Publication DBLP Name: Generic schema matching with Cupid URL: http://vldb.org... Conference: VLDB 2001 Authors: Jayant Madhavan, Philip A. Bernstein, Erhard Rahm Data sources • Physical data source (PDS) • Web data (DBLP), local data (files), ... • Splitted in logical data sources • Logical data source (LDS) • Refers to one object type • Contains object instances • Object instance • Refers to real world entity • Set of attributes • One attribute is id DBLP
Mappings • Directed relationship between LDS • Meta data: meaning of the mapping • Semantic mapping type • e.g., "publications of author" • Same mappings vs. association mappings • same = "equality" relationship between PDS • e.g., DBLP publication (id) ACM publication (id) • Id mappings vs. query mappings • Instance data: instance correspondences • Materialized: mapping tables • On-the-fly: execution result (e.g., from web service)
ACM DBLP Author Author AuthPub Auhor CoAuthor Publication PubAuth extract Publication Publication Google Scholar PubConf Conference Legend Publication ConfPub LDS PDS Conference mapping Source mapping model Domain model (same: ) Metadata model • Used by mediator for mapping/operator execution • Domain model indicates available object types and relationships
Operators • Query language capabilites + scripting support • Set-oriented operators • Input: set of object or mapping instances + parameters / query specification • Output: set of object / mapping instances • Can be combined bottom-up within scripts
Operators overview • Object instances (OI) • Query OI: queryInstances, queryMatch, attrTransf • OI OI: getInstances, traverse, traverseSame, map • Aggregated objects (AO) • OI AO: agg, disagg, fuseAttributes • AO AO: aggregateSame, aggregateTraverse, aggregateMap • Generic • union, diff, intersect • domain, range, compose
Operators for object instances • queryInstances executes a query on a peer • $S := queryInstances (Conf@DBLP, Series="SIGMOD") returns all SIGMOD conferences from DBLP • map executes a mapping • map ($S, DBLP.ConfPubs) returns all tuples (conference, publication) • traverse returns the range of a mapping • $P := traverse ($S, DBLP.ConfPubs) returns all publications • traverseSame "navigates" to corresponding objects of another physical source • traverseSame ($P, GoogleScholar) returns "equal" publications at GoogleScholar
agg Publication Publication Name: Generic schema matching with Cupid URL: http://vldb.org... Conference: VLDB 2001 Authors: Jayant Madhavan, Philip A. Bernstein, Erhard Rahm Name: Generic schema matching with Cupid URL: http:// data.cs.washington.edu... NoOfCit: 243 Authors: J Madhavan, PA Bernstein, E Rahm DBLP Name: URL: Authors: Conf.: NoOfCit: Generic schema matching with Cupid http://vldb.org... http:// data.cs.washington.edu... Jayant Madhavan, Philip A. Bernstein, Erhard Rahm J Madhavan, PA Bernstein, E Rahm VLDB 2001 243 GS DBLP DBLP DBLP DBLP GS DBLP DBLP GS GS GS DBLP GS GS GS fuseAttributes Instance fusion Publication • Object instances referring to the same real world object Aggregated object • Auxillary fusion operators • agg / disagg, fuseAttributes DBLP Name: Generic schema matching with Cupid URL: http://vldb.org... Conference: VLDB 2001 Authors: Jayant Madhavan, Philip A. Bernstein, Erhard Rahm Publication GS Name: Generic schema matching with Cupid URL: http:// data.cs.washington.edu... NoOfCit: 243 Authors: J Madhavan, PA Bernstein, E Rahm
Publication DBLP Name: Generic schema matching with Cupid URL: http://vldb.org... Conference: VLDB 2001 Authors: Jayant Madhavan, Philip A. Bernstein, Erhard Rahm Publication Name: Generic schema matching with Cupid URL: http://vldb.org... Conference: VLDB 2001 Authors: Jayant Madhavan, Philip A. Bernstein, Erhard Rahm Name: Generic schema matching with Cupid URL: http:// data.cs.washington.edu... NoOfCit: 243 Authors: J Madhavan, PA Bernstein, E Rahm DBLP DBLP agg DBLP traverseSame DBLP Publication GS GS Name: Generic schema matching with Cupid URL: http:// data.cs.washington.edu... NoOfCit: 243 Authors: J Madhavan, PA Bernstein, E Rahm GS GS GS Operators for aggregated objects • aggregateSame • Identify corresponding objects in another source (traverseSame) • Aggregate resulting objects with input objects (agg) • aggregateSame ($P, GoogleScholar) returns AOs of (DBLP + GoogleScholar) publications
iFuice scripts • Batch execution of operators • Store (intermediate) results in variables • Scripts can be interpreted as mappings • Other scripts can utilize iFuice "script mappings" • Example: SIGMOD test of time award $SIGMODPubs := queryTraverse (LDS=DBLP.Conf, {Name="SIGMOD 1995"}, DBLPConfPubs) $CombinedConfPub := aggregateSame ($SIGMODPubs, GoogleScholar) $CleanedPubs := fuseAttributes($CombinedConfPub) $Result := sort ($CleanedPubs, "NoOfCitings")
Personal Infor-mation Manager Bio navigator iFuice mediator Application Mediator interface Web service or java library Script / batch Interactive (step by step) request response Fusion control unit Cache Meta datamodel Repository mapping results Duplicate detection load store load Mapping handler mapping call mapping result Mapping execution service Wrap different map-ping implementations Web service SQL query Java class iFuice script Mediator architecture iFuice mediator
Summary & outlook • iFuice: generic way to dynamic information fusion • Based on instance correspondences of P2P sources • Mediator controled data fusion • Two working modes • Script mode: powerful operators for information fusion tasks (with source selection or transparent) • Explorative mode: navigation in information space • Future work • Finishing prototype implementation • Different domains, e.g., bioinformatics and e-commerce • Tool supported (semi-) automatic integration of local / private data sources