630 likes | 742 Views
Reference Reconciliation in Complex Information Spaces. Xin (Luna) Dong , Alon Halevy, Jayant Madhavan @ Sigmod 2005 University of Washington. Semex : Personal Information Management System. Homepage(1). SenderOfEmails(7595). RecipientOfEmails(8547). AuthorOfArticles(52).
E N D
Reference Reconciliation in Complex Information Spaces Xin (Luna) Dong, Alon Halevy, Jayant Madhavan @ Sigmod 2005 University of Washington
Semex: Personal Information Management System Homepage(1) SenderOfEmails(7595) RecipientOfEmails(8547) AuthorOfArticles(52) MentionedIn(315)
Semex: Personal Information Management System Email Contacts(1145) Co-authors(24)
Semex: Personal Information Management System Article: Reference Reconciliation in Complex Information Spaces Authors PublishedIn Cites(33) CitedBy FromFile
Semex: Personal Information Management System Xin (Luna) Dong Lab-#dong xin dong xin luna • ¶ðà xinluna dong Names luna x. dong dongxin Emails xin dong
Semex Without Deduplication Search results for luna 23 persons luna dong SenderOfEmails(3043) RecipientOfEmails(2445) MentionedIn(94)
Semex Without Deduplication Search results for luna 23 persons Xin (Luna) Dong AuthorOfArticles(49) MentionedIn(20)
Semex Without Deduplication A Platform for Personal Information Management and Integration
Semex Without Deduplication 9 Persons: dong xin xin dong
Reference Reconciliation in Complex Information Spaces Xin (Luna) Dong, Alon Halevy, Jayant Madhavan @ Sigmod 2005 University of Washington
Complex Information Space Example – An Abstract View of Personal Information • Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1) a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2) • Venue: c1=(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”)c2=(“ACM SIGMOD”, “1978”, null) • Person: p1=(“Robert S. Epstein”, null) p2=(“Michael Stonebraker”, null) p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null) p5=(“Stonebraker, M.”, null) p6=(“Wong, E.”, null)
Complex Information Space Example – An Abstract View of Personal Information • Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1) a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2) • Venue: c1=(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”)c2=(“ACM SIGMOD”, “1978”, null) • Person: p1=(“Robert S. Epstein”, null) p2=(“Michael Stonebraker”, null) p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null) p5=(“Stonebraker, M.”, null) p6=(“Wong, E.”, null) p7=(“Eugene Wong”, “eugene@berkeley.edu”) p8=(null, “stonebraker@csail.mit.edu”) p9=(“mike”, “stonebraker@csail.mit.edu”) Association Attribute Class Atomic Attribute Reference
Other Complex Information Spaces • Citation portals, e.g., Citeseer, Cora • Online product catalogs in E-commerce
Real-World Objects • Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1) a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2) • Venue: c1=(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”)c2=(“ACM SIGMOD”, “1978”, null) • Person: p1=(“Robert S. Epstein”, null) p2=(“Michael Stonebraker”, null) p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null) p5=(“Stonebraker, M.”, null) p6=(“Wong, E.”, null) p7=(“Eugene Wong”, “eugene@berkeley.edu”) p8=(null, “stonebraker@csail.mit.edu”) p9=(“mike”, “stonebraker@csail.mit.edu”)
Reference Reconciliation • Input: A set of references R • Output: A partitioning over R, such that • Each partition refers to a single real-world object– high precision • Different partitions refer to different objects– high recall
Related Work • A very active area of research in Databases, Data Mining and AI • Most current approaches assume matching tuples from a single database table • Traditional approaches (Surveyed in [Cohen, et al. 2003]) • Step I. Compare attributes • Step II. Combine attribute similarities to decide tuple match/non-match • Step III. Compute transitive closures to get partitions • New approaches explore relationship between reconciliation decisions using probability models[Russell et al, 2002] [Domingos et al, 2004] • Harder for complex information spaces
? ? Challenges in Complex Information Spaces • Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1) a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2) • Venue: c1=(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”)c2=(“ACM SIGMOD”, “1978”, null) • Person: p1=(“Robert S. Epstein”, null) p2=(“Michael Stonebraker”, null) p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null) p5=(“Stonebraker, M.”, null) p6=(“Wong, E.”, null) p7=(“Eugene Wong”, “eugene@berkeley.edu”) p8=(null, “stonebraker@csail.mit.edu”) p9=(“mike”, “stonebraker@csail.mit.edu”) 2. LimitedInformation 1. Multiple Classes 3. Multi-value Attributes
Intuition • Complex information spaces can be considered as networks of instances and associations between the instances • Key: exploit the network, specifically, the clues hidden in the associations
Outline • Introduction and problem definition • Reconciliation algorithm • Experimental results • Conclusions
Framework: Dependency Graph • p2=(“Michael Stonebraker”, null, {p1, p3})p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “eugene@berkeley.edu”, {p8}) p8=(null, “stonebraker@csail.mit.edu”, {p7}) p9=(“mike”, “stonebraker@csail.mit.edu”, null) (p3,p7) (“Michael Stonebraker”, “stonebraker@”) Cross-attr similarity (“Michael Stonebraker”, p7) (p2, p8) Compare contacts (p1, “stonebraker@csail.mit.edu”) (p1,p7) (p3, “stonebraker@csail.mit.edu”) Reference Similarity Attribute Similarity
Framework: Dependency Graph • p2=(“Michael Stonebraker”, null, {p1, p3})p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “eugene@berkeley.edu”, {p8}) p8=(null, “stonebraker@csail.mit.edu”, {p7}) p9=(“mike”, “stonebraker@csail.mit.edu”, null) (p3,p7) (“Michael Stonebraker”, “stonebraker@”) Cross-attr similarity (p2, p8) Compare contacts Reference Similarity Attribute Similarity
Framework: Dependency Graph • p2=(“Michael Stonebraker”, null, {p1, p3})p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “eugene@berkeley.edu”, {p8}) p8=(null, “stonebraker@csail.mit.edu”, {p7}) p9=(“mike”, “stonebraker@csail.mit.edu”, null) (“Eugene Wong”, “Eugene Wong”) (p3,p7) (“Michael Stonebraker”, “stonebraker@”) (“MichaelStonebraker”, “mike”) (p2, p8) (p2, p9) (p8, p9) (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”) Reference Similarity Attribute Similarity
Exploit the Dependency Graph (p1, p4) (“Distributed…”, “Distributed…”) (“Robert S. Epstein”, “Epstein, R.S.”) (“169-180”, “169-180”) (p2, p5) (a1, a2) (“Michael Stonebraker”, “Stonebraker, M.”) (c1, c2) (p3, p6) (“Eugene Wong”, “Wong, E.”) (“ACM …”, “ACM SIGMOD”) (“1978”, “1978”) Reference similarity Attribute similarity
Dependency Graph Example II (p1, p4) (“Distributed…”, “Distributed…”) (“Robert S. Epstein”, “Epstein, R.S.”) (“169-180”, “169-180”) (p2, p5) (a1, a2) (“Michael Stonebraker”, “Stonebraker, M.”) (c1, c2) Compare authored papers (p3, p6) (“Eugene Wong”, “Wong, E.”) (“ACM …”, “ACM SIGMOD”) (“1978”, “1978”) Reference similarity Attribute similarity
Strategy I. Consider Richer Evidence • Cross-attribute similarity – Name&email • p5=(“Stonebraker, M.”, null) • p8=(null, “stonebraker@csail.mit.edu”) • Context Information I – Contact list • p5=(“Stonebraker, M.”, null, {p4, p6}) • p8=(null, “stonebraker@csail.mit.edu”, {p7}) • p6=p7 • Context Information II – Authored articles • p2=(“Michael Stonebraker”, null) • p5=(“Stonebraker, M.”, null) • p2 and p5 authored the same article
1409 Considering Only Attribute-wise Similarities Cannot Merge Persons Well 3159 Person references: 24076 Real-world persons (gold-standard):1750
1409 346 Considering Richer Evidence Improves the Recall Person references: 24076 Real-world persons:1750
Exploit the Dependency Graph (p1, p4) (“Distributed…”, “Distributed…”) (“Robert S. Epstein”, “Epstein, R.S.”) (“169-180”, “169-180”) (p2, p5) (a1, a2) (“Michael Stonebraker”, “Stonebraker, M.”) (c1, c2) (p3, p6) (“Eugene Wong”, “Wong, E.”) (“ACM …”, “ACM SIGMOD”) (“1978”, “1978”) Reference similarity Attribute similarity
Exploit the Dependency Graph (p1, p4) (“Distributed…”, “Distributed…”) (“Robert S. Epstein”, “Epstein, R.S.”) (“169-180”, “169-180”) (p2, p5) (a1, a2) (“Michael Stonebraker”, “Stonebraker, M.”) (c1, c2) (p3, p6) (“Eugene Wong”, “Wong, E.”) (“ACM …”, “ACM SIGMOD”) (“1978”, “1978”) Reconciled Similar
Exploit the Dependency Graph (p1, p4) (“Distributed…”, “Distributed…”) (“Robert S. Epstein”, “Epstein, R.S.”) (“169-180”, “169-180”) (p2, p5) (a1, a2) (“Michael Stonebraker”, “Stonebraker, M.”) (c1, c2) (p3, p6) (“Eugene Wong”, “Wong, E.”) (“ACM …”, “ACM SIGMOD”) (“1978”, “1978”) Reconciled Similar
Exploit the Dependency Graph (p1, p4) (“Distributed…”, “Distributed…”) (“Robert S. Epstein”, “Epstein, R.S.”) (“169-180”, “169-180”) (p2, p5) (a1, a2) (“Michael Stonebraker”, “Stonebraker, M.”) (c1, c2) (p3, p6) (“Eugene Wong”, “Wong, E.”) (“ACM …”, “ACM SIGMOD”) (“1978”, “1978”) Reconciled Similar
Exploit the Dependency Graph (p1, p4) (“Distributed…”, “Distributed…”) (“Robert S. Epstein”, “Epstein, R.S.”) (“169-180”, “169-180”) (p2, p5) (a1, a2) (“Michael Stonebraker”, “Stonebraker, M.”) (c1, c2) (p3, p6) (“Eugene Wong”, “Wong, E.”) (“ACM …”, “ACM SIGMOD”) (“1978”, “1978”) Reconciled Similar
Exploit the Dependency Graph (p1, p4) (“Distributed…”, “Distributed…”) (“Robert S. Epstein”, “Epstein, R.S.”) (“169-180”, “169-180”) (p2, p5) (a1, a2) (“Michael Stonebraker”, “Stonebraker, M.”) (c1, c2) (p3, p6) (“Eugene Wong”, “Wong, E.”) (“ACM …”, “ACM SIGMOD”) (“1978”, “1978”) Reconciled Similar
Strategy II. Propagate Information between Reconciliation Decisions • After changing the similarity score of one node, re-compute similarity scores of its neighbors • This process converges if • Similarity score is monotone in the similarity values of neighbors • Compute neighbor similarities only if similarity increase is not too small
Propagating Information between Reconciliation Decisions Further Improves Recall Person references: 24076 Real-world persons:1750
Strategy III. Enrich References in Reconciliation • Enrich knowledge of a real-world object for later reconciliation • Naïve: Construct graph Compute similarity Transitive Closure • Problems • Dependency-graph construction is expensive • Reference enrichment takes effect until the next pass • Solution • Instant enrichment by adding neighbors in the dependency graph
Enrich References by Adding Neighbors • p2=(“Michael Stonebraker”, null, {p1, p3})p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “eugene@berkeley.edu”, {p8}) p8=(null, “stonebraker@csail.mit.edu”, {p7}) p9=(“mike”, “stonebraker@csail.mit.edu”, null) (p3,p7) (“Michael Stonebraker”, “stonebraker@”) (“MichaelStonebraker”, “mike”) (p2, p8) (p2, p9) (p8, p9) (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”) Reconciled Similar
Enrich References by Adding Neighbors • p2=(“Michael Stonebraker”, null, {p1, p3})p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “eugene@berkeley.edu”, {p8}) p8=(null, “stonebraker@csail.mit.edu”, {p7}) p9=(“mike”, “stonebraker@csail.mit.edu”, null) (p3,p7) (“Michael Stonebraker”, “stonebraker@”) (“MichaelStonebraker”, “mike”) (p2, p8) (p2, p9) (p8, p9) (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”) Reconciled Similar
Enrich References by Adding Neighbors • p2=(“Michael Stonebraker”, null, {p1, p3})p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “eugene@berkeley.edu”, {p8}) p8=(null, “stonebraker@csail.mit.edu”, {p7}) p9=(“mike”, “stonebraker@csail.mit.edu”, null) (p3,p7) (“Michael Stonebraker”, “stonebraker@”) (“MichaelStonebraker”, “mike”) (p2, p8) (p2, p9) (p8, p9) (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”) Reconciled Similar
Enrich References by Adding Neighbors • p2=(“Michael Stonebraker”, null, {p1, p3})p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “eugene@berkeley.edu”, {p8}) p8=(null, “stonebraker@csail.mit.edu”, {p7}) p9=(“mike”, “stonebraker@csail.mit.edu”, null) (p3,p7) (“Michael Stonebraker”, “stonebraker@”) (“MichaelStonebraker”, “mike”) (p2, p8) (p8, p9) (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”) Reconciled Similar
Enrich References by Adding Neighbors • p2=(“Michael Stonebraker”, null, {p1, p3})p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “eugene@berkeley.edu”, {p8}) p8=(null, “stonebraker@csail.mit.edu”, {p7}) p9=(“mike”, “stonebraker@csail.mit.edu”, null) (p3,p7) (“Michael Stonebraker”, “stonebraker@”) (“MichaelStonebraker”, “mike”) (p2, p8) (p8, p9) (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”) Reconciled Similar
References Enrichment Improves Recall More than Information Propagation Person references: 24076 Real-world persons:1750
1409 346 125 Applying Both Information Propagation and Reference Enrichment Get the Highest Recall Person references: 24076 Real-world persons:1750
Outline • Introduction and problem definition • Reconciliation algorithm • Experimental results • Conclusions
Experiment Settings • Datasets • Four personal datasets • Cora dataset for citations • Use the same parameters and thresholds for all datasets • Measure • Precision and recall, F-measure • Precision: The percentage of correctly reconciled reference pairs over all reconciled reference pairs • Recall: The percentage of correctly reconciled reference pairs over pairs of references that refer to the same real-world object • Diversity and Dispersion • Diversity: For every result partition, how many real-world objects are included; ideally should be 1 (related to precision) • Dispersion: For every real-world object, how many result partitions include them; ideally should be 1 (related to recall)
1409 346 125 Recall Results on One Personal Dataset Person references: 24076 Real-world persons:1750
Results Considering All Occurrences of Person Instances Both precision and recall increase compared with attr-wise matching.
Results Considering Only Distinct Person References Precision and recall increase largely compared with attr-wise matching.