Reference Reconciliation in Complex Information Spaces

Reference Reconciliation in Complex Information Spaces Xin (Luna) Dong, Alon Halevy, Jayant Madhavan @ Sigmod 2005 University of Washington

Semex: Personal Information Management System Homepage(1) SenderOfEmails(7595) RecipientOfEmails(8547) AuthorOfArticles(52) MentionedIn(315)

Semex: Personal Information Management System Email Contacts(1145) Co-authors(24)

Semex: Personal Information Management System Article: Reference Reconciliation in Complex Information Spaces Authors PublishedIn Cites(33) CitedBy FromFile

Semex: Personal Information Management System Xin (Luna) Dong Lab-#dong xin dong xin luna • ¶ðà xinluna dong Names luna x. dong dongxin Emails xin dong

Semex Without Deduplication Search results for luna 23 persons luna dong SenderOfEmails(3043) RecipientOfEmails(2445) MentionedIn(94)

Semex Without Deduplication Search results for luna 23 persons Xin (Luna) Dong AuthorOfArticles(49) MentionedIn(20)

Semex Without Deduplication A Platform for Personal Information Management and Integration

Semex Without Deduplication 9 Persons: dong xin xin dong

Semex NEEDS Deduplication (Reference Reconciliation)

Reference Reconciliation in Complex Information Spaces Xin (Luna) Dong, Alon Halevy, Jayant Madhavan @ Sigmod 2005 University of Washington

Complex Information Space Example – An Abstract View of Personal Information • Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1) a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2) • Venue: c1=(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”)c2=(“ACM SIGMOD”, “1978”, null) • Person: p1=(“Robert S. Epstein”, null) p2=(“Michael Stonebraker”, null) p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null) p5=(“Stonebraker, M.”, null) p6=(“Wong, E.”, null)

Complex Information Space Example – An Abstract View of Personal Information • Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1) a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2) • Venue: c1=(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”)c2=(“ACM SIGMOD”, “1978”, null) • Person: p1=(“Robert S. Epstein”, null) p2=(“Michael Stonebraker”, null) p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null) p5=(“Stonebraker, M.”, null) p6=(“Wong, E.”, null) p7=(“Eugene Wong”, “eugene@berkeley.edu”) p8=(null, “stonebraker@csail.mit.edu”) p9=(“mike”, “stonebraker@csail.mit.edu”) Association Attribute Class Atomic Attribute Reference

Other Complex Information Spaces • Citation portals, e.g., Citeseer, Cora • Online product catalogs in E-commerce

Real-World Objects • Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1) a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2) • Venue: c1=(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”)c2=(“ACM SIGMOD”, “1978”, null) • Person: p1=(“Robert S. Epstein”, null) p2=(“Michael Stonebraker”, null) p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null) p5=(“Stonebraker, M.”, null) p6=(“Wong, E.”, null) p7=(“Eugene Wong”, “eugene@berkeley.edu”) p8=(null, “stonebraker@csail.mit.edu”) p9=(“mike”, “stonebraker@csail.mit.edu”)

Reference Reconciliation • Input: A set of references R • Output: A partitioning over R, such that • Each partition refers to a single real-world object– high precision • Different partitions refer to different objects– high recall

Related Work • A very active area of research in Databases, Data Mining and AI • Most current approaches assume matching tuples from a single database table • Traditional approaches (Surveyed in [Cohen, et al. 2003]) • Step I. Compare attributes • Step II. Combine attribute similarities to decide tuple match/non-match • Step III. Compute transitive closures to get partitions • New approaches explore relationship between reconciliation decisions using probability models[Russell et al, 2002] [Domingos et al, 2004] • Harder for complex information spaces

? ? Challenges in Complex Information Spaces • Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1) a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2) • Venue: c1=(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”)c2=(“ACM SIGMOD”, “1978”, null) • Person: p1=(“Robert S. Epstein”, null) p2=(“Michael Stonebraker”, null) p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null) p5=(“Stonebraker, M.”, null) p6=(“Wong, E.”, null) p7=(“Eugene Wong”, “eugene@berkeley.edu”) p8=(null, “stonebraker@csail.mit.edu”) p9=(“mike”, “stonebraker@csail.mit.edu”) 2. LimitedInformation 1. Multiple Classes 3. Multi-value Attributes

Intuition • Complex information spaces can be considered as networks of instances and associations between the instances • Key: exploit the network, specifically, the clues hidden in the associations

Outline • Introduction and problem definition • Reconciliation algorithm • Experimental results • Conclusions

Framework: Dependency Graph • p2=(“Michael Stonebraker”, null, {p1, p3})p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “eugene@berkeley.edu”, {p8}) p8=(null, “stonebraker@csail.mit.edu”, {p7}) p9=(“mike”, “stonebraker@csail.mit.edu”, null) (p3,p7) (“Michael Stonebraker”, “stonebraker@”) Cross-attr similarity (“Michael Stonebraker”, p7) (p2, p8) Compare contacts (p1, “stonebraker@csail.mit.edu”) (p1,p7) (p3, “stonebraker@csail.mit.edu”) Reference Similarity Attribute Similarity

Framework: Dependency Graph • p2=(“Michael Stonebraker”, null, {p1, p3})p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “eugene@berkeley.edu”, {p8}) p8=(null, “stonebraker@csail.mit.edu”, {p7}) p9=(“mike”, “stonebraker@csail.mit.edu”, null) (p3,p7) (“Michael Stonebraker”, “stonebraker@”) Cross-attr similarity (p2, p8) Compare contacts Reference Similarity Attribute Similarity

Framework: Dependency Graph • p2=(“Michael Stonebraker”, null, {p1, p3})p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “eugene@berkeley.edu”, {p8}) p8=(null, “stonebraker@csail.mit.edu”, {p7}) p9=(“mike”, “stonebraker@csail.mit.edu”, null) (“Eugene Wong”, “Eugene Wong”) (p3,p7) (“Michael Stonebraker”, “stonebraker@”) (“MichaelStonebraker”, “mike”) (p2, p8) (p2, p9) (p8, p9) (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”) Reference Similarity Attribute Similarity

Exploit the Dependency Graph (p1, p4) (“Distributed…”, “Distributed…”) (“Robert S. Epstein”, “Epstein, R.S.”) (“169-180”, “169-180”) (p2, p5) (a1, a2) (“Michael Stonebraker”, “Stonebraker, M.”) (c1, c2) (p3, p6) (“Eugene Wong”, “Wong, E.”) (“ACM …”, “ACM SIGMOD”) (“1978”, “1978”) Reference similarity Attribute similarity

Dependency Graph Example II (p1, p4) (“Distributed…”, “Distributed…”) (“Robert S. Epstein”, “Epstein, R.S.”) (“169-180”, “169-180”) (p2, p5) (a1, a2) (“Michael Stonebraker”, “Stonebraker, M.”) (c1, c2) Compare authored papers (p3, p6) (“Eugene Wong”, “Wong, E.”) (“ACM …”, “ACM SIGMOD”) (“1978”, “1978”) Reference similarity Attribute similarity

Strategy I. Consider Richer Evidence • Cross-attribute similarity – Name&email • p5=(“Stonebraker, M.”, null) • p8=(null, “stonebraker@csail.mit.edu”) • Context Information I – Contact list • p5=(“Stonebraker, M.”, null, {p4, p6}) • p8=(null, “stonebraker@csail.mit.edu”, {p7}) • p6=p7 • Context Information II – Authored articles • p2=(“Michael Stonebraker”, null) • p5=(“Stonebraker, M.”, null) • p2 and p5 authored the same article

1409 Considering Only Attribute-wise Similarities Cannot Merge Persons Well 3159 Person references: 24076 Real-world persons (gold-standard):1750

1409 346 Considering Richer Evidence Improves the Recall Person references: 24076 Real-world persons:1750

Exploit the Dependency Graph (p1, p4) (“Distributed…”, “Distributed…”) (“Robert S. Epstein”, “Epstein, R.S.”) (“169-180”, “169-180”) (p2, p5) (a1, a2) (“Michael Stonebraker”, “Stonebraker, M.”) (c1, c2) (p3, p6) (“Eugene Wong”, “Wong, E.”) (“ACM …”, “ACM SIGMOD”) (“1978”, “1978”) Reference similarity Attribute similarity

Exploit the Dependency Graph (p1, p4) (“Distributed…”, “Distributed…”) (“Robert S. Epstein”, “Epstein, R.S.”) (“169-180”, “169-180”) (p2, p5) (a1, a2) (“Michael Stonebraker”, “Stonebraker, M.”) (c1, c2) (p3, p6) (“Eugene Wong”, “Wong, E.”) (“ACM …”, “ACM SIGMOD”) (“1978”, “1978”) Reconciled Similar

Strategy II. Propagate Information between Reconciliation Decisions • After changing the similarity score of one node, re-compute similarity scores of its neighbors • This process converges if • Similarity score is monotone in the similarity values of neighbors • Compute neighbor similarities only if similarity increase is not too small

Propagating Information between Reconciliation Decisions Further Improves Recall Person references: 24076 Real-world persons:1750

Strategy III. Enrich References in Reconciliation • Enrich knowledge of a real-world object for later reconciliation • Naïve: Construct graph  Compute similarity  Transitive Closure • Problems • Dependency-graph construction is expensive • Reference enrichment takes effect until the next pass • Solution • Instant enrichment by adding neighbors in the dependency graph

Enrich References by Adding Neighbors • p2=(“Michael Stonebraker”, null, {p1, p3})p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “eugene@berkeley.edu”, {p8}) p8=(null, “stonebraker@csail.mit.edu”, {p7}) p9=(“mike”, “stonebraker@csail.mit.edu”, null) (p3,p7) (“Michael Stonebraker”, “stonebraker@”) (“MichaelStonebraker”, “mike”) (p2, p8) (p2, p9) (p8, p9) (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”) Reconciled Similar

Enrich References by Adding Neighbors • p2=(“Michael Stonebraker”, null, {p1, p3})p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “eugene@berkeley.edu”, {p8}) p8=(null, “stonebraker@csail.mit.edu”, {p7}) p9=(“mike”, “stonebraker@csail.mit.edu”, null) (p3,p7) (“Michael Stonebraker”, “stonebraker@”) (“MichaelStonebraker”, “mike”) (p2, p8) (p8, p9) (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”) Reconciled Similar

References Enrichment Improves Recall More than Information Propagation Person references: 24076 Real-world persons:1750

1409 346 125 Applying Both Information Propagation and Reference Enrichment Get the Highest Recall Person references: 24076 Real-world persons:1750

Outline • Introduction and problem definition • Reconciliation algorithm • Experimental results • Conclusions

Experiment Settings • Datasets • Four personal datasets • Cora dataset for citations • Use the same parameters and thresholds for all datasets • Measure • Precision and recall, F-measure • Precision: The percentage of correctly reconciled reference pairs over all reconciled reference pairs • Recall: The percentage of correctly reconciled reference pairs over pairs of references that refer to the same real-world object • Diversity and Dispersion • Diversity: For every result partition, how many real-world objects are included; ideally should be 1 (related to precision) • Dispersion: For every real-world object, how many result partitions include them; ideally should be 1 (related to recall)

1409 346 125 Recall Results on One Personal Dataset Person references: 24076 Real-world persons:1750

Results Considering All Occurrences of Person Instances Both precision and recall increase compared with attr-wise matching.

Results Considering Only Distinct Person References Precision and recall increase largely compared with attr-wise matching.

Diversity and Dispersion Are Very Close to 1

Reference Reconciliation in Complex Information Spaces