270 likes | 361 Views
Work supported by NSF Grants IIS-0331707 and IIS-0083489. Exploiting Relationships for Object Consolidation. Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine http://www.ics.uci.edu/~dvk/RelDC
E N D
Work supported by NSF Grants IIS-0331707 and IIS-0083489 Exploiting Relationships for Object Consolidation Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine http://www.ics.uci.edu/~dvk/RelDC http://www.itr-rescue.org(RESCUE) ACM IQIS 2005
Talk Overview • Motivation • Object consolidation problem • Proposed approach • RelDC: Relationship based data cleaning • Relationship analysis and graph partitioning • Experiments
CiteSeer Rank Why do we need “Data Cleaning”? q Hi, my name is Jane Smith. I’d like to apply for a faculty position at your university Wow! Unbelievable! Are you sure you will join us even if we do not offer you tenure right away? OK, let me check something quickly … ??? • Publications: • …… • …… • …… Jane Smith – Fresh Ph.D. Tom - Recruiter
What is the problem? • Names often do not uniquely identify people CiteSeer: the top-k most cited authors DBLP DBLP
Comparing raw and cleaned CiteSeer Cleaned CiteSeer top-k CiteSeer top-k
Object Consolidation Problem • Cluster representations that correspond to the same “real” world object/entity • Two instances: real world objects are known/unknown Representations of objects in the database r1 r2 r3 r4 r5 r6 r7 rN o1 o2 o3 o4 o5 o6 o7 oM Real objects in the database
RelDC Approach • Exploit relationships among objects to disambiguate when traditional approach on clustering based on similarity does not work RelDC Framework Relationship - based Data Cleaning ARG ? f 1 B f 1 C A + ? f 2 f 2 X Y D Y X ? f 3 f 3 ? E F f 4 f 4 Traditional Methods Relationship Analysis features and context
Attributed Relational Graph (ARG) View the database as an ARG Nodes • per cluster of representations (if already resolved by feature-based approach) • per representation (for “tough” cases) Edges • Regular – correspond to relationships between entities • Similarity – created using feature-based methods on representations
Context Attraction Principle (CAP) Who is “J. Smith” • Jane? • John?
Does the CAP principle hold over real datasets? That is, if we consolidate objects based on it, will the quality of consolidation improves? Can we design a generic strategy that exploits CAP for consolidation? Questions to Answer
Consolidation Algorithm • Construct ARG and identify all virtual clusters (VCSs) • use FBS in constructing the ARG • Choose a VCS and compute connection strength between nodes • for each pair of repr. connected via a similarity edge • Partition the VCS • use a graph partitioning algorithm • partitioning is based on connection strength • after partitioning, adjust ARG accordingly • go to Step 2, if more potential clusters exists
Connection Strength c(u,v) Models for c(u,v) • many possibilities • diffusion kernels, random walks, etc • none is fully adequate • cannot learn similarity from data Diffusion kernels • (x,y)= 1(x,y) “base similarity” • via direct links (of size 1) • k(x,y) “indirect similarity” • via links of size k • B: where Bxy= B1xy = 1(x,y) • base similarity matrix • Bk: indirect similarity matrix • K: total similarity matrix, or “kernel”
Connection Strength c(u,v) (cont.) Instantiating parameters • Determining (x,y) • regular edges have types T1,...,Tn • types T1,...,Tnhave weights w1,...,wn • (x,y) = wi • get the type of a given edge • assign this weigh as base similarity • Handling similarity edges • (x,y) assigned value proportional to similarity (heuristic) • Approach to learn (x,y) from data (ongoing work) Implementation • we do not compute the whole matrix K • we compute one c(u,v) at a time • limit path lengths by L
Consolidation via Partitioning Observations • each VCS contains representations of at least 1 object • if a repr. is in VCS, then the rest of repr. of the same object are in it too Partitioning • two cases • k, the number of entities in VSC, is known • k is unknown • when k is known, use any partit. algo • maximize inside-con, minimize outside-con. • we use [Shi,Malik’2000] • normalized cut • when k is unknown • split into two: just to see the cut • compare cut against threshold • decide “to split” or “not to split” • Iterate
Measuring Quality of Outcome • dispersion • for an entity, into how many clusters its repr. are clustered, ideal is 1 • diversity • for a cluster, how many distinct entities it covers, ideal is 1 • Entity uncertainty • for an entity, if out of m represent. m1 to C1; ...; mnto Cnthen • Cluster Uncertainty • if a cluster consists of represent.: m1 of E1; ...; mnof Enthen (same...) • ideal entropy is zero
Experimental Setup Uncertainty • d1,d2,...,dn are director entities • pick a fraction d1,d2,...,dm • Group entries in size k, • e.g. in groups of two {d1,d2}, ... ,{d9,d10} • make all representations of a group indiscernible by FBS, ... Baseline 1 • one cluster per VCS, regardless • Equivalent to using only FBS • ideal dispersion & H(E)! Baseline 2 • knows grouping statistics • gueses #ent in VCS • random assigns repr. to clusters Parameters • L-short simple paths, L = 7 • L is the path-length limit Note • The algorithm is applied to “tough cases”, after FBS already has successfully consolidated many entries! RealMov • movies (12K) • people (22K) • actors • directors • producers • studious (1K) • producing • distributing
The Effect of L on Quality Cluster Entropy & Diversity Entity Entropy & Dispersion
Summary RelDC • domain-independent data cleaning framework • uses relationships for data cleaning • reference disambiguation [SDM’05] • object consolidation [IQIS’05] Ongoing work • “learning” the importance of relationships from data • Exploiting relationships among entities for other data cleaning problems
Contact Information RelDC project www.ics.uci.edu/~dvk/RelDC www.itr-rescue.org (RESCUE) Zhaoqi Chen chenz@ics.uci.edu Dmitri V. Kalashnikov www.ics.uci.edu/~dvk dvk@ics.uci.edu Sharad Mehrotra www.ics.uci.edu/~sharad sharad@ics.uci.edu
Object Consolidation Notation • O={o1,...,o|O|} set of entities • unknown in general • X={x1,...,x|X|} set of repres. • d[xi] the entity xirefers to • unknown in general • C[xi] all repres. that refer to d[xi] • “group set” • unknown in general • the goal is to find it for each xi • S[xi] all repres. that can be xi • “consolidation set” • determined by FBS • we assume C[xi] S[xi]
Object Consolidation Problem • LetO={o1,...,o|O|} be the set of entities • unknown in general • Let X={x1,...,x|X|} be the set of representations • Map xi to its corresponding entity oj in O d[xi] the entity xirefers to • unknown in general • C[xi] all repres. that refer to d[xi] • “group set” • unknown in general • the goal is to find it for each xi • S[xi] all repres. that can be xi • “consolidation set” • determined by FBS • we assume C[xi] S[xi]
Connection Strength Computation of c(u,v) Phase 1: Discover connections • all L-short simple paths between u and v • bottleneck • optimizations, not in IQIS’05 Phase 2: Measure the strength • in the discovered connections • many c(u,v) models exist • we use model similar to diffusion kernels
Our c(u,v) Model Our model & Diff. kernels • virtually identical, but... • we do not compute the whole matrix K • we compute one c(u,v) at a time • we limit path lengths by L • (x,y) is unknown in general • the analyst assigns them • learn from data (ongoing work) Our c(u,v) model • regular edges have types T1,...,Tn • types T1,...,Tnhave weights w1,...,wn • (x,y) = wi • get the type of a given edge • assign this weigh as base similarity • paths with similarity edges • might not exist, use heuristics