300 likes | 469 Views
Efficiently Linking Text Documents with Relevant Structured Information Prasan Roy with Venkat Chakaravarthy, Himanshu Gupta and Mukesh Mohania IBM India Research Lab, New Delhi. Structured and Unstructured Information. Information content in an enterprise can be structured or unstructured
E N D
Efficiently Linking Text Documents with Relevant Structured InformationPrasan Roywith Venkat Chakaravarthy, Himanshu Gupta and Mukesh MohaniaIBM India Research Lab, New Delhi
Structured and Unstructured Information • Information content in an enterprise can be structured or unstructured • Structured Content: Transaction data, payroll, sales orders, invoice, customer profiles, etc. • Unstructured Content: emails, reports, web-pages, complaints, etc. • Historically, the structured and unstructured data retrieval technologies have evolved separately Artificial separation between these two “kinds” of information • Enterprises are realizing the need to bridge this separation, and are demanding integratedretrieval, management and analysis of both the structured and unstructured content SQL Query Keyword search query RDBMS Search Engine SQL Result Search Result Structured Data Retrieval Unstructured Data Retrieval VLDB 2006, Seoul, Korea
EROCS: Entity RecOgnition in Context of Structured data • Exploit partial information contained in a document to automatically identify and link relevant structured data Main Idea • View the structured data as a set of pre-defined “entities” • Identify the entities from this set that best match the document, and also find embeddings of the identified entities in the document VLDB 2006, Seoul, Korea
Entity Templates • Specify the locations (within the relational database) of • The candidate entities to be matched in the document • For each entity, the relevant context information to be exploited to perform the match • Specified by a domain expert • Default: All tables reachable from the identified “root” node • Future work: Automatic identification of relevant templates VLDB 2006, Seoul, Korea
The heroine ungraciously and ineptly disturbs the respectability and silence of Manderley by breaking a treasured ceramic china cupid on the desk. She fearfully hides the pieces of the broken statue in the back of Rebecca's desk drawer. Maxim's sister Beatrice Lacy (Gladys Cooper) and brother-in-law Major Giles Lacy (Nigel Bruce) arrive for lunch. The Tattaglia family is behind him here in New York. Sonny is in favor of pushing drugs.... Tom legally advises Corleone to consider entering the drug business because it is the wave of the future. Corleone, Fredo, Clemenza, Sonny, Tessio and Hagen meet with Sollozzo - a maverick, shifty rival gangster who is chief spokesman for the Tattaglias - another Mafia family that is headed by PhillipTattaglia (Victor Rendina). Sollozzo is also Tattaglia's chief assassin. DOCUMENT IMDB Example <movie> <name> Rebecca </name> <actor> <name> Bruce, Nigel </name> <actedas> Major Giles Lacy </actedas> </actor> <actor> <name> Olivier, Laurence </name> <actedas> 'Maxim' de Winter </actedas> <actor> <name> Collier, Constance </name> <actedas> Rebecca </actedas> </actor> <actor> <name> Cooper, Gladys </name> <actedas> Beatrice Lacy </actedas> </actor> … <director>Hitchcock, Alfred </director> </movie> <movie> <name> Godfather, The </name> <actor> <name>Brando, Marlon</name> <actedas>Don Vito Corleone</actedas> </actor> <actor> <name>Caan, James</name> <actedas> 'Sonny' Corleone </actedas> </actor> <actor> <name>Duvall, Robert (I)</name> <actedas>TomHagen</actedas> </actor> <actor> <name>Lettieri, Al</name> <actedas>Virgil Sollozzo</actedas> </actor> <actor> <name> Rendina, Victor</name> <actedas> PhilipTattaglia</actedas> </actor> … <director>Coppola, Francis F. </director> </movie> EROCS VLDB 2006, Seoul, Korea
The heroine ungraciously and ineptly disturbs the respectability and silence of Manderley by breaking a treasured ceramic china cupid on the desk. She fearfully hides the pieces of the broken statue in the back of Rebecca's desk drawer. Maxim's sister Beatrice Lacy (Gladys Cooper) and brother-in-law Major Giles Lacy (Nigel Bruce) arrive for lunch. The Tattaglia family is behind him here in New York. Sonny is in favor of pushing drugs.... Tom legally advises Corleone to consider entering the drug business because it is the wave of the future. Corleone, Fredo, Clemenza, Sonny, Tessio and Hagen meet with Sollozzo - a maverick, shifty rival gangster who is chief spokesman for the Tattaglias - another Mafia family that is headed by Phillip Tattaglia (Victor Rendina). Sollozzo is also Tattaglia's chief assassin. DOCUMENT Enables Effective Search Existing Search Engines • Recall inherently limited by terms actually present in the document • For instance, the term “Godfather”, though relevant, does not appear in the document Existing search engines would not return this document in response to a query: “Godfather” EROCS Value-Add • Automatically retrieves relevant information from a structured database and associates it with the document as additional “metadata” • Search can exploit this metadata Improves recall, precision • A search on “Godfather” would return the example document • “Show documents about movies with characters Rebecca and Corleone” would NOT return the example document • Enables more complex XML Fragments/XPath queries on the documents • The associated metadata can also be used to gauge similarity between documents, enable/complement sophisticated text analysis VLDB 2006, Seoul, Korea
EROCS DOCID TRANSID START END CONFIDENCE TRANSDOC CUSTID NAME LEVEL ADDRESS STOREID ADDRESS CUSTOMER STORE TRANSID CUSTID STOREID TYPE VALUE TRANSACTION DOCID AUTH DOCBODY COMPLAINTS TRANSID PRODID QTY SUPID STOREID PRODID QTY TRANSPROD INVENTORY SUPID NAME ADDRESS PRODID NAME PRICE MANUFID SUPPLIER PRODUCT MANUFID NAME ADDRESS MANUFACTURER Enables OLAP on Structured + Unstructured Data • Current OLAP tools restricted to structured data • EROCS can incorporate unstructured data in the analysis Unstructured Data Find the store with the greatest upsurge in complaints on high-value transactions Structured Data VLDB 2006, Seoul, Korea
Outline • Introduction • Framework • Identifying Best-Matching Entities • Context Cache • Exploiting the Context Cache • Experimental Study • Conclusion VLDB 2006, Seoul, Korea
Pivot table Entity and Document Models Entity • Each row e in the pivot table is identifiedas an entity • Context of the entity e = the set of terms present in the row e as well as in rows in the context tables having a path to e Document • A sequence of sentences, where each sentence is a bag of terms • Actual implementation runs a parser on the document and retains only the noun phrases • Could be enhanced by further disambiguating the terms using NER (identifying customer names, organization names, product names, etc.) Segment • A sequence of one or more consecutive sentences in the document VLDB 2006, Seoul, Korea
Entity-Document Matching • The score of an entity e with respect to a segment d is defined as: where • T(e, d) = the set of terms common between the entity e and segment d • tf(t, d) = the number of times the term t appears in the segment d • w(t) = the weight of the term t In our implementation, w(t) is defined as: where • n(t) = number of entities that contain t in their context • N = the total number of entities VLDB 2006, Seoul, Korea
Problem Statement Find an annotation with the maximum score among all annotations of the document D Identifying the Best Matching Annotation • Input: Document D, set of entities E • An annotation for the document D is a pair (S, F) where • S is a set of non-overlapping segments of D • F: S E maps each segment d S to an entity F(d) E • Score of an annotation (S, F) defined as where λ ≥ 0 is a tunable parameter E Ensures that score(F(d), d) ≥ λ for each segment d in the solution D VLDB 2006, Seoul, Korea
Problem Statement (Revised) Find a canonical annotation with the maximum score among all canonical annotations of the document D Pruning the Search Space best match • An annotation (S, F) is termed canonical iff • S is a partition of D • F maps each segment d S to its best matching entity Claim • For any document D, there exists a canonical annotation such that it is an optimal annotation for D We restrict the search space to canonical annotations, without loss in generality E D VLDB 2006, Seoul, Korea
Algorithm • Let Di,j = the segment in D containing sentence i to sentence j, 0 ≤ i ≤ j ≤ |D| • Let ei,j be the best matching entity for Di,j, with si,j = score(ei,j, Di,j,) • Let (Sk, Fk) be the best annotation for D1,k, with rk = score(Sk, Fk) Then • rk = max0 ≤ j ≤ k-1(rj + sj+1,k – λ) Procedure BestAnnot(D) Input: Document D Output: best annotation • For i = 1 to |D| • For j = i to |D| • Let ei,j = argmaxe E score(e, Di,j) • Let si,j = score(ei,j, Di,j) • Let S0 = {} • Let r0 = 0 • For k = 1 to |D| • Let j = argmax0 ≤ j ≤ k-1(rj + sj+1,k – λ) • Let Sk = Sj U { Dj+1,k } • Let rk = rj + sj+1,k – λ • For each d S|D| • Let F|D|(d) = argmaxe E score(e, d) • Return (S|D|, F|D|) Optimal annotation Optimal annotation Maps to the best matching entity VLDB 2006, Seoul, Korea
Issue: Performance • The algorithm involves an entity search for every segment in the document • If naively done, likely to be a performance bottleneck Possible Solutions • Cache the result of the entity search • Document unlikely to have repeated segments Not effective • Materialize and index the context ofeach entity • High computation, maintenanceand storage overheads • Remainder of the talk developsefficient, low overhead techniques Procedure BestAnnot(D) Input: Document D Output: best annotation • For i = 1 to |D| • For j = i to |D| • Let ei,j = argmaxe E score(e, Di,j) • Let si,j = score(ei,j, Di,j) • Let S0 = {} • Let r0 = 0 • For k = 1 to |D| • Let j = argmax0 ≤ j ≤ k-1(rj + sj+1,k – λ) • Let Sk = Sj U { Dj+1,k } • Let rk = rj + sj+1,k – λ • For each d S|D| • Let F|D|(d) = argmaxe E score(e, d) • Return (S|D|, F|D|) VLDB 2006, Seoul, Korea
Context Cache • Stores associations between entities and terms in the document • Collection of pairs of the form (e, t) meaning that the term t is contained in the context of the entity e • Indexed both on entities and terms Database Access Primitives • GetEntities(t) • Retrieves the set of entities that contain term tin their context • Inserts (e, t) in the cache for each e in the set • GetTerms(e) • Retrieves the set of terms in the context of the entity e • Inserts (e, t) in the cache for each t in the set Terms Entities VLDB 2006, Seoul, Korea
Eliminating Repeated Database Access • Baseline approach – the most straightforward use of the cache • Eliminates repeated access to the database for repeated invocations of GetEntities(t) and GetTerms(e) for the same term t and entity e respectively AllTerms • Populate the cache by invoking GetEntities(t) for each term t in the document • Determine the best matching entity for each segment in the document using the information in the cache • Call BestAnnot(D) to compute the best annotation • Does not scale well, both in terms of time and space overheads Reason • Invokes GetEntities(t) for every term in the document • Includes terms that are present in a very large number of entities • Low weight make little difference • Large size of result GetEntities computationally expensive • Need to avoid calling GetEntities on such terms, while ensuring that the best matching entity for each segment is retrieved VLDB 2006, Seoul, Korea
Cache-based Score Bounds Given a segment d • Let TC(d) T(d)be the set of terms on which GetEntities has been invoked so far Then, for any e E • Lower bound on score(e, d) • Upper bound on score(e, d) where Terms Entities TC(d) T(d)-TC(d) VLDB 2006, Seoul, Korea
Cache Completeness • Let EC(d) E be the set of entities retrieved by at least one GetEntities() so far • The context cache is termed complete wrt a segment d iff the best matching entity for d is guaranteed to be present in EC(d) • For any e EC(d) and, therefore The context cache is complete if there exists an e’ EC(d) such that Terms Entities EC(d) eEC(d) TC(d) T(d)-TC(d) VLDB 2006, Seoul, Korea
Term Pruning Strategy Procedure BestMatchEntity(d) Input: Segment d Output: best matching entity • Let EC(d) = {} • Let wC = Σt T(d) tf(t, d).w(t) • Repeat • Let t’ = argmaxt T(d)-EC(d) tf(t, d).w(t) • Let wC = wC - tf(t’, d).w(t) • Let EC(d) = EC(d) U GetEntities(t’) • Let e’ = argmaxe EC(d)score-(e, d) • Until score-(e’, d) > wC • Let E(d) = EC(d), s* = 0 • Repeat • Call GetTerms(e’) • Let s’ = score(e’, d) • If s* < s’ then let s* = s’, e* = e’ • Let E(d) = E(d) - { e’ } • Let e’ = argmaxe E(d)score-(e, d) • Until score-(e’, d) + wC< s* • Return e* • Call GetEntities on t T(d) in decreasing order of tf(t, d).w(t), stopping when the cache becomes complete AllSegments • For eachsegment d in D • Call BestMatchEntity(d) to compute the best matching entity • Call BestAnnot(D) to compute the best annotation • Does not scale well with document size Reason • Computes the best matching entities for all segments in the document • Need to effectively prune the set of segments VLDB 2006, Seoul, Korea
Cache-based Best-Annotation Score Bounds • Modify BestAnnot to compute the best annotation using only the cache contents • In memory computation – no DB access • Let (S*, F*) be the best annotation for the document D • Let (SC, FC) be the annotation returned by BestAnnotC(D) Then where Procedure BestAnnotC(D) Input: Document D Output: best annotation • For i = 1 to |D| • For j = i to |D| • Let ei,j = argmaxe E score+(e, Di,j) • Let si,j = score+(ei,j, Di,j) • Let S0 = {} • Let r0 = 0 • For k = 1 to |D| • Let j = argmax0 ≤ j ≤ k-1(rj + sj+1,k – λ) • Let Sk = Sj U { Dj+1,k } • Let rk = rj + sj+1,k – λ • For each d S|D| • Let F|D|(d) = argmaxe E score+(e, d) • Return (S|D|, F|D|) VLDB 2006, Seoul, Korea
Iterative Cache Refinement: The EROCS Algorithm • Iteratively “refine” the cache contents so that the slack between the lower/upper bounds of successive (SC, FC) decreases with every iteration • On termination, score-(SC, FC) = score+ (SC, FC) (SC, FC) is the best annotation (S*, F*) • Incremental version of BestAnnotC used to compute successive annotations efficiently Procedure BestAnnotEROCS(D) Input: Document D Output: best annotation • Initialize the context cache as empty • Let (SC, FC) = BestAnnotC(D) • While score-(SC, FC) < score+(SC, FC) • Call UpdateCache(SC, FC) • Let (SC, FC) = BestAnnotC(D) • Return (SC, FC) Cache Refinement Policy VLDB 2006, Seoul, Korea
Cache Refinement • Define the slack of annotation (SC, FC) as • Let (S1C, F1C) and (S2C, F2C) be the annotations computed by BestAnnotC(D) on two consecutive invocations • Aim: Choose the intervening cache update such that the decrease in slackslack(S1C, F1C) – slack(S2C, F2C) is maximized Observations • If only GetEntities allowed, then the optimal policy is to invoke GetEntities on the term t* T(D)-TC(D) such that tf(t*, D).w(t*) is maximum • If only GetTerms allowed (and the context cache is complete with respect to each segment in S1C), a good heuristic is to invoke GetTerms on the best matching entity for the segment in S1Cwith the maximum slack • EROCS follows a hybrid policy VLDB 2006, Seoul, Korea
Cache Refinement Policy • Favors calling GetEntities initially and GetTerms later • Justified since initially, terms cause greater decrease in the slack • Does not intermix GetEntities and GetTerms for an annotation • Avoids redundant work • Works well in practice Procedure UpdateCache(SC, FC) Input: current best annotation • If the current cache is complete wrt each d SC • Let d* = argmaxd SC( score+(FC(d),d) – score-(FC(d),d) ) • Call GetTerms(FC(d*)) • Else • Let t* = argmaxt T(D)-TC(D)tf(t, D).w(t) • Call GetEntities(t*) VLDB 2006, Seoul, Korea
Experimental Setup • Structured Dataset • Subset of the IMDB dataset • Entities = Movies, Context = Actors, Directors, Producers, Writers, Editors • 401660 movies, 2GB across 8 tables • Document Dataset • Movie reviews/storylines downloaded from http://www.filmsite.org • Noun phrases identified as relevant terms • Removed the name of the movie from the text • Decomposed the reviews into 8-sentence segments • Classified each segment as good/bad based on average term weight • Given parameters K andα, generated a random document by • Picking a random sequence of K distinct movies • For each movie in the sequence, including a good segment with probability αand a bad segment with probability 1- α • Final repository included 50 docs for each K = 1, 2, …, 10 and α = 0.0, 0.1,…, 1.0 • Accuracy Metric • Harmonic mean of the avg precision and recall over the sentences in the document • Parameter Setting: λ = 4 (results robust with respect to λ) VLDB 2006, Seoul, Korea
Experimental Results: Efficacy of Fine-Grained Entity Matching α = 0.8 K = 10 Top-K picks the top K entities that best match the entire document (considered as a single segment) VLDB 2006, Seoul, Korea
Experimental Results: Efficacy of EROCS α = 0.8 VLDB 2006, Seoul, Korea
Experimental Results: Efficacy of EROCS (contd) K = 10 VLDB 2006, Seoul, Korea
Related Work • Semantic Integration [AI Magazine, Special Issue on Semantic Integration (2005)] • Identifying common concepts across heterogeneous data sources • Mainstay: heterogeneous structured databases (DB), within and across text documents (IR) • Keyword-based search in relational databases [DBXplorer, BANKS (ICDE02), Discover (VLDB03)] • Few keywords, all presumed relevant to a single entity • Named-entity recognition [Mansuri and Sarawagi, Chandel et al. (ICDE06), Agichtein and Ganti (SIGKDD04)] • Entities recognized only if explicitly mentioned in the document • Please refer to the paper for a more complete discussion VLDB 2006, Seoul, Korea
Summary • EROCS: A system for inter-linking information across structured databases and documents • An effective iterative improvement algorithm that tries to keep the information retrieved from the database small • The linkages discovered can be used to enrich several techniques and applications in the database-centric and IR-centric domains VLDB 2006, Seoul, Korea