290 likes | 454 Views
Using Latent Semantic Analysis to Find Different Names for the Same Entity in Free Text. Conley Read cread@cs.ucr.edu Computer Science & Engineering University of California - Riverside. Tim Oates, Vinay Bhat, Vishal Shanbhag, Charles Nicholas University of Maryland - Baltimore County
E N D
Using Latent Semantic Analysis to Find Different Names for the Same Entity in Free Text Conley Read cread@cs.ucr.edu Computer Science & Engineering University of California - Riverside Tim Oates, Vinay Bhat, Vishal Shanbhag, Charles Nicholas University of Maryland - Baltimore County ACM Web Information Data Management, 2002:31-35
Overview • The problem – research motivation • The solution, LSA? • LSA doesn’t work so well • Let’s do it (LSA) again • Two-stage LSA works! • Create your own Corpus
Mumbai Bombay The Problem
al Qaeda al Qaida Motivation
Nutrasweet aspartame Motivation
al Qaeda cells al Qaida network suspects Iraq bin Laden alleged cell warned terrorist Motivation
An Old IR Problem … drove their car to … … minor car accident … … new 2002 car models … … car gets good mileage … … drove their automobile to … … minor automobile accident … … new 2002 automobile models … … automobile gets good mileage …
Keyword Query: CAR … drove their car to … … minor car accident … … new 2002 car models … … car gets good mileage … … drove their automobile to … … minor automobile accident … … new 2002 automobile models … … automobile gets good mileage …
Keyword Query: AUTOMOBILE … drove their car to … … minor car accident … … new 2002 car models … … car gets good mileage … … drove their automobile to … … minor automobile accident … … new 2002 automobile models … … automobile gets good mileage …
Latent Semantic Analysis … drove their car to … … minor car accident … … new 2002 car models … … car gets good mileage … … drove their automobile to … … minor automobile accident … … new 2002 automobile models … … automobile gets good mileage …
Term-Document Matrix n documents A = m terms A(I, J) = number of times term I occurs in document J
Latent Semantic Analysis • Compute singular value decomposition (SVD) of A A = U S VT • Retain k < n largest singular values • Set remainder to zero • Projects terms/docs into k-dimensional space • Compute similarity in that space
S U V Singular Value Decomposition U – row corresponds to a wordΣ – singular values of AV – column corresponds to a document [Berry & Fierro 1996] Numerical Linear Algebra with Applications 3(4):301-327
Using SVD Sk U V U – Look only at k columns (words)Σk – Set all but k largest to zeroV – Look only at k rows (documents) [Berry & Fierro 1996] Numerical Linear Algebra with Applications 3(4):301-327
Using LSA to Find Aliases • Given name N and document collection D • Compute SVD of term-document matrix • Retain k largest singular values • Compute similarity of all terms to N • Report rank-ordered list of terms • True aliases for N must be high in list
Experiment: Creating Aliases • name N and document collection D • Set P, a percentage • S1 and S2 are two strings not in D • Replace N with S1 in P% of the documents • Replace N with S2 in the other documents • Search for aliases for S1 • Observe rank of S2 in ordered list
Our Dataset • 77 documents from www.cnn.com • Shortest has 131 words, longest has 1923 • “al Qaeda” occurs in 49 documents • Others on politics, sports, entertainment • N = “al Qaeda” • S1 = “alqaeda1” • S2 = “alqaeda2” • P = 50
Algorithm Parameters • k – dimensionality of compressed space • Small values result in spurious similarities • Large values closely approximate A • T – threshold on TF/IDF value • More aggressive filtering with larger values • Want to avoid filtering aliases • Want to filter irrelevant words Term Frequency / Inverse Document Frequency We want High Retrieval (precision) and Low Miss (infrequent in collection) rates.
Results 1: LSA Stage 1 Figure 1: Plot of Rank as a function of t for values of k.
k = 20 k = 5 k = 10 arrested government ressam lindh zubaydah raids attacks brahim passengers virginia zubaydah raids ressam pakistani hamdi soldier trial alqaeda2 pakistan walker zubaydah ressam raids hamdi alqaeda2 pakistani trial soldier pakistan lindh Results: Ontologically Dissimilar Problem: LSA shows Organizations and Individuals as similar.
Local Context to Ontology An Organization … list of al Qaeda leaders … … most senior al Qaeda member captured … … alleged al Qaeda representative … An Individual … photograph showing Lindh blindfolded … … with Lindh, the 21-year-old American … … Lindh pleaded guilty … Ontology: Hierarchical structuring of knowledge according to relevant or cognitive qualities.
A Second Run of LSA • For each term T in the top 250 candidates • Create a document DT • DT contains the words just before and just after each occurrence of T in the original corpus • Run LSA on all of the DT (the new corpus) … most senioral Qaedamember captured … … photograph showingLindhblindfolded and …
Results 2: LSA Stage 2 Figure 2: Plot of Rank as a function of t for values of k.
Results 2: Scaled to Figure 1 Figure 3: Plot of Rank as a function of t for values of k.
Results 1 & 2: Comparison LSA-1 and LSA-2, Before and After.
k = 20 k = 5 k = 10 tenet suspected warned alqaeda2 terrorism terrorist anaconda potential operation operations cells alqaeda2 network suspects germany laden alleged cell terrorist warned cells network alqaeda2 cell terrorist alleged suspects laden singapore germany Results: Contextually Similar Solution: LSA with context ranks terms by ontological similarity.
Applications • Create your own corpus • Submit N as Google query • Create corpus from top M hits • Run two-stage LSA • Example alias in Movie Titles: • Query N = “Ocean’s 12” • Use Google to get top 100 hits • Run two-stage LSA algorithm You might retrieve: 1. GoldenEye 2. Ocean’s 11 3. Die Hard: Vengeance 4. The Italian Job
Review • Find semantically related terms • Obvious solution – LSA • LSA is not so good • We ran LSA again! • LSA is great! • Create a Corpus with Google
Your Questions? AcknowledgementsDr. Tim Oates, oates@cs.umbc.edu References – the math…Berry, M., Fierro R. 1996. Low-rank orthogonaldecompositions for information retrieval applications.