Using Latent Semantic Analysis to Find Different Names for the Same Entity in Free Text

Using Latent Semantic Analysis to Find Different Names for the Same Entity in Free Text Conley Read cread@cs.ucr.edu Computer Science & Engineering University of California - Riverside Tim Oates, Vinay Bhat, Vishal Shanbhag, Charles Nicholas University of Maryland - Baltimore County ACM Web Information Data Management, 2002:31-35

Overview • The problem – research motivation • The solution, LSA? • LSA doesn’t work so well • Let’s do it (LSA) again • Two-stage LSA works! • Create your own Corpus

Mumbai Bombay The Problem

al Qaeda al Qaida Motivation

Nutrasweet aspartame Motivation

al Qaeda cells al Qaida network suspects Iraq bin Laden alleged cell warned terrorist Motivation

An Old IR Problem … drove their car to … … minor car accident … … new 2002 car models … … car gets good mileage … … drove their automobile to … … minor automobile accident … … new 2002 automobile models … … automobile gets good mileage …

Keyword Query: CAR … drove their car to … … minor car accident … … new 2002 car models … … car gets good mileage … … drove their automobile to … … minor automobile accident … … new 2002 automobile models … … automobile gets good mileage …

Keyword Query: AUTOMOBILE … drove their car to … … minor car accident … … new 2002 car models … … car gets good mileage … … drove their automobile to … … minor automobile accident … … new 2002 automobile models … … automobile gets good mileage …

Latent Semantic Analysis … drove their car to … … minor car accident … … new 2002 car models … … car gets good mileage … … drove their automobile to … … minor automobile accident … … new 2002 automobile models … … automobile gets good mileage …

Term-Document Matrix n documents A = m terms A(I, J) = number of times term I occurs in document J

Latent Semantic Analysis • Compute singular value decomposition (SVD) of A A = U S VT • Retain k < n largest singular values • Set remainder to zero • Projects terms/docs into k-dimensional space • Compute similarity in that space

S U V Singular Value Decomposition U – row corresponds to a wordΣ – singular values of AV – column corresponds to a document [Berry & Fierro 1996] Numerical Linear Algebra with Applications 3(4):301-327

Using SVD Sk U V U – Look only at k columns (words)Σk – Set all but k largest to zeroV – Look only at k rows (documents) [Berry & Fierro 1996] Numerical Linear Algebra with Applications 3(4):301-327

Using LSA to Find Aliases • Given name N and document collection D • Compute SVD of term-document matrix • Retain k largest singular values • Compute similarity of all terms to N • Report rank-ordered list of terms • True aliases for N must be high in list

Experiment: Creating Aliases • name N and document collection D • Set P, a percentage • S1 and S2 are two strings not in D • Replace N with S1 in P% of the documents • Replace N with S2 in the other documents • Search for aliases for S1 • Observe rank of S2 in ordered list

Our Dataset • 77 documents from www.cnn.com • Shortest has 131 words, longest has 1923 • “al Qaeda” occurs in 49 documents • Others on politics, sports, entertainment • N = “al Qaeda” • S1 = “alqaeda1” • S2 = “alqaeda2” • P = 50

Algorithm Parameters • k – dimensionality of compressed space • Small values result in spurious similarities • Large values closely approximate A • T – threshold on TF/IDF value • More aggressive filtering with larger values • Want to avoid filtering aliases • Want to filter irrelevant words Term Frequency / Inverse Document Frequency We want High Retrieval (precision) and Low Miss (infrequent in collection) rates.

Results 1: LSA Stage 1 Figure 1: Plot of Rank as a function of t for values of k.

k = 20 k = 5 k = 10 arrested government ressam lindh zubaydah raids attacks brahim passengers virginia zubaydah raids ressam pakistani hamdi soldier trial alqaeda2 pakistan walker zubaydah ressam raids hamdi alqaeda2 pakistani trial soldier pakistan lindh Results: Ontologically Dissimilar Problem: LSA shows Organizations and Individuals as similar.

Local Context to Ontology An Organization … list of al Qaeda leaders … … most senior al Qaeda member captured … … alleged al Qaeda representative … An Individual … photograph showing Lindh blindfolded … … with Lindh, the 21-year-old American … … Lindh pleaded guilty … Ontology: Hierarchical structuring of knowledge according to relevant or cognitive qualities.

A Second Run of LSA • For each term T in the top 250 candidates • Create a document DT • DT contains the words just before and just after each occurrence of T in the original corpus • Run LSA on all of the DT (the new corpus) … most senioral Qaedamember captured … … photograph showingLindhblindfolded and …

Results 2: LSA Stage 2 Figure 2: Plot of Rank as a function of t for values of k.

Results 2: Scaled to Figure 1 Figure 3: Plot of Rank as a function of t for values of k.

Results 1 & 2: Comparison LSA-1 and LSA-2, Before and After.

k = 20 k = 5 k = 10 tenet suspected warned alqaeda2 terrorism terrorist anaconda potential operation operations cells alqaeda2 network suspects germany laden alleged cell terrorist warned cells network alqaeda2 cell terrorist alleged suspects laden singapore germany Results: Contextually Similar Solution: LSA with context ranks terms by ontological similarity.

Applications • Create your own corpus • Submit N as Google query • Create corpus from top M hits • Run two-stage LSA • Example alias in Movie Titles: • Query N = “Ocean’s 12” • Use Google to get top 100 hits • Run two-stage LSA algorithm You might retrieve: 1. GoldenEye 2. Ocean’s 11 3. Die Hard: Vengeance 4. The Italian Job

Review • Find semantically related terms • Obvious solution – LSA • LSA is not so good • We ran LSA again! • LSA is great! • Create a Corpus with Google

Your Questions? AcknowledgementsDr. Tim Oates, oates@cs.umbc.edu References – the math…Berry, M., Fierro R. 1996. Low-rank orthogonaldecompositions for information retrieval applications.

Using Latent Semantic Analysis to Find Different Names for the Same Entity in Free Text

Using Latent Semantic Analysis to Find Different Names for the Same Entity in Free Text

Presentation Transcript

An Introduction to Latent Semantic Analysis

Classifying Signature Curves Using Latent Semantic Analysis

Latent Semantic Analysis (LSA)

Different names for the same document

Latent Semantic Analysis

Same country… different names….

Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis

Strategies for Research Methods Tutoring using Latent Semantic Analysis

An Introduction to Latent Semantic Analysis

TEXT ANALYSIS FOR SEMANTIC COMPUTING

Indexing by Latent Semantic Analysis

Using the Cell to Perform Latent Semantic Indexing

Bayesian Learning for Latent Semantic Analysis

Probabilistic Latent Semantic Analysis

Latent Semantic Analysis

Introducing Latent Semantic Analysis

TEXT ANALYSIS FOR SEMANTIC COMPUTING

Latent Semantic Analysis (LSA)

Different names for the same document

Latent Semantic Analysis