240 likes | 400 Views
LATENT SEMANTIC INDEXING. Hande Zırtıloğlu Levent Altunyurt. OUTLINE. Problem History What is LSI? What we really search? How LSI Works? Concept Space Application Areas Examples Advantages / Disadvantages. PROBLEM. Conventional IR methods depend on boolean vector space
E N D
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt
OUTLINE • Problem • History • What is LSI? • What we really search? • How LSI Works? • Concept Space • Application Areas • Examples • Advantages / Disadvantages
PROBLEM • Conventional IR methods depend on • boolean • vector space • probabilistic models • Handicap • dependent on term-matching • not efficent for IR • Need to capture the concepts instead of only the words • multiple terms can contribute to similar semantic meanings • synonym, i.e. car and automobile • one term may have various meanings depending on their context • polysemy, ie. apple and computer
THE IDEAL SEARCH ENGINE • Scope • able to search every document on the Internet • Speed • better hardware and programming • Currency • frequent updates • Recall: • always find every document relevant to our query • Precision: • no irrelevant documents in our result set • Ranking: • most relevant results would come first
What a Simple Search Engine Can Not Differentiate? • Polysemy • monitor workflow, monitor • Synonymy • car, automobile • Singular and plural forms • tree, trees • Word with similar roots • different, differs, differed
HISTORY • Mathematical technique for information filtering • Developed at BellCore Labs, Telcordia • 30 % more effective in filtering relevant documents than the word matching methods • A solutionto “polysemy” and “synonymy” • End of the 1980’s
LOOKING FOR WHAT? • Search • “Paris Hilton” • Really interested in The Hilton Hotel in Paris? • “Tiger Woods” • Searching something about wildlife or the famous golf player? • Simple word matching fails
LSI? • Concepts instead of words • Mathematical model • relates documents and the concepts • Looks for concepts in the documents • Stores them in a concept space • related documents are connected to form a concept space • Do not need an exact match for the query
HOW LSI WORKS? • A set of documents • how to determine the similiar ones? • examine the documents • try to find concepts in common • classify the documents • This is how LSI also works. • LSI represents terms and documents in a high-dimensional space allowing relationships between terms and documents to be exploited during searching.
HOW TO OBTAIN A CONCEPT SPACE? • One possible way would be to find canonicalrepresentations of natural language • difficult task to achieve. • Much simpler • use mathematical properties of the termdocumentmatrix, • i.e. determine the concepts by matrix computation.
TERM-DOCUMENT MATRIX • Query:Human-Computer Interaction • Dataset: • c1 Human machine interface for Lab ABC computer application • c2 A survey of user opinion of computersystemresponsetime • c3 The EPS user interface management system • c4 System and humansystem engineering testing of EPS • c5 Relations of user-perceived responsetime to error measurement • m1 The generation of random, binary, unordered trees • m2 The intersection graph of paths in trees • m3 Graphminors IV: Widths of trees and well-quasi-ordering • m4 Graphminors: A survey
CONCEPT SPACE • Makes the semantic comparison possible • Created by using “Matrices” • Try to detect the hidden similarities between documents • Avoid little similarities • Words with similar meanings • occur close to each other • Dimensions: terms • Plots: documents • Each document is a vector • Reduce the dimensions • Singular Value Decomposition
APPLICATION AREAS • Dynamic advertisements put on pages, Google’s AdSense • Improving performance of Search Engines • in ranking pages • Spam filtering for e-mails • Optimizing link profile of your web page • Cross language retrieval • Foreign language translation • Automated essay grading • Modelling of human cognitive function
GOOGLE USES LSI • Increasing its weight in ranking pages • ~ sign before the search term stands for the semantic search • “~phone” • the first link appearing is the page for “Nokia” although page does not contain the word “phone” • “~humor” • retrieved pages contain its synonyms; comedy, jokes, funny • Google AdSense sandbox • check which advertisements google would put on your page
ANOTHER USAGE • Tried on TOEFL exam. • a word is given • the most similar in meaning should be selected from the four words • scored %65 correct
+ / - • Improve the efficency of the retrieval process • Decreases the dimensionality of vectors • Good for machine learning algorithms in which high dimensionality is a problem • Dimensions are more semantic • Newly reduced vectors are dense vectors • Saving memory is not guaranteed • Some words have several meanings and it makes the retrieval confusing • Expensive to compute, complexity is high.