LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539Statistical Natural Language Processing • Lecture 27 • 4/24/2013

Recommended reading • Jia Lu’s slides on topic models • http://en.wikipedia.org/wiki/Latent_semantic_analysis • Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman. 1990. "Indexing by Latent Semantic Analysis”. Journal of the American Society for Information Science 41 (6): 391–407. • Thomas Hofmann. 1999. Probabilistic Latent Semantic Analysis. • http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation • David Blei. 2011. Introduction to Probabilistic Topic Models.

Outline • Models of semantic representation • WordNet • Boolean queries in information retrieval • Vector space model of meaning • Latent Semantic Analysis • Probabilistic Latent Semantic Analysis

Semantic representation 1:word-based • Semantic network • Nodes = words • Links = different semantic relations

Semantic representation 2:concept-based, indicated by words • Semantic vector space • Vectors correspond to concepts • Distance = degree of expression of concept • Distance between words = degree of semantic similarity

Semantic representation 3:topic model • Left: probabilistic assignment of words to topic • Right: words by topic, sorted by decreasing prob.

Systems for different semantic representations • Semantic network • Lexical database: WordNet • Vector space model • Unsupervised algorithm: Latent Semantic Analysis • Topic model • Unsupervised algorithms: • Probabilistic Latent Semantic Analysis • Latent Dirichlet Allocation (popular, but too advanced for this course)

WordNet • http://wordnet.princeton.edu/ • Most widely used hierarchically organized lexical database for English (Fellbaum, 1998) • Other languages: Global WordNet Association • http://www.globalwordnet.org/

Synsets in WordNet • Example synset • { chump, fish, fool, gull, mark, patsy, fall guy, sucker, schlemiel, shlemiel, soft touch, mug } • Definition: “a person who is gullible and easy to take advantage of”. • A synset defines one sense for each of the words listed in the synset • A word may occur in multiple synsets • Example: “give” has 45 senses

Format of WordNet entries

Distribution of senses amongWordNet Verbs

Lexical Relations in WordNet

Hypernymy in WordNet

Information retrieval • Web searching, library catalog lookup • Given a query and a collection of documents, retrieve documents that are relevant to the query • Problem: determining the relevant documents

Boolean queries • Example: “paris” AND “hotel” AND NOT “hilton” • Match terms: return documents with same words as in query

Example Boolean query • Q : “Light waves” • D1: “Particle and wave models of light” • D2: “Surfing on the waves under star lights” • D3: “Electro-magnetic models for fotons” • x: document contains word • x’: document contains word, and word in query • REL = relevant document • MATCH = document is matched under query (document contains all words in query)

Problems with Boolean queries • Too strict, only finds documents with specific words searched for • Doesn’t consider word ambiguity • Retrieval of documents isn’t probabilistic • Assigns equal importance to all words in query • Syntax is hard for non-technical users

Precision and recall ininformation retrieval • Retrieve documents relevant to a query http://nltk.googlecode.com/svn/trunk/doc/images/precision-recall.png

Synonymy and polysemy • Polysemy • Words with different meanings: model, python, chip • Term matching returns irrelevant documents • Lowers precision • Synonymy • Many ways to refer same object: car, automobile • Lowers recall: many documents not matched by words in query

Synonymy, polysemy, anddocument similarity Synonymy These two documents have few shared words but are related auto engine bonnet tyres lorry boot car emissions hood make model trunk make hidden Markov model emissions normalize Polysemy These two documents have many shared words, but are not related

Questions for information retrieval • How do we identify documents that are relevant, but don’t contain words in our query? • Given a query, how do we rule out a document that has matching words, but that is irrelevant?

Topics • Documents are about topics, rather than just specific words • e.g. sports, computers, cars, politics • Many words can belong to a given topic • Variability in word use: an author chooses a particular subset of them in writing a document

Document retrieval based on topics • Perform document retrieval according to topics, rather than just the words in queries • Input: words in a query • Use query words to determine the “topic” that the user wants to look up • Figure out “topic” of a query, then return documents on that topic • How do we get topics? • Computational problem: induction of semantic topics

Word vector space model of meaning • Meaning is a high-dimensional space • Each dimension is associated with a single word • Huge number of dimensions • Bag of words model: • Count the frequency of each word in a document • Ignore location of word in documents • Ignores syntactic structure • Remove stopwords (high-frequency function words) • the, of, to, around, by, at, … • Represent a document as a point in this space

Term-document co-occurrence matrix • N documents, vocabulary size M • Generate a word-document co-occurrence matrix W • Wi,j = # of times word wi occurs in document dj d1d2 ….. dN w1 w2 : : wM W =

Term-document matrix structure • Create a corpus from newsgroups • Build a term × document matrix • Example: • 100 documents each from 3 different newsgroups • 300 total documents, 12418 distinct terms • Remove words from standard stopword list • Matrix is sparse: 8% filled • 12,418 x 300 different cells

Terms sorted by rank within topic

Cosine similarity between each pair of documents

Topic discovery • Topic of document can be distinguished by words in document • However, this assumes that topics of documents are already known • Can we use frequencies of words in documents to discover the topics? • Individual documents tend to be about one particular topic

Want to learn a vector space model of concepts • N-dimensional space, each dimension corresponding to an abstract concept, rather than a single word • Location of a word indicates strength of association with concept dimensions

Reduce dimension of matrix • Want a vector space model where dimensions to correspond to topics instead of individual words • Dimensionality reduction • Reduce the size of the representation from O(10,000) words to O(100) topics. • Algorithms: • Latent Semantic Analysis (LSA) / Singular Value Decomposition (SVD) • Probabilistic Latent Semantic Analysis (PLSA) • Latent Dirichlet Allocation (LDA)

Latent Semantic Analysis (LSA) • http://en.wikipedia.org/wiki/Latent_semantic_analysis • http://lsa.colorado.edu/ • LSA is an unsupervised algorithm for discovering topics in a corpus of documents • Idea (Deerwester et. al. 1990): “We would like a representation in which a set of terms, which by itself is incomplete and unreliable evidence of the relevance of a given document, is replaced by some other set of entities which are more reliable indicants. We take advantage of the implicit higher-order (or latent) structure in the association of terms and documents to reveal such relationships.”

Basic steps in LSA • Perform singular value decomposition (SVD) on term-document co-occurrence matrix • http://en.wikipedia.org/wiki/Singular_value_decomposition • (to understand requires strong knowledge of Linear Algebra) • Produces 3 matrices that reproduce the original when multiplied together • Each of the inner dimensions corresponds to a topic • Set all but the k highest singular values to 0 • Produces best k-dimensional approximation of the original matrix • These are the k most relevant topics • Documents can then be represented in topic space, instead of word space

Example: technical memo titles c1: Human machine interface for ABC computer applications c2: A survey of user opinion of computersystemresponsetime c3: The EPSuserinterface management system c4: System and humansystem engineering testing of EPS c5: Relation of user perceived responsetime to error measurement m1: The generation of random, binary, ordered trees m2: The intersection graph of paths in trees m3: Graphminors IV: Widths of trees and well-quasi-ordering m4: Graphminors: A survey

Term-document co-occurrence matrix

Negative correlations between both related terms and unrelated terms r (human, user) = -.38 r (human, minors) = -.29

Matrix multiplication • http://en.wikipedia.org/wiki/Matrix_multiplication

Variables involved • t = number of terms • d = number of documents • Term-document matrix: size t x d • m = min(t, d): rank of term-document matrix • Maximum number of topics that there could possibly be • k = number of singular values • Number of top topics, that you select

Singular Value Decomposition (SVD) • Input: • X: term-document matrix, of size t x d • Output of SVD: • T0: term-topic matrix t x m • S0: singular value matrix m x m • D0’: topic-document matrix m x d • Original matrix can be exactly recovered from factored matrices through matrix multiplication • X = T0S0D0’ • Size (t x m) * (m x m) * (m x d) = size t x d

Singular Value Decompositionhttp://en.wikipedia.org/wiki/Singular_value_decomposition

SVD approximation • Original matrix can be exactly recovered from factored matrices through mult: • Approximation of original matrix: • Select top k singular values (top k dimensions), remove all other dimensions • Multiplication of reduced matrices approximates the original • Since the top singular values were selected, this is the best rank-k approximation of X

Choose top k dimensions

Another picture of reduced SVD • SVD: A = U Σ VT • After dimensionality reduction: A ~= ~U ~Σ ~VT

SVD approximation leads to smaller representation • t * d > ( t * k ) + ( k * k ) + ( k * d ) • If k << m, substantial savings in size of representation (m is the rank of original matrix) • Example: t = 5,000, d = 1000, k = 50 • t * d = 5,000,000 • But ( t * k ) + ( k * k ) + ( k * d ) = 250,000 + 2,500 + 50,000 = 302,500

LING / C SC 439/539 Statistical Natural Language Processing