340 likes | 387 Views
Natural Language Processing. Vasile Rus http://www.cs.memphis.edu/~vrus/nlp. Outline. Announcements Latent Semantic Analysis/Indexing Word embeddings. Announcements. 8000-level presentations Projects due 4/24/18. Word Vectors.
E N D
Natural Language Processing Vasile Rus http://www.cs.memphis.edu/~vrus/nlp
Outline • Announcements • Latent Semantic Analysis/Indexing • Word embeddings
Announcements • 8000-level presentations • Projects due 4/24/18
Word Vectors • Goal: represent a word by an m-dimensional vector (for medium-sized m, say, m=300) • Have “similar” words be represented by “nearby” vectors in this m-dimensional space • Words in a particular domain (economics, science, sports) could be closer to one another than words in other domains. • Could help with synonymy • e.g. “big” and ”large” have nearby vectors • Could help with polysemy • “Java” and ”Indonesia” could be close in some dimensions • “Java” and “Python” are close in other dimensions
Word Vectors • Vectors would be shorter length and information-dense, rather than very long and information-sparse • Would require fewer weights and parameters • Fortunately, there are existing mappings which can be downloaded and used • These were trained on big corpora for a long time • Let’s understand how they were developed and trained
What makes two words similar? • Idea: similar words occur in similar contexts • For a given word, look at the words in a “window” around it. • Count-based methods • Latent Semantic Analysis • Prediction-based methods • Word2vec (neural networks)
What makes two words similar? • Idea: similar words occur in similar contexts • For a given word, look at the words in a “window” around it. • LSA • If a occurs with b in similar contexts they should be close to each other in the vector space • Quantify/count how many times words co-occur in the same “window” • Second-order relations: • If A co-occurs with B a lot and B co-occurs with C a lot, then A and C are somehow similar (through B) although they may not co-occur in the same window very often
What makes two words similar? • Idea: similar words occur in similar contexts • For a given word, look at the words in a “window” around it. • Consider trying to predict a word given the context • This is exactly the CBOW (continuous bag of words) model “We hold these truths to be self-evident, that all men are created equal” Window size = 3 ([‘truths’, ‘to’, ‘be’, ‘that’, ’all’, ‘men’], ‘self-evident’) context target word
LSA • Relationship between concepts and words is many-to-many • Solve problems of synonymy and ambiguity by representing words as vectors of ideas or concepts • Could be used in information retrieval • Documents and queries are represented using vectors, and compute cosine similarity of vectors
LSA • Find the latent semantic space that underlies the words/documents • Find the basic (coarse-grained) ideas, regardless of the words used to say them • A kind of co-occurrence analysis; co-occurring words as “bridges” between non–co-occurring words • Latent semantic space has many fewer dimensions than term space has • Space depends on documents from which it is derived • Components have no names; can’t be interpreted
Technical Memo Example: Titles c1 Human machine interface for Lab ABC computer applications c2 A survey of user opinion of computer system response time c3 The EPS user interface management system c4 System and humansystem engineering testing of EPS c5 Relation of user-perceived responsetime to error measurement m1 The generation of random, binary, unordered trees m2 The intersection graph of paths in trees m3 Graph minors IV: Widths of trees and well-quasi-ordering m4 Graph minors: A survey
Technical Memo Example: Terms and Documents TermsDocuments c1 c2 c3 c4 c5 m1 m2 m3 m4 human 1 0 0 1 0 0 0 0 0 interface 1 0 1 0 0 0 0 0 0 computer 1 1 0 0 0 0 0 0 0 user 0 1 1 0 1 0 0 0 0 system 0 1 1 2 0 0 0 0 0 response 0 1 0 0 1 0 0 0 0 time 0 1 0 0 1 0 0 0 0 EPS 0 0 1 1 0 0 0 0 0 survey 0 1 0 0 0 0 0 0 1 trees 0 0 0 0 0 1 1 1 0 graph 0 0 0 0 0 0 1 1 1 minors 0 0 0 0 0 0 0 1 1
Technical Memo Example: Query Query: Find documents relevant to "human computer interaction" Simple Term Matching: Matches c1, c2, and c4 Misses c3 and c5
Mathematical concepts Define X as the term-document matrix, with t rows (number of index terms) and d columns (number of documents). Singular Value Decomposition For any matrix X, with t rows and d columns, there exist matrices T0, S0 and D0', such that: X = T0S0D0' T0 and D0 are the matrices of left and right singular vectors S0 is the diagonal matrix of singular values
More Linear Algebra • A non-negative real number σ is a singular value for X if and only if there exist unit-length vectors u in Kt and v in Kd such that X v= σu and X‘ u= σv • u are the left singular vectors while v are the right singular vectors • K = field such as the field of real numbers
Eigenvectors vs. Singular vectors Eigenvector: Mv = λv where v is an eigenvector and v is a scalar (real number) called eigenvalue MV = DV, where D is the collection of eigenvalues and V is the collection of eigenvectors M = VDV-1 if V is invertible (which is the case if all eigenvectors are distinct) ‘
Eigenvectors vs. Singular vectors M = VDV‘ if eigenvectors are normalized XX‘ = (TSD‘)(TSD)‘=TSDD‘S‘T‘ = TSS‘T‘ = TDT‘ D = SS‘
Linear Algebra • X = T0S0D0' • T,Dare column orthonormal • Their columns are orthogonal vectors that can form a basis for a space • They are unitary which means T‘ and D‘ are also orthonormal
More Linear Algebra • Unitary matrices have the following properties • UU‘=U‘U=In • If U has all entries real it is orthogonal • Orthogonal matrices preserve the inner product of two real vectors • <Ux,Uy=<x,y> • U is an isometry, i.e. preserves distances
LSA Properties • The projection into the latent concept space preserves topological properties of the original space • Close vectors will stay close • The reduced latent concept space is the best approximation of the original space in terms of distance-preservation compared to other choices of same dimensionality • Both terms and documents are mapped into a new space where they both could be compared
Dimensions of matrices t x d t x m m x m m x d S0 D0' X = T0 m is the rank of X< min(t, d)
~ ~ Reduced Rank S0 can be chosen so that the diagonal elements are positive and decreasing in magnitude. Keep the first k and set the others to zero. Delete the zero rows and columns of S0 and the corresponding rows and columns of T0 and D0. This gives: X X = TSD' Interpretation If value of k is selected well, expectation is that X retains the semantic information from X, but eliminates noise from synonymyand recognizes dependence. ^ ^
Dimensionality Reduction t x d t x k k x k k x d S D' ^ = X T k is the number of latent concepts (typically 300 ~ 500) X ~ X = TSD' ^
Animation of SVD M is just an m×m square matrix with positive determinant whose entries are plain real numbers. Run as slideshow to see the Animation From Wikipedia
Approximation Intuition 200 x 300 pixels Top 10 dim Top 50 dim
Projected Terms XX‘ = (TSD‘)(TSD)‘=TSDD‘S‘T‘ = TSS‘T‘= (TS)(TS)‘
LSA Summary • Strong formal framework • Completely automatic; no stemming required; allows misspellings • Computation is expensive
word2vec: CBOW Model Train a neural network on a large corpus of data Single Hidden Layer (with dimension m) w(t-2) w(t-1) Target word (one hot encoded) Context words (one hot encoded) w(t) w(t+1) w(t+2)
word2vec: CBOW Model Once the network is trained, weights -> word vectors Single Hidden Layer (with dimension m) w(t-2) w(t-1) Target word (one hot encoded) Context words (one hot encoded) w(t) w(t+1) w(t+2)
word2vec: CBOW Model Once the network is trained, weights -> word vectors Single Hidden Layer (with dimension m) w(t-2) w(t-1) Target word (one hot encoded) Context words (one hot encoded) w(t) w(t+1) w(t+2)
word2vec: Skip-gram Model Same idea, except we predict the context from the target. Single Hidden Layer Neural Network w(t-2) w(t-1) Context words (one hot encoded) Target word (one hot encoded) w(t) w(t+1) w(t+2)
word2vec • Distributed Representations of Words and Phrases and Their Compositionality – Mikolov and colleagues • Uses a Skip-gram model to train on a large corpus • Lots of details to make it work better • Aggregation of multi-word phrases (e.g. Boston Globe) • Subsampling (i.e. oversample less common words) • Negative Sampling (give network examples of wrong words)
Summary • Latent Semantic Analysis • Word2vec
Next Time • Logic Forms • Information Extraction