280 likes | 284 Views
LING/C SC 581: Advanced Computational Linguistics. Lecture 14 Feb 26 th. Administrivia. Hope you sent feedback on the last lecture on Text Classification by Marcos Zampieri . This Thursday (Feb 21 st ), we have another guest lecture from faculty candidate Adriana Picoral .
E N D
LING/C SC 581: Advanced Computational Linguistics Lecture 14 Feb 26th
Administrivia • Hope you sent feedback on the last lecture on Text Classification by Marcos Zampieri. • This Thursday (Feb 21st), we have another guest lecture from faculty candidate Adriana Picoral. • Her job talk on "Investigating Multilingualism through Computational Linguistics" is tomorrow at noon in CHEM 209 • Homework 6 out today (due on Friday at midnight)
Last Time • WordNet verbs and adjectives. • Also Framenet for verb frames/senses. • bfs.perl (basic program) bfs4.perl (all minimal length solutions) • @INC is hardwired into Perl: an environment variable can be set to add to the Perl module search path, e.g.: • export PERL5LIB=/home/foobar/code • see https://perlmaven.com/how-to-change-inc-to-find-perl-modules-in-non-standard-locations
WordNet: programmed search • Make no assumptions e.g. chair and table $ perl bfs4.perl chair#n#1 table#n#1 Not found (distance 7 and 100000 nodes explored) $ perl bfs4.perl chair#n#1 table#n#1 200000 Max set to: 200000 Not found (distance 8 and 200007 nodes explored) $ perl bfs4.perl chair#n#1 table#n#1 300000 Max set to: 300000 Found at distance 8 (256541 nodes explored) table#n#1 hype contents#n#1 hypo list#n#1 hype index#n#4 deri index#v#2 hypo supply#v#1 hype seat#v#5 deri seat#n#3 hype chair#n#1 Found at distance 8 (282344 nodes explored) table#n#1 hype contents#n#1 hypo list#n#1 hype index#n#4 deri index#v#2 hypo supply#v#1 hype seat#v#4 deri seat#n#3 hype chair#n#1
WordNet: programmed search $ perl bfs4.perl chair#n#1 table#n#1 500000 Max set to: 500000 Found at distance 8 (256541 nodes explored) table#n#1 hype contents#n#1 hypo list#n#1 hype index#n#4 deri index#v#2 hypo supply#v#1 hype seat#v#5 deri seat#n#3 hype chair#n#1 Found at distance 8 (282344 nodes explored) table#n#1 hype contents#n#1 hypo list#n#1 hype index#n#4 deri index#v#2 hypo supply#v#1 hype seat#v#4 deri seat#n#3 hype chair#n#1 All minimal solutions found does the long chain still have meaning?
WordNet: programmed search table#n#2
WordNet: programmed search $ perl bfs4.perl chair#n#1 table#n#2 Found at distance 2 (82 nodes explored) table#n#2 holo leg#n#3 mero chair#n#1 All minimal solutions found https://wordnet.princeton.edu/wordnet/man/wngloss.7WN.html holonym The name of the whole of which the meronym names a part. Y is a holonym of X if X is a part of Y . meronym The name of a constituent part of, the substance of, or a member of something. X is a meronym of Y if X is a part of Y .
WordNet: programmed search $ perl bfs4.perl chair#n#1 table#n#2 Found at distance 2 (82 nodes explored) table#n#2 holo leg#n#3 mero chair#n#1 All minimal solutions found • Take out holoand merofrom @relations
WordNet: programmed search $ perl bfs4a.perl chair#n#1 table#n#2 Found at distance 3 (81 nodes explored) table#n#2 hypo furniture#n#1 hype seat#n#3 hype chair#n#1 All minimal solutions found
WordNet: programmed search • Example: • John mended the torn dress • what can be deduced about the state of the world (situation) after the event of “mending”? • find the semantic relationship between mend and tear bfs3.perl mend#v#1 tear#v#1 Found at distance 6 (58492 nodes explored) tear#v#1 hypo separate#v#2 hype break_up#v#10 also break#v#4 ants repair#v#1 hypo better#v#2 hype mend#v#1 perl bfs3.perl tear#v#1 mend#v#1 Found at distance 6 (33606 nodes explored) mend#v#1 derimender#n#1 hypo skilled_worker#n#1 hype cutter#n#3 dericut#v#1 hypo separate#v#2 hype tear#v#1 many more…
WordNet: programmed search • Example: • John mended the red dress • mend is a change-of-state verb (applies to its object)
WordNet: programmed search • Example: • John mended the red dress • mend is a change-of-state verb (applies to its object)
WordNet: programmed search $ perl bfs4.perl mend#v#1 red#a#1 Not found (distance 7 and 100001 nodes explored) $ perl bfs4.perl mend#v#1 red#a#1 200000 Max set to: 200000 Found at distance 7 (116111 nodes explored) carmine#a#1 deri carmine#n#1 deri carmine#v#1 hypo redden#v#2 hypo color#v#1 hypo change#v#1 hype better#v#2 hype mend#v#1 Found at distance 7 (116210 nodes explored) red#a#1 deri red#n#1 hypo chromatic_color#n#1 hypo color#n#1 deri color#v#1 hypo change#v#1 hype better#v#2 hype mend#v#1 Found at distance 7 (116211 nodes explored) red#a#1 deri red#a#1 deri red#n#1 hypo chromatic_color#n#1 hypo color#n#1 deri color#v#1 hypo change#v#1 hype better#v#2 hype mend#v#1 Found at distance 7 (116325 nodes explored) ruddy#a#2 deri ruddiness#n#1 hypo complexion#n#1 hypo color#n#1 deri color#v#1 hypo change#v#1 hype better#v#2 hype mend#v#1
WordNet: programmed search $ perl bfs4.perl mend#v#1 red#n#3 Found at distance 6 (49389 nodes explored) Bolshevik#n#1 hypo radical#n#3 hypo person#n#1 hype changer#n#1 deri change#v#1 hype better#v#2 hype mend#v#1 Found at distance 6 (84143 nodes explored) Bolshevik#n#1 hypo radical#n#3 hypo person#n#1 hype worker#n#1 hype skilled_worker#n#1 hype mender#n#1 deri mend#v#1 All minimal solutions found
Homework 6 • Question 1: • Try to find the shortest distance links between each of planet, star, eagle vs. telescope • (Make sure you have the right word sense) • How many are there? • Question 2: • Draw a (merged) graph of semantic relations found • Question 3: • Are any of the chains of semantic relations what you expect? • Question 4: • Is the chain useful? Why or why not? • Question 5: • What do you think the shortest connection linking star and telescope should look like? • How about eagle and telescope?
Cosine Similarity Using word vectors acquired from large corpora GloVe (Stanford), word2vec (Google) Python: gensim etc. • vec(‘Rome’) closest vec(‘Paris’) – vec(‘France’) + vec(‘Italy’) • Examples: • telescope: [1.5667, 1.1436, 1.6432, 0.2347, -0.57751, -0.29565, -0.78965, -0.95205, -0.097776, -0.31729, 0.82443, 0.27591, 0.70094, 1.2939, -1.1032, 1.0748, -0.21654, 0.44433, -1.854, -0.50952, -0.1966, -0.050295, -0.75702, -1.4179, 1.1795, -0.29231, -0.61232, 0.40963, -0.79731, 0.02117, 0.57397, -0.6336, -0.13071, -1.1153, -0.5656, -0.20496, 0.34324, 1.1626, 0.19703, -0.76862, 1.1381, 0.019043, 0.10676, 0.46047, -0.50555, -0.26049, 1.1725, -0.049478, -0.71014, 0.19022] • star: [-0.21025, 1.6081, 0.037375, 1.0411, 0.61061, 0.064748, -0.93674, -0.030028, -0.18348, 0.73875, 0.65025, 0.75496, -0.73316, 0.95964, 0.89172, -0.10495, 0.11496, 0.30448, -1.4942, -0.036297, -0.95949, 0.41062, -0.23896, 0.40387, -0.32893, -1.5343, -0.45627, 0.109, -0.41474, -0.57094, 2.1997, 0.47089, 0.56732, -0.16914, 0.43481, 0.40459, -0.007678, -0.22073, -0.33289, -1.0992, 0.33632, 1.3412, -0.34081, -0.50183, -0.2514, -0.10199, 0.19292, -0.48934, -0.41793, 0.18085] • potato: [-0.063054, -0.62636, -0.76417, -0.041484, 0.56284, 0.86432, -0.73734, -0.70925, -0.073065, -0.74619, -0.34769, 0.14402, 1.4576, 0.034688, 0.11224, 0.13854, 0.10484, 0.60207, 0.021777, -0.21802, 0.087613, -1.4234, 1.0361, 0.1509, 0.13608, -0.2971, -0.90828, 0.34182, 1.3367, 0.16329, 1.2374, -0.20113, -0.91532, 1.4222, -0.1276, 0.69443, -1.1782, 1.2072, 1.0524, -0.11957, -0.1275, 0.41798, -0.9232, -0.1312, 1.2696, 1.2318, 0.30061, -0.18854, 0.15899, 0.0486]
Cosine Similarity • Visualization: potato star telescope
Cosine Similarity http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/
Cosine Similarity • Vectors A,B and cos(θ): (wikipedia) • Python: from scipy import linalg, mat, dot import numpy as np m1 = mat(A) # row m2 = mat(B) m12 = dot(m1,m2.T)/(np.linalg.norm(m1)*np.linalg.norm(m2))
Cosine Similarity a-b • Let x = (x1,…,xn) and y = (y1,…,yn) • Define x·y = ∑i xi yidot product • ‖x‖ = √(∑i xi2) • = √(x·x) • a,b nonzero vectors • ‖a-b‖2 = ‖a‖2 + ‖b‖2 -2‖a‖ ‖b‖cosθ (Law of cosines) • But ‖a-b‖2 = (a-b)·(a-b) • ‖a-b‖2 = a·a-2a·b+b·b • ‖a-b‖2 = ‖a‖2-2a·b+‖b‖2 • Then a·b = ‖a‖ ‖b‖cosθ ‖a-b‖ a ‖a‖ b θ ‖b‖ • a·b =0 means θ=90° (orthogonal)
Cosine Similarity • Triangle: law of cosines c2 = a2 + b2 – 2ab cosθ • Proof: • Points: C (0,0), B (a,0), A (b cosθ,bsinθ) • Pythagoras • c2 = (a – b cosθ)2 + (b sinθ)2 • c2 = a2 – 2ab cosθ + b2cos2θ +b2sin2θ • c2 = a2 – 2ab cosθ + b2(cos2θ +sin2θ) • c2 = a2 – 2ab cosθ + b2 ‖⃦a-b‖⃦ ‖⃦a‖⃦ θ ‖⃦b‖⃦ wikipedia c a θ b
Examples • Code adapted from: • https://github.com/adventuresinML/adventures-in-ml-code/blob/master/tf_word2vec.py • Training on text8: • http://mattmahoney.net/dc/textdata.html • first 108 bytes of fil9, the cleaned-up version of enwik9, the first 109 bytes of the English Wikipedia dump on Mar. 3, 2006. • clean up: remove meta-data, hypertext links, citations, footnotes. Also case-fold, spell out numbers, out-of-band a-z converted to blanks, etc. • Skip-gram model • n = 2 or 4 (either side of target word)
Examples • Embedding: 300, skip window: 4, vocab size: 20,000 (others: UNK) • Filename: text8.zip, #words: 17,005,207 • Nearest to the: • regulate, camelot, anymore, mutants, lowlands, thorn, irene, ax • and, of, a, UNK, in, to, one, nine • a, and, UNK, in, of, one, to, zero • a, UNK, and, of, in, two, is, one • a, one, and, UNK, zero, in, s, two • a, UNK, of, in, one, two, s, and • and, a, s, in, ursus, of, UNK, one • a, UNK, ursus, and, s, three, one, six • a, one, ursus, seven, three, six, four, UNK • a, ursus, UNK, s, of, this, and, in
Examples • Embedding: 300, skip window: 4, vocab size: 20,000 (others: UNK) • Nearest to have: • shrink, generalization, scandinavia, cards, approval, diplomatic, bus, bog • UNK, the, and, cards, generalization, approval, to, scandinavia • and, voter, to, generalization, in, a, cards, UNK • and, that, UNK, shrink, voter, the, in, is • that, and, in, voter, coke, shrink, generalization, cards • that, and, are, in, it, is, by, two • ursus, that, and, are, in, with, it, be • ursus, that, are, be, and, with, by, has • are, ursus, be, that, has, in, with, by • are, has, that, be, ursus, had, and, with
Examples • Embedding: 300, skip window: 4, vocab size: 10,000 (others: UNK) • Nearest to nine: • dust, owner, regain, freedom, party, gained, playstation, himself • zero, in, UNK, of, and, the, one, coke • one, zero, eight, two, in, and, six, the • one, eight, zero, two, six, three, seven, five • eight, zero, one, two, six, three, seven, five • eight, one, seven, zero, two, six, three, five • eight, seven, six, four, one, three, five, zero • eight, seven, six, one, five, four, zero, three • eight, seven, six, four, one, five, three, zero • eight, seven, six, four, five, one, three, zero
Examples • Embedding: 300, skip window: 4, vocab size: 20,000 (others: UNK) • Nearest to some: • groove, bram, cavitation, wickets, respect, wtoo, sticky, anatolia • alien, a, in, the, of, wickets, and, respect • a, alien, UNK, of, zero, and, the, wickets • alien, a, the, of, and, wickets, zero, UNK • a, alien, zero, and, two, the, or, groove • or, UNK, a, alien, two, six, in, and • or, two, and, a, the, ursus, alien, are • and, or, ursus, are, that, alien, the, from • or, are, ursus, other, and, that, two, UNK • or, the, are, other, and, two, many, ursus
Examples • Embedding: 300, skip window: 4, vocab size: 20,000 (others: UNK) • Nearest to american: • ways, practitioners, hexadecimal, tito, confirming, damascus, sharply, roof • phi, ways, practitioners, legislatures, halley, whole, mughal, UNK • phi, ways, practitioners, legislatures, tito, one, halley, roof • phi, one, and, legislatures, ways, practitioners, tito, eight • phi, the, zero, UNK, and, legislatures, in, ways • one, nine, phi, UNK, two, six, by, three • UNK, and, nine, by, ursus, phi, zero, the • nine, and, UNK, callithrix, ursus, s, phi, six • nine, in, of, and, UNK, callithrix, ursus, phi • nine, UNK, in, callithrix, ursus, and, one, of