230 likes | 388 Views
CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 9: Resnick’s measures of word Similarity; coverage of Jiang and Conrath, 1997). Pushpak Bhattacharyya CSE Dept., IIT Bombay . Path length based similarity between house and lock. House belongs-to 12 senses. Sense-1
E N D
CS626/449 : Speech, NLP and the Web/Topics in AI Programming(Lecture 9: Resnick’s measures of word Similarity; coverage of Jiang and Conrath, 1997) Pushpak BhattacharyyaCSE Dept., IIT Bombay
Path length based similarity between house and lock • House belongs-to 12 senses Sense-1 House study wall Has-part Has-part Has-part door doorway lock Has-part Has-part
Properties that a Path Length based measure should satisfy • Zero property: • self distance is 0 (d(A,A)=0) • Symmetric property: • d(A,B)=d(B,A) • Positive property: • d is always non-negative, and • Triangular inequality: • d(A,C) <= d(A,B)+d(B,C).
Motivating Resnick’s measure: through hypernymy (is-a) hierarchy • Sense 1 • lock -- (a fastener fitted to a door or drawer to keep it firmly closed) • => fastener, fastening, holdfast, fixing -- (restraint that attaches to something or holds something in place) • => restraint, constraint -- (a device that retards something's motion; "the car did not have proper restraints fitted") • => device -- (an instrumentality invented for a particular purpose; "the device is small enough to wear on your wrist"; "a device intended to conserve water") • => instrumentality, instrumentation -- (an artifact (or system of artifacts) that is instrumental in accomplishing some end) • => artifact, artefact -- (a man-made object taken as a whole) • => whole, unit -- (an assemblage of parts that is regarded as a single entity; "how big is that part compared to the whole?"; "the team is a unit") • => object, physical object -- (a tangible and visible entity; an entity that can cast a shadow; "it was full of rackets, balls and other objects") • => physical entity -- (an entity that has physical existence) • => entity -- (that which is perceived or known or inferred to have its own distinct existence (living or nonliving))
House: sense 1 • house -- (a dwelling that serves as living quarters for one or more families; "he has a house on Cape Cod"; "she felt she had to get out of the house") • => dwelling, home, domicile, abode, habitation, dwelling house -- (housing that someone is living in; "he built a modest dwelling near the pond"; "they raise money to provide homes for the homeless") • => housing, lodging, living accommodations -- (structures collectively in which people are housed) • => structure, construction -- (a thing constructed; a complex entity constructed of many parts; "the structure consisted of a series of arches"; "she wore her hair in an amazing construction of whirls and ribbons") • => artifact, artefact -- (a man-made object taken as a whole) • => whole, unit -- (an assemblage of parts that is regarded as a single entity; "how big is that part compared to the whole?"; "the team is a unit") • => object, physical object -- (a tangible and visible entity; an entity that can cast a shadow; "it was full of rackets, balls and other objects") • => physical entity -- (an entity that has physical existence) • => entity -- (that which is perceived or known or inferred to have its own distinct existence (living or nonliving)) Overlap
House: sense 2 • Sense 2 • house -- (an official assembly having legislative powers; "a bicameral legislature has two houses") • => legislature, legislative assembly, legislative, general assembly, law-makers -- (persons who make or amend or repeal laws) • => assembly -- (a group of persons gathered together for a common purpose) • => gathering, assemblage -- (a group of persons together in one place) • => social group -- (people sharing some social relation) • => group, grouping -- (any number of entities (members) considered as a unit) • => abstraction -- (a general concept formed by extracting common features from specific examples) • => abstract entity -- (an entity that exists only abstractly) • => entity -- (that which is perceived or known or inferred to have its own distinct existence (living or nonliving))
House: sense 11 • Sense 11 • sign of the zodiac, star sign, sign, mansion, house, planetary house -- ((astrology) one of 12 equal areas into which the zodiac is divided) • => region, part -- (the extended spatial location of something; "the farming regions of France"; "religions in all parts of the world"; "regions of outer space") • => location -- (a point or extent in space) • => object, physical object -- (a tangible and visible entity; an entity that can cast a shadow; "it was full of rackets, balls and other objects") • => physical entity -- (an entity that has physical existence) • => entity -- (that which is perceived or known or inferred to have its own distinct existence (living or nonliving)) Overlap
Measures of Semantic Relatedness: Resnick • The Resnik Measure • Information content based relatedness measure • Higher information content specific to particular topics, lower ones specific to more general topics • Carving fork – HIGH IC, entity – LOW IC • The Idea is that two concepts are semantically related proportional to the amount of information shared
Sense marked corpora: semcor • <s snum=3> • <wf cmd=ignore pos=PRP>He</wf> • <wf cmd=done pos=VB lemma=succeed wnsn=2 lexsn=2:41:01::>succeeds</wf> • <wf cmd=done rdf=person pos=NNP lemma=person wnsn=1 lexsn=1:03:00:: pn=person>Buck_Shaw</wf> • <punc>,</punc> • <wf cmd=ignore pos=WP>who</wf> • <wf cmd=done pos=VB lemma=retire wnsn=1 lexsn=2:41:01::>retired</wf> • <wf cmd=ignore pos=IN>at</wf> • <wf cmd=ignore pos=DT>the</wf> • <wf cmd=done pos=NN lemma=end wnsn=2 lexsn=1:28:00::>end</wf> • <wf cmd=ignore pos=IN>of</wf> • <wf cmd=done pos=JJ lemma=last wnsn=1 lexsn=5:00:00:past:00>last</wf> • <wf cmd=done pos=NN lemma=season wnsn=1 lexsn=1:28:02::>season</wf> • <punc>.</punc> • </s>
Measures of Semantic Relatedness • Considers position of nouns in is-a hierarchy • SR is determined by information content of lowest common concept which subsumes both concept • For example: Nickel and Dime subsumed by Coin, Nickel and Credit card by Medium of Exchange • P(c) is probability of encountering concept c. • If a is-a b, then p(a) <= p(b) • Information content calculated by formula:- IC (concept) = – log (P (concept))
Measures of Semantic Relatedness • Thus relatedness is given by:- Simres (c1, c2) = IC (LCS (c1, c2)) • Does not consider information content of the concepts themselves nor path length • Problems faced is that many concepts might have the same subsumer thus having same score • May get high measures on the basis of some inappropriate word senses. E.g tobacco and horse • Newer methods such as Jiang-Conrath, Linand Leacock-Chodorow measures
In case of multiple senses where sen(w) denotes the set of possible senses for word w.
Relevant formulae Classes(W) is no. of senses the word has; Words(c) is the set of words subsumed (directly or indirectly) by the class c
Structural Characteristics of a hierarchical n/w • Local network density (the number of child links that span out from a parent node) • In the plant/flora section of WordNet, the hierarchy is very dense • Depth of a node in the hierarchy • distance shrinks as one descends the hierarchy, since differentiation is based on finer and finer details • Type of link • The strength of an edge link: corpus statistics has to play role; theoretical soundness and computational efficiency are needed
Link Strength: Probability and IC theoretic • The strength of a child link is proportional to the conditional probability of encountering an instance of the child concept ci given an instance of its parent concept p: P(ci | p)
Link strength Intuition Formulation Actual formula
Page Rank • Developed by Larry Page and Sergei Brinn • Link analysis algorithm assigns numerical weighting to hyperlinked set of documents • Measures relative importance of page in a set • Link to a page is a vote of support which increases the rank of that particular page • It is a probability distribution representing the likelihood of a person randomly clicking ultimately ending up on a specific page
Pagerank based Algorithm • Assume universe has 4 pages A, B, C and D • Initial values of all the pages is 0.25 • Now suppose B, C and D link only to A • Rank of A given by:- • If B links to other pages also then rank of A:- • L(B) is the number of outbound links from B
Pagerank based Algorithm (contd.) • Page rank of U depends on rank of page V linking to U divided by number of links from V • Page Rank can be given by general formula:- • Formula applicable for pages which link to U • Thus we can see that the page ranks of all pages in corpus will be equal to 1
Pagerank based Algorithm (contd.) • Damping Factor : Imaginary surfer will stop clicking at links after some time. • d is probability that user will continue clicking • Damping factor is estimated at 0.85 here • The new page rank formula using this is:- • Now to get actual rank of a page we will have to iterate this formula many times • Problem of Dangling Links