1 / 16

Universal Concept Spotter: Self-Learning Algorithm for Entity Identification

Explore the innovative Universal Concept Spotter, an unsupervised learning system that identifies any category from large corpora. Learn how to start with initial examples and context, recognize patterns in text, and enhance precision and recall over iterative cycles.

lightj
Download Presentation

Universal Concept Spotter: Self-Learning Algorithm for Entity Identification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Self Learning Universal Concept Spotter By TomekStrzalkowski and Jin Wang Presented by Iman Sen

  2. Introduction • Previously, information taggers were hand crafted, domain specific, and/or too reliant on lexical clues such as upper case, format, etc. • The Universal Spotter is one of the first set of algorithms for unsupervised learning which can identify any category from any large corpus, given some initial examples and context information on what to spot.

  3. Basic Idea • Get some prior examples and context for things to spot (called seed) & a large corpus • Exploiting redundancy of patterns in text • Use those examples to get “new” item and context information to add to original set of rules. • Initially precision is high, recall very low. • Repeat above cycle to maximize recall, while maintaining/improving precision.

  4. Seeds: What we are looking for • Initially, the seed is some information provided by user. • It is either Examples or Contextual Information. • Examples can be highlighted in the text ( “Microsoft”, “toothbrushes”). • Context information can also be specified (both Internal & External). For example, “Name ends with Co.” or “appears after produced” . • Negative examples and context information such as “Not to the right of produced”.

  5. The Cyclic Process • Build rules from the initial examples and context info. • Find further examples of this concept, in the corpus, while trying to maximize precision/recall. • As we find more examples of the concepts, we can find more contextual information. • Use the expanded context info to find more entities.

  6. Simple Example • Suppose we have the seeds “Co” and “Inc” initially and the following text. “Henry Kaufman is president of Henry Kaufman & Co., …..president of Gabelli Funds Inc. ; Claude. N . Rosenberg is named president of Thomson S.A ….” • Use “Co” and “Inc” to pick out Henry Kaufman & Co and Gabelli Funds Inc. • Use these new seeds to get contextual information such as for example, “president of” before each of the entities. • Use “president of” to find “Thomson S.A.”

  7. The Classification Task • So our goal is to decide whether a sequence of words contains a desired entity/concept. • This is done by calculating significance weights, SW, and then combining them .

  8. The Process: In Detail • Initially some preprocessing is done including tokenization, POS tagging and lexical normalization or stemming. • POS tagging help to delineate which sequence of words might contain the desired entities. • These steps reduce the amount of noise.

  9. How to calculate SW • Consider sequence of words W1,W2,…Wm in text which is of interest. There is a window of size n on either side of the central unit where one looks for contextual information. • Then do the following: Make up pairs of (word, position), where position is one of preceding (p) context, central unit (s) or following (f) context for all words within the window of size n. Similarly make up pairs of (bigram, position). Make up 3-tuples of (word, position, distance) for the same sequence of words, where distance is the distance from W1 or Wm. (for units in W1 thru Wm take distance from Wm).

  10. AnSW Calculation Example • Example: ... boys kicked the doorwith rage ... with window n=2, and central unit, “the door”. • The generated tuples (called evidence items) are : (boys, p), (kicked, p), (the, s), (door, s), (with, f), (rage , f), ((boys, kicked), p), ((the, door)), s), ((with, ,rage), f), (boys, p, 2), (kicked, p, 1), (the, s, 2), (door, s, 1), (with, f, 1), (rage, f, 2), ((boys, kicked), p, 1), ((the, door)), s, 1), ((with, ,rage), f, 1)

  11. SW Calculation continued …. • 2 groups of items, A is the group of accepted items and R the group of rejected items. • Use these groups, to calculate SW: where s is a constant to filter noise and f(x,X) is frequency of x in X. • SW as described here takes values between -1.0 & 1.0 • For some e, SW(t)>e>0 is taken as a +ve evidence and SW(t)<-e is taken as –ve evidence. SW (t) = f(t,A)-f(t,R) f ( t , A ) + f ( t , R ) > s f(t,A)+y(t,R) 0 otherwise

  12. Combining SW weights • Then, these SW weights are combined and if this exceeds a threshold, then they become available during the tagging stage. • the primary scheme used by the authors for combining is: x + y - xy if x>O and y>O x O y = x + y + xy if x<O and y<O x + y otherwise Note: Values still remain with [-1.0, 1.0]

  13. Bootstrapping The basic bootstrapping process then looks like this: • Procedure Bootstrapping • Collect seeds • l o o p • Training phase(calc. SW weights, combine, add to rules) • Tagging phase(use all accumulated rules to tag) • until Satisfied.

  14. Experiments and Results • Organizations : Training on 7 MB WSJ corpus, Testing on 10 selected articles. • Initially, precision 97% but recall 49% • Maximized to p=95% & r= 90% after 4th cycle • Similar experiment for identifying products but worse results

  15. Improvements • Different weighing and combining schemes • Universal Lexicon Lookups: Can verify accepted items in existing online lexical databases. • Program cannot deal with Conjunctions of noun phrases due to identification difficulties.

  16. Some Considerations • Not clear how many initial seeds were provided • The program is described for identifying one category of items at a time but could be extended to more. • A limitation is that it might not be possible to spot certain context/examples due to noise in data and also for entities that do not have obvious context patterns. • The POS tagger errors are inherited.

More Related