1 / 104

Language and Information

Language and Information. November 9, 2000. Handout #4. Course Information. Instructor: Dragomir R. Radev (radev@si.umich.edu) Office: 305A, West Hall Phone: (734) 615-5225 Office hours: TTh 3-4 Course page: http://www.si.umich.edu/~radev/760

tawana
Download Presentation

Language and Information

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language and Information November 9, 2000 Handout #4

  2. Course Information • Instructor: Dragomir R. Radev (radev@si.umich.edu) • Office: 305A, West Hall • Phone: (734) 615-5225 • Office hours: TTh 3-4 • Course page: http://www.si.umich.edu/~radev/760 • Class meets on Thursdays, 5-8 PM in 311 West Hall

  3. Readings • Textbook: • Oakes Ch.3: 95-96, 110-120 • Oakes Ch.4: 149-150, 158-166, 182-189 • Oakes Ch.5: 199-212, 221-223, 236-247 • Additional readings • Knight “Statistical Machine Translation Workbook” (http://www.clsp.jhu.edu/ws99/) • McKeown & Radev “Collocations” • Optional: M&S chapters 4, 5, 6, 13, 14

  4. Statistical Machine Translationand Language Modeling

  5. The Noisy Channel Model • Source-channel model of communication • Parametric probabilistic models of language and translation • Training such models

  6. Statistics • Given f, guess e f e e’ E  F F  E encoder decoder e’ = argmax P(e|f) = argmax P(f|e) P(e) e e translation model language model

  7. Parametric probabilistic models • Language model (LM) • Deleted interpolation • Translation model (TM) P(e) = P(e1, e2, …, eL) = P(e1) P(e2|e1) … P(eL|e1 … eL-1) P(eL|e1 … eK-1)  P(eL|eL-2, eL-1) Alignment: P(f,a|e)

  8. IBM’s EM trained models • Word translation • Local alignment • Fertilities • Class-based alignment • Non-deficient algorithm (avoid overlaps, overflow)

  9. Lexical Semanticsand WordNet

  10. Meanings of words • Lexemes, lexicon, sense(s) • Examples: • Red, n: the color of blood or a ruby • Blood, n: the red liquid that circulates in the heart, arteries and veins of animals • Right, adj: located nearer the right hand esp. being on the right when facing the same direction as the observer • Do dictionaries gives us definitions??

  11. Relations among words • Homonymy: • Instead, a bank can hold the investments in a custodial account in the client’s name. • But as agriculture burgeons on the east bank, the river will shrink even more. • Other examples: be/bee?, wood/would? • Homophones • Homographs • Applications: spelling correction, speech recognition, text-to-speech • Example: Un ver vert va vers un verre vert.

  12. Polysemy • They rarely serve red meat, preferring to prepare seafood, poultry, or game birds. • He served as U.S. ambassador to Norway in 1976 and 1977. • He might have served his time, come out and led an upstanding life. • Homonymy: distinct and unrelated meanings, possibly with different etymology (multiple lexemes). • Polysemy: single lexeme with two meanings. • Example: an “idea bank”

  13. Synonymy • Principle of substitutability • How big is this plane? • Would I be flying on a large or small plane? • Miss Nelson, for instance, became a kind of big sister to Mrs. Van Tassel’s son, Benjamin. • ?? Miss Nelson, for instance, became a kind of large sister to Mrs. Van Tassel’s son, Benjamin. • What is the cheapest first class fare? • ?? What is the cheapest first class cost?

  14. Semantic Networks • Used to represent relationships between words • Example: WordNet - created by George Miller’s team at Princeton (http://www.cogsci.princeton.edu/~wn) • Based on synsets (synonyms, interchangeable words) and lexical matrices

  15. Lexical matrix

  16. Synsets • Disambiguation • {board, plank} • {board, committee} • Synonyms • substitution • weak substitution • synonyms must be of the same part of speech

  17. $ ./wn board -hypen Synonyms/Hypernyms (Ordered by Frequency) of noun board 9 senses of board Sense 1 board => committee, commission => administrative unit => unit, social unit => organization, organisation => social group => group, grouping Sense 2 board => sheet, flat solid => artifact, artefact => object, physical object => entity, something Sense 3 board, plank => lumber, timber => building material => artifact, artefact => object, physical object => entity, something

  18. Sense 4 display panel, display board, board => display => electronic device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Sense 5 board, gameboard => surface => artifact, artefact => object, physical object => entity, something Sense 6 board, table => fare => food, nutrient => substance, matter => object, physical object => entity, something

  19. Sense 7 control panel, instrument panel, control board, board, panel => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Sense 8 circuit board, circuit card, board, card => printed circuit => computer circuit => circuit, electrical circuit, electric circuit => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Sense 9 dining table, board => table => furniture, piece of furniture, article of furniture => furnishings => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something

  20. Antonymy • “x” vs. “not-x” • “rich” vs. “poor”? • {rise, ascend} vs. {fall, descend}

  21. Other relations • Meronymy: X is a meronym of Y when native speakers of English accept sentences similar to “X is a part of Y”, “X is a member of Y”. • Hyponymy: {tree} is a hyponym of {plant}. • Hierarchical structure based on hyponymy (and hypernymy).

  22. Other features of WordNet • Index of familiarity • Polysemy

  23. Familiarity and polysemy board used as a noun is familiar (polysemy count = 9) bird used as a noun is common (polysemy count = 5) cat used as a noun is common (polysemy count = 7) house used as a noun is familiar (polysemy count = 11) information used as a noun is common (polysemy count = 5) retrieval used as a noun is uncommon (polysemy count = 3) serendipity used as a noun is very rare (polysemy count = 1)

  24. Compound nouns advisory board appeals board backboard backgammon board baseboard basketball backboard big board billboard binder's board binder board blackboard board game board measure board meeting board member board of appeals board of directors board of education board of regents board of trustees

  25. Overview of senses 1. board -- (a committee having supervisory powers; "the board has seven members") 2. board -- (a flat piece of material designed for a special purpose; "he nailed boards across the windows") 3. board, plank -- (a stout length of sawn timber; made in a wide variety of sizes and used for many purposes) 4. display panel, display board, board -- (a board on which information can be displayed to public view) 5. board, gameboard -- (a flat portable surface (usually rectangular) designed for board games; "he got out the board and set up the pieces") 6. board, table -- (food or meals in general; "she sets a fine table"; "room and board") 7. control panel, instrument panel, control board, board, panel -- (an insulated panel containing switches and dials and meters for controlling electrical devices; "he checked the instrument panel"; "suddenly the board lit up like a Christmas tree") 8. circuit board, circuit card, board, card -- (a printed circuit that can be inserted into expansion slots in a computer to increase the computer's capabilities) 9. dining table, board -- (a table at which meals are served; "he helped her clear the dining table"; "a feast was spread upon the board")

  26. {act, action, activity} {animal, fauna} {artifact} {attribute, property} {body, corpus} {cognition, knowledge} {communication} {event, happening} {feeling, emotion} {food} {group, collection} {location, place} {motive} {natural object} {natural phenomenon} {person, human being} {plant, flora} {possession} {process} {quantity, amount} {relation} {shape} {state, condition} {substance} {time} Top-level concepts

  27. Information Extraction

  28. Types of Information Extraction • Template filling • Language reuse • Biographical information • Question answering

  29. INCIDENT: DATE 30 OCT 89 INCIDENT: LOCATION EL SALVADOR INCIDENT: TYPE ATTACK INCIDENT: STAGE OF EXECUTION ACCOMPLISHED INCIDENT: INSTRUMENT ID INCIDENT: INSTRUMENT TYPEPERP: INCIDENT CATEGORY TERRORIST ACT PERP: INDIVIDUAL ID "TERRORIST" PERP: ORGANIZATION ID "THE FMLN" PERP: ORG. CONFIDENCE REPORTED: "THE FMLN" PHYS TGT: ID PHYS TGT: TYPEPHYS TGT: NUMBERPHYS TGT: FOREIGN NATIONPHYS TGT: EFFECT OF INCIDENTPHYS TGT: TOTAL NUMBERHUM TGT: NAMEHUM TGT: DESCRIPTION "1 CIVILIAN"HUM TGT: TYPE CIVILIAN: "1 CIVILIAN"HUM TGT: NUMBER 1: "1 CIVILIAN"HUM TGT: FOREIGN NATIONHUM TGT: EFFECT OF INCIDENT DEATH: "1 CIVILIAN"HUM TGT: TOTAL NUMBER On October 30, 1989, one civilian was killed in a reported FMLN attack in El Salvador. MUC-4 Example

  30. Language reuse NP Yugoslav President Slobodan Milosevic [description] [entity] Phrase to be reused

  31. Example NP NP Punc NP Andrija Hebrang , The Croatian Defense Minister [entity] [description]

  32. Issues involved • Text generation depends on lexical resources • Lexical choice • Corpus processing vs. manual compilation • Deliberate decisions by writers • Difficult to encode by hand • Dynamically updated (Scott O’Grady) • No full semantic representation

  33. Named entities Richard Butler met Tareq Aziz Monday after rejecting Iraqi attempts to set deadlines for finishing his work. Yitzhak Mordechai will meet Mahmoud Abbas at 7 p.m. (1600 GMT) in Tel Aviv after a 16-month-long impasse in peacemaking. Sinn Fein deferred a vote on Northern Ireland's peace deal Sunday. Hundreds of troops patrolled Dili on Friday during the anniversary of Indonesia's 1976 annexation of the territory.

  34. Entities + Descriptions Chief U.N. arms inspector Richard Butler met Iraq’s Deputy Prime MinisterTareq Aziz Monday after rejecting Iraqi attempts to set deadlines for finishing his work. Israel's Defense MinisterYitzhak Mordechai will meet senior Palestinian negotiatorMahmoud Abbas at 7 p.m. (1600 GMT) in Tel Aviv after a 16-month-long impasse in peacemaking. Sinn Fein, the political wing of the Irish Republican Army, deferred a vote on Northern Ireland's peace deal Sunday. Hundreds of troops patrolled Dili, the Timorese capital, on Friday during the anniversary of Indonesia's 1976 annexation of the territory.

  35. Building a database of descriptions • Size of database: 59,333 entities and 193,228 descriptions as of 08/01/98 • Text processed: 494 MB (ClariNet, Reuters, UPI) • Length: 1-15 lexical items • Accuracy: (precision 94%, recall 55%)

  36. Multiple descriptions per entity Ung Huot A senior member Cambodia’s Cambodian foreign minister Co-premier First prime minister Foreign minister His excellency Mr. New co-premier New first prime minister Newly-appointed first prime minister Premier Profile for Ung Huot

  37. CONCEPTS + CONSTRAINTS = CONSTRUCTS Language reuse and regeneration Corpus analysis: determining constraints Text generation: applying constraints

  38. Language reuse and regeneration • Understanding: full parsing is expensive • Generation: expensive to use full parses • Bypassing certain stages (e.g., syntax) • Not(!) template-based: still required extraction, analysis, context identification, modification, and generation • Factual sentences, sentence fragments • Reusability of a phrase

  39. Redefining the relation:DescriptionOf (E,C) = {Di,c, Di,cis a description ofE in context C} If named entity E appears in text and the context is C: Insert DescriptionOf (E,C) in text. Context-dependent solution

  40. Multiple descriptions per entity Bill Clinton U.S. President President An Arkansas native Democratic presidential candidate Profile for Bill Clinton

  41. Choosing the right description Bill ClintonCONTEXT U.S. President …………………………..foreign relations President ………………………………… national affairs An Arkansas native ……………....false bomb alert in AR Democratic presidential candidate …………….. elections Pragmatic and semantic constraints on lexical choice.

  42. Semantic information from WordNet • All words contribute to the semantic representation • First sense is used only • What is a synset?

  43. WordNet synset hierarchy {00001740} entity, something {00002086} life form, organism, being, living thing {00004123} person, individual, someone, somebody, human {06950891} leader {07311393} head, chief, top dog {07063507} administrator, decision maker {07063762} director, manager, managing director

  44. Lexico-semantic matrix Profile for Ung Huot

  45. Choosing the right description • Topic approximation by context: words that appear near the entity in the text (bag) • Name of the entity (set) • Length of article (continuous) • Profile: set of all descriptions for that entity (bag) - parent synset offsets for all words wi. • Semantic information: WordNet synset offsets (bag)

  46. Choosing the right description Ripper feature vector [Cohen 1996] (Context, Entity, Description, Length, Profile, Parent) Classes

  47. Example (training)

  48. Sample rules Total number of rules: 4085 for 100,000 inputs

  49. Evaluation • 35,206 tuples; 11,504 distinct entities; 3.06 DDPE • Training: 90% of corpus (10,353 entities) • Test: 10% of corpus (1,151 entities)

  50. Evaluation • Rule format (each matching rule adds constraints): X [A] (evidence of A) Y [B] (evidence of B) X Y [A] [B] (evidence of A and B) • Classes are in 2W (powerset of WN nodes) • P&R on the constraints selected by system

More Related