1 / 40

An Attack on Data Sparseness

An Attack on Data Sparseness. JHU –Tutorial June 11 2003. OVERVIEW. What is this project about? What is gate? Lab assignment. Basic Approach – (from RG talk). Build Linguistic Patterns person was appointed as post of company company named person to post

thao
Download Presentation

An Attack on Data Sparseness

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Attack on Data Sparseness JHU –Tutorial June 11 2003

  2. OVERVIEW • What is this project about? • What is gate? • Lab assignment

  3. Basic Approach – (from RG talk) • Build Linguistic Patterns person was appointed as post of company company named person to post • Apply patterns to text and fill data base

  4. Getting these patterns … • Use training data to gather information about the contexts of the important bits of text. • Write an algorithm that automatically makes use of the contextual information to further identify new important bits and labels them.

  5. It is a difficult task • We are already pretty good at • Identifying and locating • People • Locations • Organizations • Dates • Times • What if we could do more?

  6. Would it help to tag/replace noun phrases? Astronauts aboard the space shuttle Endeavour were forced to dodge a derelict Air Force satellite Friday. HUMANS aboard SPACE_VEHICLE dodge SATELLITE TIMEREF

  7. We could transform the training data and get more HUMANS DODGE SATELLITE After parsing: HUMANS aboard SPACE_VEHICLE dodge SATELLITE TIMEREF

  8. Could we know these are the same? The IRA bombed a family owned shop in Belfast yesterday. FMLN set off a series of explosions in central Bogota today. ORGANIZATION ATTACKED LOCATION DATE

  9. LexicographyData Sparseness again .. • Sever BODYPART • Sever an arm • Sever a finger • Sever FASTENER • Sever the bond .. • Sever the links …

  10. Machine translation • Ambiguity of words often means that a word can translate several ways. • Would knowing the semantic class of a word, help us to know the translation?

  11. Sometimes . . . • Crane the bird vs crane the machine • Bat the animal vs bat for cricket and baseball • Seal on a letter vs the animal

  12. SO .. P(translation(crane) = grulla | animal) > P(translation(crane) = grulla) P(translation(crane) = grua | machine) > P(translation(crane) = grua | machine) Can we show the overall effect lowers entropy?

  13. Language Modeling – Data Sparseness again .. • We need to estimate Pr (w3 | w1 w2) • If we have never seen w1w2 w3 before • Can we instead develop a model and estimate Pr (w3 | C1 C2) or Pr (C3 | C1 C2)

  14. Overview Noun Phrases Identified Head Nouns Identified People marked Locations, dates, currencies, organizations Also marked CORPUS

  15. Overview Human Annotated with semantic tags– Noun Phrases Only

  16. Overview Training portion Test portion Machine Learning to improve this

  17. The Environment • GATE – an environment which conforms to the TIPSTER architecture • Provides many tools for processing language and a standard method for managing documents and any new information associated with the document

  18. Gate - Documents have annotations ~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~GEORGE BUSH~~~~ ~~~~~~~~~~~~~~~~~~~~~~ GEORGE BUSH at offset 104-114 is a person

  19. There may be more than one annotation ~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~ ~~~~The ruthless criminal~~~~ ~~~~~~~~~~~~~~~~~~~~~~ criminal at offset 104-122 is a human Is a noun Is the head of an noun phrase

  20. Documents belong to collections (a corpus in GATE) • Collections can be loaded into GATE • New collections can be created • Documents can be added or removed • Applications can run over whole collections

  21. Applications – processing resources • Programs (tools) can be loaded into gate • An Application consists of forming a pipeline of some tools • In the demo, you will see two applications

  22. Annie – with defaults • Sentence Splitter • POS tagger • NE recognizer • Tokenizer • Plus more

  23. Using gate in today’s lab • To view already processed documents • To process new documents • To process documents, you must have both an application and a corpus

  24. To learn more .. • http://www.gate.ac.uk • Tutorials, slides, downloadable versions for PC, Linux, Solaris, etc.

  25. The lab • Follow the directions in /export/ws03sem/lab/gate.lab • Use the internet or Grolier to find Paragraphs or documents about bats that fly and bats that hit a ball, cricket bat or baseball bat

  26. Which bat is it? • Use the web texts as training data for the context – you can load them into gate or use them as is. • Try a bag of words approach

  27. The idea Texts about flying bats Texts about movable solid ones The pitcher held the bat firmly NEW 

  28. Resources • Porter Stemmer • Gate • Can collect trigrams, or bigrams from the training data ..

  29. Comments • A very primitive approach to the problem • Use your work to say which kind of ‘bat’ is used in the text bat.txt • Try your same technique for ‘seal’ • There is a file called seal.txt to test on

  30. Finally • If you are very brave can you find the semantic classes for ‘chicken’ in the chicken.txt file? • Careful – this one has a lot of metaphorical use. • Have fun!

  31. Tag Set • Longman’s Dictionary (LDOCE) • 2000 word defining vocabulary • 34 semantic categories • over subject codes • Over 5000 combination markings • Gives us 85% coverage of NP’s but only contains 35% of the vocabulary

  32. Wordnet • Developed at Princeton (George Miller) • About the same coverage on a sample • Defined synsets instead of senses • Arranged with ‘IS A’ relations which can serve as a semantic category • The English acts as an interlingua to EuroWordnet.

  33. Corpus • BNC – 100 million words – mostly spoken • POS tagged with CLAWS • English side of parallel texts • possibly 80 million words • Aligned • Some french – some chinese some arabic • Or possibly UN data supplied by the MT team

  34. Evaluation • This must be decided before July • Baselines should be presented for the opening talk • The closing talk should include baseline plus as many measures of improvement as we can come up with

  35. Closing presentation • One half day for each of the three projects • Each person should plan to talk • One part of the team should be devoted to this aspect of the project

  36. Evaluation – suggested focus • We focus on showing that we can improve the entropy for MT.

  37. Techniques • Basically two possibilites • Extend techniques from disambiguation for assigning semantic category and then subject area (word focused) • Use machine learning to learn about the contexts and features of a particular semantic category – then tag those (semantic category focused)

  38. Today • 12-1 Roberto and Fabio • Machine learning • Wordnet and conceptual density • Ldoce – Wordnet correspondence • 1-2 Lunch • 2-3:30 Tagging texts and discussion • 3:30- 5:30 Gate Tutorial

  39. Tomorrow • Annotation tool • Division of labor • Plan Rome meeting • End at 1:00

  40. Why do it? • Text Extraction • Lexicography • Summarization • Machine Translation • Language Modeling

More Related