1 / 19

I256: Applied Natural Language Processing

I256: Applied Natural Language Processing. Marti Hearst Sept 25, 2006. Shallow Parsing. Break text up into non-overlapping contiguous subsets of tokens. Also called chunking, partial parsing, light parsing. What is it useful for? Entity recognition people, locations, organizations

duman
Download Presentation

I256: Applied Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. I256: Applied Natural Language Processing Marti Hearst Sept 25, 2006

  2. Shallow Parsing • Break text up into non-overlapping contiguous subsets of tokens. • Also called chunking, partial parsing, light parsing. • What is it useful for? • Entity recognition • people, locations, organizations • Studying linguistic patterns • gave NP • gave up NP in NP • gave NP NP • gave NP to NP • Can ignore complex structure when not relevant

  3. A Relationship between Segmenting and Labeling • Tokenization segments the text • Tagging labels the text • Shallow parsing does both simultaneously.

  4. Chunking vs. Full Syntactic Parsing “G.K. Chesterton, author of The Man who was Thursday”

  5. Representations for Chunks • IOB tags • Inside, outside, and begin • Why do we need a begin tag?

  6. Representations for Chunks • Trees • Chunk structure is a two-level tree that spans the entire text, containing both chunks and non-chunks

  7. CONLL Collection • From the Conference on Natural Language Learning Competition from 2000 • Goal: create machine learning methods to improve on the chunking task

  8. CONLL Collection • Data in IOB format from WSJ: • Word POS-tag IOB-tag • Training set: 8936 sentences • Test set: 2012 sentences • Tags from the Brill tagger • Penn Treebank Tags • Evaluation measure: F-score • 2*precision*recall / (recall+precision) • Baseline was: select the chunk tag that is most frequently associated with the POS tag, F =77.07 • Best score in the contest was F=94.13

  9. nltk_lite and CONLL2000 Note that raw hides the IOB format

  10. nltk_lite and CONLL2000

  11. nltk_lite and CONLL2000 pp() stands for “pretty print” and applies to the Tree data structure.

  12. nltk_lite chunks in treebank

  13. nltk_lite parses in treebank

  14. Chunking with Regular Expressions • This time we write regex’s over TAGS rather than words • <DT><JJ>?<NN> • <NN.*> • <JJ|NN>+ • Compile them with parse.ChunkRule() • rule = parse.ChunkRule(‘<DT|NN>+’) • chunkparser = parse.RegexpChunk([rule], chunk_node = ‘NP’) • Resulting object is of type Tree • Top-level node called S • Can change this label if you want, in third argument to RegexpChunk

  15. Chunking with Regular Expressions

  16. Chunking with Regular Expressions • Rule application is sensitive to order

  17. Chinking • Specify what does not go into a chunk. • Kind of like specifying punctuation as being not alphanumeric and spaces. • Can be more difficult to think about.

  18. Practice Regexp Chunking • Write rules to be able to produce this kind of chunking:

  19. Next Time • Evaluating Shallow Parsing • Begin Text Summarization

More Related