180 likes | 403 Views
School of Computing FACULTY OF ENGINEERING . Chunking: Shallow Parsing. Eric Atwell, Language Research Group. Shallow Parsing. Break text up into non-overlapping contiguous subsets of tokens. Also called chunking, partial parsing, light parsing. What is it useful for? – semantic patterns
E N D
School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group
Shallow Parsing • Break text up into non-overlapping contiguous subsets of tokens. • Also called chunking, partial parsing, light parsing. • What is it useful for? – semantic patterns • Finding key “meaning-elements”: Named Entity Recognition • people, locations, organizations • Studying linguistic patterns, e.g. semantic patterns of verbs • gave NP • gave up NP in NP • gave NP NP • gave NP to NP • Can ignore complex structure when not relevant
A Relationship between Segmenting and Labeling • Tokenization segments the text • Tagging labels the text • Shallow parsing does both simultaneously.
Chunking vs. Full Syntactic Parsing • “G.K. Chesterton, author of The Man who was Thursday”
Representations for Chunks • IOB tags • Inside, outside, and begin • In English, the start of a phrase is often marked by a function-word
Representations for Chunks • Trees • Chunk structure is a two-level tree that spans the entire text, containing both chunks and non-chunks
CONLL Corpus: training data for Machine Learning of chunking • From the Conference on Natural Language Learning Competition from 2000 • Goal: create machine learning methods to improve on the chunking task
CONLL Corpus • Data in IOB format from WSJ Wall Street Journal: • Word POS-tag IOB-tag • Training set: 8936 sentences • Test set: 2012 sentences • Tags from the Brill tagger • Penn Treebank Tags • Evaluation measure: F-score • 2*precision*recall / (recall+precision) • Baseline was: select the chunk tag that is most frequently associated with the POS tag, F =77.07 • Best score in the contest was F=94.13
Chunking with Regular Expressions • This time we write regex’s over TAGS rather than characters • <DT><JJ>?<NN> • <NN.*> • <JJ|NN>+ • Compile them with parse.ChunkRule() • rule = parse.ChunkRule(‘<DT|NN>+’) • chunkparser = parse.RegexpChunk([rule], chunk_node = ‘NP’) • Resulting object is a (sort-of) parse tree • Top-level node called S • Chunks are labelled NP
Chunking with Regular Expressions • Rule application is sensitive to order
Chinking • Specify what does not go into a chunk. • Kind of like specifying punctuation as being not alphanumeric and spaces. • Can be more difficult to think about.
Simple chink-chunk approach: function v content word-class • Regular expressions for chunks and chinks CAN get complex • BUT the whole point is to be simpler than full parsing! • SO: use a simple model which works “reasonably well” • (then tidy up afterwards…) • Chunk = nominal content-word (noun) • Chink = others (verb, pronoun, determiner, preposition, conjunction) (+adjective, adverb as a borderline category)
Example • Fruit flies like a banana • fruit\N flies\N like\V a\A banana\N • [fruit flies] like a [banana] • [S [NP fruit\N flies\N NP] • [VP like\V • [NP a\A banana\N NP] • VP] • S]
An alternative parse • This sentence is grammatically ambiguous: • Fruit flies like a banana • fruit\N flies\N like\V a\A banana\N [fruit flies] like a [banana] • fruit\N flies\V like\I a\A banana\N [fruit] flies like a [banana] • cf: “bank robbers like a chase” v “bread bakes in an oven” • [S [NP fruit\N NP] • [VP flies\V • [PP like\I [NP a\A banana\N NP] PP] • VP] • S]
Ambiguity leads to more rules • fruit\N flies\N like\V a\A banana\N [fruit flies] like a [banana] • fruit\N flies\V like\I a\A banana\N [fruit] flies like a [banana] • BUT what about: Time flies like an arrow - time\N, time\V • time\N flies\N like\V an\A arrow\N [time flies] like an [arrow] • time\N flies\V like\I an\A arrow\N [time] flies like an [arrow] • time\V flies\N like\I an\A arrow\N time [flies] like an [arrow] • 3rd PoS-tagging gives ambiguous parse
Chunking can predict prosodic breaks • http://www.acm.org/crossroads/ • An Approach for Detecting Prosodic Phrase Boundaries in Spoken English by Claire Brierley and Eric Atwell
Summary • Shallow parsing is useful for: • Entity recognition • people, locations, organizations Studying linguistic patterns • gave NP • gave up NP in NP • gave NP NP • gave NP to NP Prosodic phrase breaks – pauses in speech Can ignore complex structure when not relevant Chink-chunk approach: “quick-and-dirty” chunking, content v function PoS Chink-chunk parsing is simpler than context-free grammar parsing!