110 likes | 273 Views
Human Language Technology. Part of Speech (POS) Tagging II Rule-based Tagging. Acknowledgment. Most slides taken from Bonnie Dorr’s course notes: www.umiacs.umd.edu/~bonnie/courses/cmsc723-03 Jurafsky & Martin Chapter 5. Bibliography.
E N D
Human Language Technology Part of Speech (POS) Tagging II Rule-based Tagging
Acknowledgment • Most slides taken from Bonnie Dorr’s course notes:www.umiacs.umd.edu/~bonnie/courses/cmsc723-03 • Jurafsky & Martin Chapter 5 CLINT Lecture IV
Bibliography • A. Voutilainen, Morphological disambiguation, in Karlsson, Voutilainen, Heikkila, Anttila (eds) Constraint Grammar pp165-284, Mouton de Gruyter, 1995. See [e-book] CLINT Lecture IV
EngCG Rule-Based Tagger (Voutilainen 1995) • Rules based on English Constraint Grammar • Two stage design • Uses ENGTWOL Lexicon • Hand written disambiguation rules CLINT Lecture IV
ENGTWOL Lexicon • Based on TWO-Level morphology of English (hence the name) • 56,000 entries for English word stems • Each entry annotated with morphological and syntactic features CLINT Lecture IV
Sample ENGTWOL Lexicon CLINT Lecture IV
Examples of constraints (informal) • Discard all verb readings if to the left there is an unambiguous determiner, and between that determiner and the ambiguous word itself, there are no nominals (nouns, abbreviations etc.). • Discard all finite verb readings if the immediately preceding word is to. • Discard all subjunctive readings if to the left, there are no instances of the subordinating conjunction that or lest. • The first constraint would discard the verb reading (next slide) • There are about 1,100 constraints altogether CLINT Lecture IV
Actual Constraint Syntax Given input: “that”If (+1 A/ADV/QUANT) (+2 SENT-LIM) (NOT -1 SVOC/A)Then eliminate non-ADV tagsElse eliminate ADV tag • this rule eliminates the adverbial sense of that as in “it isn’t that odd” CLINT Lecture IV
ENGCG Tagger • Stage 1: Run words through morphological analyzer to get all parts of speech. • E.g. for the phrase “the tables”, we get the following output:"<the>" "the" <Def> DET CENTRAL ART SG/PL "<tables>" "table" N NOM PL "table" <SVO> V PRES SG3 VFIN • Stage 2: Apply constraints to rule out incorrect POSs CLINT Lecture IV
Example WORD TAGS Pavlov PVLOV N NOM SG PROPER had HAVE V PAST VFIN SVO HAVE PCP2 SVO shown SHOW PCP2 SVOO SVO SV that ADV PRON DEM SG DET CENTRAL SEM SG CS (subord. conj) salivation N NOM SG CLINT Lecture IV
Performance • Tested on examples from Wall St Journal, Brown Corpus, Lancaster-Oslo-Bergen Corpus • After application of the rules 93-97% of all words are fully disambiguated, and 99.7% of all words retain correct reading. • At the time, this was superior performance to other taggers • However, one should not discount the amount of effort needed to create this system