Stemming, tagging and chunking

Stemming, tagging and chunking Text analysis short of parsing

Word-based analysis • Whereas parsing gives a full syntactic analysis, sometimes it is sufficient to have less detailed information • In many applications we are more interested in words • But what do we mean by “word”?

Words • Naïve definition of a word: sequence of characters surrounded separated from each other by a space • But punctuation marks are usually attached to words • Though not all punctuation marks are word-delimiters, e.g. possessive apostrophe, hyphen

Words • We may want to treat hyphenated and compound words as one word, or two • By the same token we may want to treat word sequences as if they were a single word • In addition, a given “word” can have different word forms, depending on inflections, or even conventions of orthography

Tokenization • The simplest form of analysis is to reduce different word forms into tokens • Also called “normalization” • For example, if you want to count how many times a given word occurs in a text • Or you want to search for texts containing certain words (e.g. Google)

Stemming • Stemming is the particular case of tokenization which reduces inflected forms to a single base form or stem • (Recall our discussion of stem ~ base form ~ dictionary form ~ citation form) • Stemming algorithms are basic string-handling algorithms, which depend on rules which identify affixes that can be stripped

Stemming • As we know, morphology can be less than straightforward, so a stemmer has to “know” about rules such as consonant doubling, y→i, etc. • Also has to know about irregularities • And to avoid overgeneration • For this it probably needs a dictionary

Stemming • Best known stemming algorithm for English is Martin Porter’s stemmer, published in 1979 • Original use was in information retrieval • In computational terms, it is really just a sophisticated string-handling algorithm • In linguistic terms, it is interesting in that it captures generalisations about English morphology

Word categories • A.k.a. parts of speech (POSs) • Important and useful to identify words by their POS • To distinguish homonyms • To enable more general word searches • POS familiar (?) from school and/or language learning (noun, verb, adjective, etc.)

Word categories • Recall that we distinguished • open-class categories (noun, verb, adjective, adverb) • Closed-class categories (preposition, determiner, pronoun, conjunction, …) • While the big four are fairly clearcut, it is less obvious exactly what and how many closed-class categories there may be

POS tagging • Labelling words for POS can be done by dictionary lookup and/or some sort of process • Identifying POS can be seen as a prerequisite to parsing, and/or a result of morphological analysis in its own right • However, there are some differences: • Parsers often work with the most simple set of word categories, subcategorized by feature (or attribute-value) schemes • Indeed the parsing procedure may contribute to the disambiguation of homonyms

POS tagging • POS tagging, per se, aims to identify word-category information somewhat independently of sentence structure … • … and typically uses rather different means • POS tags are generally shown as labels on words: John/NPN saw/VB the/AT book/NCN on/PRP the/AT table/NN ./PNC • We’ll return to tagging in detail, but first let’s mention …

Chunking • Like parsing except that it aims only to identify major constituents • And does not attempt to identify structure, neither internal (within the chunk), nor external (between chunks) • Chunking will leave some parts of the text unanalysed • Example: [NP [NP G.K. Chesterton ], [NP [NP author ] of [NP [NP The Man ] who was [NP Thursday ] ] ] ]

Chunking • Chunks can be represented like tags or like parse trees

Chunk parser • A “chunk” is a continuous non-overlapping sequence of words • Chunker finds such sequences, often using tagged text as input • Chunk rules can be as simple as regular expressions • Chunkers can allow embedding, but typically only to a shallow level • Another example: (S: (NP: I) saw (NP: the big dog) . )

Stemming, tagging and chunking