Colloquia Linguistica

Colloquia Linguistica Part II: The development of Automated Syntactic Taggers Leif Grönqvist Göteborg University Colloquia Linguistica

Overview • Some basic thing about corpora (quick) • What is a corpus • What can we do with it • Part-of-speech tagging (slower) • What is the problem • Some common approaches • A rule based tagger • A statistical tagger • Corpus tools • Different tools • Demonstration of Multitool Colloquia Linguistica

What is a corpus for a computational linguist? • Various properties are important but the word ‘corpus’ is just Latin for ‘body’ • These properties should be considered: • Representativeness • Size • Form (annotation standard) • Standard reference Colloquia Linguistica

Representativeness • A corpus used for analyzing spoken Swedish should ideally contain all utterances of Swedish ever spoken • But this is impossible, so there are at least two strategies depending on purpose: • Try to collect various dialogue types of sizes proportional to the “complete corpus” • Collect enough big portions of each type to make sure to find all wanted phenomena • Regardless of which strategy you use it is important to select the samples from each type carefully, preferably using random Colloquia Linguistica

Corpus size: how big should it be? • Depends on purpose! • Some strategies: • Monitor corpus: as big as possible • Bank of English > 500 million tokens • Used for lexicography • Finite size, big enough for current task • POS-tagging, ~100 tags: 1 million tokens • Language model for automatic speech recognition: 100 million tokens Colloquia Linguistica

Machine readable form • Corpora have been used in linguistics for more than 100 years. • Now: a corpus => machine readable • The annotations should be made in a way to make extraction of wanted features as simple as possible Colloquia Linguistica

Standard reference (quick) • Typical content of a research article: “We used the corpus XX, took 90% for training, and 10% for testing with our new algorithm. We then got 97.2% correctness, which is a significant improvement from the old tagger at the 99% level” • Exactly the corpus XX must be available for other research groups Colloquia Linguistica

What to do with a corpus • Check our linguistic intuition • Annotate interesting features manually • Use it for training of taggers and parsers • Annotate new data automatically • But, be careful! A corpus is not the complete language Colloquia Linguistica

Text encoding • Various encoding schemes around • Text based • Human and machine readable • Could be difficult to check for validity • Word processor based • Only human readable • Rarely used in computational linguistics • XML/SGML based • Machine readable • May be transformed to human readable form using XSLT • Formalisms and tools for free, well more or less free • Limitations of XML may be annoying sometimes Colloquia Linguistica

Some important properties (skip) Important properties according to Geoffrey Leech • Possibility to extract original corpus • Possibility to separate annotations • Based on well defined guidelines • Make clear how the annotations were done • Make clear that there may be errors in the corpus • Widely agreed theory-neutral annotation scheme • No annotation scheme is the a priori standard scheme Colloquia Linguistica

Some annotation standards TEI (Text Encoding Initiative) • Huge standard for all types of texts and corpora developed by the TEI Consortium since 1987 • SGML based in the beginning but now XML (X)CES (XML Corpus Encoding Standard) • Highly inspired by the TEI • Not as complicated but only in beta version ISLE (International Standards for Language Engineering) • Developed by three working groups (lexicon, multimodality and evaluation) CDIF (Corpus Document Interchange Format) • Used by the British National Corpus • A lot in common with the TEI Colloquia Linguistica

Some typical results directly extracted from a corpus • Concordances (KWIC) • Frequency lists • N-gram statistics • Probabilities Colloquia Linguistica

Concordances rer, matematiker och dataloger i Göteborgsregionen , bandavskrifter och dataloggar, skriver Feldt.|Si , bandavskrifter och dataloggar.|Men den nya Palme Ahlberg, forskare i datalogi på Chalmers.|Av PER- Ahlberg, forskare i datalogi på Chalmers.|SIDAN 4 und blir professor i datalogi vid Umeå universitet und blir professor i datalogi vid Umeå universitet a fyra olika kurser: datalogi, pedagogik, teknik o ybjer och Jan Smith, datalogi.|Sektionen för maski atorer eller pluggar datalogi.|Så på fritiden leke r det gäller trådlös datalogistik, nu kommer det ö Colloquia Linguistica

Frequency lists 77810 det 36843 är 35471 och 32404 ja 30439 att 28628 jag 26059 så 19205 som 18681 inte 18469 har 18421 vi 17719 på 17377 man 17343 då 90304 . 56075 , 40438 och 33978 i 26358 att 25634 det 21830 en 21333 som 19743 på 15754 är 14333 med 13837 för 13683 av 13547 jag 74556 de 48104 ja 39947 e 34342 å 25694 så 25639 att 22378 va 19134 som 18679 vi 18084 inte 17611 på 17214 man 16870 i 16846 då Colloquia Linguistica

N-gram statistics 42 i stället för att 36 för några år sedan 35 men det är inte 34 en stor del av 33 på samma sätt som 32 det var som om 31 att det är en 30 är en av de 30 men det var inte 28 vad är det för 28 det är svårt att 27 det är som om 27 att det inte var 26 för ett år sedan 3395 det är 2913 för att 2451 det var 1560 att det 1351 är det 1278 i en 1174 att han 1003 i den 966 som en 920 men det 889 på en 884 att jag 882 är en 882 med en Colloquia Linguistica

Part-of-speech tagging • We want to assign the right part-of-speech (just as an example) to each word in a corpus • Input is a tokenized corpus • The tagset is determined in advance • The word types in the corpus have various properties in the training data • Some are unambiguous • Some are ambiguous (typically 2-7 POS each) • Some are unknown (not there) Colloquia Linguistica

An example Tagset: noun, verb, pron, art, infmrk, prep In: $A: you have to book a chair on deck Out: pron verb infmrk verb art noun prep noun • But, “book” and “chair” may be either verb or noun - the tagger has to disambiguate! • Several approaches to do this, all based on patterns and regularities in the language Colloquia Linguistica

Terms used in tagging • Tagging: put the right label (i.e. word class) on each token • Tagset: all possible labels (word classes) • Tokenizing: divide the corpus into tokens (words, sentence boundaries) • Training: find the rules or probabilities needed by the tagger Colloquia Linguistica

Various approaches • Rule based tagging • Constraint based tagging (SweTwol, EngTwol by Lingsoft) • Transformation-based tagging (Eric Brill) • Stochastic tagging (HMM) • Calculate the most probable tag sequence • Using maximum likelihood estimation • Or some bootstrap based training Colloquia Linguistica

Constrain based tagging • Basic idea: • Assign all possible tags to each words • Remove tags according to a set of rules of the type: “if word+1 is an adj, adv or quantifier and the following is a sentence boundary and word-1 is not a verb like ‘consider’ then eliminate non-adv else eliminate adv.” • Continue until no rule is applicable, but never remove the last tag on a word • Typically more than 1000 hand written rules, but may also be machine learned Colloquia Linguistica

The example: Constraint grammar • Tagset: nn, vb, pron, art, infmrk, prep • First: look up all possible classes for each word • Rules will then remove unwanted tags Colloquia Linguistica

Transformation-based tagging • Basic idea: • Set the most probable tag for each word as a start value • Change tags according to rules of the type: “if a word is tagged as a verb and the word before is an article, then change the tag to noun”. Perform rules in a specific order! • Training is done using a tagged corpus: • Write a set of rule templates of the type: “if word-1 or word+1 is an X then change the tag for word to Y” • Among the set of possible rules, find the one with the highest score • Continue from 2 until a lowest score threshold is passed • Keep the ordered set of rules • Rules will make errors that are corrected by later rules Colloquia Linguistica

The example: Transformation based learning • Tagset: nn, vb, pron, art, infmrk, prep • First: look up the most common tag for each word • Rules will then change to the right tags Colloquia Linguistica

An HMM tagger: uses statistics (brief) • The problem may be formulated as: • Which may be reformulated as: • But the denominator is constant and may be removed and we get: Colloquia Linguistica

HMM tagger, cont. (brief) The Markov assumption (for n=3) and the chain rule gives us: What we need now is: Colloquia Linguistica

The example: HMM Select the sequence with the highest probability! Colloquia Linguistica

Training of an HMM tagger • The best way is the Maximum Likelihood Estimation. But it requires a hand tagged corpus • A fancy name for a simple principle: expect the new data to be as the training data. Count the thing there: • P(c) = freq(c) / Ntok • P(w,c) = freq(w,c) / Ntok • P(w|c) = P(w,c) / P(c) Colloquia Linguistica

Evaluation (skip) • The result is compared with: the so called “Gold Standard” (manually coded) • Typically accuracy reach 96-97% • This may be compared with the result for a baseline tagger, for example a tagger not using context at all • Similarity between two gold standards may verified with the kappa measure • Important to note that 100% is impossible even for human annotators Colloquia Linguistica

Problems (quick) • Words and sequences are missing in the training data. This is cured using smoothing: • Additive: add one occurrence to each event frequency • Good-Turing estimation: try to calculate the number of unseen events to get a better estimation of their probabilities • Back-off and Linear interpolation • Morphology may help (-arity, -s) Colloquia Linguistica

The Viterbi algorithm (quick) • To calculate the probabilities for all possible sequences of tags would take too long time • The Viterbi algorithm helps us to find the most probable path in linear time to the length of the text and quadratic time to the number of states, using dynamic programming Colloquia Linguistica

Example of corpus tools at the linguistics department in Göteborg • The Corpus Browser • A tool for searching (for words and expressions) and browsing in our transcriptions • TraSA • A tool that count things like number of words, utterances, overlaps, vocabulary richness, etc • Multitool • A tool for browsing and coding a transcription, with audio and video available at the same time • Demonstration? Colloquia Linguistica

Thank you! • Thank you for listening! • Well, do we have any time left for questions? Colloquia Linguistica

Colloquia Linguistica