TAGPRO A system for ITALIAN POS TAGGING based on SVM

TAGPROA system for ITALIAN POS TAGGING based on SVM EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento

TextPro • A suite of modular NLP tools developed at FBK-irst • TokenPro: tokenization • MorphoPro: morphological analysis • TagPro: Part-of-Speech tagging • LemmaPro: lemmatization • EntityPro: Named Entity recognition • ChunkPro: phrase chunking • SentencePro: sentence splitting • Architecture designed to be efficient, scalable and robust. • Cross-platform: Unix / Linux / Windows / MacOS X • Multi-lingual models • All modules integrated and accessible through unified command line interface 2

TagPro YamCha Feature extraction ortho, prefix, suffix, dictionary, morpho analysis Training data Feature selection Learning dictionary MorphoPro models Controller Feature extraction ortho, prefix, suffix, dictionary, morpho analysis Test data Feature selection Classification To build TagPro we used YamCha, an SVM-based machine learning environment. TagPro can exploit a rich set of linguistic features, such as morphological analysis, prefixes and suffixes TagPro’s architecture

YamCha • Created as generic, customizable, open source text chunker • Can be adapted to a lot of other tag-oriented NLP tasks • Uses state-of-the-art machine learning algorithm (SVM) • Can redefine • Context (window-size) • parsing-direction (forward/backward) • algorithms for multi-class problem (pair wise/one vs rest) • Practical chunking time (1 or 2 sec./sentence.) • Available as C/C++ library 4

Support Vector Machines Based on the Structural Risk Minimization principle (Vladimir N. Vapnik, 1995) • SVM map input vectors to a higher dimensional space where a maximal separating hyperplane is constructed. • Two parallel hyperplanes are constructed on each side of the hyperplane that separates the data. • The separating hyperplane is the hyperplane that maximizes the distance between the two parallel hyperplanes.

YamCha: Setting Window Size Default setting is "F:-2..2:0.. T:-2..-1". The window setting can be customized 6

Training and Tuning Set • The Evalita development set was randomly split into 2 parts • Training: 89,170 tokens • Tuning: 44,586 tokens

FEATURES • For each running word a rich set of features are extracted • WORD: the word itself (both unchanged and lower-cased) • e.g. Autore autore • MORPHO: the morphological analysis (produced by MorphoPro) • e.g. Autore autore+n+m+sing • Calcio calcio calcio+n+m+sing calciare+v+indic+pres+nil+1+sing • AFFIX: prefixes/suffixes (2, 3, 4 or 5 chars. at the start/end of the word) • e.g. libro {li,lib,libr,libro,ro,bro,ibro,libro} • ORTHOgraphic information (e.g. capitalization, hypenation) • e.g. Oggi C (capitalized) • oggi L (lowercased) • GAZETTeers of proper nouns (154,000 proper names, 12,000 cities, • 5,000 organizations and 3,200 locations)

Static vs Dynamic Features • STATIC FEATURES • extracted for the current, previous and following word • WORD, MORPHO, AFFIXes, ORTHO, GAZET • DYNAMIC FEATURES • decided dynamically during tagging • tag of the two tokens preceding the current token.

An Example of Feature Extraction l' ART ex ADJ leader NN socialista ADJ Bettino NN_P Craxi NN_P l' l' l' __nil__ __nil__ __nil__ l' __nil__ __nil__ __nil__ L A N N N N N N N N N N N Y N N N N N N N N Y N O O O O ART ex ex ex __nil__ __nil__ __nil__ ex __nil__ __nil__ __nil__ L N N N N N N N N N N Y 2 N N N Y N N N N N N N O O O O ADJ leader leader le lea lead leade er der ader eader L N N N N N N N N N N Y N N Y 0 N N N N N N N N O O O O NN socialista socialista so soc soci socia ta sta ista lista L N N N N N N N N N N Y 2 N Y 0 N N N N N N N N O O O O ADJ Bettino bettino be bet bett betti no ino tino ttino C N N N N N N N N N N N N N N N N N N N N Y N N O O O B-NAM NN_P Craxi craxi cr cra crax craxi xi axi raxi craxi C N N N N N N N N N N N N N N N N N N N N Y N N O O O B-SUR NN_P

Finding the best features Baseline: WORD (both unchanged and lower-cased) window-size: +1,-1

Finding the best window-size Given the best set of features (F1=97.42) we tried to improve Accuracy by changing the window-size

multi-class problempair-wise/one vs rest • one vs rest: fewer bigger classifiers • pairwise: • a classifier for each possible pair of classes • choose the classifier with best confidence • many relatively small classifiers • faster, less memory

Evaluating the best algorithmPKI vs. PKE • YamCha uses two implementations of SVMs: PKI and PKE. • both are faster than the original SVMs • PKI (3-12 x faster) produces the same accuracy as the original SVMs. • PKE (10-300 x) approximates the orginal SVM, slightly less accurate but much faster

Results on the development set

Test Results

Conclusions • A statistical approach to PoS-Tagging for Italian based on YamCha / SVMs. • Results confirm that SVMs can deal with a big number of features without incurring in overfitting. • We used the same best configuration for both tagsets. • No specific method was applied for classifying unknown words. • Features: • AFFIX+ORTHO: +8.56 over baseline • MORPHO: 2.13 improvement over AFFIX+ORTHO • GAZETteers do not contribute any further significant improvement • Features for unknown words: • AFFIX+ORTHO:+25.56 MORPHO: ++7,62 • No benefit from a larger context (e.g. window-size +2,-2 and more)

TagPro • TagPro is a system for PoS-tagging based on YamCha. • YamCha (Yet Another Multipurpose Chunk Annotator, by Taku Kudo) • is a generic, customizable, and open source text chunker. • is based on Support Vector Machines (SVMs) • TagPro exploits a rich set of linguistic features such as the morphological analysis prefixes and suffixes. • The system is part of TextPro, a suite of NLP tools developed at FBK-irst. 18

Confusion matrix

TAGPRO A system for ITALIAN POS TAGGING based on SVM