240 likes | 384 Views
ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION. EVALITA 2007 Frascati, September 10th 2007. Roberto Zanoli and Emanuele Pianta. TextPro. A suite of modular NLP tools developed at FBK-irst TokenPro: tokenization MorphoPro: morphological analysis
E N D
ENTITYPROEXPLOTING SVM FOR ITALIANNAMED ENTITY RECOGNITION EVALITA 2007 Frascati, September 10th 2007 Roberto Zanoli and Emanuele Pianta
TextPro • A suite of modular NLP tools developed at FBK-irst • TokenPro: tokenization • MorphoPro: morphological analysis • TagPro: Part-of-Speech tagging • LemmaPro: lemmatization • EntityPro: Named Entity recognition • ChunkPro: phrase chunking • SentencePro: sentence splitting • Architecture designed to be efficient, scalable and robust. • Cross-platform: Unix / Linux / Windows / MacOS X • Multi-lingual models • All modules integrated and accessible through unified command line interface 2
EntityPro YamCha Feature extraction ortho, prefix, suffix, dictionary, collocation bigram Training data Feature selection Learning dictionary models TagPro Controller Feature extraction ortho, prefix, suffix, dictionary, collocation bigram Test data Feature selection Classification We used YamCha, an SVM-based machine learning environment, to build EntityPro, a system exploiting a rich set of linguistic features, such as the orthographic features, prefixes and suffixes, and the occurrence in proper nouns gazetteers. EntityPro’s architecture
YamCha • Created as generic, customizable, open source text chunker • Can be adapted to a lot of other tag-oriented NLP tasks • Uses state-of-the-art machine learning algorithm (SVM) • Can redefine • Context (window-size) • parsing-direction (forward/backward) • algorithms for multi-class problem (pair wise/one vs rest) • Practical chunking time (1 or 2 sec./sentence.) • Available as C/C++ library 4
Support Vector Machines Support vector machines are based on the Structural Risk Minimization principle (Vladimir N. Vapnik, 1995)from computational learning theory. Support vector machines map input vectors to a higher dimensional space where a maximal separating hyperplane is constructed. Two parallel hyperplanes are constructed on each side of the hyperplane that separates the data. The separating hyperplane is the hyperplane that maximizes the distance between the two parallel hyperplanes. 5
YamCha: Setting Window Size Default setting is "F:-2..2:0.. T:-2..-1". The window setting can be customized 6
Training and Tuning Set • Evalita Development set randomly split into two parts • training: 92.241 tokens • tuning : 40.348 tokens
FEATURES (1/3) • For each running word: • WORD: the word itself (both unchanged and lower-cased) • e.g. Casa casa • POS: the part of speech of the word (as produced by TagPro) • e.g. Oggi SS (singular noun) • AFFIX: prefixes/suffixes (1, 2, 3 or 4 chars. at the start/end of the word) • e.g. Oggi {o,og,ogg,oggi, – i,gi,ggi,oggi} • ORTHOgraphic information (e.g. capitalization, hyphenation) • e.g. Oggi C (capitalized) • oggi L (lowercased)
FEATURES (2/3) • COLLOCation bigrams (36.000, Italian newspapers ranked by MI values) • e.g. l’ O • avvocato O • di O • Rossi O • Carlo B-COL • Taormina I-COL • ha O • …….
FEATURES (3/3): GAZETTeers • TOWNS: World (main), Italian (comuni) and Trentino’s (frazioni) towns (12.000, from various internet sites) • STOCK-MARKET: Italian and American stock market organizations (5.000, from stock market sites) • WIKI-GEO: Wikipedia geographical locations (3.200,) • PERSONS: Person proper names or titles (154.000, Italian phone-book, Wikipedia,) difeso O O O O dall' O O O O avvocato O O O TRIG Mario O O O B-NAM De O O O B-SUR Murgo O O O I-SUR di O O O O Vicenza GPE O O O ……………..
An Example of Feature Extraction difeso VSP O dall' ES O avvocato SS O Mario SPN B-PER De E I-PER Murgo SPN I-PER , XPW O difeso difeso d di dif dife o so eso feso L N O O O O O VSP O dall' dall' d da dal dall ' l' ll' all' L A O O O O O ES O avvocato avvocato a av avv avvo o to ato cato L N O O O TRIG O SS O Mario mario m ma mar mari o io rio ario C N O O O B-NAM O SPN B-PER De de d e _nil_ _nil_ e de _nil_ _nil_ C N O O O B-SUR B-COL E I-PER Murgo murgo m mu mur murg o go rgo urgo C N O O O I-SUR I-COL SPN I-PER
Static vs Dynamic Features • STATIC FEATURES • extracted for the current, previous and following word • WORD, POS, AFFIX, ORTHO, COLLOC, GAZET • DYNAMIC FEATURES • decided dynamically during tagging • tag of the 3 tokens preceding the current token.
Finding the best features Baseline: WORD (both unchanged and lower-cased) AFFIX ORTHOgraphic window-size: STAT: +2,-2 DYNAMIC: -2
Finding the best window-size Given the best set of features (F1=79.81) we tried to improve F1 measure changing the window-size
Evaluating the best algorithmPKI vs. PKE • YamCha uses two implementations of SVMs: PKI and PKE. • both are faster than the original SVMs • PKI produces the same accuracy as the original SVMs. • PKE approximates the orginal SVM, slightly less accurate but faster
Conclusion (1/2) • A statistical approach to Named Entity Recognition for Italian based on YamCha/SVMs • Results confirm that SVMs can deal with a big number of features and that they perform at state of the art. • For the features, GAZETteers seem to be the most important feature • 31% error reduction • Large context (large values of window-size e.g. +6,-6) involves a significant decrease of the recall (data sparseness), 3 points.
Conclusion (2/2) • F1 values for both PER (92.12) and GPE (85.54) appear rather good, comparing well with those obtain in CONLL2003 for English. • Recognition of LOCs (F1: 73.04) seems more problematic: we suspect that the number of LOCs in the training is insufficient for the learning algorithm. • ORGs appear to be highly ambiguous.
Examples Token Gold Prediction è O Ostato O Odenunciato O Odai O Ocarabinieri B-ORG Odi O OVigolo B-GPE B-GPEVattaro I-GPE I-GPE Token Gold Prediction è O Ostato O Ofermato O Odai O Ocarabinieri O Oed O Oin O Oseguito O Oad O Oun O Ocontrollo O O
Examples 2 Token Gold PredictionFontana B-PER B-PER( O OVillazzano B-ORG B-GPE) O O, O OCampo B-PER B-PER( O OBaone B-ORG B-GPE) O O, O ORao B-PER B-PER( O OAlta B-ORG B-ORGVallagarina I-ORG I-ORG) O O. O O Token Gold Prediction dovrà O O dare O O a O O via B-ORG B-LOC Segantini I-ORG I-LOC un O O ruolo O O diverso O O
EntityPro • EntityPro is a system for Named Entity Recognition (NER) based on YamCha in order to implement Support Vector Machines (SVMs). • YamCha (Yet Another Multipurpose Chunk Annotator, by Taku Kudo), is a generic, customizable, and open source text chunker. • EntityPro can exploit a rich set of linguistic features such as the Part of Speech, orthographic features and proper name gazetteers. • The system is part of TextPro, a suite of NLP tools developed at FBK-irst. 24