230 likes | 475 Views
Named Entity Recognition based on three different machine learning techniques. Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005. Research Group on Language Processing and Information Systems. g PLSI. Outline. Named Entity Recognition task definition applications
E N D
Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005 Research Group on Language Processing and Information Systems g PLSI
Outline • Named Entity Recognition • task definition • applications • Machine learning approach • Classifier combination • Feature description and experimental evaluation • for NE detection • for NE classification • NERUA at GeoCLEF • Conclusions and future work
Identification of proper names in text, using BIO scheme B starts an entity I continues the entity O words outside entity Classification into a predefined set of categories Person names Organizations (companies, governmental organizations, etc) Locations (cities, countries, etc) Miscellaneous (movie titles, sport events, etc) Named Entity Recognition – task definition Adam_B-PER Smith_I-PER works_O for_O IBM_B-ORG ,_O London_B-LOC ._O
Named Entity Recognition – applications • Information Extraction • Question Answering • Document classification • Automatic indexing of books • Increase accuracy of Internet search results (location Clinton/South Carolina vs. PresidentClinton)
Outline • Named Entity Recognition • task definition • applications • Machine learning approach • Classifier combination • Feature description and experimental evaluation • for NE detection • for NE classification • NERUA at GeoCLEF • Conclusions and future work
Machine learning approach • Given: • NER task • tagged corpus • Select classification methods • Memory-based learning • Maximum Entropy • Hidden Markov Models • Construct set of characteristics • detection phase • classification phase
HMM Text Detection Voting TiMBL Classification HMM Voting MXE TiMBL NERText NERUA:sistema de detección y clasificación de entidades utilizando aprendizaje automático, Ferrández et al.
Classification method 1 • Memory-based learning (k-nearest neighbours) • toolkit • TiMBL package • time performance • quick training phase • slow during testing • features • various types of features • irrelevant features impede performance
Classification method 2 • Maximum Entropy • toolkit • MaxEnt • time performance • slow training phase • slow testing phase • feature management • string, missing values
Classification method 3 • Hidden Markov Models • toolkit • ICOPOST • time performance • quick training phase • quick testing phase • feature management • cannot handle as many features as the other two methods • need corpus or label transformation
Outline • Named Entity Recognition • task definition • applications • Machine learning approach • Classifier combination • Feature description and experimental evaluation • for NE detection • for NE classification • NERUA at GeoCLEF • Conclusions and future work
Classifier combination • Majority voting • give each classifier one vote
Outline • Named Entity Recognition • task definition • applications • Machine learning approach • Classifier combination • Feature description and experimental evaluation • for NE detection • for NE classification • NERUA at GeoCLEF • Conclusions and future work
Features for NE detection • Contextual • anchor word (e.g. the word to be classified); • words in a [-3,…,+3] window ; • Orthographic • capitalization at position 0,[-3,..,+3]; • whole anchor word in capitals (ex. IBM) • position of anchor word in a sentence • Substring extraction • 2 and 3 letter extraction from left and right side of the anchor word • Gazetteer list • word at position 0,+1,+2,+3 seen in the list • Trigger word list • word at position 0,[-3,..,+3] seen in the list Using Language Resource Independent Detection for Spanish NER, Kozareva et al., RANLP’05
Index • Named Entity Recognition • task definition • applications • Machine learning approach • Classifier combination • Feature description and experimental evaluation • for NE detection • for NE classification • NERUA at GeoCLEF • Conclusions and future work
Features for NE classification • Contextual • whole entity • first word of the entity • second word of the entity if present • words around the entity in [-3,…,+3] window • Orthographic • position of anchor word in a sentence • capital, lowercase or other symbol • Gazetteer list • part of entity in the list • whole entity in the list • whole entity is not in any of these lists • Trigger lists • anchor word • words in [-1,+1] window
Results for NE classification F-score for Spanish classification
Outline • Named Entity Recognition – task definition, applications • Machine learning approach • Classifier combination • Feature description and experimental evaluation • for NE detection • for NE classification • NERUA at GeoCLEF • Conclusions and future work
NERUA at GeoCLEF • English used directly the feature sets constructed for Spanish • NERUA outperformed the rule-based system Dramneri although both consulted the same gazetteer and trigger word lists • NERUA took more processing time University of Alicante at GeoCLEF 2005, Ferrández et al., CLEF’05
Conclusions and future work • We found a language resource independent feature set for NE detection • 92.96% of Spanish entities • 78.86% of Portuguese entities • Classifier combination has improved NE classification • Good coverage over PER, LOC and ORG classes is maintained • Machine learning systems may outperform rule-based systems, however they need more processing time and hand-labeled resources which are not available for all languages
Future work • Find discriminative features for MISC class • Resolve NER leaning upon unlabeled data • Divide the four categories into more detailed ones • Adapt the system for other languages • Study ways of automatic gazetteer construction
Thank you for the attention!¿Questions? Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005 Research Group on Language Processing and Information Systems g PLSI