INTRODUCTION TO ARTIFICIAL INTELLIGENCE

INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. NguyenLab: Named Entity Recognition

Download • Slides http://sites.google.com/site/trucviennguyen/Lab NER -- Vien.pdf • Software http://sites.google.com/site/trucviennguyen/Teaching/AI/SSHSecureShellClient-3.2.9.rar

Natural Language Processing (NLP) • Main purpose of NLP • Build systems able to analyze, understand and generate languages which human use naturally • Involved Tasks • Automatic Summarization • Information Extraction • Speech Recognition • Machine Translation • …

Form 3 WHO: vcvcvcvcvcvcvcvcvc WHAT: vcvcvcvcvcvcvcvcvc WHEN: vcvcvcvcvcvcvcvcvc Form 2 WHO: vcvcvcvcvcvcvcvcvc WHAT: vcvcvcvcvcvcvcvcvc WHEN: vcvcvcvcvcvcvcvcvc News 3 News 2 News 1 Form 1 WHO: vcvcvcvcvcvcvcvcvc WHAT: vcvcvcvcvcvcvcvcvc WHEN: vcvcvcvcvcvcvcvcvc Information Extraction (1) Mapping of texts into fixed structure representing the key informations

Sam Brown retired as executive vice president of the famous hot dog manufacturer, Hupplewhite Inc. He will be succeeded by Harry Jones. Information Extraction (2) EVENT: leave job Person: Sam Brown Position: executive vice president Company: Hupplewhite Inc. EVENT: start job Person: Harry Jones Position: executive vice president Company: Hupplewhite Inc.

Entity and Relation • Entity • An object in the world • Ex.President Bush was in Washington today • Example: Person, Organization, Location, GPE • Relation • A relationship between two entities • Ex. LocatedIn(“Bush”, “Washington”) • Example: LocatedIn, Family, Employment

Named Entity Recognition • Named Entity Recognition • Subtask of information extraction • Locate and classify elements in text into predefined categories: names ofpersons, organizations, locations, expressions of times, etc • Example • James Clarke, director of ABC company (Person) (Organization)

CoNLL2003 shared task (1) • English and German language • 4 types of NEs: • LOC Location • MISC Names of miscellaneous entities • ORG Organization • PER Person • Training Set for developing the system • Test Data for the final evaluation

CoNLL2003 shared task (2) • Data • columns separated by a single space • A word for each line • An empty line after each sentence • Tags in IOB format • An example Milan NNP B-NP I-ORG 's POS B-NP O player NN I-NP O George NNP I-NP I-PER Weah NNP I-NP I-PER meet VBP B-VP O

CoNLL2003 shared task (3) English precision recall F [FIJZ03] 88.99% 88.54% 88.76% [CN03] 88.12% 88.51% 88.31% [KSNM03] 85.93% 86.21% 86.07% [ZJ03] 86.13% 84.88% 85.50% --------------------------------------------------- [Ham03] 69.09% 53.26% 60.15% baseline 71.91% 50.90% 59.61%

Dataset • Italian NER-- Evalita 2009 - PER/ORG/LOC/GPE • Development set: 223.706 tokens • Test set: 90.556 tokens • English NER-- CoNLL 2003 - PER/ORG/LOC/MISC • Training set: 203.621 tokens • Development set: 51.362 tokens • Test set: 46.435 tokens • Mention Detection-- ACE 2005 • 599 documents

CRF++ (1) • Can redefine feature sets • Written in C++ with STL • Fast training based on LBFGS for large scale • Less memory usage both in training and testing • encoding/decoding in practical time • Available as an open source software http://crfpp.googlecode.com/svn/trunk/doc/index.html

CRF++ (2) • use Conditional Random Fields (CRFs) • CRFs methodology: use statistical correlated features and train them discriminatively • simple, customizable, and open source implementation • for segmenting/labeling sequential data • can define • unigram/bigram features • relative positions (windows-size)

Template basic • An example: He PRP B-NP reckons VBZ B-VP the DT B-NP << CURRENT TOKEN current JJ I-NP account NN I-NP TemplateExpanded feature %x[0,0] the %x[0,1] DT %x[-1,0] reckons %x[-2,1] PRP %x[0,0]/%x[0,1] the/DT

A Case Study • Installing CRF++ • Data for Training and Test • Making the baseline • Training CRF++ on the • NER dataset: English CoNLL2003, Italian EVALITA • Mention classification: ACE 2005 dataset • Annotating the test corpus with CRF++ • Evaluating results • Exercise

Installing CRF++ • First, ssh compute-0-x where x=1..10 • Unzip the lab--NER.tar.gz file (tar -xvzf lab--NER.tar.gz) • Enter the lab--NER directory • Unzip the CRF++-0.54.tar.gz file (tar -xvzf CRF++-0.54.tar.gz) • Enter the CRF++-0.54 directory • Run ./configure • Run make

Training/Classification (1) • Notations • xxx train_it.dat/train_en.dat/train_mention.dat • nnn it.model/en.model/mention.model • yyy test_it.dat/test_en.dat/test_mention.dat • zzz test_it.tagged/test_en.tagged/test_mention.tagged • ttt test_it.eval/test_en. eval/test_mention.eval • Note that the test_it.dat already contains the right NE tags but the system is not using this information for tagging the data

Training/Classification (2) • Enter the CRF++-0.54 directory • Training ./crf_learn ../templates/template_4 ../corpus/xxx../models/nnn • Classification ./crf_test -m ../models/nnn ../corpus/yyy > ../corpus/zzz • Evaluation perl ../eval/conlleval.pl ../corpus/zzz > ../corpus/ttt • See the results cat ../corpus/ttt

THANKS • I used material from • Text Processing II: Bernardo Magnini • Lab Text Processing II: Roberto Zanoli

INTRODUCTION TO ARTIFICIAL INTELLIGENCE