1 / 19

INTRODUCTION TO ARTIFICIAL INTELLIGENCE

INTRODUCTION TO ARTIFICIAL INTELLIGENCE. Truc-Vien T. Nguyen Lab: Named Entity Recognition. Download. Slides http://sites.google.com/site/trucviennguyen/Lab NER -- Vien.pdf Software http://sites.google.com/site/trucviennguyen/Teaching/AI/SSHSecureShellClient-3.2.9.rar.

barr
Download Presentation

INTRODUCTION TO ARTIFICIAL INTELLIGENCE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. NguyenLab: Named Entity Recognition

  2. Download • Slides http://sites.google.com/site/trucviennguyen/Lab NER -- Vien.pdf • Software http://sites.google.com/site/trucviennguyen/Teaching/AI/SSHSecureShellClient-3.2.9.rar

  3. Natural Language Processing (NLP) • Main purpose of NLP • Build systems able to analyze, understand and generate languages which human use naturally • Involved Tasks • Automatic Summarization • Information Extraction • Speech Recognition • Machine Translation • …

  4. Form 3 WHO: vcvcvcvcvcvcvcvcvc WHAT: vcvcvcvcvcvcvcvcvc WHEN: vcvcvcvcvcvcvcvcvc Form 2 WHO: vcvcvcvcvcvcvcvcvc WHAT: vcvcvcvcvcvcvcvcvc WHEN: vcvcvcvcvcvcvcvcvc News 3 News 2 News 1 Form 1 WHO: vcvcvcvcvcvcvcvcvc WHAT: vcvcvcvcvcvcvcvcvc WHEN: vcvcvcvcvcvcvcvcvc Information Extraction (1) Mapping of texts into fixed structure representing the key informations

  5. Sam Brown retired as executive vice president of the famous hot dog manufacturer, Hupplewhite Inc. He will be succeeded by Harry Jones. Information Extraction (2) EVENT: leave job Person: Sam Brown Position: executive vice president Company: Hupplewhite Inc. EVENT: start job Person: Harry Jones Position: executive vice president Company: Hupplewhite Inc.

  6. Entity and Relation • Entity • An object in the world • Ex.President Bush was in Washington today • Example: Person, Organization, Location, GPE • Relation • A relationship between two entities • Ex. LocatedIn(“Bush”, “Washington”) • Example: LocatedIn, Family, Employment

  7. Named Entity Recognition • Named Entity Recognition • Subtask of information extraction • Locate and classify elements in text into predefined categories: names ofpersons, organizations, locations, expressions of times, etc • Example • James Clarke, director of ABC company (Person) (Organization)

  8. CoNLL2003 shared task (1) • English and German language • 4 types of NEs: • LOC Location • MISC Names of miscellaneous entities • ORG Organization • PER Person • Training Set for developing the system • Test Data for the final evaluation

  9. CoNLL2003 shared task (2) • Data • columns separated by a single space • A word for each line • An empty line after each sentence • Tags in IOB format • An example Milan NNP B-NP I-ORG 's POS B-NP O player NN I-NP O George NNP I-NP I-PER Weah NNP I-NP I-PER meet VBP B-VP O

  10. CoNLL2003 shared task (3) English precision recall F [FIJZ03] 88.99% 88.54% 88.76% [CN03] 88.12% 88.51% 88.31% [KSNM03] 85.93% 86.21% 86.07% [ZJ03] 86.13% 84.88% 85.50% --------------------------------------------------- [Ham03] 69.09% 53.26% 60.15% baseline 71.91% 50.90% 59.61%

  11. Dataset • Italian NER-- Evalita 2009 - PER/ORG/LOC/GPE • Development set: 223.706 tokens • Test set: 90.556 tokens • English NER-- CoNLL 2003 - PER/ORG/LOC/MISC • Training set: 203.621 tokens • Development set: 51.362 tokens • Test set: 46.435 tokens • Mention Detection-- ACE 2005 • 599 documents

  12. CRF++ (1) • Can redefine feature sets • Written in C++ with STL • Fast training based on LBFGS for large scale • Less memory usage both in training and testing • encoding/decoding in practical time • Available as an open source software http://crfpp.googlecode.com/svn/trunk/doc/index.html

  13. CRF++ (2) • use Conditional Random Fields (CRFs) • CRFs methodology: use statistical correlated features and train them discriminatively • simple, customizable, and open source implementation • for segmenting/labeling sequential data • can define • unigram/bigram features • relative positions (windows-size)

  14. Template basic • An example: He PRP B-NP reckons VBZ B-VP the DT B-NP << CURRENT TOKEN current JJ I-NP account NN I-NP TemplateExpanded feature %x[0,0] the %x[0,1] DT %x[-1,0] reckons %x[-2,1] PRP %x[0,0]/%x[0,1] the/DT

  15. A Case Study • Installing CRF++ • Data for Training and Test • Making the baseline • Training CRF++ on the • NER dataset: English CoNLL2003, Italian EVALITA • Mention classification: ACE 2005 dataset • Annotating the test corpus with CRF++ • Evaluating results • Exercise

  16. Installing CRF++ • First, ssh compute-0-x where x=1..10 • Unzip the lab--NER.tar.gz file (tar -xvzf lab--NER.tar.gz) • Enter the lab--NER directory • Unzip the CRF++-0.54.tar.gz file (tar -xvzf CRF++-0.54.tar.gz) • Enter the CRF++-0.54 directory • Run ./configure • Run make

  17. Training/Classification (1) • Notations • xxx train_it.dat/train_en.dat/train_mention.dat • nnn it.model/en.model/mention.model • yyy test_it.dat/test_en.dat/test_mention.dat • zzz test_it.tagged/test_en.tagged/test_mention.tagged • ttt test_it.eval/test_en. eval/test_mention.eval • Note that the test_it.dat already contains the right NE tags but the system is not using this information for tagging the data

  18. Training/Classification (2) • Enter the CRF++-0.54 directory • Training ./crf_learn ../templates/template_4 ../corpus/xxx../models/nnn • Classification ./crf_test -m ../models/nnn ../corpus/yyy > ../corpus/zzz • Evaluation perl ../eval/conlleval.pl ../corpus/zzz > ../corpus/ttt • See the results cat ../corpus/ttt

  19. THANKS • I used material from • Text Processing II: Bernardo Magnini • Lab Text Processing II: Roberto Zanoli

More Related