170 likes | 373 Views
Character Gazetteer for Named Entity Recognition with Linear Matching Complexity. Dlugolinsky S., Nguyen G ., Laclavik M., Seleng M. Institute of Informatics, Slovak Academy of Sciences giang.ui@savba.sk. Content.
E N D
Character Gazetteer for Named Entity Recognition with Linear Matching Complexity Dlugolinsky S., Nguyen G., Laclavik M., Seleng M. Institute of Informatics, Slovak Academy of Sciences giang.ui@savba.sk
Content • Context: Big Data, Natural Language Processing (NLP), Named Entity Recognition (NER) • Gazetteers • Tree structures: design and realizations • NER with linear matching complexity • Evaluations • Future work
Work context NER important task in order to gain the information Big Data produced daily in • Social media: Twitter, Google+, Facebook, Instagram, etc. • Wikipedia, Wikia, newspapers … • Other internal sources like transactions, logs, emails, … Knowledgeand Informationhiddenin (un|semi-)structured data • useful for • business or political sentiment analysis • public opinion assessment • emergency response, etc. • text, images, audio, video Text NLP Information
Natural Language Processing (NLP) • Incoming text comes continuously from websites, portals, social media, etc. • Need to recognize well-known NEs and theirs occurrences with references • NER is important task in order to gain information
Gazetteers • Basic, independent and very effective NER technique for NE identification in text • Processing approaches • Token-based: split input text into a sequence of tokens (words) • Character-based: processing input text character by character • NE recognitions • Machine learning techniques • Finite-state machines (FSM)
Related work Ontotext Hash Gazetteer • Based on hash tables • Authors: “3x faster and 4x less memory than FSM equivalent” • As a part of the GATE only Ontotext Stand-Alone Gazetteer • Stand-alone version of the Hash Gazetteer • No longer available Ontotext Large Knowledge Base Gazetteer • Support for ontology-aware NLP • As a part of the GATE only Other gazetteers implemented as a proprietary look-up piece of code or complex solutions
Our requirements Standalone • no 3rd party libraries needed • does not rely on external preprocessing; e.g. tokenization Linear complexity lookup algorithm • fast and effective processing of input text as a stream, especially for Big Data Editable data structure • add/remove NEs between lookups Memory efficient data structure • “learn” tens of millions of entities Robust • input texts of any size • any language
Named entity recognitionCharacter-based with Linear matching complexity
HMT and CST realizations • HMT: Hash Map Tree (multi-way tree) • implemented by Java HashMap, constant-time performance O(1) in average for basic operations (get and put) • (-) consumes a lot of memory • (+) very fast • CST: Child-Sibling Tree • pure and simple Java structure for nodes • (+) memory efficiency (only 25% vs. HMT) • (-) slower (cca. 10x vs. HMT for big data) • Deal with overlapping, prefix, postfix NE cases
Evaluation datasets • Gazetteer datasets: • Freebase organizations: 778,814 unique entities • Freebase locations: 1,256,552 unique entities • Freebase persons: 2,614,401 unique entities • Wikipedia titles and alternative names: 9,319,611 unique entities • Incoming data sets • 9,909 documents acquired from CoNLL-2003 datasets (Reuters’ text) with approximately 29MB of text
Next steps • Improving the tree data structure in order to • Decrease memory requirements • More efficient for traversing and matching • Possible direction is collapsing nodes: • PHT - Patricia Hash Map Trie • Work completions • Integration to our projects and existing complex tools • Open source at http://ikt.ui.sav.sk/gazetteer
Thank you for attention Giang Nguyen giang.ui@savba.sk Cite: Stefan Dlugolinsky, Giang Nguyen, Michal Laclavik, Martin Seleng: "Character Gazetteer for Named Entity Recognition with Linear Matching Complexity", 3rd World Congress on Information and Communication Technologies, WICT'2013, pp. 364-368, IEEE Catalog Number: CFP1395H-ART, ISBN: 978-1-4799-3230-6