220 likes | 301 Views
Extracting Names Using Layout Clues in Genealogical Books. Aaron Stewart David W. Embley March 20, 2010. Problem. Process. Finding Names. Name recognition in genealogical texts Focus: Lists, Directories. Finding Names. Which side was easier?.
E N D
Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010
Finding Names • Name recognition in genealogical texts • Focus: Lists, Directories
Finding Names Which side was easier? It’s easy for us to spot names… But how does a computer do it?
Finding Names Natural Language Processing Stanford Named Entity Recognizer ? Apache UIMA Framework MEMM CRF
BYU OntoES Ontology Extraction System • Dictionary • Regular Expressions
Ancestry.com Data • Word text • Word bounding boxes • Genres: • Genealogical Books • City Directories • Yearbooks • Newspapers
Margin Finder – Future Work Key Left Center Right
Margin Finder – Future Work • ABBYY FineReader handles – • Paragraphs • Newspaper columns • But has trouble with – • Hanging indents • Outline indentation (possibly)
Pattern Finding • Apply baseline name extractor (OntoES) • Apply margin finder and insert markers • Find left and right context for each name • Apply common contexts to extract more names
Pattern Finding 1. Apply baseline name extractor (OntoES)
Pattern Finding 2. Apply margin finder and insert markers LEVEL 1 LEVEL 1 LEVEL 1 LEVEL 1 LEVEL 1 LEVEL 2 LEVEL 1 LEVEL 2
Pattern Finding 3. Find left and right context for each name LEVEL 1 LEVEL 1 LEVEL 1 LEVEL 1 LEVEL 1 LEVEL 2 LEVEL 1 LEVEL 2
Pattern Finding 4. Apply common context patterns to extract more names LEVEL 1 LEVEL 1 LEVEL 1 LEVEL 1 LEVEL 1 LEVEL 2 LEVEL 1 LEVEL 2
Pattern Finding – Sample Results Baseline Results • Precision: 40% • Recall: 31.25% • F1: 35.09% Results of Most Salient Pattern • Precision: 51.52% • Recall: 53.12% • F1: 52.31% Not all results are this good!
Challenges • Evaluation • More aligned data • Annotation tool • Other books • Centered and right-aligned text • Knowing when to apply patterns