90 likes | 301 Views
Sifter. for classifying books dense (or not) in family history information & for selecting 3-page sequences for evaluation . Deryle Lonsdale 1 Oct. 2013. The task. Develop a data-rich family history text range recognizer Perl Machine learning Mostly OTS components Fully automatic
E N D
Sifter for classifying books dense (or not) in family history information & for selecting 3-page sequences for evaluation Deryle Lonsdale 1 Oct. 2013
The task • Develop a data-rich family history text range recognizer • Perl • Machine learning • Mostly OTS components • Fully automatic • Arbitrary text chunk size • Evaluate performance
Method • Document features • Language identifier (and confidence) • We only want English (for now) • Used a pre-existing Perl module (Simões) • Type/token ratio • We want narrow-domain • % FH lexical items • We want to prefer FH vocabulary • Hand-coded, 49 words (died, married, cremation, etc.) • % integer words, % person words, % date words, % organization words, % location words • We want it to be data-rich • Used Stanford named entity engine • Average sentence length • Maybe sentences are shorter in FH text?? • One vector (floating-point features) per text chunk (e.g. document)
Evaluation • Gigaword corpus newswire • Associated Press Worldstream articles (Nov. 1994-May 1995) • 585 obituaries (192,000 words) • 649 non-obituaries (221,000 words, randomly selected from 85,000 articles) • TiMBL machine learning
Results F-Score beta=1, microav: 0.939263 F-Score beta=1, macroav: 0.939184 AUC, microav: 0.940449 AUC, macroav: 0.940449 overall accuracy: 0.939222 (1159/1234), of which 128 exact matches Confusion Matrix: nonobit obit -------------- nonobit | 595 54 obit | 21 564 -*- | 0 0
Feature ranking • % FH lexical items • % integers • % person names • % dates • Average sentence length • Type/token ratio • % locations • % organizations
Errors False positives False negatives Lists of creative works Credits from George Abbott's stage career, compiled by his office and from theater reference books: The Misleading Lady, 1913, actor. Yeoman of the Guard, 1915, actor. The Queens Enemies, 1916, actor. Lightnin', 1918, rewrote scenes. … Tagging errors EDITORS: Two versions of Yugoslavia-Obit-Djilas moved on circuits. Please disregard the second, shorter, unbylined version. The AP • Articles about people perishing in concentration camps • Crime stories (murders, serial killers, murder trial, terrorist acts) • Accident stories
Caveats • Obituaries, not FH data per se • Newswire, not books • One source • Will it scale? • Can it port to FSL? • Didn’t do any ML tuning • Binary acceptor; continuous values possible? • Effect of OCR errors?