100 likes | 230 Views
Citation Extractor. Nguyen Bach Sue Ann Hong Ben Lambert. Extraction Task. AuthorOf(Author, Paper) PublishedAt(Paper, Conference) IsPaper, IsAuthor, IsConference. “Citation” = <Paper, Authors, Conference> “Pattern” regular expression. Citation DB. Seed (e.g. 5 citations).
E N D
Citation Extractor Nguyen Bach Sue Ann Hong Ben Lambert
Extraction Task AuthorOf(Author, Paper) PublishedAt(Paper, Conference) IsPaper, IsAuthor, IsConference • “Citation” = <Paper, Authors, Conference> • “Pattern” • regular expression
Citation DB Seed (e.g. 5 citations) Method Outline Web pages (HTML, text) Query Search (WIT) Citations Extract Citations using new patterns Extract Patterns using known citations Page-specific Patterns
AUTHOR, AUTHOR: TITLE . CONF 4 Patterns: AUTHOR, (A-Za-z): (A-Za-z). (A-Za-z) (A-Za-z), AUTHOR : (A-Za-z). (A-Za-z) (A-Za-z), (A-Za-z): TITLE. (A-Za-z) (A-Za-z), (A-Za-z): (A-Za-z). CONF Query: "multiple-goal recognition from low-level signals " " Xiaoyong Chai" " Qiang Yang" "AAAI 2005 " Page: http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/y/Yang:Qiang.html
AUTHOR, CONF CONF AUTHOR, AUTHOR, TITLE CONF AUTHOR, AUTHOR, AUTHOR, (A-Za-z): (A-Za-z). (A-Za-z) AUTHOR, (A-Za-z), AUTHOR : (A-Za-z). (A-Za-z) . AUTHOR, (A-Za-z), (A-Za-z): TITLE. (A-Za-z) (A-Za-z), (A-Za-z): (A-Za-z). CONF AUTHOR: AUTHOR: AUTHOR: Finding New Citations
System Spits Out… • 6 seeds 60 citations • 36 of these (partial citations) • "Theory and Algorithms for Plan Merging " , " Ming Li" • "The Expected Value of Hierarchical Problem-Solving " , " Fahiem Bacchus" • "Handling feature interactions in process-planning " • 14 of these (partial strings) • "On D " • "On t " , " John Tromp", " Elizabeth Sweedyk", " Umest Vazirani" • "An L " , " Ronan Sleep" • "To D “ • No new conferences (end-token)
Bootstrapping, Short-Lived • Highly restrictive regex’s • No recovery • More seeds and variety the better • Stupid Little Things • Mis-capitalization • Variations in titles (‘-’ vs. ‘ ’) • Etc, etc, etc…
Extensions ~ Improvements • Less strict string matching • Not case and punctuation sensitive • Better boundary detection • Start/end tokens, HTML wrapper detection? • Better pattern construction • e.g. n authors not 2 • NER • help find the right "window“ • A source of ENTITY marker • Use like ‘AUTHOR’, ‘TITLE’, ‘CONF’ but with probabilities/confidence values • Evaluation with DBLP?
NER • Baseline model (News corpus) <ENAMEX_TYPE="PERSON"> M. Woszczyna, N. Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi, T. Kemp, A. Lavie, A. McNair, T. Polzin, I. Rogina, C. P. Rose, T. Schultz, B. Suhm, M. Tomita, A. Waibel, 1994, JANUS 93: </ENAMEX> Towards Spontaneous Speech Translation, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. <ENAMEX_TYPE="PERSON"> S. Awodey. </ENAMEX> Topological Representation of the Lambda Calculus. September <ENAMEX_TYPE="PERSON"> 1998. Math. Struct. </ENAMEX> in <ENAMEX_TYPE="LOCATION"> Comp. Sci. (2000), vol. 10, pp. 81--96. </ENAMEX> • Adapted model (News + citation corpus) <ENAMEX_TYPE="PERSON"> M. Woszczyna, N. Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi, T. Kemp, A. Lavie, A. McNair, T. Polzin, I. Rogina, C. P. Rose, T. Schultz, B. Suhm, M. Tomita, A. Waibel, 1994, JANUS 93: </ENAMEX> Towards Spontaneous Speech Translation, Proceedings of the <ENAMEX_TYPE="ORGANIZATION"> International Conference on Acoustics, Speech, </ENAMEX> and Signal Processing. <ENAMEX_TYPE="PERSON"> L. Birkedal. </ENAMEX> A General Notion of Realizability. December 1999. Proceedings of <ENAMEX_TYPE="ORGANIZATION"> LICS 2000 </ENAMEX>
Lessons LearnedAnother Boring Text Slide • Semi-structured text is surprisingly difficult to read • Off-line training for wrappers and/or NER may help • Need very high-confidence rules to ensure precision • A continuously-running system needs robustness (internet/Google-failure, unexpected errors, …)