260 likes | 386 Views
ViBRANT Virtual Biodiversity. Improved Bibliographic Reference Parsing Based on Repeated Patterns. Guido Sautter , Klemens Böhm. Author. Author. Year. Title. Journal. Pagi- nation. Volume. Bibliographic References - Parsing.
E N D
ViBRANT Virtual Biodiversity Improved Bibliographic Reference Parsing Based on Repeated Patterns Guido Sautter, Klemens Böhm
Improved Bibliographic ReferenceParsing Based on Repeated Patterns Author Author Year Title Journal Pagi-nation Volume Bibliographic References - Parsing Thor, A.U., Cond, S.E. 2012. The article title. The Journal 7: 8-15 • Why parse bibliographic references? • Generation of BibTeX records, etc. • Rendering in different styles • Reconciliation • … Absolute necessity when compiling large bibliographies
Improved Bibliographic ReferenceParsing Based on Repeated Patterns Bibliographic References - Examples Thor, AU, SE Cond (2012) The article title. The Journal 7: 8-15 Diversity with regard to • Reference style (order of fields, intermediate punctuation) • Type of referenced work Thor, AU, Cond, SE. The article title, The Journal 7 (2012): 8-15 Thor, A.U. 2012. The paper title. Proc. ICST 2012, Location. Thor, A.U. 2012. The book title, Publisher, Location, 151 pp. Thor, AU, Cond, SE. 2012. The chapter title. In:Itor, ED (Ed.) The book title. Location: Publisher: 8-15 Thor, AU, SE Cond, 2012. The 3rd article title. In:Itor, ED (Ed.) The 1st special issue. The Journal 7: 8-15
Improved Bibliographic ReferenceParsing Based on Repeated Patterns Bibliographic References - Fields • Fields present in references to (almost) all types of works • Authors (can be given in different styles) • Year of publication (four-digit Arabic number) • Title • Fields present in references to specific types of works: • Publisher and Location / Journal name • Pagination ((mostly) Arabic number or number range) • Volume / issue / numero number (Arabic number) • Volume title / Proceedings title • Editors (can be given in different styles) • URL / DOI / ISBN / ISSN
Improved Bibliographic ReferenceParsing Based on Repeated Patterns Overview • Bibliographic References • Previous Parsing Approaches • The RefParse Algorithm • Evaluation • Summary & Outlook
Improved Bibliographic ReferenceParsing Based on Repeated Patterns Pattern Based Parsers • Principle: • Patterns match individual field values • Meta patterns arrange field patterns • One meta pattern per reference style • Most prominent: ParaCite (now offline) • Strengths: • Numerical fields • Author names • Weaknesses: • Meta patterns to be created for every single reference style • Combinatorial explosion with alternatives for individual fields
Improved Bibliographic ReferenceParsing Based on Repeated Patterns Learning Based Parsers • Learn statistical models from pre-parsed references • Hidden Markov Models • Conditional Random Fields • Finite State Transducers • etc. • Strengths: • Can handle all cases covered in training set • No handcrafting of rules or patterns • Weaknesses: • Need for training data covering all cases • Usually do not exploit morphology • Incremental training hard
Improved Bibliographic ReferenceParsing Based on Repeated Patterns Knowledge Based Parsers • Divide references into blocks at punctuation marks • Classify blocks by comparing them to knowledge base • Examples: FLUX-CiM, INFOMAP • Strengths: • No handcrafting of rules or patterns • Learn domain specific journal names, etc. very well • Weaknesses: • Need for representative training data covering domain • Abbreviations interfere with blocking • Problems with numerical fields • Problems with highly variable fields like author names
Improved Bibliographic ReferenceParsing Based on Repeated Patterns Alignment Based Parsers • Morphologically classify word, numbers, punctuation marks • Interpret sequence of classes as gene sequence • Try to align this sequence with learned one • Strengths: • No handcrafting of rules or patterns • Learn reference styles • Weaknesses: • Need for representative training data covering many cases • Abbreviations interfere with alignment
Improved Bibliographic ReferenceParsing Based on Repeated Patterns Overview • Bibliographic References • Previous Parsing Approaches • The RefParse Algorithm • Evaluation • Summary & Outlook
Improved Bibliographic ReferenceParsing Based on Repeated Patterns RefParse: The Idea • Observation of previous approaches: • For each field, some approach is strong • Reference styles need to be in training set or created manually • Observation gathering data: • References rarely come individually • Paper bibliographies are a common source Lists of references following the same style • Idea: • Exploit structural redundancy given in reference lists • Use individual approaches for fields they handle best
Improved Bibliographic ReferenceParsing Based on Repeated Patterns Exploiting Redundancy • Get field values that patterns identify reliably: • Author names (all possible styles) • Numerical elements (year, volume, etc., pagination) • Ambiguous numbers become candidates for all they match • Generate all possible field arrangements • Compare field arrangements across reference list … • … and pick the one that fits the best • Align references against one another … • … to infer meta pattern at runtime
Improved Bibliographic ReferenceParsing Based on Repeated Patterns Volume?Year? Page? Volume?Year? Page? Volume Pages Year Reference Alignment - Example Thor, AU. The article title. The Journal 1998 (1987): 1997 • Only alignment with second referencedisambiguates numbers in first one • Exploiting redundancy overcomes inherentweaknesses of reference-by-reference parsers Cond, SE. Another article title. Another Journal 7 (2012): 8-15
Improved Bibliographic ReferenceParsing Based on Repeated Patterns Reference List 1. Base Element Extraction 2a. Author List Assembly 2b. Author List Selection 3. Reference Style Inference Parsed References 6. Title Extraction 5. Periodical / Publisher Extraction 4. Volume Reference Extraction Reference Alignment Result • After alignment steps, RefParse has identified • Author lists, including style • Years of publication • Pagination (where present) • Volume / issue / numero numbers (where present) • Reference style (order of fields, intermediate punctuation)
Improved Bibliographic ReferenceParsing Based on Repeated Patterns Reference List 1. Base Element Extraction 2a. Author List Assembly 2b. Author List Selection 3. Reference Style Inference Parsed References 6. Title Extraction 5. Periodical / Publisher Extraction 4. Volume Reference Extraction Handling Volume References Thor, AU, Cond, SE. 2012. The chapter title. In:Itor, ED (Ed.) The book title. Location: Publisher: 8-15 • Embedded references to books or journal volumes • In principle, references on their own (safe for year) • Extract and handle in recursive step Thor, AU, SE Cond, 2012. The 3rd article title. In:Itor, ED (Ed.) The 1st special issue. The Journal 7: 8-15
Improved Bibliographic ReferenceParsing Based on Repeated Patterns Reference List 1. Base Element Extraction 2a. Author List Assembly 2b. Author List Selection 3. Reference Style Inference Parsed References 6. Title Extraction 5. Periodical / Publisher Extraction 4. Volume Reference Extraction Journal / Publisher Extraction • Morphologically, names of journal and publisher very similar (Word block in title case) • Sometimes heavily abbreviated (dots interfere with blocking) • Recognize title case abbreviation blocks • Handle parts in brackets / quotes as single blocks • Use patterns to find candidates (optionally, use lexicons) • Choose candidate closest to volume number / pagination
Improved Bibliographic ReferenceParsing Based on Repeated Patterns Reference List 1. Base Element Extraction 2a. Author List Assembly 2b. Author List Selection 3. Reference Style Inference Parsed References 6. Title Extraction 5. Periodical / Publisher Extraction 4. Volume Reference Extraction Title Extraction – Finally • Title most important field of reference … • … but also most variable one pattern matching hard • Having identified all other fields, however … • … title is what remains in middle of reference • Circumvents matching or aligning title
Improved Bibliographic ReferenceParsing Based on Repeated Patterns Overview • Bibliographic References • Previous Parsing Approaches • The RefParse Algorithm • Evaluation • Summary & Outlook
Improved Bibliographic ReferenceParsing Based on Repeated Patterns Experimental Setup • Corpora: • Cora Corpus: 500 individual references • Plazi Corpus: ~25.000 references from ~1.000 documents • Experiments: • RefParse without training (empty lexicons) • RefParse with training (50% / 50% data split) • ParseCit (model based parser for comparison) • FreeCite (model based parser for comparison)
Improved Bibliographic ReferenceParsing Based on Repeated Patterns Experiments with Cora Corpus • RefParse clearly outperforms related approaches Interestingly, accuracy lower with training (in a minute)
Improved Bibliographic ReferenceParsing Based on Repeated Patterns Experiments with Plazi Corpus • RefParse clearly outperforms related approaches Again, accuracy lower with training (next slide)
Improved Bibliographic ReferenceParsing Based on Repeated Patterns Lexicons can be Harmful ?! • Observation in experiments:Accuracy for title and journal/publisher lower with lexicons • Totally counter-intuitive at first glance • What happens: • Frequent infix of long, rare journal name found in lexicon … • … and are taken as journal name proper … • … preventing whole journal name from being found Infix Match Problem
Improved Bibliographic ReferenceParsing Based on Repeated Patterns Overview • Bibliographic References • Previous Parsing Approaches • The RefParse Algorithm • Evaluation • Summary & Outlook
Improved Bibliographic ReferenceParsing Based on Repeated Patterns Summary • RefParse algorithm: • Combines strengths of previous approaches • Processes whole reference lists • Infers reference style by mutual alignment • Independent of training data • RefParse clearly outperforms previous approaches • Lexicon lookup phenomenon: Infix Match Problem
Improved Bibliographic ReferenceParsing Based on Repeated Patterns Outlook • Overcome infix match problem • Improve overall accuracy in title and journal/publisher • Blocking & block scoring (akin to knowledge backed parsers) • Exploiting redundancy to find separating punctuation • Gather experience in real-world deployment
Improved Bibliographic ReferenceParsing Based on Repeated Patterns ViBRANT Virtual Biodiversity Guido Sautter, Klemens Böhm: Improved Bibliographic Reference Parsing Based on Repeated Patterns Questions?