190 likes | 377 Views
Universität Karlsruhe (TH) Research University – founded 1825. FAT – Finding All Taxa (in Text Documents). Guido Sautter , Donat Agosti, Klemens Böhm. FAT – Basic Principle. Generate taxon name candidates Find out which candidates actually are a taxon names Divides text in Sure positives
E N D
Universität Karlsruhe (TH) Research University – founded 1825 FAT – Finding All Taxa(in Text Documents) Guido Sautter, Donat Agosti, Klemens Böhm
FAT – Finding All Taxa (in Text Documents) FAT – Basic Principle • Generate taxon name candidates • Find out which candidates actually are a taxon names • Divides text in • Sure positives • Sure negatives • Candidates • Use sure positives and negatives to deal with candidates
FAT – Finding All Taxa (in Text Documents) FAT – Detail Overview • Find all parts of text that might be taxon names using • Morphological structure (in form of regular expressions) • Known taxon names (as positive gazetteer lists) • Successively rule candidates to be taxa or not using • Morphological structure (in form of regular expressions) • Known taxon names (as positive gazetteer lists) • Textual hints (name labels, e.g. “sp. nov.”) • Ruled-out words (as negative gazetteer lists) • Common dictionaries (as negative gazetteer lists) • Document internal contradictions • User feedback (as last instance)
FAT – Finding All Taxa (in Text Documents) FAT – Basic Benefits / Deficits • Benefits • All available knowledge is used • Newly added knowledge is used early as possible • Can learn new taxa through use of structure • User can avoid errors through feedback at little effort • Deficits • Regular expression patterns somewhat inflexible regarding • Automated adaptation to different document styles • Language-dependent capitalization schemes (e.g. in German) • Gazetteer lists somewhat susceptible to • Misspellings / OCR errors • Unseen languages
FAT – Finding All Taxa (in Text Documents) Morphological Rules • Exploit (Linnaean / ICZN) rules of nomenclature • Challenges: • Different schemes of in-taxon-name punctuation • Embedded author names (differing styles, strange names) • Imlementation: • Editor for basic building blocks, including- line-broken and indented layout - syntax check and test facilities • Actual expressions assembled dynamically at runtime (almost) all parts maintainable in one place
FAT – Finding All Taxa (in Text Documents) Gazetteer Lists • Storage for known taxon names / epithets / authors • Challenges: • Huge amount of data (main memory footprint) • Misspellings (source text or OCR) • Imlementation: • Editor for lists, including- import / export- add / intersect / and subtract functions • Centralized access point loaded and stored only once
FAT – Finding All Taxa (in Text Documents) Running FAT (Overview) Recall Rules Dictio-nary Filter Lexicon Rules Label Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules User Feedback Document Text Taxon Names Taxon Names Candiates Not Taxon Names
FAT – Finding All Taxa (in Text Documents) Recall Rules Recall Rules Dictio-nary Filter Lexicon Rules Label Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules User Feedback Document Text Taxon Names • Create candidates: • morphological structure • filter out matches that stop contain stop words Taxon Names Candiates Not Taxon Names
FAT – Finding All Taxa (in Text Documents) Dictionary Filter Recall Rules Dictio-nary Filter Lexicon Rules Label Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules User Feedback Document Text Taxon Names • Filter candidates: • gazetteer based • filter out candidates with common language words in epithet positions (+ stemming for English) Taxon Names Candiates Not Taxon Names
FAT – Finding All Taxa (in Text Documents) Lexicon Rules Recall Rules Dictio-nary Filter Lexicon Rules Label Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules User Feedback Document Text Taxon Names • Exploit known epithets: • candidates matches • create further candidates Taxon Names Candiates Not Taxon Names
FAT – Finding All Taxa (in Text Documents) Label Rules Recall Rules Dictio-nary Filter Lexicon Rules Label Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules User Feedback Document Text Taxon Names • Exploit taxon name labels: • labeled candidates matches • „Genus species, sp. nov.“ Taxon Names Candiates Not Taxon Names
FAT – Finding All Taxa (in Text Documents) Precise Rules Recall Rules Dictio-nary Filter Lexicon Rules Label Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules User Feedback Document Text Taxon Names • Exploit morphology: • candidates with distinctive structure matches • „Genus species st. race“ Taxon Names Candiates Not Taxon Names
FAT – Finding All Taxa (in Text Documents) Known Data Rules Recall Rules Dictio-nary Filter Lexicon Rules Label Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules User Feedback Document Text Taxon Names • Exploit prior runs: • Extract epitets from candidates • Known epithet combination candidates matches Taxon Names Candiates Not Taxon Names
FAT – Finding All Taxa (in Text Documents) Author Name Rules Recall Rules Dictio-nary Filter Lexicon Rules Label Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules User Feedback Document Text Taxon Names Exclude candidates with author names in genus or sub genus position Taxon Names Candiates Not Taxon Names
FAT – Finding All Taxa (in Text Documents) Negative Rules Recall Rules Dictio-nary Filter Lexicon Rules Label Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules User Feedback Document Text Taxon Names Exclude candidates with words from negatives (all text excluded so far) in epithet positions Taxon Names Candiates Not Taxon Names
FAT – Finding All Taxa (in Text Documents) Data Rules Recall Rules Dictio-nary Filter Lexicon Rules Label Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules User Feedback Document Text Taxon Names Candidates with known epithets in last position matches Taxon Names Candiates Not Taxon Names
FAT – Finding All Taxa (in Text Documents) Dynamic Lexicon Rules Recall Rules Dictio-nary Filter Lexicon Rules Label Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules User Feedback Document Text Taxon Names • Exploit matches & negatives: • Works as combination of lexicon-based rules before • But with current document • Compute transitive hull Taxon Names Candiates Not Taxon Names
FAT – Finding All Taxa (in Text Documents) User Feedback Recall Rules Dictio-nary Filter Lexicon Rules Label Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules User Feedback Document Text Taxon Names Ask user to decide on remaining candidates (displaying some context) Optional step, can be omitted Taxon Names Candiates Not Taxon Names
FAT – Finding All Taxa (in Text Documents) Universität Karlsruhe (TH) Research University – founded 1825 Questions? Browse Madagascar Corpus at http://plazi.org/GgSRS/search Download GoldenGATE from http://idaho.ipd.uka.de/GoldenGATE/