220 likes | 350 Views
Next steps for BHL and Linked Data. John Mignault Technical Advisory Group Biodiversity Heritage Library Twitter: @ jmignault. The Biodiversity Heritage Library. BHL is a consortium of natural history, botanical libraries and research institutions
E N D
Next steps for BHL and Linked Data John Mignault Technical Advisory Group Biodiversity Heritage Library Twitter: @jmignault
The Biodiversity Heritage Library • BHL is a consortium of natural history, botanical libraries and research institutions • An open access digital library for legacy biodiversity literature • An open data repository of taxonomic names and bibliographic information • An increasingly global effort • US/UK, Europe, Egypt, China, Africa
How much text are we talking? • Just hit 40 million page mark • Tens of thousands of titles • 110, 000 volumes • Internet Archive is BHL scanning partner • In conjunction with local scanning efforts
Issues we’ve faced • OCR is a *BIG* deal • A lot of literature is pre-1923 • Expanding the range of material in BHL
OCR is a *BIG* deal • All book / literature digitization projects affected, not just BHL • Especially problematic in BHL • More than 50 languages represented in BHL • Dates of publication from 1400’s to 2000’s • Irregular typeface / typesetting • Multiple languages on one page • Botanical descriptions in Latin
35.16% 2007 Name Finding Study >35% OCR error rate for names only Of the 3,003 names, 1,056 were incorrectly transcribed by OCR. Top OCR errors Wei, et al. An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library. Proceedings of TDWG. 2008. http://www.tdwg.org/proceedings/article/view/380
Abbild ungen und Beschreibungen der Fische Syriens, nebst einer neuen Classification und Characteristik sämmtlicher Gattungen der i JOH. JAKOB HECKEL, Inipectoi am k. k. Hof-Natur.-iUenkabinete in Wien, mehr, yelelirt. UeHtllMeii. MIfglivd. STUTTGART. E. Schweizerbart' sehe Verlagshandlung, 1843.
Older material • Great deal of material is pre-1923 • Irregular fonts – blackletter • Multiple languages on same page – English text with Latin scientific names • Changes in geographic names • Changes in scientific names
*E.xvi�c�piteI von c. cXx.WptdvonfnrWmn bu�fbe;bcn.5 am cix bIa � S &3rn~ 41X a�m cv(f b1air�'o�et ert oiensr �; �', :�hlrfc�c wa ff�4am.diug bist a 6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tif vrmr Waff C * t6rmnli an `tn�ciblatGteaM w ?ffoaifrn w4wmeu nu weib e , wpiteI voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J ' >bSciatl�Oiff ;Bruet wacfttc n qmcx b1a bl: bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r trtuft *e t � B Rn "� trv W1Rt' ?Cm c blas waIwutr Ober �ci ti 1V Ces ' wt gbtiemwwajfu tpctt, afferain 9 c: b�titbfof �r f eran m rs bra wlg auig4;f aer�m *mc vrt blatcabtfm wfru an'deg~m rt blas Iaum bwWt� run f ncmai b14ianf tJobrrfan ebrut4net vnber Brwt Ober awawi*m.crriii btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C fca trc* cx u W�e�&mcyfbq4 Mabtt mmw rc a iiu bc Jcn ncI.end.*, blat s. a\ u:�rprd3 rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i
Expanding scope • Manuscripts, field notebooks –mostly handwritten, often with drawings • Global expansion means dealing with non-Western script systems and a whole new set of OCR problems – Arabic materials from Bibliotheca Alexandria in Egypt
Some current initiatives • Scientific name extraction • “Parts” • PDF Generator
Scientific Name Extraction • TaxonFinder algorithm in production since 2008 • More than 100 million candidate name strings • More than 1.5 million unique, verified names • Available through UI, APIs, Data Exports & Internet Archive • New collaboration with Global Names • Improved algorithm, better precision & recall • More data!
Finding parts • Disambiguating and locating structural boundaries in the corpus • Done mainly by crowdsourced means • Citebank • Greatly increases usability and semantic value of the dataset • Addressing important – makes data addressable and thus linkable
What we’d like to do http://biodivlib.wikispaces.com/BHL+and+Gaming • Correcting OCR • Rekeying Tables of Contents • Researching candidate Scientific Names • Image identification & extraction • http://biodivlib.wikispaces.com/Art+of+Life • Currently funded by NEH ^Challenges framed as games
We need your help • “When in doubt, use humans.” • @dpatil: ttp://radar.oreilly.com/2012/07/data-jujitsu.html • Increase value of biodiversity domain through improved data integration • Many similarities between specimen labels and literature
Need deep intertwingling • Wider integration of biodiversity data • Normalization through controlled vocabularies and authorities • Linkages between • Specimens • Descriptions • Articles • Manuscripts
To sum up • BHL is a massive dataset useful for multidisciplinary research • Systematics • Natural Language Processing • Humanities • BHL is open • Free to use at http://biodiversitylibrary.org • Open access data for scholarly use & reuse • BHL has APIs and data exports to enable reuse • BHL data can be incorporated into other virtual research environments
Get involved • http://biodiversitylibrary.org • http://biodivlib.wikispaces.com/Developer+Tools+and+API • http://biodivlib.wikispaces.com/BHL+and+Gaming • Thanks!