1 / 22

Next steps for BHL and Linked Data

Next steps for BHL and Linked Data. John Mignault Technical Advisory Group Biodiversity Heritage Library Twitter: @ jmignault. The Biodiversity Heritage Library. BHL is a consortium of natural history, botanical libraries and research institutions

fala
Download Presentation

Next steps for BHL and Linked Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Next steps for BHL and Linked Data John Mignault Technical Advisory Group Biodiversity Heritage Library Twitter: @jmignault

  2. The Biodiversity Heritage Library • BHL is a consortium of natural history, botanical libraries and research institutions • An open access digital library for legacy biodiversity literature • An open data repository of taxonomic names and bibliographic information • An increasingly global effort • US/UK, Europe, Egypt, China, Africa

  3. How much text are we talking? • Just hit 40 million page mark • Tens of thousands of titles • 110, 000 volumes • Internet Archive is BHL scanning partner • In conjunction with local scanning efforts

  4. Issues we’ve faced • OCR is a *BIG* deal • A lot of literature is pre-1923 • Expanding the range of material in BHL

  5. OCR is a *BIG* deal • All book / literature digitization projects affected, not just BHL • Especially problematic in BHL • More than 50 languages represented in BHL • Dates of publication from 1400’s to 2000’s • Irregular typeface / typesetting • Multiple languages on one page • Botanical descriptions in Latin

  6. 35.16% 2007 Name Finding Study >35% OCR error rate for names only Of the 3,003 names, 1,056 were incorrectly transcribed by OCR. Top OCR errors Wei, et al. An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library. Proceedings of TDWG. 2008. http://www.tdwg.org/proceedings/article/view/380

  7. Abbild ungen und Beschreibungen der Fische Syriens, nebst einer neuen Classification und Characteristik sämmtlicher Gattungen der i JOH. JAKOB HECKEL, Inipectoi am k. k. Hof-Natur.-iUenkabinete in Wien, mehr, yelelirt. UeHtllMeii. MIfglivd. STUTTGART. E. Schweizerbart' sehe Verlagshandlung, 1843.

  8. Older material • Great deal of material is pre-1923 • Irregular fonts – blackletter • Multiple languages on same page – English text with Latin scientific names • Changes in geographic names • Changes in scientific names

  9. *E.xvi�c�piteI von c. cXx.WptdvonfnrWmn bu�fbe;bcn.5 am cix bIa � S &3rn~ 41X a�m cv(f b1air�'o�et ert oiensr �; �', :�hlrfc�c wa ff�4am.diug bist a 6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tif vrmr Waff C * t6rmnli an `tn�ciblatGteaM w ?ffoaifrn w4wmeu nu weib e , wpiteI voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J ' >bSciatl�Oiff ;Bruet wacfttc n qmcx b1a bl: bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r trtuft *e t � B Rn "� trv W1Rt' ?Cm c blas waIwutr Ober �ci ti 1V Ces ' wt gbtiemwwajfu tpctt, afferain 9 c: b�titbfof �r f eran m rs bra wlg auig4;f aer�m *mc vrt blatcabtfm wfru an'deg~m rt blas Iaum bwWt� run f ncmai b14ianf tJobrrfan ebrut4net vnber Brwt Ober awawi*m.crriii btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C fca trc* cx u W�e�&mcyfbq4 Mabtt mmw rc a iiu bc Jcn ncI.end.*, blat s. a\ u:�rprd3 rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i

  10. Expanding scope • Manuscripts, field notebooks –mostly handwritten, often with drawings • Global expansion means dealing with non-Western script systems and a whole new set of OCR problems – Arabic materials from Bibliotheca Alexandria in Egypt

  11. Images

  12. Some current initiatives • Scientific name extraction • “Parts” • PDF Generator

  13. Scientific Name Extraction • TaxonFinder algorithm in production since 2008 • More than 100 million candidate name strings • More than 1.5 million unique, verified names • Available through UI, APIs, Data Exports & Internet Archive • New collaboration with Global Names • Improved algorithm, better precision & recall • More data!

  14. Finding parts • Disambiguating and locating structural boundaries in the corpus • Done mainly by crowdsourced means • Citebank • Greatly increases usability and semantic value of the dataset • Addressing important – makes data addressable and thus linkable

  15. Articles in the BHL UI

  16. Images

  17. PDF Generator

  18. What we’d like to do http://biodivlib.wikispaces.com/BHL+and+Gaming • Correcting OCR • Rekeying Tables of Contents • Researching candidate Scientific Names • Image identification & extraction • http://biodivlib.wikispaces.com/Art+of+Life • Currently funded by NEH ^Challenges framed as games

  19. We need your help • “When in doubt, use humans.” • @dpatil: ttp://radar.oreilly.com/2012/07/data-jujitsu.html • Increase value of biodiversity domain through improved data integration • Many similarities between specimen labels and literature

  20. Need deep intertwingling • Wider integration of biodiversity data • Normalization through controlled vocabularies and authorities • Linkages between • Specimens • Descriptions • Articles • Manuscripts

  21. To sum up • BHL is a massive dataset useful for multidisciplinary research • Systematics • Natural Language Processing • Humanities • BHL is open • Free to use at http://biodiversitylibrary.org • Open access data for scholarly use & reuse • BHL has APIs and data exports to enable reuse • BHL data can be incorporated into other virtual research environments

  22. Get involved • http://biodiversitylibrary.org • http://biodivlib.wikispaces.com/Developer+Tools+and+API • http://biodivlib.wikispaces.com/BHL+and+Gaming • Thanks!

More Related