1 / 48

Natural Language Processing for LODLAM

Natural Language Processing for LODLAM. A brief intro to machine learning & data science for Libraries. Presented at IGeLU 2014 by Corey A Harper 201 4 - 09-16. Context Narrative Story telling The Library's story, and the Archives story, but also…. Users’ stories Scholars' stories

chelsi
Download Presentation

Natural Language Processing for LODLAM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natural Language Processing for LODLAM A brief intro to machine learning & data science for Libraries Presented at IGeLU 2014by Corey A Harper2014-09-16

  2. Context Narrative Story telling The Library's story, and the Archives story, but also…

  3. Users’ stories Scholars' stories Adding context through recombinant metadata

  4. Scholars & Users Stories – Tim Sherratt (@wragge) Also: http://discontents.com.au/a-map-and-some-pins-open-data-and-unlimited-horizons/

  5. Library Authority Data “Include links to other URIs. so that they can discover more things.” Short of providing and linking to URIs, this *is* authority data. This is what our authority files are for.

  6. Linked data is about contextauthoritiesprovide contextand yet our controlled vocabs are nearly gonebecause the interfaces to them were broken

  7. The Death of Browse • Next-Gen Discovery Systems don't make use of Authority Control • “Browse” was/is broken as a UI Design • Rich data in Authorities, disconnected from narrative, context, search • Richer “Authority” type data outside libraries... • “Next Gen Next Gen Discovery…

  8. Fuzzy Wuzzy – Seat Geek Fuzzy Wuzzy – Awesome Library from SeatGeek https://github.com/seatgeek/fuzzywuzzy http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python

  9. Slide courtesy of Doug OardUniv. of Maryland

  10. Tools - Natural Language Processing • DBPediaSpotlighthttps://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki • Zemanta: http://www.zemanta.com/?wpst=1 • Open Calais: http://www.opencalais.com/ • Open Refine: http://openrefine.org/ • DataTXT: https://dandelion.eu/products/datatxt/ • AlchemyAPI: http://www.alchemyapi.com/ • FuzzyWuzzy: https://github.com/seatgeek/fuzzywuzzy

  11. Where does this lead? We need new interfaces new tools for new kind of catalogers for knowledge organization experts

  12. Linked Jazz Back End

  13. Primo PNX and Authorities • Indexing Cross References • New Browse Functionality • Authority Control from Aleph / Alma • What about non-MARC, or non-Aleph Data? • Matching Strings to Authorities

  14. Enter Open Refine http://freeyourmetadata.org/

  15. Match strings to vocabularies…

  16. Like LCNAF…

  17. Or Wikipedia

  18. Automated Authority Control?

  19. Open Refine RDF Skeleton

  20. Proposed System Architecture

  21. Hydra Modeling & Architecture • Approaches to Provenance • Prov-O • Named Graphs • Named Datastreams • “n” nyucore “records” • Same properties defined for each • Keep data sources separate • Merge for display in Blacklight & export to Primo

  22. Separate Metadata Datastreams • source_metadata, enrich_metadata • Reload one or both without affecting other or native metadata • native_metadata • Edited only through Hydra UI • Partitioned from external sources

  23. Metadata Provenance

  24. Fedora Datastreams

  25. Blacklight User Interface

  26. Where does this lead? We need new interfaces new tools for new kind of catalogers for knowledge organization experts

  27. A Role for Ex Libris • Alma &/or Primo • Named Entity Recognition • Vocabulary Reconciliation • Provenance Management • Primo Central • Named Entity Recognition on Full Text • Auto Classification

  28. A bit louder... we need new interfaces we need enterprise tools Integrated into our metadata management systems for new kind of catalogers for knowledge organization experts

  29. Simplified Workflow Proposal

  30. More Tools – At Programming Level • Open NLP: https://opennlp.apache.org/ • Stanford Natural Language Toolkit: http://nlp.stanford.edu/software/index.shtml • Python Tools • SciKitLearn, Pandas, NLTK, SciPi, NumPi • https://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience • http://pandas.pydata.org/ • http://www.nltk.org/

  31. More Data Science-ey Tools http://www.rexeranalytics.com/Data-Miner-Survey-Results-2013.html

  32. Data Science Techniques • Feature Extraction / Feature Engineering • Predictive Modeling • Probabilistic Classification – Large Multi-Class Problems • Text Analytics • Vectorization • Bags & Sets of Words • TF/IDF • N-Grams • Sparse Matrices

  33. Simple Example – Predict Yelp Star Ratings

  34. Fitting a Model – Naïve Bayes

  35. Data Science Venn Diagram http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

  36. http://www.amazon.com/Data-Science-Business-data-analytic-thinking/dp/1449361323http://www.amazon.com/Data-Science-Business-data-analytic-thinking/dp/1449361323

  37. Where can we go from here? • NER is just the beginning • Feature Engineering • Hiring Statisticians • Clustering & Classification • Vocabulary Pruning and Engineering • Manageable 10-20k Class Text Classification Problems • Domain Specific • Ex Libris’ Activity in this space

  38. Thanks! corey.harper@nyu.edu 212.998.2479 @chrpr

More Related