480 likes | 634 Views
Natural Language Processing for LODLAM. A brief intro to machine learning & data science for Libraries. Presented at IGeLU 2014 by Corey A Harper 201 4 - 09-16. Context Narrative Story telling The Library's story, and the Archives story, but also…. Users’ stories Scholars' stories
E N D
Natural Language Processing for LODLAM A brief intro to machine learning & data science for Libraries Presented at IGeLU 2014by Corey A Harper2014-09-16
Context Narrative Story telling The Library's story, and the Archives story, but also…
Users’ stories Scholars' stories Adding context through recombinant metadata
Scholars & Users Stories – Tim Sherratt (@wragge) Also: http://discontents.com.au/a-map-and-some-pins-open-data-and-unlimited-horizons/
Library Authority Data “Include links to other URIs. so that they can discover more things.” Short of providing and linking to URIs, this *is* authority data. This is what our authority files are for.
Linked data is about contextauthoritiesprovide contextand yet our controlled vocabs are nearly gonebecause the interfaces to them were broken
The Death of Browse • Next-Gen Discovery Systems don't make use of Authority Control • “Browse” was/is broken as a UI Design • Rich data in Authorities, disconnected from narrative, context, search • Richer “Authority” type data outside libraries... • “Next Gen Next Gen Discovery…
Fuzzy Wuzzy – Seat Geek Fuzzy Wuzzy – Awesome Library from SeatGeek https://github.com/seatgeek/fuzzywuzzy http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python
Tools - Natural Language Processing • DBPediaSpotlighthttps://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki • Zemanta: http://www.zemanta.com/?wpst=1 • Open Calais: http://www.opencalais.com/ • Open Refine: http://openrefine.org/ • DataTXT: https://dandelion.eu/products/datatxt/ • AlchemyAPI: http://www.alchemyapi.com/ • FuzzyWuzzy: https://github.com/seatgeek/fuzzywuzzy
Where does this lead? We need new interfaces new tools for new kind of catalogers for knowledge organization experts
Primo PNX and Authorities • Indexing Cross References • New Browse Functionality • Authority Control from Aleph / Alma • What about non-MARC, or non-Aleph Data? • Matching Strings to Authorities
Enter Open Refine http://freeyourmetadata.org/
Hydra Modeling & Architecture • Approaches to Provenance • Prov-O • Named Graphs • Named Datastreams • “n” nyucore “records” • Same properties defined for each • Keep data sources separate • Merge for display in Blacklight & export to Primo
Separate Metadata Datastreams • source_metadata, enrich_metadata • Reload one or both without affecting other or native metadata • native_metadata • Edited only through Hydra UI • Partitioned from external sources
Where does this lead? We need new interfaces new tools for new kind of catalogers for knowledge organization experts
A Role for Ex Libris • Alma &/or Primo • Named Entity Recognition • Vocabulary Reconciliation • Provenance Management • Primo Central • Named Entity Recognition on Full Text • Auto Classification
A bit louder... we need new interfaces we need enterprise tools Integrated into our metadata management systems for new kind of catalogers for knowledge organization experts
More Tools – At Programming Level • Open NLP: https://opennlp.apache.org/ • Stanford Natural Language Toolkit: http://nlp.stanford.edu/software/index.shtml • Python Tools • SciKitLearn, Pandas, NLTK, SciPi, NumPi • https://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience • http://pandas.pydata.org/ • http://www.nltk.org/
More Data Science-ey Tools http://www.rexeranalytics.com/Data-Miner-Survey-Results-2013.html
Data Science Techniques • Feature Extraction / Feature Engineering • Predictive Modeling • Probabilistic Classification – Large Multi-Class Problems • Text Analytics • Vectorization • Bags & Sets of Words • TF/IDF • N-Grams • Sparse Matrices
Data Science Venn Diagram http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
http://www.amazon.com/Data-Science-Business-data-analytic-thinking/dp/1449361323http://www.amazon.com/Data-Science-Business-data-analytic-thinking/dp/1449361323
Where can we go from here? • NER is just the beginning • Feature Engineering • Hiring Statisticians • Clustering & Classification • Vocabulary Pruning and Engineering • Manageable 10-20k Class Text Classification Problems • Domain Specific • Ex Libris’ Activity in this space
Thanks! corey.harper@nyu.edu 212.998.2479 @chrpr