Natural Language Processing for LODLAM

Natural Language Processing for LODLAM A brief intro to machine learning & data science for Libraries Presented at IGeLU 2014by Corey A Harper2014-09-16

Context Narrative Story telling The Library's story, and the Archives story, but also…

Users’ stories Scholars' stories Adding context through recombinant metadata

Scholars & Users Stories – Tim Sherratt (@wragge) Also: http://discontents.com.au/a-map-and-some-pins-open-data-and-unlimited-horizons/

Library Authority Data “Include links to other URIs. so that they can discover more things.” Short of providing and linking to URIs, this *is* authority data. This is what our authority files are for.

Linked data is about contextauthoritiesprovide contextand yet our controlled vocabs are nearly gonebecause the interfaces to them were broken

The Death of Browse • Next-Gen Discovery Systems don't make use of Authority Control • “Browse” was/is broken as a UI Design • Rich data in Authorities, disconnected from narrative, context, search • Richer “Authority” type data outside libraries... • “Next Gen Next Gen Discovery…

Fuzzy Wuzzy – Seat Geek Fuzzy Wuzzy – Awesome Library from SeatGeek https://github.com/seatgeek/fuzzywuzzy http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python

Slide courtesy of Doug OardUniv. of Maryland

Tools - Natural Language Processing • DBPediaSpotlighthttps://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki • Zemanta: http://www.zemanta.com/?wpst=1 • Open Calais: http://www.opencalais.com/ • Open Refine: http://openrefine.org/ • DataTXT: https://dandelion.eu/products/datatxt/ • AlchemyAPI: http://www.alchemyapi.com/ • FuzzyWuzzy: https://github.com/seatgeek/fuzzywuzzy

Where does this lead? We need new interfaces new tools for new kind of catalogers for knowledge organization experts

Linked Jazz Back End

Primo PNX and Authorities • Indexing Cross References • New Browse Functionality • Authority Control from Aleph / Alma • What about non-MARC, or non-Aleph Data? • Matching Strings to Authorities

Enter Open Refine http://freeyourmetadata.org/

Match strings to vocabularies…

Like LCNAF…

Or Wikipedia

Automated Authority Control?

Open Refine RDF Skeleton

Proposed System Architecture

Hydra Modeling & Architecture • Approaches to Provenance • Prov-O • Named Graphs • Named Datastreams • “n” nyucore “records” • Same properties defined for each • Keep data sources separate • Merge for display in Blacklight & export to Primo

Separate Metadata Datastreams • source_metadata, enrich_metadata • Reload one or both without affecting other or native metadata • native_metadata • Edited only through Hydra UI • Partitioned from external sources

Metadata Provenance

Fedora Datastreams

Blacklight User Interface

Where does this lead? We need new interfaces new tools for new kind of catalogers for knowledge organization experts

A Role for Ex Libris • Alma &/or Primo • Named Entity Recognition • Vocabulary Reconciliation • Provenance Management • Primo Central • Named Entity Recognition on Full Text • Auto Classification

A bit louder... we need new interfaces we need enterprise tools Integrated into our metadata management systems for new kind of catalogers for knowledge organization experts

Simplified Workflow Proposal

More Tools – At Programming Level • Open NLP: https://opennlp.apache.org/ • Stanford Natural Language Toolkit: http://nlp.stanford.edu/software/index.shtml • Python Tools • SciKitLearn, Pandas, NLTK, SciPi, NumPi • https://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience • http://pandas.pydata.org/ • http://www.nltk.org/

More Data Science-ey Tools http://www.rexeranalytics.com/Data-Miner-Survey-Results-2013.html

Data Science Techniques • Feature Extraction / Feature Engineering • Predictive Modeling • Probabilistic Classification – Large Multi-Class Problems • Text Analytics • Vectorization • Bags & Sets of Words • TF/IDF • N-Grams • Sparse Matrices

Simple Example – Predict Yelp Star Ratings

Fitting a Model – Naïve Bayes

Data Science Venn Diagram http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

http://www.amazon.com/Data-Science-Business-data-analytic-thinking/dp/1449361323http://www.amazon.com/Data-Science-Business-data-analytic-thinking/dp/1449361323

Where can we go from here? • NER is just the beginning • Feature Engineering • Hiring Statisticians • Clustering & Classification • Vocabulary Pruning and Engineering • Manageable 10-20k Class Text Classification Problems • Domain Specific • Ex Libris’ Activity in this space

Thanks! corey.harper@nyu.edu 212.998.2479 @chrpr

Natural Language Processing for LODLAM

Natural Language Processing for LODLAM

Presentation Transcript

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing