1 / 62

SKOS-2-HIVE

SKOS-2-HIVE. May 20, 2011 - Columbia University. Introduction. Craig Willis (craig.willis@unc.edu). Afternoon Session Schedule. Introduction Technical overview of the HIVE service Understanding HIVE Internals Installing and configuring HIVE Using the HIVE APIs Customizing HIVE

lillianh
Download Presentation

SKOS-2-HIVE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SKOS-2-HIVE May 20, 2011 - Columbia University

  2. Introduction Craig Willis (craig.willis@unc.edu)

  3. Afternoon Session Schedule Introduction Technical overview of the HIVE service Understanding HIVE Internals Installing and configuring HIVE Using the HIVE APIs Customizing HIVE Automatic indexing approaches Future development

  4. Background and Interests • What are you most interested in getting out of this part of the workshop? • Technical architecture overview? • Working installation? • Hands on programming? • What is your background? • Cataloging, indexing, and classification • Programming and databases • Systems administration

  5. Who’s using HIVE? In addition to the demonstration service, HIVE is being evaluated by other organizations: • Long Term Ecological Research Network (LTER) • Prototype for keyword suggestion for Ecological Markup Language (EML) documents. • http://scoria.lternet.edu:8080/lter-hive-prototypes/ • Library of Congress Web Archives (Minerva) • Evaluating HIVE for automatic LCSH subject heading suggestion for web archives • http://lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html

  6. Who’s using HIVE? • Dryad Data Repository • Evaluating HIVE for suggestion of controlled terms during the submission and curation process. • Scientific names (ITIS), Spatial Coverage (TGN, Alexandria Gazetteer), Keywords (NBII, MeSH, LCSH) • http://www.datadryad.org/

  7. Technical Overview

  8. HIVE Functions • System for management of multiple controlled vocabularies in SKOS/RDF format • Supports natural language and structured (SPARQL) queries • Rich internet application (RIA) for browsing and searching • Java API and REST interfaces for programmatic access • Automatic indexing • Framework for conversion of vocabularies to SKOS

  9. HIVE Technical Overview • HIVE combines several open-source technologies to provide a framework for vocabulary services. • Java-based web services can run in any Java application server • Demonstration website (http://hive.nescent.org/) • Open-source project with mailing lists • Google Code project (http://code.google.com/p/hive-mrc/)

  10. HIVE Technologies • Tomcat: Java-based web application server • Sesame: Open-source triple store and framework for storing and querying RDF data • Triple Store: Database for storing and retrieving RDF • SPARQL: RDF query language • Lucene: Java-based full-text search engine • KEA++: Algorithm and Java API for automatic term suggestion from SKOS vocabularies. • REST: Web-based software architecture (Representational State Transfer)

  11. Architecture

  12. Tour of Google Code project http://code.google.com/p/hive-mrc/ Documentation Precompiled releases Bug tracking Source code Mailing lists

  13. Understanding HIVE Internals

  14. Architecture

  15. HIVE supporting technologies • Lucene http://lucene.apache.org • Sesame http://www.openrdf.org/ • KEA http://www.nzdl.org/Kea/ • H2 http://www.h2database.com/ • GWT http://code.google.com/webtoolkit/

  16. Data Directory Layout • /usr/local/hive/hive-data • vocabulary/ • vocabulary.rdf SKOS RDF/XML • vocabularyAlphaIndex Serialized map • vocabularyH2 H2 database (used by KEA) • vocabularyIndex Lucene Index • vocabularyKEA KEA model and training data • vocabularyStore Sesame/OpenRDF store • topConceptIndex Serialized map of top concepts

  17. KeywordSearch

  18. Indexing

  19. HIVE Internals: Data Models Lucene Index: Index of SKOS vocabulary (view with Luke) Sesame/OpenRDF Store: Native/Sail RDF repository for the vocabulary KEA++ Model: Serialized KEAFilter object H2 Database: Embedded DB contains SKOS vocabulary in format used by KEA. (Can be queried using H2 command line) Alpha Index: Serialized map of concepts Top Concept Index: Serialized map of top concepts

  20. Activity • Explore Lucene index with Luke • http://luke.googlecode.com/ • Explore Sesame store with SPARQL • http://www.xml.com/pub/a/2005/11/16/introducing-sparql-querying-semantic-web-tutorial.html • http://www.cambridgesemantics.com/2008/09/sparql-by-example/

  21. Installing and Configuring HIVE

  22. Installing and Configuring HIVE • Requirements • Java 1.6 • Tomcat (HIVE is currently using 6.x) • Detailed installation instructions: • http://code.google.com/p/hive-mrc/wiki/InstallingHiveWeb • http://code.google.com/p/hive-mrc/wiki/InstallingHiveRestService

  23. Installing and Configuring HIVE-web • Detailed installation instructions (hive-web) • http://code.google.com/p/hive-mrc/wiki/InstallingHiveWeb • Quick start (hive-web) • Download and extract Tomcat 6.x • Download and extract latest hive-web war • Download and extract sample vocabulary • Configure hive.properties and agrovoc.properties • Start Tomcat • http://localhost:8080/

  24. Installing and Configuring HIVE REST Service • Detailed installation instructions (hive-rs) • http://code.google.com/p/hive-mrc/wiki/InstallingHiveRestService • Quick start (hive-rs) • Download and extract latest webapp • Download and extract sample vocabulary • Configure hive.properties • Start Tomcat

  25. Installing and Configuring HIVE-web from source • Detailed installation instructions (hive-web) • http://code.google.com/p/hive-mrc/wiki/DevelopingHIVE • http://code.google.com/p/hive-mrc/wiki/InstallingHiveWeb • Requirements • Eclipse IDE for J2EE Developers • Subclipse plugin • Google Eclipse Plugin • Apache Ant • Google Web Toolkit 1.7.1 • Tomcat 6.x

  26. Activity: Install HIVE Walkthrough of complete HIVE-web installation

  27. Properties files • hive.properties • Specifies enabled vocabularies and selected indexing algorithm • http://code.google.com/p/hive-mrc/source/browse/trunk/hive-web/war/WEB-INF/conf/hive.properties • <vocabulary>.properties • Specifies location of vocabulary databases/indexes on the local filesystem • http://code.google.com/p/hive-mrc/source/browse/trunk/hive-web/war/WEB-INF/conf/lcsh.properties

  28. Importing SKOS Vocabularies • http://code.google.com/p/hive-mrc/wiki/ImportingVocabularies • Note memory requirements for each vocabulary • http://code.google.com/p/hive-mrc/wiki/HIVEMemoryUsage • java –Xmx1024m -Djava.ext.dirs=path/to/hive/lib  edu.unc.ils.mrc.hive.admin.AdminVocabularies [/path/to/hive/conf/] [vocabulary] [train]

  29. Activity: Import Vocabulary Import the sample thesaurus

  30. Using the HIVE APIs

  31. Using HIVE as a Service • HIVE web application • http://hive.nescent.org/ • Developed by Jose Perez-Aguera, Lina Huang • Java servlet, Google Web Toolkit (GWT) • http://code.google.com/p/hive-mrc/wiki/AboutHiveWeb • HIVE REST service • http://hive.nescent.org/rs • Developed by Duane Costa, Long-Term Ecological Research Network • http://code.google.com/p/hive-mrc/wiki/AboutHiveRestService

  32. Activity: Show HIVE-RS • Demonstrate REST API calls: • http://code.google.com/p/hive-mrc/wiki/AboutHiveRestService

  33. HIVE Core Interfaces

  34. HIVE Core Packages

  35. edu.unc.ils.hive.api • SKOSServer: • Provides access to one or more vocabularies • SKOSSearcher: • Supports searching across multiple vocabularies • SKOSTagger: • Supports tagging/keyphrase extraction across multiple vocabularies • SKOSScheme: • Represents an individual vocabulary (location of vocabulary on file system)

  36. SKOSServer • SKOSServer is the top-level class used to initialize the vocabulary server. • Reads the hive.properties file and initializes the SKOSScheme (vocabulary management), SKOSSearcher (concept searching), SKOSTagger (indexing) instances based on the vocabulary configurations. • edu.unc.ils.mrc.hive.api.SKOSServer • TreeMap<String, SKOSScheme> getSKOSSchemas(); • SKOSSearcher getSKOSSearcher(); • SKOSTagger getSKOSTagger(); • String getOrigin(QName uri);

  37. SKOSSearcher • Supports searching across one or more configured vocabularies. • Keyword queries using Lucene, SPARQL queries using OpenRDF/Sesame • edu.unc.ils.mrc.hive.api.SKOSSearcher • searchConceptByKeyword(uri, lp) • searchConceptByURI(uri, lp) • searchChildrenByURI(uri, lp) • SPARQLSelect()

  38. SKOSTagger • Keyphrase extraction using multiple vocabularies • Depends on setting in hive.properties • edu.unc.ils.mrc.hive.api.SKOSTagger • “dummy” or “KEA” • List<SKOSConcept> getTags(String text, List<String> vocabularies, SKOSSearcher searcher);

  39. SKOSScheme Represents an individual vocabulary, based on settings in <vocabulary>.properties Supports querying of statistics about each vocabulary (number of concepts, number of relationships, etc).

  40. HIVE Internals: HIVE Web • GWT Entry Points: • HomePage • ConceptBrowser • Indexer • Servlets • VocabularyService: Singleton vocabulary server • FileUpload: Handles the file upload for indexing • ConceptBrowserServiceImpl • IndexerServiceImpl

  41. HIVE Internals: HIVE-RS • Java API for RESTful Web Services (JAX-RS) • Classes • ConceptsResource: • SchemesResource

  42. Customizing HIVE

  43. Obtaining Vocabularies Several vocabularies can be freely downloaded Some vocabularies require licensing HIVE Core includes converters for each of the supported vocabularies. List of HIVE vocabularieshttp://code.google.com/p/hive-mrc/wiki/VocabularyConversion

  44. Converting Vocabularies to SKOS • Additional information • http://code.google.com/p/hive-mrc/wiki/VocabularyConversion • Each vocabulary has different requirements

  45. Converting Vocabularies to SKOS • A Method to Convert Thesauri to SKOS (van Assem et al) • Prolog implementation • IPSV, GTAA, MeSH • http://thesauri.cs.vu.nl/eswc06/ • Converting MeSH to SKOS for HIVE • Java SAX-based parser • http://code.google.com/p/hive-mrc/wiki/MeshToSKOS

  46. HIVE for Developers HIVE is an open-source project Instructions for working with HIVE in Eclipse are on the Wiki http://code.google.com/p/hive-mrc/wiki/DevelopingHIVE Sample code is also available http://code.google.com/p/hive-mrc/source/browse/#svn%2Ftrunk%2Fdoc%2FsampleCode

  47. Automatic Indexing

  48. Approaches to Automatic Indexing Moens, M.F. (2000). Automatic Indexing and Abstracting of Documents. London: Kluwer. • Knowledge base approach (rule-based) • Manual construction of rules based on expert classifications. • Machine learning approach • Use expert classifications to construct a statistical model for future classifications.

  49. About KEA++ • Machine learning approach • http://www.nzdl.org/Kea/ • Algorithm and open-source Java library for extracting keyphrases from documents using SKOS vocabularies. • Developed by Alyona Medelyan (KEA++), based on earlier work by Ian Whitten (KEA) from the Digital Libraries and Machine Learning Lab at the University of Waikato, New Zealand. • Problem: How can we automatically identify the topic of documents?

  50. Automatic Indexing Medelyan, O. and Whitten I.A. (2008). “Domain independent automatic keyphrase indexing with small training sets.” Journal of the American Society for Information Science and Technology, (59) 7: 1026-1040). • Free keyphrase indexing (KEA) • Significant terms in a document are determined based on intrinsic properties (e.g., frequency and length). • Keyphrase indexing (KEA++) • Terms from a controlled vocabulary are assigned based on intrinsic properties. • Controlled indexing/term assignment: • Documents are classified based on content that corresponds to a controlled vocabulary. • e.g., Pouliquen, Steinberger, and Camelia (2003)

More Related