1 / 56

Comprehensive Workshop on HIVE Technology

Explore HIVE as a service, installation, configuration, API usage, and internals. Develop vocabularies with HIVE supporting technologies. Get hands-on with Java and SKOS.

fpederson
Download Presentation

Comprehensive Workshop on HIVE Technology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SKOS-2-HIVE GWU workshop

  2. Introductions Ryan Scherle (ryan@scherle.org) Craig Willis (craig.willis@unc.edu)

  3. Afternoon Session Schedule Overview Using HIVE as a service Installing and configuring HIVE Using HIVE Core API Understanding HIVE Internals HIVE supporting technologies Developing and customizing HIVE

  4. Block 1: Introduction

  5. Workshop Overview • Schedule • Interactive, less structure • Hands-on (work together) • Activities: • Installing and configuring HIVE • Programming examples (HIVE Core API, HIVE REST API)

  6. What is your background? • What is your background? • Java • Tomcat/Webapps • REST • SKOS/RDF • Sesame • Lucene • What are you most interested in getting out of this workshop?

  7. HIVE Overview • HIVE Website • http://hive.nescent.org/ • Primarily for demonstration purposes • HIVE Architecture • Consists of many technologies combined to provide a framework for vocabulary services.

  8. HIVE Vocabularies • Partner vocabularies: • Library of Congress Subject Headings (LCSH) • NBII Biocompexity Thesaurus (NBII) • Integrated Taxonomic Information System (ITIS) • Thesaurus of Geographic Names (TGN) • LTERNet Vocabulary (LTER) • Other • AGROVOC • Medical Subject Headings (MeSH)

  9. Architecture

  10. HIVE Functions • Conversion of vocabularies to SKOS • Rich internet application (RIA) for browsing and searching multiple SKOS vocabularies • Java API and REST application interfaces for programmatic access to multiple SKOS vocabularies • Support for natural language and SPARQLqueries • Automatic keyphrase indexing using multiple SKOS vocabularies. HIVE supports two indexers: • KEA++ indexer • Basic Lucene indexer

  11. Block 2: Using HIVE as a service

  12. Using HIVE as a Service • HIVE web application • http://hive.nescent.org/ • Developed by Jose Perez-Aguera, Lina Huang • Java servlet, Google Web Toolkit (GWT) • http://code.google.com/p/hive-mrc/wiki/AboutHiveWeb • HIVE REST service • http://hive.nescent.org/rs • Developed by Duane Costa, Long-Term Ecological Research Network • http://code.google.com/p/hive-mrc/wiki/AboutHiveRestService

  13. Activity: Calling HIVE-RS Writing Java code to call the hive-rs web service

  14. Block 3: Install and Configure HIVE

  15. Installing and Configuring HIVE • Requirements • Java 1.6 • Tomcat (HIVE is currently using 6.x) • Detailed installation instructions: • http://code.google.com/p/hive-mrc/wiki/InstallingHiveWeb • http://code.google.com/p/hive-mrc/wiki/InstallingHiveRestService

  16. Installing and Configuring HIVE-web • Detailed installation instructions (hive-web) • http://code.google.com/p/hive-mrc/wiki/InstallingHiveWeb • Quick start (hive-web) • Download and extract Tomcat 6.x • Download and extract latest hive-web war • Download and extract sample vocabulary • Configure hive.properties and agrovoc.properties • Start Tomcat • http://localhost:8080/

  17. Installing and Configuring HIVE-web from source • Detailed installation instructions (hive-web) • http://code.google.com/p/hive-mrc/wiki/DevelopingHIVE • http://code.google.com/p/hive-mrc/wiki/InstallingHiveWeb • Requirements • Eclipse IDE for J2EE Developers • Subclipse plugin • Google Eclipse Plugin • Apache Ant • Google Web Toolkit 1.7.1 • Tomcat 6.x

  18. Installing and Configuring HIVE REST Service • Detailed installation instructions (hive-rs) • http://code.google.com/p/hive-mrc/wiki/InstallingHiveRestService • Quick start (hive-rs) • Download and extract latest webapp • Download and extract sample vocabulary • Configure hive.properties • Start Tomcat

  19. Importing SKOS Vocabularies • http://code.google.com/p/hive-mrc/wiki/ImportingVocabularies • Note memory requirements for each vocabulary • http://code.google.com/p/hive-mrc/wiki/HIVEMemoryUsage • java –Xmx1024m -Djava.ext.dirs=path/to/hive/lib  edu.unc.ils.mrc.hive.admin.AdminVocabularies [/path/to/hive/conf/] [vocabulary] [train]

  20. Block 4: Using the HIVE Core Library

  21. HIVE Core Interfaces

  22. HIVE Core Packages

  23. edu.unc.ils.hive.api • SKOSServer: • Provides access to one or more vocabularies • SKOSSearcher: • Supports searching across multiple vocabularies • SKOSTagger: • Supports tagging/keyphrase extraction across multiple vocabularies • SKOSScheme: • Represents an individual vocabulary

  24. SKOSServer • SKOSServer is the top-level class used to initialize the vocabulary server. • Reads the hive.properties file and initializes the SKOSScheme (vocabulary management), SKOSSearcher (concept searching), SKOSTagger (indexing) instances based on the vocabulary configurations. • edu.unc.ils.mrc.hive.api.SKOSServer • TreeMap<String, SKOSScheme> getSKOSSchemas(); • SKOSSearcher getSKOSSearcher(); • SKOSTagger getSKOSTagger(); • String getOrigin(QName uri);

  25. SKOSSearcher • Supports searching across one or more configured vocabularies. • Keyword queries using Lucene, SPARQL queries using OpenRDF/Sesame • edu.unc.ils.mrc.hive.api.SKOSSearcher • searchConceptByKeyword(uri, lp) • searchConceptByURI(uri, lp) • searchChildrenByURI(uri, lp) • SPARQLSelect()

  26. SKOSTagger • Keyphrase extraction using multiple vocabularies • Depends on setting in hive.properties • edu.unc.ils.mrc.hive.api.SKOSTagger • “dummy” or “KEA” • List<SKOSConcept> getTags(String text, List<String> vocabularies, SKOSSearcher searcher);

  27. SKOSScheme Represents an individual vocabulary, based on settings in <vocabulary>.properties Supports querying of statistics about each vocabulary (number of concepts, number of relationships, etc).

  28. Activity Write a simple Java class that allows the user to query for a given term Write a Java class that can read a text file and call the tagger

  29. Block 5: Understanding HIVE Internals

  30. Architecture

  31. Data Directory Layout • /usr/local/hive/hive-data • vocabulary/ • vocabulary.rdf SKOS RDF/XML • vocabularyAlphaIndex Serialized map • vocabularyH2 H2 database (used by KEA) • vocabularyIndex Lucene Index • vocabularyKEA KEA model and training data • vocabularyStore Sesame/OpenRDF store • topConceptIndex Serialized map of top concepts

  32. KeywordSearch

  33. Indexing

  34. HIVE Internals: Data Models Lucene Index: Index of SKOS vocabulary (view with Luke) Sesame/OpenRDF Store: Native/Sail RDF repository for the vocabulary KEA++ Model: Serialized KEAFilter object H2 Database: Embedded DB contains SKOS vocabulary in format used by KEA. (Can be queried using H2 command line) Alpha Index: Serialized map of concepts Top Concept Index: Serialized map of top concepts

  35. HIVE Internals: HIVE Web • GWT Entry Points: • HomePage • ConceptBrowser • Indexer • Servlets • VocabularyService: Singleton vocabulary server • FileUpload: Handles the file upload for indexing • ConceptBrowserServiceImpl • IndexerServiceImpl

  36. HIVE Internals: HIVE-RS Details of HIVE-rs

  37. Block 6: HIVE Supporting Technologies

  38. HIVE supporting technologies • Lucene http://lucene.apache.org • Sesame http://www.openrdf.org/ • KEA http://www.nzdl.org/Kea/ • H2 http://www.h2database.com/ • GWT http://code.google.com/webtoolkit/

  39. Activity • Explore Lucene index with Luke • http://luke.googlecode.com/ • Explore Sesame store with SPARQL • http://www.xml.com/pub/a/2005/11/16/introducing-sparql-querying-semantic-web-tutorial.html • http://www.cambridgesemantics.com/2008/09/sparql-by-example/

  40. Block 7: Customizing HIVE

  41. Obtaining Vocabularies Several vocabularies can be freely downloaded Some vocabularies require licensing HIVE Core includes converters for each of the supported vocabularies. List of HIVE vocabularieshttp://code.google.com/p/hive-mrc/wiki/VocabularyConversion

  42. Converting Vocabularies to SKOS • Additional information • http://code.google.com/p/hive-mrc/wiki/VocabularyConversion • Each vocabulary has different requirements

  43. Converting Vocabularies to SKOS • A Method to Convert Thesauri to SKOS (van Assem et al) • Prolog implementation • IPSV, GTAA, MeSH • http://thesauri.cs.vu.nl/eswc06/ • Converting MeSH to SKOS for HIVE • Java SAX-based parser • http://code.google.com/p/hive-mrc/wiki/MeshToSKOS

  44. LTER Sample Service http://scoria.lternet.edu:8080/lter-hive-prototypes

  45. Discussion • Pros and Con • HIVE Core vs. HIVE Web vs. HIVE-RS • Brainstorm applications that could benefit from HIVE, discuss implementations

  46. Block 8: KEA++

  47. About KEA++ • http://www.nzdl.org/Kea/ • Algorithm and open-source Java library for extracting keyphrases from documents using SKOS vocabularies. • Developed by Alyona Medelyan (KEA++), based on earlier work by Ian Whitten (KEA) from the Digital Libraries and Machine Learning Lab at the University of Waikato, New Zealand. • Problem: How can we automatically identify the topic of documents?

  48. Automatic Indexing Medelyan, O. and Whitten I.A. (2008). “Domain independent automatic keyphrase indexing with small training sets.” Journal of the American Society for Information Science and Technology, (59) 7: 1026-1040). • Free keyphrase indexing (KEA) • Significant terms in a document are determined based on intrinsic properties (e.g., frequency and length). • Keyphrase indexing (KEA++) • Terms from a controlled vocabulary are assigned based on intrinsic properties. • Controlled indexing/term assignment: • Documents are classified based on content that corresponds to a controlled vocabulary. • e.g., Pouliquen, Steinberger, and Camelia (2003)

  49. KEA++ at a Glance • KEA++ uses a machine learning approach to keyphrase extraction • Two stages: • Candidate identification: Find terms that relate to the document’s content • Keyphrase selection: Uses a model to identify the most significant terms.

  50. KEA++: Candidate identification Parse tokens based on whitespace and punctuation Create word n-grams based on longest term in CV Stem to grammatical root (Porter) Stem terms in vocabulary (Porter) Replace non-descriptors with descriptors using CV relationships Match stemmed n-grams to vocabulary

More Related