Introduction to Data Ingest using the Harvester

Introduction to Data Ingest using the Harvester 2012 VIVO Implementation Fest

Welcome & Who are we? Vincent Sposato, University of Florida Enterprise Software Engineering Primarily focused on VIVO operations and reproducible harvests Eliza Chan, Weill Cornell Medical College Information Technologies and Services (ITS) Primarily focused on VIVO customization (content / branding / ontology) and data ingest John Fereira, Cornell University Mann Library Information Technology Services (ITS) Programmer / Analyst / Technology Strategist

Goals of this session • Provide you with the basics of harvester functionality • Provide you a brief overview of the standard harvest process • Answer questions

Harvester Basics

What is the harvester? • A library of ETL tools written in Java for: • Extracting data from external sources, • Transforming it into RDF in the VIVO schema, • and Loading it into the VIVO application • A way to build automated and reproducible data ingests to get data into your VIVO application

What the harvester is? • A useful set of tools for data ingest and manipulation for semantic datastores • An example of ingesting data into VIVO in real life scenarios • An open source community developed solution

What the harvester is not? • A one button solution to all of your data ingest problems • Perfect

A simple data ingest workflow

Fetch • First step in the harvest process • Brings data from the external source to the harvester • Fetch classes: • OAIFetch – pull from Open Archives Initiative repositories • PubmedFetch – pull publications from the Pubmed catalog utilizing a SOAP-style interface • NLMJounalFetch – pull publications from National Library of Medicine’s catalog utilizing a SOAP-style interface • JDBCFetch – pull information from a JDBC database • D2RMapFetch – pull information from a relational database directly to an RDF format using the D2RMap Library

Translate • Most important part of entire process, as a mistake here will result in ‘dirty’ data • Most common translation function is the XSLTranslator – which uses an XML style sheet (XSL) • Translate classes: • XSLTranslator – use XSL files to translate non-native data into VIVO RDF/XML • GlozeTranslator – use a Gloze schema to translate data into Basic RDF/XML • VCardTranslator – intended to translate a Vcard into VIVO RDF (still in progress)

Score • Scores incoming data to VIVO data to determine potential for matches • It will score all input data based on a defined algorithm • Can limit the scored dataset to a given ‘harvested’ namespace • Multi-tiered scoring can be useful for minimizing the dataset to a smaller set, before adding other factors

Score Algorithms • EqualityTest (Most common) • Tests for exact equality • NormalizedDoubleMetaphoneDifference • Tests for phonic equality • NormalizedSoundExDifference • Tests for misspelling distance • NormalizedDamerauLevenshteinDifference • Tests for misspelling distance acounting for transpositions • NormalizedTypoDifference • Tests for misspelling distance specific to “qwerty” keyboard • CaseInsensitiveInitialTest • Tests to see if first letter of each string is the same, case insensitive • NameCompare • Test specifically designed for comparing names

Match • Uses cumulative weighted scores from Scoring to determine a ‘match’ • Harvested URI changes to the matching entities VIVO URI • Can either leave or delete data properties about the matched entity (ie extraneous name information about a known VIVO author)

Transfer • The final step – and the one that actually puts your data into VIVO • Works directly with Jena models, and speaks RDF only • Useful for troubleshooting data issues throughout the process

Ingest Walk-Thru

Identify the data • Determine the resource that has the data you are seeking • Meet with data stewards (owners) to determine how to access it • Develop a listing of the data fields available

Start to build your ingest • Based on your source data select a template from the examples • Our example we started with example-peoplesoft • Copy example-harvester-name into a separate directory to begin your work • This allows you to have an unadulterated copy of the original for reference, and also prevents any accidental breakages when updating the harvester package. • Perform renaming to match ingest to your actual ingest • Update all files to reflect reality of location and data sources • Use file record handlers to view the data in a human readable format until you have a working run of the harvest.

Build the Analytics • Write SPARQL queries that will express the data that you will be working with • This is helpful for determining net changes in data with regards to the ingest • Utilize harvester JenaConnect tools to execute queries and output into text format • Insert items into master shell script to allow analytics to run both before the transfer to VIVO, and then again after transfer to VIVO • Possibly setup email of these analytics, so that they can be monitored on a daily basis

Build your Fetch • Create model files that point to the correct locations for source data • Test connections to ensure data is accessible, and comes in format expected • Identify any issues with connections, and determine speed of transfer for overall timings

Build field mappings • Identifyrelated VIVO properties • Build a mapping between source data and the resulting VIVO properties / classes

Build translation file • Utilize base XSL from the ingest example that you selected • Field mappings created in previous step will help immensely here • Determine your entities that will be created from this process • Our example is Person and Department • Work through each data property and/or class • Build logic into the XSL file where necessary to accomplish goals

Test run through translation • Test the first 2 steps of your ingest by inserting an exit into your script • Verify that the source data came over, and that your translated records looked as expected • Make sure to have test cases for each potential type of data you could see • Avoid the inevitable Hand-to-Forehead moments • Wash, Rinse, and Repeat

Setup the scoring • Determine your scoring strategy based on the unique items that you have available to you • Approach 1 (People) – almost all institutions have some sort of unique identifier that is non-SSN, this is a slam dunk as an EqualityTest • Approach 2 (Publications) – we utilized a tiered scoring approach to successively shrink down the data set, and also to provide a better match • Determining weights of each algorithm will be important for cumulative scoring

Setup the matching • The bulk of the work here was done in thinking about scoring, now it is time to implement your threshold for matching • The matching is done on the individual entities, and matches will be called based upon meeting a threshold • All data associated with an entity will go over, unless you determine it not needed

Test run through match • Test run of the process through match • Utilize all test cases from your previous tests to make sure you can account for all variations • Need to have matching data in your test VIVO to ensure that you see match work • Use Transfer to output the harvested data model to see that all data is as you expect • Still need to review previous two steps outputs to ensure nothing has inadvertently changed

Setup the Transfer • Determine whether or not subtractions will be necessary • Will entire data be provided every time? • Will only new data be provided? • Will mixed, new and old, data be provided? • Make sure that the previous harvest model gets updated to reflect these additions / subtractions

Test run through entire process • This is the full dress rehearsal, and should be done in a test environment (UF calls it Development) • This is where your analytics really help, as the review of your test cases and what actually made it into VIVO is invaluable • Check all outputs from all steps of the process to make sure that everything is firing as expected • Review the data as it appears in the VIVO application, as sometimes the best designed ingest still has unintended view consequences

Full production ingest • This is the moment we all wait for, and is the point where the rest of the world gets to see the fruits of our labor • Promote the ingest to the production environment, confirm that all settings are changed for this environment • Kick off the ingest, sit back, and watch the data move • Pat yourself and your team on the back, as your VIVO is now alive with the sound of data

Harvester Additional Tools

Additional Harvester Tools/Utilities • Qualify • ChangeNamespace • Creates new nodes for unmatched entities harvested • Smush • Combines graphs of RDF data when they share certain links • Provides same functionality on command line as the data ingest menu • Qualify • Used to clean and verify data before import • Also allow for independent use to manipulate data via Regular Expressions, string replaces, and property removal • RenameResource • Takes in an old URI and a new URI, and renames any old URI match to the new URI • Provides same functionality on command line as the data ingest menu

Additional Harvester Tools/Utilities • JenaConnect • Used by harvester to connect to Jena models • Also allows for SPARQL queries to be run from command line • Harvester-jenaconnect –j pathToVIVOModel –q “Query Text” • XMLGrep • Allows for moving files that match an Xpath expression • Useful for removing XML files from a set of data, for separate processing or a different workflow

Harvester Troubleshooting Basics

Troubleshooting • Memory Issues • Each harvester function has a corresponding setup script that gets called from the harvester/bin directory • These are set for minimum memory usage, but for large datasets they need to be adjusted • Currently UF allocates minimum 4GB and maximum 6GB for the Score, Match, and Diff functions

Troubleshooting • Unexpected Data Issues • When you are receiving unexpected results from the harvester dump the steps to file for review • Harvester-transfer –imodel-config-file –d path_and_name_of_file_to_dump_to • Invaluable for reviewing each step of the process, and the outputs that are being generated • When things are not scoring or matching correctly, check to make sure that you have your comparisons setup correctly • Make sure that you are using the correct predicates and their associated predicate namespace • <Param name="inputJena-predicates">label=http://www.w3.org/2000/01/rdf-schema#label</Param> • Make sure that your harvested namespace is correct based upon your translation of the source data

Follow-Up Discussion Q&A

Introduction to Data Ingest using the Harvester