190 likes | 308 Views
Where ’ s My Data?. Using MetriDoc to manage data integration headaches Joe Zucca– zucca@pobox.upenn.edu Tommy Barker – tbarker@pobox.upenn.edu Sponsored by. The Problem. The request seems simple but the solution is complex
E N D
Where’s My Data? Using MetriDoc to manage data integration headaches Joe Zucca– zucca@pobox.upenn.edu Tommy Barker – tbarker@pobox.upenn.edu Sponsored by
The Problem • The request seems simple but the solution is complex • Generally asked “who did / used x?” which leads to other questions • Where’s the data? • What’s the grain of the answer? • So how do we answer these questions? • If lucky, run script / query against a database and generate report • If not lucky, build an application to answer the question • This is what MetriDoc is built for
Current Solution - Datafarm Datafarm = Crontab + Perl + CGI = Spaghetti Gate Count Voyager Blackboard Ezproxy App 1 App 2 App 3 App n DLA logs Penn Community Datafarm Borrow Direct COUNTER
Datafarm Shortcomings • Maintainability issues • Not shareable • Not reusable
MetriDoc = Datafarm 2.0 • As our system grew, we began creating MetriDoc to address Datafarm’s problems • Needed a scheduler that was more sophisticated than cron • Needed languages that were more maintainable than perl • Needed integration tools to simplify data gathering across disparate systems • We built prototypes and services to help us evaluate technologies • Received a grant from IMLS to speed up development • Hired another programmer
MetriDoc Philosophy • Keep it simple • Sometimes a script is all you need • Ease of use is more important than performance • Don’t recreate the wheel • 100% open source • Sharable data
MetriDoc – How it Works • MetriDoc’s core is built around database schemas • A MetriDoc implementation consists of loading tables and normalized tables • Loading tables prime the repository • The user is responsible for populating these tables • Normalized tables are built from the data in the loading tables • MetriDoc takes care of this • Conforming to similar schemas provides interesting possibilities • Sharing data is easy • Sharing a single repository is easy (think amazon web services) • Easier to collaborate • From a user’s perspective • MetriDoc has tools to get your stuff in the loading tables • But ultimately you just need to get it in there, so you can use whatever • Use the MetriDoc tools to manage your integration needs • Useful for getting, transforming / resolving, moving and loading data
MetriDoc – Core Technologies • JVM • Java is used for infrastructure • Groovy is the primary language • Master Scheduler • Essentially the brains of MetriDoc • Using Hudson for now (http://hudson-ci.org/) • Integration Tooling • Tooling built on top of Apache Camel (http://camel.apache.org/) • Helps move data from one place to another • Really helpful for batch processing • Resolutions / Transformation Tools • Patron anonymization, text normalization, resource id to title resolutions, etc.
The Metridoc Solution Metridoc = Hudson + Java / Groovy + Apache Camel = Integration Nirvana Step 1 – Fill the loading tables Voyager Ezproxy COUNTER Load Ezproxy Load Counter Loading Tables Hudson Load Patron Info
Loading Tables 00.000.000.000||Philadelphia||PA||United States||Default+datasets+documents+pwp+vanwert||jsmith||[19/Jan/2011:00:01:44 -0500]||GET||https://proxy.library.upenn.edu:443/login?url=http://www.sciencedirect.com/science?_ob=GatewayURL&_origin=SFX&_method=citationSearch&_volkey=0264410X%2329%23266%232&_version=1&md5=8e47306a7f3a7da8a6fe7b521a7a149b||302||0||http://elinks.library.upenn.edu/sfx_local?genre=article&issn=0264410X&title=Vaccine&volume=29&issue=2&date=20101216&atitle=An+adjuvanted+pandemic+influenza+H1N1+vaccine+provides+early+and+long+term+protection+in+health+care+workers.&spage=266&sid=EBSCO:aph&pid=Madhun%2c+Abdullah+S.%3bAkselsen%2c+Per+Espen%3bSjursen%2c+Haakon%3bPedersen%2c+Gabriel%3bSvindland%2c+Signe%3bN%c3%b8stbakken%2c+Jane+Kristin%3bNilsen%2c+Mona%3bMohn%2c+Kristin%3bJul-Larsen%2c+%c3%85sne%3bSmith%2c+Ingrid%3bMajor%2c+Diane%3bWood%2c+John%3bCox%2c+Rebecca+J.5550217620101216aph||Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5 (.NET CLR 3.5.30729)]||Re07OuEIyQo8X6w||UPennLibrary=AAAAAUkQ36AAAFTaAwO7Ag==; __utma=10244330.1344196133.1295210953.1295404568.1295411821.9; __utmc=10244330; __utmz=10244330.1295411821.9.3.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=upenn; WRUID=0; __utmv=10244330.|1=User-Type=Current%20Students=1,; __utma=94565761.447912360.1295320755.1295404584.1295411882.4; __utmc=94565761; __utmz=94565761.1295320755.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=upenn%20blackboard; hp=/vanpelt/; __utma=261680716.1522407254.1295392237.1295404624.1295412044.3; __utmc=261680716; __utmz=261680716.1295412044.3.3.utmcsr=library.upenn.edu|utmccn=(referral)|utmcmd=referral|utmcct=/biomed/; proxySessionID=18175547; ezproxy=Re07OuEIyQo8X6w; ARPT=MWPYIPS108CWYL; EHost2=sid=49d81d33-5139-4dbd-b94f-5d76b01ffbdc@sessionmgr13&k2=dGJyMPGtr0iyqbVIrOPfgeyk44Dt6fIA&k3=dGJyMOPY8Xvt&k4=ehost&k6=en&k7=live&k8=DS:live; __utmb=10244330.4.10.1295411821; __utmb=94565761.6.9.1295413021459; __utmb=261680716.1.10.1295412044; ASPSESSIONIDCCAQQCRC=AHJAGJMDDPNIIMLMHBCPCHBL
The Metridoc Solution Metridoc = Hudson + Java / Groovy + Apache Camel = Integration Nirvana Step 2 – Populate the normalized tables Loading Tables Normalize Ezproxy Normalize Counter Repository Hudson Normalize Patron Info
Jenkins – Death to Cron • Generally used for building software, but a fantastic cron replacement • Can run arbitrary scripts locally and remotely • Supports master / slave distribution model seamlessly • Can be managed entirely via REST • Extensible • Helps with job dependencies • It is simple and free • Active community with a huge collection of plugins