1 / 25

Karen Loughran

GEDDM: Comparisons of OGSA-DAI and GridFTP for access to and conversion of remote unstructured data in legal data mining. Karen Loughran. Introduction. G rid E nabled D istributed D ata M ining Industrial partner Overview of GEDDM GEDDM Common Semantic Model (CSM) objectives

darin
Download Presentation

Karen Loughran

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GEDDM: Comparisons of OGSA-DAI and GridFTP for access to and conversion of remote unstructured data in legal data mining Karen Loughran

  2. Introduction • Grid Enabled Distributed Data Mining • Industrial partner • Overview of GEDDM • GEDDM Common Semantic Model (CSM) objectives • Grid enabled solution

  3. Industrial Partner - Datactics • Northern Ireland based (formed 1999) • Specialising in grid enabled “data-centric” matching across multiple sectors • Datactics technology is fully parallelised • Computationally intensive - need to compare every record with every other record • Improve data quality by applying fuzzy matching techniques • Data mining software being used in the real world

  4. GEDDM Business Driver • Data sources • numerous structures, formats, locations, administrative domains… • Client • US County Court: insider trading litigation case • 45Tb • Variety of formats • Email, pdf, weblogs, DBMS, report text dumps … • How to interface to large volumes of data in common structured parallel approach

  5. Common Semantic Model (CSM) Objectives • Representation of unstructured data such as email, weblog, report dumps. • Conversion to structured format. • Evaluation of Grid technologies for access and conversion. • Secure, reliable and scaleable. • Exploit high bandwidth.

  6. CSM Grid Enabled Solution • Two Stages: • Represent and convert unstructured Flat File Formats (FFF) to structured Common Output Format File (COFF). • Investigate Grid technologies for the remote access and conversion of unstructured data.

  7. CSM Representation & Conversion • Data Description Language DDL - XSD • Data Description File DDF • Parser

  8. Sample FFF data source & DDF App Account Address Balance IMP 343818 Dede H Smith 8600.76 181 Glen Rd Earls Court, London IMP 565777 Annie Saunders 9905.50 60 Newhaven St Edinburgh, Scotland ___________________________________________________________________ <datasource> <database> <header><headertext>App Account Address Balance </headertext></header> <rectype eorecord=’\n’> <pfield name=”App” pos=1 length=3/> <pfield name=”Account” pos=10 length=6/> <pfield name=”Address” pos=24 length=23 multiline=”yes”/> <pfield name=”Balance” pos=49 length=8/> </rectype> </database> </datasource>

  9. Parser Design • Object oriented component hierarchy • Each object represents an XML element • Encapsulates data relating to the flat file component it describes • Encapsulates all import “parse” • SAX parse performed on DDF to build up internal OO representation of FFF • Parse called on top level object.

  10. CSM Grid technologies • Transfer & conversion tools • OGSA-DAI (Version 4) • GridFTP (GT4.0.0) • GUI interfacing to both of these technologies.

  11. GUI interface – access & conversion GUI Interface to sample remote FFF, DDF creation and conversion. View Sample Unstructured FFF Data Describe (DDF) Data Conversion Services Convert View Results (COFF) Structured Data (COFF) No OK ? Conversion Module Yes Complete

  12. Implementation under OGSA-DAI • OGSA-DAI 4.0.0 • Globus Toolkit 3.2.1 • New conversion activity designed & implemented • Calls out to python scripts to perform conversion

  13. Implementation under GridFTP • Globus Toolkit 4.0.0 • Data Storage Interface (DSI) creation to perform conversion processing at server • Instead of original unstructured FFF, send the COFF file back to client • Setup striped server architecture – multiple nodes working together in parallel.

  14. GridFTP Striped Architecture Host A Host C Host X Host Z Raid Raid Raid Raid Raid Raid Host B Host Y Belfast London

  15. GridFTP Machine Specifications BELFAST • AMD4400 Dual Processor • 4Gig RAM • 1 Terabyte hard disk, serial ATA2 • 1 Gigabit ethernet LONDON • Dual Optron Processor • 4Gig RAM • 1 Terabyte hard disk • 1 Gigabit ethernet

  16. GridFTP Evaluation Tests • Attempted conversion and access to large files across the network. • File sizes: • 13Mb, 26Mb, 52Mb, 103Mb, 205Mb, 409Mb, 817Mb, 1634Mb • Buffer sizes: • Default, 4915, 409150, 785408 • MTU 1400 - 8000

  17. OGSA-DAI Benchmark Results Currently no results available: • Socket Timeout Error and Engine receives a terminate signal when Activity takes longer than approximately 10 minutes to run. • DeliverToGridFTP activity would not work in version 4. Patches required. So far, unable to get working with these patches. • Security setup issues.

  18. GridFTP Network Topology BBC NI Queens BBC London 100MBit 1GBit 1GBit Janet Bar 1GBit 1GBit Queens BESC Router BBC ROUTER

  19. Results – GridFTP transfer • Throughput hindered by: • Physical Infrastructure/Service Provider-80Mbs • Router/switches/NIC • 808 Mbs CPU to CPU (London to Belfast) • 688 Mbs Disk to Disk (BBC NI) • Striping with 2 BE servers - 60% improvement • Local 100Mbs switch: • Disc to disc – 82 Mbs

  20. OGSA-DAI Evaluation …. • DeliverToGridFTP not working in 4.0.0 • Configuring GridFTP not possible (buffer sizes, no. of streams, striped transfer etc.) • Some way to go in efficient transfer of large files. • Installation/runtime overheads • Design/code conversion activity & design perform documents for access/conversion • Timeouts converting large files. Threads may be solution. • Clear documentation

  21. GridFTP Evaluation • Secure, reliable, fast and scaleable • Lightweight installation • Optimum use of high bandwidth networks • Extra ERET/ESTO processing allows tighter integration of conversions operation through the definition of a DSI • Striping for much improved efficiency

  22. GridFTP Evaluation • Extensive tuning required • No clear documentation for writing a DSI. • gridftp-mpd@globus.org useful source of info • Poor performance on NFS. • PVFS like filesystem recommended for striping. • 1Gbit bandwidth in practice difficult to achieve due to problems with: • Router • NIC • Physical Infrastructure

  23. Conclusions • Investigated grid technologies for remote access & conversion • OGSA-DAI disappointing due to lack of support for large file transfer • GridFTP involved extensive configuration and due to network infrastructure problems difficult to get optimum performance in remote transfer

  24. Future work • Tighter integration of conversion services within GridFTP DSI server module. • Extend the services under GridFTP to cope with Distributed Query Processing. • COFF produced as XML, ready for XPATH queries.

  25. Questions ?

More Related