250 likes | 402 Views
GEDDM: Comparisons of OGSA-DAI and GridFTP for access to and conversion of remote unstructured data in legal data mining. Karen Loughran. Introduction. G rid E nabled D istributed D ata M ining Industrial partner Overview of GEDDM GEDDM Common Semantic Model (CSM) objectives
E N D
GEDDM: Comparisons of OGSA-DAI and GridFTP for access to and conversion of remote unstructured data in legal data mining Karen Loughran
Introduction • Grid Enabled Distributed Data Mining • Industrial partner • Overview of GEDDM • GEDDM Common Semantic Model (CSM) objectives • Grid enabled solution
Industrial Partner - Datactics • Northern Ireland based (formed 1999) • Specialising in grid enabled “data-centric” matching across multiple sectors • Datactics technology is fully parallelised • Computationally intensive - need to compare every record with every other record • Improve data quality by applying fuzzy matching techniques • Data mining software being used in the real world
GEDDM Business Driver • Data sources • numerous structures, formats, locations, administrative domains… • Client • US County Court: insider trading litigation case • 45Tb • Variety of formats • Email, pdf, weblogs, DBMS, report text dumps … • How to interface to large volumes of data in common structured parallel approach
Common Semantic Model (CSM) Objectives • Representation of unstructured data such as email, weblog, report dumps. • Conversion to structured format. • Evaluation of Grid technologies for access and conversion. • Secure, reliable and scaleable. • Exploit high bandwidth.
CSM Grid Enabled Solution • Two Stages: • Represent and convert unstructured Flat File Formats (FFF) to structured Common Output Format File (COFF). • Investigate Grid technologies for the remote access and conversion of unstructured data.
CSM Representation & Conversion • Data Description Language DDL - XSD • Data Description File DDF • Parser
Sample FFF data source & DDF App Account Address Balance IMP 343818 Dede H Smith 8600.76 181 Glen Rd Earls Court, London IMP 565777 Annie Saunders 9905.50 60 Newhaven St Edinburgh, Scotland ___________________________________________________________________ <datasource> <database> <header><headertext>App Account Address Balance </headertext></header> <rectype eorecord=’\n’> <pfield name=”App” pos=1 length=3/> <pfield name=”Account” pos=10 length=6/> <pfield name=”Address” pos=24 length=23 multiline=”yes”/> <pfield name=”Balance” pos=49 length=8/> </rectype> </database> </datasource>
Parser Design • Object oriented component hierarchy • Each object represents an XML element • Encapsulates data relating to the flat file component it describes • Encapsulates all import “parse” • SAX parse performed on DDF to build up internal OO representation of FFF • Parse called on top level object.
CSM Grid technologies • Transfer & conversion tools • OGSA-DAI (Version 4) • GridFTP (GT4.0.0) • GUI interfacing to both of these technologies.
GUI interface – access & conversion GUI Interface to sample remote FFF, DDF creation and conversion. View Sample Unstructured FFF Data Describe (DDF) Data Conversion Services Convert View Results (COFF) Structured Data (COFF) No OK ? Conversion Module Yes Complete
Implementation under OGSA-DAI • OGSA-DAI 4.0.0 • Globus Toolkit 3.2.1 • New conversion activity designed & implemented • Calls out to python scripts to perform conversion
Implementation under GridFTP • Globus Toolkit 4.0.0 • Data Storage Interface (DSI) creation to perform conversion processing at server • Instead of original unstructured FFF, send the COFF file back to client • Setup striped server architecture – multiple nodes working together in parallel.
GridFTP Striped Architecture Host A Host C Host X Host Z Raid Raid Raid Raid Raid Raid Host B Host Y Belfast London
GridFTP Machine Specifications BELFAST • AMD4400 Dual Processor • 4Gig RAM • 1 Terabyte hard disk, serial ATA2 • 1 Gigabit ethernet LONDON • Dual Optron Processor • 4Gig RAM • 1 Terabyte hard disk • 1 Gigabit ethernet
GridFTP Evaluation Tests • Attempted conversion and access to large files across the network. • File sizes: • 13Mb, 26Mb, 52Mb, 103Mb, 205Mb, 409Mb, 817Mb, 1634Mb • Buffer sizes: • Default, 4915, 409150, 785408 • MTU 1400 - 8000
OGSA-DAI Benchmark Results Currently no results available: • Socket Timeout Error and Engine receives a terminate signal when Activity takes longer than approximately 10 minutes to run. • DeliverToGridFTP activity would not work in version 4. Patches required. So far, unable to get working with these patches. • Security setup issues.
GridFTP Network Topology BBC NI Queens BBC London 100MBit 1GBit 1GBit Janet Bar 1GBit 1GBit Queens BESC Router BBC ROUTER
Results – GridFTP transfer • Throughput hindered by: • Physical Infrastructure/Service Provider-80Mbs • Router/switches/NIC • 808 Mbs CPU to CPU (London to Belfast) • 688 Mbs Disk to Disk (BBC NI) • Striping with 2 BE servers - 60% improvement • Local 100Mbs switch: • Disc to disc – 82 Mbs
OGSA-DAI Evaluation …. • DeliverToGridFTP not working in 4.0.0 • Configuring GridFTP not possible (buffer sizes, no. of streams, striped transfer etc.) • Some way to go in efficient transfer of large files. • Installation/runtime overheads • Design/code conversion activity & design perform documents for access/conversion • Timeouts converting large files. Threads may be solution. • Clear documentation
GridFTP Evaluation • Secure, reliable, fast and scaleable • Lightweight installation • Optimum use of high bandwidth networks • Extra ERET/ESTO processing allows tighter integration of conversions operation through the definition of a DSI • Striping for much improved efficiency
GridFTP Evaluation • Extensive tuning required • No clear documentation for writing a DSI. • gridftp-mpd@globus.org useful source of info • Poor performance on NFS. • PVFS like filesystem recommended for striping. • 1Gbit bandwidth in practice difficult to achieve due to problems with: • Router • NIC • Physical Infrastructure
Conclusions • Investigated grid technologies for remote access & conversion • OGSA-DAI disappointing due to lack of support for large file transfer • GridFTP involved extensive configuration and due to network infrastructure problems difficult to get optimum performance in remote transfer
Future work • Tighter integration of conversion services within GridFTP DSI server module. • Extend the services under GridFTP to cope with Distributed Query Processing. • COFF produced as XML, ready for XPATH queries.