130 likes | 327 Views
BIRN Data Management: An Introduction. Rob Schuler Apr 8, 2009. Outline. Data Grid Tools BIRN Data Grid Data Management Working Group. What is GridFTP ?. GridFTP is an extension of ubiquitous FTP GridFTP is a standardized protocol that extends the ubiquitous FTP protocol
E N D
BIRN Data Management:An Introduction Rob Schuler Apr 8, 2009
Outline • Data Grid Tools • BIRN Data Grid • Data Management Working Group
What is GridFTP? • GridFTP is an extension of ubiquitous FTP • GridFTP is a standardized protocol that extends the ubiquitous FTP protocol • GridFTP is Secure • GridFTP uses GSI (a public-key based security protocol) to support secure authentication and data channel protection • GridFTP is optimized for High-Performance • GridFTP includes extensions for supporting large-file transfer, lots-of-small-files transfer, transfer over high-bandwidth wide-area networks, and striped transfers • GlobusGridFTP Server • More than 5 ½ million file transfers per day
300x speedup 9688 miles GridFTP Data Transfers for the Advanced Photon Source “One Australian user left nearly 1TB of data on our systems that we had been struggling to transfer via standard FTP for several weeks. The typical data rate using standard FTP was ~200 KB/s. Using GridFTP we are now moving data at 60 MB/s—quite a significant boost in performance!” Brian TiemanAdvanced Photon Source
What is the Replica Location Service? • RLS supports replica mappings • A Logical Name is mapped to 1..N Physical Filenames (i.e., FTP URL) • RLS is a distributed registry • Local Replica Catalog (LRC) for storing and retrieving replica mappings and attributes • Replica Location Index (RLI) for indexing the contents of 1..N LRCs • RLS is scalable • Supports millions of replica mappings • Supports 100+ concurrent clients • Supports bulk operations • RLS is matureand stable • Several years in production use Replica Location Indexes Local Replica Catalogs
RLS Performance • For user convenience, server supports bulk operations • E.g., 1000 operations per request • Combine adds/deletes to maintain approx. constant DB size • For small number of clients, bulk operations increase rates • E.g., 1 client (10 threads) performs 27% more queries, 7% more adds/deletes
fMRI Scanner FBIRN Infrastructure Derived Data Processing FMRI/MRI Images Processing Pipelines Data Grid DICOM, NIFTI FIPS Results HIDB(s) (Distributed) Data Provenance Information Multi-Site User Query Clinical Data Input By David Keator, FBIRN
Data Publishers BIRN Data Grid Query HIDs Publish Metadata HIDs Locate and Retrieve Data Data Users Register and Upload Data Replica Location Service Data Grid Replica Location GridFTP Servers Data Transfers and Replication
WG Overview • Participants • BIRN CC, FBIRN, Morph BIRN, Mouse BIRN, NPRC • Scope • Files as atomic units of data • Data object naming • Data security • Data publishing methods • Data location and retrieval • Data storage services • Data transfer protocols • Replication strategies • Metadata Management (TBD)
WG: Current Topics • Use Cases • Basic file-based data management and sharing • Basic approach to authN and authZ • Metadata • To what extent should we be involved in metadata management? • Can metadata management be generalized, even at the infrastructure (not schema) level? • Existing Tools • What is the relationship between BIRN CC and existing tools for image management (XNAT, HID, MBAT, etc.)? • Can we refactor existing tools to be more reusable, modular, and fit into a BIRN layered architecture?
GridFTP: Two Channel Protocol • Two channel protocol like FTP • Control Channel • Communication link (TCP) over which commands and responses flow • Low bandwidth; encrypted and integrity protected by default • Data Channel • Communication link(s) over which the actual data of interest flows • High Bandwidth; authenticated by default; encryption and integrity protection optional
RLS Details Replica Location Indexes • Local Replica Catalogs (LRCs) contain consistent information about logical-to-target mappings RLI RLI LRC LRC LRC LRC LRC Local Replica Catalogs • Replica Location Index (RLI) nodes aggregate information about one or more LRCs • LRCs use soft state update mechanisms to inform RLIs about their state: relaxed consistency of index • Optional compression of state updates reduces communication, CPU and storage overheads • Membership service registers participating LRCs and RLIs and deals with changes in membership