1 / 16

A Web-Based Data Grid

A Web-Based Data Grid. Chip Watson, Ian Bird, Jie Chen, Ying Chen, Bryan Hess, Andy Kowalski Thomas Jefferson National Accelerator Facility. Outline. Overview of a prototype JLAB data grid architecture Status of the development Expected future milestones Lessons learned so far.

Download Presentation

A Web-Based Data Grid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Web-Based Data Grid Chip Watson, Ian Bird, Jie Chen, Ying Chen, Bryan Hess, Andy Kowalski Thomas Jefferson National Accelerator Facility

  2. Outline • Overview of a prototype JLAB data grid architecture • Status of the development • Expected future milestones • Lessons learned so far

  3. JLAB Prototype Architecture Summary • The prototype data grid consists of • Web services for information management and control • File daemons (like ftpd) for bulk data transfer • Back-end services used by the web services • Communication w/ web services is via HTTP and XML (HTTPS w/ X.509 certificate for privileged operations) • Communication w/ file daemons is via a daemon specific protocol • Communication w/ back-end services is site specific

  4. In picture form… ClientProgram Agent ReplicaCatalog DataGridServer FileServer R C Host File Host

  5. Web Services • Replica Catalog • Holds global file namespace • May itself be replicated for redundancy or performance • References (for given file) data grid nodes (but not physical path) • Data Grid Server (aka Replica Host) • Holds and serves files • May be a disk cache; may include tertiary storage • Translates global name to URL for retrieval (if cache resident) (pull by client) • Accepts new files (push by client) • Supports queuing of file transfer requests between nodes (3rd party) • Supports policy based file movement

  6. Replica Catalog Components • Relational database • Global directory name, file name, owner, size, etc • Set of Data Grid Nodes holding copies of the file, and last reported state of that replica copy (online, offline) • XML servlet • Directory level services per invocation, returning rich info from the database as an XML document • Catalog updates • HTTP servlet • Applies style sheet(s) to the XML document, allows easy browsing and simple interactions with just a simple web browser

  7. Current Status of Replica Catalog • A prototype exists with following functionality • Database populated with ALL files from the Jefferson Lab silo (no owner, group, file size info loaded for now) • XML servlet for browsing • HTTP servlet for browsing http://129.57.41.138/servlet/dg.HttpReplicaCatalog?dname=/ • Missing functionality in this prototype • Authentication Easy, already done for another (batch system) prototype • Edit catalog In principle easy, just need to finalize scenarios • Extensible file properties Moderately easy, just need to add a name-value table to db and expand the XML document for a single file to include this info

  8. Status (cont.) • Observations • Web browsing into directories w/ thousands of files is slow (produces an ENORMOUS web page), but works • Plan to segment, with “Next Page” link • Probably need to allow client to specify number of files to retrieve, and offset for next retrieval

  9. Data Grid Node Components • XML (and HTTP) servlets • File Catalog Servlet (Replica Host) • Translates file I/O requests to specific URL (including protocol negotiation or selection) • Provides offline / online status of file • Transfer Request Servlet • Queues file transfer requests, reports status • Edits transfer policy for specified directory • Disk Cache Manager Servlet • Edits policy of disk cache manager • File Server(s) • ftp, bbftp, gridftp, …

  10. Data Grid Server Components (Implementation) • Disk Cache Manager (back end service) • Java application • Manages disk pool -- NFS mounted read-only to local users • SQL database to track cached files, pending transfers • Migrates files to / from tape (if requested and if has a reference to a Tape Manager) • Interacts with a Disk Policy Agent (planned) • Tape Manager (back end service) • Separate Java application & db (running on different host) • Stages files to or from silo (has own small disk cache) • NFS exports stub file system

  11. Data Grid Node Components (Implementation) • Disk Policy Agent (back end service) • Runs in Disk Cache Manager’s VM • Keeps replica catalog up to date • Advises cache manager as to which files to delete (deleting last globally disk resident copy is expensive) • Propagates transfer policy from Replica Catalog • Grid Transfer Agent (back end service) • Operates on queued transfer requests • Uses remote File Servers (e.g. is or spawns an xxftp client) • Runs (probably) in disk cache manager’s VM

  12. Current Status of Data Grid Node • Data Grid Servlets • Translation from global name to URL is hard coded • Supports browsing of disk cache • Newest prototype allows browsing of unmanaged node-local file system, including /home, /data, …, and the copying of files within a single data node (adding authentication soon) • File Servers • bbftp in production use at Jlab; waiting for gridFTP

  13. Back End Status • Disk Cache Manager • Simple LRU policy (pluggable), no user quotas • No use of policy agent yet (to sync with replica catalog) • Automatic migration of specified files to tape guaranteed before deletion • Only 1 node operating in this mode (variant of other disk cache managers at Jlab) • Tape Manager • Fully operational, in production use at Jlab • File Transfer Agent • Just starting development

  14. Status Summary • Missing Functionality • A lot! • Transfer queuing • Advanced reservation & quotas • Policy based operations • Automatic updates of replica catalog All of these are planned or in progress…

  15. Data Grid Applications: File Manager • File Manager Design • Uses Replica Catalog (XML) • Uses Data Grid Node (XML) • GUI to browse files • GUI to copy files (and view queues) • Status • XML communications and file GUI done • 3rd party transfer operations awaiting additional functionality in the data grid node • Currently application, but plan to make into an applet

  16. Deployment / Development • 2Q 01 • 2 data grid servers running at Jlab & MIT for LQCD • grid browsing (replica catalog and data grid server) • retrieve file: http, bbftp & gridftp • Command line utility and web interface to “publish” a file (insert into grid node from co-located machine / local file system) • 3Q 01 • 2nd grid running between Jlab & FSU for CLAS (Hall D prototype) • “push” file into a data grid server from offsite • 3rd party file transfers on demand (queued) • 1Q 02 • Policy based file migration • Asynchronous event notification (HTTP based)

More Related