130 likes | 282 Views
( e)Science -Driven, Production-Quality, Distributed Grid and Cloud Data Infrastructure for the Transformative, Disruptive, Revolutionary, Next-Generation TeraGrid (now with free ponies). Data Architecture Progress Report December 11, 2008 Chris Jordan. Goals for the Data Architecture.
E N D
(e)Science-Driven, Production-Quality, Distributed Grid and Cloud Data Infrastructure for the Transformative, Disruptive, Revolutionary, Next-Generation TeraGrid(now with free ponies) Data Architecture Progress Report December 11, 2008 Chris Jordan
Goals for the Data Architecture • Improve the experience of working with data in the TeraGrid for the user community • Reliability, Ease of use, Performance • Integrate data management into the user workflow • Balance performance goals against usability • Avoid overdependence on data location • Support the most common use cases as transparently as possible • Move data in, run job, move data out as basic pattern • Organize, search, and retrieve data from large “collections”
Some Realities Cannot address the issue of available storage • Limited opportunity to improve data transfer performance at the high end • Cannot introduce drastic changes to TG infrastructure at this stage of the project • Remain dependent on the availability of technology and resources for wide-area file systems
Areas of Effort • Simplifying command-line data movement • Extending the reach of WAN file systems • Develop unified data replication and management infrastructure • Extend and unify user portal interfaces to data • Integrate data into scheduling and workflows • Provide common access mechanisms to diverse, distributed data resources
Extending Wide-Area File Systems • A “Wide-Area” file system is available on multiple resources • A “Global” file system is available on all TeraGrid resources • Indiana and SDSC each have a WAN-FS in production now • PSC has promising technology for distributed storage and Kerberos integration, but need testing to understand best management practices • Point of emphasis: going production
Data Capacitor-WAN (DC-WAN) • IU has this in production on BigRed, PSC Pople • Can be mounted on any cluster running Lustre 1.4 or Lustre 1.6 • Ready for testing and move to production • Sites and resources committed: • TACC Lonestar, Ranger, Spur • NCSA Abe, possibly Cobalt and/or Mercury • LONI Queen Bee (testing, possible production) • Purdue Steele? • This presentation is an excellent opportunity to add your site to this list.
PSC “Josephine-WAN” • Two major new design features: • Kerberos-based identity mapping • Distributed data and metadata • Kerberos is likely to work well OOTB • Distributed data/”storage pools” will need careful configuration and management • Technology working well, but needs to be actively investigated and tested in various configurations • Want to work on integration with TG User Portal
Getting to Global • No single file system technology will be compatible/feasible to deploy on every system • Will require hybrid solutions • TGUP helps, but … • Need to understand limit on simultaneous mounts, and … • Once production DC-WAN reaches the technical limit, look at technologies to extend the FS: • pNFS • FUSE/SSHFS
Command-line tools • Many users are still oriented towards shell access • GridFTP is complicated to use via globus-url-copy • Long URLS, many often inconsistent options • SSH/SCP is almost universally available and familiar to users • Limited usefulness for data transfer in current configuration • Simple changes to SSH/SCP configuration: • Support SCP-based access to data mover nodes • Support simpler addressing of data resources • Provide resource specific “default” configuration
Unified Data Management • Management of both data and metadata, distributed across storage resources • Multiple sites support data collections using SRB, iRODS, databases, web services, etc. • This diversity is good in the abstract, but also confusing to new users • Extend current iRODS-based data management infrastructure to additional sites • Expand REDDNET “cloud storage” availability • Integrate access to as many collections as possible through the User Portal
Interfaces to Data • SSH and “ls” are not effective interfaces to large, complex datasets • Portal and Gateway interfaces to data have proven useful and popular, but: • They may not be able to access all resources, may require significant gateway developer effort • Extend user portal to support WAN file systems and distributed data management • Possible to expose user portal and other APIs to ease development of gateways?
Integrating Data into Workflows • Almost all tasks run on TeraGrid require some data management and multiple storage resources • Users should be able to include these steps as part of a job or workflow submission • Port DMOVER to additional schedulers, deploy across TeraGrid • Working on BigBen, ready for Kraken • Working on SGE and LSF • Evaluate PetaShare, other “Data Scheduling” systems (Stork?)
Gratuitous end slide #42 • Data-WG has many attendees, but few participants • We need: • More sites committed to deploying DC-WAN in production • More sites committed to testing “Josephine-WAN” • More sites contributing to Data Collections infrastructure • Help porting DMOVER, testing PetaShare and REDDNET • Users and projects to exercise the infrastructure • Select one or more • If not you, who? If not now, when?