Distributed Data for Science Workflows

Distributed Data for Science Workflows Data Architecture Progress Report December 2008

Challenges and Opportunities • TeraGrid is larger than ever before, meaning data is more widely distributed and needs to be more mobile • As previously reported, balance of FLOPS to available storage has drastically changed • TeraGrid user portal, and science gateways, have matured, and interfaces to TG resources have diversified • Need greater emphasis on unified interfaces to data, and integration of data into common workflows

Constraints on the Architecture • We cannot address the issue of available storage • Limited opportunity to improve data transfer performance at the high end • Cannot introduce drastic changes to TG infrastructure at this stage of the project • Remain dependent on the availability of technology and resources for wide-area file systems

Goals for the Data Architecture • Improve the experience of working with data in the TeraGrid for the majority of users • Reliability, Ease of use, Performance • Integrate data management into the user workflow • Balance performance goals against usability • Avoid overdependence on data location • Support the most common use cases as transparently as possible • Move data in, run job, move data out as basic pattern • Organize, search, and retrieve data from large “collections”

Areas of Interest • Simplifying command-line data movement • Extending the reach of WAN file systems • Develop unified data replication and management infrastructure • Extend and unify user portal interfaces to data • Integrate data into scheduling and workflows • Provide common access mechanisms to diverse, distributed data resources

Command-line tools • Many users are still oriented towards shell access • GridFTP is too difficult to use • SSH is widely known but has limited usefulness in current configuration • We need a new approach and/or tool to provide common, easy-to-use data movement, without compromising on performance

Extending Wide-Area File Systems • A “Wide-Area” file system is available on multiple resources • A “Global” file system is available on all TeraGrid resources • Indiana and SDSC each have a WAN-FS in production now • Need to honestly assess the potential for Global file systems, while making WAN file systems available on more resources

Unified Data Management • Management of both data and metadata, which may be stored at one or more locations in TeraGrid • Multiple sites support data collections using SRB, iRODS, databases, web services, etc. • This diversity is good, but also confusing to new users • Need a single service, which may utilize multiple technologies, to provide a common entry point for users

Interfaces to Data • SSH and “ls” are not effective interfaces to large, complex datasets • Portal and Gateway interfaces to data have proven useful and popular, but: • They may not be able to access all resources, may require significant gateway developer effort • Extend user portal to support WAN file systems and distributed data management • Possible to expose user portal internals to ease development of gateways?

Integrating Data into Workflows • Almost all tasks run on TeraGrid require some data management and multiple storage resources • Moving data into an HPC system • Moving results to an analysis or viz system • Moving results to an archive • Need to make these tasks less human-intensive • Users should be able to include these steps as part of their job submission • Tools such as DMOVER, PetaShare already exist but are not widely available in TeraGrid

Some Implementation Plans • Extend current iRODS-based data management infrastructure to additional sites • Test use of REDDNET for distributed data storage and access in TeraGrid • Provide a TGUP interface to Lustre-WAN • Provide a TGUP interface to distributed data and metadata management • Extend current production IU Lustre-WAN and GPFS-WAN to as many compatible resources as possible

More Implementation Plans • Port DMOVER to additional schedulers, deploy across TeraGrid • Develop and execute plan for PSC-based Lustre-WAN and GPFS/pNFS testing and eventual production deployment (already underway) • Work with Gateways group to provide appropriate interfaces to data movement through UP or other mechanisms • Simple changes to SSH/SCP configuration: • Support SCP-based access to data mover nodes • Support simpler addressing of data resources

The Cutting, not the Bleeding Edge • Primary goal is to improve the availability of robust, production technologies for data • Balancing performance, usability and reliability will always be a challenge • Need to be agile in assessing new technologies or improvements on old technologies • Data Working Group should focus on improvements to configuration for a few production components • Make consistent, well-planned efforts to evaluate new components

To-Do List for December • Understanding level of required vs. available effort • Work with other areas/WGs to place Data Architecture in context (CTSS, Gateways, etc) • Setting of priorities and ordering of tasks • Development of timelines and milestones for execution • Presentation of integrated Data Architecture Description and Plan in early January.

Distributed Data for Science Workflows

Distributed Data for Science Workflows

Presentation Transcript

Attacking Data Intensive Science with Distributed Computing

Lustre -WAN : Enabling Distributed Workflows in Astrophysics

Distributed R for big data

Workflows for HELIOPhysics

Portable Resource Management for Data Intensive Workflows

Specification of distributed data mining workflows with DataMiningGrid

Distributed Web Security for Science Gateways

SkanPoint™ Data-Entry Workflows

ShelterPoint ™ Data-Entry Workflows

Optimizing for Time and Space in Distributed Scientific Workflows

Intelligent Distributed Data Management in Earth System Science

CARDIO: Cost-Aware Replication for Data-Intensive workflOws

Protocols and Services for Distributed Data-Intensive Science

BondFlow: A Platform for Distributed Workflows over Web Services

Intelligent Distributed Data Management in Earth system science

Data Science for Migration Data

Scientific Workflows in e-Science

Python for data science

Data Science Applications | Data Science For Beginners | Data Science Training | Edureka

Python for data science