510 likes | 646 Views
StorNet: Co-Scheduling Network and Storage with TeraPaths and SRM Dantong Yu (BNL) Arie Shoshani (LBNL) September 28-29, 2009. Outline. Project Overview Motivation Approach and Architecture Gap Between The Existing Work And Project Goals Required New Functionalities
E N D
StorNet: Co-Scheduling Network and Storage with TeraPaths and SRM Dantong Yu (BNL) Arie Shoshani (LBNL) September 28-29, 2009
Outline Project Overview Motivation Approach and Architecture Gap Between The Existing Work And Project Goals Required New Functionalities Services and Communication Flows Project Plans Backup Slides
Project Overview • Project Goals: • To design and develop an integrated end-to-end resource provisioning for high performance data transfer. • To improve resource co-scheduling including network and storage resource and ensure data transfer efficiency and resource utilization. • To support end to end data transfer with a negotiated transfer completion timeline. • Project Participants: • LBNL: ArieShoshani, Alex Sim, JunminGu, VijiNatarajan • BNL: Dantong Yu, DimitriosKatramatos, and Xin Liu (Newly Hired Post Doc).
Motivation • End-to-end scheduling of data movement requires: • Availability of network bandwidth on the backbone wide area network (WAN) • Availability of local area network (LAN) bandwidth from end hosts to the border routers of the WAN • But also • Availability of data to be moved out at the source • Availability of storage space at the target • Availability of bandwidth at the source storage system, (i.e. disk and network cards) • Availability of bandwidth at the target storage system • Why is that hard? • Need to coordinate source and target bandwidth to match with each other within available windows • Also, need to coordinate these with internal and existing network bandwidth
Approach and Architecture • Leverage existing technologies • TeraPaths on top of OSCARS (network concatenation) • Storage Resource Managers (SRMs) on top of TeraPaths • Use Berkeley Storage Manager (BeStMan) implementation of SRM TeraPaths TeraPaths
What’s missing in these toolsto achieve our goals • BeStMan needs to be enhanced to: • Keep track of bandwidth commitments for multiple request • Coordinate between source and target BeStMan’s for storage space and bandwidth • Provide advanced reservation for future time window commitments • Communication and coordination with underlying TeraPaths • TeraPaths needs to be enhanced to: • Receive bandwidth requests from BeStMan in the form of (volume, max-bandwidth, max-completion-time) • Negotiate with OSCARS for “best” time window • “best” can be earliest completion time, or shortest transfer time • If success, commit reservation, and return to BeStMan • If failure, find closest solution to suggest to BeStMan
Services • SRM Services: • Processing Service Request, and subsequent coordinating network planes. • Network Services: • End-to-end circuits connecting two storage places. • Service State/Status: • SRM Data Transfer Progress, and Performance. • End-to-end circuit state and performance.
Multi-Layer Capability View Applications SRM/GridFtp Application, Middleware Layer Security Application, Middleware Management Application, Middleware security Translate the multi-layer network architecture view into our project implementation. Layer 4 AA TeraPaths Control TCP UDP TeraPaths Services Manage ment Plane Security Layer 3 AA QoS MPLS IP MPLS TeraPaths Services Manage ment Plane Security Layer 2 AA Layer2 Control VLANs TeraPaths Services Manage ment Plane Security AA Plane Control Plane Data Plane Service Plane Management Plane Security No in implementation
Multiple-Layer Architecture View BeStMan/Application Plane AA Plane TeraPaths Service Plane TeraPaths Management Plane TeraPaths Control Plane Generic DataPlane Layer
Specific Use Case:BeStMan in “pull’ mode 1) Target BeStMan gets request (userID (credential, priority), files/directory, maxCompletionTime) 2) T-BeStMan checks if it has any of the files, and pins them (till maxCompletionTime) 3) T-BeStMan contact S-BeStMan (get volumeOfRestOfFiles, get S-maxBandwidth) -> sent, get response 4) T-BeStMan allocates space (for volume), finds its own T- maxBandwidth 5) Determines desiredMaxBandwidth = min(T-maxBandwidth, S- maxBandwidth) 6) T-BeStMan calls local TPs for “reserve and commit” (userID, DesiredBeginTime=now, volume, desiredMaxBandwidth, maxCompletionTime) 7) TPs checks validity of UserID, priority, and authorization, negotiates with OSCARS 8) TPs returns (a) (reservationID, reservedBeginTime, reservedEndTime, reservedBandwidth), or (b) “can’t do it by maxCompletionTime, but here is new (longer) completion time. 9) T-BeStMan informs user case a) “here is your reservation”. OK? If yes, no actions; if no, issue cancel reservation to TP case b) “can’t do it, do you wish to use extended maxCompletionTime? If no, cancel; if yes, accept.
New APIs to be defined and functionality and Communication Flow developed Data Flow • Client-to-BeStMan • BeStMan-to-BeStMan • BeStMan-to-TeraPaths • Control Plane in TeraPaths Pushing Client Pulling Client Source BeStMan Space Management Bandwidth management Target BeStMan Space Management Bandwidth management Control Flow TeraPaths Bandwidth coordination and reservation Data Flow Notes: Push and Pull modes are needed because of security limitations
The TeraPaths project • Background of TeraPaths • Project Objective • View of the world (network) • System architecture • Establishing flow-based end-to-end QoS paths • Domain interoperation • Distributed reservation negotiation.
TeraPaths Overview • TeraPaths is a DOE/Office of Science project on end-to-end QoS (BNL, Michigan, Boston University and Stony Brook) • It provides QoS guarantees at the individual data flow level • From end host to end host; transparently • Because not all data flows are the same… • Default “best effort” network behavior treats all data flows as equal • Capacity is limited • Congestion causes bandwidth and latency variations • Performance and service disruption problems, unpredictability • Data flows have varying priority/importance • Video streams, Critical data, Long duration transfers • It schedules network utilization • Regulate and classify (prioritize) traffic accounting for policy/SLA • It’s targeted for “high-impact” domains…not intended to scale to the internet in general
View of the High Performance Network Site C Site B TeraPaths TeraPaths TeraPaths Site A Site D WAN ctrl MPLS tunnel Dynamic circuit WAN 1 WAN ctrl RN Domain ctrl WAN 2 WAN 3 RN WAN ctrl
Establishing End-to-End QoS Paths … … … • Multiple administrative domains • Cooperation, trust, but each maintains full control • Heterogeneous environment • Domain controller coordination through web services • Coordination models • Star • Requires extensive information for all domains • Daisy chain • Requires common flexible protocol across all domains • Hybrid (star+daisy chain, end-sites first) • Independent protocols • Direct end site negotiation
L2 vs. L3 (1/2) • MPLS tunnel starts and ends within WAN domain • Packets are admitted into the tunnel based on flow ID information (IPsrc, portsrc, IPdst, portdst) • WAN admission performed at the first router of the tunnel (ingress) MPLS tunnel ingress/egress router MPLS tunnel ingress/egress router WAN border router border router
L2 vs. L3 (2/2) switch switch WAN border router border router • Dynamic circuit appears as VLAN connecting end site border routers with single hop • Cannot use flow ID data directly • Flow must be directed to the proper VLAN • WAN admission performed within end site LAN • Select VLAN with Policy Based Routing (PBR) at both ends • Route can be selected on a per-flow basis
What is needed for a reservation ? • The ReservationData data structure contains all necessary information about a reservation. • Source and Destination addresses and ports • Start time and duration • Bandwidth required and QoS class • Related WAN reservations identifier • User credentials • Rescheduling criteria • Most TeraPaths web services use the ReservationData data structure to communicate with each other.
Distributed Reservation Negotiation End-to-end paths comprise multiple segments Each segment of each domain is established by a reservation Domains have to agree on parameters and their ranges Each domain is characterized by a resource availability graph, e.g., for bandwidth The availability of all domains can be established by calculating the minimum availability graph Each new reservation has to fit in the available area Reservations that don’t fit have to be modified If no modification makes a reservation fit, it is rejected TeraPaths currently modifies only start time on a individual site basis and iterates with counter offers OSCARS is tried if/after end-sites agree Will extend to modify start time, end time, and bandwidth, using end-to-end BAGs if applicable or combination of BAGs + trial and error otherwise
Bandwidth Reservation Requests max 3 reservation reserved bandwidth available 5 6 2 1 4 ts1 ts2 ts3 te1 te3 te2 ts4 ts5 te4 te5 ts6 te6 max Bandwidth Availability Graph (BAG) t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 time
Find Resources for New Request max new bandwidth (a) max (b) new (modified) t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 TSmin TS TSmax time
End-to-End Bandwidth Availability Graph max A max B (a) bandwidth Domain A Domain B max B (b) Combined BAG t6 t1 t2 t3 t4 t5 t7 t8 t10 t11 t12 t13 t14 t9 time
Storage Resource Managers (SRMs) Definition SRMs are middleware components whose function is to provide dynamic space allocationand file management for storage components
Requirements • Grid architecture needs to include reservation & scheduling of: • Compute resources • Storage resources • Network resources • Storage Resource Managers (SRMs) role in the data grid architecture • Shared storage resource allocation & scheduling • Especially important for data intensive applications • Often files are archived on a mass storage system (MSS) • Wide area networks – need to minimize transfers by file sharing • Scaling: large collaborations (100’s of nodes, 1000’s of clients) – opportunities for file sharing • Controlled file replication and caching • Need to support non-blocking (asynchronous) requests • Storage Cleanup (garbage collection) – using lifetime
Uniformity of Interface Compatibility of SRMs CCLRC RAL Client USER/APPLICATIONS Grid Middleware SRM SRM SRM SRM SRM SRM SRM Enstore dCache JASMine Unix-based disks Castor SE
Want: Peer-to-Peer Uniform Interface Uniform SRM interface client Storage Resource Manager Storage Resource Manager MSS Disk Cache Disk Cache Disk Cache Disk Cache Disk Cache Disk Cache ... Client’s site Client (command line) Client Program Storage Resource Manager network Storage Resource Manager ... ... ... Disk Cache Site 1 Site 2 Site N
Who’s involved… • CERN, European Organization for Nuclear Research, Switzerland • Lana Abadie, Paolo Badino, Olof Barring, Jean-Philippe Baud, Tony Cass, Flavia Donno, Akos Frohner, Birger Koblitz, Sophie Lemaitre, Maarten Litmaath, Remi Mollon, Giuseppe Lo Presti, David Smith, Paolo Tedesco • Deutsches Elektronen-Synchrotron, DESY, Hamburg, Germany • Patrick Fuhrmann, Tigran Mkrtchan • Fermi National Accelerator Laboratory, Illinois, USA • Matt Crawford, Dmitry Litvinsev, Alexander Moibenko, Gene Oleynik, Timur Perelmutov, Don Petravick • ICTP/EGRID, Italy • Ezio Corso, Massimo Sponza • INFN/CNAF, Italy • Alberto Forti, Luca Magnoni, Riccardo Zappi • LAL/IN2P3/CNRS, Faculté des Sciences, Orsay Cedex, France • Gilbert Grosdidier • Lawrence Berkeley National Laboratory, California, USA • Junmin Gu, Vijaya Natarajan, Arie Shoshani, Alex Sim • Rutherford Appleton Laboratory, Oxfordshire, England • Shaun De Witt, Jens Jensen, Jiri Menjak • Thomas Jefferson National Accelerator Facility (TJNAF), USA • Michael Haddox-Schatz, Bryan Hess, Andy Kowalski, Chip Watson
Storage Resource Managers: Main concepts • Non-interference with local policies • Advance space reservations • Dynamic space management • Pinning file in spaces • Support abstract concept of a file name: Site URL • Temporary assignment of file names for transfer: Transfer URL • Directory Management and ACLs • Transfer protocol negotiation • Peer to peer request support • Support for asynchronous multi-file requests • Support abort, suspend, and resume operations
SRM functionality • Space reservation • Negotiate and assign space to users • Manage “lifetime” of spaces • Release and compact space • File management • Assign space for putting files into SRM • Pin files in storage when requested till they are released • Manage “lifetime” of files • Manage action when pins expire (depends on file types) • Get files from remote locations when necessary • Purpose: to simplify client’s task • srmCopy: in “pull” and “push” modes
Concepts: Space Reservations • Negotiation • Client asks for space: C-guaranteed, MaxDesired • SRM return: S-guaranteed <= C-guaranteed, best effort <= MaxDesired • Type of spaces • Specified during srmReserveSpace • Access Latency (Online, Nearline) • Retention Policy (Replica, Output, Custodial) • Subject to limits per client (SRM or VO policies) • Default: implementation and configuration specific • Lifetime • Negotiated: C-lifetime requested • SRM return: S-lifetime <= C-lifetime • Space reference handle • SRM returns space reference handle (space-token) • Client can assign Description • User can use srmGetSpaceTokens to recover handles on basis of ownership
Concepts: Site URL and Transfer URL • Provide: Site URL (SURL) • URL known externally – e.g. in Replica Catalogs • e.g. srm://ibm.cnaf.infn.it:8444/dteam/test.10193 • Get back: transfer URL (TURL) • Path can be different than SURL – SRM internal mapping • Protocol chosen by SRM based on request protocol preference • e.g. gsiftp://ibm139.cnaf.infn.it:2811//gpfs/sto1/dteam/test.10193 • One SURL can have many TURL • Files can be replicated in multiple storage components • Files may be in near-line and/or on-line storage • In a light-weight SRM (a single file system on disk) • SURL may be the same as TURL except protocol • In light-weight SRM (a single file system on disk) • SURL can be the same as TURL except protocol • File sharing is possible • Same physical file, but many requests • Needs to be managed by SRM
Concepts: Transfer Protocol Negotiation • Negotiation • Client provides an ordered list of desired transfer protocols • SRM return: highest possible protocol it supports • Example • Protocols list: bbftp, gridftp, ftp • SRM returns: gridftp • Advantages • Easy to introduce new protocols • User controls which protocol to use • How it is returned? • The protocol of the Transfer URL (TURL) • Example: bbftp://dm.slac.edu/temp/run11/File678.txt
Summary: SRM Methods (partial list) Space management srmReserveSpace srmReleaseSpace srmUpdateSpace srmGetSpaceTokens FileType management srmChangeFileType srmChangeSpaceForFiles Status/metadata srmGetRequestStatus srmGetFileStatus srmGetRequestSummary srmGetRequestID srmGetFilesMetaData srmGetSpaceMetaData File Movement srmPrepareToGet srmPrepareToPut srmRemoteCopy srmBringOnline srmAddFilesToSpace srmPurgeFrom Space Lifetime management srmReleaseFiles srmPutDone srmExtendFileLifeTimeInSpace Terminate/resume srmAbortRequest srmAbortFile srmSuspendRequest srmResumeRequest
e.g. Request-to-Get Files Functional Spec srmPrepareToGet In: TUserID userID, TGetFileRequest[ ] arrayOfFileRequest, string[] TransferProtocols, string userRequestDescription, TStorageSystemInfo storageSystemInfo, Boolean streamingMode Out: TRequestToken requestToken, TReturnStatus returnStatus, TGetRequestFileStatus[ ] arrayOfFileStatus
e.g. Space Reservation Functional Spec srmReserveSpace In: TUserID userID, TSpaceType typeOfSpace, String userSpaceTokenDescription, TSizeInBytes sizeOfTotalSpaceDesired, TSizeInBytes sizeOfGuaranteedSpaceDesired, TLifeTimeInSeconds lifetimeOfSpaceToReserve, TStorageSystemInfo storageSystemInfo Int expectedFileSize [ ] Out: TSpaceToken, referenceHandleOfReservedSpace, TSpaceType typeOfReservedSpace, TSizeInBytes sizeOfTotalReservedSpace, TSizeInBytes sizeOfGuaranteedReservedSpace, TLifeTimeInSeconds lifetimeOfReservedSpace, TReturnStatus returnStatus
Berkeley Storage Manager (BeStMan)LBNL Java implementation Designed to work with unix-based disk systems As well as MSS to stage/archive from/to its own disk (currently HPSS) Adaptable to other file systems and storages (e.g. NCAR MSS, VU L-Store, TTU Lustre) Uses in-memory database (BerkeleyDB) • Multiple transfer protocols • Space reservation • Directory management (no ACLs) • Can copy files from/to remote SRMs • Can copy entire directory robustly • Large scale data movement of thousands of files • Recovers from transient failures (e.g. MSS maintenance, network down) • Local Policy • Fair request processing • File replacement in disk • Garbage collection
SRMs at work • Europe : WLCG/EGEE • 177+ deployments, managing more than 10PB • 116 DPM/SRM • 54 dCache/SRM • 7 CASTOR/SRM at CERN, CNAF, PIC, RAL, Sinica • StoRM at ICTP/EGRID, INFN/CNAF • US • Estimated at about 35 deployments • OSG • dCache/SRM from FNAL • BeStMan/SRM from LBNL • BeStMan-Gateway • Skeleton SRM for local implementation • SRM-Xrootd: using BeStMan-Gateway for Xrootd • ESG • DRM/SRM, HRM/SRM at LANL, LBNL, LLNL, NCAR, ORNL • Others • JasMINE/SRM from TJNAF • L-Store/SRM from Vanderbilt Univ. • BeStMan/SRM adaptation on Lustre file system at Texas Tech
Interoperability in SRM v2.2 CASTOR dCache Disk DPM BeStMan mySQL DB BNL SLAC LBNL xrootd Client User/application SRB(iRODS) SDSC SINICA LBNL EGEE
Earth System Grid Main ESG portal 148.53 TB of data at four locations (NCAR, LBNL, ORNL, LANL) 965,551 files Includes the past 7 years of joint DOE/NSF climate modeling experiments 4713 registered users from 28 countries Downloads to date: 31TB/99,938 files IPCC AR4 ESG portal 28 TB of data at one location 68,400 files Model data from 11 countries Generated by a modeling campaign coordinated by the Intergovernmental Panel on Climate Change (IPCC) 818 registered analysis projects from 58 countries Downloads to date: 123TB/543,500 files, 300 GB/day on average • Courtesy: http://www.earthsystemgrid.org
SRM works in concert with other Grid components in Earth System Grid (ESG) LBNL HPSS DISK ANL GridFTP service RLS Globus Security infrastructure HRM Storage Resource Management GridFTP server NCAR ORNL ESG Portal RLS LLNL HRM Storage Resource Management User DB ESG CA XML data catalogs GridFTP server DISK IPCC Portal XML data catalogs ESG Metadata DB MyProxy RLS DISK HPSS LAHFS DRM Storage Resource Management OPeNDAP-g RLS GridFTP server FTP server HRM Storage Resource Management GridFTP server ISI LANL DISK MCS Metadata Cataloguing Services RLS MSS Mass Torage System RLS Replica Location Services DRM Storage Resource Management DISK GridFTP server Monitoring Discovery ervices
Data Replication from BNL to LBNL 1TB/10K files per week on average In production for over 4 years Event processing in Grid Collector Prototype uses SRMs and FastBit indexing embedded in STAR framework STAR analysis framework Job driven data movement Use BeStMan to bring files into local disk from a remote file repository Execute jobs that access “staged in” files in local disk Job creates an output file on local disk Job uses BeStMan to moves the output file from local storage to remote archival location SRM cleans up local disk when transfer complete Can use any other SRMs implementing v2.2 STAR experiment
SRM-GET (one file at a time) GridFTP GET (pull mode) Network transfer archive files stage files Disk Cache Disk Cache DataMover in HENP-STAR experiment for Robust Multi-file replication over WAN Anywhere DataMover (Command-line Interface) Create Equivalent directories Get list of files From directory RRS SRM-COPY (thousands of files) Catalog Registration BeStMan (performs writes) BeStMan (performs reads) LBNL BNL Streaming Mode
File Tracking Shows Recovery From Transient Failures Total: 45 GBs
Summary • Storage Resource Management – essential for Grid • SRM is a functional definition • Adaptable to different frameworks (currently web-service) • Multiple implementations interoperate • Permit special purpose implementations for unique products • Permits interchanging one SRM product by another • SRM implementations exist and some in production use • Particle Physics Data Grid • Earth System Grid • Medicine • Fusion • More coming … • Cumulative experience in OGF GSM-WG • Specifications SRM v2.2 now accepted
TeraPaths Tasks and Schedule 0) Planning and design of TeraPaths new functionalities (0, 1) 1) Make decision on layer 2 or layer 3 reservations based on source and target sites; decide on uni-directional or bi-directional reservations (1, 2) 2) Find reservation that minimizes reservedEndTime: design, implement, test, and evaluate distributed reservation negotiation (2, 12) 3) Implement web service based API and necessary code to accommodate the new interaction between SRM and TeraPaths (11, 4) 4) Design and development of “local-null” (consultant) mode (12, 6) 5) Setup TeraPaths testbed between BNL-Michigan and integrate into the end storage systems (SRM/BeStMan) (16, 2) 6) Functional, reliability, and performance tests under stand-alone mode and integrated mode (18, 4) 7) Plan for STAR between BNL-NERSC (12, 12) 8) User Experience Feedback, and final project report (21, 3) Note: the numbers following tasks show: (start month, length of task in months)
BeStMan Tasks and Schedule 0) Planning and design of BeStMan new functionalities (0, 1) 1) Add to BeStMan persistent database to keep network reservation state (1, 6) 2) Provide network module to plug-in TeraPaths (4, 2) 3) Develop module to register information in DB (6, 1) 4) Develop module to find out available bandwidth based on current commitment and policies (6,2) 5) Develop server and client web-service APIs to communicate with user on reservation request and outcome (8, 4) 6) Develop server and client web-service APIs source-BeStMan and target-BeStMan to communicate for both “pull” and “push” modes (12, 4) 7) Setup BeStMan-TP in testbed BNL-Michigan (16, 2) 8) Run basic tests on testbed (18, 2) 9) Run scalability tests on testbed (20, 2) 10) Extend BeStMan for multi-transfer coordination (15, 6) 11) Plan for STAR setup: BNL to NERSC (18, 6) Note: the numbers following tasks show: (start month, length of task in months)