790 likes | 951 Views
OptIPuter System Software. Andrew A. Chien Computer Science and Engineering, UCSD January 2005 OptIPuter All-Hands Meeting. OptIPuter Software Architecture for Distributed Virtual Computers v1.1. DVC/ Middleware. High-Speed Transport. Optical Signaling/Mgmt.
E N D
OptIPuter System Software Andrew A. ChienComputer Science and Engineering, UCSD January 2005 OptIPuter All-Hands Meeting
OptIPuter Software Architecture for Distributed Virtual Computers v1.1 DVC/ Middleware High-Speed Transport Optical Signaling/Mgmt • January 2003, OptIPuter All Hands Meeting OptIPuter Applications Visualization DVC #1 DVC #2 DVC #3 Higher Level Grid Services Security Models Data Services: DWTP Real-Time Objects Layer 5: SABUL, RBUDP, Fast, GTP Grid and Web Middleware – (Globus/OGSA/WebServices/J2EE) Node Operating Systems Layer 4: XCP l-configuration, Net Management Physical Resources
OptIPuter Software Architecture Distributed Applications/ Web Services Visualization Telescience SAGE JuxtaView Data Services Vol-a-Tile LambdaRAM DVC API DVC Runtime Library DVC Configuration DVC Services DVC Communication DVC Job Scheduling DVC Core Services Resource Identify/Acquire Namespace Management Security Management High Speed Communication Storage Services Globus XIO GSI RobuStore PIN/PDC GRAM GTP XCP UDT CEP LambdaStream RBUDP
System Software/Middleware Progress • Significant Progress in Key Areas! • A unified Vision of Application Interface to the OptIPuter Middleware • Distributed Virtual Computer: Simpler Application Models, New Capabilities • 3-Layer Demonstration: JuxtaView/LambdaRAM Tiled Viz on DVC on Transports • Efficient Transport Protocols to exploit High Speed Optical Networks • RBUDP/LambdaStream, XCP, GTP, CEP, SABUL/UDT • Single Streams, Converging Streams, Composite Endpoint Flows • Unified Presentation under XIO (single application API) • Performance Modeling • Characterization of Vol-a-tile Performance on Small-scale Configurations • Real-time • Definition of a Real-time DVC, Components for Layered RT Resource Management – IRDRM, RCIM • Storage • Design and Initial Simulation Evaluation of LT Code-based Techniques for Distributed Robust (low variance of access, guaranteed bandwidth) Storage • Security • Efficient Group Membership Protocols to support Broadcast and Coordination across OptIPuters
Cross Team Integration and Demonstrations • TeraBIT Juggling, 2-layer Demo [SC2004, November 8-12, 2004] • Distributed Virtual Computer, OptIPuter Transport Protocols (GTP) • Move data between OptIPuter Network Endpoints (UCSD, UIC, Pittsburgh) • Share efficiently; Good Flow Behavior, Maximize Transfer Speeds (saturate all rcvrs) • Configuration: 10 endpoints, 40+ nodes, 1000’s of miles • Achieved 17.8Gbps, a TeraBIT in less than one minute! • 3-layer Demo [AHM2005, January 26-7, 2005] • Visualization, Distributed Virtual Computer, OptIPuter Transport Protocols • 5-layer Demo [iGrid, September 26-8, 2005 ??] • Biomedical/Geophysical, Visualization, Distributed Virtual Computer, OptIPuter Transport Infrastructure, Optical Network Configuration
OptIPuter Software “Stack” 3-layer Demo Applications (Neuroscience, Geophysics) Visualization Distributed Virtual Computer (Coordinated Network and Resource Configuration) 5-layer Demo Novel Transport Protocols Optical Network Configuration
Year 3 Goals • Integration and Demonstration of Capability • All Five Layers (Application, Visualization, DVC, Transport Protocols, Optical Network Control) • Across a Range of Testbeds • With Neuroscience and Geophysical Applications • Distributed Virtual Computer • Integrate with Network Configuration (e.g. PIN) • Deploy as persistent OptIPuter Testbed Service • Alpha Release of DVC as a Library • Efficient Transport Protocols • LambdaStream: Implement, Analyze Effectiveness, Integrate with XIO • GTP: Release and Demonstrate at Scale; Analytic Stability Modeling • CEP: Implement and Evaluate Dynamic N-to-M Communication • SABUL/UDT: Integrate with XIO; Flexible Prototyping Toolkit • Unified Presentation under XIO (single application API) • Performance Modeling • Characterization of Vol-a-tile, JuxtaView Performance on Wide-Area OptIPuter • Real-time • Prototype RT DVC, Experiment: remote device control within Campus Scale OptIPuter • Storage • Prototype RobuSTore, Evaluate using OptIPuter Testbeds and Applications • Security • Develop and Evaluate High Speed / Low Latency Network Layer Authentication and Encryption
10Gig WANs: Terabit Juggling • SC2004: 17.8Gbps, a TeraBIT in < 1 minute! • SC2005: Juggle Terabytes in a Minute UI at Chicago 10 GE 10 GE 10 GE NIKHEF Trans-Atlantic Link PNWGP Seattle 10 GE 10 GE NetherLight Amsterdam StarLight Chicago U of Amsterdam 10 GE SC2004 Pittsburgh Netherlands CENIC Los Angeles 2 GE UCI United States 2 GE ISI/USC UCSD/SDSC 10 GE SDSC JSOE CSE 2 GE CENIC San Diego 10 GE 10 GE 10 GE 1 GE SIO
3-layer Integrated Demonstration Visualization Application (Juxtaview + LambdaRAM) System SW Fmwork (Distributed Virtual Computer) System SW Transports (GTP, UDT, etc.) Nut Taesombut, Venkat Vishwanath, Ryan Wu, Freek Dijkstra, David Lee, Aaron Chin, Lance Long UCSD/CSAG, UIC, UvA, UCSD/NCMIR, etc. January 2005, OptIPuter All Hands Meeting
3-Layer Demo Configuration Configuration JuxtaView at NCMIR LamdaRAM Client at NCMIR LambdaRAM Server EVL, UvA High Bandwidth (2.5Gbps, ~7 streams) Long Latencies, Two Configurations NCMIR/ San Diego EVL/ Chicago Output Video Streaming SDSC/ San Diego NLR/CAVEWAVE 10G/ 70 msec CAMPUS GE 10G/ 0.5 msec GTP Flows Transatlantic Link 4G/ 100 msec Audiences UvA/ Amsterdam
Distributed Virtual Computers Nut Taesombut and Andrew Chien University of California, San Diego January 2005 OptIPuter All-Hands Meeting
Distributed Virtual Computer (DVC) DVC • Application Request: Grid Resources AND Network Connectivity • Redline-style Specification, 1st Order Constraint Language • DVC Broker Establishes DVC • Binds Ends Resources, Switching, Lambda’s • Leverages Grid Protocols for Security, Resource Access • DVC <-> Private Resource Environment, Surface thru WSRF
Distributed Virtual Computer (DVC) • Key Features • Single Distributed Resource Configuration Description and Binding • Simple use of Optical Network Configuration and Grid Resource Binding • Single Interface to Diverse Communication Capabilities • Transport Protocols, Novel Communication Capabilities • Using a DVC • Application presents Resource Specification • Requests Grid Resources and Lambda Connectivity • DVC Broker Selects Resources and Network Configuration • DVC Broker Binds Resources and Configures Network, and Return List of Bound Resources and Their Respective (Newly Created) IP’s • Application Uses These IP’s to Access Created Network Paths • Application Selects Communication Protocols and Mechanisms amongst Bound Resources • Application Executes • Application Releases the DVC [Taesombut & Chien, UCSD]
JuxtaView and LambdaRAM on DVC Example Resource/Network Information Services (Globus MDS) viz1: ncmir.ucsd.sandiego str1: rembrandt0.uva.amsterdam str2: rembrandt1.uva.amsterdam str3: rembrandt2.uva.amsterdam str4: rembrandt6.uva.amsterdam (rembrandt0,yorda0.uic.chicago) --- BW 1, LambdaID 3 (rembrandt1,yorda0.uic.chicago) --- BW 1, LambdaID 4 (rembrandt2,yorda0.uic.chicago) --- BW 1, LambdaID 5 (rembrandt6,yorda0.uic.chicago) --- BW 1, LambdaID 17 Physical Resources and Network Configuration Application Requirements and Preference (communication + end resources) [ viz ISA [type =="vizcluster"; InSet(special-device, "tiled display")]; str1 ISA [free-memory>1700; InSet(dataset, "rat-brain.rgba")]; str2 ISA [free-memory>1700; InSet(dataset, "rat-brain.rgba")]; str3 ISA [free-memory>1700; InSet(dataset, "rat-brain.rgba")]; str4 ISA [free-memory>1700; InSet(dataset, "rat-brain.rgba")]; Link1 ISA [restype = "conn"; ep1 = <viz>; ep2 = <str1>; bandwidth > 940; latency <= 100]; Link2 ISA [restype = "conn"; ep1 = <viz>; ep2 = <str2>; bandwidth > 940; latency <= 100]; Link3 ISA [restype = "conn"; ep1 = <viz>; ep2 = <str3>; bandwidth > 940; latency <= 100]; Link4 ISA [restype = "conn"; ep1 = <viz>; ep2 = <str4>; bandwidth > 940; latency <= 100] ] (1) Requests a Viz Cluster, Storage Servers, and High-Bandwidth Connectivity DVC Manager
JuxtaView and LambdaRAM on DVC Example 192.168.85.13 192.168.85.14 192.168.85.15 192.168.85.16 192.168.85.12 (2) Allocates End Resources and Communication • Resource Binding (GRAM) • Lambda Path Instantiation (PIN) (Current Demo doesn’t yet include this) • DVC IP Allocation DVC Manager PIN Server NCMIR/San Diego UvA/Amsterdam
JuxtaView and LambdaRAM on DVC Example 192.168.85.13 192.168.85.14 192.168.85.15 192.168.85.16 192.168.85.12 (3) Create Resource Groups • Storage Group • Viz Group DVC Manager Storage Group Viz Group NCMIR/San Diego UvA/Amsterdam
JuxtaView and LambdaRAM on DVC Example 192.168.85.13 192.168.85.14 192.168.85.15 192.168.85.16 192.168.85.12 (4) Launch Applications • Launch LambdaRAM Servers • Launch JuxtaView/ LambdaRAM Clients DVC Manager Storage Group Viz Group NCMIR/San Diego UvA/Amsterdam
OptIPuter Component Technologies Real-time DVC’s Application Performance Analysis High Speed Transports (CEP, LambdaStream, XCP, GTP, UDT) Storage Security
Vision – Real-Time Tightly Coupled Wide-Area Distributed Computing Real-Time Object network Goals • High-precision Timings of Critical Actions • Tight Bounds on Response Times • Ease of Programming • High-Level Prog • Top-Down Design • Ease of Timing Analysis Dynamically formed DistributedVirtual Computer Source: Kim, UCI
Real-Time DVC Architecture Application expressed as teal time objects and links w/ various latency constraints) Real-time Application Real-Time Object Network Schedules and manages underlying resources to achieve desired RT TMO Real-Time Middleware Collection of Resources with known performance and security capabilities, and control & management Provides simple resource and management abstractions, hides detailed resource management (i.e. network provisioning, machine reservation) Distributed Virtual Machine Libraries that realize initial configuration and ongoing management High Speed Protocols/Network Management /Basic Resource Management Controls and Manages “single” resources
Real-Time: from LAN to WAN • RT grid (or subgrid) ::= A grid (or subgrid) facilitating (RG1) Message communications with easily determinable tight latency bounds and (RG2) Computing node operations enabling easy guaranteeing of timely progress of threads toward computational milestones • RG1 realized via • Dedicated optical-path WAN • Campus networks, the LAN part of the RT grid, equipped with Time-Triggered (TT) Ethernet switches (a new research task in collaboration with Hermann Kopetz) Source: Kim, UCI
Real-Time DVC (RD1) Message paths with easily determinable tight latency bounds. (RD2) In each computing or sensing-actuating site within the RT DVC, computing nodes must exhibit timing behaviorswhich are not different from those of computing nodes in an isolated site by more than a few percents. Also, computing nodes in an RT DVC must enable easy procedures for assuring the very high probability of application processes and threads reaching important milestones on time. => Computing nodes must be equipped with appropriate infrastructure software, i.e., OS kernel & middleware with easily analyzable QoS. (RD3) If representative computing nodes of two RT DVCs are connected via RT message paths, then the ensemble consisting of the two DVCs and the RT message paths is also an RT DVC. Source: Kim, UCI
Middleware for Real-Time DVC data data data Acq of ’s; Alloc of Virtual ’s; Coord of msg-send timings " Let us start a chorus at 2pm " " e-Science " Support exec of appls viaAlloc of comp & comm resources within DVC On-demand creation of DVCs RGRM RT grid resource management RCIM RT comm infrastr mgt IRDRM Intra-RT-DVC res mgt Basic Infrastructure Services IRDRM agent RCIM agent Globus System l-Configuration Net Management Source: Kim, UCI
Progress var • RCIM (RT comm infrastructure mgt) • Study of TT Ethernet began with the help of Hermann Kopetz • The 1st unit is expected to become available to us by June 2005. • IRDRM (Intra-RT-DVC resource mgt) • TMO (Time-triggered Message-triggered Object) Support Middleware (TMOSM) adopted as a starting base • A significantly redesigned version (4.1) of TMOSM (for improved modularity, concurrency, and portability) has been developed. It runs on Linux, WinXP, and WinCE. • An effort for extending the TMOSM to fit into the Jenks’ cluster began. Compo-nents of a C++ object TT Method 1 AAC TT Method 2 AAC Deadlines Service Method 1 Service Method 2 • No thread, No priority • High-level Programming Style Source: Kim, UCI
Progress (cont.) • Programming model • An API wrapping the services of the RT middleware enables high-level RT programming (TMO) without a new compiler. • The notion of Distance-Aware (DA) TMO, an attractive building-block for RT wide-area DC applications, was created and a study for its realization began. • Application development experiments • Fair and efficient Distributed On-Line Game Systems and LAN-based feasibility demonstration • Application of the global-time-based coordination principle • A step towards OptIPuter environment demonstration • Publication • A paper on distributed on-line game systems in IDPT2003 proc. • A paper on distributed on-line game systems to appear in ACM-Springer Journal on Multimedia Systems • A keynote paper on RT DVC at AINA2004 proc. • A paper on RT DVC middleware to appear in WORDS2005 proc. Source: Kim, UCI
Year 3 Plan • RCIM (RT comm infrastructure mgt) • Development of middleware support for TT Ethernet • The 1st unit of TT Ethernet switch is expected to become available to us by June 2005. • IRDRM (Intra-RT-DVC resource mgt) • Extension of TMOSM to fit into clusters • Interfacing TMOSM to the Basic Infrastructure Services of OptIPuter Source: Kim, UCI
Year 3 Plan • Application development experiments • An experiment for remote access and control within the UCI or UCSD campus • A step toward preparation of an experiment for remote access and control of electron microscopes at UCSD-NCMIR Source: Kim, UCI
Performance Analysis and Monitoring of VolaTile • Use Prophesy system to Instrument and Study VolaTile on 5-node System • Evaluate Performance Impact of Configuration (data servers, clients, network) Xingfu Wu <wuxf@cs.tamu.edu> [Wu & Taylor, TAMU]
Comparison of VolaTile Configuration Scenarios Xingfu Wu <wuxf@cs.tamu.edu>
Year 3+ Plans • Port the instrumented Volatile to a large-scale optiputer testbed for analysis (3/2005) • Analyze the performance of JuxtaView and LambdaRam applications (6/2005) • Where possible, develop models of data accesses for the different visualization applications (9/2005) • Continue collaborating with Jason’s group about viz applications (12/2005) Xingfu Wu <wuxf@cs.tamu.edu>
High Performance Transport Problem • OptIPuter is Bridging the Gap Between High Speed Link Technologies and Growing Demands of Advanced Applications • Transport Protocols Are the Weak Link • TCP Has Well-Documented Problems That Militate Against its Achieving High Speeds • Slow Start Probing Algorithm • Congestion Avoidance Algorithm • Flow Control Algorithm • Operating System Considerations • Friendliness and Fairness Among Multiple Connections • These Problems Are the Foci of Much Ongoing Work • OptIPuter is Pursuing Four Complementary Avenues of Investigation • RBUDP Addresses Problems of Bulk Data Transfer • SABUL Addresses Problems of High Speed Reliable Communication • GTP Addresses Problems of Multiparty Communication • XCP Addresses Problems of General Purpose, Reliable Communication
OptIPuter Transport Protocols Composite Endpoint Protocol (Efficient N-to-M Communication) E2e Path Allocated Lambda Shared, Routed Managed Group Standard Routers Enhanced Routers Unicast RBUDP/ l-stream SABUL/ UDT GTP XCP
Composite Endpoint Protocol (CEP) Eric Weigle and Andrew A. Chien Computer Science and Engineering University of California, San Diego OptIPuter All Hands Meeting, January 2005
Composite-EndPoint Protocol (CEP) • Network Transfers Faster than Individual Machines • A Terabit flow? A 100Gbit flow? A 10Gbps flow w/ 1Gbps NIC’s • Clusters are Cost-effective means to terminate Fast transfers • Support Flexible, Robust, General N-to-M Communication • Manage Heterogeneity, Multiple Transfers, Data Accessibility Uh-oh! [Weigle & Chien, UCSD]
Example • Move Data from a Heterogeneous Storage Cluster (N) • Exploit Heterogeneous network structure and Dedicated Lambda’s • Terminate in a Visualization Cluster (M) • Render for a Tiled Display Wall (M) • Data flow is not easy for the application to handle. • May want to locally to the storage cluster to offload checksum/buffering requirements or avoid a contested link.
Composite Endpoint Approach • Transfers Move Distributed Data • Provides hybrid memory/file namespace for any transfer request • Choose Dynamic Subset of Nodes to Transfer Data • Performance Management for Heterogeneity, Dynamic Properties Integrated with Fairness • API and Scheduling • API enables easy use • Scheduler handles performance, fairness, adaptation • Exploit Many Transport Protocols
CEP Efficiently Composes Heterogenous and Homogeneous Cluster Nodes • Seamless Composition of Performance, widely varying node performance • High Composition efficiency, demonstrated 32Gbps from 1Gbps nodes! • Efficiency increasing as implementation improves • Scaling suggests 1000 node Composites => Terabit Flows • Next Steps: Wide Area, Dynamic Network Performance
Summary and Year 3 Plans • Current Scheduling Mechanism is Static • Selects nodes to move data • Handles static heterogeneity • node/link capabilities • 32Gbps in LAN • Simple API Specification • Ease of use; scheduler takes care of transfer • Allows Scatter/Gather with arbitrary constraints on data • Plans: 1H2005 • XIO implementation: Use GTP, TCP, other transports • Tuned WAN Performance • Dynamic Transfer Scheduling (adapt to network and node conditions) • Plans: 2H2005 • Security, code stabilization, optimization • Initial Public Release • 5-layer Demo Participation • Better Dynamic Scheduling • De-centralization • Fault Tolerance
LambdaStream Chaoyue Xiong, Eric He, Venkatram Vishwanath, Jason Leigh, Luc Renambot, Tadao Murata, Thomas A. DeFanti January 2005 OptIPuter All Hands Meeting
LambdaStream (Xiong) Applications Need High BW with low jitter Idea • Combine loss-based and rate-based techniques • Loss type prediction, respond appropriately • => Good BW and Low Jitter
Loss Type Prediction When packet loss occurs, Average receiving interval = • Loss Types: • Continuous decrease in receiving capability • Occurrence of congestion in the link • Sudden decrease in receiving capability or random loss
Incipient undesirable situations avoidance (1) • When there is no loss, longer receiving packet interval indicates link congestion or lower receiving capability. Sender Bottleneck router Receiver ∆ts wi wi+1 ∆tr
Incipient undesirable situations avoidance (2) • Metric: • Ratio between the sending interval and the average receiving interval during one epoch. • Methods to improve precision • Use weighted addition of receiving intervals from the previous three epochs. • Exclude unusual samples.
Year 3 Plans • Development of XIO driver • Experiments with multiple streams • Integrate with TeraVision and SAGE. • Use formal modeling (Petri Net) to improve the scalability of the algorithm.
Information Sciences Institute • Joe Bannister • Aaron Falk • Jim Pepin • Joe Touch OptIPuter Project Progress January 18, 2005
Design of Linux XCP port Net100 tweaks Makes most sense for end-systems only; little benefit by changing OS for XCP routers Strategy is to put XCP in generic Linux 2.6 kernel; then port to Net100 (Net100 optimizations are largely orthogonal to XCP) Technical challenges exist in extending Linux kernel to handle 64-bit arithmetic needed for XCP Linux port is pending conclusion of on-going design work to eliminate line-rate divide operations from router OptIPuter XCP Progress [Bannister, Falk, Pepin, Touch ISI]
Workshops Aaron Falk, Ted Faber, Eric Coe, Aman Kapoor, and Bob Braden. Experimental Measurements of the eXplicit Control Protocol. Second Annual Workshop on Protocols for Fast Long Distance Networks. February 16, 2004. http://www.isi.edu/isi-xcp/docs/falk-pfld04-slides-2-16-04.pdf Aaron Falk. NASA Optical Network Testbeds Workshop. August 9-11, 2004, NASA Ames Research Center. User Application Requirements, Including End-to-end Issues. http://duster.nren.nasa.gov/workshop7/report.html Papers Aaron Falk and Dina Katabi. Specification for the Explicit Control Protocol (XCP), draft-falk-xcp-00.txt (work in progress), October 2004. http://www.isi.edu/isi-xcp/docs/draft-falk-xcp-spec-00.txt Aman Kapoor, Aaron Falk, Ted Faber, and Yuri Pryadkin. Achieving Faster Access to Satellite Link Bandwidth. Submitted to Global Internet 2005). December 2004. http://www.isi.edu/isi-xcp/docs/kapoor-pep-gi2005.pdf OptIPuter XCP Activities