1 / 39

Recent Developments of the Ninf Global Computing System Satoshi Matsuoka(TIT) Satoshi Sekiguchi(ETL)

This article discusses recent developments in the Ninf Global Computing System, including the establishment of a testbed for the grid and collaborative efforts in Japan and internationally. The article also highlights the modeling and simulation capabilities of Ninf for better scientific and engineering discipline.

lark
Download Presentation

Recent Developments of the Ninf Global Computing System Satoshi Matsuoka(TIT) Satoshi Sekiguchi(ETL)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recent Developments of the Ninf Global Computing SystemSatoshi Matsuoka(TIT)Satoshi Sekiguchi(ETL) http://ninf.etl.go.jp

  2. Today’s Talk • Global Computing Testbed Infrastructure (GCI) effort in Japan • APAN/TransPac, Int’l Collaboration • Brief Intro to Ninf System • Recent Developments • Modeling and Simulating Global Computing • Better Scientific&Engineering Discipline

  3. Global Computing Infrastructure • Collaborative Effort Starting in Japan to Establish Testbed for the Grid • Participants • ETL, Waseda-U, RWCP, Osaka-U, Tokyo Institute of Technology, etc. • Installation/Test Deployment of Multiple Grid Software and Apps • AppLeS/NWS, Condor, Globus, Legion, Netsolve, Ninf, Applications etc. planned • International Collab. thru APAN

  4. APAN/TransPAC Research-dedicated network within Asia and inbetween Asia and North America Being Launched Sep. 4. Perpetual Link between Asian participants and vBNS sites Need Grid software?! - both app&systems North America ASIA-Pacific Region 35Mbps(TransPAC) ChicagoStarTAP KDD Tokyo

  5. APAN Participants from Japan Agriculture, Forestry and Fisheries Research Council Agency of Industrial Science and Technology (AIST) Communication Research Laboratories Electrotechnical Laboratory (ETL w/RWC and Tokyo Inst. Tech.) Institute of Space and Astronautical Science (ISAS) KDD R&D Laboratories KEK(High Energy Accelerator Research Organization) Medical Internet Exchange Association (MDX Association) NASDA (National Aeronautics and Space Development Agency) National Cancer Center (NCC) National Institute of Genetics (NIG) NTT Laboratories (NTT Labs) RIKEN(The Institute of Physical and Chemical Research) University of Tokyo WASEDA University KEIO University et al. (WIDE)

  6. Ninf Executable Ninf Executable Ninf Executable Ninf Component Architecture Other Global Computing Systems, e.g., NetSolve via Adapters Ninf DB Server Ninf Register Meta Server Internet Ninf Computational Server Meta Server Meta Server Stub Program Ninf Procedure Ninf Client Library : Ninf_call(“linpack”, ..); : Ninf RPC Ninf Stub Generator IDL File Program

  7. Brief History of Ninf • The first draft paper (Jun.’94) • A naive implementation (Sep.’94) w/PVM • Paper POOMA’95 at Santa Fe (Mar.’95) • Cray J90 installed as Ninf server Sep.’95 • The Metaserver introduced Feb.’96 • The First package released Jun.’96 • Ninf/Netsolve Collaboration, Fall ’97 • Extensive Tools Development Early ’97~

  8. NinfCalc+ ExcelNinf Mathematica ... Numerical Scientific Computing Progs. Application Mathematical Libraries Ninf Client API (F77, C, Java,…) Ninf DB Ninf Computation Server NetSolve Server Programming Tool Ninf MetaServer NetSolve Adpter Resource Manager Ninf Protocol Service FTP HTTP TCP/IP Hardware Gigabit Net LAN WAN Architectural Layers of Ninf

  9. Resource Scheduling for the “Grid” • Max “performance” under Dynamic, Hetero. Env. • Computing Server Performance/Load • Network Topology/Bandwidth/Congestion • Multiple Users at Multiple Sites: HPC vs. Hi-thruput  Scheduling for the “Grid” • Exisiting Scheduling Systems MetaServer (Ninf), Agent (NetSolve), AppLeS/NWS, Prophet No frameworks to judge effectiveness • Difficult to perform large-scale experiments • Benchmarks®Fair Reproducibility also difficult

  10. Objective of the Model • Simulation Model and Simulators for HPDS (the “Grid”) • Modeling Various Grid Environment • Large-scale Simulation • Reproducibility • Contents • Overview of the Simulation Model • Simulator and Validity of the Model • Application: Evaluation of different Schedulers

  11. General Architecture of an HPDC System (Computing) • Clients • Computing Servers • Scheduling System • Schedulers (e.g. AppLeS, Prophet) • Performs Scheduling According to System/User Policy • Directory Service (ex. Globus-MDS) • Central Database of Resource Info • Monitors/Predictors (ex. NWS) • Monitors and Predicts Server and Network Status

  12. Request Generation Query the Scheduler Periodic Monitoring of Server and Network Assign Appropriate server Execute Request Return ComputedResult Canonical Model of Typical Grid Execution Scheduling Unit Directory Service Site1 Monitor Scheduler Client A Server A Site2 Server B Internet Client B Server C Client C

  13. The Model for Grid Simulation • Requirements for a Simulation Model • Various Clients, Servers, and Network Topologies • Servers: Performance, Load, Variance over time • Network : Bandwidth, Throughput (congestion), Variance over time ® Employ Queuing Theory • Characteristics of our Model • Simulation of large-scale execution environment • Reproducible, Fair Evaluation of Algorithms

  14. Simulating the Grid with a Queuing Model --- Overview Site1 Site1’ Qns1 Qnr1 Server A Client A Client A’ Qs1 Qns2 Qnr2 Site2 Site2’ Server B Client B’ Client B Qns3 Qnr3 Qs2 Qns4 Qnr4 Server C Client C’ Client C Qs3

  15. Simulating the Grid with a Queuing Model (2) ns_others • Arrival Rate of Data into Qns : ns= ns_request + ns_others ns_request: Request packets,ns_others : External Perturbation • Arrival Rate of Job into Qs : s= s_request + s_others s_request: Job Request,n_others : External Perturbation s_others Server Network Client Qs Qns ns ns s s s_request ns_request

  16. Processing at the Client • Emits Request w/probability request • Queries Scheduler for a Server • Provides Info on Request • Computing steps, Amount of Data Transfer The Scheduler assigns an appropriate server • Emits Request to the Assigned Server into Qnr • Divides Data into Logical Packets The Server Completes the Processing of Request • Client Receives Result Data from Qnr

  17. Client Parameters • The Probability of Emitting a Logical Packet into Qns packet= Tnet / Wpacket Tnet : Network Bandwidth Wpacket : Logical Packet Size • Example: Tnet=1.0[MB/s], Wpacket=0.01[MB] ® packet = 1.0/0.01 = 100

  18. Processing at the Network • Describe Comm. Throughput w/ns_others  Employ the M/M/1/N Queue for Qns • Arrival at Qns : both Request Data packets and packets of External Perturbation • When the Queue is full, then the request data packet is retransmitted • Each incoming data packet into Qns is processed for [data size/bandwidth] time, and then comes out of Qns

  19. Network Throughput • Arrival Rate of External Perturbation --- Determines Network Throughput packet/ (ns_others+ packet) = Tact / Tnet  ns_others= (Tnet / Tact - 1)  packet • Length of Qns --- Determines Latency Wpacket N / Tnet Tlatency N  Tlatency  Tnet/ Wpacket (note: N 2)

  20. Examples of Network Parameters • Under Tnet=1.0[MB/s], Wpacket=0.01[MB](packet=100) network throughput Tact=0.1[MB/s], to simulate Tlatency=0.1: • Arrival Rate of External Perturbation ns_others= (Tnet/Tact-1)packet = (1.0/0.1-1)100 = 900 • Queue Length N  TlatencyTnet/Wpacket = 0.11.0/0.01 = 10

  21. Processing at the Server • Describe Response time of Job Execution ® Employ the M/M/1 Queue for Qs • Server Receiving Requests • All data packets comes out of Qns, goes into Qs • Each Job on Qsserver is processd for [Compute Amount/Server Performance] time • Returns Result of Request to the Client • Divides Return Data into Logical Packets • Emit logical packets into Qnr w/ probability  packet

  22. Server Parameters • Arrival Rate of External Perturbation Jobs--- Determines Server Utilization s_others = Tser / Ws_othersU Tser: Server Performance Ws_others: Av. computing steps of EPJ U : Server Utilization • Example Parameters • Under Tser=100[Mflops], Ws_others=0.01 [MB] , to Simulate U=0.1: s_others = 100/0.010.1 = 1000

  23. Characteristics of our Simulator • OO design - Simulation Env. is pluggable • Client, Server, and Network Topologies • Scheduling Models • Processing at the Network/Server • Randomness Distribution (Poisson, etc.) (Employ Abstract Factory Pattern) • Each object can have independent (pseudo-)random number sequence • Implemented w/ Java • Parallel Simulator planned

  24. Evaluating Validity of the Model • Comparing Simulation to Actual Measurement on the Ninf System • Linpack --- Compute: 2/3n3 + 2n2 [flops], Comm: 8n2 + 20n[bytes] • Evaluation Environment • # server : client = 1 : 1, 1 : 4 Clients Ocha-U [SS10] (0.16MB/s, 32ms) Server Internet U-Tokyo [Ultra1] (0.35MB/s, 20ms) ETL [J90, 4PE] NITech [Ultra2] (0.15MB/s, 41ms) TITech [Ultra1] (0.036MB/s, 18ms)

  25. Evaluation Model of the Parameter • Client • request= 1 / [Request time + interval] • Wpacket = 10, 50, 100 [KB] (fixed) • Network(FCFS) • Bandwith Tnet = 1.5 [MB/s] • Ext. Purtb. : Av. Size = Wpacket (Exp. Dist.) ns_others, nr_others: Poisson Arrival • Server (FCFS) ー From Actual Measurements • Performance Tser = 500 [Mflops] (Cray J-90) • EPJ : Av. Compute Steps= 10 [Mflops] (Exp. Dist.) Utilization 4 [%], Poisson Arrival

  26. Evaluation of the Validity of the Simulation Model (1 : 1) • Almost Same Performance w/actual measurements for different packet sizes  simulation cost could be reduced • Matches actual measurements for different problem sizes as well

  27. Evaluation of the Validity of the Simulation Model (1 : 4) • Different Throughput Still Results in close match to Real Measurements 600 1000 1400

  28. 400 Mops 40Mops Server A Server B 1.08MB/s 0.2MB/s Client 1 Client 3 Client 4 Client 2 Application : Evaluating Scheduling Algorithms with Simulation • 3 different basic scheduling algorithms • RR : Round-Robin • LOAD: Comp. Power + Load min (L  1) / P (L : av. load, P : server perf.) • LOTH: Comp. Power + Load + Comm. Thruput: min Comp / (P / (L + 1)) + Comm / Tnet • Evaluation with Linpack / EP under hetero. environ. • Simple Predictionin addition to LOTH

  29. Parameters for Evaluating Scheduling Algorithms • Clients • Problem Sizes: Linpack - 600,EP - 221 • request= 1 / (Worst Request time + interval) (interval : Linpack 5 [sec], EP 20 [sec]), Poisson • Logical packets Wpacket = 100 [KB] (fixed),Poisson • Networks (FCFS) • Tnet = 1.5 [MB/s] • Ext. Perturb Data: av. size = Wpacket(exp. distr.) ns_others, nr_others are Poisson arrivals • Server (FCFS) • EPJ: av. computing = 10 [Mflops] (exp. distr.) Utilization 10 [%] , Poisson arrival

  30. Results of Evaluating Scheduling Algorithms • LOTH best result: resource info best utilized • Prediction did not work well as expected • LOAD performance of Linpack poor • Network bottleneck causes (false) low server utilization [SC97] • RR performs worst

  31. Weakness of the Model and Simulator • Does not model • Inter-server communication • Inteference between Networks • Distinguish Application vs. JobScheduling • Need to model co-scheduling • Simulator not very fast --- parallelization

  32. The Parallel Grid Simulators • Performance Simulator • 33 PentiumII 400Mhz 128Mb • 100Base-T (Hub+Switch) • Windows NT 4.0 + Java(Ninflet) • Network Simulator • 33Node PentiumII 333Mhz256Mb/12Gb • Switched 100Base-T • Linux/RWC Score • 150cm x 135cm x 40cm(MicroATX)

  33. Recent Ninf Developments (1) • Metaserver resource management architecture • Java-based • Could plug in predictors (e..g, NWS), directory (MDS) • Ninf v2. Development • New Protocol and IDL • Numerical RPC protocol w/Netsolve • Inregration w/CORBA • Utilize Globus DUROC in metaserver load management? • Security • local security(prev.ports) & global security (SSL) • preliminay performance

  34. Recent Ninf Developments (2) • Matrix Workshop • Collab. w/ Matrix Market • Automatically get gen. Matrix from Matrix Market and Matrix Workshop • MPI backend • Run on Wiz and RWC cluster • Automatic data distribution planned • Other backends e.g. Condor? • Demo at SC98 • Fluid Dynamics • Run on RWC cluster, J90/cluster in Japan

  35. Summary • Proposed a Simulation Model and a Simulator for HPDC-Grid Environment • Obtained almost equal results to real measurements, validating the effectiveness of the model • Evaluated basic scheduling algorithms using the model, and results were quantitatively similar to observation w/real measurements

  36. Issues • Collaboration: Bridging the gap between Grid systems • Ninf-Netsolve experience • Various Grid collaborations happening • Agreement of Interfaces • Network Protocols • IDLs and other description • Library APIs • Data Formats

  37. Standardization Due? • Interfaces are important • e.g., PVM vs. MPI • Success stories • Various Network Protocols • Programming Languages: C++, Java, etc. • SQL, HTML, etc. • Drawbacks • Standardizing “too early”

  38. Related Issue - Coping with “Industrial Standards” • CORBA example • Easy to say “CORBA is not appropriate for the Grid” • Is this true? What is exactly missing? • Need real technical underpinnings • Most Grid systems support the ORB of CORBA as a transport • Higher-level services - CORBAServices

  39. 今後の課題 • シミュレーションモデルの有効性の向上 • 実際のネットワークにおける変動を考慮 • 計算サーバでのジョブの処理方式の多様化 • Round-Robin など • シミュレーションコストの削減 • 高性能広域計算システムにおける他のスケジューリング手法の評価 • より適切なスケジューリング手法の提案

More Related