230 likes | 263 Views
This report presents the findings of the pilot test for the "Cache in and through the Cloud" action, assessing the feasibility and effectiveness of a cloud-based approach for data exchange. The report includes technical outcomes, questionnaire results, financial aspects, and GISC agreement.
E N D
Report on the pilot test for the “Cache in and through the Cloud” actionSubmitted by: Rémy Giraud (France)(Doc 14) Agenda item 3.2.1 ET-CTS 2016 ASECNA, Dakar. Senegal Date: 05-08 April 2016 WMO. ET-CTS 2016
A short history • For GISC to GISC communication and in particular for exchange of « GlobalExchange » data the current architecture is based on an any to any unicast based solution • The solution is: • Politically very challenging (some bilateral links are virtually impossible to set up) • Technically difficult to establish, to monitor and to maintain • Financially unattractive as the same data is sent multiple times on an expensive network • In 2014, at TT-GISC and ET-CTS, a solution using a cloud based approach was presented • It was then agreed to run a pilot in 2015 to assess whether this option was viable • Suitability of this solution will be established based on: • The technical outcome of the pilot • The result of a questionnaire (it appears that some organisations forbid the storage of their data into the cloud) • The financial and contractual aspects • The agreement of the GISCs to proceed with such a solution ET-CTS 2016
The should and the may What we should have now… What we may have tomorrow… ET-CTS 2016
The pilot • The pilot has been managed by ET-CTS (R.Giraud) with the support of H.Kiehl (DWD) • From a technical point of view, we have been using AFD (Automated File Distribution). The dataflow is presented later. • Used Internet only • Pilot limited to data exchange, not to metadata • The timeline of the pilot: • 2015 • May: • ET-CTS conf call to present and discuss the plan • Based on ET-CTS outcome, communication to TT-GISC for volunteers • ECMWF as part of their own evaluation of a cloud based approach for their dissemination has agreed to « loan » two VMs for up to a year starting in May 2015 • June: Installation of the system and configuration of AFD (June) • July: Interested GISC will be invited to join the pilot from July • October • Each GISC will be able to join on a piecemeal approach • All GISC in the pilot were on board • November: Issue of the questionnaire • 2016 • February: Draft reports for both the pilot and the questionnaire ET-CTS 2016
The VDC by Interoute • ECMWF is paying for the cloud servers for one year. Two virtual servers are available (one in Paris, one in Berlin). The Paris server has been used to run AFD. A second server in Berlin to emulate a GISC • As part of the configuration: • Firewall – Very limited. No support of “established” TCP connection. So to allow FTP, a large bunch of ports must be allowed • NAT • Load-Balancing (not used) • Very good online support • Network performance access very good. In theory, unlimited 10Gb/s access to the Internet (and if needed to the RMDCN) for free! • For the time being, this solution is a very cost effective (the cost per VM is approximately 4k€ per year) and proves to be an easy way to “multicast” from any GISC to all others (while using “unicast” protocol) ET-CTS 2016
The two tests • Test 1 is a pilot implementation of what could be the future service. • Each GISC involved is sending « operationally » what they are exchanging with the other GISCs. The server in the cloud is seen as an additional destination • The cloud server then resend the trafic to the destination GISCs after having potentially adapt the flow to the destination server (eg. Splitting grouped bulletins into individual files) • This pilot is very close to what could be the future configuration, it is however very difficult to assess the efficiency and the reliability of the solution as it also depends on the source and destinations servers • Test 2 aims at removing the potential variability due to the source and destination GISC. In this separate test, we have been using real data but using tests systems and not the « real » GISCs • One day of trafic sent by 6 GISCs has been captured • The trafic is then replayed in various configurations (FTP, SFTP, packing, network delay,…) ET-CTS 2016
The dataflow of Test 1 The file system of the VM in the cloud • AFD copy files in other outgoing and in cache GISC A uploads file in its incoming GISC B uploads file in its incoming /data /GISC I /GISC B /GISC A 24h cache /incoming /incoming /incoming /outgoing /outgoing /outgoing /24h cache /24h cache /24h cache • The files in /incoming are deleted after processing. • The files in the outgoing are available if a GISC wants to download data again. • The files in the cache are kept as a reference of the 24h cache. CRON clear files older than 24h • AFD push files to other GISCs ET-CTS 2016
The status of Test 1 • Thirteen GISCs are pushing the data to the cloud server (Moscow, Brasilia, Offenbach, Beijing, Toulouse, Exeter, Melbourne,Tokyo, Seoul, Jeddah*, Tehran*,Washington*, Pretoria*) • Protocol used is either FTP or SFTP • Some are sending the GFNC (files like A_*.txt) and others are sending CCCC files. • In all cases, bulletins are stored as individual files in the cache (CCCC files are unpacked on the cloud server) • We have started experiment grouping files in .tar.bz2 (it reduced the overall size to be transferred by half and divided by three the number of files) • Test 1 is still ongoing • It gives an idea to the GISC how such a service can be used operationnely • It is however difficult to draw tangible conclusions on the efficiency of the service (*) These GISCs are sending their data via a 3rd party (Offenbach and Exeter) ET-CTS 2016
Methodology for Test 2 (1) • In order to gather statistics and eliminate as much as possible GISC source and destination unknown variability, the GISC Source and GISC Destination are one server at DWD and one VM in the cloud • Traffic sent by 6 different GISCs to the cache in the cloud has been captured during 24h on November 6th. The selected GISCs are: • Offenbach • Tokyo • Melbourne • Brasilia • Toulouse • Seoul • In addition, 50 dummy bulletins, marked as “urgent” will be sent from GISC Source to GISC Destination without any packing/unpacking on AFD • Considering the current 15 GISCs and their area of responsibility, this represents a fairly reasonable sample of GISC AMDCN data. ET-CTS 2016
Methodology for Test 2 (2) • While doing the test, the AFD instance had continued to commute the files for all the real GISCs involved in the test • Traffic was “re-sent” from GISC-source to AFD and then pushed to GISC-destination • A maximum of 40 FTP/SFTP sessions will be allowed between the various hosts • 2 sessions (using either FTP or SFTP) will de dedicated to “urgent” bulletins • For each configuration the test took one week: • 1 day to prepare for the configuration • 1 day per sample GISC • The traffic pattern follows exactly what the GISC sample has sent (same repartition over 24 hours) • In order to assess various scenarios, the following tests have been done: • Compare FTP and SFTP performances, delays, reliability • Compare packing and unpacking options on the cache server • Introduce network delay and packet loss to asses performance with “remote” GISCs and packet loss (using linux tools TC and NETEM) ET-CTS 2016
Architecture ET-CTS 2016
The global planning for Test 2 ET-CTS 2016
The network “degradation” • As presented before, the 3 servers used during the Test 2 are all in Europe • The target architecture is not yet known. However, it is likely that some GISCs will be far away from the cloud servers and potentially the Internet between the GISC and the cloud servers will drop some packets. • To include the physical distance and packet loss in our pilot network, we have introduced fictitious network delay between the servers (125ms and 300ms) as well as packet loss (0.1%). • For this we have used the following parameters on the servers: tc qdisc add dev bond0 root handle 1: prio tc qdisc add dev bond0 parent 1:1 handle 2: netem delay 125ms or netem delay 300ms tc qdisc add dev bond0 parent 1:1 handle 2: netem loss 0.1% tc filter add dev bond0 parent 1:0 protocol ip pref 55 handle ::55 u32 match ip dst 213.39.4.229 flowid 2:1 • The tests done during Weeks 6, 7, 8 and 9 are using various combinations of such network degradation ET-CTS 2016
A weekly planning ET-CTS 2016
Number of files and volume per day ET-CTS 2016
The detailed results (see Annex for all results) • Week 1 (FTP) for GISC Offenbach replay • For each day of each week, we have something similar to this: Informatix to cloud1 Files delivered : 53160 files Average transport time : .1614 seconds Highest delivery time : 60 seconds (file name: ./A_PAAH46LOWM051445_C_EDZW_20151105144938_69731732.bin) Warning results : 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 5 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 Average warning time : .3000 seconds Highest warning time : 5 seconds (file name: ./urgent-23) Total test time : 86166 seconds Maximum transfer rate : 1687352 bytes/second Files missing : 0 Checksum failures : 0 Informatix to cloud2 Files delivered : 53160 files Average transport time : .5356 seconds (in the table on next page we present cloud 1 to cloud 2) Highest delivery time : 61 seconds (file name: ./A_PAAH46LOWM051445_C_EDZW_20151105144938_69731732.bin) Warning results : 0 1 0 1 1 1 0 0 0 0 0 1 0 0 1 1 1 1 0 0 0 0 0 5 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 1 1 1 1 0 0 1 1 1 Average warning time : .5000 seconds Highest warning time : 5 seconds (file name: ./urgent-23) Total test time : 86167 seconds Maximum transfer rate : 1557611 bytes/second Files missing : 0 Checksum failures : 0 ET-CTS 2016
The summary of the average transfer time ET-CTS 2016
The findings (1) • With AFD and the FTP/SFTP parameters used (40 parallel transfer for normal files and 2 for urgent) the results are very good. The average transfer time is less than 1 second. • Finding 1: Reducing the number of hops between source and destination and configuring a large number of parallel transfer is a must to achieve the ‘2-minute’ target for transferring data • Finding 2: Replacing a large number of destinations (the 14 other GISCs) with a small number of target (the servers in the cloud) will allow the GISCs to use a much larger number of parallel connections without overloading the FTP/SFTP servers. Instead of (eg) 3 parallel transfers for normal bulletins and 1 transfer for urgent bulletins (so a total of potentially 56 parallel streams), allowing 40 (normal) + 2 (urgent) parallel transfers will reduce the workload on the FTP/SFTP servers while improving the overall performance of the system. ET-CTS 2016
The findings (2) • Week 1 (using FTP) and week 2 (using) SFTP shows that the average transport time is always shorter for SFTP. • Week 7 (using FTP with network degradation) and week 9 (using SFTP with network degradation) shows that the average transport time is always shorter for SFTP. • Finding 3: Including SFTP as an authorized solution in Att II.15 will improve both the security and the efficiency of the transfer • Week 5 (packing files and bulletins) in .tar.bz2 introduces a delay (the files were buffered for 5 seconds before being sent) without clear benefit in this design • Finding 4: Except if reducing the volume of data to transfer is required (limited bandwidth), compressing files introduces significant delay in data transfer. ET-CTS 2016
The findings (3) • The design of the pilot was not resilient. During the pilot, we have had a problem on one of the VM. The storage at Interoute had issues. We lost cloud2. It was fairly simple and quick to reconfigure a new server. However, data on the server was lost. • The architecture of the solution and the procedure to support the cloud servers must include the potential failures of the VMs • As explained on Amazon Web Service, the cloud servers offer a “good enough” service. It is the responsibility of the “user” to define the required architecture keeping in mind that individual VMs offer a level of service that can’t be considered as operational according to our “usual” criteria • Finding 5: For typical IaaS service providers, the solution is designed to provide a “good enough” service. It will be the responsibility of the WIS community to design the solution accordingly. ET-CTS 2016
The thank you slide • The 13 GISCs part of the pilot • ECMWF for supporting this pilot by offering the VMs for one year • Holger Kiehl (DWD) has been a tremendous support: • Contact for GISC Offenbach • Configuration and hardening of the VMs • Configuration of AFD • Improving AFD ET-CTS 2016
Recommended text • ET-CTS: • Thanks ECMWF, the participating GISCs and the technical team in charge of the project • Recommends to keep the pilot up and running for the time being • Recognizes that the “cache in the cloud” is a very promising solution to facilitate the exchange of data for the 24h cache between all the GISCs • Considers that the pilot is successful and has proved that such a solution is technically very appropriate • Tasks the Chair of ET-CTS to present the findings to the upcoming ET-WISC and ICT-ISS meeting with the view that the “cache in and through the cloud” (including the required additional redundancy aspects) is technically viable to ensure the required 24h cache function. • Considers that with the respective Term of Reference of the various Expert Teams, the follow-up on this action falls under the remits of ET-WISC (potentially TT-GISC). ET-CTS 2016
Thank you for your attention • This space can be used for • contact information ET-CTS 2016