340 likes | 453 Views
LHC experiments and Objectivity/DB WAN Performance. Hiroyuki Sato Youhei Morita KEK / CERN IT/ASD. Detector for LHCb experiment. Detector for ALICE experiment. LHC at CERN. ATLAS Detector. ~1800 physicists from 33 countries. ATLAS Collaboration. LHC Data Challenge.
E N D
LHC experiments andObjectivity/DB WAN Performance Hiroyuki Sato Youhei Morita KEK / CERN IT/ASD
Detector forLHCb experiment Detector for ALICE experiment LHC at CERN ATLAS Detector ~1800 physicists from 33 countries
LHC Data Challenge • 1 PetaByte of raw data / year / experiment • Complexity of data: ~250 SPECint95*sec/event • 109 events / year
Data Model in ATLAS CTP • http://atlasinfo.cern.ch/Atlas/GROUPS/SOFTWARE/TDR/html/TDR-10.html#HEADING10-0 RAW 1MB information coming out of Level-3 ESD 100KB reconstruction information in enough detail to do event display, generate analysis objects, and redo most of the reconstruction. AOD <10KB physics objects, e.g. electrons, muons, etc., used for analysis Tag <100B brief information allowing a rapid first-pass selection to find events of interest
Analysis Use Case examples • Experimental-wide (CPU & IO demanding) • calibration • reconstruction (track/cluster finding, ...) • Analysis WG-wide (IO demanding, several iterations...) • physics object finding / correction algorithm • jet finding / parton energy correction, e/pi separation, particle ID... • "official" selection of physics event sets • Individuals (many iterations, short turn-around) • "private" event selection with trial and error --> feedback to WG • background study (I/O demanding) and Monte Carlo study • event display for debugging/checking (access to RAW??)
Regional Centre Reginal Center 4.106 MIPS 200 Tbyte; Robot 622 Mbits/s Desk tops • Idea of distributing the data near to physicists • Split parts of analysis tasks and computing resources world-wide 622 Mbits/s Desk tops University n.106MIPS 100 Tbyte; Robot Optional Air Freight 622 Mbits/s CERN 6.107 MIPS 2000 Tbyte; Robot Desk tops 622Mbits/s Model Circa 2006 622 Mbits/s 622 Mbits/s
Remote Analysis Use Case • Remote Access Data: CERN JOB: Remote • Data Replication Data: Remote JOB: Remote • Job Submission Data: CERN JOB: CERN
Remote Analysis Use Case • We will need/utilize all possible scenarios • Key factor is the utilization of resources and the number of job iterations • Type of the data • Number of iterations • CPU resource • Storage resource • Network bandwidth and Round Trip Time
Wish list for DRO • Data Replication in HEP Use Case can be scheduled • New objects always keep appearing (no realtime "updates") • Nice to have an "asynchronized update" of objects
MONARC Models Of Networked Analysis at Regional Centres http://www.cern.ch/MONARC/ • Primary goal is to identify baseline Computing Models that could provide viable solutions meeting the data analysis needs of LHC experiments. • Boundary condition are: network bandwidth, computing power, distributed database systems, data processing capabilities available at the start of LHC (year 2005).
Objectivity Tests in MONARC Simulation Measurements 4 CPUs Client LAN Raw Data DB
Aims of the testbed • Demonstrate the capability of world-wide distributed database (replication, WAN access, parallelised I/O etc) • Validate the MONARC simulation program • Understand the performance of the database over the wide area network • Identify the bottlenecks and possible improvements of the distributed database
CNET IGR (Paris, (Paris, France) France) Outline Topology of JEG Network Outline Topology of JEG Network INTELSAT - IOR 2 Mbps Leuk E/S Yamaguchi E/S IDC Yokohama Centre ESRIN (Frascati, National Cancer Centre Italy) NASDA CERN KEK (Tsukuba) (Hatoyama) CRL (Geneva, Switzerland) Keio Univ. (Koganei) • Tele-medicine applications : NCC/CRL ⇔ IGR • High-energy physics applications: KEK ⇔ CERN • Internet protocols : Keio Univ. ⇔ Internet community • Earth Observation applications: NASDA ⇔ ESRIN/ESA
Some Characteristics • RTT (Round Trip Time) is about 657 msec • 2Mbps bandwidth is dedicated for this measurement • Monitored TCP/IP data-link layer packets
Monitoring of the DRO • Run a monitor program to measure the size of the database file at 100 msec intervals during the replication. Interval : 100 ms
Next Slide DB Replication on dedicated satellite link (JEG 2Mbps) DRO preparation transfer closing (1) (2) (3) (MB) “Replication” Size of DB1 file in AP3 50K objects are transfered (second) Time since DRO started
DB Replication on dedicated satellite link (JEG 2Mbps) (MB) Dt “Replication” Ds Size of DB1 file in AP3 50K objects are transfered Dt : 689 ms Ds = 4096 bytes (second) Time since DRO started
Effective Bandwidth Physical Bandwidth: B Effective Bandwidth: Beff T = t(transfer) + t(handshake) = unit_size / B + RTT Beff = unit_size / T Beff unit_size ------ = ----------------------------- B unit_size + B * RTT 2.4% (unit: 4KByte unit, 2 Mbps, RTT 660msec) 28% (unit: 64KByte unit, 2 Mbps, RTT 660msec) 50% (unit: 165KByte unit, 2 Mbps, RTT 660msec) 50% (unit: 83MByte unit, 1 Gbps, RTT 660msec)
TCPDUMP in transfer phase 04:50:53.447110 arksol1.34074 > monarc01.6775: . 1523007:1523543(536) ack 13251 win 34840 (DF) 04:50:53.447181 arksol1.34074 > monarc01.6775: P 1523543:1524079(536) ack 13251 win 34840 (DF) 04:50:53.447222 arksol1.34074 > monarc01.6775: . 1524079:1524615(536) ack 13251 win 34840 (DF) 04:50:53.447257 arksol1.34074 > monarc01.6775: P 1524615:1525151(536) ack 13251 win 34840 (DF) 04:50:53.447292 arksol1.34074 > monarc01.6775: . 1525151:1525687(536) ack 13251 win 34840 (DF) 04:50:53.447327 arksol1.34074 > monarc01.6775: P 1525687:1526223(536) ack 13251 win 34840 (DF) 04:50:53.447362 arksol1.34074 > monarc01.6775: . 1526223:1526759(536) ack 13251 win 34840 (DF) 04:50:53.447391 arksol1.34074 > monarc01.6775: P 1526759:1527179(420) ack 13251 win 34840 (DF) 04:50:54.119317 monarc01.6775 > arksol1.34074: . ack 1524079 win 33232 (DF) 04:50:54.124542 monarc01.6775 > arksol1.34074: . ack 1525151 win 33232 (DF) 04:50:54.131474 monarc01.6775 > arksol1.34074: . ack 1526223 win 33232 (DF) 04:50:54.136635 monarc01.6775 > arksol1.34074: P 13251:13287(36) ack 1527179 win 33232 (DF) 04:50:54.137044 arksol1.34074 > monarc01.6775: . 1527179:1527715(536) ack 13287 win 34840 (DF) 04:50:54.137115 arksol1.34074 > monarc01.6775: P 1527715:1528251(536) ack 13287 win 34840 (DF) …..
DRO Port Activity (Transfer bytes AP1=>AP3 / AP3 => AP1) Port # of AP1 host 40s 240s 90s Preparation Transfer Closing
AMS Read • AMS Clients reads the objects on the Server • Protocol Sequence • Client makes a connection request to Server • Negotiation of the connection • Start of the data transfer phase • Application layer handshaking using the same port sending: write(fd, buffer, sizeof(buffer)) receiving: read(fd, buffer, sizeof(buffer)) • Unit of handshaking: • Client -> Server : 56 bytes • Server -> Client : Objy_pagesize + 44 bytes
AMS Read (2) • MSS (Max Segment Size) of the link is 536 bytes • Handshaking unit is fragmented into several IP packets according to the MSS pagesize = 8192 --> 16 segments(8192+44 bytes = 536 x 15 + 196 bytes) • Multiple segments are sent from the server with several patterns with varying window size
AMS Read (3) • At TCP/IP level, transfer starts from small window size, then try to increase it (slow start) Pattern 1 (window size 2 -> 6) time ~ 4 x RTT Pattern 2 (window size 6 -> 7) time ~ 3 x RTT Pattern 3 (window size 7 -> 8) time ~ 3 x RTT Pattern 4 (window size 8 -> 9) time ~ 2 x RTT Pattern 5 (window size 9 --> ?) time ~ RTT + timeout 9th segment does not reach to client (packet loss) with this network configuration -> window size is reset to 2 and starts over from pattern 1
AMS Read (4) • In typical TCP/IP application, TCP-layer remembers the maximum window size for not making the packet loss • Objectivity uses the same port for sending and receiving data, so the TCP-layer "forgets" the maximum window size at each handshaking cycle • Average handshaking cycle in this network is ~ 3 x RTT • If the page size is smaller than the maximum window size (536 x 8 = 4288 bytes in this test), handshaking cycle becomes ~ 1 x RTT without any packet loss • But 4 KB handshaking is too small for large RTT networks!
AMS Write • Two phase of communication • "Control Transfer Phase" (CTP) • Client -> Server 56 bytes • Server -> Client pagesize + 44 bytes • Client <- Server pagesize + 88 bytes • Server -> Client 36 bytes • "Data Transfer Phase (DTP) • Client -> Server pagesize + 48 bytes • Actual communication is a combination of CTP+DTPs
AMS Write (2) • Objects "newed" on the client side are stored to a buffer (~1.5MB) • Number of CTP is a function of number of pages stored into the buffer ("slow start" algorithm) # of pages in the client buffer # of CTPs 3 1 1 16 2 5 3 4 4 2 5 2 6 2 : :
Suggestions to Objectivity • Suggestion 1Use two separate ports for controlling and transferring data (1 way traffic per port) to benefit from the maximum window size set by the TCP/IP layer (eg. FTP) up to 300 % throughput improvement per transaction easy to fix • Suggestion 2Try to minimize the number of "handshaking"(eg. combine several pages per "handshaking") 200~2000% throughput improvement per transaction need a middle layer between pages and TCP/IP ?
Summary • LHC experiments face the challenges of the truly world-wide distributed data analysis at an unprecedented scale. • Effective utilization of the wide area network is very crucial. • Objectivity/DB provides a nice concept of widely distributed object persistency mechanism, but... • There are still several "rough edges" in terms of TCP/IP optimization (esp. in large RTT networks). • ...so, let's fix them together !