1 / 29

Running the multi-platform , multi-experiment cluster at CCIN2P3

Running the multi-platform , multi-experiment cluster at CCIN2P3. Wojciech A. Wojcik. IN2P3 Computing Center. e-mail: wojcik@in2p3.fr URL: http:// webcc .in2p3.fr. IN2P3 Computer Center. Provides the computing and data services for the French high energy and nuclear physic ist s:

amelia
Download Presentation

Running the multi-platform , multi-experiment cluster at CCIN2P3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Running the multi-platform, multi-experiment cluster at CCIN2P3 Wojciech A. Wojcik IN2P3 Computing Center e-mail: wojcik@in2p3.fr URL: http://webcc.in2p3.fr

  2. IN2P3 Computer Center • Provides the computing and data services for the French high energy and nuclear physicists: • IN2P3 – 18 physics labs (in all big towns in France) • CEA/DAPNIA • French groups are involved in 35 experiments at CERN, SLAC, FNAL, BNL, DESY and other sites (also astrophysics). • Specific situation: our CC is not directly connected to experimental facilities, like CERN, FNAL, SLAC, DESY, BNL.

  3. General rules • All groups/experiments share the same interactive and batch (BQS) clusters and other type of services (disk servers, tapes, HPSS and networking). Some exceptions later… • /usr/bin and lib (OS and compilers) are local • /usr/local/* on AFS, specific for each platform • /scratch – local tmp disk space • System, group and user profiles define the proper environment

  4. General rules • User has the AFS account with access to the following AFS disk spaces: • HOME - backup by CC • THRONG_DIR (up to 2GB) - backup by CC • GROUP_DIR (n * 2GB), no – backup • Data are on: disks (GROUP_DIR, Objectivity), tapes (xtage system) or in HPSS • Data exchange on the following media: • DLT, 9480 • Network (bbftp) • ssh/ssf - access to/from external domains recommended.

  5. Supported platforms • Supported platforms: • Linux (RedHat 6.1, kernel 2.2.17-14smp) with different egcs compilers (gcc 2.91.66, gcc 2.91.66 with patch for Objy 5.2, gcc 2.95.2 – installed on /usr/local), requested by different experiments • Solaris 2.6, 2.7 soon • AIX 4.3.2 • HP-UX 10.20 – end of this service already announced

  6. Support for experiments • About 35 different High Energy, Astrophysics and Nuclear Physics experiments. • LHC experiments: CMS, Atlas, Alice and LHCb. • Big non-CERN experiments: BaBar, D0, STAR, PHENIX, AUGER, EROS II.

  7. Disk space • Need to make the disk storage independent of the operating system. • Disk servers based on: • A3500 from Sun with 3.4 TB • VSS from IBM with 2.2 TB • ESS from IBM with 7.2 TB • 9960 from Hitachi with 21.0 TB

  8. Mass storage • Supported medias (all in the STK robots): • 3490 • DLT4000/7000 • 9840 (Eagles) • Limited support for Redwood • HPSS – local developments: • Interface with RFIO: • API: C, Fortran (via cfio from CERNLIB) • API: C++ (iostream) • bbftp – secure parallel ftp using RFIO interface

  9. Mass storage • HPSS – test and production services • $HPSS_TEST_SERVER:/hpsstest/in2p3.fr/… • $HPSS_SERVER:/hpss/in2p3.fr/… • HPSS – usage: • BaBar - usage via ams/oofs and RFIO • EROS II – already 1.6 TB in HPSS • AUGER, D0, ATLAS, LHCb • Other experiments on tests: SNovae, DELPHI, ALICE, PHENIX, CMS

  10. Networking - LAN • Fast Ethernet (100 Mb full duplex) --> to interactive and batch services • Giga Ethernet (1 Gb full duplex) --> to disk servers and Objectivity/DB server

  11. Networking - WAN • Academic public network “Renater 2” based on virtual networking (ATM) with guaranteed bandwidth (VPN on ATM) • Lyon  CERN at 34Mb (155 Mb in June 2001) • Lyon  US is going through CERN • Lyon  Esnet (via STAR TAP), 30-40 Mb, reserved for the traffic to/from ESnet, except FNAL.

  12. BAHIA - interactive front-end Based on multi-processors: • Linux (RedHat 6.1) -> 10 PentiumII450 + 12 PentiumIII1GHz (2 processors) • Solaris 2.6 -> 4 Ultra-4/E450 • Solaris 2.7 -> 2 Ultra-4/E450 • AIX 4.3.2 -> 6 F40 • HP-UX 10.20 -> 7 HP9000/780/J282

  13. Batch system - BQS Batch based on BQS (CCIN2P3 product) • In constant development, used since 7 years • Posix compliant, platform independent (portable) • Possibilities to define the resources for the job (the class of job is calculated by scheduler as a function of): • CPU time, memory • CPU bound or I/O bound • Platform(s) • System resources: local scratch disk, stdin/out size • User resources (switches, counters)

  14. Batch system - BQS • Scheduler takes into account: • Targets for groups (declared twice a year for the big production runs) • Consumption of cpu time in last periods: month, week, day for user and group • Proper aging and interleave in the class queues • Possibility to open the worker for any combination of classes.

  15. Batch system - configuration • Linux (RedHat 6.1) -> 96 dual PIII 750MHz + 110 dual PIII1GHz • Solaris 2.6 -> 25 * Ultra60 • Solaris 2.7 -> 2 * Ultra60 (test service) • AIX 4.3.2 -> 29 * RS390 + 20 * 43P-B50 • HP-UX 10.20 -> 52 * HP9000/780

  16. Batch system – cpu usage

  17. Batch system – Linux cluster

  18. Regional Center for: • EROS II (Expérience de Recherches d’Objets Sombres par effet de lentilles gravitationnelles) • BaBar • Auger (PAO) • D0

  19. EROS II • Raw data (from ESO site in Chili) on DLTs (tar format). • Restructuring of the data from DLT to 3490 or 9480,creation of metadata on Oracle DB. • Data server (on development) - 7TB of data actually, 20TB at the end of experiment – using HPSS + WEB server.

  20. BaBar • AIX and HP-UX not supported by BaBar, Solaris 2.6 with Workshop 4.2 and Linux (RedHat 6.1). Solaris 2.7 in preparation. • Data are stored in ObjectivityDB, import/export of data is done using bbftp. The import/export on the tapes has been abandoned. • Objectivity (ams/oofs) servers (dedicated only to BaBar) have been installed (10 servers). • Usage of HPSS for staging the ObjectivityDB files.

  21. Experiment PAO

  22. PAO - sites

  23. PAO - AUGER • CCIN2P3 is acting as AECC (AUGER European CC). • Access granted to all AUGER users (AFS accounts provided). • CVS repository for AUGER software has been installed at CCIN2P3, access from AFS (from the local and non-local cells) and from non-AFS environment using ssh. • Linux is the preferred platform. • Simulation software based on Fortran programs.

  24. D0 • Linux is one of D0 supported platforms and is available at CCIN2P3. • D0 software is using the KAI C++ compiler • Import/export of D0 data (using internal Enstore format) is a complicated work. We will try to use the bbftp as a file transfer program.

  25. Import/export CERN CASTOR HPSS SLAC HPSS CCIN2P3 ? ? ? ? HPSS FNAL ENSTORE SAM BNL HPSS

  26. Problems • To add the new Objy servers (for other experiments) is very complicated. It needs the new separate machines, with modified port numbers in /etc/services. Under development for CMS. • The OS system versions and levels • The compilers versions (mainly for Objy for different experiments). • Solutions?

  27. Conclusions • The data exchange should be done using the standards (e.g. files or tapes) and common access interfaces (bbftp and rfioarethe good examples). • Needs for better coordination and similar requirements on supported system and compiler levels between experiments. • The choice of the CASE technologie is out of the control of our CC acting as Regional Computer Center. • GRID will require more uniform configuration of the distributed elements. • Who can help? HEPCCC? HEPiX? GRID?

More Related