1 / 23

“Managing a farm without user jobs would be easier”

“Managing a farm without user jobs would be easier”. Clusters and Users at CERN Tim Smith CERN/IT. Contents. The road to shared clusters Batch cluster Configuration User challenges Addressing the challenges Interactive cluster Load balancing Conclusions. The Demise of Free Choice.

camila
Download Presentation

“Managing a farm without user jobs would be easier”

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. “Managing a farmwithout user jobswould be easier” Clusters and Users at CERN Tim Smith CERN/IT

  2. Contents • The road to shared clusters • Batch cluster • Configuration • User challenges • Addressing the challenges • Interactive cluster • Load balancing • Conclusions HEPiX fall 2002: Tim.Smith@cern.ch

  3. The Demise of Free Choice 2000 2001 2002 2003 HEPiX fall 2002: Tim.Smith@cern.ch

  4. Cluster Aggregation HEPiX fall 2002: Tim.Smith@cern.ch

  5. Organisational Compromises • Clusters per Groups • Sized for the average •  users • Sized for user peaks •  users •  financiers : wasted resources • Invest effort in recooperating cycles for other groups • Configuration differences / specialities • Bulk Production Clusters • Production fluctuations dwarf those in user anal • Complex cross-submission links HEPiX fall 2002: Tim.Smith@cern.ch

  6. Production Farm: Planning HEPiX fall 2002: Tim.Smith@cern.ch

  7. Shared Clusters 750 Batch Servers lxbatch001 lxbatch001 DNS load balancing lxbatch001 lxbatch001 lxbatch001 lxbatch001 lxbatch001 lxbatch001 lxbatch001 lxbatch001 lxplus001 lxplus001 lxplus001 LSF lxplus001 lxplus001 lxplus001 rfio lxplus001 lxplus001 lxplus001 120 Disk Servers tape001 rfio 70 Interactive Servers tape001 disk001 disk001 HEPiX fall 2002: Tim.Smith@cern.ch

  8. Shared Cluster ?  Simple, Uniform  HEPiX fall 2002: Tim.Smith@cern.ch

  9. Partitioning   • Still have identified resources • Uniform configuration • Sharing • Repartitioning or soak-up queues • If owner experiment reclaims resources, must suspend soak-up jobs – stranded jobs ALICE ATLAS CMS LHCb ALEPH DELPHI L3 OPAL COMPASS Ntof OPERA SLAP PARC PARC Int CVS BUILD DELPHI Int CSF Public HEPiX fall 2002: Tim.Smith@cern.ch

  10. LSF Fair-Share • Trade-in partition for a share • Multilevel • ATLAS 10%, CMS 12%, … • cmsprod 45%, HiggsWG 15%, … • usera 10%, userb 80%, userc 10% • Extra shares for productions • Effort: Juggling resources to Accounting • Demonstrating fairness • Protecting • Policing HEPiX fall 2002: Tim.Smith@cern.ch

  11. Facts and Figures • Accounting • LSF job records • Process with C-program • Load into Oracle DB • Prepare plots/tables with Crystal Reports package • LSFAnalyser ? • Monitoring • Poll the user access tools • SiteAssure ? HEPiX fall 2002: Tim.Smith@cern.ch

  12. CPU Time / Week Merged user analysis and production farms HEPiX fall 2002: Tim.Smith@cern.ch

  13. Performance of Batch Job Slot Analysis Thu Fri Sa 10 min / tick HEPiX fall 2002: Tim.Smith@cern.ch

  14. Challenging Batch (I) • Probing boundaries • Flooding • Concurrent starts • Uncontrolled status polling • Hitting limits • Disk space /tmp /pool /var • Memory, Swap Full • Guarantees for other user jobs? • System Issues • Queue drainers HEPiX fall 2002: Tim.Smith@cern.ch

  15. Challenging Batch (II) • Un-Fair-Share • Logging onto batch machines • Batch jobs which resubmit themselves • Forking sessions back to remote hosts • Wasting resources • Spawning processes which outlive the jobs • Sleeping processes • Copying large AFS trees • Establishing connections to dead machines HEPiX fall 2002: Tim.Smith@cern.ch

  16. Counter Measures • File system quotas • Virtual memory limits • Concurrent jobs limits per user/group • Restricted access through PAM • Instant response queues • Master node setup • Dedicated, 1GB memory • Failover cluster HEPiX fall 2002: Tim.Smith@cern.ch

  17. Shared Clusters 750 Batch Servers lxbatch001 lxbatch001 DNS load balancing lxbatch001 lxbatch001 lxbatch001 lxbatch001 lxbatch001 lxbatch001 lxbatch001 lxbatch001 LSF MultiCluster lxplus001 lxplus001 lxplus001 LSF lxplus001 lxplus001 lxplus001 rfio lxplus001 lxplus001 lxplus001 120 Disk Servers tape001 rfio 70 Interactive Servers tape001 disk001 disk001 HEPiX fall 2002: Tim.Smith@cern.ch

  18. Shared Clusters 750 Batch Servers lxbatch001 lxbatch001 DNS load balancing lxbatch001 lxbatch001 lxbatch001 lxbatch001 lxbatch001 Single Cluster lxbatch001 lxbatch001 lxbatch001 lxplus001 lxplus001 lxplus001 LSF lxplus001 lxplus001 lxplus001 rfio lxplus001 lxplus001 lxplus001 120 Disk Servers tape001 rfio 70 Interactive Servers tape001 disk001 disk001 HEPiX fall 2002: Tim.Smith@cern.ch

  19. Interactive Cluster • DNS load balancing (ISS) • Weighted load indexes • load, memory • swap rate, disk IO rate • # processes, # sessions, # window mgr sessions • Exclusion thresholds • file systems full, nologins • DNS publish 2 every 30 seconds • Random from lowest 5 HEPiX fall 2002: Tim.Smith@cern.ch

  20. Daily Users 35 users / node HEPiX fall 2002: Tim.Smith@cern.ch

  21. Sidestep load balancing Parallel sessions across farm Running daemons Brutal logouts Open connections Defunct processes CPU sapping orphaned processes Monitoring + beniced + Monthly reboots Challenging Interactive HEPiX fall 2002: Tim.Smith@cern.ch

  22. Interactive Reboots HEPiX fall 2002: Tim.Smith@cern.ch

  23. Conclusions • Shared clusters present more user opportunities • Both Good and Bad ! • Don’t represent a panacea for sysadmins ! HEPiX fall 2002: Tim.Smith@cern.ch

More Related