1 / 16

High End Computing at SDSC

High End Computing at SDSC. CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007. Managing the HPC systems: DataStar. System Software: AIX 5.2 ML3 CSM 1.3.3.1 RSCT 2.3.3.3 System Management with CSM: Node setup Node Groups Per frame

china
Download Presentation

High End Computing at SDSC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High End Computing at SDSC CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007

  2. Managing the HPC systems:DataStar • System Software: • AIX 5.2 ML3 • CSM 1.3.3.1 • RSCT 2.3.3.3 • System Management with CSM: • Node setup • Node Groups • Per frame • Per function (NPACI,TG,POE,login,batch)

  3. CSM setup nodes • Configure Nodes • lshwinfo -p hmc -c dshmc07.sdsc.edu > /tmp/fr8_9 • vi /tmp/fr8_9 : replace noname with cec_name no_hostname::hmc::dshmc07.sdsc.edu::fr9-cg13::001::7039::651::02151FF ds100::hmc::dshmc07.sdsc.edu::fr8-cg1::001::7039::651::021519 • definenode -f /tmp/fr8_9 InstallOSName=AIX • systemid -p hmc hscroot • getadapters -n ds100 -z /tmp/ds100_adapters write to CSM database, include Federation_switch adapters • csm2nimnodes -n 'ds100' type='standalone' network_name='sdsc_net' platform='chrp' netboot_kernel='mp‘ • netboot –n ds100 • updatenode –n ds100

  4. ds100: MAC_address=00096B34E093 adapter_duplex=full adapter_speed=100 cable_type=N/A install_server=192.168.236.31 interface_name=en0 location=U1.32-P1-H1/E1 machine_type=install netaddr= network_type=en subnet_mask= ds100: machine_type=secondary interface_name=sn1 network_type=sn netaddr= subnet_mask= location=U1.5-P1-H1/Q2 ds100: machine_type=secondary interface_name=sn0 network_type=sn netaddr= subnet_mask= location=U1.5-P1-H1/Q1 CSM_ADAPTERS_STANZA_FILE

  5. Managing the HPC systems:DataStar • System Management with CSM: • Management through Command line • Rpower • Power on/off, query node status • Install node: netboot –n ds100 • Dsh • Install updates on nodes (installp,rpm,emgr) • Monitor processes on nodes

  6. Managing the HPC systems:DataStar continued… • System Configuration Cfmupdatenode • Synchronize system configuration modification with nodes and system admins • Run pre/post scripts to capture security rsiks and send notification • System monitoring: Distributed Monitoring responds (GUI configured) • Event driven email notification for on-call personnel • GUI monitoring for operations personnel

  7. CSM monitoring

  8. CSM monitoring

  9. CSM Event Monitoring • GUI Event Monitoring • Critical Conditions: • AnyNodeTmpFull • AnyNodeVarSpace • AnyNodeSwitchResponds • LoadLeverProcess • hostResponds  see setting up ERRM Condition • Warning Conditions: • Processor State

  10. CSM Event Monitoring GUI

  11. CSM Event Monitoringsetting up ERRM Conditions • hostResponds ERRM condition (redbook SG24-6953 page 193) • mkcondition –r IBM.ManagedNode \ -e “Status!=1” –E “Status==1” \ -d “Node hostResponds down” \ -D “Node hostRsponds up” \ -m l hostResponds • mkresponse –n LogStatustoFIFO \ -s /usr/local/bin/LogStatusData \ -E STATUS_FILE=/var/adm/spmondata” LogStatusData • mkcondresp “hostResponds” “LogStatusData”

  12. Warning Event email ===================================== Monday 07/26/04 19:12:34 Condition Name: LoadLProcess Severity: Warning Event Type: Event Expression: Processes.CurPidCount <= 0 Resource Name: ProgramName == 'LoadL_startd' && Filter == 'ruser== root ' Resource Class: IBM.Program Data Type: CT_SD_PTR Data Value: [0,1,{},{282654}] Node Name: ds243 Node NameList: {ds243} Resource Type: 0 ===================================== Rearm email: ===================================== Monday 07/26/04 19:13:32 Condition Name: LoadLProcess Severity: Warning Event Type: Rearm event Expression: Processes.CurPidCount > 0 Resource Name: ProgramName == 'LoadL_startd' && Filter == 'ruser== root ' Resource Class: IBM.Program Data Type: CT_SD_PTR Data Value: [1,0,{270492},{270492}] Node Name: ds243 Node NameList: {ds243} Resource Type: 0 ===================================== Event notification

  13. CSM Information • CSM Guide for the PSSP Systems Administrator SG24-6953 • Useful scripts for ERRM conditions • Command cross reference • IBM CSM for AIX 5L Administration Guide SA22-7918 • CSM error messages • Web Sites • http://www-124.ibm.com/developerworks/oss/mailman/listinfo/csm

More Related