160 likes | 283 Views
High End Computing at SDSC. CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007. Managing the HPC systems: DataStar. System Software: AIX 5.2 ML3 CSM 1.3.3.1 RSCT 2.3.3.3 System Management with CSM: Node setup Node Groups Per frame
E N D
High End Computing at SDSC CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007
Managing the HPC systems:DataStar • System Software: • AIX 5.2 ML3 • CSM 1.3.3.1 • RSCT 2.3.3.3 • System Management with CSM: • Node setup • Node Groups • Per frame • Per function (NPACI,TG,POE,login,batch)
CSM setup nodes • Configure Nodes • lshwinfo -p hmc -c dshmc07.sdsc.edu > /tmp/fr8_9 • vi /tmp/fr8_9 : replace noname with cec_name no_hostname::hmc::dshmc07.sdsc.edu::fr9-cg13::001::7039::651::02151FF ds100::hmc::dshmc07.sdsc.edu::fr8-cg1::001::7039::651::021519 • definenode -f /tmp/fr8_9 InstallOSName=AIX • systemid -p hmc hscroot • getadapters -n ds100 -z /tmp/ds100_adapters write to CSM database, include Federation_switch adapters • csm2nimnodes -n 'ds100' type='standalone' network_name='sdsc_net' platform='chrp' netboot_kernel='mp‘ • netboot –n ds100 • updatenode –n ds100
ds100: MAC_address=00096B34E093 adapter_duplex=full adapter_speed=100 cable_type=N/A install_server=192.168.236.31 interface_name=en0 location=U1.32-P1-H1/E1 machine_type=install netaddr= network_type=en subnet_mask= ds100: machine_type=secondary interface_name=sn1 network_type=sn netaddr= subnet_mask= location=U1.5-P1-H1/Q2 ds100: machine_type=secondary interface_name=sn0 network_type=sn netaddr= subnet_mask= location=U1.5-P1-H1/Q1 CSM_ADAPTERS_STANZA_FILE
Managing the HPC systems:DataStar • System Management with CSM: • Management through Command line • Rpower • Power on/off, query node status • Install node: netboot –n ds100 • Dsh • Install updates on nodes (installp,rpm,emgr) • Monitor processes on nodes
Managing the HPC systems:DataStar continued… • System Configuration Cfmupdatenode • Synchronize system configuration modification with nodes and system admins • Run pre/post scripts to capture security rsiks and send notification • System monitoring: Distributed Monitoring responds (GUI configured) • Event driven email notification for on-call personnel • GUI monitoring for operations personnel
CSM Event Monitoring • GUI Event Monitoring • Critical Conditions: • AnyNodeTmpFull • AnyNodeVarSpace • AnyNodeSwitchResponds • LoadLeverProcess • hostResponds see setting up ERRM Condition • Warning Conditions: • Processor State
CSM Event Monitoringsetting up ERRM Conditions • hostResponds ERRM condition (redbook SG24-6953 page 193) • mkcondition –r IBM.ManagedNode \ -e “Status!=1” –E “Status==1” \ -d “Node hostResponds down” \ -D “Node hostRsponds up” \ -m l hostResponds • mkresponse –n LogStatustoFIFO \ -s /usr/local/bin/LogStatusData \ -E STATUS_FILE=/var/adm/spmondata” LogStatusData • mkcondresp “hostResponds” “LogStatusData”
Warning Event email ===================================== Monday 07/26/04 19:12:34 Condition Name: LoadLProcess Severity: Warning Event Type: Event Expression: Processes.CurPidCount <= 0 Resource Name: ProgramName == 'LoadL_startd' && Filter == 'ruser== root ' Resource Class: IBM.Program Data Type: CT_SD_PTR Data Value: [0,1,{},{282654}] Node Name: ds243 Node NameList: {ds243} Resource Type: 0 ===================================== Rearm email: ===================================== Monday 07/26/04 19:13:32 Condition Name: LoadLProcess Severity: Warning Event Type: Rearm event Expression: Processes.CurPidCount > 0 Resource Name: ProgramName == 'LoadL_startd' && Filter == 'ruser== root ' Resource Class: IBM.Program Data Type: CT_SD_PTR Data Value: [1,0,{270492},{270492}] Node Name: ds243 Node NameList: {ds243} Resource Type: 0 ===================================== Event notification
CSM Information • CSM Guide for the PSSP Systems Administrator SG24-6953 • Useful scripts for ERRM conditions • Command cross reference • IBM CSM for AIX 5L Administration Guide SA22-7918 • CSM error messages • Web Sites • http://www-124.ibm.com/developerworks/oss/mailman/listinfo/csm