150 likes | 336 Views
Restartability Manage- ment in the Cisco Core Router CRS/NG. Stefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.). Table of Contents. System Overview CRS/NG Restartability Overview − Problem Definition and H igh L evel S olution
E N D
Restartability Manage-mentin the Cisco Core Router CRS/NG Stefan Schaeckeler (Cisco Systems, Inc.) AshwinNarasimhaMurthy (Google, Inc.)
Table of Contents • System Overview • CRS/NG Restartability Overview −Problem Definition and High Level Solution • Concrete Example −Statistics Resource Manager Library • Conclusion
System Overview Core Router • Extremely complex System • SW: 16 MLOC • HW: several chasses, LCs (1 CPU, 5 NPUs, chips galore), RPs (1 CPU, chips galore), fabric cards, blade cards, … • Forms distributed System • 99.9...9% Uptime
System Overview • System Manager: restarts crashed Process • HW bug • SW bug • Process must maintain State (after Crash) • CRS/NG Approach • Key data structures in shared memory • Well written algorithm guarantee consistency • CRS 1 CRS 3 CRS/NG (final name?)
CRS/NG Restartability Overview • CRS/NG runs Cisco IOS/XR • Cisco IOS/XR Abstraction Layer on Linux • Sophisticated IPC • Sophisticated shared memory API • Special malloc for shared memory • Static configuration file • Mapping identifiers to fixed virtual addresses • STATS_RESTART 0x50000000 • (Re)attaching to shared memory via identifier • Previously allocated objects always available • …
CRS/NG Restartability Overview • Process requiring Restartability • Key data-structures in shared memory • Careful algorithm design to avoid • Temporary inconsistencies account1 := account1+X; account2 := account2-X; • Pointer operations (disconnection of linked lists) • Crashes during IPCs • Crashes before a return; (caller records success) • Optional recovery phase • Compromises are possible
Concrete Example: Statistics Resource Manager Library • HW: Extremely simplified View on CRS/NG
Concrete Example: Statistics Resource Manager Library • SW: Somewhat simplified View on CRS/NG Statistics Manager
Concrete Example: Statistics Resource Manager Library Client Application /Library crashes Restart • Client Application: State is gone • Stats pointers are lost • Other state is lost • Stats Lib • State is gone • Stats pointers are lost • Solution for Stats Lib • Keep freelists in shared memory • Smart algorithm for keeping state consistent
Concrete Example: Statistics Resource Manager Library Step 1: Keeping State in Shared Memory 01 stats_cl_ctx_st *mstats_cl_bind (char *name) { 02 void *shmem; 03 stats_cl_ctx_st *con; 04 05 /* open shmem at a predetermined address */ 06 shmem = shmwin_attach(SSE_STATS_RESTART_ADDRESS); // posix mmap: MAP_FIXED flag 07 con=shmem+name_to_offset(name); 08 09 if (strcmp(con->name, name)) { 10 /* first bind */ 11 12 /* init "empty" context */ 13 con->freelist[0..max]=NULL; 14 con->mutex=0; 15 strcpy(con->name, name); 16 } else { 17 /* restart */ 18 /* do nothing, just return con */ 18 } 20 return con; 21 }
Concrete Example: Statistics Resource Manager Library Step 2a: Smart Algorithm − Apragmatic Approach (chosen for CRS/NG) Few Concepts: • (Re-)moving nodes from freelist • Worst case: a page is lost (bad?) • Requesting fresh page from server • Worst case: page is lost (bad?) • Updating bitmap: mark some pointers as allocated − client does not pick up • Worst case: some pointers are lost (bad?)
Concrete Example: Statistics Resource Manager Library Discussion of worst Case Scenarios • A page (or a few Pointers within) is lost • = 256 out of 8 million stats pointers in NPU memory − no big deal • = 80 byte out of several GB of CPU memory for node structure − no big deal • Client frees a Pointer from a lost Page Error Code is returned Client is irritated but has to ignore it • We never give out same Pointer twice
Concrete Example: Statistics Resource Manager Library Step 2b: Smart Algorithm − Aperfect Approach • Complicated Algorithm /Very difficult Implementation • Further pointers in shared memory • Need to figure out where crashed and continue from there • Requirement: interacting Libraries and Processes must be "perfect" as well
Conclusion Pragmatic Approach of CRS/NG • + Easy to implement • +/− Crashes: worst Case: small Mem. Leak • + No Run-time Performance Hit Perfect Approach • + Very difficult to implement Error prone • + Crashes: no Memory Leak • − Perhaps Run-time Performance Hit
Thank You Platinum Sponsors: Gold Sponsors: Silver Sponsors: Organization Sponsors