1 / 15

Restartability Manage- ment in the Cisco Core Router CRS/NG

Restartability Manage- ment in the Cisco Core Router CRS/NG. Stefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.). Table of Contents. System Overview CRS/NG Restartability Overview − Problem Definition and H igh L evel S olution

bella
Download Presentation

Restartability Manage- ment in the Cisco Core Router CRS/NG

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Restartability Manage-mentin the Cisco Core Router CRS/NG Stefan Schaeckeler (Cisco Systems, Inc.) AshwinNarasimhaMurthy (Google, Inc.)

  2. Table of Contents • System Overview • CRS/NG Restartability Overview −Problem Definition and High Level Solution • Concrete Example −Statistics Resource Manager Library • Conclusion

  3. System Overview Core Router • Extremely complex System • SW: 16 MLOC • HW: several chasses, LCs (1 CPU, 5 NPUs, chips galore), RPs (1 CPU, chips galore), fabric cards, blade cards, … • Forms distributed System • 99.9...9% Uptime

  4. System Overview • System Manager: restarts crashed Process • HW bug • SW bug • Process must maintain State (after Crash) • CRS/NG Approach • Key data structures in shared memory • Well written algorithm guarantee consistency • CRS 1  CRS 3  CRS/NG (final name?)

  5. CRS/NG Restartability Overview • CRS/NG runs Cisco IOS/XR • Cisco IOS/XR Abstraction Layer on Linux • Sophisticated IPC • Sophisticated shared memory API • Special malloc for shared memory • Static configuration file • Mapping identifiers to fixed virtual addresses • STATS_RESTART 0x50000000 • (Re)attaching to shared memory via identifier • Previously allocated objects always available • …

  6. CRS/NG Restartability Overview • Process requiring Restartability • Key data-structures in shared memory • Careful algorithm design to avoid • Temporary inconsistencies account1 := account1+X; account2 := account2-X; • Pointer operations (disconnection of linked lists) • Crashes during IPCs • Crashes before a return; (caller records success) • Optional recovery phase • Compromises are possible

  7. Concrete Example: Statistics Resource Manager Library • HW: Extremely simplified View on CRS/NG

  8. Concrete Example: Statistics Resource Manager Library • SW: Somewhat simplified View on CRS/NG Statistics Manager

  9. Concrete Example: Statistics Resource Manager Library Client Application /Library crashes  Restart • Client Application: State is gone • Stats pointers are lost • Other state is lost • Stats Lib • State is gone • Stats pointers are lost • Solution for Stats Lib • Keep freelists in shared memory • Smart algorithm for keeping state consistent

  10. Concrete Example: Statistics Resource Manager Library Step 1: Keeping State in Shared Memory 01 stats_cl_ctx_st *mstats_cl_bind (char *name) { 02 void *shmem; 03 stats_cl_ctx_st *con; 04 05 /* open shmem at a predetermined address */ 06 shmem = shmwin_attach(SSE_STATS_RESTART_ADDRESS); // posix mmap: MAP_FIXED flag 07 con=shmem+name_to_offset(name); 08 09 if (strcmp(con->name, name)) { 10 /* first bind */ 11 12 /* init "empty" context */ 13 con->freelist[0..max]=NULL; 14 con->mutex=0; 15 strcpy(con->name, name); 16 } else { 17 /* restart */ 18 /* do nothing, just return con */ 18 } 20 return con; 21 }

  11. Concrete Example: Statistics Resource Manager Library Step 2a: Smart Algorithm − Apragmatic Approach (chosen for CRS/NG) Few Concepts: • (Re-)moving nodes from freelist • Worst case: a page is lost (bad?) • Requesting fresh page from server • Worst case: page is lost (bad?) • Updating bitmap: mark some pointers as allocated − client does not pick up • Worst case: some pointers are lost (bad?)

  12. Concrete Example: Statistics Resource Manager Library Discussion of worst Case Scenarios • A page (or a few Pointers within) is lost • = 256 out of 8 million stats pointers in NPU memory − no big deal • = 80 byte out of several GB of CPU memory for node structure − no big deal • Client frees a Pointer from a lost Page Error Code is returned Client is irritated but has to ignore it • We never give out same Pointer twice

  13. Concrete Example: Statistics Resource Manager Library Step 2b: Smart Algorithm − Aperfect Approach • Complicated Algorithm /Very difficult Implementation • Further pointers in shared memory • Need to figure out where crashed and continue from there • Requirement: interacting Libraries and Processes must be "perfect" as well

  14. Conclusion Pragmatic Approach of CRS/NG • + Easy to implement • +/− Crashes: worst Case: small Mem. Leak • + No Run-time Performance Hit Perfect Approach • + Very difficult to implement Error prone • + Crashes: no Memory Leak • − Perhaps Run-time Performance Hit

  15. Thank You Platinum Sponsors: Gold Sponsors: Silver Sponsors: Organization Sponsors

More Related