1 / 27

SunFire 15K Enterprise-Grade Server March 14, 2003

SunFire 15K Enterprise-Grade Server March 14, 2003. Overview. Introduction of SunFire 15K architecture and concepts Hardware RAS features Dynamic Domains System requirements Number of domains Resources for each domain Expansion Process RAS features Risk factors and risk mitigation

jack
Download Presentation

SunFire 15K Enterprise-Grade Server March 14, 2003

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SunFire 15K Enterprise-Grade ServerMarch 14, 2003 Implementation Review

  2. Overview • Introduction of SunFire 15K architecture and concepts • Hardware RAS features • Dynamic Domains • System requirements • Number of domains • Resources for each domain • Expansion • Process RAS features • Risk factors and risk mitigation • Current Status • Schedule Implementation Review

  3. 12 fans each 12 fans each Sun Fire 15K: “Highly redundant, symmetric multi-processing server with a shared memory architecture” Features: • 1 to 18 CPU/Memory boards • 4CPUs/board, @1050MHZ • 1 to 8GB/CPU = 576GB memory • 1 to 18 I/O boards, 4 PCI slots/board • Can trade I/O boards v.s. extra CPU • Partition in 1-18 dynamic domains Implementation Review

  4. SunFire 15K: High RAS Features • Reliability: • Fully redundant CPU/Memory and I/O boards, PCI cards • Dual System Controllers • Dual System Clock • Dual Grid power, redundant power supplies • Redundant fans • Environmental monitoring • Serviceability: • Hot Swap CPU/Memory boards • Hot Swap I/O boards, PCI components • Hot Swap System controller • Hot Swap power supplies • Hot Swap fans • Full Remote Diagnostics Implementation Review

  5. Dynamic Domains • CPU/Memory Boards and I/O Boards can be re-assigned on-the-fly to other domains when needed. E.g.: $ moveboard SB2 -d BESSIEB • By hand: • Reallocate resources from operational and development domains to I&T domain for full load performance testing • Take boards off-line for maintenance – hot swap • Automatically by programs or monitoring software: • At times of peak loads, reallocate resources from Development and I&T domains to operational domain Implementation Review

  6. Data Processing Software Systems • Pre-archive processing & Ingest: • Science data receipt and processing: Science Pipelines (OPUS) • Engineering data receipt and processing (EDPS) • Archive Ingest • Distribution: • Archive distribution (DADS) • On-the-fly reprocessing (OTFR) • Calibration: • Calibration pipeline and database • Database servers: • Pipeline Processing, Ingest/Distribution DB “CATLOG” • Archive Catalog Browsing DB “ZEPPO” • Will not support user interfaces: StarView, Web, APT Implementation Review

  7. Number of Domains • High level requirements: • Separate Development, Integration & Test and Operational environments • Protect Ingest from Distribution • Respond to user community • Other requirements: • Separate Pipeline computing from Database servers • Separate DB for external users (ZEPPO) from internal operational DB (CATLOG) • Isolate OS and COTS testing, patching • Maximum number of domains: 3*(2+2) +1 = 13 • BUT: more domains = more fragmentation = less flexibility • Must balance flexibility with need for isolation Implementation Review

  8. Number of Domains (cont.) • For Databases, combine development and Integration & Test domains for CATLOG and ZEPPO (saves 3 domains) • Combine Pre-archive processing & Ingest with Distribution (saves 3 domains): • Use similar processing pipelines • Protect Ingest from Distribution by dynamically adding resources when needed • Protect Ingest from Distribution by binding Ingest processes to dedicated CPU’s and Memory resources • Protect Ingest from Distribution using new features in DADS 10.* • Closely monitor performance and fall back to 2 separate domains as contingency plan • Number of domains = 13 – 3 – 3 = 7 Implementation Review

  9. The 7 Domains Implementation Review

  10. Domain Resources • CPU/Memory Boards: • 4 CPUs / board • CPUs run at 1.05GHz • 1-8GB of memory / CPU = 4-32GB / board • I/O Boards: • Provide external connections to SAN, network, disks • 4 PCI slots / board • 2 slots @ 33MHz • 2 slots @ 66MHz Implementation Review

  11. 1: Development domain • Supports Development Teams for: • OPUS (EDPS, OTFR, etc) • DADS 10.* • IRAF/STSDAS • Calibration pipelines • Calibration reference data • Today (combined with Testing), excluding desktops: • Tru64, Solaris: ~9-13CPU’s <1GB/CPU, 500MHz • Domain requirements: • 2 CPU Boards, 8 CPUs, 4GB/CPU • 1 I/O Board (not mission critical) Implementation Review

  12. 2: DADS/OPUS/OTFR domain • Compare Pre-archive pipelines, Ingest and Distribution Performance Requirements with current performance • Use outcome to scale current resources to domain requirements, accounting for faster CPU’s, architecture. • Account for new Software architecture of DADS 10.* • Account for lack of modeling = safety margin • Account for projected growth: • Short term: Distribution (ACS) • Intermediate: New Algorithms • Longer: Pre-archive pipelines, Ingest (SM4) • Overall increased use of 20%/year Implementation Review

  13. 2: DADS/OPUS/OTFR domain (cont.) • Today: • baseline Pre Archive processing and Ingest performance within requirements • Remember: failures addressed by Architecture • baseline Distribution & OTFR barely within requirements • Current systems maxed out. Implementation Review

  14. 2: DADS/OPUS/OTFR domain (cont.) • Today: • Tru64 cluster: 12 CPUs @ 500MHz, 1GB/CPU • 1 Sun 280R 2CPUs 750MHz (EDPS) • 3 OpenVMS systems: 1 CPU @ 250MHz, 0.5-1.5GB • Domain CPU/Memory requirement: • 9CPUs @ 1GHz, 4GB/CPU • New software architecture requirements (DADS 10.*) • 6CPUs @ 1GHz, 4GB/CPU • Short term growth, ACS + 20% • 3CPUs @ 1GHz, 4GB/CPU • Margin • 2CPUs @ 1GHz, 4GB/CPU Implementation Review

  15. 2: DADS/OPUS/OTFR domain (cont.) • Total Domain CPU/Memory Requirements: • 5 CPU Boards, 20CPUs @1GHz, 4GB/CPU • Total Domain I/O requirements: • Operational, so redundant: 2 I/O Boards • Can be multiplexed if necessary for performance • Remember: Dynamic domains: • We can re-assign resources on-the-fly, esp. from I&T domain to handle peak loads, longer term fluctuations Implementation Review

  16. 3 Integration & Test domain • Realistic end-to-end load and performance testing • Identical to operational DADS/OPUS/OTFR domain • Today: non-existent • Domain Requirements: • 5 CPU Boards, 20 CPUs @ 1GHz, 4GB/CPU • Remember: Dynamic domains • Full-load performance tests happen regularly, but not daily • Full-load performance tests are highly controlled, discrete, and scheduled events • I&T resources can be re-assigned to e.g. DADS/OPUS/OTFR domain when not needed Implementation Review

  17. 4,5,6: Database domains More details in afternoon “Databases” presentation • Operational DB, CATLOG • Today: 4CPUs @ 300MHz, 0.5GB total • Anticipate increased load because of faster pipelines, new instruments • Domain Requirements: • 2 CPU Boards, 8 CPUs @ 1GHz, 2GB/CPU • 2 I/O Boards (redundancy) • Archive Catalog Browsing DB, ZEPPO • Today: 2CPUs @ 300MHz, 0.6GB total • Domain Requirements: • 1 CPU Board, 4CPUs @ 1GHz, 2Gb/CPU • 2 I/O Boards (redundancy) • Development, test • Today: 2*2CPUs @ 200MHz, 1GB total • Domain Requirements • 1 CPU Board, 4 CPUs @ 1GHz, 2Gb/CPU • 1 I/O Board Implementation Review

  18. 7: OS & COTS testing, patches • Test next version of OS • Test patches, COTS upgrades, system procedures • Today: n.a. or scattered • Domain Requirements: • 1 CPU Board, 4 CPUs @ 1GHz, 4GB/CPU • 1 I/O board (not mission critical) • Remember: Dynamic domains • It is possible to shut down this domain when not needed • Reassign resources e.g. to DADS/OPUS/OTFR domain Implementation Review

  19. SunFire 15K Nominal Domain Layout Implementation Review

  20. SunFire 15K Peak-load domain layout Implementation Review

  21. Future growth • Today (contingencies): • Add 1 CPU/Memory Board, 4CPUs • Add 8 I/O Boards or 16 “MaxCPU” CPUs • Add 300GB of RAM • upgrade to 1.2GHz CPUs • One to two years: • Double number of CPUs: 8 CPUs / board • Increased CPU clock speed • All within the box Implementation Review

  22. Process RAS Features • STScI Administration and Software Configuration RAS Features • Sun Management Center: ease of management, monitoring and capturing of system performance metrics • Use dynamic server domains to keep the science flowing • Ability to prioritize processing in the event of a problem Implementation Review

  23. Risk factors and mitigation • Schedule slippage risk mitigation: • Contract imposes penalty for late delivery • Decouple Database migration from milestones • Can keep old equipment past end of project • Loaner system to get head start • Technical risks mitigation: • Use loaner to detect issues, find solutions early • Extensive staff training included in contract to mitigate new technology risks • Operational risk and mitigations discussed in later presentations Implementation Review

  24. Current Status • Order placed, Feb 3rd; expected Time of Arrival, Mar 4 • Loaner up and running with two domains • Training started • Completed Site survey, preparation (power, floor, environment) • Started interviewing operations staff, engineers, support staff and scientists to refine use model (later presentation) Implementation Review

  25. High level Schedule • Initial domain design, March 12 • System setup & integration, March 24 • Physical setup, power • Network • Sun’s Application Readiness Process • System benchmarks • Domain configuration, May 1 • OS Install, patch, institutionalize • Test backup/recovery, SMC, basic reporting • 3rd party software • Documentation, review • Clone other domains Implementation Review

  26. High level Schedule (cont) • Full system tests • Run benchmarks to establish system baseline • Develop procedures • System, account management • Backup/restore, disaster recovery • train support staff • Hand over 3 Development, I&T and Operational domains to ESS, May 22 • Database domain configuration • Customize OS, 3rd party applications • Hand over 3 DB domains to ESS, June 6 Implementation Review

  27. Schedule Implementation Review

More Related