通信软件开发与管理

Course OD601 通信软件开发与管理学时：32 学分：2 讲师：罗文彬

Class Subject • Communication Overview • System Architecture Overview • Performance and Reliability • Operation, Administration, & Maintenance • Development Methodology • ISO9000/TL9000 • CMMI • Project Management

Network System Characteristic High performance and reliability is always a key factor in the network system. It has direct impact to the economy of a network operation. Reduce network downtime, increase network availability will increase the network revenue. To ensure the network availability, network operator always request equipment vendors to provide products with 5 9’s or 6 9’s availability.

System Performance System is consisting of hardware and software. The combined performance of hardware and software will determine the overall system performance. CPU Memory I/O CARD

Performance Factors Software Factors • Software logics on high runner functions Hardware Factors • CPU • Memory • I/O Card • Disk

High Runner Software functions System Call Message Parsing Call Logic Data Access I/O Disk Access Threads & Processes

System Call Process in user mode will use the CPU resource contiguously. System calls requires the process to enter kernel mode for acquiring exclusive system resource. Upon completion of the system call, the process will return to user mode. This operation called “process context switch” which requires extra CPU resource. Process can use CPU resource much more efficiently with fewer system calls.

read write receiving process sending process Shared memory Buffered I/O Data has to be passed through different processes to accomplish the desired software tasks. Buffering the data to be passed can greatly reduce the number of IPCs. Since IPC involved system call, buffered I/O is critical to improve the real-time performance.

Database Optimization Database performance is critical to the overall system performance. Commercial database usually has tools for optimize the database performance, and it should be executed on a regular basis. In-Memory database such as TimesTen, MySQL, and Berkeley DB are commonly used real-time database products.

Memory CPU CPU CPU CPU Hardware Technology Hardware Resource: CPU, Memory, Disk For normal traffic, system resource should be evenly utilized up to 40%. For overload traffic, system should be evenly utilized up to 80%. The high runner processes or threads should be evenly running on each CPU. The disk I/O should be evenly distributed on all disks. The most effective configuration to reach optimal throughput needs to be tested in a lab environment with simulated traffic.

Threads/Processes & CPU Proc A Proc B Proc C CPU 2 CPU 1 Number of threads and processes should use the CPU resource proportionally. For example, assume Proc A, B, and C use CPU time ratio is 1 : 2 : 1.5, the ratio of threads in Proc A, B, and C should be 2 : 4 : 3. The number of threads also depends on the characteristic of input messages. More threads are needed when each message take longer time to process. The same thread/process ratio should be replicated on all the CPUs. Assume there are two CPUs in the system, then either two identical processes can be created or use one process with double number of threads with the same ratio.

Memory CPU CPU CPU CPU Memory Keep data in memory is critical to the system performance. Optimize memory usage can keep more data in memory and improve the performance significantly. Data can be compressed to reduce memory usage, but require CPU resource to compress and de-compress the data. Memory locking operations will prevent multiple CPUs to be fully utilized because memory lock is a shared resource in multiple CPUs system.

Disk Minimize disk I/O is critical to improve the real-time performance Disk head movement takes 5-10ms of delay time which is critical to the real-time performance Reduce disk heads movement by buffering I/O can improve system performance significantly. Use all the disks in parallel can improve the I/O throughput significantly. Character I/O versus block I/O. Disk array versus mirrored disks. Memory CPU CPU CPU CPU

Performance Tuning Performance tuning is one important step during the network product development. Profile the CPU usage to identify high CPU usage functions. Optimize the top 10 CPU usage functions can improve the system performance significantly. Performance benchmark is a regular activity on every software release.

System Reliability Network system consists of both software and hardware. To increase the system reliability, both software and hardware reliability has to be improved. Software faults contribute much more system downtime compare to the hardware faults. Improve software reliability can improve the system reliability more effectively.

Software Reliability (1) Software reliability is determined by the software downtime caused by software bugs. To improve software reliability has three aspects: Reduce number of bugs. Reduce downtime caused by software bugs. Reduce bug fixing time.

Software Reliability (2) Reduce number of bugs Software development process and quality control is the most effective way to reduce the number of software bugs and ensure software quality. More detail on the software development process will be discussed later.

Software Reliability (3) 2) Reduce downtime caused by software bugs Process could lose heartbeat because: • Process dies • Process too busy (infinite loop) Level 1 recovery, INIT kills the process which loses consecutive heartbeats, and re-initialize the process. Level 2 recovery, INIT re-initialize the process and its global resource. Level 3 recovery, INIT re-start the whole system. Level 4 recovery, INIT trigger OS re-boot. Level 5 recovery, power off, power on.

Software Reliability (4) 3) Reduce bug fixing time Error messages should be printed to the log file when unexpected software events detected such as unexpected incoming message or unexpected parameters in the incoming message, etc. The software code should cover all the logical branches in the “if ..then..else..” statement. Error messages should be printed to the log file when unexpected logic branch has been reached.

Hardware Reliability (1) The hardware technology today can almost completely remove the hardware defects in the testing stage. Hardware faults usually caused by randomly failed components due to environment reason such as dust, static, vibration, and temperature. Ways to increase the hardware reliability: Hardware Redundancy Hot swappable hardware components Spare parts inventory for hardware replacement

Hardware Redundancy Hardware redundancy is the most effective way of increasing system reliability from both software and hardware perspective. A B + C + D + 99.9999% 99.9% 99.999% Probability of component A failed is ~A = 0.001. Probability of component A and B failed together is 0.001 * 0.001. Probability of component A and B and C failed together is 0.001.* 0.001 * 0.001

A B + C + D + 99.9999% 99.9% 99.999% N+K Redundancy Assume each component can process X amount of network traffic, with 4 identical components the total traffic can be processed is 4X. Assume one component is for redundancy, the system should be able to handle 3X of traffic with probability of ??? Assume two components are for redundancy, the system should be able to handle 2X of traffic with probability of ???

A B D C E G F Reliability Model Layer 1 availability X = 1 – (~A * ~B) Layer 2 availability Y = 1 – (~C * ~D * ~E) Layer 3 availability Z = 1 – (~F * ~G) System availability = X * Y * Z

Hardware Evolution To ensure the failed hardware can be replaced as soon as possible, spare parts inventory are needed. The commercial hardware technology today, usually has a Mean Time Between Failure (MTBF) around 40000 hours (4.5 years). The hardware technology will be obsolete in 5 years, commercial hardware will be discontinued in 5 years. The software system should be able to be ported onto the latest hardware system easily to take the best usage of the hardware technology curve.

Core Chassis Features 19” 14-slot Rack-mount 11U Dedicated 15th front slot for dual shelf manager SA Forum OpenHPI shelf manager Dual Star Fabric backplane Front access fan trays and dust filters ETSI and NEBS level 3 ATCA v2 Hardware Configuration • 14 Single Processor SBC Board (Rouzic) • Dual-core 2.16GHz processor • 8GB memory

PEM A PEM B Hardware Deployment View Total Capacity: 8000 TPS 11M Subscribers 1+1 redundancy BE CPU Blade BE CPU Blade BE CPU Blade BE CPU Blade BE CPU Blade BE CPU Blade BE CPU Blade BE CPU Blade BE CPU Blade Switch Blade Switch Blade BE CPU Blade Pilot Blade Pilot Blade 1430 BE

通信软件开发与管理

通信软件开发与管理

Presentation Transcript