200 likes | 397 Views
Automation of Mobile Radio Network Performance and Fault Management ( Matkapuhelinradioverkon suorituskyvyn- ja vianhallinnan automatisointi ). A thesis submitted in partial fulfilment of the requirements for the degree of Master of Science Espoo 28.2.2007 Helsinki University of Technology
E N D
Automation ofMobile Radio NetworkPerformance and Fault Management (Matkapuhelinradioverkon suorituskyvyn- ja vianhallinnan automatisointi) A thesis submitted in partial fulfilment of the requirements for the degree of Master of Science Espoo 28.2.2007 Helsinki University of Technology Department of Electrical and Communications Engineering Author: Magnus Wallström magnus.wallstrom@nokia.com Supervisor: Timo Korhonen timo.korhonen@tkk.fi Instructor: Mikko Lamberg, MSc (Tech), Nokia Networksmikko.lamberg@nokia.com 1 2007-02-28 / Magnus Wallström
Contents • Introduction and background • Literature review: • Architecture of Mobile Radio Access Networks • State of the art in management of mobile networks as defined by 3GPP • Performance management data • Functionality scenarios in UTRAN • Methods of the practical study • Results: • Results (I/IX): Current PM and FM organisation and process • Results (II/IX): Example on current PM and FM process (1/3) • Results (III/IX): Example on current PM and FM process (2/3) • Results (IV/IX): Example on current PM and FM process (3/3) • Results (V/IX): Analysis of the current organisation and process • Results (VI/IX): Problems in current PM and FM process, Interrelationship- and why-why-diagrams of the process problems • Results (VII/IX): Summary of the analysis • Results (VIII/IX): Solution for automated investigation • Results (IX/IX): Implementation of the solution • Conclusions of the thesis • References 2 2007-02-14 / Magnus Wallström
Key concepts • 3GPP = Project that aims to develop GSM and UMTS specifications in cooperation with the vendors, operators and standardisation organisations. The acronym 3GPP stands for Third Generation Partnership Project. • Fault management = Functions that enable the detection and location of failures in the network and scheduling of repairs. 3GPP specifies the requirements of the concept. • Mobile radio access network = a network that provides wireless access to users through radio interface and allows the mobile users to move between coverage areas without losing connection, i.e. handover. • Performance management = Functions that enable the performance measurements of network services. 3GPP specifies the requirements of the concept. 3 2007-02-14 / Magnus Wallström
Introduction and background How to enhance the productivity of the UTRAN performance management investigations? • Research area: Mobile Radio Network Performance and Fault Management • Research questions: • What is a mobile radio access network and how it is managed? • What is the current performance management set-up in the organisation under study? • What is the organisation and communication structure? • What is the process to tackle performance problems in UTRAN? • What are the problems of the current set-up? • What could be solutions to the root-problems found from the current setup? • Scope: • Limited to European 3G mobile radio network = UTRAN (UMTS Terrestrial Radio Access Network) • Other major mobile radio network technologies are: GERAN (GSM radio network), Wimax (WMAN) and WiFi (WLAN) 4 2007-02-14 / Magnus Wallström
Architecture of Mobile Radio Access Networks • General architecture • UE – User Equipment • Consists of Mobile Equipment (ME) and Subscriber Identity Module (SIM) for the end-user to access the mobile network • RAN – Radio Access Network • Currently the most popular mobile RANs are UTRAN and GERAN • Other radio access technologies are LTE, WiMAX and WiFi • CN – Core Network • All RANs are attached to a CN that provides switching and access to services in PSTN and any IP network • OSS – Operations Support System • All parts of the mobile network may be managed by a centralised system • UTRAN architecture • Network elements • RNC – Radio Network Controller • Node B aka. BTS – Base Transceiver Station • A – ATM transmission nodes • Interfaces • IuCS: RNC to Circuit Switched Core Network (voice and video calls) • IuPS: RNC to Packet Switched Core Network (data calls) • Iur: RNC to RNC • Iub: RNC to BTS • Uu: BTS to UE • O&M: OSS to any network element: RNC, BTS, ATM-nodes and CN elements (MSC, HLR, SGSN, GGSN etc.)) 5 2007-02-14 / Magnus Wallström
State of the art in management of mobile networks as defined by 3GPP • Network management areas relevant to RAN technical support • Performance Management (PM) • Keeps track on the network performance status and analyses the effects of configuration changes in the network [3GPP TS 32.101] • Bases on measurements that are continuously recorded in the network elements • Fault Management (FM) • Consists of fault detection, fault localisation, fault reporting, fault correction and fault repair [3GPP TS 32.111-1] • Bases mainly on alarms and system logs that the network elements produce • Software management (SWM) • Covers software request management, installation, customer feedback and software fault management, i.e. detection of software faults and finding resolution to the problems. This duty is close to and overlapping with fault management (FM) [3GPP TS 32.101] • Configuration management (CM) • Controls the operational parameters of network elements [3GPP TS 32.600] • Process of applying the network management: 1. Performance monitored 2. Faults localised Depending on the type of failure: 3a. Configuration changed or 3b. Software defect(s) corrected 4. Monitor the performance (step 1) 3a. CM param or software? 1. PM 2. FM 3b. SWM 6 2007-02-14 / Magnus Wallström
Functionality scenarios in UTRAN • Control plane, i.e. signaling, on RRC connection (Radio Resource Control) • Major purpose: setup and release a call • User plane, i.e. the traffic, on RAB connections (Radio Access Bearer): • Major purpose: define the QoS class of the call: • Conversational class, RT (Real Time), applications: CS voice and video calls • Streaming class, RT, applications: CS streaming video • Interactive class, NRT (Non RT), applications: PS (Packet Switched) web browsing • Background class, NRT, applications: emails, MMS (Multimedia Messaging Service) • Signaling scenarios: • MTC (Mobile Terminated Call) scenario • Paging: RNC sends an “RRC Paging Type 1” message to the Uu interface • RRC connection setup: The paged UE responses by starting the radio control connection establishment procedure by (1.) sending an “RRC Connection Request” message to RNC (“RRC Connection Setup Attempt” counter is updated). (2.) RNC tries to allocate radio resources (BTS) and if successful, it responses with “RRC: Connection Setup” message (“RRC Connection Setup Complete” counter is updated). (3.) Finally the UE responses with “RRC: Connection Setup Complete” message (“RRC Access Complete” counter is updated). • Transaction reasoning: RNC and CN negotiate on the transaction type • Authentication and Security procedure: UMTS subscriber and network authenticate each other, and other security mechanism are activated • RAB setup for transaction: Actual communication resources for the transaction are allocated. • Transaction: UE has an active user plane bearer connection across the whole UMTS network • RAB release for transaction clearing: Network resources related to the transaction are released, i.e. all the RAB active connections for an UE are released • RRC connection release: Radio control connection between the UE and the UTRAN is released • Mobility (handover scenario): • Measurement: the UE sends a radio-link measurement report to the RNC • Decision: the final decision to make a handover is done in RNC by the RRM handover control algorithms. Decision bases on the handover criteria and algorithm parameters • Execution: handover signalling between e.g. UE and RNC, and radio resource allocation e.g. in BTS 7 2007-02-14 / Magnus Wallström
Performance management data • Performance counters • UTRAN collects thousands of counters that measure the amount of specific events • E.g. RRC Setup Attempts, RRC Setup Completes, RRC Setup Attempt Failure RNC, RRC Setup Attempt Failure BTS etc. • KPI (Key Performance Indicator) Calculated most often from performance counters to relative %-values • Relative% KPIs are comparable between networks of different sizes, absolute values are not, because the amount of traffic varies • Form: KPI = (a formula of performance counters) Examples: • RRC_Acc% = “RRC access complete ratio” = “RRC Access Completes” / “RRC Setup Attempts” • CSSR, Call Setup Success Rate (voice call) = RRC_Acc% * (RAB_voice_attempts-RAB_voice_failures) / RAB_attempts • CCSR, Call Completion Success Rate (voice call) = (RAB_active_voice_failures) / (RAB_active_voice_failures + RAB_active_voice_succesful_completes) 8 2007-02-14 / Magnus Wallström
Methods of the practical study • Based on UCD (User Centered Design) process and framework • Chronologically the practical study had three phases: • Study and define the current process and organisation • Study: interview • Study: focus group • Study: contextual enquiry • Analyse the current set-up • Analysis: brainstorming • Analysis: affinity diagram • Analysis: double teams • Analysis: interrelationship diagram • Analysis: why-why-analysis • Develop an enhanced process • Solution: brainstorming • Solution: SWOT analysis • Solution: UML diagrams 9 2007-02-14 / Magnus Wallström
Results (I/IX): Current PM and FM organisation and process • Organisation-wise Technical support is the communicator between the local customer contact teams and product line R&D organisation. • Local teams communicate the performance status of the customer networks to the technical support. • Technical support investigates and analyses the performance degradations and makes decisions to fix them with the co-operation of R&D. • R&D’s responsibility is to develop corrections to the system, if no other solution is effective. • The process follows the three phases of the root-cause analysis methodology: • Investigation (maps to PM [3GPP]) • Analysis (maps to FM [3GPP]) • Decision (maps to SWM and CM [3GPP]) • Each phase of the process has deliverables that are utilised in the later phases. 10 2007-02-14 / Magnus Wallström
Get KPIs and failure counters for the required top object (i.e. RNC) Achieved by using a reporting tool that collects the needed counters from the OSS measurement database and calculates the KPI values based on the counters. By manual post-processing the data, the graphical output 2. Find measurement periods where there is a dip in performance: Call setup performance: at 11 the CSSR KPI has had poor values. The phenomenon has been partly ongoing during the next hour Retainability: high drop call ratio at 16. Counter diagram verifies that the drop in CCSR is due to high number of RAB active failures. Results (II/IX): Example on current PM and FM process (1/3) KPIs: Failure counters: 11 2007-02-14 / Magnus Wallström
Results (III/IX): Example on current PM and FM process (2/3) • 3. Get the KPIs and failure counters on BTS level. • It can be achieved using the same reporting tool than in the first phase. The output is extensive list of all the BTS under one RNC, all measurement periods and counters per each BTS. • 4. Find the network elements that are causing the performance dip. • After post-processing the data, the results are lists of BTS that are the main contributors to the performance dips 12 2007-02-14 / Magnus Wallström
Results (IV/IX): Example on current PM and FM process (3/3) 5. Gather the system logs for those network elements that are main contributors of the RNC performance dip. • Achieved by connecting to the network element’s O&M unit either by manual command procedures or using a tool that automates the procedure. The log files are usually in binary format, so they need to be opened by a parser or converted to textual format before the analysis can take place. 6. Analyse the detailed data. • The format of the data is vendor specific, i.e. not defined in public specifications => no general guidance can be set for the analysis itself. • Highly dependant on the individual system specialists that can handle the versatile analysis and can produce reliable results The analysis can be in this context treated as a black box, which has the input of system data, i.e. logs, parameters, alarms, counters and KPIs, and output of set of root-causes for the occurred performance problem. • 7. Generate a solution to the root-cause. • Needs the presence of a skilled system specialist. Depending on the type of solution, finding a working solution might need trial and error approach. • Before applying the solution to a live network, it is tested in a test bed of the vendor. Some network operators have also test beds of their own, on which they verify the solutions, e.g. SW corrections, before they are installed to the live network. 13 2007-02-14 / Magnus Wallström
Results (V/IX): Analysis of the current organisation and process • Main problems: • Problems in current organisation operation • 7.2.1 High travel costs • 7.2.2 Troubleshooting poorly controlled • Problems in current PM and FM process • 7.3.2 NE logs not available for performance dips • 7.3.3 Alarms not mapped to performance dips • 7.3.4 Configuration data not available for performance dips • 7.3.5 Internal failures not distinguished from external causes • 7.3.6 Investigation is time consuming 14 2007-02-14 / Magnus Wallström
Results (VI/IX): Problems in current PM and FM process,Interrelationship- and why-why-diagrams of the process problems 15 2007-02-14 / Magnus Wallström
Results (VII/IX): Summary of the analysis • Analysis set two general requirements for the solution: • Support fault management analysis conducted by system specialists.The solution should be able to collect relevant fault management (FM) data, i.e. NE logs, configuration data and alarms, for troubleshooting. The evaluation of the FM data relevance bases on the performance measurement data, which may be collected either from OSS or from RNC. • Support general reporting of performance conducted by operator and vendor performance management bodies.The solution should produce scalable reports of the performance measurement data. Reports should represent the performance data both on whole network and individual network element level down to the level of a single cell. Other statistical requirements are: timely aggregation and that the data can be averaged. 16 2007-02-14 / Magnus Wallström
Results (VIII/IX): Solution for automated investigation • INPUT: • “Connection to a live network”. The requirement of the developed solution is either a working remote or on-site connection to the network. This prevents limitations on from which specific parts of the network the data is gathered, i.e. the OSS, NEs or some other databases in the network. • OUTPUT: • System log files and other detail data for the failures that have occurred in the live network. The root-cause analysis phase utilises this data to make decisions. • Overview reporting of the network performance that can be utilised in reporting the status of the network to company management and to customer, i.e. the network operator. 17 2007-02-14 / Magnus Wallström
Results (IX/IX): Implementation of the solution • The distributed system consists of five separate applications: • RNC monitor • RNC static performance data fetcher • OSS data fetcher • Processor & Report (application) • Report (server) 18 2007-02-14 / Magnus Wallström
Conclusions of the thesis • Summary of the thesis, Thesis studied practical problems of mobile radio network management: • Conclusion: UTRAN vendor technical support requires a distributed system of troubleshooting tools to enhance its troubleshooting processes • Purpose of the troubleshooting tools is to enhance the performance investigation by automating gathering of the performance and other relevant network behaviour data for the time periods where network suffers from low performance • The reasoning of the solution bases on • Current troubleshooting set-up study: • Organisation: vendor home base technical support that is a link between local teams, which are located by the operated networks, and the vendor R&D. During special occasions, e.g. a new product release or emergency situation in network, the organisation may adjust itself by transferring temporarily system specialist to work locally by the operated network. • Process: The practical performance and fault management process consists of three phases: investigation, analysis and decision. • The analysis of the current set-up: • currently the main problem is the inefficiency of the first, i.e. investigation, phase in the performance and fault management process. • Generalisation of the results • Same principles are applicable to other radio network (e.g. GERAN) performance and fault management • Utilization of an OSS in data gathering makes the solution more portable to other radio network systems • Typically OSS uses relational SQL databases. Different radio networks have different performance indicators. Then the same tools may be used after modifying SQL-queries, which is a straightforward process • Future work • Scope was limited to investigation. Also the complex analysis-phase has demanding development needs. • Technical support organization requires product-processes to manage the development and maintanance of the troubleshooting tools. 19 2007-02-14 / Magnus Wallström
References • Standards and Technical Specifications • 3GPP: GSM, 3G and LTE • IEEE: WiFi and WiMAX • Commercial material • Nokia Multiradio: http://www.nokia.com/NOKIA_COM_1/Microsites/NokiaWorld/Press/Multiradio_Press_Backgrounder.pdf • Cisco WiMAX: http://www.cisco.com/en/US/netsol/ns616/networking_solutions_customer_profile0900aecd80334a23.html • Previous thesis’ • Kujala, Kimmo (2006) Expert System for Mobile Network Troubleshooting. Thesis. Diplomityö, TKK / Sähkö- ja tietoliikennetekniikan osasto, 2006. 72p. • An attempt to build automated fault analysis tool system. The result in the thesis was that automated analysis is still unreliable! • Utriainen, Juha (2004) UTRAN Operation System Security. Thesis. Diplomityö, TKK / Sähkö- ja tietoliikennetekniikan osasto, 2004. 64p. • Gives a good overview on the UTRAN O&M (Operation and Maintenance) • Handbooks • Kaaranen, Heikki (2005) UMTS Networks – Architecture, Mobility and Services. Second Edition. JOHN WILEY & SONS. ISBN: 0470011033 • Nielsen, Jakob (1993) Usability Engineering. Boston: Academic Press, 1993. 20 2007-02-14 / Magnus Wallström