COMP2221 Networks in Organisations

COMP2221 Networks in Organisations Richard Henson May 2013

Week 11 – Troubleshooting & Optimisation • Learning Objectives: • Explain the principles of troubleshooting as a means of mitigating against failure • Use the various tools available on a named operating system to identify potential faults and problems • Take appropriate action to stop a fault becoming a failure

“A stitch in time saves nine”

Business - Worst Possible Scenario (1) • There is an interruption in the power supply • UPS is invoked • the interruption continues… • servers all have to be shut down • Power supply restored… • but main domain controller doesn’t reboot • no other domain controllers therefore connect to it • the domain tree fails

Business - Worst Possible Scenario (2) • Organisation cannot do business with the network down… • server can’t be persuaded to boot • new main domain controller has to be commissioned • whole directory tree has to be rebuilt!!! • word spreads very rapidly… • Business loses so much custom, trust, and credibility that even when it starts doing business again customers choose to go elsewhere • without a flourishing customer base… the business folds

Analysis: This scenario shouldn’t have occurred… • Unlikely that the server would fail to boot without prior warning… • warnings would have been presented… • but were clearly not acted upon! • Disaster recovery plan!?! • not formulated? • not tested? • not effective (in the event of a domain tree controller failure…)

But it does… • Actual example (15th Feb 2010): • root domain controller [on the network] had not been backed up for 10 months, when it crashed (well… at least it had been backed up at some time…) • http://searchwindowsserver.techtarget.com/generic/0,295582,sid68_gci1381567,00.html • The consultant called in to fix it reported that: • “I had never seen a case where the forest root domain had to be recovered -- and I couldn't find anyone who had.”

Analysis: Who is to blame? (1) • In this example, the organisation said they were following Microsoft guidelines • they set up an empty root domain • the root domain controller had a RAID-5 (best) disk configuration • Was true, to some extent… • Microsoft did espouse this as best practice… (in the year 2000!) • guidelines had changed since then…

Analysis: Who is to blame? (2) • The disaster that struck was: • two RAID drives failed on the same day! • unlucky? possible to prepare for this? • The recovery process took about three weeks • most of the time was spent studying logs, doing the restore, etc. • In this case, the tree was still able to function without a root domain • business was able to continue • customer base wasn’t compromised…

Fault Tolerance and Risk Assessment • General “common sense” principle: • always have a backup • ESPECIALLY for the most important computer on the network… • Q: • How can you tell what needs backing up? • A: • Risk Assessment and Risk Management

Why not Risk Management? • Time consuming! • However, without proper risk management… • how does the organisation know what processes are most important to its functioning? • how can an organisation provide resources to protect aspects of its network?

Risk Management and Risk Assessment • Risk Assessment is an essential first step • requires putting a “value” on assets • more valuable… greater protection • Do information assets have value? • organisations still failing to acknowledge that they do… • categorisation of information assets therefore potentially problematic • need to look at the consequence to the organisation of losing that asset…

How do you back up a Domain Controller? • The Windows “Backup” program works, and can easily be scheduled • but heavily criticised… • even the 2008 server version… • Third Party products give more flexibility and protection e.g. : • Recovery Manager • http://www.quest.com/recovery-manager-for-active-directory • Backup Exec • http://www.symantec.com/business/products/family.jsp?familyid=backupexec

Prevention is Better than Cure • A server shouldn’t crash unexpectedly! • should be kept cool (environmental unit mustn’t break down!) • monitoring should show that unexpected things are happening • action can then (usually) be taken to take care of the unexpected • Many tools available to: • Check/monitor the system on a regular basis • Provide stats/ to administrators • could also be used for security purposes • Generate alerts if something is starting to go wrong…

Troubleshooting Tools for a Windows Server: Task Manager • Applications tab: • shows which applications are running • enables changing of process priority • use view/update speed • used to • open new applications • shut rogue applications down

Task Manager (continued) • Processes tab: • all system processes • Memory usage of each • % CPU time for each • total CPU time since boot up • also used to close a process down • careful! (but you get a warning…)

Task Manager (continued) • Performance tab: • total no. of threads, processes, handles running • Graph: % CPU usage • User mode • Kernel mode (optional: view menu) • graph per CPU (optional: view menu) • physical (Page File) memory available/usage • virtual memory available/usage

Event Viewer • Events recorded into “event log” files • System log • Auditing log (customisable) • Application log • customisable - additional files • New files recorded daily; old ones archived • time before archiving also customisable

Event Viewer • Three types of events recorded in log: • Information • Warning • Error • More information on each event obtained by double-clicking • make note of event code • heed and take action if necessary

Using Event Viewer • Wise to check all event logs regularly • take time/trouble to find out that those messages really mean… • The action is needed that it • sort out potential problems now • Make sure they don’t become real ones later…

Auditing Further Events • Any “object” can be audited • Objects to audit, and processes audited can be set through audit (group) policy • Using MMC & relevant snap-in • Types of process audited: • access • attempt to access

Security auditing • Same principles as general auditing • Refers to “restricted” objects • Events appear in separate security log

Event Management software (SIEM) • Who’s going to look at all these log files? • in practice, often no-one.. • Solution – SIEM software to analyse and present information from: • network and security devices • identity & access management applications • vulnerability management/policy compliance tools • os, database & application logs • external threat data http://www.focus.com/briefs/how-select-security-information-and-event-management-siem

Performance Monitor • Not available on disk • To obtain and download Performance Monitor Wizard (PerfWiz), visit the following Web site: • http://www.microsoft.com/downloads/details.aspx?FamilyID=31fccd98-c3a1-4644-9622-faa046d69214&displaylang=en

What if the machine doesn’t boot… • Tools available: • The boot error itself • blue screen? driver software • constant reboot? motherboard • Last Known Good… • Gives machine a chance to go back to the previous (usually last but one) configuration

What if the machine doesn’t boot… (continued) • Safe Mode • includes VGA Mode or boot logging • Debugging mode also available • output difficult to decipher for non-experts • Recovery Console • “DOS-type prompt” for performing minor repairs

What if the machine doesn’t boot… (continued) • System Configuration Utility (Msconfig.exe) • automates the routine troubleshooting steps relating to Windows configuration issues • can be used to modify the system configuration and troubleshoot the problem using a process-of-elimination method

What if the machine doesn’t boot… (continued) • Emergency Repair Disk (ERD) • reboot machine using different media • e,g. floppy disk (yes… still possible) • media should be generated BEFORE it needs to be used! • option to create the ERD during the set up process…

What if the machine doesn’t boot… (continued) • Full restore • assumes a full backup has already been made • still have to: • reformat hard disk from scratch… • and then restore the backup files using backup/restore option…. • but better than losing all your data!

Optimisation… • All about improving the performance of system resources… • A network manager should never have “nothing to do…”

COMP2221 Networks in Organisations