270 likes | 347 Views
Hibatűrő rendszerek tervezési mintái. Segédfóliák az Autonóm és hibatűrő inf . r sz . tárgyhoz Kocsis Imre ( ikocsis @ mit.bme.hu ) 2010.09.20. Ismétlés: singleton. Ismétlés: Facade. Ismétlés: Observer. Architekturális mintanyelv. Units of Mitigation.
E N D
Hibatűrő rendszerek tervezési mintái Segédfóliák az Autonóm és hibatűrő inf. rsz. tárgyhoz Kocsis Imre (ikocsis@mit.bme.hu) 2010.09.20.
Units of Mitigation • Howcanyoukeepthewholesystemfrom being unavailablewhen an erroroccurs? • „Design thesystemintopartsthatwillcontainbothanyerrors and theerrorrecovery. Choosethedivisionsthatmakesenseforyoursystem. Design the rest of thesystemaroundthesepartsthatrepresentthebasicunitsoferrormitigation.”
CorrectingAudits • Faultydatacauseserrors. • „Detect and correctdataerrorsassoonaspossible. Checkrelateddataforerrors, correct and recordtheoccurence of theerror.”
Redundancy • Howcanwereducetheamount of timebetweenerrordetection and theresumption of normaloperationaftererrorrecovery? • „Provideredundantcapabilitiesthatsupportquickactivationtoenableerrorprocessingtocontinuein parallel withnormalexecution.”
Minimize Human Intervention • Howcanwepreventpeoplefromdoingthewrongthings and causingerrors? • „Design thesystemin a waythatit is abletoprocess and resolveerrorsautomatically, beforetheybecomefailures. Thisspeedserrorrecovery and reducestherisk of proceduralerrors.”
Maximize Human Participation • Shouldthesystemignorepeopletotally? Thatwillreduceproceduralerrors. • „Knowtheuser and theiravailability. Design thesystemtoenableknowledgeableoperatingpersonneltoparticipate. […] ProvideappropriateMaintenanceInterfaces and Fault Observercapabilities […]”
MaintenanceInterface • Shouldmaintenance and applicationrequests be intermingledontheapplication input and output channels? • „Provide a separateinterfacetothesystemforthe (almost) exclusiveuse of maintenanceinteractions.”
SomeoneinCharge • Anythingcan go wrong, evenduringerrorprocessing. Whenthishappensthesystemmight stop doingtheerrorprocessinginadditiontonotdoingthenormalprocessing. • „All fault tolerancerelatedactivitieshavesomecomponent of thesystemthat is clearlyincharge and has theabilitytodeterminecorrectcompletion and theresponsibilitytotakeactionifitdoesnotcompletecorrectly.”
Escalation • Whatdoesthesystemdowhenitsattempttoprocess an errorin a component is notacheivingthecorrecteffect? • „Whenrecoveryormitigation is failing, escalatetheactiontothenext more drasticaction.”
Fault Correlation • What fault is activating? • „Lookattheuniquesignature of theerrorto sort itintothe fault categoryforwhicherrorprocessingstepsareknown.”
ErrorContainmentBarrier • What is thefirstthingthatthesystem must dowhenitdetects an error? • „Isolatetheerrorto a unit of mitigation. Stop theerror flow with a barrier, quarantine and initiateeithererrorrecoveryorerrormitigation.”
System Monitor • Howdoesone part of a systemkeeptrackthatanother part is alive and functioning? • „Create a Monitor tostudysystembehavior, orthebehavior of specificpartsofthesystemtomakesurethattheycontinueoperatingcorrectly. Whenthewatchedcomponents stop, the monitor shouldreporttheoccurencetothe Fault Observer and initiatecorrectiveactions.”
ExistingMetrics • Howtomeasuretheseverity of an overloadwithoutcontributingtotheoverload? • „Usepre-existingindicatorsalready tied totheresourceas an indicator of thesystem’soverloadcondition.” • Megjegyzés: nem csak a teljesítményre igaz!
RoutineMaintenance • Howcanwekeeppreventableerrorsfromoccuring? • „Performroutine, preventivemaintenanceonthesystem.”
RoutineExercises • HowdoyouknowthatRedundantelementsthatwill be calledinto service by a Failoverincase of an errororfailurewillactuallywork? • „Routinelyexercise, orexecutethesystemcomponentsthatwill be requiredin an errorsituation. Thiswillidentifylatentfaults.”
Quarantine • Howcanthesystempreventerrorsfromspreading? • „Establish a barrieraroundtheelementthatpreventsitfrombothcontributingtotheusefulwork and alsopreventsitfrompropagatingitserrorintootherparts of thesystem.”