1k likes | 1.14k Views
Hibatűrő rendszerek tervezési mintái. Autonóm és hibatűrő információs rendszerek Kocsis Imre ikocsis @ mit.bme.hu 2013.10.21. Ismétlés: singleton. Ismétlés: Facade. Ismétlés: Observer. Architekturális mintanyelv. Units of Mitigation.
E N D
Hibatűrő rendszerek tervezési mintái Autonóm és hibatűrő információs rendszerek Kocsis Imre ikocsis@mit.bme.hu 2013.10.21.
Units of Mitigation • Howcanyoukeepthewholesystemfrom being unavailablewhen an erroroccurs? • „Design thesystemintopartsthatwillcontainbothanyerrors and theerrorrecovery. Choosethedivisionsthatmakesenseforyoursystem. Design the rest of thesystemaroundthesepartsthatrepresentthebasicunitsoferrormitigation.”
Units of Mitigation • Division: • Architecture • Availablerecovery/mitigationtechniques • … • Desirable: fail-silent • Example: three-tieredsystem, tieras unit • In-tierredundancyschemes… • … orrequestqueuing • Somefurtherproblems: • Howtoprocesserrorsinside? • Whatshouldtheblocks be? • Note: rough-grainedpattern (HW and SW blocks)
CorrectingAudits • Faultydatacauseserrors. • „Detect and correctdataerrorsassoonaspossible. Checkrelateddataforerrors, correct and recordtheoccurence of theerror.” • Leadsto a host of otherpatterns
Redundancy • Assumption: errorprocessingusuallystopsnormalexecution • Howcanwereducetheamount of timebetweenerrordetection and theresumption of normaloperationaftererrorrecovery? • „Provideredundantcapabilitiesthatsupportquickactivationtoenableerrorprocessingtocontinuein parallel withnormalexecution.”
SomeoneinCharge • Anythingcan go wrong, evenduringerrorprocessing. Whenthishappensthesystemmight stop doingtheerrorprocessinginadditiontonotdoingthenormalprocessing. • „All fault tolerancerelatedactivitieshavesomecomponent of thesystemthat is clearlyincharge and has theabilitytodeterminecorrectcompletion and theresponsibilitytotakeactionifitdoesnotcompletecorrectly.”
SomeoneinCharge • N.B. doesnotpromote a global SPOF • Onthecontrary, seeescalation • Example (Action / InCharge): • Checkpoint / eachtask • Rollback and roll forward / component R • Loadshedding / component S • Also: voting / leaderselectiontechniques
Minimize Human Intervention • Howcanwepreventpeoplefromdoingthewrongthings and causingerrors? • „Design thesystemin a waythatit is abletoprocess and resolveerrorsautomatically, beforetheybecomefailures. Thisspeedserrorrecovery and reducestherisk of proceduralerrors.”
Minimize Human Intervention • How? • Makesureallerrorsarereportedtothe Fault Observer • Indiv. componentsdonottalktotheoutsideworld • Concentrate output and input • Design automatic F/E/F, detection, processing, treatment
Maximize Human Participation • Shouldthesystemignorepeopletotally? Thatwillreduceproceduralerrors. • „Knowtheuser and theiravailability. Design thesystemtoenableknowledgeableoperatingpersonneltoparticipate. […] ProvideappropriateMaintenanceInterfaces and Fault Observercapabilities […]”
Escalation • Whatdoesthesystemdowhenitsattempttoprocess an errorin a component is notacheivingthecorrecteffect? • „Whenrecoveryormitigation is failing, escalatetheactiontothenext more drasticaction.”
Fault Observer • Coordinatereportingtoallobserversthat a fault is present, reported, and recoveryactionsescalated.
MaintenanceInterface • Shouldmaintenance and applicationrequests be intermingledontheapplication input and output channels? • „Provide a separateinterfacetothesystemforthe (almost) exclusiveuse of maintenanceinteractions.”
Fault Correlation • Hol találkoztunk ezzel korábban? • Mit jelentett a korreláció? • Példák? • Figyelem: ez egy nagy terület, a minta csak a szükségességéről beszél (~ „kell diagnosztika”) • Topológia-alapú megközelítések • Dinamikus modellezés: automaták, nyelvek • Statikus modellezés: terjedési relációk • Tanuló módszerek • … • Diagnosztika-elmélet: később
Fault Correlation • What fault is activating? • „Lookattheuniquesignature of theerrorto sort itintothe fault categoryforwhicherrorprocessingstepsareknown.”
Fault correlation • Gyakorlat-féle: adott egy topológia (DAG) és a szolgáltatási szintű hibahatások helyei. Lehetséges hibaok-helyek halmaza…? • Algoritmus? • Komplexitás? • Véges automata hibahelyekkel és mondat „hibás kimenettel”. Melyik (hibás) állapotokat érinthettük?
Mi a közös bennük? • „Bus Guardian” TT architektúrákban • Try/catch blokk • Desktop vírusvédelem
ErrorContainmentBarrier • What is thefirstthingthatthesystem must dowhenitdetects an error? • „Isolatetheerrorto a unit of mitigation. Stop theerror flow with a barrier, quarantine and initiateeithererrorrecoveryorerrormitigation.”
Completeparameterchecking • Howcanthetimefrom fault activationtoerrordetection be minimized? • „Performfrequentchecksondata and operationstodetecterrorsquicklyandpreventerrorsfrompropagatingtothe rest of thesystem.” • More specifically, checkalltheinputs and parametersrigorously. • Level/granularity: design decision
Completeparameterchecking • Hogyan ellenőrizzük ezt? • A = B / C ;
System Monitor • Howdoesone part of a systemkeeptrackthatanother part is alive and functioning? • „Create a Monitor tostudysystembehavior, orthebehavior of specificpartsofthesystemtomakesurethattheycontinueoperatingcorrectly. Whenthewatchedcomponents stop, the monitor shouldreporttheoccurencetothe Fault Observer and initiatecorrectiveactions.”
System Monitor • Célja alapvetően a rendszerszinten manifesztálódó hibás állapotok felderítése. • A „hogyan” mind a detektálás, mind a javítás esetében nyitva marad, persze.
Hogyan nézzük meg, hogy egy szolgáltatás működik-e? • Potenciális problémák pl.: • „ritkás” munkavégzés • normál működés más csatornán zajlik
Heartbeat • How does the System Monitor know that a particular monitored task is still working? • „The System Monitor shouldsee a periodicheartbeatfromthemonitoredtask. Ifthemonitoredtaskdoesnotsupply a heartbeatresponsewithintherequiredtimethenrecoveryactionshould be taken.” • Variants: autonomous / request-response • Végül is ez is heartbeat ebben az értelemben:
Acknowledgement • Whenthere is a dialogbetweentwotasks, what’stheeasiestwayforonetasktodeterminethattheothertask is alive and functioning? • „Send an acknowledgement for all requests. Allrequestsshouldrequire a replytoacknowledgereceipt and toindicatethatthemonitoredsystem is alive and abletoadheretotheprotocol. […]”
Acknowledgement • Hasznos minta, de ésszel alkalmazandó. Mikor ellenjavallott?
Realistic Threshold • Howmuchtimeshouldelapsebeforethe System Monitor takesactionwhen an error is detected? • Whycanthis be a problem? • Terminology • Messaginglatency • Detecionlatency
RealisticTreshold • „SetthemessaginglatencybasedupontheworstcasecommunicationstimecombinedwiththetimerequiredtoprocessoneHeartbeatmessage. • Setthedetectionlatencybaseduponthecriticality of thefunctionality. Makeit a multiple of themessaginglatency. • Setthemsothattheavailabilityrequirement is met, yetfalsetriggersdonotoccur.” (restarttime!)
ExistingMetrics • Howtomeasuretheseverity of an overloadwithoutcontributingtotheoverload? • „Usepre-existingindicatorsalready tied totheresourceas an indicator of thesystem’soverloadcondition.” • Példa?