530 likes | 706 Views
Application of AI- and ML-Techniques to Fault-Tolerant Routing. Arjun Rao CS 717 November 16 and 18, 2004. Papers Covered. [1] Loh, Peter K.K., “Artificial Intelligence Search Techniques as Fault-Tolerant Routing Strategies”
E N D
Application of AI- and ML-Techniques to Fault-Tolerant Routing Arjun Rao CS 717 November 16 and 18, 2004
Papers Covered • [1] Loh, Peter K.K., “Artificial Intelligence Search Techniques as Fault-Tolerant Routing Strategies” • [2] Loh, Shaw., “A Genetic-Based Fault-Tolerant Routing Strategy for Multiprocessor Networks”
Papers Covered (cont.) • [3] Loh, Schröder, Hsu., “Fault-Tolerant Routing on Complete Josephus Cubes” (not AI-related but interesting nevertheless) If time permits, also: • [4] Bradley, Tyrrell., “Immunotronics: Hardware Fault Tolerance Inspired by the Immune System”
The Problem of Routing • Communication between nodes • Servers • Microprocessors • Desire shortest, most efficient paths • Multiprocessor network topologies, e.g. hypercubes, Josephus cubes, etc. • Desire availability of paths • What to do when links/nodes fail? • How to remain (close to) optimal?
Intro to Fault-Tolerant Routing • Current algorithms adaptive but non-minimal • Misrouting • Routing strategies tied to specific topologies • k-ary, n-cubes, meshes, etc.: Regular structures and symmetry • Constrained by fault number and types • More general strategies vulnerable to deadlock and livelock
“Turn Model” [Glass, Ni] • Widest application scope • k-ary, n-cubes, nD-meshes, torus geometries, etc. • “West-First” algorithm (on 2D-mesh) • Messages prevented from turning “west” again • Prevents cyclesdeadlocks • Routing along virtual channels in strictly decreasing or increasing order
Turn Model (cont.) • Three examples of routing • “F” = FAILURE • Full adaptation w/o deadlock and livelock requires more global infomore overhead
AI Search Techniques • Arbitrary topology Search space • Search space Search tree(s) • Adaptive but still non-minimal • Characteristic recursion impractical on loosely-coupled, distributed network
AI Logical Abstraction • Abstraction: • S: Problem space • O: Set of objectives • P: Search paths • S = (O, P), where oi O and pj P, each pj connects tuple (ok, ol), k l Abstraction used to model…
Multiprocessor Network w/ Generic Topology • Network • N: Nodes • L: Links between nodes • G = (N, L), where ni N and lj L, each lj connects tuple (nk, nl), k l • Objective Node • Search path Link
Abstract Routing Model • Search : • (os, ot): S x S S*, where S = (O, P) and S* = (O*, P*) • ox,oy O and ox,oy O* Successful search • ox,oy O and ox O*, oy O* Unsuccessful • Routing attempt R: • R(ns, nd): G x G G*, where G = (N, L) and G* = (N*, L*) • ni,nj N and ni,nj N* Complete route • ni,nj N and ni N*, nj N* Incomplete
Routing Analogy • AI search equivalent to routing attempt • Successful search Route between source and destination nodes • Unsuccessful search Incomplete route to destination
Caveats of Analogy • No specific search algorithm No routing strategy • No optimality constraints • Nothing about deadlocks/livelocks • Nothing about fault tolerance!!
Fault-Tolerant Routing Model • Model considers two aspects: • Routing system configuration • Must be generic enough! • Message propagation protocols and policies • Following slides introduce what is needed for AI searches (w/ physical message backtracking)
FT Routing Model (cont.) • Eager readership of input messages • Single input buffer to avoid polling • Multiple output buffers to accommodate different delivery rates • Router process: • AI/FT routing strategy implemented here • Physical message backtracking Increased message sizes • Increased message sizes/overhead Requires communications router at each node
Communications Router (cont.) • Communication router constitutes router process and connections • Main components: LCM and CP • ROM: Stores link management and routing software • RAM: Stores routing table, link status table, associated link lists
CR Routing Table • For each node, up to n links • For each link: • Connected with status OK and node ID of neighbor • Not connected with status NC and node ID –1 • Link fault represented by timeout: • Status reset to NC • Processor fault represented by timeouts in neighbors
Message Packets • Six fields: • Router Control (4 bits): Type of message, including NORMAL and BACKTRACK • Destination Node ID (10 bits): Supports network of size up to 1024 nodes • Pending Nodes (20 bytes): Stack of node IDs that may receive packet but have not yet • Traversed Nodes (20 bytes): Stack of nodes traversed, with most recent on top
Message Packets (cont.) • Traversed Nodes Index (10 bits): Index to previous traversed nodes field. Supports simulation of physical message backtracking • Data Field (n-bit pointer): Points to information content of packet
(Finally) AI Search Strategies • Brute Force: • Depth-First Search • Random Climbing • Heuristic: • Hill Climbing • Best-First Search • A*
AI Search Strategies (cont.) • In presence of network faults: • Prevent cycles No deadlocks • Prevent more than two traversals of nodes/links No livelocks and necessary for AI searches • Adaptations of search algorithms • Problems: • Recursion? Nope (PMB) • Overhead? Fixed (Well, mostly…)
Common Beginning Extracts header and disassembles it IF Destination Node is reached, pass packet to host processor ELSE IF Router Control is BACKTRACK IF Pending Nodes top node is directly linked Route packet to that node Set Router Control to NORMAL ELSE Backtrack packet to previous node in traversed Pop current node ID from Pending Nodes Push current node ID onto Traversed Nodes
Depth-First Search • Travel as far as possible • Do not consider alternative paths just yet • If fault or dead-end, backtrack to most recent possible path
DFS (cont.) Following common beginning: Look for directly linked successor nodes IF they are already traversed, ignore ELSE IF they are in Pending Nodes, ignore ELSE push them onto Pending Nodes Read top node of Pending Nodes IF directly linked (no fault), route packet to it ELSE Set BACKTRACK and route to last traversed node END
Random Climbing Following the common beginning: … ELSE Select a successor node randomly Push unselected successor nodes onto Pending Nodes …
Hill Climbing • Heuristic: Estimated remaining distance Following common beginning: … ELSE Sort successor nodes according to est. remaining distance Push sorted nodes onto Pending Nodes …
Best-First Search • Resumes partial routes not previously considered • Looks at immediate neighbors, neighbors of predecessors • Sorts by est. remaining distance • Leads to non-minimal routes!
BFS (cont.) … ELSE Push (directly linked successor nodes) onto Pending Nodes Sort Pending Nodes according to est. remaining distance …
A* • Two heuristics: • Estimated remaining distance: h • Path length traversed: g • Partial paths sorted by f = g + h • When no faults, always finds minimal route
A* (cont.) After current ID processing: Record path length traversed, g … ELSE Calculate and store f for new successor nodes Push them onto Pending Nodes sorted by f …
Performance Testing • Simulated 125-node multiprocessor network • Max 8 links per node (maps to many topologies) • Faulty links and processors • Pre-specified or dynamically generated • Testing: • Messages between every pair of nodes • 20 trials at 0%, 5%, 10%, 15%, 20% faulty links • 125 x 125 x 20 x 6 = 1,875,000 tests (??)
Test Results • As faults increase, heuristic strategies fair better (esp. > 15%) • A* best search technique but slow • Hill climbing and BFS do not consider nodes traversed • Hill climbing considers only immediate neighbors
Main Point Using AI search techniques, we abstract from routing in networks to searching in trees (topology-independent, quantity and type of faults irrelevant)
Next Paper • [1] Loh, Peter K.K., “Artificial Intelligence Search Techniques as Fault-Tolerant Routing Strategies” • [2] Loh, Shaw., “A Genetic-Based Fault-Tolerant Routing Strategy for Multiprocessor Networks”
Our Little Problem… • AI search techniques topology- and fault-type independent… • …but non-minimal routes utilized • Follow-up work shows how genetic algorithms (combined with heuristics) can find minimal routes in presence of network faults
Genetic Algorithms: Overview • Optimization strategy • Population of potential solutions evolve over series of generations • Each element of population is chromosome; each unit of chromosome is gene • Chromosomes undergo crossover and mutation • Most fit chromosomes selected for next generation, based upon fitness function
Abstract Model • Same as before (including definitions of S and G) • Pure abstraction suffers from same caveats as before • Basic idea: Instead of AI search for adaptive route, optimize over population of routes to find best
Message Packets • Simplified version:
Chromosome • Route Chromosome • Node on route Gene in chromosome • Length of route Size of chromosome • Chromosome size directly reflects routing performance! • Distance traversed basis of fitness
Mutation and Crossover • Mutation: Swap and/or shift • Normal crossover destroys routes, messes with source and destination; problem w/ different lengths • Use one-point random crossover
Fitness Function • F = (Dmax – Droute) / Dmax + • Dmax:Maximum distance between source and destination • Droute: Distance traveled by specific route • : Predefined value to ensure non-zero fitness • Higher value More fit