250 likes | 420 Views
Distributed Reinforcement Learning for a Traffic Engineering Application. Mark D. Pendrith DaimlerChrysler Research & Technology Center Presented by: Christina Schweikert. Distributed Reinforcement Learning for Traffic Engineering Problem. Intelligent Cruise Control System
E N D
Distributed Reinforcement Learning for a TrafficEngineering Application Mark D. Pendrith DaimlerChrysler Research & Technology CenterPresented by: Christina Schweikert
Distributed Reinforcement Learning for Traffic Engineering Problem • Intelligent Cruise Control System • Lane change advisory system based on traffic patterns • Optimize a group policy by maximizing freeway utilization as shared resource • Introduce 2 new algorithms (Monte Carlo-based Piecewise Policy Iteration, Multi-Agent Distributed Q-learning) and compare their performance in this domain
Distronic Adaptive Cruise Control • Signals from radar sensor, which scans the full width of a three-lane motorway over a distance of approximately 100m and recognizes any moving vehicles ahead • Reflection of the radar impulses and the change in their frequency enables the system to calculate the correct distance and the relative speed between the vehicles
Distronic Adaptive Cruise Control • Distance to vehicle in front reduces - cruise control system immediately reduces acceleration or, if necessary, applies the brake • Distance increases – acts as conventional cruise control system and, at speeds of between 30 and 180 km/h, will maintain the desired speed as programmed • Driver is alerted of emergencies
Distronic Adaptive Cruise Control • Automatically maintains a constant distance to the vehicle in front of it, prevent rear-end collisions • Reaction time of drivers using Distronic is up to 40 per cent faster than that of those without this assistance system
Distributed Reinforcement Learning • State– agents within sensing range • Agents share a partially observable environment • Goal - Integrate agents’ experiences to learn an observation-based policy that maximizes group performance • Agents share a common policy, giving a homogeneous population of agents
Traffic Engineering Problem • Population of cars, each with a desired traveling speed, sharing a freeway network • Subpopulation with radar capability to detect relative speeds and distances of cars immediately ahead, behind, and around them
Problem Formulation • Optimize average per time-step reward, by minimizing the per-car average loss at each time step vd(i)desired speed of car i va(i)actual speed of car i n number of cars in simulation at time-step
State Representation • View of the world for each car represented by 8-d feature vector – relative distances and speeds of surrounding cars
Pattern of Cars in Front of Agent • 0 – lane is clear (no car in radar range or nearest car is faster than agent’s desired speed) • 1 – fastest car less than desired speed • 2 – slower • 3 - still slower
Pattern of Cars Behind Agent • 0 – lane is clear (no car in radar range or nearest car is slower than agent’s current speed) • 1 – slowest car faster than desired speed • 2 – faster • 3 - still faster
Lane Change • 0 – lane change not valid • 1 – lane change valid If there is not a safe gap in front and behind, land change is illegal.
Monte Carlo-based Piecewise Policy Iteration • Performs approximate piecewise policy iteration where possible policy changes for each state are evaluated by Monte Carlo estimation • Piecewise - Policy for each state is changed one at a time, rather than in parallel • Searches the space of deterministic policies directly without representing the value function
Policy Iteration • Start with arbitrary deterministic policy for given MDP • Generate better policy by calculating best single improvement in policy possible for each state (MC) • Combine all changes to generate successor policy • Continue until no improvement is possible – optimal policy
Multi-Agent Distributed Q-Learning Q-Learning • Q-value estimates updated after each time step based on state transition after action is selected • For each time step, only one state transition and one action used to update Q-value estimates • In DQL, there can be as many state transitions per time step as there are agents
Multi-Agent Distributed Q-Learning • Takes the average backup value for a state/action pair <s, a> over all agents that selected action a from state s at the last time step • Qmaxcomponent of backup value is calculated over actions valid for a particular agent to select at the next time-step
Simulation for Offline Learning Advantages: o Since true state of the environment is known, can directly measure loss metric o Can be run faster, many long learning trials o Safety Learn policies offline then integrate into intelligent cruise control system with lane advisory, route planning, etc.
Traffic Simulation Specifications • Circular 3 lane freeway 13.3 miles long with 200 cars • Half follow “selfish drone” policy • Rest follow current learnt policy and active exploration decisions • Gaussian distribution of desired speeds, mean of 60 mph • Cars have low level collision avoidance, differ in lane change strategy
Experimental Results • Selfish drone policy – consistent per-step reward of -11.9 (each agent traveling 11.9 below desired speed) • APPIA and DQL found policies 3-5% better • Best policies with “look ahead” only • “look behind” model provided more stable learning • “look behind” outperforms “look ahead” at times when good policy is lost