390 likes | 408 Views
Machine reconstruction of human control strategies. Do ri an Šuc Artificial Intelligence Laboratory Faculty of Computer and Information Science University of Ljubljana, Slovenia. Overview. Skill reconstruction and behavioural cloning The learning problem
E N D
Machine reconstruction of human control strategies DorianŠuc Artificial Intelligence Laboratory Faculty of Computer and Information Science University of Ljubljana, Slovenia
Overview • Skill reconstruction and behavioural cloning • The learning problem • A problem decomposition for behavioural cloning (indirect controllers, experiments, advantages) • Symbolic and qualitative skill reconstruction • Learning qualitative strategies: QUIN algorithm • QUIN in skill reconstruction • Conclusions
Skill reconstructionand behavioural cloning • Motivation: • understanding of the human skill • development of an automatic controler • ML approach to skill reconstruction: learn a control strategy from the logged data from skilled human operators (execution trace). Later called behavioural cloning (Michie, 93). • Early work: Chambers and Michie(69), learning control by imitation also by Donaldson(60,64)
Behavioural cloning: some applications • Original approach: clones usually induced as a direct mapping from states to actions in the form of trees or rule sets • Successfully used in domains as: • pole balancing (Miche et al., 90) • piloting (Sammut et al., 92; Camacho 95) • container cranes (Urbančič, 94) • production line scheduling (Kerr and Kibira, 94) Reviews in Sammut(96), Bratko et al(98)
Learning problem • Execution traces used as examples for ML to induce: • a control strategy (comprehensible, symbolic) • automatic controller (criterion of success) • Operator’s execution trace: • a sequence of system states and corresponding operator’s actions, logged to a file at a certain frequency • Reconstruction of human control skill: • Skill: “know how” at subsymbolic level, operational • Strategy: explicitly described “know how” at symbolic level
Container crane Used in ports for load transportation Control forces: Fx, FL State: X, dX,, d, L, dL Based on previous work of Urbančič(94) Control task: transport the load from the start to the goal position
Learning problem, cont. Fx FL X dX d L dL 0 0 0.00 0.00 0.00 0.00 20.00 0.00 2500 0 0.00 0.00 -0.00 -0.01 20.00 0.00 6000 0 0.00 0.01 -0.01 -0.02 20.00 0.00 10000 0 0.02 0.10 -0.07 -0.27 20.00 0.00 14500 0 0.12 0.31 -0.32 -0.85 20.00 0.00 14500 0 0.35 0.59 -0.95 -1.49 20.00 0.01 ….… … … … … … …….
Problems of original approach Difficulties observed with the original approach: • No guarantee of inducing with high probability a successful clone (Urbančič and Bratko, 94) • Low robustness of clones • Comprehensibility of clones; hard to understand Michie(93,95) suggests that a kind problem decomposition could be helpful: “learning from exemplary performance requires more than mindless imitation” Recent approaches to behavioural cloning (Stirling, 95; Bain and Sammut, 99; Camacho, 2000)
Related work • Leech(86), probably the first goal-structured learning of control • CHURPs(Stirling, 95): separates control skills in planning and actuation phases; focuses on planning component; assumes the goals are given • GRAIL(Bain and Sammut, 99): learning goals by decision trees and effects by abduction; • Incremental Correction model(Camacho, 2000): homeostatic and achievable goals; parametrised decision trees to learn goals; wrapper-approach
Our approach Our goals: • transparency of the induced strategies • robust and successful controllers Ideas: • Learning problem decomposition: (a) learning of the constraints on operator’s trajectories, (b) learning of the system’s dynamics • Generalized trajectory as a continuous subgoal • Symbolic and qualitative constraints, use of domain knowledge Differences with related approaches: • continuous generalized trajectory • qualitative strategies
Experimental domains Container crane: • we used execution traces from (Urbančič, 94) Acrobot (DeJong, 95; Sutton, 96) • two link pendulum in a graviatational field; swing-up task Bicycle riding (Randlov, Alstrm, 98) • drive the bike from the start to the goal position; requires simultaneous balancing and goal-aiming Simulators used in all experiments Measure of success: • time to accomplish the task
Operator’s trajectory • A sequence of the states from an execution trace • Path in the state space Operator’s trajectory of the trolley velocity (dX) in the space of X, and dX
Generalized trajectory Induced constraints on operator’s trajectory • Constraints can be represented as: • trees • equations • qualitative constraints
Qualitative and quantitative strategy • Quantitative strategy: given with precise numerical values or numeric constraints (decision tree, equation) • Qualitative strategy may also use qualitative constraints. A qualitative strategy defines a set of quantitative strategies • We use qualitatively constrained functions (QCFs): monotonicity constraints as used in qualitative reasoning
Qualitatively constrained functions • M+(x) arbitrary monotonically increasing fn. of x • A QCF is a generalization of M+, similar to qual. proportionality predicates used in QPT(Forbus, 84) Gas in the container: Pres = c Temp / Vol , c = n R > 0 QCF: Pres = M+,-(Temp,Vol) Temp=std & Vol Pres Temp & Vol Pres Temp & Vol Pres Temp & Vol Pres ? Temp & Vol Pres ?
Direct and indirect controllers Our approach; Also CHURPs(Stirling, 95), GRAIL(Bain and Sammut, 99), ICM(Camacho, 2000) Original approach, BOXES, ASE/ACE
Robustness of direct and indirect controllers against learning error • Experiment: modelling learning of direct and indirect controllers with some learning error: • direct controllers: “correct action” + noise() • indirect controllers: “correct trajectory” + noise() • Two error models: • Gaussian noise • Biased Gaussian noise (all errors in the same direction) • Simple, deterministic, discrete time system: • Control task: reach and maintain the goal value Xg • Performance criterion: controller error in Xg
Robustness of direct and indirect controllers against learning error (2) Biased noise affects direct controllers much more
Possible advantages of indirect controllers • Less prone to the departure from the operator’s trajectory • More robust against change in the system’s dynamics and small changes in the task • generalizing the trajectory is often easier than generalizing the actions Generalized trajectory often easier to understand (less details)
Symbolic and qualitative skill reconstruction GoldHorn(Križman, 98) LWR(Atkeson et al., 97) • Experiments in the crane and acrobot domains
Experiments in the crane domain • GoldHorn induced the generalized trajectory of the trolley velocity: dXdes= 0.902 – 0.018 X2 + 0.090 X + 0.050 Qualitative strategy: if X Xmid then dX = M+,+(X, ) else dX = M-,+(X, )
Transforming qualitative into quantitative strategies • By concretizing qualitative parameters into real, numeric values or real-valued functions • First experiment: using randomly generated functions satisfying qualitative constraints and additional domain knowledge: • maximal and minimal values of the state variables • the trolley starts towards goal • the trolley stops at goal • Second experiment: using additional domain knowledge
Efficiency of the qualitative strategy • The results show that qualitative strategy is: • general (the proper selection of qualitative parameters is not crucial) • successful: offers the space for controller optimization • Similar experiments in acrobot domain
Qualitative induction • Motivation: our experiments with qualitative strategies (crane, acrobot) • Usual classification learning problem, but learning of qualitative trees: • in leaves are qualitatively constrained functions (QCFs); QCFs give constraints on the class change in response to a change in attributes • internal nodes (splits) define a partition of the state space into areas with common qualitative behavior of the class variable
Qualitatively constrained function (QCF) • M+(x) arbitrary monotonically increasing fn. of x • A QCF is a generalization of M+, similar to qual. proportionality predicates used in QPT(Forbus, 84) Gas in the container: Pres = c Temp / Vol , c = n R > 0 QCF: Pres = M+,-(Temp,Vol) Temp=std & Vol Pres Temp & Vol Pres Temp & Vol Pres Temp & Vol Pres ? Temp & Vol Pres ?
Learning QCFs Pres = 2 Temp / Vol Temp Vol Pres 315.00 56.00 11.25 315.00 62.00 10.16 330.00 50.00 13.20 300.00 50.00 12.00 300.00 55.00 10.90 • Learning of the “most consitent” QCF: • For each pair of examples form a qualitative change vector • Select the QCF with minimal error-cost
Learning QCFs QCF Incons. Amb. M+(Temp) M-(Temp) M+(Vol) M-(Vol) M+,+(Temp,Vol) M+,-(Temp,Vol) M-,+(Temp,Vol) M-,-(Temp,Vol) QCF Incons. Amb. M+(Temp) 3 1 M-(Temp) M+(Vol) M-(Vol) M+,+(Temp,Vol) M+,-(Temp,Vol) M-,+(Temp,Vol) M-,-(Temp,Vol) QCF Incons. Amb. M+(Temp) 3 1 M-(Temp) 2,4 1 M+(Vol) 1,2,3 / M-(Vol) 4 / M+,+(Temp,Vol) 1,3 2 M+,-(Temp,Vol) / 3,4 M-,+(Temp,Vol) 1,2 3,4 M-,-(Temp,Vol) 4 2 qTemp=neg qVol=neg qPres=pos Select QCF with minimal QCF error-cost
Learning qualitative tree • For every possible split, split the examples into two subsets, find the “most consistent” QCF for both subsets and select the split minimizing tree-error cost (based on MDL) • Algorithm ep-QUIN uses every pair of examples • An improvement: heuristic QUIN algorithm that considers also locality and consistency of qualitative change vectors
Experimental evaluation in artificial domains • On a set of artificial domains with uniformly distributed attributes; 2 irrelevant attributes • Results by QUIN better than ep-QUIN • In simple domains QUIN finds qualitative relations corresponding to our intuition
QUIN in bicycle riding Control task: drive a bike from the start to the goal position the bike’s speed is assumed constant difficult because balancing and goal-aiming must be performed simultaneously • Controlled by torque applied to the handlebars • State: goalAngle, goalDist, , d, , d • QUIN: des = f(State)
Induced qualitative strategy goalAngle 0.015 > 0.015 goalAngle M+,+,-(, d,goalAngle) -0.027 > -0.027 M+,+,-(, d,goalAngle) M+,+(, d) Same QCFs
Induced qualitative strategy goalAngle near zero yes no M+,+(, d) M+,+,-(, d,goalAngle) Balancing Balancing and goal-aiming If the bike starts falling over then turn the front wheel in the direction of the fall Goal-aiming: turn the front wheel away from the goal
Transforming qualitative into quantitative strategies • Transform QCFs into real valued functions by using simple domain knowledge: • maximal front wheel deflection • drive straight if bike is aiming at the goal: f(0,0,0)=0 • balancing is more important than aiming at the goal • 400 randomly generated quantitative strategies; 59.2% successful • Test of robustness: • Change in the start state (58% successful) • Random displacement of the bicyclist from the mass center (26% successful)
QUIN in crane domain • Crane control requires trolley and rope control • Experiments with traces of 2 operators using different control styles • Rope control • QUIN: Ldes= f(X, dX, ,d, dL) • Often very simple strategy induced Ldes= M+(X ) bring down the load as the trolley moves from the start to the goal position
Trolley control • QUIN: dXdes= f(X, ,d) • More diversity in the induced strategies Enables reconstruction of individual differences in control styles X < 20.7 X < 29.3 yes no yes no M+(X) M+,+,-(X, , d) X < 60.1 d < -0.02 yes no yes no M-(X) M+() M-(X) M-,+(X,)
Role of human intervention • Approach facilitates the use of user knowledge • In our experiments the following types of human intervention were used: • Selection of the dependent trajectory variable • Disregarding some state variables • Selection and analysis of induced equations • Using domain knowledge in transforming qualitative into quantitative strategies • According to empirical evidence different (sensible) choices and use of domain knowledge also give successful strategies
Contributions of the thesis • A decomposition of the behavioural cloning problem into the learning of continuous generalized trajectory and system’s dynamics • Modelling of human skill with symbolic and qualitative constraints • QUIN algorithm for learning qualitative constraint trees • Applying QUIN to skill reconstruction • Experimental evaluation in several dynamic domains
Further work • Applying QUIN in different domains where qualitative models preferred; QUIN improvements • Qualitative simulation to generate possible explanations of a qualitative strategy • Reducing the space of admissible controllers by qualitative reasoning • Minimizing the trajectory constraints error in all the state variables would not require the selection of the dependent trajectory variable