390 likes | 410 Views
Explore the machine reconstruction of human control skills using the QUIN algorithm for learning qualitative strategies. Discover the application in container cranes and other domains.
E N D
Machine reconstruction of human control strategies DorianŠuc Artificial Intelligence Laboratory Faculty of Computer and Information Science University of Ljubljana, Slovenia
Overview • Skill reconstruction and behavioural cloning • The learning problem • A problem decomposition for behavioural cloning (indirect controllers, experiments, advantages) • Symbolic and qualitative skill reconstruction • Learning qualitative strategies: QUIN algorithm • QUIN in skill reconstruction • Conclusions
Skill reconstructionand behavioural cloning • Motivation: • understanding of the human skill • development of an automatic controler • ML approach to skill reconstruction: learn a control strategy from the logged data from skilled human operators (execution trace). Later called behavioural cloning (Michie, 93). • Early work: Chambers and Michie(69), learning control by imitation also by Donaldson(60,64)
Behavioural cloning: some applications • Original approach: clones usually induced as a direct mapping from states to actions in the form of trees or rule sets • Successfully used in domains as: • pole balancing (Miche et al., 90) • piloting (Sammut et al., 92; Camacho 95) • container cranes (Urbančič, 94) • production line scheduling (Kerr and Kibira, 94) Reviews in Sammut(96), Bratko et al(98)
Learning problem • Execution traces used as examples for ML to induce: • a control strategy (comprehensible, symbolic) • automatic controller (criterion of success) • Operator’s execution trace: • a sequence of system states and corresponding operator’s actions, logged to a file at a certain frequency • Reconstruction of human control skill: • Skill: “know how” at subsymbolic level, operational • Strategy: explicitly described “know how” at symbolic level
Container crane Used in ports for load transportation Control forces: Fx, FL State: X, dX,, d, L, dL Based on previous work of Urbančič(94) Control task: transport the load from the start to the goal position
Learning problem, cont. Fx FL X dX d L dL 0 0 0.00 0.00 0.00 0.00 20.00 0.00 2500 0 0.00 0.00 -0.00 -0.01 20.00 0.00 6000 0 0.00 0.01 -0.01 -0.02 20.00 0.00 10000 0 0.02 0.10 -0.07 -0.27 20.00 0.00 14500 0 0.12 0.31 -0.32 -0.85 20.00 0.00 14500 0 0.35 0.59 -0.95 -1.49 20.00 0.01 ….… … … … … … …….
Problems of original approach Difficulties observed with the original approach: • No guarantee of inducing with high probability a successful clone (Urbančič and Bratko, 94) • Low robustness of clones • Comprehensibility of clones; hard to understand Michie(93,95) suggests that a kind problem decomposition could be helpful: “learning from exemplary performance requires more than mindless imitation” Recent approaches to behavioural cloning (Stirling, 95; Bain and Sammut, 99; Camacho, 2000)
Related work • Leech(86), probably the first goal-structured learning of control • CHURPs(Stirling, 95): separates control skills in planning and actuation phases; focuses on planning component; assumes the goals are given • GRAIL(Bain and Sammut, 99): learning goals by decision trees and effects by abduction; • Incremental Correction model(Camacho, 2000): homeostatic and achievable goals; parametrised decision trees to learn goals; wrapper-approach
Our approach Our goals: • transparency of the induced strategies • robust and successful controllers Ideas: • Learning problem decomposition: (a) learning of the constraints on operator’s trajectories, (b) learning of the system’s dynamics • Generalized trajectory as a continuous subgoal • Symbolic and qualitative constraints, use of domain knowledge Differences with related approaches: • continuous generalized trajectory • qualitative strategies
Experimental domains Container crane: • we used execution traces from (Urbančič, 94) Acrobot (DeJong, 95; Sutton, 96) • two link pendulum in a graviatational field; swing-up task Bicycle riding (Randlov, Alstrm, 98) • drive the bike from the start to the goal position; requires simultaneous balancing and goal-aiming Simulators used in all experiments Measure of success: • time to accomplish the task
Operator’s trajectory • A sequence of the states from an execution trace • Path in the state space Operator’s trajectory of the trolley velocity (dX) in the space of X, and dX
Generalized trajectory Induced constraints on operator’s trajectory • Constraints can be represented as: • trees • equations • qualitative constraints
Qualitative and quantitative strategy • Quantitative strategy: given with precise numerical values or numeric constraints (decision tree, equation) • Qualitative strategy may also use qualitative constraints. A qualitative strategy defines a set of quantitative strategies • We use qualitatively constrained functions (QCFs): monotonicity constraints as used in qualitative reasoning
Qualitatively constrained functions • M+(x) arbitrary monotonically increasing fn. of x • A QCF is a generalization of M+, similar to qual. proportionality predicates used in QPT(Forbus, 84) Gas in the container: Pres = c Temp / Vol , c = n R > 0 QCF: Pres = M+,-(Temp,Vol) Temp=std & Vol Pres Temp & Vol Pres Temp & Vol Pres Temp & Vol Pres ? Temp & Vol Pres ?
Direct and indirect controllers Our approach; Also CHURPs(Stirling, 95), GRAIL(Bain and Sammut, 99), ICM(Camacho, 2000) Original approach, BOXES, ASE/ACE
Robustness of direct and indirect controllers against learning error • Experiment: modelling learning of direct and indirect controllers with some learning error: • direct controllers: “correct action” + noise() • indirect controllers: “correct trajectory” + noise() • Two error models: • Gaussian noise • Biased Gaussian noise (all errors in the same direction) • Simple, deterministic, discrete time system: • Control task: reach and maintain the goal value Xg • Performance criterion: controller error in Xg
Robustness of direct and indirect controllers against learning error (2) Biased noise affects direct controllers much more
Possible advantages of indirect controllers • Less prone to the departure from the operator’s trajectory • More robust against change in the system’s dynamics and small changes in the task • generalizing the trajectory is often easier than generalizing the actions Generalized trajectory often easier to understand (less details)
Symbolic and qualitative skill reconstruction GoldHorn(Križman, 98) LWR(Atkeson et al., 97) • Experiments in the crane and acrobot domains
Experiments in the crane domain • GoldHorn induced the generalized trajectory of the trolley velocity: dXdes= 0.902 – 0.018 X2 + 0.090 X + 0.050 Qualitative strategy: if X Xmid then dX = M+,+(X, ) else dX = M-,+(X, )
Transforming qualitative into quantitative strategies • By concretizing qualitative parameters into real, numeric values or real-valued functions • First experiment: using randomly generated functions satisfying qualitative constraints and additional domain knowledge: • maximal and minimal values of the state variables • the trolley starts towards goal • the trolley stops at goal • Second experiment: using additional domain knowledge
Efficiency of the qualitative strategy • The results show that qualitative strategy is: • general (the proper selection of qualitative parameters is not crucial) • successful: offers the space for controller optimization • Similar experiments in acrobot domain
Qualitative induction • Motivation: our experiments with qualitative strategies (crane, acrobot) • Usual classification learning problem, but learning of qualitative trees: • in leaves are qualitatively constrained functions (QCFs); QCFs give constraints on the class change in response to a change in attributes • internal nodes (splits) define a partition of the state space into areas with common qualitative behavior of the class variable
Qualitatively constrained function (QCF) • M+(x) arbitrary monotonically increasing fn. of x • A QCF is a generalization of M+, similar to qual. proportionality predicates used in QPT(Forbus, 84) Gas in the container: Pres = c Temp / Vol , c = n R > 0 QCF: Pres = M+,-(Temp,Vol) Temp=std & Vol Pres Temp & Vol Pres Temp & Vol Pres Temp & Vol Pres ? Temp & Vol Pres ?
Learning QCFs Pres = 2 Temp / Vol Temp Vol Pres 315.00 56.00 11.25 315.00 62.00 10.16 330.00 50.00 13.20 300.00 50.00 12.00 300.00 55.00 10.90 • Learning of the “most consitent” QCF: • For each pair of examples form a qualitative change vector • Select the QCF with minimal error-cost
Learning QCFs QCF Incons. Amb. M+(Temp) M-(Temp) M+(Vol) M-(Vol) M+,+(Temp,Vol) M+,-(Temp,Vol) M-,+(Temp,Vol) M-,-(Temp,Vol) QCF Incons. Amb. M+(Temp) 3 1 M-(Temp) M+(Vol) M-(Vol) M+,+(Temp,Vol) M+,-(Temp,Vol) M-,+(Temp,Vol) M-,-(Temp,Vol) QCF Incons. Amb. M+(Temp) 3 1 M-(Temp) 2,4 1 M+(Vol) 1,2,3 / M-(Vol) 4 / M+,+(Temp,Vol) 1,3 2 M+,-(Temp,Vol) / 3,4 M-,+(Temp,Vol) 1,2 3,4 M-,-(Temp,Vol) 4 2 qTemp=neg qVol=neg qPres=pos Select QCF with minimal QCF error-cost
Learning qualitative tree • For every possible split, split the examples into two subsets, find the “most consistent” QCF for both subsets and select the split minimizing tree-error cost (based on MDL) • Algorithm ep-QUIN uses every pair of examples • An improvement: heuristic QUIN algorithm that considers also locality and consistency of qualitative change vectors
Experimental evaluation in artificial domains • On a set of artificial domains with uniformly distributed attributes; 2 irrelevant attributes • Results by QUIN better than ep-QUIN • In simple domains QUIN finds qualitative relations corresponding to our intuition
QUIN in bicycle riding Control task: drive a bike from the start to the goal position the bike’s speed is assumed constant difficult because balancing and goal-aiming must be performed simultaneously • Controlled by torque applied to the handlebars • State: goalAngle, goalDist, , d, , d • QUIN: des = f(State)
Induced qualitative strategy goalAngle 0.015 > 0.015 goalAngle M+,+,-(, d,goalAngle) -0.027 > -0.027 M+,+,-(, d,goalAngle) M+,+(, d) Same QCFs
Induced qualitative strategy goalAngle near zero yes no M+,+(, d) M+,+,-(, d,goalAngle) Balancing Balancing and goal-aiming If the bike starts falling over then turn the front wheel in the direction of the fall Goal-aiming: turn the front wheel away from the goal
Transforming qualitative into quantitative strategies • Transform QCFs into real valued functions by using simple domain knowledge: • maximal front wheel deflection • drive straight if bike is aiming at the goal: f(0,0,0)=0 • balancing is more important than aiming at the goal • 400 randomly generated quantitative strategies; 59.2% successful • Test of robustness: • Change in the start state (58% successful) • Random displacement of the bicyclist from the mass center (26% successful)
QUIN in crane domain • Crane control requires trolley and rope control • Experiments with traces of 2 operators using different control styles • Rope control • QUIN: Ldes= f(X, dX, ,d, dL) • Often very simple strategy induced Ldes= M+(X ) bring down the load as the trolley moves from the start to the goal position
Trolley control • QUIN: dXdes= f(X, ,d) • More diversity in the induced strategies Enables reconstruction of individual differences in control styles X < 20.7 X < 29.3 yes no yes no M+(X) M+,+,-(X, , d) X < 60.1 d < -0.02 yes no yes no M-(X) M+() M-(X) M-,+(X,)
Role of human intervention • Approach facilitates the use of user knowledge • In our experiments the following types of human intervention were used: • Selection of the dependent trajectory variable • Disregarding some state variables • Selection and analysis of induced equations • Using domain knowledge in transforming qualitative into quantitative strategies • According to empirical evidence different (sensible) choices and use of domain knowledge also give successful strategies
Contributions of the thesis • A decomposition of the behavioural cloning problem into the learning of continuous generalized trajectory and system’s dynamics • Modelling of human skill with symbolic and qualitative constraints • QUIN algorithm for learning qualitative constraint trees • Applying QUIN to skill reconstruction • Experimental evaluation in several dynamic domains
Further work • Applying QUIN in different domains where qualitative models preferred; QUIN improvements • Qualitative simulation to generate possible explanations of a qualitative strategy • Reducing the space of admissible controllers by qualitative reasoning • Minimizing the trajectory constraints error in all the state variables would not require the selection of the dependent trajectory variable