Learning from Disagreeing Demonstrators

Learning from Disagreeing Demonstrators Bruno N. da Silva University of British Columbia bnds@cs.ubc.ca

Motivation • Some traditional cases of Learning from Demonstration assume a human expert • In some (subjective) tasks, there might not be a single expert • How to drive from point A to B

Motivation • In general, these tasks involve more than one feature • e.g. in the driving domain, want to optimize travel time and number of crashes • Different contexts lead to different tradeoffs between features • Idiosyncratic demonstrators do not reflect on their routine approach to the problem

Problem definition • How can we integrate idiosyncratic (disagreeing) demonstrations to form a homogeneous and effective policy?

Solution • We extend the framework presented by Argall et al, 2007 • Traditional demonstrations in the first stage • Robot execution and human critique in the second stage • Robot collects critiques • Robot updates policy

The 1st stage of the mechanism

The 2nd stage of the mechanism

A little more concretely… • The first stage can be interpreted as a set of datapoints (pm,an,c) • Perception pm • Actionan • Confidence on the mappingc • The criticism will affect the confidence • If praise the execution, increase c • If knock the execution, decrease c

But let’s not be naïve • If demonstrators “lie” in the demonstration, they would “lie” in the criticism • Therefore, associate a reputation riwith each demonstration di • And update the confidence level carefully • c := c + ri * f(feedback)

Adjusting reputation ranks • And adjust ribased on (lack of) improvement from di’s feedback • ri := ri +  * evaluation(feedback) • evaluation(.) can be interpreted as a Pareto improvement from the feedback

Current investigations • Policy conversion? • Rate of conversion? • What are the long term effects on human demonstrators? • Frustration? • Repudiation? • Will critiques really be mindful?

Thanks! • Questions?

Learning from Disagreeing Demonstrators