Material Detail

Direct Policy Ranking with Robot Data Streams

Direct Policy Ranking with Robot Data Streams

This video was recorded at European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), Athens 2011. Many machine learning approaches in robotics, based on reinforcement learning, inverse optimal control or direct policy learning, critically rely on robot simulators. This paper investigates a simulatorfree direct policy learning, called Preference-based Policy Learning (PPL). PPL iterates a four-step process: the robot demonstrates a candidate policy; the expert ranks this policy comparatively to other ones according to her preferences; these preferences are used to learn a policy return estimate; the robot uses the policy return estimate to build new candidate policies, and the process is iterated until the desired behavior is obtained. PPL requires a good representation of the policy search space be available, enabling one to learn accurate policy return estimates and limiting the human ranking effort needed to yield a good policy. Furthermore, this representation cannot use informed features (e.g., how far the robot is from any target) due to the simulator-free setting. As a second contribution, this paper proposes a representation based on the agnostic exploitation of the robotic log. The convergence of PPL is analytically studied and its experimental validation on two problems, involving a single robot in a maze and two interacting robots, is presented.


  • User Rating
  • Comments
  • Learning Exercises
  • Bookmark Collections
  • Course ePortfolios
  • Accessibility Info

More about this material


Log in to participate in the discussions or sign up if you are not already a MERLOT member.