Many Reinforcement Learning (RL) approaches use joint control signals (positions, velocities, torques) as action space for continuous control tasks. We propose to lift the action space to a higher level in the form of subgoals for a motion generator (a combination of motion planner and trajectory executor). We argue that, by lifting the action space and by leveraging sampling-based motion planners, we can efficiently use RL to solve complex, long-horizon tasks that could not be solved with existing RL methods in the original action space. We propose ReLMoGen -- a framework that combines a learned policy to predict subgoals and a motion generator to plan and execute the motion needed to reach these subgoals. To validate our method, we apply ReLMoGen to two types of tasks: 1) Interactive Navigation tasks, navigation problems where interactions with the environment are required to reach the destination, and 2) Mobile Manipulation tasks, manipulation tasks that require moving the robot base. These problems are challenging because they are usually long-horizon, hard to explore during training, and comprise alternating phases of navigation and interaction. Our method is benchmarked on a diverse set of seven robotics tasks in photo-realistic simulation environments. In all settings, ReLMoGen outperforms state-of-the-art Reinforcement Learning and Hierarchical Reinforcement Learning baselines. ReLMoGen also shows outstanding transferability between different motion generators at test time, indicating a great potential to transfer to real robots.
Short on time? This video should capture the gist of our work. Enjoy!
We propose to integrate Motion Generation into a Reinforcement Learning loop to lift the action space from low-level robot commands a to subgoals for the motion generator a′; Our ReLMoGen solution maps observations and (possibly) task information to base or arm subgoals that the motion generator transforms into low-level robot commands. The mobile manipulation tasks that we are interested in can usually be decomposed into a sequence of base and arm subgoals (e.g. pushing open a door for Interactive Navigation).
We instantiate our ReLMoGen solution with two types of action parameterization and network architecture: SGP-D and SGP-R. SGP-D is based on DQN using discrete action space. We adopt a fully convolutional network structure that predicts Q-values for the base and arm subgoals spatially aligned with the local top-down map and first-person RGB-D view respectively. SGP-R is based on SAC using continuous action space. The actor network directly predicts the base and arm subgoals and a binary variable that indicates whether to use base or arm for this subgoal.
We stress test our method on a wide variety of seven robotics tasks including navigation, stationary arm control, interactive navigation and mobile manipulation. See below for a brief summary of each of our tasks.
PointNav: the goal is to navigate from a random starting location and a random goal location without collision.
TabletopReachM: the goal is to reach a random goal location (represented as a red circle) on the table.
PushDoorNav: the goal is to push a door open, and navigate to the goal location inside the room.
ButtonDoorNav: the goal is to press a button to open a door, and navigate to the goal location inside the room.
InteractiveObstaclesNav: the goal is to push away a movable obstacle (represented as a red cuboid), and navigate to goal location behind the obstacles.
ArrangeKitchenMM: the goal is close as many cabinets and drawers as possible in the kitchen.
ArrangeChairMM: the goal is to tuck the chairs under the table.
We showcase the qualitative results of our best performing policy in all the tasks.
Here we show reward curves for ReLMoGen and the baselines (SAC and HRL4IN). ReLMoGen achieves higher reward with the same number of environment episodes and higher overall task completion for all seven tasks while the baselines often converge prematurely to sub-optimal solutions.
Here we show ReLMoGen is better at exploration than the SAC baseline. (a) shows the 2D projection of latent state space: SAC traverses nearby states with low-level actions, while ReLMoGen-R jumps between distant states linked by a motion plan. (b) shows the physical locations visited by ReLMoGen-R and SAC in 100 episodes: ReLMoGen-R covers a much larger area. (c) shows a top-down map of meaningful interactions (duration ≥1s) during exploration. ReLMoGen-R is able to interact with the environment more than SAC.
Here we show the visualization of ReLMoGen-D action maps during evaluation. The image pairs contain the input RGB frames on the left and normalized predicted Q-value maps on the right. The predicted Q-value spikes up at image locations that enable useful interactions, e.g. buttons, cabinet door leaves, and chairs.
Here we show the qualitative results of our policy transfer to a new robot Movo. After fine-tuning, our policy quickly adapts to the new embodiment and learns to set feasible subgoals accordingly.
With Appendix detailing task definition, evaluation metrics, algorithm description,
network structure, training procedure, and hyperparameters.
We thank Google for providing cloud computing credits for this projects. This project is also supported by HAI AWS grant. The authors would like to thank members from Mobility team of Robotics at Google and PAIR team from Stanford Vision and Learning Group for valuable feedbacks on early versions of this project.
The website template was borrowed from Michaël Gharbi.