Teaching Robots Like Dogs

Learning Agile Navigation
from Luring, Gesture, and Speech

Teaching Robots Like Dogs: Learning Agile Navigation
from Luring, Gesture, and Speech

Results (Audio available)

Multiple Skills with a Single Policy

Course Running with Multiple Objects

Individual Navigation Skills

Go There

Come Here

Follow Me

Come Around

Jump Over

Zigzag

Abstract

In this work, we aim to enable legged robots to interpret human social cues and produce appropriate behaviors through human guidance. The main challenge is that learning through physical engagement can place a heavy burden on users, particularly when the process requires large amounts of human-provided data. To address this, we propose a human-in-the-loop framework that enables robots to acquire navigational behaviors in a data-efficient manner and respond to multimodal natural human inputs, specifically gestural and verbal commands. We reconstruct interaction scenes using a physics-based simulation and augment real-world data to mitigate distributional shifts arising from limited demonstration data. Our progressive goal cueing strategy adaptively feeds appropriate commands and navigation goals during training, leading to more accurate navigation and stronger alignment between human input and robot behavior. We evaluate our framework across six real-world agile navigation scenarios, including jumping over and avoiding obstacles. Our experimental results show that our proposed method succeeds in almost all trials, achieving a 97.15% task success rate with less than one hour of demonstration data.



Q: What distinguishes this work from prior studies?

โ–ธ Learning agile behaviors from human guidance: We demonstrate that robots can learn agile navigation behaviors, such as jumping over and avoiding obstacles, through human physical guidance.

โ–ธ Control with multimodal human inputs: Unlike prior methods that rely on a single modalityโ€”either verbal commands or gesturesโ€”we enable robots to learn from both verbal and gestural cues within a unified framework. This multimodal design allows the robot to interpret richer human intent, where gestures provide spatial guidance and verbal commands convey explicit, high-level instructions.

โ–ธ Data-efficient learning via scene reconstruction: A natural question is whether one could simply collect human demonstrations and apply supervised learning. However, acquiring large-scale human interaction data is expensive and burdensome. To address this challenge, we reconstruct interaction scenes in a physics-based simulator and augment data.

BibTeX

Methods

Scene Reconstruction in Physics Simulator

Original Scene with Human Demonstration

Reconstructed Scene in Physics Simulator

Data Aggregation for Distribution Shift Problems

Data Aggregation with Local Expert Policy

Recovery under Distribution Shift (Diagonal)

Recovery under Distribution Shift (Top View)

Progressive Goal Cueing Strategy: Adaptive Command Feeding

Without Progressive Goal Cueing

With Progressive Goal Cueing

Additional Experiments

Ablation Study on Modalities

Verbal Command Only

Gesture Command Only

Few-Shot Adaptation to Novel Users

Come around

Jump over