In this work, we aim to enable legged robots to interpret human social cues and produce appropriate behaviors through human guidance. The main challenge is that learning through physical engagement can place a heavy burden on users, particularly when the process requires large amounts of human-provided data. To address this, we propose a human-in-the-loop framework that enables robots to acquire navigational behaviors in a data-efficient manner and respond to multimodal natural human inputs, specifically gestural and verbal commands. We reconstruct interaction scenes using a physics-based simulation and augment real-world data to mitigate distributional shifts arising from limited demonstration data. Our progressive goal cueing strategy adaptively feeds appropriate commands and navigation goals during training, leading to more accurate navigation and stronger alignment between human input and robot behavior. We evaluate our framework across six real-world agile navigation scenarios, including jumping over and avoiding obstacles. Our experimental results show that our proposed method succeeds in almost all trials, achieving a 97.15% task success rate with less than one hour of demonstration data.