This page requires Javascript. Please enable it for https://saytap.github.io/


Examples of A1 following human commands in natural language
Unlike text generation where Large Language Models (LLMs) directly interpret the atomic elements—tokens—it often proves challenging for LLMs to comprehend low-level robotic commands such as joint angle targets or motor torques, especially for inherently unstable legged robots necessitating high-frequency control signals. In this work, we propose to use foot contact patterns as an interface that bridges human instructions in natural language and low-level commands. The proposed approach allows the robot to take both simple and direct instructions (e.g., “Trot forward slowly”) as well as vague human commands (e.g., “Good news, we are going to a picnic this weekend!”) in natural language and react accordingly..

SayTap: Language to Quadrupedal Locomotion

Large language models (LLMs) have demonstrated the potential to perform high-level planning. Yet, it remains a challenge for LLMs to comprehend low-level commands, such as joint angle targets or motor torques. This work proposes an approach to use foot contact patterns as an interface that bridges human commands in natural language and a locomotion controller that outputs these low-level commands. This results in an interactive system for quadrupedal robots that allows the users to craft diverse locomotion behaviors flexibly. We contribute an LLM prompt design, a reward function, and a method to expose the controller to the feasible distribution of contact patterns. The results are a controller capable of achieving diverse locomotion patterns that can be transferred to real robot hardware. Compared with other design choices, the proposed approach enjoys more than 50% success rate in predicting the correct contact patterns and can solve 10 more tasks out of a total of 30 tasks.

Method

The core ideas of our approach include introducing desired foot contact patterns as a new interface between human commands in natural language and the locomotion controller. The locomotion controller is required to not only complete the main task (e.g., following specified velocities), but also to place the robot's feet on the ground at the right time, such that the realized foot contact patterns are as close as possible to the desired ones, the following figure gives an overview of the proposed system. To achieve this, the locomotion controller takes a desired foot contact pattern at each time step as its input, in addition to the robot's proprioceptive sensory data and task related inputs (e.g., user specified velocity commands). At training, a random generator creates these desired foot contact patterns, while at test time a LLM translates them from human commands.

In this paper, a desired foot contact pattern is defined by a cyclic sliding window of size LwL_w that extracts the four feet ground contact flags between t+1t+1 and t+Lwt+L_w from a pattern template and is of shape 4×Lw4\times L_w. A contact pattern template is a 4×T4 \times T matrix of '0's and '1's, with '0's representing feet in the air and '1's for feet on the ground. From top to bottom, each row in the matrix gives the foot contact patterns of the front left (FL), front right (FR), rear left (RL) and rear right (RR) feet. We demonstrate that the LLM is capable of mapping human commands into foot contact pattern templates in specified formats accurately given properly designed prompts, even in cases when the commands are unstructured and vague. In training, we use a random pattern generator to produce contact pattern templates that are of various pattern lengths TT, foot-ground contact ratios within a cycle based on a given gait type GG, so that the locomotion controller gets to learn on a wide distribution of movements and generalizes better.

Please check out our paper for more details.

Overview of the proposed approach. In addition to the robot's proprioceptive sensory data and task commands (e.g., following a desired linear velocity), the locomotion controller accepts desired foot contact patterns as input, and outputs desired joint positions. The foot contact patterns are extracted by a cyclic sliding window from a pattern template, which is generated by a random pattern generator during training, and is translated from human commands in natural language by an LLM in tests. We show some examples of contact pattern templates at the bottom.

Videos

Following Simple/Direct Commands

Videos that show our A1 following simple/direct instructions.

Following Unstructured/Vague Commands

Videos that show our A1 following unstructured/vague commands.

Acknowledgements

The authors would like to thank Tingnan Zhang, Linda Luu, Kuang-Huei Lee, Vincent Vanhoucke and Douglas Eck for their valuable discussions and technical support in the experiments.

Any errors here are our own and do not reflect opinions of our proofreaders and colleagues. If you see mistakes or want to suggest changes, feel free to contribute feedback by participating in the discussion forum for this article.

The experiments in this work were performed on GPU virtual machines provided by Google Cloud Platform.

Dog walk icon by artist Laymik on Noun Project.  Dog walk icon by Laymik from Noun Project.

Citation

For attribution in academic contexts, please cite this work as:

Yujin Tang and Wenhao Yu and Jie Tan and Heiga Zen and Aleksandra Faust and Tatsuya Harada,
SayTap: Language to Quadrupedal Locomotion, 2023.

BibTeX citation

@article{saytap2023,
  author = {Yujin Tang and Wenhao Yu and Jie Tan and Heiga Zen and Aleksandra Faust and
Tatsuya Harada}, title = {SayTap: Language to Quadrupedal Locomotion}, eprint = {arXiv:2306.07580}, url = {https://saytap.github.io}, note = "\url{https://saytap.github.io}", year = {2023} }