Sim-to-Real Learning for Legged Locomotion

Xu Yang

Department of Automation, Tsinghua University

Table of Contents

Introduction

  • Reward function.
    How to design reward functions to obtain expected controller performance?
  • Input and output.
    What can you measure and what can you control?
  • Modelling.
    simuation fidelity (rigid-body dynamics and actuator dynamics); dynamics randomization
  • Training.
    parallel differentiable simulation; curriculum; teacher and student
  • Learning algorithm.
    PPO etc.
  • Network design.

Qradruped Robot: Traditional Controller Design

  • Goal: valid gait patterns (contact sequences) and commanded velocity tracking.
  • Controller: planning and tracking.
  • Trajectory optimization for a complex rigid-body model with many unspecified contact points is beyond the capabilities of current optimization techniques. Approximations are needed to reduce complexity.
    • Rigid-body model: point mass, whole body...
    • Contact points: smooth contact dynamics...

Anymal: Learning Motor Skills[Hwangbo2019]

  • Goal: commanded velocity tracking and recovery from a fall.
  • Modelling[Raisim]: rigid-body dynamics algorithms $O(n)$ + collision-detection library $O(n^2)$ + contact solver ($O(n)$ if not coupled) + actuator dynamics
  • Challenge: motors (SEAs) are hard to model.

Anymal: Control and State Estimation of Motors

  • Actuator network:
    • input: joint state history (including joint velocity and joint position error), which consists of the current state and 2 past states that correspond to $t-0.01$ and $t-0.02$ s.
    • the length of the history should be sufficiently longer than the sum of all communication delays and the mechanical responce time.
    • output: joint torque
    • architecture: an MLP with 3 hidden layers of 32 units each; softsign activation function (12.2 $\mu$s) instead of tanh (31.6$\mu$s).
  • Dataset: the excitation must cover a wide range of frequency spectra.
    • Foot trajectories in the form of a sine wavewith varied amplitude (5 to 10 cm) and frequency (1 to 25 Hz).
    • Constant contact with the ground.
    • The robot is disturbed manually.

Anymal: Control and State Estimation of Motors

Anymal: Policy

  • Hyrbid simulator:
    • nearly 50 000 time steps per second (half of the run time is used to evaluate the actuator nets)
  • Policy network:
    • input: $(o_k=<\Phi^g, r_z, v, \omega, \Phi, \dot\Phi, \Theta, a_{k-1}, C>)$
      • Body twists $\Phi^g$ (direction of the gravity vector in the IMU frame);
      • height, linear and angular veclocities of the base; joint angles, velocities;
      • joint state history $\Theta$ sparsely sampled at $t_k-0.01$ s to $t_k-0.002$ s; (enabling contact detection, which could be replaced by force sensors)
      • previous action; command $C$.
    • output: joint position target (why not torques or velocities?)
    • architecture: an MLP with two hidden layers, with 256 and 128 units each and tanh nonlinearity (bounded activation functions yield less aggressive trajectories for real robots when subjected disturbances)

Learning Locomotion Skills: Does the Choice of Action Space Matter?[Peng2017]

  • Reward: a weighted sum of terms ($r_{pos},r_{vel}, r_{end},r_{root},r_{com}$) that encourge the policy to track a reference motion
  • Actuation model:
    • target joint angles with given $k_p$ and $k_d$;
    • target joint velocities with given $k_d$;
    • torques;
    • muscle activations.
  • A PD controller can outperform a torque controller in both training speed and final control performance.
  • Explanation:
    • Smoother learning curve.
    • A PD controller starts as a standing controller while a torque controller doesn't.

Anymal: Training

  • Learning algorithm: Trust Region Policy Optimization.
  • Curriculum: avoid trapping in a local minimum of penalty on joint torque and velocity
    • all of cost terms except those related to the objective (base velocity error cost etc.) are multiplied by a curriculum factor $k_c$, where $k_{c, j+1}\leftarrow (k_{c, j})^{k_d}$ and $k_0, k_d\in (0, 1)$ are selected to prevent the initial tendency to stand still.
  • The discount factor $\gamma$ is hand-tuned.
  • Joint velocities, base velocities: uniformly distributed noise injection
  • Training time: 9 days of simulated time for command-conditioned locomotion; 79 days for recovery from a fall.

Anymal: Reward Setting

  • angular velocity and linear velocity of the base cost $$ c_wK(|\omega_{IB}^I-\hat\omega_{IB}^I|), c_w=-6\Delta t $$
  • torque cost, joint speed cost $$ k_cc_\tau\|\tau\|^2, c_\tau=0.005\Delta t $$
  • foot clearance cost and foot slip cost
  • orientation cost
  • smoothness cost $$ k_cc_s\|\tau_{t-1}-\tau_t\|^2 $$
  • $K:\mathbb{R}\rightarrow [-0.25, 0)$ $$ K(x)=-\frac{1}{e^x+e^{-x}+2} $$

Anymal: Deployment

  • Inference takes 25$\mu$s on a single CPU thread (Intel i7-5600U, 2.6–3.2GHz, dual-core 64-bit, Tensorflow C++ API is used), less than 0.25% of the available onboard computational resources.
  • The policy network is evaluated at 200Hz for command-conditioned locomotion and at 100 Hz for recovery from a fall.
  • The performance is insensative to the control rate: performance is identical for recovery when the control rate increases from 20 Hz to 100 Hz. (Possibly due to the low joint velocities mostly below 6 rad/s)

Anymal - Learning Locomotion over Challenging Terrain[Lee2020]

  • Goal: follow a command (direction) and locomote over rough terrain with only proprioceptive measurements (joint encoders and an IMU).
  • Challenge: the robustness of the controller.

Anymal - Learning Locomotion over Challenging Terrain: Policy

  • Privileged learning
  • teacher policy (access to ground-truth knowledge)
    • state $s_t=(o_t, x_t)$
    • $o_t$ includes command, orientation, base twist, joint positions and velocities, periodic phase $\phi_i$, frequency offset $f_i$ of each leg, previous foot position targets, joint position errors and velocities at $t-0.01$ and $t-0.02$ s.
    • $x_t$ contains noiseless information from the physics engine, mainly related to foot-ground interactions, such as terrain profile (9 points placed along a 10-cm radius circle), foot contact state and forces, friction coefficients, external disturbance forces.

Anymal - Learning Locomotion over Challenging Terrain: Policy

  • Privileged learning
  • teacher policy (access to ground-truth knowledge)
    • action: 16-dim vector (leg frequencies $f_i$ and foot position residuals $\Delta r_{fi, T}$)
      • $\phi_i=\phi_{i,0}+(f_0+f_i)t$: contact phase if $\phi\in[0, \pi)$ and swing phase if $\phi\in[\pi, 2\pi)$, where $f_0=1.25$ Hz.
      • foot trajectory generator: $F(\phi):[0, 2\pi)\rightarrow \mathbb{R}^3$
      • target foot position $F(\phi_i)+\Delta r_{fi,T}$
      • tracking: analytic IK and PD controller
    • architecture: MLP encoder with input $x_t$ and MLP for computing action
    • Learning algorithm: TRPO

Anymal - Learning Locomotion over Challenging Terrain: Policy

  • Privileged learning
  • student policy (only uses sensors avaliable on the real robot; imitates the teacher)
    • state $s_t=(o_t)$
    • architecture: a temporal convolutional network (TCN) encoder and an MLP for computing action
    • Learning algorithm: supervised learning with a loss function $$ \begin{aligned} L=&{\color{var(--myred)}(\bar a_t(o_t, x_t)-a_t(o_t, H))^2}+\\ &{\color{var(--myblue)}(\bar l_t(o_t,x_t)-l_t(H))^2}, \end{aligned} $$ where $H=\{h_{t-1},\cdots,h_{t-N-1}\}$ and $h_t=o_t\setminus\{f_0, \textrm{joint history, previous foot position targets}\}$ is a time series of proprioceptive observations.

Anymal - Learning Locomotion over Challenging Terrain: Policy

  • Adaptive terrain curriculum
    • The terrain is generated by a vector $c_T$.
    • Evaluate by the traversability of generated terrains $$ \begin{aligned} &v(s_t, a_t, s_{t+1})=1, \textrm{if}\ v_{pr}(s_{t+1})>0.2,\\ &Tr(c_T,\pi)=\mathbb{E}_{\xi\sim\pi}\{v(s_t,a_t,s_{t+1}|c_T\}\in[0, 1],\\ &Td(c_T,\pi)\triangleq \textrm{Pr}(Tr(c_T,\pi)\in[0.5, 0.9]) \end{aligned} $$
    • Sequential importance resampling particle filter

Anymal - Learning Locomotion over Challenging Terrain: Policy

Anymal - Learning Locomotion over Challenging Terrain: Reward Setting and Training

  • Linear velocity; angular velocity; base motion; foot clearance; body collision; smoothness of target foot positions; torque penalty $$ 0.05r_{lv}+0.05r_{av}+0.04r_b+0.01r_{fc}+0.02r_{bc}+0.025r_s+2\cdot 10^{-5}r_\tau, $$
  • Training: 12 hours for teacher policy and 4 hours for student policy (i7-8700K CPU and RTX 2080 GPU)
  • Training environment features only rigid terrain, while the controller successfully meet the diversity of field conditions.

Mini Cheetah

Cassie: Bipedal Gait Learning

  • Goal: learn all common gaits.
  • Challenge: the reward function is hard to design
    • track trajectory motion: deriving feasible reference trajectories is hard
    • reference-free reward: inefficient, infeasible or unsafe policies.

Cassie: Bipedal Gait Learning

  • Gait: a sequence of periodic swing phases and stance phases.
  • Bahavior reward: biased sum of $n$ reward components $$ \begin{aligned} R(s,\phi)&=\sum_i R_i(s, \phi)\\ R_i(s,\phi)&=c_i\cdot I_i(\phi)\cdot q_i(s), \end{aligned} $$ where $\phi\in[0, 1]$ denotes the cycle time; $I_i(\phi)$ is a binary-valued random variable denoting whether the target phase is active; $q_i(s)$ is the reward (e.g. norm of a foot force)
  • $$ P(I_i(\phi)=x)=\left\{ \begin{aligned} &P(A_i<\phi< B_i) \quad \textrm{if}\ x=1 \\ &1-P(A_i<\phi< B_i) \quad \textrm{if}\ x=0 \\ \end{aligned} \right. $$ where $A_i\sim \Phi(2\pi a_i,\kappa), B_i\sim\Phi(2\pi b_i,\kappa)$ and $\Phi$ is the Von Mises distribution (approximation of the wrapped Normal distribution).

Cassie: Bipedal Gait Learning

  • Reward coefficients: $$ \begin{aligned} \mathbb{E}[C_{frc}(\phi)]=&c_{swing\ frc}\cdot \mathbb{E}[I_{swing\ frc}(\phi)]+\\ &c_{stance\ frc}\cdot \mathbb{E}[I_{stance\ frc}(\phi)] \end{aligned} $$
  • Reward: bipedal bahavior, smooth terms and penalization of control inputs $$ \mathbb{E}[R(s,\phi)]=\mathbb{E}[R_{bipedal}(s,\phi)]+R_{smooth}(s)+R_{cmd}(s) $$ $$ \begin{aligned} \mathbb{E}[R_{bipedal}(s,\phi)]&=\mathbb{E}[C_{frc}(\phi+\theta_{left})]\cdot q_{left\ frc}(s)+\cdots\\ &+\mathbb{E}[C_{spd}(\phi+\theta_{left})]\cdot q_{left\ spd}(s)+\cdots, \end{aligned} $$ where $\theta_{left}$ and $\theta_{right}$ define the phases.

Cassie: Bipedal Gait Learning

  • Keep the cycle offsets fixed and vary the phase ratio: transition from walking to running and hopping with more or less time in the air.
  • Vary both the cycle offsets and the phase ratio: introduce additional reward (behavior-specific transition penalities).
    • Hopping: commanded when cycle offsets are very close; cost for constraining the feet positions to be close $$ q_{hop\ sym}(s)=1-\exp(-err_{sym}\exp(-5|\sin(2\pi(\theta_{left}-\theta_{right})|)). $$
    • $$ q_{standing\ cost}(s)=1-\exp(-(err_{sym}+20q_{action\ diff}(s))). $$

Cassie: Policy

  • Input:
    • robot state $\hat q, \hat{\dot q}$
    • desired velocity $\dot x_{desired}, \dot y_{desired}$
    • phase ratios and clock inputs $r, p=\{\sin(2\pi(\phi+\theta_{left})), \sin(2\pi(\phi+\theta_{right}))\}$
  • Output: 10 desired joint positions and 10 sets of PD gains
  • Architecture: LSTM with 2 recurrent hidden layers of size 128 each and a simple linear output projection of size 30.

Cassie: Training and Deployment

  • Dynamics randomization.
  • Learning algorithm: PPO
  • Simulator: cassie-mujoco-sim based on MuJoCo
  • Policy evaluation (40 Hz) and PD tracking (2000 Hz).

Cassie: Footstep-Constrained Walking - Policy

  • Input:
    • rotational velocity and orientation of the base; robot state $\hat q, \hat{\dot q}$
    • clock inputs $p=\{\sin(2\pi(\phi+\theta_{left})), \sin(2\pi(\phi+\theta_{right}))\}$
    • swing-to-stance ratio $r$ and turn rate input
    • desired footstep constraint $U=(L_{step}, \theta_{step})$
  • Output: 10 desired joint positions
  • Architecture: LSTM with 2 recurrent hidden layers of size 128 each and a simple linear output projection of size 30.

Cassie: Footstep-Constrained Walking - Reward

  • Reward at two time scales:
    • locomotion: at every decision circle $$ r_{pelvis}=100(\Delta d^i-\Delta d^{i-1}) $$
    • footstep-constraint: only at touchdown moment $$ r_{step}=\exp(-2f), $$ where $f$ is the Euclidean distance between the actual and target footstep locations.

CONCLUSION

  • Reward function.
    Characterization without loss of generality: exploitation of periodicity
  • Input and output.
    Task space or joint space; position, velocity or torque command?
  • Modelling.
    simuation fidelity (neural networks for motors); dynamics randomization; manual noise injection
  • Training.
    curriculum (naive version or adaptive version); teacher and student; parallel differentiable simulation [Rudin2021, Margolis2022]
  • Learning algorithm.
    PPO etc.
  • Network design.
Thanks!

Bibliography

  • [Zhao2020] Zhao, Wenshuai, Jorge Peña Queralta, and Tomi Westerlund. "Sim-to-real transfer in deep reinforcement learning for robotics: a survey." 2020 IEEE Symposium Series on Computational Intelligence (SSCI). IEEE, 2020.
  • [Peng2017] Peng, Xue Bin, and Michiel van de Panne. "Learning locomotion skills using deeprl: Does the choice of action space matter?." Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation. 2017.
  • [Lee2020] Lee, Joonho, et al. "Learning quadrupedal locomotion over challenging terrain." Science robotics 5.47 (2020): eabc5986
  • [Hurst2021] Siekmann, Jonah, et al. "Sim-to-real learning of all common bipedal gaits via periodic reward composition." 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021
  • [Duan2022] Duan, Helei, et al. "Sim-to-Real Learning of Footstep-Constrained Bipedal Dynamic Walking." arXiv preprint arXiv:2203.07589 (2022)
  • [Dao2022] J. Dao, K. Green, H. Duan, A. Fern and J. Hurst, "Sim-to-Real Learning for Bipedal Locomotion Under Unsensed Dynamic Loads," 2022 International Conference on Robotics and Automation (ICRA), 2022, pp. 10449-10455
  • [Raisim] https://raisim.com/sections/Performance.html
  • [Rudin2021] Nikita Rudin, David Hoeller, Philipp Reist, and Marco Hutter. Learning to walk in minutes using massively parallel deep reinforcement learning. In Proc. Conf. Robot Learn. (CoRL), pages 91–100, London, UK, November 2021
  • [Margolis2022] Margolis, Gabriel B., et al. "Rapid Locomotion via Reinforcement Learning." arXiv preprint arXiv:2205.02824 (2022)