“…Since performing rollout on the hardware of human-scale bipedal robots is expensive, we use the zero-shot transfer method. In order to realize this, there are two widely-adopted techniques: (i) end-to-end training a policy by providing the robot with a proprioceptive short-term history [39], [45], [57] or longterm history [44], [62], [68], (ii) teacher-student training that first obtains a teacher policy with privileged information of the environment by RL, then uses this policy to supervise the training of a student policy that only has access of onboardavailable observations [18], [38], [40], [42], [52], [55], which shows advantages over the end-to-end training method [38], [52], [70]. However, here we show that, for the dynamic control of bipedal robots, by training the robot in an endto-end way with a newly-proposed policy structure, we can realize a better learning performance over the teacher-student method which separates the training process and requires more data.…”