Learning Ball-Balancing Robot through Deep Reinforcement Learning

Zhou, Yifan; Lin, Jianghao; Wang, Shuai; Zhang, Chong

doi:10.1109/icccr49711.2021.9349369

Cited by 11 publications

(6 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In another line of research, latent action representations are often learned to exploit the structure in the action space in reinforcement learning [3,6,7,48,49]. In particular, Deffayet et al [7] use a Variational AutoEncoder (VAE) model to pre-train latent slate space from logged data to improve recommendations.…”

Section: Related Workmentioning

confidence: 99%

Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction

Kiyohara,

Nomura,

Saito

2024

Proceedings of the ACM Web Conference 2024

View full text Add to dashboard Cite

We study off-policy evaluation (OPE) in the problem of slate contextual bandits where a policy selects multi-dimensional actions known as slates. This problem is widespread in recommender systems, search engines, marketing, to medical applications, however, the typical Inverse Propensity Scoring (IPS) estimator suffers from substantial variance due to large action spaces, making effective OPE a significant challenge. The PseudoInverse (PI) estimator has been introduced to mitigate the variance issue by assuming linearity in the reward function, but this can result in significant bias as this assumption is hard-to-verify from observed data and is often substantially violated. To address the limitations of previous estimators, we develop a novel estimator for OPE of slate bandits, called Latent IPS (LIPS), which defines importance weights in a low-dimensional slate abstraction space where we optimize slate abstractions to minimize the bias and variance of LIPS in a data-driven way. By doing so, LIPS can substantially reduce the variance of IPS without imposing restrictive assumptions on the reward function structure like linearity. Through empirical evaluation, we demonstrate that LIPS substantially outperforms existing estimators, particularly in scenarios with non-linear rewards and large slate spaces. CCS CONCEPTS• Information systems → Recommender systems; Evaluation of retrieval results.

show abstract

Section: Related Workmentioning

confidence: 99%

Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction

Kiyohara,

Nomura,

Saito

2024

Proceedings of the ACM Web Conference 2024

View full text Add to dashboard Cite

show abstract

“…Early methods pinpoint the core issue in offline RL as extrapolation error [13] and suggest using policy constraints to ensure that the learned policy remains close to the behavior policy. These constraints include adding behavior cloning (BC) loss [46] in policy training [12], using the divergence between the behavior policy and the learned policy [13], [14], [25], applying advantage-weighted constraints to balance BC and advantages [39], penalizing the prediction-error of a variational auto-encoder [41], and learning latent actions from the offline data [55]. While policy-constraint methods excel in performance on datasets derived from expert behavior policies, they struggle to discover optimal policies when confronted with datasets featuring suboptimal policies.…”

Section: Related Workmentioning

confidence: 99%

Uncertainty-Aware Rank-One MIMO Q Network Framework for Accelerated Offline Reinforcement Learning

Nguyen,

Luu,

Ton

et al. 2024

IEEE Access

View full text Add to dashboard Cite

Offline reinforcement learning (RL) has garnered significant interest due to its safe and easily scalable paradigm, which essentially requires training policies from pre-collected datasets without the need for additional environment interaction. However, training under this paradigm presents its own challenge: the extrapolation error stemming from out-of-distribution (OOD) data. Existing methodologies have endeavored to address this issue through means like penalizing OOD Q-values or imposing similarity constraints on the learned policy and the behavior policy. Nonetheless, these approaches are often beset by limitations such as being overly conservative in utilizing OOD data, imprecise OOD data characterization, and significant computational overhead. To address these challenges, this paper introduces an Uncertainty-Aware Rank-One Multi-Input Multi-Output (MIMO) Q Network framework. The framework aims to enhance Offline Reinforcement Learning by fully leveraging the potential of OOD data while still ensuring efficiency in the learning process. Specifically, the framework quantifies data uncertainty and harnesses it in the training losses, aiming to train a policy that maximizes the lower confidence bound of the corresponding Q-function. Furthermore, a Rank-One MIMO architecture is introduced to model the uncertainty-aware Q-function, offering the same ability for uncertainty quantification as an ensemble of networks but with a cost nearly equivalent to that of a single network. Consequently, this framework strikes a harmonious balance between precision, speed, and memory efficiency, culminating in improved overall performance. Extensive experimentation on the D4RL benchmark demonstrates that the framework attains state-of-theart performance while remaining computationally efficient. By incorporating the concept of uncertainty quantification, our framework offers a promising avenue to alleviate extrapolation errors and enhance the efficiency of offline RL.

show abstract

“…The modular design simplifies the locomotion control problem with a fixed gait and allows for individual gait analysis [73]. To learn gait patterns, we used a learned action space that maps the output of the high-level policy to a distribution of gait parameters [74][75][76]. The generative model was trained with known gait parameters [6,13].…”

Section: Comparison Of Different Architecturesmentioning

confidence: 99%

“…Existing works have proposed using generative models such as Variational Autoencoder (VAE)s [75,76] or a normalizing flow [74] to transform the action distribution into a different, possibly multi-modal, distribution. Wenxuan et al [75] and Allshire et al [76] proposed to pre-train generative models with existing motion data for higher sample efficiency.…”

Section: Acknowledgmentsmentioning

confidence: 99%

Learning robust autonomous navigation and locomotion for wheeled-legged robots

Lee,

Bjelonic,

Reske

et al. 2024

Sci. Robot.

View full text Add to dashboard Cite

Autonomous wheeled-legged robots have the potential to transform logistics systems, improving operational efficiency and adaptability in urban environments. Navigating urban environments, however, poses unique challenges for robots, necessitating innovative solutions for locomotion and navigation. These challenges include the need for adaptive locomotion across varied terrains and the ability to navigate efficiently around complex dynamic obstacles. This work introduces a fully integrated system comprising adaptive locomotion control, mobility-aware local navigation planning, and large-scale path planning within the city. Using model-free reinforcement learning (RL) techniques and privileged learning, we developed a versatile locomotion controller. This controller achieves efficient and robust locomotion over various rough terrains, facilitated by smooth transitions between walking and driving modes. It is tightly integrated with a learned navigation controller through a hierarchical RL framework, enabling effective navigation through challenging terrain and various obstacles at high speed. Our controllers are integrated into a large-scale urban navigation system and validated by autonomous, kilometer-scale navigation missions conducted in Zurich, Switzerland, and Seville, Spain. These missions demonstrate the system’s robustness and adaptability, underscoring the importance of integrated control systems in achieving seamless navigation in complex environments. Our findings support the feasibility of wheeled-legged robots and hierarchical RL for autonomous navigation, with implications for last-mile delivery and beyond.

show abstract

Learning Ball-Balancing Robot through Deep Reinforcement Learning

Cited by 11 publications

References 11 publications

Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction

Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction

Uncertainty-Aware Rank-One MIMO Q Network Framework for Accelerated Offline Reinforcement Learning

Learning robust autonomous navigation and locomotion for wheeled-legged robots

Contact Info

Product

Resources

About