Large sequence models for sequential decision-making: a survey

Wen, Muning; Lin, Runji; Wang, Hanjing; Yang, Yaodong; Wen, Ying; Luo, Ming Ronnier; Wang, Jun; Zhang, Haifeng; Zhang, Weinan

doi:10.1007/s11704-023-2689-5

Cited by 21 publications

(2 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…By conditioning on target returns, the policy can generate actions that closely resemble the behaviors presented in the dataset. Decision Transformers (DT) and its variants (Siebenborn et al 2022;Zheng, Zhang, and Grover 2022;Hu et al 2023;Wen et al 2023) use returns-to-go, i.e. cumulative future returns, as the conditional inputs and model trajectories with causal transformers (Vaswani et al 2017).…”

Section: Return-conditioned Supervised Learningmentioning

confidence: 99%

Critic-Guided Decision Transformer for Offline Reinforcement Learning

Wang,

Yang,

Wen

et al. 2024

AAAI

View full text Add to dashboard Cite

Recent advancements in offline reinforcement learning (RL) have underscored the capabilities of Return-Conditioned Supervised Learning (RCSL), a paradigm that learns the action distribution based on target returns for each state in a supervised manner. However, prevailing RCSL methods largely focus on deterministic trajectory modeling, disregarding stochastic state transitions and the diversity of future trajectory distributions. A fundamental challenge arises from the inconsistency between the sampled returns within individual trajectories and the expected returns across multiple trajectories. Fortunately, value-based methods offer a solution by leveraging a value function to approximate the expected returns, thereby addressing the inconsistency effectively. Building upon these insights, we propose a novel approach, termed the Critic-Guided Decision Transformer (CGDT), which combines the predictability of long-term returns from value-based methods with the trajectory modeling capability of the Decision Transformer. By incorporating a learned value function, known as the critic, CGDT ensures a direct alignment between the specified target returns and the expected returns of actions. This integration bridges the gap between the deterministic nature of RCSL and the probabilistic characteristics of value-based methods. Empirical evaluations on stochastic environments and D4RL benchmark datasets demonstrate the superiority of CGDT over traditional RCSL methods. These results highlight the potential of CGDT to advance the state of the art in offline RL and extend the applicability of RCSL to a wide range of RL tasks.

show abstract

Section: Return-conditioned Supervised Learningmentioning

confidence: 99%

Critic-Guided Decision Transformer for Offline Reinforcement Learning

Wang,

Yang,

Wen

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…In the meantime, the past few years have witnessed huge success in applying sequence modeling to natural language processing (Vaswani et al 2017;Brown et al 2020). In light of the similarity between language sequences and RL trajectories, a lot of works have explored the idea of modeling RL trajectories using sequence modeling approaches (Wen et al 2023). For example, Decision Transformer (DT) (Chen et al 2021) models offline trajectories extended with the sum of the future rewards along the trajectory, namely the return-to-go (RTG).…”

Section: Introductionmentioning

confidence: 99%

ACT: Empowering Decision Transformer with Dynamic Programming via Advantage Conditioning

Gao,

Wu,

Cao

et al. 2024

AAAI

View full text Add to dashboard Cite

Decision Transformer (DT), which employs expressive sequence modeling techniques to perform action generation, has emerged as a promising approach to offline policy optimization. However, DT generates actions conditioned on a desired future return, which is known to bear some weaknesses such as the susceptibility to environmental stochasticity. To overcome DT's weaknesses, we propose to empower DT with dynamic programming. Our method comprises three steps. First, we employ in-sample value iteration to obtain approximated value functions, which involves dynamic programming over the MDP structure. Second, we evaluate action quality in context with estimated advantages. We introduce two types of advantage estimators, IAE and GAE, which are suitable for different tasks. Third, we train an Advantage-Conditioned Transformer (ACT) to generate actions conditioned on the estimated advantages. Finally, during testing, ACT generates actions conditioned on a desired advantage. Our evaluation results validate that, by leveraging the power of dynamic programming, ACT demonstrates effective trajectory stitching and robust action generation in spite of the environmental stochasticity, outperforming baseline methods across various benchmarks. Additionally, we conduct an in-depth analysis of ACT's various design choices through ablation studies. Our code is available at https://github.com/LAMDA-RL/ACT.

show abstract

Encoding and decoding models

Senden,

Kroner

2025

Encyclopedia of the Human Brain

View full text Add to dashboard Cite

Large sequence models for sequential decision-making: a survey

Cited by 21 publications

References 31 publications

Critic-Guided Decision Transformer for Offline Reinforcement Learning

Critic-Guided Decision Transformer for Offline Reinforcement Learning

ACT: Empowering Decision Transformer with Dynamic Programming via Advantage Conditioning

Encoding and decoding models

Contact Info

Product

Resources

About