2021
DOI: 10.48550/arxiv.2104.08196
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Towards Standardizing Reinforcement Learning Approaches for Stochastic Production Scheduling

Abstract: Recent years have seen a rise in interest in terms of using machine learning, particularly reinforcement learning (RL), for production scheduling problems of varying degrees of complexity. The general approach is to break down the scheduling problem into a Markov Decision Process (MDP), whereupon a simulation implementing the MDP is used to train an RL agent. Since existing studies rely on (sometimes) complex simulations for which the code is unavailable, the experiments presented are hard, or, in the case of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(5 citation statements)
references
References 33 publications
0
5
0
Order By: Relevance
“…The reward is most often proportional to the optimization target or a value that highly correlates with it (e.g. makespan and average utilization) [24]. In our study, the reward has a relation to the machine loads (π‘–π‘›π‘ π‘‘πΏπ‘œπ‘Žπ‘‘ 𝑗 ) and the completion time of the job.…”
Section: Reward Functionmentioning
confidence: 86%
See 1 more Smart Citation
“…The reward is most often proportional to the optimization target or a value that highly correlates with it (e.g. makespan and average utilization) [24]. In our study, the reward has a relation to the machine loads (π‘–π‘›π‘ π‘‘πΏπ‘œπ‘Žπ‘‘ 𝑗 ) and the completion time of the job.…”
Section: Reward Functionmentioning
confidence: 86%
“…(1 βˆ’ 𝛼) is the probability of keeping the old Q-value and 𝛾 is a discount factor, it is used to balance the immediate and the future reward. (max(𝑄(𝑠 t+1 , π‘Ž))) represents the maximum Q-value of the next state, that is the reason why the Q-learning is an offpolicy algorithm, the policy used during the evaluation stage can differ from the one used in the improvement stage, which leads to more exploration at the expense of convergence speed [24]. In other words, it is not necessarily that the action π‘Ž t chosen in the state 𝑠 t is the same one as π‘Ž in the target (π‘Ÿ t + 𝛾(max(𝑄(𝑠 t+1 , π‘Ž)))).…”
Section: Reinforcement Learningmentioning
confidence: 99%
“…The main component in RL is called agent, which takes actions to obtain maximum cumulative reward through interactions with the environment like a human being. Since the fast computation and ability to cope with dynamic events of RL [15], RL has achieved outstanding success in solving DJSSP. Q-learning is a classic algorithm of RL which chooses the action with the highest Q-value stored in the Q table.…”
Section: Introductionmentioning
confidence: 99%
“…Dynamic denotes that shopfloor environment has its complexity involving multiple work centres, multiple machines of different characteristics and multiple products, coupled with technological and logistic constraints [2]. Key performance indicators (KPIs) for such production system are not only job-oriented, but also need to be economic and sustainable in terms of resource utilization [3]. It is generally agreeable that Reinforcement Learning (RL) is now the leading approach to offer dynamic dispatching solutions compared to rule-based heuristic, thanks to the advancement of sensorisation and connectivity technology in place to support data extraction from real production, to facilitate RL agent training and future estimation [1].…”
Section: Introductionmentioning
confidence: 99%