“…Remark: In this implementation, the value range of M(s), i.e., [16/9N 3 , 37] ≈ [0, 37], is segmented to three small ranges [0, 15], [5,25], and [10,37]. In the first 8100 episodes, the linear reward is defined in the first small range [0, 15]: metric 0 corresponds to reward −1, and 15 corresponds to reward 1; from episode 8101 to 11400, the linear reward is defined in the second small range [5,25]; after episode 11401, the linear reward is defined in the last small range [10,37].…”