SummaryAs the scope of reinforcement learning broadens, the number of possible states and of executable actions, and hence the product of the two sets explode. Often, there are more feasible options than allowed trials, because of physical and computational constraints imposed on the agents. In such an occasion, optimization procedures that require first trying all the options once do not work. The situation is what the theory of bounded rationality was proposed to deal with. We formalize the central heuristics of bounded rationality theory named satisficing. Instead of the traditional formulation of satisficing at the policy level in terms of reinforcement learning, we introduce a value function that implements the asymmetric risk attitudes characteristic of human cognition. Operated under the simple greedy policy, the RS (reference satisficing) value function enables an efficient satisficing in K-armed bandit problems, and when the reference level for satisficing is set at an appropriate value, it leads to effective optimization. RS is also tested in a robotic motion learning task in which a robot learns to perform giant-swings (acrobot). While the standard algorithms fail because of the coarse-grained state space, RS shows a stable performance and autonomous exploration that goes without randomized exploration and its gradual annealing necessary for the standard methods.
1.