Perceptual multistability has been studied for centuries using a diverse collection of approaches. Insights derived from this phenomenon range from core principles of information processing, such as perceptual inference, to high-level concerns, such as visual awareness. The dominant computational explanations of perceptual multistability are based on the Helmholtzian view of perception as inverse inference. However, these approaches struggle to account for the crucial role played by value, e.g., with percepts paired with reward dominating for longer periods than unpaired ones. In this study, we formulate perceptual multistability in terms of dynamic, value-based, choice, employing the formalism of a partially observable Markov decision process (POMDP). We use binocular rivalry as an example, considering different explicit and implicit sources of reward (and punishment) for each percept. The resulting values are time-dependent and influenced by novelty as a form of exploration. The solution of the POMDP is the optimal perceptual policy, and we show that this can replicate and explain several characteristics of binocular rivalry, ranging from classic hallmarks such as apparently spontaneous random switches with approximately gamma-distributed dominance periods to more subtle aspects such as the rich temporal dynamics of perceptual switching rates. Overall, our decision-theoretic perspective on perceptual multistability not only accounts for a wealth of unexplained data, but also opens up modern conceptions of internal reinforcement learning in service of understanding perceptual phenomena, and sensory processing more generally.