EU regulations on $$\textit{CO}_2$$
CO
2
limits and the trend of individualization are pushing the automotive industry towards greater flexibility and robustness in production. One approach to address these challenges is modular production, where workstations are decoupled by automated guided vehicles, requiring new control concepts. Modular production control aims at throughput-optimal coordination of products, workstations, and vehicles. For this np-hard problem, conventional control approaches lack in computing efficiency, do not find optimal solutions, or are not generalizable. In contrast, Deep Reinforcement Learning offers powerful and generalizable algorithms, able to deal with varying environments and high complexity. One of these algorithms is Proximal Policy Optimization, which is used in this article to address modular production control. Experiments in several modular production control settings demonstrate stable, reliable, optimal, and generalizable learning behavior. The agent successfully adapts its strategies with respect to the given problem configuration. We explain how to get to this learning behavior, especially focusing on the agent’s action, state, and reward design.