It is generally very difficult to optimize the routing policies in optical networks with dynamic traffic. Most widely-used routing policies, e.g., shortest path routing and least congested path (LCP) routing, are heuristic policies. Although the LCP is often regarded as the bestperforming adaptive routing policy, we are often eager to know whether there exist better routing policies that surpass these heuristics in performance. In this paper, we propose a framework of reinforcement learning (RL) based routing scheme, that learns routing decisions during the interactions with the environment. With a proposed self-learning method, the RL agent can improve its routing policy continuously. Simulations on a ring-topology metro optical network demonstrate that, the proposed scheme outperforms the LCP routing policy.