The fifth generation (5G) radio access technology is designed to support highly delaysensitive applications, i.e., ultra-reliable and low-latency communications (URLLC). For dynamic time division duplex (TDD) systems, the real-time optimization of the radio pattern selection becomes of a vital significance in achieving decent URLLC outage latency. In this study, a dual reinforcement machine learning (RML) approach is developed for online pattern optimization in 5G new radio TDD deployments. The proposed solution seeks to minimizing the maximum URLLC tail latency, i.e., min-max problem, by introducing nested RML instances. The directional and real-time traffic statistics are monitored and given to the primary RML layer to estimate the sufficient number of downlink (DL) and uplink (UL) symbols across the upcoming radio pattern. The secondary RML sub-networks determine the DL and UL symbol structure which best minimizes the URLLC outage latency. The proposed solution is evaluated by extensive and highly-detailed system level simulations, where our results demonstrate a considerable URLLC outage latency improvement with the proposed scheme, compared to the state-of-the-art dynamic-TDD proposals.