Abstract:This new algorithm which incorporates look-ahead search inside the training loop resulting in rapid improvement and precise and stable learning. By using this new search methods towards programmatic marketing, we massively improve the return of investment on Google Adwords auction by 4.6% in 1 day. We apply simple, gradient-based updates to train the next policy and value network. This appears to be much more stable than incremental, gradient-based policy improvements such as policy gradient or Q-learning that can potentially forget previous improvements.