“…Abbeel et al [2006,2007], Atkeson and Schaal [1997], Atkeson [1998] Asada et al [1996], Bakker et al [2003], Benbrahim et al [1992], Benbrahim and Franklin [1997], Birdwell and Livingston [2007], Bitzer et al [2010], Conn and Peters II [2007], Duan et al [2007Duan et al [ , 2008, Fagg et al [1998], Gaskett et al [2000], Gräve et al [2010], Hafner and Riedmiller [2007], Huang and Weng [2002], Ilg et al [1999], Katz et al [2008], Kimura et al [2001], Kirchner [1997], Kroemer et al [2009, Latzke et al [2007], Lizotte et al [2007], Mahadevan and Connell [1992], Mataric [1997], Nemec et al [2009Nemec et al [ , 2010, Oßwald et al [2010], Paletta et al [2007], Platt et al [2006], Riedmiller et al [2009], Rottmann et al [2007], Kaelbling [1998, 2002] called the value function, and use it to reconstruct the optimal policy. A wide variety of methods exist and can be split mainly into three classes: (i) dynamic programming-based optimal control approaches such as policy iteration or value iteration, (ii) rollout-based Monte Carlo methods and (iii) temporal difference methods such as TD(λ), Q-learning, and SARSA.…”