Sparse Temporal Difference Learning Using LASSO

Loth, Manuel; Davy, Manuel; Preux, Pierre‐Marie

doi:10.1109/adprl.2007.368210

Cited by 35 publications

(29 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this case, the user should determine the architecture of the network. If the user elects to use a nonparametric regularization-based method (e.g., Engel et al 2005;Jung and Polani 2006;Loth et al 2007;Farahmand et al 2009b;Taylor and Parr 2009;Kolter and Ng 2009), the regularization coefficient and kernel (or other) parameters should be selected. From a general viewpoint, the decision of which method (linear vs. non-linear, parametric vs. nonparametric) to use is not different from that of how to tune a particular method.…”

mentioning

confidence: 99%

Model selection in reinforcement learning

Farahmand

Szepesvári

2011

Mach Learn

View full text Add to dashboard Cite

We consider the problem of model selection in the batch (offline, non-interactive) reinforcement learning setting when the goal is to find an action-value function with the smallest Bellman error among a countable set of candidates functions. We propose a complexity regularization-based model selection algorithm, BERMIN, and prove that it enjoys an oracle-like property: the estimator's error differs from that of an oracle, who selects the candidate with the minimum Bellman error, by only a constant factor and a small remainder term that vanishes at a parametric rate as the number of samples increases. As an application, we consider a problem when the true action-value function belongs to an unknown member of a nested sequence of function spaces. We show that under some additional technical conditions BERMIN leads to a procedure whose rate of convergence, up to a constant factor, matches that of an oracle who knows which of the nested function spaces the true action-value function belongs to, i.e., the procedure achieves adaptivity.

show abstract

mentioning

confidence: 99%

Model selection in reinforcement learning

Farahmand

Szepesvári

2011

Mach Learn

View full text Add to dashboard Cite

show abstract

“…Using an ℓ 1 -penalty term for value function approximation has been considered before in [16] in an approximate linear programming context or in [13] where it is used to minimize a Bellman residual 6 . Our work is closer to LARS-TD [12], briefly presented in Section 2.2, and both approaches are compared next.…”

Section: Discussionmentioning

confidence: 99%

“…One searches for an approximation of the value function V (being a fixed-point of the Bellman operator T ) belonging to some (linear) hypothesis space H, onto which one projects any function using the related projection operator Π. LSTD providesV ∈ H, the fixed-point of the composed operator ΠT . The sole generalization of LSTD to ℓ 1 -regularization has been proposed in [12] ( [11] solves the same problem, [13] regularizes abiased-Bellman Residual and [16] considers linear programming). They add an ℓ 1 -penalty term to the projection operator and solve the consequent fixed-point problem, the corresponding algorithm being called LARS-TD in reference to the homotopy path algorithm LARS (Least Angle Regression) [6] which inspired it.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

ℓ1-Penalized Projected Bellman Residual

Geist

Scherrer

2012

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. We consider the task of feature selection for value function approximation in reinforcement learning. A promising approach consists in combining the Least-Squares Temporal Difference (LSTD) algorithm with ℓ1-regularization, which has proven to be effective in the supervised learning community. This has been done recently whit the LARS-TD algorithm, which replaces the projection operator of LSTD with an ℓ1-penalized projection and solves the corresponding fixed-point problem. However, this approach is not guaranteed to be correct in the general offpolicy setting. We take a different route by adding an ℓ1-penalty term to the projected Bellman residual, which requires weaker assumptions while offering a comparable performance. However, this comes at the cost of a higher computational complexity if only a part of the regularization path is computed. Nevertheless, our approach ends up to a supervised learning problem, which let envision easy extensions to other penalties.

show abstract

“…However, none of the previous works that we know of explored in a systematic manner how regularization influences the performance of the resulting procedure. The only works that we know of that used regularization are that of Jung and Polani [11], Loth et al [14] and Xu et al [22]. In particular, Jung and Polani [11] explored penalizing the empirical L 2 -norm of the Bellman-residual for finding the value function of a policy given a trajectory in a deterministic system, while L 1 -penalties for the same problem were considered by Loth et al [14].…”

Section: Introductionmentioning

confidence: 99%

Regularized Fitted Q-Iteration: Application to Planning

Farahmand

Ghavamzadeh

Szepesvári

et al. 2008

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. We consider planning in a Markovian decision problem, i.e., the problem of finding a good policy given access to a generative model of the environment. We propose to use fitted Q-iteration with penalized (or regularized) least-squares regression as the regression subroutine to address the problem of controlling model-complexity. The algorithm is presented in detail for the case when the function space is a reproducingkernel Hilbert space underlying a user-chosen kernel function. We derive bounds on the quality of the solution and argue that data-dependent penalties can lead to almost optimal performance. A simple example is used to illustrate the benefits of using a penalized procedure.

show abstract

Sparse Temporal Difference Learning Using LASSO

Cited by 35 publications

References 10 publications

Model selection in reinforcement learning

Model selection in reinforcement learning

ℓ1-Penalized Projected Bellman Residual

Regularized Fitted Q-Iteration: Application to Planning

Contact Info

Product

Resources

About