In this paper, we develop a novel variant of off-policy natural actor-critic algorithm with linear function approximation and we establish a sample complexity of O(ǫ −3 ), outperforming all the previously known convergence bounds of such algorithms. In order to overcome the divergence due to deadly triad in off-policy policy evaluation under function approximation, we develop a critic that employs n-step TD-learning algorithm with a properly chosen n. We present finite-sample convergence bounds on this critic under both constant and diminishing step sizes, which are of independent interest. Furthermore, we develop a variant of natural policy gradient under function approximation, with an improved convergence rate of O(1/T ) after T iterations. Combining the finite sample error bounds of actor and the critic, we obtain the O(ǫ −3 ) sample complexity. We derive our sample complexity bounds solely based on the assumption that the behavior policy sufficiently explores all the states and actions, which is a much lighter assumption compared to the related literature. 2 sample complexity, which is the best known convergence bound in the literature for AC algorithms with function approximation.Novelty in the Critic. Off-policy TD with function approximation is famously [65] known to diverge due to deadly triad. To overcome this difficulty, we employ n-step TD-learning, and show that a proper choice of n naturally achieves convergence, and we present finite-sample bounds under both constant and diminishing stepsizes. To the best of our knowledge, we are the first to design a single time-scale off-policy TD with function approximation with provable finite-sample bounds.Novelty in the Actor. NAC under function approximation was developed in [1] by projecting the Qvalues (gradients) to the lower dimensional space, and this involves the use of the discounted state visitation distribution, which is hard to estimate. We develop a new NAC algorithm for the function approximation setting that is instead based on the solution of a projected Bellman equation [73], which our critic is designed to solve.Exploration through Off-Policy Sampling. We establish the convergence bounds under the minimum set of assumptions, viz.,ergodicity under the behavior policy, which ensures sufficient exploration, and thus resolving challenges faced in on-policy sampling. As a result, learning can be done using a single trajectory of samples generated by the behavior policy, and we do not require constant reset of the system that was introduced in on-policy AC algorithms [1, 75] to ensure exploration. A similar observation about employing off-policy sampling to ensure exploration has been made in the tabular setting in [34].1.2. Related Literature. The two main approaches for learning an optimal policy in an RL problem are value space methods, such as Q-learning, and policy space methods, such as AC. The Q-learning algorithm proposed in [77] is perhaps the most well-known value space method. The asymptotic convergence of Qlearning was established in [1...