{"title": "Finite Sample Analysis of the GTD Policy Evaluation Algorithms in Markov Setting", "book": "Advances in Neural Information Processing Systems", "page_first": 5504, "page_last": 5513, "abstract": "In reinforcement learning (RL), one of the key components is policy evaluation, which aims to estimate the value function (i.e., expected long-term accumulated reward) of a policy. With a good policy evaluation method, the RL algorithms will estimate the value function more accurately and find a better policy. When the state space is large or continuous \\emph{Gradient-based Temporal Difference(GTD)} policy evaluation algorithms with linear function approximation are widely used. Considering that the collection of the evaluation data is both time and reward consuming, a clear understanding of the finite sample performance of the policy evaluation algorithms is very important to reinforcement learning. Under the assumption that data are i.i.d. generated, previous work provided the finite sample analysis of the GTD algorithms with constant step size by converting them into convex-concave saddle point problems. However, it is well-known that, the data are generated from Markov processes rather than i.i.d in RL problems.. In this paper, in the realistic Markov setting, we derive the finite sample bounds for the general convex-concave saddle point problems, and hence for the GTD algorithms. We have the following discussions based on our bounds. (1) With variants of step size, GTD algorithms converge. (2) The convergence rate is determined by the step size, with the mixing time of the Markov process as the coefficient. The faster the Markov processes mix, the faster the convergence. (3) We explain that the experience replay trick is effective by improving the mixing property of the Markov process. To the best of our knowledge, our analysis is the first to provide finite sample bounds for the GTD algorithms in Markov setting.", "full_text": "Finite Sample Analysis of the GTD Policy Evaluation\n\nAlgorithms in Markov Setting\n\nYue Wang \u2217\nSchool of Science\n\nBeijing Jiaotong University\n11271012@bjtu.edu.cn\n\nWei Chen\n\nMicrosoft Research\nwche@microsoft.com\n\nYuting Liu\n\nSchool of Science\n\nBeijing Jiaotong University\n\nytliu@bjtu.edu.cn\n\nZhi-Ming Ma\n\nAcademy of Mathematics and Systems Science\n\nChinese Academy of Sciences\n\nmazm@amt.ac.cn\n\nTie-Yan Liu\n\nMicrosoft Research\n\nTie-Yan.Liu@microsoft.com\n\nAbstract\n\nIn reinforcement learning (RL) , one of the key components is policy evaluation,\nwhich aims to estimate the value function (i.e., expected long-term accumulated\nreward) of a policy. With a good policy evaluation method, the RL algorithms\nwill estimate the value function more accurately and \ufb01nd a better policy. When\nthe state space is large or continuous Gradient-based Temporal Difference(GTD)\npolicy evaluation algorithms with linear function approximation are widely used.\nConsidering that the collection of the evaluation data is both time and reward\nconsuming, a clear understanding of the \ufb01nite sample performance of the policy\nevaluation algorithms is very important to reinforcement learning. Under the\nassumption that data are i.i.d. generated, previous work provided the \ufb01nite sample\nanalysis of the GTD algorithms with constant step size by converting them into\nconvex-concave saddle point problems. However, it is well-known that, the data\nare generated from Markov processes rather than i.i.d. in RL problems.. In this\npaper, in the realistic Markov setting, we derive the \ufb01nite sample bounds for the\ngeneral convex-concave saddle point problems, and hence for the GTD algorithms.\nWe have the following discussions based on our bounds. (1) With variants of step\nsize, GTD algorithms converge. (2) The convergence rate is determined by the\nstep size, with the mixing time of the Markov process as the coef\ufb01cient. The faster\nthe Markov processes mix, the faster the convergence. (3) We explain that the\nexperience replay trick is effective by improving the mixing property of the Markov\nprocess. To the best of our knowledge, our analysis is the \ufb01rst to provide \ufb01nite\nsample bounds for the GTD algorithms in Markov setting.\n\n1\n\nIntroduction\n\nReinforcement Learning (RL) (Sutton and Barto [1998]) technologies are very powerful to learn how\nto interact with environments, and has variants of important applications, such as robotics, computer\ngames and so on (Kober et al. [2013], Mnih et al. [2015], Silver et al. [2016], Bahdanau et al. [2016]).\nIn RL problem, an agent observes the current state, takes an action following a policy at the current\nstate, receives a reward from the environment, and the environment transits to the next state in Markov,\nand again repeats these steps. The goal of the RL algorithms is to \ufb01nd the optimal policy which\n\n\u2217This work was done when the \ufb01rst author was visiting Microsoft Research Asia.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fleads to the maximum long-term reward. The value function of a \ufb01xed policy for a state is de\ufb01ned as\nthe expected long-term accumulated reward the agent would receive by following the \ufb01xed policy\nstarting from this state. Policy evaluation aims to accurately estimate the value of all states under a\ngiven policy, which is a key component in RL (Sutton and Barto [1998], Dann et al. [2014]). A better\npolicy evaluation method will help us to better improve the current policy and \ufb01nd the optimal policy.\nWhen the state space is large or continuous, it is inef\ufb01cient to represent the value function over all the\nstates by a look-up table. A common approach is to extract features for states and use parameterized\nfunction over the feature space to approximate the value function. In applications, there are linear\napproximation and non-linear approximation (e.g. neural networks) to the value function. In this\npaper, we will focus on the linear approximation (Sutton et al. [2009a],Sutton et al. [2009b],Liu et al.\n[2015]). Leveraging the localization technique in Bhatnagar et al. [2009], the results can be generated\ninto non-linear cases with extra efforts. We leave it as future work.\nIn policy evaluation with linear approximation, there were substantial work for the temporal-difference\n(TD) method, which uses the Bellman equation to update the value function during the learning\nprocess (Sutton [1988],Tsitsiklis et al. [1997]). Recently, Sutton et al. [2009a] Sutton et al. [2009b]\nhave proposed Gradient-based Temporal Difference (GTD) algorithms which use gradient information\nof the error from the Bellman equation to update the value function. It is shown that, GTD algorithms\nhave achieved the lower-bound of the storage and computation complexity, making them powerful\nto handle high dimensional big data. Therefore, now GTD algorithms are widely used in policy\nevaluation problems and the policy evaluation step in practical RL algorithms (Bhatnagar et al.\n[2009],Silver et al. [2014]).\nHowever, we don\u2019t have suf\ufb01cient theory to tell us about the \ufb01nite sample performance of the GTD\nalgorithms. To be speci\ufb01c, will the evaluation process converge with the increasing of the number\nof the samples? If yes, how many samples we need to get a target evaluation error? Will the step\nsize in GTD algorithms in\ufb02uence the \ufb01nite sample error? How to explain the effectiveness of the\npractical tricks, such as experience replay? Considering that the collection of the evaluation data is\nvery likely to be both time and reward consuming, to get a clear understanding of the \ufb01nite sample\nperformance of the GTD algorithms is very important to the ef\ufb01ciency of policy evaluation and the\nentire RL algorithms.\nPrevious work (Liu et al. [2015]) converted the objective function of GTD algorithms into a convex-\nconcave saddle problem and conducted the \ufb01nite sample analysis for GTD with constant step size\nunder the assumption that data are i.i.d. generated. However, in RL problem, the date are generated\nfrom an agent who interacts with the environment step by step, and the state will transit in Markov as\nintroduced previously. As a result, the data are generated from a Markov process but not i.i.d.. In\naddition, the work did not study the decreasing step size, which are also commonly-used in many\ngradient based algorithms(Sutton et al. [2009a],Sutton et al. [2009b],Yu [2015]). Thus, the results\nfrom previous work cannot provide satisfactory answers to the above questions for the \ufb01nite sample\nperformance of the GTD algorithms.\nIn this paper, we perform the \ufb01nite sample analysis for the GTD algorithms in the more realistic\nMarkov setting. To achieve the goal, \ufb01rst of all, same with Liu et al. [2015], we consider the stochastic\ngradient descent algorithms of the general convex-concave saddle point problems, which include\nthe GTD algorithms. The optimality of the solution is measured by the primal-dual gap (Liu et al.\n[2015], Nemirovski et al. [2009]). The \ufb01nite sample analysis for convex-concave optimization in\nMarkov setting is challenging. On one hand, in Markov setting, the non-i.i.d. sampled gradients\nare no longer unbiased estimation of the gradients. Thus, the proof technique for the convergence\nof convex-concave problem in i.i.d. setting cannot be applied. On the other hand, although SGD\nconverge for convex optimization problem with the Markov gradients, it is much more dif\ufb01cult to\nobtain the same results in the more complex convex-concave optimization problem.\nTo overcome the challenge, we design a novel decomposition of the error function (i.e. Eqn (3.1)).\nThe intuition of the decomposition and key techniques are as follows: (1) Although samples are not\ni.i.d., for large enough \u03c4, the sample at time t + \u03c4 is \"nearly independent\" of the sample at time t,\nand its distribution is \"very close\" to the stationary distribution. (2) We split the random variables in\nthe objective related to E operator and the variables related to max operator into different terms in\norder to control them respectively. It is non-trivial, and we construct a sequence of auxiliary random\nvariables to do so. (3) All constructions above need to be carefully considered the measurable issues\n\n2\n\n\fin the Markov setting. (4) We construct new martingale difference sequences and apply Azuma\u2019s\ninequality to derive the high-probability bound from the in-expectation bound.\nBy using the above techniques, we prove a novel \ufb01nite sample bound for the convex-concave\nsaddle point problem. Considering the GTD algorithms are speci\ufb01c convex-concave saddle point\noptimization methods, we \ufb01nally obtained the \ufb01nite sample bounds for the GTD algorithms, in the\nrealistic Markov setting for RL. To the best of our knowledge, our analysis is the \ufb01rst to provide \ufb01nite\nsample bounds for the GTD algorithms in Markov setting.\nWe have the following discussions based on our \ufb01nite sample bounds.\n\n1. GTD algorithms do converge, under a \ufb02exible condition on the step size, i.e.(cid:80)T\n\nt=1 \u03b1t \u2192\n< \u221e, as T \u2192 \u221e, where \u03b1t is the step size. Most of step sizes used in practice\n\n(cid:80)T\n(cid:80)T\n\n\u221e,\nsatisfy this condition.\n\nt=1 \u03b12\nt\nt=1 \u03b1t\n\n(cid:32)(cid:114)\n\n2. The convergence rate is O\n\n(1 + \u03c4 (\u03b7))\n\n(cid:113)\n\n(cid:80)T\n(cid:80)T\n\nt=1 \u03b12\nt\nt=1 \u03b1t\n\n+\n\n\u03c4 (\u03b7) log( \u03c4 (\u03b7)\n\nt=1 \u03b12\nt\n\n\u03b4 )(cid:80)T\n\nt=1 \u03b1t\n\n(cid:80)T\n\n(cid:33)\n\n, where \u03c4 (\u03b7) is\n\nthe mixing time of the Markov process, and \u03b7 is a constant. Different step sizes will lead to\ndifferent convergence rates.\n\n3. The experience replay trick is effective, since it can improve the mixing property of the\n\nMarkov process.\n\nFinally, we conduct simulation experiments to verify our theoretical \ufb01nding. All the conclusions\nfrom the analysis are consistent with our empirical observations.\n\n2 Preliminaries\n\nIn this section, we brie\ufb02y introduce the GTD algorithms and related works.\n\n2.1 Gradient-based TD algorithms\nConsider the reinforcement learning problem with Markov decision process(MDP) (S,A, P, R, \u03b3),\nwhere S is the state space, A is the action space, P = {P a\ns,s(cid:48); s, s(cid:48) \u2208 S, a \u2208 A} is the transition\ns,s(cid:48) is the transition probability from state s to state s(cid:48) after taking action a, R =\nmatrix and P a\n{R(s, a); s \u2208 S, a \u2208 A is the reward function and R(s, a) is the reward received at state s if\ntaking action a, and 0 < \u03b3 < 1 is the discount factor. A policy function \u00b5 : A \u00d7 S \u2192 [0, 1]\nindicates the probability to take each action at each state. Value function for policy \u00b5 is de\ufb01ned as:\n\nV \u00b5(s) (cid:44) E [(cid:80)\u221e\n\nt=0 \u03b3tR(st, at)|s0 = s, \u00b5].\n\nIn order to perform policy evaluation in a large state space, states are represented by a feature vector\n\u03c6(s) \u2208 Rd, and a linear function \u02c6v(s) = \u03c6(s)(cid:62)\u03b8 is used to approximate the value function. The\nevaluation error is de\ufb01ned as (cid:107)V (s) \u2212 \u02c6v(s)(cid:107)s\u223c\u03c0, which can be decomposed into approximation\nerror and estimation error. In this paper, we will focus on the estimation error with linear function\napproximation.\nAs we know, the value function in RL satis\ufb01es the following Bellman equation: V \u00b5(s) =\nE\u00b5,P [R(st, at) + \u03b3V \u00b5(st+1)|st = s] (cid:44) T \u00b5V \u00b5(s), where T \u00b5 is called Bellman operator for policy\n\u00b5. Gradient-based TD (GTD) algorithms (including GTD and GTD2) proposed by Sutton et al.\n[2009a] and Sutton et al. [2009b] update the approximated value function by minimizing the objective\nfunction related to Bellman equation errors, i.e., the norm of the expected TD update (NEU) and\nmean-square projected Bellman error (MSPBE) respectively(Maei [2011],Liu et al. [2015]) ,\n\nGT D :\n\nGT D2 :\n\nJN EU (\u03b8) = (cid:107)\u03a6(cid:62)K(T \u00b5\u02c6v \u2212 \u02c6v)(cid:107)2\nJM SP BE(\u03b8) = (cid:107)\u02c6v \u2212 PT \u00b5\u02c6v(cid:107) = (cid:107)\u03a6(cid:62)K(T \u00b5\u02c6v \u2212 \u02c6v)(cid:107)2\n\nwhere K is a diagonal matrix whose elements are \u03c0(s), C = E\u03c0(\u03c6i\u03c6(cid:62)\nthe state space S.\nActually, the two objective functions in GTD and GTD2 can be uni\ufb01ed as below\n\n(2.1)\n(2.2)\ni ), and \u03c0 is a distribution over\n\nC\u22121\n\nJ(\u03b8) = (cid:107)b \u2212 A\u03b8(cid:107)2\n\nM\u22121 ,\n\n(2.3)\n\n3\n\n\fAlgorithm 1 GTD Algorithms\n1: for t = 1, . . . , T do\n2:\n(cid:80)T\n3: end for\n(cid:80)T\nOutput:\n\nUpdate parameters:\n\n(cid:16)\nyt + \u03b1t(\u02c6bt \u2212 \u02c6At\u03b8t \u2212 \u02c6Mtyt)\nyt+1 = PXy\n(cid:80)T\n(cid:80)T\n\n\u02dcxT =\n\n\u02dcyT =\n\n(cid:17)\n\nt=1 \u03b1txt\nt=1 \u03b1t\n\nt=1 \u03b1tyt\nt=1 \u03b1t\n\n(cid:16)\n\nxt+1 = PXx\n\nxt + \u03b1t \u02c6A(cid:62)\n\nt yt\n\n(cid:17)\n\nwhere M = I in GTD, M = C, in GTD2, A = E\u03c0[\u03c1(s, a)\u03c6(s)(\u03c6(s) \u2212 \u03b3\u03c6(s(cid:48)))(cid:62)], b =\nE\u03c0[\u03c1(s, a)\u03c6(s)r], \u03c1(s, a) = \u00b5(a|s)/\u00b5b(a|s) is the importance weighting factor. Since the underlying\ndistribution is unknown, we use the data D = {\u03bei = (si, ai, ri, s(cid:48)\ni=1 to estimate the value function\nby minimizing the empirical estimation error, i.e.,\n\ni)}n\n\n\u02c6J(\u03b8) = 1/T\n\n(cid:107)\u02c6b \u2212 \u02c6A\u03b8(cid:107)2\n\n\u02c6M\u22121\n\nT(cid:88)\n\ni=1\n\nwhere \u02c6Ai = \u03c1(si, ai)\u03c6(si)(\u03c6(si) \u2212 \u03b3\u03c6(s(cid:48)\nLiu et al. [2015] derived that the GTD algorithms to minimize (2.3) is equivalent to the stochastic\n(cid:18)\ngradient algorithms to solve the following convex-concave saddle point problem\n\ni))(cid:62), \u02c6bi = \u03c1(si, ai)\u03c6(si)ri, \u02c6Ci = \u03c6(si)\u03c6(si)(cid:62).\n\n(cid:19)\n\nmin\n\nx\n\nmax\n\ny\n\nL(x, y) = (cid:104)b \u2212 Ax, y(cid:105) \u2212 1\n2\n\n(cid:107)y(cid:107)2\n\nM\n\n,\n\n(2.4)\n\nwith x as the parameter \u03b8 in the value function, y as the auxiliary variable used in GTD algorithms.\nTherefore, we consider the general convex-concave stochastic saddle point problem as below\n\n{\u03c6(x, y) = E\u03be[\u03a6(x, y, \u03be)]},\n\nmax\ny\u2208Xy\n\nmin\nx\u2208Xx\n\n(2.5)\nwhere Xx \u2282 Rn and Xy \u2282 Rm are bounded closed convex sets, \u03be \u2208 \u039e is random variable and its\ndistribution is \u03a0(\u03be), and the expected function \u03c6(x, y) is convex in x and concave in y. Denote\nz = (x, y) \u2208 Xx \u00d7 Xy (cid:44) X , the gradient of \u03c6(z) as g(z), and the gradient of \u03a6(z, \u03be) as G(z, \u03be).\nIn the stochastic gradient algorithm, the model is updated as: zt+1 = PX (zt \u2212 \u03b1t(G(zt, \u03bet))),\nwhere PX is the projection onto X and \u03b1t is the step size. After T iterations, we get the model\n\u02dczT\n1 =\n\n1 is measured by the primal-dual gap error\n\n. The error of the model \u02dczT\n\n(cid:80)T\n(cid:80)T\n\nt=1 \u03b1tzt\nt=1 \u03b1t\n\nErr\u03c6(\u02dczT\n\n1 ) = max\ny\u2208Xy\n\n\u03c6(\u02dcxT\n\n1 , y) \u2212 min\nx\u2208Xx\n\n\u03c6(x, \u02dcyT\n\n1 ).\n\n(2.6)\n\nLiu et al. [2015] proved that the estimation error of the GTD algorithms can be upper bounded by\ntheir corresponding primal-dual gap error multiply a factor. Therefore, we are going to derive the\n\ufb01nite sample primal-dual gap error bound for the convex-concave saddle point problem \ufb01rstly, and\nthen extend it to the \ufb01nite sample estimation error bound for the GTD algorithms.\nDetails of GTD algorithms used to optimize (2.4) are placed in Algorithm 1( Liu et al. [2015]).\n\n2.2 Related work\n\nThe TD algorithms for policy evaluation can be divided into two categories: gradient based methods\nand least-square(LS) based methods(Dann et al. [2014]). Since LS based algorithms need O(d2)\nstorage and computational complexity while GTD algorithms are both of O(d) complexity, gradient\nbased algorithms are more commonly used when the feature dimension is large. Thus, in this paper,\nwe focus on GTD algorithms.\nSutton et al. [2009a] proposed the gradient-based temporal difference (GTD) algorithm for off-policy\npolicy evaluation problem with linear function approximation. Sutton et al. [2009b] proposed GTD2\nalgorithm which shows a faster convergence in practice. Liu et al. [2015] connected GTD algorithms\nto a convex-concave saddle point problem and derive a \ufb01nite sample bound in both on-policy and\noff-policy cases for constant step size in i.i.d. setting.\nIn the realistic Markov setting, although the \ufb01nite sample bounds for LS-based algorithms have been\nproved (Lazaric et al. [2012] Tagorti and Scherrer [2015]) LSTD(\u03bb), to the best of our knowledge,\nthere is no previous \ufb01nite sample analysis work for GTD algorithms.\n\n4\n\n\f3 Main Theorems\n\nIn this section, we will present our main results. In Theorem 1, we present our \ufb01nite sample bound\nfor the general convex-concave saddle point problem; in Theorem 2, we provide the \ufb01nite sample\nbounds for GTD algorithms in both on-policy and off-policy cases. Please refer the complete proofs\nin the supplementary materials.\nOur results are derived based on the following common assumptions(Nemirovski [2004], Duchi et al.\n[2012], Liu et al. [2015]). Please note that, the bounded-data property in assumption 4 in RL can\nguarantee the Lipschitz and smooth properties in assumption 5-6 (Please see Propsition 1 ).\nAssumption 1 (Bounded parameter). There exists D > 0, such that (cid:107)z \u2212 z(cid:48)(cid:107) \u2264 D, f or \u2200z, z(cid:48) \u2208 X .\nAssumption 2 (Step size). The step size \u03b1t is non-increasing.\nAssumption 3 (Problem solvable). The matrix A and C in Problem 2.4 are non-singular.\nAssumption 4 (Bounded data). Features are bounded by L, rewards are bounded by Rmax and\nimportance weights are bounded by \u03c1max.\nAssumption 5 (Lipschitz). For \u03a0-almost every \u03be, the function \u03a6(x, y, \u03be) is Lipschitz for both x and\ny, with \ufb01nite constant L1x, L1y, respectively. We Denote L1 (cid:44) \u221a\n1y.\nfor both x and y with \ufb01nite constant L2x, L2y respectively. We denote L2 (cid:44) \u221a\n\nAssumption 6 (Smooth). For \u03a0-almost every \u03be, the partial gradient function of \u03a6(x, y, \u03be) is Lipschitz\n\n1x + L2\n\n(cid:113)\n\n(cid:113)\n\nL2\n\nL2\n\n2\n\n2\n\n2x + L2\n\n2y.\n\nthe initial\n\n[s](A) and the corresponding probability density as pt\n\n(cid:110)\n\u2206 : t \u2208 N,(cid:82) |pt+\u2206\n\nt sample Ft = \u03c3(\u03be1, . . . , \u03bet) is de\ufb01ned as:\n\nFor Markov process, the mixing time characterizes how fast the process converge to its stationary\ndistribution. Following the notation of Duchi et al. [2012], we denote the conditional probability\ndistribution P (\u03bet \u2208 A|Fs) as P t\n[s]. Similarly, we\ndenote the stationary distribution of the data generating stochastic process as \u03a0 and its density as \u03c0.\nDe\ufb01nition 1. The mixing time \u03c4 (P[t], \u03b7) of\nthe sampling distribution P conditioned on\nthe \u03c3\u2212\ufb01eld of\n\u03c4 (P[t], \u03b7) (cid:44)\nis the conditional probability density\ninf\nat time t + \u2206, given Ft.\nAssumption 7 (Mixing time). The mixing times of the stochastic process {\u03bet} are uniform. i.e., there\nexists uniform mixing times \u03c4 (P, \u03b7) \u2264 \u221e such that, with probability 1, we have \u03c4 (P[s], \u03b7) \u2264 \u03c4 (P, \u03b7)\nfor all \u03b7 > 0 and s \u2208 N.\nPlease note that, any time-homogeneous Markov chain with \ufb01nite state-space and any uniformly\nergodic Markov chains with general state space satisfy the above assumption(Meyn and Tweedie\n[2012]). For simplicity and without of confusion, we will denote \u03c4 (P, \u03b7) as \u03c4 (\u03b7).\n\n(\u03be) \u2212 \u03c0(\u03be)|d(\u03be) \u2264 \u03b7\n\n, where pt+\u2206\n\n(cid:111)\n\n[t]\n\n[t]\n\n3.1 Finite Sample Bound for Convex-concave Saddle Point Problem\nTheorem 1. Consider the convex-concave problem in Eqn (2.5). Suppose Assumption 1,2,5,6 hold.\nThen for the gradient algorithm optimizing the convex-concave saddle point problem in (2.5), for\n\u2200\u03b4 > 0 and \u2200\u03b7 > 0 such that \u03c4 (\u03b7) \u2264 T /2, with probability at least 1 \u2212 \u03b4, we have\n\nErr\u03c6(\u02dczT\n\n1 ) \u2264 1\n\nA + B\n\n(cid:34)\n\nT(cid:80)\n\nt=1\n\nT(cid:88)\n\nT(cid:88)\n\n\u03b1t\n\nt=1\n\n+ 8DL1\n\n\u03b12\n\nt + C\u03c4 (\u03b7)\n\n(cid:118)(cid:117)(cid:117)(cid:116)2\u03c4 (\u03b7) log\n\n\u03b12\n\nt + F \u03b7\n\nt=1\n\n\u03c4 (\u03b7)\n\n\u03b4\n\n(cid:32) T(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\n\u03b1t + H\u03c4 (\u03b7)\n\n(cid:33)(cid:35)\n\n\u03b12\n\nt + \u03c4 (\u03b7)\u03b10\n\nwhere : A = D2\n\nB =\n\n5\n2\n\nL2\n1\n\nC = 6L2\n\n1 + 2L1L2D\n\nF = 2L1D\n\nH = 6L1D\u03b10\n\nProof Sketch of Theorem 1. By the de\ufb01nition of the error function in (2.6) and the property that\n\u03c6(x, y) is convex for x and concave for y, the expected error can be bounded as below\n\nErr\u03c6(\u02dczT\n\n1 ) \u2264 max\n\nz\n\n(zt \u2212 z)\n\n(cid:62)\n\n\u03b1t\n\ng(zt)\n\n.\n\n(cid:105)\n\nT(cid:88)\n\n(cid:104)\n\n1(cid:80)T\n\nt=1 \u03b1t\n\nt=1\n\n5\n\n\f(cid:104)\n\nT(cid:88)\n\nt=1\n\nz\n\n{vt}t\u22651 which is measurable with respect to Ft\u22121,vt+1 = PX(cid:0)vt \u2212 \u03b1t(g(zt)\u2212 G(zt, \u03bet))(cid:1). We have\n\n(cid:44) G(zt, \u03bet+\u03c4 )\u2212G(zt, \u03bet). Constructing\n\nDenote \u03b4t (cid:44) g(zt)\u2212G(zt, \u03bet), \u03b4(cid:48)\n\n(cid:44) g(zt)\u2212G(zt, \u03bet+\u03c4 ), \u03b4(cid:48)(cid:48)\n\nthe following key decomposition to the right hand side in the above inequality, the initiation and the\nexplanation for such decomposition is placed in supplementary materials. For \u2200\u03c4 \u2265 0:\n\nt\n\nt\n\n(cid:105)\n\n(cid:20)\n\nz\n\n(cid:34)T\u2212\u03c4(cid:88)\n(cid:123)(cid:122)\n\nt=1\n\n(c)\n\n(d)\n\n(a)\n\n(cid:62)\n\n.\n\n(cid:62)\n\n+\n\n(b)\n\n\u03b1t\n\n\u03b1t\n\n(cid:62)\n\n\u03b4t\n\n\u03b1t\n\n(cid:48)\n\u03b4\nt\n\n(cid:125)\n\n(cid:125)\n\n(cid:124)\n\n(cid:48)(cid:48)\n\u03b4\nt\n\n(cid:104)\n\nmax\n\n(cid:21)\n\n(cid:35)\n\ng(zt)\n\ng(zt)\n\n(3.1)\n\n= max\n\n(cid:105)\n(cid:125)\n\n(cid:123)(cid:122)\n\n(cid:123)(cid:122)\n\nG(zt, \u03bet)\n\nt=T\u2212\u03c4 +1\n\n(cid:125)\n(cid:124)\n\n(zt \u2212 z)\n\n(zt \u2212 z)\n\n(zt \u2212 z)\n(cid:62)\n\n+ (vt \u2212 z)\n\n(cid:124)\nT(cid:88)\n\n+ (zt \u2212 vt)\n(cid:62)\n\n+ (zt \u2212 vt)\n(cid:62)\n\n(cid:125)\n(cid:123)(cid:122)\n\nFor term(a), we split G(zt, \u03bet) into three terms by the de\ufb01nition of L2-norm and the iteration formula\n\n(cid:124)\n(cid:123)(cid:122)\n(cid:124)\n(cid:0)(cid:107)\u03b1tG(zt, \u03bet)(cid:107)2 + (cid:107)zt \u2212 z(cid:107)2 \u2212 (cid:107)zt+1 \u2212 z(cid:107)2(cid:1).\nof zt, and then we bound its summation by(cid:80)T\u2212\u03c4\nterms. Swap the max and(cid:80) operators and use the Lipschitz Assumption 5, the \ufb01rst term can be\n\nActually, in the summation, the last two terms will be eliminated except for their \ufb01rst and the last\nbounded. Term (c) includes the sum of G(zt, \u03bet+\u03c4 ) \u2212 G(zt, \u03bet), which is might be large in Markov\nsetting. We reformulate it into the sum of G(zt\u2212\u03c4 , \u03bet) \u2212 G(zt, \u03bet) and use the smooth Assumption 6\nto bound it. Term (d) is similar to term (a) except that g(zt) \u2212 G(zt, \u03bet) is the gradient that used to\nupdate vt. We can bound it similarly with term (a). Term(e) is a constant that does not change much\nwith T \u2192 \u221e, and we can bound it directly through upper bound of each of its own terms. Finally, we\ncombine all the upper bounds to each term, use the mixing time Assumption 7 to choose \u03c4 = \u03c4 (\u03b7)\nand obtain the error bound in Theorem 1.\nWe decompose Term(b) into a martingale part and an expectation part.By constructing a martingale\ndifference sequence and using the Azuma\u2019s inequality together with the Assumption 7, we can bound\nTerm (b) and \ufb01nally obtain the high probability error bound.\n\nt=1\n\n(e)\n\n(cid:105)\n\nt=1 \u03b12\nt\nt=1 \u03b1t\n\n(cid:104)\nA + B(cid:80)T\n\nRemark: (1) With T \u2192 \u221e, the error bound approaches 0 in order O(\n). (2) The mixing\ntime \u03c4 (\u03b7) will in\ufb02uence the convergence rate. If the Markov process has better mixing property\nwith smaller \u03c4 (\u03b7), the algorithm converge faster. (3) If the data are i.i.d. generated (the mixing time\n\u03c4 (\u03b7) = 0,\u2200\u03b7) and the step size is set to the constant\n1 ) \u2264\n1(cid:80)T\n), which is identical to previous work with constant step size in\ni.i.d. setting (Liu et al. [2015],Nemirovski et al. [2009]). (4) The high probability bound is similar to\nthe expectation bound in the following Lemma 1 except for the last term. This is because we consider\nthe deviation of the data around its expectation to derive the high probability bound.\nLemma 1. Consider the convex-concave problem (2.5), under the same as Theorem 1, we have\n,\u2200\u03b7 > 0,\n\n, our bound will reduce to Err\u03c6(\u02dczT\n\n= O( L1\u221a\n\nED[Err\u03c6(\u02dczT\n\nT(cid:88)\n\nT(cid:88)\n\nT(cid:88)\n\n\u03b1t + H\u03c4 (\u03b7)\n\nt=1 \u03b12\nt\n\nA + B\n\nt + C\u03c4 (\u03b7)\n\n1 )] \u2264\n\nt + F \u03b7\n\nt=1 \u03b1t\n\n(cid:35)\n\n(cid:34)\n\n\u03b12\n\n\u03b12\n\n\u221a\nc\n\nL1\n\nT\n\nT\n\n1(cid:80)T\n\nt=1 \u03b1t\n\nt=1\n\nt=1\n\nt=1\n\n(cid:80)T\n(cid:80)T\n\nProof Sketch of Lemma 1. We start from the key decomposition (3.1), and bound each term with\nexpectation this time. We can easily bound each term as previously except for Term (b). For term\n(b), since (zt \u2212 vt) is not related to max operator and it is measurable with respect to Ft\u22121, we can\nbound Term (b) through the de\ufb01nition of mixing time and \ufb01nally obtain the expectation bound.\n\n3.2 Finite Sample Bounds for GTD Algorithms\n\nAs a speci\ufb01c convex-concave saddle point problem, the error bounds in Theorem 1&2 can also\nprovide the error bounds for GTD with the following speci\ufb01cations for the Lipschitz constants.\nProposition 1. Suppose Assumption 1-4 hold, then the objective function in GTD algorithms is\nLipschitz and smooth with the following coef\ufb01cients:\n\n\u221a\n\u221a\n\nL1 \u2264\nL2 \u2264\n\n2(2D(1 + \u03b3)\u03c1maxL2d + \u03c1maxLRmax + \u03bbM )\n2(2(1 + \u03b3)\u03c1maxL2d + \u03bbM )\n\n6\n\n\fL\n\nO\n\n\u221a\n\n(cid:18)\n\nL4d2\u03bbM \u03c0max\n\nL4d3\u03bbM \u03c0max(1+\u03c4 (\u03b7))\u03c0maxo1(T )\n\nIn on-policy case,\n\n(cid:19)\nthen we have the following \ufb01nite sam-\n1 (cid:107)\u03c0 in GTD algorithms:\nthe\n(cid:33)(cid:33)\nand with probability 1 \u2212 \u03b4 is\n\nwhere \u03bbM is the largest singular value of M.\nTheorem 2. Suppose assumptions 1-4 hold,\nple bounds for the error (cid:107)V \u2212 \u02dcvT\n(cid:32)\u221a\nbound in expectation is O\n(cid:32)\u221a\nO\n, where \u03bdC, \u03bd(AT M\u22121A) is\nthe smallest eigenvalue of the C and AT M\u22121A respectively, \u03bbC is the largest singular value\nof C, o1(T ) = (\n\nthe\no2(T )\n(cid:33)(cid:33)\nand with probability 1 \u2212 \u03b4 is\n\n(cid:32)(cid:115)\n(cid:32)(cid:114)\n(cid:80)T\n(cid:80)T\n\nbound in expectation is O\n\nIn off-policy case,\n\nL4d2(1 + \u03c4 (\u03b7))o1(T ) +\n\n(1 + \u03c4 (\u03b7))L2do1(T ) +\n\n2\u03bbC \u03bbM \u03c0max\n(AT M\u22121A)\n\u03bd\n\n(cid:18) L2d\n\n\u221a(cid:80)T\n(cid:80)T\n\n(cid:16) \u03c4 (\u03b7)\n\n2\u03bbC \u03bbM \u03c0max(1+\u03c4 (\u03b7))o1(T )\n\n\u03c4 (\u03b7) log ( \u03c4 (\u03b7)\n\n), o2(T ) = (\n\n(cid:17)\n(cid:19)\n\n(AT M\u22121A)\n\u03bd\n\n\u03b4 )o2(T )\n\n(cid:114)\n\n(cid:113)\n\n\u03c4 (\u03b7) log\n\n\u221a\n\n\u03bdC\n\n\u03bdC\n\n\u03b4\n\n;\n\nt=1 \u03b12\nt\nt=1 \u03b1t\n\nt=1 \u03b12\nt\nt=1 \u03b1t\n\n).\n\nt\n\nt\n\n.\n\n(cid:33)\n\n(cid:113)\n\nt=1 \u03b12\n\nt=1 \u03b12\n\n\u03b4 )o2(T )\n\nt=1 \u03b12\nt\nt=1 \u03b1t\n\n(cid:80)T\n(cid:80)T\n\n\u03c4 (\u03b7) log( \u03c4 (\u03b7)\n\n(1 + \u03c4 (\u03b7))o1(T ) +\n\n), \u03b1t = O( 1\n\nt < 1, o2(T ) dominates.\n\n), the convergence rate is O( ln(T )\u221a\n\nt=1 \u03b1t \u2192 \u221e,\nt ), \u03b1t = c = O( 1\u221a\n\nthe bound in expectation is O(cid:16)(cid:112)(1 + \u03c4 (\u03b7))o1(T )\n(cid:17)\nWe would like to make the following discussions for Theorem 2.\nThe GTD algorithms do converge in the realistic Markov setting.\nAs in Theorem\n(cid:32)(cid:114)\nand with probability 1 \u2212 \u03b4 is\n2,\nIf the step size \u03b1t makes o1(T ) \u2192 0 and\nO\nbound, if(cid:80)T\nt > 1, then o1(T ) dominates the order, if(cid:80)T\no2(T ) \u2192 0, as T \u2192 \u221e, the GTD algorithms will converge. Additionally, in high probability\nto 0 if the step size satis\ufb01es(cid:80)T\n\nThe setup of the step size can be \ufb02exible. Our \ufb01nite sample bounds for GTD algorithms converge\n< \u221e, as T \u2192 \u221e. This condition on step size\nis much weaker than the constant step size in previous work Liu et al. [2015], and the common-used\nstep size \u03b1t = O( 1\u221a\n) all satisfy the condition. To be speci\ufb01c, for\n\u03b1t = O( 1\u221a\nln(T ) ),\nfor the constant step size, the optimal setup is \u03b1t = O( 1\u221a\n) considering the trade off between o1(T )\nand o2(T ), and the convergence rate is O( 1\u221a\nThe mixing time matters.\nIf the data are generated from a Markov process with smaller mixing\ntime, the error bound will be smaller, and we just need fewer samples to achieve a \ufb01xed estimation\nerror. This \ufb01nding can explain why the experience replay trick (Lin [1993]) works. With experience\nreplay, we store the agent\u2019s experiences (or data samples) at each step, and randomly sample one from\nthe pool of stored samples to update the policy function. By Theorem 1.19 - 1.23 of Durrett [2016], it\ncan be proved that, for arbitrary \u03b7 > 0, there exists t0, such that \u2200t > t0 maxi | Nt(i)\nt \u2212 \u03c0(i)| \u2264 \u03b7.\nThat is to say, when the size of the stored samples is larger than t0, the mixing time of the new data\nprocess with experience replay is 0. Thus, the experience replay trick improves the mixing property\nof the data process, and hence improves the convergence rate.\nOther factors that in\ufb02uence the \ufb01nite sample bound: (1) With the increasing of the feature norm\nL, the \ufb01nite sample bound increase. This is consistent with the empirical \ufb01nding by Dann et al.\n[2014] that the normalization of features is crucial for the estimation quality of GTD algorithms. (2)\nWith the increasing of the feature dimension d, the bound increase. Intuitively, we need more samples\nfor a linear approximation in a higher dimension feature space.\n\nt ), the convergence rate is O(\n\n); for \u03b1t = O( 1\n\n).\n\nT\n\nT\n\nT\n\nT\n\n1\n\n4 Experiments\n\nIn this section, we report our simulation results to validate our theoretical \ufb01ndings. We consider the\ngeneral convex-concave saddle problem,\n\nmin\n\nx\n\nmax\n\ny\n\nL(x, y) = (cid:104)b \u2212 Ax, y(cid:105) +\n\n1\n2\n\n(cid:107)x(cid:107)2 \u2212 1\n2\n\n(cid:107)y(cid:107)2\n\n(4.1)\n\n(cid:18)\n\n7\n\n(cid:19)\n\n\ft\n\nt ) = 0.03\n\nt\n\n) = 0.015\u221a\nt\n\nand \u03b1t = O( 1\n\nwhere A is a n\u00d7 n matrix, b is a n\u00d7 1 vector, Here we set n = 10. We conduct three experiment and\nset the step size to \u03b1t = c = 0.001, \u03b1t = O( 1\u221a\nrespectively. In\neach experiment we sample the data \u02c6A, \u02c6b three ways: sample from two Markov chains with different\nmixing time but share the same stationary distribution or sample from stationary distribution i.i.d.\ndirectly. We sample \u02c6A and \u02c6b from Markov chain by using MCMC Metropolis-Hastings algorithms.\nSpeci\ufb01cally, notice that the mixing time of a Markov chain is positive correlation with the second\nlargest eigenvalue of its transition probability matrix (Levin et al. [2009]), we \ufb01rstly conduct two\ntransition probability matrix with different second largest eigenvalues( both with 1001 state and the\nsecond largest eigenvalue are 0.634 and 0.31 respectively), then using Metropolis-Hastings algorithms\nconstruct two Markov chain with same stationary distribution.\nWe run the gradient algorithm for the objective in (4.1) based on the simulation data, without and\nwith experience replay trick. The primal-dual gap error curves are plotted in Figure 1.\nWe have the following observations. (1) The error curves converge in Markov setting with all the\nthree setups of the step size. (2) The error curves with the data generated from the process which\nhas small mixing time converge faster. The error curve for i.i.d. generated data converge fastest. (3)\nThe error curve for different step size convergence at different rate. (4) With experience replay trick,\nthe error curves in the Markov settings converge faster than previously. All these observations are\nconsistent with our theoretical \ufb01ndings.\n\n(a) \u03b1t = c\n\n(b) \u03b1t = O( 1\u221a\n\nt\n\n)\n\n(c) \u03b1t = O( 1\nt )\n\n(d) \u03b1t = c with trick\n\n(e) \u03b1t = O( 1\u221a\n\nt\n\n) with trick\n\n(f) \u03b1t = O( 1\n\nt ) with trick\n\nFigure 1: Experimental Results\n\n5 Conclusion\n\nIn this paper, in the more realistic Markov setting, we proved the \ufb01nite sample bound for the convex-\nconcave saddle problems with high probability and in expectation. Then, we obtain the \ufb01nite sample\nbound for GTD algorithms both in on-policy and off-policy, considering that the GTD algorithms\nare speci\ufb01c convex-concave saddle point problems. Our \ufb01nite sample bounds provide important\ntheoretical guarantee to the GTD algorithms, and also insights to improve them, including how to\nsetup the step size and we need to improve the mixing property of the data like experience replay.\nIn the future, we will study the \ufb01nite sample bounds for policy evaluation with nonlinear function\napproximation.\n\nAcknowledgment\n\nThis work was supported by A Foundation for the Author of National Excellent Doctoral Dissertation\nof RP China (FANEDD 201312) and National Center for Mathematics and Interdisciplinary Sciences\nof CAS.\n\n8\n\n\fReferences\nDzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron\nCourville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. arXiv preprint\narXiv:1607.07086, 2016.\n\nShalabh Bhatnagar, Doina Precup, David Silver, Richard S Sutton, Hamid R Maei, and Csaba\nSzepesv\u00e1ri. Convergent temporal-difference learning with arbitrary smooth function approximation.\nIn Advances in Neural Information Processing Systems, pages 1204\u20131212, 2009.\n\nChristoph Dann, Gerhard Neumann, and Jan Peters. Policy evaluation with temporal differences: a\n\nsurvey and comparison. Journal of Machine Learning Research, 15(1):809\u2013883, 2014.\n\nJohn C Duchi, Alekh Agarwal, Mikael Johansson, and Michael I Jordan. Ergodic mirror descent.\n\nSIAM Journal on Optimization, 22(4):1549\u20131578, 2012.\n\nRichard Durrett. Poisson processes. In Essentials of Stochastic Processes, pages 95\u2013124. Springer,\n\n2016.\n\nJens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The\n\nInternational Journal of Robotics Research, 32(11):1238\u20131274, 2013.\n\nAlessandro Lazaric, Mohammad Ghavamzadeh, and R\u00e9mi Munos. Finite-sample analysis of least-\n\nsquares policy iteration. Journal of Machine Learning Research, 13(1):3041\u20133074, 2012.\n\nDavid Asher Levin, Yuval Peres, and Elizabeth Lee Wilmer. Markov chains and mixing times.\n\nAmerican Mathematical Soc., 2009.\n\nLong-Ji Lin. Reinforcement learning for robots using neural networks. PhD thesis, Fujitsu Laborato-\n\nries Ltd, 1993.\n\nBo Liu, Ji Liu, Mohammad Ghavamzadeh, Sridhar Mahadevan, and Marek Petrik. Finite-sample\n\nanalysis of proximal gradient td algorithms. In UAI, pages 504\u2013513. Citeseer, 2015.\n\nHamid Reza Maei. Gradient temporal-difference learning algorithms. PhD thesis, University of\n\nAlberta, 2011.\n\nSean P Meyn and Richard L Tweedie. Markov chains and stochastic stability. Springer Science &\n\nBusiness Media, 2012.\n\nVolodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,\nAlex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control\nthrough deep reinforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\nArkadi Nemirovski. Prox-method with rate of convergence o (1/t) for variational inequalities with\nlipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM\nJournal on Optimization, 15(1):229\u2013251, 2004.\n\nArkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic\napproximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574\u2013\n1609, 2009.\n\nDavid Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller.\nDeterministic policy gradient algorithms. In Proceedings of the 31st International Conference on\nMachine Learning, pages 387\u2013395, 2014.\n\nDavid Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,\nJulian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering\nthe game of go with deep neural networks and tree search. Nature, 529(7587):484\u2013489, 2016.\n\nRichard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3\n\n(1):9\u201344, 1988.\n\nRichard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press\n\nCambridge, 1998.\n\n9\n\n\fRichard S Sutton, Hamid R Maei, and Csaba Szepesv\u00e1ri. A convergent o(n) temporal-difference\nalgorithm for off-policy learning with linear function approximation. In Advances in Neural\nInformation Processing Systems, pages 1609\u20131616, 2009a.\n\nRichard S Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba\nSzepesv\u00e1ri, and Eric Wiewiora. Fast gradient-descent methods for temporal-difference learn-\ning with linear function approximation. In Proceedings of the 26th International Conference on\nMachine Learning, pages 993\u20131000, 2009b.\n\nManel Tagorti and Bruno Scherrer. On the rate of convergence and error bounds for lstd ( lambda).\nIn Proceedings of the 32nd International Conference on Machine Learning, pages 1521\u20131529,\n2015.\n\nJohn N Tsitsiklis, Benjamin Van Roy, et al. An analysis of temporal-difference learning with function\n\napproximation. IEEE transactions on automatic control, 42(5):674\u2013690, 1997.\n\nH Yu. On convergence of emphatic temporal-difference learning.\n\nConference on Learning Theory, pages 1724\u20131751, 2015.\n\nIn Proceedings of The 28th\n\n10\n\n\f", "award": [], "sourceid": 2837, "authors": [{"given_name": "Yue", "family_name": "Wang", "institution": "Beijing Jiaotong University"}, {"given_name": "Wei", "family_name": "Chen", "institution": "Microsoft Research"}, {"given_name": "Yuting", "family_name": "Liu", "institution": "Beijing Jiaotong University"}, {"given_name": "Zhi-Ming", "family_name": "Ma", "institution": null}, {"given_name": "Tie-Yan", "family_name": "Liu", "institution": "Microsoft Research"}]}