{"title": "Information-Theoretic Confidence Bounds for Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2461, "page_last": 2470, "abstract": "We integrate information-theoretic concepts into the design and analysis of optimistic algorithms and Thompson sampling. By making a connection between information-theoretic quantities and confidence bounds, we obtain results that relate the per-period performance of the agent with its information gain about the environment, thus explicitly characterizing the exploration-exploitation tradeoff. The resulting cumulative regret bound depends on the agent's uncertainty over the environment and quantifies the value of prior information. We show applicability of this approach to several environments, including linear bandits, tabular MDPs, and factored MDPs. These examples demonstrate the potential of a general information-theoretic approach for the design and analysis of reinforcement learning algorithms.", "full_text": "Information-Theoretic Con\ufb01dence Bounds for\n\nReinforcement Learning\n\nXiuyuan Lu\n\nStanford University\nlxy@stanford.edu\n\nBenjamin Van Roy\nStanford University\nbvr@stanford.edu\n\nAbstract\n\nWe integrate information-theoretic concepts into the design and analysis of op-\ntimistic algorithms and Thompson sampling. By making a connection between\ninformation-theoretic quantities and con\ufb01dence bounds, we obtain results that\nrelate the per-period performance of the agent with its information gain about the\nenvironment, thus explicitly characterizing the exploration-exploitation tradeoff.\nThe resulting cumulative regret bound depends on the agent\u2019s uncertainty over\nthe environment and quanti\ufb01es the value of prior information. We show applica-\nbility of this approach to several environments, including linear bandits, tabular\nMDPs, and factored MDPs. These examples demonstrate the potential of a gen-\neral information-theoretic approach for the design and analysis of reinforcement\nlearning algorithms.\n\n1\n\nIntroduction\n\nWe consider an online decision problem where an agent repeatedly interacts with an uncertain\nenvironment and observes outcomes. The agent has a reward function that speci\ufb01es its preferences\nover outcomes. The objective of the agent is to sequentially select actions so as to maximize the\nlong-term expected cumulative reward. One classic example is the multi-armed bandit problem,\nwhere the agent observes only the reward of the action selected during each period. Another example\nis episodic reinforcement learning, where the agent selects a policy at the beginning of each episode,\nand observes a trajectory of states and rewards realized over the episode.\nThe agent\u2019s uncertainty about the environment gives rise to a need to trade off between exploration\nand exploitation. Exploring parts of the environment that are poorly understood could lead to better\nperformance in the future, while exploiting current knowledge of the environment could lead to\nhigher reward in the short term. Thompson sampling [18] and optimistic algorithms [9, 12] are\ntwo classes of algorithms that effectively balance the exploration-exploitation tradeoff and achieve\nnear-optimal performance in many stylized online decision problems. However, most analyses of such\nalgorithms focus on establishing performance guarantees that only exploit the parametric structure of\nthe model [1, 2, 3, 5, 7, 9, 10, 11]. There has not been much focus on how prior information as well\nas information gain during the learning process affect performance, with few exceptions [15, 16].\nIn our work, we leverage concepts from information theory to quantify the information gain of\nthe agent and address the exploration-exploitation tradeoff for Thompson sampling and optimistic\nalgorithms. By connecting information-theoretic quantities with con\ufb01dence bounds, we are able to\nrelate the agent\u2019s per-period performance with its information gain about the environment during the\nperiod. The relation explicitly characterizes how an algorithm balances exploration and exploitation\non a single-period basis. The information gain is represented succinctly using mutual information,\nwhich abstracts away from the speci\ufb01c parametric form of the model. The level of abstraction\noffered by information theory shows promise of these information-theoretic con\ufb01dence bounds\nbeing generalizable to a broad class of problems. Moreover, the resulting cumulative regret bound\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fexplicitly depends on the agent\u2019s uncertainty over the environment, which naturally exhibits the value\nof prior information. We present applications of information-theoretic con\ufb01dence bounds on three\nenvironments, linear bandits, tabular MDPs, and factored MDPs.\nOne paper that is closely related to our work is [16], which proposes an upper con\ufb01dence bound\n(UCB) algorithm for bandit learning with a Gaussian process prior and derives a regret bound that\ndepends on maximal information gain. Some of their results parallel what we establish in the context\nof the linear bandit, though our analysis extends to Thompson sampling as well. More importantly, our\nwork generalizes the information-theoretic con\ufb01dence bound approach to problems with signi\ufb01cantly\nmore complicated information structure, such as MDPs.\nAnother closely related paper is [15], which provides an information-theoretic analysis of Thompson\nsampling for bandit problems with partial feedback. The paper introduces the notion of an information\nratio, which relates the one-period regret of Thompson sampling with one-period information gain\ntowards the optimal action. Using this concept, the authors are able to derive a series of regret bounds\nthat depend on the information entropy of the optimal action. While this is an elegant result, it is\nunclear how to extend the approach to MDPs, as information gain about the optimal policy is hard to\nquantify. In our paper, we consider information gain about the underlying environment rather than\nthe optimal action or policy, which may be seen as a relaxation of their method. The relaxation allows\nus to leverage con\ufb01dence bounds and obtain information-theoretic regret bounds for MDPs.\n\n2 Problem formulation\n\nWe consider an online decision problem where an agent repeatedly interacts with an uncertain environ-\nment and observes outcomes. All random variables are de\ufb01ned on a probability space (\u2126,F, P). The\nenvironment is described by an unknown model parameter \u03b8 which governs the outcome distribution.\nThe agent\u2019s uncertainty over the environment is represented as a prior distribution over \u03b8. Thus, \u03b8\nwill be treated as a random variable in the agent\u2019s mind. During each time period (cid:96), the agent selects\nan action A(cid:96) \u2208 A and observes an outcome Y(cid:96),A(cid:96) \u2208 Y. We assume that the space of outcomes Y is a\nsubset of a \ufb01nite dimensional Euclidean space. Conditioned on the model index \u03b8, outcomes Y(cid:96) are\ni.i.d. for (cid:96) = 1, 2, . . . . The agent has a reward function r : Y \u2192 (cid:60) that encodes its preferences over\noutcomes. We make a simplifying assumption that rewards are bounded.\nAssumption 1. There exists B \u2265 0 such that supy\u2208Y r(y) \u2212 inf y\u2208Y r(y) \u2264 B.\nThe objective of the agent is to maximize its long-term expected cumulative reward. Let\nH(cid:96) = (A1, Y1,A1, . . . , A(cid:96)\u22121, Y(cid:96)\u22121,A(cid:96)\u22121) denote the history up to time (cid:96). An algorithm \u03c0 is a\nsequence of functions {\u03c0(cid:96)}(cid:96)\u22651 that map histories to distributions over actions. For any a \u2208 A,\nlet r\u03b8(a) = E[r(Y1,a)|\u03b8] denote the expected reward of selecting action a under model \u03b8. Let\nA\u2217 \u2208 arg maxa\u2208A r\u03b8(a) denote the optimal action under model \u03b8. We de\ufb01ne the Bayesian regret of\nan algorithm \u03c0 over L periods\n\nL(cid:88)\n\nE[Regret(L, \u03c0)] =\n\nE[r\u03b8(A\u2217) \u2212 r\u03b8(A(cid:96))],\n\n(cid:96)=1\n\nwhere the expectation is taken over the randomness in outcomes, algorithm \u03c0, as well as the prior\ndistribution over \u03b8.\nNote that episodic reinforcement learning also falls in the above formulation by considering policies\nas actions and trajectories as observations.\n\n3 Preliminaries\n\n3.1 Basic quantities in information theory\n\nFor two probability measures P and Q such that P is absolutely continuous with respect to Q, the\nKullback-Leibler divergence between them is\nD(P||Q) =\n\n(cid:18) dP\n\n(cid:19)\n\n(cid:90)\n\nlog\n\ndP,\n\ndQ\n\n2\n\n\fdQ is the Radon-Nikodym derivative of P with respect to Q.\n\nwhere dP\nLet P (X) \u2261 P(X \u2208 \u00b7) denote the probability distribution of a random variable X, and let P (X|Y ) \u2261\nP(X \u2208 \u00b7|Y ) denote the conditional probability distribution of X conditioned on Y .\nThe mutual information between two random variables X and Y\n\nI(X; Y ) = D(P (X, Y )|| P (X) P (Y ))\n\nis the Kullback-Leibler divergence between their joint distribution and product distribution [6].\nI(X; Y ) is always nonnegative, and I(X; Y ) = 0 if and only if X and Y are independent. In our\nanalysis, we will use I(\u03b8; A, YA) to measure the agent\u2019s information gain of \u03b8 from selecting an\naction and observing an outcome.\nThe conditional mutual information between two random variables X and Y , conditioned on a third\nrandom variable Z, is\n\nI(X; Y |Z) = E[D(P (X, Y |Z)|| P (X|Z)P (Y |Z))],\n\nwhere the expectation is taken over Z. An elegant property of mutual information is that the mutual\ninformation between a random variable X and a collection of random variables Y1, . . . , Yn can be\ndecomposed into a sum of conditional mutual information using the chain rule.\nLemma 1. (Chain rule of mutual information)\n\nI(X; Y1, Y2, . . . , Yn) =\n\nI(X; Yi|Y1, . . . , Yi\u22121).\n\nn(cid:88)\n\ni=1\n\n3.2 Notation under posterior distributions\nWe will use subscript (cid:96) on P and E to indicate quantities conditioned on H(cid:96), i.e., P(cid:96)(\u00b7) \u2261 P(\u00b7|H(cid:96)) =\nP(\u00b7|A1, Y1,A1 , . . . , A(cid:96)\u22121, Y(cid:96)\u22121,A(cid:96)\u22121), and similarly for E(cid:96)[\u00b7]. Let P(cid:96)(X) \u2261 P(cid:96)(X \u2208 \u00b7) denote the\nconditional distribution of a random variable X conditioned on H(cid:96). We de\ufb01ne \ufb01ltered mutual\ninformation\nwhich is a random variable of H(cid:96). Note that by the de\ufb01nition of conditional mutual information,\n\nI(cid:96)(X; Y ) = D(P(cid:96)(X, Y )|| P(cid:96)(X)P(cid:96)(Y )),\n\nE[I(cid:96)(X; Y )] = I(X; Y |H(cid:96)) = I(X; Y |A1, Y1,A1 , . . . , A(cid:96)\u22121, Y(cid:96)\u22121,A(cid:96)\u22121).\n\n3.3 Algorithms\n\nThompson sampling is a simple yet effective heuristic for trading off between exploration and\nexploitation. Conceptually, it samples each action according to the probability that it is optimal. The\nalgorithm typically operates by starting with a prior distribution over \u03b8. During each time period,\nit samples from the posterior distribution over \u03b8, and selects an action that maximizes the expected\nreward under the sampled model. It then updates the posterior distribution with the observed outcome.\nAnother widely studied class of algorithms that effectively trade off between exploration and exploita-\ntion are upper con\ufb01dence bound (UCB) algorithms, which apply the principle of optimism in the face\nof uncertainty. For each time period, they typically construct an upper con\ufb01dence bound for the mean\nreward of each action based on past observations, and then select the action with the highest upper\ncon\ufb01dence bound.\n\nAlgorithm 1 Thompson Sampling\n1: Input: prior p\n2: for (cid:96) = 1, 2, . . . , L do\n3:\n4:\n5: Observe: Y(cid:96),A(cid:96)\n6:\n7: end for\n\nSample: \u02c6\u03b8(cid:96) \u223c p\nAct: A(cid:96) = arg max\na\u2208A\n\nr \u02c6\u03b8(cid:96)\n\n(a)\n\nUpdate: p \u2190 P(\u03b8 \u2208 \u00b7|p, A(cid:96), Y(cid:96),A(cid:96))\n\nAlgorithm 2 Upper Con\ufb01dence Bound Algorithm\n1: Input: upper con\ufb01dence functions {U(cid:96)}L\n2: for (cid:96) = 1, 2, . . . , L do\nAct: A(cid:96) = arg max\n3:\n4: Observe: Y(cid:96),A(cid:96)\n5:\n6: end for\n\nUpdate: H(cid:96)+1 \u2190 H(cid:96) \u222a {A(cid:96), Y(cid:96),A(cid:96)}\n\nU(cid:96)(H(cid:96), a)\n\na\u2208A\n\n(cid:96)=1\n\n3\n\n\f4\n\nInformation-theoretic con\ufb01dence bounds\n\n(cid:113)\n\nInformation-theoretic con\ufb01dence bounds are de\ufb01ned with the intention of capturing the exploration-\nexploitation tradeoff for Thompson sampling and optimistic algorithms \u2013 if the regret is large, the\nagent must have learned a lot about the environment. Let \u2206(cid:96) = r\u03b8(A\u2217) \u2212 r\u03b8(A(cid:96)) denote the regret\nover the (cid:96)th period. We aim to obtain per-period regret bound of the form\n\nE(cid:96)[\u2206(cid:96)] \u2264 \u0393(cid:96)\n\nI(cid:96)(\u03b8; A(cid:96), Y(cid:96),A(cid:96)) + \u0001(cid:96),\n\n(1)\nwhere I(cid:96)(\u03b8; A(cid:96), Y(cid:96),A(cid:96)) is the \ufb01ltered mutual information between \u03b8 and the action-outcome pair\nduring the (cid:96)th period, \u0393(cid:96) is the rate at which regret scales with information gain, and we also allow\nfor a small error term \u0001(cid:96). If (1) is satis\ufb01ed with reasonable values for \u0393(cid:96) and \u0001(cid:96), a large expected\nregret on the left-hand side would imply that the right-hand side must be large as well, meaning that\nthe agent should gain a lot of information about the environment.\nIf \u0393(cid:96) can be uniformly bounded over (cid:96) = 1, . . . , L, we obtain an information-theoretic regret bound\nfor any algorithm that satis\ufb01es (1).\nProposition 2. If (1) holds with \u0393(cid:96) \u2264 \u0393 for all (cid:96) = 1, . . . , L, then\n\nE[Regret(L, \u03c0)] \u2264 \u0393\n\nL I(\u03b8; A1, Y1,A1, . . . , A(cid:96), Y(cid:96),A(cid:96)) + E\n\n\u0001(cid:96).\n\n(cid:113)\n\nL(cid:88)\n\n(cid:96)=1\n\nThe proof follows from Jensen\u2019s and the Cauchy-Schwarz inequalities, and the chain rule of mutual\ninformation. All complete proofs can be found in the appendix.\nThe mutual information term on the right-hand side shows how much the agent expects to learn about\n\u03b8 over L periods. If \u03b8 already concentrates around some value, there is not much to learn, and the\nresult would suggest that the expected regret should be small. In general, the mutual information\nterm can be bounded by the maximal information gain under any algorithm over L periods, though a\nmore careful analysis specialized to the algorithm of interest might lead to a better bound.\nOne way to obtain a per-period regret bound of the form in Equation (1) is through construction of\ninformation-theoretic con\ufb01dence sets for mean rewards. For each action a, the width of the con\ufb01dence\nset is designed to depend on the information gain about \u03b8 from observing outcome Y(cid:96),a.\nLemma 3. Under Assumption 1, if\n\n|r\u03b8(a) \u2212 E(cid:96) [r\u03b8(a)]| \u2264 \u0393(cid:96)\n2\n\nI(cid:96)(\u03b8; Y(cid:96),a) \u2200a \u2208 A\n\n\u2265 1 \u2212 \u03b4\n2\n\n,\n\n(cid:18)\n(cid:112)I(cid:96)(\u03b8; Y(cid:96),a) satis\ufb01es\n\nP(cid:96)\n\n(cid:113)\n\n(cid:113)\n\n(cid:19)\n\n(cid:19)\n\n(cid:19)\n\nthen the per-period regret of Thompson sampling and UCB with upper con\ufb01dence function U(cid:96)(a) =\nE(cid:96) [r\u03b8(a)] + \u0393(cid:96)\n\n2\n\nE(cid:96) [\u2206(cid:96)] \u2264 \u0393(cid:96)\n\nI(cid:96)(\u03b8; A(cid:96), Y(cid:96),A(cid:96) ) + \u03b4B.\n\nThe proof follows from the probability matching property of Thompson sampling, optimism of UCB,\nand properties of mutual information.\nWhen the reward function r(y) is Lipschitz continuous, we may alternatively construct con\ufb01dence\nsets on outcomes. Further, if the observation noise is additive, we may construct con\ufb01dence sets on\nthe mean outcomes.\nLemma 4. If r(\u00b7) is K-Lipschitz continuous with respect to some norm (cid:107) \u00b7 (cid:107) on Y, and if\n\n(cid:107)Y(cid:96),a \u2212 E(cid:96) [Y(cid:96),a](cid:107) \u2264 \u0393(cid:96)\n2\nthen the per-period regret of Thompson sampling\n\nP(cid:96)\n\nI(cid:96)(\u03b8; Y(cid:96),a) \u2200a \u2208 A\n\n\u2265 1 \u2212 \u03b4\n2\n\n,\n\nE(cid:96) [\u2206(cid:96)] \u2264 K \u0393(cid:96)\n\nI(cid:96)(\u03b8; A(cid:96), Y(cid:96),A(cid:96)) + \u03b4B.\n\nMoreover, if there exists a function y : \u0398 \u00d7 A \u2192 Y such that Y(cid:96),a \u2212 y(\u03b8, a) is independent of \u03b8 for\nall a \u2208 A, then it is suf\ufb01cient to have\n\nP(cid:96)\n\n(cid:107)y(\u03b8, a) \u2212 E(cid:96) [y(\u03b8, a)](cid:107) \u2264 \u0393(cid:96)\n2\n\nI(cid:96)(\u03b8; Y(cid:96),a) \u2200a \u2208 A\n\n\u2265 1 \u2212 \u03b4\n2\n\n(cid:113)\n(cid:113)\n\n(cid:113)\n\n(cid:18)\n\n(cid:18)\n\nfor the one-period regret bound to hold.\n\n4\n\n\ffor\n\n\u0393(cid:96) = 4\n\n(cid:19)\n\nI(cid:96)(\u03b8; Y(cid:96),a) \u2200a \u2208 A\n\n\u2265 1 \u2212 \u03b4\n2\n\n,\n\n4|A|\n\u03b4\n\n, where \u03c32\n\n(cid:96),max = max\n\na\u2208A a(cid:62)\u03a3(cid:96)a.\n\n(cid:113)\n\nP(cid:96)\n\n2\n\n(cid:18)(cid:12)(cid:12)Y a \u2212 E(cid:96)Y a\n(cid:12)(cid:12) \u2264 \u0393(cid:96)\n(cid:118)(cid:117)(cid:117)(cid:116)\n(cid:16)\n(cid:17) log\n(cid:112)I(cid:96)(\u03b8; Y(cid:96),a) satis\ufb01es\n\n(cid:96),max\n\u03c32\n\n(cid:96),max\n\u03c32\nw\n\n1 +\n\nlog\n\n\u03c32\n\n(cid:113)\n\nAn analogous result would hold for a UCB algorithm if outcomes Y(cid:96),a are scalar valued and the\nreward function is nondecreasing. We will push further details to the appendix.\n\n5 Examples\n\nIn this section, we show applications of information-theoretic con\ufb01dence bounds on linear ban-\ndits, tabular MDPs, and factored MDPs. The per-period regret bounds highlight the single-period\nexploration-exploitation tradeoff for Thompson sampling and the corresponding optimistic algorithms,\nwhile the cumulative regret bounds show how the prior distribution over \u03b8 affects regret.\n\n5.1 Linear bandits\nLet A \u2282 (cid:60)d be a \ufb01nite action set, and assume that supa\u2208A (cid:107)a(cid:107)2 \u2264 1. We assume a N (\u00b51, \u03a31) prior\nover the model parameter \u03b8 \u2208 (cid:60)d. At time (cid:96), an action A(cid:96) \u2208 A is selected, and Y(cid:96),A(cid:96) = \u03b8(cid:62)A(cid:96) + w(cid:96)\nw). Conditioned on H(cid:96), the posterior distribution of \u03b8 is again\nis observed, where w(cid:96) are i.i.d. N (0, \u03c32\nnormal. Let \u00b5(cid:96) and \u03a3(cid:96) denote the posterior mean and covariance matrix conditioned on H(cid:96). We\nassume that r(\u00b7) is bounded under Assumption 1, and is nondecreasing and 1-Lipschitz.\nBy Lemma 4, since noise is additive, it is suf\ufb01cient to construct con\ufb01dence sets on Y a = \u03b8(cid:62)a.\nLemma 5. Under the assumptions stated in Section 5.1,\n\nThus, it follows from Lemma 4 that the one-period regret of Thompson sampling and UCB with\nU(cid:96)(a) = E(cid:96)Y a + \u0393(cid:96)\n\n2\n\nE(cid:96)[\u2206(cid:96)] \u2264 \u0393(cid:96)\n\nI(cid:96)(\u03b8; A(cid:96), Y(cid:96),A(cid:96)) + \u03b4B.\n\nBy Proposition 2, we have the following Bayesian regret bound for Thompson sampling and UCB by\nchoosing \u03b4 = 1\nL.\nProposition 6. Under the assumptions stated in Section 5.1, the Bayesian regret of Thompson\nsampling and UCB over L periods is\n\nE[Regret(L, \u03c0)] \u2264 \u0393\n\nL I(\u03b8; A1, Y1,A1, . . . , AL, YL,AL) + B\n\n(cid:113)\n\n(cid:118)(cid:117)(cid:117)(cid:116)\n\n\u0393 = 4\n\n(cid:16)\n\nlog\n\n\u03c32\n\n1,max\n\u03c32\n\n1 +\n\n1,max\n\u03c32\nw\n\n(cid:17) log(4|A|L).\n\nwhere\n\nThe following lemma bounds the maximal information gain over L time periods.\nLemma 7. For any H(cid:96)-adapted action sequence,\n\nI(\u03b8; A1, Y1,A1, . . . , AL, YL,AL) \u2264 1\n2\n\nd log\n\n1 +\n\n\u03bbmaxL\n\n\u03c32\nw\n\n(cid:18)\n\n(cid:19)\n\n,\n\nwhere \u03bbmax is the largest eigenvalue of \u03a31.\n\nO((cid:112)dL log |A| log L), which matches the result in [16]. For large action sets, it is possible to apply\n\nIt follows from Lemma 7 that the Bayesian regret of Thompson sampling and UCB is bounded by\n\n\u221a\na discretization argument and obtain a regret bound of order O(d\n\nL log L).\n\n5\n\n\f5.2 Tabular MDPs\nWe consider the problem of learning to optimize a random \ufb01nite-horizon MDP M = (S,A, R, P, \u03c4, \u03c1)\nin repeated episodes. S is the state space, A is the action space, and we assume that both S and A\nare \ufb01nite. Assume that for each s, a, the reward distribution is Bernoulli with mean R(s, a), where\n1,s,a \u2208 (cid:60)2. We further assume that for\nR(s, a) follows an independent Beta prior with parameter \u03b1R\neach s, a, the transition distribution P (s, a,\u00b7) follows an independent Dirichlet prior with parameter\n1,s,a \u2208 (cid:60)|S|. \u03c4 is a \ufb01xed time horizon, and \u03c1 is the initial state distribution. We make the following\n\u03b1P\nsimplifying assumption on the prior parameters.\nAssumption 2. For all s \u2208 S and a \u2208 A, \u03b1R\n1,s,a(j) \u2265 2|S| for all\nj \u2208 {1, . . . ,|S|}.\nA (deterministic) policy \u00b5 is a sequence of functions (\u00b50, . . . , \u00b5\u03c4\u22121) that map states to actions.\nDuring each episode (cid:96), the agent selects a policy \u00b5(cid:96), and observes a trajectory\n\n1,s,a(i) \u2265 1 for i \u2208 {1, 2}, and \u03b1P\n\nY(cid:96),\u00b5(cid:96) = (s(cid:96),0, a(cid:96),0, r(cid:96),1, s(cid:96),1, . . . , s(cid:96),\u03c4\u22121, a(cid:96),\u03c4\u22121, r(cid:96),\u03c4 ).\n\nWe de\ufb01ne the value function of a policy \u00b5 under an MDP \u02dcM\n\n\u00b5,k(s) := E\nV \u02dcM\n\nR(st, at)\n\n(cid:35)\n(cid:12)(cid:12)(cid:12) M = \u02dcM , \u00b5, sk = s\n\n.\n\nDe\ufb01ne the expected value of a policy \u00b5 under an MDP \u02dcM\n\n(cid:34)\u03c4\u22121(cid:88)\n\u00b5 = E(cid:104)\n\nt=k\n\n\u02dcM\n\nV\n\nV \u02dcM\n\n(cid:105)\n\u00b5,0(s0)(cid:12)(cid:12) M = \u02dcM , \u00b5\nL(cid:88)\n\nE(cid:104)\n\n\u00b5\u2217 \u2212 V\n\nV\n\nM\n\n.\n\n(cid:105)\n\nM\n\u00b5(cid:96)\n\n,\n\n(cid:96)=1\n\nE[Regret(L, \u03c0)] =\n\nLet \u00b5\u2217 denote an optimal policy for the true environment M. The Bayesian regret of an algorithm \u03c0\nover L episodes is\n\nwhere the expectation is taken over the randomness in observations, algorithm \u03c0, as well as the prior\ndistribution over M.\nWe will construct con\ufb01dence bounds on value functions in the spirit of Lemma 3. As we will see\nin the following two lemmas, the structure of MDPs allows us to break down the deviation of value\nfunctions and the information gain at the level of state-action pairs. Thus, it would be suf\ufb01cient to\nconstruct con\ufb01dence sets for the reward and transition functions for individual state-action pairs.\nThe following lemma decomposes the planning error to a sum of on-policy Bellman errors. The proof\ncan be found in Section 5.1 of [13].\nLemma 8. For any MDP \u02c6M and policy \u00b5,\n\n(cid:20)(cid:16) \u02c6R(st, at) \u2212 R(st, at)\n(cid:17)\n\nE\n\n(cid:16) \u02c6P (st, at) \u2212 P (st, at)\n\n(cid:17)(cid:62)\n\n+\n\n\u02c6M\n\n\u00b5 \u2212 V\n\nV\n\nM\n\u00b5 =\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u02c6M , M, \u00b5\n(cid:21)\n\n,\n\nV \u02c6M\n\u00b5,t+1\n\n\u03c4\u22121(cid:88)\n\nt=0\n\nwhere \u02c6R, \u02c6P are the reward and transition functions under \u02c6M.\nLet H(cid:96)t = (\u00b5(cid:96), s(cid:96),0, a(cid:96),0, . . . , r(cid:96),t, s(cid:96),t) denote the history of episode (cid:96) up to time t and the policy\nselected for episode (cid:96). The chain rule of mutual information gives us the following lemma.\nLemma 9. (information decomposition)\n\nI(cid:96) (M ; \u00b5(cid:96), Y(cid:96),\u00b5(cid:96)) =\n\nI(cid:96) (M ; (s(cid:96),t, a(cid:96),t, r(cid:96),t+1, s(cid:96),t+1) | H(cid:96)t) .\n\nt=0\n\nBy Lemmas 8 and 9, we will construct information-theoretic con\ufb01dence sets for reward and transition\ndistributions individually for each state-action pair.\n\n6\n\n\u03c4\u22121(cid:88)\n\n\fLemma 10. Let r(cid:96),t,s,a \u223c Bernoulli(R(s, a))|H(cid:96),H(cid:96)t and s(cid:48)\n\n(cid:96),t,s,a \u223c P (s, a,\u00b7)|H(cid:96),H(cid:96)t. Then,\n\nP(cid:96)\n\n(cid:32)\n|R(s, a) \u2212 E(cid:96)R(s, a)| \u2264 \u0393R(cid:114)\n(cid:12)(cid:12)(cid:12) \u2264 \u0393P(cid:114)\n\n(cid:32)(cid:12)(cid:12)(cid:12)(cid:0)P (s, a) \u2212 E(cid:96)P (s, a)(cid:1)(cid:62)\n\nV M\n\u00b5\u2217,t+1\n\nmin\n\u02dct,h\u02dct\n\nand\n\nP(cid:96)\n\nI(cid:96)(R; s, a, r(cid:96),\u02dct,s,a |H(cid:96)\u02dct = h\u02dct)\n\n\u2265 1 \u2212 \u03b4\n\n(2)\n\n(cid:33)\n|H(cid:96)\u02dct = h\u02dct)\n\n\u2265 1 \u2212 \u03b4 (3)\n\nI(cid:96)(P ; s, a, s(cid:48)\n\n(cid:96),\u02dct,s,a\n\nmin\n\u02dct,h\u02dct\n\n(cid:33)\n\n(cid:19)\n\nfor all t and all s, a such that 1(cid:62)\u03b1R\n\n(cid:113)\n(cid:96),s,a \u2265 \u03c4 \u2212 1 and 1(cid:62)\u03b1P\n\n(cid:113)\n(cid:96),s,a \u2265 \u03c4 \u2212 1, respectively, where\n\n\u0393R =\n\n24 log 2\n\u03b4\n\nand \u0393P = \u03c4\nI(cid:96)(R; s, a, r(cid:96),\u02dct,s,a |H(cid:96)\u02dct = h\u02dct) and min\u02dct,h\u02dct\n\n24 log 2\n\u03b4 .\nI(cid:96)(P ; s, a, s(cid:48)\n\n(cid:96),\u02dct,s,a |H(cid:96)\u02dct = h\u02dct) measure\nThe terms min\u02dct,h\u02dct\nthe minimum per-step information gain that the agent can obtain about the reward and transition\nfunctions of a state-action pair during the (cid:96)th episode, conditioned on H(cid:96), where the minimum is\ntaken over all possible values of the time step 0 \u2264 \u02dct \u2264 \u03c4 \u2212 1 and all possible realizations of the\ntrajectory H(cid:96)\u02dct.\nLemma 10 allows us to construct a high probability con\ufb01dence set M(cid:96) over M, which is dis-\ncussed more in details in the appendix. The corresponding UCB algorithm will select \u00b5(cid:96) =\n\u02c6M\n\u00b5 . Combining with Lemmas 8 and 9, we are able to obtain a per-period\narg max\u00b5 max \u02c6M\u2208M(cid:96)\nregret bound for Thompson sampling and UCB in the form of (1), with \u0393(cid:96) = \u02dcO(\u03c4\n\u03c4 ). This leads to\nthe following proposition.\nProposition 11. The Bayesian regret of Thompson sampling and UCB over L episodes is\n\n\u221a\n\nV\n\n(cid:18)\n\n(cid:113)\n\nE[Regret(L, \u03c0)] = \u02dcO\n\n\u03c4\n\n\u03c4 L I(M ; \u00b51, Y1,\u00b51, . . . , \u00b5L, YL,\u00b5L)\n\n.\n\nWe show in the appendix that I(M ; \u00b51, Y1,\u00b51 , . . . , \u00b5L, YL,\u00b5L) = \u02dcO(S2A). Though we conjecture\nthat a bound of \u02dcO(SA) may be attainable under appropriate conditions. The conjecture is supported\nby our simulations, as discussed in the appendix.\n\n5.3 Factored MDPs\n\nFactored MDPs [4] are a class of structured MDPs where transitions are represented by a dynamic\nBayesian network [8], and can typically be encoded in a compact parametric form. Our information-\ntheoretic approach will lead to a regret bound for Thompson sampling and UCB that depends on the\nprior of the model parameter, which typically has dimension exponentially smaller than the state\nspace and action space.\nWe start with some de\ufb01nitions common in the literature [17].\nDe\ufb01nition 1. Let X = X1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 Xd be a factored set. For any subset of indices Z \u2286 {1, 2, . . . , d},\nwe de\ufb01ne the scope set X [Z] := \u2297i\u2208ZXi. Further, for any x \u2208 X , de\ufb01ne the scope variable\nx[Z] \u2208 X [Z] to be the value of the variables xi \u2208 Xi with indices i \u2208 Z. If Z is a singleton, we will\nwrite x[i] for x[{i}].\nLet P(X ,Y) denote the set of functions that map x \u2208 X to a probability distribution on Y.\nDe\ufb01nition 2. The reward function class R \u2282 P(X ,(cid:60)) is factored over S \u00d7 A = X = X1 \u00d7\n\u00b7\u00b7\u00b7 \u00d7 Xd with scopes Z1, . . . , Zm if and only if, for all R \u2208 R, x \u2208 X , there exist functions\n{Ri \u2208 P(X [Zi],(cid:60))}m\n\ni=1 such that\n\nwhere r \u223c R(x) is equal to(cid:80)m\n\nE[r] =\n\nE[ri]\n\ni=1 ri with each ri \u223c Ri(x[Zi]) individually observed.\n\nm(cid:88)\n\ni=1\n\n7\n\n\fDe\ufb01nition 3. The transition function class P \u2282 P(X ,S) is factored over S\u00d7A = X = X1\u00d7\u00b7\u00b7\u00b7\u00d7Xd\nand S = S1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 Sn with scopes Z1, . . . , Zn if and only if, for all P \u2208 P, x \u2208 X , s \u2208 S, there\nexist functions {Pj \u2208 P(X [Zj],Sj)}n\n\nj=1 such that\n\nn(cid:89)\n\nj=1\n\nP (s|x) =\n\nPj(s[j] | x[Zj]).\n\nj }n\n\nM =(cid:0){Xi}d\n\nA factored MDP is an MDP with factored rewards and transitions. Let X = S \u00d7 A. A factored MDP\nis fully characterized by\n\nj=1; \u03c4 ; \u03c1(cid:1) ,\n(cid:12)(cid:12)X [Z P\nj ](cid:12)(cid:12) be the sum of\n\ni }m\n\ni }m\n\nj=1; {Z P\n\ni=1; {Sj}n\n\nj=1; {Pj}n\n\ni=1; {Ri}m\n\nfunction has size at most K \u03b6. Let DR =(cid:80)m\n\ni=1; {Z R\nj }n\ni=1 and {Z P\nwhere {Z R\nj=1 are the scopes for the reward and transition functions, which we\nassume are known to the agent, \u03c4 is a \ufb01xed time horizon, and \u03c1 is the initial state distribution.\nWe assume that |Z R\nj | \u2264 \u03b6 (cid:28) d and |Xi| \u2264 K, so the domain of any reward and transition\ni |,|Z P\nthe cardinality of the domains, and let D = DR + DP (cid:28) |S||A|. To simplify exposition, let us\nassume that |Sj| \u2264 K for all j = 1, . . . , n.\nLet xR\nfactorized rewards are Bernoulli. With a slight abuse of notation, we let Ri(xR\nreward, and we assume that Ri(xR\n), \u03b1P\nPj(xP\ncomponent of \u03b1R\n\nj ]. We assume that the\ni ) denote the mean\n\u2208 (cid:60)2. We further assume that\n\u2208 (cid:60)|Sj|. Similar to the tabular case, we assume that each\n\n(cid:12)(cid:12)X [Z R\ni ](cid:12)(cid:12) and DP =(cid:80)n\n\ni denote an element in X [Z R\n\nj denote an element in X [Z P\n\ni ], and xP\ni ) \u223c Beta(\u03b1R\n\nis at least 1, and each component of \u03b1P\n\nj ,\u00b7) \u223c Dirichlet(\u03b1P\n\n), where \u03b1R\n\nis at least\n\n1,i,xR\ni\n\n1,i,xR\ni\n\n1,j,xP\nj\n\n1,j,xP\nj\n\nj=1\n\ni=1\n\n2|Sj| for all i and j.\n\n1,i,xR\ni\n\n1,j,xP\nj\n\ni\n\n\u223c Bernoulli(R(xR\n\ni ))|H(cid:96),H(cid:96)t and s(cid:48)\n\nLemmas 8 and 9 still hold. Since the reward and transition functions can be factorized, we will\nconstruct information-theoretic con\ufb01dence bounds on these factor functions.\n(cid:33)\nj ,\u00b7)|H(cid:96),H(cid:96)t. Then,\nLemma 12. Let r(cid:96),t,i,xR\n(cid:33)\n| H(cid:96)\u02dct = h\u02dct)\n\ni )(cid:12)(cid:12) \u2264 \u0393R(cid:114)\n(cid:114)\n\n(cid:32)(cid:12)(cid:12)Ri(xR\n(cid:32)\n\ni ) \u2212 E(cid:96)Ri(xR\n\nj ) \u2212 E(cid:96)Pj(xP\n\n| H(cid:96)\u02dct = h\u02dct)\n\n\u223c Pj(xP\n\n\u2265 1 \u2212 \u03b4\n\n\u2265 1 \u2212 \u03b4\n\n(cid:107)Pj(xP\n\ni , r(cid:96),\u02dct,i,xR\n\nI(cid:96)(Ri; xR\n\nI(cid:96)(Pj; xP\n\nmin\n\u02dct,h\u02dct\n\n(cid:96),t,j,xP\nj\n\nj , s(cid:48)\n\nand\n\nP(cid:96)\n\nP(cid:96)\n\n(4)\n\n(5)\n\ni\n\n(cid:96),\u02dct,j,xP\nj\n\nfor all s, a such that 1(cid:62)\u03b1R\n\n(cid:96),i,xR\ni\n\n\u2265 \u03c4 \u2212 1, respectively, with\n\nj\n\nmin\n\u02dct,h\u02dct\n\u2265 \u03c4 \u2212 1 and 1(cid:62)\u03b1P\n\nj )(cid:107)1 \u2264 \u0393P\n(cid:113)\n\n(cid:96),j,xP\nj\n\n(cid:113)\n\n\u0393R = 2\n\n6 log 2\n\u03b4\n\nand \u0393P\n\nj = 4\n\n6|Sj| log 2\n\u03b4 .\n\nThe lemma allows us to obtain a per-period regret bound in the form of (1) for Thompson sampling\nand a UCB algorithm that uses (4) and (5) to construct con\ufb01dence sets. As shown in the appendix,\n\nthe scaling factor \u0393(cid:96) = \u02dcO(m\u03c4(cid:112)(m + n)K\u03c4 ), which leads to the following proposition.\n(cid:19)\n\nProposition 13. The Bayesian regret of Thompson sampling and UCB over L episodes is\n\n(cid:18)\n\n(cid:113)\n\nm\u03c4\n\n(m + n)K\u03c4 L I (M ; \u00b51, Y1,\u00b51, . . . , \u00b5L, YL,\u00b5L)\n\n.\n\nE[Regret(L, \u03c0)] = \u02dcO\n\nSimilar to the tabular case, we show that I (M ; \u00b51, Y1,\u00b51 , . . . , \u00b5L, YL,\u00b5L ) is \u02dcO(KD) for any algo-\nrithm, while we conjecture that this may be \u02dcO(D) under appropriate conditions. The resulting regret\nbound matches the one in [14] if the conjecture proves to be true. Again, our bound in Proposition 13\nreveals an explicit dependence on the prior uncertainty, which is not captured by previous work.\n\n8\n\n\f6 Conclusion\n\nWe introduce information-theoretic con\ufb01dence bounds for analyzing Thompson sampling and deriving\noptimistic algorithms. We show that the information-theoretic approach allows us to formally\nquantify the agent\u2019s information gain of the unknown environment, and to explicitly characterize\nthe exploration-exploitation tradeoff for linear bandits, tabular MDPs, and factored MDPs. This\nwork opens up multiple directions for future research. It would be interesting to extend information-\ntheoretic con\ufb01dence bounds to a broader range of problems and see whether a general information-\ntheoretic framework is plausible for addressing online decision problems. It would also be interesting\nto think about whether an information-theoretic perspective could lead to tighter regret bounds for\nThompson sampling and optimistic algorithms. One may also consider the practical implications of\nthese con\ufb01dence bounds and how they can be used to design better reinforcement learning algorithms.\n\nAcknowledgments\n\nThis work was generously supported by the Charles and Katherine Lin Graduate Fellowship, the\nDantzig-Lieberman Operations Research Fellowship, and a research grant from Boeing. We would\nalso like to thank Ayfer \u00d6zg\u00fcr and Daniel Russo for helpful discussions.\n\nReferences\n[1] S. Agrawal and N. Goyal. Analysis of Thompson sampling for the multi-armed bandit problem.\n\nIn Proceedings of the 21st Annual Conference on Learning Theory (COLT), 2012.\n\n[2] S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In\nProceedings of The 30th International Conference on Machine Learning, pages 127\u2013135, 2013.\n\n[3] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine learning, 47(2):235\u2013256, 2002.\n\n[4] Craig Boutilier, Richard Dearden, and Mois\u00e9s Goldszmidt. Stochastic dynamic programming\n\nwith factored representations. Arti\ufb01cial Intelligence, 121(1):49 \u2013 107, 2000.\n\n[5] O. Capp\u00e9, A. Garivier, O.-A. Maillard, R. Munos, and G. Stoltz. Kullback-Leibler upper\ncon\ufb01dence bounds for optimal sequential allocation. Annals of Statistics, 41(3):1516\u20131541,\n2013.\n\n[6] T.M. Cover and J.A. Thomas. Elements of information theory. John Wiley & Sons, 2012.\n\n[7] V. Dani, T.P. Hayes, and S.M. Kakade. Stochastic linear optimization under bandit feedback. In\nProceedings of the 21st Annual Conference on Learning Theory (COLT), pages 355\u2013366, 2008.\n\n[8] Zoubin Ghahramani. Learning dynamic Bayesian networks, pages 168\u2013197. Springer Berlin\n\nHeidelberg, Berlin, Heidelberg, 1998.\n\n[9] T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning.\n\nJournal of Machine Learning Research, 11:1563\u20131600, 2010.\n\n[10] E. Kauffmann, N. Korda, and R. Munos. Thompson sampling: an asymptotically optimal \ufb01nite\n\ntime analysis. In International Conference on Algorithmic Learning Theory, 2012.\n\n[11] E. Kaufmann, O. Capp\u00e9, and A. Garivier. On Bayesian upper con\ufb01dence bounds for bandit\n\nproblems. In Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2012.\n\n[12] T.L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in applied\n\nmathematics, 6(1):4\u201322, 1985.\n\n[13] Ian Osband, Daniel Russo, and Benjamin Van Roy. (More) ef\ufb01cient reinforcement learning\nvia posterior sampling. In Advances in Neural Information Processing Systems 26. Curran\nAssociates, Inc., 2013.\n\n9\n\n\f[14] Ian Osband and Benjamin Van Roy. Near-optimal reinforcement learning in factored mdps.\nIn Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors,\nAdvances in Neural Information Processing Systems 27, pages 604\u2013612. Curran Associates,\nInc., 2014.\n\n[15] D. Russo and B. Van Roy. An information-theoretic analysis of Thompson sampling. Journal\n\nof Machine Learning Research, 17(68):1\u201330, 2016.\n\n[16] N. Srinivas, A. Krause, S.M. Kakade, and M. Seeger. Information-theoretic regret bounds for\nGaussian process optimization in the bandit setting. IEEE Transactions on Information Theory,\n58(5):3250 \u20133265, May 2012.\n\n[17] Istv\u00e1n Szita and Andr\u00e1s L\u02ddorincz. Optimistic initialization and greediness lead to polynomial\ntime learning in factored mdps. In Proceedings of the 26th Annual International Conference on\nMachine Learning, ICML \u201909, pages 1001\u20131008. ACM, 2009.\n\n[18] W.R. Thompson. On the likelihood that one unknown probability exceeds another in view of\n\nthe evidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n10\n\n\f", "award": [], "sourceid": 1430, "authors": [{"given_name": "Xiuyuan", "family_name": "Lu", "institution": "Stanford University"}, {"given_name": "Benjamin", "family_name": "Van Roy", "institution": "Stanford University"}]}