{"title": "Polynomial Cost of Adaptation for X-Armed Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 1029, "page_last": 1038, "abstract": "In the context of stochastic continuum-armed bandits, we present an algorithm that adapts to the unknown smoothness of the objective function. We exhibit and compute a polynomial cost of adaptation to the H\u00f6lder regularity for regret minimization. To do this, we first reconsider the recent lower bound of Locatelli and Carpentier, 2018, and define and characterize admissible rate functions. Our new algorithm matches any of these minimal rate functions. We provide a finite-time analysis and a thorough discussion about asymptotic optimality.", "full_text": "Polynomial Cost of Adaptation for X -Armed Bandits\n\nH\u00e9di Hadiji\n\nLaboratoire de Math\u00e9matiques d\u2019Orsay\n\nUniversit\u00e9 Paris-Sud, Orsay, France\nhedi.hadiji@math.u-psud.fr\n\nAbstract\n\nIn the context of stochastic continuum-armed bandits, we present an algorithm\nthat adapts to the unknown smoothness of the objective function. We exhibit\nand compute a polynomial cost of adaptation to the H\u00f6lder regularity for regret\nminimization. To do this, we \ufb01rst reconsider the recent lower bound of Locatelli\nand Carpentier [21], and de\ufb01ne and characterize admissible rate functions. Our new\nalgorithm matches any of these minimal rate functions. We provide a \ufb01nite-time\nanalysis and a thorough discussion about asymptotic optimality.\n\n1 Introduction\n\nMulti-armed bandits are a well-known sequential learning problem. When the number of available\ndecisions is large, some assumptions on the environment have to be made. In a vast line of work\n(see the literature discussion in Section 1.1), these assumptions show up as nonparametric regularity\nconditions on the mean-payoff function. If this function is H\u00f6lder continuous with constant L and\nexponent \u21b5, and if the values of L and \u21b5 are given to the player, then natural strategies can ensure\nthat the regret is upper bounded by\n\nL1/(2\u21b5+1)T (\u21b5+1)/(2\u21b5+1) .\n\n(1)\nOf course, assuming that the player knows \u21b5 and L is often not realistic. Thus the need for\nadaptive methods, that are agnostic with respect to the true regularity of the mean-payoff function.\nUnfortunately, Locatelli and Carpentier [21] recently showed that full adaptation is impossible, and\nthat no algorithm can enjoy the same minimax guarantees as when the regularity is given to the player.\nWe persevere and address the question:\n\nWhat can the player achieve when the true regularity is completely unknown?\n\nA polynomial cost of adaptation In statistics, minimax adaptation for nonparametric function\nestimation is a deep and active research domain. In many contexts, sharp adaptation is possible;\noften, an additional logarithmic factor in the error has to be paid when the regularity is unknown:\nthis is known as the cost of adaptation. See e.g., Lepskii [20], Birg\u00e9 and Massart [4], Massart\n[22] for adaptive methods, and Cai [9] for a detailed survey of the topic. Under some more exotic\nassumptions \u2014see e.g., Example 3 of Cai and Low [10] \u2014 adapting is signi\ufb01cantly harder: there\nmay be a polynomial cost of adaptation.\nIn this paper, we show that in the sequential setting of multi-armed bandits, the necessary exploration\nforces a similar phenomenon, and we exhibit this polynomial cost of adaptation. To do so, we revisit\nthe lower bounds of Locatelli and Carpentier [21], and design a new algorithm that matches these\nlower bounds.\nAs a representative example of our results, our algorithm can achieve, without the knowledge of \u21b5\nand L, an unimprovable (up to logarithmic factors) regret bound of order\n\nL1/(1+\u21b5)T (\u21b5+2)/(2\u21b5+2) .\n\n(2)\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f1.1 Related work\nContinuum-armed bandits Continuum-armed bandits, with nonparametric regularity assumptions,\nwere introduced by Agrawal [1]. Kleinberg [16] established the minimax rates in the H\u00f6lder setting\nand introduced the CAB1 algorithm. Auer et al. [3] studied the problem with additional regularity\nassumptions under which the minimax rates are improved. Via different roads, Bubeck et al. [7] and\nKleinberg et al. [17] explored further generalizations of these types of regularity, namely the zooming\ndimension and the near-optimality dimension. Bull [8] exhibited an algorithm that essentially adapts\nto some cases when the near-optimality dimension is zero.\nIn all these articles, the mean-payoff function needs to satisfy simultaneously two sets of regularity\nconditions. The \ufb01rst type is a usual H\u00f6lder condition, which ensures that the function does not vary\ntoo much around (one of) its maxima. The second type is a \u201cmargin condition\u201d that lower bounds the\nnumber of very suboptimal arms; in the literature these are de\ufb01ned in many technically different ways.\nAdapting to the margin conditions is often possible when the H\u00f6lder regularity is known. However,\nall these algorithms require some prior knowledge about the H\u00f6lder regularity.\nIn this paper, we focus on the problem of adapting to H\u00f6lder regularity. Accordingly, we call adaptive\nthe algorithms that assume no knowledge of the H\u00f6lder exponent nor of the Lipschitz constant.\n\nAdaptation for cumulative regret Bubeck et al. [5] introduced the problem of adaptation, and\nadapted to the Lipschitz constant under extra requirements. An important step was made in Locatelli\nand Carpentier [21], where it is shown that adaptation at the classical minimax rates is impossible. In\nthe same paper, the authors exhibited some conditions under which full adaptation is achievable, e.g.,\nwith knowledge of the value of the maximum, or when the near-optimality dimension is zero.\n\nOther settings For simple regret, the objections against adaptation do not hold, as the objective does\nnot penalize exploration. Adaptation up to polylog factors is done with various (meta-)algorithms.\nLocatelli and Carpentier [21] sketch out an aggregation approach inspired by Lepski\u2019s method, while\nValko et al. [24], Grill et al. [14], Shang et al. [23] describe cross-validation methods thanks to which\nthey adapt to the near-optimality dimension with unknown smoothness. As it turns out, this last\napproach yields clean results with our smoothness assumptions; we write the details in Appendix E.\nRecently, Krishnamurthy et al. [18] studied continuum-armed contextual bandits and use a sophisti-\ncated aggregation scheme to derive an algorithm that adapts to the Lipschitz constant when L > 1.\n\n1.2 Contributions and outline\n\nIn this paper, we fully compute the cost of adaptation for bandits with H\u00f6lder regularity. In Section 2\nwe discuss the adaptive (and nonadaptive) lower bounds. We take an asymptotic stance in order\nto precisely de\ufb01ne the objective of adaptation. Doing so, we uncover a family of noncomparable\nlower bounds for adaptive algorithms (Theorem 1), and de\ufb01ne the corresponding notion of optimality:\nadmissibility.\nSection 3 contains our main contribution: an admissible adaptive algorithm. We \ufb01rst recall the CAB1\nalgorithm, which is nonadaptive minimax, and use it as a building block for our new algorithm\n(Subsection 3.1). This algorithm works in a regime-based fashion. Between successive regimes of\ndoubling lengths, we reset the algorithm and use a new discretization with fewer arms. In order to\ncarry information between the different stages, we use CAB1 in a clever way: besides partitioning\nthe arm space, we add summaries of previous regimes by allowing the algorithm to play according to\nthe empirical distributions of past plays. This is formally described in Subsection 3.2.\nA salient difference with all previous approaches is that we zoom out by using fewer and fewer arms.\nTo our knowledge, this is unique, as all other algorithms for bandits zoom in in a way that crucially\ndepends on the regularity parameters. Another important feature of our analysis is that we adapt both\nto the H\u00f6lder exponent \u21b5 and to the Lipschitz constant L. On a technical level, this is thanks to the\nfact that we do not explicitly choose a grid of regularity parameters, which means that we implicitly\nhandle all values (L, \u21b5) simultaneously.\nWe \ufb01rst give a regret bound in the known horizon case (Subsection 3.2), then we provide an anytime\nversion and we show that they match the lower bounds of adaptation (Subsection 3.3). Finally\nSection 4 provides the proof of our main regret bound.\n\n2\n\n\f2 Setup, preliminary discussion\n\n2.1 Notation and known results\nLet us reintroduce brie\ufb02y the standard bandit terminology. We consider the arm space X = [0, 1].\nThe environment sets a reward function f : X! [0, 1]. At each time step t, the player chooses an arm\nXt 2X , and the environment then displays a reward Yt such that E[Yt | Xt] = f (Xt), independently\nfrom the past. We assume that the variables Yt f (Xt) are (1/4)-subgaussian conditionnally on Xt;\nthis is satis\ufb01ed if the payoffs are bounded in [0, 1] by Hoeffding\u2019s lemma.\nThe objective of the player is to \ufb01nd a strategy that minimizes her expected cumulative (pseudo-)regret.\nIf M (f ) denotes the maximum value of f, the regret at time T is de\ufb01ned as\n\nRT = T M (f ) E\" TXt=1\n\nYt# = T M (f ) E\" TXt=1\n\nf (Xt)# .\n\n(3)\n\nIn this paper, we assume that the function f satis\ufb01es a H\u00f6lder assumption around one of its maxima:\nDe\ufb01nition 1. For \u21b5> 0 and L > 0, we denote by H(L, \u21b5) the set of functions that satisfy\n\n9 x? 2 [0, 1] s.t. f (x?) = M (f ) and 8 x 2 [0, 1]\n\n|f (x?) f (x)| 6 L|x? x|\u21b5 .\n\n(4)\n\nWe are interested in minimax rates of regret when the mean-payoff function f belongs to these\nH\u00f6lder-type classes, i.e., the quantity inf\n\nsup\n\nRT .\n\nalgorithms\n\nf2H(L,\u21b5)\n\nMOSS Throughout this paper, we exploit discretization arguments and use a minimax optimal\nalgorithm for \ufb01nite-armed bandits: MOSS, from Audibert and Bubeck [2]. When run for T rounds\non a K-armed bandit problem with (1/4)-subgaussian noise, and when T > K, its regret is upper-\nbounded by 18pKT (the improved constant is from Garivier et al. [12]).\nNon-adaptive minimax rates When the regularity is given to the player, for any \u21b5, L and T :\n\n0.001 L1/(2\u21b5+1)T (\u21b5+1)/(2\u21b5+1) 6 inf\n\nalgorithms\n\nRT 6 28 L1/(2\u21b5+1)T (\u21b5+1)/(2\u21b5+1) .\n\n(5)\n\nThis is well-known since Kleinberg [16]. For completeness, we recall how to derive the upper bound\nin Section 3.1, and the lower bound in Section 2.2.\n\nsup\n\nf2H(L,\u21b5)\n\n2.2 Lower bounds: adaptation at usual rates is not possible\nLocatelli and Carpentier [21] prove a version of the following theorem; see our reshuf\ufb02ed and slightly\nimproved proof in Appendix F.\nTheorem (Variation on Th.3 from [21]). Let B > 0 be a positive number. Let \u21b5, > 0 and L, ` > 0\nbe regularity parameters that satify \u21b5 6 and L > `.\nAssume moreover that 23 12\u21b5B1 6 L 6 `1+\u21b5 T \u21b5/2 2(1+\u21b5)(82). If an algorithm is such that\nsupf2H(`,) RT 6 B , then the regret of this algorithm is lower bounded on H(L, \u21b5):\n\nRT > 210 T L1/(\u21b5+1)B\u21b5/(\u21b5+1) .\n\n(6)\n\nRemark (Bibliographical note). Locatelli and Carpentier [21] consider a more general setting\nwhere additional margin conditions are exploited. In our setting, we slightly improve their result by\ndealing with the dependence on the Lipschitz constant, and by removing a requirement on B.\nIn a different context, Krishnamurthy et al. [18] show a variation of this bound where the Lipschitz\nconstant is considered, but only in the case where \u21b5 = = 1, for ` = 1 and L > 1.\nAs explained in Locatelli and Carpentier [21] this forbids adaptation at the usual minimax rates over\ntwo regularity classes; we recall how in the paragraph that follows Theorem 1. However this is not\nthe end of the story, as one naturally wonders what is the best the player can do.\nTo further investigate this question, we discuss it asymptotically by considering the rates at which the\nminimax regret goes to in\ufb01nity, therefore focusing on the dependence on T . Our main results are\ncompletely nonasymptotic, yet we feel the asymptotic analysis of optimality is clearer.\n\nsup\n\nf2H(L,\u21b5)\n\n3\n\n\fDe\ufb01nition 2. Let \u2713 : [0, 1] ! [0, 1] denote a nonincreasing function. We say an algorithm achieves\nadaptive rates \u2713 if\n\n8 \"> 0 , 8 \u21b5, L > 0 ,\n\nlim sup\nT!1\n\nsupf2H(L,\u21b5) RT\n\nT \u2713(\u21b5)+\"\n\n< +1 .\n\nWe include the \" in the de\ufb01nition in order to neglect the potential logarithmic factors.\nAs rate functions are not always comparable for pointwise order, the good notion of optimality is the\nstandard statistical notion of admissibility (akin to \u201cPareto optimality\u201d for game-theorists).\nDe\ufb01nition 3. A rate function is said to be admissible if it is achieved by some algorithm, and if no\nother algorithm achieves stricly smaller rates for pointwise order. An algorithm is admissible if it\nachieves an admissible rate function.\nWe recall that a function \u27130 is stricly smaller than \u2713 for pointwise order if \u27130(\u21b5) 6 \u2713(\u21b5) for all \u21b5 and\n\u27130(\u21b50) <\u2713 (\u21b50) for at least one value of \u21b50.\nIt turns out we can fully characterize the admissible rate functions by inspecting the lower bounds (6).\nTheorem 1. The admissible rate functions are exactly the family\n\n\u2713m : \u21b5 7! max\u2713m , 1 m\n\n\u21b5\n\n\u21b5 + 1\u25c6 , m 2 [1/2, 1] .\n\n(7)\n\nThis theorem contains two assertions. The lower bound side states that no smaller rate function may\nbe achieved by any algorithm. This side is derived from an asymptotic rewording of lower bound (6),\nsee Proposition 1 stated below (proofs are in Appendix A). The second statement is that the \u2713m\u2019s are\nindeed achieved by an algorithm, which is the subject of Section 3.2.\nFigure 1 illustrates how these admissible rates compare to each other, and to the usual minimax rates.\n\nFigure 1: The lower bounds on adaptive rates: plots of the admissible rate functions \u21b5 7! \u2713m(\u21b5). If an\nalgorithm has regret of order OT \u2713(\u21b5), then \u2713 is everywhere above one of these curves.\n\nIn particular, we see that reaching the nonadaptive minimax rates for multiple values of \u21b5 is impossible.\nMoreover, at m = ( + 1)/(2 + 1), we have \u2713m() = ( + 1)/(2 + 1), which is the usual minimax\nrate (1) when is known. This yields an alternative parameterization of the family \u2713m: one may\nchoose to parameterize the functions either by their value at in\ufb01nity m 2 [1/2, 1], or by the only\npoint 2 [0, +1] at which they coincide with the usual minimax rates function (1).\nProposition 1. Assume an algorithm achieves adaptive rates \u2713. Then \u2713 satis\ufb01es the functional\ninequation\n\n8 > 0 ,\n\n8 \u21b5 6 ,\n\n\u2713 (\u21b5) > 1 \u2713()\n\n4\n\n\u21b5\n\n\u21b5 + 1\n\n.\n\n(8)\n\n\f2.3 Yet can we adapt in some way?\nWe have described in (7) the minimal rate functions that are compatible with the lower bounds of\nadaptation: no algorithm can enjoy uniformly better rates. Of course, at this point, the next natural\nquestion is whether any of these adaptive rate functions may indeed be reached by an algorithm.\nAll previous algorithms for continuum-armed bandits require the regularity as an input in some way\n(see the literature discussion in Section 1.1). Such algorithms are \ufb02awed: if the true regularity is\nunderestimated then we only recover the guarantees that correspond to the smaller regularity, which\nis often far worse than the lower bounds of Theorem 1. More dramatically, if the true regularity is\noverestimated, then, a priori, no guarantees hold at all.\nWe prove that all these rate functions may be achieved by a new algorithm. More precisely, if the\nplayer wishes to reach one of the lower bounds \u2713m, she may select a value of the input accordingly\nand match the chosen \u2713m. This is our main contribution and is described in the next section.\n\n3 An admissible adaptive algorithm and its analysis\n\nWe discuss in Section 3.1 how the well-known CAB1 algorithm can be generalized for our purpose.\nIn Section 3.2 we describe our algorithm and the main upper bound on its regret. Section 3.3 is\ndevoted to the anytime version of the algorithm and to a discussion on optimality.\n\n3.1 An abstract version of CAB1 as a building block towards adaptation\nWe describe a generalization of the CAB1 algorithm from Kleinberg [16], where we include arbitrary\nmeasures in the discretization. Although this extension is straightforward, we detail it as we will use\nthis algorithm repeatedly further in this paper. In the original CAB1, the space of arms is discretized\ninto a partition of K subsets, and an algorithm for \ufb01nite-armed bandits plays on the K midpoints of\nthe sets. Auer et al. [3] replace the midpoints by a random point uniformly chosen in the subset.\nWe introduce a generic version of this algorithm we call CAB1.1. We consider K arbitrary probability\ndistributions over X , which we denote by (\u21e1i)16i6K. Denote also by \u21e1(f ) the expectation of f (X)\nwhen X \u21e0 \u21e1. At each time step, the decision maker chooses one distribution, \u21e1It, and plays an arm\npicked according to that distribution. By the tower rule, she receives a reward such that\n\nE[Yt | It] = E[f (Xt) | It] = \u21e1It(f ) .\n\nRT = TM (f ) max\n\ni=1,...,K\n\nAs the player uses a \ufb01nite-arm algorithm A to select It, the regret she suffers can be decomposed as\nthe sum of two terms (denoting by \u02dcRT the expected regret of the \ufb01nite-armed algorithm):\n(9)\n\n\u21e1i(f ) + \u02dcRT\u21e1i(f )16i6K;A .\n\nThis identity is central to the construction of our algorithm. Using terminology from Auer et al. [3],\nthe \ufb01rst term measures an approximation error of the maximum of f, and the other the actual cost of\nlearning in the approximate problem. Parameters are chosen to balance these two sources of error.\nAlgorithm 1 CAB1.1 (Continuum-Armed Bandit, adapted from Kleinberg [16])\n1: Input: T the time horizon, K probability measures over X denoted by \u21e11, . . . ,\u21e1 K, discrete\n2: for t = 1, 2, . . . , T do\n3:\n4:\n5:\n6: end for\n\nDe\ufb01ne It the arm in {1, . . . , K} recommended by A\nPlay Xt 2X drawn according to \u21e1It, and receive Yt such that E[Yt|Xt] = f (Xt)\nGive Yt as input to A corresponding to It\n\nK-armed bandit algorithm A\n\nThe canonical example is that for which the space of arms is cut into a partition. Denote by Disc(K)\nthe family of the uniform measures over the intervals [(i 1)/K, i/K] for 1 6 i 6 K. We state this\nresults (and prove it in Appendix A.1) to recall the non-adaptive minimax bound (1).\nProposition 2. Let \u21b5> 0 and L > 1/pT be regularity parameters, and de\ufb01ne the number of discrete\narms K? = min\u2303L2/(2\u21b5+1)T 1/(2\u21b5+1)\u2325 , T. Algorithm CAB1.1 run with the uniform discretization\nDisc(K?) and A =MOSS enjoys the bound\n\nRT 6 28 L1/(2\u21b5+1) T (\u21b5+1)/(2\u21b5+1) .\n\nsup\n\nf2H(L,\u21b5)\n\n5\n\n\f3.2 Memorize past plays, Discretize the arm space, and Zoom Out: the MeDZO algorithm\n\nTo achieve adaptation, we combine two tricks: going from \ufb01ne to coarser discretizations while\nkeeping a summary of past plays in memory.\nOur algorithm works in successive regimes. At each time regime i, we reset the algorithm and start\nover a new regime of length double the previous one (Ti = 2p+i), and with fewer discrete arms\n(Ki = 2p+2i). While doing this, we keep in memory the previous plays: in addition to the uniform\n\ndistributions over the subsets of partitions, we include the empirical measuresb\u232bj of the actions played\nin the past regimes, for j < i.\nAlgorithm 2 MeDZO (Memorize, Discretize, Zoom Out)\n1: Input: parameter B, time horizon T\n2: Set: p = dlog2 Be, Ki = 2p+2i and Ti = 2p+i\n3: for i = 1, . . . , p do\n4:\n\nFor Ti rounds, run algorithm CAB1.1 with the uniform discretization in Ki pieces and the\n\n5:\n6: end for\n\nempirical measures of the previous playsb\u232bj for j < i; use MOSS as the discrete algorithm.a\nSet:b\u232bi the empirical measure of the plays during regime i.\naNob\u232b is used for i = 0\n\nAppendix C provides a \ufb01gure illustrating the behavior of the algorithm.\nOur construction is based on the following remark. Consider the approximation error suffered during\nregime i. Denoting the by \u21e7i the set of measures given to the player during regime i, that is, the\nuniform measures over the regular Ki-partition and the empirical measures of arms played during the\nregimes j < i, the approximation error is bounded as follows:\n\nTi\u2713M (f ) Ehmax\n\n\u21e12\u21e7i\n\n\u21e1(f )i\u25c66 TiM (f )E[b\u232bj(f )] =\n\nTi\n\nTj Xt2Regime jM (f )E[f (Xt)]\n\n(10)\n\nand this bound is proportional to the regret suffered during regime j. This means that even though we\nzoom out by using fewer arms, we can make sure that the average approximation error in regime i is\nless than the regret previously suffered. Moreover, the \ufb01rst discretizations are \ufb01ne enough to ensure a\nsmall regret in the \ufb01rst regimes, thanks to the H\u00f6lder property. This argument is formalized in the\nproof (Lemma 1), and shows that MeDZO maintains a balance between approximation and cost of\nlearning that yields optimal regret.\nA surprising fact here is that we go from \ufb01ner to coarser discretizations during the different phases.\nThus, paradoxically, the algorithm zooms out as time passes. Note also that although this regime-\nbased approach is reminiscent of the doubling trick, there is an essential difference in that information\nis carried between the regimes via the distribution of the previous plays.\nWe \ufb01rst state our central result, a generic bound that holds for any input parameter B. We discuss the\noptimality of these adaptive bounds in the next subsection.\nTheorem 2. Algorithm 2 run with the knowledge of T and input B > pT enjoys the following\nguarantee: for all \u21b5> 0 and L > 0,\n\nsup\n\nf2H(L,\u21b5)\n\nRT 6 412 (log2 B)3/2 maxB, T L1/(\u21b5+1)B\u21b5/(\u21b5+1) .\n\n(11)\n\nWe provide some illustrative numerical experiments in Appendix D, comparing the results of MeDZO\nwith other non-adaptive algorithms.\n\n3.3 Discussion: anytime version and admissibility\n\nAnytime version via the doubling trick The dependence of Algorithm 2 on the parameter B\nmakes it horizon-dependent. We use the doubling trick to build an anytime version of the algorithm.\nAt each new doubling-trick regime, we input a value of B that depends on the length of the k-th\nregime. If it is of length T (k), one typically thinks of Bk = (T (k))m for some exponent m. In that\ncase, we get the following bound \u2014see the proof and description of the algorithm in Appendix B.\n\n6\n\n\fCorollary 1 (Doubling trick version). Choose m 2 [1/2, 1]. The doubling-trick version of MeDZO,\nrun with m as sole input (and without the knowledge of T) ensures that for all regularity parameters\n\u21b5> 0 and L > 0 and for T > 1\nRT 6 4000(log2 T m)3/2 maxT m, T L1/(\u21b5+1)(T m)\u21b5/(\u21b5+1) = O(log T )3/2 T \u2713m(\u21b5).\n\nf2H(L,\u21b5)\n\nsup\n\nAdmissibility of Algorithm 2 The next result is a direct consequence of Corollary 1. This echoes\nthe discussion following Theorem 1, and shows that for any input parameter m, the anytime version\nof MeDZO cannot be improved uniformly for all \u21b5.\nCorollary 2. For any m 2 [1/2, 1], the doubling trick version of MeDZO (see App. B) with input m\nachieves rate function \u2713m, and is therefore admissible.\n3.4 About the remaining parameter: the B = pT case\nTuning the value of B amounts to selecting one of the minimal curves in Figure 1. Therefore this\nparameter is a feature of the algorithm, as it allows the player to choose between the possible optimal\nbehaviors. The tuning of this parameter is an unavoidable choice for the player to make.\nThe next example illustrates well the performance of MeDZO, as it is easily comparable to the usual\nminimax bounds. Looking at Figure 1, this choice corresponds to m = 1/2, i.e., the only choice of\nparameter that reaches the usual minimax rates as \u21b5 ! 1. In other words, if the players wishes to\nensure that her regret on very regular functions is of order pT , then she has to pay a polynomial cost\nof adaptation for not knowing \u21b5 and that price is exactly the ratio between (1) and (2).\nCorollary 3. Set a horizon T and run Algorithm 2 with B = pT . Then for \u21b5> 0 and L > 1/pT ,\n(12)\n\nRT 6 146 (log2 T )3/2 L1/(\u21b5+1) T (\u21b5+2)/(2\u21b5+2) .\n\nsup\n\nf2H(L,\u21b5)\n\nWe bound this quantity thanks to the decomposition (9), by \ufb01rst conditioning on the past up to time\nTi1. Since there are Ki + i discrete actions, the regret bound on MOSS ensures that\n\nThis is straightforward from Theorem 2, since the inequality B = pT 6 T L1/(\u21b5+1)pT \u21b5/(\u21b5+1)\nholds whenever L > 1/pT . An anytime version of this result can be obtained from Corollary 1.\n4 Proof of Theorem 2\nFull proof of Theorem 2. Let Ft = (I1, X1, Y1, . . . , It, Xt, Yt) be the -algebra corresponding to\nthe information available at the end of round t. De\ufb01ne also the transition times Ti =Pi\nj=1 Tj with\nthe convention T0 = 0. Let us \ufb01rst verify that T is smaller than the total length of the regimes. By\nde\ufb01nition of p, we have B 6 2p < 2B. Thus Tp = 2p+1(2p 1) > 2B(B 1) > B2 > T , and the\nalgorithm is indeed well-de\ufb01ned up to time T .\nt=Ti1+1 E\u21e5f (Xt)\u21e4 .\nConsider the regret suffered during the i-th regime RTi1,Ti := Ti M (f )PTi\nE24\nj 2 Disc(Ki)}[b\u232b`(f ) | ` = 0, . . . , i 1 . Notice that this bound\n(b\u232bj)j* B which yields, using\nB 6 2p 6 2B, that 4LB > 2\u21b5+1B\u21b5B and then L > B\u21b5/2. In that case, L1/(\u21b5+1)B\u21b5/(\u21b5+1) > 1\nand the total regret bound (11) is true as it is weaker than the trivial bound RT 6 T .\nHence we may assume that i0 > 1 is well de\ufb01ned. By comparing i to i0, we now show the inequality\n\nB +\n\ni0Xi=1\n\npXi=i0+1\n\n2(1 + 72pp)Ti L1/(\u21b5+1)B\u21b5/(\u21b5+1) .\n\nTiM (f ) E[M ?\ni ] 6\n\npXi=1\nFor all i 6 i0 the approximation error is smaller than the \ufb01rst argument of the minimum in (16), and\ni ] 6 B . In particular, this together with\nthis term is smaller than B. Therefore TiM (f ) E[M ?\n(14) and (15) implies that the total regret suffered during regime i0 is RTi01,Ti0 6 (1 + 72pp)B.\nFor the later time regimes i0 < i 6 p, we use the fact that preceding empirical measures were kept as\ndiscrete actions, and in particular the one of the i0-th regime: (16) instantiated with j = i0 yields\n\n(18)\n\nTiM (f ) E[M ?\n\ni ] 6 Ti\n\nRTi01,Ti0\n\nTi0\n\n61 + 72ppTi\n\nB\n\nTi0\n\n.\n\n(19)\n\nSolving equations LTi0/K\u21b5\n(details are in Appendix A.4). Therefore for i0 < i 6 p, using (19),\n\ni0 \u21e1 B \u21e1 4pTi0Ki0, we get B/Ti0 6 2 L1/(\u21b5+1)B\u21b5/(\u21b5+1) ,\n\nand we obtain (18) by summing over i.\n\nTiM (f ) E[M ?\n\ni ] 6 2(1 + 72pp) Ti L1/(\u21b5+1)B\u21b5/(\u21b5+1) ,\n\nConclusion We conclude with some crude boundings. First, as i0 6 p and the sum of the Ti\u2019s is\nsmaller than T , the total approximation error is less than pB+2(1+72pp)T L1/(\u21b5+1)B\u21b5/(\u21b5+1). Let\nus include the cost of learning, which is smaller than 72pppB and conclude, using a + b 6 max(a, b)\n\nRT 6 2(1 + 72pp)T L1/(\u21b5+1)B\u21b5/(\u21b5+1) + pB + 72p3/2B\n= 2(1 + 72pp)T L1/(\u21b5+1)B\u21b5/(\u21b5+1) + p(1 + 72pp)B\n\n6\u21e32(1 + 72pp) + p1 + 72pp\u2318 maxB, T L1/(\u21b5+1)B\u21b5/(\u21b5+1)\n\nfrom which the desired bound follows, using 1 6 p, and p 6 2 log2 B and 4(1 + 72p2) 6 412.\n\n(20)\n\n8\n\n\f5 Further considerations\nLocal regularity assumption Theorem 2 holds under a relaxed smoothness assumption, namely\nthat the function satis\ufb01es the H\u00f6lder condition only in a small cell containing the maximum. By\nlooking carefully at the proof, we observe that the condition is only required up to the i0-th epoch\n(de\ufb01ned in (17)), at which the size of the cells in the discretization is of order 1/Ki0 \u21e1 (LB)1/(1+\u21b5).\nTherefore we only need condition (4) to be satis\ufb01ed for points x in an interval of size (LB)1/(1+\u21b5)\naround the maximum.\n\nHigher dimension Our results can be generalized to functions [0, 1]d ! [0, 1] that are k\u00b7k1-H\u00f6lder.\nFor MeDZO to be well-de\ufb01ned, take Ki = 2d(p+2i) and Ti = 2d(p+i), with p \u21e1 (log B)/d. The\nbounds are similar to their one-dimensional counterparts, up to replacing \u21b5 by \u21b5/d in the exponents,\nbut the constants are deteriorated by a factor that is exponential in d. The bound in Theorem 2\nchanges to maxB, Ld/(\u21b5+d)T B\u21b5/(\u21b5+d)).\n\nAcknowledgements\n\nWe would like to thank Gilles Stoltz and Pascal Massart for their valuable comments and suggestions.\n\nReferences\n[1] Rajeev Agrawal. The Continuum-Armed Bandit Problem. SIAM Journal on Control and Optimization,\n33(6):1926\u20131951, 1995. ISSN 0363-0129, 1095-7138. doi: 10.1137/S0363012992237273. URL http:\n//epubs.siam.org/doi/10.1137/S0363012992237273.\n\n[2] Jean-Yves Audibert and S\u00e9bastien Bubeck. Minimax policies for adversarial and stochastic bandits. In\n\nCOLT, pages 217\u2013226.\n\n[3] Peter Auer, Ronald Ortner, and Csaba Szepesv\u00e1ri. Improved rates for the stochastic continuum-armed\nbandit problem. In International Conference on Computational Learning Theory, pages 454\u2013468. Springer.\n\n[4] Lucien Birg\u00e9 and Pascal Massart. Estimation of Integral Functionals of a Density. The Annals of Statistics,\n\n23(1):11\u201329, 1995. URL https://doi.org/10.1214/aos/1176324452.\n\n[5] S\u00e9bastien Bubeck, Gilles Stoltz, and Jia Yuan Yu. Lipschitz bandits without the Lipschitz constant. In\n\nInternational Conference on Algorithmic Learning Theory, pages 144\u2013158. Springer.\n\n[6] S\u00e9bastien Bubeck, R\u00e9mi Munos, and Gilles Stoltz. Pure exploration in \ufb01nitely-armed and continuous-\narmed bandits. Theoretical Computer Science, 412(19):1832\u20131852, 2011. ISSN 03043975. doi: 10.1016/j.\ntcs.2010.12.059. URL https://linkinghub.elsevier.com/retrieve/pii/S030439751000767X.\n\n[7] S\u00e9bastien Bubeck, R\u00e9mi Munos, Gilles Stoltz, and Csaba Szepesv\u00e1ri. X-armed bandits. Journal of Machine\n\nLearning Research, 12:1655\u20131695, 2011.\n\n[8] Adam D. Bull. Adaptive-treed bandits. Bernoulli, 21(4):2289\u20132307, 2015. doi: 10.3150/14-BEJ644. URL\n\nhttps://doi.org/10.3150/14-BEJ644.\n\n[9] T. Tony Cai. Minimax and Adaptive Inference in Nonparametric Function Estimation. Statistical Science,\n\n27(1):31\u201350, 2012. doi: 10.1214/11-STS355. URL https://doi.org/10.1214/11-STS355.\n\n[10] T. Tony Cai and Mark G. Low. On adaptive estimation of linear functionals. 33(5):2311\u20132343. doi:\n\n10.1214/009053605000000633. URL https://doi.org/10.1214/009053605000000633.\n\n[11] Pierre-Arnaud Coquelin and R\u00e9mi Munos. Bandit algorithms for tree search. In Uncertainty in Arti\ufb01cial\n\nIntelligence, 2007.\n\n[12] Aur\u00e9lien Garivier, H\u00e9di Hadiji, Pierre Menard, and Gilles Stoltz. KL-UCB-switch: Optimal regret\nbounds for stochastic bandits from both a distribution-dependent and a distribution-free viewpoints. URL\nhttp://arxiv.org/abs/1805.05071.\n\n[13] Aur\u00e9lien Garivier, Pierre M\u00e9nard, and Gilles Stoltz. Explore First, Exploit Next: The True Shape of Regret\nin Bandit Problems. Mathematics of Operations Research, 2018. ISSN 0364-765X. doi: 10.1287/moor.\n2017.0928. URL https://pubsonline.informs.org/doi/10.1287/moor.2017.0928.\n\n9\n\n\f[14] Jean-Bastien Grill, Michal Valko, and R\u00e9mi Munos. Black-box optimization of noisy functions with\n\nunknown smoothness. In Advances in Neural Information Processing Systems, pages 667\u2013675.\n\n[15] Olav Kallenberg. Foundations of modern probability. Springer Science & Business Media, 2006.\n\n[16] Robert Kleinberg. Nearly Tight Bounds for the Continuum-armed Bandit Problem. In Proceedings of the\n\n17th International Conference on Neural Information Processing Systems, pages 697\u2013704. MIT Press.\n\n[17] Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Bandits and Experts in Metric Spaces. abs/1312.1277.\n\nURL http://arxiv.org/abs/1312.1277.\n\n[18] Akshay Krishnamurthy, John Langford, Aleksandrs Slivkins, and Chicheng Zhang. Contextual Bandits with\nContinuous Actions: Smoothing, Zooming, and Adapting. URL http://arxiv.org/abs/1902.01520.\n[19] Tor Lattimore and Csaba Szepesv\u00e1ri. Bandit Algorithms, volume Bubeck, S\u00e9bastien and Munos, R\u00e9mi and\n\nStoltz, Gilles and Szepesv\u00e1ri, Csaba. Cambridge University Press (preprint), 2019.\n\n[20] O. Lepskii. On a Problem of Adaptive Estimation in Gaussian White Noise. Theory of Probability\nISSN 0040-585X. doi: 10.1137/1135065. URL https:\n\n& Its Applications, 35(3):454\u2013466, 1991.\n//epubs.siam.org/doi/10.1137/1135065.\n\n[21] Andrea Locatelli and Alexandra Carpentier. Adaptivity to Smoothness in X-armed bandits. In Conference\n\non Learning Theory, pages 1463\u20131492.\n\n[22] Pascal Massart. Concentration Inequalities and Model Selection. Ecole d\u2019Et\u00e9 de Probabilit\u00e9s de Saint-Flour\nXXXIII - 2003. Springer Berlin Heidelberg. ISBN 978-3-540-48503-2. URL https://doi.org/10.\n1007/978-3-540-48503-2.\n\n[23] Xuedong Shang, Emilie Kaufmann, and Michal Valko. General parallel optimization without a metric. In\n\nAlgorithmic Learning Theory, volume 98. URL https://hal.inria.fr/hal-02047225.\n\n[24] Michal Valko, Alexandra Carpentier, and R\u00e9mi Munos. Stochastic Simultaneous Optimistic Optimization.\n\nIn International Conference on Machine Learning.\n\n10\n\n\f", "award": [], "sourceid": 609, "authors": [{"given_name": "Hedi", "family_name": "Hadiji", "institution": "Laboratoire de Mathematiques d\u2019Orsay, Univ. Paris-Sud,"}]}*