首页> 外文期刊>Networking, IEEE/ACM Transactions on >Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations
【24h】

Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations

机译:未知变量的组合网络优化:具有线性奖励和个人观察力的多臂土匪

获取原文
获取原文并翻译 | 示例

摘要

We formulate the following combinatorial multi-armed bandit (MAB) problem: There are $N$ random variables with unknown mean that are each instantiated in an i.i.d. fashion over time. At each time multiple random variables can be selected, subject to an arbitrary constraint on weights associated with the selected variables. All of the selected individual random variables are observed at that time, and a linearly weighted combination of these selected variables is yielded as the reward. The goal is to find a policy that minimizes regret, defined as the difference between the reward obtained by a genie that knows the mean of each random variable, and that obtained by the given policy. This formulation is broadly applicable and useful for stochastic online versions of many interesting tasks in networks that can be formulated as tractable combinatorial optimization problems with linear objective functions, such as maximum weighted matching, shortest path, and minimum spanning tree computations. Prior work on multi-armed bandits with multiple plays cannot be applied to this formulation because of the general nature of the constraint. On the other hand, the mapping of all feasible combinations to arms allows for the use of prior work on MAB with single-play, but results in regret, storage, and computation growing exponentially in the number of unknown variables. We present new efficient policies for this problem that are shown to achieve regret that grows logarithmically with time, and polynomially in the number of unknown variables. Furthermore, these policies only require storage that grows linearly in the number of unknown parameters. For problems where the underlying deterministic problem is tractable, these policies further require only polynomial computation. For computationally intractable problems, we also present results on a different notion of regret that is suitable when a polynomial-time approx- mation algorithm is used.
机译:我们制定了以下组合式多臂匪(MAB)问题:存在均值未知的$ N $个随机变量,每个变量在i.i.d中实例化。随着时间的流逝。每次都可以选择多个随机变量,但要对与所选变量关联的权重施加任意约束。那时将观察所有选定的单个随机变量,并产生这些选定变量的线性加权组合作为奖励。目的是找到一种使后悔最小化的策略,定义为知道每个随机变量均值的精灵所获得的报酬与给定策略所获得的报酬之间的差。这种表示形式广泛适用于网络中许多有趣任务的随机在线版本,这些形式可以用线性目标函数(例如最大加权匹配,最短路径和最小生成树计算)表示为可解决的组合优化问题。由于约束的一般性质,先前对具有多个作用的多臂土匪的研究不能应用于该公式。另一方面,所有可行组合到手臂的映射都允许在单打游戏中使用MAB上的先前工作,但导致遗憾,存储和计算量在未知变量中呈指数增长。我们针对此问题提出了新的有效策略,这些策略显示出可以实现后悔,后悔随着时间呈对数增长,并且未知变量数量呈多项式增长。此外,这些策略仅要求存储的未知参数数量呈线性增长。对于基本确定性问题可解决的问题,这些策略还仅需要多项式计算。对于计算上难以解决的问题,我们还提出了一种不同的遗憾概念的结果,该概念适用于使用多项式时间近似算法的情况。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号