Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations

Gai Y.; Krishnamachari B.; Jain R.

首页> 外文期刊>Networking, IEEE/ACM Transactions on >Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations

【24h】

Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations

机译：未知变量的组合网络优化：具有线性奖励和个人观察力的多臂土匪

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

We formulate the following combinatorial multi-armed bandit (MAB) problem: There are $N$ random variables with unknown mean that are each instantiated in an i.i.d. fashion over time. At each time multiple random variables can be selected, subject to an arbitrary constraint on weights associated with the selected variables. All of the selected individual random variables are observed at that time, and a linearly weighted combination of these selected variables is yielded as the reward. The goal is to find a policy that minimizes regret, defined as the difference between the reward obtained by a genie that knows the mean of each random variable, and that obtained by the given policy. This formulation is broadly applicable and useful for stochastic online versions of many interesting tasks in networks that can be formulated as tractable combinatorial optimization problems with linear objective functions, such as maximum weighted matching, shortest path, and minimum spanning tree computations. Prior work on multi-armed bandits with multiple plays cannot be applied to this formulation because of the general nature of the constraint. On the other hand, the mapping of all feasible combinations to arms allows for the use of prior work on MAB with single-play, but results in regret, storage, and computation growing exponentially in the number of unknown variables. We present new efficient policies for this problem that are shown to achieve regret that grows logarithmically with time, and polynomially in the number of unknown variables. Furthermore, these policies only require storage that grows linearly in the number of unknown parameters. For problems where the underlying deterministic problem is tractable, these policies further require only polynomial computation. For computationally intractable problems, we also present results on a different notion of regret that is suitable when a polynomial-time approx- mation algorithm is used.

机译：我们制定了以下组合式多臂匪（MAB）问题：存在均值未知的$ N $个随机变量，每个变量在i.i.d中实例化。随着时间的流逝。每次都可以选择多个随机变量，但要对与所选变量关联的权重施加任意约束。那时将观察所有选定的单个随机变量，并产生这些选定变量的线性加权组合作为奖励。目的是找到一种使后悔最小化的策略，定义为知道每个随机变量均值的精灵所获得的报酬与给定策略所获得的报酬之间的差。这种表示形式广泛适用于网络中许多有趣任务的随机在线版本，这些形式可以用线性目标函数（例如最大加权匹配，最短路径和最小生成树计算）表示为可解决的组合优化问题。由于约束的一般性质，先前对具有多个作用的多臂土匪的研究不能应用于该公式。另一方面，所有可行组合到手臂的映射都允许在单打游戏中使用MAB上的先前工作，但导致遗憾，存储和计算量在未知变量中呈指数增长。我们针对此问题提出了新的有效策略，这些策略显示出可以实现后悔，后悔随着时间呈对数增长，并且未知变量数量呈多项式增长。此外，这些策略仅要求存储的未知参数数量呈线性增长。对于基本确定性问题可解决的问题，这些策略还仅需要多项式计算。对于计算上难以解决的问题，我们还提出了一种不同的遗憾概念的结果，该概念适用于使用多项式时间近似算法的情况。

著录项

来源
《Networking, IEEE/ACM Transactions on》 |2012年第5期|p.1466-1478|共13页
作者
Gai Y.; Krishnamachari B.; Jain R.;
展开▼
作者单位

Department of Electrical Engineering, University of Southern California, Los Angeles, CA, USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Combinatorial network optimization; multi-armed bandits (MABs); online learning;

机译：组合网络优化;多武装土匪（MAB）;在线学习;

相似文献

外文文献
中文文献
专利

1. IEEE 802.15.4.e TSCH-Based Scheduling for Throughput Optimization: A Combinatorial Multi-Armed Bandit Approach [J] . Javan Nastooh Taheri, Sabaei Masoud, Hakami Vesal IEEE sensors journal . 2020,第1期

机译：IEEE 802.15.4.基于TSCH的吞吐量优化调度：组合多武装强盗方法
2. Stochastic online optimization. Single-point and multi-point non-linear multi-armed bandits. Convex and strongly-convex case [J] . Gasnikov A. V., Krymova E. A., Lagunovskaya A. A., Automation and Remote Control . 2017,第2期

机译：随机在线优化。单点和多点非线性多武装匪徒。凸和强凸案
3. Tight Lower Bounds for Combinatorial Multi-Armed Bandits [J] . Nadav Merlis, Shie Mannor JMLR: Workshop and Conference Proceedings . 2020,第2010期

机译：组合多武装匪徒的紧张下限
4. Combinatorial Multi-Armed Bandit with General Reward Functions [C] . Wei Chen, Wei Hu, Fu Li, Annual conference on Neural Information Processing Systems . 2016

机译：具有一般奖励功能的组合式多武装匪
5. Learning in A Changing World: Restless Multi-Armed Bandit with Unknown Dynamics [D] . Liu, Haoyang 2013

机译：在瞬息万变的世界中学习：具有未知动态的躁动多臂强盗
6. Gateway Selection in Millimeter Wave UAV Wireless Networks Using Multi-Player Multi-Armed Bandit [O] . Ehab Mahmoud Mohamed, Sherief Hashima, Abdallah Aldosary, 2020

机译：使用多播放器多武装强盗的毫米波无线网络中的网关选择
7. Combinatorial Network Optimization with Unknown Variables: Multi-Armed Bandits with Linear Rewards [O] . Gai, Yi, Krishnamachari, Bhaskar, Jain, Rahul 2010

机译：具有未知变量的组合网络优化：多臂带有线性奖励的强盗

Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅