Partially observable environment estimation with uplift inference for reinforcement learning based recommendation

Shang Wenjie; Li Qingyang; Qin Zhiwei; Yu Yang; Meng Yiping; Ye Jieping

首页> 外文期刊>Machine Learning >Partially observable environment estimation with uplift inference for reinforcement learning based recommendation

【24h】

Partially observable environment estimation with uplift inference for reinforcement learning based recommendation

机译：基于学习的强化推论的部分可观察环境估算

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Reinforcement learning (RL) aims at searching the best policy model for decision making, and has been shown powerful for sequential recommendations. The training of the policy by RL, however, is placed in an environment. In many real-world applications, the policy training in the real environment can cause an unbearable cost due to the exploration. Environment estimation from the past data is thus an appealing way to release the power of RL in these applications. The estimation of the environment is, basically, to extract the causal effect model from the data. However, real-world applications are often too complex to offer fully observable environment information. Therefore, quite possibly there are unobserved variables lying behind the data, which can obstruct an effective estimation of the environment. In this paper, by treating the hidden variables as a hidden policy, we propose a partially-observed multi-agent environment estimation (POMEE) approach to learn the partially-observed environment. To make a better extraction of the causal relationship between actions and rewards, we design a deep uplift inference network (DUIN) model to learn the causal effects of different actions. By implementing the environment model in the DUIN structure, we propose a POMEE with uplift inference (POMEE-UI) approach to generate a partially-observed environment with a causal reward mechanism. We analyze the effect of our method in both artificial and real-world environments. We first use an artificial recommender environment, abstracted from a real-world application, to verify the effectiveness of POMEE-UI. We then test POMEE-UI in the real application of Didi Chuxing. Experiment results show that POMEE-UI can effectively estimate the hidden variables, leading to a more reliable virtual environment. The online A/B testing results show that POMEE can derive a well-performing recommender policy in the real-world application.

机译：强化学习（RL）旨在搜索决策的最佳政策模型，并且已为顺序建议显示强大。然而，通过RL培训政策，被置于环境中。在许多现实世界的应用中，由于勘探，实际环境中的政策培训可能导致无法忍受的成本。因此，来自过去数据的环境估算是一种吸引人的释放这些应用中RL的力量的吸引力。环境的估计基本上是从数据中提取因果效果模型。然而，现实世界的应用往往太复杂，无法提供完全可观察的环境信息。因此，很可能有没有观察到的变量，躺在数据后面，这可以阻碍对环境的有效估计。本文通过将隐藏的变量视为隐藏的政策，我们提出了一个部分观察到的多代理环境估计（POMEE）方法来学习部分观察到的环境。为了更好地提取行动和奖励之间的因果关系，我们设计了一个深隆起推理网络（DUIN）模型，以了解不同动作的因果影响。通过在DUIN结构中实现环境模型，我们提出了一种带有提升推断（POMEE-UI）方法的POMEE，以产生具有因果奖励机制的部分观测的环境。我们分析了我们在人工和现实世界环境中的方法的影响。我们首先使用人工推荐的环境，从真实的应用程序中抽象，验证POMEE-UI的有效性。然后我们在Didi Chuxing的真正应用中测试POME-UI。实验结果表明，POMEE-UI可以有效地估计隐藏变量，导致更可靠的虚拟环境。在线A / B测试结果表明，POMEE可以在现实世界应用程序中获得良好的推荐政策。

著录项

来源
《Machine Learning》 |2021年第9期|2603-2640|共38页
作者
Shang Wenjie; Li Qingyang; Qin Zhiwei; Yu Yang; Meng Yiping; Ye Jieping;
展开▼
作者单位

Didi Chuxing AI Labs Beijing Peoples R China;

Didi Chuxing AI Labs Beijing Peoples R China;

Didi Chuxing AI Labs Beijing Peoples R China;

Nanjing Univ Natl Key Lab Novel Software Technol Nanjing 210023 Peoples R China;

Didi Chuxing AI Labs Beijing Peoples R China;

Didi Chuxing AI Labs Beijing Peoples R China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Reinforcement learning; Environment estimation; Hidden state; Uplift modeling; Recommender system;

机译：强化学习;环境估算;隐藏状态;隆起建模;推荐系统;

相似文献

外文文献
中文文献
专利

1. Model-Based Reinforcement Learning for Partially Observable Games with Sampling-Based State Estimation [J] . Hajime Fujita, Shin Ishii Neural computation . 2007,第11期

机译：具有基于采样状态估计的部分可观察游戏的基于模型的强化学习
2. A gradient-based reinforcement learning approach to dynamic pricing in partially-observable environments [J] . David Vengerov Future generation computer systems . 2008,第7期

机译：在部分可观察的环境中基于梯度的强化学习方法进行动态定价
3. Identification of partially observable environment based on on-line variational Bayes method and its application to reinforcement learning [J] . Junichiro Yoshimoto, Shin Ishii, Masa-aki Sato 電子情報通信学会技術研究報告. ニュ-ロコンピュ-ティング. Neurocomputing . 2002,第731期

机译：基于在线变分贝叶斯方法的部分可观测环境识别及其在强化学习中的应用
4. Effectiveness on inference of unobservable environment using Bayesian learning in partially observable environment - it's application to the Robocup rescue simulation [C] . Yuu Sakamoto, Keisuke Matunaga, Yosinobu Kajikawa, システム制御情報学会研究発表講演会 . 2001

机译：在部分可观察到环境中使用贝叶斯学习的不可观察环境推理的有效性 - 它适用于Robocup救援模拟
5. Attack Detection for Cyber Systems and Probabilistic State Estimation in Partially Observable Cyber Environments. [D] . Guha, Sayantan. 2016

机译：网络系统的攻击检测和部分可观察的网络环境中的概率状态估计。
6. Detecting Changes and Avoiding Catastrophic Forgetting in Dynamic Partially Observable Environments [O] . Jeffery Dick, Pawel Ladosz, Eseoghene Ben-Iwhiwhu, 2020

机译：检测变化避免动态部分可观察环境中的灾难性遗忘
7. Model-based Reinforcement Learning for Partially Observable Games with Sampling-based State Estimation [O] . Hajime Fujita, Shin Ishii 2008

机译：基于采样的状态估计对部分可观察游戏的基于模型的强化学习

Partially observable environment estimation with uplift inference for reinforcement learning based recommendation

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅