首页> 外文期刊>Machine Learning >Partially observable environment estimation with uplift inference for reinforcement learning based recommendation
【24h】

Partially observable environment estimation with uplift inference for reinforcement learning based recommendation

机译:基于学习的强化推论的部分可观察环境估算

获取原文
获取原文并翻译 | 示例

摘要

Reinforcement learning (RL) aims at searching the best policy model for decision making, and has been shown powerful for sequential recommendations. The training of the policy by RL, however, is placed in an environment. In many real-world applications, the policy training in the real environment can cause an unbearable cost due to the exploration. Environment estimation from the past data is thus an appealing way to release the power of RL in these applications. The estimation of the environment is, basically, to extract the causal effect model from the data. However, real-world applications are often too complex to offer fully observable environment information. Therefore, quite possibly there are unobserved variables lying behind the data, which can obstruct an effective estimation of the environment. In this paper, by treating the hidden variables as a hidden policy, we propose a partially-observed multi-agent environment estimation (POMEE) approach to learn the partially-observed environment. To make a better extraction of the causal relationship between actions and rewards, we design a deep uplift inference network (DUIN) model to learn the causal effects of different actions. By implementing the environment model in the DUIN structure, we propose a POMEE with uplift inference (POMEE-UI) approach to generate a partially-observed environment with a causal reward mechanism. We analyze the effect of our method in both artificial and real-world environments. We first use an artificial recommender environment, abstracted from a real-world application, to verify the effectiveness of POMEE-UI. We then test POMEE-UI in the real application of Didi Chuxing. Experiment results show that POMEE-UI can effectively estimate the hidden variables, leading to a more reliable virtual environment. The online A/B testing results show that POMEE can derive a well-performing recommender policy in the real-world application.
机译:强化学习(RL)旨在搜索决策的最佳政策模型,并且已为顺序建议显示强大。然而,通过RL培训政策,被置于环境中。在许多现实世界的应用中,由于勘探,实际环境中的政策培训可能导致无法忍受的成本。因此,来自过去数据的环境估算是一种吸引人的释放这些应用中RL的力量的吸引力。环境的估计基本上是从数据中提取因果效果模型。然而,现实世界的应用往往太复杂,无法提供完全可观察的环境信息。因此,很可能有没有观察到的变量,躺在数据后面,这可以阻碍对环境的有效估计。本文通过将隐藏的变量视为隐藏的政策,我们提出了一个部分观察到的多代理环境估计(POMEE)方法来学习部分观察到的环境。为了更好地提取行动和奖励之间的因果关系,我们设计了一个深隆起推理网络(DUIN)模型,以了解不同动作的因果影响。通过在DUIN结构中实现环境模型,我们提出了一种带有提升推断(POMEE-UI)方法的POMEE,以产生具有因果奖励机制的部分观测的环境。我们分析了我们在人工和现实世界环境中的方法的影响。我们首先使用人工推荐的环境,从真实的应用程序中抽象,验证POMEE-UI的有效性。然后我们在Didi Chuxing的真正应用中测试POME-UI。实验结果表明,POMEE-UI可以有效地估计隐藏变量,导致更可靠的虚拟环境。在线A / B测试结果表明,POMEE可以在现实世界应用程序中获得良好的推荐政策。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号