...
首页> 外文期刊>Frontiers of computer science >Rich-text document styling restoration via reinforcement learning
【24h】

Rich-text document styling restoration via reinforcement learning

机译:丰富的文本文档通过强化学习造型恢复

获取原文
获取原文并翻译 | 示例
           

摘要

Richly formatted documents, such as financial disclosures, scientific articles, government regulations, widely exist on Web. However, since most of these documents are only for public reading, the styling information inside them is usually missing, making them improper or even burdensome to be displayed and edited in different formats and platforms. In this study we formulate the task of document styling restoration as an optimization problem, which aims to identify the styling settings on the document elements, e.g., lines, table cells, text, so that rendering with the output styling settings results in a document, where each element inside it holds the (closely) exact position with the one in the original document. Considering that each styling setting is a decision, this problem can be transformed as a multi-step decision-making task over all the document elements, and then be solved by reinforcement learning. Specifically, Monte-Carlo Tree Search (MCTS) is leveraged to explore the different styling settings, and the policy function is learnt under the supervision of the delayed rewards. As a case study, we restore the styling information inside tables, where structural and functional data in the documents are usually presented. Experiment shows that, our best reinforcement method successfully restores the stylings in 87.65% of the tables, with 25.75% absolute improvement over the greedy method. We also discuss the tradeoff between the inference time and restoration success rate, and argue that although the reinforcement methods cannot be used in real-time scenarios, it is suitable for the offline tasks with high-quality requirement. Finally, this model has been applied in a PDF parser to support cross-format display.
机译:格式化的文件,如金融披露,科学文章,政府法规,广泛存在于网络上。但是,由于大多数这些文档仅供公开阅读,因此通常丢失它们内部的造型信息,使它们不正确甚至是繁重的,以便以不同的格式和平台显示和编辑。在本研究中,我们将文档造型恢复的任务作为优化问题,旨在识别文档元素上的样式设置,例如线条,表格单元格,文本,以便使用输出样式设置导致文档,其中,它内部的每个元素都将(紧密地)与原始文档中的一个精确位置保持在其中。考虑到每个造型设置是一个决定,这个问题可以转换为所有文档元素的多步决策任务,然后通过强化学习来解决。具体而言,利用Monte-Carlo树搜索(MCT)探索不同的样式设置,并在延迟奖励的监督下了解策略功能。作为案例研究,我们恢复了表中的样式信息,通常呈现文档中的结构和功能数据。实验表明,我们最好的钢筋方法成功地将造型恢复为87.65%的表格,对贪婪方法的绝对改善25.75%。我们还讨论了推理时间和恢复成功率之间的权衡,并争辩说,尽管钢筋方法不能用于实时场景,但它适用于具有高质量要求的离线任务。最后,该模型已应用于PDF解析器以支持跨格式显示。

著录项

  • 来源
    《Frontiers of computer science》 |2021年第4期|154328.1-154328.11|共11页
  • 作者单位

    Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS) Institute of Computing Technology CAS Beijing 100190 China University of Chinese Academy of Sciences Beijing 100049 China;

    Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS) Institute of Computing Technology CAS Beijing 100190 China University of Chinese Academy of Sciences Beijing 100049 China;

    Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS) Institute of Computing Technology CAS Beijing 100190 China University of Chinese Academy of Sciences Beijing 100049 China;

    Search Product Center WeChat Search Application Department Tencent Beijing 100080 China;

    Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS) Institute of Computing Technology CAS Beijing 100190 China University of Chinese Academy of Sciences Beijing 100049 China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    styling restoration; monte-carlo tree search; reinforcement learning; richly formatted documents; tables;

    机译:造型修复;Monte-Carlo树搜索;加强学习;格式化的文件;桌子;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号