首页> 外文会议>International Conference on Document Analysis and Recognition >OCR-VQA: Visual Question Answering by Reading Text in Images
【24h】

OCR-VQA: Visual Question Answering by Reading Text in Images

机译:OCR-VQA:通过阅读图像中的文本来进行视觉提问

获取原文

摘要

The problem of answering questions about an image is popularly known as visual question answering (or VQA in short). It is a well-established problem in computer vision. However, none of the VQA methods currently utilize the text often present in the image. These "texts in images" provide additional useful cues and facilitate better understanding of the visual content. In this paper, we introduce a novel task of visual question answering by reading text in images, i.e., by optical character recognition or OCR. We refer to this problem as OCR-VQA. To facilitate a systematic way of studying this new problem, we introduce a large-scale dataset, namely OCRVQA-200K. This dataset comprises of 207,572 images of book covers and contains more than 1 million question-answer pairs about these images. We judiciously combine well-established techniques from OCR and VQA domains to present a novel baseline for OCR-VQA-200K. The experimental results and rigorous analysis demonstrate various challenges present in this dataset leaving ample scope for the future research. We are optimistic that this new task along with compiled dataset will open-up many exciting research avenues both for the document image analysis and the VQA communities.
机译:回答有关图像的问题通常被称为视觉问题解答(或简称VQA)。这是计算机视觉中公认的问题。但是,目前没有一种VQA方法利用图像中经常出现的文本。这些“图像中的文本”提供了其他有用的提示,并有助于更好地理解视觉内容。在本文中,我们通过阅读图像中的文本(即通过光学字符识别或OCR)介绍了视觉问答的新任务。我们将此问题称为OCR-VQA。为了便于系统地研究此新问题,我们引入了一个大型数据集OCRVQA-200K。该数据集包含207,572张书的封面图像,并包含有关这些图像的超过一百万个问题-答案对。我们明智地结合了来自OCR和VQA域的成熟技术,以提出OCR-VQA-200K的新基准。实验结果和严格的分析证明了该数据集中存在的各种挑战,为将来的研究留下了广阔的空间。我们乐观地认为,这项新任务以及已编译的数据集将为文档图像分析和VQA社区打开许多激动人心的研究途径。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号