【24h】

Relating Articles Textually and Visually

机译:通过文字和视觉关联文章

获取原文

摘要

Historical documents have been undergoing large-scale digitization over the past years, placing massive image collections online. Optical character recognition (OCR) often performs poorly on such material, which makes searching within these resources problematic and textual analysis of such documents difficult. We present two approaches to overcome this obstacle, one textual and one visual. We show that, for tasks like finding newspaper articles related by topic, poor-quality OCR text suffices. An ordinary vector-space model is used to represent articles. Additional improvements obtain by adding words with similar distributional representations. As an alternative to OCR-based methods, one can perform image-based search, using word spotting. Synthetic images are generated for every word in a lexicon, and word-spotting is used to compile vectors of their occurrences. Retrieval is by means of a usual nearest-neighbor search. The results of this visual approach are comparable to those obtained using noisy OCR. We report on experiments applying both methods, separately and together, on historical Hebrew newspapers, with their added problem of rich morphology.
机译:过去几年中,历史文献一直在进行大规模的数字化处理,将大量的图像集放在了网上。光学字符识别(OCR)通常在这种材料上表现不佳,这使得在这些资源中搜索有问题的文本难以进行文本分析。我们提出了两种克服这一障碍的方法,一种是文字方法,另一种是视觉方法。我们表明,对于诸如查找按主题相关的报纸文章之类的任务,劣质的OCR文本就足够了。普通的向量空间模型用于表示商品。通过添加具有相似分布表示的单词可以获得其他改进。作为基于OCR的方法的替代方法,可以使用单词点标执行基于图像的搜索。为词典中的每个单词生成合成图像,并且使用单词点检来编译其出现的向量。检索是通过通常的最近邻居搜索进行的。这种视觉方法的结果可与使用嘈杂的OCR获得的结果相媲美。我们报告了在希伯来语历史报纸上分别或同时应用这两种方法的实验,以及它们丰富的形态学问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号