首页> 外文会议>International Workshop on Advanced Imaging Technology >Extraction of Distinctive Keywords and Articles from Untranscribed Historical Newspaper Images
【24h】

Extraction of Distinctive Keywords and Articles from Untranscribed Historical Newspaper Images

机译:从未转录的历史报纸图像中提取独特的关键词和文章

获取原文

摘要

This paper proposes a novel approach to extract distinctive keywords from historical newspaper images without using character recognition. We converted an image of the text block on an entire newspaper page into a sequence of codes based on discretization of the feature vectors, an approach that eliminated the errors introduced by optical character recognition (OCR). This conversion makes it possible to analyze untranscribed newspaper images by using text-processing methods. We examined the daily occurrence of every tri-gram string, and extracted strings with a dense appearance as distinctive keywords. In addition, we highlighted articles that contain distinctive keywords as distinctive articles. The proposed method was evaluated on an archive of Japanese newspaper images published in the 19th century, and the results were promising.
机译:本文提出了一种新颖的方法,可以在不使用字符识别的情况下从历史报纸图像中提取独特的关键词。我们基于特征向量的离散化将整个报纸页面上的文本块图像转换为代码序列,该方法消除了光学字符识别(OCR)引入的错误。这种转换使得可以通过使用文本处理方法来分析未转录的报纸图像。我们检查了每个三字母组字符串的日常出现情况,并提取了外观密集的字符串作为独特的关键字。此外,我们将包含独特关键字的文章突出显示为独特文章。在19世纪出版的日本报纸图像档案中对提出的方法进行了评估,结果令人鼓舞。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号