首页> 外文会议>International Conference on Universal Digital Library >A Full Text Search Engine for Noisy OCR Document Based on Statistical Language Modeling
【24h】

A Full Text Search Engine for Noisy OCR Document Based on Statistical Language Modeling

机译:基于统计语言建模的嘈杂OCR文档的全文搜索引擎

获取原文

摘要

There are large amount of OCR (Optical Character Recognition) documents in digital library. In order to achieve more accurate retrieval results from noisy OCR documents, the traditional way is to correct the erroneous OCR words in advance. The main idea is to replace erroneous words with the most similar corrected one. However such correction does not consider context information and the correction rate sometimes is low. therefore much original information in OCR documents will be lost and the recall and precision ratio of OCR document retrieval are degenerated, especially for short documents. In this paper, we propose a novel OCR document retrieval approach based on statistical language models. First, instead of replacing the erroneous words explicitly, we consider the possibilities of each word in a correction candidate list associating with other words in documents locally and con-textually. Then all the words in the candidate list are indexed as inverted index. After that, the relevant documents are retrieved based on statistical language model and ranked according to the word probability. A known item search experiments on TREC5 confusion track data show that our approach is more effective than OCR documents correction method and the query expansion method, which use n-gram text approximate matching.
机译:数字库中有大量OCR(光学字符识别)文档。为了实现嘈杂的OCR文件的更准确的检索结果,传统的方式是提前纠正错误的OCR单词。主要思想是用最相似的纠正符号取代错误的单词。然而,这种校正不考虑上下文信息,校正率有时是低的。因此,OCR文件中的许多原始信息将丢失,并且OCR文件检索的召回和精确比率是退化的,特别是对于短文件。在本文中,我们提出了一种基于统计语言模型的新型OCR文件检索方法。首先,而不是明确替换错误的单词,我们考虑在校正候选列表中的每个单词的可能性与本地文档中的文档中的其他单词相关联。然后候选列表中的所有单词都被索引为反相索引。之后,基于统计语言模型检索相关文档,并根据字概率排序。 TREC5混淆轨迹数据上的已知项目搜索实验表明,我们的方法比OCR文档校正方法和查询扩展方法更有效,该方法使用N-GRAM文本近似匹配。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号