首页> 外文会议>International Conference on Universal Digital Library(ICUDL2005); 20051031-1102; Hangzhou(CN) >A Full Text Search Engine for Noisy OCR Document Based on Statistical Language Modeling
【24h】

A Full Text Search Engine for Noisy OCR Document Based on Statistical Language Modeling

机译:基于统计语言建模的嘈杂OCR文档全文搜索引擎

获取原文
获取原文并翻译 | 示例

摘要

There are large amount of OCR (Optical Character Recognition) documents in digital library. In order to achieve more accurate retrieval results from noisy OCR documents, the traditional way is to correct the erroneous OCR words in advance. The main idea is to replace erroneous words with the most similar corrected one. However such correction does not consider context information and the correction rate sometimes is low. therefore much original information in OCR documents will be lost and the recall and precision ratio of OCR document retrieval are degenerated, especially for short documents. In this paper, we propose a novel OCR document retrieval approach based on statistical language models. First, instead of replacing the erroneous words explicitly, we consider the possibilities of each word in a correction candidate list associating with other words in documents locally and con-textually. Then all the words in the candidate list are indexed as inverted index. After that, the relevant documents are retrieved based on statistical language model and ranked according to the word probability. A known item search experiments on TREC5 confusion track data show that our approach is more effective than OCR documents correction method and the query expansion method, which use n-gram text approximate matching.
机译:数字图书馆中有大量的OCR(光学字符识别)文档。为了从嘈杂的OCR文档中获得更准确的检索结果,传统的方法是预先纠正错误的OCR单词。主要思想是用最相似的纠正单词替换错误的单词。但是,这种校正没有考虑上下文信息,并且校正率有时较低。因此,OCR文档中的许多原始信息将丢失,并且OCR文档检索的查全率和精确度会降低,尤其是对于短文档。在本文中,我们提出了一种基于统计语言模型的新颖的OCR文档检索方法。首先,我们不考虑明确替换错误单词,而是考虑将更正候选列表中每个单词与文档中本地和上下文中其他单词相关联的可能性。然后将候选列表中的所有单词索引为倒排索引。之后,根据统计语言模型检索相关文档,并根据单词概率对文档进行排序。已知的针对TREC5混淆轨迹数据的项目搜索实验表明,我们的方法比使用n-gram文本近似匹配的OCR文档更正方法和查询扩展方法更有效。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号