首页> 外文会议>String Processing and Information Retrieval; Lecture Notes in Computer Science; 4209 >Word-Based Correction for Retrieval of Arabic OCR Degraded Documents
【24h】

Word-Based Correction for Retrieval of Arabic OCR Degraded Documents

机译:基于单词的阿拉伯OCR降级文档的检索更正

获取原文
获取原文并翻译 | 示例

摘要

Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of word-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. The OCR correction uses an improved character segment based noisy channel model and is tested on real and synthetic OCR degradation. Results show that the effect of OCR correction depends on the length of the index term used and that indexing using short n-grams is perhaps superior to word-based error correction. The results are potentially applicable to other languages.
机译:仅以印刷形式提供的阿拉伯文文件仍然无处不在,可以对其进行扫描并随后对其进行OCR处理以简化检索过程。本文探讨了基于单词的OCR校正对使用不同索引词检索阿拉伯语OCR文档的有效性的影响。 OCR校正使用基于字符段的改进的噪声通道模型,并经过实际和合成OCR降级测试。结果表明,OCR校正的效果取决于所使用的索引项的长度,并且使用短n元语法的索引可能优于基于单词的错误校正。结果可能适用于其他语言。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号