【24h】

Improved string matching under noisy channel conditions

机译:在嘈杂的信道条件下改进的字符串匹配

获取原文

摘要

Many document-based applications, including popular Web browsers, email viewers, and word processors, have a 'Find on this Page' feature that allows a user to find every occurrence of a given string in the document. If the document text being searched is derived from a noisy process such as optical character recognition (OCR), the effectiveness of typical string matching can be greatly reduced. This paper describes an enhanced string-matching algorithm for degraded text that improves recall, while keeping precision at acceptable levels. The algorithm is more general than most approximate matching algorithms and allows string-to-string edits with arbitrary costs. We develop a method for evaluating our technique and use it to examine the relative effectiveness of each sub-component of the algorithm. Of the components we varied, we find that using confidence information from the recognition process lead to the largest improvements in matching accuracy.
机译:许多基于文档的应用程序,包括流行的Web浏览器,电子邮件查看器和文字处理程序,都具有“在此页面上查找”功能,该功能使用户可以找到文档中给定字符串的所有匹配项。如果要搜索的文档文本是从诸如光学字符识别(OCR)之类的嘈杂过程中获得的,则可以大大降低典型字符串匹配的有效性。本文介绍了一种用于降级文本的增强的字符串匹配算法,该算法可提高查全率,同时将精度保持在可接受的水平。该算法比大多数近似匹配算法更通用,并且允许以任意成本进行字符串到字符串的编辑。我们开发了一种评估技术的方法,并将其用于检查算法每个子组件的相对有效性。在我们变化的组件中,我们发现使用来自识别过程的置信度信息可以最大程度地提高匹配精度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号