首页> 外文会议>EACL workshop on innovative hybrid approaches to the processing of textual data 2012 >An Unsupervised and Data-Driven Approach for Spell Checking in Vietnamese OCR-scanned Texts
【24h】

An Unsupervised and Data-Driven Approach for Spell Checking in Vietnamese OCR-scanned Texts

机译:越南语OCR扫描文本中的无监督和数据驱动的拼写检查方法

获取原文
获取原文并翻译 | 示例

摘要

OCR (Optical Character Recognition) scanners do not always produce 100% accuracy in recognizing text documents, leading to spelling errors that make the texts hard to process further. This paper presents an investigation for the task of spell checking for OCR-scanned text documents. First, we conduct a detailed analysis on characteristics of spelling errors given by an OCR scanner. Then, we propose a fully automatic approach combining both error detection and correction phases within a unique scheme. The scheme is designed in an unsupervised & data-driven manner, suitable for resource-poor languages. Based on the evaluation on real dataset in Vietnamese language, our approach gives an acceptable performance (detection accuracy 86%, correction accuracy 71%). In addition, we also give a result analysis to show how accurate our approach can achieve.
机译:OCR(光学字符识别)扫描仪在识别文本文档时并不总是产生100%的准确性,从而导致拼写错误,使文本难以进一步处理。本文对OCR扫描的文本文档的拼写检查任务进行了调查。首先,我们对OCR扫描仪给出的拼写错误特征进行详细分析。然后,我们提出了一种在独特方案中结合了错误检测和校正阶段的全自动方法。该方案以无监督和数据驱动的方式设计,适用于资源贫乏的语言。根据对越南语真实数据集的评估,我们的方法给出了可接受的性能(检测精度为86%,校正精度为71%)。此外,我们还提供了结果分析,以显示我们的方法可以达到的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号