首页> 外文会议>International Conference on Document Analysis and Recognition >Extraction of Spelling Variations from Language Structure for Noisy Text Correction
【24h】

Extraction of Spelling Variations from Language Structure for Noisy Text Correction

机译:从语言结构中提取拼写变化以进行嘈杂的文本校正

获取原文

摘要

We describe a novel approach for the extraction of spelling variations from a list of instances. It relates emph{distinctive infixes} to emph{distinctive infixes} of referenced words. The distinctive infixes are extracted automatically from a (multi)set of instances and a referenced dictionary without any additional expert knowledge. Based on the spelling variations retrieved during a learning(training) phase we develop a correction algorithm which suggests and ranks candidates for a particular noisy word. The main advantage of our approach is that it provides good corrections for the unobserved noisy words while it is almost perfect on words observed during the learning. Our experimental results of the normalisation of a typical reference corpus of Early Modern English letters, significantly improve over previous results of VARD2. We also achieve better results than those reported incite{SMM07} and cite{MMGRSR07} on the OCR-correction of the TREC-5 Confusion Track corpus[5].
机译:我们描述了一种从实例列表中提取拼写变化的新颖方法。它将所指单词的英特{与众不同的内缀}与英格{与众不同的内缀}关联起来。特殊词缀是从(多个)实例集和引用的词典中自动提取的,而无需任何其他专家知识。基于在学习(训练)阶段中检索到的拼写变化,我们开发了一种校正算法,该算法可以为特定的有噪声单词建议和排列候选单词。我们的方法的主要优点是,它为未观察到的嘈杂词提供了良好的校正,而对于在学习过程中观察到的词则几乎是完美的。我们对早期现代英语字母典型参考语料库进行规范化的实验结果大大优于VARD2的先前结果。与TREC-5 Confusion Track语料库的OCR校正[5]相比,我们也获得了比报道[SMM07}和引用{MMGRSR07}更好的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号