【24h】

Orthographic Case Restoration Using Supervised Learning Without Manual Annotation

机译:使用无人工注释的监督学习来恢复正字案例

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

One challenge in text processing is the treatment of case insensitive documents such as speech recognition results. The traditional approach is to re-train a language model excluding case-related features. This paper presents an alternative two-step approach whereby a preprocessing module (Step 1) is designed to restore case-sensitive form to feed the core system (Step 2). Step 1 is implemented as a Hidden Markov Model trained on a large raw corpus of case sensitive documents. It is demonstrated that this approach (ⅰ) outperforms the feature exclusion approach for Named Entity tagging, (ⅱ) leads to limited degradation for semantic parsing and relationship extraction, (ⅲ) reduces system complexity, and (ⅳ) has wide applicability: the restored text can feed both statistical model and rule-based systems.
机译:文本处理中的一个挑战是如何处理不区分大小写的文档,例如语音识别结果。传统方法是重新训练不包括与案例相关的功能的语言模型。本文提出了一种替代性的两步方法,其中预处理模块(步骤1)被设计为恢复区分大小写的形式以馈送核心系统(步骤2)。步骤1是在对大小写敏感的文档的大型原始语料库上训练的隐马尔可夫模型中实现的。事实证明,这种方法(ⅰ)优于命名实体标记的特征排除方法;(ⅱ)导致语义解析和关系提取的降级效果有限;(ⅲ)降低了系统复杂性;(ⅳ)具有广泛的适用性:文本可以同时提供统计模型和基于规则的系统。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号