首页> 外文期刊>Computer speech and language >Context-aware correction of spelling errors in Hungarian medical documents
【24h】

Context-aware correction of spelling errors in Hungarian medical documents

机译:匈牙利医疗文档中拼写错误的上下文感知更正

获取原文
获取原文并翻译 | 示例
           

摘要

Owing to the growing need of acquiring medical data from clinical records, processing such documents is an important topic in natural language processing (NLP). However, for general NLP methods to work, a proper, normalized input is required. Otherwise the system is overwhelmed by the unusually high amount of noise generally characteristic of this kind of text. The different types of this noise originate from non-standard language use: short fragments instead of proper sentences, usage of Latin words, many acronyms and very frequent misspellings. In this paper, a method is described for the automated correction of spelling errors in Hungarian clinical records. First, a word-based algorithm was implemented to generate a ranked list of correction candidates for word forms regarded as incorrect. Second, the problem of spelling correction was modelled as a translation task, where the source language is the erroneous text and the target language is the corrected one. A Statistical Machine Translation (SMT) decoder performed the task of error correction. Since no orthographically correct proofread text from this domain is available, we could not use such a corpus for training the system. Instead, the word-based system was used to create translation models. In addition, a 3-gram token-based language model was used to model lexical context. Due to the high number of abbreviations and acronyms in the texts, the behaviour of these abbreviated forms was further examined both in the case of the context-unaware word-based and the SMT-decoder-based implementations. The results show that the SMT-based method outperforms the first candidate accuracy of the word-based ranking system. However, the normalization of abbreviations should be handled as a separate task.
机译:由于越来越需要从临床记录中获取医疗数据,因此处理此类文档是自然语言处理(NLP)的重要主题。但是,为了使常规的NLP方法起作用,需要适当的标准化输入。否则,系统会被这种文本通常具有的异常高的噪声淹没。这种噪音的不同类型源自非标准语言的使用:较短的片段而不是适当的句子,拉丁词的使用,许多首字母缩写词和非常频繁的拼写错误。本文介绍了一种自动纠正匈牙利临床记录中拼写错误的方法。首先,实施了基于单词的算法,以生成被认为不正确的单词形式的校正候选的排序列表。其次,将拼写纠正问题建模为翻译任务,其中源语言是错误的文本,目标语言是纠正的语言。统计机器翻译(SMT)解码器执行纠错任务。由于没有来自该领域的拼写正确的校对文本,因此我们不能使用这样的语料库来训练系统。而是使用基于单词的系统来创建翻译模型。另外,使用了基于3克令牌的语言模型来建模词汇上下文。由于文本中存在大量的缩写词和首字母缩写词,因此在基于上下文的基于单词的实现和基于SMT解码器的实现中,都进一步检查了这些缩写形式的行为。结果表明,基于SMT的方法优于基于单词的排名系统的第一候选准确性。但是,缩写的标准化应作为单独的任务处理。

著录项

  • 来源
    《Computer speech and language》 |2016年第1期|219-233|共15页
  • 作者单位

    Pazmany Peter Catholic University, Faculty of Information Technology and Bionics, 50/a Prater Street, 1083 Budapest, Hungary;

    Pazmany Peter Catholic University, Faculty of Information Technology and Bionics, 50/a Prater Street, 1083 Budapest, Hungary,MTA-PPKE Hungarian Language Technology Research Group, 50/a Prater Street, 1083 Budapest, Hungary;

    Pazmany Peter Catholic University, Faculty of Information Technology and Bionics, 50/a Prater Street, 1083 Budapest, Hungary,MTA-PPKE Hungarian Language Technology Research Group, 50/a Prater Street, 1083 Budapest, Hungary;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Spelling correction; Medical text processing; Agglutinating languages;

    机译:拼写更正;医学文本处理;凝集语言;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号