【24h】

A Language Modeling Approach for Acronym Expansion Disambiguation

机译:一种语言建模方法,用于缩写歧义歧义

获取原文

摘要

Nonstandard words such as proper nouns, abbreviations, and acronyms are a major obstacle in natural language text processing and information retrieval. Acronyms, in particular, are difficult to read and process because they are often domain-specific with high degree of polysemy. In this paper, we propose a language modeling approach for the automatic disambiguation of acronym senses using context information. First, a dictionary of all possible expansions of acronyms is generated automatically. The dictionary is used to search for all possible expansions or senses to expand a given acronym. The extracted dictionary consists of about 17 thousands acronym-expansion pairs defining 1,829 expansions from different fields where the average number of expansions per acronym was 9.47. Training data is automatically collected from downloaded documents identified from the results of search engine queries. The collected data is used to build a unigram language model that models the context of each candidate expansion. At the in-context expansion prediction phase, the relevance of acronym expansion candidates is calculated based on the similarity between the context of each specific acronym occurrence and the language model of each candidate expansion. Unlike other work in the literature, our approach has the option to reject to expand an acronym if it is not confident on disambiguation. We have evaluated the performance of our language modeling approach and compared it with tf-idf discriminative approach.
机译:非标准单词,例如适当的名词,缩写和首字母缩略词是自然语言文本处理和信息检索中的主要障碍。特别是遗嘱难以阅读和过程,因为它们通常具有高度多义的域特定。在本文中,我们提出了一种使用上下文信息的自动歧义的语言建模方法。首先,自动生成所有可能扩字扩字的所有可能扩展的字典。字典用于搜索所有可能的扩展或感官以扩展给定的缩写。提取的词典由约17千万缩略词 - 扩展对组成,定义了来自不同领域的1,829个扩展,其中每首缩略词的平均扩展数为9.47。从从搜索引擎查询结果中标识的下载文档中自动收集训练数据。收集的数据用于构建模拟每个候选扩展的上下文的Unigram语言模型。在上下文扩展预测阶段,基于每个特定缩略语发生与每个候选扩展的语言模型之间的相似性来计算缩写扩展候选的相关性。与文献中的其他工作不同,我们的方法可以选择拒绝扩展缩略语,如果它对消歧并不充满信心。我们已经评估了语言建模方法的性能,并与TF-IDF鉴别方法进行了比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号