首页> 外文会议>International conference on intelligent text processing and computational linguistics >A Language Modeling Approach for Acronym Expansion Disambiguation
【24h】

A Language Modeling Approach for Acronym Expansion Disambiguation

机译:首字母缩写词扩展消歧的语言建模方法

获取原文

摘要

Nonstandard words such as proper nouns, abbreviations, and acronyms are a major obstacle in natural language text processing and information retrieval. Acronyms, in particular, are difficult to read and process because they are often domain-specific with high degree of polysemy. In this paper, we propose a language modeling approach for the automatic disambiguation of acronym senses using context information. First, a dictionary of all possible expansions of acronyms is generated automatically. The dictionary is used to search for all possible expansions or senses to expand a given acronym. The extracted dictionary consists of about 17 thousands acronym-expansion pairs defining 1,829 expansions from different fields where the average number of expansions per acronym was 9.47. Training data is automatically collected from downloaded documents identified from the results of search engine queries. The collected data is used to build a uni-gram language model that models the context of each candidate expansion. At the in-context expansion prediction phase, the relevance of acronym expansion candidates is calculated based on the similarity between the context of each specific acronym occurrence and the language model of each candidate expansion. Unlike other work in the literature, our approach has the option to reject to expand an acronym if it is not confident on disambiguation. We have evaluated the performance of our language modeling approach and compared it with tf-idf discriminative approach.
机译:非标准词(例如专有名词,缩写词和首字母缩写词)是自然语言文本处理和信息检索的主要障碍。特别是,首字母缩略词难以阅读和处理,因为它们通常是特定领域的,具有高度的多义性。在本文中,我们提出了一种使用上下文信息自动消除首字母缩写词义的语言建模方法。首先,自动生成首字母缩写词所有可能扩展的字典。该词典用于搜索所有可能的扩展名或含义以扩展给定的首字母缩写词。提取的字典包含约17,000个首字母缩写词-扩展对,它们定义了来自不同字段的1,829个扩展,其中每个首字母缩写词的平均扩展数为9.47。训练数据是从从搜索引擎查询结果中识别出的下载文档中自动收集的。收集到的数据用于建立一个unigram语言模型,该模型对每个候选扩展的上下文进行建模。在上下文扩展预测阶段,基于每个特定首字母缩写词出现的上下文与每个候选扩展词的语言模型之间的相似度来计算首字母缩写词扩展候选者的相关性。与文献中的其他工作不同,如果对缩歧没有信心,我们的方法可以选择拒绝扩展首字母缩写词。我们评估了语言建模方法的性能,并将其与tf-idf判别方法进行了比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号