【24h】

Overcoming Data Sparseness Problem in Statistical Corpus Based Sense Disambiguation

机译:基于统计语料库的语义歧义克服数据稀疏性问题

获取原文
获取原文并翻译 | 示例

摘要

The problem of data sparseness is considered as a common problem for Statistical Corpus based Word Sense Disambiguation (WSD) approaches [1]. Usually large amounts of data are required in the corpus to guarantee that all senses of an ambiguous word are presented. However, this is not easily achieved, especially for words that do not occur frequently in the training corpus. On the other hand, for languages that do not have large amount of digitized resources, the sparseness problem is even worse. In this paper, we present an unsupervised framework that first learns the relationships between the ambiguous and related words to reveal their most suitable senses based on a proposed mathematical model and a set of bilingual resources, including a non-aligned Portuguese-Chinese training corpus, a dictionary, and a sense inventory. For senses not found in the learning phase, bilingual examples from the dictionary and Singular Value Decomposition (SVD) [2] techniques are applied to overcome the sparseness problem. All the senses found are converted into a set of rules and stored in the Word Sense database for later use in disambiguation and translation process. Preliminary experiment results show an improvement of learning more senses in the sparseness environment with the use of the mentioned strategies.
机译:数据稀疏性问题被认为是基于统计语料库的词义消歧(WSD)方法的常见问题[1]。通常,语料库中需要大量数据,以确保呈现出歧义词的所有含义。但是,这很难实现,特别是对于在训练语料库中不经常出现的单词。另一方面,对于没有大量数字化资源的语言,稀疏性问题更加严重。在本文中,我们提出了一个无监督的框架,该框架首先基于提出的数学模型和一套双语资源(包括不结盟的葡萄牙语-汉语训练语料库),学习歧义词和相关词之间的关系,以揭示最合适的含义。一本字典和一个意义清单。对于在学习阶段找不到的感觉,词典和奇异值分解(SVD)[2]技术中的双语示例可用于克服稀疏性问题。找到的所有感官都将转换为一组规则,并存储在Word Sense数据库中,以便以后在歧义和翻译过程中使用。初步实验结果表明,通过使用上述策略,可以在稀疏环境中学习更多感官。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号