首页> 外文期刊>ACM transactions on Asian language information processing >Incorporating Prior Knowledge into Word Embedding for Chinese Word Similarity Measurement
【24h】

Incorporating Prior Knowledge into Word Embedding for Chinese Word Similarity Measurement

机译:将先验知识融合到词嵌入中以进行中文词相似度测量

获取原文
获取原文并翻译 | 示例
       

摘要

Word embedding-based methods have received increasing attention for their flexibility and effectiveness in many natural language-processing (NLP) tasks, including Word Similarity (WS). However, these approaches rely on high-quality corpus and neglect prior knowledge. Lexicon-based methods concentrate on human's intelligence contained in semantic resources, e.g., Tongyici Cilin, HowNet, and Chinese WordNet, but they have the drawback of being unable to deal with unknown words. This article proposes a three-stage framework for measuring the Chinese word similarity by incorporating prior knowledge obtained from lexicons and statistics into word embedding: in the first stage, we utilize retrieval techniques to crawl the contexts of word pairs from web resources to extend context corpus. In the next stage, we investigate three types of single similarity measurements, including lexicon similarities, statistical similarities, and embedding-based similarities. Finally, we exploit simple combination strategies with math operations and the counter-fitting combination strategy using optimization method. To demonstrate our system's efficiency, comparable experiments are conducted on the PKU-500 dataset. Our final results are 0.561/0.516 of Spearman/Pearson rank correlation coefficient, which outperform the state-of-the-art performance to the best of our knowledge. Experiment results on Chinese MC-30 and SemEval-2012 datasets show that our system also performs well on other Chinese datasets, which proves its transferability. Besides, our system is not language-specific and can be applied to other languages, e.g., English.
机译:基于单词嵌入的方法因其在许多自然语言处理(NLP)任务(包括单词相似性(WS))中的灵活性和有效性而受到越来越多的关注。但是,这些方法依赖于高质量的语料库,而忽略了先验知识。基于词典的方法专注于语义资源(例如Tongyici Cilin,HowNet和Chinese WordNet)中包含的人类智能,但是它们具有无法处理未知单词的缺点。本文提出了一个三阶段的框架,通过将从词典和统计信息中获得的先验知识整合到词嵌入中来测量中文词的相似性:在第一阶段,我们利用检索技术从网络资源中检索词对的上下文,以扩展上下文语料库。在下一阶段,我们将研究三种类型的单一相似性度量,包括词典相似性,统计相似性和基于嵌入的相似性。最后,我们利用数学运算来开发简单的组合策略,并使用优化方法来进行反拟合组合策略。为了证明我们系统的效率,对PKU-500数据集进行了可比的实验。我们的最终结果是Spearman / Pearson等级相关系数的0.561 / 0.516,据我们所知,它的性能优于最新技术。在中文MC-30和SemEval-2012数据集上的实验结果表明,我们的系统在其他中文数据集上也表现良好,证明了其可移植性。此外,我们的系统不是特定于语言的,而是可以应用于其他语言,例如英语。

著录项

  • 来源
  • 作者单位

    Dalian Univ Technol, Sch Comp Sci & Technol, Innovat Pk Bldg A0933, Dalian, Liaoning, Peoples R China;

    Dalian Univ Technol, Sch Comp Sci & Technol, Innovat Pk Bldg A0933, Dalian, Liaoning, Peoples R China;

    Dalian Univ Technol, Sch Comp Sci & Technol, Innovat Pk Bldg A0933, Dalian, Liaoning, Peoples R China;

    Dalian Univ Technol, Sch Comp Sci & Technol, Innovat Pk Bldg A0933, Dalian, Liaoning, Peoples R China;

    Dalian Univ Technol, Sch Comp Sci & Technol, Innovat Pk Bldg A0933, Dalian, Liaoning, Peoples R China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Chinese word similarity; word embedding; prior knowledge;

    机译:中文单词相似度;单词嵌入;先验知识;
  • 入库时间 2022-08-18 04:03:45

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号