首页> 外文会议>Natural language understanding and intelligent applications >NLPCC 2016 Shared Task Chinese Words Similarity Measure via Ensemble Learning Based on Multiple Resources
【24h】

NLPCC 2016 Shared Task Chinese Words Similarity Measure via Ensemble Learning Based on Multiple Resources

机译:基于多资源集成学习的NLPCC 2016共享任务汉语单词相似度度量

获取原文
获取原文并翻译 | 示例

摘要

Many Chinese words similarity measure algorithms have been introduced since it's a fundamental issue in various tasks of natural language processing. Previous work focused mainly on using existing semantic knowledge bases or large-scale corpora. However, knowledge base and corpus have limitations for broad coverage and data update. Thus, ensemble learning is then used to improve performance by combing similarities. This paper describes a Chinese word similarity measure using ensemble learning of knowledge and corpus-based algorithms. To be specific, knowledge-based methods are based on TYCCL and Hownet. Two corpus-based methods compute similarities via retrieving on web search engines and deep learning on large-scale corpora (news and microblog). All similarities are combined through support vector regression to get final similarity. Evaluation suggests that TYCCL-based method behaves best according to testing dataset. However, if tuning parameters appropriately, ensemble learning could outperform all the other algorithms. Besides, deep learning on news corpora is better than other corpus-based methods.
机译:由于它是自然语言处理的各种任务中的基本问题,因此引入了许多中文单词相似性度量算法。先前的工作主要集中在使用现有的语义知识库或大规模语料库。但是,知识库和语料库在广泛覆盖和数据更新方面存在局限性。因此,集成学习然后用于通过组合相似性来提高性能。本文介绍了基于知识的整体学习和基于语料库的算法的中文单词相似性度量。具体而言,基于知识的方法基于TYCCL和Hownet。两种基于语料库的方法通过在Web搜索引擎上检索和在大型语料库(新闻和微博)上进行深度学习来计算相似度。通过支持向量回归将所有相似性组合在一起,以获得最终相似性。评估表明,基于TYCCL的方法根据测试数据集表现最佳。但是,如果适当地调整参数,则集成学习的性能可能会优于所有其他算法。此外,对新闻语料库的深度学习比其他基于语料库的方法更好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号