首页> 外文会议>2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery >Chinese term extraction from web pages based on expected point-wise mutual information
【24h】

Chinese term extraction from web pages based on expected point-wise mutual information

机译:基于预期的逐点互信息从网页中提取中文术语

获取原文
获取原文并翻译 | 示例

摘要

Point-wise Mutual Information(PMI) has been widely used in many areas of lexicon construction, term extraction and text mining. However, PMI has a well-known tendency, which is overvaluing the relatedness of word pairs that involve low-frequency words. To overcome this limitation, Expected Point-wise Mutual Information (PMIK) has been proposed empirically. In this paper, we propose an automatic term recognition system for Chinese and theoretically prove that with variant k ≥ 3, PMIK method can overcome the bias of low-frequency words. The experiment results on Chinese SINA blog and Baidu Tieba corpus show that with a proper k value of 5, the system can achieve a precision greater than 81% for top 1000 extracted terms without decreasing the recall.
机译:点向互信息(PMI)已被广泛用于词典构建,术语提取和文本挖掘的许多领域。但是,PMI有一个众所周知的趋势,那就是高估了涉及低频单词的单词对的相关性。为了克服此限制,已根据经验提出了预期的点向互信息(PMIK)。本文提出了一种中文自动词项识别系统,并从理论上证明了当变量k≥3时,PMIK方法可以克服低频词的偏误。中国SINA博客和百度Tieba语料库上的实验结果表明,如果k值为5,则该系统可以在不降低召回率的情况下,对前1000个提取的词词达到81%以上的精度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号