【24h】

Chinese keyword extraction based on max-duplicated strings of the documents

机译:基于文档最大重复字符串的中文关键词提取

获取原文

摘要

The corpus analysis methods in Chinese keyword extraction look on the corpus as a single sample of language stochastic process. But the distributions of keywords in the whole corpus and in each document are very different from each other. The extraction based on global statistical information only can get significant keywords in the whole corpus. Max-duplicated strings contain the local significant keywords in each document. In this paper, we designed an efficient algorithm to extract the max-duplicated strings by building PAT-tree for the document, so that the keywords can be picked out from the max-duplicated strings by their SIG values in the corpus.
机译:中文关键词提取中的语料库分析方法将语料库视为语言随机过程的单个样本。但是关键字在整个语料库中和每个文档中的分布都非常不同。仅基于全局统计信息的提取才能获得整个语料库中的重要关键字。最多重复的字符串包含每个文档中的本地有效关键字。在本文中,我们设计了一种有效的算法,即通过为文档构建PAT树来提取最大重复字符串,从而可以通过语料库中的SIG值从最大重复字符串中挑选出关键字。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号