首页> 外国专利> SEMANTIC-BASED APPROACH FOR IDENTIFYING TOPICS IN A CORPUS OF TEXT-BASED ITEMS

SEMANTIC-BASED APPROACH FOR IDENTIFYING TOPICS IN A CORPUS OF TEXT-BASED ITEMS

机译:基于语义的语料库中主题识别方法

摘要

A method of identifying topics in a corpus that includes a plurality of text-based items begins by extracting keytext from each of the plurality of text-based items, resulting in sets of keytext. The method continues by processing the keytext sets to generate a respective semantic footprint for each of the text-based items, resulting in a plurality of semantic footprints. The semantic footprints are used to calculate similarity values for the text-based items, wherein the similarity values indicate commonality between pairs of the text-based items. The method continues by clustering the text-based items into a number of topic groups, wherein the clustering is influenced by the similarity values, and by generating a topic heading for each of the number of topic groups, resulting in a number of topic headings. Next, the text-based items are grouped into accessible topic groups associated with the topic headings.
机译:一种在语料库中识别包括多个基于文本的项目的主题的方法,首先从多个基于文本的项目中的每一个中提取关键文本,从而得到一组关键文本。该方法通过处理关键文本集来继续,以针对每个基于文本的项目生成相应的语义足迹,从而产生多个语义足迹。语义足迹用于计算基于文本的项目的相似性值,其中相似性值指示成对的基于文本的项目之间的共性。该方法通过将基于文本的项目聚类为多个主题组来继续,其中该聚类受到相似性值的影响,并且通过为多个主题组中的每个主题组生成主题标题,从而导致多个主题标题。接下来,将基于文本的项目分组为与主题标题关联的可访问主题组。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号