首页> 外文期刊>Procedia - Social and Behavioral Sciences >Automatize Document Topic and Subtopic Detection with Support of a Corpus
【24h】

Automatize Document Topic and Subtopic Detection with Support of a Corpus

机译:通过支持语料库自动化文档主题和子特检测

获取原文
           

摘要

In this article, we propose a new automatic topic and subtopic detection method from a document called paragraph extension. In paragraph extension, a document is considered as a set of paragraphs and a paragraph merging technique is used to merge similar consecutive paragraphs until no similar consecutive paragraphs left. Following this, similar word counts in merged paragraphs are summed up to construct subtopic scores by using a corpus which is designed so that we can find words related to a subtopic. The paragraph vectors are represented by subtopics instead of the words. The subtopic of a paragraph is the most frequent one in the paragraph vector. On the other hand, topic of the document is the most dispersive subtopic in the document. An experimental topic/subtopic corpus is constructed for sport and education topics. We also supported corpus by WordNet to obtain synonyms words. We evaluate the proposed method on a data set contains randomly selected 40 documents from the education and sport topics. The experiment results show that average of topic detection success ratio is about %83 and the subtopic detection is about %68.
机译:在本文中,我们提出了一种从名为段落扩展名的文档的新的自动主题和子特检测方法。在段落延期中,文件被视为一组段落,段落合并技术用于合并类似的连续段落,直到剩下类似的连续段落。在此之后,将合并段落中的类似单词计数总结为通过使用设计的语料库来构建副主题分数,这使得我们可以找到与副主题相关的单词。段落向量由副主题而不是单词表示。段落的副主题是段落向量中最常见的。另一方面,文档的主题是文档中最分散的子主题。为体育和教育主题构建了一个实验主题/副主题语料库。我们还支持Wordnet的语料库来获取同义词单词。我们评估数据集上的建议方法包含从教育和体育主题中随机选择的40个文档。实验结果表明,主题检测成功比例的平均值约为%83,并且子特检测约为%68。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号