...
首页> 外文期刊>Information Sciences: An International Journal >Topic identification based on document coherence and spectral analysis
【24h】

Topic identification based on document coherence and spectral analysis

机译:基于文档一致性和频谱分析的主题识别

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

In a world with vast information overload, well-optimized retrieval of relevant information has become increasingly important. Dividing large, multiple topic spanning documents into sets of coherent subdocuments facilitates the information retrieval process. This paper presents a novel technique to automatically subdivide a textual document into consistent components based on a coherence quantification function. This function is based on stem or term chains linking document entities, such as sentences or paragraphs, based on the reoccurrences of stems or terms. Applying this function on a document results in a coherence graph of the document linking its entities. Spectral graph partitioning techniques are used to divide this coherence graph into a number of subdocuments. A novel technique is introduced to obtain the most suitable number of subdocuments. These subdocuments are an aggregation of (not necessarily adjacent) entities. Performance tests are conducted in test environments based on standardized datasets to prove the algorithm's capabilities. The relevance of these techniques for information retrieval and text mining is discussed.
机译:在信息过载的世界中,优化相关信息的检索变得越来越重要。将跨多个主题的大型文档分为连贯的子文档集有助于信息检索过程。本文提出了一种新技术,可基于相干量化功能将文本文档自动细分为一致的组件。该功能基于词干或术语链,这些词干基于词干或术语的重复出现,将文档实体(例如句子或段落)链接在一起。在文档上应用此功能会生成链接其实体的文档的一致性图。频谱图划分技术用于将该相干图划分为多个子文档。引入了一种新颖的技术来获取最合适数量的子文档。这些子文档是(不一定是相邻的)实体的集合。在基于标准化数据集的测试环境中进行性能测试,以证明算法的功能。讨论了这些技术对信息检索和文本挖掘的相关性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号