首页> 外文会议>Machine learning and data mining in pattern recognition >Investigating Usage of Text Segmentation and Inter-passage Similarities to Improve Text Document Clustering
【24h】

Investigating Usage of Text Segmentation and Inter-passage Similarities to Improve Text Document Clustering

机译:研究文本分割和通道间相似性的使用以改善文本文档聚类

获取原文
获取原文并翻译 | 示例

摘要

Measuring inter-document similarity is one of the most essential steps in text document clustering. Traditional methods rely on representing text documents using the simple Bag-of-Words (BOW) model. A document is an organized structure consisting of various text segments or passages. Such single term analysis of the text treats whole document as a single semantic unit and thus, ignores other semantic units like sentences, passages etc. In this paper, we attempt to take advantage of underlying subtopic structure of text documents and investigate whether clustering of text documents can be improved if text segments of two documents are utilized, while calculating similarity between them. We concentrate on examining effects of combining suggested inter-document similarities (based on inter-passage similarities) with traditional inter-document similarities following a simple approach for the same. Experimental results on standard data sets suggest improvement in clustering of text documents.
机译:测量文档间的相似性是文本文档聚类中最重要的步骤之一。传统方法依靠使用简单的单词袋(BOW)模型表示文本文档。文档是由各种文本段或段落组成的组织结构。这种对文本的单项分析将整个文档视为一个语义单元,因此忽略了其他语义单元,例如句子,段落等。在本文中,我们尝试利用文本文档的基础子主题结构,并研究文本是否聚类如果在计算两个文档之间的相似度时利用两个文档的文本段,则可以改进文档。我们专注于研究通过简单的方法将建议的文档间相似度(基于段落间相似度)与传统文档间相似度相结合的效果。标准数据集的实验结果表明,文本文档的聚类得到了改善。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号