首页> 外文期刊>Journal of Computational Methods in Sciences and Engineering >Shape pattern matching: A tool to cluster unstructured text documents
【24h】

Shape pattern matching: A tool to cluster unstructured text documents

机译:形状模式匹配:聚类非结构化文本文档的工具

获取原文
获取原文并翻译 | 示例

摘要

Research in text mining has recently gained a lot of importance due to the large increase in the number of electronic news articles, books, research papers, and e-mail messages. Clustering organizes text documents in an unsupervised fashion. In this paper, we propose an algorithm for clustering unstructured text documents using shape pattern matching. The Vector Space Model is used to represent our dataset as a term-weight matrix. The high dimensional vector space has been mapped to a two-dimensional plane that has the term weights plotted against a time axis. In this way, the text documents are represented in the form of time sequences. Initially, the documents are broadly grouped into categories that are determined using domain knowledge. The relevant portion of the document vector is then clipped out. The shape patterns present in these clipped portions are observed. Indexing of these shape patterns is done by preparing their alphabet. Grouping documents within a category which share the same shape pattern results in the required clusters.
机译:最近,由于电子新闻文章,书籍,研究论文和电子邮件的数量大量增加,文本挖掘的研究变得非常重要。群集以无人监督的方式组织文本文档。在本文中,我们提出了一种使用形状模式匹配对非结构化文本文档进行聚类的算法。向量空间模型用于将我们的数据集表示为项权重矩阵。高维向量空间已映射到二维平面,该平面具有相对于时间轴绘制的术语权重。这样,文本文档以时间序列的形式表示。最初,将文档大致分为使用领域知识确定的类别。然后,将文档向量的相关部分剪切掉。观察存在于这些修剪部分中的形状图案。这些形状图案的索引是通过准备它们的字母来完成的。将具有相同形状图案的类别中的文档分组会产生所需的簇。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号