首页> 外文期刊>Computer Science & Information Technology >Punjabi Text Clustering by Sentence Structure Analysis
【24h】

Punjabi Text Clustering by Sentence Structure Analysis

机译:基于句子结构分析的旁遮普语文本聚类

获取原文
获取外文期刊封面目录资料

摘要

Punjabi Text Document Clustering is done by analyzing the sentence structure of similar documents sharing same topics and grouping them into clusters. The prevalent algorithms in this field utilize the vector space model which treats the documents as a bag of words. The meaning in natural language inherently depends on the word sequences which are overlooked and ignored while clustering. The current paper deals with a new Punjabi text clustering algorithm named Clustering by Sentence Structure Analysis(CSSA) which has been carried out on 221 Punjabi news articles available on news sites. The phrases are extracted for processing by a meticulous analysis of the structure of a sentence by applying the basic grammatical rules of Karaka. Sequences formed from phrases, are used to find the topic and for finding similarities among all documents which results in the formation of meaningful clusters.
机译:旁遮普文本文档聚类是通过分析共享相同主题的相似文档的句子结构并将它们分组组成的。该领域中流行的算法利用矢量空间模型,该矢量空间模型将文档视为一袋单词。自然语言的含义固有地取决于在聚类时被忽略和忽略的单词序列。本文研究了一种新的旁遮普文本聚类算法,称为“通过句子结构分析进行聚类”(CSSA),该算法已在新闻站点上的221篇旁遮普新闻中进行了研究。通过应用Karaka的基本语法规则,通过对句子结构的仔细分析来提取短语以进行处理。由短语形成的序列用于查找主题并在所有文档中查找相似之处,从而形成有意义的簇。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号