首页> 外文期刊>IAENG Internaitonal journal of computer science >Clustering Short Text using a Centroid-Based Lexical Clustering Algorithm
【24h】

Clustering Short Text using a Centroid-Based Lexical Clustering Algorithm

机译:使用基于质心的词法聚类算法对短文本进行聚类

获取原文
获取原文并翻译 | 示例
       

摘要

Traditional lexical clustering methods process text as a bag of words, with similarity between two test-fragments measured on the basis of word co-occurrence. While this approach is suitable for clustering large fragments of text (e.g., documents), it performs poorly when clustering smaller text fragments such as sentences (e.g., short text or quotes). This is because two sentences may be semantically similar while containing no common words. This paper proposes a new variant of the standard k-means algorithm for short text clustering that is based on the notion of synonym expansion semantic vectors. These vectors represent short text using semantic information derived from a lexical database constructed to identify the correct meaning to a word, based on the context in which it appears. Thus, whereas conventional it-means algorithm application is based on measuring the distance between patterns, the proposed approach is based on measuring semantic similarity between patterns (e, g., sentences). This enables it to utilise a higher degree of semantic information available within the clustered sentences. Empirical results show that the proposed variant method performs favorably against other clustering technique on two specially constructed datasets of famous quotations, benchmark datasets in several other domains, and that its incorporation as a short text similarity using synonym expansion leads to a significant improvement in the centroid-based clustering performance. Therefore, it is potential use in a variety of knowledge discovery processing tasks including text summarisation and text mining.
机译:传统的词汇聚类方法将文本当作一袋单词来处理,两个基于单词共现的测试片段之间的相似性。虽然这种方法适合于聚类较大的文本片段(例如文档),但是在聚类较小的文本片段(例如句子)(例如短文本或引号)时效果不佳。这是因为两个句子在语义上相似,但不包含常用词。本文提出了一种基于同义词扩展语义向量概念的标准k-means算法,用于短文本聚类。这些向量使用语义信息表示短文本,该语义信息来自词汇数据库,该词​​汇数据库根据单词出现的上下文识别单词的正确含义。因此,尽管常规的均值算法应用是基于测量模式之间的距离,但是所提出的方法是基于测量模式(例如,句子)之间的语义相似性。这使它能够利用聚类语句中可用的更高级别的语义信息。实证结果表明,所提出的变体方法在两个特殊构造的著名引文数据集(其他几个领域的基准数据集)上表现出优于其他聚类技术的优势,并且使用同义词扩展将其作为短文本相似性并入,可以显着改善质心基于集群的性能。因此,它潜在地用于各种知识发现处理任务中,包括文本摘要和文本挖掘。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号