首页> 外文会议>International Conference on Research Challenges in Information Science >Multi-level K-means text clustering technique for topic identification for competitor intelligence
【24h】

Multi-level K-means text clustering technique for topic identification for competitor intelligence

机译:多级K均值文本聚类技术用于竞争对手情报的主题识别

获取原文

摘要

Proliferation of web as an easily accessible information resource has led many corporations to gather competitor intelligence from the internet. While collection of such information is easy from internet, the collation and structuring of them for perusal of business decision makers, is a real trouble. Text clustering based topic identification techniques are expected to be very useful for such application. Using appropriate clustering technologies, the competitor intelligence corpus, gathered from the web, can be divided into topical groups and henceforth the analysis of this information becomes comparatively easier for the managers. This paper presents a study on the effectiveness of standard K-means text clustering algorithm applied at multiple levels, in a top-down, divide-and-conquer fashion, on competitor intelligence corpus, created from publicly available sources on the web, such as news, blogs, research papers etc. The paper also demonstrates the capability of Multi-level K-means (ML-KM) clustering technique to determine the optimal number of clusters as part of clustering process. The cluster validity metric used to determine cluster quality has also been explained along with other user-controlled configuration parameters. It is empirically found that ML-KM technique also addresses one problem of stand-alone standard K-means (S-KM), which is its bias towards convex, spherical clusters, resulting in bigger clusters subsuming smaller ones. This specific advantage of ML-KM over stand-alone S-KM to detect smaller clusters, makes it more suitable for clustering competitor intelligence related text corpus where niche, smaller clusters can actually lead to important findings. The experimental results are presented for both ML-KM and stand-alone S-KM clustering techniques based on competitor intelligence corpus as well as the standard Reuters corpus.
机译:网络作为一种易于访问的信息资源的激增,导致许多公司从互联网上收集竞争对手的情报。尽管从互联网上可以很容易地收集这些信息,但是对它们进行整理和构造以供业务决策者细读,这确实是一个麻烦。预期基于文本聚类的主题标识技术对于此类应用程序非常有用。使用适当的聚类技术,可以将从网上收集的竞争对手情报库分为主题组,从此以后,对于管理者来说,此信息的分析变得相对容易。本文以自上而下,分而治之的方式,对竞争对手情报语料库进行了多级应用的标准K-means文本聚类算法的有效性研究,该算法是从网络上的公开来源创建的,例如新闻,博客,研究论文等。本文还展示了多级K均值(ML-KM)聚类技术确定聚类最佳数量的能力,这是聚类过程的一部分。还已经解释了用于确定集群质量的集群有效性度量标准以及其他用户控制的配置参数。从经验上发现,ML-KM技术还解决了独立标准K均值(S-KM)的一个问题,即它偏向凸球形簇,从而导致较大的簇包含较小的簇。与独立S-KM相比,ML-KM具有检测较小聚类的特殊优势,使其更适合于与竞争者情报相关的文本语料聚类,在这些领域中,细分,较小聚类实际上可以带来重要发现。给出了基于竞争者情报语料库和标准Reuters语料库的ML-KM和独立S-KM聚类技术的实验结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号