首页> 外文学位 >Enhancing Clustering and Labeling for Large-Scale Information Retrieval Systems.
【24h】

Enhancing Clustering and Labeling for Large-Scale Information Retrieval Systems.

机译:增强大型信息检索系统的聚类和标记。

获取原文
获取原文并翻译 | 示例

摘要

Classic information retrieval (IR) systems rely on ranking algorithms to serve users with ordered lists of documents according to search queries. Sometimes, however, users do not have very specific information needs or cannot accurately articulate their information needs in queries. Cluster-based IR systems, such as those based on the Scatter/Gather paradigm, have been used to help users clarify their information needs and promote learning via interactive document clustering and summarization. These systems have the potential to facilitate user browsing large document collections and exploring topics. However, their effectiveness is often constrained by poor clustering quality, ambiguous cluster labels, and the inefficiency to process large-scale data sets.;In interactive clustering, term distributions vary in different clusters or subsets of a collection. Classic TF*IDF (term frequency * inverse document frequency) term weighting, especially IDF that counts document frequency in the overall (global) data, does not take into account the shifted term distributions in a (local) subset and is often incapable of identifying most informative terms within that subset. To improve clustering quality with meaningful labels, we propose two novel term weighting schemes, namely TF*ICDF and DF*LIG. TF*ICDF, or Term Frequency * Inverse within-Cluster Document Frequency, integrates the local subset information into term weighting. It outperforms TF*IDF in several aspects for clustering and labeling with various configurations.;In addition, we propose Least Information Gain (LIG) based on the least information theory, which, similar to Information Gain (IG) based on KL divergence, measures the amount of information required for a probability distribution change. Based on LIG, we develop the DF*LIG method for cluster labeling. With DF*LIG, terms that carry more information in revealing the contents of clusters are chosen as labels, resulting in better performance in terms of coverage, overlap and precision in comparison to DF*IG. By integrating TF*ICDF for term weighting and clustering, DF*LIG produces more representative, distinctive and accurate labels than when it is combined with TF*IDF.;In order to improve clustering efficiency and support data-intensive processing, we develop distributed versions of TF*ICDF and DF*LIG algorithms as well as a parallel clustering algorithm named Pruned Affinity Propagation (PAP) in the Spark framework. The proposed algorithms efficiently process large-scale data sets by taking advantage of computational capabilities of individual processors and nodes. Distributed TF*ICDF and DF*LIG methods scale very well---their efficiency improves significantly with an increased number of processors. Compared with the original affinity propagation algorithm, PAP achieves much higher efficiency while maintaining strong effectiveness. Results also show that the execution time of PAP is greatly reduced by increasing the number of processors and remains competitive with large numbers of documents, indicating its scalability.;With the support of these effective and scalable methods for text clustering and cluster labeling, a cluster-based IR system can be greatly improved in its ability to dynamically identify key features, to produce meaningful clusters, and to generate representative terms as labels. With the ability to accommodate large-scale data sets, such a system can help users discover important patterns in the data and help them learn and explore in a dynamic, complex information space.
机译:经典信息检索(IR)系统依靠排名算法根据搜索查询为用户提供文档的有序列表。但是,有时用户没有非常特定的信息需求,或者无法在查询中准确表达其信息需求。基于群集的IR系统(例如基于Scatter / Gather范式的系统)已用于帮助用户阐明其信息需求并通过交互式文档群集和汇总促进学习。这些系统有可能促进用户浏览大型文档集并探索主题。但是,它们的有效性通常受到聚类质量差,聚类标签不明确以及处理大型数据集效率低下的限制。在交互式聚类中,术语分布在不同的聚类或集合子集中会有所不同。经典TF * IDF(术语频率*逆文档频率)术语加权,尤其是IDF,它在整体(全局)数据中对文档频率进行计数,没有考虑(局部)子集中移位的术语分布,并且通常无法识别该子集中的大多数信息性术语。为了使用有意义的标签提高聚类质量,我们提出了两种新颖的术语加权方案,即TF * ICDF和DF * LIG。 TF * ICDF或术语频率*集群内文档频率的倒数将本地子集信息集成到术语加权中。在各种配置下的聚类和标记方面,它在某些方面都优于TF * IDF。此外,我们提出了基于最小信息论的最小信息增益(LIG),类似于基于KL散度的信息增益(IG)概率分布变化所需的信息量。基于LIG,我们开发了用于聚类标记的DF * LIG方法。与DF * IG相比,使用DF * LIG可以选择带有更多信息以显示群集内容的术语作为标签,从而在覆盖范围,重叠和精度方面具有更好的性能。通过集成TF * ICDF进行术语加权和聚类,与与TF * IDF结合使用时相比,DF * LIG产生了更具代表性,独特性和准确性的标签;为了提高聚类效率并支持数据密集型处理,我们开发了分布式版本Spark框架中的TF * ICDF和DF * LIG算法以及并行聚类算法Pruned Affinity Propagation(PAP)。所提出的算法通过利用各个处理器和节点的计算能力来有效地处理大规模数据集。分布式TF * ICDF和DF * LIG方法可以很好地扩展规模-随着处理器数量的增加,它们的效率将大大提高。与原始的亲和力传播算法相比,PAP在保持强效的同时实现了更高的效率。结果还表明,通过增加处理器数量可以大大减少PAP的执行时间,并且与大量文档保持竞争,这表明PAP的可扩展性;在这些有效且可扩展的文本聚类和聚类标记方法的支持下,聚类基于IR的IR系统可以动态地识别关键特征,生成有意义的簇以及生成具有代表性的术语作为标签,从而大大提高其功能。这种系统能够容纳大规模数据集,可以帮助用户发现数据中的重要模式,并帮助他们在动态,复杂的信息空间中学习和探索。

著录项

  • 作者

    Gong, Xuemei.;

  • 作者单位

    Drexel University.;

  • 授予单位 Drexel University.;
  • 学科 Information science.
  • 学位 Ph.D.
  • 年度 2015
  • 页码 118 p.
  • 总页数 118
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号