首页> 外文期刊>Journal of information & knowledge management >Effectiveness of Heuristic Based Approach on the Performance of Indexing and Clustering of High Dimensional Data
【24h】

Effectiveness of Heuristic Based Approach on the Performance of Indexing and Clustering of High Dimensional Data

机译:基于启发式方法的高维数据索引和聚类性能的有效性

获取原文
获取原文并翻译 | 示例
           

摘要

Data in practical applications (e.g., images, molecular biology, etc) is mostly characterised by high dimensionality and huge size or number of data instances. Though, feature reduction techniques have been successful in reducing the dimensionality for certain applications, dealing with high dimensional data is still an area which has received considerable attention in the research community. Indexing and clustering of high dimensional data are two of the most challenging techniques that have a wide range of applications. However, these techniques suffer from performance issues as the dimensionality and size of the processed data increases. In our effort to tackle this problem, this paper demonstrates a general optimisation technique applicable to indexing and clustering algorithms which need to calculate distances and check them against some minimum distance condition. The optimisation technique is a simple calculation that finds the minimum possible distance between two points, and checks this distance against the minimum distance condition; thus reusing already computed values and reducing the need to compute a more complicated distance function periodically. Effectiveness and usefulness of the proposed optimisation technique has been demonstrated by applying it with successful results to clustering and indexing techniques. We utilised a number of clustering techniques, including the agglomerative hierarchical clustering, k-means clustering, and DBSCAN algorithms. Runtime for all three algorithms with this optimisation scenario was reduced, and the clusters they returned were verified to remain the same as the original algorithms. The optimisation technique also shows potential for reducing runtime by a substantial amount for indexing large databases using NAQtree; in addition, the optimisation technique shows potential for reducing runtime as databases grow larger both in dimensionality and size.
机译:实际应用中的数据(例如图像,分子生物学等)的主要特征是具有高维,巨大或大量数据实例。尽管特征缩减技术已经成功地降低了某些应用程序的维数,但是处理高维数据仍然是研究界关注的一个领域。高维数据的索引和聚类是具有广泛应用的两个最具挑战性的技术。但是,随着处理数据的维数和大小增加,这些技术会遇到性能问题。为了解决这个问题,本文演示了一种适用于索引和聚类算法的通用优化技术,该技术需要计算距离并根据最小距离条件对其进行检查。优化技术是一种简单的计算,可以找到两点之间的最小可能距离,并根据最小距离条件检查该距离;因此,重用已经计算的值,并减少了定期计算更复杂的距离函数的需要。通过将成功的技术应用于聚类和索引技术,已证明了所提出的优化技术的有效性和实用性。我们利用了许多聚类技术,包括聚集层次聚类,k均值聚类和DBSCAN算法。在这种优化方案下,所有三种算法的运行时间都减少了,并且验证了它们返回的集群与原始算法相同。优化技术还显示出潜力,可以大大减少使用NAQtree索引大型数据库的运行时间。此外,随着数据库尺寸和大小的增大,优化技术还具有减少运行时间的潜力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号