首页> 外文学位 >Stream-Dashboard: A big data stream clustering framework with applications to social media streams.
【24h】

Stream-Dashboard: A big data stream clustering framework with applications to social media streams.

机译:Stream-Dashboard:一个大数据流集群框架,其应用程序适用于社交媒体流。

获取原文
获取原文并翻译 | 示例

摘要

Data mining is concerned with detecting patterns of data in raw datasets, which are then used to unearth knowledge that might not have been discovered using conventional querying or statistical methods. This discovered knowledge has been used to empower decision makers in countless applications spanning across many multi-disciplinary areas including business, education, astronomy, security and Information Retrieval to name a few. Many applications generate massive amounts of data continuously and at an increasing rate. This is the case for user activity over social networks such as Facebook and Twitter. This flow of data has been termed, appropriately, a Data Stream, and it introduced a set of new challenges to discover its evolving patterns using data mining techniques. Data stream clustering is concerned with detecting evolving patterns in a data stream using only the similarities between the data points as they arrive without the use of any external information (i.e. unsupervised learning).;In this dissertation, we propose a complete and generic framework to simultaneously mine, track and validate clusters in a big data stream (Stream-Dashboard). The proposed framework consists of three main components: an online data stream clustering algorithm, a component for tracking and validation of pattern behavior using regression analysis, and a component that uses the behavioral information about the detected patterns to improve the quality of the clustering algorithm. As a first component, we propose RINO-Streams, an online clustering algorithm that incrementally updates the clustering model using robust statistics and incremental optimization. The second component is a methodology that we call TRACER, which continuously performs a set of statistical tests using regression analysis to track the evolution of the detected clusters, their characteristics and quality metrics. For the last component, we propose a method to build some behavioral profiles for the clustering model over time, that can be used to improve the performance of the online clustering algorithm, such as adapting the initial values of the input parameters.;The performance and effectiveness of the proposed framework were validated using extensive experiments, and its use was demonstrated on a challenging real word application, specifically unsupervised mining of evolving cluster stories in one pass from the Twitter social media streams.
机译:数据挖掘与检测原始数据集中的数据模式有关,然后将其用于发掘使用常规查询或统计方法可能未发现的知识。这种发现的知识已被用于增强决策者在众多跨学科领域(包括商业,教育,天文学,安全性和信息检索)中的众多应用中的决策者的能力。许多应用程序连续不断地生成大量数据,并且速度越来越快。在诸如Facebook和Twitter之类的社交网络上的用户活动就是这种情况。这种数据流被适当地称为“数据流”,它引入了一系列新挑战,以利用数据挖掘技术发现其不断发展的模式。数据流聚类涉及仅使用数据点到达时的相似性来检测数据流中的演化模式,而无需使用任何外部信息(即无监督学习)。同时挖掘,跟踪和验证大数据流(Stream-Dashboard)中的群集。提出的框架由三个主要组件组成:在线数据流聚类算法,使用回归分析跟踪和验证模式行为的组件以及使用有关检测到的模式的行为信息来提高聚类算法质量的组件。作为第一个组件,我们提出了RINO-Streams,这是一种在线聚类算法,可以使用可靠的统计信息和增量优化来增量式更新聚类模型。第二部分是我们称为TRACER的方法,该方法使用回归分析连续执行一组统计测试,以跟踪检测到的簇的演变,其特征和质量指标。对于最后一个组件,我们提出了一种随着时间的流逝为聚类模型构建一些行为配置文件的方法,该方法可用于提高在线聚类算法的性能,例如调整输入参数的初始值。通过广泛的实验验证了所提出框架的有效性,并在具有挑战性的真实单词应用程序上演示了其用法,特别是从Twitter社交媒体流一次通过无监督地挖掘正在发展的集群故事。

著录项

  • 作者

    Hawwash, Basheer.;

  • 作者单位

    University of Louisville.;

  • 授予单位 University of Louisville.;
  • 学科 Computer Science.;Information Technology.;Web Studies.;Artificial Intelligence.
  • 学位 Ph.D.
  • 年度 2013
  • 页码 229 p.
  • 总页数 229
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号