首页> 外文OA文献 >Exploring the power of heterogeneous information sources
【2h】

Exploring the power of heterogeneous information sources

机译:探索异构信息源的力量

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The big data challenge is one unique opportunity for both data mining and database research and engineering. A vast ocean of data are collected from trillions of connected devices in real time on adaily basis, and useful knowledge is usually buried in data of multiple genres, from different sources, in different formats, and with different types of representation. Many interesting patterns cannot be extracted from a single data collection, but have to be discovered from the integrative analysis of all heterogeneous data sources available. Although many algorithms have been developed to analyze multiple information sources, real applications continuously pose new challenges: Data can be gigantic, noisy, unreliable, dynamically evolving, highly imbalanced, and heterogeneous. Meanwhile, users provide limited feedback, have growing privacy concerns, and ask for actionable knowledge. In this thesis, we proposed to explore the power of multiple heterogeneous information sources in such challenging learning scenarios. There are two interesting perspectives in learning from the correlations among multiple information sources: Explore their similarities (consensus combination), or their differences (inconsistency detection).In consensus combination, we focused on the task of classification with multiple information sources. Multiple information sources for the same set of objects can provide complimentary predictive powers, and by combining their expertise, the prediction accuracy is significantly improved. However, the major challenge is that it is hard to obtain sufficient and reliable labeled data for effective training because they require the efforts of experienced human annotators. In some data sources, we may only have a large amount of unlabeled data. Although such unlabel information do not directly generate label predictions, they provide useful constraints on the classification task. Therefore, we first propose a graph based consensus maximization framework to combine multiple supervised and unsupervised models obtained from all the available information sources. We further demonstrate the benefits of combining multiple models on two specific learning scenarios. In transfer learning, we propose an effective model combination framework to transfer knowledge from multiple sources to a target domain with no labeled data. We also demonstrate the robustness of model combination ondynamically evolving data.On the other hand, when unexpected disagreement is encountered across diverse information sources, this might raise a red flag and require in-depth investigation. Another line of my thesis research is to explore differences among multiple information sources to find anomalies. We first propose a spectral method to detect objects performing inconsistently across multiple heterogeneous information sources as a new type of anomalies. Traditional anomaly detection methods discover anomalies based on the degree of deviation from normal objects in one data source, whereas the proposed approach detects anomalies according to the degree of inconsistencies across multiple sources. The principle of inconsistency detection can benefit many applications, and in particular, we show how thisprinciple can help identify anomalies in information networks and distributed systems. We propose probabilistic models to detect anomalies in a social community by comparing link and node information, and to detect system problems from connected machines in a distributed systems by modeling correlations among multiple machines.In this thesis, we go beyond the scope of traditional ensemble learning to address challenges faced by many applications with multiple data sources. With the proposed consensus combination framework, labeled data are no longer a requirement for successful multi-source classification, instead, the use of existing labeling experts were maximized by integrating knowledge from relevant domains and unlabeled information sources. The proposed concept of inconsistency detection across multiple data sources opens up a new direction of anomaly detection. The detected anomalies, which cannot be found by traditional anomaly detection techniques, provide new insights into the application area. The algorithms we developed have been proved useful in many areas, including social network analysis, cyber-security, and business intelligence, and have the potential of being applied to many other areas, such as healthcare, bioinformatics, and energy efficiency. As both the amount of data and the number of sources in our world have been exploding, there are still great opportunities as well as numerous research challenges for inference of actionable knowledge from multiple heterogeneous sources of massive data collections.
机译:大数据挑战是数据挖掘以及数据库研究和工程设计的独特机遇。每天实时从数万亿个连接的设备中收集大量数据,并且有用的知识通常被埋藏在多种类型的数据中,这些数据来自不同的来源,采用不同的格式以及具有不同的表示形式。无法从单个数据收集中提取许多有趣的模式,而必须从所有可用的异构数据源的综合分析中发现。尽管已经开发了许多算法来分析多个信息源,但是实际应用程序不断提出新的挑战:数据可能是巨大的,嘈杂的,不可靠的,动态变化的,高度不平衡的和异构的。同时,用户提供的反馈有限,对隐私的关注日益增加,并要求获得可行的知识。在本文中,我们提出了在这种具有挑战性的学习场景中探索多种异构信息源的功能。从多个信息源之间的相关性中学习时,有两个有趣的观点:探索它们的相似性(共识组合)或差异(不一致检测)。在共识组合中,我们专注于对多个信息源进行分类的任务。同一组对象的多个信息源可以提供互补的预测能力,并且通过结合其专业知识,可以显着提高预测准确性。然而,主要的挑战是难以获得足够和可靠的标签数据来进行有效的培训,因为它们需要经验丰富的人类注释者的努力。在某些数据源中,我们可能只有大量未标记的数据。尽管此类非标签信息不会直接生成标签预测,但它们对分类任务提供了有用的约束。因此,我们首先提出一个基于图的共识最大化框架,以结合从所有可用信息源获得的多个监督模型和非监督模型。我们进一步展示了在两个特定的学习场景下组合多个模型的好处。在转移学习中,我们提出了一个有效的模型组合框架,以将知识从多个来源转移到没有标签数据的目标域。我们还展示了动态演化数据上模型组合的稳健性;另一方面,当跨各种信息源遇到意外分歧时,这可能会引发危险并需要深入研究。我的论文研究的另一条线是探索多个信息源之间的差异以发现异常。我们首先提出一种频谱方法来检测跨多种异构信息源执行的对象不一致,这是一种新型的异常。传统的异常检测方法是根据与一个数据源中正常对象的偏离程度来发现异常,而所提出的方法是根据多个数据源之间的不一致程度来检测异常。不一致检测的原理可以使许多应用受益,尤其是,我们将展示该原理如何帮助识别信息网络和分布式系统中的异常。我们提出了一种概率模型,通过比较链接和节点信息来检测社交社区中的异常情况,并通过对多台机器之间的相关性进行建模来从分布式系统中的连接机器中检测系统问题。本文超出了传统集成学习的范围解决具有多个数据源的许多应用程序所面临的挑战。通过提议的共识组合框架,标记数据不再是成功进行多源分类的必要条件,而是通过整合相关领域和未标记信息源的知识来最大限度地利用现有标记专家。跨多个数据源的不一致检测的提议概念为异常检测开辟了新的方向。传统的异常检测技术无法发现的检测异常为应用领域提供了新的见识。我们开发的算法已被证明在许多领域有用,包括社交网络分析,网络安全和商业智能,并且有可能被应用于许多其他领域,例如医疗保健,生物信息学和能源效率。随着我们世界上数据量和来源数量的爆炸式增长,从海量数据集合的多种不同来源中推断出可操作的知识,仍然存在巨大的机遇以及众多的研究挑战。

著录项

  • 作者

    Gao Jing;

  • 作者单位
  • 年度 2011
  • 总页数
  • 原文格式 PDF
  • 正文语种 {"code":"en","name":"English","id":9}
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号