首页> 外文会议>Twenty-ninth International Conference on Very Large Databases; Sep 9-12, 2003; Berlin, Germany >Checks and Balances: Monitoring Data Quality Problems in Network Traffic Databases
【24h】

Checks and Balances: Monitoring Data Quality Problems in Network Traffic Databases

机译:制衡:监视网络流量数据库中的数据质量问题

获取原文
获取原文并翻译 | 示例

摘要

Internet Service Providers (ISPs) use realtime data feeds of aggregated traffic in their network to support technical as well as business decisions. A fundamental difficulty with building decision support tools based on aggregated traffic data feeds is one of data quality. Data quality problems stem from network-specific issues (irregular polling caused by UDP packet drops and delays, topo-logical mislabelings, etc.), and make it difficult to distinguish between artifacts and actual phenomena, rendering data analysis based on such data feeds ineffective. In principle, traditional integrity constraints and triggers may be used to enforce data quality. In practice, data cleaning is done outside the database and is ad-hoc. Unfortunately, these approaches are too rigid and limited for the subtle data quality problems arising from network data where existing problems morph with network dynamics, new problems emerge over time, and poor quality data in a local region may itself indicate an important phenomenon in the underlying network. We need a new approach - both in principle and in practice - to face data quality problems in network traffic databases. We propose a continuous data quality monitoring approach based on probabilistic, approximate constraints (PACs). These are simple, user-specified rule templates with open parameters for tolerance and likelihood. We use statistical techniques to instantiate suitable parameter values from the data, and show how to apply them for monitoring data quality. In principle, our PAC-based approach can be applied to data quality problems in any data feed. We present PAC-Man, which is the system that manages PACs for the entire aggregate network traffic database in a large ISP, and show that it is very effective in monitoring data quality problems.
机译:Internet服务提供商(ISP)使用其网络中聚合流量的实时数据源来支持技术和业务决策。基于聚合的交通数据提要构建决策支持工具的基本困难是数据质量之一。数据质量问题源于特定于网络的问题(由UDP数据包丢弃和延迟,拓扑逻辑标签错误等导致的不规则轮询),并且使其难以区分工件和实际现象,从而使基于此类数据源的数据分析无效。原则上,可以使用传统的完整性约束和触发器来强制执行数据质量。实际上,数据清理是在数据库外部完成的,并且是临时的。不幸的是,对于网络数据引起的细微数据质量问题,这些方法过于僵化和局限,这些问题会随着网络动态而改变现有问题,随着时间的推移出现新问题,而本地质量低劣的数据本身可能表明潜在的重要现象网络。我们需要一种新的方法-不论是在原理上还是在实践上-都要面对网络流量数据库中的数据质量问题。我们提出了一种基于概率近似约束(PAC)的连续数据质量监视方法。这些是简单的,用户指定的规则模板,具有用于公差和可能性的开放参数。我们使用统计技术从数据中实例化合适的参数值,并展示如何将其应用于监视数据质量。原则上,我们基于PAC的方法可以应用于任何数据馈送中的数据质量问题。我们介绍了PAC-Man,它是在大型ISP中管理整个聚合网络流量数据库的PAC的系统,并表明它在监视数据质量问题方面非常有效。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号