Checks and Balances: Monitoring Data Quality Problems in Network Traffic Databases

机译：制衡：监视网络流量数据库中的数据质量问题

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Internet Service Providers (ISPs) use realtime data feeds of aggregated traffic in their network to support technical as well as business decisions. A fundamental difficulty with building decision support tools based on aggregated traffic data feeds is one of data quality. Data quality problems stem from network-specific issues (irregular polling caused by UDP packet drops and delays, topo-logical mislabelings, etc.), and make it difficult to distinguish between artifacts and actual phenomena, rendering data analysis based on such data feeds ineffective. In principle, traditional integrity constraints and triggers may be used to enforce data quality. In practice, data cleaning is done outside the database and is ad-hoc. Unfortunately, these approaches are too rigid and limited for the subtle data quality problems arising from network data where existing problems morph with network dynamics, new problems emerge over time, and poor quality data in a local region may itself indicate an important phenomenon in the underlying network. We need a new approach - both in principle and in practice - to face data quality problems in network traffic databases. We propose a continuous data quality monitoring approach based on probabilistic, approximate constraints (PACs). These are simple, user-specified rule templates with open parameters for tolerance and likelihood. We use statistical techniques to instantiate suitable parameter values from the data, and show how to apply them for monitoring data quality. In principle, our PAC-based approach can be applied to data quality problems in any data feed. We present PAC-Man, which is the system that manages PACs for the entire aggregate network traffic database in a large ISP, and show that it is very effective in monitoring data quality problems.

机译：Internet服务提供商（ISP）使用其网络中聚合流量的实时数据源来支持技术和业务决策。基于聚合的交通数据提要构建决策支持工具的基本困难是数据质量之一。数据质量问题源于特定于网络的问题（由UDP数据包丢弃和延迟，拓扑逻辑标签错误等导致的不规则轮询），并且使其难以区分工件和实际现象，从而使基于此类数据源的数据分析无效。原则上，可以使用传统的完整性约束和触发器来强制执行数据质量。实际上，数据清理是在数据库外部完成的，并且是临时的。不幸的是，对于网络数据引起的细微数据质量问题，这些方法过于僵化和局限，这些问题会随着网络动态而改变现有问题，随着时间的推移出现新问题，而本地质量低劣的数据本身可能表明潜在的重要现象网络。我们需要一种新的方法-不论是在原理上还是在实践上-都要面对网络流量数据库中的数据质量问题。我们提出了一种基于概率近似约束（PAC）的连续数据质量监视方法。这些是简单的，用户指定的规则模板，具有用于公差和可能性的开放参数。我们使用统计技术从数据中实例化合适的参数值，并展示如何将其应用于监视数据质量。原则上，我们基于PAC的方法可以应用于任何数据馈送中的数据质量问题。我们介绍了PAC-Man，它是在大型ISP中管理整个聚合网络流量数据库的PAC的系统，并表明它在监视数据质量问题方面非常有效。

著录项

来源
《Twenty-ninth International Conference on Very Large Databases; Sep 9-12, 2003; Berlin, Germany》|2003年|p.536-547|共12页
会议地点 Berlin(DE);Berlin(DE)
作者
Flip Korn; S. Muthukrishnan; Yunyue Zhu;
展开▼
作者单位

ATT Labs-Research Florham Park, NJ 02932;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词
入库时间 2022-08-26 14:15:30

相似文献

外文文献
中文文献
专利

1. Error checking of large land quality databases through data mining based on low frequency associations [J] . Qiu Xiao-Qian, Zhu A-Xing, Hu Yue-Ming, Land Degradation and Development . 2020,第15期

机译：基于低频关联的数据挖掘错误检查大型土地质量数据库
2. The EUROCARE-5 study on cancer survival in Europe 1999-2007: Database, quality checks and statistical analysis methods [J] . Rossi Silvia, Baili Paolo, Capocaccia Riccardo, European journal of cancer: official journal for European Organization for Research and Treatment of Cancer (EORTC) [and] European Association for Cancer Research (EACR) . 2015,第15期

机译：EUROCARE-5关于1999-2007年欧洲癌症存活率的研究：数据库，质量检查和统计分析方法
3. A database quality review process with interim checks [J] . Brunelle R, Kleyle R Drug information journal . 2002,第2期

机译：具有临时检查的数据库质量审查过程
4. Checks and Balances: Monitoring Data Quality Problems in Network Traffic Databases [C] . Flip Korn, S. Muthukrishnan, Yunyue Zhu International conference on very large databases . 2003

机译：检查和余额：监控网络流量数据库中的数据质量问题
5. Assessing data quality in a sensor network for environmental monitoring. [D] . Ramirez Garcia, Gesuri. 2011

机译：在传感器网络中评估数据质量以进行环境监控。
6. A Comparison of Data Quality Assessment Checks in Six Data Sharing Networks [O] . Tiffany J. Callahan, Alan E. Bauck, David Bertoch, -1

机译：六个数据共享网络中数据质量评估检查的比较
7. Data Visualization for Quality-Check Purposes of Monitored Electricity Consumption in All Office Buildings in the ESL Database [O] . Sreshthaputra A., Abushakra B., Haberl J. S., 2000

机译：用于ESL数据库中所有办公楼受监控用电量质量检查目的的数据可视化

Checks and Balances: Monitoring Data Quality Problems in Network Traffic Databases

摘要

著录项

相似文献

相关主题

期刊订阅