【24h】

Analyzing Quantitative Databases: Image is Everything

机译:分析定量数据库:图像就是一切

获取原文
获取原文并翻译 | 示例

摘要

Traditional statistical methods deal with corroborating given hypotheses on a given body of data. However, generating the hypothesis itself is a matter of intuition and ingenuity. It is clearly impossible to test all hypotheses on a database with millions of records and hundreds of fields. There have been attempts to bridge this gap through data mining. Association genera-tion is a method of creating such statisti-cal hypotheses for binary data. For quantitative databases the situation is still not good. There are a number of known meth-ods. One is a reduction to binary data by creating intervals and then generating associations. This method is computationally ex-pensive. Another suggested method was by generating associations that are statistically interesting. This method also was tried only on small databases and is applicable only for binary relations, e.g., in certain ranges of field X, field Y lies significantly outside its average. We suggest a method that answers some of the problems with the current techniques. Our idea is based on using visualization techniques and image processing ideas to rank subsets of fields according to the relation between them in the database. This ranking suggests the hypotheses to be statistically investigated. Our method has the following advantages: 1. It is scalable. Our algorithm is mainly based on analyzing histograms of the data set, thus is more efficient. It is also naturally suitable for sampling. 2. It is generalizable in the size of the set of fields. No current method handles more than a binary relation. 3. It affords comparability between fields over different base sets. This allows a uniform scale for different sets of fields in different databases. In this paper we present an algorithmic methodology and the results of its application to the census bureau data bases, cpsm93p and nhis93ac.
机译:传统的统计方法涉及在给定的数据主体上证实给定的假设。然而,产生假设本身是直觉和巧思的问题。显然不可能在具有数百万条记录和数百个字段的数据库上检验所有假设。已经尝试通过数据挖掘来弥合这种差距。关联生成是一种为二进制数据创建这样的统计假设的方法。对于定量数据库,情况仍然不好。有许多已知的方法。一种是通过创建间隔然后生成关联来减少二进制数据。此方法在计算上比较昂贵。另一种建议的方法是通过生成具有统计意义的关联。该方法也仅在小型数据库上尝试过,并且仅适用于二进制关系,例如,在字段X的某些范围内,字段Y明显超出其平均值。我们建议一种方法来解决当前技术中的一些问题。我们的想法基于使用可视化技术和图像处理想法,根据字段在数据库中的关系对字段的子集进行排名。该排名表明该假设需要进行统计调查。我们的方法具有以下优点:1.可扩展。我们的算法主要基于分析数据集的直方图,因此效率更高。它自然也适合采样。 2.可以根据字段集的大小进行概括。当前方法没有比二进制关系更多的处理方法。 3.它提供了不同基础集上各个字段之间的可比性。这允许对不同数据库中的不同字段集进行统一缩放。在本文中,我们介绍了一种算法方法及其在人口普查局数据库cpsm93p和nhis93ac中的应用结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号