首页> 外文期刊>Statistical Analysis and Data Mining >MR plot: A big data tool for distinguishing distributions
【24h】

MR plot: A big data tool for distinguishing distributions

机译:MR PLOT:用于区分分布的大数据工具

获取原文
       

摘要

Big data enables reliable estimation of continuous probability density, cumulative distribution, survival, hazard rate, and mean residual functions (MRFs). We illustrate that plot of the MRF provides the best resolution for distinguishing between distributions. At each point, the MRF gives the mean excess of the data beyond the threshold. Graph of the empirical MRF, called here the MR plot, provides an effective visualization tool. A variety of theoretical and data driven examples illustrate that MR plots of big data preserve the shape of the MRF and complex models require bigger data. The MRF is an optimal predictor of the excess of the random variable. With a suitable prior, the expected MRF gives the Bayes risk in the form of the entropy functional of the survival function, called here the survival entropy. We show that the survival entropy is dominated by the standard deviation (SD) and the equality between the two measures characterizes the exponential distribution. The empirical survival entropy provides a data concentration statistic which is strongly consistent, easy to compute, and less sensitive than the SD to heavy tailed data. An application uses the New York City Taxi database with millions of trip times to illustrate the MR plot as a powerful tool for distinguishing distributions.
机译:大数据能够可靠地估计连续概率密度,累积分布,存活,危害率和平均残留功能(MRF)。我们说明MRF的曲线曲线提供了区分分布的最佳分辨率。在每个点处,MRF为超出阈值提供的均值过剩。在此称为MR Plot的经验MRF的图提供了有效的可视化工具。各种理论和数据驱动的示例说明了大数据的MR图保持MRF的形状和复杂模型需要更大的数据。 MRF是随机变量过量的最佳预测因子。通过合适的先前,预期的MRF以生存函数的熵函数的形式给予贝叶斯风险,在此处称为生存熵。我们表明,生存熵由标准偏差(SD)主导,两种措施之间的平等表征了指数分布。经验生存熵提供了一种数据集中统计,这是强烈一致的,易于计算,而不是SD到重型数据的敏感性。应用程序使用纽约市的出租车数据库与数百万次旅行时间来说明MR PLOT作为区分分布的强大工具。

著录项

获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号