首页> 美国卫生研究院文献>PLoS Clinical Trials >High throughput nonparametric probability density estimation
【2h】

High throughput nonparametric probability density estimation

机译:高通量非参数概率密度估计

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

In high throughput applications, such as those found in bioinformatics and finance, it is important to determine accurate probability distribution functions despite only minimal information about data characteristics, and without using human subjectivity. Such an automated process for univariate data is implemented to achieve this goal by merging the maximum entropy method with single order statistics and maximum likelihood. The only required properties of the random variables are that they are continuous and that they are, or can be approximated as, independent and identically distributed. A quasi-log-likelihood function based on single order statistics for sampled uniform random data is used to empirically construct a sample size invariant universal scoring function. Then a probability density estimate is determined by iteratively improving trial cumulative distribution functions, where better estimates are quantified by the scoring function that identifies atypical fluctuations. This criterion resists under and over fitting data as an alternative to employing the Bayesian or Akaike information criterion. Multiple estimates for the probability density reflect uncertainties due to statistical fluctuations in random samples. Scaled quantile residual plots are also introduced as an effective diagnostic to visualize the quality of the estimated probability densities. Benchmark tests show that estimates for the probability density function (PDF) converge to the true PDF as sample size increases on particularly difficult test probability densities that include cases with discontinuities, multi-resolution scales, heavy tails, and singularities. These results indicate the method has general applicability for high throughput statistical inference.
机译:在高吞吐量的应用程序中,例如在生物信息学和金融学中发现的应用程序,重要的是确定准确的概率分布函数,尽管仅关于数据特征的信息很少,而且不使用人类主观性。通过将最大熵方法与单阶统计量和最大似然合并,实现了这种用于单变量数据的自动化过程,以实现此目标。随机变量唯一需要的属性是它们是连续的,并且它们是独立的或相同分布的,或可以近似为独立的。基于单次统计量的采样均匀随机数据的准对数似然函数用于凭经验构建样本大小不变的通用评分函数。然后,通过迭代地改进试验累积分布函数来确定概率密度估计值,其中,更好的估计值由识别非典型波动的评分函数来量化。该准则抵制拟合数据不足和过度,以作为采用贝叶斯或Akaike信息准则的替代方法。概率密度的多个估计值反映了由于随机样本中的统计波动引起的不确定性。还引入了比例分位数残差图作为有效的诊断程序,以可视化估计的概率密度的质量。基准测试表明,在特别困难的测试概率密度(包括具有不连续性,多分辨率标度,粗尾和奇异的情况)下,随着样本量的增加,概率密度函数(PDF)的估计会收敛到真实的PDF。这些结果表明该方法对于高通量统计推断具有普遍适用性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号