首页> 外文期刊>Astronomy and Computing >Nearest neighbor density ratio estimation for large-scale applications in astronomy
【24h】

Nearest neighbor density ratio estimation for large-scale applications in astronomy

机译:天文学大规模应用中的最近邻密度比估计

获取原文
获取原文并翻译 | 示例
       

摘要

In astronomical applications of machine learning, the distribution of objects used for building a model is often different from the distribution of the objects the model is later applied to. This is known as sample selection bias, which is a major challenge for statistical inference as one can no longer assume that the labeled training data are representative. To address this issue, one can re-weight the labeled training patterns to match the distribution of unlabeled data that are available already in the training phase. There are many examples in practice where this strategy yielded good results, but estimating the weights reliably from a finite sample is challenging. We consider an efficient nearest neighbor density ratio estimator that can exploit large samples to increase the accuracy of the weight estimates. To solve the problem of choosing the right neighborhood size, we propose to use cross-validation on a model selection criterion that is unbiased under covariate shift. The resulting algorithm is our method of choice for density ratio estimation when the feature space dimensionality is small and sample sizes are large. The approach is simple and, because of the model selection, robust. We empirically find that it is on a par with established kernel-based methods on relatively small regression benchmark datasets. However, when applied to large-scale photometric redshift estimation, our approach outperforms the state-of-the-art. (C) 2015 Elsevier B.V. All rights reserved.
机译:在机器学习的天文应用中,用于构建模型的对象的分布通常不同于后来应用于模型的对象的分布。这被称为样本选择偏差,这是统计推断的一大挑战,因为人们不能再假设标记的训练数据具有代表性。为了解决这个问题,可以对已标记的训练模式进行加权,以匹配训练阶段已经可用的未标记数据的分布。在实践中,有很多例子可以证明这种策略取得了很好的效果,但是要从有限的样本中可靠地估计权重却具有挑战性。我们考虑一种有效的最近邻密度比估计器,该估计器可以利用大样本来增加权重估计的准确性。为了解决选择合适的邻域大小的问题,我们建议对协变量平移下无偏的模型选择标准使用交叉验证。当特征空间维数较小且样本量较大时,结果算法是我们选择的密度比估算方法。该方法很简单,并且由于选择了模型,因此很健壮。从经验上我们发现它与相对较小的回归基准数据集上已建立的基于核的方法相当。但是,当将其应用于大规模光度红移估计时,我们的方法优于最新技术。 (C)2015 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号