【24h】

Multi-stage redundancy reduction

机译:多级冗余减少

获取原文

摘要

In many important bioinformatics problems the data sets contain considerable redundancy due to the evolutionary processes which generate the data and biases in the data collection procedures. The standard practice in bioinformatics involves removing the redundancy such that there is no more than at most forty percent similarity between sequences in a data set. For small data sets this can dilute the already impoverished data beyond the boundary of practicality. One can choose to include all available data in the process by just ensuring that only the training and test samples have the required redundancy gap. However, this encourages overfitting of the model by exposure to a highly redundant training sets. We outline a process of multi-stage redundancy reduction, whereby the paucity of data can be effectively utilised without compromising the integrity of the model or the testing procedure.

机译:在许多重要的生物信息学问题中,由于在数据收集过程中生成数据和偏差的进化过程,数据集包含相当大的冗余。生物信息学中的标准做法涉及去除冗余,使得数据集中的序列之间的最多不超过最高百分比相似性。对于小数据集,这可以淡化已经贫困的数据超出实用性边界。可以选择在该过程中包含所有可用数据,只需确保培训和测试样本具有所需的冗余间隙。然而,这鼓励通过暴露于高度冗余训练集来过度选择。我们概述了多级冗余减少的过程,从而可以有效地利用数据的缺乏而不会影响模型的完整性或测试程序。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号