【24h】

Multi-stage redundancy reduction

机译:多级冗余减少

获取原文
获取原文并翻译 | 示例

摘要

In many important bioinformatics problems the data sets contain considerable redundancy due to the evolutionary processes which generate the data and biases in the data collection procedures. The standard practice in bioinformatics involves removing the redundancy such that there is no more than at most forty percent similarity between sequences in a data set. For small data sets this can dilute the already impoverished data beyond the boundary of practicality. One can choose to include all available data in the process by just ensuring that only the training and test samples have the required redundancy gap. However, this encourages overfitting of the model by exposure to a highly redundant training sets. We outline a process of multi-stage redundancy reduction, whereby the paucity of data can be effectively utilised without compromising the integrity of the model or the testing procedure.

机译:

在许多重要的生物信息学问题中,由于生成数据的进化过程和数据收集程序中的偏差,数据集包含相当多的冗余。生物信息学的标准做法是消除冗余,以使数据集中序列之间的相似度最多不超过40%。对于小型数据集,这可以将已经贫困的数据稀释到实用范围之外。只需确保仅训练样本和测试样本具有所需的冗余差距,就可以选择在过程中包括所有可用数据。但是,这会通过暴露于高度冗余的训练集而导致模型过度拟合。我们概述了减少多阶段冗余的过程,从而可以有效利用数据的稀缺性,而不会损害模型或测试过程的完整性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号