In many important bioinformatics problems the data sets contain considerable redundancy due to the evolutionary processes which generate the data and biases in the data collection procedures. The standard practice in bioinformatics involves removing the redundancy such that there is no more than at most forty percent similarity between sequences in a data set. For small data sets this can dilute the already impoverished data beyond the boundary of practicality. One can choose to include all available data in the process by just ensuring that only the training and test samples have the required redundancy gap. However, this encourages overfitting of the model by exposure to a highly redundant training sets. We outline a process of multi-stage redundancy reduction, whereby the paucity of data can be effectively utilised without compromising the integrity of the model or the testing procedure.
在许多重要的生物信息学问题中,由于生成数据的进化过程和数据收集程序中的偏差,数据集包含相当多的冗余。生物信息学的标准做法是消除冗余,以使数据集中序列之间的相似度最多不超过40%。对于小型数据集,这可以将已经贫困的数据稀释到实用范围之外。只需确保仅训练样本和测试样本具有所需的冗余差距,就可以选择在过程中包括所有可用数据。但是,这会通过暴露于高度冗余的训练集而导致模型过度拟合。我们概述了减少多阶段冗余的过程,从而可以有效利用数据的稀缺性,而不会损害模型或测试过程的完整性。 P>
The University of Queensland, QLD, Australia;
机译:基于随机模糊可信度的多级混合系统最优冗余分配问题
机译:具有三模块冗余(TMR)技术的多级容错乘法器
机译:一种多级深度学习基于多尺度模型减少的算法
机译:多级冗余减少
机译:多阶段模式减少,实现无损图像压缩
机译:F-DCS:具有冗余减少算法的基于FMI的分布式CPS仿真框架
机译:一种多级深度学习算法,用于多尺度模型减少