首页> 外文期刊>Proteins: Structure, Function, and Genetics >Choosing non‐redundant representative subsets of protein sequence data sets using submodular optimization
【24h】

Choosing non‐redundant representative subsets of protein sequence data sets using submodular optimization

机译:使用子模块优化选择蛋白质序列数据集的非冗余代表性子集

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

Abstract Selecting a non‐redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non‐redundant training sets for sequence and structural models or selection of “operational taxonomic units” from metagenomics data. Previous methods for this task, such as CD‐HIT, PISCES, and UCLUST, apply a heuristic threshold‐based algorithm that has no theoretical guarantees. We propose a new approach based on submodular optimization. Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success for other representative set selection problems. We demonstrate that the submodular optimization approach results in representative protein sequence subsets with greater structural diversity than sets chosen by existing methods, using as a gold standard the SCOPe library of protein domain structures. In this setting, submodular optimization consistently yields protein sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches. We also show how the optimization framework allows us to design a mixture objective function that performs well for both large and small representative sets. The framework we describe is the best possible in polynomial time (under some assumptions), and it is flexible and intuitive because it applies a suite of generic methods to optimize one of a variety of objective functions.
机译:摘要选择非冗余代表性的序列子集是许多生物信息学工作流程中的共同步骤,例如创建用于序列和结构模型的非冗余训练集或从Metagenomics数据中选择“运营分类单位”。以前的此任务的方法,例如CD-HIN,PISCE和UCLUST,应用了一种基于启发式阈值的算法,其没有理论担保。我们提出了一种基于子模块优化的新方法。子模块优化,连续凸优化的离散模拟,已经为其他代表选择的选择问题取得了巨大的成功。我们证明,子模具优化方法导致具有更大结构分集的代表性蛋白质序列子集比现有方法所选择的集合,用作蛋白质结构域结构的范围文库。在该设置中,子模子优化一致地产生蛋白质序列子集,该子集包括比竞争方法所选择的相同大小的相同尺寸的更多范围域系列。我们还展示了优化框架如何允许我们为大型和小代表集执行良好的混合目标函数。我们描述的框架是多项式时间(在某些假设下)中最好的,并且它是灵活的,并且它适用于一套通用方法来优化各种客观功能之一。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号