Choosing non‐redundant representative subsets of protein sequence data sets using submodular optimization

Libbrecht Maxwell W.; Bilmes Jeffrey A.; Noble William Stafford

首页> 外文期刊>Proteins: Structure, Function, and Genetics >Choosing non‐redundant representative subsets of protein sequence data sets using submodular optimization

【24h】

Choosing non‐redundant representative subsets of protein sequence data sets using submodular optimization

机译：使用子模块优化选择蛋白质序列数据集的非冗余代表性子集

获取原文

获取原文并翻译 | 示例

获取外文期刊封面目录资料

开具论文收录证明 >>

文献代查 >>

文献数据库（团队版） >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Abstract Selecting a non‐redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non‐redundant training sets for sequence and structural models or selection of “operational taxonomic units” from metagenomics data. Previous methods for this task, such as CD‐HIT, PISCES, and UCLUST, apply a heuristic threshold‐based algorithm that has no theoretical guarantees. We propose a new approach based on submodular optimization. Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success for other representative set selection problems. We demonstrate that the submodular optimization approach results in representative protein sequence subsets with greater structural diversity than sets chosen by existing methods, using as a gold standard the SCOPe library of protein domain structures. In this setting, submodular optimization consistently yields protein sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches. We also show how the optimization framework allows us to design a mixture objective function that performs well for both large and small representative sets. The framework we describe is the best possible in polynomial time (under some assumptions), and it is flexible and intuitive because it applies a suite of generic methods to optimize one of a variety of objective functions.

机译：摘要选择非冗余代表性的序列子集是许多生物信息学工作流程中的共同步骤，例如创建用于序列和结构模型的非冗余训练集或从Metagenomics数据中选择“运营分类单位”。以前的此任务的方法，例如CD-HIN，PISCE和UCLUST，应用了一种基于启发式阈值的算法，其没有理论担保。我们提出了一种基于子模块优化的新方法。子模块优化，连续凸优化的离散模拟，已经为其他代表选择的选择问题取得了巨大的成功。我们证明，子模具优化方法导致具有更大结构分集的代表性蛋白质序列子集比现有方法所选择的集合，用作蛋白质结构域结构的范围文库。在该设置中，子模子优化一致地产生蛋白质序列子集，该子集包括比竞争方法所选择的相同大小的相同尺寸的更多范围域系列。我们还展示了优化框架如何允许我们为大型和小代表集执行良好的混合目标函数。我们描述的框架是多项式时间（在某些假设下）中最好的，并且它是灵活的，并且它适用于一套通用方法来优化各种客观功能之一。

著录项

来源
《Proteins: Structure, Function, and Genetics》 |2018年第4期|共13页
作者
Libbrecht Maxwell W.; Bilmes Jeffrey A.; Noble William Stafford;
展开▼
作者单位

Department of Genome SciencesUniversity of WashingtonSeattle Washington;

Department of Electrical EngineeringUniversity of WashingtonSeattle Washington;

Department of Genome SciencesUniversity of WashingtonSeattle Washington;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类生物化学;
关键词
discrete optimization; diversity; protein sequence analysis; redundancy; representative subsets; submodular maximization;

机译：离散优化;多样性;蛋白质序列分析;冗余;代表子集;子模块最大化;

相似文献

外文文献
中文文献
专利

1. Choosing non‐redundant representative subsets of protein sequence data sets using submodular optimization [J] . Libbrecht Maxwell W., Bilmes Jeffrey A., Noble William Stafford Proteins: Structure, Function, and Genetics . 2018,第4期

机译：使用子模块优化选择蛋白质序列数据集的非冗余代表性子集
2. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins [J] . Donna R. Maglott, Kim D. Pruitt, Tatiana Tatusova Nucleic acids research . 2007,第suppla1期

机译：NCBI参考序列（RefSeq）：基因组，转录本和蛋白质的精选非冗余序列数据库
3. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins [J] . Pruitt KD, Tatusova T, Maglott DR Nucleic Acids Research . 2005,第0期

机译：NCBI参考序列（RefSeq）：精心策划的基因组，转录本和蛋白质的非冗余序列数据库
4. How to Select a Good Training-data Subset for Transcription:Submodular Active Selection for Sequences [C] . Hui Lin, Jeff Bames International Speech Communication Association . 2009

机译：如何选择转录的良好训练 - 数据子集：序列的子模具主动选择
5. Submodular Optimization and Data Processing. [D] . Wei, Kai. 2016

机译：次模块优化和数据处理。
6. Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization [O] . Maxwell W. Libbrecht, Jeffrey A. Bilmes, William Stafford Noble -1

机译：使用亚模优化选择蛋白质序列数据集的非冗余代表性子集
7. Figure 3: Each point represents the substitution rate versus the sequence entropy for all sites of the ribonuclease protein with PDB code 1pyl, which is representative of our data set and makes the figure easier to interpret because of its small number of sites. [O] . -1

机译：图3：每个点表示具有PDB代码1pyl的核糖核酸酶蛋白的所有网站的替代率与PDB代码1pyl，其代表我们的数据集，并且由于其少量站点而使图形更容易解释。
8. How to Select a Good Training-data Subset for Transcription: Submodular Active Selection for Sequences [R] . Lin, H., Bilmes, J. 2009

机译：如何为转录选择一个好的训练数据子集：序列的子模块主动选择

Choosing non‐redundant representative subsets of protein sequence data sets using submodular optimization

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅