Engineering Motif Search for Large Motifs

Petteri Kaski; Juho Lauri; Suhas Thejaswi

首页> 外文期刊>LIPIcs : Leibniz International Proceedings in Informatics >Engineering Motif Search for Large Motifs

【24h】

Engineering Motif Search for Large Motifs

机译：工程图案搜索大型图案

获取原文

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

团队文献服务 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Given a vertex-colored graph H and a multiset M of colors as input, the graph motif problem asks us to decide whether H has a connected induced subgraph whose multiset of colors agrees with M. The graph motif problem is NP-complete but known to admit randomized algorithms based on constrained multilinear sieving over GF(2^b) that run in time O(2^kk^2m {M({2^b})}) and with a false-negative probability of at most k/2^{b-1} for a connected m-edge input and a motif of size k. On modern CPU microarchitectures such algorithms have practical edge-linear scalability to inputs with billions of edges for small motif sizes, as demonstrated by Bj?rklund, Kaski, Kowalik, and Lauri [ALENEX'15]. This scalability to large graphs prompts the dual question whether it is possible to scale to large motif sizes. We present a vertex-localized variant of the constrained multilinear sieve that enables us to obtain, in time O(2^kk^2m{M({2^b})}) and for every vertex simultaneously, whether the vertex participates in at least one match with the motif, with a per-vertex probability of at most k/2^{b-1} for a false negative. Furthermore, the algorithm is easily vector-parallelizable for up to 2^k threads, and parallelizable for up to 2^kn threads, where n is the number of vertices in H. Here {M({2^b})} is the time complexity to multiply in GF(2^b).We demonstrate with an open-source implementation that our variant of constrained multilinear sieving can be engineered for vector-parallel microarchitectures to yield hardware utilization that is bound by the available memory bandwidth. Our main engineering contributions are (a) a version of the recurrence for tightly labeled arborescences that can be executed as a sequence of memory-and-arithmetic coalescent parallel workloads on multiple GPUs, and (b) a bit-sliced low-level implementation for arithmetic in characteristic 2 to support (a).

机译：给定顶点彩色的图表H和一个颜色的多重M型作为输入，图表主题问题要求我们决定H是否有一个连接的诱导的子图，其多种颜色的颜色与M同意。图形主题问题是NP-Create，但已知基于TIME O（2 ^ kK ^ 2m {m（{2 ^ b}）}）的GF（2 ^ b）的受约束多线性筛分的随机算法，并且具有最多k / 2的假阴性概率^ {B-1}对于连接的M-EDGE输入和大小k的图案。在现代CPU微架构上，这种算法具有实际的边缘线性可扩展性，与小型主题尺寸的数十亿边缘输入，如BJ？Rklund，Kaski，Kowalik和Lauri [Alenex'15]所示。对于大图的这种可扩展性提示双重问题是否可以扩展到大型主题大小。我们介绍了一个受约束的多线性筛的顶点局部化变体，使我们能够在时间o（2 ^ kk ^ 2m {m（{2 ^ b}）}）和同时为每个顶点而获得，无论顶点是否参与至少一个与图案匹配，具有最高k / 2 ^ {b-1}的每个顶点概率，用于假阴性。此外，该算法容易载有最多2 ^ k线程的矢量 - 并行，最多2 ^ kn线程，其中n是H中的顶点数。这里是{m（{2 ^ b}）}是在GF（2 ^ b）中乘以时间复杂性（2 ^ b）.we用开源实现展示，我们可以为矢量平行微架构设计我们受约束的多线性筛分的变体，以产生由可用内存带宽束缚的硬件利用率。我们的主要工程贡献是（a）紧密标记的植物学的复发版本，可以作为多个GPU上的一系列内存和算术型平行工作负载执行，并且（B）有点切片的低电平实现特征2的算法以支持（a）。

著录项

来源
《LIPIcs : Leibniz International Proceedings in Informatics 》 |2018年第30期| 共19页
作者
Petteri Kaski; Juho Lauri; Suhas Thejaswi;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类
关键词
algorithm engineeringconstrained multilinear sievinggraph motif problemmulti-GPUvector-parallelvertex-localization;

机译：算法工程混合多线性SievingGraph图案Multi-GPUVector-Parallelvertex-Localization;

相似文献

外文文献
中文文献
专利

1. Efficient motif search in ranked lists and applications to variable gap motifs. [J] . Leibovich L, Yakhini Z Nucleic Acids Research . 2012 ,第13期

机译：在排名列表中的有效主题搜索以及对可变间隙主题的应用。
2. Efficient motif search in ranked lists and applications to variable gap motifs [J] . Limor Leibovich, Zohar Yakhini Nucleic acids research . 2012 ,第13期

机译：在排名列表中进行有效的主题搜索并将其应用于可变间隙主题
3. Space-related pharma-motifs for fast search of protein binding motifs and polypharmacological targets [J] . Yi-Yuan Chiu, Chun-Yu Lin, Chih-Ta Lin, BMC Genomics . 2012 ,第SUPPLEMENTa7期

机译：与空间有关的药物基序，可快速搜索蛋白质结合基序和多药理学靶标
4. NemoMapPy: Motif-centric network motif search on a web [C] . Preston Mar, Wooyoung Kim IEEE International Conference on Bioinformatics and Biomedicine . 2019

机译：NemoMapPy：以Motif为中心的网络主题搜索
5. Modern islamic motif design: Developing new arabesque motifs by mixing styles [D] . Nahhas, Shuruq 2014

机译：现代伊斯兰主题设计：通过混合样式开发新的蔓藤花纹主题
6. Space-related pharma-motifs for fast search of protein binding motifs and polypharmacological targets [O] . Yi-Yuan Chiu, Chun-Yu Lin, Chih-Ta Lin, 2012

机译：与空间有关的药物基序可快速搜索蛋白质结合基序和多药理学靶标
7. Figure 4: (A) One conserved sequence, which occurs 79 times in 46,264 binding site peaks from the ChIP-seq data-set. The mutation profile of this conserved sequence is illustrated, where ’_ ’ indicates this base is unchanged; DEL indicates this base is lost; INS X indicates a new base X is inserted in front of this base. (B) Several repeated elements patterns are listed. (C) In the first column, the top five DNA motifs, mined by meme-chip tools (Machanick Bailey, 2011) are illustrated. The resemblant conserved sequences, found by the CFSP algorithm are listed in the second column. In the third column, the position-specific scoring matrices, which are transformed from mutational information are listed. The similarity between meme motif and resemblant conserved sequence with PSSM format was calculated via a stamp motif comparison tool (Mahony Benos, 2007). The E-values for the similarity of those pairs is displayed in the fourth column. (D) One motif is selected in each group clustered by gkmsvm descriptors, and the corresponding motif found by the CFSP algorithm is listed below. (E) There are additional datasets (File No: ENCFF100GRL, ENCFF616IRT, ENCFF870CER, Target: SREBF1) collected from https://www.encodeproject.org. The top two motifs are selected in each file using meme tools, and the corresponding motifs found by our algorithm are listed below. [O] . -1

机译：图4：（a）一种保守序列，其发生在芯片-SEQ数据集中的46,264个结合位点峰值中的79倍。说明了这种保守序列的突变分布，其中'_'表示该碱度不变; del表示此基础丢失; INS X表示新的基础X插入此基础前面。（b）列出了几种重复的元素模式。（c）在第一栏中，示出了由MEME芯片工具（Machanick＆Bailey，2011）开采的前五个DNA主题。由CFSP算法发现的相应保守序列列于第二列中。在第三列中，列出了从突变信息转换的特定位置的评分矩阵。 MEME主题与PSSM格式的相似性与PSSM格式之间的相似性通过邮票图章比较工具（Mahony＆Benos，2007）计算。这些对相似性的电子值显示在第四列中。（d）在由GKMSVM描述符聚集的每个组中选择了一个图案，下面列出了CFSP算法的相应主题。（e）从https://www.encodeproject.org收集的，有附加数据集（文件no：cernff100grl，cenf616irl，conf8.20cer，target：srebf1）。使用MEME工具在每个文件中选择前两个图案，并且我们的算法发现的相应主题如下所示。
8. Verification of the MOTIF Code Version 3.0 (Verification du Code de Calcul MOTIF,Version 3.0) [R] . Chan, T., Guvanasen, V., Nakka, B. W., 1996

机译：验证mOTIF Code 3.0版（Verification du Calcul mOTIF，3.0版）

Engineering Motif Search for Large Motifs

摘要

著录项

相似文献

相关主题

期刊订阅