An Exact Algorithm of Searching for the Largest Cluster in an Integer-Valued Problem of 2-Partitioning a Sequence

A. V. Kel’manov; S. A. Khamidullin; V. I. Khandeev; A. V. Pyatkin

首页> 外文期刊>Pattern recognition and image analysis: advances in mathematical theory and applications in the USSR >An Exact Algorithm of Searching for the Largest Cluster in an Integer-Valued Problem of 2-Partitioning a Sequence

【24h】

An Exact Algorithm of Searching for the Largest Cluster in an Integer-Valued Problem of 2-Partitioning a Sequence

机译：一个精确地搜索最大群集在一个序列的整数值问题中的最大群集

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

We analyze mathematical aspects of one of the fundamental data analysis problems consisting in the search (selection) for the subset with the largest number of similar elements among a collection of objects. In particular, the problem appears in connection with the analysis of data in the form of time series (discrete signals). One of the problems in modeling this challenge is considered, namely, the problem of finding the cluster of the largest size (cardinality) in a 2-partition of a finite sequence of points in Euclidean space into two clusters (subsequences) under two constraints. The first constraint is on the choice of the indices of elements included in the clusters. This constraint simulates the set of time-admissible configurations of similar elements in the observed discrete signal. The second constraint is imposed on the value of the quadratic clustering function. This constraint simulates the level of intracluster proximity of objects. The clustering function under the second constraint is the sum (over both clusters) of the intracluster sums of squared distances between the cluster elements and its center. The center of one of the clusters is unknown and defined as the centroid (the arithmetic mean over all elements of this cluster). The center of the other cluster is the origin. Under the first constraint, the difference between any two subsequent indices of elements contained in a cluster with an unknown center is bounded above and below by some constants. It is established in the paper that the optimization problem under consideration, which models one of the simplest significant problems of data analysis, is strongly NP-hard. We propose an exact algorithm for the case of a problem with integer coordinates of its input points. If the dimension of the space is bounded by a constant, then the algorithm is pseudopolynomial.

机译：我们分析了一个基本数据分析问题之一的数学方面，该问题包括具有最大数量的对象之间的相似元素数量的子集中的搜索（选择）。特别是，问题出现在时间序列（离散信号）形式的数据分析。考虑建模这种挑战的问题之一，即，在两个约束下，在欧几里德空间中的有限点的2分区中找到最大尺寸（基数）的群集的问题。第一个约束是选择集群中包含的元素的指标。该约束模拟了观察到的离散信号中的类似元件的一组时间允许配置。对二次聚类功能的值施加第二约束。该约束模拟对象的内部内部的级别。第二个约束下的聚类功能是集群元素与其中心之间的平方距离的跨界距离的总和（群集）。其中一个集群的中心未知并定义为质心（算术平均值在此集群的所有元素上）。另一个集群的中心是原点。在第一个约束下，与未知中心的群集中包含的元素的任何两个后续指标之间的差异在上方和下方由某些常数界定。在本文中建立的是，正在考虑的优化问题，其中模型数据分析最简单的重要问题之一，是强烈的NP-COLLE。我们为其输入点的整数坐标提出了一个精确的算法。如果空间的尺寸由常数界定，则算法是假验二极管的。

著录项

来源
《Pattern recognition and image analysis: advances in mathematical theory and applications in the USSR》 |2018年第4期|共9页
作者
A. V. Kel’manov; S. A. Khamidullin; V. I. Khandeev; A. V. Pyatkin;
展开▼
作者单位

Sobolev Institute of Mathematics Siberian Branch Russian Academy of Sciences;

Sobolev Institute of Mathematics Siberian Branch Russian Academy of Sciences;

Sobolev Institute of Mathematics Siberian Branch Russian Academy of Sciences;

Sobolev Institute of Mathematics Siberian Branch Russian Academy of Sciences;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类模式识别与装置;
关键词
time-series analysis; similar elements; Euclidean space; sequence; 2-partition; longest subsequence; quadratic scattering; NP-hard problem; integer coordinates; exact algorithm; fixed space dimension; pseudopolynomial running time;

机译：时间序列分析;欧几里德空间;euclidean空间;序列;2分区;最长的子次;二次散射;NP难题;整数坐标;精确的算法;固定空间尺寸;伪级;

相似文献

外文文献
中文文献
专利

1. An Exact Algorithm of Searching for the Largest Cluster in an Integer-Valued Problem of 2-Partitioning a Sequence [J] . A. V. Kel’manov, S. A. Khamidullin, V. I. Khandeev, Pattern recognition and image analysis: advances in mathematical theory and applications in the USSR . 2018,第4期

机译：一个精确地搜索最大群集在一个序列的整数值问题中的最大群集
2. Exact Algorithms of Search for a Cluster of the Largest Size in Two Integer 2-Clustering Problems [J] . A. V. Kel′manov, A. V. Panasenko, V. I. Khandeev Numerical analysis and applications . 2019,第2期

机译：在两个整数2聚类问题中搜索最大大小集群的精确算法
3. An exact parallel algorithm to compare very long biological sequences in clusters of workstations [J] . Azzedine Boukerche, Alba Cristina Magalhaes Alves de Melo, Edans Flavius de Oliveira Sandes, Cluster computing . 2007,第2期

机译：一种精确的并行算法，用于比较工作站集群中非常长的生物学序列
4. Exact Algorithms for Two Quadratic Euclidean Problems of Searching for the Largest Subset and Longest Subsequence [C] . Alexander Kelmanov, Sergey Khamidullin, Vladimir Khandeev, International Conference on Learning and Intelligent Optimization . 2019

机译：用于搜索最大子集和最长的两个二次欧几里德问题的精确算法
5. Exact algorithms for minimum sum -of -squares clustering [D] . Aloise, Daniel 2009

机译：最小平方和聚类的精确算法
6. BiCluE - Exact and heuristic algorithms for weighted bi-cluster editing of biomedical data [O] . Peng Sun, Jiong Guo, Jan Baumbach 2013

机译：BiCluE-生物医学数据加权双聚类编辑的精确和启发式算法
7. Figure 4: (A) One conserved sequence, which occurs 79 times in 46,264 binding site peaks from the ChIP-seq data-set. The mutation profile of this conserved sequence is illustrated, where ’_ ’ indicates this base is unchanged; DEL indicates this base is lost; INS X indicates a new base X is inserted in front of this base. (B) Several repeated elements patterns are listed. (C) In the first column, the top five DNA motifs, mined by meme-chip tools (Machanick Bailey, 2011) are illustrated. The resemblant conserved sequences, found by the CFSP algorithm are listed in the second column. In the third column, the position-specific scoring matrices, which are transformed from mutational information are listed. The similarity between meme motif and resemblant conserved sequence with PSSM format was calculated via a stamp motif comparison tool (Mahony Benos, 2007). The E-values for the similarity of those pairs is displayed in the fourth column. (D) One motif is selected in each group clustered by gkmsvm descriptors, and the corresponding motif found by the CFSP algorithm is listed below. (E) There are additional datasets (File No: ENCFF100GRL, ENCFF616IRT, ENCFF870CER, Target: SREBF1) collected from https://www.encodeproject.org. The top two motifs are selected in each file using meme tools, and the corresponding motifs found by our algorithm are listed below. [O] . -1

机译：图4：（a）一种保守序列，其发生在芯片-SEQ数据集中的46,264个结合位点峰值中的79倍。说明了这种保守序列的突变分布，其中'_'表示该碱度不变; del表示此基础丢失; INS X表示新的基础X插入此基础前面。（b）列出了几种重复的元素模式。（c）在第一栏中，示出了由MEME芯片工具（Machanick＆Bailey，2011）开采的前五个DNA主题。由CFSP算法发现的相应保守序列列于第二列中。在第三列中，列出了从突变信息转换的特定位置的评分矩阵。 MEME主题与PSSM格式的相似性与PSSM格式之间的相似性通过邮票图章比较工具（Mahony＆Benos，2007）计算。这些对相似性的电子值显示在第四列中。（d）在由GKMSVM描述符聚集的每个组中选择了一个图案，下面列出了CFSP算法的相应主题。（e）从https://www.encodeproject.org收集的，有附加数据集（文件no：cernff100grl，cenf616irl，conf8.20cer，target：srebf1）。使用MEME工具在每个文件中选择前两个图案，并且我们的算法发现的相应主题如下所示。
8. Two Papers on Range Searching: A Survey of Algorithms and Data Structures for Range Searching. Efficient Worst-Case Data Structures for Range Searching. [R] . bentley,jon louis friedman,jerome h. 1978

机译：关于范围搜索的两篇论文：范围搜索的算法和数据结构综述。用于范围搜索的高效最坏情况数据结构。

An Exact Algorithm of Searching for the Largest Cluster in an Integer-Valued Problem of 2-Partitioning a Sequence

摘要

著录项

相似文献

相关主题

期刊订阅