Gene time series data clustering based on continuous representations and an energy based similarity measure

机译：基于连续表示和基于能量的相似性度量的基因时间序列数据聚类

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Gene temporal expression data clustering has been widely used to study dynamic biological systems. However, most temporal gene expression data often contain noise, missing data points, and non-uniformly sampled time points, which imposes challenges for traditional clustering methods of extracting meaningful information. To improve the clustering performance, we introduce a novel clustering approach based on the continuous representations and an energy based similarity measure. The proposed approach models each gene expression profile as a B-spline expansion, for which the spline coefficients are estimated by regularized least squares scheme on the observed data. After fitting the continuous representations of gene expression profiles, we use an energy based similarity measure to take into account the temporal information and the relative changes of time series. Experimental results show that the proposed method is robust to noise and can produce meaningful clustering results.

机译：基因时间表达数据聚类已被广泛用于研究动态生物学系统。但是，大多数时间基因表达数据通常包含噪声，数据点缺失和采样时间点不一致，这对提取有意义信息的传统聚类方法提出了挑战。为了提高聚类性能，我们引入了一种基于连续表示和基于能量的相似性度量的新颖聚类方法。提出的方法将每个基因表达谱建模为B样条扩展，对于样条系数，通过对观察到的数据进行正则化最小二乘估计来估计。在拟合基因表达谱的连续表示之后，我们使用基于能量的相似性度量来考虑时间信息和时间序列的相对变化。实验结果表明，该方法对噪声具有鲁棒性，可以产生有意义的聚类结果。

著录项

来源
《Proceedings of the Ninth International Conference on Machine Learning and Cybernetics》|2010年|2079-2083|共5页
会议地点
作者
Zhang Wei-Feng; Liu Chao-Chun; Yan Hong;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类自动推理、机器学习;
关键词
Gene time series; clustering; energy based measure;

机译：基因时间序列;聚类;基于能量的度量;

相似文献

外文文献
中文文献
专利

1. Clustering of temporal gene expression data by regularized spline regression and an energy based similarity measure [J] . Zhang W.-F., Liu C.-C., Yan H. Pattern Recognition: The Journal of the Pattern Recognition Society . 2010,第12期

机译：通过正则样条回归和基于能量的相似性度量对时间基因表达数据进行聚类
2. A modified correlation coefficient based similarity measure for clustering time-course gene expression data [J] . Young Sook Son, Jangsun Baek Pattern recognition letters . 2008,第3期

机译：改进的基于相关系数的相似度度量用于聚类时程基因表达数据
3. Imputing incomplete time-series data based on varied-window similarity measure of data sequences [J] . Sirapat Chiewchanwattana, Chidchanok Lursinsap, Chee-Hung Henry Chu Pattern recognition letters . 2007,第9期

机译：基于数据序列的变窗相似性度量来估算不完整的时间序列数据
4. Gene time series data clustering based on continuous representations and an energy based similarity measure [C] . Zhang Wei-Feng, Liu Chao-Chun, Yan Hong International Conference on Machine Learning and Cybernetics . 2010

机译：基于连续表示的基因时间序列数据聚类和基于能量的相似度测量
5. Robust dynamical model-based data representations and structuring of time series data for in-sequence localization [D] . Laftchiev, Emil 2015

机译：鲁棒的基于动力学模型的数据表示和时序数据的结构化，以进行序列内定位
6. A reliable measure of similarity based on dependency for short time series: an application to gene expression networks [O] . Mônica G Campiteli, Frederico M Soriani, Iran Malavazi, 2009

机译：基于短时间序列依赖性的可靠相似性度量：在基因表达网络中的应用
7. Figure 4: (A) One conserved sequence, which occurs 79 times in 46,264 binding site peaks from the ChIP-seq data-set. The mutation profile of this conserved sequence is illustrated, where ’_ ’ indicates this base is unchanged; DEL indicates this base is lost; INS X indicates a new base X is inserted in front of this base. (B) Several repeated elements patterns are listed. (C) In the first column, the top five DNA motifs, mined by meme-chip tools (Machanick Bailey, 2011) are illustrated. The resemblant conserved sequences, found by the CFSP algorithm are listed in the second column. In the third column, the position-specific scoring matrices, which are transformed from mutational information are listed. The similarity between meme motif and resemblant conserved sequence with PSSM format was calculated via a stamp motif comparison tool (Mahony Benos, 2007). The E-values for the similarity of those pairs is displayed in the fourth column. (D) One motif is selected in each group clustered by gkmsvm descriptors, and the corresponding motif found by the CFSP algorithm is listed below. (E) There are additional datasets (File No: ENCFF100GRL, ENCFF616IRT, ENCFF870CER, Target: SREBF1) collected from https://www.encodeproject.org. The top two motifs are selected in each file using meme tools, and the corresponding motifs found by our algorithm are listed below. [O] . -1

机译：图4：（a）一种保守序列，其发生在芯片-SEQ数据集中的46,264个结合位点峰值中的79倍。说明了这种保守序列的突变分布，其中'_'表示该碱度不变; del表示此基础丢失; INS X表示新的基础X插入此基础前面。（b）列出了几种重复的元素模式。（c）在第一栏中，示出了由MEME芯片工具（Machanick＆Bailey，2011）开采的前五个DNA主题。由CFSP算法发现的相应保守序列列于第二列中。在第三列中，列出了从突变信息转换的特定位置的评分矩阵。 MEME主题与PSSM格式的相似性与PSSM格式之间的相似性通过邮票图章比较工具（Mahony＆Benos，2007）计算。这些对相似性的电子值显示在第四列中。（d）在由GKMSVM描述符聚集的每个组中选择了一个图案，下面列出了CFSP算法的相应主题。（e）从https://www.encodeproject.org收集的，有附加数据集（文件no：cernff100grl，cenf616irl，conf8.20cer，target：srebf1）。使用MEME工具在每个文件中选择前两个图案，并且我们的算法发现的相应主题如下所示。

Gene time series data clustering based on continuous representations and an energy based similarity measure

摘要

著录项

相似文献

相关主题

期刊订阅