首页> 外文期刊>IEEE transactions on information technology in biomedicine >Information mining over heterogeneous and high-dimensional time-series data in clinical trials databases
【24h】

Information mining over heterogeneous and high-dimensional time-series data in clinical trials databases

机译:临床试验数据库中异构和高维时间序列数据的信息挖掘

获取原文
获取原文并翻译 | 示例
       

摘要

An effective analysis of clinical trials data involves analyzing different types of data such as heterogeneous and high dimensional time series data. The current time series analysis methods generally assume that the series at hand have sufficient length to apply statistical techniques to them. Other ideal case assumptions are that data are collected in equal length intervals, and while comparing time series, the lengths are usually expected to be equal to each other. However, these assumptions are not valid for many real data sets, especially for the clinical trials data sets. An addition, the data sources are different from each other, the data are heterogeneous, and the sensitivity of the experiments varies by the source. Approaches for mining time series data need to be revisited, keeping the wide range of requirements in mind. In this paper, we propose a novel approach for information mining that involves two major steps: applying a data mining algorithm over homogeneous subsets of data, and identifying common or distinct patterns over the information gathered in the first step. Our approach is implemented specifically for heterogeneous and high dimensional time series clinical trials data. Using this framework, we propose a new way of utilizing frequent itemset mining, as well as clustering and declustering techniques with novel distance metrics for measuring similarity between time series data. By clustering the data, we find groups of analytes (substances in blood) that are most strongly correlated. Most of these relationships already known are verified by the clinical panels, and, in addition, we identify novel groups that need further biomedical analysis. A slight modification to our algorithm results an effective declustering of high dimensional time series data, which is then used for "feature selection." Using industry-sponsored clinical trials data sets, we are able to identify a small set of analytes that effectively models the state of normal health.
机译:临床试验数据的有效分析涉及分析不同类型的数据,例如异构和高维时间序列数据。当前的时间序列分析方法通常假定现有序列具有足够的长度以对其应用统计技术。其他理想情况下的假设是,数据以相等的长度间隔收集,并且在比较时间序列时,通常期望长度是相等的。但是,这些假设对于许多真实数据集(尤其是临床试验数据集)无效。另外,数据源彼此不同,数据是异构的,并且实验的敏感性因数据源而异。需要重新考虑挖掘时间序列数据的方法,同时要牢记各种要求。在本文中,我们提出了一种新的信息挖掘方法,该方法涉及两个主要步骤:对同类数据子集应用数据挖掘算法,并根据第一步中收集到的信息识别通用或不同模式。我们的方法专门针对异构和高维时间序列临床试验数据而实施。使用此框架,我们提出了一种利用频繁项集挖掘以及具有新颖距离度量的聚类和聚类技术来测量时间序列数据之间相似性的新方法。通过对数据进行聚类,我们发现相关性最高的分析物组(血液中的物质)。临床小组已经验证了其中大多数已知的关系,此外,我们还确定了需要进一步生物医学分析的新型人群。对我们的算法进行稍作修改,就可以对高维时间序列数据进行有效的聚类,然后将其用于“特征选择”。使用行业支持的临床试验数据集,我们能够识别出少量分析物,从而有效地模拟了正常健康状况。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号