首页> 外文期刊>BMC Bioinformatics >G-Tric: generating three-way synthetic datasets with triclustering solutions
【24h】

G-Tric: generating three-way synthetic datasets with triclustering solutions

机译:G-TRIC:使用TriClustering解决方案生成三通合成数据集

获取原文
           

摘要

Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations $$ imes$$ features $$ imes$$ contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.
机译:由于其越来越多的能力来描述具有本质的多元和时间事件,例如生物反应,沿着时间,城市动态或复杂地球物理现象等生物反应,社会互动或复杂地球物理现象等能力,因此三元数据开始受欢迎。 TriClustering,Subpace集群的三向数据,可以发现与数据子空间(三角形)对应的模式发现,其中三维之间相关的值(观察到$$ IME $$功能$$上下文)。随着提出的越来越多的算法,有效地将它们与最先进的算法进行比较至关重要。这些比较通常使用真实数据进行,而没有已知的地面真理,从而限制了评估。在此上下文中,我们提出了一个合成数据生成器G-TRIC,允许创建具有可配置属性的合成数据集以及植物三角形的可能性。制作生成器以创建类似于生物医学和社交数据域的真实3路数据的数据集,其中进一步提供了地面真理(TriClustering Solution)作为输出的额外优点。 G-TRIC可以复制真实世界数据集,并创建与研究人员跨越多个属性的新功能,包括数据类型(数字或符号),尺寸和背景分布。用户可以调整特征的模式和结构,该模式和结构是种植的三种机构(子空间)以及它们的交互方式(重叠)。还可以通过定义丢失,噪声或错误的量来控制数据质量。此外,与相应的TriClustering解决方案(种植三分机构)一起提供类似实际数据的数据集的基准和产生参数。使用G-TRIC的TriClustering评估提供了组合内在和外部指标的可能性,以比较产生更可靠分析的解决方案。一组预定义的数据集,模仿广泛使用的三元数据和探索关键性能,并提供了可用的,突出了G-TRIC通过缓解评估新的Triculting方法质量的过程来推进三重颗粒的最新功能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号