G-Tric: generating three-way synthetic datasets with triclustering solutions

Jo?o Lobo; Rui Henriques; Sara C. Madeira

首页> 外文期刊>BMC Bioinformatics >G-Tric: generating three-way synthetic datasets with triclustering solutions

【24h】

G-Tric: generating three-way synthetic datasets with triclustering solutions

机译：G-TRIC：使用TriClustering解决方案生成三通合成数据集

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations $$ imes$$ features $$ imes$$ contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.

机译：由于其越来越多的能力来描述具有本质的多元和时间事件，例如生物反应，沿着时间，城市动态或复杂地球物理现象等生物反应，社会互动或复杂地球物理现象等能力，因此三元数据开始受欢迎。 TriClustering，Subpace集群的三向数据，可以发现与数据子空间（三角形）对应的模式发现，其中三维之间相关的值（观察到$$ IME $$功能$$上下文）。随着提出的越来越多的算法，有效地将它们与最先进的算法进行比较至关重要。这些比较通常使用真实数据进行，而没有已知的地面真理，从而限制了评估。在此上下文中，我们提出了一个合成数据生成器G-TRIC，允许创建具有可配置属性的合成数据集以及植物三角形的可能性。制作生成器以创建类似于生物医学和社交数据域的真实3路数据的数据集，其中进一步提供了地面真理（TriClustering Solution）作为输出的额外优点。 G-TRIC可以复制真实世界数据集，并创建与研究人员跨越多个属性的新功能，包括数据类型（数字或符号），尺寸和背景分布。用户可以调整特征的模式和结构，该模式和结构是种植的三种机构（子空间）以及它们的交互方式（重叠）。还可以通过定义丢失，噪声或错误的量来控制数据质量。此外，与相应的TriClustering解决方案（种植三分机构）一起提供类似实际数据的数据集的基准和产生参数。使用G-TRIC的TriClustering评估提供了组合内在和外部指标的可能性，以比较产生更可靠分析的解决方案。一组预定义的数据集，模仿广泛使用的三元数据和探索关键性能，并提供了可用的，突出了G-TRIC通过缓解评估新的Triculting方法质量的过程来推进三重颗粒的最新功能。

著录项

来源
《BMC Bioinformatics》 |2021年第1期|共28页
作者
Jo?o Lobo; Rui Henriques; Sara C. Madeira;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类生物科学;
关键词
Three-way data analysisThree-dimensional dataTriclusteringSynthetic data generationUnsupervised learningSubspace clustering;

机译：三通数据analysisthree维数据图形合成数据GonegateUnsupervised Searchumentumsubspace集群;

相似文献

外文文献
中文文献
专利

1. On Generating Network Traffic Datasets with Synthetic Attacks for Intrusion Detection [J] . Cordero Carlos Garcia, Vasilomanolakis Emmanouil, Wainakh Aidmar, ACM transactions on privacy and security . 2021,第2期

机译：在具有合成攻击的网络流量数据集进行入侵检测
2. Generating Realistic Synthetic Population Datasets [J] . Wu Hao, Ning Yue, Chakraborty Prithwish, ACM transactions on knowledge discovery from data . 2018,第4期

机译：生成现实的综合人口数据集
3. Generating synthetic aviation safety data to resample or establish new datasets [J] . Lalis Andrej, Socha Vladimir, Kremen Petr, Safety science . 2018,第期

机译：生成综合航空安全数据以重新确定或建立新数据集
4. On the use of automatically generated synthetic image datasets for benchmarking face recognition [C] . Laurent Colbois, Tiago de Freitas Pereira, Sébastien Marcel IEEE International Joint Conference on Biometrics . 2021

机译：关于使用自动生成的合成图像数据集进行基准识别
5. Noise reduction in user generated datasets. [D] . Gutierrez, Louis Alberto. 2014

机译：用户生成的数据集中的降噪。
6. G-Tric: generating three-way synthetic datasets with triclustering solutions [O] . João Lobo, Rui Henriques, Sara C. Madeira 2021

机译：G-TRIC：使用TriClustering解决方案生成三通合成数据集
7. Meta-Sim: Learning to Generate Synthetic Datasets [O] . Amlan Kar, Aayush Prakash, Ming-Yu Liu, 2019

机译：Meta-SIM：学习生成合成数据集

G-Tric: generating three-way synthetic datasets with triclustering solutions

摘要

著录项

相似文献

相关主题

期刊订阅