Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization

ALINA LAZAR; LING JIN; C. ANNA SPURLOCK; KESHENG WU; ALEX SIM; ANNIKA TODD

首页> 外文期刊>ACM journal of data and information quality >Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization

【24h】

Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization

机译：使用T-SNE可视化评估缺失值和混合数据类型对社交序列聚类的影响

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The goal of this work is to investigate the impact of missing values in clustering joint categorical social sequences. Identifying patterns in sociodemographic longitudinal data is important in a number of social science settings. However, performing analytical operations, such as clustering on life course trajectories, is challenging due to the categorical and multidimensional nature of the data, their mixed data types, and corruption by missing and inconsistent values. Data quality issues were investigated previously on single variable sequences. To understand their effects on multivariate sequence analysis, we employ a dataset of mixed data types and missing values, a dissimilarity measure designed for joint categorical sequence data, together with dimensionality reduction methodologies in a systematic design of sequence clustering experiments. Given the categorical nature of our data, we employ an "edit" distance using optimal matching. Because each data record has multiple variables of different types, we investigate the impact of mixing these variables in a single dissimilarity measure. Between variables with binary values and those with multiple nominal values, we find that the ability to overcome missing data problems is more difficult in the nominal domain than in the binary domain. Additionally, alignment of leading missing values can result in systematic biases in dissimilarity matrices and subsequently introduce both artificial clusters and unrealistic interpretations of associated data domains. We demonstrate the usage of t-distributed stochastic neighborhood embedding to visually guide mitigation of such biases by tuning the missing value substitution cost parameter or determining an optimal sequence span.

机译：这项工作的目标是调查缺失价值在聚类联合分类社交序列中的影响。在许多社会科学环境中识别社会纵传纵向数据中的模式很重要。然而，由于数据的分类和多维性质，它们的混合数据类型和缺失的值损坏，执行分析操作先前在单变序列上研究了数据质量问题。要了解它们对多元序列分析的影响，我们采用了混合数据类型和缺失值的数据集，这是针对联合分类序列数据设计的不相似性测量，以及在序列聚类实验的系统设计中具有维度降低方法。鉴于我们数据的分类性质，我们使用最佳匹配来使用“编辑”距离。因为每个数据记录具有多种不同类型的多个变量，所以我们调查将这些变量混合在单一的异化度量中的影响。在具有二进制值的变量和具有多个标称值的变量之间，我们发现在标称域中克服缺失数据问题的能力比二进制域中更困难。另外，引导缺失值的对准可能导致不同矩阵中的系统偏差，随后引入了与相关数据域的人工集群和不现实的解释。我们通过调整缺失的值替换成本参数或确定最佳序列跨度，展示了使用T分布式随机邻域嵌入到视觉引导减轻这种偏差。

著录项

来源
《ACM journal of data and information quality》 |2019年第2期|共22页
作者
ALINA LAZAR; LING JIN; C. ANNA SPURLOCK; KESHENG WU; ALEX SIM; ANNIKA TODD;
展开▼
作者单位

Department of Computer Science and Information Systems Youngstown State University 1 University Plaza Youngstown OH 44555;

Energy Analysis and Environmental Impacts Division Lawrence Berkeley National Laboratory 1 Cyclotron Road Berkeley CA 94720;

Energy Analysis and Environmental Impacts Division Lawrence Berkeley National Laboratory 1 Cyclotron Road Berkeley CA 94720;

Computational Research Division Lawrence Berkeley National Laboratory 1 Cyclotron Road Berkeley CA 94720;

Computational Research Division Lawrence Berkeley National Laboratory 1 Cyclotron Road Berkeley CA 94720;

Energy Analysis and Environmental Impacts Division Lawrence Berkeley National Laboratory 1 Cyclotron Road Berkeley CA 94720;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计量学;
关键词
Joint sequence analysis; optimal matching; missing values; time series clustering; data quality; t-SNE; dimensionality reduction; life trajectories;

机译：联合序列分析;最佳匹配;缺失值;时间序列聚类;数据质量;T-SNE;减少维度;生活轨迹;

相似文献

外文文献
中文文献
专利

1. Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization [J] . ALINA LAZAR, LING JIN, C. ANNA SPURLOCK, ACM journal of data and information quality . 2019,第2期

机译：使用T-SNE可视化评估缺失值和混合数据类型对社交序列聚类的影响
2. Missing-Values Adjustment for Mixed-Type Data [J] . AgostinoTarsitano, MariannaFalcone Journal of Probability and Statistics . 2011,第2期

机译：混合类型数据的缺失值调整
3. A Modified Spatiotemporal Mixed-Effects Model for Interpolating Missing Values in Spatiotemporal Observation Data Series [J] . Qiang Shi, Wujiao Dai, Rock Santerre, Mathematical Problems in Engineering: Theory, Methods and Applications . 2020,第1期

机译：一种改进的时空混合效应模型，用于在时空观测数据系列中插值缺失值
4. Data quality challenges with missing values and mixed types in joint sequence analysis [C] . Alina Lazar, Ling Jin, C. Anna Spurlock, IEEE International Conference on Big Data . 2017

机译：联合序列分析中缺少值和混合类型的数据质量挑战
5. Measuring the Role of Visualization on Missing Values in Time Series Data [D] . Song, Hayeong. 2018

机译：测量可视化对时间序列数据中缺失值的作用
6. The mixed model for repeated measures for cluster randomized trials: a simulation study investigating bias and type I error with missing continuous data [O] . Melanie L. Bell, Brooke A. Rabe 2020

机译：集群随机试验重复测量的混合模型：一项模拟研究研究了偏倚和I型错误（缺少连续数据）
7. Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data* [O] . Vadim Ayuyev, Joseph Jupin, Philip W. Harris, 2015

机译：基于动态聚类的混合型数据缺失值估计*

Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization

摘要

著录项

相似文献

相关主题

期刊订阅