Dimension Reduction on Open Data Using Variational Autoencoder

机译：使用变分自动编码器对开放数据进行降维

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Open Data movement has led to large number of databases published in the web. However, effectively accessing these databases remain a challenge due to its large volume and heterogeneity. To retrieve similar queries, min-wise independent permutation locality sensitive hashing (MinHash LSH) became a popular technique to estimate the similarity between two domains. With the recent advancements in deep learning and its ability to revolutionize multiple fields of science, we explored how deep learning could improve similarity search on an internet scale. To do so, we first formulate the similarity search problem as a Euclidean nearest neighbour problem by transforming the set representation into a latent representation of learned features.We then apply Variational Autoencoders (VAEs) to embed domains into a significantly smaller, continuous latent dimension. VAEs learn an embedding that minimizes reconstruction error and the Kullback-Leibler divergence between the encoder and prior. Optimizing both terms allow the model to learn a dense representation with local similarities preserved from the original input space. We evaluate our algorithm using a subset of joint Open Data (Canada, US and UK) that contain more than 1.4 million documents with a domain size greater than 128 thousand. We demonstrate that the latent space correlates significantly with the Jaccard similarity coefficient. Then, we show domains that embed spatially closer in latent space are indeed similar. Lastly, we show that our algorithm outperforms MinHash LSH in accuracy and precision for all dimensions tested. These improvements show that deep learning techniques can be a promising approach for internet scale domain search.

机译：开放数据运动导致大量数据库在网络上发布。然而，由于其庞大的数量和异构性，有效访问这些数据库仍然是一个挑战。为了检索类似的查询，最小独立的排列局部性敏感哈希（MinHash LSH）成为一种流行的技术，用于估计两个域之间的相似性。随着深度学习的最新进展及其革新科学的多个领域的能力，我们探索了深度学习如何改善互联网规模的相似性搜索。为此，我们首先通过将集合表示形式转换为学习特征的潜在表示形式，将相似性搜索问题公式化为欧几里得最近邻问题，然后应用变分自动编码器（VAE）将域嵌入到明显更小的连续潜在维度中。 VAE学习了一种将重构误差和编码器与先验编码之间的Kullback-Leibler差异最小化的嵌入方法。通过优化两个术语，模型可以学习具有原始输入空间中保留的局部相似性的密集表示。我们使用联合开放数据的一个子集（加拿大，美国和英国）评估我们的算法，该子集包含超过140万个文档，其域大小大于12.8万。我们证明了潜在空间与Jaccard相似系数显着相关。然后，我们显示在空间上更紧密地嵌入到潜在空间中的域确实是相似的。最后，我们证明了我们的算法在所有测试尺寸上的准确性和精密度均优于MinHash LSH。这些改进表明，深度学习技术可以成为Internet规模域搜索的有前途的方法。

著录项

来源
《IEEE International Conference on Data Mining Workshops》|2018年|1080-1085|共6页
会议地点
作者
Hyunmin Lee; Zhen Hao Wu; Zhaolei Zhang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Search problems; Deep learning; Probability distribution; Neural networks; Indexes; Decoding; Time complexity;

机译：搜索问题;深度学习;概率分布;神经网络;索引;解码;时间复杂度;

相似文献

外文文献
中文文献
专利

1. Variational Autoencoder-Based Dimensionality Reduction for High-Dimensional Small-Sample Data Classification [J] . International Journal of Computational Intelligence and Applications . 2020,第1期

机译：基于变化的自动化器的维度降低，用于高维小样本数据分类
2. A variational autoencoder solution for road traffic forecasting systems: Missing data imputation, dimension reduction, model selection and anomaly detection [J] . Boquet Guillem, Morell Antoni, Serrano Javier, Transportation research . 2020,第Juna期

机译：用于道路交通预测系统的变形式自动化器解决方案：缺少数据归档，尺寸减少，模型选择和异常检测
3. VASC: Dimension Reduction and Visualization of Single-cell RNA-seq Data by Deep Variational Autoencoder [J] . Dongfang Wang, Jin Gu Genomics, proteomics & bioinformatics . 2018,第5期

机译：VASC：通过深度变分自动编码器对单细胞RNA-seq数据进行降维和可视化
4. Unsupervised classification of high-dimension and low-sample data with variational autoencoder based dimensionality reduction [C] . Mohammad Sultan Mahmud, Xianghua Fu International Conference on Advanced Robotics and Mechatronics . 2019

机译：基于变分自动编码器的降维方法对高维和低样本数据进行无监督分类
5. Novelty detection and cluster analysis in time series data using variational autoencoder feature maps. [D] . Clachar, Sophine. 2016

机译：使用变分自动编码器特征图的时间序列数据中的新颖性检测和聚类分析。
6. VASC: Dimension Reduction and Visualization of Single-cell RNA-seq Data by Deep Variational Autoencoder [O] . Dongfang Wang, Jin Gu 2018

机译：VASC：通过深度变分自动编码器对单细胞RNA-seq数据进行降维和可视化
7. A deep adversarial variational autoencoder model for dimensionality reduction in single-cell RNA sequencing analysis [O] . Eugene Lin, Sudipto Mukherjee, Sreeram Kannan 2020

机译：单细胞RNA测序分析维度降低的深度逆势变分性自动化模型

Dimension Reduction on Open Data Using Variational Autoencoder

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅