首页> 外文会议>IEEE International Conference on Data Mining Workshops >Dimension Reduction on Open Data Using Variational Autoencoder
【24h】

Dimension Reduction on Open Data Using Variational Autoencoder

机译:使用变分自动编码器对开放数据进行降维

获取原文
获取外文期刊封面目录资料

摘要

Open Data movement has led to large number of databases published in the web. However, effectively accessing these databases remain a challenge due to its large volume and heterogeneity. To retrieve similar queries, min-wise independent permutation locality sensitive hashing (MinHash LSH) became a popular technique to estimate the similarity between two domains. With the recent advancements in deep learning and its ability to revolutionize multiple fields of science, we explored how deep learning could improve similarity search on an internet scale. To do so, we first formulate the similarity search problem as a Euclidean nearest neighbour problem by transforming the set representation into a latent representation of learned features.We then apply Variational Autoencoders (VAEs) to embed domains into a significantly smaller, continuous latent dimension. VAEs learn an embedding that minimizes reconstruction error and the Kullback-Leibler divergence between the encoder and prior. Optimizing both terms allow the model to learn a dense representation with local similarities preserved from the original input space. We evaluate our algorithm using a subset of joint Open Data (Canada, US and UK) that contain more than 1.4 million documents with a domain size greater than 128 thousand. We demonstrate that the latent space correlates significantly with the Jaccard similarity coefficient. Then, we show domains that embed spatially closer in latent space are indeed similar. Lastly, we show that our algorithm outperforms MinHash LSH in accuracy and precision for all dimensions tested. These improvements show that deep learning techniques can be a promising approach for internet scale domain search.
机译:开放数据运动导致大量数据库在网络上发布。然而,由于其庞大的数量和异构性,有效访问这些数据库仍然是一个挑战。为了检索类似的查询,最小独立的排列局部性敏感哈希(MinHash LSH)成为一种流行的技术,用于估计两个域之间的相似性。随着深度学习的最新进展及其革新科学的多个领域的能力,我们探索了深度学习如何改善互联网规模的相似性搜索。为此,我们首先通过将集合表示形式转换为学习特征的潜在表示形式,将相似性搜索问题公式化为欧几里得最近邻问题,然后应用变分自动编码器(VAE)将域嵌入到明显更小的连续潜在维度中。 VAE学习了一种将重构误差和编码器与先验编码之间的Kullback-Leibler差异最小化的嵌入方法。通过优化两个术语,模型可以学习具有原始输入空间中保留的局部相似性的密集表示。我们使用联合开放数据的一个子集(加拿大,美国和英国)评估我们的算法,该子集包含超过140万个文档,其域大小大于12.8万。我们证明了潜在空间与Jaccard相似系数显着相关。然后,我们显示在空间上更紧密地嵌入到潜在空间中的域确实是相似的。最后,我们证明了我们的算法在所有测试尺寸上的准确性和精密度均优于MinHash LSH。这些改进表明,深度学习技术可以成为Internet规模域搜索的有前途的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号