Fast Document Similarity Computations using GPGPU

机译：使用GPGPU的快速文档相似性计算

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Several Big Data problems involve computing similarities between entities, such as records, documents, etc., in timely manner. Recent studies point that similarity-based deduplication techniques are efficient for document databases. Delta encoding-like techniques are commonly leveraged to compute similarities between documents. Operational requirements dictate low latency constraints. The previous researches do not consider parallel computing to deliver low latency delta encoding solutions. This paper makes two-fold contribution in context of delta encoding problem occurring in document databases: (1) develop a parallel processing-based technique to compute similarities between documents, and (2) design a GPU-based document cache framework to accelerate the performance of delta encoding pipeline. We experiment with real datasets. We achieve throughput of more than 500 similarity computations per millisecond. And the similarity compuatation framework achieves a throughput in the range of 237-312 MB per second which is up to 10X higher throughput when compared to the hashing-based approaches.

机译：几个大数据问题涉及及时计算实体之间的相似性，例如记录，文档等。最近的研究点认为基于相似性的重复数据删除技术对于文档数据库有效。 Δ编码式类似的技术通常利用以计算文档之间的相似之处。操作要求决定了低延迟约束。以前的研究不考虑并行计算来提供低延迟Δ编码解决方案。本文在文档数据库中发生的Δ编码问题的背景下进行了两倍的贡献：（1）开发基于并行处理的技术，以计算文档之间的相似之处，（2）设计基于GPU的文档缓存框架以加速性能三角洲编码管道。我们尝试实际数据集。我们实现了每毫秒超过500个相似性计算的吞吐量。并且，与基于散列的方法相比，相似性增长框架在每秒237-312 MB的范围内实现了237-312 MB的吞吐量，其吞吐量高达10倍。

著录项

来源
《International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management》|2018年|1(CD-ROM)|共9页
会议地点
作者
Parijat Shukla; Arun K. Somani;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 G354-53;
关键词
Deduplication; Semi-structured data; NoSQL; Big data; Parallel processing; GPGPU; Data shaping;

机译：重复数据删除;半结构化数据;NoSQL;大数据;并行处理;GPGPU;数据整形;

相似文献

外文文献
中文文献
专利

1. Chinese semantic document classification based on strategies of semantic similarity computation and correlation analysis [J] . Yang Shuo, Wei Ran, Guo Jingzhi, Journal of web semantics: . 2020,第Auga期

机译：基于语义相似性计算与相关分析策略的汉语语义文献分类
2. Semantic Document Classification based on Strategies of Semantic Similarity Computation and Correlation Analysis [J] . Shuo Yang, Ran Wei, Hengliang Tan, Computer Science & Information Technology . 2019,第13期

机译：基于语义相似度计算和相关分析策略的语义文档分类
3. Secure computation of functionalities based on Hamming distance and its application to computing document similarity [J] . Ayman Jarrous, Benny Pinkas International journal of applied cryptography . 2013,第1期

机译：基于汉明距离的功能安全计算及其在文档相似度计算中的应用
4. Fast Document Similarity Computations using GPGPU [C] . Parijat Shukla, Arun K. Somani International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management . 2018

机译：使用GPGPU的快速文档相似性计算
5. Faster Than Real-Time GPGPU Radiation Pressure Modeling Methods [D] . ?Kenneally, P W 2019

机译：比实时GPGPU辐射压力建模方法更快
6. Optimizing Data Intensive GPGPU Computations for DNA Sequence Alignment [O] . Cole Trapnell, Michael C. Schatz -1

机译：优化DNA序列对齐的数据密集型GPGPU计算
7. Efficient Pairwise Document Similarity Computation in Big Datasets [O] . Papias Niyigena, Zhang Zuping, Weiqi Li, 2015

机译：大数据集中的高效成对文档相似性计算

Fast Document Similarity Computations using GPGPU

摘要

著录项

相似文献

相关主题

期刊订阅