【24h】

Fast Document Similarity Computations using GPGPU

机译:使用GPGPU的快速文档相似性计算

获取原文

摘要

Several Big Data problems involve computing similarities between entities, such as records, documents, etc., in timely manner. Recent studies point that similarity-based deduplication techniques are efficient for document databases. Delta encoding-like techniques are commonly leveraged to compute similarities between documents. Operational requirements dictate low latency constraints. The previous researches do not consider parallel computing to deliver low latency delta encoding solutions. This paper makes two-fold contribution in context of delta encoding problem occurring in document databases: (1) develop a parallel processing-based technique to compute similarities between documents, and (2) design a GPU-based document cache framework to accelerate the performance of delta encoding pipeline. We experiment with real datasets. We achieve throughput of more than 500 similarity computations per millisecond. And the similarity compuatation framework achieves a throughput in the range of 237-312 MB per second which is up to 10X higher throughput when compared to the hashing-based approaches.
机译:几个大数据问题涉及及时计算实体之间的相似性,例如记录,文档等。最近的研究点认为基于相似性的重复数据删除技术对于文档数据库有效。 Δ编码式类似的技术通常利用以计算文档之间的相似之处。操作要求决定了低延迟约束。以前的研究不考虑并行计算来提供低延迟Δ编码解决方案。本文在文档数据库中发生的Δ编码问题的背景下进行了两倍的贡献:(1)开发基于并行处理的技术,以计算文档之间的相似之处,(2)设计基于GPU的文档缓存框架以加速性能三角洲编码管道。我们尝试实际数据集。我们实现了每毫秒超过500个相似性计算的吞吐量。并且,与基于散列的方法相比,相似性增长框架在每秒237-312 MB的范围内实现了237-312 MB的吞吐量,其吞吐量高达10倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号