【24h】

Finding similar files in large document repositories

机译:在大型文档存储库中查找相似的文件

获取原文

摘要

Hewlett-Packard has many millions of technical support documents in a variety of collections. As part of content management, such collections are periodically merged and groomed. In the process, it becomes important to identify and weed out support documents that are largely duplicates of newer versions. Doing so improves the quality of the collection, eliminates chaff from search results, and improves customer satisfaction.The technical challenge is that through workflow and human processes, the knowledge of which documents are related is often lost. We required a method that could identify similar documents based on their content alone, without relying on metadata, which may be corrupt or missing.We present an approach for finding similar files that scales up to large document repositories. It is based on chunking the byte stream to find unique signatures that may be shared in multiple files. An analysis of the file-chunk graph yields clusters of related files. An optional bipartite graph partitioning algorithm can be applied to greatly increase scalability.
机译:惠普(Hewlett-Packard)具有各种收藏中的数百万种技术支持文档。作为内容管理的一部分,此类集合会定期合并和修饰。在此过程中,重要的是要识别和淘汰支持文件,这些文件在很大程度上是较新版本的副本。这样做可以提高馆藏质量,消除搜索结果中的草皮,提高客户满意度。技术难题是,通过工作流和人工流程,经常会丢失与文档相关的知识。我们需要一种仅根据内容即可识别相似文档的方法,而无需依赖可能已损坏或丢失的元数据。我们提出了一种查找可扩展至大型文档存储库的相似文件的方法。它基于对字节流进行分块以查找可以在多个文件中共享的唯一签​​名。对文件块图的分析产生了相关文件的簇。可以应用可选的二部图分区算法来大大提高可伸缩性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号