Finding similar files in large document repositories

机译：在大型文档存储库中查找相似的文件

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Hewlett-Packard has many millions of technical support documents in a variety of collections. As part of content management, such collections are periodically merged and groomed. In the process, it becomes important to identify and weed out support documents that are largely duplicates of newer versions. Doing so improves the quality of the collection, eliminates chaff from search results, and improves customer satisfaction.The technical challenge is that through workflow and human processes, the knowledge of which documents are related is often lost. We required a method that could identify similar documents based on their content alone, without relying on metadata, which may be corrupt or missing.We present an approach for finding similar files that scales up to large document repositories. It is based on chunking the byte stream to find unique signatures that may be shared in multiple files. An analysis of the file-chunk graph yields clusters of related files. An optional bipartite graph partitioning algorithm can be applied to greatly increase scalability.

机译：惠普（Hewlett-Packard）具有各种收藏中的数百万种技术支持文档。作为内容管理的一部分，此类集合会定期合并和修饰。在此过程中，重要的是要识别和淘汰支持文件，这些文件在很大程度上是较新版本的副本。这样做可以提高馆藏质量，消除搜索结果中的草皮，提高客户满意度。技术难题是，通过工作流和人工流程，经常会丢失与文档相关的知识。我们需要一种仅根据内容即可识别相似文档的方法，而无需依赖可能已损坏或丢失的元数据。我们提出了一种查找可扩展至大型文档存储库的相似文件的方法。它基于对字节流进行分块以查找可以在多个文件中共享的唯一签名。对文件块图的分析产生了相关文件的簇。可以应用可选的二部图分区算法来大大提高可伸缩性。

著录项

来源
《ACM SIGKDD international conference on Knowledge discovery in data mining》|2005年|P.394-400|共7页
会议地点
作者
George Forman; Kave Eshghi; Stephane Chiocchetti; PGeorge Forman;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算技术、计算机技术;
关键词
similarity;

机译：相似;

相似文献

外文文献
中文文献
专利

1. On the correctness of electronic documents: studying, finding, and localizing inconsistency bugs in PDF readers and files [J] . Kuchta Tomasz, Lutellier Thibaud, Wong Edmund, Empirical Software Engineering . 2018,第6期

机译：关于电子文档的正确性：研究，查找和定位PDF阅读器和文件中的不一致错误
2. Augmenting Medical Decision Making With Text-Based Search of Teaching File Repositories and Medical Ontologies: Text-Based Search of Radiology Teaching Files [J] . Priya Deshpande, Alexander Rasin, Eli T Brown, International journal of knowledge discovery in bioinformatics . 2018,第2期

机译：通过基于文本的教学文件存储库和医学本体搜索增强医疗决策：基于文本的放射学教学文件搜索
3. Document Management: Is It Still Valid? File and document management can provide very positive results for the CAD workplace [J] . Robert Green Cadalyst: Integrating Technology for Manufacturing, AEC and GIS . 2007,第12期

机译：文件管理：它仍然有效吗？文件和文档管理可以为CAD工作场所提供非常积极的结果
4. Finding Similar Files in Large Document Repositories [C] . George Forman, Kave Eshghi, Stephane Chiocchetti Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD'05); 20050821-24; Chicago,IL(US) . 2005

机译：在大型文档存储库中查找相似的文件
5. The advantages of preserving document descriptive metadata directly in the document file [D] . O'Deegan, Nicholas. 2014

机译：直接在文档文件中保留文档描述性元数据的优点
6. Desktop document delivery using portable document format (PDF) files and the Web. [O] . J P Shipman, W L Gembala, J M Reeder, 1998

机译：使用可移植文档格式（PDF）文件和Web进行桌面文档传递。
7. Finding Similar Files in Large Document Repositories [O] . George Forman, Kave Eshghi, Stephane Chiocchetti 2005

机译：在大型文档存储库中查找类似文件
8. Security Functions for a File Repository [R] . Helme, A., Stabell-Kulo, T. 1996

机译：文件存储库的安全功能

Finding similar files in large document repositories

摘要

著录项

相似文献

相关主题

期刊订阅