【24h】

Mining Frequent Differences in File Collections

机译:挖掘文件集合中的常见差异

获取原文

摘要

Collections of textual files, or documents, with substantial inter-document similarities are common in diverse domains. A practically significant class of such similarities, and the dual differences, are well characterized by edit scripts, or colloquially diffs, that use a simple sequence model for documents. The study of such diffs provides valuable insights into the inter-document relationships within a collection and can guide data integration within and across collections. This paper describes a framework for such study that is based on frequently occurring inter-document differences. It motivates and defines a general problem of mining frequent differences and outlines some specific instances. It presents the design and implementation of a prototype system for interactively discovering and visualizing frequent differences. A notable feature of this method is its use of difference-components, or deltas, to bootstrap the discovery of interesting structure in file collections. The paper describes a preliminary experimental evaluation of the method and implementation on a widely used corpus of file-collections.
机译:文本文件或文档之间存在实质性的相似性,在不同的领域中很常见。此类相似性和双重差异在实践中非常重要,其特点是使用简单的文档序列模型的编辑脚本或口语差异。对这种差异的研究提供了对集合中文档间关系的有价值的见解,并且可以指导集合内和集合之间的数据集成。本文介绍了基于频繁发生的文档间差异进行此类研究的框架。它激发并定义了挖掘频繁差异的一般问题,并概述了一些特定实例。它介绍了用于交互式发现和可视化频繁差异的原型系统的设计和实现。该方法的显着特征是它使用差异分量或增量来引导文件集合中有趣结构的发现。本文介绍了在广泛使用的文件收集语料库中对该方法和实现方法进行的初步实验评估。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号