Mining Frequent Differences in File Collections

机译：挖掘文件集合中的常见差异

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Collections of textual files, or documents, with substantial inter-document similarities are common in diverse domains. A practically significant class of such similarities, and the dual differences, are well characterized by edit scripts, or colloquially diffs, that use a simple sequence model for documents. The study of such diffs provides valuable insights into the inter-document relationships within a collection and can guide data integration within and across collections. This paper describes a framework for such study that is based on frequently occurring inter-document differences. It motivates and defines a general problem of mining frequent differences and outlines some specific instances. It presents the design and implementation of a prototype system for interactively discovering and visualizing frequent differences. A notable feature of this method is its use of difference-components, or deltas, to bootstrap the discovery of interesting structure in file collections. The paper describes a preliminary experimental evaluation of the method and implementation on a widely used corpus of file-collections.

机译：文本文件或文档之间存在实质性的相似性，在不同的领域中很常见。此类相似性和双重差异在实践中非常重要，其特点是使用简单的文档序列模型的编辑脚本或口语差异。对这种差异的研究提供了对集合中文档间关系的有价值的见解，并且可以指导集合内和集合之间的数据集成。本文介绍了基于频繁发生的文档间差异进行此类研究的框架。它激发并定义了挖掘频繁差异的一般问题，并概述了一些特定实例。它介绍了用于交互式发现和可视化频繁差异的原型系统的设计和实现。该方法的显着特征是它使用差异分量或增量来引导文件集合中有趣结构的发现。本文介绍了在广泛使用的文件收集语料库中对该方法和实现方法进行的初步实验评估。

著录项

来源
《International Conference on Information Reuse and Integration for Data Science》|2020年|357-364|共8页
会议地点
作者
Sudarshan S. Chawathe;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Prototypes; Semantics; Standards; Data mining; Printing; Rendering (computer graphics); Data integration;

机译：原型;语义;标准;数据挖掘;打印;渲染（计算机图形学）;数据集成;

相似文献

外文文献
中文文献
专利

1. Mining clique frequent approximate subgraphs from multi-graph collections [J] . Acosta-Mendoza Niusvel, Ariel Carrasco-Ochoa Jesus, Francisco Martinez-Trinidad Jose, Applied Intelligence: The International Journal of Artificial Intelligence, Neural Networks, and Complex Problem-Solving Technologies . 2020,第3期

机译：采矿集团频繁近似的近似子图来自多图集合
2. Extension of Canonical Adjacency Matrices for Frequent Approximate Subgraph Mining on Multi-Graph Collections [J] . Acosta-Mendoza Niusvel, Gago-Alonso Andres, Ariel Carrasco-Ochoa Jesus, International Journal of Pattern Recognition and Artificial Intelligence . 2017,第8期

机译：在多图集合上频繁近似子图挖掘的规范邻接矩阵的扩展
3. A Projection Bias in Frequent Subgraph Mining Can Make a Difference [J] . Brahim Douar, Michel Liquiere, Yahya Slimani International Journal of Artificial Intelligence Tools: Architectures, Languages, Algorithms . 2014,第5期

机译：子图频繁挖掘中的投影偏差可能会有所不同
4. Mining Frequent Subgraph Pattern over a Collection of Attributed-Graphs and Construction of a Relation Hierarchy for Result Reporting [C] . Petra Perner International conference on data mining . 2017

机译：挖掘属性图集合上的频繁子图模式，并构建结果报告的关系层次结构
5. Mining Frequent Itemsets from Uncertain Data: Extensions to Constrained Mining and Stream Mining. [D] . Hao, Boyu. 2010

机译：从不确定的数据中挖掘频繁项集：约束挖掘和流挖掘的扩展。
6. FREQUENT SUBGRAPH MINING OF PERSONALIZED SIGNALING PATHWAY NETWORKS GROUPS PATIENTS WITH FREQUENTLY DYSREGULATED DISEASE PATHWAYS AND PREDICTS PROGNOSIS [O] . Arda Durmaz, Tim A. D. Henderson, Douglas Brubaker, -1

机译：频繁失调的疾病通路和预测的个性化信号通路网络组的频率子图挖掘
7. PPFP(Push and Pop Frequent Pattern Mining): A Novel Frequent Pattern Mining Method for Bigdata Frequent Pattern Mining [O] . Jung-Hun Lee, Youn-A Min 2016

机译：PPFP（推动和流行频繁模式采矿）：一种新型频繁模式挖掘方法，用于频繁模式挖掘

Mining Frequent Differences in File Collections

摘要

著录项

相似文献

相关主题

期刊订阅