Traditional methods for data mining typically make the assumption that data is centralized and static. This assumption is no longer tenable. Such methods waste computational and I/O resources when the data is dynamic, and they impose excessive communication overhead when the data is distributed. As a result, the knowledge discovery process is harmed by slow response times. Ef-ficient implementation of incremental data mining ideas in distributed computing environments is thus becoming crucial for ensuring scalability and facilitating knowledge discovery when data is dynamic and distributed. In this paper we ad-dress this issue in the context of frequent itemset mining, an important data mining task. Frequent itemsets are most often used to generate correlations and associ-ation rules, but more recently they have been used in such far-reaching domains as bio-informatics and e-commerce applications. We first present an efficient al-gorithm which dynamically maintains the required information in the presence of data updates without examining the entire dataset. We then show how to par-allelize the incremental algorithm, so that it can asynchronously mine frequent itemsets. We also propose a distributed algorithm, which imposes low communi-cation overhead for mining distributed datasets. Several experiments confirm that our algorithm results in excellent execution time improvements.
展开▼