Abstract In real-world dirty data, errors are often not randomly distributed. Rather, they tend to occur only under certain conditions, such as when the transaction is handled by a certain operator, or the weather is rainy. Leveraging such common conditions, or “cause conditions”, the proposed data-cleansing algorithm resolves multi-tuple conflicts with high speed, achieves higher completeness, and runs with high accuracy in realistic settings. We first present complexity analyses of the problem, pointing out two subproblems that are NP-complete. We then introduce, for each subproblem, heuristics that work in sub-polynomial time. We also raise the issue that some previous studies overlook the notion of repair-completeness, which means, having less number of unsolved conflicts in the resulting repairs. The proposed method is capable of obtaining a complete repair if we are allowed to preprocess the input set of constraints. The algorithms are tested with three sets of data and rules. The experiments show that, compared to the state-of-the-art methods for conditional functional dependencies-based and FD-based data cleansing, the proposed algorithm scales better with respect to the data size, is the only method that outputs complete repairs, and is more accurate especially when the error distribution is skewed.
展开▼