MapReduce based parallel attribute reduction in Incomplete Decision Systems

Sowkuntla Pandu; Dunna Sravya; Prasad P. S. V. S. Sai

摘要

The scale of the data collected today from applications in the real-world is massive. Sometimes this data can also include missing (incomplete) values that give rise to large-scale incomplete decision systems (IDS). Parallel attribute reduction in big data is an essential preprocessing step for scalable machine learning model construction. Rough set theory has been used as a powerful tool for attribute reduction in complete decision systems (CDS). Furthermore extensions to classical rough set theory have been proposed to deal with IDS. A lot of research works have been done on efficient attribute reduction in IDS using these extensions, but no parallel/distributed approaches have been proposed for attribute reduction in large-scale IDS. Since, owing to its two challenges, large-scale and incompleteness, the processing of large-scale IDS is difficult. To address these challenges, we propose MapReduce based parallel/distributed approaches for attribute reduction in massive IDS. The proposed approaches resolve the challenge of incompleteness with the existing Novel Granular Framework (NGF). And each proposed approach follows a different data partitioning strategy to handle the data sets that are large-scale in terms of number of objects and attributes. One of the proposed approaches adopts an alternative representation of the NGF and uses a horizontal partitioning (division in object space) of the data to the nodes of cluster. Another approach embraces the existing NGF and uses a vertical partitioning (division in attribute space) of the data. Extensive experimental analysis carried out on various data sets with different percentages of incompleteness in the data. The experimental results show that the horizontal partitioning based approach performs well for the massive object space data sets. And the vertical partitioning based approach is relevant and scales well for extremely high dimensional data sets. (C) 2020 Elsevier B.V. All rights reserved.

机译：从现实世界中的应用程序收集的数据的规模是大规模的。有时，此数据还可以包括缺少（不完整）值，从而产生大规模不完整的决策系统（ID）。大数据的并行属性降低是可扩展机学习模型构建的必要预处理步骤。粗糙集理论已被用作完整决策系统（CDS）中属性减少的强大工具。此外，已经提出了古典粗糙集理论的延伸来处理ID。已经在使用这些扩展的IDS的有效属性减少的情况下完成了许多研究工作，但没有提出并行/分布式方法以进行大规模ID的属性降低。由于其两个挑战，大规模和不完整，因此难以处理大规模ID。为了解决这些挑战，我们提出了基于MapReduce的并行/分布式方法，以进行大规模ID的属性。拟议的方法解决了与现有的新型颗粒框架（NGF）的不完整性的挑战。并且每个所提出的方法都遵循不同的数据分区策略来处理对对象数量和属性的数量大规模的数据集。其中一个提出的方法采用NGF的替代表示，并使用数据的水平分区（对象空间中的划分）到集群节点。另一种方法属于现有的NGF，并使用数据的垂直分区（属性空间中的划分）。在各种数据集上进行了广泛的实验分析，具有不同百分比的数据中的不同百分比。实验结果表明，基于水平分区的方法对大规模对象空间数据集执行良好。基于垂直分区的方法是相关的，并且对于极高的维度数据集是良好的。（c）2020 Elsevier B.v.保留所有权利。

MapReduce based parallel attribute reduction in Incomplete Decision Systems

摘要

著录项

引文网络

相关主题

期刊订阅