Humankind has been collecting data since the recording started, but in the last decade with the considerable advances in computing and storage technologies, advancements of cloud computing, development of ubiquitous connectivity and the internet of things, there has been explosion in the size and variety of collected data. Nevertheless, one can be data-rich and knowledge-poor, and this is where the data analytics and the development and application of machine learning models become necessity for gaining insight of complex processes to prove scientific theories and discoveries, support decision making and enhance strategic planning in different areas of the economy, finance, industry, healthcare, etc. Recently, there is an influx of polymorphic, unstructured and multimodal data - social media, images, audio, video, etc., which is complicating further the data processing and knowledge extraction process. But even the traditional structured datasets present problems that need to be addressed and overcome in the early stages of data pre-processing, feature extraction and feature selection. This is because they usually contain variety of data formats, e.g., categorical, continuous, ordinal, and frequently missing data (usually result of sensors faults, human errors, collection, transportation, or storage problems). The most popular approaches in dealing with missing data generally fall in three groups: Deletion methods; Single imputation methods; and Model-based methods [1].
展开▼