首页> 外文会议>International Conference on Imaging, Signal Processing and Communication >A Text Preprocessing Framework for Text Mining on Big Data Infrastructure
【24h】

A Text Preprocessing Framework for Text Mining on Big Data Infrastructure

机译:大数据基础架构上的文本挖掘文本预处理框架

获取原文

摘要

There are many people using social media to comment appearance of a various event and post own activity in every day. The business organizations and industries can use social media to improve a produce by using the technique of text mining. However, a single machine cannot perform to compute a large amount of data because it has the resource limitation such as CPU, main memory, storage and so on. Moreover, a data also consists of structured and unstructured data that must be prepared before a computation with text mining or machine learning. Therefore, the text preprocessing becomes one of the most very important tasks because it includes many steps. In this paper, we design and develop an efficient text preprocessing framework on a big data infrastructure, which is proposed to support the text preprocessing task to reduce the computation time. The framework consists of the main four modules: Data collection module, storage module, data cleaning module and feature extraction module. We collected the sentiment data from Facebook page, named “Tasty” that has more than 90 million members for the performance evaluation. The data was separated into different three sizes and each dataset was tested on different the cluster system environments. Moreover, we also compared the computation time between the proposed our framework and the single machine. As the result, our framework can reduce the computation time significantly. Thus, this framework can apply to solve the problem of the complex text preprocessing in text mining areas.
机译:有许多人使用社交媒体在每天发表各种活动的外观和邮寄活动。业务组织和行业可以使用社交媒体通过使用文本挖掘技术来改善产品。但是,单个机器无法执行以计算大量数据,因为它具有资源限制,例如CPU,主存储器,存储等。此外,数据还包括结构化和非结构化数据,必须在使用文本挖掘或机器学习的计算之前准备。因此,文本预处理成为最重要的任务之一,因为它包括许多步骤。在本文中,我们在大数据基础架构上设计和开发有效的文本预处理框架,建议支持文本预处理任务以减少计算时间。该框架由主要的四个模块:数据收集模块,存储模块,数据清洁模块和特征提取模块组成。我们从Facebook页面收集了情绪数据,命名为“美味”,该绩效评估有超过9000万人。数据分为不同三种尺寸,每个数据集在不同的群集系统环境上进行了测试。此外,我们还将建议我们的框架和单机之间的计算时间进行了比较。结果,我们的框架可以显着降低计算时间。因此,该框架可以应用于解决文本挖掘区域中复杂文本预处理的问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号