首页> 外文会议>International Conference on Imaging, Signal Processing and Communication >A Text Preprocessing Framework for Text Mining on Big Data Infrastructure
【24h】

A Text Preprocessing Framework for Text Mining on Big Data Infrastructure

机译:大数据基础架构上用于文本挖掘的文本预处理框架

获取原文

摘要

There are many people using social media to comment appearance of a various event and post own activity in every day. The business organizations and industries can use social media to improve a produce by using the technique of text mining. However, a single machine cannot perform to compute a large amount of data because it has the resource limitation such as CPU, main memory, storage and so on. Moreover, a data also consists of structured and unstructured data that must be prepared before a computation with text mining or machine learning. Therefore, the text preprocessing becomes one of the most very important tasks because it includes many steps. In this paper, we design and develop an efficient text preprocessing framework on a big data infrastructure, which is proposed to support the text preprocessing task to reduce the computation time. The framework consists of the main four modules: Data collection module, storage module, data cleaning module and feature extraction module. We collected the sentiment data from Facebook page, named “Tasty” that has more than 90 million members for the performance evaluation. The data was separated into different three sizes and each dataset was tested on different the cluster system environments. Moreover, we also compared the computation time between the proposed our framework and the single machine. As the result, our framework can reduce the computation time significantly. Thus, this framework can apply to solve the problem of the complex text preprocessing in text mining areas.
机译:有很多人使用社交媒体来评论各种事件的出现并每天发布自己的活动。商业组织和行业可以使用社交媒体通过文本挖掘技术来改善产品。但是,一台机器无法执行计算大量数据的操作,因为它具有资源限制,例如CPU,主内存,存储设备等。此外,数据还包含结构化和非结构化数据,在使用文本挖掘或机器学习进行计算之前必须准备这些数据。因此,文本预处理由于包含许多步骤,因此成为最重要的任务之一。在本文中,我们设计并开发了一种在大数据基础架构上的高效文本预处理框架,旨在支持文本预处理任务以减少计算时间。该框架主要由四个模块组成:数据收集模块,存储模块,数据清理模块和特征提取模块。我们从名为“ Tasty”的Facebook页面收集了情绪数据,该数据有超过9000万会员用于绩效评估。数据分为三种大小,每个数据集在不同的集群系统环境中进行了测试。此外,我们还比较了所提出的框架和单台计算机之间的计算时间。结果,我们的框架可以大大减少计算时间。因此,该框架可以应用于解决文本挖掘领域中复杂的文本预处理问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号