A Text Preprocessing Framework for Text Mining on Big Data Infrastructure

机译：大数据基础架构上的文本挖掘文本预处理框架

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

There are many people using social media to comment appearance of a various event and post own activity in every day. The business organizations and industries can use social media to improve a produce by using the technique of text mining. However, a single machine cannot perform to compute a large amount of data because it has the resource limitation such as CPU, main memory, storage and so on. Moreover, a data also consists of structured and unstructured data that must be prepared before a computation with text mining or machine learning. Therefore, the text preprocessing becomes one of the most very important tasks because it includes many steps. In this paper, we design and develop an efficient text preprocessing framework on a big data infrastructure, which is proposed to support the text preprocessing task to reduce the computation time. The framework consists of the main four modules: Data collection module, storage module, data cleaning module and feature extraction module. We collected the sentiment data from Facebook page, named “Tasty” that has more than 90 million members for the performance evaluation. The data was separated into different three sizes and each dataset was tested on different the cluster system environments. Moreover, we also compared the computation time between the proposed our framework and the single machine. As the result, our framework can reduce the computation time significantly. Thus, this framework can apply to solve the problem of the complex text preprocessing in text mining areas.

机译：有许多人使用社交媒体在每天发表各种活动的外观和邮寄活动。业务组织和行业可以使用社交媒体通过使用文本挖掘技术来改善产品。但是，单个机器无法执行以计算大量数据，因为它具有资源限制，例如CPU，主存储器，存储等。此外，数据还包括结构化和非结构化数据，必须在使用文本挖掘或机器学习的计算之前准备。因此，文本预处理成为最重要的任务之一，因为它包括许多步骤。在本文中，我们在大数据基础架构上设计和开发有效的文本预处理框架，建议支持文本预处理任务以减少计算时间。该框架由主要的四个模块：数据收集模块，存储模块，数据清洁模块和特征提取模块组成。我们从Facebook页面收集了情绪数据，命名为“美味”，该绩效评估有超过9000万人。数据分为不同三种尺寸，每个数据集在不同的群集系统环境上进行了测试。此外，我们还将建议我们的框架和单机之间的计算时间进行了比较。结果，我们的框架可以显着降低计算时间。因此，该框架可以应用于解决文本挖掘区域中复杂文本预处理的问题。

著录项

来源
《International Conference on Imaging, Signal Processing and Communication》|2018年|1 v.|共5页
会议地点
作者
Watcharaporn Sriyanong; Nunnapus Moungmingsuk; Nattawat Khamphakdee;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算技术、计算机技术;
关键词
Text mining; Big Data; Task analysis; Feature extraction; Sentiment analysis; Facebook;

机译：文本挖掘;大数据;任务分析;特征提取;情感分析;Facebook;

相似文献

外文文献
中文文献
专利

1. Detecting and Filtering Immune-Related Adverse Events Signal Based on Text Mining and Observational Health Data Sciences and Informatics Common Data Model: Framework Development Study [J] . Yue Yu, Kathryn Ruddy, Aaron Mansfield, JMIR Medical Informatics . 2020,第6期

机译：基于文本挖掘和观察卫生数据科学和信息学的检测和过滤免疫相关不良事件信号常见数据模型：框架开发研究
2. Based SVM Distinct Stages Framework Data Mining Technique Approach for Text Extraction [J] . WSEAS Transactions on Information Science and Applications . 2019,第期

机译：基于SVM不同的阶段框架数据挖掘技术方法文本提取方法
3. A Framework of Protein-Drug Association for Malaria by Text Data Mining of Biomedical Literature. [J] . E. KADIVAR, Kh. RAHIMI, M. A. SHAHZAMANIAN Research Journal of Pharmaceutical, Biological and Chemical Sciences . 2016,第4期

机译：生物医学文献文本数据挖掘的疟疾蛋白质药物关联框架。
4. A Text Preprocessing Framework for Text Mining on Big Data Infrastructure [C] . Watcharaporn Sriyanong, Nunnapus Moungmingsuk, Nattawat Khamphakdee International Conference on Imaging, Signal Processing and Communication . 2018

机译：大数据基础架构上用于文本挖掘的文本预处理框架
5. A text mining framework linking technical intelligence from publication databases to strategic technology decisions. [D] . Courseault, Cherie R. 2004

机译：一个文本挖掘框架，将发布数据库中的技术情报链接到战略技术决策。
6. SparkText: Biomedical Text Mining on Big Data Framework [O] . Zhan Ye, Ahmad P. Tafti, Karen Y. He, -1

机译：SparkText：大数据框架上的生物医学文本挖掘
7. Text a data mining šedé literatury pro vědecké účely: Text and Data Mining of Grey Literature for the Purpose of Scientific Research [O] . 2016

机译：通过文本进行数据挖掘，以科学研究为目的：灰色文献的文本和数据挖掘
8. Intelligent Text Retrieval and Knowledge Acquisition from Texts for NASA Applications: Preprocessing Issues [R] . 2002

机译：Nasa应用文本中的智能文本检索和知识获取：预处理问题

A Text Preprocessing Framework for Text Mining on Big Data Infrastructure

摘要

著录项

相似文献

相关主题

期刊订阅