...
首页> 外文期刊>Procedia Computer Science >Data Pre-processing Evaluation for Text Mining: Transaction/Sequence Model
【24h】

Data Pre-processing Evaluation for Text Mining: Transaction/Sequence Model

机译:文本挖掘的数据预处理评估:事务/顺序模型

获取原文

摘要

Data pre-processing presents the most time consuming phase in the whole process of knowledge discovery. The complexity of data pre-processing depends on the data sources used. The aim of this work is to determine to what extent it is necessary to carry out the time consuming data pre-processing in the process of discovering sequential patterns in e-documents. We used the transaction/sequence model for text representation and sequence rule analysis as a method of modelling. We compare four datasets of different quality obtained from texts and pre-processed in different ways: data with identified the paragraph sequences, data with identified the sentence sequences, data with identified the paragraph sequences without stop words and data with identified the sentence sequences without stop words. We try to assess the impact of these advanced techniques of data pre-processing on the quantity and quality of the extracted rules. The results confirm some initial assumptions, but they also show that the stop words removal has a substantial impact on the quantity and quality of extracted rules in case of paragraph sequence identification. Contrary, in case of sentence sequence identification, removing the stop words has not any significant impact on the quantity and quality of extracted rules.
机译:数据预处理是整个知识发现过程中最耗时的阶段。数据预处理的复杂性取决于所使用的数据源。这项工作的目的是确定在发现电子文档中的顺序模式的过程中进行耗时的数据预处理的必要程度。我们将事务/顺序模型用于文本表示和顺序规则分析,将其作为建模的方法。我们比较了从文本中获得并以不同方式进行预处理的四个不同质量的数据集:已识别段落序列的数据,已识别句子序列的数据,已识别段落序列且没有停止词的数据和已识别句子序列而没有停止的数据话。我们尝试评估这些数据预处理的先进技术对提取规则的数量和质量的影响。结果证实了一些初始假设,但它们也表明,在识别段落序列的情况下,停用词的删除对提取规则的数量和质量有重大影响。相反,在识别句子序列的情况下,删除停用词对提取规则的数量和质量没有任何重大影响。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号