Data Pre-processing Evaluation for Text Mining: Transaction/Sequence Model

Da?a Munková; Michal Munk; Martin Vozár

首页> 外文期刊>Procedia Computer Science >Data Pre-processing Evaluation for Text Mining: Transaction/Sequence Model

【24h】

Data Pre-processing Evaluation for Text Mining: Transaction/Sequence Model

机译：文本挖掘的数据预处理评估：事务/顺序模型

获取原文

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

团队文献服务 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Data pre-processing presents the most time consuming phase in the whole process of knowledge discovery. The complexity of data pre-processing depends on the data sources used. The aim of this work is to determine to what extent it is necessary to carry out the time consuming data pre-processing in the process of discovering sequential patterns in e-documents. We used the transaction/sequence model for text representation and sequence rule analysis as a method of modelling. We compare four datasets of different quality obtained from texts and pre-processed in different ways: data with identified the paragraph sequences, data with identified the sentence sequences, data with identified the paragraph sequences without stop words and data with identified the sentence sequences without stop words. We try to assess the impact of these advanced techniques of data pre-processing on the quantity and quality of the extracted rules. The results confirm some initial assumptions, but they also show that the stop words removal has a substantial impact on the quantity and quality of extracted rules in case of paragraph sequence identification. Contrary, in case of sentence sequence identification, removing the stop words has not any significant impact on the quantity and quality of extracted rules.

机译：数据预处理是整个知识发现过程中最耗时的阶段。数据预处理的复杂性取决于所使用的数据源。这项工作的目的是确定在发现电子文档中的顺序模式的过程中进行耗时的数据预处理的必要程度。我们将事务/顺序模型用于文本表示和顺序规则分析，将其作为建模的方法。我们比较了从文本中获得并以不同方式进行预处理的四个不同质量的数据集：已识别段落序列的数据，已识别句子序列的数据，已识别段落序列且没有停止词的数据和已识别句子序列而没有停止的数据话。我们尝试评估这些数据预处理的先进技术对提取规则的数量和质量的影响。结果证实了一些初始假设，但它们也表明，在识别段落序列的情况下，停用词的删除对提取规则的数量和质量有重大影响。相反，在识别句子序列的情况下，删除停用词对提取规则的数量和质量没有任何重大影响。

著录项

来源
《Procedia Computer Science 》 |2013年第1期| 共10页
作者
Da?a Munková; Michal Munk; Martin Vozár;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类计算技术、计算机技术 ;
关键词
Data pre-processingstop wordssequence identificationtransaction/sequence modeltext miningevaluation;

机译：数据预处理停用词序列识别事务/序列模型文本挖掘评估;

相似文献

外文文献
中文文献
专利

1. Evaluating the use of linguistic information in the pre-processing phase of Text Mining [J] . Cassiana Fagundes da Silva, Fernando Santos Osório, Renata Vieira Inteligencia Artificial : Ibero-American Journal of Artificial Intelligence . 2005 ,第26期

机译：在文本挖掘的预处理阶段评估语言信息的使用
2. Web Usage Mining: Data Pre-processing Impact on Found Knowledge in Predictive Modelling [J] . Peter Svec, Lubomir Benko, Miroslav Kadlecik, Procedia Computer Science . 2020 ,第5期

机译：Web使用挖掘：数据预处理对预测建模中发现知识的影响
3. PepBank - a database of peptides based on sequence text mining and public peptide data sources [J] . Timur Shtatland, Daniel Guettler, Misha Kossodo, BMC Bioinformatics . 2007 ,第1期

机译：PepBank-基于序列文本挖掘和公共肽数据源的肽数据库
4. Data Pre-Processing Evaluation for Text Mining: Transaction/Sequence Model [C] . Dasa Munková, Michal Munk, Martin Vozár International Conference on Computational Science . 2013

机译：文本挖掘的数据预处理评估：事务/序列模型
5. Extracting signal from noise in biological data: Evaluations and applications of text mining and sequence coevolution. [D] . Caporaso, J. Gregory. 2009

机译：从生物数据中的噪声中提取信号：文本挖掘和序列协同进化的评估和应用。
6. Functional evaluation of out-of-the-box text-mining tools for data-mining tasks [O] . Kenneth Jung, Paea LePendu, Srinivasan Iyer, 2015

机译：开箱即用的文本挖掘工具用于数据挖掘任务的功能评估
7. Data Pre-processing Evaluation for Text Mining: Transaction/Sequence Model [O] . Munková Daša, Munk Michal, Vozár Martin 2013

机译：文本挖掘的数据预处理评估：事务/顺序模型
8. Effects of distributed database modeling on evaluation of transaction rollbacks [R] . Mukkamala, Ravi 1991

机译：分布式数据库建模对事务回滚评估的影响

Data Pre-processing Evaluation for Text Mining: Transaction/Sequence Model

摘要

著录项

相似文献

相关主题

期刊订阅