A Cleaning Algorithm for Noiseless Opinion Mining Corpus Construction

机译：一种无噪声观点挖掘语料库构建的清洗算法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper presents DyCorC, an extractor and cleaner of web forums contents. Its main points are that the process is entirely automatic, language-independent and adaptable to all kinds of forum architectures. The corpus is built accordingly to user queries using expressions or item keywords as in research engines, and then DyCorC minimizes the boilerplate for further feature-based opinion mining and sentiment analysis, gathering comments and scorings. Such noiseless corpora are usually hand made with the help of crawlers and scrapers, with specific containers devised for each type of forum, entailing lots of work and skills. Our aim is to cut down this preprocessing stage. Our algorithm is compared to state of the art models (Apache Nutch, BootCat, JusText), with a gold standard corpus we released. DyCorC offers a better quality of noiseless content extraction. Its algorithm is based on DOM trees with string distances, seven of which have been compared on the reference corpus, and feature-distance has been chosen as the best fit.

机译：本文介绍了DyCorC，它是Web论坛内容的提取程序和清理程序。它的主要要点是，该过程是完全自动的，与语言无关的，并且适用于各种论坛体系结构。像研究引擎一样，使用表达式或项目关键字根据用户查询来构建语料库，然后DyCorC最小化用于进一步基于特征的观点挖掘和情感分析，收集评论和评分的样板。这种无声的语料通常是在爬虫和刮板的帮助下手工制作的，为每种类型的论坛设计了特定的容器，需要大量的工作和技能。我们的目标是减少此预处理阶段。通过我们发布的黄金标准语料库，将我们的算法与最先进的模型（Apache Nutch，BootCat，JusText）进行了比较。 DyCorC提供了更高质量的无噪声内容提取。它的算法基于具有字符串距离的DOM树，已在参考语料库上比较了其中的七个，并选择了特征距离作为最佳拟合。

著录项

来源
《IEEE/ACS International Conference on Computer Systems and Applications》|2018年|1-7|共7页
会议地点
作者
Otman Manad; Anna Pappa; Gilles Bernard;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Crawlers; Cleaning; Gold; Standards; Feature extraction; Data mining; XML;

机译：爬虫;清洁;黄金;标准;特征提取;数据挖掘; XML;

相似文献

外文文献
中文文献
专利

1. CSR Image Construction of Chinese Construction Enterprises in Africa Based on Data Mining and Corpus Analysis [J] . Yaoping Zhong, Wenzhong Zhu, Yingying Zhou Mathematical Problems in Engineering: Theory, Methods and Applications . 2020,第1期

机译：基于数据挖掘和语料库分析的非洲中国建筑企业CSR图像建设
2. OHRank： An Algorithm Integrating Mentality and Influence of Opinion Holder for Opinion Mining [J] . LV Pin, ZHONG Luo, CAI Dunbo, 电子学报：英文版 . 2013,第004期

机译：OHRANK：一种整合思路的算法和意见持有人的意见矿业影响
3. Analysis of Machine Learning Algorithms for Opinion Mining in Different Domains [J] . Donia Gamal, Marco Alfonse, El-Sayed M. El-Horbaty, Machine Learning and Knowledge Extraction . 2019,第1期

机译：机器学习算法在不同领域中的观点挖掘
4. A Cleaning Algorithm for Noiseless Opinion Mining Corpus Construction [C] . Otman Manad, Anna Pappa, Gilles Bernard IEEE/ACS International Conference on Computer Systems and Applications . 2018

机译：一种无噪声意见采矿语料库施工清洁算法
5. Cleaning safety records using text mining algorithms. [D] . Chauhan, Vaibhav. 2012

机译：使用文本挖掘算法清理安全记录。
6. Resource Construction and Evaluation for Indirect Opinion Mining of Drug Reviews [O] . Samira Noferesti, Mehrnoush Shamsfard -1

机译：药品评论间接意见挖掘的资源建设与评估
7. INF-UFRGS-OPINION-MINING at SemEval-2016 Task 6: Automatic Generation of a Training Corpus for Unsupervised Identification of Stance in Tweets [O] . Marcelo Dias, Karin Becker 2016

机译：INF-UFRGS-PIMENING在SEMEVAL-2016任务6：自动生成培训语料库，以便在推文中无监督识别姿态
8. Algorithms for a very high speed universal noiseless coding module [R] . Rice, Robert F., Yeh, Pen-Shu 1991

机译：用于超高速通用无噪声编码模块的算法

A Cleaning Algorithm for Noiseless Opinion Mining Corpus Construction

摘要

著录项

相似文献

相关主题

期刊订阅