首页> 外文会议>IEEE/ACS International Conference on Computer Systems and Applications >A Cleaning Algorithm for Noiseless Opinion Mining Corpus Construction
【24h】

A Cleaning Algorithm for Noiseless Opinion Mining Corpus Construction

机译:一种无噪声观点挖掘语料库构建的清洗算法

获取原文

摘要

This paper presents DyCorC, an extractor and cleaner of web forums contents. Its main points are that the process is entirely automatic, language-independent and adaptable to all kinds of forum architectures. The corpus is built accordingly to user queries using expressions or item keywords as in research engines, and then DyCorC minimizes the boilerplate for further feature-based opinion mining and sentiment analysis, gathering comments and scorings. Such noiseless corpora are usually hand made with the help of crawlers and scrapers, with specific containers devised for each type of forum, entailing lots of work and skills. Our aim is to cut down this preprocessing stage. Our algorithm is compared to state of the art models (Apache Nutch, BootCat, JusText), with a gold standard corpus we released. DyCorC offers a better quality of noiseless content extraction. Its algorithm is based on DOM trees with string distances, seven of which have been compared on the reference corpus, and feature-distance has been chosen as the best fit.
机译:本文介绍了DyCorC,它是Web论坛内容的提取程序和清理程序。它的主要要点是,该过程是完全自动的,与语言无关的,并且适用于各种论坛体系结构。像研究引擎一样,使用表达式或项目关键字根据用户查询来构建语料库,然后DyCorC最小化用于进一步基于特征的观点挖掘和情感分析,收集评论和评分的样板。这种无声的语料通常是在爬虫和刮板的帮助下手工制作的,为每种类型的论坛设计了特定的容器,需要大量的工作和技能。我们的目标是减少此预处理阶段。通过我们发布的黄金标准语料库,将我们的算法与最先进的模型(Apache Nutch,BootCat,JusText)进行了比较。 DyCorC提供了更高质量的无噪声内容提取。它的算法基于具有字符串距离的DOM树,已在参考语料库上比较了其中的七个,并选择了特征距离作为最佳拟合。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号