【24h】

Leveraging Web 2.0 Sources for Web Content Classification

机译:利用Web 2.0源以获取Web内容分类

获取原文

摘要

This paper addresses practical aspects of web page classification not captured by the classical text mining framework. Classifiers are supposed to perform well on a broad variety of pages. We argue that constructing training corpora is a bottleneck for building such classifiers, and that care has to be taken if the goal is to generalize to previously unseen kinds of pages on the web. We study techniques for building training corpora automatically from publicly available web resources, quantify the discrepancy between them, and demonstrate that encouraging agreement between classifiers given such diverse sources drastically outperforms methods that ignore the different natures of data sources on the web.
机译:本文涉及经典文本挖掘框架未捕获的网页分类的实际方面。分类器应该在广泛的页面上表现良好。我们认为构建培训Corpora是建立此类分类器的瓶颈,如果目标是概括到以前的网上看不见的页面,则必须采取。我们将自动从公开的Web资源中建立培训技术的技术,量化它们之间的差异,并证明了对分类器之间的促进同意,因为这种不同的来源急剧优于忽略网络上数据源的不同自然的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号