首页> 外文OA文献 >CLASSIFICATION OF WEB PAGES IN YIOOP WITH ACTIVE LEARNING
【2h】

CLASSIFICATION OF WEB PAGES IN YIOOP WITH ACTIVE LEARNING

机译:与主动学习的yioop网页分类

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

This thesis project augments the Yioop search engine with a general facility for automatically assigning u22classu22 meta words (e.g., u22class:advertisingu22) to web pages based on the output of a logistic regression text classifier. Users can create multiple classifers using Yioopu27s web-based interface, each trained first on a small set of labeled documents drawn from previous crawls then improved over repeated rounds of active learning using density-weighted pool-based sampling.The classification systemu27s accuracy when classifying new documents was found to be comparable to published results for a common dataset, approaching 82% for a corpus of advertisements to be filtered from content-providersu27 web pages. In agreement with previous work, logistic regression was found to provide greater accuracy than Naive Bayes for training sets consisting of more than two hundred documents. Active learning with density-weighted pool-based sampling was found to offer a small accuracy boost over random document sampling for training sets consisting of less than one hundred documents.Overall, the system was shown to be effective for the proposed task of allowing users to create novel web page classifiers, but the active learning component will require more work if it is to provide users with a salient benefit over random sampling.
机译:本文项目增加了一个带一般设施Yioop搜索引擎自动分配 u22class U22元的话(例如, u22class:广告 U22)根据Logistic回归文本分类器的输出到网页。用户可以创建使用Yioop u27s基于Web的界面,每个首先训练上一小标记的文件从以前的抓取多个绘制量词则提高了多轮重复使用密度基于加权池sampling.The分类系统 u27s主动学习当分类新文档的准确性被认为是相当于一个共同数据,公布的结果,接近82%的广告从内容提供商 U27网页过滤的语料库。与以前的工作协议,logistic回归分析发现比由两百多个文档的训练集朴素贝叶斯提供更高的精度。与密度基于加权池采样主动学习被发现提供超过随机抽样文件对由不到一百documents.Overall的训练集小精度提升,该系统被证明是有效的,允许用户所提出的任务创建新的网页分类,而是主动学习组件将需要更多的工作,如果它是为用户提供超过随机抽样的一个显着的效益。

著录项

  • 作者

    Shawn C. Tice;

  • 作者单位
  • 年度 -1
  • 总页数
  • 原文格式 PDF
  • 正文语种
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号