首页> 外文会议>19th international world wide web conference 2010 >A Novel Traffic Analysis for Identifying Search Fields in the Long Tail of Web Sites
【24h】

A Novel Traffic Analysis for Identifying Search Fields in the Long Tail of Web Sites

机译:用于识别网站长尾中搜索字段的新型流量分析

获取原文

摘要

Using a clickstream sample of 2 billion URLs from many thousand volunteer Web users, we wish to analyze typical usage of keyword searches across the Web. In order to do this, we need to be able to determine whether a given URL represents a keyword search and, if so, which field contains the query. Although it is easy to recognize 'q' as the query field in 'http://www.google.com/search?hl=en&q=music', we must do this automatically for the long tail of diverse websites. This problem is the focus of this paper. Since the names, types and number of fields differ across sites, this does not conform to traditional text classification or to multi-class problem formulations. The problem also exhibits highly non-uniform importance across websites, since traffic follows a Zipf distribution.We developed a solution based on manually identifying the query fields on the most popular sites, followed by an adaptation of machine learning for the rest. It involves an interesting case-instances structure: labeling each website case usually involves selecting at most one of the field instances as positive, based on seeing sample field values. This problem structure and soft constraint-which we believe has broader applicability-can be used to greatly reduce the manual labeling effort. We employed active learning and judicious GUI presentation to efficiently train a classifier with accuracy estimated at 96%, beating several baseline alternatives.
机译:我们使用来自数千名自愿Web用户的20亿个URL的点击流样本,我们希望分析整个Web上关键字搜索的典型用法。为此,我们需要确定给定的URL是否代表关键字搜索,如果是,则确定包含查询的字段。尽管在“ http://www.google.com/search?hl=zh-CN&q=music”中很容易将“ q”识别为查询字段,但是对于各种各样的网站,我们必须自动执行此操作。这个问题是本文的重点。由于各个站点的字段名称,类型和数量不同,因此这不符合传统的文本分类或多类问题表述。由于流量遵循Zipf分布,因此该问题在各个网站之间也显示出高度不一致的重要性。 我们基于手动识别最受欢迎站点上的查询字段,然后对其余部分进行了机器学习改造,开发了一种解决方案。它涉及一个有趣的案例-实例结构:标记每个网站案例通常涉及基于查看样本字段值,最多选择一个字段实例作为肯定实例。这种问题结构和软约束(我们认为具有更广泛的适用性)可用于大大减少手动标记的工作量。我们采用主动学习和明智的GUI演示方式,有效地训练了分类器,其准确率估计为96%,超过了几种基准替代方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号