A Novel Traffic Analysis for Identifying Search Fields in the Long Tail of Web Sites

机译：用于识别网站长尾中搜索字段的新型流量分析

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Using a clickstream sample of 2 billion URLs from many thousand volunteer Web users, we wish to analyze typical usage of keyword searches across the Web. In order to do this, we need to be able to determine whether a given URL represents a keyword search and, if so, which field contains the query. Although it is easy to recognize 'q' as the query field in 'http://www.google.com/search?hl=en&q=music', we must do this automatically for the long tail of diverse websites. This problem is the focus of this paper. Since the names, types and number of fields differ across sites, this does not conform to traditional text classification or to multi-class problem formulations. The problem also exhibits highly non-uniform importance across websites, since traffic follows a Zipf distribution.We developed a solution based on manually identifying the query fields on the most popular sites, followed by an adaptation of machine learning for the rest. It involves an interesting case-instances structure: labeling each website case usually involves selecting at most one of the field instances as positive, based on seeing sample field values. This problem structure and soft constraint-which we believe has broader applicability-can be used to greatly reduce the manual labeling effort. We employed active learning and judicious GUI presentation to efficiently train a classifier with accuracy estimated at 96%, beating several baseline alternatives.

机译：我们使用来自数千名自愿Web用户的20亿个URL的点击流样本，我们希望分析整个Web上关键字搜索的典型用法。为此，我们需要确定给定的URL是否代表关键字搜索，如果是，则确定包含查询的字段。尽管在“ http://www.google.com/search?hl=zh-CN&q=music”中很容易将“ q”识别为查询字段，但是对于各种各样的网站，我们必须自动执行此操作。这个问题是本文的重点。由于各个站点的字段名称，类型和数量不同，因此这不符合传统的文本分类或多类问题表述。由于流量遵循Zipf分布，因此该问题在各个网站之间也显示出高度不一致的重要性。我们基于手动识别最受欢迎站点上的查询字段，然后对其余部分进行了机器学习改造，开发了一种解决方案。它涉及一个有趣的案例-实例结构：标记每个网站案例通常涉及基于查看样本字段值，最多选择一个字段实例作为肯定实例。这种问题结构和软约束（我们认为具有更广泛的适用性）可用于大大减少手动标记的工作量。我们采用主动学习和明智的GUI演示方式，有效地训练了分类器，其准确率估计为96％，超过了几种基准替代方法。

著录项

来源
《19th international world wide web conference 2010》|2010年|P.361-369|共9页
会议地点
作者
George Forman; Evan Kirshenbaum; Shyamsundar Rajaram;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算机网络;
关键词
web data mining; clickstream analysis; machine learning classification; active learning;

机译：网络数据挖掘;点击流分析;机器学习分类;主动学习;

相似文献

外文文献
中文文献
专利

1. Identifying Web Search Session Patterns Using Cluster Analysis: A Comparison of Three Search Environments [J] . Dietmar Wolfram, Peiling Wang, Jin Zhang Journal of the American Society for Information Science and Technology . 2009,第5期

机译：使用聚类分析识别Web搜索会话模式：三种搜索环境的比较
2. Nonparametric estimation of long-tailed density functions and its application to the analysis of World Wide Web traffic [J] . Natalia M.Markovitch, Udo R.Krieger Performance Evaluation . 2000,第2a3期

机译：长尾密度函数的非参数估计及其在万维网流量分析中的应用
3. VISA - Vector Integration Site Analysis server: a web-based server to rapidly identify retroviral integration sites from next-generation sequencing [J] . Jonah D. Hocum, Logan R. Battrell, Ryan Maynard, BMC Bioinformatics . 2015,第1期

机译：VISA-矢量集成站点分析服务器：基于Web的服务器，可从下一代测序中快速识别逆转录病毒集成站点
4. A Novel Traffic Analysis for Identifying Search Fields in the Long Tail of Web Sites [C] . International world wide web conference . 2010

机译：用于识别网站长尾的搜索字段的新交通分析
5. The bipartite clique: A topological paradigm for Web user search customization and Web site restructuring. [D] . Choyce-Miles, Brenda F. 2005

机译：双向讨论：Web用户搜索自定义和网站重组的拓扑范例。
6. VISA - Vector Integration Site Analysis server: a web-based server to rapidly identify retroviral integration sites from next-generation sequencing [O] . Jonah D. Hocum, Logan R. Battrell, Ryan Maynard, 2015

机译：VISA-矢量集成站点分析服务器：基于Web的服务器可从下一代测序中快速识别逆转录病毒集成站点
7. VISA - Vector Integration Site Analysis server: a web-based server to rapidly identify retroviral integration sites from next-generation sequencing [O] . Jonah D. Hocum, Logan R. Battrell, Ryan Maynard, 2015

机译：VISA-矢量集成站点分析服务器：基于Web的服务器，可从下一代测序中快速识别逆转录病毒集成站点

A Novel Traffic Analysis for Identifying Search Fields in the Long Tail of Web Sites

摘要

著录项

相似文献

相关主题

期刊订阅