首页> 外文期刊>ACM Transactions on Information Systems >Automatic Classification of Web Queries Using Very Large Unlabeled Query Logs
【24h】

Automatic Classification of Web Queries Using Very Large Unlabeled Query Logs

机译:使用超大无标签查询日志对Web查询进行自动分类

获取原文
获取原文并翻译 | 示例

摘要

Accurate topical classification of user queries allows for increased effectiveness and efficiency in general-purpose Web search systems. Such classification becomes critical if the system must route queries to a subset of topic-specific and resource-constrained back-end databases. Successful query classification poses a challenging problem, as Web queries are short, thus providing few features. This feature sparseness, coupled with the constantly changing distribution and vocabulary of queries, hinders traditional text classification. We attack this problem by combining multiple classifiers, including exact lookup and partial matching in databases of manually classified frequent queries, linear models trained by supervised learning, and a novel approach based on mining selec-tional preferences from a large unlabeled query log. Our approach classifies queries without using external sources of information, such as online Web directories or the contents of retrieved pages, making it viable for use in demanding operational environments, such as large-scale Web search services. We evaluate our approach using a large sample of queries from an operational Web search engine and show that our combined method increases recall by nearly 40% over the best single method while maintaining adequate precision. Additionally, we compare our results to those from the 2005 KDD Cup and find that we perform competitively despite our operational restrictions. This suggests it is possible to topically classify a significant portion of the query stream without requiring external sources of information, allowing for deployment in operationally restricted environments.
机译:用户查询的准确主题分类可提高通用Web搜索系统的有效性和效率。如果系统必须将查询路由到特定主题和资源受限的后端数据库的子集,则这种分类就变得至关重要。成功的查询分类提出了一个具有挑战性的问题,因为Web查询很短,因此提供的功能很少。此功能稀疏,再加上查询的分布和词汇的不断变化,阻碍了传统的文本分类。我们通过组合多个分类器来解决这个问题,包括在手动分类的频繁查询的数据库中进行精确查找和部分匹配,通过监督学习训练的线性模型以及一种基于从大型未标记查询日志中挖掘选择偏好的新颖方法。我们的方法无需使用外部信息源(例如在线Web目录或检索到的页面的内容)即可对查询进行分类,从而使其可在苛刻的操作环境(例如大规模Web搜索服务)中使用。我们使用可操作的Web搜索引擎中的大量查询来评估我们的方法,并表明我们的组合方法在保持足够的精度的同时,与最佳的单个方法相比,召回率提高了近40%。此外,我们将我们的结果与2005年KDD杯的结果进行了比较,发现尽管受到操作限制,我们的表现仍具有竞争力。这表明可以对查询流的重要部分进行分类,而无需外部信息源,从而可以在操作受限的环境中进行部署。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号