Exploring features for automatic identification of news queries through query logs

Xiaojuan ZHANG; Jian LI

摘要

Purpose:Existing researches of predicting queries with news intents have tried to extract the classification features from extemal knowledge bases,this paper tries to present how to apply features extracted from query logs for automatic identification of news queries without using any external resources.Design/methodology/approach:First,we manually labeled 1,220 news queries from Sogou.com.Based on the analysis of these queries,we then identified three features of news queries in terms of query content,time of query occurrence and user click behavior.Afterwards,we used 12 effective features proposed in literature as baseline and conducted experiments based on the support vector machine (SVM) classifier.Finally,we compared the impacts of the features used in this paper on the identification of news queries.Findings:Compared with baseline features,the F-score has been improved from 0.6414 to 0.8368 after the use of three newly-identified features,among which the burst point (bst)was the most effective while predicting news queries.In addition,query expression (qes) was more useful than query terms,and among the click behavior-based features,news URL was the most effective one.Research limitations:Analyses based on features extracted from query logs might lead to produce limited results.Instead of short queries,the segmentation tool used in this study has been more widely applied for long texts.Practical implications:The research will be helpful for general-purpose search engines to address search intents for news events.Originality/value:Our approach provides a new and different perspective in recognizing queries with news intent without such large news corpora as blogs or Twitter.

机译：目的：现有的具有新闻意图的预测查询研究试图从外部知识库中提取分类特征，本文试图介绍如何应用从查询日志中提取的特征来在不使用任何外部资源的情况下自动识别新闻查询。设计/方法/方法：首先，我们手动标记来自Sogou.com的1,220条新闻查询。在对这些查询进行分析的基础上，我们从查询内容，查询发生时间和用户点击行为方面确定了新闻查询的三个特征。以支持向量机（SVM）分类器为基础，使用文献中提出的12个有效特征作为基线，并进行了实验。最后，我们比较了本文使用的特征对新闻查询识别的影响。发现：与基线特征相比，在使用了三个新识别的特征之后，F分数从0.6414改善到0.8368，其中爆发点（bst）为此外，查询表达式（qes）比查询字词更有用，并且在基于点击行为的功能中，新闻URL是最有效的功能。研究限制：基于从查询日志中提取的功能进行分析可能会导致产生有限的结果。代替短查询，本研究中使用的细分工具已被更广泛地用于长文本。实际意义：该研究将对通用搜索引擎解决新闻事件的搜索意图有所帮助。独创性/价值：我们的方法为识别具有新闻意图的查询提供了全新的视角，而没有像Blog或Twitter这样的大型新闻语料库。

Exploring features for automatic identification of news queries through query logs

摘要

著录项

相关主题

期刊订阅