类似"百度知道"这类社区问答服务系统的主要任务之一是对问题进行分类,以便于对用户的提问进行组织.社区问答服务的实际应用需求对问题分类算法提出了高准确性、小计算量、对噪音数据敏感度低等要求.基于Kullback-Leibler Distance的分类算法在大规模文本和高维向量分类任务中表现出较高的分类精度,本文在该分类算法的基础上,结合语言模型的思想,提出一种改进的分类算法:n-gram KLD.通过在一个大尺度的问答对数据集合上进行的一系列实验,表明n-gram KLD算法在问题分类任务中取得了优于传统算法的分类效果,并且在计算复杂度以及对噪声数据敏感度方面都较好地满足了问题分类任务的要求.%In Community-based Q&A services(referred to as cQA) such as Baidu Zhidao, question classification is one of the crucial tasks and it is important to organize the questions submitted to the cQA system. The question categorization algorithm for the cQA service needs to get high accuracy, low computation and low-sensitivity to noise. Based on the kullback-Leibler distance classification algorithm, this paper introduces a new question classification approach adopting the idea of language model,named n-gram KID. The experimental results with a large corpus which contains more than 1 million question-answer pairs show a significant improvement when the n-gram KID algorithm is used. And the n-gram KLD algorithm is fit for the actual demand of the question classification task in the cQA service.
展开▼