网络上存在海量中文文本资源,其中许多具有稀疏性与不规范性,难于处理与挖掘.百度百科是一个丰富的与社会热点、网络流行紧密相关的动态中文知识库,基于百度百科本文提出一种网络文本语义主题抽取方法.首先利用百度百科的知识关系将文本映射到候选语义主题空间中,根据训练数据进行分类,找出最大可能的分类,选出属于该分类的候选语义主题.最后提出根据语义离散度确定最终语义主题.在两个数据集上的实验结果表明,该方法不仅对网络不规范文本而且对于规则文本都具有较好的语义主题抽取性能.%It is hard to mining Chinese texts in the web, because many of these texts are spares and nonstandard. BaiduBaike is a rich and dynamic Chinese Encyclopedia which is closely related to hot spots and web popular. In this paper, we propose a new topic extraction method for Chinese web text, which is based on BaiduBaike and text classification. In our method, the rich knowledge in BaiduBaike is used to map a text into semantic topics space, then find the classification of the text based on train data, and then select all candidate topics that belong to this classification, at last SDD(Semantic Discrete Degree) is proposed to choose the final topics. Experiments on the two datasets have demonstrated that our method get good and stable result nerveless the text is standard or not.
展开▼