首页> 外文期刊>Emerging Topics in Computing, IEEE Transactions on >Wikipedia-Based Semantic Similarity Measurements for Noisy Short Texts Using Extended Naive Bayes
【24h】

Wikipedia-Based Semantic Similarity Measurements for Noisy Short Texts Using Extended Naive Bayes

机译:使用扩展的朴素贝叶斯,基于维基百科的嘈杂短文本语义相似性度量

获取原文
获取原文并翻译 | 示例
           

摘要

This paper proposes a Wikipedia-based semantic similarity measurement method that is intended for real-world noisy short texts. Our method is a kind of explicit semantic analysis (ESA), which adds a bag of Wikipedia entities (Wikipedia pages) to a text as its semantic representation and uses the vector of entities for computing the semantic similarity. Adding related entities to a text, not a single word or phrase, is a challenging practical problem because it usually consists of several subproblems, e.g., key term extraction from texts, related entity finding for each key term, and weight aggregation of related entities. Our proposed method solves this aggregation problem using extended naive Bayes, a probabilistic weighting mechanism based on the Bayes’ theorem. Our method is effective especially when the short text is semantically noisy, i.e., they contain some meaningless or misleading terms for estimating their main topic. Experimental results on Twitter message and Web snippet clustering revealed that our method outperformed ESA for noisy short texts. We also found that reducing the dimension of the vector to representative Wikipedia entities scarcely affected the performance while decreasing the vector size and hence the storage space and the processing time of computing the cosine similarity.
机译:本文提出了一种基于Wikipedia的语义相似性度量方法,旨在用于现实世界中的嘈杂短文本。我们的方法是一种显式语义分析(ESA),它在文本中添加一袋Wikipedia实体(Wikipedia页面)作为其语义表示,并使用实体的向量来计算语义相似度。向文本而不是单个单词或短语中添加相关实体是一个具有挑战性的实际问题,因为它通常包含几个子问题,例如,从文本中提取关键术语,为每个关键术语找到相关实体以及相关实体的权重聚合。我们提出的方法使用扩展的朴素贝叶斯解决了这一聚合问题,贝叶斯是基于贝叶斯定理的概率加权机制。我们的方法特别有效,特别是当短文本在语义上比较嘈杂时,即它们包含一些无意义或误导性的术语来估计其主要主题。 Twitter消息和Web片段聚类的实验结果表明,对于嘈杂的短文本,我们的方法优于ESA。我们还发现,将向量的维数减小到代表性的Wikipedia实体几乎不会影响性能,同时减小向量的大小,从而减小了存储空间和计算余弦相似度的处理时间。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号