首页> 外文期刊>ACM transactions on Asian language information processing >Wikipedia-Based Relatedness Measurements for Multilingual Short Text Clustering
【24h】

Wikipedia-Based Relatedness Measurements for Multilingual Short Text Clustering

机译:基于维基百科的多语言短文本聚类的相关性度量

获取原文
获取原文并翻译 | 示例

摘要

Throughout the world, people can post information about their local area in their own languages using social networking services. Multilingual short text clustering is an important task to organize such information, and it can be applied to various applications, such as event detection and summarization. However, measuring the relatedness between short texts written in various languages is a challenging problem. In addition to handling multiple languages, the semantic gaps among all languages must be considered. In this article, we propose two Wikipedia-based semantic relatedness measurement methods for multilingual short text clustering. The proposed methods solve the semantic gap problem by incorporating the inter-language links of Wikipedia into Extended Naive Bayes (ENB), a probabilistic method that can be applied to measure semantic relatedness among monolingual short texts. The proposed methods represent a multilingual short text as a vector of the English version of Wikipedia articles (entities). By transferring texts to a unified vector space, the relatedness between texts in different languages with similar meanings can be increased. We also propose an approach that can improve clustering performance and reduce the processing time by eliminating language-specific entities in the unified vector space. Experimental results on multilingual Twitter message clustering revealed that the proposed methods outperformed cross-lingual explicit semantic analysis, a previously proposed method to measure relatedness between texts in different languages. Moreover, the proposed methods were comparable to ENB applied to texts translated into English using a proprietary translation service. The proposed methods enabled relatedness measurements for multilingual short text clustering without requiring machine translation processes.
机译:在全世界,人们可以使用社交网络服务以其自己的语言发布有关其本地区域的信息。多语言短文本聚类是组织此类信息的重要任务,并且可以应用于各种应用程序,例如事件检测和摘要。然而,测量用多种语言编写的短文本之间的相关性是一个具有挑战性的问题。除了处理多种语言外,还必须考虑所有语言之间的语义鸿沟。在本文中,我们提出了两种基于Wikipedia的语义相关性度量方法,用于多语言短文本聚类。所提出的方法通过将Wikipedia的语言间链接合并到扩展的朴素贝叶斯(ENB)中来解决语义鸿沟问题,这是一种概率方法,可用于测量单语短文本之间的语义相关性。所提出的方法将多语言的短文本表示为英语版本的维基百科文章(实体)的载体。通过将文本传输到统一的向量空间,可以提高具有相似含义的不同语言的文本之间的相关性。我们还提出了一种方法,该方法可以通过消除统一向量空间中特定于语言的实体来提高聚类性能并减少处理时间。在多语言Twitter消息聚类上的实验结果表明,所提出的方法优于跨语言的显式语义分析,后者是先前提出的用于测量不同语言的文本之间的相关性的方法。此外,所提出的方法与ENB相当,后者适用于使用专有翻译服务翻译成英文的文本。所提出的方法实现了针对多语言短文本聚类的相关性测量,而无需机器翻译过程。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号