首页> 外文会议>International Joint Conference on Neural Networks >LSHWE: Improving Similarity-Based Word Embedding with Locality Sensitive Hashing for Cyberbullying Detection
【24h】

LSHWE: Improving Similarity-Based Word Embedding with Locality Sensitive Hashing for Cyberbullying Detection

机译:LSHWE:使用局部敏感哈希改进基于相似度的词嵌入,以进行网络欺凌检测

获取原文

摘要

Word embedding methods use low-dimensional vectors to represent words in the corpus. Such low-dimensional vectors can capture lexical semantics and greatly improve the cyberbullying detection performance. However, existing word embedding methods have a major limitation in cyberbullying detection task: they cannot represent well on "deliberately obfuscated words", which are used by users to replace bullying words in order to evade detection. These deliberately obfuscated words are often regarded as "rare words" with a little contextual information and are removed during preprocessing. In this paper, we propose a word embedding method called LSHWE to solve this limitation, which is based on an idea that deliberately obfuscated words have a high context similarity with their corresponding bullying words. LSHWE has two steps: firstly, it generates the nearest neighbor matrix according to the co-occurrence matrix and the nearest neighbor list obtained by Locality Sensitive Hashing (LSH); secondly, it uses an LSH-based autoencoder to learn word representations based on these two matrices. Especially, the reconstructed nearest neighbor matrix generated by the LSH-based autoencoder is used to make the representations of deliberately obfuscated words close to their corresponding bullying words. In order to improve the algorithm efficiency, LSHWE uses LSH to generate the nearest neighbor list and the reconstructed nearest neighbor list. Empirical experiments prove the effectiveness of LSHWE in cyberbullying detection, particularly on the "deliberately obfuscated words" problem. Moreover, LSHWE is highly efficient, it can represent tens of thousands of words in a few minutes on a typical single machine.
机译:词嵌入方法使用低维向量来表示语料库中的词。这样的低维向量可以捕获词汇语义,并大大提高网络欺凌检测的性能。但是,现有的词嵌入方法在网络欺凌检测任务中存在主要局限性:它们不能很好地表现在“故意混淆的词”上,用户使用这些词来替换欺凌词以逃避检测。这些故意混淆的单词通常被视为带有少量上下文信息的“稀有单词”,并在预处理期间被删除。在本文中,我们提出了一种称为LSHWE的词嵌入方法来解决此限制,该方法基于故意混淆的词与其对应的欺凌词具有较高的上下文相似度的思想。 LSHWE有两个步骤:首先,它根据共现矩阵和通过局部敏感哈希(LSH)获得的最近邻居列表生成最近邻居矩阵。其次,它使用基于LSH的自动编码器来基于这两个矩阵学习单词表示。特别地,由基于LSH的自动编码器生成的重构的最近邻居矩阵用于使故意混淆的单词的表示接近于其对应的欺凌单词。为了提高算法效率,LSHWE使用LSH生成最近邻居列表和重建的最近邻居列表。经验实验证明了LSHWE在网络欺凌检测中的有效性,尤其是在“故意混淆单词”问题上。而且,LSHWE是高效的,在典型的单台计算机上,它可以在几分钟内表示成千上万个单词。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号