LSHWE: Improving Similarity-Based Word Embedding with Locality Sensitive Hashing for Cyberbullying Detection

机译：LSHWE：使用局部敏感哈希改进基于相似度的词嵌入，以进行网络欺凌检测

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Word embedding methods use low-dimensional vectors to represent words in the corpus. Such low-dimensional vectors can capture lexical semantics and greatly improve the cyberbullying detection performance. However, existing word embedding methods have a major limitation in cyberbullying detection task: they cannot represent well on "deliberately obfuscated words", which are used by users to replace bullying words in order to evade detection. These deliberately obfuscated words are often regarded as "rare words" with a little contextual information and are removed during preprocessing. In this paper, we propose a word embedding method called LSHWE to solve this limitation, which is based on an idea that deliberately obfuscated words have a high context similarity with their corresponding bullying words. LSHWE has two steps: firstly, it generates the nearest neighbor matrix according to the co-occurrence matrix and the nearest neighbor list obtained by Locality Sensitive Hashing (LSH); secondly, it uses an LSH-based autoencoder to learn word representations based on these two matrices. Especially, the reconstructed nearest neighbor matrix generated by the LSH-based autoencoder is used to make the representations of deliberately obfuscated words close to their corresponding bullying words. In order to improve the algorithm efficiency, LSHWE uses LSH to generate the nearest neighbor list and the reconstructed nearest neighbor list. Empirical experiments prove the effectiveness of LSHWE in cyberbullying detection, particularly on the "deliberately obfuscated words" problem. Moreover, LSHWE is highly efficient, it can represent tens of thousands of words in a few minutes on a typical single machine.

机译：词嵌入方法使用低维向量来表示语料库中的词。这样的低维向量可以捕获词汇语义，并大大提高网络欺凌检测的性能。但是，现有的词嵌入方法在网络欺凌检测任务中存在主要局限性：它们不能很好地表现在“故意混淆的词”上，用户使用这些词来替换欺凌词以逃避检测。这些故意混淆的单词通常被视为带有少量上下文信息的“稀有单词”，并在预处理期间被删除。在本文中，我们提出了一种称为LSHWE的词嵌入方法来解决此限制，该方法基于故意混淆的词与其对应的欺凌词具有较高的上下文相似度的思想。 LSHWE有两个步骤：首先，它根据共现矩阵和通过局部敏感哈希（LSH）获得的最近邻居列表生成最近邻居矩阵。其次，它使用基于LSH的自动编码器来基于这两个矩阵学习单词表示。特别地，由基于LSH的自动编码器生成的重构的最近邻居矩阵用于使故意混淆的单词的表示接近于其对应的欺凌单词。为了提高算法效率，LSHWE使用LSH生成最近邻居列表和重建的最近邻居列表。经验实验证明了LSHWE在网络欺凌检测中的有效性，尤其是在“故意混淆单词”问题上。而且，LSHWE是高效的，在典型的单台计算机上，它可以在几分钟内表示成千上万个单词。

著录项

来源
《International Joint Conference on Neural Networks》|2020年|1-8|共8页
会议地点
作者
Zehua Zhao; Min Gao; Fengji Luo; Yi Zhang; Qingyu Xiong;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Word Embedding; Locality Sensitive Hashing; Cyberbullying Detection;

机译：词嵌入;局部敏感哈希;网络欺凌检测;

相似文献

外文文献
中文文献
专利

1. Chinese Multi-Keyword Fuzzy Rank Search over Encrypted Cloud Data Based on Locality-Sensitive Hashing [J] . Yang Yang, Zhang Yu-Chao, Liu Jia, Journal of Information Recording . 2019,第1期

机译：基于局部敏感哈希的加密云数据中文多关键字模糊等级搜索
2. Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing [J] . Cao, Yiqun, Jiang, Tao, Girke, Thomas Bioinformatics . 2010,第7期

机译：通过几何嵌入和局部敏感哈希来加速大型化合物集的相似度搜索和聚类
3. Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing [J] . Thomas Girke Bioinformatics . 2010,第7期

机译：通过几何嵌入和局部敏感哈希来加速大型化合物集的相似性搜索和聚类
4. Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings [C] . Bikash Gyawali, Lucas Anastasiou, Petr Knoth International Conference on Language Resources and Evaluation . 2020

机译：使用局部敏感散列和Word Embeddings重复数据删除学术文件
5. Application of Locality Sensitive Hashing to Feature Matching and Loop Closure Detection. [D] . Shahbazi, Hossein. 2012

机译：局部敏感哈希在特征匹配和闭环检测中的应用。
6. Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing [O] . Yiqun Cao, Tao Jiang, Thomas Girke -1

机译：通过几何嵌入和局部敏感哈希来加速大型化合物集的相似性搜索和聚类
7. Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing [O] . Cao, Yiqun, Jiang, Tao, Girke, Thomas 2010

机译：通过几何嵌入和局部敏感哈希来加速大型化合物集的相似性搜索和聚类

LSHWE: Improving Similarity-Based Word Embedding with Locality Sensitive Hashing for Cyberbullying Detection

摘要

著录项

相似文献

相关主题

期刊订阅