首页> 外文期刊>Natural language engineering >Unsupervised modeling anomaly detection in discussion forums posts using global vectors for text representation
【24h】

Unsupervised modeling anomaly detection in discussion forums posts using global vectors for text representation

机译:讨论论坛的无监督模拟异常检测,使用全球向量进行文本表示

获取原文
获取原文并翻译 | 示例
           

摘要

Anomaly detection can be seen as an unsupervised learning task in which a predictive model created on historical data is used to detect outlying instances in new data. This work addresses possibly promising but relatively uncommon application of anomaly detection to text data. Two English-language and one Polish-language Internet discussion forums devoted to psychoactive substances received from homegrown plants, such as hashish or marijuana, serve as text sources that are both realistic and possibly interesting on their own, due to potential associations with drug-related crime. The utility of two different vector text representations is examined: the simple bag of words representation and a more refined Global Vectors (GloVe) representation, which is an example of the increasingly popular word embedding approach. They are both combined with two unsupervised anomaly detection methods, based on one-class support vector machines (SVM) and based on dissimilarity to k-medoids clusters. The GloVe representation is found definitely more useful for anomaly detection, permitting better detection quality and ameliorating the curse of dimensionality issues with text clustering. The cluster dissimilarity approach combined with this representation outperforms one-class SVM with respect to detection quality and appears a more promising approach to anomaly detection in text data.
机译:异常检测可以被视为无监督的学习任务,其中在历史数据上创建的预测模型用于检测新数据中的广泛实例。这项工作可能有希望的是,对文本数据进行异常检测相对不常见。两种英语和一个波兰语互联网讨论论坛,致力于从本土植物(如Hashish或Marijuana)获得的精神活性物质,作为潜在的与毒品相关的潜在关联的态度和可能有趣的文本来源犯罪。检查了两个不同的矢量文本表示的效用:简单的单词表示和更精致的全球向量(手套)表示,这是越来越流行的嵌入方法的一个例子。它们都与基于单级支持向量机(SVM)的两种无监督的异常检测方法结合在一起,并基于对K-yemoids集群的异化性。对于异常检测,发现手套表示肯定更有用,允许更好的检测质量和改善文本聚类的维度问题的诅咒。聚类异化方法与该表示相结合优于检测质量的单级SVM,并且在文本数据中出现了更有希望的异常检测方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号