首页> 外文期刊>Engineering Applications of Artificial Intelligence >Learning semantic information from Internet Domain Names using word embeddings
【24h】

Learning semantic information from Internet Domain Names using word embeddings

机译:使用Word Embeddings从Internet域名学习语义信息

获取原文
获取原文并翻译 | 示例
           

摘要

Word embeddings is a well-known set of techniques widely used in Natural Language Processing (NLP). These techniques are able to learn words' semantic based on the distributional hypothesis which states that words that are used and occur in the same contexts tend to purport similar meanings. This paper explores the usage of word embeddings in a new scenario to create a Vector Space Model (VSM) for Internet Domain Names (DNS). Our goal is to find semantically similar domains only using information of DNS queries without any knowledge about the content of those domains. The results presented here have practical applications in many engineering activities including websites recommendations, identification of fraudulent or risky sites, parental-control systems and anomaly detection in network traffic analysis (among others). We use the distributional hypothesis to learn the semantic of domain names from users' web navigation patterns, validating empirically that domain names that occur in the same web sessions tend to have similar semantic. We also test different word embeddings techniques: word2vec, app2vec (considering time intervals between DNS queries), and f astText (which includes sub-word information). Due to the characteristics of domain names, we found f astText as the best option for building a VSM for DNS, being 10.5% superior than word2vec with Skip-Gram which was the next best technique considering the Mean Average Precision at k (MAP@k) metric, which compares the most similar domains in our VSM with the most similar domains provided by a third party source, namely, similar sites service offered by Alexa Internet, Inc.
机译:Word Embeddings是广泛用于自然语言处理(NLP)的知名技术集合。这些技术能够基于分布假设来学习单词“语义”,这些假设指出在同一上下文中使用和发生的单词倾向于旨在旨在旨在阐述类似的含义。本文探讨了在新方案中的单词嵌入式的使用,以创建用于Internet域名(DNS)的向量空间模型(VSM)。我们的目标是使用DNS查询的信息找到语义上类似的域,而没有任何关于这些域内容的知识。这里提出的结果在许多工程活动中具有实际应用,包括网站建议,鉴定欺诈或风险位点,父母控制系统和网络交通分析中的异常检测(等)。我们使用分布假设从用户的Web导航模式中学习域名的语义,以实验验证在同一网络会话中发生的域名往往具有类似的语义。我们还测试不同的单词嵌入式技术:Word2VEC,APP2VEC(考虑DNS查询之间的时间间隔),以及F ASTText(包括子字信息)。由于域名的特点,我们发现F AstText作为为DNS构建VSM的最佳选择,比Word2Vec为Skip-Gr为10.5%,这是考虑K(Map @ K的平均平均精度的下一个最佳技术)度量标准,它将我们的VSM中最相似的域与第三方来源提供的最相似的域名,即Alexa Internet,Inc。提供的类似网站服务。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号