首页> 外文期刊>Computer networks >Natural language processing for web browsing analytics: Challenges, lessons learned, and opportunities
【24h】

Natural language processing for web browsing analytics: Challenges, lessons learned, and opportunities

机译:用于网络浏览分析的自然语言处理:挑战,学习的经验教训,以及机会

获取原文
获取原文并翻译 | 示例

摘要

In an Internet arena where the search engines and other digital marketing firms' revenues peak, other actors still have open opportunities to monetize their users' data. After the convenient anonymization, aggregation, and agreement, the set of websites users visit may result in exploitable data for ISPs. Uses cover from assessing the scope of advertising campaigns to reinforcing user fidelity among other marketing approaches, as well as security issues. However, sniffers based on HTTP, DNS, TLS or flow features do not suffice for this task. Modern websites are designed for preloading and prefetching some contents in addition to embedding banners, social networks' links, images, and scripts from other websites. This self-triggered traffic makes it confusing to assess which websites users visited on purpose. Moreover, DNS caches prevent some queries of actively visited websites to be even sent. On this limited input, we propose to handle such domains as words and the sequences of domains as documents. This way, it is possible to identify the visited websites by translating this problem to a text classification context and applying the most promising techniques of the natural language processing and neural networks fields. After applying different representation methods such as TF-IDF, Word2vec, Doc2vec, and custom neural networks in diverse scenarios and with several datasets, we can state websites visited on purpose with accuracy figures over 90%, with peaks close to 100%, being processes that are fully automated and free of any human parametrization.
机译:在搜索引擎和其他数字营销公司的收入峰值的互联网竞技场中,其他演员仍然有开放的机会,可以将其用户的数据批准。在方便的匿名化,聚合和协议之后,该组网站用户访问可能导致ISP的可利用数据。使用封面评估广告活动范围,以加强用户忠诚以及其他营销方法,以及安全问题。但是,基于HTTP,DNS,TLS或FLUS功能的嗅探器不足以实现此任务。除了嵌入横幅,社交网络的链接,图像和来自其他网站的脚本之外,现代网站专为预加载和预取。这种自动触发的流量使其令人困惑地评估有目的访问的网站用户。此外,DNS缓存可防止积极访问的网站甚至发送的一些查询。在此有限的投入中,我们建议处理域名和域的序列作为文件。这样,可以通过将该问题转换为文本分类上下文并应用自然语言处理和神经网络领域的最有前途的技术来识别访问的网站。在应用不同的表示方法之后,例如在不同方案中的TF-IDF,Word2VEC,DOC2VEC和定制神经网络之后,我们可以用多个数据集可以有超过90%的准确数字访问的网站,峰值接近100%,是流程完全自动化,没有任何人类参数化。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号