首页> 外文会议>Internet Measurement Conference >Who is .com? Learning to Parse WHOIS Records
【24h】

Who is .com? Learning to Parse WHOIS Records

机译:谁是.com?学习解析Whois记录

获取原文

摘要

WHOIS is a long-established protocol for querying information about the 280M+ registered domain names on the Internet. Unfortunately, while such records are accessible in a "human-readable" format, they do not follow any consistent schema and thus are challenging to analyze at scale. Existing approaches, which rely on manual crafting of parsing rules and per-registrar templates, are inherently limited in coverage and fragile to ongoing changes in data representations. In this paper, we develop a statistical model for parsing WHOIS records that learns from labeled examples. Our model is a conditional random field (CRF) with a small number of hidden states, a large number of domain-specific features, and parameters that are estimated by efficient dynamic-programming procedures for probabilistic inference. We show that this approach can achieve extremely high accuracy (well over 99%) using modest amounts of labeled training data, that it is robust to minor changes in schema, and that it can adapt to new schema variants by incorporating just a handful of additional examples. Finally, using our parser, we conduct an exhaustive survey of the registration patterns found in 102M com domains.
机译:WHOIS是一个长期以来的协议,用于查询Internet上的280m +注册域名的信息。不幸的是,虽然这些记录以“人类可读”格式可访问,但它们不遵循任何一致的模式,因此在规模上分析有挑战性。现有方法依赖于对解析规则和每位注册商模板进行手动制作,本质上是覆盖范围和脆弱的覆盖范围和数据表示的变化。在本文中,我们开发了一个统计模型,用于解析从标记的示例学习的Whois记录。我们的模型是一种条件随机字段(CRF),具有少量隐藏状态,大量的域特征特征,以及通过高效动态编程程序来估计的概率推断的参数。我们表明这种方法可以使用适度的标记训练数据来实现极高的准确度(超过99%),这在架构中的微小变化是强大的,并且它可以通过仅少数额外的额外加入新的架构变体例子。最后,使用我们的解析器,我们对102米COM域中的注册模式进行了详尽的调查。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号