【24h】

Learning to Extract Form Labels

机译:学习提取表单标签

获取原文

摘要

In this paper we describe a new approach to extract element labels from Web form interfaces. Having these labels is a requirement for several techniques that attempt to retrieve and integrate information that is hidden behind form interfaces, such as hidden Web crawlers and metasearchers. However, given the wide variation in form layout, even within a well-defined domain, automatically extracting these labels is a challenging problem. Whereas previous approaches to this problem have relied on heuristics and manually specified extraction rules, our technique makes use of a learning classifier ensemble to identify element-label mappings; and it applies a reconciliation step which leverages the classifier-derived mappings to boost extraction accuracy. We present a detailed experimental evaluation using over three thousand Web forms. Our results show that our approach is effective: it obtains significantly higher accuracy and is more robust to variability in form layout than previous label extraction techniques.
机译:在本文中,我们描述了一种从Web表单界面提取元素标签的新方法。具有这些标签是尝试检索和集成隐藏在表单界面(例如隐藏的Web爬网程序和元搜索器)后面的信息的多种技术的要求。但是,考虑到表单布局的巨大差异,即使在定义明确的域内,自动提取这些标签也是一个具有挑战性的问题。以前解决此问题的方法依赖于启发式方法和手动指定的提取规则,而我们的技术则利用学习分类器集成来识别元素标签映射。并应用调节步骤,该步骤利用分类器派生的映射来提高提取精度。我们提供了使用三千多种Web表单的详细实验评估。我们的结果表明,我们的方法是有效的:与以前的标签提取技术相比,该方法可获得更高的准确性,并且对表单布局的可变性更强健。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号