首页> 外文会议>ACM conference on information and knowledge management >PruSM: A Prudent Schema Matching Approach for Web Forms
【24h】

PruSM: A Prudent Schema Matching Approach for Web Forms

机译:PRUSM:Web表单的谨慎模式匹配方法

获取原文

摘要

There has been a substantial increase in the number of Web data sources whose contents are hidden and can only be accessed through form interfaces. To leverage this data, several applications have emerged that aim to automate and simplify the access to these data sources, from hidden-Web crawlers and meta-searchers to Web information integration systems. A requirement shared by these applications is the ability to understand these forms, so that they can automatically fill them out. In this paper, we address a key problem in form understanding: how to match elements across distinct forms. Although this problem has been studied in the literature, existing approaches have important limitations. Notably, they only handle small form collections and assume that form elements are clean and normalized, often through manual pre-processing. When a large number of forms is automatically gathered, matching form schemata presents new challenges: data heterogeneity is compounded with the Web-scale and noise introduced by automated processes. We propose PruSM, a prudent schema matching strategy the determines matches for form elements in a prudent fashion, with the goal of minimizing error propagation. A experimental evaluation of PruSM using widely available data sets shows that the approach effective and able to accurately match a large number of form schemata and without requiring any manual pre-processing.
机译:存在内容隐藏的Web数据源的数量大幅增加,只能通过表单接口访问。为了利用此数据,已经出现了几个应用程序,该应用程序旨在自动化和简化对这些数据源的访问,从隐藏的Web爬网和Meta搜索者到Web信息集成系统。这些应用程序共享的需求是理解这些形式的能力,以便他们可以自动填充它们。在本文中,我们解决了形式理解的关键问题:如何匹配跨不同形式的元素。虽然文献中已经研究过这个问题,但现有的方法具有重要的限制。值得注意的是,它们只处理小型集合,并假设形状元素经常通过手动预处理清洁和标准化。当自动收集大量表单时,匹配形式的模式提出了新的挑战:数据异质性与自动化过程引入的网状和噪声复合。我们提出了Prusm,谨慎的模式匹配策略,以谨慎的方式确定表单元素的匹配,具有最小化误差传播的目标。使用广泛可用数据集的PRUSM的实验评估表明,该方法有效且能够准确地匹配大量的形式模式,而无需任何手动预处理。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号