首页> 外文会议>International World Wide Web Conference; Edinburgh(GB) >Interactive Wrapper Generation with Minimal User Effort
【24h】

Interactive Wrapper Generation with Minimal User Effort

机译:最少的用户精力即可生成交互式包装

获取原文
获取原文并翻译 | 示例

摘要

While much of the data on the web is unstructured in nature, there is also a significant amount of embedded structured data, such as product information on e-commerce sites or stock data on financial sites. A large amount of research has focused on the problem of generating wrappers, i.e., software tools that allow easy and robust extraction of structured data from text and HTML sources. In many applications, such as comparison shopping, data has to be extracted from many different sources, making manual coding of a wrapper for each source impractical. On the other hand, fully automatic approaches are often not reliable enough, resulting in low quality of the extracted data. We describe a complete system for semi-automatic wrapper generation that can be trained on different data sources in a simple interactive manner. Our goal is to minimize the amount of user effort for training reliable wrappers through design of a suitable training interface that is implemented based on a powerful underlying extraction language and a set of training and ranking algorithms. Our experiments show that our system achieves reliable extraction with a very small amount of user effort.
机译:尽管网络上的许多数据本质上都是非结构化的,但也有大量的嵌入式结构化数据,例如电子商务站点上的产品信息或金融站点上的库存数据。大量的研究集中在生成包装器的问题上,即,包装器允许从文本和HTML源中轻松而强大地提取结构化数据的软件工具。在诸如比较购物之类的许多应用中,必须从许多不同的源中提取数据,从而使得对每个源的包装器进行手动编码变得不可行。另一方面,全自动方法通常不够可靠,导致提取的数据质量低下。我们描述了用于半自动包装器生成的完整系统,该系统可以以简单的交互方式在不同的数据源上进行训练。我们的目标是通过设计合适的培训界面来最大程度地减少用户用于培训可靠包装程序的工作量,该界面基于强大的基础提取语言以及一组培训和排名算法来实现。我们的实验表明,我们的系统只需很少的用户投入即可实现可靠的提取。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号