首页> 外文会议>International World Wide Web Conference >Interactive Wrapper Generation with Minimal User Effort
【24h】

Interactive Wrapper Generation with Minimal User Effort

机译:具有最小的用户努力的交互式包装器

获取原文

摘要

While much of the data on the web is unstructured in nature, there is also a significant amount of embedded structured data, such as product information on e-commerce sites or stock data on financial sites. A large amount of research has focused on the problem of generating wrappers, i.e., software tools that allow easy and robust extraction of structured data from text and HTML sources. In many applications, such as comparison shopping, data has to be extracted from many different sources, making manual coding of a wrapper for each source impractical. On the other hand, fully automatic approaches are often not reliable enough, resulting in low quality of the extracted data. We describe a complete system for semi-automatic wrapper generation that can be trained on different data sources in a simple interactive manner. Our goal is to minimize the amount of user effort for training reliable wrappers through design of a suitable training interface that is implemented based on a powerful underlying extraction language and a set of training and ranking algorithms. Our experiments show that our system achieves reliable extraction with a very small amount of user effort.
机译:虽然Web的大部分数据本质上是非结构化的,但还有大量的嵌入式结构数据,例如关于电子商务网站的产品信息或金融站点的库存数据。大量研究专注于生成包装器的问题,即允许从文本和HTML源简单且强大地提取结构化数据的软件工具。在许多应用程序中,例如比较购物,必须从许多不同的源中提取数据,使每个源的手动编码包装器不切实际。另一方面,完全自动方法通常不够可靠,导致提取数据的低质量。我们描述了一个完整的半自动包装器,可以以简单的交互方式在不同的数据源上培训。我们的目标是通过设计基于强大的潜在提取语言和一组培训和排名算法来实现可靠的训练可靠包装器的用户努力量度最大限度地减少用户努力。我们的实验表明,我们的系统通过非常少量的用户努力实现了可靠的提取。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号