首页> 外文会议>International Conference on Web Information Systems Engineering(WISE 2007); 20071203-07; Nancy(FR) >Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction
【24h】

Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction

机译:使用聚类和编辑距离技术自动进行Web数据提取

获取原文
获取原文并翻译 | 示例

摘要

Many web sources provide access to an underlying database containing structured data. These data can be usually accessed in HTML form only, which makes it difficult for software programs to obtain them in structured form. Nevertheless, web sources usually encode data records using a consistent template or layout, and the implicit regularities in the template can be used to automatically infer the structure and extract the data. In this paper, we propose a set of novel techniques to address this problem. While several previous works have addressed the same problem, most of them require multiple input pages while our method requires only one. In addition, previous methods make some assumptions about how data records are encoded into web pages, which do not always hold in real websites. Finally, we have tested our techniques with a high number of real web sources and we have found them to be very effective.
机译:许多Web资源都提供对包含结构化数据的基础数据库的访问。这些数据通常只能以HTML形式访问,这使得软件程序很难以结构化形式获取它们。但是,Web源通常使用一致的模板或布局对数据记录进行编码,并且模板中的隐式规则可用于自动推断结构并提取数据。在本文中,我们提出了一套新颖的技术来解决这个问题。尽管先前的几本著作都解决了相同的问题,但其中大多数都需要多个输入页面,而我们的方法只需要一个页面。另外,先前的方法对数据记录如何编码到网页中进行了一些假设,而这些并不总是存在于真实的网站中。最后,我们已经使用大量真实的网络资源测试了我们的技术,并且发现它们非常有效。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号