【24h】

Fine-grain Web Site Structure Discovery

机译:细粒度的网站结构发现

获取原文
获取原文并翻译 | 示例

摘要

Several techniques have been recently proposed to automatically derive web wrappers, i.e., programs that extract data from HTML pages, and transform them into a more structured format, typically in XML syntax. These techniques automatically induce a wrapper from a set of sample pages that share a common HTML template. An open issue, however, is how to collect suitable classes of sample pages to feed the wrapper inducer. Presently, the pages are chosen manually. In this paper, we tackle the problem of automatically discovering the main classes of pages offered by a site by exploring only a small, representative, portion of it. The web site model we propose describes the structure of the site as a graph whose nodes are classes of pages that share a common structure, and whose edges represent links among instances of the page classes. Using this model, we have developed an algorithm that accepts the url of an entry point to the target web site, visits a limited portion of the site, and produces an accurate model of the site structure. We also report on preliminary experiments performed on actual web sites, that have produced encouraging results.
机译:最近提出了几种技术来自动派生Web包装器,即从HTML页面提取数据并将其转换为结构化格式的程序,通常采用XML语法。这些技术会自动从一组共享通用HTML模板的示例页面中引入包装器。但是,一个未解决的问题是如何收集适当类别的示例页面以喂入包装诱导器。当前,页面是手动选择的。在本文中,我们通过仅探索一小部分具有代表性的部分来解决自动发现站点提供的主要页面类别的问题。我们建议的网站模型将网站的结构描述为一个图形,其节点是共享同一结构的页面类别,其边缘表示页面类别实例之间的链接。使用此模型,我们开发了一种算法,该算法接受目标网站的入口点的URL,访问该网站的有限部分,并生成一个准确的网站结构模型。我们还报告了在实际网站上进行的初步实验,这些实验产生了令人鼓舞的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号