首页> 外文会议>International Symposium on Knowledge Acquisition and Modeling >An Approach to Extracting Central URLs on Catalog Page
【24h】

An Approach to Extracting Central URLs on Catalog Page

机译:在目录页面上提取中央URL的方法

获取原文

摘要

Catalog pages construct the intermediate layer in architecture of a standard web site; therefore research on information retrieval for this kind of pages can be beneficial to improve web crawler's efficiency. A page is called "Catalog-style" if its main body is displayed as a sequence of regular entries, and the central link in each entry apparently contains the page’s major information. Here, we propose a central-URL extraction approach, which can automatically recognize effective information from the main segmentation on catalog-page. Our approach combines machine learning classification and DOM (Document Object Model) tree based analysis. For one page, we represent each block node, mainly DIV and Table, by a set of content-based and structure-based features, which can be used as the input of support vector machine to determine whether it belongs to "Main-Body" or not. After identifying the main semantic block, a DOM tree based algorithm that utilizes catalog's heuristic rules is implemented to find the central URLs in the segmentation. The evaluation results show that our approach obtains encouraging results with a high recall/precision ratio. This can be applied in topic-specific search engine development and other Web applications.
机译:目录页面构建标准网站架构中的中间层;因此,对这种页面的信息检索研究可以有利于提高Web履带的效率。如果其主体显示为常规条目序列,则页面被称为“目录样式”,并且每个条目中的中央链接显然包含页面的主要信息。在这里,我们提出了一种中央URL提取方法,它可以自动从目录页面上的主要分段识别有效信息。我们的方法组合了基于机器学习分类和DOM(文档对象模型)树的分析。对于一页,我们通过基于内容的基于和结构的特征来表示每个块节点,主要是div和表,可以用作支持向量机的输入来确定它是否属于“主体”或不。在识别主语义块之后,实现了利用目录的启发式规则的基于DOM树的算法来查找分段中的中央URL。评价结果表明,我们的方法获得了令人鼓舞的结果,具有高召回/精度比率。这可以应用于特定于主题的搜索引擎开发和其他Web应用程序。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号