An Approach to Extracting Central URLs on Catalog Page

机译：在目录页面上提取中央URL的方法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Catalog pages construct the intermediate layer in architecture of a standard web site; therefore research on information retrieval for this kind of pages can be beneficial to improve web crawler's efficiency. A page is called "Catalog-style" if its main body is displayed as a sequence of regular entries, and the central link in each entry apparently contains the page’s major information. Here, we propose a central-URL extraction approach, which can automatically recognize effective information from the main segmentation on catalog-page. Our approach combines machine learning classification and DOM (Document Object Model) tree based analysis. For one page, we represent each block node, mainly DIV and Table, by a set of content-based and structure-based features, which can be used as the input of support vector machine to determine whether it belongs to "Main-Body" or not. After identifying the main semantic block, a DOM tree based algorithm that utilizes catalog's heuristic rules is implemented to find the central URLs in the segmentation. The evaluation results show that our approach obtains encouraging results with a high recall/precision ratio. This can be applied in topic-specific search engine development and other Web applications.

机译：目录页面构建标准网站架构中的中间层;因此，对这种页面的信息检索研究可以有利于提高Web履带的效率。如果其主体显示为常规条目序列，则页面被称为“目录样式”，并且每个条目中的中央链接显然包含页面的主要信息。在这里，我们提出了一种中央URL提取方法，它可以自动从目录页面上的主要分段识别有效信息。我们的方法组合了基于机器学习分类和DOM（文档对象模型）树的分析。对于一页，我们通过基于内容的基于和结构的特征来表示每个块节点，主要是div和表，可以用作支持向量机的输入来确定它是否属于“主体”或不。在识别主语义块之后，实现了利用目录的启发式规则的基于DOM树的算法来查找分段中的中央URL。评价结果表明，我们的方法获得了令人鼓舞的结果，具有高召回/精度比率。这可以应用于特定于主题的搜索引擎开发和其他Web应用程序。

著录项

来源
《International Symposium on Knowledge Acquisition and Modeling》|2008年||共5页
会议地点
作者

展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP15-53;
关键词
Machine Learning; Web Information Retrival; Web Segmentation; Web URL Extraction;

机译：机器学习;Web信息 Retrival;网络分割;网页URL 提取;

相似文献

外文文献
中文文献
专利

1. Rotten But Not Forgotten: Weeding and Maintenance of URLs for Electronic Resources in The Ohio State University Online Catalog [J] . C. Rockelle Strader, Farrell D. Hamill The Serials Librarian . 2007,第1a2期

机译：烂烂但未被遗忘：俄亥俄州立大学在线目录中电子资源URL的除草和维护
2. The OhioLINK approach: Records and holdings for print and electronic serials in the OhioLINK central catalog [J] . Anne Gilliland Library Computing . 2000,第3a4期

机译：OhioLINK方法：OhioLINK中央目录中的印刷和电子序列的记录和保存
3. Neuroarchitecture of the Drosophila Drosophila central complex: A catalog of nodulus and asymmetrical body neurons and a revision of the protocerebral bridge catalog [J] . Wolff Tanya, Rubin Gerald M. The Journal of Comparative Neurology . 2018,第16期

机译：果蝇果蝇中央综合体的神经建筑：结核和不对称身体神经元的目录和突发性桥梁目录的修订
4. An Approach to Extracting Central URLs on Catalog Page [C] . International Symposium on Knowledge Acquisition and Modeling . 2008

机译：在目录页面上提取中央URL的方法
5. A catalog of slow-moving objects extracted from the Sloan Digital Sky Survey: Compilation and applications. [D] . Puckett, Andrew W. 2007

机译：从《斯隆数字天空调查：汇编和应用程序》中提取的缓慢移动物体的目录。
6. Neuroarchitecture of the Drosophila central complex: A catalog of nodulus and asymmetrical body neurons and a revision of the protocerebral bridge catalog [O] . Tanya Wolff, Gerald M. Rubin -1

机译：果蝇中央复合体的神经体系结构：结节和不对称身体神经元的目录和前脑桥目录的修订
7. From Videos to URLs: A Multi-Browser Guide to Extract User’s Behavior with Optical Character Recognition [O] . Mojtaba Heidarysafa, James Reed, Kamran Kowsari, 2019

机译：从视频到URL：一个多浏览器指南，用于用光学字符识别提取用户的行为
8. Long-Billed Curlew Breeding Success on Mid-Columbia River National Wildlife Refuges, South-Central Washington and North-Central Oregon, 2008-08 [R] . Stocking, J., Elliott-Smith, E., Holcomb, N., 2010

机译：华盛顿中南部和俄勒冈州中北部的哥伦比亚河中部国家野生动物保护区长嘴鹬育种成功，2008-08

An Approach to Extracting Central URLs on Catalog Page

摘要

著录项

相似文献

相关主题

期刊订阅