首页> 外文会议>Asia-Pacific Conference on Information Processing >A Topic-Specific Web Crawler with Web Page Hierarchy Based on HTML Dom-Tree
【24h】

A Topic-Specific Web Crawler with Web Page Hierarchy Based on HTML Dom-Tree

机译:基于HTML DOM-Tree的网页层次结构特定于主题的Web爬网程序

获取原文

摘要

With Internet growing exponentially, data mining in the web becomes the main method to find relevant information. With the amount of web sites and documents growing even faster and site contents getting updated more and more often, focused web crawler is becoming more and more popular. In the literature, how to order the unvisited URLs was studied deeply, they calculate the prediction score is based on the unvisited URLspsila ancestor, however the URLs in one web page is considered to have the same scores. In other words, they consider a web page has only one topic information. But we find the different parts of a web page have their own topic information, while they all support one or several big topics, so the URLs in different paragraphs should be given different scores based on the hierarchy relationship among them. In this paper, we parse every web page as a Dom-Tree, propose some rules in the tree aiming at extracting the relationship among different paragraphs, and then present a new topic-specific web crawler which calculates the unvisited URLpsilas prediction score based on the web page hierarchy and the text semantic similarity. We consider three factors, firstly, we calculate the text similarity using vector space model (VSM) which consider the query or paragraph as a vector in which the terms are independent. But there are relations about termspsila sequences in a text paragraph; we try to using edit distance based on termspsila sequences to avoid it. Thirdly, different paragraphs in a web page are contacted according to their hierarchy in a Dom-Tree. At last we combine the three factors in our crawlerpsilas strategy and present our model.
机译:通过互联网呈指数级增长,Web中的数据挖掘成为找到相关信息的主要方法。随着Web站点和文档的数量越来越快,现场内容越来越多地获得更新,重点的Web履带越来越受欢迎。在文献中,如何对未公开的URL进行深入研究,他们计算预测得分是基于未让URLSPSILA祖先的,但一个网页中的URL被认为具有相同的分数。换句话说,他们认为网页只有一个主题信息。但我们发现网页的不同部分有自己的主题信息,而他们都支持一个或多个大主题,因此应基于它们之间的层次结构关系给出不同段落的URL。在本文中,我们将每个网页解析为DOM-Tree,提出了某些规则,旨在提取不同段落之间的关系,然后呈现一个新的专用网爬网爬网,该分数基于以下方式计算不受检测的URLPSILAS预测分数。网页层次结构和文本语义相似性。我们考虑三个因素,首先,我们使用将查询或段落视为术语是独立的向量的查询或段落来计算三个因素。但是关于文本段落中有关于权序序列的关系;我们尝试使用编辑距离基于SignerpSila序列来避免它。第三,根据DOM树中的层次结构联系了网页中的不同段落。最后,我们将三个因素结合在我们的履带伞战略中并呈现我们的模型。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号