A Topic-Specific Web Crawler with Web Page Hierarchy Based on HTML Dom-Tree

机译：基于HTML DOM-Tree的网页层次结构特定于主题的Web爬网程序

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

With Internet growing exponentially, data mining in the web becomes the main method to find relevant information. With the amount of web sites and documents growing even faster and site contents getting updated more and more often, focused web crawler is becoming more and more popular. In the literature, how to order the unvisited URLs was studied deeply, they calculate the prediction score is based on the unvisited URLspsila ancestor, however the URLs in one web page is considered to have the same scores. In other words, they consider a web page has only one topic information. But we find the different parts of a web page have their own topic information, while they all support one or several big topics, so the URLs in different paragraphs should be given different scores based on the hierarchy relationship among them. In this paper, we parse every web page as a Dom-Tree, propose some rules in the tree aiming at extracting the relationship among different paragraphs, and then present a new topic-specific web crawler which calculates the unvisited URLpsilas prediction score based on the web page hierarchy and the text semantic similarity. We consider three factors, firstly, we calculate the text similarity using vector space model (VSM) which consider the query or paragraph as a vector in which the terms are independent. But there are relations about termspsila sequences in a text paragraph; we try to using edit distance based on termspsila sequences to avoid it. Thirdly, different paragraphs in a web page are contacted according to their hierarchy in a Dom-Tree. At last we combine the three factors in our crawlerpsilas strategy and present our model.

机译：通过互联网呈指数级增长，Web中的数据挖掘成为找到相关信息的主要方法。随着Web站点和文档的数量越来越快，现场内容越来越多地获得更新，重点的Web履带越来越受欢迎。在文献中，如何对未公开的URL进行深入研究，他们计算预测得分是基于未让URLSPSILA祖先的，但一个网页中的URL被认为具有相同的分数。换句话说，他们认为网页只有一个主题信息。但我们发现网页的不同部分有自己的主题信息，而他们都支持一个或多个大主题，因此应基于它们之间的层次结构关系给出不同段落的URL。在本文中，我们将每个网页解析为DOM-Tree，提出了某些规则，旨在提取不同段落之间的关系，然后呈现一个新的专用网爬网爬网，该分数基于以下方式计算不受检测的URLPSILAS预测分数。网页层次结构和文本语义相似性。我们考虑三个因素，首先，我们使用将查询或段落视为术语是独立的向量的查询或段落来计算三个因素。但是关于文本段落中有关于权序序列的关系;我们尝试使用编辑距离基于SignerpSila序列来避免它。第三，根据DOM树中的层次结构联系了网页中的不同段落。最后，我们将三个因素结合在我们的履带伞战略中并呈现我们的模型。

著录项

来源
《Asia-Pacific Conference on Information Processing》|2009年||共4页
会议地点
作者

展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 G202-53;
关键词
Internet; Web sites; hypermedia markup languages; text analysis; trees (mathematics); HTML Dom-Tree; Web crawler; Web page hierarchy; World Wide Web; data mining; prediction score; text paragraph; text similarity; unvisited URL; vector space model; Dom-Tree; Edit distance; Focused web crawler; Semantic similarity;

机译：互联网;网站;超媒体标记语言;文本分析;树（数学）;html dom-tree;web爬虫;万维网;数据挖掘;文本段;文本相似;不受理的URL;矢量空间模型;dom树;编辑距离;聚焦的web爬网;语义相似;

相似文献

外文文献
中文文献
专利

1. Learnable topic-specific web crawler [J] . A. Rungsawang, N. Angkawattanawit Journal of network and computer applications . 2005,第2期

机译：可学习的主题特定的Web搜寻器
2. Mining the web with hierarchical crawlers - a resource sharing based crawling approach [J] . Anirban Kundu, Ruma Dutta, Rana Dattagupta, International journal of intelligent information and database systems . 2009,第1期

机译：使用分层爬网程序挖掘Web-一种基于资源共享的爬网方法
3. Optimized Focused Web Crawler with Natural Language Processing Based Relevance Measure in Bioinformatics Web Sources [J] . Cybernetics and information technologies: CIT . 2019,第2期

机译：优化的聚焦Web爬虫，基于自然语言处理的基于生物信息学网源的相关性测量
4. A Topic-Specific Web Crawler with Web Page Hierarchy Based on HTML Dom-Tree [C] . Asia-Pacific Conference on Information Processing . 2009

机译：基于HTML DOM-Tree的网页层次结构特定于主题的Web爬网程序
5. A Dynamic Hierarchical Web-Based Portal. [D] . Spaulding, Matthew. 2011

机译：基于Web的动态分层门户。
6. An HTML5-Based Pure Website Solution for Rapidly Viewing and Processing Large-Scale 3D Medical Volume Reconstruction on Mobile Internet [O] . Liang Qiao, Xin Chen, Ye Zhang, 2017

机译：基于HTML5的纯网站解决方案用于在移动Internet上快速查看和处理大规模3D医疗量重建
7. Building an Efficient Web Portal for Students at Institutions of Higher Education Based on Web Crawlers [O] . 2017

机译：基于Web爬行者构建高等教育学生的高效网络门户

A Topic-Specific Web Crawler with Web Page Hierarchy Based on HTML Dom-Tree

摘要

著录项

相似文献

相关主题

期刊订阅