逆序解析DOM树及网页正文信息提取

张瑞雪; 宋明秋; 公衍磊

首页> 中文期刊>计算机科学 >逆序解析DOM树及网页正文信息提取

逆序解析DOM树及网页正文信息提取

开具论文收录证明 >>

期刊封面封底目录下载 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

一般地,从HTML网页中提取正文信息,应先将HTML网页解析成DOM树,然后遍历DOM树,依据目标信息在DOM树中的分布规律,将信息从DOM树中提取.这种传统方法将解析DOM树和从DOM树中提取信息看成两个独立的过程,制约了提取信息的速度.事实上,在准确提取目标信息的过程中,独立解析整个DOM树是没有必要的.在此,提出了逆序解析DOM树算法.并结合DOM树相似理论和传统的顺序解析算法,从部分目标信息开始分别向后顺序和向前逆序解析DOM树,同时定位并获取其他目标信息.利用该方法提取网页正文信息,一方面只需解析部分DOM树,从而减少了解析树结构花费的时间,另一方面不需要遍历整个DOM树查找目标信息,从而节省了查找时间,大大提高了信息提取速度.最后,通过实验证实了该方法的优越性.%To extract main content from HTML Web page, generally, we should parse HTML, visit the whole DOM tree, and extract the data from the tree by distribution. However, this method separates the two processes of parsing and extracting and therefore restricts the speed. Actually, parsing the whole DOM tree is unnecessary. Here we supposed the algorithm of parsing DOM tree by reverse order. Then combining with the theory of DOM similarity and the traditional method of parsing DOM we parsed DOM tree with both normal order and reverse order,and at the same time we fixed the positions of other targots and got them. On the one hand, this method only parses part of DOM tree, so it reduces the time cost by parsing. On the other hand,we do not have to visit the whole tree to search the target information,as a result,it saves the searching time. Overall,this method improves the speed much. At the end of this paper,we gave the proof on the superiority of this method.

著录项

来源
《计算机科学》|2011年第4期|213-215,225|共4页
作者
张瑞雪; 宋明秋; 公衍磊;
展开▼
作者单位

大连理工大学系统工程研究所,大连,116023;

大连理工大学系统工程研究所,大连,116023;

大连理工大学系统工程研究所,大连,116023;

展开▼
原文格式 PDF
正文语种 chi
中图分类
关键词
DOM树; 网页正文提取; 结构相似性; 逆序解析;
入库时间 2022-08-18 04:38:09

相似文献

中文文献
外文文献
专利

1. 基于网页DOM树节点路径相似度的正文抽取 [J] . 潘心宇 ,陈长福 ,刘蓉 . 微型机与应用 . 2016,第019期
2. 基于正文特征的网页正文信息提取方法 [J] . 孙桂煌 ,刘发升 . 现代计算机（专业版） . 2008,第009期
3. 基于视觉特征去噪和DOM树的网页信息提取方法 [J] . 陈壮 ,葛斌 . 山西师范大学学报（自然科学版） . 2021,第004期
4. 基于网页分块的正文信息提取方法 [J] . 黄玲 ,陈龙 . 计算机应用 . 2008,第0z2期
5. 基于分块的网页正文信息提取算法研究 [J] . 黄文蓓 ,杨静 ,顾君忠 . 计算机应用 . 2007,第0z1期
6. 基于HtmlParser网页解析技术的信息提取实践 [C] . 刘小野 . 第二届中国石油石化产业“互联网+”应用发展大会 . 2016
7. 基于DOM树的正文抽取算法研究 [A] . 孟川 . 2017

逆序解析DOM树及网页正文信息提取

摘要

著录项

相似文献

相关主题

期刊订阅