基于DO M状态转换的隐网页信息抽取算法

房勇; 李银胜

首页> 中文期刊> 《计算机应用与软件》 >基于DO M状态转换的隐网页信息抽取算法

基于DO M状态转换的隐网页信息抽取算法

AI论文写作 >>

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

A great deal of dynamic JavaScript containing in webpages leads to most of the webpage contents being invisible to traditional webpage crawlers.Therefore we proposed a DOM state transfer-based hidden webpage information extraction algorithm.The algorithm incrementally constructs the DOM state transfer machine and uses DOM nodes and their click events as the inputting events of transfer machine.For the transfer paths which can result in the variation of object nodes,recursive search will be done;By the playback of click path it automatically completes contents grasping of the object nodes;By covering the prototype of audiomonitor method it obtains all the clickable nodes in DOM tree as the candidate nodes.The algorithm employs RTDM algorithm and self-defined filter to compress DOM state space in order to shrink the search space,and carries out heuristic search by defining the distance between candidate nodes in DOM tree and object nodes as the h marking.Experiment demonstrated that the algorithm studied has excellent performance,it achieved 89.48% accuracy in hidden webpage content extraction,and could be used in the fields of automatic webpage test and webpage crawler,etc.%由于网页大量包含动态JavaScript脚本，造成大部分网页内容对传统的网页爬虫不可见。为此，提出一种基于DOM状态转换的隐网页信息抽取算法。该算法增量地构建DOM状态转换机，以DOM节点及其点击事件作为状态机的输入事件。对能够引起目标节点变化的转换路径进行递归搜索；通过重放点击路径，自动完成目标节点的内容抓取；通过覆盖监听器方法原型，获取DOM树中所有可点击的节点作为候选节点。该算法应用RTDM算法和自定义过滤器来对DOM状态空间进行压缩，以缩减搜索空间，定义DOM树中候选节点到目标节点的距离作为h打分，进行启发式搜索。实验表明，所研究算法性能优良，对隐网页内容的抽取准确率达到89．48％，可应用在网页自动化测试、网页爬虫等领域。

著录项

来源
《计算机应用与软件》 |2015年第9期|17-21|共5页
作者
房勇; 李银胜;
展开▼
作者单位

复旦大学软件学院上海201203;

电子商务交易技术国家工程实验室上海201203;

展开▼
原文格式 PDF
正文语种 chi
中图分类计算技术、计算机技术;
关键词
Web信息抽取; 隐Web; 网页爬虫;

相似文献

中文文献
外文文献
专利

1. 基于隐马尔科夫链的微博信息热点抽取算法研究与设计 [J] . 严宇 . 信息系统工程 . 2015,第010期
2. 基于领域本体的Web信息抽取方法的设计与实现——以网易汽车资讯网页信息抽取为例 [J] . 吴恒亮 . 图书馆论坛 . 2010,第003期
3. 基于分块的新闻网页信息抽取算法 [J] . 姬鑫 ,钟诚 . 计算机应用与软件 . 2015,第004期
4. 基于DOM的半结构化网页信息抽取算法 [J] . 李卫东 . 河北省科学院学报 . 2009,第001期
5. 基于树模型算法的动态网页信息抽取研究和实现 [J] . 邵辉 ,李芳 . 计算机应用与软件 . 2007,第010期
6. 基于树模型算法的动态网页信息抽取研究 [C] . 邵辉 ,李芳 . 第二届全国信息检索与内容安全学术会议 . 2005
7. 基于改进HITS算法及位置信息的关键网页信息抽取方法 [A] . 陈翰生 . 2009

基于DO M状态转换的隐网页信息抽取算法

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅