A great deal of dynamic JavaScript containing in webpages leads to most of the webpage contents being invisible to traditional webpage crawlers.Therefore we proposed a DOM state transfer-based hidden webpage information extraction algorithm.The algorithm incrementally constructs the DOM state transfer machine and uses DOM nodes and their click events as the inputting events of transfer machine.For the transfer paths which can result in the variation of object nodes,recursive search will be done;By the playback of click path it automatically completes contents grasping of the object nodes;By covering the prototype of audiomonitor method it obtains all the clickable nodes in DOM tree as the candidate nodes.The algorithm employs RTDM algorithm and self-defined filter to compress DOM state space in order to shrink the search space,and carries out heuristic search by defining the distance between candidate nodes in DOM tree and object nodes as the h marking.Experiment demonstrated that the algorithm studied has excellent performance,it achieved 89.48% accuracy in hidden webpage content extraction,and could be used in the fields of automatic webpage test and webpage crawler,etc.%由于网页大量包含动态JavaScript脚本,造成大部分网页内容对传统的网页爬虫不可见。为此,提出一种基于DOM状态转换的隐网页信息抽取算法。该算法增量地构建DOM状态转换机,以DOM节点及其点击事件作为状态机的输入事件。对能够引起目标节点变化的转换路径进行递归搜索;通过重放点击路径,自动完成目标节点的内容抓取;通过覆盖监听器方法原型,获取DOM树中所有可点击的节点作为候选节点。该算法应用RTDM算法和自定义过滤器来对DOM状态空间进行压缩,以缩减搜索空间,定义DOM树中候选节点到目标节点的距离作为h打分,进行启发式搜索。实验表明,所研究算法性能优良,对隐网页内容的抽取准确率达到89.48%,可应用在网页自动化测试、网页爬虫等领域。
展开▼