在现有的网页抽取技术中,正文定位方法仅考虑网页文本信息,当正文图片信息较多、文本信息偏少时,容易出现偏差,且定位准确率较低.针对该问题,从信息论角度出发,结合网页中的文本信息图片信息,设计一种对网页中图片信息量和有效信息量的估算方法,在此基础上,提出一种基于图文信息量的网页正文定位算法.实验结果表明,该算法在不同正文文本量的情况下,均具有较高的定位准确率.%Existed main text localization methods in webpage information extraction technologies only consider the text information. Those methods lead to low accuracy when main text contains a few text information and abundant image information. In order to solve this problem, this paper designs a method to estimate the image information and image effective information based on information theory, and presents a novel algorithm for main text of webpage localization based on image and text effective information. Experimental results show that on different main text ratio, this algorithm has higher accuracy.
展开▼