In this paper, a Web entity information extraction method based on SVM and AdaBoost is proposed. Firstly, an identification method for Web page' s main data region based on SVM is proposed, which segments Web page data region effectively based on the display characteristics of Web entity instances in the page, identifies the main data area where the Web entity instances locates. Secondly, based on the characteristics of the Web entity attribute labels, a method based on AdaBoost ensemble learning is proposed, which automatically extracts the Web entities information from the main data area of the page. A variety of experiments are conducted on two real data sets, and the comparison is done with correlated research works as well, experimental results show that this method is able to achieve fairly good extraction effect.%提出一种基于SVM和AdaBoost的Web实体信息抽取方法.首先提出一种基于SVM的Web页面主数据区域识别方法,基于Web实体实例在页面中的展示特征,有效地将Web页面进行数据区域分割,识别出Web实体实例所在的主数据区域;然后基于Web实体属性标签的特征,提出一种基于AdaBoost的集成学习方法,从页面的主数据区域自动地抽取Web实体信息.在两个真实数据集上进行实验,并与相关研究工作进行比较,实验结果说明该方法能够取得良好的抽取效果.
展开▼