首页> 外文会议>International Symposium on Computational Intelligence and Design >Content Information Extraction of Theme Web Pages Based on Tag Information
【24h】

Content Information Extraction of Theme Web Pages Based on Tag Information

机译:基于标签信息的主题网页内容信息提取

获取原文

摘要

In order to extract the content information of Theme Web Pages more accurately, this paper proposes a self-learning method based on the tag information by calculating the information quantity of various tag indicators. This method predefines several tag information indexes and coefficients index to calculate a variety of tag information quantity of the web pages in turn, and then the candidate content of Web pages is in the tag with the most information quantity. To improve the versatility of the method, we add the adaptive and adjustable coefficient weight in calculation formulas of tag information quantity. With the increasing of data be processed, tag collections, index value and the information quantity results are added into the learning database to adjust the weight of coefficient factor. Experimental results show that the accuracy of this extraction method with adaptive and adjustable coefficient weights can reach more than 99 percent recall rate. Also, this method does not depend on the specific structure and style of the web page and has good versatility.
机译:为了更准确地提取主题网页的内容信息,本文提出了一种基于标签信息的自学习方法,通过计算各种标签指示符的信息量来实现。该方法预先定义几个标签信息索引和系数索引,依次计算出各种网页的标签信息量,然后网页的候选内容在信息量最大的标签中。为了提高方法的通用性,我们在标签信息量的计算公式中增加了自适应系数系数和可调系数权重。随着要处理的数据的增加,将标记集合,索引值和信息量结果添加到学习数据库中,以调整系数因子的权重。实验结果表明,该方法具有自适应的系数权重和可调的系数权重,其查全率可达99%以上。而且,该方法不依赖于网页的特定结构和样式,并且具有良好的通用性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号