首页> 外文期刊>International Journal of Data Mining & Knowledge Management Process >A Study of Content Extraction From Web Pages Based on Links
【24h】

A Study of Content Extraction From Web Pages Based on Links

机译:基于链接的网页内容提取研究

获取原文
           

摘要

Extracting main content from web page is the preprocessing of web information system. The content extraction approach based on wrapper is limited to one specific information source, and greatly depends on web page structure. It is seldom employed in practice. A new content extraction method is thus proposed in this paper, which can discover web page content according to the number of punctuations and the ratio of non-hyperlink character number to character number that hyperlinks contain. It can eliminate noise and extract main content blocks from web page effectively. Experimental results show that this approach is accurate and suitable for most web sites.
机译:从网页中提取主要内容是网络信息系统的预处理。基于包装的内容提取方法仅限于一种特定的信息源,并且在很大程度上取决于网页结构。在实践中很少采用。因此,本文提出了一种新的内容提取方法,该方法可以根据标点符号的数量和非超链接字符数与超链接包含的字符数的比值来发现网页内容。它可以消除噪音并有效地从网页中提取主要内容块。实验结果表明,该方法是准确的,适用于大多数网站。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号