【24h】

Noise Reduction of Web Pages via Feature Analysis

机译:通过特征分析降低网页的噪音

获取原文

摘要

Noise information has a serious impact on various studies that using web pages as datasets. As a fundamental work in information retrieval, removing noise in web pages quickly and accurately received widely attention. In this paper, a noise reduction algorithm which uses DOM (Document Object Model) to preserve the original structure of web pages is proposed to the issue of low efficiency of traditional noise reduction algorithms. Using this method, noise information can be located rapidly by a combination of several analyzed features, e.g. Link Density and Punctuation Density. The approach is evaluated by a group of web pages that selected randomly from several well-known websites. Experiments show satisfactory results.
机译:噪音信息对使用网页作为数据集的各种研究产生了严重影响。作为信息检索的基础工作,快速,准确地消除网页中的噪音受到了广泛的关注。针对传统的降噪算法效率低的问题,提出了一种使用DOM(Document Object Model,文档对象模型)保存网页原始结构的降噪算法。使用此方法,可以通过结合多个分析特征(例如,噪声)快速定位噪声信息。链接密度和标点密度。该方法由一组从几个知名网站中随机选择的网页进行评估。实验表明令人满意的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号