首页> 外文学位 >Cleaning Web pages for effective Web content mining.
【24h】

Cleaning Web pages for effective Web content mining.

机译:清洁网页以进行有效的Web内容挖掘。

获取原文
获取原文并翻译 | 示例

摘要

Web pages usually contain many noisy blocks, such as advertisements, navigation bar, copyright notice and so on. These noisy blocks can seriously affect web content mining because contents contained in noise blocks are irrelevant to the main content of the web page. Eliminating noisy blocks before performing web content mining is very important for improving mining accuracy and efficiency. A few existing approaches detect noisy blocks with exact same contents, but are weak in detecting near-duplicate blocks, such as navigation bars.; In this thesis, given a collection of web pages in a web site, a new system, WebPageCleaner, which eliminates noisy blocks from these web pages so as to improve the accuracy and efficiency of web content mining, is proposed. WebPageCleaner detects both noisy blocks with exact same contents as well as those with near-duplicate contents. It is based on the observation that noisy blocks usually share common contents, and appear frequently on a given web site. WebPageCleaner consists of three modules: block extraction, block importance retrieval, and cleaned files generation. A vision-based technique is employed for extracting blocks from web pages. Blocks get their importance degree according to their block features such as block position, and level of similarity of block contents to each other. A collection of cleaned files with high importance degree are generated finally and used for web content mining. The proposed technique is evaluated using Naive Bayes text classification. Experiments show that WebPageCleaner is able to lead to a more efficient and accurate web page classification results than existing approaches.
机译:网页通常包含许多嘈杂的块,例如广告,导航栏,版权声明等。这些嘈杂的块会严重影响Web内容的挖掘,因为噪声块中包含的内容与网页的主要内容无关。在执行Web内容挖掘之前消除噪声块对于提高挖掘精度和效率非常重要。现有的一些方法可以检测出内容完全相同的嘈杂块,但是在检测接近重复的块(例如导航栏)方面却很弱。本文针对某网站中的网页集合,提出了一种新的系统WebPageCleaner,该系统消除了这些网页中的噪点,从而提高了网页内容挖掘的准确性和效率。 WebPageCleaner可以检测到内容完全相同的噪音块以及内容几乎重复的噪音块。基于这样的观察,嘈杂的块通常共享共同的内容,并经常出现在给定的网站上。 WebPageCleaner包含三个模块:块提取,块重要性检索和清除文件生成。基于视觉的技术用于从网页中提取块。块根据其块特征(例如块位置和块内容彼此之间的相似度)获得重要性程度。最终生成具有高重要度的已清理文件的集合,并将其用于Web内容挖掘。使用朴素贝叶斯文本分类对提出的技术进行评估。实验表明,与现有方法相比,WebPageCleaner能够导致更高效,更准确的网页分类结果。

著录项

  • 作者

    Li, Jing.;

  • 作者单位

    University of Windsor (Canada).;

  • 授予单位 University of Windsor (Canada).;
  • 学科 Computer Science.
  • 学位 M.Sc.
  • 年度 2006
  • 页码 67 p.
  • 总页数 67
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号