首页> 外文会议>Proceedings of the Fourth international AAAI conference on weblogs and social media >Coping With Noise in a Real-World Weblog Crawler and Retrieval System
【24h】

Coping With Noise in a Real-World Weblog Crawler and Retrieval System

机译:在现实世界的Weblog爬网程序和检索系统中应对噪声

获取原文
获取原文并翻译 | 示例

摘要

In this paper we examine the effects of noise when creating a real-world weblog corpus for information retrieval. We focus on the DiffPost (Lee et al. 2008) approach to noise removal from blog pages, examining the difficulties encountered when crawling the blogosphere during the creation of a real-world corpus of blog pages. We introduce and evaluate a number of enhancements to the original DiffPost approach in order to increase the robustness of the algorithm. We then extend DiffPost by looking at the anchor-text to text ratio, and discover that the time-interval between crawls is more important to the successful application of noise-removal algorithms within the blog context, than any additional improvements to the removal algorithm itself.
机译:在本文中,我们研究了创建真实世界的Weblog语料库以进行信息检索时的噪声影响。我们专注于DiffPost(Lee等人,2008)方法,用于从博客页面消除噪声,研究了在创建真实世界的博客页面语料库时,在爬网Blogsphere时遇到的困难。我们引入并评估了原始DiffPost方法的许多增强功能,以提高算法的鲁棒性。然后,我们通过查看锚文本与文本的比例来扩展DiffPost,并发现抓取之间的时间间隔对博客上下文中噪声消除算法的成功应用更为重要,而不是对消除算法本身的任何其他改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号