【24h】

FoCUS: Learning to Crawl Web Forums

机译:FoCUS:学习爬网论坛

获取原文
获取原文并翻译 | 示例

摘要

In this paper, we present Forum Crawler Under Supervision (FoCUS), a supervised web-scale forum crawler. The goal of FoCUS is to crawl relevant forum content from the web with minimal overhead. Forum threads contain information content that is the target of forum crawlers. Although forums have different layouts or styles and are powered by different forum software packages, they always have similar implicit navigation paths connected by specific URL types to lead users from entry pages to thread pages. Based on this observation, we reduce the web forum crawling problem to a URL-type recognition problem. And we show how to learn accurate and effective regular expression patterns of implicit navigation paths from automatically created training sets using aggregated results from weak page type classifiers. Robust page type classifiers can be trained from as few as five annotated forums and applied to a large set of unseen forums. Our test results show that FoCUS achieved over 98 percent effectiveness and 97 percent coverage on a large set of test forums powered by over 150 different forum software packages. In addition, the results of applying FoCUS on more than 100 community Question and Answer sites and Blog sites demonstrated that the concept of implicit navigation path could apply to other social media sites.
机译:在本文中,我们介绍了受监督的Web规模的论坛爬网程序(FoCUS)。 FoCUS的目标是以最小的开销从网上抓取相关论坛内容。论坛线程包含的信息内容是论坛搜寻器的目标。尽管论坛具有不同的布局或样式,并且由不同的论坛软件包提供支持,但它们始终具有相似的隐式导航路径,这些隐式导航路径通过特定的URL类型连接,从而将用户从条目页面引导到主题页面。基于此观察,我们将Web论坛爬网问题简化为URL类型识别问题。并且我们展示了如何使用弱页面类型分类器的汇总结果从自动创建的训练集中学习隐式导航路径的准确有效的正则表达式模式。可以从多达五个带注释的论坛中训练强大的页面类型分类器,并将其应用于大量看不见的论坛。我们的测试结果表明,在由150多种不同的论坛软件包提供支持的大量测试论坛上,FoCUS的有效性达到了98%以上,覆盖率达到了97%。此外,在100多个社区问答网站和Blog网站上应用FoCUS的结果表明,隐式导航路径的概念可以应用于其他社交媒体网站。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号