【24h】

Automating Blog Crawling Using Pattern Recognition

机译:自动化博客使用模式识别爬行

获取原文

摘要

Social media plays an important role in the propagation and dissemination of ideas and thoughts leading to the formation of diverse online communities. Compared to a myriad of other social media sites and applications, blogs provide a convenient platform for users to post detailed information, engage in active discussions and share the content on other social media sites, such as Facebook and Twitter. Thus, the blogosphere has been an enormous and ever-growing part of the open-source intelligence. In order to track and monitor online social behavior particularly from blogs, the first challenging part is to mine the vast pool of unstructured data. Several approaches have been developed to extract blog data using focused crawling, which requires a lot of time, effort and manual intervention. To scale up this process and cope with the continuously changing blog structure, we propose a sophisticated, advanced, generic, and scalable automated blog-crawler, with ability to identify different patterns in the Hypertext Markup Language (HTML) structure of the blog pages and extract data, such as title, author, date, content, tags, etc. from different blog posts. Using the crawler, we have crawled 530 blog sites with 894,856 blog posts so far.
机译:社交媒体在传播和传播导致各种在线社区形成的思想和思想中发挥着重要作用。与其他社交媒体网站和应用程序的无数相比,博客为用户提供了一个方便的平台,用于发布详细信息,从事积极讨论并在其他社交媒体网站上共享内容,例如Facebook和Twitter。因此,博客圈是开源智能的巨大而不断增长的部分。为了追踪和监控在线社会行为,特别是来自博客,第一个具有挑战性的部分是挖掘大量的非结构化数据。已经开发了几种方法来利用聚焦爬行提取博客数据,这需要大量时间,努力和手动干预。为了扩大此过程并应对持续更改的博客结构,我们提出了一种复杂的,先进,通用和可扩展性的自动博客 - 爬虫,能够识别博客页面的超文本标记语言(HTML)结构中的不同模式从不同博客帖子中提取数据,例如标题,作者,日期,内容,标记等。迄今为止,使用履带式爬行者,我们已经爬出了530个博客网站,迄今为止博客894,856个博客帖子。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号