Automating Blog Crawling Using Pattern Recognition

机译：自动化博客使用模式识别爬行

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Social media plays an important role in the propagation and dissemination of ideas and thoughts leading to the formation of diverse online communities. Compared to a myriad of other social media sites and applications, blogs provide a convenient platform for users to post detailed information, engage in active discussions and share the content on other social media sites, such as Facebook and Twitter. Thus, the blogosphere has been an enormous and ever-growing part of the open-source intelligence. In order to track and monitor online social behavior particularly from blogs, the first challenging part is to mine the vast pool of unstructured data. Several approaches have been developed to extract blog data using focused crawling, which requires a lot of time, effort and manual intervention. To scale up this process and cope with the continuously changing blog structure, we propose a sophisticated, advanced, generic, and scalable automated blog-crawler, with ability to identify different patterns in the Hypertext Markup Language (HTML) structure of the blog pages and extract data, such as title, author, date, content, tags, etc. from different blog posts. Using the crawler, we have crawled 530 blog sites with 894,856 blog posts so far.

机译：社交媒体在传播和传播导致各种在线社区形成的思想和思想中发挥着重要作用。与其他社交媒体网站和应用程序的无数相比，博客为用户提供了一个方便的平台，用于发布详细信息，从事积极讨论并在其他社交媒体网站上共享内容，例如Facebook和Twitter。因此，博客圈是开源智能的巨大而不断增长的部分。为了追踪和监控在线社会行为，特别是来自博客，第一个具有挑战性的部分是挖掘大量的非结构化数据。已经开发了几种方法来利用聚焦爬行提取博客数据，这需要大量时间，努力和手动干预。为了扩大此过程并应对持续更改的博客结构，我们提出了一种复杂的，先进，通用和可扩展性的自动博客 - 爬虫，能够识别博客页面的超文本标记语言（HTML）结构中的不同模式从不同博客帖子中提取数据，例如标题，作者，日期，内容，标记等。迄今为止，使用履带式爬行者，我们已经爬出了530个博客网站，迄今为止博客894,856个博客帖子。

著录项

来源
《International Conference on Social Media Technologies, Communication, and Informatics》|2020年|51p|共7页
会议地点
作者
Anal Kanti Roy; Nitin Agarwal;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 G20-53;
关键词
Blog crawling; Generic crawler; Blogs; Blog posts; Metadata; Title; Author; Date; Content; Patterns; Html;

机译：博客爬行;通用履带;博客;博客帖子;元数据;标题;作者;日期;内容;模式;HTML;

相似文献

外文文献
中文文献
专利

1. A Proposed Architecture for Continuous Web Monitoring Through Online Crawling of Blogs [J] . Mehdi Naghavi, Mohsen Sharifi International Journal of UbiComp . 2012,第1期

机译：通过博客的网上爬取进行连续Web监视的建议架构
2. Inter-individual variability and pattern recognition of surface electromyography in front crawl swimming [J] . Martens Jonas, Daly Daniel, Deschamps Kevin, Journal of electromyography and kinesiology: Official journal of the International Society of Electrophysiological Kinesiology . 2016,第Null期

机译：前爬网游泳中表面肌电信号的个体差异和模式识别
3. Automated mixing studies and pattern recognition for the laboratory diagnosis of a prolonged activated partial thromboplastin time using an automated coagulation analyzer. [J] . Ohsaka A, Ishii K, Yamamoto T, Thrombosis Research: An International Journal on Vascular Obstruction, Hemorrhage and Hemostasis . 2011,第1期

机译：使用自动凝血分析仪进行自动混合研究和模式识别，以用于实验室诊断延长的活化部分凝血活酶时间。
4. Automating Blog Crawling Using Pattern Recognition [C] . Anal Kanti Roy, Nitin Agarwal International Conference on Social Media Technologies, Communication, and Informatics . 2020

机译：自动化博客使用模式识别爬行
5. Automation of Crawling Blogosphere Based on Pattern Recognition [D] . Roy, Anal Kanti. 2019

机译：基于模式识别的爬行博客圈自动化
6. Battle Journey Imprisonment and Burden: patterns of metaphor use in blogs about living with advanced cancer [O] . Charlotte Hommerberg, Anna W. Gustafsson, Anna Sandgren 2020

机译：战斗旅途监禁和负担：关于晚期癌症患者的博客中的隐喻用法
7. A Proposed Architecture for Continuous Web Monitoring Through Online Crawling of Blogs [O] . Naghavi, Mehdi, Sharifi, Mohsen 2012

机译：通过在线进行连续Web监控的建议体系结构抓取博客
8. Automated target recognition and tracking using an optical pattern recognition neural network [R] . Chao, Tien-Hsin 1991

机译：使用光学模式识别神经网络进行自动目标识别和跟踪

Automating Blog Crawling Using Pattern Recognition

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅