...
首页> 外文期刊>Entropy >Do We Really Need to Catch Them All? A New User-Guided Social Media Crawling Method
【24h】

Do We Really Need to Catch Them All? A New User-Guided Social Media Crawling Method

机译:我们真的需要全部抓住吗?一种新的用户指导的社交媒体爬网方法

获取原文

摘要

[-15]With the growing use of popular social media services like Facebook and Twitter it is challenging to collect all content from the networks without access to the core infrastructure or paying for it. Thus, if all content cannot be collected one must consider which data are of most importance. In this work we present a novel User-guided Social Media Crawling method (USMC) that is able to collect data from social media, utilizing the wisdom of the crowd to decide the order in which user generated content should be collected to cover as many user interactions as possible. USMC is validated by crawling 160 public Facebook pages, containing content from 368 million users including 1.3 billion interactions, and it is compared with two other crawling methods. The results show that it is possible to cover approximately 75% of the interactions on a Facebook page by sampling just 20% of its posts, and at the same time reduce the crawling time by 53%. In addition, the social network constructed from the 20% sample contains more than 75% of the users and edges compared to the social network created from all posts, and it has similar degree distribution.
机译:[-15]随着越来越流行的社交媒体服务(如Facebook和Twitter)的使用,在不访问核心基础架构或不为之付费的情况下,从网络收集所有内容将面临挑战。因此,如果无法收集所有内容,则必须考虑哪些数据最重要。在这项工作中,我们提出了一种新颖的用户指导的社交媒体抓取方法(USMC),该方法能够从社交媒体收集数据,利用人群的智慧来决定应收集用户生成的内容以覆盖尽可能多的用户的顺序尽可能的互动。 USMC通过爬网160个公共Facebook页面进行了验证,该页面包含来自3.68亿用户的内容,包括13亿次交互,并与其他两种爬网方法进行了比较。结果表明,仅对20%的帖子进行采样就可以覆盖Facebook页面上大约75%的互动,同时将抓取时间减少53%。此外,与从所有帖子创建的社交网络相比,由20%样本构成的社交网络包含超过75%的用户和边缘,并且其程度分布相似。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号