首页> 外文会议>International AAAI Conference on Weblogs and Social Media >Filtering Noisy Web Data by Identifying and Leveraging Users' Contributions
【24h】

Filtering Noisy Web Data by Identifying and Leveraging Users' Contributions

机译:通过识别和利用用户的贡献来过滤嘈杂的Web数据

获取原文

摘要

In this paper we present several methods for collecting Web textual contents and filtering noisy data. We show that knowing which user publishes which contents can contribute to detecting noise. We begin by collecting data from two forums and from Twitter. For the forums, we extract the meaningful information from each discussion (texts of question and answers, IDs of users, date). For the Twitter dataset, we first detect tweets with very similar texts, which helps avoiding redundancy in further analysis. Also, this leads us to clusters of tweets that can be used in the same way as the forum discussions: they can be modeled by bipartite graphs. The analysis of nodes of the resulting graphs shows that network structure and content type (noisy or relevant) are not independent, so network studying can help in filtering noise.
机译:在本文中,我们提供了几种用于收集Web文本内容和过滤噪声数据的方法。我们表明了解哪个用户发布哪些内容可以有助于检测噪声。我们首先从两个论坛和Twitter收集数据。对于论坛,我们从每个讨论中提取有意义的信息(问答文本,用户的ID,日期)。对于Twitter DataSet,我们首先检测具有非常相似的文本的推文,这有助于避免进一步分析中的冗余。此外,这使我们能够以与论坛讨论相同的方式使用的推文集群:它们可以通过二角形图形建模。结果图的节点分析表明,网络结构和内容类型(嘈杂或相关)不是独立的,因此网络学习可以帮助过滤噪声。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号