首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Towards Real-Time, Country-Level Location Classification of Worldwide Tweets
【24h】

Towards Real-Time, Country-Level Location Classification of Worldwide Tweets

机译:走向全球推文的实时,国家级位置分类

获取原文
获取原文并翻译 | 示例

摘要

The increase of interest in using social media as a source for research has motivated tackling the challenge of automatically geolocating tweets, given the lack of explicit location information in the majority of tweets. In contrast to much previous work that has focused on location classification of tweets restricted to a specific country, here we undertake the task in a broader context by classifying global tweets at the country level, which is so far unexplored in a real-time scenario. We analyze the extent to which a tweet's country of origin can be determined by making use of eight tweet-inherent features for classification. Furthermore, we use two datasets, collected a year apart from each other, to analyze the extent to which a model trained from historical tweets can still be leveraged for classification of new tweets. With classification experiments on all 217 countries in our datasets, as well as on the top 25 countries, we offer some insights into the best use of tweet-inherent features for an accurate country-level classification of tweets. We find that the use of a single feature, such as the use of tweet content alone-the most widely used feature in previous work-leaves much to be desired. Choosing an appropriate combination of both tweet content and metadata can actually lead to substantial improvements of between 20 and 50 percent. We observe that tweet content, the user's self-reported location and the user's real name, all of which are inherent in a tweet and available in a real-time scenario, are particularly useful to determine the country of origin. We also experiment on the applicability of a model trained on historical tweets to classify new tweets, finding that the choice of a particular combination of features whose utility does not fade over time can actually lead to comparable performance, avoiding the need to retrain. However, the difficulty of achieving accurate classification increases slightly for countries with multiple commonalities, especially for English and Spanish speaking countries.
机译:鉴于大多数推文中缺乏明确的位置信息,使用社交媒体作为研究来源的兴趣日益浓厚,这促使人们应对自动对推文进行地理定位的挑战。与以前的许多工作专注于对特定国家/地区的推文进行位置分类相反,在这里,我们通过在国家/地区级别对全球推文进行分类来在更广泛的范围内完成这项任务,到目前为止,这在实时场景中尚待探讨。我们分析了通过使用八种固有的推特特征对推文的来源国进行确定的程度。此外,我们使用两个彼此间隔一年的数据集来分析从历史推文训练出的模型仍可在多大程度上被用于对新推文进行分类。通过对数据集中所有217个国家/地区以及排名前25个国家/地区的分类实验,我们为合理使用国家/地区的推文提供了一些见解,以更好地利用推文固有功能。我们发现,使用单个功能(例如仅使用推特内容)是以前工作中使用最广泛的功能,因此有很多不足之处。选择适当的推文内容和元数据组合实际上可以带来20%到50%的显着改善。我们观察到,推文内容,用户的自我报告的位置和用户的真实姓名(所有这些固有于推文中并且可以在实时场景中使用)对于确定来源国特别有用。我们还对在历史推文上训练的模型的适用性进行了实验,以对新推文进行分类,发现选择效用不会随时间而褪色的特定功能组合实际上可以带来可比的性能,从而避免了重新训练的需要。但是,对于具有多个共同点的国家,尤其是对于英语和西班牙语国家,实现准确分类的难度略有增加。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号