首页> 外文会议>Conference on empirical methods in natural language processing >More Features Are Not Always Better: Evaluating Generalizing Models in Incident Type Classification of Tweets
【24h】

More Features Are Not Always Better: Evaluating Generalizing Models in Incident Type Classification of Tweets

机译:更多功能并不总是更好:在推文的事件类型分类中评估泛化模型

获取原文

摘要

Social media represents a rich source of up-to-date information about events such as incidents. The sheer amount of available information makes machine learning approaches a necessity for further processing. This learning problem is often concerned with regionally restricted datasets such as data from only one city. Because social media data such as tweets varies considerably across different cities, the training of efficient models requires labeling data from each city of interest, which is costly and time consuming. In this study, we investigate which features are most suitable for training generalizable models, i.e., models that show good performance across different datasets. We re-implemented the most popular features from the state of the art in addition to other novel approaches, and evaluated them on data from ten different cities. We show that many sophisticated features are not necessarily valuable for training a generalized model and are outperformed by classic features such as plain word-n-grams and character-n-grams.
机译:社交媒体代表了有关事件(例如事件)的最新信息的丰富来源。大量可用信息使机器学习成为进行进一步处理的必要条件。这个学习问题通常与区域限制的数据集有关,例如仅来自一个城市的数据。由于诸如推文之类的社交媒体数据在不同城市之间存在很大差异,因此有效模型的训练需要标记来自每个感兴趣城市的数据,这既昂贵又费时。在这项研究中,我们调查哪些功能最适合训练通用模型,即在不同数据集上表现出良好性能的模型。除了其他新颖的方法外,我们还重新实现了现有技术中最受欢迎的功能,并根据来自十个不同城市的数据对它们进行了评估。我们表明,许多复杂的功能对于训练通用模型不一定有价值,而经典功能(例如普通单词n-gram和字符n-gram)的性能却不如后者。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号