首页> 外文会议>Annual meeting of the Association for Computational Linguistics >Tweet2Vec: Character-Based Distributed Representations for Social Media
【24h】

Tweet2Vec: Character-Based Distributed Representations for Social Media

机译:Tweet2Vec:社交媒体的基于字符的分布式表示形式

获取原文

摘要

Text from social media provides a set of challenges that can cause traditional NLP approaches to fail. Informal language, spelling errors, abbreviations, and special characters are all commonplace in these posts, leading to a prohibitively large vocabulary size for word-level approaches. We propose a character composition model, tweet2vec, which finds vector-space representations of whole tweets by learning complex, non-local dependencies in character sequences. The proposed model outperforms a word-level baseline at predicting user-annotated hashtags associated with the posts, doing significantly better when the input contains many out-of-vocabulary words or unusual character sequences. Our tweet2vec encoder is publicly available.
机译:来自社交媒体的文本提供了一系列挑战,这些挑战可能导致传统的NLP方法失败。非正式语言,拼写错误,缩写和特殊字符在这些帖子中都很常见,导致单词级方法的词汇量过大。我们提出了一个字符组成模型tweet2vec,该模型通过学习字符序列中复杂的,非本地的依赖关系来查找整个tweet的向量空间表示。在预测与帖子相关的用户注释主题标签时,建议的模型优于单词级别的基线,当输入包含许多词汇以外的单词或不寻常的字符序列时,效果会更好。我们的tweet2vec编码器已公开提供。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号