【24h】

A #hashtagtokenizer for Social Media Messages

机译:#hashtagtokenizer用于社交媒体消息

获取原文
获取原文并翻译 | 示例
           

摘要

In social media, mainly due to length constraints, users write succinct messages and use hashtags to refer to entities, events, sentiments or ideas. Hashtags carry a lot of content that can help in many tasks and applications involving text processing such as sentiment analysis, named entity recognition and information extraction. However, identifying the individual words of a hashtag is not trivial because the traditional POS taggers typically consider it as a single token, despite the fact that it might contain multiple words, e.g. #fergusondecisioa #imcharliehebdo. In this work, we propose a generic model for hashtagtokenisation that aims to split up one hashtag into several tokens corresponding to each individual word contained in it (e.g. "#imcharliehebdo " would become four tokens, "#", "i", "am" and "Charlie Hebdo"). Our hashtagtokenizer is based on a machine learning segmentation method for Chinese language and makes also use of Wikipedia as encyclopedic knowledge base. We have evaluated the inference power of our approach by comparing the tokens produced by our approach to those produced by human taggers. The results demonstrated the good accuracy and applicability of the proposed model for general-purpose applications.
机译:在社交媒体中,主要由于篇幅所限,用户编写简洁的消息并使用标签来引用实体,事件,情感或想法。主题标签包含许多内容,可以在涉及文本处理的许多任务和应用程序中提供帮助,例如情感分析,命名实体识别和信息提取。但是,识别主题标签的各个单词并非易事,因为传统的POS标记程序通常将其视为单个令牌,尽管事实上它可能包含多个单词,例如#fergusondecisioa #imcharliehebdo。在这项工作中,我们提出了一种用于hashtagtokenization的通用模型,该模型旨在将一个hashtag分成对应于其中包含的每个单词的几个标记(例如,“#imcharliehebdo”将变为四个标记,“#”,“ i”,“ am” ”和“查理周刊”)。我们的hashtagtokenizer基于中文的机器学习分割方法,并且还利用Wikipedia作为百科全书知识库。我们通过将我们的方法产生的令牌与人类标记者产生的令牌进行比较,评估了我们方法的推理能力。结果证明了该模型在通用应用中的良好准确性和适用性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号