首页> 外文期刊>Computational Social Networks >Text normalization for named entity recognition in Vietnamese tweets
【24h】

Text normalization for named entity recognition in Vietnamese tweets

机译:越南推文中用于命名实体识别的文本规范化

获取原文
           

摘要

Abstract Background Named entity recognition (NER) is a task of detecting named entities in documents and categorizing them to predefined classes, such as person, location, and organization. This paper focuses on tweets posted on Twitter. Since tweets are noisy, irregular, brief, and include acronyms and spelling errors, NER in those tweets is a challenging task. Many approaches have been proposed to deal with this problem in tweets written in English, Germany, Chinese, etc., but none for Vietnamese tweets. Methods We propose a method that normalizes a tweet before taking as an input of a learning model for NER in Vietnamese tweets. The normalization step detects spelling errors in a tweet and corrects them using an improved Dice's coefficient or n-grams. A Support Vector Machine learning algorithm is employed to learn a classifier using six different types of features. Results and Conclusion We train our method on a training set consisting of more than 40,000 named entities and evaluate it on a testing set consisting of 3,186 named entities. The experimental results showed that our system achieves state-of-the-art performance with F1 score of 82.13%.
机译:背景技术命名实体识别(NER)是一项任务,用于检测文档中的命名实体并将其分类为预定义的类,例如人,位置和组织。本文重点关注Twitter上发布的推文。由于推文嘈杂,不规则,简短,并且包含首字母缩写词和拼写错误,因此这些推文中的NER是一项艰巨的任务。在以英语,德语,中文等语言编写的推文中,已经提出了许多方法来解决此问题,但对于越南推文则没有。方法我们提出了一种在将越南语推文中的NER学习模型作为输入之前,将推文规范化的方法。规范化步骤可检测推文中的拼写错误,并使用改进的Dice系数或n-gram对其进行纠正。支持向量机学习算法用于学习使用六种不同类型特征的分类器。结果与结论我们在由40,000多个命名实体组成的训练集中训练我们的方法,并在由3,186个命名实体组成的测试集中对方法进行评估。实验结果表明,我们的系统具有F1分数达82.13%的最先进性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号