首页> 外文期刊>Computer speech and language >An empirical study on POS tagging for Vietnamese social media text
【24h】

An empirical study on POS tagging for Vietnamese social media text

机译:越南社交媒体文本中POS标记的实证研究

获取原文
获取原文并翻译 | 示例

摘要

Part-of-speech (POS) tagging is a fundamental task in natural language processing (NLP). A robust POS tagger plays an important role in most NLP problems and applications, including syntactic parsing, semantic parsing, machine translation, and question answering. Although a lot of efficient POS taggers has been developed for general, conventional text, little work has been done for social media text. In this paper, we present an empirical study on POS tagging for Vietnamese social media text, which shows several challenges compared with tagging for general text. Social media text does not always conform to formal grammars and correct spelling. It also uses abbreviations, foreign words, and emoticons frequently. A POS tagger developed for conventional text would perform poorly on such noisy data. We address this problem by proposing a tagging model based on Conditional Random Fields (CRFs) with various kinds of features for Vietnamese social media text. We also investigate the effect of features extracted from word clusters under the Brown and canonical correlation analysis (CCA) based clustering in semi-supervised settings. We introduce an annotated corpus for POS tagging, which consists of more than four thousand sentences from Facebook, the most popular social network in Vietnam. Using this corpus, we performed a series of experiments to evaluate the proposed model. Our model achieved 88.26% and 88.92% tagging accuracy in supervised and semi-supervised scenarios, respectively, which are nearly 12% improvement over vnTagger, a state-of-the-art and most widely used Vietnamese POS tagger developed for general, conventional text. In addition, the semi-supervised model outperformed, in terms of accuracy, the version of vnTagger trained on the same Facebook dataset, showing the usefulness of word cluster features.11This paper is an improved and extended version of.
机译:词性(POS)标记是自然语言处理(NLP)中的一项基本任务。强大的POS标记器在大多数NLP问题和应用程序中都扮演着重要角色,包括语法分析,语义分析,机器翻译和问题解答。尽管已经为常规的常规文本开发了许多有效的POS标记器,但对于社交媒体文本却做得很少。在本文中,我们对越南社交媒体文本的POS标记进行了一项实证研究,与普通文本的标记相比,它显示了一些挑战。社交媒体文本并不总是符合形式语法和正确的拼写。它还经常使用缩写词,外来词和表情符号。为常规文本开发的POS标记器在这种嘈杂的数据上表现不佳。我们通过提出基于条件随机字段(CRF)的标签模型来解决此问题,该模型具有越南社交媒体文本的各种功能。我们还研究了在半监督设置下基于布朗和典型相关分析(CCA)的聚类下从单词簇中提取特征的影响。我们引入了一个带注释的POS语料库,该语料库由来自越南最受欢迎的社交网络Facebook的四千多个句子组成。使用该语料库,我们进行了一系列实验以评估提出的模型。我们的模型在监督和半监督场景中分别达到了88.26%和88.92%的标记准确率,比为普通文本开发的最先进和使用最广泛的越南POS标记器vnTagger提升了近12%。 。此外,就准确性而言,半监督模型的表现优于在同一Facebook数据集上训练的vnTagger的版本,显示了单词聚类功能的有用性。11本文是该版本的改进和扩展版本。

著录项

  • 来源
    《Computer speech and language》 |2018年第7期|1-15|共15页
  • 作者单位

    Department of Computer Science, Posts and Telecommunications Institute of Technology,Machine Learning & Applications Lab, Posts and Telecommunications Institute of Technology;

    Department of Computer Science, Posts and Telecommunications Institute of Technology;

    Department of Computer Science, Posts and Telecommunications Institute of Technology,Machine Learning & Applications Lab, Posts and Telecommunications Institute of Technology;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Part-of-speech tagging; Social media text; Conditional random fields; Word clustering;

    机译:词性标注;社交媒体文本;条件随机字段;词聚类;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号