【24h】

A HMM POS Tagger for Micro-blogging Type Texts

机译:用于微博客文字的HMM POS标记

获取原文

摘要

The high volume of communication via micro-blogging type messages has created an increased demand for text processing tools customised the unstructured text genre. The available text processing tools developed on structured texts has been shown to deteriorate significantly when used on unstructured, micro-blogging type texts. In this paper, we present the results of testing a HMM based POS (Part-Of-Speech) tagging model customized for unstructured texts. We also evaluated the tagger against published CRF based state-of-the-art POS tagging models customized for Tweet messages using three publicly available Tweet corpora. Finally, we did cross-validation tests with both the taggers by training them on one Tweet corpus and testing them on another one. The results show that the CRF-based POS tagger from GATE performed approximately 8% better compared to the HMM (Hidden Markov Model) model at token level, however at the sentence level the performances were approximately the same. The cross-validation experiments showed that both tagger's results deteriorated by approximately 25% at the token level and a massive 80% at the sentence level. A detailed analysis of this deterioration is presented and the HMM trained model including the data has also been made available for research purposes. Since HMM training is orders of magnitude faster compared to CRF training, we conclude that the HMM model, despite trailing by about 8% for token accuracy, is still a viable alternative for real time applications which demand rapid as well as progressive learning.
机译:通过微博客类型的消息进行的大量通信已导致对定制非结构化文本类型的文本处理工具的需求增加。已显示,在非结构化微博客类型的文本上使用时,在结构化文本上开发的可用文本处理工具会大大恶化。在本文中,我们介绍了针对非结构化文本定制的基于HMM的POS(词性)标记模型的测试结果。我们还使用三个公开的Tweet语料库,针对已发布的基于CRF的,针对Tweet消息定制的最新POS标记模型,对标记器进行了评估。最后,我们通过在一个Tweet语料库上对它们进行训练,并在另一个Tweet语料上对其进行了测试,从而对这两个标记器进行了交叉验证测试。结果表明,在令牌级别,来自GATE的基于CRF的POS标记器的性能比HMM(隐马尔可夫模型)模型好大约8%,但是在句子级别,性能大致相同。交叉验证实验表明,两个标记程序的结果在标记级别上均下降了约25%,在句子级别上下降了80%。给出了对这种恶化的详细分析,包括数据在内的HMM训练模型也已用于研究目的。由于HMM训练比CRF训练快几个数量级,我们得出的结论是,尽管HMM模型在令牌准确性方面落后约8%,但对于需要快速学习和渐进学习的实时应用而言,HMM模型仍然是可行的选择。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号