高速发展的社交媒体产生了大量文本数据.这些文本有别于传统文本的新特性使得现有的自然语言处理工具无法对其进行有效处理.其中针对社交文本的词性标注作为自然语言处理中的最基本任务,它的处理结果直接影响到后续处理的效果.本文针对传统的词性标注工具在处理社交文本时出现的性能下降问题进行研究,尝试找出性能下降的原因,并对多个不同改进程度的标注工具的词性标注结果进行了量化分析,实验结果揭示了传统工具的性能下降主要来源于未知单词的词性推导错误,同时也给出了针对社交文本的词性标注的改进方向.%The rapid development of social media generates a large amount of text data.The casualness and informality of the social text cause notable performance degradation of the traditional natural language processing(NLP) tools.The poor performance of part-of-speech(POS) tagging,as the fundamental task in NLP pipeline,on social text is critical for the other downstream NLP applications.This paper aims at finding out the exact reason of the performance drops of the traditional POS tagging tools.We quantitatively analyze the tagging results of the taggers with different adaptation degrees.The experimental results show that the major reason is the high error rate of the POS inferences for the unknown words.
展开▼