首页> 外文期刊>Pattern recognition letters >Improving Korean verb-verb morphological disambiguation using lexical knowledge from unambiguous unlabeled data and selective web counts
【24h】

Improving Korean verb-verb morphological disambiguation using lexical knowledge from unambiguous unlabeled data and selective web counts

机译:利用来自明确的无标签数据和选择性网络计数的词汇知识来改善韩语动词形态学歧义消除

获取原文
获取原文并翻译 | 示例
           

摘要

This paper deals with verb-verb morphological disambiguation of two different verbs that have the same inflected form. The verb-verb morphological ambiguity (WMA) is one of the critical Korean parts of speech (POS) tagging issues. The recognition of verb base forms related to ambiguous words highly depends on the lexical information in their surrounding contexts and the domains they occur in. However, current probabilistic morpheme-based POS tagging systems cannot handle WMA adequately since most of them have a limitation to reflect a broad context of word level, and they are trained on too small amount of labeled training data to represent sufficient lexical information required for WMA disambiguation. In this study, we suggest a classifier based on a large pool of raw text that contains sufficient lexical information to handle the WMA. The underlying idea is that we automatically generate the annotated training set applicable to the ambiguity problem such as WMA resolution via unlabeled unambiguous instances which belong to the same class. This enables to label ambiguous instances with the knowledge that can be induced from unambiguous instances. Since the unambiguous instances have only one label, the automatic generation of their annotated corpus are possible with unlabeled data. In our problem, since all conjugations of irregular verbs do not lead to the spelling changes that cause the WMA, a training data for the WMA disambiguation are generated via the instances of unambiguous conjugations related to each possible verb base form of ambiguous words. This approach does not require an additional annotation process for an initial training data set or a selection process for good seeds to iteratively augment a labeling set which are important issues in bootstrapping methods using unlabeled data. Thus, this can be strength against previous related works using unlabeled data. Furthermore, a plenty of confident seeds that are unambiguous and can show enough coverage for learning process are assured as well. We also suggest a strategy to extend the context information incrementally with web counts only to selected test examples that are difficult to predict using the current classifier or that are highly different from the pre-trained data set. As a result, automatic data generation and knowledge acquisition from unlabeled text for the WMA resolution improved the overall tagging accuracy (token-level) by 0.04%. In practice, 9-10% out of verb-related tagging errors are fixed by the WMA resolution whose accuracy was about 98% by using the Naive Bayes classifier coupled with selective web counts.
机译:本文研究了具有相同变形形式的两个不同动词的动词-动词形态学歧义消除。动词-动词形态歧义度(WMA)是关键的朝鲜语词性(POS)标记问题之一。与歧义词相关的动词基础形式的识别高度取决于其周围上下文和它们所出现的域中的词汇信息。但是,当前的基于概率语素的POS标记系统无法充分处理WMA,因为它们大多数都具有反映的局限性广泛的词级上下文,并且它们在过少的带标签的训练数据上进行训练,无法代表WMA消除歧义所需的足够词汇信息。在这项研究中,我们建议基于大量原始文本的分类器,其中包含足够的词法信息来处理WMA。基本思想是,我们通过属于同一类的未标记明确实例,自动生成适用于歧义问题(例如WMA解析)的带注释的训练集。这使得能够使用可以从明确实例中得出的知识来标记不确定实例。由于明确的实例只有一个标签,因此使用未标签的数据可以自动生成其带注释的语料库。在我们的问题中,由于不规则动词的所有变位都不会导致引起WMA的拼写更改,因此,通过与歧义词的每种可能动词基础形式相关的明确变位实例生成WMA消歧的训练数据。这种方法不需要为初始训练数据集进行额外的注释过程,也不需要为好的种子选择过程来迭代地扩展标签集,这在使用未标签数据的自举方法中是重要的问题。因此,对于使用未标记数据的先前相关著作,这可能是有优势的。此外,还确保了许多自信的种子,这些种子是明确的,可以显示足够的学习过程覆盖率。我们还建议一种策略,以仅通过网络计数将上下文信息逐步扩展到仅使用当前分类器难以预测或与预训练数据集有很大差异的选定测试示例。结果,针对WMA分辨率的自动数据生成和来自未标记文本的知识获取将整体标记准确性(令牌级别)提高了0.04%。在实践中,通过使用Naive Bayes分类器结合选择性网络计数,WMA分辨率可修复9-10%的与动词相关的标记错误,而WMA分辨率的准确性约为98%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号