首页> 外文期刊>International journal of computer processing of languages >Annotation and Classification of Three-Character Chinese Synthetic Words
【24h】

Annotation and Classification of Three-Character Chinese Synthetic Words

机译:三字符汉语合成词的注释与分类

获取原文
获取原文并翻译 | 示例
       

摘要

The lack of internal information of Chinese synthetic words has become a crucial problem for Chinese morphological analysis systems, which are facing various needs of segmentation standards for upper NLP applications being developed. In this paper, we first define the conceptual differences between Chinese single-morpheme words and Chinese synthetic words. Then we define Chinese synthetic words into two types, compound words and morphologically derived words, according to their internal syntactic and morphological structure and classify them into more specific categories. After making a survey on three-character Chinese synthetic words based on these categories, we propose a tree-based analysis method to represent the internal information of the words. Next, we use machine learning methods to automatically identify the internal morphological structure of three-character synthetic words by using a large corpus and add syntactic tags to their internal structure. We believe that the tree-based word internal information is useful in specifying a Chinese synthetic word segmentation standard. We also believe that the internal information of Chinese synthetic words can help to improve morphological analysis and out-of-vocabulary (OOV) word detection of Chinese text.
机译:缺少中文合成词的内部信息已成为中文形态分析系统的关键问题,因为中文形态分析系统面临着针对正在开发的高级NLP应用的分割标准的各种需求。在本文中,我们首先定义了汉语单词素词与汉语合成词之间的概念差异。然后根据汉语的内部句法和形态结构,将汉语合成词分为复合词和形态衍生词两类,并将其分类。在根据这些类别对三个字符的汉语合成词进行调查之后,我们提出了一种基于树的分析方法来表示单词的内部信息。接下来,我们使用机器学习方法,通过使用大型语料库自动识别三字符合成词的内部形态结构,并在其内部结构中添加句法标记。我们认为,基于树的词内部信息对于指定中文合成词分割标准很有用。我们还认为,中文合成词的内部信息可以帮助改善中文文本的形态分析和词汇外(OOV)词检测。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号