首页> 外文会议>CCF international conference on natural language processing and Chinese computing >A Word Segmentation Method of Ancient Chinese Based on Word Alignment
【24h】

A Word Segmentation Method of Ancient Chinese Based on Word Alignment

机译:基于词对齐的古汉语分词方法

获取原文

摘要

Since there are no public tagged corpora available for ancient, Chinese word segmentation (CWS), the state-of-the-art CWS methods cannot be used for ancient Chinese. To address this problem, this paper proposes a word segmentation method based on word alignment (WSWA). Specifically, the method segments words according to the word alignment between modern Chinese words and ancient Chinese characters. If multiple consecutive characters in ancient Chinese align to the same modern Chinese word, they are considered as one word. Because many modern Chinese words are derived from ancient Chinese, the method also exploits the co-occurring characters between modern and ancient Chinese to extract words for CWS. Moreover, to reduce the effect of alignment errors, the method removes the word alignments easily leading to CWS errors. We quantitatively analyze the effects of modern CWS and word alignment on WSWA method using hand-annotated corpora. Our method outperforms the state-of-the-art methods on the WSA experiment on Shiji with a large margin, which demonstrates the effectiveness of using word alignment to perform ancient CWS.
机译:由于没有可用于古代中文分词(CWS)的公共标签语料库,因此最新的CWS方法无法用于古代中文。为了解决这个问题,本文提出了一种基于词对齐(WSWA)的分词方法。具体地,该方法根据现代汉语单词与古代汉字之间的单词对齐来对单词进行分割。如果古汉语中的多个连续字符与同一个现代汉语单词对齐,则将它们视为一个单词。由于许多现代汉语单词是从古代汉语衍生而来的,因此该方法还利用了现代汉语和古代汉语之间的共现字符来提取CWS的单词。此外,为了减少对齐错误的影响,该方法去除了容易导致CWS错误的单词对齐。我们使用手工注释的语料库定量分析了现代CWS和单词对齐对WSWA方法的影响。我们的方法大大优于Shiji的WSA实验上的最新方法,这证明了使用单词对齐来执行古代CWS的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号