【24h】

Improving Cross-Domain Chinese Word Segmentation with Word Embeddings

机译:利用词嵌入改善跨域中文分词

获取原文

摘要

Cross-domain Chinese Word Segmentation (CWS) remains a challenge despite recent progress in neural-based CWS. The limited amount of annotated data in the target domain has been the key obstacle to a satisfactory performance. In this paper, we propose a semi-supervised word-based approach to improving cross-domain CWS given a baseline segmenter. Particularly, our model only deploys word embeddings trained on raw text in the target domain, discarding complex hand-crafted features and domain-specific dictionaries. Innovative subsampling and negative sampling methods are proposed to derive word embeddings optimized for CWS. We conduct experiments on five datasets in special domains, covering domains in novels, medicine, and patent. Results show that our model can obviously improve cross-domain CWS, especially in the segmentation of domain-specific noun entities. The word F-measurc increases by over 3.0% on four datasets, outperforming state-of-the-art semi-supervised and unsu-pervised cross-domain CWS approaches with a large margin. We make our code and data available on Github.
机译:尽管最近在基于神经的CWS中取得了进展,但跨域中文分词(CWS)仍然是一个挑战。目标域中有限数量的注释数据一直是令人满意的性能的关键障碍。在本文中,我们提出了一种基于单词的半监督方法,以在给定基线分段器的情况下改善跨域CWS。特别地,我们的模型仅在目标域中部署在原始文本上训练的词嵌入,而丢弃复杂的手工功能和特定领域的词典。提出了创新的二次采样和负采样方法来推导针对CWS优化的词嵌入。我们对五个特殊领域的数据集进行了实验,涵盖了小说,医学和专利领域。结果表明,我们的模型可以明显改善跨域CWS,尤其是在特定于领域的名词实体的分割中。在四个数据集上,单词F-measurc的增长幅度超过3.0%,大大超过了最新的半监督和非监督跨域CWS方法。我们在Github上提供我们的代码和数据。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号