【24h】

Improving Cross-Domain Chinese Word Segmentation with Word Embeddings

机译:用Word Embeddings改进跨域中文词分割

获取原文

摘要

Cross-domain Chinese Word Segmentation (CWS) remains a challenge despite recent progress in neural-based CWS. The limited amount of annotated data in the target domain has been the key obstacle to a satisfactory performance. In this paper, we propose a semi-supervised word-based approach to improving cross-domain CWS given a baseline segmenter. Particularly, our model only deploys word embeddings trained on raw text in the target domain, discarding complex hand-crafted features and domain-specific dictionaries. Innovative subsampling and negative sampling methods are proposed to derive word embeddings optimized for CWS. We conduct experiments on five datasets in special domains, covering domains in novels, medicine, and patent. Results show that our model can obviously improve cross-domain CWS, especially in the segmentation of domain-specific noun entities. The word F-measurc increases by over 3.0% on four datasets, outperforming state-of-the-art semi-supervised and unsu-pervised cross-domain CWS approaches with a large margin. We make our code and data available on Github.
机译:尽管最近在基于神经科学CWS的进步,但跨域中文字分割(CWS)仍然是一项挑战。目标域中的有限量的注释数据一直是令人满意的性能的关键障碍。在本文中,我们提出了一种在基线分段器给出基于半监督的基于词的方法来改善跨域CW。特别是,我们的模型只部署了在目标域中的原始文本上培训的Word Embedings,丢弃了复杂的手工制作功能和域特定的词典。提出了创新的撤销和负采样方法来派生针对CWS优化的单词嵌入式。我们在特殊域中的五个数据集进行实验,覆盖小说,医学和专利的领域。结果表明,我们的模型可以明显改善跨域CWS,尤其是特定于域的名词实体的分割。 F-MeastC字样在四个数据集中增加超过3.0%,表现出最先进的半监督和Unsuved跨域CWS方法,具有大边距。我们在GitHub上提供了我们的代码和数据。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号