Improving Cross-Domain Chinese Word Segmentation with Word Embeddings

机译：用Word Embeddings改进跨域中文词分割

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Cross-domain Chinese Word Segmentation (CWS) remains a challenge despite recent progress in neural-based CWS. The limited amount of annotated data in the target domain has been the key obstacle to a satisfactory performance. In this paper, we propose a semi-supervised word-based approach to improving cross-domain CWS given a baseline segmenter. Particularly, our model only deploys word embeddings trained on raw text in the target domain, discarding complex hand-crafted features and domain-specific dictionaries. Innovative subsampling and negative sampling methods are proposed to derive word embeddings optimized for CWS. We conduct experiments on five datasets in special domains, covering domains in novels, medicine, and patent. Results show that our model can obviously improve cross-domain CWS, especially in the segmentation of domain-specific noun entities. The word F-measurc increases by over 3.0% on four datasets, outperforming state-of-the-art semi-supervised and unsu-pervised cross-domain CWS approaches with a large margin. We make our code and data available on Github.

机译：尽管最近在基于神经科学CWS的进步，但跨域中文字分割（CWS）仍然是一项挑战。目标域中的有限量的注释数据一直是令人满意的性能的关键障碍。在本文中，我们提出了一种在基线分段器给出基于半监督的基于词的方法来改善跨域CW。特别是，我们的模型只部署了在目标域中的原始文本上培训的Word Embedings，丢弃了复杂的手工制作功能和域特定的词典。提出了创新的撤销和负采样方法来派生针对CWS优化的单词嵌入式。我们在特殊域中的五个数据集进行实验，覆盖小说，医学和专利的领域。结果表明，我们的模型可以明显改善跨域CWS，尤其是特定于域的名词实体的分割。 F-MeastC字样在四个数据集中增加超过3.0％，表现出最先进的半监督和Unsuved跨域CWS方法，具有大边距。我们在GitHub上提供了我们的代码和数据。

著录项

来源
《Conference on the North American Chapter of the Association for Computational Linguistics: Human Language Technologies》|2019年|xciii p. 2102-2798|共10页
会议地点
作者
Yuxiao Ye; Yue Zhang; Weikang Li; Likun Qiu; Jian Sun;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类程序设计、软件工程;
关键词

相似文献

外文文献
中文文献
专利

1. Automatic Extraction Of New Words Based On Google News Corpora For Supporting Lexicon-based Chinese Word Segmentation Systems [J] . Chin-Ming Hong, Chih-Ming Chen, Chao-Yang Chiu Expert systems with applications . 2009,第2p2期

机译：基于Google新闻语料库的自动提取新词以支持基于词典的中文分词系统
2. A Chinese word segmentation based on language situation in processing ambiguous words [J] . Zhang MY, Lu ZD, Zou CY Information Sciences: An International Journal . 2004,第3a4期

机译：基于语言环境的歧义词中文分词
3. Unsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word Embeddings [J] . Herman Kamper, Aren Jansen, Sharon Goldwater Audio, Speech, and Language Processing, IEEE/ACM Transactions on . 2016,第4期

机译：使用声词嵌入的无监督分词和词典发现
4. Improving Cross-Domain Chinese Word Segmentation with Word Embeddings [C] . Yuxiao Ye, Yue Zhang, Weikang Li, Conference on the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . 2019

机译：利用词嵌入改善跨域中文分词
5. Improved GloVe Word Embedding Using Linear Weighting Scheme for Word Similarity Tasks [D] . Lu, Qinglan. 2021

机译：使用线性加权方案进行改进的手套单词嵌入单词相似性任务
6. BioWordVec improving biomedical word embeddings with subword information and MeSH [O] . Yijia Zhang, Qingyu Chen, Zhihao Yang, 2019

机译：BioWordVec通过子词信息和MeSH改善生物医学词嵌入
7. Pruning False Unknown Words to Improve Chinese Word Segmentation [O] . Goh Chooi-Ling, 浅原正幸, 松本裕治 2005

机译：修剪错误的未知单词以改善中文分词

Improving Cross-Domain Chinese Word Segmentation with Word Embeddings

摘要

著录项

相似文献

相关主题

期刊订阅