Improving Cross-Domain Chinese Word Segmentation with Word Embeddings

机译：利用词嵌入改善跨域中文分词

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Cross-domain Chinese Word Segmentation (CWS) remains a challenge despite recent progress in neural-based CWS. The limited amount of annotated data in the target domain has been the key obstacle to a satisfactory performance. In this paper, we propose a semi-supervised word-based approach to improving cross-domain CWS given a baseline segmenter. Particularly, our model only deploys word embeddings trained on raw text in the target domain, discarding complex hand-crafted features and domain-specific dictionaries. Innovative subsampling and negative sampling methods are proposed to derive word embeddings optimized for CWS. We conduct experiments on five datasets in special domains, covering domains in novels, medicine, and patent. Results show that our model can obviously improve cross-domain CWS, especially in the segmentation of domain-specific noun entities. The word F-measurc increases by over 3.0% on four datasets, outperforming state-of-the-art semi-supervised and unsu-pervised cross-domain CWS approaches with a large margin. We make our code and data available on Github.

机译：尽管最近在基于神经的CWS中取得了进展，但跨域中文分词（CWS）仍然是一个挑战。目标域中有限数量的注释数据一直是令人满意的性能的关键障碍。在本文中，我们提出了一种基于单词的半监督方法，以在给定基线分段器的情况下改善跨域CWS。特别地，我们的模型仅在目标域中部署在原始文本上训练的词嵌入，而丢弃复杂的手工功能和特定领域的词典。提出了创新的二次采样和负采样方法来推导针对CWS优化的词嵌入。我们对五个特殊领域的数据集进行了实验，涵盖了小说，医学和专利领域。结果表明，我们的模型可以明显改善跨域CWS，尤其是在特定于领域的名词实体的分割中。在四个数据集上，单词F-measurc的增长幅度超过3.0％，大大超过了最新的半监督和非监督跨域CWS方法。我们在Github上提供我们的代码和数据。

著录项

来源
《Conference on the North American Chapter of the Association for Computational Linguistics: Human Language Technologies》|2019年|2726-2735|共10页
会议地点
作者
Yuxiao Ye; Yue Zhang; Weikang Li; Likun Qiu; Jian Sun;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Automatic Extraction Of New Words Based On Google News Corpora For Supporting Lexicon-based Chinese Word Segmentation Systems [J] . Chin-Ming Hong, Chih-Ming Chen, Chao-Yang Chiu Expert systems with applications . 2009,第2p2期

机译：基于Google新闻语料库的自动提取新词以支持基于词典的中文分词系统
2. A Chinese word segmentation based on language situation in processing ambiguous words [J] . Zhang MY, Lu ZD, Zou CY Information Sciences: An International Journal . 2004,第3a4期

机译：基于语言环境的歧义词中文分词
3. Unsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word Embeddings [J] . Herman Kamper, Aren Jansen, Sharon Goldwater Audio, Speech, and Language Processing, IEEE/ACM Transactions on . 2016,第4期

机译：使用声词嵌入的无监督分词和词典发现
4. Improving Cross-Domain Chinese Word Segmentation with Word Embeddings [C] . Yuxiao Ye, Yue Zhang, Weikang Li, Conference on the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . 2019

机译：用Word Embeddings改进跨域中文词分割
5. Improved GloVe Word Embedding Using Linear Weighting Scheme for Word Similarity Tasks [D] . Lu, Qinglan. 2021

机译：使用线性加权方案进行改进的手套单词嵌入单词相似性任务
6. BioWordVec improving biomedical word embeddings with subword information and MeSH [O] . Yijia Zhang, Qingyu Chen, Zhihao Yang, 2019

机译：BioWordVec通过子词信息和MeSH改善生物医学词嵌入
7. Pruning False Unknown Words to Improve Chinese Word Segmentation [O] . Goh Chooi-Ling, 浅原正幸, 松本裕治 2005

机译：修剪错误的未知单词以改善中文分词

Improving Cross-Domain Chinese Word Segmentation with Word Embeddings

摘要

著录项

相似文献

相关主题

期刊订阅