Detecting New Words from Chinese Text Using Latent Semi-CRF Models

Xiao SUN; Degen HUANG; Fuji REN

首页> 外文期刊>IEICE transactions on information and systems >Detecting New Words from Chinese Text Using Latent Semi-CRF Models

【24h】

Detecting New Words from Chinese Text Using Latent Semi-CRF Models

机译：使用潜在的半CRF模型从中文文本中检测新单词

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Chinese new words and their part-of-speech (POS) are particularly problematic in Chinese natural language processing. With the fast development of internet and information technology, it is impossible to get a complete system dictionary for Chinese natural language processing, as new words out of the basic system dictionary are always being created. A latent semi-CRF model, which combines the strengths of LDCRF (Latent-Dynamic Conditional Random Field) and semi-CRF, is proposed to detect the new words together with their POS synchronously regardless of the types of the new words from the Chinese text without being pre-segmented. Unlike the original semi-CRF, the LDCRF is applied to generate the candidate entities for training and testing the latent semi-CRF, which accelerates the training speed and decreases the computation cost. The complexity of the latent semi-CRF could be further adjusted by tuning the number of hidden variables in LDCRF and the number of the candidate entities from the Nbest outputs of the LDCRF. A new-words-generating framework is proposed for model training and testing, under which the definitions and distributions of the new words conform to the ones existing in real text. Specific features called “Global Fragment Information” for new word detection and POS tagging are adopted in the model training and testing. The experimental results show that the proposed method is capable of detecting even low frequency new words together with their POS tags. The proposed model is found to be performing competitively with the state-of-the-art models presented.

机译：中文新词及其词性（POS）在中文自然语言处理中尤其成问题。随着互联网和信息技术的飞速发展，不可能获得用于自然语言处理的完整系统词典，因为总是在基础系统词典中创建新单词。提出了一种结合LDCRF（潜在动态条件随机场）和semi-CRF优势的潜在半CRF模型，以同步检测新词及其POS，而与中文文本中新词的类型无关无需预先细分。与原始的半CRF不同，LDCRF用于生成候选实体以训练和测试潜在的半CRF，从而加快了训练速度并降低了计算成本。通过调整LDCRF中的隐藏变量的数量和LDCRF的Nbest个输出中的候选实体的数量，可以进一步调整潜在半CRF的复杂性。提出了一种用于模型训练和测试的新单词生成框架，在该框架下，新单词的定义和分布与真实文本中存在的单词和定义一致。在模型训练和测试中采用了用于新词检测和POS标记的称为“全局片段信息”的特定功能。实验结果表明，该方法能够同时检测低频新词及其POS标签。发现所提出的模型与所提供的最新模型具有竞争优势。

著录项

来源
《IEICE transactions on information and systems》 |2010年第6期|共8页
作者
Xiao SUN; Degen HUANG; Fuji REN;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类无线电电子学、电信技术;
关键词

相似文献

外文文献
中文文献
专利

1. Detecting New Words from Chinese Text Using Latent Semi-CRF Models [J] . Xiao SUN, Degen HUANG, Fuji REN IEICE Transactions on Information and Systems . 2010,第6期

机译：使用潜在的半CRF模型从中文文本中检测新单词
2. Detecting new Chinese words from massive domain texts with word embedding [J] . Qian Yu, Du Yang, Deng Xiongwen, Journal of Information Science . 2019,第2期

机译：通过单词嵌入从大量领域文本中检测新的中文单词
3. The latent learning model to derive semantic relations of words from unstructured text data in social media [J] . Seo Jiwan, Yoo Karam, Choi Seungjin, Multimedia Tools and Applications . 2019,第20期

机译：从社交媒体中非结构化文本数据中得出词的语义关系的潜在学习模型
4. Lexicon-based semi-CRF for Chinese clinical text word segmentation [C] . Guoqing Xia, Yao Shen, Qiang Lin International Conference on Progress in Informatics and Computing . 2017

机译：基于词汇的半CRF用于中文临床文本分词
5. Latent Probabilistic Topic Discovery for Text Documents Incorporating Segment Structure and Word Order [D] . Jameel, Mohammad Shoaib 2014

机译：包含段结构和单词顺序的文本文档的潜在概率主题发现
6. Predicting Lexical Norms: A Comparison between a Word Association Model and Text-Based Word Co-occurrence Models [O] . Hendrik Vankrunkelsven, Steven Verheyen, Gert Storms, 2018

机译：预测词法规范：单词联想模型与基于文本的单词共现模型之间的比较
7. Short Text Classification Based on Latent Topic Modeling and Word Embedding [O] . Peng LI, Jun-Qing HE, Cheng-Long MA 2017

机译：基于潜在主题建模和单词嵌入的简短文本分类

Detecting New Words from Chinese Text Using Latent Semi-CRF Models

摘要

著录项

相似文献

相关主题

期刊订阅