Detecting new Chinese words from massive domain texts with word embedding

Qian Yu; Du Yang; Deng Xiongwen; Ma Baojun; Ye Qiongwei; Yuan Hua

首页> 外文期刊>Journal of Information Science >Detecting new Chinese words from massive domain texts with word embedding

【24h】

Detecting new Chinese words from massive domain texts with word embedding

机译：通过单词嵌入从大量领域文本中检测新的中文单词

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Textual information retrieval (TIR) is based on the relationship between word units. Traditional word segmentation techniques attempt to discern the word units accurately from texts; however, they are unable to appropriately and efficiently identify all new words. Identification of new words, especially in languages such as Chinese, remains a challenge. In recent years, word embedding methods have used numerical word vectors to retain the semantic and correlated information between words in a corpus. In this article, we propose the word-embedding-based method (WEBM), a novel method that combines word embedding and frequent n-gram string mining for discovering new words from domain corpora. First, we mapped all word units in a domain corpus to a high-dimension word vector space. Second, we used a frequent n-gram word string mining method to identify a set of candidates for new words. We designed a pruning strategy based on the word vectors to quantify the possibility of a word string being a new word, thereby allowing the evaluation of candidates based on the similarity of word units in the same string. In a comparative study, our experimental results revealed that WEBM had a great advantage in detecting new words from massive Chinese corpora.

机译：文本信息检索（TIR）基于单词单位之间的关系。传统的分词技术试图从文本中准确地识别出单词的单位。但是，他们无法正确有效地识别所有新单词。识别新单词，尤其是中文等语言的单词，仍然是一个挑战。近年来，词嵌入方法已使用数字词向量来保留语料库中词之间的语义和相关信息。在本文中，我们提出了一种基于词嵌入的方法（WEBM），这是一种结合词嵌入和频繁n-gram字符串挖掘的新方法，用于从领域语料库中发现新词。首先，我们将域语料库中的所有单词单元映射到高维单词向量空间。其次，我们使用了频繁的n语法词串挖掘方法来识别一组新词候选。我们基于单词向量设计了一种修剪策略，以量化单词字符串为新单词的可能性，从而允许基于同一字符串中单词单元的相似性来评估候选对象。在一项比较研究中，我们的实验结果表明，WEBM在检测大量中文语料库中的新词方面具有很大的优势。

著录项

来源
《Journal of Information Science》 |2019年第2期|196-211|共16页
作者
Qian Yu; Du Yang; Deng Xiongwen; Ma Baojun; Ye Qiongwei; Yuan Hua;
展开▼
作者单位

Univ Elect Sci & Technol China, Sch Management & Econ, Chengdu, Sichuan, Peoples R China;

Univ Elect Sci & Technol China, Sch Management & Econ, Chengdu, Sichuan, Peoples R China;

Univ Elect Sci & Technol China, Sch Management & Econ, Chengdu, Sichuan, Peoples R China;

Beijing Univ Posts & Telecommun, Res Ctr Big Data Management & Intelligent Decis, Sch Econ & Management, Beijing 100876, Peoples R China;

Yunnan Univ Finance & Econ, Sch Business, Kunming 650221, Yunnan, Peoples R China|Tsinghua Univ, Sch Econ & Management, Beijing, Peoples R China;

Univ Elect Sci & Technol China, Sch Management & Econ, Chengdu, Sichuan, Peoples R China;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Natural language processing; new word detection; similarity measurement; textual information retrieval; word embedding;

机译：自然语言处理;新词检测;相似度测量;文本信息检索;词嵌入;

相似文献

外文文献
中文文献
专利

1. Detecting New Words from Chinese Text Using Latent Semi-CRF Models [J] . Xiao SUN, Degen HUANG, Fuji REN IEICE transactions on information and systems . 2010,第6期

机译：使用潜在的半CRF模型从中文文本中检测新单词
2. Detecting New Words from Chinese Text Using Latent Semi-CRF Models [J] . Xiao SUN, Degen HUANG, Fuji REN IEICE Transactions on Information and Systems . 2010,第6期

机译：使用潜在的半CRF模型从中文文本中检测新单词
3. An Analysis of the Emotional Tendency of New Words in Chinese Text Based on Word2Vec [J] . Jiang Quan, Rao Wenbi Computer Science & Information Technology . 2020,第4期

机译：基于Word2VEC的中文文本中新词的情感趋势分析
4. An NN-based Approach to Prosodic Information Generation for Synthesizing English Words Embedded in Chinese Text [C] . Wei-Chih Kuo, Li-Feng Lin, Yih-Ru Wang, European Conference on Speech Communication and Technology . 2003

机译：基于NN的博级信息生成方法，用于综合中文文本中的英语单词
5. Optimization of Word Embeddings in Text Categorization [D] . Lauren, Paula Amanda. 2018

机译：文本分类中词嵌入的优化
6. Words and pictures: An electrophysiological investigation of domain specific processing in native Chinese and English speakers [O] . Yen Na Yum, Phillip J. Holcomb, Jonathan Grainger -1

机译：言语和图片：汉英汉语和英语讲话中域特异性处理的电生理调查
7. A Hybrid Classification Method via Character Embedding in Chinese Short Text with Few Words [O] . Yi Zhu, Yun Li, Yongzheng Yue, 2020

机译：通过少数单词嵌入中文短文本的字符嵌入混合分类方法

Detecting new Chinese words from massive domain texts with word embedding

摘要

著录项

相似文献

相关主题

期刊订阅