A Word Vector Representation Based Method for New Words Discovery in Massive Text

机译：基于词向量表示的海量文本新词发现方法

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

The discovery of new words is of great significance to natural language processing for the Chinese language. In recent years, training words in a corpus into a new word vector representation with neural network model has shown a good performance in representing the original semantic relationship among words. Accordingly, the word vector representation is then introduced into the discovery of new word in Chinese text. In this work, we propose a new unsupervised method for discovering new word based on n-gram method. To that end, we first trains the words in corpus into a word vector space, and then combine some elements in the corpus as candidates for new words. Finally, the noise candidates are dropped based on the similarity between two elements in the new word vector space. By comparing to some classical unsupervised methods such as mutual Information and adjacent entropy, the experiment results show that the propose method has great advantage on performance in discovering new words.

机译：新单词的发现对于汉语自然语言处理具有重要意义。近年来，利用神经网络模型将语料库中的单词训练成新的单词向量表示在表现单词之间的原始语义关系方面表现出良好的性能。因此，单词矢量表示然后被引入中文文本中的新单词的发现中。在这项工作中，我们提出了一种基于n-gram方法的新的无监督方法来发现新单词。为此，我们首先将语料库中的单词训练到单词向量空间中，然后将语料库中的某些元素组合为新单词的候选者。最后，基于新词向量空间中两个元素之间的相似性，丢弃候选噪声。通过与经典的无监督方法如互信息和相邻熵的比较，实验结果表明该方法在发现新词方面具有很大的优势。

著录项

来源
《Natural language understanding and intelligent applications》|2016年|76-88|共13页
会议地点 Kunming(CN)
作者
Yang Du; Hua Yuan; Yu Qian;
展开▼
作者单位

School of Management and Economics, University of Electronic Science and Technology of China, Chengdu 611731, China;

School of Management and Economics, University of Electronic Science and Technology of China, Chengdu 611731, China;

School of Management and Economics, University of Electronic Science and Technology of China, Chengdu 611731, China;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词
Word embedding; New words discovery; Semantic relationship; n-gram;

机译：词嵌入；新词发现；语义关系；克;

相似文献

外文文献
中文文献
专利

1. Improving short text classification by learning vector representations of both words and hidden topics [J] . Zhang Heng, Zhong Guoqiang Knowledge-Based Systems . 2016,第juna15期

机译：通过学习单词和隐藏主题的向量表示来改善短文本分类
2. Short Text Classification Based on Distributional Representations of Words [J] . Chenglong MA, Qingwei ZHAO, Jielin PAN, IEICE transactions on information and systems . 2016,第10期

机译：基于单词分布表示的短文本分类
3. Detecting new Chinese words from massive domain texts with word embedding [J] . Qian Yu, Du Yang, Deng Xiongwen, Journal of Information Science . 2019,第2期

机译：通过单词嵌入从大量领域文本中检测新的中文单词
4. A Word Vector Representation Based Method for New Words Discovery in Massive Text [C] . Yang Du, Hua Yuan, Yu Qian International conference on computer processing of oriental languages . 2016

机译：基于Word矢量表示的新词发现的方法
5. An exploration of the word2vec algorithm: Creating a vector representation of a language vocabulary that encodes meaning and usage patterns in the vector space structure [D] . Le, Thu Anh. 2016

机译：word2vec算法的探索：创建语言词汇的矢量表示，该矢量表示编码矢量空间结构中的含义和用法模式
6. The Fractal Patterns of Words in a Text: A Method for Automatic Keyword Extraction [O] . Elham Najafi, Amir H. Darooneh -1

机译：文本中词的分形模式：一种自动关键词提取方法
7. Vector Representation of Words for Detecting Topic Trends over Short Texts [O] . Liyan He, Yajun Du, Lei Zhang 2018

机译：导航侦查主题趋势的词的传染媒介表示在短篇文本
8. Pictures from Words, Pictures from Text: Constructing Pictorial Representations of Meaning from Text [R] . Cowie, J., Helmreich, S., Dang, H. H. 2009

机译：词语中的图片，文本中的图片：从文本构建意义的图像表征

A Word Vector Representation Based Method for New Words Discovery in Massive Text

摘要

著录项

相似文献

相关主题

期刊订阅