...
首页> 外文期刊>Journal of Information Science >Detecting new Chinese words from massive domain texts with word embedding
【24h】

Detecting new Chinese words from massive domain texts with word embedding

机译:通过单词嵌入从大量领域文本中检测新的中文单词

获取原文
获取原文并翻译 | 示例
           

摘要

Textual information retrieval (TIR) is based on the relationship between word units. Traditional word segmentation techniques attempt to discern the word units accurately from texts; however, they are unable to appropriately and efficiently identify all new words. Identification of new words, especially in languages such as Chinese, remains a challenge. In recent years, word embedding methods have used numerical word vectors to retain the semantic and correlated information between words in a corpus. In this article, we propose the word-embedding-based method (WEBM), a novel method that combines word embedding and frequent n-gram string mining for discovering new words from domain corpora. First, we mapped all word units in a domain corpus to a high-dimension word vector space. Second, we used a frequent n-gram word string mining method to identify a set of candidates for new words. We designed a pruning strategy based on the word vectors to quantify the possibility of a word string being a new word, thereby allowing the evaluation of candidates based on the similarity of word units in the same string. In a comparative study, our experimental results revealed that WEBM had a great advantage in detecting new words from massive Chinese corpora.
机译:文本信息检索(TIR)基于单词单位之间的关系。传统的分词技术试图从文本中准确地识别出单词的单位。但是,他们无法正确有效地识别所有新单词。识别新单词,尤其是中文等语言的单词,仍然是一个挑战。近年来,词嵌入方法已使用数字词向量来保留语料库中词之间的语义和相关信息。在本文中,我们提出了一种基于词嵌入的方法(WEBM),这是一种结合词嵌入和频繁n-gram字符串挖掘的新方法,用于从领域语料库中发现新词。首先,我们将域语料库中的所有单词单元映射到高维单词向量空间。其次,我们使用了频繁的n语法词串挖掘方法来识别一组新词候选。我们基于单词向量设计了一种修剪策略,以量化单词字符串为新单词的可能性,从而允许基于同一字符串中单词单元的相似性来评估候选对象。在一项比较研究中,我们的实验结果表明,WEBM在检测大量中文语料库中的新词方面具有很大的优势。

著录项

  • 来源
    《Journal of Information Science》 |2019年第2期|196-211|共16页
  • 作者单位

    Univ Elect Sci & Technol China, Sch Management & Econ, Chengdu, Sichuan, Peoples R China;

    Univ Elect Sci & Technol China, Sch Management & Econ, Chengdu, Sichuan, Peoples R China;

    Univ Elect Sci & Technol China, Sch Management & Econ, Chengdu, Sichuan, Peoples R China;

    Beijing Univ Posts & Telecommun, Res Ctr Big Data Management & Intelligent Decis, Sch Econ & Management, Beijing 100876, Peoples R China;

    Yunnan Univ Finance & Econ, Sch Business, Kunming 650221, Yunnan, Peoples R China|Tsinghua Univ, Sch Econ & Management, Beijing, Peoples R China;

    Univ Elect Sci & Technol China, Sch Management & Econ, Chengdu, Sichuan, Peoples R China;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Natural language processing; new word detection; similarity measurement; textual information retrieval; word embedding;

    机译:自然语言处理;新词检测;相似度测量;文本信息检索;词嵌入;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号