首页> 外文期刊>Intelligent data analysis >Learning of indiscriminate distributions of document embeddings for domain adaptation
【24h】

Learning of indiscriminate distributions of document embeddings for domain adaptation

机译:学习域适应的文档嵌入式的不分青红皂白地分布

获取原文
获取原文并翻译 | 示例

摘要

Natural language processing (NLP) is an important application area in domain adaptation because properties of texts depend on their corpus. However, a textual input is not fundamentally represented as the numerical vector. Many domain adaptation methods for NLP have been developed on the basis of numerical representations of texts instead of textual inputs. Thus, we develop a distributed representation learning method of words and documents for domain adaptation. The developed method addresses the domain separation problem of document embeddings from different domains, that is, the supports of the embeddings are separable across domains and the distributions of the embeddings are discriminated. We propose a new method based on negative sampling. The proposed method learns document embeddings by assuming that a noise distribution is dependent on a domain. The proposed method moves a document embedding close to the embeddings of the important words in the document and keeps the embedding away from the word embeddings that occur frequently in both domains. For Amazon reviews, we verified that the proposed method outperformed other representation methods in terms of indiscriminability of the distributions of the document embeddings through experiments such as visualizing them and calculating a proxy A-distance measure. We also performed sentiment classification tasks to validate the effectiveness of document embeddings. The proposed method achieved consistently better results than other methods. In addition, we applied the learned document embeddings to the domain adversarial neural network method, which is a popular deep learning-based domain adaptation model. The proposed method obtained not only better performance on most datasets but also more stable convergences for all datasets than the other methods. Therefore, the proposed method are applicable to other domain adaptation methods for NLP using numerical representations of documents or words.
机译:自然语言处理(NLP)是域适应的重要应用区域,因为文本的属性取决于其语料库。然而,文本输入不是基本上表示为数值矢量。已经基于文本的数字表示而不是文本输入的数值表示开发了NLP的许多域适应方法。因此,我们开发了用于域自适应的单词和文档的分布式表示方法。开发方法解决了来自不同域的文档嵌入式的域分离问题,即嵌入式的支持在域中是可分离的,并且嵌入的分布是区分的。我们提出了一种基于负面采样的新方法。该方法通过假设噪声分布依赖于域来学习文档嵌入式。该提出的方法将嵌入文件嵌入到文档中重要单词的嵌入文档,并将嵌入远离两个域中频繁发生的单词嵌入。对于亚马逊评论,我们验证了所提出的方法在文档嵌入的分布的不分性通过实验中表现出了其他表示方法,例如可视化它们并计算代理A距离测量。我们还执行了情绪分类任务以验证文档嵌入的有效性。所提出的方法始终如一的结果,而不是其他方法。此外,我们将学习的文档嵌入应用于域对抗神经网络方法,这是一种流行的基于深度学习的域自适应模型。所提出的方法不仅获得了大多数数据集的性能,而且比其他方法更稳定地收敛。因此,所提出的方法适用于使用文件或单词的数值表示的NLP的其他域适配方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号