首页> 外文会议>Conference on Computational Linguistics and Speech Processing >N-best Parse Rescoring Based on Dependency-Based Word Embeddings
【24h】

N-best Parse Rescoring Based on Dependency-Based Word Embeddings

机译:基于依存词嵌入的N最佳解析记录

获取原文

摘要

Rescoring approaches for parsing aim to re-rank and change the order of parse trees produced by a general parser for a given sentence. The re-ranking quality depends on the precision of the rescoring function. However it is a challenge to design an appropriate function to determine the qualities of parse trees. No matter which method is used, Treebank is a widely used resource in parsing task. Most approaches utilize complex features to re-estimate the tree structures of a given sentence [1, 2, 3]. Unfortunately, sizes of treebanks are generally small and insufficient, which results in a common problem of data sparseness. Learning knowledge from analyzing large-scaled unlabeled data is compulsory and proved useful in the previous works [4, 5, 6]. How to extract useful information from unannotated large scale corpus has been a research issue. Word embeddings have become increasingly popular lately, proving to be valuable as a source of features in a broad range of NLP tasks [7, 8, 9]. The word2vec [10] is among the most widely used word embedding models today. Their success is largely due to an efficient and user-friendly implementation that learns high quality word embeddings from very large corpora. The word2vec learns low dimensional continuous vector representations for words by considering window-based contexts, i.e., context words within some fixed distance of each side of the target words. Another different context type is dependency-based word embedding [11, 12, 13], which considers syntactic contexts rather than window contexts in word2vec. Bansal et al. [8] and Melamud et al. [11]show the benefits of such modifled-context embeddings in dependency parsing task. The dependency-based word embedding can relieve the problem of data sparseness, since even without occurrence of dependency word pairs in a corpus, dependency scores can be still calculated by word embeddings [12]. In this paper, we proposed a rescoring approach for parsing, based on a combination of original parsing scores and dependency word embedding scores to assist the determination of the best parse tree among the w-best parse trees. There are three main steps in our rescoring approach. The first step is to have the parser to produce n-best parse trees with their structural scores. For each parsed tree including words, part-of-speech (PoS) and semantic role labels. Second, we extract word-to-word associations (or called word dependency, a dependency implies its close association with other words in either syntactic or semantic perspective) from large amounts of auto-parsed data and adopt wordlvecf [ 13] to train dependency-based word embeddings. The last step is to build a structural rescoring method to find the best tree structure from the /i-best candidates. We conduct experiments on the standard data sets of the Chinese Treebank. We also study how different types of embeddings influence on rescoring, including word, word with semantic role labels, and word senses (concepts). Experimental results show that using semantic role labels in dependency embeddings has best performance. And the final experiments results indicate that our proposed approach outperforms the best parser in Chinese. Furthermore we attempt to compare the performance of using the traditional conditional probability method with our approach. From the experimental results, the embedding scores can relax data sparseness problem and have better results than the traditional approach.
机译:解析的记录方法旨在重新排列和更改由通用解析器为给定句子生成的解析树的顺序。重新排名质量取决于评分功能的精度。然而,设计适当的功能来确定解析树的质量是一个挑战。无论使用哪种方法,Treebank都是解析任务中广泛使用的资源。大多数方法利用复杂的特征来重新估计给定句子的树结构[1、2、3]。不幸的是,树库的大小通常很小且不足,这导致了数据稀疏性的普遍问题。通过分析大量未标记的数据来学习知识是强制性的,并且在以前的工作中被证明是有用的[4、5、6]。如何从无注释的大型语料库中提取有用的信息一直是研究的问题。词嵌入最近变得越来越流行,被证明是广泛的NLP任务中有价值的特征来源[7,8,9]。 word2vec [10]是当今使用最广泛的单词嵌入模型之一。它们的成功很大程度上归功于高效且用户友好的实现,该实现可从大型语料库中学习高质量的词嵌入。 word2vec通过考虑基于窗口的上下文(即在目标单词的每一侧的某个固定距离内的上下文单词)来学习单词的低维连续向量表示。另一个不同的上下文类型是基于依赖项的单词嵌入[11,12,13],它考虑语法上下文而不是word2vec中的窗口上下文。班萨尔(Bansal)等人。 [8]和Melamud等。文献[11]展示了在依赖解析任务中这种修饰语境嵌入的好处。基于依存关系的词嵌入可以缓解数据稀疏的问题,因为即使在语料库中不出现依存词对,依存分数仍可以通过词嵌入来计算[12]。在本文中,我们提出了一种基于原始解析分数和相关性词嵌入分数的解析的解析方法,以帮助确定w最佳解析树中的最佳解析树。我们的评分方法包括三个主要步骤。第一步是让解析器生成具有结构分数的n个最佳解析树。对于每个解析的树,包括单词,词性(PoS)和语义角色标签。其次,我们从大量自动分析的数据中提取词对词的关联(或称为单词依赖性,从语法或语义角度讲,依赖性意味着它与其他单词的紧密关联),并采用wordlvecf [13]来训练依赖性-基于单词的嵌入。最后一步是建立一种结构记录方法,以从/ i-最佳候选中找到最佳树结构。我们对中国树库的标准数据集进行实验。我们还研究了不同类型的嵌入如何影响记录,包括单词,带有语义角色标签的单词以及单词的意义(概念)。实验结果表明,在依赖关系嵌入中使用语义角色标签具有最佳性能。最终的实验结果表明,我们提出的方法优于中文最好的解析器。此外,我们尝试将使用传统条件概率方法的效果与我们的方法进行比较。从实验结果来看,嵌入分数可以缓解数据稀疏问题,并且比传统方法具有更好的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号