首页> 外文会议>Annual meeting of the Association for Computational Linguistics >DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension
【24h】

DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension

机译:Duorc:以复杂的语言理解为争取读写阅读理解

获取原文

摘要

We propose DuoRC, a novel dataset for Reading Comprehension (RC) that motivates several new challenges for neural approaches in language understanding beyond those offered by existing RC datasets. DuoRC contains 186,089 unique question-answer pairs created from a collection of 7680 pairs of movie plots where each pair in the collection reflects two versions of the same movie - one from Wikipedia and the other from IMDb - written by two different authors. We asked crowdsourced workers to create questions from one version of the plot and a different set of workers to extract or synthesize answers from the other version. This unique characteristic of DuoRC where questions and answers are created from different versions of a document narrating the same underlying story, ensures by design, that there is very little lexical overlap between the questions created from one version and the segments containing the answer in the other version. Further, since the two versions have different levels of plot detail, narration style, vocabulary. etc., answering questions from the second version requires deeper language understanding and incorporating external background knowledge. Additionally, the narrative style of passages arising from movie plots (as opposed to typical descriptive passages in existing datasets) exhibits the need to perform complex reasoning over events across multiple sentences. Indeed, we observe that state-of-the-art neural RC models which have achieved near human performance on the SQuAD dataset (Rajpurkar et al., 2016b), even when coupled with traditional NLP techniques to address the challenges presented in DuoRC exhibit very poor performance (F1 score of 37.42% on DuoRC v/s 86% on SQuAD dataset). This opens up several interesting research avenues wherein DuoRC could complement other RC datasets to explore novel neural approaches for studying language understanding.
机译:我们建议DuoRC,一个新的数据集阅读理解(RC),对语言的理解超出现有的RC数据集所提供的那些神经的方法能够激励一些新的挑战。 DuoRC包含7680个对电影情节,每一对集合中反映了同一部电影的两个版本的集合创建186089独特的问答配对 - 一个来自维基百科和从IMDB其他 - 由两个不同的作者写成。我们问众包工人创造从情节的一个版本,并与其他版本不同的一组工人提取或合成的答案的问题。其中,从不同版本的文件的叙述相同的底层故事,可确保设计,不存在从一个版本中创建的问题和包含在其他的答案段之间很少词汇重叠创建的问题和答案DuoRC的这种独特的特征版本。此外,由于这两个版本有不同程度的详细情节,叙事风格,词汇。等,从第二个版本回答问题需要更深层次的语言理解和整合外部的背景知识。此外,从电影情节所产生的(如在现有的数据集,而不是典型的描述性段落)通道的叙事风格呈现出过度跨越多个句子事件进行复杂的推理的需要。事实上,我们观察这对球队数据集附近的人的表现实现了国家的最先进的神经RC模型(Rajpurkar等,2016B),即使与传统的NLP技术来解决DuoRC提出的挑战,再加表现出非常性能差(F1得分的37.42%的DuoRC v / S上小队数据集86%)。这开辟了一些有趣的研究途径,其中DuoRC能补充其他RC的数据集,探讨学习语言理解新的神经的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号