首页> 外文会议>Workshop on Advances in Discourse Analysis and its Computational Aspects >Explicit and implicit discourse relations from a cross-lingual perspective - from experience in working on Chinese discourse annotation
【24h】

Explicit and implicit discourse relations from a cross-lingual perspective - from experience in working on Chinese discourse annotation

机译:跨语言视角下的显性和隐性话语关系-从中国话语注释的工作经验

获取原文
获取原文并翻译 | 示例

摘要

In the field of computational linguistics or natural language processing, progress in discourse analysis has been relatively slow, as compared with syntactic parsing or semantic analysis (e.g., word sense disambiguation, semantic role labeling). In this age when statistical, data-driven approaches dominate the field, having a common linguistic resource that is widely accepted by the community is key to advancing the state of the art in this area. To create consistently annotated data for discourse analysis is particularly challenging because one has to deal with larger linguistic structures and there are few linguistic rules to follow. The key to successful discourse annotation is to identify a well-grounded linguistic theory that can be easily operationalized. In the Perm Discourse Treebank (Prasad et al. 2008, Webber and Joshi 1998) the field may have found such a theory. In the PDTB conception, discourse relations revolve around discourse connectives, where each discourse connective is a predicate that takes two arguments. In this way, discourse annotations are anchored by discourse connectives and are thus lexicalized. In our view, lexicalization has been crucial to the success of the PDTB as an annotation project, a large-scale effort characterized by high inter-annotator agreement, a standard metric for annotation consistency. Lexicalization makes highly abstract discourse relations grounded to a specific lexical item. In doing so, it localizes the ambiguity in discourse relations to discourse connectives, where a lexical item can have either a discourse connective use or a non-discourse connective use (e.g., "when"), and one discourse connective can be ambiguous between different discourse relations (e.g., "since"). As a result, it reduces the cognitive load of the annotation task because each annotator can focus on only one discourse connective at a time instead of scores of discourse relations. This in turn enlarges the annotator pool and more annotators will be able to perform the task without having to have extensive training. The long list of annotators who worked on the PDTB annotation attests to this observation. A larger annotator pool and a shorter learning curve translates to the scalability of such an approach. If lexicalization is so important to discourse annotation, what about discourse relations that are not anchored by an explicit discourse connective? The PDTB addresses this by assuming there is an implicit discourse connective that connects its two arguments, which are typically (parts of) adjacent sentences. This is operationalized by identifying punctuation marks (e.g., periods) that serve as boundaries of two adjacent sentences as anchors of implicit discourse relations. The specific discourse relation is determined by testing which discourse connective can be plausibly inserted between these two adjacent sentences. In doing so, the PDTB assumes that (1) the range of possible discourse relations anchored by implicit discourse connectives are basically the same as those anchored by explicit discourse relations, and (2) discourse relations anchored by implicit discourse connectives are mostly local. The first assumption is largely born out in the PDTB. Either a discourse connective can be inserted between two adjacent sentences, or they are related by the fact that they talk about the same entities, or there is no relation between them. The last possibility has a direct bearing on the second question: if there is no relation between two adjacent sentences, does that mean that these sentences have no discourse relations at all with the rest of the text, or that they are related to other discourse segments that are non-local? It is reasonable to assume that all discourse segments are related in a coherent piece of text, and large number of such "no-relations" would call for a significant expansion to the PDTB approach. While it might not be too much to expect that the same high-level discourse relations hold across languages, it is almost certainly too much to expect that discourse relations are lexicalized in the same way across languages. The question is whether a lexicalized approach to discourse analysis can still be maintained in languages where discourse relations are lexicalized in ways that are significantly different from English . Our experience in a pilot PDTB-style Chinese discourse annotation project shows that the lexicalized approach can be effectively adopted, although significant adaptations have to be made. Chinese has the same types of discourse connectives (subordinate conjunctions, coordinate conjunctions, and discourse adverbials) as English, but they occur much less frequently because they can often be dropped. The ratio of implicit and explicit connectives is about 80/20 (Zhou and Xue, 2012) rather than the roughly 50/50 split reported for PDTB (Prasad et al 2008). However, by identifying punctuation marks as boundaries of discourse segments and test whether lexicalized discourse relations hold between adjacent comma-separated discourse segments, we are able to show that Chinese discourse annotation can be performed with very good consistency. More evidence has to be gathered from the experience of other languages to test the feasibility of lexicalized approaches to discourse annotation in a multi-lingual setting, and such evidence will come soon now that such an approach has been adopted in a number of discourse annotation projects for a variety of different languages.
机译:在计算语言学或自然语言处理领域中,与句法分析或语义分析(例如,词义消歧,语义角色标记)相比,语篇分析的进展相对缓慢。在这个由统计学,数据驱动的方法主导该领域的时代,拥有一种被社区广泛接受的通用语言资源是推动该领域最新技术发展的关键。为话语分析创建一致的注释数据尤其具有挑战性,因为必须处理更大的语言结构,并且遵循的语言规则很少。成功的话语注释的关键是要确定一个易于操作的扎根的语言理论。在彼尔姆话语树库中(Prasad等,2008; Webber和Joshi,1998),该领域可能已经找到了这样的理论。在PDTB概念中,话语关系围绕着话语连接词,其中每个话语连接词都是一个有两个论点的谓词。这样,话语注释被话语连接词锚定,从而被词汇化。在我们看来,词汇化对于PDTB作为注释项目的成功至关重要,这是一项大规模的工作,其特征是注释者之间的高度一致,这是注释一致性的标准指标。词汇化使基于特定词汇项的高度抽象的话语关系成为可能。这样,它将话语关系中的歧义性局限在话语连接词上,其中一个词汇项可以具有话语连接词使用或非话语连接词使用(例如,“何时”),而一个话语连接词在不同词之间可能是不明确的话语关系(例如“自”)。结果,它减轻了注释任务的认知负担,因为每个注释者一次只能关注一个话语连接词,而不是数十个话语关系。这反过来又扩大了注释程序池,并且更多的注释程序将能够执行任务,而无需进行广泛的培训。大量使用PDTB注释的注释者证明了这一发现。较大的注释器池和较短的学习曲线将转化为这种方法的可伸缩性。如果词汇化对于话语注释如此重要,那么那些没有被明确的话语连接词锚定的话语关系又会如何呢? PDTB通过假设存在一个隐式的语篇连接词来解决此问题,该连接词将其两个参数(通常是相邻句子的一部分)连接起来。这通过识别标点符号(例如,句点)来实现,该标点符号用作两个相邻句子的边界,作为隐式话语关系的锚点。特定的语篇关系是通过测试可以在这两个相邻句子之间合理插入哪个语篇连接词来确定的。通过这样做,PDTB假设(1)由隐性话语连接词锚定的可能话语关系的范围与由显性话语关系锚定的话语关系的范围基本相同,并且(2)由隐性话语连接词锚定的话语关系大部分是局部的。第一个假设很大程度上是在PDTB中产生的。可以将话语连接词插入两个相邻的句子之间,或者可以通过它们谈论同一实体的事实将它们联系起来,或者它们之间没有任何关系。最后一种可能性直接关系到第二个问题:如果两个相邻句子之间没有关系,是否意味着这些句子与文本的其余部分根本没有话语关系,或者它们与其他话语段有关是非本地的?可以合理地假设所有语篇片段都在一个连贯的文本中相关联,并且大量此类“无关系”将要求对PDTB方法进行重大扩展。期望跨语言拥有相同的高级话语关系可能并不太多,但几乎可以肯定的是,期望跨语言以相同的方式将话语关系词汇化。问题是,如果话语关系以与英语有明显不同的方式被词汇化,那么话语分析的词汇化方法是否仍然可以保持。我们在PDTB风格的中文话语注释试点项目中的经验表明,尽管必须进行重大调整,但可以有效地采用词汇化方法。汉语与英语具有相同类型的语篇连接词(从属连词,坐标连词和话语副词),但它们发生的频率要低得多,因为它们经常会被丢弃。隐性和显性连接词的比率约为80/20(Zhou和Xue,2012),而不是PDTB报道的约50/50的比率(Prasad等,2008)。然而,通过将标点符号标识为语篇片段的边界并测试词汇化的语篇关系在相邻的逗号分隔的语篇片段之间是否成立,我们可以证明汉语语篇注释可以很好地进行一致性。必须从其他语言的经验中收集更多证据,以测试在多种语言环境中词汇化话语注释方法的可行性,并且由于许多话语注释项目已采用这种方法,因此此类证据将很快出现。适用于各种不同的语言。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号