首页> 外文学位 >Towards Discourse Classification for Chinese -- a Resource-poor Language.
【24h】

Towards Discourse Classification for Chinese -- a Resource-poor Language.

机译:走向汉语话语分类-一种资源匮乏的语言。

获取原文
获取原文并翻译 | 示例

摘要

Discourse raises issues about semantics, and especially the nature of coherence and cohesion of texts. Similar to part-of-speech tagging and syntactic parsing, discourse classification is fundamental in computational linguistics. But relatively, this issue is not well studied. The lack of annotated corpora brings limitations to research of discourse classification for most languages other than English (e.g., Chinese). Manual annotation for discourse classification is complex, time consuming and costly. To overcome this predicament, one alternative is to explore unsupervised learning methods. Nevertheless, previous work on English showed that unsupervised methods could only deal with coarse-grained discourse relations and suffered from low precision. Another possible way is to make use of discourse classification capabilities from other languages which have rich discourse corpora. But the problem of cross language discourse classification is still very much open for investigation. Using Chinese as the target, this thesis presents the first study on discourse classification for resource-poor language. Furthermore, we also annotate the first open discourse treebank for Chinese which includes 890 news articles.;At the beginning, we propose a novel bootstrapping unsupervised method based on semantic sequential representation (SSR) for discourse classification. SSR is a new representation for discourse instances which integrate basic bag-of-words information with lexical, semantic and word sequential information. Our method starts with a small set of cue-phrase-based patterns to collect large number of discourse instances which are later converted to SSRs. We then propose an unsupervised SSR learner to generate, weigh and filter new SSRs without cue phrases for recognizing discourse relations. Experimental results showed that our method outperformed previous unsupervised method by 7% in F-score. We also show that SSRs are effective features for supervised learning methods.;The SSR-based method (F-score = 0:63) ignores the ambiguities of discourse connectives. As a result, it suffers from low recall (Recall = 0:49). To discover and eliminate these ambiguities, we further propose a cross-language framework for discourse classification. In our framework, discourse classification for Chinese is achieved in two steps: (1) Discourse connective/trigger identification and (2) Sense classification. English Penn Discourse Treebank 2 (PDTB2) and Chinese-English parallel data are coupled to provide the training data for a co-training based framework. Experimental results showed that our method achieved significant improvement comparing to SSR based method. The proposed framework is practical and effective especially in coping with the intercommunity problem, which is common in cross-language discourse classification. Moreover, the proposed framework does not integrate any language specific features, making it theoretically applicable for other languages.;Every language has its unique characteristics, our cross-language framework which focuses on the common characteristics between languages is ineffective in detecting Chinese language specific characteristics. As a result, we package the corpus we used in this research to form the Discourse Treebank for Chinese (DTBC). DTBC adopts the principles of PDTB2, and at the same time, it incorporates the linguistic characteristics of Chinese. The ii annotation work adds a discourse layer to 890 articles from the Penn Chinese Tree Bank 5 (CTB5). DTBC is the first ever open Chinese discourse treebank, which will be an invaluable linguistic resource for future research in Chinese discourse.
机译:话语提出了有关语义的问题,尤其是文本连贯性和衔接性的问题。话语分类与词性标注和句法分析相似,是计算语言学的基础。但是相对而言,这个问题还没有得到很好的研究。缺少注释的语料库给英语以外的大多数其他语言(例如中文)的话语分类研究带来了局限性。用于话语分类的手动注释是复杂,费时且昂贵的。为了克服这一困境,一种替代方法是探索无监督的学习方法。但是,以前的英语研究表明,无监督方法只能处理粗粒度的话语关系,并且精度较低。另一种可能的方式是利用其他具有丰富话语语料库的语言的话语分类功能。但是跨语言话语分类的问题仍然有待研究。本文以汉语为研究对象,对资源匮乏语言的语篇分类进行了首次研究。此外,我们还注释了第一个中文开放式话语树库,其中包括890条新闻。首先,我们提出了一种基于语义顺序表示(SSR)的新颖引导非监督方法来进行话语分类。 SSR是话语实例的一种新表示形式,它将基本的词袋信息与词汇,语义和单词顺序信息相集成。我们的方法从一小套基于提示短语的模式开始,以收集大量的话语实例,然后将其转换为SSR。然后,我们提出了一个无监督的SSR学习器来生成,加权和过滤新的SSR,而无需提示短语来识别语篇关系。实验结果表明,我们的方法在F评分方面比以前的无监督方法高出7%。我们还证明了SSR是监督学习方法的有效特征。基于SSR的方法(F-score = 0:63)忽略了语篇连接词的歧义。结果,它的召回率很低(召回率= 0:49)。为了发现并消除这些歧义,我们进一步为话语分类提出了一种跨语言的框架。在我们的框架中,汉语的话语分类通过两个步骤实现:(1)话语连接/触发识别和(2)感官分类。英文Penn话语树库2(PDTB2)和中英文并行数据相结合,为基于共同训练的框架提供训练数据。实验结果表明,与基于SSR的方法相比,我们的方法取得了显着改进。所提出的框架是实用且有效的,特别是在解决跨语言话语分类中常见的社区间问题方面。此外,所提出的框架没有集成任何语言特定的特征,因此在理论上适用于其他语言。;每种语言都有其独特的特征,我们专注于语言之间共同特征的跨语言框架在检测中文特定特征方面无效。 。结果,我们打包了用于本研究的语料库,以形成中文话语树库(DTBC)。 DTBC采纳了PDTB2的原理,同时融合了汉语的语言特征。 ii注释工作为来自Penn Chinese Tree Bank 5(CTB5)的890条文章增加了一个话语层。 DTBC是有史以来第一个开放的中文话语树库,它将为将来的中文话语研究提供宝贵的语言资源。

著录项

  • 作者

    Zhou, Lanjun.;

  • 作者单位

    The Chinese University of Hong Kong (Hong Kong).;

  • 授予单位 The Chinese University of Hong Kong (Hong Kong).;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2014
  • 页码 121 p.
  • 总页数 121
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号