首页> 外文会议>International conference on language resources and evaluation >Extraction of Unmarked Quotations in Newspapers A Study Based on Direct Speech Extraction Systems
【24h】

Extraction of Unmarked Quotations in Newspapers A Study Based on Direct Speech Extraction Systems

机译:提取报纸上未标记的报价,基于直接语音提取系统研究

获取原文

摘要

This paper presents work in progress to automatically extract quotation sentences from newspaper articles. The focus is the extraction and annotation of unmarked quotation sentences. A linguistic study shows that unmarked quotation sentences can be formalised into 16 patterns that can be used to develop an extraction grammar. The question of unmarked quotation boundaries identification is also raised as they are often ambiguous. An annotation scheme allowing to describe all the elements that can take place in a quotation sentence is defined. This paper presents the creation of two resources necessary to our system. A dictionary of verbs introducing quotations has been automatically built using a grammar of marked quotations sentences to identify the verbs able to introduce quotations. A grammar formalising the patterns of unmarked quotation sentences - using the tool Unitex, based on finite state machines - has been developed. A short experiment has been performed on two patterns and shows some promising results.
机译:本文提出了正在进行的工作,以自动提取报纸文章中的引用句子。重点是未标记的引用句子的提取和注释。语言学研究表明,无标记的引号可以形式化为16种模式,可用于开发提取语法。未标记的引号识别问题也会提出,因为它们通常是含糊不清的。允许描述可以在引号句中进行所有元素的注释方案。本文提出了创建我们系统所需的两个资源。使用标记的引文句子的语法自动构建动词词典,以识别能够引入引文的动词。基于有限状态机 - 使用工具Unitex正式地模拟非标记引号模式的语法。在两种模式下进行了短暂的实验,并显示出一些有前途的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号