首页> 外文会议>9th International conference on language resources and evaluation >Design and development of an RDB version of the Corpus of Spontaneous Japanese
【24h】

Design and development of an RDB version of the Corpus of Spontaneous Japanese

机译:RDB版本的自发日语语料库的设计和开发

获取原文

摘要

In this paper, we describe the design and development of a new version of the Corpus of Spontaneous Japanese (CSJ), which is a large-scale spoken corpus released in 2004. CSJ contains various annotations that are represented in XML format (CSJ-XML). CSJ-XML, however, is very complicated and suffers from some problems. To overcome this problem, we have developed and released, in 2013, a relational database version of CSJ (CSJ-RDB). CSJ-RDB is based on an extension of the segment and link-based annotation scheme, which we adapted to handle multi-channel and multi-modal streams. Because this scheme adopts a stand-off framework, CSJ-RDB can represent three hierarchical structures at the same time: inter-pausal-unit-top, clause-top, and intonational-phrase-top. CSJ-RDB consists of five different types of tables: segment, unaligned-segment, link, relation, and meta-information tables. The database was automatically constructed from annotation files extracted from CSJ-XML by using general-purpose corpus construction tools. CSJ-RDB enables us to easily and efficiently conduct complex searches required for corpus-based studies of spoken language.
机译:在本文中,我们描述了自发日语语料库(CSJ)的新版本的设计和开发,该版本是2004年发布的大规模口语语料库。CSJ包含各种以XML格式表示的注释(CSJ-XML )。但是,CSJ-XML非常复杂,并且存在一些问题。为了克服这个问题,我们在2013年开发并发布了CSJ的关系数据库版本(CSJ-RDB)。 CSJ-RDB基于段和基于链接的注释方案的扩展,我们适用于处理多通道和多模式流。由于此方案采用隔离框架,因此CSJ-RDB可以同时表示三个层次结构:暂停间单元顶部,子句顶部和国际短语顶部。 CSJ-RDB由五种不同类型的表组成:段,不对齐段,链接,关系和元信息表。该数据库是使用通用语料库构建工具从CSJ-XML提取的注释文件中自动构建的。 CSJ-RDB使我们能够轻松有效地进行基于语料库的口语学习所需的复杂搜索。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号