首页> 外文会议>9th International conference on language resources and evaluation >Collecting Natural SMS and Chat Conversations in Multiple Languages: The BOLT Phase 2 Corpus
【24h】

Collecting Natural SMS and Chat Conversations in Multiple Languages: The BOLT Phase 2 Corpus

机译:收集多种语言的自然短信和聊天对话:BOLT第二阶段语料库

获取原文

摘要

The DARPA BOLT Prog-am develops systems capable of allowing English speakers to retrieve and understand information from informal foreign language sources. Phase 2 of the program required large volumes of naturally occurring informal text (SMS) and chat messages from individual users in multiple languages to support evaluation of machine translation systems. We describe the design and implementation of a robust collection system capable of capturing both live and archived SMS and chat conversations from willing participants. We also discuss the challenges recruitment at a time when potential participants have acute and growing concerns about their personal privacy in the realm of digital communication, and we outline the techniques adopted to confront those challenges. Finally, we review the properties of the resulting BOLT Phase 2 Corpus, which comprises over 6.5 million words of naturally-occurring chat and SMS in English, Chinese and Egyptian Arabic.
机译:DARPA BOLT Prog-am开发的系统能够使说英语的人能够从非正式的外语来源中检索和理解信息。该计划的第2阶段需要来自大量用户的多种语言的大量自然发生的非正式文本(SMS)和聊天消息,以支持对机器翻译系统的评估。我们描述了一个强大的收集系统的设计和实现,该系统能够捕获愿意与会的参与者的实时和存档SMS以及聊天对话。我们还讨论了在潜在参与者对数字通信领域中的个人隐私提出尖锐且日益增长的担忧之时进行的征募挑战,并概述了应对这些挑战所采用的技术。最后,我们回顾了所产生的BOLT第二阶段语料库的属性,该语料库包含650万个自然英语聊天和SMS,英语,中文和埃及阿拉伯语单词。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号