首页> 外文会议>International Conference on Computational Linguistics >100,000 Podcasts: A Spoken English Document Corpus
【24h】

100,000 Podcasts: A Spoken English Document Corpus

机译:100,000播客:英语口语文档语料库

获取原文

摘要

Podcasts are a large and growing repository of spoken audio. As an audio format, podcasts are more varied in style and production type than broadcast news, contain more genres than typically studied in video data, and are more varied in style and format than previous corpora of conversations. When transcribed with automatic speech recognition they represent a noisy but fascinating collection of documents which can be studied through the lens of natural language processing, information retrieval, and linguistics. Paired with the audio files, they are also a resource for speech processing and the study of paralinguistic, sociolinguistic, and acoustic aspects of the domain. We introduce the Spotify Podcast Dataset, a new corpus of 100,000 podcasts. We demonstrate the complexity of the domain with a case study of two tasks: (1) passage search and (2) summarization. This is orders of magnitude larger than previous speech corpora used for search and summarization. Our results show that the size and variability of this corpus opens up new avenues for research.
机译:播客是一种大型且不断增长的音频存储库。作为音频格式,播客在风格和生产类型中更具变化,而不是广播新闻,包含比通常在视频数据中研究的更多类型,而且形式和格式比以前的对话的格式更多样化。当通过自动语音识别转录时,它们代表了可以通过自然语言处理,信息检索和语言学的镜头研究的文档嘈杂但迷人的文件。与音频文件配对,它们也是语音处理的资源和域的Paral语言,社会语言学和声学方面的研究。我们介绍了Spotify Podcast DataSet,这是一个100,000播客的新语料库。我们展示了域的复杂性与两个任务的案例研究:(1)通道搜索和(2)摘要。这是比以前用于搜索和摘要的语音语料库的数量级。我们的研究结果表明,该语料库的尺寸和可变性为研究开辟了新的途径。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号