100,000 Podcasts: A Spoken English Document Corpus

机译：100,000播客：英语口语文档语料库

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Podcasts are a large and growing repository of spoken audio. As an audio format, podcasts are more varied in style and production type than broadcast news, contain more genres than typically studied in video data, and are more varied in style and format than previous corpora of conversations. When transcribed with automatic speech recognition they represent a noisy but fascinating collection of documents which can be studied through the lens of natural language processing, information retrieval, and linguistics. Paired with the audio files, they are also a resource for speech processing and the study of paralinguistic, sociolinguistic, and acoustic aspects of the domain. We introduce the Spotify Podcast Dataset, a new corpus of 100,000 podcasts. We demonstrate the complexity of the domain with a case study of two tasks: (1) passage search and (2) summarization. This is orders of magnitude larger than previous speech corpora used for search and summarization. Our results show that the size and variability of this corpus opens up new avenues for research.

机译：播客是一种大型且不断增长的音频存储库。作为音频格式，播客在风格和生产类型中更具变化，而不是广播新闻，包含比通常在视频数据中研究的更多类型，而且形式和格式比以前的对话的格式更多样化。当通过自动语音识别转录时，它们代表了可以通过自然语言处理，信息检索和语言学的镜头研究的文档嘈杂但迷人的文件。与音频文件配对，它们也是语音处理的资源和域的Paral语言，社会语言学和声学方面的研究。我们介绍了Spotify Podcast DataSet，这是一个100,000播客的新语料库。我们展示了域的复杂性与两个任务的案例研究：（1）通道搜索和（2）摘要。这是比以前用于搜索和摘要的语音语料库的数量级。我们的研究结果表明，该语料库的尺寸和可变性为研究开辟了新的途径。

著录项

来源
《International Conference on Computational Linguistics》|2020年|5903-5917|共15页
会议地点
作者
Ann Clifton; Sravana Reddy; Yongze Yu; Aasish Pappu; Rezvaneh Rezapour; Hamed Bonab; Maria Eskevich; Gareth J. F. Jones; Jussi Karlgren; Ben Carterette; Rosie Jones;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. The Usage of Psychological Passives in Spoken and Written English: A Corpus-based Analysis and Implications for English Language Teaching [J] . Miharu Fuyuno Procedia - Social and Behavioral Sciences . 2013,第2期

机译：心理被动语在口语和书面英语中的运用：基于语料库的分析及对英语教学的启示
2. A comprehensive corpus-based analysis of "X Auxiliary Subject" constructions in written and spoken English [J] . Prado-Alonso Carlos Nature reviews neuroscience . 2019,第2期

机译：基于语料库的“X辅助主题”建筑的书面和英语建设
3. Mutual attraction between high-frequency verbs and clause types with finite verbs in early positions: corpus evidence from spoken English, Dutch, and German [J] . Kempen Gerard, Harbusch Karin Language, cognition and neuroscience . 2019,第9期

机译：高频动词和条款类型的互吸引力与早期位置有限的动词：来自英语口语，荷兰语和德语的语料库证据
4. Research and Construction of Spoken English Graded Corpus for College English Majors [C] . Baiping Huang International Conference on Innovations in Economic Management and Social Science . 2017

机译：大学英语专业英语评级语料库的研究与构建
5. A comparative study of the abilities of native and nonnative speakers of American English to use discourse markers and conversational hedges as elements of the structure of unplanned spoken American English interactions in three subcorpora of the Michigan Corpus of Academic Spoken English. [D] . Santana-Williamson, Eliana. 2005

机译：对美国英语母语者和非英语母语者使用话语标记和会话树篱作为计划外的美国英语口语交流结构的要素的能力的比较研究，该语言在密歇根大学英语口语语料库的三个子语料库中。
6. Phonological and syntactic competition effects in spoken word recognition: evidence from corpus-based statistics [O] . Jie Zhuang, Barry J. Devereux -1

机译：语音识别中的语音和句法竞争效应：基于语料库的统计证据
7. Designing the Radiotelephony Plain English Corpus (RTPEC): A specialized spoken English language corpus towards a description of aeronautical communications in non-routine situations [O] . Malila C.A. Prado, Patricia Tosquil Lucks 2019

机译：设计无线电话普通英语语料库（RTPEC）：用于非常规情况下的航空通信描述的专业英语语言语料库

100,000 Podcasts: A Spoken English Document Corpus

摘要

著录项

相似文献

相关主题

期刊订阅