首页> 外文会议>9th International conference on language resources and evaluation >TVD: a reproducible and multiply aligned TV series dataset
【24h】

TVD: a reproducible and multiply aligned TV series dataset

机译:TVD:可重现和多重对齐的电视连续剧数据集

获取原文

摘要

We introduce a new dataset built around two TV series from different genres, The Big Bang Theory, a situation comedy and Game of Thrones, a fantasy drama. The dataset has multiple tracks extracted from diverse sources, including dialogue (manual and automatic transcripts, multilingual subtitles), crowd-sourced textual descriptions (brief episode summaries, longer episode outlines) and various metadata (speakers, shots, scenes). The paper describes the dataset and provide tools to reproduce it for research purposes provided one has legally acquired the DVD set of the series. Tools are also provided to temporally align a major subset of dialogue and description tracks, in order to combine complementary information present in these tracks for enhanced accessibility. For alignment, we consider tracks as comparable corpora and first apply an existing algorithm for aligning such corpora based on dynamic time warping and TFIDF-based similarity scores. We improve this baseline algorithm using contextual information, WordNet-based word similarity and scene location information. We report the performance of these algorithms on a manually aligned subset of the data. To highlight the interest of the database, we report a use case involving rich speech retrieval and propose other uses.
机译:我们介绍了一个新数据集,该数据集是围绕两个不同类型的电视连续剧,《大爆炸理论》,情节喜剧和《权力的游戏》(一部幻想剧)制作的。数据集具有从不同来源提取的多个轨迹,包括对话(手动和自动成绩单,多语言字幕),群众来源的文本描述(简短情节摘要,较长情节大纲)和各种元数据(发言人,镜头,场景)。本文描述了数据集,并提供了用于研究目的而复制该数据集的工具,前提是人们已合法购买了该系列的DVD集。还提供了工具来临时对齐对话和描述轨道的主要子集,以便组合这些轨道中存在的补充信息以增强可访问性。对于对齐,我们将轨迹视为可比较的语料库,并首先应用现有算法基于动态时间规整和基于TFIDF的相似性评分来对齐此类语料库。我们使用上下文信息,基于WordNet的单词相似度和场景位置信息来改进此基线算法。我们报告了这些算法在手动对齐的数据子集上的性能。为了突出数据库的兴趣,我们报告了一个涉及丰富语音检索的用例,并提出了其他用途。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号