Multimodal Indexing of Presentation Videos

机译：演示视频的多模式索引

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

This thesis presents four novel methods to help users efficiently and effectively retrieve information from unstructured and unsourced multimedia sources, in particular the increasing amount and variety of presentation videos such as those in e-learning, conference recordings, corporate talks, and student presentations. We demonstrate a system to summarize, index and cross-reference such videos, and measure the quality of the produced indexes as perceived by the end users. We introduce four major semantic indexing cues: text, speaker faces, graphics, and mosaics, going beyond standard tag based searches and simple video playbacks. This work aims at recognizing visual content "in the wild", where the system cannot rely on any additional information besides the video itself. For text, within a scene text detection and recognition framework, we present a novel locally optimal adaptive binarization algorithm, implemented with integral histograms. It determines of an optimal threshold that maximizes the between-classes variance within a subwindow, with computational complexity independent from the size of the window itself. We obtain character recognition rates of 74%, as validated against ground truth of 8 presentation videos spanning over 1 hour and 45 minutes, which almost doubles the baseline performance of an open source OCR engine. For speaker faces, we detect, track, match, and finally select a humanly preferred face icon per speaker, based on three quality measures: resolution, amount of skin, and pose. We register a 87% accordance (51 out of 58 speakers) between the face indexes automatically generated from three unstructured presentation videos of approximately 45 minutes each, and human preferences recorded through Mechanical Turk experiments. For diagrams, we locate graphics inside frames showing a projected slide, cluster them according to an on-line algorithm based on a combination of visual and temporal information, and select and color-correct their representatives to match human preferences recorded through Mechanical Turk experiments. We register 71% accuracy (57 out of 81 unique diagrams properly identified, selected and color-corrected) on three hours of videos containing five different presentations. For mosaics, we combine two existing suturing measures, to extend video images into in-the-world coordinate system. A set of frames to be registered into a mosaic are sampled according to the PTZ camera movement, which is computed through least square estimation starting from the luminance constancy assumption. A local features based stitching algorithm is then applied to estimate the homography among a set of video frames and median blending is used to render pixels in overlapping regions of the mosaic. For two of these indexes, namely faces and diagrams, we present two novel MTurk-derived user data collections to determine viewer preferences, and show that they are matched in selection by our methods. The net result work of this thesis allows users to search, inside a video collection as well as within a single video clip, for a segment of presentation by professor X on topic Y, containing graph Z.

机译：本文提出了四种新颖的方法来帮助用户从非结构化和无来源的多媒体资源中有效，有效地检索信息，尤其是演示视频的数量和种类不断增加，例如电子学习，会议记录，公司演讲和学生演示中的视频。我们演示了一个系统来总结，索引和交叉引用此类视频，并测量最终用户感知的所产生索引的质量。我们介绍了四种主要的语义索引提示：文本，说话者面部，图形和马赛克，这超出了基于标准标签的搜索和简单的视频播放。这项工作旨在识别“野外”的视觉内容，其中系统除了视频本身之外不能依赖任何其他信息。对于文本，在场景文本检测和识别框架内，我们提出了一种新颖的局部最优自适应二值化算法，该算法通过积分直方图实现。它确定了一个最佳阈值，该阈值使子窗口内的类间差异最大化，并且计算复杂度与窗口本身的大小无关。我们获得了74％的字符识别率，这是针对8个演示视频的真实性进行了验证，这些演示视频跨越1小时45分钟，几乎使开源OCR引擎的基准性能提高了一倍。对于说话者的脸部，我们基于三个质量度量来检测，跟踪，匹配并最终为每个说话者选择一个人类偏爱的脸部图标：分辨率，皮肤量和姿势。我们记录了从每个大约45分钟的三个非结构化演示视频自动生成的脸部索引与通过Mechanical Turk实验记录的人类偏好之间的87％的一致性（58个发言者中的51个）。对于图表，我们将图形放置在显示投影幻灯片的框架内，根据基于视觉和时间信息组合的在线算法对它们进行聚类，然后选择并对其颜色进行校正，以匹配通过Mechanical Turk实验记录的人类偏好。我们在三个小时的视频（包含五个不同的演示文稿）中记录了71％的准确性（正确识别，选择和颜色校正的81张独特图表中的57张）。对于马赛克，我们结合了两种现有的缝合措施，以将视频图像扩展到世界坐标系中。根据PTZ摄像机的运动对要注册到马赛克中的一组帧进行采样，这是从亮度恒定性假设开始通过最小二乘估计来计算的。然后将基于局部特征的拼接算法应用于估计一组视频帧之间的单应性，并使用中值混合来渲染马赛克重叠区域中的像素。对于这些索引中的两个（即面孔和图表），我们展示了两个来自MTurk的新颖的用户数据集合，以确定观众的偏好，并表明它们在选择中与我们的方法相匹配。本论文的最终结果工作是，用户可以在视频集内以及单个视频剪辑中搜索X教授关于主题Y的演示文稿片段，其中包含图表Z。

著录项

作者
Merler Michele;
展开▼
作者单位

展开▼
年度 2013
总页数
原文格式 PDF
正文语种 {"code":"en","name":"English","id":9}
中图分类

相似文献

外文文献
中文文献
专利

1. Patent Issued for Systems and Methods for Indexing Presentation Videos [J] . Robotics and Machine Learning . 2012,第42期

机译：授予演示视频索引系统和方法的专利
2. An ontology-based evidential framework for video indexing using high-level multimodal fusion [J] . Rachid Benmokhtar, Benoit Huet Multimedia Tools and Applications . 2014,第2期

机译：使用高级多模式融合的基于本体的视频索引证据框架
3. Multimodal Video Indexing and Retrieval Using Directed Information [J] . Chen X., Hero III A. O., Savarese S. Multimedia, IEEE Transactions on . 2012,第1期

机译：使用定向信息的多模式视频索引和检索
4. Multimodal Fusion of Speech and Text using Semi-supervised LDA for Indexing Lecture Videos [C] . Moula Husain, S. M Meena National Conference on Communications . 2019

机译：使用半监督LDA的语音和文本多模式融合来索引演讲视频
5. Multimodal Indexing of Presentation Videos [D] . Merler, Michele 2013

机译：演示视频的多模式索引
6. Detection of Important Scenes in Baseball Videos via a Time-Lag-Aware Multimodal Variational Autoencoder [O] . Kaito Hirasawa, Keisuke Maeda, Takahiro Ogawa, 2021

机译：通过时间滞后感知多模式变化AutoEncoder检测棒球视频中的重要场景
7. Growing Trend from Uni-to-Multimodal Video Indexing [O] . Nida Aslam, Kok-keong Loo 2015

机译：从单模多路视频索引的增长趋势

Multimodal Indexing of Presentation Videos

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅