首页> 外文期刊>Audio, Speech, and Language Processing, IEEE/ACM Transactions on >Modeling Latent Topics and Temporal Distance for Story Segmentation of Broadcast News
【24h】

Modeling Latent Topics and Temporal Distance for Story Segmentation of Broadcast News

机译:为广播新闻的故事分段建模潜在主题和时间距离

获取原文
获取原文并翻译 | 示例

摘要

This paper studies a strategy to model latent topics and temporal distance of text blocks for story segmentation, that we call graph regularization in topic modeling or GRTM. We propose two novel approaches that consider both temporal distance and lexical similarity of text blocks, collectively referred to as data proximity, in learning latent topic representation, where a graph regularizer is involved to derive the latent topic representation while preserving data proximity. In the first approach, we extend the idea of Laplacian probabilistic latent semantic analysis (LapPLSA) by introducing a distance penalty function in the affinity matrix of a graph for latent topic estimation. The estimated latent topic distributions are used to replace the traditional term-frequency vectors as the data representation of the text blocks and to measure the cohesive strength between them. In the second approach, we perform Laplacian eigenmaps, which makes use of the graph regularizer for dimensionality reduction, on latent topic distributions estimated by conventional topic modeling. We conduct the experiments on the automatic speech recognition transcripts of the TDT2 English broadcast news corpus. The experiments show the proposed strategy outperforms the conventional techniques. LapPLSA performs the best with the highest F1-measure of 0.816. The effects of the penalty constant in the distance penalty function, the number of latent topics, and the size of training data on the segmentation performances are also studied.
机译:本文研究了一种用于对潜在话题和文本块的时间距离进行建模以进行故事分割的策略,在主题建模或GRTM中我们称之为图正则化。我们提出了两种新颖的方法,在学习潜在主题表示时同时考虑了文本块的时间距离和词汇相似性(统称为数据接近性),其中涉及图规则化器以在保留数据接近性的同时导出潜在主题表示。在第一种方法中,我们通过在图的亲和力矩阵中引入距离惩罚函数以进行潜在主题估计,从而扩展了拉普拉斯概率潜在语义分析(LapPLSA)的思想。估计的潜在主题分布用于代替传统的词频向量作为文本块的数据表示形式,并测量它们之间的内聚强度。在第二种方法中,我们对通过常规主题建模估计的潜在主题分布执行拉普拉斯特征图,该图利用图正则化器进行降维。我们对TDT2英语广播新闻语料库的自动语音识别转录本进行了实验。实验表明,所提出的策略优于传统技术。 LapPLSA以0.816的最高F1度量值表现最佳。还研究了距离常数函数中的惩罚常数,潜在主题的数量以及训练数据的大小对分割性能的影响。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号