首页> 外文会议>Annual conference of the International Speech Communication Association >Spoken Document Clustering Using Word Confusion Networks
【24h】

Spoken Document Clustering Using Word Confusion Networks

机译:使用单词混淆网络的语音文档聚类

获取原文

摘要

In this paper, we propose a word contusion network (WCN) based approach to perform clustering of the spoken documents and analyze its ability to handle the influence of speech recognition errors. WCN compactly represents multiple confidence weighted recognition hypotheses. Thus it provides scope for improving the clustering accuracy as a result of the likely presence of the correct transcription in the alternative hypotheses for those cases where 1-best transcripts are erroneous. On the other hand, several of the remaining hypotheses are incorrect and hence could pose a challenge during the clustering. In our approach, we extract TF-IDF vectors from the WCNs to perform clustering using K-Means algorithm. The components of TF-IDF vectors are further weighted with the word posterior probabilities. This is to potentially down-weight those vector components that are contributed by the incorrect hypotheses of low posterior probabilities. The experimental results obtained using switchboard data illustrate the usefulness of rich information in the WCN for clustering, showing upto 4% absolute improvement in normalized mutual information metric.
机译:在本文中,我们提出了一种基于词挫伤网络(WCN)的方法来对语音文档进行聚类,并分析其处理语音识别错误影响的能力。 WCN紧凑地表示多个置信度加权识别假设。因此,它为在最好的1个转录本错误的情况下的替代假设中可能存在正确的转录提供了提高聚类准确性的范围。另一方面,剩余的一些假设是不正确的,因此在聚类期间可能构成挑战。在我们的方法中,我们从WCN中提取TF-IDF向量,以使用K-Means算法进行聚类。 TF-IDF向量的分量进一步用单词后验概率加权。这是为了潜在地权衡由低后验概率的不正确假设引起的那些向量分量。使用总机数据获得的实验结果说明了WCN中的丰富信息对于聚类的有用性,显示出归一化互信息度量中的绝对值提高了4%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号