首页> 外文会议>INTERSPEECH 2012 >Spoken Document Clustering Using Word Confusion Networks
【24h】

Spoken Document Clustering Using Word Confusion Networks

机译:使用Word混淆网络的口头文档聚类

获取原文

摘要

In this paper, we propose a word confusion network (WCN) based approach to perform clustering of the spoken documents and analyze its ability to handle the influence of speech recognition errors. WCN compactly represents multiple confidence weighted recognition hypotheses. Thus it provides scope for improving the clustering accuracy as a result of the likely presence of the correct transcription in the alternative hypotheses for those cases where l-best transcripts are erroneous. On the other hand, several of the remaining hypotheses are incorrect and hence could pose a challenge during the clustering. In our approach, we extract TF-IDF vectors from the WCNs to perform clustering using K-Means algorithm. The components of TF-IDF vectors are further weighted with the word posterior probabilities. This is to potentially down-weight those vector components that are contributed by the incorrect hypotheses of low posterior probabilities. The experimental results obtained using switchboard data illustrate the usefulness of rich information in the WCN for clustering, showing upto 4% absolute improvement in normalized mutual information metric.
机译:在本文中,我们提出了一种基于混淆网络(WCN)的方法来执行口头文档的聚类,并分析其处理语音识别错误影响的能力。 WCN紧凑地表示多个置信度加权识别假设。因此,由于L-BEST转录物错误的这种情况,因此提供了改善聚类精度的范围,以改善替代假设中的正确转录。另一方面,一些剩余的假设是不正确的,因此可能在聚类期间构成挑战。在我们的方法中,我们从WCN中提取TF-IDF向量,以使用K-means算法执行群集。 TF-IDF向量的组件与单词后验概率进一步加权。这是潜在的削减那些由低后验概率的错误假设贡献的传染媒介成分。使用交换机数据获得的实验结果说明了用于聚类的WCN中丰富的信息的有用性,显示标准化的相互信息度量的绝对改善高达4%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号