Unsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word Embeddings

Herman Kamper; Aren Jansen; Sharon Goldwater

首页> 外文期刊>Audio, Speech, and Language Processing, IEEE/ACM Transactions on >Unsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word Embeddings

【24h】

Unsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word Embeddings

机译：使用声词嵌入的无监督分词和词典发现

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

In settings where only unlabeled speech data is available, speech technology needs to be developed without transcriptions, pronunciation dictionaries, or language modelling text. A similar problem is faced when modeling infant language acquisition. In these cases, categorical linguistic structure needs to be discovered directly from speech audio. We present a novel unsupervised Bayesian model that segments unlabeled speech and clusters the segments into hypothesized word groupings. The result is a complete unsupervised tokenization of the input speech in terms of discovered word types. In our approach, a potential word segment (of arbitrary length) is embedded in a fixed-dimensional acoustic vector space. The model, implemented as a Gibbs sampler, then builds a whole-word acoustic model in this space while jointly performing segmentation. We report word error rates in a small-vocabulary connected digit recognition task by mapping the unsupervised decoded output to ground truth transcriptions. The model achieves around 20% error rate, outperforming a previous HMM-based system by about 10% absolute. Moreover, in contrast to the baseline, our model does not require a pre-specified vocabulary size.

机译：在只有无标签语音数据可用的环境中，语音技术需要开发而无需转录，发音词典或语言建模文本。在对婴儿语言习得进行建模时，也会遇到类似的问题。在这些情况下，需要直接从语音音频中发现分类语言结构。我们提出了一种新颖的无监督贝叶斯模型，该模型可分割未标记的语音，并将这些段聚类为假设的单词分组。结果是根据发现的单词类型对输入语音进行了完全无监督的标记化。在我们的方法中，潜在的词段（任意长度）被嵌入到固定尺寸的声学向量空间中。该模型以Gibbs采样器的形式实现，然后在联合执行分割的同时在该空间中建立了一个全字声学模型。通过将无监督的解码输出映射到地面真相转录，我们报告了小词汇连接数字识别任务中的单词错误率。该模型的错误率约为20％，绝对值比以前的基于HMM的系统高出约10％。此外，与基准相比，我们的模型不需要预先指定的词汇量。

著录项

来源
《Audio, Speech, and Language Processing, IEEE/ACM Transactions on》 |2016年第4期|669-679|共11页
作者
Herman Kamper; Aren Jansen; Sharon Goldwater;
展开▼
作者单位

Herman Kamper is with the School of Informatics, University of Edinburgh, UK (e-mail: kamperh@gmail.com).;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Speech segmentation; speech segmentation; unsupervised learning; unsupervised speech processing; word acquisition; word discovery;

机译：语音分割;语音分割;无监督学习;无监督语音处理;单词获取;单词发现;

相似文献

外文文献
中文文献
专利

1. A Lexicon-Corpus-based Unsupervised Chinese Word Segmentation Approach [J] . Lu Pengyu, Pu Jingchuan, Du Mingming, International Journal on Smart Sensing and Intelligent Systems . 2014,第1期

机译：基于词典的无人监督的汉语词组分割方法
2. Automatic Extraction Of New Words Based On Google News Corpora For Supporting Lexicon-based Chinese Word Segmentation Systems [J] . Chin-Ming Hong, Chih-Ming Chen, Chao-Yang Chiu Expert systems with applications . 2009,第2p2期

机译：基于Google新闻语料库的自动提取新词以支持基于词典的中文分词系统
3. Incorporating word embeddings in unsupervised morphological segmentation [J] . Ahmet UEstuen, Burcu Can Natural language engineering . 2021,第Pta5期

机译：在无监督的形态分割中结合单词嵌入式
4. Dimensional Sentiment Analysis for Chinese words Based on synonym lexicon and Word Embedding [C] . Wei Cheng, Yuansheng Song, Yue Zhu, International conference on Asian language processing . 2016

机译：基于同义词词典和词嵌入的汉语词语维度情感分析
5. Hypernym Discovery over WordNet and English Corpora - Using Hearst Patterns and Word Embeddings [D] . Vallabhajosyula, Manikya Swathi 2018

机译：通过WordNet和英语语料库发现Hypernym-使用赫斯特模式和单词嵌入
6. Wide range screening of algorithmic bias in word embedding models using large sentiment lexicons reveals underreported bias types [O] . David Rozado 2020

机译：使用大型情绪词典的Word嵌入模型中的算法偏置的广泛绘制筛选揭示了额外的偏差类型
7. Unsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word Embeddings [O] . Kamper, Herman, Jansen, Aren, Goldwater, Sharon 2016

机译：使用声学词嵌入的无监督词分割和词典发现

Unsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word Embeddings

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅