In this study, we present an extension to our previous efforts on automatically detecting text-dependent segmental mispronunciations by Cantonese (L1) learners of American English (L2), through modeling the L2 production. The problem of segmental mispronunciation modeling is addressed by joint-sequence models. Specifically, a grapheme-to-phoneme model is built to convert the prompted words to their corresponding possible mispronunciations, instead of the previous characterization of phonological processes based on a transfer from the canonical phonetic transcription. Experiments show that the approach can capture the mispronunciations better than the knowledge based and data-driven phonological rules.
展开▼