Corpora, in their different forms for different purposes, have been the bases for modern natural language procesing technology. Taiwanese (MinNan), as other language members in the Sino-Tibet family, has bee nmarginalized due to many reasons. One of the conseqeunces of this marginalization is that no standard written script exists, and thus collecting corpus for these languages has been extremely difficult. By (almost) arbitrarily selecting the hanlor written script (mixture of hanzi and roman characters), we are still facing the problem that only few people are capable of phonetically transcribing a given Taiwanese text. On the other hand, reading a Taiwanese text is easier due to the existence of many commonly used hanzi. By recording a person's reading of Taiwanese text, we use a continuous speech recognizer for Taiwanese to automatically transcribe the text, and end up with two kinds of corpora, one in text, one in speech. The accuracy of the automatic phonetic transcription is about 96.05
展开▼