Spoken document retrieval (SDR) from Japanese lectures is addressed. In Japan, recently, lecture retrieval test collection (ad-hoc SDR task), which consists of 2,702 audio lectures of the Corpus of Spontaneous Japanese and 39 retrieval queries, has been designed. For an ad-hoc task, appropriate indexing is significant. Automatic speech recognition (ASR) is performed to make index terms, which essentially contain ASR errors. Therefore, studies of indexing terms that are robust to ASR errors are necessary. In Japanese text, no space is put between words, and word units are ambiguous. Thus, studies of indexing units are also important. Based on this background, indexing units are investigated in Japanese SDR. As for indexing units, morphemes, character N-grams, and combinations of the two are investigated. Morpheme unit indexing cannot deal with misrecognition of parts of words. Therefore, indexing units based on character N-grams are investigated. Although SDR has improved for some queries, we do not achieve an overall improvement. Combination with morpheme units did not work well. We confirmed the significance of the introduction of stop-word criteria in character N-gram-based indexing.
展开▼