Data Description

The document data (speech data) of the NTCIR-9 SpokenDoc-1 test collection is the subset of the Corpus of Spontaneous Japanese (CSJ) released by the National Institute for Japanese Language. In this test collection, two kinds of the automatic transcriptions of the CSJ were prepared. The textual representation of them is the N-best list of the word or syllable sequence depending on the two background ASR systems, along with the lattice and confusion network representation of them.

  • Word-based transcriptions obtained by using a word-based ASR system. In other words, a word n-gram model is used for the language model of the ASR system. With the textual representation, it also provides the vocabulary list used in the ASR.
  • Syllable-based transcription obtained by using a syllable-based ASR system. The syllable n-gram model is used for the language model, where the vocabulary is the all Japanese syllables.

To obtain the data

The users of the data are required to purchase the CSJ by themselves. To provide the reference automatic transcriptions of the CSJ, we need your written oath stating that you do possess the CSJ. Please contact to the following e-mail address.

