NTCIR-9 Core Task: "IR for Spoken Documents (SpokenDoc)"


The growth of the internet and the decrease of the storage costs are resulting in the rapid increase of multimedia contents today. For retrieving these contents, available text-based tag information is limited. Spoken Document Retrieval (SDR) is a promising technology for retrieving these contents using the speech data included in them. In NTCIR-9 SpokenDoc (IR for Spoken Documents), we will evaluate the SDR, especially based on a realistic ASR condition, where the target documents are spontaneous speech data with high word error rate and high out-of-vocabulary rate.

Task Overview

We have already developed prototypes of SDR test collections; CSJ Spoken Term Detection test collection and CSJ Spoken Document Retrieval test collection. The target documents of both the test collections are spoken lectures in Corpus of Spontaneous Japanese (CSJ). By using (and extending) these test collections, two subtasks will be conducted.

  • Spoken Term Detection: Within spoken documents, find the occurrence positions of a queried term. The evaluation should be conducted by both the efficiency (search time) and the effectiveness (precision and recall).
  • Spoken Document Retrieval: Among spoken documents, find the passages including the relevant information related to the query. This is like an ad-hoc text retrieval task, except that the target documents are speech data. To accomplish the task, the result of STD may be used.

Data Set

Our target document collection is the Corpus of Spontaneous Japanese (CSJ) released by the National Institute for Japanese Language. Among CSJ, 2702 lectures are used as the target documents for our both STD and SDR tasks. The subset 177 lectures of them, called CORE, is also used for the target for our STD subtask.

The participants are required to purchase the data by themselves. See CSJ website.


Standard STD and SDR methods first transcribe the audio signal into its textual representation by using Large Vocabulary Continuous Speech Recognition (LVCSR), followed by text-based retrieval. The participants can use the following three types of transcriptions.

  1. Manual transcription

    Included in the CSJ. It is mainly used for evaluating the upper-bound performance.

  2. Reference Automatic Transcriptions

    The organizers are going to prepare two reference automatic transcriptions. It enables that those who are interested in SDR but not in ASR can participate our tasks. It also enables the comparison of the IR methods based on the same underlying ASR performances. The participants can also use both transcriptions at the same time to boost the performance.

    The textual representation of them will be both the n-best list of the word or syllable sequence depending on the two background ASR systems, and the lattice representation of them.

    1. Word-based transcription

      Obtained by using a word-based ASR system. In other words, a word n-gram model is used for the language model of the ASR system. With the textual representation, it also provides the vocabulary list used in the ASR, which determines the distinction between the in-vocabulary (IV) query terms and the our-of-vocabulary (OOV) query terms used in our STD subtask.

    2. Syllable-based transcription

      Obtained by using a syllable-based ASR system. The syllable n-gram model is used for the language model, where the vocabulary is the all Japanese syllables. The use of it can avoid the OOV problem of the spoken document retrieval. The participants who want to focus on the open vocabulary STD and SDR can use this transcription.

  3. Participant's own transcription

    The participants can use their own ASR systems for the transcription. In order to enjoy the same IV and OOV condition, their word-based ASR systems are recommended to use the same vocabulary list of our reference transcription, but not necessary. When participating with the own transcription, the participants are encouraged to provide it to the organizers for the future SpokenDoc test collections.

Task Description


2010-10First call for participation
2010-10~12 2011-02-23Document set release
2010-12 2011-01-20Task registration due
2011-1~3 2011-05-09~16Dry run
2011-4~6 2011-06-30~07-10Formal run
2011-8Evaluation results release
2011-11Camera-ready copy of participant paper due
2011-12NTCIR-9 Workshop Meeting


  • Tomoyosi Akiba (Toyohashi University of Technology)
  • Hiromitsu Nishizaki (University of Yamanashi)
  • Kiyoaki Aikawa (Tokyo University of Technology)
  • Tatsuya Kawahara (Kyoto University)
  • Tomoko Matsui (The Institute of Statistical Mathematics)


Registration form is available at the official page of NTCIR-9.


トップ   編集 凍結 差分 バックアップ 添付 複製 名前変更 リロード   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS
Last-modified: 2011-06-24 (金) 21:00:01 (2364d)