Large Language Models (LLMs) have demonstrated remarkable capabilities in a wide range of downstream tasks, serving as powerful foundation models for language understanding and generation. Furthermore, there has been significant attention on utilizing LLMs in speech and audio processing tasks such as Automatic Speech Recognition (ASR), Audio Captioning, and emerging areas like Spoken Dialogue Models.
However, real-world conversational speech data is critical for the development of robust LLM-based Spoken Dialogue Models, as it encapsulates the complexity of human communication, including natural pauses, interruptions, speaker overlaps, and diverse conversational styles. The limited availability of such data, especially in multilingual settings, poses a significant challenge to advancing the field.
The importance of real-world conversational speech extends beyond technological advancement—it is essential for building AI systems that can understand and respond naturally in multilingual, dynamic, and context-rich environments. This is especially crucial for next-generation human-AI interaction systems, where spoken dialogue serves as a primary mode of communication.
Thus, this workshop aims to bridge the gap by hosting the challenge of building multilingual conversational speech language models together with the release of a real-world multilingual conversational speech dataset.
The event consists of two tasks, both of which require participants to explore the development of speech language model:
Task 1: Multilingual Conversational Speech Recognition
Participants will be provided with oracle segmentation for each conversation.
Objective: Develop a multilingual LLM based ASR model
This task focuses on optimizing transcription accuracy in a multilingual setting.
Task 2: Multilingual Conversational Speech Diarization and Recognition
No prior or oracle information will be provided during evaluation (e.g., no pre-segmented utterances or speaker labels).
Objective: Develop a system for both speaker diarization (identifying who is speaking when), and recognition (transcribing speech to text).
oBoth pipeline-based and end-to-end systems are encouraged, providing flexibility in system design and implementation.
Participants are encouraged to submit research papers and system descriptions that showcase innovative findings, practical case studies, and forward-looking ideas. Topics of interest include, but are not limited to:
Novel architectures and algorithms for training speech language models.
Novel pipelines for processing raw audio data, which are useful for collecting diverse internet data for training speech language models.
Algorithms designed to generate more natural and emotionally rich conversational speech for dialogue systems.
Approaches to leverage multi-turn conversational history to improve recognition and diarization results.
Innovative evaluation techniques or benchmarks for speech language models.
New datasets (real and synthetic) for training speech and audio language models.
The challenge dataset comprises approximately 11 languages: English (en), French (fr), German (de), Italian (it), Portuguese (pt), Spanish (es), Japanese (jp), Korean (ko), Russian (ru), Thai (th), Vietnamese (vi)
Each set consists of two-speaker conversational speech on randomly assigned topics.
Conversations are natural and fluent, with speakers engaging in meaningful dialogues on each topic.
Recorded in quiet indoor environments using devices such as iPhone.
The English dataset comprises approximately 500 hours of recordings from various regions, including British, American, Australian, Indian, and Philippine English. Other languages contribute around 100 hours each, resulting in a total of approximately 1500 hours of multilingual conversational speech data.
Language | Data Volume (h) | Language Classification | Sampling Rate | Description |
---|---|---|---|---|
English | 500 | Covers 5 different accents of English, speakers from the United States, the United Kingdom, Philippines, Australia, and India. Diverse genders and ages, natural conversation style, and 98% accuracy in annotated words. | ||
100 | American English | 16K | ||
100 | British English | 16K | ||
100 | Filipino English | 16K | ||
100 | Australian English | 16K | ||
100 | Indian English | 16K | ||
French | 100 | 16k | Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages, and the accuracy of the annotated words is 98%. | |
German | 100 | 16k | Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages, and the accuracy of the annotated words is 95%. | |
Italian | 100 | 16k | Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages, and the accuracy of the annotated words is 98%. | |
Japanese | 100 | 16k | Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages, and the accuracy of the annotated words is 95%. | |
Korean | 100 | 16k | Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages, and the accuracy of the annotated words is 95%. | |
Portuguese (Europe) | 100 | 16k | Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages, and the accuracy of the annotated words is 98%. | |
Russian | 100 | 16k | Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages, and the accuracy of the annotated words is 98%. | |
Spanish (Spain) | 100 | 16k | Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages, and the accuracy of the annotated words is 98%. | |
Thai | 100 | 16k | Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages, and the accuracy of the annotated words is 97%. | |
Vietnamese | 100 | 16k | Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages, and the accuracy of the annotated words is 98%. |
This dataset is designed to provide a rich resource for training and evaluating multilingual conversational speech language models, addressing the challenges of linguistic diversity, speaker variability, and contextual understanding.
February 20, 2025: Registration opens
March 10, 2025: Training data release
March 17, 2025: Development set and baseline system release
May 15, 2025: Evaluation set release and leaderboard open
June 1, 2025: Leaderboard freeze and submission portal opens (CMT system)
June 20, 2025: Submission deadline
July 10, 2025: Notification of acceptance
August 22, 2025: Workshop date
Lei Xie, Professor, Northwestern Polytechnical University (China)
Shinji Watanabe, Associate Professor, Carnegie Mellon University (USA)
Eng Siong Chng, Associate Professor, Nanyang Technological University (Singapore)
Junlan Feng, IEEE Fellow & Chief Scientist, China Mobile (China)
Khalid Choukri, Secretary General, European Language Resources Association (France)
Qiangze Feng, Co-founder & Data Scientist, Nexdata (USA)
Daliang Wang, Data Scientist, Nexdata (USA)
Pengcheng Guo, PhD Student, Northwestern Polytechnical University (China)
Bingshen Mu, PhD Student, Northwestern Polytechnical University (China)