Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of downstream tasks, serving as powerful foundation models for language understanding and generation. Recently, there has been significant interest in applying LLMs to speech and audio processing tasks, including Automatic Speech Recognition (ASR), Audio Captioning, and emerging areas such as Spoken Dialogue Models.
However, the development of robust LLM-based Spoken Dialogue Models relies heavily on real-world conversational speech data, which encapsulates the complexity of human communication, including natural pauses, interruptions, speaker overlaps, and diverse conversational styles. The scarcity of such data, especially in multilingual contexts, poses a significant challenge to advancing the field.
The importance of real-world conversational speech extends beyond technological advancement—it is essential for building AI systems that can understand and respond naturally in multilingual, dynamic, and context-rich environments. This is especially crucial for next-generation human-AI interaction systems, where spoken dialogue serves as a primary mode of communication.
Thus, this workshop aims to bridge the gap by hosting the challenge of building multilingual conversational speech language models (MLC-SLM) together with the release of a real-world multilingual conversational speech dataset.
The challenge consists of two tasks, both of which require participants to explore the development of speech language models (SLMs):
Task I: Multilingual Conversational Speech Recognition
Objective: Develop a multilingual LLM-based ASR model.
Participants will be provided with oracle segmentation and speaker labels for each conversation.
This task focuses on optimizing recognition accuracy in a multilingual conversation setting.
Task II: Multilingual Conversational Speech Diarization and Recognition
Objective: Develop a system for both speaker diarization (identifying who is speaking when), and recognition (transcribing speech to text).
No prior or oracle information will be provided during evaluation (e.g., no pre-segmented utterances or speaker labels).
Both pipeline-based and end-to-end systems are encouraged, providing flexibility in system design and implementation.
For Task I, system performance will be evaluated using Word Error Rate (WER) or Character Error Rate (CER) across different languages.
For Task II, performance will be assessed based on the Diarization Error Rate (DER) and the concatenated minimum permutation WER or CER, referred to as cpWER or cpCER. The DER is employed to determine the best speaker ID permutation between oracle annotation and diarization results. Then, the recognition results and references belonging to the same speaker within a recording will be concatenated to calculate the cpWER or cpCER. All submissions will be ranked according the cpWER or cpCER.
March 10, 2025: Registration opens
March 15, 2025: Training data release
April 1, 2025: Development set and baseline system release
May 15, 2025: Evaluation set release and leaderboard open
May 30, 2025: Leaderboard freeze and paper submission portal opens (CMT system)
June 15, 2025: Paper submission deadline
July 1, 2025: Notification of acceptance
August 22, 2025: Workshop date
The training set (Train) comprises approximately 11 languages: English (en), French (fr), German (de), Italian (it), Portuguese (pt), Spanish (es), Japanese (jp), Korean (ko), Russian (ru), Thai (th), Vietnamese (vi).
Each recording consists of two-speaker conversational speech on randomly assigned topics.
Conversations are natural and fluent, with speakers engaging in meaningful dialogues on each topic.
Recorded in quiet indoor environments using devices such as iPhones.
Each recording will provide the oracle segmentation and speaker label for the development of speech recognition and speaker diarization systems.
Both Task I and Task II share the same training set.
The English dataset comprises approximately 500 hours of recordings from various regions, including British, American, Australian, Indian, and Philippine English. Other languages contribute around 100 hours each, resulting in a total of approximately 1500 hours of multilingual conversational speech data.
This dataset is designed to provide a rich resource for training and evaluating multilingual conversational speech language models (MLC-SLM), addressing the challenges of linguistic diversity, speaker variability, and contextual understanding.
Language | Data Volume (h) | Language Classification | Sampling Rate | Description |
---|---|---|---|---|
English | 500 | Covers 5 different accents of English, speakers from the United States, the United Kingdom, Philippines, Australia, and India. Diverse genders and ages, natural conversation style. The word error rate is lower than 2%. | ||
100 | American English | 16K | ||
100 | British English | 16K | ||
100 | Filipino English | 16K | ||
100 | Australian English | 16K | ||
100 | Indian English | 16K | ||
French | 100 | 16k | Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The word error rate is lower than 2%. | |
German | 100 | 16k | Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The word error rate is lower than 2%. | |
Italian | 100 | 16k | Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The word error rate is lower than 2%. | |
Japanese | 100 | 16k | Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The sentence error rate is lower than 5%. | |
Korean | 100 | 16k | Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The sentence error rate is lower than 5%. | |
Portuguese (Europe) | 100 | 16k | Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The word error rate is lower than 2%. | |
Russian | 100 | 16k | Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The word error rate is lower than 2%. | |
Spanish (Spain) | 100 | 16k | Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The word error rate is lower than 2%. | |
Thai | 100 | 16k | Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The word error rate is lower than 3%. | |
Vietnamese | 100 | 16k | Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The word error rate is lower than 2%. |
The development set (Dev) has the same setting as the training set but contains approximately 4 hours of recordings for each language. Both Task I and Task II share the same development set.
Different evaluation sets are employed for each task, designated as Eval_1 and Eval_2. Specifically, Eval_1 includes oracle timestamps and speaker labels, which are evaluated using WER/CER. Eval_2 does not provide timestamps or speaker labels, necessitating a speaker diarization (SD) system to segment the longer recordings before recognition.
Participants can access the dataset by signing the Data use agreement and submitting to the registration form. After submission, the data download link will be sent to your email.
All participants must adhere to the following rules to be eligible for the challenge.
In addition to challenge system descriptions, participants are encouraged to submit research papers that showcase innovative findings, practical case studies, and forward-looking ideas. Topics of interest include, but are not limited to:
Registered participants will be given access to the training and testing datasets. They must sign a data use agreement (see below), agree to confidentiality and comply with the data protection agreement. The datasets will only be used for the purpose of the workshop challenge, and redistribution or any other use is strictly prohibited. It is the responsibility of the participant to protect the data from unauthorized access.
To participate, registration is required. Please upload signed Data use agreement and complete the registration form. The challenge begins on March 10, 2025.
For any other information about registration, please send Email to: [email protected]
Will be released soon.
TOTAL FUND FOR PRIZE : $20,000
Prizes for Top-Ranking Teams in this Competition(each task):
Rotterdam Ahoy Convention Centre, Rotterdam, Netherlands
Non-member registration: € 60
Non-member student registration: € 45
ISCA member registration: € 50
ISCA student registration: € 35
Official Email: [email protected]
Slack: https://join.slack.com/t/mlc-slm-challenge/shared_invite/zt-314nfsmhz-QjOJjhjK3OHYUtJyBRtPxA
WeChat:
Lei Xie, Professor, Northwestern Polytechnical University (China)
Shinji Watanabe, Associate Professor, Carnegie Mellon University (USA)
Eng Siong Chng, Professor, Nanyang Technological University (Singapore)
Junlan Feng, IEEE Fellow & Chief Scientist, China Mobile (China)
Khalid Choukri, Secretary General, European Language Resources Association (France)
Qiangze Feng, Co-founder & Data Scientist, Nexdata (USA)
Daliang Wang, Data Scientist, Nexdata (USA)
Hexin Liu, Postdoctoral Researcher, Nanyang Technological University (Singapore)
Pengcheng Guo, PhD Student, Northwestern Polytechnical University (China)
Bingshen Mu, PhD Student, Northwestern Polytechnical University (China)
Zhaokai Sun, Master Student, Northwestern Polytechnical University (China)