en

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

Motivation

Large Language Models (LLMs) have demonstrated remarkable capabilities in a wide range of downstream tasks, serving as powerful foundation models for language understanding and generation. Furthermore, there has been significant attention on utilizing LLMs in speech and audio processing tasks such as Automatic Speech Recognition (ASR), Audio Captioning, and emerging areas like Spoken Dialogue Models.

However, real-world conversational speech data is critical for the development of robust LLM-based Spoken Dialogue Models, as it encapsulates the complexity of human communication, including natural pauses, interruptions, speaker overlaps, and diverse conversational styles. The limited availability of such data, especially in multilingual settings, poses a significant challenge to advancing the field.

The importance of real-world conversational speech extends beyond technological advancement—it is essential for building AI systems that can understand and respond naturally in multilingual, dynamic, and context-rich environments. This is especially crucial for next-generation human-AI interaction systems, where spoken dialogue serves as a primary mode of communication.

Thus, this workshop aims to bridge the gap by hosting the challenge of building multilingual conversational speech language models together with the release of a real-world multilingual conversational speech dataset.

Task Setting

The event consists of two tasks, both of which require participants to explore the development of speech language model:

Task 1: Multilingual Conversational Speech Recognition

Participants will be provided with oracle segmentation for each conversation.

Objective: Develop a multilingual LLM based ASR model

This task focuses on optimizing transcription accuracy in a multilingual setting.

Task 2: Multilingual Conversational Speech Diarization and Recognition

No prior or oracle information will be provided during evaluation (e.g., no pre-segmented utterances or speaker labels).

Objective: Develop a system for both speaker diarization (identifying who is speaking when), and recognition (transcribing speech to text).

oBoth pipeline-based and end-to-end systems are encouraged, providing flexibility in system design and implementation.

Other Topics

Participants are encouraged to submit research papers and system descriptions that showcase innovative findings, practical case studies, and forward-looking ideas. Topics of interest include, but are not limited to:

    Novel architectures and algorithms for training speech language models.

    Novel pipelines for processing raw audio data, which are useful for collecting diverse internet data for training speech language models.

    Algorithms designed to generate more natural and emotionally rich conversational speech for dialogue systems.

    Approaches to leverage multi-turn conversational history to improve recognition and diarization results.

    Innovative evaluation techniques or benchmarks for speech language models.

    New datasets (real and synthetic) for training speech and audio language models.

Dataset Description

The challenge dataset comprises approximately 11 languages: English (en), French (fr), German (de), Italian (it), Portuguese (pt), Spanish (es), Japanese (jp), Korean (ko), Russian (ru), Thai (th), Vietnamese (vi)

    Each set consists of two-speaker conversational speech on randomly assigned topics.

    Conversations are natural and fluent, with speakers engaging in meaningful dialogues on each topic.

    Recorded in quiet indoor environments using devices such as iPhone.

    The English dataset comprises approximately 500 hours of recordings from various regions, including British, American, Australian, Indian, and Philippine English. Other languages contribute around 100 hours each, resulting in a total of approximately 1500 hours of multilingual conversational speech data.

Language Data Volume (h) Language Classification Sampling Rate Description
English 500 Covers 5 different accents of English, speakers from the United States, the United Kingdom, Philippines, Australia, and India. Diverse genders and ages, natural conversation style, and 98% accuracy in annotated words.
100 American English 16K
100 British English 16K
100 Filipino English 16K
100 Australian English 16K
100 Indian English 16K
French 100 16k Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages, and the accuracy of the annotated words is 98%.
German 100 16k Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages, and the accuracy of the annotated words is 95%.
Italian 100 16k Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages, and the accuracy of the annotated words is 98%.
Japanese 100 16k Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages, and the accuracy of the annotated words is 95%.
Korean 100 16k Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages, and the accuracy of the annotated words is 95%.
Portuguese
(Europe)
100 16k Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages, and the accuracy of the annotated words is 98%.
Russian 100 16k Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages, and the accuracy of the annotated words is 98%.
Spanish
(Spain)
100 16k Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages, and the accuracy of the annotated words is 98%.
Thai 100 16k Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages, and the accuracy of the annotated words is 97%.
Vietnamese 100 16k Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages, and the accuracy of the annotated words is 98%.

This dataset is designed to provide a rich resource for training and evaluating multilingual conversational speech language models, addressing the challenges of linguistic diversity, speaker variability, and contextual understanding.

Important Dates

    February 20, 2025: Registration opens

    March 10, 2025: Training data release

    March 17, 2025: Development set and baseline system release

    May 15, 2025: Evaluation set release and leaderboard open

    June 1, 2025: Leaderboard freeze and submission portal opens (CMT system)

    June 20, 2025: Submission deadline

    July 10, 2025: Notification of acceptance

    August 22, 2025: Workshop date

Organizers

    Lei Xie, Professor, Northwestern Polytechnical University (China)

    Shinji Watanabe, Associate Professor, Carnegie Mellon University (USA)

    Eng Siong Chng, Associate Professor, Nanyang Technological University (Singapore)

    Junlan Feng, IEEE Fellow & Chief Scientist, China Mobile (China)

    Khalid Choukri, Secretary General, European Language Resources Association (France)

    Qiangze Feng, Co-founder & Data Scientist, Nexdata (USA)

    Daliang Wang, Data Scientist, Nexdata (USA)

    Pengcheng Guo, PhD Student, Northwestern Polytechnical University (China)

    Bingshen Mu, PhD Student, Northwestern Polytechnical University (China)

Sponsors

474e4018-9b2d-4e0f-a6df-651b1de3a3f3