en

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

m.nexdata.datatang.com

High-Quality Training Datasets

Boost the performance of your AI models with our high-quality, ready-to-use training datasets.

Language

All

Data Type

All

Face Anti-Spoofing & Liveness Detection Dataset – 70 People (2D & 3D)

This dataset includes 70 multi-race subjects, the collection scenes are indoor scenes and outdoor scenes. The dataset includes males and females, age distribution is 18-50 years old. The device includes cellphone, camera, iPhone of multiple models (iPhone X or more advanced iPhone models). The data diversity includes multiple devices, multiple actions, multiple facial postures, multiple anti-spoofing samples, multiple light conditions, multiple scenes. This data can be used for tasks such as 2D liveness detection, 3D liveness detection, face anti-spoofing, 2D face recognition, and 3D face recognition.
face liveness detection dataset face anti spoofing dataset liveness detection dataset face spoofing dataset 3D face dataset 2D face dataset

500 Hours - Japanese(Japan) 48khz Full-Duplex Spontaneous Dialogue Smartphone speech dataset

Japanese(Japan) 48khz Full-Duplex Spontaneous Dialogue Smartphone speech dataset, collected from dialogues based on given topics. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Japanese Japan Dialogue Full-Duplex 48khz

Chinese Expressive Narration Speech Synthesis Dataset – 4 Speakers

Chinese Expressive Narration Speech Synthesis Dataset recorded by 4 professional character voice actors. Given the book-based content, speakers reads in a highly expressive narration style. Suitable for audiobook-like TTS generation. This dataset supports expressive TTS, storytelling voice models, audiobook synthesis, and emotion-rich speech generation.
Chinese speech synthesis dataset Mandarin speech dataset expressive narration TTS dataset Chinese expressive speech dataset narration speech corpus Chinese audiobook dataset character voice speech dataset

Chinese Emotional Speech Dataset – 5 Speakers, Multi-Style Voices

This is a Chinese speech synthesis dataset recorded by 5 professional character-voice actors, covering multiple speaking styles (e.g. authoritative female boss, straightforward prince, nimble maid, kind elderly woman) and emotions include disdain, anger, happiness, concern, surprise, gasp of fear, cold snort (disdain), sympathy, laughter, inner thoughts, seriousness, disgust, puzzlement, sadness and neutrality. The dataset is ideal for building expressive text-to-speech (TTS), voice acting, character-based narration, emotion-aware speech generation, and related AI voice applications.
Chinese speech synthesis dataset Mandarin speech dataset narration speech corpus Chinese expressive speech dataset character voice speech dataset Chinese TTS dataset

601 Hours - Spanish(Argentina) Real-world Casual Conversation and Monologue speech dataset

Spanish(Argentina) Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, variety show and other generic domains, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Spanish Casual Conversation ASR Argentina

10 Hours - Brazilian Portuguese Speech Synthesis Corpus

10 Hours - Brazilian Portuguese Speech Synthesis Corpus, recorded by native Brazilians. The corpus is related to the customer service field. The phoneme coverage is balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.
Brazilian Portuguese TTS Female

288 Million 3D Models & Scenes Dataset for AI and Simulation

Massive 3D Models & Scenes Dataset includes 270 million sets of 3D models and 18 million 3D scenes. 3D models cover conventional models, interactive models, and physics-enhanced models with various objects in indoor residential environments. 3D scenes cover indoor home decoration scenarios and commercial space environments. This dataset can be used for tasks like 3D asset generation, virtual environment simulation, AI model training, and industrial design applications.
3D models dataset 3D scenes dataset indoor 3D environment dataset commercial 3D space dataset physics-enhanced 3D models interactive 3D models dataset 3D assets generation dataset simulation training environment dataset virtual environment 3D data large-scale 3D AI dataset

INTERSPEECH 2025 MLC-SLM Challenge Dataset

The INTERSPEECH 2025 MLC-SLM Challenge Dataset, curated by Nexdata, is derived from fifteen proprietary conversational speech corpora. Distinguished by exceptional annotation accuracy and operational reliability, this dataset is engineered to address critical challenges in multilingual automatic speech recognition (ASR) and long-context comprehension. It meticulously replicates real-world complexities including spontaneous interruptions and speaker overlaps across 11 languages (1500 hours total duration), thereby providing robust training resources for developing world-ready ASR systems. All data collection and processing strictly comply with international privacy regulations including GDPR, CCPA and PIPL, with rigorous protocols ensuring participant anonymity and ethical data usage throughout the lifecycle.
workshop audio dataset mlc-slm dataset ASR speech recognition data

4600 Hours - Mandarin Full-Duplex Multi-Channel Speech Dataset

4600 Hours Mandarin Full-Duplex Multi-Channel Speech Dataset is collected from dialogues based on given topics. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Mandarin speech dataset multi-stream Mandarin audio data conversational Mandarin corpus Chinese voice dataset full-duplex speech dataset multi-stream speech dataset multi-channel audio dataset

1.5 Million English STEM Test Questions Dataset – Science and Engineering Subjects

This dataset contains 1.5 million English science and engineering test questions, including mathematics, physics, chemistry, biology, and other STEM subjects at the university level. Each questions contain title, answer, parse, type, subject, grade. The dataset can be used for large model subject knowledge enhancement tasks.
Question-answer dataset Question processing dataset Labeled STEM exam dataset Large-scale test question dataset English STEM test question dataset

119 Hours Greek Speech Dataset - Scripted Monologue for ASR & TTS

This dataset contains 119 hours of Greek monologues based on given scripts. Transcribed with text content. Our dataset was collected from extensive and diversify speakers(95 people in total, from Greece), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Greek scripted monologue dataset Greek speech dataset Greek audio dataset Greek speech data Greek speech synthesis data

280 Hour Norwegian Speech Dataset - Scripted Monologue for ASR & TTS

This dataset contains 280 hours of Norwegian monologues based on given scripts. Transcribed with text content. Our dataset was collected from extensive and diversify speakers(157 people in total, from Norway), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Norwegian scripted monologue dataset Norwegian speech dataset Norwegian speech synthesis data Norwegian NLP corpus Norwegian speech data Norwegian audio dataset
. . .
loading

loading

d58b9cc6-df43-4210-88df-e951286ea74d