20 Hours Japanese TTS Dataset – Native Japanese Voice Corpus

Japanese speech dataset

Japanese TTS dataset

Japanese speech synthesis corpus

Japanese voice dataset for AI

native Japanese speech dataset

Japanese text-to-speech dataset

balanced phoneme Japanese corpus

This dataset contains recordings from 2 native Japanese speakers with authentic accents, each person contribute 10 hours of audio. Contains news and colloquial style general corpus, the phoneme coverage is balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of building Japanese text-to-speech systems, speech synthesis research, and AI voice applications.

This is a paid datasets for commercial use, research purpose and more. Licensed ready made datasets help jump-start AI projects.

Sample

Audio
あなた達#3、市役所の人#4！a(L) . n a(H) . t a(L) # t a(L) . ch i(L) / sh i(L) . ya(H) . k u(H) . s yo(H) # n o(H) # h i(L) . t o(H)
Audio
何か#3、お経みてえな#1歌だな#4。n a(H) . N(L) . k a(L) / o(L) . k yo:(HH) # m i(H) . t e:(LL) # n a(H) / u(L) . t a(H) # d a(L) # n a(L)
Audio
この人の#1遺品の#1中から#3、父の#1手帳は#1見つかったの#4。k o(L) . n o(H) # h i(L) . t o(L) # n o(L) / i(L) . h i(H) . N(H) # n o(H) / n a(H) . k a(L) # k a(L) . r a(L) / ch i(L) . ch i(H) # n o(L) / t e(L) . c yo:(HH) # w a(H) / m i(L) . ts u(H) . k a(H) . T(H) . t a(H) # n o(L)
Audio
はい#3、こちら#3、お願いします#4。h a(H) . i(L) / k o(L) . ch i(H) . r a(H) / o(L) # n e(H) . g a(H) . i(H) # sh i(L) . m a(H) . s u(L)
Audio

Recommended Dataset

10.4 Hours – Japanese Female Voice TTS Dataset

This dataset contains 10.4 hours of Japanese female voice recordings. It is recorded by Japanese native speaker with an authentic accent. The phoneme coverage is balanced. Professional phonetician participates in the annotation. This corpus is ideal for tasks such as Japanese text-to-speech (TTS) training, speech synthesis research, and AI voice model development.

Japanese speech synthesis dataset Japanese tts dataset Japanese text-to-speech dataset female female japanese tts dataset

20 Hours - American English Male Voice TTS Dataset

This dataset contains 20 hours of American English male voice recordings. It is recorded by Americans (native English speakers) with authentic accent. The phoneme coverage is balanced. Professional phonetician participates in the annotation. It is suitable for text-to-speech (TTS) model training, phoneme recognition research, and AI voice development.

TTS english dataset speech synthesis dataset TTS male voice dataset male voice dataset for tts American English speech synthesis dataset

19.46 Hours - American English Female Voice TTS Dataset

This dataset contains 19.46 hours of American English female voice recordings. It is recorded by American (native English speaker) with authentic accent and clear, sweet tone. The phoneme coverage is balanced. Professional phoneticians participate in the annotation. It is suitable for text-to-speech (TTS) model training, phoneme recognition, and AI voice development requiring natural-sounding female speech.

American English speech synthesis dataset female voice dataset for TTS American English female voice corpus speech synthesis training data female TTS dataset American English female speaker speech synthesis dataset TTS english dataset

40 People - Multi-level Control Multi-emotional Paralanguage Annotated Speech Synthesis Corpus

40 People - Multi-level Control Multi-emotional Paralanguage Annotated Speech Synthesis Corpus，recorded by native professional voice actors/actresses. The content of the recording contains multi-level control, multi-emotional, single-emotional, single-tone, emotional shift, paralanguage. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.

TTS Multi-level Control Multi-emotional Paralanguage emotional shift

Chinese Expressive Narration Speech Synthesis Dataset – 4 Speakers

Chinese Expressive Narration Speech Synthesis Dataset recorded by 4 professional character voice actors. Given the book-based content, speakers reads in a highly expressive narration style. Suitable for audiobook-like TTS generation. This dataset supports expressive TTS, storytelling voice models, audiobook synthesis, and emotion-rich speech generation.

Chinese speech synthesis dataset Mandarin speech dataset expressive narration TTS dataset Chinese expressive speech dataset narration speech corpus Chinese audiobook dataset character voice speech dataset

Chinese Emotional Speech Dataset – 4 Speakers, Multi-Style Voices

This is a Chinese speech synthesis dataset recorded by 4 professional character-voice actors, covering multiple speaking styles (e.g. authoritative female boss, straightforward prince, nimble maid, kind elderly woman) and emotions include disdain, anger, happiness, concern, surprise, gasp of fear, cold snort (disdain), sympathy, laughter, inner thoughts, seriousness, disgust, puzzlement, sadness and neutrality. The dataset is ideal for building expressive text-to-speech (TTS), voice acting, character-based narration, emotion-aware speech generation, and related AI voice applications.

Chinese speech synthesis dataset Mandarin speech dataset narration speech corpus Chinese expressive speech dataset character voice speech dataset Chinese TTS dataset

100 Speakers Chinese Speech Synthesis Dataset & Multi-Emotion

This dataset is recorded by 100 professional Chinese voice actors. It not only includes sentences rich in modal particles that align with daily expression habits, but also encompasses free conversation data on given topics. Each speaker’s audio is stored in a separate track. All recordings are annotated by professional phoneticians with text, timestamps, and prosody details, meeting the precise requirements for speech synthesis, emotion recognition, and prosody modeling research.

Chinese emotional speech data Chinese conversational speech corpus Chinese natural conversation dataset Chinese prosody dataset

Mandarin Chinese Multi-Stream Speech Dataset – 294 Speakers, 203 Hours

This Mandarin Chinese speech synthesis dataset features with 294 speakers total 203 hours of audio, gender balanced 144 females and 150 males, ages from 18 to 60 years old. Each speaker records free-form dialogues based on given topics, and in each conversation, each person's audio is stored in their own separate WAV file. Professional linguists have annotated 16 types of paralanguage annotations, including text annotations and timestamps, and other information to accurately match the research and development needs of speech synthesis and paralanguage research.

paralanguage speech dataset Mandarin speech synthesis corpus Chinese speech synthesis dataset spontaneous dialogue speech synthesis annotated speech synthesis dataset dialogue speech synthesis dataset multi-stream speech synthesis dataset Chinese paralanguage dataset spontaneous dialogue dataset multi-stream speech corpus

20 Hours Japanese TTS Dataset – Native Japanese Voice Corpus

Japanese speech dataset Japanese TTS dataset Japanese speech synthesis corpus Japanese voice dataset for AI native Japanese speech dataset Japanese text-to-speech dataset balanced phoneme Japanese corpus

Current Project Maturity

Japanese speech dataset

Japanese TTS dataset

Japanese speech synthesis corpus

Japanese voice dataset for AI

native Japanese speech dataset

Japanese text-to-speech dataset

balanced phoneme Japanese corpus