From:Nexdata Date: 2024-08-15
In intelligent algorithms driven by data, the quality and quantity of data determine the learning efficiency and decision-making precision of AI systems. Different from traditional programming, machine learning and deep learning models rely on massive training data to “self-learn” patterns and rules. Therefore, building and maintain datasets has become the core mission in AI research and development. Through continuously enriching data samples, AI model can handle more complex real world problems, as well as improving the practicality and applicability of technology.
Speech synthesis, commonly known as Text To Speech (TTS), is a technology that can convert any input text into corresponding speech, and is one of the indispensable modules in human-computer voice interaction.
Traditional Speech Synthesis
Traditional speech synthesis systems usually include two modules: front-end and back-end. The front-end module mainly analyzes the input text and extracts the linguistic information required by the back-end module. For Chinese synthesis systems, the front-end module generally includes sub-modules such as Text Normalization (TN), polyphonic word disambiguation, and prosody prediction. The back-end module generates a speech waveform through a certain method according to the front-end analysis results.
Behind the front-end technology, a large amount of basic data such as TN annotation, polyphonic word annotation, and rhythm annotation is needed to help the front-end technology output accurate results.
The back-end technology requires high-quality voice libraries recorded by professional speakers. In order to apply in various scenarios, a large number of voice libraries with diverse timbres and languages are required.
Personalized Speech Synthesis
Personalized speech synthesis usually uses a small amount of and possibly low-quality target speaker speech, and using methods such as transfer learning to train a speech synthesis model capable of synthesizing the target speaker’s speech. The usual approach is to train a general speech synthesis model based on a large number of different speakers, and then fine-tune it with a small number of target speakers.
The application of personalized speech synthesis is becoming more and more mature. Baidu Maps supports users to record 9 sentences to generate a complete personal speech package and use it in all scenarios of the map.
Behind the personalized speech synthesis technology, the multi-speaker average model library is needed as an important data support. Nexdata’s speech synthesis data for general scenarios is divided into three categories:
Monophonic Human Synthesis Library
A sound library recorded in a professional recording studio by a single speaker.
•American English Speech Synthesis Corpus-Female
Female audio data of American English,. It is recorded by American English native speaker, with authentic accent and sweet sound. The phoneme coverage is balanced. Professional phonetician participates in the annotation.
•American English Speech Synthesis Corpus-Male
Male audio data of American English. It is recorded by American English native speakers, with authentic accent. The phoneme coverage is balanced. Professional phonetician participates in the annotation.
•Japanese Synthesis Corpus-Female
Japanese Synthesis Corpus-Female. It is recorded by Japanese native speaker, with authentic accent. The phoneme coverage is balanced. Professional phonetician participates in the annotation.
Multi-speaker Average Model Library
A sound library recorded in a professional recording studio by multiple speakers.
•Chinese Mandarin Average Tone Speech Synthesis Corpus, General
Chinese Mandarin Average Tone Speech Synthesis Corpus, General. It is recorded by Chinese native speaker. It covers news, dialogue, audio books, poetry, advertising, news broadcasting, entertainment; and the phonemes and tones are balanced. Professional phonetician participates in the annotation.
•Chinese Average Tone Speech Synthesis Corpus-Three Styles
Chinese Average Tone Speech Synthesis Corpus-Three Styles.It is recorded by Chinese native speakers. Corpus includes cunstomer service,news and story. The syllables, phonemes and tones are balanced. Professional phonetician participates in the annotation.
Frontend Text
•200,475 Sentences — Chinese Text Normalization Data
The dataset covers novels, articles, news and other categories, the specific special symbols and Arabic numerals contained in the sentences are marked as Chinese characters, with a total of 199,652 sentences and 454,638 annotations.
•319,977 Sentences — Mandarin Polyphone Corpus Data
The dataset covers news, spoken language and other categories, including 603 phonetic sounds of 266 polyphonic words, a total of 319,977 sentences.
•200,955 Sentences — Mandarin Prosodic Corpus Data
Texts from news and daily chats were annotated at level 4 prosody.
As the world’s leading AI data service provider, Nexdata has rich sample sound resources, outstanding technical advantages and data processing experience, and supports personalized collection services for designated language, timbre, age, and gender. Meanwhile, Nexdata supports data customization services such as audio segmentation, phoneme boundary segmentation (segmentation accuracy of 0.01 seconds), phonetic tagging, prosody tagging, part-of-speech tagging, pitch proofreading, rhythm tagging, and musical score production to fully meet customers’ diverse requirements.
End
If you need data services, please feel free to contact us: info@nexdata.ai.
Standing at the forefront of technology revolution, we are well aware of the power of data. In the future, through contentiously improve data collection and annotation process, AI system will become more intelligent. All walks of life should actively embrace the innovation of data-driven to stay ahead in the fierce market competition and bring more value for society.