From:Nexdata Date: 2024-08-15
In intelligent algorithms driven by data, the quality and quantity of data determine the learning efficiency and decision-making precision of AI systems. Different from traditional programming, machine learning and deep learning models rely on massive training data to “self-learn” patterns and rules. Therefore, building and maintain datasets has become the core mission in AI research and development. Through continuously enriching data samples, AI model can handle more complex real world problems, as well as improving the practicality and applicability of technology.
In the field of voice interaction, speech synthesis is an important part, and its technology is also constantly developing. In recent years, there has been a growing interest and demand for emotion synthesis. Emotional speech synthesis will allow the machine to communicate with us like a real person. It can express different emotions such as angry voices, happy voices, and sad voices, and even different emotions of different intensities.
Emotional speech conversion technology can convert speech from one emotional state to another under the premise of keeping the identity of the speaker and the content of the language unchanged. Simply put, it is to properly transfer the emotional expression from an emotional speaker to the target speaker while maintaining a good target speaker timbre.
Emotional Speech Synthesis Technology
Emotional speech synthesis systems can use speaker and emotion embedding model solutions. Use emotion as a label, that is, add an emotion label on the basis of the original network, and the information of these emotions will be learned through the network.
Speaker embedding is to obtain a speaker vector through a neural network, which requires a certain scale of multi-person database for training.
Emotional embedding requires emotional data combined with speaker vectors to implement an emotional speech synthesis model, so high-quality, multi-emotional data is required.
For example, cross-speaker emotion transfer can use emotion and timbre perturbation to learn speaker and emotion-related spectrums respectively, and provide explicit emotion features for the final speech generation. Speaker correlation is to maintain the timbre of the target speaker, and emotion correlation is to capture the emotional expression of the source speaker. Therefore, data from multiple people with multiple emotions and multiple people without emotion are needed for joint training.
Application Scenarios of Emotional Speech Synthesis
Avatar: It can make virtual characters have certain emotional expression ability.
Short video dubbing: You can dub the content of the short video to make the content more lively and interesting.
Game role: It allows users to have a better experience in the game.
Film and television animation: It can carry out vivid explanation.
Intelligent customer service: It can improve the human-computer interaction experience and make the interaction full of fun.
Nexdata Emotional Speech Synthesis Data Solution
As the world’s leading artificial intelligence data service provider, Nexdata can provide customers with rich emotional voice data. The artificial intelligence trained by these data can synthesize voices richer in emotion and expression, making the synthesized voice more natural and real.
Chinese Mandarin Synthesis Corpus-Female, Emotional
The 13.3 Hours — Chinese Mandarin Synthesis Corpus-Female, Emotional. It is recorded by Chinese native speaker, emotional text, and the syllables, phonemes and tones are balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.
American English Speech Synthesis Corpus-Male
Male audio data of American English. It is recorded by American English native speakers, with authentic accent. The phoneme coverage is balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.
Japanese Synthesis Corpus-Female
10.4 Hours — Japanese Synthesis Corpus-Female. It is recorded by Japanese native speaker, with authentic accent. The phoneme coverage is balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.
American English Speech Synthesis Corpus-Female
Female audio data of American English,. It is recorded by American English native speaker, with authentic accent and sweet sound. The phoneme coverage is balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.
Chinese Average Tone Speech Synthesis Corpus-Three Styles
50 People — Chinese Average Tone Speech Synthesis Corpus-Three Styles.It is recorded by Chinese native speakers. Corpus includes cunstomer service,news and story. The syllables, phonemes and tones are balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.
If you want to know more details about the datasets or how to acquire, please feel free to contact us: info@nexdata.ai.
In the future, data-driven intelligence will profoundly change all industries operation system. To make sure the long-term development of AI technology, high-quality datasets will remain an indispensable basic resource. By continuously optimizing data collection technology, and developing more sophisticated datasets, AI systems will bring more opportunities and challenges for all walks of life.