Build Minority Language ASR Models With High-Quality Reading Speech Data

From：Nexdata Date： 2024-08-15

➤ Nexdata's speech data for ASR models

The rapid development of artificial intelligence is inseparable from the support of high-quality data. Data is not only the fuel that drives the progress of AI model learning, but also the core factor to improve model performance, accuracy and stability. Especially in the field of automatic tasks and intelligent decision-making, deep learning algorithms based on massive data have shown their potential. Therefore, having well-structured and rich datasets has become a top priority for engineers and developers to ensure that AI systems can perform well in a variety of different scenarios.

In the past ten years, driven by deep learning, speech recognition technology and applications have achieved rapid development. Related products and services equipped with speech recognition technology, such as voice search, voice input method, smart speakers, smart TVs, smart wearables, intelligent customer service, robots, etc. have been widely used in all aspects of our lives.

However, today’s commercial ASR models are mainly trained on English datasets and thus have higher accuracy for English speech interactions. However, there are very few training data in minority languages on the market, and the scenarios are single and lacking in challenges, which cannot reflect the generalization ability of the research model in large data volumes and complex scenarios.

In order to allow people from all over the world to enjoy the convenience brought by new technologies such as artificial intelligence, big data, and cloud computing for work and life, Nexdata has launched 100,000 hours of reading speech data in multiple application scenarios, covering more than 60 languages and dialects around the world and help train ASR models in minority languages.

831 Hours — British English Speech Data by Mobile Phone

831 Hours–Mobile Telephony British English Speech Data, which is recorded by 1651 native British speakers. The recording contents cover many categories such as generic, interactive, in-car and smart home. The texts are manually proofreaded to ensure a high accuracy rate. The database matchs the Android system and IOS.

➤ Speech data by mobile phone

1,796 Hours — German Speech Data by Mobile Phone

German audio data captured by mobile phone, 1,796 hours in total, recorded by 3,442 German native speakers. The recorded text is designed by linguistic experts, covering generic, interactive, on-board, home and other categories. The text has been proofread manually with high accuracy; this data can be used for automatic speech recognition, machine translation, and voiceprint recognition.

769 Hours — French Speech Data by Mobile Phone

The data volumn is 769 hours and is recorded by 1623 French native speakers. The recording text is designed by linguistic experts, which covers general interactive, in-car and home category. The texts are manually proofread with high accuracy. Recording devices are mainstream Android phones and iPhones.

435 Hours — Spanish Speech Data by Mobile Phone

The data volumn is 435 hours and is recorded by 989 Spanish native speakers. The recording text is designed by linguistic experts, which covers general interactive, in-car and home category. The texts are manually proofread with high accuracy. Recording devices are mainstream Android phones and iPhones.

762 Hours — Non-Hispanic Spanish Speech Data by Mobile Phone

1,630 non-Spanish nationality native Spanish speakers such as Mexicans and Colombians participated in the recording with authentic accent. The recorded script is designed by linguists and cover a wide range of topics including generic, interactive, in-vehicle and home. The text is manually proofread with high accuracy. It matches with mainstream Android and Apple system phones.

1,044 Hours — Brazilian Portuguese Speech Data by Mobile Phone

The data volumn is 1044 hours and is recorded by 2038 Brazilian native speakers. The recording text is designed by linguistic experts, which covers general interactive, in-car and home category. The texts are manually proofread with high accuracy. Recording devices are mainstream Android phones and iPhones.

234 Hours-Japanese Speech Data by Mobile Phone

It collects 799 Japanese locals and is recorded in quiet indoor places, streets, restaurant. The recording includes 210,000 commonly used written and spoken Japanese sentences. The error rate of text transfer sentence is less than 5%. Recording devices are mainstream Android phones and iPhones.

211 people — Korean Speech Data by Mobile Phone_Guiding

➤ Speech data by mobile phone

It collects 211 Korean locals and is recorded in quiet indoor environment. 99 females, 112 males. Recording devices are mainstream Android phones and iPhones.

1,441 Hours — Italian Speech Data by Mobile Phone

The data were recorded by 3,109 native Italian speakers with authentic Italian accents. The recorded content covers a wide range of categories such as general purpose, interactive, in car commands, home commands, etc. The recorded text is designed by a language expert, and the text is manually proofread with high accuracy. Match mainstream Android, Apple system phones

759 Hours — Hindi Speech Data by Mobile Phone

The data is 759 hours long and was recorded by 1,425 Indian native speakers. The accent is authentic. The recording text is designed by language experts and covers general, interactive, car, home and other categories. The text is manually proofread, and the accuracy is high. Recording devices are mainstream Android phones and iPhones. It can be applied to speech recognition, machine translation, and voiceprint recognition.

1,002 Hours — Russian Speech Data by Mobile Phone

1960 Russian native speakers participated in the recording with authentic accent. The recorded script is designed by linguists and cover a wide range of topics including generic, interactive, in-vehicle and home. The text is manually proofread with high accuracy. It matches with mainstream Android and Apple system phones.

End

If you want to know more details about the datasets or how to acquire, please feel free to contact us: info@nexdata.com.

Data is the key to the success of artificial intelligence. We must strengthen data collection methods and data security to achieve more intelligent and efficient technical solutions. In a rapidly developing market, only by continuous innovate and optimize of artificial intelligence can we build a safer, more efficient and intelligent society. If you have data requirements, please contact Nexdata.ai at [email protected].

Build Minority Language ASR Models With High-Quality Reading Speech Data

End

Recent

Join Nexdata MLC-SLM Workshop at Interspeech 2025

Exploring Datasets for iBeta Certification: A Guide for Biometric System Developers

The Crucial Role of Healthcare Chatbot Datasets in Advancing Medical Communication

Previous

Nexdata’s Children Speech Data Helps Build the Best Voice Assistant for Kids

Next

How to Build a Conversational AI Models? In the View of Training Data