From:Nexdata Date: 2024-08-15
In the progress of constructing intelligent system, the quality of the training datasets are more important than algorithm itself. For coping with different challenges in complex scenarios, researchers need to collect and annotate different types of data to improve the capabilities of AI system. Nowadays, every industries are exploring constantly how to use data-driven technology to realize smarter business processes and decision-making systems.
According to statistics, there were 528 million people in India who spoke Hindi as their mother tongue, accounting for 43% of India's total population that year. In addition, there are 163 million people who use Hindi as their second and third languages and have certain conversational skills. At the same time, there are millions of people using Hindi in the United States, South Africa, Singapore and other places. With the increasing international status of India, the influence of Hindi is gradually increasing.
In the field of speech technology, if you want to realize speech recognition of a language, you need enough training data. Due to the extreme scarcity of Hindi speech recognition resources, the development of Hindi speech recognition has encountered great challenges.
The scarcity of Hindi speech recognition resources is mainly reflected in the following two aspects:
1) Hindi speech recognition data are difficult to obtain, and a few open source data sets have a single field
2) There are few public codes, and the results in the literature are difficult to reproduce.
Facing the scarcity of Hindi speech recognition data, Nexdata has developed multiple sets of Hindi Speech Recognition Data. We hope that these training data can help more Hindi speech recognition applications to be implemented and improve the accuracy of Hindi speech recognition.
240 Hours - Hindi Speech Recognition Data by Mobile Phone_R
The data is 240 hours and is recorded by 401 Indian. It is recorded in both quiet and noisy environment, which is more suitable for the actual application scenario. The recording content is rich, covering economic, entertainment, news, spoken language, etc. All texts are manually transferred, with high accuracy. It can be applied to speech recognition, machine translation, voiceprint recognition.
397 People - Hindi Speech Recognition Data by Mobile Phone_Guiding
The data is recorded by 397 Indian with authentic accent, 50 sentences for each speaker, total 8.6 hours. The recording content involves car scene, smart home, intelligent voice assistant. This data can be used for corpus construction of machine translation, model training and algorithm research for voiceprint recognition.
759 Hours - Hindi Speech Recognition Data by Mobile Phone
The data is 759 hours long and was recorded by 1,425 Indian native speakers. The accent is authentic. The recording text is designed by language experts and covers general, interactive, car, home and other categories. The text is manually proofread, and the accuracy is high. Recording devices are mainstream Android phones and iPhones. It can be applied to speech recognition, machine translation, and voiceprint recognition.
750 Hours - Hindi Conversational Speech Data by Mobile Phone
The 750 Hours - Hindi Conversational Speech Data collected by phone involved more than 1,000 native speakers, developed with proper balance of gender ratio, Speakers would choose a few familiar topics out of the given list and start conversations to ensure dialogues' fluency and naturalness. The recording devices are various mobile phones. The audio format is 16kHz, 16bit, uncompressed WAV, and all the speech data was recorded in quiet indoor environments. All the speech audio was manually transcribed with text content, the start and end time of each effective sentence, and speaker identification. The accuracy rate of sentences is ≥ 95%.
Data isn’t only the foundation of artificial intelligence system, but also the driving force behind future technological breakthroughs. As all fields become more and more dependent on AI, we need to innovate methods on data collection and annotation to cope with growing demands. In the future, data will continue to lead AI development and bring more possibilities to all walks of life.