From:Nexdata Date: 2024-08-15
Application fields of artificial intelligence is fast expanding, and the driving force behind this comes from the richness and diversity of datasets. Whether it is medical image analysis, autonomous driving or smart home systems, the accumulation of large amount of datasets provides infinite possibilities for AI application scenarios.
According to statistics, there were 528 million people in India who spoke Hindi as their mother tongue, accounting for 43% of India's total population that year. In addition, there are 163 million people who use Hindi as their second and third languages and have certain conversational skills. At the same time, there are millions of people using Hindi in the United States, South Africa, Singapore and other places. With the increasing international status of India, the influence of Hindi is gradually increasing.
In the field of speech technology, if you want to realize speech recognition of a language, you need enough training data. Due to the extreme scarcity of Hindi speech recognition resources, the development of Hindi speech recognition has encountered great challenges.
The scarcity of Hindi speech recognition resources is mainly reflected in the following two aspects:
1) Hindi speech recognition data are difficult to obtain, and a few open source data sets have a single field
2) There are few public codes, and the results in the literature are difficult to reproduce.
Facing the scarcity of Hindi speech recognition data, Nexdata has developed multiple sets of Hindi Speech Recognition Data. We hope that these training data can help more Hindi speech recognition applications to be implemented and improve the accuracy of Hindi speech recognition.
240 Hours - Hindi Speech Recognition Data by Mobile Phone_R
The data is 240 hours and is recorded by 401 Indian. It is recorded in both quiet and noisy environment, which is more suitable for the actual application scenario. The recording content is rich, covering economic, entertainment, news, spoken language, etc. All texts are manually transferred, with high accuracy. It can be applied to speech recognition, machine translation, voiceprint recognition.
397 People - Hindi Speech Recognition Data by Mobile Phone_Guiding
The data is recorded by 397 Indian with authentic accent, 50 sentences for each speaker, total 8.6 hours. The recording content involves car scene, smart home, intelligent voice assistant. This data can be used for corpus construction of machine translation, model training and algorithm research for voiceprint recognition.
759 Hours - Hindi Speech Recognition Data by Mobile Phone
The data is 759 hours long and was recorded by 1,425 Indian native speakers. The accent is authentic. The recording text is designed by language experts and covers general, interactive, car, home and other categories. The text is manually proofread, and the accuracy is high. Recording devices are mainstream Android phones and iPhones. It can be applied to speech recognition, machine translation, and voiceprint recognition.
750 Hours - Hindi Conversational Speech Data by Mobile Phone
The 750 Hours - Hindi Conversational Speech Data collected by phone involved more than 1,000 native speakers, developed with proper balance of gender ratio, Speakers would choose a few familiar topics out of the given list and start conversations to ensure dialogues' fluency and naturalness. The recording devices are various mobile phones. The audio format is 16kHz, 16bit, uncompressed WAV, and all the speech data was recorded in quiet indoor environments. All the speech audio was manually transcribed with text content, the start and end time of each effective sentence, and speaker identification. The accuracy rate of sentences is ≥ 95%.
The future of AI is highly dependent on the support of data. With the development of technology and the expansion of application scenarios, high-quality datasets will become the key point to promoting AI performance. In this data-driven revolution, we will be able to better meet the opportunities and challenges of technology development if we constantly focus on data quality and strengthen data security management.