From:Nexdata Date: 2024-08-15
With the implementation of speech recognition technology in more natural scenarios such as smart customer service and smart meetings, the training effect of reading aloud speech data has become unsatisfactory.
Because the speaker's pronunciation habits are more natural in daily life, there will be a lot of legato, swallowing, pronunciation deformation, and unclear articulation when speaking. The speaker often does not deliberately control the voice and pronunciation habits, and multiple people communicate at the same time. Sometimes there may even be complex speech phenomena such as sentence interruption, word rush, overlapping sounds, etc., so the speech recognition rate of this natural dialogue style is not very ideal.
Data is the foundation of artificial intelligence. To make artificial intelligence technology have a higher accuracy rate, a training data set that better matches the application scenario is needed. Natural dialogue speech data has become a more urgent data set in the industry.
Nexdata has nearly 40,000 hours of natural dialogue voice data, including Mandarin Chinese, dialects, English, Japanese, Korean, Hindi, Vietnamese, Arabic, Spanish, French, German, Italian, etc. The speakers come from different regions And cities, age and gender coverage balance. All audio has undergone strict manual transcription and quality inspection, marking the text content, the start and end time points of valid sentences, the identity of the recorder, etc., and the sentence accuracy rate is as high as 95%.
1,136 Hours – American English Conversational Speech Data by Mobile Phone
The 1,136-hour American English speech data of natural conversations collected by phone involved more than 1,000 native English speakers in America, developed with proper balance of gender ratio and geographical distribution. Speakers would choose a few familiar topics out of the given list and start conversations to ensure dialogues' fluency and naturalness. The recording devices are various mobile phones. The audio format is 16kHz, 16bit, uncompressed WAV, and all the speech data was recorded in quiet indoor environments. All the speech audio was manually transcript with text content, the start and end time of each effective sentence, and speaker identification. The accuracy rate of sentences is ≥ 95%.
607 Hours - Cantonese Conversational Speech Data by Mobile Phone and Voice Recorder
The 607-hour Cantonese Conversational Speech Data involved 995 native speakers. Speakers would choose a few familiar topics out of the given list and start conversations to ensure dialogues' fluency and naturalness. The recording devices are various mobile phones and professional audio recorders. The audio format is 16kHz, 16bit, uncompressed WAV, and all the speech data was recorded in quiet indoor environments. All the speech audio was manually transcribed with text content. The start and end time of each effective sentence, and speaker identification and other more attributes are also annotated. The accuracy rate of sentences is ≥ 95%.
500 Hours - Korean Conversational Speech Data by Mobile Phone
The 500 Hours - Korean Conversational Speech Data by Mobile Phone collected by phone involved more than 700 native speakers, developed with a proper balance of gender ratio. Speakers would choose a few familiar topics out of the given list and start conversations to ensure the dialogue's fluency and naturalness. The recording devices are various mobile phones. The audio format is 16kHz, 16bit, uncompressed WAV, and all the speech data was recorded in quiet indoor environments. All the speech audio was manually transcribed with text content, the start and end time of each effective sentence, and speaker identification. The accuracy rate of sentences is ≥ 95%.
500 Hours - Italian Conversational Speech Data by Mobile Phone
The 500 Hours - Italian Conversational Speech Data involved more than 700 native speakers, developed with proper balance of gender ratio, Speakers would choose a few familiar topics out of the given list and start conversations to ensure dialogues' fluency and naturalness. The recording devices are various mobile phones. The audio format is 16kHz, 16bit, uncompressed WAV, and all the speech data was recorded in quiet indoor environments. All the speech audio was manually transcribed with text content, the start and end time of each effective sentence, and speaker identification. The accuracy rate of word is ≥ 98%.
100 Hours - Russian Conversational Speech Data by Mobile Phone
The 100 Hours - Russian Conversational Speech Data involved more than 130 native speakers, developed with proper balance of gender ratio, Speakers would choose a few familiar topics out of the given list and start conversations to ensure dialogues' fluency and naturalness. The recording devices are various mobile phones. The audio format is 16kHz, 16bit, uncompressed WAV, and all the speech data was recorded in quiet indoor environments. All the speech audio was manually transcribed with text content, the start and end time of each effective sentence, and speaker identification.