From:Nexdata Date: 2024-08-15
The quality and diversity of datasets determine the intelligence level of AI model. Whether it is used for smart security, autonomous driving, or human-machine interaction, the accuracy of datasets directly affect the performance of the model. With the development of data collection technology, all type of customized datasets are constantly being created to support the optimization of AI algorithm. Though in-depth research on these types of datasets, AI technology’s application prospects will be broader.
From the perspective of the current data industry, most speech recognition data is based on reading training data. Reading speech data can solve relatively simple human-machine interaction application scenarios such as mobile phone voice assistants, in-vehicle voice assistants, smart speakers, and smart home appliances.
The dialogue or command control between the user and the machine is usually in the form of a single short sentence. The user often needs pay attention to his own speaking speed and pronunciation, which is essentially an unnatural pronunciation. In this scenario, reading speech data can meet the training needs of the speech recognition algorithm.
However, with the implementation of speech recognition technology in more natural scenarios such as intelligent customer service and intelligent conferences, the training effect of reading speech data has become unsatisfactory. Because the pronunciation habits of speakers in daily life are more natural, there will be a large number of legatos, swallowing, pronunciation distortion, unclear articulation, etc., including some unconscious “um, ah, uh”, etc. The speakers will deliberately control the voice and pronunciation habits. When many people communicate at the same time, complex speech phenomena such as sentence interruption, grabbing, and overlapping sounds may even occur. Therefore, the speech recognition rate of this natural dialogue style is always not satisfactory.
Data is the foundation of artificial intelligence. In order to make artificial intelligence technology more accurate, a training data set that better matches the application scenario is required. Natural dialogue speech data has become a more urgent data set in the industry. When Nexdata collects natural dialogue voice data, there is no preset corpus at all, and only a list of topics is given. The recorder selects a number of topics that he is familiar with to start a dialogue to ensure that the dialogue speech is natural and smooth.
Right now, Nexdata has 200,000 hours off-the-shelf speech data, including nearly 40,000 hours of natural dialogue speech data, including English, Japanese, Korean, Hindi, Vietnamese, Arabic, Spanish, French, German, Italian, etc. The speakers come from different regions and cities, and the age and gender coverage is balanced. All the audio has undergone strict manual transcription and quality inspection. The text content, the start and end time points of valid sentences, and the identity of the recorder are labeled, and the sentence accuracy rate is over 95%.
American English Conversational Speech Data by Mobile Phone
2000 speakers participated in the recording and conducted face-to-face communication in a natural way. They had free discussion on a number of given topics, with a wide range of fields; the voice was natural and fluent, in line with the actual dialogue scene. Text is transferred manually, with high accuracy.
Spanish Conversational Speech Data by Mobile Phone
About 700 speakers participated in the recording, and conducted face-to-face communication in a natural way. They had free discussion on a number of given topics, with a wide range of fields; the voice was natural and fluent, in line with the actual dialogue scene. Text is transferred manually, with high accuracy.
German Conversational Speech Data by Mobile Phone
About 750 speakers participated in the recording, and conducted communication in a natural way. They had free discussion on a number of given topics, with a wide range of fields; the voice was natural and fluent, in line with the actual dialogue scene. Text is transferred manually, with high accuracy.
Korean Conversational Speech Data by Mobile Phone
About 700 Korean speakers participated in the recording, and conducted face-to-face communication in a natural way. They had free discussion on a number of given topics, with a wide range of fields; the voice was natural and fluent, in line with the actual dialogue scene. Text is transferred manually, with high accuracy.
Minnan Dialect Conversational Speech Data by Mobile Phone
500 Hours — Minnan Dialect Conversational Speech Data by Mobile Phone was recorded by about 1,000 native Hokkien speakers. The recorders are from Quanzhou, Zhangzhou and Xiamen. The ratio of males and females is balanced, covering multiple age groups. 500 hours of Hokkien natural dialogue collected by mobile phone There is no preset corpus for the voice data. In order to ensure the smooth and natural dialogue, the recorder will start the dialogue and record it according to the topic he is familiar with.
If you want to know more details about the datasets or how to acquire, please feel free to contact us: info@nexdata.ai.
Data-driven AI transformation is deeply affecting our ways of life and working methods. The dynamic nature of data is the key for artificial intelligent models to maintain high performance. Through constantly collecting new data and expanding the existing ones, we can help models better cope with new problems. If you have data requirements, please contact Nexdata.ai at [email protected].