Multilingual and Code Switching: The Challenges in Future Speech Recognition Technology

From:Nexdata Date: 08/15/2024

➤ ASR development, English - focus

In the modern field of artificial intelligence, the success of an algorithm depends on the quality of the data. As the importance of data in artificial intelligence models becomes increasingly prominent, it becomes crucial to collect and make full use of high-quality data. This article will help you better understand the core role of data in artificial intelligence programs.

In recent years, Automatic Speech Recognition (ASR) has achieved important development in commercial use, and several enterprise-level ASR models based on neural networks have been successfully launched, such as Alexa, Rev, AssemblyAI, ASAPP, etc.

Back in 2016, Microsoft Research published an article announcing that they had achieved human performance (measured using Word Error Rate, WER) on a 25-year historical dataset called Switchboard. The accuracy of ASR is still improving, reaching human level in more datasets and more use cases.

However, today’s commercial ASR models are mainly trained on English datasets and thus have higher accuracy on English input. The long-term focus on English is higher in academia and industry due to data availability and market demand. Although the recognition accuracy of commercial popular languages such as French, Spanish, Portuguese, and German is reasonable, there is obviously a long tail of languages with limited training data and relatively low ASR output quality.

➤ Multilingualism in business systems

Furthermore, most business systems are based on a single language, which cannot be applied to the multilingual scenarios specific to many societies. Multilingualism can take the form of back-to-back languages, such as media programming in bilingual countries. Amazon has made strides in dealing with this issue recently with a product that integrates language identification (LID) and ASR. In contrast, cross-language (also known as code-switching) is a language system used by individuals that combines the words and grammars of two languages in the same sentence.

As a world’s leading AI data service provider, Nexdata has developed 200,000 hours speech data, covering over 60 languages and dialects and helps developers build applications that anyone can understand in any language, truly unleashing the power of speech recognition to the world.

Japanese Speech Data By Mobile Phone

The dataset is recorded by 1,245 local Japanese speakers with authentic accents; the recorded texts cover general, interactive, car, home and other categories, and are rich in content.

Brazilian Portuguese Speech Data by Mobile Phone

The data volumn is 1044 hours and is recorded by 2038 Brazilian native speakers. The recording text is designed by linguistic experts, which covers general interactive, in-car and home category. The texts are manually proofread with high accuracy. Recording devices are mainstream Android phones and iPhones.

Italian Speech Data by Mobile Phone

The data were recorded by 3,109 native Italian speakers with authentic Italian accents. The recorded content covers a wide range of categories such as general purpose, interactive, in car commands, home commands, etc. The recorded text is designed by a language expert, and the text is manually proofread with high accuracy. Match mainstream Android, Apple system phones.

➤ Datasets for speech recognition

Mixed Speech with Korean and English Data by Mobile Phone

The data is recorded by Korean native speakers . The recorded text is a mixture of Korean and English sentences, covering general scenes and human-computer interaction scenes. It is rich in content and accurate in transcription. It can be used for improving the recognition effect of the speech recognition system on Korean-English mixed reading speech.

Mixed Speech with Chinese and English Data by Mobile Phone

The data is recorded by 1113 Chinese native speakers with accents covering seven major dialect areas. The recorded text is a mixture of Chinese and English sentences, covering general scenes and human-computer interaction scenes. It is rich in content and accurate in transcription. It can be used for improving the recognition effect of the speech recognition system on Chinese-English mixed reading speech.

End

If you want to know more details about the datasets or how to acquire, please feel free to contact us: info@nexdata.ai.

On the road to intelligent future, data will always be an indispensable driving force. The continuous expanding and optimizing of all kinds of datasets will provide a broader application space for AI algorithms. By constant exploring new data collection and annotation methods, all industries can better handle complex application scenarios. If you have data requirements, please contact Nexdata.ai at [email protected].