Chinese Dialects Data

From:Nexdata Date: 08/14/2024

➤ Chinese dialects in speech recognition

Data is the “fuel”that drives AI system towards continuous progress, but building high-quality datasets isn’t easy. The part where involve data collecting, cleaning, annotating, and privacy protecting are all challenging. Researchers need to collect targeted data to deal with complex problems faced on different fields to make sure the trained models have robustness and generalization capability. Through using rich datasets, AI system can achieve intelligent decision-making in more complex scenario.

Chinese, with its rich linguistic heritage, is a language that boasts a multitude of dialects. From Mandarin to Cantonese, Shanghainese to Hokkien, these dialects reflect the diverse cultural and regional identities across China. However, this linguistic diversity poses a significant challenge when it comes to speech recognition technology.

Speech recognition is the process of converting spoken words into written text using advanced algorithms and machine learning. It has become increasingly prevalent in our daily lives, from virtual assistants like Siri and Alexa to voice-controlled devices. However, the complexity of Chinese dialects complicates the development and implementation of accurate speech recognition systems.

➤ Challenges in Chinese dialects speech recognition

One of the primary challenges lies in the vast differences in pronunciation and vocabulary among Chinese dialects. Mandarin, the official language of China, serves as a common standard, but even within Mandarin, there are variations across different regions. For example, the pronunciation of certain sounds may differ between northern and southern dialects. This variability makes it difficult for speech recognition systems to accurately interpret and transcribe spoken words, leading to errors and misinterpretations.

Furthermore, the lack of standardized written forms for some Chinese dialects adds another layer of complexity. While Mandarin has a unified system of characters, dialects like Cantonese are predominantly spoken languages with limited written representation. This lack of standardized characters makes it challenging for speech recognition systems to match spoken words with written equivalents accurately.

Another hurdle is the limited availability of training data for Chinese dialects. Speech recognition systems rely heavily on vast amounts of labeled data to learn and improve their accuracy. However, compared to Mandarin, there is significantly less data available for other Chinese dialects. This scarcity hinders the training of speech recognition models for these dialects, impeding their development and accuracy.

Nexdata Chinese Dialects Data

500 Hours – Minnan Dialect Conversational Speech Data by Mobile Phone

The 500 Hours – Minnan Dialect Conversational Speech Data collected by phone involved more than 1,000 native speakers, developed with a proper balance of gender ratio and geographical distribution. Speakers would choose a few familiar topics out of the given list and start conversations to ensure the dialogue's fluency and naturalness. The recording devices are various mobile phones. The audio format is 16kHz, 16bit, uncompressed WAV, and all the speech data was recorded in quiet indoor environments. All the speech audio was manually transcribed with text content, and the start and end timestamps of each effective sentence and speaker identification, including gender, were also annotated. The accuracy rate of sentences is ≥ 95%.

799 Hours - Sichuan Dialect Conversational Speech Data by Mobile Phone

The 799 Hours - Sichuan Dialect Conversational Speech Data by Mobile Phone collected by phone involved 1,730 native speakers. Speakers conduct conversations without topic limit to ensure the dialogue's fluency and naturalness. The recording devices are various mobile phones. The audio format is 16kHz, 16bit, uncompressed WAV, and all the speech data was recorded in quiet indoor environments. All the speech audio was manually transcribed into text content, the start and end time of each effective sentence, speaker identification and other more attributes are annotated. The accuracy rate of sentences is ≥ 95%.

➤ Speech data in Taiwan and Guangdong

203 People - Taiwanese Mandarin Speech Data by Mobile Phone_Guiding

The data collected 203 Taiwan people, covering Taipei, Kaohsiung, Taichung, Tainan, etc. 137 females, 66 males. It is recorded in quiet indoor environment. It can be used in speech recognition, machine translation, voiceprint recognition model training and algorithm research.

1,652 Hours – Cantonese Dialect Speech Data by Mobile Phone

It collects 4,888 speakers from Guangdong Province and is recorded in quiet indoor environment. The recorded content covers 500,000 commonly used spoken sentences, including high-frequency words in weico and daily used expressions. The average number of repetitions is 1.5 and the average sentence length is 12.5 words. Recording devices are mainstream Android phones and iPhones.

Data-driven AI transformation is deeply affecting our ways of life and working methods. The dynamic nature of data is the key for artificial intelligent models to maintain high performance. Through constantly collecting new data and expanding the existing ones, we can help models better cope with new problems. If you have data requirements, please contact Nexdata.ai at [email protected].

Chinese Dialects Data

Recent

Nexdata Announces Full Operation of World-Leading Embodied Intelligence Data Factory

Case Study: Multi-View Data Collection Project

Case Study: COT-VLA Robotic Arm Annotation Project

Previous

AI-Powered Marketing: Optimizing Ad Campaigns with Data Annotation Services

Next

Fueling AI Performance in Customer Service with Telephony Speech Data