From:Nexdata Date: 2024-08-14
The quality and diversity of datasets determine the intelligence level of AI model. Whether it is used for smart security, autonomous driving, or human-machine interaction, the accuracy of datasets directly affect the performance of the model. With the development of data collection technology, all type of customized datasets are constantly being created to support the optimization of AI algorithm. Though in-depth research on these types of datasets, AI technology’s application prospects will be broader.
In the realm of technological advancement, speech recognition has emerged as a groundbreaking innovation that has revolutionized human-computer interaction. From voice assistants to transcription services, this technology has become an indispensable part of our daily lives. However, one of the significant challenges in this domain lies in accurately recognizing and processing the Malay language, a complex and diverse language with unique linguistic features.
Malay, spoken in Southeast Asia, is one of the most widely spoken languages globally, with diverse dialects and variations across different regions. Its intricate linguistic structure poses a challenge for speech recognition systems to accurately interpret and comprehend spoken Malay. The language is known for its agglutinative nature, where multiple affixes and particles are added to root words to create new meanings, making it difficult for algorithms to disentangle and recognize individual words.
Another significant obstacle in Malay speech recognition is the wide range of phonetic variations and pronunciation found among its speakers. These variations can be attributed to factors such as geographical location, cultural influences, and individual speaking styles. The presence of different accents and dialects further complicates the process of building a robust and accurate speech recognition system for Malay.
Limited Data Availability
Data scarcity is a common hurdle faced by developers working on speech recognition technology for less widely spoken languages, and Malay is no exception. Compared to more prominent languages like English or Mandarin, there is limited and diverse speech data available for training Malay-specific speech recognition models. The shortage of quality data can hinder the system's ability to understand and adapt to various speech patterns, leading to reduced accuracy and reliability.
Lack of Standardization
The absence of a standardized form of Malay adds to the complexity of speech recognition. While Bahasa Malaysia serves as the official language in Malaysia, different regions and communities have their own variations of the language. This lack of standardization makes it challenging to develop a one-size-fits-all model, necessitating the need for region-specific adaptations.
Noise and Environmental Factors
Speech recognition systems also encounter difficulties when dealing with background noise and environmental factors that can interfere with the clarity of spoken Malay. In real-world scenarios, users may interact with speech recognition technology in various settings, such as crowded streets, noisy offices, or public transportation. Robust systems that can handle such noisy environments and still accurately recognize Malay speech are essential but challenging to develop.
Nexdata Malay Speech Data
134 Hours - Malay Speech Data by Mobile Phone_Reading
156 Speakers - Mobile Telephony Malay Speech Data_Reading is recorded by native Malay speakers in the quiet environment. The recording is rich in content, covering multiple categories such as economy, entertainment, news, oral language, numbers, and letters. Around 450 sentences for each speaker. The effective time is 134 hours. All texts are manually transcribed to ensure high accuracy.
155 People - Malay Speech Data by Mobile Phone_Guiding
155 Malaysian local speakers participated in the recording and the recoring environment is quiet. The recordings contain various categories like in-car scene, home, speech assistant. 50 sentences for each speaker. The valid time is 7 hours. All texts are manually transcribed with high accuracy.
370 Hours - Malay Speech Data by Mobile Phone
675 Malaysians native speakers participated in the recording with authentic accent. The recorded script is designed by linguists and cover a wide range of topics including generic, interactive, on-board and home. The text is manually proofread with high accuracy. It matches with mainstream Android and Apple system phones. The data set can be applied for automatic speech recognition, and machine translation scenes.
198 Hours - Malaysian English Speech Data by Mobile Phone
423 native Malay speakers involved, balanced for gender. The recording corpus is rich in content, and it covers a wide domain such as generic command and control category, human-machine interaction category; smart home category; in-car category. The transcription corpus has been manually proofread to ensure high accuracy.
100 Hours - Malay Conversational Speech Data by Mobile Phone
The 100 Hours - Malay Conversational Speech Data by Mobile Phone collected by phone involved about 130 native speakers, developed with proper balance of gender ratio, Speakers would choose a few familiar topics out of the given list and start conversations to ensure dialogues' fluency and naturalness. The recording devices are various mobile phones. The audio format is 16kHz, 16bit, uncompressed WAV, and all the speech data was recorded in quiet indoor environments. All the speech audio was manually transcribed with text content, the start and end time of each effective sentence, and speaker identification.
The future of AI is highly dependent on the support of data. With the development of technology and the expansion of application scenarios, high-quality datasets will become the key point to promoting AI performance. In this data-driven revolution, we will be able to better meet the opportunities and challenges of technology development if we constantly focus on data quality and strengthen data security management.