From:Nexdata Date: 2024-08-15
In the progress of constructing an intelligent future, datasets play a vital role. From autonomous driving cars to smart security systems, high-quality datasets provide AI models with massive amount of learning materiel, empowering AI model more adaptable in various real-world scenarios. Companies and researchers through continuously improving the efficiency of data collection and annotation can accelerate the implementation of AI technology, help all industries achieve their digital transformation.
We need a large volumen of speech data to help us complete and continuously optimize and improve speech recognition models. In this article, I introduce 10 datasets commonly used in the field of speech analysis.
Mozilla claims to have the largest human speech dataset available, the current dataset includes 29 different languages, including Chinese, collected from over 40,000 contributors for nearly 2454 hours (1965 hours of which are verified) recorded voice data. And there is an open commitment: to make the high-quality speech data we collect open to startups, researchers, and anyone interested in speech technology.
2. Tatoeba
The project started in 2006 when tatoeba is a large database of sentences, translations and spoken audio for language learning. A website that collects example sentences for foreign language learners, and users can search for example sentences for any word without registering. If the example sentence contains the corresponding real pronunciation, you can also click to listen. Registered users can add, translate, take over, improve, discuss sentences. You can also discuss with other registered users on the message board. On message boards, all languages are equal, and registered users can communicate with other users in their preferred language.
This dataset was collected in a complex environment. Record in real rooms of different sizes, capturing the different background sounds and reverbs of each room. It also contains various types of disturbing noise (TV, music, or murmuring). Twelve microphones carefully placed in the room record audio at a distance, each producing 120 hours of audio. To mimic human behavior in conversation, the foreground speaker uses a motorized device that rotates through a range of angles during recording.
4. LibriSpeech
LibriSpeech This dataset is an audiobook dataset containing both text and speech, a corpus of approximately 1000 hours of 16kHz read English speeches written by Vassil Panayotov. Data is derived from reading audiobooks from the LibriVox project and is carefully segmented and consistent. Cut and organized into text-annotated audio files of about 10 seconds each, ideal for getting started.
5. VoxForge
A dataset of clear English speech with accents. Useful for scenarios that improve robustness to different accents or intonations. VoxForge was created to collect annotated recordings for free and open source speech recognition engines (on Linux/Unix, Windows and Mac platforms)
6. VoxCeleb
VoxCeleb is a large human voice recognition dataset. It contains about 100,000 voices from 1251 celebrities from YouTube videos. The data is largely gender-balanced (55% male). These celebrities have different accents, occupations and ages. There is no overlap between the dev and test sets.
7. TIMIT
TIMIT (The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus) is an acoustic-phonemic continuous speech corpus constructed by Texas Instruments, MIT, and Stanford Research Institute SRI International. The TIMIT dataset has a speech sampling frequency of 16kHz and contains a total of 6300 sentences. The given 10 sentences are spoken by each of 630 people from the eight major dialect regions of the United States, all sentences are at the phone level (phone level) Manual segmentation and labeling were performed. The TIMIT corpus includes time-aligned orthographic, phonetic and word transcriptions and 16-bit, 16kHz speech waveform files for each utterance.
AudioSet is a large-scale audio dataset opened by Google in 17 years. The dataset contains 632 audio categories and 2,084,320 human-labeled 10-second sound clips each (including 527 tags, from YouTube videos). An audio ontology is defined as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and musical genre sounds, and everyday ambient sounds.
9. GigaSpeech
GigaSpeech is an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc.
10. CMU Wilderness Multilingual Speech Dataset
The CMU Wilderness Multilingual Speech Dataset is a dataset of over 700 different languages providing audio, aligned text and word pronunciations. On average each language provides around 20 hours of sentence-lengthed transcriptions.
In the future, data-driven intelligence will profoundly change all industries operation system. To make sure the long-term development of AI technology, high-quality datasets will remain an indispensable basic resource. By continuously optimizing data collection technology, and developing more sophisticated datasets, AI systems will bring more opportunities and challenges for all walks of life.