From:Nexdata Date: 2024-08-14
Visual and audio information are the primary sources of human perception of the external world. The human brain integrates heterogeneous multimodal information to gain a holistic understanding of the surrounding environment. For example, in a cocktail party scene with multiple speakers, we can enhance the reception of the speech of the person of interest by utilizing changes in lip movements. Therefore, audiovisual learning is indispensable for exploring the perceptual capabilities of humanoid machines.
Each sense provides unique information about the surrounding environment. Although the information received by various senses is different, the resulting environmental representation is a unified experience rather than unrelated sensations. A representative example is the McGurk effect: semantically different visual and auditory signals yield a single semantic message. These phenomena suggest that in human perception, signals from multiple senses are often integrated.
Humans have the ability to predict information corresponding to another modality under the guidance of a known modality. For example, in the absence of sound, as long as we see visual information of lip movements, we can roughly infer what the person is saying. The semantic, spatial, and temporal consistency between audio and visual aspects provides the possibility for machines to have humanoid cross-modal generation capabilities. Cross-modal generation tasks now cover various aspects, including single-channel audio generation, stereo generation, video/image generation, and depth estimation.
In addition to cross-modal generation, the semantic consistency between audio and visual modalities suggests that learning in one modality can benefit from semantic information from another modality. This is also the goal of audiovisual transfer tasks. Furthermore, the semantic consistency between audio and visual information promotes the development of cross-modal information retrieval tasks.
Nexdata Audio-Visual Training Datasets
155 Hours – Lip Sync Multimodal Video Data
Voice and matching lip language video filmed with 249 people by multi-devices simultaneously, aligned precisely by pulse signal, with high accuracy. It can be used in multi-modal learning algorithms research in speech and image fields.
1,998 People - Lip Language Video Data
1,998 People - Lip Language Video Data. The data diversity includes multiple scenes, multiple ages and multiple time periods. In each video, the lip language of 8-bit Arabic numbers was collected. In this dataset, there are 41,866 videos and the total duration is 86 hours 56 minutes 1.52 seconds. This dataset can be used in tasks such as face anti-spoofing recognition, lip language recognition, etc.
1,178 Hours - American English Colloquial Video Speech Data
The 1,178-hour American English Colloquial Video Speech Data is a collection of video clips gathered from the internet, covering multiple topics. Audio is transcribed into text, and speaker identity and other more attributes are annotated. This data set can be used for voiceprint recognition model training, construction of corpus for machine translation and algorithm research.
500 Hours - German Colloquial Video Speech Data
500 Hours - German Colloquial Video Speech Data, collected from real website, covering multiple fields. Various attributes such as text content and speaker identity are annotated. This data set can be used for voiceprint recognition model training, construction of corpus for machine translation and algorithm research.