Dataset Name | Data Type | Data Size | Capture Content |
1,000 Images Caption Data of Diverse Scenes | Image | 1000 images | Image caption dataset of diverse scenes. The scene distribution includes natural scenery, urban street, exhibitions, home environment, etc. Each image includes an 3-5 sentences English description. |
1,000 Images Caption Data of OCR in Natural Scenes | Image | 1000 images | OCR caption dataset of 14 languages. The subjects of images include bus stops, posters, road signs, etc. Each image includes an 3-5 sentences English description. |
1,000 Images Caption Data of Human Face | Image | 1000 images | Human face image caption dataset of various head postures, facial expressions, etc. Each image includes an 3-5 sentences English description. |
1,000 Images Caption Data of Gestures | Image | 1000 images | Gesture image caption dataset of different angles and gestures categories .Each image includes an 3-5 sentences English description. |
1,000 Images Human Facial Skin Defects Data | Image | 1000 images | Facial skin defect dataset, including acne, acne scars, dark spots, wrinkles and dark circles. |
1,000 Videos Caption Data of Human Motion | Video | 1000 videos | Human motion video caption dataset in CCTV and non CCTV scenes. Human motions include walking, drinking, yawning, fitness, etc. Each video inlcudes an English captions. |
1,000 People Multi-race 7 Expressions Recognition Data | Image | 1000 people | 7 facial expressions dataset, including normal, happy, amazed, sad, angry, disgusted, scared. |
1,000 Videos Multi-race Micro-expression (FACS) Data | Video | 1000 videos | 57 facial micro-expression dataset,including inner brow raiser(AU1), outer brow raiser(AU2), upper lid raiser(AU5), etc. |
50 People- DMS Data | Video | 50 people | DMS dataset of dangerous behavior, fatigue behavior and visual movement behavior. The dataset diversity includes various subject age periods, time periods, vehicle types and camera positions. |
50 People-2D Face Anti-Spoofing Data | Image&Video | 50 people | 2D face anti-spoofing dataset. Real face data includes facial action videos, facial images and lip language videos. Anti-spoofing data includes fake facial action videos, fake lip language videos and fake facial images. |
1,000 Images Gesture Recognition Data | Image | 1000 images | Gesture recognition dataset of 18 gesture categories. The gestures categories include number 1, OK, LOVE, etc. For dataset annotation, 21 landmarks of hand and multiple gesture labels were adopted. |
3,000 Images Natural Scene OCR Data | Image | 3000 images | Natural scene OCR dataset of Asian languages(Japanese, Korean, etc.) and European languages(French, German, etc.). For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were adopted. |
500 Images Handwriting OCR Data | Image | 500 images | Handwriting OCR data of English and Japanese. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were adopted. |
50 People- 3D Face Anti-Spoofing Data | Image | 50 people | 3D face anti-spoofing dataset. Real face data includes facial images. Anti-spoofing data includes fake facial images. Each image corresponds to a depth image, a depth values file and a camera parameters file. |
1,000 People Multi-race and Multi-pose Face Images Data | Image | 1000 people | Facial recognition dataset of multiple races. Each subject has 29 facial images, including 14 indoor multi-pose images, 14 outdoor multi-pose images and 1 id image. The annotations include labels of race, gender, age, and facial pose. |
Dataset Name | Recording Device | Data Size | Specifications |
2 Hours- 4 Countries English Speech Synthesis Corpus | Microphone | 2 hours, 4 people | People: 4 people from America, British, Australia, New Zealand Format : 48,000Hz, 24bit, uncompressed wav, mono channel; Recording environment : professional recording studio |
20 Hours - France French Reading & Conversational Speech Data by Mobile Phone | Mobile Phone | 20 hours | Format : 16kHz, 16bit, uncompressed wav, mono channel; Recording condition : Low background noise(indoor), without echo; Content category : Reading, Conversation Recording device : Android Smartphone, iPhone; Country : Portugal Language : Portuguese; Features of annotation : Transcription text; Accuracy Rate : Word Accuracy Rate (WAR) is at least 97% |
20 Hours - German Reading & Conversational Speech Data by Mobile Phone | Mobile Phone | 20 hours | Format : 16kHz, 16bit, uncompressed wav, mono channel; Recording condition : Low background noise(indoor), without echo; Content category : Reading, Conversation Recording device : Android Smartphone, iPhone; Country : Germany Language : German; Features of annotation : Transcription text; Accuracy Rate : Word Accuracy Rate (WAR) is at least 97% |
20 Hours - Italian Reading & Conversational Speech Data by Mobile Phone | Mobile Phone | 20 hours | Format : 16kHz, 16bit, uncompressed wav, mono channel; Recording condition : Low background noise(indoor), without echo; Content category : Reading, Conversation Recording device : Android Smartphone, iPhone; Country : Italy Language : Italian; Features of annotation : Transcription text; Accuracy Rate : Word Accuracy Rate (WAR) is at least 97% |
20 Hours - Spain Spanish Reading & Conversational Speech Data by Mobile Phone | Mobile Phone | 20 hours | Format : 16kHz, 16bit, uncompressed wav, mono channel; Recording condition : Low background noise(indoor), without echo; Content category : Reading, Conversation Recording device : Android Smartphone, iPhone; Country : Spain Language : Spanish; Features of annotation : Transcription text; Accuracy Rate : Word Accuracy Rate (WAR) is at least 97% |
20 Hours - European Portuguese Reading & Conversational Speech Data by Mobile Phone | Mobile Phone | 20 hours | Format : 16kHz, 16bit, uncompressed wav, mono channel; Recording condition : Low background noise(indoor), without echo; Content category : Reading, Conversation Recording device : Android Smartphone, iPhone; Country : Portugal Language : Portuguese; Features of annotation : Transcription text; Accuracy Rate : Word Accuracy Rate (WAR) is at least 97% |
20 Hours - Japanese Reading & Conversational Speech Data by Mobile Phone | Mobile Phone | 20 hours | Format : 16kHz, 16bit, uncompressed wav, mono channel; Recording condition : Low background noise(indoor), without echo; Content category : Reading, Conversation Recording device : Android Smartphone, iPhone; Country : Japan Language : Japanese; Features of annotation : Transcription text; Accuracy Rate : Word Accuracy Rate (WAR) is at least 97% |
20 Hours - Korean Reading & Conversational Speech Data by Mobile Phone | Mobile Phone | 20 hours | Format : 16kHz, 16bit, uncompressed wav, mono channel; Recording condition : Low background noise(indoor), without echo; Content category : Reading, Conversation Recording device : Android Smartphone, iPhone; Country : Korea Language : Korean; Features of annotation : Transcription text; Accuracy Rate : Word Accuracy Rate (WAR) is at least 97% |
10 Hours - Pashto Conversational Speech Data by Telephone | Telephone | 10 hours | Format : 8kHz 8bit, a-law/u-law pcm, mono channel Content category : Dialogue based on given topics Recording condition : Low background noise (indoor) Recording device : Telephony Country : Afghanistan(AFG) Language(Region) Code : ps-AF Language : Pashto Speaker : 224 people in total, 92% male and 8% female Features of annotation : Transcription text, timestamp, speaker ID, gender Accuracy rate : Word accuracy rate(WAR) 95% Accuracy Rate : Word Accuracy Rate (WAR) is at least 95% |
Interspeech_ Accented English Speech Recognition Competition Data | Mobile Phone | 200 hours,528people | / |
Note: Please apply for datasets reasonably according to the research field. The maximum number of applications for Computer Vision datasets is 6 sets.
Note: Please apply for datasets reasonably according to the research field. The maximum number of applications for speech recognition datasets is 4 sets.