Open Datasets for Academic Research

Computer vision

Speech Recognition

Dataset Name	Data Type	Data Size	Capture Content
1,000 Images Caption Data of Diverse Scenes	Image	1000 images	Image caption dataset of diverse scenes. The scene distribution includes natural scenery, urban street, exhibitions, home environment, etc. Each image includes an 3-5 sentences English description.
1,000 Images Caption Data of OCR in Natural Scenes	Image	1000 images	OCR caption dataset of 14 languages. The subjects of images include bus stops, posters, road signs, etc. Each image includes an 3-5 sentences English description.
1,000 Images Caption Data of Human Face	Image	1000 images	Human face image caption dataset of various head postures, facial expressions, etc. Each image includes an 3-5 sentences English description.
1,000 Images Caption Data of Gestures	Image	1000 images	Gesture image caption dataset of different angles and gestures categories .Each image includes an 3-5 sentences English description.
1,000 Images Human Facial Skin Defects Data	Image	1000 images	Facial skin defect dataset, including acne, acne scars, dark spots, wrinkles and dark circles.
1,000 Videos Caption Data of Human Motion	Video	1000 videos	Human motion video caption dataset in CCTV and non CCTV scenes. Human motions include walking, drinking, yawning, fitness, etc. Each video inlcudes an English captions.
1,000 People Multi-race 7 Expressions Recognition Data	Image	1000 people	7 facial expressions dataset, including normal, happy, amazed, sad, angry, disgusted, scared.
1,000 Videos Multi-race Micro-expression (FACS) Data	Video	1000 videos	57 facial micro-expression dataset,including inner brow raiser(AU1), outer brow raiser(AU2), upper lid raiser(AU5), etc.
50 People- DMS Data	Video	50 people	DMS dataset of dangerous behavior, fatigue behavior and visual movement behavior. The dataset diversity includes various subject age periods, time periods, vehicle types and camera positions.
50 People-2D Face Anti-Spoofing Data	Image&Video	50 people	2D face anti-spoofing dataset. Real face data includes facial action videos, facial images and lip language videos. Anti-spoofing data includes fake facial action videos, fake lip language videos and fake facial images.
1,000 Images Gesture Recognition Data	Image	1000 images	Gesture recognition dataset of 18 gesture categories. The gestures categories include number 1, OK, LOVE, etc. For dataset annotation, 21 landmarks of hand and multiple gesture labels were adopted.
3,000 Images Natural Scene OCR Data	Image	3000 images	Natural scene OCR dataset of Asian languages(Japanese, Korean, etc.) and European languages(French, German, etc.). For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were adopted.
500 Images Handwriting OCR Data	Image	500 images	Handwriting OCR data of English and Japanese. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were adopted.
50 People- 3D Face Anti-Spoofing Data	Image	50 people	3D face anti-spoofing dataset. Real face data includes facial images. Anti-spoofing data includes fake facial images. Each image corresponds to a depth image, a depth values file and a camera parameters file.
1,000 People Multi-race and Multi-pose Face Images Data	Image	1000 people	Facial recognition dataset of multiple races. Each subject has 29 facial images, including 14 indoor multi-pose images, 14 outdoor multi-pose images and 1 id image. The annotations include labels of race, gender, age, and facial pose.

Dataset Name	Recording Device	Data Size	Specifications
2 Hours- 4 Countries English Speech Synthesis Corpus	Microphone	2 hours, 4 people	People: 4 people from America, British, Australia, New Zealand Format : 48,000Hz, 24bit, uncompressed wav, mono channel; Recording environment : professional recording studio
20 Hours - France French Reading & Conversational Speech Data by Mobile Phone	Mobile Phone	20 hours	Format : 16kHz, 16bit, uncompressed wav, mono channel; Recording condition : Low background noise(indoor), without echo; Content category : Reading, Conversation Recording device : Android Smartphone, iPhone; Country : Portugal Language : Portuguese; Features of annotation : Transcription text; Accuracy Rate : Word Accuracy Rate (WAR) is at least 97%
20 Hours - German Reading & Conversational Speech Data by Mobile Phone	Mobile Phone	20 hours	Format : 16kHz, 16bit, uncompressed wav, mono channel; Recording condition : Low background noise(indoor), without echo; Content category : Reading, Conversation Recording device : Android Smartphone, iPhone; Country : Germany Language : German; Features of annotation : Transcription text; Accuracy Rate : Word Accuracy Rate (WAR) is at least 97%
20 Hours - Italian Reading & Conversational Speech Data by Mobile Phone	Mobile Phone	20 hours	Format : 16kHz, 16bit, uncompressed wav, mono channel; Recording condition : Low background noise(indoor), without echo; Content category : Reading, Conversation Recording device : Android Smartphone, iPhone; Country : Italy Language : Italian; Features of annotation : Transcription text; Accuracy Rate : Word Accuracy Rate (WAR) is at least 97%
20 Hours - Spain Spanish Reading & Conversational Speech Data by Mobile Phone	Mobile Phone	20 hours	Format : 16kHz, 16bit, uncompressed wav, mono channel; Recording condition : Low background noise(indoor), without echo; Content category : Reading, Conversation Recording device : Android Smartphone, iPhone; Country : Spain Language : Spanish; Features of annotation : Transcription text; Accuracy Rate : Word Accuracy Rate (WAR) is at least 97%
20 Hours - European Portuguese Reading & Conversational Speech Data by Mobile Phone	Mobile Phone	20 hours	Format : 16kHz, 16bit, uncompressed wav, mono channel; Recording condition : Low background noise(indoor), without echo; Content category : Reading, Conversation Recording device : Android Smartphone, iPhone; Country : Portugal Language : Portuguese; Features of annotation : Transcription text; Accuracy Rate : Word Accuracy Rate (WAR) is at least 97%
20 Hours - Japanese Reading & Conversational Speech Data by Mobile Phone	Mobile Phone	20 hours	Format : 16kHz, 16bit, uncompressed wav, mono channel; Recording condition : Low background noise(indoor), without echo; Content category : Reading, Conversation Recording device : Android Smartphone, iPhone; Country : Japan Language : Japanese; Features of annotation : Transcription text; Accuracy Rate : Word Accuracy Rate (WAR) is at least 97%
20 Hours - Korean Reading & Conversational Speech Data by Mobile Phone	Mobile Phone	20 hours	Format : 16kHz, 16bit, uncompressed wav, mono channel; Recording condition : Low background noise(indoor), without echo; Content category : Reading, Conversation Recording device : Android Smartphone, iPhone; Country : Korea Language : Korean; Features of annotation : Transcription text; Accuracy Rate : Word Accuracy Rate (WAR) is at least 97%
10 Hours - Pashto Conversational Speech Data by Telephone	Telephone	10 hours	Format : 8kHz 8bit, a-law/u-law pcm, mono channel Content category : Dialogue based on given topics Recording condition : Low background noise (indoor) Recording device : Telephony Country : Afghanistan(AFG) Language(Region) Code : ps-AF Language : Pashto Speaker : 224 people in total, 92% male and 8% female Features of annotation : Transcription text, timestamp, speaker ID, gender Accuracy rate : Word accuracy rate(WAR) 95% Accuracy Rate : Word Accuracy Rate (WAR) is at least 95%
Interspeech_ Accented English Speech Recognition Competition Data	Mobile Phone	200 hours,528people	Audio format: 16kHz, 16bit, mono wav Audio content: mainly daily communication, including scenes such as human-computer interaction Recording environment: relatively quiet indoor, mobile phone recording Duration: about 20 hours for each accent, a total of 8 accents Language types: Russian, Korean, American, Portuguese, Japanese, Indian, British Speakers: 40-110 speakers for each language

Note: Please apply for datasets reasonably according to the research field. The maximum number of applications for Computer Vision datasets is 6 sets.

Note: Please apply for datasets reasonably according to the research field. The maximum number of applications for speech recognition datasets is 4 sets.

Open Datasets for Academic Research

Application Process and Instruction

Apply for Sponsored Dataset

Cooperation Institution