From:Nexdata Date: 2024-08-15
In the modern field of artificial intelligence, the success of an algorithm depends on the quality of the data. As the importance of data in artificial intelligence models becomes increasingly prominent, it becomes crucial to collect and make full use of high-quality data. This article will help you better understand the core role of data in artificial intelligence programs.
Computer Vision
● Real-World Masked Face Dataset
Real-World Masked Face Dataset, referred to as RMFD, is a face recognition dataset opened by the National Multimedia Software Technology Research Center of Wuhan University in early March 2020, including nearly 100,000 masked and normal facial images, and 500,000 simulated masked faces.
Link: https://github.com/X-zhangyang/Real-World-Masked-Face-Dataset
● Hypersim
For many basic scenes, it is difficult or impossible to obtain a ground label for each pixel from a real image. Apple solves this problem by introducing Hypersim, a synthetic dataset for real indoor scenes. To create this dataset, Apple used a large repository of synthetic scenes created by professional artists and generated 77,400 images of 461 indoor scenes, with detailed labels for each pixel and corresponding ground truth geometry.
Link: https://github.com/apple/ml-hypersim
● OASIS
This dataset covers 140,000 Internet images, manually annotated and realized 3D surface pixel-level reconstruction. The dataset can play a role in depth estimation, three-dimensional surface reconstruction, edge detection, instance segmentation and other directions.
Link: https://oasis.cs.princeton.edu/
● Visual Genome
Visual Genome is a very detailed computer vision database with deep learning subtitles of 100,000 images. Compared with the ImageNet dataset, the information contained in each image in this dataset is richer and the relationship between objects and attributes is annotated.
● Audi Autonomous Driving Dataset
The dataset is released in 2020. The annotation types include object 3D bounding box, semantic segmentation, instance segmentation, and data extracted from the car. The labeled non-sequential data 41,227 frames contain semantic segmentation annotations and point cloud tags, which contain front-facing cameras. The 3D bounding box of the target in the field of view is marked with 12,497 frames. In addition, the datasets also includes 392,556 consecutive frames of unlabeled sensor data. The license plates and faces in the image are all blurred.
Link: https://www.a2d2.audi/a2d2/en.html
Speech
● Common Voice
The Common Voice dataset, including 18 different languages, has accumulated nearly 1,400 hours of voice data from more than 42,000 contributors.
● ainexdata_1505zh
The ainexdata_1505zh dataset is 1,505 hours in length and is part of the Mandarin Chinese speech database of Nexdata. The collection area covers 34 provincial administrative regions across China. The number of participants in the recording reached 6,408, and the recording contents exceeded 300,000 colloquial sentences. The accuracy of sentence annotation exceeds 98%.
Link: https://www.nexdata.ai/opensource
● CN-Celeb
The dataset contains 130,000 speech segments, a total of 1,000 Chinese celebrities are collected, a total of 274 hours.
Link: http://www.openslr.org/82/
NLP
● WikiText
The WikiText Long Term Dependency Language Modeling Dataset is an English thesaurus data containing 100 million words, which are extracted from Wikipedia’s high-quality articles. There are two versions WikiText-2 and WikiText-103. The number of words in WikiText-103 is 110 times as that in Penn Treebank (PTB).
● SQuAD
SQuAD is a reading comprehension dataset launched by Stanford University. All articles in this dataset are selected from Wikipedia, and the amount of the dataset is dozens of times that of other similar datasets. There are a total of 107,785 questions and 536 supporting articles.
Link:https://rajpurkar.github.io/SQuAD-explorer/
Besides the above ten datasets, Nexdata has launched the Open Source Research Datasets for universities and academic institutions around the world since 2020, in order to support the research of artificial intelligence. Filling in the relevant application materials can get an AI dataset worth about US$100,000 for free.
The dataset covers the conference scene PPT in French, Korean, Japanese, Spanish, German, Italian, Portuguese, and Russian, as well as posters, road signs, packaging instructions, menus, etc. of natural scenes in Chinese and English. Natural scenes are labelled with row-level rectangular boxes, and PPT scenes are labelled with quadrangular boxes, and the contents are transcribed.
● Multi-race Face Recognition Datasets
The data covers Asian, Caucasian people, Indian and black people, and the ratio of men to women is 1:1. The collection environment is indoor and outdoor scenes, and the collection equipment includes mobile phones and cameras.
● Mandarin Chinese Conversational Speech Data by Mobile Phone
The data was recorded by 440 participants with natural speaking and casual conversation, with a balanced gender ratio. In a relatively quiet indoor environment, the ambient noise level does not exceed 50db, and the text, speaker, and start and end time of valid sentences are marked. The sentence accuracy exceeds 97%.
If you want to know more details about the datasets or how to acquire, please feel free to contact us: info@nexdata.ai
Standing at the forefront of technology revolution, we are well aware of the power of data. In the future, through contentiously improve data collection and annotation process, AI system will become more intelligent. All walks of life should actively embrace the innovation of data-driven to stay ahead in the fierce market competition and bring more value for society.