From:Nexdata Date: 2024-08-15
In the progress of constructing intelligent system, the quality of the training datasets are more important than algorithm itself. For coping with different challenges in complex scenarios, researchers need to collect and annotate different types of data to improve the capabilities of AI system. Nowadays, every industries are exploring constantly how to use data-driven technology to realize smarter business processes and decision-making systems.
Optical character recognition (OCR) refers to the task of an electronic device such as a scanner or a digital camera examining characters in an image, and then using character recognition methods to translate the shape into computer text. Some applications of OCR include automated data entry for business documents, translation applications, online databases, security cameras that automatically recognize license plates, and more.
In this article, I sorted out some commonly used datasets in the field of OCR research.
1. COCO-Text
The COCO-Text dataset contains 63,686 images with 145,859 cropped text instances. It is the first large-scale dataset for text in natural images and also the first dataset to annotate scene text with attributes such as legibility and type of text. However, no lexicon is associated with COCO-Text.
SynthText (ST) can be said to be ImageNet in the field of OCR. The data set is generated by synthesis, and 8 million texts are artificially added to 800,000 pictures, and this synthesis is not a very blunt superposition, but some processing is done to make the text look more natural in the picture.
3. IIIT5K
The IIIT5K dataset contains 5,000 text instance images: 2,000 for training and 3,000 for testing. It contains words from street scenes and from originally-digital images. Every image is associated with a 50 -word lexicon and a 1,000 -word lexicon. Specifically, the lexicon consists of a ground-truth word and some randomly picked words.
The SVT dataset contains 350 images: 100 for training and 250 for testing. Some images are severely corrupted by noise, blur, and low resolution. Each image is associated with a 50 -word lexicon.
5. CUTE80
The CUTE80 dataset contains 80 high-resolution images with 288 cropped text instances. It focuses on curved text recognition. Most images in CUTE80 have a complex background, perspective distortion, and poor resolution. No lexicon is associated with CUTE80.
6. SVHN
The SVHN dataset contains more than 600,000 digits of house numbers in natural scenes. It is obtained from a large number of street view images using a combination of automated algorithms and the Amazon Mechanical Turk (AMT) framework. The SVHN dataset was typically used for scene digit recognition.
7. RCTW-17
The RCTW-17 dataset contains 12,514 images: 11,514 for training and 1,000 for testing. Most are natural images collected by cameras or mobile phones, whereas others are digital-born. Text instances are annotated with labels, fonts, languages, etc.
8. MLT(MLTcompetition, ICDAR2019)
The MLT-2019 dataset contains 20,000 images: 10,000 for training (1,000 per language) and 10,000 for testing. The dataset includes ten languages, representing seven different scripts: Arabic, Bangla, Chinese, Devanagari, English, French, German, Italian, Japanese, and Korean. The number of images per script is equal.
While pushing the boundaries of technology, we need to be aware of the potential and importance of data. By streamline the process of datasets collection and annotation, AI technology can better handle various application scenarios. In the future, as datasets are accumulated and optimized, we have reason to believe that AI will bring more innovations in the fields of medication, education and transportation, etc.