en

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

OCR Datasets

Instantly enhance AI model performance with high quality off-the-shelf datasets.

Data Type

All
31
Document
1
General Scenario
12
Handwriting
15
Internet image
4
Invoice
3
Others
4
Test paper
2
Table
1

Language

All
31
Chinese
9
English
5
Hindi
3
Japanese
6
Korean
6
Others
22
Vietnamese
3

500,000 Images - Natural Scenes and Documents OCR Data

The dataset consists of 500,000 images for multi-country natural scenes and document OCR, including 20 languages such as Traditional Chinese, Japanese, Korean, Indonesian, Malay, Thai, Vietnamese, Polish, etc. The diversity includes various natural scenarios and multiple shooting angles. This set of data can be used for multi-language OCR tasks.
Natural scenes Documents OCR

30,000 Images - Natural Scenes OCR Data in Southeast Asian Languages

30,000 natural scene OCR data for minority languages in Southeast Asia, including Khmer (Cambodia), Lao and Burmese. The diversity of collection includes a variety of natural scenes and a variety of shooting angles. This set of data can be used for Southeast Asian language OCR tasks.
OCR Southeast Asian Languages Natural Scenes

14,980 Images PPT OCR Data of 8 Languages

14,980 Images PPT OCR Data of 8 Languages. This dataset includes 8 languages, multiple scenes, different photographic angles, different photographic distances, different light conditions. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were annotated in the data. The dataset can be used for tasks such as OCR of multi-language.
PPT OCR meeting room conference room different photographic angles different photographic distances different light conditions line-level quadrilateral bounding box annotation and transcription for the texts

101 People - 4,538 Images Japanese Handwriting OCR Data

101 People - 4,538 Images Japanese Handwriting OCR Data. The text carrier is A4 paper. The dataset content includes social livelihood, entertainment, tour, sport, movie, composition and other fields. For annotation, character-level rectangular bounding box annotation and text transcription and line-level rectangular bounding box annotation and text transcription were adopted. The dataset can be used for tasks such as Japanese handwriting OCR.
Japanese handwriting OCR character-level rectangular bounding box annotation text transcription calligraphy scribble manuscript Japanese ocr data

5,147 Images Japanese Handwriting OCR data

5,147 Images Japanese Handwriting OCR Data. The text carrier are A4 paper, lined paper, quadrille paper, etc. The device is cellphone, the collection angle is eye-level angle. The dataset content includes Japanese composition, poetry, prose, news, stories, etc. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were annotated in the data.The dataset can be used for tasks such as Japanese handwriting OCR.
Japanese Handwriting OCR line-level annotation line-level text transcription

100 People - Handwriting OCR Data of Japanese and Korean

100 People - Handwriting OCR Data of Japanese and Korean,. This dadaset was collected from 100 subjects including 50 Japanese, 49 Koreans and 1 Afghan. For different subjects, the corpus are different. The data diversity includes multiple cellphone models and different corpus. This dataset can be used for tasks, such as handwriting OCR data of Japanese and Korean.
Japanese Korean Handwriting OCR Trace of handwriting

105,941 Images Natural Scenes OCR Data of 12 Languages

105,941 Images Natural Scenes OCR Data of 12 Languages. The data covers 12 languages (6 Asian languages, 6 European languages), multiple natural scenes, multiple photographic angles. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were annotated in the data. The data can be used for tasks such as OCR of multi-language.
Japanese Korean Indonesian Malay Vietnamese Thai French German Italian Portuguese Russian Spanish OCR natural scenes multiple photographic angles line-level quadrilateral bounding box annotation and transcription for the texts

497 Images – English Invoice Data

497 Images – English Invoice Data,the collection background is a solid color background, and personal information is desensitized, including various types of invoices, which can be used for tasks such as bill recognition and text recognition.
OCR bill annotation multiple types of bills

71,535 Images English OCR Data in Natural Scenes

71,535 Images English OCR Data in Natural Scenes. The collecting scenes of this dataset are the real scenes in Britain and the United States. The data diversity includes multiple scenes, multiple photographic angles and multiple light conditions. For annotation, line-level & word-leve & character-level rectangular bounding box or quadrilateral bounding box annotation were adopted, the text transcription was also adopted. The dataset can be used for English OCR tasks in natural scenes.
English natural scenes OCR multiple scenes multiple photographic angles multiple light conditions line-level & word-level & character-level bounding box text transcription

loading

Tailor Your Data Now

Why off-the-shelf Datasets

  • Copyright

    Copyright

    Clear Coyright and Ready to Check
  • Security

    Security

    Properly Authorized Secure to Use
  • Professional

    Professional

    Designed and produced by AI data experts
  • Diversity

    Diversity

    Collected from a varity of real scenes
  • Cost Effective

    Cost Effective

    More Cost-Efficient Than Tailored Data
  • Efficiency

    Efficiency

    Ready-To-Go Deliver in Seconds
b78bee75-76bd-44a6-a59e-bcc75cb42fb6