{"id":1254,"datatype":"1","titleimg":"https://www.nexdata.ai/shujutang/static/image/index/datatang_tuxiang_default.webp","type1":"147","type1str":null,"type2":"150","type2str":null,"dataname":"104,320 Images - Korean and Hindi OCR Data in Natural Scenes","datazy":[{"title":"Data size","desc":"Data size","content":"76,861 images of Korean, 555,913 bounding boxes; 27,459 images of Hindi, 200,453 bounding boxes"},{"title":"Collecting environment","desc":"Collecting environment","content":"including packaging, posters, tickets, reminders, menus, building signs, etc."},{"title":"Data diversity","desc":"Data diversity","content":"multiple natural scenes, multiple shooting angles, multiple light conditions"},{"title":"Device","desc":"Device","content":"cellphone"},{"title":"Collecting angle","desc":"Collecting angle","content":"looking up angle, looking down angle, eye-level angle"},{"title":"Language distribution","desc":"Language distribution","content":"Korean, Hindi, English (a few)"},{"title":"Data format","desc":"Data format","content":"the image data format is .jpg, the annotation file format is .json"},{"title":"Bounding box shape distribution","desc":"Bounding box shape distribution","content":"315,822 tetragon bounding boxes and 240,091 polygon bounding boxes of Korean; 780 tetragon bounding boxes, 199,671 polygon bounding boxes and 2 rectangle bounding boxes of Hindi"},{"title":"Annotation content","desc":"Annotation content","content":"line-level polygon bounding box (or tetragon bounding box, rectangle bounding box) annotation, transcription and text attributes (language type) for the texts; vertical-level polygon bounding box (or tetragon bounding box, rectangle bounding box) annotation, transcription and text attributes (language type) for the text"},{"title":"Accuracy","desc":"Accuracy","content":"The error bound of each vertex of a bounding box is within 5 pixels, which is a qualified annotation, the accuracy of bounding boxes is not less than 95%; The texts transcription accuracy is not less than 95%."}],"datatag":"Multiple natural scenes,Multiple shooting angles,Multiple light conditions","technologydoc":null,"downurl":null,"datainfo":null,"standard":null,"dataylurl":null,"flag":null,"publishtime":null,"createby":null,"createtime":null,"ext1":null,"samplestoreloc":null,"hosturl":null,"datasize":null,"industryPlan":null,"keyInformation":"","samplePresentation":[{"name":"/data/apps/damp/temp/ziptemp/APY230328002_demo1711533626489/APY230328002_demo/2.jpg","url":"https://bj-oss-datatang-03.oss-cn-beijing.aliyuncs.com/filesInfoUpload/data/apps/damp/temp/ziptemp/APY230328002_demo1711533626489/APY230328002_demo/2.jpg?Expires=4102329599&OSSAccessKeyId=LTAI8NWs2pDolLNH&Signature=1xXx3CKukYZpXUoWeGQa3UM5%2F5A%3D","intro":"","size":0,"progress":100,"type":"jpg"},{"name":"/data/apps/damp/temp/ziptemp/APY230328002_demo1711533626489/APY230328002_demo/3.jpg","url":"https://bj-oss-datatang-03.oss-cn-beijing.aliyuncs.com/filesInfoUpload/data/apps/damp/temp/ziptemp/APY230328002_demo1711533626489/APY230328002_demo/3.jpg?Expires=4102329599&OSSAccessKeyId=LTAI8NWs2pDolLNH&Signature=k36XqpwImWbppq62S04QPeMPSEA%3D","intro":"","size":0,"progress":100,"type":"jpg"},{"name":"/data/apps/damp/temp/ziptemp/APY230328002_demo1711533626489/APY230328002_demo/1.jpg","url":"https://bj-oss-datatang-03.oss-cn-beijing.aliyuncs.com/filesInfoUpload/data/apps/damp/temp/ziptemp/APY230328002_demo1711533626489/APY230328002_demo/1.jpg?Expires=4102329599&OSSAccessKeyId=LTAI8NWs2pDolLNH&Signature=zyhyQSDYy6jTbpF3nc4f%2F09ufT4%3D","intro":"","size":0,"progress":100,"type":"jpg"}],"officialSummary":"104,320 Images - Korean and Hindi OCR Data in Natural Scenes. The collecting scenes of this dataset include packaging, posters, tickets, reminders, menus, building signs, etc.. The data diversity includes multiple scenes, multiple shooting angles and multiple light conditions. For annotation, line-level polygon bounding box (or tetragon bounding box, rectangle bounding box) annotation, transcription and text attributes (language type) for the texts; vertical-level polygon bounding box (or tetragon bounding box, rectangle bounding box) annotation, transcription and text attributes (language type) for the text. The dataset can be used for Korean and Hindi OCR tasks in natural scenes.","dataexampl":null,"datakeyword":["Multiple natural scenes","Multiple shooting angles","Multiple light conditions"],"isDelete":null,"ids":null,"idsList":null,"datasetCode":null,"productStatus":null,"tagTypeEn":"Data Type,Language","tagTypeZh":null,"website":null,"samplePresentationList":null,"datazyList":null,"keyInformationList":null,"dataexamplList":null,"bgimg":null,"datazyScriptList":null,"datakeywordListString":null,"sourceShowPage":"ocr","BGimg":"","voiceBg":["/shujutang/static/image/comm/audio_bg.webp","/shujutang/static/image/comm/audio_bg2.webp","/shujutang/static/image/comm/audio_bg3.webp","/shujutang/static/image/comm/audio_bg4.webp","/shujutang/static/image/comm/audio_bg5.webp"],"firstList":[{"name":"/data/apps/damp/temp/ziptemp/APY230328002_demo1711533626489/APY230328002_demo/5.jpg","url":"https://bj-oss-datatang-03.oss-cn-beijing.aliyuncs.com/filesInfoUpload/data/apps/damp/temp/ziptemp/APY230328002_demo1711533626489/APY230328002_demo/5.jpg?Expires=4102329599&OSSAccessKeyId=LTAI8NWs2pDolLNH&Signature=59499xW%2FRbwxe18fQPgO5cjSQFE%3D","intro":"","size":0,"progress":100,"type":"jpg"}]}

en

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

104,320 Images - Korean and Hindi OCR Data in Natural Scenes

Multiple natural scenes

Multiple shooting angles

Multiple light conditions

104,320 Images - Korean and Hindi OCR Data in Natural Scenes. The collecting scenes of this dataset include packaging, posters, tickets, reminders, menus, building signs, etc.. The data diversity includes multiple scenes, multiple shooting angles and multiple light conditions. For annotation, line-level polygon bounding box (or tetragon bounding box, rectangle bounding box) annotation, transcription and text attributes (language type) for the texts; vertical-level polygon bounding box (or tetragon bounding box, rectangle bounding box) annotation, transcription and text attributes (language type) for the text. The dataset can be used for Korean and Hindi OCR tasks in natural scenes.

This is a paid datasets for commercial use, research purpose and more. Licensed ready made datasets help jump-start AI projects.

Specifications

Specifications

Data size

76,861 images of Korean, 555,913 bounding boxes; 27,459 images of Hindi, 200,453 bounding boxes

Collecting environment

including packaging, posters, tickets, reminders, menus, building signs, etc.

Data diversity

multiple natural scenes, multiple shooting angles, multiple light conditions

Device

cellphone

Collecting angle

looking up angle, looking down angle, eye-level angle

Language distribution

Korean, Hindi, English (a few)

Data format

the image data format is .jpg, the annotation file format is .json

Bounding box shape distribution

315,822 tetragon bounding boxes and 240,091 polygon bounding boxes of Korean; 780 tetragon bounding boxes, 199,671 polygon bounding boxes and 2 rectangle bounding boxes of Hindi

Annotation content

line-level polygon bounding box (or tetragon bounding box, rectangle bounding box) annotation, transcription and text attributes (language type) for the texts; vertical-level polygon bounding box (or tetragon bounding box, rectangle bounding box) annotation, transcription and text attributes (language type) for the text

Accuracy

The error bound of each vertex of a bounding box is within 5 pixels, which is a qualified annotation, the accuracy of bounding boxes is not less than 95%; The texts transcription accuracy is not less than 95%.

Sample

Sample

Recommended Datasets

Recommended Dataset

71,535 Images English OCR Data in Natural Scenes

71,535 Images English OCR Data in Natural Scenes. The collecting scenes of this dataset are the real scenes in Britain and the United States. The data diversity includes multiple scenes, multiple photographic angles and multiple light conditions. For annotation, line-level & word-leve & character-level rectangular bounding box or quadrilateral bounding box annotation were adopted, the text transcription was also adopted. The dataset can be used for English OCR tasks in natural scenes.

OCR English Natural scenes

500,000 Images - Natural Scenes and Documents OCR Data

The dataset consists of 500,000 images for multi-country natural scenes and document OCR, including 20 languages such as Traditional Chinese, Japanese, Korean, Indonesian, Malay, Thai, Vietnamese, Polish, etc. The diversity includes various natural scenarios and multiple shooting angles. This set of data can be used for multi-language OCR tasks.

Natural scenes Documents OCR

30,000 Images - Natural Scenes OCR Data in Southeast Asian Languages

30,000 natural scene OCR data for minority languages in Southeast Asia, including Khmer (Cambodia), Lao and Burmese. The diversity of collection includes a variety of natural scenes and a variety of shooting angles. This set of data can be used for Southeast Asian language OCR tasks.

OCR Southeast Asian Languages Natural Scenes

5,000 Images of Turkish Natural Scene OCR Data

5,000 Turkish natural scenarios OCR data include a variety of natural scenarios and multiple shooting angles. For annotation, quadrilateral or polygon bounding box annotation and transcription for the texts were annotated in the data. This data can be used for tasks such as the Turkish language OCR.

OCR，Turkish，Natural scenes

8,604 Images of Arabic Natural Scene OCR Data

8,604 Arabic natural scenarios OCR data include a variety of natural scenarios and multiple shooting angles. For annotation, quadrilateral or polygon bounding box annotation and transcription for the texts were annotated in the data. This data can be used for tasks such as the Arabic language OCR.

Arabic Multiple natural scenes Multiple shooting angles

57,645 Images - Vertical OCR Data in Text Scenes

57,645 Images - Vertical OCR Data in Text Scenes. The collecting scenes of this dataset include street scenes, plaques, billboards, posters, decorations, art lettering, magazine covers etc. The language distribution includes Chinese and a few English. In this dataset, vertical -level rectangular bounding box (polygonal bounding box, parallelogram bounding box) annotation and transcription for the texts; non-vertical rectangular bounding box (polygonal bounding box, parallelogram bounding box) annotation and transcription for the texts. This dataset can be used for tasks such as multiple vertical text scenes OCR.

OCR Multiple scenes Multiple fonts

105,941 Images Natural Scenes OCR Data of 12 Languages

105,941 Images Natural Scenes OCR Data of 12 Languages. The data covers 12 languages (6 Asian languages, 6 European languages), multiple natural scenes, multiple photographic angles. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were annotated in the data. The data can be used for tasks such as OCR of multi-language.

12 languages Multiple photographic angles Multiple scenes Line-level quadrilateral bounding box annotation and transcription

4,995 Vietnamese OCR Images Data - Images with Annotation and Transcription

4,995 Vietnamese OCR Images Data - Images with Annotation and Transcription. The data includes 258 images of natural scenes, 2,553 Internet images, 2,184 document images. For line-level content annotation, line-level quadrilateral bounding box annotation and test transcription was adpoted; for column-level content annotation, column-level quadrilateral bounding box annotation and text transcription was adpoted. The data can be used for tasks such as Vietnamese recognition in multiple scenes.

Vietnamese OCR Multiple scenes Multiple angles Different light conditions

Tell Us Your Special Needs

Full Name *

Contact Phone No. *

Company name *

Company Email *

Data Requirements *

By submitting, I agree to the Privacy Protection

Subscribe to our newsletter

Be the first to receive Nexdata latest product releases, data solutions and enterprise news.

Off-the-Shelf Datasets: All Category Datasets; LLM Datasets; Computer Vision Datasets; Speech Recognition Datasets; Speech Synthesis Datasets; OCR Datasets; Pronunciation Dictionary; NLU Datasets

Data Service: 3D Point Cloud Data; Street View Data; OCR Data; Behavior Recognition Data; Identity Recognition Data; Speech Recognition Data; Speech Synthesis Data; Multimodal Data

Industries: Generative AI; Autonomous Vehicles; AR/VR; Conversational AI; Smart Home; Retail; Intelligent Healthcare

Company: About Us; News; Partners; Quality & Security; Event
Links: OPENMPD; DataPlus; Datarade

Platform: Platform
Competition: Competition
Resources: Sponsored Datasets

Sharpen Your AI with Better Data

+1(626)594-5598

[email protected]

nexdata_ai facebook

nexdata_ai twitter

nexdata_ai linkedin

nexdata_ai youtube

Copyright © 2023 NEXDATA TECHNOLOGY INC

Sitemap Terms and Conditions

We use cookies to enhance your browsing experience, serve personalized ads or content, and analyze our traffic. By clicking "Accept All", you consent to our use of cookies.

9e2d471d-fe8d-4a3f-8d54-55c4ea5f337d

db8b39f9-48cd-4b0b-99e7-94bcacd0d3e9