en

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

m.nexdata.datatang.com

LLM Datasets

Instantly enhance AI model performance with high quality off-the-shelf datasets.

Type

All
31
Image Caption
14
SFT Datasets
8
Pre-training Text
13

1.51M Instruction-Based Image Editing Dataset for Generative AI Training

This dataset contains 1.51 million annotated image editing pairs. Editing types include 500,000 sets of portrait/object consistency editing, 300,000 sets of structural edits, 210,000 sets of mixed editing, and 450,000 sets of spatial editing, and 50,000 sets of style transfer editing. The editing targets cover scenes such as people, animals, goods, plants, and landscapes. In terms of annotation, the targets that need to be edited in the image are edited according to the editing instructions. The data can be used for tasks such as image synthesis, data augmentation, and virtual scene generation.
generative AI image dataset image editing dataset AI image editing dataset image editing training data AI image manipulation dataset image editing pairs dataset image inpainting dataset style transfer dataset

2.4M Korean Exam Question Dataset for AI Training

This dataset contains 2.4 million structured Korean exam questions covering primary, middle, and high school subjects including Korean, Mathematics, English, Social Studies, Science, Physics, Chemistry, Biology, History, and Geography. Each record includes question type (multiple-choice, fill-in-the-blank, true/false, short answer), the question itself, standard answers, and detailed explanations. The data is professionally annotated and categorized by subject and academic level, making it ideal for training AI models in educational applications such as question answering systems, tutoring bots, academic reasoning, and subject-level knowledge enhancement. It is widely applicable for natural language processing tasks involving structured QA, exam-style NLP training, and educational content generation. All data is collected and processed in compliance with GDPR, CCPA, and PIPL standards, ensuring privacy and legal integrity throughout the lifecycle.
korean exam dataset education dataset test question dataset multiple choice QA dataset K-12 school question data AI training dataset for education NLP exam data structured Korean question dataset school subject QA dataset

50,000 Image Editing Datasets – Object Removal, Addition & Modification Dataset for AI Training

50,000 Sets - Image Editing Data. The editing types include human attribute editing, image semantic editing, and image structure editing. The editing targets cover scenes such as people, animals, goods, plants, and landscapes. In terms of annotation, based on the editing instructions, the targets that need to be edited in the image are edited. The data can be used for tasks such as image synthesis, data augmentation, and virtual scene generation.
image editing dataset image synthesis data object removal dataset object addition data AI image generation dataset virtual scene dataset annotated image editing data inpainting dataset AI training data for image manipulation generative image dataset

32M Science QA Dataset – Answers & Parsing for LLMs

32 million structured science questions covering mathematics, physics, chemistry, and biology across primary, middle, high school, and university levels. Each question entry includes a title, answer, solution parsing, question type, subject category, and corresponding grade level. The dataset is designed to support AI training tasks such as large language model development, subject-specific knowledge enhancement, machine reading comprehension, and question-answering systems. It provides a rich resource for educational NLP applications and has been validated for quality and completeness. All data complies with global data protection standards including GDPR, CCPA, and PIPL.
science question dataset STEM QA dataset math physics chemistry biology questions education NLP dataset AI training data structured question answer dataset academic QA dataset question parsing dataset K-12 science dataset university level questions dataset

1M Chinese Coding Questions Dataset – Python/Java/C++

This dataset contains 1 million Chinese programming questions with corresponding answers, detailed parses (explanations), and programming language labels. It includes a wide range of questions in C, C++, Python, Java, and JavaScript, making it ideal for training large language models (LLMs) on multilingual code understanding and generation. The questions cover fundamental to advanced topics, supporting AI applications such as code completion, bug fixing, and programming reasoning. This structured dataset enhances model performance in natural language programming tasks and helps reinforce code logic skills in AI systems. All data complies with international privacy regulations including GDPR, CCPA, and PIPL.
Chinese coding questions dataset programming QA data parsed coding problems Python Java C++ dataset code generation LLM dataset Chinese code questions

Long Context Reasoning Dataset – Multi-Language (EN/CH/KR) Benchmark for LLM Evaluation

This dataset is designed to tackle the core weaknesses of today's large language models when it comes to processing long documents and performing complex reasoning. It consists of 7,500 high-quality training examples across three languages—Chinese, English, and Korean. Each instance is built around a long-text passage and includes questions that require synthesizing information across paragraphs and documents, while following multi-step logical chains. The goal is to offer a thorough and rigorous evaluation framework that tests a model's ability to perceive long-range context, retrieve relevant information, construct sound reasoning paths, and trace evidence back to its source.
long context dataset long context reasoning dataset LLM long context dataset long document QA dataset multi hop reasoning dataset reasoning dataset for LLM multi step reasoning dataset

6.9 million - Chinese Multi-disciplinary Questions Text Parsing And Processing Data

6.9 million - Chinese Multi-disciplinary Questions Text Parsing And Processing Data, including multiple disciplines in primary school, middle school, high school and university. Each questions contain title, answer, parse, type, subject, grade. The dataset can be used for large model subject knowledge enhancement tasks.
Chinese multi-disciplinary Questions LLM Text

89,007 Sets of Japanese–Arabic Image-Text Construction Data

The product contains a total of 89,007 data samples, with each sample consisting of one image and one JSON document. The JSON document may contain an image caption, a visual question-answering pair, OCR results extracted from the image, or a visual question-answering pair based on the OCR results. The dataset covers Arabic and Japanese languages and spans six domains:① Business and Finance, ②Coding and Computer Science, ③Law, Government, and Politics, ④Science, Technology, Engineering, and Mathematics (STEM), ⑤Society, Culture, Humanities, and Religion, ⑥ Sports, Lifestyle, and Leisure. The accuracy of image domain classification(per-image accuracy) is above 95%;The matching degree between image and text description is greater than 95%;OCR recognition accuracy (per-sentence accuracy) must exceed 95%. Suitable for multilingual OCR, multimodal LLM training, image captioning, and multilingual VQA tasks.
Japanese Arabic Visual Question Answering(VQA) Image Captioning Optical Character Recognition(OCR)

20,011 Image Caption Data of OCR in Natural Scenes

20,011 Image Caption Data of OCR in Natural Scenes, including Asian and European languages, a total of 14 languages, the collection environment includes shop plaques, stop signs, posters, road signs and other scenes, including a variety of shooting angles. The description language is English, which mainly describes the text arrangement, text content, color and other information.
AIGC English caption OCR caption Multiple shooting angles Multinational scenes

loading

Tailor Your Data Now

Why off-the-shelf Datasets

  • Copyright

    Copyright

    Clear Coyright and Ready to Check
  • Security

    Security

    Properly Authorized Secure to Use
  • Professional

    Professional

    Designed and produced by AI data experts
  • Diversity

    Diversity

    Collected from a varity of real scenes
  • Cost Effective

    Cost Effective

    More Cost-Efficient Than Tailored Data
  • Efficiency

    Efficiency

    Ready-To-Go Deliver in Seconds
13bef74f-7659-4e79-ba80-e72ff8f44c6d