en

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

m.nexdata.datatang.com

LLM Datasets

Instantly enhance AI model performance with high quality off-the-shelf datasets.

Type

All
32
Image Caption
14
SFT Datasets
8
Pre-training Text
13

1.51M Instruction-Based Image Editing Dataset for Generative AI Training

This dataset contains 1.51 million annotated image editing pairs. Editing types include 500,000 sets of portrait/object consistency editing, 300,000 sets of structural edits, 210,000 sets of mixed editing, and 450,000 sets of spatial editing, and 50,000 sets of style transfer editing. The editing targets cover scenes such as people, animals, goods, plants, and landscapes. In terms of annotation, the targets that need to be edited in the image are edited according to the editing instructions. The data can be used for tasks such as image synthesis, data augmentation, and virtual scene generation.
generative AI image dataset image editing dataset AI image editing dataset image editing training data AI image manipulation dataset image editing pairs dataset image inpainting dataset style transfer dataset

50,000 Image Editing Datasets – Object Removal, Addition & Modification Dataset for AI Training

50,000 Sets - Image Editing Data. The editing types include human attribute editing, image semantic editing, and image structure editing. The editing targets cover scenes such as people, animals, goods, plants, and landscapes. In terms of annotation, based on the editing instructions, the targets that need to be edited in the image are edited. The data can be used for tasks such as image synthesis, data augmentation, and virtual scene generation.
image editing dataset image synthesis data object removal dataset object addition data AI image generation dataset virtual scene dataset annotated image editing data inpainting dataset AI training data for image manipulation generative image dataset

32M Science QA Dataset – Answers & Parsing for LLMs

32 million structured science questions covering mathematics, physics, chemistry, and biology across primary, middle, high school, and university levels. Each question entry includes a title, answer, solution parsing, question type, subject category, and corresponding grade level. The dataset is designed to support AI training tasks such as large language model development, subject-specific knowledge enhancement, machine reading comprehension, and question-answering systems. It provides a rich resource for educational NLP applications and has been validated for quality and completeness. All data complies with global data protection standards including GDPR, CCPA, and PIPL.
science question dataset STEM QA dataset math physics chemistry biology questions education NLP dataset AI training data structured question answer dataset academic QA dataset question parsing dataset K-12 science dataset university level questions dataset

Bilingual Image Caption Dataset - 2.4 Million Pairs

THis dataset consisting of about 2.4 million image–text pairs. The images cover various categories, including landscapes, animals, flowers and trees, people, cars, sports, industry, and architecture, along with an aesthetic subset. Each image is paired with descriptive captions provided in both English and Chinese, covering overall scene understanding, local visual details, and high-level emotional context.
image caption data image captioning dataset image text dataset multimodal dataset vision language dataset

Japanese Q&A Dataset from OKWAVE – 8.4M Questions

This dataset is collected from the Japanese OKWAVE Q&A platform and includes large-scale parsed and processed text data suitable for LLM training and Japanese natural language understanding. It contains structured fields such as questions, answers, categories, timestamps, user metadata, and supplementary explanations. As of April 2025, the dataset includes 8.4 million questions with 2.3 billion words, 27 million answers totaling 7.6 billion words, 15.5 million thank-you messages (1.7 billion words), and 2.1 million supplementary replies (360 million words). Continuously updated and rich in user-generated content, this dataset is ideal for building Japanese conversational AI, ChatGPT fine-tuning, question answering systems, text summarization, and semantic parsing models. All data complies with relevant data usage and privacy regulations.
Japanese Q&A dataset OKWAVE forum data Japanese language corpus Japanese dialogue dataset ChatGPT Japanese fine-tuning user-generated content question answer dataset

288 Million 3D Models & Scenes Dataset for AI and Simulation

Massive 3D Models & Scenes Dataset includes 270 million sets of 3D models and 18 million 3D scenes. 3D models cover conventional models, interactive models, and physics-enhanced models with various objects in indoor residential environments. 3D scenes cover indoor home decoration scenarios and commercial space environments. This dataset can be used for tasks like 3D asset generation, virtual environment simulation, AI model training, and industrial design applications.
3D models dataset 3D scenes dataset indoor 3D environment dataset commercial 3D space dataset physics-enhanced 3D models interactive 3D models dataset 3D assets generation dataset simulation training environment dataset virtual environment 3D data large-scale 3D AI dataset

122,147 Questions - Logical Reasoning Question Data

This dataset comprising 122,147 logical reasoning test questions, covering various question types such as diagrammatic reasoning, IQ tests, logical thinking puzzles, visual-spatial reasoning, knowledge-based image inference, and detective-style reasoning problems. The dataset includes transcribed questions, answers, and detailed explanations, and is designed to enhance large language models' logical reasoning capabilities. Throughout the data collection, storage, and usage processes, we strictly comply with data protection regulations and privacy laws—including the GDPR, CCPA, and PIPL—to ensure the privacy rights and legitimate interests of users are fully protected.
Logical COT VLM

Multilingual Grammar Correction Dataset – 480K Parallel Texts (DE, ES, FR, IT)

This dataset focuses on the four major European languages (French, German, Spanish, Italian) and contains 480000 pairs of original and corrected text pairs. Each piece of data is presented in JSON format, including two fields: input (raw text) and output (corrected text), which can assist in natural language processing, machine translation, and language teaching research.
Multilingual Grammar Correction Dataset Grammar Correction Dataset

Riddles and brain teasers dataset

riddles and brain teasers dataset, contains 100k+ riddles and 3k+ brain teasers, can be applied in LLM training, phone assistant and other scenarios.
Riddles Brain teasers

loading

Tailor Your Data Now

Why off-the-shelf Datasets

  • Copyright

    Copyright

    Clear Coyright and Ready to Check
  • Security

    Security

    Properly Authorized Secure to Use
  • Professional

    Professional

    Designed and produced by AI data experts
  • Diversity

    Diversity

    Collected from a varity of real scenes
  • Cost Effective

    Cost Effective

    More Cost-Efficient Than Tailored Data
  • Efficiency

    Efficiency

    Ready-To-Go Deliver in Seconds
5ef7cce0-1558-4175-91e3-18f462586f0a