en

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

LLM Datasets

Instantly enhance AI model performance with high quality off-the-shelf datasets.

Type

All
16
Image Caption
10
SFT Datasets
1
Pre-training Text
5

300 million pairs of high-quality image-caption dataset

300 million images, each corresponding to a description. All are genuine image works published by photographers. The vast majority of descriptions are in English, with very few in Chinese.
multimodal image description

7 Million Sets - High-Quality Video Caption Dataset

multimodal video description caption

10 million - English Test Questions Text Parsing And Processing Data

10 Million - English Test Questions Text Parsing And Processing Data, Contains title, answer, parse, subject, grade, question type; The educational stages cover primary, middle, high school, and university; Subjects cover mathmatics, biology, accounting, etc.
English test questions text data LLM Large Language Model Large Model chatgpt data

100,000 Instruction-Following Evaluation SFT for Chinese LLM Text Data

100,000 Instruction-Following Evaluation SFT for Chinese LLM Text Data. Between 50 and 400 words, with no fewer than 3 constraints in each prompt.All prompt are manually written to satisfy the diversity of coverage.
LLM Instruction-Following SFT

Large Language Model content safety considerations text data

Large Language Model content safety considerations text data, about 570,000 in total, this dataset can be used for tasks such as LLM training, chatgpt
Large Language Model content safety considerations text data LLM Large Language Model Large Model chatgpt data

203,029 Groups - Chinese Medical Question Answering Data

The data contains 203,029 groups Chinese question answering data between doctors and patients of different diseases.
Medical question answering disease

2 Million Pairs Image Caption Data Of General Scenes

2 million pairs of images and descriptions, the pictures cover various categories, including landscapes, animals, flowers and trees, people, cars, sports, industry, and architecture, along with an aesthetic subset. They depict the overall scene of the image, the details within the scene, and the emotions conveyed by the image. The description is provided in both English and Chinese languages.
Text description multi-modality general scene data set English caption Chinese caption

830,276 groups - Multi-Round Interpersonal Dialogues Text Data

This database is the interactive text corpus of real users on the mobile phone. The database itself has been desensitized to ensure of no private information of the user's (A and B are the codes to replace the sender and receiver, and sensitive information such as cellphone number and user name are replaced with '* * *'). This database can be used for tasks such as natural language understanding.
Interactive text corpus database text corpus database

90,000 sets – Multi-domain Customer Service Dialogue Text Data

Multi-domain Customer Service Dialogue Text Data, 90,000 sets in total; spanning multiple domains, including telecommunications, e-commerce, and financial, lifestyle, business, education, healthcare, and entertainment; Each set of data consists of single or multi-turn conversations; this dataset can be used for tasks such as LLM training, chatgpt
Customer Service Dialogue text data telecommunications topics data commerce topics data finance topics data LLM data Large Language Model data chatgpt data

loading

Tailor Your Data Now

Why off-the-shelf Datasets

  • Copyright

    Copyright

    Clear Coyright and Ready to Check
  • Security

    Security

    Properly Authorized Secure to Use
  • Professional

    Professional

    Designed and produced by AI data experts
  • Diversity

    Diversity

    Collected from a varity of real scenes
  • Cost Effective

    Cost Effective

    More Cost-Efficient Than Tailored Data
  • Efficiency

    Efficiency

    Ready-To-Go Deliver in Seconds
b85102e5-3538-445c-aeb7-f1b93cd5a36d