en

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

LLM Datasets

Instantly enhance AI model performance with high quality off-the-shelf datasets.

Type

All
17
Image Caption
10
SFT Datasets
1
Pre-training Text
6

31 million Southeast Asian language news text dataset

This dataset is multilingual news data from Southeast Asia, covering four languages: Indonesian, Malay, Thai, and Vietnamese. The total amount of data exceeds 31 million, stored in JSONL format, with each record running independently in a row for efficient reading and processing. The data sources are extensive, covering various news topics, and can comprehensively reflect the social dynamics, cultural hotspots, and economic trends in Southeast Asia. This dataset can help large models improve their multilingual capabilities, enrich cultural knowledge, optimize performance, expand industry applications in Southeast Asia, and promote cross linguistic research.
Minor languages Southeast Asia NEWS Journalism

21,998Image Caption Data of Vehicles

21998 Image Caption Data Of Vehicles covers various types of cars, SUVs, MPVs, trucks, and buses. Surveillance cameras are used to collect outdoor roads for multiple periods of time, mainly describing the types of vehicles. Information such as color, vehicle orientation, scene, etc., the description language is English.
multi-modality vehicle attribute data security data intelligent monitoring data intelligent traffic data smart city data

1 Million Pairs Image Caption Data Of General Scenes

1 million pairs of images and descriptions, the pictures cover various categories, including landscapes, animals, flowers and trees, people, cars, sports, industry, and architecture, along with an aesthetic subset. They depict the overall scene of the image, the details within the scene, and the emotions conveyed by the image. The description is provided in both English and Chinese languages.
Text description multi-modality general scene data set English caption Chinese caption

10,000 Image Caption Data of Diverse Scenes

10,000 Image caption data of diverse scenes including natural scenes, urban street scenes, exhibitions, family environments and other scenes, shot with different brands of cameras, including multiple time periods, multiple shooting angles, description language is English, mainly describes the main scenes in the image, usually including foreground and background description.
multi-modality natural scene data set scene information data

10,100 Image Caption Data of Human Face

10,100 Image caption data of human face includes multiple races under the age of 18, 18~45 years old, 46~60 years old, and over 60 years old; the collection scene is rich, including indoor scenes and outdoor scenes; the image content is rich, including wearing masks, glasses, wearing headphones, facial expressions, gestures, and adversarial examples. The language of the text description is English, which mainly describes the race, gender, age, shooting angle, lighting and diversity content, etc.
multi-modal multi-pose face image data face dataset

11,000 Image & Video Caption Data of Human Action

11,000 Image & Video caption data of human action contains 10,000 images and 10,000videos of various human behaviors in different seasons and different shooting angles, including indoor scenes and outdoor scenes. The description language is English, mainly describing the gender, age, clothing, behavior description and body movements of the characters.
AIGC human behavior data behavior recognition data human behavior recognition data human detection data

90,000 sets – Multi-domain Customer Service Dialogue Text Data

Multi-domain Customer Service Dialogue Text Data, 90,000 sets in total; spanning multiple domains, including telecommunications, e-commerce, and financial, lifestyle, business, education, healthcare, and entertainment; Each set of data consists of single or multi-turn conversations; this dataset can be used for tasks such as LLM training, chatgpt
Customer Service Dialogue text data telecommunications topics data commerce topics data finance topics data LLM data Large Language Model data chatgpt data

300 million pairs of high-quality image-caption dataset

300 million images, each corresponding to a description. All are genuine image works published by photographers. The vast majority of descriptions are in English, with very few in Chinese.
multimodal image description

7 Million Sets - High-Quality Video Caption Dataset

7 million global genuine high-quality videos. All are genuine video works released by photographers around the world. 6 million of them are described in English and 1 million in Chinese. They cover a variety of categories such as people, landscapes, animals, etc. The resolution is above 1080p.
multimodal video description caption LLM dataset

loading

Tailor Your Data Now

Why off-the-shelf Datasets

  • Copyright

    Copyright

    Clear Coyright and Ready to Check
  • Security

    Security

    Properly Authorized Secure to Use
  • Professional

    Professional

    Designed and produced by AI data experts
  • Diversity

    Diversity

    Collected from a varity of real scenes
  • Cost Effective

    Cost Effective

    More Cost-Efficient Than Tailored Data
  • Efficiency

    Efficiency

    Ready-To-Go Deliver in Seconds
6cdab86b-68c3-452b-975b-c114c21c4500