en

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

NLU Datasets

Instantly enhance AI model performance with high quality off-the-shelf datasets.

Type

All
34
Entity Identification
4
Dialogue Text
1
Intention Understanding
1
Others
2
Parallel Corpus
23

75 Dictionaries of Different Chinese Fields

75 Chinese domain dictionaries, including data for a certain year and covering a wide range of content. Each line in the data file includes a term and its Chinese pinyin, and the terms are sorted alphabetically. This data set can be used for tasks such as natural language understanding, knowledge base building, etc..
Chinese domain dictionary data text data NLU data Entity Identification data

84,516 Sentences - English Intention Annotation Data in Interactive Scenes

84,516 Sentences - English Intention Annotation Data in Interactive Scenes, annotated with intent classes, including slot and slot value information; the intent field includes music, weather, date, schedule, home equipment, etc.; it is applied to intent recognition research and related fields.
English Intent-type Intention

12,820,000 Groups - Chinese-Korean Parallel Corpus Data

12,820,000 sets of parallel translation corpus between China and Korea, which are stored in txt files. It covers many fields including spoken language, traveling, news, and finance. Data cleaning, desensitization, and quality inspection have been carried out. It can be used as the basic corpus database in the text data files as well as used in machine translation.
Chinese Korean Chinese-Korean Parallel Corpus

980,000 Groups - Chinese-Urdu Parallel Corpus Data

980,000 sets of Chinese and Urdu language parallel translation corpus, data storage format is txt document. Data cleaning, desensitization, and quality inspection have been carried out, which can be used as a basic corpus for text data analysis and in fields such as machine translation.
Chinese Urdu Chinese-Urdu Parallel Corpus

1,990,000 Groups - Chinese-Czech Parallel Corpus Data

1,990,000 sets of Chinese and Czech language parallel translation corpus, data storage format is txt document. Data cleaning, desensitization, and quality inspection have been carried out, which can be used as a basic corpus for text data analysis and in fields such as machine translation.
Chinese Czech Parallel

1,980,000 Groups - Chinese-Polish Parallel Corpus Data

1,980,000 sets of Chinese and Polish language parallel translation corpus, data storage format is txt document. Data cleaning, desensitization, and quality inspection have been carried out, which can be used as a basic corpus for text data analysis and in fields such as machine translation.
Chinese Polish Parallel

100,000 Groups - Chinese-Uighur Parallel Corpus Data

100,000 sets of Chinese and Uighur language parallel translation corpus, data storage format is txt document, data fluency and loyalty is above 80%. Data cleaning, desensitization and quality inspection have been carried out, which can be used as a basic corpus for text data analysis and in fields such as machine translation.
Chinese-Uighur Parallel Corpus

1,080,000 Groups – English-Russian Parallel Corpus Data

English and Russian parallel corpus, 1,080,000 groups in total; excluded political, porn, personal information and other sensitive vocabulary; it can be a base corpus for text-based data analysis, used in machine translation and other fields.
English-Russian parallel corpus

7,440,000 Groups – Chinese-Hindi Parallel Corpus Data

7.44 Million Pairs of Sentences - Chinese-Hindi Parallel Corpus Data be stored in text format. It covers multiple fields such as tourism, medical treatment, daily life, news, etc. The data desensitization and quality checking had been done. It can be used as a basic corpus for text data analysis in fields such as machine translation.
Chinese-Hindi parallel corpus

loading

Tailor Your Data Now

Why off-the-shelf Datasets

  • Copyright

    Copyright

    Clear Coyright and Ready to Check
  • Security

    Security

    Properly Authorized Secure to Use
  • Professional

    Professional

    Designed and produced by AI data experts
  • Diversity

    Diversity

    Collected from a varity of real scenes
  • Cost Effective

    Cost Effective

    More Cost-Efficient Than Tailored Data
  • Efficiency

    Efficiency

    Ready-To-Go Deliver in Seconds
b2095ecb-5866-4fb7-bf16-88d937eb1a00