[{"@type":"PropertyValue","name":"Languages","value":"Indonesian, Malay, Thai, Vietnamese"},{"@type":"PropertyValue","name":"Data volume","value":"14447771 Indonesian, 1239420 Malay, 6467564 Thai, 8942813 Vietnamese, with a total of over 31 million pieces"},{"@type":"PropertyValue","name":"Field","value":"URL,title,published_time,article_content,category"},{"@type":"PropertyValue","name":"Format","value":"JSONL"},{"@type":"PropertyValue","name":"","value":""}]
{"id":1625,"datatype":"1","titleimg":"","type1":"226","type1str":null,"type2":"227","type2str":null,"dataname":"31 million Southeast Asian language news text dataset","datazy":[{"title":"Languages","desc":"Languages","content":"Indonesian, Malay, Thai, Vietnamese"},{"desc":"Data volume","content":"14447771 Indonesian, 1239420 Malay, 6467564 Thai, 8942813 Vietnamese, with a total of over 31 million pieces","title":"Data volume"},{"desc":"Field","content":"URL,title,published_time,article_content,category","title":"Field"},{"desc":"Format","content":"JSONL","title":"Format"},{"desc":"","content":"","title":""}],"datatag":"Minor languages,Southeast Asia,NEWS,Journalism","technologydoc":null,"downurl":null,"datainfo":null,"standard":null,"dataylurl":null,"flag":null,"publishtime":null,"createby":null,"createtime":null,"ext1":null,"samplestoreloc":null,"hosturl":null,"datasize":null,"industryPlan":null,"keyInformation":"","samplePresentation":[],"officialSummary":"This dataset is multilingual news data from Southeast Asia, covering four languages: Indonesian, Malay, Thai, and Vietnamese. The total amount of data exceeds 31 million, stored in JSONL format, with each record running independently in a row for efficient reading and processing. The data sources are extensive, covering various news topics, and can comprehensively reflect the social dynamics, cultural hotspots, and economic trends in Southeast Asia. This dataset can help large models improve their multilingual capabilities, enrich cultural knowledge, optimize performance, expand industry applications in Southeast Asia, and promote cross linguistic research.","dataexampl":null,"datakeyword":["Minor languages","Southeast Asia","NEWS","Journalism"],"isDelete":null,"ids":null,"idsList":null,"datasetCode":null,"productStatus":null,"tagTypeEn":"Type","tagTypeZh":null,"website":null,"samplePresentationList":null,"datazyList":null,"keyInformationList":null,"dataexamplList":null,"bgimg":null,"datazyScriptList":null,"datakeywordListString":null,"sourceShowPage":"llm","BGimg":"","voiceBg":["/shujutang/static/image/comm/audio_bg.webp","/shujutang/static/image/comm/audio_bg2.webp","/shujutang/static/image/comm/audio_bg3.webp","/shujutang/static/image/comm/audio_bg4.webp","/shujutang/static/image/comm/audio_bg5.webp"]}
31 million Southeast Asian language news text dataset
Minor languages
Southeast Asia
NEWS
Journalism
This dataset is multilingual news data from Southeast Asia, covering four languages: Indonesian, Malay, Thai, and Vietnamese. The total amount of data exceeds 31 million, stored in JSONL format, with each record running independently in a row for efficient reading and processing. The data sources are extensive, covering various news topics, and can comprehensively reflect the social dynamics, cultural hotspots, and economic trends in Southeast Asia. This dataset can help large models improve their multilingual capabilities, enrich cultural knowledge, optimize performance, expand industry applications in Southeast Asia, and promote cross linguistic research.
This is a paid datasets for commercial use, research purpose and more. Licensed ready made datasets help jump-start AI projects.
Specifications
Languages
Indonesian, Malay, Thai, Vietnamese
Data volume
14447771 Indonesian, 1239420 Malay, 6467564 Thai, 8942813 Vietnamese, with a total of over 31 million pieces