Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
| dc.contributor.author | Cahyawijaya S. | |
| dc.contributor.author | Lovenia H. | |
| dc.contributor.author | Moniz J.R.A. | |
| dc.contributor.author | Wong T.H. | |
| dc.contributor.author | Farhansyah M.R. | |
| dc.contributor.author | Maung T.T. | |
| dc.contributor.author | Hudi F. | |
| dc.contributor.author | Anugraha D. | |
| dc.contributor.author | Habibi M.R.S. | |
| dc.contributor.author | Qorib M.R. | |
| dc.contributor.author | Agarwal A. | |
| dc.contributor.author | Imperial J.M. | |
| dc.contributor.author | Patel H.L. | |
| dc.contributor.author | Feliren V. | |
| dc.contributor.author | Nasution B.I. | |
| dc.contributor.author | Rufino M.A. | |
| dc.contributor.author | Winata G.I. | |
| dc.contributor.author | Rajagede R.A. | |
| dc.contributor.author | Catalan C.R. | |
| dc.contributor.author | Imam M.F. | |
| dc.contributor.author | Pattnayak P. | |
| dc.contributor.author | Pranida S.Z. | |
| dc.contributor.author | Pratama K. | |
| dc.contributor.author | Bangera Y. | |
| dc.contributor.author | Na-Thalang A. | |
| dc.contributor.author | Monderin P.N. | |
| dc.contributor.author | Song Y. | |
| dc.contributor.author | Simon C. | |
| dc.contributor.author | Ng L.H.X. | |
| dc.contributor.author | Sapan R.L. | |
| dc.contributor.author | Rafi T.H. | |
| dc.contributor.author | Wang B. | |
| dc.contributor.author | Supryadi | |
| dc.contributor.author | Veerakanjana K. | |
| dc.contributor.author | Ittichaiwong P. | |
| dc.contributor.author | Roque M.T. | |
| dc.contributor.author | Vincentio K. | |
| dc.contributor.author | Kreangphet T. | |
| dc.contributor.author | Artkaew P. | |
| dc.contributor.author | Palgunadi K.H. | |
| dc.contributor.author | Yu Y. | |
| dc.contributor.author | Hastuti R.P. | |
| dc.contributor.author | Nixon W. | |
| dc.contributor.author | Bangera M. | |
| dc.contributor.author | Lim A.X.W. | |
| dc.contributor.author | Khine A.H. | |
| dc.contributor.author | Zhafran H.M. | |
| dc.contributor.author | Ferdinan T. | |
| dc.contributor.author | Izzani A.A. | |
| dc.contributor.author | Singh A. | |
| dc.contributor.author | Evan | |
| dc.contributor.author | Krito J.A. | |
| dc.contributor.author | Anugraha M. | |
| dc.contributor.author | Ilasariya F.A. | |
| dc.contributor.author | Li H. | |
| dc.contributor.author | Daniswara J.A. | |
| dc.contributor.author | Tjiaranata F.A. | |
| dc.contributor.author | Yulianrifat E.P. | |
| dc.contributor.author | Udomcharoenchaikit C. | |
| dc.contributor.author | Ansori F.R. | |
| dc.contributor.author | Ihsani M.K. | |
| dc.contributor.author | Nguyen G. | |
| dc.contributor.author | Barik A.M. | |
| dc.contributor.author | Velasco D.J. | |
| dc.contributor.author | Genadi R.A. | |
| dc.contributor.author | Saha S. | |
| dc.contributor.author | Wei C. | |
| dc.contributor.author | Flores I. | |
| dc.contributor.author | Chen K.K.H. | |
| dc.contributor.author | Santos A.G. | |
| dc.contributor.author | Lim W.S. | |
| dc.contributor.author | Phyo K.S. | |
| dc.contributor.author | Santos T. | |
| dc.contributor.author | Dwiastuti M. | |
| dc.contributor.author | Luo J. | |
| dc.contributor.author | Cruz J.C.B. | |
| dc.contributor.author | Hee M.S. | |
| dc.contributor.author | Hanif I.A. | |
| dc.contributor.author | Alif Al Hakim M. | |
| dc.contributor.author | Sya'ban M.R. | |
| dc.contributor.author | Kerdthaisong K. | |
| dc.contributor.author | Miranda L.J.V. | |
| dc.contributor.author | Koto F. | |
| dc.contributor.author | Fatyanosa T.N. | |
| dc.contributor.author | Aji A.F. | |
| dc.contributor.author | Rosal J.J. | |
| dc.contributor.author | Kevin J. | |
| dc.contributor.author | Wijaya R. | |
| dc.contributor.author | Kampman O.P. | |
| dc.contributor.author | Zhang R. | |
| dc.contributor.author | Karlsson B.F. | |
| dc.contributor.author | Limkonchotiwat P. | |
| dc.contributor.correspondence | Cahyawijaya S. | |
| dc.contributor.other | Mahidol University | |
| dc.date.accessioned | 2025-11-18T18:18:42Z | |
| dc.date.available | 2025-11-18T18:18:42Z | |
| dc.date.issued | 2025-01-01 | |
| dc.description.abstract | Despite Southeast Asia's (SEA) extraordinary linguistic and cultural diversity, the region remains significantly underrepresented in vision-language (VL) research, resulting in AI models that inadequately capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing culturally relevant high-quality datasets for SEA languages. By involving contributors from SEA countries, SEA-VL ensures better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages and cultural depictions in VL research. Our methodology employed three approaches: community-driven crowdsourcing with SEA contributors, automated image crawling, and synthetic image generation. We evaluated each method's effectiveness in capturing cultural relevance. We found that image crawling achieves approximately ∼85% cultural relevance while being more cost- and time-efficient than crowdsourcing, whereas synthetic image generation failed to accurately reflect SEA cultural nuances and contexts. Collectively, we gathered 1.28 million SEA culturally relevant images, more than 50 times larger than other existing datasets. This work bridges the representation gap in SEA, establishes a foundation for developing culturally aware AI systems for this region, and provides a replicable framework for addressing representation gaps in other underrepresented regions. | |
| dc.identifier.citation | Proceedings of the Annual Meeting of the Association for Computational Linguistics Vol.1 (2025) , 18685-18717 | |
| dc.identifier.issn | 0736587X | |
| dc.identifier.scopus | 2-s2.0-105021028710 | |
| dc.identifier.uri | https://repository.li.mahidol.ac.th/handle/123456789/113066 | |
| dc.rights.holder | SCOPUS | |
| dc.subject | Computer Science | |
| dc.subject | Social Sciences | |
| dc.subject | Arts and Humanities | |
| dc.title | Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia | |
| dc.type | Conference Paper | |
| mu.datasource.scopus | https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=105021028710&origin=inward | |
| oaire.citation.endPage | 18717 | |
| oaire.citation.startPage | 18685 | |
| oaire.citation.title | Proceedings of the Annual Meeting of the Association for Computational Linguistics | |
| oaire.citation.volume | 1 | |
| oairecerif.author.affiliation | University of Toronto | |
| oairecerif.author.affiliation | University of Illinois Urbana-Champaign | |
| oairecerif.author.affiliation | The University of Manchester | |
| oairecerif.author.affiliation | National University of Singapore | |
| oairecerif.author.affiliation | Monash University | |
| oairecerif.author.affiliation | Tianjin University | |
| oairecerif.author.affiliation | New York University | |
| oairecerif.author.affiliation | Carnegie Mellon University | |
| oairecerif.author.affiliation | Brown University | |
| oairecerif.author.affiliation | Auburn University | |
| oairecerif.author.affiliation | Hanyang University | |
| oairecerif.author.affiliation | Chulalongkorn University | |
| oairecerif.author.affiliation | University of Bath | |
| oairecerif.author.affiliation | Universitas Indonesia | |
| oairecerif.author.affiliation | Polytechnique Montréal | |
| oairecerif.author.affiliation | Universitas Gadjah Mada | |
| oairecerif.author.affiliation | Institut Teknologi Bandung | |
| oairecerif.author.affiliation | Nara Institute of Science and Technology | |
| oairecerif.author.affiliation | Thammasat University | |
| oairecerif.author.affiliation | Institut Teknologi Sepuluh Nopember | |
| oairecerif.author.affiliation | Macau University of Science and Technology | |
| oairecerif.author.affiliation | Brawijaya University | |
| oairecerif.author.affiliation | Siriraj Hospital | |
| oairecerif.author.affiliation | King Mongkut's University of Technology Thonburi | |
| oairecerif.author.affiliation | Bina Nusantara University | |
| oairecerif.author.affiliation | Indian Statistical Institute, Kolkata | |
| oairecerif.author.affiliation | Ton-Duc-Thang University | |
| oairecerif.author.affiliation | Seoul National University of Science and Technology | |
| oairecerif.author.affiliation | A-Star, Institute for Infocomm Research | |
| oairecerif.author.affiliation | Singapore University of Technology and Design | |
| oairecerif.author.affiliation | Srinakharinwirot University | |
| oairecerif.author.affiliation | Universitas Islam Indonesia | |
| oairecerif.author.affiliation | Ateneo de Manila University | |
| oairecerif.author.affiliation | University of New Haven | |
| oairecerif.author.affiliation | Mohamed Bin Zayed University of Artificial Intelligence | |
| oairecerif.author.affiliation | Montreal Institute for Learning Algorithms | |
| oairecerif.author.affiliation | Universitas Pelita Harapan | |
| oairecerif.author.affiliation | Oracle Corporation | |
| oairecerif.author.affiliation | University of the Philippines | |
| oairecerif.author.affiliation | Vidyasirimedhi Institute of Science and Technology | |
| oairecerif.author.affiliation | Singapore Polytechnic | |
| oairecerif.author.affiliation | National University, Philippines | |
| oairecerif.author.affiliation | Sony Group Corporation | |
| oairecerif.author.affiliation | Graphcore Limited | |
| oairecerif.author.affiliation | MOH Office for Healthcare Transformation | |
| oairecerif.author.affiliation | Allen Institute for AI | |
| oairecerif.author.affiliation | AI Singapore | |
| oairecerif.author.affiliation | Beijing Academy of Artificial Intelligence (BAAI) | |
| oairecerif.author.affiliation | Wroclaw Tech | |
| oairecerif.author.affiliation | Cohere | |
| oairecerif.author.affiliation | SCB 10X | |
| oairecerif.author.affiliation | Meta | |
| oairecerif.author.affiliation | Samsung R&D Institute Philippines | |
| oairecerif.author.affiliation | Capital One | |
| oairecerif.author.affiliation | Works Applications Lab | |
| oairecerif.author.affiliation | SEACrowd | |
| oairecerif.author.affiliation | Dataxet:Sonar | |
| oairecerif.author.affiliation | IndoNLP |
