Yuenyong S.Mahidol University2023-06-182023-06-182022-01-016th International Conference on Information Technology, InCIT 2022 (2022) , 207-210https://repository.li.mahidol.ac.th/handle/123456789/84301Person description search is matching a textual description of a person with the image of the same person. This is a multimodal image-text task, where the model generally has two branches: image and text. The objective is for these two branches to embed their respective input into a joint space, where the embeddings should be near each other if the image and text pair is a match, and far apart if they are not. The image branch can simply use pretrained vision models off-the-shelf without any modification, because 'person' is a common class in large image datasets. For the text branch on the other hand, person descriptions are not part of the dataset commonly used to train large language models (LM). Recent deep learning language models are based on the transformer architecture, which are commonly trained using large text corpus using masked language model loss. In this paper we propose finetuning the transformer-based LM in an unsupervised manner using the person description text before supervised training on the actual task. The result shows that unsupervised LM finetuning is beneficial for Thai person description search.Computer ScienceFinetuning Language Model for Person Description Search in ThaiConference PaperSCOPUS10.1109/InCIT56086.2022.100676832-s2.0-85151633056