Multimodal fusion of visual and auditory biomarkers: An epoch-wise stochastic modality masking framework for stroke detection using a CNN–GRU network

Alif S.S.; Arafat M.Y.; Nibir M.M.I.; Tusti F.M.; Zereen A.N.

Multimodal fusion of visual and auditory biomarkers: An epoch-wise stochastic modality masking framework for stroke detection using a CNN–GRU network

dc.contributor.author	Alif S.S.
dc.contributor.author	Arafat M.Y.
dc.contributor.author	Nibir M.M.I.
dc.contributor.author	Tusti F.M.
dc.contributor.author	Zereen A.N.
dc.contributor.correspondence	Alif S.S.
dc.contributor.other	Mahidol University
dc.date.accessioned	2026-04-11T18:24:55Z
dc.date.available	2026-04-11T18:24:55Z
dc.date.issued	2026-07-01
dc.description.abstract	The automated identification of neurovascular impairment, particularly ischemic or hemorrhagic stroke, remains challenging due to the multimodal nature of symptoms such as facial asymmetry and dysarthria. This study introduces a multimodal diagnostic framework that integrates visual and auditory streams for accurate and interpretable stroke detection, with a focus on efficient utilization of available data. On the visual side, facial landmarks are derived from MediaPipe's Face Mesh to identify unilateral muscular deficiencies. For the auditory streams, speech recordings are transformed into Mel-Frequency Cepstral Coefficients (MFCC). Notably, the facial and audio data come from two separate sources and lack patient-level correspondence. A primary innovation is the incorporation of an epoch-wise stochastic modality masking approach within the training loop, utilizing a CNN–GRU scheme. In each epoch, a new batch of training data is created by dynamically combining samples from three different categories: complete multimodal input (both facial landmarks and audio MFCC), only facial landmarks and only audio MFCC. This compels the model to learn both unimodal and cross-modal representations, demonstrating the ability to leverage multiple modalities even when true multimodal correspondence across datasets is not available. The empirical results show that the developed framework works well, achieving 95.33% accuracy in multimodal settings. While the unimodal baselines (facial-only: 86.87%; audio-only: 97.41%) show strong performance individually, the proposed modality masking approach enables the model to learn robust representations from cross-source data, demonstrating the potential for stroke detection in preliminary evaluation scenarios.
dc.identifier.citation	Array Vol.30 (2026)
dc.identifier.doi	10.1016/j.array.2026.100787
dc.identifier.eissn	25900056
dc.identifier.scopus	2-s2.0-105034593088
dc.identifier.uri	https://repository.li.mahidol.ac.th/handle/123456789/116135
dc.rights.holder	SCOPUS
dc.subject	Computer Science
dc.title	Multimodal fusion of visual and auditory biomarkers: An epoch-wise stochastic modality masking framework for stroke detection using a CNN–GRU network
dc.type	Article
mu.datasource.scopus	https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=105034593088&origin=inward
oaire.citation.title	Array
oaire.citation.volume	30
oairecerif.author.affiliation	Mahidol University
oairecerif.author.affiliation	North South University
oairecerif.author.affiliation	BRAC University

Collections

Scopus 2026

	Office Hour: Monday-Friday 08.30-12.00 and 13.00-16.30 hrs.
	Phutthamonthon Sai 4 Rd. Salaya, Nakhon Pathom 73170, Thailand
	The office: +66 (2) 800 2680 ext.4306
	thipsuda.van@mahidol.ac.th
	https://repository.li.mahidol.ac.th

Multimodal fusion of visual and auditory biomarkers: An epoch-wise stochastic modality masking framework for stroke detection using a CNN–GRU network

Files

Collections