Multimodal fusion of visual and auditory biomarkers: An epoch-wise stochastic modality masking framework for stroke detection using a CNN–GRU network

Alif S.S.; Arafat M.Y.; Nibir M.M.I.; Tusti F.M.; Zereen A.N.

Multimodal fusion of visual and auditory biomarkers: An epoch-wise stochastic modality masking framework for stroke detection using a CNN–GRU network

4

Issued Date

2026-07-01

Resource Type

Article

eISSN

25900056

DOI

10.1016/j.array.2026.100787

Scopus ID

2-s2.0-105034593088

Journal Title

Array

Volume

30

Rights Holder(s)

SCOPUS

Bibliographic Citation

Array Vol.30 (2026)

Suggested Citation

Alif S.S., Arafat M.Y., Nibir M.M.I., Tusti F.M., Zereen A.N. Multimodal fusion of visual and auditory biomarkers: An epoch-wise stochastic modality masking framework for stroke detection using a CNN–GRU network. Array Vol.30 (2026). doi:10.1016/j.array.2026.100787 Retrieved from: https://repository.li.mahidol.ac.th/handle/123456789/116135

Title

Multimodal fusion of visual and auditory biomarkers: An epoch-wise stochastic modality masking framework for stroke detection using a CNN–GRU network

Author(s)

Alif S.S.
Arafat M.Y.
Nibir M.M.I.
Tusti F.M.
Zereen A.N.

Author's Affiliation

Mahidol University
North South University
BRAC University

Corresponding Author(s)

Alif S.S.

Other Contributor(s)

Mahidol University

Abstract

The automated identification of neurovascular impairment, particularly ischemic or hemorrhagic stroke, remains challenging due to the multimodal nature of symptoms such as facial asymmetry and dysarthria. This study introduces a multimodal diagnostic framework that integrates visual and auditory streams for accurate and interpretable stroke detection, with a focus on efficient utilization of available data. On the visual side, facial landmarks are derived from MediaPipe's Face Mesh to identify unilateral muscular deficiencies. For the auditory streams, speech recordings are transformed into Mel-Frequency Cepstral Coefficients (MFCC). Notably, the facial and audio data come from two separate sources and lack patient-level correspondence. A primary innovation is the incorporation of an epoch-wise stochastic modality masking approach within the training loop, utilizing a CNN–GRU scheme. In each epoch, a new batch of training data is created by dynamically combining samples from three different categories: complete multimodal input (both facial landmarks and audio MFCC), only facial landmarks and only audio MFCC. This compels the model to learn both unimodal and cross-modal representations, demonstrating the ability to leverage multiple modalities even when true multimodal correspondence across datasets is not available. The empirical results show that the developed framework works well, achieving 95.33% accuracy in multimodal settings. While the unimodal baselines (facial-only: 86.87%; audio-only: 97.41%) show strong performance individually, the proposed modality masking approach enables the model to learn robust representations from cross-source data, demonstrating the potential for stroke detection in preliminary evaluation scenarios.

Keyword(s)

Computer Science

URI

https://repository.li.mahidol.ac.th/handle/123456789/116135

Collections

Scopus 2026

Full item page

Send Feedback

	Office Hour: Monday-Friday 08.30-12.00 and 13.00-16.30 hrs.
	Phutthamonthon Sai 4 Rd. Salaya, Nakhon Pathom 73170, Thailand
	The office: +66 (2) 800 2680 ext.4306
	thipsuda.van@mahidol.ac.th
	https://repository.li.mahidol.ac.th