Multimodal fusion of visual and auditory biomarkers: An epoch-wise stochastic modality masking framework for stroke detection using a CNN–GRU network
4
Issued Date
2026-07-01
Resource Type
eISSN
25900056
Scopus ID
2-s2.0-105034593088
Journal Title
Array
Volume
30
Rights Holder(s)
SCOPUS
Bibliographic Citation
Array Vol.30 (2026)
Suggested Citation
Alif S.S., Arafat M.Y., Nibir M.M.I., Tusti F.M., Zereen A.N. Multimodal fusion of visual and auditory biomarkers: An epoch-wise stochastic modality masking framework for stroke detection using a CNN–GRU network. Array Vol.30 (2026). doi:10.1016/j.array.2026.100787 Retrieved from: https://repository.li.mahidol.ac.th/handle/123456789/116135
Title
Multimodal fusion of visual and auditory biomarkers: An epoch-wise stochastic modality masking framework for stroke detection using a CNN–GRU network
Author(s)
Author's Affiliation
Corresponding Author(s)
Other Contributor(s)
Abstract
The automated identification of neurovascular impairment, particularly ischemic or hemorrhagic stroke, remains challenging due to the multimodal nature of symptoms such as facial asymmetry and dysarthria. This study introduces a multimodal diagnostic framework that integrates visual and auditory streams for accurate and interpretable stroke detection, with a focus on efficient utilization of available data. On the visual side, facial landmarks are derived from MediaPipe's Face Mesh to identify unilateral muscular deficiencies. For the auditory streams, speech recordings are transformed into Mel-Frequency Cepstral Coefficients (MFCC). Notably, the facial and audio data come from two separate sources and lack patient-level correspondence. A primary innovation is the incorporation of an epoch-wise stochastic modality masking approach within the training loop, utilizing a CNN–GRU scheme. In each epoch, a new batch of training data is created by dynamically combining samples from three different categories: complete multimodal input (both facial landmarks and audio MFCC), only facial landmarks and only audio MFCC. This compels the model to learn both unimodal and cross-modal representations, demonstrating the ability to leverage multiple modalities even when true multimodal correspondence across datasets is not available. The empirical results show that the developed framework works well, achieving 95.33% accuracy in multimodal settings. While the unimodal baselines (facial-only: 86.87%; audio-only: 97.41%) show strong performance individually, the proposed modality masking approach enables the model to learn robust representations from cross-source data, demonstrating the potential for stroke detection in preliminary evaluation scenarios.
