Multimodal fusion of visual and auditory biomarkers: An epoch-wise stochastic modality masking framework for stroke detection using a CNN–GRU network

dc.contributor.authorAlif S.S.
dc.contributor.authorArafat M.Y.
dc.contributor.authorNibir M.M.I.
dc.contributor.authorTusti F.M.
dc.contributor.authorZereen A.N.
dc.contributor.correspondenceAlif S.S.
dc.contributor.otherMahidol University
dc.date.accessioned2026-04-11T18:24:55Z
dc.date.available2026-04-11T18:24:55Z
dc.date.issued2026-07-01
dc.description.abstractThe automated identification of neurovascular impairment, particularly ischemic or hemorrhagic stroke, remains challenging due to the multimodal nature of symptoms such as facial asymmetry and dysarthria. This study introduces a multimodal diagnostic framework that integrates visual and auditory streams for accurate and interpretable stroke detection, with a focus on efficient utilization of available data. On the visual side, facial landmarks are derived from MediaPipe's Face Mesh to identify unilateral muscular deficiencies. For the auditory streams, speech recordings are transformed into Mel-Frequency Cepstral Coefficients (MFCC). Notably, the facial and audio data come from two separate sources and lack patient-level correspondence. A primary innovation is the incorporation of an epoch-wise stochastic modality masking approach within the training loop, utilizing a CNN–GRU scheme. In each epoch, a new batch of training data is created by dynamically combining samples from three different categories: complete multimodal input (both facial landmarks and audio MFCC), only facial landmarks and only audio MFCC. This compels the model to learn both unimodal and cross-modal representations, demonstrating the ability to leverage multiple modalities even when true multimodal correspondence across datasets is not available. The empirical results show that the developed framework works well, achieving 95.33% accuracy in multimodal settings. While the unimodal baselines (facial-only: 86.87%; audio-only: 97.41%) show strong performance individually, the proposed modality masking approach enables the model to learn robust representations from cross-source data, demonstrating the potential for stroke detection in preliminary evaluation scenarios.
dc.identifier.citationArray Vol.30 (2026)
dc.identifier.doi10.1016/j.array.2026.100787
dc.identifier.eissn25900056
dc.identifier.scopus2-s2.0-105034593088
dc.identifier.urihttps://repository.li.mahidol.ac.th/handle/123456789/116135
dc.rights.holderSCOPUS
dc.subjectComputer Science
dc.titleMultimodal fusion of visual and auditory biomarkers: An epoch-wise stochastic modality masking framework for stroke detection using a CNN–GRU network
dc.typeArticle
mu.datasource.scopushttps://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=105034593088&origin=inward
oaire.citation.titleArray
oaire.citation.volume30
oairecerif.author.affiliationMahidol University
oairecerif.author.affiliationNorth South University
oairecerif.author.affiliationBRAC University

Files

Collections