TRI-DEP: A Trimodal Comparative Study for Depression Detection Using Speech, Text, and EEG

Nurfirdausi, Annisaa Fitri; Mancini, Eleonora; Torroni, Paolo

TRI-DEP: A Trimodal Comparative Study for Depression Detection Using Speech, Text, and EEG

Annisaa Fitri Nurfirdausi^*, Eleonora Mancini^*, Paolo Torroni

DISI, University of Bologna, Italy
[TBD]
^*Indicates Equal Contribution

Abstract

Depression is a widespread mental health disorder, yet its automatic detection remains challenging. Prior work has explored unimodal and multimodal approaches, with multimodal systems showing promise by leveraging complementary signals. However, existing studies are limited in scope, lack systematic comparisons of features, and suffer from inconsistent evaluation protocols. We address these gaps by systematically exploring feature representations and modeling strategies across EEG, together with speech and text. We evaluate handcrafted features versus pretrained embeddings, assess the effectiveness of different neural encoders, compare unimodal, bimodal, and trimodal configurations, and analyze fusion strategies with attention to the complementary role of EEG. Consistent subject-independent splits ensure reproducible benchmarking. Our results show that the combination of EEG enhances multimodal detection, pretrained embeddings outperform handcrafted features, and carefully designed trimodal models achieve state-of-the-art performance. Our work serves as a robust benchmark for future research in multimodal depression detection.

Baseline Works

In this study, we implemented from scratch two multimodal baselines for comparison, both combining EEG and audio signals on the MODMA dataset. Yousufi et al. (2024) used DenseNet-121 as a feature extractor, while Qayyum et al. (2023) applied Vision Transformer (ViT) models for generating EEG and audio spectrograms.

Data Preparation for Baseline Works

EEG preprocessing follows Yousufi et al. (2024), including a 0.4–45 Hz FIR bandpass filter, a 50 Hz notch filter, and average referencing. We selected 29 channels (FP2, FP1, Fz, F7, F3, F4, F8, FT7, FC3, FCz, FC4, FT8, C3, C4, T3, CP3, CPz, CP4, T4, TP7, P3, Pz, P4, TP8, T5, T6, O1, Oz, O2) for analysis. Since Qayyum et al. (2023) provide limited details, we applied the same preprocessing pipeline to both baselines for consistency. For audio pre-processing, since both studies generate audio spectrograms directly from the original sampling rate of 44 kHz, we assume that no additional preprocessing was performed on the speech signals prior to spectrogram extraction.

EEG and Speech Spectrogram Generation

Following Yousufi et al. (2024) and Qayyum et al. (2023), we generate STFT spectrograms for EEG and mel-spectrograms for audio using librosa. We apply n_fft=1024, hop_length=512, and n_mels=64 (for audio). The resulting spectrograms are saved as .png images for use with 2D models such as DenseNet-121 and ViT.

Data Splitting

We use stratified 5-fold cross-validation with subject-level splitting to prevent data leakage. In each fold, 10% of the training data is set aside for validation (using a fixed random seed), resulting in train, validation, and test splits that preserve class balance.

Methodology

EEG Preprocessing

EEG preprocessing follows two pipelines depending on the feature extraction method.

Pipeline 1: We select 29 depression-related EEG channels [FP2, FP1, Fz, F7, F3, F4, F8, FT7, FC3, FCz, FC4, FT8, C3, C4, T3, CP3, CPz, CP4, T4, TP7, P3, Pz, P4, TP8, T5, T6, O1, Oz, O2]. Signals are bandpass filtered (0.5–50 Hz), notch filtered at 50 Hz, re-referenced to the average electrode, and segmented into 10-second windows.

Pipeline 2: We replicated the preprocessing steps of the CBraMod model for the MUMTAZ depression dataset. Signals are resampled to 200 Hz, bandpass filtered between 0.3–75 Hz, and notch filtered at 50 Hz. 19 channels are selected [FP2, FP1, F7, F3, F4, F8, FCz, C3, C4, T3, CPz, T4, P3, Pz, P4, T5, T6, O1, O2]. The signals are segmented into 5-second windows (1000 samples each at 200 Hz). Each recording is divided into multiple fixed-length windows, and each window is further split into 5 non-overlapping patches of 200 samples each.

Experiments and Results

Data Splitting

We perform all experiments at the subject level using stratified 5-fold cross-validation to preserve class balance. The same subject-level splits are used across experiments, ensuring reproducibility, fair comparability, and robust results reported as mean and standard deviation across folds.

Baseline Results

Below, we present the results of the baseline models, along with the improvements obtained after hyperparameter tuning and the addition of the text modality.

Illustration of stratified 5-fold cross-validation — Hyperparameter details for baseline works

Preliminary Studies

The framework supports various combinations of feature types, deep encoders, and late-fusion strategies. To structure the investigation and ensure feasibility, we first performed unimodal experiments for each modality (EEG, speech, and text) to identify the most effective model for each stream. These best-performing unimodal models were then integrated into the final multimodal pipeline. The result of preliminary studies can be seen below:

Proposed Model

Based on our extensive experiments, we identified the best-performing models, and their overall architecture is illustrated below:

Illustration of best performing architectures observed — Best Architecture.