<strong>Paper Title</strong><br>

Design and Evaluation of Lightweight and Multi-Feature Models for Emotion Recognition from Speech<br>

<br>


<strong>Abstract</strong><br>

Accurately inferring emotion from speech hinges on both the acoustic cues we capture and the way they are woven together inside a network. This study benchmarks four complementary configurations on the IEMOCAP, RAVDESS, MELD, and SAVEE corpora to pinpoint the most effective design choices. The first configuration blends a wide palette of descriptors—MFCCs, RMS energy, ZCR, chroma, spectral contrast, tonnetz, and HuBERT embeddings—each handled by a dedicated encoder. Their outputs are stitched together by a multi-dimensional, multi-scale convolution block and refined with layered cross-attention, producing the strongest scores. The second pipeline narrows the focus to two streams: a compact low-level feature vector processed by an MLSTM-FCN and a log-Mel image analyzed by a squeeze-excitation CNN. These latent codes meet in a fusion layer that is fine-tuned with sparse learning, retraining only under-performing neurons while freezing the rest. A third, edge-oriented design employs an Inception-style CNN to capture patterns at multiple resolutions, a dual cepstral–temporal attention module to highlight salient regions, and a lightweight GRU to track dynamics—delivering competitive accuracy with just 0.82 M parameters. The final setup is a conventional VGG-like CNN fed with MFCC spectrograms, serving as a straightforward baseline. Across extensive trials, the globally aware fusion model leads (97 % / 92 % / 91 % / 84 % on the four datasets), closely followed by the sparse-learning variant, while the lightweight and baseline networks trail by one to two percentage points. Results confirm that marrying complementary representations with informed attention mechanisms yields greater gains than simply deepening or widening networks. These insights can guide the design of robust, data-efficient SER systems for affect-aware assistants, call-center analytics, and healthcare monitoring. 

Keywords - Speech Emotion Recognition, mel spectrogram, deep learning, MFCC, CNN, MCA-TCA