Author(s)

Prof. Japan M. Mavani, Prof. Shwetaba B. Chauhan

  • Manuscript ID: 140010
  • Volume: 1
  • Issue: 1
  • Pages: 121–148

Subject Area: Computer Science

DOI: https://doi.org/10.64643/JATIRV1I1-140010-001
Abstract

Parkinson’s disease (PD) causes characteristic changes in a person’s voice, such as reduced loudness, monotonic pitch, and irregular speech patterns. This paper presents an original deep learning framework to automatically detect PD from voice recordings by leveraging spatial audio cues in time-frequency representations of speech. A dataset of voice samples from PD patients and healthy controls was assembled, including sustained vowels, spoken numbers, words, and short sentences. Mel-frequency cepstral coefficients (MFCCs) and other acoustic features (pitch, jitter, shimmer, harmonicity) were extracted to capture subtle dysphonic markers of PD. These features were used to train and evaluate several models, including conventional classifiers and novel deep neural networks. Our proposed architecture combines a Convolutional Neural Network (CNN) to learn local spatial patterns in spectrograms with a Long Short-Term Memory (LSTM) network to capture temporal dynamics in speech. Experimental results using 5-fold cross-validation show that the deep learning model achieves high accuracy (≈94%), with precision, recall, F1-score in the 92–95% range, and area under the ROC curve (AUC) above 0.95. It outperforms baseline machine learning methods (e.g. support vector machines) in distinguishing PD vs. non-PD voices. We also provide an error analysis and compare model variants (CNN alone, LSTM alone, CNN-LSTM, and transformer-based models). The findings indicate that spatial audio features derived from voice, when analyzed with deep learning, offer a promising, non-invasive tool for early PD detection. This approach could enable convenient screening and monitoring of PD progression through vocal biomarkers, complementing clinical assessments and improving personalized care.

Keywords
Parkinson’s disease; voice analysis; dysphonia; spatial audio; deep learning; MFCC; CNN; LSTM; biomedical signal processing