Towards Efficient Deep Learning Based Siren Detection
More Info
expand_more
Abstract
This thesis presents the development and evaluation of a real-time neural network-based audio classification system designed on an NXP HW board to distinguish emergency response vehicles by their sirens from other vehicles. At the core of the system is a deep learning model that processes audio inputs captured via a microphone, classifying them based on the presence of siren sounds. The system achieves this by extracting audio features and running inference through the designed neural network, followed by post-processing to detect sirens accurately. Audio signals are transformed into mel-spectrograms, which represent the frequency spectrum over time using a specific window size for analysis. The neural network leverages these mel-spectrogram features to perform audio classification.
The deployment of this system involves several key steps. First, the model is trained on diverse audio data, including siren and non-siren sounds. Audio signals are transformed into mel-spectrograms, which capture the frequency spectrum over time. The neural network processes these features to classify the audio based on the presence of siren sounds. The classified results undergo post-processing to enhance detection accuracy. The system is tested in real-world scenarios, demonstrating a turnaround time of less than 3 seconds even under high noise conditions. Various trade-offs are evaluated to improve efficiency, reduce memory size, and minimize latency, ensuring the system meets requirements for model size, latency, and compute cycles. The custom dataset comprises 280 hours of audio, including well-known, publicly available datasets such as ESC-50, Audioset, and UrbanSound. This dataset is enriched with both original and augmented siren sounds and non-siren audio to enhance the model’s learning efficacy and robustness. The system achieves a 96.19% test accuracy in identifying sirens and is suitable for deployment in real-world scenarios, even for SNRs as high as -30 dB. Although the system meets most requirements for model size, latency, and compute cycles, the false positive rate needs improvement. This can be achieved by expanding the dataset and retraining the model.