Voice activity detection (VAD) is the prevailing approach to extracting meaningful speech information from the pervasive noise found in the physical environment. Presently, deep neural networks (DNN) are widely employed as the classifier component in Voice Activity Detection (VAD
...
Voice activity detection (VAD) is the prevailing approach to extracting meaningful speech information from the pervasive noise found in the physical environment. Presently, deep neural networks (DNN) are widely employed as the classifier component in Voice Activity Detection (VAD) systems. However, conventional deep neural networks, like fully connected (FC) deep neural networks, encounter the challenge of excessive computational complexity. This heightened complexity can result in diminished computing efficiency, unnecessary utilization of hardware resources, and redundant power consumption. To address the inefficiency issue from computational complexity, this study introduces a novel neural network architecture named DeltaFC. This architecture attains an operation time latency of less than 1 ms for each 30ms voice segment, resulting in a 54% reduction in latency compared to the baseline fully connected (FC) model. In software design, this study tackles the issue by compressing and encoding time-series information using the Delta algorithm, with the objective of introducing temporal sparsity. Based on the software results, the neural network surpasses both the baseline fully connected (FC) and LSTM models in AUC (area under the curve), with accuracy at a lightweight parameter scale. In hardware design, this study reproduces the neural network software design into FPGA hardware RTL design, implementing a lightweight digital IP core. This digital IP core accelerates neural network operations in hardware by the deployment of Delta and CSR algorithms. Compared with not introducing temporal sparsity, the computing efficiency increases by approximately 85% with 0.5% loss in accuracy. This substantiates that within the domain of lightweight neural networks containing fewer than 30,000 parameters, the DeltaFC network proposed in this study is more suitable for Voice Activity Detection (VAD) when compared to fully connected (FC), LSTM, and other baseline network architectures.