Open Access System for Information Sharing

Login Library

 

Thesis
Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Efficient Architectures for Voice Activity Detection and Keyword Spotting on Mobile Devices Pohang University of Science and Technology

Title
Efficient Architectures for Voice Activity Detection and Keyword Spotting on Mobile Devices Pohang University of Science and Technology
Authors
채화병
Date Issued
2024
Abstract
By using Voice Activity Detection (VAD) and Keyword Spotting (KWS) as a preprocessing step, hardware-efficient implementations are possible for speech appli- cations that need to run continuously in severely resource-constrained environments. This thesis proposes an efficient structure that enables VAD and KWS to operate in a mobile environment with low memory usage and minimal energy consumption. The thesis addresses both the VAD and KWS domains separately, and the following sum- marizes the effectiveness of the proposed methods in each domain. For VAD domain, this thesis proposes TinyVAD, which is a new convolutional neural network (CNN) model that executes extremely efficiently with a small memory footprint. TinyVAD uses an input pixel matrix partitioning method, termed patchify, to downscale the resolution of the input spectrogram. The hidden layers use a sequence of special convolutional structures with bypass links, referred to as CSPTiny layers. The proposed model is evaluated and compared with previous VAD methods using a diverse set of noisy environmental datasets. TinyVAD executes 3.13 times faster, utilizes only 12.5% as Multiply-Accumulate operations (MACs), and requires only 13.0% as many parameters when compared to the previous state-of-the-art. For KWS domain, this thesis introduces a novel architecture that strategically employs VAD to classify input frames as speech or non-speech, subsequently trig- gering KWS only when speech is detected. By reusing the features already com- puted by VAD, the KWS model can be implemented with significantly reduced com- putational overhead. The proposed model is compared with Recurrent Neural Net- works (RNN) approaches using Long Short Term Memory (LSTM) and Gated Recur- rent Unit (GRU) models. Using both standard clean and noisy datasets, the compar- isons show a 51% to 58% reduction in the number of training parameters and a 46% reduction in the processing time. At the same time, there was a significant increase in the KWS accuracy achieved for both clean and noisy datasets with a range of heavy to light noise levels. The increase in model size and processing time due to the use of both VAD and KWS, as opposed to KWS only, were found to be 23% and 58%, re- spectively, with significant improvements in KWS accuracy for all noisy environments and a 0.94% reduction in KWS accuracy in a clean environment. In summary, the proposed methods in each domain have demonstrated higher accuracy, lower memory usage, and faster processing times compared to existing state- of-the-art models. The proposed methods are expected to facilitate the operation of complex speech applications such as Automatic Speech Recognition (ASR) in real mobile environments, imposing minimal hardware loads and conserving the energy of mobile devices. – III –
URI
http://postech.dcollection.net/common/orgView/200000732625
https://oasis.postech.ac.kr/handle/2014.oak/123438
Article Type
Thesis
Files in This Item:
There are no files associated with this item.

qr_code

  • mendeley

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Views & Downloads

Browse