DENG et al.: DISTRIBUTED SPEECH PROCESSING IN MiPad’S MULTIMODAL USER INTERFACE
SPLICE, the overall deviation that takes into account the whole sequence of frames and the mismatch of slopes will be reduced compared with the basic SPLICE.
We have implemented the above idea of “dynamic SPLICE” through temporally smoothing the bias vectors obtained from the basic, static SPLICE.
We have achieved significant performance gains using an effi- cient heuristic implementation. In our specific implementation, the filter has a low-pass characteristic, with a system transfer function of
is undone by adding the estimated noise estimate.
back to the MMSE
Our research showed that the effectiveness of the above noise- normalized SPLICE is highly dependent on the accuracy of the noise estimate . We have investigated several ways of automat- ically estimating the nonstationary noise in the Aurora2 data- base. We describe below one algorithm that has given by far the highest accuracy in noise estimation and at the same time by far the best noise-robust speech recognition results.
E. Nonstationary Noise Estimation by Iterative Stochastic Approximation
This transfer function is the result of defining an objective function as the posterior probability of the entire sequence of the (hidden) true correction vectors given the entire sequence of the observed speech vectors. The posterior probability is
After we apply a first-order Markov assumption to the conditional distribution becomes
Each term in the product is given by
is the covariance matrix for the time differential of
the correction vector,
is the unsmoothed SPLICE output at
frame , given by (7), and
is the covariance matrix of the
SPLICE output at frame
In , a novel algorithm is proposed, implemented, and evaluated for recursive estimation of parameters in a nonlinear model involving incomplete data. The algorithm is applied specifically to time-varying deterministic parameters of addi- tive noise in a mildly nonlinear model that accounts for the generation of the cepstral data of noisy speech from the cepstral data of the noise and clean speech. For computer recognition of the speech that is corrupted by highly nonstationary noise, different observation data segments correspond to very different noise parameter values. It is thus strongly desirable to develop recursive estimation algorithms, since they can be designed to adaptively track the changing noise parameters. One such design based on the novel technique of iterative stochastic ap- proximation within the recursive-EM framework is developed and evaluated. It jointly adapts time-varying noise parameters and the auxiliary parameters introduced to piecewise linearly approximate the nonlinear model of the acoustic environment. The accuracy of the approximation is shown to have improved progressively with more iteration.
Optimization of the objective function in (10) gives the MAP estimate for the smoothed correction vector sequence, which is in the form of a second-order difference equation with the input of unsmoothed correction vectors computed from the SPLICE algorithm described above. This second-order difference equa- tion can be equivalently put in the form of (9) in the -domain, where the constants are functions of the variances related to both the static and dynamic quantities of the correction vectors. These variances were assumed to be time invariant, leading to the two constant parameters in (9). These two parameters have been empirically adjusted.
D. Enhancing SPLICE by Noise Estimation and Noise Normalization
In this enhancement of SPLICE, different noise conditions between the SPLICE training set and test set are normalized. The procedure for noise normalization and for denoising is as
follows. Instead of building codebooks for noisy speech
the training set, they are built from
is an estimated
noise from . Then the correction vectors are estimated from the
The essence of the algorithm is the use of iterations to achieve close approximations to a nonlinear model of the acoustic en- vironment while at the same time employing the “forgetting” mechanism to effectively track nonstationary noise. There is no latency required for the execution of the algorithm since only the present and the past noisy speech observations are needed to compute the current frame’s noise estimate. Using a number of empirically verified assumptions associated with the implemen- tation simplification, the efficiency of this algorithm has been improved close to real time for noise tracking. The mathemat- ical theory, algorithm, and implementation detail of this iterative stochastic approximation technique can be found in , .
Figs. 3–5 show the results of noise-normalized SPLICE de-
noising using the iterative stochastic algorithm for tracking non- stationary noise in an utterance of the Aurora2 data, where the SNR is 10 dB, 5 dB, and 0 dB, respectively. From top to bottom we can see noisy speech, clean speech, and denoised speech, all in the same spectrogram format. Most of the noise has been ef- fectively removed, except for some strong noise burst located around frames 150–158 in Fig. 5 where the instantaneous SNR
training set using the noise-normalized stereo data
is significantly lower than zero.
. The correction vectors trained in this new SPLICE will be different from those in the basic version of SPLICE. This is
because the codebook selection will be different since
. For denoising in the test data, the noise-
normalized noisy cepstra
are used to obtain the noise-
normalized MMSE estimate, and then the noise normalization
The nonstationary noisy estimation algorithm discussed here and its use in the noise-normalized SPLICE are critical factors for the noise-robust speech recognition results presented in the next section. We have recently extended the algorithm to repre- sent the noise as time-varying random vectors in order to ex- ploit the variance parameter and new prior information. The