608

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 8, NOVEMBER 2002

distortion, joint additive and convolutional distortion, and non- linear distortion (in time-domain) because the stereo data pro- vides accurate estimates of the bias or correction vectors without the need for an explicit noise model. One key requirement for the success of the basic version of SPLICE described here is that the distortion conditions under which the correction vectors are learned from the stereo data must be similar to those corrupting the test data. Enhanced versions of the algorithm described later in this section will relax this requirement by employing a noise estimation and normalization procedure.

We assume a general nonlinear distortion of a clean cepstral vector, , into a noisy one, . This distortion is approximated in SPLICE by a set of linear distortions. The probabilistic formu- lation of the basic version of SPLICE is provided below.

1) Basic Assumptions: The first assumption is that the noisy speech cepstral vector follows a mixture distribution of Gaus- sians

This training procedure requires a set of stereo (two channel) data. One channel contains the clean utterance, and the other channel contains the same utterance with distortion,. The two- channel data can be collected, for example, by simultaneously recording utterances with one close-talking and one far-field mi- crophone. Alternatively, it has been shown in our research [10] that a large amount of synthetic stereo data can be effectively produced to approximate the realistic stereo data with virtually no loss of speech recognition accuracy in MiPad tasks.

3) SPLICE for Cepstral Enhancement: One significant ad- vantage of the above two basic assumptions made in SPLICE is the inherent simplicity in deriving and implementing the rig- orous MMSE estimate of clean speech cepstral vectors from their distorted counterparts. Unlike the FCDCN algorithm [1], no approximations are made in deriving the optimal enhance- ment rule. The derivation is outlined below.

The MMSE is the following conditional expectation of clean speech vector given the observed noisy speech:

with

(1)

(5)

where 1, 2,

,

denotes the discrete random variable taking the values , one for each region over which the piecewise linear

Due to the second assumption of SPLICE, the above code-

approximation between the clean cepstral vector and distorted cepstral vector is made. This distribution, one for each separate distortion condition (not indexed for clarity), can be thought as

a “codebook” with a total of and their variances.

codewords (Gaussian means)

The second assumption made by SPLICE is that the condi- tional probability density function (PDF) for the clean vector

given the noisy speech vector, , and the region index, , is a Gaussian with the mean vector being a linear function of the noisy speech vector . In this paper, we take a simplified form of this (piecewise) function by making the rotation matrix to be identity one, leaving only the bias or correction vector. Thus, the conditional PDF has the form

word-dependent conditional expectation of simply the bias-added noisy speech vector

(given

and

) is

(6)

where bias

has been estimated from the stereo training data

according to (3). This gives the simple form of the MMSE esti- mate as the noisy speech vector corrected by a linear weighted

sum of all codeword-dependent bias vectors already trained

(7)

While this is already efficient to compute, more efficiency can be achieved by approximating the weights according to

(2)

where the correction vector is

and the covariance matrix of

(8) otherwise.

This approximation turns the MMSE estimate to the approxi- mate MAP estimate that consists of two sequential steps of oper- ation. First, finding optimal codewords using the VQ codebook

based on the parameters (

), and then adding the code-

word-dependent vector

to the noisy speech vector. We have

the conditional PDF is

.

2) SPLICE Training: Since the noisy speech PDF obeys a mixture-of-Gaussian distribution, the standard EM

algorithm is used to train

and

. Initial values of the

parameters can be determined by a VQ clustering algorithm.

# The parameters

and

of the conditional PDF

can be trained using the maximum likelihood criterion. Since the variance of the distribution is not used in cepstral enhancement,

we only give the ML estimate of the correction vector below:

found empirically in many of our initial experiments that the above VQ approximation does not appreciably affect recogni-

tion accuracy while resulting in computational efficiency.

# C. Enhancing SPLICE by Temporal Smoothing

where

and

denotes the time-frame index of the feature vector.

(3)

In this enhanced version of SPLICE, we not only minimize the static deviation from the clean to noisy cepstral vectors (as in the basic version of SPLICE), but also seek to minimize the dynamic deviation.

(4)

The basic SPLICE optimally processes each frame of noisy speech independently. An obvious extension is to jointly process a segment of frames. In this way, although the deviation from the clean to noisy speech cepstra for an individual frame could be undesirably greater than that achieved by the basic, static