DENG et al.: DISTRIBUTED SPEECH PROCESSING IN MiPad’S MULTIMODAL USER INTERFACE
under the constraint of bits assigned to the th subvector,
is the number of
is the total number of bits
to be assigned, and
is the WER obtained using
the assignment of the bits to subvectors. Full search requires us to run a separate WER experiment for each one of the possible combinations, computationally prohibitive, so we use a greedy bit allocation technique. At each stage we add a bit at each one of the subvectors and we keep the combination with the minimal WER. We repeat the procedure for the next stage by starting at
the best combination of the previous step. By having
total bits to assign the total number of combinations
is reduced from
The experiments carried out to evaluate the above phone-de- pendent coder use the baseline system with a version of Microsoft’s continuous-density HMM Whisper system. The system uses 6000 tied HMM states (senones), 20 Gaussians per state, Mel-cepstrum, delta cepstrum, and delta–delta cepstrum. The recognition task is 5000-word vocabulary, continuous speech recognition from Wall Street Journal data sources. A fixed, bigram language model is used in all the experiments. The training set consists of a total of 16 000 female sentences, and the test set of 167 female sentences (2708 words). The word accuracy with no coding for this test set was 95.7%. With use of a perfect phone classifier, the coding using the bit allocation of (4, 4, 4) for the three subvectors gives word accuracy of 95.6%. Using a very simple phone classifier with a Mahalanobis distance measure and with the same bit allocation of (4, 4, 4), the recognition accuracy drops only to 95.0%. For this high-performance coder, the bandwidth has been reduced to 1.6 Kbps with the required memory being under 64 Kbytes.
B. Error Protection
A novel channel coder has also been developed to protect MiPad’s Mel-cepstral features based on the client–server archi- tecture. The channel coder assigns unequal amounts of redun- dancy among the different source bits, giving a greater amount of protection to the most important bits where the importance is measured by the contributions of these bits to the word error rate in speech recognition. A quantifiable procedure to assess the importance of each bit is developed, and the channel coder exploits this utility function for the optimal forward error correc- tion (FEC) assignment. The FEC assignment algorithm assumes that packets are lost according to a Poisson process. Simula- tion experiments are performed where the bursty nature of loss patterns are taken into account. When combined with the new source coder, the new channel coder is shown to provide consid- erable robustness to packet losses even under extremely adverse conditions.
Some alternatives to FEC coding are also explored, including the use of multiple transmissions, interleaving, and interpola- tion. We conclude from this preliminary work that the final choice of channel coder should depend on the relative impor- tance among delay, bandwidth, and burstiness of noise.
Our preliminary work on the compression and error pro- tection aspects of distributed speech recognition has provided clear insight into the tradeoffs we need to make between source coding, delay, computational complexity and resilience to packet losses. Most significantly, the new algorithms developed
have brought down the Mel-cepstra compression rate to as low as 1.6 Kbps with virtually no degradation in word error rate compared with no compression. These results are currently being incorporated into the next version of MiPad.
IV. CONTINUOUS SPEECH RECOGNITION AND UNDERSTANDING
While the compressed and error-protected Mel-cepstral features are computed in the MiPad client, major computation for continuous speech recognition (decoding) resides in the server. The entire set of the language model, hidden Markov models (HMMs), and lexicon that are used for speech decoding all reside in the server, which processes the Mel-cepstral features transmitted from the client. Denoising operations such as SPLICE that extract noise-robust Mel-cepstra can reside either on the server or the client, though we implemented it on the server for convenience.
MiPad is designed to be a personal device. As a result, speech recognition uses speaker-adaptive acoustic models (HMMs) and a user-adapted lexicon to improve recognition accuracy. The continuous speech recognition engine and its HMMs are a hybrid that combines the best features of Microsoft’s Whisper and HTK. Both MLLR and MAP adaptation are used to adapt the speaker-independent acoustic model for each individual speaker. We used 6000 senones, each with 20-component mixture Gaussian densities. The context-sensitive language model is used for relevant semantic objects driven by the user’s pen tapping action, as described in Section IV. As speech recognition accuracy remains as a major challenge for MiPad usability, most of our recent work on MiPad’s acoustic modeling has focused on noise robustness as described in Section II. The work on language modeling for improving speech recognition accuracy has focused on language model portability, which is described in this section.
The speech recognition engine in MiPad uses the unified lan- guage model  that takes advantage of both rule-based and data-driven approaches. Consider two training sentences:
“Meeting at three with John Smith.” versus “Meeting at four PM with Derek.” Within a pure -gram framework, we need to estimate
individually. This makes it very difficult to capture the obviously needed long-span semantic information in the training data. To overcome this difficulty, the unified model uses a set of Context Free Grammars (CFGs) that captures the semantic structure of the domain. For the example listed here, we may have CFGs for NAME and TIME respectively, which can be derived from the factoid grammars of smaller sizes. The training sentences now look like:
“Meeting at three:TIME with John Smith:NAME ,” and “Meeting at four PM:TIME with Derek: NAME .” With parsed training data, we can now estimate the -gram probabilities as usual. For example, the replacement of
makes such “ more accurate.
gram” representation more meaningful and