THE FIVE SENSES, AND BEYOND
puter. The bigger the corpus, the better the system can recognize a range of utterances. Each speech sound in the corpus is broken down into a soundprint or acoustic spectrum—a list of the frequencies that make up the sound and their strengths. When the system hears a voice, that, too, is analyzed, in real time. By comparing the incoming soundprints with the stored ones, the computer assigns a probability that each sound has been correctly recognized. Further information comes from knowing the probabilities of the myriad other sounds that might follow the recognized one.The system also uses a “dictio- nary,” a set of sound prints for words in the language, and a “gram- mar,” which tells it the probability of finding a particular word once the preceding word is known.Then all these factors are manipulated by extremely sophisticated statistics, resulting in highly accurate word recognition. Compared to this complex process, speech synthesis is relatively simple.
But merely recognizing and saying words is not enough. As re- searcher Sylvie Mozziconacci of Leiden University writes,
Communication is not merely an exchange of words . . . variations in pitch, intensity, speech rate, rhythm and voice quality are available to speaker and listener in order to encode and decode the full spoken message.
Recognizing words is one thing. Interpreting them, or speaking them with natural meaning and delivery, is something else.
To make a synthetic voice sound better than the mechanical monotone of a movie robot requires prosody.To poets, prosody means the study of meter, alliteration, and rhyme scheme that contribute to the flow and impact of a poem. For those who design machines that speak and listen, prosody means the differences in intonation that people use in speech, adding meaning or emotion to the literal sig- nificance of the words, or, as Elizabeth Shriberg and Andreas Stolcke of SRI International write, it is “the rhythm and melody of speech.” These intonational variations are put into synthesized voices by care- ful adjustment of pitch, pacing, and so on to copy the natural sound of people talking.
The other side of the prosody coin is the problem of ensuring that an artificial being can fully interpret what humans say.That helps