Hybrid Concatenated-Formant Expressive Speech Synthesizer For Kinesensic Voices
AdvisorYounan, Nicholas H.
CommitteeFowler, James E.
Du, Jenny Q.
Marple, Gary A.
Traditional and commercial speech synthesizers are incapable of synthesizing speech with proper emotion or prosody. Conveying prosody in artificially synthesized speech is difficult because of extreme variability in human speech. An arbitrary natural language sentence can have different meanings, depending upon the speaker, speaking style, context, and many other factors. Most concatenated speech synthesizers use phonemes, which are phonetic units defined by the International Phonetic Alphabet (IPA). The 50 phonemes in English are standardized and unique units of sound, but not expression. An earlier work proposed the analogy between speech and music ? ?speech is music, music is speech.? The speech data obtained from the master practitioners, who are trained in kinesensic voice, is marked on a five level intonation scale, which is similar to the music scale. From this speech data, 1324 unique expressive units, called expressemes®, are identified. The expressemes consist of melody and rhythm, which, in digital signal processing, is analogous to pitch, duration and energy of the signal. The expressemes have less acoustic and phonetic variability than phonemes, so they better convey the prosody. The goal is to develop a speech synthesizer which exploits the prosodic content of expressemes in order to synthesize expressive speech, with a small speech database. To create a reasonably small database that captures multiple expressions is a challenge because there may not be a complete set of speech segments available to create an emotion. Methods are suggested whereby acoustic mathematical modeling is used to create missing prosodic speech segments from the base prosody unit. New concatenated-formant hybrid speech synthesizer architecture is developed for this purpose. A pitch-synchronous time-varying frequency-warped wavelet transform based prosody manipulation algorithm is developed for transformation between prosodies. A time-varying frequency-warping transform is developed to smoothly concatenate the temporal and spectral parameters of adjacent expressemes to create intelligible speech. Additionally, issues specific to expressive speech synthesis using expressemes are resolved for example, Ergodic Hidden Markov Model based expresseme segmentation, model creation for F0 and segment duration, and target and join cost calculation. The performance of the hybrid synthesizer is measured against a commercially available synthesizer using objective and perceptual evaluations. Subjects consistently rated the hybrid synthesizer better in five different perceptual tests. 70% of speakers rated the hybrid synthesis as more expressive, and 72% preferred it over the commercial synthesizer. The hybrid synthesizer also got a comparable mean opinion score.