ASVA97 Private Page

Copyright (c) 1997 Jouji Miwa. All rights reserved.

International Symposium on Simulation, Visualization and Auralization
for Acoustic Research and Education
pp.271-278
2-4 April, 1997
International Conference Center, Waseda University
Tokyo, JAPAN

INTERACTIVE VISUALIZATION AND AURALIZATION OF SPEECH PRODUCTION
USING VARIABLE VOCAL AND NASAL AREA FUNCTION

Jouji Miwa

Department of Computer and Information Science, Iwate University
4-3-5 Ueda, Morioka, 020 Japan
miwa@cis.iwate-u.ac.jp

ABSTRACT
I developed a system of visualization and auralization of speech using an interactively variable vocal and nasal area function for research of a speech production mechanism. For visualization of speech, variable vocal and nasal area function, power spectrum, and synthesized speech waveform are graphically displayed in the multi-window system. For auralization of speech, the synthesized digital wave is easily converted to analog signal in the system.
In dialect speech such as "Zuzuben"in Japanese, it is newly found using the interactive visualization system that not only a constricted position in neutral tongue but also the degree of mouth opening is an important feature. In nasalized vowel speech, it is also newly found using the system that an anti-formant frequency in range from 1000 Hz to 2500 Hz depends on the degree of velum opening.

INTRODUCTION
Visualization of a relation between the articulatory domain such as area function and the frequency domain such as spectrum is one of important objectives of speech production. So I developed a system of visualization and auralization of speech using an interactively variable vocal and nasal area function for research of a speech production mechanism.
For visualization of speech, variable vocal and nasal area function, power spectrum, and speech waveform are graphically displayed in the multi-window system. The spectrum is immediately calculated from the area function using Sondhi's method[1]; the speech wave is synthesized from the calculated spectrum using inverse FFT. For auralization of speech, the synthesized digital wave is easily converted to analog signal in the system.
Each section of the visible area function is interactively varied in graphical window such as a graphic equalizer in an audio control system. Recently a vocal and nasal tract shape is easily and correctly measured with MRI(Magnetic Resonance Images). The initial area function of vowel in the system is used from MRI data of isolated vowels uttered by a male speaker[2]. The initial area function of nasal is used from isolated nasals uttered by a male speaker[3].

INTERACTIVE SYSTEM OF VISUALIZATION AND AURALIZATION
Fig. 1 shows an example of colored graphical multi-windows of the system of interactive visualization and auralization. The system is programmed with the C language mainly, with the Tcl/Tk system for graphical interface of the variable vocal and nasal area function, and with the Mesa system like OpenGL for display of a three-dimensional vocal tract shape under the free UNIX operating system such as Linux using a personal computer.
The top of the figure shows a variable vocal and nasal tract area function, in which the area value of each section is interactively varied with sliding bar such as a graphic equalizer in an audio control system. The left of the middle figure shows a power spectrum of synthesized transfer function; the right shows a synthesized speech waveform. Formant frequency and bandwidth are picked up from the power spectrum calculated using Equation (1). The left of the bottom figure shows an area function; the right shows a three-dimensional vocal tract shape, which is interactively rotated with mouse click for various view points.


Interactive Visualization System(15kB)

Fig. 1 An example of graphical multi-windows of the system.



TRANSFER FUNCTION FROM VARIABLE AREA FUNCTION
Vocal tract shape is approximated with a model of cylinder or wire-frame as shown in Fig. 2. So in an articulatory speech synthesizer, transfer function is calculated by a product of 2 x 2 chain matrix composed of area function [2]. In the synthesizer, wave propagation in the tract is assumed to be planar and linear.
Cylinder Wire-frame
Fig. 2 A three-dimensional model of cylinder or wire-frame for vowel /a/.

A general chain matrix of a portion of the tract relates the planar output pressure PL and volume velocity UL at lips or nostrils to the input pressure PG and volume velocity UG at glottal side as follows. All symbols in the next equations are capitalized in order to denote variables in the frequency domain.
Eq.1

where the matrix elements Ai, Bi, Ci and Di (i=1,...,n) are defined as follows from area function ai at i-th section in the tract, wave velocity c and length Deltal of a homogeneous cylindrical tube, and the element ZVN is the input impedance of the nasal branch at the velum.
Eq.2 Eq.3

Eq.4 Eq.5

where the complex variables gamma and sigma are defined as a function of angular frequency omega. The transfer function H(omega) is calculated as follows.
Eq.6
where Atract and Ctract are the elements of the total chain matrix from the glottis to the lips in the Equation 3 and ZL is the radiation impedance at the lips.


CACULATED TRANSFER FUNCTION FOR VOWEL

Table 1 shows a relation between analyzed (natural) and synthesized (calculated) formant frequencies. Most of errors for F1 and F2 are within 5 % with the exception of the case of /i/ where error is about 13 %. For /i/, assumption of planar and linear in wave propagation is difficult and FEM model is necessary. Left of Fig. 3 shows a periodical relation between a constricted position and a formant frequency. And right of Fig. 3 shows an example of two different vocal tract shapes [2] which have same synthesized formant frequencies such as F1 = 700 Hz, F2 = 1100 Hz, F3 = 2250 Hz.
Table 1 Error between analyzed and synthesized formant frequencies.

F1 F2 F3 F4
Analysis 703 1105 2420 3475
/a/ Synthesis 694 1129 2654 3624
Error 1.3% 2.2% 9.7% 14.9%
Analysis 306 2007 2482 3257
/i/ Synthesis 345 2174 2579 3392
Error 12.7% 8.3% 3.9% 4.1%
Analysis 449 1268 2257 3337
/u/ Synthesis 434 1227 2584 3519
Error -3.3% -3.2% 14.5% 5.5%
Analysis 557 1702 2286 3574
/e/ Synthesis 559 1636 2389 3617
Error 0.4% 3.9% 4.5% 1.2%
Analysis 534 839 2470 3354
/o/ Synthesis 514 884 2747 3947
Error -3.7% 5.4% 11.2% 17.7%

Error = (Synthesis - Analysis)/Analysis xx100 (%)
Tube Area
Fig. 3 Periodical relation between constricted point and formant frequency and an example of different shapes with same formant frequencies.


ARTICULATORY SHAPE IN DIALECT SPEECH

With the interactive visualization and auralization system, we can not only easily view various vocal tract shapes and corresponded power spectra but also immediately hear the corresponded synthesized wave. In dialect speech such as "Zuzuben" in Japanese, /i/ and /u/ are neutralized vowels so that the second frequency F2 of /i/ decreases toward 1500 Hz and F2 of /u/ increases toward 1500 Hz. Fig. 4 shows several different vocal tracts which are neutralized constricted points of tongue for /i/ and corresponded power spectra. In the left figure, left side is glottis and right side is lips. The right figure shows that as neutralized, F2 decreases toward 1500Hz. I can also hear the synthesized sound as neutralized vowel. So in dialect speech, the constricted point of tongue is an important feature.
Area Area
Fig. 4 Different neutralized vocal tracts and corresponded power spectra.


Fig. 5 shows several different lips-open vocal tracts and corresponded power spectra. The right figure shows that as closed, F2 decreases toward 1500Hz. In dialect speech, it may be that the degree of lips opening is also an important feature.
Area Area
Fig. 5 Different lips-open vocal tracts and corresponded power spectra.


NASAL AND NASALIZED VOWEL

Fig. 6 shows an example of graphical window of the variable nasal tract area function in the interactive system. In the system, the initial value of area function is loaded from the text file, and a sampling frequency and a displayed format such as left or right lips are easily specified from the file. And in the system, the effects of loss and sinus can be calculated with check buttons of loss and sinus as shown in Fig. 6. In the example, the total section number n in Equation (6) of vocal and nasal tract is 21 and 15 respectively, the length Delta l of each section in Equation (3) is 0.8 cm, and nasal tract is branched at the 11th section of vocal tract which is velum position.
Nasal
Fig.6 An example of graphical window of a variable nasal tract area function.


Fig. 7 shows the synthesized power spectra of nasal /m/ and /n/. In the synthesized spectra, the effects of loss and sinus are not considered for calculation of transfer function from the chain matrix. From the anti-formant of the power spectra, it is newly found that the 3rd anti-formant frequency is distinctive, the 1st and 2nd anti-formants are not distinctive. From the formant of the power spectra, it is newly found that first three formants are split into two peaks.
Simulation
Fig.7 Synthesized power spectra of nasal /m/ and /n/.
Fig. 8 is an example of cephalogram and spectrum of /i/ uttered by an operated patient of cleft palates. The smoothed line of the spectrum shows an analized result with the technique of Analysis-by-Synthesis for pole-zero speech \cite{abs}. In the left figure, area value at the velum is clearly not large but is not zero. Speech of the patient is usually nasalized because of easy-to-open velum. In the right figure, therefore the spectrum of the patient has several anti-formants. Fig. 9 shows a relation between the area value of velum opening and the anti-formant frequency in synthesized power spectrum. From the figure, it is also newly found that anti-formant frequency in 1000 Hz to 2500 Hz depends on the degree of velum opening and that anti-formant frequency is positively correlated. The degree of velum opening is estimated with a measurement value of anti-formant frequency only from speech. Therefore, this safety non-X-ray method for the measurement of the degree may be important.
X-ray Pat
Fig.8 An example of cephalogram and A-b-S spectrum of /i/ uttered by an operated patient of cleft palates.
In Zero
Fig. 9 Relation between the area value of velum opening and anti-formant frequency in synthesized power spectrum.


CONCLUSION
I developed a system of visualization and auralization of speech using an interactively variable vocal and nasal area function for research of a speech production mechanism. For visualization of speech, variable vocal and nasal area function, power spectrum and synthesized speech waveform are graphically displayed in the multi-window system. For auralization of speech, the synthesized wave is easily converted to analog signal in the system. In dialect speech such as "Zuzuben" in Japanese, it is newly found using the interactive visualization system that not only a constricted position in neutral tongue but also the degree of mouth opening is an important feature. In nasalized vowel speech, it is also newly found using the system that an anti-formant frequency in range from 1000 Hz to 2500 Hz depends on the degree of velum opening. It is important for a new non-X-ray measurement method. More details of the system can be accessed from my WWW server. The uniform resource location URL of the server is "http://sp.cis.iwate-u.ac.jp/asva97/".


ACKNOWLEDGMENS
The author greatly thanks Miss Atsuko Fujimura and Mr. Masaru Sasaki for graphical programming of the system, Dr. Chang-Sheng Yang and Dr. Hideki Kasuya for useful discussion, and Dr. Jianwu Dang for application of the area function of nasal tract.


References
  1. Man Mohan Sondhi and Juergen Schroeter: "A Hybrid Time-Frequency Domain Articulatory Speech Synthesizer", IEEE Trans. Acoust., Speech Signal Process., ASSP-35(7) (July. 1987).
  2. Chang-Sheng Yang and Hideki Kasuya: "Dimensional differences in the vocal tract shapes measured from MR images across boy, female and male subjects", J. Acoust. Soc. Jpn, 16(1), 41-44 (1995).
  3. Jianwu Dang and Kiyoshi Honda: "Morphological and acoustical analysis of the nasal and the paranasal cavities", J. Acoust. Soc. Am., 96(4) (Oct. 1994).
  4. Jhon K. Ousterhout: "Tcl and the Tk Toolkit", Addison-Wesley (1994).
  5. Jackie Neider, Tom Davis and Mason Woo: "OpenGL Programming Guide", Addison-Wesley (1995).
  6. Man Mohan Sondhi: "Estimation of vocal-tract areas: The need for acoustical measurements", IEEE Trans. Acoust., Speech Signal Process., ASSP-27(3), 268-273 (Mar. 1979).
  7. Jouji Miwa and Driszal Fryantoni: "Estimation of Formant and Anti-Formant of Pole-Zero Speech Using Analysis-by-Synthesis Techniques", Speech technical report of IEICE Japan, SP95-129 (Feb. 1996). (in Japanese)

Access:

Speech Analysis/ Speech Processing/ Laboratory/ SP Home Page/