| I developed a system of visualization and auralization of speech using an interactively variable vocal and nasal area function for research of a speech production mechanism. For visualization of speech, variable vocal and nasal area function, power spectrum, and synthesized speech waveform are graphically displayed in the multi-window system. For auralization of speech, the synthesized digital wave is easily converted to analog signal in the system. |
| In dialect speech such as "Zuzuben"in Japanese, it is newly found using the interactive visualization system that not only a constricted position in neutral tongue but also the degree of mouth opening is an important feature. In nasalized vowel speech, it is also newly found using the system that an anti-formant frequency in range from 1000 Hz to 2500 Hz depends on the degree of velum opening. |
| Visualization of a relation between the articulatory domain such as area function and the frequency domain such as spectrum is one of important objectives of speech production. So I developed a system of visualization and auralization of speech using an interactively variable vocal and nasal area function for research of a speech production mechanism. |
| For visualization of speech, variable vocal and nasal area function, power spectrum, and speech waveform are graphically displayed in the multi-window system. The spectrum is immediately calculated from the area function using Sondhi's method[1]; the speech wave is synthesized from the calculated spectrum using inverse FFT. For auralization of speech, the synthesized digital wave is easily converted to analog signal in the system. |
| Each section of the visible area function is interactively varied in graphical window such as a graphic equalizer in an audio control system. Recently a vocal and nasal tract shape is easily and correctly measured with MRI(Magnetic Resonance Images). The initial area function of vowel in the system is used from MRI data of isolated vowels uttered by a male speaker[2]. The initial area function of nasal is used from isolated nasals uttered by a male speaker[3]. |
| Fig. 1 shows an example of colored graphical multi-windows of the system of interactive visualization and auralization. The system is programmed with the C language mainly, with the Tcl/Tk system for graphical interface of the variable vocal and nasal area function, and with the Mesa system like OpenGL for display of a three-dimensional vocal tract shape under the free UNIX operating system such as Linux using a personal computer. |
| The top of the figure shows a variable vocal and nasal tract area function, in which the area value of each section is interactively varied with sliding bar such as a graphic equalizer in an audio control system. The left of the middle figure shows a power spectrum of synthesized transfer function; the right shows a synthesized speech waveform. Formant frequency and bandwidth are picked up from the power spectrum calculated using Equation (1). The left of the bottom figure shows an area function; the right shows a three-dimensional vocal tract shape, which is interactively rotated with mouse click for various view points. |

| Vocal tract shape is approximated with a model of cylinder or wire-frame as shown in Fig. 2. So in an articulatory speech synthesizer, transfer function is calculated by a product of 2 x 2 chain matrix composed of area function [2]. In the synthesizer, wave propagation in the tract is assumed to be planar and linear. |
| A general chain matrix of a portion of the tract relates the planar output pressure PL and volume velocity UL at lips or nostrils to the input pressure PG and volume velocity UG at glottal side as follows. All symbols in the next equations are capitalized in order to denote variables in the frequency domain. |

| where the matrix elements Ai, Bi, Ci and Di (i=1,...,n) are defined as follows from area function ai at i-th section in the tract, wave velocity c and length Deltal of a homogeneous cylindrical tube, and the element ZVN is the input impedance of the nasal branch at the velum. |


where the complex variables gamma and sigma are defined as
a function of angular frequency omega.
The transfer function H(omega) is calculated as follows.
|
|
Table 1 shows a relation between analyzed (natural) and synthesized (calculated) formant frequencies. Most of errors for F1 and F2 are within 5 % with the exception of the case of /i/ where error is about 13 %. For /i/, assumption of planar and linear in wave propagation is difficult and FEM model is necessary. Left of Fig. 3 shows a periodical relation between a constricted position and a formant frequency. And right of Fig. 3 shows an example of two different vocal tract shapes [2] which have same synthesized formant frequencies such as F1 = 700 Hz, F2 = 1100 Hz, F3 = 2250 Hz. |
| F1 | F2 | F3 | F4 | ||
| Analysis | 703 | 1105 | 2420 | 3475 | |
| /a/ | Synthesis | 694 | 1129 | 2654 | 3624 |
| Error | 1.3% | 2.2% | 9.7% | 14.9% | |
| Analysis | 306 | 2007 | 2482 | 3257 | |
| /i/ | Synthesis | 345 | 2174 | 2579 | 3392 |
| Error | 12.7% | 8.3% | 3.9% | 4.1% | |
| Analysis | 449 | 1268 | 2257 | 3337 | |
| /u/ | Synthesis | 434 | 1227 | 2584 | 3519 |
| Error | -3.3% | -3.2% | 14.5% | 5.5% | |
| Analysis | 557 | 1702 | 2286 | 3574 | |
| /e/ | Synthesis | 559 | 1636 | 2389 | 3617 |
| Error | 0.4% | 3.9% | 4.5% | 1.2% | |
| Analysis | 534 | 839 | 2470 | 3354 | |
| /o/ | Synthesis | 514 | 884 | 2747 | 3947 |
| Error | -3.7% | 5.4% | 11.2% | 17.7% | |

|
With the interactive visualization and auralization system, we can not only easily view various vocal tract shapes and corresponded power spectra but also immediately hear the corresponded synthesized wave. In dialect speech such as "Zuzuben" in Japanese, /i/ and /u/ are neutralized vowels so that the second frequency F2 of /i/ decreases toward 1500 Hz and F2 of /u/ increases toward 1500 Hz. Fig. 4 shows several different vocal tracts which are neutralized constricted points of tongue for /i/ and corresponded power spectra. In the left figure, left side is glottis and right side is lips. The right figure shows that as neutralized, F2 decreases toward 1500Hz. I can also hear the synthesized sound as neutralized vowel. So in dialect speech, the constricted point of tongue is an important feature. |

|
Fig. 6 shows an example of graphical window of the variable nasal tract area function in the interactive system. In the system, the initial value of area function is loaded from the text file, and a sampling frequency and a displayed format such as left or right lips are easily specified from the file. And in the system, the effects of loss and sinus can be calculated with check buttons of loss and sinus as shown in Fig. 6. In the example, the total section number n in Equation (6) of vocal and nasal tract is 21 and 15 respectively, the length Delta l of each section in Equation (3) is 0.8 cm, and nasal tract is branched at the 11th section of vocal tract which is velum position. |

|
Fig. 7 shows the synthesized power spectra of nasal /m/ and /n/. In the synthesized spectra, the effects of loss and sinus are not considered for calculation of transfer function from the chain matrix. From the anti-formant of the power spectra, it is newly found that the 3rd anti-formant frequency is distinctive, the 1st and 2nd anti-formants are not distinctive. From the formant of the power spectra, it is newly found that first three formants are split into two peaks. |

| Fig. 8 is an example of cephalogram and spectrum of /i/ uttered by an operated patient of cleft palates. The smoothed line of the spectrum shows an analized result with the technique of Analysis-by-Synthesis for pole-zero speech \cite{abs}. In the left figure, area value at the velum is clearly not large but is not zero. Speech of the patient is usually nasalized because of easy-to-open velum. In the right figure, therefore the spectrum of the patient has several anti-formants. Fig. 9 shows a relation between the area value of velum opening and the anti-formant frequency in synthesized power spectrum. From the figure, it is also newly found that anti-formant frequency in 1000 Hz to 2500 Hz depends on the degree of velum opening and that anti-formant frequency is positively correlated. The degree of velum opening is estimated with a measurement value of anti-formant frequency only from speech. Therefore, this safety non-X-ray method for the measurement of the degree may be important. |


| I developed a system of visualization and auralization of speech using an interactively variable vocal and nasal area function for research of a speech production mechanism. For visualization of speech, variable vocal and nasal area function, power spectrum and synthesized speech waveform are graphically displayed in the multi-window system. For auralization of speech, the synthesized wave is easily converted to analog signal in the system. In dialect speech such as "Zuzuben" in Japanese, it is newly found using the interactive visualization system that not only a constricted position in neutral tongue but also the degree of mouth opening is an important feature. In nasalized vowel speech, it is also newly found using the system that an anti-formant frequency in range from 1000 Hz to 2500 Hz depends on the degree of velum opening. It is important for a new non-X-ray measurement method. More details of the system can be accessed from my WWW server. The uniform resource location URL of the server is "http://sp.cis.iwate-u.ac.jp/asva97/". |
| The author greatly thanks Miss Atsuko Fujimura and Mr. Masaru Sasaki for graphical programming of the system, Dr. Chang-Sheng Yang and Dr. Hideki Kasuya for useful discussion, and Dr. Jianwu Dang for application of the area function of nasal tract. |