xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam Ka
controlfield tag 001 001447445
007 cr mnu|||uuuuu
008 040114s2003 flua sbm s000|0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0000187
Bamini, Praveen Kumar.
FPGA-based implementation of concatenative speech synthesis algorithm
h [electronic resource] /
by Praveen Kumar Bamini.
[Tampa, Fla.] :
University of South Florida,
Thesis (M.S.Cp.E.)--University of South Florida, 2003.
Includes bibliographical references.
Text (Electronic thesis) in PDF format.
System requirements: World Wide Web browser and PDF reader.
Mode of access: World Wide Web.
Title from PDF of title page.
Document formatted into pages; contains 68 pages.
ABSTRACT: The main aim of a text-to-speech synthesis system is to convert ordinary text into an acoustic signal that is indistinguishable from human speech. This thesis presents an architecture to implement a concatenative speech synthesis algorithm targeted to FPGAs. Many current text-to-speech systems are based on the concatenation of acoustic units of recorded speech. Current concatenative speech synthesizers are capable of producing highly intelligible speech. However, the quality of speech often suffers from discontinuities between the acoustic units, due to contextual differences. This is the easiest method to produce synthetic speech. It concatenates prerecorded acoustic elements and forms a continuous speech element. The software implementation of the algorithm is performed in C whereas the hardware implementation is done in structural VHDL. A database of acoustic elements is formed first with recording sounds for different phones. The architecture is designed to concatenate acoustic elements corresponding to the phones that form the target word. Target word corresponds to the word that has to be synthesized. This architecture doesn't address the form discontinuities between the acoustic elements as its ultimate goal is the synthesis of speech. The Hardware implementation is verified on a Virtex (v800hq240-4) FPGA device.
Adviser: Katkoori, Srinivas
x Computer Engineering
t USF Electronic Theses and Dissertations.
FPGA-based Implementation of Concatenati v e Speech Synthesis Algorithm by Pra v een K umar Bamini A thesis submitted in partial fulllment of the requirements for the de gree of Master of Science in Computer Engineering Department of Computer Science and Engineering Colle ge of Engineering Uni v ersity of South Florida Major Professor: Srini v as Katk oori, Ph.D. Murali V aranasi, Ph.D. Sanjukta Bhanja, Ph.D. Date of Appro v al: October 29, 2003 K e yw ords: Processing, VHDL, V erication, Sequential Circuits, W a v eforms cCop yright 2003, Pra v een K umar Bamini
DEDICA TION T o all my f amily members and friends who al w ays belie v ed me.
A CKNO WLEDGEMENTS My sincere thanks to Dr Srini v as Katk oori, who has pro vided me an opportunity to w ork with him as well as for the encouragement and necessary support he has rendered thorughout my masters program. I w ould also lik e to thank Dr V aranasi and Dr Sanjuktha Bhanja for being on my committee. I also ackno wledge my friends Ramnag, Sarath, Sudheer Lolla, Sunil, V enu, Jyothi, Bab u, Hariharan for their support. I really appreciate the support of my friends in the VCAPP group especially Saraju Prasad Mohanty who constantly helped me during the course of this research w ork. I also thank Daniel Prieto, for the technical help he pro vided during my research.
T ABLE OF CONTENTS LIST OF T ABLES iii LIST OF FIGURES i v ABSTRA CT vi CHAPTER 1 INTR ODUCTION 1 1.1 Introduction to speech synthesis 1 1.2 Applications of te xt-to-speech synthesis 3 1.3 Attrib utes of te xt-to-speech synthesis 3 1.4 Concatenati v e speech synthesis 6 1.5 Summary 9 CHAPTER 2 RELA TED LITERA TURE 10 2.1 Re vie w of prior w ork 10 2.2 Re vie w of proposed approaches to impro v e concatenation 12 2.2.1 Spectral smoothing 12 2.2.2 Optimal coupling 13 2.2.3 W a v eform interpolation 13 2.2.4 Remo ving linear phase mismatches 13 2.3 Summary 14 CHAPTER 3 PR OPOSED SYNTHESIS ALGORITHM AND ARCHITECTURE 15 3.1 Acoustic library 16 3.2 P arsing unit 16 3.3 Concatenating unit 17 3.4 Speech synthesis algorithm 17 3.5 Architectural design of the synthesis system 18 3.6 Pseudo-code of the algorithm 19 3.6.1 Description of the pseudo-code 21 3.7 Design of the controller 21 3.7.1 Procedure for the controller design 22 3.7.2 Design of the state diagram and state table 24 3.7.3 Flip-op input and output equations 28 3.8 Datapath design for the system 30 3.8.1 Components of the datapath 30 18.104.22.168 Input module 31 22.214.171.124 RAM 31 126.96.36.199 Comparator 33 188.8.131.52 Re gister counter K 33 i
184.108.40.206 Output module 33 3.8.2 Functional description of the datapath 34 3.9 Schematic diagram of the synthesis system 37 3.10 Summary 37 CHAPTER 4 IMPLEMENT A TION OF THE ARCHITECTURE AND EXPERIMENT AL RESUL TS 38 4.1 Structural VHDL design for the architecture 39 4.2 Functional description of the synthesis system 40 4.3 Synthesis and simulation results 42 4.3.1 De vice utilization data summary 43 CHAPTER 5 CONCLUSION 50 REFERENCES 51 APPENDICES 53 Appendix A 54 ii
LIST OF T ABLES T able 3.1 State table for the controller 29 T able 4.1 The utilization data summary of the system 43 T able 4.2 The utilization data summary of the system with audioproject 43 iii
LIST OF FIGURES Figure 1.1 Representation of a sinusoidal periodic sound w a v e 2 Figure 1.2 Quality and task independence in speech synthesis approaches 4 Figure 1.3 Spectrogram of a w ord `when' 7 Figure 1.4 Spectrogram of a concatenated w ord `when' formed by concatenating se gments `whe' and `en' 8 Figure 2.1 Block diagram of the te xt-to-speech synthesis process 10 Figure 3.1 Block diagram of the proposed speech synthesis algorithm 15 Figure 3.2 The top-le v el block diagram of general digital system consisting controller and datapath 19 Figure 3.3 Design procedure for the nite state machine 23 Figure 3.4 State diagram for the controller 26 Figure 3.5 Pin diagram of the controller 30 Figure 3.6 R TL schematic of the datapath 32 Figure 3.7 Pin diagram of the datapath 35 Figure 3.8 Pin diagram of the synthesis system 35 Figure 3.9 R TL schematic diagram of the synthesis system 36 Figure 4.1 Hierarchy of the implementation of the system architecture in VHDL 38 Figure 4.2 Output w a v eform of the audio player 41 Figure 4.3 Sample output w a v eform of the controller 44 Figure 4.4 Sample output w a v eform 1 of the synthesized w ord `bamite' 45 Figure 4.5 Sample output w a v eform 3 of the synthesized w ord `gemini' 46 Figure 4.6 Sample output w a v eform 2 of the synthesized w ord `de v ote' 47 Figure 4.7 Spectrogram of a spok en w ord `goose' 48 Figure 4.8 Spectrogram of synthesized w ord `goose' 48 i v
Figure 4.9 Spectrogram of a spok en w ord `byte' 49 Figure 4.10 Spectrogram of synthesized w ord `byte' 49 v
FPGA-B ASED IMPLEMENT A TION OF CONCA TEN A TIVE SPEECH SYNTHESIS ALGORITHM Pra v een K umar Bamini ABSTRA CT The main aim of a te xt-to-speech synthesis system is to con v ert ordinary te xt into an acoustic signal that is indistinguishable from human speech. This thesis presents an architecture to implement a concatenati v e speech synthesis algorithm tar geted to FPGAs. Man y current te xt-to-speech systems are based on the concatenation of acoustic units of recorded speech. Current concatenati v e speech synthesizers are capable of producing highly intelligible speech. Ho we v er the quality of speech often suf fers from discontinuities between the acoustic units, due to conte xtual dif ferences. This is the easiest method to produce synthetic speech. It concatenates prerecorded acoustic elements and forms a continuous speech element. The softw are implementation of the algorithm is performed in C whereas the hardw are implementation is done in structural VHDL. A database of acoustic elements is formed rst with recording sounds for dif ferent phones. The architecture is designed to concatenate acoustic elements corresponding to the phones that form the tar get w ord. T ar get w ord corresponds to the w ord that has to be synthesized. This architecture doesn' t address the form discontinuities between the acoustic elements as its ultimate goal is the synthesis of speech. The Hardw are implementation is v eried on a V irte x (v800hq240-4) FPGA de vice. vi
CHAPTER 1 INTR ODUCTION 1.1 Intr oduction to speech synthesis Speech is the primary means of communication between people. Speech synthesis and recognition often called as v oice input and output has a v ery special place in the man-machine communication w orld. The k e yboard, touch-sensiti v e screen, the mouse, the jo ystick, image processing de vices, etc., are well established media of communication with the computer As language is most natural means of communication, speech synthesis and recognition are best means for communicating with the computer [1 ]. A sound w a v e is caused by the vibration of a molecule. Speech is a continuously v arying sound w a v e which links speak er to listener Sound tra v els in a medium such as air w ater glass, w ood, etc., the most usual medium is air When a sound is made, the molecules of air nearest to our mouths are disturbed. These molecules being disturbed in a manner which sets them oscillating about their point of rest. A chain reaction will be gin as each molecule will propagate the abo v e mentioned ef fect when it collides with other molecules in its surroundings. This chain reaction will e v entually dissipate some distance from the speak er The distance tra v eled solely depends on the ener gy initially imparted by v ocal chords. The maximum distance a vibrating molecule mo v es from its point of rest, is kno wn as the amplitude of vibration. In one complete c ycle of motion a molecule starts from its point of rest, goes to maximum displacement at one side, then the other side, and nally returns to point of rest. The time tak en for one complete c ycle is called as period The number of times this complete c ycle occurs in one second is termed as fr equency Sounds which possess same periods in successi v e c ycles are called as periodic sounds and those that do not are aperiodic sounds A common e xample for a periodic sound is the note produced by striking a tuning fork, where sound stays f airly constant throughout its span. The shape of a sound 1
w a v e can be inuenced by the presence of harmonics. Fig. 1.1 represents a sinusoidal periodic sound w a v eform from point A to point B. AmplitudeFrequency Period A B Figure 1.1. Representation of a sinusoidal periodic sound w a v e Sound is a common form of multimedia used for man y day to day applications such as games, presentations, and e v en operating system feedback pro vide audio source to a computer user Analog to Digital (A/D) con v erters and Digital to Analog (D/A) con v erters are often used to interf ace such audio source to a computer speech synthesis. Automatic generation of speech w a v eforms, has been under de v elopment for se v eral decades. Intelligibility naturalness and v ariability are the three criteria used to distinguish human speech and synthetic speech. Intelligibility is the measure of understandability of speech. Naturalness is the measure of human f actor in the speech, that is ho w close is the synthetic speech to human speech . Recent progress in speech synthesis has produced synthesizers with v ery high intelligibility b ut the sound quality and naturalness still remain a major problem to be addressed. Ho we v er for se v eral applications such as multimedia and telecommunications the quality of present synthesizers reached an adequate le v el. T e xt-to-speech is a process through which te xt is rendered as digital audio and then Â“spok enÂ”. Continuous speech is a set of complicated audio signals. This comple xity of audio signals mak es it dif cult to produce them articially Speech signals are usually considered as v oiced or un v oiced, b ut in some cases the y are something between these tw o. 2
1.2 A pplications of text-to-speech synthesisT e xt-to-speech transforms linguistic information stored as te xt into speech[3 ]. It is widely used in audio reading de vices for the disabled such as blind people.T e xt-to-speech is most useful for situations when pre recording is not practical.T e xt-to-speech can be used to read dynamic te xt.T e xt-to-speech is useful for proofreading. Audible proofreading of te xt and numbers helps the user catch typing errors missed by visual proofreading.T e xt-to-speech is useful for phrases that w ould occup y too much storage space if the y were prerecorded in digital-audio format.T e xt-to-speech can be used for informational messages. F or e xample, to inform the user that a print job is complete, an application could say Â“Printing completeÂ” rather than displaying a message and requiring the user to ackno wledge it.T e xt-to-speech can pro vide audible feedback when visual feedback is inadequate or impossible. F or e xample, the user' s e yes might be b usy with another task, such as transcribing data from a paper document.Users that ha v e lo w vision may rely on te xt-to-speech as their sole means of feedback from the computer .It' s al w ays possible to use concatenated w ord/phrase te xt-to-speech to replace recorded sentences. The application designer can easily pass the desired sentence strings to the te xt-tospeech engine. 1.3 Attrib utes of text-to-speech synthesis Output quality is one basic and most important attrib ute of an y speech synthesis mechanism. It is often a common occurrence that a single system can sound good on one sentence and pretty bad on the ne xt sentence. This mak es it essential to consider the quality of the best sentences and the 3
percentage of sentences for which maximum quality is achie v ed. Here we consider four dif ferent f amilies of speech generation mechanisms [4 ].Limited-domain waveform concatenation This mechanism can generate v ery high quality speech with only a small number of recorded se gments. This method is used in most interacti v e v oice response systems.Concatenative synthesis with no waveform modication This approach can generate good quality on a lar ge set of sentences, b ut the quality can be bad, when poor concatenation tak es place.Concatenative synthesis with waveform modication This approach has more e xibility and less mediocrity as w a v eforms can be modied to allo w better prosody match. The do wnside is that w a v eform modication de grades the o v erall quality of output speech.Rule-based synthesis This mechanism pro vides uniform sound across dif ferent sentences, b ut the quality when compared to other mechanisms, is poor The characteristics of abo v e methods are illustrated in the Figure 1.2. Concatenative Limited-domain concatenation speech (No wave modification) Rule-based Concatenative (wave modification)High Low Best qualityLow Percentage of sentences with maximum quality High Figure 1.2. Quality and task independence in speech synthesis approaches According to the speech generation model used, speech synthesis can be classied into three cate gories [4 ]. 4
Articulatory synthesisF ormant synthesisConcatenati v e synthesis Based on the de gree of manual interv ention in the design, speech synthesis can be classied into tw o cate gories [4 ].Synthesis by ruleData-dri v en synthesis Articulatory synthesis and formant synthesis f all in the synthesis by rule cate gory where as concatenati v e synthesis comes under data-dri v en synthesis cate gory F ormant synthesis uses a source-lter model, where the lter is characterized by slo wly v arying formant frequencies of the v ocal tract. Articulatory synthesis system mak es use of a physical speech production model that includes all the articulators and their mechanical motions[1 4]. It produces high quality speech as it tries to model human v ocal cords. T ruly speaking, concatenative synthesis is not e xactly a speech synthesis mechanism, b ut it is one of the most commonly used te xt-to-speech systems around. This thesis w ork mainly deals with concatenati v e speech synthesis. Basically there are tw o techniques that most speech synthesizers use for speech synthesis.T ime domain techniqueFrequenc y domain technique. In case of time domain technique the stored data represent a compressed w a v eform as a function of time. The main aim of this technique is to produce a w a v eform which may not resemble the original signal spectrographically b ut it will be percei v ed to be same by the listener This technique implementation compared to the frequenc y domain needs relati v ely simple equipment such as D/A con v erter and post sampling lter when Pulse Code Modulation (P .C.M.) is used. The quality of speech can be controlled by the sampling rate. Frequenc y domain synthesis is based on the modeling of human speech and it requires more circuitry [4 ]. The te xt-to-speech (TTS) synthesis by rule procedure in v olv es tw o main phases. 5
T e xt analysis phaseSpeech generation phase The rst one is te xt analysis, where the input te xt is transcribed into a phonetic or some other linguistic representation, and the second one is the generation of speech w a v eforms, where the acoustic output is produced from this phonetic and prosodic information. These tw o phases are usually called as high-le v el synthesis and lo w-le v el synthesis. The input te xt might be for e xample data from a w ord processor standard ASCII from e-mail, a mobile te xt-message, or scanned te xt from a ne wspaper The character string is then preprocessed and analyzed into phonetic representation which is usually a string of phonemes with some additional information for correct intonation, duration, and stress. Speech sound is nally generated with the lo w-le v el synthesizer by the information from high-le v el one. 1.4 Concatenati v e speech synthesis The simplest w ay of producing synthetic speech is to play long prerecorded samples of natural speech, such as single w ords or sentences. In a concatenated w ord engine, the application designer pro vides recordings for phrases and indi vidual w ords. The engine pastes the recordings together to speak out a sentence or phrase. If you use v oice-mail then you' v e heard one of these engines speaking, Â“[Y ou ha v e] [three] [ne w messages]Â”. The engine has recordings for Â“Y ou ha v eÂ”, all of the digits, and Â“ne w messagesÂ”. A te xt-to-speech engine that uses synthesis generates sounds similar to those created by the human v ocal cords and applies v arious lters to simulate throat length, mouth ca vity lip shape, and tongue position. This concatenation method pro vides high quality and naturalness. Ho we v er the do wnside is that it usually requires more memory capacity than other methods. The method is v ery suitable for some announcing and information systems. Ho we v er it is quite clear that we cannot create a database of all w ords and common names in the w orld. It is maybe e v en inappropriate to call this speech synthesis because it contains only recordings. Thus, for unrestricted speech synthesis (te xt-to-speech) we ha v e to use shorter pieces of speech signal, such as syllables, phonemes, diphones, or e v en shorter se gments. A te xt-to-speech engine that uses subw ord concatenation links short digital-audio se gments together and performs interse gment smoothing to produce a continuous sound. Subw ord se gments 6
are acquired by recording man y hours of a human v oice and painstakingly identifying the be ginning and ending of phonemes. Although this technique can produce a more realistic v oice, it is laborious to create a ne w v oice. This approach requires neither rules nor manual tuning. An utterance is synthesized by concatenating together se v eral speech se gments [5 6 ]. When tw o speech se gments, not adjacent to each other are concatenated there can be spectral discontinuities. Spectral and prosodic discontinuities result due to unmatching of formants and pitch at concatenation point, respecti v ely [6, 4]. This may be due to lack of identical conte xt for the units, intra-speak er v ariation, or errors in se gmentation. This problem can be reduced by techniques such as spectral smoothing, manual adjustment of unit boundaries, use of longer units to reduce the e number of concatenations, manual reselection of units, etc. The do wnside of using spectral smoothing is that it decreases the naturalness of the resulting synthesized speech [6 ]. This problem is clearly illustrated in Figures 1.3 and 1.4. TimeFrequency 0 1000 2000 3000 4000 5000 6000 7000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 1.3. Spectrogram of a w ord `when' 7
Figure 1.3 is the spectrogram of the sample w ord `when', when it is spok en completely as a whole. There we can observ e a continuous spectral arrangement. TimeFrequency 0 1000 2000 3000 4000 5000 6000 7000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 1.4. Spectrogram of a concatenated w ord `when' formed by concatenating se gments `whe' and `en' Figure 1.4 represents the spectrogram of the concatenated w ord `when'. In the instance where the w ord 'when' is concatenated using tw o audio se gments `whe' and `en', discontinuous spectral lines can be observ ed. This is because of the time gap between the boundaries of formants `whe' and `en'. Synthesized te xt-to-speech ine vitably sounds unnatural and weird. Ho we v er it is v ery good for character v oices that are supposed to be robots, aliens, or maybe e v en foreigners. F or an application which cannot af ford to ha v e recordings of all the possible dialogs or if the dialogs cannot be recorded ahead of time, then te xt-to-speech remains the only alternati v e [7 ]. 8
1.5 Summary This thesis presents an architecture and implementation of a concatenati v e speech synthesis algorithm on a FPGA de vice and its performance is measured. Chapters 1 and 2 discuss about the introduction to speech synthesis, literature concerning to speech synthesis, concatenati v e speech synthesis systems, techniques to impro v e the quality of speech, and techniques to obtain phase coherence in concatenati v e speech synthesis. Chapter 3 e xtensi v ely discusses about the algorithm used as part of this thesis w ork. This chapter also describes the tar get architecture de v eloped to implement the proposed algorithm. It also contains the top le v el digital design as well as the modular VHDL design for the synthesis process. Chapter 4 discusses about the implementation of the synthesis algorithm and the e xperimental results. Chapter 5 discusses conclusions and suggestions. 9
CHAPTER 2 RELA TED LITERA TURE 2.1 Re view of prior w ork In this section a brief o v ervie w of e xisting literature and research w ork in speech synthesis is presented. As the central topic is based on concatenati v e speech synthesis, the literature is limited to it. T e xt to speech synthesis is a process, where input te xt after natural language processing and digital signal processing is con v erted to output speech. The general te xt-to-speech synthesis system is sho wn as a block diagram in the gure 2.1. Message Dictionary and Rules Units Sound Analysis, Letter-toProsody Store of Text Sound, Assemble Units that Match Input Targets Speech Alphabetic Characters Phonetic Symbols, Prosody Targets Text Synthesizer Figure 2.1. Block diagram of the te xt-to-speech synthesis process The rst block represents a te xt analysis module that tak es te xt and con v erts it into a set of phonetic symbols and prosody tar gets. The input te xt is rst analyzed and then transcribed. The transcribed te xt goes through a syntactic parser which with the help of a pronunciation dictionary generates a string of phonemes. W ith a v ailable transcribed te xt, syntactic, and phonological infor mation prosody module generates tar gets. The second block searches for the input tar gets in the 10
store of sound units and assemble the units that match input tar gets. Then the assembled units are fed to a synthesizer that generates the speech w a v eform [8 9 ]. There are three important characteristics that dene a good a speech synthesis systemGood concatenation method.Good database of synthesis units.Ability to produce natural prosody across the concatenated units. An important f actor in concatenati v e speech synthesis is the a v ailability of ef cient search techniques, which enable us to search v ery lar ge number of a v ailable sound se gments in real time for the optimal output w ord [10 ]. In concatenati v e speech synthesis there e xists problems such as storage problem, intelligibility etc. Beck er and Poza [11 ] proposed a model of concatenati v e speech synthesis system called Dyad synthesis system A dyad is dened as the representation of a last half of one sound and rst half of the ne xt sound. Based on some theoretical grounds, W ang and Peterson [12 ] calculated that an innitely lar ge set of messages could be produced using approximately 800 dyads for monotonic speech and 8500 dyads for intonated speech. The intonation of the speech can be impro v ed by enlar ging the in v entory of the audio se gments. By this dyad synthesis system an y English message can be constructed from a stored in v entory of audio se gments that can be stored in the a v ailable RAMs in the mark et. There are man y approaches that ha v e been proposed to impro v e speech synthesis. These are described in detail in Section 2.2. Bulut, Narayanan, and Srydal [2 ] has carried out an e xperiment of synthesizing speech consisting of v arious emotions. The y tried to synthesize four emotional states anger happiness, sadness, and neutral using a concatenati v e speech synthesizer As part of that e xperiment the y constructed a search database of sounds with the separately recorded in v entories of dif ferent emotions such as anger happiness, sadness and neutral. In their results the y classied anger as in v entory dominant and sadness and neutral as prosody dominant whereas the results were not conclusi v e enough for happiness. The y conclude that v arious combinations of in v entory and prosody of basic emotions may synthesize dif ferent emotions. 11
2.2 Re view of pr oposed appr oaches to impr o v e concatenation The common problems in concatenati v e speech synthesis areOccurrence of unseen te xt. It is a commonplace occurrence in te xt-to-speech synthesis, as it is practically not possible to obtain acoustic representations for all the possible conte xts that could occur in speech because there e xists a v ast combinations of te xt, memory constraints etc.Spectral discontinuity at concatenation points. This is e xplained in detail in the pre vious chapter Based on Hidden Mark o v Models, prosodic parameters, and conte xtual label, Lo w and V ase ghi [13 ] ha v e proposed a method for selecting speech se gments for the database. The y also proposed a method for synthesizing an unseen te xt from the e xisting speech units. Combination of abo v e mentioned methods enables to solv e the unseen te xt problem, as well as it satises the memory constraints. The y also discussed about making use of linear interpolation techniques to obtain spectral smoothing at the concatenation point. Chappell and Hansen [7 ] discussed about performing ef fecti v e concatenati v e speech synthesis by smoothing the transitions between the speech se gments with a limited database of audio se gments. The y talk about using v arious techniques such as optimal coupling, w a v eform interpolation, linear predicti v e pole shifting, and psychoacoustic closure in spectral smoothing. After the smoothing processing is done, the output speech can be approximated to the desired speech characteristics. 2.2.1 Spectral smoothing This technique is used to smooth the transitions between concatenated speech se gments. The general approach of this technique is to tak e one frame of speech from the edge of each se gment and interpolate them [7 ]. Determining the circumstances in which smoothing is to be performed is an important issue of this technique. 12
2.2.2 Optimal coupling This is a v ery cost ef cient spectral smoothing technique. This method enables the boundaries of speech se gments to be mo v ed to pro vide the best t with adjacent se gments. Basically in this technique a measure of mismatch is tested at a number of possible se gment boundaries until the closest t is found. 2.2.3 W a v ef orm inter polation In this technique a w a v eform is interpolated in either the time or frequenc y domains . The w a v eform is interpolated between the frames at the edges of speech se gments, so that smoothed data can be inserted between them. Lukasze wicz and Karjalainen [14 ] e xperimented man y w ays to o v ercome dif culties in concatenating speech sample w a v eforms. The y introduced a ne w method of speech synthesis called as micr ophonemic method The fundamental idea of this method is to apply pitch changes for intonations and transitions by mixing parts of neighboring phoneme prototypes. 2.2.4 Remo ving linear phase mismatches One important issue in the concatenati v e speech synthesis is the synchronization of acoustic units. In the conte xt of concatenati v e speech synthesis there are tw o types of phase mismatches [15 ], such asLinear phase mismatc h : This refers to the interframe mismatch.System phase mismatc h : This occurs around the concatenation point. It introduces noise between harmonic peaks, thus ef fecting the output quality Stylianou [10 15] proposed a neural netw ork based Harmonics plus Noise Model(HNM) to w ards the modication of fundamental frequenc y and duration of speech signals. The author e xtensi v ely discussed about the issues that arise when this model is used to concatenate and smooth between acoustic units in a speech synthesizer This HNM model represents speech as a sum of sinusoids plus a ltered and time-modulated noise. O'Brien and Monaghan [16 17 ] proposed an alternati v e harmonic model that uses a bandwidth e xpansion technique which corrects pitch as well as time 13
scale to represent the aperiodic elements of the speech signal. The y applied this technique to concatenati v e speech synthesis and the results obtained compares f a v orably to implementations of other common speech synthesis methods. The y also presented a smoothing algorithm that corrects phase mismatches at unit boundaries. W outers and Macon [18 ] discuss about minimizing audible distortions at the concatenation point, thus attain smooth concatenation of speech se gments. Ev en with the lar gest sample space of speech se gments there e xists a possibility that end of one se gment may not match with the be ginning of the ne xt se gment. Therefore the residual discontinuities ha v e to be smoothed to the lar gest e xtent possible to attain a quality output speech. W outers and Macon [18 ] proposed a technique called unit fusion which controls the spectral dynamics at the concatenation point between tw o speech se gments by fusing the se gments together The y also proposed a signal processing technique based on sinusoidal modeling that modies spectral shape thus enabling quality resynthesis of units. T atham and Le wis [19 20 ] in their paper discussed about a high le v el modular te xt-to-speech synthesis system called SPR UCE. SPR UCE w as based on naturalness f actor in the synthesis which is normally obtained by modeling intonation, rhythm, and v ariability 2.3 Summary A surv e y of v arious approaches proposed in te xt-to-speech synthesis are presented. An o v ervie w of v arious techniques proposed such as optimal coupling, w a v eform interpolation, spectral smoothing, etc. to impro v e concatenati v e speech synthesis are presented. 14
CHAPTER 3 PR OPOSED SYNTHESIS ALGORITHM AND ARCHITECTURE Concatenati v e speech synthesis is capable of producing natural and intelligible speech whose quality closely matches to the v oice quality of the speak er who recorded the audio se gments from which the concatenation units are dra wn. Concatenati v e synthesis has the adv antages of being simple to implement and requiring a relati v ely small database of speech. This chapter discusses about the proposed speech synthesis algorithm. The speech synthesis algorithm described here is part of the te xt-to-speech synthesis. The block diagram of the algorithm sho wn in the Figure 3.1. The rst block in the diagram represents the acoustic libr ary where all the audio recordings are stored. The second block represents the par sing unit which breaks an input w ord or a tar get w ord into small se gments to be searched in the acoustic library The third and nal block represents the concatenating unit which concatenates the matched audio se gments to synthesize the output w ord. The detailed e xplanation of the speech synthesis system and proposed speech synthesis algorithm is gi v en in the follo wing sections. Unit Unit Parsing Concatenating Word segments Representations for small Library of acoustic Output speech Input word Figure 3.1. Block diagram of the proposed speech synthesis algorithm The speech synthesis system primarily consists of three units.Acoustic library 15
P arsing unitConcatenation unit 3.1 Acoustic library This library contains the utterances from a single speak er recorded for the small w ord se gments of interest. As it is not possible to account for the all the possible combinations of w ords, we ha v e recorded a fe w selected w ords of sizes four three, and tw o. This library consists of around 250 such representations. The duration of such audio se gments is standardized to 0.50 seconds. There are some important considerations to be tak en care of during the construction of acoustic library such asSince recording is often done in se v eral sessions, it is necessary to maintain the recording conditions constant to maintain spectral or amplitude continuity The recording de vices such as microphone, sound card, etc. should remain the same throughout all the recording sessions. Changes in recording conditions may cause spectral or amplitude discontinuity [21 ].High quality speech synthesis can be obtained with a sample set of lar ge number of utterances, b ut it requires lar ge memory If we can restrict the te xt read by the tar get speak er is representati v e of the te xt to be used in our application, we can account for memory considerations without sacricing the quality of speech synthesis. 3.2 P arsing unit This unit breaks input w ord into smaller se gments to be searched for the matching acoustic representations in the acoustic library The parser rst tak es the rst four letters of the input w ord, in case no corresponding audio se gment is found then it just tak es rst three letters and again goes through the search process. If the search for the rst three letter se gment f ails, this process is repeated for the rst tw o letters of the input w ord. 16
3.3 Concatenating unit This unit concatenates the matched se gments to form a single unit to form the synthesized output. The synthesized output can be further processed and smoothed to get a more rened output, which closely resembles the tar get w ord. This unit doesn' t perform the smoothing process. 3.4 Speech synthesis algorithmFirst step in the speech synthesis algorithm is parsing an input w ord. The input w ord is fed to parsing unit. The parser tak es the input w ord and breaks into smaller se gments depending upon the acoustic library which is e xplained later .The parser rst tak es the rst four letters of the input w ord and then searches for the acoustic unit corresponding to that se gment. If corresponding acoustic unit is found in the library then it is sent to the concatenating unit. In case it f ails to nd the match, then it tak es just rst three letters and repeats the same procedure. When the search f ails for the rst three letter se gment, this process is carried out for the rst tw o letters as well, as e xplained in the follo wing e xample. Let us assume `stampede' as the input w ord. The parsing unit rst tak es the rst four letters of the input w ord and forms the se gment `stam'. It searches for the corresponding audio se gment in the acoustic library If it is found then it proceeds with the rest of the input w ord in the same manner In case no corresponding audio se gment is found, then it searches for the se gment `sta' and proceeds as e xplained abo v e. In case there is no corresponding acoustic unit is found, it also sends a message that the library doesn' t contain the required se gment.Once the abo v e mentioned step for a single se gment is nished, same procedure is repeated for the rest of the se gments of the input w ord. When all the audio se gments corresponding to the input w ord are a v ailable in the library then the concatenating unit concatenates them to form the output w ord.All the matched audio se gments are rst read as .w a v les by a MA TLAB program. Then it concatenates all the matched se gments and synthesizes the output w ord. 17
The algorithm w as implemented in C language. It mak es use of a MA TLAB program to implement the concatenation of the audio se gments. The implementation of the algorithm in C is pro vided in the appendix A. 3.5 Ar chitectural design of the synthesis system A digital system is basically a sequential circuit which consists of interconnected ip-ops and v arious combinational logic gates. Digital systems are al w ays designed with a hierarchical and modular approach. The digital system is di vided into modular subsystems which perform some functions. The modules are designed from functional blocks such as re gisters, counters, decoders, multiple x ers, b uses, arithmetic elements, ip-ops, and primary logic gates such as or and, in v erter etc. Interconnecting all the modular subsystems through control and data signals forms the digital system [22 ]. The architectural body of an y digitally designed system consists of tw o main modules, such asControllerDatapath The top le v el block diagram of a general digital system which sho ws the relationship between a controller and a datapath is gi v en in the Figure 3.2. The rst block represents the controller and the second block represents the datapath. The datapath performs the data-processing operations where as the contr oller determines the sequence of those operations. Control signals acti v ate v arious data-processing operations. The controller communicates with the datapath through these control signals. At rst controller sends a sequence of control signals to datapath and in turn recei v es status signals from datapath. F or v arious operations controller uses v ariable control signal sequences. The controller and datapath also interact with the other parts of digital system such as memory and I/O units through control inputs, control outputs, data inputs, and data outputs. Datapath performs man y re gister transfer operations, microoperations etc. Re gister transfer operations correspond to the mo v ement of the data stored in the re gisters and the processing performed on that data. On the other hand microoperations correspond to the operations such as load, add, count, subtract performed on the data stored in the re gisters. Controller pro vides the sequence for the microoperations [22 ]. 18
DATAPATHOutput Signals Input Signals Status Signals Control External Control Inputs SignalsCONTROLLER Figure 3.2. The top-le v el block diagram of general digital system consisting controller and datapath In this section, the details of the speech synthesis system, designed as part of this thesis are presented. The architectural design hierarchy of the speech synthesis system consists three steps.Pseudo-code representation of the algorithmState diagram and controller designDatapath design The pseudo-code representation is one of the most useful w ays of representing the algorithm. W e formulate the pseudo-code for the algorithm presented in the pre vious chapter and the pseudocode for the implementation of the algorithm is gi v en in Section 3.6. The controller design is the rst step in the architecture design of an y digital system. At rst, a state o w graph is de v eloped from the pseudo-code e xplained in Section 3.6. From the state o w graph we design the nite state machine or the controller Controller controls the datapath at specic control points in the datapath. The detailed design process for the controller is e xplained in Section 3.7. The last step in the architectural design hierarchy is the design of the datapath. The datapath consists of all combinational and sequential logical units such as counters, logic gates etc., memory elements and I/O elements etc. The detailed datapath design is presented in the Section 3.8. 3.6 Pseudo-code of the algorithm (1) Initialize and load the acoustic library which consists all the audio recordings; (2) Load the tar get or input w ord; 19
(3) Initialize the re gister counter K; (4) While ( F or all the characters in tar get w ord ) do (5)(6) Di vide the tar get w ord into phones of size tw o starting from the rst character; (7)/* end for (4) */ (8) F or all the phones of size tw o (9)(10) search the acoustic library for all the phones of size tw o; (11) if(matching phones are found) then (12)(13) found= 1; (14) Read the output data for all the phones ; (15)(16) send the start signal to audioplayer when player done= 0; (17) when player done= 1 implies audioplayer has nished playing the phone; (18)/* end for (14) */ (19) reset the re gister counter K and go onto the ne xt phone; (20)/* end if (11) */ (21) if( no matching phones are found) then (22)(23) dfound= 1; (24) reset the re gister counter K -reset K= 1; (25)/* end if (20) */ (26) repeat the procedure for all the phones of the tar get w ord; (27)/* end for (8) */ 20
3.6.1 Description of the pseudo-code All the audio recordings (phones) are loaded in the acoustic library (line 01). After acoustic library has been compiled then, load the tar get w ord or input w ord (line 02) which has to be synthesized. Line 03 determines the initialization of the re gister counter K which is used to count all the phones loaded in the acoustic library Once the tar get w ord is loaded, parse the tar get w ord into small se gments of tw o letters (line 04 to line 06), here we assume that the tar get w ord has e v en number of letters in it. After tar get w ord is di vided into small se gments, for all the se gments of the tar get w ord a search operation for the corresponding phones in the acoustic library (line 08 and line 10) is performed. While searching for the corresponding audio recording for a small tar get w ord se gment, if a matching phone is found (line 11 to line 13) then it sends a start signal to the audio player (line 16) which starts to play the entire audio recording of that particular phone. When the audio player nishes its operation then it sets a signal player done to 1 (line 17). It indicates that the play operation for a single phone is nished. No w re gister counter K resets to 0 (line 19). The process is repeated for other se gments of the tar get w ord (line 26). In case, no matching phone for a se gment is a v ailable in the acoustic library then a signal dfound is set to 1 (line 23). Once an operation of a single se gment is done, then the process is repeated for all the remaining se gments of the tar get w ord. 3.7 Design of the contr oller Usually a state machine design starts with the general w ord description of what we w ant the machine to do. The follo wing step is to b uild the state table, transition table, and e xcitation table. Then the e xcitation equations, the equations used for the inputs of ip-ops are deri v ed from the e xcitation table and so do the output equations from the transition table. Once the output and e xcitation equations are deri v ed, the rest of the controller design procedure resembles a normal combinatorial circuit design, and connecting them to the ip-ops. There are three methods to design a nite state machine [23 ].Random logic. It is the easiest method for the design of a nite state machine. Reassignment of the states or state machine optimization is not permitted. 21
Use directi v es to guide the logic synthesis tool to impro v e or modify the state assignment.Use a special state machine compiler to optimize the state machine. It is the most accurate and dif cult method to use for nite state machine design. State encoding is the primary and most important step in nite state machine design. Poor choice of state codes result in too much logic and it mak es the controller too slo w In digital design the state encoding can be performed in dif ferent w ays such asAdjacent encodingOne hot encodingRandom encodingUser -specied encodingMoore encoding The encoding style for state machines depends on the synthesis tool we are using. Some synthesis tools encode better than others depending on the de vice architecture and the size of the decode logic. W e can declare the state v ectors or let the synthesis tool determine the v ectors. As we are using Xilinx tools, the y use the enumerated type of encoding to specify the states and use the Finite State Machine (FSM) e xtraction commands to e xtract and encode the state machine as well as to perform state minimization and optimization algorithms. 3.7.1 Pr ocedur e f or the contr oller design Sequential circuit depends on the clock as well as a certain set of specications which form the logic for the design. The design of the sequential circuit consists of choosing the ip-ops and nding a combinational circuit structure which along with the ip-ops produces the circuit that meets the specications. The sequence of steps to be follo wed to deri v e the nite state machine, are mentioned belo w as well as sho wn in the Figure 3.3.From the statement of the problem the state diagram is obtained. The state diagram is gi v en in the Figure 3.4 which w as deri v ed from the psuedo-code e xplained in Section 3.6. 22
Derive the state table from the state diagram Derive the state diagram Specify the problem Functions with Karnaugh maps or any algebraic Method Simplify the Determine the number of flip-flops to be used and Choose the type of flip-flops Derive the flip-flop input equations and Output functions Draw the logic diagram flip-flop input equations and output Figure 3.3. Design procedure for the nite state machine 23
As a combinational circuit is fully specied by a truth table a sequential circuit is specied by a state table. The state table consisting of all the inputs and outputs of the controller and their beha vior is formulated from the state diagram gi v en in the Figure 3.4.A synchronous sequential circuit is a combination of ip-ops and combinational gates, it is necessary to obtain the ip-op equations. The ip-op input equations are deri v ed from the ne xt state entries in the encoded state table. Positi v e edge triggered D-type ip-ops are used in the controller design as the y ignore the pulse when it is constant and triggers only during a positi v e transition of the clock pulse. Some ip-ops trigger at ne gati v e transition(1-to-0) transition and some trigger at positi v e transition(0-to-1). As part of the controller design, the input equations 3.1, 3.2, and 3.3 are deri v ed.Similarly the output equations are deri v ed from the output entries in the state table. F or all the outputs from the controller the output equations are obtained. In case of this design, the equations 3.4, 3.5, 3.6, 3.7, 3.8 are output equations which are gi v en later .If possible, simplify the ip-op input equations as well as output equations. Karnaugh maps can be used to simplify the equations to a v oid cumbersome logic circuit structure.As the last step the logic diagram with the ip-ops and combinational gates is dra wn, according to the requirements mentioned in the input and output equations. 3.7.2 Design of the state diagram and state table The state diagram is gi v en in the Figure 3.4. It consists of eight states namely S0, S1, S2, S3, S4, S5, S6 and S7. The state S0 is the initial state. The system starts when signalis set to 1. In state S1 tar get w ord and acoustic library is loaded for the system to be synthesized. Whenn signal is 1 tar get w ord is loaded and the system goes to the ne xt state. Thus loaded tar get w ord is parsed into small se gments e xternally depending on ther signal. In state S2 it assigns the rst se gment of the tar get w ord to X. In state S3 it checks the library Here we assign a counter K, which is used to count the number of phones e xisting in the library In the follo wing state S4 it performs the search process. The tar get w ord se gment X is compared to the elements in the library 24
In case a matching phone is found then signal goes high and system enters into state S6 else it goes to state S5, which increments the re gister counter K and continues the search process. In state S6 the audio recording corresponding to the matched phone is read through an audioplayer program. Once this task is nished it sets a signal r to 1 and goes to state S7 where the re gister counter K is reset and se gment count I is incremented. The ne xt se gment of the tar get w ord is assigned to X and again follo ws the same procedure. This procedure is follo wed for the all the se gments of the tar get w ord. A state table is dened as the enumeration of the functional relationships between the inputs, outputs, and the ip-op states of a sequential circuit [22 ]. It primarily consists of four sections, labeled inputs, pr esent state ne xt state and outputs The inputs section pro vides all the possible v alues for the gi v en input signals for each possible present state. The pr esent state section sho ws states of ip-ops A, B, and C at an y gi v en time. The ne xt state section pro vides the states of the ip-ops A, B, and C on clock c ycle later at time. The outputs section pro vides us the v alues of the all the output signals at timefor e v ery combination of present state and inputs. In general in a sequential circuit ofnumber of ip-ops andrnumber of inputs then we will ha v enrro ws in the state table. The binary v alues for the ne xt state are directly deri v ed from ip-op input equations. The state table is gi v en in the table 3.1. W e ha v e got rid of the some unused states, so we ha v e the less number of ro ws as per the formula gi v en abo v e. The detailed procedure for the remaining part of the controller design is e xplained in the later sections. It is v ery important to ha v e a look at all the controller signals in v olv ed and their functionality Here we list the controller inputs and outputs and their functional abilities. The brief information about the input and output signals and their purpose is gi v en belo w is an input signal. Whenis 0, the system is in the idle state. The system gets started only when is set to 1. n is an input signal. Whenn is set to 0, the system is in the process of loading the tar get w ord as well as the the input phones that e xist in the acoustic library Whenn is 1, it implies that tar get w ord is loaded into the system. 25
Increment I K ready = 0 load = 0 load = 1 cmp = 0 player_done = 0 player_done = 1S0Idle LoadS1 S2X = R(i)S3K = 16 ?S4 S6Read O/P BlockS5IncrementS7kvalue = 0 incr_K cmp = 1 / found reset_K dfound kvalue = 1/ incr_I incr_I reset_K ready = 1/ reset_K Y(K) = X ? Figure 3.4. State diagram for the controller 26
is also an input signal. When it is set to 1, it denotes that all the acoustic library elements are compared to a particular tar get w ord se gment. When it is 0, we still ha v e some more entries in the library to be compared to. is an input signal. It is the output of the comparator which compares the tar get w ord se gments and acoustic library units. When it is 1, it indicates that a matching phone is found for that particular tar get w ord se gment. When it is 0, it implies no matching phone is a v ailable in the library r is also an input signal. It is the output of the audio player program which we use to play the audio recording for a particular phone when a matching phone is found in the library When it is set to 0, the recording is still being played. When it is 1, it implies the audio recording is nished and the system goes to the ne xt state. is an output signal. It is used to reset the re gister counter K. Initially when the system enters the state S1, the counter is reset. It also goes high when search is nished irrespecti v e of the result of the search. When a matching phone is found for a particular se gment, then the counter is reset to 0. The search continues for the other se gments, if there is no t a v ailable in the library then it skips that particular se gment and set to 0 to be gin the search process for the other se gments of the tar get w ord. r is an output signal. This signal is used to replace the one tar get w ord se gment for which the search has nished with another tar get w ord se gment for which the search has to be done. When it is 1, it implies that search for one se gment is done and the search for the ne xt se gment be gins. r is also an output signal. It is the increment signal which increments the re gister counter K. When a particular entry in the library when compared to the tar get w ord se gment, is not the right match then the re gister counter gets incremented so that the ne xt a v ailable entry in the library can be compared to the tar get w ord se gment. Counter K gets incremented till it reaches the maximum a v ailable entries in the acoustic library are compared. 27
r is an output signal. It is the output of the comparator which compares the tar get w ord se gments and the acoustic library entries. When there e xists a matching phone in the library to that of a particular tar get w ord se gment, it is set to 1. r is also an output signal. It is set to 1, when there e xists no matching phone in the library for a particular tar get w ord se gment. r is set to high, if is 1, and there is no right match for a particular tar get w ord se gment. 3.7.3 Flip-op input and output equations The three input equations for the D ip-ops can be deri v ed from the ne xt state v alues in the state table. If possible, we simplify the equations using Karnaugh maps or an y algebraic methods. The input equations for the state ip-ops are n(3.1) r n n (3.2) r n n r (3.3) The output equations can be obtained from the binary v alues of the output entries in the state table. If possible output equations are simplied as well. The output equations are (3.4) r (3.5) r (3.6) r (3.7) r (3.8) The ne xt step, after obtaining all the input and output equations is b uilding the logic diagram using those equations as b uilding blocks. Along with all input signals we also ha v e e xternal signals such as and signal is used to initialize the state of the ip-ops. This signal is dif ferent from the output signal which is used to reset the re gister counter K. This master 28
T able 3.1. State table for the controller n r r Present State Ne xt State n n 0 000 000 1 000 001 1 0 001 001 1 001 010 010 011 0 "! 011 100 0 "! # 100 101 1 "! $ 100 110 1 # 101 011 1 1 011 010 1 1 1 0 $ $ 110 110 1 $ % 110 111 % 111 010 1 1 29
a v oids the system starting in an unused state. In most of the cases the ip-ops are initially reset to 0, b ut depending on the requirements in some cases the y may be reset to 1. The port interf ace diagram for the controller is gi v en in Figure 3.5. clk reset ready load kvalue cmp player_done reset_K incr_K incr_I found dfoundCONTROLLER Figure 3.5. Pin diagram of the controller 3.8 Datapath design f or the system A datapath consists a digital logic that implements v arious microoperations. This digital logic in v olv es b uses, multiple x ers, decoders, logic gates, and processing circuits [24 ]. The datapaths are best dened by their re gisters and the operations such as shift, count, load, etc. performed on the binary data stored in them. These operations are called r e gister tr ansfer operations. The y are dened as the mo v ement of data stored in the re gisters and the processing performed on the data [22 ]. The datapath for the synthesis system is gi v en in Figure 3.6. 3.8.1 Components of the datapath The datapath of this synthesis system is di vided into v e main functional blocks, the y areInput module 30
RAM16-bit ComparatorRe gister counter KOutput module 220.127.116.11 Input module The basic functionality of this module is to fetch the tar get w ord and break into tw o letter se gments, so that the y can be searched in the acoustic library for their respecti v e audio recordings. Initially when signaln is 1 the tar get w ord (input) is loaded into the system. Input module consists of tw o signals r and r When r is 1, the input module asserts the rst se gment to be searched for the corresponding phone in the library Once the search for this se gment is nished, then r signal goes high. When r is 1, then the input module assigns the ne xt immediate se gment. This process continues till all the se gments of the tar get w ord are searched in the acoustic library 18.104.22.168 RAM This is basically a re gister le which stores the information about the starting address of the audio recordings stored in the RAM of a FPGA de vice as well as the phones. This re gister le contains 16 re gisters, each re gister contains 35-bit wide information. The address width of the audio recordings in the RAM of FPGA is 19 bits. The phones in the acoustic library has a width of 16 bits. T o scan through all the re gisters in the RAM we mak e use of a counter K. During the search process counter K pro vides the address of the re gisters in the re gister le to be compared with that of tar get w ord se gments. RAM has a signal which basically implements read/write operation of a re gister le. When it is 0, the data is loaded into the re gister le and when 1, data is read. When there e xists a matching phone for a tar get w ord se gment then data stored in the re gisters with a size of 35 bits is split into tw o chunks of 16 bits data and 19 bits data respecti v ely The 19 bits data chunk pro vides the address information about the corresponding audio recording for that 31
INPUT MODULEsel reset_K incr_K input control incr_I cmp output(Library) REGISTER COUNTER K RAM OUTPUT MODULE COMPARATOR48 16 16 19 4 found kvalue 19 Figure 3.6. R TL schematic of the datapath32
particular phone stored in the RAM of FPGA. This data chunk is pro vided to the output module. The 16 bits data chunk contains that particular phone, which is pro vided to the 16 bit comparator 22.214.171.124 Comparator This compares the tar get w ord se gments to that of the phones stored in the acoustic library As we kno w that input module breaks the tar get w ord into se gments of 16 bit data width. The re gister le which contains the all the phones pro vide us the phone to be compared with the tar get w ord se gment. If the y match, then signal is set to 1 and the remaining 19 bits of the data stored in the same re gister is sent to the output module. Output module pro vides this output data as the starting address to the audio player module. The audio player starts playing the audio data starting from the address pro vided by the output module to the end of that particular audio recording. 126.96.36.199 Register counter K : This counter pro vides the address location of a particular phone when it is compared to a tar get w ord se gment. It helps us through scanning through the library This counter has tw o input signals and r is used to reset the counter and r is used increment the counter The counter is reset when there e xists no matching phone for a particular tar get w ord se gment. It is also reset when there is a matching phone for a particular se gment, and proceeds to the ne xt tar get w ord se gment. Counter gets incremented, when the comparison of a particular tar get w ord se gment to a acoustic library entry (phone) f ails, then the counter gets incremented so that the phone a v ailable in the ne xt re gister of the re gister le is compared to the tar get w ord se gment. 188.8.131.52 Output module This module pro vides us the starting address of the audio recording stored in the RAM of the FPGA de vice, which we use as a tar get de vice for this e xperiment. When there e xists a matching phone for a particular tar get w ord se gment, then as it is discussed, in the controller design section r signal goes high. When r is high then starting address is loaded to the output module, which pro vides the starting address to the audio player which plays out the audio recording. T o put it briey the starting address of an audio recording is stored in the re gister le along with the 33
corresponding phone. When that particular phone is a right match for the tar get w ord se gment, then the starting address stored with that phone is loaded to the player and it is played out as an audio signal. 3.8.2 Functional description of the datapath The abo v e v e functional blocks constitute the datapath of the synthesis system. Briey the datapath functionality is described in this subsection. The functionality of the datapath can be di vided into three phases, such asLoading phaseSearching phaseOutput phase At rst, in the loading phase the tar get w ord, which is to be synthesized is loaded into the input module along with the acoustic library When loading is done,n is set to 1. This indicates loading phase is nished. The ne xt phase is searching phase. In this phase, the tar get w ord is parsed into se gments of size 2 (16-bit binary). Once loading is done an e xternal signal r is set to 1, so that the rst se gment of the tar get w ord is loaded for the search process. The comparator compares the tar get w ord se gment to the entries in the acoustic library Because of memory constraints we are assuming there are just 16 entries in the library depending upon the size of the memory library can be e xpanded. Counter K is used to scan through all the a v ailable phones in the library If there e xists a matching entry to that of the tar get w ord se gment then signal goes high. It implies there is a matching phone a v ailable in the library for the tar get w ord se gment, which also sets signal r to 1. The re gister le also contains the information about the audio recording corresponding to all of the entries stored in the library In the output phase, when r goes high, then starting address of the audio le stored in the FPGA memory is pro vided to the output module. Then output module pro vides starting address to the audio player program which plays the audio le. The pin diagram of the datapath is gi v en in the Figure 3.7. 34
sel clk found control reset_K incr_K incr_I input[47:0] output[18:0] cmp kvalue DATAPATH Figure 3.7. Pin diagram of the datapath TARGET WORD[47:0] DONE CONTROL LOAD READY RESET CLK STARTING_ADDRESS[18:0] DFOUND SYNTHESIS SYSTEM Figure 3.8. Pin diagram of the synthesis system 35
incr_K reset_K incr_I found dfound player_done cmp kvalue load ready reset clk input[47:0] found incr_I incr_K reset_K sel clk output[18:0] cmp kvalue]controlTARGET WORD[47:0] STARTING ADDRESS [18:0] DFOUNDLOAD READY RESETDONECONTROLCONTROLLER DATAPATHCLK Figure 3.9. R TL schematic diagram of the synthesis system36
3.9 Schematic diagram of the synthesis system In pre vious sections, the design of the controller and the datapath were discussed. The schematic design of the synthesis system is gi v en in the gure 3.9. It is formed by combining the controller and datapth e xplained earlier in pre vious sections. The pin diagram of the synthesis system is gi v en in the gure 3.8. As gi v en in the gure it has only tw o output signals r and r r is mostly unused in the implementation of the architecture. The implementation of the synthesis system is done in VHDL which is discussed e xtensi v ely in the ne xt chapter 3.10 Summary A detailed e xplanation of the algorithm is presented. The detailed architectural design procedure as well as the architecture to implement the algorithm is described. The softw are implementation of the algorithm is presented in appendix A. The hardw are implementation is presented in the ne xt chapter 37
CHAPTER 4 IMPLEMENT A TION OF THE ARCHITECTURE AND EXPERIMENT AL RESUL TS This chapter discusses ho w the system architecture is implemented as well as the e xperimental results compiled during the course of this research w ork. The comple xity of the contemporary inte grated circuits made design automation an essential part of the vlsi design methodologies [25 ]. Designing an inte grated circuit and ensuring that it operates to the specics is a practically impossible task without the use of computer aided (CAD) tools. CAD tools perform v arious functions such as synthesis, implementation, analysis, v erication, testability depending upon their design specications. Analysis and V erication tools e xamine the beha vior of the circuit. Synthesis and implementation tools generate and optimize the circuit schematics or layout. T estability tools v erify the functionality of the design. F or design proposed in the pre vious chapter we mak e use of Xilinx foundation tools to perform the implementation and v erication. F or the entire architecture e xplained in the pre vious chapter the structural VHDL modules ha v e been designed. The hierarchy of the implementation is gi v en in Figure 4.1. controller.vhd datapath.vhd comparator.vhd counter.vhd ram.vhd inv.vhd queue.vhd system.vhd input_module.vhd TTS SYNTHESIS SYSTEM Figure 4.1. Hierarchy of the implementation of the system architecture in VHDL 38
4.1 Structural VHDL design f or the ar chitectur e As sho wn in Figure 4.1, the te xt-to-speech synthesis system has tw o basic components. r module pro vides the tar get w ord to the synthesis system. This module basically performs the multiple x er operation. its basic functionality is to tak e the input w ord from the input module and per form the rest of the synthesis process. The system module further di vided into tw o controller and datapath modules. Datapath module consists of v e other components such as Â– Â– r Â– Â– r Â– r n is designed using state editor of the Xilinx foundation tools. It uses enumerated state encoding for the states. It interacts with the datapath and ensures that the functionality of the system is pre v ailed. Its functional simulation is pro vided in Figure 4.3. compares tw o 16 bits of binary data. Its main functionality is to compare the tar get w ord se gments with the acoustic library elements stored in the RAM. r is a 8-bit beha vioral counter module. It is used to pro vide the addresses of a particular acoustic library elements stored in the RAM. is a beha vioral module of size of 8 x 35. It has an address 8-bit wide and data of 35-bit long. It stores a phone of width 16-bits and the starting address of its audio representation stored in the RAM of the FPGA board. It can store 256 () phones. Due to memory constraints, the RAM of the FPGA (XSV -800) board can only store 10 seconds of data. Because of this memory constraints, we e xperimented with the 20 audio representations for 20 selected phones. r is a basic in v erter module which in v erts the input signal. it tak es the tar get w ord from the input module and breaks into se gments of 16-bit width. Depending upon the controller signals, it sends the 16-bit data se gments to be compared with that of the phones stored in the memory 39
All the components of the synthesis system ha v e been v eried and tested for their respecti v e functionality indi vidually at v arious instances. 4.2 Functional description of the synthesis system Initially the input module pro vides the tar get w ord to the queue module. When the controller sends then signal to 1 then queue module pro vides the rst se gment of the tar get w ord to be searched in the library The 16-bit tar get w ord se gment is compared with the phones stored in the RAM module with the use of comparator and counter Counter pro vides the address of the phones. When there e xists a matching phone corresponding to that of the tar get w ord se gment then, it sets r signal to 1. T o play the audio recordings corresponding to the phones we mak e use of a VHDL project, called Â“audioprojectÂ” [26 ]. This project allo ws recording and playback of just o v er 10 seconds of sound, and has options to play back the sound at doubled speed, re v ersed, etc. This project is to be used with the XSV board v1.0, produced by XESS Corp. This board contains a V irte x FPGA from Xilinx Inc. and support circuitry for communicating with a wide v ariety of e xternal de vices. This project is de v eloped internally by students w orking for the School of Computer Science and Electrical Engineering in the Uni v ersity of Queensland, Australia [26 ]. When r is set to 1 then the starting address of the phone is passed to the player module of the audioproject. Then the player module reads the entire data of the audio recording corresponding to that phone. Once the player nishes it' s operation it sets r signal to 1 indicating it has done playing the data. The functionality of the player module is gi v en in the Figure 4.2. The player plays the data till it reaches the ending address of the phone. Once the player nishes its operation the ne xt immediate task is to search for the ne xt tar get w ord se gment. When r signal sets to 1 then the controller module sets r signal to 1 in turn queue module sends the ne xt tar get w ord se gment to be searched in the acoustic library The controller operation is sho wn in the w a v eform 4.3. This procedure is repeated, till all the tar get w ord se gments are searched for in the acoustic library 40
Figure 4.2. Output w a v eform of the audio player 41
4.3 Synthesis and simulation r esults Simulation is performed by setting test v alues in the testbench. The design is synthesized using Xilinx tools and it is simulated with Modelsim simulator The simulation results are gi v en in the Figures 4.4, 4.5, 4.6 for three dif ferent tar get w ords. In the output w a v eforms we can notice that when the player module is done( r -1) playing the data for a particular phone, then it goes to ne xt se gment. If another phone is found ( r -1) then it gets the r of it in the RAM and sends it to the player module. Then the player module repeats the procedure performed earlier Once the functionality of the design is v eried, the design is synthesized tar geting it to the V irte x FPGA (v800hq240-4). The design mapping, place and route operations are performed as part of the implementation step of the o w engine of Xilinx tools. In the ne xt step the bitstream le to be do wnloaded onto the FPGA on the XSV board is generated. W e mak e use of XST OOLs V ersion 4.0.2, to do wnload the bitstream le onto the FPGA. These tools are pro vided by Xess corporation. F or the utilities that do wnload and e x ercise the XS40, XS95, XSA, and XSV Boards for W in95, W in98, W inME, W inNT W in2000, and W inXP Figure 4.3 e xplains the beha vior of the controller It can be observ ed that, when goes to 1 then r goes to 1. Once a particular phone is found, then r goes high indicating that the ne xt tar get w ord se gment to be searched for in the acoustic library Figures 4.4, 4.5, and 4.6 are the output w a v eforms generated by Modelsim for three tar get w ords bamite g emini and de vote respecti v ely In gure 4.4, the tar get w ord bamite is brok en into se gments ba mi and te It rst searches for ba in the library if it nds the matching phone then it pro vides the r 7EE00 to the player .vhd module of the audioproject. The player module plays out the audio data corresponding to the phone ba and sets r to 1. Then mi is searched in the library when found it returns r 50540 to the audioproject. This process is continued for se gment te and it returns r 00000. Similarly the search is performed for the other tar get w ord se gments. V arious simple w ords such as bone, when, gyration, v ote, byte, dose, bane, mine, nine etc. ha v e been synthesized with this mechanism. The spectrograms of the w ords are obtained and observ ed the dif ference between spok en w ords and synthesized w ords. There is a distinct spectral mismatch 42
observ ed between them. The spectrograms of the synthesized as well as spok en w ords `goose' and `byte' are gi v en in Figures 4.8, 4.7, 4.9, and 4.10. 4.3.1 De vice utilization data summary The design is synthesized tar geting it to the V irte x FPGA (v800hq240-4) on the XSV -800 board. The de vice utilization data is gi v en in the T ables 4.1 and 4.2. T able 4.1 gi v es the the utilization data of the synthesis system without the audioproject module and table 4.2 gi v es the utilization data of the system with audioproject. T able 4.1. The utilization data summary of the system Number of Slices: 69 out of 9408 0% Number of Slice Flip Flops: 69 out of 18816 0% Number of 4 input LUTs: 127 out of 18816 0% Number of bonded IOBs: 27 out of 170 15% Number of BRAMs: 6 out of 28 21% Number of GCLKs: 1 out of 4 25% T able 4.2. The utilization data summary of the system with audioproject Number of Slices: 235 out of 9408 0% Number of Slice Flip Flops: 264 out of 18816 0% Number of 4 input LUTs: 332 out of 18816 0% Number of bonded IOBs: 91 out of 170 15% Number of BRAMs: 6 out of 28 21% Number of GCLKs: 1 out of 4 25% 43
Figure 4.3. Sample output w a v eform of the controller 44
Figure 4.4. Sample output w a v eform 1 of the synthesized w ord `bamite' 45
Figure 4.5. Sample output w a v eform 3 of the synthesized w ord `gemini' 46
Figure 4.6. Sample output w a v eform 2 of the synthesized w ord `de v ote' 47
Figure 4.7. Spectrogram of a spok en w ord `goose' Figure 4.8. Spectrogram of synthesized w ord `goose' 48
Figure 4.9. Spectrogram of a spok en w ord `byte' Figure 4.10. Spectrogram of synthesized w ord `byte' 49
CHAPTER 5 CONCLUSION W e conclude from this w ork that the proposed concatenati v e speech synthesis algorithm is an ef fecti v e and easy method of synthesizing speech. T ruly speaking, this method is not e xactly a speech synthesis system, b ut it is one of the most commonly used te xt-to-speech systems around. In concatenati v e synthesizer the designer pro vides recordings for phrases and indi vidual w ords. The engine pastes the recordings together to speak out a sentence or phrase. The adv antage of this mechanism is that it retains the naturalness and understandability of the speech. On the other hand the do wnside of this process is that it doesn' t address the problem of spectral discontinuity This problem can be reduced by techniques such as spectral smoothing, manual adjustment of unit boundaries, use of longer units to reduce the number of concatenations etc. The do wnside of using spectral smoothing is that it decreases the naturalness of the resulting speech. W e do not address this problem in this implementation, which can be further impro v ed upon in future. Controlling prosodic features has been found v ery dif cult and the synthesized speech still sounds usually synthetic or monotonic. T echniques such as Articial Neural Netw orks and Hidden Mark o v Models ha v e been applied to speech synthesis. These methods ha v e been useful for controlling the synthesizer parameters, such as duration, fundamental frequenc y etc. The concatenati v e speech synthesis process has been modeled as a sequential circuit and it is synthesized and implemented in VHDL. W e made use of Xilinx tools and XS tools to do wnload the design onto the XSV -800 board which contains a FPGA(v800hq240-4). Its functionality has been v eried and illustrated with the w a v eforms. In future, visual te xt-to-speech will be the state of the art, which is dened as the synchronization of f acial image with the synthetic speech. 50
REFERENCES  E. J. Y annak oudakis and P J. Hutton, Speec h Synthesis and Reco gnition Systems Ellis Hor w ood Limited, 1987.  Murtaza Bulut, Shrikanth S. Narayanan, and Ann K. Syrdal, Â“Expressi v e speech synthesis using a concatenati v e synthesizer, Â” in Pr oceedings of International Confer ence on Spok en Langua g e Pr ocessing sept 2002.  M. H. O'Malle y Â“T e xt-to-Speech con v ersion T echnology, Â” IEEE Computer J ournal v ol. 23, no. 8, pp. 17Â–23, Aug 1990.  A. Acero, H. Hon, and X. Huang, Spok en Langua g e Pr ocessing: A guide to Theory Algorithm, and System De velopment Prentice Hall PTR, 2001.  J. P Oli v e, A. Greenw ood, and J. S. Coleman, Acoustics of American English Speec h: a Dynamic Appr oac h Springer -V erlag, 1993.  M. Plumpe, A. Acero, H. Hon, and X. Huang, Â“HMM-Based Smoothing F or Concatenati v e Speech Synthesis, Â” in The 5th International Confer ence on Spok en Langua g e Pr ocessing No v-Dec 1998.  Da vid Chappell and John H. L. Hansen, Â“Spectral Smoothing for Concatenati v e Speech Synthesis, Â” in International Confer ence on Spok en Langua g e Pr ocessing Dec 1998, pp. 1935Â– 1938.  Richard V Cox, Candace A. Kamm, La wrence R. Rabiner and Juer gen Schroeter Â“Speech and Language Processing for NExt-Millennium Communications Services, Â” Pr oceedings of The IEEE v ol. 88, no. 8, pp. 1314Â–1337, Aug 2000.  J. M. Pick ett, J. Schroeter C. Bickle y A. Syrdal, and D. K e welyport, The Acoustics of Speec h Communication Allyn and Bacon, Boston, MA, 1998.  B. H. Juang, Â“ Why speech synthesis? (in memory of Prof. Jonathan Allen, 1934-2000) Â” IEEE T r ansactions on Speec h and A udio Pr ocessing v ol. 9, no. 1, pp. 1Â–2, Jan 2001.  R. Bak er and F austo Poza, Â“Natural speech from a computer, Â” in Pr oceedings of A CM National Confer ence 1968, pp. 795Â–800.  W S. Y W ang and G. E. Peterson, Â“A study of the b uilding blocks in speech, Â” J ournal Acoustical Society of America v ol. 30, pp. 743, 1958.  Phuay Hui Lo w and Saeed V ase ghi, Â“Synthesis of unseen conte xt and spectral and pitch contour smoothing in concatenated te xt to speech synthesis, Â” in Pr oceedings of the International Confer ence on Acoustics, Speec h and Signal Pr ocessing 2002, pp. 469Â–472. 51
 K. Lukasze wicz and M. Karjalainen, Â“A Microphonemic method of speech synthesis, Â” in International Confer ence on Acoustics, Speec h, and Signal Pr ocessing Apr 1987, pp. 1426Â– 1429.  Y Stylianou, Â“ Remo ving linear phase mismatches in concatenati v e speech synthesis Â” IEEE T r ansactions on Speec h and A udio Pr ocessing v ol. 9, no. 3, pp. 232Â–239, Mar 2001.  D. O'Brien and A. I. C. Monaghan, Â“Concatenati v e synthesis based on a harmonic model, Â” IEEE T r ansactions on Speec h and A udio Pr ocessing v ol. 9, no. 1, pp. 11Â–20, Jan 2001.  R. J. McAulay and T F Quatieri, Â“Speech analysis/synthesis based on a sinusoidal representation, Â” IEEE T r ansactions on Acoustics, Speec h, Signal Pr ocessing v ol. 34, pp. 744Â–754, Aug 1986.  J. W outers and M. W Macon, Â“ Control of spectral dynamics in concatenati v e speech synthesis, Â” IEEE T r ansactions on Speec h and A udio Pr ocessing v ol. 9, no. 1, pp. 30Â–38, jan 2001.  Mark T atham and Eric Le wis, Â“Impro ving te xt-to-speech synthesis, Â” in International Confer ence on Spok en Langua g e Pr ocessing Oct 1996, pp. 1856Â–1859.  E. Le wis and M. A. A. T atham, Â“SPR UCEA Ne w T e xt-to-Speech Synthesis System, Â” in Pr oceedings 2nd Eur opean Confer ence on Speec h Communication and T ec hnolo gy 1991, pp. 1235Â–1238.  Y Stylianou, Â“ Assessment and correction of V oice Quality V ariabilities in Lar ge Speech Databases for Concatenati v e Speech Synthesis, Â” in International Confer ence on Acoustics, Speec h and Signal Pr ocessing Mar 1999, pp. 377Â–380.  M. Morris Mano and Charles R. Kime, Lo gic and Computer Design Fundamentals Prentice Hall, Upper Saddle Ri v er NJ 07458, 2001.  M. J. Smith, Application Specic Inte gr ated Cir cuits Addison-W esle y 1997.  John L. Hennessy and Da vid A. P atterson, Computer Or ganization and Design The Har dwar e/softwar e Interface Mor gan Kaufmann Publishers, Inc., San Fransico, CA, 1994.  Jan M. Rabae y Digital Inte gr ated cir cuits: A design P er spective Prentice-Hall of India Pri v ate Limited, 2002.  Jor gen Pedderson, A udio Pr oject http://www .itee.uq.edu.au/peters /xs vb oa rd/ aud io /au dio .htm, 2001. 52
A ppendix A /* The program to implement the speech synthesis algorithm */ #includestring.h#includestrings.h#includemath.h#includestdio.h#includeengine.h#dene MAXPHONEMES 500 #dene MAXIDLEN 32 /* Main pr o gr am */ int main()int i, j; Engine *ep; char S[MAXIDLEN]; char answer[MAXIDLEN]; char main w ord[MAXIDLEN]; char *init w ord; char temp2[MAXIDLEN]; char *ptr; char temp[MAXIDLEN]; char tok en[10*MAXIDLEN], tok en2[10*MAXIDLEN]; char ne wsubstr[MAXIDLEN]; char lib[MAXPHONEMES][MAXIDLEN]; int count=0, inde x=0, libsize=0, found=0, charcount=0, base=0; int phoneme size=0; FILE *fp; /* Read the phonemes.txt le */ 54
A ppendix A (Continued) fp=fopen(Â”phonemes.tx tÂ”, Â”rÂ”); inde x=0; while(fscanf(fp, Â”%sÂ”, lib[inde x]) != EOF)inde x++;libsize=inde x; for(inde x=0; inde xlibsize; inde x++)printf(Â”%s Â”, lib[inde x]);/* ask for the input wor d */ printf(Â” Enter the string to be parsed: Â”); scanf(Â”%sÂ”, S); /* Calling the MA TLAB pr o gr am */ if ((ep = engOpen(Â”Â”)))fprintf(stderr Â” Can' t start MA TLAB engineÂ”); return EXIT F AILURE;count=1; charcount=0; base=0; while(charcountstrlen(S))for(phoneme size =4; phoneme size1; phoneme sizeÂ–) 55
A ppendix A (Continued)for(i=0; iphoneme size; i++)temp[i] = S[base+i];temp[i]=''; printf(Â”Looking in Library for string: %s of length %d Â”, temp, strlen(temp)); found=0; for (i=0; ilibsize; i++)strcp y(temp2,lib[i]); init w ord = temp2; ptr = strchr(init w ord, ', '); if(ptr) *ptr = ''; strcp y (main w ord,init w ord); /* if (strncmp(temp,main wor d,strlen(temp))==0)*/ if (strcmp(temp,main w ord)==0)strcp y (answer main w ord); printf(Â” main w ord = %sÂ”, main w ord); printf(Â” The output is %s Â”, answer); found = 1;if(found == 0)if (ptr)init w ord = ptr +1; 56
A ppendix A (Continued) ptr = strchr(init w ord, ', '); while(ptr)*ptr = ''; /*if (strncmp(temp,init w ord,strlen(temp))==0)*/ if (strcmp(temp,init w ord)==0)strcp y (answer main w ord); found = 1;if(found == 1) break; init w ord = ptr+1; ptr = strchr(init w ord, ', '); if(found == 1)printf(Â” The answer is %s Â”,answer); printf(Â”F ound a phoneme in the library : %s Â”, lib[i]); sprintf(tok en, Â”[X%d,fs, nbits] = w a vread(' sounds/%s.w a v'); Â”, count, answer); printf(Â”tok en = %sÂ”, tok en); engEv alString(ep, tok en); count++; found=1; base=base+phoneme size; charcount = charcount + phoneme size; break; 57
A ppendix A (Continued) if(found == 1) break;if(found == 0)printf(Â”F ailed to nd a matching phoneme for %s Â”, temp); printf(Â”Aborting!!Â”) ; engClose(ep); e xit(1); /* engEvalString(ep, Â”[X,fs,nbits] = wavr ead('ca.wav');Â”); engEvalString(ep, Â”[Y ,fs,nbits] = wavr ead('ar .wav');Â”); engEvalString(ep, Â”C = [X;Y];Â”);*/ printf(Â”count = %d Â”, count); strcp y(tok en, Â”[Â”); for(i=1; icount; i++)if(i==1)sprintf(ne wsubstr Â”X%dÂ”, i);elsesprintf(ne wsubstr Â”;X%dÂ”, i);58
A ppendix A (Continued) printf(Â”%sÂ”, ne wsubstr); strcp y(tok en,strcat( token ne wsubstr)); printf(Â”%sÂ”, tok en);strcp y(tok en, strcat(tok en, Â”]Â”)); printf(Â”tok en = %sÂ”, tok en); sprintf(tok en2, Â”C = %s;Â”, tok en); printf(Â”tok en2 = %sÂ”, tok en2); engEv alString(ep, tok en2); engEv alString(ep, Â”w a vwrite(C,fs,nbits, 'w ord.w a v');Â”); engEv alString(ep, Â”close;Â”); engClose(ep);59