xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam 2200397Ka 4500
controlfield tag 001 001670395
006 m d s
007 cr mnu uuuuu
008 051128s2005 flu sbm s000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0001269
Glasbrenner, Merete. Mller.
Vowel identification by monolingual and bilingual listeners
h [electronic resource] :
b use of spectral change and duration cues /
by Merete Mller Glasbrenner.
[Tampa, Fla.] :
University of South Florida,
Thesis (M.S.)--University of South Florida, 2005.
Includes bibliographical references.
Text (Electronic thesis) in PDF format.
System requirements: World Wide Web browser and PDF reader.
Mode of access: World Wide Web.
Title from PDF of title page.
Document formatted into pages; contains 122 pages.
ABSTRACT: Recent studies have shown that even highly-proficient Spanish-English bilinguals, who acquired their second language (L2) in childhood and have little or no foreign accent in English, may require more acoustic information than monolinguals in order to identify English vowels and may have more difficulty than monolinguals in understanding speech in noise or reverberation (Mayo, Florentine, and Buus, 1997; Febo, 2003). One explanation that may account for this difference is that bilingual listeners use acoustic cues for vowel identification differently from monolinguals (Flege, 1995).In this study, we investigated this hypothesis by comparing bilingual listeners use of acoustic cues to vowel identification to that of monolinguals for six American English vowels presented under listening conditions created to manipulate the acoustic cues of vowel formant dynamics and duration.Three listener groups were tested: monolinguals, highly proficient bilinguals, and less proficient bilinguals. Stimulus creation included recording of six target vowels (/i, I, eI, E, ae, A/) in /bVd/ context, spoken in a carrier phrase by four American monolinguals (two females, two males). Six listening conditions were created: 1) whole word, 2) isolated vowel, 3) resynthesized with no change, 4) resynthesized with neutralized duration, 5) resynthesized with flattened formants, and 6) resynthesized with flattened formats and neutralized duration. The resynthesized stimuli were created using high-fidelity synthesis procedures (Straight; Kawahara, Masuda-Katsuse, and Cheveigne 1998) and digital manipulation. A six-alternative forced choice listening task was used. The main experiment was composed of 240 isolated vowel trials and 48 whole word trials.
Adviser: Catherine L. Rogers.
x Speech-Language Pathology
t USF Electronic Theses and Dissertations.
Vowel Identification by Monolingual and Biling ual Listeners: Use of Spectral Change and Duration Cues by Merete Mller Glasbrenner A thesis submitted in partial fulfillment Of the requirements for the degree of Master of Science Department of Communicati on Sciences and Disorders College of Arts and Sciences University of South Florida Major Professor: Cather ine L. Rogers, Ph.D. Stefan A. Frisch, Ph.D. Jean Krause, Ph.D. Joseph Constantine, Ph.D. Date of Approval: 07.26.2005 Keywords: speech perception, synthesized vowels, formant flattening, duration neutralization Copyright 2005, Merete Mller Glasbrenner
To my parents: You always encouraged me to pursue my dreams even if it meant to go abroad. My husband, Andrew: You initially suggested pursuing this field. You motivated me not to reach for the closest apple but for the star farthest away. Most of all, thank you for all your love and support. To all of my friends in the Speech-Langua ge Pathology program: Thank you for keeping me real and putting up with me in both times of happiness and sadness. What a ride this has been! To Patti and Katherine, my two externship supervisors: Thank you for all your support and understanding. To Dr. Rogers: For being a wonderful mentor and role model. I would not have missed a minute of it.
Acknowledgments I would like to express my gratitude to a number of people who have been of great help in the course of writing my th esis but also inspired and motivated me. To Sylvie Wodzinski, Teresa DeMasi, a nd Michelle Bianchi: Thank you for all of your help in the lab and your fr iendship. I will truly miss you. To my committee members, Dr. Jean Kr ause, Dr. Stefan Frisch, and Dr. Joseph Constantine: IÂ’m honored to have had the oppor tunity to work with you. I appreciate your opinions and your flexibility. I still think th is: Great defense questions! To Dr. Diane Kewley-Port for mentoring me, and reinforci ng my interest within speech science. Most importantly, I would like to tha nk Dr. Catherine Rogers. You not only inspired me and sparked my interest for speech science three years ago, but I had the honor and privilege of working closely w ith you and learn from you. Thank you for giving me this great opportunit y. I am grateful for your sup port, confidence, and advice in both good and bad times. Thank you for yo ur patience with me and for dedicating numerous hours of preparing data or proof reading my convoluted Â“DanishÂ” thoughts. You opened up a door to a world of science that intrigues me so much. I hope the journey will continue.
i Table of Contents List of Tables................................................................................................................. ....iv List of Figures................................................................................................................ .....v Abstract....................................................................................................................... .......vi Chapter 1: Introduction......................................................................................................1 Vowel Acoustics ..............................................................................................................4 Synthesis ..........................................................................................................................6 Formant Speech Synthesis. .........................................................................................6 Speech Resynthesis. .....................................................................................................8 Speech Perception and Synthesis ..................................................................................10 Silent Center. .............................................................................................................10 Formant Contours and Synthesis. .............................................................................11 Second Language Acquisition and Foreign Accent ......................................................15 Perceptual Assimilation Model .....................................................................................17 Speech Learning Model ................................................................................................19 American Vowels. .....................................................................................................25 Studies on L2 Acquisition ..............................................................................................27 Purpose of Study ...........................................................................................................33 Chapter 2: Method............................................................................................................35 Participants ...................................................................................................................35 Speakers. ...................................................................................................................35 Potential Listener Screening and Language Background Questionnaire. ...............35 Listeners. ...................................................................................................................36 Materials and Instrumenta tion for Speaker Recording ................................................40 Instrumentation. ........................................................................................................40 Speech materials. ......................................................................................................41 Stimulus Creation Procedures. .................................................................................42 Word Extraction. .......................................................................................................42 Vowel Isolation. ........................................................................................................44
ii Preparatory Acoustic Measurem ents for Formant Flattening. ................................45 Resynthesis Method for Natural Preserved Vowel Tokens. ......................................47 Resynthesis of Altered Vowel Conditions. ................................................................48 Chapter 3: Procedure.........................................................................................................51 Testing Procedure of Subjects ......................................................................................51 Calibration ................................................................................................................51 Testing. ......................................................................................................................51 Trials. ........................................................................................................................53 Data Manipulation. ...................................................................................................54 Chapter 4: Results............................................................................................................. 55 Explanation of Percent Correct Analysis .....................................................................55 Whole Word versus Original Vowel .........................................................................55 Original versus Natural Preserved Vowel. ...............................................................57 Natural Preserved Vowel versus Natural Neutral Vowel .........................................58 Natural Preserved versus Flat Preserved Vowel. .....................................................58 Flat Preserved versus Flat Neutral ...........................................................................58 Overall Patterns. .......................................................................................................58 Analysis of Vowels by Condition ...................................................................................60 Statistical Analyses .......................................................................................................62 Main Effects. .............................................................................................................63 Two-way Interaction of Group and Listening Condition. .........................................64 Two-way Interaction of Vowel and Listening Condition. .........................................65 Three-way Interaction. ..............................................................................................67 Confusion Matrices. ......................................................................................................72 Whole Word Confusion Matrix. ................................................................................73 Original Vowel Confusion Matrix. ...........................................................................75 Natural Preserved Vowel Confusion Matrix. ............................................................76 Natural Neutral Vowel Confusion Matrix. ................................................................78 Flat Preserved and Flat Neut ral Vowel Confusion Matrix. .....................................79 Confusion Patterns ....................................................................................................82 Chapter 5: Discussion.......................................................................................................83 Question 1: Effects of Vowel Isolation ..........................................................................83 Question 2: Effects of Straight Resynthesis ..................................................................85 Question 3: Effects of Formant and Duration Cues for HP and LP Bilingual Listeners ............................................................................................................................... ........86 Question 4: Effects of Vowel and Listening Condition .................................................88 Question 5: Confusion Patte rns of Listener Groups .....................................................91
iii Summary .......................................................................................................................92 References..................................................................................................................... ....95 Appendix A. Monolingual Language Questionnaire ..................................................100 Appendix B. Bilingual Language Questionnaire ........................................................101 Appendix C. Steady State Replication .........................................................................104 Appendix D.1. Results for the vowel // by listening conditi ons by listener group. ....107 Appendix D.2. Results for the vowel // by listening conditi ons by listener group. ....108 Appendix D.3. Results for the vowel / / by listening conditi ons by listener group. ..109 Appendix D.4. Results for the vowel / / by listening conditi ons by listener group. ...110 Appendix D.5. Results for the vowel // by listening conditi ons by listener group ...111 Appendix D.6. Results for the vowel // by listening conditi ons by listener group...112
iv List of Tables Table 1. Demographic and language backgr ound information for highly proficient bilingual group. Notes: AOLI = age of learni ng English intensively. E = English; S = Spanish; B=both English and Spanish rated equa lly. Origin = country of birth or country of birth of parents for partic ipants born in the U.S...........................................................39 Table 2. Demographic and language background in formation for less proficient bilingual group. Notes: AOLI = age of learning English intensively. E = English; S = Spanish; B = both English and Spanish rated equally. Origin = country of birth or country of birth of parents for participants born in the U.S............................................................................40 Table 3. Statistical analysis table. The table depicts main effects, two-way interactions, and three-way interaction..................................................................................................63 Table 4. Significant effects for six contrasts of interest in the thre e-way interaction. Each cell lists the vowels for which the contrast was found to be significant and the size (in RAUs) and direction of effect, with the associated p value in parentheses......................69 Table 5. Confusion matrix for whole word condition Â– depicting performance of monolingual, highly-proficient bilingual, a nd less-proficient bilingual listeners.............74 Table 6. Confusion matrix for the original vowel (OV) condition Â– depicting performance of the monolingual, highly proficient bilingual and less proficient bilingual listeners...76 Table 7. Confusion matrix for the natural preserved vowel (NP) condition Â– depicting performance of the monolingual, highly profic ient bilingual, and less proficient bilingual listeners...................................................................................................................... .......77 Table 8. Confusion matrix for the natu ral neutral vowel condition Â– depicting performance of the monolingual, highly profic ient bilingual, and less proficient bilingual listeners...................................................................................................................... .......79 Table 9. Confusion matrix for the flattened preserved vowel (FP) condition Â– depicting performance of the monolingual, highly profic ient bilingual, and less proficient bilingual listeners...................................................................................................................... .......80 Table 10. Confusion matrix for the flattene d neutral vowel (FN) condition Â– depicting performance of the monolingual, highly profic ient bilingual, and less proficient bilingual listeners...................................................................................................................... .......82
v List of Figures Figure 1. The natural vowel / / when synthesized (NP) Â– spectral and duration cues were not modified................................................................................................................... ...49 Figure 2. An example the vowel / / with flattened formants and neutralized duration...50 Figure 3. Stimulus screen..................................................................................................53 Figure 4. Mean percent-correct word-identification scores fo r each of the three speaker categories, monolingual nativ e (NA), highly-proficiency bilingual (HP) and lowerproficiency bilingual (LP), at each of the six listening cond itions. Error bars indicate two standard errors of the mean...............................................................................................57 Figure 5. Mean percent-correct word-identification scores fo r each of the six target vowels and each of the six liste ning conditions. Error bars indi cate two standard errors of the mean....................................................................................................................... .....62
vi Vowel Identification by Monolingual and Biling ual Listeners: Use of Spectral Change and Duration Cues Merete Mller Glasbrenner ABSTRACT Recent studies have shown that even highly-proficient Spanish-English bilinguals, who acquired their second language (L2) in ch ildhood and have little or no foreign accent in English, may require more acoustic informa tion than monolinguals in order to identify English vowels and may have more difficulty than monolinguals in understanding speech in noise or reverberation (Mayo, Florentin e, & Buus, 1997). One explanation that may account for this difference is that bilingua l listeners use acoustic cues for vowel identification differently fr om monolinguals (Flege, 1995). In this study, we investigat ed this hypothesis by compar ing bilingual listenersÂ’ use of acoustic cues to vowel identification to th at of monolinguals for six American English vowels presented under listening conditions crea ted to manipulate the acoustic cues of vowel formant dynamics and duration. Three listener groups were tested: monolinguals, highly proficient bilinguals, a nd less proficient bilinguals. Stimulus creation included reco rding of six target vowels (/ /) in /bVd/ context, spoken in a carrier phrase by four American monolinguals (two females, two males). Six listening conditi ons were created: 1) whole word, 2) isolated vowel, 3) resynthesized with no change, 4) resynt hesized with neutralized duration, 5)
vii resynthesized with flattened formants, and 6) resynthesized with flattened formats and neutralized duration. The resynt hesized stimuli were created using high-fidelity synthesis procedures (Straight; Kawahara, Masuda-K atsuse, & Cheveign, 1998) and digital manipulation. A six-alternative forced c hoice listening task was used. The main experiment was composed of 240 isolated vowel trials and 48 whole word trials. Data from 17 monolinguals, 25 highly prof icient bilinguals, and 18 less proficient bilingual listeners indicate a consistent but relatively small decrease in performance for the proficient bilinguals compared to the mono linguals, a substantially greater decrease in performance for the less proficie nt bilinguals compared to the proficient bilinguals, and a greater decrease in performance due to forman t flattening than to duration neutralization for all groups. In support of the hypothesis of differing cue use by bilinguals, the data showed significantly different patterns of performance across vowels and listening conditions for the three listener groups.
1 Chapter 1 Introduction Previous research has indicated that monolingual listeners use dynamic spectral information to identify vowels. While static spectral information is often sufficient to identify most vowels, monoli ngual listeners also benefit from the inherent spectral change that occurs during the course of vowel production (Hillenbr and & Nearey, 1999; Strange, Jenkins, & Johnson, 1983). Data s uggest that removal or modification of inherent vowel information affects intellig ibility negativ ely. Nevertheless, although we know that listeners rely on certain cues to identify sounds, their relative importance and acquisition order still re main somewhat unclear. The field of speech perception encompa sses many areas which all contribute to improved understanding of auditory proces sing. For instance au ral rehabilitation treatment of a hard of hearing client builds on a hierarc hy going from easy to difficult. In the easy approach, print, topic, visual and aud itory cues may all be available to the client. These cues will additively help the listener to identify the spoken stimuli. A difficult therapy approach may exclude print, topic, or visual cues. Thus, the only factor the listener relies on is hearing the stimuli. De pending on the clientÂ’s hearing and processing level, stimuli may be presented using many or few cues. Thus, speech perception not only relates to what is heard but also how we ll a person remembers a sequence: One threeword sentence is easier to memorize, word for word, than ten long embedded sentences.
2 The point with the hierarchies is that we re ly upon our senses, experiences, and memory to hear, store, process, and extract meani ng of what we hear (M ajor, 2001; Tye-Murray, 1998). Normal-hearing listeners use similar strategies to identify sounds. Many redundant cues are used and all work togeth er, however, the way in which bilingual listeners use these same cues is less clear. A valid reason why bilingual speech perception may be of importance is the increase of bilingua ls or polyglots in the United States. This fact raises the importance of studies that examine production and perception in speakers who acquire English as their second language (L2). The significance of the demographic changes in the U.S. is apparent in the U.S. Census data. The release of the U.S. Census 2000 revealed a significant growth in the non-English speaking population in only 10 years. Analysis of the data showed that 47 million persons or 18% of the American populat ion aged 5 and above, were reported to speak a language other than English in th eir homes. Spanish was ranked the first among these languages, with Spanis h speakers composing 10% of the total American population (approximately 28 million). In only ten year s the Spanish speaking population increased from 17 to 28 million nationwide (Shin & Bruno, 2003). Noteworthy for the 2000 Census was a deta iled analysis of speakersÂ’ English skills. Speakers identified themselves as speaking English Â“very well,Â” Â“well,Â” Â“not well,Â” or Â“not at all.Â” About 20 million of the Spanish speaking population rated their English skills within the levels of Â“wellÂ” to Â“very well.Â” However, nearly 8 million rated their abilities as Â“not wellÂ” or Â“not at allÂ” (Shin & Bruno, 2003).
3 A noticeable increase of Spanish speakers was also seen within the state of Florida. Here the bilingual population increa sed from 12 to 15 million for people aged 5 and over from 1990 to 2000. Out of the 15 million, 10% reported that they spoke English less than Â“very wellÂ” (Shin & Bruno, 2003). Returning to the issue of speech perc eption, a recent study performed by Febo (2003) found that even early Spanish-Englis h bilingual listeners performed more poorly than monolingual English listeners when pres ented with monosyllabi c English words in different levels of background noise and re verberation. Results showed that even bilingual listeners who acquired their second language before age 6 and were rated as having little or no foreign accent in Englis h identified fewer words correctly than monolinguals when the words were presen ted in noisy and reverberant conditions, although both groups showed identical (per fect) performance in quiet. This study indicates that even early acquisition of L2 appears to impact speech recognition negatively in certain listen ing conditions (Febo, 2003). The Febo study suggests that there is indeed a difference in speech processing abilities between monolingual and bilingual li steners. However, the data do not provide information for why these differences occur. In the present study, we compared bilingual listenersÂ’ use of acoustic cues to American English vowel identification with that of monolinguals for vowels presented in va rious distorted listening conditions. The next sections will be dedicated to explaining speech acoustics, speech perception, as well as current research with in the field of second language acquisition.
4 Vowel Acoustics To understand the perception of vowels, it is essential to first summarize the physics of speech. Vowels are best explained with what is known as the Â“source-filter modelÂ” of speech production. The production of a vowel begins at the level of the larynx at which a periodic or cyclic signal is generated. Thus, the vocal folds are referred to as the periodic source. The sound travels through the vocal tract. The vocal tract behaves like a variable filter that changes in response to tongue and jaw position. The fundamental frequency is typi cally modulated by increasing or decreasing tension of the vocal folds but it also dependent on the leng th of the vocal tract (Borden, Harris, & Raphael, 1994; Fry, 1979). Vowels are formed by shaping the vocal tract in a manner in which certain frequencies are attenuated and others amplified. The amplified points are called formants or resonance points and those ar e the points that make vowels so salient. Vowels can be characterized by their first, second, and thir d formants (F1, F2, and F3) (Borden et al., 1994; Fry, 1979). However, when the vowel is isolated, meaning no consonant precedes or follows it, F1 and F2 are often sufficie nt to identify most vowels (Pickett, 1999; Strange, 1999). Height of jaw, position of t ongue constriction (back vs. front), and lip rounding will generate vowel-specific forman t patterns (Pickett, 1999; Strange, 1999). F1 is influenced most by jaw and tongue height. If the jaw is raised and thus the tongue also, then F1 tends to be low (Borden et al ., 1994; Fry, 1979; Pickett, 1999; Strange, 1999). The second formant (F2) is influenced more by the location of tongue constriction. If the tongue constriction is towards the front, F2 increases (Borden et al., 1994; Fry, 1979). Thus, the vowel /i/, which is articulated with the tongue high and forward, has low F1
5 and high F2 values, whereas / /, which is articulated with the tongue low and back, has high F1 and low F2 values. Two additional acoustic cues that vary across vowels are duration and spectral change. In terms of duration, English American vowels can be divided into tense and lax vowels. Lax vowels typically are shorter in length than tense vowels (e.g., lax / / versus tense /i/) (Borden et al., 1994; Fry, 1 979; Pickett, 1999; Strange, 1999). Spectral change or vowel drift (Hillenbrand, Getty, Clark, & Wheeler, 1995; Hillenbrand & Nearey, 1999) becomes cruc ial when describing the dynamic formant change of diphthongs or even monophthongs Unlike /i/ for which formant contours remain fairly static, according to Hillenb rand (1999), the formant frequencies of the diphthong / / move from / / formant values to / / values. The physical movement of the jaw and lips illustrates the change of this di phthong; the jaw moves upwards as does the tongue while the location of targ et constriction moves forward in the mouth. This type of change also occurs for many American Englis h vowels that are typically described as monophthongs, although to a lesser degree than for a diphthong such as / /. Consonantal environments also infl uence vowel formant patterns causing consonant-vowel (CV) or vowel-consonant (VC) formant transitions to differ in the onset and offset of the vowel depending on the consona nt context. An example is the effect of stop consonants on vowel formant transitions. Stops or plosives are sounds formed by stopping air flow from the mout h at some point and suddenly releasing it. Vowels that precede or follow a stop will, depending on the place of articulation of the stop, have different formant transitions. For instance, a bilabial plosive elongates the vocal tract
6 forming lower Â“burstÂ” frequencies. The formant transition following the burst, specifically of F2 and F3 will have low to high or rising contour. Different for the velar produced plosive /k/, the vocal tract is shortened and the resonance patterns become more complex, producing falling F2 and rising F3 CV transitions (Strange, 1999). Another effect of the consonant-vowel (C V) context is the elongating effect of word-final voicing: a vowel that is followe d by a voiceless consonant has shorter duration than one followed by a voiced consonant (e.g., /bit/ versus /bid/) (Strange, 1999). Synthesis Synthesis and speech acoustics are related in the sense that traditional synthesis cannot be performed unless a thorough knowledge of the speech physics is available. To create artificial speech sounds requires an in-depth understanding of source generation, filter, and the transfer function to achieve the desired final output or signal. In speech perception studies the Klatt synthesis or formant synthe sis method is mostly used. However, in recent years, use of a new appr oach is winning popularity. This is the highfidelity resynthesis of speech created by Kawaha ra et al. (1998). Both synthesis methods will be discussed in the upcoming paragraphs. An advantage of using synthesis in perceptual studies, regardless of the method, is the ability to control variables. By the same token, the drawback of speech synthesi s is that perception may be compromised due to lack of natural-like sound quality. Formant Speech Synthesis. Formant speech synthesis uses the principles of source, filter and transfer function to com pute and generate signals. A number of possible synthetic software programs exist, some more sophisticated than others. One of the more frequently used synthesis programs in percep tion studies was developed by Dennis Klatt.
7 Using FantÂ’s and LawrenceÂ’s hardware princi ples, Klatt created a phoneme synthesis by rule program: a hybrid of cascade and parallel formant synthesis. Cascade and parallel synthesis were originally different types of hardware, each of which was better adapter to synthesis of different sounds (Klatt, 1987). Parallel formant synthesis by Lawrence empl oyed the principles of the vocal tract transfer function using anti-formants a nd formant resonators, employed separate independent filters to fabricate a specific resonant frequency (described in Klatt, 1987). The output signals of these filters were then added together to form one sound. Filters allowed buzzing (voicing), hissing (devoiced), or a combination (devoiced and voiced) of noise realizing obstruent sounds (e.g., /s/ versus /z/). FantÂ’s cascade approach allowed for output of one filter to feed into the next. The resulting re lation of amplitude among the formants is more like real speech for vowel s generated in this way. Thus, the cascade formant synthesis was fount to be better-a dapted for vowel synthesis while parallel synthesis produced better quality consonants (Klatt, 1987). Klatt combined these two principles cr eating a hybrid synthesis system which allowed for advantages of both cascade and parallel techniques to be exploited for synthesis of consonants and vowels (Klatt, 1987). To increase na turalness of speech, Klatt proposed a number of duration rules for different contexts (s ee Klatt, 1987). The advantage of Klatt synthesis is that variable s can easily be shifted, meaning that signals can be shifted easily from a synthetic male speaker to female by software commands. The disadvantage of creating signals by hand using the Klatt formant synthesis software is the cumbersome process of ge nerating signals. Formant synthesis requires extensive knowledge of indivi dual plosive voice onset time (VOT), locus of frication
8 noise, locus of burst energy, and formant tr ansition to obtain the desired voicing and noise features. Additional parameters needed to create synthetic sentences include intensity, duration, and F0 patterns. These fact ors aid in enhancing a number of linguistic attributes: syllabic structure, vocal effort, stress, speaking rate, syntactic structure, intonation, stress, and gender. Consequently, a single word requires precision work for quality to be optimal (Klatt, 1987). Another dilemma with the Klatt formant synt hesis is that some features are more difficult to generate than others. Klatt a nd Klatt (1990) noted th at Klatt-synthesized female breathiness as well as female pitch co ntours sounded artificia l. Producing naturallike plosives was also a difficult to achieve, which was complicated by the locus of the burst energy and the CV or VC formant transitions. Nevertheless, many speech perception studies have used the Klatt forman t synthesis principles to gain understanding listenersÂ’ use of speech cues. Speech Resynthesis. A method of high-fidelity resynthesis developed by Kawahara and colleagues has been recently introduced in speech perception studies (Kawahara, Masuda-Katsuse, & Cheveign, 1998). The uniqueness of this type of resynthesis is its close resemblance to na tural speech. The difference between formant speech synthesis and high-fidelity speech resynthesis developed by Kawahara and colleagues is that the latter one resynt hesizes already recorded speech. One may immediately see the disadvantage of high-fidel ity resynthesis; without a speech source or speaker, resynthesis is very difficult to perform. The making of high-fidelity resynthesi s can be accredited to Kawahara and colleagues who developed Straight as a spee ch manipulation tool. Straight uses input
9 speech signals and decomposes them into sour ce and spectral traits (Kawahara et al., 1998). Straight is also designed to enable spectral alterations of pitch, duration, and amplitude. Straight builds on DudleyÂ’s vocoder (s ee Dudley, 1939) and KlattÂ’s synthesis principles (see Klatt, 1980). In the user interface developed fo r Straight, the user typically begins with an existing speech signal, which is resynthesized in St raight; that is, the Straight parameters are set to model the existing speech sample. Broad parametric changes (such as shifting fundamental fre quency, duration, or the frequency of all formants by a specified proportion) are accomplished relatively easily using the Straight user interface. More focused changes, su ch as changing the frequency of one formant without changing the others, are less easily accomplished. One way of accomplishing such a goal is to change by hand the spectral matrix in which the values for amplitude and frequency are stored. Another way is to set initial (source) and target frequency points at various times and to morph from sour ce to target. However, initial testing of the method by the author and mentor for generation of static vowels resu lted in trajectories that were flattened formants but not perfectly flat. For optimal resynthesis, a filter shape or analyzing wavelet is superimposed on the formant spectrum or spectral envelope of a complex sound. The area of the spectrum covered by the wavelet determines both th e signal to noise ratio and the frequency resolution with greater area covered resulti ng in better SNR and resolution (Kawahara et al., 1998). Thus, it is important to cover as great an area of the formant spectrum as possible to obtain high frequency resolution and high signal-to-noise ratio. A spectral envelope of the signal is extracted instantaneously and inserted into the speech
10 manipulation system that converts the spectrum into amplitude, frequency, and times, respectively. Speech Perception and Synthesis Fry and colleagues were one of the ea rly pioneers in speech perception using synthetic stimuli. In 1962, they conducted a st udy in which vowels were converted from [ ] to [ ] to [ ] in a 13-step continuum (Fry, Ab ramson, Eimas, & Liberman, 1962). By formant synthesis, F1 was incrementally chan ged from low to high frequencies while F2 was moved in the opposite direction. Results of identification tests indicated that the vowel categories were not clearly defined. Instead, the percent identification of the vowels gradually sloped from one to the other, which is also refe rred to as continuous perception (Fry et al., 1962). Similar studies have been conducted reaching comparable results. Although these studies have provided understanding of the continuous perception of vowels, most of these ha ve dealt with vowel formants as static entities. Silent Center. Strange, Jenkins and Johnson (198 3) conducted a study testing lax and tense vowels in spoken /bVd/ words. A designated center portion of the vowel duration was silenced (50% for lax vowels and 65% for tense vowels). To test if listeners relied on temporal information, the duration of the silenced portion was equalized across stimuli in two ways, by shortening the silent interval to 57 ms for all stimuli and by lengthening the silent interval to 163 ms for a ll stimuli. Strange and colleagues also tested if listenersÂ’ performance was influenced by the amount and type of information provided in four conditions: initial CV only, final VC only, silent center, and center alone (or isolated vowel). American monolingual listen ers were presented w ith the stimuli and asked to choose which word they heard from a list of 10 alternatives. Results indicated
11 that listeners make increasingly more errors when more acoustic information is removed (e.g., CV only versus both CV a nd VC) (Strange et al., 1983). Whole words were on average identified by only about 10 percentage points better than the silent-center s timuli, suggesting that listeners can in fact use CV and VC formant transitions for vowel identification because identification was not substantially reduced when only the CV and VC transitions were available. Temporal manipulations resulted in little change when shortening the silent portion. Howe ver, lengthening the silent portion resulted in an increase of errors. Isolated vowel centers (shorter than silenced portion) showed a significant increase in errors compared to the silent center syllables. The study concluded that listene rs depend highly on dynamic spectral cues, somewhat on temporal cues, and greatly on completeness of information in that vowels in CV only VC only conditions were very poorly identified in that both initial and final information was needed for good identification of silent center vowels (Strange et al., 1983). Formant Contours and Synthesis. Hillenbrand and collea gues (1999) conducted a study with the purpose to examine effects of formant contour movements of vowels on vowel identification. Twelve American English vowels were recorded in /hVd/ context, spoken by men, women, and children. Three li stening conditions for each vowel were created: natural /hVd/ vowel (NAT), original -formant (OF) synthesized /hVd/ vowel, and flat-formant (FF) synthesized vowel. The OF and FF conditions were both created using Klatt synthesis (Hille nbrand & Nearey, 1999). To create the synthetic stimuli, acoustic measurements of F0 and F1-F4 were made from LPC spectra extracted every 8 ms (Hillenbrand & Nearey, 1999). Frequency
12 and temporal information were noted for F0 a nd F1 through F4 at the onset of the vowel, the offset of the vowel, and at the stea dy state portion. Formant frequencies were measured at the 20% and 80% vowel duration points to avoid measurements from the CV and VC transitions. Using the Klatt synthe sis (formant synthesis) method, OF and FF stimuli were generated. For crea tion of the OF stimuli, synthesis parameters were simply set to match the F0 and F1-F4 values measured every 8 ms using the LPC formant extraction methods detailed above. The FF stim uli were created by identifying a vowel steady point, at which F1 and F2 were j udged to be minimally changing and setting values for F1-F4 for the entire vowel to valu es measured at the st eady point. CV and VC transitions were altered accordingly to ma tch these steady point values. Monolingual listeners identified the stimuli (/hVd/) using a closed-set identification task, in which the 10 /hVd/ alternative words were presented and listeners were asked to choose which one they had heard. A total of 900 test signals were presented to each listener in random order (300 original signals, 300 OF signals, and 300 FF signals). Results showed that listeners excelled when presented with the natural /hVd/ stimuli (see below). Listeners averaged mo re poorly when signals were synthesized. Performance decreased even further when fo rmant contours were flattened (NAT: 95.4%, OF: 88.5%, FF: 73.8%) (Hillenbrand & Nearey, 1999). Analysis of the vowel categories revealed interesting patterns. For example, performance for th e vowel /i/, which has little vowel drift, only decreased minimally with fl attening of the formants. On the other hand, a significant decrease in id entification was noted for / / which is known to have more spectral change due to its diphthongized feat ures. Noteworthy was that identification patterns appeared to differ across vowel categories.
13 The second part of the experiment tested if listenersÂ’ perfor mance changed with decreased information presented. Four listeni ng conditions were gene rated: natural /hVd/ utterance, natural vowel only, or iginal formant synthesized /hVd/ utterance, and original formant synthesized vowel onl y. Identification sc ores for naturally produced stimuli revealed that listeners performed slightly but significantly worse when consonant context was removed. The difference between synthesize d OF /hVd/ utterances and OF isolated vowels was too small to be significant. Furt hermore, naturally spoken stimuli (both /hVd/ and isolated vowels) were noted to be significantly more intelligible than the original formant synthetic signals (NAT /hVd /: 96.7%, NAT vowel alone: 94.4%, OF /hVd/: 91.0%, and OF vowel alone: 90.3%). The findings of this study suggest that spec tral change plays an important role in vowel identification. Vowel inte lligibility also increased when in consonant context rather than in isolation. Maintaining origin al dynamic formant patterns of vowels in synthetic stimuli resulted some decrease in intelligibility compared to natural speech, suggesting that formant synthesis fails to reproduce naturally spoken vowels with complete accuracy. Lastly, listeners were not ed to perform differently for individual vowels in formant flattened conditions, im plying that listeners use formant dynamic information more for some vowels than others (e.g. /i/ versus / /). Although Hillenbrand and colleagues demons trated that CV and VC information, synthesis, and flattening of formants all influence vowel identification negatively, this experiment was performed with monolingual li steners only and used formant synthesis. One question to be answered is whether highfidelity resynthesis of speech differs from
14 the Klatt synthesis. Another question is whet her bilinguals would be affected differently by these changes than monolinguals. Liu and Kewley-Port (2004) investigated the question of whether Straightresynthesized speech is recognized differe ntly than Klatt-synthesized speech by comparing listenersÂ’ ability to discriminate vowels using Klatt and Straight synthesis, respectively. Listeners were tested discriminating several vowels (/ , /) in syllable, phrase, and sentence contexts Shifts of formant peaks we re performed for the target vowels by determining their spectral locations through analysis of matrices generated by Straight and copying and pasting the associated spectral peak in tensity values to a higher frequency location and replicating values for the spectral troughs. F ourteen-step continua, from smaller to larger formant frequency changes (0.9 to 10% for most vowels) were generated separately for F1 and F2 using this method. Liu and Kewley-Port found similar thresholds for formant discrimination for Straight-synthesized stimuli as had been previously found for Klatt-synthesized stimu li, indicating that th e authorsÂ’ method of altering the more natural-sounding Straight-s ynthesized stimuli re sulted in a valid collection of data, in that listeners responded similarly in a discrimination task as they had in previous studies with Klatt-synthesized stimuli. Neve rtheless, the results for all three listening conditions showed a slight inflation of the vowel discrimination scores when synthesized by Straight rath er than by Klatt synthesis. Generation of a large number of stim uli using the methodology employed by Liu and Kewley-Port (2004) would be at least as time-consuming as using Klatt synthesis. The advantage of STRAIGHT is in the naturaln ess of the resulting stimuli, not in the time taken to create them. Because the stimuli gene rated using Straight are more similar to
15 natural speech, it follows that the use of bette r quality synthesis may result in patterns of listener performance that are more represen tative of listenersÂ’ responses to natural speech. Assman and Katz (2005) conducted a study to test perceptual differences between stimuli created using formant synthesis and stimuli created using Straight. Using the Hillenbrand & Nearey protocol (1999), Assm an and Katz replicated the Hillenbrand & Nearey (1999) study to re-examine the role of formant contours for vowels synthesized using Straight. Results of the Assman and Ka tz (2005) study were compared to those of the Hillenbrand et al. (1999) study to evaluate the differences between Straight and Klatt formant synthesis. Twelve vowels were pres ented in /hVd/ context in three listening conditions: natural speaking condition, forman t synthesized conditions, and Straight resynthesized conditions. Comparison of natu ral speech, Straight, and Klatt synthesis revealed that listeners showed significantly better vowel identification performance for Straight synthesized stimuli than for Klatt synthesized stimuli. While these studies have shown that nativ e listeners use the cues of duration and vowel formant dynamics, less is known about how bilinguals use these cues. That is, the cues used may be different or they may be used differently. It also remains to be answered whether more natural sounding stimu li result in less effect of synthesis for isolated vowels. Second Language Acquisition and Foreign Accent Proficiency in a second language is fre quently associated with a speakerÂ’s reduction of a foreign accent. However, proficiency is truly determined by a second language (L2) speakerÂ’s linguistic abilities in th e areas of L2 syntax, lexicon, pragmatics,
16 phonology, and phonetics (Major, 2001). Neverthe less, L2 proficiency is often most apparent in a speakerÂ’s spoken or written language (Major, 2001). Additionally, an L2 learnerÂ’s production is commonly characteri zed by a foreign accent, which can be defined as the difference between the pronuncia tion patterns of a non-native speaker of a language and those of a native speaker (Flege 1995; Major, 2001). Thus, foreign accent is primarily influenced by phonetic and phonological factors, m eaning vowels and consonants may be distorted, substituted or even omitted, as compared to standard production of native speakers of the target language (Flege, 1995; Major, 2001). Similar to production, differences may also exist in the way that native and non-nonnative speakers perceive the sounds of a language (Flege, 1995). Possible factors affecting speakersÂ’ degr ee of accent or difficulty in correctly identifying second language speech sounds are di fferences in the age of onset of learning (AOL) the second language, differences be tween vowel and cons onant inventories between the first language (L1) and L2, a nd differences in the degree of linguistic experience and exposure to L2 (Flege, 1995; Major, 2001). Returning to the issue of foreign accent, it has been suggested that brain plasticity plays a crucial role in learning an L2. Olde r learners (i.e., adolescents and adults) may have decreased flexibility of sensorimotor neuro-wiring; i.e., speech movements become automated and harder to change with incr easing age (Flege, 1995; Major, 2001). As a consequence, adults learning a second langua ge will be more likely to speak with a foreign accent due to their previously wired neuro-motor patterns. Thus, with late L2 acquisition, previously learne d L1 sound production patterns te nd to influence and distort intended L2 sound production causing decrea sed speech intelligibility (Flege, 1995).
17 To what degree L1 sounds affect L2 sound perception and production skills, and if there are other aspects that may account fo r different proficiency levels, remain key questions. Two prominent explanations of th e L1-L2 relationship are BestÂ’s Perceptual Assimilation Model (PAM) and FlegeÂ’s Speech Learning Model (SLM). These two models each describe hypotheses regardi ng L2 learnersÂ’ poten tial acquisition of nonnative sounds. Perceptual Assimilation Model The Perceptual Assimilation Model (PAM ) offers predictions of how an L2 learner will identify and disc riminate non-native sounds. A ccording to Best (1995), nonnative sounds are assimilated in to the already established L1 sound system. Best explains that patterns of identifica tion of single non-native sounds can be narrowed down using three perceptual principles. Non-native or L2 sounds can either be assimilated to a native category or they can be perceived as uncategorizable speech sounds that are still within the native phonological space. Finall y, L2 sounds may be perceived as non-speech and will not be assimilated into the na tive phonological space but fall outside. If an L2 listener is presented with pa irs of different L2 sounds, PAM predicts discrimination of L2 sounds to fall out into one the following cat egories: Two-Category Assimilation (TC Type), Category-Goodness Difference (CG Type), Single-Category Assimilation (SC Type), Both Uncategorizab le (UU Type), Uncategorizable versus Categorized (UC Type), and Nona ssimilable (NA Type) (Best, 1995). Two-Category (TC) assimilation occurs when both L2 sounds are distinctively different within the L1 sound system. That is, the sounds are both perceived as speech sounds and each falls within a different nati ve category. The listener is expected to
18 perceive the presented L2 sounds as different and to assimila te them into already existing L1 sound categories. In the case of two L2 sounds being less distinct, meaning their proximity is close to a common L1 sound for both, the L2 sounds may in fact merge and be perceived as only one L1 sound which is also referred to as SC Type assimilation. According to the PAM model, two-categor y type sounds are discriminated better than single-categorical sounds due to th e difference factor. Ca tegory-Goodness (CG) difference is when two sounds are perceived to match one L1 sound but one is perceived as a better exemplar of the category than the other. The listener wi ll perceive a slight difference in the sounds. Thus, the production of one L2 sound will be closer to the ideal while the other L2-sound is perceived as a relatively poor example of the intended L1 sound. Consequently, the listene r will approximate the L2 sounds to one L1 sound, but will still perceive a slight difference between them. Assuming the principles of PAM to be true, and given a bilingual learner of Spanish and English (Spanish L1), discrimina tion of English vowels is expected to be compromised since English contains 11 monophthongal vowels (Crystal, 1997), compared to 5 in Spanish (Dalbor, 1969). Furt hermore, the English vowel quadrilateral is more densely populated with front and back vowels than Spanish. If an L2 listener is presented with two front vowels, bot h with raised jaw (e.g., /i/ vs. / /) the sounds may merge and be perceived as only one sound: /i/ (e.g., SC type or CG type). The Spanish listener may only perceive a temporal difference between /i/ and / / but may produce these as a long and short /i/ (Dalbor, 1969). Although BestÂ’s assimilation model of nonnative sounds suggests that perception depends on perceived dissimilarity betw een L2 sounds and their Â“goodnessÂ” in
19 comparison to L1 sounds, PAM does not address the role of AOL on the bilingual learnerÂ’s speech perception and production patterns. The Speech Learning Model (SLM) by Flege addresses the relationship between production and percep tion as well as the impact of ongoing L2 exposure (Flege, 1995, 1996; Flege, Munro, & MacKay, 1995). Speech Learning Model Flege investigated the issue of why, after a certain age of onset of L2 acquisition (AOL), children demonstrate less nativelike production in th eir L2 (Flege, 1981). Previous studies have indicated that in order to acquire an L2 with little to no accent, the L2 language had to be acquired before a cer tain age (Flege, 1995; Major, 2001). This reasoning links to thinking that plasticity of neuro-sensory wiring decreases in adolescence, also known as the Critical Pe riod Hypothesis (CPH) (F lege, 1995; Major, 2001). The CPH was originally designed to e xplain L1 acquisition (Lenneberg described in Major, 2001). Evidence of the CPH hypot hesis has been found in traumatic brain injury (TBI) victims. Comparisons of young children and adoles cents with TBI found that children recovered nearly completely, whereas adolescents sustained some cognitive deficit (Major, 2001). However, with the 1981 study, Flege found suggestions that CPH alone did not explain accentedness. The study tested childre n who presumably were within the CPH and had increased sensorimotor abilities compared to younger children (Flege, 1981). However, the data revealed that despite optimal conditions the children demonstrated difficulties learning both vowels and consona nts (Flege, 1981). This led a belief that Â“learningÂ” cutoff age was either ill-defined, nonexistent, or a third factor accounted for the learning barrier. Conversely, several per ceptual studies indicated that there is a
20 correlation between onset of learning an L2 and accentedness, which will be discussed in detail later. To account for this paradox between increased sensorimotor skills, increased foreign accent and decreased perception perfor mance, Flege and colleagues created a set of hypotheses called the Speech Learning Model (Flege, 1995). The objective of the Speech Learning Model (SLM) is to explain L2 speakersÂ’ learning process and to explain the importance of AOL on foreign accent. The four postulates of the SLM suggest that an L2 learnerÂ’s L1 system remains adaptive over the life span and can be applied in learning an L2 (postulates 1 and 2). The L1 phonetic system reorganizes its elf on introduction of new L2 sounds (Flege, 1995). Further, the SLM s uggests that bilinguals strive to establish distinct categories separate L1 and L2 s ounds (Flege, 1995). Flege and colleagues created the SLM as a result of a number of sp eech-perception and production studies that involved L2 learners. A collec tion of seven hypotheses compri ses the SLM; all of these are believed to be important for second language learning. The principles of the SLM and the PAM ar e similar to the degree that predictions regarding L2 perception and production depend on perceived similarities and differences of L2 phonemes from L1 phonemes. However, the SLM specifies that small differences between an L1 and an L2 sound can be discer ned at the allophonic le vel; that is, duration may be what sets an Â‘L1 vs. L2Â’ pair apar t (Flege, 1995). For example, the English vowel / / does not exist in the Spanish language. A Sp anish listener may perceive the sound as the phoneme / / or / / because both share simila r spectral properties with / /, but the sound may be perceived as a longer (or otherw ise different) variant of one of these L1 phonemes.
21 Also, according to the SLM, early onse t of L2 will result in improved phonetic realization or production, as we ll as improved perception; that is, the learner will be able to tell L1 and L2 sounds apart even if the cues differ minimally. However, with increasing age of onset of the L2, a bilingual listener-speaker will experience increased difficulties discerning the small differences between L1 and L2 sounds that may be perceived as allophonic differe nces (Flege, 1995). The SLM fu rther asserts that L1 and L2 sounds are grouped together in a co mmon phonetic space, although they have diaphonic realization; that is, L1 and L2 sounds will be produced in the appropriate language context (e.g., L1 sounds in L1 and L2 sounds in L2). The theory also asserts that this perceptual merging of L1 and L2 s ound inventories will also lead to a merging of production of two L2 sounds th at are both similar to a sing le L1 sound. Thus, when L1 and L2 categories are too similar, the realiz ation of L2 categories, and thus perception and production, may be altered from native-speaker norms. An example of sound merging can hypotheti cally be found in a Spanish speakerÂ’s production of the two American English vowels /e/ and / /. A Spanish listener may be able to discern the differences between th e two sounds; however, wh en integrating the sounds into the Spanish vowel system the Spanish speaker might encounter difficulties mapping the sounds to separate sound categories. Spanish has the five vowels / e, u, o, /, whereas American English has 11 m onophthongal vowels, not counting rhoticized vowels and diphthongs de pending on dialect (/ , , , , , /) (Crystal, 1997). In considering the front vowels, American English /e/ and / / may both be perceived as exemplars of Spanish /e/, alt hough they may be heard as distinct from one
22 another. Since they are both identified as members of a single category, the theory suggests that the two sounds will not be produ ced distinctively, even if the listener can perceptually discriminate between the two sounds when presented in pairs. The SLM also stresses the dynamic nature of the learning process. Flege suggests that both L1 and L2 sound systems influe nce each other with time and that this bidirectional influence may actually contribute to shifts in perception in both L1 and L2 (Flege, 1995). Initially, this may mean that increased langua ge experience may improve L2 perception as well as pr oduction. However, research has shown that L1 sounds may undergo spectral changes in production for hi ghly experienced bilinguals (Flege, 1995). Moreover, the model suggests that for L1 and L2 sounds that are very similar, an experienced bilingual may produ ce a single vowel that is inte rmediate between the native speaker norms for L1 and L2. That is, the experienced bilingual may produce the same, intermediate, sound in L1 and L2 and the soundÂ’s production will be different from that of native speakers of eith er language (Flege, 1995). The complexity of perceptual assimila tion patterns was demonstrated by Rochet (1995). Rochet (1995) demonstrated how forei gn accent and perceptual biases may relate to each other. Rochet argued that foreign accent is rooted in biased perception and consequently that imitation of vowels w ould be distorted. The study comprised a production and a perception portion. For the pr oduction portion, Portuguese and English listeners were asked to rep eat single-syllable words that contained the French vowels / y, u/, and / /. Rating of the productions was performed by native French listeners. Analysis of the rating data revealed that when Portuguese speakersÂ’ production of /y/ was inaccurate, the native French listeners mostly rated their productions as / /. The English
23 speakersÂ’ production of /y/, however, was rate d mostly as /u/. The perception portion of the experiment required the participants to identify / / and /u/. Using formant synthesis, the second formant of the vowel was incremen tally changed in 100 Hz steps from 500 to 2500 Hz. Comparison of identification perfor mance of French, Portuguese and English listeners revealed that the vowel boundaries were located at different frequencies for each. For English listeners the /u/-/i/ boundary was located at about 1900 Hz, while for Portuguese listeners, the boundary was lo cated at about 1600 Hz (Rochet, 1995). Analysis of identification patterns for native listeners of French showed that the vowel boundary between /u/ and /y / was centered around 1200 Hz, whereas the /y/-/ / boundary was located at about 2100 Hz. Noteworthy is th at native listeners of French (which has a relatively large vowel inventory) were able to reliably identify (i.e., with 100% identification performance) tokens as / y/ and /u/ within an F2 range that was only slightly larger that that need ed for native listeners of Port uguese (which has a relatively small vowel inventory) to reliably identify only two phonemes (/u/ and /i/). That is, the Portuguese listeners showed la rger regions of inconsistent performance than did the French, so that performance for the French li steners was more categorical. To illustrate, the decrease of vowel identification from 100% /u/ to 0% /u/ (100% /i/) for the Portuguese listeners stretched from 1200-2200 Hz; for the French listeners, however, the decrease for /u/ began at 900 Hz and 100% id entification as /i/ began at 2300 Hz. For native listeners of English (which, like Fren ch, has a relatively de nse vowel inventory), performance was also more categorical (i.e ., inconsistent perform ance extended over a smaller frequency range). From these data, Rochet concluded that not only do different
24 languages differ in frequency boundaries for the Â“sameÂ” phonemes, but perceptual vowel category boundaries extend to adjacent cate gories, leaving no uncommitted space (as evidenced by the broader range of inconsis tent performance for Portuguese). Thus, the Rochet study shows that prediction of vow el production and perception patterns in a second language cannot be determined based on the phonemic level alone; rather, acoustic measurements of vowels in both la nguages may be necessary to predict vowel assimilation patterns. This complexity of interaction between L1 and L2 phoneme inventories is also seen in consonants. A comparison of th e number of consonants between the two languages (L1 and L2) has important implications for what phonemes will be produced with a foreign accent; however, of equal importance is a comparison of the languagesÂ’ use of spectral cues concerning place, manner, and voicing (Flege, 1995, 1996). These distinct spectral features not only make consonants unique in isolation, but an interaction between vowels and consonants is often seen at the point of transition (e.g., VOT of plosives or locus of frication noise) (Strange 1999). Even at the le vel of syllable shapes or word-position, the variation of consonant s across context app ears to account for compromised perception or production skill s e.g., syllable in itial production and perception may be more native like than syllable-final prod uction or perception (Flege, 1995; Major, 2001; Strange, 1999). The reasoning for this per ceptual limitation may lay in coarticulation (e.g., Â“keyÂ” versus Â“coohÂ” ), assimilation patterns (e.g., Â‘would youÂ’ becomes Â‘wou chaÂ’), or simply increased am ount of new information that is processed and stored (e.g., phonological s hort-term memory) (Flege et al., 1995; Major, 2001).
25 Flege also suggests that listeners will be more prone to detect and reproduce an L2 phonetic difference when they encounter it in early life and when their encounter with that difference is lengthy. One study that supports this prediction is Fl ege et al. (1995), in which Italian-English bilinguals whose contac t with English was early and lengthy were better able to produce word-initial consonants /r, / in English and were able to produce distinctive contrasts, in word final consonants /b/-/p/, /t/-/k/, and /k/-/g/ in English better than later bilinguals with less extensive contact with the L2 (Flege et al., 1995). American Vowels. The Speech Learning Model has been the starting point for many cross-linguistic studies. Most studies using the SLM have, however, shown greater concern with consonant perception and produc tion than with vowels (Flege, 1995). As mentioned earlier, data from previous studi es suggest that monolingualsÂ’ ability to identify vowels depend on formant transi tions, duration, and formant frequency and formant dynamic. Vowel perception by bilingual s or L2 speakers is assumed to be influenced by the vowel inventory. Crosslanguage comparisons of phoneme inventories have indicated that languagesÂ’ vowel and consonant repertoires and the interactions between them have a crucial influence on L2 acquisition despite the importance of L1 and L2 use of acoustic cues noted above (Flege, 1995). One observation made is that languages with sparsely populated vowel spaces (i.e., languages with a relatively small number of vowels in their inventories) may give the learner limited examples of vowels for production and perception (Flege, 1995, 1996). Vowels that vary from the few vowels in the inventory may be perceived as poor examples of one of the small number of vowels in the inventory, rather than as separate vowels (cf. Best, 1995). An example of a
26 language with a relatively sparse vowel invent ory is Spanish, which is typically described as having five vowels (/i, e, a, u, o/) (Dal bor, 1969). American English, on the other hand, is typically described as having a re latively dense vowel inventory, with 11 monophthongal vowels (/ /) (Crystal, 1997). Thus, a person whose first language is Spanish and who is learning English as a second language must create at least six new vowel cat egories in their vowel space. Because American English vowels are, acoustically speaking, closely spaced, neighboring vowel categories for adults, women, and children may overlap with one another (cf. Peterson & Barney, 1952). Cons equently, the number of vowels in the inventory is not the only essent ial factor for misinterpretatio n; the spectral properties of the sounds may also play a role. Although a two-dimensional formant diagram (depicting the first and the second formants) for a singl e talker or talker group may suggest that vowels are distinctively separated, a depi ction of vowel productions across age and gender groups shows much more overlapping of the acoustic featur es of the sounds (e.g., / / produced by an adult female may overlap with a childÂ’s / / production) (Peterson & Barney, 1952). If spectral change from the beginn ing to the end of the vowel is plotted as the third dimension, however, it then beco mes apparent that English vowels bear a significant dynamic property which is also referred to as Â“vowel drift,Â” which helps to separate vowels in the space (Hillenbrand et al., 1995). Whether vowel drift is a factor to be found in other languages has been less in vestigated. What has been found though is that second language learners are perceived be tter by native-English li steners if spectral and duration cues are mastered (Flege et al., 1995; Kewley-Port, Akahane-Yamada, & Aikawa, 1996). If a second language learne r were unable to use the vowel drift
27 information, however, American English vow els might seem much more difficult to discriminate than if the vowel drift information were available. Summarizing the previous sections on SLM, PAM, and American vowels, L2 learners are faced with a number of potenti al obstacles that increase the difficulty of perception. In the following section, speech perception and production studies are discussed to delineate similarities and per ceptual differences between bilinguals and monolinguals. Studies on L2 Acquisition As mentioned earlier, Strange and co lleagues (1983) found evidence that monolingual listeners depend on dynamic vowel information (CV and VC transitions) and also that increased vowel informati on leads to better performance than less information (e.g., both CV and VC transi tions versus CV transition alone). Similar findings have been found using bilingual listeners. Mayo and colleagues (1997) examined whether L2 age of onset of learning is a factor influencing speech perception in different listening conditions Monolingual and bilingual listeners were tested. Three bilingual groups were used: 1) bilingual since infancy (BSI), 2) bilingual since toddler (BST), and 3) bilingual-post-puberty (BPP). A ll participants listened to short sentences ending with a target word. One set of sentences contained high context predictability whereas the context of the second set was manipulated to reduce predictability of the target word. The pract ice portion of the se ntences was presented without noise. Step two of the study was to determine the listenersÂ’ speech recognition threshold when stimuli were presented in noise. Once the threshold was established, several signal-to-noise ratio s (SNR) were selected for the next portion encompassing
28 SNRs for which 15%-85% correct performance was obtained in part 2. Stimuli were then presented at each of these SNR (part 3) (Ma yo et al., 1997). Data analysis from responses to part 3 included percent of correct responses level of predictability of words, and noise level. Results of the study suggested that the bilingual pos t-puberty listeners performed more poorly on both highand low-predictability targets. The early AOL bilingual groups (e.g., BSI and BST) performed similarly to monolinguals on pred ictable sentences; however, low-predictability targets were iden tified more poorly by early bilinguals than monolinguals. Z-scores for the noise levels i ndicated that late bili ngual learners (BPP) depend on higher thresholds for good identifica tion. Ultimately, late AOL of L2 appears to cause decreased perception skills in which the threshold is increased as well as when target words are unrelated to the context. The findings of this study were supported by FeboÂ’s study (2003) in which bilinguals were shown to need more audible information for speech recognition as well. Lopez (2004), using the methodology of Stra nge et al. (1983), wanted to see if silent center vowel perception of monolingua l English speakers and Spanish-English bilinguals differed. The bilingual participants were divided into two proficiency groups: high and low. Results revealed that monoli ngual speakers excelled in identifying silentcenter vowels. Highly proficient speakers performed better than less proficient bilinguals, but not as well as the monolingual speakers. These data suggest that even proficient bilinguals may need more information for vowel identification than monolinguals. Sebastian-Galles and Soto-Faraco (1999) tested the question of whether AOL affects bilinguals in discriminating between two vowel pairs and two consonant pairs, presented in CV.CV (e.g., /gesi/-/g ezi/) or CVC.CV (e.g., /nesku/-/n sku/) non-word
29 context. Gating procedures were used, in wh ich about 30 ms of the word was presented on the initial gate and 10 additional ms was presented on each successive gate. The entire non-word was presented on the tenth gate. Nonwords were presented in a two-alternative forced-choice format; that is the listener wa s asked to choose which of the alternatives they had heard at each gate. Two groups of bilingual speakers were tested; both had acquired Catalan and Spanish at an early age. One group had been exposed only to Spanish up until 4 years old. Hereafter, both Catalan and Spanish were introduced. The other group had been exposed to both Spanish and Catalan from birth on. Comparisons of the two groups found that Spanish dominant speakers or Â“laterÂ” learners needed more information than those who were bilinguals from birth to discriminate between the phonemes presented. Meador, Flege, and MacKay (2000) also examined the importance of age of arrival (AOA) in country for word identific ation skills. In addition, the effects of percentage of L1 usage and Length of Residence (LOR) on the recognition of English words by native speakers of Ital ian bilinguals were examine d. Since Italian and English differ in both consonant and vowel invent ories (see Meador et al., 2000), the SLM predicts that age of onset (AOL) as well as exposure to L2 may be integral factors for optimal perception. For comparison, bilingual s with different (AOL) and different percentage of L1 usage, as well as monoli ngual listeners were test ed repeating English sentences with low semantic predictability presented in noise (Meador et al., 2000). Results showed that longer LOR correlated positively with correct word identification. Since identification performance was by having the listeners repeat the target sentences, foreign accentedness of the bilingual participan ts was also studied. Results showed that
30 L2 speakers with increased LOR demonstrated reduced accentedness as well as increased word identification abilities. A further findi ng indicated that early bilinguals who used their L1 less demonstrated increased word recognition ability. Thus, age of arrival might be taken as a guide of the learner's state of neurological development at the time L2 learning begins. This study suggests that accentedness is affected by both AOA and AOL, as well as by the continued use of L1. MacKay, Meador, and Flege (2001) elabor ated on the Meador et al. study by examining whether age of arrival (AOA) in country is important for L2 perception, whether use of L1 impairs L2 production and perception, and lastly whether differences in phonological short-term memory (PSTM) between L1 and L2 participants would differ. Using CVC words in semantically unp redictable sentences, participants were asked to repeat target words (last word of th e sentence). Italian-E nglish bilinguals (Italian being participantsÂ’ L1) were purposefully c hosen by the authors due to the phonetic and phonotactic differences between the two language s. For instance, Italian does not have many words that end in consonants and the It alian consonant inventory is smaller than that of English (see MacKay et al., 2001). Referring to the SLM, this difference may either cause certain English acoustic cues to be ignored by an Italian listener, while other cues may be weighted more heavily. Bilingua l participants were grouped according to AOA, length of residence (LOR), and percent use of Italian. Four groups were formed: early-low (early AOA mean = 7 years, LOR mean = 40 years, low percent use of Italian = 8% ), early (early AOA mean = 7 years, LOR mean = 40 years, high percent use of Italian = 32%), mid (AOA mean =14 years, LO R mean = 34 years, pe rcent use of Italian
31 = 20%), and late (AOA mean = 19 years LOR mean = 28 years, percent use of Italian = 41%). Data analysis for the identification task showed no overall AOA effect between the four native Italian speak er groups. However, an in teraction between consonant position and L1 use was found: early ItalianEnglish bilinguals using L1 seldom were noted to have fewer difficulties identifying in itial consonants than fi nal consonants. Early Italian learners with greater use of L1 demonstrated increased errors in both initial and final word position compared to early low L1 users. Furthermore, late bilinguals were noted to demonstrate increased errors in in itial consonant identification compared to monolingual English speakers and early learners Different from the Sebastian-Galles and Soto-Faraco study, MacKay and colleagues attr ibuted compromised identification skills primarily to the degree of use of L1 and not age of onset of L2. Bohn and Flege (1999) examined predictions of the SLM in adults learning new L2 vowels with regard to both production a nd perception of bilinguals. In the study, Bohn and Flege addressed two questions: 1) whether adults can learn to produce and perceive a second language vowel category for which no c ounterpart exists in their native language and 2) whether there is a relation between their production and per ception skills for this new category. The vowel / / was chosen because it is not in the German vowel inventory. To test the questions, monolingual speakers of English and bilingual speakers of German and English were as ked to produce the two vowels / / and / / in bVt context in a sentence (e.g., Â“I will say ___ Â“) (see B ohn and Flege 1999 for further details). Two proficiency groups were recru ited on the basis of their AOL, LOR in the U.S, and L2 training: experienced versus inexperienced bi linguals. Measurements of duration and the
32 frequencies of F0 and F1-F3 were perf ormed for each vowel and compared across speaker groups. Acoustic analysis for the thr ee speaker groups showed that some of the experienced bilingualsÂ’ vowel patterns re sembled the monolingualsÂ’ production, while inexperienced native German speakersÂ’ productions of / / and / / were noted to overlap (Bohn & Flege, 1999). Analysis of duration patt erns revealed experienced L2 speakersÂ’ production to be similar to that of the na tive speakers. Inexperienced speakers, on the other hand, showed overall reduced durati on for both vowels (Bohn & Flege, 1999). Using Klatt formant synthesis, 33 stimu li were created with differing F1-F3 frequencies and duration (11 stimuli per dur ation 3 durations: 150 ms, 200 ms, and 250 ms). The three listener groups identified the s timuli as Â“betÂ” or Â“bat.Â” Analysis of the percent of words identified as Â“betÂ” for the three durations indicated that both groups of L2 speakers performed differently from the native listeners; greater differences from monolinguals were found for inexperienced th an for the experienced L2 learners A spectral effect was found for the experienced li steners, in which lowering F1 and raising F2 resulted in more Â“betÂ” responses. For th e inexperienced listen ersÂ’ performance was affected more by duration information th an by spectral information. Lengthening / / from 200 ms to 250 ms resulted in more / / responses regardless of spectral information. Further analysis was made comparing produc tion and perception. Spectral and duration changes were arrayed to compare percepti on results to production details. The main pattern to be found was that all experienced L2 participants who s howed more native-like production patterns also showed native-like perception patterns. The reverse, however, was not true; many experienced L2 particip ants who showed native-like perception
33 patterns did not show native-like production patterns. None of th e less experienced L2 participants showed native-lik e production patterns. A suggestion to be made from this study is that adult learners (a fter age 30) appear to be ab le form new sound categories when L2 exposure is intensive. However, as the SLM states, category formation is easiest when new L2 sounds are dissimilar to any of the L1 sounds. Further, these results appear to support FlegeÂ’s hypothesis that accurat e perception tends to precede accurate production. The studies described in this section provide evidence that even proficient bilinguals depend on more acoustic informati on than monolinguals in difficult listening conditions (Febo, 2003; Lopez, 2004; Mayo et al ., 1997; Sebastian-Galles & Soto-Faraco, 1999). Furthermore, data indicate that incr eased accentedness and decreased perceptual abilities are associated with decreased L2 exposure and increased use of L1 (Bohn & Flege, 1999; Flege et al., 1995; MacKay et al ., 2001; Meador et al., 2000). Furthermore, studies have shown that spect ral and duration cues may be used differently by bilinguals than by monolinguals (Bohn & Flege, 1999). However, no studies have, as we are aware of, investigated the role of both durati on and vowel formant dynamic cues on vowel identification, and how these may differ across di fferent L2 proficiencies by bilinguals. It also remains to be answered if bilingual li stenersÂ’ performance is affected by Straight synthesis. Purpose of Study This study focuses on the perception of American English vowels. The performance of monolingual American speaker s and Spanish bilingual speakers will be compared. The study compares perception of monolingual perception and bilingual
34 listenersÂ’ identification of the vowels / / when eliminating consonantal environment from /bVd/ syllables, reduci ng temporal cues, and flattening formant contours using resynthesis. The performance of monolingual American English speakers, highly-proficiency Spanish-E nglish bilinguals, and low-pr oficiency Spanish-English bilinguals will be compared. Thus, the present study will address five main research questions. 1. Are vowels perceived differently in isol ation than in whole words for these groups? 2. Are vowels synthesized using high fideli ty synthesis (Straight) perceived differently than naturally produced vowels? 3. Do highand lower-proficiency bili nguals use vowel formant dynamic and duration cues differently than monolinguals? 4. Do the effects of vowel isolation, synt hesis, formant dynamic cues, and duration cues differ across the six vowels studied? 5. Do patterns of confusions di ffer across the listener groups?
35 Chapter 2 Method Participants Speakers. Five monolingual native English speakers (3 males and 2 females) participated as speakers in this experiment. All were screened to exclude persons with a history of speech or hearing impairment A trained phonetician and native English speaker determined that all five spoke Eng lish without a strong regional accent. Speakers ranged from 25 to 35 years of age. Two speak ers originated from Florida, whereas the remaining three were from New Yo rk, Illinois, and Pennsylvania. Potential Listener Screening and Language Background Questionnaire. Potential listeners, both monolingual and bilingual, were screened for accent and voice quality during a telephone interview prio r to participating in the st udy. Persons with poor voice quality or a strong regional accent in Eng lish were excluded from the study. Potential listeners were also screened to exclude those with a history of speech or hearing impairment. During the telephone interview, pot ential bilingual partic ipants were also given a preliminary classificati on as higher or lower proficie ncy based on the screenerÂ’s perception of their degree of accentedness in English, but final classification was based on age of onset of learning English. The two telephone screeners were graduate students in speech-language pathology. Upon their arrival at the study site, al l participants completed a language background questionnaire. For monolinguals demographic and basic language
36 background data were collected (see Appendix A). Persons who indicated fluency in a second language were eliminated from th e study. Data from the language background questionnaire were also used to further screen monolingual participants for dialect. Speakers of Southern varietie s of English were excluded due to the vowel shifts and mergers typical of these dialects. Speakers fr om the Tampa area or other urban regions of south Florida were preferre d. According to Labov (2005), monolingual English speakers from South Florida, specifically urban area s, do not typically demonstrate vowel mergers characteristic of some Southern dialects. A more detailed language background ques tionnaire was used for the bilingual listeners (see Appendix B). In addition to basic demographic information, one set of questions probed the particip antsÂ’ languages spoken, dialect background in Spanish, age of onset of first learning E nglish, the context in which E nglish was first learned, number of years in the United States, and the age of onset of learning Eng lish in a context in which the language was used extensively, or ag e of onset of learni ng intensively (AOLI). AOLI is defined as the time at which L2 l earners are exposed to English or their L2 intensively, typically indicated by immersion in an L2 culture. In an additional set of questions, participants were asked to stat e their percent of da ily use of their two languages in work, home and othe r contexts, as applicable, an d to compare their abilities and to indicate which of their two language s they felt most comfortable using in the domains of speaking, listening, reading, and writing. Answers to these questions were used to estimate self-rated language dominance for each domain. Listeners. Sixty persons between the ages of 18 and 59 participated as listeners in this experiment. Seventeen monolingual listen ers (females=15, males=2) were selected
37 based on lack of a strong regional accent in English. Basic demographic and language background data are as follows. Fourteen of the participants originated from Florida, of which all but one resided in the Tampa Bay area. The remaining three participants were from New York, Ohio, and California, respectiv ely. The mean age of the participants was 21.24 years and ranged from 18 to 38 years of age. A total of 43 Spanish-Eng lish bilinguals were included as listeners for this study. Potential bilingual listeners were identified as persons who described themselves as speaking Spanish and English only; persons who indicated fluency in a third language were excluded. Speakers of Castilian varieties of Spanis h were also excluded from the study, but speakers of any American variety of Span ish were included. Although there are many differences in pronunciation pa tterns across American vari eties of Spanish, they are typically viewed as being more similar to one another in pronunciati on than to Castilian varieties of Spanish (Dalbor, 1969). A vari ety of dialect backgrounds was permitted because this diversity is representative of the population of Spanish-English bilingual persons in south Florida. Based on the potential participantsÂ’ age of onset of learning English intensively (AOLI) and preliminary accentedness classifica tion by the two screeners, 25 participants were classified as highly-proficient early l earners of English and 18 were classified as less-proficient later learners. Tables 1 and 2 illustrate sele cted demographic and language background data for each subject in the high ly-proficient bilingual and less-proficient bilingual groups, respectively. As shown in Ta bles 1 and 2, all of the highly-proficient
38 bilinguals began learning English intensively at age 12 or earlier, while all of the lessproficient bilinguals began learning Eng lish intensively at age 15 or later. Of the 25 highly-proficient bilinguals, 24 rate d themselves as equally proficient in Spanish and English or more proficient in E nglish in at least thr ee of the four domains (speaking, listening, reading and writing); th e remaining participant rated herself as English dominant in two domains (speaking and writing) and Spanish dominant in two domains (listening and reading). Additionally, thir teen of these participants indicated that they were born and raised in the U.S. a nd began learning English when they started school or preschool and were educated exclus ively in English. All were categorized by the screeners as having only a slight or no foreign accent in English. As shown in Table 2, all of the less-prof icient bilinguals began learning English intensively at age 15 or later. Of these 18 participants, 12 rated themselves as Spanish dominant in all four domains; two rated themse lves as balanced or Spanish dominant in three domains; two rated themselves as Span ish dominant in two domains and English dominant in two domains; and two rated themselves as English dominant in three domains. All were categorized by the screen ers as having a noticeable foreign accent in English.
39 Code Age Gender Origin AOLI Time in U.S. (yrs) % English used at: Most comfortable language for: WorkHomeOther Speak Listen Read Write HI02 18 F Dominican Republic 7 17 100 0 75 E E E E HI04 18 F Mexico 6 13 50 50 -E E E E HI05 19 F Cuba 4 19 80 80 85 E E E E HI06 19 F Mexico 5 16 -75 80 B E E E HI07 21 M Costa Rica1 4 100 100 99 B B E E HI08 19 F Nicaragua 8 11 100 50 70 E E E E HI09 18 M Mexico 4 18 -99 99 B E E E HI10 19 F Nicaragua 6 19 40 20 60 B B B B HI11 20 F Cuba 6 20 95 70 80 E E E E HI12 24 F Puerto Rico 10 14 100 5 100 E E E E HI13 19 M Peru & El Salvador 3 19 100 80 95 E E E E HI14 20 F Cuba 4 20 100 0 100 E E E E HI16 19 F Mexico 6 19 -50 50 S E E E HI17 19 F Cuba 4 19 -95 98 E E E E HI18 35 F Venezuela9 5 100 10 100 E S S E HI19 18 F Cuba 4 18 -50 -E E E E HI20 27 F Venezuela4 8 60 0 -B E E B HI22 18 M Peru / El Salvador 5 18 100 90 95 E S E E HI23 29 F Puerto Rico 9 20 100 80 50 E E E E HI24 26 F Colombia 5 26 95 40 100 E E E E Hi25 21 F Colombia 11 10 100 10 90 E E E E HI26 26 F Venezuela12 14 98 0 90 B B E E HI29 19 F Cuba 2 19 -45 55 B B E E HI30 19 F Venezuela8 11 100 30 85 B B B E HI31 22 F Mexico 6 22 80 80 100 E E E E Avg. / Summ. 21.3 21F 4 M 6 Cuba 5 Mexico 4 Venezuela 10 other 6.0 16.0 89.4 48.4 84.4 16 E 8 B 1 S 18 E 5 B 2 S 22 E 2 B 1 S 23 E 2 B 0 S Table 1. Demographic and language backgr ound information for highly proficient bilingual group. Notes: AOLI = age of learni ng English intensively. E = English; S = Spanish; B=both English and Spanish rated equa lly. Origin = country of birth or country of birth of parents for part icipants born in the U.S.
40 Code Age Gender Origin AOLI Time in U.S. (yrs) % English used at: Most comfortable language for: Work HomeOther Speak Listen Read Write LO01 30 F Panama 21 9 70 30 75 E S B B LO02 41 M Peru 24 15 100 50 100 E E S E LO03 19 F Colombia 15 4 -2 80 S S S S LO04 23 M Peru 17 5 50 99 50 S S S S LO06 19 F Colombia 18 1 -70 100 S S S S LO07 50 F Colombia 45 5 98 100 100 S S S S LO09 21 F Colombia 20 <1 -20 40 S S E S LO10 28 F Colombia 28 <1 -0 100 S S S S LO11 22 F Colombia 22 <1 100 60 40 S S S S LO12 35 F Colombia 35 <1 -15 90 S S S S LO13 19 F Puerto Rico 16 3 70 10 30 S S S S LO14 25 M Peru 22 3 100 0 100 S S S S LO15 22 F Colombia 18 4 90 0 20 S S S S LO16 49 F Colombia 46 10 100 0 20 S S S S LO17 26 M Colombia 16 10 50 10 --S S E E LO18 57 F Colombia 30 27 85 0 100 S S S S LO19 22 F Cuba 19 3 100 10 100 S S E E LO20 29 M Colombia 23 6 100 0 80 S E E E Avg. / Summ 29.8 13 F 5 M 12 Colombia 3 Peru 3 Other 24.2 5.9 85.6 26.4 72.1 16 S 2 E 16 S 2 E 13 S 4 E 1 B 13 S 4 E 1 B Table 2. Demographic and language background in formation for less proficient bilingual group. Notes: AOLI = age of learning English intensively. E = English; S = Spanish; B = both English and Spanish rated equally. Origin = country of birth or country of birth of parents for participants born in the U.S. Materials and Instrumenta tion for Speaker Recording Instrumentation. Audio recordings were made using the following equipment: 1) an Audio-Technica AT4033a Transformerless Capacitor Studio Microphone; 2) a Roland Digital Workstation model VS890EX; 3) an Applied Research and Technology Professional Tube Mic Pre-amplifier; 4) a Dell Opti-Plex CX110 personal computer; and
41 5) a sound-treated, single-wall booth. Speaker s were seated individually in the soundtreated booth. The speech was processed through the pre-amplifier and digitized at a sample rate of 44.1 kHz by the di gital workstation and transfe rred via digital input to the computerÂ’s M-Audio Audiophile 2496 sound card. CoolEdit (Johnston, 2000) was used to record the digital input to a sound file Sound files for each speaker were saved for later stimulus preparation. Speech materials. Each speaker was recorded speaking the six target vowels in /bVd/context in a carrier phrase (Â“Say ______ ag ainÂ”). The /bVd/ items were Â“bead, bid, bayed, bed, bad,Â” and Â“bod.Â” Speakers read a li st with the target words aloud to the experimenter prior to recording to ensure th ey used the target pronunciations and to avoid reading errors. The carrier phrase was used to maintain the final /d/ release. Speakers read an 18-item list two times; each target word wa s represented on the list three times, for a total of six repetitions of each target word. Each speaker r ead the list from a sheet of paper presented on a readi ng stand at eye level. Recording Procedure The microphone was positioned at a 45-degree angle approximately ten centimeters from the sp eakerÂ’s mouth to avoid peak clipping and popping. Instructions were given to read the sentences avoiding flapping of the final /d/ release. To prevent glottal mode, speakers we re instructed to take breaths in between phrases. Pausing to breathe between sentences also aided speakers in maintaining similar pitch and intensity levels across items. Befo re recording the stimulus items, a test recording was made to adjust the sound leve ls on the workstation and on the computer to rule out peak clipping.
42 Stimulus Creation Procedures. For each target word, one repetition was chosen for each speaker as the best exemplar. Next, the following six stimulus versions were created for each of 24 words (6 target items X 4 speakers): 1) whole word (WH), for which the word was simply extracted from the carrier phrase wit hout modification; 2) original vowel (OV), for which the vowel was isolated from the word by removing CV and VC transitions wi thout additional modification; 3) natural preserved (NP), for which the isolated vowel stimulus was resynthesized using Straight without modification of any parameters; 4) natural neutral (NN), for which the isolated vowel stimulus was resynthesized using Straight and its du ration was adjusted within Straight to match the average vowel duration measured across the four speakers; 5) flat preserved (FP), for which pitch pul se replication and resynthesis were used to create stimuli with static formant frequency values across the entire duration of the vowel; 6) flat neutral (FN), for which the mani pulations in 4) and 5), above, were combined to create neutral duration stimuli with static formant frequency values. The process for creation of each of these versions of each word is detailed below. Word Extraction. Each of the six tokens of each target word produced by the five speakers was isolated from the carrier phrase. First, separate sound files were created for each carrier phrase. Next, target words were isolated from the carrier phrase using CoolEdit2000 (Johnston, 2000) software by visu ally identifying and noting the time of
43 the onset of the release of the initial /b/ and the end of the burst fo r the final /d/. Using CoolEdit2000 (Johnston, 2000), the surrounding car rier phrase (Â“sayÂ” and Â“againÂ”) was silenced out leaving the target word and 10 ms of silence precedi ng the release of the initial /b/ and following the end of the burst for the final /d/. When pre-voicing was present, amplitude of pre-voicing of the ini tial /b/ was linearly ramped up (from 0% to 100% amplitude) for 3 ms following the transfor mation of the preceding 10 ms to silence. Similarly, energy following the release burst of the final /d/ was linearly ramped down (from 100% to 0% amplitude) for 3 ms follo wing transformation of the following 10 ms to silence for each stimulus. For resynthesis, target words were down sampled in Praat to 11,050 Hz, the sample rate required for input to Straight. Usin g CoolEdit2000 (Johnston, 2000), all stimuli were amplitude adjusted so that the root-mean-square (RMS) amplitude would equal 25dB less than the maximum amplitude allowed without peak clipping (i.e., +/-32,768, or 90.3 dB) to avoid large differe nces in intensity across stimuli. As stated above, one best exemplar was c hosen from among the six repetitions of each target word for each speaker for use in the experiment. The best exemplar token for each target word was selected based on vocal quality, vowel quality, presence of the final /d/ release and similarity in voice quality to other target vowels in the set. Target words containing peak clipping, distor tions, glottal fry, or nasaliza tion were not considered for further use. Using the software Praat 4.2 (Boersma & Weenink, 2003), the times of the onset of voicing for the vowel and the onset of closur e for the /d/, were noted in an Excel file (Microsoft, 2000).
44 Target words produced by four of the sp eakers (two males, two females) were used to create stimuli for the main listeni ng experiment. Target words produced by the remaining male speaker were used to create example and practice trial stimuli. Only the whole word (WH) and original vowel (OV) conditions were used in the example and practice trials, so only th ese conditions were created for the fifth talker. The selected target words for each talker served as stimuli for the whole word (WH) condition, and additional modifications to these items were made to create the stimuli for the remaining conditions. Thus, 24 stimuli (six tokens X 4 talkers) were created for each condition. Vowel Isolation. For the original vowel (OV) condition, the CV and VC transitions for the vowel were identified and is olated from the consonant context. The CV and VC transitions were identified as the first and last 40 ms of the vowel duration, following the onset of voicing for the vowel and preceding the onset of closure for the /d/, respectively. Thus, the time point corres ponding to the vowel onset plus 40 ms of the vowel duration was identified in Praat 4.2 a nd all preceding portions of the signal were silenced. Similarly, the time poi nt corresponding to the onset of closure for the /d/ minus 40 ms of the vowel duration wa s identified and all following portions of the signal were silenced. All but 10 ms of the resulting sile nce was deleted at begi nning and end for each item. Next, the initial and final 3 ms of th e signal were linearly ramped on and off, respectively, in CoolEdit2000 (Johnston, 2000) to avoid sudden changes in amplitude from the silencing of the CV and VC transi tions resulting in audible clicks. Amplitude equalization to a RMS amplitude of negative 25dB, relative to the maximum, was again performed using Cool Edit2000 (Johnston, 2000).
45 Preparatory Acoustic Measuremen ts for Formant Flattening. Formants 1 through 5 were identified using Linear Predic tion Coding (LPC, Burg method), narrow band spectral slice, and wide ba nd spectrogram. Using Praat 4.2 (Boersma & Weenink, 2003), a wide-band spectrogram (5 ms analysis wi ndow) was displayed for each isolated vowel token. The view range was set to 5,000 Hz with a dynamic range of 50.0 dB. Formant settings for a wide band spectrogram r eading were set using a maximum formant frequency of 5512 Hz. Automatic formant extraction procedures based on Linear Predictive Coding (LPC; Markel, 1972) were used to over lay formant tracks over the wide band spectrogram. By default, 10 poles were spec ified within the 5,000 Hz range analyzed. A 20-ms analysis window was used, updated at 5-ms intervals. Pre-emphasis was set for 50.0 Hz. The match between the tracks and the formants shown on the spectrogram was inspected and if formant tracks were not well aligned with the formants the number of poles used was increased or decreased as needed to achieve a better match. At each desired location, estimates of center frequency of F1-F5 were generated from the overlaid formant tracks using Praat query procedures. To verify automatic formant estimates, narrow band spectral displays were generated at the desired ti me point and Monsen & Enge bretsonÂ’s (1983) method of estimating location of spectral peaks was us ed. Spectrogram setting for the narrow band spectral slices was set with a window le ngth of 0.029 s. In most cases, automatic estimates were used, but in cases of disa greement, the method described by Monsen & Engebretson (1983) for obtaining measuremen ts from a narrow band spectral slice was used for formant frequency measurements.
46 In Monsen & EngebretsonÂ’s (1983) met hod for obtaining measurements from a narrow band spectral slice, the measurement is made by inferring the spectral peak from the intensity values of the harmonics in the vicinity of the formant in question. According to Monsen & Engebretson, Â“most researchers construct a type of pyramid or triangle above the group of harmonics which is thought to constitute a formant and then make a frequency measurement of the peak of that pyramidÂ” (pg. 90). Using this method, three main scenarios of relative harmonic amplitudes in the vicinity of the formant are possible, resulting in different measurement decisions Most simply, if th ere is one largest harmonic, with the two surrounding harmonics approximately equal in amplitude, then the formant frequency should be located at the frequency of the largest harmonic. Second, if the amplitude of two harmonics is a pproximately equal and greater than that of the surrounding harmonics, which are approximately equal in amplitude, then the formant frequency should be located between the fre quencies of these two harmonics. The third scenario is the most complex. If the am plitudes of the two harmonics surrounding the largest harmonic are unequal, the formant fr equency should be shifted away from the frequency of the largest and towards the fre quency of the larger of the two surrounding harmonics; the shift should be greater when there is a greater difference between the amplitude of the two surrounding harmonics. Measurements were made by two student raters. In cases where F1, F2, and F3 measurements of the two raters differed by more than 50 Hz, 150 Hz, or 250 Hz (Strange et al., 1998), respectively, a th ird rater (supervisor) also meas ured formant values. If the third raterÂ’s measurements agreed within th e specified amount with those of one of the
47 two original raterÂ’s measurements then that raterÂ’s measurements were used. Otherwise, the measurements of the third rater were used. Within the whole word file, vowel formant frequencies were measured at vowel onset plus 40 ms, vowel offset minus 40 ms, a nd at the vowel stable point, or the point at which absolute value of the slope of log( F1/F2) is minimized over several analysis frames (Hillenbrand et al., 1995; Hillenbrand & Nearey, 1999). The vowel stable point was computed by first transferring automatically generated formant tracks to an Excel file. Formant estimates were generated appr oximately every 4.7 ms and formant tracks were corrected if needed. Next, log(F1/F2) was computed for each frame. A seven-frame (approximately 33 ms) window was used to co mpute the slope of log(F1/F2) from the first frame to the last frame of the window; the slope was generated for each eligible window (seventh frame to the seventh from the end). As suggested by Hillenbrand (1995), the minimum slope that was not in the offglide section of the vowel was computed and compared with the plotted fo rmant tracks to verify that the selected minimum coincided visually with a maximally st able portion of the vowel formants in the F1-F2 region. Resynthesis Method for Natural Preserved Vowel Tokens. Using the extracted vowel, the natural preserved vowel tokens we re generated by resynthesis, maintaining spectral information and duration cues. The Straight graphical us er interface (GUI) (Kawahara et al., 1998) was initiated from within MatLab (The MathWorks, 2002). The stimulus file was accessed from within the GUI, the item was resynthesized by Straight with no changes specified and the synthesized item was saved to file. Sound quality and vowel duration were rechecked in Praat. Duratio n of silence prior to and after the vowel
48 was checked as well as the RMS amplitude (-25dB). Ramping and amplitude equalization as described above were performed for vowels with temporal or amplitude deviations. Resynthesis of Altered Vowel Conditions. The mean vowel duration for each word was used to resynthesize the natural neutra l (NN) vowel condition. Mean vowel duration was computed as the average duration of the six isolated vowel stimuli across the four speakers used to create the experimental s timuli (i.e., average across 24 tokens). In the Straight-GUI, the file was loaded and the appropriate lengthening or shortening factor was specified to produce a vowel of the desi red duration (approximately 193 ms). The output file was saved and the duration of each resynthesized vowel was rechecked in Praat. On some occasions, the Straight out put contained small-amplitude voicing beyond the desired duration; in these cases, the stim ulus was edited to match the target average duration. The Straight duration morphing rout ines automatically preserved the formant trajectories while either stretching or compressing the overall duration. The two flat formant conditions (flat pres erved [FP] and flat neutral [FN]) were both generated by replicating the measured stab le portion of the vowel in Praat (Boersma & Weenink, 2003). The original vowel file was used to generate these two conditions. A total of approximately 30 ms around the measured stable (steady state) time was selected. To replicate the pitch periods, 5 or 6 pitch en tire pitch periods were selected, whichever was closer to 30 ms. These pitc h periods were copied and pasted into a file of 1 s of silence in Praat; this procedure was repeat ed carefully by zooming in and pasting the pitch pulses at a zero point until the to tal vowel duration matched the desired vowel duration for the original and neutral-durat ion vowels, respectively. Excess duration was deleted as needed until the duration of th e target vowel version was matched. Vowel
49 quality and waveform matching were carefully monitored to avoid unnatural jumps in amplitude at points where pitch pulses were posted. Visual inspection of spectrograms showed all formants to be flat. Below the formants of the vowel / / after resynthesis and when both cues are neutralized (see Figures 1 and 2). Figure 1. The natural vowel / / when synthesized (NP) Â– spectral and duration cues were not modified.
50 Figure 2. An example the vowel / / with flattened formants and neutralized duration. Both original-duration and duration-neutral flat formant vowels created by pitch pulse replication were loaded into the Straight-GUI (Kawahara et al., 1998) for resynthesis. Following synthe sis, the duration of the vow els was checked in Praat assuring a preand post-vocalic silen ce of 0.03s (see Appendix C for detailed methodology).
51 Chapter 3 Procedure Testing Procedure of Subjects Calibration The level for the presentation of stimuli was adjust ed by playing a 10 second, 1000 Hz tone that was amplitude adjusted to Â–25dB from the maximum possible amplitude through the headphones to a s ound level meter (Brel Kjr Type 2235 Precision Sound Level Meter). Based on the re adings of the sound level meter, the attenuation of the programmable attenuators (PA5) on the Tucker-Davis Technologies (TDT) ("TDT System III," 2001) Psychoacoustics System III (2001) hardware was adjusted until the measured intensity of the calibration tone was equal to 68 dB. These attenuation levels were then recorded and set within EcosWin (1999). Because the root mean square (RMS) amplitude level of the calibration tone was matched to that of the amplitude adjusted stimuli, this method assu red 1) that the presentation level of the stimuli was approximately equal and 2) that the average presentation level for the stimuli was approximately 68 dB. Testing. Each subject completed an informed consent form and language background form in a prior session to this experiment. A basic hearing screening on a Beltone AudioScout Audiometer calibrated to ANSI 1989 standards was administered to rule out hearing impairment (25 dB at 500, 1000, 2000, and 4000 Hz) in this prior session.
52 Seated in a quiet room, subjects we re presented stimuli binaurally over headphones (Sennheiser HD 265). Prior to the main experiment, subjects completed several example and practice task s. First were 12 whole-word example trials, followed by 12 isolated-vowel example trials. On the ex ample trials, stimuli we re presented to the listeners in a predetermined order and feedb ack was provided on each trial. For a correct response, the target lit up gr een, whereas an incorrect respon se was indicated by the color pink. Next, each listener completed 3 blocks of 18 practice trials for the isolated-vowel stimuli. On the practice trials, stimuli were presented in random order and no feedback was provided. Stimuli for the example and prac tice tasks were the six whole word (WH) and six isolated vowel (OV) tokens created for the fifth (male) talker whose speech was not used for the main experimental task. The main experiment tasks consisted of two 120-trial blocks for isolated vowels, followed by a 48-trial whole word task. Stimuli for the isolated vowel portion of the main experiment tasks were the stimuli created for the five isolated vowel conditions (OV, NP, NN, FP, and FN) for the four talkers and si x target vowels (6 vow els X 5 conditions X 4 talkers = 120). The 120 stimuli were presen ted in random order in one trial block. Following a five-minute break, the same 120 stimuli were presented again in random order in a second trial block. This ta sk lasted approximately 30 minutes. Following another 5-minute break, the expe riment was completed with a whole word task (1 block of 48 tria ls) lasting 10 minutes. The stimuli for this trial block were the whole word (WH) stimuli created for the four talkers for the six vowels, with each
53 stimulus presented two times. Each subject was compensated $10 per hour of listening. The average duration for this experiment was less than one hour. Trials. For each task (example, practice, and experimental), written directions on a computer screen and verbal instructions were provided. Subjects were instructed to respond as quickly and yet as accurately as possible. Upon presentation of the target word, the subject selected which word (o r vowel) was heard from six alternatives presented on a computer screen (see Figure 3). Figure 3. Stimulus screen. Targets were presented in the individual /bVd/ words adjacent to a rhyming word (e.g. Â‘beedÂ’ and Â‘feedÂ’). Respons es were indicated by the subj ects using a mouse to select the target word they heard. Responses were recorded by EcosWin (1999) as correct or
54 incorrect, generating an individual Excel file fo r each listening task. The files of the main task for both isolated vowels and whole words we re saved to Excel files for data analysis. Data Manipulation. Data from Ecos/Win (1999) (i solated vowel conditions and whole word) were loaded into Excel (Microso ft, 2000) for data anal ysis. A macro script was generated to sort the data separately fo r isolated vowel and w hole word trials. The macro scripts were generated by two persons (s upervisor and author) to avoid errors in the programming. The macro script for isolat ed vowel was used to sort correct and incorrect responses and sort th ese in alphabetical order (i.e ., Â“bad, bayed, bead, bed, bid,Â” and Â“bodÂ”) and by condition (original vowel, flat formant, etc.). Number correct for word and condition was automatically tallied and recorded for each subject. Percent-correct scores were computed from number corre ct for each conditio n. Within the macro, confusion matrices were generated for each lis tening condition (i.e., for the five isolated vowel conditions). For the confusion matrices, intended targets were depicted vertically (in rows) and actual responses horizontally (in columns). Raw scores were collected in a summary Excel file for all subjects for reliab ility purposes and statistical analyses. The macro script for the whole word was designe d similarly; however, only one confusion matrix was generated for each listener (whole word).
55 Chapter 4 Results Explanation of Percent Correct Analysis Figure 4 below shows the mean percen t correct for each listening condition by listening group: native (NA), hi ghly-proficient (HP) bilingual s, and less-proficient (LP) bilinguals. The data will be discussed by c onditions by group first, with reference to Figure 4. The six condition contrasts of interest will be discussed in turn: 1) whole word versus original vowel (WH-OV), representing th e effect of vowel isolation; 2) original vowel versus natural preserved (OV-NP), repres enting the effect of resynthesis; 3) natural preserved versus natural neutral (NP-NN), representi ng the effect of duration neutralization alone; 4) natura l preserved versus flat pres erved (NP-FP), representing the effect of formant flattening alone; 5) flat preserved versus fl at neutral (FP-FN), representing the effect of duration neutralization on already formant flattened stimuli; 6) natural preserved versus flat neutral (NP-FN), representi ng the effect of removing both formant dynamic and duration cues. Next, lis tening condition by vowel effects will be discussed. Following this section, confusion matrices showing each groupÂ’s performance for each listening condition will be presented to elucidate patterns of error distribution for the listener groups. Whole Word versus Original Vowel. Native listeners (NA) performed the best with a mean of 98.9% correct for the w hole word condition. The highly-proficient
56 listeners (HP) performed slightly lower by 2.8 percentage points (96.0% correct). The less-proficient bilinguals (LP) demonstrated difficulties identifying vowels at the word level. The LP mean was 81.9% correct or 16.6 percentage points lower than the NA listenersÂ’ performance. At this point, it may be suggested that even all listener groups perform best identifying whole words; how ever, less-proficient bilinguals appear somewhat challenged even at this whole word level. Native listeners demonstrated a negligib le decrease of 3.06 percentage points when the consonant context was removed (WH-OV condition). Highly proficient bilinguals demonstrated a 6.25% drop fr om WH to OV conditions, whereas lessproficient bilingualsÂ’ performance decreased by 8.91 percentage point s from WH to OV conditions. The suggestion to be made from these results is that highly proficient listeners may rely more on consonant information than monolinguals, but less then the lessproficient bilinguals.
57 0 10 20 30 40 50 60 70 80 90 100 NAHPLPListener groupPercent correct WH OV NP NN FP FN Figure 4. Mean percent-correct word-identification scores fo r each of the three speaker categories, monolingual nativ e (NA), highly-proficiency bilingual (HP) and lowerproficiency bilingual (LP), at each of the six listening cond itions. Error bars indicate two standard errors of the mean. Original versus Natural Preserved Vowel. The effect of resynthesis (OV-NP conditions) wa s noted to be small to negligible for all of the groups. Native listeners demonstrated the greatest decrease of the three groups (i.e., 2.45%) and HP listeners demonstrated the smallest decrease of 0.67%. However, native listeners still performed be tter than both bilingual groups.
58 Natural Preserved Vowel versus Natural Neutral Vowel Neutralization of vowel duration information alone in the resynthesi zed stimuli (NP-NN) showed little decrease for native and highly proficient listeners (less than 2% for both). Less proficient listenersÂ’ performance was characterized by a 4.86 percen t decrease from NP to NN conditions. Natural Preserved versus Flat Preserved Vowel. Substantial decreases were noted for all of the listening groups between resynthe sized vowels with all cues preserved (NP) and those with flattened formants (FP). Highl y proficient bilinguals presented the largest decrease in percent corr ect identification performance (20.42%) from NP to FP conditions, whereas native and less proficient listeners av eraged decreases of 18.75 and 17.36%, respectively. The difference between dynamic preser ved vowels and flattened preserved vowels represents the effect of removal of fo rmant information alone. A similar pattern of decreases in performance was also seen be tween natural neutral vowels and flattened neutralized vowels (NA: 18.26%, HP: 21.67%, and LP: 12.27% for NN-FN). Flat Preserved versus Flat Neutral. Minor drops were noted in percept correct identification performance when durations of flat preserved vowels were neutralized (FPFN). The difference was greatest for the hi ghly-proficient bilingua l group (2.50%), with differences of less than one pe rcent, in the positive and nega tive directions, respectively, for the native and less-proficient bilingual groups. Overall Patterns. As shown in Figure 4, all thr ee groups showed fairly similar patterns of performance across the listening conditions, meaning that performance for the whole word condition averages higher than that for the other listening conditions. The highly-proficient bilinguals performed better than the less-proficient bilingual listeners in
59 all conditions. However, native listeners perf ormed consistently higher than both highlyand less-proficient listeners. Additionally, all three listen er groups performed best on whole word condition followed by original vowel. Noteworthy is the fa ct that resynthesis appears to have little effect on identificati on in regards to the or iginal vowel. Only a slight decrease in performance is noted fo r all of the listening groups. Results for the duration neutralized vowel with preserved formant dynamics (NN) show a noticeable (about 5%) decrease in perf ormance for the natural durat ion with preserved formant dynamic stimuli (NP) for less-proficient bilinguals only. It does not appear to substantially affect performance negativel y to neutralize duration for the other two groups (see Figure 4). Similar for all three groups is the sizeable drop in performance when flattening the formants dynamic information is removed (NP-FP and NN-FN). Nevertheless, some cross-group differences in performance across the different listening conditions are apparent Most notably, the effect of vowel isolation (WH-OV) appears to be greater for the two bilingual groups than for the native listener groups. Second, the effect of duration neutralization alone appears to be greater for the LP bilingual group than for the HP bilingual and native listener groups. Figure 4 also enables us to view the differences in performance and provides preliminary evidence that experience im proves oneÂ’s perception (Flege, 1995, 1996; Flege et al., 1995; MacKay et al., 2001; Mayo et al., 1997) as can be seen in the highly proficient versus the less pr oficient bilinguals. However, this figure does not explain whether the removal of formant dynamic a nd duration cues affect different vowels differently.
60 Data from Figure 4 also suggest that native and highly-pr oficient bilingual listeners are able to identify Straight-synthe sized isolated vowels quite well, even when duration information is removed (about 92 and 88% correct, respectively for the NP condition) and reasonably well (about 74 and 66 % correct, respectively) even when both formant dynamic and duration cues are removed. Performance is more than 20% lower for the LP group in the NP condition, how ever. Although the mean percent correct appears to give us an impression of cues us ed by the different listener groups, it needs to be stressed that data for all listening conditions were averaged across vowels. Consequently, it is necessary to analy ze how performance of vowels by listening condition by group was distribute d. Confusion matrices will also be used to clarify similar and different patterns between groups. Analysis of Vowels by Condition As stated previously, Figure 4 only provide s information averaged across vowels. We cannot from these average patterns determ ine how different vowels were affected by changes in listening condition. Figure 5 depicts percent corr ect identification performance by condition and vowel, averaged across the listener groups. Of interest are the different patterns across vowels that indicat e whether some vowels are affected more by some conditions than others. As can be seen, performance for / / stays relatively stable even after formant dynamic information is rem oved. This pattern may be attributable to its stable vowel drift (Hillenbrand et. al 1995; Hillenbrand & N earey, 1999). The lax vowel / / is characterized by greater decreases for removing duration information than for removing formant dynamic information, in that the decreases for NP-NN and FP-FN are greater than that for NP-FP and NN-FN. As expected, the diphthong / / stands out with
61 the largest decrease in perf ormance for all the vowels when formant dynamic information is removed (NP-FP). Before removing this information, / / is the best perceived vowel (about 95% correct for NP), whereas after removing the formant dynamic information it is perceived least accurately of all (a bout 9% correct for FP). The vowel / / is of interest because duration neutralization appears to cause reduced accuracy when formant dynamic information is also removed (approxi mately 7% decrease for FP-FN), but not when formant dynamic information is presen t (approximately 1% increase for NP-NN). Similarly, removing formant dynamic informa tion has little effect when duration information is preserved (approximately 1% de creased for NP-FP). Thus, listeners appear to identify / / similarly when one cue (either fo rmant dynamic information or duration information) is removed, but perform less we ll when both cues are removed. The greatest decrease in performance for / /, however, is seen for the e ffect of vowel isolation (WHOV), suggesting that listeners may rely more heavily on consonant context for this vowel; a similar but smaller effect of vowel isolation is shown for / /. The effect of formant flattening appears to be m oderately strong for both / / and / /. Removing vowel duration information appears to have little effect for / /, whereas / / appears to show some effect of duration neutralizat ion alone (about a 7% decrease for NP-NN). While these patterns highlight potential differences in the effect s of the listening cond itions across listener groups and vowels, they are only trends in the data. Statistical analyses are needed to confirm the significance of these trends. In the section below, parametric statistical analyses and post-hoc test re sults will be described fo r main effects and the two combinations of variables discussed thus far (listening condition by group and listening
62 condition by vowel). The three-way interaction will also be examined and the results of post-hoc tests for the three-way interaction will be presented.. Following this, confusion matrices will be presented to explore any differences in effects within the three-way interaction (i.e., differences in the effects of listening condition acr oss the vowels for the different listener groups). 0 10 20 30 40 50 60 70 80 90 100 /i//I//ei//E//ae//a/Target vowelPercent correct WH OV NP NN FP FN Figure 5. Mean percent-correct word-identification scores fo r each of the six target vowels and each of the six liste ning conditions. Error bars indi cate two standard errors of the mean. Statistical Analyses Before performing parametric data analys is, number correct da ta were converted to rationalized arcs ine transform units (RAUs) (St udebaker, 1985). According to
63 Studebaker (1985) converting raw data into R AUs is applied to proportional data to eliminate ceiling effects. In doing so, negative skewness of the data and correlation of the variance with the means is ev aded (Studebaker, 1985). Table 3. Statistical analysis table. The table depicts main effects, two-way interactions, and three-way interaction. Effects F value Degrees of freedom p value Main Listener group 44.41 2, 57 <.001 Listening condition 247.55 5, 285 < .001 Vowel 16.29 5, 285 < .001 Two-way interactions Listener group by listening condition 2.38 10, 285 .01 Vowel by listening condition 90.43 25, 1425 < .001 Listener group by vowel 3.41 10, 285 < .001 Three-way interaction 1.27 50, 1425 .100 As can be seen from Table 3, all main e ffects and all of the two-way interactions were significant; most were highly significan t (p<.001), with the ex ception of th e listener group by listening condition interaction (p=.01). Main Effects. For the between-subjects main effect of listener group, a Tukey HSD post-hoc test was performed. All listene r groups were found to differ significantly from one another in vowel id entification performance. For native vs. highly-proficiency bilingual, p = .045; for native vs. low-pr oficiency and high-proficiency vs. lowproficiency bilingual, p < .001. For the main effect of listening condition, six comparisons probing six effects of interest were made: 1) effect of vowel is olation (WH-OV); 2) effect of Straight resynthesis (OV-NP); 3) eff ect of duration neutralization alone (NP-NN); 4) effect of
64 formant flattening alone (NP-FP) ; 5) effect of duration neut ralization for stimuli with already flattened formants ( FP-FN); and 6) effect of re moving both duration and formant dynamic cues (NP-FN). These effects were expl ored as simple main effects of condition. Bonferroni adjustment to the significance level of each comparison was made by dividing the alpha level obtained by six, for the six co mparisons made (note that 15 comparisons were possible, but that significance level was not adjusted for this number because only six comparisons were explored). The effect of vowel isolation (W H-OV) was found to be significant (p<.006), as were the effects of formant flattening alone (NP-FP, p<.006) and removing both duration and formant dynamic cues (NP-FN, p<.006). The effect of Straight synthesis approached significan ce (OV-NP, p=.06). The effect of duration neutralization on flattened formants (FP-FN ) did not approach significance. The main effect of vowel identity was significant but was not consider ed of primary interest in relation to the research questions and will not be discussed further. Similarly, the significant two-way interaction between group and vowel will not be discussed. Two-way Interaction of Group and Listening Condition. For the significant twoway interaction between listener group and listening condition described in the text relating to Figure 4, the six cont rasts of interest were compared at each level of the listener group variable. Bonferroni adju stment was again performed for the six comparisons made at each level of the group variable by multiplying the p value by six. For the native (monolingual) lis tener group, the effect of formant flattening alone was found to be significant (NP-FP, p<.006), as was the effect of removing both formant dynamic and duration cues (NP-FN, p<.006). The effect of Straight resynthesis
65 approached significance (OV-NP p=.084), while the effect of the other two contrasts (WH-OV and NP-NN) did not approach significance. For both the highly-proficient and less-proficient bili ngual listener groups, the following contrasts were found to be significan t, all at p<.006: 1) the effect of vowel isolation (WH-OV); 2) the effect of formant fl attening alone (NP-FP); an d 3) the effect of removing both formant dynamic and duration cu es (NP-FN). The remaining effects did not approach significance for the highly-proficie nt listener group; fo r the less-proficient bilingual listener group, the effect of durat ion neutralization alone (NP-NN) approached significance (p=.12). Consider ing the effects across groups, the post-hoc analyses confirmed some of the trends noted in patterns of performance across groups and listening conditions, but not others. That is, the effects of forman t flattening alone (NPFP) and that of removing both cues (NPFN) were found to be large and highly significant for all three listener groups. Furt hermore, the effect of vowel isolation (WHOV) was found to be significant for the two bilingual groups but not for the monolingual group, confirming the relatively larger effect of vowel isolation for the two bilingual groups. The data lend only weak support to the increased eff ect of duration neutralization alone for the lower-proficiency bilingual group compared to the other groups, in that the effect approached significance for the lower-p roficiency group but did not for the other two groups. Similarly, the post-hoc tests weakly support the gr eater effect of Straight synthesis (OV-NP) for the native group compared to the other groups, in that the effect approached significance for the native gr oup, but did not for the other two groups. Two-way Interaction of Vowel and Listening Condition. For the significant twoway interaction between vowel and listening condition described in the text relating to
66 Figure 5, the six contrasts of interest were co mpared at each level of the vowel variable. Bonferroni adjustment was again performed for the six comparisons made at each level of the vowel variable. The eff ects of vowel isolation and St raight resynthesis will be considered first; then the effects of the duration neutralization and formant flattening manipulations will be considered. The effect of vowel isolation was si gnificant for three of the six vowels (/ , /) at p=.006 or less for all. The effect of vowel isolation approached significance for two of the remaining vowels (/ /, p=.108; / /, p=.06). These effects c onfirm the more substantial effect of vowel isolation for the vowels (/ , /) noted in the text relating to Figure 4 and suggest that the main effect of vowel is olation was primarily due to the effects for these three vowels. The effect of Straight resynthesis (OVNP) was significant only for the target vowel / / (p=.006). The effect did not approach significance for any of the other vowels. This restriction of the effect to a single vow el is surprising, consid ering that the effect neared significance in the main effect of listening condition. However, it may help to explain why the effect did not reach significance for any si ngle listener group, although it neared significance for th e native listener group. For / /, the pattern of relativ ely little effect of lis tening condition was largely confirmed, none of the effects of cue ma nipulation approached significance. For the vowel / /, however, three effects reached signi ficance: 1) the effect of duration neutralization alone (NP-NN, p= .03); 2) the effect of durati on neutralization for already flattened stimuli (FP-FN, p=.006); and 3) th e effect of removing both duration and
67 formant dynamic cues (NP-FN, p<.006). Thus, these results confir m the patterns noted relating to Figure 4 of greater effects of duration ne utralization than of formant flattening for this vowel. For / /, the effects of formant flatteni ng alone (NP-FP) and the effect of removing both formant dynamic and durati on cues (NP-FN) were both large and significant (p<.006). The effect of duration ne utralization of already flattened stimuli (FP-FN) was also significant (p=.03) but was not in the expected direc tion, in that percent correct performance was higher for the FN than for the FP condition. For the target vowel / /, the effect of removing both duration and formant dynamic cues (NP-FN) was significant (p= .024), as was the effect of duration neutralization of alr eady flattened stimuli (p=.006). Th ese results confirm the pattern noted in relation to Figure 4 of a duration ne utralization effect ma inly for the already flattened stimuli and little effect of forman t flattening alone. For the target vowel / /, on the other hand, both duration neutralization a nd formant flattening effects were found to be significant (p=.03 for NP-NN and p<.006 for NP-FP). Similarly, the effect of removing both duration and formant dynamic cu es was also significant (NP-FN, p<.006). Finally, for the vowel / / the effect of formant flatteni ng alone approached significance (NP-FP, p=.06), while the effect of rem oving both duration and formant dynamic cues just reached significance (NP-FN, p=.042). Three-way Interaction. Although the three-way interaction was not significant, it did approach significance (p = .100). Fu rthermore, only six of the possible 15 comparisons in the listeni ng condition factor were c onsidered in the two-way
68 interactions. Thus, it was determined that be cause Bonferroni adjustment in the post-hoc comparisons would be for six rather than 15 comparisons at each level of group and vowel, it was possible that sign ificant three-way interaction effects might be found (i.e., significant differences in listening condition effects across the different vowels and listener groups). For this r eason, the three-way interacti on was explored in post-hoc analyses and significant threeway interaction effects were indeed found. These effects are presented in Table 4 and will be described below, with each of the six effects of interest considered in turn.
69 Table 4. Significant effects for six contrasts of interest in the thre e-way interaction. Each cell lists the vowels for which the contrast was found to be significant and the size (in RAUs) and direction of effect, with th e associated p value in parentheses. Listener group Listening condition effect Native HP bilingual LP bilingual Vowel isolation (WH-OV) / / 11.61 (.018) / / / / / / 10.20 (.030) 23.83 (<.006) 13.84 (.018) Synthesis (OV-NP) / / 17.24 (.012) Duration neutralization alone (NP-NN) / / 15.62 (.018) Formant flattening alone (NP-FP) / / / / 94.83 (<.006) 22.36 (<.006) / / / / / / 96.86 (<.006) 23.48 (<.006) 15.05 (.018) / / / / 88.07 (<.006) 16.17 (<.006) Duration neutralization of flattened stimuli (FP-FN) / / 12.93 (.006) Both formant flattening & duration neutralization (NP-FN) / / / / / / 85.08 (<.006) 11.71 (.048) 15.09 (.018) / / / / / / / / / / 15.46 (<.006) 93.50 (<.006) 10.17 (.036) 21.43 (<.006) 17.38 (<.006) / / / / 82.98 (<.006) 15.29 (.012) As can be seen from Table 4, different ta rget words showed significant effects of vowel isolation (WH-OV) for the different listener groups. No target words showed a significant effect for the na tive (monolingual) listener group and only one vowel (/ /) showed a significant effect of vowel isol ation for the HP bilingual group. For the LP
70 bilingual group, on the ot her hand, three vowels (/ , /) showed a significant effect of vowel isolation. These effects ar e consistent with the evidence of greater vowel isolation effects for the HP and LP bilingual groups discussed with regard to the two-way interaction between group and listening condition. These data also demonstrate that the large effect of vowel is olation for the vowel / / found in the two-way interaction of vowel by listening condition was most influenced by data for the LP bilingual group. The effect of Straight resynthesis wa s significant only for the target vowel / / for the native listener group. These data demonstr ate that the nearly significant effect of Straight synthesis for both the main effect of listening condition and for the native group in the group by listening condition interac tion was most strongly influenced by performance of the native listene r group for the target vowel / /. This pattern is consistent with the sole significant effect of Stra ight resynthesis (OV-NP) for the vowel / / in the vowel by listening cond ition interaction. The effect of duration neutralization alone (NP-NN) was significant only for the vowel / / for the LP bilingual group. The size of th e effect was relatively large for this target vowel however (15.62 RAUs). These data demonstrate that the significant effect of duration neutralization alone in the vowel by listening condition inte raction was largely due to performance for the LP bilingual group, also explaining the near-significant effect of duration neutralizat ion alone for the LP group in the group by listening condition interaction. These data suggest that the LP group make more use of the duration cue in vowel identification than th e other two groups (at leas t for the target vowel / /).
71 The effect of formant flattening alon e (NP-FP) was significant for two vowels (/ /) for the native group, three vowels for the HP bilingual group (/ , /), and two vowels for the LP bilingual group (/ /). The effect was, of course, largest for the target vowel / / (ranging from 88 to 97 RAUs), how ever substantial effects were obtained for the vowels / / and / /, ranging from 16 to 23 RAUs. These data demonstrate that although the formant flattening effect wa s significant for all three groups in the group by listening condition interaction and was do minated by the effects for the vowel / /, the effects did indeed vary across vowels for the di fferent groups. It should also be noted that the size of the formant flattening effect was largest for the HP bilingual group for the two vowels (/ / and / /) that showed a significant effect for all three groups. These data suggest that the HP bilingual group may rely more heavily on formant dynamic information than either the native or LP bilingual groups. The effect of duration neut ralization of already formant flattened stimuli (FP-FN) was only significant for the vowel / / for the HP bilingual gro up. These data confirm the trend noted in the discussion of Figure 4, suggesting a larger effect of duration neutralization of alr eady formant flattened stimuli for the HP group only. These data suggest that although both the HP bilingual a nd native listenersÂ’ perf ormance appears to be robust to formant flattening effects for / /, the HP bilingual listeners may rely on duration information more heavily than nativ e listeners for this vowel when formant dynamic information is removed. That is, it may be that the native listeners are able to use only static target formant frequencies fo r accurate vowel identification, while the HP bilinguals appear to perform similarly when either duration or formant dynamic
72 information is removed, but cannot achieve pe ak performance when only static formant frequencies are available. Finally, the effect of removing both formant dynamic and duration cues was significant for three vowel for the native listener group (/ , /), five vowels for the HP bilingual listener group (all except / /), and two vowels for the LP bilingual group (/ /). For / / and / / the effect was larges t for the HP bilingual group. Again, these data suggest that the HP bilingual group is less able than the native group to reliably identify target English vowels based only on static vowel formant targets. The LP group shows fewer significant effects when both cues are removed (/ / and / /), but show lower performance overall (about 15% lower on average than the HP group). Confusion Matrices. To examine differences in the use of cues and their order acro ss groups and target vowels, it is useful to perform an analysis that provides not only information for each vowel by condition and listener group, but also information as to the identity of the misperceptions that occur. To illustrate listener-group specific performance across conditions and vowels, confusion matrices we re created to show the average percent distribution of the intended vowel and the ac tual response for each target word. Tables 510 depict percent correct for the listening c onditions for the native, highly-proficient, and less-proficient bilingual listener s. The target words are arrayed vertically in rows whereas the responses are listed horizontally in columns. Results for the native (NA), highlyproficient (HP), and less profic ient bilingual listeners (LP) are arranged in separate rows within each table, in the mentioned order, to enable comparison within and between
73 groups. The orange colored boxes signify the intended target. Boldfaced black numbers signify the highest score for each group out of all the words. Red boldfaced numbers indicate the lowest score of each group (see tables). Describing the data, this author established the following criter ion: differences less than 5 percent points would not be discussed. Further, Â“confusion eventÂ” or Â“conf usionÂ” is defined as the vowel that is incorrectly identified (incorrect resp onse) for a given target. Thus, if / / is the target five potential Â“confusionsÂ” are possible (i.e., / / and / /). For each listener group a total of 30 different confusion events is possi ble (five for each of the six target vowels). Whole Word Confusion Matrix. As illustrated by Table 5, native listeners demonstrated perfect iden tification of the vowels / / and / /. Percent scores for the remaining vowels, with the exception of / /, indicate near to perfect identification score (error percent < 1%). The percent correct score for the vowel / / reads 95.59%. Further analysis of / / shows that native listeners identified target / / as / / 3.68% of the time.
74 Table 5. Confusion matrix for whole word condition Â– depicting performance of monolingual, highly-proficient bilingual, a nd less-proficient bilingual listeners. Response Group TargetBeadBidBayedBedBadBod NA 100 00000 HP Bead9730000 LP 84.7214.5800.6900 NA 0.7499.260000 HP Bid296.5100.50 LP 24.31 61.81 013.8900 NA 00 100 000 HP Bayed0.50 99 0.500 LP 1.390.69 97.92 000 NA 000.7499.2600 HP Bed04.50950.50 LP 2.784.860.6981.256.943.47 NA 0003.68 95.59 0.74 HP Bad000.59.5 90 0 LP 0003.4788.198.33 NA 00000.7499.26 HP Bod00001 99 LP 000022.2277.78 Highly-proficient listenersÂ’ mean perform ance showed no perfect score for any of the vowels. The vowels / / and / / were noted to have the be st percent correct scores (99% for both). A decrease of five percent points was noted for / /, with two confusions: / / at 4.5% and / / at 0.5%. A large decrease of 10 pe rcent points was noted for the vowel / / and two confusions were noted. HP liste ners showed the greatest confusion for / / (9.5%). The response distribution of the less profic ient (LP) bilingual listeners was as follows. Overall performance of the LP was noted to be lower than for the native and HP groups for all vowels. The range of percent correct for the vowels varied from 61.92% (/b d/) to 97.92% (/b d/).
75 All three groups demonstrated strength identifying / /. NA and HP both performed the most poorly for target / /, with two confusions. Their greatest confusion for / / was noted to be / / (3.68% and 9.5% for the NA and HP groups, respectively). However, their second confusion differed, but both were less than one percent of responses (NA / / and HP / /). The LP group demonstrated lower performance for / / and as for NA and HP two confusions were noted. However, the LP listeners demonstrated greater identified target / / as / / (8.33%) most often, with target / / identified as / / less often (3.47%). LP listeners show ed a generally consistent confusion pattern, meaning errors were noted for only one or two other vowels in the whole word condition. Noteworthy is the LP groupÂ’ s confusion pattern for target / /: / / was identified 24.31% for / / and / / 13.89%. Although all groups demonstr ated the same first confusion for target / /, with much lower error percentages for the NA and HP groups than for the LP group, the HP group demonstrated an additi onal confusion that varied from LPÂ’s responses (i.e., HPÂ’s second confusion was / / and third confusion was / /). LP listeners demonstrated a notably inconsistent c onfusion pattern for the target vowel / /; five confusions were noted of which / / had the highest confusi on percentage (6.94%) and / / the second highest (4.86%). Original Vowel Confusion Matrix. None of the listener groups demonstrated perfect identification for any of the vowel s when consonant context information was removed (see Table 6). The vowel / / was identified best by native listeners (98.53%). Highly and less proficient bili ngualsÂ’ vowel performance was noted to be best for the
76 vowel / / (HP: 95.5% and LP: 91.67%). The HP and LP groups achieved most poorly on the vowels / / and / /, respectively, whereas native liste ners demonstrated showed lowest performance for two target vowels (/ / and / / at 94.85%). Although HP and LP listener groups showed similar patterns for best vowel, their numbers and identity of confusions did not match up. An example is / / where LP displayed five confusions, whereas HP only had four. Table 6. Confusion matrix for the original vowel (OV) condition Â– depicting performance of the monolingual, highly proficient bilingu al, and less proficient bilingual listeners. Response GroupTargetBeadBidBayedBedBadBod NA 94.85 1.470.742.9400 HPBead899.50100.5 LP79.1719.440.6900.690 NA095.5902.941.470 HPBid3900.56.500 LP23.6165.973.476.250.690 NA00.7495.591.472.210 HPBayed0.50 95.5 040 LP4.171.39 91.67 2.7800 NA0.743.68095.5900 HPBed0.580.588.52.50 LP0.6921.539.03 59.03 7.642.08 NA000.743.68 94.85 0.74 HPBad00213 85 0 LP01.390.696.9477.7813.19 NA00000.74 98.53 HPBod02.50.505.591 LP0.6901.39033.3364.58 Natural Preserved Vowel Confusion Matrix. The percent correct for the targets were noted to be highest for the native listener group for all vowels except / / (range: 82.35% to 97.79%; see Table 7); for / /, the performance of the HP bilingual group was slightly higher than that of the NA gr oup (97% vs. 96.32%, respectively). Performance
77 was lowest for the LP bilingual group for all vowels. Noteworthy for the NP condition is that all three listener groups performe d most poorly for the target vowel / / (NA: 82.35%, HP: 82%, and LP: 57.64%). The groupsÂ’ largest confusion for target / / was noted to be / /. Furthermore, the HP and LP groups both re vealed five confusions rather than only two (NA) for target / /. The range for HP bilingual listeners was noted as 82% (/ /) to 97% correct (/ /) while performance for the LP b ilingual listeners ranged from 57.64 (/ /) to 89.58% correct (/ /). Table 7. Confusion matrix for the natural preserved vowel (NP) condition Â– depicting performance of the monolingual, highly profic ient bilingual, and less proficient bilingual listeners. Response GroupTargetBeadBidBayedBedBadBod NA93.385.8800.7400 HPBead86.5110.5200 LP76.3920.141.391.390.690 NA094.1205.8800 HPBid18511300 LP18.7566.672.7811.110.690 NA0096.3203.680 HPBayed00.5 97 0.520 LP1.392.08 89.58 6.2500.69 NA014.710 82.35 2.940 HPBed113.50.5 82 2.50.5 LP1.3923.614.17 57.64 10.422.78 NA001.472.2196.320 HPBad00.50792.50 LP00.691.394.8679.1713.89 NA0000.741.47 97.79 HPBod00.5106.592 LP000039.5860.42
78 Natural Neutral Vowel Confusion Matrix. Patterns for the natural neutral vowel condition are characterized by native speakers performing best for al l six vowels except /e/ (see Table 8). As for the NP condition, th e performance of the HP bilingual group for target /e/ was slightly higher than that of the NA group (99.5% vs. 97.06%, respectively). Native speakersÂ’ best vowel was / / (98.53%) followed by /e/ (97.06%). Native listeners demonstrated most difficulties identifying the target vowel / / (86.76%), while the HP bilinguals experienced the most difficu lty correctly identifying target / / and / / (82.5%). Further, the vowel / / was most often confused for the target / / by both native and HP bilingual listeners (NA: 10.29% and HP: 12%). Less prof icient bilingual listeners exhibited an even lower per cent correct (56.25%) than nativ e and HP bilingual listeners for target / /, but the target vowel showing the lowest performance for this group / / (50.69%). Nevertheless, LP bilingual listeners also demonstrated the highest confusion of the target / / with / / (19.44%). Highly and less proficie nt bilinguals performed best on the diphthongized target vowel / /, (HP: 99.5% and LP: 88.89%), however both groups displayed different numbers and patter ns of confusion: HP heard only / / for target / /, whereas LP demonstrated four confusions for target / /, of which / / was the most frequent (6.94%). The LP bilingual listeners exhibited greatest difficulties identifying / / (50.69%). Moreover, four confusions were made for this target vowel, of which / / (21.53%) and / / (22.92%) showed most errors.
79 Table 8. Confusion matrix for the natu ral neutral vowel condition Â– depicting performance of the monolingual, highly profic ient bilingual, and less proficient bilingual listeners. Response GroupTargetBeadBidBayedBedBadBod NA92.653.680.742.9400 HPBead86.51102.500 LP70.8326.392.080.6900 NA0.7488.240.749.560.740 HPBid1.5 82.5 0.515.500 LP21.53 50.69 4.1722.9200.69 NA0097.0602.940 HPBayed00 99.5 00.50 LP1.392.08 88.89 6.940.690 NA010.290 86.76 2.940 HPBed1121.5 82.5 30 LP1.3919.446.9456.2511.814.17 NA1.470010.2988.240 HPBad00014860 LP0.692.782.088.3372.9213.19 NA00.7400.740 98.53 HPBod010.50890.5 LP000.69038.1961.11 Flat Preserved and Flat Neut ral Vowel Confusion Matrix. The flat preserved vowel confusion matrix depicts a more unifo rm pattern of identification for all three groups (see Table 9). For the FP condition, na tive listeners perfor med better than the other two listener groups for a ll of the target vowels. Highl y proficient listeners also performed better than the less-proficient lis teners, and nearly as well as the native listeners for target vowels / / and / /. The vowel / / was identified best by all three groups (NA: 96.32%, HP: 90.5%, LP: 73.61%). Also, the native and HP bilingual groups showed the next highest performance for target / /. The vowel / / was perceived least accurately for all of the groups (NA: 9.56%, HP: 8%, LP: 8.33%). Other patterns
80 included similar confusion responses; that is, all three groups heard / / most often for target / /, and / / most often for the target / /. Highly proficient bilingual listeners exhibited a substantial incr ease from the NN condition in the total number of vowels confused across all target vow els. HP listeners showed a total of 14 confusion events across all the vowels in the NN condition, compared to 23 confusions across all the vowels in the FP condition. This number of confusions by HP equals the figure misperceived by LP (i.e., 23) for the FP condition. Table 9. Confusion matrix for the flattened preserved vowel (FP) condition Â– depicting performance of the monolingual, highly profic ient bilingual, and less proficient bilingual listeners. Response Group TargetBeadBidBayedBedBadBod NA 96.32 2.940.74000 HP Bead 90.5 6.50.5200.5 LP 73.61 22.922.780.6900 NA 091.180.748.0900 HP Bid1.583.501500 LP 19.4461.112.7815.9700.69 NA 1.4749.26 9.56 36.033.680 HP Bayed339 8 4910 LP 8.3334.03 8.33 46.532.780 NA 013.97080.885.150 HP Bed0.512.51805.50.5 LP 2.7816.672.0859.7215.283.47 NA 003.6816.1880.150 HP Bad0.50.51.524.572.50.5 LP 03.474.1721.5364.586.25 NA 000010.2989.71 HP Bod000.50.520.578 LP 0.69000.6940.2858.33 The confusion patterns for the flattened neutral vowel condition are similar to those for the FP condition matrix, in that the strongest and weakest vowels were the same for all three groups for the two conditions (see Tables 9 and 10). Nativ e, highly proficient
81 bilingual, and less proficient bilingual listene rs demonstrated most accurate identification performance for target /i/ (NA: 92.65% HP: 92%, LP: 75.69%) and least accurate performance for target / / (NA: 16.91%, HP: 11.5%, LP: 11.81%). The groups showed consistent confusion events for four vowels (/ , /, and / / were mistaken mostly for / , /, and / /, respectively). Further, the total numbers of vowels confused for targets made by native listeners increased dramatica lly from the FP to the FN listening condition (FP: 13 and FN: 20). However, native listene rs continued to perf orm better than both groups of bilingual listeners for all target vowels, alt hough their performance was only slightly better than that of the HP bilingual listeners for the target vowels /i/ and / /.
82 Table 10. Confusion matrix for the flattene d neutral vowel (FN) condition Â– depicting performance of the monolingual, highly profic ient bilingual, and less proficient bilingual listeners. Response Group TargetBeadBidBayedBedBadBod NA 92.65 2.9404.4100 HP Bead 92 6.50.5100 LP 75.69 20.832.780.6900 NA 0.7484.561.4713.2400 HP Bid4.5721.521.50.50 LP 17.3656.944.8620.1400.69 NA 034.56 16.91 44.124.410 HP Bayed1.535.5 11.5 4830.5 LP 4.1733.33 11.81 45.834.170.69 NA 0.7417.650.7472.797.350.74 HP Bed116.50729.51 LP 2.7813.192.7856.2516.678.33 NA 000.7411.7686.760.74 HP Bad00.5123.574.50.5 LP 02.081.3918.7563.8913.89 NA 00.740.740.749.5688.24 HP Bod000.50.523.575.5 LP 000037.562.5 Confusion Patterns. In discussing the confusion matr ices, general patterns were noted for identification of the Â“strongestÂ” and Â“weakestÂ” vowel and the number of confusions made by condition. All three groups identified / / correctly most frequently for the WH condition. The vowel / / remained the strongest vowel for HP for the WH through NN listening conditions. All three listener groups heard / / most accurately for the FP and FN conditions. Similarly, all of the listener groups had most difficulties with / / for the FP and FN conditions.
83 Chapter 5 Discussion In this study we explored monolingual and bilingual listenersÂ’ (highly proficient and less proficient bilinguals) use of per ceptual cues for vowel identification. Six listening conditions were generated to examin e use of consonant context, duration and formant cues in different listening conditions (i.e., whole word, isolated vowel, natural preserved vowel [resynthesized], natural neut ral vowel [resynthesized], flattened formant vowels [resynthesized], and flattened ne utral vowels [resynthesized]). The following questions were addressed: 1) Are vowels perc eived differently in isolation than in whole words for these groups? 2) Are vowels synthesized using high fidelity synthesis (Straight) perceived differently than naturally pr oduced vowels? 3) Do highlyand lowerproficiency bilinguals use vow el formant dynamic and duration cues differently than monolinguals? 4) Do the effects of isolat ion, synthesis and of formant dynamic and duration cues differ across the six vowels studi ed? 5) Do patterns of confusions differ across the listener groups? Question 1: Effects of Vowel Isolation Interpretations of the group by listening condi tion results (see Fi gure 4) imply that both highly and less proficient bilinguals re ly more on consonant transitions (CV and VC) than native listeners (s ee performance for WH-OV cond itions). Further, the threeway interaction analysis revealed a signi ficant decrease in pe rcent correct vowel
84 identification performance for the vowel / / for highly-proficient bilinguals and for / / and / / for the LP bilinguals when consonant transitions were removed (see Table 4). Although removing consonant transitions was not found to result in a si gnificant decrease in performance for any of the vowels for na tive listeners (according to the three-way interaction analyses, see Table 4), a small but consistent decrease in performance (about 3% on average) was noted even for the native speakers. An issue for further investigation may be why the LP bilinguals appear to rely so heavily on consonant context, esp ecially for the target vowel / /, and, to a lesser extent, / /. The OV confusion matrix data sugge st that the LP listeners hear / / for target / / (21.5% of target / / perceived as / /, see Table 6). However, LP listeners also hear / / (9.0%) and / / (7.6%) for / /. Negligible confusions were also noted for / / and / / as response for the target word / /. According to Hillenbrand and Nearey (1999), the vowel drift of / /, i.e., measurements of F1 and F2 from the onset to the offset of the steady state portion of the vowel, was noted to have a decrease for both F1and F2 at the offglide relative to the onset. If this is taken into account, then the F1 offglide of / /, as perceived by Spanish-English bilinguals, should move towards / / and / /. Given this spectral change, LP bilinguals may, according to the PAM, assimilate / / as a poor example of adjacent English vowels which in this case might be of / / and / /, although / / would appear to be closer in the vowel space. As seen by the re sults of Table 6, this indeed seems to be the case in this study. Intere stingly, both native a nd highly-proficient bilingual listeners appear to follow similar patterns; both listener groups confuse / / for
85 / / most frequently after the intended target vowel. One explanation for the particular difficulty experienced for the LP bilingual list eners may be the partic ular density of the vowel space in this region. Whereas no vowels occur between /e/ and /a/ in Spanish, two additional vowels (/ / and / /) occur between the vowels /e/ and / / in English. Thus, two of the three apparently Â“newÂ” vowels for Spanish speakers from among the six vowels studied in the present investigation are located between /e/ and / /. Confusions may occur for the LP bilinguals when consona nt context is removed because categories for these vowels may be so poorly defined for the LP bilinguals that identification relies of specification of contex t-specific allophones. Question 2: Effects of Straight Resynthesis Effects of Straight resynthesis were only significant for the vowel / / for the native listeners (see Table 4). This finding suggests that for most vowels identification performance for isolated vowels synthesized us ing high fidelity resynthesis (Straight) is comparable to that for naturally produced is olated vowels (OV), regardless of a listenerÂ’s proficiency. However, the fact that native lis teners showed a significant decrease in identification performance for the vowel / / raises the possibility that synthesis procedures may have been faulty for the vowel / /. It is interesting that the native listeners alone demonstrated this decrease for / /. Investigations of the resynthesis effects (OV to NP) for the remaining vowels s how either equivalent scores (e.g., / / for the native and HP bilingual listener s only), a slight increase in identific ation performance
86 (e.g., / / for all three groups) or only a slight decrease in identification performance (e.g., / / for NA and HP only, / / for all three groups). Further, comparisons with the results obtained by Assmann and Katz (2005) show a slightly larger overall decrease in perfor mance from natural to Straight-synthesized stimuli by native listeners for the present study than for Assmann and KatzÂ’s study. This difference may be related to the pa rticularly poor performance for / / in the present study, however percent correct performance is not broken down by vowel for natural vs. Straight-synthesized vowels in th e Assmann and Katz (2005) data. Question 3: Effects of Formant and Duration Cues for HP and LP Bilingual Listeners As seen in Figure 4, native, highly pr oficient, and less proficient bilingual listeners showed similar and relatively small decreases in percent correct identification performance patterns for duration neutraliz ation alone (NP-NN); all three groupsÂ’ NN performance was poorer than their NP perf ormance, but only by approximately 1-5%. Overall, this effect was not significant for any of the liste ner groups, although it approached significance for the LP group. In the analysis of the threeway interaction, the effect of duration neutralization alone wa s found to be significant for the vowel / / for the LP bilingual listener group only. In the two-way interac tion of listener group and listening condition, formant flattening (NP-FP) and the removing of bot h formant dynamic and duration cues (NPFN) resulted in significant d ecreases in percent correct iden tification performance for all three listener groups. Inspecti on of the three-wa y interaction data, when post-hoc analyses were applied for only six comp arisons (see Chapter 4), found that highly
87 proficient bilingual listeners appear to use ac oustic cues differently from both native and less proficient listeners Highly proficient bili nguals demonstrated a significant effect of formant flattening alone (NP-FP) for the three vowels / / and / / whereas native and LP bilingual listeners only showed an effect for / / and / /. Of interest for the HP bilingual listener group, is that the size of effect (in RAUs ) was noted to be greater for these two vowels for the HP bilinguals than for the native or the LP bilingual listeners. The HP bilingual listeners demonstrated a significant decrease in performance for the five vowels / , / and / / when both formant and duration cues were neutralized (NP-FN). Native listeners, on the other hand, showed a significant decrease in percent correct identification performance when bot h cues were for only three vowels (/ , /). The increased size of this effect for the HP bilingual listeners for most vowels, in comparison to the native listeners (see Tabl e 4) implies that HP bilinguals identify vowels with less consistently when sta tic formant cues alone are available. Less proficient bilingual listeners were not ed to show significant decreases in performance when both formant dynamic and duration cues were removed for some vowels (i.e., / / and / /), the size of effect was not as great as for the HP bilinguals. Given that the less proficient bilingual listener group show ed the greatest significant decreases in performance of all the groups fo r the effects of vowel isolation and duration neutralization alone and less extreme effects of formant flattening al one, it appears that LP bilinguals rely more heavily on CV and VC transitions and duration cues for vowel identification than the native and HP bili ngual groups. The HP bilingual listener group,
88 on the other hand, appears to rely more heav ily than the native and LP bilingual groups on formant dynamic. Question 4: Effects of Vo wel and Listening Condition As shown in Figure 5, average performa nce of all three groups by vowel and by condition reveal various per ception patterns for the vowels across listening conditions. What is of interest is that each vowel carries at least one effect that is distinct. Common for all of the vowels, is the effect of c onsonant transition (WH-OV) : all six vowels show considerable decreases. Alt hough a decrease in performance is apparent to the eye for each vowel, only / / and / / showed significant drops in performance from WH to OV conditions (according to the two-way inter action analysis). Noteworthy is that LP bilinguals showed significant de creases in performance from WH to OV conditions for all of these three vowels. For HP bilinguals, however, removing consonant transitions were found to significantly d ecrease percent correct identification pe rformance only for the vowel / /, and the size of effect for this vowel was not as great as that for the LP bilingual listener group. According to Figure 5, vowel isolation (WH-OV) appears to account for the greatest decrease in percent corr ect performance for the vowel /i/, although the effect was not found to be significant. The fact that no si gnificant effect was ev ident for any of the listening condition comparisons helps to illustra te how different the effects of the various listening conditions are acro ss the different vowels. To illustrate, the target / /, on the other hand, showed significant decreases in percent correct identificati on performance for duration ne utralization alone (NP-NN),
89 duration neutralization of flattened stimuli (FP-FN), and when both duration and formant dynamic cues were removed (NP-FN). Note that the pattern of group by listening condition for this vowel is one of the most in teresting (see Figure 5) It appears that the listener groups rely mostly on duration for / /. Tables 7 and 8 allows us to view the two bilingual listener groupsÂ’ performance for / / (see Appendices D.1-6). Confusion matrices for the NP and NN conditions reveal that LP bilinguals alone rely greatly on duration when no other vowel cue is modified. HP bilingual listeners, on the other hand, use duration cues more heavily when dynamic form ant information is no longer present (FPFN and NP-FN). Moreover, the confusion matric es reveal that HP and LP bilinguals hear /i/ and / / for target / / consistently for the NN, FP, and FN listening conditions. One reason for this observed pattern for the vowel / / may be its close proximity to / / in terms of its formant frequencies at vowel onset. This proximity would suggest, according to FlegeÂ’s SLM that / /, a new vowel for Spanish learners of English would be assimilated as a reasonably good exemplar of the Spanish / / category, making it a particularly difficult category to acquire for Spanish-E nglish bilinguals. This hypothesis is also partially supported by the overall relatively poor performance of the LP bilingual group for this vowel (around 50-60% correct) and the LP groupÂ’s apparently heavy reliance on duration cues for this vowel. Note th at the LP group achieves only 60% correct performance for this vowel even in cons onant context; for all other vowels the LP listeners achieve at least 75% in consonant context. Further inves tigate is needed to confirm or disconfirm this hypothesis.
90 Returning to Figure 5, the most obvious effect for the vowel / / is undoubtedly formant flattening (NP-FP and NP-FN): A very large decrease in percent correct identification performance was seen upon the introduction of formant flattening (NP-FP) for all three listener groups. No duration eff ect was evident for eith er of the groups. Of interest is the fact that nativ e listeners perform as poorly as the HP and the LP listeners which suggest that both monolingual and bili ngual listeners rely heavily on spectral information for the target vowel / /. In the question 5 section, / / will be discussed in further detail with regards to FP-FN. The resynthesis effect for / / is of special interest, which relates to the previously discussed question. These results reveal that the only native listeners demonstrated a significant drop in percent correct identifica tion from the OV to the NP condition and only for this vowel. That a significant effect of Straight resynthesis occurs only for the vowel / / is intriguing and suggests further inve stigation, perhaps of the quality of synthesis for this vowel. Imprecise measurem ents may have caused too distorted quality causing native listeners to mistake this vowel for neighboring categories. Percent correct identificati on of the vowel // decreased significantly for duration neutralization alone (NP-NN) a nd for formant flattening alone (NP-FP). However, of the two effects, that of formant flattening was la rger and was the only one of the two that was found to be significant in th e three-way interaction. The results for the target vowel / / showed significant effects of formant flattening alone (NP-FP) and removing both duration and formant dynamic cues (NP-FN) for HP bilingual listeners only.
91 The point to be made from Figure 5 is th at all six vowels appear to be affected differently by the various listening conditi ons. However, to determine whether the listening groups show different confusion patterns for each vowel in the different conditions, it is necessary to review the confusion matrices. Question 5: Confusion Patte rns of Listener Groups Analysis of the confusion matrices (Tab les 5-10) and the three-way interaction results (Table 4) revealed that listener groupsÂ’ performan ce differed substantially across vowels, listening conditions a nd listener groups. Tracking one vowel at a time through all of the conditions allows us to see that each groupÂ’s performance fluctuates. For instance, for the target / /, HP bilinguals, but especially LP bilinguals, depend more heavily on consonant cues (CV and VC transitions) than native listeners. HP bilingual listeners appeared to use formant information (compa re NP and FP condition performance) more profoundly than LP bilinguals and natives, but their performance deteriorated even more when both cues (NP to FN) were removed. The confusion matrices enabled us to examine the breakdown and confusion events of each group by vowel and by liste ning conditions. Counting the number of confusion events performed by listener group by listening condition, revealed a larger number of events for the LP bilinguals than for the native and HP bilingual listeners from the whole word (WH) to the natural neut ral (NN) listening cond ition. Upon removal of formant information (FP), the HP bilingual lis teners increased their number of confusion events drastically. When both cues were rem oved, all three groups demonstrated a similar number of confusion events. Although this does not reveal differences in the identity of the confusions, the number of confusion events does imply that HP bilinguals become
92 increasingly inconsistent with the removal of formant and duration cues (NP-FP and NPFN). The previously discussed vowel / / showed an increase in performance when both cues were removed for all the listener groups, in comparison with the FP condition. The target vowel / / remained the strongest item until the NN listening condition. The fact that all three listener groups demonstrat ed an increase of performance in the FN condition, relative to the flatte ned preserved condition (FP), is curious and gives reason for further investigation. Of the remaining vowels, / / and / / are the only vowels for which all three listener groups demonstrated a decrease in performance from the FP to the FN condition. Other vowels displayed patterns in which bot h HP and LP bilingual listeners showed decreased performance only, or only HP and native listeners. However, native listeners and less proficient listeners did not share this pattern alone. i It is also of interest that the HP bilinguals either follow the native listene rsÂ’ perception pattern or that of the LP bilinguals. Five out of six times, HP bilingu als followed the FP-FN tendency of the native listeners for / , , / (see Tables 9 and 10). Summary Parallels of the previously discussed re sults can be drawn to studies such as Sebastian-Galles & Soto-Faraco (1999), L opez (2004), and Mayo et al. (1997), whose data suggested that bilinguals will be mo re challenged when 1) listening conditions are difficult, 2) fewer context cues are available, and 3) age of onset of learning the L2 is later. Our study supports these previous conclusi ons in that bilingual listeners appear to
93 be more challenged when consonant, formant, and duration cues are manipulated more so than native listeners. The data also suggest that increased L2 proficiency (or AOLI) appears to correlate positively with vowel identifica tion. This can be seen in th e less proficient listenersÂ’ consistently lower performance for all of th e conditions, compared to that of the native and HP bilingual listeners (see Figure 4). The mean for the HP bilinguals was more accurate for all of the listening conditions than that the less proficient listeners, but it was lower than of the native listeners in most condi tions. Nevertheless, it should be noted that the difference in performance between the native and LP bilingual listeners was, when averaged across vowels, betw een three and six times great er than the difference in performance between the native and HP b ilingual listeners, depending on the listening condition. In recapitulating Figure 4, the most striking tendencies are that 1) native listeners perform consistently better than the bilingual listeners whereas less-proficient bilinguals performed the poorest for all of the conditions 2) highly but mostly less proficient bilinguals rely more heavily on consonant transitions than native listeners (WH-OV), 3) all listener groups demonstrate a significant drop in performance when both formant flattening and duration neutraliz ation (NP-FN) are applied, 4) less proficient bilinguals appear to depend more heavily on duration cues than native and highly proficient bilinguals, 5) highly proficient bilinguals use dynamic formant cues more heavily for some vowels than do native listener. Although the findings provide evidence that highly proficient bilinguals use vowel cues differently than both native and less proficient bilingual listeners, further
94 investigation of production of the vowels is needed to determine whether improved perception of a vowel will reflect improved production also. A potential future study may include examining bilingual listenersÂ’ perc eption of L1, compared to that of monolinguals, when the same acoustic cues are altered as in the present study. Comparison between L1 and L2 percepti on may reveal whether the L2 listener demonstrates similar patterns for their L1 vowels. One limitation of this study is that listene rs were presented isolated words and vowels only. The ideal study would aim to ex amine listenersÂ’ per ception of words in connected speech with the previously menti oned listening conditions. However, for this to materialize, more advanced speech resynthesis methodology would be required. The present study provides us only with a small fr ame of how listeners use acoustic cues in everyday life. Thus, conclusions of percepti on and production can for now be limited to one constituent: the vowel. However, if adequate information regardi ng L2 listenersÂ’ use of vowel cues is available, then Straight resynt hesis may be a likely software to be utilized for a number of therapy and education purposes: 1) improve d second language acquisition training, 2) accent modification therapy, and 3) listening training for the hard of hearing population. Simplified resynthesis software may enable educators or therapists to record stimuli and target word of special interest It is imaginable that modi fying the cues through Straight may aid in strengthening an L2 learnersÂ’ at tention and perception of not-yet mastered vowel cues.
95 References Assmann, P. F., & Katz, W. F. (2005). S ynthesis Fidelity and Time-varying Spectral Change in Vowels. Journal of Acoustical Soc iety of America, 117 (2), 886-895. Best, C. T. (1995). A Direct Realist View of Cross-Langua ge Speech Perception. In W. Strange (Ed.), Speech perception and linguistic experience: issues in crosslanguage research (pp. 171-204). Timonium, MD: York Press, Inc. Boersma, P., & Weenink, D. (2003). Pr aat [Computer software] (Version 4.2). Amsterdam, The Netherlands: Institute of Phonetic Sciences, University of Amsterdam. Bohn, O. S., & Flege, J. E. (1999). Percepti on and production of a new vowel category by adult second language learners. In A. James & J. Leather (Eds.), SecondLanguage Speech: Structure and Process (pp. 53-74). Berlin : Mouton de Gruyter. Borden, G. J., Harris, K. S., & Raphael, L. J. (1994). Speech Science Primer: Physiology, Acoustics, and Perception of Speech (Third ed.). Baltimore, MD: Lippincott Williams & Wilkins. Crystal, D. (1997). The Cambridge Encyclopedia of Language (2 ed.). Cambridge, UK: Cambridge University Press. Dalbor, J. B. (1969). Spanish pronunciation: Theory and practice. New York.: Holt, Rinehart & Winston, Inc. Dudley, H. (1939). Remaking Speech. The Journal of the Acoustical Society of America, 11 169-177. ECoS/Win (Version 1.3) [Computer so ftware] (1999). London, Ontario: AVAAZ Innovations, Inc. Febo, D. M. (2003). Effects of Bilingualism, Noise, and Reverberation on Speech Perception by Listeners with Normal Hearing. Unpublished Doctor of Audiology, University of South Florida, Tampa, Florida. Flege, J. E. (1981). The Phonological Basis of Foreign Accent: A Hypothesis. TESOL Quarterly, 15 (4), 443-455.
96 Flege, J. E. (1995). Second language speech l earning: Theory, findings, and problems. In W. Strange (Ed.), Speech Perception and Linguistic Ex perience: Issues in CrossLanguage Research (pp. 233-272). Timonium, MA: York Press. Flege, J. E. (1996). English vowel producti on by Dutch talkers: more evidence for the "similar" vs "new" distinction. In A. James & J. Leather (Eds.), Second-Language Speech: Structure and Process (pp. 11-52). Berlin: Mouton de Gruyter. Flege, J. E., Munro, M. J., & MacKay, I. R. A. (1995). Effects of age of second-language learning on the production of English consonants. Speech Communication, 16 126. Fry, D. B. (1979). The Physics of Speech Cambridge, UK: Cambridge University Press. Fry, D. B., Abramson, A. S., Eimas, P. D., & Liberman, A. M. (1962). The Identification of and discrimination of synthetic vowels. Language and Speech, 1 35-38. Hillenbrand, J., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). Acoustic characteristics of American English vowels. The Journal of the Acoustical Society of America, 97 (5), 3099-3111. Hillenbrand, J. M., & Nearey, T. M. (1999). Identification of resynthesized /hVd/ utterances: Effects of formant contour. Journal of Acoustical Society of America, 105 (6), 3509-3523. Johnston, D. (2000). CoolEdit [Computer software] (Version 1.1). Phoenix, AZ: Syntrillium Inc. Kawahara, H., Masuda-Katsuse, I., & Chevei gn, A. d. (1998). Restructuring speech respresentations using a pitch-adap tive time-frequency smoothing and an instantaneous-frequency-based F0 extract ion: Possible role of a repetitive structure in sounds. Speech Communication, 27 187-207. Kewley-Port, D., Akahane-Yama da, R., & Aikawa, K. (1996). Intelligibility and acoustic correlates of Japanese accented English vowels. Paper presented at the Proceedings of ICSLP 96, Philadelphia, PA. Klatt, D. H. (1980). Software for a cascade/parallel formant synthesizer. Journal of Acoustical Society of America, 67 (3), 971-995. Klatt, D. H. (1987). Review of te xt-to-speech conversion for English. The Journal of the Acoustical Society of America, 82 (3), 737-793.
97 Klatt, D. H., & Klatt, L. C. (1990). Analys is, synthesis, and pe rception of voice quality variations among female and male talkers. Journal of Acoustical Society of America, 87 (2), 820-856. Labov, W. (2005). Atlas of North American English Retrieved 6.21.2005, 2005, from http://www.ling.upenn.edu/phono_atlas/home.html Liu, C., & Kewley-Port, D. (2004). Vowel formant discrimination for high-fidelity speech. Journal of Acoustical Soc iety of America, 116 (2), 1224-1233. Lopez, A. S. (2004). Silent-Center Vowel Perception by Spanish-English Bilinguals and Monolinguals English Speakers. University of South Florida, Tampa. MacKay, I. R. A., Meador, D., & Flege, J. E. (2001). The Identification of English Consonants by Native Speakers of Italian. Phonetica, 58 103-125. Major, R. C. (2001). Foreign accent: the ontogeny and phylogeny of second language phonology Mahwah, New Jersey: Lawrence Erlbaum Associates, Publishers. Markel, J. D. (1972). Digital inverse filte ring-a new tool for formant trajectory estimation. IEEE Transaction on Audio and Electroacoustics, 20 (2), 129-137. Mayo, L. H., Florentine, M., & Buus, S. (1997). Age of SecondLanguage Acquisition and Perception of Speech in Noise. Journal of Speech-Language Hearing Research, 40 686-693. Meador, D., Flege, J. E., & MacKay, I. R. A. (2000). Factors affec ting the recognition of words in a second language. Bilingualism: Language and Cognition, 3 (1), 55-67. Microsoft (2000). Microsof t Excel 2000 [Computer software]: Microsoft. Monsen, R. B., & Engebretson, A. M. (1983). The accuracy of formant frequency measurements: A comparison of spectrogr aphic analysis and linear prediction. Journal of Speech-Language Hearing Research, 26 89-97. Peterson, G. E., & Barney, H. L. (1952). Cont rol Methods Used in Study of the Vowels. The Journal of the Acoustical Society of America, 24 (2), 175-184. Pickett, J. M. (1999). The Acoustics of Speech Communi cation: Fundamentals, Speech Perception Theory, and Technology Boston, London, Toronto, Sydney, Tokyo, Singapore: Allyn and Bacon. Rochet, B. L. (1995). Perception and Produc tion of Second-Language Speech Sounds by Adults. In W. Strange (Ed.), Issues in cross-language research (pp. 379-410). Timonium, MD: York Press, Inc.
98 Sebastian-Galles, N., & Soto-Faraco, S. (1999). Online processing of native and nonnative phonemic contrasts in early bilinguals. Cognition (72), 111-123. Shin, H. B., & Bruno, R. (2003). Language Use and EnglishSpeaking Ability: 2000 Census 2000 Brief : United States Census Bureau. Strange, W. (1999). Perception of vowels: dyna mic constancy. In J. M. Pickett (Ed.), The acoustics of speech communication: fundam entals, speech perception theory, and technology (pp. 153-165). Surry, Maine: Allyn and Bacon. Strange, W., Akahane-Yamada, R., Kubo, R., Trent, S. A., Nishi, K., & Jenkins, J. E. (1998). Perceptual Assimilation of American English Vowels by Japanese Listeners. Journal of Phonetics, 26 311-344. Strange, W., Jenkins, J. J., & Johnson, T. L. (1983). Dynamic specification of coarticulated vowels. Journal of Acoustical Society of America, 74 (3), 695-705. Studebaker, G. (1985). A Â“rati onalizedÂ” arcsine transform. Journal of Speech and Hearing Research, 28 494-509. TDT System III. [Computer hardware and so ftware] (2001). Gainesville, FL: TuckerDavis Technologies, Inc. The MathWorks, I. (2002). MATLAB [C omputer software] (Version 6.5.0): The MathWorks, Inc. Tye-Murray, N. (1998). Foundation of aural rehabilitati on: Children, adults, and their family members San Diego, CA: Singular Publishing Group.
100 Appendix A. Monolingual Language Questionnaire Participant Background Questionnaire (Form A) Name: ______________ Age: _______ Address (town & state): _________ Phone (optional). Home: _______ Office: _______ Email address (optional): ______________________ 1. Is English your first (native) language? Circle one: Yes No a. If you answered Â“NoÂ” to (1) a bove, list your language here. 2. Did you speak any languages other than English while growing up (other than classroom instruction)? Ci rcle one: Yes No a. If you answered Â“YesÂ” to (2) above list those language here ___________ _______________________________ 3. List any languages you speak other than English and rate your degree of proficiency on a scale from Â“1Â” to Â“5 Â” for each (1=beginner, canÂ’t have a conversation; 5=like a native speaker): _______________________________________ 4. Have you ever been diagnosed with a speech or hearing disorder or had speech or hearing difficulties? Circle one: Yes No a. If you answered Â“yesÂ” to (4), above, please explain in th e space provided below (or on back if you need more room): _________________________________________________________ 5. How long have you lived in Flor ida (or current state)? ___________________ 6. What state where you born in and how long did you live there? ____________ (DonÂ’t answer #Â’s 7 or 9 if youÂ’ve lived all your life in 1 state) 7. What state have you lived in the longest in? _________________ a. How many years did you live there? _________________ 8. List any other states that youÂ’ve lived in for over a year (if more than 3, list top three): ________________________________________________ 9. On a scale from Â“1Â” to Â“7Â”, rate your expe rience with listening to speakers with a foreign accent (1=little or little experi ence; 7=every day or very frequent): ______
101 Appendix B. Bilingual Language Questionnaire Participant Background Questionnaire (Form B) Name: ________________ Age: _______ Address (town & state): _________ Phone (optional). Home: _____ Office: ______ Email address (optional): _____________________ 1. How many years have you lived in y our current area (t own & state)? _______ 2. Have ever been diagnosed with a speech or hearing disorder or had speech or hearing difficulties: Circle one: Yes No a. If you answered Â“yesÂ” to (2), above, please explain in th e space provided below (or on back if you need more room): ___________________________________________________________ 3. What language(s) did your pa rents speak with your? _______________ a. If you answered with more than one language in (1), above, which language(s) did each parent speak with you? __________________________________________________________ 4. Where were you born (give c ity, state, country) __________________________ a. How many years did you live there? ______ b. List other cities or regi ons youÂ’ve lived in for more than one year and note number of years you lived there for each. __________________________________________________________ c. What city and country are your parents from? Mother: _____________ Father: _____________ 5. How old were you began learning English? ___________ a. Why did you begin learning English? _______________________ ______________________________________________________ 6. If you moved to the United States from another country, how much did you speak English before moving here (describe y ears of study, if you learned English in a classroom & percent of time speaking English): ________________________________________________________
102 7. If you moved to the United States from another country, how long have you lived here? _______ years, ________ months. 8. On a typical day, what percent of your time do you spend speaking English at work? _____% At home? ______% Other (shopping, etc.)? ______% 9. On a typical day, what percent of your time do you spend speaking a language other than English at work? _____ % At home? ______% Other (shopping, etc.)? ____% (if more than one, answer below for each language) 10. What percent of percent of day do you sp end with people with people who speak both (or more) languag e that do? _______% 11. What language are you most comfortable speaking? ___________ a. How much more comfortable are you in speaking that language on a scale of 1 to 5? (1=equal or nearly equal comfort; 5=much more comfortable) ______ 12. What language are you most co mfortable liste ning in? ___________ a. How much more comfortable are you in listening in that language on a scale of 1 to 5? (1=equal or n early equal comfort; 5=much more comfortable) 13. What language are you most co mfortable read ing in? ________ a. How much more comfortable are you reading in that language on a scale of 1 to 5? (1=equal or nearly equal comfort; 5=much more comfortable) ________ 14. What languages are you most co mfortable writing in? _______ a. How much more comfortable are you writing in that language on a scale of 1 to 5? (1=equal or nearly equal comfort; 5=much more comfortable) ______ 15. Do you think your ability in the language you are less comfortable in is still improving for any of the skills in que stions 9-12? Circle one: yes no a. If you answered yes in 13 above, indi cate which abilities you believe are still improving. Circle any that apply: speaking listening reading writing 16. What academic degrees have you earned? (List language of education for each) ______________________________
103 17. For all languages that you speak, rate your level of ability on as scale of 1 to 5 (1=not proficient, like a child or beginne r=very proficient, like a well-educated native speaker) for each of the following areas: a. Comprehension: ____________________________ b. Fluency (ease of expression): __________________ c. Vocabulary: ________________________________ d. Pronunciation: ______________________________ e. Grammar: __________________________________
104 Appendix C. Steady State Replication 1. Open Praat a. Find original resampled sound file b. Find the steady state time (s teady state time can be found in vowedit_meas_goodCR.xls) c. Zoom in around the steady state time d. Select Move cursor to enter the steady state time e. Zoom in total of 30 ms around the steady state time Select Move begin of selec tion by-->enter -0.015s Select Move end of selection by enter 0.015s **Count the number of cycles (x) within the selected frame -Find the first big negative or positive (whichever has more of a zero cross) -Count the number of complete cycles plus one extra peak -Enter the number of comp lete cycles in the excel spreadsheet f. Select Move begin of selection to nearest zero crossing Enter time (F5) in excel g. Select Move end of selection to nearest zero crossing Enter time (F7) in excel 2. Praat Objects a. New Sound Create Sound Change name of file to Â“SilenceÂ” Change sample rate to 11050 Hz Formula: Delete the end of equation (+ randomGauss(0,0.1)) and change the amplitude from to zero ** This will provide one second of silence 3. Go back to opened Praat sound file and c opy the selected measure into the Â“SilenceÂ” sound file a. Paste in one set of selected measure b. Zoom in and place cursor at the end of last complete cycle at the zero cross (before the extra peak) c. Select Move cursor to nearest zero crossing d. Continue to paste in sets of the selected measure until have enough cycles to edit for the full vowel time (should have l onger vowel duration than for the full vowel duration or for largest average) -Paste in at least 12 complete cycles ** In excel, want to look at time measur es for the original word, average time of all words for that subject, and overall average time for all subjects e. Find the last complete cycle and m ove cursor to the nearest zero crossing f. Select remaining unnecessary extra information (extra peaks) -Edit Set selection to zero (this will silence the extra bit) g. Remove extra silence before saving file (leave only 30ms of time at the beginning and end of the vowel)
105 4. Praat Objects window a. Write Write to WAV file Save in word edit folder as Â“name_FLRepLong.wavÂ” 5. Repeat steps 3 f-g and 4 for the full vowel and natural vowel times a. Select Move cursor to enter 0.03s b. Select Move cursor by enter appropriate time form excel file c. Zoom in on new time d. Select Move cursor to nearest zero crossing e. Select remaining extra peaks and silence -Edit Set selection to zero f. Edit for 30ms of silence at the end g. Save the full vowel as Â“name_FLRepFulen.wavÂ” h. Save the natural vowel as Â“name_FLRepEdlen.wavÂ” 6. Copy files form the name edit dir ectory to the Straight directory 7. Matlab Straight a. Start straight by en tering the word Â“straightÂ” b. GUI window will pop up Select Initialize Select Read from file-find Â“name_FLRepEdlen.wavÂ” c. Continue through selecting prompts down the left hand side d. Play the original a nd resynthesized versions e. Save new resynthesized file as Â“name_FLRepEdlenResyn.aiffÂ” f. Repeat steps b-e for Â“name_FLRepFul en.wavÂ” and save the new resynthesized file as Â“name_FLRepFulenResyn.aiffÂ” 8. Copy files form the Straight dire ctory to the Â“name EditÂ” directory 9. Praat Objects a. Open the resynthesized sound files -Read Read from file select resynthesized files b. Select Move cursor to enter 0.03s (zoom in on cursor) c. Select Move cursor to nearest zero crossing c. Select waveform to the left of th e cursor by using SHIFT and left clicking mouse simultaneously d. Edit Set selection to zero e. Zoom in on beginning f. Select Move cursor to enter 0.03s g. Select Move cursor by enter appropriate vowel dur ation time (Fulen or Edlen as appropriate from excel) h. Hit F6 and copy that time (Ctrl C) that you moved to i. Zoom in and move curs or to the end vowel time -Select Move cursor to Ctrl V (will paste in time that was copied) j. Select Move cursor to nearest zero crossing k. Use SHIFT and left click mouse to select remaining end time l. Edit Set selection to zero
106 -Make sure to leave 30ms of time at the end (if the difference from 30ms is greater than 1ms, then back into the file a little bit and edit at a new zero cross) 10. Praat Objects a. Write Write to WAV file save as Â“name_FLRepEdlenResynGdDur.wavÂ” and Â“name_FLRepFulenResynGdDur.wavÂ”
107 Appendix D.1. Results for the vowel // by listening conditi ons by listener group. "bead" results0 10 20 30 40 50 60 70 80 90 100BeadWWBeadOVBeadNPBeadNNBeadFPBeadFNListening conditionPercent correct NA Bead HP Bead LP Bead
108 Appendix D.2. Results for the vowel // by listening conditi ons by listener group. "bid" results0 10 20 30 40 50 60 70 80 90 100BidWWBidOVBidNPBidNNBidFPBidFNListening conditionPercent correct NA Bid HP Bid LP Bid
109 Appendix D.3. Results for the vowel / / by listening conditi ons by listener group. "bad" results0 10 20 30 40 50 60 70 80 90 100 BadWWBadOVBadNPBadNNBadFPBadFNListening conditionPercent correct NA Bad HP Bad LP Bad
110 Appendix D.4. Results for the vowel / / by listening conditi ons by listener group. "bayed" results0 10 20 30 40 50 60 70 80 90 100B a y edWW Bay e dO V Bay ed N P Bay ed N N Baye dFP Bay edFNListening conditionsPercent correct NA Bayed HP Bayed LP Bayed
111 Appendix D.5. Results for the vowel // by listening conditi ons by listener group "bed" results0 10 20 30 40 50 60 70 80 90 100 BedWWBedOVBedNPBedNNBedFPBedFNListening conditionPercent correct NA Bed HP Bed LP Bed
112 Appendix D.6. Results for the vowel // by listening conditi ons by listener group "bod" results0 10 20 30 40 50 60 70 80 90 100 BodWWBodOVBodNPBodNNBodFPBodFN Listening conditionsPercent correct NA Bod HP Bod LP Bod