|USFDC Home | USF Electronic Theses and Dissertations||| RSS|
This item is only available as the following downloads:
Speech Recognition Software for Language Lear ning: Toward an Evaluation of Validity and Student Perceptions by Deborah Cordier A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of World Languages College of Arts and Sciences and Department of Secondary Education College of Education University of South Florida Co-Major Professor: Linda Evans, Ph.D. Co-Major Professor: James White, Ph.D. Jacob Caflisch III, Ph.D. Roger Brindley, Ph.D. Date of Approval: July 14, 2009 Keywords: CALL, foreign language softwa re, French pronunciation and speaking, foreign language raters, mixed method Copyright 2009, Deborah Cordier
Dedication To my family, Nicole Cordier, my daught er and inspiration, my parents Mr. Jean J. and Marjorie L. Cordier, my three br others, David, Derek and Douglas, grandparents Harry and Gladys Cordier (dec eased), my uncle Robert Cord ier, M.D. (deceased), and adopted grandparents Drs. Anna Zeldin a nd Petr Stoler, thank you for your years of continued support and encouragement. Without the deep awareness of your love, I would not have been able to continue on this long road to a happy destiny.
Acknowledgements Thanks to Alain De Connick and Tom Hunsaker, Educational Consultants, Auralog (2004) for accepting my suggestions, and to Tim O'Hagan and Peter Lucas for helping to make the suggestions reality; Dr. Tucker, French grammarian par excellence; my colleague, fellow academic and friend Rob Cooksey, who has continuously inspired me to reach and search, always beyond my lim itations; to Rob Summers and the AC staff, Rob Oates, Zac Carter and Krystal Wolfe, who guided my vision of the project and technology skills throughout. To my committ ee members, Roger Brindley, who early on and at critical periods, inspired me to thi nk creatively, and Dr. Caflisch, who allowed me to follow behind him for several years, listening and absorbin g every detail of conversation and exchange (I still have only sc ratched the surface structure!). Finally, to my co-major professors, Dr. Evans and Dr. White, who believed in my scholarship, research skills and who both helped me beyond measure, to successfully navigate a series of processes through to the end. With sincere gratitude, thanks to all of you.
i Table of Contents List of Tables ................................................................................................................ ...... iv List of Figures ............................................................................................................... .......v Abstract ...................................................................................................................... ........ vi Chapter One: Introduction ..................................................................................................1 Background ..............................................................................................................1 CALL Evaluation and FL Tasks ...................................................................2 Statement of the Problem .........................................................................................4 Research Questions ..................................................................................................6 Significance of the Study .........................................................................................7 Definitions of Terms ................................................................................................9 Chapter Two: Literature Review ....................................................................................... 11 Introduction ............................................................................................................ 11 Foreign Language: Learning, Teaching and Speaking ...........................................12 Automatic Speech Recognition Technology ..............................................16 Human Raters for FL Evaluation ...............................................................24 CALL Research Using Speech Technologies and ASR .............................32 CALL Evaluation ......................................................................................37 Foreign Language Research and TeLL me More (TMM) .........................41 Chapter Three: Method .....................................................................................................46 Introduction ............................................................................................................46 Research Questions ................................................................................................47 Design ....................................................................................................................47 Participants .............................................................................................................48 Procedure ...............................................................................................................48 Sentence Pronunciation and ASR Scores ...................................................49 Digital Data ................................................................................................51 Human Raters ............................................................................................53 Rater Site ....................................................................................................54 Rater Training ...........................................................................................55 Participant Survey ......................................................................................57 Data Collection and Analysis .................................................................................58 Statistical Measures ASR Scores and Rater Data ...................................58 Inter-Rater Reliability ...............................................................................58
ii Intra-Rater Reliability ................................................................................58 ASR Score Validity ....................................................................................59 Qualitative Data Participant Survey ....................................................................59 Survey Data Coding and Analysis ..........................................................60 Chapter Four: Results .......................................................................................................6 2 Introduction ............................................................................................................62 Analysis..................................................................................................................68 Quantitative Analysis: ASR Scores and Rater Scores............................................69 Descriptive Statistics ..................................................................................69 N Explained ...................................................................................70 Mean (M) and Extreme Scores ......................................................71 ASR and Rater Score Distributions ...............................................72 Important Considerations: Rater C ................................................73 Correlation Analysis...................................................................................74 Relationship Between ASR Scores and Rater Scores ....................75 Rater Reliability .............................................................................77 Inter-Rater Reliability ....................................................................77 Intra-Rater Reliability ....................................................................78 Survey Questionnaire .................................................................................81 Results Survey Questionnaire ..................................................................82 Automatic Speech Recognition (ASR) Features ........................................87 Native Speaker Model....................................................................89 Listening to Self .............................................................................89 Visual Graphs .................................................................................90 More Impressions About the ASR Experience ..............................91 Pronunciation Practice In itial Impressions ..................................92 ASR Scores Perceptions of Usefulness ...............................................................93 Emergent Themes ..................................................................................................93 Visual Meaning ..........................................................................................94 Pronunciation Meaning ..............................................................................95 Score Meaning ...........................................................................................96 Perceptions of Effectiveness of the ASR Software ................................................98 Final Survey Comments ...........................................................................107 Researcher Bias ........................................................................................109 Other Emergent Survey Data a nd Researcher Reflections ...................... 110 FL Interaction and Speaking ........................................................ 110 FL Learning and Speaking ........................................................... 111 FL Teaching and Speaking ........................................................... 111 Chapter Five: Discussion ................................................................................................ 113 CALL Research: ASR Software and Speech Technologies ................................ 113 Mixed Method Design and Issues of Signifi cance and Meaningfulness ................................................................................. 115 ASR Sentence Pronunciations and Score Validity ................................... 118
iii Possible Confounding Variability ............................................................120 ASR for Pronunciation Practice ...............................................................123 ASR and Pronunciation Performance ......................................................126 FL Raters and Rating Online ...................................................................127 Positive Impact and Usability ..................................................................131 Limitations and Directions for Future Research ......................................133 ASR Score Validity ......................................................................133 ASR and Rater Correlations.........................................................139 Questions Related to FL Raters and Training ..............................145 Conclusion ...........................................................................................................147 References .................................................................................................................... ....149 Appendices .................................................................................................................... ...156 Appendix A Pilot Projects Using TeLL me More French .................................157 Appendix B French Sentences for Pronunciation .............................................165 Appendix C Numerical/Des criptive Rating Scale ............................................167 Appendix D Survey Questionnaire ...................................................................169 Appendix E Survey Question 10 Qualitative Coding Example ........................172 About the Author.................................................................................................... End Page
iv List of Tables Table 1 Descriptive Statistics for ASR, ra, rb and rc Scores .............................................70 Table 2 Correlation Coefficients: ASR a nd Raters a, b and c Scores ................................76 Table 3 Intra-Rater Reliability Correlation Coefficients: Time = 1 and Time = 2 for Rater a, b and c ..........................................................81 Table 4 Frequencies for Gender and Duration of French Study .......................................84 Table 5 Frequency Responses for Survey Q 3 and Q 4 ....................................................85 Table 6 Question 7 (a, b, c) Frequency Responses ...........................................................88 Table 7 Frequency and Percentages for Survey Question 8(a) and (b) .............................94 Table 8a Survey Q 9 ASR Co mpared to Other Feedback ..............................................99 Table 8b Perceived Helpfulness A ccording to Length of French Study ......................101 Table 9 Survey Question 10 Coded Responses for French Teacher Rating .................105 Table 10 Rater ASR Scores for Sentences s1-s10 ...........................................................134 Table 11 Magnitude: ASR and Rater Correlations ..........................................................143
v List of Figures Figure 1. P25 Sentence 1-10 ASR Scores .......................................................................63 Figure 2. Three Rater Scores for P16 Sentence 1 (s1), s2 and s3 ...................................64 Figure 3. Organization of ASR a nd Rater Scores for Analysis .......................................65 Figure 4. Collated Participant Responses for Survey Question Six ................................67 Figure 5. ASR Software Score Box .............................................................................95 Figure 6. Participant Screen View of French Sentence Groupings ...............................123 Figure 7. The Rater's View of the Rating Screen ..........................................................131 Figure 8. ASR Software Score Box ...........................................................................162
vi Speech Recognition Software for Language Lear ning: Toward an Evaluation of Validity and Student Perceptions Deborah Cordier ABSTRACT A renewed focus on foreign language (FL) learning and speech for communication has resulted in computer-assi sted language learning (CALL) software developed with Automatic Speech Recognition (ASR). ASR features for FL pronunciation (Lafford, 2004) are functional components of CA LL designs used for FL teaching and learning. The ASR features available with the TeLL me More French software provides pronunciation, intonati on and speaking practice and feedback. ASR features are examined quantitatively through French student performance of recorded ASR-scored speech and compared with human raters of the same produced speech samples. A comparison of ASR scores to hum an raters considers the validity of ASRscored feedback for individualized and FL cl assroom instruction. Qualitative analyses of student performances and perceptions of ASR are evaluated using an online survey linked to individual pronunciations and performance and examined for positive impact (Chapelle, 2001) and usability.
1 Chapter One: Introduction Background Foreign language (FL) acquisition and teaching have experienced a renewed and vibrant focus, particularly in recent years w ith explosive growth to ward a unified Europe. Globally, there has been atten tion focused on foreign languages in African nations, Asia and China at both regional and national leve ls. English acquired as an International Language (Jenkins, 2000) has hi ghlighted the real ities of a global economy in which numerous native languages are spoken by people of various nationalities. Jenkins (2000) has researched the pronunciation difficultie s experienced by English language learners from diverse backgrounds and found that many were learning English as a third language, in an FL environment. In many countries, pe ople are required to acquire or use English for on-the-job communi cative purposes. Theoretically an academic domain, foreign language learning and teaching has evolved to include the area of Second Langua ge Acquisition (SLA) or the acquisition of a language other than the native language (L1). SLA is a recent evolution from within the disciplines of both theoretical and applied linguistics and has incorporated theoretical foundations from earlier research. For exampl e, the groundbreaking works of the linguist Noam Chomsky (1957) theorized L1 acquisition in children and Roger Brown (1968) studied and documented how young children acq uire a first language. Fundamental SLA incorporates both the idea of acquisition of an L1 and, depending on the age of onset,
2 type and amount of second language (L2), incl udes both the concepts of acquisition and learning. Indeed, SLA includes important th eoretical foundations, methodologies and skills for the acquisition and the teaching of second languages (L2). The advent of computer tec hnologies and their subsequent applications in various educational disciplines has resulted in th e use of computer-mediated communication (CMC) for foreign language acquisition and t eaching. Moreover, the technologies used for FL acquisition have paralleled the technol ogical developments within other domains. Foreign language technologies have also been developed, implemented and researched by FL specialists (Plass, 1998). Re searchers have been eager to investigate claims that the use of technology can contribute to positive student outcomes. Foreign language educators, specializ ing in applied linguistics, SLA and computer-mediated environments, have contri buted to a body of work, including theory, processes, products, and research which co mprise a sub-area of SLA called computerassisted language learning (CALL). In th e 1990s, the application of foreign language teaching principles informed early CALL which incorporated evolving technologies (Chun, 1994; Warschauer, 1996). CALL Evaluation and FL Tasks Today, CALL designs include state-of-the-art software, such as Automatic Speech Recognition (ASR) and internet technologi es, and are routinely evaluated by FL professionals for their pedagogical usefulness. CALL evaluation must center on the FL tasks designed for software and computer-mediated environments. In her seminal work, Chapelle (2001) sets forth th eoretical foundations and principles to consider for the
3 design and evaluation of CALL tasks and stat es that CALL should be evaluated through two perspectives: judgmental an alysis of software and planned tasks, and empirical analysis of learners perfor mance (p. 52). Ideally both types of analyses, judgmental and empirical, need to be conducted. Chapelle (2001) suggests several evalua tion criteria for CALL tasks and an important feature includes task appropriateness. For example, CALL designs that include appropriate FL tasks are evaluated for thei r language learning potential (focus on the form of the language), ability to engage lear ners and positive impact (positive effects of participation). The task appropriateness criter ia can assist devel opers, researchers and other FL specialists to establish whether th e computer-mediated tasks are appropriate for students FL learning. Based on early foreign language teachi ng methods (Delattre, 1967; Valette, 1965) and technological developments in comm unications, CALL tasks that have been relatively easy to adapt and develop for the computer-mediated environments were 1) FL pronunciation and 2) speaking ski ll practice. Interestingly, th e science of linguistics was founded on the analysis of phonetic symbols, or the sounds of spoken language and linguists studied the sounds of a language and their arrangement in words to communicate meaning. Phonology remains an importa nt branch of linguistics that is the foundation for understanding speech in any language used for human communication. Beginning in the 1950s, and used predomin ately for speech therapy, recording speech and pronunciation had been used for sp eech analysis and by speech labs to identify, record and correct individual sounds Today, foreign language computer-assisted pronunciation training (CAPT) is a type of CALL software that includes basic FL
4 pronunciation training (OBrien, 2006). Basic CA PT was originally modeled after what students experienced in language labs. In the la b setting, a student could listen to a native speaker model and then try to repeat and r ecord produced speech, or output, based on the voiced features and pronuncia tion heard in the model. More recent developments in CALL have included course ware with more sophisticated features. Automatic Speech R ecognition (ASR), or th e capability of the software to capture, recognize, and react in some way to human speech is extremely popular (OBrien, 2006). Visualiza tion techniques have also been used in course ware in combination with ASR. The most common tec hniques include the presentation of a pitch curve and a waveform, visual form s that represent voiced speech. Statement of the Problem The TeLL me More (TMM) foreign la nguage software products that include automatic speech recognition (ASR), have b een in use throughout Europe and globally for several years. For example, Lafford (2004)) published a compre hensive review of TeLL me More Spanish and outlined the pedagogical strengths of the ASR software. Later, Burston (2005) recommended the TeLL me More German version and commented on the strengths of the speech and pronuncia tion features. Several recent conferences have included participant presentations ( ACTFL November, 2005) that reported positive impact (Chapelle, 2001) rega rding use of the FL so ftware in education. The development of CALL software th at includes automatic speech recognition used for FL pronunciation and speech practice has contributed to a renewed focus on the development of FL speaking skills. Through the provision of recorded native speaker
5 models, repetition, voice reco rding and feedback, ASR has significantly contributed to software designs. FL learners have been able to both learn and practice FL speech and pronunciation in settings where th e software has been available for students use. Current CALL course ware includes ASR and the entire range of pedagogical sk ills and activities, several languages and on line learning versions. Foreign language study and interest have expanded within the European Union (EU) community, globally a nd through technological adva nces in telecommunications. The ACTFL 2005 Year of the Languages (American Council for Teaching Foreign Language) highlights the importance of concurre nt developments in the United States. Important technological developments have significantly added to CALL, FL tasks and the computer-mediated tools made available to foreign language learners. Following from the recent attention to and visibility of the ASR features of the software, several univer sities abroad and throughout the US are using TMM versions either as distance learning courses or in blended learning formats (Bonk, 2004) with foreign language classes. University langua ge labs (Rollins College, FL) are providing TeLL me More as stand-alone FL course conten t or as additions to the in-class curriculum (Bryant University, RI). Numerous software reviews have documented the use and efficacy of several language versions of the TMM software (Lafford, 2004). Articles submitted to international peer reviewed journals ( Calico Journal; Language Learning and Technology ) are subject to rigorous CALL eval uation (Hubbard, 1992), including review of both pedagogical and technologic al features. In several revi ews, the speech recognition software used in TeLL me More has been ra ted a strong and contributory feature to FL
6 speaking and pronunciation skill practice, by bot h faculty and students (Barr, D., Leakey, J.,& Ranchoux, A., 2005). Several recent reviews (Lafford, 2004; Burston, 2005) have reported on the positive contributions of ASR to CALL software. Studies have focused on the quality and interpretation of ASR and visualization feedback provided to learners (Chun, 1998; Hincks, 2003). Chun (1998; 2002). Researcher s (Pennington, 1999; Hardison, 2004) have suggested that ASR and CALL course ware is ideally suited for rese arch data collection. However, and perhaps due to the relati vely recent awareness of ASR among FL educators, there have been no reported research studies on the evaluation of ASR software for improving FL pronunciation and sp eech with students who have used ASR software. ASR software is becoming an increasingl y important feature of CALL designs and software (OBrien, 2006; Chun, 2002) and th ere is a need within the FL community to continue to support and develop technol ogy that contributes to FL teaching and positive student outcomes (Chapelle, 2001). Conse quently, there is a definite need for an investigation of FL student learner outcomes and per ceptions of ASR for foreign language learning and en hanced pronunciation. Research Questions The use of ASR software for FL learni ng has been reported in numerous studies. The potential efficacy of the ASR as an aid to FL speech and pronunciation has been documented and observed. However, how can th e effectiveness of ASR for FL learning be evaluated and can a more conclusive meas ure or statement of improved FL speech be
7 determined? What is the contribution of ASR to CALL research and to student learning and performance? Can the results of stude nt performance be compared to qualified, established performance measures? Importan tly, how do students view ASR and what is their evaluation of its purpose and effectiven ess for their learning? Does the use of ASR software, for pronunciation and speech practice empirically result in improved FL student pronunciation and speech? Does the feedback given actually help students to improve? Are the ASR scores provided to students a va lid measure of their performance? How do students interpret the scores and how do th ey use the feedback for practicing their pronunciation? How does the feedba ck compare to traditional feedback given by raters and how do ASR scores compare to human ratings, of the same speech recordings? The following investigative questions are proposed: 1) When using the same ASR-produced speech for human rating, how do human ratings compare to the ASR scores provi ded to students and how valid are the ASR scores? 2) How useful did the students find the ASR features were for assisting them with their French pronunciation practice? 3) How effective was the ASR software for helping students improve their mastery of French? Significance of the Study The teaching, learning or acquisition of foreign language assumes a full range of activities for improving all as pects of language development, including a focus on the four basic skills of reading, writing, listeni ng and speaking. The skil l that is often of
8 interest to students learning a language is the ability to speak the language. Pronunciation and conversation ar e important aspects of stud ent language development. Without a comprehensible production (Derwi ng et al., 1997) of language and speech, learners lack the ability to communicate and to be understood. The rapid technological developments th roughout the world have contributed to the unprecedented expectation of learners fo r an increased level of sophistication and access to learning. The CALL software curren tly developed for the FL learning may be, in some cases, incapable of meeting the need s of our current genera tion of learners who have the desire to collaborat e and communicate with other le arners and their peers in a virtual world. Automatic speech recognition (ASR), as a tool for FL speech and pronunciation practice has the potential to meet the needs of FL learners. By providing individualized FL speech practice and instant feedback, ASR can give students access to a form of learning in a simulated, immersive environment. An investigation of ASR f eatures and feedback scores and an examination of FL learner perceptions of ASR as tools for lear ning could all contribute to the significance of incorporating ASR into CALL designs. Furthe r, the plausibility of using ASR-designed scores for recorded speech and feedback could be documented. A comparison of ASRscored recordings and human ratings of the same ASR-scored reco rdings could provide further insights into the applicability ASR for FL speech and pronunciation as well as for programmatic and individualized learning outcomes.
9 Definitions of Terms 1) ASR: Automatic Speech Recognition 2) CALL: Computer-Assisted Language Learning 3) CAPT: Computer-Assisted Pronunciation Training 4) L1: First language, Mother Tongu e (MT) and Native Language (NL) 5) L2: Second language als o, Target Language (TL) 6) Native speaker (NS): Speaker of the home country language 7) Intonation: The pitch contour of a phrase or sentence. Pitch movement in spoken utterances that is not relate d to differences in word meaning. 8) Pitch: In speech, the rate at which the vocal cords vibrate (F -zero) is perceived by the listener as pitch. The auditory pr operty of a sound placed on a scale from low to high. 9) Pronunciation: The utterance or articulation of a word or speech sound, Including accurate pronunciation of neighboring sounds (allophones). 10) Rhythm: The patterned and recurring a lternations of contrasting elements of sound. 11) Speech features: Features are the sm allest unit of analysis of phonological structure. 12) Speech synthesis: Uses a computer and graphic representation for the process of generating human-like speech. 13) Stress: Sounds perceived as relatively more prominent as a result of the combined effects of pitc h, loudness and the length. 14) Visualization techniques: Graphic representations of a pitch curve and wave-
10 form, as used in ASR where the computer decodes a representation of the spoken language.
11 Chapter Two: Literature Review Introduction Foreign language learning and teaching has historically been attached to ideas about language and how languages are acquire d. Ideas have been put forth by teachers and researchers regarding the L1 or native la nguage and more recently the field of second language acquisition (SLA) has opened doors for understanding L2s and multilingual environments. Teaching foreign languages has evolved from both the science of linguistics, foreign languages and the devel opment methods designed to facilitate the learning process for students. Methods continue to evolve and improve, often as a result of testing methods and findings abou t the student learning process. Technology, in the form of digital and mu ltimedia resources, has been a relatively recent addition to teaching methods and fo reign language learning. Computer-assisted language learning (CALL) rese archers investigate material s designed to enhance FL learning. Automatic speech recognition (ASR) is one feature that has been added to materials in recent years to assist FL learners with pr onunciation and speaking practice. Studies designed to investigate the ASR feat ure of CALL products, used by students to assist with their FL learning are underrepresen ted. There is no recor d, to date, of studies specifically investigating the interpreted m eaning of an ASR scoring feature provided by commercial ASR and presented to students using the software.
12 Foreign Language: Learning, Teaching and Speaking In the early 19th century, foreign language was an integral part of US immigrant community life as newly arrived families struggle d to learn English and to assimilate into a new and vibrant culture. The view of the US as a melti ng pot certainly hinges on the idea of foreign languages and cultures becomi ng a part of the fabr ic of a distinctly American language and heritage. Foreign langu ages of immigrants, US participation in foreign wars and academic interest in understanding and using foreign language texts for literature, industry and comm erce were only a few of the factors responsible for the learning of foreign languages w ithin the context of an Amer ican English environment. Teaching of English, and other foreign la nguages, became an integral part of American education out of a need to unde rstand and communicate with students, their families, teachers and friends. The concept of the old schoolhouse, with a few students and a local teacher can still be found in small, remote communities within the rural and expansive western US. The Modern Language A ssociation (MLA) was one of the first of groups of educators involved in developi ng and disseminating foreign language and translated materials for use within the sc hools and universities. MLA writing style and guidelines are still used for academic works within a foreign language, social humanities framework. With the introduction of foreign langua ges into the commun ities and schools came a need for trained teachers, teaching me thods and materials. Early foreign language teachers, lacking developed methods, were often native speakers of the language or bilinguals capable of managing the FL within the classroom. Still today, many students and foreign language educators feel that the native speaker (NS) model (Long, 1991) is
13 the model to be aspired to for learning or teaching. William James, an American psychologist, was educated in France in order to become fluent in French (Proudfoot, 2004). He later translated his learning into a universal treatise on the pragmatics of living (Pragmatism). Many of the early learni ng methods were developed from a need to understand, communicate, or inte rpret language and meaning for a distinct purpose. Linguistics and applied linguist ics were the first discip lines, besides the direct contact with native speakers, to inform FL teaching methods. The science of linguistics has a long historical trad ition with numerous schools of thought. De Saussure, (Course in General Linguistics, 1959) is credited with writing the first Linguistics text still used and referenced today. Ironically, the seminal work was compiled from the copious notes of several of his students, as the maitre was known for speaking his lectures and not writing them down! As was De Saussure, li nguists were mainly concerned with the spoken language and concentr ated heavily on phonology and phonetics, or generally the sounds of language. Applied linguists of th e 1960s and early 1970s, predominantly from the UK, were interested in the applicability of the linguistic frameworks to FL teaching. One of the earliest methods, grammar-translatio n (GT) was concerned specifically with a direct one to one relationship between, for ex ample, a term or word in a first language (L1) and the translated meani ng in the second language (L2). An interest in first language learning (Roger Brown, 1968) and a subsequent shift to more psychological and introspective me thods for examining the acquisition of language resulted in a convergence of lingui stics, applied linguistics and foreign language, acquisition, learning and teachi ng. The sub-field of second language acquisition (SLA) is a direct outcome of the convergence, whereby the acquisition of a
14 second language, other than the mother tongue or native language, has come to be viewed as distinctly different from the acquisition of the L1. FL l earning and teaching, from both a theoretical and practical framework, were central to the direction, developments and research within SLA. In the 1980s, the developments within SLA resulted in a renewed focus on foreign language for communication. The communicativ e language theories and teaching method (CLT) (Canale and Swain, 1980) were influenced by a need to use language for practical purposes and realistic needs. Students wanted to be able to use the language they learned and were often motivated to study and acqui re a language for the distinct purpose of using the language with their families or for tr aveling abroad to visit relatives or friends. FL teachers wanted to be able to motivate, teach and inspire in their students a love for language, a passion that often motivated the teachers themselves. Dr. Caflisch, Linguistics Professor Emer itus (USF, 2005), has often been praised for his French accent when using his French. He is quick to comment on how he acquired and mastered the accent in seventh grade when his French teacher, a non-native speaker asked for pronunciation assistance from a native French-speaking student in the class. Dr. Caflisch credits the teacher fo r having the insight to recogni ze the value of using a nativespeaker model for pronunciation practice. Within the classroom, FL teaching has developed and incorporated or discarded techniques, but the theme of communicat ive language use has remained a steady framework since the early 1980s. Early FL classroom teaching involved four skills: reading, writing, listening and speaking. Speak ing, a main objective for many learners, was difficult for students within a classr oom setting, as most students felt anxious
15 (Horowitz, 1980) and self-conscious about sp eaking in front of the teacher and fellow students. Teaching speaking, especially with beginners and lower FL level students, was usually in the form of memorized pract ice, recitation, pronun ciation drills and conversation exercises with partners. When speaking was tested, indirect methods were used for discrete pronunciation features or fo r listening for correct pronunciation (Valette, 1967). Direct methods, whereby the teacher speaks with a student or students, were timeconsuming and recording techniqu es were unsophisticated and co stly. As a result of these and other factors, pronunciation and speaking were de-emphasized in the FL classroom. Speaking as a form of interaction with authentic, real-life materials and native speakers, a main feature of CLT, was re-e mphasized and re-vitali zed within SLA by the technological developments within society, generally, and specifi cally through computerassisted language learning (CALL). Chun (1994) was one of the first SLA researchers to examine interaction with her German students using computers to speak by writing to each other and thus communicatively inter acting within the FL classroom. Throughout the 1990s, rapid technological advances in ha rdware, software, audio and video have contributed to CALL tools. Th e explosion of the internet as a worldwide web of global communication has inspired a desire for so cial communication and virtual connection. The resurgence of a sense of community has provided new and interesting tools for FL teachers to use. Some teachers have been eager to respond, others have been hesitant and less eager. Regardless, FL students have e xperienced a bigger world by the time they arrive at the university FL classroom. The FL student of today has the opportunity to explore and learn in unparalleled ways.
16 Automatic Speech Recognition Technology One area of computer-assisted language le arning that has presented particular difficulties for researchers and teachers and, at the same time, significantly contributed to the learning materials for speech and pronunc iation is the use of Automatic Speech Recognition (ASR). The most common and wi dely recognized example is the TeLL me More (TMM) software by Auralog that has embedded ASR software within a pedagogically-based language le arning software. The benefits of the ASR software for students are many and have been documented by numerous researchers and teachers who have used the software (Hincks 2003; He mard, 2004). FL students can practice pronunciation and speech samples, receive feed back, repeat, listen and hear a native model in an indivi dualized setting. The limitations of ASR are often mis understood by the FL teachers who are unfamiliar with the details of how ASR works. There is no need to explain or understand a sophisticated technology, howev er, there is the need, on the pa rt of the CALL researcher to understand the process in order to re-interpret the limitations for users. Davies (2006) has provided an excell ent reference for language teachers and using a module approach has succeeded in explaining very complex terms in clear and understandable language. According to Davi es, the history of how ASR was developed for CALL, where ASR fits into CALL and the st ate-of-the-art are all traced to the initial efforts in the 1940s-1950s with machine tran slation (MT). Machin e translation as a research area was abandoned, but has contributed to the concepts of artificial intelligence (AI) or the ability of the com puter to produce responses equivalent to human, intelligent
17 response. AI has evolved from the early MT work on the mathematical reconstruction of language, using rules to reconstruct pieces of grammatical structure (Parsing). Intelligent CALL or parser-based CALL (Davies, 2006), incorporating theoretical linguistic concepts and speech synthesis, has been at the origin of newer domains such a corpus-based linguistics. Speech synthesis, or speech analysis, although similar to ASR, is often confusing for linguists or FL teachers who had previously used speech synthesis for studying sounds of language (Molholt a nd Pressler, 1987; Lade rfoged, 2004). Speech synthesis takes already produced or spoken language speech sounds and analyzes the speech textually, graphically or in other visual formats. A visible form of speech synthesis and transcription is used for producing speech on the TV screen, most commonly for the hearing impaired. Yet, English language begi nning learners, who may have only a basic understanding of pronunciation, are assisted in their understanding when they can read the words on the screen. ASR fundamentally attempts to reproduce speech by essentially re-creating the words by recognizing or comparing sounds for the purpose of providing some spoken information to a user Speech recognition used for commercial customer service prompts are a common use of ASR. The incorporation of ASR into CALL, with direct benefits for learners, came with the development of speech recognition engi nes that could handle continuous speakerindependent speech. The flexibility of this t ype of ASR allowed for: easy and convenient applications for standard personal comput er (PC) hardware, production of complete sentences, no training periods, immediate pr actice, a focus on fluency, word order and native models, as well as leniency in the recognition of a variety of accents (Davies, 2006). It is important to note that there are also limitations imposed by the systems
18 flexibility: greater flexibi lity allows these programs to recognize sentences spoken by students with a wide variety of accents whic h also limits the accuracy of the programs, especially for similar sounding words or phrases. Some errors may be accepted as correct. (p. 13). According to Davies (2006), native sp eakers may be more bothered than students by the trade-off of accuracy for flexibility although accuracy for most products still ranges between 90-95% (Alwang, 1999 in Davies 2006). Davies feels that learners are more likely to accept the limitations of AS R than native speakers because of the satisfaction learners experi ence when the ASR understands, especially beginners', imperfect pronunciation. The different types of ASR and the tools needed for development as well as the cost of speech recognition programs and products makes their use in CALL especially difficult. An ASR-patented system, by Auralog, Inc ., called S.E.T.S (Speech Error Tracking System) is fundamentally a recognition syst em that incorporates the three main components of a speech recognition system: recognizer, acoustic model and language model. The recognizer is the engine, the com puter programmed driver of the system. It is designed by computer programmers usi ng programming and mathematical language, using pre-designed algorithms, or specific rule-based programmed commands. With a recognizer, the acoustic and language models can be developed for a specific foreign language, hence the development of seve ral language versions. For example, Auralog provides ASR with nine language versions. Many commercial software companies purchase the recognizer component and then build the other components. Several academic institutions have been able to design
19 and develop their own systems. For exam ple Carnegie Mellon University has the computer programming power, financial assets, a large global interest and they have developed their own speech r ecognizer (Carnegie Mellon, 1999) a system that has been used and tested by CALL researchers (Eskanazi, 1999). Several other university research-based systems have evolve d within the past ten years. In a technical paper obtained from Fift h Generation Computer (FGC) Corporation entitled: Speaker Independent Conne cted Speech Recognition (2008) several speech recognition systems that employ different techniques for the recognition process are described. An explanation which describes th e process of ASR could be summarized as follows: a recognizer algorithm takes initia lly input sequences of phonemes (sound parts) or words and generates acoustic templates. These templates are compared to the previously acoustically modeled templates (acoustic model) and a template of each word in the vocabulary is stored and input is evaluated for patterned matches. Close or exact matches are then selected. The basic pr ocess of ASR, in a si mple representation, can be outlined as follows: Input Speech -----Front-end processing------Stored Word Patterns------Pattern Matchi ng-------Decision Rule (dec ides closest match) -----Recognized Word (FGC, p. 12). In a speaker-independent system: there is no training of the system to recognize a particular speaker and so the stored word patterns must be representative of a collection of speakers expected to use the system. The word templates are derived by first obtaining a large number of sample patterns from a cro ss-section of speakers of different sex, agegroup and language or dialect, a nd then clustering these to fo rm a representative pattern for each word. A representative pattern can be created by averaging all the patterns in a
20 word cluster. Because of the great variabili ty in speech, it is generally impossible to represent each word cluster with a single patter n. So each cluster is sub-divided into subclusters and a number of tokens for each word (i .e. up to twelve is stored in the system). All tokens of each word are matched ag ainst the input word. (FGC, p. 12). Intelligent CALL, or parser-based CALL (Davies, 2006), as a subfield of CALL and of which ASR is a part, is not yet prep ared to address the issue of spontaneous, natural foreign language speech. However, FL speech, speaking and pronunciation can be effectively practiced using ASR software. Da vies (2006) suggests several purposes for ICALL including: communicative practice, report ed increased confidence from students, and research. Research using CALL software, where th e software tracks whole sentences (such as ASR), detects and records errors and include s other types of visual feedback, has been suggested by other researchers (C hun, 1994, 1998, 2002). For example, Chun (1998) suggested using the tracking feat ures of the software for re search and for investigating the visual waveforms and pitch curves pr oduced by the software. Chun (1998) proposed training students to interpret the visual s provided with ASR software and then investigating whether the tr aining and assistance contri buted to their understanding during practice. Studies using ASR have found ASR cont ributes to FL pr onunciation practice where the students used the software (Hinck s, 2003; Hemard, 2004; Barr et al., 2005). However, there is a lack of studies examini ng the actual impact of the use of the CALL software generally and ASR software specif ically. Part of the reason lies with the difficulty of using the software for data collection purposes. Alt hough student tracking is
21 provided in the software and in fact resear ch using these features has been suggested (Chun, 1998), the difficulties encountered in ex tracting the data make studies using this design difficult and time consuming. In many cas es the student data can be tracked, but cannot be exported for analysis. For example, TeLL me More (TMM) Automatic Speech Recognition (ASR) provides a visual score to students after each production. The score, or green bars, provided indicate a range of pr onunciation between 3 and 7 (1 or 2 bars are highlighted as gray, indicating a below reco rding threshold). There is no information on what the score means and it is assumed that more green bars is a better pr onunciation. Several reviewers of the TMM ASR so ftware, of both French (Reeser, 2001) and American English versions (Zahra, R. & Zahr a, R., 2007) have commented on the lack of information about the ASR score. Student s using the software are responsive and excited to see their pronunciation represented by 7 highlighted green bars and will repeat sentences in order to re-confirm, or try again for a good pronunciation. What do these scores mean and more importantly what do th ey represent, in terms of feedback for pronunciation practice or performance, to FL students? How are the scores generated, by the ASR software? Is there a special algorithm that produces this product as part of an internally-programmed proce ss of comparison? Fifth Ge neration (2008) has one short comment that includes the issue of scores which suggests this algorithmic comparative process may play a part: In the one-stage al gorithm each word pattern is matched against a first portion of the input pattern using a dynamic time-wa rping algorithm (DTW). Then some of the best scores together with their corresponding ending position in the input pattern are recorded (p. 13). How can this ASR score information be obtained and how can it be interpreted for the FL resear chers and the community of users?
22 Communication and research with Aural og's representatives have been ongoing since 2005. The company has endorsed and been supportive of univers ity-based research. The Auralog website has expanded and include s references and resources, as well as scholarly articles. An article by Wachowicz a nd Scott (1999) is particularly insightful and informative. Both research-based and co mmercial CALL speech products are evaluated and reviewed. In some products, pronunciation can be scored (Vocabulary Builder), but only relative to expected or anticipated responses. In re ference to pronunciation scoring, Wachowicz et al. (1999) comment: Based on the performance of ASR in commercial CALL products, it is our opinion that it is t oo early for recognizers to grade pronunciation as a background process secondary to word-l earning games and other response-selection activities in CALL (p. 268). Wachowicz and Scott (1999) also revi ewed an earlier product by Auralog (Auralang 1995) and their evaluative findings are important for this study. Auralang offers activities focusing on word boundaries and pauses in a sentence (Auralog, 1995). The authors continue: Speech interactive courseware allows the learner to focus precisely on the desired portion of the senten ce The system also allows the learner to see his or her voice in the form of a sound wa ve and compare it with the waveform of a model native speaker. We found this kind of activity interestin g and potentially instructive. It is not subject to the erro rs made by ASR when attempting to categorize utterances as this or that word or phrase. The authors caution readers, However, we did not test the Auralang activity with students, nor did we find teachers familiar with the software who could give an impre ssionistic assessment (p. 268).
23 Wachowicz et al. (1999) contend, as does Chun (1998) that the waveform is a valuable tool and although teachers may be hesitant, there is a new audience of technologically sophisticated learners (p. 268). Although somewhat dated (ten years have passed since the article was published) the realities of conducting foreign language research with commercial speech-interactive CALL, despite, or because of technological advances, are daunting for CALL researchers and consequently for teachers whose methods they seek to inform. Auralog's educational representatives ha ve been helpful and provided information about the accuracy rate of the speech recognition engine used for their ASR products: The error rates that are usually used in speech recognition engines are related to native speakers speaking in their own language. The speech recognition engine that Auralog is using has an accuracy rate of 99% (error rate of 1%). However this does not reflect the reality of the language learning domain where the speaker is not a native. If a sentence that is extremely poorly pronounced is misrecognized, is it an error from the speech recognition engine (that should account in the error rate) or the consequence of the bad pronunciation? We do not have a simple error rate that would show our performance in our specific domain. We use multiple different indicators that measure the various possible problems: an average pronunciation that gets a good score, a good pronunciation
24 that gets a poor score, etc. We try to reduce them all knowing that they are contradict ory and that the final tuning is based on a good balance betw een all of them. (Personal communication from Tim O Hagan, Auralog, May 2009). The issues relating to automatic sp eech recognition (ASR), the components designed into the system and the CALL rese archers attempting to investigate how the system can be used for instructional FL speech-interactive purposes are faced with considerable challenges. The challenges cont inue to emerge from several domains such as computer speech recognition designs and advances (Rodman, 1999), the renewed interest in natural language processing (NLP) and computational linguistics, and a focus on foreign languages and consequent socioli nguistic developments worldwide. CALL researchers are challenged as never before to make sense of the rapid and sometimes daunting changes occurring within the CALL domain (Chapelle, 2007). Human Raters for FL Evaluation Early foreign language classroom teachers could be considered the first human raters as they were charged with eval uating the speech samples produced by their students. Valette (1965) in a seminal work on modern language testing, written for FL classroom teachers, describes the design and evaluation of speaking tests. Referencing Lado (1964), considered the fat her of language testing, Va lette describes how teachers can conduct speaking tests and how students pronunciation, intonation and other speech features can be evaluated. In each instan ce, the teacher is cons idered the judge or evaluator and is cautioned to understand th e specific problems faced by students. For
25 example, English-speaking students often have problems with multiple-syllable word stress and the English stress system allows word stress on the first syllable, second syllable as well as weak stresses, depending on the word. Students, in their early FL learning, apply English word stress patterns to FL words. Misplaced stress can result in a word that is incomprehensible to or mis understood by a native speaker (Valette, p. 93). Valette (1965) addressed the issues of teacher-rater reliability with regards to speaking tests and provided a cl ear discussion of ways for in creasing score reliability and consistency by limiting the rating to a specifi c feature or aspect of an utterance. Suggestions are made on how to design scor ing systems or evaluation categories and forms for rating. The issues of rater reliability that were pertinent to FL teachers in early FL classrooms remain important, if not para mount, to teachers and human raters of FL speech today. Valettes (1965) discussion highlights the problems that are inherent in human rating, problems which have remained basica lly the same over the years, despite the development of CALL materials and technol ogical advances. Standard protocols for insuring rater reliability and consistency in te sting situations, that is where human raters are used to evaluate FL speech producti on and outcomes, have been developed. Protocols for human rating are routinely us ed in any case where human error has the potential to contribute to standa rd error of measurement (SEM) related to the test items or test format, in any testing situation (Weir, 2005). Rater training is considered necessary to insure that raters are instructed in the features of the rating scale and practice usi ng the scale to rate actual speech samples. Rater training is one way believed to increase the consistency of the ratings raters assign
26 to the speaking samples (McNamara, 1996; Weir, 2005). Practice rating and training are suggested at all levels of FL speaking ra ting and may be especially important for beginner levels where speech may be unde veloped. Targeting specific speech or pronunciation features, in short sentences design ed to elicit those features, as suggested by Valette (1965), can assist in training raters, and in their ac tual rating of FL speech, in both direct and indirect methods. Although Valette (1965) suggested limiting the features of an utterance for teacher evaluation, many language test developers proceeded to use interviews, as a direct method of rating FL speech (Underhill, 1987). Any direct method of rating speech involves having the rater, or raters, present with the examinee and rating is done within the context of the speaking test situation. Indirect speaking test methods involve audio or video recordings of FL speech or pronunciati on that are evaluated outside of the test environment. Indirect methods were availa ble in the 1950-1960s. However, the lack of mechanical and technological sophistication made for a time consuming process for recording and playback. This in turn ma de listening and re-listening difficult and frustrating for raters. The American Council of Teachers of Foreign Languages (ACTFL) (1985) was an early proponent of the direct method a nd the development of the ACTFL-Oral Proficiency Interview (OPI) was an attempt, on a large scale, to standardize a form of direct speaking test. The ACTF L-OPI met with considerable controversy (Lantolf et al., 1987) and the methods and evaluation criteria were severely criticized by foreign language researchers. Despite the early criticism, the AC TFL-OPI remains a regularly used instrument for evaluation. As well, ACTF L has established itself as the preeminent
27 FL organization for foreign language teacher s and researchers. ACTFL has developed a sophisticated, time and cost in tensive process for training hu man raters, with the purpose of standardizing the rater trai ning and scoring processes. ACTFL has also been important in th e development of scoring rubrics and descriptors used for evaluating FL tasks. Initially, ACTFL (1982) borrowed and adapted the numerical scoring system developed by the Interagency Language Roundtable (ILR), a system originally developed for the De fense Language Institute to teach foreign languages to military personnel in an im mersive and accelerated program. ACTFL however, addressed different needs-those of professors teaching FL at the university level. University students were learning under very different conditions from military personnel. ACTFL dropped the numerical sc ale (1-7) and used the ILR foundational framework to develop descriptors marki ng attainment levels. Eventually, ACTFL collapsed levels and assigned low, mid a nd high to both the Novice and Intermediate Levels to accommodate learning levels obser ved in university FL courses. In 1999, in response to a defined interest and more di versified need for evaluating speaking, ACTFL developed the ACTFL GuidelinesSpeaking 1999 A rubric and descriptors were developed to address the need for a direct method of evaluating FL speech and speaking performance that was not specifi c to an interview setting. In the 1980s, with the work of Canale a nd Swain (1985) and th eir proposition of a communicative approach to language teaching, the shift toward language for communication resulted in a focus on auth entically-based texts and resources. Throughout the 1990s, technological advances in multimedia, including audio and video, significantly contributed to the developmen t of computer-assist ed language learning
28 (CALL) materials developed at the time. The resources were designed to provide a venue for delivering an authentic, communicatively driven foreign language curriculum which supported the theoretical founda tions proposed by Canale et al. (1985) and expanded by other FL professionals (S avignon, 1997; Bachman, 1990). CALL developments also contributed to the expansion of materials that included considerations for ways to include not just recording and playback of speaking skills, but also spontaneous speech, intonation and rela ted features such as pitch and stress, pronunciation, conversation and discourse. C hun (1994), in her early study, concluded that the words students wrote using the com puter were a form of spoken communication or dialogue. Subsequently Chun proposed a mo re inclusive view of FL speech that included intonation and discourse features Chun (1998) focused on CALL and the possibilities for FL student learning. Chuns vi sion and suggestions were significant to the direction of the current study. Her criti cal review of the nature and functions of discourse and intonation (Chun, 2002) are significant contribu tions for the direction of CALL, FL speaking skills, discourse and testing. Chun (2002) critically revi ewed the ACTFL guidelines and evaluated the current descriptors. Chun found a distinct lack of focus on evaluating and rating speech, intonation, pronunciation and related features. The descriptors did not address features critical to speaking except, marginally, at the higher attainment levels. The novice and intermediate levels, levels most often f ound in students studying FL at the university level, were disturbingly inco mplete and provided few or no descriptors and guidelines for FL educators to use to evaluate studen ts at these more beginner levels.
29 Chun (1998; 2002; 2007) suggested a need for more research using CALL materials and suggests the need for more comp lete scales and desc riptors developed for evaluating speech by raters, as well as for teacher s. If the state-of-the -art for human raters has remained essentially the same for the past ten years, given the established and prevalent elements of human error in rati ng, research needs to be directed toward addressing the lack of methods and paucity of qualified rubrics for scoring and evaluating speech data. Technological advances have provided ways for CALL and multimedia materials to collect speech data, however th e components necessary for evaluating speech are more complicated and require understandin gs and expertise in ar eas such as natural language processing (NLP) and automated scor ing. Previously, these areas have been unexplored by CALL develope rs and FL researchers (Chapelle et al. 2008). Advances in multimedia and the advent of the internet in the 1990s, have contributed significantly to foreign language speech and speaking both learning and teaching. Automated and instantaneous feedb ack, speech recognition, voice-over internet protocols (VOIP) such as SKYPE and online communities such as Second Life, have all impacted advances in learning and have contri buted to a more participatory and inclusive virtual world. Global educational and envir onmental developments in Asia, Africa and the emergence of a unified European community have all contributed to the multicultural and multilingual online world of learners. The thrust toward a more social and cultural world fabric have resulted in a renewed fo cus on understanding cultural similarities and differences, regional languages and dialects as well as national and regional boundaries. Online communication and transaction are commonplace exchanges at all levels of society and among all ages.
30 Internet-based and online FL tests that can record, provide feedback and evaluate speaking skills are used routinely for placemen t purposes, in courses, for certifications, work-related training and admissions. In high stakes testing, such as the Educational Testing Service (ETS), Test of English as a Foreign Language (TOEFL-ETS) internetbased test, a speaking section was added in 2005. During the test, foreign language examinees produce task-related speech samples that are recorded and then sent to ETS raters for rating. The raters are ETS certified and trained and examinees are notified of their speaking score about 710 days after the actual test Speaking scores are then reported as part of the total test score. Within the high-stakes testing arena recent work by Cambridge ESOL researchers into the IELTS (International English Language Test) speaking test resulted in the firsttime addition of pronunciation criterion and a pronunciation scale (Develle, 2008). Early studies using raters revealed problems with raters interp reting the pronunciation scale. The findings during a two-phase mixed method st udy resulted in revisions that included a larger point range as well as half point values being added. The revised pronunciation scale became fully operational in August 2008 and monitoring of Speaking test performance data, including f unctioning of the pronunciation s cale, is part of an ongoing research agenda Using both correlation an alysis and feedback questionnaires, ESOL researchers were able to refi ne a pronunciation scale used at the international level. Cambridge ESOL has been at the cutting e dge of speaking and pronunciation testing. Interestingly, even at a high stakes level of testing such as the iBT-TOEFL, raters are used for rating non-native speaking tasks. The speech is produced by examinees, the subsequent sound files, or recorded speech samp les are sent via the in ternet to the raters
31 for rating. Unfortunately, the quality and sophis tication of the hardware used to collect the test audio samples are to some extent dependent on the headset/microphone used for recording and listening and on the intern al computer sound card. The six TOEFL speaking tasks, the audio samples, rated by ETS raters are between 45 seconds to 2 1/2 minutes long. Likewise, the raters are limite d by hardware specifications. Efforts are made to minimize or eliminate any reduction in sound quality. Researchers at ETS have been exploring th e possibility of using automated speech scoring. In a recent document, Xi (2008) re ported on the developmen t of Speech Rater v 1.0, a speech recognizer (a main component of an automatic speech recognition system) used for automated rating of speech produced by students using a TOEFL speaking practice software. To date, this is the first automated speech scoring instrument that has been developed or used by ETS. The Xi (2008) report was directed toward the need to establish scoring validity criteria before de ploying the instrument in an actual test situation. An established international language te sting body, such as ETS, and the eminent researchers, as well as many foreign language researchers and language test specialists, are directing attention and resources to the development of measur ing and evaluating FL speaking and speech. Using raters to evaluate speech remains an important feature of the developments. However, automated scoring a nd the use of speech recognition to evaluate speech is, as reported by numerous researcher s, relatively new and unexplored terrain. Given the potential benefits to FL students and teachers in an informal instructional or testing cont ext such as the classroom or language lab, how does FL speech rated by trained raters compare to an automatic speech recognition (ASR) score?
32 Is the score produced a face validity scor e for evaluation purposes? What does the ASR score indicate to students and how do students interpret the score? Wh at form of feedback does the score represent for students? The curre nt study proposes to investigate the scores produced by automatic speech recognition software by providing FL raters with the same recorded speech samples participants produced with the ASR software. CALL Research Using Speech Technologies and ASR Second language acquisition (SLA) and applied linguistics research have expanded greatly on foreign language teach ing and learning theories and methods. Communicative competence in foreign language (FL) learning (Savignon, 1997) that is authentic language use for communication, and the more recent use of computermediated communication (CMC) in foreign la nguage teaching (Warschauer, 1996), has resulted in a more inclusive view of FL lear ning. Warschauer interpre ted this view as one that included empowerment with a distin ct focus on learner control and language production. Computer-assisted la nguage learning (CALL) frameworks (Warschauer, 1997), multimedia software applications (Laf ford, 2004), and developments in speech recognition technology (Eskenazi, 1999), have all contributed to enhanced FL communication, listening and speaking skills and student motivation (Hmard, 2004). The concurrent and residual developments within SLA are those experienced in many other domains where a lack of research or an inability of researchers to proceed as quickly as the technological developments, has hindered our ability to conduct research that can inform methods and materials. Many areas of SLA research are of interest. However, as CALL is the domain of SLA researchers specifically interested in
33 technology and in using technology for teaching, it appears that the rapid technological developments most concretely affect CALL research. Several resear chers, over the past ten years, have suggested using specific CALL products and materials for research (Chun, 1998; Levis, 2007; Xi, 2008), to inform t eaching, to contribute to student learning and to design better CALL materials. Materials that have contributed to CALL development are multimedia for teaching and practicing pronunciation ( Clear Speech, 1995, 2003 ; Tell me More, 1995, 2003 ) and improving speaking skills ( Streaming Speech, 2005 ). Other FL multimedia has incorporated video, audio and numerous learning activities incorporating all four skills. Many of the technological features include d in CALL materials are the domain of computer scientists and instructional t echnologists, as well as programmers and information technologists and trainers. It is th is overlap of domains within CALL that is most problematic as regards rapid technologi cal developments. CALL researchers need the technological expertise to interpret the developments, both academically and for the foreign language community of learners a nd teachers. As well, CALL researchers are called upon to create a bridge of understa nding and communication with the other domain specialists. Commercial FL softwa re developers need information regarding actual users of their softwa re (Nielsen, 2006) to inform future iterations. However research-based investigations are proce durally-defined, proce ss-oriented and time consuming. The area entitled human-computer interaction (HCI) has grown from the need to understand how users interact with th e computer, specific programs or software and to observe and research the interactions As well, commercial products are confined by patents and other legally excl usive contracts that can make it difficult for researchers
34 to conduct investigations. Likewi se, the internet and a vast ar ray of diverse resources, has produced another challenge for CALL rese archers willing to provide teachers and students with quality and timely learning materials. Speech technology, and specifically au tomatic speech recognition (ASR), was designed to recognize produced foreign langua ge speech and is used in CALL for teaching foreign language intonation and pronunciation to second language learners. Pronunciation and intonation are features of FL speech and discourse that have gained renewed emphasis both in FL teaching a nd research. Chun (1998) outlined several reasons for the recent developments in speech recognition software for teaching pronunciation. For example, Chun (1 998) cites the sophistication of the visual and digital audio technology for the production of speech sounds, more sophisticated models for intonation and speech processing and cont extualization of speech and learner performance feedback. Chuns (2002) recent work with discourse and CALL has centered on the shifts in communicative language technology and research that have resulted in a renewed focus on speech intonation as a fundamental part of discourse fluency and FL proficiency. Chun (1998) noted the development of contextual, discourse-based FL multimedia software and suggested a need for research on the effectiv eness of computer trai ning materials (p. 79). In her work, Chun theorized that FL communicative methods and computer-mediated communication (CMC) can contribute to more discourse-based teaching of pronunciation and intonation. Increasingly, the view of intonation as an integral component of language communication (Chun, 1998) has contributed to CALL software designs developed to
35 include the practice of speech discourse beyond the sentence. Constructed oral dialogues used in CALL designs, as models for c onversation and authentic speech, have enhanced the productive possibilities availa ble to FL students. The development and addition of ASR to CALL designs has furt her enhanced the cont ent and quality of performance feedback. Chun (1998) suggests more inclusive CALL research to assess the types and effective combinations of relations hips between oral feedback and production. The essential questions in any CALL eval uation (Chapelle, 2001), including speaking skills and speech production, centers on stude nt usefulness and effectiveness for production of FL communicative discourse. Levis and Pickering (2004) demonstrated how intonation functions in discourse using speech technology and discussed the implications for teaching elements of intonation. Brazils (1992) model was used to categorize degree of engagement (1=low, 5= high) where each level was characterized by different tonal qua lities. Levis et al. found that text-based readings were associat ed with higher levels of engagement than sentence-based readings and demonstrated how the context of discourse affects intonation. Research on speech recognition software used to evaluate foreign language pronunciation and intonation demonstrates th at students produc tion of communicative discourse can benefit from skills practice. McIntosh et al. (2003) used an online webbased product (WIMBA) to record FL stude nt speech samples and oral activities. Students were evaluated by the teacher on discrete pronuncia tion features and spontaneous speech. Responses to the use of the tool were positive and students reported feeling more confident with FL language use.
36 Automatic speech recognition (ASR) in cludes production, practice and evaluation of FL speech. Student perceptions of speech production and ASR have contributed to evaluations of the effectiven ess of speech recognition software for the improvement of FL intonation and pronunciation. Hincks ( 2003) evaluated several speech recognition software programs for student perceptions on feedback and pronunc iation features. As ASR is designed, using underlyi ng algorithms, to compare recognized speech to an underlying model, it is only an effective evaluation tool when produced speech is compared to the underlying model. ASR algor ithms are effective for determining how much speech samples differ but cannot determine the ways speech samples differ from the models. Feedback evaluation methods using ASR are still in development phases. However, Eskenazis (1999) FLUENCY pr oject at Carnegie Mellon University (CMU) and her pilot experiments have dem onstrated the use of speech technology for effectively recording prosody, or pronuncia tion features of speech responsible for intonation and rhythm. Contextu alized ASR for meaningful FL communication is one of the many practical recommendations, as the de velopment of speech prosody is considered an important feature of communicative discourse. Eskanazis (1999) work highlights the importance of research on ASR and the practical applications for FL pronunciation, for CALL desi gns and for FL teaching and student performance. The development of AS R can significantly contribute to the FL curriculum and as an adjunct to the teachers role of FL speaker model for discourse and conversation (Eskanazi, 1999).
37 CALL Evaluation In an effort to determine the effectiveness of foreign language CALL materials, Chapelle (2001) developed appropriateness cr iteria considered fundamental for CALL design. Jamieson, Chapelle and Priess (2004) dem onstrated the use of the criteria in an evaluation of the design of Longman English Online Judgmental evaluation was used to examine several instructional variables and re sults indicated that in an evaluation of interactive features, interact ion between people and between people and the computer (p.412) were good. However, the intrapers onal interaction whic h includes learner initiated choices was weak. Can CALL that includes ASR for FL oral skill practice contribute to the intrapersona l interaction Jamieson et al (2004) found weak in the English program? Warschauer (1996) conclu ded that student choi ce and empowerment were the definitive features required for effective foreign language learning in a computer-mediated communicative (CMC) environment. A feature of CALL evaluation (Chapelle 2001) includes, among other features, positive impact or questions related to the tasks designed for CALL and student responses. Chapelle (2001) suggests that CA LL must support evidence of positive impact through an examination of questions relate d to student learning and experiences. Research designed to qualitat ively investigate FL learner observations and experiences, within the context of task use, is essential to an evaluation of the positive impact of CALL. Hmard (2004) designed a CALL program for use with French language learners and evaluated the effectiveness of the audio a nd interactive features from the perspective of the user interface design and student e xperience. Using user-centered evaluation
38 methods such as questionnaires, user walk-throughs and focus groups, Hmard demonstrated student participation in the interactive learning experience. Hmard concluded that evaluating usability in the cont ext of use is critical for generating new and effective materials for inclusion in CALL. Plass (1998) created criteria for a domain-specific (FL reading), user interface evaluation and used a cognitive approach to design. Using CALL software, Plass applied his criteria to an evaluation of Cyberbuch/Ciberteca (Plass, 1998) and demonstrated how the model of cognitive design included a FL domain-specific, user-centered approach. Through an identification of the mental pr ocesses involved in SLA-related reading activities, Plass demonstrated how a cognitive design contributed to the effectiveness of multimedia tools that function as aids to cognitive processes involved in the development of particular linguisti c and pragmatic competency (p. 45). Handley and Hamel (2005) support evalua tion of speech synthesis systems and believe research on FL student user satisfacti on can contribute to the inclusion and use of speech synthesis in CALL. Building on the evaluation criteria developed by Chapelle (2001) Handley et al. (2005) evaluated several products in an effort to establish benchmarks, or essential requirements, for th e use of speech synthesis as an aid to acquisition of spoken language. Using the criteria of comprehensibility, naturalness and accuracy (Handley et al.), as indications of whether speech recognition software effectively manipulated speech output, Handley concluded that different language functions require different elements of speech recognition. Consequently, some elemen ts could be more useful for pronunciation development than for conversation. Handley et al. suggest that understanding various
39 speech features and language skill modes th rough user-centered evaluative criteria can help to determine the acceptability of comput er training materials for CALL inclusion. Smith and Gorsuch (2004) used a univer sity usability lab (UL) and related technologies to examine synchronous CMC (SCMC) within English second language pairs. The capabilities for data collecti on provided by the UL allowed a comprehensive examination of audio, video, and computer inte raction combined with student interaction and response. The interactive, communicat ive nature of speech was documented and analyzed. In audio recordings of screen captures Smith et al. (2004) found evidence for verbalizations or private, barely audi ble self-speech that occurred among a few participants (p.564). Smith et al. (2004) defined the findi ngs as verbalized private speech, cross-modality uptake, or practice (p.564). Smith suggests that future research on verbalizations, pr ivate speech samples, may provi de insights into meaning-based tasks and into what participants have lear ned or are learning th rough an interaction (p.572). Could ASR used in CALL software he lp researchers to ev aluate the productive or oral nature of private speech? Did Smith et al. (2004) find examples in their study for the phenomenon Jamieson et al. (2004) descri bed as intrapersonal or within person interaction? In a discussion of the tutor/tool mode l of CALL software, Hubbard and Siskin (2004) provide evidence for a return to tutori al CALL as an effective tool for learnerdirected language learning. Central to the mode l is the issue of whether FL software is used as a teacher replacement or a tool fo r learning. Hubbard et al. (2004) convincingly document a case for convergence of the di chotomous model and no longer see the categories as mutually exclusive. A more incl usive view of tutorial CALL, and one that
40 includes learner-centered designs, includes one of the most current and active areas in tutorial software development, pronunciati on (p. 459). Further, Hubbard describes the features of the multimedia models from th e user-learner perspective and states these programs allow the user to record and comp are, and can provide various forms of sound visualization including wave forms and pitc h contours (p.459). H ubbard believes that feedback from users and the results of info rmed research will be combined to enhance future CALL applications. FL student perceptions of learning and pe rformance are essential to an effective design and evaluation of software developed fo r learning. Barr et al. (2005) cite research from Burnage (2001) that supports students recognition and appreciation of CALL software designed with FL pedagogical value. Barr (2005) found that students understood the limited nature of ASR for practice of c onversational speech. The findings resulted in the conclusion that technology is perhaps best kept out of free conversation and integrated more into pronunciation drilling and the development of associated skills (p.72). Research studies have used CALL met hods (Jamieson, Chapelle & Preiss (2004) to design and investigate multimedia produc t effectiveness and learning. Technological developments and refined tools, such as sp eech recognition software, have contributed to more sophisticated methods for collecting da ta on FL speech production. Research using speech recognition software to provide native speaker examples, record and score student speech, provide feedback and collect FL produ ced speech could contribute to enhanced or improved student learning and performance.
41 Foreign Language Research and TeLL me More (TMM) Auralog, a French software company, has been a forerunner in the development of speech recognition software. Several resear chers (Handley, 2005; Hincks, 2003) have used TMM versions for various tests or evaluations. Several studies examined and evaluated CALL from a FL user perspective. Several researchers have used TMM foreign language software in studies either as content, activities or resources. For example, Barr et al. (2005) c onducted a pilot project at the University of Ulster (Irela nd) using TeLL me More (version 5, Auralog ). The oral activities were conducted and data collected using the sp eech recognition features available in the software. The researchers sugge sted that the design of the CALL software (TMM French) seemed to reflect a progression of practice stages beginning with pronunciation drilling and progressing to simu lated interactive ro le-plays (p. 60). A pilot project conducted by Barr et al. (2005) is a significan t contribution to CALL and the fields of second language acquisition (SLA) a nd applied linguistics research. First, the pilot study was the firs t part of a longitudinal research project designed to examine the use of multimedia software for oral language learning and conversation over the course of several year s. Initial conclusions reported that the multimedia lab was an effective venue for tutorial, practice and assessment phases of oral work. Barr et al. made a case for the use of technology for certain phases of oral skills teaching and learning (p. 61). Secondly, in Barr et al.s (2005) initial CALL experiments both quantitative and qualitative measures were used. The quantit ative research focused specifically on an evaluation of oral language learning (Fre nch), with task and skill comparisons.
42 Qualitative analysis documented student suppor t for learning using technology. Students made little reference to problems with learni ng or with technology use. The findings are inconclusive, but students reported positive impact (Chapelle, 2001) with the use of computer technology to enhance their oral skills (p. 72). Finally, Barr et al. (2005) expect to expand, further refine and develop the study following a thorough consideration of the initial findings. The TMM software is adaptable and flexible enough to be used for specifically collecting FL speech data. In the future design, Barr et al. plan to in clude individualized l earning paths as diagno stic tools and to allow for student differences in foreign language proficiency. Hincks (2003) used Talk to Me: The Conversation Method (an Auralog ASR program, without the full course features of TMM) as an additional pronunciation resource for adult learners of English in a second language environment in Sweden. The program was found to be beneficial for st udents with strongly accented (intrusive) pronunciation as measured on a standardized oral assessment, PhonePass Ordinate designed the assessment to be administered by telephone. The student responses were recorded and rated. Hincks used Chapelles (2001) CALL criteria as a framework to evaluate the efficacy of the CALL tasks designed into the Talk To Me software. Further, Hincks (2003) evaluated the AS R and described several problems that are a result of the inherent limitations of ASR, one of which in cludes the ability to recognize and reproduce spontaneous speech. However, the results indica ted that ASR-based training was useful to beginning learners (p. 17). Handley and Hamel (2005) designed a study to examine the possibility of developing a method for adequacy tests (benchma rks) for speech synthesis software used
43 in CALL. For this case study, a conversati on corpus was designed and a simulated ASRbased dialogue from Talk to Me French was used for the 20 consecutive turns corpus (p. 109). Although the study did not specifically investigate featur es of the ASR, participants ratings of accuracy and appropr iateness of the corpus and simulations demonstrated that contextual ization of ASR for oral activ ities could assist with an understanding of which roles and utteranc es are acceptable for inclusion in CALL applications (Handley et al.). The recent studies using automatic speech recognition, and specifically the ASR software used in TeLL me More point to the sophist icated nature of the software and its perceived potential fo r improving FL learner oral performance. The research literature points to a distinct interest in the development of computer-mediated tools and activities designed to assist FL learners at all levels of skill learning. Students have through time consistent ly expressed an inte rest in learning to speak the language and to use their language for useful purposes. Teachers have a responsibility to assist them by providing resources and mate rials that are directed toward the development of these skills. Speech and pronunciation practice for communi cative and authentic interaction is an important goal in a progression toward ef fective language use. Documented research has demonstrated that FL students are recepti ve to and empowered by the development of their ability to use the language especially in the early stages of learning. Automatic speech recognition (ASR) can be used as a pow erful tool to assist students with speech production and use, both within academic contex ts and in the real and virtual world. Through an examination of scores produ ced in a computer-mediated environment using ASR software, scores that are then gi ven to students as fee dback on their produced
44 speech, teachers can further meet the needs of their students. Teachers have the responsibility to provide information on the m eaning of the scores, whether in a testing situation, online or in the classroom. Informa tion on the meaning of the scores can assist teachers and researchers with the interpretati on of the feedback and scores for the FL community as well as for materials developers. Human raters have been used for many as pects of research and especially in rating FL speech where their language skills a nd expertise have been and continue to be needed. By having human raters rate FL sp eech, samples of audio feedback produced by ASR software, rater insights can provide an unprecedented dimension to a previously unexplored aspect of their work. By combining human rater perspectives ASR scores and audio produced by FL students, as well as consideri ng student perspectives, a more complete examination of the value and usefulness of the ASR software as a contributory fact or in FL pronunciation and speech learning can be made. As CALL continues to incor porate and use features that are technologically inventive and responsive to the needs of users, ASR will continue to add to the resources and materials available to all participants in a multilingual and global community of learners. In response to the need for an investiga tion into the value and interpretation of the ASR score feature and by using expert human ra ters, this study was designed to challenge the limits of CALL research within the constraints of curre nt technological practices. The difficulties of design and administration using ASR software to investigate French learner input and outcomes are detailed in order to assign value to a nume rical representation and for future replication studies. Participant-l earner feedback is presented as evidence for
45 consideration in future CALL evaluation a nd designs. Ideally, the study will serve as a model for CALL researchers and further our understanding of the us efulness of ASR for future designs and to assist with an u nderstanding of ASR wi thin the FL domain. Importantly, it is hoped that the study will be interpreted as a call for attention from FL student-users, to FL teachers and researcher s, to address their documented interest in learning to pronounce and 'speak the language'.
46 Chapter Three: Method Introduction Several pilot studies have examined FL student use of the Automatic Speech Recognition (ASR) software. Each study provided insights, observations and direction for the current method design. The pilot studies ha ve significantly contributed to the final design of the study and actual te sting of the instrument, as well as preliminary data collection and analysis. The pilot studies, the design features, considerations and conclusions that have directly influenced the methods for the current study are described, where necessary and appropriate, in the fo llowing sections. An account of the pilot studies can be found in Appendix A. The use of Automatic Speech Recognition (ASR) software for FL learning has been reported in numerous studies and the effi cacy of the ASR as an aid to FL speech and pronunciation has been reported and observe d (Hincks, 2003; Levis & Pickering, 2004; Levis, 2007; OBrien, 2006). However, many questions have remained unanswered and empirically unexplored. For example, what would the results of student performance using ASR reveal when compared to qualifie d FL raters using es tablished performance measures? Importantly, how do students view ASR and what is their evaluation of its purpose and effectiveness for th eir foreign language practice and learning? Finally, what is the contribution of ASR to CALL research and to student learning and performance?
47 Research Questions The following investigative questions were proposed and have been used as the research questions to conduct an analysis of the actual and potent ial pedagogical value of ASR for student FL practice, performance and learning: 1) When using the same ASR-produced se ntences for human expert rating, how do human ratings compare to the ASR scores provided to students and how valid are the scores? 2) How useful did the students find the ASR features were for assisting them with their French pronunciation practice? 3) How effective did French students pe rceive the ASR software to be for improving their mastery of French? Design A hybrid method, incorporating both quant itative and qualitative measures, was used (Tashakkori & Teddlie, 2003). A quantitative analysis (Hatch and Lazaraton, 1991; Mackey and Gass, 2005) using descriptive statis tics and correlation were used to analyze the variability and relationship of the ASR sc ores and human ratings of recorded French sentences. Qualitative measures (Brown, 2001; Dornyei, 2003; Patton, 2002; Strauss and Corbin, 2008) were used to analyze the qualita tive data obtained from an online survey. The survey questions were designed to elicit responses from particip ants regarding their experiences using the ASR software to r ecord and listen to French sentences.
48 Participants Participation, using IRB-approved inform ed consent protocols, was sought from 30 undergraduate French language students. Stud ents in the beginner levels (I and II), enrolled in or having completed one semester of French were included in the sample. Participants attended an orientation to the AS R software at the foreign language computer lab prior to the data collection session. There are 15 computer stations with the software available at the FL lab. However, in an at tempt to insure an environment as free as possible of extraneous noise, approximately ten students participated per data collection session. During the data collec tion all participants were randomly assigned a number and a password to access the pre-designed learning path of ten French sentences. The same participant number was used for access to the qualitative online survey administered at the end of the session. In this way, the individual particip ant sentence recordings and scores were cross-checked with online su rvey responses. Additionally, participant composite data, including scores and percepti ons are available and provide the basis for post-hoc individual case analysis. Procedure Using the ASR features of the multimedia foreign language software TeLL me More the participants used an a ssigned password to enter th e pre-designed learning path leading to ten French sentences for oral production and repetition. Logitech headsets with attached microphones were used by each participant for listening and recording. Following the oral production and recording, pa rticipants were given web access to enter a secured, online survey.
49 The survey was created online using the survey tool provided through the university academic computing department. Upon completing the survey, participants submitted the survey online for data collecti on. A sample of the survey is included in Appendix D. Sentence Pronunciation and ASR Scores Using a pre-designed learning path cons isting of 10 French sentences (Appendix B) taken from the 540 sentence pronunciation corpus provided by the software, each participant orally produced and recorded each sentence at least one time. Each sentence repetition is ASR scored however the softwa re was preset to save only one recorded repetition of each sentence as a separate audi o file. The data collection session included ASR scores, one score for each of ten senten ces for a total of ten ASR scores and ten accompanying audio recordings of the same ASR scored sentences, per participant. The 30 participants data sets resulted in te n ASR scores and ten audio recordings per participant, for a total of 300 ASR sc ores and 300 accompanying audio files. The ten sentences selected from the 540 beginner level corpus were chosen for pronunciation features id entified for beginning learners. Several sentences include the same pronunciation features, for example th e r found in the word franais. Several textbooks and workbooks were consulted for the purpose of aligning the sentences found in the software with comparable sentences for beginner level French learners. The ten French sentences were selectively chosen from several of the beginner units to reflect a generalized French language learning theme.
50 As the sentences were designed into the beginner level software corpus, it was necessary to design the study using sentences from this group. However, the following detailed selection process was designed to identify the ten sentences, from the 540 sentence group, and subsequently used for the study. 1) Review all 540 sentences within the be ginner level. 2) Identify sentences from specific units that located the French r is several positions within sentences and words. 3) Choose 25-30 sentences that included both the r in varying positions and identify a generalized overarching theme. 4) Consult with both a French language and a linguistics expert for input on the first draft of 25 sentences. 5) Diminish the list to 10-15 se ntences based on sentence examination, suggestions review and finalize a general theme. Review the final ten sentence choices and translations with experts. 6) Using the TeLL me More software T utor Tools, set-up and trouble-shoot the pre-defined learning path of the te n chosen sentences for the study. Hardison (2004) also used French se ntences for her pronunciation study. However, the sentences used in the study had be en previously recorded in a presentation by learners and were later extracted from the recorded student productions. Unlike the French sentences used in Hardisons (2004) seminal study, the ASR software contains corpora from which the sentences were chosen.
51 Digital Data During the data collection session the s ound files and ASR scores were saved in the software, on the main computer at the lab. The main lab computer served as a server for the 15 computer stations that contain the software. Following the data collection sessions, all participant audio files were id entified, copied, extracted and saved to CDROM. The participant numbers identified the so und files associated with the participant recordings. As the audio files are saved in an open source format (.ogg files) a conversion program was used to batch convert the .ogg files to a more accessible MP3 format. Once converted the files were coded according to participant and sentence recording (e.g. p1s1 = participant 1, sentence 1). During an earlier pilot study (April 2008), audio files coll ected were used to test the identification, copyi ng, extraction and secure saving of audio files. During this pilot testing the coding system (p1s1, p1s2 etc.) wa s developed and the audio files were coded using this system. As well, the open source f iles were converted to MP3 format using an accessible conversion program. The convers ion process digitally compresses the sound/audio to the MP3 format so that the audio files can be played on an embedded Flash Player. The Flash Player wa s embedded into the design of the Rater Site so that the raters were able to quickly a nd conveniently listen to and ra te the particip ant sentences. After the sound files were extr acted from the lab server co mputer, converted to MP3 and coded the files were then saved to a web space, then connected to the Rater Site and made accessible to the human raters for rating. Technically, the process involves a small amount of HTML code written and input to the survey to direct each sentence to be rated to the corresponding audio file. The title
52 Rater Site is used as the name given to the onlin e model the raters used for listening to and rating the ASR sentence recordings. Raters input their ratings of the individual audio recordings to the Rater Site using a mouse. An adapted num erical descriptor scale (3 novice 7 advanced) was used for rating and a paper copy of the scale was available during the rating session. Raters used headsets to listen to the sentences and a mouse to input the ratings to the Rater Site The ASR scores are assigned one of seven numbers, low to high, 1 through 7. However, in the pilot studies the ASR scores assigned to the senten ce samples (120 audio samples) there was only one ASR score = 1 and only three ASR scores = 2. It was observed that in the case of the low numb ers it appeared the audio files were poor recordings and not that the student pronuncia tion was poor or incomplete. For this reason the #1 and #2 ASR score were collapsed into #3 for raters. Several researchers (Luoma, 2004; McNamara, 1996; Weir, 2005) have suggested that raters are only able to reliably handle rating scales that contain four to six levels. Additionally, th e adapted rating scale of 3 to 7 was found to be adequate for the purposes of this study. Data collected included: participant iden tifier number, sentence identifier number, ASR scores per sentence and human rater scor es for each sentence. A ll data collected was transferred to an Excel data file for analys is. In order to obtain ASR scores, the three expert French professor raters, two native, and one near-native, speakers produced the same ten sentences as the participants. The rate rs ASR scores verified their expert status as raters with ASR score = 7, for 27 of the total = 30 ASR scores.
53 Human Raters Three French professor raters, includi ng two native and one near-native speaker adults, listened to and rated the recorded se ntence audio files for each participant for a total = 300 ratings per rater. Two of the rate rs had rated during the pilot study (April 2008) and had been previously introduced to the rating procedures. A third rater, also a native speaker, offered to rate for the study. Raters used a scale (Appendix C), adapted from both the earlier ILR (Interlanguage Roundtable) scale (Appendix C) and the ACTFL GuidelinesSpeaking 1999 (Appendix C). Currently, the ACTFL Guidelines ar e a widely used and accessible foreign language descriptiv e scale for rating FL speaking. Seven levels of the Speaking 1999 guidelines have been adapted and were used for rating according to the following descriptive levels: Novice: no response or low = (1-2) 3, mid, high = 4; Intermediate: Low = 5, Mid, High = 6; and Advanced = 7. A copy of the adapted rating scale descriptors was available to the raters duri ng the rating session. Th e 7-level descriptive rating scale was designed into the Rater Site using buttons, so th at the raters needed simply to select and click on their numeri cal choice for each participant audio sample. The raters were trained, accessed the sites , and rated the audio sentences using a laptop (Dell Windows XP, 2006) and headphones, in a quiet, comfortable space in the university library. The researcher was presen t for the training and dur ing rating to answer questions and to observe. The same process was used during an earlier pilot study and proved efficient and comfortable for the rate rs. In order to determine an intra-rater consistency check for each rater, raters we re asked to re-rate ten random participant
54 sentences following the rating session. A random sample of sentences was pre-selected and put into a separate Rater Site entitled: Re-Rate Sentences. The following random sequence of ten participant sentences ( N = 10) was used to check intra-rater consistency where P = participant and s# = sentence number: P16s10, P18s8, P23s6, P26s4, P29s2, P34s1, P37s3, P40s5, P43s7, P45s9. Rater Site The Rater Site was a model created for the purpo se of ordering and labeling the sound files recorded by the participants. Furthe r, the site design provided a simple, direct and consistent way for raters to listen to the sentence pronunciations and then input the ratings for each participant. The survey wa s designed with the uni versity-based survey tool, was accessed online, and was designed for rater ease of use, consistency and efficient organization. In this way, th e design, layout and organization of the Rater Site, as a model for rating FL speaking assisted the raters with the concentration and rating tasks required of them for listeni ng and rating numerous pronunciations. In a previous pilo t study (April 2008) the Rater Site prototype was developed using actual sentence sound f iles obtained from beginner Fren ch students. The pilot study was designed for the sole purpose of field testing the sentence audio file tracking, copying, extraction and saving. Although the pilo t sentences and audio files were not the focus of the current study, the sound files co llected proved indispensable for developing and testing the Rater Site prototype
55 Rater Training The rater training was conducted for appr oximately fifteen minutes, prior to the rating session. During the training raters were oriented to the Rater Site the location of the sentence sound files, how to input ratings and use of the headset for listening. The content and descriptors of the rating scale were discussed and raters were presented a paper copy to read and review. Using a se parately prepared rater site entitled: Rater Training the raters practiced using the scale to rate sentence pronunciation samples. The Rater Training consisted of an initial fourteen participants where each of the fourteen produced all, or several, senten ces and audio files an d completed the online participant survey. In some cases, due to an initial problem with the software options settings for saving the audio f iles, only one sentence, sent ence 10 was recorded. Thus, the Rater Training consisted of actual audio files fo r rating practice. To complete the Rater Training raters were given a printed page with the participant/sentence codes and the ASR scores and were instructed to review, compare and write in their ratings. Ratings were subsequently reviewed with the re searcher. Any questions regarding the ASR scores, sentence audio samples or the individu al raters concerns re garding their ratings were answered and discussed with the raters prior to the rating session. On average, the rater training lasted approxi mately 15 to 20 minutes. Following the training the raters were instructed to enter the Rater Site and to rate the audio files, to ask questions if necessary regarding any aspect of the Rater Site audio files or rating scale. Raters had access to th e rating scale for cons ultation during the rating period. After completing the rating, ra ters were asked to enter the Re-Rate S entences site and re-rate ten participant samples. At the end of the session, raters were asked to answer
56 five questions relating to th eir rating experience. The en tire time period for training, rating, re-rating and answering five questions was approxima tely an hour and a half. The quality of the sound files produced fo r the pilot studies was not considered important. Nonetheless, it was determined that the quality and clarity of the audio recording was significantly reduced when st udents used a headset with an external microphone. The audio recorded with a head set (earphones with an attached microphone) is far superior to the external microphone set-up. External microphones available at the lab do not filter extraneous noise and can interfere with the ASR recordings, an observation that highlights th e critical reason for using headsets with attached microphones for the study. Because the microphone is inserted or connected to the sound card located on the computer, both the quality of the mi crophone and sound card can affect the quality of the audio recording. The lab computers where the software was installed were Pentium IIIs (Windows 2000) with limited speed and storage and conseq uently it was expected that the quality of the audio would be compromised. However, the use of headsets by the participants helped minimize these detrimental effects. The procedures described above and the instruments tested in the pilot study (April 2008) were repeated and used for the current study. Specific modifications included: 1) fourteen sepa rate participants for the Rater Training including sentence sound files for practice rating; 2) thirty French student participants producing ten sentences, selected from the software corpus for a total of 300 audio files for rating; 3) expanding and further developing the Rater Site to accommodate the larger number of participants and audio files; 4) collecting ASR scores for ten sentences from the three
57 raters; and 5) a pre-chosen random sample of ten participant senten ces and creation of a Re-Rate Sentences site used for collecting intra-rater data for reliability calculations. Participant Survey Qualitative measures (Patton, 2002) incl uded a descriptive online participant survey designed to elicit responses relative to: the ASR-recording experience; repetition and pronunciation of the sentences; ASR-scores ; listening to native models; use of the waveform and pitch curve visuals and listening to self-produced speech. The universitybased survey tool was used to design the survey. Participants accessed the survey using a link provided at the end of the sentence recording session. The participants randomly assigned number identified and linked the qualitative responses to the ASR scores and ratings for the same participant. Several online surveys have been conducted (2005, 2006, 2007) to discover different facets of French student software us ers. The online survey used for this study (Appendix D) is an adaptation and modificatio n of earlier piloted surveys. The qualitative survey questions were further developed for a distinct focus on the participant experience using the ASR software for recording French sentences. As well, the survey questions were designed to elicit the participants pe rceptions of using ASR software for practicing their French pronunciation. The descriptive survey components were designed to: 1) collect information regarding student percep tions, and experience, of the ASR software features; 2) to provide deeper insight into the participants experience; and 3) to investigate the ways in which the participant descriptions tr iangulate with the ASR scores they obtained for the sentence pronunciations.
58 Data Collection and Analysis Statistical Measures ASR Scores and Rater Data The ASR scores were analyzed using desc riptive statistics to summarize the basic characteristics of the scores based on the pa rticipant group. Va riability in ASR scores was captured through calculating th e standard deviation and the typical ASR score was assessed by calculating a group average ASR scor e. Participant-level mean scores were calculated as well. The same statistical pr ocedures were conducted for human rater scores. Next correlation coefficients were used to examine the consistency or relationship between ASR scores and scores recorded by hum an raters. All descriptive statistics were calculated using SAS v9.1.3. to further examine the ASR scores and to summarize the basic characteristics of the scores based on the participant group. Standard deviation was calculated, as well as, a calculation of mean scores. Correlation calculations were used to calculate the consistency of the raters rating as compared to the ASR scores. Inter-Rater Reliability Inter-rater reliability was calculated using SAS and was used to examine the consistency and reliability of the human rati ngs across raters for the recorded sentence samples. Correlation calculations were used to calculate the inter-rater consistency. Intra-Rater Reliability The ratings from the Rater Site (first scoring) and the Re-Rate Sentence ratings (second scoring) were checked for intra, or within-rater reliability. Due to the small
59 sample size, only ten scores were collected and compared, correlation coefficients were calculated, but were not expected to be significant. The rate r training, questions addressed during the training review, res earcher observations and the rater responses to the five questions provide insights into the rating e xperience. These insights are discussed in the analysis, as well as under direct ions for future research, and may help to explain the first and second score differences. ASR Score Validity Depending on the outcomes of the rater sc ores and following a determination of inter-rater and intra-rater reliability, it may be possible to consider the ASR scores as valid, where it can be demonstrated that the ASR scores are consistent with the human raters (personal communication, White, 2008). A case for ASR score validity as a measure of pronunciation performance and practice will be discussed. Qualitative Data Participant Survey The online survey completed and collected from the participants at the end of the data collection session was saved in an el ectronic format for data organization. The online survey provides for data export features and the ability to systematically organize survey data. Based on the topics and ques tions included in the survey (Appendix D) content coding of questionna ire responses (Brown, 2001; Dornyei, 2003; Strauss and
60 Corbin, 2008) and further data analysis (Miles and Huberman 1994) proceeded as outlined below. Survey Data Coding and Analysis Specific attention was given to coding re sponses, as related to content features investigated in the research questions. The survey questions addressed partic ipant use and perceptions of the features related to the ASR software. Response formats provided for both closed and open-ended responses (Dor nyei, 2003). The closed questions included participant data and demographics and natu rally were more limiting and in many cases did not allow for participant comments. Initially, all descri ptive survey responses were gr ouped and transferred verbatim to an electronic document (Brown, 2001). Res ponses were openly coded (Strauss and Corbin, 2008) by question, for emergent patt erns and categories. Content-themes, relative to individual student responses, were categorized as they relate to the ASR features. Codes were assigned to categories. Specific attention was given to axial coding (Strauss et al.) of responses as related to s ubsequent emergent themes investigated in the research questions. Miles and Huberman (1994) outline three iterative and interactive processes for qualitative analysis: data reduction, displa y, and triangulation for verification and drawing conclusions or identifying patterns. For data reduction, coding was also used for confirmation of observed themes within a nd across survey respons es. A data display matrix, in the form of an adapted Checklist Matrix was used to systematically summarize
61 the clustered coded responses (Patton, 2002). The matrix included coded content, informative emergent patterns and response them es used for analysis (see Appendix E). A readily available word count software tool and the word count feature in the qualitative software Atlas/ti were used, in some cases, to cross-check word codes, categories and themes. Brown (2001) used a si milar approach in a study where a survey was developed and used to investigate langua ge researchers. Both high and low word counts were used successfully for cross-ch ecking and analysis. The actual word count tool Brown used in his st udy is no longer available. Miles and Huberman (1994) have sugges ted both within-case and cross-case display and these displays were applicable fo r a description and expl anation of data for both individual participant survey responses as compared with the in dividual ASR-scores. Additionally, where applicable, the human rate rs ratings were included for comparison and explanation of individual survey responses. Triangulation, or the use of multiple data collection methods was used to crosscheck and examine multiple data sources including participant ASR scores, perceptions, rater scores given for the same participant a udio sentences, and observations. Discussions in the form of researcher bias checks, w ith other research colle agues, proved helpful and provided further insights for the analysis.
62 Chapter Four: Results Introduction Several types of data were collected for the study in or der to answer the research questions. As the study applied a mixed method, tw o distinct data analysis processes were applied. A quantitative analysis was conducted to answer Research question 1 : When using the same automatic speech recognition (ASR) produced sentences for human expert rating, how do human ratings compare to the AS R scores provided to students and how valid are the scores? The audio sentence data was colle cted and both the ASR scores and the subsequent rater scores were analyzed us ing correlation. As well, the interand intrarater reliability indices were calculated. The score results are discussed and address questions of validity, given that no validity ev idence exists relative to the ASR-generated scores. A qualitative analysis was conducted and a ten question survey questionnaire was used to collect data regard ing student perceptions for Research question 2 : How useful did the students find the automatic speech reco gnition (ASR) features were for assisting them with their Fren ch pronunciation practice? Several survey que stions addressed the utility of the software for assisting with pronunciation pract ice. Other survey questionnaire items were designe d to elicit descriptive participant data that was collected for Research question 3 : How effective did French student s perceive the automatic speech
63 recognition (ASR) software to be for improving their mastery of French? Responses were evaluated to determine whether participants pe rceived the software could be effective for improving their mastery of French. In order to conduct the quantitative anal ysis of ASR and rater scores, the score collection and analysis organi zation process was conducted in three stages. First, the ASR scores for each participant (P) were extracte d from the software and inserted into an Excel file. The Figure 1 screen shot is an ex ample of the software interface view of the ten French sentence audio recordings and ASR scores. In this view, at the top of the screen, beside Student's audio recordings th e P25 indicates that this is participant 25. Under Recording and Sentence Pronunciation a speaker icon indicates that these are sentence recordings. Beside each icon we see a French sentence, a date and time (called a date-time stamp) and under Result an ASR number score out of 7 total. Figure 1. P25 Sentence 1-10 ASR Scores
64 Stage 2 involved extracting the rater sc ores for each sentence from the rater website where they were recorded and inserti ng the rater scores into the same Excel file. The Figure 2 screen shot illustrates the rater we bsite and screen view of the rater scores. In this screen, beside numbers 1, 2, 3 (delete) with date and time, identifies the rater (to the researcher only). The speaker icon with the right pointed arrow indicates that the audio could be played in this view (by the researcher). Th e code p16s1 indicates that these are rater scores for participant 16 sentence 1 (s1), s2 and s3. For example, rater 1 assigned p16s1 = 5 (= score), p16s2 = 5, and p16 s3 = 6. Figure 2 Three Rater Scores for P16 Sentence 1 (s1), s2 and s3 In stage 3 the ASR scores and rater scores for each participant were organized and input to an Excel spreadsheet for quantitative analysis. Figure 3 illustrates the Excel spreadsheet organization where id indicates that score data for participant 16 through 25 is displayed with scores reported for senten ces 1 through 3, asr, ra ters a, b and c. For example, P16 s1asr = 7, s1ra = 6, s1rb = 5 and s1rc = 5.
65 Figure 3 Organization of ASR and Rater Scores for Analysis In each of the three stages used for ex tracting, organizing and then analyzing the quantitative score data, crosschecking of scores was done at least three times by the researcher, with assistance from two colleagues. For the qualitative analysis, the survey questionnaire was orga nized by participant survey to begin initial coding. Each participant survey was saved and printed. The survey tool used to create the ques tionnaire allowed for a print-out of responses, according to survey question. The training participant surv eys (P1-P14) were reviewed and initial codes were assigned. All surveys, that is the 14 surveys collected from the training participant's and the thirty study participan t surveys (= 44 surveys) were collected, reviewed and coded. Subsequently, a separate electronic Ex cel file was created for each survey question where only the thirty study participant responses were organized and collated. Codes and categories were assigned and fina l analysis was conducte d based on a question by question organization of survey response da ta. Figure 4 is an illustration of the electronic file created by collating the thirty participant responses for survey question 6. ids1asrs1ras1rbs1rcs2asrs2ras2rbs2rcs3asrs3ras3rbs3rc 16765564554565 17766665666665 18677776676776 19766775675556 20655755675346 21565566565656 22766665557455 23744664474447 24545753674557 25565745553436
66 As illustrated in Figure 4, the initial survey questions that included biographical data such as participant number, gender, Fr ench level, years of study were summarized under each question in order to identify each se t of descriptive data by participant. The bio data collected in the first two survey questions defines the participant sample and indicated (summarized in Table 5) that the thirty participants were predominantly female (male = 8), beginning French students who had been studying for a year or less at the university level. Surv ey questions 3 and 4 re quested experiential information related to using computers and ne w software, specifically speech recognition software (summarized in Table 6 ) The reported experiences revealed participants were comfortable using computers, however they had not previously us ed automatic speech recognition software (ASR).
67 Figure 4 Collated Participant Responses for Survey Question Six PMF FrI-II Sem-Y r Q6 Feelings about experience recording Fs with ASR today? 16mI1-2 sI liked the software very much. I hope to use the software to help me enunciate the French words properly. The time frame for each sentence seemed so short for me, but that helped me understand the proper 'flow' of each sentence. 17fI> 1sI enjoyed it because it allowed me to listen to a native speacker and right after try to pronounce the same words on my own. I believe if I worked on it longer, my pronunciation woukd shurly improve. 18mI1-2 yrI though my pronunciaion was awful. 19fII1-2 sFrustrated!!! but I like that you can go back and repeat it until you get it right I felt that the sentences were a little hard for me. they spoke so fast and I felt like my tounge couldn't keep up with theirs. 20fII1-2 yrIt was interesting, i had a hard time following what the people where saying when they read aloud the sentences. I found it really fast. 21mI1-2sSoftware was very good and actually being able to see the sound waves helped tremendously. I could hear the "ups and do wns" of the language, what had more emphasis, and what was not really pronounced. Great help! 22fI>1sI like the software. It really helped having someone say the sentence first and then repeating it. I also liked how it instantly showed your progress to the side. It helped me to compare how my tries were going. I really liked this software. It was easy to use. 23fother>1sIt was interesting. I was kind of neverous because I have not had french for many years and I know that my pronuncia tion is not the best. It was weird to see that my words were so off from that of a native speaker. However, I think that this software would benifit me in the furture. 24fother<2yrIt was neat to be able to hear how I was saying things. 25fother>1sI wish I could do this more often, it helps to learn the correct pronunciations, the way the pitch of our voice is r ecorded and the correct way is recorded above it is very helpful and a really cool idea!! 26mI1-2yrIt was very helpful and useful. 27mother>1sNot very confortable sometimes 28fother1-2yrI was nervous since other people were in the room and could hear me, but I thought it was fun and helped me alot 29fI1-2sI felt like it was very neat and a good learning tool. A few of the sentences moved a little fast though so I felt like I needed to speak just as fast which made my words slur together some. 30fI1-2sI enjoyed using the software but I felt that the time to get accustomed to the sentece was very short. 31fI1-2yrI enjoyed. I was a little intimidated at first but I eventually got past that once everyone else started talking. 32mI1-2yrnervious at first, then got used to it. 33mI1-2sOddly, I did feel slightly uncomftorable using this because I realized just how bad my pronounciation is. I could see this being beneficial and enabling me to be more comftorable speaking with native french speakers. At some points, when the words were unfamiliar to me, I felt the recordings didn't allow enough time for me to get the full rec ording in. My mouth just didnt want to move that fast. 34fother<2yrI didn't know how loud I needed to speak in order for my voice to be understood, so I made the assumption it would only need to be slightly over a whisper so that I would not be heard by the few other individuals in the room. Once I realized it was not working very well, and I started to get more comfortable, I was engaged and excited to speak as loud as needed without considering others around me. This session was very useful because I realized a lot of what is holding me back is my intimidation, but that it is possible fo r me to speak.I need to realize I read much more than I speak, thus my speaking is naturally going to suffer. There were a few words within the sentences that I knew the meaning to but had never read aloud. When I read it in my head I th ought it would have been pronouced in a certain way, but was proven wrong by the native speaker. 35fII1-2yrit was a good experience.. i was impressed with the software 36fI1-2sI thought it was fun and it really helped my with pronunciation. 37fII<2yrIt was interesting, I wish I had more time to practice with it more. 38fother>1It was easy and simple enough. Since we were using headphones, I forgot that there were others who might hear me, so I wasn't too anxious. 39fII1-2sI feel that the software has done its purpose but it was a bit too fast for me. 40fII1-2sI was a little frustrated with how poorly I was doing but eager to experiment with ways the software could potentially help my French pronunciation. 41mII<2yrsee survey Q6 comments 42fI1-2sIt was alright, it will be better to get feedback and learn what I need to work on in my pronunciation. The red highlig hting helped as I was going through becuase I knew what I needed to focus on. 43fI1-2yrIt was a new experience that I have not done before. I enjoyed being able to record my voice and understand how differ ent I sound compared to a real french speaker. I am excited to further my education in the subject and work on my sentence pronunciation. 44fI1-2sIt was very easy to use and I would like to use it again. 45fII1-2seasy to understand
68 The organization of survey questionnaire data was critical fo r coding of categories and themes by hand and iteratively. It is important to note that the data coding, categorizing, journaling, resear cher reflections, frequency and word counts were done by hand. Both a freely availa ble word count tool ( Bonkenc ) and ATLASti were used experimentally with a few survey questions to cross check word counts with assigned codes and categories. However using the word count tools proved redundant and timeconsuming, particularly once the majority of the coding and analysis had been done by hand. Organizing the data, both ASR scores a nd qualitative survey data, was a complex process because the different types of data n eeded to be carefully extracted from various electronic sources. Using Excel f iles to organize both types of data, that is score data and survey data, provided a consistent and comp lete score and survey data set for each participant. Setting up the data and creating a database in this way provided for individual case observation and analysis, future combined case analysis and a myriad of other post hoc studies. Analysis Research Question 1: When using th e same ASR-produced sentences for human expert rating, how do human ratings compare to the ASR scores provided to students and how valid are the scores? To answer the first part of the question, that is how do human ratings compare to the ASR scores provided to students the ASR scores produced by the ASR software for the participants sentence recordings and the ra ter assigned scores fo r the same recordings
69 were subjected to a number of st atistical tests. Descriptive statistics were used to examine the ASR score and rater a (ra), rater b (rb) a nd rater c (rc) score di stributions. Correlation analysis was used to determine the relati onship between ASR scores and rater scores. Interand intra-rater reliability wa s also calculated using correlation. In order to examine the second part of the research question, how valid are the scores and following from the statistical analysis of the ASR scores and rater scores, the validity of the ASR scores are discussed. As there has been no known analysis of the ASR scores produced by the software, prior to this study, any validity statements are made solely on the basis of the findings presented. Quantitative Analysis: ASR Scores and Rater Scores Descriptive Statistics For the purposes of descriptive data anal ysis, each set of scores ASR scores for 30 participants = 300 ASR scores and three sets of rater scores (ra = 300, rb = 300 and rc = 300) are considered separate va riables, resulting in four va riables: ASR, ra, rb, rc. SAS was used to examine the distributions of each variable, individually, prior to examining the relationship between the variables. The desc riptive statistics for the four variables can be found in Table 1.
70 Table 1 Descriptive Statistics for ASR, ra, rb and rc Scores Statistics ASR scores ra scores rb scores rc scores N 300 296 299 298 Mean 4.45 4.46 4.47 5.74 SD 1.44 1.14 1.09 1.95 Skewness -0.13 -0.003 -0.14 -1.46 Kurtosis -0.38 -0.25 0.43 3.52 Range 7.0 6.0 6.0 6.0 N explained. The slight reduction in the N ( N = 300 total scores) for ra = 300-296 = 4, rb = 300 299 = 1 and rc = 300 298 = 2, is the result of seve ral audio files not opening during the raters scoring period. Although all files were checked, doublechecked, and were working audio files, it appeared that the files were not able to open for technical reasons associated with the wirele ss network system. Missing data is recorded as follows: P19s4ra, s4rb, s4rc; P25s9ra; P37s3rc; P44s8ra, s9ra ( N = 7 missing scores).
71 Mean (M) and extreme scores. An average score for each set of scores is commonly reported in Second Language (S L) studies (Mackey and Gass, 2005), and describes the typical score of the set of scores. The M as an average, can be misinterpreted if there are extreme scores reported. There were a small number of low/extreme scores, considered low where ASR score = 0 or 1, reported for a few specific audio files that had been inadequately r ecorded due to poor quality of the specific computer sound card. The extreme reported ASR scores are listed for participant and sentence as follows: P19s4asr = 0; P20s9asr = 1; P27s2asr =1, s4asr = 1, s8asr = 1; P31s4asr = 1. In one case, the rater scores were also low as a resul t: P27s2asr = 1, s2rc = 1; s4asr = 1, s4rb = 1 and rc = 1. In a fe w cases, although the ASR score < 3, rater scores were reported as extreme. The participant, sentence and ASR score, as well as the extreme rater scores are as follows: P25s6rc = 1; P34s2asr = 5, ra, rb and rc = 1; s6asr = 6, ra and rc = 1; P35s6asr = 5, rc = 1. Alt hough statistically these sc ores are considered extreme, they were not removed from the analysis. A total of 15 extreme scores are reported. It is important to note however that th e extreme scores in these cases were the result of several factors. In some cases the ASR score was a direct result of the poor recording and functioning of the individual computer stati on. Sentence recordings were on average less than 3 seconds in duration. A ny delay, hesitation or interruption, whether created by the computer, or microphone, or the pa rticipant interacting wi th the features of the ASR software, were difficult to identify. The audio sentence recordings were contributed as part of the pa rticipant data set of ten sent ences, where each sentence
72 resulted in an ASR score and se parate audio file. In order to maintain the integrity of the participant data set, for rating by the ra ters, no extreme scores were removed. On several occasions, during the rater sc oring sessions, an audio file would not play. The only explanation was that there wa s a temporary interruption in the wireless network. All files were checked, re-check ed and accessible. Also, the temporarily unavailable audio file was different for each rater, except for P19s4 where audio data was missing entirely (P19 may have not recorded or may have skipped or missed sentence 4). Again, given the unexplained vari ability relating to the low/extreme scores it was decided to include all scores as reported and collected. ASR and rater score distributions. The mean (M) for the ASR scores, ra and rb scores are close. However, M = 5.74 for rc is significantly higher. Likewise for rater c, S = 1. 95, and skewness, sk = -1.46. Clearly, the sc ore distribution for ra ter c is non-normal and indicates that rc, on averag e, rated the participant senten ces higher than ra or rb. On average, rc rated participant sentences 1.29 points higher than the ASR and 1.28 and 1.27 points higher on the rating scale (1 = low 7 = high score) than ra and rb respectively. The kurtosis value for rater c, k = 3.52 indicates a peaked, very leptokurtic, nonnormal distribution. The highly peaked distributi on indicates rater c scored within a very narrow point range and significantly violates the underlying normality assumption. It is questionable whether rater c scores are reliable measures.
73 Important considerations: rater C. Rater c was the third rater, a native speaker (NS) who had not participated in the pilot study whereas both rater a and rater b had rated previously (April 2008). Rater c received the same rater training as the other raters, however it is uncertain whether having rated previously gave rater a and rater b an advantage. It is possible that ra and rb may have been mo re experienced raters, in the ASR context. Clearly, ra and rb had the expe rience of practice rating (although they did not rate the same participant a udio) several months earlier. Variability in the scores is most re liably demonstrated using the Standard Deviation (S) and is a reliable index of variability. Howeve r, it is important to examine the S and the means in relation to each other in order to determine how the scores are spread. A small S, in relation to the means, demonstrates, generally, that the mean has captured the scoring behavior of the ASR a nd raters. For example, rb M = 4.47, where S = 1.09, demonstrates a proportionally small spre ad. Rater c distribution reported a larger spread, S = 1.95 The shape of the distribution of each se t of scores, as indexed by the skewness and kurtosis for each variable, are an importa nt contribution toward an understanding of the degree of normality displayed by the dist ribution. The skewness values indicate that the ASR, ra and rb distributions are slight ly negatively skewed, where the scores are concentrated at the higher end of the rating scale. Rater c (sk = 1.469) indicates a high negative skewness value.
74 The range of scores, based on the ASR sc ores assigned and the rubric developed and used by the raters to evaluate the sentences was designed on a 1 = low though 7 = high rating scale. The ASR reported a range AS R = 7. The range for the rater a, b and c = 6. In a discussion of statistical analyses and language tests Brown (2001) suggests that violations of the normality assumption is not as problematic for criterion-referenced measures. Rater c's skewed di stribution might indicate, for example, that participants pronounced the sentences successfully. Also, Brown (2001) suggests reporting the mean and median for skewed distributions because a few skewed scores (representing a few P sentences) can significantly affect the mea n, but has less effect on the median (middle score = 50% above, 50% below or midway between two middle scores). Violations to the assumption of normality that underlie correlatio n are obviously problematic for rater c, in particular. A comparison of the variable distribut ions, that is ASR scores and ra, rb and rc scores, was used as the basis for the correlation and other measures. Bachman (2005) suggests that a common misunderstanding is to interpret the means of score groups as suggesting that there is a relationship between the scores and he suggests that researchers often incorrectly interpret findings. The description of the distributions is necessary for a correlation analysis and statements about the strength and degree of relationships can only be made subsequent to conducting a correlation. Correlation Analysis In correlation research attempts are made to determine the relationship both between and among variables. Usually, it is pr actical to use scatter plots in order to
75 graphically observe the degree of the relationship between the variables. Scatterplots of the ASR scores with each rater 's scores we re not visually helpful, as no relationships were established. However, correlations were calculated for the four variables ASR, ra, rb, rc scores in order to examine the re lationship between ASR and rater scores. The correlation coefficients for the ASR scores a nd rater a, b and c scores are presented in Table 2. Relationship between ASR scores and rater scores. Results report positive, statistically significant, yet low moderate rela tionships between variables. Rater a (ra) is a non-native, female French literature professor responsible for French teaching assistants (TAs) and graduate students. Rater a had b een oriented to rating ASR-produced audio during the pilot and although not entirely co mfortable with technology, felt comfortable and unhindered by the rating required for the stud y. As a non-native speaker rater a has a command of French pronunciation and c onsiders pronunciation enhancement and practice a necessary element of teaching and learning for non-native TAs and her students. The relationship between ASR scores a nd rater b = rb, indicate the strongest relationship, where r = 0.452. Rater b had sc ored ASR-produced audio during the pilot and had been previously exposed to the scor ing procedures and rating scales which lends some support for the stronger relation ship observed with the ASR scores.
76 Table 2 Correlation Coefficients: ASR and Raters a, b and c Scores Rater b (rb) is a native French speaker with ten years teachin g experience in the U.S. and has worked with French second la nguage learners and ho mogeneous groups of Correlations ASR Rater a Rater b Rater c ASR r = .99 p < .0001 r = .36 p <.0001 r = .45 p <.2027 r = .30 p < .0001 Rater a r = .36 p < .0001 r = .42 p < .2260 r = .57 p <.0001 r = .40 p < .0001 Rater b r = .45 p <.2027 r = .57 p <.0001 r = .71 p < .0193 r = .49 p < .0001 Rater c r = .30 p < .0001 r = .40 p < .0001 r = .49 p < .0001 r = .46 p < .0001
77 American English university cl asses. As well, rb has worked with University-trained French student-teachers in loca l, American high schools. Anot her factor that could have contributed to the relationship found with ASR a nd rb scores is the fact that of the three raters, rb was the most highly trained t echnology user. Rater b was also the most comfortable with technology and had even rece ntly taught a graduate level FL technology course. Rater c (rc) was also a native speaker a nd a younger female graduate student who had limited experience teaching in the L2 envi ronment. Rater c had also not rated ASR during the pilot and questioned the rating scal e prior to rating for the study. All three raters were aware that they we re using a scale to rate Fren ch sentences, student produced audio that had been previously scored by th e ASR. Raters did not have access to the ASR-produced scores, except during the traini ng using separate par ticipants (P1-P14). Rater reliability. Critical to the analysis of ASR scores, is the fact that, to date, a relationship between ASR scores and any ot her type of FL sc oring has not been established. In this investigation ASR sc ores were examined for French sentence pronunciations. Raters a, b, and c scored exactly the same participants and their French sentence audio pronunciation files. In order to establish rater reliab ility, an examination of consistency between (inter-rater), and within individual (intra-rater ), raters a, b, and c was necessary. Inter-rater reliability. Using correlations as indicators of the degree of consistency between the three raters: rater a = ra, rater b = rb and rater c =rc, Table 2
78 indicates the correlation coefficients calculated. The total N = 300 French sentence pronunciations, that is 30 part icipants x 10 sentences = 300 au dio files for rating, by each rater. The correlation between raters a, b and c are listed in Table 2. From the correlations between raters, all of the correlations we re positive and the relationship between rater a a nd rater b was the strongest, wh ere r = .57. Both ra and rb had rated French audio files in the pilot st udy. It is reasonable to assume that their previous experience ra ting may explain the observation that their scoring patterns, although only moderately related, were the strongest of the three. Previous experience rating similar AS R audio (ra and rb) was more highly correlated with the raters than was being a na tive French speaker (rb and rc). Based on the correlation coefficient between asr scores and rb scores, r = .45, rb correlated more highly with ASR than the two raters, ra and rc, correlated with each other. Rater c and rater a had the least strong relations hip, where r = .4. It is important to note that these two raters were the most diss imilar. Rater c had not rated before and was a native French speaker, while rater a was a non-native speaker and academic, who had previously rated. Intra-rater reliability. In order to check the consiste ncy of each individual rater, that is the consistency with which a rater scored the same senten ce pronunciations from the initial rating period (time = 1) to a later time period (time = 2), ten random participant sentences were selected for re-rating.
79 The N =10 is an insufficient number for reliable statistical coefficient measures. However, it is clear from rate r b results for time 1 and time 2, r = .718 that rater b rated the most reliably when compared to ra and rc The intra-rater reliability for rb, even given a small N, lends support for rater b's training, ex perience and expertise when rating French audio sentences in the context delineated in the study. The ten random sentences were accessible through the Re-Rate Site and were rescored following the initial rating session. Note from Table 3 rater c initially had problems opening one audio file for Time = 1 and thus N = 9. However, during the rescore when the rater entered the Re-Rate Site the file was able to be opened and rescored. Again, the explanation for this discrepa ncy is that there was interference from the wireless network in the library, during the sc oring time period. Intere stingly, this was not the case for rater a and b who both scored seve ral days prior to rater c when all of the files were able to be accessed acco rdingly. Thus for ra and rb, N = 10 consistently. The small N = 10 for the intra-rater reliability correlations is a drawback to actually finding consistency w ithin individual raters. There were several reasons such a small number of audio files were chosen and input to the Re-Rate Site. First, raters were asked to attend a pre-scoring training session. During the training sess ion, raters scored approximately 14 training pa rticipants. Several training participants included all ten sentences, and some contained only one senten ce for a total of 80 t raining audio files. The rater training participant set (P = P1 thr ough P14) were not part of the data set of 30 participants (P = 30, consecutively P16 though P45). Second, the rater training included a revi ew of the scores th e individual rater assigned to the training participants and a comparison to the ASR scores. Questions the
80 rater's had regarding individual training pa rticipants, sentences, audio files or scoring were discussed during the trai ning. The training process lasted approximately a half hour. Once the training scores had been reviewed with the rater, by the researcher, the raters entered the Rater Site and began rating the 30 participan ts (300 audio sentences). Each individual rating period lasted approximately 1-1 hours. Third, once the rating period ended, the raters were asked to enter the Re-Rate Site and to re-rate ten sentences. By this time th e raters had been rating for a considerable time. As well, the ten sentences were repetiti ve and the raters had begun to experience an overload. Cognitive overload is common in fo reign language speaking rating and results from several factors both individual and s ituational: rater concentration and fatigue, physical discomfort, and repeti tive factors. Clearly, based on researcher observation, the raters had become fati gued at this point. The ten participant sentences were re-ra ted by rater a, b and c and correlation coefficients were calculated. The results for time 1 and time 2 for each ra, rb and rc are presented in Table 3.
81 Table 3 Intra-Rater Reliability Correlation Coefficients: Time = 1 and Time = 2 for Rater a, b, and c Time 1 and Time 2 N r p ra1 and ra 2 10 r = 42069 p = 0.2260 rb1 and rb2 10 r = .71808 p = 0.0193 rc1 and rc2 9 r = .46916 p = 0.2027 All of the correlations conducted for the quantitative analysis are summarized in Table 11 (p. 112). The individual tables 1-3 report the correlations of each variable grouping (ASR scores rater scores, intera nd intra-rater reliability ). The summary table was created in order to further the discussion relative to: the validity of the ASR scores as generated by the software; consideration of the ASR scores and rater assigned scores as valid and as practical indi cators of pronunciation ability; and interand intra-rater considerations. Survey Questionnaire In order to obtain information for answ ering research questions 2 and 3, data and descriptive responses were elic ited through the use of a survey questionnaire. The survey
82 was linked to the participant sentence recordi ngs (the same participant number was used) and was administered online, immediatel y following the sentence recording period. While the complete questionnaire results were analyzed, and provided valuable data, not all of the survey questions were directly applicable to research questions 2 and 3. Research question 2: How useful did the students fi nd the ASR features were for assisting them with their Fr ench pronunciation practice? Survey responses were examined and analyzed as they related to student pe rceptions of the utility of the software for assisting with pr onunciation practice. Research question 3 : How effective did French students perceive the ASR software to be for improving their mastery of French? Descriptive participant data was collected to evaluate whether participants perceived th e software could be effective for improving student mastery of French. All of the survey questio n results are presented and reported. However where the survey item results relate directly to th e research questions, the survey results are organized under and support the respective res earch question. The complete survey questionnaire is attached in Appendix D and is available for reference. Results Survey Questionnaire The survey questionnaire consisted of te n questions and the organization of the survey questions was important for eliciti ng specific experientia l qualities and for triangulation to other responses. A number of que stions consisted of several parts. Survey questions 1 and 2 were used to collect bio data. They were cl osed questions and inquired
83 about gender and length of French study. The frequencies for gender and duration of French study are outlined in Table 4. Survey question 2 also gave participants the opportunity to elaborate or explain further about their French study experiences. Twenty-one of the thirty participants felt it important enough to explain further about thei r experiences. The res ponses were related to high school French study where nine pa rticipants related e xperiences of studying French in high school and explained gaps in their study and learning periods. For example, the explanation provided by one participant was a common remark and includes: This is actually my first se mester taking a french class since 10th grade in High School (P23). Three participants commented that they had spent time in or had traveled to France. One participant commented: i grew up in canada where I was taught french from 4th grade on (P20). Finally, participants ad ded comments specifically about their past and future French lear ning and the following comment s reflect the responses for participant 36 and 38: I studied fren ch in high school but do not remember a thing(P36); ... rarely got to practice the speaking aspect during high school and haven't taken a French course for tw o years on top of that (P38). Survey questions 3 and 4 inquired about comfort using technology and experience using language learning or speech recognition software, respectively. The responses were coded for yes = comfortable, no = uncomfor table and other = sometimes and comments on comfort level other than yes or no. For que stion 4 answers were coded for yes = have used software before, no = have not used before and Conditions = comments beyond yes or no regarding conditions for learning. Frequencies and conditions for Q3 and Q4 are listed in Table 5.
84 Table 4 Frequencies for Gender and Duration of French Study Survey Responses Frequency Male 8 Female 22 French Study French I 15 French II 8 Other/Unspecified 7 Semester/Year >1 semester 6 1-2 semesters 12 1-2 years 8 < 2 years 4 Survey questions 5 and 6 were designed as a warm-up to encourage participants to reflect on past French learning experiences. The intended purpose of survey Q5 was to have participants describe how they perc eive their affective responses to learning situations and to relate their interactions. Th is question was designed to allow participants
85 to express both positive and negative learni ng experiences, in the hope that they would more freely relate to survey Q6 which followed. Table 5 Frequency Responses for Survey Q 3 and Q 4 Frequency Responses Survey Q 3 and Q 4 Yes No Conditions Q 3 Comfortable using computers/new software? 23 1 Yes, if or as long as = 4 Sometimes = 2 Q 4 Used language learning/speech recognition software before? 7 23 Rosetta Stone software = 3 Online Spanish = 1 Other software = 3 Survey question 6 specifically asked partic ipants to describe, affectively, the short experience (in many cases 15 minutes or less) of recording their pr onunciation of French sentences using the automatic speech recogni tion (ASR) software. This question was designed to capture the overall quality of th e participant experience and to provide a means for triangulating to the individual e xperience with the speci fic features of the software and the pronunci ation recording period. The results for survey Q6 revealed po sitive user experiences. Initial coding resulted in two basic categorie s: 1) comments relating to the software and 2) how the
86 features (for example the na tive speaker) or a feature of the software related to the individual. For example, the following quote by one participant demonstrates the 'positive experience' theme: I enjoyed it because it allowed me to liste n to a native speaker and right after try to pronounce the same words on my own. (P17). Another participant commented: I enjoyed being able to reco rd my own voice and understand how different I sound compared to a real French speaker. (P43). The same theme was expressed about the graphic display by anothe r participant: Software wa s very good and actually being able to see the sound waves helped tremendously I could hear the ups and downs of the language, what had more emphasis, and what was not really pronounced.(P21). On a continuum of general 'quality of experience' for Q6, from exciting to frustrating, twenty-seven participants relate d that they found the software: fun, exciting, interesting, helpful, enjoyable. Of the twenty-seven particip ants, seven were coded more toward the middle of the continuum where partic ipants reported feeli ng initially uncertain or nervous, yet were positive overall about th e experience. Participant 28s response was characteristic: I was nervous since other peop le were in the room and could hear me, but I thought it was fun and helped me a lot. (P 28). Three participants were coded at the frustrating end of the continuum and indicate d they were not comfortable, for example participant 31 reflected simply Not very comfortable sometimes. (P31). The continuum of 'quality of experience' theme related directly to the participants' impressions of the software. So, where the experience was positive, the software was perceived as easy to understand, or easy to use and the features such as, listening to the native speaker, being able to re peat and to say each word separately were viewed as helpful. Where participants had expressed frus tration with the experi ence, the overriding
87 comment was that the speed of the speech and pronunciation of the native speaker was very fast and as one participant described: T hey spoke so fast and I felt like my tongue couldn't keep up with theirs. (P19). Survey Q6 was an open-ended questi on, the first question that requested a description by the participant relating to the ASR experience. Only following the analysis of later questions (especially Q7 and Q8) di d it become clear that participants provided their initial descriptions and impressions of the software in Q6. After coding open question Q6 (and Q7 (d)), it became clear pa rticipants provided complete descriptive responses about their impressi ons and experience in Q6. Th is is understandable as the focus and wording of each question was simila r, yet Q6 could be considered a first request for an open, individualized evaluation of the experience. Regardless, participant responses, comments and descriptions were consistent with the individual's described experience. Automatic Speech Recognition (ASR) Features Research question 2: How useful did the students find the ASR features were for assisting them with their Fr ench pronunciation practice? Survey questions 7 and 8 were designed to address participan t perceptions regarding the specific features of the ASR software. Responses were examined and analy zed as they related to perceived usefulness of the software for assisting with pronunciation practice. Survey question 7 consisted of four parts directly relating to the features of the ASR software. Data responses for three parts were quantitatively-or iented and the final part was an open-ended question. Su rvey Q7 was organized as follows : Based on the
88 French pronunciations that you recorded today di d you find it helpful: a) To listen to your own pronunciations? b) To listen to the native sp eaker model? c) To look at the visual graphs? d) Please tell us more about your impressions and your exp erience with these software features. Table 6 Question 7 (a, b, c) Frequency Responses Helpful Yes % No % a) to listen to your own pronunciations? 27 90% 3 10% b) to listen to the native speaker model? 30 100% 0 0% c) to look at the visual graphs? 23 76% 7 24%
89 For survey Q7 (a, b, c) frequency res ponses were coded yes = helpful, and no = not helpful for 'perceptions of usefulne ss'. The pronunciation features of the ASR software examined were: 1) listening to self, 2) listening to the NS model and 3) looking at the visual graphs. The frequency responses for each feature are listed in Table 6. Native speaker model. Participants responded positively to the native speaker model presented for each of the ten senten ces (Helpful = 30 participants) where all participants reported it was helpful to have the NS model. The native speaker (NS) model is based on a standard French speaker from the northern, Parisian region of France and referred to as Parisian French (Valdma n, 1976). The native models used in the ASR software are representative of the French spoken in that region of France and include males and females, young and older adult speaker s. Although the partic ipants could read each French sentence presented to them, on the computer screen, they felt that being able to listen to the sentence pr onounced by the NS speaker, any number of times, either before or after recording, was very helpful. Although there may have been participants hesitant about pronouncing the sentences, seeing the words on screen and listening to the pronunciation by the native speaker contributed to their confidence and ability to orally repeat the sentence pronunciation. Listening to self Participants reported that listening to their own pronunciations was helpful. However, it is unclear whethe r they listened to a pronunciation and then reproduced the sentence based on the NS a nd their own pronunciation or based on the score they received. In other words, when pa rticipants listened to their own recordings
90 were they attempting to improve the sentence pronunciation or their ASR score? Or did they see both their sentence pronunciation and the ASR score as the same? For example, one participant commented: I wish that it would save your recordings so you can shut down and come back in later to listen to yourself. (P37). Visual graphs. The visual graphs were also re ported as being helpful however there were participants who clearly felt they were not as helpful and were perhaps even confusing. The visual graphs are presented as a waveform and a pitch curve in the ASR software and a screen shot was put into th e survey for clarity. The visual graphs are sophisticated representations of sou nd, and researchers (Chun, 1998; 2002) have commented that where they are used in soft ware for language learning, students need to be trained to understand and in terpret their meaning. Otherwise, it has been suggested they not be used. Interestingly, one of Chun's (1998) early suggestions for research was just this topic train some students using visual graphs for speech synthesis and then leave others to interpret on their own, an excellent suggest ion for a research study. However, the TeLL me More (TMM) software options do not al low turning off (or making unavailable) the visual graphs for some and not others. It is considered a global opt ion, meaning it is an option that occurs throughout the program. N onetheless, Chun's work (1998; 2002) is an important contribution to FL speech synthesis and CALL. Participants were impacted by the vi sual graphs and in the following case examples did understand and interpret the gr aphs for themselves: I really liked the ability to go back and listen to what I can improve on, the fact that the proper way of
91 speaking the sentence can be accessed, and the gr aphical display to show the stressed part of the word(s) (P16); and The graphs we re extremely helpful, it would tell you where or when you need to fix your pronunciation of a word or phrase, l oved it!! (P25); or The visual graphs were an important part in pronouncing th e words right (P17). Of the participants who responded no th e visual graphs did not help (P = 7), only two commented specifically about the visu al graphs: The program would give me a great score, but the visual gr aph looked totally different than the native speaker (P32); The visual graphs were not helpful...I felt like they we re tracking voice inflection, not syllable accuracy (P40). More impressions about the ASR experience The open-ended feature of survey question 7 (d) asked participants to tel l us more about your impressions and your experience with these software features. Participants responded by describing more about their interactive experi ences with the features of the ASR software. Comments from participants 28, 39 and 44 related specifica lly to the NS model and how participants felt it helped them, for example, I liked the native speakers were the voices heard on the software. It really helps hearing the correct accent. (P39); I rea lly enjoyed seeing how I sounded compared to the model. (P44); and I t is very fun and challenging and helped me pronounce words I didn't know how to pronounce before (P28). Participants were able to see and hear the Fr ench sounds and make phonetic connections for their own French pronunciation. Although all of the participants reported th at they found it helpfu l to listen to the native speaker model, comments in Q7 part (d ) related to the NS model. Participants
92 reported they felt the NS spoke too fast and suggested they would have liked the speaker to slow down, or to speak slower. Particip ants 20 and 27 gave definitive responses: great and they speak too fast (P27) or It is interesting, but they n eed to speak a little slower (P20). The NS model was perceived to be helpful, maybe even necessary, and participants suggested how they perceived it could be more helpful for their French pronunciation learning. Pronunciation practice initial impressions. Given the short time span for recording, participants responded positively to the features of the software and their interaction with the ASR. Although not al l commented specifically on the features, comments were directed to how they felt th e software could help them practice or improve. Its good to help us practice and li sten to how we speak (P26) and It was amazing! I didn't realize how off I was wh en I spoke (P23). Participants 33 and 37 respectively, commented on ways to improve or change the software to be more helpful to them, for example, I wish that it woul d save your recordings so you can shut down and come back in later to liste n to yourself. (P37) and I wo uld have liked the speaker to slow down, or perhaps have an option wher e you can click on individual words for pronunciation (P33). Interesti ngly, the software does pr ovide this feature option. However, it was not an included option for the study sentences.
93 ASR Scores Perceptions of Usefulness Survey question 8 consiste d of several parts addressi ng the ASR score box with green bars. Survey Q8 was organized as follows: TeLL me More ASR provides a score (represented by 'green bars') for each of your sentence pronunciations: a) Were the ASR scores helpful for your pronunciation? b) Do you think the ASR scores reflect your French pronunciation ability? c) We are most interested in the ASR score as it is presented to you. Could you please comment on exactly what this score meant to you? The first two parts of Q8 were closed (yes /no) questions. The frequency and percentages for both parts (8a and b) ar e presented in Table 7. Emergent Themes Survey question 8(c) was open-ended a nd asked participants to comment on exactly what the score meant. A screen shot of the box marked 'Score' was also presented as a visual aid for reference, next to the que stion. Three significant themes emerged from the analysis of 8(c) and included three sepa rate meanings participants assigned to, or associated with the score feature: 1) vi sual, 2) pronunciation a nd 3) score meanings.
94 Table 7 Frequency and Percentages fo r Survey Question 8(a) and (b) Frequencies for Survey Question 8 a and b Yes % No % a) Were the ASR scores helpful? 28 93 2 7 b) Do you think the ASR scores reflect your ability? 25 83 5 17 Visual Meaning The visual aid presented with th e question was assigned a meaning and represented a common reference. A visual ma rked with the name score was assigned an evaluative value and was considered some fo rm of evaluation. Participants used terms such as: bars, score chart, scale and guidel ine. The colors gray and green were also mentioned, where green was equated with good. In the ASR software screen the score is visually represented by small, rectangular and vertically positione d bars. Figure 5 is a screenshot of the Score box, with green bars in the ASR software and as seen on the Speech Recognition activity screen.
95 The positioning of the score box and the green bars that appear within the box (following a recording) are important from a visual graphics standpoint. The score box uses color, rectangular bars and verti cal movement on the screen to create the appearance of addition, more or better. Comment s included: It felt good when I looked and I had five bars, since the last time it wa s in the gray which wa s not very motivating (42) and I never wanted to get a gray color bar, green meant good to me (P44). Figure 5 ASR Software Score Box Pronunciation Meaning In terms of pronunciation, pa rticipants referenced features of pronunciation and associated meaning to specific qualities of pronunciation. Two distinct and important features of pronunciation are speed and accuracy and related features such as timing and clarity. One participant commented: This scor e showed me that I need to work on my speed and accuracy (P39); another commented The scores made me realize that some sentences were easier to pronounce that ot hers (P17). One part icipant (French II)
96 commented on the vocabulary used for the sentence pronunciations: I was unfamiliar with the vocabulary and woul d get tongue-tied by the word s I had never seen before (P40). These participant comments indicat ed a sophisticated understanding of the qualities necessary for interpreting pronunciation meaning. It is appropriate that speed of pronunciation was identified and triangulates to previous comments referenced earlier (Q7) rega rding the fast NS speaker model. Also, in survey Q8, the NS was a reference point for comprehensibility (Munro & Derwing, 1997) or understanding of Fren ch pronunciation. Participant 32 interpreted the score to be a comparison to the NS. For example, A sc ore above the bar was acceptable. Meaning that a native French speaker would unders tand what I said. Maybe not perfectly, but understandable (P32). The perception expressed (P32) indicated that the participant had linked their pronunciation pe rformance to being comprehensible to a NS. Score Meaning The word 'Score' on the top of the score box was the most concrete example of associations with grades and participan ts quantified and qualified the score with meanings. The familiar dichotomies associated with evaluation were apparent and several categories emerged such as: good/bad, pass/fail, high/low, ea sier/harder. As well, participants interpreted a higher score (or more green bars) as better, or more acceptable. More important, perhaps, was the fact that participants interpreted the score to indicate progress or need improvement, and an indication to work or to try harder. This self-evaluative view of the score meaning t ook the form of statements about how they viewed their speaking skill, or where they saw themselves, It meant I need a lot of help
97 speaking (P28), or their goals, for example, It gave me a better idea of how well I spoke certain sentences. I enjoyed knowing wh ere I was at and where I should be (P31); and It showed me that I need to work on cer tain areas of pronunciation and some areas I do well on (P43). It is apparent from the analysis of surv ey Q8 that the feature of the ASR software entitled Score impacts the st udents' feelings ab out their pronunciation. It appears from the responses that the green bars app earing and indicating a positive result, or pronunciation, can spontaneously result in a good feeling and provided motivation to continue recording the same sentence or to move on to the following sentence. Participants want to know how well they were producing, pronouncing and speaking the language and how they could work to improve Even in a short period of interaction, participant responses describe d their perceptions indicating that the ASR software was a tool that did and would help them practice their French pronunciation. Participant 23s comment was particularly instructive and inform ative: It actually made me want to work to get the bars higher. It wa s amazing how much it affected me to want to get a little green bar to rise. I think that this tool w ill make student work harder because it actually shows you where you are and where you could be (P23). Participant responses and the subsequent analysis demonstrate conclusively that their perceptions of the usefulness of the software for assisting with pronunciation practice emerged. Not only did participants pe rceive the software to be useful for pronunciation practice, they also reported a perception that the software engaged their motivation and stimulated an interest, partic ularly for their current FL learning and for future practice. Given the elements of positive feedback, motivation and interest
98 expressed by participants in response to their brief use of the software, it appears imperative to consider how and in what ways the software can be made available to students for their use with their FL pronunciation and speaking skills. Perceptions of Effectiven ess of the ASR Software Research question 3 : How effective did French students perceive the ASR software to be for improving their mastery of French? Descriptive participant data was collected to evaluate whether participants perceived th e software could be effective for improving student mastery of French. Participant re sponses to survey questions 9 and 10 were analyzed and responses were examined for par ticipant perceptions of the effectiveness of the ASR software for improving French speaking skill. To some extent, evidence of participant 'perceptions of effectiveness' ha s been documented in earlier survey findings. However the range of the final questions allo wed for a broader and deeper analysis of responses by providing the opportunity for comparison and triangulation to earlier participant descriptions and reports. Survey question 9 consisted of two parts and was organized as follows: How helpful is the ASR software feedback (scores, native model, visuals) compared to other feedback you have received on your Fren ch pronunciation and speaking? Could you please comment about how this software's fee dback has been more or less helpful to you? The first part was a quantified statement of degree, and asked for comments on how helpful the ASR feedback was compared to ot her feedback. The second part was a request for a descriptive comment regarding the specif ic feature of the AS R software feedback that was or was not helpful. The findings for Q9, 'ASR comp ared to other feedback,' are
99 listed in Table 8a. Although not a specific part of Q9, a second (Table 8b), was designed in order to compare the perceived helpfu lness according to length of time studying French. Table 8a Survey Q 9 ASR Compared to Other Feedback Q 9 Responseshelpfulness of ASR compared to other feedback N = 30 % 1. Much more helpful 14 47% 2. Somewhat more helpful 13 43% 3. About the same 3 10% 4. Not as helpful 0 0% The rationale for the comparison of 'Per ceived Helpfulness according to length of French study' (Table 8b) is based on the s upposition that longer length of study would result in more experiences with feedback, or a broader variety or type of feedback. Students who have studied French longer may have more feedback experiences available for comparison which raises several intere sting questions. Does the ASR software
100 feedback benefit beginning l earners more than students w ith more or longer learning experiences? Or are students who have st udied longer more experienced and more qualified to compare their feedback experiences? The literature provides some support for the use of ASR software with beginning learners (Hincks, 2003). However, the gene ral lack of FL pronunciation and speaking practice, particularly within a classroo m setting (Eskenazi, 1999), is a common and ongoing concern in the field. On e participant commented that they felt the teacher had limited time for interaction. In survey Q9 the helpfulness of the AS R compared to other feedback resulted in two feedback types or themes: 1) feedback from a person and 2) feedback from the ASR software. The comparisons resulted in the ASR software always being compared to a person, or evaluated based on a person givi ng feedback. The person listed as giving feedback was the professor, French teacher 'real' teacher and native speaker. Other qualifiers such as in-p erson, one-on-one, someone and 'your voice' were used to signify a human giving feedback vs a machine or as one participant commented a 'sound device'. This finding triangulates with the concept of human raters being the gold standard (Eskenazi, 2009) for FL pronunciation and sp eaking evaluation, particularly for students within a classroom. Students expect the huma n, teacher-rater to be available and expect that they are qualified to assist them, if only during class periods.
101 Table 8b Perceived Helpfulness According to Length of French Study Q9 Responses Length of French study (N = 30) >1s 1-2s 1-2yr <2yr Totals 1. Much more helpful 2 5 4 3 14 2. Somewhat more helpful 3 5 4 1 13 3. About the same 1 2 0 0 3 4. Not as helpful 0 0 0 0 0 The feedback from the ASR software was described based on comparative advantages of using a computer and head set for pronunciation reco rding and practice. These advantages included it being easier to 'click' and faster feedback. Specific advantages related to the ASR software in cluded several components such as: features that can compare more aspects of pronunciati on, red highlighting as indicators for sounds or words that need attention, ability to pronounce unfamiliar words when presented by the NS, as well as, the ability to repr oduce pronounced words more clearly.
102 Pronunciation practice, or th e ability to repeat indivi dual pronunciations, was seen as a comparative advantage relative to a te acher. One can listen over and over, repeat, and can go back and hear yourself and these were qualities of the ASR software that were seen as helpful and were valued. One on one feedback doesn't allow you to compare as many aspects of pronunciation, also, its easier to just click a button and have the phrase repeated to you than it is to have your professor repeat it over and over (P25). Being able to repeat Fren ch sentence pronunciations was associated with not only practice but improvement, as well Participants 31 and 32 stated they felt being able to repeat and practice meant they could work ha rder and the work would result in a better outcome or pronunciation. The play back of what I recorded against a native speaker helps me see what words I need improvement on (P32). I think it is more helpful because it allows you to listen over and over c onstantly try to bette r improve you better score (P31). The concept of practice, reported desc riptively as being able to repeat pronunciations, was seen as an advantage. On e participant creativel y equated the ability to record and listen with sp eaking the language: It is nice to be able to go back and hear myself speak the language (P30). Anot her participant felt that the software was unbiased The feedback is unbiased to me, which I value (P16), and another was perhaps overly trusting of the software! I t actually shows you where you are. This software can tell you exactly where you are a nd it isn't just the opinion of a person. This is actually physical proof of how you are speaking and how you should be speaking. It's incredible. (P23).
103 Of the thirteen participants who viewed the software as somewhat more helpful the speed of the NS pronunciations was the focu s of concern and the participants saw this as a setback for their pronunciation. For exam ple, It depended on the sentence, there were times where the sentence was said to fast, so I had a hard time making out a word (P20). There were three particip ants who rated the software as similar to or about the same as other forms of feedback. Three rate d the software about the same where two of these three comments contained reference to a person: in person you get a lot more feedback and more quickly (P42) and anothe r commented, I think it does almost as well as a teacher since it is basically the sa me concept as listening to and repeating after her. The advantage though is I can practice mo re with this than I can with the teacher (P38). The concept of using the ASR software for practice, despite uncertainty about the comparison to other forms of f eedback in the above cited cases, was seen as better for pronunciation practice. Overall, twenty-seven of the thirty part icipants reported that they perceived the ASR was more helpful than a teacher rater fo r providing feedback. The analysis indicated that participants were qualified to differen tiate between the relati ve advantages of the various forms of rater, teacher-based feedback based on their experiences. Where a FL teacher-rater has provided input and feedback to students, they valued and trusted the teacher and at the same time employed their technological sa vvy to interpret the contribution of technologically-enhanced featur es such as ASR. The participant responses provide insight into, and support for, students ability to perc eive and distinguish variable feedback methods and still assign valu e to their own learning experiences.
104 Survey question 10 addressed the issue of teacher ratings of French pronunciation and speech and was organized as follows: Could you describe a time when you had your French speaking or pronunciation rated or evalu ated by your French teacher? a) How do you feel about having your French speech or pronunciation rated by an expert human rater (either known or unknown to you)? b) Would you want the rater to be a native French speaker? The responses and assigned codes re present the most commonly used form of FL evaluation for speaking and pr onunciation. The coded responses are presented in Table 9. The responses to teacher-rated speaking or pronunciation were indicative of the expectations students have of teachers, gene rally. Students expect teachers to rate FL speaking and to correct their pronunciation. No t only do students want to be rated, they want to be corrected and given feedback on how to improve. Participants expressed that, although aware of limited class time for learning, they view the teacher as rater each class and especially when th ey are asked, during class, to speak or pronounce words or sentences. Although this is a common FL cl assroom practice for pronunciation exercises, students did not report feeling uncomfortable a nd want to be correct ed. During class if we pronunciate a word wrong our teacher will co rrect us. After I try to repeat the word correctly (P32). In another example, a par ticipant spoke of havi ng a conversation with the teacher: Well we had a conversati on in French and sh e would correct my pronunciation as I went. It was helpful but a little distracting from the conversation (P19).
105 Table 9 Survey Question 10 Coded Responses for French Teacher Rating Coded Responses Informal Comfortable Yes Formal Uncomfortable No Teacher-rated experiences Informal = French classes Projects and presentations Formal = Exams Oral interviews a) Expert rater Comfortable = 25 Uncomfortable = 5 b) Native Speaker rater Yes = 27 No = 3 Students expect to be evaluated and they want and accept specific feedback, both on informal and formal speech ratings. In one case a participant repo rted that the teacher was responsive to pronunciation at the level of syllables: They poi nted out a couple of instances where I had pronounced a final syllab le. That is usually the only commentary I get in oral interviews (P34) and I was pretty relaxed because the instructor helped me through any hard pronunciations (P30). Originally, the oral proficiency inte rview (OPI) was developed by ACTFL to evaluate FL conversational ability. Modified a nd adapted versions of oral interviews have been incorporated into FL curricula as a way to quickly evaluate spoken language. The oral interview students referre d to in response to Q10, and especially at the beginner
106 level, is relatively short less than five minutes. A French I student reported that they had not been rated orally yet: I don't believe my t eacher has listened to just me yet (P38). In reference to survey Q10 where part icipants were asked to comment on their feelings about having an expert rater rate their sentences, the twen ty-five participants were comfortable with the idea. The responses that were indifferent (It doesn't bother me (P20)) were also coded = comfortable, and in some cases participants stated they were glad not to know the person. Only five were coded as being uncomfortable: It freaks me out a little (P29) and I woul d feel anxious (P30) and one participant commented they were more concerned about their performance on the sentences than about the expert rater: It ma de me a little self-conscious but I soon forgot about it (P36). The responses were posi tive regarding a na tive speaker rater for rating their sentences. Twenty-seven of thirty participan ts reported that woul d want the rater to a native speaker. The NS model is used by the ASR and participants had the model available during the recording period. Participants perceived the functionality of the ASR to be a comparison of their pronunciation to the model wh ere the NS pronunciation was the best score. Three participants responde d no that is, these three would not want the rater to be NS. However they were not uncomfortable with an expert rater. While coding survey question 10 severa l interesting questions about FL rating emerged. For example, how are FL rating and scoring related, for st udents? It appears that students (or at least thes e participants) are less concer ned about an expert rater and had very little affective response (indifference, don't care, don't mind, doesn't bother me etc.) but had a much more definitive response to a NS rater. It might be possible to
107 suggest that a NS is an expert yet do students see the NS a nd expert as the same person? What is the relationship betw een rating and scoring? Does the question of how these two evaluations are related have a basis in how students are impacted, that is how it affects them? These questions arose when looking at the no responses to 10 (a) and (b) and, although rhetorical, may provide a basis for future analysis. An interesting aspect of survey question 10 was the reported positive impact FL teachers have had on the participants. Ther e was a clear understandi ng and acceptance of teacher feedback, especially when referring to correction. Particip ants referred to the teacher being 'not too strict' or 'more leni ent' on pronunciation. One comment sums up the sentiment: I would like to know what I can do to sound comfortable speaking conversational French in a relaxed atmosphere (P45). Final Survey Comments At the end of the survey, a final request for comments on the experience at the lab or the ASR software resulted in 20 co mments (10 = no answers). The responses were grouped according to comments rela tive to the experience: 1) at the lab and 2) responses about the ASR software. The lab experience was described as: enjoyable, fun, a good or great experience. Three comments included pa rticipants being di stracted by others present while recording in the lab. However, in each of the three cases, the participants stated clearly that the over all experience was positive. As regards the ASR software, partic ipant final comments were positive and related that the software was eas y to use. Participants suggest ed they would like to see the software integrated into the FL curriculum, into their language classes and expressed the
108 hope they could use the software again. One comment was particularly informative and summarizes one participant's overall experience: I wish the sentences had been recorded slower. I also wish there was a way for the so ftware to give more specific feedback on a particular recording...I do think the software has the potential to be very useful in the future, but at its present state, I would hope that this program would not be used as a method of evaluation for a course grade. Just a useful tool for practice! (P40). The results of the survey questionnaire m eaningfully contributed to the participant experience recording French sentences using ASR software. The participants provided positive suggestions for their learning and useful insights into the features of the ASR software. Their interpretations of what this means for their French learning included how and in what ways they perceived the ASR coul d effectively contribute to their mastery of French. Through a triangulation of methods, ite mized survey questions and a final analysis of the qualitative findings, particip ants provided consistent support for their views and perceptions concerni ng the actual and potential use of ASR to contribute to their French learning. Student perceptions of the usefulness of the ASR software for assisting with pronunciation and speaking pr actice, as reported and analyzed, are meaningful and contributory qua litative evidence for the us efulness of ASR software. The insights that emerged in the qual itative findings suggest, definitively, that participants perceive the ASR software as effective for their mastery of French. There was a clear sense that participants percei ved their participation in the study as a confirming statement of their support for the use of ASR software for future FL
109 courses. In a final comment, one participant expressed this view: I think that this program has good potential and it should be in tegrated in French classrooms (P44). Researcher Bias Researchers are aware of th eir relationship to their data, whether participants or numbers. Empirical researchers seek to have the data speak in numbers, qualitative researchers to examine data in words. As a researcher my relationship with the software began over a six week period the summer of 2004 at the University of Perpignan (south of France). During several afternoon sessions in the multimedia center I was introduced to the TeLL me More software and the ASR f eatures. As an FL researcher my impression of the software, its ability to produce and repeat speech, was a fa scinating adventure. Upon returning to the US, I learned the FL community was unaware of the tool. Barbara Laffords (2004) review was a refreshing c ontribution as she t oo perceived similar qualities in the software and its potential to assist students. A positive initial experience with the so ftware, leading to research and final reporting of results, given these were the experiences of the researcher, may have contributed to researcher bi as. However, knowing the impact the experiences with the software had imparted, an awareness of, and sensitivity to, the potential for bias was consciously present during analysis. An effo rt was made to maintain a researchers distance from the data.In this way the nature of the participant experiences were reported at a distance by the researcher, especially because the researchers experiences may have been similar to the participants expe riences using the softwa re. Feedback from
110 colleagues and further reflection suggests that the expressed perceptions of the participant voices have been analyzed and reported. Other Emergent Survey Data and Researcher Reflections During the early coding of all the surveys several interesting concepts and ideas emerged and while they are not directly applicable to the issues addressed by the research questions there are important c onnections that were identified. Three of these connections will be outlined, as they are perceived to be supportive of specific topics related to 1) FL interaction and speaking, 2) student l earning experiences, and 3) FL teaching. FL interaction and speaking. Following the initial coding, the first run-through of all the surveys, and after arriving at the fi nal survey question 10, th e process and duration of time emerged as a significant contribution to the concept of interaction' (see coding example Appendix E). Participants had used th e word during to qualify and quantify an experience, either an earlier learning experience or the expe rience with the software. The word during implies process, movement start/end, motion, that is a process of something happening in time, while time is passing. The question that emerged was: How does this concept of tim e as having a flow, a movement, as a beginning and ending describe a student's view or percep tion of FL speaking? Or does it? There were very few mentions of computer s or software. It appeared that students might be expressing the concept of FL speak ing as 'during interaction as process'. Students describe teacher input and expect, desire and want teacher feedback. Participants then describe their responses where they correct themselves, take in the feedback, plan or
111 produce a spoken thought or message. In this view of interaction as process, interaction occurs with the student, teachers or materi als, language, and the action, end, outcome or situation is produced by the student. FL learning and speaking. This concept of time, in FL terminology, is considered an interaction process. Student s expressed the distinct feeli ng that they accept, for the most part, the process, that is they expect to interact and to expe rience an outcome. Even if they are not happy with their input (often the case with FL pronunciation), they expect to gain, to improve, to learn something about themselves. What emerged was a clear perception that students want to learn to speak, to pronounce, to learn to use the language. Is it possible they conceive of speaking as using the language? Are speaking and using th e language, synonymous, in their learning experiences? This idea might e xplain why computers and softwa re are both important and irrelevant to the process. Important because they are a feature of the interaction process and may be an important one for pronunciati on learning; irrelevant because students ultimately feel responsible, towards themselves, for their production. FL teaching and speaking Teachers were, overwhelmingly, seen as the source of input for FL speaking and pronunciation. Stude nts viewed their teachers as resources, contributors and evaluators. St udents are generally uncomfortab le in speaking situations and often this relates directly to their feeling of a lack of control over what they want to produce or need to produce in the language. This is especially difficult in the area of FL speaking when students want to do well and de sire to be prepare d. Pronunciation and
112 speaking are areas where a teacher can have c onsiderable influence on the students level of comfort, even in perceived high stakes situations. The abili ty of teachers to accept a student's discomfort or fear, whether the f ear is about not being understood or not being heard, appears to override the teacher's actual ability to produce fluent, native or nearnative language. Again, the emerged theme that it is the student's expectation of the self that drives their learning. This theme was e xpressed repeatedly in comments such as: to hear and be able to respond to say what is meant or wh at one is thinking, to not offend a native speaker, to have conversati ons, and finally to have a good accent. Numerous ideas and researcher reflecti ons emerged throughout the entire coding, categorizing, analyzing and survey questionn aire documenting process. Many appeared at first to be unrelated. However, the underlyi ng thread, given such a short period of data collection, is that students are perceptive, they desire to learn and understand. FL pronunciation and speaking presents a considerab le challenge for their learning. Yet, this is a challenge they appear willing to undert ake given helpful resources, materials, and practice situations with reliable and understanding teachers.
113 Chapter Five: Discussion CALL Research: ASR Software and Speech Technologies CALL studies, given the rapid advances within the domain (advances from CDROM to DVD in terms of speed and storage are two simple examples) are technically challenging. Studies reported less than ten years ago may not be feasible, depending on the software design. For example, TMM Educa tion (2003), the version used for this study was designed for a computer lab, and services networked computers from within the lab, by a server computer. The ASR software sa ves the produced audio recordings on the networked server. So each time a participant re cords or pronounces an audio file, it is saved for the 'tutor' or teacher to listen to, or score, later. However, the internet version, TMM Campus (2006) does not save audio fr om the ASR once the student has left the screen. In a recent text designed for FL teache rs, Blake (2008) evaluated two tutorial CALL products (LeLoup and Ponterio, 2005) that included FL vocabulary glosses for reading. Both products provide glosses in a la rge number of languages, yet only a limited number include audio recordings of pronunc iations. As Blake notes: ...but not every language database has extensive audio record ings to accompany the word definitions (p. 56). Although FL audio is useful for pr onunciation, especially when native examples are provided, dealing with audio files, from a technological perspectiv e, presents distinct challenges.
114 Chun (1998) in her seminal article sugge sted research designs that although impossible at the time, would become possible given continued technological developments in speech analysis, synthe sis, recognition and foreign language. Chun envisioned that more and varied research was necessary. Chun's vision and research inspired the design the TMM pilot studies and of the current study. Overcoming technological challenges to conduct CALL research, sometimes the result of inherent software design, can be da unting or even untenable. Nowhere has this been more the case than in the area of Automatic Speech Recognition (ASR) where CALL foreign language, natural language proc essing (NLP), software design and basic computer science converge. Neri et al. (2008) and others (O'Bri en, 2006) comment on the rapidly emerging interest in speech recogni tion technology used for computer-assisted pronunciation training (CAPT). ETS researchers are working on speech recognition software that could potentially be used for analyzing and ra ting speech produced by examinees in the speaking section of the TOEFL (Xi, 2008). Rodman (1999) in his excellent work: Computer Speech Technology is one of the early applied linguists to bridge the divide between the branches of FL speech, phonol ogy, phonetics and an understanding of the technology and speech science. Chapelle (2007) an early and eminent CALL researcher has suggested a more expansive view of CALL is on the horizon, one needed in order to incorporate developments in natural language processing. Chapelle comments that a broader view is necessary for understanding the type of research needed to further CALL designs, especially designs that incorporat e speech features. More is being asked of CALL researchers today, than ever be fore, for many reasons. Automatic speech
115 recognition and related areas of study provide great promise for collaborative research and can encourage developments that wi ll enhance foreign language speaking and pronunciation learning resour ces for our students. Numerous CALL researchers (Esken azi, 1999; Hincks, 2003; Barr et al., 2005; Levis & Pickering 2004) have explored the use of FL speech-interactive and ASR features used in both commercial and resear ch-oriented products. Several researchers (Chapelle, 2001) have made meaningful cont ributions to our theo retical and practical understanding of how these f eatures and products can cont ribute to CALL software designs and student learning with multimed ia. Likewise, the limitation of current technologies and of the CALL and ASR studi es have provided a solid framework for future investigations. Mixed Method Design and Issues of Significance and Meaningfulness The purpose of a mixed method study is to provide a broader view of a topic and to shed light on findings and the issues inve stigated from a vari ety of perspectives. Importantly, a mixed method was designed for this study to inves tigate a quantity and quality related to automatic speech recogniti on software (ASR). The quantitative findings provided meaningful information, although no sta tistically significant relationships were determined. The qualitative results indicat ed significant positive feedback from participants. Both methods and the related fi ndings contributed to our understanding of ASR and its position within the larger domai n of CALL and forei gn language teaching and learning.
116 Bachman (2005) suggests that investiga ting relationships among two or more variables is fundamental to two of the qua lities of test useful ness: reliability and construct validity (p. 113). In this regard, using correlations as a means of analysis for ASR scores, raters' scores and relationship s among these variables is practical and informative. However, correlations and tests of significance are often difficult measures to employ in foreign language research and lin ear relationships as re presentations of FL learning can be incomplete. Investigating th e ASR score construct by comparison to the relationship of the rater's scores was an atte mpt to establish the validity of the undefined ASR score construct. Examining the reliabilit y of the rater's scores also contributes to our understanding of the ASR score value and its potential or practic al usefulness as an evaluation tool. Mackey and Gass (2005) discuss the use of correlation in FL studies and comment that in second language resear ch we generally deal with much smaller sample sizes, making it difficult to get statistical signi ficance (p. 267). FL qualitative studies, particularly longitudinal and case studie s, are difficult to design and following participants, even if there are only a few, can be time intensive and costly. Mackey and Gass (2005) argue for meaningfulness in foreign language studies and suggest: In considering the difference between meaningfulness and significance (in the st atistical sense), we need to recognize that second language learning is a slow and complex process often involving a period of pr oduction of correct forms only to be followed by a later period of production of incorrect forms. Therefore we need longer periods of observation, but
117 the exigencies of research do not often allow long periods of time. Thus, it may be that meaningful trends are worthy of discussion, independent of sta tistical significan ce. (p.268) A balanced FL study, such as the investig ation into the assign ed meaning of the ASR score, has contributed to meaningfulness, if not significance in the statistical sense, and has shed light on meaningful trends as suggested by Mackey and Gass (2005). Several CALL researchers (Chapelle, 2001; Weir, 2005) have suggested that research involving several methods are lacking within our domain. Where empirical analyses are conducted, complete reports of findings are neglected or are unreported because the results are consid ered statistically insignifi cant. Others (Bachman, 2005) suggest that complete reporti ng is necessary in order to advance our profession and responsibly assist other resear chers with more complete st udy designs or replication. With an interest in responding to Bachman' s suggestions, and a desi re to appeal to a larger audience of CALL and FL researchers, a complete report of the results have been provided. By conducting both quantitative/empirical and qualitative data collection and analyses a more complete picture has been presented of an area of CALL multimedia with automatic speech recognition (ASR) software. ASR is often misunderstood or misinterpreted by the FL community (Davie s, 2006) and clarifying or simplifying are necessary for furthering our understanding. Th is study presented a view which employed not only more inclusive measur es for a broader investigatio n, but a view presented from
118 multiple perspectives: 1) ASR software design and multimedia; 2) FL professionals and FL raters as evaluators; 3) and the French l earners and students user s. This investigation and reported findings have contributed to a more complete understanding of ASR and its usefulness as a tool for FL practice, performance and learning. ASR Sentence Pronunciations and Score Validity One of the first questions posed by FL users and teachers is: What does the ASR Score mean and what do the green bars = score represent? Most important, what do they indicate to the student users? However, if there is no definitive explanation, as is the case with the ASR Score, then how is th e score interpreted both empirically and qualitatively? What is the interpreted meanin g? As instantiations of the interpreted meaning construct, how valid are the ASR and rater scores? The ASR scores assigned to the French sentence pronunciations were generated by the ASR software through a complex proces s using a speech recognizer and related components, with French as the language co mponent. The TMM French version of the ASR software uses a core beginner-leve l vocabulary of approximately 2,000 words. Through an algorithm designed to compare inco ming speech with a language model, a sound comparison-type score is generated. A s implistic explanation for the very complex process of independent speech recognition can explain, generally, the basis for an assigned score. The score is recorded in th e software on a point scale 1 (low) through 7 (high), yet in the user interface and on the view ed screen the score is recorded as gray bars (below a threshold of 2) and green ba rs for scores 3 = 3 green bars, through 7.
119 Thus students using the software do not act ually see the number scores, only green bars. As reported in chapter 4, participants viewed more green bars as better and worked to improve, or increase the number of green bars attained as they progressed through the ten sentences. It a ppears that the score assigned by the ASR as a number and the green bars viewed by the participants, on average, were interpreted as having similar, if not the same, meanings or values. As th e qualitative findings indicated, participants expressed that the green bars resulted in a positive or 'need-to-imp rove' self-evaluation of their pronunciation depending on the number of green bars. Although the ASR-generated scores cannot be considered actual test scores, inferences and generalizations, cornerstones of validity research, about a participants pronunciation have been made by both the ASR so ftware and human raters. Reliability, or repeatability of the inferences made about sc ores presents problems when tests involve FL productive skills, such a speaking and pronunciation. McNamara (2006) emphasized, scoring involves human judgment, and issues of repeatability he re involve estimating the extent of agreement between judges in other words, would the candidate get the same score next time with a different j udge? (p. 33). Further, McNamara (2006) has encouraged researchers to consider new pers pectives when engaging in validation studies and comments that input from non-measuremen t traditions is leading to the exploration of new insights into the limitations of such inferences (p. 27).
120 Possible Confounding Variability The correlation analysis of the ASR scores relative to the rater scores did not reveal a relationship between the ASR and rater scores. There was an undefined and indiscriminate amount of variability which doe s not allow, empirically, an identifiable relationship. The sources of confounded variability, despit e controlled variables, are assumed to be the result of several factors. One factor includes the rubric used by the raters. The rubric was a numeri cal scale adapted from the ACTFL Speaking Guidelines 1999. The guidelines for speaking are descriptive in nature and have been criticized (Tatton, 2007) for being difficult to relia bly interpret, especially for evaluating pronunciation components. The adapted numeri cal scale used for the pilot (April 2008) was initially confusing for the raters and wa s subsequently revised for the study. For the scale revisions, the ILR (Interlanguage R oundtable) scale was reviewed as it was the precursor to the ACTFL guidelines and incl uded numerical values. Generally, the ACTFL Speaking Guidelines 1999 have been difficult for FL ra ters to apply and interpret, especially when evaluating the pronunc iation components of speaking skills. Another factor contributing to score variab ility relates to the problems the raters encountered using the adapte d scale to interpre t a value for such a short sentence pronunciation sample. As well, perhaps the varied background of each rater and their rater training experiences also contributed to variations in the scores. For example one rater related they felt they were stricter ear ly in their rating and b ecame more lenient as they continued rating. Another felt the opposit e and stated feeling more lenient and less experienced early on in the rating process. Ex amining rater scoring tr ends and patterns
121 would help identify specific problems with both the rating sc ale and sentence length and clarity. The backgrounds of the raters were vari ed and diverse in many aspects: diverse ages, length and type of teaching experience and they had taught in various educational systems. Even though there were two native French speaker rate rs, their individual experiences within the US and working with French language learners were different. How much did the diverse backgrounds and experiences contribute to the human factors, and thus possible variability, duri ng the rating period? The adapted numerical scale and the identified problems the raters encountered and the human rater factors were sources of variability for the study. Neri et al. (2008) used a research-b ased automatic speech recognition (ASR) system for their study. Researchers modeled the ASR system after TeLL me More KIDS ( Parling ) and they found similar difficulties with their ASR scale and scores. The participants were 11-year ol d Italian native speakers learni ng English and raters were unable to consistently rate the young children's word pronunciations. Eventually, the ASR evaluation component was adapted for a simple accept/reject, as the short length of the recording and young speaker/voice variations were consistent problems in the early test studies. Another factor that may have contributed to variability is the fact that the French ASR sentences were not presented to the participants as sentences 1, 2, 3, ...10. The software, at the beginner le vel, contained the corpus of 540 sentences assigned to learning units. Where the selected sentence (o f the ten) came from the same unit, they could only be grouped together. If the sent ence was the only one chosen from the unit
122 then it was presented alone (for example, sent ence 4). The screenshot (Figure 6) presents the screen as seen by the pa rticipants where the sentences are grouped accordingly. It would be interesting to see how the grouped sentences versus the sentences presented alone were scored, according to participant. Preliminary analysis indicates there is variability in the ASR scores of the sentence s presented alone. However, the individual sentence pronunciation features would need fu rther analysis as well before any practical concerns regarding sentence gr oupings could be determined. Future research into the actual ASR sc ores assigned for sentences 1 through 10 and across participants may indicate whether participant ASR scores improved as they progressed through the sentence pronunciations and the ASR recording experience. An examination of specific sentence scores c ould reveal more information about each sentence and how it functioned as a French pronunciation sample. A cursory glance revealed that several sentences presented pronunciation problems and one participant commented that they were unfamiliar with the vocabulary. Another commented that the sentence pronunciations were more difficu lt than they had originally anticipated. Although relatively short, and few, the Fren ch sentences were common and standard beginner level vocabular y and pronunciations.
123 Figure 6. Participant Screen View of French Sentence Groupings ASR for Pronunciation Practice FL speaking and pronunciation are one aspect of language skill that is generally unanalyzed separately from other skills, such as listening for example. In fact FL tests of different skills are often given together. Researchers are required to analyze test scores accordingly and a considerable degree of covariance (or overlapping of skills) in the scores can be expected. Although not a te st, the ASR software provides a form of performance evaluation of pronunc iation. Clearly, other aspects of FL language skills are implicated. For example, participant feedback indicated that they listened to their pronunciations and the native speaker (NS) model. Listening is an FL skill that is usually
124 associated with measures of comprehensi on. One participant men tioned having a problem with vocabulary and how the NS example allo wed them to listen to (hear) how to pronounce the words in the sentence. In this case, listening to the NS assisted the participant with produci ng the sentence pronunciation but perhaps without comprehension of the individual meaning of the French words. The French sentences were written on se veral screens for reading as participants entered the software and recorded. As well, each individual senten ce was written on the ASR screen. During the pilot (2008), the ques tion arose as to whether the sentences should be written and available for reading during the ASR recording. Consultation with French professors resulted in comments rega rding the level of understanding of French pronunciation and speech by French I and II students. It was d ecided that not only should the written sentences be available but that they were necessary for their pronunciation production. In this case, reading, as a skill, supported pronunciation and again, perhaps comprehension. It is important to emphasize that speak ing and pronunciation practice, for which ASR seems especially well suited, is not n ecessarily concerned with a confounding of skills for evaluative purposes. The ability of the task to identify and selectively target the skill for practice is generally the primary consideration. ASR as a component of CALL for speaking and pronunciation task practice fi lls a need for both the student and the teacher. The student can practice speaking a nd pronunciation on their own or during a lab session without evaluation by the teacher. The te acher has a limited amount of class time with students to practice speaking or pronuncia tion, yet participants value the teacher's input. CALL-multimedia with embedded ASR can complement FL teachers lack of time
125 during class for pronunciation practice where st udents need to practice in order to improve their pronunciation and speaking. Initially CALL software was designed for FL practice as a resu lt of the nature of technological limitations. The designs resulted in software that would allow only certain actions, and interaction was almost impossibl e. With advances, multimedia software has grown to incorporate multiple, interactive processes. In one sense, ASR software is where early CALL software began, as a practic e tool for FL pronunciation. Yet pronunciation and speaking practice remain the most difficult skill areas to incorporate in the classroom. For various reasons teachers lack the cl assroom time and sometimes the FL language expertise. At the university le vel, teachers are und er pressure to introduce (what are often viewed as) more difficult aspects of the FL curriculum. FL pronunciation is also one of the most di fficult skills to acquire. There is some research to indicate that afte r a critical period (Krashen, 1985 ), before the age of fifteen, generally the L1 or native language will continue to exert an influence on the pronunciation of the second language. FL speakin g and pronunciation practice is of value at any age and may be more important for young adults who have had earlier FL experience, and want to advance or impr ove upon their abilities. Study participants related that they had studied French in high school or earlie r, sometimes for two or three years, but felt they had learned very little and expressed frustr ation that they did not learn to speak the language. The study demonstrated that using ASR software for pronunciation practice was perceived, even in the short recording period, as a valuable pronunciation practice tool.
126 ASR and Pronunciation Performance Foreign language performance, usually evaluated within a course or classroom setting is relatively low stakes compared to other tests, such as an oral proficiency interview (OPI). For the student who is as ked to model learning and produce language the FL evaluation experience especially as relates to FL speaking and pronunciation, can be troubling. When a course grade and a required level of performance are expected, the stakes, for the student, ca n be relatively high. FL performance as expressed in ASR sc ores collected on the French sentences was productively received by the participan ts. The majority related a positive overall experience. Many expressed insightful and self-reflective comme nts regarding their interest, motivation and desire to learn. On e participant described in a final comment their overall experience: It wa s a great experience to hear my own voice because it is something I have never done before (P 43). Another participant summed up the experience with: Wish there were more sentences. It was a fun experience (P37). Regardless of the ASR scores obtained, particip ant's expressed an interest in continuing to learn, described feeling as though they coul d learn and wanted to improve their French pronunciation. Teachers and professors were cited as ke y in promoting participants' interest in learning a foreign language and participants accepted teacher critique, correction and feedback. One participant commented regarding the ASR feedback, However sometimes there is nothing better than having a real teacher there with you, though this is a great substitute.
127 Interestingly, Andrews (1991) in her dissertation investigated whether the organization of French language in textbooks mirrored the way in which FL learners learn to speak. Not only did she find that it did not, she also found no difference between French I and French II for speaking as m easured by the production of morphological features. Students were at the same level for speaking French after one year and again after two years of study at the university level. Within the academic course framework, we need to make changes for student French learning and use, especially in the areas of pronunciation and speaking. McNamara (2006) exte nds this view of needed change: We urgently need to explore ways in which a ssessment concepts can be better used in classrooms to benefit lear ning and teaching (p. 39). Participant feedback on all aspects of the interactive features of the ASR software was positive and definitive. Students perceived the value of being able to use technology to assist them with their pronunciation practi ce. Participants unde rstood the limitations of the ASR software for demonstrating pe rformance. When speaking and pronunciation are the goals they know they need more resources, more time to work to improve, and continued motivation to learn. We are responsib le as CALL researchers and FL teachers for contributing to our student s resources for learning. FL Raters and Rating Online Rating foreign language speaking and pronunc iation, especially recorded speech has been a time-consuming, physically demanding and often an exhausting task. When teachers or professors are rating high stakes tests, training and monitoring are necessary
128 to insure reliable and consistent ratings. The r ubric or scale used for rating also needs to undergo testing to insure that a va lid scale or measure is used. The Center for Applied Linguistics (C AL) has developed a multimedia rater training package (CD-ROM) for French (2007) and much like the TOEFL speaking section rater training (2005) various speech sa mples at the required levels are provided for training. Following from suggested forei gn language rater trai ning protocols, the study involved training raters using participant ASR-produced French speech samples. Further, a separate site had been prepared th at included fourteen tr aining participants and several complete 1-10 sentence sets. Raters underwent the training, then proceeded to rate the thirty participants and afterwards th ey re-rated ten random participant sentences. The entire rating period was le ss than two hours, for a gra nd total of approximately 360 audio samples per rater. In essence, and similar to the CAL rater training model, raters for the study were trained on actual samples and then rated in the same multimedia environment. When cassette tapes were the standard, rating could take hours or days. Ea rlier when recording was untenable, students would pr ovide written responses, usually in the form of dictation or translation (Valette 1965). Today raters are trained and then sent speech samples to rate (often through a secured internet link) in their own enviro nment and on their own time. In this study the purpose of the rate r training was to insure consistency and reliability among raters. No only has the study ad ded the vital quality of time efficiency, but the rating environment prepared for each ra ter was consistent (for each rater). Each rater provided invaluable input to the entire rating process. The raters have expanded the
129 view of how and in what ways raters can si gnificantly contribute to rating FL speaking and pronunciation in the future. The raters expressed that the website, created using a university sponsored survey creation tool, significantly redu ced the amount of time needed to rate a large number of audio files. Using the site, rater fatigue wa s also reduced and raters commented on the efficiency and ease of rating, compared to other forms they had experienced. The raters commented that the design and layout significan tly contributed to their concentration and focus when rating. The website was designed to make the stressful job of rating easier. Raters were presented with each participant and ten sentences on one screen. The ratings were input by the rater simply clicking on a numbered circle after listening to a sentence. The rater could re-listen to any senten ce, at any time. The screen shot (Figure 7) illustrates the rater's view of the rating screen where Participant 16 Sentences are presented. For P16 French Sentence 1 is writ ten and underneath the speaker icon with the right pointed arrow indicates wh ere the rater must click the m ouse in order to hear P16's audio recording of sentence 1. The 1-7 butt ons located under the sentence are spaces where the rater inputs their assigned score. Raters appeared to need th e print copy of the rating scale early on, but later used it for troubling or confusing recordings only. They also rarely needed to re-listen. From the feedback provided by the rate rs, there is support for not only consistency, but also efficiency in rating, at least fr om a practical point of view. A modest empirical example of the rater's consistency is the outcome of the interrater correlations for rater b, and where rater a, b and c assigned similar ratings to the participant sentences. The fact that there was an observable relationship between what
130 rater a, rater b and rater c were rating, signifies that the ra ter training, the rater site and the rating scale worked well enough across raters to be meaningful. The intra-rater reliability was based on a very small random sample. In retrospect, it would have been advisable to increase the intra-rater sample However, at the time there was concern for the raters' time constraints, given that the tr aining, rating and repeat rating were to occur within the same time period.
131 Figure 7 The Rater's View of the Rating Screen Positive Impact and Usability Impact as a concept can be traced to Bachman and Palmers (1996) adaptation of a framework for language test validation. Language testing and validation have been complex issues for FL and test developers. FL researchers (Bachman, 1990) felt a need to define and interpret construct validity in orde r to evaluate language competence. With the development of concepts of communicativ e competence and FL production, the tasks used for language test development and validation became even more challenging to evaluate. Bachman et al. designed a framewor k based on test usefulness' that included aspects representative of communicative comp etence and ability such as authenticity,
132 interactiveness and impact. In th e context of test usefulness, impact can be considered a quality of validity. Chapelle (2001) has interpreted Bachman s framework for a number of earlier studies and later applied the concept, successf ully, to the design of a CALL evaluation framework. Chapelle suggests that both empi rical and judgmental evaluations of CALL tasks must be evaluated for usefulness and impact. She outlines criteria for CALL designs and suggests qualities necessary for a te rm she refers to as positive impact on learning. Where CALL multimedia includes features that are designed with sound theoretical foundations of SL A and methodological rigor, evid ence of FL learning may be possible. Student response in terms of thei r views on designs and in what ways they perceive learning can occur are valued. These types of responses, when measured either empirically and qualitatively (Jamieson, Ch apelle & Preiss, 2004), or using both methods, can be interpreted as having pos itive impact on student learning and include factors such as interest, motivation, ability to improve skills and to provide practice. Usability, a technological term employed at various stages of software design and development (Neilsen, 2004, 2006) and as fo rmative evaluation of a final product, defines many of the qualities of positive impact from a technology viewpoint. Usability refers to the quality of the design as regards users and all factors relating to users such as ease of use, intuitiveness, cl arity or transparency and fee dback, among others. All of the factors inherent in FL or CALL software that includes user-oriented features are, theoretically, expected to c ontribute to or enhance stude nt learning or practice. The ASR software, a feature of the mu ltimedia TMM product, can be interpreted, based on the findings from this study, both empi rical and qualitative, as meeting a test of
133 usefulness and usability. Despite the fact that the study was conducted on second generation computers, the software was dete rmined, by student participant users, to provide a learning environment that did meet the criteria for positive impact. Students were responsive to the software and perceived that it could be useful as a FL resource and further that it could provide incentive a nd support for their French language learning, pronunciation and speaking abilities. Limitations and Directions for Future Research ASR Score Validity The main focus of the inquiry into the ASR scores generated by the software was to investigate whether it is possible to make comparisons between rater scoring and the ASR software scoring and thus to determine if the ASR scores are, in any way, valid measures or statements of French senten ce pronunciation. The significant problem with the comparison is that while raters can be co rrelated with each other and themselves, the ASR can only truly be correlated with the ra ter scoring. The ASR scores, in this study, could not be crossed checked from time 1 to ti me 2 to establish a correlation for the ASR as a measure of the internal consistency of the ASR software. In order to check this consistency the ASR audio par ticipant sentences (or a random selection) would have to be re-entered and re-processed through the ASR software for re-ASR scoring. This idea, while technically possibl e, was abandoned based on practical considerations as it is unlikely that the same conditions, with the same audio files could be re-entered for consistent measures. For ex ample, the audio files were converted from .ogg to mp3 format for insertion into the rater website. Any further conversion would
134 certainly result in reduction in the quality of the audio and could thus only create more unexplained variability. One way used to establish a consistent and reliable comparative measure for the ASR scores was to have the expert raters pr oduce the ten French sentences and record the ASR scores assigned to their pronunciations. Ta ble 10 is a record of the rater scores for each of the ten sentences. While not a direct one-to-one relationship, the consistency of the expert rater ASR scores does lend credib ility to the high ASR score = 7, for a nearnative pronunciation. Table 10 Rater ASR Scores for Sentences s1s10 Rater ASR Scores s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 ra = NNS 7 7 5 6 7 7 7 6 7 7 rb = NS 7 7 7 7 7 7 7 7 7 7 rc = NS 7 7 7 7 7 7 7 7 7 7 Note: NNS = Non-native speaker Several factors are importan t to note relative to the ASR software and are factors that may play a part in or impact the validity of the scores, strictly from the software perspective. First, the ASR software consists, generally, of a three part system: a speech recognizer, an acoustic model and a language model. The speech recognizer's judged effectiveness is based on an error rate, when working together with the other two
135 separately designed features, in the case of foreign language: th e acoustic and language models. In response to a question regarding error rate: What is the speech recognition error rate for the recognizer used for the AS R in TeLL me More? Auralog developers responded with the following information: The error rates that are usually used in speech recognition engines are related to native speakers speaking in their own language. The speech recognition engine that Auralog is using has an accuracy rate of 99% (error rate of 1%). However this does not reflect the reality of the language learning domain where the speaker is not a native. If a sentence that is extremely poorly pronounced is misrecognized, is it an error from the speech recognition engine (that should account in the error rate) or the consequence of the bad pronunciation? We do not have a simple error rate that would show our performance in our specific domain. We use multiple different indicators that measure the various possible problems: an average pronunciation that gets a good score, a good pronunciation that gets a poor score, etc. We try to reduce them all knowing that they are contradict ory and that the final tuning is based on a good balance betw een all of them. (Personal communication: Auralog Developers and Tim OHagan, May 2009).
136 As described by Auralog error rate and accuracy rate are relative and technical terms when applied to ASR a nd interpretation invo lves variable and complex factors. Accuracy rate is, at best, based on a combination of diverse factors: programmed recognizer al gorithms, modeling, human voices and languages used for testing. Foreign la nguages and component language models have continually created obstacles fo r speech recognition researchers (Rodman, 1999). Researchers are consiste ntly working to test and reduce the erro r rates and in some products the reported error rate is very low. Thus, error rate is generally a given feature of ASR software depending on numerous and complex variables. The recognizer error rate is usually reporte d, and determined, by the developer and distributor, in the case of commercia l software products (Rodman, 1999). The question remains, how is the ASR score generated from a participants sentence pronunciation to the interface scr een? Context is an important factor, as the commercial software has the ASR em bedded within, or as a feature of the product. Contextual factors such as type of computer, sound card efficiency, noise, headset with microphone and location are all factors affecting the recording environment simultaneously, and apparently the ASR scoring. There is a reported two second delay from the actual time th e sentence is pronounced to the visual on-screen representation of the score. Is this the time needed for the ASR algorithm to produce an evaluation for the sentence pronunciation score? Realistically, and based on the feedback from Auralog, it was reasonable to hypothesize a correlation for the ASR scores of r = .99 within the context of the study.
137 In an early overview of speech-interactive CALL, Wachowicz and Scott (1999) evaluate one of Auralang's (Aural og, 1995) ASR software activity features. The researchers found that the listen and imitate style allowed for word boundaries and pauses in sentences and in their estimation eliminated the possibility of ASR errors. To their cr edit, Wachowicz et al. (1999) caution reviewers: However, we did not test th e Auralang activity with students, nor did we find teachers familiar with the software who could give an impressionistic assessment (p. 268). Other important factors that can affect ASR scoring are: ambient or background noise and microphone quality. Microphone qual ity is such a significant factor that companies suggest the type of microphone to use or even send an actual microphone with the product. Although to some extent these feat ures can be controlled, they still impart variability. For this reason when testing any equipment related to ASR software the same exact conditions must apply for measurem ent purposes (Rodman, 1999). Another factor, posited by Wachowicz and Scott (1999) that re alistically and practic ally may make the most sense is as follows: To put recognizer errors in perspective, we would note that even human teachers cannot guarantee 100% accura te recognition of students' utterancesparticularly in noisy, crowde d classrooms, as anyone with language teaching experience knows (p. 270). Finally, and particularly for this stud y, the second generation computers and sound cards used for collecting the audio files were clearly technologically inefficient and undoubtedly contributed to cases of poor quality audio. It can be assumed that re-scoring
138 audio files using an already inferior system would not have added to the consistency of the ASR scores. On the other hand, it seems logical to assume that a computer-based system, internally programmed (algorithmically), similar to the commercially produced TMM ASR software, given the same audio file recording circumstances and optimally reproduced conditions would produce a hi ghly, if not completely, correlated pronunciation score. This was an assumption made for this study and was based on the ASR score support of the expert raters and th e explained details rela tive to the ASR and related technologies. According to a report by Balough, et al. for Harcourt Assessment (2006) on their Versant for English test, an ASR-scored speaking test the ASR score validation process used by Harcourt employed three metrics, one of which included a correlation to human scores. The correlations reported for fluency and pronunciation (two of four sub-scores) were strong (r = .89 for both score types) a nd according to Harcourt, suggest that machine ASR-generated scores for Versant for English systematically correspond to human-generated sc ores (p. 8). In order to increase the validity analysis for evidence relating to the ASR scores, it would be necessary to also increase the statistical power by increasing both the total number of audio files produced and the number produced by each participant. For example, increasing the total participants to 50 with 20 sentences each would result in (50 x 20) = 1,000 audio samples or 20 audi o samples per participant). In the Versant English Harcourt Assessment (2006) study the tota l sample number reported was over 1,000,
139 although it was not clear whether this number re presented total participants, audio files, sub-scores or overall scores. ASR and Rater Correlations The correlations conducted for the study are summarized in Table 11 and are organized in descending magnitude order. The summary indicates that the ASR could be assumed to be, for pract ical purposes, almost comp letely correlated. Rater b moderately correlated on an in tra-rater correlation and in th is study rater b was the most consistent compared to all other measures. Wh ile all of the correlations are significant ( p value ) the degree of relationship as evidenced by the consistency of raters scores and the ASR scores indicated that the raters were more consistent amongst themselves as compared to evidence of their scoring cons istency relative to the ASR scores. From a quantitative perspective, given the parameters of correlation, the ASR scores and the rater scores would not appear to be relate d. Although statistically significant at a p < .0001 value, the correlations indicate that the ASR scores with the rater scores are below r = .50. For this study it is important to contextualize these coefficient measures, especially because no previous measures, comparing a commercial ASR automated scoringproduct with expert raters, has appeared in the literature. However, numerous studies are in press and many more are yet to be reported. What the studies (Eskenazi, 2009; Zechnar et al. 2009) are reporting are correlations, and in some cases correlations with expert or human raters. For example, the Versant Harcourt Assessment automated scoring and speech recognition system, usi ng speech collected through a telephone and
140 directed to the ASR system, report a correlation of r = .89 between human raters and the automated scores. The reported correlation was arrived at through a comparison to a combination of sub-scores. Would the correlatio ns have been less had only the individual pronunciation sub-scores been corr elated? It is unclear as to the actual sample used for the reporting. The reporting of correlations for FL st udies is standard, yet how can these correlations be evaluated with in the context of ASR and FL human rating? In a recent study, conducted by the Educati onal Testing Service (ETS) (i n press), researchers were interested in testing the ASR-automated sc oring system developed to evaluate the TOEFL-iBT speaking test data. ETSs speech recognition system, SpeechRater v1.0 with automated scoring, generated scores for the TOEFL-iBT speaking practice test that were then compared to human raters. A correlation of r = .57 was considered acceptable and reasonable for a practice test and practice environment. ETS had thousands of speech data samples available for testin g their system and for the study. ETS has an interest in devel oping the automated system and providing valid test measures for the high stakes testi ng environment and users. ETS employs and contracts renowned and expert FL researchers who work collaboratively on developing and testing the automated scoring systems. Beginning with the release of the new TOEFL-iBT in 2005, the inclusion of the speakin g test section requires ETS to train and employ expert raters for rating the speaking te st samples. The expected rating score turn around of seven days to score users is more than acceptable given the time and expense needed to employ and train a large pool of expert raters.
141 If ETS researchers are reporting correlat ions for ASR and automated scoring of speech or speaking of r = .57 how will the profession interpret the correlations? Perhaps more important, for our profession, how will AS R studies be replicated? For the current study, the parameters of the correlations have been defined and are replicable. The correlations reflect the exact ASR-produced raw scores, produced by beginner students from within a lab networked commercial syst em, correlated or compared to expert rater raw scores, without averaging or excludi ng scores on any dimension. Given these considerations and these examples, the corr elations for this study can be considered sufficient and acceptable for a pronunciation practice environment. Where researchbased, and in-house designed, ASR systems are used, an advantage is that researchers have large amounts of readily av ailable speech data, to both model and test systems. The purpose of ordering the correlati ons (Table 11) is to demonstrate a meaningful conclusion while pointing to the current limitations of using correlation as a determinant of score relationships. The ASR scores were not compared from time 1 to time 2 either functionally or practically acro ss participants. The rater assigned scores across participants, when compared to the AS R score evidence, demonstrated that there was no identifiable relationship. Raters, when compared to each other and themselves, were likewise found not to correlate on the ASR assigned scores, rater b presenting an exception. Clearly, rater b was the most reliable and perhaps experienced rater across measures. Many situations where raters are used involve using only two raters where a third rater is used as an arbitrator for questionable ratings, depending on the study design. Where more than two raters are used (10 raters ) the third rater is usually designated for
142 the same purpose. Given that three raters were used for the study, it might be of interest for future research to work with the current rater (ra, rb and rc) scores and identify a third rater arbitrator, then correlate ratings of adjusted ra and rb scores with ASR scores. Another consideration w ould be to train two or three new French expert raters (NS only or NNS only) and have them rate the 300 audio samples. Ezkenazi (in press, 2009) in an articl e entitled, An Overview of Automated (ASR) Scoring in Education suggests th at communication and the development of domain understanding among diverse groups of re searchers is necessary. A clear, concise presentation of findings can assist professi onals with interpreting results and designing practical applications.
143 However, are there other contextual f actors impinging on the actual numerical correlations? Mackey and Gass (2005) argue for interpreti ng meaningfulness in foreign language studies where language may be in a specific developmenta l stage associated more with a regressive period rather than identifiable prog ressive learning stages. Even Table 11 Magnitude: ASR and Rater Correlations Magnitude-ASR and Rater Correlations Pearson r P value ASR and ASR r = .99 p < .0001 rb and rb r = .71 p < .0193 rb and ra r = .57 p < .0001 rb and rc r = .49 p < .0001 rc and rc r = .46 p < .0001 ASR and rb r = .45 p < .2027 ra and ra r = .42 p < .2260 rc and ra r = .40 p < .0001 ASR and ra r = .36 p < .0001 ASR and rc r = .30 P < .0001
144 before actual FL learning theories were espoused, linguists were grappling with the mysteries of language learni ng. Bloomfield (1933) an earl y American linguist, in his seminal work: Language, described simple qua lities of language such as word stress that can affect not only pronunciation, but comprehension as well. For example, in English contrasts can used to express stress at the beginning of a word, for example, a name and an aim. French uses no stress phonemes, and cannot in this way mark its word units (Bloomfield, p. 182). Could something as simple as word stress used by English learners have been applied to sentence pronunciations by beginning French learners? In this case would the misplaced stress have affected the ASR scores students received? Would expert raters, having had experiences with French learners, accommodated, intuitively, for these factors in their rating? Would these factors, human factors, be accounted for in a correlation? The human factors in this study, as expressed through the rater scores and correlations of the scores among the raters, po int to the fact that raters were more consistent among themselves than they we re with the machine-generated scores. Generally the problems for foreign language raters are: 1) it is difficult to use scales for rating fluency, 2) it is difficult to train raters to use the scales, 3) where raters are trained effectively, it is difficult to ensure consis tent use of the scales for rating FL pronunciation, 4) raters are conscientious a nd want to rate effectively and judiciously, but are often fatigued, 5) over an extended time raters may become distracted or report rating more strictly or more leniently. Given what we know to be true of the problems for FL raters, especially for rating pronunciation coupled with what we have learned about ASR scoring added to what student s have reported about their use of the ASR software, it is
145 reasonable to conclude that AS R scoring was, on average, co mparable to human raters, at scoring participants French sentence pronunciations. What is needed is a more robust study as re gards statistical power, that is a larger N, and a statistical tool that can account for more of the vari ability observed in the study. McNamara (1996) suggested the use of RASC H measurement tools where the individual aspects of each dimension investigated can be independently measured and compared. For example, the raters could be evaluated fo r degree of leniency or harshness in scoring participants. Individual French sentences could be analyzed by participant for high (7) or low (2-3) ASR scores. With the computi ng power available for large numbers of operations and comparison measures, RA SCH is a tool suggested by numerous researchers (Weir, 2005; Bach man, 2005). However, there is a learning curve for a study designed using RASCH methods a nd special statistical tools are necessary. Likewise the numerical parameters of each measurement are important and even trained measurement researchers collaborate on RASCH designe d studies. Depending on the research orientation there are other tools that can concretely address the limitations existing in an ASR study. Questions Related to FL Raters and Training Using expert raters to evaluate FL sp eech, referred to as the gold standard (Eskenazi, 2009), (where human judgments ar e used) has been a consistently difficult task to validate, given human factors. However, it is the one that ha s and continues to be used for FL evaluation of pronunciation and sp eech, especially within the classroom
146 setting. In some cases, tests have been de signed around the use of expert raters. The ACTFL-Oral Proficiency Interv iew (OPI) is one example. As noted earlier in chapter 5, there were questions about the adapted scale used for rating. In a future study or replication, the adapted ra ting scale needs revision based on suggestions made by the raters, especially if sentence length pr onunciations are to be rated. The short length of the audio recording contributed to the difficulty of listening quickly and with extreme concentration. Anothe r challenge for the raters included using a numerical scale and decidi ng whether the sentence included the sound or phoneme variables heard in the pronunciation. The findings indicate that future studies using human raters would benefit from having the raters trained toge ther to encourage interac tion among the raters. Raters interacting with each other during a training session by providing feedback regarding the technology, the rating scale and practice rating sample audio files t ogether would provide a collaborative and perhaps more consistent training environment across raters. Each of these features were included in the rater trai ning provided to raters a, b and c, but they were done individually with each rater. Determining the time factors and constraints from real-time rater experiences were meaningful and can provide useful insights for future research. It is a question also whether raters co uld be trained separately from the rating experience where two or three separate se ssions would be sche duled: 1) a training session, 2) a rating session and 3) a re-rati ng session. As suggested, rater a and b may have had an advantage as they rated for a pilot study (2008). In cases where more than two raters are used, pilot study training would be useful for each rater.
147 The rating scale adapted for the study could also be further scrutinized and tested by the raters. Although the scale was adapted afte r the pilot, more work needs to be done to validate adapted scales, especially where FL pronunciation and speaking are concerned. High-stakes testing and the resources available to large testing bodies (ETS and IELTS) will hopefully contribute to rating scales for use by researchers and teachers, in the future. The International English Language Testing System (IELTS, 2008) has developed a scale for pronunciation that is being tested and validated which could significantly contribute to high-stak es FL pronunciation evaluation. Conclusion If participant perceptions and feedback fr om users are considered as evidence, for positive impact (Chapelle, 2001) and usabilit y (Neilsen, 2006), then a strong case for using CALL designs that include ASR fo r student pronunciation and speaking practice has been made. Participant voices and percep tions were expressed clearly through their audio productions and thei r descriptive comments, regarding their own learning experiences. Not only participant perceptions but also statements of their needs for pronunciation and speech learning support the need to includ e FL language learning ASR software as integral component of speaking and pronunciat ion skill practice. Students need to have tools available for FL learning in the classroom, lab or at home. This becomes even more critical for stude nts engaged in distan ce, or hybrid learning where students need immediate access to FL native models and examples. In the interest of addressing the larger FL community and CALL researchers in particular, it is reasonable to conclude that much work still needs to be done in the area of
148 speech-interactive products using ASR. While the findings of the study indicated that there are inherent limitations to CALL designs using ASR features, the results point to the fact that the long-standing FL tradition of using trained expert human raters for rating speech or pronunciation may, in this case, be comparable to the ASR-produced scores. Using machine-scored ASR software to assi st FL learners with pronunciation, given the problems experienced by expert raters, would: 1) eliminate the need for creating, testing and training raters to use scales, 2) ensure that a consistent measure was used for each student, 3) eliminate worries about rater fati gue, 4) reduce distracti ons and human factors that could interfere with reliable rating. What we know to be important in any fo rm of feedback and especially feedback for beginner French learners applies to the ASR-generated feedback. Students want to engage in interaction, to practice and have consistent feedback for improvement. From this perspective the machine can undertak e these tasks much more reliably and consistently. Based on conclusive evidence, it is almost inexcusable not to make ASR available to FL learners. AS R can only add to the speechinteractive CALL resources, when they can be made accessible to learners a technologically sophisticated group of foreign language students
149 References Alderson, J.C, Clapham, C. & Wall. D. (2003). Language Test Construction and Evaluation Cambridge Language Teaching Library. Cambridge: Cambridge University Press. Andrews, D. (1991). Developmental sequences in the ac quisition of French L2: A study of adult learners. Thesis (PhD. French), University of Illinois, UrbanaChampaign. UM Press: Ann Arbor, MI. Bachman, L. (1990). Fundamental consideratio ns in language testing Oxford: Oxford University Press. Bachman, L. & Palmer, A. (1996). Language testing in practice Oxford: Oxford University Press. Bachman, L. (2000). Modern language testing at the turn of the cen tury: assuring that what we count counts. Language Testing, 17 (1), 1-42. Bachman, L. (2002). Alternative Interpreta tions of Alternative Assessments: Some validity issues in educati onal performance assessments. Education Measurement: Issues and practice. Fall 2002, p. 5-18. Bachman, L. (2005). Statistical Analyses for Applied Linguistics Cambridge: CUP. Balough, J, Bernstein, J., Suzuki, M., Subbarayan, P. and Lennig, M. (2006). Automatically scored spoken language test s for air traffic controllers and pilots Harcourt Assessment, Menlo Park, CA. Barr, D., Leakey, J.,& Ranchoux, A. (2005). Told like it is! An evaluation of an integrated oral development pilot project. Language Learning and Technology, 9(3), 55-78. Blake, R. (2008). Brave new digital classroom: Technology and foreign language learning Georgetown University Press: Washington D.C. Bonk, C.J. & Graham, C. R. (Eds.) (2006). Handbook of blended learning:Global perspectives, local designs San Francisco, CA: Pfeiffer Publishing. Bloomfield, L. (1933). Language. NY, NY: Holt. Brown, J.D. (2001). Using surveys in language programs Cambridge: CUP.
150 Brown, Roger. (1968). Words and Things. NY: US Free Press. Burston, J. (2005). International Associat ion of Language Learning Technology (IALLT) Listserv comment. Canale, M. & Swain, M. (1980). Theoretica l Bases of Communicative Approaches to Second Language Teaching and Testing. Applied Linguistics 1 (1). 1-47. Chapelle, C. (2001). Computer applications in second language acquisition Cambridge. Cambridge University Press. Chapelle, C. and Douglas, D. (2006). Assessing Language through Computer Technology Cambridge Language Assessment Series Editors: Alderson, J.C. & Bachman, L. CUP. Chappelle, C. (2007). Technology and Second Language Acquisition ARAL, 27, 98114. Chapelle, C., Chung, Y-U. & Xu, J. (Eds.) (2008). Towards adaptive CALL: Natural Language Processing for diagnostic la nguage assessment (pp. 102-114). Ames, IA: Iowa State University. Chomsky, Noam. (1957). Syntactic Structures The Hague, Netherlands: Mouton. Chun, D. (2002). Discourse Intonation in L2: From theory and research to practice Philadelphia, PA: John Benjamins. Chun, D. (1998). Signal analysis so ftware for teaching intonation. Language Learning and Technology, 2 (1), 74-93. Chun, D. (1994). Using computer networking to facilitate the acquisition of interactive competence. System 22 (1): 17-31. Davies, G. (2006). Information and Communication Technologies for Language Teachers (ICT4LT) Slough, Thames Valley University [Online]. Available from: http://www.ict4lt.org/en/enmod 3-5 .htm [Accessed: 2, February, 2008]. Delattre, R. (1967). Modern language testing NY: Harcourt-Brace, Inc. Derwing, T. & Munro, Murray. (1997).Accent, Intelligibility, and Comprehensibility. Studies in Second Language Acquisition 19. 1-16. De Saussure, F. (1959). Course in General Linguistics Bally, C. & Sechehaye, A. (Eds.) (W. Baskin, Trans.). New York: McGr aw-Hill. (Original work published 1915) Develle, S. (2008). The Revised IELTS Pronunciation Scale. Cambridge ESOL Research Notes : Issue 34, November 2008.
151 Dornyei, Z. (2003). Questionnaires in second language research. Construction, administration, processing Mahwah, NJ: Erlbaum. Doughty, C. and Long, M. (2003). The Handbook of Second Language Acquisition Blackwell. Malden: MA. Ducate, L. & Arnold, N. (eds.). (2006)/ Calling on CALL: From Theory and Research to New Directions in Foreign Language Teaching CALICO Monograph Series, Volume 5. San Marcos, TX: CALICO. Ellis, R. (1994). The study of second language acquisition Oxford: Oxford University Press. Eskenazi, M. (in press, 2009).). Overview of of spoken language t echnology in education. Speech Communication. Eskenazi, M.(1999). Using a computer in fo reign language pronuncia tion training: What advantages? CALICO Journal, 16 (3), 447-469. Fifth Generation Comput er Corporation (2008). Speaker Independent Connected Speech Recognition [Online]. Available from: http://www.fifthgen.com [Accessed: February 2, 2008]. Fulcher, G. (2003). Testing Second Language Speaking Fulcher, G. and Davidson, F. (2007). Language Testing and Assessment: An advanced resource book. Routledge: UK. Gass, S. and Mackey, A. ((2007). Data Elicitation for Second and Foreign Language Research. Lawrence Erlbaum:NJ. Gupta P. & Schulze, M. (2007) Human La nguage technologies (HTL). Module 3.5 in Davies, G. (ed.) Information and Communication Technologies for Language Teachers (ICT4LT) Slough, Thames Valley University [Online]. Available from: http://www.ict4lt.org/en/index.htm [Accessed: 2, February, 2008]. Handley, Z., & Hamel, M. J. (2005). Establis hing a methodolgy for benchmarking speech Language Learning and Technology, 9 (3), 99-120. Hardison, D. (2004) Generalization of computer-assisted prosody training: Quantitative and qualitative findings. Language Learning & Technology, 8, 34-52. Hatch, E. & Lazaraton, A. (1991). Statistics for applied linguistics Heinle & Heinle. Hmard, D. (2004). Enhancing online CA LL design: The case for evaluation. ReCALL, 16 (2), 502-519.
152 Hincks, R. (2003). Speech technologies for pronunciation feedback and evaluation. ReCall, 15 (1), 3-20. Horowitz, M.B., Horowitz, E. K. & C ope, J.A (1986).Foreign language classroom anxiety. Modern Language Journal 70(2), 125-132. Hubbard, P., & Siskin, C. (2004). Another look at tutorial CALL. ReCALL 16 (2), 16 (2), 448-461. Hughes, A. (2002). Testing for language teachers Oxford: Cambridge University Press. Jamieson, J., Chapelle, C., & Preiss, S. (2004). Putting principles into practice. ReCALL, 16 (2), 396-415. Jenkins, J. (2000). The Phonology of English as an International Language Oxford Applied Linguistics Series. Oxford: Oxford University Press. Jonassen, D. (2004). The Handbook on Educational Co mmunications and Technology 2nd Ed. Erlbaum.. Mahwah:NJ. Jurafsky, D. and Martin, J.(2000). Speech and Language Processing: An Introduction to Natural Language processing, Computational Linguistics and Speech Recognition Prentice Hall: NJ. Lafford, B. (2004). Review of TELL ME MORE Spanish. Language Learning and Technology, 8 (3), 21-34. Ladefoged, P. (2001). Vowels and Consonants: An Introduction to the Sounds of Language Oxford, UK: Blackwell. Lado, R. (1964). Language Testing. NY:McGraw-Hill. Leloup, J. & Ponterio, R. ( December, 2003) Second language acquisition technology: A review of the research. Center for Applied Linguistics EDOFL-003-11, Washington, D.C. Lepage, S. and North, B. (2005). Guide for the organisation of a seminar to calibrate examples of spoken performance in line with the scales of the Common European Framework of Reference for Languages Council of Europe, Language Policy Division. Strasbourg: FR. Levis, J., & Pickering, L. (2004). Teachi ng intonation in discourse using speech visualization technology. SYSTEM, 32 505-524. Levis, J. (2007). Computer technology in teaching and researching pronunciation. ARAL 27, 184-202.
153 Levy, M. & Stockwell, G. (2006). Call dimensions. Options and issues in computerassisted learning Mahwah, NJ: Erlbaum. Luoma, S. (2004). Assessing speaking Cambridge; CUP. Mackey, A. and Gass, S. M. (2005). Second Language Research: Methodology and Design Mahwah, NJ: Erlbaum. Malone, M. (October, 2007). Oral proficienc y assessment: The use of technology in test development and rater training CALdigest, Center for Applied Linguistics : Washington, D.C. McNamara, T. (2006). Validity and values. In M. Chalhoub-Deville, c. Chapelle and P. Duff (Eds.), Inference and Generalizability in Applied Linguistics Multiple Perspectives (pp. 27-45). Philadelphia, PA: John Benjamins McNamara, T. and Roever, C. (2006). Language Testing: The Social Dimension Language Learning Monograph Series. Blackwell:Oxford, UK. McNamara, T. (1996). Measuring second language performance NY, NY: Longman. McIntosh, S., Braul, B., & Chao, T. (2003). A case study in asynchronous voice conferencing for language instruction. Education Media International 40 (1-2), 63-74. Miles, M. and Huberman, A.M. (1994). Qualitative data analysis Thousand Oaks, CA: SAGE. Molholt & Pressler (1986). Correlation betw een human and machine rating of tests of spoken English reading passages. In C. Stansfield (ed.) Technology and Language Testing VA: TESOL. Neilsen, J. (2000). Designing web usability Indianapolis: New Riders. Neilsen, J. & Loranger, H. (2006). Prioritizing web usability Berkeley, CA: New Riders. Neri, A. Mich, O, Gerosa, M. and Guiliani, D. (2008) The effectiveness of computer assisted pronunciation training for foreign language learning by children. Computer Assisted Language Learning, Vol. 21, No. 5, December 2008, 393-408. OBrien, M. G. (2006). Teaching pronunciation and intonation with computer technology. In Ducate, L. & Arnold, N. (eds.) Calling on CALL: From Theory and Research to New Directions in Foreign Language Teaching. CALICO Monograph Series, Volume 5. San Marcos, TX: CALICO. Oller, J. (1979). Language tests at school London: Longman.
154 Patton, M. (2002). Qualitative research and evaluation methods Thousand Oaks, CA: Sage. Pennington, M.C. (Ed.) (2007), Phonology in context (pp. 135-158). New York: Palgrave, Macmillan. Plass, J.L. (1998). Design and Evaluation of the User Interface of Foreign Language Multimedia Software: A Cognitive Approach. Language Learning and Technology 2 (1). 40-53. Proudfoot, W. (2004). Varieties of relig ious experience by William James: Introduction and Notes. Barnes and Noble Classics Revised Edition (1902). Raphael, L. Borden, G. and Harris, K. (2003, 4th Ed.). Speech Science primer: Physiology, Acoustics and Perception of Speech. Lippincott: Baltimore, MD. Reeser, T. (2001). CALICO Software Review: TeLL me More French. March, 2001 Rodman, R. (1999). Computer Speech Technolo gy. Artech: MA. Savignon, S. (1997). Communicative competence: Theory and classroom practice 2nd ed. McGraw Hill New York, McGraw-Hill. Smith, B., & Gorsuch, G. (2004). Synchronous computer mediated communication captured by usability lab technol ogies: new interpretations. SYSTEM, 32 553575. Stansfield, C. (September, 1992). ACTF L Speaking proficiency guidelines. CALdigest Center for Applied Linguistics : Washington, D.C. Strauss, A. & Corbin, J. Basics of Qualitative Research: Techniques and Procedures for Developing Grounded Theory 3rd Edition. London: Sage. Tashakkori, A. and Teddlie, C. Editors, (2003). Handbook of mixed methods in social and behavioral research. Thousand Oaks, CA: SAGE. Tatton, H. (2007). Evaluating the ACTFL Pr oficiency Guidelines Speaking Section. Teacher's College, Columbia University Working papers in TESOL & Applied Linguistics, 2007 Vol. 7, No. 2, The Forum. Underhill, N. (1987). Testing spoken language: Handbook of oral testing techniques. Oxford, UK: Cambridge University Press. Valdman, A. (1976). Introduction to French Phonology and Morphology. Newbury House: MA.
155 Valette, P. (1965). Comparing the phonetic features of English, French, German and Spanish. London: G. Harrap. Wachowicz, K. and Scott, B. (1999). Softwa re that Listens: Its not a Question of Whether, Its a Question of How. Calico Journal, volume 16, no. 3 pp. 253-276. Warschauer, M., Turbee, L., & Roberts, B. (1996). Computer learning networks and student empowerment. SYSTEM, 24 1-14. Warschauer, M. (1997). Computer-mediated coll aborative learning: Th eory and practice. Modern Language Journal, 81 (3), 470-481. Weinberg, A. & Knoerr, H. (2003). Learning French pronunciation: Audiocassettes or multimedia. CALICO Journal, 20 (2), pp. 315-336. Weir, C. (2005). Language testing and validation: An evidence-based approach London: Palgrave-MacMilliam. Xi, X. & Mollaun, P. (April, 2006). Investigatin g the utility of analytic scoring for the TOEFL Academic Speaking Test (TAST). TOEFL iBT Research Report,RR-06-07. Princeton, NJ: ETS. Xi, X. (2008). What and how much evidence do we need? Critical considerations in validating an automated scoring system. In C.A Chapelle, Y-R. Chung & J. Xu (Eds.), Towards adaptive Call: Natural Language Processing for diagnostic language assessment (pp. 102-114). Ames, IA: Iowa State University. Yoffe, L. (1997). An overview of ACTFL proficiency interview: a test of speaking ability. JALT Testing & Evaluation SIG Newsletter. Vol. 1, No. 2 September 1997, pp. 39. Zahra, R. & Zahra, R. (2007). Calico Software Review: TeLL me MoreAmerican English Advanced Ver 6 February, 2007. Zechnar, K., Bejar, I. & Hemat, R. (Februar y, 2007). Toward an understanding of the role of speech recognition in nonnative speech assessment. TOEFL iBT Research Report, RR-07-02 Princeton, NJ: ETS. Zechnar, K., Higgins, D., Xi, X., & Williams on, D. (in press, 2009). Automatic Scoring of Non-Native Spontaneous Speech in Tests of Spoken English. Speech Communication 51 (10). 883-895.
157 Appendix A Pilot Projects Using TeLL me More French Design Considerations and Features Chun (1998) made several recommendations for future software designs for teaching intonation and many of the features suggested are technically possible with current versions of ASR. Her suggestion to p rovide learners with visualization of their intonation patterns and specific contrastive f eedback (p. 81) is av ailable with pitchtracking software. With consideration of th e suggestions for research provided by Chun (1998) and by using TeLL me More with ASR software, many of the research tools are available and current software tools can pr ovide audio and visual feedback to the learner, can serve as a data collectors, and utterances produced can be saved, ordered and compared over time (Chun, p. 87). However, an alysis of natural, spontaneous speech, specifically with ASR, is still limited. A research design suggested by Chun (1998) addresses the question of type of ASR feedback provided. The ASR software of TMM c an provide corrective feedback for pronunciation errors, through native speaker audio models and visual models of waveforms describing pitch curves. Student performance can be recorded aurally and visually. As well learner perf ormance can be scored on a sc ale (3-7), and students can compare their pronunciation to the model provided. With the permission of Auralog s educational division, a preliminary investigation was conducted to examine the usefulness of the TeLL me More (French) speech recognition software for teaching French pr onunciation. The oral productions of French students and an examination of the features of the ASR were evaluated by several
158 Appendix A (continued) volunteers during a one semester period Fall 2005 A short pilot questionnaire was designed to evaluate student per ceptions and satisfaction with the TMM learning activities and ASR features. The preliminary findings and feedback from the volunteers was used to inform the procedures and design of the Pilot Study Spring 2006 Pilot Project Spring 2006 In the Spring 2006, using TeLL me More ( TMM ) French software for research purposes, a project was conducted in order to examine the Automatic Speech Recognition (ASR) component. The software was installed on fifteen of the twenty-five computers in the foreign language computer lab. University-e nrolled French student s were oriented to the ASR and the software was available duri ng lab hours. Students were free to use the software to practice their French pr onunciation, or to engage with other TMM multimedia activities. A research group was established and members cut across disciplines and contributed a variety of background knowle dge in French language, teaching and instructional technology, including research in multimedia CALL. A research study was designed to investigate the ef fectiveness of the speech r ecognition features of the software, with a specific focus on the effi cacy of the ASR for providing students with French pronunciation and speaking practice. Also of interest were st udent satisfaction and perceptions, as well as language learning, and software design and usability.
159 Appendix A (continued) Three learning paths : sounds, words, sentences. The tutor tools available with the TMM allows for the design of learning paths for practicing specific speech and language features. Pre-designed paths can be directed to specific student accounts. In this way participants were only given access to pre-designed learning paths. Three learning paths were designed to pr ovide students with samples of French speech for pronunciation practice. Learning pa th 1 provided pronunciation practice with distinct sounds. For example, if the sound practiced was / l / then path 1 provided the French word, belle, which includes the sound. Path 2 included words, including sounds practiced in path 1. Path 3 included sentences that provide d practice using combinations of previous sounds and words as well as pun ctuation. Sentences provide a more diverse speech sample because with word/sound orde rs and punctuation students can practice and demonstrate discourse features of speech such as intonation. Intonation is expressed through the voiced, affective dyn amics of a sentence. For ex ample, an exclamation in French can result in a rising, assertive tone at the end of the sentence. Speech data and ASR scoring. The scores produced by the TeLL me More ASR and assigned to the recorded speech are internally generated by algorithms. The ASR scores are part of the computer-programmed design of the ASR software. Auralog has proprietary rights to the ASR design algorith ms and has developed a speech recognition system entitled: S.E.T.S or Speech Error-Tracking System. The ASR S.E.T.S system designed and owned by Auralog is now only part of the multimedia software, an
160 Appendix A (continued) important feature, but origina lly (1993) the French compa ny designed and sold, only, the speech recognition software. The speech recognition features of the so ftware are able to produce scores (seen on the screen as bars) based on the fo llowing scale: 1-2 not-rated (considered a threshold level for recording and recognition); 3-4-5-6-7. The ASR scores are produced as a result of an internal, al gorithmic comparison of the French sounds to an ASR corpusbase of sounds. Students can also choose to li sten to a native spea ker example (recorded in the software). A score of 7 represents th e most complete score. Figure 8 is a screen shot of the scores or bars in the ASR soft ware as seen on the Speech Recognition activity screen. Using the three paths outlined above, produced speech samples were collected and recorded for forty participants. Two sepa rate data collection sessions, approximately three weeks apart were conducted for each pa rticipant. For each session, a minimum of five ASR-scored recordings, per example, pe r path were collected. For example, if for path 1 the sound was /l/ and used in the word belle, then five repetitions of the sound resulted in five ASR scores for belle.
161 Appendix A (continued) Figure 8. ASR Software Score Box Participant survey. After each of the two data collection sessions an online survey was completed by each participant. The date and participant number were used as identifiers for both the recordings and th e survey. The two online surveys provided a qualitative component and an informative a nd direct link between individual participant speech recordings and related ASR scores. The survey included several categories of questions designed to elicit pa rticipant responses on experi ences with learning French and the TMM software. Participant perceptions of the speech recognition features and experience using the ASR were of particular interest. Data collection and analysis The data have been collected from the FL lab computer that serves as the server for th e site licenses. Data cannot be exported or
162 Appendix A (continued) electronically transferred and is an acknowle dged limitation of the software. A significant drawback for research purposes has been the manually intensive data collection process. As of December 2007, all of the data have been manually transferred, recorded and organized in an Excel data file. In June 2006 a case analysis of three pa rticipants, representi ng three levels of French language based on years of learni ng, was conducted. In the three cases the participants scores and survey data were an alyzed and examined to address the following questions: 1) Did a participants produced speech scores improve across path 1, path 2, and path 3? From session 1 to session 2? 2) What were the identifiable and observed features and interesting charac teristics of the scores by learning path? 3) How does the survey submitted for session 1 correspond to and differ from session 2? Are there relationships between the survey reports and the produced speech for each session? Analysis of the Spring 2006 performance data continues to be empirically and qualitatively analyzed. The design features of the pilot study have informed the current study design and method. The me thodological considerations th at were used to examine features of the ASR, speech production a nd oral performance have been thoroughly considered for the method and corresponding questions. Pilot Project April 2008 In April 2008 a pilot project was conducted for the pur pose of pilot testing the rater site instrument to be used in the curre nt study. Using the sound files of sentences
163 Appendix A (continued) extracted from the software and collected fr om participants in a pilot study conducted August 2007, the rater site was p ilot tested. The sound files were converted from the open .oggs to MP3 format and the survey was de signed using the audi o recordings of 12 participants x 5 sentences x 2 recordings each sentence for a total of 120 sound files for rating. Human raters Two French speaking raters (o ne native speaker and one French Professor) were trained using a Rater Site (Pilot) Three participants were rated for the training. Raters were instructed in the use of the rubric, th e rubric designed and adapted from the ACTFL GuidelinesSpeaking 1999 and practiced rating th e three participants. Following the rater training, the raters entered the Rater Site (Beta) and proceeded to rate the 120 audio recordings of sentences. Following the rating session, raters were aske d to enter a final site to answer five questions regarding the rating experience. Th e training, rating sessi on and question period lasted approximately forty-five minutes. Analyses. The Rater Site (Beta) data was put into an Excel file for analysis. Using SAS code for a correlation calculation, an analysis wa s conducted to compare the consistency of the raters-ratings (Rater A and Rater B) as well as the ASR scores provided for the sentences. Preliminary findings indicated that the raters correlated well with each other, that is their ratings using th e rubric were consistent and that the rubric can be expected to assist th e raters in providing an evalua tion of the participant audio
164 Appendix A (continued) recordings. The ratings of the ASR-scored au dio, when compared to the ASR-generated scores, did not show a correlation. The data revealed interesting patterns that could be considered for the future analysis. For example, it was noted that the ASR scores for the first recording were, in many cases, the same for the second recording. As well, the increase in participants and the number of data points, that is an incr ease of about 180 data points (audio files) contributed to the effect si ze and the numerical significan ce of the analyses for the current study. April 2008 Pilot Instruments Rater Site (Pilot ) location: http://survey.acomp.usf.e du/survey/entry.jsp?id=1207155217939 Rater Site (Beta) location: http://survey.acomp.usf.e du/survey/entry.jsp?id=1204735687244 Rater Questions location: http://survey.acomp.usf.e du/survey/entry.jsp?id=1207772730008
165 Appendix B French Sentences for Pronunciation Produced using TeLL me More French-ASR software Beginner Level 10 French Sentences 1) Le R franais est tellement charmant 2) Je ne sais pas comment le dire en franais. 3) Jai encore des progrs faire. 4) Je fais de gros efforts en ce moment, tu sais 5) Je ne parle pas trs bien le franais. 6) Jespre quils pa rlent anglais 7) Pouvez-vous rpter ? 8) Oh non, mon vocabulaire est trs rduit 9) Je viens de terminer un cours intensif. 10) En ralit, je voulais am liorer mon franais. Data Set: 10 Sentences Data Points: Totals: 300 audio file s and 300 ASR Scores 10 sentences repeated 1x = 300 ASR scores per participant 1 sound file per sentence = 10 ASR sound files per participant 10 sound files x 30 participants = 300 sound files/sentence pronunciations for rating for Rater A, Rater B and Rater C.
166 Appendix B (continued) English sentence translations: 1) The French R is so delightful! 2) I dont know how to say that in French. 3) I need to make more progress. 4) You know, I am really working hard! 5) I dont speak French very well. 6) I hope they speak English! 7) Could you repeat that please? 8) Oh no, my vocabulary is very limited! 9) I just finished a very difficult course. 10) Honestly, I would like to improve my French.
167 Appendix C Numerical/Descriptive Rating Scale (Adapted from: ACTFL Proficiency Guidelin es-Speaking 1999 level descriptors and the Interagency Language Roundtable (ILR) skill level descriptions for Speaking.) Novice: Ability to speak minimally with only learned material. 1 2 Low/Mid: Undecipherable, unintelligible speech, difficult production and/or poor recording (recording may be chopped off). Di fficult to rate the sentence pronunciation sample. 3 Mid/High: Oral production of the senten ce is limited and includes frequent errors and faulty sound pronunciation and word stress. Intermediate: Combining learned elements primarily in a reactive mode. 4 Low Elementary sentence pronuncia tion, stress and intonation are generally poor and may be influenced by a L1. Flue ncy is strained and sentence flow is difficult to maintain. 5 Mid Pronunciation may be foreign but individual sounds are accurate. Stress and intonation may lack c ontrol and exhibit irregular flow. Longer sentences may be marked by hesitations. 6 High Able to pronounce sentences w ith a degree of precision. Sound, word and sentence pronunciation and stre ss are obvious, and effective.
168 Appendix C (continued) Advanced: Ease of speech, expressed fluency and sentence pronunciation is marked by competent sound and word pronunciation; co mplete sentence stress and intonation. 7 Low/Mid Sentence is comprehensible with no discernable erro rs in sound or word pronunciation. Speech range is n ative-like and accurate. Please note: In some cases the sentence audio recording is incomplete. This may be due to other factors than the pa rticipants pronunciation. For exam ple, if there was a longer hesitation (more than 2 seconds) the Automa tic Speech Recognition (ASR) may have not recorded the sample completely. It is a ssumed that the rater training and the good recordings, along with your French expertise and pronuncia tion ability will give you a gauge for identifying where these factors are the result of pronunciation errors or audio recording errors.
169 Appendix D Survey Questionnaire Question 1 If you have studied French, at the university, for only one se mester please choose: French I. If you have studied French, at the univers ity, for 2 semesters choose: French II. If neither of these categorie s apply choose: Other Question 2 How many years have you studied Fren ch? [One semester = .5 years] Is there anything about your French studies that you would like to explain further? Question 3 Do you usually feel comfortable using computers and new software? Question 4 Have you used language learning software or speech recognition software before? Questions 5a and 5b Many people are anxious when speaking aloud in any situation. Could you please describe two situations: a) One situation where you are or have felt comfortable speaking French. b) A second situation where you feel or have felt anxious, self-conscious or uncomfortable speaking French. Question 6 How would you describe your feelings about the experien ce recording your pronunciation of the French sentence s with the ASR software today?
170 Appendix D (continued) Question 7 a, b, c, d Based on the French pronunciations that you recorded today did you find it helpful: a) To listen to your own pronunciations? b) To listen to the native speaker model? c) To look at the visual graphs? d) Please tell us more about your impressi ons and your experience with these software features. Question 8 a, b, c TeLL me More ASR provides a score (represented by 'green bars') for each of your sentence pronunciations: a) Were the ASR scores helpful for your pronunciation? b) Do you think the ASR scores refl ect your French pronunciation ability? c) We are most interested in the ASR scor e as it is presented to you. Could you please comment on exactly what this score meant to you? Question 9 How helpful is the ASR software feedback (s cores, native model, visuals) compared to other feedback you have received on your French pronunciation and speaking? Could you please comment about how this softwa re's feedback has been more or less helpful to you?
171 Appendix D (continued) Question 10a and 10b Could you describe a time when you had your French speaking or pronunciation rated or evaluated by your French teacher? a) How do you feel about having your French sp eech or pronunciation rated by an expert human rater (either known or unknown to you)? b) Would you want the rater to be a native French speaker? Final Comment: Please take a minute or two to add anything more you want to say about your experience at the lab today or about the ASR software.
172 Appendix E Survey Question 10 Qualitative Coding Example The notes for the coding of survey Q10 were handwritten and scanned to a PDF document. The PDF is attached to the elec tronic document. The printed documents are attached with this document. a) Coding memo notes: page 64 b) Memo draft diagram page 65 c) Diagram: page 66 d) Theoretical note: page 67
173 Appendix E (continued) a) Coding Memo notes: page 64
174 Appendix E (continued) b) Memo draft diagram: page 65
175 Appendix E (continued) c) Diagram: page 66
176 Appendix E (continued) d) Theoretical note: page 67
About the Author Deborah J. Cordier, Ph.D. in Sec ond Language Acquisition/ Instructional Technology (SLAIT), University of South Fl orida, Tampa (2009). Research interests include CALL and interactive technologies for foreign language speaking, pronunciation and FL testing; Automatic speech recogniti on (ASR), developing research and related trends in NLP and human-computer interact ion (HCI). Online community involvement (2005) working worldwide with researcher s and academics to develop awareness and materials for Open Education Resources (OER ). Previous experience includes French, eleven years in K-12 international educati on and work with IB programs; Translation studies (Fr-Eng) certificate a nd RSA-CTEFLA certificate.
xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam 2200409Ka 4500
controlfield tag 001 002063153
007 cr mnu|||uuuuu
008 100315s2009 flu s 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0003172
Speech recognition software for language learning :
b toward an evaluation of validity and student perceptions
h [electronic resource] /
by Deborah Cordier.
[Tampa, Fla] :
University of South Florida,
Title from PDF of title page.
Document formatted into pages; contains 176 pages.
Dissertation (Ph.D.)--University of South Florida, 2009.
Includes bibliographical references.
Text (Electronic dissertation) in PDF format.
ABSTRACT: A renewed focus on foreign language (FL) learning and speech for communication has resulted in computer-assisted language learning (CALL) software developed with Automatic Speech Recognition (ASR). ASR features for FL pronunciation (Lafford, 2004) are functional components of CALL designs used for FL teaching and learning. The ASR features available with the TeLL me More French software provides pronunciation, intonation and speaking practice and feedback. ASR features are examined quantitatively through French student performance of recorded ASR-scored speech and compared with human raters of the same produced speech samples. A comparison of ASR scores to human raters considers the validity of ASR-scored feedback for individualized and FL classroom instruction. Qualitative analyses of student performances and perceptions of ASR are evaluated using an online survey linked to individual pronunciations and performance and examined for positive impact (Chapelle, 2001) and usability.
Mode of access: World Wide Web.
System requirements: World Wide Web browser and PDF reader.
Co-advisor: Linda Evans, Ph.D.
Co-advisor: James White, Ph.D.
Foreign language software
French pronunciation and speaking
Foreign language raters
x Second Language Acquisition and Instructional Technology
t USF Electronic Theses and Dissertations.