|USFDC Home||| RSS|
This item is only available as the following downloads:
1 of 48 Education Policy Analysis Archives Volume 10 Number 7January 25, 2002ISSN 1068-2341 A peer-reviewed scholarly journal Editor: Gene V Glass College of Education Arizona State University Copyright 2002, the EDUCATION POLICY ANALYSIS ARCHIVES Permission is hereby granted to copy any article if EPAA is credited and copies are not sold. Articles appearing in EPAA are abstracted in the Current Index to Journals in Education by the ERIC Clearinghouse on Assessment and Evaluation and are permanently archived in Resources in Education .Exito en California? A Validity Critique of Language Program Evaluations and Analysis of English Learner Test Scores Marilyn S. Thompson Kristen E. DiCerbo Kate Mahoney Jeff MacSwan Arizona State UniversityCitation: Thompson, M.S., DiCerbo, K.E., Mahoney, K and MacSwan, J. (2002, January 25). Exito en California? A validity critique of language program evaluation s and analysis of English learner test scores. Education Policy Analysis Archives 10 (7). Retrieved [date] from http://epaa.asu.edu/epaa/v10n7/.Abstract Several states have recently faced ballot initiativ es that propose to functionally eliminate bilingual education in favor of English-only approaches. Proponents of these initiatives have ar gued an overall rise in standardized achievement scores of California's lim ited English proficient (LEP) students is largely due to the imp lementation of English immersion programs mandated by Proposition 227 in 1 998, hence, they claim Exito en California (Success in California). However, many such arguments presented in the media were based on flaw ed summaries of
2 of 48these data. We first discuss the background, media coverage, and previous research associated with California's Prop osition 227. We then present a series of validity concerns regarding use of Stanford-9 achievement data to address policy for educating LE P students; these concerns include the language of the test, alternat ive explanations, sample selection, and data analysis decisions. Fina lly, we present a comprehensive summary of scaled-score achievement m eans and trajectories for California's LEP and non-LEP stude nts for 1998-2000. Our analyses indicate that although scores have ris en overall, the achievement gap between LEP and EP students does no t appear to be narrowing. Education policy concerning the instruction of limi ted English proficient (LEP) students in the United States has been debated for a number of decades and considerable attention has been given to the best method of instruction fo r these students. In recent years, the controversy regarding how to best educate LEP stude nts has surfaced in the form of political legislation. According to the United Stat es Department of Education (1994) the term "limited English proficient" refers to individ uals who (1) were not born in the U. S. and whose native language is other than English, or (2) come from environments in which a language other than English is dominant. Th e education of LEP students is important, especially given that during the 1996-19 97 academic year U. S. school districts reported an enrollment of approximately 3 .5 million limited-English proficient (LEP) students, accounting for 7.4% of the total re ported enrollment (Macas, Nishikawa, & Venegas, 1998). The proportional rate of increase in LEP students from 1995 to 2020 is projected to be 96%, compared to an expected increase of 22% for native-English speakers (Campbell, 1994).The controversy regarding the education of LEP stud ents generally focuses on the amount of instruction provided in the children's na tive language. English immersion programs provide instruction almost exclusively in English, which the teacher attempts to make accessible to LEP students. Bilingual educa tion programs provide a substantial amount of content area instruction in the students' native language, while some time each day is spent developing English skills. It sho uld be noted, however, that the actual implementation of programs varies across states, di stricts, schools, and even classrooms (August & Hakuta, 1998; Berliner, 1988).Proponents of bilingual education argue that withou t support in their native language, LEP students will fall behind academically while th ey are learning English (Crawford, 1999; Krashen, 1996). They also argue that if stude nts first learn to read in the language in which they are fluent, they can then transfer th ose skills to reading in English (Krashen, 1996). Proponents of English immersion ar gue that instructional time devoted to intensive learning of English will more likely b enefit children's academic achievement in a second language environment (Rossell & Baker, 1996). Recently, the debate surrounding the education of L EP students has shifted to the political arena. Several states, including Californ ia, Colorado, and Arizona, have faced ballot initiatives that propose to restrict the typ es of educational methods and programs that may be used to instruct LEP students. Specific ally, these restrictions functionally eliminate bilingual education programs in favor of an English immersion approach. The first such proposition was California's Proposition 227, which passed by a majority vote in 1998. In November of 2000, voters in Arizona app roved a similar, but even more
3 of 48restrictive measure, Proposition 203. In early 2001 measures similar to these propositions were introduced in the state legislatu res of Massachusetts, Oregon, and Rhode Island.In this article, we first provide an overview of th e media coverage surrounding the implementation and evaluation of California's Propo sition 227 and then review scholarly analyses related to its initiation, including both qualitative studies of implementation and quantitative evaluations of student achievement sco res. We then discuss several methodological problems we have observed with the u se of California Department of Education's Standardized Testing and Reporting (STA R; California Standardized Testing and Reporting, 2000) data to support argume nts about the effects of Proposition 227. We limit our discussion to Stanford-9 scores r eleased in 1998, 1999 and 2000 because we are concerned with the validity of claim s about the success of Proposition 227 which purport to derive from these specific dat a. See Hakuta (2001) for remarks on the 2001 Stanford-9 scores of California's English learners. We frame these problems in the context of specific threats to validity and ina ppropriate approaches to data analysis. Finally, we present a comprehensive reanalysis of t he STAR data and summarize achievement trajectories for LEP and non-LEP studen ts. In interpreting our reanalysis, we discuss policy issues we feel cannot be adequate ly examined based on these data. All the News That's Fit to Print? Media Accounts of California's Stanford-9 Scores an d Proposition 227California implemented Proposition 227 during the 1 998-1999 school year. During the previous year, California also began statewide admi nistration of the Stanford Achievement Test, 9th edition (Stanford-9). These t est results are publicly available, aggregated by grade level for each school, through the STAR system (California Standardized Testing and Reporting, 2000). In the p ast two years, many educators, media sources, and political stakeholders have repo rted summaries of these data as evidence of the effectiveness of Proposition 227.The New York Times which is widely recognized as one of the most inf luential newspapers in the United States, published a news s tory focusing on Stanford-9 achievement scores of LEP children in California on August 20, 2000. Times reporter Jacque Steinberg claimed that the increase in score s "at the very least" represented "a tentative affirmation" of the vision of Ron Unz (St einberg, 2000, A1), who had sponsored the California initiative that banned bil ingual education two years earlier. The Times story appeared as front-page news, running 1,744 w ords in length, and opened with the following statement: Two years after Californians voted to end bilingual education and force a million Spanish-speaking students to immerse themse lves in English as if it were a cold bath, those students are improving in r eading and other subjects at often striking rates, according to standardized test scores released this week. (p. A1) Steinberg concluded the test results provide tentat ive evidence that Proposition 227's prescribed "cold bath" of English immersion is resp onsible for the increase in scores, and characterized the results as "remarkable." The Times piece also included an extensive anecdote of a school superintendent from the Oceanside district who
4 of 48 converted from an advocate of bilingual education t o a proponent of structured English immersion.To present a contrasting view, Steinberg included a 59-word paragraph in which he suggested alternative explanations for the increase in scores, citing class-size reduction in particular. However, these alternatives were int roduced by the suggestion that Proposition 227 was at least in part responsible fo r the increase, which Steinberg found to be "remarkable given predictions that scores of Spanish-speaking children would plummet" (p. A1). Steinberg also quoted Stanford Pr ofessor Kenji Hakuta, who had conducted an analysis of the test scores and posted them on the World Wide Web the same day they were released. Rather than discussing Hakuta's study, Steinberg briefly summarized that it was Hakuta's view that "few conc lusions could be drawn from the results, other than that 'the numbers didn't turn n egative,' as many had feared" (p. A1). Steinberg appeared to use Hakuta's quote essentiall y to sustain his main point, namely, that the increase in test scores is a tentative aff irmation of Proposition 227, and a clear indication that educators were wrong to predict chi ldren would suffer. The Times story was syndicated in the Milwaukee Journal Sentinel (Steinberg, 2000b, p. 3A), where the opposing viewpoint was cut by half, and in the Baltimore Sun (New York Times New Service, 2000, p. 3A) and Cleveland' s Plain Dealer (Steinberg, 2000c, p. 21A), where it was entirely eliminated. After th e New York Times story appeared, the idea that California's Stanford-9 gains for LEP stu dents resulted from the implementation of Proposition 227 was cited in 24 m ajor U. S. newspapers, frequently without any question of the accuracy of the claim. Of these, 17 (or 71%) gave no voice to opposing viewpoint at all (see Table 1). (Note 1 )Table 1 News Stories in Major Newspapers (Aug. 20, '00June 9, '01) Mentioning the Increase in California's Stanford-9 Test Scores as Evidence of the Success of Structured English Immer sion (Proposition 227)Newspaper Date of publication Length of article (in words) Length of opposing view (in words) The Plain Dealer8/20/007120Milwaukee Journal Sentinel8/20/0083941The Baltimore Sun8/20/008480New York Times8/20/00174478The Arizona Republic8/22/0067946The Christian Science Monitor8/23/0011720The Houston Chronicle8/28/00261105Star Tribune8/28/005400USA Today8/28/009960
5 of 48 Newsday9/09/004500The Arizona Republic9/22/008440The Christian Science Monitor9/27/008620The San Diego Union Tribune10/06/005740The Arizona Republic10/29/0012830Los Angeles Times11/07/009846The Arizona Republic11/08/006780New York Times11/15/006000The Boston Globe12/31/0011250The Atlanta Journ. & Constitution1/04/016470The Boston Globe1/14/014360The Arizona Republic3/01/014310The Arizona Republic3/02/0144840The Denver Post3/28/0181129New York Times4/01/012280 Averages: 721.116.0 Interestingly, the Associated Press (AP), which wri tes stories circulated to its 1,550 clients, including the New York Times wrote a considerably more balanced story a week b efore the publication of the Times piece. AP reporter Jennifer Kerr's story opened as follows: "Two years after voters ended most bilingual education i n California, statewide test scores for non-English speakers jumped about as much as scores for their fluent fellow students" (Kerr, 2000). Kerr's story noted that test scores had rise n for all students in the state about equally, that the Stanford-9 was not written for English lea rners and is arguably inappropriate, and provided much stronger objections from the research community. Only three news stories appeared in major U.S. news papers before the New York Times story, one in the Los Angeles Times (Groves, 2000, p. A3) and two in the San Diego Union-Tribune (Moran & Spielvogel, 2000, p. B1). Like the AP sto ry, these papers presented a much more balanced account. Groves' Los Angeles Times story began, California students who are not proficient in Engli sh improved their scores on the Stanford 9 standardized test at about the same rate as their fluent classmates, but new state data released Monday continue to show an immense disparity between the two groups. (p. A3) The main San Diego Union-Tribune story (Moran & Spielvogel, 2000) opened this way: Celebrated gains in student state test scores are spread among all studentswhether advantaged or disadvantaged, wheth er they speak English or
6 of 48notaccording to data released today. (p. B1) However, the view appearing in the New York Times, due to the paper's enormous influence on the national press, strongly predominated. The Times was cited as an authority on the issue in 56 published letters and editorials, and i n one story appearing in the Arizona Republic (Gonzalez, 2000, p. EX1). Following the appearance of the Times article, numerous television and radio news shows, including those of the major television networks, broadcasted the story that rising scores in California indicated Proposition 227 was a success in that state. A story in Newsday (Willen & Kowal, 2000, p. A10) said that the conclusion followed from "a recent California study ." Inaccuracies in scientific and technical reporting are known to occur widely in journalistic writing (Simon, Fico, & Lacy, 1989; Singer & Endren y, 1993; Tankard & Ryan, 1974; Weiss & Singer, 1987). However, what is particularl y disturbing about the New York Times story is that conclusions were drawn based on claim s that disregarded basic principles of scientific research design and educational measurem ent. Undermining the credibility of the story were inadequate consideration of alternative explanations and improper interpretation and use of Stanford-9 scores. Further, the Times failed to discuss controlled studies comparing bilingual education to all-English instru ctional approaches (Ramirez et al., 1991; Willig, 1985) or recent comprehensive research synt heses prepared by the National Research Council (August & Hakuta, 1998; Meyer & Fienberg, 1 992). These errors and exclusions are particularly grievous given the high-stakes nature of the inferences drawn regarding an extremely complex educational issue. Brief Background on Proposition 227 ImplementationIn this section, we provide some background on Prop osition 227 and summarize briefly recent research addressing the implementation of th e initiative. The full text of the California law can be reviewed online (English Lang uage Education for Immigrant Children, 2001; http://www.leginfo.ca.gov/calaw.htm l), however the general mandate of Proposition 227 is the following: "Children who are English learners shall be educated through sheltered English immersion during a tempor ary transition period not normally intended to exceed one year" (Section 305, 2001). T he law applies to English learners, defined as "a child who does not speak English or w hose native language is not English and who is not currently able to perform ordinary c lassroom work in English, also known as a Limited English Proficiency or LEP child (Section 306, 2001). The law also further defines sheltered English immersion as "an English language acquisition process for young children in which nearly all classroom in struction is in English but with the curriculum and presentation designed for children w ho are learning the language" (Section 306, 2001). The implementation of sheltere d English immersion (SEI; equivalently referred to as 'structured' English im mersion) as mandated by Proposition 227 has been addressed in the educational research literature. Most of these studies may be described as either qualitative studies of the i mplementation of SEI in California schools or quantitative summaries of standardized a chievement scores preand post-implementation of Proposition 227.Studies of Proposition 227 ImplementationAlthough districts, schools, and teachers did not i gnore Proposition 227, there was not a
7 of 48"sea of change" in programs for English learners ap parent in the schools (Garca & Curry-Rodrguez, 2000). In fact, prior to implement ation of Proposition 227, only 29% of English learners were in programs that included native language instruction, and 12% of students were still in those programs following implementation (Gndera et al., 2000). Maxwell-Jolly (2000) studied the interpretat ion and implementation of 227 in seven different school districts and found that alt hough district interpretation of 227 set the tone, responses to and implementation of distri ct policy regarding 227 varied widely. Further research indicated that when district admin istrators set a strong tone for eliminating native language instruction or providin g alternatives to SEI, schools followed suit. However, when district leadership wa s lacking, implementation of the proposition varied across schools (Gndera et al., 2000). Gndara (2000) documented the impact 227 had on ins tructional services, classroom pedagogy, and distribution of teachers, concluding the greatest impact of Proposition 227 was on classroom instruction. For example, teac hers reported leaving out much of their normal literacy instruction, such as storytel ling and story sequencing, to focus on English word recognition. Instructional challenges presented by Proposition 227 included having a lack of instructional materials a nd teaching students with a wider linguistic range (Schirling, Contreras, & Ayala, 20 00). Teachers reported that even for programs in which parental waivers were obtained fo r native language instruction, they were required to include 30 days of English instruc tion before the waiver could take effect. Because schools did not know how many waive rs they would receive, orders for instructional materials were delayed or made in ins ufficient quantities. Hayes & Salazar (2001) evaluated instructional serv ices offered to English learners enrolled in SEI in first, second, and third grade c lasses in Los Angeles Unified School District. They noted uneven implementation of SEI, with programs generally adopting one of two general approaches: use of primary langu age for clarification only and use of primary language for concept development The effects of the proposition on teachers varied based on what the teachers had done prior to the passage of 227 and on teachers' education, skills, experience, and views on student learning (Gndera et al., 2000). For example, teachers who were certified to teach bilin gual education were more likely to continue some level of native language support in t heir classrooms. Studies offering various other perspectives on impl ementation of Proposition 227 have been published, many of them in a special issue of the Bilingual Research Journal devoted to the topic (e.g., Dixon, Green, Yeager, B aker, & Frnquiz, 2000; Palmer & Garcia, 2000; Paredes, 2000; Schirling, Contreras, & Ayala, 2000; Stritikus & Garcia, 2000). A California Research Bureau report by de Co s (1999) presented issues surrounding implementation of 227 in a historical c ontext of language policy issues. The effects of STAR on English learners were also discu ssed and the author warned against using these publicly available test scores to evalu ate SEI programs. We now review some published quantitative analyses of these STAR data that have been used to support arguments for or against Proposition 227.Analyses of California Achievement DataWhen the publicly available aggregated standardized test scores of California children were released following implementation of Propositi on 227, they were quickly analyzed in an attempt to determine the effects of the initi ative. It is well-documented in the
8 of 48literature that LEP students made gains in test sco res, as did all students in the state (Butler, Orr, Gutierrez, & Hakuta, 2000; Gndara, 2 000; Garcia & Curry-Rodrguez, 2000). Butler et al. (2000) reported that schools m aintaining strong bilingual programs had scores that equaled or exceeded those of school s that had dropped bilingual programs. In addition, Butler et al. (2000) noted t hat due to regression to the mean, scores of lower performing students are more likely to improve than those at the middle of the scale. Finally, they emphasized there was si gnificant variation in test scores across schools in both the bilingual and English-only cate gories. Garca and Curry-Rodrguez (2000) studied a random sample of districts and fou nd no specific patterns of test scores across schools with different 227 implementation st rategies. Amselle and Allison (2000) examined percentile rank increases for LEP students and found that LEP students made "significant gains in reading and writing in English as well as math" (p.1). They went on to examine percen tile rank improvements in four school districts that reported to be in strict comp liance with Proposition 227 and four school districts that reported maintenance of a bil ingual program. They found greater score improvements in the select districts reportin g compliance with the initiative. Finally, they pointed to Los Angeles Unified School District as a district that openly defied Proposition 227 and had percentile rank test scores below "the state average for LEP students" (p. 12). Unfortunately, Amselle and A llison (2000) failed to note the variability within districts. In addition, their fo cus on select districts did not allow them to examine the variability across districts that re ported similar implementations of Proposition 227. Finally, they inappropriately used summaries of national percentile ranks to determine academic growth, a problem we di scuss in greater depth later in this paper.A noteworthy limitation of the publicly available d ata is the lack of student level information (Gndara, 2000). However, Gutirrez, As ato, and Baquedano-Lopez (2000) acquired and utilized student-level data for LEP st udents from an urban unified school district. Over three years, they tracked student sc ores in this predominantly English-only, phonics-based literacy district. They found the per centage of LEP students scoring at or above the 50th percentile decreased dramatically ov er the three years. Disaggregation of these data by language group showed that the percen tage of Spanish-speaking children reading at or above the 50th percentile dropped fro m 32% in the first grade to 30% in the second grade to 15% in the third grade. Other langu age groups (Cantonese, Russian, Hmong, and Mien) also experienced sharp declines be tween first and third grade (Gutirrez et al., 2000). The specific causes of th ese declines were not explored. In sum, published qualitative evaluation reports of Proposition 227 generally conclude the overall effect of the new law on the education of language minority students has been negative. Furthermore, with the exception of Amsell e and Allison's (2000) report, quantitative analyses of the Stanford-9 data to dat e reveal comparable gains for English learners and their fluent English-speaking peers. M any arguments and quantitative summaries based on the STAR data have been replete with improper statistical analyses and fail to acknowledge the many limitations of the se highly aggregated standardized achievement data (e.g., Amselle & Allison, 2000). W e now discuss multiple validity concerns as they apply to use of the California Sta nford-9 data for evaluating language policy.
9 of 48Validity IssuesOf utmost concern when using assessment data in res earch should be the validity of the assessment for the intended purpose. A large litera ture exists addressing the conceptualization of validity in educational and ps ychological testing and research (see Messick, 1989, for a comprehensive discussion of va lidity); therefore, we do not attempt a comprehensive review of validity, but rather conc entrate on those validity issues that appear to be most problematic in our research conte xt. To help focus our discussion, we borrow from Messick (1989) a definition of validity as a unified concept with multiple facets: "Validity is an integrated evaluative judgm ent of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment" (Messick, 1989, p. 13, emphasis added). As such, validity is not merely about the meaning of test scores. Validity encompasses "... the interpretabil ity, relevance, and utility of scores, the import or value implications of scores as a basis f or action, and the functional worth of scores in terms of social consequences of their use (Messick, 1989, p. 13). The measurement context and the inferential context are both vitally important in forming validity judgments. We focus our discussion in the following section on issues that pose major threats to the validity of inferences based o n the STAR data concerning LEP students: language of the test, alternative explana tions, and sample selection.Language of the TestIn this section, we consider the meaning of scores in the context of the assessment. Of particular concern is the administration of English -language standardized achievement tests to evaluate the academic achievement of stude nts who are not proficient in English. Testing students in a language in which they are no t yet proficient is problematic for multiple reasons. The Standards for Educational and Psychological Testing warn that when testing a non-native speaker in English, the t est results may not reflect accurately the abilities and competencies being measured if te st performance depends on the test takers' knowledge of English (American Educational Research Association, American Psychological Association, & National Council on Me asurement in Education, 1999). The Stanford-9 is a test of academic achievement, n ot a test of language proficiency, and the test developers have not conducted any specific studies to establish validity of Stanford-9 scores for children who have limited abi lity in the language of the test (Harcourt Brace Educational Measurement, 1997c). Th erefore, limited English proficiency should be regarded a likely source of m easurement error in the Stanford-9 test scores intended to reflect academic achievemen t. Language proficiency in general has been shown to i nfluence performance on achievement tests (Ulibarri, Spencer, & Rivas, 1981 ). Pilkington, Piersel, and Ponterotto (1988) reported that the home language of a child i nfluenced the predictive validity of kindergarten achievement measures. These studies su ggest language proficiency plays a role in young children's performance on achievement tests. This relationship may continue in high school children, where LEP status was shown to be a significant predictor of both language arts and mathematics sco res on the California Assessment of Progress, although a poverty measure was a stronger predictor (Wright & Michael, 1989).
10 of 48Alternative ExplanationsAn important consideration in interpreting trajecto ries of achievement scores in California is the acknowledgement of potential conf ounding conditions and alternative explanations. In this section, we address five such considerations: simultaneous changes in educational policy and practice, inconsistent im plementation of immersion programs, increasing test familiarity and preparation, the li mitations of using aggregated data, and regression to the mean.Simultaneous Policy ImplementationsProposition 227 was introduced concurrently with ot her changes in educational policy and practice. In fact, Gndara (2000) explained Pro position 227 was enacted in what has been the most active period of education reform in recent times. Statewide initiatives include class size reductions from an average of 30 to 20 in early elementary classrooms and a switch from a whole language approach to a ph onics-based method of reading instruction for poor readers. Gutirrez et al. (200 0) specifically noted that class size reduction, the new state standardized testing progr am, new reading and accountability initiatives, and the new language arts standards ha d all been implemented concurrently. In addition, other reforms have likely occurred at the district, school, and classroom level. Any or all of these may be important contrib utors to student gains. Inconsistencies in Language ProgramsEvaluating language program policy is further compl icated by inconsistent implementation of English immersion programs, with instructional practices varying widely across districts, schools, and even classroo ms (Berliner, 1988; Gndara et al., 2000). There are many different variations of Engli sh-only programs, as well as of bilingual programs. Therefore, it is not clear exac tly which programs are being compared when simply examining changes in test scor es from 1998 to 1999. The state education system is terrifically diverse and educat ional practices are far from uniform. Test Familiarity and CoachingBecause the aggregated scores are publicly released schools and districts feel pressure to achieve high test scores and thus encourage test pr eparation in varying degrees. As the stakes become higher, test preparation often become s a high profit industry, as documented in Texas (McNeil, 2000; Sacks, 1999). Th e California test was tied to different, but motivating, rewards and sanctions fo r schools, teachers, and students. Schools and teachers could receive bonuses based on increased test scores. For example, each of the 1,000 certificated staff in underachiev ing schools with the largest growth in California receives $25,000 (Public Schools Account ability Act, 1999). On the other hand, schools that do not meet goals for academic i mprovement in 24 months may be taken over by the state Superintendent for Public I nstruction. The Superintendent may then take a variety of actions, up to and including closing the school (Public Schools Accountability Act, 1999). Bilingual and English-on ly teachers alike, in the presence of so many rewards and sanctions, may feel pressure to specifically teach to the test or focus disproportionately on test preparation, as Mo ran (2000) has discussed.
11 of 48Even without these extensive consequences, a dramat ic and consistent rise in test scores is frequently observed in the first few years follo wing implementation of a new testing program (Linn, Graue, & Sanders, 1990), such as occ urred in California following implementation of Stanford-9 testing in 1998. There are several possible explanations for this trend. Coaching and teaching to the test, often at the expense of more desirable teaching and learning activities, can contribute to striking rises in test scores. A meta-analysis of 30 studies revealed that coaching for standardized tests increases test scores in the typical study by .25 standard deviati ons (Bangert-Drowns, Kulik, & Kulik, 1983). Coaching may refer to a variety of test prep aration activities, including general test-taking strategies (e.g., guessing, underlining main ideas, time management), test-specific strategies (e.g., methods useful for quirks of a particular test), and academic instruction tied closely to the content and skills on the assessment (Anastasi, 1981; Bond, 1989).As teachers and administrators become more familiar with the tests, coaching strategies may become more effective. Butler et al. (2000) sug gested that since there is a trend for test scores to rise for all students in California, these broad patterns of improvement may result largely from "teaching to the test." Teacher s in all types of classrooms, including bilingual and SEI classrooms, report having modifie d their teaching practices substantially, with a greater emphasis on preparing students to answer English standardized test-like questions (Alamillo & Viramo ntes, 2000; Gndara 2000). Unknown Student and School CharacteristicsAnother factor limiting conclusions based on the ST AR data is the nature of the data itself. We have noted the California STAR data are available to the public and the research community only in aggregated formby grade level within school. The lack of student-level information makes these data insuffic ient for thoroughly exploring relations between student achievement and language program for LEP students. Although indicators of the dominant language progra m and enrollment numbers are available at the school level (California Language Census Data Files, 2000), this information cannot be tied to individual students o r even grade-level averages. Further, relevant student-level information such as socioeco nomic status, level of English proficiency, and the previous year's score cannot b e used to control for potentially relevant individual differences.A problem of particular relevance for studying the effects of language programs on LEP students is the variability among districts in the criteria for defining LEP students, as well as who will be tested. As noted by Gndara (20 00) and Butler et al. (2000), redesignation of students with borderline English p roficiency could have a profound effect on aggregated scores in LEP and non-LEP grou ps. Although the Stanford-9 is not a measure of English language proficiency, some dis tricts redesignate LEP students achieving a certain score on the Stanford-9 to EP s tatus for the following year. This skimming effect may result in depressed scores for the LEP group. In contrast, Gndara (2000) pointed out that some districts are not recl assifying students on the basis of high scores on the Stanford-9, perhaps because they have not performed well on language proficiency tests. Differences in redesignation pol icies and rates, in the absence of student-level data, blur the meaning ascribed to LE P and EP score means.
12 of 48Additionally, use of these aggregated data induces two distinct problems related to school size. First, grade-level averages of achieve ment scores for each academic subject were included only if there were at least ten stude nts in the summary category represented. Schools were therefore unable to repor t summaries of any subgroup, such as LEP students, if there were fewer than ten students in the grade. It follows that the scores of many students are not represented in these subgr oup aggregates. If, for example, schools with fewer LEP students tended to be school s having higher socioeconomic status (SES), systematic omission of these schools due to insufficient numbers of LEP students may introduce bias in estimated LEP group means related to average school SES. A second problem related to school size is that the oft-reported statewide averages of the grade-level means do not account for the drasticall y varying numbers of students represented by the available grade-level within-sch ool score aggregates. Certainly, the numbers of students in each grade varies across sch ools overall and for particular subgroups, yet computation of unweighted averages g ives equal influence to small and large schools. This presents a unit of analysis pro blem when trying to make inferences about achievement at the student level. Unweighted averages of the grade-level means do not provide appropriate estimates of the statewi de student averages. As we later address, weighted means might provide b etter estimates of student scores. However, even weighted means do not allow represent ation of students excluded from subgroup summaries resulting from too few students in a category. Further, they do not address problems associated with the lack of releva nt student covariates or redesignation of language proficiency status. Using aggregated ra ther than student-level data severely limits the nature and strength of generalizations t hat can be made based on these data. It is imperative that researchers realize the implicit limitations and fallacies associated with using such grossly aggregated data.Regression to the MeanAnother explanation for score gains we must briefly consider is regression to the mean, a topic discussed thoroughly in many classic statisti cal textbooks (e.g., Campbell & Stanley, 1966; Glass & Hopkins; 1996) but often ign ored by researchers as a genuine threat to validity. Regression to the mean refers t o the tendency for student scores that are extreme upon initial testing (relative to the o verall mean) to drift toward the population mean upon subsequent testing. Regression to the mean is particularly important when gains of extreme groups are of inter estsuch as in the comparison of low-scoring LEP students to other groups. For all a vailable years of Stanford-9 score reports, LEP student scores are markedly lower than those for non-LEP students. In a district such as Oceanside, whose mean scores for L EP students were extremely low in 1998, scores might be expected to rise upon retesti ngeven without interventiondue to regression to the mean. Butler et al. (2000) com pared low-scoring schools with mostly non-LEP students to low-scoring schools with mostly LEP students and demonstrated that schools with both compositions increased simil arly from 1998-2000. Failure to acknowledge or correctly address regress ion to the mean has been an issue in other large-scale policy analyses. For example, Cam illi and Bulkley (2001), in a critique of Greene's (2001) evaluation of Florida's A-Plus a ccountability system, argued
13 of 48convincingly for policy analysts to be aware of reg ression to the mean and to use statistical models that take regression to the mean into account. They noted such approaches have been recently employed in North Car olina's development of growth standards for the state. A detailed discussion of r egression artifacts, particularly regression to the mean, may be found in a recent bo ok devoted to this subject by Campbell and Kenny (1999).Sample SelectionMany, if not most, of the published reports of anal yses of California's Stanford-9 scores have been based on the consideration of a small sam ple of schools. While it is perhaps more feasible to focus on a few schools when attemp ting a descriptive study of the policy implementation process, it is dangerous to m ake inferences based on quantitative differences in mean achievement across a small numb er of select schools. The presence of school and classroom effects on student achievem ent is well documented. Students sharing a common classroom and/or school environmen t tend to perform more similarly on achievement tests than students sampled from mul tiple sites (Muthn, 1991; Thompson, 2000). Demographic similarities, as well as collective experiences of students sharing an educational environment, contri bute to these classroom and school effects.Oceanside initially became the district held up as representative of schools that strictly implemented Proposition 227. Beginning with the emp hasis on Oceanside's Stanford-9 gains in press releases from Ron Unz and English-on ly supporters (e.g., English for the Children, 2000), score gains in this district have been repeatedly cited as evidence of the success of the proposition (Amselle & Adams, 2000; Steinberg, 2000). In response to these claims of success due to SEI, opponents of Pr oposition 227 pointed out marked gains seen in specific districts maintaining biling ual education. For example, Butler et al. (2000) chose schools nominated by Californians Together, a bilingual advocacy group, for analysis (e.g., Fresno Unified School Di strict; Californians Together, 2000, August 21). They compared these to English-only dis tricts held up by advocates of Proposition 227 as the most successful English-imme rsion schools. Schools that maintained bilingual programs likely h ad administrators and teachers who were highly committed to these programs, given the effort needed to obtain parental waivers for participating children. This degree of commitment to bilingual programs may also suggest exceptionally strong and effective programs. Similarly, it can be argued districts such as Oceanside had atypically s trong SEI programs. While comparing what seem to be the most successful districts of ea ch program type is informative, the results of such a comparison should not be used to suggest that the same outcomes would be observed in districts with different chara cteristics. A characteristic unique to a school or district may contribute substantially to a rise in test scores; this characteristic may or may not be related to the language program. We should not be surprised to see contradictory results and inferences regarding prog ram effects from studies that employ selective sampling of a few specific schools and di stricts. While comparisons of select schools and districts are informative, we urge caut ion in making generalizations based on such samples.Data Analysis Decisions
14 of 48The remaining issues we address are data analytic p roblems in summarizing the STAR data and using these summaries to support inference s. Our focus here is on analyses that manipulate students' scores in manners incongruent with the intended purpose of the assessment, and therefore these data analysis probl ems should also be considered threats to the validity of judgments based on STAR data. Fi rst, we discuss the misinterpretation and misuse of scores reported in the form of percen tile ranks. Problems in using percentile ranks as a basis for longitudinal infere nces result from incongruent norm group compositions, unequal score intervals, and di fficulties in computing gains. We then discuss unit of analysis issues associated wit h using aggregated data to make inferences at the student level.Misinterpretation and Misuse of Percentile RanksIndividual student reports of performance on standa rdized achievement tests, including the Stanford-9, frequently feature percentile ranks National percentile rank (NPR) scores indicate percentile ranks for a subtest rela tive to the national within-grade norm group. The NPR scores reported for students taking the Stanford-9 are derived from distributions of scaled scores broken down by grade and subject (Harcourt Brace Educational Measurement, 1997c). For example, a 2nd -grade student estimated to be at the 56th percentile on the math test would score hi gher than 56% of the students in the 2nd-grade norm group. Similarly, he or she would sc ore lower than 44% of the students in the 2nd-grade norm group. The popularity of perc entile rank scores is likely due to the ease with which they are understood at a practical levelparents are comfortable with the notion of a percentile scale on which their chi ld's relative standing can be located. However, there often are hidden validity problems i n utilizing a "relative" comparison group in interpreting achievement scores.Consider a statement from the earlier-mentioned New York Times article describing increasing Stanford-9 achievement scores of LEP stu dents in California: "In second grade, ... average score in reading of a student cl assified as limited in English increased 9 percentage points over the last two years, to the 28th percentile from the 19th percentile in national rankings, according to the s tate" (Steinberg, 2000, p. 1A). While this statement may seem quite clear on the surface, there are multiple assumptions about the meaning and comparison of percentile ranks that may cloud the perception of true academic growth. We briefly develop several points, well-known to psychometricians, that discourage the use of percentile ranks as meas ures for assessing academic gains for a collective group of students.NPRs Are Norm-referenced ScoresFor any one student, percentile ranks indicate only relative standing within a norm group. It follows that NPR scores should always be interpreted with the characteristics of the norm group in mind. The norm sample for the Sta nford-9 was balanced to generally represent the U.S. population according to socioeco nomic status, ethnicity, and urbanicity, with nonpublic schools oversampled to f acilitate a separate norm group. Sampled schools were asked to test students who wou ld typically be tested with other students in regular education classrooms, except th ose classified as trainable mentally handicapped or severely/profoundly mentally handica pped. Individual districts and schools were therefore abl e to include or exclude LEP students
15 of 48and some classifications of special education stude nts according to local policy. The tested student population in California contains a much greater proportion of LEP students than does the Stanford-9 norm group. Speci fically, the Stanford-9 spring norm sample contained only 1.8% LEP students (Harcourt B race Educational Measurement, 1997b). In contrast, California estimates approxima tely 25% of its students are LEP (Macas et al., 1998). The incongruence between the makeup of the reference group and California with respect to LEP students calls into question the validity of generalizations based on NPR scores for LEP students.Due to their normative nature, NPRs are not a measu re of academic achievement as defined by a level of knowledge or skill. It is pos sible that a true academic gain may appear as a decline according to the change in NPR across years. For example, a student could display greater mastery than in the previous year, but have a lower percentile rank if students in the norm group scored proportionally higher than the tested student in the second year. It again follows that changes in perce ntile ranks across multiple years are not well suited for demonstrating improvements in a cademic knowledge or skill. NPR Score Increments Represent Unequal Achievement Intervals To understand more thoroughly the pitfalls of manip ulating NPR scores, we consider how NPR scores are derived from the students' raw s cores. A raw score is simply the total number of items a student answers correctly o n a test. The test publisher determines NPRs through a two-step score conversion process (H arcourt Brace Educational Measurement, 1997b). On a specific test, such as th e Stanford-9 5th-grade reading test, the original raw scores from the norm group are fir st transformed into scaled scores by applying item response theory (IRT). The IRT model employed for the Stanford-9 takes item difficulties into account to estimate a profic iency level, or scaled score, that is both independent of the specific items to which the stud ent responds (i.e., the form and level of the subtest may vary) and independent of the gro up of students to whom the test is administered. These scaled scores are on a single s cale for a subject area, so they can be compared across different test forms and grade leve ls. Scaled scores also have the convenient property of an equal-interval scale that supports comparisons of proficiency level across time for a specific subject test (i.e. a one-unit increase from 1998-1999 on a subject test represents the same amount of achievem ent growth as a one-unit increase from 1999-2000, regardless of grade level).To convert scaled scores into NPRs, the cumulative distribution of scaled scores from the norm group is transformed into a roughly unifor m distribution of percentile ranks ranging from 1 to 99. When a new group of students is administered the exam, their raw scores are first converted to scaled scores. NPRs a re then determined for students in the testing group such that the percentile rank for a s pecific scaled score reflects the percentage of the norm group scoring at or below th at level. Table 2 illustrates conversions among raw, scaled, and percentile rank scores on the 5th-grade reading subtest of the Stanford-9, using the spring nationa l norm sample (Intermediate 2 Reading Scores, Form T; Harcourt Brace, 1997b). This readin g subtest had 84 items. Because there are 4 choices, it is worth noting that we mig ht expect merely guessing on the exam to yield correct answers to 21 of the 84 items, whi ch would place the score at the 3rd percentile on the NPR scale.Table 2
16 of 48 Conversions of Select Raw, Scaled, and National Per centile Rank Scores for the Stanford-9 5th-grade Reading Su btestRaw score Scaled scores (Form T) National percentile rank 8074299757109370691826567672606635955652485064238456322940622223561215305998255916205793155641(Note: Adapted from Harcourt Brace Educational Meas urement, 1997b. Intermediate 2 (grade 5) Total Reading, Form T, spring norm sample.)This transformation of the distribution of scaled s cores into percentile ranks is nonlinear, resulting in a loss of the equal-interval scale pro perty of scaled scores. Equal differences in percentile ranks do not reflect equal difference s in achievement or skill. Because the relative frequency of the original scaled scores is typically greater in the middle range than in the upper and lower ranges, conversion to p ercentile ranks results in spreading of the mid-range scaled scores and condensing of the u pperand lower-range scaled scores (the tails of the distribution).This lack of an equal-interval scale has several im portant implications. An achievement gain of 1 scaled score point does not result in a c onsistent gain in percentile ranks throughout the range of the scale. Specifically, a gain of 1 percentile rank will reflect a greater achievement difference (reflected by the sc aled score difference) in the upper or lower range than in the middle range. A difference in percentile ranks between 10 and 20 or between 80 and 90 may reflect a greater achievem ent gain than a difference between 50 and 60. This can be seen from the example in Tab le 3. Student 1 scored lower than most other students taking the test and Student 2 s cored near the middle of students taking the test. Both students improved 50 scaled s core points from 3rd grade to 4th grade, representing equivalent achievement gains. S tudent 1, at the low end of the scale, improved from the 1st percentile to the 7th percentile, an increase of 6 percentiles.
17 of 48 However, Student 2, in the middle of the scale, imp roved from the 41st to the 67thpercentile, an increase of 26 percentiles. In addit ion, the accuracy of percentile ranks differs across the range of scores (Rogosa, 1999). Comparisons of percentile gains for students at different skill levels are not transpar ent and may be regarded as misleading at best.Table 3 Example of Percentile Rank Gains at Different Points in the Distribution Student 1Student 2 Grade 3Grade 4Grade 3Grade 4Scaled score 525575605655 Percentile rank 174167(Note: Adapted from Harcourt Brace Educational Meas urement, 1997b. Primary 3 (grade 3) & Intermediate 1 (grade 4) Total Reading, Form T, spr ing norm sample.)NPRs Should Not Be Averaged or Used to Compute Gain s Another important implication of the lack of an equ al-interval scale is that means and gain scores of NPRs should not be computed (Crocker & Algina, 1986; Cronbach, 1960). The California Stanford-9 data are only publ icly available as means for each grade level within school. Additionally, these grad e-level within-school means often are averaged further in an attempt to summarize subgrou p means. Due to the unequal intervals created in deriving the NPR scale, such a veraging of data may drastically propagate errors in estimating true achievement.Although both the California STAR website (Californ ia Standardized Testing and Reporting, 2000) and Stanford-9 Technical Manual (H arcourt Brace Educational Measurement, 1997c) state explicitly that NPRs shou ld not be used to determine true academic change across years, many reports have foc used on changes in percentile ranks (e.g. Amselle & Allison, 2000; Butler et al., 2000; English for the Children, 8/14/2000; Steinberg, 2000, p. 1A). The major argument against using changes in NPRs to assess achievement gain for a student is that a gain of 1 scaled score point translates into different NPR intervals at different points on the scale, thereby obscuring measurement of true achievement gains. Further, because scores within each year have already been averaged, the collective gains of a group are even more problematic to assess and compare.Gains have been computed from two perspectives: wit hin-grade changes across years or cohort gains across years. Within-grade changes which have been more commonly reported in the context of California STAR data, co mpare means for a grade level across subsequent years (e.g., 2nd-graders in 1998 to 2ndgraders in 1999). Even when using scaled scores, these within-grade changes are not t rue achievement gains in the sense that they are based on different groups of students In contrast, cohort gains compare scores for a cohort of students across subsequent y ears (e.g., 2nd-graders in 1998 to
18 of 483rd-graders in 1999). However, we caution that when working with aggregated data, individual students cannot be tracked and therefore student mobility introduces some uncertainty to within-school cohort gains. Further, even if we assume the student group to be relatively consistent across years, the norm group used to determine the NPRs is different. As earlier described, it is therefore po ssible for slight gains in true achievement to appear as declines if the relative s tanding in the norm group is lower in the second year of testing, and vice versa.In summary, the use of NPRs for tracking and compar ing student achievement trajectories is problematic from multiple perspecti ves. The utility of percentile ranks is limited to informing judgments of how well a studen t does relative to the norm group for a subject and grade, gaining a view of a student's relative score profile across subjects, and evaluating whether a student has improved stand ing relative to the norm group from one year to the next. Such judgments are only valid if the norm group is an appropriate basis for comparison, which it quite clearly is not for LEP students. The characteristics of percentile ranksincluding norm group inconsiste ncies and an unequal interval scalemake this score form unsuitable for large-sca le longitudinal policy analysis. We now treat one additional data analysis problem we h ave observed in reports of California achievement resultsinappropriate averaging of data .Averaging Scores Across Subjects and GradesRegardless of the score form reported, it is incorr ect to average scores across different subjects and different grades (California Standardi zed Testing and Reporting, 2000; Harcourt Brace Educational Measurement, 1997c). Yet we have observed multiple citations of score improvements that involve averag es across both grades and subjects. For example, consider the following statement, from a press release on the English for the Children website, that attempts to summarize the academic p rogress of English learners in California: From 1998 to 2000, California English learners in e lementary grades (2-6) ... raised their mean percentile scores by 35% in r eading, 43% in mathematics, 32% in language, and 44% in spelling, with an average increase of 39% across all subjects (English for th e Children, 2000, August 14). The same site also displays tables showing mean per centile ranks across elementary grades 2-6 and across multiple subjects. Even prior to computing percentage improvements, consider the layers of averages impli ed in these numbers: LEP student scores are first aggregated to grade-le vel, within-school means (before release of data); 1. LEP grade-level, within-school means are averaged a cross grades (2 through 6) and schools; 2. LEP grade-level, within-school means are averaged a cross subjects (reading, mathematics, language, and spelling) and schools; and 3. LEP grade-level, within-school means are then simul taneously averaged across grades and subjects and schools. 4. We therefore see that "average increase of 39% acro ss all subjects" relies on a notion of gains based on means of means of means of means What do these mean ?
19 of 48These overall averages are not meaningful or defens ible from a measurement perspective and, further, they may obscure importan t differences in means that exist across grades and subjects. Such data summaries are psychometric nightmares, and are particularly haunting when used to support argument s for educational policy that may strongly impact students' educational opportunities .An Analysis of the California STAR DataMotivated by the validity problems we have observed in other summaries of these data, we conducted a reanalysis of the California STAR da ta. Although we have argued that the Stanford-9 has significant limitations as a mea sure of achievement for LEP students and that these aggregated data lack information nec essary to inform language program policy, it is apparent from our review of press and research reports that trends observed in these data will continue to be cited as evidence for arguments on both sides of the language policy spectrum. Here we attempt to analyz e differences in score means and trends for California LEP and EP students and inter pret them thoughtfullywithout leaping to unwarranted inferences about language pr ogram effects. We present a comprehensive summary of means and gains for all gr ades and subjects tested; however, we focus our discussion on reading, language (Note 2), and mathematics scores for the elementary grades 2-6.MethodsDataWe compiled data from three publicly available data sources. First, Stanford-9 scores were obtained from the California STAR website (Cal ifornia Standardized Testing and Reporting, 2000). This dataset provided within-scho ol grade-level means on subtests of the Stanford-9 for reading, mathematics, and langua ge for grades 2 to 11, spelling for grades 2 through 8, and science and social studies for grades 9 through 11. In addition to the within-school grade-level means, STAR also repo rts subgroup means for EP and LEP students. However, as noted previously, data we re not reported for groups of less than 10 students for reasons of confidentiality. Fo r example, the grade-level aggregate scores for the LEP subgroup were not included in th e data report if a grade had less than 10 LEP students. Additionally, we obtained suppleme ntal demographic information from the language census website (California Langua ge Census Data Files, 2000) and from the academic performance index data website (C alifornia Academic Performance Index Data Files, 2000).Statistical ProceduresThe outcome scores used in these analyses were in t he form of subject-area scaled scores. Recall that scaled scores are academic prof iciency estimates that can be compared across time and across different levels of a subject-area test. Weighted means were computed for each subject area and grade level for three groups. With weighted means, schools with more students are weighted more heavily than schools with fewer students in computing the overall mean, providing a closer approximation to student-level mean. The three groups for which mean s were computed were: all
20 of 48students, LEP students, and EP students. Schools re porting overall scores were used in computing weighted means for the group of all stude nts. For LEP and EP subgroup means, we included all schools that reported subgro up means for both LEP and EP students. In 1998, however, the STAR dataset did no t include aggregate scores for EP students separately, so we were unable to compare t hese groups in 1998. In order to determine changes in scores across year s, we computed both within-grade changes and cohort gains. First, within-grade chang es were computed by subtracting within-school grade-level means from one year to th e next for a single grade; an example is the difference between 4th-grade reading scores in 1998 from 4th-grade reading scores in 1999. The weighted means of these within-grade changes were then computed for each grade level in each subject area. Second, cohort gains were computed by subtracting within-school grade-level scores fro m one year to the next in consecutive grades; an example is subtracting 3rd-grade reading scores in 1998 from 4th-grade reading scores in 1999. We regard this as a loose cohort because we do not have evidence regarding which students remained in the s ame school from one year to the next. Again, the weighted mean of these gains was c omputed in each subject area.ResultsDescriptive StatisticsIn order to examine how grade level means change fo r each academic subject over the years 1998, 1999, and 2000, we computed weighted me ans and standard deviations for each grade in each subject (see Appendix A). Our ma in finding from these means is that over the three-year period, scores for LEP students remain substantially below the scores for EP students in schools that reported aggregate scores for both LEP and EP students. And, with few exceptions, the gap in LEP and EP stu dents' scores does not appear to be narrowing. We describe the score trends for 2nd thr ough 6th grades in reading, mathematics, and language in greater depth in the f ollowing section. We summarize mean score differences by examining within-grade ch anges, followed by cohort gains. Within-grade changesReading. Within-grade changes for each grade in reading acr oss consecutive years are shown in Table 4 for grades 2-6 and in Appendix B f or all grades. Figure 1 shows that for 2nd grade, all student groups improved substant ially from 1998 to 2000. For example, from 1999 to 2000, 2nd-grade LEP students gained an average of 4.20 scaled score points, EP students gained an average of 4.40 scaled score points, and all students gained an average of 5.39 scaled score points. (Rec all that the group termed all students consists of a larger number of schools; LEP and EP means are based on schools reporting scores in both of these subgroups.) It is also interesting to note that although the overall means increased, gains varied considera bly among schools, and not all schools experienced improvement. For example, from 1999 to 2000, 27.2% of schools experienced declines in mean second-grade reading f or LEP students and 28.6% experienced declines for EP students.Table 4 Weighted Mean Within-Grade Gains in
21 of 48 Reading for Grades 2-6 1998-19991999-20001998-2000 Grade LEPALLLEPEPALLLEPALL 2M7.845.884.204.405.3912.9011.21 SD9.828.8910.549.678.3110.889.66 3M8.305.023.124.724.5711.589.60 SD10.878.3610.989.417.8010.248.71 4M6.472.692.163.583.797.606.48 SD10.278.0210.078.697.489.658.44 5M3.811.651.271.881.834.543.44 SD7.687.097.927.756.908.887.56 6M3.562.121.961.971.514.183.68 SD8.015.936.846.525.527.356.23 Figure 1. Mean within-grade changes for 2nd-grade r eading. A comparison of reading scores for LEP and EP stude nts from 1999 to 2000 (EP students were not reported separately in 1998) acro ss grades 2-6 indicated that EP students made slightly larger gains than LEP studen ts in all 5 grades. In no instance,
22 of 48however, was the difference in gains more than two scaled score points. This pattern is illustrated in graphs of reading means for grades 1 and 4 (see Figures 1 and 2, respectively), which show nearly parallel lines for LEP, EP, and all students. Figure 2. Mean within-grade changes for 4th-grade r eading. Language. Within-grade changes in language were similar to t hose in reading (see Appendix B). Students in each group displayed gains in scores from 1998 to 1999, 1999 to 2000, and 1998 to 2000. A comparison of LEP and EP students from 1999 to 2000 again revealed that EP students made slightly large r gains than LEP students in grades 2-6, although the overall improvement was never gre ater than three scaled score points. Mathematics. An examination of the within-grade mean increases in mathematics scores revealed that LEP students again improved slightly less than EP students in all grades from 1999 to 2000, as reported in Table 5 for grade s 2-6 and Appendix B for all grades. For example, in 2nd grade, the mean change for LEP students was 6.91 scaled score points and the mean change for EP students was 7.63 scaled score points. In 4th grade, LEP students had an average increase of 4.95 scaled score points, while EP students had an average increase of 7.33 points. Figures 3 and 4 illustrate within-grade improvements for 2nd and 4th grades, respectively. A visual insp ection suggests that increases were similar across groups; however, the cumulative effe ct of greater improvements for EP students over multiple years makes the trend worth noting. We again caution that these within-grade changes are average score improvements across years based on different groups of students.Table 5 Weighted Mean Within-Grade Gains in Mathematics for Grades 2 through 6
23 of 48 1998-19991999-20001998-2000 Grade LEPALLLEPEPALLLEPALL 2M10.087.726.917.637.2914.3715.04 SD12.8210.0214.1412.0110.1813.4711.66 3M11.828.177.359.608.3816.4316.56 SD13.239.8712.9210.599.3813.0811.03 4M7.844.884.957.336.9111.3011.83 SD10.338.7111.399.518.3511.089.69 5M4.903.784.495.685.348.449.09 SD9.108.3110.038.848.1810.379.39 6M4.664.303.674.894.117.558.49 SD10.067.549.668.617.519.518.51 Figure 3. Mean within-grade changes for 2nd-grade m athematics.
24 of 48 Figure 4. Mean within-grade changes for 4th-grade m athematics. Cohort gainsReading. We examined the gains made by cohorts of students across the three years of the test (see Table 6). Figures 5 and 6 show cohort gains for grades 2-4 and grades 4-6, respectively, in reading. Both LEP and EP cohorts i mproved substantially from 1999 to 2000; however, there was not a clear pattern with r espect to which group gained more. From 2nd to 3rd grade, LEP students gained less (28 .70 scaled score points) than EP students (34.21). The two groups' gains were simila r to each other from 3rd to 4th grade and from 4th to 5th grade, while the LEP students g ained more than EP students from 5th to 6th grade. Another interesting trend is that for all groups, cohort gains across grades are much greater in early elementary grades (2-4) than upper elementary grades (4-6). Table 6 Weighted Mean Cohort Gains in Reading for Grades 2 through 6 1998-19991999-2000 1998-2000 Grades LEPALLLEPEPALLGradesLEPALL 2 to 3M30.5934.0028.734.2132.662 to 458.3362.19 SD9.518.489.839.368.61 11.019.20
25 of 48 3 to 4M30.2729.5427.8727.7628.353 to 547.9947.05 SD8.897.288.977.857.11 10.068.93 4 to 5M19.5118.3117.9116.7517.534 to 638.7533.53 SD8.437.068.357.496.79 10.949.77 5 to 6M19.7515.7218.8814.9115.92 SD7.422.214.171.1247.00 Figure 5. Mean cohort gains for 2nd through 4th gra ders in reading.
26 of 48 Figure 6. Mean cohort gains for 4th through 6th gra ders in reading. Language. The cohort gains in language were similar to those in reading (see Appendix C). There was not a consistent pattern of gains for LEP, EP, and all students. From 1999 to 2000, EP students gained more (22.48 scaled scor e points) than LEP students (21.02) from 2nd to 3rd grade, although not by much. The tw o groups gained similarly across the other grade ranges, with EP students gaining sl ightly more then LEP students from 4th to 5th grade. LEP students gained slightly more than EP students from 3rd to 4th grade and 5th to 6th grade.Mathematics. The cohort gains for mathematics, displayed in Tab le 7, reveal a pattern consistent with that seen in the within-group chang es. From 1999 to 2000, LEP students gained less than EP students in every cohort. For e xample, from 2nd to 3rd grade, LEP students gained an average of 31.76 scaled score po ints, while EP students gained an average of 35.30 scaled score points. These pattern s are suggested in Figures 7 and 8 as well, as the LEP line diverges slightly from both t he EP and all lines.Table 7 Weighted Mean Cohort Gains in Mathematics for Grades 2 through 6 1998-19991999-2000 1998-2000 Grades LEPALLLEPEPALLGradesLEPALL 2 to 3M31.6133.7431.7635.3034.372 to 455.9660.70 SD12.2311.0312.6112.0211.2013.2111.64
27 of 48 3 to 4M26.8128.3024.3827.4927.073 to 552.8357.00 SD11.029.6511.1210.749.76 12.8511.17 4 to 5M27.1328.2626.0728.9128.734 to 650.4653.54 SD9.438.169.539.038.08 12.6510.95 5 to 6M22.7924.5822.6025.6725.43 SD9.368.969.649.328.80 Figure 7. Mean cohort gains for 2nd through 4th gra ders in mathematics.
28 of 48 Figure 8. Mean cohort gains for 4th through 6th gra ders in mathematics. Comparison of Weighted and Unweighted MeansTo determine whether the use of weighted means yiel ds substantially different results than unweighted means, we computed a limited number of u nweighted means. The unweighted scaled score means for reading, language, and mathe matics for grades 2 through 4 are displayed in Appendix D. The results for reading ar e quite interesting. The unweighted means for the LEP students are higher than the weig hted means for all three grades in 1999 and 2000. In contrast, unweighted means for th e EP students are lower than the weighted means for all three grades in these years. This indicates that when using unweighted means to summarize reading scores, the g ap between EP and LEP students appears to be less than it is when using weighted m eans to estimate the average score at the student level. In other words, the gap in readi ng scores between LEP and EP students is wider when taking into account the number of stu dents in each subgroup for a grade-level within a school. This phenomenon is als o present in the language scores, but to a lesser extent. For mathematics, the unweighted me ans were consistently higher for both LEP and EP groups.DiscussionOur concern for basing educational policy on valid evidence of academic success motivated this commentary and analysis. We first so ught to provide a summary and validity critique of writings citing Stanford-9 sco res in arguments regarding the success of Proposition 227 in California. Multiple issues have threatened the validity of inferences based on the California data concerning LEP student s: testing LEP students in English, failing to consider myriad alternative explanations for score trends, and generalizing from a limited and nonrandom sample of schools. In addit ion, we have observed errors in
29 of 48quantitatively summarizing this large dataset of st andardized scores. In the context of the California data, the misuse and misinterpretation o f percentile rank scores, the inappropriate averaging of data across years and gr ades, and the failure to consider the unit of analysis when using aggregated data are common p roblems. As validity arguments must include judgments about the appropriateness of inte rpretations made from test scores (Messick, 1989), this critique should raise many do ubts regarding conclusions that have previously been drawn from LEP students' scores on the Stanford-9. In addition, the topics discussed in this paper generalize readily to other applications in which large standardized datasets are cited in educational policy debates.Our analysis of the STAR dataset differed in three important ways from previous summaries of these data: a) we used scaled scores t o assess academic gain across years; b) we computed weighted means to account for the numbe r of students represented by an aggregate score; and c) we were modest with respect to the meaning our results hold for informing language program policy. As previously re ported (e.g., Butler et al., 2000) means improved for both LEP and EP students over th e three-year period. Our examination of weighted means revealed that from 19 98 to 2000, scores for LEP students remained substantially below the scores for EP stud ents in schools that reported aggregate scores for both LEP and EP students, and that with few exceptions this gap is not narrowing.A within-grade comparison of reading scores for LEP and EP students across grades 2-6 indicated that EP students made slightly larger gai ns than LEP students in all grades. Loose cohort gains for LEP and EP students in readi ng were similar; however, for some grade intervals LEP students gained slightly more, while for other intervals EP students gained slightly more. The results for language arts subtest scores were similar to those in reading. In mathematics, an examination of the with in-grade mean gain scores revealed that LEP students gained slightly less than EP stud ents in all grades from 1999 to 2000. For 1999-2000 cohort gains, LEP students gained les s than EP students in every cohort. Because it was impossible to follow an individual s tudent's growth across multiple years, the comparison of groups from year to year undoubte dly involved the comparison of different students. This problem was complicated by redesignation of students from LEP status to EP status. Not only was it unclear how ma ny students were redesignated, but redesignation criteria differed across districts. F inally, there are many factors that have been repeatedly shown to influence student achievem ent. For example, LEP students may differ, on average, from their EP peers with respec t to socioeconomic status and mobility, but there was no way to control for such difference s using these data. These findings should be regarded as descriptive su mmaries of the California STAR data, and we caution that these must be interpreted in li ght of the substantial limitations of these data for research purposes. Whatever the score differences between LEP and EP s tudents, judgments of the effects of language program policy on LEP student achievement are not warranted by these data. To further address the question of performance dif ferences between language programs on a large scale, we atte mpted to use language census data (California Language Census Data Files, 2000) and a cademic performance index (API; California Academic Performance Index Data Files, 2 000) data to tie schools to specific program types. Most schools reported having student s in nearly every program type. Because we could not identify individual students, we could not parse the data from schools into program type. We then attempted to com pare schools that reported 100% of their LEP students in bilingual programs in 1998, 1 999, and 2000 with schools that
30 of 48reported 100% of their LEP students in English imme rsion programs in those years; however, this resulted in such a drastic reduction in data that we did not feel quantitative comparisons were warranted (only six schools report ed 100% of LEP students in bilingual programs over the three years).Further investigation is also needed to explore the differences in score trends observed when using unweighted versus weighted means, most n otably the underestimation of EP and LEP mean differences with unweighted means. Fac tors associated with school size may offer meaning to these patterns. We attempted t o use API data to investigate the relationship between score trends and mobility, soc ioeconomic status, and class size. However, due to the aggregated nature of these data we were only able to reach very general and well-known conclusions, such as that sc hools with lower average SES tended to have lower test scores.The evaluation of policy outcomes is a high-stakes activity requiring more thoughtful and detailed analyses than computing overall group NPR means. In the context of school accountability systems, Camilli and Bulkley (2001) summarized that tying accountability to single achievement outcomes does not automatical ly shed light on why certain changes were noted. They also argued that appropriate and i nformative use of statistical models for evaluating policy outcomes requires appreciable tec hnical sophistication. We concur with these notions and find them relevant for our contex t of evaluating language program effects for LEP students. There is a strong need fo r research that is well-planned and well-executed that seeks to evaluate language progr am effects with better controls. We offer several conclusions based on our validity critique and analysis of the Stanford-9 data. First, the scores of LEP students are not cat ching up to those of their English-proficient peers in any consistent manner a cross grades and subjects. Second, the success or failure of programs to remedy the dispar ity between LEP and EP students should be judged by means other than a single acade mic achievement test administered in English. The construct of language on an achievemen t test is qualitatively different from language proficiency as measured on an assessment o f English as a second language. Using test scores for any purpose requires that we consider the appropriateness of the scores for the intended use and provide evidence to justify this use. In all assessments, not only should the psychometric validity of the tests be considered, but the potential consequences of the test's use must also be judged. Given the changing demographics of the United States, educators, researchers, and poli cymakers must join forces to establish policy that will provide maximal opportunity for LE P students to learn.Notes1We conducted a full-text search of the NEXIS/LEXIS Academic Universe archive of major U.S. newspapers using the search terms "bilin gual education, test scores, California." "Major U.S. newspapers" are defined by NEXIS/LEXIS as U.S. newspapers listed in the top 50 in circulation in Editor & Publisher Year Book In a manual inspection of the results, we excluded any publication that di d not mention the increase in Stanford-9 test scores in relation to Proposition 227 and also included a Newsday article (Willen & Kowal, 2000, p. A10) that refers to the event as "a recent California study."2The "language" subtest of the Stanford-9 measures c omprehensive language arts proficiency and is intended for use with English-pr oficient students; therefore it should not
31 of 48be regarded as an assessment of English language pr oficiency.ReferencesAlamillo, L., & Viramontes, C. (2000). Reflections from the classroom: Teacher perspectives on the implementation of Proposition 2 27. Bilingual Research Journal, 24 155-168.American Educational Research Association, American Psychological Association, National Council on Measurement in Education. (1999 ). Standards for educational and psychological testing Washington, DC: Author. Amselle, J., & Allison, A. C. (2000, August). Two y ears of success: An analysis of California test scores after Proposition 227. READ Abstracts [On-line]. Available: http://www.ceousa.org/html/227rep.html Anastasi, A. (1981). Coaching, test sophistication, and developed abilities. American Psychologist, 36 1086-1093. August, D., & Hakuta, K. (Eds.). (1998). Educating language-minority children. Washington, DC: National Academy Press.Bangert-Drowns, R. L., Kulik, J. A., & Kulik, C. C. (1983). Effects of coaching programs on achievement test performance. Review of Educational Research, 53 571-585. Berliner, D. C. (1988). Meta-comment: A discussion of critiques of L. M. Dunn's monograph Bilingual Hispanic Children on the U.S. Mainland Hispanic Journal of Behavioral Sciences, 10 273-300. Berliner, D., & Biddle, B. (1995). The manufactured crisis: Myths, fraud, and the atta ck on America's schools. Reading, MA: Addison-Wesley. Bond, L. (1989). The effects of special preparation on measures of scholastic ability. In R. L. Linn (Ed.), Educational Measurement (3rd ed.) (pp. 13-103). New York: Macmillan Publishing Company.Butler, Y. G., Orr, J. E., Gutierrez, M. B., & Haku ta, K. (2000). Inadequate conclusions from an inadequate assessment: What can SAT-9 score s tell us about the impact of Proposition 227 in California. Bilingual Research Journal, 24 141-154. California Academic Performance Index Data Files. (2000). Sacramento, CA: California Department of Education. [On-line database, 8/15/20 00] Available: http://api.cde.ca.gov/datafiles.html California Education Code (2001). English language education for immigrant children Division 1, Part 1, Chapter 3. [On-line]. Available : http://www.leginfo.ca.gov/calaw.html California Language Census Data Files. (2000). Sacramento, CA: California Department of Education. [On-line database, 8/15/2000] Availab le: http://www.cde.ca.gov/demographics/files/census.htm California Standardized Testing and Reporting (2000). Sacramento, CA: California
32 of 48Department of Education. [On-line database, 8/15/20 00] Available: http://star.cde.ca.gov California State Board of Education. (2000). State Monetary Awards Programs Based on the Academic Performance Index. [On-line] Available: http://www.cde.ca.gov/ope/ae/pages/certstaffact.htm l Californians Together. (2000, August 21). Schools with large enrollments of English learners and substantial bilingual instruction are effective in teaching English [On-line]. Available: http://www.bilingualeducation.org/news.htm Camilli, G., & Bulkley, K. (2001, March 4). Critiqu e of "An evaluation of the Florida A-Plus Accountability and School Choice Program." Educational Policy Analysis Archives, 9 [On-line journal] Available: http://epaa.asu.edu/ Campbell, D. T., & Kenny, D. A. (1999). A primer on regression artifacts. New York: Guilford Press.Campbell, D. T., & Stanley, J. C. (1966). Experimental and quasi-experimental designs for research. Chicago: Rand McNally. Campbell, P. R. (1994). Population projections for states by age, race, and sex: 1993 to 2020 (U.S. Bureau of the Census Current population re ports, pp. 17-23). Washington, DC: Government Printing Office.Crawford, J. (1999). Bilingual education: History, politics, theory, and practice. Los Angeles: Bilingual Education Services.Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory Orlando, FL: Holt, Rinehart, and Wilson.Cronbach, L. J. (1960). Essentials of psychological testing (2nd ed.) New York: Harper. de Cos, P. (1999). Educating California's immigrant children: Overview of bilingual education (CRB-99-010). California Research Bureau. Duran, R. P. (1989). Testing of linguistic minoriti es. In R. L. Linn (Ed.), Educational measurement (3rd ed.) (pp. 13-103). New York: Macmillan Publis hing Company. Eastin (2000). Eastin releases additional STAR 2000 test results [On-line] Available: http://www.cde.ca.gov/statetests/star/pressrelease2 000b.pdf English for the Children. (2000, August 14). After two years of Prop. 227 English immersion, a huge rise in California's immigrant sc ores; Pro-bilingual education districts lag behind. [On-line]. Available: http://www.yeson227.org/releases.html Gndera, P. (2000). In the aftermath of the storm: English learners in the post-227 era. Bilingual Research Journal, 24 1-14. Gndera, P., Maxwell-Jolly, J., Garcia, E., Asato, J., Gutierrez, K., Stritikus, T., & Curry, J. (2000, April). The initial impact of Proposition 227 on the instruction of English learners. [On-line]. Available: http://lmri.ucsb.ed u/RESDISS/prop227effects.pdf.
33 of 48Garcia, E. E., & Curry-Rodriguez, J. E. (2000). The education of Limited English Proficient students in California schools: An asses sment of the influence of Proposition 227 in selected districts and schools. Bilingual Research Journal, 24 15-36. Genesee, F. (1984). On Cummins' theoretical framewo rk. In C. Rivera (Ed.) Language proficiency and academic achievement (pp. 20-27). Clevedon, UK: Multilingual Matters. Glass, G. V, & Hopkins, K. D. (1996). Statistical methods in education and psychology (3rd ed.). Boston: Allyn & Bacon. Gonzalez, D. (2000, November 8). Bilingual educatio n gets rebuke from state voters. Arizona Republic p. EX1. Greene, J. P. (2001). An evaluation of the Florida A-Plus Accountability and School Choice Program. New York: The Manhattan Institute. Groves, M. (2000, August 15). English skills still the key in test scores; Stanford 9: Results show vast divide in achievement between stu dents who have language fluency and those who don't. Los Angeles Times p. A3. Gutirrez, K. D., Asato, J., & Baquedano-Lopez, P. (2000). "English for the Children": The new literacy of the old world order, language p olicy, and educational reform. Bilingual Research Journal, 24 87-112. Hakuta, K. (2001). Silence from Oceanside and the f uture of bilingual education. Stanford University manuscript. Available athttp://www.stanford.edu/~hakuta/SAT9/Silence%20from% 20Oceanside.htm. Harcourt Brace Educational Measurement. (1997a). Sample questions for the Stanford Achievement Test, Ninth Edition San Antonio, TX: Author. [On-line]. Available: http://www.cde.ca.gov/statetests/star/stanford9.pdfHarcourt Brace Educational Measurement. (1997b). Spring norms book San Antonio, TX: Author.Harcourt Brace Educational Measurement. (1997c). Technical data report San Antonio, TX: Author.Hayes, K., & Salazar, J. (2001). Evaluation of the Structured English Immersion Prog ram Final Report: Year I Language Acquisition and Literacy, Program Evalua tion and Research Branch, Los Angeles Unified School Distric t. Juffs, A., & Harrington, M. (1995). Parsing effects in second language sentence processing: Subject and object asymmetries in wh-ex traction. Studies in Second Language Acquisition 17(4), 483-516. Juffs, A., & Harrington, M. (1996). Garden path sen tences and error data in second language sentence processing. Language Learning 46(2), 283-326. Kerr, J. (2000, August 15). English learners show g ains. Associated Press
34 of 48Krashen, S. (1996). Under attack: The case against bilingual education. Culver City, CA: Language Education Associates.Linn, R., Graue, E., & Sanders, N. (1990). Comparin g state and district test results to national norms: The validity of claims that "everyo ne is above average". Educational Measurement: Issues and Practice, 5-14. Macas, R. F., Nishikawa, S., & Venegas, J. (1998). Summary report of the survey of the states' limited English proficient students and ava ilable educational programs and services, 1996-97. Washington, DC: National Clearinghouse for Bilingu al Education. Maxwell-Jolly, J. (2000). Factors influencing imple mentation of mandated policy change: Proposition 227 in seven Northern California school districts. Bilingual Research Journal, 24 37-56. McNeil, L. (2000). Contradictions of school reform: Educational costs of standardized testing New York: Routledge. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement (3rd ed.) (pp. 13-103). New York: Macmillan Publishing Company.Meyer, M., & Fienberg, S. (Eds.). (1992). Assessing evaluation studies: The case of bilingual education strategies. Washington, DC: National Academy Press. Moran, C. (2000, July 23). School test gains may be illusory, critics say. San Diego Union-Tribune p. A1. Moran, C., & Spielvogel, J. (2000, August 15). Stud ents share in test gains; however, the gap between advantaged, disadvantaged has widened, data show. San Diego Union-Tribune p. B1. Muthn, B. O. (1991). Multilevel factor analysis of class and student achievement components. Journal of Educational Measurement, 28, 338-354. Myles, F. (1996). The acquisition of interrogatives by English learners of French: The role played by structural distance. Jyvaskyla Cross Language Studies 17, 195-208. New York Times News Service. (2000, August 20). Spa nish-speaking pupils gain after cored move to English; students in California impro ve strikingly with end of bilingual education. The Baltimore Sun P. 3A. Pilkington, C. L., Piersel, W. C. & Ponterotto, J. G. (1988). Home language as a predictor of first-grade achievement for Angloand Mexican-A merican children. Contemporary Educational Psychology, 13 1-14. Public Schools Accountability Act. (1999). [On-line ] Available: http://www.cde.ca.gov/psaa/ Ramirez, D., Pasta, D., Yuen, S., Billings, D., & R amey, D. (1991). Final report. Longitudinal study of structured English immersion strategy, early-exit and late-exit transitional bilingual education programs for langu age-miority children (Vols. 1 & 2).
35 of 48San Mateo, CA: Aguirre International.Rogosa, D. R. (1999). How accurate are the STAR national percentile rank scores for individual students? An interpretive guide. Los Angeles, CA: National Center for Research on Evaluation, Standards, and Student Test ing [On-line]. Available: http://cresst96.cse.ucla.edu/CRESST/Reports/drrguid e.html Rosenthal, A., Milne, A., Ellman, F., Ginsburg, A., & Baker, K. (1983). A comparison of the effects of language background and socioeconomi c status on achievement among elementary-school students. In K. Baker & E. de Kan ter (Eds.). Bilingual education: A reappraisal of federal policy Lexington, MA: Lexington Books. Rossell, C., & Baker, K. (1996). The educational ef fectiveness of bilingual education. Research in the Teaching of English, 30 7-74. Sacks, P. (1999). Standardized minds: The high price of America's tes ting culture and what we can do to change it. Cambridge: Perseus Books. Schirling, E., Contreras, F., & Ayala, C. (2000). P roposition 227: Tales from the schoolhouse. Bilingual Research Journal, 24 127-140. Simon, T. F., Fico, F., & Lacy, S. (1989). Covering conflict and controversy: Measuring balance, fairness, defamation. Journalism Quarterly, 66, 427-434. Singer, E. & Endreny, P. (1993). Reporting on risk: How the mass media portray accidents, diseases, and other hazards. New York: Russell Sage Foundation. Stabler, E. P. Jr. (1986). Possible contributing fa ctors in test item difficulty: Research memorandum. Princeton, NJ: Educational Testing Serv ice. Stabler, E. P., Jr. (1994). The finite connectivity of linguistic structure. In C. Clifton, L. Frazier, & K. Rayner (Eds.), Perspectives on sentence processing New York, NY: Erlbaum.Steinberg, J. (2000a, August 20). Increase in test scores counters dire forecasts for bilingual ban. New York Times p. A1. Steinberg, J. (2000b, August 20). English-only clas ses appear a success; after ending bilingual programs, California schools see boosts i n test scores. Milwaukee Journal Sentinel p. 3A. Steinberg, J. (2000c, August 20). Rising test score s shock critics of English-only law. The Plain Dealer, p. 21A. Tankard, J., & Ryan, M. (1974). News source percept ions of accuracy in science coverage. Journalism Quarterly, 51 219-225, 334. Thompson, M. S. (April, 2000). Investigating dependency at the classroom and schoo l levels in educational research: A multilevel approa ch. Presented at the Annual Meeting of the American Educational Research Association, New Orleans, Louisiana. Ulibarri, D. M., Spencer, M. L., & Rivas, G. A. (19 81). Language proficiency and
36 of 48academic achievement: A study of language proficien cy tests and their relationship to school ratings as predictors of academic achievemen t. NABE Journal, 5 47-80. United States Department of Education. (1994). Summary of the bilingual education state educational agency program survey of states' limite d English proficient persons and available educational services (1992-1993): Final r eport. Arlington, VA: Development Associates.United States Department of Education. (1997). 199394 schools and staffing survey: A profile of policies and practices for limited Engli sh proficient students: Screening methods, program support, and teacher training Washington, D.C.: U.S. Department of Education. [On-line]. Available: http://nces.ed.gov/pubs/97472.pdf Weiss, C., & Singer, E. (1987). Reporting of social science in the national media. New York: Russell Sage Foundation.Willen, L., & Kowal, J. (2000, September 9). Study: Bilingual ed falling short. Newsday p. A10.Willig, A. (1985). A meta-analysis of selected stud ies on the effectiveness of bilingual education. Review of Educational Research, 55, 269-318. Wright, C. R., & Michael, W. B. (1989). The relatio nship of average scores of seniors in 172 southern California high schools on a statewide standardized achievement test to percentages of families receiving federal aid and t o percentages of limited-English proficient speakers. Educational and Psychological Measurement, 49, 937-944.About the AuthorsMarilyn S. ThompsonMarilyn Thompson is Assistant Professor of Measurem ent, Statistics, and Methodological Studies in the Division of Psychology in Education at Arizona State University. Her research interests include methodological technique s for large data set analysis, applications of structural equation modeling and mu ltilevel modeling, and policy pertaining to the assessment of English learners. D r. Thompson is Director of the EDCARE laboratory at ASU, providing measurement, st atistics, and evaluation services to the educational community. She previously taught ph ysics and chemistry and was science department chair in an urban high school, giving he r a first-hand perspective on the role of classroom assessment. She may be reached at m.thomp email@example.com. Kristen E. DiCerboKristen DiCerbo is a Ph.D. student in the School Ps ychology program in the Division of Psychology in Education, College of Education at Ar izona State University. Her research interests include psychoeducational assessment, esp ecially the assessment of minority children, and internalizing problems in children. S he is currently a psychology intern in a district serving a large population of English lear ners. Ms. DiCerbo was formerly Editor of Current Issues in Education She may be reached at firstname.lastname@example.org. Kate Mahoney Kate Mahoney is a Ph.D. student in Curriculum and I nstruction with a concentration in bilingual education at Arizona State University. He r teaching experience includes teaching
37 of 48 mathematics with English language learners. Her res earch interests include psychometrics and validity issues concerned with testing English language learners. Jeff MacSwanJeff MacSwan is an Assistant Professor of Language and Literacy in the College of Education at Arizona State University. He has publi shed articles in Hispanic Journal of Behavioral Sciences, Bilingual Review, Bilingual Re search Journal, Bilingualism: Language and Cognition and Southwest Journal of Linguistics He is the author of A Minimalist Approach to Intrasentential Code Switchi ng (Garland, 1999), and editor of the forthcoming volume Grammatical Theory and Bilingual Codeswitching (MIT Press). He is chair of the organizing committee of the Fourth International Symposium on Bilingualism to be held at Arizona State University in April, 2003. He may be reached at email@example.com.Appendices Appendix A Weighted Mean Scaled Scores199819992000 SubjectGradeLEPAllLEPEPAllLEPEPAll Reading 2M542.93570.14550.11578.98575.35555.91584.00580.64 SD15.1323.3515.2318.5523.2315.4018.2523.17 N23884843255725573319274127413356 3M567.19599.53573.53606.94604.06578.52611.32608.33 SD13.5125.9713.0019.1525.1712.5018.3724.57 N24454874260826084936278827884979 4M593.31626.58597.35630.15629.20601.07634.35632.77 SD11.9125.1012.0818.1924.5211.8417.4724.03 N22954837242224224894260626064958 5M610.03643.35612.81646.04644.92615.14647.72646.84 SD9.9622.4210.0515.6621.7910.0015.0521.36 N21474798229322934868243624364924 6M624.30655.79627.47660.78657.78628.96662.54659.37 SD9.0419.559.1214.7018.999.4914.6418.98 N14493395152215223319164616463356 7M633.17670.85635.71679.00672.17638.05680.54674.06 SD10.3420.319.7515.4919.6110.8515.7219.74
38 of 48 N885177694194117699979971797 8M649.52684.52651.42692.03686.13653.21693.55687.63 SD9.0918.328.3713.8117.759.0414.1917.79 N860180391991918199829821851 9M650.43683.79651.81690.86684.25652.97691.93685.67 SD7.81116.917.5213.9616.837.3114.1016.73 N602128864164112886936931321 10M654.50689.40656.25697.05689.85656.93697.56691.05 SD8.6516.747.8113.8416.507.8814.2916.70 N602144061461414357047041479 11M662.13697.18663.57703.39696.76664.35703.50697.81 SD9.3516.358.2613.2615.838.2913.9116.07 N571139559859813936616611446 199819992000 SubjectGradeLEPAllLEPEPAllLEPEPAll Math 2M547.65564.58555.79572.58571.91562.12579.37579.14 SD15.7520.9615.9719.2021.0716.4719.0521.38 N25404875255525574914273727414969 3M571.11590.43579.49598.43598.36587.14606.81606.51 SD15.5822.3615.3019.4521.9515.2519.0721.66 N24934883260226084938278327884980 4M592.14614.08597.92618.28618.95603.46625.57625.71 SD13.0321.5613.4218.0921.2413.4317.7721.23 N23854846241924204902260426064959 5M614.86638.55619.22641.50642.36623.74646.64647.75 SD11.8421.0911.9916.8520.7312.1516.6320.92 N22094808228622924871243124344927 6M629.16656.14633.79661.83660.34637.24665.71664.57 SD12.4521.6712.9918.0821.6513.6218.3022.03 N14863400152015223322164416463361 7M643.68667.56646.47674.23670.16648.62676.83673.15 SD10.9019.1610.6117.0918.8511.4417.6119.61 N886177794194117729969971793
39 of 48 8M652.85676.41655.18682.99679.36656.96685.46681.71 SD11.7018.7410.8316.6518.8011.4317.0919.10 N868180391291718109799811847 9M667.00688.24668.78698.79689.71670.22696.90692.25 SD11.2016.9810.2814.9716.769.9115.4217.08 N617128563864112956906931326 10M677.47694.70679.93701.79696.78680.26702.34698.09 SD12.1715.6611.1714.5315.7610.8115.0215.98 N609143860961414336967041478 11M680.76699.80684.06707.38702.07684.69708.36703.99 SD13.6617.5213.0516.1817.5712.8416.8918.03 N575139759459813926586611439 199819992000 SubjectGradeLEPAllLEPEPAllLEPEPAll Language 2M558.26580.56563.58588.20585.12568.34592.11589.54 SD13.0920.7913.3717.1120.7513.7817.1520.93 N24834868254625564910273027404969 3M571.63596.13578.35603.84602.17584.12609.41607.56 SD13.3322.5713.4918.1422.2713.6417.6321.92 N24424873258626014933277127864977 4M595.30620.61598.68623.71622.82602.67628.02626.78 SD11.9220.8912.2516.6020.6211.9315.9520.11 N23674839240724134895259626044960 5M607.68634.34610.14636.88636.29613.07639.29639.03 SD10.7620.4910.9615.8820.2911.0215.5920.15 N21904805227822914871242424331927 6M617.70643.43620.59648.11645.75622.65650.80648.31 SD9.6417.6810.1114.3617.7510.5514.5617.96 N14723394151315213314163416463354 7M626.52655.69628.79663.36657.72631.26665.77660.41 SD9.7417.659.3414.5617.5710.1814.7717.78 N883176893894117689919961790 8M632.29661.89633.81669.37664.12635.63671.49666.25
40 of 48 SD9.0717.898.7414.8618.059.2815.2118.22 N866179991391818029789821843 9M642.68668.38643.96676.22669.97644.84678.01672.07 SD9.2915.478.9813.1915.679.0713.7616.08 N608128563364012806886931318 10M639.13669.03640.67678.03670.48641.48679.24672.59 SD9.6417.348.6414.6417.438.9515.2617.67 N597143160661314256937031463 11M650.25678.16652.09686.12679.64652.70686.90681.33 SD9.4215.698.8313.3715.998.9114.4616.44 N565138658859713836586611438 199819992000 SubjectGradeLEPAllLEPEPAllLEPEPAll Spelling 2M533.17558.19542.89568.99565.13550.35575.46572.04 SD19.8623.4219.9419.4123.2620.7519.2723.49 N25024872255325534912273927394970 3M567.11589.48574.74598.86595.70582.39604.80601.92 SD17.4920.3016.9716.1519.9616.6715.9219.38 N24874883260326034936278527851981 4M583.53612.64588.39617.46615.69593.67623.30620.95 SD14.2723.2114.5817.8622.6014.2117.2722.14 N23844849242024204900260426044961 5M603.56629.88606.59634.13632.42610.18636.86635.52 SD11.4619.0711.6314.7718.9011.7314.1618.61 N22004810228622864874243024304929 6M611.54642.94615.15649.85645.85618.38653.33649.06 SD11.7620.1812.3916.1120.0712.6816.1720.33 N14863400152115213323164516453361 7M623.16657.22625.20665.34658.31627.47667.90660.98 SD11.0818.2610.9114.5517.9212.2514.8618.40 N887177593893817719959951798 8M641.29668.60642.20675.44670.07643.64677.14671.75 SD8.0014.657.8712.0314.688.2712.3214.82
41 of 48 N870180791691618169789781848 199819992000 SubjectGradeLEPAllLEPEPAllLEPEPAll Science 9M651.24670.30652.79675.05671.21653.62675.96672.39 SD6.6012.306.0510.3512.005.9510.4711.98 N616128163964012906896931325 10M656.39676.94657.96682.40678.00657.81682.15678.20 SD7.2013.226.0411.5413.076.0311.8513.15 N606142860961414346917041471 11M659.77681.99661.60687.48683.23662.08687.60683.98 SD7.4113.666.7912.1512.006.4712.5013.80 N571138859359813876586611436 Social Studies 9M632.42649.34633.57652.97649.61634.83653.88650.91 SD5.0911.284.549.6910.894.019.5410.65 N614127963864112836896931323 10M633.48652.44634.03656.65652.61634.04656.44652.87 SD5.7511.724.9310.2411.514.7110.4011.46 N604142560861414286937041466 11M645.77665.91647.20671.00666.65647.86670.94667.45 SD6.5612.495.8110.3912.005.7010.7412.02 N573138459059813796576611434 Appendix B Weighted Mean Within-grade GainsSubjectGrade1998-19991999-20001998-2000 LEPALLLEPEPALLLEPALL Reading 2M7.845.884.204.405.3912.9011.21 SD9.828.8910.549.678.3110.889.66 3M8.305.023.124.724.5711.589.60
42 of 48 SD10.878.3610.989.417.8010.248.71 4M6.472.692.163.583.797.606.48 SD10.278.0210.078.697.489.658.44 5M3.811.651.271.881.834.543.44 SD7.687.097.927.756.908.887.56 6M3.562.121.961.971.514.183.68 SD8.015.936.846.525.527.356.23 7M3.641.373.942.461.964.263.30 SD7.445.108.655.614.778.395.45 8M3.551.622.352.371.523.463.15 SD6.974.597.855.944.578.085.25 9M1.070.411.291.321.352.611.81 SD6.074.115.624.233.996.504.20 10M1.300.440.8126.96.36.1991.59 SD6.674.446.604.684.427.994.56 11M1.52-0.360.910.980.922.310.60 SD7.285.157.385.305.267.655.23 SubjectGrade1998-19991999-20001998-2000 LEPALLLEPEPALLLEPALL Math 2M10.087.726.917.637.2914.3715.04 SD12.8210.0214.1412.0110.1813.4711.66 3M11.828.177.359.608.3816.4316.56 SD13.239.8712.9210.599.3813.0811.03 4M7.844.884.957.336.9111.3011.83 SD10.338.7111.399.518.3511.089.69 5M4.903.784.495.685.348.449.09 SD9.108.3110.038.848.1810.379.39 6M4.664.303.674.894.117.558.49 SD10.067.549.668.617.519.518.51 7M3.802.602.043.133.084.615.69 SD7.735.257.726.305.377.015.97 8M4.092.971.922.732.404.195.38 SD7.165.047.886.855.277.035.93
43 of 48 9M1.711.461.682.532.523.523.97 SD5.754.304.964.604.236.054.54 10M2.362.140.6188.8.131.52.40 SD5.824.015.284.483.996.504.33 11M3.422.350.881.921.834.304.22 SD6.644.686.305.234.826.695.14 SubjectGrade1998-19991999-20001998-2000 LEPALLLEPEPALLLEPALL Language 2M5.695.853.824.074.559.959.46 SD9.258.2411.019.058.0710.349.26 3M8.346.423.546.025.6212.8812.00 SD10.928.3210.589.777.9610.599.02 4M5.562.241.833.434.157.306.35 SD9.627.5610.098.417.439.938.20 5M3.251.992.022.132.664.854.59 SD8.577.668.718.257.569.748.17 6M3.082.371.993.192.494.504.93 SD8.786.627.806.776.317.926.96 7M3.482.043.112.752.764.244.80 SD7.505.158.825.845.027.665.58 8M2.922.242.562.492.183.314.41 SD5.974.968.557.375.107.555.63 9M1.011.5184.108.40.206.413.67 SD5.854.215.434.544.246.724.52 10M1.241.461.242.062.052.533.53 SD6.084.896.085.364.957.385.06 11M2.001.560.611.681.582.623.21 SD6.304.676.595.225.037.015.09 SubjectGrade1998-19991999-20001998-2000 LEPALLLEPEPALLLEPALL Spelling 2M9.497.467.705.896.9616.9114.36 SD12.389.8011.8425.109.4613.8310.96 3M7.806.618.065.556.4115.7613.01 SD10.998.5910.6620.028.5012.169.46 4M4.443.125.684.915.4810.148.60 SD9.688.459.9120.828.1710.439.14 5M2.802.5220.127.116.116.125.56 SD9.047.368.6021.367.219.807.91 6M3.052.982.942.418.104.22.168 SD8.757.178.7124.246.999.507.75
44 of 48 7M1.651.061.982.422.783.683.86 SD7.254.897.6714.154.928.215.42 8M0.831.441.230.971.702.203.14 SD5.914.495.9922.114.576.995.25 Science 9M1.380.871.060.891.152.562.03 SD5.213.604.5912.653.285.533.69 10M1.381.100.01-0.350.141.551.24 SD5.344.034.6017.653.765.994.14 11M1.861.330.460.200.652.342.02 SD5.254.425.1016.084.395.654.56 Social Studies 9M1.080.221.460.931.262.581.49 SD4.533.103.7712.452.974.363.20 10M0.350.140.09-0.190.230.610.37 SD4.563.434.0515.783.225.093.43 11M1.580.770.780.390.712.161.52 SD5.464.215.3014.813.955.574.26 Appendix C Weighted Mean Cohort GainsSubjectGrade1998-19991999-2000Grades1998-2000 LEPALLLEPEPALLLEPALL Reading 2 to 3M30.5934.0028.734.2132.662 to 458.3362.19 SD9.518.489.839.368.6111.019.20 3 to 4M30.2729.5427.8727.7628.353 to 547.9947.05 SD8.897.288.977.857.1110.068.93 4 to 5M19.5118.3117.9116.7517.534 to 638.7533.53 SD8.437.068.357.496.7910.949.77 5 to 6M19.7515.7218.8814.9115.92 SD7.422.214.171.1247.00 Math 2 to 3M31.6133.7431.7635.3034.372 to 455.9660.70 SD12.2311.0312.6112.0211.2013.2111.64 3 to 4M26.8128.3024.3827.4927.073 to 552.8357.00 SD11.029.6511.1210.749.7612.8511.17 4 to 5M27.1328.2626.0728.9128.734 to 650.4653.54 SD9.438.169.539.038.0812.6510.95 5 to 6M22.7924.5822.6025.6725.43
45 of 48 SD9.368.969.649.328.80 Spelling 2 to 3M41.7837.5540.2435.1536.662 to 461.1262.30 SD11.009.1910.689.889.1712.799.55 3 to 4M21.5726.0319.5126.1524.823 to 543.4145.79 SD10.088.0410.138.918.2111.738.38 4 to 5M23.2719.8222.0418.3419.774 to 637.9537.41 SD8.707.919.098.467.7211.589.49 5 to 6M13.8817.3213.8818.6118.04 SD8.897.479.287.967.54 Language 2 to 3M20.0621.7021.0222.4822.292 to 444.5245.78 SD9.408.539.649.648.5210.639.00 3 to 4M27.1326.5024.7123.6224.303 to 541.5042.56 SD9.217.929.358.617.7710.989.03 4 to 5M14.8315.6914.5415.8416.144 to 631.0329.07 SD8.736.828.627.566.7410.678.85 5 to 6M15.3712.8415.0112.8913.72 SD8.087.708.568.117.58 Appendix D Unweighted Mean Scaled Scores199819992000 SubjectGradeLEPAllLEPEPAllLEPEPAll Reading 2M547.69572.91554.73570.92578.67560.34578.08583.94 SD17.9523.0617.7019.1822.8217.4719.0822.65 N23884843255725573319274127413356 3M571.02602.21577.13596.33607.21581.95605.26611.58 SD15.3325.7215.2419.4724.8714.4418.9524.2 N24454874260826084936278827884979 4M596.17628.96600.18616.57631.78604.38624.12635.56 SD13.9824.8213.7118.0224.3113.7417.4623.72 N22954837242224224894260626064958 Math 2M550.59566.41558.44576.57573.95565.5582.29581.36
46 of 48 SD18.4821.1818.7518.5221.2618.9818.121.31 N25404875255525574914273727414969 3M574.02591.96582.39604.47600.25590.49609.41608.48 SD17.7822.4117.8319.0922.0917.5618.1521.48 N24934883260226084938278327884980 4M594.48615.47600.48627.87620.52606.75632.41627.5 SD15.3221.4515.2917.921.1815.4317.1520.98 N23854846241924204902260426064959 Language 2M561.73582.3566.96586.13587.17571.59590.57591.59 SD15.5720.5515.5717.0320.3115.717.120.41 N24834868254625564910273027404969 3M574.98597.75581.86601.61604.16587.39607.58609.49 SD15.522.3315.7818.0121.9315.7117.4721.53 N24424873258626014933277127864977 4M597.9622601.72621.92624.21605.9626.58628.25 SD13.920.4813.9716.4620.2213.6815.6819.61 N23674839240724134895259626044960Copyright 2002 by the Education Policy Analysis ArchivesThe World Wide Web address for the Education Policy Analysis Archives is epaa.asu.edu General questions about appropriateness of topics o r particular articles may be addressed to the Editor, Gene V Glass, firstname.lastname@example.org or reach him at College of Education, Arizona State University, Tempe, AZ 8 5287-2411. The Commentary Editor is Casey D. Cobb: email@example.com .EPAA Editorial Board Michael W. Apple University of Wisconsin Greg Camilli Rutgers University John Covaleskie Northern Michigan University Alan Davis University of Colorado, Denver Sherman Dorn University of South Florida Mark E. Fetler California Commission on Teacher Credentialing Richard Garlikov firstname.lastname@example.org Thomas F. Green Syracuse University
47 of 48 Alison I. Griffith York University Arlen Gullickson Western Michigan University Ernest R. House University of Colorado Aimee Howley Ohio University Craig B. Howley Appalachia Educational Laboratory William Hunter University of Calgary Daniel Kalls Ume University Benjamin Levin University of Manitoba Thomas Mauhs-Pugh Green Mountain College Dewayne Matthews Education Commission of the States William McInerney Purdue University Mary McKeown-Moak MGT of America (Austin, TX) Les McLean University of Toronto Susan Bobbitt Nolen University of Washington Anne L. Pemberton email@example.com Hugh G. Petrie SUNY Buffalo Richard C. Richardson New York University Anthony G. Rud Jr. Purdue University Dennis Sayers California State UniversityStanislaus Jay D. Scribner University of Texas at Austin Michael Scriven firstname.lastname@example.org Robert E. Stake University of IllinoisUC Robert Stonehill U.S. Department of Education David D. Williams Brigham Young University EPAA Spanish Language Editorial BoardAssociate Editor for Spanish Language Roberto Rodrguez Gmez Universidad Nacional Autnoma de Mxico email@example.com Adrin Acosta (Mxico) Universidad de Guadalajaraadrianacosta@compuserve.com J. Flix Angulo Rasco (Spain) Universidad de Cdizfelix.firstname.lastname@example.org Teresa Bracho (Mxico) Centro de Investigacin y DocenciaEconmica-CIDEbracho dis1.cide.mx Alejandro Canales (Mxico) Universidad Nacional Autnoma deMxicocanalesa@servidor.unam.mx Ursula Casanova (U.S.A.) Arizona State Universitycasanova@asu.edu Jos Contreras Domingo Universitat de Barcelona Jose.Contreras@doe.d5.ub.es Erwin Epstein (U.S.A.) Loyola University of ChicagoEepstein@luc.edu Josu Gonzlez (U.S.A.) Arizona State Universityjosue@asu.edu
48 of 48 Rollin Kent (Mxico)Departamento de InvestigacinEducativa-DIE/CINVESTAVrkent@gemtel.com.mx email@example.com Mara Beatriz Luce (Brazil)Universidad Federal de Rio Grande do Sul-UFRGSlucemb@orion.ufrgs.brJavier Mendoza Rojas (Mxico)Universidad Nacional Autnoma deMxicojaviermr@servidor.unam.mxMarcela Mollis (Argentina)Universidad de Buenos Airesmmollis@filo.uba.ar Humberto Muoz Garca (Mxico) Universidad Nacional Autnoma deMxicohumberto@servidor.unam.mxAngel Ignacio Prez Gmez (Spain)Universidad de Mlagaaiperez@uma.es Daniel Schugurensky (Argentina-Canad)OISE/UT, Canadadschugurensky@oise.utoronto.ca Simon Schwartzman (Brazil)Fundao Instituto Brasileiro e Geografiae Estatstica firstname.lastname@example.org Jurjo Torres Santom (Spain)Universidad de A Coruajurjo@udc.es Carlos Alberto Torres (U.S.A.)University of California, Los Angelestorres@gseisucla.edu
xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam a22 u 4500
controlfield tag 008 c20029999azu 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E11-00255
Educational policy analysis archives.
n Vol. 10, no. 7 (January 25, 2002).
Tempe, Ariz. :
b Arizona State University ;
Tampa, Fla. :
University of South Florida.
c January 25, 2002
Exito en California? :a validity critique of language program evaluations and analysis of English learner test scores / Marilyn S. Thompson, Kristen E. DiCerbo, Kate Mahoney, [and] Jeff MacSwan.
Arizona State University.
University of South Florida.
t Education Policy Analysis Archives (EPAA)
xml version 1.0 encoding UTF-8 standalone no
mods:mods xmlns:mods http:www.loc.govmodsv3 xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govmodsv3mods-3-1.xsd
mods:relatedItem type host
mods:identifier issn 1068-2341mods:part
mods:detail volume mods:number 10issue 7series Year mods:caption 20022002Month January1Day 2525mods:originInfo mods:dateIssued iso8601 2002-01-25