|USFDC Home||| RSS|
This item is only available as the following downloads:
xml version 1.0 encoding UTF-8 standalone no
mods:mods xmlns:mods http:www.loc.govmodsv3 xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govmodsv3mods-3-1.xsd
mods:relatedItem type host
mods:identifier issn 1068-2341mods:part
mods:detail volume mods:number 8issue 49series Year mods:caption 20002000Month October10Day 2626mods:originInfo mods:dateIssued iso8601 2000-10-26
1 of 22 Education Policy Analysis Archives Volume 8 Number 49October 26, 2000ISSN 1068-2341 A peer-reviewed scholarly electronic journal Editor: Gene V Glass, College of Education Arizona State University Copyright 2000, the EDUCATION POLICY ANALYSIS ARCHIVES. Permission is hereby granted to copy any article if EPAA is credited and copies are not sold. Articles appearing in EPAA are abstracted in the Current Index to Journals in Education by the ERIC Clearinghouse on Assessment and Evaluation and are permanently archived in Resources in Education What Do Test Scores in Texas Tell Us? Stephen P. Klein Laura S. Hamilton Daniel F. McCaffrey Brian M. Stecher RANDRelated articles: Haney: Vol. 8 No. 41 Camilli: Vol. 8 No. 42 AbstractWe examine the results on the Texas Assessment of A cademic Skills (TAAS), the highest-profile state testing program a nd one that has recorded extraordinary recent gains in math and rea ding scores. To investigate whether the dramatic math and reading g ains on the TAAS represent actual academic progress, we have compare d these gains to score changes in Texas on another test, the Nationa l Assessment of Educational Progress (NAEP). Texas students did imp rove significantly more on a fourth-grade NAEP math test than their co unterparts nationally. But, the size of this gain was smaller than their gains on
2 of 22TAAS and was not present on the eighth-grade math t est. The stark differences between the stories told by NAEP and TA AS are especially striking when it comes to the gap in average scores between whites and students of color. According to the NAEP results, t hat gap in Texas is not only very large but increasing slightly. Accord ing to TAAS scores, the gap is much smaller and decreasing greatly. Man y schools are devoting a great deal of class time to highly speci fic TAAS preparation. While this preparation may improve TAAS scores, it may not help students develop necessary reading and math skills. Schools with relatively large percentages of minority and poor s tudents may be doing this more than other schools. We raise serious ques tions about the validity of those gains, and caution against the da nger of making decisions to sanction or reward students, teachers and schools on the basis of test scores that may be inflated or mislea ding. Finally, we suggest some steps that states can take to increase the likelihood that their test results merit public confidence and prov ide a sound basis for educational policy. Introduction During the past decade, several states ha ve begun using the results on statewide tests as the basis for rewarding and sanctioning in dividual students, teachers, and schools. Although testing and accountability are in tended to improve achievement and motivate staff and students, concerns have been rai sed in both the media and the professional literature (e.g., Heubert & Hauser, 19 99; Linn, 2000) about possible unintended consequences of these programs. The high-stakes testing program in Texas has received much of this attention in part because of the extraordinarily large gains the students in this state have made on its statewide achievement tests, the Texas Assessment o f Academic Skills (TAAS). In fact, the gains in TAAS reading and math scores for both majority and minority students have been so dramatic that they have been dubbed the "Te xas miracle." However, there are concerns that these gains were inflated or biased a s an indirect consequence of the rewards and sanctions that are attached to the resu lts. Thus, although there is general agreement that the gains on the TAAS are attributab le to Texas' high-stakes accountability system, there is some question about what these gains mean. Specifically, do they reflect a real improvement in student achie vement or something else? We conducted several analyses to examine the issue of whether TAAS scores can be trusted to provide an accurate index of student skills and abilities. First, we used scores on the reading and math tests that are admin istered as part of the National Assessment of Educational Progress (NAEP) to invest igate how much students in Texas have improved and whether this improvement is consi stent with what has occurred nationwide. NAEP scores are a good benchmark for th is purpose because they reflect national content standards and they are not subject to the same external pressures to boost scores as there are on the TAAS. Next, we assessed whether the gains in TA AS scores between 1994 and 1998 were comparable to those on NAEP. We did this to ex amine how much confidence can be placed in the TAAS score gains. Similarly, we me asured whether the differences in scores between whites and students of color on the TAAS were consistent with the
3 of 22differences between these groups on NAEP. Specifica lly, is the gap on TAAS credible given the gap on NAEP? And finally, we investigated whether TAAS scores are related to the scores on a set of three other tests that we administered to students in 20 Texas elementary schools. Our findings from this research raise ser ious questions about the validity of the gains in TAAS scores. More generally, our results i llustrate the danger of relying on statewide test scores as the sole measure of studen t achievement when these scores are used to make high-stakes decisions about teachers a nd schools as well as students. We anticipate that our findings will be of interest to local, state, and national educational policymakers, legislators, educators, and fellow re searchers and measurement specialists. Readers also may be interested in a RAND study by Grissmer et al. (2000) that compared the NAEP scores of different states across the country. Grissmer and his colleagues found that after controlling for various student demographic characteristics and other factors, Texas tended to have higher NAEP scores than other states and there was some speculation as to whether this was due to the accountability system in Texas. Thus, while the Grissmer et al. (2000) report and t he research presented in this issue paper both used NAEP scores, these studies differed in the questions they investigated, the data they analyzed, and the methodologies they employed. A forthcoming RAND issue paper will discuss some of the broader policy questions about high-stakes testing in schools.Background Scores on achievement tests are increasin gly being used to make decisions that have important consequences for examinees and other s. Some of these "high-stakes" decisions are for individual students--such as for tracking, promotion, and graduation (Heubert & Hauser, 1999). Some states and school di stricts also are using test scores to make performance appraisal decisions for teachers a nd principals (e.g., merit pay and bonuses) and to hold schools and educational progra ms accountable for the success of their students (Linn, 2000). Although the policymak ers who design and implement such systems often believe they lead to improved instruc tion, there is a growing body of evidence which indicates that high-stakes testing p rograms can also result in narrowing the curriculum and distorting scores (Koretz & Barr on, 1998; Koretz et al., 1991; Linn, 2000; Linn, Graue, & Sanders, 1990; Stecher, Barron Kaganoff, & Goodwin, 1998). Consequently, questions are being raised about the appropriateness of using test scores alone for making high-stakes decisions (Heubert & H auser, 1999). In this issue paper, we examine score gai ns on one statewide test in an effort to assess the degree to which they provide valid infor mation about student achievement in that state and about improvements in achievement ov er time. This investigation is the latest in a decade-long series of RAND studies of h igh-stakes testing (e.g., Koretz & Barron, 1998). We believe that this work will provi de lessons to help policymakers understand some of the challenges that arise in the context of high-stakes accountability systems. Our interest in Texas was prompted by an unusual empirical relationship we observed between scores on TAAS and tests we admini stered to students in a small sample of schools as part of a larger study on teac hing practices and student achievement. Because our set of schools was small a nd not representative of the state,
4 of 22we decided to explore statewide patterns of achieve ment on TAAS and on NAEP. In addition, Texas provides an ideal context in which to study high-stakes testing because its accountability system has received attention fr om the media and from the policy community, and it has been cited as possibly contri buting to improved student achievement (e.g., Grissmer & Flanagan, 1998; Griss mer et al., 2000). TAAS scores are a central component of the accountability system. F or example, students must pass the TAAS to graduate from high school, and TAAS scores affect performance evaluations (and, in some cases, compensation) for teachers and principals. The TAAS program has been credited not on ly with improving student performance, but also with reducing differences in average scores among racial and ethnic groups. For example, a recent press release announced a record high passing rate on the TAAS. According to Commissioner of Education Jim Nelson, "Texas has justifiably gained national recognition for the per formance gains being made by our students." Nelson also stated that Texas has "been able to close the gap in achievement between our minority youngsters and our majority yo ungsters, and we've again seen how we're progressing in that regard" (Jim Nelson as qu oted by Mabin, 2000). The unprecedented score gains on the TAAS have been referred to as the "Texas miracle." However, some educators and analysts (e.g ., Haney, 2000) have raised questions about the validity of these gains and the possible negative consequences of high-stakes accountability systems, particularly fo r low-income and minority students. For example, the media have reported concerns about excessive teaching to the test, and there is some empirical support for these criticism s (Carnoy, Loeb, & Smith, 2000; McNeil & Valenzuela, 2000; Hoffman et al., in press ). For instance, teachers in Texas say they are spending especially large amounts of c lass time on test preparation activities. Because the length of the school day is fixed, the more time that is spent on preparing students to do well on the TAAS often mea ns there is less time to devote to other subjects. There are also concerns that score trends may be biased by a variety of formal and informal policies and practices. For example, polic ies about student retention in grade may affect score trends (McLaughlin, 2000). States may vary in the extent to which their schools promote students who fail to earn acceptabl e grades and/or statewide test scores. Eliminating these so-called "social promotions" wou ld most likely raise the average scores at each grade level in subsequent years whil e lowering it at each age level. This is likely to occur because although the students who a re held back may continue to improve, they are likely to do so at a slower rate than comparable students who graduate with their classmates (Heubert & Hauser, 1999). Ano ther concern is inappropriate test preparation practices, including outright cheating. There have been documented cases of cheating across the nation, including in Texas. If widespread, these behaviors could substantially distort inferences from test score ga ins (Hoff, 2000; Johnston, 1999). The pressure to raise scores may be felt most intensely in the lowest-scoring schools, which typically have large populations of low-income and minority students. Students at these schools may be particularly likel y to suffer from overzealous efforts to raise scores. For example, Hoffman et al. (in press ) found that teachers in low-performing schools reported greater frequency o f test preparation than did teachers in higher-performing schools. This could lead to a superficial appearance that the gap between minority and majority students is narrowing when no change has actually occurred. Evidence regarding the validity of score gains on the TAAS can be obtained by investigating the degree to which these gains are a lso present on other measures of these same general skills. Specifically, do the score tre nds on the TAAS correspond to those
5 of 22on the highly regarded NAEP? The NAEP tests are gen erally recognized as the "gold standard" for such comparisons because of the techn ical quality of the procedures that are used to develop, administer, and score these ex ams. Of course, NAEP is not a perfect measure. For example, there are no stakes attached to NAEP scores, and therefore student motivation may differ on NAEP and state tes ts, such as TAAS. However, it is currently the best indicator available. There are several other reasons why score gains on the TAAS are not likely to have a one-to-one match with those on NAEP if these tests assess different skills and knowledge. However, the specifications for the NAEP exams are based on a consensus of a national panel of experts, including educators about what students should know and be able to do. Hence, NAEP provides an appropriate benchmark for measuring improvement. As Linn (2000) notes, "Divergence of t rends does not prove that NAEP is right and the state assessment is misleading, but i t does raise important questions about the generalizability of gains reported on a state's own assessment, and hence about the validity of claims regarding student achievement" ( p. 14).Questions for Our Research Understanding the source and consequences of the impressive score gains on the TAAS would require an extensive independent study. We have not done that. Instead, the analyses described below address the following questions about student achievement in Texas: Have the reading and math skills of Texas students improved since the full statewide implementation of the TAAS program in 199 4 (e.g., are fourth graders reading better today than fourth graders a few year s ago); and, if their skills did improve: (a) how much improvement occurred and (b) was the amount of improvement in reading the same as it was in math? 1. Are the gains in reading and math on the TAAS consi stent with what would be expected given NAEP scores in Texas and the rest of the country? 2. Has Texas narrowed the gap in average reading and m ath skills between whites and students of color? 3. Do other tests given in Texas at a sample of 20 sch ools produce results that are consistent with those obtained with the TAAS? 4. We begin by describing certain important features of the TAAS and NAEP exams. We then answer the first three questions through an alyses of publicly available TAAS and NAEP data and discuss the findings. Next, we an swer the fourth question by reporting the results from a study that administere d other tests to about 2,000 Texas students. Finally, we present our conclusions.Description of the TAAS TAAS was initiated in 1990 to serve as a criterion-referenced measure of the
6 of 22state's mandated curriculum. It is intended to be c omprehensive and to measure higher-order thinking skills and problem-solving ab ility (Texas Education Agency, 1999). Since the full implementation of the TAAS pr ogram in 1994, it has been administered in reading and mathematics in grades 3 4, 5, 6, 7, 8, and 10. Other subjects are also tested at selected grade levels. Last year for example, a writing test was given at grades 4, 8, and 10. Science and social studies wer e tested at grade 8. The TAAS tests consist primarily of multiple-choice items, but the writing test includes questions that require written answers. Teachers administer the TAAS tests to the ir own students. Answers are scored by the state. The questions are released to the public after each administration of the exam, and a new set of TAAS tests is administered each ye ar. However, the format and content of the questions in one year are very similar to th ose used the next year. Each form of the TAAS contains items that are being field-tested for inclusion in the forms to be used in subsequent years. These items are also used to link test scores from one year to the next to help ensure consistent difficulty over time. The se experimental items are not used to compute student scores nor are they released to the public. This practice is consistent with that employed in many other large-scale testin g programs. The TAAS is administered only in Texas. T hus, there are no national norms or benchmarks against which to compare the performance of Texas students on this test. However, the Texas Education Agency administered th e Metropolitan Achievement Tests to a sample of Texas students to determine ho w well these students performed relative to a national norm group. We discuss this study in a later section of this issue paper. Description of NAEP The national portion of NAEP is mandated by Congress and is administered through the National Center for Education Statistic s. It is currently the only assessment that provides information on the knowledge and skil ls of a representative sample of the nation's students. The content of NAEP tests is bas ed on test specifications that were developed by educators and others, and is intended to reflect a consensus about what students should be learning at a given grade level. Hence, the questions are not tied to standards of a single state or district. (Note 1) L ike TAAS, NAEP is designed to assess problem-solving skills in addition to content knowl edge. A national probability sample of schools is invited to participate in NAEP. Schoo ls that decline are replaced with schools where the student characteristics are simil ar to those at the schools that refused to participate. Most states, including Texas, also arrang e to have the NAEP exams administered to another (and larger) group of their schools to a llow for the generation of reliable state-level results. This state-level testing utili zes the same general procedures as the national NAEP program does; e.g., third-party selec tion of the participating schools and having a cadre of trained consultants (rather than classroom teachers) administer the tests. However, unlike the national program, these consultants may be local district personnel. In both the national and state-level prog rams, a given student is asked a sample of all the questions that are used at that student's g rade level. This permits a much larger sampling of the content domain in the available tes ting time than would be feasible if every student had to answer every item. Different i tem formats (including
7 of 22multiple-choice, short-answer, and essay) are used in most subjects. The breadth of content and item types, as well as the consensus of a national panel of experts that is reflected in NAEP frameworks, makes NAEP a useful i ndicator of achievement trends across the country. The validity of NAEP scores is enhanced b y the procedures that are used to give the exams and ensure test security (e.g., test admi nistrators do not have a stake in the outcomes). However, the utility of NAEP scores is l imited by some of the other features of this testing program. For instance, NAEP is not administered every year, and when it is administered, not every subject is included, onl y a few grade levels are tested, and individual student, school, and district scores are not available. These features preclude examining year-to-year trends in a particular subje ct or tracking individual student progress over time. The motivation to do well on th e NAEP tests is intrinsic rather than driven by external stakes. However, any reduction i n student effort or performance that may stem from NAEP being a relatively low-stakes te st should be fairly consistent over time and therefore not bias our measurement of scor e improvements across years. How We Report Results NAEP and TAAS results are typically repor ted to the public in terms of the percentage of students passing or meeting certain p erformance levels (or "cut" scores). Although this type of reporting seems easier to und erstand, it can lead to erroneous conclusions. For example, the difficulty of achievi ng a passing status or a certain level of performance (such as "proficient") may vary between tests as well as within a testing program over time. Making comparisons based on perc entages reaching certain levels also does not account for score changes among stude nts who perform well above or below the cut score. To avoid these and other problems with pe rcentages, we adopted the research community's convention of reporting results in term s of "effect" sizes. The effect size is the difference in mean scores (between years or gro ups) divided by the standard deviation of those scores. In other words, it is th e standardized mean difference. The major advantage of using effect sizes is that they provide a common metric across tests. As a frame of reference for readers who a re not familiar with this metric, the effect size for the difference in achievement between whit e and black students has ranged from 0.8 to 1.2 across a variety of large-scale tests (H edges & Nowell, 1998). The effect size for the difference in third grade student reading s cores between large and small classes in Tennessee was approximately 0.25 (Finn & Achille s, 1999). (Note 2)Have Reading and Math Skills Improved in Texas? NAEP data have been cited as evidence of the effectiveness of educational programs in Texas (e.g., Grissmer & Flanagan, 1998) For instance, within a racial or ethnic group, the average performance of the Texas students tends to be about six percentile-points higher than the national average for that group (Grissmer et al., 2000; Reese et al., 1997). These results are consistent with the fin dings obtained by the Texas Education Agency in its 1999 Texas National Comparative Data Study, in which a sample of Texas students took the Metropolitan Achievement Tests, S eventh Edition (MAT-7). Texas students at every grade level scored slightly highe r than the national norming sample in
8 of 22 most subjects (Texas Education Agency, 1999). Howev er, it is difficult to draw conclusions from this study because, according to t he sampling plan for this research, each participating school selected the classrooms a nd students that would take the MAT. Moreover, Texas did not report the mean TAAS scores of the students who took the MAT. Under the circumstances, the TAAS data are vit al for determining whether those who took the MAT were truly representative of their school or the state. For example, the interpretation of the MAT findings would no dou bt change if it was discovered that the mean TAAS scores of the students who took the M AT were higher than the corresponding state mean TAAS scores. Data from a single year cannot tell us wh ether achievement has improved over time or whether trends in TAAS scores are reflected in other tests. To answer the question of whether performance improved, we compar ed the scores of Texas fourth graders in one year with the scores of Texas fourth graders four years later. We did this in both reading and mathematics. We also did this f or eighth graders in mathematics (NAEP's testing schedule precluded conducting a sim ilar analysis for eighth graders in reading). We then contrasted these results with nat ional trends to assess whether the gains in Texas after the full statewide implementat ion of the TAAS differed from those in other states. Figures 1 through 3 present the results o f these analyses. The main finding is that over a four-year period, the average test score gai ns on the NAEP in Texas exceeded those of the nation in only one of the three compar isons, namely: fourth grade math. Figure 1 shows that the Texas fourthgraders in 1998 had higher NAEP readingscores than did Texas fourth graders in 1994. The size of the increase was .13standard deviation units for white students and .15 units for students of color. However,these increases were not unique to Texas.The national trend was for all students to improve. In fact, only among white fourthgraders was the improvement in Texas greater than improvement nationally, andthen only slightly (the difference in theeffect sizes between Texas and the United States was .08). We discuss the implicationsof this difference in score gains between groups when we discuss the question ofwhether Texas has narrowed the gap inperformance among racial and ethnic groups. The TAAS data tell a radically different story (see Figure 1). They indicate there was a very large improvement in TAAS reading scores for all groups (effect sizes ranged from .31 to .49). Figure 1 also shows that on the T AAS, black and Hispanic students improved more than whites. The gains on TAAS were t herefore several times larger than they were on NAEP. And, contrary to the NAEP findin gs, the gains on TAAS were greater for students of color than they were for wh ites. Figure 2 shows that fourth graders inTexas in 1996 had substantially higherNAEP math scores than did fourth graders in 1992 (effect sizes ranged from .25 to .43).
9 of 22 Moreover, this improvement was substantially greater than the increasenationwide. This was especially true for white students. Nevertheless, the gains onTAAS were much larger than they were onNAEP, especially for students of color. (Note 3) Figure 3 shows that Texas eighthgraders in 1996 had higher NAEP scoresthan did Texas eighth graders in 1992, but these differences were only slightly largerthan those observed nationally. Thus, as with fourth grade reading, there was nothingremarkable about the NAEP scores in Texas,and students of color did not gain more than whites. In contrast, there were huge improvements i n eighth grade math scores on the TAAS during a similar four-year period, and these i ncreases were much larger for students of color than they were for whites. The sa me was true for eighth grade TAAS reading scores during this period (effect sizes for whites, blacks, and Hispanics were .28, .45, and .37, respectively). To further examine the question of whethe r there has been an improvement in reading and math skills of Texas students, we compa red the NAEP scores of fourth graders in one year with the NAEP scores of eighth graders four years later. Because of the way NAEP samples students for testing, this is analogous (but not equivalent) to following the same cohort of students over time. In fact, the redesign of NAEP in 1984, which established a practice of testing grade level s four years apart and conducting the assessment in the core subjects every four years, w as intended in part to support this type of analysis (Barton & Coley, 1998). We present resu lts for Texas and the nation so readers can see the extent to which Texas students are progressing relative to students in other states. Table 1 shows that the average NAEPmath scale score for white Texas fourthgraders in 1992 was 229. Four years later, the mean score for white eighth graders was285, i.e., a 56-point improvement. However,there was a 54-point improvement nationally for whites during this same period. Therewas a similar pattern for minority students, and these trends held for both math andreading (Table 2). In short, the scoreincreases in Texas were almost identical to those nationwide (we could not conduct thecorresponding analysis with TAAS data because TAAS does not convert scores to acommon scale across grade levels).
10 of 22 Table 1 Mean NAEP Math Scores for 4th Graders in 1992 and 8th Graders in 1996Group TexasUnited States Texas U.S. 4th8thGain4th8thGain White 22928556227281542 Black 19924950192242500 Hispanic 2092564720125049-2 Table 2 Mean NAEP Reading Scores for 4th Graders in 1994 and 8th Graders in 1998Group TexasUnited States Texas U.S. 4th8thGain4th8thGain White 2272734622327047-1 Black 1912455418624155-1 Hispanic 1982525418824355-1Is Texas Closing the Gap Between Whites and Student s of Color? In 1998, the mean fourth grade NAEP readi ng score for whites in Texas was one full standard deviation higher than the mean for bl acks. To put this in perspective, the average black student was at roughly the 38th perce ntile among all Texas test takers whereas the average white student was at about the 67th percentile. This gap was slightly larger than the difference between these g roups in 1994. In other words, the black-white reading gap actually increased during t his four-year period. The same pattern was present in fourth and eighth grade math scores (see Figure 4a). In contrast, the difference in meanTAAS scores between whites and blacks wasinitially smaller than it was on NAEP, and it
11 of 22 decreased substantially over a comparablefour-year period. Consequently, by 1998, theblack-white gap on TAAS was about half what it was on NAEP. In other words,whereas the gap on NAEP was large to begin with and got slightly wider over time, the gapon TAAS started off somewhat smaller than itwas on NAEP and then got substantially smaller. The same radically disparate NAEP andTAAS trends were also present for theHispanic-white gap; i.e., the gap got slightly wider on NAEP but substantially smaller on TAAS over comparable four-year periods (see Figure 4b). In addition, although fourth grade math was the subject on which Texas showed th e largest gains over time relative to the nation, the white-Hispanic NAEP gap grew in Texas but not nationally, and the white-black gap remained constant in Texas but actu ally shrank nationally. In short, gap sizes on NAEP were moving in the opposite direction than they were on TAAS. It is worth noting that even therelatively small NAEP gains we observedmight be somewhat inflated by changes in who takes the test. As mentioned earlier,Haney (2000) provides evidence that exclusion of students with disabilitiesincreased in Texas while decreasing in thenation, and Texas also showed an increase over time in the percentage of studentsdropping out of school and being held back. All of these factors would have the effect ofproducing a gain in average test scores thatoverestimates actual changes in student performance.Why Do TAAS and NAEP Scores Behave So Differently? The large discrepancies between TAAS and NAEP results raise serious questions about the validity of the TAAS scores. We do not kn ow the sources of these differences. However, one plausible explanation, and one that is consistent with some of the survey and observation results cited earlier, is that many schools are devoting a great deal of class time to highly specific TAAS preparation. It is also plausible that the schools with relatively large percentages of minority and poor s tudents may be doing this more than other schools. TAAS questions are released after each ad ministration. Although there is a new
12 of 22version of the exam each year, one version looks a lot like another in terms of the types of questions asked, terminology and graphics used, con tent areas covered, etc. Thus, giving students instruction and practice on how to answer the specific types of questions that appear on the TAAS could very well improve their sc ores on this exam. For example, in an effort to improve their TAAS scores, some school s have retained outside contractors to work with teachers, students, or both. If the discrepancies we observed between NAEP and TAAS were due to some type of focused test preparation for the TAAS, then this instruction must have had a fairly narrow scope. With the possible exception of fourth grade math, it certainly did not appear to influence NAEP scores. In short, if TAAS scores were affected by test preparation for the TAAS, then the effects of this preparation did not appear to generalize to the NAEP exams. This explanation also raises que stions about the appropriateness of what is being taught to prepare students to take th e TAAS. A small but significant percentage of stu dents may have "topped out" on the TAAS. In other words, their TAAS scores may not ref lect just how much more proficient they are in reading and math than are other student s. If that happened, it would artificially narrow the gap on the TAAS between whites and stude nts of color (because majority students tend to earn higher scores than minority s tudents). Thus, the reduced gap on the TAAS relative to NAEP may be an artifact of the TAA S being too easy for some students. (Note 4) If so, it also would deflate the gains in TAAS scores over time. In short, were it not for any topping-out, the TAAS ga in scores in Figures 1 through 3 would have been even larger, which in turn would further increase the disparity between TAAS and NAEP results. What Happens on Other Tests? We collected data on about 2,000 fifth gr aders from a mix of 20 urban and suburban schools in Texas. This study was part of a much larger project that included administering different types of science and math t ests to students who also took their state's exams. The 20 schools were from one part of Texas. They were not selected to be representative of this region let alone of Texas as a whole. Nevertheless, some of the results at these schools also raised questions abou t the validity of the TAAS as a measure of student achievement.Test Administration In the spring of 1997, our Texas students took the English language version of the TAAS in reading and math. A few weeks later, we adm inistered the following three tests to these same students: the Stanford 9 multiple-cho ice science test, the Stanford 9 open-ended (OE) math test, and a "hands-on" (HO) sc ience test developed by RAND (Stecher & Klein, 1996). The Stanford 9 OE math tes t asked students to construct their own answers and write them in their test booklets. In the HO science test, students used various materials to conduct experiments. They then wrote their answers to several open-ended questions about these experiments in a s imulated laboratory notebook. Table 3 shows the means and standard deviations on each m easure. Some Expected and Unexpected Findings We analyzed the data in two ways. First, we investigated whether the students who
13 of 22 earned high scores on one test tended to earn high scores on the other tests. Next, we examined whether the schools that had a high averag e score on one test tended to have high average scores on the other tests. We also loo ked at whether the results were related to type of test used (i.e., multiple-choice or open -ended), subject matter tested (reading, math, or science), and whether a student was in a f ree or reduced-price school lunch program. The latter variable serves as a rough indi cator of a student's socioeconomic status (SES). For the school-level analyses, SES wa s indicated by the percentage of students at the school who were in the subsidized l unch program.Table 3 Means and Standard Deviations on Supplemental Study Measures by Unit of AnalysisVariable StudentsSchools Mean Standard Deviation Mean Standard Deviation TAAS math 37.9713.6238.843.80 TAAS reading 29.3310.6129.612.59 Stanford 9 science 29.015.4028.551.94 Stanford 9 OE math 15.145.2114.841.44 HO science 11.786.0011.441.83 Percentage in lunchprogram (SES) 67.8446.776.1022.3 Notes: TAAS math had 52 items a nd TAAS reading had 40 items. Stanford 9 science had 40 items. The maximum possible scores on on Stanford 9 OE m ath and HO science were 27 and 30, respectively. Some of our results were consistent with those in previous studies. Others were not. We begin with what was consistent and then tur n to those that were anomalous. The first column of Table 4 shows the cor relation between various pairs of measures when the student (N approx. 2,000) is the unit of analysis. (Note 5) The second column shows the results when the school (N = 20) i s the unit of analysis. The first set of rows show that the measures we administered correla ted about .55 with each other when the student was the unit of analysis. These correla tions were substantially higher when the school was the unit. For example, the correlation b etween Stanford 9 science and Stanford 9 OE math was .55 when the student was the unit, but it was .78 when the school was the unit. These results are very consist ent with the general findings of other research on student achievement. Table 4 Correlations Between Measures
14 of 22 Correlations between: Unit of Analysis StudentsSchools Non-TAAS tests Stanford 9 science and HO science Stanford 9 science and Stanford 9 OE math Stanford 9 OE math and HO science .57.55.53 .88.78.71 SES and non-TAAS tests SES and Stanford 9 science SES and Stanford 9 OE math SES and HO science .17.10.18 .76.72.66 SES and TAAS tests SES and TAAS math SES and TAAS reading .08.14 .13.21 TAAS and non-TAAS tests TAAS math and Stanford 9 science TAAS math and Stanford 9 OE math TAAS math and HO science TAAS reading and Stanford 9 science TAAS reading and Stanford 9 OE math TAAS reading and HO science .220.127.116.11.42.53 .07.02.03.10.21.13 TAAS math and TAAS reading.81.85 The second set of rows in Table 4 shows a strong negative correlation between the percentage of students at a school who were in the lunch program and that school's mean on the tests we administered. In other words, schoo ls with more affluent students tended to earn higher mean scores on the non-TAAS tests th an did schools with less wealthy students. This relationship is present regardless o f test type (multiple-choice or open-ended) and subject matter (math or science). A gain, these findings are very consistent with those found in other testing progra ms. The correlation between SES and our test scores is much stronger when the school is used as the unit of analysis than when the stude nt is the unit. This is a common finding and stems in part from the fact that it is difficul t to get a high correlation with a dichotomous variable (i.e., in program versus not i n program). The school-level analyses do not suffer from this problem because SES at the school level is measured by the percentage of students at the school who are in the program (i.e., a continuous rather than a dichotomous variable). School-level analyses also tend to produce higher correlations than individual-level analyses because aggregation of scores to the school level reduces
15 of 22the percentage of error in the estimates. The anomalies appear in the third and fou rth sets of rows. In the third set, SES had an unusually small (Pearson) correlation with both of the TAAS scores even when the school was used as the unit of analysis. (Note 6) T his result (which is opposite to the one we found with the non-TAAS tests) was due to a curv ilinear relationship between SES and TAAS scores. Specifically, schools with a relat ively low or high percentage of students in the lunch program tended to have higher mean TAAS math scores than did schools with an average percentage of students in t his program (see Figure 5). Thus, the typical relationship between SES and test scores di sappeared on the TAAS even though this relationship was present on the tests we admin istered a few weeks after the students took the TAAS. Figure 6 illustrates the more typica l pattern by showing the negative, linear relationship between Stanford 9 math test sc ores and the percentage of students in the free or reduced-price lunch program. The fourth set of rows in Table 4 shows t hat when the student is the unit of analysis, TAAS math and reading scores correlate we ll with the scores on the tests we gave. Although the correlations are somewhat lower than would be expected from experience with other tests (especially the .46 cor relation between the two math tests), these differences do not affect the conclusions we would make about the relationships among different tests. However, the correlation bet ween TAAS and non-TAAS tests essentially disappears when the school is the unit of analysis. This result is contrary to the one that would be expected by other studies and the results in the first block of rows. The last row of Table 4 shows that TAAS m ath has a very high correlation with TAAS reading (despite being a different subject). I n fact, TAAS math correlates much higher with TAAS reading than it does with another math test (namely: Stanford 9 OE math). To sum up, the non-TAAS tests correlated highly with each other and with SES; and, as expected, this correlation increased when t he school was used as the unit of analysis. Also as anticipated, the two TAAS tests h ad a moderate correlation with the non-TAAS tests, but unexpectedly, this only occurre d when the student was used as the unit of analysis. Rather than getting larger, the c orrelation between TAAS and non-TAAS tests essentially evaporated when the school was th e unit. And finally, regardless of the unit of analysis, the two TAAS tests had an extreme ly high correlation with each other, but both had a virtually zero correlation with SES. One of the reasons we were surprised that the TAAS and non-TAAS scores
16 of 22behaved so differently is that the latter tests wer e designed to measure some of the same kinds of higher-order thinking skills that the TAAS is intended to measure. However, our results could be due to the unique characteristics of the 20 schools in our study or other factors. We are therefore reluctant to draw conclus ions from our findings with these schools or to imply that these findings are likely to occur elsewhere in Texas. Nevertheless, they do suggest the desirability of p eriodic administration of external tests to validate TAAS results. This procedure, which is sometimes referred to as "audit testing," could have been incorporated into the stu dy of the Metropolitan Achievement Test discussed previously. Conclusions We are now ready to answer the questions that we posed at the beginning of this issue paper. Specifically, we found that the readin g and math skills of Texas students improved since the full implementation of the TAAS program in 1994. However, the answers to the questions of how much improvement oc curred, whether the improvement in reading was comparable to what it was in math, a nd whether Texas reduced the gap in scores among racial and ethnic groups depend on whe ther you believe the NAEP or TAAS results. They tell very different stories. NAEP and TAAS results tell us very different storie s. According to NAEP, Texas fourth graders w ere slightly more proficient in reading in 1998 than they were in 1994. However, the countr y as a whole also improved to about the same degree. Thus, there was nothing remarkable about reading score gains in Texas. In contrast, the increase in fourth grade math scor es in Texas was significantly greater than it was nationwide. However, the small improvem ents in NAEP eighth grade math scores were consistent with those observed national ly. The gains in scores between fourth and eighth grade in Texas also were consistent with national trends. In short, except for fourth grade math, the gains in Texas were comparab le to those experienced nationwide during this time period. In all the analyses, including fourth gra de math, the gains on the TAAS were several times greater than they were on NAEP. Hence how much a Texas student's proficiency in reading and math actually improved d epends almost entirely on whether the assessment of that student's skills relies on N AEP scores (which are based on national content frameworks) or TAAS scores (which are based on tests that are aligned with Texas' own content standards and are administered b y the classroom teacher). The huge disparities between the stories told by NAEP and TAAS are especially striking in the assessment of (1) the size of the g ap in average scores between whites and students of color and (2) whether these gaps are ge tting larger or smaller. According to NAEP, the gap is large and increasing slightly. Acc ording to TAAS, the gap is much smaller and decreasing greatly. We again quote Linn (2000, p. 14): "Divergence of trends does not prove that NAEP is right and the state ass essment is misleading, but it does raise important questions about the generalizability of g ains reported on a state's own assessment, and hence about the validity of claims regarding student achievement." Put simply, how different could "reading" and "math" be in Texas than they are in the rest of the country? The data available for this report were n ot ideal. Limitations in the way NAEP is administered make it difficult to do the kinds of c omparisons that would be most
17 of 22informative. For example, NAEP is not given every y ear and individual student or school scores are not available. And the supplemental stud y described above was limited to 20 schools in just one part of a very large state. Nev ertheless, the stark differences between TAAS and NAEP (and other non-TAAS tests) raise very serious questions about the generalizability of the TAAS scores. These concerns about TAAS do not condemn all efforts to increase accountability, nor should they be interpreted as being opposed to testing. On the contrary, we believe that some form of large-scale assessment, when prop erly implemented, is an essential tool to monitor student progress and thereby suppor t state efforts to improve education. Moreover, the possible problems with the TAAS discu ssed earlier in this issue paper are probably not restricted to this test or state. For example, score inflation and unwanted test preparation have been found in a number of jurisdic tions (Koretz & Barron, 1998; Linn, 2000; Stecher et al., 1998; Heubert & Hauser, 1999) To sum up, states that use high-stakes ex ams may encounter a plethora of problems that would undermine the interpretation of the scor es obtained. Some of these problems include the following: (1) students being coached t o develop skills that are unique to the specific types of questions that are asked on the s tatewide exam (i.e., as distinct from what is generally meant by reading, math, or the ot her subjects tested); (2) narrowing the curriculum to improve scores on the state exam at t he expense of other important skills and subjects that are not tested; (3) an increase i n the prevalence of activities that substantially reduce the validity of the scores; an d (4) results being biased by various features of the testing program (e.g., if a signifi cant percentage of students top out or bottom out on the test, it may produce results that suggest that the gap among racial and ethnic groups is closing when no such change is occ urring). There are a number of strategies that sta tes might try to lessen the risk of inflated and misleading gains in scores. They can reduce the pressure to "raise scores at any cost" by using one set of measures to make decisions abou t individual students and another set (employing sampling and third-party administration) to make decisions about teachers, schools, and educational programs. States can repla ce their traditional paper-and-pencil multiple-choice exams with computer based "adaptive tests that are tailored to each student's abilities, that draw on "banks" of thousa nds of questions, and that are delivered over the Internet into the school building (for det ails, see Bennett, 1998; Hamilton, Klein, & Lorie, 2000). States can also periodically conduc t audit testing to validate score gains. They can study the positive and negative effects of the testing program on curriculum and instruction, and whether these effects are similar for different groups of students. For instance, what knowledge, skills, and abilities are and are not being developed when the focus is concentrated on preparing students to do w ell on a particular statewide, high-stakes exam? However, given the findings repor ted above for Texas, it is evident that something needs to be done to ensure that high -stakes testing programs, such as the TAAS, produce results that merit public confidence and thereby provide a sound basis for educational policy decisions.NotesRAND issue papers explore topics of interest to the policymaking community. Although issue papers are formally reviewed, authors have su bstantial latitude to express provocative views without doing full justice to oth er perspectives. The views and conclusions expressed in issue papers are those of the authors and do not necessarily represent those of RAND or its research sponsors.
18 of 22It was beyond the scope of this issue paper to iden tify the specific similarities and differences in content coverage between NAEP and TA AS. 1. This estimate includes students who spent one to fo ur years in small classes. 2. In Figures 2 and 3, the NAEP and TAAS trends cover different but overlapping years, due to the testing schedules of these measur es. 3. The results in the 20-school study discussed later in this issue paper suggest that some topping-out occurred on the TAAS. For example, although about two-thirds of the 2,000 students in this study were in a free or reduced-price lunch program, 7 percent answered 95 percent of the TAAS reading que stions correctly and 9 percent did so on the math test. Only a few students were a ble to do this on any of the tests we gave. 4. The correlation coefficient, which can range from 1 .00 to +1.00, is a measure of the degree of agreement between two tests. A high posit ive correlation is obtained when the students (or schools) that have high score s on one test also tend to have high scores on the other test. 5. We also examined the relationships by splitting the schools into two groups, according to whether they had relatively high versu s low percentages of students in the lunch program (e.g., those that had more than 7 0 percent versus those with less than 70 percent). This analysis produced results th at were consistent with the data in Figures 5 and 6. Specifically, schools with a hi gh percentage of students in the lunch program had much lower scores on the three te sts we gave than did schools with a relatively low percentage of students in thi s program whereas that was not the case with the TAAS scores. 6.AcknowledgmentsThe preparation of this issue paper benefited great ly from the many thoughtful suggestions and insights of our RAND colleagues, Dr David Grissmer, Dr. Daniel Koretz, and Dr. James Thomson, and our external rev iewers, Professor Richard Jaeger of the University of North Carolina at Greensboro and Professor Robert Linn of the University of Colorado at Boulder. We are also grat eful to Miriam Polon and Christina Pitcher for editorial suggestions. ReferencesBarton, P. E., & Coley, R. J. (1998). Growth in school: Achievement gains from the fourth to the eighth grade (ETS Policy Information Report). Princeton, NJ: Ed ucational Testing Service.Bennett, R. E. (1998). Reinventing assessment. Princeton, NJ: Educational Testing Service.Carnoy, M., Loeb, S., & Smith, T. L. (2000). Do higher state test scores in Texas make for better high school outcomes? Paper presented at the Annual Meeting of the Ameri can Educational Research Association, New Orleans.Finn, J. D., & Achilles, C. M. (1999). Tennessee's class size study: Findings, implications, misconceptions. Educational Evaluation and Policy Analysis, 21, 97-109.
19 of 22Grissmer, D., & Flanagan, A. (1998). Exploring rapid achievement gains in North Carolina and Texas. Washington, DC: National Education Goals Panel. Grissmer, D., Flanagan, A., Kawata, J., & Williamso n, S. (2000). Improving student achievement: What state NAEP test scores tell us. Santa Monica, CA: RAND, MR-924-EDU Hamilton, L. S., Klein, S. P., & Lorie, W. (2000). Using web-based testing for large-scale assessment. Santa Monica, CA: RAND, IP-196-EDU Haney, W. (2000). The myth of the Texas miracle in education. Education Policy Analysis Archives, 8 (41). Available at http://epaa.asu.edu/epaa/v8n41. Hedges, L. V., & Nowell, A. (1998). Black-white tes t score convergence since 1965. In Jencks, C., & Phillips, M. (Eds.), The Black-White Test Score Gap (pp. 149-181). Washington, DC: Brookings.Heubert, J. P., & Hauser, R. M. (Eds.) (1999). High stakes: Testing for tracking, promotion, and graduation. A Report of the National Research Council, Washing ton, DC: National Academy Press.Hoff, D. J. (2000). As stakes rise, definition of c heating blurs. Education Week, June 21. Hoffman, J. V., Assaf, L., Pennington, J., & Paris, S. G. (in press). High stakes testing in reading: Today in Texas, tomorrow? The Reading Teacher. Johnston, R. C. (1999). Texas presses districts in alleged test-tampering cases. Education Week, March 17. Koretz, D., & Barron, S. I. (1998). The validity of gains on the Kentucky Instructional Results Information System (KIRIS). Santa Monica, CA: RAND, MR-1014-EDU Koretz, D., Linn, R. L., Dunbar, S. B., & Shepard, L. A. (1991). The effects of high-stakes testing: Preliminary evidence about gen eralization across tests, in R. L. Linn (chair), The Effects of High Stakes Testing, symposium presented at the annual meetings of the American Educational Research Association an d the National Council on Measurement in Education, Chicago, April.Linn, R. L. (2000). Assessments and accountability. Educational Researcher, 29 (2), 4-16.Linn, R. L., Graue, M. E., & Sanders, N. M. (1990). Comparing state and district test results to national norms: The validity of claims t hat "everyone is above average." Educational Measurement: Issues and Practice, 9, 5-14. Mabin, Connie (2000). State's students again improv e on TAAS scores. Austin American-Statesman, May 18. McLaughlin, D. (2000). Protecting state NAEP trends from changes in SD/LEP inclusion rates. Palo Alto, CA: American Institutes for Research.
20 of 22 McNeil, L., & Valenzuela, A. (2000). The harmful impact of the TAAS system of testing in Texas: Beneath the accountability rhetoric. Cambridge, MA: Harvard University Civil Rights Project.Reese, C. M., Miller, K. E., Mazzeo, J., & Dossey, J. A. (1997). NAEP 1996 report card for the nation and the states. Washington, DC: National Center for Education Stat istics. Stecher, B. M., Barron, S., Kaganoff, T., & Goodwin J. (1998). The effects of standards-based assessment on classroom practices: Results of the 1996-97 RAND Survey of Kentucky Teachers of Mathematics and Writ ing (CSE Technical Report 482). Los Angeles: Center for Research on Evaluation, Sta ndards, and Student Testing. Stecher, B. M., & Klein, S. P. (Eds.) (1996). Performance assessments in science: Hands-on tasks and scoring guides. Santa Monica, CA: RAND, MR-660-NSF. Texas Education Agency (1999). Texas Student Assessment Program: Technical digest for the academic year 1998-1999. Available at http://www.tea.state.tx.us/student.assessment/techd ig.htm. Texas Education Agency (2000). 1999 Texas National Comparative Data Study. Available at http://www.tea.state.tx.us/student.ass essment/researchers.htm. Texas Education Agency (2000). Texas TAAS passing rates hit seven-year high; four out of every five students pass exam. Press release, May 17.About the AuthorsDr. Stephen P. Klein is a Senior Research Scientist at RAND, where for the past 25 years he has led studies on health, criminal justic e, military manpower, and educational issues. His current projects include analyzing lice nsing examinations in teaching and other professions, delivering computer adaptive per formance tests over the Web, and measuring the effects of instructional practices an d curriculum on student performance. Dr. Laura Hamilton is an Associate Behavioral Scientist at RAND where she conducts research on educational assessment and the effectiv eness of educational reform programs. Her current projects include a study of systemic re forms in math and science, an investigation of the validity of statewide assessme nts for students with disabilities, and an analysis of the effectiveness of private governance of public schools. Dr. Daniel F. McCaffrey is a Statistician at RAND where he works on studie s of health and educational issues. His work on education inclu des studies on teaching practices and student achievement, the effects of class size redu ction on the test scores of California's students, and the properties of hands-on performanc e measure of achievement in science. Dr. Brian Stecher is a Senior Social Scientist in the Education prog ram at RAND. Dr. Stecher's research emphasis is applied educational measurement, including the implementation, quality, and impact of state assess ment and accountability systems and the cost, quality, and feasibility of performance-b ased assessments in mathematics and science.
21 of 22 Copyright 2000 by the Education Policy Analysis ArchivesThe World Wide Web address for the Education Policy Analysis Archives is epaa.asu.edu General questions about appropriateness of topics o r particular articles may be addressed to the Editor, Gene V Glass, email@example.com or reach him at College of Education, Arizona State University, Tempe, AZ 8 5287-0211. (602-965-9644). The Commentary Editor is Casey D. C obb: firstname.lastname@example.org .EPAA Editorial Board Michael W. Apple University of Wisconsin Greg Camilli Rutgers University John Covaleskie Northern Michigan University Alan Davis University of Colorado, Denver Sherman Dorn University of South Florida Mark E. Fetler California Commission on Teacher Credentialing Richard Garlikov email@example.com Thomas F. Green Syracuse University Alison I. Griffith York University Arlen Gullickson Western Michigan University Ernest R. House University of Colorado Aimee Howley Ohio University Craig B. Howley Appalachia Educational Laboratory William Hunter University of Calgary Daniel Kalls Ume University Benjamin Levin University of Manitoba Thomas Mauhs-Pugh Green Mountain College Dewayne Matthews Western Interstate Commission for HigherEducation William McInerney Purdue University Mary McKeown-Moak MGT of America (Austin, TX) Les McLean University of Toronto Susan Bobbitt Nolen University of Washington Anne L. Pemberton firstname.lastname@example.org Hugh G. Petrie SUNY Buffalo Richard C. Richardson New York University Anthony G. Rud Jr. Purdue University Dennis Sayers Ann Leavenworth Centerfor Accelerated Learning Jay D. Scribner University of Texas at Austin Michael Scriven email@example.com Robert E. Stake University of IllinoisUC Robert Stonehill U.S. Department of Education David D. Williams Brigham Young University
22 of 22 EPAA Spanish Language Editorial BoardAssociate Editor for Spanish Language Roberto Rodrguez Gmez Universidad Nacional Autnoma de Mxico firstname.lastname@example.org Adrin Acosta (Mxico) Universidad de Guadalajaraadrianacosta@compuserve.com J. Flix Angulo Rasco (Spain) Universidad de Cdizfelix.email@example.com Teresa Bracho (Mxico) Centro de Investigacin y DocenciaEconmica-CIDEbracho dis1.cide.mx Alejandro Canales (Mxico) Universidad Nacional Autnoma deMxicocanalesa@servidor.unam.mx Ursula Casanova (U.S.A.) Arizona State Universitycasanova@asu.edu Jos Contreras Domingo Universitat de Barcelona Jose.Contreras@doe.d5.ub.es Erwin Epstein (U.S.A.) Loyola University of ChicagoEepstein@luc.edu Josu Gonzlez (U.S.A.) Arizona State Universityjosue@asu.edu Rollin Kent (Mxico)Departamento de InvestigacinEducativa-DIE/CINVESTAVrkent@gemtel.com.mx firstname.lastname@example.org Mara Beatriz Luce (Brazil)Universidad Federal de Rio Grande do Sul-UFRGSlucemb@orion.ufrgs.brJavier Mendoza Rojas (Mxico)Universidad Nacional Autnoma deMxicojaviermr@servidor.unam.mxMarcela Mollis (Argentina)Universidad de Buenos Airesmmollis@filo.uba.ar Humberto Muoz Garca (Mxico) Universidad Nacional Autnoma deMxicohumberto@servidor.unam.mxAngel Ignacio Prez Gmez (Spain)Universidad de Mlagaaiperez@uma.es Daniel Schugurensky (Argentina-Canad)OISE/UT, Canadadschugurensky@oise.utoronto.ca Simon Schwartzman (Brazil)Fundao Instituto Brasileiro e Geografiae Estatstica email@example.com Jurjo Torres Santom (Spain)Universidad de A Coruajurjo@udc.es Carlos Alberto Torres (U.S.A.)University of California, Los Angelestorres@gseisucla.edu
xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam a22 u 4500
controlfield tag 008 c20009999azu 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E11-00193
Educational policy analysis archives.
n Vol. 8, no. 49 (October 26, 2000).
Tempe, Ariz. :
b Arizona State University ;
Tampa, Fla. :
University of South Florida.
c October 26, 2000
What do test scores in Texas tell us? / Stephen P. Klein, Laura S. Hamilton, Daniel F. McCaffrey, [and] Brian M. Stecher.
Arizona State University.
University of South Florida.
t Education Policy Analysis Archives (EPAA)