USF Libraries
USF Digital Collections

Educational policy analysis archives


Material Information

Educational policy analysis archives
Physical Description:
Arizona State University
University of South Florida
Arizona State University
University of South Florida.
Place of Publication:
Tempe, Ariz
Tampa, Fla
Publication Date:


Subjects / Keywords:
Education -- Research -- Periodicals   ( lcsh )
non-fiction   ( marcgt )
serial   ( sobekcm )

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
usfldc doi - E11-00190
usfldc handle - e11.190
System ID:

This item is only available as the following downloads:

Full Text
xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam a22 u 4500
controlfield tag 008 c20009999azu 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E11-00190
0 245
Educational policy analysis archives.
n Vol. 8, no. 46 (September 17, 2000).
Tempe, Ariz. :
b Arizona State University ;
Tampa, Fla. :
University of South Florida.
c September 17, 2000
Should achievement tests be used to judge school quality? / Scott C. Bauer.
x Research
v Periodicals.
2 710
Arizona State University.
University of South Florida.
1 773
t Education Policy Analysis Archives (EPAA)
4 856


1 of 18 Education Policy Analysis Archives Volume 8 Number 46September 17, 2000ISSN 1068-2341 A peer-reviewed scholarly electronic journal Editor: Gene V Glass, College of Education Arizona State University Copyright 2000, the EDUCATION POLICY ANALYSIS ARCHIVES. Permission is hereby granted to copy any article if EPAA is credited and copies are not sold. Articles appearing in EPAA are abstracted in the Current Index to Journals in Education by the ERIC Clearinghouse on Assessment and Evaluation and are permanently archived in Resources in Education Should Achievement Tests be Used to Judge School Qu ality? Scott C. Bauer University of New OrleansAbstract This study provides empirical evidence to answer th e question whether student scores on standardized achievement tests re present reasonable measures of instructional quality. Using a research protocol designed by Popham and the local study directors, individual te st items from a nationally-marketed standardized achievement test w ere rated by educators and parents to determine the degree to wh ich raters felt that the items reflect important content that is actually ta ught in schools, and the degree to which raters felt that students' answers to the questions would be likely to be unduly influenced by confounded cau sality. Three research questions are addressed: What percentage o f test items are considered suspect by raters as indicators of schoo l instructional quality? Do educators and parents of school-age children dif fer in their ratings of the appropriateness of test items? Do educators and parents feel that standardized achievement test scores should be used as an indicator of school instructional quality? Descriptive statistic s show that on average, raters felt that the content reflected in test ques tions measured material


2 of 18that is important for students to know. However, fo r reading and language arts questions, between about 20% to 40% o f the items were viewed as suspect in terms of the other criteria.Introduction Since publication of A Nation at Risk in 1983, issues associated with accountability have been at the forefront of educational reform in the United States. Kirst (1990) estimated that in the 1980's alone, 40 states creat ed or amended their accountability systems. Stecher and Barron (1999) note that the nu mber of states with a mandated student testing program rose from 29 in 1980 to 46 in 1992. Presidents Bush and Clinton both proposed the creation of a voluntary national test that would allow the reporting of student performance in relation to national standar ds (Carnevale & Kimmel, 1997). The emergence of high-stakes accountability policies has intensified the debate over whether state-mandated assessment is a useful instrument for changing educational practice (Firestone, Mayrowetz, and Fairman, 1998; Ginsberg and Berry, 1998; Sheldon and Biddle, 1998). Proponents of high-stakes testin g assume that poor performance in American schools results from a lack of attention t o school performance. "To solve such problems, according to this view, we need to set hi gh standards for students, assess students' performance with standardized tests, and reward or punish students, their teachers, and their schools, depending on whether t hose standards are met" (Sheldon and Biddle, 1998, p. 165): Forty-nine states and a number of urban districts h ave set standards for what students should know and be able to do at various p oints in their school careers. Half the states hold schools accountable a nd apply sanctions to those whose students fail to meet the standards. At least a third – with more soon to follow – require students to score at desig nated levels on tests to get promoted and/or graduate. (Wolk, 1998, p. 48) A recent survey by the Council of Chief Sta te School Officers (1998) shows that while the states are increasingly introducing less traditional performance measures like portfolios into their assessment programs, 31 state s use norm-referenced tests to measure student achievement in language arts, reading and m athematics. Tests are generally a part of the accountability system because they are inexpensive and quick to implement, and they are considered socially accepted as indica tors of student performance (Linn, 1999). At the heart of the debate over the use of high-stakes testing policies as a reform is the assumption that introducing new assessments wil l result in changes in teacher behavior in the classroom. As Firestone, Mayrowetz and Fairman (1998) observed, there is in fact a good deal of evidence that testing cha nges patterns of teaching, "if only by promoting 'teaching to the test'" (p. 96). There is evidence that school-based performance and reward programs such as Kentucky's produces desired results (Kelley and Protsik, 1997), and research supports the notio n that school leaders take high-stakes testing very seriously (Mitchell, 1995). However, r esearch also suggests that highstakes testing programs do not necessarily provide valid d ata on students and schools (Stecher & Barron, 1999), and these systems tend to produce a high level of stress for teachers and principals. Critics argue that high-stakes test ing may encourage teachers to consider test scores as ends in themselves: Evidence...reveals various perils associated with r igid standards, narrow


3 of 18accountability, and tangible sanctions that can deb ase the motivations and performances of teachers and students. Teachers fac ed with reforms that stress such practices may become controlling, unres ponsive to individual students, and alienated. Testand sanction-focused students may lose intrinsic interest in subject matter, learn at only a superficial level, and fail to develop a desire for future learning. (Sheldon a nd Biddle, 1998, p. 164) Opponents of these measures conclude that t hey result in dumbing-down the curriculum (e.g., Corbett and Wilson, 1991), while others argue that they deny the reality of the situation faced by students, particularly th ose in urban districts, who are not well prepared to meet harsh standards (Wolk, 1998). Stil l others question whether policy is an effective instrument for shaping instructional p ractice at all (e.g., Cohen, 1995). Newmann, King and Rigdon argue that high-stakes acc ountability programs are doomed to failure because insufficient attention is paid t o increasing schools' capacity for change, and Mayer (1998) raises the question of whether pur suing standards-based reform while leaving testing policy largely unchanged undermines reform. Wallace (2000, p. 66) concludes, "Provincial achievement exams create und ue pressure on students, teachers, and schools. Even worse, the tests fail to assess w hat students will need to know in the next century." Nevertheless, rating school performance bas ed on the results of state testing programs has become an increasingly popular feature of state accountability programs (Watts, Gaines & Creech, 1998). The CCSSO survey re ferenced earlier indicates, in fact, that standardized achievement tests generally serve as summative indicators of elementary, middle, and high school performance, at least in part. For instance, in my home state of Louisiana, the new testing program is used to produce a school performance score that includes scores from the sta te's criterion-referenced test (60% of score), a nationally-marketed norm-referenced test (30% of score), and student attendance and dropout rates (10 percent of score). The school performance score will be used to establish 10-year goals, and schools will b e held accountable for reaching two-year targets that represent progress toward the se goals. A series of corrective actions are spelled out for schools that fail to meet their targets (Louisiana's School and District Accountability System, 1999). At the 1998 Annual Meeting of the Mid-South Educational Research Association, W. James Popham raised the following question: Is i t appropriate to use norm-referenced tests to evaluate instructional qua lity? Specifically, he challenged participants to consider whether norm-referenced te sts measure knowledge that is taught and learned in schools. Popham then invited researc hers to participate with him in a study to answer the question: Should student scores on standardized achievement tests be used to evaluate instructional quality in local schools? In a subsequent paper, Popham (1999) laid o ut the basic argument that frames this study. While standardized achievement tests are use ful tools to provide evidence about a specific students' mastery of knowledge and skills in certain content domains, "Employing standardized achievement tests to ascert ain educational quality is like measuring temperature with a tablespoon" (p. 10). T here are several difficulties with using aggregate measures from norm-referenced tests to judge the performance of a school. First, there is considerable diversity acro ss states and school systems with regard to content standards, and therefore test developers produce "one-size-fits-all assessments" which do not adequately align with wha t's supposed to be taught in schools. Second, because normreferenced tests mus t provide a mechanism to differentiate between students based on a relativel y small number of test items, test


4 of 18developers select "middle difficulty" items. As Pop ham put it, As a consequence of the quest for score variance in a standardized achievement test, items on which students perform w ell are often excluded. However, items on which students perform well often cover the content that, because of its importance, teachers stress. T hus the better the job that teachers do in teaching important knowledge and/or skills, the less likely it is that there will be items on a standardized achie vement test measuring such knowledge and skills (p. 12). Finally, scores on standardized achievement tests may not be attributable to the instructional quality of a school. Student performa nce may be caused by any number of factors, including what's taught in schools, a stud ent's native intelligence, and out-of-school learning opportunities that are heavi ly influenced by a students' home environment. Popham terms this last issue the probl em of "confounded causality." Here we report the results of one of severa l local studies designed to provide empirical evidence to answer the question of whethe r student scores on standardized achievement tests represent reasonable measures of instructional quality. Using a research protocol designed by Popham and the local study directors, individual test items from a nationally-marketed standardized achievement test were rated by educators and parents to determine the degree to which raters fel t that the items reflect important content that is actually taught in schools, and the degree to which raters felt that students' answers to the questions would be likely to be undu ly influenced by confounded causality. Three research questions are addressed: What percentage of test items are considered suspec t by raters as indicators of school instructional quality? 1. Do educators and parents of school-age children dif fer in their ratings of the appropriateness of test items? 2. Do educators and parents feel that standardized ach ievement test scores should be used as an indicator of school instructional qualit y? 3.Methods The investigation consisted of a series of three separate item-review studies designed to secure evidence regarding the appropria teness of using students' scores on standardized achievement tests as evidence of instr uctional quality. All sections of a nationally-marketed standardized achievement test w as studied at the third grade level. The test covers mathematics, reading and language a rts content areas. The test was secured by the local study director, who also took responsibility for security. Participants Participants were solicited from two source s. First, principals associated with the School Leadership Center of Greater New Orleans (SL C-GNO) were invited to put together teams of teachers and parents to host an i tem-rating session. Two principals were able to put together groups of ten and eleven raters. From these 21 participants, 10 were parents and 11 were educators. These rating se ssions were held at the participant's schools after school hours. Additionally, nine teac hers enrolled in a graduate level course dealing with testing and measurement at the University of New Orleans formed a


5 of 18third group. This rating session was held on campus In sum, then, 30 reviewers served as item raters, including two principals, 18 teache rs, and 10 parents of elementary school children.Procedures Reviewers were provided with a description of the goals and procedures associated with the study prior to the actual rating session. In addition to signing a standard human subjects protocol outlining the responsibilities an d risks associated with participation, reviewers signed a testconfidentiality form prior to their participation, and the item reviews were carried out under the scrutiny of the local director so that no security violations could occur. All test booklets were reta ined by the study director. Data were recorded on forms that do not reveal the specific t est reviewed or any test questions. Reviewers were asked to make their item-byitem judgments individually on summary rating sheets (see Exhibit 1 for a sample o f the rating sheet), without group discussion, using a protocol that asked them to exa mine test items and judge their appropriateness in terms of five criteria: IMPORT: Is the skill or knowledge measured by this item truly important for children to learn? 1. TAUGHT: Is the skill or knowledge measured by this item likely to be taught if teachers follow the prescribed curriculum? 2. SES: Is this item free of qualities (form or content) that will make the likelihood of a student's answering correctly be dominantly in fluenced by the student's socioeconomic status? 3. INHERITED CAPABILITIES: Is this item free of qualities (form or content) that will make the likelihood of a student's answering c orrectly be dominantly influenced by the student's inherited academic capa bilities? 4. VALIDITY: Will a student's response to this item co ntribute to a valid inference about the student's status regarding whatever the t est is supposed to be measuring? 5.Exhibit 1. Sample item rating sheetItem Import? Taught? SES? IQ? Validity? 1 Y ? N Y ? N Y ? N Y ? N Y ? N2 Y ? N Y ? N Y ? N Y ? N Y ? N3 Y ? N Y ? N Y ? N Y ? N Y ? N4 Y ? N Y ? N Y ? N Y ? N Y ? N5 Y ? N Y ? N Y ? N Y ? N Y ? NExhibit 2. The five item-review questionsIMPORT: Is the skill or knowledge measured by this item truly important for children to learn? 1. TAUGHT: Is the skill or knowledge measured by this item likely to be taught if teachers follow the prescribed curricu lum? 2.


6 of 18SES: Is this item free of qualities (form or content) that will make the likelihood of a student's answering correctly b e dependent on the student's socioeconomic status? WOULD A STUDENT FROM A WELL-OFF HOME BE UNLIKELY TO GET THE ITEM CORRECTJUST BECAUSE HE OR SHE IS MORE "ADVANTAGED?"3. IQ: Is this item free of qualities (form or content) that will make the likelihood of a student's answering correctly be de pendent on the student's inherited academic capabilities? WOULD A STUDENT WITH GREATER NATIVE INTELLIGENCE (IQ) BE UNLIKELYTO GET THE ITEM CORRECT JUST BECAUSE OF THIS INBORNQUALITY?4. VALIDITY: Will a student's response to this item co ntribute to a valid conclusion about the student's ability relati ng to whatever the test is supposed to be measuring? IS THIS ITEM A VALID MEASURE OF THE ABILITY THE TEST IS MEASURING IN THI S SECTION OF THE TEST?5. During an orientation phase, prior to itemreview, the local study director practiced reviewing a selection of test items from a test-boo klet's sample items and/or from a different test to clarify item-reviewers' understan ding of the five item-review questions. During a pre-test of the procedure, it became clear that respondents may have difficulty with the questions related to SES, IQ, and validity thus some clarifying language was added and a summary sheet was provided to raters wh ich allowed them to access the definitions as they performed the ratings. (Exhibit 2 shows the summary sheet.) Each rating session was held in the afterno on, and took approximately three hours. Because of the time of day and the considerable inv estment of time and energy, participants were provided with a light dinner afte r each rating session. They also participated in a short debriefing session, during which they answered questions about the methodology and their ability to sensibly rate the test items. Analysis Response sheets were collected and numbered after each session. The number of items rated yes, no, or with a question mark (not s ure) were tallied for each content area of the test, and the number of no and "not sure" (q uestion mark) ratings were entered into an SPSS 9.0 for Windows system file. To addres s the question of what percentage of test items raters considered suspect as indicato rs of school instructional quality, the mean percentages of items rated "no" or "not sure" were computed for each of the rating criteria and for each content area of the test. Des criptive statistics related to the raters' judgments of items in each content area of the test and for each of the criteria are presented. Additionally, a summary statistic indica ting the mean percentage of items rated as suspect on at least one criterion was comp uted. For purposes of discussion, the percentage of items rated as either a "no" or "not sure" are combined; given the high-stakes involved in the state accountability pr ograms, if raters cannot determine if an item meets the criteria used in this study, we will consider it suspect. The full breakdown of ratings are presented in the Appendices. To see if educators and parents of school-a ge children differ in their ratings of the


7 of 18 appropriateness of test items, analysis of variance was computed to test whether the mean ratings are statistically significant. Eta-squ ared is also reported; Stevens (1996) recommends that to interpret the effect size, an et a-squared of .01 should be treated as a small effect, .06 a medium effect, and .14 a large effect. To address whether educators and parents fe el that standardized achievement test scores should be used as an indicator of school ins tructional quality, the frequency distribution is reported for a summary question whi ch asked respondents to answer yes, no, or "not sure" in regard to this question. Chi-s quare was computed to see if there is a statistically significant association between the a nswer to this summary question and group membership. As a final portion of the study, answers to questions posed during debriefing sessions were analyzed to determine whether raters felt confident in their ability to assess test items on these criteria. In an explorat ory study such as this, rater's sense of their ability to render reliable judgments in terms of these criteria is an important question. These data may shed some light on whether the methodology provides a valid assessment of the usefulness of the test to judge s chool quality.Results Table 1 displays the mean percentage of tes t items rated as suspect by respondents. As mentioned earlier, the percentage reflects the n umber of items rated as either a "no" or "not sure" on each of the five criteria for each content area of the test. Overall, the mean percentage of items rated as suspect varies wi dely; only 2% of the items were rated as suspect in importance for math procedures, where as 41% of the vocabulary items were rated as suspect because the likelihood seemed great that student's answering correctly would be dependent on the student's inher ited academic capabilities (IQ). Overall, raters felt that the items dealing with re ading and language arts were more often suspect as indicators of school quality, especially in terms of the likelihood that students' answering these items correctly would be unduly inf luenced by native intelligence (IQ) or socio-economic status (SES). Raters were somewha t more comfortable with measures relating to mathematics problem-solving and reasoni ng, and considerably more comfortable with the items measuring mathematics pr ocedures. Table 1 Mean percentage of items rated as "suspect" for each content areaContent Area Important?Taught?SES?IQ?Valid? Vocabulary 11%26%38%41%26%Reading comprehension1426384026Grammar & language824373821Math problem solving & reasoning 1124193321 Math procedures 27112210


8 of 18 Viewing the data in Table 1 in terms of cri teria instead of content area, one sees that from among the various criteria used to rate t est items, raters judged the test items more likely to be suspect in terms of SES and IQ. T hat is, from among the five possible reasons a test item might be inappropriate to asses s school quality, raters felt the greatest threat to validity was the likelihood that a studen t might answer the item correctly because of socioeconomic advantage or because of native intelligence rather than because of what he or she learned in school. In fac t, for the reading and language arts content areas, between 30 and 40% of the items were rated as suspect in these regards. Considerably fewer items were rated as suspect beca use they were deemed unimportant for students to know, and for most content areas be tween 20 and 30% of the items were deemed unacceptable because raters felt that the ma terial was not a part of the standard curriculum at that grade level. The above-mentioned data show the mean perc entage of items rated as suspect on each of the five criteria; a final summary statisti c was computed to show the mean percentage of items in each section of the test tha t was rated as suspect on at least one of the five criteria. Table 2 shows that for all areas of the test, approximately 50% of the items were deemed inappropriate as indicators of in structional quality on at least one criterion. The table also shows that the range of r atings is considerable – for most areas, at least one rater felt that nearly all of the item s were alright as indicators of instructional quality on all criteria, and at least one rater fel t that all items were suspect on at least one of the five criteria.Table 2 Mean percentage of items deemed suspect on at least one criterionContent areaMean % HighLow Vocabulary 57%100%15% Reading comprehension52%100%3%Grammar & language55%100%13%Math problem solving & reasoning48%100%3%Math procedures 46%100%0% To address the question of whether educator s and parents rated the test items differently, analyses of variance were computed to test the null hypothesis that the mean percentages do not differ between the two groups of respondents. These data, presented in Table 3, show that the only statistically signif icant differences between the mean percentage of items rated as suspect by parents and educators exist for the criteria dealing with whether the content measured by the te st item is taught in the regular school curriculum (taught). Parents consistently fe lt that a greater percentage of the items on the test covered material that would not b e a part of the standard curriculum. An examination of eta-squared shows that for most o f the content areas, the effect size of the difference in means for this criterion (taug ht) is large (eta 2 for vocabulary=.16, for reading comprehension=.16, for math problem-solving =.19) or moderate (eta 2 for


9 of 18 grammar and language=.10, for math procedures=.11).Table 3 Mean ratings by respondent groupVocabulary RespondentIMPTAUSESIQVALEducator. (1,28).725.40*1.88.472.89Eta-squared. .09 Reading comprehension Educator. (1, 28)1.995.40*1.88.502.89Eta-squared. Grammar and language Educator. (1, 28).692. Math problem-solving & reasoning Educator. (1, 28).016.36*.04.251.08Eta-squared. Math procedures Educator. (1, 28).


10 of 18 Eta-squared. p<.05 Table 4 shows the results for the summary i tem that asked raters to judge whether they would recommend using standardized achievement tests as an indicator of instructional quality. Results show that approximat ely a quarter of the educators and 30% of the parents felt that standardized achieveme nt tests ought to be used as an indicator of school quality, whereas about two-thir ds of the educators and 40% of the parents felt that they should not. Another 30% of t he parents and 11% of the educators were not sure, and one respondent left the question blank. The chi-square test of association showed that there is not a statisticall y significant association between the answer to this question and role [ X 2 (2, n=29) = 2.11, p<.05 ].Table 4 Should standardized tests be used to measure instructional quality? YesNot sureNo Educator5 (26%)2 (11%)12 (64%)Parent3 (30%)3 (30%)4 (40%)Total 8 (28%)5 (17%)16 (55%) The final data collected in this study had to do with the methodology itself. A formal debriefing was held after each item rating s ession. Respondents were asked a short series of questions in writing about their ab ility to rate test items and about the kinds of factors they felt influenced their ratings Raters also discussed their experiences and any difficulties they perceived with the rating process. These data provide us with some sense of the threats to validity present in th e ratings. Respondents were asked to rate how easy the y felt it was to make judgments about the test items, on a scale of 1 = "very easy" to 10 = "very difficult." On average, these data show that respondents felt that it was relativ ely easy to assess whether an item measured import material for students to know (2.1) and whether the item was likely to be taught as a part of the regular curriculum (2.9) Raters found it most difficult to rate whether an item would be more likely to be answered correctly because of a child's inherited capabilities (IQ) or socio-economic statu s (5.0 and 4.5, respectively). Respondents also found it relatively more difficult to judge whether an item was a valid measure of the skill it was intended to measure (4. 7). Overall, then, on a ten-point scale raters found their job moderately easy (i.e., lower than the midpoint between very easy and very difficult), though some criteria were more difficult to apply than others. Respondents also answered open-ended questi ons that probed into the kinds of factors that they felt might threaten their ability to render reliable judgments about the test items. These answers show that most of the par ents felt at least a bit unsure about what was in the regular or "official" curriculum, t hus they were not sure about the reliability of their judgments on the criterion lab eled "taught." One respondent pointed


11 of 18out that SES and IQ were tough to assess because th ese relate to a subjective assessment of the fairness of an item, and several other respo ndents noted that SES was likely influenced by their own socio-economic status. That is, they questioned whether relatively well-off parents or teachers could rende r a valid judgment on this criterion. Some teachers questioned whether their beliefs abou t teaching would "get in the way" of their ability to rate the items, and several raters simply said that they found it tough – "speculative" – to assess the degree to which a stu dents' answer on a test item would relate more to native intelligence than knowledge g ained in school.Summary and Conclusions The purpose of this study was to attempt to amass credible evidence concerning whether student scores on standardized achievement tests should be used to evaluate instructional quality in local schools. Using a fra mework developed by Popham (1999) and a research protocol collaboratively devised by Popham and local study directors, educators and parents of school-age children rated all items contained on a commercially-marketed standardized achievement test that covered third grade content in reading, language arts, and mathematics. Descrip tive statistics show that on average, raters felt that the content reflected in test ques tions measured material that is important for students to know. However, for reading and lang uage arts questions, between about 20% to 40% of the items were viewed as suspect in t erms of the other criteria. Raters saw fewer problems with questions dealing with math ematics problem-solving and reasoning, and they felt the fewest problems existe d with questions on mathematical procedures. Overall, though, raters felt that about half of all items they appraised were suspect on at least one of the criteria used to ass ess the test. Educators and parents did not differ statistically on their ratings on most c riteria, though about two-thirds of the educators felt that tests should not be used to jud ge instructional quality whereas only 40% of the parents felt this way. The range of rati ngs across respondents was considerable for all content areas and for each of the rating criteria; some respondents saw very few problems with any questions, while oth ers felt that the vast majority of items were suspect on at least one criterion. This study was prompted by the realization that while standardized achievement tests are useful tools to provide evidence about st udents' mastery of knowledge and skills in tested content domains, it does not logically fo llow that they should be useful as indicators of school performance. As reflected in t he rating scheme used in this study, student performance on standardized tests may be ca used by any number of factors, including what's taught in schools, a student's nat ive intelligence, and out-of-school learning opportunities that are heavily influenced by a students' home environment. The question that follows, then, is whether this confounded causality poses a problem in terms of using standardized test scores as measures of instructional quality. In a critique of Popham's argument regarding confou nded causality, Schmoker (2000) argues that it does not. What happens in classrooms can "significantly mitigate and even overcome environmental and genetic factors" (p. 64) and standardized tests give schools focus and empower teachers by providing specific da ta on students' needs. "Standardized test results have provided the essential focus and urgency for schools to improve and refine instructional programs in reading, writing, and math practices" (p. 64). This argument misses the point. There is no question that norm-referenced tests are exceedingly valuable in their intended purpose: to identify knowledge and skills that individual students need to improve, thus providing professional educators with essential data with which they can craft programs and practic es. It does not follow, however, that


12 of 18using aggregate average scores on standardized test s serves as a good indicator of school quality To say that norm-referenced tests can help teache rs identify areas in need of attention does not rely on an assumption that schoo l programs alone caused a deficiency; instead, as Schmoker observed, this relied on the b elief that schools can do something to overcome the deficiency regardless of cause. The notion that aggregate scores on standar dized tests should serve as an indicator of school quality relies on an assumption of causal ity. The underlying logic is that the scores are predominantly caused by something the school does or has some co ntrol over. For this assumption to hold, at a minimum we must b e willing to believe that student performance on standardized tests is related to sch ool quality, that the tests measure the skills and abilities stressed in school programs, a nd that there are no antecedent factors that might otherwise explain aggregate student perf ormance on the tests. If the data presented here are credible, the soundness of this assumption must be questioned. On average about half of the items on the rated test s uffer from "confounded causality" on at least one of these criteria. The question of whether the data presented here are, in fact, "credible," deserves attention. The data collected from debriefing prese nted earlier barely scratch the surface of the potential threats to validity. Perhaps the b iggest issues stem from the fact that the study was purposefully constructed to include both educators and parents. The fact that parents felt less knowledgeable about what should be in the regular school curriculum may have resulted in an exaggeration of the percent age of items that were deemed suspect on this criterion. Additionally, some respo ndents felt it difficult to judge whether items might be unduly influenced by a students' nat ive intelligence (where do you draw the line between native intelligence and knowledge learned in school?) and some felt that their own social standing made it hard for the m to determine if a students' socio-economic background would greatly influence t he likelihood of answering a test item correctly. Regardless of criterion, the rating process asked for a judgment, that is, the subjective assessment of an item's appropriateness. These are difficult conclusions to make. Yet, in terms of the message to policy-makers this is precisely be the point. Aggregate average scores on standardized tests are at best a gross approximation of the instructional quality of a school, and any number o f factors may have more to do with the production of this number than the quality of e ducational services delivered. We should be questioning what these numbers mean, espe cially considering the fact that in many states the numbers are being used to reward or punish school staff and students. By design, policy makers have raised the st akes. As this analysis shows, though, when you get beneath the summary number and ask whe ther the test items that go into producing that number are sensible measures of know ledge and skills learned in school, the answer is far from clear. This would suggest, a t a minimum, that policy-makers should consider eliminating or de-emphasizing their use of norm-referenced achievement tests as a barometer of how well a scho ol is doing.NoteResearch presented here was supported by a grant fr om the School Leadership Center of Greater New Orleans. An earlier version of this wor k was presented at the Annual Meeting of the Mid-South Educational Research Assoc iation, Point Clear, AL, November 1999. References


13 of 18Carnevale, A. & Kimmel, E. (1997). A national test: Balancing policy and technical issues Princeton, NJ: Educational Testing Service. Cohen, D. (1995). What is the system in systemic re form? Educational researcher, 24 11-17.Corbett, H. and Wilson, B. (1991). Testing, reform and rebellion. Norwood, NJ: Ablex. Council of Chief State School Officers. (1998). Key state education policies on K-12 education: Standards, graduation, assessment, teach er licensure, time and attendance a 50state report. Washington, DC: Author.Firestone, W., Mayrowetz, D., and Fairman, J. (1998 ). Performance-based assessment and instructional change: The effects of testing in Maine and Maryland. Educational evaluation and policy analysis, 20 95-113. Ginsberg, R. and Berry, B. (1998). The capability f or enhancing accountability. In R. MacPherson (Ed.), The politics of accountability: Educative and inter national perspectives (pp. 43-61). Newbury Park, CA: Corwin. Kelley, C. and Protsik, C. (1997). Risk and reward: Perspectives on the implementation of Kentucky's schoolbased performance award progr am. Educational Administration Quarterly, 33, 474-505. Kirst, M. (1990). Accountability: Implications for state and local policymakers Washington, DC: OERI.Linn, R. (1999). Standards-based accountability: Te n suggestions (CRESST Policy Brief). Los Angeles: National Center for Research o n Evaluation, Standards and Student Testing.Mayer, D. (1998). Do new teaching standards undermi ne performance on old tests? Educational evaluation and policy analysis, 20, 53-74. Mitchell, K. (1995). Reforming and conforming: NASD C principals talk about the impact of accountability systems on school reform ( Technical Report 143). Santa Monica, CA: Rand Corp.Popham, W. J. (1999, March). Why standardized tests don't measure educational quality. Educational Leadership, 57 8-15. Schmoker, M. (2000, February). The results we want. Educational Leadership, 58 62-65.Sheldon K. and Biddle, B. (1998). Standards, accoun tability and school reform: Perils and pitfalls. Teachers' College Record 100 (1) : 164-180. Stecher, B. & Barron, S. (1999). Test-Based Account ability: The perverse consequences of milepost testing. Paper presented at the Annual Meeting of the American Educational Research Association, Montreal, Canada, April 1999.Stevens, J. (1996). Applied multivariate statistics for the social scie nces (3 rd ed.).


14 of 18 Mahwah, NJ: Lawrence Erlbaum Associates.Wallace, D. (2000, February). Results, results, res ults? Educational Leadership, 58 66-68.Watts, J., Gaines, G. & Creech, J. (1998). Getting results: A fresh look at school accountability. Atlanta: Southern Regional Educatio n Board. Wolk, R. (1998). Education's high-stakes gamble. Education Week, 18 (15) : 48. About the AuthorScott C. Bauer 348J Bicentennial Education Bldg.University of New OrleansNew Orleans, LA 70148504-280-6446 (voice)504-280-6453 (fax) Email: Scott C. Bauer (Ph.D., Cornell University) is an as sistant professor in the Educational Leadership, Counseling, and Foundations Department in the College of Education at the University of New Orleans, and Director of Research at the School Leadership Center of Greater New Orleans. His research and teaching focu ses on the application of the principles of organizational behavior and developme nt to the study of school leadership, organizational change and restructuring. Dr. Bauer' s most recent work deals with designing and implementing site-based decision maki ng systems in schools. Appendix A Descriptive statistics: mean percentages, standard deviations and range of all ratingsSkill areaCriteriaRatingMeansdhighlowVocabularyimportancenot sure.07.11.400 no.04.06.200 taughtnot sure.21.221.000 no.06.09.400 SESnot sure.14.15.750 no.24.231.000 IQnot sure.15.12.350 no.27.241.000 Validitynot sure.12.14.450 no.14.15.750 Skill areaCriteriaRatingMeansdhighlow


15 of 18 Reading comprehensionimportancenot sure.07.12.530 no.07.10.370 taughtnot sure.16.231.000 no.07.11.300 SESnot sure.08.09.300 no.20.25.930 IQnot sure.13.19.930 no.28.27.970 Validitynot sure.12.13.570 no.15.17.770 Skill areaCriteriaRatingMeansdhighlowGrammar and Languageimportancenot sure.05.07.230 no.03.05.230 taughtnot sure.17.211.000 no.07.16.770 SESnot sure.13.14.570 no.24.271.000 IQnot sure.12.15.470 no.26.321.000 Validitynot sure.10.11.400 no.12.11.400 Skill areaCriteriaRatingMeansdhighlowMath problem-solving and reasoningimportancenot sur e.06.11.500 no.05.06.270 taughtnot sure.17.221.000 no.07.14.570 SESnot sure.05.07.270 no.13.25.970 IQnot sure.07.08.330 no.27.30.970 Validitynot sure.10.10.300 no.11.14.670 Skill areaCriteriaRatingMeansdhighlowMath Proceduresimportancenot sure.01.03.150 no.01.03.150 taughtnot sure.05.10.400 no.02.04.150


16 of 18 SESnot sure.01.02.050 no.10.271.000 IQnot sure.03.05.200 no.20.361.000 Validitynot sure.03.05.150 no.08.15.500Appendix B Descriptive statistics: mean percentages, standard deviations and range of combined ratingsCriteriaMeansdhighlow Vocabularyimportance.11.14.500 taught.26.231.000 SES.38.271.000 IQ.41.241.000 validity.26.20.800 Reading comprehensionimportance.14.16.570 taught.26.231.000 SES.38.271.000 IQ.40.30.970 validity.26.20.800 Grammar and Languageimportance.08.10.330 taught.24.251.000 SES.37.271.000 IQ.38.321.000 validity.21.18.770 Math problem-solving and reasoningimportance.11.12. 500 taught.24.221.000 SES.19.24.970 IQ.33.301.000 validity.21.17.800 Math Proceduresimportance.02.05.200 taught.07.10.400 SES.11.271.000 IQ.22.381.000 validity.10.16.600


17 of 18 Copyright 2000 by the Education Policy Analysis ArchivesThe World Wide Web address for the Education Policy Analysis Archives is General questions about appropriateness of topics o r particular articles may be addressed to the Editor, Gene V Glass, or reach him at College of Education, Arizona State University, Tempe, AZ 8 5287-0211. (602-965-9644). The Commentary Editor is Casey D. C obb: .EPAA Editorial Board Michael W. Apple University of Wisconsin Greg Camilli Rutgers University John Covaleskie Northern Michigan University Alan Davis University of Colorado, Denver Sherman Dorn University of South Florida Mark E. Fetler California Commission on Teacher Credentialing Richard Garlikov Thomas F. Green Syracuse University Alison I. Griffith York University Arlen Gullickson Western Michigan University Ernest R. House University of Colorado Aimee Howley Ohio University Craig B. Howley Appalachia Educational Laboratory William Hunter University of Calgary Daniel Kalls Ume University Benjamin Levin University of Manitoba Thomas Mauhs-Pugh Green Mountain College Dewayne Matthews Western Interstate Commission for HigherEducation William McInerney Purdue University Mary McKeown-Moak MGT of America (Austin, TX) Les McLean University of Toronto Susan Bobbitt Nolen University of Washington Anne L. Pemberton Hugh G. Petrie SUNY Buffalo Richard C. Richardson New York University Anthony G. Rud Jr. Purdue University Dennis Sayers Ann Leavenworth Centerfor Accelerated Learning Jay D. Scribner University of Texas at Austin Michael Scriven Robert E. Stake University of Illinois—UC Robert Stonehill U.S. Department of Education David D. Williams Brigham Young University


18 of 18 EPAA Spanish Language Editorial BoardAssociate Editor for Spanish Language Roberto Rodrguez Gmez Universidad Nacional Autnoma de Mxico Adrin Acosta (Mxico) Universidad de J. Flix Angulo Rasco (Spain) Universidad de Teresa Bracho (Mxico) Centro de Investigacin y DocenciaEconmica-CIDEbracho Alejandro Canales (Mxico) Universidad Nacional Autnoma Ursula Casanova (U.S.A.) Arizona State Jos Contreras Domingo Universitat de Barcelona Erwin Epstein (U.S.A.) Loyola University of Josu Gonzlez (U.S.A.) Arizona State Rollin Kent (Mxico)Departamento de InvestigacinEducativa-DIE/ Mara Beatriz Luce (Brazil)Universidad Federal de Rio Grande do Sul-UFRGSlucemb@orion.ufrgs.brJavier Mendoza Rojas (Mxico)Universidad Nacional Autnoma deMxicojaviermr@servidor.unam.mxMarcela Mollis (Argentina)Universidad de Buenos Humberto Muoz Garca (Mxico) Universidad Nacional Autnoma deMxicohumberto@servidor.unam.mxAngel Ignacio Prez Gmez (Spain)Universidad de Daniel Schugurensky (Argentina-Canad)OISE/UT, Simon Schwartzman (Brazil)Fundao Instituto Brasileiro e Geografiae Estatstica Jurjo Torres Santom (Spain)Universidad de A Carlos Alberto Torres (U.S.A.)University of California, Los

xml version 1.0 encoding UTF-8 standalone no
mods:mods xmlns:mods http:www.loc.govmodsv3 xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govmodsv3mods-3-1.xsd
mods:relatedItem type host
mods:identifier issn 1068-2341mods:part
mods:detail volume mods:number 8issue 46series Year mods:caption 20002000Month September9Day 1717mods:originInfo mods:dateIssued iso8601 2000-09-17