xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam a22 u 4500
controlfield tag 008 c20019999azu 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E11-00204
Educational policy analysis archives.
n Vol. 9, no. 7 (March 04, 2001).
Tempe, Ariz. :
b Arizona State University ;
Tampa, Fla. :
University of South Florida.
c March 04, 2001
Critique of An evaluation of the Florida a-plus accountability and school choice program / Gregory Camilli [and] Katrina Bulkley.
Arizona State University.
University of South Florida.
t Education Policy Analysis Archives (EPAA)
xml version 1.0 encoding UTF-8 standalone no
mods:mods xmlns:mods http:www.loc.govmodsv3 xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govmodsv3mods-3-1.xsd
mods:relatedItem type host
mods:identifier issn 1068-2341mods:part
mods:detail volume mods:number 9issue 7series Year mods:caption 20012001Month March3Day 44mods:originInfo mods:dateIssued iso8601 2001-03-04
1 of 17 Education Policy Analysis Archives Volume 9 Number 7March 4, 2001ISSN 1068-2341 A peer-reviewed scholarly journal Editor: Gene V Glass, College of Education Arizona State University Copyright 2001, the EDUCATION POLICY ANALYSIS ARCHIVES. Permission is hereby granted to copy any article if EPAA is credited and copies are not sold. Articles appearing in EPAA are abstracted in the Current Index to Journals in Education by the ERIC Clearinghouse on Assessment and Evaluation and are permanently archived in Resources in Education .Critique of "An Evaluation of the Florida A-Plus Accountability and School Choice Program" Gregory Camilli Rutgers University Katrina Bulkley Rutgers University Related article. AbstractIn 1999, Florida adopted the "A-Plus" accountabilit y system, which included a provision that allowed students in certa in low-performing schools to receive school vouchers. In a recently r eleased report, An Evaluation of the Florida A-Plus Accountability and School Choice Program (Greene, 2001a), the author argued that early evid ence from this program strongly implies that the program has led to significant improvement on test scores in schools threatened wi th vouchers. However, a careful analysis of Greene's findings an d the Florida data suggests that these strong effects may be largely d ue to sample selection,
2 of 17regression to the mean, and problems related to the aggregation of test score results. One of the most closely watched state ref orms in recent years is the use of school vouchers as a part of the accountability system for Florida's public schools. This program is of particular interest because of its st rong similarities with proposals put forward by President George W. Bush. As a New York Times article noted, "Gov. Jeb Bush's educational program in Florida has been held up as a model for its combination of aggressive testing of schools' performance, back ed by taxpayer-financed vouchers, which his brother President Bush is proposing for t he nation as a whole" (Schemo, 2001). A recently published report purports to s how a convincing link between the threat of school vouchers for students in certain low-perf orming schools in Florida and achievement gains in those schools. An Evaluation of the Florida A-Plus Accountability and School Choice Program (Greene, 2001a) documents gains in achievement on the Florida Comprehensive Assessment Test (FCAT) in the areas of reading, mathematics, and writing. (This evaluation will be referred to a s Evaluation of Florida's A-Plus Program for short.) These findings, not surprisingly, hav e received a substantial amount of attention in the popular press (cf. Schemo, 2001 ; Lopez, 2001; Greene, 2001b). The gains reported are attributed to incentives impleme nted under Title XVI (section 229.0535 "Authority to enforce school improvement") of the 2000 Florida Statutes: It is the intent of the Legislature that all public schools be held accountable for students performing at acceptable levels. A sys tem of school improvement and accountability that assesses studen t performance by school, identifies schools in which students are no t making adequate progress toward state standards, institutes appropr iate measures for enforcing improvement, and provides rewards and san ctions based on performance shall be the responsibility of the Stat e Board of Education. In the APlus accountability system, schoo ls are evaluated and assigned one of five grades (A, B, C, D, F) based primarily on FCAT scor es, and to a lesser extent, the percent of eligible students tested and dropout rat es (Florida Department of Education, 2001). If a school receives two grades of "F" in an y four-year period, it becomes eligible for state board action. Contrary to the implication in Greene's title, such action is not limited to school choice; rather, actions may inclu de providing additional resources, implementing a school plan or reorganization, hirin g a new principal or staff, and other unspecified remedies designed to improve performanc e. However, the possibility of public schools losing children to either private sc hools or higher-performing public schools is clearly the area of most interest and co ntroversy. In the 1999-2000 school year, two Pensacola elementary schools met the elig ibility criteria (Note 1), and as a result, lost 53 children to private schools and 85 to other public schools. Greene argued that his report "shows that the performance of students on academic tests improves when public schools are fac ed with the prospect that their students will receive vouchers" (p. 2). At the cent er of his argument is the fact that all 78 schools that received an "F" in 1999 received a hig her grade in 2000. His claim that the threat of vouchers was responsible for the improvem ent of "F" schools (from the 1998-1999 to the 1999-2000 school year) includes se veral important elements. First, an attempt was made to show the validity of the FCAT b y showing a strong correlation to another test (Stanford-9) given in Florida in 2000. Given this evidence, he then
3 of 17proceeded to show the average gains for each school receiving a particular grade. Based on the latter results, it was concluded that: The most obvious explanation for these findings is that an accountability system with vouchers as the sanction for repeated f ailure really motivates schools to improve. (p. 9) However, Greene also wrote: While the evidence presented in the report supports the claims of advocates of an accountability system and advocates of choice and competition in education, the results cannot be considered definit ive. (p. 9) The A-Plus accountability system was duly noted as being relatively new, with the voucher options used in only two schools in the sta te, and possibleÂ—though not likelyÂ—manipulation of FCAT scores. It is an additi onal alternative that Greene mentions, commonly known as regression to the mean that is one main concern of this report. This paper also examines three other issues : (1) sample selection, (2) the combining of gain scores across grade levels, and ( 3) the use of the school as the unit of analysis. Below, we subsume the latter two items un der the category of "aggregation." The potential policy importance of the fi ndings Greene reports places a heavy burden on his study to demonstrate that the improve d scores in schools that had previously received one "F" are in fact meaningful improvements and a result of school changes linked to the threat of vouchers. We argue here that the evidence does not support this conclusion. We show that there may hav e been some small achievement gains in Florida from 1999-2000, but these effects were vastly overestimated in Greene's analysis. However, even if these modest outcomes wi thstand further investigation, it is not at all clear that they resulted from the threat of vouchers as opposed to other aspects of the accountability program.Background Several recent reforms have similar compo nents to the Florida effort. It is not the purpose of this report to review that literature, b ut two well-known reforms deserve mention. One of these, which Greene specifically ad dresses, is the Texas accountability system and its use of the Texas Assessment of Acade mic Skills (TAAS). Another is the public voucher program in the city of Milwaukee. Co mparisons between each of these reforms and the Florida's A-Plus accountability sys tem are limited for a variety of reasons. The accountability system in Texas varies in critical ways from the model in Florida, especially in the use of vouchers as a san ction in the latter state but not the former. Greene did, however, address an important m ethodological concern (discussed below) that arose in a recent study of the TAAS (Kl ein, Hamilton, McCaffrey, and Stecher, 2000). In the area of publicly-funded vouc hers, students in Milwaukee who met certain income requirements are eligible to receive vouchers allowing them to attend local private schools. Several evaluations have bee n done of this program (i.e. Witte, 1996; Greene, Peterson and Du, 1998). These evaluat ions are not comparable to the Florida evaluation because they examined the test s cores of individual students who either received vouchers or applied for vouchers bu t did not receive one; the Greene study focuses on the school impact on test scores o f the threat of vouchers, not the actual
4 of 17 provision of vouchers. Summary of the Evaluation of Florida's A-Plus Program In Evaluation of Florida's A-Plus Program (Greene, 2001a, Table 2), the main results were obtained by aggregating across grade f or school types A, B, C, D, and F. These results are reproduced in Table 1 below.Table 1 FCAT Reading and Mathematics 1999-2000 Gains from Greene's "An Evaluation of the Florida A-Plus Accountability and School Choice Program"GradeReadingMathWriting A 1.9011.02 .36 B 4.85 9.30 .39 C 4.6011.81 .45 D 10.6216.06 .52 F 17.5925.66 .87 To obtain the overall reading and writing gain, gains at the 4th, 8th, and 10th grade levels were pooled, while for mathematics, ga ins at the 5th, 8th, and 10th grade levels were pooled. School means for standard curri culum students were used to compute gains, not individual student scores. It ca n be seen that the average gain for "F" schools "are more than twice as large as those expe rienced in schools with higher state-assigned grades" (Greene, 2001a, p. 6). These gains for "F" schools were then translated into effect sizes for reading (.80), mat hematics (1.25), and writing (2.23) (Greene, 2001a, endnotes 12-14). No doubt, as compu ted, these gains are statistically significant. They are also among the highest gains ever recorded for an educational intervention. Results like these, if true, would be nothing short of miraculous, far outpacing the reported achievement gains in Texas a nd North Carolina. This may have moved Greene to conclude: While one cannot anticipate or rule out all plausib le alternative explanations for the findings reported in this study, one should follow the general advice to expect horses when one hears hoof beats, not zeb ras. The most plausible interpretation of the evidence is that the Florida A-Plus system relies upon a valid system of testing and produces the desired in centives to failing schools to improve their performance. (p. 14)Critique of the Evaluation of Florida's A-Plus Program Our critique of Greene's evaluation focus es primarily on two problematic issues: aggregation and regression to the mean. We do not e xamine in detail Greene's validation argument for the FCAT based on its correlations wit h the Stanford-9 (the latter given in
5 of 17 2000). Greene's correlational analysis was conducte d partly in response to concerns raised by Klein and his colleagues (2000) about the validity of the TAAS in Texas. However, it is worth noting that while the two test s have substantial correlations (in the range .85-.95), correlation coefficients computed o n aggregate scores typically have much higher values than those computed with student scores. For example, school means on the reading and mathematics sections of th e FCAT in 8th grade have a correlation of about .96. This correlation should not be interpreted as meaning that the FCAT reading and mathematics tests are statisticall y indistinguishable, but rather that correlations on aggregate score tend to be much hig her than those for individual scores. Sample Selection Greene (2001a) used the school means of standard curriculum" students to obtain school-level gains scores. Here "standard" defines a subset of students who tend to score higher on the FCAT (i.e., it does not include certa in types of students with disabilities). An alternative method of choosing a sample is to us e the results for all curriculum groups, and these data are available on the Florida Department of Education web pages. While there is nothing intrinsically wrong with usi ng standard curriculum students, for the purposes of evaluation, however, it would seem preferable to look at the potential impact of the A-Plus program on all curriculum grou ps. Florida administrative statues allow for (or require) nontrivial variation in popu lations selected for determining school grades (Note 2).Aggregation In the analyses below, we disaggregate re sults by grade. This is useful because overall state gains (Florida Department of Educatio n, 2001) vary by grade as shown in Table 2.Table 2 FCAT Score Gains from School Year 1998-1999 to 1999-2000GradeReadingMathWriting 4 5.0N/A0.0 5 N/A11.0N/A 8 -5.07.00.0 10 -4.03.00.1 The data in Table 2 suggest several probl ems with aggregation across grades. First, the results of a policy implementation may b e different at different grades, even if this is not an a priori expectation. Second, in ord er to fine-tune a successful policyÂ—or weed-out an unsuccessful policyÂ—suitable diagnostic information is critical. Furthermore, a subtle problem arises when mixing th e scales of two different instruments given at different grades. How can we b e sure that this isn't the old applesand-oranges problem? To be safe, the best advice is to conduct separate analyses and
6 of 17then to combine them while making explicit the assu mptions involved. A more subtle problem involves the computat ion of effect size (Hedges, 1985), which is typically taken to be This formula can be read as the differenc e between an observed value and an expectation divided by the standard deviation. In p ractice, the expectation E [ x ] could be a school's average test score for the prior year, a nd x could be taken as the score for the current year. It is also typical practice to use a measure of student individual variation in the denominator for "sigma" to facilitate a standar d interpretation. For example, = 1 means that the average student in the "treatment" p opulation scores at the 84th percentile of the "control" population. Likewise, = 2 means that the average student in the "treatment" population scores at the 98th percentil e of the "control" population. So the interpretation is anchored in individual student ac hievement. In contrast, Greene computed effect sizes r elative to the standard deviation (SD) of schools, and though this is technically defensible, it must be recognized that such an effect size doesn't have the usual interpretation. In fact, we have estimated that the individual-level standard deviations (SD) are about 70 score points for reading and mathematics, and about .85 for writingÂ—while the sc hool-level SDs are about 20 points for reading and writing, and about .39 point for wr iting. Thus, an effect size for reading based on the school-level SD would be 350% larger t han one based on the individual-level SD. At face value, the effect size s computed by Greene, ranging from .80 to 2.23, are implausible because many studies h ave found that especially large educational effects (produced under laboratory cond itions) fall into the range of .4 .7. But even if Greene's effect sizes are res caled for comparability, they are still inflated by other factors including regression to t he mean (see below) and an inappropriately selected definition of the expectat ion E [ x ] In regard to the latter issue, the effect of a treatment is usually defined as the net effect above and beyond average growth (the latter is referred to by statisticians as the grand mean ). Thus, gain is defined as the net effect above average, and loss as the ne t effect below average. In this case, the average is the overall state gain; and the deviatio n from the grand mean represents the unique effect of a particular treatment or interven tion. For example, take the average state gain for 4th grade reading in Table 2 of 5 po ints. If an intervention is defined as positive, it should register as being greater than 5 points since 5 points is what could be expected with no intervention whatsoever. It's not very useful to apply this correction to Greene's Table 1 because the results are aggregated across grades. However, in our analyses below, we build in this correction. We als o use the individual-level standard deviation to facilitate the comparability of effect sizes to the general research literature. Regression to the mean Campbell & Stanley (1966) in their classi c volume Experimental and Quasi-Experimental Designs for Research defined the internal validity of an experiment as: The basic minimum without which any experiment is u ninterpretable: Did in fact the experimental treatments make a differen ce in this specific experimental instance? (p. 5)
7 of 17 In a very simple investigation, there are only two measurements taken: the pretest (O1) and, after the experimental intervention, the pos ttest (O2). Campbell and Stanley (1966) listed five definite weaknesses of this "One -Group Pretest-Posttest Design" and one potential concern which is of central importanc e to Greene's evaluation: regression to the mean or, alternatively, regression artifacts They explained: If, for example, in a remediation experiment, stude nts are picked for a special experimental treatment because they do part icularly poorly on an achievement test (which becomes for them O1), then on a subsequent testing using a parallel form or repeating the same test, O2 for this group will almost surely average higher than O1. This dependable result is not due to any genuine effect of [the intervention], and te st-retest practice effect, etc. It is a rather tautological aspect of the imperfect correlation between O1 and O2. (p. 10) In short, experimental units chosen on th e basis of extreme scores tend to drift toward the mean upon posttest: low scores drift upw ard and high score drift downward. Campbell and Stanley (1966) then gave an extended t reatment to this topic because "errors of inference due to overlooking regression effects have been so troublesome in educational research," and "the fundamental insight into their nature is so frequently missed" (p. 10). The regression phenomenon emerged from Francis Galton's studies of inheritance in biology, and this subject provides t he most common phrasing of the regression to the mean effect: tall fathers tend to have tall sons, but not as tall on average as the fathers; while short fathers have short sons but not as short on average as the fathers. It can be seen in Table 1 for all three FCA T subjects that the trend is for higher achievement schools to gains less and lower achieve ment schools to gain more. This is a tell-tale sign of statistical regression, that is, scores in the tails of the distribution tend to drift toward the mean. Higher scores drift downward and lower scores drift upward relative to average gains. Greene (2001a) did consi der this possibility, but rejected it as a potential explanation, arguing that: Regression to the mean is not a likely phenomenon f or the exceptional improvement made by the F schools because the score s for those schools were nowhere near the bottom of the scale for possi ble results. The average F school reading score was 254.70 in 1999, far abov e the lowest possible score of 100. Likewise, the average FCAT mathematics an d writing scores of the F schools were 272.5 on a scale of 100-500 and 2.40 on a scal e form 1-6, respectively. Greene thus concluded that regression to the mean was not a pro blem because the scores of the F schools were not at all extreme. This is an inaccurate notion of regressio n to the mean because "extremeness" should be evaluated in terms of distance (in standa rd deviation units) below the overall group mean, rather than relative to the lowest poss ible score. A good measure of "distance below the mean" can be given in z-score u nits which are interpreted as "standard deviations below the mean" in the distrib ution of school means; z-scores of Â–3.00 and lower generally indicate substantial dist ance below the mean. To check for extremeness, we calculated the z-scores of the lowe st performing school in 4th, 8th, and 10th grade reading, and 5th, 8th and 10th grade mat hematics. These z-scores ranged
8 of 17from a high of Â–3.2 to a low of Â–4.5, indicating a strong likelihood of obtaining a regression artifact in simple difference scores; ho wever, the writing scores tended to be less extreme for the "F" schools. In North Carolina, it was recognized that "Students who are proficient may grow faster" and "students who score low one year may sc ore higher the next year, partly due to 'regression to the mean'" (Public Schools of Nor th Carolina, 2000, p. 2). Both influences on achievement are explicitly taken into account in the North Carolina system when computing expected growth for schools. As note d by Campbell and Stanley (1966) the incorrect interpretation of regression effects has plagued educational research for decades. To give an example, consider a study by Gl ass and Robbins (1967) in which the SAT was given to a group of students, and resea rchers then took the high scorers as the control group and the low scorers as the treatm ent group. Predictably, the treatment showed a positive effect that disappeared when regr ession effects were taken into account (Glass & Robbins, 1967)MethodsData Sources The state of Florida has an exceptional p olicy of granting the public full access to state, district, school level test scores, and othe r variable such as class size, per pupil expenditures, and the like. These data files contai ning school means for all curriculum students can be downloaded in the form of Excel spr eadsheets at the Florida Department of Education website. For the present analysis, rea ding and mathematics, and writing FCAT scores at the school level were downloaded for both the 1998-1999 and 1999-2000 school years. Department staff provided a spreadsheet containing school grades, with district and school identification num bers, for the 1998-1999 school year. Residual gain score analysis Since we strongly suspected that the stat istics in Table 1 were affected by at least two sources of error (regression to the mean and in correct definition of net effect), we reanalyzed the data using the technique of residual gain scores Glass and Hopkins (1996) described the context for residual gains: Administering parallel forms of the achievement tes t before [O1] and after [O2] instruction, then subtracting the pretest score f rom the posttest score [O2 O1] for each student produces a measure that is far c loser to the researcher's notion of a measurement of an achievem ent gain. One difficulty remains: Such a posttest-minus-pretest measure, [O2 O1], is contaminated by the regression effect, usually correlate negativ ely with the pretest scores [O1] Â… A better method to measure gain or change is to predict posttest scores [ O2] from pretest scores [O1] and use the deviation [O2 O2] as a measure of gain, above and beyond what is predictab le by the pretest alone. (p. 167) In the present case of the FCAT scores, O1 is the pretest and O2 is the posttest. Everything else in the present case is the same as in Glass and Hopkins's recommendation. By using residual gains, two goals are accomplished. First, the
9 of 17regression effect is removed because the predicted score takes into account movement toward the mean. Second, the predicted value takes into account the average state gain; it will lead to unique net (policy) effects for any pa rticular accountability grades.Results Average residual gains for the FCAT readi ng and mathematics tests, disaggregated by grade, are given in Tables 3 (reading), 4 (mathe matics), and 5 (writing) below. Table 3 Average Residual Gains for FCAT Reading Table 4 Average Residual Gains for FCAT Mathematics
10 of 17 In Tables 3 and 4, the largest effects ar e in the 8th grade, but in terms of standard deviation (SD) units, these effects are small (Note 3). Using the individual student SD of about 70 (versus the school SD of about 23), the ef fect size for 8th grade reading is = .10, and for 8th grade math is about = .13. We think it is not worthwhile to persevere on whether these effects are statistically signific ant because they are relatively small and other sources of possible bias cannot be plausibly ruled out as causes. For example, slight nonlinearities in the regressions might acco unt for the higher effect sizes for the 8th grade F schools. In addition, the average effec t for this group of only 7 schools is accompanied by a relatively high standard deviation This means the overall positive effect is highly variable. The results for FCAT writing are somewhat d ifferent for those in reading and mathematics. It can be seen in Table 5 at the 4th g rade level that the average residual gain was .20 point on a scale that ranges from 1-6, and this effect is statistically significant. We estimated the individual-level SD t o be about .88 point, and consequently the latter gain translates into an eff ect size of about .23. The average gains are also positive at 8th and 10th grade, but much s maller. Greene also found an effect for writing, but estimated it to have an effect size of 2.23.Table 5 Average Residual Gains for FCAT Writing
11 of 17 Greene attempted to control for regressio n effects by comparing higher-scoring "F" schools to lower-scoring "D" schools. "These ga ins made by the higher-scoring F schools in excess of what were produced by the lowe r-scoring D schools are what we can reasonably estimate as the effect of the unique motivation that vouchers posed to those schools with the F designation" (p. 8). Using residual scores, we repeated this analysis using 40 schools in each of the above cate gories aggregated across grade for reading and mathematic (though we don't suggest thi s as an analytic strategy). The estimates of effect were small and nonsignificant.Discussion The A-Plus accountability system in Flori da, with its inclusion of school vouchers as one possible repercussion for low-performing sch ools, is a significant policy shift in the use of high-stakes assessment. Findings from ev aluations of this program may thus play an important role in policy making in other st ates and at the federal level. Unfortunately, the Greene evaluation does not meet the methodological demands for such an evaluation. It is clear that Greene's analy sis failed to account for both regression to the mean and obtaining a unique net effect of be ing labeled an "F" school. Sample selection is a debatable issue, and we have argued in this report that indicators based on all curriculum groups better satisfy the demands of evaluation. Some have argued that information and res earch must be central to the improvement of schools: Schools that consistently fail to educate poor chil dren should not receive federal dollarsÂ—and states should be accountable to Washington for ensuring that this does not happen. Federal program s that can't demonstrate results should themselves be replaced by different strategies. Though innovation and experimentation should always be enc ouraged, rigorous evaluation is vital and federal funds should not fl ow to activities that do not yield results for children. (Finn, Bruno & Ravitch, 2000) In reply, we would argue that it's not al ways easy to demonstrate results given the kinds of data and accountability models that are re adily available. As seen in Florida, the accountability model itself may cause some difficul ty (Note 4). If schools in the lowest classification "F" improve, and yet this "improveme nt" is a regression artifact, then
12 of 17teachers and principals and others may seize upon w holly irrelevant events as the causes of this improvement. Likewise, "D" schools that mov e down to the "F" classification may seize upon wholly irrelevant causes for their d emise. While it is true that true "F" schools will tend to bounce up and down, and thus b e more likely to become eligible for intervention, it is also true that the accountabili ty system as currently structured may provide them with unreliable signs of their progres s (or lack thereof). Positive results are more helpful if they c an be shown (by means of high quality evaluations) to be internally consistent with polic y mechanisms that presumably stimulated change. One can learn better from negati ve outcomes if it can be shown in some detail how the policy levers failed. In other words, learning more about how schools made improvements or reasons for slippage i s important, as well as is having confidence that the measures of loss or gain are bo th reliable and valid. Tying accountability to a single (or even a few) achievem ent outcomes has several downsides: (1) it does not automatically increase our knowledg e about why things happened the way they did; (2) the use of statistical models for mon itoring policy outcomes is technically demanding and requires obscure policy tools such as adjustments for regression to the mean. Moreover, it is problematic to conflate evalu ation and accountability: program evaluation is intrinsically important to the missio n of schools and should not be equated with establishing "results" as defined by Washingto n. We can agree that hard-nosed evaluation i s necessary, but it is useful to expand on what such evaluation activities should include: Technical considerations The state of Florida should consider methods that are used elsewhere (e.g., North Carolina) to stabilize the indicators that are used to designate school classifications. Such models use past achiev ement data to estimate expected growth, and designate exemplary growth in a manner that controls for some statistical artifacts such as regression. Though there are cost s associated with a more complex model, the decision to focus accountability on test scores requires more sophisticated statistical apparatus within the accountability mod el. Moreover, focus on a small set of indicators accompanied by significant sanctions can force schools to employ instructional methods that are optimized for shortterm payoffs. Consequently, additional accountability components may be require d to monitor for negative consequences such as an increase in the number of r emedial classes, focusing on test preparation, curricular materials that are substant ially similar to test preparation material, and increases in drop-out rates. Policy considerations One of the most important roles of policy evaluat ion is to inform policymakers not only about whether or not a program is working, but why it is having the noted effects. Evaluations that provide little or no information about the mechanisms that have led to reported changes are bo th less compelling and more subject to criticism. In Florida, there is currently little information about what schools are doing that would lead one to expect that scores would imp rove. This information is crucial for the future development of the accountability progra m and might include, for example, an evaluation of capacity within schools identified as needing intervention, or an analysis of how administrative rules are interpreted by local s taff. Policy makers should also receive evaluation information regarding the accountability model or system itself as well as behavior that is the object of the model. In the case of Florida, this report sugge sts that it is simply not clear whether or not the threat of vouchers is having a positive impact on student test scores. There is some evidence of a small effect at 8th grade in reading and mathematics, and in writing at 4th grade. These findings should be investigated in a m ore thorough analysis (taking into account, for example, exclusion rates). If these fi ndings withstand further analysis, it
13 of 17would also be important to examine a number of pote ntial causes including resources (e.g., professional development or teaching materia ls), school intervention plans, staffing changes, and other taken remedies to impro ve performance. In other words, it is overly simplistic to assume that the voucher threat was the only active agent, or that other causes were contingent on the voucher threat.Conclusion We offer an alternative to Greene's gener ous and simplistic reading of the evidence. At face value, the large gains (as seen i n effect sizes of .80, 1.25, and 2.23 for reading, mathematics, and writing) were implausible and should have been submitted to additional methodological scrutiny. Upon such an ex amination, we have raised serious questions regarding the validity of Greene's empiri cal results and conclusions. Indeed, one should follow the general advice to expect horses when on e hears hoof beats, not unicorns.NotesThese two schools were chosen in 1999 for the vouch er plan in the first year of the accountability policy implemented in 1999. These sc hools did not meet the "2-outof-4" policy, but had received an "F" in 1999 and a ppeared on a 1998 list of low performing schools (Sandham, 1999). 1. It could be argued that the group of students who w ere used to determine the school grade might also be the appropriate sample. It appears that "standard curriculum" designates eligible students. According to the State Board of Education Administrative Rules (6A-1.09981) (3)(a) For the purpose of calculating state and dis trict results, the scores of all students enrolled in standard curricu lum courses shall be included. This includes the scores of students who are speech impaired, gifted, hospital homebound, and Limited E nglish Proficient (LEP) students who have been in an English for Spea kers of Other Languages (ESOL) program for more than two (2) year s. To receive a grade of "D" or higher, schools are re quired to test at least 90% of their eligible students. There are additional restr ictions on student inclusion for determining school grade in 6A-1.09981: (3)(b) For the purpose of designating a school's pe rformance grade, only the scores of those students used in calculati ng state and district results who are enrolled in the second period and t he third period full-time equivalent student membership survey as s pecified in Rule 6A-1.0451, FAC., shall be included. Because these criteria, fairly applied, may create inconsistencies across schools, the group of all students tested may provide a scho ol average better for the purposes of evaluation. It would also be useful to have the school median and exclusion rates. 2. The frequencies in Tables 3 and 4 differ slightly f rom the actual number of schools in each category. For example, 5 high schoo ls received grades of "F," yet 3.
14 of 17there are only 4 in our study. In checking this res ult, we found that the 5th high school was no longer listed in official documents i n 1999-2000. Other than this difference, however, our data agree with state data in terms of the numbers of "F" schools for elementary, middle, and high schools.We note that only two of the schools on the 1998 li st of critically low performing schools received an "F" in 1999. Likewise none of t he 78 schools receiving an "F" in 1999 also received an "F" in 2000; however, only 4 schools received an "F" in 2000. 4.ReferencesCampbell, D.T., and Stanley, J.C. (1966). Experimental and quasi-experimental designs for research Chicago: Rand McNally. Finn, C. E., Manno, B. V., Jr. & Ravitch, D. (2000) Education 2001: Getting the Job Done A Memorandum to the President-Elect and the 10 7th Congress. Thomas B. Fordham Foundation, Washington, D.C.: author.Florida Department of Education. (2001). FCAT Briefing Book Florida Department of Education, Tallahassee: author.Glass, G. V & Hopkins, K.D. (1996). Statistical Methods in Education and Psychology (3rd Ed.) Boston: Allyn and Bacon. Glass, G.V & Robbins, M.P. (1967). A critique of ex periments on the role of neurological organization in reading performance. Reading Research Quarterly 3, 5-51. Greene, J. P. (2001a). An Evaluation of the Florida A-Plus Accountability and School Choice Program New York: The Manhattan Institute. Greene, J. P. (2001b). Bush's School Plan: Why We K now It'll Work. New York Post (February 21).Greene, J. P., Peterson, P. E., & Du, J. (1998). Sc hool Choice in Milwaukee: A Randomized Experiment. In P. E. Peterson & B. C. Ha ssel (Eds.), Learning from School Choice Washington, D.C.: Brookings Institution Press. Hedges, L.V. & Olkin, I. (1985). Statistical methods for meta-analysis Orlando, FL: Academic Press.Klein, S. P., Hamilton, L. S., McCaffrey, D. F. & S techer, B. M. (2000). What Do Test Scores in Texas Tell Us? Education Policy Analysis Archives, 8 (49). Lopez, K. J. (2001). Bush Ed Plan Rates an "A." National Review Online (February 22). Public Schools of North Carolina. (2000). Setting A nnual Growth Standards: "The Formula." North Carolina Department of Public Instr uction, North Carolina: author. Sandham, J. L. (1999). Schools Hit by Vouchers Figh t Back. Education Week (September 15).
15 of 17Schemo, D. J. (2001). Threat of Vouchers Motivates Schools to Improve, Study Says. New York Times (February 16). Witte, J. F. (1996). Who Benefits from the Milwauke e Choice Program? In B. Fuller & R. F. Elmore (Eds.), Who Chooses? Who Loses? New York: Teachers College Press. About the AuthorsGregory CamilliProfessorRutgers Graduate School of Education Email: email@example.com Gregory Camilli is Professor of Educational Psychol ogy, at the Rutgers Graduate School of Education and a Senior Researcher in the Center for Educational Policy Analysis. His areas of research interest include psychometric iss ues in educational policy, meta-analysis, and differential item functioning. E xamples of recent publications include "Values and state ratings: An examination of the st ate-by-state education indicators in quality counts" ( Educational Measurement: Issues and Practice 2000), "Application of a method of estimating DIF for polytomous test item s" ( Journal of Educational and Behavioral Statistics, 1999), Standard error in educational programs: A p olicy analysis perspective" ( Educational Policy Analysis Archives 1996), and Methods for Identifying Biased Items (Sage, 1994). Camilli has been or is currently a m ember of the editorial Boards of Educational Measurement: Issues and Practice Educational Policy Analysis Archives and Education Review He is a regular reviewer for Applied Measurement in Education, Journal of Educational Measurement, Psyc hometrika and Psychological Methods among others. As a member of the Technical Adviso ry Committee of the New Jersey Basic Skills Assessment Council, he provides expertise on testing and measurement issues to the New Jersey state assessme nt program. Katrina BulkleyAssistant ProfessorRutgers Graduate School of Education Email:firstname.lastname@example.org Katrina Bulkley is an Assistant Professor of Educat ional Policy at the Rutgers University Graduate School of Education. Much of her work has focused on issues involving school choice and charter schools. Recent articles include "Charter School Authorizers: A New Governance Mechanism?" in Educational Policy (November 1999), and "'New Improved' Mayors Take Over City Schools" (with Mich ael Kirst) in Phi Delta Kappan (March 2000). She is currently working with the Con sortium for Policy Research in Education on a literature review of research on cha rter schools and a study of for-profit management companies and charter schools, and with the Center for Education Policy Analysis, located at Rutgers University, on two stu dies of the impact of standards, testing and professional development on instruction al practices in New Jersey. Bulkley has reviewed articles for Educational Evaluation and Policy Analysis Educational Policy and Policy Studies Journal
16 of 17 Copyright 2001 by the Education Policy Analysis ArchivesThe World Wide Web address for the Education Policy Analysis Archives is epaa.asu.edu General questions about appropriateness of topics o r particular articles may be addressed to the Editor, Gene V Glass, email@example.com or reach him at College of Education, Arizona State University, Tempe, AZ 8 5287-0211. (602-965-9644). The Commentary Editor is Casey D. C obb: firstname.lastname@example.org .EPAA Editorial Board Michael W. Apple University of Wisconsin Greg Camilli Rutgers University John Covaleskie Northern Michigan University Alan Davis University of Colorado, Denver Sherman Dorn University of South Florida Mark E. Fetler California Commission on Teacher Credentialing Richard Garlikov email@example.com Thomas F. Green Syracuse University Alison I. Griffith York University Arlen Gullickson Western Michigan University Ernest R. House University of Colorado Aimee Howley Ohio University Craig B. Howley Appalachia Educational Laboratory William Hunter University of Calgary Daniel Kalls Ume University Benjamin Levin University of Manitoba Thomas Mauhs-Pugh Green Mountain College Dewayne Matthews Western Interstate Commission for HigherEducation William McInerney Purdue University Mary McKeown-Moak MGT of America (Austin, TX) Les McLean University of Toronto Susan Bobbitt Nolen University of Washington Anne L. Pemberton firstname.lastname@example.org Hugh G. Petrie SUNY Buffalo Richard C. Richardson New York University Anthony G. Rud Jr. Purdue University Dennis Sayers Ann Leavenworth Centerfor Accelerated Learning Jay D. Scribner University of Texas at Austin Michael Scriven email@example.com Robert E. Stake University of IllinoisÂ—UC
17 of 17