xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam a22 u 4500
controlfield tag 008 c20029999azu 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E11-00256
Educational policy analysis archives.
n Vol. 10, no. 8 (January 28, 2002).
Tempe, Ariz. :
b Arizona State University ;
Tampa, Fla. :
University of South Florida.
c January 28, 2002
Reaction to Bolon's Significance of test-based ratings for metropolitan Boston schools / Stephan Michelson.
Arizona State University.
University of South Florida.
t Education Policy Analysis Archives (EPAA)
xml version 1.0 encoding UTF-8 standalone no
mods:mods xmlns:mods http:www.loc.govmodsv3 xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govmodsv3mods-3-1.xsd
mods:relatedItem type host
mods:identifier issn 1068-2341mods:part
mods:detail volume mods:number 10issue 8series Year mods:caption 20022002Month January1Day 2828mods:originInfo mods:dateIssued iso8601 2002-01-28
1 of 18 Education Policy Analysis Archives Volume 10 Number 8January 28, 2002ISSN 1068-2341 A peer-reviewed scholarly journal Editor: Gene V Glass College of Education Arizona State University Copyright 2002, the EDUCATION POLICY ANALYSIS ARCHIVES Permission is hereby granted to copy any article if EPAA is credited and copies are not sold. Articles appearing in EPAA are abstracted in the Current Index to Journals in Education by the ERIC Clearinghouse on Assessment and Evaluation and are permanently archived in Resources in Education .Reactions to Bolon's "Significance of Test-based Ra tings for Metropolitan Boston Schools" Stephan Michelson Longbranch Research AssociatesCitation: Michelson, S. (2002, January 28). Reactio ns to Bolon's "Significance of test-based ratings for metropolitan Boston schools." Education Policy Analysis Archives 10 (8). Retrieved [date] from http://epaa.asu.edu/epaa/v10n8/.Abstract Several concerns are raised abut the procedures us ed and conclusions drawn in Craig Bolon's article "Significance of Test-based Ratings for Metropolitan Boston Schools" published in this journal as Number 42 of Volume 9. Craig Bolon introduces his article about the Massac husetts Comprehensive Assessment System (MCAS), tenth grade mathematics tests, with a non-s equitur. "The state is treating scores and ratings as though they were precise educational mea sures of high significance," he tells us, but "statistically they are not." Nothing Bolon does le ads to this conclusion. We do not know how "precise" these measures are, by any criterion. We do not know how "significant" they are, by any criterion. One wants the conclusions of an article to represent its contents. Bolon's do not. Whether scores measure something consistently is ca lled "reliability," and BolonÂ’s exposition implies highly reliable tests.  Whether scores measure something else, something be sides themselves, is called "validity."
2 of 18Bolon has not said what concept success on these te sts is supposed to imply, or that they donÂ’t. He has not determined that these scores are invalid against some criterion. He finds them correlated with town income, a not surprising resul t that we have lived with for 35 years, since the publication of The Coleman Report.  Nothing in that correlation informs us about the validity of these test scores, their ability to pre dict some other attribute of the students who take them.Bolon concludes that student characteristics, such as limited English proficiency, "all failed to associate substantial additional variance" over and above that accounted for by town income. As I will explain, that criterion is not helpful in dete rmining if a variable should be in a model explaining test scores. And at any rate, it is inco rrect. Bolon reaches that conclusion by dropping Boston schools from his data, weighting observation s by size of school, and failing to take advantage of the very information he provides. He a lso claims that the data cannot display "performance trends," though he has not established that there are trends that the data fail to capture. The incentive systems the state will assoc iate with school performance are not yet in place. Bolon cannot therefore say that schools have failed to react to them. Finally, Bolon tells us that per pupil school expen diture is not well related to school performance. Or perhaps he is saying that is true over and above town income. It appears to be the lack of correlation between test scores and school expendit ure that leads Bolon to conclude that schools cannot raise themselves when found to be at a low l evel.  In fact, per pupil expenditure is highly correlated with test score, and with income, also. This comment is only about statistical analysis, ab out deriving information from data. I say "information" rather than "fact" because all statis tical results are probabilistic, associations one can choose to accept or not on many grounds. On the surface that is what Bolon offers, statistically derived information. But his conclusi ons do not follow from his data. His procedures are suspect and his interpretations are incorrect. Surely the discussion requires better statistical analysis.DataBolon has put together data from various sources, d escribing 47 schools in 32 towns around Boston. For this he deserves much credit, although one should not expect much from these data. Per pupil expenditure, for example, is presumably o ver an entire school system, and includes many items that no one would think to associate wit h a mathematics test score. Only if all students taking the test have always been in the sc hool system in which we see them, at the tenth grade--and assuming that last yearÂ’s expenditures a re representative of the last ten years of expenditures--would system-wide expenditure be a re asonable measure of resources expended on these students.Within his data, Bolon recognizes that three Boston schools, entrance to which is predicated on an exam, "draw away many Boston students who tend t o score well on achievement tests...."  But he neither includes that fact in his statistica l study, nor does he investigate the effect of this "creaming" on other Boston schools. Let us do so qu ickly. Boston schools, on average, do not score differentl y from non-Boston schools. Non-Boston schools do score higher (226.6 to 212.5 in 1999), b ut a difference this large or larger would be expected to occur by chance more than forty-four ti mes in one hundred in a world in which the actual difference were zero. In Boston, the three e xam schools out-score the ten non-exam
3 of 18schools (238 to 204.9) a difference that we would e xpect to occur by chance, assuming they were equal, fewer than two times in one hundred.  So a statistical examination of test scores should account for both those Boston schools that contain the cream (the variable I call "exam"), and those that contain only the remaining skimmed milk (the variable "bosnoex"). Bolon includes schools that "provide vocational edu cation in the same facilities as academic programs."  He notes that "most students in vocational program s receive lower MCAS scores than students in academic programs"  when they are in separate schools, but nonetheless takes no account of vocational students when mixed with a cademic students. Bolon indicates this mix with an asterisk attached to twelve schools, none o f them in Boston. I created the variable "voc," with a value 1 for schools that have a vocational p rogram, 0 for schools that do not. A better variable would be the percent of students in each s chool who were "vocational." An even better variable would be the percentage of test takers in each school who could be categorized that way. If the indicator variable has a negative effect and is statistically unlikely from chance, when we assume it is structurally unrelated, then perhaps i t is telling us something about these data that no other variable captures. It would not be new information, but how would one justify not taking into account old, correct information in asking wha t new information these data contain?ReplicationFurthermore, there is the problem of "specification error." Although this seems like a technical term, it is part of a larger inquiry about the lang uage we use to describe our findings. The "specification" of a single equation regression is simply the variables, and their forms (linear, logarithmic, quadratic, etc.), including weights. S ome variables might be important to specification even though their associated probabil ity is high, because their inclusion changes the coefficients of other variables. One variable, thou gh Bolon does not use it, does show some specification effect.Bolon implies that school size is such a variable, but rather than including it, he weights observations by it. He is wrong on both counts: Sch ool size is not only not important for specification, weighting by it is incorrect. Weight ing implies that there is more information in a large school than a small school.  If size affects score, we should determine that re lationship directly (as a variable in the equation). Otherwise BolonÂ’s observations are of schools, each one telling us as much as any other.To this point it appears that Bolon has not taken a dvantage of his own data, failing to define three variables (exam, bosnoex and voc) that his own disc ussion implies should be important. If I am going to explain the consequences of these omission s, I should first replicate his results. Only if I do so can I assert that differences between us are due solely to the changes in specification that I will offer. It turns out that there is more amiss h ere than just the variables. I placed my coefficients and standard errors into E xcel, and let it round them to three places. Bolon uses what to me is awkward language for R2, the most common statistic emanating from a regression equation. He says that a model "associat es X percent of the statistical variance." Perhaps he is avoiding the more usual term, "explai ns." I am sympathetic to that desire, but "associates" is usually followed by the word "with. I find his phrasing unintelligible. If one were asked to estimate the average school test score of each school, knowing nothing about schools, the best strategy is to estimate the mean of all sc hools every time. R2 essentially tells you how much better than that you can do (how much closer t o real values your estimates are) if you
4 of 18generate an estimated score from the regression equ ation. R2 can vary from zero to 1.00. In an exercise in which Bolon will use different nu mbers of explanatory variables, he should report the adjusted R2, which makes the statistic pay a penalty for using more variables. The easy strategy to increase R2 is to use all the variables you have. R2 itself can never decrease when variables are added, an important item to remember as Bolon changes data on us. Adjusted R2can decrease, telling you that you have paid too hi gh a price for whatever information your last variable has added. It can even become negative. I will return, below, to how Bolon appears to be using R2 to create his specification. In Table 1, a replica tion of his Table 2-6, to the three places Bolon reports, we agree exactly. Table 1 The prefix "per" represents "percent." I believe we are looking at the percent black, Asian and Hispanic in the school, not necessarily taking the exam. Similarly, "perlimit" is percent of the school with limited English fluency (whether define d consistently from school to school I cannot say), and "perlunch" is percent receiving free or r educed cost lunch. A variable we will encounter below, "percapy," is not a percentage. It is per ca pita income in 1989 (1990 Census data). Short of administering a questionnaire to each student--a nd then dealing with the accuracy of the responses--these are the kinds of variables we have in education research, somewhat distant from what we would like to measure.  I will present R2 and adjusted R2 from left to right in the following tables, withou t including the word "adjusted." I could show a similar table (repl icating BolonÂ’s) for his Table 2-7. In Table 2-9, Bolon and I disagree in the last decimal place shown of four coefficients. It is possible that Bolon has data to more decimal places than he has r eproduced.  This possibility is suggested by coefficients for per capita income in his equati on in Table 2-10. Mine is 1.108, his is 1.104. Not consequential, but not a rounding difference, e ither.A Change of MethodI think we can agree that my results are what Bolon would derive from the data he has provided. Just after his Table 2-13, Bolon makes a turn that he does not justify, and that I see no reason for. He tells us in Table 2-13 that the unadjusted R2 for the three factor equation in Table 2-10, in 1999, is .80. That is, the residual variance of the difference between school scores and his estimate thereof is 20 percent of the value it was when he estimated every school to have the mean score of all schools. He has, in more traditio nal language, "explained" 80 percent of the variance in scores. Yet he will tell us, in Table 2 -15, that a two factor model produces an R2 of
5 of 18.86, and that he needs only per capita income to ge nerate an R2 of .84. Higher unadjusted R2from fewer variables? Something else has changed. Table 2 One difference lies in the statement just before Ta ble 214, that from now on regressions are calculated "with schools weighted by numbers of tes t participants." Bolon does not tell us why he has changed his specification that way. If this is how regressions should be run on these data, why did he not start out doing so? Nor does this ch ange in procedure alone allow me to replicate his results. Table 3 In Tables 2 and 3 I have weighted by the adjusted n umber of test takers, as listed in BolonÂ’s Appendix Table A3-1. Weighting makes little differe nce in the coefficients or the R2. Weighting surely should be justified, which I believe it cann ot be in this context, but neither is it of any consequence.As implied by my table titles, and R2 far below the values Bolon reports, what Bolon has done is drop some schools. Boston schools. He does so, part ly, "because the students who score well on school-based standard tests are selected for admiss ion to the three exam schools," and for other reasons mostly confined to a footnote.  As we know, the exam schools can easily be represented by a variable. He notes that attributin g the same per capita income to all ten schools in Boston does not allow that variable to explain d ifferences in test scores among those schools. True, but deleting variation because you cannot exp lain it is guaranteed to both raise R2 and
6 of 18leave you ignorant. For Bolon to conclude, later, t hat per capita income is the only factor needed to explain most variation in test scores, is to ign ore that he has deleted 21 percent of his data to achieve that result.  It is not true of his initial data set. He does no t feel compelled to make the same adjustment in Lynn, Newton or Quincy, all of w hich have the same per capita income applied to more than one school. Nor is it necessar y to adjust the data at all. One adjustment Bolon might have made is to have one observation per town, by creating a weighted average of scores (as he tells us how many test takers there are per school). In fact, I think there is no school effect to be found in thes e data--though there are pupil effects--and therefore a per town analysis would have been best. I will return to that view. Another answer is to use the variables I suggested above, which assum e that exam schools will have higher than average scores because they have selected students on that basis, and that BostonÂ’s non-exam schools will have lower than average scores because the students they would ordinarily enroll have been snatched from them by the exam schools. F irst, I will show that dropping the Boston schools, not weighting, is the key to BolonÂ’s analy sis: Table 4 Table 5 As earlier, here in Tables 4 and 5, comparing equat ions vertically on the left, there is little difference between weighted and non-weighted result s, again raising the question why Bolon bothered to change his method at this point. Readin g the weighted equations left to right on the
7 of 18bottom, there are some differences between us in th e last decimal place that cannot be attributed to output rounding. Essentially I replicate Bolon, which means that he has decided that Boston schools do not contribute to our knowledge about te st scores in Boston-area schools. He is quite wrong about that. Consider this five fa ctor "model," which I present unweighted: Table 6 I have added a column. I do not find the standard e rror a very informative statistic. Without the t-distribution at hand, one cannot translate the co efficient and standard error to probability. Many people report the t statistic, which is the ratio o f the coefficient to its standard error, but it suf fers from the same problem. I prefer to report the proba bility itself, as that is what everyone receiving other information is trying to derive.SpecificationBolon is much too attached to reporting "significan ce, and eliminating variables "not significant at p < .05." That is what I call a "mechanical" app roach to statistical inference, one a machine can do as well as a human. Probability is not like an o n-off light switch. It is more like a dimmer. There is more or less of it. p = .06 should not be set aside as "not significant." It requires interpretation. I would accept a probability for th e vocational variable higher than for other variables, because it masks the variation (the perc ent of exam takers who are vocational) that it should have. The equation in Table 6 does not need this explanation.  It would have been a superior model with higher probabilities, but its p robabilities meet BolonÂ’s apparent criterion. The three classification variables included here, d erived from BolonÂ’s data but not presented as such by him, are extremely powerful explainers of v ariation in school test scores. My equation achieves a higher R2, even adjusted, than any unadjusted R2 Bolon reports. I donÂ’t say that should be the single criterion for choosing a model but it does seem to be one of BolonÂ’s, and is a good starting point. Retaining all the data and a chieving a higher R2 than BolonÂ’s model did, after his dropping data for the sole purpose of inc reasing R2, surely argues that my model is more informative than his.If statistical comparisons are too ornery--I often think they are--a picture should make my point. LetÂ’s compare the ability of BolonÂ’s "one-factor mo del" and my "five factor model" to predict the data, to estimate the actual data points from apply ing the equation to input data. In Figure 1, test scores from 1999 are plotted for all 47 schools, against per capita income. Triangles represent data points. Estimated points a re implied by the lines that join them. The solid line is the one Bolon would have us believe b est describes the data, a regression on per
8 of 18capita income alone. The dashed line follows estima tes made from my five factor equation, Table 6. Does BolonÂ’s one factor model capture the essenc e of this data set? I donÂ’t think so. Figure 1 Boston schools lie on top of each other, because pe r capita income is defined only by town. BolonÂ’s regression line, of course, cannot distingu ish among them. However, my regression distinguishes all except Boston Latin very well. It is true that the non-exam Boston schools cannot vary in per capita income, but neither do th ey vary much in test score. Although not always in the same place, the ten Boston non-exam s chools had the ten lowest school averages in all three years of BolonÂ’s data. Boston Latin alway s had the highest score. BolonÂ’s equationÂ—or, to be more precise, my equation using BolonÂ’s speci fication on all schoolsÂ—under-estimates scores of low income towns and over-estimates score s of high income towns. In short, he presents not only an inaccurate but a biased view o f the net relationship between per capita income and test score.Besides looking to coefficient probability and R2, how does Bolon (apparently) select his model? In Table 2-12 he presents R2 for each combination of three factors. In Table 215 he asks how much increase in R2 he can obtain from adding alternative other variab les to his two factor model (now based on only 34 schools, weighted). He tells us that one factor, per capita income, gives him an R2 of .84, so why bother with limited English profici ency when it increase R2 to only .86? This is a political decision. Bolon wants to s ay that a model with no variable describing school or child characteristics tells us as much as we can know about these data. He wants to argue that school test data that relate only to tow n income cannot be valid, though this is not how validity is measured. There is no statistical justi fication for dropping other variables. Table 7
9 of 18 In Table 7, I show, first, how much the addition of each variable in my Five Factor Model adds to R2. The number after "drop" is the R2 from the remaining four variables. Thus, for examp le, per capita income adds .1476 to R2, because R2 is .7575 without it, and .9051 with it. Per capita income alone generates an R2 of .4734, from the bottom half of Table 7. In othe r words, .3258 of the variation "explained" by per capita income is a lso explained by other variables. That is, 69 percent of the apparent explanatory power of per ca pita income may or may not be due to per capita income. It is "shared" with other variables. One might say that the variable "exam" explains little--an R2 = .10 when it is the only variable in the equation --but only 15 percent of that is shared with other explanatory factors. Limi ted English proficiency explains more than any other single variable, but shares 97 percent of tha t variation with other variables. Is this a reason to drop this variable?Not statistically. When Bolon drops perlimit, he al locates its shared explanatory variation to the remaining variables, in his case to percapy, by fia t. In the regression context, we just do not know whether it is per capita income or English proficie ncy, or both, or neither (they may both be proxies for something else) that is associated with variation in test scores. We let the multiple regression statistics tell us whether adding perlim it is worth the price of using up a degree of freedom. The variable describing limited English pr oficiency belongs in the model, even though 97 percent of the variation in test score that it e xplains can be explained by other variables. We can believe, with about as much confidence as one e ver can get from data such as these, that where there are more students with limited English proficiency, the schoolÂ’s average test score will be lower.The only exception to this arithmetic, where adding the R2 of the equation with just one variable to the R2 of the equation with all other variables produces a larger R2 than we know all five produce, is vocational education. Adding its R2 alone to the R2 without it falls short of the R2 we know we get from five factors. Vocational education displays a specification effect. Its presence in the equation increases the contribution to R2 made by other variables. Bolon should have defined and utilized a "voc" variable regardless of its contribution to R2.InterpretationWe might infer that it is the students with limited English proficiency themselves who are causing a decline in the average, through their low scores. That is an inference Bolon is careful
10 of 18not to make. There is such a thing as the "ecologic al fallacy," or the "Simpson paradox," in which group data lead to incorrect inferences about indiv iduals. But fear of that mistake should not deter someone from making an informed judgment. Stu dies in the 1960s and 1970s showed that states with more blacks had lower income, and state s with higher education levels had higher income. It was not wrong to infer that blacks earne d less than whites, and that higher educated people earned more than lower educated people, even though it was wrong to infer (at that time, though many did) that if black education levels ros e they would earn considerably higher incomes. I think BolonÂ’s data is quite sufficient t o ask if the MCAS test is fair to persons with limited ability in English (and, presumably, Spanis h, as tests are available in that language). The same is true with vocational students. There co uld be many reasons why schools with vocational programs score lower than strictly acade mic schools, without the vocational students being the direct cause. But, armed with his other s tudy of strictly vocational schools, Bolon surely can infer that it is the vocational students themselves who are scoring lower. This raises the question: What purpose is served by vocational students taking this academic test? BolonÂ’s study does no t answer this question, because it is not a validit y study. But this is an example of the kind of useful inquiry Bolon could have provide d. We can conclude that two student descriptive variab les appear important, an interesting if expected finding from aggregate data of this sort: limited English language ability and a vocational curriculum. In addition, we can conclude that creaming does matter, in the sense that concentrating all the best students in three school s concentrates the worst students in the remaining schools available to them. Ultimately, we can interpret these data as telling us nothing about schools, but much about students.It was conventional wisdom in 1955, when I graduate d from Brookline High, that aside from Boston Latin, one could find the best greater Bosto n area students in Brookline and Newton, and maybe Lexington, largely because their parents had moved to these towns for that reason. Ten years later research declared what we knew to be tr ue, though it was "controversial" in professional ranks: high scoring schools were so be cause they had high scoring students. And high scoring students had two characteristics: By a nd large they came from wealthier households, and they associated with each other in wealthy town s. BolonÂ’s finding to this effect is surely correct, and welcome, but not new information, and hardly impugning the MCAS test on which it is based.ResidualsBolon estimates his one-factor model in three years and concludes "that schools scoring higher than predicted tend to increase scores in successiv e years, and schools scoring lower than predicted tend to decrease scores."  This is not a correct interpretation from his own data and results. Residuals do not show the highest scoring schools scoring even higher from year to year, or the lowest scoring schools scoring even lower. T he correlation of residuals means that they are in the same place every year. Furthermore, 1998 sco res do not predict 1999 scores for the lowest scoring schools. The lower scoring schools have ran domly higher or lower scores the next year. Although 1998 scores appear to predict higher 1999 scores for the highest scoring schools, that result is due to Boston Latin only. Except for Bost on Latin, the higher scoring schools in 1998 have somewhat lower scores in 1999.Figure 2 is a plot of residuals in 1999 against res iduals in 1998, from my 5-factor model:
11 of 18Figure 2 As I did also in Figure 1, I have named some school s, the principle being to inform but avoid clutter. In general, schools line up the same way e very year. The residuals from 1998, for the top fifteen schools, predict the residuals from those s chools in 1999 with a very small constant and a coefficient of .995. The residuals from the bottom fifteen schools in 1998 predict their 1999 residuals with a zero constant and a coefficient of 1.055. That is, we can predict the residual for the top and bottom schools almost precisely from th e prior year data. The same does not hold true in the middle. Overall the residuals are correlated as we would expect them to be: Table 8 More importantly, raw scores are also correlated ac ross time, as indicated above in noting that the Boston non-exam schools always score at the bottom. Figure 3 is a picture of the ranks of the scores, from lowest to highest, in 1999 and 1998: Figure 3
12 of 18 The best schools--which is to say the best students --are consistently best, and the worst are consistently worst, and in the middle there is some moving around, but the middle schools seldom appear best or worst. This is startlingly tr ue, and largely unrecognized, of most reliability measures. A test with a high reliability can be ver y poor in the middle, where it is used to hire or not, to pass or to fail. It is likely that this sam e variability applies to individual students, and t hat reliability figures offered in support of the MCAS test do not describe the unreliability at exactly the pass-fail scores where it is most important to be correct.The Town ViewAs noted above, another solution to BolonÂ’s dilemma that some variables described towns, not schools, would have been to aggregate all data by t own. The assumption that would allow one to do this is that there is no school effect, for exam ple that Boston Latin students score highest because they were pre-selected, not because they we re better educated. Nothing in these data argues against that view.Bolon chose a different route:  My summary analysis is based on a one-factor model for metropolitan Boston communities that each operate only a single academi c high school, weighted by numbers of students tested. The effects study showe d that "Per-capita community income (1989)" was the dominant factor in predictin g 1999 school-averaged tenth-grade MCAS mathematics test scores. All other factors made only small contributions to predictions with much lower signif icance. That is, BolonÂ’s final comments are based on 60 per cent of his initial data, 28 observations. It is hard to take BolonÂ’s conclusions as seriously gener alizing the results of a study of 47 schools
13 of 18from 32 towns. I will suggest below that only one t own--not one selected by Bolon--might have been dropped to increase our understanding of exist ing relationships. One of the interesting results one can obtain from analyzing towns is that property value (measured per capita in BolonÂ’s data) appears to be a function of both income and educational expenditure, but not MCAS test score. This finding sort of follows conventional wisdom, that people pay a premium for property where there is a higher focus on education, though usually we expect a more direct relationship with test scores. Table 9 Following BolonÂ’s main concern, the best model from town data is quite familiar. I show its coefficients in Table 9. It is my five factor model less "exam" and "bosnoex," which have been aggregated into a single Boston observation.I contend that the highest scoring students are exa ctly who we would expect them to be, those from highest income places, modified by those stude nt characteristics that we expect to generate lower scores, such as vocational curriculum and lim ited ability in English. Boston fits right in. The most "influential" observation, by far, is Marb lehead. Without that observation, the adjusted R2 = .9221, the coefficient for perlimit is -2.73 (p = .011), and the coefficient for percapy is 1.247 (p = .000). The coefficient for voc is close to that in Figure 9. Thus the general effect of limited English ability (as well as income) on MCAS score is quite a bit larger than Bolon suggests, or that my model on all towns suggests. I f we are going to delete data, let us do so to improve the generalizability of the results. When w e do that, we emphasize the importance of the points I have made (that "voc" and "perlimit" need to be in any model explaining test score in these data).Final RemarksThere is nothing wrong with analyzing a set of data and concluding "Nothing new here, looks like these relationships usually do." Unfortunately such findings are seldom published. So we systematically lose confirming information, or broa dening information (that relationships found elsewhere apply here) that Bolon could have provide d. There is something wrong with saying one has analyzed data and found that "community income swamped the influence of the other social and school factors examined" when that conclusion i s drawn from failing to utilize the information at hand, and dropping two-fifths of the observations. Bolon concludes: "Large uncertainties in residuals of school-averaged score s, after subtracting predictions based on community income, tend to make the scores ineffecti ve for rating performance of schools." What are these "uncertainties?" Why should the state car e about the residual from BolonÂ’s model? That his model produces residuals that do not correlate perfectly from year to year implies nothing about the effectiveness of rating schools that, as shown here, do pretty much line up the same every time we look.
14 of 18He goes on: "Large uncertainties in year-to-year sc ore changes tend to make the score changes ineffective for measuring performance trends." Does he mean "large variation ?" Variation may not be uncertainty. His attempt to define "large" f rom published student reliability data is innovative and bold but, though I have not dwelled on it here, not convincing in a study of school averages. WouldnÂ’t "trends" be measured by consiste ncy of direction? Unless Bolon can tell us that there are trends which these data fail to pick up, what he is saying is that "trend" studies should account for random variation, that to establ ish a "trend" means to exceed variation from changes in the students and tests and random factor s from year to year. Well yes, of course. Is he saying no school could meet this criterion? Quite t he contrary, he says that too many do. Where in all other respects, when Bolon finds a result in explicable by chance, he accepts it as "significant," in his "trend" study he rejects the results just because they appear to be inexplicable by chance.When he concludes that "tenth-grade MCAS mathematic s tests mainly appear to provide a complex and expensive way to estimate community inc ome," Bolon is ignoring the fact that MCAS tests are designed to measure individual achie vement. If they do that job, then Bolon will find that individuals who do well traditionally are found with others who do well. His finding of a correlation of score with income does not argue t hat the test fails to measure understanding of mathematics. Does Bolon not believe that more of th e "best" academic students, in general, attend Boston Latin, or Brookline, Newton, Lexingto n and a few other elite schools, than other schools? Does he not believe that the three Boston exam schools "cream" the best students from other schools, leaving these other schools the lowe st scoring in the area? What is wrong with finding that high scores in attributes that p roduce wealth are correlated with the wealth they produce? This is one reason my pare nts moved to Brookline, and surely one reason Bolon lives there, too. Why does Bolon not w ant to emphasize the more important findings that students who have difficulty with Eng lish, or with an academic environment, do not do well on this academic test? I do not know why th e Massachusetts Department of Education wants to impose this test on all students, or why i t wants to deny a high school graduation to those who fail it. However, nothing in BolonÂ’s arti cle argues against them. Bolon is free to disagree with the state (as would I), but saying th at there is a high correlation between test score and community income, not news in the twenty-first century, does not support his position.Notes  Bolon would disagree. He questions their reliabilit y in Tables 2-1 and 2-2 based on year to year changes in school averages. I discuss reliabil ity to some extent below. I agree that Bolon raises good questions, but I disagree that Bolon ca n criticize test reliability, as understood by test makers, in this manner.  Coleman, James S., E.Q. Campbell, C.J. Hopson, J. M cPartland, A.M. Mood, F.D. Weinfeld, and R.L. York, Equality of Educational Opportunity (Washington, DC: U.S. Department of Health, Education and Welfare, 1966).  I see no rationale for utilizing land value as an i ndependent variable predicting test scores. By the middle 1970s, urban planners had determined tha t property values reflect school scores. People want higher scoring friends for their childr en. No one has shown the research that says people act this way, or their behavior, to be wrong Bolon has the direction of causality reversed even as a hypothesis.
15 of 18  Bolon, last paragraph of Section 1, Part B.  The probabilities are from the tdistribution appl ied to the difference in scores divided by the joint standard deviation (the square root of the su m of the variances). However the Boston-other comparison is two tail, because we have no hypothes is about which schools would score higher, whereas the exam-not exam comparison is one-tail, w here a prior hypothesis about the direction of the difference is clear.  Third paragraph of Section 1, Part B. He also says "I choose to include such schools in these studies while noting their special character." In n o part of his analysis is their "special character" noted, although it could have been.  First quote from Bolon p. 4, second from p. 6.  I was plaintiffsÂ’ statistical expert in the "Compar able Worth" litigation in the state of Washington in 1983. I weighted the salary structure of jobs by the number of employees per job, because salaries in larger jobs were in fact set mo re carefully than those in smaller jobs. The larger jobs contained more information about whethe r gender appeared to be a factor in determining salary, so weighting was appropriate.  As Bolon says, "the data generally available for te st score research fail to capture much of the critical information needed to understand developme nt of cognitive abilities and educational achievement in the settings of public schools." Sec tion 1, Part D, first paragraph.  Craig Bolon kindly sent his data to me. I did not e xamine each item, but I did assure that the correlation between his data and mine, for each var iable, was 1.0000. It is conceivable, but unsettling if true, that his program Statistica, pr oduces different results from mine, Stata.  Bolon, second paragraph after Table 2-13.  Bolon equates R2 with "accuracy." "In an attempt to improve accurac y of the model in Table 2-14, schools with residuals from the two-factor mo del for 1999 that were greater than two standard deviations were dropped." (at Table 2-18). Influence, not residual size, can be used to delete variables to increase the generalizability o f results. See below. All that is accomplished by deleting large residuals is configuring the data to support the model and report higher than real R2. For example: "Community income has been found str ongly correlated with tenth-grade MCAS mathematics test scores and associated more th an 80 percent of the variance in school averaged 1999 scores for a sample of Boston-area co mmunities." (Bolon Section 3, Part B, "Conclusions.") From Table 7, below, one can see th at he would have had to report "associated more than 47 percent of the variance . from h is original sample.  When my 5-factor model is run on 2000 data, all var iables have probability .006 or less. On 1998 data, the "voc" variable has a probability .06 4, perlimit .041, and all other variables .002 or less. Would Bolon delete "voc" from only the 1998 m odel for its "insignificance?" I hope not.  Bolon, just before Section 2, Part C, "Observations ."
16 of 18  Section 2, part D, opening sentences.About the AuthorStephan MichelsonStephan Michelson is co-founder and President of Lo ngbranch Research Associates (LRA). Formed in Washington, DC in 1979, now with offices in Maryland, North Carolina and Oregon, LRA provides statistical analysis in litigation. Dr Michelson has been on the faculties of Reed College and the Harvard Graduate School of Educatio n, and had minor associations with Stanford University (summer) and University of Mary land (Adjunct Professor). He has been on the staffs of The Brookings Institution, Center for Law and Education, Center for Community Economic Development and The Urban Institute (Senio r Fellow).Copyright 2002 by the Education Policy Analysis ArchivesThe World Wide Web address for the Education Policy Analysis Archives is epaa.asu.edu General questions about appropriateness of topics o r particular articles may be addressed to the Editor, Gene V Glass, firstname.lastname@example.org or reach him at College of Education, Arizona State University, Tempe, AZ 8 5287-2411. The Commentary Editor is Casey D. Cobb: email@example.com .EPAA Editorial Board Michael W. Apple University of Wisconsin Greg Camilli Rutgers University John Covaleskie Northern Michigan University Alan Davis University of Colorado, Denver Sherman Dorn University of South Florida Mark E. Fetler California Commission on Teacher Credentialing Richard Garlikov firstname.lastname@example.org Thomas F. Green Syracuse University Alison I. Griffith York University Arlen Gullickson Western Michigan University Ernest R. House University of Colorado Aimee Howley Ohio University Craig B. Howley Appalachia Educational Laboratory William Hunter University of Calgary Daniel Kalls Ume University Benjamin Levin University of Manitoba Thomas Mauhs-Pugh Green Mountain College Dewayne Matthews Education Commission of the States William McInerney Purdue University Mary McKeown-Moak MGT of America (Austin, TX)
17 of 18 Les McLean University of Toronto Susan Bobbitt Nolen University of Washington Anne L. Pemberton email@example.com Hugh G. Petrie SUNY Buffalo Richard C. Richardson New York University Anthony G. Rud Jr. Purdue University Dennis Sayers California State UniversityÂ—Stanislaus Jay D. Scribner University of Texas at Austin Michael Scriven firstname.lastname@example.org Robert E. Stake University of IllinoisÂ—UC Robert Stonehill U.S. Department of Education David D. Williams Brigham Young University EPAA Spanish Language Editorial BoardAssociate Editor for Spanish Language Roberto Rodrguez Gmez Universidad Nacional Autnoma de Mxico email@example.com Adrin Acosta (Mxico) Universidad de Guadalajaraadrianacosta@compuserve.com J. Flix Angulo Rasco (Spain) Universidad de Cdizfelix.firstname.lastname@example.org Teresa Bracho (Mxico) Centro de Investigacin y DocenciaEconmica-CIDEbracho dis1.cide.mx Alejandro Canales (Mxico) Universidad Nacional Autnoma deMxicocanalesa@servidor.unam.mx Ursula Casanova (U.S.A.) Arizona State Universitycasanova@asu.edu Jos Contreras Domingo Universitat de Barcelona Jose.Contreras@doe.d5.ub.es Erwin Epstein (U.S.A.) Loyola University of ChicagoEepstein@luc.edu Josu Gonzlez (U.S.A.) Arizona State Universityjosue@asu.edu Rollin Kent (Mxico)Departamento de InvestigacinEducativa-DIE/CINVESTAVrkent@gemtel.com.mx email@example.com Mara Beatriz Luce (Brazil)Universidad Federal de Rio Grande do Sul-UFRGSlucemb@orion.ufrgs.brJavier Mendoza Rojas (Mxico)Universidad Nacional Autnoma deMxicojaviermr@servidor.unam.mxMarcela Mollis (Argentina)Universidad de Buenos Airesmmollis@filo.uba.ar Humberto Muoz Garca (Mxico) Universidad Nacional Autnoma deMxicohumberto@servidor.unam.mxAngel Ignacio Prez Gmez (Spain)Universidad de Mlagaaiperez@uma.es
18 of 18 Daniel Schugurensky (Argentina-Canad)OISE/UT, Canadadschugurensky@oise.utoronto.ca Simon Schwartzman (Brazil)Fundao Instituto Brasileiro e Geografiae Estatstica firstname.lastname@example.org Jurjo Torres Santom (Spain)Universidad de A Coruajurjo@udc.es Carlos Alberto Torres (U.S.A.)University of California, Los Angelestorres@gseisucla.edu