USF Libraries
USF Digital Collections

Educational policy analysis archives


Material Information

Educational policy analysis archives
Physical Description:
Arizona State University
University of South Florida
Arizona State University
University of South Florida.
Place of Publication:
Tempe, Ariz
Tampa, Fla
Publication Date:


Subjects / Keywords:
Education -- Research -- Periodicals   ( lcsh )
non-fiction   ( marcgt )
serial   ( sobekcm )

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
usfldc doi - E11-00053
usfldc handle - e11.53
System ID:

This item is only available as the following downloads:

Full Text


1 of 18 Education Policy Analysis Archives Volume 4 Number 4March 15, 1996ISSN 1068-2341A peer-reviewed scholarly electronic journal. Editor: Gene V Glass,Glass@ASU.EDU. College of Educ ation, Arizona State University,Tempe AZ 85287-2411 Copyright 1996, the EDUCATION POLICY ANALYSIS ARCHIVES.Permission is hereby granted to copy any a rticle provided that EDU POLICY ANALYSIS ARCHIVES is credi ted and copies are not sold.Standard Errors in Educational Assessment: A Policy Analysis Perspective Greg Camilli Rutgers Abstract: In many educational settings, educational gains ar e measured and evaluated rather than absolute levels of achievement. Gains might be esti mated for individual students,teachers, schools, districts, and so forth. In some education al programs, schools are required to make "statistically significant" progress over the cours e of one school year. This would typically require and estimate of the standard error (SE for short) of the gain, which is a number representing the precision of the gain similar to t he "margin of error" in polls. Because SEs can be used to define educational targets, it is import ant to understand precisely what a standard error is -and this requires going beyond the simple tex tbook definition. Statistical methods are tools for understanding social processes, but there is no necessary connection between a statistical method and an empirical outcome. A policy analyst m ust ask how closely features of the statistical theory correspond to aspects of the mea sured outcomes for a given purpose. For example, how much does it matter if the assumption of random sampling is violated in certain ways? Can one assume that the children or educators at a particular school during a given year constitute a random sample of some population that is perhaps spread across time, space, as well as cultural and institutional dimensions?INTRODUCTION In many educational settings, educational attainmen t is measured and evaluated. The units on which these measurments are taken might be stude nts, but also teachers or school districts. The TVAAS program (Tennessee Value-Added Assessment System) is a statewide assessment


2 of 18program that incorporates features such as these. S pecifically, it is a "gain-oriented" statistical tool for collecting and analyzing student achieveme nt test score data; that is, gains are the focus rather than absolute levels of achievement. The inf ormation provided by the statistical model is used in the evaluation of teachers, schools, distri cts, and the like. It is not the purpose of this document to evaluate the TVAAS program itself. However, during discussion of this program on EDPOLYAN (Educ ational Policy Analysis Forum -an electronic forum in which discussion is conducted o n the Internet), the issue of standard errors (or SEs) arose (SEs are computed with the TVAA mult i-level model). In particular, some schools are required to make "statistically significant" pr ogress which entails a 1.5 to 2.0 SE gain over the course of one school year. Because SEs are used to define educational targets, it is important to understand precisely what a standard error is -and this requires going beyond the simple textbook definition. In most introductory statistics courses, the terms "population," "random sample," and "sampling distribution" are taught. No two samples give the same result, e.g., the average height of a sample will always differ to a greater or less er extent across random samples. This is why polls always append the "margin of error" to report ed percentages. For example, the polls report that 51% of respondents would have voted for Bill C linton with a margin of error of plus or minus 3%. (The margin of error is the product of tw o components: the standard error (SE), and the critical value (CV). The former represents how much a result may vary from sample to sample while the latter is used in conjunction with the former to place a band of confidence around the obtained result. For the given example, 3%=SE*CV; and the band of confidence extends from 51%3% (48%) to 51%+3% (54%). It is c ommon to interpret the latter by saying that even though there might be variation from samp le to sample, 95 out of 100 samples would give a result between 48% and 54%.] Why should the standard error (SE) serve as a stand ard against which gains are evaluated? This question must be answered at both a technical and policy level: Technical. If there were 10 people in a room and yo u wanted to know their average income, you could ask all 10 people and calculate t he mean. But now suppose that a town has 500,000 people in it. Obviously the same strate gy is not going to work. Rather, you must economize by sampling a representative cross-s ection and calculate the mean of this group. This method doesn't guarantee an accurate re sult all the time, but it does well most of the time -especially with larger samples. Thus a sample result is not an exact answer to one's overriding question. From a statistical po int of view, numbers are fuzzy rather than precise creatures; and a statistician's concern is to keep the amount of fuzziness to a minimum. By metaphor, an exact number is a pin prick (or pun cture) whereas a fuzzy number is a bruise. Two pin pricks would be easy to discri minate visually, but if two bruises were large and overlapping, they might be difficult to d istinguish. Now imagine the radius of a bruise as the standard error: as the radius decreas es, it becomes clear as to whether there is one bruise or whether there are two. And as the rad ius decreases to near zero, the bruise becomes a pin prick. In short, the standard error i s the statistician's criterion for the separability of two numbers, and two numbers are co nventionally thought of as separable if they are at least 1.5 or 2 SEs apart. This is equiv alent to requiring that the confidence bands around two numbers not overlap. Policy. Statistical methods are tools for understan ding social processes. There is no necessary connection between a statistical process and an empirical outcome,so policy


3 of 18analysts must ask how closely features of the stati stical theory correspond to aspects of the measured outcomes of an educational program. An imp ortant part of this analysis concerns how well sentences typical of the statistical theor y support actions based on the separability criterion of 1.5 or 2.0 or some others numbers of S Es. For example, a typical sentence of classical statis tical inference might read "A random samples is taken from a population." To whic h the analyst might respond "What is the population?" Furthermore, if the population can 't be defined, one might conclude that it is not possible to determine whether the sample was indeed random. Thus, the language of the statistical theory might not satisfactorily exp lain the SE criterion, in which case more analysis is necessary to arive at a pragmatic under standing. These issues are explored in the following discussi on among members of EDPOLYAN. The discussants are in alphabetical order Greg Cami lli, Sherman Dorn, Gene Glass, Harvey Goldstein, Bill Hunter and Leslie McLean. Passages have been edited to focus on the issue of standard errors. The original postings contained mo re ancillary issues as well as parenthetical comments. However, the participants have reviewed t he following text for accuracy and completeness. In addition, further summary comments were provided by Harvey and these are given at the end of the discussion section. Origina l messages were posted in late December, 1994, through January, 1995.EDPOLYAN DISCUSSIONGoldstein : I have come in on what I gather is the tail-end o f a discussion of missing data in the analysis of TVAAS system data to produce estimates of school effects. Apologies therefore if the issue has been discussed already, and also because I'm from a different educational system but one where we have had quite a lot of debate about v alue added analysis using longitudinal data. ... in the UK the value added debate has been looki ng at problems with the sampling errors (standard errors) of value added gain t urns out that these are typically so large that you cannot make any statistically significant compariso ns between most of your schools...only those at opposite extremes of a ranking. Is this also the case in Tenessee? If so what do you do about it when reporting?Camilli : I've been wondering what the standard errors mean Usually, I have in mind that a sample is drawn from a population, and an effect (s ay gain score) is estimated from the sample data. The standard error then conveys how precise t his estimate is (much like the "margin of error" that pollsters use). For TVAAS, what are the sample and population? McLean : The purpose of this post is to focus on two aspec ts of the TVAAS that I feel have received too little attention: validity and standar d errors. This is not to say that the political nat ure of any evaluation is not important or to take anyth ing away from the discussion of formative vs. summative evaluation.Harvey Goldstein stated on Dec. 19, 1994, " tur ns out that these [standard errors] are typically so large that you cannot make any statistically sig nificant comparisons between most of your schools...only those at opposite extremes of a rank ing. Is this also the case in Tennessee? If so, what do you do about it when reporting?"Below are listed the mean gains for math with their standard errors for schools within one of the larger school systems in Tennessee. These means are three year averages and were calculated from the TVAAS mixed model process. This should giv e an idea of the sensitivity of the process.


4 of 18 MEANTYPE OF SCHOOL GRADE RANK GAIN STD.ERR INTERMEDIATE 3 1 71.6 4.9 3 2 71.2 3.7 3 3 67.0 2.6 MIDDLE 6 1 22.6 1.0 6 2 20.1 0.8 6 3 15.6 1.0 6 4 13.9 1.0 6 5 13.1 1.2 6 6 12.5 1.0 6 7 11.1 0.9 6 8 9.8 1.1 6 9 9.3 1.3 6 10 8.4 0.9 ...................... 8 12 13.5 6.6 8 13 13.2 1.7 8 14 11.0 0.9 ...................... The problem for those of us who have calculated, po ndered and puzzled over such results as these, in national and international assessments, i s that the reported standard errors are unbelievable (impossibly small). We can't say they are wrong, of course, because we lack the details of the calculations, but Harvey Goldstein h as analyzed at least as much data and written several books and taken the lead in multilevel mode ling (sometimes called, by others, hierarchical linear modeling), and his informed and experienced "opinion" is not to be taken lightly. The standard errors remind me of those Ric hard Wolfe found faulty in the first International Assessment of Educational Progress--t he fault being that the estimates of error failed to include all the components reasonable peo ple agree should be included. Moreover, the Std. Errors above are clearly proportionate to the mean scores, not a desirable outcome. There must be at least one error (three lines from the bo ttom of those displayed above). I, too, will leave to later a comment on the statement below from TVAA S, except to say that whatever it is they do is not "certainly sufficient":Camilli : I thought that some of you might want to take a l ook at some statistics regarding the metric of the scores that TVAAS uses. Below, I've g iven the mean, median and standard deviation of the IRT metric for fall reading compre hension as reported in the CTBS/4 Technical Bulletin 1 (1989).(I hope this isn't too far out of date.) Grade mean median STD 1 473 481 84.3 2 593 606 81.1 3 652 657 59.6 4 685 694 53.6 5 707 714 48.6 6 725 730 43.8 7 733 738 43.6 8 745 750 43.1 9 760 764 38.6 10 770 774 39.6 11 776 780 38.2 12 780 782 38.0


5 of 18If you plot these data by grade, some interesting p ossibilities emerge. For example, one wonders why students below average gain as much as students above average. The explanation I see is that there is much less room for growth at higher g rade levels, but this is a function of the scoring metric. A transformation of scale might lead to dif ferent results. Goldstein : I see that Les McLean and one or two others have taken up my query about how well schools can be separated taking the estimated stand ard errors into account. I don't yet know how the standard errors have been calculated, but based upon a table Sandra Horn xxx sent me, I would say that the results (e.g. for grade 3 based upon a 3 year average in one of the larger school systems, are in line with our own results. What you do (roughly) is multiply each standard error by about 1.5, use this to place and interval (i.e. +-1.5 s.e.'s) about each gain estimate and judge whether two schools are significantly different at the 5% level by whether or not the intervals overlap. Most intervals do! BUT if you average over 3 years then you get smaller standard errors so fewer do.McLean : My observation that the gains were proportional t o the standard errors does NOT seem to be true--within grades. If you lump all grades t ogether, the correlation is over 0.5, but within grades (the correct plot, IMHO) it is essentially z ero. Grade six shows a substantial NEGATIVE correlation, but there are only 12 observations.What are these standard errors anyway? In a separat e post to me, Greg Camilli points out that if all students are tested, then the "sampling error" has to be zero. What we need to make sense of this is, as I have said already, a technical report How are they (TVAAS) modelling the error in their multilevel models? What explanatory variables do they use? Do they include covariance terms? Is the "standard error" an estimate of measu rement error? Just how much data are missing?Goldstein : Re Les Mclean's message about standard errors. He quotes Camilli as stating that the standard errors given are 'sampling errors' and tha t if all students are tested then these are zero. I am confused! The usual standard errors quoted in th is context are those relating to the accuracy of the estimated school effects where there is a co nceptually infinite population of students of whom those measured (whether they are all those in the school at a particular time or not) are a random sample. If they are not this, then what are they? Hunter : [In response to Goldstein's last question] I cann ot say what they are precisely, but I am quite confident that those in any particular school at any particular time are NOT a random sample.Camilli : Harvey Goldstein is wondering what I mean about s tandard errors. TVAAS probably doesn't test random samples of students, or samples at all, given what I know about the program. Given that it's not a random sample, one could alwa ys imagine that this were the case anyway (a counterfactual): imagination is required by all the ories of statistical inference. However, without some sensible restrictions any set of numbers whats oever can become a "random sample from a population." And once the "population" is in place, it can be of any size at least as great as the "sample." Now if it is imagined as infinite, then w e can go on to imagine our "sample" as one realization of an infinite possible samples. Thus, we arrive at an estimate of "sampling error" called the "standard error."Here's a fictitious dialogue between an Educator an d a Statistician regarding this point: E: What is this standard error associated with a ga in score?


6 of 18S: It's like this. Suppose you have a conceptually infinite population of 8th graders, and from this population you took an infinite numbe r of random samples and computed the gain for each sample. You'd want these to all give about the same result; it's like witnesses at a trial corroboratin g a claim made by the defendant. Small standard errors are analogous to a high degre e of corroboration; while high SE's indicate a lot of uncertainty.E: What if the tested group isn't a sample?S: Then you just imagine it's a sample.E: OK, I'll imagine it's a sample, but what's the p opulation? S: Just imagine that the population is pretty much like the sample. E: So if I get a small SE, then I can be confident in the gain score because most of the samples from the population will have similar g ains, because the population is pretty much like my sample?S: Basically.E: What if the standard error is large? Does this m ean that I shouldn't be confident because most of the sample in the population that i s similar to my sample will give substantially different estimates of gain?S: Well, you've got the basic idea.E: Okay, but just one more question. If the SE is l arge, doesn't it mean that the population isn't similar to my sample? If so, how c an I imagine an infinite number of samples from that population?S: Look, SE's are really theoretical quantities. Th ey're things that are defined by equations -and the equations can be explained in different ways. Population/sample is the easiest way, but don't get bogged down. Most statisticians agree that they are useful and that small ones are better than large on es. E: OK, but just one more question. How small do the y have to be to be good? S: That depends on your question. Suppose you want to test whether one teacher's gain is larger than another's. If the difference is one the order of 1.5 or 2.0 SEs, then you can have confidence in it.E: Why should I have confidence?S: Because the difference is large relative to the sampling error ... er, I mean, standard error.E: I see. Well, I have to go now. By the way, could you write something up that I could give to parents that explains this? Thanks. Goldstein : Well, I enjoyed Greg Camilli's imaginary conversa tion, but of course the reality is that standard errors are not things statisticians invent ed to make life difficult. Most non-statisticians have little difficulty in understanding that if you only have a measurement on 1 student there ain't


7 of 18much to be said about the rest. The bigger the samp le the more confident you become that what you have observed is a good guide to what you would get on repeated samples with also suitably large numbers...assuming of course that you adopted a sensible randomly based sampling strategy.Now we come to the philosophical bit. Social statis ticians are pretty much forced to adopt the notion of a 'superpopulation' when attempting to ge neralise the results of an analysis. If you want to be strict about things then the relationship you discovered between parental education and student achievement back in 1992 from a sample of 5 0 elementary schools in Florida can only give you information about the physically real popu lation of Florida schools in 1992. Usually we are not interested just in such history, but in rat her more general statements that pertain to schools now and in the future...we may be wrong of course and that is why we strive to replicate over time and place etc. BUT the point is that, get ting back to value added estimates for a school, if we want to make a general statement about an ins titution we do have to make some kind of superpopulation assumption....what we happen to obs erve for the students we have studied is a reflection of what the school has done, and would h ave done, for a bunch of students, given their measured characteristics such as initial achievemen t. The more students we measure the more accurate we can be and that's why we need an estima te of uncertainty (standard error). Glass : Greg answered with a hypothetical conversation be tween an educator and a statistician. I think Greg exposed some key problems with this noti on of standard errors, and it is no more a problem with TVAAS than it is a problem with most a pplications of inferential statistics in education.Harvey asks, in effect, what is wrong with regardin g standard errors as being measures of the accuracy of samples as representations of "conceptu ally infinite populations" from which the samples might "conceivably have been drawn at rando m." After more than thirty years of calculating, derivi ng, explaining and publishing "standard errors" and their ilk, I have come to the conclusion that I don't know what they mean and I doubt seriously that they mean anything like what they ar e protrayed as meaning. Consider this: if the population to which inference is made is one that is conceptually like the sample, then the population is just the sample writ large and the "standard error" is much larger than it ought to be. If you show me 25 adolescent l argely Anglo-Saxon boys who love sports and ask me the population from which they could conceiv ably have been sampled, I'll conceive of an "infinite" population of such boys. If no populatio n has actually been sampled and all I know about the situation before me is the sample, then I will conceive of a population like the sample. This is surely the very opposite of inference and s tandard errors are surely beside the point. Consider something even more troubling: I present y ou with a sampleFlorida, Alabama, Tennessee, South Carolina. N=4. I calculate the sta te high school graduation rates, average them and calculate a standard error. What is the populat ion? States in the Southern U.S.? Fine; that's certainly conceivable, even if not "infinite." But suppose that someone else conceives of "States in the U.S." Well, that's conceivable too. But it i s surely ridiculous to think that these four states can be used to infer to both of these conceivable p opulations with equal accuracy (standard errors). Or to make matters worse, suppose that I s uddenly produce a fifth "state": Alberta. Now it raises the question whether the conceptual popul ation is "geo-political units in North America"-or the entire Western Hemisphere.I can't imagine that there is much wisdom in attach ing a number accurate to two decimal places


8 of 18when we can't even be certain whether it is referri ng to an "inference" to the Southern U.S., North America or the Western Hemisphere. Now, if yo u think I am playing with your head and will suggest a way out of this dilemma that rescues the business of statistical inference for us, let me assure you that I have no solution. In spite of the fact that I have written stat texts and made money off of this stuff for some 25 years, I can't see any salvation for 90% of what we do in inferential stats. If there is no ACTUAL probabilis tic sampling (or randomization) of units from a defined population, then I can't see that standar d errors (or t-tests or F-tests or any of the rest) make any sense.Does any of this apply to TVAAS? Just this. If one is worried about "stability" (in any of the many senses in which the word could be interpreted) then why not simply compare teachers' scores across all years for which data are availabl e. That would answer in very straightforward ways whether the ranking of teachers jumps around w ildly for whatever reasons or is relatively steady. (I hasten to add that I don't approve of su ch things as ranking teachers with respect to their students' test scores.)Goldstein : Gene Glass also takes me to task on standard erro rs and raises the interesting question of when a sample should be considered as having a r eference population and when not. There is no general depends on what you want to do. As I said in my response to Greg, I cannot easily see how you can have empirical social scienc e without assuming that the units (people, schools etc) you happen to have measured are repres entative (in the usual statistical sense) of a (yes) hypothetical population whose members exhibit relationships you want to estimate. Such populations must (I think) be hypothetical because they have to embrace the present and future as well as the past when the data were collected. The issue is therefore the general philosophical issue and not a statistical one statisticians sim ply try to provide tools for making inferences about such populations.Camilli : Harvey replied to my previous post with "The bigg er the sample the more confident you become that what you have observed is a good guide to what you would get on repeated samples with also suitably large numbers...assuming of cour se that you adopted a sensible randomly based sampling strategy."Bigger is better, I agree. Another issue is whether it is the correct standard error, and still anothe r is whether the SE has a meaningful referent. If the sample consists of all kids in the system, how can imagining a larger group possibly create more i nformation. If I want to understand the behavior of my three cars (I wish), how would it be nefit me to imagine I had a fourth? This is not a statistical issue at all. "Population" has always been a heuristic device. Generalizing beyond known populations is risky busi ness, and requires more than statistical knowledge. This was the focus of the long and inter esting dialogue between Lee Cronbach and Don Campbell. Standard errors have something to do with the precision of estimates. Perhaps they convey something about how well a model fits c ertain data. You might want to argue, on this basis, that the model is likey (or not) to gen eralize; but model fit at one instant does not logically imply model fit one second later. This, I think, is the difference between induction and inference.The standard errors will apparently be used to meas ure whether statistically significant progress is being made by schools that fail to meet the stan dard (whatever that turns out to be), so it is important to be clear about what SEs mean. I find i t fascinating that they are being used as policy tools with legal implications. In this regard, it i s important to understand what drives the SEs. I'm guessing that missing data will add to SEs (it real ly would be helpful if the TVAAS staff would


9 of 18respond), and am sure that unit size will decrease SEs. Thus, standard errors for schools will typically be smaller for districts than for schools than for teachers than for students. As far as I can tell, only certain districts are required to ma ke statistically significant progress; this may tur n out to be a pretty easy criterion to satisfy.Goldstein : When you try to enshrine complex technicalities i n the law you certainly ask for trouble especially, as would appear here, when th ose drafting the law have a rather meagre understanding of the technicalities. My interpretat ion of $49-1-601 is that it requires (say from one year to the next) that the differnce in value a dded scores for a school between two years is statistically significantly different from zero (at 5%?). If each years scores are on the same metric then this question can certainly be asked and one c an even think of a suitable interpretation. The problem arises if we require this to be the case fo r all those schools below the mean (note that the legislation does not say STATISTICALLY SIGNIFICANTL Y BELOW the mean.). If the schools are successful then the mean for all school s inevitably goes up!! and it isn't difficult to envisage a scenario where every school makes a real (even statistically significant) gain leaving the ranking of all schools the same! This raises th e issue of the measurements used. Are these standardised each year on the Tennessee population? If so then not only is the ranking the same, so are the actual scores! All this needs some caref ul unpicking I would have thought and raises very serious issues for the interpretation of TVAA.McLean : The discussion of standard errors has gotten so i nvolved that a look at the Tennessee legislation should tell us where standard errors ar e needed and what interpretations reasonable people ought to be able to put on them. Below, the text from Sherman Dorn's post [who is quoting and paraphrasing from TVAAS statutes] and L es McLean's reponses are indicated by "-->". --> Dorn : The goal is for all school districts to have mean gain for each measurable academic subject within each grade greater than or equal to the gain of the national norms.--> McLean : How will anyone decide whether the mean gain is g reater than or equal to the gain of the national norms? Publication of standard errors" must mean that an error bound will be established around the national norms--perhaps 1.5 Times the median std. Error per grade--one "harvey", or 2.0 S td. Errors--one "dorn". --> Dorn : If school districts do not have mean rates of gai n equal to or greater than the national norms based upon the TCAP tests (or te sts which measure academic performance which are deemed appropriate), each sch ool district is expected to make statistically significant progress toward that goal. --> McLean : ok, gang, the veil is lifted from our eyes--there is no such thing as "statistically significant progress" without standa rd errors and the assumption of samples from some population.--> Dorn : schools or school districts which do not achieve the required rate of progress may be placed on probation as provided in $49-1-602. If national norms are not available then the levels of expected gain will be setupon the recommendation of the commissioner with the approval of the state boa rd. --> McLean : Yo, commish! I do not envy you your task. --> Dorn : value added assessment means: (1) a statistical s ystem for educational


10 of 18outcome assessment which uses measures of student l earning to enable the estimation of teacher, school, and school district statistical distributions; and (2) the statistical system will use available and appropria t data as input to account for differences in prior student attainment, such that the impact which the teacher, school and school district have on the educational progress of students may be estimated on a student attainment constant basis.--> McLean : I could write a rationale for a "statistical syst em" that did not need standard errors, given that they test all the stude nts. It would contain careful, modern descriptive statistics that would gladden John Tuke y's heart. --> Dorn : On or before July 1, 1995, and annually thereafte r data from the TCAP tests, or their future replacements, will be used ( notice the 'will'-the language is not just permissive here) to provide an estimate of the statistical distribution of teacher effects on the educational progress of students wit hin school districts for grades three (3) through eight (8).--> McLean : Here we are again--these gains are to be interpre ted as "teacher effects". Peace, TVAAS, but I do not believe that anyone's mo dels and techniques are yet good enough to isolate the teacher effect from all the other effects on standardized test scores in schools with all their complexity. N ext to this concern--it is a concern about validity and is not vague or complex-the de finition and estimation of standard errors is too small a matter to take our t ime. Goldstein : Les McLean's comments have inspired some more tho ughts. In the simplest value added model, an outcome score is regressed on an in put score so that generally each school will have a different regression line perhaps with var ying slopes but in the basic model with parallel slopes so that schools can then be ranked on the re sulting regression intercepts. (The actual analysis is a bit more complex but this simple mode l captures the essence). We find, typically, that the variation among these intercepts is relati vely small compared to the residual variation of student scores about the regression lines for each school (5% 30% depending on which educational system you are studying). In addition, the regression itself will account for quite a lot of the variation in outcome...maybe as much as 50-6 0%. This means that there is a substantial remaining va riation (among students) unnacounted for and it is this residual variation which determines the standard error values. Thus, for example, if this residual variation was zero, we would exactly predi ct each schools (relative) mean and the standard error of that prediction would be zero. Th is would mean also that once we knew each student's input score (and anything else we were ab le to put into our regression model) and the school that student was in, we would have a perfect prediction of the student's outcome. Of course, we are nowhere near that situation and it i s this uncertainty about the individual prediction that translates into uncertainty about t he school mean (think of the mean roughly as the average of the student residuals about the regr ession line for each school). If you took another bunch of students with exactly the same set of inta ke scores you would NOT therefore expect to get the same set of outcome scores this is what t he uncertainty implies nor the same mean for the school. In the absence of being able to predict with certainty we have to postulate some underlying value for each school's mean (otherwise we are pretty well lost) which we can think of as the limit of a series of conceptual allocatio ns of students to the school. Thus an estimate of uncertainty, conventionally supplied by calculating the appropriate standard error,is important if you want to make any inference about whether the un derlying means (that is, the population means) are different and, more importantly, to set limits (confidence intervals) around the


11 of 18estimated difference for any two schools or around the difference between a school's estimate and some national norm. Hence my original remark some t ime ago that when you did just that you found that most institutions could not statisticall y be separated, and I suspect also for TVAAS that very many cannot statistically be separated fr om a National norm, whether they are actually above or below it. It would be good to hear from th e TVAA people on this issue. Camilli : Harvey continues the standard error saga, and I w ant to reiterate: if you had all the students in the school there wouldn't be any uncert ainty at all; you'd know the mean. I think we need a "superpopulation" to get us out of this pred icament. Harvey said "If you took another bunch of students with exactly the same set of inta ke scores you would NOT therefore expect to get the same set of outcome scores this is what t he uncertainty implies nor the same mean for the school."This bunch of students is from the superpopulation, no? They are students who might exist, but don't, who are substantially like the students in t he sample. I'll say it again, Harvey, this is a heuristic. It simply doesn't convey any additional meaning regardless of how many times it is repeated. I think we're lost when we accept statist ical inferences based on data that weren't observed, and moreover, do not exist conceptually. If "all the students in the school" doesn't really have that meaning, then we are playing a gam e with language. If we can get away from the superpopulation for a m oment, we can begin to analyze what drives the standard error. It certainly isn't sampling err or; nonetheless, it is a quantity that exists in a real sense. As you've implied above, SEs have something to do with model fit. Thus, we should be interested in those things that cause models to fit more loosely to the data. District size is certainly one factor; but correlation of effects wi thin the model will also inflate SEs. Effects like teachers within schools, teachers with school, scho ols with district might be some examples. As Gene implies, separating these effects may take som e doing. McLean : Harvey Goldstein's exposition on standard errors (17 Jan, "Standard Errors: yet again") may have been more than some wanted, but I found it instructive and thought-provoking. If you deleted without reading, reconsider--it gets at the heart of the matter of TVAAS. While still wanting to retain the concept of the sample from so me (unspecified) population, Harvey's main lesson for us was to highlight the crucial role of the model adopted by the statistician in estimating scores--gain scores, in the case of TVAA S. A model is a formula that the statistician considers a reasonable try at relating the desired quantity, the 'gain' in achievement (not directly measurable because of nuisances such as social clas s and prior learning) to aspects of schooling, such as teacher competence.Advised by statisticians with wide experience outsi de of education (and maybe in education--we have not been told), the policy-makers decide to gi ve the statisticians their head and to accept their estimate of 'gain', knowing that the formula will be complex and the procedures well beyond the understanding of all but a very few. The statisticians make a persuasive case that their formula and their procedures will provide the polic y-makers with an estimate of gain that will distinguish the bad teachers from the poor from the average from the good from the excellent. "National norms" are invoked, unspecified, but resp onsibility given to the Commissioner of Education to provide norms if the national governme nt lets the side down. All this tedious repetition is needed to give a con text for Harvey Goldstein's description of standard errors. In essence (correction, Harvey, pl ease, if needed) the errors are S&E, not SE--errors of Specification & Estimation, not of sa mpling. A 'specification' error is made when our model, our formula, does not accurately link th e target (the gain) with the data (the item


12 of 18responses or scale scores plus proxies for prior le arning and social class and the like). We ALWAYS make a specification error--the only questio n is how large. If we limit ourselves, as in the TVAAS, to linear models, and we try to estimate gains across big, complex societies such as states, the error can be huge--and there is not con sensus how to estimate the size of the error. Here is a source of error.Even though they do not sample students and schools sampling cannot be avoided--people are absent, times of testing vary, the tests cannot pos sibly cover all the content (hence content sampling), items are omitted, test booklets get los t, some teachers do not cover the material on the test, ..., and so on and so on and so on. This is why we do not use a very simple formula: Gain = (Avg. score end Avg. score beginning) After all, when we test everyone, and when the goal is to measure gains by THESE students THIS year in THESE places with THESE teachers, who needs an error term? With well-constructed tests, the measurement errors will cancel out when we calculate school and class means. Oh--there is measurement error in indi vidual pupil scores, but we can report that (from the test publisher's manual) and besides, the se scores don't count in the student's grade--the teacher does not get them in time, and even if they do they do not use them. Ok, so I seem to have lost the tenuous thread of th e argument--NOT SO! We have learned over the years that the simple formula is more likely to mislead than to lead--to distort our view of gain rather than to clarify it. Raw score compariso n tables (called 'League Tables' in the UK, after the rankings of sports teams), however compelling t hey seem, are statistically invalid, immoral, racist, sexist and stupid. Apart from those few fla ws, they are fine. But would Tennessee put up with such poor procedures? Not on your life--scalin g, imputation, hierarchical linear models and prayer are brought into play. Here is another sourc e of error. All this talk of standard errors and models and pol itics keeps coming back to one key aspect: VALIDITY. Do those numbers represent gains in achie vement? The formulas and procedures are complex enough that evidence is needed. Even if the y do, how accurate are they--and I mean how much do they tell us about better learning, class-b y-class, teacherby-teacher; or has the TVAAS traded in science for voodoo? Without a better expl anation, the use of these scores to label teachers as competent or incompetent seems a lot li ke sticking pins in dolls. It is possible to validate the numbers--but it woul d take a lot of thinking, a lot of hard work and maybe 0.01 of the budget of TVAAS.Glass : Harvey, and are these future batches of students "random" or "probabilistic" samples from that "conceptual" superpopulation? It seems highly doubtful. So what sense can possibly be made out of probability statements that surely assume ra ndom sampling? None that I can imagine. I think Les had it right last night. The "errors" i n these teacher measurement schemes are model specification errors and not sampling errors. And t he important questions to ask about them are not "will they be different in some conceivable 'po pulation'?" but "what do they contain: ability differences in students, effects from previous teac hers, etc.?" Camilli : Les, I think your distinction between SE and S&E is a clear and elegant statement. It is a must-read for anyone interested in how statistical models are likely to behave in policy contexts. I'd like to throw in two additional cents: I think TVAAS is certain to encounter a related pro blem with its "linear metric." How is it, 1.


13 of 18the press may ask, that gains are so much larger in the earlier than the later grades? Does this mean that students aren't learning very much i n high school? Moreover, because the standard errors are likely to be different across d istricts, larger districts might have to achieve smaller gains to be consistent with the law Does this imply different standards for different districts? (I recognize that larger distr icts have to pull up more kids to achieve a SE's worth of gain -but I'm not sure this type of argument would wash since a SE may be only a baby step toward the national average.)The "natural" sample that exists on any given day d oes, I suppose, give rise to a superpopulation of the sort that Harvey Goldstein w rites of. However, this is not the population about which most people think of when ev aluating gains since, as Bill Hunter points out, it is not a random sample from the scho ol's student body. 2. Hunter : Per Camilli who wrote "The "natural" sample that exists on any given day does, I suppose, give rise to a superpopulation of the sort that Harvey Goldstein writes of. However, this is not the population about which most people think of when evaluating gains since, as Bill Hunter points out, it is not a random sample from t he school's student body." I need to clarify a bit. I think it is not the case that a sample of convenience "gives rise to" or "implies" a population of any sort (unless one choo ses to regard the sample _as_ a population). As far as I can tell this thinking is exactly backw ards--samples derive their meaning and existence from populations: I cannot see that the r everse order has any meaning at all. I also question the utility of Harvey G.'s conception of s uch samples as samples from a population in time. This _might_ make sense in a time/space of gr eat stability, but I see little reason to believe that children four or five years from now will have experiences of the world (especially the world of information) that is comparable to children of t oday (or five years past). The kinds of changes that required revision and re-norming of intelligen ce tests every 15 or 20 years half a century ago now take place in five years or less--probably abou t the same time scale that would be required to conscientiously develop and renorm the test.Moreover, I think it is not just that such a sample is not a random sample from some _specific_ population (as Greg suggests above), but that it is not a random sample of ANY population for two reasons: 1) the process of selection did not in sure equal and independent likelihood of selection for all members of the population and, mo re importantly, 2) no population was specified (to which the above process was not appli ed). Goldstein : Brief response to Greg. The point about imagining another bunch of students like the ones you used to compute the school mean is that th is seems to me just what one always has to do. The information about the students whose data y ou analysed may be of historical interest, but for most people they really want to assume that, gi ven no evidence to the contrary, if and when a fresh set of similar students passes through the sc hool (as is happening by the time they get to read the report)they would expect a similar outcome The superpopulation is not just a heuristic device it is a reality in th esense that further ba tches of students are samples from it. How else would you make sense of anything?Now to Les' points: Specification error actually, I think, sits on top of what I mean by standard errors, the latter assume that the specified statis tical model is a good description. This raises what I think is perhaps the more important issue. Are we using the right measures? have we adjusted for all th econfounding factors? Have we adjusted p roperly for measuring errors (unreliability). On this side of the Pond we have I believe won the intellectual (not the political we are used to losing that one)argument against Les' RAW league ta bles and are beginning to make people aware of the limitations of value added ones. The s tandard error argument is only one point of


14 of 18reference but it is quite important because it does I believe, point out the inherent scientific limits to any kinds of institutional comparison in terms of how finely ranked you can get. There is a kind of uncertainty principle operating; you c an establish that there are institutional variations without being able to determine exactly which institutions are actually different from each other. That's perhaps difficult to live with b ut does seem to be a fact of life. McLean : On January 18, Bill Sanders wrote (via Rick Garli kov--and along with many other topics): --> Sanders : To Leslie McLean, your plots of standard errors a s calculated make no sense. Middle schools in the example school system we provided have more students than intermediate schools in almost every case. Thu s, their standard errors tend to be lower. Middle schools also have smaller expected no minal gains. Therefore, your attempt to show a relationship over grades is nonse nse. --> McLean : It was indeed the point I was making--that the pl ot (or correlation) over grades made no sense. That is why I argued that the withingrade correlations were the ones to look at--and that they were around 0.0. BTW, if means in a table are based on widely different Ns, you would do your rea ders a good turn to say so, don't you think? Your remark that "middle schools also ha ve smaller expected nominal gains" is ambiguous and interesting. In what sense "expected"; in what sense "nominal"? Camilli : About superpopulations: these are entities that d on't exist, except in the imagination. Yet it is contended that it is a "reality in the sense that further batches of students are samples from i t. How else would you make sense of anything?" A lot o f people have sought to answer this question, among them Alan Birnbaum who paraphrased the likelihood principle as the "irrelevance of outcomes not actually observed." He went on the write of the "immediate and radical consequences for the everyday practice as w ell as the theory of informative inference." As for the superpopulation, it exists in one's mind as a vehicle for generalization. But generalization itself requires more worldly knowledge. For example consider the standard error of statistic calculated from a poll during an election. You migh t say a population exists, but only for a limited amount of time. Experience with the rate of change in public sentiment (and the way the question is asked) is required for a valid generali zation. Happily, however, we are in full agreement on the role of specification error, as ma sterfully articulated by Les. SUMMARY COMMENTS BY PARTICIPANTSGoldstein : I am a bit confused by the TVAA requirement to ma ke a gain of 1.5-2.0 STANDARD ERRORS. Shouldn't this refer to STANDARD DEVIATIONS ? The standard error is a measure of the accuracy with which a statistic (e.g. mean g ain score) is estimated. The standard deviation is a measure of population spread and is the approp riate unit to use. Camilli : Sherman, a question has come my way from Harvey G oldstein. He asks whether STANDARD ERROR should be STANDARD DEVIATION?" It's my recollection that the law specifically states that SEs are to be used for ass essing gain, not SDs. Could you send me the relevant section?Dorn : Okay, here is the relevant section of the TN law, and the answer's "none" -at least explicitly:


15 of 18$49-1-601. (c) If school districts do not have mean rates of gain equal to or greater than the national norms based upon the TCAP tests (or tests which measure academic performance which are deemed appropriate), each school district is ex pected to make statistically significant progress toward that goal.But statistically significant is a strange concept for TVAAS, since there is no random sampling -it's supposedly everyone in the relevant universe. Does it mean statistically significant considering test-retest reliability? Does it mean s tatistically significant considering the norming population? Does it mean statistically significant considering a hypothetical "let's pretend this is a random sample" thought experiment? Yeesh. In poin t of fact, courts have not had a chance to even consider this, since probation is not a questi on until this fall, and the legislature has delayed individual teacher reports for an additional year, at least (Nashville TENNESSEAN, 31 May 1995). I find it amusing that a state court will de cide what statistical significance is here. Goldstein : The discussion has certainly been interesting and useful for me in forcing me to be explicit about a number of 'taken for granted' assu mptions. There seems to me to be three separate issues being debated. If we have a collection ( sample) of individuals on whom we make measurements, is there some sense in which we can and should regard these as members of a larger collection or population of individuals. Does this population hav e to exist in reality (i.e. it can be enumerated in principle) or can we think of a hypot hetical 'superpopulation' and when might this be useful 1. If we accept that there is a population about which we may wish to say something (e.g. what is the mean gain score among ALL 6-7 year old boys), how can we obtain a RANDOM probability sample so that we can then apply the statistical techniques which require such samples in order to draw valid inferen ces? 2. Any member of a sample of human (social) beings sim ultaneously belongs to more than one recognisable population; thus a child belongs t o a particular social background grouping and a neighbourhood of residence and a sch ool etc. Which is the appropriate population for inference? 3. Let me tackle 1) first. There are clearly some real enumerable populations which we can sample and make statements about. Surveys of voting intent ions are a case in point where we wish to say something about how the whole (voting) population t hinks, based on a suitable (preferably random) sample. A great deal of statistical samplin g theory exists to help us do just this. At the other extreme you have something like what has been called 'generalisability theory' in educational testing that chooses to regard a set of chosen test items as being 'sampled' from some (conceptually) infinite population of such items co ntained within a 'domain'. I personally have great difficulty with this concept since, as Gene G lass points out, what people seem to be doing here is to imagine the population as just a larger version of the items they happen to have (unless they really have sampled, for example words from a dictionary for a spelling test). This then tends to come down to a sleight of hand whereby you choose your own test items, declare that they allow you to make inferences about an undefine d domain, and then use statistical procedures to describe how accurately you have been able to de scribe that domain. And don't believe them when they tell you they have rules for generating t he items and the rules implicitly define the domain it doesn't work!There are, however, other cases where I simply don' t see how you can make any substantial progress without the notion of a hypothetical popul ation of which your sample is a realisation. In effect this is nothing more than saying that you wa nt, on the basis of what you observe on one


16 of 18group of individuals, to make some statements about other, unobserved individuals. If you are doing ethnographic, case study research, you are in terested in what you find for what it may tell you about other (similar?) cases. Likewise, if you observe a relationship between race and school achievement you are concerned to make a more genera l statement and set of speculations about the relationship as it may apply to other children It seems to me that without this there is no empirical social science possible. This is a philos ophical not a statistical issue. If you can't make inferences about future individuals then all social science is just descriptive history. The notion of a superpopulation is simply a formalistaion whic h allows us to use the tools of statistical inference. It is the ONLY formalisation I know of w hich allows a satisfactory method of generalising from the observed to the unobserved.This leads to the next issue, which is how one can conceive of drawing a random sample from such a population. Gene's example of the four US st ates as a sample is instructive. Suppose, instead of merely calculating the mean graduation r ate across States you compared the probability of graduating between Florida and Tenne ssee. In 1994 you found a moderate difference. You might be happy to stop there and le ave it at that. On the other hand as a social scientist you might want to contextualise the diffe rence, noting that Florida and Tennessee had different social compositions and you wondered whet her these might 'explain' the observed differences. You might go on to look at other facto rs, and soon you would be constructing quite sophisticated statistical 'models'. The point about these models is, in general, that they don't explain all the differences there is residual var iation between children in whether or not they graduate.. We might, in principle, be able to expla in everything but in practice this is extremely rare, and Les McClean's discussion of specification errors is relevant here. So the unexplained variation is assumed to be random a reflection of our ignorance if you like. It is at this point when we invoke the statistical assumption of random variation that we are forced to assume some kind of sampling (or exchangeability if you in sist on being a Bayesian) from a population. Whether you wish to confine inferences to Florida a nd Tennessee or wished to make some tentative inference about the factors which 'explai ned' graduation variations in general, across time and space, is a matter of debate and presumabl y disagreement, but generalise we surely wish to do?This gets into my third issue about the appropriate population of reference. In brief then, I am not arguing that we always need a superpoluation notion which then leads on to the statistical apparatus of standard errors, etc., but I am saying that to make sense of school comparisons (as with State comparisons), adjusting for those factor s extrinsic to schools (gain scores and much more than these of course, such as race and class a nd gender) the notion of a superpopulation is really indispensable for us to make any progress. L et me ask the question again which I don't think anyone answered: If you have two schools, eac h with 2 students following a particular course, who would stake their academic reputation o n reporting a moderate difference in average (over 2 cases) gain score as a judgement that the s chools were REALLY differentially effective? Or suppose there was only 1 student in each school? Of course this is extreme I merely wish to pursue the logic of refusing to recognise a superpo pulation to an absurdity. Camilli : Harvey frames the discussion with three questions : When should we think of sample as members of a supe rpopulation. Clearly, there is a way to draw a sample in which it makes sense to think o f a sampling distrubution. Frequency does have meaning in this situation. But Harvey thi nks that you can "make progress" by imagining a sampling distribution when the populati on is poorly defined and sampling isn't random, and "In effect this is nothing more than sa ying that you want, on the basis of what you observe on one group of individuals, to make so me statements about other, unobserved 1.


17 of 18 individuals. If you are doing ethnographic, case st udy research, you are interested in what you find for what it may tell you about other (simi lar?) cases." But the logic here is circular: I will assume my sample is similar to oth er nonrandom samples, then I will assume that the results in these other samples (sim ilar by assumption) will yield similar results. In short, I think generalization is possib le, but classical frequency theory is a lazy metaphor. You can make inferences about the future, but don't think statistical theory provides a formal basis for this. Ian Hacking in Th e Emergence of Probability recounts how Hume demolished this notion in 1739 (see p. 181 ). Models are proposed by scientists to account for va riation, and few if any models fit perfectly. In this case, a measure of mistfit is a measure of ignorance, but whoa! How does one equate ignorance with random variation? It seem s to me that this is an attempt to reify frequency theory. I agree that generalize is what w e surely wish to do, but what is happening here is 1) a statistical theory is adopte d which is a mathematical formalization, 2) a strict correspondence between the terms of the theory and real events is assumed, and 3) results and manipulations of the theory are pres umed to have counterparts in the real world. That is, real world events are now assumed t o follow statistical laws. (Marx is spinning like a top.) We will make progress when we can usefully distinguish descriptive theory from observed covariation. 2. Suppose one has two students from each of two schoo ls with gain scores, yet knows nothing of how these students were encountered. Does one want to determine whether these students are representative of the schools to which they belong, or assume that they ARE representative of Population X? In the latter case, we are 100% certa in that we have a valid sample; this is an easily recognized tautology. In the former case, we have more work to do. Generalization isn't impossible, but we must make an argument for doing so and defend its validity. The argument is based on evidence, completeness, and persuasiveness None of these qualities is based in statistical theory. References Hacking, Ian. (1975) The Emergence of Probability. Cambridge University Press.Copyright 1996 by the Education Policy Analysis ArchivesEPAA can be accessed either by visiting one of its seve ral archived forms or by subscribing to the LISTSERV known as EPAA at (To sub scribe, send an email letter to whose sole contents are SUB EPAA y our-name.) As articles are published by the Archives they are sent immediately to the EPAA subscribers and simultaneously archived in three forms. Articles are archived on EPAA as individual files under the name of the author a nd the Volume and article number. For example, the article by Stephen Kemmis in Volume 1, Number 1 of the Archives can be retrieved by sending an e-mail letter to LISTSERV@a and making the single line in the letter rea d GET KEMMIS V1N1 F=MAIL. For a table of contents of the entire ARCHIVES, send the following e-mail message to INDEX EPAA F=MAIL, tha t is, send an e-mail letter and make its single line read INDEX EPAA F=MAIL.The World Wide Web address for the Education Policy Analysis Archives is Education Policy Analysis Archives are "gophered" in the directory Campus-Wide Inform ation at the gopher server INFO.ASU.EDU.


18 of 18To receive a publication guide for submitting artic les, see the EPAA World Wide Web site or send an e-mail letter to and include the single l ine GET EPAA PUBGUIDE F=MAIL. It will be sent to you by return e-mail. General questions about ap propriateness of topics or particular articles may be addressed to the Editor, Gene V Glass, Glass@asu.ed u or reach him at College of Education, Arizona Sta te University, Tempe, AZ 85287-2411. (602-965-2692)Editorial Board John Andrew Coulson Alan Davis Mark E. Thomas F. Alison I. Arlen Gullickson Ernest R. Aimee Craig B. Howley u56e3@wvnvm.bitnet William Richard M. Jaeger Benjamin Thomas Dewayne Mary P. Les Susan Bobbitt Anne L. Hugh G. Richard C. Anthony G. Rud Dennis Jay Robert Robert T.

xml version 1.0 encoding UTF-8 standalone no
mods:mods xmlns:mods http:www.loc.govmodsv3 xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govmodsv3mods-3-1.xsd
mods:relatedItem type host
mods:identifier issn 1068-2341mods:part
mods:detail volume mods:number 4issue 4series Year mods:caption 19961996Month March3Day 1515mods:originInfo mods:dateIssued iso8601 1996-03-15

xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam a22 u 4500
controlfield tag 008 c19969999azu 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E11-00053
0 245
Educational policy analysis archives.
n Vol. 4, no. 4 (March 15, 1996).
Tempe, Ariz. :
b Arizona State University ;
Tampa, Fla. :
University of South Florida.
c March 15, 1996
Standard errors in educational assessment : a policy analysis perspective / Greg Camilli.
x Research
v Periodicals.
2 710
Arizona State University.
University of South Florida.
1 773
t Education Policy Analysis Archives (EPAA)
4 856