|USFDC Home||| RSS|
This item is only available as the following downloads:
1 of 20 A peer-reviewed scholarly journal Editor: Gene V Glass College of Education Arizona State University Copyright is retained by the first or sole author, who grants right of first publication to the EDUCATION POLICY ANALYSIS ARCHIVES EPAA is a project of the Education Policy Studies Laboratory. Articles appearing in EPAA are abstracted in the Current Index to Journals in Education by the ERIC Clearinghouse on Assessment and Evaluation and are permanently archived in Resources in Education Volume 11 Number 31September 1, 2003ISSN 1068-2341Performance Standards: Utility for Different Uses of Assessments Robert L. Linn University of Colorado at Boulder & National Center for Research on Evaluation, Standards, and Student TestingCitation: Linn, R. L. (2003, September 1). Performa nce standards: Utility for different uses of assessments. Education Policy Analysis Archives, 11 (31). Retrieved [Date] from http://epaa.asu.edu/epaa/v11n31/.AbstractPerformance standards are arguably one of the most controversial topics in educational measurement. There are uses o f assessments such as licensure and certification where performan ce standards are essential. There are many other uses, however, wher e performance standards have been mandated or become the preferre d method of reporting assessment results where the standards ar e not essential to the use. Distinctions between essential and none ssential uses of performance standards are discussed. It is argued t hat the insistence on reporting in terms of performance standards in s ituations where they are not essential has been more harmful than h elpful. Variability in the definitions of proficient academic achieveme nt by states for purposes of the No Child Left Behind Act of 2001 is discussed and it
2 of 20 is argued that the variability is so great that cha racterizing achievement is meaningless. Illustrations of the gr eat uncertainty in standards are provided. Measurement specialists generally conceive of achie vement as a continuum and they prefer to work with scale scores with many gradations rather than with a small number of categorical scores. It is recogni zed that there are a number of purposes for which the scores need to be lumped int o a small number of categories that require the identification of one o r more cut scores. Some leading measurement specialists, however, have sugg ested that it is best to avoid setting performance standards and associated cut scores if possible. For example, Shepard (1979) advised that it is best to avoid setting standards whenever possible (p. 67) and Green (981) conclude d that fixed cutting scores should be avoided whenever possible (Green, 1981, p. 1005). There obviously are some purposes where the identif ication of one or more cut scores cannot be avoided because they are essential to the use of a test. Tests used to make licensure and certification decisions, must have a cut score identified that will collapse the score scale into only two categories pass and fail. In other situations, the scores are collapsed into 4 or 5 categories. The College Board, for example, converts a weighted com bination of scores on the multiple-choice and the constructed-response sectio ns of their Advanced Placement (AP) Examinations to a final grade that i s reported on a five-point scale: 5 extremely well qualified 4 well qualified 3 qualified 2 possibly qualified 1 no recommendation. The pass-fail dichotomy is required for the decisio n to be made in the case of a licensure or certification test. The use of 5 categ ories on AP examinations, on the other hand, also supports dichotomous decisions about whether or not a student will receive college credit based on his or her AP grade, but allows colleges and universities to determine the grade re quired to be awarded credit. Certification tests and AP examinations are just tw o of many situations where the primary use of test scores is to determine whet her the test taker has met a performance standard. The performance of the person who barely met the standard is more similar to that of the person who barely failed to meet the standard than to that of someone who exceeded the s tandard by a comfortable margin. Indeed, due to measurement error, there is a substantial probability that the person who barely met the standard may be a fal se positive while the person who barely failed may be a false negative. T he two could easily switch places if they took an alternate form of the test. Although such classification errors are of concern and should be minimized to th e extent possible, they cannot be avoided altogether and there are legitima te practical reasons that require that a decision be made. Thus, Shepards (1 979) and Greens (1981) advice to avoid the use of a fixed standard or cut score cannot always be followed because standards are an essential element of the use of the test
3 of 20 results. In the last 10 or 15 years, however, perfo rmance standards have been mandated and used with increasing frequency in situ ations where they are not essential to the use being made of test results. Nonessential Uses of Performance Standards Performance standards began to be introduced in use s of tests where they were not really essential as the result of the crit erion-referenced testing movement that was spawned by Glasers (1963) classi c article. Ironically, Glasers conceptualization of criterion-referenced measurement did not require the establishment of a fixed standard or cut score, but the use of cut scores to determine that a student either did or did not meet a performance standard became associated with criterion-referenced tests ( Hambleton & Rogers, 1991; Linn, 1994). Glasers later discussions of criterio n-referenced testing (e.g. Glaser & Nitko, 1971) recognized the use of criteri on in the sense of a fixed standard noting that [a] second prevalent interpre tation of the term criterion in achievement measurement concerns the imposition of an acceptable score magnitude as an index of achievement (p. 653). Non etheless, the setting of a fixed standard or cut score is not an essential to criterion-referenced measurement.There are a number of reasons to question the wisdo m of setting a performance standard for a test if the standard is not essentia l to the use of the test results. Greens (1981) desire to avoid setting fixed perfor mance standards whenever possible was based on the recognition that a single item provides very little information by itself but one item may well make t he difference between passing and failing (p. 1005). Others who have bee n critical of the use of performance standards have focused on limitations o f the standards. Based on his review of standard setting methods, Glass (1978 ), for example, concluded that standards setters cannot determine criterion levels or standards other than arbitrarily. The consequences of arbitrary dec isions are so varied that it is necessary either to reduce the arbitrariness, and h ence the unpredictability or to abandon the search for criterion levels altogether in favor of ways of using test data that are less arbitrary, and hence safer (p. 237). Although respondents to Glasss article noted that although standards are a rbitrary in the sense that they are set judgmentally, they need not be capricious ( Hambleton, 1978; Popham, 1978).For a licensure test there is a clear context for s tandard setters who have a responsibility for thinking about the minimal skill s required to protect the public from incompetent practitioners. The broader good of requiring some minimal level of performance on a test to be certified and therefore allowed to practice justifies the judgmentally established cut score. M oreover, the need to make certification decisions provides judges with a clea r context for setting the standard. Standard setters may also have a clear se nse of the proportion of candidates who have passed the examination in the p ast. Similarly, for an AP test there is a clear link between college grades a ssigned in courses for which credit may be awarded. Judgments regarding minimal acceptable performance when standards are set on tests that must be passed to graduate from high school or for promotion to the next grade have a mo re generalized context than the one that applies to setting standards for licen sure examinations. The need
4 of 20 to establish a cut score is essential to the use of a test as a high school graduation requirement and at least some of the con sequences of the use are known to the standard setters who can weigh the pot ential benefits (e.g., restoring the meaning of a high school diploma or m otivating greater effort by teachers and students, Mehrens & Cizek, 2001, p. 48 0) against the potential negative consequences (e.g., that the minimums will become the maximum or that more students will drop out of school, Mehrens & Cizek, 2001, p. 481). In many instances the standards that are set are no t used to make any pre-specified decisions about individual students. Instead the performance standards are used for reporting the performance of groups of students and for tracking progress of achievement for schools, state s, or the nation. Examples include the setting of performance standards for th e National Assessment of Educational Progress (NAEP) or a state assessment. The context that standard setters have had for setting a cut score to determi ne proficient, basic, or advanced performance for NAEP or a state assessment has, until recently, lacked any clear context other than a sense of an a spiration for high levels of student achievement.Standards and AspirationsSix broad educational goals, two of which concerned student achievement, were agreed to by Governors and the President at th e Education Summit held in Charlottesville, Virginia in 1989. The National Educational Goals Panel was created and given the responsibility of monitoring progress toward the goals set at the Education Summit. Five years after the Charl ottesville summit, the Goals 2000: Educate America Act of 1994 was signed into l aw (Public Law 103-227). Goals 2000 encouraged standards-based reporting of student achievement. As defined by a technical planning group for the Goals Panel, performance standards specify how good is good enough (Natio nal Educational Goals Panel, 1993, p. 22). Unanswered is the question: go od enough for what? It is clear that the performance standards are expected t o be absolute rather than normative. (Although normative comparisons have bee n eschewed by many proponents of criterion-referenced tests and standa rds-based assessments, norms have considerable utility for providing compa risons of relative performance across content areas and even for a sin gle measure are often more readily interpreted than are criterion-referen ced reports or reports of results in terms of performance standards (see, for example, Hoover, 2003).) In keeping with the zeitgeist of the time, it is also clear that they were expected to be set at world class levels.The high aspirations of 1989 Education Summit which were encouraged by Goals 2000 and the Goals Panel provided the primary context for the setting of standards on NAEP and on many state assessments. As might have been expected the performance standards set on NAEP on a number of state assessments were set at high levels. In support of the judgment that the performance standards were set at ambitious levels consider the fact that in 1990, the first year that NAEP results were reporte d in terms of achievement levels, the proficient level on the mathematics ass essment corresponded to the 87th percentile for 4th grade students, the 85th pe rcentile for 8th grade students, and 88th percentile for 12th grade studen ts (see, for example,
5 of 20 Braswell, Lutkus, Grigg, Santapau, Tay-Lim & Johnso n, 2001). Linkages of NAEP to the Third International Mathematics and Sci ence Study (TIMSS) grade 8 mathematics results which revealed that no countr y is anywhere close to having all of their students scoring at the profici ent level or higher (Linn, 2000) also attest to conclusion that the NAEP achievement levels are set at very ambitious levels.Performance standards set by a number of states for their state assessments during the past decade have also been set at quite high levels in many cases. The high levels that were set by many states are ev ident from the linkage of state assessments to NAEP (e.g., McLaughlin & Bande ira de Mello; 2002, see also Linn, in press, for a discussion of the McLaug hlin & Bandiera de Mello results). It is not unusual for a state to have the proficient performance standard set at the 70th percentile or even higher. Since pe rformance standards on NAEP and on a number of state assessments have, in most cases in the past, had no real consequences for students and there is no requirement for actually achieving the aspiration of having all students at the proficient level or above, one might conclude that there is no harm in having high standards. There may be reason for concern that reports that, say, less than half the students are proficient may paint an unduly negative picture for the public, but many would argue that it is good to have ambitious goals even if they are never achieved. If you reach for the stars, you may not quite get them but you wont come up with a handful of mud either (Samuel Butler, as quoted by Applewhite, Evans, and Frothingham, 1992, p. 22).It is one thing to set performance standards at the height of stars when there are no requirements of achieving them, but it is an other matter altogether when there are serious sanctions for falling short such as those that have recently been put in place by the No Child Left Behind (NCLB ) Act of 2001 (Public Law 107-110). NCLB sets the goal of having all children at the proficient level or higher in both reading/language arts and mathematic s by 2013-2014. It also requires schools, districts, and states to meet ade quate yearly progress (AYP) targets in intermediate years that would assure tha t they are track to having 100% of the students at the proficient level or abo ve by 2013-2014. There are severe sanctions for schools that fall short of AYP targets for two or more years in a row.The percentage of students who are at the proficien t level or above for purposes of NCLB is to be determined by state asses sments using performance standards established by each state. Each State sh all demonstrate that the State has adopted challenging academic content stan dards and challenging student academic achievement standards that will be used by the State (P. L. 107-110, Section 1111(b)(1)(A)).All states had to submit plans explaining how they were going to meet the accountability requirements of NCLB by January 31, 2003. States that were in process of introducing new assessments or that had not yet set performance standards will be setting standards in quite a diff erent context than existed prior to the enactment of NCLB. In light of the new conte xt provided by NCLB, it reasonable to expect that they will set the standar ds at less ambitious levels than they would have been set a couple of years ear lier. The standards recently
6 of 20 set by Texas for their new assessment, the Texas As sessment of Knowledge and Skills (TAKS), are consistent with the expectat ion that states may set their sights a little lower in the context of NCLB.States that already had their assessments and perfo rmance standards in place prior to the enactment of NCLB faced a dilemma. The y confronted the question of whether they should stay the course, recognizing that their performance standards were set at levels that are unrealistic f or all children to achieve within the next 12 years. Or should they lower their perfo rmance standards and risk being accused of dumbing down their standards? Some states, e.g., Colorado and Louisiana redefined their performance levels fo r purposes on NCLB. Colorado, for example, has reported results on the Colorado Student Assessment Program (CSAP) in terms of four levels: unsatisfactory, partially proficient, proficient, and advanced. Colorado will continue to use all four levels for reporting to parents, schools, and districts. F or purposes of NCLB, however, Colorado has collapsed the partially proficient and proficient levels into one level called proficient.NCLB Starting Points for StatesIn order to track their AYP toward the goal of 100% proficient or above by 2013-2014, states have to define percentage profici ent starting points. The starting point for each subject (reading/language a rts and mathematics) is defined to be equal to the higher of the following two values: (1) the percentage of students in the lowest scoring subgroup who achi eve at the proficient level or above and (2) the school at the 20th percentile in the State, based on enrollment, among all schools ranked by the percent age of students at the proficient level (P.L. 107-110, Sec. 111 (b)(2)(E) (ii)). In most cases the latter value will be the higher one and define the startin g point. Because states have their own assessments and set t heir own performance standards it should not be at all surprising that s tate NCLB starting points are quite variable. Some states are yet to define their performance standards and/or starting points and some states have expressed thei r starting points in terms of scale scores that are make comparisons difficult. P ercentage proficient or above for reading/language arts starting points are available for 34 states at grades 4 and 8 (Olson, 2003). At grade 4, the start ing percentages range from a low (i.e., most stringent) of 13.6% for California to a high (i.e., most lenient) of 77.5% for Colorado, with a median of 49.35%. At gra de 8, the stating points range from a low of 13.6% to a high of 74.6% with a median of 46.2%. As at grade 4, California and Colorado define the extreme s at grade 8. At grade 4, eight states have starting point percentages of 34% or less and eight states have starting points 64% or more. The corresponding percentages at grade 8 are 35% and 60%. State NAEP results indicate that s tates do vary in terms of student achievement, but not nearly enough to expla in the huge variability in NCLB percentage proficient starting points. For the 43 states that participated in the 2002 NAEP 4th grade reading assessment, for exa mple, the percentage of students who were at the proficient level or above ranged from a low of 14% in Mississippi to a high of 47% in Massachusetts (Grig g, Daane, Jin & Campbell, 2003).
7 of 20 The variability in the starting points is of simila r magnitude for mathematics at grades 4 and 8 as that found for reading/language a rts. The range for mathematics at grade 4 is from 8.3% in Missouri to 79.5% in Colorado and at grade 8 the range is from 7% in Arizona to 74.6% in North Carolina. On the 2000 NAEP mathematics assessment, North Carolina st udents did perform somewhat better than Arizona students. Thirty perce nt of the North Carolina students were at the proficient level or above on t he grade 8 mathematics compared to 21% in Arizona (Braswell, Lutkus, Grigg Santapau, Tay-Lim & Johnson, 2001). The grade 8 mathematics achievement of students in Arizona and North Carolina appears much more similar on NAE P, however, than is suggested by the starting points of 7% and 74.6%.Controversy Regarding Performance StandardsPerformance standards have been the subject of cons iderable controversy. The performance standards called achievement levels set on the NAEP) assessments, for example, have been subjected to ha rsh criticism. Reviews by panels of both the National Academy of Education (N AE) (Shepard, Glaser, Linn, & Bohrnstedt, 1993) and the National Research Council (NRC) (Pellegrino, Jones, & Mitchell, 1998) concluded tha t the procedure used to set the achievement levels was fundamentally flawed ( Shepard, et al., 1993, p. xxii; Pellegrino, et al., 1998, p. 182). The conclu sions of the NAE and NRC panels were controversial and several highly-regard ed measurement experts have defended the procedure used to set the NAEP ac hievement levels as well as the resulting levels (e.g., Cizek, 1993; Kane, 1 993, 1995; Mehrens, 1995). There is an abundance of methods for setting standa rds, but there is no agreed upon best method. This point was made repeatedly at the Joint Conference on Standard Setting for Large-Scale Assessments sponso red by the National Assessment Governing Board and the National Center for Education Statistics held October 5-7, 1994 in Washington, DC. The Joint Conference included 18 presentations by scholars representing multiple per spectives. The papers dealt with a variety of issues ranging from technical to policy and legal issues. Crocker and Zieky (1995) prepared an executive summ ary of the conference which included the following summary conclusion.Even though controversies and disagreements abound ed at the conference, there were some areas of general agreement. Authors agreed that setting standards was a difficult, judgmental task and that procedures used were likely to disagree with one another. There was clear agree ment that the judges employed in the process must be well trained and kn owledgeable, represent diverse perspectives, and that their work should be well documented (p. ES-13). Variability in Standards As was indicated by Crocker and Zieky, there is a b road consensus in the field that different methods of setting standards will yi eld different standards. This consensus is consistent with Jaegers (1989) summar y of the literature on the comparability of standard setting methods. Differe nt standard-setting procedures generally produce markedly different sta ndards when applied to the
8 of 20 same test, either by the same judges or by randomly parallel samples of judges (Jaeger, 1989, p. 497). It is also agreed that diff erent groups of judges will set different standards when using the same method, esp ecially when the groups represent different constituencies (e.g., teachers, administrators, parents, the business community, or the general public). Moreove r, there is general agreement that there is NO true standard that the application of the right method, in the right way, with enough people, will find (Zieky, 1994, p. 29). Given that there is no true standard or best met hod for setting a standard, it is reasonable to ask what should be made of the variab ility in results as the function of choice of methods or choice of judges. If one wants to make generalizations across methods or groups of judges then it would seem reasonable to treat the variability in results as e rror variance. In doing so, we would at least acknowledge that there is a high deg ree of uncertainty associated with any performance standard.Variability Due to JudgesAttempts are often made to estimate the error varia bility due to judges as part of the standard setting process. A difficulty that is encountered, however, is that standard setting methods usually involve group disc ussion following an initial set of judgments which may be made independently. G roup discussion obviously makes judgments in subsequent rounds depe ndent which makes it impossible to estimate the error variability due to judges in the final round of judgments. Furthermore, there are good reasons to b elieve that the person who leads the standard setting exercise may have an imp ortant influence on the outcome. Thus, what would be desired is something a kin to duplicate-construction experiment that Cronbach (19 71) proposed as a way to evaluate the content validity of a test. The duplic ate-construction experiment would require that two teams of equally competent writers and reviewers (p. 456) independently construct alternate tests. The p arallel in standard setting would involve the use of independent panels of comp arably qualified judges set the standards under the direction of equally compet ent leaders using the same method and instructions. The variance in the standa rds for the two panels would provide an estimate of the amount of error due to t he panel of judges and standard setting leader.Since the bigger sources of variability in standard s is apt to come from the way in which judges are identified and the method that is used to set standard, even the parallel of Cronbachs duplicate-construction e xperiment would greatly underestimate the real degree of uncertainty in the standards. Some idea of the degree of variability due to the identification of judges is provided by results of a study conducted by Jaeger, Cole, Irwin, and Pratto (1980). Jaeger and his colleagues had three panels, consisting of samples of teachers, school administrators, and counselors, respectively, indep endently set passing standards on a North Carolina test. The differences in the standard set by the different panels can be gauged by the magnitude of the differences in the proportion of students who would have failed the te st according to the different groups of judges. On the reading test the proportio n who would have failed ranged from a low of 9% to a high of 30%. The varia bility in failure rates was even greater for the mathematics test, ranging from a low of 14.4% to a high of
9 of 20 71.1%.Variability Due to MethodVariability due to choice of method can be evaluate d based on results of several different studies that were reviewed by Jaeger (198 9). One of those studies where multiple methods were used, for example, was conducted by Poggio, Glassnapp and Eros (1981). They had independent sam ples of teachers set standards using one of four different standard sett ing methods: the Angoff (1971) method, the Ebel (1972) method, the Nedelsky (1954) method, or the contrasting groups (see, for example, Jaeger, 1989) method. Teachers set standard for tests at grades 2, 4, 6, 8, and 11. Th ere was substantial variability in the standards set by the four different methods at every grade. At grade 8, for example, the four different methods would set the m inimum passing score on a 60 item reading test at 28, 39, 43, and 48 items co rrect. If the most lenient standard had been used, just over 2% of the student s would have failed whereas approximately 29% would have failed if the most stringent standard had been used.In his summary of 32 contrasts of standards set by different methods Jaeger (1989) found that the ratios of the percentages of examinees who would fail range from a low of 1.00 to a high of 29.75 with a median of 2.74. That is, the typical consequence of using a different method to set standards would be to alter the failure rate by a factor of almost 3. As Jaeger (1989) concluded the choice of a standard setting method is critical ( p. 500). He went on to endorse earlier suggestions by Hambleton (1980), Koffler (1 980) and Shepard (1980; 1984) that it might be prudent to use several meth ods in any given study and then consider all the results, together with extrastatistical factors, when determining a final cutoff score (Jaeger, 1989, p. 500). Because of the practical cost considerations, the u se of multiple methods as input to a final standard setting decision is rare in operational practice, but that was the approach recently taken in Kentucky for the assessments introduced in the state in 2000. As was recently reported by Gree n, Trimble and Lewis (2003), the Kentucky Department of Education used m ultiple methods as input to their final standard setting when the state intr oduced a new testing system, the Kentucky Core Content Test (KCCT) in 2000. Firs t, three different methods, the bookmark procedure (Lewis, Green, Mitzel, Baum & Patz, 1998; Mitzel, Lewis, Patz, & Green 2001), the Jaeger-Mills proced ure (Jaeger & Mills, 2001), and the contrasting group (see, for example, Jaeger 1989) were used to set cut scores to distinguish four levels of performance (n ovice, apprentice, proficient and distinguished) on each of 18 different tests us ed in the KCCT system for various grade levels and content areas. The results of the bookmark, the Jaeger-Mills, and the contrasting groups standards setting efforts were input to a synthesis process where the results were consider ed by teacher committees that recommended cut scores to the Kentucky State B oard of Education (Green, Trimble, & Lewis, 2003; see also CTB/McGraw -Hill, 2001; and Kentucky Department of Education, 2001, for more de tailed descriptions). Table 1 displays the percentage of students at the proficient level or above on each of six KCCT subject area tests administered at elementary school grades
10 of 20 according to the three standard setting methods. Al so shown are summary statistics across the three methods, mean, standard error, minimum, maximum, and range as well as the percentage proficient or a bove according to the standard set by the synthesis panel. The standard e rror is simply the standard deviation of the percentages for the three differen t standard setting methods since the standard deviation may be interpreted as a standard error if the goal is go generalize across standard setting methods. As c an be seen, the standard errors are quite large, indicating that there is co nsiderable uncertainty about the percentage of students proficient or above due to s tandard setting method. Table 1 Percentage of Students at the Proficient Level or A bove on KCCT For Elementary School Grade Tests According to Standard Setting MethodMethodorStatistic Subject Mean ReadingMathematicsScienceSocial Studies Arts &Humanities Practical Living/VocationalStudies Bookmark56.5%35.2%35.4%48.4%15.3%44.8%39.3%Jaeger-Mills 15.318.104.22.1681.016.412.1 ContrastingGroups 29.419.524.527.224.824.725.0 MethodsMean 33.725.121.526.717.028.625.5 StandardError 20.948.7415.5621.957.0614.6014.81 Maximum56.535.235.448.424.844.840.85Minimum15.322.214.171.1241.016.411.9Range41.215.730.743.913.828.429.0SynthesisStandard 126.96.36.1999.813.345.437.1Tables 2 and 3 display results parallel to those in Table 1 for the KCCT tests administered at the middle school and high school g rades, respectively. The results for the 12 subject area by grade combinatio ns shown in Tables 2 and 3 are similar to those shown in Table 1 for the tests administered at the elementary school grades. The mean standard error a cross the 18 subject area by grade combinations in Tables 1 through 3 is 10.8 2 and they range from a low of 4.26 for the middle school mathematics test to a high of 26.35 for the middle school reading test. Even in the best case there is a good deal of uncertainty about the percentage of students who are at the pro ficient level or above as the consequence of the method used to set the performan ce standards. A standard error of even 4 points is large relative to the ann ual change in percentage proficient or above that is likely to be required t o meet the AYP target for NCLB. A standard error of 26 points is gigantic in that s ame context.Table 2
11 of 20 Percentage of Students at the Proficient Level or A bove on KCCT For Middle School Grade Tests According to Standard Setting Method MethodorStatistic Subject Mean ReadingMathematicsScienceSocial Studies Arts &Humanities Practical Living/VocationalStudies Bookmark61.0%19.8%50.6%28.6%41.6%40.6%40.4%Jaeger-Mills 10.517.710.412.814.616.213.7 ContrastingGroups 22.725.923.722.823.930.925.0 MethodsMean 31.4188.8.131.526.729.226.4 StandardError 26.354.2620.487.9913.7212.2914.18 Maximum61.025.950.628.641.640.641.4Minimum10.517.710.412.814.616.213.7Range184.108.40.206.827.024.427.7SynthesisStandard 51.025.227.328.335.935.333.8 Table 3 Percentage of Students at the Proficient Level or A bove on KCCT For High School Grade Tests According to Standard Setting Method MethodorStatistic Subject Mean ReadingMathematicsScienceSocial Studies Arts &Humanities Practical Living/VocationalStudies Bookmark27.1%10.9%19.0%22.5%21.0%47.8%30.7%Jaeger-Mills 12.010.92.614.411.017.511.4 ContrastingGroups 26.325.930.924.024.533.527.5 MethodsMean 21.815.917.520.318.832.921.2 StandardError 8.508.6614.215.167.0015.169.78 Maximum27.125.930.924.024.547.830.3Minimum12.010.92.614.411.017.511.4Range15.115.028.39.613.530.318.6SynthesisStandard 27.526.327.324.019.548.428.8Of course Kentucky did not use any of the three met hods to set the performance standards for operational use. Rather t he results of the three methods were used as input to the synthesis panels that provided the final
12 of 20 recommendations to the State Board of Education. He nce, it might be argued that the variability due to method is not relevant to judging the uncertainty of the final performance standards. Thus, a reasonable que stion is what is the degree of uncertainty for operational standards in Kentuck y or any other state? Results recently reported by Hoover (2003) are relevant to this question. Hoover compared of the percentage of students labeled prof icient or advanced according to three national test batteries and NAEP Hoovers results show that performance standards that are finally adopted afte r much care and expense by national test publishers for their tests or by the National Assessment Governing Board for NAEP also have a great deal of uncertaint y. According to the three national tests the percentag e of grade 4 students who are proficient or above in reading is 24% according to one test publisher, 40% according to another publisher, and 55% according t o the third publisher. According to NAEP the figure is 31%. For grade 4 ma thematics, the corresponding four numbers are 15%, 34%, 44%, and 2 6% (Hoover, 2003, p. 11). Hoovers comparisons also show that, whereas t here is apparently a substantial decline in the percentage of students w ho are proficient or above from grade 8 to 9 (33% vs. 11%) according to one pu blisher, there is no such decline according to the other two publishers. Whil e one publishers performance standards show a fairly steady decline in percentage of students who are proficient or advanced in mathematics from grades 1 through 12 (from 41% to 5%) another publishers performance standard s show that slightly more than twice as many students (27% vs. 12%) are profi cient or advanced at grade 12 than at grade 1.ConclusionThe variability in the percentage of students who a re labeled proficient or above due to the context in which the standards are set, the choice of judges, and the choice of method to set the standards is, in each i nstance, so large that the term proficient becomes meaningless. Insistence on standards-based reporting of achievement test results where such reporting se rves no essential purpose is more harmful than helpful. This is particularly tru e in the context of NCLB where schools, districts, and states are subject to subst antial sanctions based on the progress that is made against arbitrary performance standards that lack any semblance of comparability from state to state.Several years ago I (Linn, 1995) described four use s of performance standards: exhortation, exemplification of goals, accountabili ty, and the certification of student achievement. Performance standards and asso ciated cut scores are essential only for the fourth use. Although reporti ng results in terms of performance standards is often done to exhort teach ers and students to exert greater effort standard-based reporting is not esse ntial to that use. Nor are performance standards essential for the purpose of exemplifying goals. NCLB and a number of state accountability systems depend on the performance standards, but that would not have to be the case.One of the purposes of introducing performance stan dards is to provide a means of reporting results in a way that is more me aningful than a scale score. Certainly, reporting that a student performed at th e proficient level appears
13 of 20 more understandable than saying that the student ha s a scale score of 215. Parents familiar with student performance in school in terms of grades of, say, A, B, C, D and F might naturally assume that profic ient is like a grade of B or C. Given the huge inconsistencies in definitions of pr oficient achievement and in the associated stringency of cut scores, however, i t seems clear that attaching such an interpretation to performance levels cannot mean the same thing across states where standards vary so radically in their stringency. It would be better to find another way of dealing with these no n-essential uses of performance standards and cut scores.There obviously is a legitimate interest in being a ble to measure progress in student achievement. There are many ways of measuri ng progress and setting AYP targets that do not depend upon the reporting o f results in terms of performance standards. Effect-size statistics that would measure the year-to-year difference in mean achievement scores in terms of the standard deviation of scores in the base year is one obvious way that progress could be measured. Hollands (2002) proposal to measure prog ress in student achievement by comparing cumulative distribution fu nctions is another approach. Comparisons of cumulative distribution fu nctions provide a means of monitoring changes in student performance throughou t the range of performance. Changes in the percentage of student e xceeding any selected score level can be readily determined rather than j ust focusing on one arbitrary cut score that corresponds to the proficient perfor mance standard. Comparisons might also be made to norms for a base or reference year. If improvement in performance in State A was large eno ugh that three quarters of the students in 2013-2014 performed above the media n level in 2002-2003 that would represent a large improvement in student achi evement. It would also be readily understood that that students in state A ge nerally had better achievement than students in state B where 55% of t he students in 2013-2014 scored above the 2002-2003 median. Furthermore, the norm-based results for states A and B would be much more interpretable tha n a statement that three-fourths of the students in State C were profi cient or above in 2013-2014 compared to only 55% in State D, because the meanin g of proficient might be radically different for States C and D. Indeed, if the stringency of the proficient performance standard were as variable from state to state as it is now, it might well be that the achievement of students in state D was actually better than that in state C.Finally, if it is decided that the best way to trac k progress is in terms of the percentage of students scoring above a fixed cut sc ore, sometimes referred to as PAC for percent above cut, then it would be bett er to pick the cut score based on norms in a base year than to use an arbitr ary definition of proficient performance that bears little similarity to the def inition of proficient performance in another state. The median or some other percenti le rank in a base year might be used as the cut score. This would provide a clea r and consistent meaning that does not seem possible to achieve for the prof icient performance standard. Using PAC statistics to track progress would provid e a reasonable alternative to tracking progress in achievement for different stat es in terms of percent proficient or above. PAC statistics based on a cut score at, say, the median achievement level in a base year would also be much more interpretable than
14 of 20 percentages proficient or above when comparisons ar e made across states. Reports of individual student assessment results in terms of norms have more consistent meaning across different assessments tha n reports in terms of proficiency levels based on uncertain standards. Fu rthermore criterion-referenced reports of results can be prov ided by illustrating the types of items that the student can answer correctly and the types that they cannot. A fixed cut score is not essential to criterion-refer enced interpretations of achievement.PostscriptFor the reasons discussed above, I believe it would be desirable to shift away from standards-based reporting for uses where perfo rmance standards are not an essential part of the test use. I recognize, how ever, that existing state and federal laws require the setting of performance sta ndards and the reporting of results in terms of those standards. Thus, at least until the laws are changed, there is no choice but to work to make performance standards as reasonable as possible. Assuring that judges on standard setting panels understand the context in which the standards will be used is a mi nimal requirement for obtaining reasonable performance standards. Normati ve information needs to be made part of the process for judges to anchor th eir absolute judgments with an understanding of current levels of performance o f students and likely consequences. As Zieky (2001) has noted, considerin g both absolute and normative information in setting a cutscore can he lp avoid the establishment of unreasonably high or low values (p. 38). In additi on to knowing the percentile rank corresponding to particular cut scores, it wou ld also be desirable to have some means of providing judges with information com parative information about the relative stringency of their standards in compa rison to standards set in other states before judgments are finalized. Normative in formation would be one way of making comparisons to standards in other states.It is critical that the context in which the standa rds will be used be made as clear as possible to panels of judges who set the standar ds. The uses of the standards for purposes of NCLB with its expectation that all students will be at the proficient level or higher by 2014 and sanction s for schools that do not meet AYP targets is an important part of the current con text that standard setters need to consider.Finally, while there is no agreed upon best method for setting standards, the literature does provide useful indications of the d ifferences among different methods in their relative stringency and ease of us e. Jaegers (1989) advice that multiple standard setting methods be used and the r esults of the different methods be considered together with extrastatistic al factors when determining the cutoff score (p. 500) seems as sound now as it in 1989. The experience in Kentucky with the use of multiple methods as input to a synthesis panel provides an excellent example where that advice was taken seriously in practice.A number of authors have suggested helpful criteria to consider in selecting a method. Hambleton (2001), for example, identifies 2 0 criteria that he presents in
15 of 20 the form of questions to be considered in evaluatin g a standard-setting process. Raymond and Reid (2001) provide useful advice on th e selection and training of judges, Kane (2001) provides a good discussion of c onsiderations in the validation of performance standards and cut scores, and Bond (1995) has identified five principles intended to help ensure fairness. Although such considerations cannot eliminate the arbitrariness, they can help make the standards and cut scores more reasonable and more d efensible.AcknowledgmentBased on a paper presented at the College Board, Ne w York City, July 16, 2003. Work on the paper was partially supported und er the Educational Research and Development Center Program PR/ Award # R305B60002, as administered by the Institute of Education Sciences U.S. Department of Education. The findings and opinions expressed in t his paper do not reflect the position or policies of the National Center for Edu cation Research, the Institute of Education Sciences, or the U.S. Department of Ed ucation.ReferencesAngoff, W. H. (1971). Scales, norms and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 508-600). Washington, DC: American C ouncil on Education. Applewhite, A., Evens, W. R. III, & Frothingham, A. (1992). And I quote New York: St. Martins Press. Bond, L. (1995). Ensuring fairness in the setting o f performance standards. In Proceedings of the Joint Conference on Standard Setting for Large-Scal e Assessments. (pp. 311-324). Washington, DC: National Assessment Governing Board and Nationa l Center for Education Statistics. Braswell, J. S., Lutkus, A. D., Grigg, W. S., Santa pau, S. L., Tay-lim, B.S.-H., & Johnson. M. S. (2001). The nations report card: Mathematics 2000 Washington, DC: National Center for Education Statistics. Cizek, G. J. (1993). Reactions to National Academy of Education report, Setting performance standards for student achievement Washington, DC: National Assessment Governing Boa rd. Crocker, L. & Zieky, M. (1995). Executive Summary: Joint conference on setting standards for large-scale assessments. In Proceedings of the Joint Conference on Standard Set ting for Large-Scale Assessments. (pp. ES-1 ES-17). Washington, DC: National Asses sment Governing Board and National Center for Education S tatistics. Cronbach, L. J. (1971). Test validation. In R. L. T horndike (Ed.), Educational measurement (2nd ed., pp. 443-507). CTB/McGraw-Hill. (2001), Kentucky standard setting technical report Monterey, CA: Author. Ebel, R. L. (1972). Essentials of educational measurement Englewood Cliffs, NJ: Prentice-Hall. Glaser, R. (1963). Instructional technology and the measurement of learning outcomes. American Psychologist 18 519-521. Glaser R. & Nitko, A. J. (1971). Measurement in lea rning and instruction. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 625-670). Washington, DC: American C ouncil on Education. Glass, G. V. (1978). Standards and criteria. Journal of Educational Measurement 15 237-261. Goals 2000: Educate America Act of 1994, Public Law 103-227, Sec. 1 et seq. 108 Stat. 125 (1994).
16 of 20 Green, B. F. (1981). A primer of testing. American Psychologis t, 36 1001-1011. Green, D. R., Trimble, C. Scott, & Lewis, D. M. (20 03). Interpreting the results of three different standard-setting procedures. Educational Measurement: Issues and Practice 22 No. 1, 22-32. Grigg, W. S., Daane, M. C., Jin, Y. & Campbell, J. R. (2003). The nations report card: Reading 2002 Washington, DC: National Center for Education Stati stics. Hambleton, R. K. (1978). On the use of cut-off scor es with criterion referenced tests in instructional settings. Journal of Educational Measurement 15 277-290. Hambleton, R. K. (1980). Test score validity and st andard-setting methods. In R. A. Berk (Ed.), Criterion-referenced testing: Motives, models, meas ures and consequences (pp. 80-123). Baltimore, MD: Johns Hopkins University Press. Hambleton, R. K. (2001). Setting performance standa rds on educational assessments and criteria for evaluating the process. In G. J Cizek (Ed.), Setting performance standards: Concepts, methods and perspectives (pp. 89-116). Mahwah, NJ: Lawrence Erlbaum. Hambleton, R. K. & Rogers, J. H. (1991). Advances i n criterion-referenced measurement. In R. K Hambleton & J. N Zall (Eds.), Advances in educational and psychological testing (pp. 3-44). Boston: Kluwer Academic. Hoover, H. D. (2003). Some common misconceptions ab out testing. Educational Measurement: Issues and Practice 22 No. 1, 5-14. Holland, P. (2002). Measuring progress in student a chievement: Changes in scores and score gaps over time. Report of the Ad Hoc Committee on Confirming Test R esults Washington, DC: National Assessment Governing Board. Jaeger, R. M. (1989). Certification of student comp etence. In R. L. Linn (ed.), Educational Measurement (3rd ed., pp. 485-514). New York: Macmillan. Jaeger, R. M., Cole, J., Irwin, D. M., and Pratto, D. J. (1980). An interactive structure judgment process for setting passing scores on competency te sts applied to the North Carolina high school competency tests in reading and mathematics Greensboro, NC: Center for Education Research and Evaluation, University of North Caroli na at Greensboro. Jaeger, R. M. and Mills, C. (2001). An integrated j udgment procedure for setting standards on complex, large-scale assessments. In G. J. Cizek (E d.), Setting performance standards; Concepts, methods, and perspectives (pp. 313-318). Mahwah, NJ: Lawrence Erlbaum. Kane, M. (1993). Comments on the NAE evaluation of NAGB achievement levels Washington, DC: National Assessment Governing Board. Kane, M. (1995). Examinee-centered and task-centere d standard setting. In Proceedings of the Joint Conference on Standard Setting for Large-Scale Asse ssments. (pp. 119-141). Washington, DC: National Assessment Governing Board and National Ce nter for Education Statistics. Kane, M. T. (2001). So much remains the same: Conce ption and status of validation in setting standards. In G. J Cizek (Ed.), Setting performance standards: Concepts, methods an d perspectives (pp. 53-88). Mahwah, NJ: Lawrence Erlbaum. Kentucky Department of Education. (2001). Standard setting: Synthesis of three procedures procedures and findings. Frankfort, KY: Kentucky Department of Education. A vailable at www.kde.state.ky.us. Koffler, S.L. (1980). A comparison of approaches fo r setting proficiency standards. Journal of Educational Measuremen t, 17 167-178. Lewis, D. M., Green, D. R., Mitzel, H. C., Baum, K. & Patz, R. J. (1998). The Bookmark standard setting procedure: Methodology and recent implement ations Paper presented at the 1998 annual meeting of the National Council on Measureme nt, San Diego, CA. Linn, R. L. (1994). Criterion-referenced measuremen t: A valuable perspective clouded by surplus
17 of 20 meaning. Educational Measurement: Issues and Practice 13 (4), 12-14. Linn, R. L. (1995). The likely impact of performanc e standards as a function of uses: From rhetoric to sanction. In Proceedings of the Joint Conference on Standard Set ting for Large-Scale Assessments. (pp. 267-276). Washington, DC: National Assessment Governing Board and National Center for Education Statistics. Linn, R. L. (2000). Assessments and accountability. Educational Researcher 29 (2), 4-16. Linn, R. L. (in press). Accountability: Responsibil ity and reasonable expectations. Educational Researcher. McLaughlin, D. & Bandeira de Mello, V. (2002). Comparison of state elementary school mathematics achievement standards using NAEP 2000 Paper presented at the Annual Meeting of the American Educational Research Association, New Orle ans, LA, April. Mehrens, W. A. (1995). Methodological issues in sta ndard setting for educational exam. In Proceedings of the Joint Conference on Standard Set ting for Large-Scale Assessments. (pp. 221-263). Washington, DC: National Assessment Gover ning Board and National Center for Education Statistics. Mehrens, W. A. & Cizek, G. J. (2001). Standard sett ing and the public good: Benefits accrued and anticipated. In G. J Cizek (Ed.), Setting performance standards: Concepts, methods an d perspectives (pp. 477-485). Mahwah, NJ: Lawrence Erlbaum. Mitzel, H. C., Lewis, D. M., Patz, R. J., & Green, D. R. (2001). The Bookmark procedure: Psychological perspectives. In G. J Cizek (Ed.), Setting performance standards: Concepts, methods and perspectives (pp. 249-281). Mahwah, NJ: Lawrence Erlbaum. National Educational Goals Panel. (1993). Report of the Goals 3 and 4 Technical Planning Grou p on the Review of Education Standards Washington, DC: Author. No Child Left Behind Act of 2001. Public Law 107-11 0. Olson, L. (2003). Approved is relative term for E d. Dep.: 11 states fully meet ESEA requirements. Education Week Vol. XXII, No. 43. August 6, pp. 1, 34-36. Pellegrino, J. W., Jones, L. R., & Mitchell, K. J. (Eds.). (1999). Grading the nations report card: Evaluating NAEP and transforming the assessment of educational progress Washington, DC: National Academy Press. Poggio, J. P., Glasnapp, D. R., & Eros, D. S. (1981 ). An empirical investigation of the Angoff, Ebel, and Nedelsky standard setting methods Paper presented at the annual meeting of the Amer ican Educational Research Association, Los Angeles, Apri l. Popham, W. J. (1978). As always, provocative. Journal of Educational Measurement 15 297-300. Raymond, M. R. & Reid, J. B. (2001). Who made thee judge? Selecting and training participants for standard setting. In G. J Cizek (Ed.), Setting performance standards: Concepts, methods an d perspectives (pp. 119-157). Mahwah, NJ: Lawrence Erlbaum. Shepard, L. A. (1979). Setting standards. In M. A. Bunda & R Sanders (Eds.), Practices and problems in competency-based measurement (pp. 72-88). Washington, DC: National Council on Measurement in Education. Shepard, L. A. (1980). Technical issues in minimum competency testing. In D. C. Berliner (Ed.), Review of research in education: Vol. 8 (pp. 30-82). Washington, DC: American Educational Research Association. Shepard, L. A. (1984). Setting performance standard s. In R. A. Berk (Ed.), Criterion-referenced testing: Motives, models, measures and consequences (pp. 169-198). Baltimore, MD: Johns Hopkins University Press. Shepard, L., Glaser, R., Linn, R. & Bohrnstedt, G. (1993). Setting performance standards for student achievement Stanford, CA: National Academy of Education.
18 of 20 Zieky, M. J. (1995). A historical perspective on se tting standards. In Proceedings of the Joint Conference on Standard Setting for Large-Scale Asse ssments n Statistics. (pp. 1-38). Washington, DC: National Assessment Governing Board and National Center for Education Statistics. Zieky, M. J. (2001). So much has changed: How the s etting of cutscores has evolved since the 1980s. In G. J Cizek (Ed.), Setting performance standards: Concepts, methods an d perspectives (pp. 19-51). Mahwah, NJ: Lawrence Erlbaum.About the AuthorRobert L. LinnSchool of EducationUniversity of Colorado at BoulderEmail: Robert.email@example.comRobert L. Linn is Distinguished Professor of Education at the Uni versity of Colorado at Boulder and Co-Director of the National Center for Research on Evaluation, Standards, and Student Testing. He has published over 200 journal articles and chapters in books dealing with a wide range of theoretical and applied issues in educational measurement and has r eceived several awards for his contributions to the field, including the E TS Award for Distinguished Service to Measurement, the E.L Thorndike Award, th e E.F. Lindquist Award, the National Council on Measurement in Education Ca reer Award, and the American Educational Research Association Award for Distinguished Contributions to Educational Research. He is past p resident of the American Educational Research Association, past president of the National Council on Measurement in Education, past president of the Eva luation and Measurement Division of the American Psychological Association, and past vice-president for the Research and Measurement Division of the Americ an Educational Research Association. He is a member of the National Academy of Education, a Lifetime National Associate of the National Academies, and s erves on two Boards of the National Academy of Sciences. The World Wide Web address for the Education Policy Analysis Archives is epaa.asu.edu Editor: Gene V Glass, Arizona State UniversityProduction Assistant: Chris Murrell, Arizona State University General questions about appropriateness of topics o r particular articles may be addressed to the Editor, Gene V Glass, firstname.lastname@example.org or reach him at College of Education, Arizona State Un iversity, Tempe, AZ 85287-2411. The Commentary Editor is Casey D. Cobb: email@example.com .EPAA Editorial Board Michael W. AppleDavid C. Berliner
19 of 20 University of WisconsinArizona State University Greg Camilli Rutgers University Linda Darling-Hammond Stanford University Sherman Dorn University of South Florida Mark E. Fetler California Commission on Teacher Credentialing Gustavo E. Fischman Arizona State Univeristy Richard Garlikov Birmingham, Alabama Thomas F. Green Syracuse University Aimee Howley Ohio University Craig B. Howley Appalachia Educational Laboratory William Hunter University of Ontario Institute ofTechnology Patricia Fey Jarvis Seattle, Washington Daniel Kalls Ume University Benjamin Levin University of Manitoba Thomas Mauhs-Pugh Green Mountain College Les McLean University of Toronto Heinrich Mintrop University of California, Los Angeles Michele Moses Arizona State University Gary Orfield Harvard University Anthony G. Rud Jr. Purdue University Jay Paredes Scribner University of Missouri Michael Scriven University of Auckland Lorrie A. Shepard University of Colorado, Boulder Robert E. Stake University of IllinoisUC Kevin Welner University of Colorado, Boulder Terrence G. Wiley Arizona State University John Willinsky University of British ColumbiaEPAA Spanish Language Editorial BoardAssociate Editor for Spanish Language Roberto Rodrguez Gmez Universidad Nacional Autnoma de Mxico firstname.lastname@example.org Adrin Acosta (Mxico) Universidad de Guadalajaraadrianacosta@compuserve.com J. Flix Angulo Rasco (Spain) Universidad de Cdizfelix.email@example.com Teresa Bracho (Mxico) Centro de Investigacin y DocenciaEconmica-CIDEbracho dis1.cide.mx Alejandro Canales (Mxico) Universidad Nacional Autnoma deMxicocanalesa@servidor.unam.mx
20 of 20 Ursula Casanova (U.S.A.) Arizona State Universitycasanova@asu.edu Jos Contreras Domingo Universitat de Barcelona Jose.Contreras@doe.d5.ub.es Erwin Epstein (U.S.A.) Loyola University of ChicagoEepstein@luc.edu Josu Gonzlez (U.S.A.) Arizona State Universityjosue@asu.edu Rollin Kent (Mxico) Universidad Autnoma de Puebla firstname.lastname@example.org Mara Beatriz Luce (Brazil) Universidad Federal de Rio Grande do Sul-UFRGSlucemb@orion.ufrgs.br Javier Mendoza Rojas (Mxico)Universidad Nacional Autnoma deMxicojaviermr@servidor.unam.mx Marcela Mollis (Argentina)Universidad de Buenos Airesmmollis@filo.uba.ar Humberto Muoz Garca (Mxico) Universidad Nacional Autnoma deMxicohumberto@servidor.unam.mx Angel Ignacio Prez Gmez (Spain)Universidad de Mlagaaiperez@uma.es Daniel Schugurensky (Argentina-Canad) OISE/UT, Canadadschugurensky@oise.utoronto.ca Simon Schwartzman (Brazil) American Institutes forResesarchBrazil (AIRBrasil) email@example.com Jurjo Torres Santom (Spain) Universidad de A Coruajurjo@udc.es Carlos Alberto Torres (U.S.A.) University of California, Los Angelestorres@gseisucla.edu EPAA is published by the Education Policy Studies Laboratory, Arizona State University
xml version 1.0 encoding UTF-8 standalone no
mods:mods xmlns:mods http:www.loc.govmodsv3 xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govmodsv3mods-3-1.xsd
mods:relatedItem type host
mods:identifier issn 1068-2341mods:part
mods:detail volume mods:number 11issue 31series Year mods:caption 20032003Month September9Day 11mods:originInfo mods:dateIssued iso8601 2003-09-01
xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam a22 u 4500
controlfield tag 008 c20039999azu 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E11-00329
Educational policy analysis archives.
n Vol. 11, no. 31 (September 01, 2003).
Tempe, Ariz. :
b Arizona State University ;
Tampa, Fla. :
University of South Florida.
c September 01, 2003
Performance standards : utility for different uses of assessments / Robert L. Linn.
Arizona State University.
University of South Florida.
t Education Policy Analysis Archives (EPAA)