USF Libraries
USF Digital Collections

Educational policy analysis archives


Material Information

Educational policy analysis archives
Physical Description:
Arizona State University
University of South Florida
Arizona State University
University of South Florida.
Place of Publication:
Tempe, Ariz
Tampa, Fla
Publication Date:


Subjects / Keywords:
Education -- Research -- Periodicals   ( lcsh )
non-fiction   ( marcgt )
serial   ( sobekcm )

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
usfldc doi - E11-00160
usfldc handle - e11.160
System ID:

This item is only available as the following downloads:

Full Text
xml version 1.0 encoding UTF-8 standalone no
mods:mods xmlns:mods http:www.loc.govmodsv3 xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govmodsv3mods-3-1.xsd
mods:relatedItem type host
mods:identifier issn 1068-2341mods:part
mods:detail volume mods:number 8issue 16series Year mods:caption 20002000Month March3Day 88mods:originInfo mods:dateIssued iso8601 2000-03-08

xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam a22 u 4500
controlfield tag 008 c20009999azu 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E11-00160
0 245
Educational policy analysis archives.
n Vol. 8, no. 16 (March 08, 2000).
Tempe, Ariz. :
b Arizona State University ;
Tampa, Fla. :
University of South Florida.
c March 08, 2000
Relationship between the reliability and cost of performance assessments / Jay Parkes.
x Research
v Periodicals.
2 710
Arizona State University.
University of South Florida.
1 773
t Education Policy Analysis Archives (EPAA)
4 856


1 of 14 Education Policy Analysis Archives Volume 8 Number 16March 8, 2000ISSN 1068-2341 A peer-reviewed scholarly electronic journal Editor: Gene V Glass, College of Education Arizona State University Copyright 2000, the EDUCATION POLICY ANALYSIS ARCHIVES. Permission is hereby granted to copy any article if EPAA is credited and copies are not sold. Articles appearing in EPAA are abstracted in the Current Index to Journals in Education by the ERIC Clearinghouse on Assessment and Evaluation and are permanently archived in Resources in Education The Relationship between the Reliability and Cost of Performance Assessments Jay Parkes The University of New MexicoAbstractPerformance assessments have come upon two major ro adblocks: low reliability coefficients and high cost. Recent spec ulation has posited that the two are directly related such that cost must ri se in order to increase reliability. This understanding may be an oversimpl ification of the relationship. Two empirical demonstrations are offe red to show that more than one combination of sources of error may r esult in a desired generalizability coefficient and that it is possibl e to increase the number of observations while also decreasing cost. The movement toward performance assessments for large-scale assessment purposes has encountered two major obstacles: first, such as sessments have difficulty demonstrating highly reliable scores, and second, they tend to be very expensive. How these two problems are thought to be related influences the proposed s olutions. This in turn will directly affect policies about the use of such assessments. The problem of poor reliability in performa nce assessment scores stems from the lack of agreement among tasks, raters and other sources of measurement error. This is exhibited


2 of 14in a variety of types of performance assessments by several concurrent lines of inquiry, including: those by Shavelson and colleagues (e.g. Shavelson and Baxter,1992; and Shavelson, Baxter, and Gao, 1993); those from the V ermont Portfolio Assessment program (e.g. Koretz, Klein, McCaffrey, and Stecher, 1994; Koretz, Stecher, Klein, and McCaffrey, 1994; and Koretz, Stecher, Klein, McCaffrey, and De ibert, 1994); and one by McWilliam and Ware (1994). Shavelson and colleagues have worked primar ily with performance assessments in elementary level general science. By using the fram ework of generalizability theory, they have demonstrated that the greatest contributing fa cet to low generalizability coefficients is the task (e.g. Shavelson, Baxter and Gao, 1993). Fu rthermore, they project that by increasing the number of tasks a higher generalizability coeff icient will result. Koretz and colleagues have worked with portfolio assessments of math and writing and identified raters and tasks as sources of error variance (Koretz, Stecher, Klei n, McCaffrey, and Deibert, 1994). They, too, explore the possibility of increasing the numb er of tasks and the number of raters to achieve a more acceptable estimate of reliability. McWilliam and Ware (1994) examined the assessment of young children's engagement, and iden tified the number of sessions or observations as being a large source of error varia nce. They estimated the minimum number of sessions that would be necessary to create an ac ceptably reliable assessment. A second major concern with performance ass essments is their high cost (Picus, 1994). Performance assessments are widely believed to be m ore expensive than multiple-choice testing (Catterall & Winters, 1994; Hardy, 1996; Li nn, Baker & Dunbar, 1991; U.S. General Accounting Office, 1993), though the costs of perfo rmance assessments will vary considerably based on the exact nature of the asses sment (Monk, 1996; U.S. General Accounting Office, 1993). Reckase (1995) demonstrat ed that it is possible to produce a writing portfolio assessment procedure that meets c urrent standards of psychometric quality; but such a procedure, compared to current multiplechoice methods, would be a "very expensive alternative (p. 14)." White (1986), howev er, holds that, when designed properly, a direct assessment of writing can be conducted with comparable expense to that of multiple-choice assessment. This divergence notwith standing, White (1986) recognized that the expenses are different for the two forms, the m oney being used mostly for raters in a direct assessment of writing. Hoover and Bray (1995 ) to some extent validated this claim by showing that the Iowa Writing Test could be conduct ed for approximately the same cost as the Iowa test of Basic Skills, albeit the former co vered a much smaller domain than the latter. These two problems of low reliability coeff icients and high cost in performance assessment are often directly linked. If the soluti on to low generalizability is to increase the number of tasks, raters, etc., then the cost must a lso increase (e.g. Picus, 1994). There are a number of issues, however, that make this more comp licated than it first appears. The first issue is the automatic acceptance of the direct relationship between the number of observations in an assessment and the rel iability of scores from that assessment. This acceptance is promulgated by a long history wi th the SpearmanBrown Prophecy Formula used to address this issue with objective i tem assessments. In a multiple-choice test, it is possible to estimate the number of item s necessary to reach a desired reliability coefficient. For example, if a test contains 50 mul tiple-choice items and the reliability coefficient for scores from that test is 0.76, the Spearman-Brown Prophecy Formula can be employed to estimate how many items would need to b e added to increase the reliability estimate to 0.85. There is direct (though asymptoti c) relationship between the number of items used and the magnitude of the reliability coe fficient. In a performance assessment, however, the relationship between a reliability est imate and the number of observations is more complicated because there are more sources of error. In a multiple-choice test, the


3 of 14items represent the only source of error. In a perf ormance assessment, tasks, raters, occasions and potentially many other sources of err or are possible. The implication is twofold. First, there may be more than one combinat ion of raters, tasks, etc. that will result in a reliability estimate of a given magnitude. Sec ond, it is possible that fewer observations could lead to a larger estimate of the reliability of scores from a performance assessment. Therefore, it is no longer axiomatic that increasin g reliability means adding more observations. The second issue is that cost and reliabili ty are seldom addressed simultaneously. By and large this is due to the methodologies employed for such projections. In an assessment procedure with multiple sources of error, the most common projective technique is a liberalization of the Spearman-Brown Prophecy Formu la, the decision study, or d-study from the generalizability theory framework. The d-s tudy approach to addressing the joint issues of cost and reliability is less than desirab le in a couple of ways. D-studies are often done one at a time by c onsidering different combinations of sources of error. That means that when the first combinatio n to reach the desired reliability estimate is reached, the process stops. If there are several combinations of sources of error that would satisfy the desired reliability threshold, they pro bably would not be uncovered in this manner. The d-study approach does not take cost inf ormation into consideration, which leaves the direct relationship between the number of obser vations and cost to dictate the best combination of sources of error. Assuming that d-st udies are conducted in such a manner that multiple combinations of sources of error are identified, all meeting a minimum reliability estimate, the one with the fewest total observations is likely to be selected for implementation. It might be possible that more tota l observations could actually be less expensive. Without explicitly examining cost inform ation, there is no way to know for sure. The goal should be an optimal assessment de sign where optimal is defined as the most reliable and least expensive. There is a technique that allows all of these issues to be handled simultaneously in one analysis. Sanders, Theunissen and Baas (1989, 1991, 1992) proposed the use of a branch-and-bound integer programming a lgorithm which searches for and identifies the optimal number of levels for each fa cet while taking into account each facet's contribution to the generalizability coefficient an d each facet's cost as well as any other practical constraint. This technique appears to be promising. It can exhaustively search all possible combinations of levels of facets, within g iven parameters, something that could be a daunting task to perform "by hand" using only psy chometric constraints. Thus it gives reasonable assurance that the optimal solution has been located. A second advantage of this technique is tha t it can accommodate a wide variety of logistical, economic, or other constraints. So cost data and reliability data, as well as other relevant issues, can be used simultaneously to defi ne an optimal assessment design. These issues and procedures will now be dem onstrated using two different studies. The first study concludes that, depending on the defini tion of "optimal," there are many possible best combinations of facets to produce a predetermi ned generalizability coefficient. The second study produces data supporting the Sanders, et al. (1991) statement that it is possible to decrease the number of observations and/ or the total cost while increasing the generalizability coefficient. Both studies are base d on the same set of data.The Optimization Studies Subjects. Fifty subjects enrolled in an undergraduate educat ional psychology class participated in the study. Twentyeight percent of the sample were males and seventy-two percent were females. The sample also contained a m ix of White, AsianAmerican, and


4 of 14 Hispanic subjects. By class, the sample consisted o f freshmen (20%), sophomores (52%), juniors (21%), seniors (5%), with the remainder uni dentified. The sample had taken an average of 1.26 writing courses with a range from 0 to 3. Procedures. Each subject read three articles—one about instruc tional approaches, and two articles about performance assessments—prior to attending the first of two 2 1/2 hour sessions. During the first session, subjects filled out a demographic questionnaire and wrote a separate 300 to 500 word essay about each of two prompts. During the second session, subjects wrote the other two prompts. In total, the y wrote an expressive piece and a persuasive piece about the instructional approaches and an expressive piece and a persuasive piece about performance assessments. Four different orders of the prompts were counterbalanced to allow investigation of practice effects or other effects that may arise by writing the essays in a particular order. Scoring the essays. Three graduate students in Educational Psychology served as raters and were trained. These raters were given the scori ng rubric and discussed it; then, they scored a sample paper as a group. Using a slightly modified version of the Diederich scale (Diederich, 1974), each rater then read all 200 pie ces of writing. The seven items on the scale were summed to achieve each subject's score o n each piece of writing.The Variance Models The studies are based on a three-facet mixe d design: mode of discourse (m), writing prompt (p), and rater (r). The object of measuremen t is student's overall writing ability (s). In the data collection design, prompts are nested with in mode (i.e., p:m) and both cross raters and students. In the generalizability framework, th e variance model is: The variance components for the sample in this stud y were estimated using the GENOVA software program (Crick and Brennan, 1983). Based o n a review of the literature on modes of discourse (Crusius, 1989), there are at most fiv e modes in existence. Therefore, for the estimation of variance components, the universe of modes was defined as having 5 levels. For all other facets, the universes were defined as infinite. The variance components estimated are shown in Table 1.Table 1 Estimated Variance Components for Studies One and T woSource of variationVariance components Subject (s)5.8275728Mode (m)0*Prompt:mode (p:m)0*Rater (r)5.6756912sm0*s(p:m)2.6025238


5 of 14 sr0.6714422smr0.3008503sr(p:m)11.8791415*Note. Negative variance components were set equal to zero, following Brennan (1992). For all subsequent optimization analyses, t he relative model of measurement was used wherein the relative error variances were estimated through: where nr, nm, and np are the number of raters, modes, and prompts respe ctively. The G-coefficient of interest was therefore : Study One In this study, results of a generalizabilit y study and data describing the number of person-hours necessary to score the assessment have been used. Four different scenarios are presented, each with a different set of constraints each producing a different optimal solution. The first scenario optimized the problem using only psychometric constraints; the second took a relative human factor constraint into consideration; the third used a specific human factor constraint; and the fourth used specif ic economic constraints. The Optimization Scenarios A branch-and-bound integer programming algo rithm, a linear programming technique, was employed to estimate the optimal combination of raters, prompts within modes, and modes themselves. This investigation used the solve r function of Microsoft EXCEL, version 5.0, to execute the algorithm. For all four scenari os, the variance components from Table 1 were entered into the worksheet. All four scenarios investigated shared a common objective function and a common set of constraints. In Scenar ios 2, 3, and 4, additional constraints were considered. The common problem to be solved ac ross all scenarios is: Objective Function: Minimize L = nm np:m nr ; (4) Subject to:


6 of 14 nm< or = 5, (6) nm np:m nr are integers, (7) and nm, np:m, and nr > or = 1. (8) The objective function is to minimize the t otal number of observations needed. Constraint (5) specifies the minimal acceptable lev el of a generalizability coefficient. Constraint (6) specifies that there are no more tha n 5 possible modes of discourse. Constraints (7) and (8) ensure that solutions will be positive whole numbers. In Scenario 1, the objective function defin ed in (4) subject to constraints (5) through (8) was submitted to the branch-andbound search algor ithm. The results of this search can be found in Table 2, which shows that, to attain a g-c oefficient of at least 0.8, the minimum numbers are 4 modes with 2 prompts each while emplo ying two raters to score each prompt in each mode. Based on data obtained from the sampl e, the average time needed to rate each prompt in each mode in this study was 0.092 hour (a pproximately 5.5 minutes). The total amount of time needed to rate the writings from ns subjects under any given scenario is then: Total person-hours = nm np:m nr ns(.092) (9) Applying Equation (9), the total person-hou rs needed for Scenario 1 for 50 subjects is 73.6.Table 2 Results of Study One Number of Cases Needed to Meet the Constraints Actual Scenario 1 Scenario 2 Scenario 3 Scenario 4 Additional Constraints nr nm(np:m) nm(np:m) £ 6 nm(np:m)nr(50)(.092) £ 60 person-hoursnm(np:m)nr(50)(.092) £ 70 person-hoursMode24114Prompt: Mode22462Rater32532Obj. Function1216201816Manhours55.273.69282.873.6G Coefficient0.750.800.800.800.80


7 of 14 An apparent practical problem with Scenario 1 is the demand on the examinee. A better solution might be one in which the burden of reliab ility is shifted away from the demand on the examinee to a demand on ratings per piece of wr iting. In Scenario 2, a new constraint was added to shift this demand to ratings. The addi tional constraint and the results can be found in table 2. To attain a g-coefficient of at l east 0.8 while minimizing the burden on the examinee, the minimal design is one in which each e xaminee responds to 4 different prompts in a single mode of discourse. Each piece o f writing needs to be rated by 5 raters. Under this scenario, the total number of writings f rom each examinee is only four. However, the total amount of person-hours needed for the rat ing of 50 subjects increases to 92 personhours. In Scenario 3, a compromise between Scenari os 1 and 2 was investigated by constraining the total number of pieces to six or l ess (see table 2). Under this scenario, each examinee must produce 6 pieces of writing in a sing le mode. On the other hand, only 3 raters are needed for each piece to attain a g-coef ficient of 0.8 or higher. The total person-hours for 50 subjects in this case is 82.8. Scenario 4 investigated the cost factor. Th e lowest number of person-hours so far has been 73.6 in Scenario 1. Scenario 4 attempted to ex plore the possibility of a person-hour estimate lower than that. Table 2 illustrates the t wo constraints attempted, neither of which produced a feasible solution. In other words, it is not possible to expend less than 70 personhours of rating activities to rate the writings use d in this study for 50 subjects and still maintain a minimum g-coefficient of 0.8.Conclusions from Study One In a single-facet measurement situation, a multiple-choice exam for example, there is only one source of error to draw on to increase a r eliability coefficient: items. So a one-to-one relationship exists between the number o f the facet and the reliability coefficient: as the number of items increases, so does the relia bility coefficient, albeit the relationship is asymptotic at some point. Also, there is a unique m inimum number of items that will satisfy the desired reliability coefficient. For example, i f a 50-item exam has a reliability coefficient of 0.69, the Spearman-Brown Prophecy Formula may in dicate that in order to achieve a coefficient of 0.90, 83 items are needed. In a mult i-faceted situation like the one represented here, the relationships are not so clear. With mult iple facets, each contributing unequally in proportion to the size of its variance component to the generalizability coefficient, there is no simple one-to-one relationship. Scenario 1 uses psychometric constraints alone (as the Spearman-Brown Prophecy Formula or other projective techniques would) yet mode changes by 2 units, prompt within mode does not cha nge, and raters decreases by one unit. Thus, in multi-faceted situations using only psycho metric criteria, the relationship between the facets and the generalizability coefficient is not straightforward or simple. Neither in a multi-faceted situation is the re one combination which will uniquely fulfill the predetermined generalizability coefficient. The first step is to define optimal in some way. The optimization procedure allows a great deal of latitude in doing so. The four scenarios taken together demonstrate that there are many optimal combinations that will fulfill the predetermined generalizability coeffici ent.Study Two The second study is similar to the first ex cept that instead of using person-hours as the


8 of 14economic constraint, it employs dollar figures. Sec ond, instead of minimizing the total number of observations in order to constrain costs, it uses total cost as the objective function. The variance model in Study Two is the sa me as that in Study One. The Cost Data The cost data for this study are taken from Hoover and Bray (1995), who report on cost information for an administration of the Iowa Writi ng Assessment. The assessment tested the writing skills of 30,000 school students from g rades three to twelve, each of whom wrote two pieces of writing. Each sample was scored twice holistically and twice analytically. For this assessment, Hoover and Bray estimate that $138 ,000 was spent in developing the 40 writing prompts; $174,410 was spent to score the pr ompts; and $30,000 was spent for materials. This breakdown is consistent with a fram ework for examining costs explained by Hardy (1996). In order to use this information in t he optimization procedure, base units of development, scoring and materials need to be devel oped. That is, figures need to be obtained that indicate how much adding one rating ( for example) to the scenario will change scoring costs, or how much adding one prompt will c hange development and scoring costs. The cost of development hinges on the total number of prompts developed—in Hoover and Bray (1995), 40—therefore, each prompt costs $3450 to develop ($138,000/40). In that study, each examinee wrote two prompts. Had each wr itten only one prompt, presumably only 20 prompts would have been developed. Therefor e, the $3450 is divided by 2, the number of prompts each examinee responded to, produ cing a cost per prompt required of an examinee of $1725. So that represents the base unit cost for development. Therefore, the development cost function is $1725np:mnm, where p:m is the number of prompts each person must write pe r mode and nm is the number of modes. To obtain the base unit cost for scoring, t he total scoring cost ($174, 410) was divided by the number of subjects (30,000), the number of p ieces per subject (2), and the number of raters or readings per piece (2) to produce a unit scoring cost of $1.43 per piece, per rater, per subject. The materials were estimated to cost $ 1.00 per subject. For the purposes of these analyses, the number of subjects was held constant at 50. Therefore, the total cost function, combining development, scoring and material costs, is: Total Cost = $1725np:mnm + $1.43np:mnmnrns + $1.00ns(10) The Optimization Problem The variance components from Table 1, the c ost function given in equation (10), and the number of prompts within modes, modes, raters, and subjects were entered into the EXCEL worksheet, and the following optimization pro blem was submitted for analysis. Objective Function: Minimize L = Total Cost = $1725np + $1.43npnrns + $1.00ns, subject to constraints (5) through (8) given in Stu dy One. (11) The results are given in Table 3. Since the procedure was minimizing cost not the number of observation points, the optimal design in cludes more observation points (27 versus 12) but at less cost and a higher generaliza bility coefficient.


9 of 14 Table 3 Results of Study Two Number of Cases Needed to Meet the Constraints ActualOptimal Mode21Prompt:Mode23Rater39Obj. Function1227Total Cost$7808$7156G Coefficient0.750.80 Conclusions from Study Two This second study provides empirical suppor t for the claim made by Sanders, Theunissen, and Baas (1989) that it is possible to decrease cost while increasing the generalizability coefficient even when the total nu mber of observation points increases.Discussion These studies serve as illustrations of the issues raised in the introduction. The first study demonstrates that it is possible to have many combinations of facets in an assessment design meet some predetermined level of reliability coefficient. The second study demonstrates the advantages of simultaneously consi dering cost and reliability data in the same analysis, namely, that it is possible to achie ve a more reliable but less costly design. Both of these points need to be taken in co nsideration during discussions about the cost implications of various solutions to the low reliab ility problem associated with performance assessment scores. If we assume that the only way t o increase the reliability is to increase the number of observations and/ or we assume that incre asing reliability will automatically increase cost, these stumbling blocks will not be r emoved. Policy makers will continue to be very reluctant to choose performance assessments as parts of their assessment plans. These demonstrations represent a narrow per spective though and were designed to demonstrate only the two issues already mentioned. They are narrow in two ways. First, they may oversimplify the estimation of true costs of pe rformance assessments. Second, they address only reliability and cost and not other con cerns. The costs associated here with performance assessments are expressed in dollars and cents and are rather simple. For example, developme nt costs would change depending on the number of examinees (Parkes, 1996). More examinees would require that more prompts be developed and the cost would probably change in som e exponential fashion. This relationship is held constant by assuming the same number of examinees in each scenario. There are also many other ways to conceptualize cos t, some of which would be very difficult


10 of 14to quantify. Monk (1996) and Picus (1994) describe the difficulties in determine the actual "costs" of a performance assessment. There are, of course, the financial expenditures associated with an assessment system. But more nebu lously, there will be expenditure of time by students, teachers, and administrators to c onduct these assessments. There is also cost in terms of what curriculum changes are made t o accommodate the testing. That is, what would students be learning in the time taken f or assessment. The studies reported here are also narrow i n that they address only reliability and cost and not other concerns. And there are plenty of oth er considerations that are equally as important or more important in the design of a perf ormance assessment besides reliability and cost. The content sampling issue is one of thes e. Deciding how many tasks should constitute an assessment should probably be address ed in terms of content coverage first. Though certain constraints could be added to an opt imization problem to account for content coverage issues, it probably not best to handle the issue in that manner. This approach treats each facet of the design equally or weights it base d on its contribution to error variance. It therefore works on the implicit assumption that one rater means essentially the same thing as one task, which means essentially the same thing as one occasion, etc. But raters and tasks and occasions all serve different purposes in the a ssessment and contribute different things to the construct validity of the scores. So to trad e three tasks for five ratings is, at best, contrived. These issues provide a necessary context fo r the studies reported here but should not distract attention from the two central issues of t his paper. First, more than one combination of sources of error may result in a desired general izability coefficient. Second, it is possible to increase the number of observations while also d ecreasing cost.Conclusion The notion that only one design will genera te a g-coefficient of a given value is not accurate. There are many possible combinations of f acets, depending on how the optimal solution is defined, that will meet a desired g-coe fficient value. The relationship between an assessment design and a corresponding generalizabil ity coefficient needs to be more broadly understood. The inference that generalizability coeffic ients and the number of observations are directly related is inappropriate. It is possible t hat several different designs would achieve acceptable generalizability coefficients. Similarly a direct relationship between cost and reliability is not exact. Study Two shows that it i s possible to increase the generalizability coefficient and the number of observations while de creasing the total cost of the assessment. The bottom line for policymakers and those involved in performance assessment programs is that it is theoretically possible to ha ve both a reliable and cost-effective performance assessment system. Assuming that low co st is the "line in the sand," those developing performance assessments should not assum e that means they must minimize the number of ratings or the number of pieces in an ass essment. Indeed, increasing certain aspects, like ratings, might actually end up being cheaper and still produce more reliable scores.ReferencesCatterall, J. S. & Winters, L. (1994, August). Econ omic analysis of testing: Competency, certification, and "authentic" assessments (CSE Tec hnical Report 383). Los Angeles: National Center for Research on Evaluation, Standar ds, and Student Testing.


11 of 14Crick, J. E., and Brennan, R. L. (1982). GENOVA: A generalized analysis of variance system (FORTRAN IV computer program and manual). Io wa City: ACT. Crusius, T. W. (1989). Discourse: A critique and synthesis of major theori es New York: Modern Language Association of America.Diederich, P. B. (1974). Measuring Growth in English Urbana, IL: National Council of Teachers of English.Hardy, R. (1996). Performance assessment: Examining the costs. In M. B. Kane & R. Mitchell Implementing perfomance assessment: Promises, probl ems, and challenges (pp. 107-117). Mahwah, NJ: Lawrence Erlbaum Associates.Hoover, H. D., and Bray, G. (1995, April). The rese arch and development phase: Can a performance assessment be cost effective? Paper pre sented at the Annual Meeting of the American Educational Research Association, San Fran cisco, CA. Koretz, D., Klein, S., McCaffrey, D., and Stecher, B. (1994). Interim report: The reliability of the Vermont portfolio scores in the 1992-93 scho ol year. (RAND/ RP 260). Santa Monica, CA: RAND.Koretz, D., Stecher, B., Klein, S., and McCaffrey, D. (1994). The Vermont portfolio assessment program: Findings and implications. Educational Measurement: Issues and Practice, 13 (3), 5-16. Koretz, D., Stecher, B., Klein, S., McCaffrey, D., and Deibert, E. (1994). Can portfolios assess student performance and influence instructio n?. (RAND/ RP-259). Santa Monica, CA: RAND.Linn, R. L., Baker, E. L., and Dunbar, S. B. (1991) Complex, performance-based assessment: Expectations and validation criteria. Educational Researcher, 20 (8), 15-21. Monk, D. H. (1996). Conceptualizing the costs of la rge-scale pupil performance assessment. In M. B. Kane & R. Mitchell Implementing perfomance assessment: Promises, probl ems, and challenges (pp. 119-137). Mahwah, NJ: Lawrence Erlbaum Associ ates. McWilliam, R. A., and Ware, W. B. (1994). The relia bility of observations of young children's engagement: An application of generaliza bility theory. Journal of Early Intervention, 18 (1), 34-47. Parkes, J. (1996, April). Optimal designs for perfo rmance assessments: The subject factor. Paper presented at the Annual Meeting of the Nation al Council on Measurement in Education, New York, NY.Picus, L. O. (1994, August). A conceptual framework for analyzing the costs of alternative assessment (CSE Technical Report 384). Los Angeles: National Center for Research on Evaluation, Standards, and Student Testing.Reckase, M. D. (1995). Portfolio assessment: A theo retical estimate of score reliability. Educational Measurement: Issues and Practice, 14 (1), 12-14.


12 of 14 Sanders, P. F., Theunissen, T. J. J. M., and Baas, S. M. (1989). Minimizing the number of observations: A generalization of the Spearman-Brow n formula. Psychometrika, 54 587-598.Sanders, P. F., Theunissen, T. J. J. M., and Baas, S. M. (1991). Maximizing the coefficient of generalizability under the constraint of limited resources. Psychometrika, 56 (1), 87-96. Sanders, P. F., Theunissen, T. J. J. M., and Baas, S. M. (1992). The optimization of decision studies. In M. Wilson (Ed.) Objective Measurement: Theory into Practice (Vol. 1 ) Norwood, NJ: Ablex.Shavelson, R. J. and Baxter, G. P. (1992). What we' ve learned about assessing hands-on science. Educational Leadership, 49 (8), 20-25. Shavelson, R. J., Baxter, G. P., and Gao, X. (1993) Sampling variability of performance assessments. Journal of Educational Measurement, 30 (3), 215-232. U.S. General Accounting Office. (1993). Student tes ting: Current extent and expenditures, with cost estimates for a national examination (GAO /PEMD-93-8). Washington DC: U.S. General Accounting Office, Program Evaluation and M ethodology Division. White, E. M. (1986). Pitfalls in the testing of wri ting. In K. L. Greenberg, H. S. Wiener, and R. A. Donovan (Eds.), Writing Assessment: Issues and strategies New York: Longman.About the AuthorJay ParkesEducational Psychology Program128 Simpson HallUniversity of New MexicoAlbuquerque, NM 87131-1246 E-mail: Jay Parkes is an Assistant Professor of Educational Psychology at the University of New Mexico. He received his Ph.D. in Educational Psycho logy from the Pennsylvania State University in 1998 specializing in applied measurem ent and statistics. His research interests include generalizability in performance assessment and cognitive and motivational aspects of performance assessments.Copyright 2000 by the Education Policy Analysis ArchivesThe World Wide Web address for the Education Policy Analysis Archives is General questions about appropriateness of topics o r particular articles may be addressed to the Editor, Gene V Glass, or reach him at College of Education, Arizona State University, Tempe, AZ 8 5287-0211. (602-965-9644). The Commentary Editor is Casey D. C obb:


13 of 14 .EPAA Editorial Board Michael W. Apple University of Wisconsin Greg Camilli Rutgers University John Covaleskie Northern Michigan University Alan Davis University of Colorado, Denver Sherman Dorn University of South Florida Mark E. Fetler California Commission on Teacher Credentialing Richard Garlikov Thomas F. Green Syracuse University Alison I. Griffith York University Arlen Gullickson Western Michigan University Ernest R. House University of Colorado Aimee Howley Ohio University Craig B. Howley Appalachia Educational Laboratory William Hunter University of Calgary Daniel Kalls Ume University Benjamin Levin University of Manitoba Thomas Mauhs-Pugh Green Mountain College Dewayne Matthews Western Interstate Commission for HigherEducation William McInerney Purdue University Mary McKeown-Moak MGT of America (Austin, TX) Les McLean University of Toronto Susan Bobbitt Nolen University of Washington Anne L. Pemberton Hugh G. Petrie SUNY Buffalo Richard C. Richardson New York University Anthony G. Rud Jr. Purdue University Dennis Sayers Ann Leavenworth Centerfor Accelerated Learning Jay D. Scribner University of Texas at Austin Michael Scriven Robert E. Stake University of Illinois—UC Robert Stonehill U.S. Department of Education David D. Williams Brigham Young UniversityEPAA Spanish Language Editorial BoardAssociate Editor for Spanish Language Roberto Rodrguez Gmez Universidad Nacional Autnoma de Mxico


14 of 14 Adrin Acosta (Mxico) Universidad de J. Flix Angulo Rasco (Spain) Universidad de Teresa Bracho (Mxico) Centro de Investigacin y DocenciaEconmica-CIDEbracho Alejandro Canales (Mxico) Universidad Nacional Autnoma Ursula Casanova (U.S.A.) Arizona State Jos Contreras Domingo Universitat de Barcelona Erwin Epstein (U.S.A.) Loyola University of Josu Gonzlez (U.S.A.) Arizona State Rollin Kent (Mxico)Departamento de InvestigacinEducativa-DIE/ Mara Beatriz Luce (Brazil)Universidad Federal de Rio Grande do Sul-UFRGSlucemb@orion.ufrgs.brJavier Mendoza Rojas (Mxico)Universidad Nacional Autnoma deMxicojaviermr@servidor.unam.mxMarcela Mollis (Argentina)Universidad de Buenos Humberto Muoz Garca (Mxico) Universidad Nacional Autnoma deMxicohumberto@servidor.unam.mxAngel Ignacio Prez Gmez (Spain)Universidad de Daniel Schugurensky (Argentina-Canad)OISE/UT, Simon Schwartzman (Brazil)Fundao Instituto Brasileiro e Geografiae Estatstica Jurjo Torres Santom (Spain)Universidad de A Carlos Alberto Torres (U.S.A.)University of California, Los