xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam a22 u 4500
controlfield tag 008 c19969999azu 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E11-00066
Educational policy analysis archives.
n Vol. 4, no. 17 (November 11, 1996).
Tempe, Ariz. :
b Arizona State University ;
Tampa, Fla. :
University of South Florida.
c November 11, 1996
Includes EPAA commentaries.
What does the psychometrician's classroom like like? : reframing the assessment concepts in the context of learning / Catherine S. Taylor [and] Susan Bobbitt Nolen.
Arizona State University.
University of South Florida.
t Education Policy Analysis Archives (EPAA)
xml version 1.0 encoding UTF-8 standalone no
mods:mods xmlns:mods http:www.loc.govmodsv3 xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govmodsv3mods-3-1.xsd
mods:relatedItem type host
mods:identifier issn 1068-2341mods:part
mods:detail volume mods:number 4issue 17series Year mods:caption 19961996Month November11Day 1111mods:originInfo mods:dateIssued iso8601 1996-11-11
1 of 37 Education Policy Analysis ArchivesVolume 4 Number 17November 11, 1996ISSN 1068-2341A peer-reviewed scholarly electronic journal. Editor: Gene V Glass,Glass@ASU.EDU. College of Educ ation, Arizona State University,Tempe AZ 85287-2411 Copyright 1996, the EDUCATION POLICY ANALYSIS ARCHIVES.Permission is hereby granted to copy any a rticle provided that EDU POLICY ANALYSIS ARCHIVES is credi ted and copies are not sold.What Does the Psychometrician's Classroom Look Like ?: Reframing Assessment Concepts in the Context of Lea rning Catherine S. Taylor University of Washingtonctaylor@u.washington.edu Susan Bobbitt Nolen University of Washingtonsunolen@u.washington.edu Abstract We question the utility of traditional conceptualiz ations of validity and reliability, developed in the context of large scale, external t esting, and the psychology of individual differences, for the context of the classroom. We c ompare traditional views of validity and reliability to alternate frameworks that situate th ese constructs in teachers' work in classrooms. We describe how we used these frameworks to design an assessment course for preservice teachers, and present data that suggest students in the redesigned course not only saw the course as more valuable in their work as teachers, but dev eloped deeper understandings of validity and reliability than did their counterparts in a tradit ional tests and measurement course. We close by discussing the implications of these data for the t eaching of assessment, and for the use and interpretation of classroom assessment data for pur poses of local and state accountability. More than ever before, pressure is being placed on teachers to create high quality assessments of their students' learning. Work is un derway in Kentucky, New Mexico, Vermont, Washington, and in the eighteen states that are mem bers of the New Standards Project (Resnick and Resnick, 1991) to explore the viability of clas sroom-based assessments, projects, and portfolios as sources of state or national accounta bility data about student learning. These initiatives emerge from a growing belief that the t eacher can be one of the best sources of
2 of 37information about student learning. However, there is growing evidence that teachers have not been adequately prepared to create and conduct vali d assessments. Even teacher education programs that include an assessment course may not help teachers develop the concepts and skills necessary to meet these assessment demands. To address this problem, districts, states, and nat ional organizations have invested considerable resources in in-service training for t eachers. Organizations such as the National Council on Measurement in Education (NCME) and the Association for Curriculum Development and Supervision (ASCD) have developed t raining modules and training materials for classroom teachers. Groups such as the National Council for Teachers of Mathematics (NCTM) have developed documents such as Mathematics Assessment: Myths, Models, Good Questions, and Practical Suggestions (NCTM, 1991) and Assessment Standards for School Mathematics Standards (NCTM, 1995) in an attempt to help teachers incorp orate more appropriate assessments into their teaching practic es. Still, these efforts may not be successful if the models used to educate teachers in the concepts and skills of assessment do not fit the reality of classrooms. An example of the confusion caused by the mismatch between models based on test theory and the demands of the classroom context illustrate s this problem. Preservice teachers in an assessment class had read Smith's (1991) article on the meanings of test preparation. Smith lists a number of ways teachers prepare for external standa rdized tests, including teaching the specific content covered on the test. Students were surprise d to find that psychometricians considered this to be cheating. Were they not being admonished, by both the instructor and the course textbook to do just that--assess to see whether students wer e learning what had been taught. To these students, if vocabulary words were to be tested, th ey should be taught. If science or social studies concepts and facts were to be tested, they should b e taught. Even if the test expected students to generalize a concept or skill to the new situation, the concept or skill should have been taught first! In the words of one puzzled student, "What d oes the psychometrician's classroom look like?" This apparent discrepancy between the idea of "doma in sampling" central to test theory and the notion that classroom assessment is intende d to assess whether students learn what they are taught arises from a clash of contexts. The wor ld of large scale external tests is very different from the world of the classroom. In this paper, we will argue that traditional tests and measurement courses and most assessment textbooks f or teachers present measurement concepts in ways that better fit the world of external tests designed to measure individual differences. When teachers are taught traditional measurement co ncepts and expected to apply them to the context of teaching and learning, they have little chance of developing the skills and concepts they need to assess their students. We will also ar gue that the meanings of assessment in the context of the classroom must be considered careful ly when large scale assessment programs decide to use classroom assessments for the purpose s of district, state, or national accountability. We begin by challenging traditional notions of test ing and measurement in terms of their fit to the classroom. While we recognize that the p rinciples of classical test theory may be appropriate for some contexts (e.g., administering and interpreting standardized, norm-referenced tests), we see a need for more clarity in how these models, their applications and limitations, are presented to teachers. We discuss the theoretical u nderpinnings of traditional measurement concepts and why they must be reframed in light of the classroom context. We examine the ways in which reliability and validity are presented in eight recently published assessment texts designed for teacher preparation and discuss why de finitions of validity and reliability presented in most educational assessment textbooks fit the co ntext of external testing better than that of the classroom. Next we present frameworks for validity and reliabi lity that situate these constructs in the world of the classroom teacher, and discuss how the se frameworks might be used in teacher
3 of 37education. We then present an overview of the asses sment course we developed to help preservice teachers understand the concepts of vali dity and reliability as they are reframed in this paper. The work of the course was designed to help preservice teachers develop a deep understanding of the potential relationship between classroom assessment practices, subject-area disciplines, and instructional methods so that they would see valid and reliable assessment as central to their work as teachers. Evidence for the effectiveness of basing our assessment course on these frameworks is provided in the form of thre e studies comparing the responses of students in the redesigned course to those taking a traditio nal tests and measurement course in the same teacher education program. We discuss the need for the measurement community t o acknowledge the differences between the methods appropriate for external measur ements and the measurement of the learning targeted by classrooms and schools. We suggest that those who prepare assessment text-books for the preparation of teachers, as well as instruc tors of assessment courses, clarify the philosophical positions underlying different assess ment purposes and present assessment concepts in ways that are consistent with those dif fering purposes rather than attempting to blend frameworks that come from different philosophies ab out the purposes of assessment. Finally we discuss what these classroom-based conceptions of r eliability and validity suggest in terms of what constitutes appropriate classroom-based eviden ce for large scale assessment programs.The Misfit of the Measurement Paradigm The classroom context is one of fairly constant for mal and informal assessment (Airasian, 1993; Stiggins, FairesConklin, & Bridgeford, 1986 ). However, few teacher preparation programs provide adequate training for the wide arr ay of assessment strategies used by teachers (Schafer & Lissitz, 1987, Stiggins & Bridgeford, 19 88). Further teachers do not perceive the information learned in traditional tests and measur ement courses to be relevant to their tasks as classroom teachers (Gullickson, 1993; Schafer & Lis sitz, 1987; Stiggins & Faires-Conklin, 1988). Wise, Lukin, and Roos (1991) found that teac hers do not believe they have the training needed to meet the demands of classroom assessment. At the same time, teachers' ability to develop appropriate classroom-based assessments is seen as one of the six core functions of teachers (Gullickson, 1986). Several authors have outlined what they believe are the essential understandings about assessment teachers must have in order to confront the ongoing assessment demands in the typical classroom (Airasian, 1991; Linn, 1990; Scha fer, 1991; Stiggins, 1991). Many of these concepts and skills, as well as those presented in measurement text-books (e.g., Hanna, 1993; Linn & Gronlund, 1995; Mehrens & Lehmann, 1991; Nit ko, 1996; Oosterhof, 1996; Salvia & Ysseldyke, 1995; Worthen, Borg, & White, 1993), are derived from a model of measurement that began in the late 1800s. Rooted in scientific think ing of the nineteenth century, test theory is based on a model of the scientific method. With classroom instruction as the equivalent of a t reatment, test theory would suggest that tools of assessment are designed to carefully asses s the success of instruction for different examinees. Taking the perspective of Galton (1889), students differ in their inherent capacity to learn the content of various disciplines. The asses sor is the scientist who must dispassionately assess and record each students' attainment of the defined outcomes of instruction. Students are the focus of observation and the measurement model presumes them to behave like passive objects. As Cronbach (1970) noted, A distinction between standardized and unstandardiz ed procedures grew up in the early days of testing. Every laboratory in those da ys had its own method of measuring. . and it was difficult to compare resu lts from different laboratories. .
4 of 37Standardization attempts to overcome these problems A standardized test is one in which the procedure, apparatus, and scoring have be en fixed so that precisely the same testing procedures can be followed at differen t times and places. . If standardization of the test is fully effective, a m an will earn very nearly the same score no matter who tests him or where. (pp. 26-27, italics added) The classroom teacher, however, is not a dispassion ate observer of students' learning. Classroom teachers have a vested interest in the ou tcomes of instruction--many believing that student failure is a reflection on their teaching. Both the popular press and current legislation in states such as Kentucky would suggest that the publ ic agrees with this view of the relationship between teaching and learning. The classroom teache r, in contrast to the experimental scientist, is more like a "participant observer" (Whyte, 1943). U sing the words of Vidich and Lyman (1994), the teacher is much like an ethnographic researcher In the following quote, the authors' use of the term "ethnographic researcher" has been replace d by the term "teacher." The [teacher] enters the world from which he or she is methodologically required to have become detached and displaced. . [T]his [t eacher] begins work as a self-defined newcomer to the habitat and life world of his or her [students]. He or she is a citizen-scholar as well as a participant o bserver. (Vidich & Lyman, 1994, p. 41) Teachers adjust instruction for the needs of studen ts; adapt instruction for the needs of diverse students; bring a wide range of evidence to bear on decision-making about students extending beyond the evidence from standardized tes ts to observations of students' classroom behaviors, attitudes, interests, and motivations (A irasian, 1994). The purpose of classroom assessment is to find out whether students have ben efited from instruction. However, unlike the dispassionate observer, the good teacher regularly adjusts the treatment, in response to ongoing assessments, in order for learning to be successful While the participant observer may be required to u se certain methods to increase their "objectivity," they must both observe and participa te in the world of the classroom. They "make their observations within a mediated framework, tha t is, a framework of symbols and cultural meanings given to them by those aspects of their li fe histories that they bring to the observational setting" (Vidich & Lyman, 1994, p. 24). The teacher 's decision to attend to one source of assessment information over another reveals as much about the "value-laden interests" of the teacher as it does about the subject of her/his ass essments (Vidich & Lyman, 1994, p. 25). While this may be seen by measurement professionals as the reason objective measures are needed, qualitative researchers would respond that "The more you function as a member of the everyday world of the researched, the more you risk losing the eye of the uninvolved outsider; yet, the more you participate, the greater your opportunity to learn. (Glesne & Peshkin, 1992, p. 40, italics added). Qualitative researchers would s ay that the very choice of what items to include in a test reflects the values and biases of the tea cher. Hence the job of those who prepare teachers for classroom assessment must include an awareness of the context in which teachers teach, the goals of instruction and schooling, and the complex demands of the work of a participant observer. If teachers are not dispassionate observers, neithe r are students passive objects. They are influenced by assessment processes and products (Br icklin & Bricklin, 1967; Butler, 1987; Covington & Beery, 1976; Deci & Ryan, 1987). They a dapt their approach to learning and preparation for assessment in order to gain the hig hest possible scores (Toom, 1993). They may take on persona that will afford them the grace of teachers. Hence, neither teachers nor students fit the scientific model of standardized measuremen t used to frame the measurement concepts and strategies taught to teachers.
5 of 37Assessment and Teacher Preparation Programs Despite the importance of assessment in the experie nce of students and in teachers' ability to determine the success of instruction in terms of student learning, assessment instruction is peripheral in many teacher education programs. In p rograms that do include assessment courses, assessment is usually treated as a foundational cou rse focused on a set of generalizable concepts and skills. In most programs, all prospective teach ers, from the kindergarten teacher, to the APP calculus teacher, to the middle school vocal music teacher are taught in a single group. In others, assessment instruction is relegated to a 1-2 week u nit in an omnibus educational psychology course. In response to the formidable range of asse ssment content teachers need to know, instructors may design courses that result in intel lectual awareness of key concepts rather than actual competency in applying. Research on the prof essional development of teachers (e.g., Cohen & Ball, 1990; Grossman, 1991) suggests that i ntellectual awareness is not sufficient to overcome the "apprenticeship of observation" (Lorti e, 1975) that dominates pre-service teachers' learning. Without significant intervention, pre-ser vice teachers typically adopt the practices that were used with them as students or those that are u sed by their cooperating teachers. Assessment textbooks generally reflect a view of as sessment courses as survey courses, intended to present a range of assessment ideas and leaving to instructors (or the students themselves) the task of constructing a coherent pic ture of assessment. As Anderson, et al (1995) have noted, survey approaches to the preparation of teachers do not allow for a "rich and grounded" understanding. Ironically, textbook autho rs' attempts to acknowledge the classroom context may contribute to teachers' confusion and a ntipathy. Many textbooks (e.g., Hanna, 1993; Linn & Gronlund, 1995; Mehrens & Lehmann, 1991; Oos terhof, 1996; Salvia & Ysseldyke, 1995; Worthen, Borg, & White, 1993) combine present ations of assessment in the classroom with traditional presentations of the principles of testing and basic concepts of measurement. As we will argue in the next section, the notions of v alidity and reliability used in large scale external testing must be recast before they can be useful in the context of classroom teaching and learning. With the increased emphasis on appropriat e assessment practices in the classroom, we must take seriously the gulf between what classroom teachers believe they need to know about assessment and what measurement professionals belie ve teachers need to know. In the next sections, we provide frameworks for bridging this g ulf.Definitions of Validity Traditional Presentations of Validity All of the assessment text books reviewed for this article acknowledged the contextual issues in the classroom; however, chapters on valid ity generally used the language of scientific methodology to describe this construct. Most of the se texts (e.g., Hanna, 1993; Linn & Gronlund, 1995; Nitko, 1996; Salvia & Ysseldyke, 1995; Ooster hof, 1996; Worthen, Borg, & White, 1993) presented three or four "types" of validity: constr uct validity, content validity, criterionrelated (predictive and/or concurrent) validity, and recomm end that evidence for each type of validity should be obtained when using a test. Measurement p rofessionals generally agree that for assessments to be valid, they should (a) measure th e construct they are intended to measure, (b) measure the content taught, (c) predict students' p erformance on subsequent assessments, and (d) provide information that is consistent with other, related sources of information. Consequences of test interpretation and use, a validity issue re cently raised by Messick (1989), is addressed by few published classroom assessment texts (For examp le, see Hanna, 1993; Linn & Gronlund, 1995; Nitko, 1996). In fact, some would disagree th at "consequential validity" is a component of
6 of 37the construct of validity at all (See Stuck, 1995). Traditional presentations of these types of validit y often define evidence for validity in terms of: (a) correlations between tests measuring the same construct or between a test and the criterion behavior of interest (Hanna, 1993; Linn & Gronlund, 1995; Nitko, 1996; Worthen, Borg, & White, 1993), (b) tables of specification t o determine whether the content of a test measures the breadth of content targeted (Linn & Gr onlund, 1995; Mehrens & Lehmann, 1991; Oosterhof, 1996), and (c) using a range of strategi es to build a logical case for the relationship between scores from the assessment and the construc t the assessment is intended to measure (Linn & Gronlund, 1995; Nitko, 1996; Oosterhof, 199 6). These types of validity evidence are based on two d ifferent notions of what makes an assessment valid. The evidence for the validity of an assessment is provided if (a) students perform consistently across different measures of t he same construct (a notion that comes from a theory of individual differences (Galton, 1889)) an d (b) links between what is measured and some framework or context external to the test (Lin n & Gronlund, 1995; Messick, 1989). Taken individually, these two prongs of validity theory d o not have equal value in the classroom. Classroom teachers are less interested in the consi stency of student performance across similar measures than they are in whether students' learn w hat they are teaching (the targeted constructs). Learning, especially of skills and strategies that are taught throughout schooling, is expected to change rather than remain consistent over time. Consistency with other, related performances is als o problematic for teachers as they teach each new group of students. Given the option of loo king over prior school records, teachers often claim that they do not want to be prejudiced by oth ers' views (Airasian, 1991, p. 54). Over the course of a year, inconsistent performance may be a ttributed to many factors other than the validity of assessments. Students who begin to perf orm more poorly than expected may be informally assessed through interviews with the stu dents and reviews of their work. Teachers may become alarmed and contact school support staff and/or parents to see if the cause lies outside the classroom. On the other hand, when poor ly performing students begin to dramatically improve performance, teachers may see this as evide nce of student learning and of their own success as teachers. Consistent performance across assessments is only desirable when performance is consistently good or when the conten t taught is constantly changing (e.g., spelling lists). As Moss's (1996) paper suggests, the notion of the assessor as "objective observer" does not fit the context of the educational assessment a s well as it does the work of experimental science. Teachers see students as the focus of purp oseful action (Bloom, Madaus, & Hastings, 1981). Tests and other assessments provide informat ion, not only about how well students have learned, but about how well they are presenting the targeted content and concepts (Airasian, 1993; Mehrens & Lehmann, 1991; Nitko, 1996; Oosterh of, 1996), how students are feeling about school, themselves, and their worlds (Airasian, 199 3). Hence it is the responsibility of measurement professionals to help teachers learn ho w to choose and create assessment tools that will do the best job possible to make appropriate d ecisions about students' learning. This requires teachers to have a clear notion of validity that fi ts the work and the world view of teachers. Validity in the Classroom Context In this section, we situate Messick's (1989) dimens ions of validity in the context of classroom teachers' decision-making. Messick claime d that construct validity is the core issue in assessment, and stated that all inferences based up on, and uses of, assessment information require evidence that supports the inferences drawn between test performance and the construct an assessment is intended to measure.
7 of 37 We can look at the content of the test in relation to the content of the domain of reference. We can probe the ways in which indivi duals respond to the items or tasks. We can examine the relationships among respo nses to the tasks, items, or parts of the test, that is, the internal structure of test responses. We can survey relationships of test scores with other measures an d background variables, that is, the test's external structure. We can investigate diffe rences in these test processes and structures over time, across groups and settings, a nd in response to . interventions such as instructional . treatment and manipulat ion of content, task requirements, or motivational conditions. Finally, we can trace t he social consequences of interpreting and using test scores in particular wa ys, scrutinizing not only the intended outcomes, but also the unintended side eff ects. (Messick, 1989, p. 16) Validity, then, is a multidimensional construct tha t resides, not in tests, but in the relationship between any assessment and its context (including the instructional practices and the examinee), the construct it is to measure, and the consequences of its interpretation and use. Translated to the classroom, this means that validi ty encompasses (a) how assessments draw out the learning, (b) how assessments fit with the educ ational context and instructional strategies used, and (c) what occurs as a result of assessment s including the full range of outcomes from feedback, grading, and placement, to students' self concepts and behaviors, to students' constructions about the subject disciplines. Messick stated that multiple sources of evidence ar e needed to investigate the validity of assessments. In the classroom context, this means t hat teachers must know how to look at their own assessments and assessment plans for evidence o f their validity, they must know where to look for alternative explanations of student perfor mance, and they must consider the consequences of assessment choices on their student s and themselves. In short, teachers should develop a "habit of mind" related to their assessme nt processes. After situating each dimension in the context of teachers' work, we suggest general a pproaches that assessment instructor might use to help teachers use that dimension in their own as sessment practice. Validity Dimension 1: Looking at the content of the assessment in relation to the content of the domain of reference. Before teachers can look at their assessments in t his way, they must be able to think clearly about their disciplines, unde rstanding both the substantive structure (critical knowledge and concepts) and the syntactic structure (essential processes) of the disciplines they teach (Schwab, 1978). They must be able to determin e which concepts and processes are most important and which are least important in order to adequately reflect the breadth and depth of the discipline in their teaching and assessments. A s Messick (1989) states, one of the greatest sources of construct invalidity is overor under-r epresentation of some dimension of the construct. Once they have clearly conceptualized th e disciplines they teach, teachers must know how to ascertain the degree to which the types of a ssessment tasks used in the classroom are representative of the range and relative importance of the concepts, skills, and thinking characteristic of subject disciplines. In addition, because the process of assessment is a s much a function of how assessments are scored as it is a function of whether the tasks elicit student learning related to the structure o f the discipline, teachers must examine the degree to which the rules for scoring assessments and strategies for summarizing grades reflect the targe ted learnings. As with breadth and depth of coverage within assessments, teachers must be able to evaluate whether scoring rules give too little or too much value to certain skills, concept s, and knowledge leading to questions about the validity of the interpretations teachers make from resulting scores. To obtain evidence for this dimension of validity, teachers can be taught to stand back from their teaching, frame the learning targets of instruction carefully, and plan instruction and assessment together, in light of the overall target s of instruction. Without a clear picture of what
8 of 37is to be accomplished in a course or subject area, teachers cannot adequately assess whether their assessments (selected or self-developed) are valid. Once teachers develop a framework of learning targets (learning goals and objectives), t hey can learn how to carefully analyze whether assessment and instructional decisions link back to this framework. They can be given opportunities to look at scoring rules developed fo r open-ended student work and determine whether these rules relate directly to these target s of learning. Validity Dimension 2: Probing the ways in which ind ividuals respond to the items or tasks and examining the relationships among responses to the tasks and items. Teachers do not often have the luxury of "item tryouts" when developing t heir assessments. Before giving students an assessment, teachers must examine the degree to whi ch the assessments have the potential to elicit the learning the students are expected to ac hieve. This means they must examine the assessment tasks and task directions to determine w hether students are really being asked to show the learning related to the targets. Teachers must know to ask themselves, "Have the directions for the task or the wording of the items limited my students' understanding of the expectations of the task?" Teachers should be encouraged to use assessment str ategies that will allow them to probe their students' thinking and processes. This become s increasingly important as the emphasis on higher level thinking and processes increases (Stig gins, Griswold, & Wikelund, 1989). In performance assessments, for example, examinees are often asked to explain their thinking and reasoning as part of the assessment task. Teachers commonly ask students to show their work in mathematics and science assessments. These classroo m assessment practices lend themselves to probing the ways in which individuals are respondin g. This probing not only provides information about the validity of the assessments, but can provide better pictures of students' learning. Teachers must know how to look across students' res ponses to a variety of assessment tasks to determine whether patterns of students' re sponses support the use of the assessments. The mechanisms for this type of examination have hi storically been quantitative item analysis techniques. However, few teachers use these quantit ative techniques in actual classroom practice (Stiggins & Faires-Conklin, 1988). Teachers can be shown how to scrutinize student work qualitatively, looking for patterns in responses th at reveal positive and negative information about the assessments. If items and tasks have not yet been used with students, teachers must know how to examine the demands of a range of items and tasks and ask themselves, "Are students who can show understanding of a concept in one assessment format (e.g., an essay), likely to show equal understanding in a different f ormat (e.g., a multiple-choice test)?" In order to probe examinee performance within and a cross different measures, teachers can learn to develop multiple measures of the same targ eted learning. They may not only discover different ways to assess a given construct, but the y may discover for themselves that particular types of assessment are more or less suited to cert ain learning targets. Validity Dimension 3: Investigating differences in assessment processes and structures over time, across groups and settings, in response to instructional interventions. To investigate these validity issues, teachers must know how to ex amine the relationship between the instructional practices used and the assessments th emselves. They must ask themselves, "Did I or will I actually teach these concepts well enough fo r students to perform well?" They must also evaluate the adequacy of various assessment strateg ies for the unique needs of their students. They must be able to judge whether an assessment ca n be used in many different contexts or whether differing contexts, groups, and instruction al strategies require the development of different assessments. Examination of this dimension of validity can be ob tained when teachers are asked to look carefully at the relationship between an instructio nal plan and the demands of an assessment. If the work demanded in an assessment was not an adequ ate focus of instruction, teachers can
9 of 37decide ahead of time whether to adjust instruction to fit the learning targeted in the assessment or whether to adjust assessments to fit the learning t argeted in the instruction. Validity Dimension 4: Surveying relationships betwe en assessments and other measures or background variables. Teachers must know how to judge the degree to whic h performance on the assessment and the score resulting from the ass essment are directly attributable to the targeted learning. They must determine whether performance i s influenced by factors irrelevant to the targeted learning such as assessment format, respon se mode, gender, or language of origin. This becomes increasingly critical as classrooms become more diverse and whole group teaching becomes more difficult. In general terms, teachers must know how to adapt an assessment format to meet the needs of diverse students while still o btaining good evidence about student learning related to the targets of instruction. Finally, tea chers must know how to create scoring mechanisms for open-ended performances that are cle arly related to the learning targets and that are precise enough to prevent biased scoring. When teachers develop assessments, they can be aske d to examine whether factors other than the targeted learning will influence students' performances. They can be asked to examine scoring rules to see whether the rules provide an u nfair advantage or disadvantage to students who have certain strengths or weaknesses unrelated to the targeted learning. Validity Dimension 5: Tracing the social consequenc es of interpreting and using test scores in particular ways, scrutinizing not only th e intended outcomes, but also the unintended side effects. Teachers must consider the influence of classroom assessments on the learners themselves. The nature of the assessments, feedback and grading can all influence student learning, students' self concepts and motivation (B utler & Nisan, 1986; Covington & Omelich, 1984), and their perceptions of the disciplines bei ng taught. Teachers who assess their students' knowledge of science by giving them only multiple-c hoice tests of isolated facts, for example, communicate that science is a collection of facts a bout which everyone agrees. Those who assess students' inquiry strategies and their ability to m ake generalizations from observations or to systematically test their own hypotheses, communica te something different about the structure of the discipline of science. To examine this dimension of validity, teachers can be asked to assess whether a given assessment reflects the syntactic and/or substantiv e structure of the discipline they teach (Schwab, 1978). Does the assessment target students deep understanding of important concepts within the discipline or does it test surface knowl edge? Does the assessment ask students to show their ability to use the processes through which pr ofessionals within the discipline construct new knowledge and ideas? Teacher also can be asked to determine whether meth ods used to summarize grades for a marking period give adequate weight to those perfor mances most directly related to the learning targeted. Teachers can be asked to look at their me thods of feedback (formative assessments) and determine whether they are likely to motivate learn ing or to stifle learning; to assess whether feedback will lead to improvement, be largely insub stantial (Sommers, 1991), or be perceived by students as too late to make a difference in their grades (Canady, & Hotchkiss, 1989). The five dimensions of validity described here can be taught in ways that emphasize their importance and usefulness in teachers' everyday wor k. Later we will briefly describe a course designed for this purpose and present evidence of i ts effectiveness. We recognize, however, that validity rests, in part, on teachers' ability to ga ther reliable information about student learning. Traditional presentations of reliability, based on test theory, are not immediately transferable to the work teachers do. In the next section, we descr ibe traditional treatments of reliability in assessment textbooks and present an alternative fra mework.Dimensions of Reliability
10 of 37Traditional Presentations of Reliability Measurement professionals place most of their empha sis in assessment on reliability--often at the expense of the validity of assessments. A co mmon claim in test theory is that "for an inference from a test to be valid, or truthful, the test first must be reliable." (Mehrens & Lehmann, 1991, p. 265). This assumption is based on a mathematical model of test theory wherein observed scores are composed of true scores and measurement error. The less error in a test (i.e., the more reliable) the more truthful th e test score. Hence, an unreliable assessment is automatically less valid. Textbooks usually discuss reliability in terms of c onsistency (Airasian, 1993; Hanna, 1993; Linn & Gronlund, 1995; Mehrens & Lehmann, 1991; Nit ko, 1996; Oosterhof, 1996; Salvia & Ysseldyke, 1995; Worthen, Borg, & White, 1993). Whe n gathering evidence for the reliability of tests, the focus on consistency is related to eithe r score reliability or rater reliability. Score reliability means that if a test were administered to an examinee a second time, the examinee would receive the same or about the same score. One way that measurement specialists try to ensure score reliability is through the standardiza tion of tests. When assessments are standardized, all examinees complete the same items and/or tasks. If examinees are retested, they should complete the exact same tasks under exactly the same conditions. This would help to ensure that consistency of performance. Another element of score reliability discussed in t extbooks is that of generalizability. The longer the test (the more items and tasks) the more opportunities students have to show their learning. If students do better than they should on one item or task, they are just as likely to do more poorly than they should on another item or tas k. If a test is long enough, positive measurement error should cancel negative measuremen t error. Hence, the student is likely to earn a score that would be replicated if s/he took a par allel test. Writers who have expanded their discussion of reliability to include performance-ba sed assessments focus on the number of performances necessary to obtain scores for examine es that can be generalized to the domain of interest (Linn & Burton, 1994). Discussions of reliability in many textbooks; howev er, are based on the notion that assessment takes place at a single time and that su mmary decisions are made about examinees based on single testing events. In the classroom, t eachers are engaged in on-going assessment over time and across many dimensions of behavior (A irasian, 1993; Stiggins, Faires-Conklin, & Bridgeford, 1986). Like motivation researchers, tea chers see giving students choices about assignments as a way to increase student motivation and engagement (Deci & Ryan, 1985; Nicholls, 1989; Nicholls & Nolen, 1993). While indi vidualization of instruction may result in better achievement and motivation, it means that st andardization is very difficult. In addition, few teachers have the time or the inclination to ad minister parallel test forms to see whether students' scores are consistent; and psychometric t echniques developed for looking at internal consistency of exams are not appropriate for many f orms of classroom assessment. Some teachers give students opportunities to revise thei r work after feedback, both for the purposes of assessment and to enhance student learning (Wolf, 1 991). Hence, the notion of a test with multiple items is only one of many possible assessm ent episodes in the classroom. Teachers do, however, collect many sources of information about student learning--not only through tests but through a range of formal and informal assessments: homework, classroom work, projects, quizzes. If this information is relevant to their l earning targets, teachers could make reasonable generalizations about student learning. The second dimension of reliability relates to the judgments made about students' work. Rater reliability refers the degree to which raters agree when assessing a given student's work. Studies have documented that when raters are well t rained and scoring criteria are well developed, raters can score student work with a hig h degree of consistency across raters (e.g.,
11 of 37Hieronymus & Hoover, 1987; Shavelson & Baxter, 1992 ). In the classroom, however, a single judge (the teacher or a teaching assistant) is ofte n responsible for evaluating all student work. Teachers rarely exchange student work or have anoth er evaluator look at student work. Reliability in the Classroom Context For reliability to have meaning for teachers, the c oncept has to make sense for the classroom and school context. Two dimensions of rel iability relevant to the classroom are: Reliability Dimension 1: Determining the dependabil ity of assessments made about students. The concept of reliability can be reframed to fit the classroom context if the reality of the classroom and a broader and inclusive meaning o f reliability are acknowledged. The American Heritage Dictionary (Houghton Mifflin Comp any, 1981) definition of reliable is "dependable." While measurement professionals have equated dependable with consistent, the former term is more appropriate for the classroom. Assessment may occur frequently in the classroom using measures that could not stand up to psychometric standards of reliability (e.g., research reports, written essays); however, it is p ossible that grading decisions made at the end of a marking period can be much more reliable than the individual assessments themselves. Even writers who are fairly cautious about performance-b ased assessments and portfolios admit that the classroom context could provide more reliable a ssessment information simply because teachers have more information from which to make j udgments (Dunbar, Koretz, & Hoover, 1991). Hence, for assessments to be reliable, teach ers must ensure that they have sufficient information from which to make dependable decisions about students. Given this framework, evidence for the validity of assessments used to ma ke decisions should be the foremost consideration for teachers. Reliability of assessme nt decisions depends on the quality of the assessments. If attention is given to evidence for validity, then teachers can begin to ask themselves whether there is sufficient information from which to make dependable decisions. A wide range of assessments can serve the purpose of a long test--the more sources of assessment information, with demonstrable evidence for validit y, the more likely dependable decisions can be made. Teachers can be asked to look across diverse source s of assessment information planned for a given unit of instruction and determine wheth er there is sufficient information from which to make dependable judgments about students' learni ng related to the learning targets for the unit. Teachers can use grading policies to organize their thinking about the sources of information available for making judgments about student learni ng. Rather than using "averaging" techniques in grading, teachers can be shown how to use their professional judgments to look at the range of evidence about student learning and make a "holisti c, integrative interpretation of collected performances." (Moss, 1994, p. 7) Reliability, then becomes a judgment based on sufficiency of information rather than test-retest consistency. Teachers can also be taught to develop public perfo rmance criteria that all students must apply to their work, even if they make their own ch oices about what work to do (see Figure 1 for an example). This level of standardization can allo w for individual choice in projects and other types of performances while still ensuring that stu dents' work will demonstrate their learning related to the targets of instruction. This will al so help with rater consistency, the second dimension of reliability.Figure 1Directions and Criteria for Literature Project
12 of 37 This project will give you a chance to do some lite rary analysis. You will be working as a literary critic. In doing so, you will show your un derstanding of: how authors communicate major themes through their writing how authors communicate authors' their perspective or purpose in their writing how authors use language to create images, mood, an d feelings how to judge the effectiveness of an author's work You may choose a short story or a collection of thr ee or more poems by a single author. In your writing be sure to include: a main message or theme you see in the story or poe ms what you believe is the author's purpose or perspec tive a description of at least two figurative language s trategies the author used to communicate mood, images, and/or feelings specific examples from the story or poems to suppor t your claims about theme, purpose, perspective, and figurative language an overall judgment about whether the author was ef fective in communicating themes and his/her perspective/purpose and in using figurative language strategies at least three reasons to support your overall judg ment If you choose to choose to use poems, make certain that the poems share a similar theme or message. Remember that authors often have more than one theme or message in their work, but be sure to focus your thinking on o nly one. Begin your paper by introducing the story or poems and the author. Orga nize your writing so that it will build a case for your positions and ideas about the writing Look back at the literary reviews we have studied in class to give you ideas about how t o organize your writing. You must tell me what story or poems you have chose n to write about on _____________. You will turn in an outline or web for your paper o n ______________. The first draft is due on ______________. Your final draft, the outlin e/web, and marked first draft are due on _______________. Be sure to give the source of t he literary work(s) at the beginning of the paper. Reliability Dimension 2: Determining the degree of consistency in making decisions across students and across similar types of work. Teachers generally use three types of assessment that could be affected by the consistenc y of their judgments about students' learning. They create short answer and essay items for tests; they assign projects and performances; they give several similar assignments (such as writing p rompts) for which they have the same expectations. In these three situations, consistenc y of teachers' judgments depends on (a) whether the rules for scoring short answer items and essays are consistently applied across students, (b) whether the rules for scoring extended performances are applied consistently across students, and (c) whether rules for scoring frequently occurring types of assessment are applied consistently across similar tasks. Teachers can be taught to develop public scoring cr iteria that they then apply to all students' performances. This can assist them in mak ing consistent judgments across different students' performances. Teachers can be taught how to create generic scoring rules that can apply to multiple similar short answer or essay items (se e Figure 2) so that they assess a range of
13 of 37 responses to short answer or essay items based on t he same criteria.Figure 2Generic Scoring Rules for Historical Essay Performance CriteriaEssay is clearly and logically organized. Position is clearly stated near the beginning of th e essay. At least three arguments are given for the position Arguments clearly support position. Specific supporting evidence is given for each argu ment. All supporting evidence is accurate and supports ar guments. Scoring Rubric4 points The essay is clear and logical in taking a positio n on a historical issue and in supporting the position wit h arguments and evidence. The essay thoroughly and effectively pres ents the position, arguments, and supporting evidence such that the re ader can understand and entertain the writer's point of view. All suppo rting evidence is accurate. 3 points The essay is clear and logical in taking a positio n on a historical issue and in supporting the position wit h arguments and evidence, although more evidence is needed for at l east one argument. The essay presents the position, arguments, and evi dence such that the reader can understand the writer's point of view. A ll supporting evidence is accurate. 2 points The essay takes a position on a historical issue a nd supports the position with arguments and evidence, although more and/or stronger arguments and evidence are needed. The ess ay could be organized more effectively to communicate the posit ion arguments, and evidence. Some information presented may be ina ccurate 1 point The essay takes a position on a historical issue b ut provides little or no support for the position. Organization may or may not communicate the writer's ideas clearly. Some inform ation presented may be inaccurate If teachers learn how to frame the items and tasks given to students in a way that allows them to make consistent assessments and if they use scoring rules consistently across students and similar tasks, they are more likely to ensure t hat their evaluations of student's responses are consistent. We have claimed in this paper that the frameworks w e have set forth can be used to design assessment courses for teachers that not only bette r prepare them for the assessment tasks they will fact, but that help teachers develop habits of mind in which valid and reliable assessment is seen as central to the teaching-learning process. T o support this claim, we briefly describe a course based on the validity and reliability framew orks and present evidence of its effectiveness.
14 of 37Assessment Frameworks in Action The assessment course described here was taught at a large northwestern university, that provided a certification program for approximately 250 elementary and secondary teachers per year. Courses were ten weeks in length and a given class included pre-service teachers from all academic subjects and the arts for kindergarten thr ough twelfth grade. During the quarter in which the assessment course was taught, students sp ent at least 20 hours per week in their field placement sites in addition to their course work as a transition into full time student teaching the following quarter. During the summer of 1991, the decision was made to redesign the tests and measurement course for the teacher preparation program. Prior t o that time, didactic procedures were used to cover standardized test interpretation, item writin g and item analysis techniques, and statistical procedures for obtaining estimates of validity and reliability of tests. Students were assessed on their ability to write test items in various format s, and tested on their knowledge of measurement principles and concepts. The redesign of the course was part of an overall r estructuring of the teacher preparation program and was based on exit surveys indicating th at students did not value the course (R. Olstadt, personal communication, May, 1991) as well as recommendations from the literature about what assessment courses for teachers should a ddress (Airasian, 1991; Linn, 1990; Stiggins, 1991). In redesigning the course, the two most sign ificant shifts were that (a) all assessment concepts were to be taught in the context of instru ctional practices and (b) the major emphasis of the course was to be on assessment validity and rel iability rather than simply assessment techniques and memorization of abstract concepts. We began with a model proposed by Linn (1990), and expanded it to include the use of process portfolios (Valencia, 1990; Wolf, 1991). We chose process portfolios because they are an interactive teaching tool in which successive itera tions of work build upon one another to create a "prepared accomplishment" (Wolf, 1991), in this cas e a well developed plan that integrates instructional planning and assessment development u sing clearly defined learning objectives as the unifying force. We then planned assignments tha t would give students the opportunity to develop specific assessment literacy skills and str ategies and that would require students to examine their own work in terms of validity. In what follows we briefly discuss the work of the course and how the requirements of the assignments designed to help teachers develop the c lassroom-based definitions of validity and reliability given above. A more thorough descriptio n of the course is presented in Taylor and Nolen (1996) and Taylor (in press). In Taylor and N olen (1996), each classroom course assignment is discussed in terms of its function in helping students think about one or more of the dimensions of validity, including excerpts from the students' self-evaluations that highlight the depth of their learning. In Taylor (in press), the types of decisions that had to be made to effectively use portfolios as an instructional and assessment tool are presented. The Process Portfolio The portfolio provided both a means for instruction and learning during the course (process portfolio), and the product used to assess students' learning at the end of the course (showcase or assessment portfolio). The use of proc ess portfolios allowed students to benefit from peer and teacher feedback (formative assessmen t) on the first draft of each assignment prior to its submission for grading purposes. Instructor feedback was intended to focus their thinking so that subsequent versions of their work reflected a better understanding of the course objectives. With better understanding, students cou ld improve the quality of their own work. At the end of the course, students pulled all of th eir work together in an assessment
15 of 37 portfolio that "showcased" their learning for the c ourse. They then wrote selfevaluations of their learning. In what follows the components of the of the portfolio are described. The Structure of the Assignments for the Course To teach all five dimensions of validity and both d imensions of reliability, it was necessary to help students investigate assessment concepts in a meaningful context. The centerpiece of the course was a set of related assignments designed to guide students through the development of a unit of instruction so that they could engage in th e thinking and skills necessary to make valid connections between learning objectives and instruc tion, instruction and assessment, and learning objectives and assessment. For their assignments, students described a plan fo r a subject they would be likely to teach, and produced documents that were reasonable represe ntations of the types of work good teachers do. Table 1 shows the assignments for the course an d the dimensions of reliability and validity each was intended to help students learn.Table 1 Configuration of the Portfolio Components for the A ssessment CourseTitleDescription ValidityDimension ReliabilityDimension Subject Area Description A description of the content foci and theinstructional units in a subject area for an 8to 12 week period including contentcoverage and major concepts targeted. 1 Subject Area Goalsand Objectives 4-6 discipline based 4-6 objectives for eachgoal with disciplinebased rationale for asubject the student planned to teach 1 Instructional UnitDescription A description of instructional activities thatwould target 4-6 of the subject areaobjectives for 2-4 weeks of the period; withactivities rationale indicating how eachactivity would help students learn therelevant objective(s) 1, 3 1 Item Sets: Checklist orRating Scale PerformanceAssessment Essay Items TraditionalItems Four separate item sets as examples of thevarious types of assessment items and tasksthat are used in classroom assessment(observational checklist or rating scale,performance assessment, essay items,traditional items (multiple choice,true-false, completion, matching,short-answer); each with the validityrationale All All
16 of 37 Sample Feedback Mocked-up student work for one unitassessment with written or dialogue of oralfeedback; philosophy and rationale aboutgiving feedback 5 Grading Policy A description of the types of work thatwould be included in the grade, howdifferent work would be evaluated, andhow absences and late work would behandled; also included an example gradesummary for one student 1 1,2 Self Evaluation Description of own learning of selectedcourse objectives, including discussion ofconcepts of validity, reliability, bias, andfairness referring to own work to showexamples of own learning All All Students were required to write rationales for all assessment decisions made during the development of components of the plan. Writing rati onales forced students to articulate the validity and reliability issues that arose within e ach component of the plan, as well as giving the instructors a means to assess the conceptual learni ng that complemented the technical work displayed. The process of writing rationales also s eemed to lead to deeper understanding of the concepts (Taylor and Nolen, 1996). When all components were completed, students collec ted them into a final showcase portfolio. They wrote a single page reflection on e ach document and a self-evaluation of their learning in the course. In addition to these core a ssignments, other assignments were given to broaden students' understanding of assessment conce pts. They included: "Thought papers" in which they discussed their thou ghts about collections of course readings (from the text book and a course reader). 1. A letter to parents explaining norm-referenced test ing and score types 2. A written interpretation of one student's scores fr om a norm-referenced test. 3. The assignments listed above formed the core of the course as it evolved over the next twelve quarters. Based on student work and feedback we adjusted the portfolio components, norm-referenced test interpretation assignments, an d the number of thought papers required. We clarified instructions and experimented with variou s scoring schemes for the final portfolios. The focus of this paper is on the classroom assessment components of the portfolio; therefore, the latter three assignments are not discussed further here. In what follows, we briefly discuss each of the com ponents of the portfolio in the order the components were assigned. We also discuss the links between components and their links to the validity and reliability frameworks. Subject area description, goals and objectives. Students began by writing a brief (one page) description of a subject area they planned to teach the quarter following the assessment course. The description included a general outline for one quarter or trimester, including the units of study and the major concepts and processes to be taught. The purpose of this component of the plan was to help students envision a subject area a s a whole rather than as piece-meal units or text-book chapters. From this vision of the subject area, they were more able to articulate the overall learning goals of the course.
17 of 37 Once the general description was completed, student s wrote four to six learning goals and four to six relevant objectives for those goals. We hoped that this level of objective writing would lead our students to clarify, for themselves, the most central learnings in the disciplines they planned to teach. This conceptual clarity is n ecessary if teachers are to develop assessments that reflect the disciplines studied (Validity Dime nsion 1). Finally, students wrote a rationale describing how their goals and objectives reflected the substantive and syntactic structures of the discipl ine they intended to teach. This requirement built upon the educational psychology course they h ad taken the previous quarter in which they explored the concepts of disciplinary structure (Sc hwab, 1978) and pedagogical content knowledge (Grossman, Wilson, & Shulman, 1989). Stud ents revisited this component throughout the quarter as they developed a deeper understandin g of their goals and objectives through the assessment development process. Unit description. Once students had completed their subject area desc riptions, they described a brief unit of study that would fit with in the quarter or trimester they had described in the subject area description. This component proved vital to students' understanding of how to establish the validity of assessments. Without the instructional unit as an anchor, it would be difficult to address aspects like the validity of m ethods of assessment for the methods of teaching used (Validity Dimension 3). Students developed uni ts that were unique to their individual interests and that they were likely to use; therefo re, the units were also a "hook" that kept students engaged in the work of the course.. Students selected up to six subject area objectives as the focus for the instructional unit. Then they wrote a brief narrative of the activities they would use to teach the objectives each day of the unit, linking the objectives to each activit y, and providing a rationale for why the given activity or activities would lead to the targeted l earning. This helped them to judge the fit of the assessments to the discipline as well as the fit of assessments to the unit of instruction (Validity Dimensions 1 and 3) Unit Assessments. For the next part of the portfolio, students used a variety of techniques to create assessments for their instructional units Students fully developed four different types of assessment for their units: An observational checklist or rating scale. The assignment for the observational checklist or rating scale required students to identify one o r more unit objectives and one or more situations from the unit for which observation woul d be an appropriate form of assessment. The checklist or rating scale was to have at least 10 items that were of clearly observable behaviors that could show the learning described in the objective(s). 1. A performance-based assessment. This assignment included a description of a perfor mance that was either an integral part of the instruction al unit or that could be used for students to show the learning objectives that were the target o f the instructional unit. Students wrote directions (oral or written) that were sufficient f or their students to complete the performance and show the learning, as well as a che cklist, rating scale, or rubric(s) to evaluate the performance. 2. Two essay items. The assignment for the essay items required studen ts to think about two essay prompts that could be written in the instruct ional unit through which students could show learning related to one or more of the unit ob jectives. Essay prompts had to be explicit enough that students would know what they were to do to successfully write the essays. Essays were to be brief (extended essays we re considered performance assessments). Students also had to write scoring ru les (checklists, rating scales, and/or rubrics) for each essay. 3. A set of "traditional" test items. This assignment was for a set of at least 10 items that assessed one or more unit objectives. The set had t o include at least three multiple-choice 4.
18 of 37items, one matching item, two completion item, two true-false items, and two short answer items. The item set could be organized as a quiz, p art of a unit test, or into one or more daily worksheets (for younger students). Students h ad to develop a scoring key for the select items and scoring rules (key words, checklis ts, rating scales, or rubrics) for the supply items. Students were asked to develop assessments that fit with their instructional methods and that assessed their unit objectives. Students then had to write a rationale for each item or task that answered several questions: How will the item/task elicit students' learning re lated to the targeted unit objective(s)? (Validity Dimensions 1 and 2) 1. How does the item/task reflect concepts, skills, pr ocesses that are essential to the discipline? (Validity Dimensions 1 and 5) 2. How does the item/task fit with the instructional m ethods used in the unit? (Validity Dimension 3) 3. How do the rules for scoring the item/task relate t o the targeted unit objective(s)? (Validity Dimension 1) 4. Is the mode of assessment such that all students wh o understand the concepts will be able to demonstrate them through the assessment? (Validi ty Dimension 4) 5. By thinking about each item or task and its relatio nship to the discipline and the unit methods, students went beyond simply practicing ite m or task writing techniques and had to consider whether the assessment represented the con struct (Validity Dimension 1) and whether the assessment was appropriate for the instructiona l context (Validity Dimension 3). By examining whether items and tasks clearly asked for the learning targeted, students could examine whether assessments were presented in a way that allowed their students to demonstrate learning (Validity Dimension 2; Reliability Dimensi on 1). By carefully examining the rules for scoring the item/task and how these rules relate to the objective(s) the item/task is intended to measure, students had to think about whether the sc ores used to represent student performance related to the construct (Validity Dimension 1) and whether their scoring rules would help them be more consistent across students (Reliability Dim ension 2). By having to discuss whether all of their students would be able to show their learning through the mode of assessment, our students could begin to explore issues related to bias (Vali dity Dimension 4). By considering the link between the assessments and the disciplines, studen ts could also begin to grapple with whether assessments were likely to provide appropriate repr esentations of the disciplines for students (Validity Dimension 5). Finally, by creating severa l assessments in different modes for the same unit and unit objectives, they were able to compare different methods of assessment in terms of their demands for students (Validity Dimension 2). Feedback. This assignment required students to choose one of their assessments and either try it out with one of their students or mock-up/de scribe one of their students' responses. They then showed what they would do (either by marking o n the paper or by describing a dialogue with their student) to give feedback. Finally, they wrote a rationale for the feedback, including both a discussion about the influence of the feedba ck on the learner's motivation and self-esteem and a discussion about how the feedback could help their student improve future performance related to the learning target(s). This gave studen ts another opportunity to explore the consequences of assessment interpretation and use ( Validity Dimension 5). Grading Policy. For the grading policy, we had students use the as sessment ideas derived from their unit plans and write a grading policy th at applied to the entire subject area description. They had to choose an grading philosophy (norm-refe renced or criterion-referenced) and explain why they had chosen it. They explained what types o f work would contribute to the grades (e.g.,
19 of 37essays, reports, projects, tests, homework, daily s eatwork, etc.) and why this work was important to learning the discipline (Validity Dimension 1), the general strategies they would use to assess various kinds of work (Reliability Dimension 2 [e.g ., a generic four point rubric for all homework assignments based on completeness and accu racy of work]), how they would weight the various sources of assessment information, and how they would summarize across assessments to assign a grade. They also had to pre pare a sample grade summary for one student using the information from the various assessment s ources. Students had decide how much weight to give to atte ndance, timeliness, oral participation, and attitude when making judgments about their stud ents' learning of the targeted objectives. By validity standards, some of these variables would b e considered sources of irrelevant variance that lead to invalid inferences about student acade mic learning (Validity Dimension 1). They also had to think about multiple-sources of evidence nee ded to make reliable decisions about learners (Reliability Dimension 1). Finally, by creating a s et of scores for a hypothetical examinee, they were able to look at the impact of various sources of assessment information on overall grades (Validity Dimension 5) Reflection and Self-evaluation. The final component of the portfolio was the self-evaluation. This component gave students an op portunity to bring closure to the course and to organize their thinking about a few central asse ssment concepts using the work required in the course as the anchor. In these self-evaluations, th ey wrote about their understanding of major assessment concepts for the course. They were requi red to: Discuss their current understanding of the concepts of validity, reliability, bias, and fairness with reference to specific work in the cou rse that helped them understand these concepts and how the course work had helped them to understand the concept. 1. Select at least six of assessment course objectives and discuss what they had learned related to each objective, what aspect of the cours e had helped them to learn it, and how. 2. The self-evaluations were evaluated for the student s' ability to demonstrate their understanding of the assessment concepts using thei r work as examples. It was not sufficient to provide a text book definition of a term or to expl ain the impact of assessment in general terms; specific and credible examples were required. In th e following discussion, excerpts from student self-evaluations from the Spring 1994 students are used to demonstrate, in their own words, what students thought about as they reflected on their o wn learning. Selected excerpts represent common thoughts among students. In the self-evaluations, when students discussed th eir understandings of validity, most references were made to the unit assessments (Valid ity Dimensions 1 through 5). Discussions of reliability and fairness usually focused on the use of rubrics and observational checklists and rating scales (Reliability Dimension 1 and Validity Dimension 4). Rarely did students bring up consistency of ratings across students and performa nces as an element of reliability (Reliability Dimension 2). In discussions of fairness and bias, students often indicated how helpful it had been to use a standardized scoring scheme to evalua te essays or performances in class; how such rules had given them a way to be fair and unbiased in their assessments (Validity Dimension 4). For example: "The students in my placement are intentionally giv en vague criteria. The teacher considers it her right to use her personal judgment s of the student's attitude and behavior to influence the grade. If the criteria (a re) not spelled out she has the leeway to insert her prejudice. Students realize wh at is going on and they become cynical and resigned. Few of them try to fight it. This lack of fairness is so widespread that they have come to expect it."
20 of 37 When choosing which component of the portfolio most influenced their learning, each component was selected by someone. For some, the cl arification of their disciplines were seen as the most critical element (Validity Dimension 1). "The best part of the course for me was the subject area description and goals because it forced me to stop and think about why I want to teach biology. . Being a good teacher is a difficult task. The best way to overcome this is going through the process we went through during the development of s ubject description, goals, objectives, and rationale. . It will help me do wn the road as a teacher." Some students wrote about the importance of develop ing a unit of instruction in order to help them conceptualize the role of assessment (Val idity Dimension 3). "It made me focus on what I really wanted my studen ts to learn, and then I had to find different and appropriate ways to assess wheth er or not the students learned these things. If one of my unit objectives was to v iew the American Revolution and its effects from a variety of perspectives, then an assessment that only deals with one perspective is not a valid assessment. It does not tell me if they have learned what . I want them to learn." Many students chose to focus on one or more of the unit assessments, discussing what they had discovered as they developed a given type. A ve ry common observation was about the need for clear directions for performances so that their students would actually provide performances that showed the targeted learning (Validity Dimensi on 2). "Giving the criteria for successful work helps make an assessment valid, as it insures that a student's essay demonstrates the student's c onceptual and/or procedural understanding rather than his/her ability to read t he teacher's mind." Another common focus was on the fit between various forms of assessment and either the discipline or the learning objectives as well as wh at assessments communicate to students about a discipline (Validity Dimensions 1 and 5). "Assessments are not neutral!. . Assessments send messages about a discipline; they communicate to students in a direct, concrete, and powerful way about what is really important to know is this subject." Students also wrote about grading policies. They ty pically reflected back on readings about the influences of grading practices on motivation a nd self-esteem (Covington & Beery, 1976; Canady & Hotchkiss, 1989), discussing the assumptio ns often made about the motivating power of grades and considering the potential consequence s of various ethical and unethical grading practices (Validity Dimension 5). Some students ind icated that in being forced to think about the relative weight of each aspect of the grade, they h ad to look again at the discipline to decide which sources of evidence were best and most import ant in making judgments about their students' learning (Validity Dimension 1). These and other comments showed us, as instructors, the power of the work assigned in the course in terms of helping our students understand important assessment concepts. Comments from students suggested that the assignments done f or the course as well as the rationales and self-evaluation enhanced their learning.Comparative Studies of the Traditional Tests and Me asurement Assessment
21 of 37Course and the Portfolio-Based Course In an effort to evaluate the effectiveness of the r evised course, three studies were conducted that compared data available from student s who had taken one of the two versions of the course: the portfolio-based course and the trad itional tests and measurement course. The classroom assessment component of the original asse ssment course covered item writing and item analysis techniques (some later sections of th is course also covered performance assessment), and statistical procedures for obtaini ng estimates of validity and reliability of tests. Instructors used a combination of lectures and disc ussions to teach assessment content. Instructors relied heavily on midterm and final exa minations (primarily multiple-choice), which counted for 60 to 70 percent of the final grade (de pending on the instructor). Up to 25 percent of the final grade was based on students' development of behavioral objectives (based on Bloom's taxonomy) and tests or sets of items to measure tho se objectives. Tests or sets of items were independent of any context except that of the behav ioral objectives. Study 1 compared course evaluations across teaching faculty for the two versions of the course. Study 2 compared evaluations of relevant co mponents of an exit survey given to all students graduating from the teacher education prog ram. Study 3 involved analyses of data from follow-up surveys sent to teacher education student s in the quarter following their enrollment in the assessment course--the time during which most w ere engaged in full-time student teaching. In the survey, the pre-service teachers were asked to discuss assessment issues, validity dilemmas, and reliability dilemmas that arose in their teachi ng. Each of these studies is described more fully below. The designs of the three studies reflect the natura l development of curricular revision, rather than the carefully-controlled world of labor atory studies or field experiments. The research opportunity was presented by the decision to redesi gn the course. Thus, comparisons of the two versions of the course presented in Studies 1 and 2 depended on existing institutional data. The data for Study 3 were collected as part of an evalu ation of the course revision, but the decision of one instructor to revert to the traditional format for two sections provided an opportunity for comparison on the follow-up measure.Study 1: Course Evaluations Data Source The university's Office of Educational Assessment provided course evaluation results for each quarter from the summer quarter of 1988 through the spring quarter of 1994. Course evaluations are required for every cou rse for assistant professors and at least once a year for senior faculty. Student participation is v oluntary, however, most students complete the form. Results of the course evaluation are not give n to the instructor until after grades are submitted. Data representing 12 quarters of the traditional te sts and measurement version of the course and 12 quarters of the revised course were a vailable. The number of respondents from the traditional tests and measurement course ranged fro m 15 to 55 across different sections with a mean of 32.25. The number of respondents from the r evised course ranged from 17 to 74 with a mean of 32.58. Because responses were anonymous, it was not possible to determine the exact number of males and females in the sections nor the number of students who were to be certified in elementary, secondary, or music education. Acade mic ranks for the instructors in the traditional tests and measurement course ranged fro m graduate student instructor to full professor. Academic ranks for the instructors in th e revised course ranged from graduate student instructor to assistant professor. There were 8 dif ferent instructors for the traditional tests and measurement course and 3 different instructors for the revised course. Only those items common to evaluation forms used in all sections of the course were
22 of 37included in the analysis. Items common to all forms are given in Appendix A. Each item was rated on a 6 level scale. "Excellent" (5), "very go od" (4), "good" (3), "fair" (2), "poor" (1), and "very poor" (0). Four items from this common set as sessed students' ratings of the content and relevance of the course. Results Mean item scores were averaged across classes for each version of the course. Only those items specifically related to the conten t of the course and the relevance of the course were included in the analyses. Two analyses were pe rformed on a selected set of the items. In the first analysis, data from four items from the cours e evaluation forms were used: (a) course as a whole, (b) course content, (c) amount you learned i n the course, and (d) relevance and usefulness of course content.. These items were summed to obta in an overall score for the general content of the course; the mean score for the traditional test s and measurement course was 12.09 (SD = 2.04), and for the revised course was 16.48 (SD = 1 .62). In the second analysis, relevance and usefulness was analyzed alone, with means for the t raditional tests and measurement and revised course 2.92 (SD = .57) and 4.29 (SD = .38), respect ively. T-tests were performed to compare mean ratings for these data. There were significant differences between students perceptions of the gen eral content of the course (t(22) = 5.85, p < .001) and the relevance and usefulness of the cours e (t(22) = 7.00, p < .001). Students in the revised course saw the course as more relevant to t heir needs and rated the content of the course between "very good" and "excellent." Students in th e traditional tests and measurement course rated the course as "good" on both general content and relevance and usefulness. These differences might have been due to difference s in the effectiveness of individual instructors. However, even instructors of the tradi tional tests and measurement course who received high ratings for instructor's effectivenes s received lower ratings on relevance and usefulness of course content. and course content. T wo instructors from the traditional tests and measurement course had high ratings for instructor' s effectiveness (mean ratings of 4.38 and 4.25), comparable to the average ratings for the tw o revised course instructors with the highest effectiveness ratings (mean ratings of 4.20 and 4.5 4). When only these four instructors are compared, the mean ratings for the for relevance an d usefulness were 3.52 and 3.83 for the traditional tests and measurement course and 4.81 a nd 4.71 for the revised course. The mean ratings for course content were 3.90 and 3.64 for t he traditional tests and measurement course and 4.71 and 4.54 for the revised course. This sugg ests that whether students saw the content of the assessment course as relevant to their needs wa s not merely a function of their perceptions of the effectiveness of an instructor.Study 2: Teacher Education Program Exit Surveys Subjects As part of the ongoing evaluation process of the teacher education program, exit surveys were administered in the last quarter of th e program to all students. We obtained 153 of these surveys from three years just prior to the ch ange in the assessment course (1989-91) and 145 from two years after the change (1992 and 1994) In the summer of 1992 an outside instructor taught a traditional tests and measureme nt course. Since it was not possible to tell which 1993 exit surveys came from students who had taken the revised course, data from that year were not used. All responses were anonymous; t herefore, the demographic characteristics of the respondents were unavailable. Instruments Exit surveys were general program review instrume nts and asked a variety of questions about students' experiences in the teache r education program, including both course work and field work. There were several items which provided some information about students' perceptions of assessment course effectiveness. Fir st, a set of items asked students to rate how well the program as a whole had prepared them in a number of areas corresponding to the state requirements for teacher education programs. One of these items was "How well has this
23 of 37 program prepared you to evaluate student work," whi ch students rated on a scale from 1 ("not at all prepared") to 5 ("thoroughly prepared"). A set of open-ended questions asked students to com ment on various program aspects. Three of these questions were coded for comments re lated to the assessment course. The first of the open-ended questions asked for com ments about any of the courses in the program. Coding schemes for this item were as follo ws: Comments specifically directed at the assessment co urse, and related to value or worth of the course or its content were coded (0) if they su ggested eliminating the course altogether; (1) if they stated the course was worthless, not va luable, not useful for teachers; and (2) if they stated the course was valuable, applicable or useful. 1. General comments (not referring to value) were code d (1) negative or (2) positive. 2. A second item asked students to list aspects of the teacher education program that were particularly valuable or worthwhile. Raters counted the number of students listing the assessment course here. A third item asked what important material was left out or not sufficiently covered. Raters counted any mention of an assessment-related topic (e.g., setting up grade books, portfolios, informal observation). Finally, negative comments r egarding the work load related to the assessment course mentioned anywhere in the survey were counted. All coding was completed by the authors and one gra duate student who was unfamiliar with the purpose of the research. There was a 98% a greement among the three raters. Final counts for each code assigned to each response were based on absolute agreement among the raters. Results Ratings of how well students thought the program prepared them to do assessment were compared across courses using a one-way ANOVA. Students who took the revised course rated the teacher education program as preparing th em more thoroughly to do assessment (Mean = 4.07, SD = 0.87) than did students who took the t raditional tests and measurement course (Mean = 3.22, SD = 1.04; F(1, 296)= 58.36, p < .001 ). Frequency of responses for each open-ended item app ear in Table 2. In general, the comments were more positive for the revised course, though not uniformly so. Typical comments for the traditional tests and measurement course in cluded "[The assessment course] was a useless class. Testing and evaluation are essential, but I learned almost nothing in this class" and "Did not relate to the real world." Typical comments for the revised course included "[The assessment course] provided me with the information that I con sidered most valuable in my field experience" and "[The assessment course] was the most valuable class overall for my teaching." Eight students in the revised course (5.2%) stated that t he work load in the revised course was excessive, while none of the students taking the tr aditional tests and measurement course did so.Table 2 Frequency of responses to each item for the traditi onal tests and measurement course and the revised courseComments (Value) Course N of Cases Valuable Not Valuable Eliminate Course Revised Course 1451920
24 of 37 Traditional Course 1531179 Comments (General) Course N of Cases Positive Negative Negative Work load Revised Course 1452248 Traditional Course 153090 What aspects of the program were... Course N of Cases ParticularlyValuable Not Sufficiently Covered Revised Course 154283 Traditional Course 129111 Each comment was coded into only one category, but some students mentioned the assessment course in more than one way. Therefore a new variable was created by counting the number of students in each group who had responded in some way that the assessment course was valuable and the number of students who had ind icated that the course was not valuable. Students who had taken the revised course were much more likely to mention it as a valuable part of the program (31%) than to say it was not (2%), w hile those taking the traditional tests and measurement course were more likely to see the cour se as not valuable (17%) than as valuable (1%) (chi-square(1) = 61.8, p < .001).Study 3: Follow-up Survey Study 3 aimed to assess the impact of the assessmen t courses on pre-service teachers' work in their field placement classrooms. We were primar ily concerned with their ability to describe assessment issues they faced in teaching, and in th eir understanding of validity and reliability. We were also interested in the extent to which they could use the assessments (and other components of their work for the course) in their f ield placement classrooms. Subjects Students from six different quarters were asked t o be part of an anonymous mail survey during quarter following the one in which th ey took the assessment course. Most of the students were engaged in full-time student teaching Two classes of students (N = 112) who had taken the traditional tests and measurement course during the summer of 1992 were surveyed. Twenty-one percent (n = 23) of these students compl eted and returned the surveys. Five classes of students (N = 195) who had taken the revised ver sion of the course between the summer of 1991 and the autumn of 1992 were surveyed. Twenty-f ive percent (n = 50) of those enrolled completed and returned the surveys. Results The follow-up questionnaire addressed a number of assessment and programmatic issues. A complete list of items is shown in Append ix B. There were few differences between groups on the assessment methods used in their fiel d placement, the proportion of planning time spent on assessment, or the amount of time they rep orted thinking about assessment. Students in the revised course reported spending slightly more time planning assessments (7% of planning time, SD = 4%) than traditional tests and measureme nt course students (3%, SD = 4%), t(65) = 9.54, p <.01).
25 of 37 Ninety-two percent of the students in the revised c ourse reported using all or part of the work developed in the course, while only 8% of stud ents in the traditional tests and measurement course reported using any of the work developed in their course (chi-square(1)=9.03, p < .01). Students who reported using materials developed in the course rated the process of planning helpful on a 5-point scale from 1 ("not at all help ful" to 5 "very helpful"), with a mean of 4.17 (SD = .81). Three items provided information on students' postcourse understanding of assessment issues, validity, and reliability. Responses to ite ms 4, 6, and 7 (the influence of assessment, validity issues, and reliability issues) were indep endently coded by three full professors with strong measurement and statistics backgrounds who h ad previously taught classroom assessment courses. They were not aware of the purposes of the study or the type of course in which students were enrolled. Coding was based on the degree to wh ich the students' responses showed understanding of general assessment concepts. Table 3 provides the scheme used to code student responses.Table 3Coding scheme for relevant items of the post-course survey 4 Influence of course on teaching Code 1: 1 = yes 2 = no Code 2 2 = shows clear, unambiguous understanding of appro priate uses of assessment 1 = shows partial understanding of appropriate us es of assessment describes delivery of instruction; may have asse ssment links uses assessment terms without examples 0 = shows little or no understanding of appropriate uses of assessment in instruction NS = not scorable (off task or omitted) 6 Validity issues 2 = gives good example of validity issue 1 = possible example of validity issue, somewhat unclear may confuse validity with reliability 0 = gives example that is neither reliability nor v alidity NS = not scorable (off task or omitted) 7 Reliability issues 2 = gives good example of reliability issue 1 = possible example of reliability issue, somewh at unclear may confuse validity with reliability 0 = gives example that is neither reliability nor v alidity NS = not scorable (off task or omitted)
26 of 37 The final code assigned to each item for each exami nee was based on a majority agreement among the raters. For students from the traditional tests and measurement group, 35% indicated that the course had no effect on their teaching. Fo r the students in the revised course, 2% indicated that the course had no effect on their te aching. For influence of assessment course, 70 percent of s tudents from the revised course showed a clear understanding of the appropriate uses of as sessment, as judged by the raters, as compared to 44 percent of the students in the traditional te sts and measurement course (chi-square(3) = 9.96, p < .02). For validity issues, 70 percent of students from the revised course gave good examples of validity issues as compared to 22 perce nt of the students from the traditional tests and measurement course (chi-square(3) = 15.01, p < .001). For reliability issues, 22 percent of students from the revised course gave good examples of reliability issues as compared to 13 percent of the students from the traditional tests and measurement course (chi-square(3) = 8.74, p < .03), however, a fairly large proportion of both groups gave no examples at all (61% of the traditional tests and measurement course students a nd 42% of the revised course students). In addition, a fairly large percent of the students in the revised course (32%) received a score of 1 for this item, indicating that while the students i n the revised course may have been better prepared to address issues related to reliability t han were the students in the traditional tests and measurement version of the course, they were still not sufficiently prepared.Discussion The results of these three studies suggest that the revision of the assessment course was beneficial to preservice teachers. Students taking the revised course were more likely to see the course as useful and relevant to their own work as teachers than students in the traditional tests and measurement course, both at the end of the quar ter in which they took the course, and at the end of their teacher education program (following f ull-time student teaching). Students taking the revised course felt better prepared to deal with cl assroom assessment than similar students in the traditional tests and measurement course by nearly a full standard deviation. Nearly a third of those responding listed the revised course as one o f the most useful parts of the teacher education program; only 1% of students listed the traditional tests and measurement course. Although student ratings are valuable, they do not bear on whether students actually learned central concepts in assessment and could us e those concepts in their own classrooms. The results of the follow-up survey (Study 3) sugge st that students in the revised course were indeed able to use the notion of validity generativ ely. The concept of reliability; however, was not as clearly understood by the majority of studen ts in either version of the course. Post-course questionnaires showed that, while stude nts in the revised version of the course had a better understanding of reliability (as it ap plied to the assessments used in their field placements) than did students in the traditional te sts and measurement course, their understanding of reliability was still inadequate. This could be due to the intense focus on a broad understanding of validity and inadequate atte ntion to reliability issues. Many of the examples of reliability issues given by students wh o had taken the revised version of the course were actually validity issues. These comments, whil e inaccurate representations of the concept of reliability, did show an understanding of the diffe rence between appropriate and inappropriate assessment practices. On the other hand, survey com ments from students who had taken the traditional tests and measurement class indicated t hat they were very confused about meanings of reliability and validity. Several of these students responded to the questionnaire items about reliability and validity with: I don't understand the concept. I only memorized it for class."
27 of 37 It appears from these data that the revised assessm ent course was effective in helping students understand appropriate assessment practice s in the context of the classroom and in helping them develop a generalizable understanding of the concept of validity. What was lacking was a deep understanding of reliability and how it transferred to the world of teaching. Subsequent to these analyses, the course was revise d in order to help students focus more carefully on the dimensions of reliability. Followup studies are planned to determine whether these adjustments accomplished the course goals.Conclusion The assessment course outlined here has been design ed to engage students in tasks relevant to their own work as preservice teachers and demand that they consider assessment in the context of disciplinary structures and instructional practi ces. Each component of the portfolio gave students an opportunity to address one or more of t he dimensions of validity and reliability highlighted in this paper. The focus on validity gu ided student learning from the initial subject area description and concomitant goals and objectiv es (which helped students develop clearer definitions of their disciplines for themselves), t o the unit assessments (which helped students explore all five dimensions of validity), to the gr ading policy (which helped them address issues of multiple sources of evidence, appropriateness of evidence, and potential consequences of assessment interpretations and use). One powerful aspect of this course may have been th at it was a model of the concepts students were learning. In contrast to a course in which teachers act as impartial observers of students' learning, the instructors were engaged as participant observers--using feedback and guidance to help ensure learning for as many studen ts as possible. Multiple sources of information were used to determine whether students were learning the concepts and skills of the course, from the components of the portfolio to the reflections and self-evaluations at the end of the course.. Students had more than one opportunity to return to their work and revise based on feedback from the instructor and later learning. As such, the instructors had multiple opportunities to observe students' growth over time Public criteria were used to communicate the expectations of performances and scoring rules were consistently applied across students' work and across similar performances. Another powerful aspect of the course was that it w as carefully focused on tasks and reflective writing designed to help students grappl e with each of the dimensions of reliability and validity described in this paper. The learning that resulted from the course--in terms of students' transfer of ideas from the course to their own teac hing as judged by three full professors with substantial knowledge of assessment concepts--sugge sts that the validity framework used to organize the work of the students is one that teach ers can internalize and understand. A stronger focus on the sufficiency of assessment information and ways of ensuring scoring consistency in students' work was needed if students were to bette r understand the concept of reliability. The success of this course in reaching teachers has implications not only for the preparation of teachers, but for the ways in which we present measurement theory in textbooks and instruction and for how classroom assessments a re used in large scale assessment programs. While there may be a place for external assessments that provide accountability data to taxpayers, legislators, and state boards of education, the mea surement model developed for these external tests does not fit the rich and complex environment in which learning takes place. If we are to adequately prepare teachers in the are a of assessment, clearer thinking is needed about the assessment concepts, types of text books and the methods of teaching that are used. Measurement professionals often lament the wi de-spread lack of understanding about measurement concepts. Quite possibly we have create d this problem ourselves. The problems seen may be due to the fact that the philosophical foundations of test theory don't fit the
28 of 37classroom context well. Although text book authors may be trying, in their individual ways, to construct texts book that will force a fit where on e does not exist, we may need to admit that a test theory that fits the modernist notions of the impartial observer is not appropriate for the context in which the teaching and learning occur. It is likely that two different frames are needed f or educational assessment constructs: one for the context of school and one for the context o f external norm-referenced tests. Textbooks could acknowledge the differences between these con texts and frame concepts, procedures, and skills as appropriate for each context. Courses cou ld be designed to help teachers internalize and grapple with these differences. Textbooks and teach er educators could regularly bring teachers back to classroom-relevant dimensions of validity a nd reliability within chapters that address various assessment problems, skills, decision-makin g issues and processes for the classroom. They could ask students to think deeply about why v ery different frameworks and methodologies apply to external assessments. As measurement profe ssionals and teacher educators, we could do a better job of preparing good "participant observe rs," as well as helping teachers understand the paradigm shifts between the two perspectives on ass essment. Most importantly, we should frame our preparation of teachers in such a way that they are clear about their own tasks as teachers: to promote students' ongoing learning. At this point in time, while we have standards for educational and psychological testing (AERA, APA, NCME, 1985), standards for assessment c ompetencies for teachers (AFT, NCME, NEA, 1990), and standards for various professional groups in the interpretation and use of tests (e.g., American Association for Counseling and Deve lopment, 1989; APA Committee on Children, Youth and Families, Committee on Testing and Assessment and Committee on Ethnic Minority Affairs, 1992; American Speech-Language-He aring Association, 1991), we do not have standards for the preparation of teachers related t o assessment or for the materials used in that preparation. In addition, as AERA, APA, and NCME re vise the testing standards, it is critical that they look carefully at the contexts in which a ssessments apply as well as the philosophies underlying the use of assessments within those cont exts rather than attempting to create omnibus standards that apply to all assessment circumstance s. Related to this, as large scale assessment programs look at the viability of incorporating classroom-based assessments into statewide accounta bility information, the nature of the classroom context, and the proposed validity and re liability frameworks, should be considered. Some might say that, given the unstandardized and p rogress-oriented nature of classroom assessments, the information derived from these sou rces is too unstable to use for large scale assessment purposes. On the other hand, the richnes s and breadth of the assessment information that arises from classrooms could give us more and better information if we more appropriately develop teachers' assessment skills. As state and national programs attempt to incorpora te classroom assessment information when reporting on students' learning, the focus mus t be on the validity and reliability frameworks that fit the classroom rather than ones that fit ex ternal tests. The dimensions of validity and reliability presented here make sense to teachers b ecause they make sense in a classroom context of teaching and learning. Large scale assessment pr ograms that use classroom-based evidence should consider the dimensions of validity and reli ability relevant to the classroom when making decisions about how to incorporate classroom-based information into large-scale programs. If, in order to obtain assessment information from classrooms, large scale programs create top-down standardized tasks or tests to be administ ered by teachers, the validity of such assessments for the classroom context is suspect. G iven the validity framework presented here, top-down classroom assessments could not provide va lid classroom assessment information because they would not follow from instruction (Val idity Dimension 3). They would simply be extensions of external, standardized tests. If teac hers are admonished to use standardized administration directions that do not allow for the unique needs of students, top-down classroom
29 of 37assessments should be suspect because they may prev ent some students from showing their learning in ways that accommodate their unique need s (Validity Dimension 4). If standardized, top-down tasks are closely circumscribed in order t o strengthen reliability, they limit the capacity of the assessments to assess students' understandin gs of the subject area disciplines (Validity Dimension 1). This would not only limit fit with th e content and constructs to be measured, but would rob the classroom of the opportunity to use i mportant assessments to accurately represent the structures of disciplines (Validity Dimension 5 ). Limiting classroom assessment information to a few, standardized, top-down assessments would also limit the range of evidence and counter-evidence that teachers could present about student learning--a threat to Validity Dimension 2. If, on the other hand, several generic outlines for tasks, scoring rules, and tests are created, (e.g., Rekase, 1995), and teachers are allowed to c onfigure these assessment outlines to fit their own instructional methods, content focuses, and tim elines, classroom assessments could fit all of the dimensions of validity relevant to the classroo m context. Guidelines for adaptation of the assessments to instructional contexts, strategies f or evaluating the validity of these adapted assessments, and ideas about what would constitute a reasonable range of assessment information for decision-making could help teachers develop useful assessments, first for themselves and their students and secondly for larg e scale programs. State programs could provide powerful professional development materials to practicing teachers through such materials. For too long, rules for creating and evaluating ext ernal tests have been seen as the ideal for obtaining valid and reliable information about lear ning in the classroom. This has led to a lack of fit between the needs of teachers and the notions o f assessment professionals. With the current awareness of the importance of assessment among tea chers, school administrators, and policy-makers, the classroom has the potential to b e a much more powerful and complete source of assessment information. To achieve this potentia l, however, we must begin with frameworks for measurement constructs that fit the classroom c ontext, teach teachers how to use these frameworks to improve the quality of their assessme nts, and ensure that external uses of classroom assessment information attend to these fr ameworks when deciding how to incorporate classroom assessments into large scale programs.ReferencesAirasian, P. (1991). Perspectives on measurement in struction for pre-service teachers, Educational Measurement: Issues and Practice, 10 (1 ) 13-16, 26. Airasian, P. (1991). Classroom Assessment, First Ed ition. New York: McGraw-Hill. Airasian, P. (1993). Classroom Assessment, Second E dition. New York: McGraw-Hill. American Association for Counseling and Development (1989, May 11). The responsibilities of test users. Guidepost, 12, 16, 18, 27.American Federation of Teachers, National Council o n Measurement in Education, & National Education Association. (1990). Standards for teache r competence in educational assessment of students, Washington, DC: Author.American Psychological Association Committee on Chi ldren, Youth and Families, Committee on Testing and Assessment & Committee on Ethnic Minori ty Affairs. (1992). Psychological testing of language minority and culturally different child ren, Washington, DC: Author.
30 of 37American Psychological Association, American Educat ional Research Association, & National Council on Measurement in Education. (1985). Standa rds for educational and psychological testing, Washington, DC: American Psychological Ass ociation. American Speech-Language-Hearing Association. (1991 ). Code of ethics of the American Speech-Language-Hearing Association. Rockville, MD: Author. Anderson, L., Blumenfeld, P., Pintrich, P. I., Clar k, C., Marx, R., & Peterson, P. (1995). Educational psychology for teachers: Reforming our courses, rethinking our roles. Educational Psychologist, 30, 143-157.Bloom, B. S., Madaus, G.F., & Hastings, J. T. (1981 ). Evaluation to Improve Learning. New York: McGraw-Hill Book Company.Bricklin, B., & Bricklin, P. M. (1967). Bright chil d--poor grades. New York: Dell. Butler, R., & Nisan, M. (1986). Effects of no feedb ack, task related comments, and grade on intrinsic instruction and performance. Journal of E ducational Psychology, 78, 210216. Canady, R. L., & Hotchkiss, P. R. (1989). It's a go od score! Just a bad grade. Phi Delta Kappan, 71 (1), 68-71.Cohen, D., & Ball, D. L. (1990). Relations between policy and practice: A commentary. Educational Evaluation and Policy Analysis, 12 (3), 331-338. Covington, M. V., & Beery, R. G. (1976). Self-worth and school learning. New York: Holt, Rinehart, 42-63, 77-87.Covington, M. V., & Omelich, C. L. (1984). Task-ori ented versus competitive learning structures: Motivational and performance consequenc es. Journal of Educational Psychology, 76, 1199-1213.Cronbach, L. J. (1970). Essentials of Psychological Testing, Third Edition. New York: Harper & Row, Publishers.Deci, E., & Ryan, R. (1985). Intrinsic motivation a nd selfdetermination in human behavior. New York: Plenum.Deci, E., & Ryan, R. (1987). The support of autonom y and the control of behavior. Journal of Personality and Social Psychology, 53, 1024-1037.Dunbar, S. B., Koretz, D. M., & Hoover, H. D. (1991 ). Quality control in the development and use of performance assessments. Applied Measurement in Education, 4 (4), 289-303. Galton, F. (1889). Natural inheritance. New York: M acmillan. Glesne, C. & Peshkin, A. (1992). Being there: Devel oping understanding through participant observation. Becoming Qualitative Researchers: An I ntroduction. White Plains, NY: Longman, 39-61.Grossman, P. L. (1991). Overcoming the apprenticesh ip of observation in teacher education coursework. Teaching and Teacher Education, 7 (4), 345-357.
31 of 37Grossman, P. L., Wilson, S. M., & Shulman, L. S. (1 989). Teachers of substance: Subject matter knowledge for teaching. In M. C. Reynolds (Ed.), Kn owledge base for the beginning teacher. New York: Pergamom.Gullickson, A. R. (1986). Teacher education and tea cherperceived needs in educational measurement and evaluation. Journal of Educational Measurement, 23 (4), 347-354. Gullickson, A. R. (1993). Matching measurement inst ruction to classroom-based evaluation: Perceived discrepancies, needs, and challenges. In S. L. Wise & J. C. Conoley (Eds.), Teacher training in measurement and assessment skills. Linc oln, NE: Burros Institute of Mental Measurements, University of Nebraska.Hanna, G. S. (1993). Better Teaching Through Better Measurement. Fort Worth, TX: Harcourt Brace Javanovich College Publishers.Hieronymus, A. N., & Hoover, H. D. (1987). Iowa Tes ts of Basic Skills: Writing Supplement Teacher's Guide. Chicago: Riverside Publishing Comp any. Houghton Mifflin Company, American Heritage Diction ary of the English Language, Morris, W. (Ed.). Boston: Author, 1098.Linn, R. L. (1990). Essentials of student assessmen t: From accountability to instructional aid. Teachers College Record, 91 (3), 422-436.Linn, R. L., & Burton, E. (1994). Performance-based assessment: Implications of task specificity. Educational Measurement: Issues and Practice, 13 (1 ), 58, 15. Linn, R. L., & Gronlund, N. E. (1995). Measurement and Assessment in Teaching, Seventh Edition. Englewood Cliffs, NJ: Merrill, an imprint of Prentice Hall. Lortie, D. (1975). Schoolteacher: A sociological st udy. Chicago: University of Chicago Press. Mehrens, W. A., & Lehmann, I. J. (1991). Measuremen t and Evaluation in Education and Psychology, Fourth Edition. Fort Worth, TX: Holt, R inehart, and Winston, Inc. Messick, S. (1989). Validity. In Educational Measur ement. Robert Linn (Ed.). Washington, DC: American Council on Education.Moss, P. A. (1994). Can there be validity without r eliability? Educational Researcher, 23 (2), 5-12.Moss, P. A. (1996). Enlarging the dialogue in educa tional measurement: Voices from interpretive research traditions. Educational Researcher, 25 (1) 20-28. National Council of Teachers of Mathematics (1991). Mathematics Assessment: Myths, Models, Good Questions, and Practical Suggestions. Reston, VA: Author. National Council of Teachers of Mathematics (1995). Assessment Standards for School Mathematics. Reston, VA: Author.Nicholls, J. G. (1989). The competitive ethos and d emocratic education. Cambridge, MA: Harvard University Press.
32 of 37Nitko, A. J. (1996). Educational Assessment of Stud ents. Englewood Cliffs, NJ: Merrill an imprint of Prentice Hall.Nolen, S. B., & Nicholls, J. G. (1994). A place to begin again in research on student motivation: Teachers' beliefs. Teaching and teacher education, 10, 57-69. Oosterhof, A. (1996). Developing and Using Classroo m Assessments. Englewood Cliffs, NJ: Merrill, an imprint of Prentice Hall.Rekase, M. (1995, December). Using portfolios for h igh stakes assessments. Presented at the Washington State Assessment Conference, Seattle, WA Resnick, L. B. & Resnick, D. P. (1991). Assessing t he thinking curriculum: New tools for educational reform. In B. R. Gifford & M. C. O'Conn or (Eds.), Changing assessments: Alternative views of aptitude, achievement, and ins truction. Boston: Kluwer. Salvia, J., & Ysseldyke, J. E. (1995). Assessment, Sixth Edition. Boston: Houghton Mifflin Company.Schafer, W. D. (1991). Essential assessment skills in professional education of teachers. Educational Measurement: Issues and Practice, 10 (1 ), 3-6, 12. Schafer, W. D., & Lissitz, R. W. (1987). Measuremen t training for school personnel: Recommendations and reality. Journal of Teacher Edu cation, 38 (3), 57-63. Schon, D. A. (1987). Educating the reflective pract itioner: Toward a new design for teaching and learning in the professions. SF: The Jossey-Bass Hi gher Education Series. Schwab, J. J. (1978). Science, curriculum, and libe ral education. Chicago: University of Chicago Press.Shavelson, R. J. & Baxter, G. P. (1992, May). What we've learned about assessing hands-on science. Educational Leadership, 20-25.Smith, M. L. (1991). Meanings of test preparation. American Educational Research Journal, 28 (3), 521-542.Sommers, N (1982). Responding to student writing. C ollege Composition and Communication, 33 (2), 148-156.Stiggins, R. J. (1991). Relevant training for teach ers in classroom assessment. Educational Measurement: Issues and Practice, 10 (1), 7-12.Stiggins, R. J. (1994). Student centered classroom assessment.. New York: Merrill, an imprint of Macmillan College Publishing Company.Stiggins, R. J., & Bridgeford, N. J. (1988). The ec ology of classroom assessment. Journal of Educational Measurement, 22 (4), 271-286.Stiggins, R. J., & Faires-Conklin, N. (1988). Teach er training in assessment. Portland, OR: Northwest Regional Educational Laboratory.Stiggins, R. J., & Faires-Conklin, N. (1992). In te achers' hands: Investigating the practices of
33 of 37classroom assessment.. Albany, NY: SUNY Press.Stiggins, R. J., Faires-Conklin, N., & Bridgeford, N. J. (1986). Classroom assessment: A key to effective education. Educational Measurement: Issue s and Practice, 5 (2), 5-17. Stiggins, R. J., Griswold, M. M., & Wikelund, K. R. (1989). Measuring thinking skills through classroom assessment. Journal of Educational Measur ement, 26 (3), 233-246. Stuck, I. (1995, April). Heresies of the new unifie d notion of test validity. Paper presented at the annual meeting of the National Council on Measureme nt in Education, San Francisco, CA. Taylor, C. S. (in press). Using portfolios to teach teachers about assessment: How to survive. Educational Assessment.Taylor, C. S. & Nolen, S. B. (1996). A contextualiz ed approach to teaching teachers about classroom-based assessment. Educational Psychologis t, 31 (1), 77-88. Toom, A. (1993). A Russian teacher in America. Jour nal of Mathematical Behavior, 12, 117-139. Valencia, S. (1990). A portfolio approach to classr oom reading assessment: The whys, whats, and hows. Reading Teacher, 43 (4), 338-340.Vidich, A. J., & Lyman, S. M. (1994). Qualitative m ethods: Their history in sociology and anthropology. In Handbook of Qualitative Research, Denzin, N. K. & Lincoln, Y. S. (Eds.). Thousand Oaks, CA: SAGE Publications, Inc., 2359.Whyte, W. F. (1943). Street corner society: The soc ial structure of an Italian slum. Chicago: University of Chicago Press.Wise, S. L., Lukin, L. E., & Roos, L. L. (1991). Te acher beliefs about training in testing and measurement. Journal of Teacher Education, 42 (1), 37-42. Wolf, D. P. (1991). Assessment as an episode of lea rning. In R. Bennett and W. Ward (Eds.), Construction versus choice in cognitive measurement Hillsdale, NJ: Lawrence Erlbaum Associates.Worthen, B. R., Borg, W. R., & White, K. R. (1993). Measurement and Evaluation in the Schools. New York: Longman.Appendix ACourse evaluation items common across all evaluatio n formsSection ItemStem1: GeneralEvaluation 1234 Course as a wholeCourse content Instructor's contribution to the course Instructor's effectiveness in teaching the subject matter 2: Feedback toInstructor 134 Course organizationExplanations by instructorInstructor's ability to present alternative explana tions
34 of 3757811 Instructor's use of examples and illustrationsStudent confidence in instructor's knowledgeInstructor's enthusiasmAvailability of extra help when needed 3: Information toOther Students 1234567 Use of class timeInstructor's interest in whether students learnedAmount you learned in the courseRelevance and usefulness of course contentEvaluative and grading techniques (tests, papers, p rojects) Reasonableness of assigned workClarity of student responsibilitiesAppendix BPost-course survey items: Please check the methods of assessment you are usin g in your field placement (list of 12 types of assessment, including worksheets, lab writ e-ups, observational records, paper-pencil tests, written reports, portfolios, pe er evaluations) 1. Use the pie chart below to estimate the portion of your planning time you use each week to do the following activities (various planning activ ities, including planning lessons, assessments, units, writing objectives, etc.) 2. For each of the following situations, how often do you think about assessment issues? (3-point scale: frequently, sometimes, rarely); lis t of ten situations, including teaching, grading, planning instruction, observing other teac hers, riding to and from work. 3. Thinking back on (the course) have any ideas or oth er aspects of the course influenced your teaching? If so, what part of (the course) has influenced your teaching the most? How has this influenced your teaching? 4. Have you had any new thoughts, questions, or unders tandings about assessment this quarter? If so, what are they? 5. Have you wrestled with any validity issues in your field placement this quarter? If so please describe one such issue. 6. Have you wrestled with any reliability issues in yo ur field placement this quarter? If so please describe one such issue. 7. Have you taught all or part of the unit you designe d for EDPSY 308? (For traditional course students: Have you used any of the materials or assessments you developed?) 8. If so, how helpful was the original plan or plannin g process? (5-point scale) 9.About the AuthorsCatherine S. TaylorAssistant Professor of Educational Psychology312 Miller Hall, Box 353600University of WashingtonSeattle, WA 98195-3600 Voice phone: 206-543-1139FAX: 206-543-8439E-mail: firstname.lastname@example.org
35 of 37EDUCATIONPh.D. University of Kansas, 1986: Educational Psych ology and Research M.S.E. University of Kansas, 1978: Counseling Psych ology B.S.E. University of Kansas, 1974: Language Arts Ed ucation EMPLOYMENT1991Assistant Professor, University of Washington Educational Psychology-Research and Measurement1986-1991 Senior Editor/Senior Project Manager, CTB /McGraw-Hill 1984-1986 Psychometrician, Psychological Corporatio n RESEARCH INTERESTS My main research focuses are large scale assessmen t development issues, validity theory, test theory, and research in the p reparation of teachers. Current projects include studies of different scoring metho ds for performance-based tests in mathematics, reading, and writing, and a study of t he philosophical foundations for and the social consequences of tests and testing pr actices. Susan Bobbitt NolenAssociate Professor of Educational PsychologyUniversity of Washington322 Miller Hall, Box 353600University of WashingtonSeattle, WA 98195-3600 Voice phone: 206-543-4011 ('96-'97 only) 206-543-1846Fax: 206-543-8480 ('96-'97 only) 206-543-8439 email@example.com EDUCATIONPh.D. Purdue University, 1986: Educational Psycholo gy M.Ed. Lewis & Clark College, 1976: Education of the Hearing-Impaired B.A. Portland State University, 1975: Speech Pathol ogy & Audiology EMPLOYMENT 1990Associate Professor, University of Washington Educational Psychology-Human Development & Cognition 1986-90 Assistant Professor, Arizona State Universi ty West Educational Psychology 1978-80 Teacher, Oregon School for the Deaf, Salem, OR High School English and Reading 1976-77 Teacher, Lacey Elementary School, Lacey, WA North Thurston Regional Program for the Hearing-Impaired RESEARCH INTERESTS
36 of 37 My main research focus is the relationship between motivation and learning, and how this relationship develops over time. Curre nt projects include investigations of how motivation develops differently depending on the learners interpretation of their social context for learning. A second interes t is in assessment in schools, and the effects of various policies and practices on te acher and student motivation.Copyright 1996 by the Education Policy Analysis ArchivesEPAA can be accessed either by visiting one of its seve ral archived forms or by subscribing to the LISTSERV known as EPAA at LISTSERV@asu.edu. (To sub scribe, send an email letter to LISTSERV@asu.edu whose sole contents are SUB EPAA y our-name.) As articles are published by the Archives they are sent immediately to the EPAA subscribers and simultaneously archived in three forms. Articles are archived on EPAA as individual files under the name of the author a nd the Volume and article number. For example, the article by Stephen Kemmis in Volume 1, Number 1 of the Archives can be retrieved by sending an e-mail letter to LISTSERV@a su.edu and making the single line in the letter rea d GET KEMMIS V1N1 F=MAIL. For a table of contents of the entire ARCHIVES, send the following e-mail message to LISTSERV@asu.edu: INDEX EPAA F=MAIL, tha t is, send an e-mail letter and make its single line read INDEX EPAA F=MAIL.The World Wide Web address for the Education Policy Analysis Archives is http://seamonkey.ed.asu.edu/ Education Policy Analysis Archives are "gophered" in the directory Campus-Wide Inform ation at the gopher server INFO.ASU.EDU.To receive a publication guide for submitting artic les, see the EPAA World Wide Web site or send an e-mail letter to LISTSERV@asu.edu and include the single l ine GET EPAA PUBGUIDE F=MAIL. It will be sent to you by return e-mail. General questions about ap propriateness of topics or particular articles may be addressed to the Editor, Gene V Glass, Glass@asu.ed u or reach him at College of Education, Arizona Sta te University, Tempe, AZ 85287-2411. (602-965-2692)Editorial Board John Covaleskiejcovales@nmu.edu Andrew Coulson firstname.lastname@example.org Alan Davis email@example.com Mark E. Fetlermfetler@ctc.ca.gov Thomas F. Greentfgreen@mailbox.syr.edu Alison I. Griffithagriffith@edu.yorku.ca Arlen Gullickson firstname.lastname@example.org Ernest R. Houseernie.email@example.com Aimee Howleyess016@marshall.wvnet.edu Craig B. Howley firstname.lastname@example.org William Hunterhunter@acs.ucalgary.ca Richard M. Jaeger email@example.com Benjamin Levinlevin@ccu.umanitoba.ca Thomas Mauhs-Pughthomas.firstname.lastname@example.org
37 of 37Dewayne Matthewsdm@wiche.edu Mary P. McKeowniadmpm@asuvm.inre.asu.edu Les McLeanlmclean@oise.on.ca Susan Bobbitt Nolensunolen@u.washington.edu Anne L. Pembertonapembert@pen.k12.va.us Hugh G. Petrieprohugh@ubvms.cc.buffalo.edu Richard C. Richardsonrichard.email@example.com Anthony G. Rud Jr.firstname.lastname@example.org Dennis Sayersdmsayers@ucdavis.edu Jay Scribnerjayscrib@tenet.edu Robert Stonehillrstonehi@inet.ed.gov Robert T. Stoutstout@asu.edu
1 of 4 Contributed Commentary on Volume 4 Number 17: Taylor & Nolen Reframing Assessment Concepts 19 December 1996 Jonathan A. Plucker University of Maineplucker@maine.maine.edu I read with considerable interest Taylor and Nolen 's (1995) recent article on reconceptualizing assessment concepts. As a faculty member who is responsible for teaching undergraduate and graduate education students these concepts and who is very concerned with the shortcomings of traditional methods for teachin g such concepts, the description of efforts at the University of Washington was quite helpful and thought-provoking. However, the theoretical underpinnings of the course disappointed me -I fi nd the end agreeable but not the means. A few caveats are worth mentioning before I share my comments. First, I hesitated to respond to the original article because I am afraid that any commentary will be perceived as a blind defense of traditional psychometrics -a per ception with which I am not comfortable. But I believe that educators and psychologists criticize standardized testing much too harshly and promote the advantages of alternative assessments f ar too enthusiastically. For example, alternatives to traditional psychometric approaches are often fraught with important problems (Plucker & Renzulli, in press; Ruiz-Primo & Shavels on, 1996), students from certain ethnic groups may be as culturally biased in favor of stan dardized testing as others are biased against it (Plucker, 1996), and both theoretical (Sternberg, 1 994) and empirical evidence (e.g., Bridgeman & Morgan, 1996) suggests that individuals with spec ific learning styles may prefer standardized testing over alternative assessments. However, thes e reservations do not prevent me from researching the use of alternative assessments or w orking with my students to develop nontraditional assessments. Indeed, in two of my ma jor areas of interest -creativity and gifted education -teacher checklists, performance-based assessments, and teacher/parent/peer nominations have been used for decades by educators and researchers. My views are heavily influenced by my work in both of these areas, and i t is through this lens that my comments should be viewed. The following five points are meant to be a spring board for future discussion, since the ideas raised by Taylor and Nolen (1996) are certain ly timely and very important. First, is the growth of alternative assessment due to "a growing belief that the teacher can be one of the best sources of information about student learning" or i s it due to a lack of satisfaction with traditional (i.e., standardized) assessment? Research on teache r accuracy in the assessment of students calls the opening statement into question (Guskin, Peng, & Simon, 1992; Hocevar & Bachelor, 1989; Holland, 1959; Pegnato & Birch, 1959; Plucker, Call ahan, & Tomchin, 1996). This is a minor point, but historically an important one. For examp le, if the purpose of alternative assessment adoption is to create assessments that are less bia sed toward students from specific ethnic groups, then the bias inherent in alternative assessments b ecomes a stumbling block and focus of future
2 of 4development efforts. Second, the overarching issue may be that the tech niques we use to teach measurement concepts -not the content -need to be improved. Taylor and Nolen note that teachers do not "perceive the information learned in traditional te sts and measurement courses to be relevant" to classroom contexts, that "few teacher preparation p rograms provide adequate training for the wide array of assessment strategies used by teacher s", and that "teachers do not believe they have the training needed to meet the demands of classroo m assessment." They also discuss the ways in which measurement and assessment texts fail to aid the teaching of measurement concepts. Most individuals responsible for preparing future teache rs would agree with the authors' summary. But rather than argue that our efforts and the texts fa il due to insufficient theoretical foundations, why not argue that the content and text are merely pass ive objects that are actively manipulated by teachers to create learning experiences for student s? The same argument is used by critics of the way we instruct future teachers to foster creativit y, apply knowledge of motivation and behavior management, and even construct a realistic lesson p lan. All of these areas are marked by a call for greater curricular application and application of principles of situated cognition, but not by a call for a complete revision of content. Why not? B ecause it may largely be unnecessary. In the interest of brevity, I will not analyze the authors' reconceptualization of reliability and validity and the resulting description of the c ourses in detail. But readers should be aware that many of the underlying characteristics of the authors' work are really not any different than those addressed by traditional psychometricians. "V alidity Dimension 1" is content validity, Dimension 2 is item or task analysis and construct validity, Dimension 3 is content validity again but from the perspective of the assessment, Dimensi on 4 is the detection of bias and criterion-related validity, and Dimension 5 is a co nsideration of the social implications of assessment and score interpretation. All of these c oncepts are certainly worthwhile (I especially like the emphasis on social implications), and most educators could provide dozens of studies that reinforce the inclusion of each dimension. And while the notion of the objective observer has held too much importance in the past, modern concep tions of reliability and validity (such as intraand inter-rater agreement and criterion-rela ted validity) tacitly acknowledge the fallibility and bias associated with assessment and evaluation. In most of the examples and discussion provided by Taylor and Nolen, familiar psychometric concepts and ideas are merely recast in postmodern terminology. Fourth, abstract concepts (e.g., standard error of measurement [SEM]) and traditional standards of psychometric quality still need to be taught. Most, if not all, students take standardized tests as they progress through the edu cational system, and many of our future teachers will administer these tests and/or advise students who are about to take them. A case in point, and one that I use with my undergraduates, i s the importance of standard errors of measurement. The students find this topic to be qui te dry and lacking in application when I introduce the topic, but they begin to see the impo rtance of SEM after we discuss several high-stakes applications (including school-by-schoo l test score comparisons, which are known to influence parental decision-making in climates of s chool choice [Maddaus & Marion, 1995]). The question becomes one less of replacing traditio nal concepts and more of modifying our coursework and course sequences to include addition al concepts. If qualitative inquiry has taught us nothing else, it has shown that multiple perspec tives can be taught successfully within the framework of a single course. Fifth, the most important issue may be the distinc tion between norm-referenced and criterion-referenced measurement. As Taylor and Nol en note, "classroom teachers are less interested in the consistency of student performanc e across similar measures than they are in whether students learn what [teachers] are teaching (the targeted constructs)." Alternative assessments used for high stakes (i.e., norm-refere nced) purposes should be required to meet traditional standards of reliability and validity. As the authors state, "the meanings of assessment
3 of 4in the context of the classroom must be considered carefully when large scale assessment programs decide to use classroom assessments for th e purposes of district, state, or national accountability." At the same time, classroombased assessment and evaluation used for primarily criterionreferenced purposes should be held to sl ightly different standards. Attention has been focused on the type of assessment and not how it is used (which is this most important aspect of measurement and evaluation). Finally, given the validity concerns surrounding t he use of alternative assessments (e.g., Ruiz-Primo, & Shavelson, 1996), educators must avoi d the appearance of calling for new conceptions of reliability and validity because the y cannot produce high quality alternative assessments as judged by traditional standards. Whi le this was almost certainly not the authors' intent, it is not altogether impossible to understa nd why critics of alternative assessment infer this logic from our reasoning. Arguing that teachers and future teachers do not learn measurement concepts because of the way in which they are taugh t is reasonable; arguing that they do not learn measurement concepts because of how they are taught AND because the concepts are not applicable is more of a stretch. In conclusion, I find much within the Taylor and N olen article with which to agree. Indeed, if they had simply described their course w hich was "designed to engage students in tasks relevant to their own work as preservice teachers a nd demand that they consider assessment in the context of disciplinary structures and instruct ional practices," I would have filed the article away in a folder that was easily accessible for mys elf, my colleagues, and my students. But the authors' proposed reconceptualization of psychometr ic concepts is merely a presentation of the wolf in postmodernism's clothing. Educators need to begin questioning whether we need to replace our conceptualizations of and standards for psychometric quality or expand the conceptualizations and the teaching of them to inco rporate fresh perspectives. The latter course is more reasonable and more feasible than the former. References Bridgeman, B., & Morgan, R. (1996). Success in coll ege for students with discrepancies between performance on multiplechoice and essay tests. Jo urnal of Educational Psychology, 88, 333-340.Guskin, S. L., Peng, C.-Y. J., & Simon, M. (1992). Do teachers react to "multiple intelligences"? Effects of teachers' stereotypes on judgments and e xpectancies for students with diverse patterns of giftedness/talent. Gifted Child Quarterly, 36, 3 2-37. Hocevar, D., & Bachelor, P. (1989). A taxonomy and critique of measurements used in the study of creativity. In J. A. Glover, R. R. Ronning, & C. R. Reynolds (Eds.), Handbook of creativity (pp. 53-75). New York: Plenum Press.Holland, J. L. (1959). Some limitations of teacher ratings as predictors of creativity. Journal of Educational Psychology, 50, 219-223.Maddaus, J., & Marion, S. F. (1995). Do standardize d test scores influence parental choice of high school? Journal of Research in Rural Education 11, 75-83. Pegnato, C. W., & Birch, J. W. (1959). Locating gif ted children in junior high schools: A comparison of methods. Exceptional Children, 25, 30 0-304. Plucker, J. A. (1996). Gifted Asian American studen ts: Curricular and counseling concerns.
4 of 4Journal for the Education of the Gifted, 19, 315-34 3. Plucker, J. A., Callahan, C. M., & Tomchin, E. M. ( 1996). Wherefore art thou, multiple intelligences? Alternative assessments for identify ing talent in ethnically diverse and economically disadvantaged students. Gifted Child Q uarterly, 40, 81-92. Plucker, J. A., & Renzulli, J. S. (in press). Psych ometric approaches to the study of creativity. In R. J. Sternberg (Ed.), Handbook of human creativity Ruiz-Primo, M. A., & Shavelson, R. J. (1996). Rheto ric and reality in science performance assessments: An update. Journal of Research in Scie nce Teaching, 33, 1045-1063. Sternberg, R. J. (1994, Nov.). Allowing for thinkin g styles. Educational Leadership, 52(3), 36-40. Taylor, C. S., & Nolen, S. B. (1995). "What does a psychometrician's classroom look like?": Reframing assessment concepts in the context of lea rning. Education Policy Analysis Archives, 4(17). (WWW URL: http://olam.ed.asu.edu/epaa)
1 of 5 Contributed Commentary on Volume 4 Number 17: Taylor & Nolen Reframing Assessment Concepts 19 December 1996 Rick Garlikov email@example.com Catherine S. Taylor firstname.lastname@example.org Susan Bobbit Nolen email@example.com The following is an e-mail exchange that Susan B. N olen and Catherine S. Taylor had with Rick Garlikov in response to follow-up que stions Garlikov had about their article "What Does the Psychometrician's Classroom Look Like?: Reframing Assessment Concepts in the Context of Learning" (EP AA Volume 4 Number 17) Garlikov: I have read "through" (more than cursorily, but le ss than thoroughly) your EPAA paper, and I have some questions. 1) If teachers and pre-service teachers are evalua ting their own assessment instruments, how can they see the flaws that they d idn't see when they made up the instruments? Isn't it almost impossible to evaluate your own evaluation efforts very accurately? E.g., a friend made up a test one time where her students had to rank the developmental order of some five or more stages of development. But she took off for each one that was not in the correct stage -i .e., first, second, third, etc. The problem was that if a kid missed the first one and then had the right order of the rest, but each one slid back one notch, they missed them all. But a kid who had no clue as to the order might get two or more in the right slo t by just guessing. My point was that my friend didn't see the problem with this tes t question. She said each part was worth only five points, but actually, each part was worth much more, because any wrong answer COULD throw off the other ones. If SHE were evaluating her test, she would have said it was a good test. Yes, no? How do es your course deal with this aspect of judging one's own evaluations of students ? Nolen: This is why the draft-feedback-revision loop is so critical. Our job as instructors is to find the flaws and point them out Often this entails telling the intern how students might interpret a misleading or unclea r question, playing the role of their students. In the course of the term, most stu dents are not able to actually try out items and assessments, though some do. More try the m out the following quarter when they are responsible for teaching a unit of th is size. The fact that we require students to construct mod el responses and scoring rules for their items, as well as write rationales for assessments, scoring, and objectives (and their relationship), provides anoth er way for them to see the problems with their initial work. This comes throug h pretty clearly in their self-evaluations.Garlikov: 2) It SEEMS to me, at least today, that if you cov er all the things in your
2 of 5course that you discuss in the paper, that kids wil l be hard-pressed to get an intuitive understanding of testing or evaluating students -though they will probably get an intuitive understanding of the problems of testing or evaluating students. It seems to me there is too much specific, technical detail inv olved for understanding to become likely. It may be that I just tried to read too muc h in too short a time, but I felt the quantity and nature of the material you discussed m ade the concept of evaluating more complex than it had to be -for a student. Yo ur article seems good for someone who already understands general forms and l imits of evaluations --gives nice details in a systematic and thorough way-but I worry about how much a pre-service teacher could absorb of it all. Or do y ou teach about this in a way very different from the way you constructed this particu lar article? Nolen: In the course we try to teach both by presenting s ome information through readings and lecture-ettes, and by having students construct a cohesive unit plan. Although the presentation part is necessary, they d on't really begin to *learn* it until they try to put the theories and methods into actio n. This learing (we think) continues long after the course is over as they try to assess fairly and informatively in their own classrooms. There is a lot of informat ion in the course, but I think it is made learnable in several important ways: First, the fact that all of the activities are emb edded in their unit plans, rather than merely appearing as unrelated activities. The unit description and goals lead to the learning targets (objectives), which along with their knowledge of the subject-matter disciplines leads to appropriate met hods of instruction and assessment. Because the most important mode of inst ruction for us is also our mode of assessment, we model for them how assessment can be an instructional tool. Second, we consciously draw on what they have stud ied in their other teacher prep courses, and expect them to use their knowledg e from those courses to justify their assessment decisions. Thus the assessment con tent is seen, in part, as a natural extension of what they have been learning in the pr ogram. Finally, remember that students have been assessed a lot by the time we see them, and most have also seen and/or assisted with assessment in their field experiences. Most are quite concerned about being a ble to fairly and accurately assess their own students. From our experience, thi s (along with subject-matter preparation) is sufficient base on which to build t heir knowledge of assessment. I'm not sure what you mean by intuitive understanding, exactly; I know that several students have told us that they can no longer think about planning instruction without thinking about assessment as part of instru ction. It seems, for many, to have become part of their instructional schema. As you say, the article seems good for those who a lready know something about assessment: That is to whom it is directed. W e don't have our students read the article, we have them learn by doing with feedback.Taylor: We know that learning in a University context is i mperfect and that our students will continue to learn after they take our course. At this time, students take the assessment course during the 2nd or 3rd quarter of a 5 quarter sequence. Our students have an opportunity to use their work in t he field after the course is completed and to give us feedback about how things are going. We literally have students tell us "thank you" in the halls during th e quarters following our course. They claim that the thinking they learned to do in the course helps them develop strategies to focus their teaching better and to pl an their teaching better. One even
3 of 5said, "My students even thank you." As Susan said a bove, rather than this course resulting in superficial coverage of assessment con cepts and skills disconnected from other elements of teaching, the course is set up so that they grapple with the meanings of assessment concepts in a meaningful con text their own plans for teaching subject matter. I guess I'd have to say th at I don't know what "intuitive" means in the sense you are using, but a habit of mi nd about how to stay clear about the goals of instruction, how instruction helps stu dents reach those goals, and how assessment actually assesses for students' learning of those goals seems to me to be a powerful "intuitive" process. As a parent, much of what I see in my own children 's school experiences is random or text book driven. My children are learnin g more about how to please the idiosyncracies of teachers than they are substantiv e conceptual or procedural understandings (or even social understandings). The kinds of assessments they have reinforce a notion that science (or social studies, or English) is a list of facts to be memorized or is a teacher whim. I ask them "what di d you learn from that experience" fairly regularly and am dismayed by the responses. Garlikov: As Susan knows from my EDPOLYAN/EDPOLICY writing, this is one of the things about schools that frustrates me the most. Some of it is due, of course, as you say, to assessment techniques that give the impressions they do, but I also suspect many teachers (and many adults in general) really DO think that these subject matters really are some specific body of fa cts that need to be learned or memorized. So they may actually be assessing in way s that reinforce what they intend to be teaching. Which, if so, is, of course, disappointing to me. Taylor: I hope that what we give our students is a way of thinking that helps them, not only be technically better assessors, but bette r, more focused, more fair teachers who use assessment as ways to assess student learni ng and to communicate to students about what is important to learn. Because our students' first passion is helping students learn, many can embrace this view of assessment more easily than they can a view that portrays assessments as tools for dispassionate observers and graders. The spin off is also surprising. Today, in reviewing file for a teacher education scholarship, the winner was one whose coo perating and supervising teachers both stated that his teaching was focused, his goals for instruction were clear to students, and his assessments were "eagerl y embraced" by students because they "knew what the purpose was."Garlikov: Thanks for answering my questions. I think you pre tty much addressed what I asked, but I have one more simple question, and then a much more important question. Simple question: Do you think that your students u nderstand that the kind of crucial feedback you give them when you go over the ir assessment plans, etc. is what they also need to see their students as doing if and when their students complain about particular items or scoring, etc.? T hat is, can your students generalize from the kind of thing you are doing in this regard, or is it that they pay attention to you because a) you are the teacher so they pay attention to your claims of invalidity (or some other sort of flaw), or b) you can articulate the flaws in their assessments extremely well and cogently. In short, are you able to get them to see that whenever one of their students might complain about the reliability or validity
4 of 5(or fairness or whatever) of a particular evaluatio n tool, they need to really listen and try to figure out whether there is any merit to the claim? Nolen: For some, yes. Probably for most in theory. We tal k some about the power dynamics involved in these things, and in both the preceding ed psych course and this course we emphasize listening hard to students (In fact, their major project for the previous course is to do just that: Listen hard to two students talking about what they learned in two consecutive class periods, and trying to explain why that's what they learned.) We are not the only ones in their pr ogram who model this, though not all profs do as much to encourage revision and reth inking because of time required. Garlikov: Difficult question: *I* don't know how to construc t good "formal", individual exams (or paper assignments) about philo sophical/conceptual/logical issues that measures what I wanted students to lear n. I only get some ideas about what students might have learned by continuing back -and-forth dialogue that further probes answers they give, questions they ask, and c omments they make as we go along. This almost never ends up giving me the impr ession that their initial answers gave me; and often I am left with the vague feeling that if we carried it even further, they would either change or they would give me an e ven different/better understanding of what they know. So I am never happ y with a time-slice assessment except in those few cases where students either see m to have learned nothing about a particular issue or where they seem to express rema rkably perceptive, genuinely independently discovered, reflective views. Since m uch of what I think is important in schools IS of a philosophical or conceptual or l ogical nature, how can teachers in general design formal assessments that can be said, or shown, to accurately reflect what students really know or understand? And how ca n one tell such assessments do that. An example of the problem is almost any discussion on EDPOLICY [an educational listserv forum to which Susan Nolen and Rick Garlikov subscribe], where it takes a number of responses back and forth for everyone to even be clear about what is being asked or responded to; if it oc curs even then. The initial tendency is to feel that someone is dead wrong or t erribly misguided in some way, perhaps in some cases not even very bright. But it usually turns out that they were making a fairly cogent interesting point that was j ust difficult to express in some way that everyone could understand it. Yet teachers or assessment "instruments" often don't give students a chance to clarify, discuss, a rgue, clarify some more, etc. Or do you have a way around that?Nolen: We just try to model and talk about listening to s tudents, especially trying to understand the responses that seem off the wall. We give examples from our own teaching (both in the u. and in public schools) of times we found great insight lurking in what seemed initially to be a wacky answ er. And we try to model this during the (many) discussions we have in our classe s. Capturing some part of the essence of a discipline, including the habits of mi nd or approaches as well as the big ideas and questions, is something our students are encouraged (required) to struggle with in several courses.. It often seems to come to a head in the assessment course, where they have to be very clear about what is impo rtant to learn and how they will know when their students have learned it, AND how t hat learning (and assessment) is what Bruner calls "intellectually honest" or wha t Schwab calls "true to the discipline." They will continue (we hope) to strugg le with this throughout their
5 of 5careers, as we do.Garlikov: You both mentioned you were not certain what I mea nt by "intuitive" understanding when I wrote: It SEEMS to me ... that if you cover all the thing s in your course that you discuss in the paper, that kids will be ha rd-pressed to get an intuitive understanding of testing or evaluating st udents -though they will probably get an intuitive understanding of the problems of testing or evaluating students. You did answer it in a way that pleases me, howeve r. Catherine's additional comments particularly helped in light of what Susan had already written about the "details". I take it now that, although when your s tudents leave your course they have some particular techniques and methods for assessin g students in ways that are valid and instructionally re-inforcing and useful, the mo st important thing you give them is a sense for how assessment needs to work in clas sroom teaching, and what the essential pitfalls and problems of assessment are. That is why I distinguished between an intuitive understanding of testing on th e one hand, and an intuitive understanding of THE PROBLEMS (and issues) involved in testing. The latter is I think (1) the most important thing you could give y our students, and (2) the only ingrained ("intuitive") thing you were likely to be able to give them. I didn't think that in one course you could teach pre-service teac hers how to intuitively make up flawless assessments for their students. I had had the FEELING as I had read the EPAA paper that there was a possibility you were claiming that you taught your students how to make up wonderful or perfect tests with no problem at all. But from your responses, I see that what you do (which is what I had hoped you were doi ng) is to give your students a really good understanding of what KINDS of things n eed to be done in evaluating students, and a really good understanding of why th ose KINDS of things are important. Insofar as you helped them learn how to design some specific assessments, that is good; but it is better that yo u have helped them understand the concept of assessment, since the specifics may chan ge for them as they get into situations different from any you may have anticipa ted, or as their instructional ideas and principles change. Thanks for answering. And 'Good For You Guys' for doing such a good job teaching this sort of thing, and for writing the EP AA paper about it.