USF Libraries
USF Digital Collections

Educational policy analysis archives


Material Information

Educational policy analysis archives
Physical Description:
Arizona State University
University of South Florida
Arizona State University
University of South Florida.
Place of Publication:
Tempe, Ariz
Tampa, Fla
Publication Date:


Subjects / Keywords:
Education -- Research -- Periodicals   ( lcsh )
non-fiction   ( marcgt )
serial   ( sobekcm )

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
usfldc doi - E11-00382
usfldc handle - e11.382
System ID:

This item is only available as the following downloads:

Full Text
xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam a22 u 4500
controlfield tag 008 c20049999azu 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E11-00382
0 245
Educational policy analysis archives.
n Vol. 12, no. 33 (July 20, 2004).
Tempe, Ariz. :
b Arizona State University ;
Tampa, Fla. :
University of South Florida.
c July 20, 2004
Reasonable decisions in portfolio assessment : evaluating complex evidence of teaching / Aaron Schutz [and] Pamela A. Moss.
x Research
v Periodicals.
2 710
Arizona State University.
University of South Florida.
1 773
t Education Policy Analysis Archives (EPAA)
4 856

xml version 1.0 encoding UTF-8 standalone no
mods:mods xmlns:mods http:www.loc.govmodsv3 xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govmodsv3mods-3-1.xsd
mods:relatedItem type host
mods:identifier issn 1068-2341mods:part
mods:detail volume mods:number 12issue 33series Year mods:caption 20042004Month July7Day 2020mods:originInfo mods:dateIssued iso8601 2004-07-20


1 of 47 A peer-reviewed scholarly journal Editor: Gene V Glass College of Education Arizona State University Copyright is retained by the first or sole author, who grants right of first publication to the EDUCATION POLICY ANALYSIS ARCHIVES EPAA is a project of the Education Policy Studies Laboratory. Articles published in EPAA are indexed in the Directory of Open Access Journals. Volume 12 Number 33July 20, 2004ISSN 1068-2341Reasonable Decisions in Portfolio Assessment: Evaluating Complex Evidence of Teaching Aaron Schutz University of Wisconsin-Milwaukee Pamela A. Moss University of MichiganCitation: Schutz, A., Moss, P.A., (2004, July 20). Reasonable decisions in portfolio assessment: Evaluating complex evidence of teaching, Education Policy Analysis Archives, 12 (33). Retrieved [Date] from A central dilemma of portfolio assessment is that a s the richness of the data available to readers increases, so do t he challenges involved in ensuring acceptable reliability among r eaders. Drawing on empirical and theoretical work in discou rse analysis, ethnomethodology, and other fields, we argue that t his dilemma results, in part, from the fact that readers cannot avoid forming the data of a portfolio into a pattern—a coherent story" or "stories"—in order to evaluate it. Our article pres ents case studies of readers independently evaluating the same portfo lios. We show that even readers who hold a shared vision of effective teaching and who cite much the same evidence can, n onetheless, develop significantly different "stories." Our anal ysis illustrates that some portfolios are more ambiguous than others and are thus more likely to result in such divergent readin gs. We argue


2 of 47 that more fine grained understandings of portfolio ambiguities and disagreements between readers over "stories" can he lp us respond to the challenges posed by the rich data of portfolio assessments. (Note 1) IntroductionImagine that you are sitting in the back of a class room. Every passing moment is rich with events. Students fidget, make marks on paper, and whisper side comments to each other; the teacher asks questions, draws on the board, spreads her arms to indicate the size of an elephan t; the principal makes an announcement over the intercom; a book falls from a desk; a passing cloud dims the sunlight coming through the windows.Then imagine it is your job to assess the teacher's performance in this classroom. You immediately face a wide range of cha llenges and dilemmas. For your interpretation to be well-warranted, you must attend to a variety of relevant evidence. But interpretations like these always inv olve some level of selective abstraction and pattern-making. Some aspects must b e foregrounded from the near infinite range of the noticeable, and this mea ns other aspects will fade into the background. Further, as this paper will show, e ven when you decide what evidence matters, you must draw this together into a comprehensible pattern. And no matter how much evidence you collect, it rep resents only a small sample of the information that could be engaged wit h, always reflecting a particular understanding of what is important and w hat is not. In this article, we seek to illuminate some of thesechallenges and dile mmas involved in moving from evidence like this to well-warranted interpret ations in consequential assessments of teaching.The practice of assessment in American schools has been largely guided by the discipline of educational measurement. Over time, m easurement scholars have developed a shared set of strategies for drawing an d warranting assessment-basedinterpretations--for “reasoning fro m evidence to inference” (Mislevy, Almond, and Steinberg, 2003). As Mislevy and colleagues noted, the general approach reflected in these strategies has been driven, in part,by the need to make conclusions about large numbers of ind ividuals. By standardizing their approach, by collecting the same data from ea ch person under the same conditions and using the same criteria and procedur es to evaluate it, assessment developers can build a common argument f or validity rather than having to reinvent the argument anew for each case. When an assessment system is operational, then, the set of possible sc ores (and the intended interpretations associated with them) is essentiall y predetermined: the goal is to use the available evidence to determine the most ap propriate or likely score/interpretation for each individual. Important ly, however, assessment scholars understand that the common argument develo ped for any particular test will not necessarily hold for all individuals. Thus Mislevy and colleagues noted that assessors actually have dual responsibil ities in their efforts to construct arguments for validity. Not only must the y establish "the credentials of the evidence in the common argument," they must als o detect "individuals for whom the common argument does not hold” (p. 15).


3 of 47 In this paper we focus on one crucial component of a common validity argument for performance assessments: the role of readers (r aters, assessors, or judges) in interpreting/evaluating the available evidence a bout an individual. In particular, we focus on the role of readers in port folio assessments that contribute to consequential decisions about individ ual teachers. (Note 2) Most assessment systems seek to maximize the consis tency with which readers are applying the scoring rubric and to minimize thr eats to validity that might be introduced as readers score. Developers work to ens ure readers attend only to evidence relevant to the construct of interest (min imizing "construct irrelevant variance") and capture all relevant evidence andcri teria (reducing “construct under-representation”), developing ongoing monitori ng programs to ensure the system is actually functioning as intended. [Messic k (1989)providedextended descriptions of these general threats to validity, which AERA, APA, NCME (1999)echoed.] Consistency is typically promoted th rough training and monitoring procedures that ensure,to the extent pos sible,that readers are using the same appropriate criteria to attend to the same relevant features of the available evidence (see, e.g., Engelhard, 1994, 200 1; Myford and Engelhard, 2001; Myford and Mislevy, 1995; Wilson and Case, 19 97). As Kane, Crooks, and Cohen (1999) noted, however, “ the more complex and open-ended the task, the more difficult it becomes to anticipate the range of possible responses and to develop fair, explicit sc oring criteria that can be applied to all responses” (p. 9), and thus the more difficult it is to prepare readers able to consistently apply their rubric. Te aching portfolios of the kind we discuss contain some of the most "open-ended" data encountered in large scale assessments, including contextualizing commentary, videotapes of teaching, instructional artifacts, samples of student work, a nd the teacher’s reflective commentary. A central dilemma of portfolio assessme nt is that as the richness of the data available to readers increases,so do th e challenges involved in ensuring acceptable reliability between readers. Th e very thing that would seem likely to improve the validity of an assessment--mo re information--also appears to threaten its validity.We engage with this dilemma, here, by seeking to be tter understand the kinds of interpretive challenges that portfolios seem to raise. These are often illuminated in disagreements between readers. Howev er, the sorts of ambiguities that underlie disagreements can also be present in portfolios even when particular readers agree on a score, especiall y when the scoring system constrains or reduces the complexity of the judgmen ts that readers are required to make. We argue thatmore fine-grained understandi ng of such disagreements and ambiguities can help us improve our general arg uments for validity, identify cases where the argument does not fit, and pointus to new strategies for ethically taking account of such issues in the cont ext of large scale assessment. While we focus in this paper on portfolio assessmen ts of teachers, the issues we raise are just as relevant for other forms of as sessment, including essay exams and multiple choice tests. In multiple choice tests, for example, the task of making "sense" of the data collected is given ov er to a key which entails assumptions about what students mean when they answ er each question and about how the combined picture provided by all of a student's answers together


4 of 47 should be understood (e.g., Hill and Larsen, 2000). We examine portfolios, then, not because they are inherently more challeng ing or problematic from an interpretive point of view than simpler forms of as sessment, but because the burdens they place on readers tend to illuminate ch allenges otherwise obscured by less open-ended assessment contexts. Thus, the c rucial difference between a multiple choice test and a teaching portfolio is not the complexity of the performance they represent--the reality they are attempting to describe---but the complexity and richness of the data available to readers about that performance. Portfolio assessments like these allow us to see what happens when much of the complexity of a performance has no t been eliminated by the assessment instrument itself.Our data are drawn from field tests of a portfolio assessment developed by the Interstate New Teacher Assessment and Support Conso rtium (INTASC) and adapted for use in Connecticut. Building on the pio neering work of the National Board for Professional Teaching Standards, INTASC i s developing subject specific standards and portfolio assessmentsto help participating states support the professional development of beginning teachers and make licensure decisions. Of the 10 INTASC states that participate d in the development of the portfolio assessment, only Connecticut is currently using portfolios to inform licensure decisions. The portfolios under considera tion were prepared by beginning English/language arts teachers as part of a special field test in Connecticut. In this field test, readers (experienc ed teachers in the subject area) evaluatedportfolios in pairs, seeking consensus on a final score. We present two case studies of three pairs of readers each rea ding the same portfolio. The three pairs of readers disagreed quite widely on th e first portfolio (scores of 1, 3, and 4 on a scale of 4, with 2 considered a passing score) but agreed (on a score of 1) on the second portfolio.Our analysis indicates that many of the disagreemen ts between reader pairs that might initially seem to be about “evidence" (a re readers attending to the relevant features of the performance?) or "values" (are they applying the same criteria?) actually represent disagreements over wh at we term the “story” of the portfolio. We use the term "story," drawn from narr ative theory and studies of the interaction of narrative and human understandin g (e.g., Bruner, 1986; Davies and Harre, 1990; Kroeber, 1992), to indicate the patterns people develop to make coherent sense of situations and te xts. In many cases,although readers appeared to agree on the wha t was "there" in the portfolio and largely concurred on their vision of what counted as effective teaching, they nonetheless constructed different "s tories" that fit much the same data into very different patterns. Sometimes, for e xample, a divergence appeared to involve one pair foregrounding an aspec t of the performance that the other pair tended to downplay. For example, in the Civil Rights Movement (CRM) portfolio, discussed below, while one pair (R obert and Sandra) acknowledged that most of the dialogue in the Respo nse to Literature (RTL) video involved simple “question and answer,” they s tressed particular examples where the teacher went beyond this. Another pair, C harlene and Iris, focused on the opposite characteristics. They stressed the “qu estion and answer” aspect of the dialogue, even though they noted as exceptions aspects the first pair had emphasized. Here, the readers appear to have viewed much the same “evidence” and to have held similar “values” about what counted as better and


5 of 47 worse teaching. Yet they still constructed fundamen tally different pictures of what was going on in the classroom. And each pair m ade convincing arguments for their points of view.Most operational assessment systems depend upon evi dence of discrepant readings to illuminate problems like these for furt her review. And yet most large scale assessment systems also work hard to minimize the number of discrepant readings--to improve the consistency with which rea ders are applying scoring criteria. When inadequate levels of interreader rel iability are found (as is more frequent with portfolio assessments) the advice for improving reliability typically includes disaggregating the portfolio into componen ts that can be separately scored, having different readers score each compone nt (Nystrand et al., 1993; Klein et al., 1995; Swanson et al., 1995) and “intr oducing separate scales to disentangle confounding aspects of performance” (My ford and Mislevy, 1995, p. 55). However, these are all practices that are more likely to gloss over rather than illuminate the sort of ambiguities in evidence we will demonstrate. Thus, our concern is not only that routine practices for building a reliable system fail to illuminate ambiguous cases, but that advice for imp roving the system to make it more reliable is likely to make the problem even ha rder to detect. In the next section we provide a theoretical founda tion for exploring disagreement over “stories,” accompanied by empiric al evidence from the limited set of studies that have examined actual re ader processes in the context of large scale performance assessments. Then we tur n to two case studies of multiple pairs of readers independently reading the same two portfolios. In our concluding comments, we speculate about the ways la rge scale assessment systems might feasibly and ethically cope with the sorts of problems we have raised.Disagreements over “Stories:” A Theoretical Discuss ionOur conception of disagreements over what we call “ stories” draws on assumptions about how human beings make sense of th e world from a diverse range of fields, including analytic philosophy (Gri ce in Davies, 2000), narrative theory in literary studies (Kroeber, 1992), researc h on reading processes (Ruddell and Unrau, 1994; Smagorinsky, 2001), histo riography (Mink, 1987), linguistics (Gee, 1990), psychology (Bruner, 1986,1 990; Gibbs, 1993), hermeneutics (Gadamer, 1975, 1987)and ethnomethodol ogy and conversation analysis in sociology (Arminen, 1999; Garfinkel, 19 67,2002; Schegloff, 1999). Despite differences, these scholars have converged, often somewhat independently, on a description of human understand ing that involves the active construction of coherent meaning in our everyday in teractions in the world. In this section, we explore aspects of the arguments t hat lie behind this theoretical vision and illustrate these arguments with evidence from the relatively small body of literature that examines, inductively, read ers’ processes in large scale assessment (e.g., Heller, Sheingold, and Myford, 19 98; Moss, Schutz, and Collins, 1998; Myford and Mislevy, 1995). We primar ily focus on ethnomethodology, especially the efforts of Harold Garfinkel (1967, 2002), because, along with other work on narrative, its em pirical findings anticipate with remarkable accuracy the documented inclination s and practices of readers when they encounter complex performancesin large sc ale assessments.


6 of 47 Furthermore, it provides a fruitful theoretical res ource for illuminating particular problems with existing assessment systems and for d esigning assessment systems that better accommodate complex performance data. Garfinkel and colleagues (1967,2002) arguedthat the process of sense-making, of imputing meaning to data, intention to actions, and the like, is a constant and permanent aspect of our efforts to orient ourselves in the world. We do this through what he called the “documentary method.” Th e documentary method “consists of treating an actual appearance of somet hing”--for example, a view of an object, a statement in an ongoing dialogue, an a ction, a piece of text from a portfolio, or the portfolio itself—“as ‘pointing to ,’ as ‘standing on behalf of’ a presupposed underlying pattern” (p. 78). In other w ords, we make sense of what something is or means by referring to an assumed underlying pattern of w hich it is a part. Conversation analysts (see Heritage, 198 4), for example, have often shownthat we are able to understand quite cryptic s tatements made by our conversation partners in everyday dialogue only bec ause we make assumptions about what our partners are talking “about.” As we are leaving a movie, when my friend says “I didn’t like it,” I need to know e nough about her biography, the context, and our past history of conversations to k now whether she is referring to the popcorn, the movie, her job, or any number o f an infinite number of possibilities. What is interesting to these discour se analysts is something we generally take for granted: that despite the seemin gly complex interpretive work needed to make sense of such ordinary interactions, people generally understand each other and their environments quite well. Garfinkel (1967,2002) arguedthat all statements, no matter how detailed we try to make them, are essentially “ indexical .” In other words, like my friend’s statement about what she didn’t like, they point to something never entirely contained within the statement itself. And this is not just a problem with pronouns like “it;” this is a pervasive feature of human interaction. In experiments, Garfinkel (1967,2002) found that no ma tter how hard one tries to specify exactly what any statement “means,” there i s always some ambiguity left. His students, for example, when trying to ans wer a question like “What do you mean you ‘had a flat tire’?” would invariably t hrow up their hands at some point and declare that the meaning in a particular example was something “anyone can see.” Understanding, he argued, always involves reference to assumptions not contained explicitly in a dialogue or a text. Importantly, Garfinkel (1967)found that this proces s of sense-making is recursive Not only is “the underlying pattern derived from its individual documentary evidences,” but “the individual documen tary evidences, in their turn, are interpreted on the basis of ‘what is know n’ about the underlying pattern” (1967, p. 39). Or, as he notes elsewhere, “you can speak either of seeing the detail-in-the-pattern or the patternin -the-detail” (2002, p. 203). This is analogous to Gadamer’s (1975, 1987) “hermeneutic circle” in which we discover what a text means by analyzing the parts, and give meaning to the parts by understanding the whole. (Note 3) Consider the following artificial example (taken from Wittgenstein, in Heritage, 1984 p. 87). What do you see?


7 of 47 Figure 1: Line Drawing (Wittgenstein, in Heritage, 1984, p. 87) In this very simple drawing, two fundamentally diff erent conclusions seem possible. Either the picture is of a duck, or it is of a rabbit. And each story about the meaning of the picture involves a gestalt switc h in the meaning of all of the data in the picture. In the case of the duck, for e xample, the indent on the right is merely a dent in its head. While we all may “see ” it, it probably isn’t important enough to be given much attention. The duck would s till be a duck without it. On the other hand, when we see a rabbit, the indent be comes the rabbit’s mouth. Suddenly it is one of the most important features i n the picture. In fact, if you put your finger over the indent, suddenly a duck is the only reasonable interpretation. The only reason you still see the p ossibility that there could be a rabbit is because you can’t forget about the indent If you show the picture to someone else with the indent covered, they will alm ost invariably tell you that, as “anyone can see,” it is a duck.Actually, you can try a third approach and attempt to see the picture as simply a collection of curved lines, trying not to see it ei ther as a rabbit or a duck. You will probably find this is a difficult task—you cannot e asily take “time out” from interpretation, from seeing “what everyone can see, ” even when you try. Not interpreting feels like the active stance since the largely unseen process of sense-making is our default mode of interaction wit h the world. We’ll revisit the issue of ambiguous evidence--of the potential to su pport multiple stories--that this example raises shortly.Returning to the world of human interaction, our pr esent analysis and previous work shows that such recursive efforts to “make” se nse occur continually in readers’ efforts to understand portfolios, even tho ugh, like all of us, they generally seem not to notice the constant, moment-t o-moment judgments they are making (Moss, Schutz, and Collins, 1998; see Go odwin and Goodwin, 1992). These natural inclinations to make sense hav e also been illustrated repeatedly in the small body of literature that loo ks inductively at how raters reason in evaluating open-ended assessments, both s ingle performances (Freedman and Calfee, 1983; Wolfe, 1997) and portfo lios (Heller et al., 1998; Moss et al., 1998; Myford and Mislevy, 1995). All o f these studies found readers working recursively between interpretations/evaluat ions and chunks of text. As described by Heller and colleagues (1998), Freedman and Calfee found that


8 of 47 bits of text are judged as they are comprehended an d that “the evaluation of one section of text may change as one comprehends a subsequent section of text” (p. 4). They noted that Wolfe (1997) “confirm ed that some raters exhibit iterative processing and intermingle reading, evalu ating, and articulating rating processes when judging essays” (p. 4). Heller and c olleagues (1998), listening to think alouds while readers rated student portfol ios, found readers drawing on whatever they could find that was “remotely useful” to fill in the gaps in their understanding (p. 17).For example, raters referred to the dates on which pieces of student work were created when deciding how to interpret qualities of performance across versions of a piece that had been revised . or across di fferent pieces in a portfolio over time (p. 17)They noted that “repeated exposure to portfolios fr om the same class was also useful. For instance, it allowed them to draw infer ences about how the teacher has structured the assignments and “what the child has brought to the work” (p. 18; see also Moss et al., 1992).In fact, ethnomethodologists have shown that it is very difficult to prevent people from engaging in this ongoing process of sense-maki ng/story construction. For example, in a range of experiments Garfinkel (1967, 2002) attempted to “breach” participants’ understanding of the world. In one famous case, students met and talked with a counselor about an issue that was important to them. The counselor then left for a separate room and student s asked a series of questions of the counselor through an intercom, rec eiving either “yes” or “no” answers. After each question, students shut off the intercom and talked into a tape recorder about their understanding of the answ er. What is interesting is that the answers were given in a completely random order. Yet in nearly every case the students had little difficulty making sens e of them. They noted that they “saw in a glance” “what the advisor had in min d.” They were able to manage incongruous answers by “imputing knowledge a nd intent to the advisor,” although sometimes they had to wait to un derstand the meaning of a particular answer. (Note 4) And each new answer might change the sense of earlier answers and even the sense of the student’s own questions. Thus “the sense of the problem [a student was focusing on] wa s progressively accommodated to each . answer, while [each] ans wer motivated progressively fresh aspects of [the student’s] unde rlying problem” (1967, p. 90; see Arminen, 1999).A few key lessons can be taken from Garfinkel’s (19 67) counseling experiment. First, as later work in conversation analysis showe d in detail (Arminen, 1999), understanding in such an exchange is built up over time as people seek to build a coherent story. Sometimes the meaning of a statem ent or of a piece of data changes in response to new information, and sometim es the meaning of a particular piece of data remains vague until furthe r information is forthcoming. In conversation, for example, “every next conversation al move renews our understanding of the prior move, so that each turn at talk . recreates the context anew” (Arminen, 1999, p. 41). Once a partic ular pattern has been identified, subsequent pieces of data will be inter preted in the light of this pattern. Even if the subsequent data disconfirms th e initial pattern, it is read to


9 of 47 some extent in response to the pattern it disconfir ms. Once any descriptive work has been done, then, interpreters are moved from to tal uncertainty to the progressive development of a few anchor points from which to work. Thus, while readers, for instance,change their views as they re ad, and reinterpret previous evidence in the light of subsequent evidence, the sequence of their interpretations has a crucial effect on their subse quent readings (Arminen, 1999). Our previous analyses of reader processes have show n much the same thing, that the more a reader “sees” a particular pattern, the more she may be inclined to interpret further evidence as “pointing to” this pattern—even if, to an outside observer with access to multiple interpretations, t he data might seem more equivocal. Further, while a reader may hold a few p ieces of evidence in a state of uncertainty as to their meaning, like the studen ts in the counseling example, it would seem difficult for them to hold very many in this state given the time pressures they are under and the amount of data the y are required to examine. In fact, in an earlier study of this approach to po rtfolio assessment,we found that initial interpretations reached by a reader pa ir often “cascade” through their subsequent reading of a portfolio, influencing how they “see” future pieces of data (Moss, Schutz, and Collins, 1998). As we note below, however, and as this discussion indicates, such a “cascade” does not nec essarily involve a problematic rush to “prejudgment” on the part of ev aluators. As the theory we have cited suggests, readers cannot avoid interpret ing parts in light of some understanding of the whole. Nonetheless, readers in our case studies often sought out evidence that might contradict the patte rns they were building. Second, students in Garfinkel’s counseling experime nt expended a great deal of interpretive energy to eliminate contradictions and to ascribe a coherent “sense” to all of the responses they received, something he saw in his later work as well (Garfinkel, 2002). Even when students were told the nature of the experiment, they had great difficulty treating the advisor’s re sponses as “random.” “Many subjects,” Garfinkel (1967) noted, “saw the sense o f the answer ‘anyway’” (p. 91). As Heritage (1984) pointed out, “it is strikin g that, despite the ingenuity of many of the experimental procedures [Garfinkel and his colleagues engaged in, his goal of creating situations that were literally unintelligible], was rarely achieved” (p. 94).Finally, students in the counseling experiment had difficulty not imputing intention and motivation to the counselor’s statements. In fact, in additio nal experiments, when seemingly senseless actions were taken—like when Garfinkel (1967) replaced one white pawn with anoth er white pawn while playing chess—participants immediately sought to un derstand the motivation behind the action within the goals of the game, eve n after it was explained to them that there was no such motivation. Readers of portfolios assessments have, similarly, sought to understand students’ int ent: Heller and colleagues (1998), listening to think alouds while readers rat ed student portfolios in different subject areas, found them trying to under stand what the student had done or was asked to do by the teacher: ‘Before I c an evaluate the student’s work, I have to know what the problem is that they’ re attacking.’ (p. 16) Along the same lines, Myford and Mislevy (1995), in terviewing readers about


10 of 47 difficult-to-score portfolios in the Advance Placem ent Art program, found readers raising concerns about not knowing enough a bout the students’ intent, especially in the case of uneven portfolios. For in stance, they described readers expressing their frustration over not being able to talk to certain students directly about the work they had submitted. They felt that s uch conversations would help them answer crucial questions they had about t he students’ work, and would enable them to feel more confident about thei r judgments, particularly in cases in which they were vacillating between rating s. (29) At this point, readers might expect us to make a br oad argument about the impossibility of valid portfolio interpretation, so mething like the simple relativist argument sometimes made by some recent readers of p ostmodernism. However, such an argument would ignore a basic trut h of human existence. Most of the time we do understand each other suffic iently to get along, and our interpretations of the world generally serve us wel l (Gee, 1990). It is, in fact, unreasonable to say to someone “what do you mean, ‘ a flat tire’?” Of course we know what they mean, at least enough to go on with the conversation. Understanding our world and other people is what we do moment to moment, sitting in chairs because they are chairs, deciding not to cross the street right now because it is too dangerous, replying to questi ons because they make sense. This is not to deny that we sometimes make m istakes, or are sometimes confused, but most of the time we can make what Gar finkel called“reasonable” judgments, which he definedas “outcomes of document ary work, decided under the circumstances of common sense situations of cho ice,” where “common sense” in the case of portfolio evaluation involves the use of tools of judgment provided to readers through training (Garfinkel, 19 67, p. 99). Of course, there is always the possibility that an individual will come to a relatively idiosyncratic interpretation in any part icular case. The distinction, here, between “reasonable” and “unreasonable” is a qualit ative one. It relies on interpreters holding basically the same set of valu es or general “methods” for interpreting. Every person has a unique biography a nd cannot help but make “sense” of the world in their own unique way to som e extent, regardless of how they are trained. Thus, perfect reliability of inte rpretation is never achievable. Whether a particular interpretation is “reasonable” cannot be proved, as “reasonable” is not a scientific but a social desig nation achieved in the same way as all other “documentary” efforts.Sometimes however—as in the Duck/Rabbit example des cribed above—we find it difficult to come to “reasonable” decisions that “everyone can see.” In fact, we argue that in these specific circumstances, it is s ometimes “reasonable” not to be able to decide on a particular interpretation For example, Heritage (1984) discusseda group that was contracted by the governm ent to decide whether particular cases of death were examples of suicide. Such an effort seems inherently fraught with uncertainty. People who com mit suicide may not have particularly clear intentions, they may have motiva tions for hiding their true intentions, and a characteristic of some suicidal p eople is that they cannot communicate what is bothering them very well. They may even appear happier just prior to suicide. Heritage madereference to a particular example that is illuminating. In this case, a woman was found asphy xiated in her apartment with the gas on and towels pushed against all the window s and spaces under the


11 of 47 door. On first glance, this looks very much like a classic case of suicide. But then Heritage gave some extra information: it was e xtremely cold that day, and her neighbors and friends said she was always a ver y happy person. This seems to open up contradictory possibilities. On th e one hand, the woman may have turned the gas on to commit suicide, pushing t owels against any drafts to make sure the gas stayed in her apartment. On the o ther hand, the woman may have been cold that day, and when it didn’t get war mer (perhaps because the pilot light had blown out) she pushed towels agains t any drafts to keep the heat in and accidentally asphyxiated herself.What we have, here, is a case where at least two fu ndamentally different “stories” appear to make adequate sense of the same collection of data. And each story (or “pattern”) gives fundamentally diffe rent meanings to the different pieces of evidence available to us. In the first ca se, the woman used the towels to kill herself, in the second case she used them t o keep warm. In the first case she intentionally blew out the pilot light. In the second case, she didn’t know it was out. Each story requires a fundamental gestalt shift in the meaning of all of the data. Of course, there may be other equally con vincing stories that could be told about the event as well. And it is not clear t hat more data will make judgment easier. In the suicide case, for example, more data has already made it more difficult to decide whether the woman commi tted suicide—and it is quite possible that additional data will only muddy the w aters more. (Note 5) As in the Duck/Rabbit example, sometimes different stories ca n “reasonably” fit a particular collection of data. And the Duck/Rabbit example shows how different explanatory frameworks can even alter what counts as a “piece” of data. In one, the indent is not worth mentioning, is not really a separate aspect of the duck, in the other it is a crucial and separately identifiab le body part, although we “see” it in both cases.In fact, reader process studies indicate that reade rs often face challenges of interpretation and story construction similar to th ose just noted, especially when faced with inconsistent or ambiguous evidence. In M yford and Mislevy’s (1995) example of a portfolio assessment that included art work and commentary on that work, for example, disjunctions between the tw o data sources created interpretive problems. When insightful commentaries did not seem to reflect what could be seen in the actual work, or when weak commentaries accompanied a series of high quality interrelated p ieces, readers struggled to develop a coherent interpretation of the portfolio. In our own work on INTASC portfolios, we have repeatedly encountered similar challenges, for example, when relating teachers’ commentary to classroom vid eos (Moss et al., 1998, 2003). Heller et al.’s (1998) study presentedsimila r examples of readers struggling with discrepant information. Interesting ly, Heller et al. notedthat when readers couldn’t find evidence to resolve the situa tion, they resorted either to quantitative solutions (e.g. averaging scores from separate pieces) or made individual decisions about how to weigh criteria an d evidence. These strategies appeared to serve as fall back solutions when the e vidence did not allow them to develop a single coherent representation (or “st ory” in our terms). We draw on this conception of “stories” in our two case studies, illustrating the kinds of interpretive problems that complex portfol io assessments present for readers.


12 of 47 Assessment Context and Case Study MethodologyThe INTASC portfolio assessments discussed here are intended for teachers in their first, second, or third year of teaching. To guide the development and evaluation of such portfolios, INTASC has developed a set of general and subject specific standards based on INTASC's Princi ples for Beginning Teachers and standards from the relevant profession al communities. The standards and related assessments are intended to p rovide a coherent developmental trajectory pointing toward those of t he National Board. The assessments ask candidates for licensure to prepare a portfolio documenting their teaching practice with entries that include: a description of the contexts in which they work, goals for student learning with pl ans for achieving those goals, lesson plans, videotapes of actual lessons, assessm ent activities with samples of evaluated student work, and critical analysis an d reflection on their teaching practices. Unlike the National Board portfolios which contain multiple separate entries (see, e.g., ETS, 1998), these entries are organized around one or two units (8 – 10 hours) of instruction such that the portfolio ca nnot easily be broken into parts for separate evaluation. Judges evaluate the portfo lios in terms of a series of “guiding questions” focused on the portfolio but ba sed on the standards described above; they record evidence relevant to e ach guiding question and develop interpretive summaries or “pattern statemen ts” that respond to the question; then they determine an overall decision a bout the candidate. (Note 6) As developed by INTASC, the portfolio assessment wa s intended both for professional development and for informing decision s about licensure. As implemented in Connecticut in 2000, there were four possible levels to the overall decision: conditional, basic, proficient, a nd advanced, where basic constituted a pass. The assessment occurred as part of a 2-3 year induction program in which beginning teachers who had an init ial three-year license were provided with a mentor in the first year and the op portunity to attend state-sponsored workshops to prepare them for the a ssessment. When fully operational, teachers who did not pass the portfoli o assessment in their second year would continue in the program for another year If they did not pass in the third year, they would be required to reapply for t he initial license after successfully completing additional course work or a state approved field placement.During training, readers (experienced teachers in t he relevant subject area) worked through a series of “benchmark” portfolios-portfolios that had been selected by assessment developers to represent part icular score points. Using these as training portfolios, readers were taught t o evaluate portfolios using the same evidence, criteria, and procedures that the as sessment developers used. Before being allowed to score new portfolios indepe ndently, they were tested on two additional benchmark portfolios. During actual scoring, readers worked first individually to evaluate the portfolios, and then c ollaboratively in pairs to resolve discrepancies both at the level of evidence and pat tern statements and at the level of the overall score.In the case studies we present, each portfolio was evaluated, completely


13 of 47 independently, by three pairs of readers. The data that inform our case studies include all the written materials they produced, tr anscripts of their dialogues as they resolved discrepancies, and individual intervi ews with each of the readers. While we draw from the insights produced by ethnome thodology, our analysis is in no way an example of an ethnomethodological proj ect. Instead of examining how a single pair constructed order over time, whic h would have been a more ethnomethodological approach, we have taken a thema tic approach, exploring the kinds of differences that emerged among them. (Note 7) To address our questions about how different pairs of readers made sense of portfolios, we condensed and ordered our rich corpu s of data through a series of steps. First, a read-through of the data generat ed a series of themes. These themes were developed inductively--while they relat e to the guiding questions that readers were given, we allowed themes to emerg e from the notes and dialogue. A file was created for each reader with a ll the relevant data organized under these themes. Drawing from the data files for each reader, we generated the side-by-side comparison tables that appear belo w. As we worked, we looked across the data for all three pairs, ensurin g that we identified issues mentioned by one pair in the corpus of the other pa irs. In our two case studies we note only a few disagree ments between readers in the same pair. In fact, very few clear disagreement s emerged within pairs in our data, something that we saw in our earlier work, as well (Moss, Schutz, and Collins, 1998). While certainly there were subtle ( and perhaps sometimes important) differences of opinion, these were gener ally difficult to detect and define in any definitive way. Although readers were told in their training that they were to honor any disagreements between them, in fa ct our analyses of a number of different efforts to evaluate portfolios using this paired approach has shown that for pragmatic reasons of limited time, a mong numerous other reasons, readers very quickly accommodate themselve s to each others’ways of seeing. (Note 8) Thus, except where there was clear evidence to the contrary, we treated each pair as a unit and integrated comme nts from both members into a single “opinion.” However, we do return to t his issue of differences between readers within the same pair at the end of the first case study. Before we move to the case studies, it is important to acknowledge that in qualitative research it is always a challenge to gi ve enough data necessary to illustrate the particular points central to one's a rgument without overloading the reader with information. One always draws selective ly from a much larger data set. Even in our most extensive case study, #1, we address less than one-half of the issues that emerged in our complete analysis Given the rich data we have provided about each theme, it should be possib le for readers to evaluate our conclusions about how readers generally made se nse of the portfolios. We have not provided enough data, however, to fully il luminate why each pair gave the final score it did. Our goal is to illuminate the (diffe rent) ways readers made sense of the portfolios, even when they attended to the same evidence and seemed to value the same criteria, and to consider the implications of this for assessment practice.Case Study #1


14 of 47 Our first case study examines readers’ evaluations of a portfolio that generated quite a significant difference amongthe pairs in th eir final scores. Pair 1, Charlene and Iris, scored this portfolio a “one,” w hich was describedas a “conditional” performance. As planned for the opera tional system, ateacher who received a “one” would not receive a license until she repeated and received a higher score on the assessment. Pair 2, Robert and Sandra scored this portfolio a “three,” a “proficient” performance. And Pair 3, Burt and Dani, scored the portfolio a “four,” an “advanced” performance. It i s important to note that we chose this portfolio for data collection before it was scored because the trainers informed us that it contained characteristics that they felt would make it challenging to score. We return to this issue later on. We begin with a brief overview of the major compone nts of what we call the Civil Rights Movement (CRM) portfolio and of the ge neral process readers used to evaluate portfolios. We then give an overview of the perspective each reader pair took on the portfolio, discussing key areas of agreement and disagreement. This is followed by side-by-side comparisons of the pairs’ views on three different topic areas—“Pedagogy in the Response to Literature (RTL) Section,” “Teacher Control and Classroom Dialogue,” and “Supp orting Different Ways of Learning.” (Note 9) These topic areas were chosen because they illumin ate most of the key issues that divided the readers. Ag ain, they represent only a sample of a larger number of topic areas that emerg ed in our analysis. Portfolio OverviewOn the cover sheet, the male teacher who constructe d this portfolio noted that his school community was in an urban setting, and t hat the class he focused on was a seventh grade reading class with 18 students. The portfolio focused on a unit he constructed on the Civil Rights Movement (C RM), and he provided students with a range of non-fiction materials, inc luding pictures, videos, and written texts. The materials presented in this port folio differed from most portfolios encountered by readers, which tended to focus on literature. Like all of the English/language arts portfolios, the portfo lio was divided into two main sections, the first entitled Response to Literature (RTL) which was meant to focus on student engagement with literature and inc luded a relatively small composition component. The second, entitled Writing was meant to focus directly on a teacher’s process of teaching composi ng. In some portfolios these two groups of lessons were not related to each othe r, but in this portfolio they were both a part of the same unit on the CRM.In this case, the content covered in the RTL sectio n included two informational videos and readings on school desegregation. For th eir assignments, students answered questions about the readings, drew picture s of a scene from the CRM, and wrote a “newspaper article” on something t he class had discussed. In the Writing section students wrote a persuasive let ter about the civil rights movement to the mayor of their city.Overview of Each Pair’s Perspective on the Portfoli o Pair 1: Charlene and Iris (Final Score of 1) Both readers felt the teacher had


15 of 47 what Iris called “impressive” and what Charlene cal led “lofty” goals, especially in the RTL section, but that, with the exception of as pects of the Writing section, he could not meet these goals. They both read the e arly pages of the portfolio with “high expectations” (Charlene) that were not f ulfilled in the rest of the portfolio. Although they didn’t think the teacher’s activities fit together well in the RTL section, they thought the Writing section was v ery coherent, almost a “different teacher.” Iris noted that “if anything, [his] . unit had one thing going through it,” the Civil Rights Movement, “as badly a s it was delivered,” although it was “not a theme by any means.” Ultimately, the rea ders agreed that the teacher was basically “task oriented,” and was focu sed on “skills development and literal comprehension of text.” The pair seemed to feel, as Charlene stated, however, that the portfolio was a “high one:” that he had good ideas, and that he was “headed in the right direction.” Some strong mentoring, she argued, could help him “really bring together the strengths of his work that we see and change some of his overly structured processes.” Ho wever, Iris, at least, seemed to question whether the teacher could have a ctually completed all the activities with his class that his logs said he did “I couldn’t do it as a veteran teacher with [inaudible] amount of years. I’m not s ure this played out. And I don’t want to read between the lines, or investigat e. That’s not my job. It’s simply to assess. But . it was totally amazing. He seemed to pack a lot [in] during the day [and] I wondered how the information was addressed in a 45 minute period of time.”Pair 2: Robert and Sandra (Final Score of 3) Both readers liked the teacher’s general goals. Robert argued that although they wer e not especially “lofty” or “thought provoking,” they went beyond what a “solid basic teacher [score level of 2] would attempt.” They stated that the teacher organized his teaching around the concept of the Civil Rights Movement, al though Sandra noted that he didn’t articulate this well in the beginning, so that she thought his focus was Martin Luther King until she got to the Writing sec tion. Both pointed out that he drew from the students’ interests, and noted that t he teacher’s pedagogy was “structured and comfortable.” As Sandra said, “in t wo more years [his classroom] will become clearly a structured and com fortable environment. He’s growing!” Nonetheless, as Robert said, they felt th at the candidate did “have moments of that basic two range.” Sandra was impres sed by the way the teacher “tied the entire thing together for these k ids,” creating an educational experience that was “as good/tight as [one] would e xpect from a second year portfolio.” She also noted that it was laudable tha t the teacher had students read newspapers in the classroom, something she did n’t “usually” see. Pair 3: Burt and Dani (Final Score of 4) This pair seemed to feel, as Dani said, that the teacher “accomplished what he promised to accomplish” and had “clearly stated goals that we felt, I felt, he achi eved.” They gave the teacher some credit for teaching a social studies unit and not a literature unit. “I’ve only taught literature,” Dani noted, stating that “when his material wasn’t just faithful to literary text but faithful to historical text, I found that I respected him and had a high regard.” She also “cut him slack” because he was teaching 7th graders. Both Burt and Dani thought the unit was very well t ied together. And they felt that the classroom was extremely “comfortable.”Initial Analysis


16 of 47 Already one can see important differences in how th e readers were responding to the portfolio. Pair 1 framed the teacher’s goals as “lofty” and complained that he did not achieve them, whereas the other two seem ed to think that the goals were simply reasonable ( not “lofty,” in Pair 2’s terms), and that he had largely achieved them. Pair 1 felt that much of the teacher ’s pedagogy did not fit together coherently, whereas the other two stated t hat it was tied well together for students. It is easy to see how each of the pai rs’ initial response to the teachers’ goals might affect their later interpreta tions. Importantly, the second disagreement about pedagogy seemed to clearly refle ct a difference in how readers constructed a “story” about the coherence o f the teacher’s pedagogy: two pairs were able to make coherent “sense” of wha t the teacher presented in the RTL section, and one pair was not.Other differences are also evident. For example, th e latter two pairs noted that the classroom was “comfortable,” something not note d by Pair 1. Of course, this involves a vision both of “comfortable” and of the kinds of student actions that would indicate “comfort.” Also, early in their read ing of the portfolio, Pair 1 seemed to raise issues of trust, questioning whethe r the teacher really did what he said he did, an issue that returns below and som ething not stressed at all by the other two pairs. Finally, Pair 3 seemed to give the teacher credit both for teaching non-fiction and for teaching seventh grade rs, something the other pairs did not mention (although Sandra, in Pair 2, did sa y the use of newspapers was unusual).In the following sections, we show how differences in the “stories” the readers were constructing about the portfolio emergedas inc reasingly important sources of disagreement across the pairs. Each section focu ses on one of three themes that emerged from the dialogue and interviews as ke y to the issues dividing readers: Pedagogy in the Response to Literature Sec tion, Teacher Control and Classroom Dialogue, and Supporting Different Ways o f Learning. Within each section, we present multiple (numbered) side-by-sid es showing the three pairs of readers using the same or similar evidence to de velop their own perspectives.A: Pedagogy in the Response to Literature (RTL) Sec tion Pair 1: Charlene and Iris Pair 2: Robert and Sandra Pair 3: Burt and Dani Student Engagement with the ContentA1Pair 1 felt that while the teachersaid he would gethis students to aninterpretive levelin RTL, theevidence providedlittle support thathe did so. As Iris Pair 2acknowledged thatthe teacher, asSandra stated,focused on helpingstudents findinformation and“established arelationship In Pair 3, Burt stated that theteacher focused on acquisitionof knowledge in RTL. Dani feltthat the “many prereading andprewriting activities [wereused] to empower studentswithinformation—background.”She noted that the teacher


17 of 47 said, “he didn’tquite take thesekids to the nextstep.” She saw“very little in theway of studentinterpretation.”Charlene felt thatthere were “manymissedopportunities” forthe teacher toengage studentsin interpretation.Both readerscontinually noted“little” deviations,but focused onthe larger pattern. betweenresponding andcomposing that wasliteral in nature.”There was, Sandrastated, only “someevidence ofelementary analysisand some creativeresponse,” mostlyframed around“studentsinterpreting quotesin reference tohistorical eventsand what theywould do.” Shenoted that“students read textsand responded to avariety of questionswhich guided themto finding, and in afew casesinterpreting, thefacts.” “He gavethem case studiesabout history,” shesaid, “and researchand let them gleanthe information.” was “always mindful of the . .student need to select andgrasp and interpretinformation that’s significant inthe text,” and statedrepeatedly that the materialthe teacher was presentingwas challenging for his middleschool students. From the evidence above, it is clear that the three pairs agreed on much of the basic “evidence” in the portfolio, even though they often framed these in different ways. They all agreed that the teacher fo cused on acquisition of knowledge. However, Pair 1 seemed especially concer ned about this, noting the teacher’s missed opportunities and the fact that he thought he was teaching critical thinking when he wasn’t. In fact, Pair 1 f ocused on negatives that they then qualified, while the other two pairs pointed s pecifically to positive exceptions to the negative pattern while acknowledg ing, in Pair 2, their limited number, and in Pair 3 the teacher’s focus on findin g information (also see row A2, below). Pair 3, however, framed the teacher’s e fforts to impart knowledge to students in a very positive light, in that he was h elping students pull out the important information, “empowering” them with infor mation, and Dani linked this to the fact that they were engaged with very challe nging material. In fact, Pair 2 and 3 both noted that the teacher helped students t o, as Sandra said, “glean . information,” a more active engagement with the m aterial than that described by Pair 1.


18 of 47 Opportunity for Students’ Personal ResponsesA2Iris noted that there was “little”opportunity forpersonal responseon the part ofstudents.Charlene pointedto the video wherethe teacher hadan opportunity “totake the visual andturn it intofeelings,” but,“took it to a literalstance.” Therewere, she said,“many missedopportunities . .[to] elicit studentfeelings.” Sandra noted that“a few questionsasked whatstudents would do.”and while studentswere “directed topull information out”there were at leasta “few questions”that asked them toexplore how they“felt.” On the video,for example, “upuntil question threehe was beingfactual inEisenhower’sspeech, but fourand five ask thestudents to walk inhis shoes.” Pair 3 felt that there were toofew opportunities for studentsto make “personalconnections” to the material.They thought this was the keylimitation in the teacher’spedagogy in RTL. Danipraised the teacher for askingstudents how others in historymight feel, but “it wasn’t bringthe civil rights issue into yourown neighborhood . intothe student’s own world.”Dani was reassured,however, when the teacherasked the students to putthemselves in Eisenhower’sshoes on the video. From theperspective of his largerperformance she stated that“I would presume he woulddo something with it [personalresponse] later. Or he wasjust loosening them up to getthem to think and feel whatthose students felt like, or . .. But he alluded to it, so Iknow he’s mindful that it’snecessary. I just didn’t seemany instances [in hispedagogy].” The pattern from row A1 is repeated somewhat in row A2. Pair 1 stated that there was “little” opportunity for personal respons e, but repeatedly indicated that they did see some exceptions, even if these did not seem significant to them. Furthermore, Pair 1 focused on a particular moment when the teacher could have elicited personal responses but failed to do s o--one of his “many missed opportunities.” Pair 2 agreed with Pair 1 in essenc e, but did not mention the “missed opportunity” and, again, specifically point ed to the “few” questions that did ask for a personal response. Finally, Pair 3 wa s also concerned that there were few opportunities for personal responses to th e material, and actually went into more detail about the specific kinds of person al response opportunities that they felt were missing than either of the others. I n doing so, however, they both stressed the kind of responses that were encouraged--imagining how others in history might feel--and pointed to a key example wh ere the teacher did in fact encourage a personal response--the Eisenhower examp le (which was also noted by Pair 2). Furthermore, Pair 3 noted approvi ngly that the teacher was “mindful” of the need to encourage personal respons es, even though he didn’t


19 of 47 do it. Contrast this with Pair 1’s criticism in row A1 that even though the teacher said he would get the students to an interpretive l evel, he didn’t do it. For Pair 1, his failure to do what he said is used as evidence against him. This pattern repeats, below. Student Understanding of the CRMA3Pair 1 stated that the teacherbasically led hisstudents to a literalunderstanding ofthe material hepresented. Bothwere especiallyconcerned, asCharlene noted,that “he actuallythought he wasusing criticalthinking” when hewasn’t. Pair 2 seemed toagree that theteacher led thestudents, as Robertsaid, to “develop acritical and analyticalresponse to thematerial . leadingstudents to anunderstanding of theCRM.” The teacherhad studentsintegrate the manydifferent materials hegave them throughempathy, helpingthem gain, Charlenesaid, a “perspectiveon being black inAmerica then.” Dani stressed that theteacher gave “many . .opportunities [to] applystudents’ higher orderthinking,” and thatnumerous strategies wereused “to develop studentsas critical/analyticalthinkers” producing“independent thinkers whorespond and interpretnon-fiction with a criticalthinking stance and worksupported opinion intovarious activities.” She wasimpressed that instead ofthe usual “band-aidapproach to black history .. what a way to introducekids to the reality of a timeand scope, the wholehistory of a country, andreally bring them to see where we are now.” In row A3, where the pairs discussed the kind of le arning taking place in the RTL section, the first evidence emerges in this top ic area that the pairs focused on significantly different aspects of the portfolio But these differences seem tightly connected to their different interpretation s of what seemed like very similar data in rows A1 and A2. Pair 1 argued that the teacher’s pedagogy led students only to a literal understanding of his mat erial. Again, they were concerned that he “thought” he was teaching critica l thinking when he wasn’t. In contrast, Pair 2 felt that the students were led to a critical understanding of the CRM, but focused on what they had learned--the information they gained. The y argued that the teacher helped the students bring t his complex material together through a process of “empathy.” Here, Pair 2 seemed to have a somewhat different meaning for “critical” than Pair 1, and this different meaning seemed to have developed in the context of their re sponse to this particular portfolio The students gained a “critical” understanding of the material by gaining a new perspective--that of being “black in America.” Engagement with and gaining an understanding of this complex materi al seemed, for them, to have involved the development of a particular kind of critical perspective. As usual, Pair 3’s stance was the farthest from Pair 1 ’s with Dani’s conviction that


20 of 47 the students engaged in “higher order thinking” and became “independent thinkers.” It is hard to tell how Pair 3 came to th is conclusion given their statements in the first two rows. However, their re asoning may have been similar to Pair 2’s, given their earlier focus on t he difficulty level of this material for seventh graders and the way the teacher “empowe red” students with information. Again, the disagreement in this row ma y relate to the pairs’ earlier differences over whether the pedagogy in the RTL se ction was coherent or not: Pair 1 seeing it as incoherent, and Pairs 2 and 3 s eeing it as tied together well and thus able to help students achieve a coherent u nderstanding. In summary, while there are quite significant diffe rences between how each of the pairs evaluated the RTL section, most of the di fferences between them appeared to result from the ways they constructed “ stories” about what was happening in the teacher’s classroom: focusing on d ifferent aspects of the portfolio and framing teacher actions in different ways. Further, it is already possible to see in this initial example the ways in which a particular framing of the classroom with respect to one aspect can affect the way a pair frames other aspects of the portfolio they encounter.B: Teacher Control and Classroom Dialogue Pair 1Pair 2Pair 3Teacher ControlB1Iris and Charlene argued that thediscussion video wascrucial evidenceshowing the teacherover controlled theclassroom. ForCharlene, forexample, seeing“question andanswer” in the video,“colored all the dailylogs where he said‘discussion’. . Ibegan to wonderwhether or not thosewere discussions.”She noted that theteacher askedquestions and thenanswered themhimself and that hetook students’ literalanswers andelaborated them “into Robert and Sandrafelt that the teacherwas too directive inthe classroom.Sandra noted that thediscussion on thevideotape wasprimarily “Q and A,”and stated that theteacher did not “fullyunderstand, asevidenced on thevideo, that discussionis [not supposed tobe] teacher-directed.” Dani noted that the“discussion” was reallyjust a “question andanswer session”something that really“annoyed” her: “It’s like,does anybody know howto have a realdiscussion?” “How,” shewondered, “is thisempowering them [hisstudents] as . .members of adiscussion?” But whileshe noted that theteacher didn’t have“discussions,” “he atleast engaged most ofthe class, and they’re‘what would you do?’questions, they’re ‘whatabout. .?’ questions.They’re not just literalcomprehension. So I hadto give that though too.”


21 of 47 perhaps a moreinterpretive level”himself. Charlenewas especiallyconcerned that theteacher didn’tunderstand thatquestion and answerdidn’t count as adiscussion. Pair 3 specifically notedthe exception ofquestions that askedstudents to “walk inEisenhower’s shoes.” In row B1, the pairs seemed generally to agree that the teacher over controlled the classroom and that he didn’t understand what “d iscussion” meant, given the conceptual model of this assessment. Again, however Pair 3 went farther than Pairs 1 or 2, noting specific strengths of the teac hers’ dialogue that the other’s didn’t. For Pair 1, which already had questions abo ut the accuracy of the teacher’s logs, the teacher’s apparently different understanding of discussion raised even more questions (a good example of what we referred to in an earlier paper [Moss, Schutz, and Collins, 1998] as a “cascade”). While the lack of “dialogue” in the video must have also colored h ow the other pairs read the term “discussion” in the teacher’s logs, they did n ot explicitly raise this as an issue. And, again, Pair 1 was more specific about w hat the teacher was doing in his “discussion” that didn’t fit with their concept ion of a discussion, while Pairs 2 and 3 had brought up the positive Eisenhower exampl e that Pair 1 didn’t mention. Student Interaction in Peer CritiquesB2Pair 1 felt that the video wherestudents critiquedeach other’spapers in theWriting section“showedtremendousinsight.” But bothwere concernedthat even here, hedidn’t seem to beable to let go. Infact, Irisquestioned“whether he hadto go to that levelof facilitation inMarch if this hadbeen a pattern Pair 2 thought that the peercritique video showed, asSandra said, that thestudents were “comfortablemaking suggestions.” Theywere impressed with howthe students critiqued eachother, noting that theirlanguage was moresophisticated than mostseventh graders: “Theysaid ‘you could make thisbetter by doing this,’ not‘this stunk.’” Sandraacknowledged that somepeople might not agreewith her, but stated thatshe liked that the teacherhad read the students’papers ahead of time and Dani was especiallyimpressed by theteacher’s ability to“stay out of theirterritory” in the peerediting video. “When ateacher can take partin that editing processand not take the pen ..” she noted. She feltthe editing process“would empower[students]. And he hadthe proof” in the peerediting video. “Theywere being honestabout their [opinions] .. for such youngstudents they werebeing very evaluative.”


22 of 47 ingrained in thosekids from a veryearly timeline. SoI don’t want to getsuspicious aboutmotives. I don’twant to go there.” helped to facilitate the peerconference. “He knewspecific questions hewanted to ask about eachstudents’ paper to draw thepeer editor towards makingconclusions about thework” which is “somethingyou rarely see in teachers.”They noted, however, asRobert said, that “he letsthe kids peer edit forthemselves and then hewon’t let [go]. . It’s amixed message on that.” She noted that “heexhibited a lot of trustin the writing processwith them. I have nodoubt that’s whatmade me buy inbecause I saw somuch trust there intheir process ofwriting.” It was clear toPair 3, as Dani said,that “this isn’t the firsttime” his students hadengaged in peercritique. “I thought thatwas quite a height [theteacher] had climbed .. and [he] deservedthat regard.” In Row B2, the readers agreed that the peer critiqu e video was extremely impressive, although Pair 1 was much less specific in their praise. However, Pair 1 emphasized that the teacher’sfacilitation, w hile effective, showed the his inability to give up control. Furthermore, even tho ugh Pair 1 seemed to agree with the others that the actual activity of the stu dents seemed to show that they had learned how to engage in peer critique, the tea cher’s facilitation appeared to raise questions for Iris about whether the teach er really expected the students to be able to do it. Thus, this raised que stions for her about whether the teacher really had students engage in peer crit ique regularly in his classroom. Although she didn’t want to get “suspici ous,” she nonetheless saw potential issues with his “motives” in facilitating Pair 2 also noted that the teacher’s facilitation was an indication of his nee d for control. But while Sandra acknowledged that it would be possible to downgrade the teacher for actively facilitating what would normally be a more independ ent peer activity, she felt that in this specific case he facilitated very effe ctively in ways that showcased some of his skills as a teacher. Finally, Pair 3 fr amed the teacher’s activity in an entirely positive light--extending on Sandra’s posi tive statements by focusing on the teacher’s ability to resist taking control even though he was facilitating. Thus, in contrast to the others, for Pair 3 what ap pears to be essentially the same data from the peer critique video gave direct evidence of the opposite of what the other two pairs saw, of the teacher’s abil ity to trust his students and give up control even when he was in a position where one m ight expect him not to. Variety and Quality of ActivitiesB3Iris and Charlene were Pair 2 noted, asRobert said, that Although Pair 3 was impressedby the variety of activities the


23 of 47 impressed bythe variety ofactivities theteacherprovided, butfelt that hisstudents had“come toover-rely onhim,” andnoted that hewas unable to“break awayas a center ofcontrol” eventhough “hehimselfadmitted togiving toomuch ofhimself.” while the teacher“provided a varietyof opportunities” forlearning, “they werevery teacherdirected.” And theyagreed that, asRobert noted, he“provided a positiveand challenginglearningenvironment,except that hestepped in andgave them theanswer. . Herealized they havefrustrations, so hesteps in and says,let me do this foryou.” teacher provided, Dani noted that“it is his choice in how they wouldshow off that material.” “Eventhough he was controlling in theprocess,” Dani noted, however, “Idid feel that students wereempowered through the process.”She was especially impressed bythe “million strategies” he gavethem, noting that he made surethe students had “all of the thingsthat would help [them] build to thegoals required in class.” Burtagreed that the “teacher . .frequently directed.” Repeatedly,Dani focused on the distinctionbetween the teacher’s controlover the structure of theassignment, and his workingtowards their “independence” “inthe reading . the process ofwriting, [. .] the process ofgetting to an end result in a pieceof my writing. I have the power tobe in control of it. And that’s whatI think he gave those students.”They agreed, however, that hiswriting process “was not a flexibleplan,” noting that it was“articulated, but still controlled.”Finally, both felt that the teacherwould use less control if he taughta novel instead of non-fictionmaterial. Row B3 extendson the pairs’ concerns about teacher over-control. Again, the pairs seemed to use very similar evidence to very d ifferent ends. For Pair 1, the teacher’s statements about his need to pull back we re “admissions” of his inability to do so. While Pair 2 acknowledged the t eacher’s directiveness, they saw his “frustration,” the well-meaning intentions beh ind his difficulty in giving up control. And, while Pair 3 also saw the teacher’s d irectiveness, Dani stressed aspects of the teacher’s pedagogy where the student s did have some control. Finally, Pair 3 argued that the non-fiction materia l used in this particular portfolio may have required more control than fictional mater ial, and that the “strategies” he gave them helped them “build to the goals requir ed in class”--thus aspects of the teacher’s control in this case may have been appropriate. In this topic area, then, the pairs tended to cite much the same data (although there were differences in focus). Further, they app eared to agree on the definition of an effective discussion, and on the i mportance of teachers’ giving


24 of 47 up control of aspects of their classroom. Most of t he disagreements, then, appeared to result from the different “stories” the readers were constructing about what was happening in the classroom.C: Supporting Different Ways of Learning Pair 1Pair 2Pair 3Teacher’s Expectations of StudentsC1These readers agreed that the teacher had lowexpectations for hisstudents. As Charlenesaid, he “reallyunderestimated” them,not because his goalswere low, but “in the wayhe taught. . They meethis expectations becausehe doesn’t do anything toraise them.” As Iris noted,“he expected the low levellearner was going to be alittle overwhelmed, nomatter what theopportunity.” And asCharlene stated, “my gutresponse was all of thesejudgments about ‘theycan’t get it,’ or ‘I pairedthem with somebody elseso that they can just sitback.’ That just kind ofconcerned me.” This pair felt theteacher was veryresponsive to thedifferent ways hisstudents learned.Sandra “found hisinstructionalstrategies as areading teacher . .[to be] designed tosupport the[students’]development asreaders . . Hementions . theslow learners, lowlevel readers, whichindicates that he hassome IEPs that he’sworking with in theclass. . I thoughthe did a really goodjob.” They especiallynoted his effort topair stronger andweaker students. Dani noted that “hismind’s eye is alwayson his slower reader,the less achievingstudent, alwaysworking the crowd toempower thosereaders into beingequal.” “He alwayswent . towardthose students whomight have slippedthrough the cracks.And he wasconstantly committedto ensuring that they .. didn’t.” Burt notedthat “he says a lothere about slowerkids and he is verycognizant of theirneeds and trying tofind ways that he canhelp them. But yethe’s also looking atthe brighter kids.”Both saw, as Danisaid, that he was“mindful of the wholeclass . asindividuals.” In Row C1, Pair 1 seemed to interpret the teacher’s repeated mentions of low level students as indications that he was “underest imating” them. In contrast, Pairs 2 and 3 felt that many of the same statements showed his real concern for the very same students. As usual, Pair 3’s statemen ts appeared more glowing than Pair 2’s. Teacher Knowledge About Students


25 of 47 C2Despite their discomfort with the teacher’s low expectations,Pair 1 acknowledged, as Irisnoted, that “he really did hit onthe kids’ interest levels [and]their innate ability, the effortproduced, and the thing that Ibought into was the fact that hesaid he was helping thesechildren in the classroom andthen having them stay afterschool and work with him.” AsCharlene noted, “he’s aware ofstudent needs.” And Iris statedthat “teacher’s knowledge ofstudent interest levels, ability,and effort produces someteacher recognition of [theneed to] adjust instruction [and]support student needs.” Irisacknowledged that the teacher“drew useful conclusions andmade decisions regardingfuture practice.” Both readersnoted the teacher’s individualwork with the ESL student,which Charlene said showedpotential and that “someserious mentoring” could “reallybring together the strengths ofhis work that we see andchange some of his overlystructured practices.” At thesame time, however, Charlenewondered “whether he[conferences with students]regularly,” later stating, “I won’teven go into the fact that I thinkshe [the ESL student] cameafter school, not that hesuggested it.” Nonetheless, shenoted “in his defense [that hemade a] lot of commentsregarding the students. . andhe mentions pairing [the ESLstudent] up with a friend of herswho works with her afterschool. [So] I’m not saying hehas no expectations, or that heis damaging students, but it The readersagreed that, asSandra said,the teacher“used hisknowledge ofstudent needsto inform thetypes ofactivities heselects, hismonitoring andpairing/grouping ofstudents”actuallymodifyinginstruction “sothat all studentsare successfulon some level.”Robert noted,“he does allkinds of things .. that inviteother than justwhat weconsiderperhaps thattraditional typeof learner intothe picture.”Theyspecificallynoted his workwith the ESLstudent. Both readers cited awide range ofevidence indicatingthat the teacher waspaying attention tothe differing needsand ability levels ofhis students. Theywere impressed withhow he dealt with hisESL student, andthought he was, asBurt said, “veryknowledgeableabout the students.”Dani noted that “hewas always mindfulof making sure thatthe student whomost likely would notbuild to the right endresult had all of thethings that wouldhelp build.” In hisplanning, she said,he was “alwaysmindful of thegroups’ impact ontheir time andindividual students tounderstand andinterpret.”


26 of 47 seems like there is a pervasivelow expectation of students.”There was, the two agreed,“some opportunity to learn forsome.” In Row C2, all three pairs acknowledged theteacher’ s knowledge of his students and the efforts he made to help them. However, in r ow C1, Pair 1 had already classified some of the teacher’s pairing efforts as evidence of his low expectations. So Pair 1 had fewer examples of posit ive teacher efforts to cite in row C2 than the other pairs. And, again, Pair 1’s l ack of trust led Charlene to raise questions about whether the teacher generally helped all of his students given the way the teacher phrased his description o f his efforts to help his ESL student, downgrading the importance of what was a k ey example for the other two pairs. Nonetheless, Pair 1 did use much of the same evidence that the other two pairs used to show that the teacher did a n effective job helping his students, and this mitigated and nuanced their stat ements in row C1 about his low expectations. Thus, Pair 1 noted that there was “some opportunity for some” in this classroom.Again, most of the disagreements in this topic area appeared to arise not out of differences in criteria or in what they saw, butout of differences in the “stories” the pairs constructed.Differences on the Final Score Among Readers Within the PairsUp to this point in this paper we have avoided disc ussing differences within the pairs over interpretations of the portfolio. Howeve r, we know from the readers’ comments in individual interviews separate from the ir partners that there were some differences in what they said they focused on in reaching their final interpretations. Given limited space, we will not g o into these in detail. However, what is interesting is that four of the six readers acknowledged in their individual interviews that the CRM portfolio might deserve a d ifferent score than the one they had given it. Importantly, these differences s eemed to reflect not disagreements with partners but instead individual reader’s uncertainties about the correct score for the portfolio. In Figure 2, w e plot out the range of different scores members of the pairs said they might be will ing to give the portfolio.


27 of 47 Figure 2: Differences within pairs on Civil Rights Portfolio (Case Study #1) (Note 10) In Pair 1, Iris argued that she didn’t think “enoug h evidence was given.” If she had been given more evidence that had shown that th e teachers “goals had been met” then she might have given him a two, “may be even a three, depending on the level of the evidence.” Her partne r, Charlene noted that the teacher “had good ideas and was headed in the right direction,” and acknowledged that she felt the portfolio was a high “one,” or almost into the “two” category. In this pair, Charlene was clearly the junior member, with much less teaching experience than Iris, and it is possi ble that in a more equally matched situation she might have been willing to go higher. In Pair 2, Robert seemed to stand fairly solidly on a score of “three ” for the portfolio, even though his statements in dialogue seemed to indicate more flexibility than this. Sandra explicitly noted that she struggled between a three and a four, and at one point seemed even to be playing with the possibility of g iving the portfolio a two. In Pair 3, Burt stated that he felt very “confident” a bout his score of 4, but Dani had argued in their dialogue over whether it was a thre e or a four, finally becoming convinced by Burt’s evidence that it was a four (al though her interview indicated she still may have had some questions).Thus, most of the evaluators of this portfolio were not very confident that they had arrived at the correct final score. We return t o this issue in our discussion of the second case study portfolio, where we found ver y few differences between the readers.Overall Analysis of the CRM Case StudyIf our presentation and discussion of the data, abo ve, has been at all convincing, it should be clear by now that most of the differences amongthe pairs involved “stories.” Again, it is important to emphasize that the aim of this case study was not to determine the direct cause of the pairs’ final score differences. Instead, we seek only to show that “st ory” issuescan “reasonably” be said to have reflected a pervasive source of dif ferences between the two pairs. And while we cannot know whether or not disa greements over stories caused the final discrepancyamongthe pairs on the s core they gave a portfolio, our goal is only to show that it is an important en ough source of problems that


28 of 47 one could imagine it frequently resulting in such a difference in fi nal scores. With respect to this particular question, then, wha t we have tried to do is crack open what is often treated as the “black box” of re ader judgment processes. We argue that the three pairs of readers in this ca se study seemed relatively well trained in a conventional measurement sense, i n that they generally seemed to share a perspective on what counted as ef fective teaching and they were generally able to cite much the same “evidence ” from an extremely complex portfolio under relatively severe time cons traints. In fact, it is hard to imagine readers doing much better. Given the comple xity of portfolio assessment, it seems impossible that readers will a lways cite exactly the same data--even if they are generally constructing the s ame “stories.” While readers could always be trained better, the skills these re aders exhibited seemed at the very least what we can expect from adequate readers Nonetheless, these pairs disagreed quite substantia lly about the performancelevel of this portfolio. Their final pai red scores covered the entire possible range of scores--from 1 to 4. If we are ri ght that these readers seemed at least adequately prepared to evaluate these port folios, then it is reasonable to assume that the problem of differences over “sto ries” may be significant for other evaluators who face similar challenges when e ncountering similar kinds of portfolios.Importantly, the evidence does not seem to indicate that any of the readers made an inappropriate rush to judgment early in the ir evaluations of this portfolio. In fact, as the data we present indicate s, all three pairs were careful to acknowledge conflicting aspects of this teacher’s p erformance, finding a range of aspects in the same areas. Instead, it seems cle ar that their continuing observations were influenced in a range of ways by observations they had already made as they moved between their vision of the whole and their understanding of the meaning of a particular piece of data. Further, because the readers had limited time to complete their evaluati on, they had limited attention to focus as they went through the portfolio. Instea d of a rush to judgment, differences in focus appeared to reflect evidence-b ased decisions about where readers would focus their limited energy, represent ing perhaps unavoidable choices whenever evaluators must choose what to for eground and background in their reading of a collection of material.In our next case study, we present an example of a portfolio that did not seem to elicit significant differences over the correct “story” for the portfolio. First, however, we address a concern raised by others who read initial drafts of this paper: reader inferences about teacher intentions a nd motivations. Discerning Motivation and IntentionSome of the “story” disagreements visible between t he pairs involved interpretations about the motivations or intentions of the teacher. But is this a reasonable area for readers to investigate? Shouldn ’t they be evaluating only what they can “see” directly and not what they have to infer? Conversely, is it possible (given what we have said about the assumpt ions of ethnomethodology, above) for readers not to raise these issues?


29 of 47 First of all, we argue that these inferences are no t fundamentally different from the other kinds of judgments readers are asked to m ake. Recall Garfinkel’s (1967,2002) argument that all language is “indexica l,” pointing to some state of affairs that it cannot entirely capture. And we sho wed these kinds of inferences about the teacher’s classroom being made continuall y by the readers in the CRM example. Readers had to decide whether the teac her’s pedagogy was “tied together,” how important exceptions to a patt ern were, and what should “count” as “critical thinking” in the context of a particular unit. In these and many other cases, the readers had to infer, from the lim ited data they had, the general state of the larger classroom. Other studies of rea der processes have shown much the same thing.Even if we could eliminate questions of motivation and intent, however, it is important to understand that this would make it imp ossible for readers to ask why a teacher made a particular move or statement. Yet current teacher standards make processes of teacher decision-making a central part of good teaching. Thus, readers often struggled to understa nd why a teacher was acting in a particular way. A good example came in an inte raction between Robert and Sandra in a discussion of why the teacher in the CR M portfolio had failed to grade some assessments. Robert said,I don’t know whether he’s consciously saying ‘they’ re going to think this is bad because I don’t assess [these papers], but I’m goin g to do it anyway,’ or, ‘I said this is what I’m going to assess, this is what I’m going to assess, and it’s got to stand on it’s own’ . . I don’t know, but it’s e ither gutsy or stupid. To which Sandra responded, “Maybe he’s worn out by this . class.” In other words, Robert was trying to decide whether the teac her had made a conscious, calculateddecision based on his pedagogical approac h,or whether he just didn’t care. And Sandra offered a third option—that this s pecific case was an exception for this teacher because at this point he was just “worn out.” The pair ultimately agreed that the teacher had made a consc ious decision not to grade the assessments based on his pedagogical strategy, and thus gave the teacher credit for his process even though they did not agree with his final decis ion In more subtle ways, issues like this frequently arose amidst readers’ efforts to evaluate. It is difficult for us to see how readers could avoid making such judgments and still operate under the standards for effective teaching that currently predominate in the field.On a more basic level, assumptions about intentiona lity are required when we judge whether actors are responsible for particular actions or events. For example, remember when Charlene, in the CRM case st udy, questioned whether the teacher made a practice of working with students after school. In that case, she pointed to evidence that the ESL stu dent approached the teacher seeking extra help, and not vice-versa—in other wor ds, she emphasized that the student and not the teacher was the agent in this case. De ciding between these options involved understanding the teacher’s motivations for his actions. More broadly, what would happen to the validity of the assessment if we did not ask readers to differentiate, for example, between an accident and an intentional act?


30 of 47 In fact, evidence from ethnomethodological and othe r studies (e.g., Gibbs, 1993; Schegloff, 1999) indicates that it would be d ifficult to prevent readers from making inferences about teacher veracity, intention s, and the like, even if we wished to. As we noted above, even when Garfinkel a ttempted to create situations where there was no motivation for his ac tions internal to the context he was acting in (replacing one white pawn with ano ther one) people had great difficulty in believing that no motivation was invo lved. In Garfinkel’s work “deviations from . [normal] sense-making proced ures were instantly interpreted as ‘motivated’ departures on the part o f experimenters who were treated as acting from ‘special’ if presently undis closed motives” (Heritage, 1984, p. 99).Furthermore, ethnomethodology argues that the very meaning of any statement always involves “grasping the purpose or the motive for its being produced” (p. 101). Even the seemingly simple act of describing a situation raises questions about intent. As Schegloff (1999) showed, there is no such thing as mere description since there are a potentially infinite number of ways any particular situation can be described. A description always in volves some answer to the question “why that now?” Thus, Heritage (1984) argu edthat any description of a complex setting or interaction is necessarily “sele ctive in relation to the state of affairs it depicts, . [so that] part of the pro cess of” understanding an action or a statement necessarily involves “grasping the purp ose or motive for its being produced at a particular moment” (p. 151). Other ex periments in psychology also show the centrality of interpretations of inte ntion to human interaction. For example, studies indicate that our memory processes are highly influenced by what we understand as an intention behind a stateme nt. As Gibbs (1993) noted, it is often the case that “the [presumed] intention behind the speaker’s utterance is encoded and represented in memory, not the sente nce or utterance meaning” (p. 187). And “experiments show that infants as you ng as nine months begin reliably to interpret certain behaviors as intentio nal (e.g., pointing)” (Richards, 2002, p. 2). When we cannot grasp someone’s purpose s or motives, Garfinkel’sand other’s work indicates, we often fin d ourselves disoriented. (Note 11) Ultimately, studies like these show that in our mo ment-to-moment interactions, when people act we treat them as agen ts and "see" motivations. And we often saw readers “seeing” such motivations as an apparently automatic part of reading the portfolios. For examp le, in one case Sandra first thought the CRM teacher’s unit was about Martin Lut her King, but then saw that it was about the Civil Rights Movement. In another case, she first thought the teacher had used multiple activities to support slo wer learners, but then “realized” that he was using multiple activities to engage the multiple intelligences of all of his students. Another fasci nating example arose when Iris, discussing the peer critique video in the CRM case study, questioned “whether he [the teacher] had to go to that level of facilit ation in March if this had been a pattern ingrained in those kids from a very early t imeline. So I don’t want to get suspicious about motives. I don’t want to go there. ” She didn’t want to get suspicious, but she was she didn’t want to discuss motives, but she saw them nonetheless. As with the Duck/Rabbit example, such active sense-making is our default approach.


31 of 47 We now move to a discussion of our second (abbrevia ted) case study.Case Study #2For our second case study, we provide an example of a portfolio that contrasts with the CRM example, representing far fewer disagr eements among the pairs. Unlike the CRM portfolio, which the trainers had id entified as a difficult-to-score portfolio, this portfolio was chosen by the trainer s as a “benchmark” portfolio, as a clear example of a score level. And, in fact, the three pairs of readers agreed on a score of “one,” constructing strikingly simila r stories about the portfolio. (Note 12) The possibility that readers can be prepared to di stinguish between more and less ambiguous portfolios is an issue to w hich we return in our conclusion.Because there were few examples of “story” disagree ments in readers’ evaluations of our second case study portfolio, we provide only a short, schematic discussion of it, here. It is important t o note, however, that the same careful process of data analysis used in the CRM ca se study was used here as well.The teacher in this second case study portfolio org anized his instruction around a famous young adult novel, engaging his students i n a range of activities, including a mock trial of one of the characters. Th e similarities in the stories constructed across reader pairs included the follow ing: The pairs felt that the teacher in this portfolio controlled the classroom to the point that the children had little or no choice in their activities. They w orried that the teacher seemed concerned about getting his students to do exactly what he told them to with little focus on whether the students were actually learning anything. And they felt that dialogue in the classroom largely fit wit h the teacher’s pattern of control, involving simple question and answer, with little r eal “discussion.” In general, then, the readers agreed that while the teacher’sac tivities had potential,he had no clear purposes for them. The readers also though t the teacher had low expectations of his students.As with the CRM portfolio, we did see places in the data where divergence between the pairs might have been possible. For example, at one point Sand ra seemed to treat the teacher’s mistakes in his own w riting in the portfolio as an “editing” problem, whereas the other readers interp reted his mistakes from the beginning as an indicator of problems with his know ledge of English grammar. This raised the possibility that Sandra might infer that the teacher did not have problems, himself, with issues of grammar. However, she saw similar grammar mistakes on the videos when he was working with stu dents. Thus, along with the others, she eventually read the teacher’s mista kes in the portfolio as part of a pattern of evidence that the teacher lacked some basic knowledge of English. In this and in other examples, in contrast with the CRM example, the preponderance of evidence seemed to prevent initial ambiguities from leading to significant disagreements.On only a few generally subtle issues did disagreem ents between the pairs seem to emerge. In fact, the only major disagreemen t about the teacher’s


32 of 47 performance involved interpretations of the teacher ’s ability to control his classroom. Pair 3 saw “strengths in classroom contr ol,” while the other two pairs specifically noted his lack of ability in this area In addition, while Pair 2 simply noted his inability to control his class, Pair 1 we nt further and pointed out that he “didn’t alter his approach when” faced with a “chao tic” environment and when “the kids were not doing things.” For Pair 1, then, the teacher’s response to the “chaos” in his classroom also indicated the teacher ’s inability to change his instruction in response to challenges, something no t noted by the other pairs. Other more subtle divergences between the pairs inc luded their interpretations of the “mock trial” and of the teacher’s understand ing of his students. First, with respect to the mock trial, Pair 1 indicated that th e teacher didn’t know how a real trial works (and should have done some researc h to find out). Pair 3, in contrast, didn’t see the accuracy of the trial proc edure as an issue, and in fact indicated that the teacher should have added some “ bells and whistles” from fictional examples of trial procedures (like Perry Mason). F inally, Pair 2 did not mention the accuracy or the richness of the trial a ctivity specifically at all, instead noting more generally that the teacher’s ac tivities focused on helping students learn details that did not relate to the “ real world.” Second, regarding his understanding of his students, Pair 1 acknowled ged that the teacher had some awareness of where students needed to improve. Pair 2 seemed to go beyond this, however, arguing that he “does know hi s students and their abilities.” And Pair 3 went the farthest, noting th at he made “astute observations of student behavior.”None of these specific differences in interpretatio n, however, seemed to trigger particularly significant disagreements in the gener al “stories” that readers constructed to make sense of this portfolio. The wa y we interpret this is that the overall “story lines” for this portfolio were robus t enough to absorb a few specific differences in inference about the teacher and his classroom. In fact, we noted a number of examples in the CRM c ase study, above, where readers from Pairs 1 and 2 seemed to struggle to de cide what the correct story line was. Thus, in the CRM case study, many readers indicated at points that the evidence was contradictory enough to lend itsel f to a range of different patterns. Examples like these tended not to occur i n the YAN case study. Furthermore, we noted that many readers did not see m very confident about their final score for the CRM portfolio. And, in fa ct, the CRM portfolio was explicitly chosen for us by the trainers as difficu lt to score. On the YAN portfolio, however,readers generally fell solidly in the “one” category, (Note 13) and the trainers chosethe portfolio as a “benchmark” to be used for reader certification purposes, a portfolio especially chosen to lie clea rly within a particular score range. Thus, both the readers and the trainers seem ed able to tell the difference, at least in these cases,between portfol ios that can and cannot be straightforwardly scored. The YAN portfolio appeare d to represent one of many portfolios which tend to produce a single “reasonab le” story, and thus a single “reasonable” score, in independent reads. In contra st, the CRM portfolio represented a portfolio that appeared to lack a sin gle “reasonable” story or set of stories. While more reads would help establish m ore definitively the status of each portfolio, our ultimate goal, here, is not to arrive at confident answers for


33 of 47 these particular portfolios. Instead we seek to sho w thatour distinction between “reasonably” scoreable and unscoreable portfolios i s one that makes sense more generally for the assessment field.ImplicationsIn this paper, we have discussed two case studies t hat illustrate how portfolio readers engage with evidence in their efforts to un derstand a teacher’sperformance. Our findings are consistent w ith the small body of other literature on readers’ processes in large-scale ass essments. We have grounded our analysis in a large body of research around nar rative theory, discourse analysis, ethnomethodology, and other fields that i ndicate such efforts to construct coherent "sense" are an unavoidable and o ngoing part of our everyday activity. While we have not presented a la rge number of cases, we argue that our examples, in conjunction with releva nt work in assessment and on processes of human understanding,make it difficu lt to imagine how readers could avoid constructing “stories” in their efforts to evaluat e portfolios. In what follows we discuss possible implications that this challenge of “stories” might raise for assessment.As the CRM Portfolio example indicates, even when r eaders generally agree on evidence and on relevant criteria,they can construc t different "stories" about the teacher's practice. Conventional assessment practic es, focusing on interreader reliability, seem unlikely to illuminate these prob lems. For example, two out of three reader pairsagreed that the CRM portfolio ref lected a relatively strong performance. In our in-depth reading of field test portfolios we have encountered other portfolios that resemble the CRM portfolio in their apparently conflictual evidence, yet these cases only sometime s produce discrepant final scores.Further, as we noted at the outset, the usual appro ach to solving such discrepancies, focusing on techniques likely to imp rove interreader reliability on final scores, risks further masking what we argue r epresents the underlying challenge of ambiguous cases (e.g., Kane et al., 19 99;Swanson et al., 1995; Klein et al., 1995;Nystrand et al., 1993). These te chniques include: further standardizing the tasks, thus constraining the responses that readers encounter; disaggregating the portfolio into separate tasks th at can be scored one at a time; having different readers score the different tasks; developing more explicit criteria, including stipul ating or reaching a priori agreement on how particular issues should affect sc ores (e.g., weak commentary but strong performance); and separating criteria into separate scales that can b e separately scored. Because all of these practices work to fix or strip context from the information available to readers, they insulate themfrom uneven conflicting, or otherwise ambiguous responses. While all of these practices c an and have improved interreader reliability, they don’t make the proble m of ambiguous evidence go away. They simply relegate it to a priori decisions standardized responses, or


34 of 47 statistical machinery which combines scores accordi ng to predetermined algorithms.It is important to acknowledge that high quality as sessment programs do generally have statistical routines that attempt to detect unusual patterns of scores (for raters, for tasks, and for students) an d to flag them for additional attention (e.g., Engelhard and Myford, 1994; Engelh ard 1994;2001; Wilson and Case, 1997; Myford and Mislevy, 1995). These are im portant and useful tools. But the routines are only as effective as the infor mation on which they are based. More needs to be done to flag ambiguous case s. In fact, we are increasingly convinced that it is crucial for large scale assessment programs to build in information-gathering processes that are m ore likely to illuminate the kinds of ambiguities we have illustrated. The need for such processes is made only more imper ative by the fact that most studies that compare alternate scoring practic es use consistency between readers’scores and efficiency as criteria for decid ing among these practices. Few studies look at broader questions of validity, including the relationship between the assessment in question and other indica tors of similar performance. Thus we should be alarmed, but not sur prised, by the findings of a study reported by Swanson, Norman and Linn(1995) from the health related professions. Swanson and colleagues reportedthat th at the scoring method which producedhigher reliability actually led to lo wer correlations with the criterion of interest, indicating lower validity. T he paucity of studies like these results, in part, from the fact that routine and le gally defensible practice does not require assessors to conduct them, even though many measurement professionals wish it were otherwise (NRC, 2001; Ma daus, 1990;Haertel, 1991). Without access to such comparisons, it is impossibl e to know for many large scale assessment programs how choices made by devel opers affect the meaningfulness of interpretations produced by their system. In other words, we don't know how these decisions affect validity.In what follows, we argue that it is possible to de velop procedures for routinely illuminating ambiguous portfolios and that there ar e ways that this information could usefully serve efforts to achieve ethical, va lid assessments in high stakes settings. It is important to note that most readers as well as trainers, appeared to know that the CRM portfolio was difficult to sco re. In fact, it was the trainers’ initial identification of it as challenging that le d us to collect data on it in the first place. Much the same could be said of the YAN portf olio. It was specifically chosen because the trainers, who selectedit as a be nchmark portfolio, informed us that it was relatively straightforward, somethin g also indicated by the response of the readers. In fact, there are a range of indications that evaluators and assessment developers can distinguish, at least in some cases, between portfolios that are more or less easy to evaluate. Indeed, the National Board routinely selects both benchmarks--clearly illustra ting a score point--and “training portfolios”--intended to illuminate parti cular kinds of problems (ETS, 1998).Controversially for us, however, the National Board assessor training script indicates that all of these portfolios, including t he problematic "training" ones, are given predetermined scores to use in the traini ng. (Note 14) The goal of


35 of 47 using training portfolios, it appears, is to teach readers how to score portfolios like these in a consistent fashion. Of course, this makes perfect sense if one assumes that (almost) all complete portfolios have or embody a particular score that can be determined. But if we are right and som e portfolios contain ambiguities and contradictions that make it difficu lt or impossible to "reasonably" assign them a single score, efforts to always train readers to "correct" scores may actually be counter-productive Instead, we think it is important to at least explo re the possibility that distinguishingin some cases between portfolios that are easier or more difficult to "reasonably" score may provide an opportunity fo r changes in the ways readers approach portfolio evaluation. In fact, the pressure our readers felt to find the "correct" score for the CRM portfolio may have generated some of the struggles theyfaced. We wonder what would happen if instead of training scorers to find a “correct” score, readers were als o prepared (as trainers are) to identify ambiguous portfolios. Readers might engage in such an effort, for example, by purposefully seeking disjunctions betwe en pieces of evidence or by trying out alternative interpretations of the exist ing evidence. Such a search for counter-interpretations is fully consistent with go od practice in validity research (e.g., AERA et al., 1999; Messick, 1989) and in qua litative research as well (e.g., Erickson, 1986). In our terms this would be akin to seeking out alternate yet equally convincing “stories” for a particular p ortfolio. Documentation of unevenness or other sorts of ambiguity within a par ticular portfolio might be institutionalized by providing opportunities for in dicating this on the scoring document. Evidence and general observations placed in this new category wouldn’t necessarily preclude giving a final score that represents readers’ best interpretation of the preponderance of evidence. Bu t it would allow them to indicate when issues of ambiguity made it difficult for them to arrive at a final score, and it would allow the portfolio to be flagg ed for further study--if not during the operational scoring process, perhaps at least as part of ongoing research intended to improve the system.What, then, might be done with portfolios that are flagged by readers or researchers as ambiguous in an operational system? We can imagine a number of options including the (not unreasonable) option of doing nothing. First, such portfolios might be put into a process that involves a deeper, more comprehensive, review than conventional scoring all ows. In this supplementary process, readers might engage in a deeper reading t o see if they can surface a coherent interpretation or story that could support a clear final score. Another option might allow the readers to request m ore data from a candidate. Perhaps, for example, candidates with particularly difficult portfolios might be asked to interview directly with the readers. Or ad ditional documentation of classroom practice might be required. Of course, th is raises all kinds of difficult challenges with respect to time, resources, and sim ple feasibility. Further, Garfinkel’s work, for example,indicates that it is not at all clear that more data would actually help—as the example from Heritage (d escribed above) illustrated, it could just complicate the issue mor e. (Note 15) At some point, we simply need to agree that we have “enough” data sin ce no amount of data is ever going to adequately answer all the questions w e have. Furthermore, unless we are careful,additional data may actually be misl eading or problematic, since


36 of 47 its collection may not be framed as carefully as it is in the larger established system. Despite these challenges, however, in some cases more data might allow readers to achieve more valid evaluations (se e Moss et al., in press, for examples of how case studies might complement portf olio-based evidence). A third approach to this problem would involve shif ting portfolios identified as difficult to reasonably score to a larger committee that would take on the responsibility for making an ethical decision given the available evidence, the standards, and the potential consequences of a deci sion for the candidate and the educational system (similar to what we have arg ued in Moss and Schutz, 2001). Instead of asking committee members to see e xactly as the trainers have asked them to see, they might take on the role of representatives of the larger professional community. Members could take i nto account the risks and possibilities entailed in the particular performanc e they are evaluating and make a judgment as proxies for this larger community. In deed, in another ongoing study with beginning teacher portfolios,we’ve prese nted portfolios to expert readers who have not been constrained (as evaluator s within an ongoing assessment system are) in the kinds of interpretati ons they can draw. Under these broadly open conditions, we have asked them t o make a pass (achieves regular license) or fail (repeats assessment) decis ion on each portfolio based on the evidence they see. We have then brought them together to discuss their decisions. In all three cases where we have attempt ed this, these unconstrained discussions of the risks and benefits of a particul ar decision, or of trying to reach a decision at all, have expanded to include t he needs of the larger system as well as the candidate (Moss, Schutz, Haniford, C oggshall, and Miller, in preparation).A final alternative we can imagine is that ambiguit ies noted within portfolios might be ignored in the operational system and used only to inform an ongoing research agenda designed to improve the operational system. There is no question that efforts to note ambiguities will flag more performances as problematic than are currently identified by discre pant or otherwise atypical scores alone. Clearly if they must all be resolved through a more time consuming process, costs will increase, perhaps mor e than the system can afford. No assessment system (whether large scale o r based in an intense, deeply contextualized study of single a candidate’s practice) can eliminate all “error”or bad decisions. Thus, assessment developer s and policy makers must decide how much (and what kind of) “error”they are willing to tolerate in light of the consequences to the candidate and to the larger system. That decision, however, will be better informed if our efforts to improve the operation of assessment systems do not obscure these problems an d if the on-going research agenda undertakes to better illuminate the nature of the interpretive problem we have described here.Large scale assessment systems must cope with the a mbiguity that is on-goingly present to some degree in all efforts to understand human action. If we are to feasibly evaluate large numbers of cases, interpretive practices must be routinized, and this effort at routinization may always generate a tension amongefficiency, reliability, and validity. These c hallenges have led developers to processes that carefully control what raters can see and how they can interpret it. In fact, as we have repeatedly noted, all testing practices (not just


37 of 47 those associated with raters) place artificial cons traints on how interpreters work. To one extent or another this is probably una voidable and thus not, in and of itself, a criticism. Instead, this paper argues that the routine indicators of technical quality generally employed in current lar ge scale assessment systems do not sufficiently illuminate the challenges of am biguous portfolios, and that routine practices aimed at improving the quality of such systems may actually make these problems harder to find. The challenge w e pose for assessment developers, then, is threefold. First, we encourage them to develop better tools for revealing the kind of challenging portfolios we have discussed here. Second, we argue that it is imperative that better ways are found for illuminating the consequences of the decisions developers make, espe cially around efforts to improve reader consistency. Finally, we recommend t he exploration of new strategies for engaging with problematic portfolios for which "reasonable" decisions cannot effectively or ethically be made u nder current practices.Notes 1. Authors' note: This research was supported by a gra nt from the Spencer Foundation. We gratefully acknowledge their support. We are also grateful to the Interstate New Teacher Assessm ent and Support Consortium (INTASC) and to the Connecticut State De partment of Education (CSDE) for their generosity in providing the materials and opportunities that made our analysis possible. Both authors are members of INTASC’S technical advisory committee (TAC). We gratefully acknowledge comments on an earlier draft from INTAS C staff and TAC members: Mary Diez, Jim Gee, Anne Gere, Bob Linn, J ean Miller, David Paradise, Ray Pecheone, and Diana Pullin. Ray Peche one has generously read multiple drafts. Opinions, findings conclusions, and recommendations expressed are those of the authors and do not necessarily reflect the views of INTASC, its techni cal advisory committee, or its participating states. 2. The role of teaching performance in licensure decis ions was highlighted in a recent report of the National Research Council ’s committee on Assessment and Teacher Quality. The authors of the report concluded that “paper and pencil tests provide only some of t he information needed to evaluate the competencies of teacher candidates” and called for “research and development of broad based indicators of teaching competence,” including “assessments of teaching per formance in the classroom” (NRC, 2001, p. 172). As evidenced in the work of the National Board for Professional Teaching Standards (NBPTS), portfolio assessment provides one credible means for the larg e scale high stakes assessment of teaching performance. 3. For Gadamer (1975, 1987), the hermeneutic circle ha d two important aspects: (a) the dialectic between the parts of a t ext and the whole and (b) the dialectic between the reader’s preconceptions a nd the text. 4. Recent work by Peng and Nesbitt (1999) indicated th at this search for coherence and the attempt to eliminate contradictio ns from analysis may be a somewhat western phenomenon. Peng and Nisbett compared the responses of a group of white American and Chinese college students to contradictory proverbs and propositions. They found that Chinese participants were much more comfortable with ambigu ity and


38 of 47 contradiction. “Furthermore,” they noted, “when two apparently contradictory propositions were presented, American participants polarized their views, and Chinese participants wer e moderately accepting of both propositions” (p. 741). 5. There may also be issues arising from the requireme nt that the “answer” fall in one of two starkly different categ ories that may not capture the complexity of human motivation and situatedness In fact, “suicide” implies clear intention which may, in fact, not hav e existed. Our analysis of portfolio reader processes indicates that the same challenges arise for them as well. We examine this issue in more detail in Moss, Schutz, Haniford, Coggshall, and Miller (in preparation). 6. The use of guiding questions that integrate standar ds into dimensions directed at a particular teaching performance to pr oduce an interpretive summary, was developed by Tony Petrosky, Ginette De landshere, Steve Koziol, Penny Pence, Ray Pecheone, and Bill Thompso n in their leadership of one of the two first National Board A ssessment development labs. This has informed both the work at INTASC and NBPTS. See, for instance Delandshere and Petrosky (1994); Koziol, B urns, and Brass (2003). 7. Although they are not an examples of ethnomethodolo gy, we have examined aspects of the readers’processes over time in earlier work (Moss, Schutz, and Collins, 1998). 8. It is also important to note that in these particul ar cases while the two readers within each pair initially evaluated the po rtfolio individually, these within-pair readings were not independent: readers worked in the same room at the same time, watching the videotape toget her, and making comments occasionally to one another as they worked 9. Quotes were edited for sense and grammar. The tense of statements was sometimes altered to fit with the paper. 10. Based on interviews, we have roughly estimated and depicted the range of potential scores each reader reported they considered. 11. In fact, Gibbs (1993) reporteda study that showed t hat “people find it easier to understand written language if it is assu med to have been composed by intentional agents (i.e. people) rather than by computer programs which lack intentional agency. . Reade rs presume that poets [for example] have specific communicative intention s in designing their utterances, an assumption that does not hold for un intelligent computer programs. . These data testify to the powerful role of authorial intentions in people’s understanding of isolated, written expr essions” (pp. 190-191). In this article, Gibbs also arguedthat this tendenc y to impute intentions to others is a cross-cultural phenomenon, although it may appear in very different ways in different cultural contexts. 12. The dialogue between readers for Pair 3 of Case Stu dy #2 was not recorded for some reason (the tape was blank), so o ur evidence is based only on their written notes and individual intervie ws. One of the readers was also different. These did not seem to raise sig nificant issues for the case study, in total, but the lack of one tape did make it more difficult to discern differences on selected issues. 13. Sandra, from Pair 1, and Phil, from Pair 3, did men tion subtly different score possibilities (Sandra "one plus," and Phil, one of those . portfolios" that is on the "line" between a one and two) in their interviews


39 of 47 around the YAN portfolio, but both also indicated t hat they understood theseopinions were grounded in interpretations of t he portfolio that were not strictly well supported by the evidence. 14. According to the technical manual, aresponse makes “a good training sample if a final score can be agreed on and the sc oring issues presented by the response are worthy of training time” (ETS, 98, p. 60). Trainers are told during discussion to “announce the true score, ” elicit “misconceptions” but “not [to] let errant assessors dominate the dis cussion with arguments for why your score is wrong,” and “to reiterate the true score” (pp. A33-34). Assessors are encouraged to look for the “preponder ance of evidence” across “unevenness.” Unscorable portfolios are desc ribed as “weird stuff: If in live scoring, you find a case that is so inco mplete or weird that you can’t score it, call the trainer to discuss it. The cases you will be seeing as training samples have all been pre-selected, so thi s will not be a problem for now.” [We have selected these phrases to docume nt our assertion that assessors are not encouraged to illuminate ambiguou s cases. Doing this presents a lopsided picture of The National Board’s scoring practices which are the most explicit and thoughtful we have seen within the bounds of a conventional assessment program.] 15. One of Myford and Mislevy’s readers, speaking about a portfolio exhibit that included contextual information, comme nted that it was easier to score without knowledge of the context. ReferencesAERA, APA, & NCME (1999). Standards for educational and psychological testing Washington, DC: American Educational Research Association. Arminen, I. (1999). Conversation analysis: A quest for order in social interaction and language use. Acta Sociologica, 42 251-257. Bruner, J. (1986). Actual minds, possible worlds Cambridge: Harvard University Press. Bruner, J. (1990). Acts of Meaning Cambridge: Harvard University Press. Collins, K. C., Moss, P. A, & Schutz, A. (1997). INTASC candidate interviews: Final summary report to INTASC. Unpublished manuscript, University of Michigan. Coulon, A. (1995). Ethnomethodology (J. Coulon & J. Katz, Trans.). Thousand Oaks: SAGE Davies, B. (2000). Grice’s cooperative principle: G etting the meaning across. In D. Nelson & P. Foulkes (Eds.), Leeds Working Papers in Linguistics 8 1-26. Davies, B. & Harre, R. (1990). Positioning: The dis cursive production of selves. Journal for the Theory of Social Behavior, 20 (1), 43-63. Delandshere, G., & Petrosky, A. R. (1994). Capturin g teachers' knowledge. Educational Researcher, 23 (5), 11-18. Educational Testing Service (ETS) (1998). NBPTS technical analysis report,1996-97 administrat ion Southfield, MI: NBPTS. Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a many faceted Rasch model. Journal of Educational Measurement 31 (2), 93-112. Engelhard, G. ( 2001). Monitoring raters in perform ance assessments. In Tindal & Haladyna (Eds.), Large scale assessment programs for all students (pp. 261-287). Mahway, NJ: Lawrence Erlbaum.


40 of 47 Erickson, F. (1986). Qualitative methods in researc h on teaching. In M. C. Wittrock (Ed.), Handbook of research on teaching (pp. 119-161). New York: Macmillan. Freedman, S.W. & Calfee, R.C. (1983). Holistic asse ssment of writing: Experimental design and cognitive theory. In P. Mosenthal, L. Taymor & S. A Walmsley (Eds.), Research on writing: Principles and method ( pp.75-98). New York: Longman. Gadamer, H. G.. (1994/1975). Truth and method. New York: Seabury. Gadamer, H. G. (1987). The problem of historical co nsciousness. In P. Rabinow, & W.M. Sullivan (Eds.), Interpretive social science: A second look (pp. 82-140). Berkeley: University of California Press. Garfinkel, H. (1967). Studies in ethnomethodology Englewood Cliffs: Prentice Hall. Garfinkel, H. (2002). Ethnomethodology’s program: Working out Durkheim’s aphorism (A. W. Rawls, Ed.). Boulder: Roman & Littlefield. Gee, J. P. (1990). Social linguistics and literacies London: Taylor and Francis. Gee, J. P. (1999). An introduction to discourse analysis theory and me thod London: Routledge. Gibbs, R. W. (1993). The intentionalist controversy and cognitive science. Philosophical Psychology 6 (2), 181-206. Goodwin, C. & Goodwin, M. H. (1992). Assessments an d the construction of context. In Duranti, A. & Goodwin, C. (Eds.), Rethinking context: Language as an interactive phen omenon (pp. 147-190). Cambridge: Cambridge University Press. Habermas, J. (1993). Justification and application (C. P. Cronin, Trans.). Cambridge: MIT Press. Habermas, J. (1996). Between facts and norms: Contributions to a discour se theory of law and democracy (W. Rehg, Trans.). Cambridge: MIT Press. Haertel, E. H. (1991). New forms of teacher assessm ent. Review of Research in Education, 17 3-29. Heller, J.I., Sheingold, K. & Myford, C.M. (1998). Reasoning about evidence in portfolios: Cognitive foundations for valid and reliable assessment. Educational Assessment, 5 (1), 5-40. Heritage, J. (1984). Garfinkel and ethnomethodology Albany: SUNY Press. Hill, C. & Larsen, E. (2000). Children and reading tests Stamford, CT: Ablex. Kane, M., Crooks, T., & Cohen, A. (1999). Validatin g measures of performance. Educational Measurement: Issues and Practice, 18 (2), 5-17. Klein, S. P., McCaffrey, D., Strecher, B., & Koretz D. (1995). The reliability of mathematics portfol io scores: Lessons from the Vermont experience. Applied Measurement in Education, 8 (3), 243-260. Koziol, S. M. Jr., Burns, L., & Brass, J (2003). Four lenses for the analysis of teaching. Supportin g beginning teachers' practice. Working paper, Michigan State University. Kroeber, K. (1992). Rereading/retelling: The fate of storytelling in mo dern times New Brunswick: Rutgers University Press. Madaus, G. F. (1990). Legal and professional issues in teacher certification testing: A psychometric snark hunt. In J. V. Mitchell, Jr., S. L. Wise, & B S. Plake (Eds.), Assessment of teaching (pp. 209-261). Hillsdale, NJ: Erlbaum. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-103). New York: American Council on Education/Macmillan. Mislevy, R. J., Almond, R., & Steinberg, L. (2003). On the structure of educational assessment. Measurement: Interdisciplinary Research and Perspec tives, 1 (1), 3-62.


41 of 47 Mink, L. O. (1987). Historical understanding Ithaca: Cornell University Press. Moss, P. A. (1992). Shifting conceptions of validit y in educational measurement: Implications for performance assessment. Review of Educational Research, 62 (3), 229-258. Moss, P. A., Beck, J. S., Ebbs, C., Herter, R., Mat son, B., Muchmore, J., Steele, D., & Taylor [Clark] C. (1992). Portfolios, accountability, and an inter pretive approach to validity. Educational Measurement. Issues and Practice 3 (11), 12-21. Moss, P. A. & Schutz, A. (2001). Educational standa rds, assessment, and the search for "consensus." American Educational Research Journal, 38 (1), 37-70. Moss, P. A., Schutz, A. M., & Collins, K. M (1998). An Integrative approach to portfolio evaluation fo r teacher licensure. Journal of Personnel Evaluation in Education, 12 (2), 139-161. Moss, P. A., Schutz, A. M., Haniford, L., Coggshall J., & Miller, R. (in preparation). High stakes assessment as ethical decision making. University o f Michigan. Moss, P.A., Sutherland, L.M., Haniford, L, Miller, R., Johnson, D., Geist, P.K., Koziol, S.M., Star, J ., & Pecheone, R.L. (in press). Interrogating the gene ralizability of portfolio assessments of beginning teachers: A Qualitative Study. Educational Policy Analysis Archives. Myford, C. M. (1993). Formative studies of Praxis III: Classroom Performa nce Assessments--An overview. The Praxis Series: Professional Assessmen ts for Beginning Teachers Princeton, NJ, Educational Testing Service. Myford, C. M., & Engelhard, G. (2001). Examining th e psychometric quality of the national board for professional teaching standards early childhood/gen eralist assessment System. Journal of Personnel Evaluation in Education, 15 (4), 253-285. Myford, C. M., & Mislevy, R. J. (1995). Monitoring and improving a portfolio assessment sys tem (Center for Performance Assessment Research Report) Princeton, NJ: Educational Testing Service. National Research Council (2001a). Knowing what students know: The science and design of educational assessment Washington, D.C.: National Academy Press. National Research Council (2001b). Testing teacher candidates: The role of licensure t ests in improving teacher quality Washington, D.C.: National Academy Press. Nystrand, M., Cohen, A. S., & Dowling, N. M. (1993) Addressing reliability problems in the portfolio assessment of college writing. Educational Assessment, 1 (1), 53-70. Peng, K. & Nesbitt, R. E. (1999). Culture, dialecti cs, and reasoning about contradiction. American Psychologist, 54 (9), 741-755. Pinker, S. (2002). The blank slate: The modern denial of human nature. New York: Viking. Pollner, M. (1991). Left of ethnomethodology: The r ise and decline of radical reflexivity. American Sociological Review 56 (3), 370-81. Richards, R. J. (2002). “’The blank slate’: The evo lutionary war.” New York Times October 13. www.nytimes.c om, accessed 10/17/02 Ruddell, R. B., & Unrau, N. J. (1994). Reading as a meaning-construction process: The reader, the text, and the teacher. In R. B. Ruddell, M. R. Rudd ell, & H. Singer (Eds.), Theoretical models and processes of reading (4th ed.) (pp. 996-1056). Newark, DE: IRA. Schegloff, E. A. (1999). ‘Shegloff’s texts’ as ‘Bil lig’s data’: A critical reply. Discourse and Society, 10 (4), 558-572. Smagorinsky, P. (2001). If meaning is constructed, what is it made from? Toward a cultural theory of reading. Review of Educational Research, 71 (1), 133-169. Swanson, D., Norman, G. R. & Linn, R. L. (1995). Pe rformance-based assessment: Lessons from the health professions. Educational Researcher, 24 (5), 5-11,35.


42 of 47 Wilson, M., & Case, H. (1997). An examination of variation in rater severity over time: A study in rater drift. Berkeley, CA: Berkeley Evaluation and Assessment Re search (BEAR) Center. Wolfe, E. Q. (1997). The relationship between essay reading style and scoring proficiency in a psychometric scoring system. Assessing Writing, 4 (1), 83-106.About the AuthorsAaron SchutzAssociate ProfessorEducational Policy and Community StudiesUniversity of Wisconsin-MilwaukeeP. O. Box 413Milwaukee, WI 53201Voice: 414-229-4150Fax: 414-229-3700E-mail: schutz@uwm.eduPamela A. MossAssociate Professor4220 School of EducationUniversity of MichiganAnn Arbor, MI 48109-1259Voice: 734-647-2461Fax: 734-936-1606E-mail: Pamela A. Moss is an Associate Professor in the Sch ool of Education at the University of Michigan. Her areas of specialization are at the intersections of educational assessment, validity theory, and interp retive social science. The World Wide Web address for the Education Policy Analysis Archives is Editor: Gene V Glass, Arizona State UniversityProduction Assistant: Chris Murrell, Arizona State University General questions about appropriateness of topics o r particular articles may be addressed to the Editor, Gene V Glass, or reach him at College of Education, Arizona State Un iversity, Tempe, AZ 85287-2411. The Commentary Editor is Casey D. Cobb: .EPAA Editorial Board Michael W. Apple University of Wisconsin David C. Berliner Arizona State University


43 of 47 Greg Camilli Rutgers University Linda Darling-Hammond Stanford University Sherman Dorn University of South Florida Mark E. Fetler California Commission on TeacherCredentialing Gustavo E. Fischman Arizona State Univeristy Richard Garlikov Birmingham, Alabama Thomas F. Green Syracuse University Aimee Howley Ohio University Craig B. Howley Appalachia Educational Laboratory William Hunter University of Ontario Institute ofTechnology Patricia Fey Jarvis Seattle, Washington Daniel Kalls Ume University Benjamin Levin University of Manitoba Thomas Mauhs-Pugh Green Mountain College Les McLean University of Toronto Heinrich Mintrop University of California, Los Angeles Michele Moses Arizona State University Gary Orfield Harvard University Anthony G. Rud Jr. Purdue University Jay Paredes Scribner University of Missouri Michael Scriven University of Auckland Lorrie A. Shepard University of Colorado, Boulder Robert E. Stake University of Illinois—UC Kevin Welner University of Colorado, Boulder Terrence G. Wiley Arizona State University John Willinsky University of British ColumbiaEPAA Spanish & Portuguese Language Editorial Board Associate Editors Gustavo E. Fischman Arizona State University & Pablo Gentili Laboratrio de Polticas Pblicas Universidade do Estado do Rio de JaneiroFounding Associate Editor for Spanish Language (199 8—2003) Roberto Rodrguez Gmez Universidad Nacional Autnoma de Mxico Argentina


44 of 47 Alejandra Birgin Ministerio de Educacin, Argentina Email: Mnica Pini Universidad Nacional de San Martin, Argentina Email:, Mariano Narodowski Universidad Torcuato Di Tella, Argentina Email: Daniel Suarez Laboratorio de Politicas Publicas-Universidad de Bu enos Aires, Argentina Email: Marcela Mollis (1998—2003) Universidad de Buenos Aires Brasil Gaudncio Frigotto Professor da Faculdade de Educao e do Programa dePs-Graduao em Educao da Universidade Federal F luminense, Brasil Email: Vanilda Paiva Lilian do Valle Universidade Estadual do Rio de Janeiro, Brasil Email: Romualdo Portella do OliveiraUniversidade de So Paulo, Brasil Email: Roberto Leher Universidade Estadual do Rio de Janeiro, Brasil Email: Dalila Andrade de Oliveira Universidade Federal de Minas Gerais, Belo Horizont e, Brasil Email: Nilma Limo Gomes Universidade Federal de Minas Gerais, Belo Horizont e Email: Iolanda de OliveiraFaculdade de Educao da Universidade Federal Flumi nense, Brasil Email: Walter KohanUniversidade Estadual do Rio de Janeiro, Brasil Email: Mara Beatriz Luce (1998—2003) Universidad Federal de Rio Grande do Sul-UFRGS Simon Schwartzman (1998—2003) American Institutes for Resesarch–Brazil Canad


45 of 47 Daniel Schugurensky Ontario Institute for Studies in Education, Univers ity of Toronto, Canada Email: Chile Claudio Almonacid AvilaUniversidad Metropolitana de Ciencias de la Educaci n, Chile Email: Mara Loreto Egaa Programa Interdisciplinario de Investigacin en Edu cacin (PIIE), Chile Email: Espaa Jos Gimeno SacristnCatedratico en el Departamento de Didctica y Organ izacin Escolar de la Universidad de Valencia, Espaa Email: Mariano Fernndez EnguitaCatedrtico de Sociologa en la Universidad de Sala manca. Espaa Email: Miguel Pereira Catedratico Universidad de Granada, Espaa Email: Jurjo Torres Santom Universidad de A Corua Email: Angel Ignacio Prez Gmez Universidad de Mlaga Email: J. Flix Angulo Rasco (1998—2003) Universidad de Cdiz Jos Contreras Domingo (1998—2003)Universitat de Barcelona Mxico Hugo Aboites Universidad Autnoma Metropolitana-Xochimilco, Mxi co Email: Susan StreetCentro de Investigaciones y Estudios Superiores en Antropologia Social Occidente, Guadalajara, Mxico Email: Adrin Acosta Universidad de Guadalajara Email: Teresa Bracho Centro de Investigacin y Docencia Econmica-CIDE Email: bracho Alejandro Canales


46 of 47 Universidad Nacional Autnoma de Mxico Email: Rollin Kent Universidad Autnoma de Puebla. Puebla, Mxico Email: Javier Mendoza Rojas (1998—2003) Universidad Nacional Autnoma de Mxico Humberto Muoz Garca (1998—2003)Universidad Nacional Autnoma de Mxico Per Sigfredo ChiroqueInstituto de Pedagoga Popular, Per Email: Grover PangoCoordinador General del Foro Latinoamericano de Pol ticas Educativas, Per Email: Portugal Antonio TeodoroDirector da Licenciatura de Cincias da Educao e do Mestrado Universidade Lusfona de Humanidades e Tecnologias, Lisboa, Portugal Email: USA Pia Lindquist WongCalifornia State University, Sacramento, California Email: Nelly P. StromquistUniversity of Southern California, Los Angeles, Cal ifornia Email: Diana RhotenSocial Science Research Council, New York, New York Email: Daniel C. LevyUniversity at Albany, SUNY, Albany, New York Email: Ursula Casanova Arizona State University, Tempe, Arizona Email: Erwin Epstein Loyola University, Chicago, Illinois Email: Carlos A. Torres University of California, Los Angeles Email: Josu Gonzlez (1998—2003) Arizona State University, Tempe, Arizona


47 of 47 EPAA is published by the Education Policy StudiesLaboratory, Arizona State University