USF Libraries
USF Digital Collections

Educational policy analysis archives


Material Information

Educational policy analysis archives
Physical Description:
Arizona State University
University of South Florida
Arizona State University
University of South Florida.
Place of Publication:
Tempe, Ariz
Tampa, Fla
Publication Date:


Subjects / Keywords:
Education -- Research -- Periodicals   ( lcsh )
non-fiction   ( marcgt )
serial   ( sobekcm )

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
usfldc doi - E11-00381
usfldc handle - e11.381
System ID:

This item is only available as the following downloads:

Full Text


1 of 70 A peer-reviewed scholarly journal Editor: Gene V Glass College of Education Arizona State University Copyright is retained by the first or sole author, who grants right of first publication to the EDUCATION POLICY ANALYSIS ARCHIVES EPAA is a project of the Education Policy Studies Laboratory. Articles published in EPAA are indexed in the Directory of Open Access Journals. Volume 12 Number 32July 20, 2004ISSN 1068-2341Interrogating the Generalizability of Portfolio Ass essments of Beginning Teachers: A Qualitative Study Pamela A. Moss LeeAnn M. Sutherland Laura Haniford Renee Miller David Johnson University of Michigan Pamela K. Geist Denver, Colorado Stephen M. Koziol, Jr. University of Maryland Jon R. Star Michigan State University Raymond L. Pecheone Stanford UniversityCitation: Moss P.A., Sutherland, L.M., Haniford, L. Miller, R., Johnson, D., Geist, P.K., Koziol, S.M., Star, J.R., Pecheone, R.L., (2004, July 20). Interrogating the generalizability of portfolio assessments of beginning teachers: A qualitative st udy, Education Policy Analysis Archives, 12 (32). Retrieved [Date] from a/v12n32/.


2 of 70 Abstract This qualitative study is intended to illuminate fa ctors that affect the generalizability of portfolio assessments of be ginning teachers. By generalizability, we refer here to the extent to which the portfolio assessment supports generalizations f rom the particular evidence reflected in the portfolio to t he conception of competent teaching reflected in the standards on wh ich the assessment is based. Or, more practically, “The key question is, ‘How likely is it that this finding would be revers ed or substantially altered if a second, independent assessment of the same kind were made?’” (Cronbach, Linn, Brennan, and Haertel, 1997, p. 1). In addressing this question, we draw on two kinds o f evidence that are rarely available: comparisons of two diffe rent portfolios completed by the same teacher in the same year andcomparisons between a portfolio and a multi-day cas e study (observation and interview completed shortly after portfolio submission) intended to parallel the evidence calle d for in the portfolio assessment. Our formative goal is to illu minate issues that assessment developers and users can take into account in designing assessment systems and appropriately limi ting score interpretations. (Note 1) IntroductionA growing number of states are using some form of s tandardized assessment to assist in the licensure decisions about beginning t eachers. Among the 42 states requiring such tests in 2000, the most widely used were paper-and-pencil tests assessing varied combinations of basic skills, cont ent knowledge, or pedagogical knowledge (NRC, 2001b). The National Re search Council's "Committee on Assessment and Teacher Quality" concl uded that "paper and pencil tests provide only some of the information n eeded to evaluate the competencies of teacher candidates" (NRC, 2001b, p. 69). The committee called for additional research into the development of licensure systems that include assessment of teaching performance. As evid enced in the work of the National Board for Professional Teaching Standards (NBPTS), portfolio assessment provides one credible means for the larg e-scale high-stakes assessment of teaching performance. The Interstate New Teacher Assessment and Support Consortium (INTASC) is building on the pioneering work of the NBPTS to develop subject-specific portfolio assessm ents of beginning teachers. Their work provides the basis for this study.This qualitative study is intended to illuminate th e factors that affect the generalizability of this portfolio assessment of be ginning teachers. By generalizability, we refer here to the extent to wh ich the portfolio assessment supports generalizations from the particular eviden ce reflected in the portfolio to the conception of competent teaching reflected in t he standards on which the assessment is based. Or, more practically, “The key question is, ‘How likely is it that this finding would be reversed or substantiall y altered if a second, independent assessment of the same kind were made?’ ” (Cronbach, Linn, Brennan, and Haertel, 1997, p. 1). In addressing th is question, we draw on two


3 of 70 kinds of evidence that are rarely available: compar isons of two different portfolios completed by the same teacher in the sam e year and comparisons between a portfolio and a multi-day case study (obs ervation and interview completed shortly after portfolio submission) inten ded to parallel the evidence called for in the portfolio assessment. The case st udies lasted 3 5 days, depending on each teacher's schedule. Consistent wi th Cronbach’s (1988, 1989) “strong” program of validity, this study is e xplicitly disconfirmatory ; it is intended to illuminate potential problems with assu mptions about generalizability. Our formative goal is to raise is sues that assessment developers and users can take into account in desig ning assessment systems and appropriately limiting score interpretations.Conceptions of GeneralizabilityMessick (1989, 1996) characterized generalizability as "an aspect of construct validity" that is meant to "ensure that the score i nterpretation not be limited to the sample of assessed tasks but be generalizable t o the construct domain more broadly" (1996, p. 250; see also 1989). He not ed that generalizability has two important senses: (a) "generalizability as reli ability … refers to the consistency of performance across the tasks, occasi ons, and raters of a particular assessment which might be quite limited in scope" (p. 250) and (b) "generalizability as transfer …refers to the range of tasks that performance on the assessed tasks is predictive of" (1996a, p. 250 ). Thus, inferences about the broader domain (in our case, competent teaching per formance as defined by a set of standards) from a particular sample of evide nce (as contained in a portfolio) can be productively conceived of in at l east two distinct steps: from the observed performance to the more limited scope of w hat we will call the assessment domain (reliability) and then from the a ssessment domain to the outcome or standards domain (transfer or extrapolat ion). This distinction between kinds or levels of generalization is drawn by others as well, albeit with somewhat different language (e.g., Brennan and John son, 1995; Haertel, 1985; Haertel and Lorie, in press; Kane, Crooks, and Cohe n, 1999) (Note 2) Within psychometrics, generalizability has typicall y been evaluated in terms of quantitative indicators of reliability or transfer. These concepts from psychometrics will be useful--even though this is a qualitative study--for helping us frame and learn from the results of our comparis ons. The comparisons we offer will in turn, suggest the limitations of conv entional theory for illuminating the complexity of variations involved in teaching p ractice and making well warranted decisions that accommodate that variation This first level of inference (reliability) involve s generalization from a set of representative observations to a well specified ass essment domain (or universe of generalization) consisting of similar observatio ns (Kane et al., 1999; Brennan, 2001). We are not simply interested, for instance, in how an examinee performed on a particular set of tasks on a particu lar occasion; rather, we are interested in estimating how an examinee would perf orm on tasks/occasions like these Further, we want some assurance that the score is not based on the idiosyncrasies of a particular judge but that simil arly qualified judges would likely interpret the performance in the same way.


4 of 70 Reliability is appropriately conceptualized and inv estigated as a faceted concept that encompasses multiple sources of “error” or var iations over which we want to generalize (differences in tasks, raters, occasi ons, and so on that are intended as samples from the same assessment domain ). A set of scores can have multiple reliabilities and errors of measureme nt depending on which sources of variation are taken into account. The ap propriate domain of generalization, including the sources of variation over which we want to generalize, depends on the decision to be made (Cro nbach et al., 1997). For those sources of variation over which we want to ge neralize, empirical studies that examine these variations—across tasks, occasio ns, raters, etc.—are required to support the generalization. As Brennan (2001) argued, the notion of “replication” is central to an understanding of rel iability. Generalizability theory (Cronbach, Gleser, Nanda, and Rajaratnam, 1972; Bre nnan, 1983; Shavelson and Webb, 1991) is, perhaps, the most commonly used theoretical model that enables the effects of various sources of error to be “disentangled” and estimated simultaneously, although other models, es pecially those based on Item Response Theory (IRT) (e.g., Engelhard, 1994, 2002; Myford and Mislevy, 1995; Wilson and Case, 1997) are becoming more wide ly used (see Mislevy, Wilson, Ercikan, and Chudowski, 2002, and NRC, 2001 a, for a discussion of alternative models). (Note 3) With generalizability theory, reliability is ideal ized as a statistical generalization based on “random” s amples from the assessment domain. Brennan (2001) acknowledged that the notion of random sampling is an “idealization that is not fully supported”, but noted that “the central conceptual distinction is not so much between fixed and random in the literal sense of ‘random,’ as it is between fixed and ‘not fixed’” (p. 302). Reliability estimates can be quite misleading if a facet that v aries in the assessment domain (possible essay prompts, for instance) is no t included in the estimated error of measurement. An unfortunate practice is re porting reliability estimates for performance assessments based on differences am ong readers but ignoring potential differences among tasks even though the i ntended generalization is to a broader domain of tasks like these. This can seri ously overestimate the quality of the generalization to the intended asses sment domain. Turning to the second level of generalization (tran sfer or extrapolation), this involves generalization from the more limited and c arefully specified assessment domain to a broader outcome domain, whic h includes the full range of performances about which we would like to genera lize. As Kane and colleagues noted, most educational concepts are qui te broad; rarely are we interested simply in how examinees perform on other (test) items like these. Using reading comprehension as an (often cited) exa mple, the outcome domain of interest might include a wide range of types and genres of text (e.g., newspapers, magazines, novels, instructional manual s, technical reports, text books, friendly letters, business letters, signs, f orms, lists, tests), read for a variety of purposes, in many different contexts, re quiring various kinds and depths of background knowledge to understand, with readings represented in multiple ways (writing, conversation, mental images or concepts, drawing, marks on answer sheets, and the like).This level of generalization clearly spills over th e bounds of reliability into validity more generally and typically involves a more tenuou sly warranted set of


5 of 70 inferences. Warrants for transfer generalizations i nclude logical or theoretical arguments about the relationship between the assess ment domain and outcome domain. A common approach “is to argue that the skills needed for good performances in the universe of generalization (e.g., problem definition, problem solving) are essentially the same as, or ar e a critical subset of, those needed in the full target domain” and “that anyone who performs well on the assessment should also be able to perform well in t he target domain and anyone who performs poorly on the assessment should also perform poorly in the target domain…” or at least that “the skills be ing assessed are necessary (if not sufficient) for effective performance in the ta rget domain” (Kane et al., 1999, p. 11).Empirical studies supporting transfer generalizatio ns might involve “criterion studies,” examining of the relationship between tes t performance and some “especially thorough (and representative)” sample f rom the outcome domain (Kane et al., 1999, p. 10) or, more practically, a “series of small experiments regressing various outcomes on test performance” (H aertel, 1985, p. 35). Given the near infinite range of possible studies, some m eans of deciding which are most important to undertake given limited resources is necessary. As Kane and colleagues noted, “in practice, the argument for ex trapolation is likely to be a negative argument.” A serious effort is made to identify differences be tween the universe of generalization and the target domain that would be likely to invalidate the extrapolation. If no major differences are fou nd, the extrapolation is likely to be accepted. If the impa ct of some differences on the plausibility of extrapolation is unclear, it may be necessary to check on their importance empirically. (Kane et al., 1999, p. 11)Empirical Evidence of GeneralizabilityWith Performance Assessments, In GeneralWith performance assessments, the most commonly exa mined sources of error are those due to raters and tasks. Empirical studie s of reliability or generalizability with performance assessments are q uite consistent in their conclusions that (a) reader reliability, defined as consistency of evaluation across readers on a given task, can reach acceptabl e levels when carefully trained readers evaluate responses to one task at a time, and (b) adequate task or "score" reliability, defined as consistency in p erformances across tasks intended to address the same capabilities, is far m ore difficult to achieve (e.g., Breland et al., 1987; Brennan and Johnson, 1995; Du nbar, Koretz, and Hoover, 1991; Gao and Colton, 1997; Gao, Shavelson, and Bax ter, 1994; Lane, Liu, Ankemann, and Stone, 1996; Linn and Burton, 1994; M cBee and Barnes, 1998; Swanson, Norman, and Linn, 1995). In the case of po rtfolios, where the tasks may vary substantially from student to student and where multiple tasks may be evaluated simultaneously, inter-reader reliability may drop below acceptable levels for consequential decisions about individual s or programs (e.g., Koretz, McCaffrey, Klein, Bell, and Stecher, 1992; Nystrand Cohen, and Martinez, 1993). Adequate levels of score (reader and task) r eliability have typically been


6 of 70 achieved by further standardizing the task directio ns, choosing tasks with higher intercorrelations, disaggregating the portfolio int o separate tasks that can be scored one at a time, and then estimating generaliz ability as one would with any collection of performance tasks. Brennan (2001) cau tioned that tasks and raters are only some of the sources of error that are like ly to matter. He cited other sources of variation that should likely be taken in to account. These included different occasions, both occasions of testing as w ell as occasions of scoring, and different methods of testing as sources of erro r. He noted that some of these, such as different methods, are better concep tualized as convergent validity studies (rather than as reliability studie s per se). (Note 4) Of course, certain types of estimate are often deemed not feas ible, including parallel forms reliabilities with portfolios and assessments of pe rformance in different contexts (ETS, 1998; Harris, 1997; NRC, 2001b; Porter et al. 2003). Special studies involving performance assessments h ave looked at relationships among methods of assessment: between multiple choice and performance assessment (e.g., Lane et al., 1996; Cr ehan, 2001); (Note 5) between different methods of performance assessment such as direct observation of scientific experiments and analysis of students notebooks (Shavelson et al., 1991); and between on-demand and school based tasks (Gentile, 1992, in Brennan and Johnson, 1995). The general conclusion is that different methods appear to be getting at somewhat different constructs (e.g., Brennan, 2001; Brennan and Johnson, 1995). Fewer op erational assessments in education undertake this sort of empirical resea rch, relying instead on empirical evidence of reliability and logical argum ents about content-relevance and representativeness. And, indeed, while the Standards for Educational and Psychological Testing (AERA, APA, NCME, 1999), require at least some sor t of empirical evidence about reliability, they mention “external validity” as only one potential source of evidence, but leave the choice of validity evidence up to the assessment developer and user.Some authors note a tradeoff between these two leve ls of generalization. Strengthening the faithfulness with which the asses sment represents the outcome domain often undermines the reliability of assessment (as reflected in the many technical problems with performance assess ment) and enhancing reliability, for example by employing a larger numb er of shorter tasks, undermines fidelity (e.g., Kane et al., 1999).With Teaching Performances, in ParticularResearch into the generalizability of performance a ssessment of teaching has tended to emphasize much the same sort of evidence described above, focusing primarily on consistency among tasks and j udges. There are two major programs of research that are most relevant to our study, the portfolio assessments of the National Board for Professional Teaching Standards and the observation/interview assessments of Praxis III Both of these assessments are developed by the Educational Testing Service.National Board’s standards-based assessments are de signed to certify the accomplishment of experienced teachers with at leas t three years of service. Assessments are developed or underway for over thir ty different certificates


7 of 70 (differentiated by subject area and age of students taught). The ten performance tasks that comprised the assessment in each certificate area (when the research described here was undertaken) a re divided into two parts: a portfolio completed by candidates in their home s chools across a year and a one-day assessment-center experience. The school ba sed portfolio consists of (a) four tasks that ask candidates to document thei r practice, through videotapes and samples of student work, and to prov ide "extensive analytical and reflective commentary" (Pearlman, in Jaeger, 19 98, p. 191), and (b) two tasks that ask candidates to document their accompl ishments outside the classroom and explain why they are important. The f our assessment-center tasks provide candidates with materials such as stu dent work samples, assessment records, instructional resource material s, or professional reading and ask them to use the materials to diagnose the s tatus of student learning, plan instruction, and so on (Pearlman, in Jaeger, 1 998, p. 191). (Note 6) Each exercise is scored independently by two reviewers. The resulting scores for each exercise are weighted and aggregated to form a n overall composite score for each candidate. This composite is then compared to a predetermined passing score.The National Board’s Technical Analysis Report (ETS, 1998) described four relevant sources of error: Assessors: Would a candidate, given a different set of assessors, fare similarly on the assessment? Exercise Sampling: Would candidates perform similar ly on a different set (sample) of exercises? Assessment occasions: Would candidates fare similar ly if they took the same assessment on a different occasion? School context: Would candidates fare similarly if they happened to teach in a different school? They noted that it is not feasible for them to prov ide evidence of reliability across school contexts or assessment occasions. Wit h assessment occasions, they argued that there is likely to be a learning e ffect such that one would expect a candidate to fare differently (better) and so reliability may not feasibly be assessed.They provided empirical evidence with respect to as sessors and exercise sampling— concluding that both are adequate to supp ort the assessment for its intended use (ETS, 1998, p. 125; see also Myford an d Engelhard, 2001). (Note 7) With respect to exercise sampling, they cautioned readers about the limitation of such evidence since the set of tasks was explicitly designed to represent a multidimensional domain: Whether an assessment with the current design can b e considered to allow for alternative forms in a traditional mea surement sense is debatable. It is possible to argue that the exercis es are but one possible sample from a larger domain of accomplishe d teaching or that the exercises, for all intents and purposes, c omprise a fixed assessment of accomplished teaching. (ETS, 1998, pp 107-108) This is, in fact, typical of the way in which task generalizability is investigated


8 of 70 with portfolio assessments (e.g., Klein et al., 199 5; Koretz et al., 1992; Reckase, 1995; Nystrand et al., 1993); what we have is an es timate of internal consistency (based on tasks that were designed to a ccess quite different elements of teaching practice) and that treats as f ixed a wide range of factors that may in fact vary. Following Brennan (2001), th is is not really a replication “using two full length operational forms” (p. 313).With respect to "transfer," Bond, Smith, Baker, and Hattie (2000) examined the relationship between scores on the National Board's assessment (in two certificate areas for 65 teachers) and 1-3 hour obs ervations of teaching accompanied by interviews with teachers and some st udents. The casebooks produced from the visits were scored according to t hirteen dimensions of accomplished teaching identified in an extensive li terature search. Using discriminant analysis, they were able to correctly classify 84% of teachers as to whether they had been certified using the National Board’s assessment. Other studies are currently underway (see ). While the National Board’s goal was primarily documenting consistency across t he sources in support of the validity of the NBPTS assessment, our purpose i s to illuminate both similarities and differences at the level of partic ularity that qualitative methods allow.ETS’s PRAXIS series, which is intended for use with beginning teachers, involves three sets of assessments: PRAXIS I focuse s on basic skills, PRAXIS II on content knowledge and general pedagogical kno wledge, and PRAXIS III on teaching performance. The PRAXIS III assessment involves direct observations of classroom performance over a series of “assessment cycles.” An assessment cycle consists of a preliminary descr iption of the context, the students, and the lesson-to-be-observed, prepared b y the beginning teacher; an observation of a lesson of instruction by a trained assessor (experienced teacher); and pre and post semi-structured intervie ws. The assessor’s notes are then scored on a list of nineteen criteria (that we re developed through an extensive literature review and job analysis survey ) and an overall score given. “Summative decisions are made based on cumulated da ta from two or more assessors based on two or more assessment cycles” ( Dwyer, 1998, p. 8). In addition to the obvious differences in methods, PRA XIS III is intended for use across grade levels and subject areas, and the crit eria for classroom observation have not been tailored to particular su bject areas as with INTASC and the National Board. Although this leads to a so mewhat different emphasis, Porter et al. noted the similarity of the PRAXIS cr iteria to the general principles of the National Board and INTASC. While there are m ultiple studies of assessor reliability, there are no reports of generalizabili ty across assessment occasions that we could locate (Dwyer, 1998; Myford and Lehma n, 1993; NRC, 2001b; Porter, Youngs, and Odden, 2003; Myford, personal c ommunication, 3/5/03; Wylie, personal communication, 5/2/03). With respec t to generalizability across occasions, assessment developers caution: “The purpose and consequences of the assessment, pa rticular local circumstances, and the beginning teacher's level of performance (both absolute and in terms of improvement) are fac tors that determine how many assessment cycles will be carrie d out. Guidelines governing Praxis validity and use prohib it


9 of 70 decision-making on the basis of a single assessment cycle or on the judgment of a single assessor (Educational Testing Service, 1993b).” (Dwyer, 1998, p. 171) Thus, the comparisons in the study reported here--w hich involve full length replications of portfolio assessments, methods of p erformance assessment, and classroom contexts in which the same tasks can be i mplemented--begin to address an important gap in our understanding of th e generalizability of portfolio assessments of teaching and, perhaps, of performanc e assessments more generally.Research DesignOur study draws on qualitative methods to address q uestions of portfolio generalizability through comparative content analys es across different portfolios and different methods of assessment for the same te achers. Consistent with Kane and colleagues’ (1999) conception of a negativ e argument, built from a serious effort to disconfirm, our goal is to illumi nate differences that challenge assumptions about generalizability. Where to locate these comparisons in terms of the level of generalizability described in the p revious section is an open question. At face value, one might argue that the p ortfolio-portfolio comparison is a reliability issue (different occasions on whic h same tasks are performed), and the portfolio-case comparison is a transfer iss ue (different methods and different occasions). And yet, as we return to this issue after sharing our findings, the nature of variations that the differe nt occasions afford makes this problem far more complex--as occasion is confounded with uncontrollable aspects of context--and raises important questions about the nature of the assessment domain to which we can appropriately gen eralize. These are the variations that can be invisible when portfolio rel iability is examined via intercorrelations among tasks and readers.We begin with a brief description of the INTASC por tfolio assessment system and then describe data collection for the two compa rative studies--portfolio-portfolio comparison and case-po rtfolio comparison--which were replicated in secondary English Language Arts (ELA) and Mathematics (Math). Since the comparative content analyses for both studies follow a similar pattern, we describe those activities in a fourth s ection. While the data sets are small from a quantitative perspective (29 comparati ve cases across the two studies and two subject areas), our goal was to und erstand each comparative case in depth and to illuminate issues for assessme nt developers and policy makers to consider.INTASC Portfolio AssessmentThe portfolio assessments are intended for teachers in their first, second, or third year of teaching. To guide the portfolio asse ssment, INTASC has developed a set of general and subject specific sta ndards based on INTASC's Principles for Beginning Teachers and standards fro m the relevant professional communities. The standards and related assessments are intended to provide a coherent developmental trajectory with those of the National Board. The assessments ask candidates for licensure to prepare a portfolio documenting


10 of 70 their teaching practice with entries that include: a description of the contexts in which they work, goals for student learning with pl ans for achieving those goals, lesson plans, video tapes of actual lessons, assess ment activities with samples of evaluated student work, and critical analysis an d reflection on their teaching practices. Unlike the National Board portfolios (wh ich contain four separate entries), these entries are organized around one or two units (8 – 10 hours) of instruction such that the portfolio cannot easily b e broken into parts for separate evaluation. Judges evaluate the portfolios in terms of a series of “guiding questions” focused on the portfolio but based on th e standards described above; they record evidence relevant to each guidin g question and develop interpretive summaries or “pattern statements” that respond to the question; then they determine an overall decision about the c andidate (Note 8) As developed by INTASC, the portfolios were intended b oth for professional development and for informing decisions about licen sure. Of the 10 INTASC states that participated in the development of the portfolio assessment, only Connecticut is currently using it to inform licensu re decisions. For this study, participants were recruited from fieldtests in multiple INTASC states in 1998-2000. Because our interest in this paper i s about the generalizability of portfolios for licensure decisi ons, we chose to evaluate the portfolios using the guiding questions and decision guide as they were used by Connecticut for field tests in 1999-2000, even thou gh the participating teachers were recruited from multiple states. As it was impl emented in Connecticut in 2000, there were four possible levels to the overal l decision: conditional, basic, proficient, and advanced. Judges also completed a feedback rubric" on which they selected performance levels that best characte rized the portfolio with respect to each guiding question. The assessment oc curred as part of a 2-3 year induction program in which beginning teachers who had an initial three-year license were provided with a mentor in t he first year and the opportunity to attend state-sponsored workshops to prepare them for the assessment. When fully operational, teachers who di d not pass the portfolio assessment in their second year would continue in t he program for another year. If they did not pass in the third year, they would be required to reapply for the initial license after successfully completing a dditional course work or a state approved field placement.Portfolio-Portfolio Comparison Data CollectionA small sample of secondary beginning teachers in m ath (n=7) and ELA (n=6) were recruited to complete two portfolios during th e same year, choosing classes and units of instruction that differed as m uch as possible within their routine teaching assignments. They were compensated for the second portfolio. Not surprisingly, it was very hard to find beginnin g teachers willing to assume the burden of two portfolios, and it is impossible to fully understand how these stalwart volunteers might have differed from their colleagues. We can say that their portfolios do reflect a range of performance levels, teaching practices, and school contexts and that their paired portfolios do illuminate an instructive array of differences, consistent with the goals of the st udy. Case-Portfolio Comparison Data Collection:


11 of 70 Another small sample of secondary teachers in math (n=8) and ELA (n=8) was asked to allow case studies of their teaching short ly after they submitted their portfolios. The sample was recruited to include dif ferences in gender, ethnicity, school context, and performance level (based upon a quick read through by the portfolio developers). The case studies took place over 3-5 days (depending on the teacher’s schedule) during which researchers ob served classes; conducted entry, exit, and brief daily interviews with the te acher; and interviewed the school principal and, if possible a mentor, regardi ng the support available to the teacher. [See Ball, Gere, and Moss, 1998; Moss, Rex and Geist, 2000a, 2000b for fieldwork and case write-up guidelines.] Case s tudy researchers observed two classes: the class used in the preparation of t he portfolio and a second class. As with the portfolio/portfolio comparison, we asked for a class that differed as much as possible within the teachers' r outine teaching assignment (but sometimes we were only able to observe a diffe rent section of the same class). Our intent was to parallel the information collected in the portfolio as closely as possible and to gather additional inform ation about the teacher's background, school context, and experience preparin g the portfolio to address additional questions of fairness. Teachers were giv en a small honorarium for participating in the case study. As before, it is n ot possible to know how these volunteers differed from the larger population of b eginning teachers. Case study researchers, all experienced teachers in the appropriate field, were taken through an abbreviated course of study (with practice and feedback) in taking fieldnotes and conducting interviews relevan t to the project. Tape recordings and artifacts were used as back-up. Fiel d and interview notes were read by a senior researcher and questions of clarif ication and elaboration were raised to guide revisions (which could be supported with audio-recordings and artifacts). Case study researchers were then asked to draw on their notes in responding to the Guiding Questions used to evaluat e the portfolios. Again, a senior researcher reviewed the responses (with fiel dnotes at hand) and raised questions to facilitate revision.Comparative AnalysesThe comparative content analyses for both studies w ere undertaken in a similar fashion. Research assistants (experienced teachers in the content area with graduate research training) used the guiding questi ons (and the dimensions contained in the related feedback rubric) to develo p a coding scheme for the two sources of evidence. Videotapes were roughly sc ripted for coding. Then, answers to each of the guiding questions were devel oped for each source based upon a comprehensive review of the evidence, including the search for counter examples to challenge developing interpreta tions. Similarities and differences were then noted, organized by guiding q uestion and overall. Justifications for perceived differences in perform ance level with respect to the criteria were developed. For the portfolio-portfoli o comparisons in ELA, each pair of portfolios was read twice, in reverse order by two research assistants, who then met to develop a consensus on any differen ces. (Note 9) For the portfolio-case comparisons and the portfolio-portfo lio comparisons in math, a single comparison document was developed, and the p rocess was audited by another researcher. The comparative content analyse s typically took 3-5 days


12 of 70 per teacher and generated 30-70 pages of text each. These comparisons were then condensed into 2-3 pages versions that highlig hted substantial differences both at the level of the guiding question and overa ll. It is important to note that we have, for the purpo ses of this paper, bracketed questions about consistency among readers. Elsewher e we address concerns about differences in the way knowledgeable readers evaluate portfolios in different social settings when trained to reach con sistent decisions and when allowed to draw on their own criteria of competent teaching (Moss and Schutz, 2001; Moss, Schutz, Haniford, and Miller, in prepar ation; Schutz and Moss, in press). Here, we present findings whose validity is based upon in-depth analyses, in which relevant differences in perspect ive between readers were resolved through consensus seeking dialogue. The is sue for us is not the validity of a specific score; rather it is the vali dity of an interpretation of difference between two portraits of teaching and an argument for whether the observed differences are likely to matter in light of the evaluation criteria. We present our evidence for which differences are like ly to matter in sufficient detail that readers can reconsider these judgments for the mselves.Structural Differences Between Data Sources and Asy mmetrical Questions of ComparisonBy structural differences, we mean those difference s between data sources that could be anticipated in light of the different meth ods and which are, in fact, typically present in our data. With respect to the portfolio-case comparisons, beyond the obvious differences in data collection m ethods, it is important to note the following. While we attempted to have case study researchers present on days when teaching consistent with what is expec ted in the portfolio was occurring, it was not always possible to observe al l the aspects of teaching called for in the portfolio. For instance, while th e ELA portfolio required evidence of students' response to literature and students’ p rocesses in writing, the lessons observed in the case study might not cover both areas. The case-based evidence is typically weak with respect to formal a ssessment procedures since often no formal assessment was occurring. However, the case study provides substantially more evidence about daily classroom i nteractions. The case also provides rich information about the context in whic h the teacher worked and about teaching practices not foregrounded in the po rtfolio evidence. With respect to the portfolio/portfolio comparisons, the portfolio completed second is invariably shorter, often considerably so. It conta ins typically fewer artifacts and shorter commentary (sometimes with reference back t o the first portfolio). This caused us to develop an asymmetrical comparison and research question : To what extent does the second portrait (case study or second portfolio) cause us to reconsider the evaluation of the teacher's performance in (what we'll call) the primary portfo lio?FindingsOur comparative analyses were set up to uncover dif ferences in the two portraits of beginning teachers and to evaluate whe ther the differences were


13 of 70 likely to result in different decisions in light of the INTASC standards (as instantiated in the guiding questions and the decis ion guide as adapted and used in Connecticut in 2000). We make no attempt to estimate the frequency with which these sorts of differences are occurring ; our evidence is not appropriate for that purpose. Again, our formative goal is to illuminate issues for assessment developers and users to consider in desi gning an assessment system, characterizing the appropriate domains of i nference, and limiting interpretations appropriately.We present our findings in the following sections: We begin with an overview of the variations in context of the classes and units selected by these teachers. Then, we illustrate our comparison methodology in s ubstantial detail with comparisons in both math and ELA. In the first comp arison, we provide an example of a case in which the differences observed do not seem to matter in terms of the relevant criteria (which was, we shoul d note, true in the majority of cases). In the second comparison, we provide an exa mple in which the evidence in the portfolio is, we’ll argue, ambiguou s, because the artifacts (videotapes, handouts) only partially support the w ritten representation of the class; the case appears to clarify the ambiguity. W hether this is a difference that would “matter” depends on how the portfolio readers weigh the partially conflicting evidence. Thus this comparison, even mo re so than the first, illustrates some of the interpretive problems we en counter with these sorts of data— problems that we have tried to address throug h far more in-depth readings than would be possible in operational use. Then, we present a series of briefer vignettes that describe situations in wh ich the second portrait caused us to question the conclusions we drew from the pri mary portrait. Contextual VariationsAs indicated above, for both sets of comparisons, w e asked teachers to choose a second class that differed as much as possible wi thin their routine teaching assignments. The classes they selected are presente d in Tables 1 and 2 for secondary ELA and math teachers respectively. Given their selections, it is important to note the many different kinds of (ofte n intersecting) contextual variations that are present in the comparisons we e xamine. These include: different sections of the same course (which entail differences in time of day and whether the teacher has taught the lesson befor e); differences in (perceived) ability levels and groupings of student s, including those designated by the school directly (remedial, AP, and the like) or indirectly (scheduling in ELA resulting from math assignments) and those perc eived by the teacher; different courses; different grade levels; differen t units within the same course; differences in (mix of) cultural backgrounds of stu dents; different times of year (which involves differences in teachers knowledge o f and relationship to students); differences in class sizes; differences in availability of curriculum and support materials; differences in extent to which t hese materials are consistent with the standards. These are all variations that a re fixed for a given portfolio assessment of a teacher and are unexamined when all we have is the single set of performances from a given class. (Note 10) Illustration of Analysis in ELA with a "Complementa ry" Portfolio/Case Comparison


14 of 70 To illustrate our comparison methodology in ELA, we focus on one portfolio/case comparison, "Ms. Bertram (Note 11) ," in which the activities we observed differed substantially, and yet we found t he portrait of the teacher conveyed in the case study provided quite consisten t evidence with respect to the general evaluation criteria. We illustrate this comparison in some detail both to document our practices of analysis, and to show how two quite different activity contexts can nevertheless support similar conclusions about the teacher. We begin with a discussion of the ELA port folio guidelines, the guiding questions (developed by INTASC and revised by Conne cticut) to evaluate completed portfolios, and the way in which we appli ed them for this study. Then we return to the specific case of Ms. Bertram.The ELA portfolio handbook asks candidates to compl ete two distinct entries: one each in teaching response to literature (RL) an d processes of writing (PW). Teachers may choose the same class or different cla sses for these two components. Across these two exhibits, we have the following sources of evidence from each teacher: (a) the teacher's ratio nale for her choice of literature and writing assignment, (b) the teacher' s daily logs for 10 lessons in which she describes the activities she and the stud ents engaged in (providing copies of instructional artifacts) and writes brief reflections about how the day’s lesson went, (c) video tapes of two-three activitie s reflecting different participation structures (d) teacher's reflections on the videos, (e) five samples of student writing, including multiple drafts, with teacher's comments on the writing, (f) the teacher's reflections on the stude nts’ writing, and (g) the teacher's general reflections on her teaching in the unit.In the case study, we have fieldnotes depicting the activities and the discourse in the classroom across three days for each of the two classes. Through notes from a series of interviews, we learn about the tea cher’s goals, her specific plans for daily lessons, her reflections on how the lessons went (in general and for particular students), and her goals for profess ional development. The guiding questions for ELA (initially prepared b y INTASC and revised by Connecticut for use in 1999-2000) are organized int o four separate categories. (1) Questions about literacy focus on "connections among responding, interpreting, and composing" with an emphasis on th e extent to which students develop their own meanings. (2) Questions about instruction focus on how the teacher organizes students' learning--including que stions about alignment between goals and instructional strategies, about i ntegration of activities within and across lessons, and about materials--with an em phasis on the extent to which instruction provides learning opportunities ( challenges) for all students and promotes independence. (3) Questions about analysis of learning focus on formal and informal assessment of students’ work--h ow the teacher monitors students’ progress, communicates with them about th eir learning, and uses that information to inform instruction. (4) Finally, que stions about analysis of teaching focus on how the teacher reflects on student learn ing and uses that reflection to inform her practice. (Note 12) While some of the guiding questions were quite spec ific and descriptive in nature (e.g., "Describe how the teacher helps students use a writ ing process, including context, purposes, and conventions of sta ndard written English." );


15 of 70 others involved much higher levels of inference tha t required integrating multiple types of evidence (e.g., Describe the ways in which the teacher creates a learning environment that provides all students wit h opportunities to develop as readers, writers, and thinkers. "). In analyzing the portfolios, we found ourselves following a multi-step process. We began with descr ibing the various sources of evidence (e.g., describing the teacher's goals, out lining the progression of lessons, scripting the videotape, characterizing th e artifacts in terms of the nature of students’ responses and any written comme nts by the teacher; illustrating the ways in which teachers reflected o n their students' work in their commentary). Then we developed interpretations that coordinated various sources of evidence (e.g., considering the relation ship between the teacher’s goals and the progression of lessons to evaluate al ignment and scaffolding or between the teacher's commentary on the video and w hat we had observed to evaluate quality of reflection). Finally, we moved to the level of responding to some of the higher-inference guiding questions (e.g Describe how the teacher uses knowledge about students to meet their needs i n instruction and provide them with opportunities to learn or Describe the ways in which the teacher creates a learning environment that provides all st udents with opportunities to develop as readers, writers, and thinkers .”). For the case studies, the fieldnotes from the classroom observation and notes from a ser ies of interviews with the teacher allowed us to engage in much the same proce ss. Our task was somewhat easier as the case study writer had constr ucted responses to the guiding questions that drew on evidence form the fi eld and interview notes. We nevertheless reviewed the field and interview notes in light of the case study writer’s conclusions and often included the additio nal detail in our comparisons. The Appendix provides brief excerpts from the 70-pl us page portfolio/case comparison document prepared by LeeAnn Sutherland. It shows brief examples of the sort of evidence we have from the two method s and illustrates the way we have combined the evidence to develop interpreta tions and comparisons relevant to the guiding questions. Below we offer s ome general conclusions based both on the comprehensive evidence in the lon ger document for which the Appendix provides only brief examples.Ms. Bertram teaches sixth grade English Language Ar ts in a middle school located in what the case study writer describes as a small, relatively affluent suburban community. In the portfolio and the case, we see a "reading and writing" class of 24 students who meet daily for tw o periods. In addition to this reading and writing class, the case study writer ob served another section that covers only writing.For the response to literature exhibit in her portf olio, Ms. Bertram selected a series of lessons based on the study of a novella c ommonly used with this age group. Three separate tasks require students to (a) identify the character traits of the main characters, (b) compose a written respo nse citing which character they felt they were most similar to/could relate to best/liked the best and why, and (c) use that reflection as the foundation for c reating a simulated journal. In the processes of writing exhibit, we see Ms. Bertra m guide students through the development of a poem using metaphors to describe t heir mothers in preparation for Mother's day.


16 of 70 The case study describes three days of parallel les sons in two classes where the teacher focuses on having students select three best pieces of writing from their notebooks, complete evaluation sheets about e ach one, exchange with a partner who would name his or her choice for the wr iter’s best piece on a ‘Nomination Ballot,’ revise that piece of writing, and publish it on a web page. Even though the activities are substantially differ ent, there is nothing in the case that would cause us to question our evaluation of t he primary portfolio. Both portraits show the teacher using a variety of activ ities to help students use literature to make connections, take others’ perspe ctives, and explore concepts, scaffolding their learning through the activities s he creates and the discussion she guides. She also uses a variety of activity str uctures (e.g. small group, whole class). We have ample evidence of similar cla ssroom interaction wherein the teacher poses questions to which students respo nd initially (to begin an activity or class session), and she then builds fro m students’ responses to guide subsequent questions, consistently validating their contributions. In both portraits, we see the teacher employ a variety of s trategies to guide students in developing as readers, writers, and thinkers. Eithe r portrait would tell us that this is a highly reflective teacher who uses that reflec tion to shape practice immediately and to think about changes in her futur e practice. She consistently addresses both the strengths and weaknesses of each lesson as well as their relationship to the larger unit. Thus the evidence in the case reinforces the conclusions from the portfolio in somewhat differen t contexts of teaching. Illustration of Analysis in Math with an Ambiguous Portfolio and Clarifying CaseComplex evidence of the sort contained in the portf olio and case often presents substantial interpretive problems to readers. While this is not the focus of this paper, reading problems do impact the nature of the conclusions we draw. Here we illustrate our analytic practices with a math co mparison and present a situation where the evidence in the portfolio is so mewhat ambiguous and where the additional evidence in the case appears to supp ort one potential portfolio interpretation over the other.The math portfolio handbook focuses on a single 8 – 12 hour unit in mathematics and requests similar artifacts as reque sted in the ELA handbook. Here the portfolio contains a description of the cl assroom context, descriptions of a series of lessons with instructional artifacts (e.g., handouts, assignments); videotapes, student work, and reflections on two fe atured lessons; a cumulative evaluation of student learning with accompanying re flection; a focus on three students across the featured lessons and cumulative evaluation of learning; and analysis of teaching and personal growth. As with t he ELA handbook, then, we have partially independent artifacts (including the videotape, instructional artifacts, and samples of students’ work) against w hich to evaluate (some parts of) the teacher’s description and reflection/evalua tion on what happened. The guiding questions in mathematics are organized into five categories (as initially prepared by INTASC and revised by Connect icut for use in 1999-2000). (1) Tasks focuses on the appropriateness (variety, richness, challenge,


17 of 70 accessibility) of the tasks selected by the teacher and on how effectively they are implemented (clarity, accuracy, alignment, and responsiveness to students’ interests, styles and experiences). (2) Discourse focuses on how effectively the teacher orchestrates discourse, uses tools and mate rials to support discourse, and promotes discourse among students in which powe rful kinds of thinking predominate (defined as students exploring a variet y of approaches to problems and explaining their reasoning with evidence). (3) Learning environment focuses on how effectively the teacher manages the physical time, and social aspects of the classroom and encourages participation and e ngagement by all students. (4) Analysis of learning focuses on how effectively the teacher assesses students’ learning (accuracy, variety, and alignmen t with objectives and tasks) and communicates with students about expectations a nd feedback. (5) Finally, analysis of teaching focuses on how the teacher learns from and improve s teaching. The comparison methodology was similar to that presented in ELA. (Note 13) The mathematics teacher in this portfolio/case comp arison works at a large urban high school, in which 68% of students receive free/reduced lunch. The portfolio presents an 11th grade Integrated Geometr y course. The teacher, Ms. Fleming, explains that this course is the lowest le vel geometry course offered by the school and that she closely follows the text. T he unit presented in the portfolio concerns tessellations and triangles. The case study follows the same Integrated Geometry class and a 9th – 10th grade Ma th I class. Ms. Fleming reports that Math I is the lowest level math course offered by this school with the exception of remedial math. At the beginning of the course the students shared textbooks with another class; however, Ms. F leming indicates that these texts soon disappeared. Ms. Fleming uses worksheets left by a previous teacher and generates her own curriculum worksheets She reports that approximately 75% of Math I students are failing. T he lessons presented in the Math I class focus on basic arithmetic, naming of g eometric objects and measurements. The Integrated Geometry lessons obser ved for the case focus on triangles, angles, and parallel lines. The case was conducted late in the year when both classes were reviewing material for a fin al exam. We begin with an extended discussion of the portfol io because it alone raises a complex interpretive problem when the partially ind ependent artifacts (videotapes, handouts) are compared to the teacher’ s descriptions of what is happening. In the interest of space, we focus on co nnections across lessons, nature of mathematical tasks, and the implementatio n of tasks in classroom discourse. Then we turn to the case where the portr ait of the teacher is substantially different from what is portrayed in t he written portion of the portfolio. [Both descriptions draw heavily on Pamel a Geist’s extended comparison document.]The evidence in the portfolio creates a picture of a teacher that sees how the mathematics of a unit connects across ideas and to prior and later learning. Early in the portfolio, Ms. Fleming describes some of the mathematical connections she believes are important for students to understand. She writes, Knowing the properties of triangles is important to the student of mathematics because it is the starting point for le arning the


18 of 70 properties for special triangles and enclosed figur es, namely polygons. For example, the Triangle Angle-sum Theor em can be used to derive the sum of the interior angles of a quadrilateral and convex polygons. It also lays the foundation for st udents to learn about pyramids and other three dimensional figures. Ms. Fleming describes in detail how the seven lesso ns across the unit connect mathematically and what students will learn across the unit to accomplish learning goals and objectives. For example, she exp lains, It was important to show the relationship between t he exterior angle of a triangle with the adjacent interior and remote interior angles. Once the properties of a single triangle were estab lished, it was necessary to establish the relationship between a p air of congruent triangles and how to use the postulates to establis h congruence. In order to establish this, students had to learn how to make congruence correspondence and congruence statement. Of course this also leads us to establishing proof, but my de partment recommended not to introduce proofs with this level class. In the portfolio, Ms. Fleming develops a strong cas e for the predominance of discovery-type tasks and learning. She explains tha t hands-on discovery type tasks dominate her practice and that students learn best in these types of lessons. She writes, The tasks that are most effective are of a ‘discove ry’ or ‘hands on’ type… [explaining] when my students “see it” and “find it” the learning is retained …. I try to let my students have the experience of discovery even when it seems small. I have found th at using the discovery method works best for my students so I ha ve tried to use it often…Using this method students get to see ideas. For ex ample, when students put together the angles from a triangle an d actually saw that it made a straight line, they knew that adding the interior angles of a triangle would equal 180. They saw that it wo rked for all triangles regardless of the size and shape. In the portfolio artifacts, we see a range of tasks including tasks that appear to offer opportunities for discovery and those that fo cus more on recall and application of definitions and facts. Consistent wi th the teacher’s description, a series of problems presented in one of the instruct ional artifacts asks students to work at drawing and measuring the various angles within the triangles and record their data. The questions that follow ask st udents to detect a pattern in the data and develop statements or conclusions abou t various relationships within the triangles. There is much writing in the portfolio explaining that these tasks and others like them are selected because the y support students’ opportunity to formulate conjectures, reason about mathematical ideas, and justify results. The teacher also provides evidence of other types of tasks that are designed to check on students’ general understa nding of geometric shapes and their properties. For instance, they ask studen ts to classify shapes, recall definitions and theorems, use definitions and prope rties to find other measures,


19 of 70 and to justify answers with a known theorem or defi nition. The tasks appear on daily activity worksheets, homework assignments, an d on assessments such as tests or quizzes.We turn next to the videotape to see how these sort s of tasks are implemented in classroom interaction. Here, we focus in detail on one of two videotaped lessons providing excerpts from our rough transcrip t of the videotape and Ms. Fleming’s reflections on what occurred. The videota pe is less effective in making the teacher’s case for discovery-type learni ng. We see Ms. Fleming guiding the discourse with students responding in s hort statements that restate a definition or fact. The excerpt we’ve selected be gins about 4 minutes into the tape after Ms. Fleming has finished reviewing, thro ugh brief question and answer segments, the previous day’s lesson for stud ents. T Okay. So let’s look at a triangle (she has an exa mple on the overhead (4:37 into the tape)). We have remote interior angl es, we have exterior. We have adjacent interior. Let’s look in relationship to this one angle (points to image on the screen). Here we have an exterior angl e. It’s outside the triangle. The adjacent interior is the one that is what? S Sharing the same sides.T Sharing the same sides. So it’s adjacent. Adjacen t means? S Next to.T Next to, okay? Remote means, we said?S Far away.T So these two are far away from this exterior angl e. Right? These are going to be your remote interior.S (unintelligible)T Now look at this, if I said 2 is you exterior ang le, what is the adjacent angle for 2. Where would it be located?S (A student is asked to come up and point out the specified angle. Other students were calling out some helpful comments as well as “I know, I know” as he points to various angles the teacher asks him to identify remote interior. She asks him to confirm that he is pointi ng to remote interior or remote exterior. He confirms remote interior).T Very good. So depending on which angle you pick, your remote interior angles will be switching sides. Okay, this is my ex terior angle 1, right (she points)? So what angle is adjacent to that and insi de the triangle? S What’s adjacent?T Adjacent means next to, it’s touching. It’s shari ng the same side. So angle A over here would be CAB. Correct? CAB is adjacent to angle ...? I’m looking


20 of 70 at angle 1. What is it?T I’m looking at angle 1. What kind of angle is ang le 1? S ExteriorT What is the adjacent angle to angle 1?S 4T What are the remote interior angles?S 5 and 6T 5 and 6. Much better.T & S (At 7:28 in the tape) (Some students need cla rification so students and teacher have a brief discussion on the different ty pes of angles and their relationships to each other).T (Teacher moves around the triangle she has on the overhead and asks for students to quickly identify remote, adjacent, and exterior angles). T Now, today you’re going to look at the relationsh ips between these angles. Okay? I’m going to hand out a worksheet and you’re going to do that. [Break in sequence. Students are now working togeth er in small groups on the assigned worksheet. The teacher walks around to answer questions and check their work. Students are comparing work. They use rulers and protractors to measure angles and help each other c onstruct the various triangles on the worksheet.) (It’s hard to hear wha t students are saying to one another but the teachers voice can be heard from ti me to time.)] In reflecting on this lesson, Ms. Fleming describes how she interacted with students to arrive at a solution: I did not offer ‘answers’ for the students, but gui ded them using questions to arrive at a solution. For example, whe n the male student attempted to identify the angles on the tra nsparency, I realized that he was trying to ‘bluff’ his way out of it. I guided him by repeating the names of the angles, emphasizing the words adjacent and interior. And about the small group time, she writes: When a student asked me if she measured an acute an gle correctly, (she did not by reading the protractor incorrectly) I asked her if her angle was greater or less than 90, and if her answ er made sense. When student A asked me about the measures of her a ngles, I asked her how she could check them. Once she ‘got i t’, she proceeded to help another member of her group.


21 of 70 This is not an unreasonable representation of what occurred, and it helps us understand why she made some of the choices she did Viewed in light of this interpretive commentary, and taken together with de scription of all the lessons that reflect a privileging of discovery-type learni ng, it is possible to situate the evidence in the videotape within a larger picture t hat mitigates its dominant impression. While not presented here, the other vid eotape and reflections surrounding it raise similar issues; the teacher’s description surrounding the lesson creates a different image than what we might infer from the videotape alone.As teachers, we know that even in the most ‘learner -centered’, discovery-oriented classroom, there are often (with good reason) stretches of dialogue that resemble what we see here. That we ca n’t hear what is happening among the students on the videotape allows the teac her’s characterization to shape our impressions. Viewed alone, this portfolio can be constructed as a relatively strong performance, better than just pas sing, even though the evidence provided by the artifacts is a bit uneven.Turning to the case study, what we see reinforces w hat we see in the videotape and, taken together with what the case study resear cher reports from his interviews with the teacher, presents a substantial ly different portrait. We focus on the same aspects of the teacher’s practice, pres ented in essentially the same order: connections across lessons, nature of t asks, and implementation. About her characterization of connections across ta sks, the case study researcher writes: “In the pre-lesson interviews wh en asked to describe her objectives for the next day’s lesson, they were alw ays in terms of discrete topics to be covered, sometimes by book chapter. For insta nce, Ms. Fleming explained her plans for her Math I class, “ I am presenting material that is very close to that they are seeing on the exam. I will d o Chapter 10 tomorrow, reading graphs, finding mean, median, mode, and ran ge. ” Or, following a Geometry lesson, she says: “they can do triangles, but not the parallel lines. I keep throwing these [parallel line problems] at the m so they keep seeing them.” In his search for counter evidence about this devel oping pattern, the case study researcher offers the following quotation: We do introduce the concept of showing how triangle s can be congruent and we will ask them to give reasons. The last thing they were doing was perimeter and area for rectangles, p arallelograms, and other quadrilaterals. We did Pythagorean Theore m and area under the curve using a trapezoid. And we try to re ason with them by making ‘cubes’ under the curve and having them coun t the ‘cubes’. The case study researcher argues that the teacher m erely makes mention of ideas that were presented in earlier lessons or oth er contexts but did not offer any deeper understanding of how ideas are connected only that they are. The case study researcher develops a very different portrait of the dominant kinds of tasks offered and the kinds of learning th ey promote. The case study writer concludes, “there is little diversity or ric hness in the problems offered,” and, “the majority of tasks are one-step applicatio ns of definitions and theorems.” He describes, “Both in the integrated ge ometry and in the geometry


22 of 70 content of the Math I course, she emphasized fundam ental skills such as naming and applying simple definitions. In the firs t period I observed of Integrated Geometry the first set of problems all a re based on knowing definitions (e.g. altitude, median, congruent) or t heorems (e.g. corresponding angles congruent). The questions are one-step appli cations of definitions where the only probe mentioned in the problem, ‘How do yo u know?”?’ is more a reference to naming the correct specific theorem us ed to solve the problem.” He presents numerous examples of these kinds of tasks. He concludes, “Ms. Fleming’s objectives across the tasks she offered w ere centered around coverage of facts, definitions and theorems student s had memorized and not on the development of particular skills or understandi ngs of broader concepts.” The case study writer offers a description of the t ypical discourse pattern, “There was one dominant pattern of interaction arou nd the tasks offered in Ms. Fleming’s classes. My characterization of this patt ern is based on three observations in the context of two different mathem atics classes– Integrated Geometry and Math I. Ms. Fleming offered students a set of problems, similar to those given earlier, in the form of a worksheet. Ms Fleming engaged students in a Question-Response-Evaluate type of dialogue ar ound the problems offered on the worksheet. The pattern consisted of Ms. Flem ing going over the problems with the students as a whole class. She wo uld move in order through the problems on the worksheet they were currently d iscussing and for each question, the pattern would be essentially the same .” The case study writer explains, “As can be seen from the example, Ms. Fle ming asks a question, or reads a question from the sheet to initiate the con versation; next, a student responds to the question with a specific piece of i nformation, either a number, theorem name, or yes/no with little or no emphasis on reasoning or justification; in the next turn of talk Ms. Fleming evaluates the student’s response, and then either gives a correct answer if the student answer is incorrect, or poses a new question, which implies the student answer was acce ptable.” He notes two exceptions to the pattern: (a) a TV game simulation where students are allowed to call on one another for help if they aren’t conf ident of the answer to a question and (b) a small group activity where stude nts worked together on a problem in groups of three or four where, he notes, the groups often took on much the same dynamic as the class overall: student s worked on the same problems and usually agreed on an answer, which oth er students in the group then copied from those who ‘got it’.Which portrait presents the more credible represent ation of the teacher’s practice? What might explain the differences? The d ifference in the quality of representation and reflection could be attributed, in part, to the differences in format: spontaneous comments in informal conversati ons and unprepared interviews are unlikely to show the depth of the te acher’s considered reflections. And, the written reflections may have been complete d with full access to curriculum resources and feedback from colleagues. The handbook in fact encourages collaborative reflection with colleagues It’s also important to keep in mind that the case study occurred at the very en d of the year when the teacher was reviewing for the final exam. Does this make the classroom discourse atypical? Given the evidence, it is impos sible to know. [We address the issue of ambiguous evidence in more detail in S chutz and Moss (in press).] Whether this would count as a difference that matte rs depends, in part, on how


23 of 70 portfolio readers cope with the ambiguous evidence in the portfolio.Additional Comparative VignettesWe examined all 29 of the comparisons in ELA and Ma th at the level of detail described and illustrated in the previous two cases In this section, we present vignettes from five additional comparisons in which the differences we observed did seem to matter in terms of the relevant criteri a and raise, we argue, dilemmas that assessment developers and policy make rs should consider in the design of assessment systems.Consistent with the intent of the paper, our vignet tes are developed to foreground important differences for a particular c omparison; we do not describe, as we have above, the similarities in the se comparisons. In the interests of space, we summarize our conclusions wi th brief illustrations. [We hope the extended examples described above and in t he Appendix illustrate the attention to detail that underlies these conclusion s.] Each vignette follows a similar pattern: we first characterize the issues a nd context differences that the vignette raises (so that readers can choose whether to read the vignette) and then we provide brief illustrations of those issues (Note 14) This section concludes with a brief mention of additional sorts of differences we noted but thought were unlikely to matter in terms of the cri teria used. We reserve discussion of the issues the vignettes raise until the final section of the paper where we propose some possible paths for resolution Vignette 3: Mr. RichardsIn this portfolio/case comparison we see a case in which an English teacher's performance looks substantially different in an hon ors class than in his third level class. Mr. Richards teaches in what the case study researcher describes as a rural school of about 600 students, 97% of who m are white. Distinguishing among students' placements in the school's tracking system, Mr. Richards indicated, Honors kids are chosen because of their work ethic and their intelligence. Students in the second level, “ have the work ethic, but they just can't grasp the material. They will eventually, but their work ethic keeps their nose above water. ” For the students in the third level, “ the content is watered down. ” While the portrait of the honors class is relativ ely consistent across the two methodologies, the case study highlights how it is that his beliefs, as well as institutional tracking, seem to shape his practice with students in different tracks. [The original comparison was prepared by Le eAnn Sutherland.] Only the honors class is represented in the portfol io as they complete a poetry unit. The case study writer observed the honors cla ss as well as a third level class. Mr. Richards explained that the poems for th e third level are not as difficult; they use a narrative, abridged version o f the literature selection from their textbook (honors classes read the unabridged version); he uses the textbook much more; he has different expectations o f students’ writing (the focal correction areas are different). Mr. Richard’s rationale for his c hoices of texts, for the activities he employs to engage students wi th literature, and for his implementation of a writing process appear to diffe r in terms of his understanding of their level of ability.


24 of 70 As the case study writer describes it, both honors and third level students readthe same novel at different times during the sc hool year, but Mr. Richards assessed their interpretive needs and abilities dif ferently. The goal for third level students was more “ the story ” and “ trying to pick out the basic elements [such as] plot, theme ." He believed that honors students, however, “ can go beyond the literal ." Honors students had a lot of discussion … a lot of note-taking, explaining the concepts, ” whereas third level students answered primarily l ower inference questions on worksheets. Of honors studen ts, Mr. Richards required out-of-class reading and book reports that follow a genre sequence—first fantasy/science fiction, second historical fiction, and the like. Students in the other class did a single book report on a biography or autobiography, and they reported on the book by creating a poster or doing an in-character presentation to the class. They ended the school year with a nov el based on a made-for-TV movie which Mr. Richards acknowledged has absolutely no literary merit but that he chose because students like it and because it’s reading. Students wrote an essay at the end of the unit, a personal n arrative that did not require them to make connections with the text itself.Another example of the difference in opportunities provided to students depending on level was in composition study. Honors students prepared for 10th grade by writing a persuasive essay that inclu ded MLA documentation. About writing, Mr. Richards said that honors studen ts would be mortified to conference with him individually as they don’t like to be embarrassed ." He typically worked with small groups of these student s in the first semester, he said, but did not require “ rewrites ” of them in the second semester because they had already mastered the guidelines of revision. Mr. Richards writes in the portfolio narrative about two additional, “ authentic ” tasks Honors students would complete— entries for a poetry contest and composin g a group poem to be read at graduation.In contrast to honors students, third level student s met with Mr. Richards for individual meetings about their writing. Conference s took place at the front center of the room, facing the class, with the teac her seated at a low table and the student whose paper was being reviewed seated o n a high stool next to him. Mr. Richards marked student papers ahead of ti me so that he would remember what to tell them they need to fix. He counseled kids by skimming their papers and calling their attention to each it em he had marked as problematic. The case study researcher observed 15 conferences he held over two days. Mr. Richards emphasized form and mechanic s in these meetings, including frequent references to spelling, contract ions, capitalization, use of second-person pronouns, writing numbers in word for m, and the need to include information in its proper place in an essay. Studen ts spoke to answer his questions or to ask for clarification of his sugges tions. Following those meetings students were to rewrite ," which offered two options. Students could “ rewrite the entire paper and make all the corrections or they could rewrite problem words ten times, sentences three times, and in addi tion, write three things that they learned. Vocabulary study for students in this class consisted of writing definitions, parts of speech, and sentences using e ach word. Vignette 4: Mr. Johnson


25 of 70 Here we have a portfolio-portfolio comparison acros s two different subject areas within mathematics. Both portfolios were generated by a novice middle school mathematics teacher working in a community that he describes as white, suburban, and blue collar. Mr. Johnson works at a l arge middle school with about 900 students, almost all of whom are native E nglish speakers. The two portfolios present 8th grade math courses; both cla sses use a popular textbook series. The more advanced of the two is an Algebra I course for "average" students. The unit presented in the portfolio from this class concerns linear relationships, particularly the generation of algeb raic equations for lines. The other portfolio is from a Transitions course for "g eneral ability" students. The unit from this course covers statistics, particular ly the generation of multiple types of graphs to display data. In a close reading of the two portfolios, important differences emerge relating to the use of ‘real world’ applications in classroom tasks, to modes of final assessment, and to the role of the teacher in classroom activities. [The original comparison was prepared by Jon Star.] The first category of difference concerns Mr. Johns on’s use of real world examples and concrete materials. Connections to rea l world examples play a very prominent role in the tasks in the Transitions portfolio. Several of the lessons in this unit begin with students collecting data that is subsequently made into a chart. For example, students count Frui t Loops cereal pieces to determine which colors occur most frequently; they work with box scores of a basketball game; and they cut paper plates in their examination of pie charts. In contrast, context plays little or no role in the Al gebra portfolio. Students work exclusively with symbolic equations of lines: these equations are never given any referent or context, nor are any real-life situ ations embodying linear relationships introduced in class.A second salient difference concerns the way Mr. Jo hnson assesses students at the end of the portfolio units. In the Algebra p ortfolio, the teacher assesses students in a traditional manner -using a written test, administered in a single class period. Students are asked to complete 23 pro blems, all clustered around the execution of procedures (finding an equation of a line given a point and a slope, finding the equation of a line given two poi nts, and converting a line from point-slope form to general form). There is signifi cant repetition: for example, clusters of four or five problems look identical, w ith only the numbers changed from one to the next. In contrast, the final assess ment in the Transition portfolio requires students (in groups) to collect data, cons truct graphs, and give an oral presentation. This assessment takes several days to complete; students are assessed on the quality and accuracy of their graph s and on their oral presentations. At the conclusion of this assessment students meet individually with the teacher to discuss their grade.A third difference concerns the role Mr. Johnson ap pears to take in conducting classroom activities. In the Transitions class, the teacher seems to view his role as one of a background guide; his actions and his c ommentary consistently indicate that his goal is to largely remove himself from classroom activity. For example, in one lesson plan, he writes that he plan s to step into the background and let students proceed with their work on their own ." In another lesson, the teacher makes an explicit attempt to re -direct questions posed to


26 of 70 him back to students (and he subsequently reflects that he was very happy with the results). In general, almost all of the Mr. Joh nson’s lessons in this portfolio consist of students being given a worksheet or an a ctivity to do in groups; the teacher spends much of each class in the background circulating from group to group and answering students' questions when they a rise. In contrast, the teacher portrayed in the Algebra portfolio is much more directly involved in student activity. Almost all lessons in the Algebra portfolio involve the teacher conducting a recitation: standing in front of the c lassroom, he demonstrates a procedure, asks frequent questions of the class to guide him through his demonstration, and then offers problems that the cl ass should do for practice. Although students are involved in these recitations via the teacher's questions, the teacher is largely controlling the activity and problem-solving that occurs in most classes. The teacher writes that he views the recitation style of instruction as appropriate for the more advanced Algebra class but less so for the low-achieving students of the Transitions class: Low achievers and behavior disorder students could not stand more than ten min utes of lecture... The style that works best for them is more of an activity bas ed learning. Vignette 5: Mrs. MartinThis ELA portfolio-case comparison raises a complex chain of issues: (a) we see the same class at two different points in time engaging in substantially different learning activities; (b) we learn from th e case that some practices illustrated in the portfolio were not consistent wi th the teacher’s practice, were undertaken because the portfolio handbook prompted them, and were not consistent with what she believed were her students ’ needs; and (c) this then causes us to question the teacher’s judgments about her students’ capabilities. [The original comparison was prepared by LeeAnn Sut herland.] The teacher in this portfolio/case comparison works at a school characterized by the case study writer as a “large inner city sch ool.” The students who attend this school are predominantly Hispanic from poor an d working class families. For this teacher, Mrs. Martin, both the portfolio a nd the case study are based on a 9th grade Writing Enrichment course. The course w as developed for students in a “transitional program” who are “ too old to be in Middle School ” at 15-16 years of age, but are “ earmarked as an at-risk group. ” There are 13 students in the class, 10 of whom are bilingual; 3 are identifi ed as special education students. The portfolio literature exhibit is compr ised of one 4-page short story and one poem which is integrated with the writing e xhibit. The case study focuses on a drama unit with a two-page play from a n adolescents’ literary magazine as the primary text. The case study writer observed two sections of the same writing enrichment course.The texts used in the literature section of the por tfolio were selected to focus on the theme of strong, courageous women. The teacher characterizes her goals as helping students see the connection among these pieces of literature and their own lives, wanting them to “ see the potential within themselves ”. The lessons focusing on these texts, across more than 1 0 of the 15 days represented in the portfolio, take students through a series of activities including the completion of several charts focusing on elemen ts of literature such as character, theme, and imagery, a 2-paragraph “miniessay” describing one of


27 of 70 the characters, and a culminating essay in which st udents write three paragraphs which compare a character from the short story with a character from the poem. The comparison writer notes that the time spent on these brief and straightforward texts seems excessive and that students have little opportunity to develop their own ideas. The teacher ’s reflections suggest that she believes students need this level of support to comprehend the story. She indicates that “ students have difficulty decoding words ” and that each story read in this course begins with an uninterrupted oral re ading (by the teacher) that gives the students “ the opportunity to hear the story first, get a basi c idea of the plot of the story, and minimize the frustration of difficult vocabulary. ” She notes that “ focusing on a few skills and then building on them, ensures a complete understanding, and more importantly, retention of t he lesson. For this group of students, retention is the key. ” In her portfolio, Mrs. Martin provides commentary a nd videotape of two students as they conference about essays they have written. Each of the students offers observations to the other, and the author is seen t o respond to those observations. They discuss thesis statements, parag raphing, use of examples, and proper citation form including line numbering [ for the poem]. They also discuss parts of the essay they found difficult to understand. The comparison writer argues that this is one of the better studen t-student conferences seen in portfolios, as participants are actively engaged in dialogue about writing. While it is not a substantively rich conference, many of the writing conferences seen on videotape are teacher-directed, or the students speak to one another but not with one another. The two students seem to “get” the id ea of how to engage in a writing conference.However, when the case study researcher asked the t eacher a question from the interview protocol, “Did the portfolio involve things that were not part of your teaching practice?” Mrs. Martin responded: “ I thought the peer editing and peer responses were phony for me because I don’t do that yet. The kids were not really ready yet. ” If the teacher coached the girls on how to talk f or the camera, then that raises one set of issues. If she did not, but simply asked them to conference, even though she usually does not have t hem do so, then their relative success in the conference raises questions about the teacher’s judgment of students’ abilities.The case study writer saw no writing in response to literature and no process writing during the time of his observations. The on ly writing he saw involved lists and definitions associated with vocabulary words. A sked whether what the case study writer had observed over the three-day period was “typical of your teaching and your classroom,” Mrs. Martin indicated that “ the class is a writing enrichment class and most of our time for the whole year was spent on enrichment. ” She stated that previous writing assignments for the course had followed the [state’s] test format and students had also written autobiographical essays. Neither source of evidence provides example s of these types of writing. About literature, Mrs. Martin said that the portfol io requirements—again—did not jibe with her usual practice: “ I think having 7-8 hours of literature did not really fit the curriculum I have for these kids. ” She told the case study writer that these students are not ready for a novel or for the “ 7-8 hours of literature”


28 of 70 required for the portfolio, that they struggle with decoding, and that students needed to read the screenplay twice in order to get it. The case study writer observed the teaching of the magazine play, and he reports that students’ oral reading over two days was relatively fluent, and th ough little attention was paid to students’ understanding of the play, students’ v erbal comments, a question one student asked, their expressive reading, and ot her verbal cues indicated that they did, indeed, comprehend this particular t ext as they were reading. Again, this raises questions for us about the teach er’s judgment of her students’ capabilities and needs—question that the evidence i s insufficient to address. Vignette 6: Mr. GereIn this vignette, we encounter a teacher who indica tes that he chose to use his portfolio as an opportunity to improve certain area s of his teaching. In the portfolio he presents an Algebra I class, where he reports the students reflected a wide range of abilities and dispositions. The cas e study writer observed the Algebra 1 class and a Pre-Calculus honors class. We learn from the case that Mr. Gere had originally decided to focus the Pre-Ca lculus honors class for the portfolio but switched to the Algebra I class becau se he felt the class needed extra attention. In the portfolio, he indicates tha t he hoped the portfolio would help him focus in on the difficulties he was having and turn them around. [This vignette is based on the comparison originally deve loped by Pamela Geist.] While the two portraits of teaching present quite c onsistent evidence about this teacher’s practice, the interpretive commentary tha t surrounds them leaves the reader with a substantially different impression ab out the effectiveness of his teaching. Unlike the situation with Ms. Fleming, wh ere her interpretation highlights the strengths in her practice, Mr. Gere focuses, it seems relentlessly, on his concerns about his teaching and what his stu dents are learning. The case study, then, presents the teacher in a far mor e positive light. The case study writer, acknowledging the predominan ce of procedural work and the ongoing focus on manipulating expressions and e quations, nevertheless develops a picture of a teacher who encourages stud ents to look at underlying ideas and explore some of the logic associated with working the procedures. The case study writer concludes, “In all six of the classes observed, the students’ oral responses, questions, homework, clas swork and quizzes indicated that Mr. Gere’s expectations were accessi ble to most students.” She notes some differences between Pre-Calculus and Alg ebra: For example, the case study writer reports that in the algebra class the pattern was one of fairly routine mechanics; first distributing with algebrai c expressions, then factoring algebraic expressions, and finally solving quadrati c equations that were factored and set equal to zero. In the pre-calculus class, a lthough the work appeared to be quite mechanical, there was more problem solving involved because of the number of possibilities when finding equations that fit a set of data points. However, she also notes: “As the material developed over the three days, the students played a bigger role in the dialogue, offe ring their own strategies for finding the equation from a set of data points. Mr. Gere also used open-ended questions effectively: for example, about a quadrat ic equation, in standard form, students were asked to give its characteristics, in other words, to tell him what they could about this function. The responses were extensive and showed


29 of 70 depth of knowledge about a quadratic.” The case stu dy writer creates an overall image of a fairly successful teacher, one who takes his work seriously, is well-liked and respected by his students, and works hard to create a practice that meets his goals and expectations.The comparison writer notes the similarities betwee n this representation and what she sees in the artifacts of the portfolio. Th e portfolio artifacts show a similar continuum of difficulty on daily worksheets quizzes, and tests. Problems begin with simple equations and progress to more co mplicated ones. Initial tasks focus on the procedural steps to solve proble ms and move toward using these steps in context. There usually is one task t hat requires students to explain an idea or the logic underlying steps. In t his sense, both reports show that tasks become progressively more difficult beca use they require that students know more about the different scenarios re presented in algebraic equations and how to manipulate more complex expres sions and equations. In large group work, Mr. Gere demonstrates procedures and talks students through his logic of the steps. Mr. Gere asks nextstep questions of students and students answer Mr. Gere directly. And he effec tively and accurately demonstrates the procedures for students, using app ropriate mathematical language and notation to demonstrate how a system i s solved, and students practice and memorize the procedures eventually mak ing them their own process. Mr. Gere promotes student-to-student disco urse in the context of small group work and pairing students together to complet e a task. The video evidence illuminates that for the most part student s work productively in pairs and small groups explaining to each other how to pr oceed with a task and compare procedures and answers with each other.And yet, Mr. Gere’s reflective commentary on his pr actice paints an entirely different image of his success. For instance, talki ng about the difficulties he faced in facilitating discussion, he writes “ I regretted not soliciting a variety of problem solving methods for this exercise and again bypassed potentially rich mathematical discussion in the interests of time. T he decomposition of the problem’s solution into discrete steps was worthwhi le and helpful, but again lost something due to the more directed discussion that resulted from my sense of time pressure. ” He notes further: “ My responses to students’ questions also reflect my impatience such as the response to stude nt A when I don’t even let him finish his question before answering. I give hi m a perfectly accurate and reasonable answer, but the tone of impatience is mo re damaging in other areas. Another student question is similar in outcome. I q uickly give an accurate concise answer to his question but would have benef ited the other students with the same misunderstanding by instead redirecting hi s question to a few of the weaker students to make sure their understanding wa s solid. ” He worries “ I have begun to recognize that I have slowly adopted more and more of the students’ inclination to ‘just let me see how to do the problem so I can stop thinking.’ ” In fact, he describes what he perceives to be an ongoing decline in students’ efforts to succeed in his Algebra I class : “ I know that the effort level has declined precipitously over the past 1 1/2 mont hs in this class, and I worry that I am enabling the very destructive tendencies that are plaguing this class. ” Thus, there is a running theme across his reflectio ns, one that details the frustrations and disappointments of not being able to change students’ attitude.


30 of 70 The image he creates in the portfolio is of a teach er struggling with changing his teaching and at times, there is a sense of hopeless ness. Because there is little offered in the portfolio in the way of a rich analy sis for how he intends to turn this pattern around, the portfolio writing produces the image of a teacher who sees himself as mostly ineffective and struggling w ith supporting richer opportunities in the discourse and at the same time offers few ideas for how to change current patterns. In effect, the case study report paints a much more positive image of the discourse patterns, indicatin g that procedural goals are getting met through the patterns of discourse and a t times, especially in Pre-calculus, the discourse supports a deeper and r icher investigation into the mathematical ideas. While the comparison writer, pe rhaps cued by the image in the case study, was able to read behind Mr. Gere’s commentary, this portfolio (which was also used in another study involving mul tiple readers) elicits quite different reactions depending on the weight the rea der gives the teacher’s negative commentary about himself. Whether this is a difference that would matter depends on whether or not the portfolio read ers are willing and able to read behind the teacher’s commentary.Vignette 7: Mrs. JacobsonIn one sense, this portfolio/portfolio comparison p rovides another example of a teacher whose practices look different in classes s he characterizes as comprised of students with different ability levels In this case, we observe differences in the teacher's demeanor and attitude toward students in the two classes. We also note that her expectations, her ex planations for her choices, and her reflections on students' performance in the second portfolio (unlike the primary portfolio) are sometimes framed in terms of cultural and linguistic differences. [The primary comparison was prepared b y Laura Haniford, drawing on documents from Steve Koziol, Leah Kirell, and Su zanne Knight.] This teacher, Mrs. Jacobson, submitted portfolios f or two 7th grade classes that are “theoretically heterogeneous” but that are actu ally grouped, as she reports, based upon the scheduling of math classes. There ar e 26 students in the first class and 28 students in the second class; there ap pear to be 2-3 students of color in the first class while students of color ar e a majority in the second class. The teacher is white. The base text in both classes is a trade novel set during World War II and is part of her department’s prescr ibed curriculum. Mrs. Jacobson characterizes her students in the fir st class as “ bright and fun ” and states that her expectations for students readi ng this novel are that they “ learn the historical and cultural ramifications of World War II. I intended that students examine the personal struggle of the innoc ent civilians victimized during the war and the incredible strength and cour age of the survivors. ” She also states that this particular selection exposed the students to diverse perspectives “ other than the black/white issue which is pervasive at this school. ” In contrast, Mrs. Jacobson characterizes the studen ts in the second class as “ behavior problems ” and her expectations for them are different. She states that she would not have chosen this book for them and th at “ the majority of these students cannot--or will not--read it and understan d it. These students are intensely committed to being Black or Hispanic and did not relate to the Holocaust…. They love violence and injustice—most k ids their age do…. [But]


31 of 70 This was far too sanitized for them. ” Mrs. Jacobson also states that she is more concerned that the students in the second class lea rn the history of WWII as opposed to understanding any elements of plot or ch aracter. The teacher begins each class with a daily oral lan guage (DOL) experience. In class one, this takes many forms – open ended quest ions about literary terms followed by a discussion of some examples from the novel and from students own experience; brief comprehension questions on th e reading; a vocabulary exercise where extra credit is given for making the teacher laugh; a brief review of grammatical terms. In class two, the DOL is cons istently a recitation/review of questions on the assigned reading with answers give n “ swiftly ” and written answers handed in at the end of the week. Commentin g on an interaction in class one, she writes, “ In the future, I might take a hint from this class and compare movies they may have seen with the books we ’re reading. I always try to relate what we’re doing to their own lives, but they like to talk about movies. ” Of class two, she writes “ With this group, I have to lead them with a strong hand, although I try very hard not to tell them wha t they ‘should’ say: I want to hear what they want to say, even if it is immature or downright silly. ” In both classes, students’ written response to lite rature is related to preparing a five paragraph theme and to addressing the state’s criteria for a persuasive essay. Beyond this, her stated goals for the first class include learning to appeal to all five senses, to write great opening lines and to engage readers; for the second class, her goal is getting students to write something--anything. ” In the first class, the primary assignment asks students t o take a position on whether they would take in an escaped prisoner of war who c ame to their home seeking refuge. In the second class, the same primary assig nment is given, together with an alternative prompt related to a reading on the Civil Rights movement, because “ They identified with the black students. ” In class one, the primary assignment is grouped wit h two others, a personal time narrative and a group diary designed to person alize the story for students; there are no surrounding assignments in class two a nd students in the second class are not given the opportunities to work with one another that the students in class one are. Samples of writing from each clas s suggest that students understood the demands of the assignment and could respond to it appropriately. Commenting on her concerns about a s tudent’s writing in the first class, she says: He “ does not create an effective visual in the opening paragraph. Also…[he] does not respond to the opposi tion in his fourth paragraph. He only states that there is another sid e. ” Of one student in the second class, she writes: “ [He] did not follow the guidelines, either. This ch ild’s family speaks Spanish in the home, and he had made great improvements since September. He has learned to skip the lines t o make paragraphs, and he is writing sentences, rather than one long sentence ” The guidelines for the state's writing tests are th e focus of formal writing instruction in both classes and of video segments o n writing. In the first class, Mrs. Jacobson’s introduction is brief, mainly an ov erview, and students have an opportunity to look at some samples and to begin wo rking in a writing workshop format on their own essays – students read aloud so me of their drafts, they work together in peer editing, the teacher guides t he critique of samples,


32 of 70 drawing from the student samples to deal with topic s in language use (e.g., using over-used words), and she confers with indivi dual students. On the video, the teacher circulates around the room, talking wit h individual students about their work. Overall, the interactions appear positi ve and supportive. In class two, the teacher guides students through a series of questions about the state’s writing test guidelines, seeking respon ses about what is to go into each paragraph and elaborating on student responses She moves to a whole class example – on the topic of what if there were no teachers – which begins to generate student responses, although the teacher ap pears (on video, as well as in her comments) to be frustrated that the students don’t seem to understand how to give reasons for the “ other side, ” which she says is required by the state’s guidelines. There is no small group work: s tudents are in whole class activity or working on their own; when they are wri ting, the teacher circulates and has occasional interactions with students.Based on our observation of the video, the teacher' s management in class one appears to be smooth; students move from one type o f activity to another and from one arrangement to another with little disrupt ion; the teacher comments that this group is especially active and noisy, alt hough that doesn’t appear to be evident from the tapes. In the second class, manage ment issues dominate more of the dialogue. The teacher writes: “ running a discussion with this group is like walking through hip-deep jello. With every remark comes ambient noise and chatter which drowns it out and everything has to be repeated. In fact, as I watched this segment, I was bored just listening to myself repeat the instructions more than 10 or 20 times. Virtually no thing got accomplished. ” In addition to this, Mrs. Jacobson has several extende d disagreements with individual students in the second class that are co nducted in front of the entire class.Mrs. Jacobson’s reflection about her teaching and a bout students’ learning is not detailed or extensive. With class one, she note s that some of her assignments were too vague and that she was unprepa red for how capable her students were, something she would better prepare f or in the future. She thinks she will add some drama in the future, because the group would have done well with this kind of reading and activity. With class two, she notes that she was not particularly effective as a teacher, but attributes this primarily to being required to teach an inappropriate text and having to follow a district mandate that doesn’t fit the students. She notes, “Sadly, any of these students’ real problem is behavior. If they would listen, if they wanted t o produce, they could. Peer pressure and stress at home makes it nearly impossi ble for them to succeed. Patience and in-class time to do their work does in crease their chances of doing acceptable work.”Additional DifferencesSubstantive differences existed in all the comparis ons, as we would expect in any dynamic teaching situation. It’s important to n ote, however, that in the majority of cases examined in math and ELA, we foun d that the second data source elaborated but did not overturn our general impression of the quality of the teacher’s performance with respect to the relev ant criteria. In some cases,


33 of 70 we simply saw the same practices instantiated in a different content; in some cases we saw somewhat different practices that, tak en together, presented a coherent portrait across the two (e.g., Ms. Bertram ) or clarified an ambiguity in the original portfolio (e.g., Ms. Fleming); in some cases we saw differences similar to those we represented here but not so sub stantial as to overturn our judgment of the primary portrait. Portfolios that c ontained inconsistent evidence--in which the artifacts did not fully supp ort the teacher’s descriptions, as with Ms. Fleming--complicate the question of por tfolio generalizability with the problem of interpreting the initial portrait. [ We discuss issues of portfolio evaluation elsewhere (Schutz and Moss, in press).] In some cases, we learned things about the teacher in the case, which may not have been relevant to the criteria, but which shaped our judgment of the teac her. For instance, in one case (a likely “conditional” score), the case study researcher observed multiple situations of conflict, at least one potentially vi olent, during and outside of class that the teacher skillfully resolved. In fact, we o ften learned about the teacher’s relationship to, rapport with, and work with studen ts outside of class. We also learned about numerous factors that influenced the teachers’ performances that would not likely be mentioned in the portfolio or i lluminated by the criteria if they were: the presence or lack of a coherent curriculum and/or text that is consistent with the standards; the presence or lack of a supportive mentor in the teacher’s subject area; large differences in profes sional development opportunities and opportunities for collaborative w ork with colleagues; differences in resources available to prepare the p ortfolio (including release time and access to video equipment for multiple day s). Of course, whether and how these factors influencing a teacher’s performan ce should or even could be fairly taken into account in this assessment is an open question.ConclusionsAs we indicated in the introduction, our goal is to use these comparisons to illuminate issues for assessment developers to cons ider in designing assessment systems. Consequently, our analysis was disconfirmatory: It was not intended to document consistency but rather to highlight the kinds of differences that can occur across different represe ntations of a teacher’s practice and that point to potential problems with implicit assumptions about generalizability. We want to caution readers agains t drawing conclusions about the typicality of our comparisons. There is no way to know how these volunteers--teachers who were willing to complete a second portfolio or to allow an observer into their classrooms for 3-5 days--mig ht differ from the larger population of beginning teachers. However, the dile mmas we have found--which would not be illuminated in data that are routinely collected--highlight important issues for educators, assessment developers, psycho metricians, and policy makers to consider.We begin our conclusions with a review of the kinds of differences that seem likely to matter (that is, likely to result in diff erent performance levels) in terms of the relevant criteria. Then we return to the useful concepts of generalizability with which the study was framed. What is the assess ment domain (or “universe”) to which we can safely generalize? What is the (larger) outcome domain about which we can reasonably draw inference s supported with logical arguments and intermittent empirical studies? How c onsistent are these


34 of 70 domains with the domain implied in the decision abo ut licensure? We close with some more speculative thoughts about the nature of assessment systems (and theoretical resources) that might support well warr anted decisions about teaching performance.What Have We Learned about the Generalizability of Teaching Portfolios? The comparisons in this study begin to address an i mportant gap in our understanding of the generalizability of portfolio assessments of teaching and, perhaps, of performance assessments of teaching mor e generally. Taken together, these vignettes raise a number of concern s, some of which relate directly to the topic of generalizability and some of which spill over into concerns about validity and ethics.In the small set of comparisons we’ve examined here it is very clear that context matters We’ve shown differences in performance across cla sses that differ in (perceived and/or institutionally designa ted) ability level of students, in subject matter taught, and in cultural background o f students. For instance, in the case of Mr. Johnson, we saw differences in perf ormance across two subject matter domains: statistics and linear equations in algebra. Perhaps it is easier for novice teachers to develop “rich and challengin g” tasks that foster “connections” and “reflect students’ interests, sty les and experiences” in some domains than in others. We've presented two clear e xamples of differences in performance across classes that differ in perceived ability level (Mr. Richards and Mrs. Jacobson). And we found other cases (not d escribed here) in which the differences were apparent but far more subtle ( as might be seen in the difference between the Pre-Calculus and Algebra cla sses of Mr. Gere). In Mrs. Jacobson’s case, perceived differences in ability w ere coupled with differences in the cultural background of her students. The dif ferences in performance here are more troubling because of the teacher's apparen t attitude toward the students and tendency to seek explanations of their performance outside her practice, in district requirements and in their per ceived needs as members of different cultural groups. The rubric has no place for descriptions of teachers' expectations and, indeed, if it did, it would be ea sy to coach a teacher to eliminate problematic language from her text.That context matters will come as no surprise to th ose who study classroom teaching or performance assessment. There are compl ex and dynamic relationships among teachers’ social backgrounds an d experiences; their expectations, values, and beliefs; their classroom practices; their students’ (inter)actions; and the larger social and instituti onal structures in which they live and work (Gallego, Cole, and the Laboratory of Huma n Cognition, 2002; Knapp and Woolverton, 1995; McLaughlin and Little, 1993; McLaughlin, Talbert, Bascia, 1990; McNeill, 1983; Stodolsky and Grossman 1995). Research in performance assessment more generally, with tasks t hat are far narrower in scope than those represented in teaching portfolios shows us that different people perform differently on different tasks (the person x task interaction, in terms of generalizability theory) which necessarily confound the construct of interest with variations in the context in which it is preformed. A recent review of approaches to performance assessment in health prof essions (Swanson et al., 1995) leads to similar conclusions about the diffic ulty of generalizing across the


35 of 70 contexts presented by different tasks. “Regardless of the assessment method used, performance in one context (typically, a pati ent case) does not predict performance in other contexts very well” (Swanson, Norman, and Linn, 1995, p. 8). The social context of a classroom seems even mo re complex than that of a health professional-patient relationship. While bot h are certainly equally embedded in societal and institutional structures, the classroom involves dynamic relationships among as many as 30 – 35 indi viduals, each with their own cultural/personal backgrounds that vary in ways we can’t predict. Gallego and colleagues (2002) argued that “every continuing social group develops a culture and a body of social relations that are pec uliar and common to its members…. Hence,… we can expect that every classroo m will develop its own variant” (p. 992).Two recent reviews of assessments of teaching (NRC, 2001b; Porter, Youngs, and Odden, 2003) both raised concerns about the lac k of evidence of teaching performance across differing classroom contexts, an d our observations support those concerns. It is hard to imagine, however, how a single assessment program could adequately (and fairly) address those concerns. One could ask for samples of teachers' performance in different c lassroom contexts, as we tried to do, and yet the variations available withi n teachers’ yearly class loads vary quite substantially from teacher to teacher, a nd all are considerably narrower than the range of classes and school conte xts in which they are licensed to teach. (Note 15) One could imagine other kinds of assessments in which teachers are presented with cases from a rang e of classroom contexts, and this might provide some relevant evidence; howe ver, asking teachers to plan or evaluate activities for students with whom they have little experience would raise other kinds of validity questions. And the experience in health-related professions with these sorts of simu lations suggests that questions of generalizability are likely to remain. There are no straightforward solutions.The case of Ms. Martin raises a second issue direct ly relevant to generalizability. Here we find a teacher who percei ves that she is in the position of being required to show evidence of a performance that is outside of her routine teaching practice. Does that suggest the po rtfolio guidelines were too directive or restrictive? Experience with National Board assessments has led developers to conclude that it is important for tea chers to understand what is valued in the assessment; being explicit about expe ctations, within the bounds of construct relevance, is considered important for validity and fairness (Pearlman, in press a, b), and INTASC has emulated their practice. Clearly, portfolio assessments of this sort do not support c onclusions about what is typical. What we learn with a “passing” portfolio i s whether a teacher and a group of her students can engage in a particular ki nd of practice and reflection in at least one instance. Teachers may, of course, make choices that are not in their best interests, as was the case with Mr. Gere who chose a class with which he was struggling and then emphasized his sho rtcomings. While this is commendable and productive for a professional devel opment activity, it is less than strategic for a high-stakes assessment. Carefu l instructions to candidates, and examples of successful portfolios, will be impo rtant in helping teachers demonstrate the strength of their practice with res pect to the standards. It is important to recognize, however, that not all candi dates will have


36 of 70 commensurate opportunities to illustrate their prac tice. Assessors should try to make sure that teachers have the human and material resources they need, including adequate time, access to competent mentor s, and access to audiovisual services. Of course, assessors cannot c ontrol teachers’ work assignments or the schools in which they work. We h ave to recognize that these factors influence the extent to which teacher s can demonstrate a performance consistent with higher scores and desig n a system that is appropriately skeptical of the validity of its conc lusions about individual teachers.Not surprisingly, differences across methods used h ere also played a role, with different methods being more or less adequate in pr oviding evidence relevant to different criteria (as we discussed above under str uctural differences). The portfolio typically offered the teachers in our com parison a better opportunity to explain their choices and reflect thoughtfully on t heir teaching (although with a skillful interviewer, one could imagine the opposit e for some teachers who are uncomfortable with writing); the case study provide d more evidence (six full classes vs. brief videotaped segments from two feat ured lessons) that allowed stronger inferences about the pattern of discourse in class. Of course, either method could be revised to better address these con cerns. If we want to draw conclusions about patterns of classroom discourse, having access to two lessons may be insufficient, especially if they do not support the written description in the portfolio. Clearly, more researc h about this would be most beneficial. While criteria were not varied in this study (and a re typically considered fixed), there are clearly many different ways to instantiat e the INTASC principles in specific criteria tied to available evidence. Consi der, for instance, the following two principles taken from the ten INTASC (1992) pri nciples on which the subject specific standards are based: Principle #2: The teacher understands how children learn and develop, and can provide learning opportunities tha t support their intellectual, social and personal development.Principle #5: The teacher uses an understanding of individual and group motivation and behavior to create a learning environment that encourages positive social interaction, active enga gement in learning, and self-motivation. (p. 16) The portfolio assessment situates evidence and crit eria relevant to these principles within particular subject matter context s where particular approaches to learning are privileged. Alternatively, assessme nt developers could, as PRAXIS III assessments do, frame criteria and evide nce more generally. Consider the following criteria drawn from PRAXIS I II Domain B: B1: Creating a climate that promotes fairnessB2: Establishing and maintaining rapport with stude nts B3: Communicating challenging learning expectations to each student


37 of 70 B4: Establishing and maintaining consistent standar ds of classroom behaviorB5: Making the physical environment as safe and con ducive to learning as possible (Dwyer, 1998, pp. 21-22) When we have asked INTASC portfolio readers what th ey would like to attend to that isn’t addressed in the rubric, among the is sues that repeatedly arise are teachers’ relationships with their students and the ir classroom management. If these criteria are given some or substantial weight in a compensatory assessment, a number of teachers are likely to impr ove their scores. In fact, we saw one teacher, working in a large urban school, w hose ELA performance tended to emphasize form over meaning and single co rrect interpretations—consistent with the lowest performa nce level--and yet the case study researcher saw multiple examples of the teach er handling potentially violent conflict among students in ways that succes sfully diffused the incident. Moreover, the teacher’s reflections on this inciden t were insightful; he commented, for example, on how he learned who he co uld touch in violent situations. This observation is not a criticism of the INTASC criteria and standards. As Pearlman (in press) points out, every assessment system has to decide what it values and then make those values cl ear to candidates. Contra Brennan (2001), treating rubrics as fixed seems the reasonable choice although it’s important to acknowledge that changes in the r ubric will likely result in some changes in who passes and fails (Moss and Schutz, 1 999, 2001). While it is beyond the scope of this paper to addre ss, our comparisons also raise issues relevant to portfolio readers’ evaluat ion of the specific evidence contained in the portfolio. As the case of Ms. Flem ing illustrates, portfolios sometimes contain conflicting evidence, especially artifacts that do not fully support the written representation. As we argue els ewhere (Schutz and Moss, in press), portfolios like these can support different interpretations depending on how readers choose to weigh the evidence. Based on recorded dialogue and extended interviews, we show how readers who clearl y value the same criteria and describe the same evidence nevertheless constru ct a different story about the teacher’s practice given the evidence in the po rtfolio. How should an assessment system address ambiguous evidence like t his? We return to this issue as we discuss implications (and in Schutz and Moss, in press). The meaning of central terms like “discussion,” “pr oblem solving,” “inquiry,” “reasoning,” and so on is also at issue. If teacher s and readers hold different meanings for these terms—attach them to different a ctions/interactions—then this can affect readers’ understanding of classroom practice (for which there may be no accompanying evidence) and/or readers’ ev aluation of the accuracy of teachers’ reflections. Further, the issue of sla nt in the representation of performance cannot be ignored. Comparisons among Mr Gere’s reflections and those of the case study and comparison writer illus trate the problem, as does a comparison between Mr. Gere’s and Ms. Fleming’s ref lections on their practice. Mr. Gere must depend on the tenacity of portfolio r eaders in finding the teaching performance behind the negative slant of the reflec tion. While portfolio readers can likely be trained to score situations like thes e reliably, especially with


38 of 70 analytic rubrics that might leave the weighting and combining of evidence to the predetermined algorithm, it does not make the ambig uity go away, rather it simply masks the problem behind consistent scores t hat are unlikely to be challenged with routine procedures.Many of the vignettes raise an important issue that spills over the bounds of generalizability: how to evaluate the appropriatene ss of a teacher's practices--the extent to which they are "challengin g" for students--based upon the evidence contained in the portfolio. Portfolio readers' understanding of students' interests, needs, and capabilities depend s exclusively on the teachers' characterizations and the few artifacts contained i n the portfolio (plus readers' own mostly unarticulated experience with similar st udents). How does a reader decide whether a teacher's practices are appropriat e for her students or reflective of inappropriately low expectations perh aps resulting from ability groupings imposed by the school? One answer to this question, often heard in committee meetings, is that teachers' practices sho uld be evaluated in light of their justification of their choices. However, in o ur experience, writing evidence-based justifications or reflections is dif ficult for many beginning teachers. If the quality of the reflection is more crucial to evaluating particular kinds of performances, then some teachers may be di fferentially disadvantaged. This leads to the disturbing questio n of whether it is easier for a teacher to receive a higher score with some classes of students than with others. We have seen a number of examples in which portfolio readers’ beliefs about the appropriateness of different activities l ead to different scores on the same portfolio. Decisions about appropriateness are often underdetermined by the evidence in the portfolio. One could imagine gu idelines that ask the teacher to provide more evidence of students’ capabilities. In fact, one review panel, noting a similar problem, asked for evidence of stu dents' work from more students and over a much longer period of time. Whi le this appears to be an appealing solution, it risks making the portfolio u nmanageable for beginning teachers who almost invariably report on the time-c onsuming nature of the task. Again, there are no straightforward solutions to th e problem. What Can We Conclude about the Generalizability of Teaching Portfolio Assessments?How might the issues we’ve raised be addressed with in the bounds of the theoretical resources on generalizability that fram ed the paper? Returning to the questions with which we began this section: What is the assessment domain (or “universe”) to which we can safely generalize? What is the (larger) outcome domain about which we can reasonably draw inference s supported with logical arguments and intermittent empirical studies? How c onsistent are these domains with the domain implied in the decision abo ut licensure? Our answers are speculative, based on the evidence provided her e and the existing literature on performance assessment of teaching.What is the assessment domain (or “universe”) to wh ich we can safely generalize?Readers can certainly be trained to achieve suffici ently reliable scores for


39 of 70 individual decisions, as the National Board continu es to demonstrate. Our experience suggests the importance and feasibility of preparing readers to forthrightly acknowledge problems of ambiguity in t he portfolio evidence. If we were rewriting the guiding questions and scoring cr iteria as a result of this experience, we would structure them to encourage ca reful triangulation across different sources of evidence: first, so that highe r level inferences could be explicitly built up from more descriptive inference s (as we illustrated in our comparison methodology) and second, to ferret out d isjunctions in the evidence that might call inferences about performance levels into question. Assuming that this is already enacted informally as part of reade rs’ training, then it could be further supported by the formal procedures and docu mentation through which they record their evaluations. The portfolios so id entified could be sent for additional review and possibly for additional evide nce from the beginning teacher. While most large-scale assessment systems are set up to deal with unscorable responses, responses on which readers di sagree, or responses that are flagged as atypical in some way (Wilson and Cas e, 1997; Engelhard, 1994, 2002), we imagine that what we are suggesting might well lead to a larger proportion of responses being identified as needing additional attention, which will, in turn, increase the cost of the system. Hav ing additional evidence of patterns of classroom discourse and of students’ le arning would be useful and might reduce the number of portfolios that need add itional review and/or evidence. Of course, this would increase the time i t takes to evaluate each portfolio. It might, however, be possible to develo p a multi-stage evaluation system that only examines the additional evidence ( beyond the featured lessons, for instance) when questions arise. That s aid, it is important to note that with the portfolio evidence alone, we have no idea what additional factors enabled or constrained the performance, and which o f those we would consider within and outside the construct-relevant bounds of an appropriate resource. If the portfolio asked for such information, it is not clear how it could be corroborated or fairly taken into account.If we know that a teacher and her students can demonstrate certain kinds of performances on at least one occasion, how far beyond inferences about the particular portfolio might a well warranted assessm ent domain expand ? There was nothing in our evidence that suggests it would not be possible to include in the assessment domain what a teacher can do in this class (not just on this occasion) and possibly, what a teacher can do in ot her classes (perceived as) very much like this one. For those comparisons that involved the same class at different points in time or very similar classes, t he only differences we found that seemed to matter could be explained by ambiguous ev idence (e.g., disjunction between written portfolio and video) or by knowledg e that the teachers felt obliged to demonstrate activities that were not par t of their routine practice. By very similar we mean classes that cover the same co ntent and in which teachers’ expectations about the students’ capabili ties also appeared to be the same. We are careful here to limit the inference to what they can do (and whether that is replicable), not to what they typic ally do. However, research on classroom culture, which suggests it is always dyna mic and at least partially unique (e.g., Gallego et al., 2002) does raise red flags about even this assumption that should be empirically investigated. Thus, additional research that checks on these assumptions, perhaps with smal ler teaching exhibits, or with interview/observation cycles like those of PRA XIS III, would be important. If


40 of 70 feasible, more extended observations would be usefu l. The advantage of an INTASC-like assessment and our more extended case s tudies is that we see how a unit unfolds over a series of lessons. Eviden ce supporting the generalizability of what teacher can do may well ne ed to include such a series of lessons. Thus, we speculate it is possible to bu ild a logical argument, that should be buttressed with periodic empirical studie s, of the extent to which we can generalize to an assessment domain that include s classes, subject matter, and students like these. It may be necessary to bui ld multiple assessment opportunities into the assessment system itself to flag candidates whose performance is not consistent. Whether these differ ences should be treated at the first (reliability) or second (transfer) level of generalization described above is an open empirical question and then a matter of judgment. Beyond this, our evidence, taken together with the general lack of positive evidence that might overturn it, suggests that we c annot extend the well-warranted assessment domain to different class es within a teacher’s regular teaching assignments. Many of the differenc es we’ve encountered suggest variations that may only be supportable at the second level of generalizability, as a matter of transfer, if there Our limited, case based qualitative evidence certainly supports concerns th at are raised about score generalizability that takes task sampling (which al ways involves some differences in context) into account and, indeed, s uggests that the problems with portfolio assessments of teaching may be much worse because the variations in context are so complex. We simply do not know how a teacher working with an honors class might perform with a c lass designated as remedial or how a teacher working with a statistics class mi ght perform if the content were substantially different. Here additional resea rch is very much needed—not only to help appropriately limit inferences about t eaching performance but also to understand what changes in a teacher’s work cont ext should involve additional professional support. Clearly, the domai n implied in the decision to license a teacher, typically constrained only by ge neral subject (or subjects for elementary teachers) and grade or age levels, is gr eater than can possibly be empirically examined and may, at best, involve weak assumptions and negative arguments (not yet disconfirmed) of the sort that K ane and colleagues (1999) describe.Given the limited assessment domain our study sugge sts is likely supportable, is it worth mounting a portfolio assessment of teac hing? We do believe that the answer is still yes: If, given adequate time and re sources, a teacher is not able to demonstrate in one instance a passing performanc e, then it makes sense to require additional opportunities for professional d evelopment and further demonstration of competence before granting a regul ar license. This is information about teaching performance that would n ot be available to a state using only written tests. But is there more we can expect from a performance assessment system?Returning to First PrinciplesHow does a state education authority, charged with ensuring the competence of the teaching force, undertake that task in a well w arranted way? A recent NRC report on Testing Teacher Candidates raised concerns about licensure


41 of 70 decisions based only on tests of basic skills and c ontent knowledge (which, themselves, the report notes, have only limited val idity evidence) (Note 16) (NRC, 2001b). The authors of the report called, in addition, for: Research and development of broad-based indicators of teacher competence, not limited to test-based evidence, sho uld be undertaken; indicators should include assessments o f teaching performance in the classroom, of candidates' abilit y to work effectively with students with diverse learning nee ds and cultural backgrounds and in a variety of settings, and of co mpetencies that more directly relate to student learning. (p. 172). Given our conclusions from the previous section and the existing literature on performance assessment, this is a tall order. How c an information about these sorts of performances be reasonably taken into acco unt by distant users? When distant users have access to a classroom portf olio like the one we studied here, they certainly know something more ab out a teacher’s practice that they did before. The question is how to use th at information or rather what kind of system can feasibly be developed to support a valid and ethical use of that information. If we theorize this problem withi n the bounds of conventional approaches to generalizability, then our choices fo r how to improve the assessment system are limited. As Mislevy and colle agues (2002) noted “compromises in theory and methods … result when we have to gather data to meet the constraints of specific models” (p. 49; se e also NRC, 2001a): “When unexamined standard operating procedures fall short it is often worth the effort to return to first principles” (Mislevy et al., 200 3, mss. p. 57). In addressing this issue, it is important to illumi nate the distinction between (a) warranting the validity of the interpretation of a score across individuals with the same score (as psychometrics is positioned to do) a nd (b) warranting the validity of a consequential decision about an indiv idual (which may be informed by a valid score but typically relies on other/addi tional kinds of evidence and judgments). That these two sorts of warrant can be different is not a radical suggestion, even within the discourse of educationa l and psychological measurement. As the testing Standards assert: "In e ducational settings, a decision or characterization that will have major i mpact on a student should not be made on the basis of a single test score. Other relevant information should be taken into account if it will enhance the overal l validity of the decision" (p. 146). (Note 17) Citing the example of identifying students with sp ecial needs, the authors of the Standards note: "It is important, that in addition to test s cores, other relevant information (e.g., school record, cl assroom observation, parent report) is taken into account by the professionals making the decision" (p. 147). And yet, psychometrics has little advice to offer a bout how to combine such evidence into a well warranted interpretation or de cision. We close, then, by pointing in two somewhat differe nt (and yet potentially complementary) directions for returning to first pr inciples to enhance the validity of high-stakes assessment of teaching competence: ( a) toward the flexible use of probability based reasoning illustrated in work of Mislevy, Wilson, and their colleagues (Mislevy et al., 2002, 2003; NRC, 2001a; Wilson and Sloane, 2000; Wilson, 1994) and (b) toward enhancing the capabili ty of local education


42 of 70 authorities, in dialogue with the state, to make we ll warranted and credible recommendations about individual teachers.Mislevy, Wilson and colleagues argued that fundamen tal concepts like validity, reliability, and fairness, are broader than any par ticular set of methods for addressing them: while “familiar formulas and proce dures from test theory” work well with “familiar forms of assessment,” (p. 1) th ey risk constraining new forms of assessment that respond to new developments in o ur understanding of how people learn. For instance, citing Brennan’s associ ation of reliability with replication (2001), they noted that “it is less str aightforward to know just what repeating the measurement procedure means if the pr ocedure has several steps that could each be done differently … or if s ome of the steps can’t be repeated at all (if a person learns something by wo rking through a task, a second attempt isn’t measuring the same level of kn owledge)” (p. 17) (issues Brennan acknowledges). They offered a more general characterization of reliability as “the evidentiary value that a given …body of data would provide for a claim—more specifically, the amount of informatio n for revising belief about an inference” (p. 33).They cited a number of alternative theoretical mode ls within and beyond psychometrics which taken together enhance our capa bility to model real world situations in reasoning from evidence to inference. Note 18 They noted further that each of these might be considered a special ca se of a more encompassing approach to probability based reasoning that would allow mixing of existing models and development of new ones. Note 19 Given the sorts of examples they provided, we imagine that models like these wo uld enable distant users to combine portfolio based evidence with other evidenc e available about the teacher, including routinely collected evidence fro m existing tests, and briefer embedded assessments that might be collected during a teachers’ preservice and induction years. While none of these could be c onsidered interchangeable or random samples from the same assessment domain ( as common approaches to generalizability idealize), each of t hem would help in decreasing uncertainty about a teacher’s competence (or accomp lishment). Indeed, a consortium of teacher education institutions in Cal ifornia (Performance Assessment for California Teachers [PACT]) is explo ring the use of preservice embedded assessments and induction-year teaching ex hibits (similar to the INTASC portfolio exhibits) as an alternative to the state’s less-contextualized assessment. Mislevy, speaking about assessment of s tudents’ opportunity to learn, envisioned that models can be developed whic h simultaneously take into account important features of the context in which the assessment occurs (personal communication, 12/28/02). The goal in add ressing the qualities of reliability and validity is to increase “the fideli ty of probability-based models to real-world situations” (p. 49). While it is beyond the scope of this paper and our collective expertise to proceed much further down t his road, we point readers in the direction of these scholars’ work to highlight the possibility of developing more flexible models with large-scale centralized f orms of assessment. An alternative (and we argue, complementary) direct ion for returning to first principles involves enhancing the capability of loc al authorities (districts, teacher education institutions) to make well warranted (and audited or auditable) recommendations to the state about the readiness of individual teachers to


43 of 70 receive a regular license. Portfolio judgments coul d be combined with other relevant sorts of evidence only routinely available in the local context. This approach suggests different roles for state and loc al agencies; and a different use for the portfolio at the state level than at th e local level. At the state level, the goal would be to audit the practices and judgme nts about individual candidates that are made at the local level.To warrant those decisions, we need to move beyond psychometrics or (frequentistic) probability-based reasoning (Note 20) and look to other epistemological/ethical resources (for instance, in anthropology, hermeneutic philosophy, political philosophy and ethics, and th e law). The senior author has turned to hermeneutic philosophy--for reasoning fro m evidence to inference--as a means of warranting knowledge claims and ethical decisions (Note 21) (Moss, 1994, 1996, 1998, in press; Moss and Schutz, 2001, Moss, Schutz, and Collins, 1998). Practices for developing interpretations acr oss disparate sources of evidence and controlling readers’ biases can be fou nd in any number of “qualitative” methods texts (e.g., Erickson, 1986).Of course, when portfolios are constructed and eval uated within a local educational community, a somewhat different set of threats (and benefits) to validity becomes salient. On the positive side, whe n candidates have multiple opportunities to demonstrate their learning, capabi lities, or accomplishments, the stakes for any one assessment decision are redu ced. This is particularly true when support for professional development is p rovided in between assessment episodes. Similarly, when assessors can seek additional information to help in explaining the observed perf ormance, as is true in many local contexts, the burden placed on interpreting t he portfolio evidence is reduced. On the negative side, an ongoing relations hip between the candidate and the readers can detract from validity by allowi ng potentially irrelevant knowledge and commitments to be brought to bear on the conclusions about the candidate's performance. It will be important t o bring in outside perspectives to the evaluation process, so that potential disabl ing biases of readers familiar with the candidate and context (whether favorable o r unfavorable to the candidate) can be illuminated and self-consciously considered. Readers designated by the state could be invited to partici pate in the process in a variety of ways--as members of the initial portfolio review team; as auditors of the decision, written documentation, and supporting evi dence produced by the team; or as independent reviewers who consider the portfolio based evidence with no knowledge of the outcome. Although consiste ncy in the conclusions of inside and outside reviewers will enhance the valid ity of the decision, high levels of consistency may be unlikely because of the diffe ring perspectives and knowledge that the different readers bring. Thus, i t becomes the state’s role to audit and warrant the process at the local level, not necessarily the individual decision. Here, the role of the outsider is to illu minate taken-for-granted practices and perspectives and make them available for critical review by members of the interpretive community so that they may be self-consciously affirmed or revised. In that way, the interpretive community continues to evolve in its ability to make sound judgments. Alverno Col lege (NRC, 2001b; Zeichner, 2000) provides one well-documented model of a local system set up to support professional development and warrant high stakes de cisions. Descriptions of other examples of local decision making processes c an be found in Porter,


44 of 70 Youngs and Odden (2003), NRC (2001b), and Lyons (19 98). We believe both of these approaches--one focused on the warrant for centralized decisions and the other on the warrant for local decisions--might enhance the way in which evidence of teaching perfo rmance can be taken into account in licensure decisions. Each has advantages and disadvantages, resolving some validity problems and raising others Whichever approach is privileged in a given educational context (that is, whichever approach results in the decision that “counts”), the other approach can (and should) provide an important check on or challenge to the validity of those decisions. In closing, we concur with the National Research Co uncil's (2001b) call for a wider range of assessment practices than is typical ly gathered at the state level, including evidence of teaching performance. The wor k of the National Board and of INTASC and Connecticut suggests that portfol ios represent one feasible means for obtaining information about teachers’ cla ssroom performance. More research is needed, however, to unearth potential p roblems with portfolio or other performance-based interpretations and to prov oke debate about solutions. Further, it is important to note that the dilemmas we have raised do not simply reflect technical problems. And the solutions, we b elieve, are not within the bounds of a single assessment program. Rather, teac her education institutions and the schools and districts within which teachers work must work together to support beginning teachers, especially as they move into new contexts, and to ensure that they are ready to provide a productive learning environment for all of their students. Table 1 Participating ELA Teachers* and Their Classes (with achievement levels as characterized by teache rs) Portfolio/Case Comparisons Primary Portfolio Class(es)Non-Portfolio Class(es) Ms. Bertram 6th grade languagearts(“heterogeneous”) 6th grade languagearts(“heterogeneous”) Mrs. Carson 8th grade English(“advanced” “bestmath students”) 8th grade English(“bestmath students”) Mr. Koehler 8th grade ELA(“technicallyheterogeneous” but mostly “upperlevel” due to math track) 8th grade ELA(“heterogeneous”) Mrs. Martin 9th grade writing enrichment9th grade writing enrichment Mr. Richards 9th Grade honors9th Grade third level (“content is watered down”)


45 of 70 Mr. Roberts College English 9 (“average”)College English 9 (“average”) Mr. Roosevelt 12th grade Humanities II honors and 10th grade “general” 10th grade honors Mr. Turner11th grade English III (“college level”) 9th grade honors Portfolio/Portfolio Comparisons Primary Portfolio ClassSecondary Portfolio ClassMrs. Harris10th grade English II (“heterogeneous, vocational”) 10th grade English II(“heterogeneous, vocational”) Mrs. Jacobson 7th grade Language Arts(“heterogeneous,” “phase IImath”) 7th grade Language Arts(“heterogeneous, “lower abilitymath”) Mrs. Marks9th grade World Literature I (“top level”) 10th grade World Literature II(“average track”) Ms. Patrickmiddle schoolmiddle schoolMs. Phillips 7th grade English (“gifted andtalented”) 7th grade English (“B-level”) Ms. Snyder 7th grade English (“manyreading 1 – 2 levels belowgrade level”) 7th grade English (“manyreading 1-2 levels below gradelevel”) *All names are pseudonyms. Table 2 Participating Mathematics Teachers* and Their Class es (with achievement levels as characterized by teache rs) Portfolio/Case Comparisons Primary Portfolio Class(es)Non-Portfolio Class(es)Ms. Anderson Algebra I primarily 9th gradersAccelerated Algebra I all 9th graders Ms. Fleming Integrated Geometry 11th graders Math I remedial math (“failingstudents”)


46 of 70 Mr. GereAlgebra I 9th gradersPre-Calculus predomina ntly juniors, a few 10th graders. Mrs. Green Geometry (academic) 10th, 11th and 12th graders Integrated Math remedial math(“socially promoted students”) Mrs. Jones Geometry (college bound) 10th and 11th graders Algebra I (college bound) 9th graders Ms. Rinaldi General 8th grade mathematics Accelerated Geometry All 8th grade students (mostaccelerated math students in the school) Mr. Skinner Pre-Calculus (optional 4th year of mathematics) 12thgraders Algebra II 11th and 12th grade (students of varying abilities) Ms. Weaver Algebra II (primarily college bound students, somerepeating the course) 10th, 11th, and 12th graders Geometry (majority of the students are college bound, butnot honors students) 10th, 11th, and 12th graders Portfolio/Portfolio Comparisons First Portfolio ClassSecond Portfolio ClassMs. Barnes Algebra and Geometry 8th graders (general ability) Algebra 8th graders (most advanced math class offered) Ms. Eastman Trigonometry/ Analytic Geometry 11th and 12thgraders (general ability, high achieving students) Consumer Math10th, 11th, and 12th graders (many repeatingthe course or have previously failed a math course) Mr. Freeman Integrated Level I (Algebra) 9th graders (advanced) Geometry 9th and 10th graders (advanced) Mr. Johnson Transition Math 8th graders (general ability) Algebra 8th graders (average ability) Ms. Layton Transition to College Mathematics 12th graders(average ability) Algebra I – part 2 9th, 10th,11th, and 12th graders(average ability) Ms. Schafer Regular/Inclusion Math 7th grade Regular Education Mathematics 7th grade Mr. Sexton 8th grade mathematics(students of varying abilities, most advancedstudents leave this classroom 7th grade mathematics(students of varying abilities)


47 of 70 to take algebra) *All names are pseudonyms.Notes 1. This study was supported, in part, by a grant from the Spencer Foundation. We gratefully acknowledge their support We also wish to thank Kevin Basmadjian, Leslie Burns, Leah Kirell, Suzanne Knight, Emily Smith, Michigan State University, and Vilma Mesa, U niversity of Michigan, for their thoughtful contributions to the comparati ve analyses. The senior author is a member of INTASC’S technical advisory c ommittee (TAC). We gratefully acknowledge comments on an earlier draft from Aaron Schutz, Mark Wilson and from INTASC staff and TAC members: Mary Diez, Jim Gee, Ann Gere, Bob Linn, Jean Miller, David Paradis e, and Diana Pullin. Opinions, findings, conclusions, and recommendation s expressed are those of the authors and do not necessarily reflect the views of INTASC, its technical advisory committee, or its participat ing states. 2. Kane and colleagues (1999) actually refer to three levels of inference: “inferences from performances to observed scores”, “inferences [or ‘generalization’] from observed scores to universe scores…which includes performances on tasks similar to (i.e., exchangeabl e with) those in the assessment”,” and “inferences [or ‘extrapolation’] from universe scores to target scores” reflecting “a larger, and generally less well-defined domain” where the regulative ideal of random sampling is un tenable. Messick’s (1994) distinction between taskand constructbas ed assessments seems to parallel Kane’s first two levels of infere nce. Haertel (1985), too, characterizes three levels of generalizability. He differentiates the outcome domain into that part that can be empirically inves tigated and that part that involves only weak assumptions. 3. Brennan notes a discontinuity between IRT and other measurement models: “It is certainly true that statistics that have a reliability like form can be computed based on an IRT analysis, but it is equally true that almost all such analyses treat items as fixed. This raises important questions about what such statistics mean from the point of view of replications of measurement procedures” (2001, p. 3 04). 4. Brennan also cites scoring rubrics and rater traini ng procedures as potentially relevant sources of variation, although most assessments treat these (within the universe of generalization) as fi xed. 5. There is a long history in the writing assessment l iterature of examining relationships between so called direct and indirect methods of assessment, both to document the validity of the mu ltiple choice method and to show that actual samples of writing represen t a different construct than what can be examined with multiple choice test s. 6. Readers should note that the structure of the Natio nal Board assessment is undergoing revision; the description presented here was operative in each of the research studies we descri be. See for updated information.


48 of 70 7. For the six certificates that were operational when their Technical Analysis Report was released (1998), the overall es timate of exercise reliability (across the 10 tasks) ranged from 0.72 – 0.87, with a median of 0.825. This included 4 in-class portfolio exercises two documented accomplishments portfolio exercises, and six assess ment center exercises. For the four in-class portfolio exercise s, the reliability ranged from .049 – 0.76, with a median of 0.695. [The one math certificate, adolescence and Young Adulthood/Math received the h ighest exercise reliabilities and Early Adolescence Generalist, the lowest.] (p. 109). Decision consistency for exercise sampling ranged f rom 5% 7% for false negatives (estimated percent of candidates who inco rrectly failed) and 6% 9% for false positives (estimated percent of cand idates who incorrectly passed). They note that decisions are more consiste nt for candidates with scores further from the cut score. Assessor reliabi lity was generally high (.90 .98) overall and (.85 .92) on in class por tfolio exercises. “Based on these analyses of the technical measurement quality of the six certificates administered in the 1996-97, the assessments fully meet the requirements of the Standards for Educational and Psychological Testing (AERA, APA, NCME, 1985) for validity, reliability, and freedom from bias” (p. 125). 8. The use of guiding questions that integrate standar ds into dimensions directed at a particular teaching performance to pr oduce an interpretive summary was developed by Genette Delandshere, Steve Koziol, Penny Pence, Ray Pecheone, Tony Petrosky, and Bill Thomps on in their leadership of one of the first two National Board A ssessment development labs. This has informed both the work at INTASC and NBPTS. See, for instance Delandshere and Petrosky (1994); Koziol, B urns and Brass (2003). 9. With these independent readings, the rank ordering of the two records of teaching were the same. Given the time consuming nature of the task, we decided it was appropriate to move to a single c omparison document and audit described next rather than to have two se parate documents produced. 10. While the four separate classroom based exercises i n the National Board portfolio may (or may not) encompass some of these variations depending on the class(es) the teacher chooses for each exercise, they are thoroughly confounded with task differences and not to the best of our knowledge routinely examined. 11. All names are pseudonyms. 12. Connecticut's ELA guiding questions and rubrics hav e been revised since we used them. The current versions are availa ble on the Connecticut Department of Education’s web page at ubrics.htm 13. Connecticut's Math guiding questions and rubrics ha ve been revised since we used them. The current versions are availa ble on the Connecticut Department of Education’s web page at ubrics.htm. 14. Direct quotations from the teachers are in italics. 15. The National Board assessments encourage but do not require teachers to choose different classes for different tasks (see guidelines at We have not located any documentati on that provides evidence about the ways in which teachers respond t o this direction,


49 of 70 whether/how they are considered during scoring, or how these differences might shape the evaluations of their performances. 16. "Little information abut the technical soundness of teacher licensure tests appears in the published literature. Little r esearch exists on the extent to which licensure tests identify candidates with the knowledge and skills necessary to be minimally competent beginnin g teachers" (NRC, 2001b, p. 14). 17. Of course, the import of this statement depends on what your conception of validity is. 18. These include item response theory models, latent c lass models, structural equation models, and hierarchical models 19. “In the same conceptual framework and with the same estimation approach, we can carry out probability-based reason ing with all of the models we have discussed. Moreover, we can mix and match components of these models, and create new ones, to produce mo dels that correspond to assessment designs motivated by theor y and purpose” (Mislevy et al., 2001, p. 49). 20. Kadane and Schum (1996), a resource on which Mislev y (1994) draws, cite subjective versions of probabilistic re asoning that could be used with singular judgments. They use this approac h to model the likelihood of potential verdicts in a murder trial. They offer this approach as a supplement, a way of illuminating assumptions behind, but not a replacement to the kinds of human judgments involve d in such complex social situations. 21. Like psychometrics, hermeneutics characterizes a ge neral approach to the interpretation of human products, expressions, or actions. Important differences between these disciplines lie, in part, in the ways in which the information is combined. Psychometric practices sup port aggregative strategies for combining information: scores for di stinct (ideally independent) pieces of information are (weighted an d) aggregated to form an interpretable overall score or grade. Hermeneuti cs supports a holistic and integrative approach to interpretation of human phenomena, which seeks to understand the whole in light of its parts repeatedly testing interpretations against the available evidence, unt il each of the parts can be accounted for in a coherent interpretation of th e whole (Bleicher, 1980; Ormiston and Schrift, 1990; Schmidt, 1995). 22. Quotations marks (“ ”) indicate quotations from the beginning teacher. Side-ways carats (<>) indicate quotations from the case study writer. All names are pseudonyms. ReferencesAERA, APA, & NCME (1985). Standards for educational and psychological testing Washington, DC: American Psychological Association. AERA, APA, & NCME (1999). Standards for educational and psychological testing Washington, DC: American Educational Research Association. Ball, D. L., Gere, A. R. & Moss, P. A. (1998). Fieldwork Guide for the INTASC Beginning Teacher Case Study Project Unpublished manuscript, University of Michigan. Bleicher, J. (1980). Contemporary hermeneutics: Hermeneutics as method, philosophy, and critique London: Routledge and Kegan Paul.


50 of 70 Bond, L., Smith, T., Baker, W.K., & Hattie, J.A. (2 000). The certification system of the National Board for Professional Teaching Standards: A construct an d consequential validity study Greensboro, NC: Center for Educational Research and Evaluation, The University of North Carolina at Greensboro. Breland, H. M., Camp, R., Jones, R. J., Morris, M. M., & Rock, D. A. (1987). Assessing writing skill (Research Monograph No. 11). New York: College Entr ance Examination Board. Brennan, R. L. (1983). The elements of generalizability theory Iowa City, IA: American College Testing Program. Brennan, R. L. (2001). An essay on the history and future of reliability from the perspective of replications. Journal of Educational Measurement 38 (4), 295-317. Brennan, R., & Johnson, E. (1995). Generalizability of performance assessments. Educational Measurement: Issues and Practice 14 (4), 9-12, 27. Crehan, K. D. (2001). An investigation of the valid ity of scores on locally developed performance measures in a school assessment program. Educational and Psychological Measurement, 61 (5), 841-848. Cronbach, L. J. (1988). Five perspectives on validi ty argument. In H. Wainer (Ed.), Test validity (pp. 3-17). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. Cronbach, L. J. (1989). Construct validation after thirty years. In R. L. Linn (Ed.), Intelligence: Measurement, theory and public policy (pp. 147-171). Urbana: University of Illinois Pres s. Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajara tnam, N (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles New York: Wiley. Cronbach, L. J., Linn, R. L., Brennan, R. L., & Hae rtel, E. H. (1997). Generalizability analysis for performance assessments of student achievement or s chool effectiveness." Educational and Psychological Measurement 3 (57), 373-399. Delandshere, G., & Petrosky, A. R. (1994). Capturin g teachers' knowledge. Educational Researcher 23 (5), pp. 11-18. Dunbar, S. B., Koretz, D. M., & Hoover, H. D. (1991 ). Quality control in the development and use of performance assessments. Applied Measurement in Education 4 (4), 289-303. Dwyer, C. A. (1998). Psychometrics of Praxis III: C lassroom performance assessments. Journal of Personnel Evaluation in Education 12 (2), 163-187. Educational Testing Service (ETS) (1998). NBPTS technical analysis report,1996-97 administrat ion Southfield, MI: NBPTS Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a many faceted Rasch model. Journal of Educational Measurement 31 (2), 93-112. Engelhard, G. ( 2002). Monitoring raters in perform ance assessments. In G. Tindal & T. M. Haladyna (Eds.), Large scale assessment programs for all students (pp. 261-288). Mahwah, NJ: Erlbaum. Erickson, F. (1986). Qualitative methods in researc h on teaching. In M. C. Wittrock (Ed.), Handbook of research on teaching (pp. 119-161). New York: Macmillan. Gallego, M. A., Cole, M., & The Laboratory of Human Cognition (2002). Classroom cultures and cultures in Classrooms. In V. Richardson (Ed.), Handbook of Research on Teaching (4th Ed.) (pp. 951-997). Washington: AERA. Gao, X., & Colton, D. A. (1997). Evaluating measure ment precisions of performance assessment with multiple forms, raters, and tasks. In D. A. Co lton (Ed.), Reliability issues with performance assessments: A collection of papers Iowa City, American College Testing Program. (ACT Research Report Series 97-3). Gao, X., Shavelson, R. J., & Baxter, G. P. (1994). Generalizability of large-scale performance


51 of 70 assessments in science: Promises and problems. Applied Measurement in Education 7 (4), 323-342. Haertel, E. (1985). Construct validity and criterio n-referenced testing. Review of Educational Research 55 23-46. Haertel, E. H. & Lorie, W. (in press). Validating S tandards-Based Test Score interpretations. Measurement: Interdisciplinary Research and Perspec tives Harris, D. J. (1997). Using reliabilities to make d ecision. In D. A. Colton (Ed.), Reliability issues with performance assessments: A collection of papers Iowa City, American College Testing Program. (ACT Research Report Series 97-3). Interstate New Teacher Assessment and Support Conso rtium. (1992). Model standards for beginning teacher licensing and development: A Resource for S tate Dialogue Washington, DC: Council of Chief State School Officers. Jaeger, R. M. (1998). Evaluating the psychometric q ualities of the National Board for Professional Teaching Standards' Assessments: A methodological a ccounting. Journal of Personnel Evaluation in Education 2 (2), 189-210. Kadane, J.B., & Schum, D.A. (1996). A probabilistic analysis of the Sacco and Vanzetti evidence New York: Wiley. Kane, M., Crooks, Terence, & Cohen, Allan (1999). V alidating measures of performance. Educational Measurement: Issues and Practice 18 (2), 5-17. Knapp, M. & S. Woolverton. (1995). Social class and schooling. In James Banks & Cherry A. McGee Banks (Eds.), Handbook of Research on Multicultural Education (pp. 548-569). New York: Simon and Schuster. Klein, S. P., McCaffrey, D., Stecher, B., & Koretz, D. (1995). The reliability of mathematics portfoli o scores: Lessons from the Vermont experience. Applied Measurement in Education 8 (3), 243-260. Koretz, D., McCaffrey, D., Klein, S., Bell, R., & S techer, B. (1992). The reliability of scores from the 1992 Vermont Portfolio Assessment Program: Interim report Santa Monica, CA: Rand Institute on Education and Training, National Center for Rese arch on Evaluation, Standards, and Student Testing. Koretz, D., Klein, S., McCaffrey, D., & Stecher, B. (1993). Interim report: The reliability of Vermont portfolio scores in the 1992-93 school year (RAND, RP-260). Santa Monica, CA: RAND. (Reprinted from CSE Technical Report 370, Los Angel es, University of California, Center for Research on Evaluation, Standards, and Student Test ing, December.) Koretz, D., Stecher, B., Klein, S., & McCaffrey, D. (in press). The Vermont Portfolio Assessment Program: Findings and implications. Educational Measurement: Issues and Practice 13 (3), 5-16. Koretz, D., Stecher, B., Klein, S., McCaffrey, D., & Deibert, E. (1993). Can portfolios assess student performance and influence instruction? The 1991-92 Vermont experience (RAND, RP-259). Santa Monica, CA: RAND. (Reprinted from CSE Technic al Report 371, Los Angeles, University of California, Center for Research on Evaluation, S tandards, and Student Testing, December.) Koziol, S. M. Jr., Burns, L., & Brass, J (2003). Four lenses for the analysis of teaching. Supportin g beginning teachers’ practice Working paper, Michigan State University. Lane, S.. Liu., M., Ankenmann, R. D., & Stone, C. A (1996). Generalizability and validity of a mathematics performance assessment. Journal of Educational Measurement 33 (1), 71-92. Linn, R. L. & Burton, E. (1994). Performance-based assessment: Implications of task specificity. Educational Measurement: Issues and Practice 13 (1), 5-8, 15. Lyons, N. (1998) With portfolios in hand: validation the new teacher professionalism New York: Teachers College Press. McBee, M. M., & Barnes, L. L. (1998). The generaliz ability of a performance assessment measuring


52 of 70 achievement in eight-grade mathematics. Applied Measurement in Education 11 (2),179-194. McLaughlin, M., Talbert, J., & Bascia, N. (Eds.). ( 1990). The contexts of teaching in secondary schools: teachers' realities New York: Teachers College Press. McLaughlin, M. & Little, J. W. (Eds.). (1993). Teachers' work: Individuals, colleagues, and contex ts New York: Teachers College Press. McNeil, L. (1983). Contradictions of control: School structure and sch ool knowledge New York: Routledge and K. Paul. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-103). New York: American Council on Education/Macmillan. Messick, S. (1994). The interplay of evidence and c onsequences in the validation of performance assessments. Education Researcher 32 (2), 13-23. Messick, S. (1996). Validity and washback in langua ge testing. Language Testing 13 (3), 241-256. Mislevy, R.J. (1994). Evidence and inference in edu cational assessment. Psychometrika, 59 439-483 Mislevy, R. J., Almond, R., & Steinberg, L. (2003). On the structure of educational assessment. Measurement: Interdisciplinary Research and Perspec tives 1 (1), pp. 3-62. Mislevy, R. J., Wilson, M. R., Ercikan, K., Chdowsk y, N. (2002). Psychometric principles in student assessment Los Angeles: CRESST. Moss, P. A. (1992). Shifting conceptions of validit y in educational measurement: Implications for performance assessment. Review of Educational Research 62 (3), 229-258. Moss, P. A. (1994). Can there be validity without r eliability? Educational Researcher 23 (2), 5-12. Moss, P. A. (1996). Enlarging the dialogue in educa tional measurement: Voices from interpretive research traditions. Educational Researcher 25 (1),20-28, 43. Moss, P. A. (1998). The role of consequences in val idity theory. Educational Measurement: Issues and Practice 17 (2), 5-12 Moss, P. A. (in press). The meaning and consequence s of reliability. Journal of Educational and Behavioral Statistics. Moss, P. A., Rex, L., & Geist, P. (2000a). Case Study Writing Guide for the INTASC Beginning Teacher Case Study Project Unpublished manuscript, University of Michigan. Moss, P. A., Rex, L., & Geist, P. (2000b). Fieldwork Guide for the INTASC Beginning Teacher Ca se Study Project. Unpublished manuscript, University of Michigan. Moss, P. A. & Schutz, A. (1999). Risking frankness in educational assessment. Phi Delta Kappan 80 (9), 680-687. Moss, P. A. & Schutz, A. (2001). Educational standa rds, assessment, and the search for "consensus". American Educational Research Journal 38 (1), 37-70. Moss, P. A., Schutz, A. M., & Collins, K. M (1998). An Integrative approach to portfolio evaluation fo r teacher licensure. Journal of Personnel Evaluation in Education 12 (2),139-161. Moss, P. A., Schutz, A. M., Haniford, L., Miller, R ., & Coggshall, J. (in preparation). High stakes assessment as ethical decision making Unpublished manuscript, University of Michigan. Myford, C. M. (1993). Formative studies of Praxis III: Classroom Performa nce Assessments--An overview. The Praxis Series: Professional Assessmen ts for Beginning Teachers Princeton, NJ: Educational Testing Service. Myford, C. M., & Engelhard, G. (2001). Examining th e psychometric quality of the national board for professional teaching standards early childhood/gen eralist assessment System. Journal of Personnel Evaluation in Education 15 (4), 253-285.


53 of 70 Myford, C. M., & Mislevy, R. J. (1995). Monitoring and improving a portfolio assessment sys tem (Center for Performance Assessment Research Report) Princeton, NJ: Educational Testing Service. National Research Council (2001a). Knowing what students know: The science and design of educational assessment Washington, D.C.: National Academy Press. National Research Council (2001b). Testing teacher candidates: The role of licensure t ests in improving teacher quality Washington, D.C.: National Academy Press. Nystrand, M., Cohen, A. S., & Martinez, D. (1993). Addressing reliability problems in the portfolio assessment of college writing. Educational Assessment 1 (1), 53-70. Ormiston, G. L. & Schrift, A. D. (Eds.) (1990). The hermeneutic tradition: From Ast to Ricoeur Albany: SUNY Press. Pearlman, M. (in press a). The design architecture of NBPTS certification assessments. In L. Ingvarson (Ed.), Assessing teachers for professional certification Stamford, CT: Jai. Pearlman, M. (in press b). The evolution of the sco ring system for NBPTS assessments. In L. Ingvarson (Ed.), Assessing teachers for professional certification Stamford, CT: Jai. Porter, A. C., Youngs, P. & Odden, A (2003). Advanc es in teacher assessments and their uses. In V. Richardson (Ed.), Handbook on Research on Teaching (pp. 259-297). Washington, DC: AERA. Reckase, M. D. (1995). Portfolio assessment: A theo retical estimate of score reliability. Educational Measurement: Issues and Practice 14 (1), 12-14, 31. Schmidt, L. K, (1995). Introduction: Between certai nty and relativism. In. L. K. Schmidt (Ed.), The specter of relativism: Truth, dialogue, and phrones is in philosophical hermeneutics (pp. 1-22). Evanston, IL: Northwestern University Press. Schutz, A & Moss, P. A. (in press), Reasonable deci sions in portfolio assessment. Educational Policy Analysis Archives. Shavelson, R. J., Baxter, G. P., Pine, J., Yure, J. Goldman, S.R., Smith, B. (1991). Alternative assessment technologies for large scale science ass essment: Instrument of education reform. School effectiveness and school improvement 2 (2), 97-114. Shavelson, R. J., & Webb, N. W. (1991). Generalizability theory: A primer Newbury Park, CA: SAGE Publications. Stodolsky, S. & Grossman, P. (1995). The impact of subject matter on curricular activity: An analysis of five academic subjects. American Educational Research Journal 32 (2), 227-49. Swanson, D., Norman, G. R. & Linn, R. L. (1995). Pe rformance-based assessment: Lessons from the health professions. Educational Researcher 24 (5), 5-11,35. Wilson, M. (1994). Community of judgment: A teacher -centered approach to educational accountability. In Office of Technology Assessment (Ed.), Issues in Educational Accountability Washington, D.C.: Office of Technology Assessment, United States Congress. Wilson, M., & Case, H. (1997). An examination of variation in rater severity over time: A study in rater drift Berkeley, CA: Berkeley Evaluation and Assessment Research (BEAR) Center. Wilson, M., & Sloane, K. (2000). From principles to practice: An embedded assessment system. Applied Measurement in Education 13 (2), 181-208. Zeichner, K. (2000). Alverno College. In L. Darling Hammond (Ed.), Studies in excellence in teacher education: Preparation in the undergraduate years Washington, DC: American Association of Colleges of Teacher EducationAbout the Authors


54 of 70 Pamela A. Moss is an Associate Professor in the School of Educati on at the University of Michigan. Her areas of specialization are at the intersections of educational assessment, validity theory, and interp retive social science. She can be reached at 4220 School of Education, Univers ity of Michigan, Ann Arbor, MI 48109-1259 ( M. Sutherland is an Assistant Research Scientist at the Universi ty of Michigan ( Her work focuses on adolescent literacy and identity, particularly as students make sense of sc hool discourse vis--vis their everyday experiences.Laura Haniford is a doctoral candidate in Educational Foundations and Policy at the University of Michigan ( She specializes in teacher education, especially multicultural education. Her research focuses on the ways in which classroom discourse influences learning op portunities. Renee Miller is a doctoral candidate in the School of Education at the University of Michigan ( She spe cializes in Science Education and Museum Studies.David Johnson is a Ph.D. student in Educational Studies at the U niversity of Michigan ( His research interes ts include the influence of government policy on teachers and students and how students make meaning of their state-mandated testing experiences.Pamela K. Geist is an educational consultant in Denver, CO ( She specializes in mathe matics education. Stephen M. Koziol, Jr. is Professor and Chair of the Department of Curric ulum and Instruction at the University of Maryland (skoz He specializes in English Education, program design and policy in teacher education, and teacher assessment.Jon R. Star is an Assistant Professor in the College of Educat ion at Michigan State University ( His research fo cuses on students learning of middle and secondary school mathematics, particu larly the development of mathematical understanding in algebra.Raymond L. Pecheone is an Academic Research and Program Officer in the School of Education at Stanford University (raymond He specializes in the design and implementation of complex performance assessment systems. Previously Dr. Pecheone was Bur eau Chief of Curriculum, Research, Testing and Assessment for th e Connecticut State Department of Education. He was co-director of one of the first assessment development labs for the National Board for Profess ional Teaching Standards.AppendixExcerpts From 70 Page Document Comparing Portfolio And Case For "Ms. Bertram" (prepared by LeeAnn M. Sutherland).


55 of 70 Note: We have emphasized excerpts from the exhibits that focus on writing. Consequently, readers may not find examples of evid ence for reading in the conclusions which address writing and reading toget her. (Note 22) PortfolioCase Teacher's goals for the unit [based on T's commentary]The T states that the unit we see in the pf, based on the study of apopular age appropriate novella, has three goals. She writes thatSs will be expected to:• Identify the character traits ofthe main characters in thenovella,• Compose a written responseciting which character they feltthey were most similar to/couldrelate to best/liked the best andwhy, and• Use that reflection as thefoundation for creating asimulated journal to recordsthoughts and reflections upon theevents of the plot as viewedthrough that character’s eyes, tobe evaluated at the end of thenovel by the teacher and apre-discussed rubric (TR, 2). [based on interview notes and fieldnotes]According to the CSW, (CSW, 25). The 3-day series of lessonsobserved by the CSW were focused onhaving Ss select 3 best pieces of theirown writing from their notebooks,complete evaluation sheets about eachone, exchange with a partner who wouldname his or her choice for the writer’sbest piece, revision of that piece ofwriting, and publication on a web page.This is their choice to say “I have grownas a writer and I have matured. This isthe piece out of all of the others that I amproud to call my own.” (CSW, FN 12). Teacher's characterization of students Based on T's commentaryThe T says: “In a previous unit onbiography, I had realized that thisparticular group of Ss was bothwilling and capable of ‘allying’themselves with characters whosegender, life experiences, or agewere different than their own, ifthe Ss perceived a connectionbased on personality ormotivation” (5). Thus she “hopedthat Ss would choose to further Based on interview notesThe T describes her 1st period class as“the sleeper class” (CSW, 8) and (CSW, 9). She describesher 5/6 period as the “hoopla class,” oneshe calls a (CSW, 8).


56 of 70 this exploration—that S diariesmight cross over the lines of ageand gender perspectives,encouraging discussion anddeeper reflection into the themeof the book” (5).She says this class is “fearless inits class discussions, and thewhole-class discussion forumworks well—rather than ateacher-led review, thesediscussions often erupt into livesof their own. Studentsfrequently—thoughpolitely—challenge one another’sideas” (14, L).The T generalizes about thisclass, “[Although] this class is,overall, a group very much atease with higher-order thinkingskills, there are a number of moreconcrete Ss who requirequestions of a more literal nature”(26). She describes period 2/3, the portfolioclass, as (CSW, 9). They are “a class I neverhave to worry about. They are wonderful,enthusiastic, we will do anything for youtoday, Ms. Bertram …Their personalitiesare like a prism…period 1 tend to be alittle variegated and textual …whereasperiod 2 kids come up and share theirlives with me” (CSW, 10). “Period two isthe class that I’ll toss an idea out to andall of a sudden they’re coming up withideas. Boom, Boom, Boom . It’s theirideas” (CSW, 11).During the first pre-observation interview,the T talked with the CSW about howshe anticipated Ss <[responding] to herinstructional delivery the following day.>She differentiated between 1st periodand 2/3rd period (the pf class): “Periodone students will find evaluatingthemselves as difficult. They think theyknow what to look for, but … thesechildren often require lots more modelingbefore they fully understand how toevaluate. They will also struggle with thetechnical aspects of this assignment.Some are still struggling with usinglaptops finding it difficult to SAVE theirwork or to save their work to a disk.Picking the entry itself will be very easyfor these children, but the chosen entrymay not even be their best piece.However, period two students will find thelesson a challenge, yet a breeze. Most ofthem are technologically literate andexperience few technical difficulties usingthe laptops. They will be quiteenthusiastic about the assignment aswell as honest. These children knowwhat to look for when selecting their bestpiece of work and are most capable ofknowing the right answers whatever thelesson.” Chronological summary of activities


57 of 70 [based on T's daily log]Session 1SSR—Sustained Silent Reading(10 min). The T introduced thefantasy unit by having Ssbrainstorm as a class “Whatelements would we expect to findin a fantasy novel?” The classthen discussed conflict in fantasyand the problem-solving role ofthe hero. In small groups, Ssbrainstormed words and phrasesthat describe a hero, and theyused crayons to sketch an imageof a hero. These responses wereshared with the whole group. TheT encouraged discussion byasking Ss to elaborate; forexample, when a S offered: “Thehero may not have planned to bea hero,” the T asked the class,“What does this mean?” toencourage discussion. Shomework was to write oneparagraph describing “The PerfectHero” (L, 6-7).Session 2SSR (10 min.) The class beganwith Ss sharing their homeworkabout the perfect hero aloud.Afterward, the T guided Ss intoidentifying those characteristicsthey have listed as “traits,” andtalking about differences betweenphysical and nonphysicalcharacter traits. Ss then played agame (10 min.) in which theycirculated around the room to gettheir peers’ signatures on ahandout that asked forinformation about both thephysical and the nonphysicaltraits of their classmates. Theworksheet required Ss to learnabout 24 topics including whoowns a kitten, who sings in theshower, and who has freckles? [based on fieldnotes]Day OneClass begins with Ss writing for about 10minutes in their notebooks. (CSW,FN 15). T does a power pointpresentation to walk Ss through the stepsof choosing their “best piece” forpolishing. Discussion ensues with Ssvolunteering strategies for choosing abest piece, reasons why they mightconsider one “best” and why they might“pass” on others. Power pointpresentation includes a sample entrywhich is discussed as a whole class.Homework is a “Nomination Ballot” whichSs are to use to review the 3 entries theyhad chosen for the previous night’shomework. <”You’re going to check offthe strong points, not so strong pointsand comment on whey these threeentries could be the entry of the year”>(CSW, FN 20). Because the T realizedafter 1st period that the evaluationhandout was going to be homework, shetook more time in the 2nd period (pfclass) such as as the class worked throughthe power point presentation anddiscussion (CSW, 28).Day TwoClass began with 10-minute writing innotebooks followed by a discussion ofwhat Ss found easy and found difficult inlast night’s homework. The T noticed thatseveral Ss had incomplete assignments,so she reviewed the concept of“reflection” on the best piece, answering“Why” it could be a winner.Day Three


58 of 70 (ART, 9A). When finished, Ssshared what they had learnedabout their classmates anddiscussed “which character traitsconvey the most informationabout a person or character” (L,9).[…sessions 3-8 described….]Session 9T read-aloud continued (10 min.)Following this, Ss receivedinstructions and did individualseatwork in preparation for thefollowing day’s Goldfish Bowlactivity. Ss were to answer 4questions on a handout, and theT circulated to answer questionsand keep students “on task.”Questions include: “Do the maincharacters fit the criteria of ahero—at least, the hero wediscussed in our first lesson? Whyor why not?” Ss were also askeda prediction question, a questionabout changing the novel’s pointof view, and a question relatingfaults as sometimes helpful to aparticular character in thenovel—to how faults could behelpful in real life (23a). The last15 minutes of the period weredevoted to small groupdiscussion, assigning a single toeach member, revising individualresponses based on other groupmembers’ responses, andpreparing to share those thefollowing day (22-23, 23a, L).Session 10—the videotapedsession(No oral reading.) Spent the entireclass period on the Goldfish Bowlactivity. A “Reflection Sheet”(ART, 25A) asked Ss to respondto 3 questions about each of thelarger questions discussed in the Class began with 10-min writing in notebooks and review of what they havedone thus far in regard to the notebooks. The next step is revising, which the Twalked Ss through using a Power Pointpresentation. They were also given a handout to guide them.A difference was noted in the way homework was assigned to both classes.Period one was to finish the teacher created ditto, while period two was tocomplete not only the ditto but revisions on their best entry of the year as well.The teacher reported that it was amistake on her part, realizing that period one should have finished their revisionsas well


59 of 70 bowl. The worksheet asked,“Which part of the discussionmost closely matched your ownanswer? How?” and “Did youdisagree with or not understandsome part of the discussion?Which part? Why?” and “Do youfeel the ‘goldfish’ left anythingout? If yes, what?” The Tindicates that this activity closed“the pre-vacation leg of the unit”(25, L).Comparison: [[In both portraits, we see the T provide a variety of ELA experiences for these Ss. Reading, writing, speakin g and listening happen on most days represented in the pf and in the CS.Configurations are varied—small group, whole class, and pairs, in both self-selected and T-selected groupings.We also see lessons that build on one another towar d completion of a final composition. In the pf, those compositions are a ch aracter diary and a poem, and in the CS the final composition is a revision f or on-line publication of a S-selected favorite piece of writing. While the goa l in each case was to create a product, the lessons focused on other important s kills. Through handouts in the pf and a Power Point presentation in the CS, we see the T scaffold students’ learning. She asks questions which aim to ward having students achieve particular goals, but the questions themsel ves do not lead students to particular answers. In both portraits, we see the T guide her students in developing their own critical thinking skills. And, in both the pf and the CS, we see the T attend to development of students’ metaco gnitive skills. Also, in both portraits we see established routines that Ss seem to respond well to and which seem to align with T goals. In th e written pf text, the T tells us how students respond to particular activities, a nd in the video, we see that what she says is true. The CS provides additional e vidence of the same. Both portraits would likely lead to the same evaluation of this T.]] Classroom Interaction Based on videoIn the video conferencing session,we see the T seated at a tablewith four students. The procedureis that each S reads his or herpoem aloud to the group, andgroup members provide feedback.The T facilitates by repeating andclarifying S responses after eachone speaks, and by deciding Based on fieldnotes[Whole-class interaction in both the pfand the non-pf class is described by theCSW as representing the same pattern.The following example is one of severalthat illustrate that the T poses a questionand Ss respond one after another with avariety of ideas.] < After Ss have read asample the T has provided as part of herPower Point lesson, she poses the


60 of 70 when to move on to the nextwriter. Students are clearlypracticed in this type of session.They make comments such as,“When you say ‘things,’ you couldgo deeper,” and “You know howyou said it’s confusing? When Iwent over mine, I thought thesame thing….” A student alsodefends his poem which twoclassmates say is “too deep” with,“That’s what I was aiming for” andexplaining, “I’m trying to stayaway from the word ‘like.’” The Tmakes comments such as, “WhatI’m hearing you say is that wehave a good poem here, we justneed …” and she ends thesession by telling the students“I’m very pleased” with the way inwhich they conferenced. question, “Does the entry show deeperthinking?” [A quality they had alreadydetermined was necessary for a goodjournal entry.] The CSW reports: and the classmoves on to another topic (CSW, 15). Comparison: In both the pf and the CS, we observe similar class room routines. We see the first 10 minutes of each class period spent on SSR in the pf, notebook writing in the CS. In both portrai ts, we see the T begin class sessions by asking a S volunteer to recap the previ ous day’s learning, then proceeding with the day’s activities. We witness a variety of classroom configurations. A horseshoe arrangement of desks is seen on the pf video and is described by the CSW. We observe whole class discussion, peer cfs about writing, small group activities, T lecture, p resentations, and preparation of written texts for publication. Class sessions fr equently close with the assignment of homework or preview of the following day’s activities. While on the video the classroom may be seen as qui te controlled in terms of time management and change from one activity to ano ther by teacher command, it is clear in the content of the talk that students are thinking and learning. Their questions and comments to peers in writing conferences and in whole-group discussions indicate that they have learned to talk with one another about ideas, to give one another feedback a bout writing, and to


61 of 70 question and compliment their classmates. These are observable in the pf, and are affirmed by the CSWs observations of classr oom dialogue among Ss and T. Teacher's reflection on classroom interaction Based on commentaryIn order to “keep the noise to aminimum,” the T reports that forthe videotaping, she chose towork with one cooperative groupwhile other Ss worked individually.When videotaping was finished,students were “released to workin peer groups of their own” (48).The T says that “each member ofthe group eagerly offered his orher perceptions on the sharedwork.” She also reports that shequestioned one Ss understandingof the assignment, but otherwisefelt that Ss “were fully incommand of the assignment.”She offered her perspective toand moderated the discussion,but “made a conscious attemptnot to influence student critiques”even when she felt that a S “wasbeing a bit too literal in hisimages” (48). Ss have been inrevision groups before, but the Tnotes that this is the first time theyhave written and revised poetrythis year.Student Y is one of what the Tterms “gray children,” a studentshe says she needs to “watchparticularly for” and “make aconcerted effort to draw into classdiscussions and activities,” asthey readily “fade into thebackground” among moreassertive Ss, and “do not cry outfor the attention the lower-abilitySs require” (31-2).[[What I see on the video jibeswith the T’s description of it. Thewriting session does feel Based on interview notesReflecting upon a day’s lesson, the Tsays she was surprised by 1st period’sresponse: “They aren’t generally thatparticipatory” and she The T (CSW, 31).When the CSW asked the T

62 of 70 somewhat rushed. The T did try to“facilitate” the writing conferencegroup discussion, but she actedsomewhat more as “guide” for thefishbowl activity, seemingly topush the Ss thinking. She saysthat the first group was nervous,and two of the members we seeon camera do seem verynervous.]] her interaction with a boy from period twowho has a gift for analyzing detail andworking with words. She comments, “Hehas written poetry that would stop yourheart but he holds himself to a standardthat is three times higher than any otherchild. He worries about things and he isable to grasp things that never evenoccur to the other kids. It works againsthim sometimes and needs a lot ofreassurance and validation. Sometimeshe just thinks too much and at timesneeds to have limits placed on himbecause of his drive and tremendousefforts.” Teacher encourages him “torelax rather than get an ulcer before theage of twenty.”> (CSW, 33). Comparison: In the pf, we have info. about how the T views indi vidual Ss as she writes about the 5 writing samples included in her pf. In this CS, the T also talks a great deal in interviews about individ ual Ss. In both cases, she talks both about who the child is as a person (e.g. personality, affect) and who the child is as a student (e.g. academic streng ths and weaknesses). In both portraits we also see a range; the T provides information both about students who are the most successful and about stud ents who struggle the most in her classes.In both portraits we also see this T’s use of terms like “abstract randoms,” and “concrete learners,” and she describes herself as “ concrete random.” She appears to shape activities with these “types” in m ind, as she does talk about “types” of Ss who will succeed or struggle with par ticular activities. She also describes individual Ss in terms of “high ability,” “lower ability,” and “average.” There is no indication, however, that sh e holds Ss to particular standards based on which of these she believes the S to be. And, while the Ts characterizations of each class’ personality differs, the CSW reports that “the teacher has the same basic in structional strategies for all of her classes” (CSW, FN 9). T's assessment of student work Based on comments written on students' papersOn Student X’s paper, the Tcircles 3 spelling errors and onemissing comma. At the bottom ofthe page she writes 4 comments:Student X, much of this is wording Based on fieldnotes and interview notes[[While we see no evaluated S writing in the CS, the T reviewed with Ss thequalities of good writing before (and during) their reading, selecting,conferencing, and revising processes for the task of polishing a journal entry forpublication.]]


63 of 70 taken almost directly from thebook; that’s not the best tactichere.Use your own words.Not clear whose POV you’retaking in this entry . .What about this character’sthoughts and feelings? this ismainly summary.Reflect on the events—don’t’ justlist them!Based on T's commentaryIn reviewing the Ss initialcharacter diary entries, the Treveals, “I feel I may haveoverestimated the ability of manySs to take another’s point of view.The work of Student X inparticular reminds me that theyare still young, and I wonder ifsome Ss have not yet maturedenough to see beyondthemselves, to look at things fromanother’s perspective. Student X’sentry shows that he is taking thebook very much at face value; heeither cannot see (or does notwish to take the effort to see) thecharacter traits of the charactersconveyed in their words oractions” (33). The CSW writes that the T (FN 45).[[Two artifacts appended to the pfprovide the more specific detail.]] A slidefrom the T’s Power Point presentationsasks:Does the entry show “deeper thinking?”Does the writer “paint a picture withwords?”Is the writing “honest writing?”Is the topic unique, unusual, or particularly meaningful?The Nomination Ballot students use toevaluate their own and a classmates’work lists the following criteria forjudgment in addition to the above:Entry is clear and focusedWonderful use of “show, don’t tell.”Captivating written “voice.” Comparison: It seems that even though we do not see the Ts actu al feedback on Ss work in the CS, the nature of her co nversation with them and the content of the slides and the evaluation sheets Ss are to use with peers make us confident that she will actually use these standards to evaluate the final products. So while we know more detailed info rmation, and see the actual follow-through to final draft only in the pf there is nothing in the two portraits that would likely cause this T to be eval uated differently in regard to what she thinks is important in evaluating S writin g.]] Teacher's reflection on her teaching The T reports that when shebegins this unit another time, she

64 of 70 will spend more time on particularaspects of studying of the novelthat she felt Ss needed more timewith. For example, she says thatnext time she will “spend one dayfocusing only on the aspects offantasy before launching into thehero.” She combined the two inthis lesson, and reflects: “it wouldcertainly have benefited fromadditional time for student workand discussion” (L, 7).The T notes that a vocabularyglossary or a vocabulary activity“in preparation for the reading”might be useful to Ss, as theyasked about several words as theT read chapter 1 of the novelaloud. The T lists “prodigious,”frenzied,” and “exclusive” asexamples (L, 10-11).The T also says she “may haveoverestimated the ability of manystudents to take another’s point ofview.” Her conclusion is, “In thefuture, I should probably ‘testdrive’ the character diary on ashort story before applying it to alonger piece, so as to betterassess the abilities of the class tostep into another’s point of view”(33). tried this unit, yet feels she might do thesame unit again next year and states thatshe will give the introductory portion inone day and the nomination ballot forhomework. She also believes that theselections regarding best pieces will bedone in class ‘to eliminate theconsequences of Ss who come to classunprepared’> (CSW, 33).< During the three day observation period, this interviewer was able towitness her creating curriculum from day to day based on her monitoring andadjusting for student learning first-hand.Her Power Point demonstration on revision, for instance, was created inresponse to her feeling that the students needed it to complete the task at hand>(CSW 29). GQ 2.1 Describe the ways in which the teacher creat es a learning environment that provides all Ss with opportunities to develop as readers, writers, and thinkers:[[In the pf, the T describes “types” of S learners (e.g. concrete, random) in her logs, her other written text, and her descriptions of those whose writing samples she includes. She speaks in both the pf and the CS about accommodations for special needs Ss (e.g. special e ducation), and the CSW observes the T’s accommodations for absent Ss.In both the pf and the CS, we observe similar class room routines. We see the first 10 minutes of each class period spent on SSR in the pf, notebook writing in the CS. In both portraits, we see the T begin cl ass sessions by asking a S volunteer to recap the previous day’s learning, the n proceeding with the day’s


65 of 70 activities. We witness a variety of classroom confi gurations. A horseshoe arrangement of desks is seen on the pf video and is described by the CSW. We observe whole class discussion, peer cfs about w riting, small group activities, T lecture, presentations, and preparati on of written texts for publication. Class sessions frequently close with t he assignment of homework or preview of the following day’s activities.While on the video the classroom may be seen as qui te controlled in terms of time management and change from one activity to ano ther by teacher command, it is clear in the content of the talk that students are thinking and learning. Their questions and comments to peers in writing conferences and in whole-group discussions indicate that they have learned to talk with one another about ideas, to give one another feedback a bout writing, and to question and compliment their classmates. These are observable in the pf, and are affirmed by the CSWs observations of classr oom dialogue among Ss and T.The T remarked in the pf text and in the CS intervi ews that some of the activities she planned may be (or may have been) to o difficult for some of the Ss. When she anticipated that in advance, she attem pted to shape instruction accordingly. When she realized Ss difficulties duri ng or after class, she altered her plans for the next day (e.g. created an other handout, attended to another aspect of the writing process in her Power Point presentation), or she indicated how she will alter instruction in the fut ure. In both portraits, the T creates activities that challenge Ss, and her quest ions challenge their thinking on handouts and in discussion activities. In additi on, we see the T make minor alterations in a particular lesson from one c lass to another, but she indicates and the CSW observes that she works with the same lesson plans, assignments, and “talk” as she guides S learning. I n each individual class, however, the T asks questions and pushes discussion based on Ss contributions, thus responds in flexible ways to th eir learning as a class and as individuals within that class.For this T, the CS reinforces what we learn in the pf. Either portrait alone would have given a substantial amount of informatio n in regard to this Guiding Question, and neither contains information that would likely lead us to evaluate this T differently.]]GQ 4.1 Describe how the teacher addresses student l earning in reflection.GQ 4.2 Describe how the teacher uses that reflectio n to inform practice. [[This T, in both the pf and the CS, addresses issu es of S learning in a variety of ways, including addressing learning in terms of the class as a whole and the individuals within the class. She reflects on s tudents in general (age/developmentally) as well as on what she has le arned about her own students this year.In both the pf and the CS, we see the T reflect on student learning in terms of the nature of their contributions to discussion. In the pf, she reflects on S


66 of 70 learning in terms of their “command” of assignments and of specific tasks, and in terms of the depth of reflection and the ins ightfulness shown in their writing. In the CS, she reflects on Ss abilities, t heir strengths and weaknesses, and the behavior they exhibit during cl ass sessions. So, while we gather some similar and some different informati on about how the T reflects on S learning in each of these portraits, we can see in both that she draws upon a range of information that is both accu rate and important in informing her instruction. None of the differences between the evidence presented in the pf and that presented in the CS is likely to be the cause of a different evaluation for this T.]] The World Wide Web address for the Education Policy Analysis Archives is Editor: Gene V Glass, Arizona State UniversityProduction Assistant: Chris Murrell, Arizona State University General questions about appropriateness of topics o r particular articles may be addressed to the Editor, Gene V Glass, or reach him at College of Education, Arizona State Un iversity, Tempe, AZ 85287-2411. The Commentary Editor is Casey D. Cobb: .EPAA Editorial Board Michael W. Apple University of Wisconsin David C. Berliner Arizona State University Greg Camilli Rutgers University Linda Darling-Hammond Stanford University Sherman Dorn University of South Florida Mark E. Fetler California Commission on TeacherCredentialing Gustavo E. Fischman Arizona State Univeristy Richard Garlikov Birmingham, Alabama Thomas F. Green Syracuse University Aimee Howley Ohio University Craig B. Howley Appalachia Educational Laboratory William Hunter University of Ontario Institute ofTechnology Patricia Fey Jarvis Seattle, Washington Daniel Kalls Ume University Benjamin Levin University of Manitoba Thomas Mauhs-Pugh Green Mountain College Les McLean University of Toronto Heinrich Mintrop University of California, Los Angeles


67 of 70 Michele Moses Arizona State University Gary Orfield Harvard University Anthony G. Rud Jr. Purdue University Jay Paredes Scribner University of Missouri Michael Scriven University of Auckland Lorrie A. Shepard University of Colorado, Boulder Robert E. Stake University of Illinois—UC Kevin Welner University of Colorado, Boulder Terrence G. Wiley Arizona State University John Willinsky University of British ColumbiaEPAA Spanish & Portuguese Language Editorial Board Associate Editors Gustavo E. Fischman Arizona State University & Pablo Gentili Laboratrio de Polticas Pblicas Universidade do Estado do Rio de JaneiroFounding Associate Editor for Spanish Language (199 8—2003) Roberto Rodrguez Gmez Universidad Nacional Autnoma de Mxico Argentina Alejandra Birgin Ministerio de Educacin, Argentina Email: Mnica Pini Universidad Nacional de San Martin, Argentina Email:, Mariano Narodowski Universidad Torcuato Di Tella, Argentina Email: Daniel Suarez Laboratorio de Politicas Publicas-Universidad de Bu enos Aires, Argentina Email: Marcela Mollis (1998—2003) Universidad de Buenos Aires Brasil Gaudncio Frigotto Professor da Faculdade de Educao e do Programa dePs-Graduao em Educao da Universidade Federal F luminense,


68 of 70 Brasil Email: gfrigotto@globo.comVanilda Paiva Lilian do Valle Universidade Estadual do Rio de Janeiro, Brasil Email: Romualdo Portella do OliveiraUniversidade de So Paulo, Brasil Email: Roberto Leher Universidade Estadual do Rio de Janeiro, Brasil Email: Dalila Andrade de Oliveira Universidade Federal de Minas Gerais, Belo Horizont e, Brasil Email: Nilma Limo Gomes Universidade Federal de Minas Gerais, Belo Horizont e Email: Iolanda de OliveiraFaculdade de Educao da Universidade Federal Flumi nense, Brasil Email: Walter KohanUniversidade Estadual do Rio de Janeiro, Brasil Email: Mara Beatriz Luce (1998—2003) Universidad Federal de Rio Grande do Sul-UFRGS Simon Schwartzman (1998—2003) American Institutes for Resesarch–Brazil Canad Daniel Schugurensky Ontario Institute for Studies in Education, Univers ity of Toronto, Canada Email: Chile Claudio Almonacid AvilaUniversidad Metropolitana de Ciencias de la Educaci n, Chile Email: Mara Loreto Egaa Programa Interdisciplinario de Investigacin en Edu cacin (PIIE), Chile Email: Espaa Jos Gimeno SacristnCatedratico en el Departamento de Didctica y Organ izacin Escolar de la Universidad de Valencia, Espaa Email: Mariano Fernndez EnguitaCatedrtico de Sociologa en la Universidad de Sala manca. Espaa


69 of 70 Email: enguita@usal.esMiguel Pereira Catedratico Universidad de Granada, Espaa Email: Jurjo Torres Santom Universidad de A Corua Email: Angel Ignacio Prez Gmez Universidad de Mlaga Email: J. Flix Angulo Rasco (1998—2003) Universidad de Cdiz Jos Contreras Domingo (1998—2003)Universitat de Barcelona Mxico Hugo Aboites Universidad Autnoma Metropolitana-Xochimilco, Mxi co Email: Susan StreetCentro de Investigaciones y Estudios Superiores en Antropologia Social Occidente, Guadalajara, Mxico Email: Adrin Acosta Universidad de Guadalajara Email: Teresa Bracho Centro de Investigacin y Docencia Econmica-CIDE Email: bracho Alejandro Canales Universidad Nacional Autnoma de Mxico Email: Rollin Kent Universidad Autnoma de Puebla. Puebla, Mxico Email: Javier Mendoza Rojas (1998—2003) Universidad Nacional Autnoma de Mxico Humberto Muoz Garca (1998—2003)Universidad Nacional Autnoma de Mxico Per Sigfredo ChiroqueInstituto de Pedagoga Popular, Per Email: Grover PangoCoordinador General del Foro Latinoamericano de Pol ticas Educativas, Per Email: Portugal


70 of 70 Antonio TeodoroDirector da Licenciatura de Cincias da Educao e do Mestrado Universidade Lusfona de Humanidades e Tecnologias, Lisboa, Portugal Email: USA Pia Lindquist WongCalifornia State University, Sacramento, California Email: Nelly P. StromquistUniversity of Southern California, Los Angeles, Cal ifornia Email: Diana RhotenSocial Science Research Council, New York, New York Email: Daniel C. LevyUniversity at Albany, SUNY, Albany, New York Email: Ursula Casanova Arizona State University, Tempe, Arizona Email: Erwin Epstein Loyola University, Chicago, Illinois Email: Carlos A. Torres University of California, Los Angeles Email: Josu Gonzlez (1998—2003) Arizona State University, Tempe, Arizona EPAA is published by the Education Policy StudiesLaboratory, Arizona State University

xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam a22 u 4500
controlfield tag 008 c20049999azu 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E11-00381
0 245
Educational policy analysis archives.
n Vol. 12, no. 32 (July 20, 2004).
Tempe, Ariz. :
b Arizona State University ;
Tampa, Fla. :
University of South Florida.
c July 20, 2004
Interrogating the generalizability of portfolio assessments of beginning teachers :a qualitative study / Pamela A. Moss, LeeAnn M. Sutherland, Laura Haniford, Renee Miller, David Johnson, Pamela K. Geist, Stephen M. Koziol, Jr., Jon R. Star[and] Raymond L. Pecheone.
x Research
v Periodicals.
2 710
Arizona State University.
University of South Florida.
1 773
t Education Policy Analysis Archives (EPAA)
4 856

xml version 1.0 encoding UTF-8 standalone no
mods:mods xmlns:mods http:www.loc.govmodsv3 xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govmodsv3mods-3-1.xsd
mods:relatedItem type host
mods:identifier issn 1068-2341mods:part
mods:detail volume mods:number 12issue 32series Year mods:caption 20042004Month July7Day 2020mods:originInfo mods:dateIssued iso8601 2004-07-20