USF Libraries
USF Digital Collections

Educational policy analysis archives


Material Information

Educational policy analysis archives
Physical Description:
Arizona State University
University of South Florida
Arizona State University
University of South Florida.
Place of Publication:
Tempe, Ariz
Tampa, Fla
Publication Date:


Subjects / Keywords:
Education -- Research -- Periodicals   ( lcsh )
non-fiction   ( marcgt )
serial   ( sobekcm )

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
usfldc doi - E11-00194
usfldc handle - e11.194
System ID:

This item is only available as the following downloads:

Full Text
xml version 1.0 encoding UTF-8 standalone no
mods:mods xmlns:mods http:www.loc.govmodsv3 xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govmodsv3mods-3-1.xsd
mods:relatedItem type host
mods:identifier issn 1068-2341mods:part
mods:detail volume mods:number 8issue 50series Year mods:caption 20002000Month November11Day 22mods:originInfo mods:dateIssued iso8601 2000-11-02

xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam a22 u 4500
controlfield tag 008 c20009999azu 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E11-00194
0 245
Educational policy analysis archives.
n Vol. 8, no. 50 (November 02, 2000).
Tempe, Ariz. :
b Arizona State University ;
Tampa, Fla. :
University of South Florida.
c November 02, 2000
Student evaluation of teaching : a methodological critique of conventional practices / Robert Sproule.
x Research
v Periodicals.
2 710
Arizona State University.
University of South Florida.
1 773
t Education Policy Analysis Archives (EPAA)
4 856


1 of 23 Education Policy Analysis Archives Volume 8 Number 50November 2, 2000ISSN 1068-2341 A peer-reviewed scholarly electronic journal Editor: Gene V Glass, College of Education Arizona State University Copyright 2000, the EDUCATION POLICY ANALYSIS ARCHIVES. Permission is hereby granted to copy any article if EPAA is credited and copies are not sold. Articles appearing in EPAA are abstracted in the Current Index to Journals in Education by the ERIC Clearinghouse on Assessment and Evaluation and are permanently archived in Resources in Education Student Evaluation of Teaching: A Methodological Critique of Conventional Practices Robert Sproule Bishop's University (Canada)Abstract The purpose of the present work is twofold. The fir st is to outline two arguments that challenge those who would advocate a continuation of the exclusive use of raw SET data in the determinat ion of "teaching effectiveness" in the "summative" function. The sec ond purpose is to answer this question: "In the face of such challeng es, why do university administrators continue to use these data exclusive ly in the determination of 'teaching effectiveness'?" I. Introduction The original purpose of collecting data o n the student evaluation of teaching (hereafter SET) was to provide student feedback to an instructor on her "teaching effectiveness" [(Adams (1997), Blunt (1991), and Ri fkin (1995)]. This function is


2 of 23dubbed the "formative" function by some, and is vie wed as non-controversial by most. In time, raw SET data have been put to another use— this is to provide student input into faculty committees charged with the responsibility deciding on the reappointment, pay, merit pay, tenure, and promotion of an individual i nstructor [Rifkin (1995), and Grant (1998)]. This second function, dubbed the "summativ e" function by some, is viewed as controversial by many. (Notes 1, 2) The purpose of the present work is twofol d. The first is to outline two arguments that challenge those who would advocate a continuat ion of the exclusive use of raw SET data in the determination of "teaching effectivenes s" in the "summative" function. The first argument identifies two conceptual, and the s econd identifies two statistical, fallacies inherent in their methodology. Along the way, I shall also argue that while both conceptual fallacies cannot be remedied, one of the statistical fallacies can—this by means of the collection of additional data and the use of an appropriate statistical technique of the sort outlined in Mason et al. (199 5). The second purpose of the present paper is to answer this question: In the face of su ch challenges, why do university administrators continue to use these data exclusive ly in the determination of "teaching effectiveness"? The general motivation for the present wo rk is located in three classes of statements. The first class is the many reports of the confusion and general disarray caused to the academic mission of many disciplines by the SET process. For example, Mary Beth Ruskai (1996), an associate editor of Notices of The American Mathematical Society wrote: Administrators, faced with a glut of data, often fi nd creative ways to reduce it (the SET process) to meaningless numbers. I enco untered one who insisted that it sufficed to consider only the ques tion on overall effectiveness, because he had once seen a report th at, on average, the average on this question equaled the average of all other questions. He persisted in this policy even in cases for which it was patently false ... Advocates often cite a few superficial studies in s upport of the reliability of student evaluations. However, other studies give a more complex picture ... Many experienced faculty question the reliability o f student evaluations as a measure of teaching effectiveness and worry that th ey may have counterproductive effects, such as contributing to grade i nflation, discouraging innovation, and deterring instructors from challeng ing students. The second concerns what constitutes admi ssible and inadmissible evidence in legal and quasi-legal proceedings related to the "s ummative" function. For example, over fifteen years ago, Gillmore (1984) wrote: "If student ratings are to qualify as evidence in support of faculty employment decisions questions concerning their reliability and validity must be addressed" (p. 561 ). In recent times, it seems that the issue of admissibility has been clarified in the U. S. courts. For example, Adams (1997) wrote: Concerning questions about the legal basis of stude nt evaluations of faculty, Lechtreck (1990) points out that, "In the past few decades, courts have struck down numerous tests used for hiring, and/or promotions on the grounds that the tests were discriminatory or allow ed the evaluator to discriminate. The question, How would you rate the teaching ability of this instructor, is wide open to abuse" (p. 298). In his column, "Courtside,"


3 of 23Zirkel (1996) states, "Courts will not uphold evalu ations that are based on subjective criteria or data" (p. 579). Administrati ve assumptions to the contrary, student evaluations of faculty are not ob jective, but rather, by their very nature, must be considered subjective. (p. 2) (Note 3) That said, the present work should be seen as an at tempt to further reinforce two views: that SET data are not methodologically sound, and t hat they ought not be treated as admissible evidence in any legal or quasi-legal hea ring related to the "summative" function. And the third motivation stems from the notion of academic honesty, or from the virtue of acknowledging ignorance when the situatio n permits no more or no less – a notion and a virtue the academic community claims a s its own. This motivation is captured succinctly by Thomas Malthus (1836) in a s tatement made over a century and half ago. He wrote: To know what can be done, and how to do it, is beyo nd a doubt, the most important species of information. The next to it is to know what cannot be done, and why we cannot do it. The first enables us to attain a positive good, to increase our powers, and augment our happi ness: the second saves us from the evil of fruitless attempts, and the los s and misery occasioned by perpetual failure. (p. 14) This article is organized as follows. In the second section, I offer a characterization of the conventional process used i n the collection, and processing, of the SET data. This is done for the benefit of those unacquainted with the same. This is then followed by an outline of fallacies inherent i n the conventional SET process of the conceptual sort. Similarly, in the fourth section, I outline fallacies inherent in the same of the statistical sort. The next to last section a ddresses this question: In the face of such challenges, why do university administrators contin ue to use these data exclusively in the determination of "teaching effectiveness"? Fina l remarks are offered in a concluding section.II. The Conventional SET Process The conventional process by which the SET data (on a particular instructor of a particular class) are collected and analyzed may be characterized as follows (Note 4) The SET survey instrument is comprised of a series of questions about course content and teaching effectiveness. Some questions are open-ended, while others are closed-ended. 1. Those, which are closed-ended, often employ a scale to record a response. The range of possible values, or example, may run from a low of 1 for "poor," to a high of 5 for "outstanding." 2. In the closed-ended section of the SET survey instr ument, one question is of central import to the "summative" function. It asks the student: "Overall, how would you rate this instructor as a teacher in this course?" In the main, this question plays a pivotal role on the evaluation pro cess. For ease of reference, I term this question the "single-most-important quest ion" (hereafter, the SMIQ). 3. In the open-ended section of the SET survey instrum ent, students are invited of 4.


4 of 23offer short critiques of the course content and of the teaching effectiveness of the instructor.The completion of the SET survey instrument comes w ith a guarantee to students; that is, the anonymity of individual respondents. 5. The SET survey instrument is administered: (i) by a representative of the university administration to those students of a gi ven class who are present on the datacollection day, (ii) in the latter part of th e semester, and (iii) in the absence of the instructor. 6. Upon completion of the survey, the analyst then tak es the response to each question on each student's questionnaire, and then constructs question-specific and class-specific measures of central tendency, and of dispersion – this in an attempt to determine if the performance of a given instruct or in a particular class meets a cardinallyor ordinallymeasured minimal level of "teaching effectiveness." (Note 5) 7. It seems that, in such analyses, raw SET data on th e SMIQ are used in the main. More likely than not, this situation arises from th e fact that the SET survey instrument does not provide for the collection of b ackground data on the student respondent (such as major, GPA, program year, requi red course?, age, gender, …), and on course characteristics. (Note 6) 8. An example of the two-last features may p rove useful. Suppose there are three professors, A, B, and C, who teach classes, X, Y, a nd Z, respectively. And suppose that the raw mean of the SMIQ for A in X is 4.5, the raw mean value of the SMIQ for B in Y is 3.0; and the raw mean value of the SMIQ for C in Z is 2.5. Suppose too that the reference-group raw mean score for the SMIQ is 3.5 where the reference group could be either: (i) all faculty in a given department, or ( ii) all faculty in the entire university. In the evaluation process, C's mean score for the SMIQ may be compared of with that of another [say A's], and will be compared with that o f her reference group. The object of this comparison is the determination of the teachin g effectiveness, or ineffectiveness, of C. The questions addressed below are: (a) are the d ata captured by the SMIQ a valid proxy of "teaching effectiveness," and (b) can the raw mean values of the SMIQ be used in such comparisons? III. Fallacies Of A Conceptual Sort Inherent In The SET Process In this section, I outline two fallacies of a conceptual sort inherent in the SET process. These are: (a) that students are a, or alt ernatively are the only, source of reliable information on teaching effectiveness, and (b) ther e exists a unique and immutable metric termed "teaching effectiveness." III.1. Students As A, Or The Only, Source Of Reliab le Information on Teaching Effectiveness Let us return to the example of the three professors, A, B, and C, who teach classes, X, Y, and Z, respectively. There are two q uestions to be addressed here: (a) Would one be justified in believing that students p rovide reliable information on teaching effectiveness? (b) If yes, would one be ju stified in believing that students provide the only source of reliable information on teaching effectiveness? In my view, one would not be justified in holding either belief There are four reasons:


5 of 23The Public-Good Argument: The advocates of the SET process would argue: The university is a business, and the student its custo mer. And since the customer is always right, customer opinion must drive the busin ess plan. Mainstream economists would argue that this is a false analogy Their reason is that these same advocates are assuming that the provision of tertia ry education is a "private good." This (economists would argue) is not so: It is a "p ublic good."(Note 7) As such, students are not solely qualified to evaluate cours e content, and the pedagogical style of a faculty member. The Student-Instructor Relationship Is Not One of C ustomer-Purveyor, And Hence Not A Relationship Between Equals: As Stone ( 1995) noted, Higher education makes a very great mistake if it p ermits its primary mission to become one of serving student "customers ." Treating students as customers means shaping services to the ir taste. It also implies that students are entitled to use or waste the services as they see fit. Thus judging by enrollment patterns, stud ents find trivial courses of study, inflated grades, and mediocre sta ndards quite acceptable. If this were not the case, surely ther e would have long ago been a tidal wave of student protest. Of cours e, reality is that student protest about such matters is utterly unkno wn. Tomorrow, when they are alumni and taxpayers, today's student s will be vitally interested in academic standards and efficient use of educational opportunities. Today, however, the top priority of most students is to get through college with the highest grades and lea st amount of time, effort, and inconvenience. As Michael Platt (1993) noted: The questions typical of student evaluations teach the student to value mediocrity in teaching and even perhaps to resent g ood teachers who, to keep to high purposes, will use unusual words, g ive difficult questions, and digress from the syllabus, or seem t o. Above all, such questions also conceive the relation of student and teacher as a contract between equals instead of a covenant betwe en unequals. Thus, they incline the student, when he learns litt le, to blame the teacher rather than himself. No one can learn for a nother person; all learning is one's own .... (p. 31) While the student-instructor relationship is not one of customer-purveyor, and hence not a relationship between equals, the SET pr ocess itself offers the illusion that it is. As Platt (1993) noted: Merely by allowing the forms, the teacher loses hal f or more of the authority to teach. (p. 32)Students Are Not Sufficiently Well-Informed To Pron ounce On The Success Or Failure of the Academic Mission: Because of age and therefore relative ignorance,


6 of 23students are not sufficiently well-informed about s ocietal needs for educated persons, and employers' needs for skill sets. There fore, students are not in a position to speak for all vested interests (includi ng their own longterm interests). For example, Michael Platt (1993) noted: Pascal says: while a lame man knows he limps, a lam e mind does not know it limps, indeed says it is we who limp. Yet t hese forms invite the limpers to judge the runners; non-readers, the readers; the inarticulate, the articulate; and non-writers, the writers. Naturally, this does not encourage the former to become the latter. In truth, the very asking of such questions teaches students things th at do not make them better students. It suggests that mediocre que stions are the important questions, that the student already knows what teaching and learning are, and that any student is qualified to judge them. This is flattery. Sincere or insincere, it is not true, and will not improve the student, who needs to know exactly where he or she stands in order to take a single step forward. (p. 32) In the same vein, Adams (1997) noted, Teaching, as with art, remains largely a matter of individual judgment. Concerning teaching quality, whose judgme nt counts? In the case of student judgments, the critical questio n, of course, is whether students are equipped to judge teaching qua lity. Are students in their first or second semester of college compet ent to grade their instructors, especially when college teaching is so different from high school? Are students who are doing poorly in their courses able to objectively judge their instructors? And are studen ts, who are almost universally considered as lacking in critical think ing skills, often by the administrators who rely on student evaluations of faculty, able to critically evaluate their instructors? There is sub stantial evidence that they are not. (p. 31) The Anonymity of The Respondent: As noted above, th e SET process provides that the identity of the respondent to the SET ques tionnaire would or could never be disclosed publicly. This fact contains a latent message to students. This is, in the SET process, there are not personal consequences for a negligent false, or even malicious representation. There is no "student resp onsibility" in student evaluations. It is as if the student was be ing assured: "We trust you. We do not ask for evidence, or reasons, or authority. We do not ask about your experience or your character. We do not ask your name. We just trust you. Your opinions are your opi nions. You are who you are. In you we trust." Most human beings tr ust very few other human beings that much. The wise do not trust themselves that much. [Platt (1993, p. 34)] III.2. Opinion Misrepresented As Fact Or Knowledge


7 of 23 A major conceptual problem with the SET p rocess is that opinion is misrepresented as fact or knowledge, not to mention the unintended harm that this causes to all parties. As Michael Platt (1993) note d: I cannot think that the habit of evaluating one's t eacher can encourage a young person to long for the truth, to aspire to ac hievement, to emulate heroes, to become just, or to do good. To have one' s opinions trusted utterly, to deliver them anonymously, to have no check on th eir truth, and no responsibility for their effect on the lives of oth ers are not good for a young person's moral character. To have one's opinions ta ken as knowledge, accepted without question, inquiry, or conversation is not an experience that encourages self-knowledge. (pp. 33-34) He continued: What they teach is that "Opinion is knowledge." For tunately, the student may be taught elsewhere in college that opinion is not knowledge. The student of chemistry will be taught that the period ic table is a simple, intelligible account of largely invisible elements that wonderfully explains an enormous variety of visible but heterogeneous fe atures of nature. (p. 32) This misrepresentation of opinion as fact or knowledge raises problems in statistical analysis of the SET data in that any op erational measure of "teaching effectiveness" will not be, by definition, a unique and immutable metric. [This is one of the concerns raised in the next section.] In fact, I claim that the metric itself does not exist, or the presumption that it does is pure and unsubstantiated fiction. The assessment of these claims is the next concern. To initiate discussion, return to the exa mple of the three professors, A, B, and C, who teach classes, X, Y, and Z, respectively. From data extracted from the SMIQ, recall that A in X scored 4.5, B in Y scored 3.0; and C in Z scored 2.5. Two premises of the conventional SET process are: (i) there exists a un ique and an immutable metric, "teaching effectiveness," and (ii) the operational measure of this metric can be gleaned from data captured by the SMIQ, or by a latent-vari able analysis (most commonly, factor analysis) of a number of related questions. The que stion to be addressed here is: Would one be justified in believing that these two premis es are true? In my view, neither premise is credible. The first premise is not true because to assume otherwise is to contradict both the research literature, and casual inspection. There are three inter-related aspects to this claim : 1. The first premise contains the uninspected suppo sition that through introspection, any student can "know" an unobservab le metric called "teaching effectiveness," and can then be relied up on to accurately report her measurement of it in the SET document. (Note 8)2. The literature makes quite clear that within any group of students one can find multiple perceptions of what constitutes "teac hing effectiveness" (e.g., Fox (1983)). (Note 9)3. If a measure is unobservable, its metric cannot be claimed to be also unambiguously unique and immutable. (Note 10) To ar gue otherwise is to be confronted by a bind: A measure cannot be subjec tive, and its metric


8 of 23objective. That said, what could account for the sub jective nature of the term, "teaching effectiveness"? One explanation arises from the exi stence of two distinct motivations for attending university, or alternatively for enrollin g in a given program. The details are these: 1. One motivation is the "education-as-an-investmen t-good" view. This is tantamount to the view that "going to university" w ill enhance one's prospects of obtaining a high-paying and/or an inte llectually-satisfying job upon graduation. Latent in this view is the fact or belief that many employers take education as a signal of the product ive capability of a university graduate as a job applicant [Spence (197 4), and Molho (1997, part 2)]. (Note 11)2. The other motivation is the "education-as-a-cons umption-good" view. This view is tantamount to some mix of these five v iews: (a) that education is to be pursued for education's sake, (b) that "go ing to university" must be above all else enjoyable, (c) higher education is a democracy, and (d) in this democracy, learning must be fun, and (e) to be educ ated, students must like their professor. (Notes 12, 13) Thus, any student can be seen holding som e linear combination of these two views. What differentiates one student from the nex t (at any point in time) is the weighting of this combination. Next, consider the second premise. It sta tes that the operational measure of the metric, "teaching effectiveness," can be gleaned fr om data captured by the SET data in general, and by the SMIQ in particular. In my view, one is not justified in assuming the second premise is true because the metric, "teachin g effectiveness," is unobservable and subjective. (Note 14) As such, the data captured by the conventional SET process in general, and the SMIQ in particular, can at best me asure "instructor popularity" or "student satisfaction" [Damron (1995)]. An example of this subjectiveness can be found in the following passage from Cornell University's (1997) Science News Attention teachers far and wide: It may not be so m uch what or how you teach that will reap high student evaluations, but something as simple as an enthusiastic tone of voice and beware, administrato rs, if you use student ratings to judge teachers: Although student evaluat ions may be systematic and reliable, a Cornell university study has found that they can be totally invalid. Yet many schools use them to determine ten ure, promotion, pay hikes and awards. These warnings stem from a new study in which a Cor nell professor taught the identical course twice with one exception—he us ed a more enthusiastic tone of voice the second semester—and student ratin gs soared on every measure that second semester. Those second-semester students gave much higher rat ings not only on how knowledgeable and tolerant the professor was and on how much they say they learned, but even on factors such as the fairn ess of grading policies, text quality, professor organization, course goals and professor accessibility. And although the 249 students in the second-semeste r course said they


9 of 23learned more than the 229 students the previous sem ester believed they had learned, the two groups performed no differently on exams and other assessment measures. "This study suggests that factors totally unrelated to actual teaching effectiveness, such as the variation in a professor 's voice, can exert a sizable influence on student ratings of that same professor 's knowledge, organization, grading fairness, etc.," said Wendy M Williams, associate professor of human development at Cornell. Her coll eague and co-author, Stephen J. Ceci, professor of human development at Cornell, was the teacher evaluated by the students in a course on de velopmental psychology that he has taught for almost 20 years. The assertion that the data captured by t he conventional SET process in general, and the SMIQ in particular, measure at best "instru ctor popularity" or "student satisfaction" is echoed by Altschuler (1999). He wr ote: At times, evaluations appear to be the academic ana logue to "Rate the Record" on Dick Clark's old "American Bandstand," i n which teen-agers said of every new release, "Good beat, great to dan ce to, I'd give it a 9." Students are becoming more adjectival than analytic al, more inclined to take faculty members' wardrobes and hairstyles into acco unt when sizing them up as educators.IV. Fallacies Of A Statistical Sort Inherent In The SET Process In this section, I outline potential fall acies of a statistical sort inherent in the SET process. There are two: (a) under all circumstances the SMIQ provides a cardinal measure of "teaching effectiveness" of an instructo r, and (b) in the absence of statistical controls, the SMIQ provides an ordinal measure of teaching effectiveness" of an instructor. (Notes 15,16) IV.1. Ascribing A Cardinal Measure of Teaching Effe ctiveness To An Instructor Based on The SMIQ Return to the example of the three profes sors, A, B, and C, who teach classes, X, Y, and Z, respectively. Recall that A in X scored 4 .5, B in Y scored 3.0; C in Z scored 2.5, and the reference group scored 3.5. A premise of the SET process is that these averages are cardinal measures of "teaching effectiveness." The question to be addressed here is: Would one be justified in believ ing that this premise is true? That is, would one be justified in believing that A is 50% more effective" than B, that B is 20% "more effective" than C, or that A is 28% "more eff ective" than the average? (Note 17) In my view, one would not be justified in believing any such claim simply because of the argument outlined in the previous section; t hat is, a unique and an immutable metric, "teaching effectiveness," does not exist. IV.2. The Rank Ordering Of Instructors By Teaching Effectiveness Based on The SMIQ Return again to the example of three prof essors, A, B, and C, who teach classes, X, Y, and Z, respectively. An alternative premise o f the conventional SET process is that


10 of 23the averages of the data captured by the SMIQ serve as a basis for an ordinal measure of "teaching effectiveness." The question to be addres sed here is: Would one be justified in believing that this premise is true? That is, would one be justified in believing that A is "more effective" than B, or that B is "more effecti ve" than C? In my view, this belief could be seen as justifiable: (a) if the SMIQ captu res an unequivocal reading of "teaching effectiveness" (see above), and (b) if th e subsequent analysis controls for the many variables which confound the data captured by the SMIQ.(Note 18) What are these confounding variables that require control? To answer this question, two studies are worthy of mention. One, i n a review of the literature, Cashin (1990) reports that (in the aggregate) students do not provide SET ratings of teaching performance uniformly across academic disciplines. (Note 19) Two, in their review of the literature, M ason et al. (1995, p. 404) note that there are three clusters of variables, which affect stude nt perceptions of the teaching effectiveness of faculty members. These clusters ar e: (a) student characteristics, (b) instructor characteristics, and (c) course characte ristics. (Note 20) They also note that only one of these clusters ought to be included in any reading of "teaching effectiveness." This is the cluster, "instructor ch aracteristics." Commenting on prior research, Mason et al. (1995, p. 404) noted: A …virtually universal problem with previous resear ch is that the overall rating is viewed as an effective representation of comparative professor value despite the fact that it typically includes a ssessments in areas that are beyond the professor's control. The professor is re sponsible to some extent for course content and characteristics specific to his/her teaching style, but is unable to control for student attitude, reason f or being in the course, class size, or any of the rest of those factors categoriz ed as student or course characteristics above. Consequently, faculty member s should be evaluated on a comparative basis only in those areas they can affect, or more to the point, only by a methodology that corrects for thos e influences beyond the faculty member's control. By comparing raw student evaluations acro ss faculty members, administrators implicitly assume that none of these potentially mitigating factors has any impact on student evaluation differ entials, or that such differentials cancel out in all cases. The literatu re implies that the former postulate is untrue. The true import of the above is found aga in in Mason et al. (1995). Using an ordered-probit model, (Note 21) they demonstrate th at student characteristics, instructor characteristics, and course characteristics do impa ct the response to the SMIQ in the SET dataset. They wrote: Professor characteristics dominated the determinant s of the summary measures of performance, and did so more for those summary variables that were more professor-specific. However, certain cour seand studentspecific characteristics were very important, skewi ng the rankings based on the raw results. Students consistently rewarded tea chers for using class time wisely, encouraging analytical decision making, kno wing when students did not understand, and being well prepared for class. However, those professors who gave at least the impression of lowe r grades, taught more difficult courses, proceeded at a pace students did not like, or did not


11 of 23stimulate interest in the material, fared worse. (p 414) Mason et al. (1995) then wrote: Based on the probit analysis, an alternative rankin g scheme was developed for faculty that excluded influences beyond the pro fessor's control. These rankings differed to some extent from the raw ranki ngs for each of the aggregate questions. As a result, the validity of t he raw rankings of faculty members for the purposes of promotion, tenure, and raises should be questioned seriously. … Administrators should adjus t aggregate measures of teaching performance to reflect only those items within the professors' control, so that aggregates are more likely to be p roperly comparable and should do so by controlling for types of courses, l evels of courses, disciplines, meeting times, etc. … Administrators f ailing to do this are encouraged to reconsider the appropriateness of agg regate measures from student evaluations in promotion, tenure, and salar y decisions, concentrating instead on more personal evaluations such as analys is of pedagogical tools, peer assessments, and administrative visits. (p. 41 4) It may be useful to ask: To what extent a re the findings of Mason et al. (1995) unique? Surprisingly, they are not; they echo those of other studies, some recent, and some more than a quarter-century old. For example, Miriam Rodin and Burton Rodin (1972) writing in Science present a study in which they correlated an object ive measure of "good teaching" (viz., a student's performance o n a calculus test) with a subjective measure of "good teaching" (viz., a student's evalu ation of her professor) holding constant the student's initial ability in calculus. What they found is that these two measures were not orthogonal or uncorrelated as som e might expect, but something more troublesome. These two variables had a correla tion coefficient less than –0.70, and these two accounted for more about half of the vari ance in the data. How did they interpret their findings? The last sentence in thei r paper states: "If how much students learn is considered to be a major component of good teaching, it must be concluded that good teaching is not validly measured by student ev aluations in their current form." How might others interpret their findings? They suggest the individual instructor is in a classic double-bind: If she attempts to maximize he r score on the SMIQ, then she lowers student performance. Alternatively if she attempts to maximize student performance, then her score on the SMIQ suffers. This begs the q uestion: In such a dynamic, how can one possibly use SET data to extract a meaningful m easure of "teaching effectiveness?" In a different study (one concerned with the teaching evaluations for the Department of Mathematics at Texas A&M University, and one which entails the analysis of the correlation coefficients for arrays of variables measuring "teaching effectiveness" and "course characteristics"), Runde ll (1996) writes: "(T)he analysis we have performed on the data suggests that the distil lation of evaluations to a single number without taking into account the many other f actors can be seriously misleading" (p. 8).V. Why Has The Conventional SET Process Not Been Di scarded? Given that the likelihood of deriving mea ningful and valid inferences from raw SET data is nil, the question remains: Why is the c onventional SET process (with its conceptual and statistical shortcomings) employed e ven to this day, and by those for


12 of 23who highly revere the power of critical thinking? To my mind, there are three answers to th is question. The first answer concerns political expediency; that is, while fatally flawed raw SET data can be used as a tautological device; that is, to justify most any p ersonnel decision. As a professor of economics at Indiana University and the Editor of The Journal of Economic Education noted: End of term student evaluations of teaching may be widely used simply because they are inexpensive to administer, especia lly when done by a student in class, with paid staff involved only in the processing of the results…Less-than-scrupulous administrators and fac ulty committees may also use them … because they can be dismissed or fi nessed as needed to achieve desired personnel ends while still mollifyi ng students and giving them a sense of involvement in personnel matters. [ Becker (2000, p. 114)] The second is offered by Donald Katzner ( 1991). He asserted that in their quest to describe, analyze, understand, know, and make decis ions, western societies have accepted (for well over five hundred years) the "my th of synonymity between objective science and measurement" (p. 24). (Note 22) He wrot e: [W]e moderns, it seems, attempt to measure everythi ng…. We evaluate performance by measurement…. What is not measurable we strive to render measurable, and what we cannot, we dismiss it from our thoughts and justify our neglect by assigning it the status of t he "less important." … A moment's reflection, however, is all that is needed to realize that measurement cannot possibly do everything we expect it to do. … by omitting from our considerations what cannot be mea sured, or what we do not know how to measure, often leads to irrelevance and even error. (p. 18) The third reason is offered by Imre Lakat os (1978) in his explanation as to why prevailing scientific paradigms are rarely replaced or overthrown. This contains these elements: What ought to be appraised in the philosophy of the sciences is not an isolated individual theory, but a cluster of interconnected theories, or what he terms "scientific research programs" (hereafter SRP). 1. An SRP protects a "hard core" set of unquestioned a nd untestable statements. These statements are accepted as "fact." 2. Stated differently, the hard core of a SRP is surro unded by a "protective belt" of "auxiliary hypotheses." 3. One or more of the hard core statements cannot be r efuted without dismantling the entire cognitive edifice, which happens in practice only very rarely. That said, it follows that any departure from the hard core of a SRP is tantamount to the creation of a new and different SRP. 4. Thus, in my view, the conventional SET pr ocess is the artifact of an SRP. Judging from the substance of its protective belt, and from the disciplinary affiliations of its proponents or advocates, this is an SRP defined and protected by a cadre of psychologists and educational administrators. (Note s 23,24) VI. Conclusion


13 of 23 In the present work, I have advanced two arguments, both of which question the appropriateness of using raw SET data (as the only source of data) in the determination of "teaching effectiveness." The first argument ide ntified two types of fallacies in this methodology. One is conceptual, and the other stati stical. Along the way, I argued by implication that the conceptual fallacies cannot be remedied, but that one of the statistical fallacies can – this by means of the co llection of additional data and the use of an appropriate statistical technique of the sort ou tlined in the study of Mason et al. (1995), which I also discussed. The second argument is centered on the qu estion, why do the current practices used in the determination of the "teaching effectiv eness" ignore these two fallacies? I offered three answers to this question. These are: (a) that the conventional SET process offers to any university administration a political ly-expedient performance measure, and (b) that the conventional SET process may be seen a s an example of: (i) Katzner's (1991) "myth of synonymity between objective science and m easurement," and (ii) Lakatos' (1978) general explanation of the longevity of SRPs Two implications flow from these argument s, and the related discussion. These are as follows: One, the present discussion should not be seen as tantamount to an idle academic debate. On the contrary, since the SET dat a have been entered as evidence in courts of law and quasi-legal settings [Adams (1997 ), Gillmore (1984), and Haskell (1997d)], and since the quality and the interpretat ion of these data can impact the welfare of individuals, it is clear that the presen t paper has import and bearing to the extent that: (i) it explicates the inadequacies, an d unintended implications, of using raw SET data in the "summative" function, and (ii) it e xplains the present resistance of the conventional SET process to radical reform. Two, given the present assessment of the conventional SET process, and given the legal repercussions of its continued use, the quest ion becomes: What to do? Here, the news is both good and bad. The bad news is that not hing can be done to obviate the conceptual fallacies outlined in the above pages. T he inescapable truth is that the SMIQ in particular, and the SET dataset in general, do n ot measure "teaching effectiveness." They measure something akin to the "popularity of t he instructor," which (it must be emphasized) is quite distinct from "teaching effect iveness." [Recall the discussion of Rodin and Rodin (1972) in the above.] The good news is that one of the statistical fallacies inherent in the conventional SET process can be overcome – this by capturing and then using background data on student, instruct or, and course characteristics, in the mold of Mason et al. (1995). That said, I leave the last word to what (in my opinion) amounts to a classic in its own time. Mason et al. (1995) state, and I repeat: Administrators should adjust aggregate measures of teaching performance to reflect only those items within the professors' control, so that aggregates are more likely to be properly comparable and shoul d do so by controlling for types of courses, levels of courses, discipline s, meeting times, etc. … Administrators failing to do this are encouraged to reconsider the appropriateness of aggregate measures from student evaluations in promotion, tenure, and salary decisions, concentrat ing instead on more personal evaluations such as analysis of pedagogica l tools, peer assessments, and administrative visits. (p. 414)NotesThis article was prepared during the winter semeste r of 2000 while the author was on a


14 of 23half-year sabbatical at the University of Manitoba (Winnipeg, Canada). Without implicating them for any remaining errors and overs ights, the author thanks Donald Katzner, Paul Mason, Stuart Mckelvie, and three ano nymous referees, for many useful comments and critiques. For reviews of the literature that are essentially supportive of the SET process, see d'Apollonia and Abrami (1997), Greenwald and Gilmor e (1997), Marsh (1987), Marsh and Roche (1997), and McKeachie (1997). And f or reviews of the literature that are highly critical of some mix of the concept ual, statistical, and legal foundations of the SET process, see Damron (1995), and Haskell (1997a, 1997b, 1997c, and 1997d). 1. The terms "formative" and "summative" are due to Sc riven (1967). 2. On such matters, the position of the Canadian Assoc iation of University Teachers on the admissibility of SET data appears unambiguou s in light of statements like these: "Appropriate professional care should be exe rcised in the development of questionnaires and survey methodologies. Expert adv ice should be sought, and reviews of the appropriate research and scientific evidence should be carried out. Comments from faculty and students and their associ ations or unions should be obtained at all stages in the development of the qu estionnaire. Appropriate trials or pilot studies should be conducted and acceptable levels of reliability and validity should be demonstrated before a particular instrument is used in making personnel decisions" [Canadian Association of Unive rsity Teachers (1998, p. 3)]. In a footnote to this passage, this document contin ues, "Most universities require at least this standard of care before investigators are permitted to conduct research on human subjects. It is unacceptable that universi ty administrations would condone a lesser standard in the treatment of facul ty, particularly when the consequences of inadequate procedures and methods c an be devastating to teachers' careers." 3. The present characterization represents an amalgam of three sources: (a) first-hand knowledge of the SET documents used at three Canadi an universities; (b) a small, non-random sample of SET documents for four univers ities taken from the internet [viz., University of Minnesota, University of British Columbia, York University (Toronto), and University of Western Ont ario]; and (c) non-institutional-specific comments made in the vol uminous literature on the SET process. 4. The phrase "a cardinallyor ordinallymeasured mi nimal level of "teaching effectiveness"" requires four comments. One, exampl es of cardinal measures are: The heights of persons A, B, and C are 6'1", 5'10", and 5'7" respectively. And using the same data, examples of ordinal measures a re: A is taller than B, B is taller than C, and A is taller than C. Two, the pre sent measurement terminology is used in economics [Pearce (1992)], and (it can be s aid) is distinct from that used in other disciplines [e.g., Stevens (1946), Siegel (1956, p. 30), and Hands (1996)]. Three, it is the existence of a unique and an immut able metric (in the above examples, distance or length) that makes both cardi nal and ordinal measures meaningful. Four, as the above examples make clear, an ordinal measure can be inferred from a cardinal measure, but not the rever se. 5. An example of this statement is the instrument used by York University (Toronto). An exception to this statement is that used by the University of Minnesota. 6. The distinction between a "private good" and a "pub lic good" can be rephrased in several, roughly equivalent ways. These are: (i) te rtiary education has 7.


15 of 23externalities; (ii) that the net social benefits of tertiary education differ from the net private benefits, (iii) that the benefits of te rtiary education do not accrue to, nor are its costs borne by, students solely, and (iv) t hat students do not pay full freight. Because of this, one could argue that (in the evalu ation of "teaching effectiveness") the appropriate populations of opin ion to be sampled are all groups who share in the social benefits and social costs. These would include not only students, but also members of the Academy, potentia l employers, and other members of society (such as taxpayers). In sum, bec ause tertiary education is not a private, but a public good, students are not solely qualified to evaluate course content, and the pedagogical style of a faculty mem ber. A personal vignette provides some insight into the potential seriousness of the inaccuracy of self-reported data. In the fall of 19 97, I taught an intermediate microeconomics course. The mark for this course was based solely on two mid-term examinations, and a final examination. Eac h mid-term examination was marked, and then returned to students and discussed in the class following the examination. Now, the course evaluation form has th e question, "Work returned reasonably promptly." The response scale ranges fro m 0 for "seldom," to 5 for "always." Based on the facts, one would expect (in this situation) an average response of 5. This expectation was dashed in that 50% of the sample gave me a 5, 27.7% gave me a 4, and 22.2% gave me a 3. The impor t of this? If self-reported measures of objective metrics are inaccurate (as th is case indicates), how can one be expected to trust the validity of subjective mea surements like "teaching effectiveness?" 8. Indeed, it appears that students and professors can hold different perceptions as to what constitutes "appropriate learning," and hence "appropriate teaching," in tertiary education. For example, Steven Zucker (199 6), professor of Mathematics at Johns Hopkins University, laments the gulf betwe en the expectations of students and instructors. He writes: "The fundament al problem is that most of our current high school graduates don't know how to lea rn or even what it means to learn (a fortiori to understand) something. In effe ct, they graduate high school feeling that learning must come down to them from t heir teachers. That may be suitable for the goals of high school, but it is un acceptable at the university level. That the students must also learn on their own, out side the classroom, is the main feature that distinguishes college from high school ." (p. 863). 9. Alternatively, Weissberg (1993, p. 8) noted that on e cannot measure what one cannot define. 10. These assertions have been borne out empirically un der the rubric, "sheepskin effect." The interested reader is directed to Belma n and Heywood (1991 and 1997), Heywood (1994), Hungerford and Solon (1987), and Jaeger and Page (1996). 11. Some of these views contradict the raison d'tre and the modus operandi of tertiary education. For example, Frankel (1968) wro te: "Teaching is a professional relationship, not a popularity contest. To invite s tudents to participate in the selection or promotion of their teachers exposes th e teacher to intimidation." (pp. 30-31) In fact, the Canadian Association of Univers ity Teachers (1986) speaks of the irrelevance of "popularity" as a gauge of profe ssional performance by stating: "The university is not a club; it is dedicated to e xcellence. The history of universities suggests that its most brilliant membe rs can sometimes be difficult, different from their colleagues, and unlikely to wi n a popularity contest. 'The university is a community of scholars and it is to be expected that the scholars will 12.


16 of 23hold firm views and wish to follow their conviction s. Tension, personality conflicts and arguments may be inevitable by-produc ts.'" As Crumbley (1995) noted: "There is another univers al assumption that students must like an instructor to learn. Not true. Even if they dislike you and you force them to learn by hard work and low grades, you may be a good educator (but not according to SET scores). SET measures whether or n ot students like you, and not necessarily whether you are teaching them anything. Instructors should be in the business of educating and teaching students--not SE T enhancement. Until administrators learn this simple truth, there is li ttle chance of improving higher education." 13. It seems that some psychologists would argue that l atent measures of "teaching effectiveness" can be uncovered by a factor analysi s of the SET data [e.g., d'Apollonia and Abrami (1997)]. Also, it seems that the motivation for such a claim is the intellectual appeal and success of stu dies of a completely different ilk. A case in point is Linden (1977) who uses factor an alysis to uncover dimensions, which account for event-specific performances of at hletes in the Olympic decathlon. However, the expectation that the succes s found in studies such as Linden (1977) can be replicated in the factor analy sis of SET data is unwarranted in that this expectation ignores the fact that the SET data (unlike Linden's data) are opinion based or subjective, have measurement error and are in need of statistical controls. In brief, it is my view that the use of f actor analysis on SET data to uncover latent measures such as "teaching effective ness" is analogous to trying to "unscramble an egg" in that it just cannot be done. Besides, as the authors of a popular text on multivariate statistics observe, "W hen all is said and done, factor analysis remains very subjective" [Johnson and Wich ern (1988, p. 422)]. 14. The terms, ordinal and cardinal measures, are defin ed in a footnote above. In conjunction with that, it should be noted that the type of a variable governs the statistical manipulations permissible [Hands (1996, pp. 460-62)], and "(T)he use of ordinally calibrated variables as if they were f ully quantified .. results in constructions that are without meaning, significanc e, and explanatory power. Treating ordinal variables as cardinal … can mislea d an investigator into thinking the analysis has shed light on the real world" [Kat zner (1991, p. 3)]. This latter point captures an important dimension of the presen t state of research on SET data, and of the present paper. 15. For reasons of brevity, I have concentrated on only two of several statistical problems. These are "measurement error" and "omitte d variables." By doing so, I have overlooked other statistical problems inherent in the SET data like the unreliability of selfand anonymous-reporting, ina dequate sample size, sample-selection bias, reverse causation, and teach ing to tests. The reader interested in a more complete treatment of some of these issues may wish to consult readings such as Aiger and Thum (1986), Bec ker and Power (2000), Gramlich and Greenlee (1993), and Nelson and Lynch (1984). 16. As Rundell (1996) noted, in actual practice, this w ould mean: "…'Jones had a 3.94 mean on her student evaluations, and since this is 0.2 above the average for the Department, we conclude she is an above average ins tructor as judged by these questionnaires' is a statement that appears increas ingly common" (p. 1). 17. Statistical controls are needed to the extent that they eliminate "observational equivalence." In this connection, two comments are warranted here. One, observational equivalence is said to exist when "al ternative interpretations, with different theoretical or policy implications, are e qually consistent with the same 18.


17 of 23data.. No analysis of the data would allow one to d ecide between the explanations, they are observationally equivalent. Other informat ion is needed to identify which is the correct explanation of the data" [Smith (199 9, p. 248)]. Two, Sproule (2000) has identified three distinct forms of observationa l equivalence in the interpretation of raw data from the SMIQ.Cashin (1990) reports, for example, professors of f ine arts and music receive high scores on the SMIQ, and professors of chemistry and economics receive lower scores, all things being equal. 19. Mason et al. (1995) contend that those variables wh ich fall under the "student-characteristics" rubric include: (i) reaso n for taking the course, (ii) class level of the respondent, (iii) student effort in th e course, (iv) expected grade in the course, and (v) student gender. Those variables whi ch fall under the "instructorcharacteristics" rubric include: (i) the professor' s use of class time, (ii) the professor's availability outside of class, (iii) ho w well the professor evaluates student understanding, (iv) the professor's concern for student performance, (v) the professor's emphasis on analytical skills, (vi) the professor's preparedness for class, and (vii) the professor's tolerance of oppos ing viewpoints and questions. Those variables which fall under the "coursechara cteristics" rubric include: (i) course difficulty, (ii) class size, (iii) whether t he course is required or not, and (iv) when the course was offered. 20. For an elementary discussion of the ordered-probit model, see Pindyck and Rubinfeld (1991, pp. 273-274.). 21. Katzner (1991) also states that this "blind pursuit of numbers" can lead to unintended, and unjust, outcomes. For example, "(W) hen the state secretly sterilizes individuals only because their 'measured intelligence' on flawed intelligence tests is too low, then bitterly dashed hopes and human suffering becomes the issue." (p. 18). That said, it would no t be too difficult to claim that the "blind pursuit of numbers" by those responsible for the "summative" function has also led to unintended, and unjust, outcomes. [ In fact, see Haskell (1997d) for details.] 22. Three comments seem warranted here. One, the enterp rise of science can been seen as a "market process" [Walstad (1999)]. Two, t he SRP of this cadre of psychologists and educational administrators could be viewed as barrier to entry (of the epistemological sort) into the marketplace of ideas. Three, that said, perhaps the recommendation of Paul Feyerabend (1975 ) applies in this instance; that competition between epistemologies, rather tha n the monopoly of a dominant epistemology, ought to be encouraged. 23. While it is clear from the above that the protectiv e belt of the SRP associated with the SET has survived many types of logical appraisa ls (or epistemological attacks), the question remains: Can this protective belt, and this SRP itself, continue to withstand such repeated attacks? I woul d hazard the opinion that, no, it cannot. 24.ReferencesAdams, J.V. (1997), Student evaluations: The rating s game, Inquiry 1 (2), 10-16. Aiger, D., and F. Thum (1986), On student evaluatio n of teaching ability, Journal of Economic Education Fall, 243-265.


18 of 23Altschuler, G. (1999), Let me edutain you, The New York Times Education Life Supplement, April 4.Becker, W. (2000), Teaching economics in the 21st c entury, Journal of Economic Perspectives 14 (1), 109-120. Becker, W., and J. Power (2000), Student performanc e, attrition, and class size, given missing student data, Economics of Education Review, forthcoming. Belman, D., and J.S. Heywood (1991), Sheepskin effe cts in the returns to education: An examination on women and minorities, Review of Economics and Statistics 73 (4 ), 720-24. Belman, D., and J.S. Heywood (1997), Sheepskin effe cts by cohort: Implications of job matching in a signaling model, Oxford Economic Papers 49 (4) 623-37. Blunt, A. (1991), The effects of anonymity and mani pulated grades on student ratings of instructors, Community College Review 18 Summer, 48-53. Canadian Association of University Teachers (1986), What is fair? A guide for peer review committees: Tenure, renewal, promotion, Info rmation Paper, November. Canadian Association of University Teachers (1998), Policy on the use of anonymous student questionnaires in the evaluation of teachin g, CAUT Information Service Policy Paper 4-43.Cashin, W. (1990), Students do rate different acade mic fields differently, in M. Theall and J. Franklin, eds., Student Ratings of Instruction: Issues for Improvin g Practice, New Directions for Teaching and Learning, No. 43 (San F rancisco, CA: Jossey-Bass). Cornell University (1997), Cornell study finds stud ent ratings soar on all measures when professor uses more enthusiasm: Study raises concer ns about the validity of student evaluations, Science News September 19th. Crumbley, D.L. (1995), Dysfunctional effects of sum mative student evaluations of teaching: Games professors play, Accounting Perspectives 1 (1) Spring, 67-77. Damron, J.C. (1995). The three faces of teaching ev aluation, unpublished manuscript, Douglas College, New Westminster, British Columbia.d'Apollonia, S., and P. Abrami (1997), Navigating s tudent ratings of instruction, American Psychologist 52 (11) 1198-1208. Feyerabend, P. (1975), Against Method (London: Verso). Fox, D. (1983), Personal theories of teaching, Studies in Higher Education 8 (2), 151-64. Frankel, C. (1968), Education and the Barricades (New York: W.W. Norton). Gillmore, G. (1984), Student ratings as a factor in faculty employment decisions and


19 of 23periodic review, Journal of College and University Law 10 557576. Gramlich, E., and G. Greenlee (1993), Measuring tea ching performance, Journal of Economic Education Winter, 3-13. Grant, H. (1998), Academic contests: Merit pay in C anadian universities, Relations Industrielles / Industrial Relations 53 (4) 647-664. Greenwald, A., and G. Gilmore (1997), Grading lenie ncy is a removable contaminant of student ratings, American Psychologist 52 (11) 1209-17. Hands. D.J. (1996), Statistics and the theory of me asurement, Journal of the Royal Statistical Society – Series A 159 (3) 445-473. Haskell, R.E. (1997a), Academic freedom, tenure, an d student evaluations of faculty: Galloping polls in the 21st century, Education Policy Analysis Archives 5 (6), February 12. Haskell, R.E. (1997b), Academic freedom, promotion, reappointment, tenure, and the administrative use of student evaluation of faculty (SEF): (Part II) Views from court, Education Policy Analysis Archives 5 (6) August 25. Haskell, R.E. (1997c), Academic freedom, promotion, reappointment, tenure, and the administrative use of student evaluation of faculty (SEF): (Part III) Analysis and implications of views from the court in relation to accuracy and psychometric validity, Education Policy Analysis Archives 5 (6), August 25. Haskell, R.E. (1997d), Academic freedom, promotion, reappointment, tenure, and the administrative use of student evaluation of faculty (SEF): (Part IV) Analysis and implications of views from the court in relation to academic freedom, standards, and quality of instruction, Education Policy Analysis Archives 5 (6), November 25. Heywood, J.S. (1994), How widespread are sheepskin returns to education in the U.S.?, Economics of Education Review 13 (3) 227-34. Hungerford, T., and G. Solon (1987), Sheepskin effe cts in the returns to education, Review of Economics and Statistics 69 (1) 175-77. Jaeger, D., and M. Page (1996), Degrees matter: New evidence on sheepskin effects in the returns to education, Review of Economics and Statistics 78 (4), 733-40. Johnson, R., and D. Wichern (1988), Applied Multivariate Statistical Analysis Second Edition (Englewood Cliffs: Prentice-Hall).Katzner, D. (1991), Our mad rush to measure: How di d we get there?, Methodus 3 (2), 18-26.Lakatos, I. (1978), The Methodology of Scientific Research Programmes (Cambridge: Cambridge University Press).


20 of 23Linden, M. (1977), A factor analytic study of Olymp ic decathlon data, Research Quarterly 48 (3 ), 562568. Malthus, T. (1836), Principles of Political Economy 2nd Edition. Marsh, H. (1987), Students' evaluations of universi ty teaching: Research findings, methodological issues, and directions for future re search, International Journal of Educational Research 11 253-388. Marsh, H., and L. Roche (1997), Making students' ev aluations of teaching effectiveness effective: The central issues of validity, bias, an d utility, American Psychologist 52 (11) 1187-97.Mason, P., J. Steagall, and M. Fabritius (1995), St udent evaluations of faculty: A new procedure for using aggregate measures of performan ce, Economics of Education Review 12 (4), 403-416. McKeachie, W. (1997), Student ratings: The validity of use, American Psychologist 52 (11) 12181225. Molho, I. (1997), The Economics of Information: Lying and Cheating in Markets and Organizations (Oxford: Blackwell). Nelson, J., and K. Lynch (1984), Grade inflation, r eal income, simultaneity, and teaching evaluations, Journal of Economic Education Winter, 21-37. Pearce, D.W., ed. (1992), The MIT Dictionary of Modern Ecoonomics 4 th Edition (Cambridge, MA: MIT Press).Pindyck, R. and D. Rubinfeld (1991), Econometric Models & Economic Forecasts (New York: McGraw-Hill). Platt, M. (1993), What student evaluations teach, Perspectives In Political Science 22 (1), 29-40. Rifkin, T. (1995), The status and scope of faculty evaluation, ERIC Digest. Rodin, M., and B. Rodin (1972), Student evaluations of teaching, Science 177 September, 11641166.Rundell, W. (1996), On the use of numerically score d student evaluations of faculty, unpublished working paper, Department of Mathematic s, Texas A&M University. Ruskai, M.B. (1996), Evaluating student evaluations Notices of The American Mathematical Society 44 (3) March 1997, 308. Scriven, M. (1967), The methodology of evaluation, in R. Tyler, R. Gagne, and M. Scriven, eds., Perspectives in Curriculum Evaluation (Skokie, IL: Rand McNally). Siegel, S. (1956), Nonparametric Statistics For The Behavioral Science s (New York:


21 of 23 McGraw-Hill).Smith, R. (1999), Unit roots and all that: The impa ct of time-series methods on macroeconomics, Journal of Economic Methodology 6 (2), 239-258. Spence, M. (1974), Market Signaling (Cambridge, MA: Harvard University Press). Sproule, R. (2000). The underdetermination of instr uctor performance by data from the student evaluation of teaching, Economics of Education Review (in press). Stevens, S.S. (1946), On the theory of scales of me asurement, Science 103 677-680. Stone, J.E. (1995), Inflated grades, inflated enrol lment, and inflated budgets: An analysis and call for review at the state level, Education Policy Analysis Archives 3 (11) Walstad, A. (1999), Science as a market process, un published paper, Department of Physics, University of Pittsburgh—Johnstown.Weissberg, R. (1993), Standardized teaching evaluat ions, Perspectives In Political Science 22 (1), 5-7. Zucker, S. (1996), Teaching at the university level Notices of The American Mathematical Society 43 (8), August, 863-865. About the AuthorRobert SprouleDepartment of EconomicsWilliams School of Business and EconomicsBishop's University Lennoxville, Qubec, J1M 1Z7, CanadaThe author can be reached via e-mail at Robert Sproule is a Professor of Economics at Bisho p's University. He received his Ph.D. in Economics at the University of Manitoba (W innipeg, Canada). His research interests include statistics, econometrics, and dec ision making under uncertainty. His research appears in journals such as the Bulletin of Economic Research, Communications in Statistics: Theory and Methods, E conomics Letters, the Economics of Education Review the European Journal of Operations Research, Metroecono mica, Public Finance, and The Statistician .Copyright 2000 by the Education Policy Analysis ArchivesThe World Wide Web address for the Education Policy Analysis Archives is General questions about appropriateness of topics o r particular articles may be addressed to the Editor, Gene V Glass, or reach him at College of Education, Arizona State University, Tempe, AZ 8 5287-0211. (602-965-9644). The Commentary Editor is Casey D. C obb:


22 of 23 .EPAA Editorial Board Michael W. Apple University of Wisconsin Greg Camilli Rutgers University John Covaleskie Northern Michigan University Alan Davis University of Colorado, Denver Sherman Dorn University of South Florida Mark E. Fetler California Commission on Teacher Credentialing Richard Garlikov Thomas F. Green Syracuse University Alison I. Griffith York University Arlen Gullickson Western Michigan University Ernest R. House University of Colorado Aimee Howley Ohio University Craig B. Howley Appalachia Educational Laboratory William Hunter University of Calgary Daniel Kalls Ume University Benjamin Levin University of Manitoba Thomas Mauhs-Pugh Green Mountain College Dewayne Matthews Western Interstate Commission for HigherEducation William McInerney Purdue University Mary McKeown-Moak MGT of America (Austin, TX) Les McLean University of Toronto Susan Bobbitt Nolen University of Washington Anne L. Pemberton Hugh G. Petrie SUNY Buffalo Richard C. Richardson New York University Anthony G. Rud Jr. Purdue University Dennis Sayers Ann Leavenworth Centerfor Accelerated Learning Jay D. Scribner University of Texas at Austin Michael Scriven Robert E. Stake University of Illinois—UC Robert Stonehill U.S. Department of Education David D. Williams Brigham Young UniversityEPAA Spanish Language Editorial BoardAssociate Editor for Spanish Language Roberto Rodrguez Gmez Universidad Nacional Autnoma de Mxico


23 of 23 Adrin Acosta (Mxico) Universidad de J. Flix Angulo Rasco (Spain) Universidad de Teresa Bracho (Mxico) Centro de Investigacin y DocenciaEconmica-CIDEbracho Alejandro Canales (Mxico) Universidad Nacional Autnoma Ursula Casanova (U.S.A.) Arizona State Jos Contreras Domingo Universitat de Barcelona Erwin Epstein (U.S.A.) Loyola University of Josu Gonzlez (U.S.A.) Arizona State Rollin Kent (Mxico)Departamento de InvestigacinEducativa-DIE/ Mara Beatriz Luce (Brazil)Universidad Federal de Rio Grande do Sul-UFRGSlucemb@orion.ufrgs.brJavier Mendoza Rojas (Mxico)Universidad Nacional Autnoma deMxicojaviermr@servidor.unam.mxMarcela Mollis (Argentina)Universidad de Buenos Humberto Muoz Garca (Mxico) Universidad Nacional Autnoma deMxicohumberto@servidor.unam.mxAngel Ignacio Prez Gmez (Spain)Universidad de Daniel Schugurensky (Argentina-Canad)OISE/UT, Simon Schwartzman (Brazil)Fundao Instituto Brasileiro e Geografiae Estatstica Jurjo Torres Santom (Spain)Universidad de A Carlos Alberto Torres (U.S.A.)University of California, Los