USF Libraries
USF Digital Collections

Educational policy analysis archives


Material Information

Educational policy analysis archives
Physical Description:
Arizona State University
University of South Florida
Arizona State University
University of South Florida.
Place of Publication:
Tempe, Ariz
Tampa, Fla
Publication Date:


Subjects / Keywords:
Education -- Research -- Periodicals   ( lcsh )
non-fiction   ( marcgt )
serial   ( sobekcm )

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
usfldc doi - E11-00203
usfldc handle - e11.203
System ID:

This item is only available as the following downloads:

Full Text
xml version 1.0 encoding UTF-8 standalone no
mods:mods xmlns:mods http:www.loc.govmodsv3 xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govmodsv3mods-3-1.xsd
mods:relatedItem type host
mods:identifier issn 1068-2341mods:part
mods:detail volume mods:number 9issue 6series Year mods:caption 20012001Month February2Day 2222mods:originInfo mods:dateIssued iso8601 2001-02-22


1 of 22 Education Policy Analysis Archives Volume 9 Number 6February 22, 2001ISSN 1068-2341 A peer-reviewed scholarly journal Editor: Gene V Glass, College of Education Arizona State University Copyright 2001, the EDUCATION POLICY ANALYSIS ARCHIVES. Permission is hereby granted to copy any article if EPAA is credited and copies are not sold. Articles appearing in EPAA are abstracted in the Current Index to Journals in Education by the ERIC Clearinghouse on Assessment and Evaluation and are permanently archived in Resources in Education .Teacher Test Accountability: From Alabama to Massachusetts Larry H. Ludlow Boston CollegeAbstract Given the high stakes of teacher testing, there is no doubt that every teacher test should meet the industry guidelines se t forth in the Standards for Educational and Psychological Testing Unfortunately, however, there is no public or private business or governmen tal agency that serves to certify or in any other formal way declare that any teacher test does, in fact, meet the psychometric recommendations stipula ted in the Standards. Consequently, there are no legislated penalties fo r faulty products (tests) nor are there opportunities for te st takers simply to raise questions about a test and to have their questions taken seriously by an impartial panel. The purpose of this article is to highlight some of the psychometric results reported by National Evaluatio n Systems (NES) in their 1999 Massachusetts Educator Certification Test (MEC T) Technical Report and more specifically, to identify those technica l characteristics of the MECT that are inconsistent with the Standards A second purpose of this article is to call for the establishment of a standing test auditing organization with investigation and sanctioning pow er. The significance


2 of 22of the present analysis is twofold: a) psychometric results for the MECT are similar in nature to psychometric results prese nted as evidence of test development flaws in an Alabama class-action lawsui t dealing with teacher certification (an NES-designed testing syst em); and b) there was no impartial enforcement agency to whom complaints about the Alabama tests could be brought, other than the cour t, nor is there any such agency to whom complaints about the Massachuse tts tests can be brought. I begin by reviewing NES's role in Allen v. Alabama State Board of Education 81-697-N. Next I explain the purpose and interpretation of standard item analysis procedures and statistics. Finally, I present results taken directly from the 1999 MECT Technical Report and compare them to procedures, results, and conseq uences of procedures followed by NES in Alabama.Teacher Test Accountability: From Alabama to Massac husetts From its inception and continuing through present administrations, the Massachusetts Educator Certification Test (MECT) ha s attracted considerable public attention both regional and around the world (Cochr an-Smith & DudleyMarling, in press). This attention is due in part to two distur bing facts: 1) educators seeking certification in Massachusetts have generally perfo rmed poorly on the test, and 2) in many instances politicians have used these test res ults to assert, among other things, that candidates who failed are “idiots” (Pressley, 1998) The purpose of the MECT is “to ensure that each certified educator has the knowledge and some of the skills essential to teach in Massachusetts public schools” (National Evaluation Systems, 1999, p. 22). The Mas sachusetts Board of Education has raised the stakes on the MECT by enacting plans to sanction institutions of higher education (IHEs) with less than an 80% pass rate fo r their teacher candidates (Massachusetts Department of Education, 2000). One consequence of this proposal is that most IHEs are considering requirements that th e MECT be passed before students are admitted to their teacher education programs. I n addition, Title II (Section 207) of the Higher Education Act of 1998 requires the compi lation of state “report cards” for teacher education programs, which must include perf ormance on certification examinations (U.S. Department of Education, 2000). What all of this means is that poor perfor mance on the MECT could prevent federal funding for professional development progra ms, limit federal financial aid to students, allow some IHEs be labeled publicly “low performing”, and prove damaging at the state-level when states are inevitably compared to one another upon release of the Title II report cards in October 2001. Given the pe rsonal, institutional, and national ramifications of the test results, there is no ques tion that the MECT should be expected to meet the industry benchmarks for good test devel opment practice as set forth in the Standards for Educational and Psychological Testing (AERA, APA, NCME, 1999). At this time, however, there is no public or private b usiness or governmental agency either within the Commonwealth of Massachusetts or nationa lly that can certify or in any other formal way declare that the MECT does (or does not) in fact, meet the psychometric recommendations stipulated in the Standards. The National Board on Educational Testing and Public Policy (NBETPP) serves as an “in dependent organization that monitors testing in the US” but even it does not fu nction as a regulatory agency


3 of 22(NBETPP, 2000). In addition to the absence of a national r egulatory agency, many state departments of education do not have the professionally trained staff to answer directly technical psychometric questions. Nor do they usually have th e expertise on staff to confront a testing company, which they have contracted, and de mand a sufficient response to a technical question raised by outside psychometricia ns. Furthermore, even when a database with the candidates' itemlevel responses is available for internal analysis, a state department of education does not typically co nduct rigorous disconfirming analyses, e.g. evidence of adverse impact. Thus, mo st state departments are largely dependent on whatever information testing companies decide to release. The public is then left with an inadequate accountability process One purpose of this article is to highligh t some of the psychometric results reported by National Evaluation Systems in their 1999 MECT Technical Report (NES, 1999). Specifically, this article identifies techni cal characteristics of the MECT that are inconsistent with the Standards A second purpose of this article is to voice one more call for the establishment of a standing test audit ing organization with powers to investigate and sanction (National Commission on Te sting and Public Policy, 1990; Haney, Madaus & Lyons, 1993). The significance of the present analysis i s twofold. First, psychometric results reported by NES for the MECT are similar in nature to psychometric results entered as evidence of test development flaws in an Alabama cl assaction lawsuit dealing with teacher certification ( Allen v. Alabama State Board of Education 81-697-N). That suit was brought by several African-American teachers wh o charged, among other things, that “the State of Alabama's teacher certification tests impermissibly discriminate[d] against black persons seeking teacher certification ;” the tests “[were] culturally biased;” and the tests “[had] no relationship to job perform ance” ( Allen 1985, p. 1048). Second, there was no impartial enforcement agency to whom c omplaints about the Alabama tests could be brought, other than the court, nor is ther e any such agency to whom complaints about the Massachusetts tests can be brought. These two points are linked in an interesting and troubling way--NES, the Massachuset ts Educator Certification Tests contractor, was also the contractor for the Alabama Initial Teacher Certification Testing Program (AITCTP). Some of the criticism of debates about tea cher testing, teacher standards, teacher quality, and accountability suggests that arguments are, in part, ideologically, rather than empirically based (Cochran-Smith, in press). This m ay or may not be the case. This article, however, takes the stance that regardless of one's political ideology or philosophy about testing, the MECT is technically flawed. Furt hermore, because of the lack of an enforceable accountability process, the public is p owerless in its efforts to question the quality or challenge the use of this state-administ ered set of teacher certification examinations. In this article I argue that the cons equences of high-stakes teacher certification examinations are too great to leave q uestions about technical quality solely in the hands of state agency personnel, who are oft en illprepared and under-resourced, or in the hands of test contractors, who may face o bvious conflicts-of-interest in any aggressive analyses of their own tests. In the sections that follow, I begin by re viewing NES's role in Allen v Alabama Then I explain the purpose and interpretation of st andard item analysis procedures and statistics. Finally I compare results taken directl y from the 1999 MECT Technical Report with statistical results entered as evidence of tes t development flaws in Allen v Alabama NES and the AITCTP


4 of 22Allen, et al. v. Alabama State Board of Education, et al. In January 1980, National Evaluation Syste ms was awarded a contract on a non-competitive basis for the development of the Al abama Initial Teacher Certification testing Program (AITCTP). Item writing for these te sts began in the Spring of 1981, and the first administration of the tests took place on June 6, 1981. Allen v Alabama was brought just six months later on December 15 th 1981. The Allen complaint challenged the Alabama State Board of Education's requirement that applicants for state teacher certification pass certain standardized tests admin istered under the AITCTP. On October 14, 1983, class certification (Note 1) was granted, and the first trial was set for April 22, 1985. Subsequent to a pre-trial hearing on December 19, 1984 and “after substantial discovery was done,”(Note 2) an out-of-court settle ment was reached on April 4, 1985. A Consent Decree was presented to the U.S. District Court April 8, 1985(Note 3). The Attorney General for the State of Alabama immediate ly “publicly attacked the settlement” ( Allen 1985, p. 1050), claiming that it was illegal. Non etheless, the consent decree was accepted by the court October 25, 1985 ( Allen Oct. 25. 1985). A succession of challenges and appeals on the legality and enfor ceable status of the settlement resulted (Note 4). For example, on February 5, 1986 the district court vacated its October 25th order approving the consent decree ( Allen February 5, 1985, p. 76). While the plaintiffs appeal of the February 5th decision was pending at the 11th Circuit Court of Appeals, trial began in district court on May 5, 1986. The AITCTP consisted of an English languag e proficiency examination, a basic professional studies examination, and 45 content-ar ea examinations. The purpose of the examinations was to measure “specific competencies which are considered necessary to successfully teach in the Alabama schools” ( Allen Defendants' Pre-Trial Memorandum, 1986, p. 21). A pool of 120 items for each exam was generated--100 of which were scorable and mostly remained unchanged across the f irst eight administrations. Extensive revisions were incorporated into most of the tests at the ninth administration. By the start of the May 1986 trial the tests had be en administered 15 times in all. A team of technical experts (Note 5) for t he plaintiffs was hired in November 1983 (prior to the ninth administration of the exams) to examine test development, administration, and implementation procedures. The team was initially unsure about the form of the sophisticated statistical analyses they assumed would have to be conducted to test for the presence of “bias” and “discriminat ion”, the bases of the case. That is, the methodology for investigating what was then called “bias” and is now called “differential item functioning” was far from well e stablished at that time (Baldus & Cole, 1980). Nevertheless, when the plaintiffs' tea m received the student-level item response data from the defendants, their first step s were to perform an “item analysis.” Such an analysis produces various item statistics a nd test reliability estimates. These initial analyses produced negative point-biserial c orrelations. Although point-biserial correlations are explained in detail below, suffice it to say at this point that it was a surprise to find negative point-biserial correlatio ns between the responses that examinees provided on individual items and their to tal test scores. Such correlations are not an intended outcome from a well-designed testin g program. These statistical results prompted a detai led inspection of the content, format, and answers for all the individual items on the AITCTP tests. Content analyses yielded discrepancies in the keyed correct responses in the NES test documents and the keyed correct responses in the NESsupplied machine scor able answer keys (i.e., miskeyed


5 of 22items were on the answer keys). This finding led to an inspection of the original NES in-house analyses which revealed that negative poin t-biserials for scorable items existed in their own records from the beginning of the test ing program and continuing throughout the eighth administration without correc tion. What this meant for the plaintiffs was tha t NES had item analysis results in their own possession which indicated that there were miskeyed items. Nonetheless they implemented no significant changes in the exams unt il they were faced with a lawsuit and plaintiffs' hiring of the testing experts to do their own analyses. The defendants argued that it was normal for some problems to go u ndetected or uncorrected in a large-scale testing program because the overall eff ect is trivial for the final outcome. The problem with that argument was that many candidates were denied credit for test items on which they should have received credit, and some of those candidates failed the exam by only one point. In fact, as the plaintiffs argue d, as many as 355 candidates over eight administrations of the basic professional skills ex am alone should have passed but were denied that opportunity simply because of faulty it ems that remained on the tests (Milman, 1986, p. 285). It should be noted here tha t these were items that even one of the state's expert witnesses for the defense admitt ed were faulty (Millman, 1986, p. 280). Establishing that there were flawed items with negative point-biserial correlations was critical to the plaintiffs' case. The plaintiff s presented as evidence page after page of so-called “failure tables” (Note 6) with the names of candidates for each test whose answers were mis-scored on these faulty items. Base d upon these failure tables, any argument from defendants that the mis-keyed items d id not change the career expectations for some candidates would most likely have failed. In the face of this evidence, the defendan ts argued at trial that ...the real disagreement is between two different t esting philosophies. One of these philosophies would require virtual perfect ion under its proponents' rigid definition of that word. The other looks at t esting as a constantlydeveloping art in which professional judgment ultim ately determines what is appropriate in a particular case”( Allen Defendant's Pre-trial Memorandum, 1986, p. 121-2) Plaintiffs counter-argued “This case…is not a philosophical case at all. This case is a case on professional competence….this was an incompetent jo b, unprofessional, and as I said before, sloppy and shoddy, and in the case of the miskeyed items, unethical.” (Madaus, 1986, p. 185). Judge Thompson, in the subsequent Richardson decision which also involved the AITCTP, specifically agreed with plaintiffs on this point ( Richardson 1989, p. 821, 823, 825). Excellent reviews of the diametrically oppose d plaintiff and defendant positions may be found in Walden & Deaton (1988) and Madaus ( 1990). At the same time that this case was procee ding, the plaintiffs' appeal to reverse the vacating of the original settlement was granted pri or to a decision in this trial ( Allen Feb. 5, 1986, p. 75). The U.S. Court of Appeals dec ided the district court should have enforced the consent decree ( Allen April 22, 1987)—which the district court so order ed on May 14, 1987 ( Allen May 14, 1987). Although the decision to uphold th e original settlement was a positive ruling for the plaintiffs it also was somewhat counter-productive for them because it was unexpect edly beneficial to NES at this stage in the proceedings. That is because the evidence pr esented above in Allen v Alabama


6 of 22was critical of the state and NES (NES was explicit ly referred to in the court documents). Thus, NES's best hope for avoiding a wr itten opinion critical of their test development procedures was if plaintiffs' appeal we re to be upheld and the original settlement enforced, as it was. Then there would be no evidentiary record, no court ruling, and no legal opinion that would reflect bad ly upon the NES procedures. Richardson v Lamar County Board of Education (87-T-568-N) commenced, however, and the actions of NES and the Alabama State Board of Education were openly discussed and critiqued in the court's opinion of N ovember 30, 1989 (though NES was not mentioned by name in the Richardson 1989 decision). Richardson v Lamar County Board of Education, et al Like Allen v Alabama Richardson v Lamar County also addressed issues of the “racially disparate impact” of the AITCTP ( Richardson 1989, p. 808). The Honorable Myron H. Thompson again presided, and testimony fro m Allen v Alabama was admitted as evidence ( Richardson 1989). Although the defendants denied in the Allen v Alabama consent decree that the AITCTP tests were psychomet rically invalid, and even though no decision was reached in the abbreviated Allen v Alabama trial, the State Board of Education did not attempt to defend the validity of the tests in Richardson v Lamar and, “in fact, it conceded at trial that plaintiff need not relitigate the issue of test validity” ( Richardson v Alabama State Board of Education 1991, p. 1240, 1246). Judge Thompson's position on the test deve lopment process of NES was clearly stated: “In order to fully appreciate the invalidit y of the two challenged examinations, one must understand just how bankrupt the overall m ethodology used by the State Board and the test developer was” ( Richardson 1989, p. 825, n. 37). While sensitive to the fact that “close scrutiny of any testing program of this magnitude will inevitably reveal numerous errors,” the court concluded that these er rors were not “of equal footing” and “the error rate per examination was simply too high ” ( Richardson 1989, pp. 82224) Thus, none of the examinations that comprised the c ertification test possessed content validity because of five major errors by the test d eveloper and the test developer had made six major errors in establishing cut scores ( Richardson 1989, pp. 821-25). Case Outcomes in Alabama The Allen v Alabama consent decree required Alabama to pay $500,000 in liquidated damages and issue permanent teaching cer tificates to a large portion of the plaintiff class ( Allen Consent Decree, Oct. 25, 1985, pp. 9-11). The dec ree also provided for a new teacher certification process. H owever, no new test was developed or implemented and the Alabama State Board of Educatio n suspended the teacher certification testing program on July 12, 1988. In 1995 the Alabama State Legislature enacted a law requiring that teacher candidates pas s an examination as a condition for graduation. Subsequently, another trial was held Fe bruary 23, 1996 to decide the state's motions to modify or vacate the 1985 consent decree ( Allen 1997, p. 1414). Those motions were denied on September 8, 1997 ( Allen Sept. 8, 1997). Given the rigorous test development and monitoring conditions of the A mended Consent Decree, it was estimated by the court that the State of Alabama wo uld not gain complete control of its teacher testing program “until the year 2015” ( Allen Jan. 5, 2000, p. 23). Only recently has a testing company stepped forward with a propos al for a new Alabama teacher certification test (Rawls, 2000). Plaintiff Richardson was awarded re-employ ment, backpay, and various other


7 of 22employment benefits ( Richardson 1989, pp. 825-26). Defendants (the State of Alaba ma and its agencies) in both cases were ordered to pay court costs and attorney fees ( Richardson 1989, pp. 825-26). However, even though NES was r esponsible for the development of the tests, NES was not named as one of the defendants in these cases and was not held liable for any damages (Note 7).Psychometric and Statistical Background At this point it is appropriate to discuss some of the psychometric concepts and statistics that are fundamental to any question abo ut test quality. The purpose of this discussion is to illustrate that excruciatingly com plex analyses are not necessarily required in order to reveal flaws in a test or indi vidual test items. The first steps in test development simply involve common sense practice co mbined with sound statistical interpretations. If those first steps are flawed, t hen no complex psychometric analysis will provide a remedy for the mistakes. One of the simplest statistics reported in the reliability analysis of a test like the MECT is the “item-test point-biserial correlation.” This statistic goes by other names such as the “item-total correlation” and the “item discrimination index.” It is called the point-biserial correlation specifically because it represents the relationship between a truly dichotomous variable (i.e., an item scored as either right or wrong) and a continuous variable (i.e., the total test score for a person). A total test score, here, is the simple sum of the number of correctly answered item s on a test. The biserial correlation has a long histor y of statistical use (Pearson, 1909). One of its earliest measurement uses was as an item-level index of validity (Thorndike, et al., 1929, p. 129). The “point”-biserial correlation app eared specifically for individual dichotomous items in an item analysis because of co ncerns over the assumptions implicit in the more general biserial-correlation ( Richardson & Stalnaker, 1933). It was again used as a validity index. It subsequently cam e to acquire diagnostic value and was re-labeled as a discrimination index (Guilford, 193 6, p. 426). The purpose of this statistic is to determ ine the extent to which an individual item contributes useful information to a total test scor e. Useful information may be defined as the extent to which variation in the total test sco res has spread examinees across a continuum of low scoring persons to high scoring pe rsons. In the present situation, this refers to the extent to which well qualified candid ates can be distinguished from less capable candidates. Generally, the greater the variation in th e test scores, the greater the magnitude of a reliability estimate. Reliability may be defined ma ny ways through the body of definitions and assumptions known as Classical Test Theory or CTT (Lord & Novick, 1968). According to CTT, an examinee's observed sco re (X) is assumed to consist of two independent components, a true score component (T) and an error component (E). One relevant definition of reliability may be expre ssed as the ratio of true-score variance to observedscore variance. Thus, the closer the r atio is to 1.0, the greater the proportion of observed-score variance that is attributed to tr ue-score variance. The KR-20 reliability estimate is often re ported for achievement tests (Kuder & Richardson, 1937, Eq. 20, p. 158). Although reliabi lity as defined above is necessarily positive, the KR-20 can be negative under certain e xtraordinary conditions (Dressel, 1940) but typically ranges from 0 to +1. Neverthele ss, the higher the value, the more “internally consistent” the items on a test. The ma gnitude of the KR-20, however, is


8 of 22affected by the direction and magnitude of the poin t-biserial correlations. Specifically, total test score reliability is decreased by the in clusion of items with near-zero point-biserial correlations and is worsened further by the inclusion of items with negative point-biserial correlations. This is becau se each additional faulty item increases the error variance in the scores at a faster rate t han the increase in true-score variance. Technically, the point-biserial correlation represents the magnitude and direction of the relationship between the set of incorrect (scor ed as “0”) and correct (scored as “1”) responses to an individual item and the set of tota l test scores for a given group of examinees. In other words, it is a variation of the common Pearson product-moment correlation (Lord & Novick, 1968, p. 341). It can r ange in magnitude from zero to An estimate near zero is a poorly discriminating item that contributes no useful information. An estimate of +1 would indicate a perfectly discri minating item in the sense that no other items are necessary on the test for different iating between high scoring and low scoring persons. A value of 1.0 is never attained i n practice nor is it sought (Loevinger, 1954). Negative estimates are addressed below. Ideally the test item point-biserial corre lation should be moderately positive. Although various authors differ on what precisely c onstitutes “moderately positive”, a long-standing general rule of thumb among experts i s that a correlation of .20 is the minimum to be considered satisfactory (Nunnally, 19 67, p. 242; Donlon, 1984, p. 48) (Note 8). There is, however, no disagreement among psychometricians on the direction of the relationship—it has to be positive. The direction of the correlation is critic al. A positive correlation means that examinees who got an item right also tended to scor e above the mean total test score and those who got the item wrong tended to score below the mean total test score. This is intuitively reasonable and is an intended psychomet ric outcome. Such an item is accepted as a good “discriminator” because it diffe rentiates between high and low scoring examinees. This is one of the fundamental o bjectives of classical test theory, the theory underlying the development and use of the ME CT. A negative point-biserial correlation, how ever, occurs when examinees who got an item correct tended to score below the mean total t est score while those who got the item wrong tended to score above the mean total test sco re. This situation is contrary to all standard test practice and is not an intended psych ometric outcome (Angoff, 1971, p. 27). A negative point-biserial correlation for an i tem can occur because of a variety of problems (Crocker & Algina 1986). These include: chance response patterns due to a very small sample of people having been tested, 1. no correct answers to an item, 2. multiple correct answers to an item, 3. the item was written in such a way that “high abili ty” persons read more into the item than was intended and thus chose an unintended distracter while the “low ability” people were not distracted by a subtlety i n the item and answered it as intended, 4. the item had nothing to do with the topic being tes ted, or 5. the item was mis-keyed, that is, a wrong answer was mistakenly keyed as the correct one on the scoring key. 6. When an item yields a negative point-biser ial correlation, the test developer is obligated to remove the item from the test so that it does not enter into the total test score calculations. In fact, the typical commercial testing situation is one where the test contractor administers the test in at least one fie ld trial, discovers problematic items,


9 of 22either fixes the problems or discards the items ent irely, and then readministers the test prior to making the test fully operational. The pre sence of a flawed item on a high-stakes examination can never be defended psychometrically. One additional point must be made. The poi nt-biserial correlation can be computed two ways. The first way is to correlate the set of 0/1 (incorrect/correct) responses with the total scores as described above. In this way of computing the statistic, the item for which the correlation is being computed contributes variance to the total score, hence, the correlation is necessarily magnified. That is, the statistical estimate of the extent to which an item is internally consistent with the oth er items “tends to be inflated” (Guilford, 1954, p.439). The second way in which the correlation ma y be computed is to compute it between the 0/1 responses on an item and the total scores for everyone but with the responses to that particular item removed from the total score (Henrysson, 1963). This is called the “corrected point-biserial correlation.” It is a more accurate estimate of the extent to which an individual item is correlated to all the other items. It is easily calculated and reported by most statistical softwar e packages used to perform reliability analyses (e.g., SPSS's Reliability procedure). Various concerns have been raised over the interpretation of the point-biserial correlation because the magnitude of the coefficien t is affected by the difficulty of the item. The fact is, however, that all the various di scrimination indices are highly positively correlated (Nunnally, 1936; Crocker & Al gina, 1986). Furthermore, even though the magnitude of the point-biserial correlat ion tends to be less than the biserial-correlation, all writers agree on the inte rpretation of negative discriminations. “No test item, regardless of its intended purpose, is useful if it yields a negative discrimination index”(Ebel & Frisbie, 1991, p. 237) Such an item “lowers test reliability and, no doubt, validity as well” (Hopkins, 1998, p. 261). Furthermore, “on subsequent versions of the test, these items [with negative po int-biserial correlations] should be revised or eliminated (Hopkins, 1998, p. 259). NES AND THE MECTThe 1999 MECT Technical Report In July 1999 NES released their five volum e Technical Report on the Massachusetts Educator Certification Tests. Volume I describes the test design, item development description, and psychometric results. Volume II describes the subject matter knowledge and test objectives. Volume III co nsists of “correlation matrices by test field.” Volume IV consists of various content validation materials and reports. Volume V consists of pilot material, bias review ma terial, and qualifying score material. The report was immediately hailed by Massachusetts Commissioner of Education David P. Driscoll: "I have said all along that I stand by the reliability and validity of the tests, and this report supports it.” (Massachusetts Depart ment of Education, 1999). Field Trial Technical Report Volume I contains the psy chometric results for the first four administrations of the MECT (April, July, and Octob er 1998, and January 1999). It does not, however, contain any results from a full-scale field trial, nor are any “pilot” test results reported (Note 9). There is no information on how may different items were tested, where the items came from, how many items w ere revised or rejected, what the


10 of 22revisions were to any revised items, or what the ps ychometric item-level results were. In fact, there is no field trial evidence in support o f the initial inclusion of any of the individual items on the operational exams because there was no field trial Interestingly, the Department of Education released a brochure in January 1998 stating that the first two test administrations wou ld not count for certification—implying that the tests would serve as a field trial. Chairm an of the Board of Education John Silber, however, declared in March 1998 that the pu blic had been misinformed and that the first two tests would indeed count for certific ation. This policy reversal was unfortunate because of the confusion and anxiety it created among the first group of examinees and because it prevented the gathering of statistical results that could have improved the quality of the test. NES had considered a field trial of their teacher test in Alabama but did not conduct one and assumedly came to regret that decis ion. In Allen v Alabam they argued, “As the evidence will show, there was no need to co nduct a separate large-scale field tryout in this case, since the first test administr ation served that purpose” ( Allen Defendants' Pre-Trial Memorandum, 1986, p. 113). Th at decision was unwise because it directly affected the implementation and validity o f their procedures. For example, “The court has no doubt that, after the results from the first administration of those 35 examinations were tallied, the test developer knew that its cut-score procedures had failed” ( Richardson 1989, p. 823). In fact, the original settlement i n Allen v Alabama stipulated that in any new operational examination, the items “shall be field tested using a large scale field test” ( Allen Consent Decree, Oct. 25, 1985, p. 3). The first two administrations of the MECT would have served an important purpose as a full-scale field trial for the new tes ts, thus avoiding the mistake made in Alabama. However, that opportunity to detect and co rrect problems in administration, scoring, and interpretation was lost. The impact of the lack of a field trial is further magnified when it is noted that the time period bet ween when NES was awarded the Massachusetts contract (October 1997) and when the first tests were administered (April 1998) was even smaller than the time period NES had to develop the tests in Alabama—a time frame that the court referred to as “quite short” ( Richardson 1989, p. 817). Furthermore, even though NES may have drawn m any of the MECT items from existing test item banks, items written and used el sewhere still must be field tested on each new population of teacher candidates.Point-biserial correlations In the NES Technical Report Volume I, Chapter 8, p. 140, there is a descriptio n of when an item is flagged for further scrutiny. One o f the conditions is when an item displays an “item-to-test point-biserial correlatio n less than 0.10 (if the percent of examinees who selected the correct response is less than 50)”. After such an item is found, “The accuracy of each flagged item is reveri fied before examinees are scored.” The Technical Report however, does not report or provide the percent o f persons who selected the correct response on each item. Nor is there an explanation of what the reverification process consisted of, nor of how man y items were flagged, nor what was subsequently modified on flagged items. Thus, there is no way to determine the extent to which NES actually followed its own stated guidelin es and procedures in the development of the MECT. The relevance of what NES states as their review procedures and what they actually performed is that in Alabama under the topic of content validity, it was argued by the defense that items rated as “c ontent invalid” were revised by NES and that these “revisions were approved by Alabama panelists before they appeared on a


11 of 22 test.” The court, however, found that “no such proc ess occurred” ( Richardson 1989, p. 822). The following table summarizes the point-b iserial estimates reported for the MECT. Note that these are not the results prior to NES conducting the item review process. These are the results for the “scorable it ems” after the NES review. Table 1 Problematic Point Biserial Correlations from the 1999 MECT Technical ReportDate Number tested N of M/C Items Items with point biserials <=0.20 % of total items <.00.00-.05.06-.10.11-.15.16-.20 Apr-984891315171524 4629.5% Jul-9857164430214173916.3% Oct-9852863792510153216.9% Jan-99 9471 507 1414354920.3% 25,3641,6444185391166 332/1644 = 20.2%Test Number tested N of M/C Items Items with point biserials <=0.20 % of total items <.00.00-.05.06-.10.11-.15.16-.20 Writing97509200 0112.2%Reading9455144001 1 65.6%Early Childhood93625603 18 304637.9%Elementary3125256020 32712.5%Social Studies259128101 61417.2%History10864002 6 520.3%English69525603 11122921.5%Mathematics345192104 4 78.3%Special Needs 691256 2 10 1628 3134.0% 1,6444185391166 Source: Massachusetts Educator Certification Tests: Technical Report, 1999 A number of observations may be made from the information in this table. First, of the 1644 total number of items administered over th e first four dates, 332 items (20.19%) had point-biserial correlations that are lower than the industry minimum standard criterion of .20. That is a huge percent of poorly performing items for a high-stakes examination. Second, while there are relatively few suspect items on the Reading and Writing tests, there are large numbers of items wit h poor statistics on many of the subject


12 of 22matter tests. The Early Childhood, English, and Spe cial Needs tests, in particular, consisted of extraordinarily large percentages of p oorly performing items (37.9%, 21.5%, and 34%, respectively). Overall, of the 332 items w ith low point-biserials, 322 (97%) occurred on the subject matter tests. On the face o f it, the results for the subject matter tests are terrible. There is, unfortunately, no aut horitative source in the literature (including the Standards ) that tells us unequivocally whether or not this o verall 20.19% of poorly performing items on a licensure examinati on with high-stakes consequences is acceptable, not acceptable, or even terrible. Given the steps that NES claims were followed in selecting items from existing item bank s and in writing new items, there simply should not be this many technically poor ite ms on these tests. Reliability In Volume I, Chapter 9, p. 188 of the Tech nical Report, the following statement appears. “It is further generally agreed that relia bility estimates lower than .70 may call for the exercise of considerable caution.” The prac tical significance of this statement lies in the fact that when reliability is less than .70, it means that at least 30% of the variance in an examinee's test score is attributable to some thing other than the subject matter that is being tested. In other words, an examinee's test score consists of less than 70% true-score variance and more than 30% error varianc e. This ratio of true-score variance to error-variance is not desirable in high-stakes exam inations (Haney, et al., 1999). Nearly 40 years ago, Nunnally went so far as to describe a s “frightening” the extent to which measurement error is present in high-stakes examina tions even with reliability estimates of .90 (1967, p. 226). NES, however, suggests that their reported item statistics and reliability estimates should not greatly influence one's judgment about t he overall quality of the tests because the multiple-choice items make up only part of the exam format (NES, 1999, p. 189). The problem with that argument, as noted by Judge Thomp son in Richardson (1989, pp. 824-25), is that small errors do accumulate and can invalidate the use for which the test was developed. This issue of simply dismissing trou bling statistics as inconsequential is particularly ironic when the MECT has been describe d by the non-profit Education Trust as “the best [teacher test] in the country” (Daley, Vigue & Zernike, 1999). The Special Needs test deserves closer att ention because it had problems at each reported administration. The sample sizes for the tests were 131, 206, 154, and 200, respectively. Based on NES's own criteria (NES, 1999, p. 187), these sampl e sizes are sufficient for the generation of statistical estimates that would be r elatively unaffected by sampling error. 1. The KR-20 reliability coefficients for the four adm inistrations were .67, .76, .76, and .74, respectively. These are minimally tolerabl e for the last three administrations. The reliability is not acceptable, however, for the first administration. This means that people were denied certification in Special Needs based on their performance on a test that was defic ient even by NES's own guidelines. 2. For the April 1998 administration eleven Special Ne eds items had point-biserials of .10 or less (again, one of NES's stated criterion f or “flagging” an item). For the July 1998 administration it was five items, for October 1998 it was four items, and for January 1999 it was eight items. In fact, in two of the administrations there was an item with a negative point-biserial. (Given the pre vious discussion about the way 3.


13 of 22the point-biserials were likely to have been calcul ated (uncorrected), the frequency of negative point-biserials would likely increase i f the corrected coefficients had been reported.) Given that there is no specific inf ormation about flagging, deleting or replacing items, it is possible that these same faulty items were, and continue to be, carried over from one administration to the nex t.The Linkage between Alabama and Massachusetts: A modus operandi At this point the reasonable reader might ask why I am expending so much effort upon what appears to be a relatively minor problem— some items had negative pointbiserial correlations. NES, for example, would like ly call this analysis “item-bashing”, as this type of analysis was referred to in Alabama. T he significance of these findings lies in the apparent connection between NES's work in Alaba ma and their present work on the MECT in Massachusetts. In Alabama, defendants claimed that Before any item was allowed to contribute to a cand idate's score, and before the final 100 scorable items were selected, the ite m statistics for all the items of the test were reviewed and any items identified as questionable were checked for content and a decision was made about e ach such item ( Allen Defendants' Pre-Trial Memorandum, 1986, pp. 113-14) In fact, in Alabama there were negative po int-biserial correlations in the original reliability reports generated by NES (their own doc uments reported negative point-biserial correlations as large as -0.70) and those negative pointbiserial correlations for the same scorable items remained after multiple administrations of the examinations. Simply taking out the worst 20 items in each test d id not remove all the faulty items since each exam had to have 100 scorable items. As seen a bove in Table 1, the MECT has statistically flawed items on many tests, these ite ms have been there since the first administration, and they may be the same items stil l being used in current administrations. In Alabama, the negative point-biserial co rrelations led to the discovery of items for which there was no correct answer. Also discovered were items for which there were multiple correct answers and there were items for o bjectives that had been rated “not as job related.” Additionally, items were found to hav e been mis-keyed on the item analysis scoring forms. Furthermore, those flawed items exis ted unchanged for the first eight administrations of the tests. They were not revised deleted, or changed to “experimental” non-scorable status until the ninth administration-one month after the plaintiffs' team agreed to take the case. Defendants argued that “pr oblems with the testing instrument—such as mis-keyed answers” were simply o ne component of many that is taken into account by the “error of measurement” ( Allen Defendants' Pre-Trial Memorandum, 1986, pp. 108113). (Note 10) As noted earlier, poor item statistics may result for many reasons. Of those reasons the only acceptable one is that they may be due to sampling error (chance). That explanation is unlikely with respect to the MECT, h owever, because the sample sizes are sufficiently large, and the pattern of faulty item statistics persists over time. The extent to which flawed items may exist in the Massachusetts t ests can only be determined by release of the student-level item response data and the content of the actual items, something that has not been done to date. Furthermo re, such a release of additional technical information, or item response data, or it em content is highly unlikely. (Note 11)


14 of 22In Alabama, the statistical results and in-house do cuments were not produced by NES until the plaintiffs seriously discussed contempt o f court actions against NES personnel. Consequently, there is little reason to expect that NES will voluntarily release MECT data or results not explicitly covered in their ori ginal confidential contract. In Alabama there were no independent testi ng experts appointed or contracted to monitor the test developer's work. This fact led th e court to conclude that “The developer's work product was accepted by the state largely on the basis of faith” ( Richardson 1989, p. 817). In Massachusetts the original MECT contract called for the contractor to recommend a technical review committe e of nationally recognized experts who were external to their organization (MDOE, 1997 Task 2.14.i, p. 11). The committee was to review the test items, test admini stration, and scoring procedures for validity and reliability and was to report its find ings to the Department of Education. NES did not form such an independent technical advisory committee for the MECT nor has a formal independent review of the MECT been undertak en by anyone else. It is not in the short-term business inter ests of a testing company to conduct disconfirming studies on the technical quality of t heir commercial product. The MECT is, of course, a product that NES markets as an example of what they can build for other states who might be interested in certification exa minations. It is, however, in the best interests of a state for such studies to be conduct ed. For example, the Commonwealth of Massachusetts has a statutory responsibility to “pr otect the health, safety and welfare of citizens” who seek services from licensed professio nals (NES, 1999, p. 16). In the present situation “citizens” are defined by the Board of Ed ucation as “the children in our schools” (MDOE, Special Meeting Minutes, 1998). What has app arently been lost in all of this is the fact that prospective educators are “citizens” and deserve protection too--protection from a faulty product that can damage the professio n of teaching and can alter drastically the career paths of individuals. Educators and the public at large deserve the highest quality certification examinations that the industr y is capable of providing. There is ample evidence that the MECT may not be such an exa mination. Conclusion A technical review of the psychometric cha racteristics of the MECT has been called for in this journal (Haney et al. 1999; Wainer, 199 9). The year 2000 and 2001 budgets passed by the Legislature of the Commonwealth also called for such an independent audit of the MECT. Those budget provisions, however, were vetoed by Governor Cellucci, and the legislature failed to override the vetoes. Unti l an independent review committee with full investigative authority is convened by the Com monwealth, the only technical material publicly available for independent analysi s is the 1999 MECT Technical Report generated by NES (NES, 1999). (Note 12) One of the important points made by Haney et al, (1999) was that the Massachusetts Department of Education is not the appropriate agency for conducting such a review. Part of my poi nt here is that the only review of the MECT the Commonwealth may ever see is the one prepa red by NES of its own test. Such a review clearly raises a concern over conflict-ofinterest (Madaus, 1990; Downing & Haladyna, 1996). Given the national interest in “higher sta ndards” for achievement and assessment, it must be recognized that there are no “gold” standar ds by which a testing program such as the MECT can be evaluated (Haney & Madaus, 1990; Ha ney, 1996). This is ironic given how technically sophisticated the testing professio n has become. Consequently, without “gold” standards to define test development practic e, there are no legislated penalties for faulty products (tests) and there is no enforced pr otection for the public. Testing


15 of 22companies may lose business if the details of shodd y practice are made known and the public may appeal to the judicial system for damage s. But the opportunity for a test taker simply to raise a question about a test that can sh ape his or her career and to have that question taken seriously by an impartial panel shou ld be the right of every test-taking citizen. (Note 13) Contrary to former Chairman John Silber's statement to the Massachusetts Board of Education, “there is nothing wrong with this test” (Minutes of the Board, Nov. 11, 1998) and the statement by the chief of staff for the MDO E, Alan Safran, “[the test]does not show who will become a great teacher, but it does r eliably and validly rule out those who would not” (Associated Press, 1998), there is ample evidence that there may be significant psychometric problems with the MECT. Th ese problems, in turn, have significant practical ramifications for certificati on candidates and the institutions responsible for their training. Is the MECT sound enough to support assert ions that the candidates are “idiots”? No. Is there evidence that poor performance may, in part, reflect a flawed test containing defective items? Yes. Should the Massachusetts Comm issioner of Education independently follow through on the twice-rejected Senate bill to "select a panel of three experts from out-of-state from a list of nationally qualified experts in educational and employment testing, provided by the National Resear ch Council of the National Academy of Sciences, to perform a study of the validity and reliability of the Massachusetts educator certification test as used in the certific ation of new teachers and as used in the elimination of certification approval of teacher pr eparation programs and institutions to endorse candidates for teacher certification?" (Mas sachusetts, 1999, Section 326. (S191K)). Absolutely. Should such a panel serve as a blueprint for the formation of a standing national organization for test review and consumer protection? Yes. As we enter the 21st century, high stakes tests are becoming increasingly powerful determinants of students' and teachers' lives and l ife chances. Title II of the 1998 Higher Education Act, in particular, has encouraged a kind of de facto national program of teacher testing. Given the extraordinarily high sta kes of these tests, the personal and institutional consequences of poorly designed teach er tests have become too great simply to allow test developers to serve as their own (and lone) quality control and their own (and often non-existent) dispute resolution boards. Now is the time for the community of profe ssional educators and psychometricians to take a stand and demand that test developers be held accountable for their products in the test marketplace. What this would require at th e very least are (1) a mechanism for an independent external audit of the technical charact eristics of any test used for high stakes decisions, and (2) a mechanism for the resolution o f disputed scores, results, and cases. Only then will taxpayers, educators, and t est candidates have confidence that teacher tests are actually providing the informatio n intended by legislative actions to raise educational standards and enhance teacher quality. Title II legislation certainly did not cause the high stakes test Juggernaut that is rolli ng through all aspects of educational reform in the U.S. and elsewhere. With mandatory te acher test reporting now tied to federal funding, however, Title II legislation cert ainly has added to the size, weight, and power of the test Juggernaut and strengthened its h old on reform. For this reason, federal policy makers are now responsible for providing leg islative assurances that the public will be protected from the shoddy craftsmanship of some tests and some testing companies and that there will be remedies in place to right the mistakes that result from negligence. This article ends with a call to action Policy makers must now incorporate into the federal legislation that requires state te acher test reporting new concomitant requirements for the establishment of independent a udits and dispute resolution boards.


16 of 22NotesI wish to thank Marilyn Cochran-Smith, Walt Haney, Joseph Herlihy, Craig Kowalski, George Madaus, and Diana Pullin for their advice an d editorial comments. The class consisted of “all black persons who have been or will be denied any level teaching certificate because of their failure to pa ss the tests by the Alabama Initial Teacher Certification Testing Program.” (Order On P retrial Hearing, 1984). 1. This specific wording does not appear until the Ame nded Consent Decree of Jan. 5, 2000. 2. Among other things, conditions were set on the deve lopment of new tests, an independent monitoring and oversight panel was esta blished, grade point averages were ordered to be considered in the certification process, and defendants would pay compensatory damages to the plaintiffs and plai ntiffs' attorneys' fees and costs (Consent Decree, 1985). 3. That decision has been upheld numerous times since. The latest Amended Consent Decree was approved on January 5, 2000 ( Allen Jan.5 2000). 4. George Madaus, Joseph Pedulla, John Poggio, Lloyd B ond, Ayres D'Costa, Larry Ludlow. 5. “Failure tables” consisted of an applicant's name, their raw scores on the exams, the exam cut-scores, their actual responses to susp ect items, and their recomputed raw scores if they should have been credited with a correct response to a suspect item. Examinees were identified in court who had fa iled an examination by one point (i.e., missed the cutscore by one item) but had actually responded correctly to a miskeyed item. For example, on the fifth admin istration of the Elementary Education exam there were six people who should hav e been scored correct on scorable item #43 (the so-called “carrot” item) but were not. Their total scores were 72. The cut-score was 73. These individuals should have passed the examination. There was even a candidate who took an exam multipl e times and failed but who should have passed on each occasion. 6. The standard contract for test development will inc lude some specification of indemnification. In the case of a state agency like the MDOE, the Request For Responses will typically specify protection for the state, holding the contractor responsible for damages (MDOE, 1997, V. (G), 1, p.1 7). Contractors, understandably, are reluctant to enter into such an agreement and have been successful in striking this language from the contr act. 7. The rationale is that .20 is the minimum correlatio n required to achieve statistical significance at alpha=.05 for a sample size of 100. This is because .20 is twice the standard error (based on a sample of 100) needed to differ significantly from a correlation of zero. 8. The difference between piloting test items, as NES did, and conducting a field-trial is that the field-trial simulates the actual operat ional test-taking conditions. Its value is that problems can be detected that are oth erwise difficult to uncover. For example, non-standardized testing conditions create d numerous sources of measurement error on the first administration of th e MECT (Haney et al, 1999). 9. This interpretation of measurement error goes consi derably beyond conventional practice where “Errors of measurement are generally viewed as random and unpredictable.” ( Standards 1999, p. 26). A miskeyed answer key is not a rand om error. It is a mistake and its effect is felt great est by those near the cut-score. 10.


17 of 22Although false-positive passes may benefit from the mistake, it is the false-negative fails who suffer and, as a consequen ce, seek a legal remedy. To date the MDOE has routinely ignored questions re questing technical information, e.g. how many items originally came fr om item banks, who developed the item banks, how many items have been replaced, what are the reliabilities of new items, what are the technical characteristics o f the present tests, will the Technical Report be updated, what “disparate impact ” analyses have been conducted? 11. From the start of testing to the present time indiv idual IHE's have not been able to initiate any systematic analysis of their own stude nt summary scores, let alone any statewide reliability and validity analyses. The pr imary reason for this paucity of withinand acrossinstitution analysis is because NES only provides IHEs with student summary scores printed on paper—no electron ic medium is provided for accessing and using one's own institutional data. T hus, each IHE faces the formidable task of hand-entering each set of scores for each student for each test date. This results in a unique and incompatible dat abase for each of the Commonwealth's IHEs. 12. I assert that the right to question any aspect of a high-stakes examination should take precedence over the waiver required when one t akes the MECT: “I waive rights to all further claims, specifically includin g, but not limited to, claims for negligence arising out of any acts or omissions of the Massachusetts Department of Education and the Contractor for the Massachusetts Educator Certification Tests (including their respective employees, agents, and contractors)” (MDOE, 2001, p. 28). 13.ReferencesAngoff, W. (ed.). (1971). The College Board Admissions Testing Program: A Tec hnical Report on Research and Development Activities Relat ing to the Scholastic Aptitude Test and Achievement Tests NY: College Entrance Examination Board. Allen v. Alabama State Board of Education 612 F. Supp. 1046 (M.D. Ala. 1985). Allen v. Alabama State Board of Education 636 F. Supp. 64 (M.D. Ala. Feb. 5, 1986). Allen v. Alabama State Board of Education 816 F. 2d 575 (11th Cir. April 22, 1987). Allen v. Alabama State Board of Education 976 F. Supp. 1410 (M.D. Ala. Sept. 8, 1997). Allen v. Alabama State Board of Education 190 F.R.D. 602 (M.D. Ala. Jan. 5, 2000). American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999 ). Standards for Educational and Psychological Testing Washington, D.C.: American Educational Research A ssociation. Associated Press Archives, (October 4, 1998). State Administers Teacher Certification Test Amid Ongoing Complaints Baldus, D.C. & Cole, J.W.L. (1980). Statistical Proof of Discrimination NY: McGraw-Hill.


18 of 22Cochran-Smith, M. (in press). The outcomes question in teacher education. Teaching and Teacher Education Cochran-Smith, M. & Dudley-Marling, C. (in press). The flunk heard round the world. Teaching Education Consent Decree, Allen v. Alabama State Board of Education No. 81-697-N (M.D. Ala. Oct. 25, 1985).Crocker, L. & Algina, J. (1986). Introduction to Classical and Modern Test Theory NY: Holt, Rinehart and Winston.Daley, B. (1999). “Teacher exam authors put to the test”. Boston Globe, 10/7/98, B3. Daley, B.; Vigue, D.I. & Zernike, K. (1999) “Survey says Massachusetts Teacher Test is best in US”. Boston Globe 6/22/99, B02. Defendant's Pre-trial Memorandum, Allen v. Alabama State Board of Education No. 81-697-N (M.D. Ala. May 1, 1986).Donlon, T. (ed.) (1984). The College Board Technical Handbook for the Schola stic Aptitude Test and Achievement Tests NY: College Entrance Examination Board. Downing, S. & Haladyna, A. (1996). A model for eval uating high stakes testing programs: Why the fox should not guard the chicken coop. Educational Measurement: Issues and Practice, 15:1 pp.5-12. Dressel, P.L. (1940). Some remarks on the Kuder-Ric hardson reliability coefficient. Psychometrika, 5, 305-310. Ebel, R.L. & Frisbie, D.A. (1991) (5th ed.). Essentials of Educational Measurement NJ: Prentice Hall.Guilford, J.P. (1936) (1st ed.). Psychometric Methods NY: McGraw-Hill. Guilford, J.P. (1954) (2nd ed.). Psychometric Methods NY: McGraw-Hill. Haney, W., & Madaus, G. F. (1990). Evolution of Eth ical and Technical standards. In R.K. Hamilton, & J. N. Zaal (Eds.), Advances in Educational and Psychological Testing (pp.395-425).Haney, W.M., Madaus, G.F. & Lyons, R. (1993). The Fractured Marketplace for Standardized Testing Boston: Kluwer. Haney, W. (1996). Standards, Schmandards: The need for bringing test standards to bear on assessment practice. Paper presented at the annu al meeting of the American Educational Research association annual meeting. NY : NY. Haney, W., Fowler, C., Wheelock, A, Bebell, D. & Ma lec, N. (1999). Less truth than error?: An independent study of the Massachusetts T eacher Tests. Education Policy Analysis Archives, 7(4) Available online at /.


19 of 22Henrysson, S. (1963). Correction for item-total cor relations in item analysis. Psychometrika, 28 211-218. Hopkins, K.D. (1998) (8th ed.). Educational and Psychological Measurement and Evaluation Boston: Allyn and Bacon. Kuder, G.F. & Richardson, M.W. (1937). The theory o f the estimation of test reliability. Psychometrika, 2 151-160. Loevinger, J. (1954). The attenuation paradox in te st theory. Psychological Bulletin, 51 493-504.Lord, F.M. & Novick, M.R. (1968). Statistical Theories of Mental Test Scores Reading, MA: Addison-Wesley.Madaus, G. (May 19-20, 1986). Testimony in Allen v Alabama (81-697-N). Madaus, G. (1990). Legal and professional issues in teacher certification testing: A psychometric snark hunt. In J.V. Mitchell, S. Wise, & B. Plake (Ed.), Assessment of teaching: Purposes, practices, and implications for the profession (pp. 209-260). Hillside, NJ: Lawrence Erlbaum Associates..Massachusetts. (1999). FY 2000-2001 Budget Massachusetts Department of Education (February 24, 1997). Massachusetts Teacher Certification Tests of Communication and Literacy S kills and Subject Matter Knowledge: Request for Responses (RFR).Massachusetts Department of Education (July 1, 1998 ). Board of Education Special Meeting Minutes. 0198.h tml. Massachusetts Department of Education (July 27, 199 9). Department of Education Press Release. ml. Massachusetts Department of Education (November 28, 2000). Board of Education Regular Meeting Minutes. eg.pdf Massachusetts Department of Education (February 16, 2001). Massachusetts Educator Certification Tests: Registration Bulletin. 00/00bulletin.pdf Melnick, S. & Pullin, D. (1999, April). Teacher education & testing in Massachusetts: The issues, the facts, and conclusions for institut ions of higher education Boston: Association of Independent Colleges and Universitie s of Massachusetts. Millman, J. (June 17, 1986). Testimony in Allen v Alabama (81-697-N). National Board on Educational Testing & Public Poli cy. (2000). Policy statement Chestnut Hill, MA: Lynch School of Education, Bosto n College. National Commission on Testing and Public Policy. ( 1990). From Gatekeeper to Gateway: Transforming Testing in America Chestnut Hill, MA: Lynch School of


20 of 22Education, Boston College.National Evaluation Systems. (1999). Massachusetts Educator Certification Tests Technical Report Amherst, MA: National Evaluation Systems. Nunnally, J. (1967). Psychometric Theory NY: McGrawHill. Order On Pretrial Hearing, Allen v. Alabama State Board of Education No. 81-697-N (M.D. Ala. Dec. 19, 1984).Pearson, K. (1909). On a new method of determining correlation between a measured character A and a character B, of which only the pe rcentage of cases wherein B exceeds or falls short of a given intensity is recorded for each grade of A. Biometrika, Vol. VII Pressley, D.S. (1998). “Dumb struck: Finneran slams 'idiots' who failed teacher tests.” Boston Herald 6/26/98 pp. 1,28. Rawls, P. (2000). “ACT may design test for Alabama' s future teachers.” The Associated Press 7/11/00 Richardson v. Lamar County Board of Education 729 F. Supp. 806. (M.D. Ala 1989) aff'd 935 F. 2d 1240 (11th Cir. 1991). Richardson, M.W. & Stalnaker, J.M. (1933). A note o n the use of bi-serial r in test research. Journal of General Psychology, 8 463-465. Thorndike, E.L., Bregman, M.V., Cobb, Woodyard, E. et al., (1929) The Measurement of Intelligence NY: Teachers College, Columbia University. U.S. Department of Education, National Center for E ducation Statistics. Reference and Reporting Guide for Preparing State and Institution al Reports on the Quality of Teacher Preparation: Title II, Higher Education Act NCES 2000089. Washington, DC: 2000. Wainer, H. (1999). Some comments on the Ad Hoc Comm ittee's critique of the Massachusetts Teacher Tests. Education Policy Analysis Archives, 7(5) Available online at, J.C. & Deaton, W.L. (1988). Alabama's teach er certification test fails. 42 Ed. Law Rep.1About the AuthorLarry H. LudowAssociate ProfessorBoston CollegeLynch School of EducationEducational Research, Measurement, and Evaluation D epartment Email: Larry Ludlow is an Associate Professor in the Lynch School of Education at Boston College. He teaches courses in research methods, st atistics, and psychometrics. His


21 of 22 research interests include teacher testing, faculty evaluations, applied psychometrics, and the history of statistics.Copyright 2001 by the Education Policy Analysis ArchivesThe World Wide Web address for the Education Policy Analysis Archives is General questions about appropriateness of topics o r particular articles may be addressed to the Editor, Gene V Glass, or reach him at College of Education, Arizona State University, Tempe, AZ 8 5287-0211. (602-965-9644). The Commentary Editor is Casey D. C obb: .EPAA Editorial Board Michael W. Apple University of Wisconsin Greg Camilli Rutgers University John Covaleskie Northern Michigan University Alan Davis University of Colorado, Denver Sherman Dorn University of South Florida Mark E. Fetler California Commission on Teacher Credentialing Richard Garlikov Thomas F. Green Syracuse University Alison I. Griffith York University Arlen Gullickson Western Michigan University Ernest R. House University of Colorado Aimee Howley Ohio University Craig B. Howley Appalachia Educational Laboratory William Hunter University of Calgary Daniel Kalls Ume University Benjamin Levin University of Manitoba Thomas Mauhs-Pugh Green Mountain College Dewayne Matthews Western Interstate Commission for HigherEducation William McInerney Purdue University Mary McKeown-Moak MGT of America (Austin, TX) Les McLean University of Toronto Susan Bobbitt Nolen University of Washington Anne L. Pemberton Hugh G. Petrie SUNY Buffalo Richard C. Richardson New York University Anthony G. Rud Jr. Purdue University Dennis Sayers Ann Leavenworth Centerfor Accelerated Learning Jay D. Scribner University of Texas at Austin


22 of 22 Michael Scriven Robert E. Stake University of Illinois—UC Robert Stonehill U.S. Department of Education David D. Williams Brigham Young UniversityEPAA Spanish Language Editorial BoardAssociate Editor for Spanish Language Roberto Rodrguez Gmez Universidad Nacional Autnoma de Mxico Adrin Acosta (Mxico) Universidad de J. Flix Angulo Rasco (Spain) Universidad de Teresa Bracho (Mxico) Centro de Investigacin y DocenciaEconmica-CIDEbracho Alejandro Canales (Mxico) Universidad Nacional Autnoma Ursula Casanova (U.S.A.) Arizona State Jos Contreras Domingo Universitat de Barcelona Erwin Epstein (U.S.A.) Loyola University of Josu Gonzlez (U.S.A.) Arizona State Rollin Kent (Mxico)Departamento de InvestigacinEducativa-DIE/ Mara Beatriz Luce (Brazil)Universidad Federal de Rio Grande do Sul-UFRGSlucemb@orion.ufrgs.brJavier Mendoza Rojas (Mxico)Universidad Nacional Autnoma deMxicojaviermr@servidor.unam.mxMarcela Mollis (Argentina)Universidad de Buenos Humberto Muoz Garca (Mxico) Universidad Nacional Autnoma deMxicohumberto@servidor.unam.mxAngel Ignacio Prez Gmez (Spain)Universidad de Daniel Schugurensky (Argentina-Canad)OISE/UT, Simon Schwartzman (Brazil)Fundao Instituto Brasileiro e Geografiae Estatstica Jurjo Torres Santom (Spain)Universidad de A Carlos Alberto Torres (U.S.A.)University of California, Los

xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam a22 u 4500
controlfield tag 008 c20019999azu 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E11-00203
0 245
Educational policy analysis archives.
n Vol. 9, no. 6 (February 22, 2001).
Tempe, Ariz. :
b Arizona State University ;
Tampa, Fla. :
University of South Florida.
c February 22, 2001
Teacher test accountability : from Alabama to Massachusetts / Larry H. Ludlow.
x Research
v Periodicals.
2 710
Arizona State University.
University of South Florida.
1 773
t Education Policy Analysis Archives (EPAA)
4 856