USF Libraries
USF Digital Collections

Educational policy analysis archives


Material Information

Educational policy analysis archives
Physical Description:
Arizona State University
University of South Florida
Arizona State University
University of South Florida.
Place of Publication:
Tempe, Ariz
Tampa, Fla
Publication Date:


Subjects / Keywords:
Education -- Research -- Periodicals   ( lcsh )
non-fiction   ( marcgt )
serial   ( sobekcm )

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
usfldc doi - E11-00167
usfldc handle - e11.167
System ID:

This item is only available as the following downloads:

Full Text
xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam a22 u 4500
controlfield tag 008 c20009999azu 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E11-00167
0 245
Educational policy analysis archives.
n Vol. 8, no. 23 (May 12, 2000).
Tempe, Ariz. :
b Arizona State University ;
Tampa, Fla. :
University of South Florida.
c May 12, 2000
School-based standard testing / Craig Bolon.
x Research
v Periodicals.
2 710
Arizona State University.
University of South Florida.
1 773
t Education Policy Analysis Archives (EPAA)
4 856

xml version 1.0 encoding UTF-8 standalone no
mods:mods xmlns:mods http:www.loc.govmodsv3 xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govmodsv3mods-3-1.xsd
mods:relatedItem type host
mods:identifier issn 1068-2341mods:part
mods:detail volume mods:number 8issue 23series Year mods:caption 20002000Month May5Day 1212mods:originInfo mods:dateIssued iso8601 2000-05-12


1 of 43 Education Policy Analysis Archives Volume 8 Number 23May 12, 2000ISSN 1068-2341 A peer-reviewed scholarly electronic journal Editor: Gene V Glass, College of Education Arizona State University Copyright 2000, the EDUCATION POLICY ANALYSIS ARCHIVES. Permission is hereby granted to copy any article if EPAA is credited and copies are not sold. Articles appearing in EPAA are abstracted in the Current Index to Journals in Education by the ERIC Clearinghouse on Assessment and Evaluation and are permanently archived in Resources in Education School-based Standard Testing Craig Bolon Planwright Systems Corporation Brookline, MA (USA)AbstractSchool-based standard testing continues to evolve, yet in some ways it remains surprisingly close to its roots in the firs t two decades of the twentieth century. After use for many years as a di agnostic and as a filter for access to education, in the closing years of th e century it has been pressed into service for state-run political accoun tability programs. In this role, it is generating vehement controversy th at recalls protests over intelligence testing in the early 1920s. This backg round article explores primary characteristics and issues in the developme nt of school-based standard testing, reviews its typical lack of quali fication for political accountability programs, and suggests remedies to a ddress major problems. In general, the attitude toward new techn iques of assessment is skeptical, in light of the side-effects and unexpec ted problems that developed during the evolution of current technique s. Survival of the Fittest School-based standard testing began a dream decade in the early 1950s, driven by


2 of 43waves of public anxiety over Soviet "dominos," nucl ear weapons, Sputnik and the "missile gap." Now, so many years later, it can be hard to imagine the intensity of fears that the Russians were ahead of everybody else not just in the size of their standing army but in scientific knowledge, inventions and in dustry. There was widespread agreement that the U. S. needed to identify talente d people and train them for critical occupations. (Note 1)Of course we know more of the dreary facts today a Russia of gray poverty and workplace spies, burdened with heavy but narrow inv estment to produce arms, rockets and nuclear bombs. But in those times, who knew? We saw North Korea fortified with MiG-15s, the Hungarian revolt crushed with Russian tanks, and then the Berlin wall built. Russia had been four years behind the U. S. in testing an atomic bomb but only one year behind with its first thermonuclear blast. And although the U. S. employed the Nazi rocket designers from World War II, Soviet Rus sia had a space satellite first winking at us and mocking "the American century." And so it was, into the breach against Godl ess communism, (Note 2) that we launched our homespun Scholastic Aptitude and Iowa tests. Few questioned the methods or values. In the climate of those days, school-bas ed standard testing was an engine of progress. (Note 3) It would promote technical exper tise and fairly chosen leadership to right the balance and put America first again.Background School-based standard testing (Note 4) aims to prov ide uniform, rapid measurement of some kind of mental capability that is related to education. There are many other assessments related to responsibilities or occupations rather than schools. These include, for example, tests for motor vehicle drivers, aircraft pilots, divers, plumbers and power plant operators. Historical prec edents for competence testing can be traced to the ancient civilizations of China (Note 5) and Rome. However, until relatively recently education operated mainly as a craft. Teac hers and schools tested their students and applicants, sometimes intensely, but there was rarely interest in tests that would be applied uniformly and rapidly to large groups of st udents in diverse situations. Key educational credentials were instead the evaluation s of students by individual teachers and schools. It may have been public education, more tha n any other factor, that inspired interest in school-based standard testing. (Note 6) The U. S ., with the strongest history of public schools, also had the strongest early interest in s tandard testing. Perhaps it should not be surprising that the country which implemented the c oncepts of standard machine parts and mass production should also be the country that most eagerly adopted standard testing in its rapidly growing education enterprise s (see Cremin, 1962, pp. 185-192). The Yankee attitude can be perceived in the pursuit of uniformity and efficiency. Standard Tests The distinguishing features of a standard test are uniform administration and some form of calibration. Before routine use, standard t ests or component items will be tried out with groups intended to represent populations o f testtakers. These trials are used to measure distributions of scores and other propertie s of a test (Rogers, 1995, pp. 256-257 and 734-741). After calibration, test scores are ty pically reported by using a formula derived from the calibration (to percentile ranks, for example). Beginning in the 1910s, statistical metrics were developed to characterize test items and report scores (Rogers, 1995, pp. 197-208, 317-325 and 382-388). The IQ sco re and the SAT scaled score


3 of 43ranging from 200 to 800 are among the well-known me trics. A quantitative approach helped give standar d tests the appearance of objectivity and encouraged a test format that is easily adapted to numerical scoring. Multiple choice and short answer questions quickly became the conve ntional format. Such questions are scored only as right or wrong. While in principle t here is nothing to prevent a standard test from using essays, extended reasoning and scal es of partial credit, reliable scoring of extended answers and essays requires careful traini ng and monitoring of test evaluators and substantially more effort. Rushed and inept eva luation of extended answers can be at least as troublesome as restricting testing to mult iple choice and short answer formats. Standard tests have long been distinguished as having "speed" or "power" formats, meaning that they are strictly timed or that they a re loosely timed or untimed (Rogers, 1995, p. 256, and Goslin, 1963, pp. 148-149). The d istribution of scores is deliberately widened by strict timing. Many common school-based standard tests, including the Stanford, California and Iowa achievement tests, cl aim to measure knowledge and skill but are in fact "speed" tests. More recent distinct ions are proposed between so-called "norm-referenced" and "criterion-referenced" tests (Rogers, 1995, pp. 653-666). Supposedly a "norm-referenced" test has a calibrati on relative to a population, while a "criterion-referenced" test has an absolute standar d (for example, basic competence to drive a motor vehicle). However, for practical purp oses nearly all school-based standard tests are "norm-referenced," because critical decis ions about how to use the scores are made after score distributions have been measured. We used to call this "grading on the curve." In fact, wild attempts to produce "criterio n-referenced" tests, without knowing how many people can actually pass them, generate so me of the horror stories of testing. Another recent and somewhat misleading dist inction is so-called "highstakes testing," meaning the use of test scores to make de cisions that critically affect people. Supposedly this is a new practice. Actually it is q uite old; parts of the Chinese civil service were closed to applicants who could not pas s required examinations more than twenty centuries ago (Reischauer and Fairbank, 1958 p. 106). Beginning in the nineteenth century, standard tests were developed t o place students in French schools according to ability. During World War I, U. S. Arm y recruits were assigned to combat or support missions on the basis of IQ scores. According to current psychometric standards it is improper to use a test for some purpose for which it was not "designed." Ninety yea rs ago, however, intelligence tests were quickly appropriated to identify "morons," "im beciles" and "idiots," who were then to be sexually restricted. Claims were advanced tha t experienced testers could readily identify "feeble-minded" people by observation (Gou ld, 1981, p. 165). We are not as far away from those days as some would like to think. R ecent applicants who failed a new, uncalibrated teacher certification test were denoun ced as "idiots" by a prominent Massachusetts politician. (Note 7) Although some st rong advocates of standard testing were once inspired by egalitarian views (such as Co nant, 1940), standard tests have long been instruments for social manipulation and contro l. In an irony of the late twentieth century, tests like the former Scholastic Aptitude series, once praised as breaking the stranglehold of social elites on access to higher e ducation, became barricades tending to isolate a new, test-conscious elite which, as we wi ll see, largely tracks the social advantages of the old elite.Aptitude, Achievement and Ability School-based standard testing is largely a phenomen on of the twentieth century. An early product, the "intelligence scale" published b y Alfred Binet and Thodore Simon in


4 of 431905, was intended to identify slow learners. By th e 1920s, the testing movement had split into two camps which remain distinct today (s ee Goslin, 1963, pp. 24-33). The Binet-Simon scale and its offspring such as the IQ test produced by Lewis M. Terman in 1916, the Army Alpha and Beta tests organized by Robert M. Yerkes during World War I, and the Scholastic Aptitude Test designed by Carl C. Brigham in 1925 all claimed to measure "aptitude." The essay exams of t he College Entrance Examination Board, founded in 1900, the Stanford Achievement te sts, first published in 1923, and Everett F. Lindquist's Iowa Every-Pupil tests, deve loped in the late 1920s and early 1930s, claimed instead to measure "achievement." Tests of "aptitude" try to measure capacity for learning, while tests of "achievement" aim only to measure developed knowled ge and skills. From their earliest days, standard aptitude tests have been clouded in controversy. It has never been clearly shown that "aptitude" can be measured separately fr om knowledge and skills acquired through experience (see Ceci, 1991; also see Neisse r, 1998, and Holloway, 1999, on changes over time). Standard achievement tests, whi le nominally free of these snares, share assumptions about language and cultural profi ciency. Performance on almost any test is strongly influenced by language skills. Lik ewise, all tests rely to some degree on trained and culturally influenced associations and styles of thinking. Despite longstanding claims of distinct purposes, standard aptitude and standard achievement tests may have more similarities than differences. Standard achievement test scores tend to co rrelate with standard aptitude test scores, as shown by Cole (1995) and others. To some observers, such as Hunt (1995), this simply shows that bright people learn well, an d vice-versa. To others, it suggests that much of what is being tested might be called t esttaking ability (see Hayman, 1997, and Culbertson, 1995). Most content of the widely u sed school-based standard tests can be viewed as collections of small puzzles to be sol ved rapidly by choosing options or writing brief statements. Such a pattern of tasks i s rarely encountered by most adults in everyday life. By design, the times allowed to complete st andard tests are typically too short for a sizeable fraction of test-takers, putting great str ess on rapid work and leaving little opportunity for reflection. For some strictly timed tests favoring men it has been shown that the same tests conducted without time limits f avor women (see Kessel and Linn, 1996). Standard test designers may assign high scor ing weights to test items written to be ambiguous, so that they will encourage wrong ans wers (see Owen and Doerr, 1999, pp. 70-72). Right answers are guided in part by tra ined or culturally acquired associations intuitions about a test designer's unstated viewpo int. When ambiguous questions are removed, differences in scores betwee n ethnic groups may be reduced. Test designers sometimes say that ambiguous questio ns "stretch the scale," differentiating the more skilled from the less skil led. Owen and Doerr (1999, pp. 45 ff.) suggest instead that they raise the scores of testtakers who have the favored patterns of associations and thinking. The stressful properties of a typical stand ard test make test-taking into a sort of mental gymnastics, an ability that may well have it s uses but does not necessarily predict performance in other situations (see Sacks, 1999, p p. 6061). We recognize many special skills, such as remembering complex pattern s in card games, multiplying numbers in one's head, or solving crossword puzzles People who do these things deftly may also perform well in other pursuits, or they ma y not. Predictive Strengths


5 of 43 Standard tests are promoted on the basis of claims to predict future performance. Their predictive strengths are measured by how well they do this. Despite heavy use of standard tests in circumstances that may critically affect people's lives, there have been remarkably few evaluations of these tests by organi zations independent of the test vendors. The underlying substance of predictive eva luations is sometimes shallow. For example, it may be claimed that a standard test req uired for acceptance to a school program helps to predict the likelihood of graduati on, when a key criterion for graduation is the score on a similarly organized st andard test. For a standard test to be useful, it cannot merely predict performance to some degree. It must significantly improve the accuracy of prediction over readily obtained information. Unless it does so, the effort of testi ng is wasted. (Note 8) During the last forty years, predictive strengths of the SAT, ACT, GRE and similar aptitude tests have been independently evaluated. Scores from these tes ts improve predictions of first year grades by at most a few percent of the statistical variance over predictions based solely on previous grades, family income and other persona l factors. (Notye 9) For later and broader measures of performance, the predictive str engths of these tests evaporate. Sometimes negative correlations have been found lower performance associated with higher scores. (Note 10) In response to the low pre dictive strengths of standard aptitude tests, growing numbers of colleges have stopped req uiring them as part of applications. (Note 11) Predictive strengths of standard tests are falsely enhanced when they are used to "track" or group students in schools, providing ext ra opportunities to some while denying them to others. The favored students stand to gain not only skills and knowledge but also self-esteem, which has been show n to correlate with higher test scores. (Note 12) Ability grouping based on standar d tests is a form of "high-stakes testing" which has been practiced for at least 80 y ears in U. S. public schools. We can clearly distinguish between the selection procedure s of public schools, which have a legal duty to treat every student fairly, and those of taxpaying private institutions, which may not. Of the public schools, we can surely ask, "Why not provide opportunity to everyone?" Beyond the schoolhouse door, school-based s tandard tests show hardly any predictive strength for creativity, professional ex pertise, management ability or financial success. (Note 13) However, these tests stress eith er generalized test-taking abilities or subjects that are only occasionally relevant to adu lt life. Tests for competence in specific skills have been used successfully to predict wheth er workers can perform tasks that require those skills. For example, some temporary e mployment agencies now administer technical skills tests to new job-seekers before se nding them out to interview with potential employers. This practice has increased em ployer satisfaction with job performance.Errors of Testing All measurements are subject to potential error. Co mpared with physical measurements, the errors in standard test scores ar e enormous. There are many sources of error. These include: Mechanical errors in transcribing short answers or multiple choice answers Consistency errors in scoring essays or extended an swers Computer errors when calculating or reporting resul ts Systematic errors from varying difficulty of differ ent test versions Random errors arising from the physical or mental s tates of test-takers


6 of 43Bias errors: test designs that favor some groups of test-takers over others Content errors: test items that do not accurately c over the intended material Vendors and promoters of standard tests do not often discuss errors of testing. When they do, they usually bury information in opaq ue language, tables and formulas found in "technical reports" that may be hard to ob tain. Careful reading of such information often reveals defects in the error eval uation as well as large errors. Test vendors typically present themselves a s diligent in reducing or eliminating mechanical, consistency, computer and systematic er rors. There are well developed methods for controlling these gross errors. However such errors do occur. Advanced Systems, a company used by the Massachusetts Board of Education since 1986, was embarrassed by errors in score reporting in Kentuck y and lost its Kentucky contract in 1997 (see Szechenyi, 1998, and "Problems," 1998). G ross errors seem to be more common with smaller and newer test vendors than wit h larger and longer established ones. The most common error measurement for a sta ndard test is its "reliability." By convention, this describes the range of scores whic h a test-taker would receive in taking repeated, comparable versions of a test (Rogers, 19 95, pp. 61-62, 368-378 and 741-743). A narrow range means high reliability: a test-taker would be likely to receive about the same score on repeated tests. Because training effe cts occur when tests of a particular type are actually repeated, indirect methods must b e used to estimate reliability, such as mathematical models. Details of these methods can b e adjusted to change estimates of reliability. When mechanical, consistency, computer and systematic errors have been well controlled, reliability mainly measures random erro rs arising from unpredictable, individual circumstances of test-takers. Such error s are often larger than is generally known. As cited by Owen and Doerr (1999, p. 72), th e Educational Testing Service has estimated that, on average, individual differences of less than 70 points for its SAT Verbal scores and 80 points for its SAT Math scores are not significant. These margins increase for high scores. Massachusetts (1999a, p. 86, Table 14-4) has estimated there is only about a 56 percent chance that a fourth-grader who is advanced in English language arts, according to its standards, will receive an advanced" rating on its MCAS fourth-grade English language arts test. People who are unfamiliar with the large ra ndom errors of standard test scores often assume that the scores can be used reliably to rank -order test-takers according to ability. In fact, random errors of testing are so great that scores can be used at most to classify individuals in a few levels. Using only four levels to classify MCAS scores, Massachusetts (1999a, p. 86, Table 14-4) has estima ted substantial likelihoods, ranging from 8 to 46 percent, that an MCAS test-taker will be misclassified. Many types of bias errors have been discove red in standard tests. For example, if the format of a test is changed from multiple choic e to essay, different groups of test-takers are favored. A study performed by the E ducational Testing Service found that multiple choice questions on its advanced placement tests favored men and European-Americans, while essay questions favored w omen and African-Americans (cited by Sacks, 1999, p. 205). Grouping test-taker s with high essay and low multiple choice scores and those with the reverse pattern, t he study showed comparable college grades for the two groups but a sixty point differe nce in their average Educational Testing Service SAT scores, in favor of the group w ith high multiple choice scores (Sacks, 1999, p. 206).


7 of 43 People tested using a language in which the y are not fluent are likely to do much worse than native speakers of the language. Tests t hat require reading, in the formats used for most standard testing, assume reading prof iciency. Individuals with poor reading proficiency, whatever the cause, are at maj or disadvantage with respect to others who do not have such limitations. Bias caused by te st timing and ambiguous questions has been previously mentioned. Most attempts to com pensate for bias involve identifying substantially impaired individuals and providing them extra test time. There is little evidence that test bias is actually corre cted with this approach (see Heubert and Hauser, 1999, p. 199). Perhaps the greatest source of bias and con tent error in school-based standard testing is the conventional process of standard tes ting itself, as contrasted with rating actual performance. When an educational assessment should measure success at significant tasks, such as writing a research repor t or investigating a technical theory, it may be impossible to design a standard test with mu ch accuracy or predictive strength. In the U. S., there has been a movement toward repl acing standard testing with criterion-based "performance assessment" (see Appen dix 6). A goal of this movement, also called "authentic assessment," is eventually t o integrate educational testing with the ordinary processes of teaching and learning. There have been attempts to use performance assessment as part of state testing pro grams in Kentucky (1990-1997) and California (1991-1995), reviewed by McDonnell (1997 pp. 5-8 and 62-65). School Accountability The performance of public schools became an issue i n the U. S. almost soon as support for public education began. In 1845 the Mas sachusetts Board of Education printed a voluntary written examination to measure eighth-grade achievement. Most students could not pass the test. Schoolmasters com plained that knowledge tested did not match their curricula. After a few years the te st was abandoned (see Appendix 2). In 1874 the Portland, Oregon, school superintendent di stributed a curriculum for each of eight school grades. At the end of the school year, he administered written tests on the curriculum. Test scores were published in a newspap er. Based on test scores, less than half the students were promoted that year and the f ollowing year. An uprising by parents and teachers then led to dismissal of the superinte ndent and an end to the practices of publishing scores and denying promotion on the basi s of a test score alone. (Note 14) Since those days similar initiatives and reactions have often occurred throughout the U. S.. The U. S. has sponsored a continuing expans ion of public education for 350 years. Most people did not expect to graduate from eighth grade until late in the nineteenth century. High-school graduation became a normal exp ectation only in the 1930s. Today, we are still struggling with rising expectations th at include college. At each stage of this growth, critics have condemned the lowering of educ ational standards and demanded accountability. However, each of these stages can a lso be seen as intrusion into a formerly elite province of education by large numbe rs of students who would previously have been excluded. For several years, levels of pe rformance go down as the system adapts to less prepared students. Over a longer per iod, curricula change, often abandoning cultural traditions for more practical a pproaches. School accountability became a public deman d during the first two decades of the twentieth century. (Note 15) Over the ten years fro m 1905 through 1914 the U. S. accepted the largest flow of immigrants in its hist ory, averaging more than a million per year. Immigration, coupled with stronger school att endance laws, raised school enrollments and increased the fraction of students for whom English was not a native


8 of 43language. Declines in student achievement were noti ced and became an object of public concern. At first standard tests were used to docume nt declining student achievement, but they did not provide a method to improve it. By 192 0 many urban school systems had started to use the newly available intelligence tes ts to measure student aptitude; they grouped students in classes by IQ. (Note 16) Educators hoped to improve performance by providing instruction that was adjusted to stude nt aptitudes. In 1925 a U. S. Bureau of Education survey (cited by Feuer et al., 1992, p. 1 22, footnote 91) showed that 90 percent of urban elementary schools and 65 percent of urban high schools had adopted this approach. As immigration declined and school a ttendance became more uniform, student achievement tended to stabilize, and public concern relaxed. Despite warnings from progressives such as John Dewey and Walter Lip pmann about a "mechanical…civilization" run by "pseudoaristocra ts" (Dewey, 1922), IQ testing and the multiple choice test format had acquired presti ge as techniques to improve public schools. Strong U. S. demand for school accountabili ty arose again in the 1970s through the 1990s. This time aptitude testing and finances play ed significant roles. Acceptance of Scholastic Aptitude Test scores as a measure of mer it by highly selective colleges was regarded by many people as sanctioning a measure of merit for public schools. Average SAT scores for schools and communities began to cir culate as tokens of prestige or shame. During the period from 1963 through 1982, th e Educational Testing Service reported a continued decline in its national averag e SAT scores, followed by a slower recovery, as shown by the scores in Table 1.Table 1 SAT National Average ScoresTest / Year196319801995SAT Verbal 478424431 SAT Math 502466482Source of data: Ravitch, 1996. These scores, scrutinized year after year, were used by the press, broadcast media and opportunist politicians to stir up a new sense of crisis. Once again, the public schools must be failing. The charges were false. Accurate tracking o f changes over time requires painstaking steps to assure that both the measureme nts and the groups being measured are comparable at each point. As shown by Crouse an d Trusheim (1988, pp. 133-134) and by Feuer et al. (1992, pp. 185 ff.), the groups being measured by SAT scores changed drastically. Increases in scholarships and loans, affirmative action programs, and awareness of long-term financial rewards produc ed more applications to selective colleges. The number of colleges requiring SAT scor es more than doubled. As a result, the number of students taking the SAT series for co llege applications grew from 560 thousand in 1960 to 1.4 million in 1980, an increas e of 150 percent over a period in which public school enrollment grew only 16 percent Students with lower high-school grades were taking these tests who would not have t aken them in previous years. Spreads in scores increased significantly, reflecti ng more diversity in test-takers. Berliner (1993) shows that SAT scores of students w ith similar characteristics were


9 of 43actually increasing. Other school-based standard tests do show c hanges over this period, but they are not parallel trends. Beginning in 1969, reading, wr iting, science and mathematics skills have been measured by the National Assessment of Ed ucational Progress (NAEP), a federal research program. Scores remained roughly s teady through 1996, with typical average scores of 280-300 points at the high-school level and typical changes across this period of less than 10 points (see Appendix 1). NAE P reading comprehension scores would probably have fallen and then risen along wit h SAT Verbal scores if the SAT scores reflected real changes in education. Actuall y NAEP high-school reading scores were flat within a band of 1% over the entire 1971-1996 period. There may hav e been declines in science during the 1970s, but changes i n NAEP procedures make them uncertain. During the past 20 years, at the high-sc hool level there appear to have been modest gains in science and math and a slow but per sistent decline in writing skills (while SAT Verbal scores were rising). Overall patt erns of NAEP scores indicate little change in educational achievement. However, these r esearch results do not generate flashy headlines or sound bites, and they are usual ly ignored. The other major cause of concern during the last three decades of the twentieth century has been the increasing cost of public scho ols (see Appendix 1). Proportionately spending rose even faster from 1950 to 1970, but th at was also a period of rapid growth in school enrollment, the "baby boom" generation, a nd a period of anxiety over the possibility of nuclear war. Annual, inflation-adjus ted public school spending grew from about $1,570 per student in 1950 to $3,720 in 1970 and $7,140 projected for 2000 (all in 1998 dollars). Total public school spending climbed even during the 11 percent enrollment drop from 1970 to 1980. By demanding acc ountability the public has in part been seeking value in return for its reasonably gen erous support. "School Reform"Accountability is a political concept, not an educa tional one. The public figures who talk about it loudest today want "school reform," a fami liar war cry in U. S. politics. (Note 17) The measures many current "school reformers" pr omote are: Frequent school-based standard testing with "high g oals" Publication of scores for individual schools or dis tricts Denial of school activities and diplomas to student s with low scores Removal of principals and teachers in schools with low scores Some politicians go further. (Note 18) In 1 983, the Reagan administration embraced a system that would circulate test scores to colleges and employers, maintaining permanent national dossiers of people's test records. The Bush administration proposed legislation in 1991 includi ng these concepts, but it was defeated in Congress. Just what such a program might do to p eople never seems to have been a concern for the "school reform" promoters. In the name of "school reform," without any federal mandate, state legislatures and politically controlled state education boards have been increasing the use of standard tests in public schools and the punishments for low test scores. Typical of the state-run "school reform" programs are the following measures : Statewide standard achievement tests in several or all school grades Statewide standard tests for course credit, promoti on and graduation


10 of 43"Curriculum frameworks," or required curricula, "al igned" to standard tests Access to advanced courses and special programs bas ed on standard test scores Athletic team participation and student privileges based on standard test scores Special diplomas, honor programs and scholarships b ased on standard test scores Classification of school performance based on stand ard test scores Publication of test scores or classifications by sc hool or by district Publicity about school testing requirements, change s and schedules Financial support for "test preparation" consultant s and materials Financial incentives for administrators and teacher s to achieve high test scores Removal of administrators and teachers in schools w ith low test scores State seizure or closure of schools with low test s cores Also associated with "school reform" are mo vements to support religious schools via "school choice" and financial "vouchers" and in itiatives to create privately run "charter schools." In 1980 eleven states required minimum scor es on their standard tests to receive a high-school diploma. By 1997 seventeen states enfor ced such a requirement (National Center for Education Statistics, 1999, Table 155). During the years 2000-2005 several states, including Alaska, California, Delaware, Mas sachusetts, New York and Texas, are planning one or more of the following "school refor m" initiatives: Add standard tests for course credit, promotion or graduation. Raise or begin enforcing required scores. Dismiss principals of low scoring schools. Place low scoring schools in receivership. About two-thirds of the current states with high-school graduation tests are southern or southwestern states; they tend to have larger fractions of poverty and low-income households than the national averages. T he students who are denied high-school diplomas typically come from the most d isadvantaged households in those states. Texas has a program often pointed to by "sc hool reform" advocates as a model (see Appendix 4). The program is politically controlled by the governor and state legislature. It has changed several times since its inception in 1984. The key feature for the last ten years is a test system called TAAS, which includes high school graduation requirements. Under this system, there have been reports of weeks spent on test cramming and "TAAS rallies." School ratings are raised by "exempting" students. Schools are allowed to contract for "test preparation" consultants and mat erials, and some have spent tens of thousands of dollars. There have been reports of fa lsifying results. In April, 1999, the deputy superintendent of the Austin school district which had shown dramatic score improvements, was indicted for tampering with gover nment records. In Houston three teachers and a principal were dismissed for prompti ng students during test sessions ("TAAS scandal," 1999). Official Texas statistics c laim reductions in school dropouts, but independent studies consistent with U. S. gover nment data show persistent increases, with 42 percent of all students failing to receive a high school diploma as of 1998 ("Longitudinal Attrition Rates," 1999). Students id entified by Texas as black or Hispanic are disproportionately affected. In some schools 10 0 percent of students with limited English proficiency drop out (IDRA, 1998). Illitera cy remains a major problem in Texas,


11 of 43and it appears to be worsening. New York has recently released part of the initial results from its new highschool graduation tests. Based on currently required score s, they show that diplomas are likely to be denied at about twice the statewide rate to s tudents in New York City who complete high school (see Appendix 3). The city has the largest concentrations of poverty in the state. In five years New York will i ncrease the required scores by abolishing so-called "local" diplomas. The probable result will be an even more severe impact on students from poverty and low-income hous eholds. State-run "school reform" has operated larg ely on the basis of beliefs, not evidence. There is little evidence that these programs actual ly work as intended. Feuer et al. (1992), show that claims for improved achievement, as measured by test scores, are often hollow. They are commonly a result of trainin g students to take the standard tests (also see Sacks, 1999, pp. 117-151). When a new ser ies of tests is substituted, scores typically return to levels, measured against nation al norms, that are similar to scores when the previous series of tests began. If "school reform" has caused substantial i mprovement in student achievements, measurements performed by the National Assessment o f Educational Progress (NAEP) ought to reveal it. This longstanding federal resea rch program has taken care to provide broad coverage of educational content, to maintain consistency in its testing over time, and to avoid test formats with sources of bias such as hectic pacing and heavy dependence on reading proficiency in tests other th an reading (see Feuer et al., 1992, pp. 90-94). Test formats use multiple choice, short ans wer, extended answer and essay questions, with scales of partial credit. Since par ticipating schools change, there is little opportunity or incentive for students to be taught the tests. From about 11,000 to 44,000 students participated in each of the test series gi ven from 1982 through 1996. Most of the geographically segmented data p ublished for the NAEP are grouped by regions rather than by states. The Northeast region includes Connecticut, Delaware, District of Columbia, Maine, Maryland, Massachusett s, New Hampshire, New Jersey, New York, Pennsylvania, Rhode Island and Vermont. F rom 1982 through 1996 none had a major "school reform" program; only one of th e twelve had a high-school graduation test (only New York; see National Center for Education Statistics, 1999, Table 155). The Southeast region includes Alabama, Arkansas, Florida, Georgia, Kentucky, Louisiana, Mississippi, North Carolina, S outh Carolina, Tennessee, Virginia and West Virginia. From 1982 through 1996 all had m ajor "school reform" programs and eleven of these twelve had high-school graduati on tests (all except Kentucky; see National Center for Education Statistics, 1999, Tab le 155). Average NAEP scores reported for these two regions from 1982 through 19 96 are shown in Table 2.Table 2 NAEP Regional Average Scores, 1984 and 1996Reading ScoresNortheastSoutheast 19841996Change19841996Change Grade 11 292291-1285279-6 Grade 8 260261+1256252-4 Grade 4 216220+4204206+2 Writing ScoresNortheastSoutheast


12 of 43 19841996Change19841996Change Grade 11 291290-1287273-14 Grade 8 273264-9267260-7 Grade 4 212213+1204200-4 Math ScoresNortheastSoutheast 19821996Change19821996Change Age 17 304309+5292303+11 Age 13 277275-2258270+12 Age 9 226236+10210227+17 Science ScoresNortheastSoutheast 19821996Change19821996Change Age 17 284296+12276288+12 Age 13 254255+1239251+12 Age 9 222234+12214224+10Source of data: National Center for Education Stati stics, 1997. If a case can be made for improvement that may have been caused by "school reform" it is in math and science, where both regio ns had score improvements but those of "school reform" states were better. However, "sc hool reform" states had worse changes in reading and writing scores. The Northeas t, without major "school reform," improved scores an average of 2.8 points, while the Southeast, under major "school reform," improved scores an average of 3.4 points. With the random errors in scores estimated for NAEP, the difference in these results has no statistical significance (National Center for Education Statistics, 1997, pp iii-vi). At the high-school level, the changes measured in "school reform" states were som ewhat better in math, the same in science, somewhat worse in reading and substantiall y worse in writing. Despite great hopes for "school reform," there is no general evid ence of benefit. "School reform" is strongly associated with high dropout rates and low rates of high-school graduation. Nationally about 32 percent of public school students aged 15 through 17 are enrolled below normal grade levels, a figure that climbed steadily during the years 1979 through 1992. (Note 19) Statistics o n school dropout cannot be evaluated readily, since government reporting procedures have been changing, possibly to conceal unfavorable trends (see Appendix 4). Table 3 estima tes normal high-school graduation rates for the class of 1996 as percentages of ninth -grade enrollments in the fall of 1992. (Note 20) It compares nine southern and southwester n states under major "school reform," requiring minimum scores on standard tests for graduation, with nine northeastern states that did not have major "school reform" programs:Table 3 High-school graduation rates by state, 1996 (Percentage normal high-school graduation, class of 1996)States under "school reform"States without "school reform"


13 of 43Alabama58%Connecticut74%Florida58%Maine72%Georgia55%Massachusetts76%Louisiana58%New Hampshire75%Mississippi57%New Jersey83%North Carolina62%New York62%South Carolina54%Pennsylvania76%Texas58%Rhode Island71%Virginia76%Vermont90%Source of data: National Center for Education Stati stics, 1996 and 1999. Only one southern or southwestern state wit h major "school reform" had a normal graduation rate above two-thirds, while only one of the northeastern sta tes had a rate below two-thirds. The worst northeastern state is New Yo rk, which has a longstanding Regents examination for high-school graduation but during the 1992-1996 period was also awarding "local" diplomas (see Appendix 3). Reform Schools and Private Interests By the early 1990s, with reform schools entrenched for ten years or more in several states, a perverse competition began, which might b e called Our Standards Are "Stiffer" Than Yours : We make tests harder. We mandate more tests. We raise minimum scores. We enforce more punishments. See Heubert and Hauser (1999, pp. 59-67) an d Sacks (1999, pp. 98-99 and 114). As with most of "school reform," the process is politi cal (see Appendix 4 and Appendix 5). Typically, it is known that test scores ramp up for a few years and then flatten out. Otherwise there is little organized review of wheth er the testing and punishment systems actually produce harm or benefit for anyone. Nevert heless, state governors and legislators vie for TV spots and news headlines wit h commitments to "raise standards." In states without major "school reform," politician s are prepared to exploit anxiety over somehow being left behind. (Note 21) Many states are trying "school reforms" fas ter than their school systems can adapt. Seeking to change educational content and testing p ractices at the same time worsens these problems. It has become common first to impos e a test and then to "align" the curriculum, obviously putting the cart before the h orse. Even states with a relatively stable curriculum and incremental changes in testin g, such as North Carolina, have fallen prey to this disease (McDonnell, 1997, pp. v and 8-11). Some "school reformers" like the Pioneer Institute in Boston utilize the re sulting chaos in political karate, aiming to promote "charter schools" which are actually pri vate business ventures fed by tax revenues. James A. Peyser, Executive Director of Pi oneer Institute, is currently Chairman of the Massachusetts Board of Education. C harles D. Baker, Jr., a member of the Pioneer Institute Board of Directors, is also a member of the Massachusetts Board of


14 of 43Education. Former and current directors of the Pion eer Institute founded Advantage Schools, Inc., of Boston, a for-profit business tha t has opened two Massachusetts charter schools and fourteen charter schools in other state s. These cross-interests and educational mista kes need to be made familiar to the public. They are usually ignored by the large newsp apers and broadcast media unless a tragedy occurs. (Note 22) In contrast to the strong interest over test scores, our press, broadcast media and politicians show only sporadic interest in the education process. Effective innovations such as team teaching, "loopi ng" and open classrooms are being neglected or forgotten (see Tyack and Cuban, 1995, pp. 86-107). Science and math have been emphasized, but long-term surveys of achieveme nt suggest that progress in these areas has occurred partly at the expense of writing skills. Only computer technology gets much attention, but its limits are becoming apparen t. While classroom computers are convenient for exploring the Internet and organizin g assignments, they have otherwise taught students few skills. By conventional standards of psychological testing, (Note 23) major test vendors have been earning revenue from highly questionable uses of their products. While technical manuals may advise that their achievement tests are not "validated" for uses such as school rating or promotion tests, they sell large volumes of these tests to jurisdictions using them for purposes other than in dividual counseling. For example, the Stanford Achievement Test series, published by Harc ourt Brace Educational Measurement, is being used by the state of Californ ia to rate and compare school districts (see Appendix 5). The Iowa test series, f rom the Riverside Publishing division of Houghton Mifflin, is being used by the city of C hicago as promotion tests (see Roderick et al., 1999). When so used, these tests e ffectively set the curriculum and the standards of performance for public schools, withou t meaningful public input or control. Parents and taxpayers are poorly informed about tes t validation and about strong effects these tests have in setting educational standards. Taking a cue from Horace Mann, who fought f or school standards and then moved to Congress a century and a half ago (see Appendix 2), many modern politicians have sought to use "school reform" as a platform for adv ancement. The "school reform" movement has enough momentum that few state officeh olders and candidates openly oppose it. Candidates for state offices often use school reform" backgrounds to support their campaigns. In 1996 Governor Wilson of Califor nia attempted to mount a campaign for President; Governor Bush of Texas is doing the same this year. Wilson left office after the defeat of his 1998 plan (proposition 8) t o create state-appointed "governing councils" for all California public schools, in cha rge of budgets. Taking a moderate approach, such as supporting smaller class sizes an d improved facilities, has sometimes won out over "back to basics" appeals, as it did in the victory of Tom Vilsack over Jim Ross Lightfoot in the 1998 election for governor of Iowa. The Social Context School-based standard testing does not occur in a s ocial vacuum. It has consequences, and the techniques it uses reflect in terests and values. Insight and candor about these consequences, interests and values are rare today; they must often be inferred from behaviors. In previous times, the adv ocates of standard testing were less guarded about their intents. It has become well known that early promote rs of standard aptitude tests were profoundly racist and sexist. Goddard, Terman, Thor ndike, Burt, Yerkes and Brigham all believed that these tests identified African-Americ ans, native Americans, immigrants from southern and eastern Europe, or women as typic ally less able than white men


15 of 43whose ancestors came from northern and western Euro pe. (Note 24) Goddard, Terman and Brigham were advocates of the "eugenics" moveme nt, (Note 25) favoring IQ tests followed by sexual restriction of the "feeble-minde d." An echo of their attitudes can be heard in the enthusiasms for standard tests sometim es expressed in the U. S. today, reducing access by African-Americans and Hispanic-A mericans to universities and professional schools. Few of the modern promoters o f standard tests flaunt prejudices that were once openly displayed. Relative success o n these tests by Jews and by the offspring of Asian immigrants has greatly tempered hubris over "Nordic superiority." The myth of measuring innate talent has bee n exposed. Multifactor studies link high scores on aptitude tests with advantages in fa mily income, language and cultural exposure, motivation, self-confidence and training (see, for example, Goslin, 1963, pp. 137-147, Duncan and Brooks-Gunn, 1997, pp. 132-189, and Brooks-Gunn et al., 1996). Key research on the inheritance of intelligence, on ce widely cited, has been probed and found to have been scientific fraud (Gould, 1981, p p. 234-239). After accounting for measurable influences of environment, studies of mu ltiple factors do leave unexplained residues that might be called aptitudes, but they c an only be inferred from comparisons across groups. There are no reliable techniques for measuring aptitudes in an individual which are independent of experience, nor has it bee n shown how many such aptitudes there might be. Despite exposures of motive and mythology, use of standard testing continues to grow. A century after their origins, school-based s tandard testing and its scavenger, test preparation, have become industries sustained by po werful institutions and deeply felt personal interests. Their supporters are now often driven by secondary motives that result from widespread testing programs. At least t wo generations have been able to profit from test-taking success, entering professio ns and making connections during their college years that might otherwise have been closed to them. They know how to crack the tests; they make sure their children learn; and they can be angered to think that this useful wedge into income and influence might be rem oved. Today's standard test enthusiasts range fro m right-wing extremists to hard-nosed business people to ambitious young professionals to church schools and home schoolers who are looking for validation of their work in other words, some of our neighbors. Parents who want to keep young children out of the testing game are now beset with legal mandates in many states and with social press ure almost everywhere. Far too few people are asking whether the public schools are re ally broken and in need of this kind of a fix (see Berliner, 1993, and Berliner and Bidd le, 1995). Among the right-wing, there is a Libertaria n perspective from which conventional standard tests are an intrinsic evil because they i nterfere with local control of schools. Also, it is worth noting that a number of the busin ess enthusiasts for standard testing actually send their own children to private schools where such testing is not emphasized. Berliner and Biddle (1995) have extended such obser vations into an argument that some testing promoters have a different agenda: using th e embarrassment of low test scores in public schools as a weapon to force governments tow ard corporate schools, which they will operate at a profit. Much as in the 1920s, its first great decad e, school-based standard testing is still sold as a key to discovering talent and measuring a bility objectively. When possible its critics are ignored, or they are dismissed as extre mists, dreamers or losers. Test development and scoring procedures are wrapped in m ystification. "Validation" of tests is widely touted, but it usually means only that pe ople who do well on one test do well on another. Public enlightenment has made progress, but it struggles upstream against a flow of laundry soap, liver pills and snake oil.


16 of 43 What have all the years of more than 100 mi llion school-based standard tests a year (Note 26) brought us? The "one minute" people, perh aps, who judge anything that takes longer as not worth the bother. Try to make life in to a rush of standard questions. The idiot-genius computer programmers, fast as lightnin g. The ones who saddled us with about $200 billion worth of "year 2000" problems, b ecause they didn't think about a slightly bigger picture. The test prep industry, a scrounger that otherwise has no purpose. The product support staff who don't know what to do when they run to the end of their cheat sheets. The cutback from education to test cr amming in the states with standard punishment systems. Don't take chances; teach and l earn the test. Remedies School-based standard testing has seen more than a century of development in the U. S. (see Appendix 7). No quick or simple remedy c an cure the many problems it has caused. Any remedy will require resolute public act ion. The following priorities are essential: Stop using standard test scores to deny promotion o r graduation. Stop using standard test scores to create financial incentives or penalties. These are the key weapons of the state puni shment systems. The significance and accuracy of standard test scores do not justify the se measures. They are viruses that transform schools from education to test cramming. They are all harm and no benefit. If we do not stop the damage being wrecked by these mi staken "school reforms," no other remedies will matter much. If the catastrophes from "school reform" ca n be curtailed, we can tackle the worst problems of current school-based standard testing: Conflict of purpose. We are trying to use the same tests to measure basi c competence as to measure high levels of skills and knowledge. Conflict of method. We say that we want to measure meaningful skills a nd knowledge, but our test methods stress empty tasks and fast answers. The root of these conflicts is the same: ch oosing speed and price over effectiveness. If we want accurate and meaningful results, we must reverse these priorities. Good tests will not be quick or cheap. A test to measure basic competence in a skill or subject must cover a broad range of what we believe basic compet ence should mean. A test to measure high levels of skills and knowledge must in clude open-ended tasks that can be performed with many different strategies. We will n eed to weigh costs and benefits carefully. Even when they do not corrupt education, meaningful tests will take time and resources that could have been spent otherwise. The "authentic assessment" and "performance assessment" movements seek to combine educational assessments with the learning p rocess. Classic models are the "course project" and the "term paper." While the in tents of these movements are understandable, Kentucky and California experiences in the 1990s suggested that such techniques were not mature enough to provide reliab le comparisons among schools or school districts, much less to create promotion or graduation tests (Sanders and Horn, 1995). Moreover, we have no school-based achievemen t tests at all that have been proven to predict meaningful accomplishments by stu dents in the world beyond the


17 of 43schoolhouse door. Schools probably test too much, yet at the same time they may fail to use tests when tests can help. A key example is poor and late diag nosis of reading disorders. A great fraction of adult activities require proficient rea ding; most school activities and standard tests do also. We know that some young students hav e much more difficulty reading than others, although they may otherwise have stron g skills. Schools need to identify reading disorders as early as possible and help to remedy them before they become deeply ingrained. Limited and conflict-ridden as it is, curre nt standard testing shows systematic deficits for students from low-income and minority households. Better testing will give a better picture of how serious these problems are, b ut it will not cure them. We need plans and resources to address the problems which a re already clearly understood: Language. We should teach standard spoken English as a secon d language to students from households where it not spoken. We sh ould not disparage dialects or other languages, but we must equip students earl y with this essential skill. Motivation. Other than language, the key barrier for students from low-income and minority households is weak motivation. Home an d school partnerships have shown how this problem can be overcome. We must cre ate and strengthen them. We do not understand all the problems. We d o not know how to solve all the problems that we do understand. But we know enough to begin. If not now, then when? Validity and Relevance School-based aptitude testing is known to have low predictive strength. Studies have shown that it heavily reflects the income and education levels of students' households and that most of what it can predict is associated with social advantages and disadvantages. If tax-supported or tax-exempt schoo ls use scores on intelligence or other aptitude tests to deny opportunities to some studen ts while providing them to others, they violate the public trust. For school-based achievement testing, we ha ve few studies of predictive strength (as one example, see Allen, 1996, section IV-B, pp. 118-120). In most circumstances, we simply do not know whether these tests measure anyt hing apart from social privilege that is useful outside a school setting. After adju stment for social factors, can their scores accurately predict future success in occupat ions, creative achievements, earning levels, family stability, civic responsibility or a ny of the other outcomes we mean to encourage with public education? Are there alternat ive assessments that can accomplish these goals? Given the heavy engagement in "school reforms" and the energy spent on their testing programs, it is amazing to see how li ttle attention these matters receive (see related observations by Broadfoot, 1996, pp. 14-15) Academic and foundation-supported scholars specializing in psych ometrics have the greatest opportunities to answer these questions, but they h ave largely ignored them. Journalists, broadcasters, bureaucrats, pol iticians, educators and their critics like most of the public usually assume that a mathematics test, for exampl e, actually measures some genuinely useful knowledge and skill. Who has shown this to be true, and for which tests? Is there actually a strong and consistent relation, for example, between top scores on a particular high school math achievement test and a successful career as a civil engineer? If there were not, then what does that test measure? Is there a strong and consistent relation between acceptable s cores on a social studies test and


18 of 43adult voting participation? If there were not, then how is such a test of use? Unfortunately, it is far from proven that a ny method of assessment can escape the biases, the other errors, and the low or unknown pr edictive strengths outside the schools which plague the current tests. We should take this not as a signal of defeat but as an invitation to humility. The complexities of human b ehavior are immense, and our current approaches measure them poorly. Rather than try to stretch each student onto a Procrustean bed of so-called "achievement," taking pride in lengthening the beam a bit every few years, we need to promote core competence and recognize the diversity of other skills. If standard tests were to have any us eful role, it would most likely be as an aid to help insure that students can exercise skill s which have been clearly proven essential for ordinary occupations. Even such a lim ited objective as this requires both education and test validation well beyond current e ducational and psychometric practices. As we question the validity of testing, we may also question the relevance of the education supposedly being tested. Are we using the irreplaceable years of youth to convey significant skills and knowledge, or are we cultivating fetishes and harping on hide-bound answers to yesterday's questions? Someho w, despite decades of claims that our schools are inferior, we in the U. S. have achi eved a stronger economy than most other industrial countries. Yet we also have more c rime than most of these countries. Is our education responsible for these situations? We have many such issues to address. They present truly difficult questions. None of the m will be found on school-based standard tests.NotesComments and suggestions from several reviewers are gratefully acknowledged. Mistakes or omissions remain, of course, the fault of the author. For a viewpoint characteristic of the era, see Rick over, 1959. 1. Pope Pius XI, as spoken in "...the defenders of ord er against the spread of Godless communism," Christmas Allocution, The Holy See, Rom e, 1936. "Godless communism" became a popular phrase among cold-war p atriots of the era. 2. Lemann, 1995, recounts the history of draft-deferme nt testing. 3. Commonly called "standardized testing." The underly ing purpose of such tests is to set a standard that is calibrated for a populati on. 4. Reischauer and Fairbank, 1958, pp. 106-107, describ e Chinese origins in the Western (Earlier) Han Dynasty, c. 120 BCE. 5. Schultz, 1973, reviews the industrial model for pub lic schooling. 6. Massachusetts House Speaker Thomas Finneran. See Le high, 1998. 7. Goslin, 1963, p. 82 (footnote 2), indicates that th e relatively low predictive strengths of aptitude tests for college grades were well known by around 1960. 8. Crouse and Trusheim, 1988, pp. 124-127, review pred ictive strength for the SAT vs. family incomes and high school grades. Nairn an d Associates, 1980, show that SAT scores tend to act as proxies for family income Tyack, 1974, pp. 214-215, cites an equivalent claim for IQ scores made by the Chicago Federation of Labor in 1924. 9. Sacks, 1999, p. 183 (note 23), cites a negative cor relation between GRE aptitude test scores and publishing records for academic his torians. 10. Owen and Doerr, 1999, Appendix C, list 284 U. S. co lleges and universities where SAT and ACT scores are optional for admission into bachelor's programs. 11. Merton, 1957, pp. 421-436, calls such a phenomenon a "self-fulfilling prophecy." 12.


19 of 43Sacks, 1999, pp. 182-185, cites and summarizes seve ral relevant studies. 13. Tyack, 1974, pp. 35-36 and 47-48, recounts the two examples cited of nineteenth-century school testing. 14. Tyack, 1974, pp. 126-147, shows how demands for acc ountability were used to cement control of public schools by business leader s and school supervisors. 15. Tyack, 1974, pp. 194 and 206-216, recounts the rapi d spread of standard testing in the 1920s. 16. Tyack, 1974, pp. 41-46, recounts the first major U. S. school reform, the system of graded classrooms, inspired by Prussian schools and introduced to the U. S. in the 1840s and 1850s. Tyack and Cuban, 1995, explore the history of twentieth-century school reform movements in the U. S. 17. A Nation at Risk, published by the National Commiss ion on Excellence in Education, U. S. Department of Education, in April, 1983, is cited as inspiring many of these initiatives. 18. See Appendix 1. Precedents from the past are worse. In 1922, New York City reported that nearly half of all students were "abo ve normal age for their school grade," as cited by Feuer, et al., 1992, p. 118. 19. Data from National Center for Education Statistics, 1996, and National Center for Education Statistics, 1999. See 1995 Table 41 for n inth-grade enrollments and 1998 Table 102 for high school graduates. No attemp t is made to adjust for immigration, emigration, mortality or population mo vement between states. 20. An egregious example of these effects can be seen i n California from 1994 through 1997, during the Wilson administration. See Appendix 5. 21. Albert L. Powers, "Questionable reform," Carlisle M osquito, Carlisle, MA, October 29, 1999. Paul Dunphy, "Charter schools fai l promises," Daily Hampshire Gazette, Amherst, MA, February 7, 2000. Beth Daley and Doreen I. Vigue, "Firm pulls out of school where boy died," Boston Globe, February 10, 2000. 22. Standards 6.12, 8.7 and 8.12 in Committee to Develo p Standards for Educational and Psychological Testing, 1985, pp. 43 and 53-54. These standards, jointly developed by the American Psychological Association American Educational Research Association and National Council on Measur ement in Education, were also updated in 1999. 23. Brigham, 1923, pp. 87 ff., says "...the foreign bor n are intellectually inferior," then analyzes inferiority by races and origins. 24. For the proposition that "no feeble-minded person s hould ever be allowed to marry or to become a parent," Goddard, 1914, p. 561 On "curtailing the reproduction of feeble-mindedness," Terman, 1916, p 7. On "prevention of the continued propagation of defective strains," Brigha m, 1923, p. 210. All three men modified their views in later years. 25. Since at least 1961. See Goslin, 1963, pp. 53-54. 26.ReferencesAllen, W. B., Project Director (1996). A New Framework for Public Education in Michigan. East Lansing, MI: James Madison College, Michigan S tate University. Associated Press (1999, June 3). Blacks nearly four times more likely to be exempt from


20 of 43TAAS than whites. Capitol Times, Austin, TX. Berliner, D. C. (1993). Educational reform in an er a of disinformation. Educational Policy Analysis Archives 1 (2), available at l. Berliner, D. C., & Biddle, B. J. (1995). The Manufactured Crisis: Myths, Fraud, and the Attack on America's Public Schools. Reading, MA: AddisonWesley. Brigham, C. C. (1923). A Study of American Intelligence. Princeton, NJ: Princeton University Press.Broadfoot, P. M. (1996). Education, Assessment and Society: A Sociological A nalysis. Philadelphia, PA: Open University Press.Brooks-Gunn, J., et al. (1996). Ethnic differences in children's intelligence test scores. Child Development 67 (2), 396-408. California Department of Education (2000). Academic Performance Index School Rankings, 1999. Sacramento, CA: Department of Education, Delaine Ea stin, State Superintendent.Ceci, S. J. (1991). How much does schooling influen ce general intelligence and its cognitive components? A reassessment of the evidenc e. Developmental Psychology 27 (5), 703-722. Census Bureau (1992). Census of 1990. Washington, DC: U. S. Department of Commerce.Cole, P. G. (1995). The bell curve: Should intellig ence be used as the pivotal explanatory concept of student achievement? Issues In Educational Research 5 (1), 11-22.Committee to Develop Standards for Educational and Psychological Testing, Melvin R. Novick, Chair (1985). Standards for Educational and Psychological Testing Washington, DC: American Psychological Association.Conant, J. B. (1940, May). Education for a classles s society. Atlantic Monthly 165 (5), 593-602.Cremin, L. A. (1962). The Transformation of the School: Progressivism in American Education, 1876-1957. New York: Alfred A. Knopf. Crouse, J., & Trusheim, D. (1988). The Case Against the SAT. Chicago: University of Chicago Press.Culbertson, J. (1995). Race, intelligence and ideol ogy. Educational Policy Analysis Archives 3 (2), available at l. Daley, B., & Zernike, K. (2000, January 26). State may change MCAS contractor. Boston Globe.Dewey, J. (1922, December 13). Individuality, equal ity and superiority. The New


21 of 43Republic 33 (419), pp. 61-63. Duncan, G. J., & Brooks-Gunn, J., Eds. (1997). Consequences of Growing Up Poor. New York: Russell Sage Foundation.Feuer M. L., et al., Eds. (1992). Testing in American Schools: Asking the Right Questions (Publication OTA-SET-519). Washington, DC: U. S. Co ngress, Office of Technology Assessment.Goddard, H. H. (1914). Feeble-mindedness; Its Causes and Consequences. New York: Macmillan.Goslin, D. A. (1963). The Search for Ability. New York: Russell Sage Foundation. Gould, S. J. (1981). The Mismeasure of Man. New York: W. W. Norton and Co. Haney, W. M. (1999). Supplementary Report on the Texas Assessment of Aca demic Skills Exit Test (TAAS-X). Boston: Center for the Study of Testing, Evaluation and Educational Policy, Boston College School of Educat ion. Hayman, R. L., Jr. (1997). The Smart Culture: Society, Intelligence, and Law. New York: New York University Press.Heubert, J. P., & Hauser, R. M., Eds. (1999). High Stakes Testing for Tracking, Promotion and Graduation. Washington, DC: National Academy Press. Holloway M. (1995, January). Flynn's effect. Scientific American 280 (1), 37-38. Hunt, E. (1995). The role of intelligence in modern society. American Scientist 83 (4), 356-369.IDRA Newsletter (1998, January). Intercultural Development Research Association, San Antonio, TX.Kessel, C., & Linn, M. C. (1996). Grades or scores: Predicting future college mathematics performance. Educational Measurement: Issues and Practice 15 (4), 10-14. Lehigh, S. (1998, June 28). For teachers, criticism s from many quarters. Boston Globe. Lemann, N. (1995, September). The great sorting. Atlantic Monthly 276 (3), 84-100. Longitudinal Attrition Rates in Texas Public High S chools, 1985-1986 to 19971998 (1999). Intercultural Development Research Associat ion, San Antonio, TX. Massachusetts Department of Education (1999a). Massachusetts Comprehensive Assessment System 1998 Technical Report. Malden, MA: Department of Education, David P. Driscoll, Commissioner.Massachusetts Department of Education (1999b). Massachusetts Comprehensive Assessment System, Report of 1999 State Results. Malden, MA: Department of Education, David P. Driscoll, Commissioner.McDonnell, L. M. (1997). The Politics of State Testing: Implementing New Stu dent


22 of 43Assessments (Publication CSE-424). Los Angeles: National Center for Research on Evaluation, Standards and Student Testing, Universi ty of California. McDonnell, L. M., & Weatherford, M. S. (1999). State Standards-Setting and Public Deliberation: The Case of California (Publication CSE-506). Los Angeles: National Center for Research on Evaluation, Standards and St udent Testing, University of California.Merton, R. K. (1957). Social Theory and Social Structures. Glencoe, IL: Free Press. Nairn, A., & Associates (1980). The Reign of the ETS: The Corporation that Makes Up Minds. Washington, DC: Center for the Study of Responsive Law. National Center for Education Statistics (1996). Digest of Education Statistics, 1995. Washington, DC: U. S. Department of Education.National Center for Education Statistics (1997). NAEP 1996 Trends in Academic Progress (Publication NCES 97-985). Washington, DC: U. S. De partment of Education. National Center for Education Statistics (1999). Digest of Education Statistics, 1998. Washington, DC: U. S. Department of Education.Neisser, U., Ed. (1998). The Rising Curve: Long-Term Gains in IQ and Related Measures. Washington, DC: American Psychological Association. New York State Education Department (1998). New York State School Report Card for the School Year 1996-1997. Albany, NY: Education Department, Richard P. Mills, Commissioner.New York State Education Department (1999). New York State School Report Card for the School Year 1997-1998. Albany, NY: Education Department, Richard P. Mills, Commissioner.Owen, D., & Doerr, M. (1999). None of the Above (Revised ed.). Lanham, MD: Rowman and Littlefield Publishers.Problems with KIRIS test erode public's support for reforms (1998, February 2). Lexington Herald-Leader, Lexington, KY. Ravitch, D. (1996, August 28). Defining literacy do wnward. New York Times. Regional Profile, Juarez and Chihuahua (1999). Texas Centers for Border Educational Development, El Paso, TX.Reischauer, E. O., & Fairbank, J. K. (1958). East Asia: The Great Tradition. Boston: Houghton Mifflin.Rickover, H. G. (1959). Education and Freedom. New York: E. P. Dutton and Co. Roderick, M., et al. (1999). Rejoinder to Ending Social Promotion: Results from the First Two Years. Chicago: Consortium on Chicago School Research, Des igns for Change.


23 of 43Rogers, T. B. (1995). The Psychological Testing Enterprise. Pacific Grove, CA: Brooks/Cole Publishing Co.Sacks, P. (1999). Standardized Minds. Cambridge, MA: Perseus Books. Sanders, W. L., & Horn, S. P. (1995). Educational a ssessment reassessed: The usefulness of standardized and alternative measures of student achievement as indicators for the assessment of educational outcomes. Education Policy Analysis Archives 3 (6), available at, S. K. (1973). The Culture Factory: Boston Public Schools, 1789-18 60. New York: Oxford University Press.Szechenyi, C. (1998, March 8). Failing grade? Firm with state's assessment contract has troubled past. Middlesex News, Framingham, MA. TAAS scandal widens (1999, April 9). Lone Star Report, Austin, TX. Terman, L. M. (1916). The Measurement of Intelligence. Boston: Houghton Mifflin. Texas Education Agency (1998). 1998 Comprehensive Biennial Report on Texas Public Schools. Austin, TX: Education Agency, Jim Nelson, Commissio ner. Tyack, D. B. (1974). The One Best System. Cambridge, MA: Harvard University Press. Tyack, D. B. & Cuban, L. (1995). Tinkering toward Utopia: A Century of Public School Reform. Cambridge, MA: Harvard University Press. About the AuthorCraig BolonPlanwright Systems Corporation, Inc. Email: Craig Bolon is President of Planwright Systems Corp ., a software development firm located in Brookline, Massachusetts, USA. After several years in high energy physics research and then in biomedical instrument development at M.I.T., he has been an industrial software developer for the past twenty years. He is author of the textbook Mastering C (Sybex, 1986) and of several technical publications. He is an elected Town Meeti ng Member and has served as member and Chair of the Finance Committee in Brookline, Massac husetts. Appendix 1 Information: U. S. Public EducationFigure 1 (on two pages, U. S. Dept. of Education, 1 997) shows NAEP national average scores from program inception through 1996.


24 of 43


25 of 43


26 of 43 The next chart, Figure 2, with data in Tabl e 4, shows estimated U. S. public school enrollment and spending for the years 1850-2000. En rollment is for elementary and secondary schools, including kindergarten, in millions. Spend ing includes local, state and federal outlays, in US$ billions, adjusted to 1998 dollar equivalenc e by the annualized Consumer Price Index. The last chart, Figure 3, shows U. S. public school enrollment aged 15-17 retained below modal grade, for the years 1971 through 1998. The i ncrease in enrollment below modal grade is caused by increases in retention rates at all grade s as well as by later ages of first school enrollment (Heubert and Hauser, 1999, pp. 136-158). Figure 2. U. S. public school enrollment and spendi ng.Table 4 U. S. Public School Enrollment and Spending, for Fi gure 2Year Enrollment 1,000,000s Spending $B (1998) Spending per student 18503.4 18604.8 18706.9 18809.9 189012.7 190015.54.2270191017.87.4420192021.68.4390193025.722.6880194025.427.31070195025.139.51570196035.286.02440197045.9170.83720


27 of 43198040.9189.94650199041.2265.46440200047.4338.67140Sources: U. S. Department of Education, Digest of Education Statistics, 1998 (spending not available in this series before 1900); U. S. Census Bureau, Census of 1850 and Census of 1860; U. S. Bureau of Labor Statistics, Consumer Price Index: All Urban Consumers (annual averages, estimated before 1913). Figure 3. U.S. public school enrollment below modal grade.Source: "The population 6 to 17 years old enrolled below modal grade: 1971 to 1998," Current Population Survey Report – School Enrollmen t – Social and Economic Characteristics of Students, U. S. Bureau of the Census, Washington, DC, Supplem entary Table A-3, October, 1999.Appendix 2 Information: Massachusetts The Mather school, the first free public sc hool in the U. S., was founded in Dorchester, Massachusetts, in 1639. In 1647 the Massachusetts G eneral Court enacted a law requiring every town of 100 families or more to provide free public education through the eighth grade, but attendance was not required. In 1821 Boston opened English High School, the first free public


28 of 43high school in the U. S.. It taught English, histor y, logic, mathematics and science but did not offer the traditional Latin curriculum. An 1827 Mas sachusetts law required every town with 500 or more families to support a free public high school, and an 1852 law required school attendance to the age of 14, the first such laws in the U. S.. Massachusetts took over 30 years to reach compliance with each. Massachusetts created a state Board of Educ ation in 1837 to set standards for public schools, then in disarray. Horace Mann, a state sen ator from Boston and former state representative from Dedham, became the first Secret ary to the Board. In 1839, at Mann's urging, Massachusetts created its first statesupp orted teacher's college, located in Lexington (now in Framingham). In 1845, following disputes ov er the quality of instruction, the Board of Education issued a voluntary written examination fo r public school eighth-graders, consisting of 30 short-answer questions. In its first year, th e average score was less than one-third correct answers. Scores were soon used to compare schools i n the press. Schoolmasters complained that knowledge tested did not correspond to their c urricula. After Mann entered Congress in 1848 the examination was discontinued. During the f ollowing 138 years the Board of Education did not require testing of students. In 1986 the Board of Education began statew ide student testing called the Massachusetts Educational Assessment Program (MEAP). Among its pu rposes was to provide comparisons between student achievements in the state and stude nt achievements being measured since 1969 through NAEP, the National Assessment of Educationa l Progress. Fourth-grade and eighth-grade tests of reading, mathematics and scie nce were given every two years from 1986 through 1996. These tests were designed and adminis tered by Advanced Systems in Measurement and Evaluation, Inc., of Dover, NH. Que stions were in multiple choice, short answer and extended answer formats. Only aggregate scores for the state were publicly reported. Scores for individual schools were not di sclosed. While Massachusetts average scores were above national averages, from 26 to 32 percent of the 19921996 scores were "below basic," the lowest of four classification levels. The Massachusetts Education Reform Act of 1 993 required revised educational standards and procedures. In January, 1998, the Board of Educ ation began using a new Massachusetts Teacher Test as a part of teacher certification. A communication and literacy skills test and a subject test in one of 41 areas must be passed. The se tests, recently renamed the Massachusetts Educator Certification Tests, are being prepared an d administered by National Evaluation Systems, Inc., of Amherst, MA, designer of the Cali fornia Basic Educational Skills tests and the Texas Academic Skills Program tests. They are s trictly timed and include multiple choice reading comprehension questions, short answer vocab ulary and grammar questions, and a written composition. Testing was initiated without a tryout period for the test and with relatively little advance notice about test content or consequences. In the first group of candidates, less than half passed both parts of the test. As one result, only white candidates were certified to teach in Massachusetts. In 1995, the Board of Education released "c urriculum frameworks," or required curricula, for mathematics and for science and technology. It later issued frameworks for English language arts and for history and social science. I n the spring of 1998, after a tryout period in 1997, the Board began a new student testing program in the fourth, eighth and tenth grades called the Massachusetts Comprehensive Assessment S ystem (MCAS). It includes tests in English language arts, mathematics, science and tec hnology, and history and social science. They are loosely timed and include questions in mul tiple choice, short answer and extended answer formats. Through 1999, the test for history and social science has been administered only to eighth-grade students. Total testing time i s about ten to fifteen hours, depending on the year and number of tests, with about half typically spent on English language arts. Scores are reported in a 200-280 point range; they are classif ied in four levels, equally spaced in the score


29 of 43 range, called "advanced," "proficient," "needs impr ovement" and "failing." Parents are not permitted to exempt their children from testing. Th ere are alternative procedures, such as small group settings, for special needs students and for students for whom English is not a native language. Beginning in 1999, aggregate scores for eac h school in the state were publicly reported. Individual scores are disclosed to schools and pare nts. Schools also receive an analysis of results for each test item. Both 1998 and 1999 test items have been released to the public. The 1999 tests were offered in Spanish as well as Engli sh. Statewide, the results for 1998 and 1999 were similar; combined results from these two years are shown in Table 5.Table 5 Massachusetts MCAS Average Scores, 1998-1999.MCAS English Language Arts, statewide, 1998-1999 co mbined School Grade Average Score Percent Advanced Percent Proficient Percent Needs Improvement Percent Failing 1022943234308237352311442300206614 MCAS Mathematics, statewide, 1998-1999 combined School Grade Average Score Percent Advanced Percent Proficient Percent Needs Improvement Percent Failing 10222816235382267232941423412234421 MCAS Science and Technology, statewide, 1998-1999 c ombined School Grade Average Score Percent Advanced Percent Proficient Percent Needs Improvement Percent Failing 1022522140378224424294342398443810 MCAS History and Social Science, statewide, 1998-19 99 combined School Grade Average Score Percent Advanced Percent Proficient Percent Needs Improvement Percent Failing 10 822111040494 Source of data: Massachusetts Department of Educati on, 1999b.


30 of 43 The Board of Education has released a techn ical analysis of the 1998 MCAS which includes estimates that its classification levels a re consistent (Massachusetts Department of Education, 1999a). These are phrased in terms of th e probability that a student who might receive a particular classification level after man y repeated tests of some type would be classified at the same level by any one of those te sts. Estimated probabilities range from 56 to 92 percent; they are highest for the "failing" leve l, averaging 85 percent, and lowest for the "advanced" level, averaging 70 percent. This techni cal analysis considers "validity" only in the narrow sense of comparison with other test results. Strong correlations, from .6 to .8, were found with components of the Stanford Achievement T est series. Significant MCAS score differences between male and female students and la rge score differences between students of different ethnic backgrounds were found, a pattern that is commonly duplicated by aptitude tests. Neither the methodology for computing the re ported scaled scores nor the basis for classifying scores into passing and failing levels has been disclosed to the public. In 2003, passing scores on tenth-grade test s will be required for a high-school diploma. The Board of Education has also announced plans to remove principals of schools which receive low scores and do not improve. The Board ha s not reported the fraction of students failing at least one of the tenth-grade tests, but statewide it is obviously more than half. By 2003, Massachusetts may be denying a diploma to a m ajority of students who complete high school, based on their failure to achieve passing s cores on its standard tests. Like the MEAP tests, the 1998 and 1999 MCAS tests were designed and administered by Advanced Systems in Measurement and Evaluation, Inc ., of Dover, NH. Advanced Systems won a 1995 contract estimated at $25 million over c ompetitors Riverside Publishing, publisher of the Iowa Tests of Educational Development, and H arcourt Brace Educational Measurement, publisher of the Stanford Achievement Tests. Advanc ed Systems has been a target of state investigations for its work in Maine and New Hampsh ire. In 1997, it lost a contract in Kentucky after being accused of gross errors in tes t scoring ("Problems," 1998). Scoring errors by the firm have also been reported in Maine. Tests that use extended answer questions, as those in Massachusetts do, must be scored by indivi dual test evaluators. There have been reports of hasty scoring by Advanced Systems test e valuators working under time pressures and of computer programming errors by the company (Szec henyi, 1998). In the summer of 1999, the Board of Educati on opened competitive bidding for the MCAS program of 2000-2004. Bids were received from the s ame companies as in 1995. In January, 2000, Commissioner of Education David P. Driscoll a nnounced that Harcourt Brace Educational Measurement had received preliminary se lection, with final negotiations in progress (Daley and Zernike, 2000). Problems with t his change in vendors can be expected. A new vendor lacks the time to repeat the review and tryout process of the first MCAS series before testing starts in April, 2000.Appendix 3 Information: New York The state of New York began to appropriate funds for support of public schools in 1795. In 1814, all New York municipalities were required to participate in a statewide system of public school districts. At the time, these schools charged tuition to cover differences between operating costs and state funding. In 1867, free pu blic schooling became a requirement of law. The current school year of 180 days was set in 1913 and the current school-leaving age of 16 was set in 1936. The New York Board of Regents, originally r esponsible for supervising higher education,


31 of 43began high-school entrance examinations in 1865, la ter called "preliminary" examinations. In 1878 it began examinations for graduation from high schools. In the 1880s the Board began inspection visits to public schools. A 1904 reorgan ization put the Board of Regents in charge of standards for all public education. One response wa s gradual strengthening of secondary school attendance. Another was development of detailed cur ricula aimed at preparing students for higher education. Throughout the nineteenth and twe ntieth centuries, high-school students in New York have been able to obtain a "local" high-sc hool diploma without meeting Regents examination requirements. In the 1930s the New York City schools bega n administering the Metropolitan Achievement Test series, designed by The Psychologi cal Corporation, for diagnosis and guidance. In the 1970s the New York Education Depar tment began using this test series, now provided by Harcourt Brace Educational Measurement, for its statewide Pupil Evaluation Program. This program administered tests of reading and mathematics in grades 3 and 6, tests of writing in grade 5, and tests of social studies in grades 6 and 8. During the years 1993 to 1996, the Department gradually changed to the Calif ornia Achievement Tests, provided by CTB/McGraw-Hill. Throughout these years, the Depart ment also administered the Regents Preliminary Competency Tests of reading and writing in grades 8 and 9. Beginning in 1999 the Education Department is replacing its elementary and secondary school tests with new Program Evaluation Tests, pla nned since 1994 and piloted during 1995 through 1998. These tests, developed by CTB/McGrawHill, are strictly timed and include questions in multiple choice, short answer, extende d answer, essay and laboratory performance formats. Tests for English language arts, mathemati cs and science are to be administered in grades 4 and 8. Social studies tests are to be admi nistered in grades 5 and 8. Test items are disclosed to the public. Only English language arts and mathematics tests are being given in 1999 and 2000. Tests are currently offered only in English. The New York Regents high-school graduation examinations are by subject. Beginning with a few subjects, the examination catalog reache d a peak of 68 subjects in 1925. After years of consolidation, by the 1960s the catalog was redu ced to English, mathematics, science, social studies and certain foreign languages. Subsequent r evisions introduced technical education subjects. In 1998 the Education Department announce d a new series of statewide tests in English, mathematics, science, global history and g eography, and U. S. history and government, starting in 1999. The new Regents examinations have been developed by CTB/McGraw-Hill. They are strictly timed and include questions in mu ltiple choice, short answer, extended answer, essay and laboratory performance formats. M ost tests are offered in English only; some have also been offered in Chinese, Hatian Creole, K orean, Russian and Spanish. Scores of 65 on all tests are now required for a Regents diploma and scores of 55 are required for a "local" diploma. Beginning in 2005, there will be no more local" diplomas. A 1987 New York law requires an annual repo rt from the Education Department covering enrollment, student achievement, graduation and dro pout rates, and other topics. This is known as the School Report Card. Data tables accompanying these reports show numbers or percentages of students statewide and by school dis trict receiving certain score levels on tests. School Report Card data tables are being released a bout 9 months after the end of a school year. Statewide percentages of grade enrollment rec eiving Regents examination scores in specified ranges are shown in Table 6.Table 6 New York Regents Examination Scores, 1997 and 1998


32 of 43 1997 Examination score55 or more65 or more85 or moreComprehensive English63%56%17%Mathematics I66%59%29%Biology51%44%15%US History56%48%15%Global Studies57%48%14% 1998 Examination score55 or more65 or more85 or moreComprehensive English65%57%15%Mathematics I70%62%33%Biology51%44%16%US History60%52%17%Global Studies65%56%17%Source of data: New York State Education Department 1998 and 1999. Although data tables for the 1999 Regents e xaminations are not yet available, a summary for the English language arts test has been release d. It shows that statewide 78 percent of grade enrollment has received a score of 55 or more on th is examination, much higher than the percentage on the previous Comprehensive English ex amination. However, in the New York City schools only 55 percent of grade enrollment ha s passed this examination, with 35 percent yet to attempt it.Appendix 4. Information: Texas The Republic of Texas enacted laws to suppo rt free public education in 1845, in anticipation of statehood later that year. It also created a state fund to provide part of the cost of the public school system. Through the rest of the c entury public education was limited to eight grades in many rural areas, although high schools w ere founded in cities. In 1911 Texas reorganized its state education system to provide p ublic high schools in all rural areas. In 1984 the Texas legislature passed House Bill 72, a public "school reform" law. This revised the state's financial support for education providing more funds for low-income districts, and it directed the Texas Education Agen cy to establish school performance standards and administer a statewide high-school graduation t est. Until 1990 Texas used a series of tests focused on minimum competence. In that year, as req uired by law, it began introducing over a four year period a testing program designed to rais e the expected level of skills, using a new test series. The Texas Assessment of Academic Skill s (TAAS) is a series of standard tests given in the third through tenth grades in reading, writing, mathematics and social studies. These tests are untimed and in multiple choice form at except for essays in writing tests. They have been organized by National Computer Systems of Minneapolis, MN, as prime contractor. Harcourt Brace Educational Measurement performs tes t development; it has involved about 7,000 Texas educators in the process.


33 of 43 TASS tests are available in English and Spa nish, and there is an alternate assessment process for students in special education. Satisfac tory scores on the tenth-grade tests in reading, writing and mathematics are required for a highsc hool diploma. Texas also has standard tests on which passing scores are required to obtain cred it for certain high-school courses, currently Algebra I, Biology I, English II and U. S. History. In 1999 passing three such tests in the tenth grade was made equivalent to passing the entire TAS S series. Starting in 2005 a new Texas law will require high-school graduates to get passing s cores on new standard tests of English language arts, mathematics, science and social stud ies, taken in the eleventh grade. Since 1994 Texas has used an Accountability Rating System to report school and district performance. Schools are rated as "exemplary," "rec ognized," "acceptable" or "low performing." The key criteria are TASS scores, for which large racial and ethnic differences have been documented. For rating purposes, students are classified in four groups: white, African-American, Hispanic and economically disadva ntaged. To achieve school ratings, the minimum rating scores are required for each group. There are also requirements for high attendance and low dropout rates. Ratings are publi shed in newspapers. Schools with strong ratings or progress receive financial rewards, curr ently a total of $2.5 million per year statewide. Texas public colleges and universities have a standard qualifying examination, the Texas Academic Skills Program test. It is an untimed test of reading, writing and mathematics in multiple choice format, plus an essay, all prepared and administered by National Evaluation Systems, Inc., of Amherst, MA. No one is denied adm ission based on TASP scores, but passing scores are required to graduate from two-year colle ges and to take junior and senior courses at fouryear colleges. The test is waived for student s with high enough scores on certain other tests. Racial differences in Texas test scores are well documented (Texas Education Agency, 1998). According to Texas statistics, the percentag e of success for TASP is about the same for men and women, but the percentage of success for wh ites is more than twice that for African-Americans. The success rate on the tenth-gr ade TAAS series in 1998 was 85 percent for white students, 60 percent for Hispanic student s, and 56 percent for African-American students. So far, however, all legal challenges to racial and ethnic differences in Texas standard test scores have failed. New arguments are being us ed by plaintiffs seeking to overcome the judicial barriers encountered in previous lawsuits. There is no objective evidence to sustain the passing scores set by Texas for the TAAS high-schoo l graduation examinations, and the state provides no program to assure that the tests cover what is taught in the schools (Haney, 1999). Texas is in denial about the dropout rates its program appears to be causing; official statements claim substantial decreases in dropout r ates, to 10-15 percent. U. S. Department of Education enrollment data indicate much higher drop out rates. Haney (1999, p. 22) notes that Texas Education Agency definitions of "drop out" ha ve changed several times in the last ten years. Longitudinal dropout rates in Texas have bee n surveyed by an independent organization over several years. Their estimates for the school years ending in 1986 through 1999 are shown in Figure 4.


34 of 43 Figure 4. Texas dropout rates.Source: Longitudinal Attrition Rates in Texas Public High S chools, 1985-1986 to 1998-1999, Intercultural Development Research Association, Sa n Antonio, TX, 1999. By permission. Chart prepared by the author. Not shown are data for Asian/Pacific Islander and Native American students. No data were published fo r 1991 or 1994. The estimates in Figure 4 are consistent wi th U. S. Department of Education data. They show that introduction of TAAS in 1990-1995 was ass ociated with a significant increase in dropout rates which has been sustained in the years since. Although the impact of TASS has been heaviest on African-American students, in some schools 100 percent of students with limited English proficiency drop out (IDRA, 1998). While the impact of TASS on Hispanic students has been less than the impact on African-A merican students, Hispanic students remain the group with the highest dropout rates. Some students who do not receive a diploma at normal high-school graduation age continue in school and obtain a conventional diplom a later, or they return to school after having dropped out, or they earn a certificate by passing the GED or a similar test, or they arrange to begin higher education without high-school credenti als. U. S. Census data suggest that by age 24 half or more of highschool dropouts may have e xtended their education up to or beyond high-school equivalence. However, there is no consi stent source of statistical data on these educational outcomes in Texas, in most other states or for the U. S. (Heubert and Hauser, 1999, pp. 136-137 and 172). Under TAAS, there have been reports of week s spent on test cramming and "TAAS rallies." School ratings are raised by "exempting" students (Associated Press, 1999). Schools are allowed to contract for "test preparation" cons ultants and materials, and some have spent tens of thousands of dollars. There have been repor ts of falsifying results. In 1998 the Austin Independent School District produced dramatic TAAS score improvements; then in April, 1999, Deputy Superintendent Kay Psencik and the sch ool district were indicted for tampering with government records. In Houston three teachers and a principal were dismissed for prompting students during test sessions ("TAAS scan dal," 1999).


35 of 43 Illiteracy remains a major problem in Texas Over 80 percent of Texas prison inmates have been found functionally illiterate. The four l argest cities Dallas, Houston, San Antonio and El Paso have adult illiteracy rates of 12 to 19 percent. S tatewide, the Texas adult illiteracy rate is 12 percent, second worst of any state in the U. S. (Census Bureau, 1992). In communities near the Mexican border, where rates ar e highest, illiteracy among children has increased during the years under TAAS ( Regional Profile, 1999).Appendix 5 Information: California In 1961 California began programs of achiev ement testing in its public schools, with testing procedures and standards under local school district control. A 1972 state law created the California Assessment Program, under which mult iple choice tests for reading, writing and mathematics were administered in grades 2, 3, 6 and 12, with grade 8 added in 1983. By 1987 a writing sample and a test for U. S. history and eco nomics had been added. In 1988 the Board began to offer Golden State Examinations, intended to identify and honor outstanding students in public schools. In 1998 about 2,700 high-school graduates received merit diplomas based on these test scores. In 1978 California voters passed Propositio n 13, radically restricting local funds for schools in most communities. Passage of Proposition 62 in 1986 hobbled the ability of state government to assist with funding for education. Pr oposition 98, approved in 1988, set a school funding floor at a relatively low level and has ten ded to prevent further erosion. Since 1978 California has fallen from among the top ten states in many national ratings of education to among the bottom ten. California education initiati ves since the 1970s must be viewed in the context of the state's flamboyant and reactionary p olitics and its drastic change in financial support for public schools. A 1991 state law authorized a new Californi a Learning Assessment System, and the previous testing program was gradually discontinued In 1994 the new program died after a veto of legislation by the governor, leaving the st ate with no statewide testing except the Golden State Examinations. In 1995 new state laws e stablished a Pupil Testing Incentive Program and required statewide standards. The Board of Education began to establish "curriculum frameworks," or required curricula (see McDonnell, 1997). In 1997, before the new testing program had been fully implemented, ano ther new state law replaced it with requirements for revised curriculum standards and n ationally normed standard tests, to be designated by the Board of Education. In 1997 and 1 998 the Board of Education specified new content standards for reading, writing, mathematics science, and history and social science (see McDonnell and Weatherford, 1999). Curriculum framew orks and corresponding tests are being revised and developed to correspond. As required by the 1997 California law, the Board of Education began a Standardized Testing and Reporting (STAR) Program in 1998. Its m ajor component is annual administration of the Stanford Achievement Tests, published by Har court Brace Educational Measurement, to all students in grades 2 through 11. Grades 2 throu gh 8 are tested in reading, writing, spelling and mathematics. Grades 9, 10 and 11 are tested in reading, writing, mathematics, science and social science. There are also "augmentation" tests in language arts and mathematics, intended to reflect the California curriculum, with addition al tests in preparation. By state law, STAR tests are provided only in English, although about forty percent of California's public school students come from Spani sh-speaking households. These are strictly timed tests in multiple choice formats plus writing samples. Total testing time is about six


36 of 43hours. Parents may exempt their children from testi ng. Test items are not being disclosed to the public. California public schools are forbidden by law to use test preparation materials specifically designed for these tests. Their use by parents who can afford them is not restricted. In April, 1999, the California legislature passed and its governor signed a law called the Public Schools Accountability Act. It requires the state to publish an Academic Performance Index (API) annually for each public school. It als o provides extra funding for low performing schools and a system of awards for high performing schools. A total of $100 million was appropriated for awards in 1999. The 1999 law also requires the Board of Education to develop and administer promotion and graduation tests, star ting in 2001. After three years, passing scores will be required to enter high school and to obtain a high-school diploma. For 1999 the Board of Education defined the API on the basis of Stanford Achievement Test scores (California, 1999). It reflects student score ranks, weighted by subject content. Weights for grades 2 through 8 are reading 30 perce nt, writing 15 percent, spelling 15 percent, and mathematics 40 percent. Weights for grades 9 th rough 11 are 20 percent each for reading, writing, mathematics, science and social science. A school with all students ranking in the top 20 percent of the distribution of scores will have an API of 1,000, while a school with all students ranking in the bottom 20 percent will have an API of 200. The 1999 API ratings for California public schools are summarized in the Fig ure 5: Figure 5. California school ratings.Source: Academic Performance Index School Rankings, 1999, California Department of Education, Sacramento, CA, January, 2000. Chart pre pared by the author. Data were grouped into the API ranges shown. Four schools wer e unrated. The official goal is to raise all schools t o an API of 800. Since the API is essentially comparing scores with averages, this is a "Lake Wob egon" goal, to make "all the kids above average."Appendix 6 Performance Assessment


37 of 43 "Performance assessment is a broad term. It covers many different types of testing methods that require students to demonstrate their competencies or knowledge by creating an answer or product. It is best understood as a conti nuum of formats that range from the simplest student-constructed responses to comprehensive demo nstrations or collections of large bodies of work over time. This [section] describes some co mmon forms of performance assessment. Constructed-response questions require students to produce an answer to a question rather than to select from an array of possible ans wers (as multiple-choice items do). In constructed-response items, questions may have just one correct answer or may be more open ended, allowing a range of responses. The form can also vary: examples include answers supplied by filling in a blank; solving a mathemati cs problem; writing short answers; completing figural responses (drawing on a figure l ike a graph, illustration, or diagram); or writing out all the steps in a geometry proof. Essays have long been used to assess a student's understan ding of a subject by having the student write a description, analysis, explanation, or summary in one or more paragraphs. Essays are used to demonstrate how well a student c an use facts in context and structure a coherent discussion. Answering essay questions effe ctively requires analysis, synthesis, and critical thinking. Grading can be systematized by h aving subject matter specialists develop guidelines for responses and set quality standards. Scorers can then compare each student's essays against models that represent various levels of quality. "Writing is the most common subject tested by performance as sessment methods. Although multiple-choice tests can assess some of t he components necessary for good writing (spelling, grammar, and word usage), having student s write is considered a more comprehensive method of assessing composition skill s. Writing enables students to demonstrate composition skills-inventing, revising, and clearly stating one's ideas to fit the purpose and the audience--as well as their knowledg e of language, syntax, and grammar. There has been considerable research on the standardized and objective scoring of writing assessments. "Oral discourse was the earliest form of performance assessment. Be fore paper and pencil, chalk, and slate became affordable, school children rehearsed their lessons, recited their sums, and rendered their poems and prose aloud. At the university level, rhetoric was interdisciplinary: reading, writing, and speaking w ere the media of public affairs. Today graduate students are tested at the master's and Ph .D. levels with an oral defense of dissertations. But oral interviews can also be used in assessments of young children, where written testing is inappropriate. An obvious exampl e of oral assessment is in foreign languages: fluency can only be assessed by hearing the student speak. As video and audio make it possible to record performance, the use of oral presentation s is likely to expand. "Exhibitions are designed as comprehensive demonstrations of ski lls or competence. They often require students to produce a demonstrat ion or live performance in class or before other audiences. Teachers or trained judges score p erformance against standards of excellence known to all participants ahead of time. Exhibition s require a broad range of competencies, are often interdisciplinary in focus, and require stude nt initiative and creativity. They can take the form of competitions between individual students or groups, or may be collaborative projects that students work on over time. "Experiments are used to test how well a student understands sci entific concepts and can carry out scientific processes. As educators emphas ize increased hands-on laboratory work in the science curriculum, they have advocated the dev elopment of assessments to test those skills more directly than conventional paper-and-pencil te sts. A few states are developing standardized scientific tasks or experiments that a ll students must conduct to demonstrate understanding and skills. Developing hypotheses, pl anning and carrying out experiments, writing up findings, using the skills of measuremen t and estimation, and applying knowledge of


38 of 43scientific facts and underlying concepts—in a word, 'doing science'—are at the heart of these assessment activities. "Portfolios are usually files or folders that contain collectio ns of a student's work. They furnish a broad portrait of individual performance, assembled overtime. As students assemble their portfolios, they must evaluate their own work a key feature of performance assessment. Portfolios are most common in writing and language arts-showing drafts, revisions, and works in progress. A few states and districts use portfol ios for science, mathematics, and the arts; others are planning to use them for demonstrations of workplace readiness." Source: Michael J. Feuer et al., Eds., Testing in American Schools: Asking the Right Quest ions, OTA-SET-519, Office of Technology Assessment, U. S. Congress, Washington, DC, 1992, p. 19.Appendix 7 Chronology of Standard Testing in the U. S.[Listed in brackets are some developments in other countries which had rapid and substantial impacts i n the U. S.]1900 The College Entrance Examination Board is founded at Columbia College in New York.1905 [Alfred Binet publishes the first intelligence tes t, to identify slow learners.] 1908 Edward L. Thorndike, a Columbia professor, begins writing a series of standard achievement tests for use in elementary an d high schools, completed in 1916.1916 First publication of the Stanford-Binet IQ test by Houghton Mifflin, developed by Lewis M. Terman, a Stanford professor.1916 Arthur S. Otis, a student of Terman and later a te st editor for the World Book Company, invents the multiple choice format. It is used in the Army Alpha test. 1917 Robert M. Yerkes, a Harvard professor, organizes t he Army Alpha and Beta intelligence tests, given to 1.7 million World War I recruits. 1921 The Psychological Corporation is founded in New Yo rk by James M. Cattell, Robert S. Woodworth and Edward L. Thorndike.1923 First publication of the Stanford Achievement Test s by the World Book Company, developed under the direction of Lewis M. Terman. 1925 Carl C. Brigham, a Princeton professor, develops t he Scholastic Aptitude Test for the College Entrance Examination Board.1927 The California Test Bureau is founded in Los Angel es by Ethel M. Clark and Willis W. Clark, a Los Angeles school teacher.1928 Everett F. Lindquist, a professor at the Universit y of Iowa, begins the Iowa


39 of 43Testing Program in support of a scholarship competi tion. 1933 First publication of the Progressive Achievement T est series by the California Test Bureau, developed by Willis W. Clar k and Ernest W. Tiegs. 1935 Louis L. Thurstone, a professor at the University of Chicago, publishes a theory of factor analysis as applied to psychometri c testing. 1935 First publication of the Iowa Every-Pupil Test of Basic Skills by the University of Iowa Testing Bureau, developed under the direction of Everett F. Lindquist.1936 IBM scores the New York Regents examination using a machine based on the Markograph soft pencil electrical technology in vented by Reynold B. Johnson. 1938 The Mental Measurements Yearbook is first published by Oscar K. Buros, a Rutgers University professor.1940 Houghton Mifflin acquires publishing rights to the Iowa Test of Basic Skills. 1941 The U. S. armed forces begin using the Army Genera l Classification Test and other standardized tests, given to more than 10 mil lion World War II recruits. 1942 First publication of the Iowa Tests of Educational Development by Houghton Mifflin, developed under the direction of Everett F Lindquist. 1942 The College Entrance Examination Board replaces it s traditional essay tests with multiple choice tests.1943 Everett F. Lindquist first administers the Test of General Educational Development (GED).[ 1944 Great Britain's Parliament approves the Education Act of 1944, beginning the "eleven-plus" examination restricting admission to grammar schools and access to higher education.]1947 The Educational Testing Service is founded by Henr y Chauncey to prepare and administer the Scholastic Aptitude Test (SAT) f or the College Entrance Examination Board.1949 First publication of the Weschler Intelligence Sca les by The Psychological Corporation, developed by David Weschler, a profess or at NYU Medical College. 1956 Houghton Mifflin introduces electronic scanners de veloped by Everett F. Lindquist and Albert N. Hieronymous, scoring test s heets on both sides without requiring soft pencil markings.1958 The Educational Testing Service begins disclosing its SAT scores to testtakers.1959 The American College Testing (ACT) Program is foun ded by Everett F. Lindquist and Theodore McCarrel.


40 of 431960 Harcourt Brace and Co. acquires the World Book Pub lishing Co. and its Stanford Achievement Test series.1968 McGraw-Hill acquires the California Testing Bureau and its CTB Achievement Test series.1969 Michigan begins a statewide program of standard te sting, later expanded to high-school graduation requirements.1970 Harcourt Brace acquires The Psychological Corporat ion. [ 1976 Key research findings on the inheritance of intell igence by Cyril Burt, a former professor at University College, London, are exposed as scientific fraud.] 1979 Houghton Mifflin establishes a Riverside Publishin g division to publish the Iowa achievement tests, Stanford-Binet IQ test and other school-based standard tests.1979 New York's legislature passes and its governor sig ns the Educational Testing Act of 1979, a "truth in testing" law.1983 The Reagan administration publishes A Nation at Risk, embracing a system of school-based standard tests and punitive sanctio ns for low scores. 1984 Texas begins a statewide program of standard testi ng, to be required in ten years for high-school graduation.1985 The National Center for Fair and Open Testing is f ounded in Cambridge, MA.1991 The Bush administration's proposed Excellence in E ducation Act, H.R. 2460, to create federal school and employment tests, is d efeated in Congress. 1996 California begins a statewide program of standard testing, to be required in eight years for middle school and high-school gradu ation, with state receivership for schools with low scores.1998 Massachusetts begins a statewide program of standa rd testing, to be required in five years for high-school graduation, with repl acement of principals in schools with low scores.Appendix 8 General BibliographyDavid C. Berliner and Bruce J. Biddle, The Manufactured Crisis: Myths, Fraud, and the Atta ck on America's Public Schools, Addison-Wesley, Reading, MA, 1995. James Crouse and Dale Trusheim, The Case Against the SAT, University of Chicago Press, Chicago, 1988.


41 of 43 David A. Goslin, The Search for Ability: Standardized Testing in Soc ial Perspective, Russell Sage Foundation, New York, 1963.Stephen J. Gould, The Mismeasure of Man, W. W. Norton and Co., New York, 1981. Robert L. Hayman, Jr., The Smart Culture: Society, Intelligence, and Law, New York University Press, New York, 1997.Jay P. Heubert and Robert M. Hauser, Eds., High Stakes Testing for Tracking, Promotion and Graduation, National Academy Press, Washington, DC, 1999. Banesh Hoffmann, The Tyranny of Testing, Crowell-Collier Publishing Co., New York, 1962. National Center for Education Statistics (Jay R. Ca mpbell, Kristin E. Voelkl and Patricia L. Donahue, Eds.), NAEP 1996 Trends in Academic Progress, NCES 97-985, U. S. Department of Education, Washington, DC, 1997.Nicholas Lemann, The Big Test: The Secret History of the American Me ritocracy, Farrar, Straus and Giroux, New York, 1999.Office of Technology Assessment (Michael J. Feuer e t al., Eds.), Testing in American Schools: Asking the Right Questions, OTA-SET-519, U. S. Congress, Washington, DC, 1992. David Owen, None of the Above: Behind the Myth of Scholastic Ap titude, Houghton Mifflin, Boston, 1985. David Owen and Marilyn Doerr, None of the Above: Behind the Myth of Scholastic Aptitude, Rowman and Littlefield Publishers, Lanham, MD, Revi sed and Updated Edition, 1999.Tim B. Rogers, The Psychological Testing Enterprise: An Introducti on, Brooks/Cole Publishing Co., Pacific Grove, CA, 1995.Peter Sacks, Standardized Minds: The High Price of America's Tes ting Culture and What We Can Do to Change It, Perseus Books, Cambridge, MA, 1999. David B. Tyack, The One Best System: A History of American Urban Ed ucation, Harvard University Press, Cambridge, MA, 1974.Copyright 2000 by the Education Policy Analysis ArchivesThe World Wide Web address for the Education Policy Analysis Archives is General questions about appropriateness of topics o r particular articles may be addressed to the Editor, Gene V Glass, or reach him at College of Education, Arizona State University, Tempe, AZ 8 5287-0211. (602-965-9644). The Commentary Editor is Casey D. C obb: .EPAA Editorial Board


42 of 43 Michael W. Apple University of Wisconsin Greg Camilli Rutgers University John Covaleskie Northern Michigan University Alan Davis University of Colorado, Denver Sherman Dorn University of South Florida Mark E. Fetler California Commission on Teacher Credentialing Richard Garlikov Thomas F. Green Syracuse University Alison I. Griffith York University Arlen Gullickson Western Michigan University Ernest R. House University of Colorado Aimee Howley Ohio University Craig B. Howley Appalachia Educational Laboratory William Hunter University of Calgary Daniel Kalls Ume University Benjamin Levin University of Manitoba Thomas Mauhs-Pugh Green Mountain College Dewayne Matthews Western Interstate Commission for HigherEducation William McInerney Purdue University Mary McKeown-Moak MGT of America (Austin, TX) Les McLean University of Toronto Susan Bobbitt Nolen University of Washington Anne L. Pemberton Hugh G. Petrie SUNY Buffalo Richard C. Richardson New York University Anthony G. Rud Jr. Purdue University Dennis Sayers Ann Leavenworth Centerfor Accelerated Learning Jay D. Scribner University of Texas at Austin Michael Scriven Robert E. Stake University of Illinois—UC Robert Stonehill U.S. Department of Education David D. Williams Brigham Young UniversityEPAA Spanish Language Editorial BoardAssociate Editor for Spanish Language Roberto Rodrguez Gmez Universidad Nacional Autnoma de Mxico Adrin Acosta (Mxico) Universidad de J. Flix Angulo Rasco (Spain) Universidad de


43 of 43 Teresa Bracho (Mxico) Centro de Investigacin y DocenciaEconmica-CIDEbracho Alejandro Canales (Mxico) Universidad Nacional Autnoma Ursula Casanova (U.S.A.) Arizona State Jos Contreras Domingo Universitat de Barcelona Erwin Epstein (U.S.A.) Loyola University of Josu Gonzlez (U.S.A.) Arizona State Rollin Kent (Mxico)Departamento de InvestigacinEducativa-DIE/ Mara Beatriz Luce (Brazil)Universidad Federal de Rio Grande do Sul-UFRGSlucemb@orion.ufrgs.brJavier Mendoza Rojas (Mxico)Universidad Nacional Autnoma deMxicojaviermr@servidor.unam.mxMarcela Mollis (Argentina)Universidad de Buenos Humberto Muoz Garca (Mxico) Universidad Nacional Autnoma deMxicohumberto@servidor.unam.mxAngel Ignacio Prez Gmez (Spain)Universidad de Daniel Schugurensky (Argentina-Canad)OISE/UT, Simon Schwartzman (Brazil)Fundao Instituto Brasileiro e Geografiae Estatstica Jurjo Torres Santom (Spain)Universidad de A Carlos Alberto Torres (U.S.A.)University of California, Los