USF Libraries
USF Digital Collections

Educational policy analysis archives

MISSING IMAGE

Material Information

Title:
Educational policy analysis archives
Physical Description:
Serial
Language:
English
Creator:
Arizona State University
University of South Florida
Publisher:
Arizona State University
University of South Florida.
Place of Publication:
Tempe, Ariz
Tampa, Fla
Publication Date:

Subjects

Subjects / Keywords:
Education -- Research -- Periodicals   ( lcsh )
Genre:
non-fiction   ( marcgt )
serial   ( sobekcm )

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
usfldc doi - E11-00254
usfldc handle - e11.254
System ID:
SFS0024511:00254


This item is only available as the following downloads:


Full Text
xml version 1.0 encoding UTF-8 standalone no
mods:mods xmlns:mods http:www.loc.govmodsv3 xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govmodsv3mods-3-1.xsd
mods:relatedItem type host
mods:identifier issn 1068-2341mods:part
mods:detail volume mods:number 10issue 6series Year mods:caption 20022002Month January1Day 1616mods:originInfo mods:dateIssued iso8601 2002-01-16


xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam a22 u 4500
controlfield tag 008 c20029999azu 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E11-00254
0 245
Educational policy analysis archives.
n Vol. 10, no. 6 (January 16, 2002).
260
Tempe, Ariz. :
b Arizona State University ;
Tampa, Fla. :
University of South Florida.
c January 16, 2002
505
Teaching and ethical issues in indicator systems : doing things right and doing wrong things / Carol Taylor Fitz-Gibbon [and] Peter Tymms.
650
Education
x Research
v Periodicals.
2 710
Arizona State University.
University of South Florida.
1 773
t Education Policy Analysis Archives (EPAA)
4 856
u http://digital.lib.usf.edu/?e11.254



PAGE 1

1 of 28 Education Policy Analysis Archives Volume 10 Number 6January 16, 2002ISSN 1068-2341 A peer-reviewed scholarly journal Editor: Gene V Glass College of Education Arizona State University Copyright 2002, the EDUCATION POLICY ANALYSIS ARCHIVES Permission is hereby granted to copy any article if EPAA is credited and copies are not sold. Articles appearing in EPAA are abstracted in the Current Index to Journals in Education by the ERIC Clearinghouse on Assessment and Evaluation and are permanently archived in Resources in Education .Technical and Ethical Issues in Indicator Systems: Doing Things Right and Doing Wrong Things Carol Taylor Fitz-Gibbon University of Durham Peter Tymms University of DurhamCitation: Fitz-Gibbon, C.T. & Tymms, P. (2002, Janu ary 16).Technical and ethical issues in indicator systems: Doing things right and doing wro ng things. Education Policy Analysis Archives 10 (6). Retrieved [date] from http://epaa.asu.edu/epaa /v10n6/.Abstract Most indicator systems are top-down, published, man agement systems, addressing primarily the issue of public accountabi lity. In contrast we describe here a university-based suite of "grass-ro ots," research-oriented indicator systems that are now subscribed to, volun tarily, by about 1 in 3 secondary schools and over 4,000 primary schools in England. The systems are also being used by groups in New Zealan d, Australia and Hong Kong, and with international schools in 30 cou ntries. These systems would not have grown had they not been cost -effective for schools. This demanded the technical excellence tha t makes possible the provision of one hundred percent accurate data in a very timely fashion.

PAGE 2

2 of 28An infrastructure of powerful hardware and ever-imp roving software is needed, along with extensive programming to provide carefully chosen graphical and tabular presentations of data, giving at-a-glance comparative information. Highly skilled staff, alwa ys learning new techniques, have been essential, especially as we m ove into computer-based data collection. It has been importa nt to adopt transparent, readily understood methods of data ana lysis where we are satisfied that these are accurate, and to model the processes that produce the data. This can mean, for example, modelling sep arate regression lines for 85 different examination syllabuses for one age group, because any aggregation can be shown to represent unfair compar isons. Ethical issues are surprisingly often lurking in technical decisio ns. For example, reporting outcomes from a continuous measure in ter ms of the percent of students who surpassed a certain level, produces un ethical behavior: a concentration of teaching on borderline students. D istortion of behavior and data corruption are ever-present concerns in in dicator systems. The systems we describe would have probably failed to t hrive had they not addressed schools' on-going concerns about educatio n. Moreover, data interpretation can only be completed in the schools by those who know all the factors involved. Thus the commitment to wo rking closely and collaboratively with schools in "distributed resear ch" is important, along with "measuring what matters"... not only achieveme nt. In particular the too-facile interpretation of correlation as causati on that characterized much school effectiveness research had to be avoide d and the need for experimentation promoted and demonstrated. Reasons for the exceptionally warm welcome from the teaching profes sion may include both threats (such as the unvalidated inspection re gime run by the Office for Standards in Education) and opportunities (such as site based management). Indicator systems that we have developed over the last 15 years have, somewhat to our surprise, attracted support and subscriptions from about a third of the schools in England where we work on a scale many times greater than an y other group. 1 We have also developed a linked Curriculum, Evaluation and Manag ement Centre at the University of Canterbury in Christchurch, New Zealand, and we hav e the pleasure of welcoming participation from scattered schools in thirty coun tries. The development of the indicator systems in the CEM Centre is unusual if not unique in that schools themselves have chosen to participate. The systems are therefore pr ofessional, ground-up, developments, that stand in contrast to "top-down" indicator syst ems created, and sometimes imposed, by state and local authorities. The interests of bo th kinds of systems should ultimately, however, be coincident: to improve education.An indicator can be defined as an item of informati on collected at regular intervals to track the performance of a system (Fitz-Gibbon, 199 0). The indicator systems that have formed the basis of our learning are all designed to feed back valuable information of interest to teachers and administrators in schools and colleges. We see our indicator or information systems as significantly empowering sch ools as they participate with a university in 'distributed research'. The issue of public accountability must also addressed by indicator systems.

PAGE 3

3 of 28The design of each system has been driven by the ne ed to measure outcomes that matter along with relevant covariates so that fair compari sons can be made. Process variables are measured in some of the systems, but only to ge nerate hypotheses, not to make judgements.'Value added' measures have been included in our sy stems since the first one started in 1983. In 1995 we won the two year contract to condu ct national feasibility studies for a value-added system. The final report, Fitz-Gibbon, 1997, can be found on the website for the Qualifications and Curriculum Authority(http://www.cem.dur.ac.uk/ca/5-14/durham_report.asp —click publications and search for "national" or under Software on the CEM Centre website). The studies carried out contributed to ongoing debates of both a technical and ethical nature. The issues are of interest to those concerned with indicators in any system, either as designers, users or policy makers.Technical IssuesTechnical issues include those procedural problems that must be solved if an indicator system is to be of high quality and run in a timely efficient and effective fashion. Indicators need to be based on adequate samples, ha ve appropriate levels of reliability, good validity, and above all, positive reactivity. These are technical terms straight out of research methods courses, but they go to the heart of indicator systems, and any practical use of data. The data must be of research quality, otherwise it will confuse rather than guide.Technical infrastructureIn the early years, in 1983, technical sophisticati on was no more than batch-processing using a mainframe. The mainframe could deliver capi tal letters and stars and eventually an adequate type-face could be constructed via embe dded commands in a special program that could produce quite nicely spaced uppe r and lower case printing. Access to a data entry service was essential and it was quick ly obvious, as the volume of data increased, that adequate data verification techniqu es had to be built-in. At first, this was by double-entry, which was not satisfactory; follow ed by data checking on entry, which required some programming to prevent out-of-range d ata being entered and to ensure the data went into the right columns on the 80-column c ards. From these humble and clumsy beginnings, we move to day to a situation where extensive programming is used, optical mark recogni tion assists some data entry, computer based tests that can record responses dire ctly and be delivered across intranets and internets are becoming essential features of in dicator systems. All of this requires that a team of very skilled persons is collected to gether. We have hired predominantly young scientists and mathematicians, who have, almo st without exception, continued to be on a steep learning curve, taking further qualif ications, constantly upgrading the work, adjusting the programs, writing software, and making full use of the graphical capabilities now available. Not only is data transl ated into meaningful sentences contingent upon the data-values, but also into grap hs that, for example, change colour when differences are statistically significant.

PAGE 4

4 of 28Without this level of technical expertise, data wil l not be attractive and easy to access by busy teachers, the turn-around of data will be slow and errors will creep in. Schools want data quickly, within weeks of examination 2 results becoming available. And they want the data 100 per cent accurate. In most research pr ojects, if a small percentage of the data doesn't match, this can simply be reported and ignored, but in indicator systems, schools want every single student accounted for. In consequence, not only must there be teams of expert programmers, but also very high cap acity computing equipment. Facilities for printing CDs are helpful, as special user-friendly software is developed to assist schools in their own explorations of their d ata. CDs can also be used to deliver computerised tests. The data files returned to scho ols also need to interface easily with schools' management information software such as ti metabling and staff deployment. If the technical infrastructure is effective, data turn-around quick, data presentation attractive and readily interpreted, then the indica tor systems will probably grow and this growth itself demands further technical capabilitie s, such as running a high-capacity server and creating a central database that can be accessed by researchers and secretaries alike. This central database needs to be relational in order to store efficiently the hundreds of thousands of students with hundreds of variables attached to each student in thousands of schools over many years. It must have an extremely friendly front end, so that secretaries can readily track the mail-out of questionnaires and the return of data, plus a massive invoicing system if individual schoo ls can join the project and pay on their own account. Alternatively, school districts might pay for groups of schools. Finally the infrastructure needs communication on a regular basis with all schools. Newsletters, a website and conferences are importan t, particularly as teachers become conference presenters and have a credibility with f ellow teachers that researchers lose after some years away from the classroom.We have been fortunate in working with teachers and headteachers ready to welcome, and make themselves familiar with, streams of data. Some government policies have also helped to make indicator systems important and feasible in the UK: the framework of achievement tests shown in Figure 1, the site-ba sed management legislation requiring school districts to devolve about 80 percent of the ir budgets to schools, and open enrolment policies allowing parental choice of scho ols. These were intended both to put schools into competitive situations and also given them some freedom of action derived from having budgetary control.

PAGE 5

5 of 28 Figure 1. Achievement framework: national tests are provided for students, ages 7, 11, 14, 16, and 18 years. If the infrastructure for indicator systems can be created, then a cost-effective system is feasible. We now consider the design of such a syst em, including choosing what to measure, collecting the data, analysing, reporting and interpreting the data. Choosing indicatorsThe advice to select a few key indicators is often given (e.g. Lightfoot, 1983 Somekh Convery, Dlaney, Fisher, Gray, Gunn, Henworth and P owell, 1999 p30 and p 34). Whilst this might make life easy, the temptation sh ould be resisted and the advice rejected. A few indicators cannot reflect the compl exity of institutions and will undermine the system as gaming takes hold. Given a few indicators, the effort is focused on these concerns alone. Furthermore it is difficul t to know which indicators will become important in the future, so that what is now considered to be a key indicator may become of less concern in the future. And who is to decide? Multiple indicators for complex organisations are a fairer representation o f the multiple realities within each than is any attempt to assign a single label, wheth er this label be numerical (e.g. average value added) or verbal (e.g. 'coasting', 'failing') Our solution is to try to measure what matters as c omprehensively as possible. Here the literature in educational research is of value. Blo om's Taxonomy of Educational Objectives identified affective, cognitive and psychomotor ou tcomes that can be taken to include behavioural outcomes. The distinction betwe en aptitudes and achievements (Green, 1974) is an important distinction in the co gnitive area. Clearly money matters,

PAGE 6

6 of 28so economic indicators are important. The essence o f schooling is who is taught what for how long and by what methods and these concerns can following the OECD practice (OECD, 1998) be called 'Flow'. Eventually all of th ese aspects should have indicators. A simple mnemonic makes this list of domains memorabl e as shown in Figure 2. (See also Fitz-Gibbon and Kochan, 2000.) Figure 2. Typology of Education Indicators for Moni toring The indicators could be collected from various grou ps such as students, teachers, heads, school districts, states, or parents, the community the voters. Most indicators could be an input, an output or a long-term outcome and may also be related to a process. Thus a comprehensive classification of indicators can be d eveloped and this also will require a relational database if the measures are to be effic iently stored. Baseline testsPrior achievement is an excellent predictor of subs equent achievement, but each student's level of prior achievement will be partly influenced by the effectiveness of the previous stage of schooling. One teacher's output i s then another teacher's input. Teachers quite reasonably worry that if they promot e high achievement at one age it may be more difficult to show high rates of progress (v alue added) subsequently. (In Tennessee, they try to control not just for a child 's achievement in the previous grade but in the grades two and three years earlier in an att empt to overcome this problem. (Sanders and Horn, 1995).) We have found that the i ntroduction of baseline tests as an alternative to achievement input measures has provi ded an important alternative approach. The purpose of our baseline tests is not to stamp labels on students, but to predict how easy or difficult it will be to get stu dents through the next set of examinations. What is needed for these tests, then, is typical performance, not maximum performance. In the secondary school we use tape re cordings for test administration so that all schools present students with the same inf ormation in the same tone of voice

PAGE 7

7 of 28using exactly the same words and the same timing of each of the subtests. Old IQ tests, in contrast, are often administered in ways that ar e not properly standardised from school to school.We have written our own baseline tests to obtain go od predictions of subsequent achievement. Our aim has been to obtain quick and e fficient measures by using item formats and tasks that require many responses from the pupil that are instantly recorded and therefore add up very quickly to a good predict or of general academic performance. These tests are in some cases remarkably efficient. For example, in 20 minutes the PIPS individually administered adaptive baseline assessm ent for 4 and 5 year olds obtains measures that predict subsequent progress in mathem atics and reading with correlations of about 0.7 (Tymms, 1999). At the secondary school level, our baseline test (the MidYIS test, part of the Middle Years Information S ystem takes 45 minutes of working time and predicts subsequent achievement with corre lations of about 0.7, depending upon the subject... such as English or mathematics. The MidYIS test was chosen by the prestigious independent schools for a compulsory ba seline. In addition to prior achievement or baseline measur es, are there other important covariates? The best source of information about re levant covariates is not what people write about, but what the data shows. Much is writt en, for example, of the impact of socio-economic status on achievement, but at the pu pil level the correlation is generally about 0.3, thus implying that about 9 per cent of v ariance in the outcomes will be accounted for by knowing the socio-economic status of the student. In contrast, cognitive measures predict about 50 per cent of sub sequent variation. To obtain adequate prediction of subsequent achievement ... and theref ore the fairest data for teachers...there is no adequate alternative to a cognitive test.Affective and social indicatorsIn addition to the cognitive indicators we need to address the affective and social domains. In Victoria, Australia, there is extensive use of questionnaires to students, to staff in schools and to parents. Currently, in the Curriculum, Evaluation and Management Centre, we concentrate on questionnaires to students, since education is primarily aimed at the students who are in our care for 15,000 hours of compulsory treatment. This concentration on students is also d esigned to keep the indicator systems lean and efficient and costing as little as possibl e and obtaining as close as possible to a hundred percent response rates. Students can tell u s on questionnaires how much they like school, how much they like an individual subje ct, whether they feel safe in school, their aspirations for the future, their relationshi ps with teachers, their health, traumas in their lives, how they are taught, and how interesti ng they find each subject, etc., etc. For children in their first year at school we also ask teachers to rate the children's attention, impulsivity and activity levels.Does all this amount to too many indicators? Certai nly, when schools first join an indicator system, they can feel quite overwhelmed b y the amount of data that is returned from a fully developed system. For schools in the f irst few years of participation 'Keep it simple, stupid' might be a good motto, especially a s there is evidence that giving people too much data is de-motivating (Cousins and Leithwo od, 1986). However, it would probably be better to give people choice.We now operate a wide variety of systems of indicat ors that involve paperor

PAGE 8

8 of 28computer-based tests, as well as 'basic' or 'extend ed' versions, the latter including hundreds of variables. We are moving towards system s that will involve on-line administration of data collection and permit matrix sampling and the inclusion of choice, by students or other respondents to questionnaires, of the domains in which they would like to express their views and opinions. This will need close attention to the reliability of the data collected. Thus the matrix sampling wil l use scales as the unit of sampling rather than items.Having decided on how to measure the outcomes that matter, one is not finished with the creation of indicators. Just as prior achievement p redicts subsequent achievement, so prior attitudes will predict subsequent attitudes, and in order to compare like with like we need to use regression analyses and look at the residuals. The prediction appears not to be so strong as in the cognitive area, perhaps d ue to less reliable measures, but about 25 percent of the variance of final attitudes in se condary schools is usually predictable from knowing intake attitudes.Process variablesAn indicator system consisting of dependent variabl es with appropriate covariates is a complete indicator system. However, an indicator sy stem is only a step along the way to trying to understand what works, and how schooling can be improved. Consequently, some of our indicator systems include process varia bles such as descriptions of methods of teaching and learning for which students in the 16-18 age range report the frequency of use.Process indicators serve to generate hypotheses and most importantly, they stimulate discussion of teaching methods among staff in schoo ls and as such are valuable. The important problems in trying to attribute cause and effect must, however, be continuously emphasised.Qualitative data: always valued.As Berliner (1992) argued, qualitative data are pow erful. Early in the ALIS project, one school was constantly at the bottom of the set of p articipating schools on a scale assessing attitude to school. It paid very little a ttention to this fact but then open-ended questions were introduced into the data collection and students' comments were typed up and made available to the schools. The typing disgu ised students' handwriting and kept the feedback anonymous. When the school read statem ents like 'We are treated like fifth formers without uniform', 'Staff are sarcastic', 'I wish I'd gone to another school' this qualitative data had an impact that was immediate a nd led to a re-design of the provision for subsequent students. Having had that experience the school then watched the quantitative attitude indicators with more concern and we continue to provide typed-up responses to open-ended questions.Credible data collection procedures for attitudinal data We have already described how the cognitive data co llection is standardised so that the same procedures are followed in every school. This standardisation of data collection is important in collecting data that can be validly co mpared from school to school.

PAGE 9

9 of 28A particular threat to the validity of attitude dat a could arise from demand characteristics. If students are being asked if the y like the school and whether they get on well with teachers yet teachers are looking over st udents' shoulders, or if the students feel that their questionnaires will be scrutinised by teachers, then the situation becomes subject to possible pressures and influences that c ould inhibit honest responding. In the secondary school projects, the tape recording that administers the cognitive test introduces the questionnaire part of the data colle ction by noting that if there is anything they don't understand they should not raise their h and and ask questions because the teacher cannot come to their desk to help them, sin ce the teacher will be staying at the front of the class in order to avoid seeing the res ponses on any of the questionnaires. Additionally, students are given plastic envelopes in which to seal their questionnaires. Of course this procedure requires that students can read the questionnaires and this may not always be the case. If there are non-readers, t he questionnaire can be tape recorded and students can be given answer sheets with symbol s so that they can listen to the questions on the questionnaire and answer on the an swer sheet (Fitz-Gibbon, 1985). Responding to feedback.The creation of a monitoring system involves a grea t many decisions and, as a system grows and there is feedback from the users of the s ystem, there is a need to be responsive and flexible whilst holding firm to fund amental principles. In developing an on-entry assessment for 4-5 year olds the intention was that the data would be kept until the children reached the first statutory assessment three years later. But many reception class teachers suggested that we should assess the students again at the end of their first year at school using an extended version of the onentry assessment. We now do this on a very wide scale and it has proved to be one of th e more important innovations with a number of unseen benefits. (For an analysis of the data see Tymms, Merrell and Henderson, 1997).Matching individual student records from different sources. The first task in analysing progress data is to mat ch records from baseline tests to outcome measures. The outcome measures should, of c ourse, be curriculum-embedded, high-stakes, authentic tests that reflect work actu ally taught and worth teaching in the classroom. The use of a standardised multiple choic e measure of reading comprehension, for example, is not likely to be fai r to schools since teachers may not be able to influence reading comprehension skills once students can read. In other words, there is a problem of lack of sensitivity to instru ction. The matching of data from different sources can only be efficiently done by t he use of unique identifiers. These preferably should be identifiers containing check d igits and the computing facilities to make sure that no identifier is mis-entered.Transparent analyses vs. sophisticated statistics s uch as hierarchical linear models. Einstein said that everything should be as simple a s possible, but no simpler. This is a wise, but very challenging, piece of advice. One ca nnot know how simple a data analysis can be until one has done both simple and complicat ed analyses and compared the

PAGE 10

10 of 28results with representative sets of real data so th at one is looking not only at theoretical models but also at actual magnitudes. When we won the contract to design a national syste m of value added indicators, the brief we were given asked for data that was 'statistically v alid' and 'readily understood'. These two desiderata could well have been in opposition. We a nalysed the same data sets using ordinary least squares and multilevel models, and found, as we had found previously, that the average residuals indicating the so-called 'value added' sc ores for departments or schools, correlated at worst 0.93, and more usually higher, up to 0.99 on the two analyses. Thus it was possible to have the data valid and 'readily understood' by using simple regression. The multi-level analysis, requiring special software and a postgrad uate course in statistical analysis, was in contrast to the ordinary least squares analysis tha t could be taught in primary schools. In our experience in the UK the ordinary least squares ana lysis can certainly be presented to schools so that most members of staff understand the analys is and can use software to re-analyse data as necessary. This accessibility of the data along with the atmosphere of joint investigation (distributed research) probably helped to encourage acceptance of the indicator systems, unlike the situation that sadly seems to have arise n in Tennessee where a highly ambitious, yearly multi-level analysis was tracking students a nd teachers (Sanders & Horn, 1995; Baker, Xu & Detch, 1995).The development of multilevel modelling or hierarch ical linear models is admirable, provides efficient calculations and rather different error t erms, but to use these procedures in day to day indicator system work is likely to lead to less acceptance of the analysis by teachers. Moreover, it is somewhat akin to applying a correct ion for relativity when considering the momentum of a moving train: theoretically correct, but in scientific terms, an ill-advised tendency to over-precision.A recommendation in the Value Added National Projec t was that prompt initial feedback should be based on very simple value added measures taking account of prior achievement and using ordinary least squares regression methods that any school could adopt and replicate. Then, before any data is made public, st atisticians should be given access to the datasets to analyse in numerous sophisticated ways in order to see if any of the analyses makes a difference to particular scores.Adequate and inadequate statistical modelling.A method of analysing that does make a substantial difference is to consider each subject to have its own regression line, since each subject go es through a particular examining process with a chief examiner and statistical moderation of the marks arrived at by experienced markers working to guidelines. Professor Robin Plac kett, winner of two gold medals from the Royal Statistical Society, emphasised in his lectur es, usually in his opening sentences that the question to ask, first and foremost, was what proce sses produced the data. The essence of good statistical modelling is to model the process that produces the data. From the very start, with the A Level Information S ystem in 1982-83, it was clear that the regression line for mathematics was quite different from the regression line for English and implied that for the same level of prior achievemen t students would come out with two grades lower taking the Advanced examination in mat hematics than they would taking Advanced English (Fitz-Gibbon,1988; Fitz-Gibbon and Vincent, 1994). Regrettably, other researchers (e.g. Donoghue, Thomas, Goldstein, and Knight, 1996) have simply taken the results of all examinations and assumed that the sc ales could be combined without any

PAGE 11

11 of 28adjustment. Having thus confused the data, sophisti cated multilevel models were applied to find that there were differential slopes, i.e. slop es that differed for high and low ability intakes. It was even suggested that teachers may be to blame for concentrating on some groups more than others. This was poor data interpr etation since a confound (different subjects with different regression lines) was being attributed to teachers' actions without any corroborating evidence.In Figure 3, we see some of the different regressio n segments for different subjects based on intake ranges. These indicate very clearly that the intake differs between subjects, that the difficulty level differs between subjects, and that to simply combine the outcome grades as though each subject were of equivalent difficulty i s inconsistent with proper statistical modelling based on the processes that produced the data and that the differences are substantial, unlike the difference made by using or not using hierarchical modelling Figure 3. Regression segments showing differences i n intake (x-axis) and output (Y-axis) for different subjects Regression segments, such as were shown in Figure 3 are particularly useful in comparing one subject with another subject, but also in compa ring subjects across years. Thus we see in Figure 4 that the average achievement level of the intake is steadily declining (the segment is moving to the left), and the output shows grade inf lation (the trend segment is moving up the page). This combination of lower intake range and h igher outcome grades has been the pattern with the examinations at age 18 for many ye ars during which time the percentage of students taking these advanced examinations has inc reased. When these changes are measured against an unchanged baseline, they illust rate the necessary adjustment of 'standards' over time to accommodate expanding rang e of uptake of advanced courses (Tymms and Fitz-Gibbon, 2001).

PAGE 12

12 of 28 Figure 4. Regression segments for the same subject but different cohorts. Providing various kinds of feedback, including elec tronic and web-based feedback. As with the amount of data, the presentation of dat a needs to change according to the experience of the school. A school just beginning t o get feedback data needs a few clear diagrams and a telephone helpline in case of questi ons. Schools that have become used to receiving data and have, despite some initial rejec tion from some departments found it to be useful and credible, start to make more and more us e of the data. It therefore becomes valuable to them to have the data provided in Excel spreadsheets, possibly with pre-programmed macros, or in specially prepared sof tware that allows them to undertake procedures such as separating out teaching groups, aggregating by curriculum area, dropping students who have missed substantial amounts of sch ooling, and adding students for whom data was missing.Increasingly, as we move from paper-based feedback to sending disks we provide instant feedback. Eventually, with tight encryption techniq ues this will be directly over the internet. Chances graphs ... making cognitive tests acceptabl e. It has been immensely important in the development of acceptable indicator systems to listen to and to respond to teachers' concerns. It has bee n important, for example, that baseline tests are not seen as predicting exact outcomes. Fifty pe r cent of the variation in outcomes is predictable but that means that 50 per cent is not. How can this be represented to teachers who, currently in England, are asked by government agencies to set targets? This problem was confronted very early in that scho ols were in some cases preventing students from taking advanced mathematics had they not received a C grade or higher in earlier mathematics courses. When data from a large number of schools was available, in

PAGE 13

13 of 28some of which students had been allowed to take the advanced course even having not done well earlier, it was possible to present what we no w call 'Chances graphs' (Fitz-Gibbon, 1992, p.288). These graphs show the chances a student had (in retrospect) of getting each grade subsequently. These 'chances' can be represented wi th simple bar charts showing the empirical percentages of students who actually achi eved each grade the previous year. This empirical distribution has great credibility with t eachers and students. It is data that actually happened and if it happened once it can happen agai n. Thus, the low-achieving student is encouraged to recognise that many low achieving stu dents from the previous year well exceeded the average predicted grade for that start ing point. By representing their 'chances', we remove the opposition rightly felt to labelling students with single predicted grades and we provide actual data that is motivating for stude nts. Statistical Process Control Charts (Shewhart, 1986) A particularly useful representation of the data is one which answers the question 'How is this department doing from year to year, taking into acc ount the number of students in the group and therefore the expected variation in the average from year to year?' Shewhart's brilliant insight into how to represent confidence intervals has proved most useful. By showing the confidence intervals as guidelines to expected vari ation, data from year to year are very easily scrutinised. Of course, one expects half the result s to be above the line and half below the line in some kind of random order. An example of da ta from a school that might be concerned about its effectiveness is shown in Figur e 5 from the A Level Information system. Figure 5. A Statistical Process Control chart for d epartmental residual gain scores averaged over three years. The representation involved in statistical process control charts can be applied to presenting

PAGE 14

14 of 28the average residuals from various subjects in the same year or using a three year moving average. Each figure is automatically processed fro m the data in the relational database and a test for statistical significance made. If the vari ation from zero is statistically significant, the indicator bar turns red so that schools, at a glanc e, can see which departments are probably doing better or worse than inherent variation. We a lso warn, however, that statistical significance at any particular level is not a dicho tomy between truth and error but simply an indicator on a continuum. The software we provide e nables schools to switch easily between a baseline of prior achievement and a curriculum-fr ee baseline. For publication: the unit of analysis and the unit of reporting. Compliance with freedom of information legislation and other relevant laws may require that considerable amounts of data are published. The iss ue as to what should be published is taken up later since it raises ethical issues.Let it just be acknowledged here that there are iss ues regarding the reporting unit (we recommend curriculum area, not whole school nor any thing finer-grained) but also the problem arises that the vocabulary of research incl udes words that raise anxieties such as: 'negative', 'below average' and 'regression'. A sol ution is to show the data in terms of all-round growth with simply variations in the amou nt of growth. For lay audiences this representation may be more accessible than regressi on lines. Interpreting data: Establishing substantive as oppo sed to statistical significance. In the statistical process control charts we saw me thods of conveying the inherent variability of data samples. It is highly important that politi cians and the public recognise that indicators will fluctuate no matter what teachers do. It was c ommendable that Scotland waited till it had three years' of data before publishing value added measures. Although we embed statistical significance tests in to the data, we also warn schools against using this as a sole criterion. The problems with r outine testing at the 0.05 level have been well rehearsed, (Alkin & Fitz-Gibbon, 1975; Carver, 1975; Glass, McGaw and Smith, 1981; Hedges and Olkin, 1985). To assist schools in inter preting the data, we provide both raw residuals that enable substantive interpretation of differences to be made in the metric in which the examination results are reported 5 and standardised residuals that enable comparisons to be made from year to year. Scales do change; for example in the age 16 examinations, because of grade inflation, an A* was added to the scale as a point above an 'A' grade in the age 16 exams.Grade inflation due to the standards setting proces s? Tymms has suggested that a drift in standards seems to be characteristic in national tests in English primary schools. The reasons for this are c onnected with the practice of piloting items and setting their difficulty from the results on students who knew they were simply taking an exercise. This 'adrenaline-free' un-prepa red testing situation might produce lower performance that would then serve as the benchmark against which the exams were calibrated the following year. Taken under genuine examination conditions, with revision time having been invested and the adrenaline flowing, students might well be producing much better results than those calibrated. Hence the unconsciou s drift in 'standards'.

PAGE 15

15 of 28Helping users to interpret the data.It is an unusual teacher-training course that prepa res teachers for the kind of information that is now available in England through professional mo nitoring systems. And yet the information is now seen as vital to many educationa l professionals. In the course of setting up the CEM Centre projects we have run hundreds of ses sions to explain the feedback and to discuss the implications. Further, many courses hav e been run locally for schools to understand and use the data. Our feeling is that th e enormous need for in-service work is an essential part of any monitoring system and that th e extent of the need for conferences and workshops often only becomes apparent as the projec t starts running. It has been standard practice for many years now for our conferences to involve teachers as presenters of the data (e.g. there is a video of a head teacher addressing an early conference and he was a speaker in New Zealand following our involvement there. (Coope r, 1995, video) Dealing with issues of cause and effect: what works ? This is the most important aspect of data interpret ation. It would be wrong to imply to schools that indicator systems are all they need to find out what works. It could take years, even if the search were successful. A school may im plement an innovation and the indicator suggests worse results. But perhaps the results wou ld have been even worse without the innovation. Who knows? So the school repeats the in novation and the results stay the same. So another year's data are awaited—and so on.If instead of this year by year indicator monitorin g, if a school joined with 20 other schools and a random 10 implemented one innovation and the other random half implemented a different innovation, all schools would receive 20 years of data in one year. By adopting the methods of science, learning is speeded up and made more reliable. The fundamental distinction between observation and experimentation must never be blurred. Epidemiology and clinical trials both have their vi rtues, but the clinical trials are necessary to establish sound evidence as to what works. That con cept applies in education as in medicine and the term 'evidence-based' is now becoming popul ar. As 'value-added' became the popular word for residuals, evidence-based may become the p opular word for experiments. The need not to over-claim for the value of monitoring syste ms brings U.S. to the next major section of this paper, ethical issues.Ethical IssuesA major ethical imperative is to do good rather tha n to do harm. At the very least we might try to observe the Hippocratic oath and 'at least d o no harm'. But how do we find out what does harm to students, to society, to academic subj ects, to staff? Evidence of the likely impact of indicator systems on participating schools will be considered including the small number of controlled trials tha t exist. In addition to this question about the overall impact of indicators there are numerous ethical issues to be addressed that arise in the course of running indicator systems. Each repre sents a potential source of net harm, a potential negative in a cost benefit analysis.

PAGE 16

16 of 28Some of the questions that arise are: Do indicator systems really help schools and affect achievement—or are the admittedly modest funds misspent? Should indicator systems lead to a single national, or state, curriculum in order to have a common standard? What is the effect of analysing by gender, ethnicit y, socio-economic status and religion—does this common activity perpetuate stere otypic thinking? What are the effects of poorly chosen indicators, s uch as those dichotomising continuous data distributions, as in 'percent above x'? What are the effects of benchmarking, i.e. comparis ons with putatively 'similar' schools? Data corruption—does it happen and, if so, who is t o blame? Is personnel work in public acceptable? (e.g. publi shing indicators per teacher) Is performance related pay justified? Will over-reliance on indicator systems delay the s earch for better sources of evidence? What is the role of the public sector? How can an i nternal market get the advantages of competition and diversity without the disadvantages of 'the bottom line'? Stakeholders not shareholders? Do indicator systems really help schools and affect achievement? It could be argued that because schools freely choo se buy into indicator systems this is proof that they find indicator systems useful. However, p eople buy snake-oil, and the commercial argument is never adequate. People bought treatment with phosphorus that was actually very damaging, and even without a commercial pressure, t reatments are provided that do harm simply because adequate evidence has not been colle cted. What evidence do we have, of a disinterested and objective kind, that indicator sy stems help schools and, for example, affect achievement?Cohen (1980) ran a meta-analysis of controlled tria ls of: no feedback from students to lecturers vs. feedback from students to lecturers v s. feedback from students to lecturers supported by discussions with "an expert." The feed back the same lecturers received in subsequent years improved most in the third conditi on, and least in the first condition. This result is important. When the ALIS project was abou t four years old a request was made to a committee at the DfEE (then the Department of Educa tion and Science) inviting them to conduct a randomised controlled trial of the impact of this performance indicator system. The Coopers & Lybrand (1988) report had recommended dev olved financing and the use of indictors and the Department was interested. Unfort unately the funds were not found for this potentially important trial. Tymms ran a controlled trial in introducing performance indicators in primary schools into a North Eastern school dist rict in England. A modest effect size (ESrct) of 0.1 was found. This was, however, in 1994 befo re primary schools were under pressure regarding the publication of examination r esults and there was no "expert" advice available.Coe experimented with giving additional feedback in the A Level Information System to individual teachers rather than just to school depa rtments. Thus the effect of the randomly assigned feedback was measured not against no feedb ack but against already substantial feedback, so to expect any further improvement was perhaps optimistic. Nevertheless as a

PAGE 17

17 of 28result of giving classroom by classroom analyses to the teachers concerned, rather than simply departmental data from which this informatio n could be extracted, there was an achievement gain of ES rct = 0.1 on the high stakes, externally assessed exam inations taken at age 18 years 6 In the Value Added National Project, Tymms experi mented with kinds of feedback and found that for primary teachers, table s appeared to be better understood and also, importantly, appeared to have had more impact than graphical feedback. The average Effect Size across English, mathematics and science was ESrct = 0.2 (Tymms, 1997, p12). In the Years Late Secondary Information System 7 a list of under-aspiring students is produced by combining students' intentions regardin g continuing in education with their baseline scores. Many schools given the list of und er-aspiring students set up mentoring sessions or special monitoring. Unfortunately, good intentions do not guarantee good outcomes (McCord, 1978; McCord, 1981; Dishion, McCo rd et al, 1999). Aware of our ethical responsibility not to have teachers wasting their time and in order to avoid harming students, we obtained permission from some schools to only feed back to them a random half of the list of their under-aspiring students. In fo llowing up these schools and comparing the outcomes of the named under-aspirers versus the unn amed under-aspirers, we actually have found more differences in favour of the unnamed gro up than the named group. Indeed, naming students resulted in an overall effect on ex amination progress, adjusted for prior achievement of value added decrement of ESrct = -0.38. Naming seemed to have little effect on whether or not students were counselled at all ( r = 0.01) but the more counselling sessions that any students, named or not, received the worse were their value added scores ( r = -0.22). Only 15 schools were involved in this first experim ent, but it calls into question many facile beliefs about how achievement can be improved. The findings are challenging and the experiment is being repeated with thirty schools. I t illustrates how an indicator system can move the profession forward to proper experimentati on. Should indicator systems lead to a National Curricu lum? The resistance to a National Curriculum in the U.S. has contributed to the slow development of curriculum-embedded, high-stakes, authentic test s. In England, where external curriculum-embedded assessments have been used for decades and school performance tables are published using raw results, moves have been ma de towards value added systems. These will increase the high stakes nature of the externa l examinations and, at the same time, government pressure on the Qualifications and Curri culum Authority has led to a reduction from seven independent examination boards to three conglomerates of the former boards. Furthermore there has been a reduction in the numbe r of syllabuses on offer for secondary schools.Meanwhile in primary schools, a single National Cur riculum has been imposed and all primary students sit the same tests designed to the same syllabuses at the ages of 7, 11 and 14 years. The specification of a National Curriculum c oncentrating on particular subjects and the publication of these data has put schools under pre ssure to drop attention to such areas as the fine arts, the performing arts, and physical educat ion, and to concentrate on those indicators that are published. All schools are forced to do th e same curriculum unless exemptions are granted.This restriction and concentration certainly repres ents a downgrading of the professional status of teachers who can now make few important d ecisions, and it may contribute to

PAGE 18

18 of 28declining levels of satisfaction of teachers. At th e very least there should be various curricula available to be chosen, as was the case for decades for teachers of students aged 16 and 18 years. Thus, a teacher who preferred to teach physi cal geography rather than economic geography could find a syllabus in which the propor tion was attractive for that teacher. Another reason for maintaining choice and diversity in syllabuses is that in the entire population a much broader range of skills is thereb y likely to be developed. Choice and diversity also keep the examination boards in compe tition and this ought to lead to an improvement in the quality of the service that they provide. Unfortunately, since they have a virtual monopoly endowed by government approval, it will not be likely that examination boards drop their poor practice unless required to do so. Examples of poor practice from examination boards are leaving the names of student s and their schools on the examination paper when it is being assessed. The name of the pu pil and the school will often contain clear evidence regarding the pupil's gender, ethnicity, s ocial class and religion. In the face of this information, can essays be read in a totally unbias ed way? Further poor practice is the lack of provision of inter-marker reliability data (Fitz-Gi bbon, 1996, p.115). What is the effect of analysing by gender, ethnicit y, socio-economic status and religion? There may be differences between groups, but ethnic ity is very poorly defined; socio-economic status is not well-measured; and nei ther of these variables, is alterable by the school. Alterable variables (Bloom, 1979, 1984) are the key to improvement and accountability. Religion is perhaps an alterable va riable, but if we find Catholic schools are doing better than Protestant schools, do we draw th e inference that we should make schools turn Catholic? Or vice versa? The habit of analysin g by these unalterable variables may simply be a result of the pressure to produce acade mic papers, whether they contribute to practical or theoretical developments or not. Given a body of data it is easy to break it down by these categories, and report the differences. Th e fact that it leads nowhere has not been a major consideration in social science research.The fact that such analyses perpetuate stereotyping should also be a matter of ethical concern. That these correlational analyses do not promote th e search for strong evidence as to what works, is certainly a matter for ethical concern. A ttention should be directed towards alterable variables rather than unalterable categories into w hich human beings are grouped, which is the first step to stereotyping. These analyses beco me particularly a matter of concern when teachers are presumed to be somehow to blame for th e 'under-achievement' of boys at the age of 16 as compared with the achievement of girls. Gr oup differences make catchy headlines in the newspapers. While there may sometimes be a need to track group differences, there is a more important need to educate users of data about the size of the effects being studied and what is known about altering the situation. Boys ar e smaller than girls at age 11. Should they be stretched? Are teachers responsible?The use of a "percentage greater than" criterion in reporting The most egregious mistake made in performance data in England has been the DfEE's 8 introduction of arbitrary dichotomies into continuo us data. Thus, primary school students' achievements are publicly reported in terms of the percent of students in each school above a certain level, called Level 4. This has the unfortu nate implication that students below Level 4 have in some way failed their school or failed in t heir schooling. This is extremely unethical, since for some students a Level 4 achievement is an excellent achievement, whereas for

PAGE 19

19 of 28others a Level 4 is a failure to reach their potent ial. Furthermore, to draw an arbitrary line through a co ntinuous outcome data almost always leads to very negative reactivity. At the secondary level the damaging and unethical impact is a concentration on D students because the reporting line is the percentage of students getting Grade C or above. Time, effort and money have been spent on D students to the neglect of more able and less able students.If, on the other hand, an average points score is u sed as the outcome measure, the implication is to work with each pupil to obtain their maximum performance. This is ethical behaviour, it is the kind of behaviour teachers wish to adopt, bu t it is made impossible by the reporting of indicators based on arbitrary dichotomies in the da ta. The effects of arbitrary benchmarksIn England, official bodies such as the Office for Standards in Education, lacking pupil level value added measures, compare schools with 'similar schools. The classification of 'similar' is usually made on the basis of the percent of stud ents receiving free school meals. However, two schools can both have 20 per cent of students r eceiving free school meals but otherwise have quite different profiles. For example, one may have a larger proportion of children who also come from schools with very high levels of ach ievement. Such a school benchmarked against a school with the same percent of free scho ol meals will look very good at the expense of the other school, but the comparison is spurious. Such benchmarking is an inadequate way of making comparisons. The only fair comparisons are with similar students in other schools. There are no similar schools.It is certainly not ethical to make unfair comparis ons which in some cases carry financial consequences for the institution concerned and can lead to job losses and demoralisation. Indeed, to take a most extreme and serious conseque nce, Ofsted inspectors rely on poor benchmarking data and also sit in classrooms judgin g teachers. Ofsted inspections have recently been cited in four inquests following suic ides by teachers (Times Educational Supplement, April, 2000).Fair data carefully interpreted is a defence agains t the inequities of the Ofsted system, problems reported at length to a Select Committee o f the House of Commons (website: http://www.cem.dur.ac.uk/ ) (Kogan, 1999; Fitz-Gibb on, 1998; Fitz-Gibbon and Stephenson, 1999).Data corruption: when does it happen and who is to blame? In an article entitled 'On the unintended consequen ces of publishing performance data in the public sector' Peter Smith, Professor of Economics at the University of York, identified a 'huge number of instances of unintended behavioural consequences of the publication of performance data' (Smith, 1995). He named eight problems associated with non-effective or counter-productive systems: tunnel vision; sub-optimisation myopia; measure fixation;

PAGE 20

20 of 28gaming; ossification; misinterpretation; misrepresentation. These can be seen as distortions of behaviour and a ttention (the first six) and data corruption (the last two). With the sole exception of ossifica tion, every one of these possibilities was raised by headteachers in open-ended items in the q uestionnaires used in the Value Added National Project. Thus in education these are not t heoretical problems but actual, already-perceived problems (Fitz-Gibbon, 1997).W Edwards Deming (1986) warned that "When there is fear we get the wrong figures." In primary schools in England there have been instance s of teachers opening the examination papers the week before assessments and making sure that students were well-prepared. This unfortunately has negative consequences for the sch ool subsequently, since higher than reasonable achievement levels will be expected.A more subtle form of data corruption is to exclude students who are not going to produce good examination results. In England following the advent of publication of raw achievement levels in the form of 'School Performance Tables', 9 exclusion rates increased 600 per cent. Exclusions from school may be the beginning of an i ncreased risk of delinquency, drug-taking and criminality—is this a price worth p aying for the publication of school performance data? It is widely acknowledged that th ere was a causal link here: schools saw a way to improve their standing in the tables and exc luded difficult students. The government some years later responded by publishing exclusion rates and making an issue of 'inclusion'... but the impact had already taken place for many stu dents. As further pressures arise from 'performance manage ment' (performance related pay systems) it may not be long before we see baseline measures declining so that value added measures look better, particularly when old IQ tests are use d for baselines and are not standardised in their administration procedures.Personnel work in publicWhole school indicators should be avoided because t he evidence is that there is more variation within a single school than is generally found between schools. Furthermore, the use of whole school indicators encourages the rank ordering of schools and the public is not prepared to interpret rank orders adequately. Very small differences in the indicator can move a school through many positions in a rank ordering in the middle of a distribution. To avoid simple rank ordering, schools were sometimes put in to bands, but this too can be damaging if bands A through E are used. Schools in 'D' and 'E' bands are castigated but in any distribution half have to be below average. This may be politica lly unpalatable but such is the nature of the average.If indicators were published for each teacher, this would be tantamount to doing personnel work in public and would be unacceptable. And yet d ata cannot be withheld from the public unreasonably, so some compromise is needed: not who le school indicators and not individual teacher indicators.The compromise recommended in the Value Added Natio nal Project was to use curriculum

PAGE 21

21 of 28area as the unit of reporting. This has the virtue of e nabling parents to look for schools that seem to be doing well in the area in which their ch ildren are most interested (e.g. performing arts or mathematics and science curriculum areas). Of course, in small schools there may be no distinction between the indicators for a curricu lum area and for a teacher. There needs to be some restriction put on the size of sample that can be reported publicly. The CEM Centre is developing these indicators for the provision of data at the LEA 10 /School District level as opposed to individual school level, where the data is presented department by department for affective and cognitive indicators, and student by student in the cognitive area. Within the individual school, further analyses can be undertak en to obtain data teacher by teacher. Such analyses are made easy by our provision of the scho ol's data in software packages called Pupil Assessment and Recording Information System ( PARIS). Performance related payGeorge Soros, in his book The Crisis of Global Capitalism, elaborates on his concept of reflexivity. His point is that, in the social world where perceptions can influence behaviour, saying 'it is so' may indeed 'make it so'. Mistaken beliefs about the nature of the physical world have no influence on the physical world, but distortions of beliefs about the social world can have an impact. One of the distortions pr omulgated by those seeking to implement performance related pay is that pay is the great mo tivator. This is only a hypothesis, and before huge amounts of money go into implementing p erformance related pay systems, they should be put to an experimental test in which some schools get performance related pay and other schools get equivalent money to spend as they wish. The negative influences of performance related pay are potentially the destruction of team work, the demoralisation of those who do not get a performance pay rise, the corruption of data due to the chance to make financial gain from 'good' exam results, and the message sent to students that teachers work for pay: not for the ir love of the subject, not for their concern for their students, but for pay. According to Soros 's concept of reflexivity, this very implication can make itself come true as beliefs ca n be distorted. Will over-reliance on indicator systems delay the s earch for better sources of evidence? Just as epidemiology is inadequate as a basis for a ssessing medical treatments, so indicators are inadequate as a means of establishing 'what wor ks' in education. As argued earlier, as schools experience the yearly receipt of indicators of the progress of every student and see the data accumulating in Statistical Process Control ch arts, they realise that simply watching the indicators, whilst very important, is a slow way to find out 'what works'. The launch, in Philadelphia in February 2000, of th e 'Campbell Collaboration' represents a major effort to create a more just and effective so ciety. It is important that the provision of indicators will support this important step forward and they do, indeed, provide an excellent context in which to conduct experiments: by embeddi ng experiments in institutions with on-going indicator systems, time series data with r andomised interventions becomes a very powerful source of high quality evidence.The role of the public sectorIndicator systems, feasible because of computers, m ay make the public sector, and in

PAGE 22

22 of 28particular public sector management, a fascinating exercise in applied social science. Finally social scientists may have some responsibility for more than arguments and papers. The actions of managers and administrators should be gu ided by social science findings. They can study their success in applying the findings by wat ching the indicators as business managers watch the bottom line or the share price. Perhaps i ndeed the pensions of Chief Education Officers could be tied to the long-term outcomes of the students who are in their care for about 15,000 hours of compulsory treatment. However the public sector, including universities, will need to permit innovation, flexi bility, and devolved 'site-based management' and public servants will need to reduce drastically time-serving hierarchies and inefficient bureaucracies.ConclusionThe most important aspect of an indicator system is its reactivity: the impact it has on behaviour in the system being monitored. All the is sues raised above need attention to create indicator systems in which the benefits outweigh th e costs. Porter (1988) described the tensions in how indicat or systems may be used. When a headteacher 11 said that our indicator systems had 'Introduced a research ethos into the school' we felt this was exactly what was desirable and eth ical. But there are pressures to make indicators part of an aggressive management culture including target setting and performance related pay. Without knowledge of cause, effect and magnitudes of effects this is likely to be unproductive gaming. Good management requires good science, including the recognition of our ignorance concerning many aspects of schooling. An 'Evidence-Based Education Network' is one of the ways in which we wish to pro mote the research agenda in our 'distributed research' with schools. The questions are not 'Who is to blame and who needs to be rewarded?' but 'What do we know and how do we fi nd out what works?' A research ethos.Notes 1 Each summer, with a turn-around time of a few weeks the CEM Centre processes hundreds of variables and matched pre-post scores on over a million students. Staff look after 12 servers and a relational database management system (RDMS) used by researchers, secretarial and administrative staff. 2 The examination system in England has long delivere d authentic, high stakes, curriculum-embedded tests, called 'examinations'. T he complex authentic tests are based on syllabuses to which teachers teach. The examination papers are published each year along with comments from examiners. The systems were set up by universities. Teachers are hired to mark the authentic scripts to clearly designed c riteria. The examinations are 'high stakes' but not punitive but aiming to provide certificatio n that assists in gaining university entrance and jobs. 3 Further discussion of the statistical issues is ava ilable in the Vernon Wall lecture on the website www.cem.dur.ac.uk/software/. 4 Roughly comparable to Advanced Placement in the U.S Advanced level examinations in England are taken at age 18 and there is, for 2001, a new examination the year before.

PAGE 23

23 of 28 5 E.g. 'Levels' in primary schools and 'grades'... A, B C etc. ... in secondary schools. 6 (To assist readers in distinguishing correlational from experimental findings the ES is subscripted 'rct' if it arises directly from the ma nipulation in a randomised controlled trial. This practice, (recommended in Fitz-Gibbon, 1999, p 37) could make meta analyses considerably easier to conduct, especially for elec tronically published articles.) 7 YELSIS also known as YELLIS, Year 11 Information Sy stem. 8 Department for Education and Employment, based in L ondon. 9 WEBSITE: Error! Reference source not found. 10 Local Education Authority, i.e., School District. 11 Keith Nancekievil, Gosforth High School, Newcastle upon Tyne, EnglandReferencesAlkin, M.C. & Fitz-Gibbon, C. T. (1975) Methods and Theories of Evaluating Programs., Journal of Research and Development in Education, 8 (3), pp. 2-15. Baker, A.P., Xu, D. and Detch, E. (1995) The measur e of education: a review of the Tennessee value added assessment system. Nashville, TN, Office of Education Accountability, Tennessee Department of Education, Comptroller of the Treasury. Berliner, D.C. (1992) Telling the stories of educat ional psychology, Educational Psychologist, 27 (2), pp. 143-161. Bloom, B.S. (1979). Alterable variables: the new di rection in educational research. Edinburgh: Scottish Council for Research.Bloom, B.S. (1984). The 2 Sigma Problem: The Search for Methods of Group Instruction as Effective as one-to-one Tutoring. Educational Researcher, June/July: 4-16. Carver, R.P. (1975). The Coleman Report: Using inap propriately designed achievement tests. American Educational Research Journal, 12 (1): 77-86. Cohen, P.A. (1980). Effectiveness of Student-Rating Feedback for Improving College Instruction: A Meta-analysis of Findings. Research in Higher Education, 13 (4): 321-341. Coopers and Lybrand, (1988). Local Management of Sc hools. London: HMSO. Cousins, J.B. and Leithwood, K.A. (1986). Current E mpirical Research on Evaluation Utilization. Review of Educational Research, 56 (3): 331-364. Deming, W.E. (1986). Out of the Crisis: Quality Productivity and Competi tive Position. Cambridge University Press.Dishion, T.J., McCord, J. et al. (1999). When Inter ventions Harm. American Psychologist,

PAGE 24

24 of 2854 (9): 755-764. Donoghue, M., Thomas, S., Goldstein, H. and Knight, T. (1996). DfEE Study of Value Added for 16-18 Year Olds in England. London: DfEE.Fitz-Gibbon, C.T. & Stephenson-Forster, N.J. (1999) Is Ofsted helpful? An evaluation using social science criteria, in C. Cullingford (Ed.), An Inspector Calls: Ofsted and its effect on school standards (London, Kogan Page). Fitz-Gibbon, C.T. (1985). Using audio tapes in ques tionnaire administration. Research Intelligence, 19 :8-9. Fitz-Gibbon, C.T. (1988). Recalculating the standar d. 26-8-88, The Times Educational Supplement p 15 Fitz-Gibbon, C.T. (1990). Performance Indicators: a BERA Dialogue. Clevedon, Avon: Multi-lingual Matters.Fitz-Gibbon, C. T. (1992). The Design of Indicator Systems The Role of Education in Universities, and the Role of Inspectors/Advisers: A Discussion and a Case Study. Research Papers in Education—Policy and Practice, 7 (3), 271-300. Fitz-Gibbon, C.T. (1996). Monitoring Education: Indicators, Quality and Effec tiveness London: Cassell/Continuum.Fitz-Gibbon, C.T. (1997). The Value Added National Project: Final Report: Fea sibility studies for a national system of Value Added indica tors. London: School Curriculum and Assessment Authority.Fitz-Gibbon, C. T. (1999). Education: High Potentia l Not Yet Realized Public Money & Management: Integrating Theory and Practice in Publ ic Management 19 (1): 33-40. Fitz-Gibbon, C.T. and Kochan, S. (2000). School Eff ectiveness and Education Indicators. In C. Teddlie and D. Reynolds (Eds) The International Handbook of School Effectiveness Research London: Falmer Press, pages 257-282. Fitz-Gibbon, C.T. and Stephenson-Forster, N.J. (199 9). Is Ofsted helpful? An evaluation using social science criteria. In C. Cullingford (E d) An Inspector Calls: Ofsted and its effect on school standards. London: Kogan Page, pages 97-118. Fitz-Gibbon, C.T. and Vincent, L. (1994 ) Candidates' Performance in Public Examinations in Mathematics and Science London: School Curriculum and Assessment Authorit y (SCAA).Glass, G.V., McGaw, B. & Smith, M.L. (1981). Meta-analysis in social research. Beverly Hills, CA:, SAGE Publications).Green, D.R. (1974) The Aptitude Achievement Distinction Monterey, CA: McGraw-Hill. Hart, P.M., Conn, M. & Carter, N. (1992) School Organisational Health Questionnaire: Manual. Melbourne, Victoria, Australia, Department of Scho ol Education).

PAGE 25

25 of 28Hedges, L.V. & Olkin, I. (1985) Statistical methods for meta-analysis. New York: Academic Press.Kogan, M. (1999) The Ofsted System of School Inspec tion: An independent evaluation. A report of a study by The Centre for the Evaluation of Public Policy and Practice and Helix Consulting Group. CEPPP, Brunel University.Lightfoot, S. L. (1983). The Good High School: Portraits of Character and Cu lture New York, Basic Books.McCord, J. (1978). A thirty-year follow-up of treat ment effects. American Psychologist, 33 284-289.McCord, J. (1981). Considerations of Some Effects o f a Counseling Program. In S.E. Martin, L.B. Sechrest and R. Redner (Eds.), New Directions in the Rehabilitation of Criminal Offenders Washington, DC: National Academy Press, 393-405. OECD (1998). Education at a Glance OECD Indicators Porter, A. (1988). Indicators: Objective Data or Po litical Tool? Phi Delta Kappan, (March 1988): 503-508.Sanders, W.L. and Horn, S. (1995). An Overview of t he Tennessee Value-Added Assessment System (TVAAS)—Answers to Frequently Asked Question s. Knoxville, Tennessee: The University of Tennessee.Shewhart, W.A. (1986). Statistical Method from the Viewpoint of Quality Control. (Graduate School, Department of Agriculture, Washington, 1939 ). Dover. Smith, P. (1995). On the unintended consequences of publishing performance data in the public sector. International Journal of Public Administration 18(2/3) 277-310 Somekh, B., Convery, A. et al. (1999). Improving Co llege Effectiveness: raising quality and achievement. London: Further Education Development Agency. Tymms, P.B. (1997). The Value Added National Projec t – Technical Report: Primary 3, London: SCAA.Tymms, P.B. (1999). Baselines Assessment and Monitoring in Primary Scho ols: Achievements, Attitudes and Value-Added Indicators. London: David Fulton. Tymms, P.B. and Fitz-Gibbon, C.T. (2001). Standards Achievement and Educational Performance, 1976-2001: A Cause for Celebration? In R. Phillips and J. Furlong (Eds.), Education, Reform and the State: Politics, Policy a nd Practice, 1976-2001 London: Routledge.Tymms, P.B., Merrell, C. and Henderson, B. (1997). The First Year at School: A quantitative investigation of the attainment and progress of pup ils. Educational Evaluation and Research, 3 (2): 101-118.About the Authors

PAGE 26

26 of 28 Professor Carol Taylor Fitz-GibbonDirector, The Curriculum, Evaluation and Management Centre University of DurhamEngland, UKEmail: Carol.Fitz-Gibbon@cem.dur.ac.ukAfter 10 years of teaching physics and mathematics in a variety of schools in the U.K. and then the U.S., Carol Fitz-Gibbon conducted a study for the U.S. Office of Education on the identification of mentally gifted, inner-city stude nts and then became a Research Associate for Marvin C. Alkin, at the Center for the Study of Evaluation, UCLA. She completed a Ph.D. in Research Methods and Evaluation, obtained a grant on the design of compensatory education, co-authored a series of textbooks and re turned to the UK in 1978 planning to continue work on Cross-age and Peer Tutoring. But t he success of an indicator system she developed with 12 schools in 1983 led to other area s. Much of this work is described in the prize-winning book Monitoring Education: Indicators, Quality and Effec tiveness (1996). Professor Peter TymmsDirector of the PIPS ProjectThe Curriculum, Evaluation and Management CentreUniversity of DurhamEngland, UKAfter taking a degree in natural sciences, Peter Ty mms taught in a wide variety of schools from Central Africa to the north-east of England be fore starting an academic career. He was "Lecturer in Performance Indicators" at Moray House Edinburgh, before moving to Newcastle University and then to Durham University, where he is presently Professor of Education. His main research interests are in monit oring, assessment, school effectiveness and research methodology. He is Director of the PIP S project within the CEM Centre, which involves monitoring the progress and attitudes of p upils in about 4000 primary schools. He has published many academic articles, and his book Baseline Assessment and Monitoring in Primary Schools has recently appeared.Copyright 2002 by the Education Policy Analysis ArchivesThe World Wide Web address for the Education Policy Analysis Archives is epaa.asu.edu General questions about appropriateness of topics o r particular articles may be addressed to the Editor, Gene V Glass, glass@asu.edu or reach him at College of Education, Arizona State University, Tempe, AZ 8 5287-2411. The Commentary Editor is Casey D. Cobb: casey.cobb@unh.edu .EPAA Editorial Board Michael W. Apple University of Wisconsin Greg Camilli Rutgers University John Covaleskie Northern Michigan University Alan Davis University of Colorado, Denver

PAGE 27

27 of 28 Sherman Dorn University of South Florida Mark E. Fetler California Commission on Teacher Credentialing Richard Garlikov hmwkhelp@scott.net Thomas F. Green Syracuse University Alison I. Griffith York University Arlen Gullickson Western Michigan University Ernest R. House University of Colorado Aimee Howley Ohio University Craig B. Howley Appalachia Educational Laboratory William Hunter University of Calgary Daniel Kalls Ume University Benjamin Levin University of Manitoba Thomas Mauhs-Pugh Green Mountain College Dewayne Matthews Education Commission of the States William McInerney Purdue University Mary McKeown-Moak MGT of America (Austin, TX) Les McLean University of Toronto Susan Bobbitt Nolen University of Washington Anne L. Pemberton apembert@pen.k12.va.us Hugh G. Petrie SUNY Buffalo Richard C. Richardson New York University Anthony G. Rud Jr. Purdue University Dennis Sayers California State University—Stanislaus Jay D. Scribner University of Texas at Austin Michael Scriven scriven@aol.com Robert E. Stake University of Illinois—UC Robert Stonehill U.S. Department of Education David D. Williams Brigham Young University EPAA Spanish Language Editorial BoardAssociate Editor for Spanish Language Roberto Rodrguez Gmez Universidad Nacional Autnoma de Mxico roberto@servidor.unam.mx Adrin Acosta (Mxico) Universidad de Guadalajaraadrianacosta@compuserve.com J. Flix Angulo Rasco (Spain) Universidad de Cdizfelix.angulo@uca.es Teresa Bracho (Mxico) Centro de Investigacin y DocenciaEconmica-CIDEbracho dis1.cide.mx Alejandro Canales (Mxico) Universidad Nacional Autnoma deMxicocanalesa@servidor.unam.mx

PAGE 28

28 of 28 Ursula Casanova (U.S.A.) Arizona State Universitycasanova@asu.edu Jos Contreras Domingo Universitat de Barcelona Jose.Contreras@doe.d5.ub.es Erwin Epstein (U.S.A.) Loyola University of ChicagoEepstein@luc.edu Josu Gonzlez (U.S.A.) Arizona State Universityjosue@asu.edu Rollin Kent (Mxico)Departamento de InvestigacinEducativa-DIE/CINVESTAVrkent@gemtel.com.mx kentr@data.net.mx Mara Beatriz Luce (Brazil)Universidad Federal de Rio Grande do Sul-UFRGSlucemb@orion.ufrgs.brJavier Mendoza Rojas (Mxico)Universidad Nacional Autnoma deMxicojaviermr@servidor.unam.mxMarcela Mollis (Argentina)Universidad de Buenos Airesmmollis@filo.uba.ar Humberto Muoz Garca (Mxico) Universidad Nacional Autnoma deMxicohumberto@servidor.unam.mxAngel Ignacio Prez Gmez (Spain)Universidad de Mlagaaiperez@uma.es Daniel Schugurensky (Argentina-Canad)OISE/UT, Canadadschugurensky@oise.utoronto.ca Simon Schwartzman (Brazil)Fundao Instituto Brasileiro e Geografiae Estatstica simon@openlink.com.br Jurjo Torres Santom (Spain)Universidad de A Coruajurjo@udc.es Carlos Alberto Torres (U.S.A.)University of California, Los Angelestorres@gseisucla.edu