USF Libraries
USF Digital Collections

Educational policy analysis archives


Material Information

Educational policy analysis archives
Physical Description:
Arizona State University
University of South Florida
Arizona State University
University of South Florida.
Place of Publication:
Tempe, Ariz
Tampa, Fla
Publication Date:


Subjects / Keywords:
Education -- Research -- Periodicals   ( lcsh )
non-fiction   ( marcgt )
serial   ( sobekcm )

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
usfldc doi - E11-00187
usfldc handle - e11.187
System ID:

This item is only available as the following downloads:

Full Text
xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam a22 u 4500
controlfield tag 008 c20009999azu 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E11-00187
0 245
Educational policy analysis archives.
n Vol. 8, no. 43 (August 21, 2000).
Tempe, Ariz. :
b Arizona State University ;
Tampa, Fla. :
University of South Florida.
c August 21, 2000
Consistency of findings across international surveys of mathematics and science achievement : a comparison of IAEP2 and TIMSS / Michael O'Leary, Thomas Kellaghan, George F. Madaus, [and] Albert E. Beaton.
x Research
v Periodicals.
2 710
Arizona State University.
University of South Florida.
1 773
t Education Policy Analysis Archives (EPAA)
4 856

xml version 1.0 encoding UTF-8 standalone no
mods:mods xmlns:mods http:www.loc.govmodsv3 xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govmodsv3mods-3-1.xsd
mods:relatedItem type host
mods:identifier issn 1068-2341mods:part
mods:detail volume mods:number 8issue 43series Year mods:caption 20002000Month August8Day 2121mods:originInfo mods:dateIssued iso8601 2000-08-21


1 of 17 Volume 8 Number 43August 21, 2000ISSN 1068-2341 A peer-reviewed scholarly electronic journal Editor: Gene V Glass, College of Education Arizona State University Copyright 2000, the EDUCATION POLICY ANALYSIS ARCHIVES. Permission is hereby granted to copy any article if EPAA is credited and copies are not sold. Articles appearing in EPAA are abstracted in the Current Index to Journals in Education by the ERIC Clearinghouse on Assessment and Evaluation and are permanently archived in Resources in Education Consistency of Findings Across International Survey s of Mathematics and Science Achievement: A Comparison of IAEP2 and TIMSS Michael O'Leary Thomas Kellaghan St Patrick's College, Dublin George F. Madaus Albert E. Beaton Boston CollegeAbstractThe investigation reported in here was prompted by discrepancies between the performance of Irish students on two in ternational tests of science achievement: the Second International Asses sment of Educational Progress (IAEP2) administered in 1991 a nd the Third International Mathematics and Science Study (TIMSS) administered in 1995. While average science achievement for Irish 1 3-year-olds was reported to be at the low end of the distribution r epresenting the 20 participating countries in IAEP2, it was around the middle of the distribution representing the 40 or so countries th at participated in


2 of 17TIMSS at grades 7 and 8. An examination of the effe ct sizes associated with mean differences in performance on IAEP2 and T IMSS indicated that the largest differences are associated with th e performance of students in France, Ireland and Switzerland. Five h ypotheses are proposed to account for the differences.Introduction International comparative studies of stude nt achievement have become part of the educational landscape over the past four decades. I n these studies, a number of countries (usually represented by research organizations) agr ee on an instrument to assess achievement in a curriculum area, the instrument is administered to a representative sample of students at a particular age or grade lev el in each country, and comparative analyses of the data obtained are carried out. The most frequently assessed areas have been reading, mathematics, and science at ages 9 or 10 and 13 or 14. The number of participating countries has grown from 12 in a pilo t project conducted between 1959 and 1961 to over 40 for a survey of mathematics and sci ence achievements in 1995 (see Goldstein, 1996; Husn & Postlethwaite, 1996; Kella ghan, 1996). The potential of international studies to contribute to policy formation was made clear from the earliest studies (Husn, 1967; Lambi n, 1995). Over the years, a range of purposes to which information derived from such stu dies might be put has been suggested. These include the pursuit of equity goal s, setting priorities, assessing the effectiveness and efficiency of the educational ent erprise and the appropriateness of curricula, evaluating instructional methods and the organization of the school systems, and providing a mechanism for accountability (Kella ghan & Grisay, 1995; Plomp, 1992). While we have relatively little information on the extent to which the findings of international studies have in fact been utilized, t here is no doubt that they attract considerable media and public attention. A variety of factors can affect the extent to which data obtained in an international study accurately reflects what students have learne d in the participating countries, something that is necessary if valid comparisons be tween countries are to be made (see Brown, 1996, 1998; Goldstein, 1996; Kellaghan, 1996 ; Kellaghan & Grisay, 1995; Murphy, 1996; Nuttall, 1994). One relates to the ad equacy of a uniformly administered assessment procedure to measure the outcomes of a v ariety of curricula. Since curricula differ from country to country, an assessment instr ument will not reflect the curricula of all countries participating in an international stu dy to the same degree. The second factor relates to the extent th at the populations and samples of pupils for whom data are obtained can be regarded as equiv alent. Defined target populations may not be comparable across countries since exclus ion practices may differ (e.g., relating to students with handicapping conditions/l earning problems or when the language of the assessment instrument differs from the language of the school). Differences in participation rates of selected samp les (due to lack of co-operation from schools, student absenteeism) will make matters wor se. Many commentators have considered how thes e problems impact on comparisons based on a single study. Additional problems arise when the findings of two different surveys are being compared. In the case of IAEP2 an d TIMSS, instruments used to measure achievement differed in form and content sa mpled, age-based versus grade-based populations definitions were used, and different methods of data


3 of 17manipulation were utilized. The investigation reported here was prompte d by discrepancies between the performance of Irish students on tests of science i n the Second International Assessment of Educational Progress in Mathematics and Science (IAEP2) (Lapointe, Askew & Mead, 1992) in 1991 and, four years later in the Th ird International Mathematics and Science Study (TIMSS) (Beaton, Mullis, Martin, Gonz alez, Kelly & Smith, 1996a; Beaton, Martin, Mullis, Gonzalez, Smith, & Kelly, 1 996b). Initially, the intention was to focus on the Irish problem but, as the investigatio n proceeded, it became clear that discrepancies in performance between the two survey s were not confined to Irish students. In this article, we first present brief de scriptions of IAEP2 and TIMSS. We then select 12 countries that participated in both surve ys for further analyses: Canada, England, France, Hungary, Ireland, Korea, Portugal, Scotland, Slovenia, Spain, Switzerland, and the United States. Our approach to assessing the consistency of countries' performances is based on an examination of the performance of each country relative to the performance of other countries in b oth surveys. If results are stable, differences in performance between countries should not vary very much from one survey to the next. To the extent that they do, fin dings may be regarded as unstable. Change in effect sizes between pairs of means on th e two assessments were calculated to obtain an estimate of the magnitude of differences between performance on the two occasions. IAEP2 and TIMMS In IAEP2, representative samples of 9 and 13-year-olds in 20 countries were assessed in mathematics and science in 1991 (Lapoin te, Askew & Mead, 1992). In TIMSS, the mathematics and science achievements of students in grades 3, 4, 7, 8, and in the final grade of secondary education were asse ssed in 1995 (Beaton et al., 1996). Data are reported in our article for 13year-olds in IAEP2 and for grades 7 and 8 students in TIMSS. However, the main focus is on gr ade 7 performance, since in all countries that had participated in both assessments except Scotland, more 13 year-olds were in grade 7 than in grade 8 (Beaton et al, 1996 a, p. A12). The IAEP2 tests for 13-year olds were conta ined in two separate booklets, each of which had to be completed by students in four 15-mi nute segments (one hour testing time in all). The mathematics booklet contained 76 items and covered four content areas: Measurement, Geometry, Data Analysis/Statistics/Pro bability, and Algebra/Functions. The science test consisted of 72 items and covered four content areas: Life Sciences, Physical Sciences, Earth/Space Sciences, and the Na ture of Science. Students completed either a mathematics or science test and were admin istered all items on the test. Unlike IAEP2, the TIMSS test booklets conta ined both mathematics and science items. At grades 7 and 8, the mathematics test comp rised 151 items and the science test 135 items. The TIMSS mathematics items covered six content areas: Fractions/Number Sense, Geometry, Algebra, Data Representations/Anal ysis/Probability, Measurement, and Proportionality. The science content areas were : Earth Science, Life Science, Physics, Chemistry, and Environmental Issues/Nature of Science. Items were rotated across eight test booklets and student performance was matrix-sampled using a modified Balanced-Incomplete-Block (BIB) spiraling design (M artin & Kelly, 1997). One and a half hours were allocated for the completion of eac h booklet. In both studies, performance on both tests was reported in the form of an average percentage correct score. In the case of TIMSS, an average scale score for each country was also reported.


4 of 17 While scale scores were calculated for the IAEP2 st udy, they were not included in the published reports. The Consistency of IAEP2 and TIMSS Science Results In 1991, the average science performance of Irish 13-year-olds is significantly below the average performance of students in all bu t two of the 'common' countries (Portugal and the US) and also significantly below the international mean (Lapointe, Askew, & Mead, 1992). However, in 1995, the average performance of Irish students on the TIMSS test at grades 7 & 8 compares much more f avorably with the 'common' countries and with the overall TIMSS means (Beaton et al, 1996b). This change of fortune is clearly evident in Table 1, in which cou ntries are listed from highest achieving to lowest achieving, and are categorized according to whether their means were statistically significantly above, below, or did no t differ from, the Irish mean.Table 1 Science and Mathematics Means of Countries that Par ticipated in IAEP2 and TIMSS (Categorised in Terms of the Significance of Differ ence of Each Mean from the Irish Mean)a, b IAEP2 13-year-olds TIMSS Grade 7TIMSS Grade 8 Science MSE MSE MSEOverallc66.9 49.8 (0.1)55.5(0.1) Kor77.5(0.5)Kor61.4 (0.4)Kor65.5(0.3)Swi73.7(0.9)Slo57.2 (0.5)Slo61.7(0.5)Hun73.4(0.5)Hun55.5 (0.6) Eng61.3 (0.6) Slo70.3(0.5)Eng55.6 (0.6) Hun60.7 (0.6) Can68.8(0.4) US54.0 (1.1) Can58.7 (0.5) Eng68.7(1.2) Can54.0 (0.5) Ire58.4 (0.9) Fra68.6(0.6) Ire52.0 (0.7) US58.3 (1.0) Sco67.9(0.6) Swi50.1 (0.4) Swi56.3 (0.5) Spa67.5(0.6)Spa49.3 (0.4)Spa55.6(0.4)US67.0 (1.0)Sco48.2 (0.8) Sco55.3 (1.0) Ire63.3 (0.6)Fra46.1 (0.6)Fra53.7(0.6) Por62.6 (0.8)Por41.3 (0.5)Por49.9(0.6) IAEP2 13-year-olds TIMSS Grade 7TIMSS Grade 8 Mathematics MSE MSE MSEOverall 58.3 49.3 (0.1)55.1(0.1)


5 of 17 Kor73.4 (0.6)Kor67.0(0.6)Kor71.7(0.5)Swi70.8 (1.3) Hun53.8 (0.8) Swi62.0 (0.6) Hun68.4 (0.8) Swi53.1 (0.5) Hun61.5 (0.7) Fra64.2 (0.8) Ire53.3 (1.0) Fra61.3 (0.8) Can62.0 (0.6) Slo52.5 (0.7) Slo61.2 (0.7) Eng60.6 (2.2) Can51.6 (0.5) Ire58.7 (1.2) Sco60.6 (0.9) Fra51.0 (0.8) Can58.7 (0.5) Ire60.5 (0.9)US47.7 (1.2)Eng53.1(0.7) Slo57.1 (0.8)Eng47.2 (0.9)US53.0(1.1) Spa55.4(0.8)Sco44.3 (0.9)Sco51.6(1.3)US55.3(1.0)Spa42.4 (0.6)Spa51.0(0.5)Por48.3(0.8)Por36.6 (0.6)Por42.9(0.7)aIn TIMSS, overall scale scores rather than overall average percents correct were used to report the outcomes of statistical tests. b Average performance in countries whose data appear in bolded type is not statistically significantly different from that in Ireland. Avera ge performance in countries above the bolded entires is statistically significantly above that i n Ireland. Average performance in countries below the bolded entries is statistically significantly b elow that in Ireland.c The international averages in the table are for al l participating countries and educational systems in each of the studies. The standard errors for the IAEP averages were not published. Source. For IAEP2: Lapointe, Askew, & Mead (1992), Lapointe, Mead, & Askew (1992), ETS, (1992). For TIMSS: Beaton et al. (1996a; b), Center for the Study of Testing, Evaluation and Public Policy (n.d.). Compared to their performance on the IAEP2 science assessment, four countries maintain their superiority over Ireland on the TIMS S assessment at grade 7 (Korea, Slovenia, Hungary, England). Two, having performed at a superior level on IAEP2, achieve at levels comparable to Ireland in TIMSS (C anada, Switzerland), while three that were superior on IAEP2 record a significantly poorer performance on TIMSS (France, Scotland, Spain). Comparisons between IAEP 2 performance and performance at grade 8 on TIMSS reveal a somewhat similar patte rn in which only two countries (Korea and Slovenia) maintain their superior positi on. It is apparent that the relative performanc es of countries other than Ireland also change between IAEP2 and TIMSS (e.g., France and Sw itzerland). It could be argued that the same phenomenon occurs in mathematics (com pare, for example, English and Scottish performances in the two surveys). However, changes in position are less frequent in mathematics, a finding that is reflecte d in the magnitude of the correlations between scores in the two assessments (Table 2).Table 2 Correlations Between the Performances of Countries that Participated in Both IAEP2 and TIMSS (n=12) TIMSS Grade 7Mean Scale Score TIMSS Grade 7 Mean Percent Correct


6 of 17 Mathematics IAEP2 Mean Scale Score .83 IAEP2 Mean Percent Correct .83 Science IAEP2Mean Scale Score .55 IAEP2Mean Percent Correct .66 In considering the consistency of scores f rom one assessment to another, data on statistical significance from the published reports could have been used (as they were in Table 1). However, since our interest is in the ext ent to which the size of differences between pairs of country means changed across the a ssessments, we chose to use an effect-size index.Effect Size Differences The effect size is a measure of the magnitu de in numerical terms of a difference of interest (in the present case, mean differences bet ween countries) (Hair, Anderson, & Black, 1995; Wolf, 1986). The measure chosen for th e present analysis is Cohen's d which is a measure of standardized differences betw een means, expressed in terms of standard deviation units (Cohen, 1977). The measure provides a scale-invariant estimate of the magnitude of an effect and involves dividing the value of the difference between two group means by the pooled standard deviation, u sing the formula, d = (M1 – M2)/spooled in which, d is the effect size index for differences between m eans in standard units;M1 and M2 are the sample means in original measurement units; and spooled is the pooled standard deviation for both samples and is calculated as[(n1 – 1)s1 + (n2 – 1)s2]1/2 (n1 + (n2 – 2)-1/2 The effect size measure is now in the comm on metric of standard deviation units. Thus, an effect size of 0.3 indicates that one coun try scored 0.3 of a standard deviation higher (or lower) than the comparison country. Guid ance for interpreting effect sizes is equivocal. It has been suggested that effect sizes around 0.2 are small, those around 0.5 are medium, and those around or above 0.8 are large (Cohen, 1977). However, the significance of an effect size will depend on the c ontext in which it is obtained (Durlak, 1995).Table 3


7 of 17 Effect Sizes Observed in Science for IAEP2 CanEngFraHunIrlKorPorScoSloSpaSwiUS Can .00+.01+.04-.27+.39-.54+.45+.08-.03+.15-.31+.16 Eng -.01 .00+.03-.27+.34-.53+.41+.06-.04+.12-.28+.15 Fra -.04-.03 .00-.30+.32-.57+.39+.03-.07+.09-.32+.12 Hun +.27+.27+.30-.00+.60-.26+.67+.32+.23+.42-.01+.43 Ire -.39-.34-.32-.60 .00-.89+.07+.28-.39-.26-.65-.21 Kor +.54+.53+.57+.26+.89 .00+.96+.60+.50+.69+.25+.69 Por -.45-.41-.39-.67-.07-.96 .00-.35-.45-.33-.70-.28 Sco -.08-.06-.03-.32+.29-.60+.35 .00-.10+.06-.36+.09 Slo +.03+.04+.07-.23+.39-.50+.45+.10 .00+.18-.27+.19 Spa -.15-.12-.09-.42+.26-.69+.33-.06-.18 .00-.46+.03 Swi +.31+.28+.32+.01+.66-.25+.70+.36+.27+.46 .00+.44 US -.16-.15-.12-.43+.21-.69+.28-.09-.19-.03-.44 .00Note: Reading across the row and comparing performa nce with country listed in heading: Positive effect sizes reflect higher avera ge performance; negative effect sizes reflect lower average performance.Table 4 Effect Sizes Observed in Science for TIMSS Lower Gr ade CanEngFraHunIrlKorPorScoSloSpaSwi US Can .00-.14+.61-.21+.04-.39+.83+.34-.35+.26+.17-.09 Eng +.14 .00+.72-.06+.17-.24+.89+.44-.18+.39+.28+.04 Fra -.61-.72 .00-.88+.58-1.01+.31-.23-1.06-.34-.44-.57 Hun +.21+.06+.88 .00+.25-.19+1.12+.54-.13+.50+.39+.10 Ire -.04-.17+.58-.25 .00-.44+.86+.29-.39+.22+.13-.12 Kor +.39+.24+1.01+.19+.44 .00+1.20+.73+.05+.66+.56+.26 Por -.83-.89-.31-1.12-.86-1.20 .00-.51-1.39-.63-.75-.77 Sco -.34-.44+.23-.54-.29-.73+.51 .00-.68-.11-.18-.38 Slo +.35+.18+1.06+.13+.39-.05+1.39+.68-.00+.66+.55+.21 Spa -.26-.39+.34-.50-.22-.66+.63+.11-.66 .00-.09-.30 Swi -.17-.28+.44-.39-.13-.56+.75+.18-.55+.09 .00-.23 US +.09-.04+.57-.10+.12-.26+.77+.38-.21+.30+.23 .00Note: Reading across the row and comparing performa nce with country listed in heading: Positive effect sizes reflect higher avera ge performance; Negative effect


8 of 17 sizes reflect lower average performance. The effect sizes associated with country di fferences in the IAEP2 and TIMSS surveys are contained in Tables 3 and 4 respectivel y and are based on the weighted n s, scale scores, and standard deviations (see Appendix A and B). Scale scores for IAEP2 were taken from the public use data file. Changes i n effect sizes between pairs of means on the assessments are the absolute values of the d ifference between the effect size for the IAEP2 assessment and the effect size for TIMSS, i.e., dchange = |dIAEP2 – dTIMSS|. These absolute values are presented in Table 5.Table 5 Absolute Value of the Differences Between the Effec t Sizes Observed in Science for IAEP2 and TIMSS Lower Grade CanEngFraHunIreKorPorScoSloSpaSwiUS Can . Eng . Fra . Hun . Ire . Kor . Por . Sco . Slo . Spa . Swi . US . Slight differences between the absolute value s in this table and the values in Tables 3 and 4 on which they are based result from rounding error. Reading across the columns or down the row s gives the effect size differences for a country compared to all other countries. For exampl e, the difference between the effect sizes for Canada and England in the two assessments is 0.15 standard deviation units – a small difference reflecting the fact that the mean achievement in both countries is not significantly different in either assessment. Most of the largest effect size differences are associated with France, Ireland, and Switzerland (Table 5). Large effect size difference s are evident at the intersection of France and Ireland (0.90) and at the intersection o f Ireland and Switzerland (0.77). This


9 of 17is a reflection of the fact that while Ireland's st anding relative to these countries was poor in IAEP2, Ireland scored higher than these countrie s in TIMSS. The intersection of France and Switzerland shows a small effect size di fference (0.12) and confirms that these countries maintained their position relative to each other on both occasions. However, effect sizes at the intersection of France and countries such as England (0.69), Hungary (0.58), Slovenia (0.99) and the US (0.69) a re large. The Swiss change of fortune is clearly reflected in the effect size dif ferences between it and England (0.57), Slovenia (0.82), and the US (0.67). Moderate to large effect sizes are also ass ociated with comparisons involving Portugal, Scotland, Slovenia, and the US. For examp le, the effect size difference at the intersection of Portugal and Slovenia is 0.93. In b oth assessments, Portugal scored significantly lower than Slovenia. However, the lar ge value results from the fact that while the effect size was in the order of 0.45 in I AEP2, it increased to 1.39 in TIMSS. Indeed, most of the other large effect sizes associ ated with Portugal reflect that country's very poor performance in TIMSS. Other moderately la rge effect sizes worth noting are those at the intersections of Scotland and Slovenia (0.57), Scotland and the US (0.47), Korea and Slovenia (0.45), Slovenia and Spain (0.48 ), and Korea and the US (0.43). Other analyses, not reported here, show that the ab solute value of differences between effect sizes observed for mathematics, though large in some cases, are generally much smaller than for science (O'Leary, 1999).Conclusion The dilemma that our findings give rise to for policy makers seems straightforward enough. Do the findings (for more countries at any rate) indicate a change in level of science achievement over time? And if not, which re sults are to be taken as a 'true' reflection of its nation's achievement? Careful con sideration now needs to be given to the task of trying to explain why performance in th e two assessments seems to be so different for some countries. At least five hypothe ses can be suggested (see Beaton et al., 1990 for a description of efforts to disentangle th e 1985/86 reading anomaly in the National Assessment of Educational Progress in the United States). These, each of which will be briefly considered, relate to populat ion definitions, survey implementation, approaches to data analysis, the po ssibility of real gains or losses in the achievement of students in some countries during th e period between the two surveys and measuring instrument issues. Firstly, differences in population definiti ons might account for differences in the relative performance of students in IAEP2 and TIMSS science. In IAEP2 a sample of students who were 13 years old was tested. In TIMSS the students were in grades 7 and 8. While there is some overlap between these two po pulations, there are differences between them that need to be taken into account whe n comparing performance. For example, it is noteworthy that for TIMSS science th e estimated median scale score for Irish 13-year olds (486) is lower than the mean sca le score for Irish seventh graders (495) and that the median score for Swiss 13-year-o lds is exactly equivalent to the Irish mean at the seventh grade (see, Beaton et al., 1996 b, pp. 26 and 37). (A median scale score rather than a mean s cale score was calculated for 13-year-olds in TIMSS due to the fact that students were sampled by grade and not by age. Not all 13year-olds were in the grades sampl ed and, as a consequence, an estimate of the median was thought to be more reliable.) Ram seier (1997, personal communication) claims that a large part of the chan ge in Swiss performance between IAEP2 and TIMSS can be explained by the fact that 4 4% of Swiss 13-year olds are in


10 of 17grade 8. He argues that comparing Swiss grade 8 per formance to the performance of grade 7 students in Ireland (where most 13-year old s are) provides evidence that Swiss IAEP2 and TIMSS performances may not be all that di fferent. However, taking the sampling variability of both medians into account, it must still be argued that, as the scores for both sets of 13-year olds suggest, Switz erland did not perform significantly better than Ireland in TIMSS. (The standard errors of the Irish and Swiss medians were 3.1 and 2.2 respectively). Secondly, populations with exclusions and low participation rates in some countries may also account for some of the differen ces in outcomes across the two studies. Exclusions were caused by countries modify ing the internationally agreed definition of the population to be tested. Low part icipation rates were caused by having combined school and student participation rates bel ow an agreed cut-off mark (70% in IAEP2 and 75% in TIMSS). A few examples will suffic e to illustrate the point. In IAEP2, Spain excluded students in Cataluna but incl uded them in TIMSS. In IAEP2, Switzerland tested in only 15 of the 26 Cantons whe reas 22 Cantons were involved in TIMSS. In IAEP2, England had a final participation rate of only 48% while in TIMSS it was closer to 80% after replacement. Indeed, a part icularly vexing question in international assessments (or any largescale asse ssment for that matter) is the extent to which exclusions and participation rates affect ove rall performance (see Linn & Baker, 1995). Thirdly, differences in approaches to data analysis may account for differences in the relative performance of students in IAEP2 and T IMMS science. Both IAEP2 and TIMSS use complex procedures for estimating average percentage correct and average proficiency scale scores. Technical reports that we re published in conjunction with the assessments indicate that the technologies differed for the two surveys. For example, approaches to handling missing data when calculatin g average percents for items differed across the two studies (not reached items were treated as not administered in IAEP2 while they were treated as incorrect in TIMSS ). Moreover, in IAEP2, average scale scores were calculated using a 3-parameter It em Response Theory model, while in TIMSS a modified Rasch model was used (see Adams, W ilson & Wang, 1997). The fact that TIMSS items were matrix sampled (using a BIB d esign) and that a plausible values technology was used makes it a very different kind of survey to the more straightforward IAEP2. Fourthly, between 1991 and 1995, levels of science achievement for students around 13 years of age may have increased or decrea sed, accounting for differences in the relative performance of students in IAEP2 and T IMSS science. We do not, however, have any evidence to support the view that substant ial change occurred in the achievement of Irish 13year old students during t he four years between IAEP2 and TIMSS. Comparing outcomes from the two assessments, all we can say is that, in a normative sense, Irish performance in TIMSS improve d. Comparison with the Swiss is important here. Ramseier (1997, personal communicat ion) suggests that age, instruction time and curriculum issues affected Swiss performan ce in TIMSS. Was Ireland's favorable comparison with the Swiss in TIMSS merely an artifact of poor Swiss performance? Of course Ireland's performance relati ve to more than one country improved and this suggests that achievement in a re al sense may have improved. But we cannot say for sure. While the time-span between th e two assessments is probably not long enough to allow for the kind of gains that mig ht help explain the improved relative performance in TIMSS, the matter of how performance in IAEP2 can be equated with performance in TIMSS in an absolute sense is a subs tantial matter and one that is of the utmost importance to an accurate interpretation of national performance in the two


11 of 17surveys. Fifthly, differences in measuring instrume nts might account for differences in the relative program of students in IAEP2 and TIMSS sci ence. As noted above, there were differences in the content areas of the IAEP2 and T IMSS tests. TIMSS had a section entitled Environmental Issues which IAEP2 did not. There were also differences in the proportion of items assigned to common content area s. For example, while 17% of the IAEP2 items were devoted to the Nature of Science, the figure for TIMSS was 6%. In addition, more of the TIMSS test (5%) was devoted t o Physics. Hence, differences in performance may be a function of differences in the nature of the achievement that was assessed. However, an interesting issue arising in this context is worth raising here. The fact is that while the instruments measuring mathem atics achievement also differed in content coverage, the mathematics performance of co untries across the two studies was more consistent. The question arises: In internatio nal studies do particular factors impinge much more strongly on science achievement t han mathematics achievement? Finally, and as an extension of the last po int, what seems reasonably clear is that underlying the reporting of results of internationa l studies in the popular media and in many reports emanating from government ministries i s an assumption that 'science,' 'mathematics,' 'reading' and the like are clearly u nderstood. But is this the case? Can we say that there is real consensus about the nature o f these domains and the underlying psychological constructs implied by "achievement" i n these subjects? Or could it be that at the international level an understanding of what constitutes achievement in mathematics, for example, is at a more advanced lev el than the understanding of what constitutes science achievement? It is noteworthy t hat some support for this hypothesis is contained in our finding that country rank order ings were more stable in mathematics than in science across two distinct international a ssessments. Moreover, in the United States the analysis by Hamilton and her colleagues (1995) of a large scale national test (NELS:88) provides further food for thought in sugg esting that "achievement patterns in science were much more heterogeneous than in math" and that "[i]n science, a far greater number of factors was required to account for stude nt performance differences" (p. 577). Such findings raise critical questions about the sc ience tests used in international comparative studies. NoteThe poor performance of Irish students in science w as also a feature of the First International Assessment of Educational Progress in Mathematics and Science (IAEP1) test in 1988 (Lapointe, Meade, & Phillips, 1989).ReferencesAdams, R. J., Wilson, M., & Wang, W-C. (1997) The m ultidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, pp. 1-23. Beaton, A. E., Mullis, I. V. S., Martin, M. O., Gon zalez, E. J., Kelly, D. L., & Smith, T. A. (1996a) Mathematics achievement in the middle school years: IEA's Third International Mathematics and Science Study Chestnut Hill, MA: Center for the Study of Testing, Evaluation, and Educational Policy, Bos ton College. Beaton, A. E., Martin, M. O., Mullis, I. V. S., Gon zalez, E. J., Smith, T. A., & Kelly, D. L (1996b) Science Achievement in the Middle School Years: IEA 's Third International


12 of 17Mathematics and Science Study. Chestnut Hill, MA: Center for the Study of Testing Evaluation, and Educational Policy, Boston College.Beaton, A. E., Zwick, R., Yamamoto, K., Mislevy, R. J., Johnson, E. G., & Rust, K. F. (1990) Disentangling the NAEP 1985-86 Reading Anomaly. Princeton, NJ: Educational Testing Service.Brown, M. (1996) FIMS and SIMS: the first two IEA i nternational mathematics surveys, Assessment in Education: Principles, Policy & Pract ice 3, pp. 193212. Brown, M. (1998) The tyranny of the international h orse race. In R. Slee & G. Weiner with S. Tomlinson (Eds) School Effectiveness for Whom? Challenges to the Sc hool Effectiveness and School Improvement Movements pp. 3347 (London, Falmer Press). Cohen, P. (1977) Statistical Power Analysis for the Behavioural Scie nces. New York: Academic Press.Durlak, J. A. (1995) Understanding meta analysis. I n L. G. G. P. R. Yarnold (Eds.), Reading and Understanding Multivariate Statistics pp. 319-352 Washington, D.C.: American Psychological Association.ETS (1992) IAEP technical report: Volume 1 Princeton, NJ: Educational Testing Service.Goldstein, H. (1996) Introduction. Assessment in Education: Principles, Policy and Practice 3, pp. 125128. Hair, J. F., Anderson, R. E., & Black, R. L. T. C. (1995) Multivariate Data Analysis (4 th ed) Englewood Cliffs, NJ: Prentice Hall. Hamilton, L. S., Nussbaum, E. M, Kupermintz, H., Ke rkhoven, J. I. M., & Snow, R. S. (1995) Enhancing the validity and usefulness of lar ge-scale educational assessments: NELS:88 science achievement. American Educational Research Journal 32, pp. 555-581.Husn, T. (1967) International Study of Achievement in Mathematics 2 vols. Stockholm: Almquist & Wiksell.Husn, T., & Postlethwaite, T. N. (1996) A brief hi story of the International Association for the Evaluation of Educational Achievement (IEA) Assessment in Education: Principles, Policy and Practice 3, pp. 129141. Kellaghan, T. (1996) IEA studies and educational po licy. Assessment in Education: Principles, Policy and Practice 3, pp. 143-160. Kellaghan, T., & Grisay, A. (1995) International co mparisons of student achievement: Problems and prospects. In Measuring What Students Learn pp. 41-61 Paris: Organization for Economic Co-Operation and Developm ent. Lambin, R. (1995) What can planners expect from int ernational quantitative studies?, in W. Bos & R. H. Lehman (Eds) Reflections on Educational Achievement. Papers in honour of T. Neville Postlewaithe pp. 169-182 New York: Waxman.


13 of 17Lapointe, A. E., Askew, J. M., & Mead, N. A. (1992) Learning Science. Princeton, NJ: Educational Testing Service.Lapointe, A. E., Mead, N. A. & Phillips, G. W. (198 9) A World of Differences: An International Assessment of Mathematics and Science Princeton, NJ: Educational Testing Service.Linn, R. L., & Baker, E. L. (1995). What do interna tional assessments imply for world-class standards? Implications of internationa l assessments. Educational Evaluation and Policy Analysis, 17 4 pp. 405-418. Martin, J. O., Hickey, B. L., & Murchan, D. P. (199 2) The second international assessment of educational progress: mathematics and science findings in Ireland. Irish Journal of Education 26, pp. 3-146. Martin, M. O., & Kelly, D. L. (1996) Third International Mathematics and Science Study. Technical report. Volume 1: Design and Development (Chestnut Hill, MA, Center for the Study of Testing, Evaluation, and Ed ucational Policy, Boston College). Murphy, P. (1996) The IEA assessment of science ach ievement, Assessment in Education: Principles, Policy & Practice 3, pp. 213-232. Nuttall, D. (1994) Choosing indicators, In Making Education Count. Developing and using international indicators pp. 79-96 (Paris, Organisation for Economic Co-operation and Development)O'Leary, M. (1999) The validity and consistency of findings from inter national comparative studies of student achievement: A compa rison of outcomes from IAEP2 and TIMSS. (Unpublished doctoral dissertation, Boston College Massachusetts). Plomp, T. (1992) Conceptualizing a comparative educ ational research framework. Prospects 22, pp. 278-288. Winer, B. J., Brown, D. R., & Michels, K. M. (1991) Statistical Principles in Experimental Design. New York: McGraw Hill. Wolf, F. M. (1986) Meta Analysis: Quantitative Methods for Research Sy nthesis. Newbury Park, CA: SAGE.About the AuthorsMichael O'Leary St. Patrick's CollegeDrumcondra, Dublin 9, Ireland Telephone +353-1-8842000 Fax +353-1-8376197 Email: Michael O'Leary is a member of the Education Depart ment at St.Patrick's College, Dublin, Ireland. He holds a Ph.D. from Boston Colle ge in the area of educational


14 of 17 research, measurement and evaluation. He has served as Ireland's representative on the Board of Participating Countries for the OECD's Pro gramme for International Student Assessment (PISA) project. Thomas KellaghanThomas Kellaghan is Director of the Educational Res earch Centre at St. Patrick's College, Dublin. He is a member of Academia Europae a and a Fellow of the International Academy of Education. He is currently serving as President of the International Association for Educational Assessmen t. George MadausGeorge Madaus is Boisi Professor of Education and P ublic Policy at Boston College. He has served as Director of the National Commission o n Testing and Public Policy, and as Vice President of AERA Division D, and as President of NCME. He is a a member of the National Academy of Education and a Senior Rese arch Associate of the National Board on Educational Testing and Public Policy. Albert BeatonAlbert Beaton is a professor in Boston College's Gr aduate School of Education. He was international study director of the Third Internati onal Mathematics and Science Study (TIMSS) and is a former director of design, researc h, and data analysis for the National Assessment of Educational Progress (NAEP). He is a member of the International Academy of Education and an honorary member of the International Association for the Evaluation of Educational Achievement (IEA). Appendix A Average Science Scale Scores for 13-year-olds in IA EP2 nWeighted nScale Scoresesd Can 49801823125341.561 Eng 9295045905333.971 Fra 17876727645312.569 Hun 16231496475522.372 Ire 1657637915092.572 Kor 16356718675702.368 Por 15201492285043.872 Sco 1584553985292.869 Slo 1598266405362.265 Spa 16094403225252.361 Swi 3653527265533.463 US 140430283865234.468


15 of 17 Source: International Assessment of Educational Pro gress (IAEP2), 1991-1992.Appendix B Average Science Scale Scores at Grade 7 in TIMSS nWeighted nScale Scoresesd Can 82193777314992.390 Eng 18034654575123.5101 Fra 30168606574512.674 Hun 30661187275183.291 Ire 3127684774953.591 Kor 29077984095352.192 Por 33621468824282.171 Sco 2913629174683.894 Slo 3600280495302.486 Spa 37415490324772.180 Swi 4085666814842.582 US 388631568475085.5105Source: IEA's Third International Mathematics and S cience Study (TIMSS), 1994-1995.Copyright 2000 by the Education Policy Analysis ArchivesThe World Wide Web address for the Education Policy Analysis Archives is General questions about appropriateness of topics o r particular articles may be addressed to the Editor, Gene V Glass, or reach him at College of Education, Arizona State University, Tempe, AZ 8 5287-0211. (602-965-9644). The Commentary Editor is Casey D. C obb: .EPAA Editorial Board Michael W. Apple University of Wisconsin Greg Camilli Rutgers University John Covaleskie Northern Michigan University Alan Davis University of Colorado, Denver Sherman Dorn University of South Florida Mark E. Fetler California Commission on Teacher Credentialing


16 of 17 Richard Garlikov Thomas F. Green Syracuse University Alison I. Griffith York University Arlen Gullickson Western Michigan University Ernest R. House University of Colorado Aimee Howley Ohio University Craig B. Howley Appalachia Educational Laboratory William Hunter University of Calgary Daniel Kalls Ume University Benjamin Levin University of Manitoba Thomas Mauhs-Pugh Green Mountain College Dewayne Matthews Western Interstate Commission for HigherEducation William McInerney Purdue University Mary McKeown-Moak MGT of America (Austin, TX) Les McLean University of Toronto Susan Bobbitt Nolen University of Washington Anne L. Pemberton Hugh G. Petrie SUNY Buffalo Richard C. Richardson New York University Anthony G. Rud Jr. Purdue University Dennis Sayers Ann Leavenworth Centerfor Accelerated Learning Jay D. Scribner University of Texas at Austin Michael Scriven Robert E. Stake University of Illinois—UC Robert Stonehill U.S. Department of Education David D. Williams Brigham Young UniversityEPAA Spanish Language Editorial BoardAssociate Editor for Spanish Language Roberto Rodrguez Gmez Universidad Nacional Autnoma de Mxico Adrin Acosta (Mxico) Universidad de J. Flix Angulo Rasco (Spain) Universidad de Teresa Bracho (Mxico) Centro de Investigacin y DocenciaEconmica-CIDEbracho Alejandro Canales (Mxico) Universidad Nacional Autnoma Ursula Casanova (U.S.A.) Arizona State Jos Contreras Domingo Universitat de Barcelona


17 of 17 Erwin Epstein (U.S.A.) Loyola University of Josu Gonzlez (U.S.A.) Arizona State Rollin Kent (Mxico)Departamento de InvestigacinEducativa-DIE/ Mara Beatriz Luce (Brazil)Universidad Federal de Rio Grande do Sul-UFRGSlucemb@orion.ufrgs.brJavier Mendoza Rojas (Mxico)Universidad Nacional Autnoma deMxicojaviermr@servidor.unam.mxMarcela Mollis (Argentina)Universidad de Buenos Humberto Muoz Garca (Mxico) Universidad Nacional Autnoma deMxicohumberto@servidor.unam.mxAngel Ignacio Prez Gmez (Spain)Universidad de Daniel Schugurensky (Argentina-Canad)OISE/UT, Simon Schwartzman (Brazil)Fundao Instituto Brasileiro e Geografiae Estatstica Jurjo Torres Santom (Spain)Universidad de A Carlos Alberto Torres (U.S.A.)University of California, Los