xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam a22 u 4500
controlfield tag 008 c20019999azu 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E11-00231
Educational policy analysis archives.
n Vol. 9, no. 34 (September 14, 2001).
Tempe, Ariz. :
b Arizona State University ;
Tampa, Fla. :
University of South Florida.
c September 14, 2001
Predicting variations in mathematics performance in four countries using TIMSS / Daniel Koretz, Daniel McCaffrey, [and] Thomas Sullivan.
Arizona State University.
University of South Florida.
t Education Policy Analysis Archives (EPAA)
xml version 1.0 encoding UTF-8 standalone no
mods:mods xmlns:mods http:www.loc.govmodsv3 xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govmodsv3mods-3-1.xsd
mods:relatedItem type host
mods:identifier issn 1068-2341mods:part
mods:detail volume mods:number 9issue 34series Year mods:caption 20012001Month September9Day 1414mods:originInfo mods:dateIssued iso8601 2001-09-14
1 of 28 Education Policy Analysis Archives Volume 9 Number 34September 14, 2001ISSN 1068-2341 A peer-reviewed scholarly journal Editor: Gene V Glass, College of Education Arizona State University Copyright 2001, the EDUCATION POLICY ANALYSIS ARCHIVES. Permission is hereby granted to copy any article if EPAA is credited and copies are not sold. Articles appearing in EPAA are abstracted in the Current Index to Journals in Education by the ERIC Clearinghouse on Assessment and Evaluation and are permanently archived in Resources in Education .Predicting Variations in Mathematics Performance in Four Countries Using TIMSS Daniel Koretz Harvard University Daniel McCaffrey RAND Education Thomas Sullivan RAND EducationCitation: Koretz, D., McCaffrey, D., and Sullivan, T. (2001, September 14). Predicting variations in mathematics performance in four countries using TIMSS. Education Policy Analysis Archives 9 (34). Retrieved [date] from http://epaa.asu.edu/epa a/v9n34/.Abstract Although international comparisons of average stude nt performance are a staple of U.S. educational debate, little attenti on has been paid to cross-national differences in the variability of pe rformance. It is often assumed that the performance of U.S. students is un usually variable or that the distribution of U.S. scores is left-skewed Â– that is, that it has an
2 of 28unusually long Â‘tail' of low-scoring students Â– but data from international studies are rarely brought to bear on these questio ns. This study used data from the Third International Mathematics and S cience Study (TIMSS) to compare the variability of performance i n the U.S., Australia, France, Germany, Hong Kong, Korea, and J apan; investigate how this performance variation is distributed withi n and between classrooms; and explore how well background variabl es predict performance at both levels. TIMSS shows that the U. S. is not anomalous in terms of the amount, distribution, or prediction of performance variation. Nonetheless, some striking differences a ppear between countries that are potentially important for both r esearch and policy. In the U.S., Germany, Hong Kong, and Australia, betwee n 42 and 47 percent of score variance was between classrooms. A t the other extreme, Japan and Korea both had less than 10 percent of sc ore variance between classrooms. Two-level models (student and classroom ) were used to explore the prediction of performance by social bac kground variables in four of these countries (the U.S., Hong Kong, Franc e, and Korea). The final models included only a few variables; TIMSS l acked some important background variables, such as income, and other variables were dropped either because of problems revealed by exploratory data analysis or because of a lack of significance in th e models. In all four countries, these sparse models predicted most of th e between-classroom score variance (from 59 to 94 percent) but very lit tle of the within-classroom variance. Korea was the only count ry in which the models predicted more than 5 percent of the withinclassroom variance in scores. In the U.S. and Hong Kong, the models pr edicted about one-third of the total score variance, and almost a ll of this prediction was attributable to between-classroom differences in ba ckground variables. In Korea, only 19 percent of total score variance w as predicted by the model, and most of this most of this was attributab le to within-classroom variables. Thus, in some instances, countries diffe r more in terms of the structure and prediction of performance variance th an in the simple amount of variance. TIMSS does not provide a clear explanation of these differences, but this paper suggests hypotheses tha t warrant further investigation. Introduction International comparisons of average student perfor mance are widely discussed by policymakers and the press and have had a powerful influence on educational debate and policy in the US. In an era when traditional norm-r eferenced reporting of student performance ostensibly has gone out of favor, "coun try norms" have become an increasingly important indicator of the success of US education and the levels of performance to which this country should aspire. Th e publication of the results of the Third International Mathematics and Science Study ( TIMSS) over the past several years (Beaton et al., 1996a, 1996b; Mullis et al., 1997, 1998) has increased further the prominence of international comparisons in the US d ebate. Much of the discussion of international comparisons has focused on horse-race
3 of 28comparisons of means or medians. Although presented in TIMSS reports, information on the variability of student performance has usually been ignored in the US debate or has been used in a lopsided and potentially mislead ing fashion. Typically, the variability in the US has been considered, while the variabilit y in the countries to which the US is compared has been ignored. For example, earlier thi s decade, the results of the 1991 International Assessment of Educational Progress (I AEP) were projected onto the National Assessment of Educational Progress (NAEP) scale, permitting comparison of countries participating in IAEP to states participa ting in the 1992 NAEP Trial State Assessment in mathematics. These comparisons, which have been widely cited, showed that the highest-scoring US states, such as Iowa an d North Dakota, had mean scores similar to those of the highest-scoring countries, such as Taiwan and Korea (National Center for Education Statistics, 1996, Figure 25). High-scoring regions in Taiwan and Korea, however, were not compared to the US mean. Underlying some of these comparisons appears to be an expectation that the variability of student performance is atypically la rge in the US. Indeed, some observers have made this expectation explicit. For example, B erliner and Biddle, in disparaging the utility of international comparisons of mean pe rformance, wrote: The achievement of American schools is a lot more variable than is student achievement from elsewhereÂ….To put it baldly, Ameri can now has some of the finest, highest-achieving schools in the worldÂ— and some of the most miserable, threatened, underfunded educational trav esties, which would fail by any achievement standard (1995, p. 58, emphasis in the original). To buttress this assertion, they cited the NCES co mparisons of US states and foreign nations noted above, which displayed no inf ormation about the variation of performance in other countries and included no info rmation about the variation of performance among schools within any country.Research Questions This study was undertaken to explore the variabilit y of performance in the US and several other countries using TIMSS data. Specifica lly, the study explored two primary questions: How large is the performance variation in our sampl e countries, and how is this variation distributed between and within classrooms ? 1. How well do background variables predict performanc e variation in our countries, both within and between classrooms? 2. The results reported here are limited to mathemati cs in the higher grade in Population 2 (grade 8). We focused on Population 2 rather than Population 1 (elementary grades) because of doubts about the val idity and utility of self-report data from elementary school students.(Note 1) Population 3 (end of high school) presented formidable difficulties of sample non-equivalence. The analyses focused on mathematics because the TIMSS sample design which selected stud ents based on the mathematics classes they attended rather than the science class es (Foy, Rust, and Shleicher, 1996, p. 4-7). This precluded decomposition of score variati on and hierarchical modeling in science.
4 of 28Methods To answer these research questions, our analyses pr oceeded in two steps: We compared the distributions of student-level perf ormance across all the countries in the Population 2 sample. 1. We used a smaller, purposive subsample of countries to analyze the variability in student performance between and within classrooms a nd to explore the contributions of student background characteristics to both of these sources of variability. 2. The performance measure used in all analyses was B IMATSCR, the "international mathematics achievement score" (Gonz alez, Smith et al., 1997) used in TIMSS published reports for Population 2. Technical ly, BIMATSCR is not a score in the traditional sense, but it is labeled a score he re for simplicity. TIMSS was designed to provide aggregate estimates but not scores for indi vidual students. In lieu of scores, TIMSS provides for each student five plausible valu es, which are "random draws from the estimated ability distribution of students with similar item response patterns and background characteristics" (Gonzalez, Smith et al. 1997, p. 5-1). In this respect, TIMSS followed a variant of the procedures NAEP has used since 1984. In the case of Population 2, however, scores were conditioned only on country, gender, and class mean, not on background variables (Gonzalez, 1998). In theory, the variance of repeated estimates using different plausible values should b e added to the sampling variance to obtain an estimate of error variance for statistics calculated with plausible values. However, Gonzalez, Smith et al. (1997, p. 5-8) repo rt that the intercorrelations among TIMSS plausible values are so high that this error component can be ignored. It was not calculated for statistics reported in this paper. The step 1 analyses are purely descriptive and use data available in TIMSS publications (Beaton et al., 1996a and 1996b; Mulli s, et al., 1997, Martin et al., 1997). Our initial purposive subsample for the more detail ed analyses in step 2 included seven counties: Australia, France, Germany, Japan, Hong Kong, Korea, and the US. Japan and Korea were selected because they are ofte n used as examples of high-performing countries in comparisons with the U S. Germany was included because it is often noted in discussions of the competitive ness of the US workforce. Hong Kong was included because it has both parallels with and interesting differences from Japan and Korea. France was included because in eighth-gr ade mathematics, it showed an unusually small variance of performance. Australia was considered primarily for methodological reasons. Although we present some re sults for all seven countries, we limited modeling of the predictors of variance to f our: the US, France, Hong Kong, and Korea. Students in Japan did not complete the surve y items used in the modeling. Response patterns for students in Germany made us s uspicious of that country's data. Since Australia was included more for methodologica l than for substantive reasons, we dropped it from the modeling because of similaritie s in the preliminary results from Australia and other countries. In our second stage analyses we decomposed the vari ance among students scores from each of the countries into the variance within classrooms and the variance between classrooms, and in the four primary countries, we e xplored the predictors of variance at each of these levels. Ideally one would want to dec ompose the variance into at least three levels: within classrooms, between classrooms within schools, and between
5 of 28schools. The school and classroom levels of aggrega tion are not exchangeable. For example, a decision to track students on the basis of ability would increase the variance between classrooms within schools while decreasing the variance within classrooms, but it would not directly affect the variance between s chools. Conversely, residential segregation on the basis of social class would incr ease performance variance between schools, but it could decrease the variance between classrooms within schools by making schools more homogeneous with respect to ach ievement. In all countries other than the US, Australia, and Cyprus, however, the TIMSS Population 2 sample consisted of a single classroom per school. Therefore, in most countries, one can only specify a two-level model i n which variations in performance between schools and between classrooms within schoo ls are completely confounded. Accordingly, we decomposed the variability in math scores from each of the four countries into within classroom variability and bet ween classroom variability. The between classroom variability includes contribution s from both the variation of classrooms within schools and the variation between schools. To fit these models we sacrificed some of the richn ess of the US data in order to obtain comparable to the results from all four coun tries. We did this by creating a subsample of the US samples that consisted of a sin gle classroom per school, randomly selected from the multiple classrooms in the origin al sample. We modified the sample weights and jackknife replicates used in variance e stimation accordingly. Our step 2 analyses followed the same course in eac h country and extended from simple exploratory data analysis (EDA) to hierarchi cal modeling. Extensive EDA was used to explore individual-level and classroom-leve l variations in performance and background variables, to determine whether backgrou nd variables showed sufficient variability to be usable in analysis, to determine whether the relationships between background variables and performance appeared sensi ble, and to decide whether and how to categorize variables. The patterns uncovered by this EDA substantially constrained our analyses in several instances. Simple bivariate relationships between performance and background variables were examined for all of the variables considered f or the hierarchical models. When necessary, variables were recoded so that a positiv e relationship with scores would be expressed as a positive correlation. The bivariate analyses were carried out three ways because of the inherently hierarchical nature of th e data: (1) student-level uncentered (i.e., simple student-level analyses without regard to classrooms); (2) student-level, centered on classroom means (corresponding to the w ithin-classroom component of variance); and (3) classroom-level (corresponding t o the between-classroom component of variance). Hierarchical modeling using multiple background var iables followed bivariate analyses. The models include the classroom mean for each background variable and the individual student-level values, centered on classr oom means. With centering, the coefficients produced by the model separately measu re each variable's contribution to both the betweenand within-classroom variability. TIMSS used a complex sampling plan with unequal pro bability of selection among schools from each country's sample. To accoun t for this disproportionate sampling, all analyses reported here are weighted u nless noted. Weighted analyses produce consistent estimates of model parameters ev en if the sample design is disproportionate or more technically nonignorable ( see, e.g., Pfefferman, 1996 for discussion on the use of weights in model fitting). We used the methods of Pfefferman et al. (1998) to fit our weighted hierarchical models using specially written SAS macros. (For the macro and more detail on methods, see Kore tz, McCaffrey, and Sullivan, 2000.)
6 of 28Distributions of Student-Level performance in TIMSS Basic information about the size of the performance variation in participating countries, analyzed at the level of students withou t regard to aggregation, is provided in TIMSS publications. Appendices to the reports provi de standard deviations and selected percentiles (5th, 25th, 50th, 75th, and 95th) of th e performance distributions (Beaton et al., 1996a and 1996b, Appendix E; Mullis et al., 19 97, Appendix C; Martin et al., 1997, Appendix C). At the level of individual students, the eighth-gra de mathematics performance of US students was near the median of the 31 countries that met the TIMSS sampling requirements for the eighth grade (see Beaton, et a l., 1996a, Tables 2.1 and E.3). The country-level standard deviations varied greatly, f rom 58 to 110, but half were clustered in the narrow range from 84 to 92. The median stand ard deviation across the 31 countries was 88. The standard deviation of the US sample was 91, only slightly above the international median. Among these 31 countries, the country-level standard deviation of eighth-grade mathematics performance was strongl y predicted by country means: the higher the mean, the larger the standard deviation (r=.71; see Figure 1). Seen this way, the standard deviation of mathematics performance i n the US was about nine percent higher than the value that would be predicted from the US mean. Numerous other countries, however, had standard deviations that de viated comparably from those predicted by their means. For example, clustered ti ghtly around the US in Figure 1 are England and New Zealand, and Germany would be as we ll if it were included in Figure 1. Germany does not appear in Figure 1 because it d id not meet all sampling requirements. (In eighth-grade science, the standar d deviation in the US was indeed one of the largest, but it is not an outlier; see Koret z, McCaffrey, and Sullivan, 2000.) Figure 1. Plot of Mathematics Standard Deviation by Mathematics Mean, Grade 8, 31 Countries Meeting Sampling Requirements (based on Beaton et al., 1996a) Figure 1 rebuts the common notion that high-scoring Asian countries have a more equitable (i.e., narrower) dispersion of performanc e, at least in eighth-grade
7 of 28mathematics. All three of the Asian countries in ou r sample have larger standard deviations than does the US: Hong Kong's and Japan' s standard deviations are roughly 10% larger than that in the US, and Korea's is appr oximately 20% larger. Among our sample of seven countries, only France has an unusu ally small standard deviation of eighth grade mathematics performance, either in abs olute terms or relative to its mean. In grade 8 mathematics, TIMSS also calls into quest ion the view that the US mean is pulled downward by a distribution with an unusua lly long left-hand (low-scoring) tail. As shown in Figure 2, the US distribution shows a s light right-hand skew rather than a left-hand skew. The US mean is not pulled downward because of a small number of low scoring students. Figure 2 compares the US distribu tion to the data from Korea. The Korean distribution is substantially wider, as its larger standard deviation indicates. The right-hand tails of the distributions in the two co untries are nearly parallel. The left-hand side of the distribution is much shorter in the US, however, pulling the US tail closer to the Korean tail. (Note 2) Figure 2. Distributions of Mathematics Scores, Grad e 8, Korea and US. This plot is unweighted. Weighting has virtually no effect on the distribution of scores in Korea and only a trivial effect on the distribution in the US.Simple Decomposition of Performance Variance in Fou r Countries The previous discussion demonstrates that the overa ll distribution of student level performance in the US is not anomalous. However, lo oking only at the overall variability might miss important differences betwee n performance in the US population compared to that of other countries. For example, t he extent to which the variability is clustered, e.g., within classrooms or schools, migh t vary across countries. In addition, the possible sources of the variance might also dif fer across countries, which would suggest different interpretations of the variabilit y of performance and different policy responses to low mean performance in the US. We use d data from all seven countries to determine the clustering of variability within and between classrooms. As noted above,
8 of 28 we focus on the classroom rather than the school be cause the TIMSS sample makes it impossible to distinguish clustering within schools from clustering within classrooms. The decomposition of mathematics score variance int o withinand between-classroom components is sufficient to revea l striking differences among the seven countries in our sample. In the US, Hong Kong Germany, and Australia, a bit over half of the total variance in eighth-grade mat hematics scores lies within classrooms (Table 1). In contrast, in Japan and Korea, over 90 percent of the variance lies within classrooms. France is intermediate, with about thre e-fourths of the total variance lying within classrooms.Table 1 Percent of Score Variance Within and Between Classr oomsCountryPercent BetweenPercent WithinAustralia47%53%France2773Germany4555Hong Kong4654Japan892Korea694US4258 Similarities among some countries in this decomposi tion of variance, however, might mask important differences that would be come apparent if TIMSS made it possible to distinguish between-school from between -classroom variance. For example, Schmidt, Wolfe, and Kifer (1993) partitioned the va riance of eighth grade mathematics scores in six countries using data from the Second International Mathematics Study, which had two classrooms per school in a number of countries. They found striking differences among countries in the partitioning of aggregate variance. In France, for example, they found that two-thirds of the aggregat e variance lay between schools, while in the US, only 9 percent of the aggregate variance lay between schools (with the remainder lying between classrooms within schools). The average classrooms in our sample of seven count ries differ strikingly in their heterogeneity of performance, with the US showing r elatively little variability within classrooms. The heterogeneity of performance within classrooms depends on both the total variance of performance in each nation and th e breakdown of this variance into withinand between-classroom components. Japan and Korea have slightly larger national standard deviations than the US in Populat ion 2 mathematics and also have a much larger share of their total variance lying wit hin classrooms than does the US. Therefore, the typical within-classroom standard de viation in mathematics is considerably larger in Japan (96) and Korea (102) t han in the US (74). (See Table 2.) The average classrooms in France, Germany, Hong Kon g, and Australia are more similar to that in the US in heterogeneity.Table 2 Within-Classroom Standard Deviations
9 of 28 CountryStandard DeviationAustralia83France63Germany64Hong Kong73Japan96Korea102US74Multilevel Models of Performance Variation As noted, we used data from four countries, the US, France, Hong Kong, and Korea, to explore the relationships between perform ance variation and background variables. Based on research showing which background characte ristics predict student performance in the US, we chose to examine parental education, other measures of socioeconomic status and family composition, measur es of academic press in the family and community, and a few measures of student attitu des. We also examined the effect of student age, which could predict performance in at least two ways. Through maturational effects, older students might be expected to perfor m better than others do. On the other hand, to the extent that students who do poorly in school are held back in grade, older students in a given grade might be expected to perf orm more poorly than others, particularly in the higher grades. Variations in ag e at entry could also affect later scores in several ways. We did not examine curricular variables. As measure d, these will not predict variation within classrooms, and research in the US has generally shown variations in schooling to be less powerful predictors of perform ance than background factors. However, curricular differences may be important pr edictors of performance variation between classrooms within schools (for example, whe n students are tracked by ability) and between schools (when schools differ substantia lly in curriculum). Moreover, important curricular variables are likely to be cor related with background variables. Thus, the results we report here should not be inte rpreted as clear effects of background variables. Rather, they are likely joint effects of the measured background factors, educational factors collinear with them, and other omitted variables correlated with the measured variables.Selecting Variables for Inclusion As noted, exploratory data analysis revealed limita tions in some variables that constrained their use in formal models. The few exa mples presented here illustrate that EDA has particular importance in comparative, inter national studies because variables may behave differently in different countries. Although TIMSS includes numerous attitude and press variables, we focused on a set of 15 Likert variables that asked students how strongly they disagreed or agree with statements that the student's mother, the student's friends, and the student herself considered it important to do well in mathematics, do well in the language of the test, do
10 of 28well in sports, be in a high-achieving class, and h ave time to have fun. EDA showed these press and attitude variables to be problemati c in several respects. In some instances, responses showed little variation. Some relationships with scores were not what one would anticipate if the variables were mea suring the intended constructs. In several instances, data showed suggestions of respo nse bias. For example, several problems can be seen in the re sponses of eighth-grade students to the BSBMMIP2 press for achievement vari able, "My mother thinks it is important for me to do well in mathematics at schoo l" (Figure 3). Each of the six panels arrayed across Figure 3 represents the results from a different country. In the figure we include the four countries in our analysis sample a s well as Australia and Germany; this item was not administered in Japan. The common vert ical axis, labeled BIMATSCR, is the final TIMSS mathematics score. The four categor ies of responses to the survey question are arrayed on the X-axis of each panel: S D = strongly disagree, D = disagree, A = agree, and SA = strongly agree. The vertical po sition of each plotted circle indicates the mean score of the students in that country who gave that particular response to the background question. The radius of each circle is p roportional to the percent of students within each country who provided that particular re sponse. The range of sizes is constrained to make the graphic intelligible, howev er, and in the case of variables with extreme differences in cell counts, including some cells in Figure 3, the relative sizes of the circles understate the actual differences in ce ll counts. Figure 3. Mathematics Scores and Responses to BSBMM IP2 Press Variable In all the six countries other than Germany, the re lationship between scores and responses to the "My mother thinks it is important for me to do well in mathematics at school" variable was in the anticipated direction: the more strongly students agreed with this statement, the higher their average scores. In most countries, however, this relationship stemmed in large measure from very sma ll groups of students who "disagree" or "strongly disagree" with this stateme nt, and the group that included most students showed only weak relationships. In the US, for example, 97 percent of all students are in the "strongly agree" and "agree" ca tegories, the mean mathematics scores
11 of 28of which differed by only 10 points. The "disagree" and "strongly disagree" categories had markedly different score means but contained on ly 2 and 1 percent of students, respectively. This variable is likely to have relat ively little utility in predicting score variability in the sampled countries, even if mater nal press for achievement is an important influence. The data from Germany in Figure 3 show an unusual p attern and demonstrate the value of EDA. The relationship between this press v ariable and scores is not monotonically positive in Germany; the strongly agr ee and strongly disagree groups had approximately the same mean scores. This pattern, w hich appeared repeatedly across the TIMSS press and attitude variables in the German da ta, calls the validity of the responses into question. Because of patterns such a s these and the less than optimal sampling in Germany, we did not model the relations hips between background variables and scores in Germany. The extremely strong positiv e relationship in Korea, which also appeared repeatedly, was also grounds for concern. For example, the strong very strong positive relationship appearing in Korea extended t o "I think it is important be placed in the high achieving class," even though eighth-grade classes are not tracked by achievement in Korea (Hyung Im, 1998). However, the response patterns in Korea to the variables we used in modeling were not sufficiently suspect in our judgment to warrant excluding Korea from modeling. The relationships between some other press variable s and student performance varied markedly, sometimes dramatically, among coun tries. These differences among countries could have several causes. There might be response biases, either consistent or item-specific, that vary among countries. Translati on problems could engender misleading response differences. There might be sub stantive reasons for these differences as well; for example, press variables m ight in fact have stronger relationships with student performance in some countries than in others, perhaps because of differences in the correlations between press varia bles and school characteristics or between press variables and ethnicity. TIMSS also includes press variables that one would expect to show weak or even negative relationships with scores. One set, for ex ample, asks students how strongly they agree with the statements that mother, friends, and the student herself think it is important to have time to have fun. One might expec t that students who think it particularly important to save time for fun might b e less willing to put long hours into study and would therefore score lower. T>wo of the strongest positive predictors of mean scores from this set of variables, however, ar e the strength of agreement with the statements "I think it is important to have time to have fun" (BSBGSIP4) and "My friends think it is important for me to have time t o have fun" (BSBGFIP4). In response to these findings, we used only two of these 15 press variables in our models: the strength with which the student agreed that the mother and the student herself consider it important to do well in mathema tics. We pooled these two variables for each subject, creating a single "press for math ematics variable" variable from the students' responses pertaining to themselves and th eir mothers. These composites were the mean of the two variables for the subject when both were present and whichever was present when one was missing. The decision to pool these two variables, which is consistent with the logic of Likert scales, was mad e because the two press variables taken individually had only insubstantial relations hips with scores, while the composite showed stronger relationships with scores. We also examined the quality of data for 10 studen t and family background variables: whether the student was born in the coun try of testing; mother's and father's educational attainment; number of people in the hom e; whether the father, mother, and
12 of 28 any grandparents lived with the student; how many b ooks were in the home; and whether the home had a study desk and a computer. F ewer problems appeared with background variables than with press and attitude v ariables. Missing data and "I don't know" responses, however, posed serious difficultie s, particularly in France. In all of our countries, responses to the questions about parents' educational attainment were missing for a substantial percentag e of students. This problem was particularly severe in France (where 17 percent wer e missing for fathers and 16 percent for mothers). More important, of the students who r esponded to these question, many answered "I don't know." This problem was particula rly severe in France, where 34 percent responded "I don't know," so that a total o f 50 percent of respondents provided no informative answer (Table 3). Efforts to impute values were unsuccessful. Thus, we had to choose between omitting parental educational attainment from models in France in order to use most of the sample, or including pa rental education and using a substantially reduced sample. We opted to include m other's education at the cost of using a reduced sample. Comparisons of preliminary models indicated that the choice between these options probably affected parameter e stimates but did not to have a major on the overall prediction of score variance. Althou gh our interpretation focuses on the latter, any interpretation of the results from Fran ce should be taken with caution because of this limitation of the data.Table 3 Percent of Students with No Response or Response of "I don't Know" to Question About Mother's Educational Attainment MissingI don't knowTotal Australia4%15%18%France133447Germany92130Hong Kong5914Korea099USA3711 Two variables, mother's educational attainment and number of books in the home, illustrate another issue that can arise in comparat ive studies Â– that is, it may be desirable or necessary to treat variables differently in diff erent countries. Both variables showed substantial but not always monotonic positive relat ionships with achievement. For example, the mean mathematics scores of students wh ose mothers were in the "finished secondary" and "some vocational" categories were no t in the same order in all countries. We combined these categories in all countries excep t Hong Kong. In Hong Kong, small samples and a different pattern of means suggested collapsing the "some vocational" category of maternal education with "finished unive rsity." Similarly, in France only, the mean scores of the students reporting the largest n umber of books was lower than that of the category below, so we collapsed those two categ ories into a single category for the French model. We did not collapse these groups for other countries. The variables used in the final modeling are noted in Appendix A.
13 of 28Specifying Multilevel Models The multilevel models reported here are simple "fix ed coefficients" models (Kreft and DeLeeuw, 1998). That is, the coefficients estim ating the level-one relationships between background factors and achievement (student -level relationships within classrooms) are held constant across classrooms wit hin countries. Between-classroom effects were thus limited to differences in interce pts. In general form, this model is: ...where the subscript i indicates individuals, j indicates classrooms, an underscore indicates a vector, and a bar over a variable indic ates a mean. That is, a student's score reflects a vector of background variables weighted by a vector of regression coefficients, a vector of classroom means of those same backgroun d characteristics weighted by a second vector of coefficients, and random error. Th e coefficients applied to individual characteristics are unaffected by classroom charact eristics. (That is, there are no cross-level interactions.) Equivalently, this can b e expressed in terms of two levels as follows: In other words, the intercept in each classroom is the sum of the overall intercept and the sums of the classroom aggregate variables w eighted by the classroom-level regression coefficients, plus error. The score of e ach individual student is then the sum of that student's classroom intercept and the sum o f the student-level background variables weighted by the student-level regression coefficients, plus error. Preliminary analysis indicated that little would be gained by a llowing the within-classroom slopes to vary randomly or by modeling their variation. These models center observations around classroom m eans. Without group-mean centering, the predictor variance within and betwee n classrooms would be confounded. Centering eliminates confounding of the predictor v ariance between and within classrooms. Centering also makes the model's coeffi cients straightforward estimates of the within-classroom and between-classroom effects (e.g., Bryk and Raudenbush, 1992). We began with the assumption that all variables tha t survived screening by EDA would be included in the models. Including some tha t survived the EDA, however, resulted in numerous small and statistically non-si gnificant parameter estimates. We therefore constructed models based on what could be called a Â‘judgmental stepwise' procedure, in which we began with a null model (i.e ., a model including nothing but an intercept), built up to a more complex model, and t hen pared back to a more parsimonious model based on the size and significan ce of coefficients.(Note 3) In general, we opted to include variables that were on ly marginally significant or that failed to reach significance by a modest amount, leaving i t to the reader to discount them, provided that their inclusion did not markedly chan ge the coefficients of other variables. In addition, because our classroom-level variables are aggregates of student-level
14 of 28 variables, we included at both levels any variable that was significant at either level. The statistics normally reported from hierarchical modelsÂ—intercepts and regression coefficients at each level of aggregatio nÂ—are sufficient for predicting means but not for comparing variance of performance acros s countries. For example, at the classroom level, the estimated effect of the propor tion of students living with their fathers indicates how much, on average, the classro om mean score would increase if the proportion increased from 0 to 1, but it does not i ndicate how much of the variability among classroom mean scores is attributable to this factor. Therefore, we also present a summary of the variance accounted for by the predic tors at each level, expressed as the absolute value of the predicted variance, the perce ntage of variance predicted within level, and the percentage of total variance predict ed.Decomposing Performance Variation We first give a detailed discussion of the models f or the US. This discussion serves as a template for evaluating the results of the other models. We then compare the results from the four countries. The final two-level model of mathematics scores in the US contained only five variables at each level: the number of books in the home, the presence of a computer in the home, the presence of the father in the home, t he academic press variable, and student age. The square of age was included because of nonlinearities in the relationships between age and scores that became ap parent in the exploratory data analysis. Each of these variables was at least marg inally significant at one of the two levels. The importance of these predictors can be evaluated several ways. One can look at the significance and impact of the individual coeff icients within each level, the relative significance or impact of the coefficients across l evels, and the total predictive power of the coefficients at each level. These three views a re each described in turn. Within classrooms in the US, the strongest effects were those of the number of books, the academic press variable, and students' a ge (Table 4). The effects of having a computer and the father living at home were both sm aller and non-significant. Comparisons of these parameter estimates, however, is clouded by their imprecision. Confidence bands around most of these estimates wer e wide (see Appendix B).Table 4 Two-Level Models of Mathematics ScoresVariableUnited StatesFranceHong KongKoreaIntercept-351.7592.6-424.827.9Within class ( b ) Number books7.9**0.320.2**Computer present4.4-3.810.9**Father present1.78.9*-7.4Mother's education4.6Father's education 9.5**
15 of 28 Press9.6**8.6*10.3**36.2**Age-14.4**-18.2**-6.0Age2-6.9-0.6-14.8** Born in Country-19.1**Between-class ( c ) M Books45.5**44.1**16.2*M Computer37.2*89.8*44.5**M Father present90.3**59.5**326.9**M Mother's education26.4**M Father's Education18.8**M Press43.2**45.0**174.5**47.4**M Age33.9-23.0*20.5M Age2-149.4-23.3-26.2* M Born in Country-44.7Residual variancesr 2 (within) 4570.44040.85485.09290.6 t (between)766.2554.71406.248.0NOTE. All estimates of significance reflect jackkni fed estimates: p<.05 ** P<.01 The effects of these estimates can be compared to t he distribution of scores to provide a concrete estimate of their size. For exam ple, in the US, the estimated student-level effect of the number of books was 7.9 This variable had five categories. The model predicts that holding constant the other variables, the mean difference between students in the lowest and highest categori es would be 32 points, roughly one-third of the standard deviation of mathematics scores, which was 89 points in this subsample. The press coefficient was larger, but mo st students were concentrated within two categories of either of the press variables, an d the effect of being in the higher of these two categories, relative to the lower of them was only about one-tenth of a standard deviation. The age coefficient was signifi cant and negative, suggesting that either retention or late entry of slower learners h ave a larger impact than maturational effects. At first glance, the estimated effects at the betwe en-classroom level (preceded by an "M," for "mean," in all tables) appear much larg er than the coefficients at the within-classroom level. However, the standard error s of the estimated between-class coefficients are generally large, and the t statist ics of the between-class coefficients are on average only modestly larger than those of the w ithin-class estimates. Nonetheless, in the US, there are some striking dif ferences between the withinand between-class estimates. The presence of the fa ther in the home had a non-significant and near-zero relationship to score s within classrooms, but the percentage of fathers in the home showed a substant ial relationship to classroom mean scores. On average, the estimated within-classroom effect of having the father present
16 of 28 was less than 2 points, roughly 2 percent of a stan dard deviation. Classrooms in our grade 8 mathematics model sample ranged from 15 to 100 percent of fathers present. Holding other variables constant, going from one st andard deviation below the mean to one standard deviation above on the scale of propor tion of fathers present (from .50 to .82) would predict an increase in mean scores of ab out one-third of a standard deviation. The difference in predictive power at the withina nd between-classroom levels in the US becomes clearer if one compares the variance accounted for by variables at each level. In this model, 59 percent of the total varia nce in scores in the US was within classrooms, while the remaining 41 percent was betw een classrooms (Table 5). The five variables in the model predicted about 77 percent o f the between-classroom variance but only 4 percent of the within-classroom variance. Th e predicted between-classroom variance was 2,532, while the predicted within-clas sroom variance was only 198. Thus, the five between-classroom variables accounted for 31 percent of the total variance of mathematics scores [2532/(3299+4769)], while the fi ve within-classroom variables accounted for only 2 percent of the total variance.Table 5 Total and Predicted Variance in Mathematics Scores at Each LevelUnited StatesFranceHong KongKorea Share of Variance BetweenWithinBetweenWithinBetweenWithinBetweenWithi n Total at level32994769135642324543555779910722Percent at level 415924764555793 Predicted by variables atlevel 25321988011913137737511431 Percent at level predicted byvariables at level 7745956919413 Percent of total predicted byvariables atlevel 312193311712 One surprising finding in the multilevel model for the US was the lack of importance of mother's and father's education, whic h are generally considered to be among the strongest predictors of student performan ce in the US. Parental education did not have large enough effects to warrant keeping ei ther variable in the model. Alternative models (for example, one in which the T IMSS parental education categories were entered as dummies) produced the same result. To explore this, we conducted additional analyses of TIMSS and the base year of t he National Education Longitudinal Study (NELS-88), modifying our TIMSS model in sever al ways to make it as
17 of 28comparable as possible to the model we analyzed in NELS. This comparison suggested that several factors contributed to the unimportanc e of maternal education in our TIMSS model, including the use of a single classroom per school and the inclusion of the academic press variable. However, much of the diffe rence remained unexplained and appears to be a result of unknown characteristics o f the TIMSS database. When nearly identical models were analyzed in TIMSS and NELS, i n both cases using schools rather than classrooms as the level 2 unit, the level 1 an d level 2 parameters for maternal education were both less than half the size in TIMS S as in NELS. None of the final models fully matched any other in terms of the variables included (Table 4). Only a single variable, academi c press, appeared in the final models for all countries. The final models for Hong Kong, Korea, and the US all included variables for number books in the home, a computer in the home, and academic press. The model for Hong Kong, however, included a variab le for father present in the home but excluded age, which was included in both the US and Korea. The model for Korea included age but excluded presence of a father, whi ch was included in the other two countries. The model for Hong Kong included a varia ble for born in country, and the model for Korea included a variable for father's ed ucation; neither of these variables was included in the models for any other countries. The model for France was was the only model that excluded variables for the number of boo ks or computer present and was the only one to include mother's education. Although some of the coefficients were similar in m agnitude across countries, others differed markedly. For example, the studentlevel (within-classroom) coefficients for press were similar in the US, France and Hong K ong: 9.6, 8.6 and 10.3, respectively. The between-classroom coefficients for this variabl e were 43.2, 45.0 and 47.4 for the US, France and Korea. In contrast, the between-clas sroom coefficient for the press variable in Hong Kong was 174.5, several times as l arge as the coefficients for the same variable in the other models. However, as explained below, we do not place great confidence on specific parameter estimates, and thi s estimate in Hong Kong may be seen as implausible. Although the variables in the models and the effect s of those variables differed across countries, the models in all countries were consistent in predicting most of the variance between classrooms but little of the varia nce within classrooms (Table 5). This prediction of between-classroom variance ranged fro m 59 percent in France to 94 percent in the Korea, and the prediction of withinclassroom variance ranged from 1 percent in Hong Kong to 13 percent in Korea. The pr ediction of within-classroom variance in Korea, while a modest 13 percent, is se veral times as strong as in any other country; the next strongest prediction was 5 percen t of the within-classroom variance in France. The consistency of this strong prediction of betwee n-classroom variance is all the more striking in the light of the sparseness of the models and the weak measurement of social background. Our models included few predicto rs. The variables available in TIMSS do not necessarily include those that researc hers in participating countries would suggest are the most important predictors of achiev ement. For example, TIMSS does not include income, race/ethnicity, or inner-city locat ion, all three of which are known to be important predictors of performance in the US. Simi larly, the National Research Coordinator for Korea indicated that income, type o f community (urban, suburban, rural) and geographic region are all somewhat correlated w ith performance in Korea (Im, 1998). In addition, the selection of variables for use in the models was constrained in some instances by problems with the data. Thus, the variables included in the models were a p otentially weak proxy for those
18 of 28 that would best show the relationships between scor e variance and background variables in each country. It is possible that the use of a s tronger set of predictors would have substantially increased the percentage of variance predicted at one or both levels, particularly the within-classroom level, at which o ur prediction was very weak. We cannot determine whether this is the case, however. In the general case, the degree of prediction may not be substantially lessened by the weakness of collinear predictors if enough of them are used in the model (e.g., Berends and Koretz, 1996). We have less confidence in the specific parameter e stimates we obtained, particularly in cases in which the estimates varied markedly among countries. There are several reasons for this caution. First, as noted e arlier, parameter estimates in multi-level models are often quite sensitive to specification d ifferences (Kreft and DeLeeuw, 1998), and our selections of variables were necessarily so mewhat happenstance, constrained as they were by the limitations of the TIMSS database. Models that included additional variables (such as family income) or better-measure d constructs might have yielded substantially different estimates of the parameters in our models. Second, EDA showed that some variables behaved quite differently acros s countries. Other operationalizations of these constructs might have altered these differ ences and might therefore have produced different parameter estimates. To test the importance of the particular selections of variables in our final models, we ran a constant, minimal model in each of the fou r countries, including the individual and aggregate values of number of books, computer p resent, press, age, and age squared. This fixed model predicted almost as much of the va riance in performance as did our final models, which were selected to optimize predi ction in each country and subject (Table 6; compare Table 5). This suggests that pred icted variability is somewhat invariant to the variables included in the model.Table 6 Percent of Variance at Each Level Predicted by Fixe d ModelMathematics Between ClassroomWithin Classroom United States72%4%France544Hong Kong671Korea8612 Differences in the strength of prediction across th e four countries therefore may be substantively more important than differences in pa rameter estimates. One striking difference in prediction becomes apparent when one looks at the prediction of total variance rather than within-level variance. In the US and Hong Kong, roughly one third of the total variance is predicted by the models, i n both cases largely because of variation in between-classroom predictors (Table 7) The models predict much less of the variance in France (18 percent) and Korea (19 p ercent).Table 7 Percent of Total Variance Predicted by Predictors at Each Level, Final Models
19 of 28 Between ClassroomWithin ClassroomBoth Levels United States31%2%34%France14318Hong Kong31132Korea71219NOTE: Entries may not sum to totals because of roun ding. The four countries also differ in terms of the rela tive predictive power of the models between the student and classroom levels. Ag ain, the US and Hong Kong are very similar: almost all of the predicted variance in each country is attributable to between-classroom variation in the predictors (Tabl e 7). France and Korea, however, differ in this respect, even though the percentage of total variance predicted at both levels is nearly identical in the two countries. In France, most of the predicted variance is attributable to the classroom-level predictors, and France differs from the US and Hong Kong in that the prediction is much weaker at the c lassroom level. In Korea, in contrast to all three other countries, more of the total pre diction is due to within-classroom variation in predictors. This can be seen as a refl ection of two factors. First, even though the model predicted only a modest percentage of the within-classroom variance in Korea, the predicted percentage was considerably la rger than in the other three countries. Second, a larger percentage of the total variance l ies within classrooms in Korea (93 percent) than in France (76 percent), the US (59 pe rcent), or Hong Kong (55 percent). The product of these two percentages, which is the percent of total variance predicted by within-classroom predictors, is therefore much larg er in Korea than in the other countries. There are several possible non-exclusive explanatio ns for these cross-national differences in predicted variance. First, the fixed model and our final models may be a better selection of variables for some countries th an for others. Changing to a fixed set of variables drawing from the variables in our set did not have much of an impact, but it is possible that including other variables would ha ve. Second, taking our models as a given, stronger prediction in one country than in a nother could stem from larger estimated effects of some variables in the model, g reater variability in the predictors themselves, or both. Stronger prediction of scores could reflect stronge r partial relationships, greater variance in the predictors themselves, or both. To explore this, we partitioned the variance in the predictors themselves into withinand between classroom components. We then compared the amount of variance in the pred ictors to the amount of predicted variance in scores. The greater prediction of score variance within cla ssrooms in Korea compared to the US appears not to stem from differences in the variability of predictors. Within classrooms, all of the predictors other than age (w hich matters less because it is a weak predictor of scores) showed roughly similar varianc e in the US and Korea. This, in conjunction with the larger parameter estimates rep orted for Korea earlier, indicate that the stronger within-classroom prediction in Korea s tems from stronger partial relationships within classrooms between background variables and scores. The contribution of predictor variance to the diffe rence between France and the US in the prediction of between-classroom score var iance, however, is ambiguous. France shows less between-classroom variance in two predictors, number of books and
20 of 28computer present, and the former is a relatively po werful predictor of score variance in France. On the other hand, France shows much more b etween-classroom variance in age, and age is also a strong predictor of score va riance. Recall that although Hong Kong is similar to Japan and Korea in terms of its overall mean and standard deviation, it is similar to the US Â– and strikingly different from Japan and Korea Â– in terms of the decompositio n of variance into withinand between-school components. Hong Kong is also very s imilar to the US in terms of the predictive power of the models both within and betw een classrooms. Hong Kong and the US are also similar in terms of the withinand bet ween-classroom variance of the predictors themselves, with the exception of age.Conclusions This study was prompted in part by a widespread vie w that performance variance in the US is unusual. This view has sometimes been made explicit Â– for example, in Berliner and Biddle's assertion that "The achieveme nt of American schools is a lot more variable than is student achievement from elsewhere (1995, p. 58). In other instances, this view of variability is implicit, as when the s cores for US states or districts are compared to national averages from other countries. In response, we asked whether the distribution of performance in the US is anomalous, how the variance in performance is distributed in the US and other countries, and how well background factors can predict that variation. TIMSS suggests strongly that the variation in perfo rmance in the US is not anomalous. In Population 2, the US variance is larg e but not exceptional in science and more nearly average in mathematics. Contrary to som e expectations, the distribution of scores is not particularly skewed in the US, and in eighth-grade mathematics, it is rightrather than left-skewed. Moreover, differences amon g countries in the variance of performance do not clearly follow stereotypes about their homogeneity. Socially homogeneous Japan, for example, shows a bit more va riation than the US in mathematics, while socially heterogeneous France sh ows considerably less. When performance variance is broken into withinan d between-classroom components, however, the story becomes more complex The US, Australia, Germany and Hong Kong show one pattern, in which nearly hal f of the variance lies between classrooms. Japan and Korea lie at the other extrem e; most of their variance lies within classrooms, while very little lies between. The res ult is that classrooms in Japan and Korea resemble each other in terms of mean performa nce much more than do classrooms in the US, Germany, Hong Kong, and Austr alia. France falls between these two poles. By the same token, students in the typic al classrooms in Japan and Korea show much greater variability in performance than d o their counterparts in the US, Germany, Hong Kong, and Australia. While the US is similar to many other countries in the overall variability of student performance in mathematics and is similar t o several others we investigated in the decomposition of performance variation within a nd between classrooms, TIMSS does not fully address the reasonableness of Berlin er and Biddle's (1995) assertion that US schools are far more variable than are schools e lsewhere. Of the countries we considered, only the US and Australia provided samp les that allow one to separate between-classroom and between-school variance. For example, if tracking is entirely absent in Japan and Korea, classrooms within school s should be randomly equivalent. In this case, much of the between-classroom variance i n these countries might lie between schools Â– in comparison to the US and Australia, wh ere our preliminary analysis found
21 of 28that most of the between-classroom variance lies wi thin schools. However, only a sample that includes multiple classrooms per school would permit testing this hypothesis. What do the present findings imply about the reason ableness of comparing means for US states and districts to averages for other n ations? We cannot fully answer that question because the TIMSS design does not yield ev idence pertaining to districts or states in the US or about similar units in other co untries, such as German Lnder. However, the wide dispersion of classroom means in Australia and Germany, and the smaller but still substantial dispersion of means i n France, suggests that these comparisons may be misleading. Just as some states in the US compare more favorably than do others to means of other countries, some ar eas in those other countries are likely to score markedly better than the averages for thos e countries. In contrast, classrooms in Japan and Korea vary much less in average performan ce, so comparisons between US states and the means in Japan and Korea may be more meaningful. However, even in Korea and Japan, the standard deviations of classro om means are substantial, and the standard deviation of school means, which cannot be estimated from TIMSS, may be sizable as well. Our analyses cannot identify causes of the cross-na tional differences we found, but they raise a number of intriguing possibilities that warrant further investigation. One question is what factors might underlie the pattern s in Korea: little total variance between classrooms and an unusually large amount of predicted variance within classrooms. One possible contributor to the differences between the US and Korea is stratification of students in terms of ability. Thi s hypothesis is consistent with the differences between the US and Korea in terms of bo th the decomposition of variance and the ability of the models to predict the within -classroom variance. We know that Korea's policy is not to track students into classe s by ability in eighth-grade mathematics (Im, 1998). If schools as well as classrooms are re latively little stratified in Korea in terms of background factors associated with student performance, then more of the relevant variance of these background variables may lie within classrooms in Korea than in France, the US, or Hong Kong. Note that the tota l variance in the background factors included in the fixed model is not larger within cl assrooms in Korea than in the US. However, more of the variance that predicts student performance may lie within classrooms in Korea. In contrast, in countries like the US, the combination of residential stratification and tracking would result in much of the relevant variance of these background variables lying between classrooms rathe r than within them. However, other factors, such as instructional diffe rences, might also contribute to the differences between Korea and the other countri es examined. For example, instruction might vary less among classrooms in Kor ea than in Hong Kong or the US. This might help explain the lack of performance var iation between classrooms. Instructional factors might also contribute to the greater within-classroom predictive power of background factors in Korea. Although many current US reform efforts aim for both higher standards and greater equity of outcome s, it is possible that all other factors being equal, a very high level of standards could i ncrease score variance, as the more able students might be better able to take advantag e of more difficult material. Curriculum differences might also correlate differe ntly with background factors from one country to another. If curriculum differences a re less highly correlated with background factors in Korea than in the US, that to o could contribute to the patterns we found. The results for Hong Kong also raise interesting qu estions. Four Asian countries,
22 of 28Singapore, Korea, Japan, and Hong Kong, ranked high est in grade 8 mathematics in TIMSS. Hong Kong is also similar to Japan and Korea but not Singapore, in terms of its simple standard deviation of scores. Our results, h owever, showed that in both the decomposition and prediction of performance variati on, Hong Kong is very similar to the US and strikingly different from Korea and Japa n. Hong Kong is also similar to the US in terms of the decomposition of the variance of predictor variables. Further investigation of factors that might cause Hong Kong to resemble other highly developed Asian countries in some respects but the US in othe r respects could help avoid simplistic explanations of cross-national differences in perfo rmance. Finally, several aspects of performance variation i n France Â– the relatively small overall standard deviation of scores, and the small total and predicted between-classroom variance Â– could have important i mplications for policy. As noted earlier, it is not clear from our results whether l esser between-classroom variation in predictors contributed to this, but decompositions of predictor variance did not suggest that this was a major factor. Some observers mainta in that the French curriculum is highly standardized, even compared to that of many other countries with national curricula. If so, that uniformity could contribute to both a smaller between-classroom variance. In addition, by weakening any correlation s between curricular variables and social background, uniformity of curriculum could a lso lessen the prediction of score variance by background factors. Further analysis of TIMSS data may help shed light on these questions. For example, the present analysis could be expanded to incorporate instructional and curriculum variables as well as background factors. The TIMSS data, however, will not be sufficient to address certain key aspects of the se questions. They cannot provide useful data about variations in larger aggregates, including schools and states (and their equivalents). Moreover, in most countries, TIMSS co llected very little information about stratification, either within or between schools. T hese gaps could be addressed either by modifications of future international surveys or by the use of smaller, more focused studies in selected countries.NotesA number of studies have shown that even older stud ents often provide reports of background variables that are inconsistent with tho se of their parents. For example, Kaufman and Rasinski (1991) showed that on ly roughly 60 percent of eighth-grade students in the National Education Lon gitudinal Study (NELS-88) agreed with their parents about their parents' educ ational attainment (Kaufman and Rasinski, 1991, Table 3.2). A study of Asian and Hi spanic students in NAEP found similar results for middle-school students bu t found that fewer than half of third-grade students agreed with their parents on t his variable (Baratz-Snowden, Pollack, and Rock, 1988). 1. Note that the shape of the distributions depend on the mix of items included in the assessment. For example, it is possible that includ ing a larger number of easy items in the assessment would have stretched the le ft-hand tails of these distributions, particularly the lower tail of the U S distribution. 2. This is in contrast to traditional stepwise or othe r empirical subsets procedures, in which criteria specified a priori such as F-for-inclusion, are applied algorithmically. 3.Acknowledgements
23 of 28 This work was conducted under Task Order 22.214.171.124 w ith the Education Statistics Services Institute, funded by contract number RN951 27001 from the National Center for Education Statistics. The opinions expressed here a re solely those of the authors and do not necessarily represent the views of the Educatio nal Statistics Services Institute or the National Center for Education Statistics. The authors would like to acknowledge the assistanc e of several people who contributed to this work. Eugene Gonzalez of Boston College and TIMSS explained numerous aspects of the TIMSS data. Al Beaton and L aura O'Dwyer of Boston College and Laura Salganik of the Educational Statistics Se rvices Institute reviewed this report and provided valuable comments. Any errors of fact or interpretation that remain are solely the responsibility of the authors. Christel Osborn provided secretarial support for the project.References Baratz-Snowden, J., J. Pollack, and D. Rock (1988). Quality of responses of selected items on NAEP special study student survey Princeton: Educational Testing Service, unpublished.Beaton, A. E., Mullis, I. V. S., Martin, M. O,. Gon zalez, E. J., Kelly, D. L., and Smith, T. A. (1996a). Mathematics Achievement in the Middle School Years. Chestnut Hill, MA: TIMSS International Study Center, Boston Colleg e. Beaton, A. E., Martin, M. O, Mullis, I. V. S., Gonz alez, E. J., Smith, T. A, and Kelly, D. L. (1996b). Science Achievement in the Middle School Years. Chestnut Hill, MA: TIMSS International Study Center, Boston College.Berends, M., and Koretz, D. (1996). Reporting minor ity students' test scores: How well can the National Assessment of Educational Progress account for differences in social context? Educational Assessment, 3 (3), 249-285. Berliner, D. C., and Biddle, B. J. (1995). The Manufactured Crisis: Myths, Fraud, and the Attach on America's Public Schools. Reading, MA: Addison-Wesley. Bryk, A. S., and Raudenbush, S. W. (1992). Hierarchical Linear Models Newbury Park, CA: Sage.Bryk, A. S., Raudenbush, S. W., and Congdon, R. T. (1996). HLM: Hierarchical Linear and Nonlinear Modeling with the HLM/2L and HLM/3L P rograms. Chicago: Scientific Software International.Foy, P. Rust, K., and Schleicher, A. (1996). Sample design. In M. O. Martin and D. L. Kelly (Eds.), Third International Mathematics and Science Study ( TIMSS) Volume I: Design and Development Chestnut Hill, MA: TIMSS International Study Cent er, Boston College.Im, Hyung (1998). Personal communication, June 18.Kaufman, P., and Rasinski, K. A. (1991). Quality of Responses of Eighth-Grade Student in NELS-88. Washington, DC: US Department of Education, Office of Educational Research and Improvement (NCES 91-487).
24 of 28Koretz, D., McCaffrey, D., and Sullivan, T. (2000). Using TIMSS to Analyze Correlates of Performance Variation in Mathematics. Santa Monica: RAND, working paper (February).Kreft, I., and DeLeeuw, J. (1998). Introducing Multilevel Modeling London: Sage. Martin, M. O., Mullis, I. V. S., Beaton, A. E., Gon zalez, E. J., Smith, T. A., and Kelly, D. L. (1997). Science Achievement in the Primary School Years: IE A's Third International Science and Science Study Chestnut Hill, MA: TIMSS International Study Center, Boston College.Mullis, I. V. S., Martin, M. O., Beaton, A. E., Gon zalez, E. J., Kelly, D. L., and Smith, T. A. (1997). Mathematics Achievement in the Primary School Years : IEA's Third International Science and Science Study Chestnut Hill, MA: TIMSS International Study Center, Boston College.Mullis, I. V. S., Martin, M. O., Beaton, A. E., Gon zalez, E. J., Kelly, D. L., and Smith, T. A. (1998). Mathematics and Science in the Final Year of Second ary School: IEA's Third International Mathematics and Science Study ( TIMSS). Chestnut Hill, MA: TIMSS International Study Center, Boston College.National Center for Education Statistics (1996). Education in States and Nations: Indicators Comparing U.S. States with Other Industr ialized Countries in 1991. Washington: author (Report NCES 96-160).Pfeffermann D. (1996). The use of sampling weights for survey data analysis. Statistical Methods in Medical Research 5 239-262. Pfefferman, D., Skinner, C. J., Holmes, D .J., Gold stein, H., and Rasbash, J. (1998). Weighting for unequal selection probabilities in mu ltilevel models. Journal of the Royal Statistical Society, B, 60 Part 1, 23-40. Schmidt, W. H., Wolfe, R. G., and Kifer, E. (1993). The identification and description of student growth in mathematics achievement. In L. Bu rstein (Ed.), The IEA Study of Mathematics III: Student Growth and Classroom Proce sses. Oxford: Pergamon, 59-100.About the AuthorsDaniel Koretz is a professor at the Harvard Graduate School of E ducation and Associate Director of the Center for Research on Ev aluation, Standards, and Student Testing (CRESST). His work focuses primarily on edu cational assessment and recently has included studies of the validity of gains in hi gh-stakes testing programs, the effects of testing programs on schooling, the assessment of students with disabilities, and the effects of alternative systems of college admission s. E-mail: firstname.lastname@example.org Daniel McCaffrey is a Statistician at RAND. His areas of concentrat ion are education policy and the analysis of hierarchical data and da ta from complex sample designs. E-mail: Daniel_McCaffrey@rand.org Thomas J. Sullivan is a statistical programmer/analyst with RAND and a doctoral
25 of 28 candidate in Computational Statistics at George Mas on University. E-mail: email@example.comAppendix A Description of VariablesThis Appendix describes the source of the principal variables used the models presented in this report. NameTIMSS nameNotesMath scoreBIMATSCRFather presentBSBGADU2AgeBSDAGEBooks in home BSBGBOOKSometimes entered as a single variable, if test of linearity warranted. Computer in home BSBGPS02 PresscompositeMean of BSBMSIP2 and BSBMMIP2 when bo th were present; either variable if only one present Mother'seducation BSBGEDUMSometimes recoded as noted in text; sometim es entered as a single variable, if test of linearity warrante d Father'seducation BSBGEDUFSometimes recoded as noted in text; sometim es entered as a single variable, if test of linearity warrante d Born in country BSBGBRN1Appendix B Confidence Limits for Parameter Estimates Two-level ModelsParameter estimates are the same as those reported in the body of the article. Jackknifed estimates of lower and upper 95 percent confidence limits are in parentheses under each parameter estimate. VariableUnited StatesFranceHong KongKoreaIntercept-351.7592.6-424.827.9 (-884.9, 181.5)(289.8, 895.4)(-708.6, -141.0)(-660. 5, 716.3) Within class ( b ) Number books7.9**0.320.2** (5.6, 10.3)(-2.0, 2.7)(16.7, 23.7) Computer present4.4-3.810.9** (-2.7, 11.4)(-10.0, 2.5)(3.2, 18.7) Father present1.78.9*-7.4
26 of 28 (-4.9, 8.3)(0.9, 17.0)(-19.7, 5.0) Mother's education4.6 (1.2, 8.0) Father's education 9.5**(5.5, 13.5) Press9.6**8.6*10.3**36.2** (4.4, 14.7)(0.5, 16.7)(4.6, 16.0)(27.9, 44.5) Age-14.4**-18.2**-6.0 (-21.1, -7.7)(-24.6, 11.7)(-18.4, 6.4) Age2-6.9-0.6-14.8**(-14.2, 0.5)(-6.0, 4.7)(-26.0, -3.7) Born in Country-19.1** (-30.1, -8.2) Between-class ( c ) M Books45.5**44.1**16.2* (30.7, 60.2)(11.7, 76.6)(1.7, 30.7) M Computer37.2*89.8*44.5** (3.6, 70.9)(5.4, 174.1)(12.5, 76.4) M Father present90.3**59.5**326.9** (47.4, 133.2)(14.0, 104.9)(151.8, 502.1) M Mother's education26.4** (16.2, 36.7) M Father's Education 18.8**(7.7, 30.0) M Press43.2**45.0**174.5**47.4** (9.0, 77.4)(14.3, 75.5)(103.5, 245.4)(13.1, 81.7) M Age33.9-23.0*20.5 (3.2, 64.6)(-43.1, -2.8)(-27.6, 68.5) M Age2-149.4-23.3-26.2*(-223.8, -75.0)(-54.7, 8.0)(-50.0, -2.4) M Born in Country-44.7 (-119.9, 30.4) Residual variancesr 2 (within) 4570.44040.85485.09290.6
27 of 28 t (between)766.2554.71406.248.0Copyright 2001 by the Education Policy Analysis ArchivesThe World Wide Web address for the Education Policy Analysis Archives is epaa.asu.edu General questions about appropriateness of topics o r particular articles may be addressed to the Editor, Gene V Glass, firstname.lastname@example.org or reach him at College of Education, Arizona State University, Tempe, AZ 8 5287-0211. (602-965-9644). The Commentary Editor is Casey D. C obb: email@example.com .EPAA Editorial Board Michael W. Apple University of Wisconsin Greg Camilli Rutgers University John Covaleskie Northern Michigan University Alan Davis University of Colorado, Denver Sherman Dorn University of South Florida Mark E. Fetler California Commission on Teacher Credentialing Richard Garlikov firstname.lastname@example.org Thomas F. Green Syracuse University Alison I. Griffith York University Arlen Gullickson Western Michigan University Ernest R. House University of Colorado Aimee Howley Ohio University Craig B. Howley Appalachia Educational Laboratory William Hunter University of Calgary Daniel Kalls Ume University Benjamin Levin University of Manitoba Thomas Mauhs-Pugh Green Mountain College Dewayne Matthews Education Commission of the States William McInerney Purdue University Mary McKeown-Moak MGT of America (Austin, TX) Les McLean University of Toronto Susan Bobbitt Nolen University of Washington Anne L. Pemberton email@example.com Hugh G. Petrie SUNY Buffalo Richard C. Richardson New York University Anthony G. Rud Jr. Purdue University Dennis Sayers California State UniversityÂ—Stanislaus Jay D. Scribner University of Texas at Austin Michael Scriven firstname.lastname@example.org Robert E. Stake University of IllinoisÂ—UC
28 of 28 Robert Stonehill U.S. Department of Education David D. Williams Brigham Young UniversityEPAA Spanish Language Editorial BoardAssociate Editor for Spanish Language Roberto Rodrguez Gmez Universidad Nacional Autnoma de Mxico email@example.com Adrin Acosta (Mxico) Universidad de Guadalajaraadrianacosta@compuserve.com J. Flix Angulo Rasco (Spain) Universidad de Cdizfelix.firstname.lastname@example.org Teresa Bracho (Mxico) Centro de Investigacin y DocenciaEconmica-CIDEbracho dis1.cide.mx Alejandro Canales (Mxico) Universidad Nacional Autnoma deMxicocanalesa@servidor.unam.mx Ursula Casanova (U.S.A.) Arizona State Universitycasanova@asu.edu Jos Contreras Domingo Universitat de Barcelona Jose.Contreras@doe.d5.ub.es Erwin Epstein (U.S.A.) Loyola University of ChicagoEepstein@luc.edu Josu Gonzlez (U.S.A.) Arizona State Universityjosue@asu.edu Rollin Kent (Mxico)Departamento de InvestigacinEducativa-DIE/CINVESTAVrkent@gemtel.com.mx email@example.com Mara Beatriz Luce (Brazil)Universidad Federal de Rio Grande do Sul-UFRGSlucemb@orion.ufrgs.brJavier Mendoza Rojas (Mxico)Universidad Nacional Autnoma deMxicojaviermr@servidor.unam.mxMarcela Mollis (Argentina)Universidad de Buenos Airesmmollis@filo.uba.ar Humberto Muoz Garca (Mxico) Universidad Nacional Autnoma deMxicohumberto@servidor.unam.mxAngel Ignacio Prez Gmez (Spain)Universidad de Mlagaaiperez@uma.es Daniel Schugurensky (Argentina-Canad)OISE/UT, Canadadschugurensky@oise.utoronto.ca Simon Schwartzman (Brazil)Fundao Instituto Brasileiro e Geografiae Estatstica firstname.lastname@example.org Jurjo Torres Santom (Spain)Universidad de A Coruajurjo@udc.es Carlos Alberto Torres (U.S.A.)University of California, Los Angelestorres@gseisucla.edu