xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam a22 u 4500
controlfield tag 008 c19979999azu 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E11-00072
Educational policy analysis archives.
n Vol. 5, no. 3 (January 15, 1997).
Tempe, Ariz. :
b Arizona State University ;
Tampa, Fla. :
University of South Florida.
c January 15, 1997
Testing writing on computers : an experiment comparing student performance on tests conducted via computer and via paper-and-pencil / Michael Russell [and] Walt Haney.
Arizona State University.
University of South Florida.
t Education Policy Analysis Archives (EPAA)
xml version 1.0 encoding UTF-8 standalone no
mods:mods xmlns:mods http:www.loc.govmodsv3 xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govmodsv3mods-3-1.xsd
mods:relatedItem type host
mods:identifier issn 1068-2341mods:part
mods:detail volume mods:number 5issue 3series Year mods:caption 19971997Month January1Day 1515mods:originInfo mods:dateIssued iso8601 1997-01-15
1 of 20 Education Policy Analysis Archives Volume 5 Number 3January 15, 1997ISSN 1068-2341A peer-reviewed scholarly electronic journal. Editor: Gene V Glass Glass@ASU.EDU. College of Education Arizona State University,Tempe AZ 85287-2411 Copyright 1997, the EDUCATION POLICY ANALYSIS ARCHIVES.Permission is hereby granted to copy any a rticle provided that EDUCATION POLICY ANALYSIS ARCHIVES is credited and copies are not sold.Testing Writing on Computers: An Experiment Comparing Student Performance on Test s Conducted via Computer and via Paper-and-Pencil Michael Russell Boston College Walt Haney Boston College AbstractComputer use has grown rapidly during the past deca de. Within the educational community, interest in authentic assessment has also increased To enhance the authenticity of tests of writing, as well as of other knowledge and skills, some assessments require students to respond in written form via paper-and-pencil. However, as i ncreasing numbers of students grow accustomed to writing on computers, these assessmen ts may yield underestimates of students' writing abilities. This article presents the findin gs of a small study examining the effect that mode of administration -computer versus paper-and -pencil -has on middle school students' performance on multiple-choice and written test que stions. Findings show that, though multiple-choice test results do not differ much by mode of administration, for students accustomed to writing on computer, responses writte n on computer are substantially higher than those written by hand (effect size of 0.9 and relat ive success rates of 67% versus 30%). Implications are discussed in terms of both future research and test validity.Introduction
2 of 20 Two of the most prominent movements in education o ver the last decade or so are the introduction of computers into schools and the incr easing use of "authentic assessments." A key assumption of the authentic assessment movement is that instead of simply relying on multiple choice tests, assessments should be based on the re sponses students generate for open-ended "real world" tasks. "Efforts at both the national a nd state levels are now directed at greater use of performance assessment, constructed response questi ons and portfolios based on actual student work" (Barton & Coley, 1994, p. 3). At the state le vel, the most commonly employed kind of non-multiple-choice test has been the writing test (Barton & Coley, 1994, p. 31) in which students write their answers long-hand. At the same time, many test developers have explored the use of computer administered tests, but this form o f testing has been limited almost exclusively to multiple-choice tests. Relatively little attenti on has been paid to the use of computers to administer tests which require students to generate responses to open-ended items. The consequences of the incongruities in these dev elopments may be substantial. As the use of computers in schools and homes increases and students do more of their writing with word processors, at least two problems arise. First, per formance tests which require students to produce responses long-hand via paper-and-pencil (w hich happens not just with large scale tests of writing, but also for assessments of other skill s as evidenced through writing) may violate one of the key assumptions of the authentic assessment movement. For people who do most of their writing via computer, writing long-hand via paper-a nd-pencil is an artificial rather then real world task. Second, and more importantly, paper-and -pencil tests which require answers to be written long-hand to assess students' abilities (in writing or in other subjects) may yield underestimates of the actual abilities of students who are accustomed to writing via computer. In this article, we present the results of a small study on the effect of computer administration on student performance on writing or essay tests. Specifically, we discuss the background, design and results of the study reporte d here. However, before focusing on the study itself, we present a brief summary of recent develo pments in computerized testing and authentic assessment. In 1968, Bert Green, Jr., predicted "the inevitabl e computer conquest of testing" (Green, 1970, p. 194). Since then, other observers have env isioned a future in which "calibrated measures embedded in a curriculum . continuously and uno btrusively estimate dynamic changes in student proficiency" (Bunderson, Inouye & Olsen, 19 89, p. 387). Such visions of computerized testing, however, are far from present reality. Ins tead, most recent research on computerized testing has focused on computerized adaptive testin g, typically employing multiple-choice tests. Perhaps the most widely publicized application of t his form of testing occurred in 1993 when the Graduate Record Examination (GRE) was administered nationally in both paper/pencil and computerized adaptive forms. Naturally, the introduction of computer administer ed tests has raised concern about the equivalence of scores yielded via computerversus paper-and-pencil-administered test versions. Although exceptions have been found, Bunderson, Ino uye & Olsen (1989) summarize the general pattern of findings from several studies which exam ined the equivalence of scores acquired through computer or paperand-pencil test forms as follows: "In general it was found more frequently that the mean scores were not equivalent than that they were equivalent; that is the scores on tests administered on paper were more oft en higher than on computer-administered tests." However, the authors also state that "[t]he score differences were generally quite small and of little practical significance" (p. 378). More re cently, Mead & Drasgow (1993) reported on a meta-analysis of 29 previous studies of the equival ence of computerized and paper-and-pencil cognitive ability tests (involving 159 correlations between computerized and paper-and-pencil test results). Though they found that computerized tests were slightly harder than paper-and-pencil tests (with an overall cross-mode effect size of -.04), they concluded that their results "provide strong support for the conclusion that there is no medium effect for carefully
3 of 20constructed power tests. Moreover, no effect was fo und for adaptivity. On the other hand, a substantial medium effect was found for speeded tes ts" (Mead & Drasgow, 1993, p. 457). Yet, as previously noted, standardized multiple-ch oice tests, which have been the object of comparison in previous research on computerized ver sus paper-and-pencil testing, have been criticized by proponents of authentic assessment. A mong the characteristics which lend authenticity to an assessment instrument, Darling-H ammond, Ancess & Falk (1995) argue that the tasks be "connected to students' lives and to t heir learning experiences..." and that they provide insight into "students' abilities to perfor m 'real world' tasks" (p.4-5). Unlike standardized tests, which may be viewed as external instruments that measure a fraction of what students have learned, authentic assessments are intended to be c losely linked with daily classroom activity so that they seamlessly "support and transform the pro cess of teaching and learning" (Darling-Hammond, Ancess & Falk, 1995, p. 4; Cohen, 1990). In response to this move towards authentic assessm ent, many developers of nationally administered standardized tests have attempted to e mbellish their instruments by including open-ended items for which students have to write t heir answers. These changes, however, have occurred during a period when both the real-world a nd the school-world have experienced a rapid increase in the use of computers. The National Center for Education Statistics repor t that the percentage of students in grades 1 to 8 using computers in school has increas ed from 31.5 in 1984, to 52.3 in 1989 and to 68.9 in 1993 (Snyder & Hoffman, 1990; 1994). In the workplace, the percentage of employees using computers has risen from 36.0 in 1989 to 45.8 in 1993. During this period, writing has been the predominant task adult workers perform on a computer (Snyder & Hoffman, 1993; 1995). Given these trends, tests which require stud ents to answer open-ended items via paper-and-pencil may decrease the test's "authentic ity" in two ways: 1. Assessments are not aligned with students' learning experiences; and 2. Assessments are not representative of 'real-world' tasks. As the remainder of this paper suggests, these shortcomings may be leading to underestimates of students' writing abilities.Background to this Study In 1993, the Advanced Learning Laboratory School (ALL School, (http://nis.accel.worc.k12.ma.us) of Worcester, Mas sachusetts decided to adopt the Co-NECT school design (or Cooperative Networked Educational Community for Tomorrow, http://co-nect.bbn.com). Developed by BBN, Inc., a Bostonbased communications technology firm, Co-NECT is one of nine models for innovative schooling funded by the New American Schools Development Corporation. Working with BBN, the ALL School restructured many aspects of its educational environment. Among other reforms, the traditional middle school grade structure (that is, separately organized grade 6, 7 and 8 classes) was replaced with blocks which combined into a single cluster students who otherwi se would be divided into grades 6, 7 and 8. In place of traditional subject-based classes (such as English Class, Math Class, Social Studies, etc.), all subjects were integrated and taught thro ugh project-based activities. To support this cooperative learning structure, several networked c omputers were placed in each classroom, allowing students to perform research via the Inter net and CD-ROM titles, to write reports, papers and journals, and to create computer based p resentations using several software applications. To help evaluate the effects the restructuring at the ALL School has on its students as a whole, the Center for the Study of Testing, Evaluat ion and Educational Policy (CSTEEP) at Boston College helped teachers gather baseline data in the fall of 1993 with plans to perform follow-up assessments in the spring of 1994 and eac h spring thereafter. To acquire a broad picture of students' strengths and weaknesses, the forms of tests included in the baseline
4 of 20assessment ranged from multiple choice tests to sho rt and long answer open-ended assessments to hands-on performance assessments covering a wide range of reading, writing, science and math skills. To acquire insight into how cooperativ e projects affected the development of group skills, some of the performance assessments require d students to work together to solve a problem and/or answer specific questions. Finally, to evaluate how the Co-NECT Model, as implemented in the ALL School, affected students' f eelings about their school, a student survey was administered. Assessments and surveys were admi nistered to representative samples of the whole school's student population. In the spring of 1994, the same set of assessments was re-administered to different representative samples of students. While a full di scussion of the results is beyond the scope of this paper, many of the resulting patterns of chang e were as expected. For example, performance items which required students to work cooperatively generally showed more improvement than items which required students to work independently On items that required students to work independently, improvement was generally stronger o n open-ended items than on multiple-choice items. But there was one notable exception: open-en ded assessments of writing skills suggested that writing skills had declined. Although teachers believed that the Co-NECT Model enhanced opportunities for students to practice writing, performance on both short answ er and long answer writing items showed substantial decreases. For example, on a short answ er item which asked students to write a recipe for peace, the percentage of students who responded satisfactorily decreased from 69% to 51%. On a long answer item which asked students to imagi ne a superhero, describe his/her powers, and write a passage in which the superhero uses his/her powers, the percentage of satisfactory responses dropped from 71% to 41%. On another long answer item that asked students to write a story about a special activity done with their frie nds or family, student performance dropped from 56% to 43%. And on a performance writing item which first asked students to discuss what they saw in a mural with their peers and then asked them to write a passage independently that described an element in the mural and explain why t hey selected it, the percentage of satisfactory responses decreased from 62% to 47%. These declines were all statistically significant, and more importantly were substantively troubling. Since writing was a skill the school had selected as a focus area for the 1993-94 school year, teachers were surprised and troubled by the a pparent decrease in writing performance. During a feedback session on results in June 1994, teachers and administrators discussed at length the various writing activities they had unde rtaken over the past year. Based on these conversations, it was evident that students were re gularly presented with opportunities to practice their writing skills. But a consistent comment was that teachers in the ALL School were increasingly encouraging students to use computers and word processing tools in their writing. As several computers were present in all classrooms as well as in the library, teachers believed that students had become accustomed to writing on t he computer. When one teacher suggested that the decrease in writing scores might be due to the fact that all writing items in spring 1994 were administered on paper and required students to write their responses by hand, the theory was quickly supported by many teachers. With a foll ow-up assessment scheduled to occur a year later, several teachers asked if it would was possi ble for students to perform the writing items on a computer. After careful consideration, it was decided that a subsample of students in spring 1995 would perform a computeradministered version of t he performance writing item and items from the National Assessment of Educational Progress (NA EP) (items were mostly multiple-choice with a few short answer items included). But, to pr eserve comparisons with results from 1993-94, the majority of the student population wou ld perform these assessments as they had in that year -via the traditional pencil-and-paper m edium. Hence, we undertook an experiment to compare the effect that the medium of administratio n (computer versus paper-and-pencil) has on
5 of 20student performance on multiple-choice, short-answe r and extended writing test items. Study Design and Test Instruments To study the effect the medium of administration h as on student performance, that is taking assessments on computer versus by hand on pa per, two groups of students were randomly selected from the ALL School Advanced Cluster (grad es 6, 7 and 8). For the experimental group, which performed two of three kinds of assessments o n computer, 50 students were selected. The control group, which performed all tests via pencil -and-paper, was composed of the 70 students required for the time-trend study described above. The three kinds of assessments performed by both groups were: An open-ended (OE) assessment comprising 14 items, which included two writing items, five science items, five math items and two reading items. 1. A test comprised of NAEP items which was divided in to three sections and included 15 language arts items, 23 science items and 18 math i tems. The majority of NAEP items were multiple-choice. However, 2 language arts item s, 3 science items and 1 math item were open-ended and required students to write a br ief response to each item's prompt. 2. A performance writing assessment which required an extended written response. 3. Both groups performed the open-ended (OE) assessmen t in exactly the same manner, by hand via paper-and pencil. The experimental group performed the NAEP and writing assessment on computer, whereas the control group performed both in the traditional manner, by hand on paper. The performance writing assessment consisted of a picture of a mural and two questions. Students formed small groups of 2 or 3 to discuss t he mural. After 5 to 10 minutes, students returned to their seats and responded to one of two prompts: Now, it is your turn to pick one thing you found in the mural. Pick one thing that is familiar to you, that you can recognize from your d aily life or that is part of your culture. Describe it in detail and explain why you chose it. 1. Artists usually try to tell us something through th eir paintings and drawings. They may want to tell us about their lives, their culture or their feelings about what is happening in the neighborhood, community or world. What do you t hink the artists who made this mural want to tell us? What is this mural's message? 2. Due to absences, the actual number of students who participated in this study was as follows: Experimental (Computer) Group: 46 Control (Paper-and-Pencil) Group: 68 It should be noted that the study described in thi s paper was performed as part of a larger longitudinal study which relied heavily on matrix s ampling. For this reason, not all of the students in the control group performed all three t ests. However, all students included in the analyses reported here performed at least two tests one of which was the open-ended assessment. Table 1 shows the actual number of students in each group that performed each test.Table 1 Number of Students Performing Each TestTestExperimentalControlTotal
6 of 20Open-ended4668114NAEP444286Perf. Writing404686 To be clear, we emphasize that the treatment, in t erms of which the experimental and control groups differed, had nothing to do with edu cational experience of the two groups. The groups were receiving similar -albeit quite unusu al in comparison to most middle schools -educational experiences in the ALL school. The trea tment, in terms of which the two groups differed, was simply that the experimental group to ok the NAEP and performance writing tests on computer, whereas the control group took these t ests in the traditional manner, by hand with paper-and-pencil.Converting Paper Tests to Computer Before the tests could be administered on computer the paper versions were converted to a computerized format. Several studies suggest that slight changes in the appearance of an item can affect performance on that item. Something as s imple as changing the font in which a question is written, the order items are presented, or the order of response options can affect performance on that item (Beaton & Zwick, 1990; Ciz ek, 1991). Other studies have shown that people become more fatigued when reading text on a computer screen than when they read the same text on paper (Mourant, Lakshmanan & Chantadis ai, 1981). One study (Haas & Hayes, 1986) found that when dealing with passages that co vered more than one page, computer administration yielded lower scores than paper-andpencil administration, apparently due to the difficulty of reading extended text on screen. Clea rly, by converting items from paper to computer, the appearance of items is altered. To minimize such effects, each page of the paper v ersion of the NAEP items and the performance writing item was replicated on the comp uter screen as precisely as possible. To that end, the layout of text and graphics on the compute r version matched the paper version, including the number of items on a page, the arrangement of r esponse options, and the positioning of footers, headers and directions. Despite these effo rts, not every screen matched every page. Since the computer screen contained less vertical space, it was not always possible to fit the same number of questions on the screen as appeared on th e page. In addition, to allow the test taker to move between screens (e.g., to go on to the next sc reen, back to a previous screen, or to flip to a passage or image to which an item referred), each s creen of the computer versions contained navigation buttons along its bottom edge. Finally, to decrease the impact of screen fatigue, a larger font was used on the computer version than o n the paper version. To create a computerized version of the NAEP and p erformance writing tests, the following steps were taken: An appropriate authoring tool was selected. To full y integrate the several graphics used in the multiple-choice items and the full-color photog raph of a mural used in the performance writing item, as well as to track students' respons es, Macromedia Director was used. 1. All graphics and the photograph of the mural were s canned. Adobe Photoshop was used to retouch the images. 2. A data file was created to store student input, inc luding name, ID number, school name, birth date, gender, date of administration and resp onses to each item. 3. A prototype of each test was created, integrating t he graphics, text and database into a seamless application. As described earlier, navigat ional buttons were placed along the lower edge of the screen. In addition, a "cover" pa ge was created in which students entered biographical information. 4.
7 of 20The prototype was tested on several adults and stud ents to assure that all navigational buttons functioned properly, that data was stored a ccurately, and that items and graphics were easy to read. 5. Finally, the prototype was revised as needed and th e final versions of the computer tests were installed on twenty-four computers in the ALL School. 6. As described above, the addition of navigational b uttons along the lower edge of the computer screen was the most noticeable difference between the computer and paper versions of the tests. To allow students to review their work a nd make changes as desired, a "Next Page" and "Previous Page" button appeared on all pages (or sc reens) of the computer tests (except the first and last page). To allow students to review their w ork, student responses were not recorded until the student reached the last page of the assessment and clicked a button labeled "I'm Finished." When the "I'm Finished" button was clicked, the stu dent's biographical information and responses to each item were recorded in a data file before the program terminated. For all multiple-choice items, students clicked the option they felt best answered the question posed. For both shortand long-answer questions, examinees us ed a keyboard to type their answers into text boxes which appeared on their screen. Though they c ould edit using the keyboard and mouse, examinees did not have access to word processing to ols such as spell-checking.Scoring A combination of multiple choice and open-ended it ems were performed by both groups of students. Multiple-choice NAEP items were scored as either correct or incorrect based upon the answer key accompanying the NAEP items. To prev ent rater bias based on the mode of response, all short-answer NAEP responses were ente red verbatim into the computer. Responses of students who had taken the NAEP questions on com puter and via paper-and-pencil were then randomly intermixed. Applying the rating rubrics de signed by NAEP, two raters independently scored each set of six short answer items for each student. As part of an overall strategy to summarize results on all items in terms of percent correct, the initial ratings (which ranged from 1 5) were converted to a dichotomous value: 1 or 0; to denote whether student responses were adequate or inadequate. The two raters' converted s cores were then compared. Where discrepancies occurred, the raters re-evaluated res ponses and reached consensus on a score. To score the performance writing item, all hand wr itten responses were entered verbatim into the computer -again so as to prevent raters from knowing which responses were originally written by hand. The hand-written and computerwri tten responses were randomly intermixed. Three independent raters then scored each written r esponse, using the following four-point scoring rubric: Too brief to evaluate: Student did not make an atte mpt; indicates that student either did not know how to begin, or could not approach the proble m in an appropriate manner. 1. Inadequate Response: Student made an attempt but th e response was incorrect, reflected a misconception and/or was poorly communicated. 2. Adequate Response: Response is correct and communic ated satisfactorily, but lacks clarity, elaboration and supporting evidence. 3. Excellent Response: Response is correct, communicat ed clearly and contains evidence which supports his/her response. 4. Initial analyses of the three raters' ratings show ed that there was only a modest level of inter-rater reliability among the three (inter-rate r correlations ranged from 0.44 to 0.62, across the total of 89 performance writing responses). Althoug h these correlations were lower than expected, research on the assessment of writing has shown that rating of writing samples, even
8 of 20among trained raters, tends to be only modestly rel iable (Dunbar, Koretz, & Hoover, 1991). Indeed, that is why we planned to have more than on e rater evaluate each student response to the performance writing task. Hence for the purpose of the study reported here we created composite performance rating scores by averaging the three ra tings of each student's response (which we call PWAvg). Since the open-ended assessment was performed by p aperand-pencil by all students, student responses were not entered into the compute r. A single rater, who did not know which students had performed other assessments on the com puter, scored all responses using a 4 point scale. Although each of the 14 items had its own sp ecific scoring criteria, the general meaning of each score was the same across all 14 open-ended it ems, as well as the performance writing item. The raw scores were then collapsed into a 0, 1 scal e, with original scores of 1 or 2 representing a 0, or inadequate response, and original scores of 3 or 4 representing a 1, or adequate response. For the purpose of the study reported here, total o pen-ended response scores were calculated by summing across all 14 OE items.Results In presenting results from this study, we discuss: 1) assessment results overall; 2) comparative results from the two groups that took a ssessments via computer or via paperand-pencil; 3) results of regression analyses; and 4) separate analyses of performance on the short-answer and multiple-choice NAEP items. We present descriptive data summaries before resul ts of statistical tests. Regarding the latter, we note that this experiment involved multi ple comparisons of results based on just two random samples of students. While the literature on how to adjust alpha levels to account for multiple comparisons (e.g. Hancock & Klockars, 1996 ) is too extensive to review here, let us simply summarize how we dealt with this issue. We p lanned to compare results for the experimental and control groups on five different m easures: OE, performance writing, and three NAEP subtests, in science, math, and language arts. The Dunn approach to multiple comparisons tells us that the for c multiple comparisons, is related to simple for a single comparison, as follows: Hence for five comparisons, the adjusted value of a simple 0.05 alpha level becomes 0.0102. Analogously, a simple alpha level of 0.01 for a sin gle comparison becomes 0.0020 for five planned comparisons. We use these alpha levels in d iscussing the statistical significance of comparisons between experimental and control group results. In discussion, we address not just the statistical significance, but also the substant ive significance of our findings.Overall Results The actual raw data on which all analyses are base d is being made available to the reader. From this point, the data files can be accessed in ASCII or EXCEL Spreadsheet (binary) form. Table 2 presents a summary of overall results, tha t is, combined results for all students who took any of the three assessments in Spring 199 5.Table 2 Summary Statistics for All AssessmentsScale Range n MeanSD OE0-141147.872.96
9 of 20NAEP Lang Arts0-15869.843.79NAEP Science0-23869.704.37NAEP Math0-18866.213.39Perf Writing Avg1-4862.530.62 These data indicate that the assessments were rela tively challenging for the students who performed them. Mean scores were in the range of 56 -66% correct for the OE and NAEP Language Arts tests, but considerably below 50% cor rect for the NAEP science and NAEP math subtests. In this regard, it should be noted that a ll of these assessments were originally designed to be administered to eighth graders, but in the st udy reported here they were administered to 6th, 7th and 8th grade level students who in the ALL sch ool are intermixed in the same clusters. Table 3 presents Spearman rank order intercorrelat ions of all assessments, again across both groups. The OE results correlated only slightl y higher with the PWAvg results, possibly reflecting the fact that both of these assessments were open-ended requiring students to produce rather than select an answer. The three NAEP item s ubtests showed moderate intercorrelations (0.56-0.62) which might be expected for multiple-ch oice tests in the different subject areas (despite the fact that none of the NAEP subtests co ntained as many as two dozen items). The PWAvg results showed modest correlations with the N AEP subtests. Of the three NAEP sub-tests, the PWAvg was most strongly correlated w ith the Science sub-test. Although the NAEP science results were based largely on multiple choice items, of the three NAEP subtests, the Science section contained the largest number of short answer items (3 out of 23 items). The NAEP subtest that correlated least with the PWAvg s cores (0.37) was the NAEP Math subtest, which contained only one open-ended item.Table 3 Intercorrelations of Assessment ResultsOE NAEP Lang Arts NAEP Science NAEP Math Perf. Writing OE1.00NAEP Lang Arts0.461.00NAEP Science0.440.621.00NAEP Math0.400.560.571.00Perf Writing0.480.490.540.371.00p <.01 for all intercorrelationsComputer versus Paper-and-Pencil Results Table 4 presents results separately for the experi mental and control groups, namely the group which took NAEP and performance writing asses sments on paper and the one that took them on computer. The table also shows results of t -tests (for independent samples, assuming equal variances for the two samples and hence using a pooled variance estimate). As an aid to interpretation, the table also shows the effect of computer administration in terms of Glass's delta effect size, that is the mean of the experimental g roup minus the mean of the control group divided by the standard deviation of the control gr oup. While other methods for calculating effect size have been proposed (Rosenthal, 1994, p. 237), note that results would not differ dramatically if a pooled standard deviation were used instead of the control group standard deviation.
10 of 20 Results indicate that, after adjusting for the pla nned multiple comparisons, the effect of computer administration was significant only for th e PWAvg. The effect size of computer administration on the performance writing task was 0.94. The four tests which did not show a statistically significant difference between the two groups were the OE test and the NAEP Language Arts, Science, and Math tests. The absence of a statistically significant difference on the OE test was, of course, expected since the OE test was the one test that was administered in the same form (paper-andpencil) to the two groups. Similarly, since the NAEP tests were primarily comp osed of multiple-choice items, which previous research suggests are affected minimally b y the mode of administration, differences between the two groups on the NAEP tests were not e xpected. Note however that the size of the difference in OE scores between the two groups was surprisingly large, given that the two groups had been randomly selected. The absence of four stu dents randomly selected for the experimental group who did not take any tests may partially expl ain this difference. Nevertheless to explore the possibility that group differences may partiall y account for apparent mode of administration effects (and also, of course, to estimate effects m ore precisely), regression analyses were conducted.Table 4 Summary Results by GroupControlExperimental nMeanSDnMeanSDEffect Size (df)tSig OE687.623.14468.242.660.20 (112)1.100.27Lang Arts429.243.96443.580.300.30 (84)1.440.15Science428.674.174410.684.390.48 (84)2.180.03Math426.003.30446.413.510.12 (84)0.560.58Perf Writ.462.300.55402.810.590.94 (84)4.16<.0001** ** statistically significant at the 0.01 level afte r taking multiple comparisons into accountRegression Analyses As a further step in examining the effects of mode of administration, regression analyses were conducted using the OE scores as a covariate a nd then introducing a dummy variable (0= paper/pencil group; 1= computer administration grou p) to estimate the effects of mode of administration on the NAEP Language Arts, Science a nd Math subtests and on the PWAvg scores. Results of these regression analyses are sh own in Table 5.Table 5 Results of Regression AnalysesDependent VariableCoeff SEt-ratioSigNAEP Lang Arts Constant5.031.094.60<.0001**OE0.570.134.40<.0001**Group*0.660.750.890.38 NAEP Science
11 of 20Constant3.721.233.02.0033OE0.670.154.59<.0001**Group*1.420.841.690.09 NAEP Math Constant1.990.972.04<.0445OE0.540.124.70<.0001**Group*-0.070.670.110.91 Perf Writing Constant1.590.169.73<.0001**OE0.090.024.88<.0001**Group*0.440.113.98.0001*** (1=computer)** statistically significant at the 0.01 level afte r taking multiple comparisons into account These results confirm the findings shown in Table 4, namely that even after controlling for OE scores, the effect of mode of administration was highly significant on the PWAvg. However, for the largely multiple-choice NAEP subtests, resu lts indicate no difference for mode of administration.Performance on Multiple Choice and Short-Answer NAE P Items Although the regression analysis suggested that mo de of administration did not significantly influence performance on the NAEP sub tests, further analysis was performed on the NAEP subtest items to examine the effect of adminis tration mode on the two forms of items contained in the NAEP subtest -multiple-choice an d short answer. Table 6 shows the mean score for the two groups on both the multiple-choic e items and the short-answer items for the three subtests. Although slight differences between the means were found for the multiple-choice items, none were significant. However, for the scie nce and language arts short answer items, those students who responded on computer performed significantly better than the paper-and-pencil group. While it was expected that performance on multiple-choice items would not differ, the differences detected on the short a nswer items suggest that even for items that require a brief written response, the mode of admin istration may affect a student's performance. The question arises as to why the mode of administ ration affected performance on the short answer Language Arts and Science questions, b ut not on the one short-answer Math item. It is likely that the nature of the open-ended Math it em accounts for similar performance between the two groups. The open-ended Math question requir ed a short answer which could not be provided without correctly answering the multiple-c hoice question that preceeded it. In contrast, the three short answer Science items asked students to interpret data in a table, explain their process and respond to a factual item. In particula r, the second short answer Science item provided a fair amount of space for a response and many students wrote at least one complete sentence. Although the three Science items were rel ated to the same set of data displayed in a table, response to these items were not dependent o n answers to previous items.Table 6: Results of Analysis of NAEP Subtest Item f ormats: Multiple-choice versus Short AnswerItemsnControlSDnExperimentalSDEffecttSig
12 of 20MeanMeanSize Lang. Arts Mult. Choice428.63.47 449.03.03 0.12 0.64 .522 Short Answer420.60.73 44 1.40.750.994.52<.0001** Science Mult. Choice428.03.97 44 9.03.99 0.26 1.22 .226 Short Answer420.70.77 44 1.70.98 1.25 5.06<.0001** Math Mult. Choice425.83.07446.13.33 0.10 0.44 .660 Short Answer420.20.41 44 0.30.470.251.08 .282 ** statistically significant at the 0.01 level afte r taking multiple comparisons into account To inquire further into the apparent effect of mod e of administration on short answer Language Arts and Science items, we conducted regre ssion analyses, using OE scores as a covariate. Results, shown in Table 7 indicate that the mode of administration had a significant effect on the students' performances on the NAEP La nguage Arts and Science short-answer items.Table 7: Results of Regression Analyses on NAEP Lan guage Arts and Science Short-Answer ItemsDependent VarCoef.s.e.betas.e.t-ratioSig NAEP Lang Arts Constant-0.080.22 -.38 .71 OE0.100.030.350.093.77.0003** Group*0.630.150.390.094.23.0001** NAEP Science Constant0.200.28 0.74.4645 OE0.070.030.200.092.09.0397 Group*0.910.190.450.094.77<.0001** (1=computer)** statistically significant at the 0.01 level afte r taking multiple comparisons into accountDiscussion The experiment described here was a small inquiry aimed at investigating a particular question. Motivated by a question as to whether or not performance on an extended writing task might be better if students were allowed to write o n computer rather than on paper, the study aimed at estimating the effects of mode of administ ration on test results for two kinds of assessments, namely the largely multiple-choice NAE P subtests and the extended writing task previously described. Unlike most previous research on the effects of computer administered tests, which has focused on multiple-choice tests a nd has generally found no or small differences due to mode of administration, our results indicate substantial effects due to mode of administration. The size of the effects was found t o be 0.94 on the extended writing task and .99
13 of 20and 1.25 for the NAEP language arts and science sho rt answer items. Effect sizes of this magnitude are unusually large and of sufficient siz e to be of not just statistical, but also practical significance (Cohen, 1977; Wolf, 1986). An effect s ize of 0.94, for example, implies that the score for the average student in the experimental g roup exceeds that of 83 percent of the students in the control group. A number of authors have noted the difficulty of i nterpreting the practical significance of effect sizes and have suggested that one useful way of doing so is with a "binomial effect size display" showing proportions of success and failure under experimental and control conditions (Hedges & Olkin, 1985; Rosenthal & Rubin, 1982). Wh ile there are a number of ways in which effect sizes, expressed as either Glass's delta or a correlation coefficient, can be converted to a binomial effect size display, in the case of our PW Avg scores, we have a direct way of showing such a display. Recall that student responses to th e performance writing item were scored on a 4-point scale in which scores of 1 and 2 represente d a less than adequate response and scores of 3 and 4 represented an adequate or better response. U sing the cut-point of 2.5 as distinguishing between inadequate (failure) and adequate (success) responses in terms of PWAvg scores, we may display results as shown in Table 8.Table 8: Binomial Effect Size Display of Experiment al Results: In Terms of Inadequate vs. Adequate PWAvg ScoresControl (Paper) InadequateAdequate N3214Percent69.6%30.4% Experimental (Computer) InadequateAdequate N1327Percent32.5%67.5% This display indicates that the computer mode of a dministration had the effect of increasing the success rate on the performance writ ing item (as judged by the average of three independent raters) from around 30% to close to 70% As a means of inquiring further into the source of this large effect, we conducted a variety of analyses to explore why and for whom the mode of administration effect occurred. To explore why the mode of administration effect may have occu rred, we first undertook a textual analysis of student responses to the extended writing task. Specifically we calculated the average number of words and paragraphs contained in the responses of both groups. As Table 9 below indicates, those students who performed the assessment on the computer tended to write almost twice as much and were more apt to organize their responses into more paragraphs.Table 9: Characters, Words and Paragraphs on Perfor mance Writing Task by Mode of Administration Characters WordsParagraphs Control (Paper) Mean586.9111.6 1.457 Std 275.58 52.47 1.069 n464646 Experimental (Computer)
14 of 20Mean1022.2 204.72.625 Std549.55111.322.306 n404040 observed t with pooled variance 4.735.07 3.08 sig<.0001**<.0001**<.0001** ** statistically significant at the 0.01 level afte r taking multiple comparisons into account In some ways, this pattern is consistent with the findings of Daiute (1985) and Morocco and Neuman (1986), who have shown that teaching wri ting with word processors tends to lead students to write more and to revise more than when they write with paper-and-pencil. Not surprisingly, the length of students' written respo nses (in terms of numbers of characters and words correlated significantly with PWAvg scores, 0 .63 in both cases). Although this suggests that longer responses tended to receive higher scor es, the fact that length of response explains less than half of the variance in PWAvg scores sugg ests that rated quality is not attributable simply to length of response. Second, we considered the possibility that motivat ion might help explain the mode of administration effect. This possibility was suggest ed to us by spontaneous comments made by students after the testing. For example, after taki ng the writing assessment on computer, one student commented, "I thought we were going to be t aking a test." In contrast, a student in the control group, who had not taken any tests via comp uter, inquired of us, "How come we didn't get to take the test on computer?" Such comments ra ised the possibility that motivation and the simple novelty of taking tests on computer might ex plain the mode of administration effect we found. Two lines of thought suggest that simple motivatio n cannot explain our results. If differential motivation arising from the novelty of taking tests on computer was the main cause of our results, it is hard to explain why mode of a dministration effects were absent on the multiple-choice NAEP subtests, but were prevalent o n the performance writing test and the NAEP open-ended items. Furthermore, recent research on the effects of motivation on test performance, suggests that the effects of motivatio n are not nearly as large as the mode of administration effect we found on the performance w riting test. Recently, Kiplinger & Linn (1996) reported on the effects of an experiment in which "lowstakes" NAEP items were embedded in a "high stakes" state testing program i n Georgia. Though results from this experiment were mixed, the largest effects of "high stakes" motivation occurred for nine NAEP items designed for eighth grade students. For these nine items, however, the effect size was only 0.18. (Kiplinger & Linn, 1996, p.124). In a separat e study, O'Neill, Sugrue & Baker (1996) investigated the effects of monetary and other ince ntives on the performance of eighth grade students on NAEP items. Again, though effects of th ese motivational conditions were mixed, the largest influence of motivation ranged from an effe ct size of 0.16 to 0.24 (O'Neill, Sugrue & Baker, 1996, p. 147). With the largest effects of m otivation on eighth grade students found to be in the range of 0.16 to 0.24, these results suggest that motivation alone cannot explain the magnitude of mode of administration effects we foun d for written responses. To examine for whom the mode of administration eff ects occurred, we also inquired into whether the mode of administration effect appeared to be different for different students. First we inquired into whether the mode of administration ef fect seemed to be different for students performing at different levels on the OE test. One simple way of testing this possibility was to calculate PWAvg scores predicted on the basis of OE scores and see if there was a statistically
15 of 20significant correlation between residuals (actual m inus predicted PWAvg scores) and OE scores among the experimental group students. No significa nt correlation was found, suggesting that the mode of administration effect was not different for students of different ability levels as indicated by their OE scores. A graphical presentation of thi s pattern is shown in Figure 1, which depicts the line of PWAvg scores regressed on OE scores, wi th the experimental cases represented with X's and the control group with dots. As can be seen in Figure 1, the actual PWAvg scores for the experimental group tended to exceed the predicted s cores across ability levels as represented by the OE scores. Figure 1: Regression of PWAvg scores on OE Scores Finally, we explored whether mode of administratio n effect seemed to differ for males versus females. Table 10 shows PWAvg scores by gend er for both control and experimental groups.Table 10: PWAvg Scores by Gender and Group FemaleMaleTotal Control (Paper) Mean2.332.272.30 SD0.580.470.55 n212546 Experimental (Computer) Mean2.922.602.81 SD0.53 0.630.59 n261440 Total Mean2.662.38 2.53 SD0.650.540.62 n4639 86 Within the control groups, females performed only slightly better on PWAvg scores than did males (with means 2.33 and 2.27 respectively). However within the experimental group females scored considerably better than males (with means of 2.92 and 2.60). Thus it appears that the effect of computer administration may have been somewhat larger for females than for males.
16 of 20Nonetheless the males who took the extended writing task on computer still performed considerably better than the females who took the w riting task on paper (with respective means of 2.60 and 2.33). A two way analysis of variance ( PWAvg by gender and group) showed group but not gender to be significant (this was the case whether or not an interaction term was included). This general pattern was confirmed by re gression analyses of PWAvg scores on OE scores, sex and group. Though OE scores and the gro up variable were significant, the sex variable was not. We should note that post hoc, we were surprised th at the proportion of males in the control group (54%) differed by nearly 19 percentag e points from the proportion of males in the experimental group (35%). Although the two groups w ere selected randomly, the probability that this difference would occur is less than .08. Howev er, as can be calculated based on the data in Table 10, even after controlling for gender, the av erage effect size is 0.86. Although the experiment reported here had several weaknesses--only one extended writing task was used, no other variables on academ ic achievement beyond the OE test results were used as covariates in regression analyses, and information on students' extent of experience working on computers was not collected--further res earch into this topic clearly is warranted. Increasingly, schools are encouraging students to use computers in their writing. As a result, it is likely that increasing numbers of stu dents are growing accustomed to writing on computers. Nevertheless, large scale assessments of writing, at state, national and even international levels, are attempting to estimate st udents' writing skills by having them use paper-and-pencil. Our results, if generalizable, su ggest that for students accustomed to writing on computer for only a year or two, such estimates of student writing abilities based on responses written by hand may be substantial underestimates o f their abilities to write when using a computer. This suggests that we should exercise considerable caution in making inferences about student abilities based on paper-and-pencil/handwri tten tests as students gain more familiarity with writing via computer. And more generally it su ggests an important lesson about test validity. Validity of assessment needs to be considered not s imply with respect to the content of instruction, but also with respect to the medium of instruction. As more and more students in schools and colleges do their work with spreadsheet s and word processors, the traditional paper-andpencil modes of assessment may fail to m easure what they have learned. We suspect that it will be some years before schoo ls generally, much less large scale state, national or international assessment programs, deve lop the capacity to administer wide-ranging assessments via computer. In the meantime, we shoul d be extremely cautious about drawing inferences about student abilities when the media o f assessment do not parallel those of instruction and learning.Acknowledgment We wish to acknowledge our great appreciation of C arol Shilinsky, Principal, and the teachers of the ALL School of Worcester, MA, who fi rst suggested the idea of the study reported here. In particular we thank Rich Tamalavich and De ena Kelly who helped arrange the equipment needed and oversaw administration of the computerized testing. We also wish to thank five anonymous EPAA reviewers who, through ed itor Gene Glass, provided us with some very helpful suggestions on a previous version of t his manuscript with a rapidity that was absolutely astonishing. Five reviews were received within one week of submission of the manuscript! We also thank Gene Glass for suggesting that we take better advantage of the electronic medium via which this journal is publish ed, for example, by appending to this article the full data set for the study so that others migh t conduct secondary analyses, and by providing electronic links to institutions mentioned in the s tudy. So now, the reviewer who wondered about
17 of 20the unusual school in which this study was conducte d can visit the ALL Schools WWW site. Any such visitors will find that students at this remar kable school are now not just writing via computer, but also publishing via the WWW.ReferencesBarton, P. E. & Coley R. J. (1994) Testing in Ameri ca's schools. Princeton, NJ Educational Testing Service Policy Information Center.Beaton, A. E. & Zwick, R. (1990). The Effect of Cha nges in the National Assessment: Disentangling the NAEP 198586 Reading Anomaly. Pr inceton, NJ: Educational Testing Service, ETS.Bunderson, C. V., Inouye, D. K. & Olsen, J. B. (198 9). The four generations of computerized educational measurement. In Linn, R. L., Educational Measurement (3rd Ed). Washington, D.C.: American Council on Education, pp. 367-407.Cizek, G. J. (1991). The Effect of Altering the Pos ition of Options in a multiple-choice Examination. Paper presented at NCME, April 1991. ( ERIC) Cohen, D. (1990). Reshaping the Standards Agenda: F rom an Australian's Perspective of Curriculum and Assessment. In P. Broadfoot, R. Murp hy & H. Torrance (Eds.), Changing Educational Assessment: International Perspectives and Trends London: Routledge. Cohen, J (1977). Statistical power analysis for the behavioral scien ces (rev. ed.) NY: Academic Press.Daiute, C. (1985). Writing and computers Reading, MA: Addison-Wesley Publishing Co. Darling-Hammond, L., Acness, J. & Falk, B. (1995). Authentic Assessment in Action New York, NY: Teachers College Press.Dunbar, S. B., Koretz, D. M., & Hoover, H. D. (1991 ). Quality Control in the Development and Use of Performance Assessments. Applied Measurement in Education, 4(4), 289-303. Green, B. F., Jr. (1970). Comments on tailored test ing. In W. H. Holtzman (Ed.), Computer-assisted instruction, testing and guidance New York: Harper and Row. Haas, C. & Hayes, J. R. (1986). What Did I Just Say ? Reading Problems in Writing with the Machine. Research in the Teaching of English 20(1), 22-35. Hancock, G. R. & Klockars, A. J. (1996). The quest for : Developments in multiple comparison procedures in the quarter century since Games (1971 ). Review of Educational Research 66(3), 269-306.Hedges, L. V. & Olkin, I. (1985) Statistical methods for meta-analysis San Diego: Academic Press.Kiplinger, V. L. & Linn, R. L. (1996). Raising the stakes of test administration: The impact on student performance on the National Assessment of E ducational Progress. Educational Assessment 3(2), 111-133.
18 of 20Mead, A. D & Drasgow, F. (1993) Equivalence of comp uterized and paper-and-pencil cognitive ability tests: A metaanalysis. Psychological Bulletin 114(3), 449-458. Morocco, C. C. & Neuman, S. B. (1986). Word process ors and the acquisition of writing strategies. Journal of Learning Disabilities 19(4), 243-248. Mourant, R. R, Lakshmanan, R. & Chantadisai, R. (19 81). Visual Fatigue and Cathode Ray Rube Display Terminals. Human Factors 23(5), 529-540. O'Neil, H. F. Jr., Sugrue, B. & Baker, E. L. (1996) Effects of motivational interventions on the National Assessment of Educational Progress mathema tics performance. Educational Assessment 3(2), 135-157. Rosenthal, R. (1994) Parametric measures of effect size. In Cooper, H. & Hedges, L. The handbook of research synthesis NY: Russell SAGE, pp. 231-244 Rosenthal, R. & Rubin, D. B. (1982) A simple, gener al purpose display of magnitude of experimental effect. Journal of Educational Psychology 74, 166-169. Snyder, T. D. & Hoffman, C. M. (1990). Digest of Ed ucation Statistics. Washington, DC: U. S. Department of Education.Snyder, T. D. & Hoffman, C. M. (1993). Digest of Ed ucation Statistics. Washington, DC: U. S. Department of Education.Snyder, T. D. & Hoffman, C. M. (1994). Digest of Ed ucation Statistics. Washington, DC: U. S. Department of Education.Snyder, T. D. & Hoffman, C. M. (1995). Digest of Ed ucation Statistics. Washington, DC: U. S. Department of Education.Wolf, F. (1986) Meta-analysis: Quantitative methods for research sy nthesis SAGE University series on quantitative applications in the social s ciences, series no. 07-059. Newbury Park, CA: SAGE.About the AuthorsMichael Russell email@example.com Boston College School of Education Center for the Study of Testing, Evaluation and Edu cational Policy 323 Campion HallChestnut Hill, MA 02167Ph. 617/552-4521 Fax 617-552-8419 Walt Haney
19 of 20 firstname.lastname@example.org Boston College School of Education Center for the Study of Testing, Evaluation and Edu cational Policy 323 Campion HallChestnut Hill, MA 02167Copyright 1997 by the Education Policy Analysis ArchivesThe World Wide Web address for the Education Policy Analysis Archives is http://olam.ed.asu.edu/epaa General questions about appropriateness of topics o r particular articles may be addressed to the Editor, Gene V Glass, email@example.com or reach him at College of Education, Arizona Stat e University, Tempe, AZ 85287-2411. (602-965-2692). T he Book Review Editor is Walter E. Shepherd: firstname.lastname@example.org The Commentary Editor is Casey D. Cobb: email@example.com .EPAA Editorial Board Michael W. Apple University of Wisconsin Greg Camilli Rutgers University John Covaleskie Northern Michigan University Andrew Coulson firstname.lastname@example.org Alan Davis University of Colorado, Denver Sherman Dorn University of South Florida Mark E. Fetler California Commission on Teacher Credentialing Richard Garlikov email@example.com Thomas F. Green Syracuse University Alison I. Griffith York University Arlen Gullickson Western Michigan University Ernest R. House University of Colorado Aimee Howley Marshall University Craig B. Howley Appalachia Educational Laboratory William Hunter University of Calgary Richard M. Jaeger University of North Carolina--Greensboro Daniel Kalls Ume University Benjamin Levin University of Manitoba Thomas Mauhs-Pugh Rocky Mountain College Dewayne Matthews Western Interstate Commission for Higher Education William McInerney Purdue University Mary P. McKeown Arizona Board of Regents Les McLean University of Toronto Susan Bobbitt Nolen University of Washington
20 of 20 Anne L. Pemberton firstname.lastname@example.org Hugh G. Petrie SUNY Buffalo Richard C. Richardson Arizona State University Anthony G. Rud Jr. Purdue University Dennis Sayers University of California at Davis Jay D. Scribner University of Texas at Austin Michael Scriven email@example.com Robert E. Stake University of Illinois--UC Robert Stonehill U.S. Department of Education Robert T. Stout Arizona State University