USF Libraries
USF Digital Collections

A comparability analysis of the National Nurse Aide Assessment Program

MISSING IMAGE

Material Information

Title:
A comparability analysis of the National Nurse Aide Assessment Program
Physical Description:
Book
Language:
English
Creator:
Jones, Peggy K
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla
Publication Date:

Subjects

Subjects / Keywords:
Mode effect
Item bias
Computer-based tests
Eifferential item functioning
Dichotomous scoring
Dissertations, Academic -- Measurement and Evaluation -- Doctoral -- USF
Genre:
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )

Notes

Abstract:
ABSTRACT: When an exam is administered across dual platforms, such as paper-and-pencil and computer-based testing simultaneously, individual items may become more or less difficult in the computer version (CBT) as compared to the paper-and-pencil (P&P) version, possibly resulting in a shift in the overall difficulty of the test (Mazzeo & Harvey, 1988). Using 38,955 examinees' response data across five forms of the National Nurse Aide Assessment Program (NNAAP) administered in both the CBT and P&P mode, three methods of differential item functioning (DIF) detection were used to detect item DIF across platforms. The three methods were Mantel-Haenszel (MH), Logistic Regression (LR), and the 1-Parameter Logistic Model (1-PL). These methods were compared to determine if they detect DIF equally in all items on the NNAAP forms. Data were reported by agreement of methods, that is, an item flagged by multiple DIF methods. A kappa statistic was calculated to provide an index of agreement bet ween paired methods of the LR, MH, and the 1-PL based on the inferential tests. Finally, in order to determine what, if any, impact these DIF items may have on the test as a whole, the test characteristic curves for each test form and examinee group were displayed. Results indicated that items behaved differently and the examinee's odds of answering an item correctly were influenced by the test mode administration for several items ranging from 23% of the items on Forms W and Z (MH) to 38% of the items on Form X (1-PL) with an average of 29%. The test characteristic curves for each test form were examined by examinee group and it was concluded that the impact of the DIF items on the test was not consequential. Each of the three methods detected items exhibiting DIF in each test form (ranging from 14 items to 23 items). The Kappa statistic demonstrated a strong degree of agreement between paired methods of analysis for each test form and each DIF method pairing (reporting good to excell ent agreement in all pairings). Findings indicated that while items did exhibit DIF, there was no substantial impact at the test level.
Thesis:
Dissertation (Ph.D.)--University of South Florida, 2006.
Bibliography:
Includes bibliographical references.
System Details:
System requirements: World Wide Web browser and PDF reader.
System Details:
Mode of access: World Wide Web.
Statement of Responsibility:
by Peggy K. Jones.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains 207 pages.
General Note:
Includes vita.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 001914297
oclc - 175298839
usfldc doi - E14-SFE0001752
usfldc handle - e14.1752
System ID:
SFS0026070:00001


This item is only available as the following downloads:


Full Text
xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam Ka
controlfield tag 001 001914297
003 fts
005 20071022143734.0
006 m||||e|||d||||||||
007 cr mnu|||uuuuu
008 071022s2006 flu sbm 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0001752
040
FHM
c FHM
035
(OCoLC)175298839
049
FHMM
090
LB3051 (ONLINE)
1 100
Jones, Peggy K.
2 245
A comparability analysis of the National Nurse Aide Assessment Program
h [electronic resource] /
by Peggy K. Jones.
260
[Tampa, Fla] :
b University of South Florida,
2006.
3 520
ABSTRACT: When an exam is administered across dual platforms, such as paper-and-pencil and computer-based testing simultaneously, individual items may become more or less difficult in the computer version (CBT) as compared to the paper-and-pencil (P&P) version, possibly resulting in a shift in the overall difficulty of the test (Mazzeo & Harvey, 1988). Using 38,955 examinees' response data across five forms of the National Nurse Aide Assessment Program (NNAAP) administered in both the CBT and P&P mode, three methods of differential item functioning (DIF) detection were used to detect item DIF across platforms. The three methods were Mantel-Haenszel (MH), Logistic Regression (LR), and the 1-Parameter Logistic Model (1-PL). These methods were compared to determine if they detect DIF equally in all items on the NNAAP forms. Data were reported by agreement of methods, that is, an item flagged by multiple DIF methods. A kappa statistic was calculated to provide an index of agreement bet ween paired methods of the LR, MH, and the 1-PL based on the inferential tests. Finally, in order to determine what, if any, impact these DIF items may have on the test as a whole, the test characteristic curves for each test form and examinee group were displayed. Results indicated that items behaved differently and the examinee's odds of answering an item correctly were influenced by the test mode administration for several items ranging from 23% of the items on Forms W and Z (MH) to 38% of the items on Form X (1-PL) with an average of 29%. The test characteristic curves for each test form were examined by examinee group and it was concluded that the impact of the DIF items on the test was not consequential. Each of the three methods detected items exhibiting DIF in each test form (ranging from 14 items to 23 items). The Kappa statistic demonstrated a strong degree of agreement between paired methods of analysis for each test form and each DIF method pairing (reporting good to excell ent agreement in all pairings). Findings indicated that while items did exhibit DIF, there was no substantial impact at the test level.
502
Dissertation (Ph.D.)--University of South Florida, 2006.
504
Includes bibliographical references.
516
Text (Electronic dissertation) in PDF format.
538
System requirements: World Wide Web browser and PDF reader.
Mode of access: World Wide Web.
500
Title from PDF of title page.
Document formatted into pages; contains 207 pages.
Includes vita.
590
Adviser: John Ferron, Ph.D.
653
Mode effect.
Item bias.
Computer-based tests.
Eifferential item functioning/
Dichotomous scoring.
690
Dissertations, Academic
z USF
x Measurement and Evaluation
Doctoral.
773
t USF Electronic Theses and Dissertations.
4 0 856
u http://digital.lib.usf.edu/?e14.1752



PAGE 1

A Comparability Analysis of the Na tional Nurse Aide Assessment Program by Peggy K. Jones A dissertation submitted in partial fulfillment of the requirement s for the degree of Doctor of Philosophy Department of Measurement and Research College of Education University of South Florida Major Professor: John Ferron, Ph.D. Robert Dedrick, Ph.D. Jeffrey Kromrey, Ph.D. Carol A. Mullen, Ph.D. Date of Approval: November 2, 2006 Keywords: mode effect, item bias, com puter-based tests, differential item functioning, dichotomous scoring Copyright 2006, Peggy K. Jones

PAGE 2

ACKNOWLEDGEMENTS To my committee, my colleagues, my fa mily, and my friends, I give many, many thanks and a deep gratitude for eac h and every time you supported and encouraged me. I extend deep appreciation to Dr. John Ferron for his guidance, assistance, mentorship, and encouragement throughout my doctoral program. Fond thanks are extended to Drs. Robert Dedrick, Jeff Kromrey, and Carol Mullen for their support, guidance and encouragement. A special thank you to Dr. Cynthia Parshall for support which result ed in the successful achievement of many goals throughout this journey. Thank you to Promissor especially Dr. Betty Bergstrom and Kirk Becker for allowing me the use of the NNAAP data. All analyses and conclusions drawn from these data are the authors. Special gratitude is extended to my family for their tireless support, patience, and encouragement.

PAGE 3

TABLE OF CONTENTS List of Tables ........................................................................................................v List of Fi gures...................................................................................................... vii Abstract................................................................................................................x Chapter 1: In troducti on.........................................................................................1 Backgroun d................................................................................................3 Purpose of Study.......................................................................................7 Research Questions ........................................................................9 Importance of St udy...................................................................................9 Limitati ons................................................................................................13 Definiti ons................................................................................................14 Chapter 2: Review of the Literatu re....................................................................17 Standardized Te sting...............................................................................18 Test Equating...........................................................................................25 Research Design...........................................................................24 Test Equati ng Methods .................................................................24 Computerized Testing ..............................................................................27 i

PAGE 4

Advantage s...................................................................................27 Disadvanta ges..............................................................................29 Comparab ility................................................................................30 Early Comparabili ty Studies ...............................................31 Recent Compar ability Studi es............................................33 Types of Compar ability I ssues......................................................37 Administra tion Mode ...........................................................37 Software Vendors ..............................................................39 Professional Recommendations ....................................................41 Differential Item Functioni ng.........................................................42 Mantel-Ha enszel................................................................48 Logistic Re gression ............................................................53 Item Response Theory .......................................................55 Assumpti ons............................................................64 Using IRT to De tect DI F ..........................................65 Summary ................................................................................................69 Chapter 3: Method..............................................................................................71 Purpose...................................................................................................71 Research Q uestions ......................................................................72 Data.........................................................................................................73 National Nurse Aide A ssessment Pr ogram...................................73 Administration of National Nurs e Aide Assessment Program.......78 Sample ..........................................................................................78 ii

PAGE 5

Data Anal ysis...........................................................................................83 Factor A nalysis..............................................................................84 Item Response Theory..................................................................85 Differential Item Functioni ng.........................................................86 Mantel-Haen szel................................................................87 Logistic Regre ssion............................................................89 1-Parameter Logist ic Model ................................................91 Measure of Ag reement .......................................................92 Summary ................................................................................................93 Chapter 4: Re sults..............................................................................................94 Total Co rrect............................................................................................95 Reliabilit y.................................................................................................97 Factor A nalysis........................................................................................98 Differential Item Functioni ng....................................................................99 Mantel-Ha enszel.........................................................................104 Logistic Re gression .....................................................................114 1-Parameter Logi stic M odel........................................................ 125 Agreement of DIF Methods.................................................................... 136 Kappa Stat istic ............................................................................153 Post Hoc A nalyses .................................................................................156 Differential Test Functioni ng.......................................................156 Nonuniform DI F Analysi s............................................................164 Summary...............................................................................................167 iii

PAGE 6

Chapter 5: Discussion ......................................................................................168 Summary of the Study ...........................................................................168 Summary of Study Result s....................................................................169 Limitati ons..............................................................................................175 Implications and Suggestions for Future Research ...............................177 Test De veloper ............................................................................177 Methodolog ist..............................................................................180 DIF Me thods..................................................................... 181 Practical Si gnificanc e.......................................................183 Matching Criter ion Approac h............................................183 References ......................................................................................................186 Appendice s.......................................................................................................201 About the Aut hor......................................................................................E nd Page iv

PAGE 7

LIST OF TABLES Table 1 Summary of DIF st udies ordered by year..............................................48 Table 2 Groups defined in the MH stat istic.........................................................51 Table 3 Descriptive statistics: 2004 NNAAP written exam (60 item s).................77 Table 4 Number and percent of ex aminees passing the 2004 NNAA P..............80 Table 5 Number of examinees by form and state for sample.............................82 Table 6 Groups defined in the MH stat istic.........................................................88 Table 7 Mean total sco re for sa mple..................................................................96 Table 8 NNAAP statistics for 2004 over all..........................................................98 Table 9 Study sample KR-20 results (60 items).................................................98 Table 10 Confirmatory fact or analysis re sults.....................................................99 Table 11 Proportion of examinees re sponding correctly by it em......................101 Table 12 Mantel-Haenszel re sults for fo rm A....................................................105 Table 13 Mantel-Haenszel re sults for fo rm V....................................................107 Table 14 Mantel-Haenszel re sults for fo rm W...................................................109 Table 15 Mantel-Haenszel re sults for fo rm X....................................................111 Table 16 Mantel-Haenszel re sults for fo rm Z....................................................113 v

PAGE 8

Table 17 Logistic regression results for fo rm A.................................................116 Table 18 Logistic regression results for fo rm V.................................................118 Table 19 Logistic regression results for fo rm W................................................120 Table 20 Logistic regression results for fo rm X.................................................122 Table 21 Logistic regression results for fo rm Z.................................................124 Table 22 1-PL result s for form A.......................................................................127 Table 23 1-PL result s for form V.......................................................................129 Table 24 1-PL result s for form W......................................................................131 Table 25 1-PL result s for form X.......................................................................133 Table 26 1-PL result s for form Z.......................................................................135 Table 27 DIF methodology re sults for fo rm A...................................................140 Table 28 DIF methodology re sults for fo rm V...................................................143 Table 29 DIF methodology re sults for fo rm W..................................................146 Table 30 DIF methodology re sults for fo rm X...................................................149 Table 31 DIF methodology re sults for fo rm Z...................................................152 Table 32 Kappa agreement result s for all test forms .......................................154 Table 33 Nonuniform DIF results compar ed to uniform DIF methods.............. 166 Table 34 Number of items favoring each mode by test form and DIF method..172 vi

PAGE 9

LIST OF FIGURES Figure 1 Item-per son map ..................................................................................58 Figure 2. Stem and leaf plot for fo rm A MH odds ratio estimates.....................106 Figure 3. Stem and leaf plot for fo rm V MH odds ratio estimates.....................108 Figure 4. Stem and leaf plot for fo rm W MH odds ratio estimates....................110 Figure 5. Stem and leaf plot for fo rm X MH odds ratio estimates.....................112 Figure 6. Stem and leaf plot for fo rm Z MH odds ratio estimates......................114 Figure 7. Stem and leaf plot for fo rm A LR odds rati o estimate s......................117 Figure 8. Stem and leaf plot for fo rm V LR odds rati o estimate s......................119 Figure 9. Stem and leaf plot for fo rm W LR odds ratio estimates.....................121 Figure 10. Stem and leaf plot for fo rm X LR odds ratio estimates....................123 Figure 11. Stem and leaf plot for fo rm Z LR odds ratio estimates.....................125 Figure 12. Stem and leaf plot for form A 1-PL theta differenc es.......................128 Figure 13. Stem and leaf plot for form V 1-PL theta differenc es.......................130 Figure 14. Stem and leaf plot for form W 1-PL theta differenc es......................132 Figure 15. Stem and leaf plot for form X 1-PL theta differenc es.......................134 Figure 16. Stem and leaf plot for form Z 1-PL theta differenc es.......................132 vii

PAGE 10

Figure 17. Scatterplot of MH odds ratio estimates and 1-PL theta differences for form A ..............................................................................................................138 Figure 18. Scatterplot of LR odds ratio estimates and 1-PL theta differences for form A ..............................................................................................................138 Figure 19. Scatterplot of MH odds ratio estimates and LR odds ratio estimates for form A ..............................................................................................................139 Figure 20. Scatterplot of MH odds ratio estimates and 1-PL theta differences for form V ..............................................................................................................141 Figure 21. Scatterplot of LR odds ratio estimates and 1-PL theta differences for form V ..............................................................................................................142 Figure 22. Scatterplot of MH odds ratio estimates and LR odds ratio estimates for form V ..............................................................................................................142 Figure 23. Scatterplot of MH odds ratio estimates and 1-PL theta differences for form W .............................................................................................................144 Figure 24. Scatterplot of LR odds ratio estimates and 1-PL theta differences for form W .............................................................................................................145 Figure 25. Scatterplot of MH odds ratio estimates and LR odds ratio estimates for form W .............................................................................................................145 Figure 26. Scatterplot of MH odds ratio estimates and 1-PL theta differences for form X ..............................................................................................................147 Figure 27. Scatterplot of LR odds ratio estimates and 1-PL theta differences for form X ..............................................................................................................148 viii

PAGE 11

Figure 28. Scatterplot of MH odds ratio estimates and LR odds ratio estimates for form X...............................................................................................................148 Figure 29. Scatterplot of MH odds ratio estimates and 1-PL theta differences for form Z ..............................................................................................................150 Figure 30. Scatterplot of LR odds ratio estimates and 1-PL theta differences for form Z ..............................................................................................................151 Figure 31. Scatterplot of MH odds ratio estimates and LR odds ratio estimates for form Z ..............................................................................................................151 Figure 32. Test information curv e for form A P&P examinees..........................159 Figure 33. Test information curv e for form A CB T examinees.......................... 159 Figure 34. Test information curv e for form V P&P examinees..........................160 Figure 35. Test information curve for form V CBT examinees..........................160 Figure 36. Test information curv e for form W P&P examinees.........................161 Figure 37. Test information curv e for form W CB T examinees.........................161 Figure 38. Test information curv e for form X P&P examinees..........................162 Figure 39. Test information curv e for form X CB T examinees.......................... 162 Figure 40. Test information curv e for form Z P&P examinees..........................163 Figure 41. Test information curv e for form Z CB T examinees.......................... 163 Figure 42. Nonuniform DIF for form X it em 12 ..................................................165 ix

PAGE 12

A Comparability Analysis of the Na tional Nurse Aide Assessment Program Peggy K. Jones ABSTRACT When an exam is administered across dual platforms, such as paper-andpencil and computer-based testing simultaneously, individual items may become more or less difficult in the computer version (CBT) as compared to the paperand-pencil (P&P) version, possibly resulting in a shift in the overall difficulty of the test (Mazzeo & Harvey, 1988). Using 38,955 examinees response data across five forms of the National Nurse Aide Assessment Program (NNAAP) administered in both the CBT and P&P mode, three methods of differential item functioning (DIF) detection were used to detect item DIF across platfo rms. The three methods were MantelHaenszel (MH), Logistic Regression (LR), and the 1-Parameter Logistic Model (1-PL). These methods were compared to determine if they detect DIF equally in all items on the NNAAP forms. Data we re reported by agreement of methods, that is, an item flagged by multiple DIF methods. A kappa statistic was calculated to provide an index of agreement betw een paired methods of the LR, MH, and x

PAGE 13

the 1-PL based on the inferential tests. Fina lly, in order to determine what, if any, impact these DIF items may hav e on the test as a whole, the test characteristic curves for each test form and exami nee group were displayed. Results indicated that items behaved differently and the examinees odds of answering an item correctly were in fluenced by the test mode administration for several items ranging from 23% of the items on Forms W and Z (MH) to 38% of the items on Form X (1-PL) with an av erage of 29%. The test characteristic curves for each test form were ex amined by examinee group and it was concluded that the impact of the DIF items on the test was not consequential. Each of the three methods detected it ems exhibiting DIF in each test form (ranging from 14 items to 23 items). T he Kappa statistic demonstrated a strong degree of agreement between pai red methods of analysis for each test form and each DIF method pairing (reporting good to ex cellent agreement in all pairings). Findings indicated that wh ile items did exhibit DIF, there was no substantial impact at the test level. xi

PAGE 14

CHAPTER ONE: INTRODUCTION A test score is valid to the extent that it measures the attrib ute of what it is supposed to do. This study addressed score validity which can impact the interpretability of a score on any given in strument. This study was designed as a comparability investigation. Comparabili ty exists when examinees with equal ability from different groups perform equally on the same test items. Comparability is a term often used to refer to studies that inve stigate mode effect. This type of study investigates whether being administered an exam in paperand-pencil (P&P) mode or computer-based (C BT) mode predicts test score. If it does not predict test score, the test items are mode-fr ee; otherwise, mode has an effect on test score and a mode effect exists. Simply put, it should be a matter of i ndifference to the examinee which test form or mode is used (Lord, 1980). When multiple forms of an exam are used, forms should be equated and the pass standard or cut score should be set to compensate for the difficulty of the form (Crocker & Algina, 1986; Davier, Holland, Thayer, 2004; Hambleton & Swaminathan, 1985; Kolen & Brennan, 1995; Lord, 1980; Muckle & Becker, 2005) Comparability studies conducted 1

PAGE 15

have found some common conditions that tend to produce mode effect that include items that require multiple screens or scrolling such as long reading passages or graphs which require the examinee to search for information (Greaud & Green, 1986; Mazzeo & Harvey, 1988, Parshall, Davey, Kalohn, & Spray, 2002); software that does not allow the examinee to omit or review items (Spray, Ackerman, Reckase, & Carlson, 1989); presentation differences such as passage and item layout (Pommerich & Burden, 2000); and speeded tests (Greaud & Green, 1986). In ear lier comparability studies, groups (CBT and P&P) were examined by investigating diff erences in mean and standard deviation looking for main effects or interactions us ing test-retest reliabili ty indices, effect sizes, ANOVA, and MANOVA. Researchers more recent studies have begun to examine differential item functioning using various methods as a way of examining mode effects. Differential item functioning (DIF) exists when examinees of equal ability from different groups perform differently on the same test items. DIF methods normally are used to investigate differences between groups involving gender or race. Good test practice calls for a thorough analysis including test equating, DIF analysis, and item analysis whenever changes to a test or test program are implemented to ensure that the m easure of the intended construct has not been affected. Many re searchers (e.g., Bolt, 2000; Camilli & Shepard, 1994; Clauser & Mazor, 1998; Gi erl, Bisanz, Bisanz, Boughton, & Khaliq, 2001; Penfield & Lam, 2000; Standards for Educational and Psychological Testing, 1999; Welch & Miller, 1995) call for empirical evidence (observed from operational assessments) of equity and fairness associated with 2

PAGE 16

tests and report that fairness is concluded when a test is free of bias. The presence of DIF can compromise the validit y of the test scores (Lei, Chen, & Yu, 2005) affecting score interpretations. With the onset of com puter-based testing, the American Psychological Association (1986) and t he Association of Test Publishers (2002) have established guidelines strongly calling for comparability of CBTs (from P&P) to be established pr ior to administering the test in an operational setting. Items that operate differentially for one group of examinees make scores from one group not comparable to another examinee group, therefore, comparability studies shoul d be conducted (Parshal l et al., 2002), and DIF methodology is ideal for examining whether items behave differently in one test administration mode compared to the other (Schwarz, Rich, & Podrabsky, 2003). High stakes exams (e.g., public school grades kindergarten-12, graduate entrance, licensure, certif ication) are used to make decisions that can have personal, social, or political ramifications Currently, there are high stakes exams being given in multiple modes for which comparability studies have not been reported. Background Measurement flourished in the 20 th century with the in creased interest in measuring academic ability, aptitude, attit ude, and interests. A growing interest in measuring a persons academic ability heightened the movement of standardized testing. Behavioral observations were collected in environments where conditions were prescribed (Crocker & Algina, 1986; Wilson, 2002). The prescribed conditions are very similar to the precise instructions that accompany 3

PAGE 17

standardized tests. The instructions dict ate the standard conditions for collecting and scoring the data (Crocker & Algi na, 1986). Wilson (2002) defines a standardized assessment as a set of items administered under the same conditions, scored the same, and result ing in score reports that can be interpreted in the same way. The early years of measurement result ed in the distribution of scores on a mathematics exam in an approximation of the normal curve by Sir Francis Galton, a Victorian pioneer of statistical correlation and regression; the statistical formula for the correlation coefficient by Ka rl Pearson, a major contributor to the early development of statisti cs, based on work by Sir Francis Galton; the more advanced correlational procedure, fact or analysis, theorized by Charles Spearman, an English psychologist known fo r work in statistics; and the use of norms as a part of score interpretati on first used by Alfred Binet, a French psychologist and inventor of the first usable intelligence test, and Theophile Simon, who assisted in the development of the Simon-Binet Scale, to estimate a persons mental age and compare it to ch ronological age for instructional or institutional placement in the appropriate setting (Crocker & Algina, 1986; Kubiszyn & Borich, 2000). Each of thes e statistical techniques was commonly used in test validation furthering standar dized testing. These events mark the beginning of test theory. In 1904, E. L. Thorndike published the first text on test theory entitled, An Introduction to the Theory of Mental and Social Measures Thorndikes work was applied to the dev elopment of five forms of a group examination designed to be given to a ll military personnel for selection and 4

PAGE 18

placement of new recruits (Crocker & Algina, 1986). Test theory literature continued to grow. Soon new developments and procedures were recorded in the journal, Psychometrika (Crocker & Algina, 1986; Salsburg, 2001). Recently, standardized testing has been used to make decisions about examinees including promoti on, retention, salary increases (Kubiszyn & Borich, 2000), and the granting of certification and licensure. The use of tests to make these high stakes decisions is likely to increase (Kubiszyn & Borich, 2000). Humans have made advances in te chnology and these advances often come with benefits that make tasks more convenient and efficient. It is not surprising that these technical advanc es would affect standardized testing making the administration of tests more convenient, more efficient and potentially more innovative. The administration of computerized exams offers immediate scoring, item and examinee information, th e use of innovative items types (e.g., multimedia graphic displays), and the po ssibility of more efficient exams (Embretson & Reise, 2000; Mead & Dras gow, 1993; Mills, Potenza, Fremer, & Ward, 2002; Parshall et al., 2002; Wainer, 2000; Wall, 2000). However even with these benefits, there are challenges: C omputer-based tests present a greater variety of testing paradigms and a unique set of measurement challenges (ATP, 2002, p. 20). A variety of administrative circumstances can impact the test score and the interpretation of the score. In sit uations where an exam is administered across dual platforms such as paperand-pencil and computer-based testing simultaneously, a comparability study is needed (Parshall, 1992). In fact, in large 5

PAGE 19

scale assessments it is important to gather empirical evidence that a test item behaves similarly for two or more groups (Penfield & Lam, 2000; Welch & Miller, 1995). Individual items may become more or less difficult in the computer version (CBT) of an exam as compared to the paper-and-pencil (P&P) version, possibly resulting in a shift in the overa ll difficulty of the te st (Mazzeo & Harvey, 1988). One example of an item that could be more difficult in the computer mode is an item associated with a long readi ng passage because reading material onscreen is more demanding and slower than reading print (Parshall, 1992). Alternately, students who are more co mfortable on a computer may have an advantage over those who ar e not (Godwin, 1999; Mazzeo & Harvey, 1988; Wall, 2000). Comparing the equivalence of administration modes can eliminate issues that could be problematic to the in terpretation of scores generated on a computerized exam. Issues that could be problematic include test timing, test structure, item content, innovative item types, test scoring, and violations of statistical assumptions for establishi ng comparability (Mazzeo & Harvey, 1988; Parshall et al., 2002; Wall, 2000; Wang & Kolen, 1997). If differences are found on an item from one test adm inistration mode over another test administration mode, the item is measuring something beyond ability. In this case, the item would measure test administration mode, which is not relevant to the purpose of the test. As a routine, new items are field tested and screened for differential item functioning (DIF) and this practice is crit ical for computerized exams in order to uphold quality (Lei et al., 2005). 6

PAGE 20

When tests are offered across many forms, the forms should be equated and the pass standard set to compensate for the difficulty of the form (Crocker & Algina, 1986; Davier, Holland, & T hayer, 2004; Hambleton & Swaminathan, 1985; Kolen & Brennan, 1995; Lord, 1980; Muckle & Becker, 2005). This practice should transfer to tests administered in dual platform as the delivery method can potentially affect the difficulty of an item (Godwin, 1999; Mazzeo & Harvey, 1988; Parshall et al., 2002; Sawaki, 2001). Cu rrently, there are high stakes exams being given in multiple modes for which comparability studies have not been reported. Purpose of the Study The purpose of this study was to ex amine any differences that may be present in the administration of a specific high stakes exam administered in dual platform (e.g., paper and com puter). The exam, National Nurse Aide Assessment Program (NNAAP) is administered in 24 st ates and territories as a part of the states licensing procedure for nurse aides. If differences acro ss platforms were found on the NNAAP, it would indicate that a test item behaves differently for an examinee depending on whether the exam is administe red via paper-and-pencil or computer. It is reasonable to assume that when items are administered in the paper-and-pencil mode and when the same items are administered in the computer mode, they shoul d function equitably and should display similar item characteristics. Researchers have suggested that when examining empirical data where the researcher is not sure if a co rrect decision has been made regarding the 7

PAGE 21

detection of Differential Item Functioning (DIF), as would be known in a Monte Carlo study, it is wise to use at leas t two methods for detecting DIF (Fidalgo, Ferreres, & Muniz, 2004). The presence of DIF can compromise the validity of the test scores (Lei et al., 2005). This study used various methods to detect DIF. DIF methodology was originally designed to study cultural differences in test performance (Holland & Wainer, 1993) and is normally used for the purpose of investigating item DIF across two examinee groups within a single test. Examining items across test administration platforms can be viewed as the examination of two examinee groups. DIF would exist if items behave differently for one group over the other group, and the items operating differentially make scores from one examinee group to another not comparable. Schwarz et al. (2003) state that DIF met hodology presents an ideal met hod for examining if an item behaves differently in one test adm inistration mode co mpared to another test administration mode. Three commonly used methods of DIF methodology (comparison of bparameter estimate calibrated using t he 1-Parameter Logistic model, MantelHaenszel, and Logistic Regression) were used to examine the degree that each method was able to detect DIF in this study. Although these methods are widely used in other contexts (e.g., to compare gender groups), there was some question about the sensitivity of the three methods in this context. A secondary purpose of this study was to examine the relative sensitivity of these approaches in detecting DIF between administration modes on the NNAAP. 8

PAGE 22

Research Questions The questions addressed in this study were: 1. What proportion of items exhibit differential item functioning as determined by the following thr ee methods: Mantel-Haenszel, Logistic Regression, and the 1-Parameter Logistic Model? 2. What is the level of agreement among Mantel-Haenszel, Logistic Regression, and the 1-Parameter Logistic Model in detecting differential item functioning between the computer-based National Nurse Aide Assessment Program and the paper-and-pencil National Nurse Aide Assessment Program? These two questions were equally weight ed in importance. The first question required the researcher to use DIF methodology. The second question elaborates the findings in the first question. The researcher compared the results of items flagged by more than one method of DIF methodology. Importance of the Study Comparability studies are conducted to determine if comparable examinees from different groups perform equally on the same test items (Camilli & Shepard, 1994). This study looked at comparable examinees, that is examinees with equal ability, to determine if they perform equally on the same test items administered in different te st modes (P&P, CBT). The majority of comparability studies have been conducted using instruments designed for postsecondary education. The use of computers to administer exams will continue to increase with the increased ava ilability of computers and technology 9

PAGE 23

(Poggio, Glasnapp, Yang, & Poggio, 2005). This platform is advancing quickly in licensure and certification ex aminations, and it is necessary to have accurate measures because these instruments are largely used to make high stakes decisions (Kubiszyn & Borich, 2000) about an examinee including determining whether he or she has sufficient ability to practice law, practice accounting, mix pesticides, or pilot an aircraft. These decis ions are not only high stakes for the examinee but also for the public to whom t he examinee will provide services (i.e., passengers on airplane piloted by examinee). Indeed, test results routinely are used to make decisions that have im portant personal, social, and political ramifications (Clauser & Mazor, 1998). Cl auser and Mazor (1998) state that it is crucial that tests used for these types of decisions allow for valid interpretations and one potential threat to validity is item bias. With the increased access to comput ers, comparability studies of operational CBTs with real examinees under motivated conditions are imperative. For this reason, it is critical to cont inue examining these types of instruments. The National Nurse Aide Assessment Progr am (NNAAP) is one such instrument (see http://www.promissor.com). Scor es on the NNAAP are largely used to determine whether a person will be licensed to a work as a nurse aide. The NNAAP is a nationally administe red certification program t hat is jointly owned by Promissor and the National Council of St ate Boards of Nursing (NCSBN) and administers both a written assessment and a skills test (Muckle & Becker, 2005) to candidates wishing to be certified as a nurse aide in a state or territory where the NNAAP is required. The written a ssessment is administered to these 10

PAGE 24

candidates in 24 states and territories. The written assessment is currently administered via paper-and-pencil (P&P) in all but one state. In New Jersey, the assessment is a computer-based test (CBT) meaning the assessment is administered in dual platforms, making th is the ideal time to examine the instrument. The NNAAP written assessment contains 70 multiple-choice items. Sixty of these items are used to determine the examinees score and 10 items are pretest items which do not affect the examinees score. The exam is generally administered in English but examinees with limited reading proficiently are administered the written exam with ca ssette tapes that pr esent the items and directions orally. A version in Span ish is available for Spanish-speaking examinees. There are 10 items that measure reading comprehension and contain job-related language words. T hese reading comprehension items are required by federal regulations and do not fall under the accommodations rendered for limited reading proficiency or Spanish-speaking examinees. These reading comprehension items are computed as a separat e score. In order to pass the orally administered exam, a candidate must pass these reading comprehension items (Muckle & Becker, 2005). Across the United States, the pass ra te for the 2004 wr itten exam was 93%. A candidate must pass both the writte n and skills exams in order to obtain an overall pass. For this study, we were only concerned with the written exam; therefore, the skills assessment results were not reported. 11

PAGE 25

For this study, data from New Jersey were used as New Jersey is the only state administering the NNAAP as a CBT. The P&P data came from other states with similar total mean scores who admin istered the same test forms as New Jersey. All data examined cont ained item level data of first time test takers. Simulated comparability methodology studies have been conducted using Monte Carlo designs to com pare the likelihood t hat items may react differently in dual platform. Simulated data allow the researcher to know truth because the researcher has programmed the item parameters and exam inee ability. However, while comparability studies have been conducted using simulated data few operational studies exist in the lit erature needed to suppor t the theories and conclusions drawn from the simulated data. A comparability study using DIF methodol ogy to detect DIF using real data provides a valuable contribution to t he literature. While many studies have been conducted using DIF methodology to det ect DIF between two groups, generally gender or ethnicity, very little has been done operationally to use DIF methodology to determine if items behav e differently for two groups of examinees whose difference is in testing platform. The methods selected for this study represent common methods used in DIF studies (Congdon & McQueen, 2000; De Ayala, Kim, Stapleton, Da yton, 1999; Dodeen & Johanson, 2001; Holland & Thayer, 1988; Huang, 1998; J ohanson & Johanson, 1996; Lei et al., 2005; Luppescu, 2002; Penny & Johnson, 1999; Wang, 2004; Zhang, 2002; Zwick, Donoghue, & Grima, 1993). The first nonparametric method, MantelHaenszel (MH), has been used for a number of years and continues to be 12

PAGE 26

recommended by practitioners. A second nonparametric method yielding similar recommendations as the Mantel-Haensze l and with the ability to detect nonuniform DIF is Logistic Regression (LR). The parametric model, the 1Parameter Logistic model (1-PL) is from item response theory (IRT) and is highly recommended for use when the statistical assumptions hold. There are various approaches that fall under IRT. Ascertaining the relative sensitivity of these three methods for assessing comparability in th is exam will add to the literature examining the functioning of these DIF detection methods. A study that replicates methodolog y impacts the field because it can confirm that findings are upheld whic h can increase confidence in the methodology and the interpretations of the st udy. This replication justifies that the methodology is sound. When similar me thods are used with different item response data, it further strengthens t he methodology of other researchers who have used the same methods for investi gating individual data sets and found similar results. If methods do not hold, valuable information is provided to the field showing that such methodology may need to be further refined or abandoned. Limitations The use of real examinees may illustra te items that behave differentially in the test administration pla tform for this set of examin ees, but conclusions drawn from the study cannot be generalized to fu ture examinees. Ex aminees were not randomly assigned to test administration mo de; therefore, it is possible that differences may exist by virtue of the group membership. Additionally, a finite 13

PAGE 27

number of forms was reviewed in this study. The testing vendor will cycle through multiple forms, so results cannot be generalized to forms not examined in this study. In addition, conclusions about the re lative sensitivity of the DIF methods cannot be generalized beyond this test. Definitions Ability -The trait that is measured by the te st. Ability is often referred to as latent trait or latent ab ility and is an estimate of ab ility or degree of endorsement. Comparability -Exists if comparable exami nees (e.g., equal ability) from different groups (e.g., test administrati on modes) perform equally on the same test items (Camil li & Shepard, 1994). Computer-based test (CBT)This term refers to any test that is administered via the comput er. It can include comput er-adaptive tests (CAT) and computerized fixed tests (CFT). Computerized Fixed Test (CFT) -A fixed-length, fixed-form test. This test is also referred to as a linear test. Differential Item Functioning (DIF) -The study of item s that function differently for two groups of examinees with the same ability. DIF is exhibited when examinees from two or more groups have different likelihoods of success with the item after matching on ability. Estimation Error-Decreases when the items and the persons are targeted appropriately. Error estimates, item reli ability, and person reliability indicate the degree of replicability of the item and person estimates. 14

PAGE 28

Item BiasItem bias refers to detecting DIF and using a logical analysis to determine that the difference is unrel ated to the construct intended to be measured by the test (C amilli & Shepard, 1994). Item Difficulty-In latent trait theory, item difficulty represents the point on the ability scale where the examinee has a 50 percent probability of answering an item correctly (Hambl eton & Swaminathan, 1985). Item Reliability IndexIndicates the replicability of item placements on the item map if the same items were administered to another comparable sample. Item Response Theory -Mathematical functions t hat show the relationship between the examinees ability and the probability of responding to an item correctly. Item Threshold -In latent trait theor y, the threshold is a location parameter related to the difficulty of the item (re ferred to as the b parameter). Items with larger values are more difficult (Zim owski, Muraki, Mislevy, & Bock, 2003). Logistic Regression-A regression procedure in which the dependent variable is binary (Yu, 2005). Logit Scale -This term refers to the log odds unit. The logit scale is an interval scale in which the unit intervals hav e a consistent meaning. A logit of 0 is the average or mean. Negative logits represent easier items or less ability. Positive logits represent more difficult items or more ability (Bond & Fox, 2001). Mantel-Haenszel-A measure combining the log odds ratios across levels with the formula for a weighted av erage (Matthews & Farewell, 1988). 15

PAGE 29

National Nurse Aid Assessment Program (NNAAP) -A nationallyadministered certification program jointl y owned by Promissor and the National Council of State Boards of Nursing. The assessment is made up of a skills assessment and a written assessm ent (Muckle & Becker, 2005). Paper-and-Pencil (P&P) -Refers to the traditional testing format where the examinee is provided a test or test book let with the test items in print. The examinee responds by marking the answe r choice or answer document with a pencil. Person Reliability Index -Indicates the replicability of person ordering expected if the sample were given another comparable set of items measuring the same construct (Bond & Fox, 2001). Rasch Model -Mathematically equivalent to the one-parameter logistic model and the widely-used IRT model, t he Rasch model is a mathematical expression to relate the probability of success on an item to the ability measured by the exam (Hambleton, Swaminathan, & Rogers, 1991). 16

PAGE 30

CHAPTER TWO: REVIEW OF THE LITERATURE This chapter is a review of literatur e related to computerized testing and mode effect. Literature is presented in three sections: (a) standardized assessment, (b) test equating, and (c) computerized testing. The first section, standardized assessment presents a brief hi story of standardized testing from its first widespread uses to todays large-scale administrations using multiple test forms leading to the need for test equating. The second section, test equating, provides an overview of research design pertaining to equating studies and common methods of test equating. The th ird section, computerized testing, discusses the advantages and disadvantages of computer-based testing (CBT), comparability studies, types of comparability, and the need for comparability testing. Comparability is established when test items behave equitably for different administration modes when abi lity is equal. When items do not behave equitably for two or more gr oups, differential item functioning (DIF) is said to exist. The nonparametric models, Mantel-Haenszel (MH) and Logistic Regression (LR), and the 1-parametric logi stic (1-PL) model are discussed as methods for detecting items that behave differently. Item response theory (IRT) is widely used in todays computer-based testing programs and before its 1-PL 17

PAGE 31

model can be discussed as a method for detecting DIF, the items will need to be calibrated using one of the IRT models; therefore, an overview of modern test theory is discussed including some of the models and methods for examining item discrimination and difficulty. Literature presented in this chapter was obtained via searches of the ERIC FirstSearch and the Educational Resear ch Services databases, workshops, and sessions with Susan Embretson and St eve Reise, Everett Smith and Richard Smith, Trevor Bond, and Educ ational Testing Services staff. The literature was selected to provide the historical backg round and to support the study but is not meant to be exhaustive coverage on all subject matter presented. Standardized Testing It would be an understatement to say that the use of large scale assessments is more widespread today than at any other ti me in history. The No Child Left Behind (NCLB) Act of 2001 has mandated that all states have a statewide standardized assessment. Placem ent tests such as the Scholastic Achievement Test (SAT) and the American College Test ing Program (ACT) are required for admission to many institutions of higher education. Certification and licensure exams are offered more often and in more locations than ever before. Wilson (2002) defines a standardized assessment as a set of items administered under the same conditions, scored the same and resulting in score reports that can be interpreted in the same way. The first documentation of written te sts that were publ ic and competitive dates back to 206 B.C. when the Han emperors of China administered 18

PAGE 32

examinations designed to select candidates for government service, an assessment system that lasted for two thousand years (Eckstein & Noah, 1993). The Japanese designed a version of t he Chinese assessment in the eighth century which did not last. Examinations did not reemerge in Japan until modern times (Eckstein & Noah, 1993). In t heir research, Eckstein and Noah (1993) found that modern exam ination practices evolved begi nning in the second half of the eighteenth century in Western Europe. In contrast to the Chinese version, this practice was used in the private se ctor for candidate selection and licensure issuance and was more of a qualifying ex amination rather than competitive. These exams soon came to be tied to spec ific courses of study leading to the relationship of examinations in schools. Needless to say, the growth in student s led to the growth in numbers of examinations. Examinati ons provided a defendable met hod for filling the scarce spaces available in schools at all educat ional levels. In order to hold fair examinations, many were administered in a large public setti ng where examinees were anonymous and responses were scored according to predetermined external criteria. Examinations provided a means to award an examinee for mastery of standards based on a course syllabus. Eventually, examinations were used as a method for measuring teacher effectiveness (England and the United St ates). In the United States, the government introduced the use of examinations for select ion of candidates in the 1870s. These examinations resulted in appointments to the Patent Office, Census Bureau, and Indian Office. In 1883, the Pendleton Act expanded the use 19

PAGE 33

of examinations for all U.S. Civil Servic e positions; a practice that continues today and includes the postal, sanitation, and police services (Eckstein & Noah, 1993). The onset of compulsory attendance forced educators to deal with a range of abilities (Glasser, 1981). Educator s needed to make decisions regarding selection, placement, and instruction s eeking success of each student (Mislevy cited in Frederiksen, Mislevy, & Bejar, 1993). Mislevy (1993) further posits that an individuals degree of success depends on how his or her unique skills, knowledge, and interests match up with the equally multifaceted requirements of the treatment (p. 21). Info rmation could be obtained through personal interviews, performance samples, or multiple choice te sts. By far, multiple choice tests are the more economical choice (Mislevy cit ed in Frederiksen, Mislevy, & Bejar, 1993). Items are selected that yield some information related to success in the program of study. A tendency to answer it ems correctly over a large domain of items is a prediction of success (G reen, 1978). This new popularity for administration of tests led to the need fo r a means of statistically constructing and evaluating tests, and a need for creating multiple test forms to measure the same construct yet ensure security of items (Mislevy cited in Frederiksen, Mislevy, & Bejar, 1993). Ralph Tyler outlined hi s thoughts regarding what students and adults know and can do, and this document was the beginning of the National Assessment Educational Program (NAEP) NAEP produces national assessment 20

PAGE 34

results that are representative and hist orical (some as far back at the 1960s) monitoring achievement patterns (Kifer, 2001). Students were administered a variety of standardized assessments in the 1970s and 1980s. It was during the 1980s that the report A Nation at Risk: The Imperative for Educational Reform placed an emphasis on achievement compared to other countries calling for widespread reform. A Nation at Risk: The Imperative for Educational Reform (National Commission on Excellence in Education, 1983) called for changes in public schools (K-12) causing many states to implement standardized assessm ents. This gave tests a new purpose; to bring about radical change within the school and to provide the tool to compare schools on student achievement (K ifer, 2001). Kifer concludes that new interest in large-scale assessments flour ished in the 1990s. It was the 1990s that witnessed another national effo rt affecting schools. No Child Left Behind (NCLB) was signed into law on January 8, 2002 by President George W. Bush changing assessment practices in many states (U .S. Department of E ducation, 2005). The purpose of the act was to ensure that all children have an equal, fair, and significant opportunity to obtain a high quality education. The act called for a measure (Adequate Yearly Progress) of student success to be reported by the state, district, and school for the ent ire student population as well as the subgroups of White, Black, Hispanic, Asian, Native American, Economically Disadvantaged, Limited English Profic ient, and Students with Disabilities. Success is defined as each student in t hese groups scoring in the proficient 21

PAGE 35

range on the statewide assessment. Many st ates scrambled to assemble such a test. As standardized testing expa nded, there grew a need to ensure security of items. This was accomplished through t he development of multiple forms. Also termed horizontal equating, multiple test fo rms of a test are cr eated to be parallel, that is the difficulty and content should be equivalent (Crocker & Algina, 1986; Davier, Holland, & Thayer, 2004; Ho lland & Wainer, 1993; Kolen & Brennan, 1995). Test Equating The practice of test equating has been around since the early 1900s. When constructing a parallel test, the ideal is to create two equivalent tests (e.g., difficulty and content). However, since this is not easily accomplished (most tests differ in difficulty), models have be en established to measure the degree of equivalence of exam forms called test equating (Davier et al., 2004; Holland & Wainer, 1993; Kolen & Brennan, 1995). Test equating is a statistical process used to adjust scores on different test forms so the scores can be used interc hangeably adjusting for differences in difficulty rather than content (Kolen & Brennan, 1995). Estab lishing equivalent scores on two or more forms of a test is known as horizontal equating (Crocker & Algina, 1986; Hambleton & Swaminathan, 1985). Vertical equating is commonly used in elementary school achievement test s that report developmental scores in terms of grade equivalents and allows exami nees at different grade levels to be 22

PAGE 36

compared (Crocker & Algina, 1986; Hamb leton & Swaminathan, 1985; Kolen & Brennan, 1995). Test form differences should be invisible to the examinee. The goal of test equating is for scores and test forms to be interpreted interchangeably (Davier, Holland, & Thayer, 2004). Lord (1980) posited t hat for two tests to be equitable, it must be a matter of indifference to the applicants at every gi ven ability whether they are to take test x or test y (p. 195). When the observed raw scores on test x are to be equated to the form y, an equatin g function is used so that the raw score on test x is equated to its equivalent raw score on test y. For example, a score of 10 on test x may be equivalent to a score of 13 on test y (Davier et al., 2004). Further, Davier et al. summarized t he five requirements or guidelines for test equating: The Equal Construct Requir ement: tests that measure different constructs should not be equated. The Equal Reliability Requirem ent: tests that measure the same construct but which differ in reliability should not be equated. The Symmetry Requirement: the equating function for equating the scores of Y to those of X should be the inverse of the equating function for equating the scores of X to those of Y. 23

PAGE 37

The Equity Requirement: it ought to be a matter of indifference for an examinee to be tested by either one of the two tests that have been equated. Population Invariance Requir ement: the choice of (sub) populations used to compute the equating function between the scores of tests X and Y should not matter, i.e., the equating function used to link the scores of X and Y should be population invariant (p. 3-4). Research Design Certain circumstances must exist in order to equate scores of examinees on various tests (Hambleton & Swaminathan, 1985). Three models of research design are permitted in test equating: (1) single group, (2) equivalent-group, and (3) anchor-test design. Single GroupThe same tests are administered to all examinees. Equivalent GroupsTwo tests are administered to two groups, which may be formed by random assignment. One group takes test x and the other group takes test y. Anchor-test -The two groups take different tests with the addition of an anchor test or set of common items administered to all examinees. The anchor test provides the common items needed for the equating procedure (Crocker & Algina, 1986; Da vier et al., 2004; Hambleton & Swaminathan, 1985; Kol en & Brennan, 1985; Schwarz et al., 2003). 24

PAGE 38

Test Equating Methods The classical methods for test equating are: Linear, equipercentile, and Item Response Theory (IRT). Linear equating is based on the premise that the distribution of scores on the different test forms is the same; the differences are only in means and st andard deviations. Equivalent scores can be identified by looking for the pair of scores (one from each test form) with the same z score. These scores will have the same percentile rank. The equation used for linear equating is: Lin y (x) = y + ( y )/ x ) (xx ) where = target population mean = target population standard deviation (Kolen & Brennan, 1985) Linear equating is appropriate only when the score distributions of test x and test y differ only in means and st andard deviations. When this is not true, equipercentile equating is mo re appropriate (Crocker & Algina, 1986). Equipercentile equat ing is used to determine which two scores (from each test form) have the same percentile rank. Percentile ranks are plotted against the raw scores for t he two test forms. Next, a percentile rank-raw score curve is drawn for the two test forms. Once these curves are plotted, equivalent scores can be plotted from the graph which are then plotted against one another (Test x, Test y) and a smooth curve drawn allowing equivalent scores to be read from the graph (Crocker & 25

PAGE 39

Algina, 1986). An example of this method could be interpreted as Person A earns a score of 16 on test y which would be a score of 18 on test x. The percentile rank for either test score is 63. Equipercentile equating is more complicated than linear equating and has a larger equating error. In addition, Hambleton and Swaminathan (1985) indicate that a nonlinear transformation is needed resulting in a nonlinear relation between the raw scores and a nonlinear relation between the true scores implying that the scores are not equally reliable; therefore, it is not a matter of indifference which test is administered to the examinee. A third method can be used with any of the design models. Since assumptions for linear and equipercent ile equating are not always met when random assignment is not feasib le, IRT or an alternative procedure is required (Crocker & Algina, 1986) Further, equating based on item response theory overcomes problems of equity, symmetry, and invariance if the model fits the data (Davier et al., 2004; Hambleton & Swaminathan, 1985; Kolen & Brennan, 1985; Lord, 1980). Currently, there are many IRT or latent trait models. The more widel y used are the one-parameter logistic (Rasch) model and the three-parameter logistic (3-PL) model. First, IRT assumes there is one latent trait which accounts for the relationship of each response to all items (Cro cker & Algina, 1986). The ability of an examinee is independent of the items. When the item parameters are known, the ability estimate will not be affected by the items. For this reason, it makes no difference whether the examinee takes an easy or a 26

PAGE 40

difficult test; therefore, there is no need for equati ng test scores within the framework of IRT (Hambleton & Sw aminathan, 1985). Ho wever, when the item parameters are not known, Cr ocker and Algina (1986) describe a procedure for equating using the Rasch model: 1. Estimate the b-parameters fo r all items including the anchor items for both groups of examinees. 2. Calculate differences between groups for each of the anchor items. Estimate the average of these differences, m 3. For each group, estimate the latent trait score corresponding to each number-right score. 4. Add the estimates of m to the latent trait scores for group 2, thus transforming these scores to the scale for group 1. Many test publishers want to use existing paper-and-pencil exams that are directly transferred to a comput erized format. For this type of test equating, IRT methods are the most feasible (Crocker & Algina, 1986; Hambleton & Swaminathan, 1985; Hamb leton et al., 1991; Holland & Thayer, 1988). Computerized Testing Advantages Technical advances have affect ed standardized testing making the administration of tests more convenien t, more efficient and potentially more innovative. Computerized exams offer im mediate scoring, item and examinee information, the use of innovative item ty pes, and the possibility of more efficient 27

PAGE 41

exams (Mead & Drasgow, 1993; Mills et al ., 2002; Parshall et al., 2002; Wainer, 2000; Wall, 2000). Computerized exams can provide the examinee with a score immediately after completion of the test, a popular feat ure among examinees who learn their status and can plan for their next course of action (Mead & Drasgow, 1993; Mills et al., 2002; Pars hall et al., 2002; Wainer, 2000; Wall, 2000). This feature can be used to report data by item, indicators, or standards generating an expanded score report. The use of computers provides the opt ion to capture additional data on the examinee at the item level, such as the time each examinee spent on an item measured by the time the examinee fi rst saw the screen to the time the examinee made a selection or moved to another item. This f eature allows the collection of data for item s skipped or omitted, items reviewed, and items that were changed (Parshall, 1995). A feature ava ilable in some computerized tests is flexibility which allows the examinee to review items, skip, or omit items but return to these items at a later time dur ing the test administration to make or change a response (Parshall et al., 2002). The use of computers can enhance te sts as well as the learning experience through the use of simulations (e.g., for airli ne pilots, medical research, technology skills), sound, and product-based assessment (Choi, Kim, & Boo, 2003; Harmes et al., 2004; Jones, Parshall, & Harmes, 2003; Parshall et al., 2002; Wall, 2000). Computerized simu lations are often used to answer questions that are difficult to approach such as a medical student using a 28

PAGE 42

simulation to practice making decisions wi th patients that could cause death in reality. The use of innovative item types r eplacing traditional multiple-choice formats is just beginning to expand. Test s can be administered that target the examinees ability possibly resulting in a shortened and less expensive test certainly creating a more efficient test (Mills et al., 2002; Parshall et al., 2002; Wainer, 2000; Wall, 2000). Finally, a new adv antage coming to light is the use of assistive technology for persons with disabilities (Wall, 2000). Text readers can help the visually disabled, voice recognition and dictation technology can aid those with physical disabilities, touch screens and smart boards provide a means for assessment for those with difficult y controlling fine motor skills, and webbased testing provides accessibility to those unable to travel (Wall, 2000). Disadvantages Research has also documented possi ble disadvantages to the use of computerized testing. T he examinees experience and comfort with computers present a complicated set of issues that can be problematic (Godwin, 1999; Mazzeo & Harvey, 1988; Wall, 2000). The ex aminees perception of his or her experience with computers can also be disadvantageous. The examinee may feel unskilled on the use of a computer causing increased anxiety leading to measurement error. An examinee who is weak in computer skills but perceives himself or herself to be competent ma y have more difficulty completing the tasks required of the assessment, tasks that would not be present in the paper-andpencil format. Accessibility to computers and/or the In ternet may be limited to 29

PAGE 43

higher income individuals disadvantagi ng those who do not have access (Wall, 2000). Individual items may become more or less difficult in the computer version of an exam as compared to the p aper-and-pencil (P&P) version, possibly resulting in a shift in the overall difficul ty of the test (Mazzeo & Harvey, 1988). One example of an item that could be more difficult in the computer mode is an item associated with a long reading pa ssage because reading material onscreen is more demanding and slower than reading print (Parshall, 1992). Other issues that could be problematic include s peededness, innovative item types, test scoring, and violations of statistical assumptions for establishing comparability (Mazzeo & Harvey, 1988; Parshall et al., 200 2; Wall 2000; Wang & Kolen, 1997). If differences are found on an item from one test administration mode over another test administration mode, the item is measuring something beyond ability. In this case, the item would m easure test administration mode, which is not relevant to the purpose of the test. Comparability With the advantages of using a computer to administer an exam, many test publishers are transforming their traditional paper-and-pencil forms to a computerized form, but this does not mean that the test items behave identically and the same scores and norms may not ap ply. When dual platforms are used simultaneously these factors are ev en more relevant. This has caused professional associations such as the American Psychological Association (APA) and the Association of Test Publishers (ATP ) to publish guidelines specific to the 30

PAGE 44

use of computerized exams (APA, 1986; AT P, 2002). In order to ensure score interpretability, both of these organizations (APA, ATP) strongly call for the comparability of computeriz ed forms of traditional test s to be established before administering the test in an operational setting. Early Comparability Studies Early studies were conducted to det ermine if computer-based exams measured the same construct as the paper-and-pencil exam. Two well-known literature reviews of early comparability studies were conducted by Greaud and Green (1986) and Mazzeo and Harvey (1988). A general finding in these reviews was that tests with multiscreen item s tended to produce mode differences. Spray et al. (1989) found that flexibil ity in the software was responsible for mode effects. That is, the degree to whic h the examinee was able to move from item to item, omit items, or review it ems was related to mode effect. Tests that allowed for flexibility exhibited fewer mode differences. Bergstrom, Lunz, and Gershon (1992) inve stigated the effects of altering the test difficulty on estimated exami nee ability and on test length. The 225 students were assigned to one of three test conditions or difficulties (e.g., easy, medium, or hard) in which the test was targeted at 50%, 60%, or 70% probability of a correct answer. The researchers concluded that changing the probability of a correct response does not need to affect th e estimation of the examinees ability. By increasing the number of items, an easier test can still reach specified levels of precision. 31

PAGE 45

The earliest well-known operationa l computer-based exam is the computerized version of the Armed Se rvices Vocational Aptitude Battery (ASVAB). Greaud and Green (1986) investigated mode effect on two subtests of the ASVAB (numerical and coding speed). In the study, 50 college students took shortened versions of the subtest. Mode effect was found for the coding speed exam, which may be due to the s peeded characteristic of the exam. Segall and Moreno (1999) reviewed some additional early studies of the ASVAB. Three of these studies that hav e comparability implic ations are noted here. The first study compared the effi ciency of an experimental verbal test administered in adaptive (CAT) and conv entional (e.g., non-adaptive) modes by computer. Reliability and validity were computed; the results corroborated the theoretical advantages that decreased te st length of a CAT could achieve a particular level of precision. A second si milar study looked at three subtests of the ASVAB in CAT and P&P modes. The resu lts confirmed prior beliefs that the shorter length of a CAT could measur e the same construct as a P&P with equivalent or higher precision. Thir d, the Joint Services Validity Study investigated the comparability of the P&P ASVAB prior to enlistment, an experimental CAT taken during basic training, and a P&P ASVAB subtest taken during basic training. Results indicat ed only one significant difference, which favored the CAT-ASVAB. The conclusi on was that both the CAT and the P&P predicted performance equally well. 32

PAGE 46

Recent Comparability Studies One of the main concerns in ear lier computer-based tests was the available hardware, which made it difficult to administer an operational CBT. The more current comparability studies have been able to address operational CBTs as well as the related issue of adminis tering a CBT under mo tivated conditions. Schaeffer, Reese, Steffan, McKinley, and Mills (1993) used the 3Parameter Logistic model, MH, and LR to investigate the relationship between the Graduate Record Exam (GRE) in t he linear CBT mode and in the P&P mode, using the same items. The volunteers who had taken an earlier version of the P&P GRE were retested and were assigned to either the CBT or the P&P version of the exam. A number of it ems were flagged, but differ ences in item functioning were expected due to sample fluctuation. The authors concluded that item level mode effect was not found. However, slight test level mode differences were found in the Quantitative subtest. It is notable that items flagged for exhibiting DIF did not appear to be from any particula r item type or content, nor was there anything systematic about the location of the item on the test. Items were not consistently flagged by each of the three methods. Parshall and Kromrey ( 1993) investigated the rela tionship of demographic variables collected by survey for t he above-mentioned GRE using residual difference scores. The survey item s included questions about computer experience and the use of the interface t ools on the GRE. Relationships between the examinee characteristics and mode effe ct, while slight, di d exist. In general, examinees tended to perform better on the computer; the authors suggested this 33

PAGE 47

could either be a result of practice effe ct or positive and negative differences for individual items may have cancelled each other out at the test level. In an impact study, Miles and King (1998) investigated whether gender interacts with administration mode to produce differences in mean scores on personality measures. A stat istically significant main effect was present for gender and mode effect; however, no signific ant interaction effect was found for gender and administration mode. Neuman and Baydoun (1998) suggest ed several sources of possible mode effect including computer anxiet y and the use of different motor skills. These hypotheses were investigated (cross mode equivalence) by administering a battery of office skills tests to 41 1 undergraduate students. No mode difference was found leading the authors to conclude that CBT versions of P&P tests can be developed that are psychometrically sim ilar (e.g., congeneric) to P&P tests in factor structure and predictive validity (p. 82). Pommerich and Burden (2000) administered a non-operational subtest to 36 examinees. Each examinee took one test in one of the content areas of English, math, reading, or science. Followi ng the administrati on, examinees were asked questions as to their approach for so lving the test item s. While the small sample size causes results to be speculative, presentation differences (e.g., page and line breaks, passage and item layout feat ures, highlighting, and navigation features) were found to po ssibly contribute to mode effect. Julian, Wendt, Way, and Zara (2001) in vestigated the com parability of two exams administered by the National Co uncil of State Boards of Nursing. 34

PAGE 48

Approximately 2,000 examinees took the National Coun cil Licensure Exam for Registered Nurses in P&P mode while anot her 2,000 examinees took the same exam in CAT mode during the regular July administration. Approximately 2,000 examinees took the National Council Licensure Exam for Practical / Vocational Nurses in P&P mode while another 2,000 to ok the same exam in CAT mode during a special July admin istration. Subgroups of 500 examinees took the exam under modified conditions where flexibilit y was built into the exam so that examinees were able to review and omit items. In the original design, examinees were told that the CAT would not c ount; therefore, a bet a test design was developed to be able to investigate the test administration modes under motivated conditions. The beta test repres ented all items in the test plan as well as all difficulty levels. Within the total group and the subgroups, the only statistically significant difference when comparing passing rates was for African Americans who performed better on the CAT. Schwarz et al. (2003) used DIF stat istics (Linn-Harnisch Procedure and Standardized Mean Difference) to examine item level mode effects for two exams: InView and a Test of Adult Basic Education (TABE). The InView, administered to students in grades four through eight resulted in several flagged items for many levels for the online comparison with the P&P platform. The TABE had two flagged items but resulted in diffe rences in the computer-administered and P&P groups throughout ability ranges. Jones, Parshall, and Ferron (2003) exam ined data from a pilot of a linear version of the Medical College Admissions Test conducted by the Association of 35

PAGE 49

American Medical Colleges in April 2003. Three vendors created the software used to administer the exam internat ionally to 45 examinees. The exam consisted of three sections; a 60-item Verbal Reasoning subtest, a 77-item Physical Sciences subtest, and a 77-item Biological Sciences subtest. The data were compared by software vendors to the P&P version of the exam with the same items administered to a group of international examinees and to a P&P version of the same exam administered to a domestic (United States) group of examinees. Data were examined by com paring b-parameter estimates using the Rasch model. This produced a comparabilit y study of mode effect and examinee group difference. The sample size was f ound to be too small to confidently draw conclusions about the population but di d show that items tended to behave differently across modes and examinee groups in the sample. Poggio et al. (2005) conducted a study using a large scale state math assessment in the K-12 setting. Students in grade seven were able to voluntarily participate in the fixed form computeriz ed testing program. There were four paper-and-pencil forms of the exam which were transferred to CBT and the forms were randomly assigned to students. T here were 32,000 st udents who took, as required, the paper-and-pencil version of the exam and 2,861 students also took the CBT version of the exam. The res earchers used the 3-par ameter logistic model with a common guessing parameter to examine t he data for mode effect. Results indicated that very little difference existed between student scores on the P&P and the CBT. Of the items flagged, the items appeared to be more difficult 36

PAGE 50

in the CBT version and the researchers we re not able to determine the factor(s) that might account for the differentia l performance after a detailed study. Types of Comparability Issues Administration Mode While the field continues to realize the many benefits of computer-based assessments including the opportunity to use challenging stimuli, the ability to collect data online (including respons e time), and the nearly instantaneous computation of statistical indices (Mead & Drasgow, 1993), it is also recognized that roadblocks abound. Another issue that can be problematic is that tests that are targeted below an exami nees ability must increase in length (Bergstrom et al., 1993). Also, when test developers us e a current item bank developed for a P&P test to implement a CAT, there ma y not be enough difficult items for all of the examinees to have a 50% probability of a correct response (Bergstrom et al., 1993). Another problem that may be less obvi ous is that computerized forms may be more difficult than the P&P forms (Mead & Drasgow, 1993). This may be due to aspects of the presentat ion of the item on the co mputer screen compared to the presentation of the item in the P&P format. As an example, a score of 50 on a given CBT may be equivalent to a score of 55 on the related P&P. A two-page P&P layout cannot be displayed on a single computer screen which can advantage certain examinees in a dual platform situation (Pommerich, 2002). Further, Pommerich (2002) stated that dec isions made for item presentation on a computer-only platform can affect examinee performa nce and is especially true 37

PAGE 51

for items with long text-based passages or multiple figures or tables which may require scrolling or for the examinee to navigate through several pages to locate the answer. However, examinees someti mes indicate a preference for the option of taking an exam via computer (God win, 1999; Parshall, 1995). This can become a disadvantage to the student offered a choice between the two administration forms who chooses the comp uterized format, thinking he or she will perform better. Thus, while computer s can increase the reliability of test scores by reducing sources of measurem ent error such as errors caused by penciling in an unintentional answer or fa iling to completely erase a response, other sources of error related to the computer may be present, including errors related to the use of the keyboard or mouse (Parshall, 1995). Several issues can lead to slight c hanges in the format of the exam from the P&P; these changes can lead to mode ef fect. It is imperative that a study be conducted to detect any test-delivery m edium effect when converting from P&P to CBT (Sawaki, 2001). As new features continue to emerge in this field, comparability studies will continue to be called for. Tests can be enhanced on the computer. Some of these enhancements include the use of si mulations (e.g., for airline pilots, medical research) and product-based assessm ent. Most recently, simulations are being added as an additional item type. T he Uniform CPA Examination includes two simulations for each of the four sections (Johnson, 2003). The Teacher Technology Literacy Exam in Florida ( designed to meet recent legislative mandates of the NCLB Act of 2001) uses simulations where the examinee must 38

PAGE 52

demonstrate successful performance on the technology indicator (Harmes et al., 2004). The field is just beginning to discove r the possibilities that exist now that computers are economical, fairly re liable, and readily available. Historically, good test practice has always encouraged thorough analyses (e.g., test equating, DIF analyses, and it em analyses) whenever any changes to the test program are implemented to ensur e that the measure of the intended construct is not affected. These changes have included vary ing the items within the item bank to be included on parallel te st forms and examining differences in groups (e.g., gender, ethnicity, socioeconomic status, or to update norms for the current audience) (Crocker & Algina, 1986; Davier et al., 2004; Holland & Wainer, 1993; Kolen & Brennan, 1995). The logical st ep is to continue to ensure that the test is comparable to the original ve rsion when converting to a computerized mode. Mead and Drasgow (1993) concl uded that timed power tests can be constructed to measure the same construct as a paper-and-pencil test when they are computerized by [the] careful processes (p 456). Indeed, tests can be constructed to provide accurate m easures of an examinees ability. Software Vendors The importance of ensuri ng that any changes made to a test should be examined in order to confi rm that the test measures the intended construct. Any differences that exist across test forms or test administration modes may result in the measure of something other than the intended construct has been stressed throughout this review. Of equal import ance is the need to compare the test 39

PAGE 53

forms when a computerized exam has been presented by more than one software vendor. Differences may exist due to the flexib ility present (or not present) from one CBT administration software versus another. Software that does not allow flexibility can be a source of difference across modes (Mazzeo & Harvey, 1988; Pommerich & Burden, 2000). Because software aspects are different across vendors, this can impact the item. Comple x item types may be more challenging when displayed by different vendors. Mazzeo and Harvey (1988) have suggested that innovative items requiring multiple screens or complex graphics may exhibit a mode effect. Items where passages and q uestions require multiple screens are more difficult (Mead & Drasgow, 1993). Mead and Drasgow (1993) suggest that graphics of an item depend on the quality of t he display device, the time the item is displayed and the size of the computerized item compared to the P&P item. They further suggest that the need to resolve this issue may be eliminated as the quality of computers increases. Since computer skills may be a fact or for examinees, Godwin (1999) called for test software with a well-designe d user interface that would facilitate, not hinder, the testing process (p. 4). Godwin (1999) further suggested the value of surveying examinees after the pilot. The insights provided can be used to influence the design and development of the software before the field test (Godwin, 1999). One software vendor may have been more thorough, or more successful, in carrying out these types of steps as they developed the software interface. 40

PAGE 54

Professional Recommendations The importance of highly comparable test modes has been strongly called for by professional associations. The APA (1986) states, When interpreting scores from the computerized versions of conventional tests, the equivalence of scores from computerized versions should be established and documented before using norms or cutting scores obtai ned from conventiona l tests (p. 16). This statement explicitly addresses the importance of investigating the equivalence of scores generated from co mputer-administered exams to the scores generated from paper-and-pencil exam s. Computer-based tests should be designed and developed with an underlying sound systematic and scientific basis (ATP, 2002, p.16). According to the Association of Test Publishers: Guidelines for Computer-based Testing (2002) the following guidance is provided for test users: 4.2 If test scores from different modes of administration are considered interchangeable, equivalency across modes of administration should be documented in a test users or technical test manual. 4.4 If computer-based tests are normed or standardized with paper-and-pencil test data, comparability or equivalence of the paper-and-pencil scores to the computer-based scores should be established before norms are applied to scored test data. This may be especially important if the computer-based 41

PAGE 55

test is time constraine d, or includes extensive reading passages, scrolling text, or graphical images (p. 19). Differential Item Functioning (DIF) As the standardized testing movement continues to expand becoming incrementally more high stakes for the K-12 population (e.g., No Child Left Behind mandates), the postsecondary populati on (e.g., SAT, ACT, GRE) and the licensure population (e.g., licensing a person to practice medicine), a new focus has surfaced. The focus is on potential item and test bias defined as an item or test that behaves differently for two or more groups of examinees (e.g., males and females) when ability is equal. Several authors call for empirical evidence of equity and fairness associated with test s (Bolt, 2000; Camilli & Shepard, 1994; Clauser & Mazor, 1998; Gierl et al., 2001; Penfield & Lam, 2000; Standards for Educational and Psychological Testing, 1999; Welch & Miller, 1995). This fairness is often concluded when a test is considered free of bias. DIF is defined as existing for an item when members of two or more groups matched on ability have different probabilities for getting an item correct (Bolt, 2000; Camilli & Shepard, 1994; Clauser & Mazo r, 1998; De Ayala et al., 1999; Gierl et al., 2001; Hambleton et al., 1991; Holland & Thayer 1988; Penfield & Lam, 2000; Welch & Miller, 1995). Further, DIF is the uninterpreted relati ve difficulty when an item behaves differently for groups with compar able ability. Bias, more appropriately, refers to detecting the DIF and using a logical analysis to determine that the difference is unrelated to the construc t intended to be measured by the test 42

PAGE 56

(Camilli & Shepard, 1994; Clauser & Mazo r, 1998; Embretson & Reise, 2000; Hambleton et al., 1991; Holland & Thayer 1988; Penfield & Lam, 2000; Smith & Smith, 2004). A panel of experts usually re views the item. It is the conclusion drawn by this panel that determines if indeed there is logical evidence of bias for the item exhibiting DIF (De Ayala et al., 1999). Once an item has been determined to be biased, generally it is revised or eliminated from the test. There are two ways bias is manifest which are described by Embretson and Reise (2000). First, external bias refe rs to test scores that have different correlations with non-test variables for di fferent groups affecting the predictive validity of the measure. Second, when inte rnal relations (e.g., covariance among item responses) differ across groups, this is called measurement bias that affects the measurement scale so that it is not equivalent across groups. In order to use a measure to compare examinee s (e.g., level of ability or trait), it is imperative that the scores be on the same scale. DIF occurs when a test item does not have the same relationship to a latent variable across at least two groups (Embretson & Reise, 2000). In 1910, Alfred Binet first investi gated the idea that items may behave differently for two or more groups of examinees with equal ability (Camilli & Shepard, 1994). Binet tested children with low socioeco nomic status and was concerned that rather than measuring mental ability, he may be measuring cultural training which could inclu de scholastic training, language, and home training. Not a lot of attention was paid to this area until the 1970s when a moratorium was issued by the NAACP as a response to the practice of using 43

PAGE 57

exams that may be biased to ethnic/ra cial groups as a means for making decisions for job placement, promotions and raises (Camilli & Shepard, 1994; Hambleton et al., 1991). Differential item functioning (DIF) anal yses were originally designed to study cultural differences in test pe rformance (Holland & Wainer, 1993) and are normally used for the purpose of investi gating item bias across examinee groups within a single test. Examining items acro ss test administration platforms can be viewed as a type of comparability study as items operating differentially make scores from one examinee group to anot her group not comparable. It is reasonable to assume that when the item s are administered in the paper-andpencil mode and when the same items are administered in the computer mode, they should function equitably and should di splay similar item characteristics. Therefore, DIF methodology presents an ideal and well-studied method for examining if an item behaves differently in one test administration mode compared to the other test administration mode (Schwarz et al., 2003). DIF methods have been conducted that are based on IRT, logistic regression, contingency tables, and effect sizes (Camilli & Shepard, 1994; Clauser & Mazor, 1988; Holland & Thayer, 1988; Penfield & Lam, 2000). Currently, research is being conducted to examine proper methods for detecting DIF for polytomous items (Chang, Mazzeo, & Roussos, 1995; Congdon & McQueen, 2000; Dodeen & Johanson, 2001; Johanson & Johanson, 1996; Penfield & Lam, 2000; Welch & Miller, 1995; Zwick et al., 1993) including surveys and performance-based assessment s. One of the features a researcher reviews 44

PAGE 58

when selecting a method is the type of methodology referri ng to parametric models, which require more complicated computation and a large sample size, and nonparametric models, which are not as complex and do not require a large sample size. Another impor tant feature to consi der is how effective and appropriate the model is for nonuniform DIF. Uniform DIF is said to exist when the relative advantage of one group over the other group is relatively uniform across the ability scale (Camilli & S hepard, 1994; Clauser & Mazor, 1998; Penfield & Lam, 2000). For example, w hen examining scores of low ability examinees, Group 1 performs better. This pattern continues with the same magnitude for the mid r ange and higher ability examinees. In contrast, nonuniform DIF is said to exist when the advantage of one group is not uniform along the ability continuum (Camilli & S hepard, 1994; Clauser & Mazor, 1998; Hambleton et al., 1991; Penfield & Lam, 2000) This can occur in two forms (1) a noncrossing form where one group has the advantage over the other group throughout the continuum but at varying magnitudes as ability changes and (2) a crossing form where one group has an advantage for low ability levels and the other group has an advantage for high abilit y levels (Camilli & Shepard, 1994; Penfield & Lam, 2000; Penny & Johnson, 19 99). The latter form is the form that has more of an impact on DIF detection because it can be more difficult for statistics to detect DIF if t he statistic is not able to detect nonuniform DIF, and it can indicate validity issues with the item which are beyond the realm of DIF (Camilli & Shepard, 1994; Hambleton et al., 1991; Penfield & Lam, 2000). Parametric models tend to be more usef ul for detecting nonuniform DIF (Penfield 45

PAGE 59

& Lam, 2000). However, Jodoin and Gierl (2001) state that nonuniform IDF occurs less frequently than uniform DIF; ther efore, it is appropr iate to focus on uniform DIF. Typically, DIF methodology is divided into two types (1) conditions on an estimate of true latent abi lity (e.g., 1-PL, 3-PL) and (2) using an observed score as the conditioning variable. This latter type includes contingency table approaches such as the Mantel-Haenszel, standardized differences in proportion correct, and logistic regression (Hamblet on et al., 1991; Holland & Thayer, 1988; Welch & Miller, 1995). Three widely-cited methods for det ecting DIF that will be elaborated upon here are (1) Mantel-Haenszel, (2) Logistic Regression, and (3) Item Response Theory. Researchers have suggested that when examining em pirical data where the researcher is not sure if a co rrect decision has been made regarding the detection of DIF as would be known in a M onte Carlo study, it is wise to use at least two methods for detecting DIF (F idalgo et al., 2004). Holland and Thayer (1988) highly recommend the MH as a met hod for the detection of DIF. Zwick et al. (1993) reported promisi ng results when using the MH for detecting DIF on a performance assessment. DeAyala et al. (1999) investigated six methods (likelihood ratio, logistic regression, Lords chi square, MH, Exact Signed Area, H Statistic) for detecting DIF and found MH and logistic regression (LR) to have the highest correct identification rates of the six methods. Huang (1998) investigated reliability of DIF methods applied to achi evement tests and repor ted that the MH was more reliable at detecting DIF t han LR. Zhang (2002) used the MH and LR 46

PAGE 60

to examine the interaction of gender and ethnicity, reporting that the MH was powerful in detecting uniform DIF and can produce an effect size indicator. Dodeen and Johanson (2001) compared the MH and LR to determine the effect and magnitude of gender DIF in an attit udinal data set and reported that both methods yielded similar results with respec t to uniform DIF notin g the effect size indicator in the MH is advantageous Johanson and Johanson (1996) discussed the use of the MH as an aide in the detection of DIF when constructing a survey. Penny and Johnson (1999) found that Type I error rates of the MH chi-square test of the null hypothesis that there is no DIF were not inflated when the MH method was used and the data fit the Rasch model. Lei et al. (2005) found IRT to be powerful in detecting both uniform and nonuniform DIF. Wang (2004) used the Ra sch model to detect DIF and found the method yielded well-controlled Type I erro r and high power of DIF detection. Congdon and McQueen (2000) successfully used the Rasch model to detect rater severity. Luppescu (2002) compared the Rasch model and hierarchical linear modeling (HLM) and reported that bot h methods were equally sensitive to the detection of DIF. As can be seen, several studies have used the methods of LR, MH, and IRT for the detec tion of DIF. In fact, two of these types of DIF methodology (MH, and IRT) represent the most commonly used method for that type of model (parametric and nonparametric, respectively). Table 1 summarizes the findings of these studies. 47

PAGE 61

Table 1 Summary of DIF studies ordered by year Author Date Method Recommendation Holland & Thayer 1988 Mantel-Haenszel (MH) Highly recommend Zwick, Donoghue, & Grima 1993 MH, Generalized MantelHaenszel (GMH), Standardized Mean Difference (SMD), Yanagawa & Fuji (YF) MH more powerful than GMH for conditions with constant DIF. MH more sensitive to mean differences. GMH more sensitive to between-group differences. SMD useful supplement to chi-square for constant DIF. YF warrants further study. Johanson & Johanson 1996 MH Successfully used the MH for item analysis of survey Huang 1998 MH, LR MH was more reliable than LR De Ayala, Kim, Stapleton, & Dayton 1999 Likelihood ratio, logistic regression (LR), Lord's chisquare, MH, exact signed area, H statistic MH and LR had the highest correct identification of DIF of the six methods Penny & Johnson 1999 MH on items fitting and not fitting Rasch model Type I error rates not inflated when MH is used and data fit the Rasch model Congdon & McQueen 2000 Rasch Successful in detecting rater severity Dodeen & Johanson 2001 MH, LR Similar results at detecting uniform DIF. Effect size indicator in MH makes it more advantageous than LR Zhang 2002 MH, LR MH more powerful than LR at detecting uniform DIF and can produce an effect size indicator Luppescu 2002 Rasch, Hierarchical Linear Modeling (HLM) Both methods equally sensitive to the detection of DIF Wang 2004 Rasch Controlled Type I error and had high power in detection of DIF Lei, Chen, & Yu 2005 IRT Powerful in detecting uniform and nonuniform DIF Mantel-Haenszel The most popular, current non-IRT approac h for the detection of DIF is the Mantel-Haenszel (Hamblet on et al., 1991). The Mantel-Haenszel is a widely used 48

PAGE 62

estimate that is an asymptotic chisquare statistic with one degree of freedom (Penny & Johnson, 1999) com puted from the set of stratified K X 2 X 2 tables (Matthews & Farewell, 1988; Mould, 1989; Welch & Miller, 1995). MantelHaenszel, also referred to as the logrank method, has been used by the medical community to examine differ ences in experience (e.g., survival) for two or more groups to determine if these differences ar e statistically significant (Mould, 1989). Matthews and Farewell (1988) define the st atistic used in medical statistics as: i iii k i i iii k i MHN baA N bBa OR 1 1 where a i = number successful in Group 1 b i = number successful in Group 2 A i -a i = number of failures in Group 1 B B i -b i = number of failures in Group 2 A i = total for Group 1 B B i = total for Group 2 N i = total for Groups 1 and 2 This statistic is easy to compute and is not affected by zeros in the tables. Research has shown that st atistical properties of this estimate compare very favourably with the corresponding properti es of estimates which are based on logistic regression models (Matthews & Farewell, 1988, p. 219). The MantelHaenszel statistic tests the H 0 against the alternative hypothesis (Clauser & Mazor, 1998; Holland & Thayer, 1988; Welch & Miller, 1995): 49

PAGE 63

Fj Fj Rj RjQ P Q P H 1 where P Rj = number of correct respons es for the reference group Q Rj = number of incorrect responses for the reference group P Fj = number of correct responses for the focal group Q Fj = number of incorrect responses for the focal group In 1988, Holland and Thayer applied the statistic for use in detecting DIF in test items. The statistic represents the groups reported in Table 2, estimates and is written as: jjjj jjjjTCB TDA MH where A j = number of correct responses for the reference group B B j = number of incorrect respons es for the reference group C j = number of correct res ponses for the focal group D j = number of incorrect res ponses for the focal group N Rj = total for the reference group N Fj = total for the focal group T j = total for both the reference and focal groups 50

PAGE 64

Table 2 Groups defined in the MH statistic Group Correct Incorrect Total Reference A j B B j N rj Focal C j D j N fj Total M 1j M 0j Total j For each group, reference (R) and foca l (F), the number correct and the number incorrect are recorded for each it em based on total score to control for ability. This information is used to com pute the odds ratio and log-odds for each item. For the purpose of this mo del, the null hypothesis is H 0 : = 1. The MH has an associated test of significance distributed as a chi-square which takes the form of (Clauser & Mazor, 1998; Penny & Johnson, 1999): 2 2var 2 1 | |j j jjjjA AEA MH where 1 var2 01 j j jj j frj jT MMNN AT The variance of A is the product of the group total for the reference ( ) group, the group total for the focal ( ) group, the column to tal for items correct ( ), and the column total for items incorrect ( ) divided by the squared grand total (T) multiplied by the grand total-1 rjN j f NMM2 j 1 j 0 j 1 jT The Mantel-Haenszel Chi-square 51

PAGE 65

(MH-CHISQ) is the uniformly most powerful unbiased test of H 0 versus H 1 (Welch & Miller, 1995). Another value produced by the M antel-Haenszel is the indicator, MH The equation is stated (Cl auser & Mazor, 1998): MH MH MH ln35.2 7.1 ln4 In essence, the MH is a logistic transformation. It takes the log of and transforms the scale to be symmetric around zero. The MH is produced by multiplying -2.35 to the resulting val ue (Clauser & Mazor, 1998; Holland & Thayer, 1988). This places the value on t he Educational Testing Services delta scale. On this scale, items favoring the reference group have values from minus infinity to zero and items favoring the foca l group have values from zero to infinity (Clauser & Mazor, 1998). Both the MH and the are measures of effect size (Clauser & Mazor, 1998). The symbol, MH refers to the average factor by which the odds that a member of R is correct on the studied item exceeds the corresponding odds for a comparable member of F w here R is the reference gr oup and F is the focal group (Holland & Thayer, 1988). Values greater than 1 refer to items on which the reference group performed better than the examinees in the focal group. The symbol, MH is the average amount more difficult the item is for the reference group examinee compared to the focal group examinee. Negative values of MH refer to items that were easier for examinees in the reference group compared to 52

PAGE 66

members of the focal group. The ma tching variable total score, ln ( ), is partialed out of the association of gr oup membership and performance for the given item (Holland & Thayer in Wainer & Braun, 1988). The Mantel-Haenszel null hypothesis will hold only if the abilit y distributions for the two groups are equal, the conditioning test is perfectly reli able, or all items me et the conditions of the Rasch model (Welch & Miller, 1995). Logistic regression Non-parametric tests do not require the rigorous assumptions and complexity of parametric models nor do they require la rge sample sizes. Logistic Regression (LR) is a regression procedure in whic h the dependent variable is binary. Yu (2005) states that the pur pose is to predict whether the independent variable can predict the outcome (e.g., if being male does not predict the test score then the test is gender-free). In th is case, we would be in vestigating whether being administered the CBT predicts the test score for an examinee. If not, the test would be mode-free when controlled for abilit y. Otherwise, mode has an effect on test score. The assumptions of LR are not as rigorous as latent trait models and include the following advantages: LR does not assume a linear relationship between the dependent and independent variable The dependent variable does not need to be normally distributed There is no homogeneity of variance assumption Error terms are not assumed to be normally distributed 53

PAGE 67

Independent variables do not need to be interval or unbounded (Garson, 2005; Pedhazur, 1997). The logistic model returns a param eter called the odds ratio (in SAS) which is the ratio of the probability that an outcome (O) will occur divided by the probability that the same outcome will not occur (Pedhazur, 1997; Yu, 2005). The odds of a correct response can be written as OPOPOOdds 1 An odds ratio greater than 1 means t hat the odds of getting a on the dependent variable (correct response) are greater for the focus group than the reference group. The closer the odds ratio is to the more the categories or groups are independent of the dependent variable (mode to test score/correct response). In logistic regression, the dependent variable is a logistic transformation of the odds and is represented as P1 P lnPitlogoddslog where ln = natural logarithm A simple logistic regressi on equation may be represented as bXaPitlog where X=independent variable 54

PAGE 68

In this equation, b is the expected c hange of logit (P) associated with a unit change in X. When b is positive, increases in X are associated with increases in logits and when b is negative increases in X are associated with decreases in logits (Pedhazur, 1997). The equation can be simplified much further by replacing the logits with odds obtai ned by taking the antilogs yielding the following equation: bXae P 1 1 where e= natural logarithm Further, LR is sensitive to nonuniform DIF. The Wald Statistic is commonly used to determine significance (Garson, 2005; Yu, 2005). Item Response Theory Item Response Theory (IRT) provides a means to examine a persons ability and the behavior of items that is invariant or independent of one another. NAEP is probably one of the first to use IRT in large-scale (operational) assessments (Kifer, 2001) but with the em ergence of computer-based tests, IRT is becoming more common. Modern test theory also referred to as latent trait theor y can be traced to the works of two men: Georg Rasch and Frederic Lord. Geor g Rasch is known for the Rasch Model which is mathemat ically equivalent to the 1-PL model (Hambleton et al., 1991). Frederic Lord is credited with the development of the parameter logistic model. One of the largest differ ences in explaining these models is philosophical. Those who prefer the Rasch Model believe the model is 55

PAGE 69

fixed and the researcher mu st find the data to fit the model. Those who prefer the parameter logistic model (common dic hotomous models include the 1-PL, 2-PL, and 3-PL) believe the data are fixed and the researcher must fit the model to the data. The following two sections provide a review of t he literature that supports the Rasch Model and the 3-par ameter logistic model. Measurement attempts to capture the real thing: latent trait ability and share it with others (e.g., summarize). Data are recorded as presence or absence of a correct response or endorsing or not endorsing an event. This information is recorded by counting t he correct items or endorsements on an ordinal scale. Researchers transform the dat a to an interval scale in order to make meaning of the data to draw conclu sions. Placing the data on an interval scale results in distances between count s or scores that are an equal distance, and therefore are meaningful (Bond & Fox, 2001; Smith & Smith, 2004). Bond and Fox (2001) explain the theor etical scale used in the Rasch model. Criteria are ordered by listing items onto a logit scale, so that usefulness can be gauged. The steps or items at the bottom of t he scale would be attainable or easy for the beginner while the items at the top of the scale would be suitable for the person with more of the attribute. The location of the item on the same scale represents item diffi culty. The steps along the way represent the varied levels of ability representing the persons ability. Bond and Fox (2001) define construct validity in the following manner recorded performances are reflections of a single underlying construct: the theoretical construct as made explicit by th e investigators attempt to represent it 56

PAGE 70

in items or observations and by the human ability inferred to be responsible for those performances (p. 26). Rasch analysis provides fit statis tics to assist the researcher. Data that do not fit the model can be eliminat ed. Items that do not fit do not hold up to the assumption of unidi mensionality. These data diverge from the expected ability and difficulty pattern; theor etically a straight line, and the item is thought to not contribute to the m easurement of a single construct. When creating a visual scale of the items and ability a parallel line is drawn on either side of the scale mathematic ally represented as a straight line similar to a 95% confidence interval. Taken together, th is graphical repr esentation and the analytical output provide detail on how each performance met or failed to meet the model. Items that fail to meet the model require fu rther investigation (Bond & Fox, 2001). 57

PAGE 71

Person Person Person Item Item Item Item ItemA Difficult Item Person with High Ability Z Easy Item Person with Low Ability Person Person-2 +2 Figure 1. Item-person map* ( adapted from Bond & Fox, 2001) *Note: Items are represented as circles and persons are represented as squares. Larger circles or squares represent more error or more uncertainty associated with the item. The solid vertic al line represents the logit scale which is an interval 58

PAGE 72

scale so all units have the same value. Fit values are read horizontally on a standardized t scale Acceptable values fall within the dotted vertical lines between -2.0 and +2.0. Items and pers ons outside these boundaries do not meet the model and can be eliminated. Items not fitting the model do not hold up to the assumption of unidimensionality. A pers on not fitting the data might answer difficult items correctly but miss easier items not following the expected pattern. Figure 1 represents the item map and is adapted from Bond and Fox (2001). Items are represented as circ les and persons are represented as squares. Larger circles or squares repres ent more error or more uncertainty associated with the item. The line represents the logit scale which is an interval scale so all units have the same value. The higher values are located at the top of the map and represent the more difficult items, higher person ability, or higher degree of endorsement. The lower values ar e located at the bo ttom of the map and represent the easier items, lowe r person ability, or less degree of endorsement. Estimated values of item and person are read vertically on the map or scale (Bond & Fox, 2001). Within this figure are two parallel vertical dotted lines. Items within these lines fit the model while items outside of these lines do not behave as expected. Fit values are read horizontally on a standar dized t scale. Acceptable values fall (within the dotted lines) between .0 and +2 .0. The use of these dotted lines in the figure represents item fi t. Items falling outside the two dotted lines (less than 2.0 and greater than +2.0) indicate items than do not fit the construct (do not contribute to the measurement of only one construct) In this model, these items 59

PAGE 73

most likely represent multidimensionality (more t han one construct) which violates the assumption of unidimens ionality and items should be excluded from analyses. Bond and Fox (2001) suggest that those items may contain story problems confounding reading ability with ma th computation and would fit better in another pathway. Persons falling in t hese areas (less than -2.0 and greater than +2.0) do not adhere to t he expected ability pattern (Bond & Fox, 2001). In its ideal form, the pattern should be a straight line. Straying excessively from this model may indicate persons responding co rrectly to an item that should have been missed or responding incorrectly to an item that should have been correct according to the pattern. Wright (1995) states t hat the Rasch Model has t he same discrimination for all items which makes possible conjoint additivity, sufficient statistics, and the uncrossed item characteristic curves (I CC) which define a coherent construct. He further defines discrimination as a paramet er in the 3-PL model but not in the Rasch model. By parameterizing item discr imination, Wright believes the 3-PL model complicates estimation and inhibits the interpretation by obscuring the item hierarchy and the construct definit ion. In addition, Rasch models do not allow ICCs to cross. ICCs that cross do not represent variance but rather item bias or multidimensionality detected by fit statistics. Rasch proponents feel that an item that can differ in difficulty for diffe ring ability levels, meaning your ability is dependent on the item you get, illustrates that items are sample dependent. Only Rasch models can be invariant given the fit and sample for which the data were calibrated. Rasch recognizes that gue ssing may occur and deals with it through 60

PAGE 74

the detection of out fit statis tics. The item is removed, the data are recalibrated, and the two calibrations are compared to determine if the item should be treated as an outlier would in an ANOVA (Smith & Smith, 2004). Wright (1992) states that items cannot guess; t herefore, guessing should be i dentified in the person. In his opening lines of an invited debat e with Ron Hambleton during the Annual meeting of the Amer ican Educational Research A ssociation in 1992, Wright states that he knows of only 2 models a 2 parameter model with B and D where B is the persons ability and D is the item difficulty. The second model he says is the four parameter Birnbaun mode l with parameters a, b, c, and where: a= item discrimination b= item difficulty c=guessing =theta or persons ability Further, Wright (1992) cl arifies that the Rasch model can have multiple parameters to the right of the log-odds statement w hen connected with a plus or minus sign. Wright purports that meas urement requires a measure where the scale is defined so that every additional unit is an equal amount (e.g., 8 inches is more that 7 inches and contains the same amount of difference as 12 inches and 13 inches). This is an interval scale. Raw data is ordinal and the Rasch model is the only model that can transform ordinal dat a to the interval scale. Wright further classifies the Birnbaum model as only useful for dichotomous data where the Rasch model is able to handle any kind of ordered data such as a rating, scaling, or ranking. Wright concludes that the 3-PL offers no benefit over Rasch (ability 61

PAGE 75

estimate and item difficulties are statistically equivalent to Rasch measures) and has disadvantages compared to Rasch. The parameter logistic model in item response theory (IRT) postulates that performance on an item can be explained or predicted by the ex aminees ability, a latent trait. The models represent a relationship between examinee ability and item characteristics. The probability of getting an item correct depends on where the examinee is on the theta scale and can be computed for anyone anywhere on theta. The three dichotomous model s commonly used are the 1, 2, and 3 parameter logistic models. e e PPLb bi iaD aD i 11 e e PPLba baii iiD D i 12 e e ccPPLba baii iiD D i i i 1 13 where P ( ) =the probability of a correct response for an examinee with ability theta b=item difficulty parameter a=item discrimi nation parameter a =common level of discrimination in all items c=lower asymptote or guessing parameter D=1.7 a scaling factor 62

PAGE 76

e=the base of the natural logarithms The 1-PL model includes the item diffi culty parameter (b). The 2-PL model includes the item difficult y parameter (b) and the discr imination parameter (a). The 3-PL model includes the item diffi culty parameter (b), the discrimination parameter (a), and the guessing parameter (c). While the Rasch Model calls for the dat a to fit the model, the 1-PL, 2-PL, and 3-PL models require the selection of a model to fit the data. Though some researchers may distinguish between the Rasch model and the 1-PL, Hambleton et al. (1991) report that the 1-PL is mathematically equivalent to the Rasch model. Model selection should be based on theoretical or empirical grounds (Stark, Chernyshenko, Chuah, Lee, & Wadlington, 2001). Sample size will factor into the selection as par ameters are added, the requir ed sample size increases. The 3-PL has been used successfully with m any achievement data studies (Stark et al., 2001). Stark further states that data must c onform to assumptions about dimensionality and examined in a crossvalidation sample. Data can be crossvalidated using a calibration sample and a validation sample. The calibration sample is used to estimate the item parameters and the validation sample is used to check the model-data fit. Dimensionality can be investigated by examining the number of ei genvalues greater than one, scr ee plots, or the ratio of the first eigenvalue to the second eigenvalue. A factor analysis can be conducted before any analysis has been conducted to determine if the items fall on one factor indicating unidimensionalit y within the test or subtest to be calibrated (Drasgow & Lissak, 1983; Stark et al., 2001). 63

PAGE 77

The 3-PL model is a more general form of the Rasch Model (Stark et al., 2001). The equation is e e cc Pba baii iiD D i i i 1 1 where P i =probability that an examinee with ability answers item i correctly a=item discrimi nation parameter b=location parameter c=pseudo-chance-level parameter D=a scaling factor to make the logistic function as close to possible to the normal ogive function e=transcendental number whose value is 2.718 The higher the a-parameter the better discrimination among the examinees. The a parameter is related to the steepness of the Item Response Function (IRF). The b-parameter or location parameter indicates the location of the IRF on theta (horizontal axis). This parameter indica tes the difficulty of the item and is inversely related to the proportion-correct score in Classical Test Theory. A large p-value represents an easy item while a large b-paramet er indicates a difficult item. The c parameter re ferred to as the pseudo gues sing parameter indicates the probability of responding correctly fo r an examinee with low ability due to guessing and influences the shape of the IRF by a lower asymptote (Stark et al., 2001). Assumptions. In order to make inferences from the data results of IRT methods, it is important to mention the assumptions of the theory; 64

PAGE 78

unidimensionality, local independence, and response functions are logistic (Hambleton & Swaminathan, 1985). Unidimens ionality refers to the measurement of one factor or component that influences test performance. Local independence refers to items being independ ent at a given level on t heta. The probability of a correct response to the item is independent once latent ability is conditioned on. Items have nothing in common except abi lity. The functional form of response functions assumes that probabilities will ch ange across ability levels in a way that is fit by a logistic curve. Hamb leton and Swaminathan (1985) mention speededness which refers to the addition of another trait; therefore, the assumption of unidimensionality is no l onger satisfied. If speed is required to complete a test, the test is measuring ability and speed rather than ability alone (Parshall et al., 2002). Using IRT to Detect DIF. An advantage to using IRT to detect DIF is that it permits the study of how DIF may vary ac ross different latent ability levels (Bolt, 2000). Further, the statistical properties of the items can be graphed using the IRT approach broadening the understanding of items sh owing DIF (Camilli & Shepard, 1994). IRT can detect item bias compared to true differences on the attribute allowing researchers to conduc t rigorous test of equivalence across experimental groups (Stark et al., 2001) Violations to this measurement equivalence result in DIF. Parametric models are strong statistical models when the associated assumptions are met and a large sample size exists (Camilli & Shepard, 1994). The use of a parametric me thod is less flexible because of its stronger assumptions related to di mensionality and the IRF shape while 65

PAGE 79

nonparametric methods typically assume onl y monotonicity (Stark et al., 2001). The sample is categorized as either a me mber of the reference or focal group after the distribution has been put on a common metric. DIF exists when examinees from the two groups with the same ability have different item response functions (IRF) meaning different item parameters. IRFs and item parameters are subpopulatio n invariant and can be used to detect DIF using Lords Chi-square statistic for differences in the estimated item parameters or an alternative statistic which tests the ar ea between the IRFs (Stark et al., 2001). There are two commonly used methods for determining DIF: comparing item characteristic curves (ICC) and comparing item param eters directly. When comparing the ICCs, the researcher pl aces the parameter estimates on a common scale. If the ICCs are identical then the area between them should be zero (Hambleton et al., 1991; Stark et al., 2001). When the area between the ICCs is not zero, it can be concluded that DIF exists (Camilli & Shepard, 1994; Hambleton et al., 1991). The expression dev eloped by Raju (1988) for the threeparameter model is | 1ln 2 |112 / 21 12121221bb e aDa aa cAreaaabbaDa where a=item discrimi nation parameter b=location parameter c=pseudo-chance-level parameter D=a scaling factor to make the logistic function as close to possible to the normal ogive function 66

PAGE 80

e=transcendental number whose value is 2.718 For the 2-PL, the term involving the c is eliminated. For the 1-PL, the expression becomes the absolute difference betw een the b-values for the two groups (Hambleton et al., 1991). When DIF exists, the item response cu rve (IRC) differs for the two groups. Similarly, an item parameter will differ ac ross the groups (Embretson & Reise, 2000). Simply put, DIF can be detected by collecting a large samp le from the two groups (e.g., P&P, CBT), administer the instrument, estimate the item parameters for each group then compare the IRC visually or after linking scales. In each calibration, the latent variable is arbitrarily defined to have a mean equal to 0 and a standard deviation of 1. This fixes the scale of the latent variable. This fixing is done separately for each group result ing in the probability that the scales are not the same and cannot be directly compared (Embretson & Reise, 2000). Therefore, before com paring the item parameters it is necessary to put them on the same scale known as linking (i.e., te st equating). The detection of DIF is made possible within the Rasch meas urement model because the model requires the probability of the outcome of the interaction of any person and item be solely determined by the ability of the person and the difficulty of the item. The interactions that represent bias will be visible against the structure imposed by the additive measurement model (S mith & Smith, 2004, p. 392). Rather than comparing the ICCs ac ross modes to determine if the items were functioning comparably, the b-par ameters can be compared for common 67

PAGE 81

items from each group. The nu ll hypothesis is that the item parameters are equal and is stated as ccbbaa H 121212 0;;: where a 1 =item discrimination parameter for Group 1 a 2 =item discrimination parameter for Group 2 b 1 =location parameter for Group 1 b 2 =location parameter for Group 2 c 1 =pseudo-chance-level parameter for Group 1 c 2 =pseudo-chance-level parameter for Group 2 The subscript represents the group. If the hypothesis is rejected, it would lead the researcher to conclude that DIF exists (Hambleton et al., 1991). After placing the estimates of the item parameter s on a common scale (standardizing the difficulty parameters), the variance-covarian ce matrix of the parameter estimates in each group is computed. The informati on matrix is computed and inverted (for each group). Then the variance-covarianc e matrices of the groups are added which yields the variance-covariance difference between the estimates (Hambleton et al., 1991). This statistic is stated as cba cbadiffdiffdiffdiffdiffdiff 1 2 where a diff = a 2 a 1 b diff = b 2 b 1 c diff = c 2 c 1 a 1 =item discrimination parameter for Group 1 a 2 =item discrimination parameter for Group 2 68

PAGE 82

b 1 =location parameter for Group 1 b 2 =location parameter for Group 2 c 1 =pseudo-chance-level parameter for Group 1 c 2 =pseudo-chance-level parameter for Group 2 This statistic is asymptotically dist ributed as a chi-square with k degrees of freedom. The symbol, k, represents the num ber of parameters (a, b, and c). For the 3-pl where a, b, and c are compared, k=3. For the 2-pl where a and b are compared, k = 2. For the 1-pl where b is compared, k = 1. Therefore, the expression for the 1-pl can be simplified to: bb bvvdiff 2 1 2 2 In this statistic, v (b 1 ) and v (b 2 ) represent the reciprocals of the information function for the difficulty parameter estimates (Hambleton et al., 1991). One of the problems associ ated with using IRT to detect DIF is the sample size. In IRT, a large sample size is required for both groups. Item parameters must be estimated for each group and a la rge range of abilities must exist. In many cases, the sample size is not large enough for IRT in the minority group (Clauser & Mazor, 1998; Hambleton et al ., 1991). This can lead to poor ability estimates which can lead to errors in the decisions made regarding DIF. Summary The research illustrates that methods for evaluating comparability exist, fit a variety of situations, and are prac tical. Professional organizations and researchers alike unequivocally call for the necessity to conduct comparability 69

PAGE 83

studies whenever the adminis tration platform changes. T herefore, there should be no question that comparability studies should be routine. The increased availability of computer s has resulted in increases of large high stakes exams being administered dual platform. It is necessary to have accurate measures because these instru ments are routinely used to make high stakes decisions about placement, advancement, and licensure and these decisions have personal, social, and polit ical ramifications (Clauser & Mazor, 1998). For this reason, it is imperative to continue examining these types of instruments and to identify methods fo r detecting DIF that allow for valid interpretations (Clauser & Mazor, 1998), are accurate, are consistent, and are generalizable. The National Nurse Aide Assessment Program (NNAAP) is an important high stakes exam; the exam is a requirem ent for certification as a nurse aide. A computer-administered test version of this exam potentially would have many advantages including a s hortened test, enhanced test security, testing on demand, greater test standardization, and immediate scoring and reporting (Hambleton & Swaminathan, 1985). Howeve r, none of these benefits will be truly accomplished without ensuring that the computerized test is statistically equivalent to the paper-and-pencil version. A comparability analysis is essential to ensure that the test provides a comm on, standardized experience for all testtakers who are taking the same test, regardless of w here [or how] the test is delivered (ATP, 2002, p. 14). 70

PAGE 84

CHAPTER THREE: METHOD Purpose The purpose of this study was to ex amine any differences that may be present in the items of a specific high stakes exam administered in dual platform (paper and computer). The exam, Nati onal Nurse Aide Assessment Program (NNAAP), is administered in 24 states and territories as a part of the states licensing procedure for nurse aides. If differences were found on the NNAAP, it would indicate that a test item behaves differently for an examinee depending on whether the exam is administered vi a paper-and-pencil or computer. It is reasonable to assume that when items ar e administered in the paper-and-pencil mode and when the same items are adminis tered in the computer mode, they should function equitably and should displa y similar item characteristics. Researchers have suggested that when examining empirical data where the researcher is not sure if a co rrect decision has been made regarding the detection of Differential Item Functioning (DIF), as would be known in a Monte Carlo study, it is wise to use at least two methods for detecting DIF (Fidalgo et al., 2004). Since the presence of DIF can compromise t he validity of the test scores (Lei et al., 2005), this study used methodology to detect item DIF. 71

PAGE 85

Originally designed to study cultural differences in test performance (Holland & Wainer, 1993), DIF methodology is normally used for the purpose of investigating item bias across two examinee groups within a single test. Examining items across test administration platforms c an be viewed as the examination of two examinee groups. Item bias would exist if items behave differently for one group over the other group after controlli ng for ability, and the items operating differentially make scores from one ex aminee group to another not comparable. Schwarz et al. (2003) stat e that DIF methodology presents an ideal method for examining if an item behaves differently in one test administration mode compared to the other test administration mode. Three commonly used DIF methods (1-Parameter Logistic, Mantel -Haenszel, Logistic Regression) were used to examine the degree that each me thod was able to detect DIF in this study. Although these methods are widely us ed in other contexts, there is some question about the sensitivity of the three methods in this context. A secondary purpose of this study was to examine the relative sensitivity of these approaches in detecting DIF between administration modes on the NNAAP. Research Questions As previously stated, the questions addressed in this study were: 1. What proportion of items exhibit differential item functioning as determined by the following three methods: Mantel-Haenszel (MH), Logistic Regression (LR), and the 1-Parameter Logistic Model (1PL)? 72

PAGE 86

2. What is the level of agreement among Mantel-Haenszel, Logistic Regression, and the 1-Parameter Logistic Model in detecting differential item functioning between the computer-based (CBT) National Nurse Aide Assessm ent Program and the paper-andpencil (P&P) National Nurse Aide Assessment Program? Data National Nurse Aide Assessment Program Nurse Aides primarily work in hospi tals, clinics, nursing home facilities, and home health agencies. Nurse Aides receive their clinical experience in long term care facilities or ac ute care hospital settings (University of Hawaii, 2005). Nurse Aides provide basic care such as bathing, feeding, toileting, turning and moving patients in bed, assisting with walking, and transferring patients to wheelchair or stretcher (Department of H ealth and Human Services Office of the Inspector General, 2002). They perform such tasks as taking the patients temperature, pulse, respiration, and blood pressure (University of Hawaii, 2005; Lake County, 2005). A Nurse Aide must ta ke coursework, complete a clinical experience, and pass the stat es certification assessment requirements. Several states use a national assessment to se rve as the assessment requirement. The national assessment used in 24 states and te rritories is the National Nurse Aide Assessment Program (NNAAP). The NNAAP items have been based on the components mandated by the Omnibus Budget Reconciliation Act (OBR A) of 1989 (Muckle & Becker, 2005). The exam consists of a wri tten exam administered orally or in writing and a skills 73

PAGE 87

evaluation which requires the examinee to demonstrate a series of skills required to serve as a nurse aide (Muckle & Becke r, 2005). The written section is made up of 70 multiple-choice items. Sixty of these items are used to determine the examinees score on the written section of the exam and 10 items are pretest items that do not affect the examinees score. The exam is commonly administered in English but examinees with limited reading proficiency are administered the wr itten exam with cassette tapes that present the items and direct ions orally. Spanish-speaking examinees may take the Spanish version of the exam. For both of these versions of the exam given with accommodations, an additional 10 item s are administered in English that measure reading comprehension and contain job-related language words. These items are computed as a separate score, re present written instructions a nurse aide would routinely be required to follo w, are required to be administered by federal regulations, and must be passed before being administered the written portion of the NNAAP with accommodations (Muckle & Becker, 2005). The skills section is a performance-based assessm ent that requires the examinee to successfully demonstrate skills related to t he job of a nurse aide. Both the written and skills sections of the NNAAP must be passed to obtain an overall pass. In this study, the 60-item multiple-cho ice written section was examined. Items contained in the 2004 NNAAP were developed in a National Council of State Boards of Nursing Job Anal ysis in 1998-1999 and pretest items were administered prior to building test forms. These items were revised and reviewed by subject matter experts. The experts r epresented 12 states and were joined by 74

PAGE 88

four Promissor staff mem bers and one National Council of State Boards of Nursing (NCSBN) representative (Muckle & Becker, 2005). This team engaged in an item-writing workshop and reviewed the pretest items. A part of the test construction included cueing rules. Cuei ng occurs when the content of an item provides information that would allow the examinee to answer another item correctly. The NNAAP rules of cueing gr oup all items that possibly cue the examinee into a set considered mutually exclusive, and no pair of these items can appear on the same test form (Mu ckle & Becker, 2005). Further, approved items for inclusion into a test form must hav e a p-value (item difficulty) of at least 0.60, a point biserial la rger than 0.10, and must fit the Rasch model (Muckle & Becker, 2005). A practice test contains sample items for the examinee to take. The practice tests are not state-specific due to frequently changing state laws. A sample written test is available by visiting the Promissor Web site. Online practice tests are available for a fee. The test items are written to measure several clusters including Physical Care Skills, Psychosocial Care Skills, and Role of the Nurse Aide (e.g., communi cation, resident rights, and legal and ethical behavior). The following are sample items (Promissor, 2005). The clients call light should always be placed: (A) on the bed (B) within the clients reach (C) on the clients right side (D) over the side rail 75

PAGE 89

Which of the following items is used in the prevention and treatment of bedsores or pressure sores: (A) Rubber sheet (B) Air mattress (C) Emesis basin (D) Restraint The six test forms were created using the approved items and no items overlap across forms to ensure strict item security. This equating is accomplished by scaling the scored items in the item bank to a common logit scale from which candidate ability is estimated. Passing standards represent equal ability even when item difficulty varies across forms. The logit scale is also used to determine the difficulty of the exam form and the score requi red to meet the passing standard (Muckle & Becker, 2005). Therefor e, cut scores compensate for item difficulty and may vary across forms. This study did not compare forms. Instead common items were examined on identical test forms administered in dual platforms. While the items are common by form, Promissor has informed the researcher that the P&P vers ion uses the term client and the CBT replaces the term client with the term patient in all items. Muckle and Becker (2005) have recor ded the 2004 NNAAP validity and reliability statistics in the 2004 Technical Report Subject matter experts who participated in a workshop that included the principles of item writing, practice in the construction and review of items, and a review of the statistical characteristics of items established va lidity for the NNAAP forms used in this 76

PAGE 90

study (Muckle & Becker, 2005). The subjec t matter experts representing several states, along with staff members from Promissor and the NCSBN, reviewed items prior to inclusion on the exam. Items included in t he exam were distributed by content category and were eliminated if they did not meet the cueing rules and statistical performance paramet ers. Additionally, items were required to have a pvalue of at least 0.60 and a point biserial larger than 0.10. Approved items met the Rasch model and computerized forms were reviewed for content, readability, and presentation. Item analyses were performed for each of the written test administrations and included p-values, discrimination index item distractor analysis, and the Kuder-Richardson KR-20. The KR-20 coeffi cient for each exam (without pretest items) is at least 0.80 and is reported in Table 3 (data in table include P&P and CBT). Table 3 Descriptive statistics: 2004 NNA AP written exam (60 items) Examination Form # of Examinees Mean Raw Score* Mean PointBiserial* Reliability (KR20) Form A 5441 53.28 (5.12) 0.24 (0.07) 0.805 Form V 10018 54.09 (5.48) 0.25 (0.08) 0.808 Form W 8751 53.38 (5.19) 0.27 (0.08) 0.830 Form X 9137 53.11 (5.54) 0.30 (0.10) 0.847 Form Z 5670 53.87 (5.08) 0.32 (0.11) 0.859 (Muckle & Becker, 2005) Numbers in parentheses indicate the standard deviations Copyright 2005 by Promissor, Inc. A ll rights reserved. Used with permission 77

PAGE 91

Administration of the National Nurse Aide Assessment Program In order to reduce turnaround time of candidate and client, Promissor has investigated other modes of administrati on and registration. This has resulted in several administration modes being used including P&P, CBT, and fax. New Jersey administers the exam by co mputer only. The other 23 states and territories administer the exam pape r-and-pencil or by fax (Muckle & Becker, 2005). All of these examinees took t he exam under motivated conditions. Five existing forms of the NNAAP we re transported to computer using a single vendor that develops test adminis tration software. The vendor addressed issues such as hardware and interface including navigation buttons, scrolling, functionality, and font size. Jones et al. (2003) found CBTs using separate software vendors display classical item p-va lues for the different administrations that are different, possibly in dicating that item bias exis ts when an instrument is administered across varied software vendors. Therefore, the use of a single vendor for this data acts as a control to limit DIF that coul d be present due to software vendors. Promissor tracks the number of at tempts the examinee has made on the exam. This is used to determine the test form the examinee receives. There are six forms of the paper-and-pencil NNA AP. Five of these forms have been translated to computer resulting in five linear forms of the CBT NNAAP. Sample Since the researcher is not empl oyed by Promissor or the National Council of State Boards of Nursing, she requested data from Promissor and 78

PAGE 92

entered into a data sharing agreement. See Appendix A for the agreement. The data shared with the researcher for this study were anonymous to her. Because the study used existing dat a, an application was submitted to the researchers university requesting exempt ion status which was appr oved (see Appendix B). The data shared with the researcher include d the state of admin istration, the type of exam (e.g., Standard, Spanish, Oral), t he item response and score data, test form, and an identifier provided by Promi ssor. Demographic information for this sample was not available from Promissor. Table 4 reports the number and percent of all examinees administered the NNAAP in 2004 passing the writ ten and oral portions of the exam. This table represents all test takers in 2004. The CBT test forms administered in New Jersey have been used since early 2002 (and were used through summer 2005), and New Jersey has one of the highest state passing rates on the NNAAP required for certification (K. Becker, personal communication, November 2005). While the passing rate for New Jersey appea rs very different from other states, the mean score on the written exam does not show evidence of significant differences and the mean scores are further discussed in Chapter 4. 79

PAGE 93

Table 4 Number and percent of exam inees passing the 2004 NNAAP NNAAP Written/Oral State First Time Takers Repeaters Total N % Pass N % Pass N % Pass Alabama 1474 98% 119 95% 1593 98% Alaska 541 99% 37 97% 578 98% California 7992 94% 583 76% 8575 93% Colorado 3670 98% 305 86% 3975 97% Connecticut 3055 95% 470 74% 3525 92% District of Columbia 396 89% 50 80% 446 88% Louisiana 588 97% 30 77% 618 96% Maryland 3159 96% 258 80% 3417 95% Minnesota 5670 92% 1035 79% 6705 90% Mississippi 2447 98% 317 96% 2764 97% Nevada 943 87% 201 58% 1144 82% New Hampshire 19 100% 2 100% 21 100% New Jersey 4394 76% 1375 43% 5769 68% New Mexico 1539 98% 191 93% 1730 97% North Dakota 971 99% 93 92% 1064 98% Oregon 672 99% 9 100% 681 99% Pennsylvania 8586 89% 1369 65% 9955 85% Rhode Island 1212 98% 105 89% 1317 97% South Carolina 3109 98% 441 97% 3550 98% Texas 19157 97% 1631 87% 20788 97% Virginia 4619 96% 557 68% 5176 93% Virgin Islands 57 96% 6 83% 63 95% Washington 5514 95% 401 75% 5915 94% Wisconsin 9831 99% 139 78% 9970 99% Wyoming 916 100% 95 100% 1011 100% Total 90531 95% 9819 74% 100350 93% (Muckle & Becker, 2005); *Passing scores are set by each state or territory. 80

PAGE 94

This researcher analyzed a subset of the total testing population for 2004. The sample sent to the investigator included all 2004 data for New Jersey. Paper-and-pencil response data from other states were included in the sample when the response data were from the same test forms administered in New Jersey. Other test forms or forms without common items we re not included in the sample. The criteria for inclusion in the subset of the sample included taking the standard written exam, taking the exam for the first time and taking an exam form that was also administered via com puter. Examinees taking the Spanish or oral versions of the exam were eliminated from the subset. Examinees taking the sixth form of the exam that was not administered as a CBT and examinees retaking the exam were also eliminated fr om this subset of examinee data. The reduced sample consisted of 38,955 examinees across five forms. Table 5 reports the number of exami nees from each state taking eac h of the five forms of the NNAAP. Demographic data were not availa ble from Promissor; therefore, no analyses were run by dem ographic subgroups. 81

PAGE 95

Table 5 Number of examinees by fo rm and state for sample Test Form State A V W X Z Alaska 42 133 130 120 60 California 220 193 298 227 228 Colorado 0 975 973 970 931 Connecticut 0 0 0 3 1 Louisiana 71 71 70 74 74 Maryland 13 0 15 1 0 Minnesota 826 1277 1354 1409 928 Nevada 106 126 110 114 99 N. Hampshire 0 16 0 5 0 New Jersey 861 922 906 908 947 New Mexico 0 0 0 42 0 North Dakota 32 38 52 69 53 Pennsylvania 0 0 0 1 0 Rhode Island 222 233 221 287 216 Texas 3038 5574 4198 4433 1695 Virginia 0 17 413 426 398 Virgin Islands 9 17 11 14 10 Wisconsin 0 0 0 3 3 P&P Total 4579 909 3 7845 8198 4696 CBT Total 861 922 906 908 947 Grand Total 5440 10015 8751 9106 5643 Note: P&P = Paper-and-Pencil; CBT = Computer-based Test Copyright 2005 by Promissor, Inc. A ll rights reserved. Used with permission 82

PAGE 96

Data Analysis This study investigated the compar ability of the computer-administered version of the NNAAP to the paper-and-pencil version of the same test for five forms of the test. The sets of achievement data were obtained from 2004 administrations of a 60-item exam. T he additional 10 pretest items were not examined, and these 10 pretest items did not count as part of the examinees score (Muckle & Becker, 2005). Initially, the data were examined at the total test level to look for differences that may be present by the geographical location of the examinee that may be confounding result ing in the conclusion that item DIF exists when it may not. The mean percent correct score for the entire sample was calculated. Next, the mean percent correct scores fo r examinee groups identified by test form were compared for examinees taking the P&P version. This distribution of mean percent scores was examined to iden tify any possible outlying regions. Since approved items must statisticall y fit the Rasch model to be included in the test development (Muckle & Becke r, 2005), the Rasch model is used to analyze and score NNAAP data by t he vendor, and the 1-PL model is mathematically equivalent to the Rasch model (Hambleton et al., 1991), the 1-PL model was used in this study. To ensure t he integrity of the remaining results, assumptions were investigated to ensur e this data set did fit the model. Assumptions of the model included unidi mensionality, local independence, and response functions were logistic or para llel in the Rasch Model since they do not 83

PAGE 97

cross. The assumption of unidimensi onality was investigated using a factor analysis to ensure that the exam measured one factor. Reliability had been conducted by t he vendor. Reliability coefficients should be at least 0.80 for multiple-choice licensure and certification exams (Muckle & Becker, 2005) and the KR-20 of the NNAAP coefficients meet or exceed this requirement. In addition, the se t of response data used in this study was examined for reliability using the KR-20. Factor Analysis A confirmatory factor analysis was conducted to evaluate the assumption of unidimensionality. Si nce the assessment is st andardized and has been found to be a reliable assessment, it was not nece ssary to explore the number of factor loadings. Rather, the resear cher specified the structur e of the factor model a priori, according to the theory about how t he variables ought to be related to the factor (Stevens, 2002). The dominant factor theorized is the knowledge necessary to perform job tasks required of a nurse aide. If unidimensionality can be assumed or if a major construct can be identified, IRT can be used to detect DIF because the degree of violation to the unidimensionality assumption is not consequential (Drasgow & Lissak, 1983). Therefore, a conf irmatory factor analysis (CFA) was computed for each of the test forms by mode for a total of 10 analyses. Fit statistics were examined to determine if a major construct was identified which would re flect unidimensionality. Since there were a large number of variables (60) and the variables were categorical, Mplus (Muthn & Muthn, 2004) software was used with the 84

PAGE 98

weighted least squares mean-variance (WLSMV) estimator. WLSMV is defined by Muthn and Muthn (2004) as the wei ghted least square parameter estimates using a diagonal weight matrix wi th standard errors and mean-and-variance adjusted chi-square test statistic that used a full weight matrix. Each of the operational (not pretest) items for t he five test forms was included making the variable total equal to 60. The CFA was computed for both samples (P&P and CBT). Since the NNAAP CB T is a linear or fixed CBT, it is reasonable to assume that the results of the factor analysis for the P&P will be similar to the results of the factor analysis for the CBT. If there is a difference, it may indicate that an additional construct exists which would likely be related to the format of the test or i ssues with the computer. 1-Parameter Logistic Model The set of data did not include co mmon examinees but did include examinees response data on common items that were administered in both the paper-and-pencil mode and the computer-adm inistered mode. The comparability was analyzed at the item level. By investi gating the data at the item level, the researcher hoped to identify e ffects in the data that coul d have been lost at the test level (e.g., easier items and harder it ems for a particular test mode that might otherwise average out across a total test score). Assuming unidimensionality or at least that a major construct was identified, IRT was used to detect DIF(Drasgow & Lissak, 1983) and the two test form modes of the NNAAP (P&P and CBT) were calibrated using the 1-PL model. 85

PAGE 99

e e Pb bi iaD aD i 1 where P i =probability that an examinee with ability answers item i correctly b=item difficulty parameter a =common level of discrimination in all items D=1.7 a scaling factor e=the base of the natural logarithms The scored items were calibrated usi ng the 1-parameter logistic model using the Maximum Likelihood (MLE) es timation as implemented in BILOG MG (Zimowski et al., 2003). Each test form was calibrated separately. The 1-PL model was used to calibrate the ability es timate for each examinee and as one of the methods being examined to detect DIF. Differential Item Functioning The study used DIF methodology to inve stigate comparability at the item level across two groups of examinees, all of which consisted of applicants for certification as nurse aides. Three methods of DIF were used to identify item DIF, for which items it existed, and to exami ne if three of the commonly used methods of identifying DIF detected DIF equally (e .g., the same item s are identified as containing item DIF). DIF analyses (Ho lland & Wainer, 1993) are normally used for the purpose of investigating item bi as across examinee groups within a single 86

PAGE 100

test. This can be viewed as a type of comparability study as items operating differentially make scores from on e examinee group to another group not comparable. For this research, it is r easonable to assume that when the items are administered in the paper-and-pencil mode, and when the same items are administered in the computer mode, they should function equivalently and should display similar item characteristics. Therefore, DIF analyses are ideal for an analysis to determine if an item behaves differently in one test administration mode compared to the other test administration mode. Sample size was more than sufficient for the Rasch or the 1-PL model. Mantel-Haenszel The use of Mantel-Haenszel was inve stigated as an additional analysis for the detection of DIF and is a widely used estimate that is an asymptotic chisquare statistic with one degree of freedom (Penny & Johnson, 1999) computed from the set of stratifi ed K X 2 X 2 tables where K represents the number of ability levels (Matthews & Farewell, 1988; Mould, 1989; Welch & Miller, 1995) formed by contrasting two groups with it ems correct and incorrect for each score on an exam (Penny & Johnson, 1999). Nonparametric tests do not rely on a mathematical model; therefore, they do not require the rigorous assumptions and complexity of parametric models nor do t hey require large sample sizes (Camilli & Shepard, 1994). Mantel-Haenszel, also re ferred to as the logrank method, has been used by the medical community to exam ine differences in experience (e.g., survival) for two or more groups to determi ne if these differences are statistically significant (Mould, 1989). In 1988, Holland and Thayer modified the statistic for 87

PAGE 101

use in detecting DIF in test items. The a ssociated test of significance for the MH is distributed as a chi-square (see Table 6) which takes the form of (Clauser & Mazor, 1998; Penny & Johnson, 1999): 2 2var 2 1 | |j j jjjjA AEA MH where 1 var2 01 j j jj j frj jT MMNN AT The variance of A is the product of the group total for the reference ( ) group, the group total for the focal ( ) group, the column to tal for items correct ( ), and the column total for items incorrect ( ) divided by the squared grand total (T) multiplied by the grand total -1 rjN j f NM2 jM1 j 0 j 1 jT The Mantel-Haenszel Chi-square (MH-CHISQ) is the uniformly most powerful unbiased test of H 0 versus H 1 (Welch & Miller, 1995). Table 6 Groups defined in the MH statistic Group Correct Incorrect Total Reference A j B B j N rj Focal C j D j N fj Total M 1j M 0j Total j 88

PAGE 102

In this study, for each group, R and F, the number correct and the number incorrect were recorded for each item bas ed on total score to control for ability. This information was charted and used to compute the odds ratio and log-odds for each item using SPSS (SPSS, 2005). Fo r the purpose of this model, the null hypothesis was H 0 : odds ratio = 1. As mentioned, the matching criterion was the total test score. In order to contro l the Type I error rate, the Bonferroni adjustment was applied. The probability leve l of 0.05 applied to each test form was divided by 60 representing the 60 te st items and resulted in a significance level of probability level = .00083 for each item comparison. Logistic Regression Non-parametric tests do not rely on a ma thematical model; therefore, they do not require the rigorous assumptions and complexity of parametric models nor do they require large sample sizes. Logi stic Regression (LR) is a regression procedure in which the dependent variable is binary (Yu, 2005). In this case, the investigator examined whether a parti cipant administered the CBT (versus the P&P) NNAAP had a higher probability of responding correctly to an item given ability. The assumptions of LR are not as rigorous as la tent trait models and include the following advantages: LR does not assume a linear relationship between the dependent and independent variable There is no homogeneity of variance assumption Error terms are not assumed to be normally distributed (Garson, 2005; Pedhazur, 1997). 89

PAGE 103

The logistic model returns a parameter called the odds ratio in SAS (SAS, 2003) which is the ratio of the probabilit y that an outcome (O) will occur divided by the probability that the same out come will not occur (Pedhazur, 1997; Yu, 2005). The odds of a correct response can be written as OPOPOOdds 1 An odds ratio greater than 1 means that the odds of getting a on the dependent variable (correct response) are greater for the focus group than the reference group. The closer the odds ratio is to the more the categories or groups are independent of the dependent variable (mode to test score/correct response). A simple logistic regr ession equation may be represented as bXaPitlog where X=independent variable In this equation, b is the expected change of logit (P) associated with a unit change in X. When b is positive, increases in X are associated with increases in logits and when b is negat ive increases in X are associated with decreases in logits (Pedhazur, 1997). The equation can be simplified much further by replacing the logits wit h odds yielding the following equation: bXae P 1 1 where e= base of the nat ural logarithm In order to control the Type I error rate, the Bonferr oni adjustment was applied. The level of 0.05 applied to each test form was divided by 60 90

PAGE 104

representing the 60 test items and resulting in a significance level of =.00083 for each item comparison. For this study the analyses were conditioned on total score. 1-Parameter Logistic Model Data were scaled to the 1-PL model and item parameters were estimated separately for the P&P test and the CBT with Bilog-MG (Scientific Software International, 2003; Zimowski et al., 2003) Item difficulty estimates were compared based on the performance of the P&P and CBT groups. The bparameters were compared for common it ems on each test administration mode in this study. The null hypothesis is t hat the item paramet ers are equal and is stated as b b HCBT P&P 0: The symbol, b, refers to the b parameter or location parameter and is the point on the ability scale where the probability of a correct response is 0.5. The subscript represents the group. If the hypothesis is rejected, it would lead the researcher to conclude that DIF exists (Hambleton et al., 1991). After placing the estimates of the item parameter s on a common scale (standardizing the difficulty parameters), the variance-covarian ce matrix of the parameter estimates in each group was computed. The informa tion matrix was computed and inverted (for each group). Then the variance-covari ance matrices of the groups were added which yielded the variance-covari ance difference between the estimates (Hambleton et al., 1991). This statistic is asymptotically distributed as a chisquare with k degrees of freedom. The sym bol, k, represents the number of 91

PAGE 105

parameters (a, b, and c). For the 1-PL where b is compar ed, k = 1. T herefore, the expression for the 1-PL is bb bvvdiff 2 1 2 2 In this statistic, v (b 1 ) and v (b 2 ) represent the reciprocals of the information function for the difficulty parameter esti mates (Hambleton et al., 1991). The Bonferroni adjustment was applied to control the Type I error rate. The level of 0.05 applied to each test form was divide d by 60 representing the 60 test items and resulting in a significance level of = .00083 for each item comparison. Measure of Agreement Cohen first proposed kappa as a chance-corrected coefficient of agreement for nominal scales, and the k appa statistic is a frequently used measure of interobserver or inter-d iagnostic agreement when the data are categorical (Bonnardel, 2005). This study examined the agreement of three methods for detecting DIF. The detection of DIF is categorical (DIF, NoDIF). Therefore, the kappa st atistic provides an ideal statistic for calculating the extent to which two of the methods agree in their detection of DIF. The kappa coefficient (K) measures pairwise agreement am ong a set of coders making category judgments, correcting for expected chance agreement: EP EPAP K 1 where P(A) is the proportion of times that the methods agree P(E) is the proportion of times that we would expect them to agree by chance, 92

PAGE 106

When there is no agreement other than that which would be expected by chance, K is zero. When there is to tal agreement, K is one. (Carletta, 2005; Fidalgo et al., 2004). Calculated using SPSS the kappa statistic was used in this study to examine all pair methods for t he detection of DIF to examine the degree of agreement between methods. Summary Using response data of 38,955 examinees across five forms of the NNAAP administered in bot h the CBT and P&P mode, three methods of DIF were used to detect DIF across platform. These methods of DIF were compared to determine if they detected DIF equally in all items on the NNAAP. For each method, Type I error rates were adjust ed using the Bonferroni adjustment dividing .05 by 60, with 60 representing t he number of items fo r each of the test forms. While the Bonferroni adjustment may result in inflated Type II error rates, the sample size is large. A list was compiled identifying how many items were flagged using the 1-PL, how many items were flagged using LR, and how many items were flagged using MH. In order to compare the methods examined in this study, data were reported i dentifying the items flagged by the 1-PL and not MH nor LR, the items flagged by MH and not the 1-PL nor LR, the items flagged by LR and not the 1-PL nor MH the items flagged by the 1-PL and MH, the items flagged by the 1-PL and LR, the items flagged by LR and MH, and the items flagged by the three methods equally. A kappa statistic was calculated to provide an index of agreement between paired met hods of the LR, MH, and the 1-PL classification based on the inferential tests. 93

PAGE 107

CHAPTER FOUR: RESULTS This chapter includes results relat ed to the analyses outlined in Chapter Three: Methods and conducted to answer the research questions: 1. What proportion of items exhibit differential item functioning (DIF) as determined by the following three methods: Mantel-Haenszel (MH), Logistic Regression (LR) and the 1-Parameter Logistic Model (1-PL)? 2. What is the level of agreement among Mantel-Haenszel, Logistic Regression, and the 1-Parameter Logistic model in detecting differential item functioning between the computer-based (CBT) National Nurse Aide Assessm ent Program and the paper-andpencil (P&P) National Nurse Aide Assessment Program? The results are reported in six sections : (a) total correct, (b) reliability, (c) factor analysis, (d) DIF methodology, (e) agreement, and (f) post hoc analyses. The first section, percent correct, pres ents the mean total for the sample and by state represented. The second section, re liability, reports the KR-20 reliability coefficients for the sample. The third section, factor analysis, discusses the 94

PAGE 108

goodness of fit statistics for the confi rmatory factor analysis conducted on each of the test forms by mode. The fourth section, DIF methodology displays results for the Mantel-Haenszel, Logistic Regr ession, and the 1-Parameter Logistic methods. The fifth section, agreement, di scusses the items flagged as exhibiting DIF by method and the kappa statistic. The sixth section, post hoc analyses, displays the test characteristic curve for each test form in order to examine the impact of DIF items on the overall test and discusses nonuniform DIF in this study. Total Correct The mean total score for the sample of paper-and-pencil tests was 54.24 with a standard deviation of 4.83 (N = 34, 411). The mean total score for the sample of computer-based tests was 52.27 with a standard deviation of 5.88 (N = 4,544). Table 7 reports the mean total sco re for each state represented in the sample. 95

PAGE 109

Table 7 Mean total score for sample State N Mean Standard Deviation Minimum Maximum Kurtosis Skewness Alaska 485 56.00 3.56 34 60 9.11 -2.41 California 1166 53.66 5.88 17 60 7.25 -2.28 Colorado 3849 55.17 4.93 14 60 14.35 -3.12 Connecticut 4 13.00 1.83 11 15 -3.30 0.00 Louisiana 360 52.86 4.69 30 60 3.13 -1.37 Maryland 29 51.90 8.44 16 60 11.66 -3.05 Minnesota 5794 55.54 4.06 11 60 22.11 -3.38 North Dakota 244 55.61 3.57 37 60 6.59 -2.10 New Hampshire 21 55.05 3.56 48 59 -0.22 -0.74 New Jersey 4544 52.27 5.88 13 60 6.08 -1.96 New Mexico 42 52.79 6.58 17 59 21.62 -4.07 Nevada 555 54.91 4.39 25 60 7.62 -2.21 Pennsylvania 1 16.00 ** 16 16 ** ** Rhode Island 1179 54.69 4.45 23 60 6.48 -1.99 Texas 18938 53.71 4.66 11 60 6.79 -1.92 Virginia 1677 53.53 6.21 13 60 9.07 -2.56 Virgin Islands 61 52.66 5. 02 33 59 4.76 -1.83 Wisconsin 6 15.00 4. 29 9 21 -0.55 0.00 Total 38955 54.01 5.00 9 60 9.74 -2.37 Copyright 2005 by Promissor, Inc. All rights reserved. Used with permission As the table indicates, when the data are viewed by total mean score for the sample and disaggregated by state or territory, the sample did not exhibit visible differences between the P &P data administrations and the CBT administrations. It is worth noting that so me of the states and territories had very small samples such as Connecticut, Ma ryland, New Hampshire, New Mexico, Pennsylvania, the Virgin Islands, and Wisc onsin. Scores in four of these small sample size groups had extremely low to tal scores such as Connecticut (11), New Mexico (17), Pennsylvania (16), and Wisconsin (9). While New Jerseys pass rate for t he 2004 calendar year appeared vastly different from the other states and was notable as the only state that administered the NNAAP via computer, this table shows that the mean total 96

PAGE 110

score for New Jersey is not notably diffe rent from other st ates nor the total sample mean. This would support the statement that New Jersey has one of the higher cut scores as determined by t he state for meeting certification requirements. When the data ar e broken down by test form and location (state or territory), the pa ttern continues. To see tables of mean percent correct scores by state or territory for each test form refer to Appendix C. To further examine the data, the researcher c onducted an ANOVA. The obtained F (17, 38937) = 140.71, p = <. 0001 was judged to be statistically significant (Type I error = .05). These re sults indicate that at least one pair of means differ. The proportion of variance as sociated with the stat e or territory was eta 2 = 0.058. The researcher then computed a contrast comparing the mean for New Jersey (M = 52.27) to the mean of all other stat es (M = 54.24). The proportion of variance associated with the state or territory was eta 2 = 0.0047 showing very little of the variation in sco res is associated with whether or not you are in New Jersey. Reliability The reliability was computed for the sample data using the Coefficient Alpha (Kuder-Richardson KR-20) method in SPSS (SPSS, 2005) Results were very similar to those of the overall sample (see Table 8). The study sample results are reported in Table 9. 97

PAGE 111

Table 8 NNAAP statistics for 2004 overall Form N Mean Raw Score Mean Point Biserial KR-20 Form A 5,441 53.28 (5.12) 0.24 (0.07) 0.81 Form V 10,018 54.09 (5.48) 0.25 (0.08) 0.81 Form W 8,751 53.38 (5.19) 0.27 (0.08) 0.83 Form X 9,137 53.11 (5.54) 0.30 (0.10) 0.85 Form Z 5,670 53.87 (5.08) 0.32 (0.11) 0.86 Copyright 2005 by Promissor, Inc. A ll rights reserved. Used with permission Table 9 Study sample KR-20 results (60 items) Test Form KR-20 N CBT Form A 0.82 861 CBT Form V 0.84 922 CBT Form W 0.83 906 CBT Form X 0.86 908 CBT Form Z 0.84 947 P&P Form A 0.80 4,579 P&P Form V 0.80 9,093 P&P Form W 0.83 7,845 P&P Form X 0.82 8,198 P&P Form Z 0.81 4,696 Copyright 2005 by Promissor, Inc. A ll rights reserved. Used with permission Factor Analysis Measurement ideally reduces an expe rience to one dimension; however, latent domains do not always conform (Bri ggs & Wilson, 2003). In fact, a latent domain can be divided into subcomponents until the number of domains may be equal to the number of items bei ng administered. Similarly, it is likely that strict unidimensionality will be viol ated in real data (Millsap & Everson, 1993). Drasgow and Lissak (1983) stated the violation of the unidimensionality assumption is not consequential when a pr imary factor can be determined. Items that are essentially unidimensional are dominated by a single latent trait with 98

PAGE 112

additional traits that may influence a sm all number of items (Stout, in Millsap & Everson, 1993). Goodness of fit statis tics (reported in Table 10) for the confirmatory factor analysis conduct ed on each of the test forms by mode indicate that a major cons truct has been reflected in t he factor analysis. All TLI values are greater than .9 and RMSEA values are less than .026 indicating good fit. Researchers generally accept that RM SEA values be less than .05 to indicate a close approximation of fit of the model (Stevens, 2002). Table 10 Confirmatory factor analysis results Test Form N ChiSquare Degrees of Freedom p-value CFI TLI RMSEA Computer A 861 240.408 151 0. 0000 0.851 0.909 0.026 V 922 186.493 137 0. 0032 0.911 0.959 0.020 W 906 248.079 180 0.000 6 0.933 0.949 0.020 X 908 267.824 193 0.000 3 0.932 0.972 0.021 Z 947 292.268 185 0. 0000 0.893 0.944 0.025 Paper-and-Pencil A 4,579 961.540 553 0. 0000 0.923 0.970 0.013 V 9,093 1457.736 742 0. 0000 0.943 0.981 0.010 W 7,845 1532.547 677 0.0000 0.931 0.979 0.013 X 8,198 2019.432 730 0.0000 0.819 0.972 0.015 Z 4,696 1050.859 496 0. 0000 0.838 0.971 0.015 Copyright 2005 by Promissor, Inc. A ll rights reserved. Used with permission Differential Item Functioning In reviewing the NNAAP items across test forms, items cannot be directly compared since common item s are not shared across forms. A more complete picture is garnered by examining the fl agged items for each test form across DIF methods. A Bonferroni adjustment was appli ed to the significance level for each method of analysis so that the significance level was .00083 to control for the 60 99

PAGE 113

100 items of each test form. The following sections repor t the results by method for each test form. Then, results were su mmarized by test form reporting the commonly flagged items by DIF me thod or multiple methods. Before discussing results flagged by the various DIF methods, the researcher looked at the proportion of examinees responding correctly to each item by test form. Table 11 reports th is proportion. The forms are listed in columns and the items in rows If an item was not assess ed by a test form (field test items), the item is marked with an asterisk.

PAGE 114

Table 11 Proportion of examinees responding correctly by item Proportion of Examinees Responding Correctly by Item Item Form A Form V Form W Form X Form Z CBT P&P CBT P&P CBT P&P CBT P&P CBT P&P s1 .84 .81 .96 .97 .71 .79 .99 .98 * s2 .61 .62 .92 .95 91 .96 .86 .87 .77 .84 s3 .91 .94 .97 .99 75 .79 .63 .66 .98 .98 s4 .88 .85 .64 .72 87 .95 .94 .93 .95 .96 s5 .91 .91 .97 .97 94 .97 .91 .91 .98 .97 s6 .82 .88 .82 .90 85 .94 .47 .57 .97 .96 s7 .96 .97 .90 .91 .90 .95 * .94 .95 s8 .79 .86 .90 .91 89 .91 .98 .98 .92 .92 s9 .97 .94 .94 .98 96 .96 .90 .90 .96 .97 s10 .98 .98 .75 .84 * .95 .96 .79 .90 s11 .99 .99 * .69 .76 .80 .85 .89 .93 s12 * * * .92 .97 .87 .92 s13 .89 .96 .99 99 .91 .93 * * s14 .88 .93 .94 .97 77 .80 .97 .98 .98 .99 s15 .69 .78 * .99 .99 .94 .97 .98 .99 s16 * .59 .66 .91 .89 .90 .96 .83 .92 s17 .56 .61 .94 .96 91 .94 .98 .99 .55 .73 s18 .99 .99 .98 .98 * .78 .85 .85 .93 s19 .97 .99 .98 .98 .97 .97 * .85 .91 s20 .92 .90 .90 .96 46 .65 .83 .93 .97 .96 s21 .98 .98 .84 .92 .93 .97 * .96 .98 s22 .79 .87 .90 .92 83 .94 .97 .98 .91 .94 s23 .89 .92 .97 .98 .93 .91 .80 .92 * s24 * .88 .95 * .86 .92 * s25 .95 .96 .85 .82 96 .95 .80 .87 .83 .88 s26 .78 .81 .96 .98 64 .75 .78 .78 .85 .92 s27 .97 .98 .96 .96 89 .94 .83 .86 .47 .58 101

PAGE 115

Proportion of Examinees Responding Correctly by Item Item Form A Form V Form W Form X Form Z CBT P&P CBT P&P CBT P&P CBT P&P CBT P&P s28 .87 .89 * .84 .93 .78 .91 .96 .97 s29 .95 .97 * .86 .94 .75 .84 .77 .89 s30 .85 .81 .84 .89 96 .97 .96 .97 .84 .92 s31 .90 .89 .90 .85 98 .99 .92 .91 .98 .99 s32 * .86 .88 .80 .85 .86 .85 .80 .89 s33 .89 .91 .85 91 * .88 .94 * s34 * .97 .96 79 .85 .89 .92 * s35 .95 .96 .88 .95 99 .99 .64 .81 .96 .97 s36 .99 .99 .82 .92 * .96 .98 .86 .91 s37 .97 .95 .92 .93 79 .94 .86 .86 .51 .69 s38 .93 .93 .95 .91 70 .74 .98 .98 .76 .86 s39 .93 .97 .63 .72 * .69 .86 .96 .98 s40 .88 .88 .97 .97 * .92 .95 .89 .95 s41 .90 .92 .92 .94 99 .99 .97 .98 .85 .89 s42 * .99 .90 94 .97 * .97 .97 s43 * .99 .99 93 .95 * .80 .89 s44 .73 .85 .92 .93 92 .96 .93 .97 .89 .94 s45 .73 .85 * .87 .91 .93 .96 .78 .89 s46 .48 .75 * .95 .91 .83 .77 .98 .99 s47 .95 .94 .94 .98 95 .86 .53 .43 .88 .91 s48 .97 .98 .91 .94 92 .96 .91 .95 .83 .70 s49 .94 .97 .92 .91 97 .97 .97 .99 .76 .85 s50 .79 .89 .87 .92 89 .94 .95 .97 .92 .88 s51 * .94 .95 .93 .96 .94 .96 .75 .83 s52 .83 .87 .92 .94 92 .96 .67 .81 .85 .94 s53 .45 .55 .95 .96 .97 .98 * .83 .86 s54 .84 .83 .64 .66 61 .66 .75 .79 .99 .99 s55 .51 .67 .96 .96 * .94 .98 .61 .59 s56 .97 .95 .90 91 .85 .87 * * s57 .99 .99 .97 .98 .86 .88 .97 .99 * s58 .92 .96 .95 .95 78 .88 .97 .98 .89 .95 s59 .68 .75 .97 .98 98 .98 .96 .89 .96 .96 102

PAGE 116

103 Proportion of Examinees Responding Correctly by Item Item Form A Form V Form W Form X Form Z CBT P&P CBT P&P CBT P&P CBT P&P CBT P&P s60 * .53 .55 .76 .74 .87 .92 .95 .93 s61 .94 .96 * 95 .93 * .60 .62 s62 .90 .88 * .99 .99 .86 .94 .98 .97 s63 .87 .91 * * .90 .91 .57 .88 s64 .98 .97 .97 .97 92 .93 .66 .78 .98 .96 s65 .88 .78 .99 .98 96 .99 .85 .90 .99 .97 s66 .99 .98 .97 .98 .93 .93 .99 .98 * s67 .96 .94 .83 .88 .99 .99 .66 .69 * s68 .94 .96 .89 .91 90 .87 .92 .94 .87 .85 s69 * .96 .98 68 .75 * .90 .95 s70 .66 .74 .78 .74 96 .98 .94 .97 .97 .98 Note: CBT=Computer-based Test; P&P= Paper-andPencil; indicates a field test item Copyright 2005 by Promissor, Inc. A ll rights reserved. Used with permission

PAGE 117

Mantel-Haenszel The Mantel-Haenszel was conducted us ing SPSS (SPSS, 2005). Tables 12 to 16 report results by test form. In t he tables reporting re sults for the MantelHaenszel, the items are listed in the firs t column, the next column reports the conditional independence ch i-squared test. The Mant el-Haenszel statistic is asymptotically distributed as a one degree of freedom chi-squared distribution. This is followed by the odds ratio estimate and significance in the third and fourth columns respectively (odds ratio is a measur e of effect size). The final column is marked with an asterisk for items which are statistically significant or flagged as exhibiting DIF. For exampl e, in test Form A (Table 12), item 1 has a chi-squared value of 12.617 (column 2). The odds ratio estimate for item 1 is 0.688 (column 3) which indicates that the CBT examin ees odds of answering correctly were 1.5 (or 1/0.688) times the odds of the P&P examinees. The asymptotical significance is <.001 (column 4) indicating t hat the item is statistically significant, and an asterisk is found in the last column representing this significance. In test Form A, 16 items were flagged as exhibiting DIF (see Table 12). Out of these 16 items, 6 favored (the item was less difficult for) the P&P examinees and 10 favored t he CBT examinees. The greatest odds ratio flagged, favoring the paper-and-pencil administrati on group, was for item 46 (OR = 2.939). For this item, P&P ex aminees odds of answering correctly were almost three times the odds of the CBT examinees The smallest odds ratio, favoring the computer-based test group, was for item 65 (OR = 0.271). Fo r this item, CBT 104

PAGE 118

examinees odds of answering correctly were 3.7 times the odds of the P&P examinees. Table 12 Mantel-Haenszel results for form A Mantel-Haenszel Results: Form A (N = 5,441) Chisquared Odds Ratio Asymp Sig Significant Item Item Chisquared Odds Ratio Asymp Sig Significant s1 12.617 0.688 <.001 s 36 2.813 0.311 0.073 s2 9.629 0.769 0.002 s 37 15.239 0.415 <.001 s3 0.235 0.922 0.581 s38 2.804 0.758 0.077 s4 18.716 0.595 <.001 s39 4.239 1.461 0.033 s5 3.278 0.768 0.060 s40 5.989 0.734 0.012 s6 8.380 1.365 0.003 s41 1.208 0.857 0.251 s7 0.046 0.930 0.747 s 44 31.282 1.703 <.001 s8 1.708 1.154 0.172 s 45 39.180 1.785 <.001 s9 22.352 0.356 <.001 s46 158.028 2.939 <.001 s10 0.386 0.787 0.438 s47 2.357 0.760 0.109 s11 0.008 0.968 0.927 s48 0.001 0.981 0.931 s13 26.330 2.069 <.001 s49 0.809 1.207 0.320 s14 9.356 1.477 0.002 s 50 28.645 1.796 <.001 s15 5.015 1.224 0.023 s52 2.167 1.173 0.128 s17 0.390 1.053 0.507 s53 3.187 1.161 0.068 s18 0.589 0.717 0.347 s 54 15.914 0.640 <.001 s19 1.311 1.469 0.191 s 55 36.121 1.623 <.001 s20 13.883 0.584 <.001 s56 14.190 0.442 <.001 s21 1.971 0.669 0.133 s57 0.069 0.843 0.654 s22 5.589 1.293 0.016 s58 4.768 1.446 0.022 s23 0.351 1.089 0.511 s59 1.219 1.103 0.253 s25 0.075 0.933 0.713 s61 0.003 1.007 0.971 s26 1.150 0.891 0.259 s62 7.665 0.705 0.005 s27 0.148 0.885 0.617 s63 0.026 0.972 0.822 s28 1.361 0.862 0.220 s64 7.572 0.455 0.005 s29 0.018 0.951 0.812 s 65 107.667 0.271 <.001 s30 21.867 0.602 <.001 s66 5.887 0.392 0.012 s31 13.581 0.600 <.001 s67 12.132 0.518 0.001 s33 0.002 1.013 0.918 s68 0.001 0.988 0.947 s35 0.013 0.963 0.837 s70 3.557 1.179 0.054 Note: *p-value <.00083 Copyright 2005 by Promissor, Inc. All rights reserved. Used with permission. The stem and leaf plot for the MH odds ratio estimates is displayed in Figure 2. As the figure illu strates, the sample displayed a positively skewed 105

PAGE 119

distribution with two extreme values of 2.069 and 2.939. Most estimates ranged from 0.2 through 1.7 with an average odds ratio of 0.99. Odds Ratio Estimate (Mantel-Haenszel) Stem-and-Leaf Plot Frequency Stem & Leaf 4.00 0 2333 6.00 0 444555 13.00 0 6666677777777 14.00 0 88888999999999 9.00 1 000011111 4.00 1 2223 4.00 1 4444 4.00 1 6777 2.00 Extremes (>=2.1) Stem width: 1.000 Each leaf: 1 case(s) Figure 2. Stem and leaf plot fo r form A MH odds ratio estimates In test Form V, 15 items were flagged as exhibiting DIF (see Table 13). Out of these 15 items, 10 favored t he P&P examinees and 5 favored the CBT examinees. The greatest odds ratio, fa voring the paper-and-penc il administration group, was 2.379 for item 59. The smalle st odds ratio, favoring the computerbased test group, was 0.105 for item 42. 106

PAGE 120

Table 13 Mantel-Haenszel results for form V Mantel-Haenszel Results: Form V (N = 10,018) Item Chisquared Odds Ratio Asymp Sig Significant Item Chisquared Odds Ratio Asymp Sig Significant s1 0.147 0.912 0.633 s36 46.541 2.137 <.001 s2 1.825 1.232 0.152 s37 1.685 0.828 0.168 s3 2.099 1.406 0.122 s38 31.058 0.421 <.001 s4 4.508 1.185 0.030 s39 13.466 1.324 <.001 s5 0.641 0.834 0.370 s40 2.597 0.685 0.090 s6 18.185 1.539 <.001 s41 1.413 1.185 0.208 s7 0.583 0.905 0.408 s42 88.391 0.105 <.001 s8 2.579 0.818 0.100 s43 1.486 0.483 0.171 s9 26.460 2.379 <.001 s44 0.590 0.894 0.407 s10 14.515 1.429 <.001 s47 18.288 2.067 <.001 s13 5.195 0.390 0.018 s48 2.327 1.229 0.111 s14 8.631 1.610 0.003 s49 6.636 0.710 0.009 s16 2.287 1.126 0.122 s50 3.728 1.265 0.049 s17 1.151 1.208 0.248 s51 3.367 0.733 0.058 s18 7.023 0.402 0.007 s52 1.676 0.820 0.168 s19 2.489 0.648 0.095 s53 0.245 0.905 0.564 s20 13.574 1.678 <.001 s54 2.612 0.880 0.099 s21 22.819 1.666 <.001 s55 10.867 0.516 0.001 s22 1.582 0.846 0.194 s56 0.350 0.924 0.514 s23 0.685 0.802 0.347 s57 0.136 0.899 0.635 s24 42.603 2.177 <.001 s58 4.249 0.703 0.033 s25 12.510 0.700 <.001 s59 0.211 1.134 0.572 s26 3.040 1.435 0.068 s60 0.780 0.934 0.357 s27 3.895 0.695 0.041 s64 3.916 0.636 0.040 s30 1.629 1.154 0.183 s65 5.839 0.433 0.013 s31 44.089 0.445 <.001 s66 0.013 0.942 0.810 s32 1.365 0.875 0.220 s67 6.031 1.289 0.012 s33 8.640 1.392 0.003 s68 0.009 1.017 0.881 s34 5.583 0.618 0.015 s69 4.390 1.508 0.031 s35 30.299 1.964 <.001 s70 23.900 0.652 <.001 Note: *p-value <.00083 Copyright 2005 by Promissor, Inc. All rights reserved. Used with permission. The stem and leaf plot for the MH odds ratio estimates is displayed in Figure 3. As the figure illust rates, the sample had a posi tively skewed distribution with one extreme value of 2.379. Most estimates ranged from 0.1 through 2.1 with an average odds ratio of 1.1. 107

PAGE 121

Odds Ratio Estimates (Mantel-Haenszel) Stem-and-Leaf Plot Frequency Stem & Leaf 7.00 0 1344444 27.00 0 566666677778888888888999999 16.00 1 0111112222233444 6.00 1 556669 3.00 2 011 1.00 Extremes (>=2.4) Stem width: 1 Each leaf: 1 case(s) Figure 3. Stem and leaf plot fo r form V MH odds ratio estimates In test Form W, 14 items were fla gged as exhibiting DIF (see Table 14). Out of these 14 items, 7 favored the P&P examinees and 7 favored the CBT examinees. The greatest odds ratio, fa voring the paper-and-penc il administration group, was for item 37 (OR = 3.475). The smallest odds ratio, favoring the computer-based test group, was for item 47 (OR = 0.186). 108

PAGE 122

Table 14 Mantel-Haenszel results for form W Mantel-Haenszel Results: Form W (N = 8,751) Item Chisquared Odds Ratio Asymp Sig Significant Item Chisquared Odds Ratio Asymp Sig Significant s1 2.634 1.147 0.097 s37 167.256 3.475 <.001 s2 11.407 1.629 0.001 s38 5.478 0.822 0.018 s3 1.962 0.879 0.145 s41 1.713 0.529 0.136 s4 32.848 1.996 <.001 s42 7.436 1.560 0.007 s5 1.193 1.213 0.250 s43 0.001 1.016 0.913 s6 37.156 2.057 <.001 s44 3.545 1.332 0.052 s7 5.946 1.424 0.012 s45 0.032 0.973 0.815 s8 3.326 0.793 0.058 s46 48.778 0.302 <.001 s9 0.873 0.827 0.303 s47 122.097 0.186 <.001 s11 2.375 0.868 0.109 s48 5.302 1.439 0.015 s13 0.294 0.923 0.544 s49 1.904 0.745 0.147 s14 2.349 0.868 0.114 s50 5.667 1.360 0.015 s15 0.046 1.148 0.689 s51 1.400 1.218 0.221 s16 17.834 0.570 <.001 s52 3.625 1.338 0.050 s17 0.037 0.966 0.796 s53 1.833 0.700 0.143 s19 2.935 0.670 0.071 s54 0.014 1.012 0.878 s20 53.467 1.726 <.001 s56 0.005 1.012 0.905 s21 10.044 1.655 0.002 s57 1.619 0.866 0.184 s22 48.133 2.120 <.001 s58 4.603 1.243 0.029 s23 25.121 0.476 <.001 s59 0.793 0.770 0.317 s25 21.001 0.403 <.001 s60 34.837 0.583 <.001 s26 3.987 1.179 0.042 s61 9.946 0.590 0.001 s27 4.469 1.312 0.031 s62 7.053 0.255 0.009 s28 19.968 1.696 <.001 s64 4.594 0.733 0.029 s29 25.233 1.792 <.001 s65 2.226 1.418 0.119 s30 0.230 0.887 0.562 s66 2.801 0.782 0.081 s31 0.335 0.812 0.472 s67 0.023 0.885 0.739 s32 0.311 0.943 0.549 s68 25.666 0.546 <.001 s34 2.206 0.856 0.127 s69 0.081 0.974 0.747 s35 0.768 0.652 0.297 s70 0.438 1.196 0.430 Note: *p-value <.00083 Copyright 2005 by Promissor, Inc. All rights reserved. Used with permission. The stem and leaf plot for the MH odds ratio estimates is displayed in Figure 4. As the figure illu strates, the sample had a slightly positively skewed distribution with one extreme value of 3.745. Most estimates ranged from 0.1 through 2.1 with an average odds ratio of 1.1. 109

PAGE 123

Odds Ratio Estimates (Mantel-Haenszel) Stem-and-Leaf Plot Frequency Stem & Leaf 5.00 0 12344 28.00 0 5555566777777888888888899999 17.00 1 00011112223333444 7.00 1 5666779 2.00 2 01 1.00 Extremes (>=3.5) Stem width: 1.00 Each leaf: 1 case(s) Figure 4. Stem and leaf plot fo r form W MH odds ratio estimates In test Form X, 16 items were fla gged as exhibiting DIF (see Table 15). Out of these 16 items, 7 favored the P&P examinees and 9 favored the CBT examinees. The greatest odds ratio, fa voring the paper-and-penc il administration group, was for item 23 (OR = 2.193). The smallest odds ratio, favoring the computer-based test group, was for item 66 (OR = 0.126). 110

PAGE 124

Table 15 Mantel Haenszel results for form X Mantel-Haenszel Results: Form X (N = 9.137) Item Chisquared Odds Ratio Asymp Sig Significant Item Chisquared Odds Ratio Asymp Sig Significant s1 3.774 0.529 0.040 s35 68.748 1.961 <.001 s2 14.976 0.645 <.001 s36 0.415 1.187 0.447 s3 3.771 0.856 0.047 s37 28.251 0.532 <.001 s4 11.850 0.564 <.001 s38 0.616 0.789 0.372 s5 11.553 0.635 0.001 s39 72.861 2.101 <.001 s6 8.193 1.247 0.004 s40 5.079 1.379 0.020 s8 0.844 0.733 0.287 s41 0.021 0.946 0.801 s9 9.809 0.682 0.002 s44 0.386 1.123 0.481 s10 4.415 0.671 0.029 s45 0.090 0.941 0.707 s11 0.035 1.023 0.815 s46 47.253 0.518 <.001 s12 10.651 1.610 0.001 s47 103.242 0.463 <.001 s14 0.278 0.871 0.529 s48 0.881 1.158 0.313 s15 0.499 1.154 0.424 s49 0.815 1.296 0.296 s16 8.696 1.554 0.002 s50 0.996 0.791 0.271 s17 0.002 0.949 0.855 s51 5.148 0.653 0.021 s18 1.409 1.122 0.217 s52 47.313 1.731 <.001 s20 52.260 2.169 <.001 s54 0.272 0.953 0.574 s22 1.079 1.308 0.243 s55 10.393 1.945 0.001 s23 59.637 2.193 <.001 s57 2.434 1.568 0.085 s24 6.602 1.330 0.009 s58 2.006 0.687 0.128 s25 1.724 1.142 0.177 s59 87.310 0.184 <.001 s26 9.662 0.752 0.002 s60 0.885 1.123 0.319 s27 0.000 0.993 0.944 s62 28.697 1.845 <.001 s28 40.207 1.901 <.001 s63 8.680 0.684 0.003 s29 4.443 1.215 0.032 s64 10.571 1.326 0.001 s30 0.511 0.852 0.419 s65 1.122 0.874 0.262 s31 15.741 0.585 <.001 s66 20.722 0.126 <.001 s32 12.714 0.681 <.001 s67 1.147 0.917 0.266 s33 8.450 1.471 0.003 s68 3.610 0.738 0.047 s34 0.000 1.007 0.955 s70 0.593 1.197 0.370 Note: *p-value <.00083 Copyright 2005 by Promissor, Inc. All rights reserved. Used with permission. The stem and leaf plot for the MH odds ratio estimates is displayed in Figure 5. As the figure illu strates, the sample had a slightly positively skewed distribution with no extreme values. Es timates ranged from 0.1 through 2.1 with an average odds ratio of 1.1. 111

PAGE 125

Odds Ratio Estimates (Mantel-Haenszel) Stem-and-Leaf Plot Frequency Stem & Leaf 3.00 0 114 28.00 0 5555566666666777778888999999 18.00 1 001111111122233334 8.00 1 55678999 3.00 2 111 Stem width: 1.00 Each leaf: 1 case(s) Figure 5. Stem and leaf plot fo r form X MH odds ratio estimates In test Form Z, 14 items were fla gged as exhibiting DIF (see Table 16). Out of these 14 items, 6 favored the P&P examinees and 8 favored the CBT examinees. The greatest odds ratio, fa voring the paper-and-penc il administration group, was for item 63 (OR = 4.254). The smallest odds ratio, favoring the computer-based test group, was for item 65 (OR = 0.237). In each of these flagged items, the significance level of the odds ratio was < 0.001. 112

PAGE 126

Table 16 Mantel Haenszel results for form Z Mantel-Haenszel Results: Form Z (N = 5,670) Item Chisquared Odds Ratio Asymp Sig Significant Item Chisquared Odds Ratio Asymp Sig Significant s2 0.196 0.955 0.624 s37 26.690 1.506 <.001 s3 6.073 0.507 0.013 s38 5.690 1.260 0.015 s4 2.962 0.730 0.075 s39 2.183 0.683 0.110 s5 6.903 0.521 0.007 s40 2.365 1.236 0.115 s6 12.860 0.456 <.001 s41 1.825 0.858 0.159 s7 7.420 0.620 0.005 s42 9.734 0.479 0.001 s8 2.111 0.817 0.130 s43 1.804 1.163 0.164 s9 0.534 0.849 0.414 s44 0.755 1.129 0.353 s10 19.109 1.580 <.001 s45 6.580 1.309 0.009 s11 0.002 1.016 0.907 s46 0.308 1.344 0.453 s12 0.135 1.051 0.673 s47 7.134 0.717 0.007 s14 0.052 0.898 0.712 s48 68.851 0.461 <.001 s15 0.330 1.252 0.470 s49 0.739 1.092 0.363 s16 10.727 1.447 0.001 s50 53.907 0.368 <.001 s17 22.361 1.465 <.001 s51 13.782 1.383 <.001 s18 1.822 1.188 0.160 s52 11.105 1.523 0.001 s19 0.722 0.890 0.345 s53 10.911 0.693 0.001 s20 7.859 0.541 0.004 s54 1.411 0.560 0.174 s21 1.988 0.714 0.130 s55 30.268 0.649 <.001 s22 0.069 1.049 0.737 s58 0.449 1.117 0.454 s25 0.340 1.066 0.527 s59 1.137 0.806 0.253 s26 0.366 1.082 0.504 s60 19.543 0.472 <.001 s27 0.309 1.047 0.555 s61 3.776 0.859 0.047 s28 2.766 0.682 0.075 s62 8.505 0.471 0.003 s29 30.186 1.677 <.001 s63 335.891 4.254 <.001 s30 3.959 1.268 0.041 s64 22.554 0.274 <.001 s31 0.000 0.955 0.882 s65 21.495 0.237 <.001 s32 1.577 1.150 0.194 s68 45.782 0.450 <.001 s35 1.243 0.769 0.222 s69 2.762 1.259 0.088 s36 0.621 0.905 0.396 s70 0.006 1.054 0.836 Note: *p-value <.00083 Copyright 2005 by Promissor, Inc. All rights reserved. Used with permission. The stem and leaf plot for the MH odds ratio estimates is displayed in Figure 6. As the figure illu strates, the sample displayed a positively skewed distribution with one extreme value of 4. 254. Most estimates ranged from 0.2 through 1.6 with an average odds ratio of 0.98. 113

PAGE 127

Odds Ratio Estimates (Mantel-Haenszel) Stem-and-Leaf Plot Frequency Stem & Leaf 3.00 0 223 10.00 0 4444445555 9.00 0 666667777 10.00 0 8888888999 13.00 1 0000000011111 8.00 1 22222333 5.00 1 44555 1.00 1 6 1.00 Extremes (>=4.3) Stem width: 1.00 Each leaf: 1 case(s) Figure 6. Stem and leaf plot fo r form Z MH odds ratio estimates Logistic Regression The Logistic Regression odds ratio was conducted using SAS (SAS, 2003). In the tables reporting results for the Logistic Regression (Tables 17 to 21), the items are listed in the first co lumn, the next two co lumns report the maximum likelihood estimate and p-value. Th is is followed by the odds ratio point estimate in the fourth column. The final column is marked with an asterisk for items which are statistically significant or flagged as exhibiting DIF. For example, in test Form A (Table 17), item 1 has a maximum likelihood estimate of -0.319 (column 2) and a p-value of 0.002 (column 3) indicating the P&P examinees have an increased likelihood of getting the item incorrect; however, this value is not statistically significant. The odds ratio point estimate for item 1 is 0.727 (column 4) which indicates that the CBT examin ees odds of answering correctly were 1.4 times the odds of the P&P examinees. The item is not statistically significant, so an asterisk is not found in the last column. 114

PAGE 128

In test Form A, 17 items were fla gged as exhibiting DIF (see Table 17). Out of these 17 items, 7 favored (the item was less difficult for) the P&P examinees and 10 favored the CBT examinees. The greatest odds ratio, favoring the paper-and-pencil administrat ion group, was for item 46 (OR = 2.967). For this item, P&P examinees odds of answering corr ectly were almost three times the odds of the CBT examinees. The smalle st odds ratio, favoring the computerbased test group, was for item 65 (OR = 0.283). For th is item, CBT examinees odds of answering correctly were 3.5 times the odds of the P&P examinees. 115

PAGE 129

Table 17 Logistic Regression results for form A Logistic Regression Results: Form A (N = 5,441) Maximum Likelihood Odds Ratio Maximum Likelihood Odds Ratio Item Estimate p value Point Estimate Significant Item Estimate p value Point Estimate Significant s1 -0.319 0.002 0.727 s 36 -0.805 0.147 0.447 s2 -0.231 0.006 0.793 s37 -0.821 <.001 0.440 s3 -0.040 0.791 0.961 s38 -0.197 0.202 0.821 s4 -0.513 <.001 0.598 s39 0.465 0.009 1.592 s5 -0.233 0.092 0.792 s40 -0.307 0.014 0.736 s6 0.349 <.001 1.418 s 41 -0.107 0.440 0.899 s7 -0.030 0.894 0.971 s44 0.564 <.001 1.757 s8 0.161 0.126 1.174 s45 0.634 <.001 1.885 s9 -1.065 <.001 0.345 s46 1.088 <.001 2.967 s10 -0.276 0.345 0.759 s47 -0.307 0.068 0.736 s11 -0.113 0.749 0.893 s48 0.005 0.982 1.005 s13 0.829 <.001 2.290 s49 0.145 0.454 1.156 s14 0.405 0.001 1.499 s50 0.586 <.001 1.796 s15 0.219 0.014 1.244 s52 0.198 0.058 1.219 s17 0.072 0.353 1.075 s53 0.175 0.030 1.192 s18 -0.348 0.300 0.706 s54 -0.384 <.001 0.681 s19 0.440 0.132 1.552 s55 0.515 <.001 1.674 s20 -0.487 <.001 0.614 s56 -0.814 <.001 0.443 s21 -0.404 0.140 0.668 s57 -0.219 0.557 0.804 s22 0.313 0.004 1.368 s58 0.451 0.005 1.570 s23 0.172 0.185 1.187 s59 0.138 0.106 1.148 s25 0.019 0.920 1.019 s61 0.053 0.767 1.054 s26 -0.082 0.416 0.921 s62 -0.350 0.006 0.704 s27 -0.035 0.883 0.966 s63 0.034 0.792 1.034 s28 -0.124 0.308 0.883 s64 -0.770 0.007 0.463 s29 0.105 0.610 1.111 s 65 -1.262 <.001 0.283 s30 -0.476 <.001 0.621 s66 -0.861 0.014 0.423 s31 -0.507 <.001 0.603 s67 -0.702 <.001 0.496 s33 0.065 0.604 1.068 s68 0.039 0.830 1.039 s35 -0.067 0.714 0.935 s70 0.184 0.031 1.202 Note: *p-value <.00083 Copyright 2005 by Promissor, Inc. A ll rights reserved. Used with permission The stem and leaf plot in Figure 7 di splays the odds ratio estimates for the logistic regression method for Form A. The distribution was bimodal with two 116

PAGE 130

extreme values of 2.290 and 2.967. Most estimates ranged from 0.2 through 1.8 with an average odds ratio of 1.0. Odds Ratio Estimates (Logistic Regression) Stem-and-Leaf Plot Frequency Stem & Leaf 2.00 0 23 7.00 0 4444445 13.00 0 6666677777777 10.00 0 8888899999 13.00 1 0000000111111 4.00 1 2223 5.00 1 44555 3.00 1 677 1.00 1 8 2.00 Extremes (>=2.3) Stem width: 1.000 Each leaf: 1 case(s) Figure 7. Stem and leaf plot for form A LR odds ratio estimates In test Form V, 16 items were fla gged as exhibiting DIF (see Table 18). Out of these 16 items, 10 favored t he P&P examinees and 6 favored the CBT examinees. The greatest odds rati o (OR), favoring the paper-and-pencil administration group, was for item 9 (OR = 2.657). The smallest odds ratio, favoring the computer-based test group, was for item 42 (OR = 0.113). 117

PAGE 131

Table 18 Logistic Regression results for form V Logistic Regression Results: Form V (N= 10,018) Maximum Likelihood Odds Ratio Maximum Likelihood Odds Ratio Item Estimate p value Point Estimate Significant Item Estimate p value Point Estimate Significant s1 -0.069 0.726 0.933 s36 0.824 <.001 2.279 s2 0.247 0.097 1.280 s 37 -0.166 0.229 0.847 s3 0.436 0.052 1.546 s 38 -0.871 <.001 0.419 s4 0.189 0.015 1.208 s39 0.287 <.001 1.333 s5 -0.160 0.431 0.852 s40 -0.389 0.092 0.678 s6 0.461 <.001 1.583 s41 0.205 0.129 1.228 s7 -0.079 0.515 0.924 s42 -2.184 <.001 0.113 s8 -0.163 0.195 0.850 s43 -0.674 0.185 0.510 s9 0.977 <.001 2.657 s 44 -0.088 0.525 0.916 s10 0.389 <.001 1.476 s47 0.850 <.001 2.340 s13 -0.909 0.018 0.403 s48 0.227 0.079 1.254 s14 0.441 0.007 1.554 s 49 -0.326 0.016 0.722 s16 0.132 0.085 1.141 s50 0.284 0.021 1.328 s17 0.215 0.205 1.240 s 51 -0.275 0.111 0.760 s18 -0.793 0.020 0.452 s52 -0.151 0.305 0.860 s19 -0.399 0.138 0.671 s53 -0.113 0.529 0.893 s20 0.626 <.001 1.869 s 54 -0.119 0.128 0.888 s21 0.543 <.001 1.722 s55 0.720 <.001 0.487 s22 -0.148 0.275 0.863 s56 -0.065 0.604 0.937 s23 -0.190 0.429 0.827 s57 -0.070 0.760 0.932 s24 0.844 <.001 2.326 s 58 -0.338 0.051 0.713 s25 -0.345 <.001 0.708 s59 0.083 0.723 1.087 s26 0.390 0.052 1.477 s 60 -0.060 0.408 0.941 s27 -0.381 0.039 0.684 s64 -0.460 0.038 0.631 s30 0.163 0.133 1.178 s 65 -0.884 0.012 0.413 s31 -0.792 <.001 0.453 s66 -0.017 0.948 0.983 s32 -0.105 0.341 0.900 s67 0.241 0.017 1.272 s33 0.362 0.001 1.437 s68 0.035 0.761 1.036 s34 -0.477 0.019 0.621 s69 0.460 0.017 1.585 s35 0.755 <.001 2.127 s 70 -0.404 <.001 0.668 Note: *p-value <.00083 Copyright 2005 by Promissor, Inc. A ll rights reserved. Used with permission The stem and leaf plot in Figure 8 di splays the odds ratio estimates for the logistic regression method for Form V. The distribution had a positive skew with 118

PAGE 132

four extreme values of 2.279, 2.326, 2.340, 2.657. Most estimates ranged from 0.1 through 2.1 with an average odds ratio of 1.1. Odds Ratio Estimates (Logistic Regression) Stem-and-Leaf Plot Frequency Stem & Leaf 7.00 0 1444444 27.00 0 566666677778888888899999999 15.00 1 001122222233444 6.00 1 555578 1.00 2 1 4.00 Extremes (>=2.3) Stem width: 1 Each leaf: 1 case(s) Figure 8. Stem and leaf plot for form V LR odds ratio estimates In test Form W, 17 items were fla gged as exhibiting DIF (see Table 19). Out of these 17 items, 10 favored t he P&P examinees and 7 favored the CBT examinees. The greatest odds ratio, fa voring the paper-and-penc il administration group, was for item 37 (OR = 3.917). The smallest odds ratio, favoring the computer-based test group, was for item 47 (OR = 0.171). 119

PAGE 133

Table 19 Logistic Regression results for form W Logistic Regression Results: Form W (N = 8,751) Maximum Likelihood Odds Ratio Maximum Likelihood Odds Ratio Item Estimate p value Point Estimate Significant Item Estimate p value Point Estimate Significant s1 0.194 0.020 1.214 s37 1.365 <.001 3.917 s2 0.560 <.001 1.750 s 38 -0.148 0.077 0.862 s3 -0.063 0.472 0.939 s41 -0.486 0.226 0.615 s4 0.788 <.001 2.199 s42 0.533 0.002 1.704 s5 0.293 0.091 1.340 s43 0.100 0.495 1.105 s6 0.847 <.001 2.332 s44 0.389 0.009 1.476 s7 0.454 0.001 1.574 s45 0.062 0.593 1.064 s8 -0.209 0.092 0.812 s46 -1.131 <.001 0.323 s9 -0.167 0.365 0.846 s47 -1.765 <.001 0.171 s11 -0.068 0.441 0.935 s48 0.405 0.006 1.499 s13 -0.014 0.917 0.986 s49 -0.303 0.145 0.738 s14 -0.093 0.304 0.912 s50 0.400 0.002 1.492 s15 0.047 0.887 1.048 s51 0.246 0.142 1.279 s16 -0.538 <.001 0.584 s52 0.398 0.007 1.489 s17 0.077 0.567 1.080 s 53 -0.325 0.179 0.723 s19 -0.279 0.206 0.756 s54 0.044 0.558 1.045 s20 0.580 <.001 1.786 s56 0.011 0.911 1.012 s21 0.625 <.001 1.869 s 57 -0.105 0.337 0.901 s22 0.833 <.001 2.300 s58 0.333 <.001 1.395 s23 -0.701 <.001 0.496 s59 -0.138 0.598 0.871 s25 -0.736 <.001 0.479 s60 -0.485 <.001 0.616 s26 0.216 0.008 1.241 s 61 -0.542 0.001 0.582 s27 0.341 0.007 1.406 s 62 -1.495 0.004 0.224 s28 0.608 <.001 1.836 s 64 -0.252 0.084 0.777 s29 0.629 <.001 1.876 s65 0.561 0.014 1.753 s30 -0.110 0.588 0.896 s66 -0.216 0.127 0.806 s31 0.115 0.689 1.121 s 67 -0.060 0.861 0.941 s32 0.016 0.870 1.016 s 68 -0.545 <.001 0.580 s34 -0.090 0.390 0.914 s69 0.007 0.929 1.008 s35 -0.477 0.243 0.621 s70 0.328 0.141 1.388 Note: *p-value <.00083 Copyright 2005 by Promissor, Inc. A ll rights reserved. Used with permission The stem and leaf plot in Figure 9 di splays the odds ratio estimates for the logistic regression method for Form W. The distribution was positively skewed 120

PAGE 134

with one extreme value of 3.917. Most estimates ranged from 0.1 through 2.3 with an average odds ratio of 1.2. Odds Ratio Estimates (Logistic Regression) Stem-and-Leaf Plot Frequency Stem & Leaf 5.00 0 12344 23.00 0 55566677778888889999999 20.00 1 00000001122233344444 8.00 1 57777888 3.00 2 133 1.00 Extremes (>=3.9) Stem width: 1.00 Each leaf: 1 case(s) Figure 9. Stem and leaf plot for form W LR odds ratio estimates In test Form X, 19 items were fla gged as exhibiting DIF (see Table 20). Out of these 19 items, 11 favored t he P&P examinees and 8 favored the CBT examinees. The greatest odds ratio, fa voring the paper-and-penc il administration group, was for item 55 (OR = 2.297). The smallest odds ratio, favoring the computer-based test group, was for item 66 (OR = 0.153). 121

PAGE 135

Table 20 Logistic Regression results for form X Logistic Regression Results: Form X (N = 9,137) Maximum Likelihood Odds Ratio Maximum Likelihood Odds Ratio Item Estimate p value Point Estimate Significant Item Estimate p value Point Estimate Significant s1 -0.491 0.094 0.612 s35 0.690 <.001 1.993 s2 -0.432 <.001 0.649 s36 0.177 0.447 1.193 s3 -0.151 0.052 0.860 s37 -0.629 <.001 0.533 s4 -0.531 <.001 0.588 s38 -0.288 0.300 0.749 s5 -0.441 0.001 0.643 s39 0.820 <.001 2.270 s6 0.189 0.011 1.208 s40 0.308 0.024 1.360 s8 -0.256 0.355 0.774 s41 -0.048 0.830 0.953 s9 -0.351 0.005 0.704 s44 0.172 0.311 1.187 s10 -0.379 0.041 0.684 s45 -0.034 0.842 0.966 s11 0.037 0.702 1.038 s 46 -0.635 <.001 0.530 s12 0.566 <.001 1.761 s 47 -0.765 <.001 0.465 s14 -0.133 0.532 0.875 s48 0.199 0.186 1.221 s15 0.158 0.376 1.171 s49 0.409 0.102 1.506 s16 0.511 <.001 1.667 s 50 -0.137 0.529 0.872 s17 0.037 0.896 1.037 s 51 -0.425 0.027 0.654 s18 0.145 0.120 1.156 s52 0.564 <.001 1.757 s20 0.804 <.001 2.235 s 54 -0.025 0.777 0.976 s22 0.274 0.244 1.315 s55 0.831 <.001 2.297 s23 0.827 <.001 2.287 s57 0.553 0.032 1.738 s24 0.311 0.005 1.365 s 58 -0.436 0.083 0.647 s25 0.175 0.080 1.192 s 59 -1.695 <.001 0.184 s26 -0.305 0.001 0.737 s60 0.180 0.130 1.197 s27 0.023 0.815 1.023 s62 0.673 <.001 1.961 s28 0.722 <.001 2.059 s 63 -0.325 0.014 0.723 s29 0.231 0.012 1.259 s64 0.293 <.001 1.340 s30 -0.109 0.565 0.897 s65 -0.107 0.386 0.898 s31 -0.545 <.001 0.580 s66 -1.875 <.001 0.153 s32 -0.348 0.001 0.706 s67 -0.079 0.307 0.924 s33 0.409 0.002 1.505 s 68 -0.230 0.131 0.794 s34 0.008 0.950 1.008 s70 0.203 0.312 1.224 Note: *p-value <.00083 Copyright 2005 by Promissor, Inc. A ll rights reserved. Used with permission The stem and leaf plot in Figure 10 displays the odds ratio estimates for the logistic regression method for Form X. The distribution had a slightly positive 122

PAGE 136

skew with no extreme values. Estimate s ranged from 0.1 through 2.2 with an average odds ratio of 1.1. Odds Ratio Estimates (Logistic Regression) Stem-and-Leaf Plot Frequency Stem & Leaf 3.00 0 114 26.00 0 55556666667777777888889999 18.00 1 000011111122223333 8.00 1 55677799 5.00 2 02222 Stem width: 1.00 Each leaf: 1 case(s) Figure 10. Stem and leaf plot fo r form X LR odds ratio estimates In test Form Z, 17 items were fla gged as exhibiting DIF (see Table 21). Out of these 17 items, 9 favored the P&P examinees and 8 favored the CBT examinees. The greatest odds ratio, fa voring the paper-and-penc il administration group, was 4.541 for item 63. The smalle st odds ratio, favoring the computerbased test group, was 0.230 for item 65. 123

PAGE 137

Table 21 Logistic Regression results for form Z Logistic Regression Results: Form Z (N = 5,670) Maximum Likelihood Odds Ratio Maximum Likelihood Odds Ratio Item Estimate p value Point Estimate Significant Item Estimate p value Point Estimate Significant s2 0.029 0.763 1.029 s 37 0.478 <.001 1.613 s3 -0.659 0.023 0.517 s38 0.287 0.002 1.333 s4 -0.120 0.511 0.887 s39 -0.312 0.188 0.732 s5 -0.538 0.027 0.584 s40 0.335 0.015 1.397 s6 -0.819 <.001 0.441 s41 0.029 0.788 1.030 s7 -0.440 0.010 0.644 s42 -0.557 0.011 0.573 s8 -0.157 0.237 0.855 s43 0.267 0.014 1.306 s9 -0.097 0.645 0.908 s44 0.270 0.045 1.310 s10 0.596 <.001 1.814 s45 0.402 <.001 1.494 s11 0.112 0.401 1.118 s46 0.472 0.216 1.603 s12 0.137 0.253 1.147 s 47 -0.227 0.074 0.797 s14 0.050 0.863 1.051 s 48 -0.810 <.001 0.445 s15 0.211 0.507 1.235 s49 0.152 0.117 1.164 s16 0.473 <.001 1.604 s 50 -0.902 <.001 0.406 s17 0.433 <.001 1.541 s51 0.354 <.001 1.424 s18 0.337 0.006 1.401 s52 0.526 <.001 1.693 s19 -0.014 0.910 0.986 s53 -0.334 0.003 0.716 s20 -0.576 0.007 0.562 s54 -0.591 0.161 0.554 s21 -0.070 0.757 0.932 s55 -0.372 <.001 0.689 s22 0.130 0.354 1.139 s58 0.167 0.262 1.182 s25 0.184 0.071 1.201 s 59 -0.247 0.192 0.782 s26 0.190 0.114 1.210 s 60 -0.716 <.001 0.489 s27 0.096 0.216 1.101 s 61 -0.096 0.203 0.908 s28 -0.335 0.140 0.716 s62 -0.745 0.004 0.475 s29 0.602 <.001 1.825 s63 1.513 <.001 4.541 s30 0.342 0.003 1.408 s 64 -1.244 <.001 0.288 s31 0.203 0.489 1.225 s 65 -1.469 <.001 0.230 s32 0.221 0.043 1.247 s 68 -0.725 <.001 0.484 s35 -0.329 0.119 0.719 s69 0.372 0.007 1.451 s36 0.002 0.989 1.002 s70 0.051 0.835 1.053 Note: *p-value <.00083 Copyright 2005 by Promissor, Inc. A ll rights reserved. Used with permission The stem and leaf plot in Figure 11 displays the odds ratio estimates for the logistic regression method for Form Z. The distribution was bimodal with one 124

PAGE 138

extreme value of 4.541. Most estimate s ranged from 0.2 through 1.8 with an average odds ratio of 1.1. Odds Ratio Estimates (Logistic Regression) Stem-and-Leaf Plot Frequency Stem & Leaf 2.00 0 22 11.00 0 44444455555 8.00 0 66777777 6.00 0 889999 11.00 1 00000111111 9.00 1 222223333 6.00 1 444445 4.00 1 6666 2.00 1 88 1.00 Extremes (>=4.5) Stem width: 1.00 Each leaf: 1 case(s) Figure 11. Stem and leaf plot fo r form Z LR odds ratio estimates The logistic regression method allows for the examination of DIF by group. We were able to determine for each item by test form, which mode of administration was favored (for that parti cular mode of administration, the odds of answering the item correctly were greater than the odds of answering the item correctly for members of t he other administration mode). 1-Parameter Logistic Model Item response theory was applied to each test form using the 1-PL model in the BILOG-MG software program (Z imowski et al., 2003). This yielded adjusted threshold values for each group. Threshold refers to the difficulty parameter (sometimes referr ed to as the b-parameter). Using the group threshold differences (can also be referred to as logit differences) and the standard error, a z-value was calculated for each item. T he z-value was compared to a critical z125

PAGE 139

value for a normal distribution applying the Bonferroni adjustment to the significance level so that the significanc e level was .00083 and the critical z-value was 3.34 (two-tailed). In the tables reporting results for the 1-PL (Tables 22 to 26), the items are listed in the first colu mn, the next two columns report the group threshold difference and the standard error. Th is is followed by the z-value for the item. The final column is marked with an as terisk for items which are statistically significant or flagged as exhibiting DIF. Fo r example, in test Form A (Table 22), item 1 has a group threshold difference of 0.486 (column 2) indicating that the adjusted threshold of the P&P examines is 0.486 above the CBT examines meaning the item was more difficult for P&P examinees and favored the CBT examinee group. The standard e rror is 0.123 (column 3). The z-value for item 1 is 3.951 (column 4) indicating the estimate is almost four standard errors from zero. The item is statistically signif icant, and an asterisk is found in the last column indicating this significance. In test Form A, 19 items were fl agged as exhibiting DIF (see Table 22). Out of these 19 items, 10 favored t he P&P examinees and 9 favored the CBT examinees. The smallest group threshold difference (threshold diff), favoring the paper-and-pencil administration group, was for item 46 (threshold diff = -1.351). The greatest group threshold difference, fa voring the computer-based test group, was for item 65 (threshold diff = 1.267). 126

PAGE 140

Table 22 1-PL results for form A 1-PL: Form A (N=5,441) Item Group 2-1 Estimate Standard Error z value Significant Item Group 2-1 Estimate Standard Error z value Significant s1 0.486 0.123 3.951 s 36 0.764 0.659 1.159 s2 0.186 0.099 1.879 s37 0.971 0.260 3.735 s3 -0.103 0.173 -0.595 s38 0.117 0.182 0.643 s4 0.557 0.141 3.950 s 39 -0.694 0.206 -3.369 s5 0.267 0.170 1.571 s40 0.292 0.145 2.014 s6 -0.430 0.126 -3.413 s41 -0.020 0.157 -0.127 s7 -0.215 0.251 -0.857 s44 -0.741 0.113 -6.558 s8 -0.314 0.123 -2.553 s45 -0.782 0.110 -7.109 s9 1.212 0.262 4.626 s46 -1.351 0.102 -13.245 s10 0.204 0.345 0.591 s47 0.544 0.204 2.667 s11 -0.082 0.407 -0.201 s48 -0.107 0.268 -0.399 s13 -1.043 0.169 -6.172 s49 -0.380 0.218 -1.743 s14 -0.528 0.150 -3.520 s50 -0.784 0.127 -6.173 s15 -0.344 0.106 -3.245 s52 -0.201 0.124 -1.621 s17 -0.042 0.093 -0.452 s53 -0.294 0.097 -3.031 s18 0.263 0.395 0.666 s54 0.349 0.131 2.664 s19 -0.724 0.318 -2.277 s55 -0.652 0.096 -6.792 s20 0.516 0.169 3.053 s56 0.887 0.256 3.465 s21 0.231 0.310 0.745 s 57 -0.082 0.415 -0.198 s22 -0.512 0.126 -4.063 s58 -0.617 0.190 -3.247 s23 -0.238 0.155 -1.535 s59 -0.189 0.101 -1.871 s25 -0.071 0.221 -0.321 s61 -0.252 0.206 -1.223 s26 -0.052 0.120 -0.433 s62 0.513 0.147 3.490 s27 -0.083 0.281 -0.295 s63 -0.177 0.149 -1.188 s28 0.068 0.144 0.472 s64 0.539 0.313 1.722 s29 -0.341 0.239 -1.427 s65 1.267 0.149 8.503 s30 0.624 0.128 4.875 s66 0.615 0.397 1.549 s31 0.480 0.160 3.000 s67 0.889 0.223 3.987 s33 -0.115 0.150 -0.767 s68 -0.154 0.210 -0.733 s35 0.135 0.217 0.622 s70 -0.262 0.101 -2.594 Note: Variables were coded such that Group 1= CBT examinees and Group 2= P&P examinees; *pvalue<.00083. Copyright 2005 by Promissor, In c. All rights reserved. Used with permission. The stem and leaf plot is displaye d in Figure 12. The plot illustrates the normal distribution of the threshold diffe rences for Form A. No extreme values were reported. Threshold differences ranged from -1.0 through 1.2 with an average difference of -0.9. 127

PAGE 141

Threshold Differences (1-PL) Stem-and-Leaf Plot Frequency Stem & Leaf 2.00 -1 03 9.00 -0 556667777 24.00 -0 000000011111122222233334 12.00 0 011122222344 11.00 0 55555667889 2.00 1 22 Stem width: 1.000 Each leaf: 1 case(s) Figure 12. Stem and leaf plot for form A 1-PL threshold difference values In test Form V, 16 items were fla gged as exhibiting DIF (see Table 23). Out of these 16 items, 10 favored t he P&P examinees and 6 favored the CBT examinees. The smallest group threshold difference, favoring the paper-andpencil administration group, was for item 47 (threshold diff= -0 .945). The greatest group threshold difference, favoring the computer-based test group, was for item 42 (threshold diff = 2.666). 128

PAGE 142

Table 23 1-PL results for form V 1-PL Results: Form V (N = 10,018) Item Group 2-1 Estimate Standard Error z value Significant Item Group 2-1 Estimate Standard Error z value Significant s1 0.108 0.217 0.498 s36 -0.891 0.121 -7.364 s2 -0.327 0.163 -2.006 s37 0.204 0.155 1.316 s3 -0.556 0.245 -2.269 s38 1.122 0.177 6.339 s4 -0.176 0.088 -2.000 s39 -0.261 0.085 -3.071 s5 0.209 0.222 0.941 s40 0.320 0.248 1.290 s6 -0.498 0.112 -4.446 s41 -0.202 0.152 -1.329 s7 0.152 0.135 1.126 s42 2.666 0.313 8.518 s8 0.212 0.136 1.559 s43 0.467 0.544 0.858 s9 -1.090 0.191 -5.707 s44 0.086 0.149 0.577 s10 -0.464 0.102 -4.549 s47 -0.945 0.185 -5.108 s13 0.588 0.396 1.485 s 48 -0.220 0.145 -1.517 s14 -0.480 0.177 -2.712 s49 0.345 0.146 2.363 s16 -0.132 0.086 -1.535 s50 -0.393 0.131 -3.000 s17 -0.360 0.179 -2.011 s51 0.168 0.183 0.918 s18 0.347 0.334 1.039 s52 0.094 0.161 0.584 s19 0.151 0.281 0.537 s 53 -0.017 0.192 -0.089 s20 -0.727 0.152 -4.783 s54 0.157 0.086 1.826 s21 -0.591 0.119 -4.966 s55 0.445 0.219 2.032 s22 0.045 0.142 0.317 s56 0.081 0.138 0.587 s23 0.070 0.261 0.268 s 57 -0.025 0.249 -0.100 s24 -0.923 0.135 -6.837 s58 0.273 0.183 1.492 s25 0.598 0.111 5.387 s 59 -0.277 0.239 -1.159 s26 -0.444 0.220 -2.018 s60 0.139 0.082 1.695 s27 0.358 0.200 1.790 s64 0.549 0.246 2.232 s30 -0.240 0.120 -2.000 s65 0.485 0.352 1.378 s31 0.885 0.137 6.460 s 66 -0.210 0.266 -0.789 s32 0.091 0.121 0.752 s 67 -0.249 0.114 -2.184 s33 -0.448 0.123 -3.642 s68 0.066 0.127 0.520 s34 0.521 0.221 2.357 s 69 -0.543 0.208 -2.611 s35 -0.838 0.136 -6.162 s70 0.522 0.098 5.327 Note: Variables were coded such that Group 1= CBT examinees and Group 2= P&P examinees; *pvalue<.00083. Copyright 2005 by Promissor, In c. All rights reserved. Used with permission. The stem and leaf plot is displayed in Figure 13. The plot illustrates the positively skewed distribution of the th reshold differences for Form V. One extreme value was reported (2.666). Most threshold differences ranged from -1.0 through 1.1 with an average difference of 0.0. 129

PAGE 143

Threshold Differences (1-PL) Stem-and-Leaf Plot Frequency Stem & Leaf 1.00 -1 0 8.00 -0 55578899 19.00 -0 0011222222233344444 24.00 0 000000011111122223333444 6.00 0 555558 1.00 1 1 1.00 Extremes (>=2.7) Stem width: 1 Each leaf: 1 case(s) Figure 13. Stem and leaf plot for form V 1-PL threshold difference values In test Form W, 18 items were flagged as exhibiting DIF (see Table 24). Out of these 18 items, 11 favored t he P&P examinees and 7 favored the CBT examinees. The smallest group thres hold difference, favoring the paper-andpencil administration group, was for item 37 (threshold diff = -1.443). The greatest group threshold difference, favo ring the computer-based test group, was for item 47 (threshold diff = 1.691). 130

PAGE 144

Table 24 1-PL results for form W 1-PL Results: Form W (N = 8,751) Item Group 2-1 Estimate Standard Error z value Significant Item Group 2-1 Estimate Standard Error z value Significant s1 -0.082 0.091 -0.901 s37 -1.443 0.111 -13.000 s2 -0.608 0.155 -3.923 s38 0.244 0.091 2.681 s3 0.169 0.099 1.707 s41 0.646 0.455 1.420 s4 -0.828 0.132 -6.273 s42 -0.589 0.178 -3.309 s5 -0.378 0.182 -2.077 s43 -0.060 0.163 -0.368 s6 -0.911 0.129 -7.062 s44 -0.452 0.160 -2.825 s7 -0.552 0.153 -3.608 s45 -0.054 0.126 -0.429 s8 0.258 0.135 1.911 s46 1.031 0.182 5.665 s9 0.211 0.205 1.029 s47 1.691 0.177 9.554 s11 0.028 0.096 0.292 s 48 -0.439 0.163 -2.693 s13 0.141 0.146 0.966 s49 0.295 0.224 1.317 s14 0.212 0.099 2.141 s 50 -0.415 0.139 -2.986 s15 0.030 0.375 0.080 s 51 -0.376 0.173 -2.173 s16 0.791 0.144 5.493 s 52 -0.433 0.162 -2.673 s17 -0.088 0.144 -0.611 s53 0.129 0.251 0.514 s19 0.404 0.245 1.649 s54 0.168 0.081 2.074 s20 -0.578 0.083 -6.964 s56 0.160 0.113 1.416 s21 -0.683 0.173 -3.948 s57 0.244 0.119 2.050 s22 -0.882 0.120 -7.350 s58 -0.397 0.108 -3.676 s23 0.812 0.160 5.075 s59 0.133 0.286 0.465 s25 0.699 0.213 3.282 s60 0.573 0.100 5.730 s26 -0.192 0.090 -2.133 s61 0.820 0.181 4.530 s27 -0.375 0.136 -2.757 s62 1.583 0.578 2.739 s28 -0.692 0.126 -5.492 s64 0.219 0.154 1.422 s29 -0.639 0.127 -5.031 s65 -0.642 0.238 -2.697 s30 0.108 0.225 0.480 s66 0.310 0.155 2.000 s31 -0.197 0.303 -0.650 s67 0.046 0.375 0.123 s32 -0.050 0.106 -0.472 s68 0.841 0.131 6.420 s34 -0.072 0.109 -0.661 s69 0.024 0.091 0.264 s35 0.549 0.455 1.207 s70 -0.466 0.230 -2.026 Note: Variables were coded such that Group 1= CBT examinees and Group 2= P&P examinees; *pvalue<.00083. Copyright 2005 by Promissor, In c. All rights reserved. Used with permission. The stem and leaf plot is displaye d in Figure 14. The plot illustrates the normal distribution of the threshold diffe rences for Form W. Two extreme values were reported (1.583 and 1.691). Most thre shold differences ranged from -1.4 through 1.0 with an average difference of 0.0. 131

PAGE 145

Threshold Differences (1-PL) Stem-and-Leaf Plot Frequency Stem & Leaf 1.00 -1 4 11.00 -0 55566666889 17.00 -0 00000011333344444 20.00 0 00001111111222222234 8.00 0 55667888 1.00 1 0 2.00 Extremes (>=1.6) Stem width: 1.00 Each leaf: 1 case(s) Figure 14. Stem and leaf plot for form W 1-PL threshold difference values In test Form X, 23 items were fla gged as exhibiting DIF (see Table 25). Out of these 23 items, 12 favored t he P&P examinees and 11 favored the CBT examinees. The smallest group threshold difference, favoring the paper-andpencil administration group, was for item 55 (threshold diff = -0.939). The greatest group threshold difference, favo ring the computer-based test group, was for item 66 (threshold diff = 1.755). 132

PAGE 146

Table 25 1-PL results for form X 1-PL Results: Form X (N = 9,137) Item Group 2-1 Estimate Standard Error z value Significant Item Group 2-1 Estimate Standard Error z value Significant s1 0.668 0.318 2.101 s35 -0.674 0.087 -7.747 s2 0.386 0.118 3.271 s 36 -0.400 0.226 -1.770 s3 0.291 0.082 3.549 s37 0.687 0.126 5.452 s4 0.696 0.171 4.070 s38 0.194 0.276 0.703 s5 0.456 0.142 3.211 s 39 -0.846 0.093 -9.097 s6 -0.098 0.080 -1.225 s40 -0.104 0.147 -0.707 s8 0.125 0.288 0.434 s41 -0.044 0.228 -0.193 s9 0.485 0.125 3.880 s 44 -0.269 0.171 -1.573 s10 0.242 0.188 1.287 s 45 -0.146 0.168 -0.869 s11 0.036 0.100 0.360 s46 0.940 0.101 9.307 s12 -0.612 0.154 -3.974 s47 0.917 0.080 11.463 s14 0.125 0.224 0.558 s 48 -0.345 0.152 -2.270 s15 -0.230 0.185 -1.243 s49 -0.575 0.253 -2.273 s16 -0.608 0.148 -4.108 s50 -0.155 0.209 -0.742 s17 0.045 0.296 0.152 s51 0.056 0.183 0.306 s18 -0.082 0.100 -0.820 s52 -0.496 0.085 -5.835 s20 -0.827 0.114 -7.254 s54 0.145 0.091 1.593 s22 -0.420 0.235 -1.787 s55 -0.939 0.200 -4.695 s23 -0.856 0.110 -7.782 s57 -0.652 0.265 -2.460 s24 -0.265 0.114 -2.325 s58 0.193 0.250 0.772 s25 -0.188 0.104 -1.808 s59 1.770 0.202 8.762 s26 0.393 0.096 4.094 s 60 -0.193 0.122 -1.582 s27 0.193 0.107 1.804 s 62 -0.688 0.121 -5.686 s28 -0.774 0.107 -7.234 s63 0.372 0.135 2.756 s29 -0.220 0.096 -2.292 s64 -0.340 0.090 -3.778 s30 0.123 0.201 0.612 s 65 -0.107 0.123 -0.870 s31 0.704 0.142 4.958 s66 1.755 0.446 3.935 s32 0.555 0.113 4.912 s67 0.268 0.083 3.229 s33 -0.465 0.135 -3.444 s68 0.183 0.158 1.158 s34 0.058 0.134 0.433 s70 -0.441 0.199 -2.216 Note: Variables were coded such that Group 1= CBT examinees and Group 2= P&P examinees; *pvalue=.00083. Copyright 2005 by Promissor, In c. All rights reserved. Used with permission. The stem and leaf plot is displayed in Figure 15. The plot illustrates the normal distribution of the threshold diffe rences for Form X. Two extreme values were reported (1.755 and 1.770). Most thre shold differences ranged from -0.8 through 0.9 with an average difference of 0.0. 133

PAGE 147

Threshold Differences (1-PL) Stem-and-Leaf Plot Frequency Stem & Leaf 4.00 -0 8889 6.00 -0 666667 6.00 -0 444445 6.00 -0 222233 9.00 -0 000111111 12.00 0 000011111111 6.00 0 222333 3.00 0 445 4.00 0 6667 2.00 0 99 2.00 Extremes (>=1.8) Stem width: 1.00 Each leaf: 1 case(s) Figure 15. Stem and leaf plot for form X 1-PL threshold difference values In test Form Z, 21 items were fla gged as exhibiting DIF (see Table 26). Out of these 21 items, 12 favored t he P&P examinees and 9 favored the CBT examinees. The smallest group threshold difference, favoring the paper-andpencil administration group, was for item 63 (threshold diff = -1.880). The greatest group threshold difference, favo ring the computer-based test group, was for item 64 (threshold diff = 1.572). 134

PAGE 148

Table 26 1-PL results for form Z 1-PL Results: Form Z (N = 5,670) Item Group 2-1 Estimate Standard Error z value Significant Item Group 2-1 Estimate Standard Error z value Significant s2 -0.042 0.119 -0.353 s37 -0.607 0.099 -6.131 s3 0.547 0.337 1.623 s 38 -0.378 0.118 -3.203 s4 0.037 0.220 0.168 s 39 -0.002 0.280 -0.007 s5 0.712 0.305 2.334 s 40 -0.541 0.167 -3.240 s6 0.975 0.272 3.585 s41 0.030 0.135 0.222 s7 0.394 0.211 1.867 s42 0.712 0.278 2.561 s8 0.478 0.164 2.915 s 43 -0.530 0.133 -3.985 s9 -0.092 0.248 -0.371 s44 -0.424 0.164 -2.585 s10 -0.780 0.129 -6.047 s45 -0.642 0.127 -5.055 s11 -0.233 0.165 -1.412 s46 -0.716 0.451 -1.588 s12 -0.216 0.146 -1.479 s47 0.111 0.151 0.735 s14 -0.135 0.355 -0.380 s48 1.553 0.118 13.161 s15 -0.453 0.372 -1.218 s49 -0.311 0.121 -2.570 s16 -0.597 0.139 -4.295 s50 1.280 0.174 7.356 s17 -0.609 0.101 -6.030 s51 -0.181 0.106 -1.708 s18 -0.578 0.151 -3.828 s52 -0.799 0.154 -5.188 s19 -0.249 0.149 -1.671 s53 0.192 0.134 1.433 s20 0.823 0.266 3.094 s54 0.032 0.446 0.072 s21 -0.114 0.269 -0.424 s55 0.651 0.097 6.711 s22 -0.119 0.178 -0.669 s58 -0.496 0.178 -2.787 s25 -0.066 0.126 -0.524 s59 0.442 0.235 1.881 s26 -0.383 0.147 -2.605 s60 1.111 0.211 5.265 s27 -0.118 0.096 -1.229 s61 0.411 0.094 4.372 s28 -0.074 0.253 -0.292 s62 0.609 0.313 1.946 s29 -0.667 0.117 -5.701 s 63 -1.880 0.104 -18.077 s30 -0.530 0.144 -3.681 s64 1.572 0.328 4.793 s31 -0.397 0.367 -1.082 s65 1.401 0.387 3.620 s32 -0.454 0.133 -3.414 s68 0.732 0.144 5.083 s35 0.278 0.259 1.073 s 69 -0.509 0.168 -3.030 s36 -0.052 0.146 -0.356 s70 -0.108 0.308 -0.351 Note: Variables were coded such that Group 1= CBT examinees and Group 2= P&P examinees; *pvalue<.00083. Copyright 2005 by Promissor, In c. All rights reserved. Used with permission. The stem and leaf plot is displaye d in Figure 16. The plot illustrates the positively skewed distribution of the threshold differences for Form Z. One extreme value was reported (-1.880). Most threshold differences ranged from 1.5 through 1.5 with an aver age difference of 0.0. 135

PAGE 149

Threshold Differences (1-PL) Stem-and-Leaf Plot Frequency Stem & Leaf 1.00 Extremes (=<-1.9) 13.00 -0 5555556666777 23.00 -0 00000011111122233334444 10.00 0 0001123444 8.00 0 56677789 3.00 1 124 2.00 1 55 Stem width: 1.00 Each leaf: 1 case(s) Figure 16. Stem and leaf plot for form Z 1-PL threshold difference values Agreement of DIF Methods Figures 17 to 31 display the relati onship between the estimates of the detection methods in a paired comparison fo r all items on that test form. There are three paired comparisons for each test form. The first figure displays the relationship between the MH and 1-PL detection methods, and the correlation value is reported. The second figure di splays the relationship between the LR and the 1-PL methods with the correlation of the two methods. Finally, the third figure displays the relationship between the LR and the MH methods with the correlation of the two methods. These figures are provided as a visual demonstration of the agreement on values obtained by each DIF method. Similarly, Tables 27 to 31 display the items flagged by method. This display of the DIF detection decision allows a look at the degree to which the methods agree in ability to flag an item for DIF. The tables illustrate results by test form. The first column displays the item number, followed by columns for each method of analysis: 1-PL, MH, and LR. A C or P in the column indicates that the item was flagged for that method. A C indicates the item favored the 136

PAGE 150

CBT examinees and a P indicates the item favored the P&P examinees. For example, in Form A (Table 27) item 1 was flagged by the 1-PL method and the MH method. A C is reported in the se cond and third columns indicating the item favored CBT examinees for both met hods. The item was not flagged by LR; therefore, the column (c olumn 4) was left blank. Figure 17 illustrates the relationship for each item on Form A by displaying the odds ratios obtained in MH and the threshold differences using the 1-PL model. This type of graph allows us to view the item relationships. The correlation between the values obtained by MH and 1-PL was 0.93. Figures 18 and 19 display similar information for t he pairings of LR and the 1-PL and MH and LR, respectively. The correlation between the estimates obtained by LR and 1-PL was 0.93. The correlation between the estimates obtained by LR and MH was 0.995. 137

PAGE 151

-1.5 -1 -0.5 0 0.5 1 1.5 00 511 522 533 Mantel-Haenszel Odds Ratios1-PL Theta Differences 5 Figure 17.Scatterplot of MH odds ratios and 1-PL threshold differences for form A -1.5 -1 -0.5 0 0.5 1 1.5 00 511 522 533 Logistic Regression Odds Ratios1-PL Theta Differences 5 Figure 18. Scatterplot of LR odds ratios and 1-PL threshold differences for form A 138

PAGE 152

0 0.5 1 1.5 2 2.5 3 3.5 00 511 522 533 Logistic Regression Odds RatiosMantel-Haenszel Odds RatiosForm A 5 Figure 19. Scatterplot of LR odds ra tios and MH odds ratios for form A In Form A (see Table 27), 12 items were commonly flagged by all methods. An additional item was flagged by both the 1-PL and MH methods, three additional items were flagged by the MH and LR methods, and an additional two items were flagged by t he 1-PL and LR methods. An additional four items were flagged sole ly by the 1-PL method. 139

PAGE 153

Table 27 DIF methodology results for form A DIF Methodology Results: Form A (N = 5,441) Item 1-PL MH LR Item 1-PL MH LR s1 C C s36 s2 s37 C C C s3 s38 s4 C C C s39 P s5 s40 s6 P P s41 s7 s44 P P P s8 s45 P P P s9 C C C s46 P P P s10 s47 s11 s48 s13 P P P s49 s14 P s50 P P P s15 s52 s17 s53 s18 s54 C C s19 s55 P P P s20 C C s56 C C C s21 s57 s22 P s58 s23 s59 s25 s61 s26 s62 C s27 s63 s28 s64 s29 s65 C C C s30 C C C s66 s31 C C s67 C C s33 s68 s35 s70 Form Total 19 16 17 1-PL =1-Parameter Logistic Model; MH=Mant el-Haenszel; LR=Logistic Regression C=item favored CBT examinees; P=item favored P&P examinees Copyright 2005 by Promissor, Inc. A ll rights reserved. Used with permission. Figure 20 illustrates the relationship between the odds ratios obtained in MH and the threshold differences using t he 1-PL model for each item on Form V. 140

PAGE 154

Figures 21 and 22 display similar informati on for the pairings of LR and the 1-PL as well as MH and LR, respectively. The correlation between the values obtained by MH and 1-PL was 0.89. The correlation between the estimates obtained by LR and 1-PL was 0.88. The correlation between the estimates obtained by LR and MH was 0.997. -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 0 0.5 1 1.5 2 2.5 Mantel-Haenszel Odds Ratios1-PL Theta Differences Figure 20. Scatterplot of MH odds ratios and 1-PL theta differences for form V 141

PAGE 155

-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 0 0.5 1 1.5 2 2.5 3 Logistic Regression Odds Ratios1-PL Theta Differences Figure 21. Scatterplot of LR odds ratios and 1-PL theta differences for form V 0 0.5 1 1.5 2 2.5 00 511 522 5 Logistic Regression Odds RatiosMantel-Haenszel Odds Ratios 3 Figure 22. Scatterplot of LR odds ra tios and MH odds ratios for form V 142

PAGE 156

In Form V (see Table 28), 14 items were commonly flagged by all methods. An additional item was flagged by both the MH and LR methods, two additional items were flagged by the 1-PL only, and an additional item was flagged by LR solely. Table 28 DIF methodology results for form V DIF Methodology Results: Form V (N = 10,018) Item 1-PL MH LR Item 1-PL MH LR s1 s36 P C P s 2 s 3 7 s3 s38 C C C s4 s39 P P s 5 s 4 0 s6 P P P s41 s7 s42 C C C s 8 s 4 3 s9 P P P s44 C s10 P P P s47 P P P s13 s48 s14 s49 s16 s50 s17 s51 s18 s52 s19 s53 s20 P P P s54 s21 P P P s55 C s22 s56 s23 s57 s24 P P P s58 s25 C C C s59 s26 s60 s27 s64 s30 s65 s31 C C C s66 s32 s67 s33 P s68 s34 s69 s35 P P P s70 P C C Form Total 16 15 16 1-PL =1-Parameter Logistic Model; MH=Mant el-Haenszel; LR=Logistic Regression C=item favored CBT examinees; P=item favored P&P examinees Copyright 2005 by Promissor, Inc. A ll rights reserved. Used with permission. 143

PAGE 157

Figure 23 illustrates the relationship between the odds ratios obtained in MH and the threshold differences using the 1-PL model for each item on Form W. Figures 24 and 25 display similar information for the pairings of LR and the 1PL as well as MH and LR, respectively The correlation between the estimates obtained by MH and 1-PL was 0.90. The correlation between the estimates obtained by LR and 1-PL was 0.91. The correlation between the estimates obtained by LR and MH was 0.994. -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 00 511 522 533 5 Mantel-Haenszel Odds Ratios1-PL Theta Differences 4 Figure 23. Scatterplot of MH odds ratios and 1-PL theta differences for form W 144

PAGE 158

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 00 511 522 533 544 Logistic Regression Odds Ratios1-PL Theta Differences 5 Figure 24. Scatterplot of LR odds ratios and 1-PL theta differences for form W 0 0.5 1 1.5 2 2.5 3 3.5 4 00 511 522 533 544 Logistic Regression Odds RatiosMantel-Haenszel Odds Ratios 5 Figure 25. Scatterplot of MH odds ra tios and LR odds ratios for form W 145

PAGE 159

In Form W (see Table 29), 13 item s were commonly flagged by all methods. An additional item was flagged by the MH and LR methods, and an additional three items were flagged by the IRT and LR methods. An additional two items were flagged sole ly by the IRT method. Table 29 DIF methodology results for form W DIF Methodology Results: Form W (N = 8,751) Item 1-PL MH LR Item 1-PL MH LR s1 s37 P P P s2 P P s38 s3 s41 s4 P P P s42 s5 s43 s6 P P P s44 s7 P s45 s8 s46 C C C s9 s47 C C C s11 s48 s13 s49 s14 s50 s15 s51 s16 C C C s52 s17 s53 s19 s54 s20 P P P s56 s21 P P s57 s22 P P P s58 P P s23 C C C s59 s25 C C s60 C C C s26 s61 C s27 s62 s28 P P P s64 s29 P P P s65 s30 s66 s31 s67 s32 s68 C C C s34 s69 s35 s70 Form Total 18 14 17 1-PL =1-Parameter Logistic Model; MH=Mant el-Haenszel; LR=Logistic Regression C=item favored CBT examinees; P=item favored P&P examinees Copyright 2005 by Promissor, Inc. A ll rights reserved. Used with permission. 146

PAGE 160

Figure 26 illustrates the relationship between the odds ratios obtained in MH and the threshold differences using t he 1-PL model for each item on Form X. Figures 27 and 28 display similar informati on for the pairings of LR and the 1-PL as well as MH and LR, respectively. The correlation between the estimates obtained by MH and 1-PL was 0.91. The correlation between the estimates obtained by LR and 1-PL was 0.91. The correlation between the estimates obtained by LR and MH was 0.99. -2 -1.5 -1 -0.5 0 0.5 1 1.5 00 511 522 Mantel-Haenszel Odds Ratios1-PL Theta Differences 5 Figure 26. Scatterplot of MH odds ratios and 1-PL theta differences for form X 147

PAGE 161

-2 -1.5 -1 -0.5 0 0.5 1 1.5 0 0.5 1 1.5 2 2.5 Logistic Regression Odds Ratios1-PL Theta Differences Figure 27. Scatterplot of LR odds ratios and 1-PL theta differences for form X 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5 Logistic Regression Odds RatioMantel Haenszel Odds Ratio Figure 28. Scatterplot of MH odds ra tios and LR odds ratios for form X In Form X (see Table 30), 14 items were commonly flagged by all methods. An additional item was flagged by both the 1-PL and MH methods, one 148

PAGE 162

additional item was flagged by the MH and LR methods, and an additional four items were flagged by the 1-PL and LR me thods. An additional four items were flagged solely by the 1-PL method. Table 30 DIF methodology results for form X DIF Methodology Results: Form X (N = 9,137) Item 1-PL MH LR Item 1-PL MH LR s1 s35 P P P s2 C C s36 s3 C s37 C C C s4 C C C s38 s5 s39 P P P s6 s40 s8 s41 s9 C s44 s10 s45 s11 s46 C C C s12 P P s47 C C C s14 s48 s15 s49 s16 P P s50 s17 s51 s18 s52 P P P s20 P P P s54 s22 s55 P P s23 P P P s57 s24 s58 s25 s59 C C C s26 C s60 s27 s62 P P P s28 P P P s63 s29 s64 P P s30 s65 s31 C C C s66 C C C s32 C C s67 s33 P s68 s34 s70 Form Total 23 16 19 1-PL =1-Parameter Logistic Model; MH=Mant el-Haenszel; LR=Logistic Regression C=item favored CBT examinees; P=item favored P&P examinees Copyright 2005 by Promissor, Inc. A ll rights reserved. Used with permission. 149

PAGE 163

Figure 29 illustrates the relationship between the odds ratio obtained in MH and the threshold differences using the 1-PL model for each item on Form Z. Figures 30 and 31 display similar informati on for the pairings of LR and the 1-PL as well as MH and LR, respectively. The correlation between the estimates obtained by MH and 1-PL was 0.82. The correlation between the estimates obtained by LR and 1-PL was 0.84. The correlation between the estimates obtained by LR and MH was 0.99. -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Mantel-Haenszel Odds Ratios1-PL Theta Differences Figure 29. Scatterplot of MH odds ratios and 1-PL theta differences for form Z 150

PAGE 164

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Logistic Regression Odds Ratios1-PL Theta Differences Figure 30. Scatterplot of LR odds ratios and 1-PL theta differences for form Z 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Logistic Regression Odds RatiosMantel-Haenszel Odds Ratios Figure 31. Scatterplot of MH odds ra tios and LR odds ratios for form Z 151

PAGE 165

In Form Z (see Table 31), 13 items were commonly flagged by all methods. An additional item was flagged by both the MH and LR methods and an additional three items were flagged by t he 1-PL and LR method s. An additional five items were flagged sole ly by the 1-PL method. Table 31 DIF methodology results for form Z DIF Methodology Results: Form Z (N = 5,670) Item 1-PL MH LR Item 1-PL MH LR s2 s37 P P P s3 s38 s4 s39 s5 s40 s6 C C C s41 s7 s42 s8 s43 P s9 s44 s10 P P P s45 P P s11 s46 s12 s47 s14 s48 C C C s15 s49 s16 P P s50 C C C s17 P P P s51 P P s18 P s52 P P s19 s53 s20 s54 s21 s55 C C C s22 s58 s25 s59 s26 s60 C C C s27 s61 C s28 s62 s29 P P P s63 P P P s30 P s64 C C C s31 s65 C C C s32 P s68 C C C s35 s69 s36 s70 Form Total 21 14 17 1-PL =1-Parameter Logistic Model; MH=Mant el-Haenszel; LR=Logistic Regression C=item favored CBT examinees; P=item favored P&P examinees Copyright 2005 by Promissor, Inc. A ll rights reserved. Used with permission. 152

PAGE 166

Kappa Statistic Cohens kappa measures the agreement between the evaluations of two raters when both are rating the same object. A value of 1 indicates perfect agreement. A value of 0 indicates that agreement is no better than chance. Table 32 reports the values for the DIF methods of analysis grouped in pairs. The first column reports the test form. The second column reflects the value for the paired agreement of flagged items bet ween the 1-PL and MH. The third column reflects the value for the paired agreement of flagged items between the 1-PL and LR. The fourth column reflects the value for the paired agreem ent of flagged items between MH and LR. For all paired values of kappa, the p-value is reported in the same cell as the value but marked by parenthesis. Simple Interactive Statistical Analysis (SISA) provides guidelines for interpreting the level of agreement where a value below .4 repr esents poor agreement, a value between .4 and .75 represents fair to good agreement, and a value greater than .75 represents excell ent agreement. 153

PAGE 167

Table 32 Kappa agreement results for all test forms Kappa Agreement Results Test Form 1-PL/MH (p-value) 1-PL /LR (p-value) MH/LR (p-value) A 0.638 (<.001) 0.638 (<.001) 0.875 (<.001) V 0.822 (<.001) 0.830 (<.001) 0.911 (<.001) W 0.746 (<.001) 0.792 (<.001) 0.773 (<.001) X 0.663 (<.001) 0.781 (<.001) 0.799 (<.001) Z 0.643 (<.001) 0.770 (<.001) 0.870 (<.001) Note: 1-PL=1-Parameter Logistic Model; MH =Mantel-Haenszel; LR= Logistic Regression Copyright 2005 by Promissor, Inc. All rights reserved. Used with permission. For Form A, the value for the paired agreement of flagged items between the 1-PL and MH is 0.638 indicating good agreement. The value for the paired agreement of flagged items between the 1PL and LR is 0.638 indicating good agreement. The value for the paired agr eement of flagged items between MH and LR is 0.875 representing excellent agreement. Each paired agreement value is statistically significant indicating the agreement between pai rs is greater than we would expect to see by chance. For Form V, the value for the paired agreement of flagged items between the 1-PL and MH is 0.822 indicating ex cellent agreement. The value for the paired agreement of flagged it ems between the 1-PL and LR is 0.830 indicating excellent agreement. The value for t he paired agreement of flagged items between MH and LR is 0.911 again indicati ng excellent agreement. For Form V, each of the paired agreement values is statistically significant indicating the agreement between pairs is gr eater than we would expect to see by chance and all values represent excellent agreement between paired methods. 154

PAGE 168

For Form W, the value for the paired agreement of fl agged items between the 1-PL and MH is 0.746 indicating good agreement. The value for the paired agreement of flagged items between the 1-PL and LR is 0.792 representing excellent agreement. The value for t he paired agreement of flagged items between MH and LR is 0.773 represent ing good agreement. Each paired agreement value is statistically significant indicating the agreement between pairs is greater than we would expect to see by chance. For Form X, the value for the pair ed agreement of flagged items between the 1-PL and MH is 0.663 indicating good agreement. The value for the paired agreement of flagged items between the 1-PL and LR is 0.781 representing excellent agreement. The value for t he paired agreement of flagged items between MH and LR is 0.799 representi ng excellent agreement. Each paired agreement value is statistically significant indicating the agreement between pairs is greater than we would expect to see by chance. For Form Z, the value for the paired agreement of flagg ed items between the 1-PL and MH is 0.643 indicating good agreement. The value for the paired agreement of flagged items between the 1PL and LR is 0.770 representing good agreement. The value for the paired agr eement of flagged items between MH and LR is 0.870 indicating excellent agreem ent. Each paired agreement value is statistically significant indicating the agreement between pairs is greater than we would expect to see by chance. In summary, kappa values for paired agreements of decision (flagged as DIF item) were all greater than 0.600. The 1-PL and MH had good to excellent 155

PAGE 169

agreement on all five forms. Similarly, the 1-PL and LR had good to excellent agreement on all five forms. Finally, the paired methods of LR and MH had kappa values representing excellent agreem ent on all five forms. In addition, the highest values reported for paired agreement s in this study were between LR and MH (0.911 Form V, 0.875 Form A; 0. 870 Form Z) representing excellent agreement, and all values for this pairing were great er than 0.75. Post Hoc Analyses Differential Test Functioning Several items have been identified as ex hibiting DIF in each test form. The DIF detection methods tend to agree that several items exhibit DIF in each test form. Further, there was notable agreement in the group favored on each item as reported by the DIF methodology indicators. This generates a new question, what is the impact of the items exhibiti ng DIF on the test as a whole? This is more formally referred to as Diffe rential Test Functioning (DTF). At present, any readily available hypot hesis tests for the DTF do not exist. However, work has begun in this ar ea. Raju, Linden, and Fleer (1995) have begun using and t statistics for this work. Ho wever, these statistics assume known parameters and do not take into c onsideration error in estimating item parameters. Two methods (one statistical and one visual) were used to obtain a sense of the impact of the it em DIF at the test level. 2 Initially, the tests were examined us ing DIFAS 2.0 (Penfield, 2006). This analysis is unsigned and the variance estimates are summed. For this data set where the item DIF did not tend to subs tantially favor one group (CBT, P&P) over 156

PAGE 170

the other and there are pos itive and negative estimates, this may garner an inaccurate view of the impact of the item DIF on the test since the item DIF will yield test DIF (Camilli & Penfield, 1997). Results for Form A were Tau = .175, standard error = .039, z-val ue = 4.487 indicating test DIF. Similar results were obtained for the remaining forms Form V results were Tau = .238, standard error = .05, z-value = 4.76 indicating test DIF. Form W results were Tau = .227, standard error = .048, z-val ue = 4.729 again indicating te st DIF. Form X results were Tau = .244, standard error = 05, z-value = 4.88 indicati ng test DIF. Finally, Form Z results were Tau = .2, standard error = .043, z-value = 4.651 indicating test DIF. 2 2 2 2 2 Since this test had several items favo ring both groups similarly, the test information curve (TIC) was examined si nce it summarizes the information function for the set of items of the test fo rm. Each items contribution to the total information is additive. The TIC displays information for the set of items at each point on the ability scale. The item sl ope and item variance are important. An increase in slope and a decrease in variance yield more information resulting in a smaller standard error of measurement. T he amount of information provided by the set of items (solid line) is inversely related to the error (dotted line) associated with the ability estimates at t he ability level. Using the test information, the test developer can assess the precise contributi on of an item to the precision of the total test (Hambleton & Swaminathan, 1985) Used in this study, the test information allows the investigator to a ssess the precision of the total test with the DIF items to determine if there is an impact at the test level. 157

PAGE 171

The TICs for each group by test form were created in Bilog-MG and are displayed in Figures 32 to 41. There are two figures displayed for each form (Form A, Form V, Form W, Form X, and Fo rm Z, respectively). The first of the two figures displays the TIC for that form for the CBT examinee group followed by the TIC for that form for the P&P ex aminee group (second figure). Within the figure there are two lines. The solid line represents the test information curve and is read from the left vertical axis. The dotted line represents the standard error curve and is read from the right vertical axis. Please note that the right vertical axis is not part of the under lying data. It is computed by the software program that draws the plots and cannot be controlled (L. Stam, personal communication, November 2006). Therefore, t he right axis scale is not necessarily consistent across each mode displayed. As can be seen, in each test form there is virtually no difference in the TIC for the two groups of examinees indicating that the overall level of DIF within any one test form is virtually inconsequential with respect to the impact on differential test functioning. 158

PAGE 172

3210123 0 1 2 3 4 5 6 7 8 Scale ScoreInformationGroup: CBT 1 1 0 0.28 0.57 0.85 1.13 1.41 Standard Error Figure 32. Test information curve for form A for CBT examinees 3210123 0 1 2 3 4 5 6 7 8 Scale ScoreInformationGroup: PP 1 1 0 0.34 0.68 1.02 1.36 1.70 Standard Error Figure 33. Test information curv e for form A for P&P examinees 159

PAGE 173

3210123 0 1 2 3 4 5 6 7 8 Scale ScoreInformationGroup: CBT 1 1 0 0.28 0.57 0.85 1.13 1.41 Standard Error Figure 34. Test information curve for form V for CBT examinees Figure 35. Test information curv e for form V for P&P examinees -3 -2 -1 0 1 2 3 0 1 2 3 4 5 6 7 8 Scale ScoreInformationGroup: PP 1 1 0 0.34 0.68 1.02 1.36 1.70 Standard Error 160

PAGE 174

Figure 36. Test information curve for form W for CBT examinee 3210123 0 1 2 3 4 5 6 7 8 9 10 Scale ScoreInformationGroup: CBT 1 1 0 0.32 0.64 0.96 1.28 1.60 Standard Error 3210123 0 1 2 3 4 5 6 7 8 9 10 Scale ScoreInformationGroup: PP 1 1 0 0.39 0.79 1.18 1.58 1.97 Standard Error Figure 37. Test information curv e for form W for P&P examinees 161

PAGE 175

Figure 38. Test information curve for form X for CBT examinees 3210123 0 1 2 3 4 5 6 7 8 9 10 Scale ScoreInformationGroup: CBT 1 1 0 0.30 0.60 0.90 1.20 1.50 Standard Error Figure 38. Test information curv e for form X for P&P examinees 3210123 0 1 2 3 4 5 6 7 8 9 10 Scale ScoreInformationGroup: PP 1 1 0 0.35 0.70 1.06 1.41 1.76 Standard Error 162

PAGE 176

Figure 40. Test information curve for form Z for CBT examinees 3210123 0 1 2 3 4 5 6 7 8 Scale ScoreInformationGroup: CBT 1 1 0 0.28 0.57 0.85 1.13 1.41 Standard Error Figure 41. Test information curve for form Z for P&P examinees 3210123 0 1 2 3 4 5 6 7 8 Scale ScoreInformationGroup: PP 1 1 0 0.34 0.68 1.02 1.36 1.70 Standard Error 163

PAGE 177

Nonuniform DIF Analysis The three methods had strong agreement in their ability to detect item DIF. The analyses were conducted to look at agreement of uniform DIF detection as each method examined was designed to detect uniform DIF (De Ayala et al., 1999; Dodeen & Johanson, 2001; Lei et al ., 2005; Zhang, 2002). With uniform DIF, the mode effect is assumed const ant across all ability levels. It is conceivable that the differ ence in modes would vary across ability levels leading to nonuniform (crossing) DIF. Jodoin and Gierl (2001) note that nonuniform DIF occurs at a much lower frequency than uniform DIF, and that it is more appropriate to focus a test around unifo rm DIF but not at the exclusion of nonuniform DIF. The researcher decided to use one method to determine if this may have occurred in the NNAAP data se t. While there are different methods that can be used to examine nonuniform DIF (e.g., LR, IRT, Lei et al., 2005), the researcher decided to conduct an interaction LR using SAS (SAS, 2003) to examine any effect between mode and score. The interaction was statistically signi ficant for three items on Form A, 5 items on Form V, 6 items on Form W, 13 items on Form X, and 9 items on Form Z. The odds ratios ranged from 1.043 to 1.079 (all test forms). See Table 33 for interaction statistics. Out of the 36 items flagged for nonuniform DIF, 13 were flagged using one of the uniform DIF detec tion methods. The remaining items (not flagged by the uniform DIF detection methods) had odds ratio values that were close to 1.0 (range from 1.048 to 1.072) and thus appear to not have a lot of nonuniform DIF. To give the reader a better sense of the degree of nonuniform 164

PAGE 178

DIF, one item (Form X item 12) is shown with the crossing probability curve illustrating nonuniform DIF (see Figure 42). 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 01 02 03 04 05 06 07 0 AbilityProbability Computer-based Paper-and-Pencil Figure 42. Nonuniform DI F for Form X item 12 165

PAGE 179

Table 33 Nonuniform DIF results compar ed to uniform DIF methods Items Exhibiting Nonuniform DIF by Form Compared to Uniform DIF Methods Interaction Uniform DIF Items Form N Item Odds Ratio Estimate p value 1-PL MH LR Form A 5,441 s41 1.068 0.0011 s59 1.049 0.0026 s62 1.061 0.0026 X Form V 10,018 s14 1.069 0.0017 s22 1.069 0.0008 s47 1.078 0.0020 X X X s54 1.063 0.0003 s59 1.072 0.0035 Form W 8,751 s5 1.068 0.0021 s20 1.051 0.0018 X X X s32 1.057 0.0009 s34 1.070 0.0006 s42 1.068 0.0017 s69 1.053 0.0011 Form X 9,137 s3 1.054 0.0001 X s6 1.057 0.0002 s9 1.067 0.0001 X s11 1.049 0.0008 s12 1.059 0.0009 X X s24 1.058 0.0001 s26 1.043 0.0040 X s29 1.048 0.0026 s39 1.056 0.0015 X X X s45 1.061 0.0023 s47 1.079 0.0001 X X X s62 1.051 0.0020 X X X s64 1.056 0.0029 X X Form Z 5,670 s12 1.058 0.0010 s27 1.070 0.0001 s40 1.063 0.0014 s44 1.055 0.0035 s47 1.065 0.0006 s51 1.057 0.0001 X X s53 1.069 0.0004 s63 1.049 0.0011 X X X s69 1.070 0.0002 Note: Bonferroni adjustment = 0.004167; 1-PL = 1-Parameter Logistic model; MH=Mantel-Haenszel; LR=Logistic Regression Copyright 2005 by Promissor, Inc. A ll rights reserved. Used with permission 166

PAGE 180

Summary The DIF methods used in this study (1-PL, MH, and LR) were similar in the number of items flagged as exhibiting DI F on each of the five test forms and in the actual items flagged unlike Schae ffer et al. (1993) who found little consistency in the three methods (MH, LR, 3-PL) used for detec tion of item DIF when comparing administration mode (CBT, P&P) on the GRE. When reviewing which group was favored as evidenced by the DIF methods in this study, the tendency appeared to favor the P&P group meaning the items seemed easier for this group, but this was inconsequential. Results indicated that items did behave differently and the examinees odds of answering an item correctly were influenced by the test mode administration for several items ranging from 23% of the items on Forms W and Z (MH) to 38% of the items on Form X (1-PL) with an av erage of 29%. Form X had the greatest number of items favoring one group over the other (using the 1-PL). Flagged items did not contain long r eading passages or multiple screens (K. Becker, personal communication, No vember 2005) which are common issues that result in a mode effect (Greaud & Green, 1986; Mazzeo & Harvey, 1988; Parshall et al., 2002). As a not e, the researcher is unab le to share the flagged items publicly as the items are active it ems and must remain secure. The study reported excellent agreement in the detection of DIF on an item level indicating that items do exhibit DIF. Ho wever, upon investigation of the TCC, it is clear that the test does not appear to behave different ially for the two administration modes regardless of individual item DIF. 167

PAGE 181

CHAPTER FIVE: DISCUSSION Summary of the Study A variety of administrative circumstanc es can impact test scores and their interpretation. In situations where an ex am is administered across dual platforms, such as via paper-and-pencil and computer simultaneously, a comparability study is needed (Parshall et al., 2002). I ndividual items may become more or less difficult in the computer version (CBT) of an exam as compared to the paper-and-pencil (P&P) version, possibly re sulting in a shift in the overall difficulty of the test (Mazzeo & Harvey, 1988). This study was conducted to probe t he following two research questions. 1. What proportion of items exhibit differential item functioning (DIF) as determined by the following three methods: Mantel-Haenszel (MH), Logistic Regression (LR) and the 1-Parameter Logistic Model (1-PL)? 2. What is the level of agreement among Mantel-Haenszel, Logistic Regression, and the 1-Parameter Logistic Model in detecting differential item functioning between the computer-based and 168

PAGE 182

paper-and-pencil versions of the National Nurse Aide Assessment Program? Using 38,955 examinees response dat a across five forms of the NNAAP administered in both the CBT and P&P mo de, three methods of evaluating DIF (1-PL, MH, LR) were used to detect DIF across platforms. Th e three methods of DIF were compared to determine if they detect DIF equally in all items on the NNAAP forms. For each method, Type I error rates were controlled using the Bonferroni adjustment. While this may hav e resulted in inflated Type II error rates, the sample size for each form was large (5,440; 10 ,015; 8,751; 9,106; 5,643). As sample size (N) increases, st atistical estimates become more precise and the power of the statisti cal test increases (Murphy & Myors, 2004). A list was compiled identifying the number of items flagged by each DIF method (1-PL, LR, and MH) and identified as either fa voring the P&P examinee group or the CBT examinee group. Data were reported by agreement of methods, that is, an item flagged by multiple DIF methods (e .g., 1-PL and MH not LR, all three methods). A kappa statistic was calcul ated to provide an index of agreement between paired methods of the LR, MH, an d the 1-PL based on the inferential tests. Finally, in order to determine w hat, if any, impact these DIF items may have on the test as a whole, the test characteristic curves for each test form and examinee group were displayed. Summary of Study Results First, the sample was examined as a whole. The mean percent correct for the P&P group was 54.24 (SD = 4.83; N = 34,411) and the mean percent correct 169

PAGE 183

for the CBT group was 52.27 (SD = 5.88; N = 4,544) indicating no sizable differences between the two administration groups. The Kuder-Richardson KR20 was computed to review the reliability of the scores for the study sample and displayed similar results to the 2004 NNAAP sample (Muckle & Becker, 2005). All coefficients were greater than .80. Next, in order to examine whether t he data were consistent with a single factor model, a confirmatory factor analysis (CFA) was conducted for each form. The RMSEA indicators were less than 0. 026 for each test form where it is suggested that RMSEA values be less than 0.05 (Stevens, 2002) indicating good fit. Thus the data appear consist ent enough with the unidimensionality assumption to warrant exam ination of DIF using IRT methods, in addition to MH and LR. It is important to note that the CF A was conducted by form and by group. It is possible that results may have indica ted violations to the unidimensionality assumption had the two groups (P&P, CBT) been analyzed together. Research Question 1 was examined us ing DIF methodology to determine if an examinees odds of answering an item were greater for the group assessed on the computer-based National Nurse Aide Assessment Program (NNAAP) compared to a paper-and-pencil version of th e same test. Results indicated that items behaved differently and the exami nees odds of answering an item correctly were influenced by the test mode administration for several items ranging from 23% of the items on Forms W and Z (MH) to 38% of the items on Form X (1-PL) with an average of 29%. T hese percentages are similar to what Flowers and Oshima (1994) found w hen examining DIF by gender and ethnic 170

PAGE 184

groups. They found item DIF ranging from 19% to 55% of it ems in a 75-item exam. These results are consistent with t he observation of Wang (2004) that real or operational tests tend to contain a m oderate percentage of DIF items equal to around 20% (Wang, 2004). Similar to findi ngs by Schaeffer et al. (1993), the flagged items in this study did not contain any typical item type concerns such as scrolling multiple screens or long reading passages that tend to produce a mode effect (Greaud & Green, 1986; Mazzeo & Harvey, 1988, Parshall et al., 2002). However, presentation differences (Pommerich & Burden, 2000) such as font style, font size, and item layout may have been present. Reviewing the DIF items by method (see Table 34), the 1-PL tended to moderately favor P&P in all five forms, MH tended to find more items favoring P&P in one form and to find more items favoring CBT in three forms, and LR tended to find more items that favored P&P in four forms and to find more items that favored CBT in one form. Reviewing by form, Form A tended to have slightly more items that favored P&P in one me thod and CBT in two methods, Form V tended to have slightly more items that favored P&P in all three methods, Form W tended to have slightly more items t hat favored P&P in two methods, Form X tended to have slightly more items that favored P&P in two methods and CBT in one method, and Form Z tended to have slight ly more items that favored P&P in two methods and CBT in one method. Thes e findings are consistent with other studies that have looked at item le vel DIF (generally comparing gender or ethnicity groups) and found DIF at the it em level (Dodeen & Johanson, 2001; Huang, 1998; Schwarz, 2003; Zhang, 2002). 171

PAGE 185

Table 34 Number of items favoring each mode by test form and DIF method Number of Items Favoring Mode of Administration Form DIF Method Number of Items Favor P&P Favor CBT A 1-PL 10 9 (N = 5,441) MH 6 10 LR 7 10 V 1-PL 10 6 (N = 10,018) MH 9 5 LR 10 6 W 1-PL 11 7 (N = 8,751) MH 7 7 LR 9 6 X 1-PL 12 11 (N = 9,137) MH 7 9 LR 11 8 Z 1-PL 12 9 (N = 5,670) MH 6 8 LR 9 8 Note: 1-PL =1-Parameter Logistic Model; MH=Mantel-Haenszel; LR=Logistic Regression Copyright 2005 by Promissor, Inc. A ll rights reserved. Used with permission. Given that each test form did have DI F items, the researcher wanted to see if these DIF items affe cted the overall test perform ance and therefore the test score interpretation. To do this, the test ch aracteristic curves for each test form were examined by administration mode groups. Upon investi gation of these TCCs, it can be concluded that the impact of the DIF items on the test was not significant. This finding is consist ent with Flowers and Oshima (1994) who compared DIF and DTF in a study of gender and ethnic groups on a statewide testing program (N = 1, 000) and found that 14 to 41 items were flagged as exhibiting DIF while results of the consis tency of DTF were encouraging. Hauser and Kingsbury (2004) studied DIF and its effect on DTF using standardized test 172

PAGE 186

data of students in grades 4, 8 and 10 on an achievement test (comparing gender and ethnic groups) and found that the ov erall level of DIF in any test had an impact that was virtually nil on the larger issue of DTF. Uniform DIF occurs more frequently than nonuniform (crossing) DIF (Jodoin & Gierl, 2001), and this study was designed to investigate uniform DIF. However, nonuniform DIF was examined using the interaction term in the LR model. For the five test forms combined, 36 items were flagged as exhibiting nonuniform DIF. Thirteen of t hose items were also flagged by one of the uniform DIF detection methods, and the remaining it ems appeared to not exhibit an odds ratio far from 1.0. Research Question 2 was examined using three commonly reported methods (1-PL, MH, LR) for assessing DIF which were used to conduct the analyses required to explore Question 1. These three methods were selected because they have been cited in applied studies (Congdon & McQueen, 2000; Dodeen & Johanson, 2001; Huang 1998; Johanson & Johanson, 1996; Zhang, 2002) and report reliable results as recor ded in studies using simulated data (De Ayala et al., 1999; Luppescu, 2002; Penny & Johanson, 1999; Rogers & Swaminathan, 1993; Wang, 2004; Zwick et al., 1993) com pared to other methods (e.g., likelihood ratio, H statistic) for t he detection of DIF. A fter the DIF analyses, the kappa statistic was used to examine the degree to which the results of the three methods agreed in the detection of DI F. Each of the three methods (1-PL, MH, and LR) detected items exhibiting DIF in each test form (ranging from 14 to 23 items). The kappa statis tic demonstrated a strong degree of agreement 173

PAGE 187

between paired methods of analysis for each test form and each DIF method pairing. Kappa values higher than 0.75 represent exce llent agreement (Simple Interactive Statistical Analysis) while values between 0.4 and 0.75 represent fair to good agreement. Using these guidelines, the kappa statistic in this study reported good to excellent agreement in 15 out of 15 pairings. These kappas were higher than those found in a DI F study (comparing gender and ethnicity) conducted by Flowers and Oshima (1994) where no kappa indices were above 0.7 and many indices indi cated poor agreement. These results were not surprising as these three methods have been compared and were found to yield similar results in the detection of DIF. Dodeen and Johanson (2000) found similar results in their comparison of the MH and LR methods to examine gender differences on attitudinal scale data. Several studies used the IRT method (Congdon & McQueen, 2000; Lei et al., 2005; Wang, 2004) and MH results were favorable when data fit the Rasch model (Penny & Johnson, 1999). Among these studies, Congdon and McQueen (2000) used the Rasch method to examine stability of raters on existing writing exam data. Luppescu (2002) simulated data to compare IRT and HL M. Wang (2004) simulated data to examine the effects of anchor item methods within IRT models. Penny and Johnson (1999) found that Type I error rates were not inflated when the MH was used and data fit the Rasch model, and NNAAP data were calibrated using the Rasch model. However, Zhang (2002) found that MH was more powerful than LR. Many DIF studies com pare methods using simulated data to compare demographic groups (e.g., gende r). Conversely, this study was 174

PAGE 188

conducted to compare thes e DIF detection methods using real data and to evaluate administration mode effect (rather than gender or ethnicity). Limitations of Study Examinees were not randomly assigned to test administration mode; therefore, it is possible that differ ences may exist by virtue of the group membership. Further, the ex aminees making up the CBT group were all from one state and none of the ex aminees from this particular state took the NNAAP under the P&P administration. This could confound any DIF interpretations as it is possible that the cause of item DIF c ould be curricular-based due to any possible curriculum differences among states rather than administration mode. In this study, the researcher us ed data with real examinees which may illustrate items that behave differentially in the test admini stration pla tform for this set of examinees. However, conclusions drawn from the study results cannot be generalized beyond this sample of examinees. Additionally, a finite number of forms were reviewed in this study, and t he testing vendor cycles through multiple forms. Consequently, results cannot be generalized to forms not examined in this study. In this study, the investigator ex amined the sensitivity of DIF detection methods to flag DIF items. The sample wa s limited to examinees prepared to be evaluated on their knowledge of a spec ific construct (e.g., nurse aide responsibilities), and the administration mode was unique (varied by state of administration rather than wit hin state). Therefore, conclusions about the relative sensitivity of the DIF me thods cannot be generalized beyond this test or set of 175

PAGE 189

examinees. Further, Flowers and Oshima (1994) recommend that DIF indices be interpreted according to each context (how the DIF indices are used) to help eliminate some of the inc onsistency in reliability of results between DIF indices. The current study had a large sample size which resulted in a high level of power leading to analyses that can be s ensitive to small levels of DIF. Researchers strive to increase power in order to increase statistical validity and to decrease the chance of erroneous conclusions, and increases in power allow the researcher to be more confident that the findings could exist in the population. However, the size of the samp le in this study may have resulted in the statistical tests being over ly sensitive. This sensitivity could have resulted in small effects of DIF being detected that could be inconsequential to the impact of the item performance. Indeed, some of the statistically significant differences were for small observed effects. For exam ple, the odds ratio for Form V item 25 was 0.708 (LR) and the odds ratio for Form V item 39 was 1.333 (LR). Similarly, the odds ratio for Form V item 25 was 0.700 (MH) and the odds ratio for Form V item 39 was 1.324 (MH). As a final note, this study examined the reliability of the DIF methods (the consistent identification of DIF in a particular item between samples), not the validity of the methods (whet her the item identified as exhibiting DIF is truly biased). 176

PAGE 190

Implications and Suggestions for Future Study Test Development This study dealt with DIF not bias. It is possible that the items flagged as exhibiting DIF may also be biased. The study provides the groundwork to conclude that items did behave differently between the two platforms (computer and paper). In order to draw conclusions about item bias, additional work is required. Determining DIF is based on an em pirical examination of the data. To determine if an item is also biased, t he test developer would form a group of experts such as members of the National Council of State Boards of Nursing, subject matter experts, and representatives of the test developers staff. Since the test is given in several states and laws can vary by state, experts for each state administering the instrument should be included as part of the panel as well as experts in software development who c an provide insight into font, color, and other presentation issues t hat can affect the display on the computer monitor. This group would review each item that wa s identified as exhibiting DIF in order to logically determine if the differenc e in item performance across examinee groups is unrelated to the construct in tended to be measured by the test (Camilli & Shepard, 1994; Clausser & Mazor, 1998; Ellis & Raju, 2003; Embretson & Reise, 2000; Hambleton, Swaminathan, & Rogers, 1991; Holland & Thayer, 1988; Penfield & Lam, 2000; Smith & Smith, 2004), that is the skills required to complete the responsibilities required of a nurse aide. If the committee of experts finds an item is biased for an exami nee group administer ed the exam on one platform versus the other (e.g., CBT exam inees tend to correctly answer the item 177

PAGE 191

compared to P&P examinees), this expert committee would decide if the item should be revised for the administration platform or eliminated. Since the data examined in this study represented varied states, the cause of the DIF or possible bias may not be the administrati on platform. Rather, the cause could be that students who ta ke standardized tests could receive different curriculum. It is reasonable to assume that t he nurse aide curriculum in each state using the NNAA P is based on the same standards. It is also reasonable to expect that examinees of equal ability given the same curriculum should perform equally on the NNAAP. Furt her, DIF analyses were originally conducted to determine differences on a single test form related to economic status, gender, or ethnicity. Since these demographic variables were not available for this study, it is possibl e that differences in various demographic variables (e.g., age, educational backg round, gender, ethnicity, and nurse aide training site) could affect item performanc e for examinees of equal ability. These questions have not been resolved. To expl ore this issue further, additional analyses should be conducted disaggregating re sults by subgroup to determine if there is an impact affecting item performance. It is critical to review items that exhi bit DIF in an exam that is administered dual platform (Lei et al., 2005; Parshall et al., 2002; Penfield & Lam, 2000; Welch & Miller, 1995) where score interpretation c an be affected. In this situation, two examinees, in effect, have non-equated test scores. This is a critical concern if an examinee may have been administered a non-equated form of the test where score values were not adjusted for di fficulty resulting in erroneous score 178

PAGE 192

interpretations. However, the NNAAP is not administered dual platform within one state and each state has set its own pass score; therefore, all examinees within a state are being compared on an equal standard and score interpretations are not affected. It is possible that an examinee could successfully complete the required curriculum and pass the written and skill s portions of the NNAAP meeting the states certification criteria as a nurse aide. This examinee may move to another state that also uses the NNAAP as part of the certification requirements. If the examinee were to inquire about the state s reciprocity position, the state may accept the certification or choose to accept the test score and apply the unique state cut score. In such a case, the NNAAP is essentially being administered in both computer and paper versions. As such, additional analyses are recommended using a counterbalanced design so that identical examinees are taking both the P&P version of the test and the same test form administered as a CBT. This design would identify differences, if any, due to administration platform. Computer programs exist for asse ssing differential functioning for unidimensional dichotomous data, unidimensional pol ytomous data, and multidimensional dichotom ous data (Oshima, Raju, Flowers, & Slinde, 1998) allowing researchers to examine possible sources of DIF on questionnaires which use the Likert scale, essay or ex tended response items, or partial credit items. With advances in technology, it may be possible to convert the observation instrument for the skills portion of the NNAAP and review the 179

PAGE 193

instrument for differential f unctioning. It would be intere sting to see if a difference exists in examinees for skills demons trated versus cognitive ability. Finally, if the test developer conduct ed an item bias review and several items were recommended for revision or replacement, there would be an impact on the construct of the te st and a financial burden on the test developer. It is possible and likely (Hauser & Kingsbury, 2004) that all tests have items that exhibit DIF and that some of these items can remain in the test without harmful effects. Rewriting items takes a great deal of time and is expensive. Further, revising or rewriting several items could result in a test that measures a different construct than was intended. It is possibl e that upon examination of DTF (versus DIF), the items exhibiting DIF cancel each other out so that the test is free of DTF and does not function differentially. To examine this further, analyses using DTF or even Differential Bundle Functioning (DBF) are recommended. DBF allows the examination of the data to det ermine if a pattern of results occurs relative to a common secondary dimension (Gierl et al., 2001). In this case, the items may be bundled by the clusters (e.g., Physical Ca re Skills, Psychosocial Care Skills, Role of the Nurse Aide) measured in the NNAAP. Methodologist The test developer uses a variety of analyses to constantly review test items including item analysis, DIF analysi s, and reliability analyses to ensure that the test provides reliable scores that can be validly interpreted, is free of bias, measures the intended construct, is appr opriately scaled for the current sample of examinees, and has item security. Since the test developer deals with data 180

PAGE 194

from operational exams, it is important that the me thods used for such analyses yield consistent, valid, and reliable results. The methodologist is able to assist in this area. The methodologist is able to compare methods using various situations to determine if statistical methods vary or if assumptions are violated. There are three areas that are recommended as a result of this study for fu ture exploration: (1) guidelines for selecting among DIF meth ods, (2) the determination of practical significance, and (3) the examination of alternative methods for determining the matching criterion (e.g., ma tch based on item scores rather than total score). DIF Methods While many of the items in the current study were flagged by a combination of DIF methods and the kappa statistic was satisfactory, there were items flagged by one method and not the ot her two or by two of the three methods. Each method used in the curr ent study was noted for attributes contributing to the outcome of the analyses. The non-parametric tests like Logistic Regression and M antel-Haenszel are less complicated and require fewer assumptions. Within the examination of data, there is a question of uniform DIF versus nonuniform DIF Parametric tests like IRT ar e better for the detection of nonuniform DIF (Lei et al., 2005). LR has been shown to detect nonuniform DIF better than MH (Rogers & Swaminathan, 1993) Uniform DIF occurs more often than nonuniform DIF, and some researchers (Jodoin & Gierl, 2001) have suggested that the focus on uniform DI F is more appropriate for test development. Examining nonuniform DIF can result in a loss of power. For 181

PAGE 195

example, using the interaction term in LR can impact the Type I error because of the loss of one degree of freedom (Jodoin & Gierl, 2001) since the researcher is examining mode and the mode by score inte raction resulting in two analyses. Studies examining this nonuniform DI F have typically been conducted using simulated data often designed to mimi c real data (see Roger s & Swaminathan, 1993; Wang, 2004). The issue of which method was better at detecting nonuniform DIF was not resolved in this study. In fact, many of the DIF studies have been conducted through the use of simulated data such as Monte Carlo studies or studies based on achievement test data. Many of these studies hav e compared groups based on gender or ethnicity, not on administration mode. A dditional studies conducted using actual data from certification and licensure exams administered using different administration modes should be published. These st udies would allow the comparison of how achievement test items behave when presented in dual platform (paper and computer) compared to ce rtification and licensure test items. Given that there are several factors that can impact an analysis, it would be helpful to produce a comprehensive revi ew of each method for various types of data (e.g., dichotomous versus polytomous, achievement measures, certification and licensure measures, attit udinal measures) in order to produce a set of guidelines with differing conditions that may be present and the DIF method that provides the greatest power among methods that provide Type I error control allowing the analyst to c hoose a method more appropriate to the 182

PAGE 196

conditions (e.g., sample si ze, uniform versus nonuniform DIF, cost of analysis, complexity of method, presence of effect size indicator) of the given situation. Practical Significance There are few guidelines for determining how large a difference in the difficulty estimate between examinee groups should be before it is considered to be of practical significance (Hauser & Kingsbury, 2004). In most cases, the researcher is flagging items according to st atistical significance, but it is possible that there are instances wher e statistical significance is not practical significance. The validity and reliability of test items is critical and can be a sensitive topic particularly in high stakes environments, and it is po ssible that items yielding statistical significance indicating the existence of DIF are not, for practical purposes, impacting the examinees perfo rmance or score interpretation. Additional methodological work conducted in varied DIF methods would be useful in providing guidelines that help to in form others when the difficulty estimate difference value is of practical significance in addition to statistical significance. Further, additional work is warranted usi ng effect size indicators such as MH and R 2 (LR). The use of confidence intervals can assist in this area by providing an index relative to the sample of wh ich there is a 95% chance that the true difficulty difference mean falls in a par ticular range providing the researcher confidence in the practical sign ificance of the obtained results. Matching Criterion Approach This study used the score-anchored approach to DIF where the test score (e.g., total number right) was used as t he matching criterion across groups. This 183

PAGE 197

approach is used by the majority of DIF analyses in the measurement literature, and the procedures are numerous and we ll-known (Koretz & McCaffrey, 2005; Schmitt & Crone, 1991). This has the advantage of not requiring assumptions about individual items but assumes that the score has the same meaning across the groups being compared, and it is po ssible that this approach could have a logical inconsistency (Koretz & McCa ffrey, 2005). Therefore, it would be reasonable to examine items for DIF using an alternate approach. Koretz and McCaffrey (2005) recommend the itemanchored approach where certain items are determined to behave similarly across groups. These items are used as an anchor in the analysis of DIF. Another method for the matching criterion is to use the purification method where items exhibiting DIF are gradually eliminated in an iterative scoreanchored (modified) approach (Holland & Thayer, 1988; Jodoin & Gierl, 2001; Koretz & McCaffrey, 2005). Once computed, items identified as exhibiting DIF are removed and the scores recalculated. The new score is used in a second DIF analysis assessing all items. The use of th is purification of t he matching criterion can be computed to eliminate the effect of DIF items wit hin the matching criterion that can occur when an internal criterion is used (Clauser & Mazor, 1998; Holland & Thayer, 1988; Jodoin & Gierl, 2001). Holland and Thayer (1988) recommend that the studied item be incl uded in the matching criteri on for the first screening and omitted from the screeni ng of other items when using the purification procedure. Studies comparing each of these methods for selecting the matching 184

PAGE 198

criterion in DIF studies would strengthen the literature and provide valuable information to test developers as they conduct more powerful analyses. The current understanding of items exhi biting DIF and item bias and their impact on test score is still incomplete. It is a challenge to produce and monitor an instrument that provides scores t hat are a valid, reliable measure of a persons ability on a given construct so that the instrument yields good information for the client and does not unfairly advantage a particular membership group of examinees Yet, it is a challenge that we must continue to accept. Since this study used an oper ational assessment instrument, the analyses can be used by test developers and practitioners to conduct similar studies to examine their data. This type of study can be useful when the test developer is considering moving from P&P to CBT and wants to examine whether items behave differently in the two modes that could impact score validity. Scholars who wish to apply some of the theories in the literature using real data to probe how test conditions a ffect actual scores may use these study results to identify possible trends and to compare other types of assessment instruments. The study provides a fram ework for applying some of the methods in the literature. Finally, educators may use these analyses to identify aberrant responses of examinee groups on indivi dual items. This information may be used for instructional planning purposes. 185

PAGE 199

REFERENCES American Psychological Association. (1986). Guidelines for computer-based tests and interpretations Washington, DC: Author. Association of Test Publishers. (2002). Guidelines for computer-based testing Washington, DC: Author. Bergstrom, B. A., Luntz, M. E., & Gershon, R. C. (1992). Alte ring the level of difficulty in computer adaptive testing. Applied Measuremen t in Education, 5 137-149. Bolt, D. M. (2000). A SI BTEST approach to testing DIF hypotheses using experimentally designed test items. Journal of Educational Measurement 37, 307-327. Bond, T., & Fox, C. (2001). Applying the Rasch model: Fundamental measurement in the human sciences Mahwah NJ: Lawrence Erlbaum. Bonnardel, P. (2005). The Kappa coefficient: The measurement of interrater agreement when the ratings are on categorical scales. Retrieved on September 12, 2005, from http://kappa.chez .tiscali.fr/kappa.txt 186

PAGE 200

Briggs, D., & Wilson, M. (2003). An introduction to multidimensional measurement using Rasch models. Journal of Applied Measurement, 4 87-100. Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage Publications. Camilli, G., & Penfield, D. A. (1997). Variance estima tion for differential test functioning based on Mantel -Haenszel statistics Journal of Educational Measurement 34(2), 123-139. Carletta, J. (2004). Assessing agreement on classification tasks: The kappa statistic. Retrieved on Septem ber 12, 2005, from http://acl.ldc.upenn.edu/J/J96-2004.pdf Chang, H., Mazzeo, J., & Roussos, L. (1995). Detecting DIF for polytomously scored items: An adaptati on of the SIBTEST procedure. Research Report. Educational Testing Service. Pr inceton: NJ (ERIC Document Reproduction Service No. ED390886). Choi, I., Kim, K. S., & Boo, J. (2003) Comparability of a paper-based language test and a computer-based language test. Language Testing 2003, 20, 295-320. Clauser, B. E., & Mazor, K. M. (1998, Spring). Using statistical procedures to identify differentially functioning test items. Educational Measurement, 3144. 187

PAGE 201

Congdon, P. J., & McQueen, J. (2000). The stability of rater severity in largescale assessment programs. Journal of Educational Measurement 37, 163-178. Crocker, L., & Algina, J. (1986). Introduction to classica l and modern test theory Fort Worth, TX: Harcourt Brace Jovanovich. Davier, A. A. von, Holland, P. W., & Thayer, D. T. (2004). The kernel method of test equating New York: Springer. De Ayala, R. J., Kim, S., Stapleton, L.M., & Dayton, C. M. (1999). A reconception of differential item functioning Paper presented at t he annual meeting of the American Educational Research Association, Montreal: Canada, (ERIC Document Reproduction Service No. ED447153l). Department of Health and Human Services Office of Inspector General (2002). Nurse Aide Training. Retrieved on October 7, 2005 from http://oig.hhs.gov/oei/r eports.oei-05-01-00030.pdf Dodeen, H., & Johanson, G. (2001). The prevalence of gender DIF in survey data Paper presented at t he annual meeting of the American Educational Research Association, Seattle: WA. (ERIC Document Reproduction Service No. ED454269). Drasgow, F., & Lissak, R. (1983). Modi fied parallel analysis: A procedure for examining the latent dimensionality of dichotomously scored item responses. Journal of Applied Psychology, 68, 363-373. 188

PAGE 202

Eckstein, M. A., & Noah, H. J. (1993). Secondary school examinations: International perspectives on policies and practice New Haven, CT: Yale University Press. Ellis, B. B., & Raju, N. S. (2003). Test and item bias: Wh at they are, what they arent, and how to detect them. In measuring up: Assessment issues for teachers, counselors, and administrators (ERIC Document Reproduction Service No. ED480042). Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum. Fidalgo, A. M., Ferreres, D., & Muniz, J. (2004). Liberal and conservative differential item functioning det ection using Mantel-Haenszel and SIBTEST: Implications for type I and type II error rates. The Journal of Experimental Education 73, 23-39. Flowers, C., P., & Oshima, T. C. (1994, April). The consistency of DIF/DTF across different test administratio ns: A multidimensional perspective. Paper presented at the Annual Meeting of the American Educational Research Association, New Orleans, LA. Fredericksen, N., Mislevy, R. J., & Bejar, I. I. (Eds.) (1993). Test theory for a new generation of tests Hillsdale, NJ: Lawrence Erlbaum. Garson, G. D. (2005). Logistic regression Retrieved on October 7, 2005 from http://www2.chass.ncsu.edu/garson/pa765/logistic.htm Gierl, M. J., Bisanz, J., Bisanz, G.L., Boughton, K. A., & K haliq, S. N. (2001, Summer). Illustrating the utility of differential bundle functioning analyses 189

PAGE 203

to identify and interpret group differences on achievement tests. Educational Measuremen t: Issues and Practice 26-36. Glasser, R. (1981). The future of testing: A research agenda for cognitive psychology and psychometrics. American Psychologist 36, 923-936. Godwin, J. (1999). Designing the ACT ESL listening test Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal, Canada. Greaud, V., & Green, B. F. (1986). Equivalence of conventional and speed tests. Applied Psychological Measurement 10, 23-34. Green, B. F. (1978) In defense of measurement. American Psychologist, 33, 664670. Hambleton, R. K., & Sw aminathan, H. (1985). Item response theory: Principals and applications. Boston: Kluwer-Nijhoff. Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory Newbury Park: Sage. Harmes, J. C., Parshall, C. G., Rendina-Gobi off, G., Jones, P. K., Githens, M., & Dennard, A. (2004, November). Integrating usability methods into the CBT development process: Case study of a technology literacy assessment. Paper presented at the annual meeti ng of the Florida Educational Research Association, Tampa, FL. Hauser, C., & Kingsbury, G. (2004). Differential item functioning and differential test functioning in the Idaho Standards Achievement Te sts for spring 190

PAGE 204

2003 Oregon: Northwest Evaluation A ssociation. (ERIC Document Reproduction Service No. ED491248). Holland, P. W., & Thayer, D. T. (1988) Differential item performance and the Mantel-Haenszel procedure In Wainer, H. and Braun, H. Test Validity. New Jersey: Lawrence Erlbaum. Holland, P. W., & Wai ner, H. (Eds.) (1993). Differential item functioning Hillsdale, N. J.: Lawrence Erlbaum. Huang, C. (1998). Factors influencing the reliability of DIF detection methods Paper presented at the annual meeti ng of the American Educational Research Association, San Diego: CA. (ERIC Document Reproduction Service No. ED419833). Jodoin, M. G., & Gierl, M. J. (2001). Ev aluating type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14(4) 329-349. Johanson, G. A., & Johanson, S. N. (1996). Differential item functioning in survey research Paper presented at the annua l meeting of the American Educational Research Association, New York: NY. (ERIC Document Reproduction Service No. ED399293). Johnson, G. (2003). Computerized CPA exam only months away. Retrieved May 1, 2004 from http://www.aicpa.org/pub/jofa/sep2003/special.htm Jones, P. K., Parshall, C. G., & Ferron, J. (2003, September). Analysis of preliminary computer-administered MCAT data: How comparable is it to 191

PAGE 205

the paper-and-pencil MCAT? Paper presented at the 2003 MCAT Graduate Student Research Pr ogram, Washington, D.C. Jones, P. K., Parshall, C. G., & Harmes, J. C. (2003, February). Audio-enhanced internet courses: The impac t on affect and achievement Paper presented at the annual meeting of the Eastern Educational Research Association, Hilton Head, S.C. Julian, E., Wendt, A., Way, D., & Zara, A. (2001). Mo ving a national licensure examination to computer. Nurse Educator 26, 264-267. Kifer, E. (2001). Large-scale assessment: Dimensions, dilemmas, and policy Thousand Oaks, CA: Corwin Press. Kolen, M. J., & Brennan, R. L. (1995). Test equating: Methods and practices New York: Springer. Koretz, D. M., & McCa ffrey, D. F. (2005). Using IRT DIF methods to evaluate the validity of score gains. CA: The Regents of the University of California. Kubiszyn, T., & Borich, G. (2000). Educational testing and measurement: Classroom applicati on and practice 6 th edition New York: Wiley & Sons. Lake County. Job posting. Retrieved on November 22, 2005 from http:// www.co.lake.il.us/hr/j obs/data/positions/8111.pdf Lei, P., Chen, S., & Yu, L. (2005). Comparing methods of assessing differential item functioning in computeriz ed adaptive testing environment Paper presented at the Annual Meeting of the National Council on Measurement in Education, Montreal, Canada. 192

PAGE 206

Lord, F. M. ( 1980). Applications of item response theory to practical testing problems Hillsdale, NJ: Lawrence Erlbaum. Luppescu, S. (2002). DIF detection in HLM Paper presented at the Annual Meeting of the National Council on Measurement in Education, New Orleans: LA. (ERIC Document Repr oduction Service No. 479334). Matthews, D. E., & Fa rewell, V. T. (1988). Using and understanding medical statistics 2 nd edition. Switzerland: Karger. Mazzeo, J., & Harvey, A. L. (1988). The equivalence of scores from automated and conventional educational and psyc hological test: A review of the literature. New York: College Entranc e Examination Board. Mead, A. D., & Drasgow, F. (1993). Equivalence of computerized and paper-andpencil cognitive ability te sts: A meta-analysis. Psychological Bulletin 114 449-458. Miles, E. W., & King, W. C. (1998) Gender and administration mode effects when pencil-and-paper personality test are computerized. Educational and Psychological Measurement 58, 68-69. Mills, C. N., Potenza, M. T., Fremer, J. J., & Ward, W. C. (Eds.) (2002). Computer-based testing: Building t he foundation for future assessments. Mahwah, NJ: Lawrence Erlbaum. Millsap, R. E., & Everson, H. T. (1993). Methodology re view: Statistical approaches for assessing measurement bias Applied Psychological Measurement, 17, 297-334. 193

PAGE 207

Mould, R. F. (1989). Introductory medical statistics 2 nd edition Bristol, England: Adam Hilger. Muckle, T., & Becker, K. (2005). The national nurse aide assessment program (NNAAP) 2004 technical report for ex aminations administered between January 1, 2004 and De cember 31, 2004 Evanston, IL: Promissor Inc. Murphy, K. R., & Myors, B. (2004). Statistical power analysis: A simple and general model for traditional and modern hypothesis tests 2 nd edition Mahwah, NJ: Lawrence Erlbaum. Muthn, L. K., & Mu thn, B. O. (2004). Mplus users guide. 3 rd Edition Los Angeles, CA: Author. National Commission on Excellence in Education (1983). A nation at risk: The imperative for educational reform Washington, DC: Government Printing Office. Neuman, G., & Baydoun, R. (199 8). When are they equivalent? Applied Psychological Measurement, 22 71-83. Oshima, T. C., Raju, N. S ., Flowers, C. P., & Slinde, J. A. (1998). Differential bundle functioning using the DFIT framework: Procedures for identifying possible sources of differential functioning. Applied Measurement in Education, 11, 353-369. Parshall, C. G. (1992). Computer testing vs. pape r-and-pencil testing: An analysis of examinee characteristics associated with mode effect on the GRE General Test. (Unpublished doctoral dissertati on, University of South Florida, Tampa). 194

PAGE 208

Parshall, C. G. (1995). Practical issues in computer-based testing. Journal of Instruction Delivery Systems 13-17. Parshall, C. G., Davey, T., Ka lohn, J., & Spray, J. (2002). Practical considerations in computer-based testing. New York: Springer-Verlag. Parshall, C. G., & Kromre y, J. D. (1993, April). Computer testing versus paperand-pencil testing: An analysis of examinee characteristics associated with mode effect Paper presented at the annual meeting of the American Educational Research Association, Atlanta, GA. Pedhazur, E., J. (1997). Multiple regression in behavioral research: Explanation and prediction 3 rd edition Fort Worth, TX: Harcourt Brace College Publishers. Penfield, R. D. (2006). Differential item functioning analysis system (DIFAS) (Version 2.0) [Computer software]. Author. Penfield, R. D., & Lam, T. C. M. ( 2000, Fall). Assessing differential item functioning in performance assessm ent: Review and recommendations. Educational Measuremen t: Issues and Practice 5-15. Penny, J., & Johnson, R. L. (1999). How group differences in matching criterion distribution and IRT item difficulty c an influence the magnitude of the Mantel-Haenszel chisquare DIF index. The Journal of Experimental Education, 67, 343-366. Poggio, J., Glasnapp, D. R., Yang, X., & Poggio, A.J. (2005). A comparative evaluation of score results fr om computerized and paper & pencil mathematics testing in a large scale state assessment program. Journal of 195

PAGE 209

Technology, Learning, and Assessment 3. Retrieved June 22, 2005, from http://www.jtla.org Pommerich, M., & Burden, T. (2000). From simulation to application: Examinees react to computerized testing Paper presented at the annual meeting of the National Council on Measurement in Education (New Orleans, LA April 25-27). Pommerich, M. (2002, April). The effect of administration mode on test performance and score precision factor s contributing to mode differences Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans, LA. Promissor. NNAAP sample items Retrieved on November 22, 2005 from http://www.promissor.com Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika 53, 495-502. Raju, N. S., Linden, W. J. van der, & Fl eer, P. F. (1995). IRT-based internal measures of differential f unctioning of items and tests. Applied Psychological Measurement, 19(4), 353-368. Rogers, H. J., & Swami nathan, H. (1993). A comparison of Logistic Regression and Mantel-Haenszel procedures for detecting differential item functioning Applied Psychological M easurement, 17, 106-116. Salsburg, D. (2001). The lady tasting tea: How statistics revolutionized science in the twentieth century New York: Henry Holt and Co. 196

PAGE 210

Sawaki, Y. (2001). Compar ability of conventional and computerized tests of reading in a second language. LLT 5 38-59. Schaeffer, G. A., Reese, C. M. Steffan, M., McKinley, R. L., & Mills, C. N. (1993). Field test of a computer-based GRE general test. GRE Board of Professional Report No. 88-08P (ETS RR 93-07). Princeton, New Jersey: Educational Testing Service. (ERIC Document Reproduction Service No. 385588). Schmitt, A. P., & Crone, C. R. (1991). Alternative mathematical aptitude item types: DIF issues. Princeton, New Jersey: Educational Testing Services. (ERIC Document Reproduction Service No. 384659). Schwarz, R., Rich, C., & P odrabsky, T. (2003, April). A DIF analysis of item-level mode effects for computerized and paper-and-pencil tests Paper presented at the Annual Meeting of the National Council on Measurement in Education, Chicago, IL. Scientific Software International. ( 2003). BILOG-MG for Windows, Version 3.0.2327.2. Lincolnwood, IL: Author. Segall, D. O., & Moreno, K. E. (1 999). Development of the computerized adaptive testing version of the Armed Services Vocational Aptitude Battery. In F. Drasgow & J. B. Olson-Buchanan (Eds.), Innovations in computerized assessment. Mahwah, New Jersey: Lawrence Erlbaum. Simple Interactive Statistical Analysis (SISA). Kappa. Retrieved on July 17, 2006 from http://home.clara.net/sisa/twoby2.htm 197

PAGE 211

Smith, E. V., & Smith, R. M. (2004). Introduction to Rasch measurement: Theory, models, and applications Maple Grove Minnesota: JAM Press. Spray, J. A., Ackerman, T. A., Reckase, M. D., & Carlson, J. E. (1989). Effects of medium of item pr esentation on examinee performance and item characteristics. Journal of Educational Measurement 26, 261-271. Standards for Educational and Psychological Testing. (1999). Washington, DC: American Educational Research Asso ciation, American Psychological Association, National Council on Measurement in Education. Stark, S., Chernyshenko, S., Chuah, D., Lee, W., & Wadlington, P. (2001). Detection of DIF using the SIBTEST procedure Retrieved July 18, 2003, from http: //work.psych.u iuc.edu/irt/dif.sibtest.asp Statistical Analysis System (2003). Statistical Analysis System (Version 9.1) [Computer software]. Cary, NC: SAS Institute Inc. Statistical Package for the Social Sciences. (2005). Statistical Package for the Social Sciences (Version 14.0) [Computer software] Chicago, IL: Author. Stevens, J. P. (2002). Applied multivariate statisti cs for the social sciences 4 th edition New Jersey: Lawrence Erlbaum. University of Hawaii Kapiolani Community College. Nurse Aide Program Retrieved on November 22, 2005 from http://programs.kcc.ha waii.edu/health/na/ U. S. Department of Education. No Child Left Behind Act of 2001 Retrieved on September 26, 2005 from http://www. ed.gov/nclb/landing.jhtml?src=pb71k 198

PAGE 212

Wainer, H. (2000). Computerized adaptive testing: A primer. Hillsdale, NJ: Lawrence Erlbaum. Wall, J. (2000). Technology -delivered assessments: Diamonds or rocks? (ERIC Document Reproduction Service No. ED446325). Wang, W. (2004). Effects of anchor item methods on the detection of differential item functioning within the family of Rasch models. The Journal of Experimental Education 72, 221-261. Wang, T., & Kolen, M. J. (1997). Evaluating comparabi lity in computerized adaptive testing: A theoretical framework with an example Paper presented at the annual meeting of t he American Educational Research Association, Chicago, IL. Welch, C. J., & Miller, T. R. (1995). Assessing differ ential item functioning in direct writing assessments: Problems and an example. Journal of Educational Measurement 32, 163-178. Wilson, L. W. (2002). Better instruction through assessment: What your students are trying to tell you. Larchmont, NY: Eye on Education. Wright, B. D. (1995). 3PL or Rasch? Rasch Measurement Transactions 9 408409. Yu, C. (2005). True score model and item response theory Retrieved on October 7, 2005 from http://seamonkey.ed.asu.edu/~alex /teaching/WBI/measurement.shtml Zhang, Y. (2002, April). DIF in a large scale mathematics assessment: The interaction of gender and ethnicity Paper presented at the Annual Meeting 199

PAGE 213

of the National Council on Measurement in Education, New Orleans: LA. (ERIC Document Reproduction Service No. ED464152). Zimowski, M., Muraki, E., Mi slevy, R., & Bock, D. ( 2003). BILOG MG in Mathilda du Toit (Ed.) IRT from SSI Chicago: Scientific Software International.. Zwick, R., Donoghue, J. R., & Grima, A. (1993). Assessing differential item functioning in performance tests. Educational Testing Series Re search Report. Princeton: NJ. (ERIC Document Reproduction Service No. ED386493). 200

PAGE 214

Appendices

PAGE 215

Appendix A. Letter of Data Approval 202

PAGE 216

Appendix B. Institution Revi ew Board Letter of Exemption 203

PAGE 217

Appendix B (continued). Institution Review B oard Letter of Exemption 204

PAGE 218

Appendix C. Tables of Mean Percent Correct by State and Form Table C1 Mean percent correct for form A by state Mean Percent Correct for Form A by State State Mean N Std. Deviation Minimum Maximum Skewness Kurtosis AK 55.81 42 2.91 50 60 -0.44 -0.92 CA 52.05 220 6.33 24 60 -1.72 4.01 LA 52.18 71 5.09 31 59 -1.57 3.77 MD 52.62 13 5.33 38 59 -1.78 4.35 MN 54.61 826 4.95 11 60 -3.06 16.90 ND 55.28 32 3.94 45 60 -1.02 0.70 NJ 51.81 861 5.56 18 60 -1.63 4.28 NV 54.80 106 4.94 25 60 -2.75 12.23 RI 54.17 222 4.57 23 60 -2.12 9.39 TX 53.11 3038 4.76 24 60 -1.61 4.28 VI 50.44 9 7.52 33 56 -1.85 3.51 Total 53.18 5440 5.07 11 60 -1.85 6.11 Copyright 2005 by Promissor, Inc. All rights reserved. Used with permission. Table C2 Mean percent correct for form V by state Mean Percent Correct for Form V by State State Mean N Std. Deviation Minimum Maximum Skewness Kurtosis AK 56.29 133 3.65 34 60 -2.76 12.07 CA 54.99 193 4.86 23 60 -2.88 12.33 CO 55.71 975 4.89 17 60 -3.23 14.45 LA 53.99 71 4.15 41 60 -1.14 1.30 MN 56.35 1277 2.95 37 60 -1.86 5.85 ND 55.03 38 3.72 42 60 -1.29 2.76 NH 54.50 16 3.22 48 59 -0.75 0.56 NJ 53.58 922 5.47 15 60 -2.40 8.98 NV 55.50 126 4.07 34 60 -2.51 9.52 RI 54.64 233 4.76 33 60 -1.98 4.64 TX 54.39 5574 4.31 15 60 -2.10 7.73 VA 54.16 440 6.23 13 60 -2.78 10.76 VI 51.53 17 5.91 35 59 -1.53 2.78 Total 54.74 10015 4.53 13 60 -2.45 10.24 Copyright 2005 by Promissor, Inc. All rights reserved. Used with permission. 205

PAGE 219

Table C3 Mean percent correct for form W by state Mean Percent Correct for Form W by State State Mean N Std. Deviation Mi nimum Maximum Skewness Kurtosis AK 55.99 130 4.30 35 60 -2.49 7.67 CA 54.00 298 5.91 25 60 -2.21 5.96 CO 55.55 973 4.73 15 60 -3.10 14.88 LA 53.96 70 4.80 30 60 -2.05 7.88 MD 53.67 15 4.70 43 60 -0.91 0.33 MN 55.82 1354 4.06 12 60 -3.26 21.08 ND 55.73 52 4.64 37 60 -2.92 9.15 NJ 52.54 906 5.67 24 60 -1.50 3.05 NV 54.51 110 4.56 37 60 -1.62 2.73 RI 55.14 221 4.40 31 60 -2.22 7.60 TX 53.87 4198 4.99 15 60 -2.02 6.98 VA 53.83 413 6.08 17 60 -2.30 7.14 VI 53.82 11 3.54 47 58 -0.77 -0.19 Total 54.31 8751 5.06 12 60 -2.19 7.76 Copyright 2005 by Promissor, Inc. All rights reserved. Used with permission. Table C4 Mean percent correct for form X by state Mean Percent Correct for Form X by State State Mean N Std. Deviation Minimum Maximum Skewness Kurtosis AK 56.26 120 2.62 45 60 -1.52 3.92 CA 53.04 227 6.53 17 60 -2.28 7.20 CO 54.75 970 4.97 18 60 -2.98 13.48 CT 13.33 3 2.08 11 15 -1.29 LA 52.04 74 4.71 36 59 -1.15 1.29 MD 16.00 1 16 16 * MN 55.32 1409 4.23 13 60 -3.80 25.53 ND 55.49 69 3.18 44 60 -1.54 3.10 NH 56.80 5 4.38 49 59 -2.18 4.80 NJ 51.73 908 6.48 13 60 -2.15 7.02 NM 52.79 42 6.58 17 59 -4.07 21.62 NV 55.19 114 4.47 33 60 -2.30 7.76 PA 16.00 1 16 16 * RI 54.99 287 4.43 26 60 -2.26 8.37 TX 53.10 4433 4.66 11 60 -1.93 7.81 VA 53.58 426 5.86 22 60 -2.80 10.09 VI 53.36 14 3.59 48 59 -0.05 -1.40 WI 17.00 3 4.58 12 21 -0.94 Total 53.61 9106 5.18 11 60 -2.57 11.63 Copyright 2005 by Promissor, Inc. All rights reserved. Used with permission. 206

PAGE 220

Table C5 Mean percent correct for form Z by state Mean Percent Correct for Form Z by State State Mean N Std. Deviation Mi nimum Maximum Skewness Kurtosis AK 54.98 60 3.59 40 60 -2.26 7.51 CA 54.25 228 5.06 19 60 -2.81 13.67 CO 54.64 931 5.04 14 60 -3.29 15.54 CT 12.00 1 12 12 * LA 52.23 74 4.33 35 59 -1.02 2.08 MN 55.17 928 4.00 17 60 -3.43 21.56 ND 56.26 53 2.28 49 60 -0.91 1.33 NJ 51.65 947 5.92 14 60 -2.08 6.48 NV 54.39 99 3.79 40 60 -1.62 3.53 RI 54.44 216 4.00 40 60 -1.11 1.12 TX 53.71 1695 4.39 20 60 -1.90 6.81 VA 52.48 398 6.58 13 60 -2.48 8.75 VI 54.30 10 3.27 49 59 -0.55 -0.01 WI 13.00 3 3.61 9 16 -1.15 Total 53.72 5643 5.15 9 60 -2.70 12.27 Copyright 2005 by Promissor, Inc. All rights reserved. Used with permission. 207

PAGE 221

ABOUT THE AUTHOR Peggy Jones has a Bachelor of Science in Elementary Education from the Florida State Un iversity and a Master of Arts in Educational Leadership from the University of South Florida. She has worked for a public school system in the state of Florida at t he elementary, middle, and district level as a teacher and administrator.