USFDC Home  USF Electronic Theses and Dissertations   RSS 
Material Information
Subjects
Notes
Record Information

Full Text 
xml version 1.0 encoding UTF8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchemainstance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd leader nam Ka controlfield tag 001 001461865 003 fts 006 med 007 cr mnuuuuuu 008 040406s2003 flua sbm s0000 eng d datafield ind1 8 ind2 024 subfield code a E14SFE0000222 035 (OCoLC)54908333 9 AJQ2277 b SE SFE0000222 040 FHM c FHM 090 LB1555 1 100 Aaron, Lisa Therese. 2 245 A comparative simulation of Type I error and Power of Four tests of homogeneity of effects for random and fixedeffects models of metaanalysis h [electronic resource] / by Lisa Therese Aaron. 260 [Tampa, Fla.] : University of South Florida, 2003. 502 Thesis (Ph.D.)University of South Florida, 2003. 504 Includes bibliographical references. 500 Includes vita. 516 Text (Electronic thesis) in PDF format. 538 System requirements: World Wide Web browser and PDF reader. Mode of access: World Wide Web. Title from PDF of title page. Document formatted into pages; contains 244 pages. 520 ABSTRACT: In a Monte Carlo analysis of metaanalytic data, Type I and Type II error rates were compared for four homogeneity tests. The study controlled for violations of normality and homogeneity of variance. This study was modeled after Harwell (1997) and Kromrey and Hogarty's (1998) experimental design. Specifically, it entailed a 2x3x3x3x3x3x2 factorial design. The study also controlled for betweenstudies variance, as suggested by Hedges and Vevea's (1998) study. As with similar studies, this randomized factorial design was comprised of 5000 iterations for each of the following 7 independent variables: (1) number of studies within the metaanalysis (10 and 30); (2) primary study sample size (10. 40, 200); (3) score distribution skewness and kurtosis (0/0; 1/3; 2/6);(4) equal or random (around typical sample sizes, 1:1; 4:6; and 6:4) withingroup sample sizes;(5) equal or unequal group variances (1:1; 2:1; and 4:1);(6)betweenstudies variance, tausquared(0, .33, and 1); and (7)betweenclass effect size differences, delta(0 and .8). The study incorporated 1,458 experimental conditions. Simulated data from each sample were analyzed using each of four significance test statistics including: a)the fixedeffects Q test of homogeneity; b)the randomeffects modification of the Q test; c) the conditionallyrandom procedure; and d)permuted Q between. The results of this dissertation will inform researchers regarding the relative effectiveness of these statistical approaches, based on Type I and Type II error rates. This dissertation extends previous investigations of the Q test of homogneity. Specifically, permuted Q provided the greatest frequency of effectiveness across extreme conditions of increasing heterogeneity of effects, unequal group variances and nonnormality. Small numbers of studies and increasing heterogeneity of effects presented the greatest challenges to power for all of the tests under investigation. 590 Adviser: Ph.D, Jeffrey Kromrey 653 metaanalytic q tests. homogeneity of effects. fixedeffects tests. randomeffects tests. tau squared. 0 690 Dissertations, Academic z USF x Interdisciplinary Education Doctoral. 773 t USF Electronic Theses and Dissertations. 4 856 u http://digital.lib.usf.edu/?e14.222 PAGE 1 A Comparative Simulation of Type I Error and Power of Four Tests of Homogeneity of Effects For Randomand FixedEffects Models of MetaAnalysis by Lisa Therese Aaron A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Interdisciplinary Studies College of Education University of South Florida Major Professor: Jeffrey Kromrey, Ph.D Robert Dedrick, Ph.D John Ferron, Ph.D Howard Johnston, Ph.D Date of Approval: December 1, 2003 Keywords: MetaAnalytic Q Tests, Homogeneity of Effects, FixedEffects Tests, RandomEffects Tests, Tau Squared Copyright 2003, Lisa Therese Aaron PAGE 2 Dedication This project is dedicated first and foremost to the One responsible for all hope, love, truth and persistence. When the interpersonal, physical, intellectual and logistical obstacles seemed most demoralizing, it was God alone who carried this work to completion and imbued it with a meaningful and redemptive purpose. Secondly, this effort is dedicated with love to Bruce. Thank you for freeing me to focus on the spaces in between. To Kellianne, my special facilitator, without whose help this work would not have been possible, may this effort be encouragement for the development of your potential and the realization of your goals however outofreach they may seem. Lastly, I dedicate this work to the students in the Title I program who are responsible for inspiring my interest in an otherwise abstract subject. The application of the findings of this project to the evaluation and improvement of their educational experience is what will determine the ultimate value of this assignment. PAGE 3 Acknowledgments Many friends and acquaintances have offered words of encouragement along the way. And I am grateful to them. But a few provided ongoing support and their active input for the completion of this work. First, I want to thank my parents who always believed in the value of education. Their direction instilled an early and persistent interest in the purs uit of higher education. My thanks to my Major Professor, Dr. Jeff Kromrey, and the other members of my committee for contributing many helpful suggestions in the conception of this project and its completion. Special thanks go to Dr. Howard Johnston for your friendship at a pivotal juncture in this process and for remaining on my committee despite the many disruptions of this endeavor. Thank you to Dr. Jack Vevea for lending crucial technical assistance central to the purpose of this project. Heartfelt thanks go to Dr. Tina Bacon who by way of taking a personal interest in a relative strangers challenges helped me rein in my anxieties about the process challenges confronting me. My deepest gratitude goes to my brother, Larry, who never questioned the value of the protracted effort, nor my ability to accomplish the goal. Thank you for not only encouraging me and taking an active interest, but also pushing me towards the finish line. PAGE 4 i Table of Contents List of Tables iii List of Figures v Abstract vi Chapter One Introduction Background 1 Statement of the Problem 7 Purpose of the Study 8 Research Questions 9 Limitations 9 Definitions of Terms 10 Chapter Two Review of Literature Historical and Philosophical Evolution of Metaanalysis 13 Purpose of Homogeneity Tests 20 Introduction of Tests 22 Calculation Strategies Used to Augment Precision 23 Distinguishing Randomand Fixedeffects Models 35 Model Selection 38 Implications for Test Selection 46 Suggested Research 65 Summary 66 Chapter Three Method 68 Purpose 68 Design 69 Sample 78 Test Statistics Examined 80 Data Analysis 84 Chapter Four Results 86 Type I Error 87 Power 152 Discussion of Conditions with Both Adequate Type I Error Control and Power 182 Chapter Five Interpretations and Conclusions 185 Summary 185 Discussion 189 Limitations 195 Topics for Additional Research 196 References 198 PAGE 5 ii Appendices 204 Appendix A: SAS Program for Simulating True Null Hypotheses 205 Appendix B: SAS Program for Simulating False Null Hypotheses 219 About the Author End Page PAGE 6 iii List of Tables Table 1 Relevant Factors Examined by Other Studies 67 Table 2 Study Design 70 Table 3 Proportion of Simulations Controlling Type I Error ( 2 = 0, = 0) 108 Table 4 Proportion of Simulations Controlling Type I Error ( 2 = 0, = .8) 109 Table 5 Proportion of Simulations Controlling Type I Error ( 2 = .33, = 0) 110 Table 6 Proportion of Simulations Controlling Type I Error ( 2 = .33, = .8) 111 Table 7 Proportion of Simulations Controlling Type I Error ( 2 = 1, = 0) 112 Table 8 Proportion of Simulations Controlling Type I Error ( 2 = 1, = .8) 113 Table 9 All Average Type I Error Rates ( 2 = 0, = 0) 114 Table 10 All Average Type I Error Rates ( 2 = 0, = .8) 115 Table 11 All Average Type I Error Rates ( 2 = .33, = 0) 116 Table 12 All Average Type I Error Rates ( 2 = .33, = .8) 117 Table 13 All Average Type I Error Rates ( 2 = 1, = 0) 118 Table 14 All Average Type I Error Rates ( 2 = 1, = .8) 119 Table 15 Type I Error Rate Estimates ( 2 = 0, = 0), at 05 for K= 10 121 Table 16 Type I Error Rate Estimates ( 2 = 0, = .8), at 05 for K= 10 123 Table 17 Type I Error Rate Estimates ( 2 = 0, = 0), at 05 for K= 30 125 Table 18 Type I Error Rate Estimates ( 2 = 0, = .8), at 05 for K= 30 127 Table 19 Type I Error Rate Estimates ( 2 = .33, = 0), at 05 for K= 10 129 Table 20 Type I Error Rate Estimates ( 2 = .33, = .8), at 05 for K= 10 131 Table 21 Type I Error Rate Estimates ( 2 = .33, = .8), at 0 for K= 10 133 Table 22 Type I Error Rate Estimates ( 2 = .33, = 0), at 05 for K= 30 135 Table 23 Type I Error Rate Estimates ( 2 = .33, = .8), at 05 for K= 30 137 PAGE 7 iv Table 24 Type I Error Rate Estimates ( 2 = .33, = .8), at 0 for K= 30 139 Table 25 Type I Error Rate Estimates ( 2 = 1, = 0), at 05 for K= 10 141 Table 26 Type I Error Rate Estimates ( 2 = 1, = .8), at .05 for K= 10 143 Table 27 Type I Error Rate Estimates ( 2 = 1, = 0), at .05 for K= 30 145 Table 28 Type I Error Rate Estimates ( 2 = 1, = .8), at 05 for K= 30 147 Table 29 Type I Error Rate Estimates ( 2 = 1, = .8), at 10 for K= 30 149 Table 30 Power Estimates Indicating Adequate Type I Error ( 2 = 0, = .8), at 05 for K= 10 154 Table 31 Power Estimates Indicating Adequate Type I Error ( 2 = 0, = .8), at 05 for K= 30 156 Table 32 Power Estimates Indicating Adequate Type I Error ( 2 = .33, = .8), at 05 for K= 10 158 Table 33 Power Estimates Indicating Adequate Type I Error ( 2 = .33, = .8), at 10 for K= 10 160 Table 34 Power Estimates Indicating Adequate Type I Error ( 2 = .33, = .8), at 05 for K= 30 162 Table 35 Power Estimates Indicating Adequate Type I Error ( 2 = .33, = .8), at 10 for K= 30 164 Table 36 Power Estimates Indicating Adequate Type I Error ( 2 = 1, = .8), at 05 for K= 10 166 Table 37 Power Estimates Indicating Adequate Type I Error ( 2 = 1, = .8), at 05 for K= 30 168 Table 38 Power Estimates Indicating Adequate Type I Error ( 2 = 1, = .8), at 10 for K= 30 170 Table 39 Power Estimates Indicating Robustness & Power ( 2 = 0, = .8), at 05 for K= 10 174 Table 40 Power Estimates Indicating Robustness & Power ( 2 = 0, = .8), at 05 for K= 30 176 Table 41 Power Estimates Indicating Robustness & Power ( 2 = .33, = .8), at 05 for K= 30 178 Table 42 Power Estimates Indicating Robustness & Power ( 2 = .33, = .8), at 10 for K= 30 180 Table 43 Effectiveness of 5 Metaanalytic Tests of Homogeneity for True Null Conditions 189 PAGE 8 v List of Figures Figure 1. Box and Whisker Plot (K=10) 89 Figure 2. Box and Whisker Plot (K=10) 89 Figure 3. Box and Whisker Plot (K=30) 90 Figure 4. Box and Whisker Plot (K=30) 90 Figure 5. Box and Whisker Plot ( 2 = 0) 92 Figure 6. Box and Whisker Plot ( 2 = 0) 92 Figure 7. Box and Whisker Plot ( 2 = .33) 93 Figure 8. Box and Whisker Plot ( 2 = .33) 93 Figure 9. Box and Whisker Plot ( 2 = 1) 94 Figure 10. Box and Whisker Plot ( 2 = 1) 94 Figure 11. Box and Whisker Plot (primary study sample size = 10) 96 Figure 12. Box and Whisker Plot (primary study sample size = 10) 96 Figure 13. Box and Whisker Plot (primary study sample size = 40) 97 Figure 14. Box and Whisker Plot (primary study sample size = 40) 97 Figure 15. Box and Whisker Plot (primary study sample size = 200) 98 Figure 16. Box and Whisker Plot (primary study sample size = 200) 98 Figure 17. Box and Whisker Plot (population variance = 1/1) 100 Figure 18. Box and Whisker Plot (population variance = 1/1) 100 Figure 19. Box and Whisker Plot (population variance = 2/1) 101 Figure 20. Box and Whisker Plot (population variance = 2/1) 101 Figure 21. Box and Whisker Plot (population variance = 4/1) 102 Figure 22. Box and Whisker Plot (population variance = 4/1) 102 Figure 23. Box and Whisker Plot (skewness/kurtosis = 0/0) 104 Figure 24. Box and Whisker Plot (skewness/kurtosis = 0/0) 104 PAGE 9 vi Figure 25. Box and Whisker Plot (skewness/kurtosis = 1/3) 105 Figure 26. Box and Whisker Plot (skewness/kurtosis = 1/3) 105 Figure 27. Box and Whisker Plot (skewness/kurtosis = 2/6) 106 Figure 28. Box and Whisker Plot (skewness/kurtosis = 2/6) 106 PAGE 10 vii A Comparative Simulation of Type I Error and Power of Four Tests of Homogeneity of Effects for Randomand FixedEffects Models of Metaanalysis Lisa Therese Aaron ABSTRACT In a Monte Carlo analysis of metaanalytic data, Type I and Type II error rates were compared for four homogeneity tests. The study controlled for violations of normality and homogeneity of variance. This study was modeled after Harwell (1997) and Kromrey and Hogartys (1998) experimental design. Specifically, it entailed a 2x3x3x3x3x3x2 factorial design. The study also controlled for betweenstudies variance, as suggested by Hedges and Veveas (1998) study. As with similar studies, this randomized factoral design was comprised of 5000 iterations for each of the following 7 independent variables: (1) number of studies within the metaanalysis (10 and 30); (2) primary study sample size (10, 40, 200); (3) score distribution skewness and kurtosis (0/0; 1/3; 2/6); (4) equal or random (around typical sample sizes, 1:1; 4:6; and 6:4) withingroup sample sizes; (5) equal or unequal group variances (1:1; 2:1; and 4:1); (6) betweenstudies variance, 2 (0, .33, and 1); and (7) betweenclass effect size differences, k (0 and .8). The study incorporated 1,458 e xperimental conditions. Simulated data from each sample were analyzed using each of four significance test statistics including: a) the fixedeff ects Q test of homogeneity; b) the randomeffects modification of the Q test; c) the conditionallyrandom procedure; and d) permuted Qbetween. The results of this dissertation will inform resear chers regarding the relative effectiveness of these statistical approaches, based on Type I and Type II error rates. This dissertation extends previous investigations of the Q test of homogeneity. Specifically, permuted Q provided the greatest frequency of effectiveness across extreme conditions of increasing he terogeneity of effects, un equal group variances and nonnormality. Small numbers of studies and increas ing heterogeneity of effects presented the greatest challenges to power for all of the tests under investigation. PAGE 11 1 Chapter One Introduction Background The purpose of metaanalysis is to discover if some treatment effect or some nonexperimental factor consistently exerts influence over a broad, but simila r, set of contexts or studies. Researchers hope to expose truth about a given population and whether an influence bears sufficient strength to produce an expected outcome, regardless of other competing forces. By examining the relationship across multiple contexts, samples and measures, one can determine the possible presence of a stable, more generalizable influence. According to Tukey (1969), the search for constant relationships between treatments and outcomes requires seeking for irremovable complexities, rather than triv[i]al ones and choosing the numerical expression of our variables to make things as simple as possible, while grasping greedily at the unsimplicities that remain (p. 86). Such an approach to metaanalysis will determine whether there is a single true effect across studies or if differences between studies result from other moderating influences. Metaanalysis is a secondary analysis. Summa ry statistics from each of the primary studies included in the secondary sample of studies comprise the data. Prior to conducting a review of the literature, the metaanalyst establishes a set of criteria for directing the gathering of a collection of similar studies. Important factors of the studies (e.g., resear ch design factors) to be collected are coded within each study. The coding used for marking the relevant aspects of each study also serves as the basis for building a model of the extent to which the samples a ggregated features reflect the characteristics of the population of interest. Applying one of several effect indices, effect sizes are computed for each study. In turn, sampling errors are estimated. Lastly, the corresponding Q test is calculated to provide information about the homogeneity/heterogeneity of effects across studies. Based on this result, the decision to pool effects is determined. When conducting a metaanalysis, the crucial deci sion to pool effect sizes is confounded by an array of inconsistencies across studies, including differing measures, varying study designs, disparate PAGE 12 2 statistical tools and sampling error. Since Glass (1 976) first originated the term metaanalysis, the primary interest continues to be the identification of the amount of similarity among studies effects. Erez, Bloom and Wells (1996) further explain that The underlying assumption of metaanalysis is that combining information from independent, but similar, studies improves estimates of population parameters over those obtained from any single study (p. 277). But the comprehensiveness of a sample of studies is becoming increasingly difficult to define, as more studies are being introduced through media other than academic journals. The major issue when addressing the variability across effect sizes pertains to whether such variation occurs as a result of random variation in th e true effect, moderators or sampling error (BangertDrowns, 1986). The test of homogeneity (also referred to as the Q test) is a tool used to determine the extent of this variability across studies by simultaneo usly evaluating the degree of withinstudy variability for each study within the collection. BangertDrowns further explains the test of homogeneity is an extension of Glasss concern with variability among st udies, in that it also attends to the variance associated with each effect size as a summary statistic (p. 394). Once the presence of homogeneity/heterogeneity is determined, the focus of the metaanalysis returns to summarization of the effect(s) through the process of approximate data pooling. Approximate data pooling is the statistical practice of combining e ffect sizes either to compute a common effect (in the case of homogeneity of effects) or an average effect (in the case of he terogeneity of effects). BangertDrowns (1986) cites Hedges (1982) and Rosenthal and Rubin (1982) as those who first explicated approximate data pooling. If the Q test results in a determination of multiple effects, the focus shifts to a description of moderators contributing to the varied eff ects. In such a case, linear regression is often used to model the various effects. Hedges (1982) credits Gl ass (1978) with the earliest application of regression in metaanalysis as he coded study characteristics as a vector of predictor variables, thereby regressing effect size estimates on these predictors to determine the relationship between the two. Hedges (1982) devised the Q test of homogeneity of effect sizes, realizing conventional statistics were not applicable to metaanalysis. Unlike ANOVA the Q statistic can withstand a violation of the assumption of homogeneity of variance without dimi nished sensitivity to treatment effect variances (Chang, 1993). Conventional statistics applied to metaanalysis are particularly subject to violations of the PAGE 13 3 assumptions of normality and homogeneity of variance, due to the multitude of measures and statistical techniques employed in the primary studies (Seltzer, 1991; Chang, 1993), resulting in inflated Type I and Type II error rates. Additionally, primary studies cont ribute their own sensitivity due to violations that are present, but not reported (Keselman et al., 1998). Prob lems inherent in violations of normality in primary studies are further compounded in metaanalysis. Many assert that Hedges (1982) Q is an effective and parsimonious tool for modeling the variability among standardized mean differences acro ss studies (Harwell, 1997). The purpose of the statistic is to detect statistically si gnificant differences, if any, between ef fect sizes across multiple studies. It tests the assertion that Ho: 1 = 2 = 3 =...= k. Hedges (1982) explains: The test of homogeneity of effect size (Hedges, 1982a) provides a method of empirically testing whether the variation in effect size estimates is greater than would be expected by chance alone. If the null hypothesis of homogeneity is not rejected, the reviewer is in a strong position visavis the argument that studies exhibit real variability, which is observed by coarse grouping (p. 2467). The true variability refers to the variance of the treatment across studies. Hedges and Olkin (1985) supply several algorithms for the Q statistic. But the basic, largesample derivation is Q = ( di d+)2/ 2( di) where d+ is the weighted estimator of effect size, di is the population effect size estimate from the ith study and 2( di) is the estimated variance. Hedges and Olkin expl ain that The test statistic Q is the sum of squares of the di about the weighted mean d+, where the i th square is weighted by the reciprocal of the estimated variance of di (p. 123). There is no differentiation of betweenstudies variance and sampling error across studies. This factor will become increasingly relevant to the present discussion. Valid use of this analytic tool relies on probable a priori inferences about the relative characteristics of the sample to the population, as expressed by the model. Though methodologists advocate the use of randomeffects tests once heterogeneity is f ound present, this procedure is not generally being employed (as evidenced by a survey to be described later). The determination of homogeneity resulting from the use of the Q test involves both theoretic and statistical implications. As with any statistic, PAGE 14 4 Hedges fixedeffects Q provides a less valid analysis under certain conditions and other considerations still require investigation. For example, Harwell (1997) concludes that skewed distributions combined with unequal variances result in inflated Type I error rate s for fixedeffects Q. Kromrey and Hogarty (1998) further corroborated these findings and cautioned ag ainst the use of Q under these conditions. Another consideration involves the influence of unequal sample si zes on Q test control of Type I and Type II error. Studies within a metaanalysis rarely have equal sample sizes, the primary factor permitting unbiased estimates of effects. Lastly, there is concern for the lim ited attention given to the influence of violations of normality (Wolf, 1990; Chang, 1993; Harwell, 1997; and Kromrey & Hogarty, 1998) and homogeneity of variance (Harwell, 1997; and Kromrey & Hogarty, 1998) on the control of metaanalytic Type I and Type II errors. Two models, the fixedand randomeffects, differ in their characterization of the error involved in defining the relationship of the sample to the population. The fixedeffects model is appropriately applied to a collection of studies representing a single population (that is, when the entire population is available for analysis). When applied to a condition where the sample reflects a diverse array of effects, thereby representing a number of differing populations, the model underestimates the variance. The fixedeffects model depicts error only in terms of withinstudies variance, whereas the randomeffects model separates this error in terms of both withinand betweenstudies variance. The betweenfrom withinstudy partitioning is expressed algorithmically th rough an added component of uncertainty, In other words, the randomeffects model expresses the effects as a random distribution, not limited to the sample of studies included in the metaanalysis, but including a ll other study effects not captured by that particular sample. The homogeneity test of effect size variance was developed to investigate whether significant differences are present in effects across studies. Howe ver, as several researchers (Erez, Bloom & Wells, 1996 and Abelson, 1997) have noted, the fixedeffects version of Q fails to distinguish betweenstudy variance from sampling error. It is possible that the undifferentiated variance may contribute to the Q statistics sensitivity to conditions of extreme parame ter effects and small sample sizes. Chang (1993) found a pattern of significant discrepancies in which the fixedeffects Q test produced greater simulated power values than theoretical power values when e ither small sample sizes (and large k) or extreme parameter effects were present. In contrast, the randomeffects model did not evidence significant PAGE 15 5 discrepancies between simulated and theoretical powe r values. However, Chang notes that population effects were normally distributed for the randomeffects model and not similarly controlled for the fixedeffects model. The theoretical simplicity and computational ease of the fixedeffects model, as well as traditional practice of researcher inferences guiding model selection has kept the majority of metaanalysts firmly entrenched in the consistent selection of fixedeffects. A survey of the Review of Educational Research for the past five years confirms the pervasiveness of this strategy (12 of 15 studies employed fixedeffects). The National Research Council (1992) recognized th is practice, recommending increased use of the randomeffects model to offset the excessive reliance on the fixedeffects model. For many metaanalysts, judgments about homogeneity of effects rely, in large part, on subjective interpretations about the samples representation of the population in question. As hom ogeneity rarely characterizes treatment effects in education (Erez, Bloom & Wells, 1996; Abelson, 1997; and Harwell, 1997; Mulaik, Raju & Harshman, 1997), exclusive application of this model is inappropriate. In addition, recent evidence suggests the fixede ffects Q test is not appropriate when significant heterogeneity of effects (Chang, 1993) and nonnormality are both present (Harwell, 1997; and Kromrey & Hogarty, 1998). The proliferation of studies, advances in the field of statistics and more sophisticated statistical software have all contributed to the interest in applying tests of homogeneity possessing greater robustness properties, as well as the growing body of evidence indicating that use of incongruent models impedes progress. Whether applying the Q statistic alone (associated with the fixedeffects model) or in combination with the betweenstudies variance associated with the ra ndomeffects model), both statistics evidence higher degrees of sensitivity under certain conditions. For instance, Harwell (1997) found the fixedeffects Q test yielded increased Type I error rates when sma ll sample sizes were paired with a larger number of studies. In contrast, because incorporates an additional variance component, it tends to be a more conservative estimate. The randomeffect s standard error is typically larger (i.e., less precise) than that of fixedeffects models due to the added betweenstudi es variance component, resulting in larger sampling error. PAGE 16 6 Three tests of homogeneity (the traditional fixedeffects Q, randomeffects Q and fixedeffects permuted Qbetween) and the conditionallyrandom procedure illustrate the manner in which homogeneity tests numerically elaborate the inferences expressed by each particular model. As their names imply, the traditional fixedeffects Q and permuted Qbetween are used to test the inferences expressed by the fixedeffects model. The randomeffects Q tests the infere nces extended by the randomeffects model. The conditionallyrandom procedure applies the fixedeffects Q test as the decisionpoint to first determine the presence of homogeneity. The presence or absence of homogeneity then determines the model to be selected (Hedges & Vevea, 1998). As each model has its corresponding test(s), statistics are used which correspond to the selected model. Prevalent and indiscriminate model selection results in repeated oversights regarding heterogeneity of effects, thereby obf uscating accurate interpretation of treat ment effects in education. The typical application of fixedeffects has inadvertently sa nctioned the practice of fitting data to the model. In so doing, the metaanalyst presents a faulty interpre tation of an estimate of the mean population effect generated through an invalid analysis. The combined influences of sample size for any given study and the heterogeneity among the true effects determines the precision of each studys estimate of effect (Raudenbush, 1994). Therefore, only a probable estimate of model homogeneity/heterogeneity can render a probable parameter estimate of the mean treatment e ffect when Q is rejected (Friedman, 2000). Changs (1993) study substantiates this conclusion as she fo und significant discrepancies between theoretical and simulated power estimates when the select ed model did not correspond with actual homogeneity/heterogeneity conditions. Whether appl ying a fixedeffects test to a condition of heterogeneous effects or a randomeffects test to a condition of homogeneous effects, power discrepancies result. As Chang points out, Type II error is of special concern in the determination of homogeneity of effects as it conveys the false conclusion that a treatment produces some uniform outcome when it actually generates any number of outcomes depending on the sample. Therefore, to promote accurate interpretation of homogeneity of effects, the primary question under consideration is how do four tests of homogeneity of effects perform under varying degrees of skewness and kurtosis, random withinstudy sample sizes, differing withinstudies variance and heterogeneous effects. Such an investigation should inform metaanalysts future model selection. There PAGE 17 7 have been two basic decisional approaches used to select the metaanalytic models and corresponding statistics: 1) decide a priori (as indicated by Hedges and Vevea, 1998) the sort of generalization to be made about the samples or 2) conduct a test of homogeneity of effects (if heterogeneity exists, a randomeffects model is adopted vs. if homogeneity is evidenced, a fixedeffects model is upheld). Hedges and Vevea (1998) assert that one must be primarily concerned w ith the inferences to be made about the population of studies included in a metaanalysis. But others such as Chang (1993) demonstrate the disparity in power between fixedand randomeffects contingent on homogeneity of effect sizes. Without betterreasoned, more consistent and deliberate use of models, data are analyzed and interpreted unevenly and less cogently throughout the field, delaying theoretical advances. Statement of the Problem Growing concern about the appropriateness of the fixedeffects model for metaanalysis (National Research Council, 1992; Chang, 1993; Erez, Bloom & Wells, 1996; Abelson, 1996; Harwell, 1997, Kromrey & Hogarty, 1998; and Friedman, 2000) signals the need for more highly defined criteria to select homogeneity tests useful for complex effects characterizing educational data. Since the model directs the selection of statistics, the analysis is only as valid as the extent to which the model expresses the relationship between the population, treatment and other influences on performance outcomes. Moreover, educational and psychological data most often possess skewed distributions (Lix & Keselman, 1998). The danger of using statistics sensitive to nonnormality and heterogeneity is that they become liberal, inflating Type I error, particularly with an unbalanced design (Lix & Keselman). Once distributional assumptions of a statisti cal procedure are violated, as is ofte n the case, it is useful to know the subsequent behavior of such statistics. In this way, applied researchers can better assess the extent to which such analyses generate valid results (Keselman, Huberty, Lix, Olejnik, Cribbie, Donahue, Kowalchuk, Lowman, Petoskey, Keselman & Levin, 1998). The gr eater possibility for heterogeneity of effects in educational studies and the tendency for metaanalysts to favor use of the fixedef fects model suggest that metaanalysts are applying fixedeffects to collections of studies actually containing heterogeneous effects. Furthermore, previously simulated metaanalyses present conclusions about homogeneity tests based on oversimplified conditions, not realistic to e ducation. Some important investigations have used equal sample sizes and variances, as well as normally distributed primary samples. Langenfeld and Coombs PAGE 18 8 (1998) express concern for the effect of these data conditions on the resulting magnitude of effects: our understanding of the influence of varying sample sizes, degree of variance heterogeneity, and type of distribution on ME [magnitude of effect] statistics is not well understood (p. 15). Hunter and Schmidt (1994) echo concerns about nonnormal distributions influence on the Q statistic, stating The development of methods for sensitivity analysis for random effects mo dels is an important, open research area (p. 391). Equal primary study variances and sample sizes have been shown to permit unbiased estimators for primary study effects (Hedges & Olkin, 1985). The presence of unequal study sample sizes impacts the accuracy of the primary study effect sizes. The pa ucity of realistically simulated conditions in such research challenges the metaanalyst to apply models without the full benefit of a wellsubstantiated rationale. Only realistic simulations based on conditions commonly found in educational settings can support valid conclusions about the efficiency of estimators for each models homogeneity tests. Therefore, an important aspect of promoting appropriate model selection involves the comparative investigation of the robustness properties of each models tests to wellcontrolled and diverse data conditions. Prior studies comparing the performance of fixedand randomeffects Q tests have not simultaneously controlled for data conditions specific to education. To this point, unequal withinstudy variances (Hedges & Vevea, 1998), varied skewness and kurtosis (Chang, 1993) and random primary study sample sizes (Kromrey & Hogarty, 1998) have not been addressed simultaneously in a comparative analysis of randomand fixedeffects models. Furthermore, only Hedges and Vevea (1998) have controlled for varying degrees of betweenstudies variance, As Harwell (1997) points out, nonnormal score distributions, unequal groups variances and unequal study sample sizes impact the t statistic in primary studies. As these are common conditions in ed ucational settings, these conditions effects on metaanalytic tests are important questions. Purpose of the Study The purpose of the study is to investigate data conditions typical in education settings to begin to establish a set of criteria facilitating deliberate model selection for optimal model fit. The responsiveness of four tests of homogeneity of effects are compared under conditions of varying degrees of heterogeneity of variance, primary study sample sizes, number of pr imary studies and dual violations of normality and PAGE 19 9 homogeneity of effects, as evidenced by statistical power and control of Type I error. Harwell (1997) recommended utilizing randomeffects regression in ad dition to the Q statistic that has been typically generated with fixedeffects regression. Raudenbush (1994) and Bollen (1989) further advised applying a weighted least squares regression, when sample si zes across studies are unb alanced. This second recommendation is supported by Hedges (1982) who found it provided reasonable accuracy for model fit specification when sample sizes were as small as 10. However, scant information is available to indicate whether such a procedure provides adequate robustness for metaanalytic tests. Research Questions The study provided metaanalysts with more specifi c guidelines for the use of each of four tests of homogeneity of effects by addressing the following questions: 1) To what extent is the Type I error rate of the fixedeffects Q, permuted Qbet randomeffects Q and conditionallyrandom procedure maintained near the nominal alpha level across variations in the number of primary studies included in the metaanalysis, sample sizes in primary studies, heterogeneity of variance, varying degrees of heterogeneity of effects ( ) and primary study skewness and kurtosis? 2) What is the relative statistical power of the fixedeffects Q, permuted Qbet randomeffects Q and conditionallyrandom procedure given variations in the number of primary studies included in the metaanalysis, sample sizes in primary studies, heterogeneity of variance, varying degrees of heterogeneity of effects ( ) and primary study skewness and kurtosis? Limitations Computer programs simulated data analyzed within this Monte Carlo study in which distribution shapes, study sample sizes, extent of variance across studies and random variation in true effects were controlled. Moreover, the models defining the statistical tools to be used were methodically alternated. Other potentially influential conditions were not examined. There were a handful of sample sizes and distribution shapes under investigation in the pr esent study. Related to this issue, there were innumerable potential factors influencing the effect size in any given study (Fern & Monroe, 1996). For this reason, one can never be absolutely certain the va riance is attributable to the independent variables under investigation. PAGE 20 10 The data being simulated looked like data typically collected in reading programs found in public school systems, including features of skewness and kurtosis, sample sizes and the numbers of studies sampled. Therefore, these findings should not be considered widely generalizable to other contexts without further investigation and juxtaposition to other contexts Differing conditions change the robustness of the statistical properties. As Overton (1998) suggests, ... one of the most important considerations in selecting a metaanalysis model is the contextual conditions in wh ich the effect of interest (e.g., a selection tests validity) is to be generalized in theory or in applica tion (p. 376). In other words, a statistic is only meaningful given the assumptions applicable to the s ituation in which it is being used. No single statistic can be expected to be robust to all conditions. Bradleys conservative criterion of acceptable Type I error was applied for purposes of evaluating the performance of each of the tests under the cond itions being investigated. However, a number of alternative criteria could have been applied. This pa rticular criterion was utilized based on typical best practices of researchers. Finally, this study was based on a Monte Carlo simulation, the results were derived from approximations of the data, not real world observations. Definition of Terms Several terms employed in the review of literature and description of the study design are defined below: Asymptotic Robustness Theory (Bentler, 1994) & (Wang, Fan & Willson 1996) referred to the robustness of certain parameter estimates which maintain independence from the assumptions of normality, homogeneity of variance and homoscedasticity, due to the nature of their computation, as sample sizes increase. Central Limit Theorem as n increases, the influence of any outliers in the population diminishes to the extent that the distribution of scores assumes a normal distribution. Or effects of nonnormality in the population diminish as the sampling size (n) increases (Glass & Hopkins, 1984). Estimate of the true effect Referred to as Ti and consists of both the true effect and the error in estimating the true effect. PAGE 21 11 Fixed Effects Model A statistical model which stipulates that the units of one or more of the factors under analysis (studies, in a metaanalysis) ar e the ones of interest, and thus constitute the entire population of units. Only withinstudy sampling error is taken to influence the uncertainty of the results of the metaanalysis (Cooper & Hedges, p. 535). Homoscedasticity The error variance is constant for all ob servations (Chatterj ee & Price, p. 38). Metaanalysis the statistical analysis of the findings of many individual analyses (Glass, 1977, p. 352). Moderator Variable A variable altering the effect of another variable on some outcome. Montecarlo Study A procedure in which an empirical sampling distribution is drawn from a computersimulated treatment population manifesting specific model violations, then compared against a theoretical (critical) distribution for a given significance level (Kennedy & Bush, 1985). Random Effects Model A statistical model in which both withinstudy sampling error and betweenstudies variation are included in the assessment of the uncertainty of the results of the metaanalysis (Cooper & Hedges, 1994, p. 539). Robust Methods Test statistics which generate probabilities with minimal or no discrepancies between the nominal and actual levels of significan ce when applied to a set of data comprised of characteristics deviating from those parameters recommended for their use (Kennedy & Bush, 1985). Sensitivity analysis the process of developing methods which minimize the discrepancy between p values and the nominal p values under a variety of distributions and maintaining high efficiency and stringency over an array of circumstances. According to Seltzer (1991), it involves applying a randomeffects model to determine if much change occurs in the second stage model before drawing conclusions about the relationship between two variables. It may change judgments about the magnitude of the relationships (p. 174). Standard error (of the statistic) estimates the standard deviation of a parameter estimate. It is the standard deviation of the sampling distribution (Winer, 1971). Total variance of the estimate of true effect (v*I) consists of both 2 and the withinstudies or estimation variance (vi) PAGE 22 12 Type I error rejection of a true null hypothesis resulting in the faulty conclusion of statistically significant treatment effects. Type II error acceptance of a false null hypothesis resulting in the failure to identify true treatment effects. Unbiasedness Winer (1971) states: One criterion for th e goodness of a statistic as an estimate of a parameter is lack of bias A statistic is an unbiased estimate of a parameter if the expected value of the sampling distribution of the statistic is equal to the parameter of which it is an estimate (p. 7). Therefore, unbiasedness is a characteristic of the sampling distributio n, not only the nature of the statistic. So the extent to which a statistic produces an unbiased parameter estimate suggests that over the course of a large number of samples, the mean of these statistics will be equal to the true parameter. Otherwise, the statistic is biased. Weighted Least Squares a modified regression procedure applied to obtain weighted estimates of differentially reliable studies. Hedges (1982) also defined weighted least squares as an estimation of linear model parameters by minimizing a weighted sum of squares of differences between observations and estimates (p. 256). It is characterized by the following: ^ w = (X1WX)1 X1 WY PAGE 23 13 Chapter Two Literature Review The literature referenced in this study concerni ng the relative sensitivity of two fixedeffects tests, the randomeffects Q and conditionallyrandom proc edures will be presented in seven sections to extend a rationale for both addressing this research in terest and applying the Monte Carlo design. First, the historical and philosophical evolution of metaanalysis is discussed. Next, the four tests are presented. Third, a discussion of the process of model selection, model fitting and hypothesis testing is offered, as well as the rationale for maintaining a flexible, iterative approach. Fourth, potential factors affecting the efficacy of homogeneity tests are presented. Fifth, rese arch literature pertaining to the comparative sensitivity of each of the four tests of homogeneity of effects is surveyed. Sixth, calculation strategies for reducing estimator bias are presented. Finally, a summary of the recommendations of prior re searchers is presented. Historical & Philosophical Evolution of Metaanalysis The objective of Science is not focused on the unearthing of Truth, as much as the construction of more heuristic problems and perspectives. That is the problems or hypotheses incorporate more overriding factors, not merely the particulars of a single context. The potential for generating more general problems propels scientific progress, not the discovery of certain Truth (Popper, 1968). In a similar fashion, Kuhn (1962) suggests the scientist is obligated to underst and the world and enlarge the scope and precision of the system of its ordering. Popper continues by explaining, in fact, the need for complete objectivity prevents scientific declarations to be anything more than tentative. Similarly, Pearson (Inman, 1994) asserts the scientist works to summarize perceptual da ta, not necessarily to uncover Truth. It is a descriptive, not explanatory endeavor. Some, Meehl (1978) points out, ar gue the social sciences possess few accumulated insights. Glass, McGaw & Smith (1981) echo this concern with respect to metaanalysis, Although scholars continued to integrate studies narratively, it was becoming clear that chronologically arranged verbal descriptions of PAGE 24 14 research failed to portray the accumu lated knowledge (p. 12). Post hoc analysis of scientific disciplines reveals that progress ensues over the course of time and many contributions not readily evidenced in a single event (Serlin, 1987). Kuhn (1962) points out th at many question whether the social sciences have adopted a set of rules and standards (a paradigm) to constrain the generally accepted practice of the scientific enterprise. Acceptance of a paradigm requires a group of practitioners to universally embrace the same set of theories and rules. Without the accumulation of experiences, the group does not possess a shared perspective. If probability is the science of tent ative truths (Popper, 1968), statistics is the discipline of determining the extent to which those truths are possible. Statistics is guided by an imperative to summarize data when possible. We employ this process as a means of extending information (with a given amount of error) about the characteristics of a sample to a population of larger interest. Estimation of a common parameter by combining multiple estimates from similar studies is valued as being a more efficient, accurate and natural proce ss (Glass & Hopkins, 1984; Hedges & Vevea, 1998). The more we minimize error, the more we learn about the presumed characteristics of the population. Combining data from similar studies, as opposed to data derived from a single study, enhances population parameter estimates (Erez, Bloom & Wells, 1996), by checking the consistency of the effect across contexts given th e multitude of potential moderators within each study (Raudenbush, 1994). Pearson (1904) first derived an aver age estimate of effects to obtain a typical effect. Tippett (1931) and Fisher (1932) are among the fi rst to combine probabilities across studies. The fundamental objectives are to identify both the pattern s across studies, as well as the aberrations from the same. As will be discussed, it is this philosophy whic h underlies the inferences embedded within the fixedand randomeffects (metaanalytic) models. However, these models maintain varying degrees of focus on patterns or aberrations. Metaanalysis permits questions not possible w ithin the context of a single study. Present synthesis methods facilitate tests of hypotheses not te sted in primary studies (C ooper & Hedges, 1994). Estimating the generalizability of the effects, as well as the nature to which they are specifiable to groups, exemplifies this objective. With metaanalysis, there is an accumulation of studies results across situations, making it possible to determine whether a relationship exists regardless of specific circumstances. It is this capability that makes metaanalysis more cons istent with general scientific inquiry. PAGE 25 15 The accumulation of scientific knowledge requires the development of testable hypotheses incorporating prior knowledge and theory (Kuhn, 1962). The idea of synthesis evolved from testing for one overall level of significance to identifying an average true effect (s), as well as an explanation of any moderators contributing to the impossibility of a single average effect (Cooper & Hedges, 1994). Significance testing used within the context of a single study does not accomplish this purpose alone (Mulaik, Raju & Harshman, 1997). But metaanalysis in tegrates the results of multiple related studies to facilitate generalization about a problem or set of fact ors (Cooper & Hedges, 1994). Metaanalysis refers to the statistical analysis of a large collection of analysis results from individual studies for the purpose of integrating the findings (Glass, 1976, p. 3). It repr esents one class of resear ch review, the integrative review (BangertDrowns, 1986). Cooper & Hedges (1994) describe research integration as an effort involving the determination of consistencies and the or igin of variability across a set of related studies. Redundancy and multiple contexts, both integral to drawing generalizations about the validity of some hypothesis (Cohen, 1990; Tukey, 1969), are two central features of metaanalysis. This method achieves two important objectives: the determination of result invariance and the enhancement of objectivity. Replication and integration of studies both entail redundancy, a useful property for establishing consistency of results. By integrating studies incorp orating similar features, metaanalysis uses multiple studies to assess the robustness and generality of a relationship(s) (Rosnow & Rosenthal, 1989). Similarly, Cohen (1990) states Only successful future replication in the same and different settings (as might be found through metaanalysis) provides an approach to settling the issue [of the stable influence of some treatment] (p. 1311). The compilation of studies demonstrating consistent results, regardless of research design differences and treatment administration peculiar ities, ensures the insignificance of the researchers influence and chance, as well as significance of the effect. Because repetition (or repeated outcomes) permits the evaluation of variability and consistency, it is a cornerstone of scientific inquiry (Tukey, 1969; Hedges & Olkin, 1985). Repeated outcomes suggest great er reliability of the data. Popper (1968) asserts: Only when certain events recu r in accordance with rules or regularities, as is the case with repeatable experiments, can our observations be tested in principle by anyoneOnly by such repetitions can we convince ourselves that we are not dealing with a mere isolated coincidence, but with events which, on account of their regularity and reproducibility, are in principle intersubjectively testable (p. 45). PAGE 26 16 With respect to the second aim, objectivity involves invariance in the manner with which a researcher develops observations, separate and apart from his/her actions or predispositions (Mulaik, Raju & Harshman, 1997). Objectivity increases as the error inherent in any particular study is minimized. But even assuming the absence of resear cher bias, invariance does not provide irrefutable support for the presence of an absolute relationship or set of results. As Popper (1968) suggests, other outcomes could still be possible corroboration of a result does not nullify the possibility of another outcome, as well. Regarding model selection in metaanalysis, there is much disagreement as to the degree of objectivity required for preliminary efforts taken at the first stage of the analysis involving the determination of homogeneous effects. As metaana lysis involves the compilation of many studies, summary statistics from each of the pr imary studies are converted into effect sizes. The first stage of the metaanalytic process involves the determination of whether there is a comm on population effect or multiple typical effects. The disagreement lies in which criterion to use for model selection in determining the homogeneity of effects either a priori selection based on researcher inferences about the comprehensiveness of the sample or selection based on the results of an initial administration of the fixedeffects Q test. The appropriaten ess of the model selected bears significant implications for the determination of the magnitude of the effect(s) at stage 2 (Chang, 1993). The null hypothesis for a typical metaanalysis includes the assumptions that there is no difference in the effects from one study to the next ( 1= k and that each studys effect does not equal zero ( k=/0 Another null hypothesis assumes the study effects across classes share a common effect. Properly interpreting the lack of evidence to the contrary does not mean one does not suspend the possibility that true differences exist. Chan g (1993) finds that false rejection of the null at the first stage results in z tests (incorporating randomeffects variances) with reduced sensitivity for the magnitude of the common effect at the second (p. 121). In an effort to replicate and summarize re sults, there is an attempt to draw more definitive conclusions about the relationships between a multitude of variables (Raudenbush, 1994), as well as compile, evaluate and integrate the corpus of res earch studies into a meaningful nexus of information (Cooper & Hedges, 1994). As Kirk (1996) contends, What we want to know is the size of the difference PAGE 27 17 between A and B and the error associated with our es timate; knowing that A is greater than B is not enough (p. 774). Replication or, in the case of metaanalysis, synthesis of multiple studies contributes to either the corroboration or nullification of a consistent m easurement which then generalizes more accurately to a greater number of subjects. Cro nbach et al. (1972) stated, A behavioral measurement is a sample from the measurements that might have been made, and interest attaches to the obtained score only because it is representative of the whole collection or universe (p. 18). Co mbining information from similar studies in somewhat different settings provides more conclusive evidence regarding a hypothesis by further enhancing population parameter estimates, as opposed to those generated from a single study (Cohen, 1990; Erez et al., 1996). A single stud y fails to provide conclusive evidence for the generalizability of a finding, because it does not demonstrate the stab ility of a relationship outside of one context. In other words, the consistency of an outcome supports its noncha nce occurrence and that it truly represents some expectancy. Metaanalysis involves procedures for both com puting an estimate of acrossstudies effect sizes and an overall significance level for multiple studies addressing similar treatments. A brief overview of the determination of a true estimate of effect versus an overall significance level can help to distinguish these two metaanalytic procedures. Glass is credited with applying a method of combining study results from differing scales of measurement. He demonstrated how Cohens d, standardized mean difference, or a productmom ent correlation coefficient could fulfill the need for a scalefree measure of effect magnitude. Gl ass, McGaw and Smith (1981) applied analysis of variance and multiple regression for metaanalytic purposes. Effect sizes were employed as the dependent or criterion variable and study characteristics inputted as the independent or predictor variable(s). Tippett (1931) and Fisher (1932) were among the first to co mbine probabilities across studies. Tippett illustrated that if pvalues are independent, th en each originates from a similar distribution under the null hypothesis. Significance tests are a sort of measure indicating the extent to which one ought to be disinclined to accept the null hy pothesis under considera tion (Mulaik, Raju & Harshman, 1997). As Rosenthal (1978) illustrates, there are at least eight methods for obtaining an overall significance level. Some are appropriate with larg er numbers of studies, others with fewer studies. The blocking method, PAGE 28 18 for instance, may be more powerful than most, but is computationally more complex, particularly with a larger set of studies. Neither the combination of probabilities nor the generation of a single index is adequate, in and of itself, for the calculation of a common effect size. Each resolves only part of the question as to the possibility of the presence of the relationship and the comprehensiveness of the sample. Metaanalysis circumvents the issue of statistical appropriateness, by deriving its summary statistic from a multitude of parameter estimates or sampling distributions. As stated earlier, both confidence intervals and significance levels guide the generation and/or evaluation of a common effect size in metaanalysis. There are two problems in the exclusive application of probabilities and statistics. Specifically, Chow (1998) suggests a potential conflict in assuming the presence of a single sampling distribution, supporting the need to evaluate treatme nt effects using both significance levels and effect size estimates. He states The validity of statistical power is debatable because statistical significance is determined with a single sampling distribution of the test statistic based on null hypothesis, whereas it takes two distributions to represent statistical power or effect size (p 169). Although Fishers method of combining probabilities may be best known, it is faulty in that, when there are two studies with equally significant outcomes in opposite directions, it supports the significance of either result (Rosenthal, 1978), failing to account for the direction of the outcome. There is also a basic problem with attempting to interpret effect sizes in isolation. Metaanalysts do not test the opposite of the null, that th ere is an effect, because it is not an exact hypothesis, for there is no single suggested value (like zero fo r the null) (Mulaik, Raju & Harshman, 1997). If a treatment group mean is significantly greater than the control group mean, this result is not specified. But when one merely combines eff ect sizes without presenting confiden ce intervals, the overall effect can equal zero as all effects are taken in to account. Carver (1978) explains: Now if we reverse the question to ask what is the probability that two obtained groups were sampled from the same population, we have the question that most people want to answer and assume they have an swered when they calculate the p value from statistical significan ce testing. In essence they are asking what the probability is that the null hypothesis, Ho, is true, given the type of large mean difference we have obtained, or, what is p (Ho D1)? (p. 385). PAGE 29 19 Because significance levels draw a single conclu sion about the probability of an outcome (s), variation across studies was perceived, prior to th e use of metaanalysis, as obscuring rather than clarifying interpretation. Metaanalysis both identifies the variation and its source(s) (Cooper & Hedges, 1994). Cooper & Hedges further assert, that past syntheses were often conducted using the wrong datum (significance tests rather than eff ect sizes) and the wrong ev aluation criteria (implicit cognitive algebra rather than formal statistical test) (p. 523). Comprehensiveness, not consistency alone, best imparts scientific truth (Light & Smith, 1971; and Glass, McGaw & Smith, 1981). Metaanalysts stri ve to account for discrepancies across studies, in an attempt to identify moderating variables, thereb y providing a more complete representation of the population of interest. This objective cannot be accomplished by deriving an overall significance test in and of itself. To this end, metaanalysts call for the combined presentation of both estimates of effect and an overall significance level, accompanied by confidence intervals (Light & Smith, 1971; and Rosenthal, 1978). The presentation of effect sizes in conjunction with significance results helps to contextualize the interpretation of the latter (Rosenthal, 1978; Serlin, 1987; and Cohen, 1990). Heterogeneous effects generate an estimate of th e effectsize, at the second stage of the metaanalysis, which is an average effect ( ) from a distribution of random effects, not a single effect ( ) from a set of equal effects (Chang, 1993). An average effect size computed across a series of heterogeneous effects is a distinctive statistic from the estimated effect size of a common population estimate. The pooling of effect sizes across studies, in the case of the later, as sumes a consistent treatment administration to a single population. When an a priori decisi on or one attained as a result of information gleaned from a test of homogeneity of effects indicates multiple independent populations, effect sizes are no longer pooled into a single estimate of effect. Rather, an average effect size can be computed with the understanding that the statistic does not represent the effect of a treatment on a single population. The effect size representing a whole population presupposes the same strength of influence for a treatment or independent predictor variable on the relatively similar set of characteristic s displayed by a uniform population. In contrast, an average effect simply combines ef fects across studies irrespective of treatment differences and any true differences between groups within a population. PAGE 30 20 Combining p values involves a no nparametric test, whereas tests of homogeneity of effects are parametric. Becker (1994) describes the p value, or significance levels, as reflecting the probability of seeing a sample as unique as the observed sample assuming the verity of a null hypothesis. Large ps indicate samples are well represented by the null conditio n. Significance levels or p values are laden with information about sample size or degrees of freedom, as well as the population, not distinguishing these from the influence of the treatment alone. Unlike effectmagnitude indices, combined ps do not present measures of effects free from these other inputs. Only an analysis of effect magnitudes provides information about both the size and the strength of the average effect. Furthe rmore, these analyses can further pinpoint any moderator variables contributing to inconsistent effects. The omnibus test wherein p values are combined presents information about whet her or not the effect is different from zero. Purpose of Homogeneity Tests The question driving metaanalysis is whether th ere is one average pooled effect or multiple average effects. Two questions are subsumed within the one about the breadth of the population a substantive question with implications for the par tialling of variability and interpretation of results. Is one or are several populations being represented by the sample? This question is determined by another is there variability across effects or merely sampling error? As will be discussed in greater detail, this question is the reason for conducting regression. In integrating studies re sults, the objective is to determine whether a single average effect best descri bes a treatment effect on a population or a random distribution of effects. Homogeneity tests permit inferences as to whet her a set of studies share a common true effect, facilitating greater generalization about a treatments efficacy from a single context to a broader array of settings. A frequently applied test of homogeneity, Q (He dges, 1982), tests the assertion that effect sizes are equal (Ho: 1 = 2 = 3 =...= k,), with variability assumed constant across studies. When the assumption of homogeneity is violated, the Q test possesses an approximate chisquare distribution with a noncentrality parameter. This noncentrality parameter cannot be theorized, because it consists of a combination of multiple sa mpling distributions. The only alternative is to estimate power using a simulation technique. In another method, the analyst further accounts for the variability in studies effects by incorporating sa mpling error and betweenstudies variance. Three PAGE 31 21 models apply to metaanalysis: fixedeffects, randomeffects and conditionallyrandom models. Distinguishing features of the first two models will be discussed shortly. Each model employs at least one statistical procedure. For the present study, 2 fixedeffects tests (not including the basic, largesample derivation), 1 randomeffects test and 1 conditionallyrandom procedure are presented. Hedges and Vevea (1998) define Q, the general test of homogeneity of effects, as a comparison of betweento withinstudy variance (p. 490). Hedges (1982) states, The test of homogeneity of effect (Hedges, 1982a) provides a method of empirically testing whether the variation in effect estimates is greater than would be expected by chance alone. If the null hypothesis of homogeneity is not rejected, the reviewer is in a strong position visvis the argument that studies exhibit real variability, which is observed by coarse grouping (p. 2467). Harwell (1997) explains that retention of Ho is typically followed by pooling the di and testing the weighted average d+ against zero (p. 220). Chang (1993), among others, elaborates a twostage process fo r determining a population effect. According to Mulaik, Raju & Harshman (1997), there has to be some criterion for determining whether a small positive variance between effects is small enough to approximate zero. Variance between studies is estimated and then evaluated to determine if the effects are signifi cantly different. The question becomes how one can select the inference without knowing whether there is one or several population effects. Faulty estimation of a co mmon effect can result in the blurring of true differences across studies (Hedges & Olkin, 1985) and the use of erroneous z tests to determine the magnitude of effect (Chang). The two stage process of effect size estimation involves both Changs (1993) study of Type II error consequences in accurately computing z tests an d the determination of the effect for a particular study, as well as the estimation of a population effect (s) across studies. First, one determines whether studies share a common effect sizes. This step invo lves the use of homogeneity tests. Second, one computes the magnitude of the average effect. Despite the possible commission of Type II error, consideration of the influence of violations of statistical assumptions is limited. The recent and in frequent consideration of these assumption violations demonstrates that further study and vigilance is warranted. Concern related to the drawing of valid generalization across studies dates back centuries (L egendre, 1805). Rasmussen & Dunlap (1991) assert many data analysts demonstrate limited concern for the distributional features of their data. Similarly, PAGE 32 22 Wolf (1990) expresses concern for how few metaa nalysts recognize the cons equences of violated assumptions on the performance of metaanalytic tests and their failure to investigate the impact of such violations. As recently as the mid90s, a review of over 400 psychological and educational journals revealed a similar lack of consideration in studies employing various univariate and multivariate designs. The violation of normality received attention from 11% (46 of 411) of the authors, while only 16% (69 of 411) addressed any assumption violations whatsoever (Keselman, Huberty, Lix, Olejnik, Cribbie, Donahue, Kowalchuk, Lowman, Petoskey, Keselman, & Levin, 1998). Though there is ongoing dispute as to whether one determines the model (whether fixed, random or mixed) prior to or after applying the test of homogeneity, both the theory and data conditions need to be considered as an integrated whole. For this reas on, we now turn to the dataanalytic limitations and robustness features of each of five tests/procedures Each is more or less appropriate under certain conditions. Introduction of Tests Q tests the null hypothesis that study effects ar e equal or there is no statistically significant variance among population effects. Essentially, it sums the total of the differences between each population effect and the average of all population effects dividing by the standard deviation of the estimate of the population effect. According to Hedges and Vevea, the Q test can be interpreted as an analysis of betweento withinstudy variance. If the effect sizes vary, Q has a noncentral chisquare distribution with k1 degrees of freedom and a noncentrality parameter, .= ( i 2/ 2(di), where di equals the unbiased estimate of the population effect, i. When study sample sizes decrease, Q no longer retains a sampling distribution with a chisquare spread and welldefined degrees of freedom. For this reason, estimated Type I error rates depart sharply from nominal values (Harwell, 1997). For the appropriate use of Q, primary studies are independent, normally distributed and share a common variance. Typically, there are a more limited numbe r of studies included in the metaanalysis. The population is well defined and finite. The only source of variance under consideration is sampling error or variance introduced by uncontrollable and unknown factor s. The Q test incorporates weighted variances in order to account for greater or lesser pr ecision across studies included in the metaanalysis. In this set of circumstances, both the Type I error rate and power of Q approximate theoretical values. But unequal PAGE 33 23 group variances and sample sizes, as well as nonnormal distributions disrupt this congruence between estimated and theoretical values of Q (Harwell, 1997). Harwell (1997) found the larger the variance ratios, the smaller the estimated power values. As within group vari ances became progressively dive rgent, estimated power values diminished. This finding held for all k, distributions and variance ratios and results from the incorrect pooling of withinstudy sample variances for the generation of d. Unequal group variances result in ds denominator being overestimated, diminishing the value of d and its power. Specifically, Q requires the maintenance of the following assumptions to perform optimally: 1) independence of scores in primary studies; 2) a normal distribution; and 3) common variance. To better present the disparity between fixedand randomeffects models, a brief overview of their application follows. For both traditional Q and the ran domeffects procedure, first apply the regression equation di = o + 1Xi1 + 2Xi2 ++ pXip + ui + ei Where di = the effect size estimate for the ith study o = the grand mean effect size 1 = the expected mean difference in effect sizes between studies of different classes Xi1 = some amount of the first study characteristic in the ith study ui = the residual or component of the sc ore effect size not explained by X, and ei = the error of estimation Calculation strategies used to augment precision. Generalized Linear Models are the largest class of statistical models. They include the more specialized classical linear m odels those models restricted to linear relationships. In a classical linear model, th ere is a systematic component (entailing the X model matrix and or vector of parameters equivalent to the means) and a random component. The latter part assumes independence and constant variance of the errors or homoscedasticity. There are known covariates influencing the mean and these are measured without error. Unbiased estimates generated from the minimizing of the leastsquares criterion po ssess minimum variance. Least squares estimates depend on the assumptions of common variance and the independence of the observations from their mean value. Often, these are referred to as ordinary least squares. The presence of either unequal variances or corr elated observations sign als the inappropriateness of ordinary least squares (Draper & Smith, 1998). In addition, Lix and Keselman (1998) advise against the PAGE 34 24 use of usual least squares estimates under conditions of nonnormal distributions and unbalanced study designs (in metaanalysis this situation arises when the number of subjects in each condition of a study varies across studies). The unbalanced design contributes to the uneven pr ecision of estimated effect sizes. In identifying the extent of variance across study eff ects, the primary objective is to inco rporate estimates of variance that have minimal bias and a degree of precision corresponding to the kind of information yielded by each particular study. The precision of these estimates depends upon the studys sample size and on the extent of heterogeneity across the true effect sizes (Raudenbush, 1994). Studies with larger sample sizes permit more precise estimators of effect and need to be more heavily weighted than estimators derived from studies with smaller sample sizes (Hedges & Olkin, 1985; and Hedges & Vevea, 1998). The extent of the precision is captured by employing weighted estimators responsible for minimizing variance by using weights inversely proportional to the variance in each study (Hedges & Vevea, 1998). For estimates of effect, the nonsystematic variance is inversely proportional to the sample size of the study on which the estimate is based (Hedges & Olkin, 1985, p. 11). The resulting procedure, weighted least squares regression, originat es from the class of Generalized Linear Models. Before addressing the statistical procedure used to calcu late these weighted estimates, a brief discussion of the Generalized Linear Model is presented. The Generalized Linear Models extend beyond the classical linear models in that they permit one to consider the patterns of how moderating variables sy stematically affect the va riation in some treatment outcome. As a result, they define both linear and n onlinear relationships. Speci fically, Generalized Linear Models are applied because they permit two extensio ns beyond the normal distribution and the identity function between the random component and the link between the random and systematic components. First, the distribution may be derived from an exponential family. Second, the link function may become any monotonic differentiable function (McCullagh & Nelder, 1983, p. 27). Regardless of the use of a fixedor randomeffects model, once the individual variances evidence large differences in variance from one study to the next, residual variances will be unequal (Raudenbush, 1994). In either case, weighting the ds will be more appropriate. Using the weighted least squares will pinpoint the source of variability, as the residuals from the regression help determine the form of the PAGE 35 25 variances. This procedure generates negatively biased standard error estimates. The bias diminishes as i the unbiased estimator, approaches 0 or as the residual variance of i diminishes. The distinction between randomand fixedeff ects weighting involves the additional variance component, 2 added to the denominator of the randomeffects version. Specifically, the randomeffects weights are: w* 1 = 1/ v* 1 = 1/( 2 + vi) whereas, the fixedeffects weights are: wi = 1/ vi The weighted mean is incorporated into the calculation of the maximum likelihood estimator: d+ = wi Ti / wi Note: d+ is the weighted estimation of the estimated population eff ect, calculated by multiplying the sum of the weights by the estimated population effect and dividing th at product by the sum of the weights. In conclusion, weighted least squares utilizes study ch aracteristics as the predictors of study outcomes, as well as the estimators of effect size variance unexplained by the model (Raudenbush, 1994). Applying weighted least squares permits improved validity of the overall results, by enhancing the precision of variance estimates, and facilitates further explanation of variance by identifying sources of variability. The fixedeffects homogeneity test is generally referred to as Q or H (Chang, 1993). Notice the randomeffects version of this algorithm in corporates two units in the error term, ui + ei. In the fixedeffects equation, ui is absent. The ui term refers to the true score (effect) variance, whereas the fixedeffects model only considers the sampling error, ei, discounting the possibility of true variance across studies. As will be seen, this additional component of uncertainty plays a considerable role in increasing the variance and power of the randomeffects statistic relative to trad itional, fixedeffects Q. The fixedeffects Q test follows: Q = i ( di d+)2/ 2(di ) Fixedeffects Test where di = estimate of the population effect size, the minimum variance unbiased estimator of i. d+= average of the dis = 2 (d i ) d i 2 (di) 2(di ) = withinstudy variance = nE + nC + i 2 nE nC 2(nE + nC) After controlling for the study characteristics, the variance for di is V* i = Var (ui + ei ) = vi With a balanced design and no predictors, proceed as follows Q = wi (di d+)2 PAGE 36 26 Note: d+ is the weighted estimation of the estimated population effect and was calculated by d+ = w i d i wi If the design were unbalanced, compute the weights for each study effect accordingly: wi = 1/vi for randomeffects weights wi = 1/(vi + ) Hedges and Vevea (1998) describe how the additional variance component in the randomeffects test makes it a more conservative, less powerful, test than the fixedeffects Q. Th ey explain: Because the additional component of variance is the same for a ll studies, it both increases the total variance of each effect size estimate and tends to make th e total variances of the studies (the vi*) more equal than the samplingerror variances (the vi) (p. 492). Once Q is calculated for fixedeffects, this statistic can be used to calculate c which then permits the computation of 2 for the randomeffects procedure. C= wi w i )2 wi Next, test the significance of th e effectsize variance component ( 2) or that Ho: 2 = 0 2 = Q (k )/c If this value is larger than zero then one can no lo nger assume the presence of a single effect. One then recomputes the randomeffects weights using the estimate of 2 d+ for randomeffects = randomeffects weighted mean effect size or w i d i wi Using d+ for randomeffects, one constructs the confidence interval around Lower limit = d+ (randomeffects) 1.96(SE)< = < =d+ (randomeffects) + 1.96(SE) = Upper limit Conceptually, the randomeffects model interprets the collection of studies as part of a wider and unknown universe, as opposed to the fixedeffects mode l by which hypothesis testing is restricted to the immediate sample. For this reason, the randome ffects statistic accounts for variance in a unique expression. Although it has been treated as a statistic not requiring an assumption of normal distribution of the study effects, there has been recent cause for suspicion of the same. Hedges (1992) suggests this estimator may not be unbiased under conditions of nonnormal random effects. The betweenstudies variance component is the distinctive feature of the randomeffects homogeneity test (referred to by this study as Q+ but variously referred to in other literature as H+ ), PAGE 37 27 differentiating the randomeffects Q bo th theoretically and algorithmically. This element is variously referred to as the betweenstudies variance component , estimator of populatio n variance, estimator of the variance of population effects, estimator of the population variance component, heterogeneity of effects and treatment x studies inter action and is variously expressed as 2, 2 2 and 2. For purposes of this study, it will be referred to as th e betweenstudies variance component heterogeneity of effects or 2. The betweenstudies variance or 2 represents the part of the variance that consists of systematic error It defines the degree of variation across studies relevant to studyspecific treatment effects or refers to the variance of the population from which the stud yspecific effect parameters are sampled (Hedges & Vevea, 1998). It is added to the sampling error to co mpute the estimate of the Total variance of the average effect. Moreover, 2 entails the variance of the distribution of the errors of i. Typically, this statistic has increased variance due to the added uncertainty built into its algorithm. As stated above, the betweenstudies variance component ( 2), is responsible for enlarging the st andard error of the mean, making it substantially larger than in the fixedeffects test. This difference results from the increased betweenstudy heterogeneity in the effects (Hedges & Vevea, 1998). In fact, Hedges and Vevea explain that the betweenstudies variance component is approximately two thirds as large as the average estimate of the sampling error variance.2 is valued as 0 when Q (k1) becomes negative because it cannot be negative and is either present or absent (Hedges & Vevea, 1998). Though it may not be partitioned within the model, Hedges and Vevea conduct their simulation controlling for varying degrees of the betweenstudies variance, 2. When 2 = 0 with an unconditional inference, fixedeffects Q generates nominal probability; however, when 2 >0, Q generates a lower than nominal probability. But the same is true to a lesser extent for randomand conditionallyrandom procedures, particularly when k is small and heterogeneity is large (Hedges & Vevea, 1998). If one assumes 2 to be small when it is not, the result will be the underestimation of the variance v* (the sampling variance of the randomeffects estimate) (Hedges & Vevea, 1998). The randomeffects algorithm includes the estimate of population variance, the component accounting for the added uncertainty. PAGE 38 28 Q+ = (di d+)2/ (dii) Randomeffects Test Note: The primary difference between the traditional Q and Q+ is the inclusion of the variance of the sample effect sizes while holding constant the population effect size(s), i. This modification creates the variance of the conditional distribution of di given i. Hedges and Olkin (1985) explain the algorithm applies e xpected values of the mean squares. These values are represented in terms of variance components. Samp le values are substituted for these expected values and used to solve for the variance components (please refer to the procedural description of the fixedeffects test). Weighted least squa res are typically applied to enhance the precision of the estimates of variance for study effects. This procedure results in the unbiased estimates of the variance components. A more indepth discussion of weighted least squares is presented at the end of this chapter. First, it is important to distinguish the conditionally randomeffects Q as a procedure and not a test statistic. Rather, it is a sort of protocol concerning the treatment of the choice to conduct either a randomor fixedeffects test of homogeneity Hedges and Vevea (1998) state: If the analyst chooses to make conditional inferences (by conditioning on the studies in the data set), the statistical model has been determined because the effect parameters are treated as fixed for inference. If the analyst chooses to make unconditional inferences, the statistical model treats the effects as a sample (even if no real sampling has been done), and thus, they are treated as random effects (p. 495). When conditional inferences are made ( are fixed), randomand conditionally randomeffects procedures overestimate confidence intervals (they are too wide). Such overextension depends on K and the degree of betweenstudy heterogeneity (Hedges & Vevea, 1998). The conditionally randomeffects procedure performs similarly to the fixedeffects Q when 2 is not statistically significant, behaving similarly to the randomeffects Q when 2 is significant (Hedges & Vevea, 1998). According to Hedges and Vevea (1998), the label of conditionallyrandom refers to the choice of randomeffects being predicated on the test that ( 2) is greater than zero. Because it is a procedure conditioned on the outcome of the null hypothesis of homogeneity of effects, it mediates the fixedand randomeffects approaches. According to Hedges and Vevea (1998), if one is not certain as to the homogeneity of the population, one applies the cond itionallyrandom procedure. But optimall y, according to Hedges and Vevea, the researcher will determin e the model first and then select the appropriate PAGE 39 29 procedure based on this a priori decision. There is the equivalent practice using the test for determining the homogeneity of effects across studies. Hedges and Vev ea credit Chang (1993) with the identification of the twostage process applied by a number of metaanalysts for the purpose of determining the model in use based on the homogeneity/heterogene ity of effects at the first stage. If an unconditional inference is desired, it means the effect parameters are being treated as a sample from a population and estimates the mean and variance of that population. The population of effect parameters from which the observed effects are collect ed is a random sample. The test being conducted concerns the mean effect size Ho: (usually equal to 0) In order to explain the specific use of the conditiona llyrandom procedure, Hedges and Vevea (1998, p. 503) describe the decision point in the following manner: In the conditionally random effects procedures, the randomeffects procedures are used if Q is statisti cally significant, and the fixedeffects procedures, are used otherwise. Thus, the expression for v.C conditional on Q (p. 503). For use in making either conditional or unconditional inferences apply: v.C = {vQ/k(k1) if Q>/Q. 95 v/k if Q< Q. 95 This procedure assumes Q to have a chisquare distribution with k1 degrees of freedom and independent of the estimate of the population effect, d. The sampling distribution of Q is a noncentral chisquare variate with a noncentrality parameter. A confidence interval is built around the upper and lower limits of either the mean of the k effectsize parameters, or the mean of the population from which the k effectsize parameters 1 ,, k in the k studies being analyzed were sampled, The selection of the mean depends upon whether the test pertains to conditional or unconditional inferences. The confidence intervals vary for conditional and unconditional inferences. LC = T z /2 square root v.C < < d + z /2 square root v.C = UC LC = T z /2 square root v.C < < d + z /2 square root v.C = UC Notice that the confidence intervals are built the same for both conditional and unconditional inferences, respectively, with the exceptions of the parameter esti mate about which the confidence interval is built and PAGE 40 30 the formulas used to compute standard error. The represents the (mean of the k effectsize parameters, p. 495) one true estimate of the treatment effect uniform across all st udies or the difference between the population of all studies T and C whereas the refers to the mean of the population from which the k effectsize parameters 1 ,, k in the k studies being analyzed were sampled (Hedges & Vevea, 1998, p. 498). Hedges & Olkin (1985) refer to Qbetween test as the between class goodnessoffit statistic QB. The QBetween statistic is a chisquare distributed test. The pe rmuted version will be investigated here. Permuted QBetween is a randomization test designed to generate a larger, empirical sample. It tests the null hypothesis that the average effect size is the same across classes of studies. It remains consistent with the fixedeffects model because the only classes that are included in the test are those in th e sample (it is a test of betweenclass homogeneity). It is not possible to permute studies that are not present. It should be noted that testing this null does not yield any direct variance estimates only a probability statement about the similarity/difference between average effect sizes, not the difference between two or more parameters. For example, if one conducts such a test, they would assume that two grades (3rd and 4th grades) of students had the same effect size. The question becomes what is the probability that we would see data like that which is observed, if grade level is not a moderator variable an d assuming the reality that population effect sizes are identical. So this probability statement is obtained without the need for any distribution assumptions. The null hypothesis is expressed as Ho: 1 = 2 The test statistic is based on the total weighted sum of squares. The denominator is the normalized weighted sum of squares of the effect size indices about the grand mean d++. Having a common effect size across studies indicates the Qbetween possesses an approximate chisquare distribution with k1 degrees of freedom. The permuted version applies a permutation strategy instead of a using a chisquare distribution. Kromrey & Hogarty (1998) find this procedure to be more robust to the dual violations of normality and homogeneity of variance. When the Qbetween Test is employed using a permutation strategy, instead of a chisquare distribution, the Type I erro r control is well maintained. The primary limitation PAGE 41 31 noted by their study is its inability to generate a sufficient data set with which to test at the .05 alpha level under conditions of small K (generally 5 or less). According to Noreen (1989), a randomization test is a procedure for assessing the significance of a test statistic [and] involves randomizing the ordering of one variable relative to another (p. 12). Orderings are permutations of the variables relative to each other. Randomization tests are nonparametric and based on an empirical, rather than theoretical di stribution. Nonparametric refers to the manner in which the nature of the population distribution is not specified explicitly. Almost all permutation tests are nonparametric and vice versa. They do not require random sampling, but are based on random assignment within the study. Distributionfree, as opposed to nonparametric, refers to a tests significance level and is not predicated on the form of the population from which the sample is selected. Randomization tests are distributionfree, while maintaining the datas scale values. Permutations are not quite distributionfree because they still require distribution symmetry. But all distributionfree tests are permutations. The permuted data are computed from other possible random assignments based on a randomization scheme, yielding a new sampling distribution. Test statistics derived from the simulated sampling distributions are compared against the observed test statistic. Based on the proportion of those statistics equal to or greater than the observed statistic, a significance level is computed. There are two classes of randomization tests: exact and approximate. An exact randomization is generated when all possible permutations are completed. An exact test refers to the probability of causing a Type I error that is exactly alpha (Good, 1994). In contrast, an approximate randomization is the random shuffling of the possible permutations. Not all permutations are conducted. Noreen describes the purpose of approximate tests being to increase the sample (number of orderings), thereby improving the precision of the approximation. Furthermore, approximate tests serve as a time saving measure in conducting permutations, as they do no require the generation of every possible order. Permutation tests are more complex to implement than conventional statistical tests in that they require the generation of a new sample. Good (1994) provides an outline for conducting a permutation test: 1. Analyze the problem What are th e null and alternative hypotheses? 2. Select a test statistic PAGE 42 32 3. Calculate the statistic based on the original observation labels. 4. Rearrange the labels and recalculate the statistic for th is set. Repeat until the entire set of permutations has a test statistic generating a new sampling distribution. When one compares the test statistic of the shuffled data (the theoretical) against the estimated test statistic of the original data, and the former is greater than the latter, one is added to the nge c ounter. If shuffled data ends up being greater than the number of shuffles NS, one computes the si gnificance level. Youre counting how many randomizations to give a test statistic larger than in the original data to get a combined significance level at the end. 5. Decide on the validity of the null hypothesis based on the permutation distribution. Looking to see if the original statistic is an extreme value within th e permutation distribution. If so you reject the null hypothesis. As an example, there may be 4 possible se lections, but 24 permutations of these... 4! = 4 x 3 x 2 x 1 = 24 Permuting this test involves the reordering of study effects by dividing the factorial of K, the number of study effects, by the factorial of R, the number of studies in the smaller group (either treatment or control), which is multiplied by K minus R. For example, if the number of study effects equals 10 and the number of studies in the smaller group equals 4, the number of combinations of K taken R at a time will result in the number of permutations to be conducted. In this case, 210 permutations would be simulated, by computing, K!/ (R!(KR)!) or specifically 10!/ (4!(104)!) = 3,628,800/24(720)=210 Once the number of permutations is computed, the permutations are run using the Qbetween test as follows: Qbet= i (di + d++)2/ 2(di +)= ij (di +d++)2/ dij) Qbetween test With the summation over I classes, and j studies in each class. Where di+ refers to average weight ed effect size for class I d++ represents the grand mean effect size, and dij represents the effect size for the j th study in the i th class (Hedges & Olkin, 1985). Permutations may be more powerful than parametr ic statistics, as they operate using symmetric distributions and/or distributions with small shifts in value. Good (1994) explains that A most powerful unbiased permutation test often works in cases where a most powerful parametric test fails for lack of knowledge of some yet unknown nuisance parameter (p. 2). To insure the validity of a permutation comparing samples of two populations requires that both the treatment and the control samples be drawn from the same distribution. According to Good, permutation tests can be employed with mixed subpopulations of heterogeneous data. This was how Kromrey and Hogarty applied these tests. PAGE 43 33 Further study of the fixedeffects Q test rela tive to randomeffects Q, the conditionallyrandom procedure and the permuted Qbetween test is warranted on the basis of several studies and reviews. As mentioned, Changs (1993) study illustrates the influence of faulty decisions about homogeneity of effects at the first stage on the computation of magn itude of effect, as power is affected. Subsequently, Harwells (1997) study of the traditional homogeneity of effects Q test suggests it has unstable Type I and Type II error control under violations of normality, unequal variance and unequal sample size in primary studies. Finding significant influences on Type I error and power, Harwell recommends metaanalysts consider using the randomeffects procedure as an alternative to the traditional fixedeffects Q test, particularly under conditions of nonnormality and heterogeneous effects. Based on the performance of the traditional fixedeffects Q under conditions of heterogeneity of effects, unequal primary studies variance and nonnormality, both Harwell (1997) and Kromrey & Hogarty (1998) advise against its exclusive use. These studies provide the contex t and structure for the present study design. The research questions directing the present study draw their value from these prior conclusions, designs and methodologies. A brief overview of the variables, inferences and methodology incorporated by these studies follows. Chang (1993) conducts one of the most extens ive power analyses comparing fixedeffects Q and randomeffects Q using the following variables: normal or noncentral chisquare distributions, variance of parameter effects, the number of studies included in the metaanalysis, the total sample size of a single study and study effect sizes. She also completes a regression to determine the most influential factors contributing to the power of both tests. Lastly, Chang conducts an analysis to determine the extent to which a faulty decision about the homogeneity/heterogene ity at the first stage of metaanalysis results in making a faulty decision to reject or accept th e obtained z test value at the second stage. Harwell (1997) investigates the Type I and Type II error rates of the Q test, by controlling the following variables: skewness and kurtosis, variance ra tios within primary studies, the number of studies in the metaanalysis, the N of a single study and the st udy effect sizes (For th e specific values for each variable, refer to the table at the end of this chapte r). He conducts this test under inferences commonly associated with the fixedeffects model. Harwell further manipulates the positive and negative pairing of PAGE 44 34 the sample sizes and variances. Harwell employs a nom inal alpha of .05, a standard power of .8 and Bradleys (1978) criterion for intermediate stringency of +/, determining the number of recommended simulated metaanalyses to be 5,000 (see Robey & Barcikowski, 1992, p. 286). Harwell uses an unspecified randomnumber generator from the 1986 Numerical Recipes to simulate standardnormal deviates. He applies Fleishmans method for transformation of the same. Harwell replicates the nonnormal distributions by transforming normal random variates based on the Fleishman (1978) technique, referred to as the Power Method. Kromrey and Hogartys (1998) study extends the investigation first conducted by Harwell, by controlling the same variables, but furthering the inquiry by comparing Q to two more chisquare distributed tests and four permuted tests. The permuted tests include the permuted version of Qbetween, as well as gamma, trimmed d and Cohens d. In addition to unequal variances, Kromrey and Hogarty (1998) simulate all of the studies using the heterogeneous variance conditions. Like Harwell, these researchers focus on applying tests under inferences consistent with the fixedeffects model. The other major research question this study addresses pertains to the robustness of some common effect size indices: Hedges g, Cohens d, Trimmedd and 1 Keeping the conditions consistent with Harwell s (1997) study, Kromrey and Hogarty vary the same variables and maintain the same values for eac h level of each va riable. Kromrey and Hogarty (1998) use the RANNOR random number generator in SAS to generate normally distributed random variables. Different seed values are used in each execution of the program to yield the random numbers. Nonnormal distributions are replicated by transforming normal random variates derived from RANNOR based on the Fleishman (1978) technique, referred to as the Power Method. Using SAS/IML version 6.12, Kromrey and Hogarty are able to verify the accuracy of the data an alysis by comparing the results to the GLM procedure. Type I error rates are computed for all seven procedures examined by Kromrey and Hogarty, drawn from either 1,000 or 5,000 randomly generated samples for each condition of the study. They apply Bradleys (1978) liberal criterion of robustness to assess the Type I error control for each test under the given conditions and determine the proportion of conditions with adequate Type I error control for each test. PAGE 45 35 Kromrey and Hogarty (1999) also investigate the power, as well as the Type I error control of both the effect size indices commonly applied in metaa nalysis and the tests of homogeneity previously examined in their 1998 study. They employ the same five variables and methodology described in their earlier study. Again, 5000 simulations are conducted. The Hedges and Vevea (1998) study is unique in that Hedges and Vevea deliberately and methodically test the robustness of the fixedeffects Q, randomeff ects Q and conditionallyrandom test under both the inferences associated with fixedand randomeffects. They control two variables in the form of the magnitude of betweenstudies variance an d K, the number of studies included in the metaanalysis. Given that factors such as nonnormality, withinstudy variance, or withinstudy sample sizes are not varied, this study provides limited information abou t robustness issues typical to educational studies. In addition to concerns raised by the limited robustness of the Q test and other homogeneity tests, concern about the exclusive and indiscriminate use of a single model has been expressed. Initially, several had urged metaanalysts to consider applying the randomeffects model (National Research Council, 1992; Erez, Bloom & Wells, 1996; Abelson, 1997). By doing so, researchers avoid the use of the fixedeffects model as the primary defa ult. However, this initiative was more of a general call for applying models with greater discrimination nothing more specific in terms of conditions best suited for the application of one model or the other. Despite the preference for its use, the sensitivity of the fixedeffects Q test to heterogeneity of variance and primary study nonnormality is suspect (see Harwell, 1997 and Kromrey & Hogarty, 1998). As the traditional Q test is not always appropriate, it is important to investigate alternatives. Specifically, it is important to know how these op erate under dual conditions of heterogeneity of variances and nonnormality, using random sample sizes within and across studies. Before addressing conditions appropriate for use with specific tests, th e metaanalytic models are presented in detail. Distinguishing randomand fixedeffects model. As mentioned previously, the primary features distinguishing the two models are the breadth of the in ferences about the sample of collected studies, the degree of variability of the study effects, the number of studies included in the metaanalysis and the treatment of uncertainty. Each of these elements has im plications for the other aspects of the model. For PAGE 46 36 instance, as the number of studies increase so does the potential for expanding the generalizability, as well as increasing the variability of the same. In turn, the treatment of the uncertainty about the error is affected. The breadth of the inferences tran slates into how variance is char acterized. The sort of inferences made refer to whether the sample is considered to be representative of one single, welldefined population or a wider, less clearly delineated universe of several populations. Invariant or predictably varied study characteristics reflect the need for fixedeffects approach es. In contrast, multiple, unidentifiable sources of variance may be best treated using randomeffect s approaches. Randomeffects tests possess a mathematical mechanism for enhanc ing the sensitivity necessary to acco mmodate the increased ambiguity inherent in the model 2, or betweenstudies variance component. In the randomeffects model, true effects originate from a distribution of effects with some variance. The study effects, i, vary randomly around one grand mean, There are two sources of variation in the population effect sizes: 1) the variance in population effects parameters in the popu lation distribution of the e ffect sizes; and 2) the variance in the estimator about the true parameter value for a study (Chang, 1993, p.26). As a result of the differences in the treatment of variance, the tests corresponding to these two models tend to yield noticeable differences in power. Tests associated with fixedeffects models can produce narrow confidence intervals, as they do not incorporate this betweenst udies variance (Erez, Bloom & Wells, 1996). In contrast, confidence intervals generated from randomeffects widen as 2 increases. Many researchers maintain that assuming a fixedef fects model is justifiable only if the null test for the homogeneity of effects is maintained and no moderating variables are suspected (Chang, 1993; Matt & Cook, 1994; Erez et al., 1996). Some would argue ther e is no purpose in combining studies with little or no variability in treatment administration. Either no additional information is contributed (Erez et al.) or the betweenstudies variance [( 2) or ( 2 )] is considered to be trivial or nonexistent. Another consideration related to variability across study effects pertains to the number of available studies. Raudenbush (1994) suggests the decision to employ fixedor randomeffects be predicated, in part, on the number of available studies. His reas oning stems from the concern that randomeffects is not as precise with small numbers of studies. In fact, others have produced evidence suggesting the sensitivity of certain homogeneity tests depends upon the relationship between large k and small N (Chang, 1993; Harwell, 1997; Hedges & Vevea, 1998; and Kromrey & Hogarty, 1998). Raudenbush further recommends PAGE 47 37 metaanalysts account for unidentifiable numbers of poten tial moderators of a true effect(s), by treating the true effect(s) of a series of studies as random. In general, the potential for a greater number of moderators increases as a metaanalysis incorporates more studies. For the RE summary to be valid, it relies on both accurate estimation of 2 (otherwise expressed as or 2 ) and an adequate number of studies. Moreover, valid generalization is predicated on a clearlydefined population (Raudenbush, 1994). If 2 = 0, there is a common effect or the conditional variance of d = the unconditional variance of d. It is the variance of the population distribution of effects. If homogeneity between effects is not present, then a reviewer can categorize the effects by group, testing each for homogeneity of effects. Essentially, the fixe deffects model assumes the presence of one universal effect. In contrast, the randomeffects model assumes each treatment produces its own effect and is derived from a universe of similar, but distin ct, treatments. For this reason, th e effects are most accurately modeled as a distribution of true effects. A final consideration involves the modeling of uncertainty. In the fixedeffects model, there is one source of uncertainty concerning participant sampling. It relates to withingroup sampling error, corresponding variance is 2. Conversely, the randomeffects mode l includes 3 sources of uncertainty: withingroup sampling error, 2, the random effect of the study, and the interaction between the treatment and the study or Winer (1971) contends it is imperative that experimental procedures closely reflect mathematical models to ensure valid prediction of experimental results. In other words, sampling methods must be accurately expressed in the algorithms used to model the variables being investigated. The metaanalyst must have a clear conception of the degree of representation emul ated by the collection of studies to a specified population. If significant heteroge neity is present, it is a fairly clear indication that a single treatment is not responsible for the resulting effe ct. Because errors influence is accounted for in the expression and analysis of the randomeffects model, it ha s been argued that it is more consistent with most other statistical methodologies than the fixedeffects approach (Erez et. al, 1996). As mentioned earlier, the fixedeffects approach assumes a uniform model for any given set of studies. Both Erez et al. (1998) and Harwell (1997) encourage metaanalysts to utilize randomeffects models in an attempt to discourage the inappropriate use of fixedeffects statistics. Th e National Research Council (1992) also recommends PAGE 48 38 reviewers apply the randomeffects model with greater frequency to avoid the more restrictive assumptions underlying the fixedeffects model. Model selection. Model selection is usually determined by either the tenability of the assumption of homogeneity of study effects (Chang, 1993; Erez, Bloom & Wells, 1996) or the researchers theoretical inferences about the relationship of the sample to a population (Hedges & Ve vea, 1998). As will be discussed, model selection bears important implications for the power of the test and its sensitivity. For this reason and the fact that inferences are based so lely on hypothetical judgments, not certain truths, a homogeneity test is probabilistic and no t absolute, equal consideration of both is necessary. Such a decision does not nullify the influence of theoretical judgment s when variables are interpreted as being sampled from either a larger population of studies or a fixed and clearly defined population (Shadish & Haddock, 1994). As specified in chapter 1, metaanalysis involves a twostage process. First, the reviewer attempts to determine whether effects are equa l (or whether there is variance betwee n study effects). In other words, is there a common population effect size? The second hypothesis refers to the question of whether the true effect is greater than zero. If the homogeneity assumption is maintained, is the common effect size equal to zero? Or to put it another way, does the treatment ha ve a significant, nonrandom, effect on the population to which it was administered? These hypot heses are expressed as the following: Ho1: 2k (There is no difference between study effects.) Ho2: = 0 (The common effect size is equal to zero.) The models being discussed differ primarily in how they characterize the uncer tainty of the variance across study effects. With respect to the initial selection of either fixedor randomeffects models, inappropriate model use at the first stage bears consequences for th e decisions made at the second stage of the process (Chang, 1993; Hedges & Vevea, 1998). Specifically, Chang finds when homogeneity of effects is falsely rejected, the application of a z test at the se cond stage correlates with more inflated Type I and Type II error than if the same assumption is falsely maintained. This result is especially affected by a large number of studies each with small sample sizes. Hedges & Vevea note similar outcomes using unconditional inferences (typically associated with the randomeffects model). When a fixedeffects test PAGE 49 39 is applied, inflated Type I error will ensue, unless th ere is perfect homogeneity of effects. Under such conditions, the randomand conditionallyrandom pro cedures provide results closer to the nominal than fixedeffects tests. Given the importance of th e decision and the differen ces inherent in each model (both in terms of the inferences made and the partitioni ng of variance), we turn our attention to the issue of how model selection is presently being conducted in the field of metaanalysis. Two schools of thought present distinctly different rationales for addressing the decision of model selection. Conventional statistical wisdom suggests th at tools are selected base d first on the theoretical purpose and later modified by the datas distribution characteristics (Tukey, 1969; and Cohen, 1990). In metaanalysis, there appears to be a more distinct and less integrat ed process adopted by each of two camps, focusing primarily on either theory or dataanalytic concerns. One group of synthesists demands a priori selection of a model based on inferences formulat ed from theoretical know ledge of the sample of studies and the degree to which it represents the population. The other group employs the test of homogeneity of effects to determine the appropriateness of the model based on the likelihood of its true description of the variability across sample effects an indicator of the presence of one or several population effects. It is noteworthy that Kromrey and Hogarty (1999) conclude that the traditional Q test of homogeneity of effects has limited, if any, statistical power rendering it a poor instrument for this purpose. The opposing viewpoints are captured in the following: Abelson (1997) posits: Empirically, c onsiderable heterogeneity of effect sizes is quite often found in metaanaly ses. We can argue abstractly all we want, but in the end, we must attend to behavior of our methods when confronted with real data The assumption of constant true effect sizes is rarely sustainable (p. 1234). In contrast, Cooper & Hedges (1994) state: Conceptual criteria would be applied first, with the model chosen acco rding to the goals of the inference. Only after considering the object of the inference would empirical criteria influence modeling strategy (p. 526). Lix & Keselman (1998) state: the re searcher needs to be clear on the goals of data analysis prior to choosing a particular method of statistical inference (p. 411). In the first camp, inferences about the population are rest ricted to the group of values of predictor variables represented in the sample. Generalizations about treatments apply to similar treatments, even if not controlled in the study. Inferences are based only on studies collected in the sample. Hypothesis testing relates only to the present collection. Generalizing beyond the collection is possible only subjectively. PAGE 50 40 Alternatively, the collection of studies may be viewed as a result of chance. Generalizing beyond the immediate set of studies is necessary in setting up inferences about the sample and is done statistically. Population values of effect are random samples from a distribution of effects. Other factors used to distinguish metaanalytic models are the data conditions under which the statistics operate most effectively (Kesselman, Huberty, Lix, Olejnik, Cribbie, Donahue, Kowalchuk, Lowman, Petoskey, Keselman, & Levin, 1998), the im plemented sampling proced ure and the number of studies incorporated in the metaanalysis. In these cas es, data analytic goals are clarified, followed by the selection of a statistical method (Lix & Keselman, 1998) In contrast, another perspective asserts that the theory directs the use of a particular sampling procedure and the parameters used to define the extent of a samples representation of the population (Serlin, 1987). Rasmussen and Dunlap (1991), Raudenbush (1994) and Hedges and Vevea (1998) agree that the decision to apply one model over another hinges, in part, on the number of studies to be included in the MA. Smaller numbers of studies would be more validly analyzed, using fixedeffects. In contrast, larger numbers of studies would permit valid generalization to a large possible universe or population. Both the theoretical and the dataanalytic perspectives overlook the tentativeness and reiterative nature of the scientific process. As McCullagh an d Nelder (1983) assert, one never knows with certainty whether a model is accurate. In the case of the theore tical camp, one cannot be certain whether the selected studies are inclusive of a single population or multip le populations. Those in the dataanalytic camp are initially applying a test based on a fixedeffects models partitioning of variance. Further, this test has been shown to have little, if any, power to detect true di fferences in study effects (K romrey & Hogarty, 1999). As the departure of these metaanalysts approaches suggest, the choice between fixed and random model use is debatable because it pertains both to the nature of scientific inquiry, as well as the type of data deemed applicable for answering res earch questions (Hedges, 1994). A similar dichotomy exists in the field of Measurement pertaining to generalizability theory. When conducting a Dstudy for purposes of drawing data for decisionmaking, the set of measurement conditions are treated as either fixed or random. Sp ecifically, when these conditions are viewed as fixed, the intent is to restrict generalizations to those conditions appearing in the study. However, treating the set PAGE 51 41 of conditions as random involves treating these as if they were a sample from a broader universe of conditions to which inferences will be made to the larger universe of conditions. Though it is a process of quantifying observed objects, measurement, and ultimately model specification, is a system of human abstraction. Nu nnally (1967) emphasizes that measurement (the rules for assigning numbers to objects to represent quantities of attributes) relies on a system of abstraction to measure attributes of objects or people (p. 2). If a measure is intended to fit a set of axioms for measurement (a model), the closeness of the fit can be determined only by the extent to which relations in empirical data meet the requirements of the model (p 8). The underlying impli cation is that measurement and model development is a subjective and imperfect endeavor. The extent to which a sample represents a popula tion is predicated on the sampling procedures, the degree of disparity in primary studies administration of the treatment, and the modeling of the uncertainty. Metaanalysts first establish their th eoretical inferences by determining the criteria used for primary study selection, as random sampling is not typically possible. Serlin (1987) suggests, Theory must guide the selection of a sampling procedure and theory must determine in what ways the sample should be representative (p. 366). Initially, the reviewer must establish a criterion for the selection of primary studies. This criterion dictates the characteristics of the sample of accumulated studies. In turn, sampling procedures influence the characteristics of the sampling distribution of the statistic. When one attempts to analyze a set of data with tools selected solely on the basis of a desired set of inferences, tools and data are not necessarily compatible. Therefore, the tools may not generate a valid analysis. Any random sample bears a unique set of distribution characteristics (Keselman et al., 1998). Independent of sampling, Keselman et al. explain that Every inferential statistical tool is founded on a set of core assumptions. As long as the assumptions are satisfied, the tool will function as intended. When the assumptions are violated, however, the tool may misl ead (p. 351). As a result of any mismatch between the sampling procedures, in particular, and the population of interest, the sample data can possess unexpected distribution characteristics. To promote optimal model selection, it is necessary to understand the principles defining a wellspecified linear model. The value of a model lies in both its ability to promote the summary of data based on its differentiated presentation of systematic e ffects as well as the nature and magnitude of random PAGE 52 42 variation (McCullagh & Nelder, 1983). Furthermore, a good model is characterized by fitted values which minimize some criterion (either a discrepancy or closeness measure); parsimony of parameters included in the model; and scope. Finally, they explain that value lies in describing not just the systematic variation of the data under immediate investigation but patterns potentially occurring in future data of a similar nature. Based on these objectives, a presentation of the process initiated with model selection and culminating in sensitivity analysis follows. First, a model can be defined as a symbolic representation of the elements specifying the relationship between th e xs and ys, including the characterization and partitioning of systematic and nonsystematic variance. The statistics are employed as the vehicle through which one makes probability statements about the viability of each of those elements in relationship to each other as expressed by the model In other words, the statistic is a mathematical abstraction of the empiricallybased model (Stevens, 1968). The model is based on the researchers theorybased perceptions of how the variables relate to each other. Theories are hunches about some phenome non based on observations (data). Furthermore, the scaffolding for these definitions is based on former re searchers hunches. The data is collected based on operational definitions and measures of some behavior (these too are developed based on the perception of the researcher). And the statistics used to analyze the data are constructed on the basis of a set of assumptions about the data. Keeping the connection of the nature of model development and the origin of the measures and statistics in mind while reviewing the process of their development and utilization may illuminate the influence each contributes in conductin g tests of parameter estimation, as well as the subsequent interpretation of data analyses. Finally, the influence of previously unspecified moderating variables changes the dynamics of the relationship captured by the model. Model selection is the first step in a process of establishing a model for use in identifying statistics appropriate for testing the relationship characterized by the model. It is the determination of a general class of models. Secondly, model specification is conducted for explanatory metaanalysis. Most metaanalysts treat the moderator variables as if they were causes in different effect sizes. For this purpose, model specification is a strong assumption. Having the right regressor or moderating variables and correct functional form linear or nonlinear and/or nonadd ititive follows model selection. Model specification, as defined by Hedges and Olkin (1985), is an expression of the way in which estimates of regression PAGE 53 43 coefficients approach their respectiv e population values (p. 172). They further explain that the estimates of coefficients in linear models are consistent when the variables that actually determine the dependent variable are included in the model (p. 172). Model fitting is the third step of estimating the regression coefficients. Model fitting involves generating estimates of the parameters and checking the residuals by way of sensitivity analysis. Model fitting subsumes model checking. Model fitting refers to the determination of the distance or closeness of the theoretical values (derived through the use of the model and observed outcomes) from the representation of the linear relationship of the observed ys and the selected covariate xs. According to Hedges (1 994), specification of the population directs how the synthesis results will be interpreted. The relationship between theory and sampling makes model selection inextricably related to the sensitivity of the statistical test, as sampling produces the distribution characteristics. To properly interpret the goodness of a model, one needs a fair and representative sample of the population, as well as a statistic that effectively filters the noise while remaining sensitive to the detection of the signal. Theoretical assumptions about the population guide the sampling procedure. Determining from where the sample is to be drawn requires the a priori demarcation of the parameters of the population. The sampling distribution of the statistic is based, in large part, on the distributio n of the sample. In turn, the statistic is used to make a decision about the likelihood of the relationship betw een sample and population. Presumably, the statistic reflects the principle theoretical underpinnings of a given model. The assumptions are essentially the rules for use, employed because they are the conditions under which the statistic is best suited to detect true variance. Ideally, a statistical test would po ssess two primary features. The or significance level of this statistic would equal zero and the power, 1would equal 1 or 100%. However, without perfect knowledge about the true state of the conditions being investigated, confirming these parameters is impossible (Good, 1994). A statistic is unbiased when its average (derived from multiple samples) equals the parameter it estimates. As Winer (1971) explains, One criterion for the goodn ess of a statistic as an estimate of a parameter is lack of bias. A statistic is an unbiased estimate of a parameter if the expected value of the sampling distribution of the statistic is equal to the parameter of which it is an estimate. (p. 7). Unbiasedness refers to a feature of both the sampling di stribution as well as the statistic. Efficiency refers PAGE 54 44 to how precise the statistic estimates a parameter. That is the more precise an estimate, the more restricted the confidence interval or smaller the standard error. The third property to be considered when evaluating estimators concerns the consistency of each estimato rs successive approximation to the parameter as sample size increases. As the sample size increases, re gardless of the degree of bias, an estimator will more closely approximate the parameters value. As variance across study effects diminishes, the precision of the population effect estimate increases. As violated assumptions can result in misinterpr etation of statistical results, application of a corresponding model statistic that fails to paralle l the parametric conditions will lead to faulty interpretations of the results. Assumptions are the conditions for appropriate use of inferential statistics (Keselman et. al., 1998). Similarly, when a statistic reflecting one models treatment of the variance of effects does not parallel the reality of the datas population derivation (either from a single population or distribution of population effects), a faulty interpreta tion results (Becker, 1994). With respect to metaanalytic model selection, Erez et. al (1996) points out that homogeneity of effects rarely occurs due to the variety of constructs studied in psychology and other social sciences. Moreover, certainty about the inclusion of all relevant studies is not possible (Glass, McGaw & Smith, 1981; Abelson, 1997). In the presence of uncertainty wherein one pools effects and derives estimates of these, the possibility remains that a variety of effects impacts th ese estimates via an unreported moderating variable (Mulaik, Raju & Harshman, 1997). For this reason, some argue against the assumption of homogeneous effects (Abelson, 1997). Others (Hedges & Vevea, 1998) state heterogeneity alone is not a reason to avoid using fixedeffects. Rather, if a limited set of studies is drawn and inferences are restricted to that set, selecting a fixedeffects model is the only reasonable option. Ultimately, it is important to understand how the model contributes to test sensitivity. According to Good (1994), sensitivity analysis is the process of developing me thods which minimize the discrepancy between p values and the nominal p values under a variety of distributions and maintaining high efficiency and stringency over an array of circumstances. A tests power is determined by how likely it is to detect true differences between populations. The sensitivity of one test is compared to that of the others (Good, 1994). Hedges and Olkin (1985) have described that failure to differentiate systematic variation from estimates of error undermines the sensitivity of the statisti cal test for systematic variation. In fact, the extent PAGE 55 45 to which the systematic variance is partitioned from the nonsystematic variance, the random error, determines the sensitivity of the test and the extent of bias. Because the test is devised to correspond to and test the underlying assumptions of the model, higher Type II error rates result from misspecified variance in the model. As a result of the fixedeffects models failure to differentiate betweenstudies variance from sampling error, the fixedeffects Q possesses little, if any, power to detect heterogeneous effects. An example of the effect of poor model specification on test sensitivity lies in the fixedeffects model and the Q test. As will be described, Kromrey and Hogarty (1999) find the Q test has no practical utility for detecting true differences in study effects, as its power is low when applied to data with heterogeneous variances. It also evidences unstable Type I error control. For a test to be useful, it needs to control both Type I and Type II errors. For these errors to be well controlled by the statistic, it requires proper partitioning of the systematic variance from the nonsystematic variance. Kromrey and Hogarty (1998) also find the iterated Q (Hedges & Olkin, 1985) to have no appreciable improvement in Type I error control under similar conditions. The literature supplies little explicit attention to de termining the necessity for or superfluity of direct model to statistic correspondence. This consid eration is important and bears directly on the model selection process. More discussion is warranted, but it is beyond the purview of the present study. The understanding permitted by simulating and reporting th e effect of common data conditions on the four tests sensitivity will provide an applestooranges know ledge of the relative sensitivity of each. Realizing the elusive nature of Truth, it is valuable to identif y actions clarifying the rela tionship, if any, between applied models and statistics. Based on the numerous points for misinterpretation due to the extent of subjectivity involved in each process from model development and model specifica tion to test selection, maintaining a flexible, openended approach to model selection and test use is crucial to the integrity of the overall scientific endeavor. Similarly, McCullagh and Nelder (1983) recommend an iterative process whereby model selection is followed by model checking to be reiterated as long as necessary until the best plausible model is identified and verified (model fitting), before summar izing the results and drawing conclusions. There is a need for ongoing model checking, while maintaining flexibility in model selection. The selection process PAGE 56 46 would be best guided by treating both inference goals and data conditions with equal consideration. Returning to the objective of Science referred to at the beginning of this chapter: it is to develop more heuristic problems, not to suggest certainty of a single theory (Popper, 1968). That is the problems or hypotheses incorporate more overriding factors, not merely the particulars of a single context. As Kuhn (1962) suggests, the job of scientists is to magnify the scope and precision of ordering a system of understanding. Due to the need for flexibility in model development and implementation, as well as the many factors contributing to test sensitivity, it is most important to maintain an approach of informed vigilance, as opposed to adherence to a single model and test. But one cannot place too much emphasis on inference, because rigid use of a statistic results in a loss of flexibility in data analysis (Cohen, 1990, p. 1310). In an effort to support well informed, contextbased decisions of metaanalytic test use, some general concerns related to the minimization of bias in homogeneity tests will be presented in the following section. The process of refining model to sta tistic correspondence will gain greater clarity by understanding conditions where Type I error is mini mized and power is increased, as well as becoming knowledgeable about the conditions that increase the pr ecision of tests while improving model fit. Such knowledge will facilitate future decisionmaking about th e tenability of models. To this end, this study attempts to address issues for better refining our know ledge of the application of four tests of homogeneity to specific data conditions. By now, it should be clear that any judgments ma de about the adequacy of one statistic relative to another are predicated on the specific model and data conditions under consideration. Additionally, once distribution assumptions of a statistical procedure are violated, as is often the case, it is useful to know the impact on subsequent performance of these statistics. In this way, applied researchers can better assess the extent to which such analyses generate valid results (Keselman, Huberty et al., 1998). Implications for Test Selection Because the problem of interest, in metaanalysis as in all dataanalysis, is one of isolating and controlling sources of artificial variation across st udies (Cooper & Hedges, 1994), it is important to recognize all possible sources of variation, including sampling error, true variance due to treatment differences, variance due to differences in populations and bias introduced by the test. The question PAGE 57 47 becomes one of determining whether the variance across study effects is attributable to sampling error or something meaningful to the true variance (either due to differences in treatm ent or differences in the treatments effect on multiple populations). Depending upon this initial determination of variance, the researcher will either pool the effects or develop a re gression model to classify the remaining variance. It should be noted in the case of the permuted Qbetween one starts out by testing to see if the average effects across classes are equivalent. The moderator variables have been identified a priori Using tests that minimize bias is integral to the validity of the analysis. As described earlier, bias contributes to Type I and Type II error, wherein the investigator fails to accurately estimate the parameter(s) of interest, thereb y drawing faulty conclusion s about the accuracy of the null hypothesis. In general, larger sums of square s are associated with sma ller standard error of the regression coefficients. So extreme values for X en hance statistical significance tests (Pedhauzur, 1982). Furthermore, when unequal error variances go unchecked, parameter estimates have large standard errors, though appearing unbiased, resulting in diminished precision and tests with low sensitivity (Chatterjee & Price, 1977). The precision of an estimator is generally measured by the standard error of its sampling distribution (Winer, 1971). A smaller standard error enhances the precision of the estimate. For clarification, a description of the true states of reality assumed by each error type follows. The Type I error assumes the null hypothesis is false, ev en though it is true (there is no difference between effects and there is one true effect, but one rejects it in stead). Underlying a Type II error is the assumption that the null hypothesis is true, when it is, in fact, fals e (there is a true differen ce between effects, but one accepts the false null). In general, power bears a positive concomitant re lationship to sample size, effect size and As these three variables increase, power increases. When assumptions are violated or certain factors present, several researchers (Chang, 1993; Harwell, 1997; Hedges & Vevea, 1998; and Kromrey & Hogarty, 1998) find that both the traditional fixedeffects and randomeffects Q tests are su bject to inflated Type I error. In flated Type I error is especially problematic because it suggests good power, though power va lues are likely to be artificially enhanced (Chang, 1993). With respect to many tests of homogeneity, the null hypothesis states there is a uniform treatment effect or there is no significant betweens tudies variance. A differ ence suggests the possible presence of a true treatment effect. At the metaanalytic level, the researcher is interested in identifying PAGE 58 48 differences, if any, between groups from one study to the next. Significant variation across studies reflects multiple, not single, effects may have been manifested. Multiple effects do not render meaningful generalizations, unless moderator vari ables are isolated and analyzed. Now discussion is turned to the conclusions drawn in the literature about the specific conditions and their influence on the control of Type I and Type II errors for each test of homogeneity under present investigation. Additionally, those conditions and tests requiring further study will be presented as identified by the authors of the prev ious research or as indicated by the omission of variables in those studies. As revealed by the Review of Educational Research survey mentioned previously, Q is presently used as the common default. Therefore, factors influenc ing its use will be discussed first, so as to prepare the reader for the conclusions drawn about the comp arative performance of the three alternative tests examined in this study. Before proceeding to the specific conditions for test application, general information about some factors relevant to the use of all tests of homogeneity is presented. Typically, data collected from ed ucational and psychological settings are characterized by skewed distributions (Lix & Keselman, 1998). Distortions in dataanalysis most often arise when skewness or kurtosis is greater than 1 in absolute value (Wang, Fan & Willson, 1996). Most traditional homogeneity tests lack robustness to violations of normality to begin with, resulting from the relationship between the sampling variance of the sample variance and the kurtosis of the population (Raudenbush & Bryk, 1987). The presence of both skewness and kurt osis exerts greater influence on an analysis than either distribution trait alone (Wang, Fan & Willson, 1996). Normality is an important consideration due to the implications for the control of Type II errors (Kes elman et. al, 1998 and Wilcox, 1995). Kromrey & Hogarty (1998) find sampling distributions of most mean difference indices tested reflect increases in positive skewness when samples are generated from populations with heterogeneous variances. Though the standardized mean difference typi cally reveals the degree of overlap between distributions of experimental and control group scores, it cannot provide a valid summary if data are not normally distributed. As Winer (1971) describes, parameters provide a description of the population distribution. The frequency distribution determines the number of parameters needed to depict the population. A normal distribution requires two parameters, whereas skewed distributions require more PAGE 59 49 (Winer, 1971). Only a monotonic transformation of the data can permit an unbiased estimator to adequately estimate the eff ect (Hedges & Olkin, 1985). In general, increasing the N, using random sa mpling, mitigates the effects of a nonnormal distribution (Snedecor & Cochran, 1989). There is a corresponding increase in power, as N increases. (Note: there are circumstances under which increasin g the total N does not convey protection against the effects of skewed distributions.) Chan g (1993) elaborates this point, Using the asymptotic theory to obtain power for homogeneity test would give conservative power estimates for data with small samples or nonnormal population effects (p. 59). Moreover, in creased sample sizes contribute to improved accuracy, diminishing the effects of random error. In a related manner, Kromrey and Hogarty (1998) find: All of the indices, as expected, showed a decrease in sampling variation with increasing sample sizebec[oming] more symmetric and more mesokurtic with larger samples (p. 8). As the estimates of population e ffect incorporate effect size indi ces and these are the data upon which the tests of homogeneity are conducted, understanding the sensitivity of the effect size index is important, when considering the extent to which homogeneity tests remain robust under various conditions. Heterogeneous variance tended to result in increases to the mean and variance of Hedges g and Cohens d (Kromrey & Hogarty, 1998). The effectsize index, Hedges g, upon which the fixedeffects Q test is computed, operates on the assumption that experimental and control group data are normally distributed (Hedges & Olkin, 1985). Kromrey and Hogarty explain that standardized mean difference indices are especially susceptible to heterogeneous variances in that they exhibit increases in positive skewness. Moreover, there is a concomitant relationship whereby increases in population distribution skewness seems to be accompanied by increases in the mean effect size fo r g and d, as well as increases in variance of these indices. Kromrey and Hogarty (1998) conclude there is no major difference in Type I error control from one effect size estimate to the next. Similarly, Kromrey and Hogarty (1999) find few differences in the power of three commonly used effect size indices, g, d and trimmedd. Trimmedd provides the most power applied to data from nonnormal distributions. Hedgesg presents the most power with normal distributions. With respect to significance test s, few understand the role of sa mple size (Abelson, 1997). Sample size often determines the extent of variance. Typica lly, a smaller sample size is associated with greater PAGE 60 50 variance. Again, the larger the sample size, the great er the power of a test. As already mentioned, large sample theory suggests that large samples mitigate the influence of skewed distributions. For purposes of this study, four issues pertaining to sample size are of present interest: Total size of the experimental and control groups, differences in withinstudy groups of the sample, the order of the discrepancy, if any, in size between the first and second groups within a study, the size of the first and second groups relative to the size of the variance between these two groups and the ratio of sample size to k. Large differences in sample size across studies contribute to heterogeneous error variances (Hedges & Olkin, 1985). Hedges & Olkin (1985) describe: In fact, the nonsystematic variance of estimates of effect is inversely proportional to the sample size of the study on which the estimate is based. Therefore, if studies have different sample sizes, as is usually the case, effect es timates will have different error variances. If the sample sizes of the studies va ry over a wide range, so will the error variances (p. 11). If considerable heterogeneity is present, procedures permitting explanatory analyses should be selected. Those studying the behavior of the homogeneity tests tend to use equal sample sizes, though this condition rarely occurs in the actual literature. A ccording to Harwell (1992, 1997), equal sample sizes minimize the possibility for inflated Type I error for t tests with unequal variances provided a normal score distribution. In general, increasing the sample size minimizes the influence of random error. However, as mentioned above, skewed distributions can exacerbate unequal variances even under conditions of equal group sample sizes. Harwell (1997) found equal sample sizes between experimental and control groups act as a partial safeguard against Type I error inflation when distributions are normal and variances are unequal. But equal sample sizes fail to provide any safegu ard against inflated error rates in the presence of combined unequal variances and skewed distributions. It would seem that larger variance, regardless of th e larger sample size, incr eases Type I error, as it pertains to chisquare distributed statistics. This patte rn does not reflect the tendency of ttests. As Glass & Hopkins (1984) suggest: the true probability of a type I error is always less than the nominal probability when the larger n and larger variance are pa ired (p. 238). Therefore, the robustness of the ttest cannot be assumed for all of the tests of homogeneity. PAGE 61 51 How well the systematic variance is partitioned from the nonsystematic (random) variance the error determines the sensitivity of the test and the extent of bias. Betweenstudies variance is the systematic part of the variance. 2 (betweenstudies variance) is a more comprehensive estimate than betweenclass effect size differences, in that it represen ts the variance in the true effect sizes of different implementations of the treatment. Beyond any moderating variables, part of the systematic error is due to aspects of treatment logistics, as well as the time a nd location of the measurement of the effect of such treatment on a given performance. The k (betweenclass effect size differences) can be employed as an explicit test for a moderating variable. The systematic difference in average effect size is broken down to groups of studies, but grouped according to some variable. For exampl e, it is the difference between the average effect size when given to first graders and the average when given to third graders, otherwise referred to as Harwell (1997) suggests that a k of 0 produces results similar to those demonstrated by Hedges and Olkin (1985), using a k = .25. The k = 0 case permits the estimation of Type I error rates. And k = 1 permits the estimation of power. No other specific conclusions are presented with respect to betweenclass effect size differences for each of the chisquare and permuted tests. Generally, increasing population effect size results in the increase of sampling variability within the sampling distributions of the effect sizes (those ba sed on mean differences); thereby, increasing positive skewness and leptokurtosis (Kromrey & Hogarty, 1998). In fact, a single study effect size can significantly influence metaanalytic results (Fleiss & Gross, 1991). Within a given study, if the treatment and control groups have different population variance then the size of the population variance and the sample size of each group have implications for Type I error control. Groups with larger variance and smaller sample sizes represent a negative pairing. Positive pairings consist of smaller variance and smaller sample size. Positive pairings will yield a conservative test and negative pairings will yield a liberal test. In the latter case, Type I error becomes even more inflated than in the presence of equal sample sizes and unequ al variances (Harwell, 1997; and Kromrey & Hogarty, 1998). When negative pairings are combined with va riance ratios of 4:1 and 8:1, all of these estimated Type I error rates are inflated well beyond .05. Large sample sizes do not seem to neutralize this effect. PAGE 62 52 By contrast, positive pairings with the same variance ratios tend to yield conservative Type I error rates (Harwell, 1997; and Kromrey & Hogarty, 1998). Unequal sample sizes, where the group with the larger population variance is paired with smaller sample sizes than the second, help to maintain better Type I error control. With respect to the ttest, Glass and Hopkins (1984) explain that it is robust to heterogeneity, as long as the sample sizes from the two gr oups are equal. In the ttest case, pairing a larger set of ns with a smaller variance results in an underestimation of the true alpha at .05. So we have conservative Type I error, associated with Type II error. But when the samp le effects of the group with the smaller sample sizes also come from a population with smal ler variance, the Type I error rate quickly increases, as the n ratio gets smaller (the first n is smaller). When sample sizes are equal, there is homogeneity for all practical purposes. Now that factors relevant for the use of all homogeneity tests have been presented, results from studies examining the specific conditions governing the efficient use of each test are elaborated. It should be noted that not all factors considered for one test have been controlled in the study of every other statistic. Therefore, certain potentially influential factors are no t addressed in this section, but are proposed later. Conditions Affecting the Use of Q According to Harwell (1997), the performance of Q relies on large sample theory, wherein withinstudy sample sizes are large enough to support a noncentral t distribution. Based on Harwell's (1997) study, the Q test has conservative Type I error when sample si zes are less than 40, partic ularly when study sample sizes and K have a ratio less than 1. Sample sizes le ss than 20 contribute to low power for all but dramatic heterogeneity of effects. Though equal sample sizes between experimental and control groups can often diminish inflated Type I error when there are unequa l variances in the primary studies, it has relatively little influence when skewed distributions are involved. Kromrey and Hogarty (1998) find that increasing K, increasing variance heterogeneity a nd increasing nonnormality most noticeably affect the Q tests sensitivity. When the group with the larger population variance (whether it is the experimental or control group), considered the first group, has a smaller sample size than the second group, Type I error control is better maintained. As suggested in the discussion of normality, there are conditions under which larger PAGE 63 53 study samples do not afford any enhancement in robustness (Abelson, 1997; and Kromrey & Hogarty, 1998). When smaller study sample sizes are present, the Q test's sampling distribution no longer parallels the chisquare distribution. As a result of this discre pancy, the Type I error rates substantially depart from nominal Moreover, the greatest degree of inflation of average Type I error results from negative pairings of sample size and variance. Conversely, small unequal variance and small unequal sample sizes generate conservative Type I error rates and lower power for all variance ratios (Harwell, 1997; and Kromrey & Hogarty, 1998). The observed effect size has a larger actual magnitude when there is a smaller N, assuming the same p values as a study with a larger N. The Q test has conservative Type I error when sample sizes are less than 40, particularly when sample sizes and the number of studies in the metaanalysis (k) have a ratio less than 1. There is higher power wi th increasing K, once N is larger than 40 (Harwell, 1997). In general, as variance heterogeneity increases, Type I error control diminishes for Q (Hedges & Vevea, 1998; and Kromrey & Hogarty, 1998). Large variance ratios yield small power (Harwell, 1997). Harwell (1997) and Kromrey and Hogarty (1998) further conclude that within primary studies, small unequal variance and large unequal sample sizes yield minimal departures from nominal for smaller variance ratios, but inflated Type I error for variance ratios of 4:1 or 8:1. Note: When there are large variance ratios (e.g. 8:1), and effect sizes are incorr ectly pooled in d, estimated power shrinks. This happens because the denominator of d is overestimated, incorrectly reducing the value of d and diminishing power (Harwell, 1997). Hedges and Vevea (1998) find that Q is least robust to 2 > 0, in that Type I error rate exceeded nominal particularly when unconditiona l inferences are in place. Skewness, when matched with unequal primary study variances, contributed to inflated Type I error regardless of sample size and K (Harwell, 1 997). Even given equal sample sizes, skewness will continue to erode Type I error control, particularly when unequal variances are present. Furthermore, inflation increases as skewness and heterogeneity incr ease. Harwell finds the same outcome for N=200. But Q has conservative error rates for k=30 when there were skewed distributions. In fact, it is likely that any chisquare distributed test will have increased sensitivity under conditions of skewness and kurtosis (Wang, Fan & Willson, 1996). PAGE 64 54 Generally, when K is small, variance estimate trunca tion at zero can lead to bias. It is minimized as K increases (Hedges & Vevea, 1998). When N is held constant, increasing K tends to generate more conservative Type I error rates (Harwell, 1997). Sp ecifically, as K increases, even with conditions of homogeneous variance, Qs Type I error control diminishes (Kromrey & Hogarty, 1998). Kromrey and Hogarty notice that using the same conditions, but adding the influence of increasing variance heterogeneity, Type I error control erodes more drastically. Kromrey and Hogarty (1998) point out that for Q, control of Type I error worsened with nonnormal distributions and larger values of K. According to Harwell (1997), the primary determinant is larger K because of the increase in noncentral t variates. As and N increase, power improves (per Harwell, regarding Q). When study sample sizes and k have a ratio less than 1, Harwell says Q yields conservative Type I error. Large K with small N contributes to inflated Type I error (per Harwell, 1997). Chang (1993) also concludes that estimated power surpasses theoretical power. With a large number of studies where the sample sizes within each study is small (10 or less), large sample theory for effect sizes needs to be modified (Hedges & Olkin, 1985). When you increase k for a fixed N, the problem worsens for Q because there are more noncentral t variates (Harwell, 1997). That is the influence of large sample theory (wherein sample sizes for each group are at least 10 and the magnitu de of the effect sizes is no greater than 1.5) may no longer be possible, rendering the estimates of eff ect sizes less accurate. Once you increase the withinstudy sample sizes for a fixed k, the ds correspond more closely to a normal distribution with estimated Type I error rates closer to Using a weighted linear combination of estimators di will permit estimates to approximate the (the maximum likelihood estimator). Bias is not necessarily eliminated when a larg e number of estimators, K, are averaged, as each estimators bias can be in the same direction. To this purpose, Hedges and Olkin advise applying the unbiased estimator di when studies have small sample sizes. With respect to this concern, Chang (1993) concludes though the noncentral chi square distribution (based on asymptotic theory) enhances robustness for reviews with large samples and evenly distributed parameter effects, it results in conservative power for reviews with small samples or nonnormal population effects. For equal withinstudy sample sizes but unequal variances, Type I error becomes increasingly inflated with skewed distributions and increasing variance ratios for fixed N and K and more pronounced as PAGE 65 55 k increased with a fixed N and variance ratio. So the problem (resulting in inflated Type I error) seems to be the unequal variance that further compounds with skewed distributions and increasing K (Harwell, 1997). Estimated power is less than theoretical (nom inal) power when N is less than 40, particularly for small N/K ratios. Note: Changs results indicated that the estimated power was larger than theoretical for the same set of conditions ( per Harwell, related to Q) As K increased and N decreased, greater departures arose between the theoretical and simulated power (Chang, pertaining to Q). Both Harwell (1997) and Kromrey and Hogarty (1998 and 1999) find that Q has inflated Type I error and minimal power under conditions of heterogeneous variances and nonnormal distributions in the primary studies. In a study comparing fixedeffects Q, the randomeffects Q, and conditionallyrandom Q, Hedges & Vevea (1998) conclude that Q most closely approximates the nominal value of when homogeneity of effects is present. With conditions of heterogeneous effects, the conditionallyrandom and randomeffects Q more closely ap proximate the nominal value of Similarly, Chang (1993) finds that fixedeffects Q is not appropriate for use under conditions of heterogeneous effects. Although Hedges and Vevea (1998) suggest the a pplication of Q for situations of complete homogeneity of effects, Kromrey and Hogarty (1998) find Q to have diminished Type I error control even under this condition, as K increases from 3 to 30. In fact, only permuted tests maintain Type I error regardless of the presence of heterogeneity of variance and K (provided that K was equal to or greater than 10). Further, Kromrey and Hogarty (1998) find regular Qbetween tends to maintain better Type I error control over more conditions than the Q test. So for conditions of homogeneous effects and when K is less than 10, the regular Qbetween is preferable. Therefore, there seem s to be some disagreement between the suggestions being made by Hedges and Olkin (1985), Hedges and Vevea (1998) and Kromrey and Hogarty (1998). Chang (1993) explains the additional variance component included in the randomeffects Q test overestimates variance when true popu lation effects are equal. In such a case, the fixedeffects Q should be applied (p. 86). Hedges and Vevea conclude that the random and conditionallyrandom procedures are most appropriate for betweenstudies heterogeneity or when the betweenstudies variance component is greater than zero. But when this variance component equals zero (that is there is no difference among effect size estimates), the fixedeffects Q most optimally controls Type I error. PAGE 66 56 Based on the aforementioned effect on Qs se nsitivity, Kromrey and Hogarty (1998) strongly advise against the use of the Q test of homogeneity when treatment effects are computed from primary studies involving either nonnormal distributions and/or heterogeneous variances. The presence of either or both of these conditions tends to result in inflated Type I error rates. Given these conditions, they recommend employing the permuted Qbetween test. Additionally, Changs (1993) study illustrates how large K paired with small N or extreme parameter e ffects present the greatest challenge to maintaining Type I error control, when applying fixedeffects Q. Moreover, as K increased and N decreased, greater departures arose between the theoretical and simulated power. When the Qbetween Test is employed using a permutation strategy, instead of a chisquare distribution, the Type I error control is wellmaintained. But this test bears the limitation of insufficiency with small numbers of studies (k=5 or less) included in a metaanalysis, because there are too few permutations of the data to permit a test at an alpha level of .05. Unequal variances matched with positive pairings of variances and sample sizes generate conservative Type I error rates for the fixedeffect s Q (Kromrey & Hogarty). Harwell also finds the proportion of the variances and sample sizes (e.g. sm all unequal variances to small unequal sample sizes) generates conservative Type I error rates and lower power across all variance ratios. Furthermore, the size of the variance ratio used in conjunction with the positive or negative pairings of sample sizes and variances affects the Type I error rate, as well. Speci fically within primary studies, small unequal variance and large unequal samples sizes generated minimal departures from nominal for smaller variance ratios, but inflated Type I error for larger variance ratios (4:1 and 8:1). Conditions Indicating the Use of Randomeffects Q As stated previously, Hedges and Vevea (1998) reco mmend using either the randomeffects or conditionallyrandom procedures when the betweenstudies variance component ( 2) is greater than zero, the true population parameter is greater than zero an d unconditional inferences are employed. Under these circumstances, the fixedeffects Q, conditionallyrand om procedure and the randomeffects Q all produce inflated Type I error. But the ra ndomeffects Q test produces rejectio n rates much lower than either the other two tests (Hedges & Vevea, 1998). When th e betweenstudies variance is underestimated, the randomeffects Q produces less inflated estimates. Additionally, as both K and betweenstudies variance PAGE 67 57 increase, the random Q most cl osely approximates the nominal though still exceeding it (Hedges & Vevea, 1998). Random Q does not produce this result for conditions with small N. Additionally, researchers such as Raudenbush (1994) tend to advise use of the fixedeffects model for small collections of studies and randomeffects for larg er collections of studies. It is unclear at what value of K this specific demarcation lies. Raudenbush describes scenarios based on two studies (clearly fixedeffects) or several hundred (clearly randomeff ects). Beyond those extrem e parameters, there is not much more detailed criteria, other than to suggest lo oking to the magnitude of the treatment and control group differences. As Hedges and Vevea (1998) illustrate, the prim ary difference between th e fixedand randomeffects estimates of variance is that within the randomeffect s procedure, the estimate of 2 is added to the variance before dividing it by k. According to Erez, Bloom and Wells (1996), randomeffects confidence intervals widen as 2 (the betweenstudies variance) increases. In contrast, the fixedeffects confidence intervals tend to be too narrow, as they do not include betweenstudies variance. Mulaik, Raju and Harshman (1997) point out that narrow confidence intervals reflect high power and wide intervals suggest low power. Changs (1993) study seems to confirm th ese findings, as she explains: Unlike for the fixedeffects models where simulated power values were sometimes higher than theoretical power values; for randomeffects models, a strong twothirds (9/28) of the discrepancies reflected lower simulated power values (p. 63). Specifically, the randomeffects approach recogni zes multiple sources of variance by including systematic error. Betweenstudy heterogeneity in eff ects contributes to differences in standard errors from the randomto the fixedeffects model. The randome ffects larger variance component due to the addition of 2 increases the standard error of the mean (Hedges & Vevea, 1998). Heterogeneity of effects and the number of studies included determine how conservative Type I error becomes when employing randomand conditio nally randomeffects procedures in a conditional inferences situation (Hedges & Ve vea, 1998). Small k does not perm it more accurate estimation of the variance component which in turn affects the precision of the weights. The estimate of the variance component ( 2) is used to generate weights used in calculatin g the mean estimate of the average effect size and its total variance. PAGE 68 58 When Not to Use the RandomQ Test Hedges and Vevea (1998) suggest the application of Q for situations of complete homogeneity of effects, as the fixedeffects Q is th e only model tested by them that provides rejection rates exactly equal to nominal (Hedges & Vevea, 1998). It should be noted that Kromrey and Hogarty (1998) recommend the use of regular Qbetween and permuted alternatives of homogeneity tests. But in general, fixedeffects procedures should be applied to conditions calling for conditional inferences and homogeneity of effects. Applying either randomeffects Q or the cond itionallyrandom procedure will undermine power and produce Type II error. But when heterogeneity is present with conditional inferences, these two models (randomeffects Q and conditionallyrandom test) have even lower rejection rates than the fixedeffects Q. In this case, randomeffects Q produces even less inflated Type I error than the conditionallyrandom procedure (Hedges & Vevea, 1998). But because the randomeffects Q does not test the inferences underlying the fixedeffects model, its use under these conditions is not recommended. Additionally, under heterogeneity, randomeffects theo retical power is less th an for fixedeffects, as there is a larger variance included in the denominat or of the magnitudeofeffects (z) test (Chang, 1993). Relative to the randomeffects z test, the fixe deffects z consistently has greater power. There are some other factors worth mentioning. The randomeffects Q test is limited in that the estimate is not precise when there are a small number of studies, due to the limitations of the incorporated 2. Power is also curtailed when there is large k with small samples (Chang, 1993). Moreover, when both K and betweenstudy heterogeneity are large, the width of the confidence intervals becomes overextended (Hedges & Vevea, 1998). Such an outcome results in rejection rates being too low for randomeffects and conditionallyrandom procedures. Conditions Indicating the Use of ConditionallyRandom Q Conditionallyrandom or randomeffects procedures should be employed under unconditional inference situations. Using fixedeffects procedures under these conditions will yield inflated Type I error rates (Hedges & Vevea, 1998). If ( 2) is greater than zero, the fixedef fects procedure confidence intervals exhibit lower than nominal probability content i.e. the confidence intervals are too narrow. But given the same scenario using random or conditionallyrandom procedures, PAGE 69 59 As with the randomeffects Q test, the conditiona llyrandom procedure is susceptible to inflated Type I error and low power when conditional infere nces are employed. However, the conditionallyrandom procedure is less sensitive than the randomeffects Q. Moreover, the conditionallyrandom procedure tends to underestimate nominal when the true population effect is equal to 0 (perfect homogeneity of effects). In fact, it yields rejection rates progressively more conservative as 2 exceeds 0. In other words, the conditionallyrandom procedure tends to produce Type II error (low power), though to a less extent than the randomeffects Q test. When the true population effect is greater than 0, employing conditional inferences, use of the conditionallyrandom procedure tends to result in inflated Type I error for all conditions of 2 (whether equal to or greater than 0). In fact, so do the fi xedand randomeffects Q tests. But the conditionallyrandom procedure produces less inflation than the fixedeffects Q, and more inflation than the randomeffects Q. When to Use Conditionallyrandom Q The conditionallyrandom procedure produces rejec tion rates not as extreme as the fixedeffects Q, but more so than the randome ffects Q, when betweenstudies vari ance becomes greater than zero (Hedges & Vevea, 1998). This pattern is more pronounced when the true population effect is greater than zero and as K increases. On the other hand, when the true population effect is zero (homogeneous effects) and betweenstudies variance increases, the conditi onallyrandom procedure more closely approximates and produces less Type II error than th e randomeffects Q (Hedges & Vevea, 1998). When Not to Use Conditionallyrandom Q As stated previously, Hedges and Vevea (1998) reco mmend using either the randomeffects or conditionallyrandom procedures when the betweenstudies variance component ( 2) is greater than zero, the true population parameter is greater than zero an d unconditional inferences are employed. Under these circumstances, the fixedeffects Q, conditionallyrand om procedure and the randomeffects Q all produce inflated Type I error. Bu t the randomeffects Q test produces rejec tion rates much lower than either of the other two tests (Hedges & Vevea, 1998). When th e betweenstudies variance is underestimated, the randomeffects Q produces less inflated estimates. Additionally, as both K increases and between PAGE 70 60 studies variance increases, the random Q most closely approximates the nominal though still exceeding it (Hedges & Vevea, 1998). Note this does not include a situation with small N. Conditions Indicating the Use of Permuted Q This test has not been compared to either the randomeffects Q test or conditionally randomeffects procedure. In fact, Kromrey and Hogarty (1998) may have produced the only study to simulate and analyze the performance of th e permuted version of the Qbetween test. Therefore, further simulation would enhance the reliability of these results, in and of itself. The other tests under investigation in their study are the Q test, iterated Q and regular Qbetween, as well as 3 permuted indices. All of the permutationbased tests under investigation in the Kromrey and Hogarty (1998) study outperformed the chisquare tests, in general, for each variable under consideration. In testing an alternative hypothesis and utilizing a different distribution strategy, the permuted Qbetween, maintains better control of Type I and Type II erro r than the three investigated chisquare tests (Q, iterated Q and regular Qbetween). Permuted Q tests a different null from the other types of homogeneity tests. It also tests a different null from that tested by othe r fixedeffects tests. Inst ead, it tests for betweenclass homogeneity. Chisquare tests are greatly affected by nonnormality and large sample sizes (Wang, Fan & Willson, 1996). The Qbetween statistic tests a different null hypothesis, essentially determining the presence of moderating variables. Kromrey and Hogarty explai n that increasing K seems to be associated with inflated Type I error rates for both the Q test and iterated Q, whereas the Qbetween (though still inflated) maintains better Type I error control. Additionally, the permutation strategy further frees the statistic from many of the assumptions typically held by chisquare and normal distributions. As long as K was at least 10, the permutation tests maintain Type I error control across population shapes. They also maintain Type I error control for all conditions with K=10 and 30, regardless of the extent of variance heterogeneity. Despite the caution that robust tests applied to unbalanced designs under conditions of heterogeneity and nonnormality may exhibit liberal Type I error rates (Lix & Keselman, 1998), permuted Qbetween seems to maintain extraordinary Type I error co ntrol. Kromrey and Hogarty (1998) find this procedure to be most robust to the dual violations of normality and homogeneity of variance. When the Qbetween Test is employed using a permutation strategy, instead of a chisquare distribution, the Type I error PAGE 71 61 control is well maintained. The permuted Qbetween derives part of its robustness from the superior properties of the regular Qbetween, as well as its permutation strategy. In order to promote a deeper understanding of the comparative performance of the permuted Qbetween, results of the comparative performance of three chisquare (including regular Qbetween) based tests follows. Though increasing heterogeneity results in increasi ngly inflated Type I error rates for each of the chisquare tests (Q, iterated Q and regular Qbetween, to a lesser extent), all permutation strategybased tests maintain better Type I error control. Regardless of the degree of heterogeneity, permuted Qbetween maintains Type I error control, as long as the number of stud ies is 10 or greater. The Type I error rate closely approximates the nominal level, .05. Based on Kromrey and Hogartys (1998) study, the Qbetween test outperformed both the Q test and iterated Q with proportions of conditions ranging from 75 to .396. Although some inflation of the Type I error rate was evidenced, violations of normality affected the Qbetween Test less than the homogeneity tests. The Qbetween Tests ability to maintain adequate Type I error control is not n ecessarily enhanced by larger samples in the primary studies. Regardless of sample size, it permitted greater Type I error control than did the Q test or Iterated Q test. Unequal sample sizes, wherein the first group has a smaller sample size than the second group, may have contributed to better Type I error control for the two homogeneity tests, as well as the Qbetween Test (to a lesser extent). Nonnormality produces inflated Type I error rates for two versions of the Q test of homogeneity. But the Qbetween between maintains better Type I error control under the same condition (Kromrey & Hogarty, 1998). This test is less likely to be influenced by sample size than regular or iterated Q tests (Kromrey & Hogarty, 1998). When to Use the Permuted Qbetween Where fixedeffects inferences are most appropriate, the numbe r of studies exceeds 9 and the researcher suspects the presence of mode rator variables, applying the permuted Qbetween appears to be the optimal choice. But assuming that the fixedeffects model is appropriate for collections larger than 9, application of permuted Qbetween appears to be substantiated by the results from the Kromrey and Hogarty (1998) study. In contrast to Q, permuted Qbetween maintains Type I error control near the nominal level under all investigated conditions of varied skewness and kurtosis. Moreover, use of this test does not PAGE 72 62 require the determination of homogeneity/heterogeneity, as it tests a different null hypothesis. And thus far, permuted Qbetween demonstrates robustness to all variations of withinstudy sample sizes. When Not to Use the Permuted Qbetween When the Qbetween Test is employed using a permutation strategy, instead of a chisquare distribution, the Type I error control is wellmaintained. But this test bears the limitation of insufficiency with small numbers of studies (k=5 or less) included in a metaanalysis, because there are too few permutations of the data to permit a test at an alpha level of .05 Kromrey and Hogarty (1998) conclude that small K, fewer than 10 studies, generates too fe w permutations to permit a valid test at the nominal level .05. They suggest this restriction to be the major limitation of the permuted Qbetween. Another consideration pertains to the appropriateness of applying the test under conditions suggesting multiple treatment effects. Although it tests for betweenclass moderating variables, the underlying assumption is more consistent with the fixedeffects model because the only classes included in the test are those in the sample (it is a test of be tweenclass homogeneity). It is not possible to permute studies that are not present. It should be noted that testing this null does not yield any direct variance estimates only a probability statement about the simila rity/difference between average effect sizes, not the difference between two or more parameters. The test assumes population effects are identical. So when the metaanalyst hypothesizes the presence of mu ltiple treatment populations, and wishes to draw inferences beyond the immediate collection of studies, applying some other model, such as randomeffects would be more appropriate. If the permuted Qbetween were applied instead, the test would not be testing the parameters expressed by the model. Primary Withinstudy Sample sizes(REQ) The increase in the variance of population effects or the sample sizes yield increased mean power estimates. Chang (1993) concludes that only Total Sample Size has a significant influence on power. Random samples within primary studies are not consider ed. She explains there may be the possibility that simulated power underestimates actual power for small samples and large K. BetweenStudies Variance (2)(REQ) Hedges and Vevea (1998) find that as betweenstudi es variance equals 0, there is inflated Type I error, particularly for randomeffects Q. Based on he r regression analysis of the factors involved in the PAGE 73 63 randomeffects test of homogeneity, the variation of effects and the total sample size in the model are most responsible for explaining the power of the randomeffects test (Chang, 1993, p. 78). Similarly, Raudenbush (1994) notes the precision of the estimates of study effects reflect both the study sample size and the degree of heterogeneity across the true effects. As the number of studies in the metaanalysi s increases, only the randomeffects Q and conditionallyrandom procedure continue to approximate the nominal value. However, substantial heterogeneity of effects combined with increasing numbers of studies can widen the departure from the nominal value. Hedges and Vevea contend the robustne ss of either the randomeffects Q or conditionallyrandom procedure, although more appropriate with he terogeneous effects, relies on the combined quantities of both heterogeneous effects and increasing numbers of studies. Betweenstudy heterogeneity in the effects contributes to differences in standard errors from a RE to a FE model (Hedges & Vevea, 1998 p. 494). The REs larger varian ce component due to the addition of 2 increases the standard error of th e mean. But precise estimation of 2 is primarily contingent on K (Hedges & Vevea, 1998). If K is small, precision of weighted estimates will be undermined, despite large study sample sizes. However, when K is larger than 20, biases are minimized. Number of studies in the metaanalysis (k)(REQ) The power of randomeffects Q was most explained by the spread of parameter effects and total sample size (Chang, 1993). Hedges and Vevea (1998) point out if both K and betweenstudy heterogeneity are large, the width of the confidence intervals be comes overextended. Such an outcome results in rejection rates being too low for randomeffects and conditionallyrandom procedures. Under conditions of large withinstudy sample sizes, the K plays a crucial role for enhancing the precision of the estimate of the betweenstudies variance component (Hedges & Vevea, 1998, p. 493). Conversely, Chang (1993, p. 63) finds that simulated power may underestimate actual power for small samples with large k. But according to her regression analysis (p. 78), k had no effect, due to its inclusion in the computation of total sample size. For rando meffects Q, Chang (1993) concludes there are no significant associations between the k and N and frequency of significant power discrepancies. In other words, the dependence of power discrepancies on sample sizes N did not vary with k. Changs study reveals that when k is larger and the average withinstudy N is smaller, simulated power was higher than PAGE 74 64 theoretical power for the randomeff ects Q. In other words, randomeffects Q produces inflated Type I error. BetweenStudies Variance (2)(CRQ) The conditionallyrandom procedure produces rejec tion rates not as extreme as the fixedeffects Q, but more so than the randome ffects Q, when betweenstudies vari ance becomes greater than zero (Hedges & Vevea, 1998). This pattern is more pronounced when the true population effect is greater than zero and as K increases. On the other hand, when the true population effect is zero (homogeneous effects) and betweenstudies variance increases, the conditi onallyrandom procedure more closely approximates and produces less Type II error than th e randomeffects Q (Hedges & Vevea, 1998). Number of studies in the metaanalysis (k)(CRQ) As K increases under conditional inferences and a true population effect greater than zero, the conditionallyrandom procedure produces increasingly infl ated Type I error. However, as the betweenstudies variance increases, it tends to produce progressively less inflated Type I error rates. K does not appear to have an appreciable influence when the true population effect is zero (homogeneous effects) and betweenstudies variance equals zero. Rejection ra tes diminish, as betweenstudies variance increases, producing conservative Type I error. However, when the true population effect is greater than zero and betweenstudies variance equals zero, rejectio n rates again become exceedingly greater than as K increases. As K increases under unconditional inferences and a true population effect greater than zero, the conditionallyrandom procedure produces the same pattern i.e. it produces increasingly inflated Type I error. In this case, as the betweenstudies variance increases, it produces increasin gly more inflated Type I error, as do the other two tests. Again K does not appear to have an appreciable influence when the true population effect is zero and between studies variance equals zero. Rejection rates increase, as betweenstudies variance increases, producing inflated Type I error. When the true population effect is greater than zero (heterogeneous effects) and as betweenstudies variance increases rejection rates increase for small K, but decrease for larger K. PAGE 75 65 Suggested Research Although Chang conducts a thorough power analys is and comparison between the traditional and randomeffects Q tests, she does not evaluate these tests performance in terms of skewness/kurtosis, randomly assigned sample sizes and re lative to the performance of the Qbetween test. Kromrey and Hogarty (1998) state they do not explore the influence of randomly assigned sample sizes within primary studies. Also, their design does not control for varying degrees of betweenstudies variance. Although the permuted Qbetween test is shown to have more robustness under heterogeneous variances than Q, it has not been compared to the randomeffects and conditionally randomeffects procedures. Comparing it to randomeffects and conditionallyrandom tests can lend greater insight into how all three of these tests respond to violations of normality, random samples and heterogeneity of effects relative to each other. Hedges and Vevea (1998) do not vary the withinstudy variances, skewness/kurtosis or primary withinstudy sample sizes. Therefore, a potentially fruitful investigation involves comparing the randomand fixedeffects Q, conditionallyrandom procedure and the permuted Qbetween test. Chang (1993) suggests investigating the influence of the betweenstudies variance ( 2) component on the randomeffects test to determine the magnitude of the effects. Both Chang and Harwell recommend further investigation of the randome ffects statistic, particularly with respect to the impact of small n and large k on the betweenstudies variance component, 2. Specifically, Chang suggests that the theoretical power estimates presented a poor fit for the simu lated randomeffects values, calling into question the precision of the 2, contingent on K. Like Hedges (1992), Chang acknow ledges that the distribution of effects is not taken into account for randomeffects Q. Hedges (1992) concedes that metaanalysts tend to accept that randomeffects Q does not require the study effects to be normally distributed. However, if this issue has not been investigated, it would be a fruitful line of inquiry. Neither Chang nor Hedges and Vevea investigate the combined influence of variance ratio to sample size pairings on the control of Type I error, particularly using random samples. Also, neither of these studies explore the influence of skewness and kurtosis of the withinstudy samples on the control of PAGE 76 66 Type I error and power. Further, permuted Qbet has not been compared to e ither randomeffects Q or the conditionallyrandom procedure. Hedges and Veveas study contro ls for the betweenstudies variance heterogeneity and K, but does not account for withinstudy sample sizes or the influence of violations of normality. Furthermore, the conditionallyrandom procedure is compared only to fixedand randomeffects Q. Therefore, comparing the Type I error control and power of the conditionallyrandom procedure to the permuted Qbetween as well as the fixedand randomeffects Q tests under conditions of primary study nonnormality, unequal variances and unequal sample sizes, is a worthwhile line of investigation. As mentioned previously, Kromrey and Hogarty (1998) recommend varying the sample sizes within the primary studies and permitting them to vary randomly. In addition, the permuted Qbetween Test has not been compared to the randomeffects or conditionallyrandom procedures in terms of both Type I error control and power. Again, this test has not been submitted to conditions of varying betweenstudies variance. Lastly, in reviewing the internet web sites providing commercial and freeware metaanalytic software, none appear to introduce the incorporation of 2 or the use of permuted Qbetween. The commercial and bestfunded programs include: www.metaanalysis.com www.metawinsoft.com and www.weasyma.com The first two programs provide both fixe dand randomeffects options for analyzing categorical or continuous data, presenting transformati ons of common effect size indices. The last program only presents methods for evaluating categorical data, explicitly stating that methods for use with continuous data are not yet available. Summary Because Q, the traditional test of homogeneity, is susceptible to w ithinstudy variance heterogeneity and nonnormal score distributions (Chang, 1993; Harwell, 1997; Kromrey & Hogarty, 1998), researchers have begun investigating alternative procedures to accommodate these conditions. Though standardized mean difference effect size estimates such as Hedges g (u sed to compute the Q test) remain sensitive to violations of normality and homogeneity, Kromrey & Hogarty (1998) find permuted Qbetween more robust than the traditional test. Another alte rnative, the conditionallyrandom procedure, can be effective with significant betweenstudies varian ce (Hedges & Vevea, 1998). Several researchers PAGE 77 67 recommend the randomeffects model of the traditional test, when significant heterogeneity or a distribution of several population effects is suspect ed (Chang, 1993; National Research Council, 1992; Erez, Bloom & Wells, 1996; and Harwell, 1997). As suggested in Table 1, none of these procedures have been compared under violations of normality and homogeneity within primary studies, unequal and random sample sizes, and heterogeneous effects across schools. Table 1 Relevant Factors Examined By Other Studies Chang (1993) Harwell (1997) Kromrey & Hogarty (1998) Hedges & Vevea (1998) Test Q RE Q Q Permuted QBetween Q CR RE Skewness & Kurtosis normal, noncentral chisquare 2/6; 0/25; 0/0; 1/3; 1.5/5 2/6; 0/25; 0/0; 1/3; 1.5/5 Variances within and across studies variance of parameter effects 1:1;2:1; 4:1; 8:1 1:1;2:1; 4:1; 8:1 2=0; .33;.67;and 1.0 K 2,5,10, 30 5,10, & 30 5,10, & 30 110, 20, 30, 40, 50, 60, 70, 80, 90, & 100 N of a single study 20, 60, 120, & 200 equal & unequal within: 10, 20, 40 & 200 equal & unequal within: 10, 20, 40 & 200 Study effect sizes many effect sizes 0, .125, .375, .5, .75 & 1.0 0, .125, .375, .5, .75 & 1.0 PAGE 78 68 Chapter Three Method The effectiveness of each of four tests of homoge neity of effects under varying conditions will be evaluated, using computersimulated data following the design, and analyses to be described. Evaluation of test effectiveness concerns the accurate verification or falsifi cation of homogeneity of effects, as evidenced by the degree to which Type I and Type II errors are controlled, relative to nominal Purpose The purpose of the study was to investigate data conditions typical in education settings to begin to establish a set of criteria facilitating deliberate model selection for optimal model fit. The responsiveness of four tests of homogeneity of effects was compared under conditions of varying degrees of heterogeneity of variance, primary study sample sizes, number of primary studies and dual violations of normality and homogeneity of effects, as evidenced by statistical power and control of Type I error. Harwell (1997) recommended utilizing randomeffects regression in addition to the Q statistic that has been typically generated with fixedeffects regression. Raudenbush (1994) and Bollen (1989) further advised applying a weighted least squares regression, when sample sizes across studies are unbalanced. This second recommendation is supported by Hedges (1982) who found it provided reasonable accuracy for model fit specification when sample sizes were as small as 10. However, scant information is available to indicate whether such a procedure provides ad equate robustness for metaanalytic tests. Comparative analyses are based on each tests re lative Type I error rates and power. This study will provide metaanalysts with more specific guide lines for the use of each test by addressing the following questions: 1) To what extent is the Type I error rate of the fixedeffects Q, permuted Qbet, randomeffects Q and conditionallyrandom Q maintained near the nominal alpha level across variations in the number of primary studies included in the metaanalysis, sample sizes in primary studies, heterogeneity of variance, varying degrees of heterogeneity of effects ( 2) and primary study skewness and kurtosis? PAGE 79 69 2) What is the relative statistical power of the fixedeffects Q, permuted Qbet, randomeffects Q and conditionallyrandom Q across variations in the number of primary studies included in the metaanalysis, sample sizes in primary studies, heterogeneity of variance, varying degrees of heterogeneity of effects ( 2) and primary study skewness and kurtosis? Design This study is modeled after Harwell (1997) and Kromrey and Hogartys (1998) experimental design. Specifically, it entails a 2 x 3 x 3 x 3 x 3 x 3 x 2 factorial design. The study also controls for betweenstudies variance, as suggested by Hedges and Veveas (1998) study. The randomized factorial design includes seven independent variables: (1) number of studies within the metaanalysis (10 and 30); (2) primary study sample size (10, 40, 200); (3) score distribution skewness and kurtosis (0/0; 1/3; 2/6;); (4) equal or random (around typical sample sizes, 1:1; 4:6; and 6:4) withingroup sample sizes; (5) equal or unequal group variances (1:1; 2:1; and 4:1;); (6) betweenstudies variance, 2 (0, .33, and 1); and (7) betweenclass effect size differences, k (0 and .8). Data were obtained using two programs: one for null hypotheses (972 simulations) and the other for nonnull hypotheses (486 simulations). Hence, the study incorporates 1,458 experimental cond itions, illustrated in Figu re 1. Simulated data from each sample are analyzed using each of four tests of homogeneity. The dependent variable is, in part, the proportion of conditions with adequate Type I error control at the nominal alpha level of .05. Additionally, estimates of statistical power are computed for those conditions where tests maintained adequate Type I er ror control. These power estimates indicate the degree to which a test reflects sensitivity to significant hete rogeneity of effects, in the presence of violated assumptions. Harwell (1997) applies a criterion for determining inflated versus conservative Type I error rates. Specifically, the criterion includes the number of rejections above .056 ar e termed inflated, whereas those empirical values below .044 are termed conservative. The dependent variable is the proportion of metaanalyses leading to a rejection of th e null hypothesis. This represents e ither the Type I error rate or power depending on the truth of the null hypothesis. Essen tially, the effectiveness of each of the four tests of PAGE 80 70 homogeneity of effects is being evaluated based on this performance, as well as how large a discrepancy exists between the simulated data and nominal values. Table 2 Study Design (shaded cells = Power es timates; white cells = Type I error rates) Note: The number of cells will be quadrupled as the data for each of the 4 tests of homogeneity are entered. Primary Withinstudy Sample Size Equal or Random Value of Withinstudy Sample Size Skewness & Kurtosis Variance Within Studies Btwnclass Effect Size Dfrncs 2=0 2=.33 2=1.0 101:10/01:10 0.8 2:10 0.8 4:10 0.8 1/31:10 0.8 2:10 0.8 4:10 0.8 2/61:10 0.8 2:10 0.8 4:10 0.8 4:60/01:10 0.8 2:10 0.8 4:10 0.8 For Each of the Four Tests PAGE 81 71 Table 2 (continued) Study Design (shaded cells = Power estimates; white cells = Type I error rates) Note: The number of cells will be quadrupled as the data for each of the 4 tests of homogeneity are entered. Primary Withinstudy Sample Size Equal or Random Value of Withinstudy Sample Size Skewness & Kurtosis Variance Within Studies Btwnclass Effect Size Dfrncs 2=0 2=.33 2=1.0 104:6 1/31:10 0.8 2:10 0.8 4:10 0.8 2/61:10 0.8 2:10 0.8 4:10 0.8 6:40/01:10 0.8 2:10 0.8 4:10 0.8 1/31:10 0.8 2:10 0.8 4:10 0.8 For Each of the Four Tests PAGE 82 72 Table 2 (continued) Study Design (shaded cells = Power estimates; white cells = Type I error rates) Note: The number of cells will be quadrupled as the data for each of the 4 tests of homogeneity are entered. Primary Withinstudy Sample Size Equal or Random Value of Withinstudy Sample Size Skewness & Kurtosis Variance Within Studies Btwnclass Effect Size Dfrncs 2=0 2=.33 2=1.0 106:42/61:10 0.8 2:10 0.8 4:10 0.8 401:10/01:10 0.8 2:10 0.8 4:10 0.8 1/31:10 0.8 2:10 0.8 4:10 0.8 2/61:10 0.8 2:10 0.8 4:10 0.8 For Each of the Four Tests PAGE 83 73 Table 2 (continued) Study Design (shaded cells = Power estimates; white cells = Type I error rates) Note: The number of cells will be quadrupled as the data for each of the 4 tests of homogeneity are entered. Primary Withinstudy Sample Size Equal or Random Value of Withinstudy Sample Size Skewness & Kurtosis Variance Within Studies Btwnclass Effect Size Dfrncs 2=0 2=.33 2=1.0 404:60/01:10 0.8 2:10 0.8 4:10 0.8 1/31:10 0.8 2:10 0.8 4:10 0.8 2/61:10 0.8 2:10 0.8 4:10 0.8 406:40/01:10 0.8 2:10 0.8 4:10 0.8 For Each of the Four Tests PAGE 84 74 Table 2 (continued) Study Design (shaded cells = Power estimates; white cells = Type I error rates) Note: The number of cells will be quadrupled as the data for each of the 4 tests of homogeneity are entered. Primary Withinstudy Sample Size Equal or Random Value of Withinstudy Sample Size Skewness & Kurtosis Variance Within Studies Btwnclass Effect Size Dfrncs 2=0 2=.33 2=1.0 406:4 1/31:10 0.8 2:10 0.8 4:10 0.8 2/61:10 0.8 2:10 0.8 4:10 0.8 2001:10/01:10 0.8 2:10 0.8 4:10 0.8 1/31:10 0.8 For Each of the Four Tests PAGE 85 75 Table 2 (continued) Study Design (shaded cells = Po wer estimates; white cells = Type I error rates) Note: The number of cells will be quadrupled as the data for each of the 4 tests of homogeneity are entered. Primary Withinstudy Sample Size Equal or Random Value of Withinstudy Sample Size Skewness & Kurtosis Variance Within Studies Btwnclass Effect Size Dfrncs 2=0 2=.33 2=1.0 2001:1 1/32:10 0.8 4:10 0.8 2/61:10 0.8 2:10 0.8 4:10 0.8 4:60/01:10 0.8 2:10 0.8 4:10 0.8 1/31:10 0.8 2:10 0.8 4:10 0.8 2/61:10 0.8 For Each of the Four Tests PAGE 86 76 Table 2 (continued) Study Design (shaded cells = Power estimates; white cells = Type I error rates) Note: The number of cells will be quadrupled as the data for each of the 4 tests of homogeneity are entered. Primary Withinstudy Sample Size Equal or Random Value of Withinstudy Sample Size Skewness & Kurtosis Variance Within Studies Btwnclass Effect Size Dfrncs 2=0 2=.33 2=1.0 2004:62/62:10 0.8 4:10 0.8 6:40/01:10 0.8 2:10 0.8 4:10 0.8 1/31:10 0.8 2:10 0.8 4:10 0.8 2/61:10 0.8 2:10 0.8 4:10 0.8 For Each of the Four Tests PAGE 87 77 Seven experimental variables are being investigated: (1) number of primary studies included in the metaanalysis, (2) total sample size, (3) primary study distribution shape, (4) equal, random or unequal withingroup sample sizes, (5) experimental to control group variances, (6) extent of heterogeneity of effects and (7) betweenstudies effect size differences. Three variables (1), (6) and (7) focus on aspects of the metaanalysis, whereas, (2) (5) address features of the primary studi es included in the metaanalysis. In order to determine the extent to which the experimental conditions are representative of published metaanalyses, a survey was conducted of all articles published in the Review of Educational Research from 1995 to 2000. Out of a total of 15 metaanalyses, three employed randomeffects tests either a priori or post hoc The metaanalyses included between 13 180 studies. Because some of these syntheses included studies in which several effect si zes were computed to address multiple hypotheses, as many as 1,728 effect sizes were computed for a single metaanalysis. Six syntheses included 30 or fewer studies. Eight syntheses included 50 or fewer studies. Eleven syntheses included 60 or fewer studies. It should be noted that two of the 15 studies either did not conduct homogeneity tests or compute an average effect. These two metaanalyses were eliminated for this reason. Sample size data were incomplete for six of the syntheses. Due to differences in design, some studies presented total primary study sample sizes, whereas others provided the sample sizes for the experimental and control groups. Total sample sizes ranged widely from 12 3,656. There were multiple sample sizes for syntheses addressing multiple questions. Ten of the metaanalyses resulted in overall heterogeneity, requiring further evaluation of moderating variables. Three syntheses presented unclear information about the result. One study presented a homogeneous effect after the removal of outlying eff ect sizes. Only two studies presented a definitively homogeneous main effect. 2 and average effects were computed for most of the metaanalyses. Eight of the remaining 13 metaanalyses generated an average effect. These were computed for heterogeneous and homogeneous outcomes alike. The average effects (computed for 8 of the 13 studies) ranged from .34 to .79. The 2 s (computed for 4 of the 13 studies) ranged from 82.32 to 3,246.99. The levels selected for the variables being investigated in this study correspond with those presented in the extant literature. Given that the betweenstudies variance component represents a significant difference in the delineation of uncertainty between the fixedeffects and randomeffects models, further consideration of the PAGE 88 78 influence of varying 2 seems warranted. Based on previously applied 2 levels and a conversation with L.V. Hedges (personal communication, September 3, 1999) about the determination of typical 2 values, the decision was made to inco rporate 3 specific values of 2 (0, .33 and 1.0). After analyzing raw data from a Title I reading program administered throughout a public school district in Florida, it was apparent that reading scores were nonnormally distributed for this population. Scores were negatively skewed and leptokurtotic. Se veral researchers have addressed the concern of the influence of nonnormal population shape on the performance of homogeneity tests (Hedges & Olkin, 1985; Raudenbush & Bryk, 1987; Wilcox, 1995; Lix & Keselman, 1998; and Kromrey & Hogarty, 1998). Based on prior evidence in the literature and from the sch ool district, 3 representative population shapes were selected [0/0 (normal skewness and kurtosis), 1/3 and 2/6 (extreme skewness and kurtosis)]. Sample The data were generated through a Monte Carlo study. A Monte Carlo study uses computer models to simulate statistics performance under va rious conditions. Snedecor and Cochran (1989) state: An important use of tables of random digits and of computers is to draw repeated random samples of a given size from a population. By estimating a desired population characteristic from each sample we obtain the sampling distribution of the estimates (p. 15). The effectiveness of the statistic is determined by the extent to which Type I and Type II error rates are controlled, relative to the theoretical or crite rion alpha level. Data were generated using the SAS procedure Interactive Matrix Language. Hedges an d Olkin (1985) refer to the SAS proc. Matrix: A simpler alternative to the computation of estimates and test statistics is to use a program (such as SPSS or SAS proc GLM) that can perform WLS analyses (p 173). But for this study, the analyses were conducted, initially, using SAS/IML version 6.12. It performs the operations more rapidly than the GLM procedure. But the accuracy of the data analysis was verified by comparing the results to the GLM procedure. In order to extend the studies conducted by Harwell (1997) and Kromrey and Hogarty (1998), this study utilized the same procedures for random number generation and transformation of nonnormal scores. Harwell used an unspecified randomnumber generator from the 1986 Numerical Recipes to simulate standardnormal deviates. He then applied the Fleishmans method for transformation of the same. PAGE 89 79 Following the procedure used by Kromrey and Hogarty (1998), the RANNOR random number generator in SAS were used to generate normally distributed random variables. Different seed values were used in each execution of the program to yield th e random numbers. Nonnormal distributions were replicated by transforming normal random variates derived from RANNOR based on the Fleishman (1978) technique, referred to as the Power Method. In conducting a Monte Carlo study, the study design focuses on establishing a criterion for determining the robustness of the tests, as well as the number of iterations necessary to ensure the reliability of the results. Based on Robey and Barcikowskis (1992) seminal work, the number of iterations necessary for a twotailed test of departures of the estimate of the actual Type I error rate (computed as the observed proportion of the total number (n) of calcu lated test statistics greater than a critical test value under the null hypothesis), from the nominal Type I error rate, was determined. Consistent with Harwells nominal alpha of .05, a standard power of .8 and Bradleys (1978) criterion for intermediate stringency of +/, the number of recommended simulated metaanalyses would be 2,660 (see Robey & Barcikowski, 1992, p. 286). Though it may not affect appreciable differences in the consistency of findings, this study will employ the same number of iterations (5000) as found in Harwells study. In an effort to describe the realistic simu lation of an educational metaanalysis, study characteristics are presented in the form of a linear model. In thinking about the model, one might think of the simulated metaanalysis as consisting of a collection of studies about a given reading program being administered to groups of first an d third grade readers. The readin g outcomes being investigated are differentially effective depending upon the students grade, the moderating variable. The individual student scores will be generated through the implementation of the following model, Xijk = jk + ijk Where Xijk = the observed score for child i, in group j in study k, the grand mean of all scores for all children in both groups in all of the studies, jk = the effect of treatment j being implemented in study k, and ijk = random error associated with this childs score. Each simulation will represent an individual students sc ore, designated as part of either the control or treatment group. This latter feature will be expresse d as an independent variable embedded within the jk. At the metaanalytic level, the characterization of sam ples of studies as either randomor fixedeffects PAGE 90 80 will arise from the value of 2 as either equal to 0 (fixedeffects) or greater than 0 (randomeffects). 2 will be expressed in the The extent of heterogeneity of the sample will be manipulated during the simulation by varying the error term, In a similar manner, the skewness and kurtosis of the sample effects will be controlled. K will be va ried based on the number of iterations. When considering a heterogeneous outcome or nonnull condition, two additional components will need to be added. If heterogeneity is simulated with the influence of a systematic moderator, studies will be grouped according to the stud ents participation in either a 1st or 3rd grade reading program. The linear model can then be characterized as Xijkm = jkm + m + jkm + ijk Where m represents a main effect for overall differences in 1st and 3rd grade reading students, and jkm expresses the differential effectiveness of the reading program for use with 1st and 3rd grade students. The magnitude of this last component establishes the degree of falsity of the metaanalytic null hypothesis of homogeneity of effects. The type of score distribution and group variance ratios are patterned after the Harwell (1997) and Kromrey and Hogarty (1998) studies. Kromrey and Hogarty (1998) recommended the inclusion of random withingroup sample sizes, as includ ed in the present study. Hedges and Vevea (1998) incorporated the betweenstudy variances included in the present study. As betweenstudy variance has been shown to significantly influence the resulting Q statistic (independent of primary study variance), it too has been included. Test Statistics Examined For each simulated metaanalysis, the significance probability of heterogeneous effects will be analyzed by employing each of four tests of homogeneity of effect s. Tests of homogeneity of effects reflect (whether sampling error alone accounts for this variation or whether features of studies, samples, treatment designs, or outcome measures also contribu te to variation and indicates that more variance exists in effect size estimates acros s studies than predicted by sampling error alone (Cooper & Dorr, 1995, p. 489). Further, Homogeneity analysis compares the amount of variance exhibited by a set of effect sizes with the amount of variance expected if only sampling error is operating (Cooper, Nye, Charlton, Lindsay & Greathouse, 1996, p. 251). Homogeneity tests, as a whole, apply either regression or PAGE 91 81 correlation coefficients to de termine the average effect. In this study, regression coefficients are employed. The rationale for applying regression versus correlation is that regressi on coefficients provide a more apt measure of magnitude than do correlation coefficients (Abelson, 1997). Slope has dimensional units and r is susceptible to range restriction, whereas b is not. For both traditional Q and the randomeffects procedure, first apply the regression equation di = o + 1Xi1 + 2Xi2 ++ pXip + ui + ei Where di = the effect size estimate for the ith study o = the grand mean effect size 1 = the expected mean difference in effect sizes between studies of different classes Xi1 = some amount of the first study characteristic in the ith study ui = the residual or component of the sc ore effect size not explained by X, and ei = the error of estimation For each simulated metaanalysis, consisting of k studies, the Q test of homogeneity will be calculated. This statistic is derived by Q = (di d+)2/ (di) di = minimum variance estimate of the samp le effect; sample effect size for the ith study d+ = average weighted di; weighted average affect size across the k studies and d+= i=1 k 1 (di)/ 1/ The element vi used to weight the reliability of each study is the variance of the effect size in the ith study, obtained as = nT + nC/ ni T ni C + di 2/ 2(ni T + ni C) where ni T and ni C are the sample sizes for the treat ment and control groups in the ith study. The obtained test statistic, Q, is evaluated for statistical significance by comparing its magnitude to a critical value of chisquare with k1 degrees of freedom. If the obtained Q exceeds the critical chisquare value, the null hypothesis of homogeneity of effects is rejected. As mentioned previously, the randomeffects model interprets the field of studies as part of a wider and unknown universe. The randomeffects procedure bears one unique element, differentiating it both theoretically and algorithmically. This element is variously referred to as the betweenstudies variance component, estimator of population variance estim ator of the variance of population effects and estimator of the population variance component. This element is added to the sampling error to compute the estimate of the Tota l variance of the average effect Typically, this statistic has increased PAGE 92 82 variance due to the added uncertainty built into its al gorithm. Often, differences between fixedand randomeffects standard errors are due to substantial betweenstudy heterogeneity in the effects (Hedges & Vevea, 1998, p. 494). Typically, the standard error is substantially larger than in the fixedeffects test. Its algorithm includes the estimate of populati on variance, the component accounting for the added uncertainty. The distinction of betweenstudies variance tran slates into added degrees of freedom in the denominator incorporated into the individual study variances, contributing to the larger standard error of the mean. Larger standard error results in wide confidence intervals and low power. Q+ = ( di d+)2/ 2(di i ) Note: The primary difference between the traditional Q and Q+ is the inclusion of the variance of sample effect sizes while holding co nstant the population effect size(s), i. This modification creates the variance of the conditional distribution of di given i. Hedges and Olkin (1985) explain the algorithm applies e xpected values of the mean squares. These values are represented in terms of variance components. Samp le values are substituted for these expected values and used to solve for the variance components. This procedure results in the unbiased estimates of the variance components. Specifically, if the design were unbalanced, compute the weights for each study effect accordingly (see Raudenbush, 1992): for randomeffects weights wi = 1/(vi + where vi is defined as the conditional variance or the square of the standard error for a given study. Once Q is calculated for fixedeffect s, this statistic can be used to calculate c which then permits the computation of 2 for the randomeffects procedure. C= wi w i )2 wi Next, test the significance of th e effectsize variance component ( 2) or that Ho: 2 = 0 2 = Q (k )/c If this value is larger than zero then one can no lo nger assume the presence of a single effect. One then recomputes the randomeffects weights using the estimate of 2 d+ for randomeffects = randomeffects weighted mean effect size or w i d i wi Using d+ for randomeffects, one constructs the confidence interval around PAGE 93 83 The conditionallyrandom effects Q test is a procedure for maintaining a degree of flexibility in the process of testing for homogeneity of effects. Initia lly, the traditional fixedeffects Q is used to test for the homogeneity of the effects. If the null hypothesis of homogeneity is maintained, the decision is made to combine the study effect s and compute an average or common effect But if the synthesist rejects the null, further testing for moderating variables is conducted using the randomeffects Q and corresponding weighted least squares procedure. The Permuted QBetween test is a randomization test designed to generate a more extensive sample from a limited number of studies, by randomly reassigning studies to each of two classes. It is a modification of the fixedeffects test of homogeneity first investigated by Hedges and Olkin (1985). Hedges and Olkin (1985) refer to the Qbetween test as the between class goodnessoffit statistic QB. The Permuted QBetween tests a different sort of hypothesis than the other homogeneity tests. It tests the null hypothesis that the average effect size is th e same across classes. Specifically, it partitions the observed effect sizes into groups according to the h ypothesized moderator variable and tests the tenability of the null hypothesis that population effects are the same in the subgroups. It is a simpler hypothesis in that it directly tests for moderating variables. However, this test maintains the fixedeffects assumption that all of the observed variation is sampling error. Moreov er, the sample effect sizes originate from the same population. The variable effects have been observed. The test statistic is based on the total weighted sum of squares The denominator is the normalized weighted sum of squares of the effect size indices about the grand mean d++. Having a common eff ect size across studies indicates the Qbetween possesses an approximate chisquare distribution with k1 degrees of freedom. According to Noreen (1989), a randomization test is a procedure for assessing the significance of a test statistic [and] involves randomizing the ordering of one variable relative to another (p. 12). Orderings are permutations of the variables relative to each other. Randomization tests are nonparametric and based on an empirical, rather than theoretical di stribution. Nonparametric refers to the manner in which the nature of the population distribution is not specified explicitly. Almost all permutations are nonparametric and vice versa. They do not require random sampling, but are based on random assignment within the study. PAGE 94 84 Permuting this test involves the reordering of study effects by dividing the factorial of K, the number of study effects, by the factorial of R, the number of studies in the smaller group (either treatment or control), which is multiplied by K minus R. For example, if the number of study effects equals 10 and the number of studies in the smaller group equals 4, the number of combinations of K (now referred to as N) taken R at a time will result in the number of permutations to be conducted. In this case, 210 permutations would be simulated, by computing, N!/ (R!(NR)!) or specifically 10!/ (4!(104)!) = 3,628,800/24(720)=210 Once the number of permutations has been determin ed, one can then generate the data to be tested for homogeneity of effects. The algorithm for the Qbetween test follows: Qbet= i (di + d++)2/ 2(di +)= ij (di +d++)2/ dij) Qbetween test With the summation over I classes, and j studies in each class. Where di+ refers to average weight ed effect size for class I d++ represents the grand mean effect size, and dij represents the effect size for the j th study in the i th class (Hedges & Olkin, 1985). Kromrey and Hogarty (1998) found this procedure to be more robust to the dual violations of normality and homogeneity of variance. When the Qbetween Test is employed using a permutation strategy, instead of a chisquare distribution, the Type I error control is well maintained. Because it does not involve a normal or chisquare distribution, the Permuted QBetween is freed from many of the assumptions, such as normality, upon which other homogeneity tests are based. The primary limitation noted by their study is its inability to generate a sufficient data set with which to test at the .05 alpha level under conditions of small K (generally 5 or less). Data Analysis Each of the 1,458 experimental conditions extrapolated from the seven independent variables is to be analyzed using each of the four te sts of homogeneity. Effectiveness of each test is to be evaluated based on the proportion of the 5 000 simulations of each metaanalytic condition reflecting adequate Type I error control at the nominal alpha level of .05. As defined previously, rejections above .056 are termed inflated, whereas those empirical values below 044 are termed conservative. In addition, the estimated Type I error rates will be reported. For nonnull conditions, power estimates will be calculated. The rejection rates of all variable combinations will be presented by inserting these in the design matrix, such as the one appearing in Figure 1 of this chapte r. Furthermore, the marginal rejection rates for each PAGE 95 85 factor of the study, indicating comparative Type I erro r control, will be presented in the form of graphs. Both the proportion of conditions with adequate Type I error control and the average Type I error rate estimates will be presented graphically, categorized by each of the seven independent variables. Finally, box and whisker plots will be used to illustrate the distribution of the Type I error rates of each of the four tests. The organizing variable will be K, the number of studies included in the metaanalysis. PAGE 96 86 Chapter Four Results The purpose of this study was to investigate the power and Type I error control of the permuted Q, randomeffects test, fixedeffects test and regular Q te st under varying levels of heterogeneity of effects ( 2=0, 2= .33, and 2=1) and at level .05, as well as three variance ratios, two different numbers of studies (hereafter referred to as K) and 3 levels of sample sizes within studies (hereafter referred to as N=10, 40 and 200). Test performance was based on a set of criteria established by Bradley (1978) wherein the robustness of the test depends upon the range of p results falling around a preset The relative efficacy of these three tests (and one conditionallyra ndom procedure) were co mpared between the K=10 and K=30 conditions and across variance ratios between control and experimental groups, increasing skewness/kurtosis and increasing sample sizes within me taanalytic studies. The comparison of the relative performance of these three tests and the conditionallyrandom procedure within each set of controlled conditions should enhance the appropriateness of practitioners test selec tion for metaanalysis. In particular, the research questions addressed were the following: 1. To what extent is the Type I error rate of the fixedeffects test (FE), permuted Qbet, randomeffects test (RE) and conditionallyrandom procedure (CR) maintained near the nominal alpha level across variations in the number of primary studies included in the metaanalysis, sample sizes in primary studies, heterogeneity of variance, varying degrees of heterogeneity of effects ( 2) and primary study skewness and kurtosis? 2. What is the relative statistical power of the fixedeffects test (FE), permuted Qbet, randomeffects test (RE) and conditionallyrandom procedure (CR) given variations in the number of primary studies included in the metaanalysis, sample si zes in primary studies, heterogeneity of variance, varying degrees of heterogeneity of effects ( 2) and primary study skewness and kurtosis? PAGE 97 87 Results of each of the 1,458 experimental c onditions arising from the factoring of seven independent variables across each of the th ree tests of homogeneity and conditionallyrandom procedure are presented. Effectiveness of each test was evaluated based on the proportion of the 5000 simulations of each metaanalytic condition reflecting adequate Type I error control at the nominal alpha level of .05. As defined by Bradley (1978), rejections above .055 are termed inflated, whereas those empirical values below .045 are termed conservative for nominal =.05. For nominal alpha level .10, rejections above .11 are inflated, while those rejections below .09 are conservative. As an overview, the box and wh isker plots for each primary cond ition will be presented. All conditions will be incorporated while isolating the one variable of interest. These plots will be presented by the following 5 controlled variables: 2 plots for K=10 and K=30; 3 plots for ( 2=0, 2=.33, and 2=1); 3 plots by (N =10, N=40 and N=200), 3 plots for (skewness/kurtosis = 1/1, 1/3 and 2/6); and 3 plots for each of the variance ratios of 1:1, 2:1 and 4:1, respectively. The estimated Type I error rates are reported. For nonnull conditions, power estimates have been calculated. The rejection rates of all variable combinations are presented within the design matrices, appearing in Ta bles 1 through 6 of this chapter. Tables illustrating the proportion of cond itions with adequate Type I error control and the average Type I error rate estimates follow. Lastly, box and whisker plots used to display the distribution of the Type I error rates of each of the three tests and conditionallyrandom procedure are presented. Control of Type I Error Rate For purposes of this study, the control of Type I error is being examined using box and whisker plots, proportion of simulations with adequate Type I error control, average Type I error rates for each condition and marginal error rates for individually simulated conditions. Type I error control must first be determined before further examination of power becomes relevant for any of the conditions with a or effect greater than 0. Type I error occurs in the metaanalytic case when the researcher has deemed that a differential treatment effect (a moderating effect) has occurred for separately defined groups when in fact no true difference in effect exists. PAGE 98 88 Box Plots for K In both the box plots for K=10 and K=30 (Figures 1 and 3), the permuted Q between, randomeffects test (RE) and the conditionally random procedure (CR) demonstrated markedly greater concentrations of conditions in which Type I error was better maintained than in either the regular Q test or the fixedeffects (FE) test. For this reason, only the box plots for the three better performing tests were magnified so that a closer comparison can be made between these. As the distribution covers a much broader range, magnification of the range is less necessary. Figure 2 reveals the permuted Q between maintained Type I error control to the highest degree of the three best performing tests. For K=10, the randomeffects test (RE) and the conditionally random (CR) procedure performed similarly, showing a larger spread in the distribution of Type I error rate estimates. Comparing the K=10 (see Figure 1) to the K= 30 (see Figure 3) condition, the regular Q test produced a greater number of conditions with lower, though still inflated, Type I error when K=10. At K=10, the median error rate for the regular Q test fell at .75, whereas at K=30 it was 1. Surprisingly, the FE test maintained a similar frequency of conditions with lower, though still inflated, Type I error, regardless of the K. The RE and CR tests performed similarly at both K=10 (see Figure 2) and 30 (see Figure 4). At K=10, each had slightly inflated medi ans with inflated Type I error at .10. At K=30, the median for each decreased to approximately .065. As will be demonstrat ed later, this pattern of performance was borne out by the other analyses. Of all of the tests, only permuted Q maintained Type I error control consistently across conditions of K. Although, deviations from nominal alpha were present, at least 50 % of the conditions fell within Bradleys criterion of acceptability. With only a few exceptions did an y of the permuted Qs conditions exceed .06 (see Figure 4) at K=30. PAGE 99 89 Figure 1. Distributions of Type I Error Rate Estima tes Across Experimental Conditions for K=10 Figure 2. Magnified Distributions of Type I Error Rate Estimates Across Experimental Conditions for K=10 PAGE 100 90 Figure 3. Distributions of Type I Error Rate Estimates Across Experimental Conditions for K=30 485 485 485 485 485 N =CR FE RE PQb Reg QType I Error Rate1.05 1.00 .95 .90 .85 .80 .75 .70 .65 .60 .55 .50 .45 .40 .35 .30 .25 .20 .15 .10 .05 0.00 258252285306312279 261288 315 261288 315 3879521237 188 Figure 4 Magnified Distributions of Type I Error Rate Estimates Across Experimental Conditions for K=30 485 485 485 N =CR RE PQbType I Error Rate.20 .15 .10 .05 0.00 258252 285306312 279 288261 315 288261 315 38795212 37 188 *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure PAGE 101 91 Box Plots for 2 When 2 =0, the betweenstudies variance is e ssentially not a consideration. Under 2 =0 (see Figure 5), each of the tests tended to maintain Type I error rates around nominal =.05. The RE, FE, CR and regular Q tests produced some conservative rates and some moderately liberal rates. As the reader will recall, the model underlying the regular Q test requir es the test to be most sensitive to withinstudy variance. Therefore, reported outliers are to be expected. Increasing 2 introduces betweenstudies variance. B ecause Q was constructed to be sensitive to heterogeneity across studies, Q cannot simultaneously ma intain robustness to Type I errors to the same extent that other tests do. But Qs particular sensitiv ity to variance permits it to detect any heterogeneity present. The reader will discover how increasing 2 permitted this test to display increased power while other tests (including the most powerful one) lost power. Similarly, the FE test did not maintain robustness when many factors were introduced. At 2 =0, FE performed much like the RE and CR tests. But with increases in 2 and the introduction of betweenstudies variance, the median error rate rose from just under .05 to .30 (for 2 =.33, Figure 7) and .50 (at 2 =1, Figure 9). The RE and CR marginal error rates closely re flected the performance of the other. When 2 =0, both tests median fell below the nominal of .05. Therefore, these te sts provided conservative estimates of When 2 values increased from 0 to .33 and 1.0, me dian estimates greater than .05 ensued. Though the tests produced marginal error rates less inflated than those resulting from the FE test, those rates still exceeded .05. Only Permuted Q remained robus t to Type I error across all 2 conditions. The median rejection rates closely approximated .05. Indeed, most rejection rates did not deviate far from this value. PAGE 102 92 Figure 5 Distribution of Type I Error Rate Estima tes Across Experimental Conditions for 2 =0 Figure 6 Magnified Distribution of Type I Error Rate Estimates Across Experimental Conditions for 2 =0 *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure 324 324 324 324 324 N =CR FE RE PQb Reg QType I Error Rate1.05 1.00 .95 .90 .85 .80 .75 .70 .65 .60 .55 .50 .45 .40 .35 .30 .25 .20 .15 .10 .05 0.00 81243216 145167 50 26 80 213124 210133115291 318 243 189151 216 160 24142 21 102 78 75 5148 156 12981 27 54 324 324 324 N =CR RE PQbType I Error Rate.20 .15 .10 .05 0.00 145 167 50 26 PAGE 103 93 Figure 7 Distribution of Type I Error Rate Estima tes Across Experimental Conditions for 2 = .33 Figure 8 Magnified Distribution of Type I Error Rate Estimates Across Experimental Conditions for 2 = .33 *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure 324 324 324 324 324 N =CR FE RE PQb Reg QType I Error Rate1.05 1.00 .95 .90 .85 .80 .75 .70 .65 .60 .55 .50 .45 .40 .35 .30 .25 .20 .15 .10 .05 0.00 27822595 324 324 324 N =CR RE PQbType I Error Rate.20 .15 .10 .05 0.00 278 225 95 PAGE 104 94 Figure 9 Distribution of Type I Error Rate Estima tes Across Experimental Conditions for 2 = 1.0 321 321 321 321 321 N =CR FE RE PQb Reg QType I Error Rate1.05 1.00 .95 .90 .85 .80 .75 .70 .65 .60 .55 .50 .45 .40 .35 .30 .25 .20 .15 .10 .05 0.00 320249193 9537165 95 1196 92 1293 270 267 269 189 252 251186 264188 250 183266259 268169 187 178 170 249 171 185 260 263 168 247248 261 246 179 167182265184 166 256 175 165245 180 244 164257 262 253 163 176181 172258 177 254 173 255 174 Figure 10 Magnified Distribution of Type I Error Rate Estimates Across Experimental Conditions for 2 = 1.0 321 321 321 N =CR RE PQbType I Error Rate.20 .15 .10 .05 0.00 320 249 193 95 16537 *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure PAGE 105 95 Box Plots for Primary Study Sample Size Harwell (1997) suspected that small N paired w ith large K contributes to regular Qs greater tendency for inflation of Type I error. Applying the FE test while increasing primary study sample size resulted in increasingly inflated Type I error. Qs median rose from just over nominal .05 to .30 (N=40, see Figure 13) and .60 (for N=200, see Figure 15). This result stands in contrast to Harwells conclusion, except that K was not controlled for in this specific analysis. As sample size increased, RE and CR performance again reflected the same pattern. At N=10 (see Figure 11), the median rates exceeded nominal alpha, bu t not .10. Although the median remained relatively unchanged at all 3 levels of N, the 3rd quartile rose from approximately .07 to .11, as N was elevated from 10 to 40. The dispersion of error rates did not change dramatically from N=40 to N=200. Lastly, the permuted Q maintained adequate Type I error control across all 3 levels. In fact, the distribution of error rates remained unchanged from N=10 to N=200. PAGE 106 96 Figure 11 Distribution of Type I Error Rate Estima tes Across Experimental Conditions for N = 10 324 324 324 324 324 N =CR FE RE PQb Reg QType I Error Rate1.05 1.00 .95 .90 .85 .80 .75 .70 .65 .60 .55 .50 .45 .40 .35 .30 .25 .20 .15 .10 .05 0.00 59141 20311141 242 Figure 12 Magnified Distribution of Type I Error Rate Estimates Across Experimental Conditions for N = 10 324 324 324 N =CR RE PQbType I Error Rate.20 .15 .10 .05 0.00 59 141 203 11141 242 *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure PAGE 107 97 Figure 13 Distribution of Type I Error Rate Estima tes Across Experimental Conditions for N = 40 322 322 322 322 322 N =CR FE RE PQb Reg QType I Error Rate1.05 1.00 .95 .90 .85 .80 .75 .70 .65 .60 .55 .50 .45 .40 .35 .30 .25 .20 .15 .10 .05 0.00 112 239172 Figure 14 Magnified Distribution of Type I Error Rate Estimates Across Experimental Conditions for N = 40 322 322 322 N =CR RE PQbType I Error Rate.20 .15 .10 .05 0.00 112 239 172 *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure PAGE 108 98 Figure 15 Distribution of Type I Error Rate Estima tes Across Experimental Conditions for N = 200 323 323 323 323 323 N =CR FE RE PQb Reg QType I Error Rate1.05 1.00 .95 .90 .85 .80 .75 .70 .65 .60 .55 .50 .45 .40 .35 .30 .25 .20 .15 .10 .05 0.00 253158 279 Figure 16 Magnified Distribution of Type I Error Rate Estimates Across Experimental Conditions for N = 200 323 323 323 N =CR RE PQbType I Error Rate.20 .18 .16 .14 .12 .10 .08 .06 .04 .02 0.00 253 158 279 *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure PAGE 109 99 Box Plots for Variance Heterogeneity Median error rates tended to decrease for the regul ar Q and the FE tests (see Figures 17, 19 and 21), as the difference in population variances increased. In contrast, median error rates remained fairly constant for the RE and CR tests (refer to Figures 18, 20 and 22). What did change for these latter 2 tests was the line delineating the third quartile of error rate s (.075 at sds=1/1; .09 at sds = 2/1; and .11 at sds=4/1). Again, permuted Q maintained Type I error rate s, with marginal error rates clustered tightly around the nominal alpha. This test appeared unaffected by changes in the variance from equal to unequal variances within groups. PAGE 110 100 Figure 17 Distribution of Type I Error Rate Estimate s Across Experimental Conditions for sds = 1/1 405 405 405 405 405 N =CR FE RE PQb Reg QType I Error Rate1.05 1.00 .95 .90 .85 .80 .75 .70 .65 .60 .55 .50 .45 .40 .35 .30 .25 .20 .15 .10 .05 0.00 307317291363375365288361 389 199118 Figure 18 Magnified Distribution of Type I Error Rate Es timates Across Experimental Conditions for sds = 1/1 PAGE 111 101 Figure 19 Distribution of Type I Error Rate Estimate s Across Experimental Conditions for sds = 2/1 322 322 322 322 322 N =CR FE RE PQb Reg QType I Error Rate1.05 1.00 .95 .90 .85 .80 .75 .70 .65 .60 .55 .50 .45 .40 .35 .30 .25 .20 .15 .10 .05 0.00 243165 4714144195 Figure 20 Magnified Distribution of Type I Error Rate Es timates Across Experimental Conditions for sds = 2/1 322 322 322 N =CR RE PQbType I Error Rate.20 .15 .10 .05 0.00 243165 47 14414 195 *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure PAGE 112 102 Figure 21 Distribution of Type I Error Rate Estimate s Across Experimental Conditions for sds = 4/1 323 323 323 323 323 N =CR FE RE PQb Reg QType I Error Rate1.05 1.00 .95 .90 .85 .80 .75 .70 .65 .60 .55 .50 .45 .40 .35 .30 .25 .20 .15 .10 .05 0.00 218 11650298 26 Figure 22 Magnified Distribution of Type I Error Rate Es timates Across Experimental Conditions for sds = 4/1 323 323 323 N =CR RE PQbType I Error Rate.20 .15 .10 .05 0.00 218 116 50 298 26 *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure PAGE 113 103 Box Plots for Skewness/Kurtosis Contrary to the findings established by Kromrey & Hogarty (1998) and Harwell (1997), changing the population distribution did not significantly alter Type I error rates for any of the 5 tests under consideration. It should be noted that the box plots provide a general overview and do not permit examination of multiple factors held constant all at once. Permuted Q maintained adequate Type I error contro l, as marginal error ra tes concentrated around nominal alpha. In other words, Permuted Q was unaffected by the population distribution. The RE and CR tests both demonstrated a median marginal error rate just over .05, but less than .10 (see Figures 24, 26 and 28). At least 25% of all error rates fell below .05. None of the error rates exceeded .15. This pattern remained constant across the 3 levels of skewness and kurtosis. The median error rates for regular Q and the FE tests remained at .80 an d .15, respectively (see Figures 23, 25 and 27). Because the application of non e of these 5 tests resulted in varying error rates, given differing degrees of population shape, it can be concluded that distribution shape did not have a markedly significant effect on Type I error control. PAGE 114 104 Figure 23 Distribution of Type I Error Rate Estimates Across Experimental Conditions for skewness/kurtosis = 1/1 323 323 323 323 323 N =CR FE RE PQb Reg QType I Error Rate1.05 1.00 .95 .90 .85 .80 .75 .70 .65 .60 .55 .50 .45 .40 .35 .30 .25 .20 .15 .10 .05 0.00 141284 Figure 24 Magnification of Type I Error Rate Estimates Across Experimental Conditions for skewness/kurtosis = 1/1 323 323 323 N =CR RE PQbType I Error Rate.20 .15 .10 .05 0.00 284141 *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure PAGE 115 105 Figure 25 Distribution of Type I Error Rate Estimates Across Experimental Conditions for skewness/kurtosis = 1/3 323 323 323 323 323 N =CR FE RE PQb Reg QType I Error Rate1.05 1.00 .95 .90 .85 .80 .75 .70 .65 .60 .55 .50 .45 .40 .35 .30 .25 .20 .15 .10 .05 0.00 5 122114154 30318056 Figure 26 Magnification of Type I Error Rate Estimates Across Experimental Conditions for skewness/kurtosis = 1/3 323 323 323 N =CR RE PQbType I Error Rate.20 .15 .10 .05 0.00 122 5 114 154 303180 56 *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure PAGE 116 106 Figure 27 Distribution of Type I Error Rate Estimates Across Experimental Conditions for skewness/kurtosis = 2/6 324 324 324 324 324 N =CR FE RE PQb Reg QType I Error Rate1.05 1.00 .95 .90 .85 .80 .75 .70 .65 .60 .55 .50 .45 .40 .35 .30 .25 .20 .15 .10 .05 0.00 211 9376 9 Figure 28 Magnification of Type I Error Rate Estimates Across Experimental Conditions for skewness/kurtosis = 2/6 324 324 324 N =CR RE PQbtype I Error Rate.20 .15 .10 .05 0.00 211 93 76 9 *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure PAGE 117 107 Summary of Box Plots In short, small K, increasing 2 and increasing primary study samp le sizes resulted in increasingly inflated Type I error for all tests (except permuted Q). This trend wa s most pronounced for regular Q and FE, and to a lesser extent, RE and CR tests. Changes in the population shape did not result in significant changes in the tests performance, only in differences between tests. But these patterns of performance remained co nstant for each test across conditions of increasing skewness and kurtosis. Permuted Q uniquely controlled Type I error across all changes in conditions. It was the only test to maintain Type I error rates near .05, regardless of the extremity of the isolated variable. Proportion of Simulations Controlling Type I Error The proportion of simulations were calculate d by summing the number of conditions with marginal rejection rates meeting Bradleys criterion for acceptable Type I error control. For nominal = .05, the range of acceptability was restricted to rejectio n rates no smaller than .045 and smaller or equal to .055. The proportion tables for both 2 =.33 and 2 = 1 (Tables 5 and 6 for 2 =.33, and tables 7 and 8 for 2 =1) dovetail the results illustrated in the box plots pertaining to 2 Increases in 2 superceded all other factors in terms of producing inflated Type I error rates. Due to this effect, only the permuted Q consistently showed proportions of simulations with adequate Type I error greater than 50%. All other tests yielded either no proportions of adequate Type I error or only a few conditions with proportions less than 50% showing adequacy of Type I error control. This pattern was true for 2 =.33 and 2 = 1, regardless of whether =0 or .8. It is worth noting that 2 = 0, =.8 returned more conditions with proportions of simulations with adequate Type I error control than 2 = 0, =0. Here, the FE test presente d either comparable or greater proportions of Type I error control than either the RE or CR tests, particularly for primary study sample sizes of 6/4, 20/20 and 100/100. As described earlier, this influence can be explained by the model underlying the formulation of the FE test. PAGE 118 108 Table 3 Proportion of Simulations with Adequate Type I Error Control ( 2 = 0, = 0) Note: Proportions reflecting more than half of the condit ions expressing adequate Type I error control are bolded. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure In all but one of the conditions for 2 =0, =0, Permuted Q evidenced the greatest proportion of simulations with adequate Type I error control. Neither distribution shape nor within group variances significantly alters this pattern. Furthermore, changes in K do not lend a significant influence. Primary Study Sample Sizes N of studies Reg Q PQbREFECR 5/5 4/60.222 0.889 0.0000.1110.000 4/6 4/60.000 1.000 0.0000.0000.000 6/4 4/60.000 0.889 0.1110.2220.111 20/20 4/60.4440.4440.444 0.6670.556 16/24 4/60.000 0.889 0.1110.1110.111 24/16 4/60.111 0.889 0.2220.1110.000 100/100 4/6 0.8890.8890.5560.6670.667 80/120 4/60.222 0.889 0.1110.2220.222 120/80 4/60.222 1.000 0.4440.3330.333 5/5 12/180.222 1.000 0.0000.0000.000 4/6 12/180.000 1.000 0.0000.0000.000 6/4 12/180.000 0.778 0.4440.1110.444 20/20 12/180.222 0.889 0.2220.3330.333 16/24 12/180.000 0.778 0.0000.1110.111 24/16 12/180.000 0.889 0.4440.0000.222 100/100 12/18 0.5560.889 0.222 1.0000.778 80/120 12/180.222 1.000 0.2220.3330.333 120/80 12/180.111 0.778 0.3330.2220.222Population Shape N of studies Reg Q PQbREFECR sk=0.00, kr=0.00 4/60.259 0.852 0.2220.4440.296 sk=1.00, kr=3.00 4/60.222 0.852 0.2590.2220.185 sk=2.00, kr=6.00 4/60.222 0.926 0.1850.1480.185 sk=0.00, kr=0.00 12/180.148 0.815 0.2220.2220.296 sk=1.00, kr=3.00 12/180.148 0.963 0.2590.2590.185 sk=2.00, kr=6.00 12/180.148 0.889 0.1850.2220.333Population Variances N of studies Reg Q PQbREFECR 1/1 4/60.259 0.926 0.2220.4440.296 1/2 4/60.185 0.778 0.2590.2220.185 1/4 4/60.259 0.889 0.1850.1480.185 1/1 12/180.148 0.926 0.1110.3330.259 1/2 12/180.148 0.778 0.2220.1850.296 1/4 12/180.148 0.963 0.2960.1850.259 PAGE 119 109 Table 4 Proportion of Simulations with Adequate Type I Error Control ( 2 = 0, = .8) Note: Proportions reflecting more than half of the condit ions expressing adequate Type I error control are bolded. In 9 of the conditions utilizing Permuted Q, 100% of the simulations resulted in adequate Type I error control. For each of the other 4 tests, the propor tions with adequate Type I error did not exceed 55%. From 2=0, =0 (Table 3) to 2=0, =.8 (Table 4), the RE test yielded increases in the proportion of adequate Type I error for the equal variance conditions, whether K=10 or 30. As the variance ratios increased, proportions remained lower for this test. Th e CR test showed an incr ease in the proportion when variances were equal and K=10, only. For all other variance ratios, the CR tests proportions remained fairly constant. The FE test demonstrated decr eased proportions at the equal variance condition. P r i mary St u d y Sample Sizes N o f studies Reg Q PQbREFECR 5/5 4/60.111 1.000 0.1110.1110.111 4/6 4/60.000 1.000 0.1110.1110.111 6/4 4/60.222 1.000 0.4440.3330.333 20/20 4/60.111 0.889 0.3330.444 0.556 16/24 4/60.000 1.000 0.2220.1110.111 24/16 4/60.222 1.0000.5560.556 0.444 100/100 4/60.222 0.889 0.3330.3330.333 80/120 4/60.111 1.000 0.2220.2220.222 120/80 4/60.222 1.0000.556 0.4440.444 5/5 12/180.000 0.889 0.0000.1110.111 4/6 12/180.000 0.889 0.0000.0000.000 6/4 12/180.111 1.000 0.1110.2220.111 20/20 12/180.111 0.889 0.3330.4440.444 16/24 12/180.000 1.000 0.2220.2220.111 24/16 12/180.111 1.0000.556 0.333 0.556 100/100 12/180.222 0.8890.556 0.333 0.556 80/120 12/180.111 0.778 0.2220.0000.111 120/80 12/180.222 0.8890.556 0.3330.333Primary Study Sample Sizes N of studies Reg Q PQbREFECR sk=0.00, kr=0.00 4/60.148 0.926 0.2220.2960.259 sk=1.00, kr=3.00 4/60.074 0.926 0.3700.2590.407 sk=2.00, kr=6.00 4/60.074 0.926 0.1480.1480.148 sk=0.00, kr=0.00 12/180.222 0.889 0.2590.2590.259 sk=1.00, kr=3.00 12/180.037 0.852 0.2960.2220.370 sk=2.00, kr=6.00 12/180.037 0.963 0.2220.0740.037Population Variances N of studies Reg Q PQbREFECR 1/1 4/60.148 0.926 0.4070.3330.482 1/2 4/60.148 0.889 0.2220.2220.185 1/4 4/60.000 0.963 0.1110.1480.148 1/1 12/180.148 0.889 0.4440.2220.259 1/2 12/180.148 0.889 0.1110.1850.222 1/4 12/180.000 0.926 0.2220.1480.185 PAGE 120 110 Table 5 Proportion of Simulations with Adequate Type I Error Control ( 2 = .33, = 0) Note: Proportions reflecting any conditions expressing adequate Ty pe I error control are bolded. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure These results (see Table 5) substantiate the box plots outcome previously reported. As 2 increased from 0 to .33, all tests other than permuted Q displayed either trace or no proportions of adequate Type I error control. Given the constant absence of Type I error control across conditions, there is evidence of the significant in fluence of this increase in 2 on Type I error control for 4 of the 5 tests. Primary Study Sample Sizes N of studies Reg Q PQbREFECR 5/54/6 0.000 0.889 0.0000.0000.000 4/64/6 0.000 0.889 0.0000.0000.000 6/44/6 0.000 1.000 0.0000.0000.000 20/204/6 0.000 1.000 0.0000.0000.000 16/244/6 0.000 0.889 0.0000.0000.000 24/164/6 0.000 1.000 0.0000.0000.000 100/1004/6 0.000 1.000 0.0000.0000.000 80/1204/6 0.000 1.000 0.0000.0000.000 120/804/6 0.000 1.000 0.0000.0000.000 5/512/18 0.000 0.778 0.0000.0000.000 4/612/18 0.000 1.000 0.1110.0000.000 6/412/18 0.000 1.000 0.0000.0000.000 20/2012/18 0.000 0.889 0.0000.0000.000 16/2412/18 0.000 1.000 0.0000.0000.000 24/1612/18 0.000 1.000 0.0000.0000.000 100/10012/18 0.000 0.889 0.0000.0000.000 80/12012/18 0.000 0.889 0.0000.0000.000 120/8012/18 0.000 1.000 0.0000.0000.000Population Shape N of studies Reg Q PQbREFECR sk=0.00, kr=0.00 4/60.000 0.963 0.0000.0000.000 sk=1.00, kr=3.00 4/60.000 0.926 0.0000.0000.000 sk=2.00, kr=6.00 4/60.000 0.889 0.0000.0000.000 sk=0.00, kr=0.00 12/180.000 0.9630.037 0.0000.000 sk=1.00, kr=3.00 12/180.000 0.926 0.0000.0000.000 sk=2.00, kr=6.00 12/180.000 0.815 0.0000.0000.000Population Variances N of studies Reg Q PQbREFECR 1/1 4/60.000 0.926 0.0000.0000.000 1/2 4/60.000 0.889 0.0000.0000.000 1/4 4/60.000 0.963 0.0000.0000.000 1/1 12/180.000 0.926 0.0000.0000.000 1/2 12/180.000 0.9260.037 0.0000.000 1/4 12/180.000 0.852 0.0000.0000.000 PAGE 121 111 Table 6 Proportion of Simulations with Adequate Type I Error Control ( 2 = .33, = .8) Note: Proportions reflecting any conditions expressing adequate Ty pe I error control are bolded. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Again, with a few exceptions, only permuted Q evidenced a significant proportion of simulations with adequate Type I error control (see Table 6 above). Primary Study Sample Sizes N of studies Reg Q PQbREFECR 5/5 4/60.000 1.000 0.0000.0000.000 4/6 4/60.000 0.778 0.0000.0000.000 6/4 4/60.000 1.000 0.0000.0000.000 20/20 4/60.000 0.889 0.0000.0000.000 16/24 4/60.000 0.889 0.0000.0000.000 24/16 4/60.000 1.000 0.0000.0000.000 100/100 4/60.000 1.000 0.0000.0000.000 80/120 4/60.000 0.778 0.0000.0000.000 120/80 4/60.000 1.000 0.0000.0000.000 5/5 12/180.000 0.889 0.0000.0000.000 4/6 12/180.000 0.8890.333 0.000 0.333 6/4 12/180.000 1.0000.111 0.0000.000 20/20 12/180.000 0.667 0.0000.0000.000 16/24 12/180.000 1.000 0.0000.0000.000 24/16 12/180.000 1.000 0.0000.0000.000 100/100 12/180.000 0.889 0.0000.0000.000 80/120 12/180.000 0.889 0.0000.0000.000 120/80 12/180.000 1.000 0.0000.0000.000Population Shape N of studies Reg Q PQbREFECR sk=0.00, kr=0.00 4/60.000 0.926 0.0000.0000.000 sk=1.00, kr=3.00 4/60.000 0.889 0.0000.0000.000 sk=2.00, kr=6.00 4/60.000 0.889 0.0000.0000.000 sk=0.00, kr=0.00 12/180.000 0.8520.037 0.000 0.037 sk=1.00, kr=3.00 12/180.000 0.9260.037 0.000 0.037 sk=2.00, kr=6.00 12/180.000 0.8890.037 0.000 0.037Population Variances N of studies Reg Q PQbREFECR 1/1 4/60.000 0.926 0.0000.0000.000 1/2 4/60.000 0.889 0.0000.0000.000 1/4 4/60.000 0.889 0.0000.0000.000 1/1 12/180.000 0.926 0.0000.0000.000 1/2 12/180.000 0.852 0.0000.0000.000 1/4 12/180.000 0.8890.111 0.000 0.111 PAGE 122 112 Table 7 Proportion of Simulations with Adequate Type I Error Control ( 2 = 1, = 0) Note: Proportions reflecting any conditions expressing adequate Ty pe I error control are bolded. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure The pattern established as 2 increased from 0 to .33 continued from .33 to 1.0 (see Table 7). It did not make a difference whether =0 or .8. Because the pattern of an absence in any proportions of adequate Type I error continued from 2 = .33 to 1, one can conclude that increasing heterogeneity of effects had a significant influence on the effectiveness of the perf ormance of these tests. Only permuted Q permitted the maintenance of adequate Type I error control in more than 50% of the simulations. Despite increases in heterogeneity, permuted Q maintained robustness across all changes in variance and population shape. Primary Study Sample Sizes N of studies Reg Q PQbREFECR 5/54/6 0.000 0.778 0.0000.0000.000 4/64/6 0.000 0.889 0.0000.0000.000 6/44/6 0.000 1.000 0.0000.0000.000 20/204/6 0.000 0.778 0.0000.0000.000 16/244/6 0.000 1.000 0.0000.0000.000 24/164/6 0.000 0.889 0.0000.0000.000 100/1004/6 0.000 0.889 0.0000.0000.000 80/1204/6 0.000 0.778 0.0000.0000.000 120/804/6 0.000 0.889 0.0000.0000.000 5/512/18 0.000 1.000 0.0000.0000.000 4/612/18 0.000 1.000 0.0000.0000.000 6/412/18 0.000 1.000 0.0000.0000.000 20/2012/18 0.000 1.000 0.0000.0000.000 16/2412/18 0.000 0.667 0.0000.0000.000 24/1612/18 0.000 1.000 0.0000.0000.000 100/10012/18 0.000 1.000 0.0000.0000.000 80/12012/18 0.000 0.889 0.0000.0000.000 120/8012/18 0.000 0.778 0.0000.0000.000Population Shape N of studies Reg Q PQbREFECR sk=0.00, kr=0.004/6 0.000 0.963 0.0000.0000.000 sk=1.00, kr=3.004/6 0.000 0.815 0.0000.0000.000 sk=2.00, kr=6.004/6 0.000 0.889 0.0000.0000.000 sk=0.00, kr=0.0012/18 0.000 0.889 0.0000.0000.000 sk=1.00, kr=3.0012/18 0.000 0.963 0.0000.0000.000 sk=2.00, kr=6.0012/18 0.000 0.926 0.0000.0000.000Population Variances N of studies Reg Q PQbREFECR 1/14/6 0.000 0.926 0.0000.0000.000 1/24/6 0.000 0.852 0.0000.0000.000 1/44/6 0.000 0.852 0.0000.0000.000 1/112/18 0.000 0.963 0.0000.0000.000 1/212/18 0.000 0.889 0.0000.0000.000 1/412/18 0.000 0.926 0.0000.0000.000 PAGE 123 113 Table 8 Proportion of Simulations with Adequate Type I Error Control ( 2 = 1, = .8) Note: Proportions reflecting any conditions expressing adequate Ty pe I error control are bolded. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure The same pattern of Type I error control is evidenced regardless of the value of see Table 8). Therefore, one can conclude that increases in 2 negatively influenced the effect on Type I error. Primary Study Sample Sizes N of studies Reg Q PQbREFECR 5/5 4/60.000 0.778 0.0000.0000.000 4/6 4/60.000 1.000 0.0000.0000.000 6/4 4/60.000 1.000 0.0000.0000.000 20/20 4/60.000 0.556 0.0000.0000.000 16/24 4/60.000 0.778 0.0000.0000.000 24/16 4/60.000 0.889 0.0000.0000.000 100/100 4/60.000 1.000 0.0000.0000.000 80/120 4/60.000 1.000 0.0000.0000.000 120/80 4/60.000 1.000 0.0000.0000.000 5/5 12/180.000 0.889 0.0000.0000.000 4/6 12/180.000 0.889 0.0000.0000.000 6/4 12/180.000 0.778 0.0000.0000.000 20/20 12/180.000 1.000 0.0000.0000.000 16/24 12/180.000 0.778 0.0000.0000.000 24/16 12/180.000 0.889 0.0000.0000.000 100/100 12/180.000 0.778 0.0000.0000.000 80/120 12/180.000 1.000 0.0000.0000.000 120/80 12/180.000 1.000 0.0000.0000.000Population Shape N of studies Reg Q PQbREFECR sk=0.00, kr=0.00 4/60.000 0.778 0.0000.0000.000 sk=1.00, kr=3.00 4/60.000 0.852 0.0000.0000.000 sk=2.00, kr=6.00 4/60.000 0.926 0.0000.0000.000 sk=0.00, kr=0.00 12/180.000 0.852 0.0000.0000.000 sk=1.00, kr=3.00 12/180.000 0.778 0.0000.0000.000 sk=2.00, kr=6.00 12/180.000 0.963 0.0000.0000.000Population Variances N of studies Reg Q PQbREFECR 1/1 4/60.000 0.926 0.0000.0000.000 1/2 4/60.000 0.889 0.0000.0000.000 1/4 4/60.000 0.741 0.0000.0000.000 1/1 12/180.000 0.889 0.0000.0000.000 1/2 12/180.000 0.852 0.0000.0000.000 1/4 12/180.000 0.852 0.0000.0000.000 PAGE 124 114 Average Type I Error Rate Estimates From the previous presentation of the proportion of simulations with adequate Type I error, it is clear that 2 exerted a substantial influence on the control of Type I error for all of the tests, except permuted Q. Permuted Q maintained robustness across all conditions. The following tables (Tables 914) will present more specific evidence of the Ty pe I error rate for each set of conditions. Table 9 All Average Type I Error Rates ( 2 = 0, = 0) *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure This table (Table 9) illustrates that permuted Q cons istently maintained rejection rates around .05. Under this condition, the FE test produced the sec ond greatest number of adeq uate Type I error rates (13/30). The RE and CR tests tended to exhibit conservative Type I error rates when 2=0. Primary Study Sample Sizes N of studies Reg Q PQbREFECR 5/5 4/60.042 0.049 0.0350.0370.035 4/6 4/60.024 0.050 0.0260.0280.027 6/4 4/60.081 0.050 0.043 0.048 0.044 20/20 4/6 0.0500.050 0.044 0.0490.045 16/24 4/60.026 0.050 0.0330.0360.034 24/16 4/60.098 0.0500.054 0.0640.056 100/100 4/6 0.0490.0510.0450.0500.047 80/120 4/60.028 0.049 0.0330.0360.034 120/80 4/60.095 0.0500.054 0.0660.057 5/5 12/180.039 0.050 0.0330.0340.033 4/6 12/180.013 0.049 0.0260.0260.026 6/4 12/180.114 0.051 0.041 0.046 0.042 20/20 12/18 0.0500.049 0.041 0.046 0.043 16/24 12/180.019 0.052 0.0340.0360.035 24/16 12/180.150 0.0500.049 0.061 0.052 100/100 12/18 0.0480.050 0.044 0.0500.047 80/120 12/180.020 0.051 0.0360.0390.037 120/80 12/180.151 0.0490.052 0.0650.056Population Shape N of studies Reg Q PQbREFECR sk=0.00, kr=0.00 4/6 0.0540.050 0.040 0.046 0.042 sk=1.00, kr=3.00 4/6 0.0510.049 0.040 0.045 0.041 sk=2.00, kr=6.00 4/60.059 0.051 0.042 0.048 0.043 sk=0.00, kr=0.00 12/180.064 0.050 0.0390.0440.040 sk=1.00, kr=3.00 12/180.061 0.050 0.0390.0440.041 sk=2.00, kr=6.00 12/180.077 0.051 0.041 0.047 0.043Population Variances N of studies Reg Q PQbREFECR 1/1 4/60.039 0.050 0.0390.0430.040 1/2 4/6 0.0490.050 0.040 0.045 0.042 1/4 4/60.077 0.050 0.043 0.050 0.045 1/1 12/180.034 0.050 0.0370.0410.039 1/2 12/18 0.0540.051 0.0400.0450.042 1/4 12/180.113 0.050 0.041 0.049 0.043 PAGE 125 115 Table 10 All Average Type I Error Rates ( 2 = 0, = .8) *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure This table (Table 10) reflects a slight decline for the regular Q and FE tests (as compared to 2 =0, = 0) in the number of conditions with average Type I error rates approximating nominal The FE test produced adequate Type I error rates in 9 of 30 conditions, as compared with the prior 13 of 30. With the introduction of =.8, the RE and CR tests provided a few more instances of robustness. RE and CR tests demonstrated better robustness when sample sizes in the first group were greater than the second group. Primary Study Sample Sizes N of studies Reg Q PQbREFECR 5/5 4/60.040 0.050 0.0320.0350.033 4/6 4/60.025 0.050 0.0270.0280.027 6/4 4/60.066 0.050 0.042 0.045 0.042 20/20 4/6 0.0450.049 0.0390.0430.040 16/24 4/60.032 0.049 0.0320.0350.033 24/16 4/60.074 0.0510.0480.0550.049 100/100 4/6 0.0490.049 0.041 0.047 0.043 80/120 4/60.035 0.052 0.0340.0380.035 120/80 4/60.076 0.0490.049 0.057 0.050 5/5 12/180.033 0.050 0.0310.0320.031 4/6 12/180.020 0.050 0.0260.0270.026 6/4 12/180.080 0.050 0.0370.0400.038 20/20 12/18 0.0490.050 0.0390.0430.041 16/24 12/180.033 0.050 0.0330.0360.034 24/16 12/180.102 0.0490.0460.0550.049 100/100 12/180.057 0.052 0.041 0.048 0.044 80/120 12/180.041 0.049 0.0320.0360.034 120/80 12/180.110 0.0480.046 0.056 0.049Population Shape N of studies Reg Q PQbREFECR sk=0.00, kr=0.00 4/60.056 0.050 0.041 0.046 0.042 sk=1.00, kr=3.00 4/6 0.0460.050 0.0380.0420.039 sk=2.00, kr=6.00 4/6 0.0460.049 0.0350.0400.036 sk=0.00, kr=0.00 12/180.069 0.050 0.039 0.045 0.041 sk=1.00, kr=3.00 12/18 0.0490.049 0.0370.0410.039 sk=2.00, kr=6.00 12/180.057 0.050 0.0340.0390.036Population Variances N of studies Reg Q PQbREFECR 1/1 4/60.066 0.0500.0460.0520.047 1/2 4/60.039 0.050 0.0370.0400.038 1/4 4/60.042 0.050 0.0320.0350.032 1/1 12/180.085 0.049 0.044 0.0510.046 1/2 12/180.038 0.049 0.0360.0390.037 1/4 12/18 0.0520.051 0.0310.0350.032 PAGE 126 116 Table 11 All Average Type I Error Rates ( 2 = .33, = 0) *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Only permuted Q maintained adequate Type I error control (see Table 11). The other 4 tests exhibited inflated Type I error consistent across all conditions. By elevating the K to 30, there appeared to be a greater degree of robustness maintained for th e RE and CR tests across all conditions. The RE and CR tests outperformed the regular Q and FE tests in term s of overall effectivene ss. But the increase in 2 from 0 to .33 promoted Type I error inflation for all of the tests, except permuted Q. Primary Study Sample Sizes N of studies Reg Q PQbREFECR 5/5 4/60.340 0.049 0.0760.1020.077 4/6 4/60.270 0.050 0.0720.0920.074 6/4 4/60.403 0.050 0.0830.1170.085 20/20 4/60.909 0.050 0.1140.3090.115 16/24 4/60.875 0.051 0.1150.2900.116 24/16 4/60.921 0.050 0.1150.3240.116 100/100 4/60.999 0.049 0.1170.6140.117 80/120 4/60.999 0.049 0.1170.5990.117 120/80 4/61.000 0.049 0.1170.6220.117 5/5 12/180.640 0.049 0.0630.1000.065 4/6 12/180.504 0.050 0.0610.0870.063 6/4 12/180.726 0.048 0.0620.1100.064 20/20 12/180.999 0.049 0.0670.3020.067 16/24 12/180.998 0.049 0.0680.2810.068 24/16 12/181.000 0.049 0.0680.3140.068 100/100 12/181.000 0.052 0.0710.6110.071 80/120 12/181.000 0.050 0.0680.5950.068 120/80 12/181.000 0.050 0.0690.6150.069Population Shape N of studies Reg Q PQbREFECR sk=0.00, kr=0.00 4/60.723 0.049 0.1010.3340.102 sk=1.00, kr=3.00 4/60.742 0.049 0.1020.3410.103 sk=2.00, kr=6.00 4/60.773 0.050 0.1050.3480.106 sk=0.00, kr=0.00 12/180.842 0.050 0.0660.3290.066 sk=1.00, kr=3.00 12/180.869 0.050 0.0670.3340.067 sk=2.00, kr=6.00 12/180.911 0.049 0.0660.3410.067Population Variances N of studies Reg Q PQbREFECR 1/1 4/60.736 0.049 0.1020.3380.103 1/2 4/60.742 0.050 0.1020.3380.103 1/4 4/60.761 0.050 0.1040.3460.105 1/1 12/180.862 0.049 0.0660.3320.067 1/2 12/180.871 0.049 0.0660.3360.066 1/4 12/180.890 0.050 0.0670.3370.067 PAGE 127 117 Table 12 All Average Type I Error Rates ( 2 = .33, = .8) *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure As the average Type I error rates indicate (Tab le 12), only permuted Q maintained adequate robustness. The rates associated with the population shapes and variances suggest small K tended to minimize robustness to a greater extent for the RE and CR tests. These tests, though still inflating Type I error, did not perform as poorly when K=30. Regular Q and FE tests demonstrated inflated Type I to a much greater extent, than the RE or CR tests. It s hould be noted that the population shape did not present any more of a significant challenge to the control of Type I error for any of the tests than did the variances. Primary Study Sample Sizes N of studies Reg Q PQbREFECR 5/5 4/60.312 0.051 0.0760.1010.077 4/6 4/60.247 0.049 0.0690.0880.070 6/4 4/60.375 0.050 0.0790.1120.081 20/20 4/60.892 0.052 0.1150.3020.116 16/24 4/60.856 0.050 0.1110.2830.112 24/16 4/60.902 0.048 0.1120.3130.113 100/100 4/60.999 0.050 0.1160.6070.116 80/120 4/60.999 0.049 0.1170.5950.117 120/80 4/60.999 0.048 0.1160.6100.116 5/5 12/180.595 0.050 0.0620.0980.064 4/6 12/180.460 0.049 0.0570.0820.059 6/4 12/180.690 0.050 0.0630.1060.064 20/20 12/180.999 0.048 0.0660.2990.066 16/24 12/180.997 0.049 0.0680.2750.069 24/16 12/180.999 0.051 0.0690.3040.069 100/100 12/181.000 0.048 0.0680.6040.068 80/120 12/181.000 0.049 0.0690.5880.069 120/80 12/181.000 0.049 0.0670.6080.067Population Shape N of studies Reg Q PQbREFECR sk=0.00, kr=0.00 4/60.716 0.049 0.0990.3270.100 sk=1.00, kr=3.00 4/60.728 0.049 0.1010.3350.102 sk=2.00, kr=6.00 4/60.750 0.050 0.1030.3420.104 sk=0.00, kr=0.00 12/180.837 0.049 0.0650.3240.065 sk=1.00, kr=3.00 12/180.855 0.050 0.0650.3280.066 sk=2.00, kr=6.00 12/180.888 0.050 0.0670.3360.067Population Variances N of studies Reg Q PQbREFECR 1/1 4/60.734 0.051 0.1030.3370.104 1/2 4/60.727 0.049 0.1000.3330.102 1/4 4/60.733 0.049 0.1000.3330.101 1/1 12/180.864 0.049 0.0650.3300.066 1/2 12/180.855 0.050 0.0660.3280.067 1/4 12/180.861 0.049 0.0650.3300.066 PAGE 128 118 Table 13 All Average Type I Error Rates ( 2 = 1, = 0) *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure The average Type I error rates continued to escalat e as heterogeneity of effects increased from .33 to 1 (see Table 12 as compared to Table 13). Again, the 4 tests other than permuted Q tended to produce inflated Type I error. The regular Q and FE tests Type I error rates inflated to a much greater extent than those of either the RE or CR tests. These patterns were evident across change s in population shape and variance ratios, regardless of normality and equal varian ce. Lastly, Type I error for the FE test increased dramatically when the N increased to 200. Primary Study Sample Sizes N of studies Reg Q PQbREFECR 5/5 4/60.768 0.049 0.1020.1930.103 4/6 4/60.701 0.050 0.0980.1790.099 6/4 4/60.798 0.050 0.1040.2020.105 20/20 4/60.996 0.052 0.1150.4750.116 16/24 4/60.994 0.049 0.1150.4620.115 24/16 4/60.996 0.052 0.1170.4810.117 100/100 4/61.000 0.049 0.1160.7420.116 80/120 4/61.000 0.051 0.1170.7300.117 120/80 4/61.000 0.048 0.1150.7470.115 5/5 12/180.987 0.049 0.0670.1850.067 4/6 12/180.973 0.050 0.0670.1730.067 6/4 12/180.991 0.050 0.0660.1950.066 20/20 12/181.000 0.051 0.0680.4670.068 16/24 12/181.000 0.051 0.0690.4520.069 24/16 12/181.000 0.050 0.0680.4740.068 100/100 12/181.000 0.048 0.0660.7340.066 80/120 12/181.000 0.048 0.0670.7250.067 120/80 12/181.000 0.050 0.0690.7400.069Population Shape N of studies Reg Q PQbREFECR sk=0.00, kr=0.00 4/60.899 0.050 0.1100.4600.111 sk=1.00, kr=3.00 4/60.916 0.049 0.1110.4690.111 sk=2.00, kr=6.00 4/60.937 0.051 0.1120.4760.112 sk=0.00, kr=0.00 12/180.991 0.050 0.0680.4540.068 sk=1.00, kr=3.00 12/180.995 0.050 0.0680.4620.068 sk=2.00, kr=6.00 12/180.998 0.049 0.0670.4660.067Population Variances N of studies Reg Q PQbREFECR 1/1 4/60.913 0.050 0.1110.4650.111 1/2 4/60.916 0.050 0.1110.4660.111 1/4 4/60.922 0.050 0.1110.4730.112 1/1 12/180.995 0.050 0.0680.4600.068 1/2 12/180.995 0.050 0.0680.4590.068 1/4 12/180.994 0.049 0.0670.4620.067 PAGE 129 119 Table 14 All Average Type I Error Control ( 2 = 1, = .8) *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Type I error rates for each of the tests did not change dramatically from the prior condition of 2=1, =0 (see Table 13 relative to Table 14 above). Th e same patterns of inflation arose when K increased from 10 to 30 when both population shape and population variances were controlled. The degree of inflation for each test remained fairly co nstant from the prior condition, as well. Primary Study Sample Sizes N of studies Reg Q PQbREFECR 5/5 4/60.740 0.050 0.1040.1960.106 4/6 4/60.682 0.050 0.0980.1790.100 6/4 4/60.777 0.050 0.1040.2050.105 20/20 4/60.995 0.047 0.1130.4760.114 16/24 4/60.993 0.050 0.1140.4580.114 24/16 4/60.995 0.051 0.1150.4810.115 100/100 4/61.000 0.048 0.1140.7390.114 80/120 4/61.000 0.049 0.1190.7300.119 120/80 4/61.000 0.051 0.1150.7400.115 5/5 12/180.983 0.051 0.0680.1850.068 4/6 12/180.965 0.052 0.0680.1710.068 6/4 12/180.988 0.050 0.0670.1950.067 20/20 12/181.000 0.051 0.0680.4650.068 16/24 12/181.000 0.051 0.0710.4570.071 24/16 12/181.000 0.050 0.0690.4780.069 100/100 12/181.000 0.049 0.0680.7390.068 80/120 12/181.000 0.050 0.0690.7280.069 120/80 12/181.000 0.051 0.0690.7400.069Population Shape N of studies Reg Q PQbREFECR sk=0.00, kr=0.00 4/60.893 0.050 0.1110.4610.111 sk=1.00, kr=3.00 4/60.906 0.050 0.1100.4660.111 sk=2.00, kr=6.00 4/60.926 0.050 0.1110.4730.112 sk=0.00, kr=0.00 12/180.989 0.051 0.0690.4460.069 sk=1.00, kr=3.00 12/180.993 0.050 0.0690.4610.069 sk=2.00, kr=6.00 12/180.997 0.051 0.0680.4680.068Population Variances N of studies Reg Q PQbREFECR 1/1 4/60.909 0.049 0.1100.4640.111 1/2 4/60.908 0.049 0.1100.4660.111 1/4 4/60.908 0.051 0.1120.4700.112 1/1 12/180.994 0.050 0.0670.4610.067 1/2 12/180.993 0.051 0.0690.4630.069 1/4 12/180.992 0.051 0.0690.4520.069 PAGE 130 120 Summary of Results Concerning Average Type I Error Rates With the introduction of =.8 at 2=0, the RE, CR and FE tests began to provide more instances of adequate control of Type I error, particularly when N was 40 or greater and the first group had a larger N than the second group. The equal variances condition al so seemed to have contributed to the robustness of these tests. Robustness of these tests continued to decl ine as heterogeneity of effects increased from .33 to 1. The regular Q and FE tests Type I error rates inflat ed to a much greater extent than those of either the RE or CR tests. These patterns were evident acros s changes in population shape and variance ratios. Only permuted Q maintained a consistent pattern of robustness across a wide variety of conditions. Regardless of the extent of the heterogeneity of effects, population variance ratios, sample size of the primary studies, the number of studies or population shape, error rates continued to fluctuate only slightly around the nominal alpha level, .05. Type I Error Rate Estimates of Individual Simulated Conditions The Type I error rate estimates of all true null co nditions are now presented. These tables provide evidence of each tests performance under all of the specified conditions of the study. By presenting the individual error rates of each condition, a more detailed picture of the behavior of each of the tests can be examined more closely, so that general patterns highlighted by the prior presentations can be more fully elaborated. PAGE 131 121 Table 15 Type I Error Rate Estimates ( 2 = 0, =0) at =.05 for K=10, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded areas signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure When N was less than 40, none of the tests except permuted Q, permitted adequate control of Type I error. Error rates tended to be conservative for the RE, FE and CR tests. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/14/6sk=0.00, kr=0.000.0340.0470.0320.0330.032 5/51/24/6sk=0.00, kr=0.000.0400.0510.0330.0350.033 5/51/44/6sk=0.00, kr=0.000.0550.0510.0360.0400.037 5/51/14/6sk=1.00, kr=3.000.0290.0490.0390.0410.039 5/51/24/6sk=1.00, kr=3.000.0300.0410.0310.0350.032 5/51/44/6sk=1.00, kr=3.000.0480.0490.0350.0380.035 5/51/14/6sk=2.00, kr=6.000.0280.0520.0300.0300.030 5/51/24/6sk=2.00, kr=6.000.0330.0520.0350.0380.036 5/51/44/6sk=2.00, kr=6.000.0830.0500.0420.0460.043 4/61/14/6sk=0.00, kr=0.000.0350.0490.0320.0340.033 4/61/24/6sk=0.00, kr=0.000.0170.0510.0230.0240.023 4/61/44/6sk=0.00, kr=0.000.0180.0550.0220.0230.022 4/61/14/6sk=1.00, kr=3.000.0290.0480.0330.0360.034 4/61/24/6sk=1.00, kr=3.000.0210.0450.0240.0240.024 4/61/44/6sk=1.00, kr=3.000.0160.0500.0210.0230.021 4/61/14/6sk=2.00, kr=6.000.0240.0480.0280.0300.028 4/61/24/6sk=2.00, kr=6.000.0220.0530.0280.0290.028 4/61/44/6sk=2.00, kr=6.000.0370.0480.0280.0300.029 6/41/14/6sk=0.00, kr=0.000.0310.0550.0320.0350.033 6/41/24/6sk=0.00, kr=0.000.0650.0460.0410.0460.041 6/41/44/6sk=0.00, kr=0.000.1450.0500.0560.0650.057 6/41/14/6sk=1.00, kr=3.000.0300.0490.0320.0340.033 6/41/24/6sk=1.00, kr=3.000.0570.0520.0430.0460.044 6/41/44/6sk=1.00, kr=3.000.1270.0470.0510.0600.052 6/41/14/6sk=2.00, kr=6.000.0310.0540.0300.0310.030 6/41/24/6sk=2.00, kr=6.000.0610.0490.0390.0430.041 6/41/44/6sk=2.00, kr=6.000.1770.0490.0600.0720.062 20/201/14/6sk=0.00, kr=0.000.0440.0470.0460.0510.048 20/201/24/6sk=0.00, kr=0.000.0430.0570.0450.0490.046 20/201/44/6sk=0.00, kr=0.000.0520.0480.0410.0460.042 20/201/14/6sk=1.00, kr=3.000.0380.0480.0420.0460.043 20/201/24/6sk=1.00, kr=3.000.0460.0430.0410.0450.042 20/201/44/6sk=1.00, kr=3.000.0540.0570.0470.0520.048 20/201/14/6sk=2.00, kr=6.000.0390.0440.0400.0440.041 20/201/24/6sk=2.00, kr=6.000.0480.0490.0440.0500.046 20/201/44/6sk=2.00, kr=6.000.0820.0560.0510.0580.052 16/241/14/6sk=0.00, kr=0.000.0430.0510.0460.0500.047 16/241/24/6sk=0.00, kr=0.000.0180.0520.0290.0310.030 16/241/44/6sk=0.00, kr=0.000.0110.0490.0260.0260.026 16/241/14/6sk=1.00, kr=3.000.0400.0460.0380.0420.039 16/241/24/6sk=1.00, kr=3.000.0230.0530.0320.0340.033 16/241/44/6sk=1.00, kr=3.000.0100.0430.0220.0230.022 16/241/14/6sk=2.00, kr=6.000.0370.0510.0420.0450.043 16/241/24/6sk=2.00, kr=6.000.0270.0460.0340.0380.035 16/241/44/6sk=2.00, kr=6.000.0290.0530.0300.0320.030 PAGE 132 122 Table 15 (continued) Type I Error Rate Estimates ( 2 = 0, =0) at =.05 for K=10, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded areas signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 24/161/14/6sk=0.00, kr=0.000.0480.0510.0410.0480.043 24/161/24/6sk=0.00, kr=0.000.0920.0510.0550.0650.056 24/161/44/6sk=0.00, kr=0.000.1510.0470.0630.0800.067 24/161/14/6sk=1.00, kr=3.000.0420.0470.0380.0420.039 24/161/24/6sk=1.00, kr=3.000.0850.0520.0550.0630.057 24/161/44/6sk=1.00, kr=3.000.1480.0450.0630.0790.065 24/161/14/6sk=2.00, kr=6.000.0400.0500.0410.0440.042 24/161/24/6sk=2.00, kr=6.000.0870.0570.0620.0710.064 24/161/44/6sk=2.00, kr=6.000.1880.0500.0690.0880.072 100/1001/14/6sk=0.00, kr=0.000.0490.0520.0480.0510.049 100/1001/24/6sk=0.00, kr=0.000.0500.0550.0510.0560.053 100/1001/44/6sk=0.00, kr=0.000.0460.0500.0400.0430.041 100/1001/14/6sk=1.00, kr=3.000.0440.0500.0420.0470.044 100/1001/24/6sk=1.00, kr=3.000.0500.0540.0470.0520.048 100/1001/44/6sk=1.00, kr=3.000.0540.0460.0460.0500.047 100/1001/14/6sk=2.00, kr=6.000.0470.0510.0440.0490.045 100/1001/24/6sk=2.00, kr=6.000.0480.0500.0410.0460.043 100/1001/44/6sk=2.00, kr=6.000.0550.0510.0480.0560.049 80/1201/14/6sk=0.00, kr=0.000.0440.0450.0400.0440.040 80/1201/24/6sk=0.00, kr=0.000.0260.0440.0290.0320.030 80/1201/44/6sk=0.00, kr=0.000.0100.0500.0250.0270.026 80/1201/14/6sk=1.00, kr=3.000.0510.0470.0460.0520.048 80/1201/24/6sk=1.00, kr=3.000.0220.0520.0350.0370.035 80/1201/44/6sk=1.00, kr=3.000.0130.0500.0240.0250.024 80/1201/14/6sk=2.00, kr=6.000.0460.0550.0430.0490.045 80/1201/24/6sk=2.00, kr=6.000.0230.0500.0330.0350.034 80/1201/44/6sk=2.00, kr=6.000.0130.0500.0260.0270.026 120/801/14/6sk=0.00, kr=0.000.0460.0530.0420.0480.044 120/801/24/6sk=0.00, kr=0.000.0940.0460.0540.0660.057 120/801/44/6sk=0.00, kr=0.000.1420.0490.0620.0800.067 120/801/14/6sk=1.00, kr=3.000.0440.0500.0450.0510.047 120/801/24/6sk=1.00, kr=3.000.0870.0500.0540.0620.055 120/801/44/6sk=1.00, kr=3.000.1470.0500.0640.0790.066 120/801/14/6sk=2.00, kr=6.000.0480.0520.0450.0510.047 120/801/24/6sk=2.00, kr=6.000.0930.0490.0550.0660.058 120/801/44/6sk=2.00, kr=6.000.1580.0490.0660.0870.070 PAGE 133 123 Table 16 Type I Error Rate Estimates ( 2 = 0, =.8) at =.05 for K=10, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded areas signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure The regular Q tests robustness succumbed to the in fluence of the effect size greater than zero, as did the FE test to a lesser extent (see Table 16 abov e). The RE and CR tests evidenced a greater degree of robustness than in the prior table (Table 15), when effect size was held to zero. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/14/6sk=0.00, kr=0.000.0320.0510.0310.0340.032 5/51/24/6sk=0.00, kr=0.000.0400.0520.0330.0360.034 5/51/44/6sk=0.00, kr=0.000.0660.0490.0370.0400.037 5/51/14/6sk=1.00, kr=3.000.0460.0520.0400.0430.041 5/51/24/6sk=1.00, kr=3.000.0310.0470.0240.0250.024 5/51/44/6sk=1.00, kr=3.000.0370.0500.0270.0300.028 5/51/14/6sk=2.00, kr=6.000.0720.0500.0440.0480.046 5/51/24/6sk=2.00, kr=6.000.0260.0520.0280.0300.029 5/51/44/6sk=2.00, kr=6.000.0150.0480.0240.0250.024 4/61/14/6sk=0.00, kr=0.000.0310.0500.0330.0350.033 4/61/24/6sk=0.00, kr=0.000.0210.0470.0250.0260.025 4/61/44/6sk=0.00, kr=0.000.0140.0460.0240.0240.024 4/61/14/6sk=1.00, kr=3.000.0430.0490.0360.0380.037 4/61/24/6sk=1.00, kr=3.000.0160.0540.0260.0270.026 4/61/44/6sk=1.00, kr=3.000.0110.0500.0180.0180.018 4/61/14/6sk=2.00, kr=6.000.0680.0530.0480.0520.049 4/61/24/6sk=2.00, kr=6.000.0160.0500.0220.0240.022 4/61/44/6sk=2.00, kr=6.000.0070.0500.0100.0100.010 6/41/14/6sk=0.00, kr=0.000.0300.0480.0330.0340.033 6/41/24/6sk=0.00, kr=0.000.0770.0530.0460.0490.047 6/41/44/6sk=0.00, kr=0.000.1410.0510.0560.0650.057 6/41/14/6sk=1.00, kr=3.000.0410.0530.0390.0410.039 6/41/24/6sk=1.00, kr=3.000.0530.0480.0400.0430.041 6/41/44/6sk=1.00, kr=3.000.0960.0460.0450.0520.046 6/41/14/6sk=2.00, kr=6.000.0680.0500.0450.0480.045 6/41/24/6sk=2.00, kr=6.000.0470.0490.0370.0390.038 6/41/44/6sk=2.00, kr=6.000.0440.0540.0330.0360.034 20/201/14/6sk=0.00, kr=0.000.0440.0510.0440.0480.045 20/201/24/6sk=0.00, kr=0.000.0430.0550.0490.0520.049 20/201/44/6sk=0.00, kr=0.000.0580.0480.0440.0480.045 20/201/14/6sk=1.00, kr=3.000.0710.0470.0480.0540.050 20/201/24/6sk=1.00, kr=3.000.0310.0480.0360.0400.037 20/201/44/6sk=1.00, kr=3.000.0230.0540.0310.0330.032 20/201/14/6sk=2.00, kr=6.000.0990.0460.0520.0620.054 20/201/24/6sk=2.00, kr=6.000.0260.0420.0280.0330.029 20/201/44/6sk=2.00, kr=6.000.0070.0490.0180.0180.018 16/241/14/6sk=0.00, kr=0.000.0420.0530.0400.0440.041 16/241/24/6sk=0.00, kr=0.000.0280.0440.0290.0320.030 16/241/44/6sk=0.00, kr=0.000.0150.0490.0270.0280.027 16/241/14/6sk=1.00, kr=3.000.0660.0520.0530.0600.055 16/241/24/6sk=1.00, kr=3.000.0190.0470.0280.0310.029 16/241/44/6sk=1.00, kr=3.000.0060.0510.0200.0200.020 16/241/14/6sk=2.00, kr=6.000.0990.0480.0550.0650.057 16/241/24/6sk=2.00, kr=6.000.0140.0480.0270.0280.027 16/241/44/6sk=2.00, kr=6.000.0020.0490.0080.0090.009 PAGE 134 124 Table 16 (continued) Type I Error Rate Estimates ( 2 = 0, =.8) at =.05 for K=10, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded areas signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure The RE and CR tests held Type I error in check on a greater number of conditions when the sample size in the first group was greater than in the second (see Table 16). This pattern of response was more consistent when sample sizes increased to 40 or above. The FE test responded most effectively (in terms of robustness) when sample sizes were equal across groups. Permuted Q maintained robustness across all conditions, regardless of the sample size of either and both groups combined. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 24/161/14/6sk=0.00, kr=0 .000.0450.0500.0400.0460.042 24/161/24/6sk=0.00, kr=0 .000.0930.0550.0550.0640.056 24/161/44/6sk=0.00, kr=0 .000.1540.0510.0650.0810.068 24/161/14/6sk=1.00, kr=3 .000.0600.0490.0460.0520.048 24/161/24/6sk=1.00, kr=3 .000.0660.0440.0460.0530.047 24/161/44/6sk=1.00, kr=3 .000.0810.0500.0470.0550.048 24/161/14/6sk=2.00, kr=6 .000.0990.0490.0540.0630.057 24/161/24/6sk=2.00, kr=6 .000.0450.0520.0430.0470.045 24/161/44/6sk=2.00, kr=6 .000.0260.0540.0310.0330.032 100/1001/14/6sk=0.00, kr=0 .000.0480.0480.0370.0420.038 100/1001/24/6sk=0.00, kr=0 .000.0510.0520.0460.0530.048 100/1001/44/6sk=0.00, kr=0 .000.0570.0530.0490.0560.051 100/1001/14/6sk=1.00, kr=3 .000.0830.0440.0490.0580.051 100/1001/24/6sk=1.00, kr=3 .000.0380.0550.0420.0450.043 100/1001/44/6sk=1.00, kr=3 .000.0200.0510.0330.0350.033 100/1001/14/6sk=2.00, kr=6 .000.1190.0500.0620.0770.065 100/1001/24/6sk=2.00, kr=6 .000.0200.0470.0360.0380.036 100/1001/44/6sk=2.00, kr=6 .000.0040.0440.0160.0180.017 80/1201/14/6sk=0.00, kr=0 .000.0460.0560.0460.0500.046 80/1201/24/6sk=0.00, kr=0 .000.0240.0530.0390.0420.040 80/1201/44/6sk=0.00, kr=0 .000.0130.0490.0250.0260.026 80/1201/14/6sk=1.00, kr=3 .000.0810.0490.0480.0560.050 80/1201/24/6sk=1.00, kr=3 .000.0180.0520.0310.0340.032 80/1201/44/6sk=1.00, kr=3 .000.0030.0520.0180.0190.018 80/1201/14/6sk=2.00, kr=6 .000.1170.0500.0610.0730.063 80/1201/24/6sk=2.00, kr=6 .000.0110.0500.0300.0310.031 80/1201/44/6sk=2.00, kr=6 .000.0010.0530.0100.0100.010 120/801/14/6sk=0.00, kr=0 .000.0460.0500.0440.0500.046 120/801/24/6sk=0.00, kr=0 .000.0940.0480.0540.0650.056 120/801/44/6sk=0.00, kr=0 .000.1560.0500.0640.0770.065 120/801/14/6sk=1.00, kr=3 .000.0680.0490.0520.0610.053 120/801/24/6sk=1.00, kr=3 .000.0710.0460.0460.0540.048 120/801/44/6sk=1.00, kr=3 .000.0640.0500.0470.0530.049 120/801/14/6sk=2.00, kr=6 .000.1200.0510.0620.0760.065 120/801/24/6sk=2.00, kr=6 .000.0450.0470.0410.0450.041 120/801/44/6sk=2.00, kr=6 .000.0170.0460.0280.0290.028 PAGE 135 125 Table 17 Type I Error Rate Estimates ( 2 = 0, =0) at =.05 for K=30, N=5000 Note: All unshaded areas for each of the tests reflect either inflated or conservative Type I error. Shaded areas signify th ose conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure None of the tests, save permuted Q, showed consistent robustness when N was less than 40. All of the tests, except regular Q and permuted Q, tende d to yield conservative Type I error rates under this set of conditions. This pattern was evident across variance ratios and population shapes. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/112/18sk=0.00, kr=0.000.0230.0490.0280.0290.028 5/51/212/18sk=0.00, kr=0.000.0300.0520.0350.0360.036 5/51/412/18sk=0.00, kr=0.000.0520.0510.0350.0370.036 5/51/112/18sk=1.00, kr=3.000.0180.0470.0280.0300.029 5/51/212/18sk=1.00, kr=3.000.0230.0500.0320.0330.032 5/51/412/18sk=1.00, kr=3.000.0480.0490.0310.0320.031 5/51/112/18sk=2.00, kr=6.000.0170.0540.0290.0290.029 5/51/212/18sk=2.00, kr=6.000.0320.0460.0380.0400.039 5/51/412/18sk=2.00, kr=6.000.1100.0500.0410.0440.042 4/61/112/18sk=0.00, kr=0.000.0210.0490.0300.0300.030 4/61/212/18sk=0.00, kr=0.000.0080.0480.0230.0230.023 4/61/412/18sk=0.00, kr=0.000.0040.0490.0210.0210.021 4/61/112/18sk=1.00, kr=3.000.0180.0450.0250.0250.025 4/61/212/18sk=1.00, kr=3.000.0050.0500.0280.0280.028 4/61/412/18sk=1.00, kr=3.000.0030.0510.0220.0230.023 4/61/112/18sk=2.00, kr=6.000.0170.0530.0320.0330.033 4/61/212/18sk=2.00, kr=6.000.0130.0480.0230.0240.024 4/61/412/18sk=2.00, kr=6.000.0290.0510.0290.0300.029 6/41/112/18sk=0.00, kr=0.000.0220.0450.0240.0240.024 6/41/212/18sk=0.00, kr=0.000.0750.0520.0400.0440.042 6/41/412/18sk=0.00, kr=0.000.2300.0550.0530.0620.055 6/41/112/18sk=1.00, kr=3.000.0180.0530.0310.0310.031 6/41/212/18sk=1.00, kr=3.000.0710.0510.0400.0420.041 6/41/412/18sk=1.00, kr=3.000.1980.0470.0460.0530.048 6/41/112/18sk=2.00, kr=6.000.0130.0510.0310.0310.031 6/41/212/18sk=2.00, kr=6.000.0730.0610.0530.0560.054 6/41/412/18sk=2.00, kr=6.000.3200.0450.0540.0670.055 20/201/112/18sk=0.00, kr=0.000.0420.0470.0360.0420.040 20/201/212/18sk=0.00, kr=0.000.0410.0450.0390.0410.040 20/201/412/18sk=0.00, kr=0.000.0470.0470.0400.0440.041 20/201/112/18sk=1.00, kr=3.000.0360.0450.0360.0410.039 20/201/212/18sk=1.00, kr=3.000.0420.0490.0430.0460.044 20/201/412/18sk=1.00, kr=3.000.0570.0540.0450.0510.048 20/201/112/18sk=2.00, kr=6.000.0300.0480.0390.0420.041 20/201/212/18sk=2.00, kr=6.000.0540.0530.0440.0490.046 20/201/412/18sk=2.00, kr=6.000.0980.0500.0470.0550.049 16/241/112/18sk=0.00, kr=0.000.0390.0570.0450.0480.046 16/241/212/18sk=0.00, kr=0.000.0100.0500.0300.0300.030 16/241/412/18sk=0.00, kr=0.000.0040.0500.0270.0280.027 16/241/112/18sk=1.00, kr=3.000.0370.0480.0410.0440.043 16/241/212/18sk=1.00, kr=3.000.0090.0550.0350.0360.036 16/241/412/18sk=1.00, kr=3.000.0050.0520.0250.0250.025 16/241/112/18sk=2.00, kr=6.000.0400.0470.0410.0440.043 16/241/212/18sk=2.00, kr=6.000.0170.0550.0380.0390.038 16/241/412/18sk=2.00, kr=6.000.0110.0490.0290.0300.030 PAGE 136 126 Table 17 (continued) Type I Error Rate Estimates ( 2 = 0, =0) at =.05 for K=30, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded areas signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure The regular Q, FE and CR tests produced a concentratio n of wellmaintained Type I error conditions as the sample size of the primary studies increased to 200 un der the equal groups condition. As unequal sample sizes were introduced, both of these tests had diminished Type I error control. The RE test was less robust than either of these other two tests when sample si zes increased to 200. The RE test demonstrated particular robustness when the sample sizes at N=10, 40 and 200 had the first group with the larger N than the second. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 24/161/112/18sk=0.00, kr=0.000.0420.0460.0380.0420.040 24/161/212/18sk=0.00, kr=0.000.1230.0460.0470.0560.050 24/161/412/18sk=0.00, kr=0.000.2660.0460.0540.0770.059 24/161/112/18sk=1.00, kr=3.000.0390.0510.0410.0440.042 24/161/212/18sk=1.00, kr=3.000.1130.0590.0580.0670.062 24/161/412/18sk=1.00, kr=3.000.2660.0480.0520.0740.057 24/161/112/18sk=2.00, kr=6.000.0370.0510.0400.0440.041 24/161/212/18sk=2.00, kr=6.000.1260.0500.0510.0610.053 24/161/412/18sk=2.00, kr=6.000.3380.0510.0600.0850.066 100/1001/112/18sk=0.00, kr=0.000.0420.0520.0440.0510.048 100/1001/212/18sk=0.00, kr=0.000.0470.0560.0440.0500.046 100/1001/412/18sk=0.00, kr=0.000.0450.0470.0420.0470.045 100/1001/112/18sk=1.00, kr=3.000.0470.0480.0410.0470.043 100/1001/212/18sk=1.00, kr=3.000.0480.0490.0470.0530.051 100/1001/412/18sk=1.00, kr=3.000.0500.0540.0470.0530.050 100/1001/112/18sk=2.00, kr=6.000.0440.0500.0430.0490.046 100/1001/212/18sk=2.00, kr=6.000.0530.0480.0420.0480.045 100/1001/412/18sk=2.00, kr=6.000.0560.0470.0420.0480.045 80/1201/112/18sk=0.00, kr=0.000.0470.0540.0490.0540.050 80/1201/212/18sk=0.00, kr=0.000.0110.0530.0350.0380.037 80/1201/412/18sk=0.00, kr=0.000.0030.0500.0260.0260.026 80/1201/112/18sk=1.00, kr=3.000.0450.0520.0450.0500.047 80/1201/212/18sk=1.00, kr=3.000.0130.0530.0350.0360.035 80/1201/412/18sk=1.00, kr=3.000.0020.0460.0270.0280.028 80/1201/112/18sk=2.00, kr=6.000.0450.0500.0450.0510.047 80/1201/212/18sk=2.00, kr=6.000.0100.0490.0330.0350.034 80/1201/412/18sk=2.00, kr=6.000.0050.0540.0300.0310.030 120/801/112/18sk=0.00, kr=0.000.0430.0530.0470.0510.049 120/801/212/18sk=0.00, kr=0.000.1370.0430.0470.0600.052 120/801/412/18sk=0.00, kr=0.000.2630.0520.0600.0830.064 120/801/112/18sk=1.00, kr=3.000.0430.0460.0380.0440.042 120/801/212/18sk=1.00, kr=3.000.1310.0480.0530.0670.059 120/801/412/18sk=1.00, kr=3.000.2590.0520.0630.0830.066 120/801/112/18sk=2.00, kr=6.000.0480.0440.0410.0460.043 120/801/212/18sk=2.00, kr=6.000.1470.0550.0550.0670.060 120/801/412/18sk=2.00, kr=6.000.2850.0510.0620.0860.068 PAGE 137 127 Table 18 Type I Error Rate Estimates ( 2 = 0, =.8) at =.05 for K=30, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded areas signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure As an effect size greater than 0 was introduced, the performance of the regular Q and FE test declined (see Table 18 as compared to Table 17). The regular Q had 14 conditions with adequate Type I error when the effect size was 0 (see Table 17) as comp ared to 8 in the present set of conditions. The FE test had less of a decline in performance from 21 adequate conditions vs. 17 in the present set. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/112/18sk=0.00, kr=0.000.0270.0530.0300.0310.031 5/51/212/18sk=0.00, kr=0.000.0350.0500.0310.0330.032 5/51/412/18sk=0.00, kr=0.000.0590.0540.0380.0400.039 5/51/112/18sk=1.00, kr=3.000.0350.0440.0310.0310.031 5/51/212/18sk=1.00, kr=3.000.0190.0510.0310.0330.032 5/51/412/18sk=1.00, kr=3.000.0210.0470.0270.0280.027 5/51/112/18sk=2.00, kr=6.000.0830.0510.0430.0470.044 5/51/212/18sk=2.00, kr=6.000.0150.0510.0270.0270.027 5/51/412/18sk=2.00, kr=6.000.0050.0490.0190.0200.019 4/61/112/18sk=0.00, kr=0.000.0270.0510.0330.0340.033 4/61/212/18sk=0.00, kr=0.000.0120.0460.0210.0220.021 4/61/412/18sk=0.00, kr=0.000.0080.0510.0220.0220.022 4/61/112/18sk=1.00, kr=3.000.0380.0570.0400.0410.040 4/61/212/18sk=1.00, kr=3.000.0060.0510.0270.0280.027 4/61/412/18sk=1.00, kr=3.000.0030.0490.0200.0200.020 4/61/112/18sk=2.00, kr=6.000.0790.0480.0400.0430.041 4/61/212/18sk=2.00, kr=6.000.0080.0480.0220.0220.022 4/61/412/18sk=2.00, kr=6.000.0010.0470.0120.0120.012 6/41/112/18sk=0.00, kr=0.000.0220.0480.0270.0270.027 6/41/212/18sk=0.00, kr=0.000.0820.0450.0340.0380.035 6/41/412/18sk=0.00, kr=0.000.2440.0520.0510.0610.053 6/41/112/18sk=1.00, kr=3.000.0330.0550.0350.0360.035 6/41/212/18sk=1.00, kr=3.000.0530.0470.0360.0390.037 6/41/412/18sk=1.00, kr=3.000.1190.0500.0420.0450.043 6/41/112/18sk=2.00, kr=6.000.0960.0490.0410.0440.042 6/41/212/18sk=2.00, kr=6.000.0410.0490.0340.0350.034 6/41/412/18sk=2.00, kr=6.000.0340.0530.0330.0340.034 20/201/112/18sk=0.00, kr=0.000.0430.0510.0440.0490.046 20/201/212/18sk=0.00, kr=0.000.0460.0520.0430.0460.045 20/201/412/18sk=0.00, kr=0.000.0610.0560.0490.0540.051 20/201/112/18sk=1.00, kr=3.000.0770.0480.0470.0550.051 20/201/212/18sk=1.00, kr=3.000.0270.0490.0380.0410.039 20/201/412/18sk=1.00, kr=3.000.0130.0480.0300.0310.030 20/201/112/18sk=2.00, kr=6.000.1530.0510.0550.0680.058 20/201/212/18sk=2.00, kr=6.000.0170.0460.0270.0290.028 20/201/412/18sk=2.00, kr=6.000.0010.0500.0180.0190.019 16/241/112/18sk=0.00, kr=0.000.0420.0470.0410.0460.043 16/241/212/18sk=0.00, kr=0.000.0130.0480.0350.0370.037 16/241/412/18sk=0.00, kr=0.000.0040.0490.0280.0290.029 16/241/112/18sk=1.00, kr=3.000.0730.0500.0490.0550.051 16/241/212/18sk=1.00, kr=3.000.0080.0440.0260.0270.026 16/241/412/18sk=1.00, kr=3.000.0010.0520.0190.0190.019 16/241/112/18sk=2.00, kr=6.000.1490.0480.0540.0660.058 16/241/212/18sk=2.00, kr=6.000.0040.0530.0290.0300.030 16/241/412/18sk=2.00, kr=6.000.0000.0550.0120.0120.012 PAGE 138 128 Table 18 (continued) Type I Error Rate Estimates ( 2 = 0, =.8) at =.05 for K=30, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded areas signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure The RE and CR tests generated more conservative Type I error rates than any of the other 3 tests, but the greatest frequency (other than permuted Q) of conditions with adequately controlled Type I error. The RE test performed best when the first group had the larger sample size. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 24/161/112/18sk=0.00, kr=0.000.0480.0470.0400.0440.042 24/161/212/18sk=0.00, kr=0.000.1220.0460.0480.0600.053 24/161/412/18sk=0.00, kr=0.000.2830.0480.0530.0710.057 24/161/112/18sk=1.00, kr=3.000.0780.0480.0460.0510.048 24/161/212/18sk=1.00, kr=3.000.0800.0450.0420.0500.046 24/161/412/18sk=1.00, kr=3.000.0980.0520.0510.0590.054 24/161/112/18sk=2.00, kr=6.000.1520.0520.0570.0690.061 24/161/212/18sk=2.00, kr=6.000.0430.0550.0480.0550.051 24/161/412/18sk=2.00, kr=6.000.0110.0510.0330.0360.034 100/1001/112/18sk=0.00, kr=0.000.0460.0540.0450.0520.048 100/1001/212/18sk=0.00, kr=0.000.0520.0550.0510.0590.055 100/1001/412/18sk=0.00, kr=0.000.0650.0560.0500.0540.052 100/1001/112/18sk=1.00, kr=3.000.1080.0480.0470.0600.054 100/1001/212/18sk=1.00, kr=3.000.0250.0540.0440.0460.045 100/1001/412/18sk=1.00, kr=3.000.0080.0520.0290.0310.031 100/1001/112/18sk=2.00, kr=6.000.2000.0510.0550.0730.059 100/1001/212/18sk=2.00, kr=6.000.0130.0480.0320.0350.033 100/1001/412/18sk=2.00, kr=6.000.0000.0510.0190.0190.019 80/1201/112/18sk=0.00, kr=0.000.0490.0420.0370.0430.040 80/1201/212/18sk=0.00, kr=0.000.0150.0470.0320.0350.034 80/1201/412/18sk=0.00, kr=0.000.0060.0490.0250.0260.026 80/1201/112/18sk=1.00, kr=3.000.1040.0480.0520.0620.056 80/1201/212/18sk=1.00, kr=3.000.0080.0560.0350.0360.036 80/1201/412/18sk=1.00, kr=3.000.0000.0480.0190.0190.019 80/1201/112/18sk=2.00, kr=6.000.1800.0480.0520.0710.057 80/1201/212/18sk=2.00, kr=6.000.0030.0530.0260.0280.027 80/1201/412/18sk=2.00, kr=6.000.0000.0490.0070.0070.007 120/801/112/18sk=0.00, kr=0.000.0530.0470.0390.0460.042 120/801/212/18sk=0.00, kr=0.000.1330.0510.0550.0660.058 120/801/412/18sk=0.00, kr=0.000.2740.0490.0600.0840.066 120/801/112/18sk=1.00, kr=3.000.1020.0460.0460.0570.050 120/801/212/18sk=1.00, kr=3.000.0860.0460.0450.0540.049 120/801/412/18sk=1.00, kr=3.000.0880.0480.0460.0540.050 120/801/112/18sk=2.00, kr=6.000.1950.0470.0520.0680.057 120/801/212/18sk=2.00, kr=6.000.0510.0430.0390.0440.040 120/801/412/18sk=2.00, kr=6.000.0070.0520.0320.0330.033 PAGE 139 129 Table 19 Type I Error Rate Estimates ( 2 =.33, =0) at =.05 for K=10, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded areas signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure There was a dramatic increase in the number of conditions with inflated Type I error for all tests, except permuted Q. As sample size increased to 40, the Ty pe I error for each of the 4 tests exhibited another notable increase. This pattern resulted in a total ab sence of error control by the aforementioned tests. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/14/6sk=0.00, kr=0.000.2620.0460.0700.0920.072 5/51/24/6sk=0.00, kr=0.000.2800.0510.0700.0910.072 5/51/44/6sk=0.00, kr=0.000.3100.0460.0700.0920.072 5/51/14/6sk=1.00, kr=3.000.2960.0440.0700.0910.072 5/51/24/6sk=1.00, kr=3.000.3110.0480.0730.0970.076 5/51/44/6sk=1.00, kr=3.000.3580.0550.0820.1110.084 5/51/14/6sk=2.00, kr=6.000.3750.0480.0790.1060.081 5/51/24/6sk=2.00, kr=6.000.4080.0490.0790.1130.081 5/51/44/6sk=2.00, kr=6.000.4580.0530.0850.1250.087 4/61/14/6sk=0.00, kr=0.000.2510.0460.0690.0850.070 4/61/24/6sk=0.00, kr=0.000.2020.0520.0660.0810.068 4/61/44/6sk=0.00, kr=0.000.1790.0450.0610.0730.062 4/61/14/6sk=1.00, kr=3.000.2960.0480.0700.0940.071 4/61/24/6sk=1.00, kr=3.000.2370.0590.0800.0980.080 4/61/44/6sk=1.00, kr=3.000.2290.0480.0640.0830.066 4/61/14/6sk=2.00, kr=6.000.3730.0460.0840.1100.086 4/61/24/6sk=2.00, kr=6.000.3180.0510.0770.1020.080 4/61/44/6sk=2.00, kr=6.000.3470.0520.0770.1030.078 6/41/14/6sk=0.00, kr=0.000.2470.0500.0670.0810.067 6/41/24/6sk=0.00, kr=0.000.3420.0520.0850.1120.087 6/41/44/6sk=0.00, kr=0.000.4430.0490.0900.1270.092 6/41/14/6sk=1.00, kr=3.000.3030.0510.0760.0990.078 6/41/24/6sk=1.00, kr=3.000.3810.0490.0790.1110.081 6/41/44/6sk=1.00, kr=3.000.5130.0490.0940.1430.096 6/41/14/6sk=2.00, kr=6.000.3550.0500.0810.1090.083 6/41/24/6sk=2.00, kr=6.000.4510.0480.0810.1210.084 6/41/44/6sk=2.00, kr=6.000.5940.0510.0930.1480.094 20/201/14/6sk=0.00, kr=0.000.9020.0450.1100.2960.112 20/201/24/6sk=0.00, kr=0.000.8940.0500.1130.3010.115 20/201/44/6sk=0.00, kr=0.000.8940.0520.1190.3020.119 20/201/14/6sk=1.00, kr=3.000.9130.0510.1180.3220.119 20/201/24/6sk=1.00, kr=3.000.9110.0520.1130.3020.114 20/201/44/6sk=1.00, kr=3.000.9100.0460.1100.3090.111 20/201/14/6sk=2.00, kr=6.000.9220.0530.1160.3170.118 20/201/24/6sk=2.00, kr=6.000.9120.0510.1160.3120.117 20/201/44/6sk=2.00, kr=6.000.9200.0450.1120.3230.113 16/241/14/6sk=0.00, kr=0.000.8840.0480.1130.2900.114 16/241/24/6sk=0.00, kr=0.000.8620.0520.1160.2770.117 16/241/44/6sk=0.00, kr=0.000.8400.0530.1170.2800.118 16/241/14/6sk=1.00, kr=3.000.9010.0490.1110.3030.111 16/241/24/6sk=1.00, kr=3.000.8680.0470.1130.2930.115 16/241/44/6sk=1.00, kr=3.000.8530.0490.1080.2740.110 16/241/14/6sk=2.00, kr=6.000.9100.0550.1140.3120.115 16/241/24/6sk=2.00, kr=6.000.8880.0490.1140.2830.115 16/241/44/6sk=2.00, kr=6.000.8690.0560.1250.2950.126 PAGE 140 130 Table 19 (continued) Type I Error Rate Estimates ( 2 =.33, =0) at =.05 for K=10, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded areas signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure As the sample size increased to 40 and greater, perm uted Q continued to evidence adequate Type I error control across all conditions. All other tests did not maintain robustness under these conditions. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 24/161/14/6sk=0.00, kr=0.000.8880.0500.1190.3110.121 24/161/24/6sk=0.00, kr=0.000.9200.0520.1150.3110.115 24/161/44/6sk=0.00, kr=0.000.9280.0490.1150.3410.116 24/161/14/6sk=1.00, kr=3.000.9030.0490.1090.3030.110 24/161/24/6sk=1.00, kr=3.000.9190.0480.1110.3150.112 24/161/44/6sk=1.00, kr=3.000.9470.0500.1180.3470.119 24/161/14/6sk=2.00, kr=6.000.9060.0550.1230.3160.124 24/161/24/6sk=2.00, kr=6.000.9310.0440.1100.3230.111 24/161/44/6sk=2.00, kr=6.000.9470.0520.1130.3450.114 100/1001/14/6sk=0.00, kr=0.000.9990.0500.1120.6100.112 100/1001/24/6sk=0.00, kr=0.001.0000.0430.1120.6090.112 100/1001/44/6sk=0.00, kr=0.001.0000.0550.1130.6090.113 100/1001/14/6sk=1.00, kr=3.000.9990.0500.1180.6200.118 100/1001/24/6sk=1.00, kr=3.000.9990.0460.1200.6150.120 100/1001/44/6sk=1.00, kr=3.000.9990.0520.1200.6130.120 100/1001/14/6sk=2.00, kr=6.000.9990.0460.1210.6180.122 100/1001/24/6sk=2.00, kr=6.000.9990.0500.1140.6150.114 100/1001/44/6sk=2.00, kr=6.001.0000.0500.1180.6190.118 80/1201/14/6sk=0.00, kr=0.000.9990.0480.1180.6050.118 80/1201/24/6sk=0.00, kr=0.000.9990.0480.1180.5910.118 80/1201/44/6sk=0.00, kr=0.000.9990.0480.1190.5870.119 80/1201/14/6sk=1.00, kr=3.000.9990.0530.1200.6140.120 80/1201/24/6sk=1.00, kr=3.001.0000.0470.1080.5940.108 80/1201/44/6sk=1.00, kr=3.000.9990.0480.1150.5980.115 80/1201/14/6sk=2.00, kr=6.000.9990.0480.1140.6050.114 80/1201/24/6sk=2.00, kr=6.000.9990.0490.1200.5960.120 80/1201/44/6sk=2.00, kr=6.000.9990.0530.1190.5970.119 120/801/14/6sk=0.00, kr=0.000.9990.0460.1160.6130.116 120/801/24/6sk=0.00, kr=0.001.0000.0510.1160.6170.116 120/801/44/6sk=0.00, kr=0.001.0000.0510.1160.6290.116 120/801/14/6sk=1.00, kr=3.000.9990.0490.1200.6060.120 120/801/24/6sk=1.00, kr=3.001.0000.0500.1190.6210.119 120/801/44/6sk=1.00, kr=3.001.0000.0470.1170.6360.117 120/801/14/6sk=2.00, kr=6.000.9990.0490.1180.5980.118 120/801/24/6sk=2.00, kr=6.001.0000.0500.1220.6370.122 120/801/44/6sk=2.00, kr=6.001.0000.0470.1110.6370.111 PAGE 141 131 Table 20 Type I Error Rate Estimates ( 2 =.33, =.8) at =.05 for K=10, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded areas signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure The same pattern of results continued (see Table 20) as displayed in the prior table (Table 19) and the permuted Q retained its robustness in terms of Type I error control. Again, none of the other tests maintained robustness under increasing heterogeneity of effects. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/14/6sk=0.00, kr=0.000.2480.0520.0680.0890.069 5/51/24/6sk=0.00, kr=0.000.2640.0490.0680.0930.070 5/51/44/6sk=0.00, kr=0.000.3040.0500.0790.1000.080 5/51/14/6sk=1.00, kr=3.000.2960.0520.0780.1020.080 5/51/24/6sk=1.00, kr=3.000.2760.0500.0690.0890.072 5/51/44/6sk=1.00, kr=3.000.3090.0500.0780.1020.080 5/51/14/6sk=2.00, kr=6.000.4050.0520.0870.1200.089 5/51/24/6sk=2.00, kr=6.000.3500.0520.0770.1070.078 5/51/44/6sk=2.00, kr=6.000.3530.0490.0770.1090.078 4/61/14/6sk=0.00, kr=0.000.2380.0470.0670.0890.070 4/61/24/6sk=0.00, kr=0.000.2090.0500.0620.0770.063 4/61/44/6sk=0.00, kr=0.000.1780.0470.0580.0720.058 4/61/14/6sk=1.00, kr=3.000.2890.0510.0780.1020.080 4/61/24/6sk=1.00, kr=3.000.2230.0480.0650.0790.067 4/61/44/6sk=1.00, kr=3.000.1890.0570.0720.0880.075 4/61/14/6sk=2.00, kr=6.000.3850.0500.0830.1150.086 4/61/24/6sk=2.00, kr=6.000.2810.0430.0660.0880.068 4/61/44/6sk=2.00, kr=6.000.2260.0490.0650.0830.067 6/41/14/6sk=0.00, kr=0.000.2410.0530.0680.0890.070 6/41/24/6sk=0.00, kr=0.000.3290.0490.0760.1030.079 6/41/44/6sk=0.00, kr=0.000.4520.0470.0790.1190.081 6/41/14/6sk=1.00, kr=3.000.2990.0470.0730.0980.076 6/41/24/6sk=1.00, kr=3.000.3580.0550.0840.1150.086 6/41/44/6sk=1.00, kr=3.000.4400.0480.0810.1200.083 6/41/14/6sk=2.00, kr=6.000.3890.0530.0840.1170.086 6/41/24/6sk=2.00, kr=6.000.4070.0460.0850.1230.088 6/41/44/6sk=2.00, kr=6.000.4550.0490.0830.1230.084 20/201/14/6sk=0.00, kr=0.000.8810.0570.1220.2960.124 20/201/24/6sk=0.00, kr=0.000.8840.0480.1100.2800.111 20/201/44/6sk=0.00, kr=0.000.8830.0540.1150.2920.116 20/201/14/6sk=1.00, kr=3.000.9060.0500.1190.3190.120 20/201/24/6sk=1.00, kr=3.000.8920.0480.1110.3010.111 20/201/44/6sk=1.00, kr=3.000.8900.0520.1060.3000.106 20/201/14/6sk=2.00, kr=6.000.9170.0540.1220.3210.122 20/201/24/6sk=2.00, kr=6.000.8900.0520.1170.3080.117 20/201/44/6sk=2.00, kr=6.000.8810.0540.1160.3010.116 16/241/14/6sk=0.00, kr=0.000.8720.0490.1150.2930.116 16/241/24/6sk=0.00, kr=0.000.8460.0480.1030.2650.105 16/241/44/6sk=0.00, kr=0.000.8130.0490.1080.2610.111 16/241/14/6sk=1.00, kr=3.000.8910.0540.1230.3090.124 16/241/24/6sk=1.00, kr=3.000.8470.0480.1110.2730.113 16/241/44/6sk=1.00, kr=3.000.8360.0420.1060.2650.108 16/241/14/6sk=2.00, kr=6.000.9050.0550.1180.3180.119 16/241/24/6sk=2.00, kr=6.000.8630.0530.1110.2930.112 16/241/44/6sk=2.00, kr=6.000.8270.0500.1020.2680.104 PAGE 142 132 Table 20 (continued) Type I Error Rate Estimates ( 2 =.33, =.8) at =.05 for K=10, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded areas signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Permuted Q continued to maintain Type I error cont rol across all conditions, as sample sizes increased. According to Bradleys criterion, all other tests did not maintain Type I error control, as reflected by the overly inflated Type I error rates. In fact, Type I error rates for all ot her tests became increasingly inflated as the primary sample sizes increased. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 24/161/14/6sk=0.00, kr=0.000.8720.0480.1140.2830.115 24/161/24/6sk=0.00, kr=0.000.9000.0450.1090.3070.110 24/161/44/6sk=0.00, kr=0.000.9170.0470.1170.3350.118 24/161/14/6sk=1.00, kr=3.000.8870.0500.1070.2930.108 24/161/24/6sk=1.00, kr=3.000.9050.0450.1080.3110.110 24/161/44/6sk=1.00, kr=3.000.9160.0490.1120.3280.113 24/161/14/6sk=2.00, kr=6.000.8990.0480.1090.3080.110 24/161/24/6sk=2.00, kr=6.000.9100.0510.1170.3210.118 24/161/44/6sk=2.00, kr=6.000.9170.0490.1100.3300.111 100/1001/14/6sk=0.00, kr=0.000.9990.0470.1140.6030.114 100/1001/24/6sk=0.00, kr=0.000.9990.0530.1270.6080.127 100/1001/44/6sk=0.00, kr=0.000.9990.0470.1130.5990.113 100/1001/14/6sk=1.00, kr=3.000.9980.0490.1090.6090.109 100/1001/24/6sk=1.00, kr=3.001.0000.0510.1180.6110.118 100/1001/44/6sk=1.00, kr=3.001.0000.0470.1160.6020.116 100/1001/14/6sk=2.00, kr=6.000.9980.0540.1180.6070.118 100/1001/24/6sk=2.00, kr=6.000.9990.0500.1130.6180.113 100/1001/44/6sk=2.00, kr=6.001.0000.0490.1120.6030.112 80/1201/14/6sk=0.00, kr=0.000.9990.0500.1170.5990.117 80/1201/24/6sk=0.00, kr=0.000.9990.0430.1160.5720.116 80/1201/44/6sk=0.00, kr=0.000.9990.0500.1160.5870.116 80/1201/14/6sk=1.00, kr=3.000.9990.0510.1200.6110.120 80/1201/24/6sk=1.00, kr=3.001.0000.0440.1090.5990.109 80/1201/44/6sk=1.00, kr=3.000.9990.0460.1100.5830.110 80/1201/14/6sk=2.00, kr=6.000.9990.0490.1220.6180.123 80/1201/24/6sk=2.00, kr=6.000.9990.0540.1190.5990.119 80/1201/44/6sk=2.00, kr=6.000.9990.0500.1180.5840.118 120/801/14/6sk=0.00, kr=0.001.0000.0480.1110.5920.111 120/801/24/6sk=0.00, kr=0.000.9990.0520.1190.6170.119 120/801/44/6sk=0.00, kr=0.001.0000.0460.1100.6120.110 120/801/14/6sk=1.00, kr=3.000.9990.0500.1170.5960.117 120/801/24/6sk=1.00, kr=3.000.9990.0490.1180.6120.118 120/801/44/6sk=1.00, kr=3.000.9990.0490.1180.6140.118 120/801/14/6sk=2.00, kr=6.001.0000.0440.1130.6080.113 120/801/24/6sk=2.00, kr=6.000.9990.0510.1230.6260.123 120/801/44/6sk=2.00, kr=6.001.0000.0450.1170.6170.117 PAGE 143 133 Table 21 Type I Error Rate Estimates ( 2 =.33, =.8) at =.10 for K=10, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded areas signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/14/6sk=0.00, kr=0.000.3640.1030.1270.1510.131 5/51/24/6sk=0.00, kr=0.000.3740.0990.1280.1630.134 5/51/44/6sk=0.00, kr=0.000.4220.1040.1340.1660.137 5/51/14/6sk=1.00, kr=3.000.4280.1010.1350.1690.139 5/51/24/6sk=1.00, kr=3.000.3970.0970.1250.1550.129 5/51/44/6sk=1.00, kr=3.000.4270.1040.1370.1680.141 5/51/14/6sk=2.00, kr=6.000.5360.1070.1400.1900.145 5/51/24/6sk=2.00, kr=6.000.4670.0970.1330.1760.137 5/51/44/6sk=2.00, kr=6.000.4700.1020.1370.1770.141 4/61/14/6sk=0.00, kr=0.000.3490.0980.1270.1590.132 4/61/24/6sk=0.00, kr=0.000.3120.0990.1140.1410.119 4/61/44/6sk=0.00, kr=0.000.2840.1010.1130.1300.118 4/61/14/6sk=1.00, kr=3.000.4110.1090.1390.1750.146 4/61/24/6sk=1.00, kr=3.000.3320.0930.1140.1370.117 4/61/44/6sk=1.00, kr=3.000.2920.1120.1220.1450.126 4/61/14/6sk=2.00, kr=6.000.5090.1060.1410.1860.148 4/61/24/6sk=2.00, kr=6.000.3860.0980.1280.1590.133 4/61/44/6sk=2.00, kr=6.000.3220.1030.1160.1440.122 6/41/14/6sk=0.00, kr=0.000.3530.1080.1320.1590.136 6/41/24/6sk=0.00, kr=0.000.4590.1010.1380.1740.143 6/41/44/6sk=0.00, kr=0.000.5710.1000.1360.1890.139 6/41/14/6sk=1.00, kr=3.000.4280.0970.1350.1720.141 6/41/24/6sk=1.00, kr=3.000.4840.1060.1430.1800.146 6/41/44/6sk=1.00, kr=3.000.5610.1040.1420.1980.149 6/41/14/6sk=2.00, kr=6.000.5250.1020.1430.1860.147 6/41/24/6sk=2.00, kr=6.000.5320.1040.1440.1950.150 6/41/44/6sk=2.00, kr=6.000.5710.0990.1410.1940.146 20/201/14/6sk=0.00, kr=0.000.9190.1120.1800.3740.182 20/201/24/6sk=0.00, kr=0.000.9230.0980.1670.3670.170 20/201/44/6sk=0.00, kr=0.000.9300.1060.1780.3780.180 20/201/14/6sk=1.00, kr=3.000.9440.1120.1790.4020.181 20/201/24/6sk=1.00, kr=3.000.9280.1040.1710.3830.172 20/201/44/6sk=1.00, kr=3.000.9280.0980.1670.3860.170 20/201/14/6sk=2.00, kr=6.000.9460.1100.1820.4050.184 20/201/24/6sk=2.00, kr=6.000.9260.1060.1820.4000.184 20/201/44/6sk=2.00, kr=6.000.9190.1070.1770.3830.180 16/241/14/6sk=0.00, kr=0.000.9160.0980.1740.3740.176 16/241/24/6sk=0.00, kr=0.000.8970.0960.1670.3460.170 16/241/44/6sk=0.00, kr=0.000.8780.1010.1660.3410.169 16/241/14/6sk=1.00, kr=3.000.9300.1120.1840.3910.186 16/241/24/6sk=1.00, kr=3.000.9000.0990.1790.3580.183 16/241/44/6sk=1.00, kr=3.000.8850.0960.1690.3490.171 16/241/14/6sk=2.00, kr=6.000.9390.1060.1820.4000.185 16/241/24/6sk=2.00, kr=6.000.9130.1030.1720.3790.175 16/241/44/6sk=2.00, kr=6.000.8860.0980.1660.3520.170 PAGE 144 134 Table 21 (continued) Type I Error Rate Estimates ( 2 =.33, =.8) at =.10 for K=10, N=5000 Note: All unshaded areas fo r each of the tests reflect either inflated or conservative Type I error. Shaded areas signify those conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Increasing the nominal alpha level from .05 to .10 (compare Tables 20 and 21) did not enhance the performance of any of the five tests being investigated, when K=10. The permuted Q still maintained adequate Type I error control. No other test maintain ed adequate Type I error control. When the primary study sample sizes were 40 or greater, both the regular Q and the FE tests still did not constrain Type I error. Therefore, when th ere was true equality between groups i.e. no difference, these tests incorrectly determined that a treatment had an effect. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 24/161/14/6sk=0.00, kr=0.000.9190.1050.1780.3740.182 24/161/24/6sk=0.00, kr=0.000.9330.0990.1740.3860.176 24/161/44/6sk=0.00, kr=0.000.9460.1050.1850.4170.187 24/161/14/6sk=1.00, kr=3.000.9250.1010.1710.3830.172 24/161/24/6sk=1.00, kr=3.000.9380.0970.1740.3920.175 24/161/44/6sk=1.00, kr=3.000.9480.1030.1770.4090.178 24/161/14/6sk=2.00, kr=6.000.9310.0990.1740.3900.175 24/161/24/6sk=2.00, kr=6.000.9440.1100.1790.3910.181 24/161/44/6sk=2.00, kr=6.000.9490.1000.1760.4210.177 100/1001/14/6sk=0.00, kr=0.000.9990.0990.1740.6630.174 100/1001/24/6sk=0.00, kr=0.000.9990.1100.1900.6620.190 100/1001/44/6sk=0.00, kr=0.001.0000.1000.1740.6610.174 100/1001/14/6sk=1.00, kr=3.000.9990.0960.1750.6670.175 100/1001/24/6sk=1.00, kr=3.001.0000.1050.1790.6690.179 100/1001/44/6sk=1.00, kr=3.001.0000.1000.1750.6620.175 100/1001/14/6sk=2.00, kr=6.000.9990.1040.1810.6630.181 100/1001/24/6sk=2.00, kr=6.000.9990.0980.1700.6710.170 100/1001/44/6sk=2.00, kr=6.001.0000.0990.1790.6570.179 80/1201/14/6sk=0.00, kr=0.000.9990.1020.1800.6570.180 80/1201/24/6sk=0.00, kr=0.001.0000.1010.1810.6340.181 80/1201/44/6sk=0.00, kr=0.000.9990.1010.1790.6470.179 80/1201/14/6sk=1.00, kr=3.001.0000.1010.1750.6650.175 80/1201/24/6sk=1.00, kr=3.001.0000.0950.1710.6560.171 80/1201/44/6sk=1.00, kr=3.000.9990.0940.1710.6490.171 80/1201/14/6sk=2.00, kr=6.001.0000.1090.1910.6780.191 80/1201/24/6sk=2.00, kr=6.001.0000.1080.1820.6600.182 80/1201/44/6sk=2.00, kr=6.001.0000.1000.1710.6450.171 120/801/14/6sk=0.00, kr=0.001.0000.1000.1750.6600.175 120/801/24/6sk=0.00, kr=0.001.0000.1050.1820.6730.182 120/801/44/6sk=0.00, kr=0.001.0000.0930.1710.6710.171 120/801/14/6sk=1.00, kr=3.001.0000.1020.1820.6500.182 120/801/24/6sk=1.00, kr=3.001.0000.1010.1730.6660.173 120/801/44/6sk=1.00, kr=3.000.9990.1020.1750.6710.175 120/801/14/6sk=2.00, kr=6.001.0000.1000.1710.6650.171 120/801/24/6sk=2.00, kr=6.001.0000.1080.1830.6820.183 120/801/44/6sk=2.00, kr=6.001.0000.1020.1750.6750.175 PAGE 145 135 Table 22 Type I Error Rate Estimates ( 2 =.33, =0) at =.05 for K=30, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded areas signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Again, permuted Q maintained its robustness across most all conditions. All other tests resulted in inflated Type I error. Despite the increase in the total study sample, K, the RE and CR tests did not regain robustness. Primary Study Sample Sizes Population VariancesN of studies Population ShapeReg Q PQbREFECR 5/51/112/18sk=0.00, kr=0.000.5000.0560.0680.0940.070 5/51/212/18sk=0.00, kr=0.000.5270.0530.0640.0900.066 5/51/412/18sk=0.00, kr=0.000.5970.0510.0620.0890.063 5/51/112/18sk=1.00, kr=3.000.5900.0480.0610.0950.063 5/51/212/18sk=1.00, kr=3.000.6180.0480.0600.0990.063 5/51/412/18sk=1.00, kr=3.000.6800.0460.0600.0980.062 5/51/112/18sk=2.00, kr=6.000.6960.0480.0620.1060.063 5/51/212/18sk=2.00, kr=6.000.7340.0440.0600.1060.061 5/51/412/18sk=2.00, kr=6.000.8220.0500.0660.1200.068 4/61/112/18sk=0.00, kr=0.000.4800.0540.0660.0870.067 4/61/212/18sk=0.00, kr=0.000.3620.0500.0550.0740.057 4/61/412/18sk=0.00, kr=0.000.3190.0470.0560.0730.059 4/61/112/18sk=1.00, kr=3.000.5730.0490.0610.0930.063 4/61/212/18sk=1.00, kr=3.000.4650.0490.0600.0850.063 4/61/412/18sk=1.00, kr=3.000.4050.0540.0640.0830.067 4/61/112/18sk=2.00, kr=6.000.6950.0490.0610.1000.062 4/61/212/18sk=2.00, kr=6.000.6220.0510.0620.0960.063 4/61/412/18sk=2.00, kr=6.000.6170.0500.0610.0920.063 6/41/112/18sk=0.00, kr=0.000.4820.0490.0600.0860.062 6/41/212/18sk=0.00, kr=0.000.6640.0480.0650.1060.066 6/41/412/18sk=0.00, kr=0.000.8160.0470.0600.1150.061 6/41/112/18sk=1.00, kr=3.000.5680.0500.0600.0890.062 6/41/212/18sk=1.00, kr=3.000.7200.0460.0610.1060.062 6/41/412/18sk=1.00, kr=3.000.8550.0520.0690.1340.069 6/41/112/18sk=2.00, kr=6.000.6830.0470.0640.1010.066 6/41/212/18sk=2.00, kr=6.000.8140.0450.0640.1180.065 6/41/412/18sk=2.00, kr=6.000.9280.0460.0600.1320.060 20/201/112/18sk=0.00, kr=0.000.9990.0510.0680.2940.068 20/201/212/18sk=0.00, kr=0.000.9990.0470.0650.2930.065 20/201/412/18sk=0.00, kr=0.000.9990.0550.0700.3030.070 20/201/112/18sk=1.00, kr=3.000.9990.0510.0680.3030.068 20/201/212/18sk=1.00, kr=3.000.9990.0520.0670.3090.067 20/201/412/18sk=1.00, kr=3.001.0000.0510.0700.3040.070 20/201/112/18sk=2.00, kr=6.001.0000.0500.0700.3140.070 20/201/212/18sk=2.00, kr=6.001.0000.0440.0610.2990.061 20/201/412/18sk=2.00, kr=6.000.9990.0430.0590.3000.059 16/241/112/18sk=0.00, kr=0.000.9990.0480.0700.2930.070 16/241/212/18sk=0.00, kr=0.000.9970.0490.0680.2720.068 16/241/412/18sk=0.00, kr=0.000.9970.0530.0700.2580.070 16/241/112/18sk=1.00, kr=3.001.0000.0470.0650.2800.065 16/241/212/18sk=1.00, kr=3.000.9990.0480.0690.2850.069 16/241/412/18sk=1.00, kr=3.000.9970.0510.0680.2660.068 16/241/112/18sk=2.00, kr=6.001.0000.0460.0630.3060.063 16/241/212/18sk=2.00, kr=6.000.9990.0510.0670.2920.067 16/241/412/18sk=2.00, kr=6.000.9970.0490.0700.2750.070 PAGE 146 136 Table 22 (continued) Type I Error Rate Estimates ( 2 =.33, =0) at =.05 for K=30, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded areas signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Increasing the sample size to 40 and above did not fac ilitate enhanced robustness for the regular Q, RE, FE and CR tests. Primary Study Sample Sizes Population VariancesN of studies Population ShapeReg Q PQbREFECR 24/161/112/18sk=0.00, kr=0.000.9990.0500.0700.2860.070 24/161/212/18sk=0.00, kr=0.001.0000.0480.0660.3130.066 24/161/412/18sk=0.00, kr=0.001.0000.0510.0670.3190.067 24/161/112/18sk=1.00, kr=3.000.9990.0500.0690.2930.069 24/161/212/18sk=1.00, kr=3.001.0000.0550.0750.3220.075 24/161/412/18sk=1.00, kr=3.001.0000.0450.0650.3340.065 24/161/112/18sk=2.00, kr=6.000.9990.0470.0620.3050.062 24/161/212/18sk=2.00, kr=6.000.9990.0510.0680.3280.068 24/161/412/18sk=2.00, kr=6.001.0000.0470.0670.3270.067 100/1001/112/18sk=0.00, kr=0.001.0000.0480.0670.6050.067 100/1001/212/18sk=0.00, kr=0.001.0000.0520.0710.6180.071 100/1001/412/18sk=0.00, kr=0.001.0000.0480.0660.6080.066 100/1001/112/18sk=1.00, kr=3.001.0000.0480.0660.6060.066 100/1001/212/18sk=1.00, kr=3.001.0000.0540.0760.6100.076 100/1001/412/18sk=1.00, kr=3.001.0000.0510.0680.6100.068 100/1001/112/18sk=2.00, kr=6.001.0000.0530.0760.6010.076 100/1001/212/18sk=2.00, kr=6.001.0000.0520.0710.6160.071 100/1001/412/18sk=2.00, kr=6.001.0000.0580.0790.6240.079 80/1201/112/18sk=0.00, kr=0.001.0000.0470.0640.6080.064 80/1201/212/18sk=0.00, kr=0.001.0000.0460.0620.5990.062 80/1201/412/18sk=0.00, kr=0.001.0000.0510.0650.5840.065 80/1201/112/18sk=1.00, kr=3.001.0000.0430.0650.5920.065 80/1201/212/18sk=1.00, kr=3.001.0000.0500.0670.5930.067 80/1201/412/18sk=1.00, kr=3.001.0000.0520.0750.5820.075 80/1201/112/18sk=2.00, kr=6.001.0000.0500.0680.6140.068 80/1201/212/18sk=2.00, kr=6.001.0000.0520.0710.5850.071 80/1201/412/18sk=2.00, kr=6.001.0000.0560.0740.5980.074 120/801/112/18sk=0.00, kr=0.001.0000.0520.0730.6010.073 120/801/212/18sk=0.00, kr=0.001.0000.0480.0680.6100.068 120/801/412/18sk=0.00, kr=0.001.0000.0470.0670.6130.067 120/801/112/18sk=1.00, kr=3.001.0000.0510.0700.6080.070 120/801/212/18sk=1.00, kr=3.001.0000.0470.0640.6210.064 120/801/412/18sk=1.00, kr=3.001.0000.0520.0720.6190.072 120/801/112/18sk=2.00, kr=6.001.0000.0500.0700.5980.070 120/801/212/18sk=2.00, kr=6.001.0000.0520.0700.6260.070 120/801/412/18sk=2.00, kr=6.001.0000.0470.0670.6360.067 PAGE 147 137 Table 23 Type I Error Rate Estimates ( 2 =.33, =.8) at =.05 for K=30, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded areas signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure At =.8, (see Table 23), permuted Q demonstrated continued robustness with the increase of K to 30 and optimal effectiveness upon the increase in sa mple size to 40. All other tests failed to provide adequate robustness under these conditions. The RE and CR tests showed minimal improvement to robustness with the increase in effect size to .8, particularly at sample sizes below 40. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/112/18sk=0.00, kr=0.000.4820.0520.0630.0900.065 5/51/212/18sk=0.00, kr=0.000.5150.0560.0680.0920.070 5/51/412/18sk=0.00, kr=0.000.5890.0510.0640.0950.066 5/51/112/18sk=1.00, kr=3.000.5950.0470.0600.0940.062 5/51/212/18sk=1.00, kr=3.000.5580.0490.0610.0930.064 5/51/412/18sk=1.00, kr=3.000.5820.0480.0590.0990.061 5/51/112/18sk=2.00, kr=6.000.7410.0480.0620.1130.063 5/51/212/18sk=2.00, kr=6.000.6520.0530.0650.1050.066 5/51/412/18sk=2.00, kr=6.000.6390.0470.0590.0990.062 4/61/112/18sk=0.00, kr=0.000.4630.0530.0610.0880.063 4/61/212/18sk=0.00, kr=0.000.3630.0460.0570.0730.059 4/61/412/18sk=0.00, kr=0.000.3170.0500.0520.0680.055 4/61/112/18sk=1.00, kr=3.000.5720.0510.0600.0910.061 4/61/212/18sk=1.00, kr=3.000.4190.0520.0580.0810.060 4/61/412/18sk=1.00, kr=3.000.3370.0500.0540.0700.055 4/61/112/18sk=2.00, kr=6.000.7280.0480.0610.1050.062 4/61/212/18sk=2.00, kr=6.000.5210.0500.0630.0900.066 4/61/412/18sk=2.00, kr=6.000.4170.0440.0500.0740.053 6/41/112/18sk=0.00, kr=0.000.4660.0460.0560.0810.058 6/41/212/18sk=0.00, kr=0.000.6240.0470.0590.0940.060 6/41/412/18sk=0.00, kr=0.000.7960.0460.0600.1150.061 6/41/112/18sk=1.00, kr=3.000.5730.0550.0680.0990.069 6/41/212/18sk=1.00, kr=3.000.6860.0500.0620.1020.064 6/41/412/18sk=1.00, kr=3.000.7860.0500.0670.1260.068 6/41/112/18sk=2.00, kr=6.000.7290.0460.0590.1010.060 6/41/212/18sk=2.00, kr=6.000.7490.0520.0660.1110.067 6/41/412/18sk=2.00, kr=6.000.7980.0550.0710.1260.073 20/201/112/18sk=0.00, kr=0.000.9990.0480.0660.2950.066 20/201/212/18sk=0.00, kr=0.000.9990.0440.0640.2950.064 20/201/412/18sk=0.00, kr=0.000.9990.0440.0630.2880.063 20/201/112/18sk=1.00, kr=3.001.0000.0490.0700.3020.070 20/201/212/18sk=1.00, kr=3.000.9990.0490.0660.2910.066 20/201/412/18sk=1.00, kr=3.000.9990.0470.0670.2890.067 20/201/112/18sk=2.00, kr=6.001.0000.0430.0600.3120.060 20/201/212/18sk=2.00, kr=6.001.0000.0580.0730.3080.073 20/201/412/18sk=2.00, kr=6.000.9990.0510.0670.3070.067 16/241/112/18sk=0.00, kr=0.000.9960.0470.0660.2790.066 16/241/212/18sk=0.00, kr=0.000.9960.0490.0690.2680.069 16/241/412/18sk=0.00, kr=0.000.9960.0490.0680.2600.069 16/241/112/18sk=1.00, kr=3.001.0000.0490.0690.2850.069 16/241/212/18sk=1.00, kr=3.000.9960.0460.0650.2700.065 16/241/412/18sk=1.00, kr=3.000.9950.0510.0690.2570.069 16/241/112/18sk=2.00, kr=6.001.0000.0510.0680.3140.068 16/241/212/18sk=2.00, kr=6.000.9980.0490.0670.2740.067 16/241/412/18sk=2.00, kr=6.000.9960.0530.0740.2660.074 PAGE 148 138 Table 23 (continued) Type I Error Rate Estimates ( 2 =.33, =.8) at =.05 for K=30, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded areas signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure When N was elevated to 40 and above, permuted Q evidenced adequate Type I error control. All other tests manifested an absence of robustness. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 24/161/112/18sk=0.00, kr=0 .000.9970.0500.0700.2780.070 24/161/212/18sk=0.00, kr=0 .000.9990.0490.0670.2950.067 24/161/412/18sk=0.00, kr=0 .000.9990.0440.0620.3160.062 24/161/112/18sk=1.00, kr=3 .000.9990.0510.0680.2900.068 24/161/212/18sk=1.00, kr=3 .000.9990.0480.0640.3030.064 24/161/412/18sk=1.00, kr=3 .001.0000.0550.0710.3270.071 24/161/112/18sk=2.00, kr=6 .001.0000.0540.0710.3040.071 24/161/212/18sk=2.00, kr=6 .001.0000.0530.0720.3010.072 24/161/412/18sk=2.00, kr=6 .001.0000.0550.0750.3250.075 100/1001/112/18sk=0.00, kr=0 .001.0000.0480.0700.5920.070 100/1001/212/18sk=0.00, kr=0 .001.0000.0490.0720.6030.072 100/1001/412/18sk=0.00, kr=0 .001.0000.0500.0650.6010.065 100/1001/112/18sk=1.00, kr=3 .001.0000.0520.0700.6000.070 100/1001/212/18sk=1.00, kr=3 .001.0000.0440.0630.6040.063 100/1001/412/18sk=1.00, kr=3 .001.0000.0470.0670.6000.067 100/1001/112/18sk=2.00, kr=6 .001.0000.0460.0650.6220.065 100/1001/212/18sk=2.00, kr=6 .001.0000.0490.0680.6040.068 100/1001/412/18sk=2.00, kr=6 .001.0000.0460.0690.6110.069 80/1201/112/18sk=0.00, kr=0 .001.0000.0460.0670.5890.067 80/1201/212/18sk=0.00, kr=0 .001.0000.0530.0750.5900.075 80/1201/412/18sk=0.00, kr=0 .001.0000.0490.0680.5820.068 80/1201/112/18sk=1.00, kr=3 .001.0000.0420.0610.5970.061 80/1201/212/18sk=1.00, kr=3 .001.0000.0510.0700.5800.070 80/1201/412/18sk=1.00, kr=3 .001.0000.0490.0700.5770.070 80/1201/112/18sk=2.00, kr=6 .001.0000.0520.0720.6060.072 80/1201/212/18sk=2.00, kr=6 .001.0000.0490.0710.5910.071 80/1201/412/18sk=2.00, kr=6 .001.0000.0500.0700.5810.070 120/801/112/18sk=0.00, kr=0 .001.0000.0500.0680.5960.068 120/801/212/18sk=0.00, kr=0 .001.0000.0480.0640.6070.064 120/801/412/18sk=0.00, kr=0 .001.0000.0460.0620.6180.062 120/801/112/18sk=1.00, kr=3 .001.0000.0480.0650.5980.065 120/801/212/18sk=1.00, kr=3 .001.0000.0510.0710.6120.071 120/801/412/18sk=1.00, kr=3 .001.0000.0540.0680.6180.068 120/801/112/18sk=2.00, kr=6 .001.0000.0490.0680.6000.068 120/801/212/18sk=2.00, kr=6 .001.0000.0470.0660.6090.066 120/801/412/18sk=2.00, kr=6 .001.0000.0490.0660.6140.066 PAGE 149 139 Table 24 Type I Error Rate Estimates ( 2 =.33, =.8) at =.10 for K=30, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded areas signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Increasing the nominal alpha level from .05 to .10 (see Table 24 above) resulted in a minimal improvement in robustness for the RE and CR tests when K=30 and fo r sample sizes smaller than 40. For the RE and CR tests, Type I error was a central aspect su pported by the increase in nominal alpha. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/112/18sk=0.00, kr=0.000.6180.1050.1170.1560.123 5/51/212/18sk=0.00, kr=0.000.6450.1050.1190.1560.123 5/51/412/18sk=0.00, kr=0.000.7030.1020.1220.1670.126 5/51/112/18sk=1.00, kr=3.000.7140.1000.1140.1620.119 5/51/212/18sk=1.00, kr=3.000.6830.1020.1150.1580.119 5/51/412/18sk=1.00, kr=3.000.6970.0990.1150.1630.119 5/51/112/18sk=2.00, kr=6.000.8270.1020.1230.1840.125 5/51/212/18sk=2.00, kr=6.000.7540.1020.1190.1720.122 5/51/412/18sk=2.00, kr=6.000.7420.0990.1150.1660.118 4/61/112/18sk=0.00, kr=0.000.5910.1030.1130.1520.119 4/61/212/18sk=0.00, kr=0.000.4890.0930.1020.1250.106 4/61/412/18sk=0.00, kr=0.000.4380.0990.1060.1260.110 4/61/112/18sk=1.00, kr=3.000.6960.0940.1090.1580.114 4/61/212/18sk=1.00, kr=3.000.5520.0960.1030.1370.110 4/61/412/18sk=1.00, kr=3.000.4670.0990.1060.1360.112 4/61/112/18sk=2.00, kr=6.000.8170.1000.1170.1770.119 4/61/212/18sk=2.00, kr=6.000.6420.1040.1130.1550.116 4/61/412/18sk=2.00, kr=6.000.5370.0950.1030.1380.108 6/41/112/18sk=0.00, kr=0.000.5960.0990.1100.1440.115 6/41/212/18sk=0.00, kr=0.000.7370.0940.1100.1650.116 6/41/412/18sk=0.00, kr=0.000.8760.0940.1120.1840.113 6/41/112/18sk=1.00, kr=3.000.7060.1050.1200.1660.125 6/41/212/18sk=1.00, kr=3.000.7880.0960.1140.1720.117 6/41/412/18sk=1.00, kr=3.000.8670.1040.1250.1970.127 6/41/112/18sk=2.00, kr=6.000.8260.0940.1130.1710.115 6/41/212/18sk=2.00, kr=6.000.8420.1010.1200.1830.122 6/41/412/18sk=2.00, kr=6.000.8670.1080.1290.2020.131 20/201/112/18sk=0.00, kr=0.000.9990.0990.1210.3760.121 20/201/212/18sk=0.00, kr=0.001.0000.1020.1230.3770.123 20/201/412/18sk=0.00, kr=0.001.0000.0980.1210.3690.121 20/201/112/18sk=1.00, kr=3.001.0000.1050.1280.3830.128 20/201/212/18sk=1.00, kr=3.000.9990.1020.1220.3770.122 20/201/412/18sk=1.00, kr=3.001.0000.0980.1190.3620.119 20/201/112/18sk=2.00, kr=6.001.0000.0910.1140.3930.114 20/201/212/18sk=2.00, kr=6.001.0000.1040.1260.3920.126 20/201/412/18sk=2.00, kr=6.001.0000.0980.1220.3880.122 16/241/112/18sk=0.00, kr=0.000.9990.0980.1180.3660.118 16/241/212/18sk=0.00, kr=0.000.9990.1000.1220.3510.122 16/241/412/18sk=0.00, kr=0.000.9980.1010.1240.3450.125 16/241/112/18sk=1.00, kr=3.001.0000.0990.1210.3660.121 16/241/212/18sk=1.00, kr=3.000.9990.0970.1160.3620.116 16/241/412/18sk=1.00, kr=3.000.9980.0990.1170.3470.117 16/241/112/18sk=2.00, kr=6.001.0000.1050.1240.4000.124 16/241/212/18sk=2.00, kr=6.000.9990.0990.1190.3600.119 16/241/412/18sk=2.00, kr=6.000.9980.1090.1330.3430.133 PAGE 150 140 Table 24 (continued) Type I Error Rate Estimates ( 2 =.33, =.8) at =.10 for K=30, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded areas signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure As the sample size increases (see Table 24 above) the RE and CR tests produced greater inflation of Type I error rates. Therefore, permuted Q is the only test effective under these conditions. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 24/161/112/18sk=0.00, kr=0.000.9980.0980.1210.3590.121 24/161/212/18sk=0.00, kr=0.000.9990.1000.1180.3790.118 24/161/412/18sk=0.00, kr=0.001.0000.0940.1170.3990.117 24/161/112/18sk=1.00, kr=3.001.0000.1000.1230.3750.123 24/161/212/18sk=1.00, kr=3.001.0000.0960.1140.3860.114 24/161/412/18sk=1.00, kr=3.001.0000.1020.1230.4060.123 24/161/112/18sk=2.00, kr=6.001.0000.1020.1190.3930.119 24/161/212/18sk=2.00, kr=6.001.0000.1030.1200.3910.120 24/161/412/18sk=2.00, kr=6.001.0000.1050.1270.4100.127 100/1001/112/18sk=0.00, kr=0.001.0000.1030.1220.6480.122 100/1001/212/18sk=0.00, kr=0.001.0000.0990.1240.6620.124 100/1001/412/18sk=0.00, kr=0.001.0000.0990.1200.6600.120 100/1001/112/18sk=1.00, kr=3.001.0000.1050.1260.6620.126 100/1001/212/18sk=1.00, kr=3.001.0000.0960.1180.6590.118 100/1001/412/18sk=1.00, kr=3.001.0000.0970.1210.6620.121 100/1001/112/18sk=2.00, kr=6.001.0000.0950.1170.6800.117 100/1001/212/18sk=2.00, kr=6.001.0000.0990.1240.6630.124 100/1001/412/18sk=2.00, kr=6.001.0000.1000.1210.6720.121 80/1201/112/18sk=0.00, kr=0.001.0000.1010.1220.6490.122 80/1201/212/18sk=0.00, kr=0.001.0000.1040.1260.6480.126 80/1201/412/18sk=0.00, kr=0.001.0000.0960.1130.6430.113 80/1201/112/18sk=1.00, kr=3.001.0000.0960.1190.6630.119 80/1201/212/18sk=1.00, kr=3.001.0000.1000.1220.6410.122 80/1201/412/18sk=1.00, kr=3.001.0000.1020.1220.6360.122 80/1201/112/18sk=2.00, kr=6.001.0000.1030.1260.6640.126 80/1201/212/18sk=2.00, kr=6.001.0000.1030.1270.6510.127 80/1201/412/18sk=2.00, kr=6.001.0000.1050.1280.6470.128 120/801/112/18sk=0.00, kr=0.001.0000.0990.1210.6570.121 120/801/212/18sk=0.00, kr=0.001.0000.0950.1180.6660.118 120/801/412/18sk=0.00, kr=0.001.0000.0980.1200.6740.120 120/801/112/18sk=1.00, kr=3.001.0000.1010.1230.6660.123 120/801/212/18sk=1.00, kr=3.001.0000.1040.1320.6690.132 120/801/412/18sk=1.00, kr=3.001.0000.1020.1230.6780.123 120/801/112/18sk=2.00, kr=6.001.0000.0970.1230.6560.123 120/801/212/18sk=2.00, kr=6.001.0000.1010.1230.6680.123 120/801/412/18sk=2.00, kr=6.001.0000.0970.1220.6740.122 PAGE 151 141 Table 25 Type I Error Rate Estimates ( 2 =1, =0) at =.05 for K=10, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded cells signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure As heterogeneity of effects increased from .33 to 1, Type I error rates for all tests, but permuted Q, rose steadily, rendering these te sts lacking in robustness. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/14/6sk=0.00, kr=0.000.6970.0520.1010.1760.102 5/51/24/6sk=0.00, kr=0.000.7100.0480.0970.1750.098 5/51/44/6sk=0.00, kr=0.000.7440.0490.1020.1840.103 5/51/14/6sk=1.00, kr=3.000.7530.0500.1050.1900.107 5/51/24/6sk=1.00, kr=3.000.7530.0510.1060.1880.107 5/51/44/6sk=1.00, kr=3.000.7780.0410.0940.1950.095 5/51/14/6sk=2.00, kr=6.000.8130.0440.0970.2030.097 5/51/24/6sk=2.00, kr=6.000.8260.0520.1060.2070.106 5/51/44/6sk=2.00, kr=6.000.8350.0520.1070.2240.107 4/61/14/6sk=0.00, kr=0.000.6790.0460.0950.1690.097 4/61/24/6sk=0.00, kr=0.000.6320.0480.0910.1570.093 4/61/44/6sk=0.00, kr=0.000.6000.0540.0970.1610.098 4/61/14/6sk=1.00, kr=3.000.7290.0490.0950.1830.096 4/61/24/6sk=1.00, kr=3.000.6900.0480.0970.1760.099 4/61/44/6sk=1.00, kr=3.000.6640.0490.0980.1730.099 4/61/14/6sk=2.00, kr=6.000.8000.0490.1050.2090.106 4/61/24/6sk=2.00, kr=6.000.7650.0520.1000.1950.101 4/61/44/6sk=2.00, kr=6.000.7500.0560.1060.1900.107 6/41/14/6sk=0.00, kr=0.000.6810.0500.1010.1730.102 6/41/24/6sk=0.00, kr=0.000.7580.0450.0980.1820.099 6/41/44/6sk=0.00, kr=0.000.8170.0470.1000.1980.102 6/41/14/6sk=1.00, kr=3.000.7360.0530.1040.1890.106 6/41/24/6sk=1.00, kr=3.000.8000.0470.0990.1940.099 6/41/44/6sk=1.00, kr=3.000.8610.0530.1110.2300.112 6/41/14/6sk=2.00, kr=6.000.7960.0520.1010.1960.102 6/41/24/6sk=2.00, kr=6.000.8480.0490.1100.2160.111 6/41/44/6sk=2.00, kr=6.000.8880.0500.1070.2420.107 20/201/14/6sk=0.00, kr=0.000.9940.0540.1220.4730.122 20/201/24/6sk=0.00, kr=0.000.9950.0560.1150.4640.115 20/201/44/6sk=0.00, kr=0.000.9970.0500.1130.4700.114 20/201/14/6sk=1.00, kr=3.000.9970.0490.1120.4600.112 20/201/24/6sk=1.00, kr=3.000.9970.0560.1170.4780.117 20/201/44/6sk=1.00, kr=3.000.9960.0520.1220.4840.122 20/201/14/6sk=2.00, kr=6.000.9970.0510.1110.4790.111 20/201/24/6sk=2.00, kr=6.000.9960.0480.1100.4870.110 20/201/44/6sk=2.00, kr=6.000.9970.0510.1160.4820.116 16/241/14/6sk=0.00, kr=0.000.9940.0520.1200.4730.120 16/241/24/6sk=0.00, kr=0.000.9930.0490.1120.4520.112 16/241/44/6sk=0.00, kr=0.000.9910.0480.1140.4410.114 16/241/14/6sk=1.00, kr=3.000.9960.0460.1160.4670.116 16/241/24/6sk=1.00, kr=3.000.9960.0500.1170.4630.117 16/241/44/6sk=1.00, kr=3.000.9940.0490.1140.4740.114 16/241/14/6sk=2.00, kr=6.000.9950.0510.1090.4680.109 16/241/24/6sk=2.00, kr=6.000.9960.0540.1220.4820.122 16/241/44/6sk=2.00, kr=6.000.9940.0500.1190.4690.119 PAGE 152 142 Table 25 (continued) Type I Error Rate Estimates ( 2 =1, =0) at =.05 for K=10, N=5000 Note: All unshaded areas fo r each of the tests reflect either inflated or conservative Type I error. Shaded cells signify those conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure As N increased to 40 and above (see Table 25 above), a ll tests, but permuted Q, continued to show inflated Type I error rates. Permuted Q maintained robustness at higher sample sizes with K=10. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 24/161/14/6sk=0.00, kr=0.000.9960.0520.1200.4570.120 24/161/24/6sk=0.00, kr=0.000.9950.0510.1140.4740.114 24/161/44/6sk=0.00, kr=0.000.9980.0490.1170.4840.117 24/161/14/6sk=1.00, kr=3.000.9950.0500.1130.4620.113 24/161/24/6sk=1.00, kr=3.000.9960.0520.1170.4910.117 24/161/44/6sk=1.00, kr=3.000.9970.0540.1190.4950.119 24/161/14/6sk=2.00, kr=6.000.9950.0500.1140.4670.114 24/161/24/6sk=2.00, kr=6.000.9970.0540.1140.4890.114 24/161/44/6sk=2.00, kr=6.000.9970.0570.1240.5090.125 100/1001/14/6sk=0.00, kr=0.001.0000.0500.1170.7360.117 100/1001/24/6sk=0.00, kr=0.001.0000.0490.1170.7380.117 100/1001/44/6sk=0.00, kr=0.001.0000.0480.1100.7520.110 100/1001/14/6sk=1.00, kr=3.001.0000.0520.1220.7450.122 100/1001/24/6sk=1.00, kr=3.001.0000.0440.1180.7340.118 100/1001/44/6sk=1.00, kr=3.001.0000.0480.1150.7440.116 100/1001/14/6sk=2.00, kr=6.001.0000.0490.1160.7520.116 100/1001/24/6sk=2.00, kr=6.001.0000.0500.1160.7410.116 100/1001/44/6sk=2.00, kr=6.001.0000.0470.1140.7320.114 80/1201/14/6sk=0.00, kr=0.001.0000.0540.1200.7350.120 80/1201/24/6sk=0.00, kr=0.001.0000.0530.1160.7250.116 80/1201/44/6sk=0.00, kr=0.001.0000.0490.1140.7250.114 80/1201/14/6sk=1.00, kr=3.001.0000.0430.1070.7330.107 80/1201/24/6sk=1.00, kr=3.001.0000.0490.1240.7310.124 80/1201/44/6sk=1.00, kr=3.001.0000.0510.1190.7280.119 80/1201/14/6sk=2.00, kr=6.001.0000.0480.1200.7390.120 80/1201/24/6sk=2.00, kr=6.001.0000.0520.1140.7300.114 80/1201/44/6sk=2.00, kr=6.001.0000.0550.1170.7270.117 120/801/14/6sk=0.00, kr=0.001.0000.0500.1130.7350.113 120/801/24/6sk=0.00, kr=0.001.0000.0510.1210.7450.121 120/801/44/6sk=0.00, kr=0.001.0000.0480.1130.7550.113 120/801/14/6sk=1.00, kr=3.001.0000.0480.1140.7420.114 120/801/24/6sk=1.00, kr=3.001.0000.0410.1070.7540.107 120/801/44/6sk=1.00, kr=3.001.0000.0460.1130.7460.113 120/801/14/6sk=2.00, kr=6.001.0000.0540.1180.7380.118 120/801/24/6sk=2.00, kr=6.001.0000.0510.1180.7420.118 120/801/44/6sk=2.00, kr=6.001.0000.0460.1160.7660.116 PAGE 153 143 Table 26 Type I Error Rate Estimates ( 2 =1, =.8) at =.05 for K=10, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded cells signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Though limited in overall effectiveness, permuted Q maintained adequate robustness over a majority of the conditions (see Table 26). All other tests did not maintain Type I error control, making those tests ineffective as well. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/14/6sk=0.00, kr=0.000.6700.0480.0970.1750.098 5/51/24/6sk=0.00, kr=0.000.6920.0440.0990.1820.101 5/51/44/6sk=0.00, kr=0.000.7120.0590.1150.2030.117 5/51/14/6sk=1.00, kr=3.000.7350.0500.1040.1900.104 5/51/24/6sk=1.00, kr=3.000.7370.0520.1010.1940.102 5/51/44/6sk=1.00, kr=3.000.7560.0510.1050.1980.106 5/51/14/6sk=2.00, kr=6.000.7960.0480.1060.2070.107 5/51/24/6sk=2.00, kr=6.000.7840.0490.1080.2080.109 5/51/44/6sk=2.00, kr=6.000.7760.0520.1050.2070.106 4/61/14/6sk=0.00, kr=0.000.6650.0480.0990.1810.101 4/61/24/6sk=0.00, kr=0.000.6150.0480.0950.1610.097 4/61/44/6sk=0.00, kr=0.000.5950.0530.0970.1600.099 4/61/14/6sk=1.00, kr=3.000.7280.0520.0990.1850.101 4/61/24/6sk=1.00, kr=3.000.6690.0500.0980.1770.100 4/61/44/6sk=1.00, kr=3.000.6380.0510.0980.1690.099 4/61/14/6sk=2.00, kr=6.000.7850.0520.1030.1970.105 4/61/24/6sk=2.00, kr=6.000.7410.0480.0940.1840.095 4/61/44/6sk=2.00, kr=6.000.7020.0510.1000.1950.100 6/41/14/6sk=0.00, kr=0.000.6670.0480.1000.1770.102 6/41/24/6sk=0.00, kr=0.000.7360.0500.1020.1900.103 6/41/44/6sk=0.00, kr=0.000.8000.0560.1130.2220.115 6/41/14/6sk=1.00, kr=3.000.7350.0490.1030.1900.105 6/41/24/6sk=1.00, kr=3.000.7730.0540.1080.2090.110 6/41/44/6sk=1.00, kr=3.000.8240.0510.1060.2210.107 6/41/14/6sk=2.00, kr=6.000.7940.0500.1050.2070.106 6/41/24/6sk=2.00, kr=6.000.8160.0490.1000.2080.101 6/41/44/6sk=2.00, kr=6.000.8460.0460.0940.2210.095 20/201/14/6sk=0.00, kr=0.000.9960.0450.1090.4660.109 20/201/24/6sk=0.00, kr=0.000.9950.0430.1120.4650.112 20/201/44/6sk=0.00, kr=0.000.9960.0430.1120.4730.112 20/201/14/6sk=1.00, kr=3.000.9960.0400.1050.4740.105 20/201/24/6sk=1.00, kr=3.000.9940.0430.1100.4850.110 20/201/44/6sk=1.00, kr=3.000.9950.0530.1180.4830.118 20/201/14/6sk=2.00, kr=6.000.9950.0510.1200.4810.120 20/201/24/6sk=2.00, kr=6.000.9950.0520.1170.4810.117 20/201/44/6sk=2.00, kr=6.000.9960.0560.1180.4800.118 16/241/14/6sk=0.00, kr=0.000.9940.0510.1120.4600.112 16/241/24/6sk=0.00, kr=0.000.9920.0500.1180.4560.118 16/241/44/6sk=0.00, kr=0.000.9910.0490.1130.4350.113 16/241/14/6sk=1.00, kr=3.000.9950.0510.1130.4780.113 16/241/24/6sk=1.00, kr=3.000.9910.0520.1140.4460.115 16/241/44/6sk=1.00, kr=3.000.9900.0570.1120.4540.112 16/241/14/6sk=2.00, kr=6.000.9960.0440.1080.4710.109 16/241/24/6sk=2.00, kr=6.000.9940.0490.1140.4560.115 16/241/44/6sk=2.00, kr=6.000.9910.0520.1210.4640.121 PAGE 154 144 Table 26 (continued) Type I Error Rate Estimates ( 2 =1, =.8) at =.05 for K=10, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I error. Shaded cells signify those conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Permuted Qs robustness remained intact with increasing sample sizes (see Table 26 above). Therefore, permuted Q appeared to be robust to multiple violations of assumptions. All other tests projected inflated Type I error rates. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 24/161/14/6sk=0.00, kr=0.000.9950.0480.1100.4660.110 24/161/24/6sk=0.00, kr=0.000.9940.0550.1160.4750.116 24/161/44/6sk=0.00, kr=0.000.9970.0480.1190.4900.119 24/161/14/6sk=1.00, kr=3.000.9950.0530.1190.4680.119 24/161/24/6sk=1.00, kr=3.000.9950.0460.1080.4750.108 24/161/44/6sk=1.00, kr=3.000.9960.0480.1130.4850.114 24/161/14/6sk=2.00, kr=6.000.9950.0530.1210.4800.121 24/161/24/6sk=2.00, kr=6.000.9950.0530.1180.4940.118 24/161/44/6sk=2.00, kr=6.000.9970.0490.1130.5030.113 100/1001/14/6sk=0.00, kr=0.001.0000.0480.1210.7380.121 100/1001/24/6sk=0.00, kr=0.001.0000.0490.1140.7410.114 100/1001/44/6sk=0.00, kr=0.001.0000.0480.1120.7450.112 100/1001/14/6sk=1.00, kr=3.001.0000.0480.1130.7430.113 100/1001/24/6sk=1.00, kr=3.001.0000.0510.1180.7380.118 100/1001/44/6sk=1.00, kr=3.001.0000.0460.1150.7360.115 100/1001/14/6sk=2.00, kr=6.001.0000.0460.1070.7270.107 100/1001/24/6sk=2.00, kr=6.001.0000.0500.1100.7380.110 100/1001/44/6sk=2.00, kr=6.001.0000.0480.1170.7430.117 80/1201/14/6sk=0.00, kr=0.001.0000.0530.1170.7210.117 80/1201/24/6sk=0.00, kr=0.001.0000.0490.1180.7270.118 80/1201/44/6sk=0.00, kr=0.001.0000.0490.1250.7300.125 80/1201/14/6sk=1.00, kr=3.001.0000.0500.1210.7300.121 80/1201/24/6sk=1.00, kr=3.001.0000.0480.1160.7360.116 80/1201/44/6sk=1.00, kr=3.001.0000.0460.1170.7230.117 80/1201/14/6sk=2.00, kr=6.001.0000.0460.1170.7330.117 80/1201/24/6sk=2.00, kr=6.001.0000.0480.1180.7330.118 80/1201/44/6sk=2.00, kr=6.001.0000.0530.1190.7310.119 120/801/14/6sk=0.00, kr=0.001.0000.0490.1110.7280.111 120/801/24/6sk=0.00, kr=0.001.0000.0530.1140.7430.114 120/801/44/6sk=0.00, kr=0.001.0000.0550.1230.7490.123 120/801/14/6sk=1.00, kr=3.001.0000.0520.1130.7320.113 120/801/24/6sk=1.00, kr=3.001.0000.0530.1150.7360.115 120/801/44/6sk=1.00, kr=3.001.0000.0500.1110.7410.111 120/801/14/6sk=2.00, kr=6.001.0000.0540.1190.7350.119 120/801/24/6sk=2.00, kr=6.001.0000.0500.1210.7470.121 120/801/44/6sk=2.00, kr=6.001.0000.0450.1090.7520.109 PAGE 155 145 Table 27 Type I Error Rate Estimates ( 2 =1, =0) at =.05 for K=30, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded cells signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Only permuted Q maintained adequate Type I error control under increasing heterogeneity of effects. Again, permuted Qs robustness was demonstr ated across most all conditions under investigation. Although the RE and CR tests error rates were not as inflated as those of regular Q and the FE test, the former two tests still presented error rates exceeding nominal alpha. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/112/18sk=0.00, kr=0.000.9770.0510.0670.1640.067 5/51/212/18sk=0.00, kr=0.000.9790.0450.0610.1650.061 5/51/412/18sk=0.00, kr=0.000.9820.0460.0620.1710.062 5/51/112/18sk=1.00, kr=3.000.9860.0520.0700.1830.070 5/51/212/18sk=1.00, kr=3.000.9870.0490.0680.1880.068 5/51/412/18sk=1.00, kr=3.000.9900.0490.0670.1820.067 5/51/112/18sk=2.00, kr=6.000.9940.0520.0710.1980.071 5/51/212/18sk=2.00, kr=6.000.9950.0480.0650.2010.065 5/51/412/18sk=2.00, kr=6.000.9960.0520.0680.2120.068 4/61/112/18sk=0.00, kr=0.000.9700.0460.0620.1630.062 4/61/212/18sk=0.00, kr=0.000.9540.0480.0680.1520.068 4/61/412/18sk=0.00, kr=0.000.9350.0500.0680.1540.068 4/61/112/18sk=1.00, kr=3.000.9850.0540.0700.1810.070 4/61/212/18sk=1.00, kr=3.000.9740.0520.0700.1730.070 4/61/412/18sk=1.00, kr=3.000.9680.0470.0610.1630.061 4/61/112/18sk=2.00, kr=6.000.9930.0540.0730.1960.073 4/61/212/18sk=2.00, kr=6.000.9890.0470.0650.1850.066 4/61/412/18sk=2.00, kr=6.000.9850.0490.0680.1860.068 6/41/112/18sk=0.00, kr=0.000.9720.0520.0690.1680.069 6/41/212/18sk=0.00, kr=0.000.9860.0470.0640.1760.064 6/41/412/18sk=0.00, kr=0.000.9950.0480.0630.2000.063 6/41/112/18sk=1.00, kr=3.000.9830.0490.0640.1800.064 6/41/212/18sk=1.00, kr=3.000.9930.0520.0700.1940.070 6/41/412/18sk=1.00, kr=3.000.9980.0510.0670.2130.067 6/41/112/18sk=2.00, kr=6.000.9950.0500.0700.1930.070 6/41/212/18sk=2.00, kr=6.000.9990.0500.0640.2060.064 6/41/412/18sk=2.00, kr=6.001.0000.0510.0660.2260.066 20/201/112/18sk=0.00, kr=0.001.0000.0520.0710.4780.071 20/201/212/18sk=0.00, kr=0.001.0000.0540.0710.4550.071 20/201/412/18sk=0.00, kr=0.001.0000.0490.0640.4650.064 20/201/112/18sk=1.00, kr=3.001.0000.0480.0680.4610.068 20/201/212/18sk=1.00, kr=3.001.0000.0520.0700.4650.070 20/201/412/18sk=1.00, kr=3.001.0000.0500.0660.4810.066 20/201/112/18sk=2.00, kr=6.001.0000.0490.0680.4630.068 20/201/212/18sk=2.00, kr=6.001.0000.0520.0730.4700.073 20/201/412/18sk=2.00, kr=6.001.0000.0480.0650.4680.065 16/241/112/18sk=0.00, kr=0.001.0000.0590.0800.4560.080 16/241/212/18sk=0.00, kr=0.001.0000.0530.0680.4430.068 16/241/412/18sk=0.00, kr=0.001.0000.0500.0680.4360.068 16/241/112/18sk=1.00, kr=3.001.0000.0490.0650.4650.065 16/241/212/18sk=1.00, kr=3.001.0000.0470.0670.4490.067 16/241/412/18sk=1.00, kr=3.001.0000.0560.0730.4490.073 16/241/112/18sk=2.00, kr=6.001.0000.0460.0650.4700.065 16/241/212/18sk=2.00, kr=6.001.0000.0440.0610.4430.061 16/241/412/18sk=2.00, kr=6.001.0000.0510.0700.4540.070 PAGE 156 146 Table 27 (continued) Type I Error Rate Estimates ( 2 =1, =0) at =.05 for K=30, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded cells signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure The Type I error control of the regular Q, RE, FE and CR tests did not improve as sample sizes increased to 40 and greater (see Table 27 above). Perm uted Q maintained adequate Type I error control in these conditions. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 24/161/112/18sk=0.00, kr=0.001.0000.0510.0690.4580.069 24/161/212/18sk=0.00, kr=0.001.0000.0530.0730.4820.073 24/161/412/18sk=0.00, kr=0.001.0000.0500.0670.4700.067 24/161/112/18sk=1.00, kr=3.001.0000.0460.0620.4620.062 24/161/212/18sk=1.00, kr=3.001.0000.0510.0700.4810.070 24/161/412/18sk=1.00, kr=3.001.0000.0500.0660.4790.066 24/161/112/18sk=2.00, kr=6.001.0000.0490.0650.4690.065 24/161/212/18sk=2.00, kr=6.001.0000.0530.0710.4780.071 24/161/412/18sk=2.00, kr=6.001.0000.0470.0680.4870.068 100/1001/112/18sk=0.00, kr=0.001.0000.0470.0670.7370.067 100/1001/212/18sk=0.00, kr=0.001.0000.0520.0690.7290.069 100/1001/412/18sk=0.00, kr=0.001.0000.0460.0640.7380.064 100/1001/112/18sk=1.00, kr=3.001.0000.0470.0630.7340.063 100/1001/212/18sk=1.00, kr=3.001.0000.0470.0660.7430.066 100/1001/412/18sk=1.00, kr=3.001.0000.0500.0660.7280.066 100/1001/112/18sk=2.00, kr=6.001.0000.0470.0670.7340.067 100/1001/212/18sk=2.00, kr=6.001.0000.0470.0620.7330.062 100/1001/412/18sk=2.00, kr=6.001.0000.0490.0680.7330.068 80/1201/112/18sk=0.00, kr=0.001.0000.0520.0710.7270.071 80/1201/212/18sk=0.00, kr=0.001.0000.0430.0640.7270.064 80/1201/412/18sk=0.00, kr=0.001.0000.0500.0660.7170.066 80/1201/112/18sk=1.00, kr=3.001.0000.0470.0650.7280.065 80/1201/212/18sk=1.00, kr=3.001.0000.0490.0720.7310.072 80/1201/412/18sk=1.00, kr=3.001.0000.0460.0660.7220.066 80/1201/112/18sk=2.00, kr=6.001.0000.0480.0650.7410.065 80/1201/212/18sk=2.00, kr=6.001.0000.0520.0690.7250.069 80/1201/412/18sk=2.00, kr=6.001.0000.0450.0640.7040.064 120/801/112/18sk=0.00, kr=0.001.0000.0510.0690.7340.069 120/801/212/18sk=0.00, kr=0.001.0000.0520.0660.7350.066 120/801/412/18sk=0.00, kr=0.001.0000.0570.0750.7520.075 120/801/112/18sk=1.00, kr=3.001.0000.0500.0670.7430.067 120/801/212/18sk=1.00, kr=3.001.0000.0540.0780.7330.078 120/801/412/18sk=1.00, kr=3.001.0000.0500.0690.7490.069 120/801/112/18sk=2.00, kr=6.001.0000.0480.0680.7350.068 120/801/212/18sk=2.00, kr=6.001.0000.0430.0630.7320.063 120/801/412/18sk=2.00, kr=6.001.0000.0480.0640.7440.064 PAGE 157 147 Table 28 Type I Error Rate Estimates ( 2 =1, =.8) at =.05 for K=30, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded cells signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Again, permuted Q maintained adequate Type I error control. However, none of the tests offered any degree of effectiveness as heterogeneity of effects increased to 1 (see Table 28 above). Type I error rates of the other tests remained elevated. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/112/18sk=0.00, kr=0.000.9670.0550.0730.1740.074 5/51/212/18sk=0.00, kr=0.000.9700.0530.0720.1770.073 5/51/412/18sk=0.00, kr=0.000.9780.0540.0720.1810.073 5/51/112/18sk=1.00, kr=3.000.9830.0450.0610.1790.062 5/51/212/18sk=1.00, kr=3.000.9810.0490.0690.1780.069 5/51/412/18sk=1.00, kr=3.000.9870.0510.0660.1910.066 5/51/112/18sk=2.00, kr=6.000.9940.0500.0640.1910.064 5/51/212/18sk=2.00, kr=6.000.9910.0520.0660.1970.066 5/51/412/18sk=2.00, kr=6.000.9930.0490.0630.1970.063 4/61/112/18sk=0.00, kr=0.000.9690.0480.0630.1700.064 4/61/212/18sk=0.00, kr=0.000.9430.0510.0660.1560.067 4/61/412/18sk=0.00, kr=0.000.9320.0520.0710.1580.071 4/61/112/18sk=1.00, kr=3.000.9800.0530.0680.1840.068 4/61/212/18sk=1.00, kr=3.000.9620.0580.0760.1740.076 4/61/412/18sk=1.00, kr=3.000.9530.0470.0640.1560.064 4/61/112/18sk=2.00, kr=6.000.9920.0540.0680.1950.068 4/61/212/18sk=2.00, kr=6.000.9800.0470.0650.1750.065 4/61/412/18sk=2.00, kr=6.000.9710.0550.0700.1710.070 6/41/112/18sk=0.00, kr=0.000.9640.0470.0660.1650.066 6/41/212/18sk=0.00, kr=0.000.9850.0550.0710.1960.071 6/41/412/18sk=0.00, kr=0.000.9930.0490.0680.2050.068 6/41/112/18sk=1.00, kr=3.000.9800.0450.0610.1710.062 6/41/212/18sk=1.00, kr=3.000.9900.0490.0690.1900.069 6/41/412/18sk=1.00, kr=3.000.9960.0490.0640.1940.064 6/41/112/18sk=2.00, kr=6.000.9940.0490.0670.2010.067 6/41/212/18sk=2.00, kr=6.000.9980.0500.0660.2090.066 6/41/412/18sk=2.00, kr=6.000.9970.0570.0720.2260.072 20/201/112/18sk=0.00, kr=0.001.0000.0500.0690.4650.069 20/201/212/18sk=0.00, kr=0.001.0000.0470.0650.4650.065 20/201/412/18sk=0.00, kr=0.001.0000.0510.0670.4470.067 20/201/112/18sk=1.00, kr=3.001.0000.0540.0670.4660.067 20/201/212/18sk=1.00, kr=3.001.0000.0510.0680.4550.068 20/201/412/18sk=1.00, kr=3.001.0000.0490.0680.4580.068 20/201/112/18sk=2.00, kr=6.001.0000.0500.0680.4850.068 20/201/212/18sk=2.00, kr=6.001.0000.0530.0680.4810.068 20/201/412/18sk=2.00, kr=6.001.0000.0500.0690.4630.069 16/241/112/18sk=0.00, kr=0.001.0000.0510.0710.4540.071 16/241/212/18sk=0.00, kr=0.001.0000.0510.0690.4590.069 16/241/412/18sk=0.00, kr=0.001.0000.0580.0770.4330.077 16/241/112/18sk=1.00, kr=3.001.0000.0540.0760.4750.076 16/241/212/18sk=1.00, kr=3.001.0000.0440.0670.4620.067 16/241/412/18sk=1.00, kr=3.001.0000.0470.0690.4360.069 16/241/112/18sk=2.00, kr=6.001.0000.0500.0690.4780.069 16/241/212/18sk=2.00, kr=6.001.0000.0500.0700.4620.070 16/241/412/18sk=2.00, kr=6.001.0000.0530.0670.4540.067 PAGE 158 148 Table 28 (continued) Type I Error Rate Estimates ( 2 =1, =.8) at =.05 for K=30, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded cells signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Increases in sample size to 40 and above did not improve the Type I error control of any of the tests, aside from permuted Q (see Table 28 above). Du e to the continued inflation of Type I error rates, it can be concluded that under increasing heterogeneity of effects, application of these tests would not be warranted. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 24/161/112/18sk=0.00, kr=0.001.0000.0460.0630.4540.063 24/161/212/18sk=0.00, kr=0.001.0000.0580.0770.4910.077 24/161/412/18sk=0.00, kr=0.001.0000.0460.0610.4780.061 24/161/112/18sk=1.00, kr=3.001.0000.0540.0740.4610.074 24/161/212/18sk=1.00, kr=3.001.0000.0500.0700.4740.070 24/161/412/18sk=1.00, kr=3.001.0000.0550.0760.4990.076 24/161/112/18sk=2.00, kr=6.001.0000.0490.0670.4710.067 24/161/212/18sk=2.00, kr=6.001.0000.0480.0660.4890.066 24/161/412/18sk=2.00, kr=6.001.0000.0480.0690.4840.069 100/1001/112/18sk=0.00, kr=0.001.0000.0480.0650.7360.065 100/1001/212/18sk=0.00, kr=0.001.0000.0540.0690.7440.069 100/1001/412/18sk=0.00, kr=0.001.0000.0490.0670.7380.067 100/1001/112/18sk=1.00, kr=3.001.0000.0440.0630.7460.063 100/1001/212/18sk=1.00, kr=3.001.0000.0530.0760.7420.076 100/1001/412/18sk=1.00, kr=3.001.0000.0470.0660.7420.066 100/1001/112/18sk=2.00, kr=6.001.0000.0480.0680.7260.068 100/1001/212/18sk=2.00, kr=6.001.0000.0470.0660.7410.066 100/1001/412/18sk=2.00, kr=6.001.0000.0520.0690.7340.069 80/1201/112/18sk=0.00, kr=0.001.0000.0500.0720.7200.072 80/1201/212/18sk=0.00, kr=0.001.0000.0490.0680.7250.068 80/1201/412/18sk=0.00, kr=0.001.0000.0500.0710.7200.071 80/1201/112/18sk=1.00, kr=3.001.0000.0490.0640.7440.064 80/1201/212/18sk=1.00, kr=3.001.0000.0520.0710.7310.071 80/1201/412/18sk=1.00, kr=3.001.0000.0520.0720.7260.072 80/1201/112/18sk=2.00, kr=6.001.0000.0450.0640.7420.064 80/1201/212/18sk=2.00, kr=6.001.0000.0510.0690.7200.069 80/1201/412/18sk=2.00, kr=6.001.0000.0520.0710.7260.071 120/801/112/18sk=0.00, kr=0.001.0000.0500.0660.7260.066 120/801/212/18sk=0.00, kr=0.001.0000.0510.0700.7450.070 120/801/412/18sk=0.00, kr=0.001.0000.0530.0700.7460.070 120/801/112/18sk=1.00, kr=3.001.0000.0480.0660.7420.066 120/801/212/18sk=1.00, kr=3.001.0000.0510.0700.7260.070 120/801/412/18sk=1.00, kr=3.001.0000.0530.0720.7570.072 120/801/112/18sk=2.00, kr=6.001.0000.0520.0690.7200.069 120/801/212/18sk=2.00, kr=6.001.0000.0520.0710.7440.071 120/801/412/18sk=2.00, kr=6.001.0000.0510.0700.7570.070 PAGE 159 149 Table 29 Type I Error Rate Estimates ( 2 =1, =.8) at =.10 for K=30, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded cells signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Again, increasing the nominal alpha from .05 to .10 did not provide a substantive improvement to the performance of any of the tests, including permuted Q. And Type I error control was poor for all tests, other than permuted Q (see Table 29 above). Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/112/18sk=0.00, kr=0.000.9840.1130.1310.2540.131 5/51/212/18sk=0.00, kr=0.000.9850.1050.1270.2560.128 5/51/412/18sk=0.00, kr=0.000.9890.1060.1310.2630.131 5/51/112/18sk=1.00, kr=3.000.9920.0970.1140.2600.114 5/51/212/18sk=1.00, kr=3.000.9910.0990.1190.2520.120 5/51/412/18sk=1.00, kr=3.000.9940.1010.1230.2720.124 5/51/112/18sk=2.00, kr=6.000.9980.0980.1180.2690.118 5/51/212/18sk=2.00, kr=6.000.9960.1030.1210.2730.122 5/51/412/18sk=2.00, kr=6.000.9970.1010.1190.2830.119 4/61/112/18sk=0.00, kr=0.000.9860.1040.1270.2490.127 4/61/212/18sk=0.00, kr=0.000.9710.1010.1210.2270.122 4/61/412/18sk=0.00, kr=0.000.9640.1070.1280.2410.129 4/61/112/18sk=1.00, kr=3.000.9910.0990.1190.2590.120 4/61/212/18sk=1.00, kr=3.000.9830.1060.1260.2500.127 4/61/412/18sk=1.00, kr=3.000.9760.0960.1130.2360.114 4/61/112/18sk=2.00, kr=6.000.9970.1010.1210.2830.122 4/61/212/18sk=2.00, kr=6.000.9900.0970.1200.2520.120 4/61/412/18sk=2.00, kr=6.000.9830.1030.1190.2470.119 6/41/112/18sk=0.00, kr=0.000.9820.0990.1190.2510.120 6/41/212/18sk=0.00, kr=0.000.9930.1110.1280.2730.128 6/41/412/18sk=0.00, kr=0.000.9970.0990.1250.2830.125 6/41/112/18sk=1.00, kr=3.000.9900.0900.1100.2500.110 6/41/212/18sk=1.00, kr=3.000.9960.1000.1230.2760.123 6/41/412/18sk=1.00, kr=3.000.9980.0950.1120.2780.112 6/41/112/18sk=2.00, kr=6.000.9980.1010.1230.2800.123 6/41/212/18sk=2.00, kr=6.000.9990.1020.1200.2910.120 6/41/412/18sk=2.00, kr=6.000.9980.1110.1250.3000.125 20/201/112/18sk=0.00, kr=0.001.0000.1050.1240.5340.124 20/201/212/18sk=0.00, kr=0.001.0000.0980.1190.5380.119 20/201/412/18sk=0.00, kr=0.001.0000.0990.1150.5210.115 20/201/112/18sk=1.00, kr=3.001.0000.1020.1240.5440.124 20/201/212/18sk=1.00, kr=3.001.0000.1030.1260.5310.126 20/201/412/18sk=1.00, kr=3.001.0000.1010.1240.5320.124 20/201/112/18sk=2.00, kr=6.001.0000.0980.1250.5600.125 20/201/212/18sk=2.00, kr=6.001.0000.0940.1230.5550.123 20/201/412/18sk=2.00, kr=6.001.0000.1020.1220.5400.122 16/241/112/18sk=0.00, kr=0.001.0000.1010.1250.5280.125 16/241/212/18sk=0.00, kr=0.001.0000.0990.1250.5400.125 16/241/412/18sk=0.00, kr=0.001.0000.1080.1270.5100.127 16/241/112/18sk=1.00, kr=3.001.0000.1100.1290.5530.129 16/241/212/18sk=1.00, kr=3.001.0000.0950.1230.5370.123 16/241/412/18sk=1.00, kr=3.001.0000.0990.1180.5120.118 16/241/112/18sk=2.00, kr=6.001.0000.1030.1220.5490.122 16/241/212/18sk=2.00, kr=6.001.0000.1040.1260.5370.126 16/241/412/18sk=2.00, kr=6.001.0000.1020.1230.5280.123 PAGE 160 150 Table 29 (continued) Type I Error Rate Estimates ( 2 =1, =.8) at =.10 for K=30, N=5000 Note: All unshaded areas for each of the tests reflect either infl ated or conservative Type I erro r. Shaded cells signify tho se conditions in which the error rate fell within Bradleys criterion. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Increasing nominal alpha to .10 did not alter the effectiveness of permuted Q. But this test did remain robust under these conditions (see Table 29 above). Despite the increase in sample size, the Type I error rates of all other tests continued to exceed the nominal alpha level. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 24/161/112/18sk=0.00, kr=0.001.0000.0930.1130.5260.113 24/161/212/18sk=0.00, kr=0.001.0000.1100.1350.5620.135 24/161/412/18sk=0.00, kr=0.001.0000.0970.1180.5510.118 24/161/112/18sk=1.00, kr=3.001.0000.1060.1270.5430.127 24/161/212/18sk=1.00, kr=3.001.0000.1070.1280.5530.128 24/161/412/18sk=1.00, kr=3.001.0000.1050.1290.5720.129 24/161/112/18sk=2.00, kr=6.001.0000.1000.1210.5510.121 24/161/212/18sk=2.00, kr=6.001.0000.1000.1220.5620.122 24/161/412/18sk=2.00, kr=6.001.0000.1030.1240.5550.124 100/1001/112/18sk=0.00, kr=0.001.0000.1000.1230.7780.123 100/1001/212/18sk=0.00, kr=0.001.0000.1040.1300.7830.130 100/1001/412/18sk=0.00, kr=0.001.0000.0990.1230.7760.123 100/1001/112/18sk=1.00, kr=3.001.0000.0930.1150.7860.115 100/1001/212/18sk=1.00, kr=3.001.0000.1110.1300.7790.130 100/1001/412/18sk=1.00, kr=3.001.0000.0970.1190.7800.119 100/1001/112/18sk=2.00, kr=6.001.0000.1050.1280.7700.128 100/1001/212/18sk=2.00, kr=6.001.0000.0980.1200.7810.120 100/1001/412/18sk=2.00, kr=6.001.0000.1020.1260.7770.126 80/1201/112/18sk=0.00, kr=0.001.0000.1000.1210.7640.121 80/1201/212/18sk=0.00, kr=0.001.0000.1010.1200.7700.120 80/1201/412/18sk=0.00, kr=0.001.0000.0990.1240.7620.124 80/1201/112/18sk=1.00, kr=3.001.0000.1010.1240.7830.124 80/1201/212/18sk=1.00, kr=3.001.0000.1000.1220.7750.122 80/1201/412/18sk=1.00, kr=3.001.0000.1020.1250.7690.125 80/1201/112/18sk=2.00, kr=6.001.0000.0980.1200.7860.120 80/1201/212/18sk=2.00, kr=6.001.0000.1040.1290.7610.129 80/1201/412/18sk=2.00, kr=6.001.0000.0990.1170.7730.117 120/801/112/18sk=0.00, kr=0.001.0000.1040.1270.7660.127 120/801/212/18sk=0.00, kr=0.001.0000.1020.1240.7800.124 120/801/412/18sk=0.00, kr=0.001.0000.1000.1250.7890.125 120/801/112/18sk=1.00, kr=3.001.0000.0970.1180.7870.118 120/801/212/18sk=1.00, kr=3.001.0000.1030.1270.7700.127 120/801/412/18sk=1.00, kr=3.001.0000.0990.1220.7930.122 120/801/112/18sk=2.00, kr=6.001.0000.1050.1200.7610.120 120/801/212/18sk=2.00, kr=6.001.0000.1000.1240.7830.124 120/801/412/18sk=2.00, kr=6.001.0000.1040.1250.7960.125 PAGE 161 151 Summary of Type I Error Rate Estimates All of the tests offered some degree of effectiveness when 2=0, whether nominal alpha was .05 or .10. Robustness was inhibited when K=10 and sample sizes were small. Once K was increased to 30, at 2=0, all tests evidenced robust performance for several (13 and 19) conditions. The regular Q (14 conditions of r obustness), FE (21 conditions of robustness) and CR (19 conditions of robustness) tests produced a concentr ation of wellmaintained Type I error conditions as the sample size of the primary studies increased to 200 under the equal groups condition. As unequal sample sizes were introduced, both of these tests had diminished Type I error control. The RE test was less robust than either of these other two tests when sample sizes increased to 200. The RE test (18 conditions of robustness) demonstrated particular robustness when the sample sizes at N=10, 40 and 200 had the first group with the larger N than the second. As an effect size greater than 0 was introduced, the performance of the regular Q and FE test diminished. The regular Q had 14 conditions with ade quate Type I error when the effect size was 0 (see Table 17) as compared to 8 in the present set of conditi ons (Table 18). The FE test had less of a decline in performance from 21 adequate conditions vs. 17 in th e present set. The RE and CR tests generated more conservative Type I error rates than any of the other 3 tests, but the greatest frequen cy (other than permuted Q) of conditions with adequately controlled Type I error. Again, the RE test performed best when the first group had the larger sample size. There was a dramatic increase in the number of cond itions with inflated Type I error for all tests, except permuted Q, once 2 increased to .33 (see Tables 19 and 20) As sample size increased to 40, the Type I error for each of the 4 tests exhibited another notable increase. This pattern resulted in a total absence of error control by the aforementioned tests. The same pattern of results continued as displayed in Table 19 and the permuted Q retained its robustness in terms of Type I error control, when 2 = .33 and =.8, K=10 (Table 20). Again, none of the other tests maintained robustness under increasing heterogeneity of effects. At =.8, 2 = .33, permuted Q remained robust and effective with the increase of K to 30 and upon the increase in sample size to 40 (see Table 23). All other tests failed to provide adequate robustness under PAGE 162 152 these conditions. The RE and CR tests showed minimal improvement to robustness with the increase in effect size to .8. However, at sample sizes below 40 e ffective use of these tests in the few robust conditions would not be warranted (to be discussed further in the summary on power estimates). Increasing the nominal alpha level from .05 to .10 did not enhance the performance of any of the five tests being investigated, when K=10 (see Table 21). The permuted Q still maintained adequate Type I error control, though it was not truly e ffective. No other test maintained adequate Type I error control. As K increased to 30 (see Table 24), the robustness of permuted Q again became integral to its utility. Only permuted Q maintained adequate Type I error control under increasing heterogeneity of effects at 2=1(see Tables 26 and 27). However, none of th e tests offered any degree of effectiveness as heterogeneity of effects increased to 1 (power was lacking). Type I error rates of the other tests remained elevated. Though the RE and CR tests error rates we re not as inflated as regular Q and the FE test, increasing heterogeneity contributed to the erosion of robustness for both tests. Conditions Wherein Test Simulations Demonstrate Both Adequate Type I Error and Good Power It is not enough for a test to maintain adequate Type I error control. Without possessing good power under given conditions, a test will fail to detect true differences when they occur. Failure to detect these true differences renders a test ineffective in the ultimate task of identifying true variance between groups. For example, it would be important for a school administrator to know the effectiveness of a reading program for various groups of students in order to evaluate the usefulness of that program before investing future resources. In order to provide a more comprehensive analysis of the utility of the five tests being investigated, two sets of power tables are presente d. Both lend a picture of the degree of specific effectiveness for each test. The focu s of each together will facilitate the practitioner in his/ her selection of the statistics to be applied for metaanalytic research. PAGE 163 153 Power Estimates for Conditions Indicating Adequate Type I Error Control The following presentation provides an explanatio n for both the conditions under which each test extends good power to detect true di fferences, as well as situations for which the tests possessed inadequate power. The first set of power estimate tables (Tab les 3038) originated from power estimates for all conditions under investigation. Power estimates were removed for those conditions under which tests did not permit adequate Type I error control. All estimates remaining reflect ed both good and insufficient power values for those conditions where tests proved to be robust. By retaining all power values for all conditions signifying adequate Type I error control, two pieces of informati on arise. First, this pres entation can illuminate those conditions under which a test performed either effectiv ely or not (effective conditions were highlighted in green). Specifically, this data presentation allows the re ader to determine when a test achieved Type I error control and whether or not good power accompanied the robustness of the test. Secondly, one can determine to what extent a test failed to provide good power under given conditions. Good power was identified as any power estimate of .795 or higher. The reason for presenting the results in this manner is that most practitioners first consider whether a test demonstrates adequate Type I error c ontrol. If Type I error is not wellmaintained, then considerations of power are deemed irrelevant. Typically, the practitioner wants to determine first whether a particular null hypothesis is to be rejected or main tained. If a given test does not provide an adequate degree of robustness, the test, for that given set of conditions, is rendered invalid. The flip side of this argument is that the researcher is no longer taking a disconfirmatory approach to hypothesis testing. Furthermore, alternative hypotheses are not being investigated. But that concern involves a deeper epistemological argument, beyond the purview of the present study. PAGE 164 154 Table 30 Power Estimates for Conditions Indicating Adequate Type I Error Control ( 2 = 0, =.8) at =.05 for K=10 *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Power was low for all test conditions with adequate Type I error, when sample sizes were small (see Table 30). Because power was low, none of the tests would prove effective under these conditions, until sample sizes increase to 40 and above. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/14/6sk=0.00, kr=0.000.376 5/51/24/6sk=0.00, kr=0.000.368 5/51/44/6sk=0.00, kr=0.000.370 5/51/14/6sk=1.00, kr=3.000.1430.413 5/51/24/6sk=1.00, kr=3.000.421 5/51/44/6sk=1.00, kr=3.000.439 5/51/14/6sk=2.00, kr=6.000.4550.4980.489 5/51/24/6sk=2.00, kr=6.000.500 5/51/44/6sk=2.00, kr=6.000.513 4/61/14/6sk=0.00, kr=0.000.364 4/61/24/6sk=0.00, kr=0.000.383 4/61/44/6sk=0.00, kr=0.000.411 4/61/14/6sk=1.00, kr=3.000.403 4/61/24/6sk=1.00, kr=3.000.441 4/61/44/6sk=1.00, kr=3.000.480 4/61/14/6sk=2.00, kr=6.000.4300.4480.4600.451 4/61/24/6sk=2.00, kr=6.000.503 4/61/44/6sk=2.00, kr=6.000.544 6/41/14/6sk=0.00, kr=0.000.363 6/41/24/6sk=0.00, kr=0.000.3440.3800.4000.381 6/41/44/6sk=0.00, kr=0.000.322 6/41/14/6sk=1.00, kr=3.000.401 6/41/24/6sk=1.00, kr=3.000.2120.393 6/41/44/6sk=1.00, kr=3.000.3840.4560.4870.459 6/41/14/6sk=2.00, kr=6.000.4350.4660.4780.470 6/41/24/6sk=2.00, kr=6.000.2490.448 6/41/44/6sk=2.00, kr=6.000.463 20/201/14/6sk=0.00, kr=0.000.9170.9660.962 20/201/24/6sk=0.00, kr=0.000.9170.9570.9650.958 20/201/44/6sk=0.00, kr=0.000.9040.9690.961 20/201/14/6sk=1.00, kr=3.000.9120.9620.9700.963 20/201/24/6sk=1.00, kr=3.000.925 20/201/44/6sk=1.00, kr=3.000.945 20/201/14/6sk=2.00, kr=6.000.9170.9600.961 20/201/24/6sk=2.00, kr=6.00 20/201/44/6sk=2.00, kr=6.000.970 16/241/14/6sk=0.00, kr=0.000.904 16/241/24/6sk=0.00, kr=0.00 16/241/44/6sk=0.00, kr=0.000.937 16/241/14/6sk=1.00, kr=3.000.9080.9560.957 16/241/24/6sk=1.00, kr=3.000.937 16/241/44/6sk=1.00, kr=3.000.959 16/241/14/6sk=2.00, kr=6.000.9110.952 16/241/24/6sk=2.00, kr=6.000.956 16/241/44/6sk=2.00, kr=6.000.976 PAGE 165 155 Table 30 (continued) Power Estimates for Conditions Indicating Adequate Type I Error Control ( 2 = 0, =.8) at =.05 for K=10 Note: Shaded areas indicate those conditions for which a test demonstrated good power to detect true differences. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure As sample sizes increased to 40 and greater, th ere were a greater number of conditions with both robustness and good power for all of the tests (see Table 30). After permuted Q, the RE and CR tests provided the greatest number of conditions with robustness and good power (19 conditions each). Regular Q yielded 6 effective conditions, while the FE test dem onstrated 16. Permuted Q achieved the majority of effective conditions, once sample si zes increased to 40 and greater. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 24/161/14/6sk=0.00, kr=0.000.9070.961 24/161/24/6sk=0.00, kr=0.000.8850.948 24/161/44/6sk=0.00, kr=0.000.870 24/161/14/6sk=1.00, kr=3.000.9130.9560.9660.957 24/161/24/6sk=1.00, kr=3.000.9650.9730.966 24/161/44/6sk=1.00, kr=3.000.9050.9600.9740.962 24/161/14/6sk=2.00, kr=6.000.9130.959 24/161/24/6sk=2.00, kr=6.000.8280.9350.9790.974 24/161/44/6sk=2.00, kr=6.000.948 100/1001/14/6sk=0.00, kr=0.001.0001.000 100/1001/24/6sk=0.00, kr=0.001.0001.0001.0001.0001.000 100/1001/44/6sk=0.00, kr=0.001.0001.0001.000 100/1001/14/6sk=1.00, kr=3.001.0001.000 100/1001/24/6sk=1.00, kr=3.001.0001.000 100/1001/44/6sk=1.00, kr=3.001.000 100/1001/14/6sk=2.00, kr=6.001.000 100/1001/24/6sk=2.00, kr=6.001.000 100/1001/44/6sk=2.00, kr=6.00 80/1201/14/6sk=0.00, kr=0.001.0001.0001.0001.000 80/1201/24/6sk=0.00, kr=0.001.000 80/1201/44/6sk=0.00, kr=0.001.000 80/1201/14/6sk=1.00, kr=3.001.0001.0001.000 80/1201/24/6sk=1.00, kr=3.001.000 80/1201/44/6sk=1.00, kr=3.001.000 80/1201/14/6sk=2.00, kr=6.001.000 80/1201/24/6sk=2.00, kr=6.001.000 80/1201/44/6sk=2.00, kr=6.001.000 120/801/14/6sk=0.00, kr=0.001.0001.0001.0001.000 120/801/24/6sk=0.00, kr=0.001.0001.000 120/801/44/6sk=0.00, kr=0.001.000 120/801/14/6sk=1.00, kr=3.001.0001.0001.000 120/801/24/6sk=1.00, kr=3.001.0001.0001.0001.000 120/801/44/6sk=1.00, kr=3.001.0001.0001.0001.000 120/801/14/6sk=2.00, kr=6.001.000 120/801/24/6sk=2.00, kr=6.001.0001.0001.000 120/801/44/6sk=2.00, kr=6.001.000 PAGE 166 156 Table 31 Power Estimates for Conditions Indicating Adequate Type I Error Control ( 2 = 0, =.8) at =.05 for K=30 Note: Shaded areas indicate those cond itions for which a test demonstrated good power to detect true differences. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure With the increase in K to 30 (Table 31), the FE test offered the same number of effective, powerful conditions as with K=10 (see Table 30). But the RE test manifested an increased number of effective conditions for use (24 as compared to 19 in the previous K=10 conditions), primarily occurring in the equal and first group with greater sample size conditions. The CR test presented no change in the number of effective conditions and these were concentrated in the equal sample sizes conditions. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/112/18sk=0.00, kr=0.000.869 5/51/212/18sk=0.00, kr=0.000.877 5/51/412/18sk=0.00, kr=0.000.872 5/51/112/18sk=1.00, kr=3.00 5/51/212/18sk=1.00, kr=3.000.916 5/51/412/18sk=1.00, kr=3.000.931 5/51/112/18sk=2.00, kr=6.000.9300.928 5/51/212/18sk=2.00, kr=6.000.959 5/51/412/18sk=2.00, kr=6.000.967 4/61/112/18sk=0.00, kr=0.000.865 4/61/212/18sk=0.00, kr=0.000.880 4/61/412/18sk=0.00, kr=0.000.891 4/61/112/18sk=1.00, kr=3.00 4/61/212/18sk=1.00, kr=3.000.930 4/61/412/18sk=1.00, kr=3.000.942 4/61/112/18sk=2.00, kr=6.000.923 4/61/212/18sk=2.00, kr=6.000.962 4/61/412/18sk=2.00, kr=6.000.977 6/41/112/18sk=0.00, kr=0.000.870 6/41/212/18sk=0.00, kr=0.000.836 6/41/412/18sk=0.00, kr=0.000.8140.8360.839 6/41/112/18sk=1.00, kr=3.000.902 6/41/212/18sk=1.00, kr=3.000.3750.904 6/41/412/18sk=1.00, kr=3.000.9070.926 6/41/112/18sk=2.00, kr=6.000.924 6/41/212/18sk=2.00, kr=6.000.949 6/41/412/18sk=2.00, kr=6.000.953 20/201/112/18sk=0.00, kr=0.001.0001.0001.000 20/201/212/18sk=0.00, kr=0.000.9831.0001.0001.000 20/201/412/18sk=0.00, kr=0.001.0001.0001.000 20/201/112/18sk=1.00, kr=3.001.0001.0001.0001.000 20/201/212/18sk=1.00, kr=3.001.000 20/201/412/18sk=1.00, kr=3.001.000 20/201/112/18sk=2.00, kr=6.001.0001.000 20/201/212/18sk=2.00, kr=6.001.000 20/201/412/18sk=2.00, kr=6.001.000 16/241/112/18sk=0.00, kr=0.001.0001.000 16/241/212/18sk=0.00, kr=0.001.000 16/241/412/18sk=0.00, kr=0.001.000 16/241/112/18sk=1.00, kr=3.001.0001.0001.0001.000 16/241/212/18sk=1.00, kr=3.00 16/241/412/18sk=1.00, kr=3.001.000 16/241/112/18sk=2.00, kr=6.001.0001.000 16/241/212/18sk=2.00, kr=6.001.000 16/241/412/18sk=2.00, kr=6.001.000 PAGE 167 157 Table 31 (continued) Power Estimates for Conditions Indicating Adequate Type I Error Control ( 2 = 0, =.8) at =.05 for K=30 Note: Shaded areas indicate those cond itions for which a test demonstrated good power to detect true differences. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Permuted Q had an increased number of effective conditions both with the conditions with smaller and larger sample size (compare Table 30 to Table 31 ). These effective conditions for use were spread across all conditions for permuted Q. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 24/161/112/18sk=0.00, kr=0.000.9781.000 24/161/212/18sk=0.00, kr=0.001.0001.0001.000 24/161/412/18sk=0.00, kr=0.001.0001.000 24/161/112/18sk=1.00, kr=3.001.0001.0001.0001.000 24/161/212/18sk=1.00, kr=3.001.0001.0001.000 24/161/412/18sk=1.00, kr=3.001.0001.0001.000 24/161/112/18sk=2.00, kr=6.001.000 24/161/212/18sk=2.00, kr=6.001.0001.0001.0001.000 24/161/412/18sk=2.00, kr=6.001.000 100/1001/112/18sk=0.00, kr=0.001.0001.0001.0001.0001.000 100/1001/212/18sk=0.00, kr=0.001.0001.0001.0001.000 100/1001/412/18sk=0.00, kr=0.001.0001.0001.000 100/1001/112/18sk=1.00, kr=3.001.0001.0001.000 100/1001/212/18sk=1.00, kr=3.001.0001.0001.000 100/1001/412/18sk=1.00, kr=3.001.000 100/1001/112/18sk=2.00, kr=6.001.0001.000 100/1001/212/18sk=2.00, kr=6.001.000 100/1001/412/18sk=2.00, kr=6.001.000 80/1201/112/18sk=0.00, kr=0.001.000 80/1201/212/18sk=0.00, kr=0.001.000 80/1201/412/18sk=0.00, kr=0.001.000 80/1201/112/18sk=1.00, kr=3.001.0001.000 80/1201/212/18sk=1.00, kr=3.00 80/1201/412/18sk=1.00, kr=3.001.000 80/1201/112/18sk=2.00, kr=6.001.0001.000 80/1201/212/18sk=2.00, kr=6.001.000 80/1201/412/18sk=2.00, kr=6.001.000 120/801/112/18sk=0.00, kr=0.001.0001.0001.000 120/801/212/18sk=0.00, kr=0.001.0001.000 120/801/412/18sk=0.00, kr=0.001.000 120/801/112/18sk=1.00, kr=3.001.0001.0001.000 120/801/212/18sk=1.00, kr=3.001.0001.0001.0001.000 120/801/412/18sk=1.00, kr=3.001.0001.0001.0001.000 120/801/112/18sk=2.00, kr=6.001.0001.000 120/801/212/18sk=2.00, kr=6.001.000 120/801/412/18sk=2.00, kr=6.001.000 PAGE 168 158 Table 32 Power Estimates for Conditions Indicating Adequate Type I Error Control ( 2 = .33, =.8) at =.05 for K=10 Note: Shaded areas indicate those cond itions for which a test demonstrated good power to detect true differences. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure With the increase in 2 to .33 at K=10 for nominal alpha, .0 5, sufficient power was absent for all tests, including permuted Q (see Table 32). Powe r remained limited for permuted Q, regardless of increases in sample size, as long as K=10. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/14/6sk=0.00, kr=0.000.228 5/51/24/6sk=0.00, kr=0.000.229 5/51/44/6sk=0.00, kr=0.000.234 5/51/14/6sk=1.00, kr=3.000.247 5/51/24/6sk=1.00, kr=3.000.238 5/51/44/6sk=1.00, kr=3.000.245 5/51/14/6sk=2.00, kr=6.000.248 5/51/24/6sk=2.00, kr=6.000.246 5/51/44/6sk=2.00, kr=6.000.269 4/61/14/6sk=0.00, kr=0.000.231 4/61/24/6sk=0.00, kr=0.000.227 4/61/44/6sk=0.00, kr=0.000.247 4/61/14/6sk=1.00, kr=3.000.232 4/61/24/6sk=1.00, kr=3.000.248 4/61/44/6sk=1.00, kr=3.00 4/61/14/6sk=2.00, kr=6.000.248 4/61/24/6sk=2.00, kr=6.00 4/61/44/6sk=2.00, kr=6.000.281 6/41/14/6sk=0.00, kr=0.000.229 6/41/24/6sk=0.00, kr=0.000.219 6/41/44/6sk=0.00, kr=0.000.215 6/41/14/6sk=1.00, kr=3.000.230 6/41/24/6sk=1.00, kr=3.000.242 6/41/44/6sk=1.00, kr=3.000.232 6/41/14/6sk=2.00, kr=6.000.241 6/41/24/6sk=2.00, kr=6.000.245 6/41/44/6sk=2.00, kr=6.000.251 20/201/14/6sk=0.00, kr=0.00 20/201/24/6sk=0.00, kr=0.000.366 20/201/44/6sk=0.00, kr=0.000.371 20/201/14/6sk=1.00, kr=3.000.374 20/201/24/6sk=1.00, kr=3.000.367 20/201/44/6sk=1.00, kr=3.000.394 20/201/14/6sk=2.00, kr=6.000.376 20/201/24/6sk=2.00, kr=6.000.381 20/201/44/6sk=2.00, kr=6.000.381 16/241/14/6sk=0.00, kr=0.000.366 16/241/24/6sk=0.00, kr=0.000.372 16/241/44/6sk=0.00, kr=0.000.373 16/241/14/6sk=1.00, kr=3.000.369 16/241/24/6sk=1.00, kr=3.000.392 16/241/44/6sk=1.00, kr=3.00 16/241/14/6sk=2.00, kr=6.000.379 16/241/24/6sk=2.00, kr=6.000.388 16/241/44/6sk=2.00, kr=6.000.394 PAGE 169 159 Table 32 (continued) Power Estimates for Conditions Indicating Adequate Type I Error Control ( 2 = .33, =.8) at =.05 for =10 Note: Shaded areas indicate those cond itions for which a test demonstrated good power to detect true differences. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 24/161/14/6sk=0.00, kr=0.000.371 24/161/24/6sk=0.00, kr=0.000.365 24/161/44/6sk=0.00, kr=0.000.359 24/161/14/6sk=1.00, kr=3.000.371 24/161/24/6sk=1.00, kr=3.000.367 24/161/44/6sk=1.00, kr=3.000.356 24/161/14/6sk=2.00, kr=6.000.372 24/161/24/6sk=2.00, kr=6.000.381 24/161/44/6sk=2.00, kr=6.000.370 100/1001/14/6sk=0.00, kr=0.000.436 100/1001/24/6sk=0.00, kr=0.000.447 100/1001/44/6sk=0.00, kr=0.000.447 100/1001/14/6sk=1.00, kr=3.000.445 100/1001/24/6sk=1.00, kr=3.000.441 100/1001/44/6sk=1.00, kr=3.000.452 100/1001/14/6sk=2.00, kr=6.000.447 100/1001/24/6sk=2.00, kr=6.000.446 100/1001/44/6sk=2.00, kr=6.000.444 80/1201/14/6sk=0.00, kr=0.000.455 80/1201/24/6sk=0.00, kr=0.00 80/1201/44/6sk=0.00, kr=0.000.444 80/1201/14/6sk=1.00, kr=3.000.446 80/1201/24/6sk=1.00, kr=3.00 80/1201/44/6sk=1.00, kr=3.000.446 80/1201/14/6sk=2.00, kr=6.000.447 80/1201/24/6sk=2.00, kr=6.000.443 80/1201/44/6sk=2.00, kr=6.000.452 120/801/14/6sk=0.00, kr=0.000.450 120/801/24/6sk=0.00, kr=0.000.447 120/801/44/6sk=0.00, kr=0.000.450 120/801/14/6sk=1.00, kr=3.000.441 120/801/24/6sk=1.00, kr=3.000.437 120/801/44/6sk=1.00, kr=3.000.452 120/801/14/6sk=2.00, kr=6.00 120/801/24/6sk=2.00, kr=6.000.441 120/801/44/6sk=2.00, kr=6.000.439 PAGE 170 160 Table 33 Power Estimates for Conditions Indicating Adequate Type I Error Control ( 2 = .33, =.8) at =.10 for K=10 Note: Shaded areas indicate those cond itions for which a test demonstrated good power to detect true differences. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure The increase in nominal alpha to .10 for the 2=.33 condition at K=10 (see Table 33) provided a small, but inadequate, improvement in power for all of the tests, including permuted Q. The combined increase in heterogeneity and small K suppressed power fo r all of the tests, making them all ineffective. This pattern was evident as heterogeneity increased to 1 for both nominal alpha .05 and .10 when K=10. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/14/6sk=0.00, kr=0.000.360 5/51/24/6sk=0.00, kr=0.000.359 5/51/44/6sk=0.00, kr=0.000.365 5/51/14/6sk=1.00, kr=3.000.375 5/51/24/6sk=1.00, kr=3.000.372 5/51/44/6sk=1.00, kr=3.000.387 5/51/14/6sk=2.00, kr=6.000.373 5/51/24/6sk=2.00, kr=6.000.384 5/51/44/6sk=2.00, kr=6.000.409 4/61/14/6sk=0.00, kr=0.000.364 4/61/24/6sk=0.00, kr=0.000.352 4/61/44/6sk=0.00, kr=0.000.377 4/61/14/6sk=1.00, kr=3.000.366 4/61/24/6sk=1.00, kr=3.000.388 4/61/44/6sk=1.00, kr=3.00 4/61/14/6sk=2.00, kr=6.000.379 4/61/24/6sk=2.00, kr=6.000.407 4/61/44/6sk=2.00, kr=6.000.424 6/41/14/6sk=0.00, kr=0.000.362 6/41/24/6sk=0.00, kr=0.000.351 6/41/44/6sk=0.00, kr=0.000.334 6/41/14/6sk=1.00, kr=3.000.353 6/41/24/6sk=1.00, kr=3.000.377 6/41/44/6sk=1.00, kr=3.000.362 6/41/14/6sk=2.00, kr=6.000.363 6/41/24/6sk=2.00, kr=6.000.383 6/41/44/6sk=2.00, kr=6.000.390 20/201/14/6sk=0.00, kr=0.00 20/201/24/6sk=0.00, kr=0.000.527 20/201/44/6sk=0.00, kr=0.000.527 20/201/14/6sk=1.00, kr=3.00 20/201/24/6sk=1.00, kr=3.000.531 20/201/44/6sk=1.00, kr=3.000.546 20/201/14/6sk=2.00, kr=6.000.531 20/201/24/6sk=2.00, kr=6.000.545 20/201/44/6sk=2.00, kr=6.000.539 16/241/14/6sk=0.00, kr=0.000.527 16/241/24/6sk=0.00, kr=0.000.528 16/241/44/6sk=0.00, kr=0.000.533 16/241/14/6sk=1.00, kr=3.00 16/241/24/6sk=1.00, kr=3.000.548 16/241/44/6sk=1.00, kr=3.000.543 16/241/14/6sk=2.00, kr=6.000.527 16/241/24/6sk=2.00, kr=6.000.537 16/241/44/6sk=2.00, kr=6.000.548 PAGE 171 161 Table 33 (continued) Power Estimates for Conditions Indicating Adequate Type I Error Control ( 2 = .33, =.8) at =.10 for K=10 Note: Shaded areas indicate those cond itions for which a test demonstrated good power to detect true differences. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 24/161/14/6sk=0.00, kr=0.000.520 24/161/24/6sk=0.00, kr=0.000.517 24/161/44/6sk=0.00, kr=0.000.521 24/161/14/6sk=1.00, kr=3.000.531 24/161/24/6sk=1.00, kr=3.000.524 24/161/44/6sk=1.00, kr=3.000.507 24/161/14/6sk=2.00, kr=6.000.522 24/161/24/6sk=2.00, kr=6.000.534 24/161/44/6sk=2.00, kr=6.000.523 100/1001/14/6sk=0.00, kr=0.000.600 100/1001/24/6sk=0.00, kr=0.000.606 100/1001/44/6sk=0.00, kr=0.000.604 100/1001/14/6sk=1.00, kr=3.000.616 100/1001/24/6sk=1.00, kr=3.000.600 100/1001/44/6sk=1.00, kr=3.000.611 100/1001/14/6sk=2.00, kr=6.000.606 100/1001/24/6sk=2.00, kr=6.000.608 100/1001/44/6sk=2.00, kr=6.000.606 80/1201/14/6sk=0.00, kr=0.000.609 80/1201/24/6sk=0.00, kr=0.000.595 80/1201/44/6sk=0.00, kr=0.000.606 80/1201/14/6sk=1.00, kr=3.000.608 80/1201/24/6sk=1.00, kr=3.000.616 80/1201/44/6sk=1.00, kr=3.000.612 80/1201/14/6sk=2.00, kr=6.000.607 80/1201/24/6sk=2.00, kr=6.000.600 80/1201/44/6sk=2.00, kr=6.000.604 120/801/14/6sk=0.00, kr=0.000.605 120/801/24/6sk=0.00, kr=0.000.605 120/801/44/6sk=0.00, kr=0.000.607 120/801/14/6sk=1.00, kr=3.000.603 120/801/24/6sk=1.00, kr=3.000.596 120/801/44/6sk=1.00, kr=3.000.614 120/801/14/6sk=2.00, kr=6.000.597 120/801/24/6sk=2.00, kr=6.000.597 120/801/44/6sk=2.00, kr=6.000.609 PAGE 172 162 Table 34 Power Estimates for Conditions Indicating Adequate Type I Error Control ( 2 = .33, =.8) at =.05 for K=30 Note: Shaded areas indicate those cond itions for which a test demonstrated good power to detect true differences. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Low power at sample sizes below 40 prohibited effective use of the RE and CR tests in the few conditions evidencing robustness. Permuted Q also produced low power for these conditions, until sample sizes rose to 40 and above. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/112/18sk=0.00, kr=0.000.636 5/51/212/18sk=0.00, kr=0.00 5/51/412/18sk=0.00, kr=0.000.631 5/51/112/18sk=1.00, kr=3.000.654 5/51/212/18sk=1.00, kr=3.000.670 5/51/412/18sk=1.00, kr=3.000.684 5/51/112/18sk=2.00, kr=6.000.684 5/51/212/18sk=2.00, kr=6.000.697 5/51/412/18sk=2.00, kr=6.000.700 4/61/112/18sk=0.00, kr=0.000.620 4/61/212/18sk=0.00, kr=0.000.635 4/61/412/18sk=0.00, kr=0.000.6550.6860.691 4/61/112/18sk=1.00, kr=3.000.640 4/61/212/18sk=1.00, kr=3.000.676 4/61/412/18sk=1.00, kr=3.000.6890.7290.733 4/61/112/18sk=2.00, kr=6.000.660 4/61/212/18sk=2.00, kr=6.000.703 4/61/412/18sk=2.00, kr=6.000.7560.760 6/41/112/18sk=0.00, kr=0.000.621 6/41/212/18sk=0.00, kr=0.000.613 6/41/412/18sk=0.00, kr=0.000.594 6/41/112/18sk=1.00, kr=3.000.654 6/41/212/18sk=1.00, kr=3.000.653 6/41/412/18sk=1.00, kr=3.000.651 6/41/112/18sk=2.00, kr=6.000.662 6/41/212/18sk=2.00, kr=6.000.675 6/41/412/18sk=2.00, kr=6.000.686 20/201/112/18sk=0.00, kr=0.000.874 20/201/212/18sk=0.00, kr=0.00 20/201/412/18sk=0.00, kr=0.00 20/201/112/18sk=1.00, kr=3.000.878 20/201/212/18sk=1.00, kr=3.000.881 20/201/412/18sk=1.00, kr=3.000.885 20/201/112/18sk=2.00, kr=6.00 20/201/212/18sk=2.00, kr=6.00 20/201/412/18sk=2.00, kr=6.000.889 16/241/112/18sk=0.00, kr=0.000.871 16/241/212/18sk=0.00, kr=0.000.874 16/241/412/18sk=0.00, kr=0.000.889 16/241/112/18sk=1.00, kr=3.000.879 16/241/212/18sk=1.00, kr=3.000.878 16/241/412/18sk=1.00, kr=3.000.894 16/241/112/18sk=2.00, kr=6.000.868 16/241/212/18sk=2.00, kr=6.000.880 16/241/412/18sk=2.00, kr=6.000.891 PAGE 173 163 Table 34 (continued) Power Estimates for Conditions Indicating Adequate Type I Error Control ( 2 = .33, =.8) at =.05 for K=30 Note: Shaded areas indicate those cond itions for which a test demonstrated good power to detect true differences. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Increasing K to 30 had an ameliorative effect on power for permuted Q, regardless of heterogeneity of effects at .33 (compare Table 32 to the above Table 34). As long as sample sizes exceeded the smallest level, power increased for this one test. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 24/161/112/18sk=0.00, kr=0.000.869 24/161/212/18sk=0.00, kr=0.000.867 24/161/412/18sk=0.00, kr=0.00 24/161/112/18sk=1.00, kr=3.000.878 24/161/212/18sk=1.00, kr=3.000.876 24/161/412/18sk=1.00, kr=3.000.864 24/161/112/18sk=2.00, kr=6.000.867 24/161/212/18sk=2.00, kr=6.000.879 24/161/412/18sk=2.00, kr=6.000.875 100/1001/112/18sk=0.00, kr=0.000.935 100/1001/212/18sk=0.00, kr=0.000.935 100/1001/412/18sk=0.00, kr=0.000.935 100/1001/112/18sk=1.00, kr=3.000.937 100/1001/212/18sk=1.00, kr=3.00 100/1001/412/18sk=1.00, kr=3.000.939 100/1001/112/18sk=2.00, kr=6.000.934 100/1001/212/18sk=2.00, kr=6.000.939 100/1001/412/18sk=2.00, kr=6.000.934 80/1201/112/18sk=0.00, kr=0.000.936 80/1201/212/18sk=0.00, kr=0.000.936 80/1201/412/18sk=0.00, kr=0.000.937 80/1201/112/18sk=1.00, kr=3.00 80/1201/212/18sk=1.00, kr=3.000.939 80/1201/412/18sk=1.00, kr=3.000.935 80/1201/112/18sk=2.00, kr=6.000.932 80/1201/212/18sk=2.00, kr=6.000.938 80/1201/412/18sk=2.00, kr=6.000.933 120/801/112/18sk=0.00, kr=0.000.931 120/801/212/18sk=0.00, kr=0.000.930 120/801/412/18sk=0.00, kr=0.000.923 120/801/112/18sk=1.00, kr=3.000.926 120/801/212/18sk=1.00, kr=3.000.937 120/801/412/18sk=1.00, kr=3.000.928 120/801/112/18sk=2.00, kr=6.000.933 120/801/212/18sk=2.00, kr=6.000.932 120/801/412/18sk=2.00, kr=6.000.931 PAGE 174 164 Table 35 Power Estimates for Conditions Indicating Adequate Type I Error Control ( 2 = .33, =.8) at =.10 for K=30 Note: Shaded areas indicate those cond itions for which a test demonstrated good power to detect true differences. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure The increase in nominal alpha served to increase power to a minimal extent (when sample sizes were small) for the RE and CR tests. Permuted Qs power did not change from the increase and remained high across all conditions. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/112/18sk=0.00, kr=0.000.761 5/51/212/18sk=0.00, kr=0.000.753 5/51/412/18sk=0.00, kr=0.000.747 5/51/112/18sk=1.00, kr=3.000.774 5/51/212/18sk=1.00, kr=3.000.779 5/51/412/18sk=1.00, kr=3.000.800 5/51/112/18sk=2.00, kr=6.000.792 5/51/212/18sk=2.00, kr=6.000.799 5/51/412/18sk=2.00, kr=6.000.810 4/61/112/18sk=0.00, kr=0.000.747 4/61/212/18sk=0.00, kr=0.000.7550.7790.784 4/61/412/18sk=0.00, kr=0.000.7730.7920.795 4/61/112/18sk=1.00, kr=3.000.7610.789 4/61/212/18sk=1.00, kr=3.000.7820.8070.812 4/61/412/18sk=1.00, kr=3.000.7990.821 4/61/112/18sk=2.00, kr=6.000.770 4/61/212/18sk=2.00, kr=6.000.809 4/61/412/18sk=2.00, kr=6.000.8240.8440.846 6/41/112/18sk=0.00, kr=0.000.7430.773 6/41/212/18sk=0.00, kr=0.000.7330.764 6/41/412/18sk=0.00, kr=0.000.718 6/41/112/18sk=1.00, kr=3.000.771 6/41/212/18sk=1.00, kr=3.000.768 6/41/412/18sk=1.00, kr=3.000.768 6/41/112/18sk=2.00, kr=6.000.785 6/41/212/18sk=2.00, kr=6.000.787 6/41/412/18sk=2.00, kr=6.000.796 20/201/112/18sk=0.00, kr=0.000.932 20/201/212/18sk=0.00, kr=0.000.934 20/201/412/18sk=0.00, kr=0.000.932 20/201/112/18sk=1.00, kr=3.000.935 20/201/212/18sk=1.00, kr=3.000.938 20/201/412/18sk=1.00, kr=3.000.944 20/201/112/18sk=2.00, kr=6.000.931 20/201/212/18sk=2.00, kr=6.000.938 20/201/412/18sk=2.00, kr=6.000.944 16/241/112/18sk=0.00, kr=0.000.931 16/241/212/18sk=0.00, kr=0.000.932 16/241/412/18sk=0.00, kr=0.000.942 16/241/112/18sk=1.00, kr=3.000.938 16/241/212/18sk=1.00, kr=3.000.939 16/241/412/18sk=1.00, kr=3.000.944 16/241/112/18sk=2.00, kr=6.000.927 16/241/212/18sk=2.00, kr=6.000.937 16/241/412/18sk=2.00, kr=6.000.947 PAGE 175 165 Table 35 (continued) Power Estimates for Conditions Indicating Adequate Type I Error Control ( 2 = .33, =.8) at =.10 for K=30 Note: Shaded areas indicate those cond itions for which a test demonstrated good power to detect true differences. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 24/161/112/18sk=0.00, kr=0.000.930 24/161/212/18sk=0.00, kr=0.000.928 24/161/412/18sk=0.00, kr=0.000.931 24/161/112/18sk=1.00, kr=3.000.933 24/161/212/18sk=1.00, kr=3.000.935 24/161/412/18sk=1.00, kr=3.000.932 24/161/112/18sk=2.00, kr=6.000.926 24/161/212/18sk=2.00, kr=6.000.935 24/161/412/18sk=2.00, kr=6.000.932 100/1001/112/18sk=0.00, kr=0.000.971 100/1001/212/18sk=0.00, kr=0.000.970 100/1001/412/18sk=0.00, kr=0.000.968 100/1001/112/18sk=1.00, kr=3.000.969 100/1001/212/18sk=1.00, kr=3.000.974 100/1001/412/18sk=1.00, kr=3.000.971 100/1001/112/18sk=2.00, kr=6.000.970 100/1001/212/18sk=2.00, kr=6.000.972 100/1001/412/18sk=2.00, kr=6.000.969 80/1201/112/18sk=0.00, kr=0.000.970 80/1201/212/18sk=0.00, kr=0.000.970 80/1201/412/18sk=0.00, kr=0.000.971 80/1201/112/18sk=1.00, kr=3.000.972 80/1201/212/18sk=1.00, kr=3.000.975 80/1201/412/18sk=1.00, kr=3.000.970 80/1201/112/18sk=2.00, kr=6.000.966 80/1201/212/18sk=2.00, kr=6.000.972 80/1201/412/18sk=2.00, kr=6.000.970 120/801/112/18sk=0.00, kr=0.000.968 120/801/212/18sk=0.00, kr=0.000.969 120/801/412/18sk=0.00, kr=0.000.960 120/801/112/18sk=1.00, kr=3.000.962 120/801/212/18sk=1.00, kr=3.000.966 120/801/412/18sk=1.00, kr=3.000.963 120/801/112/18sk=2.00, kr=6.000.970 120/801/212/18sk=2.00, kr=6.000.972 120/801/412/18sk=2.00, kr=6.000.967 PAGE 176 166 Table 36 Power Estimates for Conditions Indicating Adequate Type I Error Control ( 2 = 1, =.8) at =.05 for K=10 Note: Shaded areas indicate those cond itions for which a test demonstrated good power to detect true differences. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure With K=10 and increased heterogeneity of effects at 1, permuted Q evidenced low power, rendering it an ineffective statistic under these conditions. Again, none of the other tests had adequate Type Ierror control. Therefore, concerns with the adequacy of power for these tests were not relevant. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/14/6sk=0.00, kr=0.000.140 5/51/24/6sk=0.00, kr=0.00 5/51/44/6sk=0.00, kr=0.00 5/51/14/6sk=1.00, kr=3.000.137 5/51/24/6sk=1.00, kr=3.000.135 5/51/44/6sk=1.00, kr=3.000.152 5/51/14/6sk=2.00, kr=6.000.142 5/51/24/6sk=2.00, kr=6.000.141 5/51/44/6sk=2.00, kr=6.000.146 4/61/14/6sk=0.00, kr=0.000.149 4/61/24/6sk=0.00, kr=0.000.150 4/61/44/6sk=0.00, kr=0.000.140 4/61/14/6sk=1.00, kr=3.000.135 4/61/24/6sk=1.00, kr=3.000.147 4/61/44/6sk=1.00, kr=3.000.146 4/61/14/6sk=2.00, kr=6.000.145 4/61/24/6sk=2.00, kr=6.000.148 4/61/44/6sk=2.00, kr=6.000.158 6/41/14/6sk=0.00, kr=0.000.149 6/41/24/6sk=0.00, kr=0.000.141 6/41/44/6sk=0.00, kr=0.00 6/41/14/6sk=1.00, kr=3.000.153 6/41/24/6sk=1.00, kr=3.000.136 6/41/44/6sk=1.00, kr=3.000.144 6/41/14/6sk=2.00, kr=6.000.149 6/41/24/6sk=2.00, kr=6.000.146 6/41/44/6sk=2.00, kr=6.000.148 20/201/14/6sk=0.00, kr=0.000.172 20/201/24/6sk=0.00, kr=0.00 20/201/44/6sk=0.00, kr=0.00 20/201/14/6sk=1.00, kr=3.00 20/201/24/6sk=1.00, kr=3.00 20/201/44/6sk=1.00, kr=3.000.175 20/201/14/6sk=2.00, kr=6.000.172 20/201/24/6sk=2.00, kr=6.000.184 20/201/44/6sk=2.00, kr=6.00 16/241/14/6sk=0.00, kr=0.000.173 16/241/24/6sk=0.00, kr=0.000.178 16/241/44/6sk=0.00, kr=0.000.174 16/241/14/6sk=1.00, kr=3.000.181 16/241/24/6sk=1.00, kr=3.000.175 16/241/44/6sk=1.00, kr=3.00 16/241/14/6sk=2.00, kr=6.00 16/241/24/6sk=2.00, kr=6.000.167 16/241/44/6sk=2.00, kr=6.000.178 PAGE 177 167 Table 36 (continued) Power Estimates for Conditions Indicating Adequate Type I Error Control ( 2 = 1, =.8) at =.05 for K=10 Note: Shaded areas indicate those cond itions for which a test demonstrated good power to detect true differences. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 24/161/14/6sk=0.00, kr=0.000.168 24/161/24/6sk=0.00, kr=0.000.171 24/161/44/6sk=0.00, kr=0.000.165 24/161/14/6sk=1.00, kr=3.000.174 24/161/24/6sk=1.00, kr=3.000.168 24/161/44/6sk=1.00, kr=3.000.168 24/161/14/6sk=2.00, kr=6.000.174 24/161/24/6sk=2.00, kr=6.000.174 24/161/44/6sk=2.00, kr=6.000.165 100/1001/14/6sk=0.00, kr=0.000.194 100/1001/24/6sk=0.00, kr=0.000.195 100/1001/44/6sk=0.00, kr=0.000.195 100/1001/14/6sk=1.00, kr=3.000.186 100/1001/24/6sk=1.00, kr=3.000.185 100/1001/44/6sk=1.00, kr=3.000.187 100/1001/14/6sk=2.00, kr=6.000.186 100/1001/24/6sk=2.00, kr=6.000.182 100/1001/44/6sk=2.00, kr=6.000.184 80/1201/14/6sk=0.00, kr=0.000.190 80/1201/24/6sk=0.00, kr=0.000.185 80/1201/44/6sk=0.00, kr=0.000.190 80/1201/14/6sk=1.00, kr=3.000.183 80/1201/24/6sk=1.00, kr=3.000.186 80/1201/44/6sk=1.00, kr=3.000.194 80/1201/14/6sk=2.00, kr=6.000.189 80/1201/24/6sk=2.00, kr=6.000.188 80/1201/44/6sk=2.00, kr=6.000.199 120/801/14/6sk=0.00, kr=0.000.186 120/801/24/6sk=0.00, kr=0.000.185 120/801/44/6sk=0.00, kr=0.000.190 120/801/14/6sk=1.00, kr=3.000.198 120/801/24/6sk=1.00, kr=3.000.177 120/801/44/6sk=1.00, kr=3.000.180 120/801/14/6sk=2.00, kr=6.000.185 120/801/24/6sk=2.00, kr=6.000.182 120/801/44/6sk=2.00, kr=6.000.186 PAGE 178 168 Table 37 Power Estimates for Conditions Indicating Adequate Type I Error Control ( 2 = 1, =.8) at =.05 for K=30 Note: Shaded areas indicate those cond itions for which a test demonstrated good power to detect true differences. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Though permuted Qs power improv ed with the increase in K at 2=1 and as sample size increased to 40, power was still insufficient. Low power was evid ent across all variants within this combination of heterogeneity of eff ects and effect size. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/112/18sk=0.00, kr=0.000.389 5/51/212/18sk=0.00, kr=0.000.382 5/51/412/18sk=0.00, kr=0.000.386 5/51/112/18sk=1.00, kr=3.000.392 5/51/212/18sk=1.00, kr=3.000.383 5/51/412/18sk=1.00, kr=3.000.387 5/51/112/18sk=2.00, kr=6.000.396 5/51/212/18sk=2.00, kr=6.000.390 5/51/412/18sk=2.00, kr=6.000.398 4/61/112/18sk=0.00, kr=0.000.375 4/61/212/18sk=0.00, kr=0.000.389 4/61/412/18sk=0.00, kr=0.000.394 4/61/112/18sk=1.00, kr=3.000.385 4/61/212/18sk=1.00, kr=3.00 4/61/412/18sk=1.00, kr=3.000.398 4/61/112/18sk=2.00, kr=6.000.392 4/61/212/18sk=2.00, kr=6.000.403 4/61/412/18sk=2.00, kr=6.000.393 6/41/112/18sk=0.00, kr=0.000.366 6/41/212/18sk=0.00, kr=0.000.367 6/41/412/18sk=0.00, kr=0.000.362 6/41/112/18sk=1.00, kr=3.000.388 6/41/212/18sk=1.00, kr=3.000.395 6/41/412/18sk=1.00, kr=3.000.383 6/41/112/18sk=2.00, kr=6.000.370 6/41/212/18sk=2.00, kr=6.000.387 6/41/412/18sk=2.00, kr=6.000.378 20/201/112/18sk=0.00, kr=0.000.491 20/201/212/18sk=0.00, kr=0.000.492 20/201/412/18sk=0.00, kr=0.000.488 20/201/112/18sk=1.00, kr=3.000.478 20/201/212/18sk=1.00, kr=3.000.502 20/201/412/18sk=1.00, kr=3.000.485 20/201/112/18sk=2.00, kr=6.000.480 20/201/212/18sk=2.00, kr=6.000.490 20/201/412/18sk=2.00, kr=6.000.479 16/241/112/18sk=0.00, kr=0.000.482 16/241/212/18sk=0.00, kr=0.000.492 16/241/412/18sk=0.00, kr=0.00 16/241/112/18sk=1.00, kr=3.000.506 16/241/212/18sk=1.00, kr=3.00 16/241/412/18sk=1.00, kr=3.000.499 16/241/112/18sk=2.00, kr=6.000.496 16/241/212/18sk=2.00, kr=6.000.483 16/241/412/18sk=2.00, kr=6.000.493 PAGE 179 169 Table 37 (continued) Power Estimates for Conditions Indicating Adequate Type I Error Control ( 2 = 1, =.8) at =.05 for K=30 Note: Shaded areas indicate those cond itions for which a test demonstrated good power to detect true differences. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 24/161/112/18sk=0.00, kr=0.000.487 24/161/212/18sk=0.00, kr=0.000.484 24/161/412/18sk=0.00, kr=0.000.487 24/161/112/18sk=1.00, kr=3.000.482 24/161/212/18sk=1.00, kr=3.000.493 24/161/412/18sk=1.00, kr=3.000.481 24/161/112/18sk=2.00, kr=6.000.480 24/161/212/18sk=2.00, kr=6.000.495 24/161/412/18sk=2.00, kr=6.000.474 100/1001/112/18sk=0.00, kr=0.000.523 100/1001/212/18sk=0.00, kr=0.000.524 100/1001/412/18sk=0.00, kr=0.000.523 100/1001/112/18sk=1.00, kr=3.000.519 100/1001/212/18sk=1.00, kr=3.000.509 100/1001/412/18sk=1.00, kr=3.000.525 100/1001/112/18sk=2.00, kr=6.000.522 100/1001/212/18sk=2.00, kr=6.000.515 100/1001/412/18sk=2.00, kr=6.000.530 80/1201/112/18sk=0.00, kr=0.000.532 80/1201/212/18sk=0.00, kr=0.000.524 80/1201/412/18sk=0.00, kr=0.000.518 80/1201/112/18sk=1.00, kr=3.000.527 80/1201/212/18sk=1.00, kr=3.000.518 80/1201/412/18sk=1.00, kr=3.000.533 80/1201/112/18sk=2.00, kr=6.000.525 80/1201/212/18sk=2.00, kr=6.000.532 80/1201/412/18sk=2.00, kr=6.000.522 120/801/112/18sk=0.00, kr=0.000.523 120/801/212/18sk=0.00, kr=0.000.508 120/801/412/18sk=0.00, kr=0.000.535 120/801/112/18sk=1.00, kr=3.000.519 120/801/212/18sk=1.00, kr=3.000.525 120/801/412/18sk=1.00, kr=3.000.529 120/801/112/18sk=2.00, kr=6.000.517 120/801/212/18sk=2.00, kr=6.000.515 120/801/412/18sk=2.00, kr=6.000.523 PAGE 180 170 Table 38 Power Estimates for Conditions Indicating Adequate Type I Error Control ( 2 = 1, =.8) at =.10 for K=30 Note: Shaded areas indicate those cond itions for which a test demonstrated good power to detect true differences. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure The increase in nominal alpha to .10 promoted further slight improvements in permuted Qs power, but still these increases in power remained inadequa te for appropriate use under all of the 2=1 conditions. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/112/18sk=0.00, kr=0.00 5/51/212/18sk=0.00, kr=0.000.516 5/51/412/18sk=0.00, kr=0.000.513 5/51/112/18sk=1.00, kr=3.000.518 5/51/212/18sk=1.00, kr=3.000.514 5/51/412/18sk=1.00, kr=3.000.519 5/51/112/18sk=2.00, kr=6.000.530 5/51/212/18sk=2.00, kr=6.000.534 5/51/412/18sk=2.00, kr=6.000.525 4/61/112/18sk=0.00, kr=0.000.509 4/61/212/18sk=0.00, kr=0.000.520 4/61/412/18sk=0.00, kr=0.000.521 4/61/112/18sk=1.00, kr=3.000.516 4/61/212/18sk=1.00, kr=3.000.530 4/61/412/18sk=1.00, kr=3.000.531 4/61/112/18sk=2.00, kr=6.000.524 4/61/212/18sk=2.00, kr=6.000.528 4/61/412/18sk=2.00, kr=6.000.528 6/41/112/18sk=0.00, kr=0.000.504 6/41/212/18sk=0.00, kr=0.00 6/41/412/18sk=0.00, kr=0.000.494 6/41/112/18sk=1.00, kr=3.000.525 6/41/212/18sk=1.00, kr=3.000.526 6/41/412/18sk=1.00, kr=3.000.515 6/41/112/18sk=2.00, kr=6.000.503 6/41/212/18sk=2.00, kr=6.000.517 6/41/412/18sk=2.00, kr=6.000.513 20/201/112/18sk=0.00, kr=0.000.628 20/201/212/18sk=0.00, kr=0.000.627 20/201/412/18sk=0.00, kr=0.000.618 20/201/112/18sk=1.00, kr=3.000.618 20/201/212/18sk=1.00, kr=3.000.631 20/201/412/18sk=1.00, kr=3.000.614 20/201/112/18sk=2.00, kr=6.000.619 20/201/212/18sk=2.00, kr=6.000.628 20/201/412/18sk=2.00, kr=6.000.611 16/241/112/18sk=0.00, kr=0.000.615 16/241/212/18sk=0.00, kr=0.000.631 16/241/412/18sk=0.00, kr=0.000.627 16/241/112/18sk=1.00, kr=3.000.627 16/241/212/18sk=1.00, kr=3.000.620 16/241/412/18sk=1.00, kr=3.000.629 16/241/112/18sk=2.00, kr=6.000.626 16/241/212/18sk=2.00, kr=6.000.614 16/241/412/18sk=2.00, kr=6.000.621 PAGE 181 171 Table 38 (continued) Power Estimates for Conditions Indicating Adequate Type I Error Control ( 2 = 1, =.8) at =.10 for K=30 Note: Shaded areas indicate those cond itions for which a test demonstrated good power to detect true differences. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Summary of Power Estimates Results When and K=10 at nominal alpha, .05, power was low for each of the 5 tests in all conditions evidencing adequate Type I error, when sample sizes we re small. With the increase in K to 30, the FE test offered the same number of effective, powerful conditio ns as with K=10 (see Table 30). But the RE test manifested an increased number of effective conditions for use (24 as compared to 19 in the previous K=10 conditions), primarily occurring in the equal and firs t group with greater sample size conditions. The CR test presented no change in the number of effectiv e conditions and these were concentrated in the equal Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 24/161/112/18sk=0.00, kr=0.000.620 24/161/212/18sk=0.00, kr=0.000.620 24/161/412/18sk=0.00, kr=0.000.623 24/161/112/18sk=1.00, kr=3.000.615 24/161/212/18sk=1.00, kr=3.000.627 24/161/412/18sk=1.00, kr=3.000.606 24/161/112/18sk=2.00, kr=6.000.620 24/161/212/18sk=2.00, kr=6.000.628 24/161/412/18sk=2.00, kr=6.000.610 100/1001/112/18sk=0.00, kr=0.000.650 100/1001/212/18sk=0.00, kr=0.000.662 100/1001/412/18sk=0.00, kr=0.000.654 100/1001/112/18sk=1.00, kr=3.000.646 100/1001/212/18sk=1.00, kr=3.000.641 100/1001/412/18sk=1.00, kr=3.000.658 100/1001/112/18sk=2.00, kr=6.000.660 100/1001/212/18sk=2.00, kr=6.000.650 100/1001/412/18sk=2.00, kr=6.000.660 80/1201/112/18sk=0.00, kr=0.000.661 80/1201/212/18sk=0.00, kr=0.000.659 80/1201/412/18sk=0.00, kr=0.000.651 80/1201/112/18sk=1.00, kr=3.000.647 80/1201/212/18sk=1.00, kr=3.000.655 80/1201/412/18sk=1.00, kr=3.000.675 80/1201/112/18sk=2.00, kr=6.000.663 80/1201/212/18sk=2.00, kr=6.000.658 80/1201/412/18sk=2.00, kr=6.000.655 120/801/112/18sk=0.00, kr=0.000.661 120/801/212/18sk=0.00, kr=0.000.646 120/801/412/18sk=0.00, kr=0.000.660 120/801/112/18sk=1.00, kr=3.000.650 120/801/212/18sk=1.00, kr=3.000.649 120/801/412/18sk=1.00, kr=3.000.655 120/801/112/18sk=2.00, kr=6.000.651 120/801/212/18sk=2.00, kr=6.000.651 120/801/412/18sk=2.00, kr=6.000.656 PAGE 182 172 sample size conditions. Permuted Q had an increased number of effective conditions both with the conditions with smaller and larger sample size. With the increase in 2 to .33 at K=10 for nominal alpha, .0 5, sufficient power was absent for all tests, including permuted Q. Power remained limited for permuted Q, regardless of increases in sample size. With the introduction of effect sizes greater than 0, even permuted Qs effectiveness sharply deteriorated due to losses in power. Th e increase in nominal alpha to .10 for the 2=.33 condition at K=10 did not improve the power for permuted Q, the RE and CR tests. Both the regular Q and the FE test possessed good power when the primary study sample sizes were 40 or greater, but lacked robustness. The combined increase in heterogeneity and small K suppressed power for all of the tests. This pattern was evident as heterogeneity increased to 1 for both nominal alpha .05 and .10 when K=10. Increasing K to 30 (with alpha set to .05) had an ameliorative effect on power for permuted Q, regardless of heterogeneity of effects being equal to .3 3. As long as sample sizes exceeded the smallest level, this tests power increased. The RE and CR tests showed slight improvement to robustness with the increase in effect size to .8. However, low power at sample sizes below 40 prohib ited effective use of these tests in the conditions evidencing robustness. The increase in nominal alpha served to increase power to a minimal extent (when sample sizes were small) for the RE and CR tests. Instead, the in crease in heterogeneity of effects continued to suppress power for these tests, regardless of any additional changes in any of the factors investigated. Table 23 illustrates that power was still insufficient, despite a mi nimal increase, to permit effective use of the RE and CR tests. Permuted Qs power was enhanced only slightly from the increase in K=30 for the same 2 and remained high across all conditions. Therefore, the in crease in K to 30 appeared to have more of an ameliorative impact than any other factor for increasing permuted Qs power at 2 =.33. Once heterogeneity of effects increased to 1, permuted Qs power re duced dramatically. Regardless of any and all variations of K, nominal alpha and sample size, permuted Qs power remained low. None of the other tests attained adequate robustness for power to become a consideration. In summary, none of the tests offered any degree of effectiv eness as heterogeneity of effects increased to 1 or when sample sizes were less than 40 with nominal set to .05 (and K=10). Though permuted Q yielded PAGE 183 173 adequate Type I error control for all conditions investigated, it proved to be susceptible to Type II error under these conditions. Power Estimates for Conditions Indicating Both Robustness and Good Power This second series of results illustra te those conditions for which each test provided both adequate Type I error cont rol and good power. Low power values were excluded. These tables (Tables 3942) originated as Type I error estimate tables for each of the true null conditions with a true effect of .8. Tables were constructed only for those conditions in which at least one test obtained both adequate type I error c ontrol and good power for at least one condition. Once those conditions reflecting adequate robustness were identified and highlighted, data reflecting only good power for those conditions was presented. Therefore, these tables best illustrate optimal effectiveness of the five tests being evaluated. Generally, adequate power is defined as probab ility estimates (corresponding false null estimates for the same 2 and combination) of .80 or better. This means that in 80% of the simulations, the test detected true differences between groups. Power estim ates are only relevant for those conditions where a test has demonstrated adequate control of Type I error. As mentioned previously, only those conditions evidencing both robustness and good power by a test were included in the next set of tables. Therefore, the following tables (Tables 3942) prov ide a picture of the conditions for optimal effective use of each test being evaluated. PAGE 184 174 Table 39 Power Estimates for Conditions Indicating Both Robustness & Good Power ( 2= 0, = .8) at =.05 for K=10 Note: Empty shaded cells indicate those conditions with adequate Type I error control, but low power. Unshaded cells indicate conditions with poor Type I error control. Shaded cells containing data have both goo d Type I error and good power. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Though each of the tests presente d at least one occasion of robustness when sample sizes were small, none of the tests, including permuted Q possessed power enough for effective application. After sample sizes increased to 40 and above, the RE, FE and CR tests all showed enhanced power when sample sizes were equal (particularly the FE and CR tests) an d when the first group had a larger sample size than the second (particularly for the RE test). Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/14/6sk=0.00, kr=0.00 5/51/24/6sk=0.00, kr=0.00 5/51/44/6sk=0.00, kr=0.00 5/51/14/6sk=1.00, kr=3.00 5/51/24/6sk=1.00, kr=3.00 5/51/44/6sk=1.00, kr=3.00 5/51/14/6sk=2.00, kr=6.00 5/51/24/6sk=2.00, kr=6.00 5/51/44/6sk=2.00, kr=6.00 4/61/14/6sk=0.00, kr=0.00 4/61/24/6sk=0.00, kr=0.00 4/61/44/6sk=0.00, kr=0.00 4/61/14/6sk=1.00, kr=3.00 4/61/24/6sk=1.00, kr=3.00 4/61/44/6sk=1.00, kr=3.00 4/61/14/6sk=2.00, kr=6.00 4/61/24/6sk=2.00, kr=6.00 4/61/44/6sk=2.00, kr=6.00 6/41/14/6sk=0.00, kr=0.00 6/41/24/6sk=0.00, kr=0.00 6/41/44/6sk=0.00, kr=0.00 6/41/14/6sk=1.00, kr=3.00 6/41/24/6sk=1.00, kr=3.00 6/41/44/6sk=1.00, kr=3.00 6/41/14/6sk=2.00, kr=6.00 6/41/24/6sk=2.00, kr=6.00 6/41/44/6sk=2.00, kr=6.00 20/201/14/6sk=0.00, kr=0.000.9170.9660.962 20/201/24/6sk=0.00, kr=0.000.9170.9570.9650.958 20/201/44/6sk=0.00, kr=0.000.9040.9690.961 20/201/14/6sk=1.00, kr=3.000.9120.9620.9700.963 20/201/24/6sk=1.00, kr=3.000.925 20/201/44/6sk=1.00, kr=3.000.945 20/201/14/6sk=2.00, kr=6.000.9170.9600.961 20/201/24/6sk=2.00, kr=6.00 20/201/44/6sk=2.00, kr=6.000.970 PAGE 185 175 Table 39 (continued) Power Estimates for Conditions Indicating Both Robustness & Good Power ( 2= 0, = .8) at =.05 for K=10 *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Permuted Q consistently offered the majority of ef fective conditions across all factors of variance, sample size and population shape, once sample size increased to 40 and above. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 16/241/14/6sk=0.00, kr=0.000.904 16/241/24/6sk=0.00, kr=0.00 16/241/44/6sk=0.00, kr=0.000.937 16/241/14/6sk=1.00, kr=3.000.9080.9560.957 16/241/24/6sk=1.00, kr=3.000.937 16/241/44/6sk=1.00, kr=3.000.959 16/241/14/6sk=2.00, kr=6.000.9110.952 16/241/24/6sk=2.00, kr=6.000.956 16/241/44/6sk=2.00, kr=6.000.976 24/161/14/6sk=0.00, kr=0.000.9070.961 24/161/24/6sk=0.00, kr=0.000.8850.948 24/161/44/6sk=0.00, kr=0.000.870 24/161/14/6sk=1.00, kr=3 .000.9130.9560.9660.957 24/161/24/6sk=1.00, kr=3.000.9650.9730.966 24/161/44/6sk=1.00, kr=3 .000.9050.9600.9740.962 24/161/14/6sk=2.00, kr=6.000.9130.959 24/161/24/6sk=2.00, kr=6 .000.8280.9350.9790.974 24/161/44/6sk=2.00, kr=6.000.948 100/1001/14/6sk=0.00, kr=0.001.0001.000 100/1001/24/6sk=0.00, kr=0.0 01.0001.0001.0001.0001.000 100/1001/44/6sk=0.00, kr=0.001.0001.0001.000 100/1001/14/6sk=1.00, kr=3.001.0001.000 100/1001/24/6sk=1.00, kr=3.001.0001.000 100/1001/44/6sk=1.00, kr=3.001.000 100/1001/14/6sk=2.00, kr=6.001.000 100/1001/24/6sk=2.00, kr=6.001.000 100/1001/44/6sk=2.00, kr=6.00 80/1201/14/6sk=0.00, kr=0 .001.0001.0001.0001.000 80/1201/24/6sk=0.00, kr=0.001.000 80/1201/44/6sk=0.00, kr=0.001.000 80/1201/14/6sk=1.00, kr=3.001.0001.0001.000 80/1201/24/6sk=1.00, kr=3.001.000 80/1201/44/6sk=1.00, kr=3.001.000 80/1201/14/6sk=2.00, kr=6.001.000 80/1201/24/6sk=2.00, kr=6.001.000 80/1201/44/6sk=2.00, kr=6.001.000 120/801/14/6sk=0.00, kr=0 .001.0001.0001.0001.000 120/801/24/6sk=0.00, kr=0.001.0001.000 120/801/44/6sk=0.00, kr=0.001.000 120/801/14/6sk=1.00, kr=3.001.0001.0001.000 120/801/24/6sk=1.00, kr=3 .001.0001.0001.0001.000 120/801/44/6sk=1.00, kr=3 .001.0001.0001.0001.000 120/801/14/6sk=2.00, kr=6.001.000 120/801/24/6sk=2.00, kr=6.001.0001.0001.000 120/801/44/6sk=2.00, kr=6.001.000 PAGE 186 176 Table 40 Power Estimates for Conditions Indicating Both Robustness & Good Power ( 2= 0, = .8) at =.05 for K=30 Note: Empty shaded cells indicate those conditions with adequate Type I error control, but low power. Unshaded cells indicate conditions with poor Type I error control. Shaded cells containing data have both goo d Type I error and good power. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure The greatest difference from Table 39 to 40 (with the increase in K to 30) was that permuted Q provided effective use conditions when sample sizes were small. The number of effective conditions yielded by the regular Q, RE, FE and CR tests incr eased when sample sizes reached 40 and above. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/112/18sk=0.00, kr=0.000.869 5/51/212/18sk=0.00, kr=0.000.877 5/51/412/18sk=0.00, kr=0.000.872 5/51/112/18sk=1.00, kr=3.00 5/51/212/18sk=1.00, kr=3.000.916 5/51/412/18sk=1.00, kr=3.000.931 5/51/112/18sk=2.00, kr=6.000.9300.928 5/51/212/18sk=2.00, kr=6.000.959 5/51/412/18sk=2.00, kr=6.000.967 4/61/112/18sk=0.00, kr=0.000.865 4/61/212/18sk=0.00, kr=0.000.880 4/61/412/18sk=0.00, kr=0.000.891 4/61/112/18sk=1.00, kr=3.00 4/61/212/18sk=1.00, kr=3.000.930 4/61/412/18sk=1.00, kr=3.000.942 4/61/112/18sk=2.00, kr=6.000.923 4/61/212/18sk=2.00, kr=6.000.962 4/61/412/18sk=2.00, kr=6.000.977 6/41/112/18sk=0.00, kr=0.000.870 6/41/212/18sk=0.00, kr=0.000.836 6/41/412/18sk=0.00, kr=0.000.8140.8360.839 6/41/112/18sk=1.00, kr=3.000.902 6/41/212/18sk=1.00, kr=3.000.904 6/41/412/18sk=1.00, kr=3.000.9070.926 6/41/112/18sk=2.00, kr=6.000.924 6/41/212/18sk=2.00, kr=6.000.949 6/41/412/18sk=2.00, kr=6.000.953 20/201/112/18sk=0.00, kr=0.001.0001.0001.000 20/201/212/18sk=0.00, kr=0.000.0461.0001.0001.000 20/201/412/18sk=0.00, kr=0.001.0001.0001.000 20/201/112/18sk=1.00, kr=3.001.0001.0001.0001.000 20/201/212/18sk=1.00, kr=3.001.000 20/201/412/18sk=1.00, kr=3.001.000 20/201/112/18sk=2.00, kr=6.001.0001.000 20/201/212/18sk=2.00, kr=6.001.000 20/201/412/18sk=2.00, kr=6.001.000 PAGE 187 177 Table 40 (continued) Power Estimates for Conditions Indicating Both Robustness & Good Power ( 2= 0, = .8) at =.05 for K=30 *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure The power of all tests was optimal in relationship to performance for all other combinations of heterogeneity of effects and effect size once K=30 and sample sizes were 40 or above. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 16/241/112/18sk=0.00, kr=0.001.0001.000 16/241/212/18sk=0.00, kr=0.001.000 16/241/412/18sk=0.00, kr=0.001.000 16/241/112/18sk=1.00, kr=3.001.0001.0001.0001.000 16/241/212/18sk=1.00, kr=3.00 16/241/412/18sk=1.00, kr=3.001.000 16/241/112/18sk=2.00, kr=6.001.0001.000 16/241/212/18sk=2.00, kr=6.001.000 16/241/412/18sk=2.00, kr=6.001.000 24/161/112/18sk=0.00, kr=0.000.9781.000 24/161/212/18sk=0.00, kr=0.001.0001.0001.000 24/161/412/18sk=0.00, kr=0.001.0001.000 24/161/112/18sk=1.00, kr=3.001.0001.0001.0001.000 24/161/212/18sk=1.00, kr=3.001.0001.0001.000 24/161/412/18sk=1.00, kr=3.001.0001.0001.000 24/161/112/18sk=2.00, kr=6.001.000 24/161/212/18sk=2.00, kr=6.001.0001.0001.0001.000 24/161/412/18sk=2.00, kr=6.001.000 100/1001/112/18sk=0.00, kr=0.001.0001.0001.0001.0001.000 100/1001/212/18sk=0.00, kr=0.001.0001.0001.0001.000 100/1001/412/18sk=0.00, kr=0.001.0001.0001.000 100/1001/112/18sk=1.00, kr=3.001.0001.0001.000 100/1001/212/18sk=1.00, kr=3.001.0001.0001.000 100/1001/412/18sk=1.00, kr=3.001.000 100/1001/112/18sk=2.00, kr=6.001.0001.000 100/1001/212/18sk=2.00, kr=6.001.000 100/1001/412/18sk=2.00, kr=6.001.000 80/1201/112/18sk=0.00, kr=0.001.000 80/1201/212/18sk=0.00, kr=0.001.000 80/1201/412/18sk=0.00, kr=0.001.000 80/1201/112/18sk=1.00, kr=3.001.0001.000 80/1201/212/18sk=1.00, kr=3.00 80/1201/412/18sk=1.00, kr=3.001.000 80/1201/112/18sk=2.00, kr=6.001.0001.000 80/1201/212/18sk=2.00, kr=6.001.000 80/1201/412/18sk=2.00, kr=6.001.000 120/801/112/18sk=0.00, kr=0.001.0001.0001.000 120/801/212/18sk=0.00, kr=0.001.0001.000 120/801/412/18sk=0.00, kr=0.001.000 120/801/112/18sk=1.00, kr=3.001.0001.0001.000 120/801/212/18sk=1.00, kr=3.001.0001.0001.0001.000 120/801/412/18sk=1.00, kr=3.001.0001.0001.0001.000 120/801/112/18sk=2.00, kr=6.001.0001.000 120/801/212/18sk=2.00, kr=6.001.000 120/801/412/18sk=2.00, kr=6.001.000 PAGE 188 178 Table 41 Power Estimates for Conditions Indicating Both Robustness & Good Power ( 2= .33, = .8) at =.05 for K=30 Note: Empty shaded cells indicate those conditions with ade quate Type I error control, but low power. Unshaded cells indicate conditions with poor Type I error control. Shaded ce lls containing data have both g ood Type I error and good power. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure The increase in heterogeneity of effects to .33 (even with K=30) minimized the effectiveness of all tests, particularly with the small sample size of 10. Once sample sizes increased to 40 and above, only permuted Q demonstrated robustness and power. However, power was reduced, though sufficient for effectiveness for most conditions. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/112/18sk=0.00, kr=0.00 5/51/212/18sk=0.00, kr=0.00 5/51/412/18sk=0.00, kr=0.00 5/51/112/18sk=1.00, kr=3.00 5/51/212/18sk=1.00, kr=3.00 5/51/412/18sk=1.00, kr=3.00 5/51/112/18sk=2.00, kr=6.00 5/51/212/18sk=2.00, kr=6.00 5/51/412/18sk=2.00, kr=6.00 4/61/112/18sk=0.00, kr=0.00 4/61/212/18sk=0.00, kr=0.00 4/61/412/18sk=0.00, kr=0.00 4/61/112/18sk=1.00, kr=3.00 4/61/212/18sk=1.00, kr=3.00 4/61/412/18sk=1.00, kr=3.00 4/61/112/18sk=2.00, kr=6.00 4/61/212/18sk=2.00, kr=6.00 4/61/412/18sk=2.00, kr=6.00 6/41/112/18sk=0.00, kr=0.00 6/41/212/18sk=0.00, kr=0.00 6/41/412/18sk=0.00, kr=0.00 6/41/112/18sk=1.00, kr=3.00 6/41/212/18sk=1.00, kr=3.00 6/41/412/18sk=1.00, kr=3.00 6/41/112/18sk=2.00, kr=6.00 6/41/212/18sk=2.00, kr=6.00 6/41/412/18sk=2.00, kr=6.00 20/201/112/18sk=0.00, kr=0.000.874 20/201/212/18sk=0.00, kr=0.00 20/201/412/18sk=0.00, kr=0.00 20/201/112/18sk=1.00, kr=3.000.878 20/201/212/18sk=1.00, kr=3.000.881 20/201/412/18sk=1.00, kr=3.000.885 20/201/112/18sk=2.00, kr=6.00 20/201/212/18sk=2.00, kr=6.00 20/201/412/18sk=2.00, kr=6.000.889 PAGE 189 179 Table 41 (continued) Power Estimates for Conditions Indicating Both Robustness & Good Power ( 2= .33, = .8) at =.05 for K=30 *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 16/241/112/18sk=0.00, kr=0.000.871 16/241/212/18sk=0.00, kr=0.000.874 16/241/412/18sk=0.00, kr=0.000.889 16/241/112/18sk=1.00, kr=3.000.879 16/241/212/18sk=1.00, kr=3.000.878 16/241/412/18sk=1.00, kr=3.000.894 16/241/112/18sk=2.00, kr=6.000.868 16/241/212/18sk=2.00, kr=6.000.880 16/241/412/18sk=2.00, kr=6.000.891 24/161/112/18sk=0.00, kr=0.000.869 24/161/212/18sk=0.00, kr=0.000.867 24/161/412/18sk=0.00, kr=0.00 24/161/112/18sk=1.00, kr=3.000.878 24/161/212/18sk=1.00, kr=3.000.876 24/161/412/18sk=1.00, kr=3.000.864 24/161/112/18sk=2.00, kr=6.000.867 24/161/212/18sk=2.00, kr=6.000.879 24/161/412/18sk=2.00, kr=6.000.875 100/1001/112/18sk=0.00, kr=0.000.935 100/1001/212/18sk=0.00, kr=0.000.935 100/1001/412/18sk=0.00, kr=0.000.935 100/1001/112/18sk=1.00, kr=3.000.937 100/1001/212/18sk=1.00, kr=3.00 100/1001/412/18sk=1.00, kr=3.000.939 100/1001/112/18sk=2.00, kr=6.000.934 100/1001/212/18sk=2.00, kr=6.000.939 100/1001/412/18sk=2.00, kr=6.000.934 80/1201/112/18sk=0.00, kr=0.000.936 80/1201/212/18sk=0.00, kr=0.000.936 80/1201/412/18sk=0.00, kr=0.000.937 80/1201/112/18sk=1.00, kr=3.00 80/1201/212/18sk=1.00, kr=3.000.939 80/1201/412/18sk=1.00, kr=3.000.935 80/1201/112/18sk=2.00, kr=6.000.932 80/1201/212/18sk=2.00, kr=6.000.938 80/1201/412/18sk=2.00, kr=6.000.933 120/801/112/18sk=0.00, kr=0.000.931 120/801/212/18sk=0.00, kr=0.000.930 120/801/412/18sk=0.00, kr=0.000.923 120/801/112/18sk=1.00, kr=3.000.926 120/801/212/18sk=1.00, kr=3.000.937 120/801/412/18sk=1.00, kr=3.000.928 120/801/112/18sk=2.00, kr=6.000.933 120/801/212/18sk=2.00, kr=6.000.932 120/801/412/18sk=2.00, kr=6.000.931 PAGE 190 180 Table 42 Power Estimates for Conditions Indicating Both Robustness & Good Power ( 2= .33, = .8) at =.10 for K=30 Note: Empty shaded cells indicate those conditions with adequate Type I error control, but low power. Unshaded cells indicate conditions with poor Type I error control. Shaded cells containing data have both goo d Type I error and good power. *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 5/51/112/18sk=0.00, kr=0.00 5/51/212/18sk=0.00, kr=0.00 5/51/412/18sk=0.00, kr=0.00 5/51/112/18sk=1.00, kr=3.00 5/51/212/18sk=1.00, kr=3.00 5/51/412/18sk=1.00, kr=3.000.800 5/51/112/18sk=2.00, kr=6.00 5/51/212/18sk=2.00, kr=6.000.799 5/51/412/18sk=2.00, kr=6.000.810 4/61/112/18sk=0.00, kr=0.00 4/61/212/18sk=0.00, kr=0.00 4/61/412/18sk=0.00, kr=0.000.795 4/61/112/18sk=1.00, kr=3.00 4/61/212/18sk=1.00, kr=3.000.8070.812 4/61/412/18sk=1.00, kr=3.000.7990.821 4/61/112/18sk=2.00, kr=6.00 4/61/212/18sk=2.00, kr=6.000.809 4/61/412/18sk=2.00, kr=6.000.8240.8440.846 6/41/112/18sk=0.00, kr=0.00 6/41/212/18sk=0.00, kr=0.00 6/41/412/18sk=0.00, kr=0.00 6/41/112/18sk=1.00, kr=3.00 6/41/212/18sk=1.00, kr=3.00 6/41/412/18sk=1.00, kr=3.00 6/41/112/18sk=2.00, kr=6.00 6/41/212/18sk=2.00, kr=6.00 6/41/412/18sk=2.00, kr=6.000.796 20/201/112/18sk=0.00, kr=0.000.932 20/201/212/18sk=0.00, kr=0.000.934 20/201/412/18sk=0.00, kr=0.000.932 20/201/112/18sk=1.00, kr=3.000.935 20/201/212/18sk=1.00, kr=3.000.938 20/201/412/18sk=1.00, kr=3.000.944 20/201/112/18sk=2.00, kr=6.000.931 20/201/212/18sk=2.00, kr=6.000.938 20/201/412/18sk=2.00, kr=6.000.944 16/241/112/18sk=0.00, kr=0.000.931 16/241/212/18sk=0.00, kr=0.000.932 16/241/412/18sk=0.00, kr=0.000.942 16/241/112/18sk=1.00, kr=3.000.938 16/241/212/18sk=1.00, kr=3.000.939 16/241/412/18sk=1.00, kr=3.000.944 16/241/112/18sk=2.00, kr=6.000.927 16/241/212/18sk=2.00, kr=6.000.937 16/241/412/18sk=2.00, kr=6.000.947 PAGE 191 181 Table 42 (continued) Power Estimates for Conditions Indicating Both Robustness & Good Power ( 2= .33, = .8) at =.10 for K=30 *Reg Q=Regular Q; PQb=Permuted Q Between; RE=Randomeffects Z; FE=F ixedeffects Z; CR=Conditionally Random Procedure The increase in nominal alpha to .10 slightly fac ilitated the effectiveness of the permuted Q, RE and CR tests when sample sizes were small. But only permuted Q retained power and robustness when sample sizes were 40 and above. Permuted Qs eff ectiveness was high and slightly better with increased nominal alpha, .10. Primary Study Sample Sizes Population Variances N of studies Population ShapeReg Q PQbREFECR 24/161/112/18sk=0.00, kr=0.000.930 24/161/212/18sk=0.00, kr=0.000.928 24/161/412/18sk=0.00, kr=0.000.931 24/161/112/18sk=1.00, kr=3.000.933 24/161/212/18sk=1.00, kr=3.000.935 24/161/412/18sk=1.00, kr=3.000.932 24/161/112/18sk=2.00, kr=6.000.926 24/161/212/18sk=2.00, kr=6.000.935 24/161/412/18sk=2.00, kr=6.000.932 100/1001/112/18sk=0.00, kr=0.000.971 100/1001/212/18sk=0.00, kr=0.000.970 100/1001/412/18sk=0.00, kr=0.000.968 100/1001/112/18sk=1.00, kr=3.000.969 100/1001/212/18sk=1.00, kr=3.000.974 100/1001/412/18sk=1.00, kr=3.000.971 100/1001/112/18sk=2.00, kr=6.000.970 100/1001/212/18sk=2.00, kr=6.000.972 100/1001/412/18sk=2.00, kr=6.000.969 80/1201/112/18sk=0.00, kr=0.000.970 80/1201/212/18sk=0.00, kr=0.000.970 80/1201/412/18sk=0.00, kr=0.000.971 80/1201/112/18sk=1.00, kr=3.000.972 80/1201/212/18sk=1.00, kr=3.000.975 80/1201/412/18sk=1.00, kr=3.000.970 80/1201/112/18sk=2.00, kr=6.000.966 80/1201/212/18sk=2.00, kr=6.000.972 80/1201/412/18sk=2.00, kr=6.000.970 120/801/112/18sk=0.00, kr=0.000.968 120/801/212/18sk=0.00, kr=0.000.969 120/801/412/18sk=0.00, kr=0.000.960 120/801/112/18sk=1.00, kr=3.000.962 120/801/212/18sk=1.00, kr=3.000.966 120/801/412/18sk=1.00, kr=3.000.963 120/801/112/18sk=2.00, kr=6.000.970 120/801/212/18sk=2.00, kr=6.000.972 120/801/412/18sk=2.00, kr=6.000.967 PAGE 192 182 Power Estimates Summary In brief, increasing K tended to enhance the power of all of the tests, particularly permuted Q. Changing this variable in combination with changing 2 appeared to have some unique effect in that 2 =0 at K=10 resulted in several conditions for completely effective use of all tests. But only once K was elevated to 30 for 2 =.33 did the permuted Q test work effectiv ely (and then only as effectively as when 2 =0 at K=10, and not as effectively as wh en K=30). Raising nominal alpha from .05 to .10 mitigated this influence of increasing 2 on power, particularly for permuted Q. Furthermore, power and Type I error control improved as primary study sample sizes increased from 10 to 40. This influence was most salient when 2 =0. Surprisingly, the opposite pattern arose when heterogeneity of effects was .33 and nominal alpha was .10 at K=30. Effectiveness improved for the RE and CR tests at small sample sizes only. When 2 =0 and alpha =.05, an increase in K from 10 to 30 (see Tables 39 and 40) most significantly affected permuted Qs power. When K=10, 48 of the 81 conditions had both adequate Type I error and good power. At K=30, the permuted Q not only demonstrated adequate Type I error, but also increased capacity for good power (78 of the 81 conditio ns). The other 4 tests (Regular Q, RE, FE and CR) produced far fewer conditions with both adequate Type I error and good power. The number of such conditions did not change appreciably from K=10 to 30, as with permuted Q (see Tables 39 and 40). Only the RE test produced an increase in the number of effective conditions (from 19 to 24) under these conditions. With respect to the combination of 2 =.33, =.8, nominal alpha=.05 at K=10, a table of power estimates is not presented due to the fact that none of the conditions for any of the tests, including permuted Q, exhibited both good power and adequate Type I error control. Only once K was increased to 30 for the same set of conditions (see Table 41) did permuted Q yi eld both adequate Type I error and good power. Permuted Qs performance with this set of cond itions was comparable to its performance at 2 = 0, =.8 at K=10 (Table 39) in that sufficient power was absent when sample sizes were small. Though good power was absent, the RE and CR tests had a few conditions for which each enabled adequate control of Type I error. PAGE 193 183 In order to determine the effect of nominal if any, alpha was elevated from .05 to .10 for the combination of 2 =.33, =.8 at K=10 and K=30. The combination at K=10 did not result in any conditions with adequate Type I error and good power. But at K=30, there was some positive effect on the power of permuted Q, and to a lesser extent for the RE and CR tests. When = .05 under the same constraints, neither RE nor CR provided adequate Type I error control or good power for any of the conditions. Permuted Q produced 47 such conditions, as compared with 61 conditions under K=30 (see Tables 39 and 40). Answers to Research Questions 1. To what extent is the Type I error rate of the fi xedeffects Q (FE), permuted Q, randomeffects (RE) and conditionallyrandom procedure (CR) maintained near the nominal alpha level across variations in the number of primary studies included in the metaanalysis, sample sizes in primary studies, heterogeneity of variance, varying degrees of heterogeneity of effects ( 2) and primary study skewness and kurtosis? a. Small K tended to inflate Type I error for all of the tests (except for Permuted Q), particularly as 2 increased. b. Fluctuations in the number of primary studies influenced the Type I error rates as K varied from 10 to 30. There tended to be greater inflation of Type I error rates when the primary study sample sizes were small, especially when K was smaller. c. Variance within primary studies did not inject significant changes in the median rejection rates of the RE, CR and Permuted Q tests. Variance had the greatest influence on the distribution of rejection rates for the regular Q test. d. Heterogeneity of effects ( 2) had the greatest impact on the pe rformance of each of the tests. As 2 increased from 0 to .33, Ty pe I error became increasingly inflated. Only Permuted Q was unaffected by the in fluence of varying the 2 values, as it constrained the margin error rates to nominal alpha, .05. e. Skewness and kurtosis in isolation did not tend to contribute to inflated Type I error for any of the tests. PAGE 194 184 2. What is the relative statistical power of the fixedeffects Q (FE), permuted Q, randomeffects (RE) and conditionallyrandom procedure (CR) given variations in the number of primary studies included in the metaanalysis, sample sizes in primary studi es, heterogeneity of variance, varying degrees of heterogeneity of effects ( 2) and primary study skewness and kurtosis? a. Small K diminished power for all tests, including permuted Q. b. As primary study sample sizes increased from 10 to 40, power tended to increase for all tests. c. Differing the variance within primary studies did not exert notable influences on the power of any of the tests. d. Increasing 2 did not enhance power for these tests (see Tables 30 and 32; Tables 31 and 34). This pattern continued whether K equaled 10 or 30. e. Varying skewness and kurtosis had no remarkable effect on power. PAGE 195 185 Chapter Five Interpretations and Conclusions Following is a summary of the present research, in cluding a review and discussion of the results, consideration of the limitations of the research design, and recommendations for future studies. Summary The purpose of this study was to investigate the power and Type I error control of the permuted Q, randomeffects Z test, fixedeffects Z test, conditionallyrandom procedure and regular Q test under varying levels of heterogeneity of effects ( 2=0, 2= .33, and 2=1) and at level .05, as well as three variance ratios, two different K and N=10, 40 and 200. The relative effectiveness of these three tests (and one conditionallyrandom procedure) was compared under varying conditions of K, variance within groups, primary study sample sizes within metaanalytic studies, population shape, as well as 2. The comparison of the relative performance of these three tests of homogeneity of effects and the conditionallyrandom procedure within each set of controlled conditions shoul d enhance the appropriaten ess of practitioners test selection for metaanalysis. The research questions propelling this investigation were the following: 1. To what extent is the Type I error rate of the regular fixedeffects Q (regQ), permuted Qbet, fixedeffects Z (FE), randomeffects Z (RE) and conditio nallyrandom procedure (CR) maintained near the nominal alpha level across variations in the number of primary studies included in the metaanalysis, sample sizes in primary studies, he terogeneity of variance, varying degrees of heterogeneity of effects ( 2) and primary study skewness/kurtosis? 2. What is the relative statistical power of th e regular fixedeffects Q (regQ), permuted Qbet, fixedeffects Z (FE), randomeffects Z (RE) and conditio nallyrandom procedure (CR) given variations in the number of primary studies included in the metaanalysis, sample sizes in primary studies, heterogeneity of variance, varying degrees of heterogeneity of effects ( 2) and primary study skewness/kurtosis? PAGE 196 186 Results of each of the 1,458 experimental c onditions arising from the factoring of seven independent variables across each of the three te sts of homogeneity and conditionallyrandom procedure were presented. Effectiveness of each test was evaluated based on the proportion of the 5000 simulations of each metaanalytic condition reflecti ng adequate Type I error control at the nominal alpha level of .05. As defined by Bradley (1978), rejections above .055 are termed inflated, whereas those empirical values below .045 are termed conservative for nominal a =.05. For nominal alpha level .10, rejections above .11 are inflated, wh ile those rejections below .09 are conservative. This study is modeled after Harwell (1997) and Kromrey and Hogartys (1998) experimental design. Specifically, it entails a 2 x 3 x 3 x 3 x 3 x 3 x 2 factorial design. The study also controls for betweenstudies variance, as suggested by Hedges and Veveas (1998) study. The randomized factorial design includes seven independent variables: (1) number of studies within the metaanalysis (10 and 30); (2) primary study sample size (10, 40, 200); (3) score distribution skewness and kurtosis (0/0; 1/3; 2/6;); (4) equal or unequal (around typical sample sizes, 1:1; 4:6; and 6:4) withingroup sample sizes; (5) equal or unequal group variances (1:1; 2:1; and 4:1;); (6) betweenstudies variance, 2 (0, .33, and 1); and (7) betweenclass effect size differences, k (0 and .8). Data were obtained using two programs: one for null hypotheses (972 simulations) and the other for nonnull hypotheses (486 simulations). Hence, the study incorporated 1,458 experimental conditi ons, illustrated in Figure 1. Simulated data from each sample were analyzed using each of five tests of homogeneity (includes one procedure). The dependent variable is, in part, the proportion of conditions with adequate Type I error control at the nominal alpha level of .05. Although, not an or iginal consideration of the study, the performance of each of these tests under nominal alph a level .10 was investigated under 2= .33 and 1.0. Additionally, estimates of statistical power were computed for thos e conditions where tests maintained adequate Type I error control. These power estimates indicate the degr ee to which a test reflects sensitivity to significant heterogeneity of effects, in the presence of violated assumptions. There were nine hundred seventytwo simulated data conditions, consisting of six sets of null conditions. These sets (81 conditions per set) of null conditions entailed the following: 2 =0, =0; 2 =0, =.8; 2 =.33, =0; 2 =.33, =.8; 2 =1, =0; 2 =1, =.8. Each of these sets of conditions was submitted to a further condition, varying K. K equaled 10 for one (total number=486) and 30 (total PAGE 197 187 number=486) for the next. Additionally, 486 data c onditions (three sets) were generated, utilizing the nonnull program. These three sets consisted of the following: 2 =0, =.8; 2 =.33, =.8; and 2 =1, =.8. Again, each set of 81 conditions was simulated, assuming K=10 for one set and K=30 for the other. Five thousand iterations were simulated for each condition and an average rejection rate was calculated. Results are presented as box and whisker plots of the Type I error rates, the proportion of simulations with adequate Type I error, average Type I error rates, and power value rates for each given test. The proportion of simulations was computed by adding up the number of conditions with adequate Type I error and then dividing that frequency by the number of conditions. Average Type I error rates were derived from the average of all error rates for a partic ular condition. Power value rates were based on the nonnull conditions. When experimental conditions exhibited Type I error control, based on Bradleys criterion for each nominal alpha, po wer analyses were completed. Fi rst, power detection rates for each nonnull condition were deemed either within good power limits (estimates greater than .795) or too low. For those conditions displaying both adequate Type I error control and good power, it was concluded that a particular test or tests demonstrated effectiveness under that set of conditions. Results define the extent of Type I error control and the comparable degree of power of the five tests being investigated (regular Q serves as a baseline, as its purpose is somewhat different from the other tests). The effect of varying the withinstudy variance and population shape were of particular concern, as it has been demonstrated that increasing heterogeneity of variance within studies and increased skewness and kurtosis led to greater inflation of Type I error (Harwell, 1997; and Kromrey & Hogarty, 1998). Also of interest, Chang (1993) had noted that few had addressed the issue of Type II error with respect to regular Q. Increasing K with a small N in the primary studies had resulted in increased inflation of the Type I error when applying regular Q (Hedges & Vevea, 1998). No subsequent studies had examined the relative influence of increased he terogeneity of effects ( 2 ) on the performance of regular Q, permuted Q, the RE test, FE test and the conditionallyrandom procedure. Lastly, the effect of altering the nominal alpha level from the commonly applied, .05, to a more liberal .10 had not been investigated in a comparative analysis of these five tests, with respect to control of Type I error or degree of power. Therefore, the overall purpose in conducting these analyses is to provide practitioners with guid ance as to when best to apply each of the five tests under a given set of conditions. PAGE 198 188 As demonstrated previously, the permuted Q ma intained the greatest de gree of robustness under each set of conditions. However, it was surprising to find that it did not evidence good power under increasing heterogeneity of effects. As illustrated by the box and whisker plots, changes in the population shape did not have an appreciable effect on Type I error, as previously re ported (Harwell, 1997; Kromrey & Hogarty, 1998). As skewness/kurtosis increased from normal to more skewed and leptokurtotic, no significant changes were evident. Though inflated Type I error is evident for all tests but permuted Q, the degree of inflation did not appear to change appreciably from nor mal to more skewed/kurtotic extremes. Consistent with the findings of the Hedges and Vevea (1998) study, the results of this investigation further demonstrate that the combination of increasing K with small N renders an increase in Type I error. When either N incr eases to 40 or above or K increases to 30, Type I error rates tended to decrease for all tests. As mentioned above, permut ed Q was the only test relatively unaffected by these changes alone. Note that this finding is in isol ation to the increase in heterogeneity of effects. Investigating the issue of power, provisioned on th e adequacy of Type I er ror control, presents a more complete picture of the effectiveness of each test. It was surprising to note how often adequate Type I error was not further augmented by corresponding good power, particularly with respect to the permuted Q. Conversely, many of the other four tests demonstrated insufficient control of Type I error. As heterogeneity of effects increased from 0 to .3 3, there was a dramatic effect on the power of each of the tests. As 2 varied from 0 to .33, K played an integral role in the maintenance of Type I error control for all tests, except permuted Q. However, the same pattern proved relevant to the power of permuted Q. The effect from .3 3 to 1.0 was even more pronounced. The effect of applying a more liberal nominal alpha did little to enhance the effectiveness of the tests. Under increasing heterogeneity of effects, applying a more liberal nominal alpha (from .05 to .10) permitted the permuted Q to achieve greater power (Type I error was wellmaintained regardless of nominal alpha). But the actual increase in the frequency of effective conditions was minimal. Referring to Table 43, there were increases across nominal alpha levels at the K=30, 2=.33 condition. Permuted Q, RE and CR tests increased in effectiveness from 0 to low effectiveness (less than 25% of the conditions showed effective application of that specific test ), when alpha went from .05 to .10. PAGE 199 189 Generally, the permuted Q maintained adequate Type I error control under all conditions of varying sample sizes within primary studies, populati on shape, variance withinstudies, effect size (from 0 to .8) and K. Additionally, it did appear that increas ing K had a positive effect on permuted Qs power. When the effectiveness of permuted Q did decline, it was due to low power. Table 43 Effective Application of Five Metaanaly tic Tests of Homogeneity for True Null Conditions The RE test and conditionally random procedure ge nerally exhibited the next greatest degree of robustness. In contrast to the RE and CR tests, bo th the FE and regular Q tests were more sensitive in terms of maintaining Type I error control under most changes in each of the treatment effects. When effectiveness diminished for any of these four tests, it was generally due to a lack of robustness. Discussion The results with respect to the overall robustne ss of permuted Q are consistent with previous research (Kromrey & Hogarty, 1998; and Hogarty & Kromrey, 1999). Over all of the varied conditions, permuted Q enabled sufficient Type I error control. But, there was an unexpected lack of power from one level of heterogeneity of effects to the next. Power diminished as heterogeneity of effects increased. =.05 RegQPQBREFECR =.10 RegQPQBREFECR K=10 =.8 N<40 00000 K=10 =.8 N<40 00000 N>40 lowhighmedmedmed N>40 lowhighmedmedmed N<40 00000 N<40 00000 N>40 00000 N>40 00000 N<40 00000 N<40 00000 N>40 00000 N>40 00000 K=30 =.8 N<40 0highlowlowlow K=30 =.8 N<40 0highlowlowlow N>40 lowhighmedmedmed N>40 lowhighmedmedmed N<40 00000 N<40 0lowlow0low N>40 0high000 N>40 0high000 N<40 00000 N<40 00000 N>40 00000 N>40 00000low=low frequency for effectivenessless than 25% of the conditions, but more than 0 high=high frequency for effectivenessmore than 75% of the conditions med=medium frequency for effectivenessbetween 25% and 75% of the conditions 0= Type I error rate is outside Bradley's criterion for robustness and/or power was low for all conditions Note: effectiveness is determined when a test exhibits both adequate Type I error control and good power for a given condition. PAGE 200 190 There was limited enhanced power for permuted Q as nom inal alpha became more liberal from .05 to .10, evidenced at K=30, 2 =.33. The RE and CR tests also displa yed similar small increases in frequency of effectiveness at this same level of heterogeneity of effects. The introduction of heterogeneous variance tended to inflate Qs Type I error rate (Hogarty & Kromrey, 1999). In the present study, withinstudies variance also impacted the frequency of conditions with Type I error control for the RE, FE and CR tests, particularly for 2 from 0 to .33. In general, as variance ratios increased, the frequency of conditions ev idencing control of Type I error decreased. When 2 = 0, margin error rates for these 3 tests tended to be conservative. As 2 increased from 0 to .33, margin error rates for these 3 tests tended to inflate. As w ill be discussed, as heterogeneity of effects increased, Type I error control diminished substantially. More specifically, when variance was equal, the greatest frequency of wellcontrolled Type I error conditions occurred for all 3 tests, regardless of whether K=10 or 30 for the 2 = 0, =.8 set of conditions (see table 4). Proportions of conditio ns with adequate Type I error for the RE, FE and CR tests were .41, .33 and .48, respectively for K=10. At K=30, proportions were .44, .22 and .26. Permuted Q presented better than 85% of its conditions with adequate Type I error at each of the variance ratios. This pattern occurred consistently for permuted Q across all variance ratios and for all combinations of heterogeneity of effects and delta. As for the other tests, few propor tions greater than zero were evidenced for any other conditions beyond the 2 = 0, =.8 and 2 = 0, =0 (refer both to Table 43 and proportion tables 58). As increases in variance ratios were introduced the frequency of conditions with wellcontrolled Type I error for each of these tests fell consistently, especially for K= 10. At K=30, the CR test had the same frequency of conditions with wellmaintained Type I error control at variance ratios of 1/1 and 1 /2, with a drop at variance 1/4. The RE test manifested a notable drop in frequency of these wellcontrolled conditions from 1/1 to 1 /2, with a minimal increase at 1/4. The FE test exhibited a continued decrease in frequency of these conditions from the norm al to more extreme variance conditions. At 2 = .33 (when K=30) with a liberal nominal alpha of .10, the RE test had a slight increase in the frequency of conditions with wellmaintained Type I error control as within study variance increased (see table 33). The CR test had an equal number of wellmaintained Type I error conditions at 1/2 and 1/4 and no conditions with wellmaintained Type I error at the equal variance condition. PAGE 201 191 The combined effect of heterogeneou s variances and unequal sample sizes was to reduce the power of any permutation test (Hogarty & Kromrey, 1999). But when the larger sample originated from the group with the larger variance, permutation tests had the highest power. Power was diminished for equal sample sizes or for negative pairings of sample size and population variance. Kromrey & Hogarty (1998) found these negative pairings of sample size and variance (where the first sample size is larger than that of the second group ) result in increasingly inflated Type I error, as compared to the condition when there are equal sample sizes between groups and unequal variances. Alternatively, Harwell (1997) and Kromrey and Hogart y (1998) found that positive pairings (small unequal variance and small unequal sample sizes) resulted in conservative Type I error rates. When N was 40 or greater, tests attained good power only with larger Hogarty & Kromrey, 1999). Because only two effect sizes (0 and .8) were applied, this current study cannot shed additional light on changes in However, in general, the number of conditions where any of the tests exhibited good power increased as the total sample size increased from 10 to 40, particularly for 2 =0 or .33. The present study also demonstrated that the frequency of conditions with good power increased as either K increased or nominal alpha was ex panded from .05 to .10, for all tests (for 2 =.33). At 2 =1, tests providing sufficient control of Type I error (the RE, CR, and permuted Q tests) also presented low power, regardless of whether K increased or nominal alpha was liberalized. Therefore, none of these tests would be effectively applied under these conditions. The relationship between K and withinstudy sample size (N) has been discussed at length (Hedges & Vevea, 1998). Specifically for the regular Q, when N is small (less than 40) and constant, Type I error control deteriorates as K increases. The present study seems to bear out this result. But other tests, like Permuted Q, RE, and CR, Type I error control improves as K increases for the 2 =0 condition (when =.05) and to a more limited extent for 2 =.33 (when =.10). Referring to Table 43, one can determine that if N was less than 40 with K less than 30, power and Type I error control were curtailed, even when 2 =0. But as K increased to 30, keeping all else equal, both power and Type I error control improved dramatically for all tests but regular Q. The ameliorative effect arose whether nominal alpha equaled .05 or .10 (for 2 =0). PAGE 202 192 Table 43 represents a compilation of the power esti mate and power detection tables. It provides an overview of recommendations for the appropriat e application of each of the tests evaluated by this study. The number of conditions was counte d to derive a frequency for each distinct group of conditions (each 2 by by N and K group) for the two nominal alphas. Further clarification can be obtained from the power estimate and power detection tables. Further evidence of the N to K relationship is evident, when 2 was raised to .33. In general, as 2 increased, Type I error increased. Only once K increased to 30 and alpha elevated to .10 (see Table 43) did robustness improve for all tests but permuted Q. When 2 =.33 (for =.05) and sample size increased to 40 and above, permuted Qs effectiveness increased from 0 to high. At nominal alpha=.10, robustness showed a decrement from low effectiveness to 0 as sample size (N) increased to 40 and above. This pattern contradicted what has been reported in the literature as the impact on regular Qs performance. Also in contrast, permuted Qs effectiveness improved from low to medium, as N increased to 40 and above. Moreover, Type I error control continued to erode for all tests but permuted Q as 2 increased to 1. Again, only an in crease in nominal alpha to .10 ameliorated Type I error control for the RE and CR tests as 2 increased to .33, and this improvement was minimal (see Table 43). Permuted Q showed the greatest improvement as 2 increased to .33 and N increased to 40 (nominal alpha=.05). In contrast to Hedges and Veveas (1998) findings, but consistent with the results found by Kromrey and Hogarty (1998), permuted Q was the only statistic to provide rejection rates closely approximating nominal across all conditions. Even under conditions of extreme heterogeneity of effects ( 2), permuted Q tended to maintain adequate Type I error control more often than either the RE or CR tests. However, as discovered by Hedges and Vevea, both of these latter tests did provide superior Type I error contro l (though still exceeding nominal as compared to either the regular Q or FE tests. Furthermore, Chang (1993) found that for Q, as K increased and sample size decreased, greater departures arose between the theoretical and simulated power. More specifically, Type I error results from Qs use under these conditions. Increasing K has been thought to exacerbate the influence of heterogeneous effects ( 2). In particular, Hedges and Vevea ( 1998) suggest that under such conditions RE and CR tests produce PAGE 203 193 conservative rejection rates. Chang (1993) contends that K had no such effect on the RE test, although total sample size within the primary studies did. In the present study, as the heterogeity of effects ( 2) increased, RE and CRs rejection rates remained more conservative than Qs. However, the Type I error rate still exceeded nominal alpha when ( 2) >0. This result occurred whether equaled 0 or .8. As ( 2) increased from 0 to 1, while holding at 0, all of the tests, except pe rmuted Q, produced inflated Type I error. Practitioners responsible for evaluating treatments investigated through metaanalytic procedures seem faced with the choice of (a) applying a permuted test which thou gh robust under most circumstances does not always exhibit good power, particularly unde r increasingly heterogeneous effects or (b) utilizing a test which does not maintain adequate control of Type I error across a wide variety of circumstances. Researchers determined to evaluate treatments investigated by metaanalytic methods are advised to utilize statistical tests in the following manner: When K =10, for 2=0 at nominal =.05, permuted Q can produce both good power and a consistently wellcontrolled Type I error rate. The RE, FE and CR tests perfor med effectively in fewer than half of the cases. Similar performance can be expected at =.10, holding all other factors constant. Unfortunately, none of the 5 tests investigated proved effective for use when K=10 and sample size was less than 40. Specifically, all of the tests, but permuted Q, were not robust to Type I error and power was limited. Permut ed Q, though robust, was ineffective due to low power. This problem was exacerbated as 2 increased to anything above 0. The same outcome occurred whether nominal alpha was .05 or .10. As K increased to 30, all tests, except the regula r Q, displayed a marked increase in effectiveness from K=10, at 2=0. Again, permuted Q evidenced the high est degree of effectiveness in terms of frequency of conditions where both Type I error was wellcontrolled and good power demonstrated. Again, the RE, FE and CR tests proved to be limited in the effectiveness when N was less than 40, whether nominal alpha equaled .05 or .1 0. As sample size increased to 40 and higher, effectiveness improved for all tests. Again, permuted Q manifested the greatest degree of effectiveness. Performance was comparab le across alpha levels. PAGE 204 194 Once 2 increased to .33 (nominal alpha .05), none of the tests were effective until sample size increased to 40. And then, only permuted Q provided both effective detection of true differences and maintained robustness to residual variance. As 2 increased to .33 (nominal alpha .10) and sample size was less than 40, the permuted Q, RE and CR tests all dem onstrated equal, but limited, effectiveness. This outcome is somewhat unusual in that typically more conditions evidence effective control of Type I error and good power as sample size increas es. In this case, the RE and CR test actually proved to be more effective when sample sizes were small. As sample size increased to 40 and a bove, only the permuted Q (for both nominal alpha .05 and .10) demonstrated any effectiveness. None of the tests were effective in providing both good power and robustness to Type I error as 2 increased to 1. This result was manifest whether nominal alpha was set to .05 or .10. Hedges & Vevea (1998) recommend applying the randomeffects model (RE or CR test) when K is large and generalizations are to be made to a broader universe of studies, as it provides more conservative estimates of the effects. Unfortunately, applying the random effects test presents one with the dilemma of having a low frequency of ef fectiveness when N is less than 40 at 2=0, K=30 and medium frequency of effectiveness when N is 40 or greater under the same conditions. (It should be noted that this pattern of results was evident whether nominal alpha was set at .05 or .10.) However, this approach may be appropriate if one decides a priori that the generalizations they will be making go beyond the immediate sample or a sample with simple, welldefined charact eristics. The randomeffects approach provided no practical utility when 2 exceeded 0, unless nominal alpha was set to .10, N was less than 40, and 2=.33. The frequency of effectiveness was minimal, however, even under the latter set of conditions. Permuted Q can best be applied to conditions si milar to those in which the RE and CR tests can be effectively applied, but the investigator has decided a priori to draw generalizations more limited to the immediate sample of studies. The permuted Q has the added advantage of greater Type I error control under a wider variety of conditions than any of the other tests. Therefore, there are a greater number of instances in which the permuted Q would demonstrate greater robustness to the commission of Type I errors as well as a greater tendency to detect true differences. Lastly, the permuted Q can also be effectively applied when 2 increases to .33, as long as K is gr eater than 10. As with the other tests PAGE 205 195 investigated, it did not demonstrate practical utility when 2 increased to .33 or 1 at K=10 (whether nominal alpha=.05 or .10) and when 2 = 1 (whether nominal alpha =.05 or .10) at K=30. Limitations Normality of effects is not being investigated. The randomeffects test, for example, requires a normal distribution of effects (Raudenbush, 1994, p.317). Even a single outlying effect can introduce bias into the computation of Q. Only sensitivity an alysis, wherein the outlier(s) are omitted, can accurately predict the consequences of skew ed distributions of effects. As regression is used after the computation of Q to compute the relationship of potential moderator variables and the effect size, there may be a problem with model overfit (Raudenbush, 1994). Model overfit occurs when there are multiple predictor or independent variables to a single or just a few criterion or dependent variables. This study incorporated 30 studies to a single dependent variable. Each of these studies introduces a multitude of variables not necessarily common to all of the studies in the collection. Having such a large K to dependent variable ratio could be problematic in the calculation of regression weights using Weighted Least Squares (typically completed after Q is computed and heterogeneity is determined). This study investigated the effects of several varying conditions on the performance of five tests of homogeneity as applied to unconditional inferences only. However, most researchers still draw conclusions appropriate to the selection of unconditional inferences while inconsistently applying the fixedeffects model and corresponding tests (National Resear ch Council, 1992). More specifically, unconditional inferences pertain to situations where one extends conclusions about a samples performance, given some treatment, to a larger (usually less welldefined) popul ation. Metaanalysts do not typically limit their conclusions about the performance of a particular sample to that sample. Such an approach severely curtails the generalization of their work, prohibiting th e development of heuristic theory. Moreover, there is a common understanding that one can no longer as sume the collection of a completely representative sample. With the prevalent use of the internet and other automated data sources, it is no longer possible to select a sample from a welldelimited population of studies. Therefore, conclusions about a sample must account for the added uncertainty inhere nt in the relationship of that samp le to some larger, more abstract population to permit valid generalization. PAGE 206 196 Other potentially influential conditions were not examined. There was a handful of sample sizes and distribution shapes under investigation in th e present study. Related to this issue, there are innumerable potential factors influencing the effect size in any given study (Fern & Monroe, 1996). For this reason, one can never be absolutely certain the va riance is attributable to the independent variables under investigation. Differing conditions change the robustness of the statistical properties. As Overton (1998) suggests, ...one of the most important considerations in selecting a metaanalysis model is the contextual conditions in which the effect of interest (e.g., a selection tests validity) is to be generalized in theory or in application (p. 376). In other words, a statistic is only meaningful given the assumptions applicable to the situation in which it is being used. No single statistic can be expected to be robust to all conditions. Furthermore, realistic data often entail widely differing data characteristics, not modeled by generated data sets. Computer programs simulated data to be analyzed within this Monte Carlo study in which distribution shapes, study sample sizes, extent of variance across studies and random variation in true effects have been controlled. Moreover, the models defining the statistical tools used were methodically alternated. For this reason, these results should be confirmed by direct application of these tests to data collected in actual educational settings. Topics for Additional Research Future research is required to consider the eff ect of heterogeneity of effects on the use of trimmedd with nonnormal populations (Hogarty & Kromrey, 1999). As many Federallyfunded school programs are targeted to populations where negatively skewed performance is common, it would be important to understand how other statistics like the trimmedd operate when incorporating various degrees of heterogeneity of effects. It is in such school programs where evaluation of treatment programs must take into account the unique characteristics of the performancerelated data. Additional study is warranted to determine the ex tent of the problem of model overfit in the recalculation of the regression once Q has been computed. If overfit is a problem, what measures will need to be taken in order to ensure proper computation of the regression weights? As the regression weights are required to compute the relationship of potential moderator variables and the effect size, the PAGE 207 197 problem of model overfit is an important issue in the final evaluation of treatment programs. An investigation of the use of effect size indices other than the twogroup sort incorporated in the same tests of homogeneity would be useful. School programs frequently organize students in one of several groups according to general ability level, thereby classifying students in more than two groups. Therefore, evaluating these tests of homogeneity using twogroup effect size indices limits practitioners understanding of how effectivel y these tests operate in realistic settings. As schools do not have the luxury of consistently restricting class sizes to a given number, it would be worthwhile to evaluate the introduction of unequal sample sizes within any given condition. In this way, the behavior of each of the tests of homogeneity can be more realistically evaluated. PAGE 208 198 References Abelson, R.P. (1997). A retrospective on the significance test ban of 1999 (If there were no significance tests, they would be invented). In L.L. Harlow, S.A. Mulaik and J.H. Steiger (Eds.) What if there were no significance tests? Mahwah, NJ: Lawrence Erlb aum Associates, pp. 117141. BangertDrowns, R.L. (1986). Review of developments in metaanalytic method. Psychological Bulletin 99 3, 38899. Becker, B.J. (1992). Using results from repl icated studies to estimate linear models. Journal of Educational Statistics 17 4 341362. Becker, B.J. (1994). Combining significance levels. In The Handbook of Research Synthesis New York: Russell Sage Foundation. Bollen, K.A. 1989. Structural Equations with Latent Variables New York: John Wiley & Sons, Inc. Carver, R.P. (1978). The case agains t statistical significance testing. Harvard Educational Review 48 3 378399. Chang, L. (1993). Power analysis of the test of homogeneity in effect size metaanalysis Unpublished doctoral dissertation, Michigan State University, East Lansing. Chatterjee, S. & Price, B. 1977. Regression Analysis by Example New York: John Wiley & Sons. Chelimsky, E. (1998). The role of experience in formulating theories of evaluation practice. American Journal of Evaluation 19 1 3555. Chow, S.L. (1998). Pr ecis of statistical significance: Rationale, validity, and utility. Behavioral And Brain Sciences 21 169239. Cohen, J. (1990). Things I have learned (so far). American Psychologist 45 12 13041312. Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences, Second Edition New Jersey: Lawrence Erlbaum Associates, Publishers. Cooper, H. & Hedges, L.V. (Eds.) 1994. The Handbook of Research Synthesis New York: Russell Sage Foundation. Cronbach, L.J., Gleser, G.C., Harinder, N., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles New York: Wiley. Diaconis, P. & Efron, B. (1983). Computerintensive methods in statistics. Scientific American 11630. Draper & Smith (1998). Applied Regression Analysis, 3rd edition Effron, B. & Gong, G. (1983). A leisurely look at the bootstrap, the jackknife, and crossvalidation. The American Statistician 37 1 3648. Erez, A., Bloom, M.C. & Wells, M.T. (1996). Using random rather than fixed effects models in metaanalysis: implications for situational specificity and validity generalization. Personnel Psychology 49 275306. PAGE 209 199 Fisher, R.A. (1959). Statistical methods and scientific inference Edinburgh: Oliver &Boyd. Fleishman, A.I. (1978). A method for simulating nonnormal distributions. Psychometrika 43 4, 521532. Fleiss, J.L., & Gross, A.J. (1991). Metaanalysis in epidemiology, with special reference to studies of the association between exposure to environm ental tobacco smoke and lung cancer: A critique. Journal of Clinical Epidemiology 44 127139. Fowler, R.L. (1988). Estimating the standardi zed mean difference in intervention studies. Journal of Educational Statistics 13 4 33750. Glass, G.V. (1977). Integrating find ings: The metaanalysis of research. Review of Research in Education 5 35179. Glass, G.V. & Hopkins, K. D. (1984). Statistical Methods in Education and Psychology 2nd edition. Boston: Allyn & Bacon. Glass, G.V., McGaw, B. & Smith, M.L. (1981). MetaAnalysis in Social Research Beverly Hills: Sage Publications. Glass, G.V., Peckham, P.D. & Sanders, J.R.(1972). Consequences of failure to meet assumptions underlying the fixed effects analysis of variance and covariance. Review of Educational Research 42 23788. Good, P. (1994). Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses New York: SpringerVerlag. Greenhouse, J.B. & Iyengar, S. (1994). Harwell, M.R. (1997). An empirical study of Hedges's homogeneity test. Psychological Methods 2 2, 21931. Harwell, M. (1997). An investigation of the Raudenbush (1988) test for studying variance heterogeneity. The Journal of Experimental Education 65 2 181190. Harwell, M.R. (1992). Summarizing monte car lo results in methodological research. Journal of Educational Statistics 17 4, 297313. Harwell, M.R., Rubinstein, E.N., Hayes, W.S., & Olds, C.C. (1992). Summarizing monte carlo results in methodological research: The oneand twofactor fixed effects ANOVA cases. Journal of Educational Statistics 17 p. 31539. Hauck, W.W., & Anderson, S. (1984). A survey regarding the reporting of simulation studies. The American Statistician 38 3, 21416. Hayes, W.S., & Olds, C.C. (1992). Summarizing monte carlo results in methodological research: The oneand twofactor fixed effects ANOVA cases. Journal of Educational Statistics 17 31539. Hedges, L.V. (1982). Fitting continuo us models to effect size data. Journal of Educational Statistics 1 4, 24570. Hedges, L.V. (1983). A random e ffects model for effect sizes. Psychological Bulletin 93 2 38895. PAGE 210 200 Hedges, L.V. (1992). Metaanalysis. Journal of Educational Statistics 17 4 27996. Hedges, L.V. & Olkin, I. (1985). Statistical Methods for MetaAnalysis Orlando: Academic Press, Inc. Hedges, L.V. & Vevea, J.L. (1998). Fixedand randomeffects models in metaanalysis. Psychological Methods 3 4 486504. Hoaglin, D.C. & Andrews, D.F. (1975). The repor ting of computationbased results in statistics. The American Statistician 29 3 12226. Hopkins, K.D. & Weeks, D.L. (1990). Tests for normality and measures of skewness and kurtosis: their place in research reporting. Educational and Psychological Measurement 50 71729. Inman, H.F. (1994). Karl Pearson and R.A. Fisher on statistical tests: A 1935 exchange from Nature The American Statistician 48 1 210. Kennedy, J.J. & Bush, A.J. (1985). An Introduction to the Design and Analysis of Experiments in Behavioral Research Lanham:University Press of America. Keselman, H.J., Huberty, C.J., Lix, L.M., Olejni k, S., Cribbie, R.A., Donahue, B., Kowalchuk, R.K., Lowman, L.L., Petoskey, M.D., Keselman, J.C. & Levin, J.R. (1998). Statistical practices of educational researchers: An analysis of their anova, manova, and ancova analyses. Review of Educational Research 68 3 35086. Kirk, R.E. (1996). Practical significan ce: A concept whose time has come. Educational and Psychological Measurement 56 5, 74659. Kraemer, H.C. & Andrews, G.A. (1982). A nonparam etric technique for metaanalysis effect size calculation. Psychological Bulletin 91 404412. Kromrey, J.D., Lee, R.S., & Ferron, J.M. (1998). Permutation tests of equality of variances: An emprirical comparison of robustness and statistical power. Paper presented at the annual meeting of the American Educational Research Associ ation (Chicago, IL, April, 1997). Kromrey, J.D. & Hogarty, K.Y. (1998). Effect size estimates: An empirical study of their robustness in metaanalysis. Paper presented at the annual meeting of the Florida Educational Research Association (Orlando, FL, November, 1998). Kromrey, J.D. & Larrimore, C.D. (1998). The robustness of metaanalytic tests for homogeneity of effect sizes: An empirical i nvestigation. Paper presented at the annual meeting of the Florida Educational Research Association (Orlando, FL, November, 1998). Kuhn, T. S.(1962). The structure of scientific revolutions .Chicago: The University of Chicago Press. Langenfeld, T. E. & Coombs, W.T. (1998). The influence of total sample size, type of distribution, and ratio of population standard deviations on magnitu deofeffect statistics. Paper presented at the annual meeting of the American Educational Research Association (San Diego, CA, April, 1998). Legendre, D.T. (1805). Nouvelles method a la determination des orbites des cometes Paris: Courcier. Lin, C. & Davenport, E.C. (1997). A weighted least squares approach to robustify least squares estimates. Paper presented at the annual meeting of the American Educational Research Association (Chicago, IL, March, 1997). PAGE 211 201 Lix, L.M. & Keselman, H.J. (1998). To trim or no t to trim: Tests of location equality under heteroscedasticity and nonnormality. Educational and Psychological Measurement 58 3 409429. Mathes, P.G. & Fuchs, L.S. (1994). The efficacy of peer tutoring in reading for students with mild disabilities: A bestevidence synthesis. School Psychology Review 23 1 5980. Matt, G.E. & Cook, T.D. (1994). Threats to the validity of research syntheses. In The Handbook of Research Synthesis New York: Russell Sage Foundation. McCullagh, P. and Nelder, J.A. 1989. Generalized Linear Models, Second Edition Boca Raton: Chapman & Hall/CRC. Metsala, J.L., Stanovich, K.E. & Brown, G.D.A. (1998). Regularity effects and the phonological Deficit model of reading disabilities: A metaanalytic review. Journal of Educational Psychology 90 2 27993. Mulaik, S.A., Raju, N.S. & Harshman, R.A. (199 7). There is a time and a place for significance Testing. In What If There Were No Significance Tests? Mahwah, N.J.: Lawrence Erlbaum Associates, Publishers. National Research Council. (1992). Combining information: Statistical issues and opportunities for research Washington, DC: National Academy Press. Noreen, E.W. (1989). Computer Intensive Methods for Testing Hypotheses New York: Wiley. OShaughnessy, T.E. & Swanson, H.L. (1998). Do immediate memory deficits in students with learning disabilities in reading reflect a developmental lag or deficit?: A selective metaanalysis of the literature. Learning Disability Quarterly 21 12348. Overton, R.C. (1998). A comparison of fixedeffects and mixed (randomeffects) models for metaanalysis tests of modera tor variable effects. Psychological Methods 3 3 35479. Paunonen, S.V. & Gardner, R.C. (1991). Biases resulting from the use of aggregated variables in psychology. Psychological Bulletin 109 3 520523. Pedhazer, E.J. 1982. Multiple Regression in Behavioral Research Fort Worth: Holt, Rinehart and Winston, Inc. Pinnell, G.S., DeFord, D.E., Bryk, A.S. & Seltzer, M. (1994). Comparing instructional models for the literacy education of highrisk first graders. Reading Research Quarterly 29 939. Popper, K.R. (1968). The logic of scientific discovery NewYork: Harper & Row, Publishers. Rasmussen, J.L. & Dunlap, W.P. (1991). Dealing with nonnormal data: Parametric analysis of transformed data vs nonparametric analysis. Educational and Psychological Measurement 51 809820. Raudenbush, S.W. (1994). Random effects models. In The Handbook of Research Synthesis New York: Russell Sage Foundation, pg. 301320. Raudenbush, S.W. & Bryk, A.S. (1987). Examining correlates of diversity. Journal of Educational Statistics 12 3 24169. Robey, R.R. (1990). The analysis of oneway between effects in fluency research. Journal of Fluency Disorde rs, 15 27589. PAGE 212 202 Robey, R.R. & Barcikowski, R.S. (1992). Type I error and the number of iterations in monte carlo studies of robustness. British Journal of Mathematical and Statistical Psychology 45 28388. Rosnow, R.L. & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in Psychological science. American Psychologist 44 10 12761284. Seltzer, M. (1991). The use of data augmentation in fitting hierarchical models to educational data. Unpublished doctoral dissertation, The University of Chicago, Chicago. Serlin, R.C. (1987). Hypothesis testing, theory building, and the philosophy of science. Journal of Counseling Psychology 34 4 365371. Shadish, W. R. (1998). Evaluation theory is who we are. American Journal of Evaluation 19 1 119. Shadish, W.R. & Haddock, C.K. (1994). Combining estimates of effect size. In The Handbook of Research Synthesis New York: Russell Sage Foundation, pg. 26180. Shanahan, T. & Barr, R. (1995). Reading recovery: An independent evaluation of the effects of an early instructional intervention for atrisk learners. Reading Research Quarterly 30 4 95896. Snedecor, G.W. & Cochran, W.G. 1989. Statistical Methods, eighth edition Ames: Iowa State University Press. Thomas, H. (1986). Effect size standard errors for the nonnormal nonidentically distributed case. Journal of Educational Statistics 11 4 293303. Thompson, B. (1998). Five methodology errors in educational research: The pantheon of statistical significance and other faux pas. Paper pr esented at the annual meeting of the American Educational Research Association (San Diego, CA, April, 1998). Tukey, J.W. (1969). Analyzing data: Sanctification or detective work? American Psychologist 24 8391. Wachter, K.W. & Straf, M.L. (Eds.) 1990. The Future of MetaAnalysis New York: Russell Sage Foundation. Wang, L., Fan, X., & Willson, V.L. (1996). Effect s of nonnormal data on parameter estimates and fit indices for a model with latent and manifest variables: An empirical study. Structural Equation Modeling 3 3 22847. Wang, M.C. & Bushman, B.J. (1999). Integrating Results through MetaAnalytic Review Using SAS Software .Cary, NC: SAS Institute Inc. Weisberg, S. 1985. Applied Linear Regression (2nd edition) New York: John Wiley & Sons. Wilcox, R.R. (1995). ANOVA: A paradigm for low power and misleading measures of effect size? Review of Educational Research 65 5177. Wilcox, R.R. (1998). The goals and strategies of robust methods. British Journal of Mathematical and Statistical Psychology 51 139. Winer, B.J. 1971. Statistical Principles in Experimental Design, 2nd edition New York: McGrawHill Book Company. PAGE 213 203 Wolf, F.M. (1990). Methodological observations on bias. In K.W. Wachter & M.L. Straf (Eds.), The Future of MetaAnalysis New York: Russell Sage Foundation, pg. 139151. Yuen, K.K. & Dixon, W.J. (1973). The approximate behaviour and performance of the twosample trimmed t Biometrika 60 2 36974. Zeng, L. & Cope, R. T. (1995). Standard error of linear equating for the counterbalanced design. Journal of Educational and Behavioral Statistics 20 4 337348. Zucker, D.M. (1990). An an alysis of variance pitfall: The fixed effects analysis in a nested design. Educational and Psychological Measurement 50 731738. PAGE 214 204 Appendices PAGE 215 205 Appendix A: SAS Program for Simulating True Null Hypotheses option ps=59 ls=132 pageno=1; proc printto print='c:\my documents\dissertation results\testnull100203.out'; proc iml ; ++ ROBUSTQ.SAS Changes required to execute the program: *NN0 *NNB Sample sizes *SPEC0 *SPEC1 True False Moderating Null Hypothesis *S1 *S4 Variances *SHAPE1 *SHAPE5 Skewness and Kurtosis *K1 *K3 N of studies in each metaanalysis *delta0 *delta5 Population mean differences ++; ++ Define parameters for execution of the simulation ++; replicat= 5000 ; N of metaanalyses to simulate; dlta= 0 ; *delta1 dlta=0.8; *K1 KK={2,3}; N of studies in each metaanalysis; KK={ 4 6 }; *K3 KK={12,18}; *NN2 njs={ 5, 5}; *NN3 njs={20,20}; *NN4 njs={4,6}; njs={ 16 24 }; *NN7 njs={80,120}; *NN8 njs={6,4}; Note: NN8 NN10 reverse pairing with unequal variances; Use these only for nonnull conditions; *NN0 njs={120,80}; *NNA njs={100,100}; *NNB njs={24,16}; *S1 sds={1.0,1.0}; sds={ 1.0 2.0 }; *S3 sds={1.0,4.0}; *pooled=SQRT ((njs` sds)/sum (njs)); POOLED_VAR=( 0.5 )#SUM(sds); POOLED_SD=SQRT(POOLED_VAR); *Tau2=0; *Tau2=.33; Tau2= 1.0 ; Tau2=5.0; PAGE 216 206 Appendix A (Continued) MEANDELTA=DLTA#POOLED_SD; VARDELTA=TAU2#POOLED_VAR; mu1={ 0.0 0.0 }; Pop means for experimental and control boys; mu2={ 0.0 0.0 }; Pop means for experimental and control girls; specific= 0 ; Null condition for moderation effect; *SPEC1 specific=1;* Nonnull condition for moderation effect; ++ Fleishman Transformations to nonnormality ++; The following give sk= 0, kr= 0; *SHAPE 1b=1; *SHAPE 1c=0; *SHAPE 1d=0; The following give sk= 1.00, kr= 3.00; *SHAPE 2b= .83221632289426; *SHAPE 2c= .12839670935047; *SHAPE 2d= .04803205907079; The following give sk= 2.00, kr= 6.00; b= 0.82632385761082 ; c= 0.31374908500462 ; d= 0.02270660525731 ; ++ Initialize counters ++; rejq101 = 0 ; rejq105 = 0 ; rejq110 = 0 ; rejreq101 = 0 ; rejreq105 = 0 ; rejreq110 = 0 ; rejreq201 = 0 ; rejreq205 = 0 ; rejreq210 = 0 ; rejqb01= 0 ; rejqb05= 0 ; rejqb10= 0 ; rejqb201= 0 ; rejqb205= 0 ; rejqb210= 0 ; rejprob_REZ101= 0 ; rejprob_REZ105= 0 ; rejprob_REZ110= 0 ; PAGE 217 207 Appendix A (Continued) rejprob_FEZ101= 0 ; rejprob_FEZ105= 0 ; rejprob_FEZ110= 0 ; rejprob_conrand01= 0 ; rejprob_conrand05= 0 ; rejprob_conrand10= 0 ; nsamples= 0 ; ++ Subroutine to generate a random sample. User specifies the population mean and standard deviation. For population shapes, Fleishman constants are used. Inputs to the subroutine are NN desired sample size mu population mean variance population variance bb,cc,dd Fleishman constants Outputs are Rawdata column vector of NN observations from the specified population ++; start gendata(NN,variance,bb,cc,dd,mu,rawdata); seed1=round( 1000000 *ranuni( 0 )); rawdata=rannor(repeat(seed1,nn, 1 )); rawdata = (1 *cc) + (bb*rawdata) + (cc*rawdata## 2 ) + (dd*rawdata## 3 ); rawdata = (rawdata SQRT(variance)) + mu; finish; ++ Direct resampling for randomization ++; start resamp(x); n=Nrow(x); allnbut=n1 ; do i = 1 to allnbut; ++ Randomly select rows from the matrix X to create the matrix NEWM. Sampling is without replacement so that the matrix NEWM has the same data as X, but in random order ++; PAGE 218 208 Appendix A (Continued) ranrow = round(uniform( 0 )*(n i + 0.999 )+ 0.5 ); if i = 1 then do; newm = x[ranrow,]; end; if i > 1 then do; newm = newm//x[ranrow,]; end; if ranrow > 1 then do; if ranrow < (n(i1 )) then x = x[ 1 :ranrow1 ,]//x[ranrow+ 1 :n(i1 ),]; if ranrow = n(i1 ) then x=x[ 1 :(ni),]; end; if ranrow = 1 then x = x[ 2 :n(i1 ),]; end; newm = newm//x; x = newm; print x; finish; ++ Subroutine to calculate the Q test of homogeneity. Inputs to the subroutine are di_vec column vector of effect sizes (d) n_vec matrix (k X 2) of sample sizes corresponding to each effect size Outputs are QQ = the obtained value of Q d_plus = weighted mean d value d_star = iteratively weighted mean d value prob_qq1 = chisquare probability associated with QQ ++; *start calcq(di_vec,n_vec,qq,d_plus,prob_qq1); calculate variance for each effect size; k = nrow(di_vec); var_di=J(k,1,0); do i = 1 to k; var_di[i,1] = ((n_vec[i,1]+n_vec[i,2])/(n_vec[i,1]#n_vec[i,2])) + ((di_vec[i,1]##2)/(2#(n_vec[i,1]+n_vec[i,2]))); end; calculate weighted mean effect size; d_plus = 0; sum_wt = 0; do i = 1 to k; d_plus = d_plus + di_vec[i,1]/var_di[i,1]; sum_wt = sum_wt + var_di[i,1]##1; PAGE 219 209 Appendix A (Continued) end; d_plus = d_plus/sum_wt; calculate Q; QQ = 0; do i = 1 to k; QQ = QQ + ((di_vec[i,1] d_plus)##2/var_di[i,1]); end; prob_qq1 = 1 PROBCHI(QQ,k1); print di_vec var_di; print d_plus qq prob_qq1; finish; *++ Subroutine to calculate the REQ test of homogeneity. Inputs to the subroutine are KK column vector of N of studies in each class di_vec column vector of effect sizes (d) n_vec matrix (k X 2) of sample sizes corresponding to each effect size Outputs are reqq = the obtained value of Q d_plusrq = weighted mean d value d_starrq = iteratively weighted mean d value prob_req = chisquare probability associated with req ++; start calcreq(KK,di_vec,n_vec,RSS_wls1,B_wls2,B_wlsi,vartheta,cov_B,cov _B2,SE_B, SE_B2); calculate variance for each effect size; k = nrow(di_vec); var_di=J(k, 1 0 ); Vi=J(k, 1 0 ); X=J(k, 1 1 ); do i = 1 to k; var_di[i, 1 ] = ((n_vec[i, 1 ]+n_vec[i, 2 ])/(n_vec[i, 1 ]#n_vec[i, 2 ])) + ((di_vec[i, 1 ]## 2 )/( 2 #(n_vec[i, 1 ]+n_vec[i, 2 ]))); Vi[i, 1 ]=var_di[i, 1 ]##1 ; end; print X; print var_di di_vec Vi; B_ols =INV(X`*X)*X`*di_vec; M=X*INV(X`*X)*X`; NOBS =NROW(di_vec); IOBS = I (NOBS); RSS = di_vec`*(IOBS M)* di_vec; PAGE 220 210 Appendix A (Continued) *const1=(J(1, NOBS, 1)*Vi) TRACE(X`*DIAG(Vi)*X*INV(X`*X)); const1=(J( 1 NOBS, 1 )*Vi##1 )TRACE(X`*DIAG(Vi##1 )*X*INV(X`*X)); *A vector of variances will need to be created where we take the reciprocal= var_di and call in the Vi; const2=NOBS NCOL(X); vartheta=(RSS const1)/const2; if vartheta< 0 then vartheta= 0 ; wi=Vi##1 + J(NOBS, 1 1 )*vartheta; wi = wi##1 ; wi2=Vi; *prob_req = 1 PROBCHI(req,k1); print 'Initial Run Using OLS'; print B_ols RSS vartheta const1 const2 vi wi wi2; B_wls=INV(X`*DIAG(wi2)*X)*X`*DIAG(wi2)*di_vec; RSS_wls1=(di_vec X*B_wls)`*DIAG(wi2)*(di_vecX*B_wls); print 'This equals Q test:'RSS_wls1; X=J(k, 2 1 ); do i= 1 to k; if i<=KK[ 1 1 ] then X[i, 2 ] = 0 ; end; print X; *++ Weighted least squares estimation using wi as variance estimates ++; B_wls = INV(X`*DIAG(wi)*X)*X`*DIAG(wi)*di_vec; RSS_wls = (di_vec X*B_wls)`*DIAG(wi)*(di_vecX*B_wls); cov_b = INV(X`*DIAG(wi)*X); SE_B = SQRT(vecdiag(cov_b)); B_wls2=INV(X`*DIAG(wi2)*X)*X`*DIAG(wi2)*di_vec; RSS_wls2=(di_vec X*B_wls2)`*DIAG(wi2)*(di_vecX*B_wls2); cov_b2= INV(X`*DIAG(wi2)*X); SE_B2= SQRT(vecdiag(cov_b2)); print 'Running Thru WLS with wi as the weights'; print B_wls RSS_wls cov_b SE_B; print 'Running Thru WLS with wi2 as the weights'; print B_wls2 RSS_wls2 cov_b2 SE_B2; *++ Maximum Likelihood Estimation ++; change= 1 ; iterate= 1 ; do until(change< .0000000001 ); wi=(Vi##1 ) + J(NOBS, 1 1 )*vartheta; wi=wi##1 ; *wi2=Vi##1; wi2=Vi; PAGE 221 211 Appendix A (Continued) B_wlsi = INV(X`*DIAG(wi)*X)*X`*DIAG(wi)*di_vec; print wi B_wlsi; RSS_i= (di_vec X*B_wlsi)`*DIAG(wi)*(di_vec X*B_wlsi); r_vec=(di_vecX*B_wlsi)## 2 ; *varnew=SUM(wi##2#(r_vec vi))/(wi`*wi); varnew=SUM(wi## 2 #(r_vec var_di))/(wi`*wi); if varnew< 0 then varnew= 0 ; change=abs(vartheta varnew); B_prt = B_wlsi`; print 'Maximum Likelihood Algorithm'; print iterate vartheta varnew change B_prt RSS_i; vartheta = varnew; iterate = iterate + 1 ; end; wi=(Vi##1 ) + J(NOBS, 1 1 )*vartheta; wi=wi##1 ; cov_b = INV(X`*DIAG(wi)*X); cov_b2= INV(X`*DIAG(wi2)*X); SE_B = SQRT(vecdiag(cov_b)); SE_B2= SQRT(vecdiag(cov_b2)); print 'Last Commands in Routine'; print cov_b cov_b2; print se_b se_b2; *prob_req = 1 PROBCHI(req,k1); finish; ++ Subroutine to calculate Qb test of homogeneity of effect sizes across classes. Inputs to the subroutine are dp_vec1column vector of study effect sizes for boys dp_vec2column vector of study effect sizes for girls n_vecmatrix (K X 2) of sample sizes corresponding to each individual study KK column vector with N of studies on boys and on girls Outputs are Qb=the obtained value of Qb d_plspls=grand mean effect size prob_qb=chisquare probability associated with Qb ++; start calcqb (dp_vec1,dp_vec2,n_vec,KK,K,qb,d_plspls,prob_qb); print 'This is within the CALCB Subroutine'; print dp_vec1 dp_vec2 n_vec KK K; *calculate variance for each effect size for the studies on boys; n_vec1=n_vec[ 1 :kk[ 1 1 ],]; var_di1=J(kk[ 1 1 ], 1 0 ); PAGE 222 212 Appendix A (Continued) do i= 1 to kk[ 1 1 ]; var_di1[i, 1 ]=((n_vec1[i, 1 ] + n_vec1[i, 2 ])/(n_vec1[i, 1 ]#n_vec1[i, 2 ]))+ ((dp_vec1[i, 1 ]## 2 )/( 2 #(n_vec1[i, 1 ]+n_vec1[i, 2 ]))); end; *calculate weighted mean effect size per class; dp1= 0 ; sum_wt= 0 ; do i= 1 to kk[ 1 1 ]; dp1=dp1 + dp_vec1[i, 1 ]/var_di1[i, 1 ]; sum_wt=sum_wt + var_di1[i, 1 ]##1 ; end; dp1=dp1/sum_wt; print 'These are calculations for boys'; print n_vec1 var_di1 dp1; *calculate variance for each effect size for the studies on girls; n_vec2=n_vec[(kk[ 1 1 ]+ 1 ):K,]; var_di2=J(kk[ 2 1 ], 1 0 ); do i= 1 to kk[ 2 1 ]; var_di2[i, 1 ]=((n_vec2[i, 1 ] + n_vec2[i, 2 ])/(n_vec2[i, 1 ]#n_vec2[i, 2 ]))+ ((dp_vec2[i, 1 ]## 2 )/( 2 #(n_vec2[i, 1 ]+n_vec2[i, 2 ]))); end; *calculate weighted mean effect size for girls; dp2= 0 ; sum_wt= 0 ; do i= 1 to kk[ 2 1 ]; dp2=dp2 + dp_vec2[i, 1 ]/var_di2[i, 1 ]; sum_wt=sum_wt + var_di2[i, 1 ]##1 ; end; dp2=dp2/sum_wt; print 'These are calculations for girls'; print n_vec2 var_di2 dp2; *calculate weighted grand mean (d++); dpall=dp_vec1//dp_vec2; varall=var_di1//var_di2; d_plspls= 0 ; sum_wt= 0 ; do i= 1 to k; d_plspls=d_plspls + dpall[i, 1 ]/varall[i, 1 ]; sum_wt=sum_wt + varall[i, 1 ]##1 ; end; d_plspls=d_plspls/sum_wt; *calculate Qb; PAGE 223 213 Appendix A (Continued) Qb= 0 ; do i= 1 to kk[ 1 1 ]; Qb=Qb + ((dp1 d_plspls)## 2 /var_di1[i, 1 ]); end; do i= 1 to kk[ 2 1 ]; qb=qb+ (dp2d_plspls)## 2 /var_di2[i, 1 ]; end; prob_qb= 1 PROBCHI (Qb, 1 ); print d_plspls qb prob_qb; finish; ++ Subroutine to calculate exact (and approximate) permutation test of homogeneity of effect sizes across classes (Qb). For K = 5 and K = 10, the test is exact. For K = 30, the test is approximate, based on a sample of 1000 permutations of the obtained effect sizes. Inputs to the subroutine are dp_vec1column vector of study effect sizes for boys dp_vec2column vector of study effect sizes for girls n_vecmatrix (K X 2) of sample sizes corresponding to each individual study KK column vector with N of studies on boys and on girls K total number of studies in the metaanalysis Q_real obtained value of Qb on the actual study data Outputs are prob_qb2 permutation probability associated with Qb ++; start Qb_exact (dp_vec1,dp_vec2,n_vec,KK,K,Q_real,prob_qb2); dpall = dp_vec1//dp_vec2; prob_qb2 = 0 ; perm = 0 ; if K = 5 then do; do i = 1 to K 1 ; do j = 2 to K; if i < j then do; dvect1 = dpall[i,]; dvect1 = dvect1//dpall[j,]; nvect = n_vec[i,]; nvect = nvect//n_vec[j,]; cdt2 = 0 ; do z = 1 to K; if (z ^= i & z ^= j) then do; dvect2=dvect2//dpall[z,]; nvect = nvect//n_vec[z,]; end; end; run calcqb PAGE 224 214 Appendix A (Continued) (dvect1,dvect2,nvect,KK,K,qbtemp,d_plspl,probqbt); if Qbtemp < Q_real then prob_qb2 = prob_qb2 + 1 ; perm = perm + 1 ; free dvect1 dvect2 nvect; end; end; end; prob_qb2 = 1 (prob_qb2 / perm); end; if K = 10 then do; do i = 1 to K 3 ; do j = 2 to K 2 ; do l = 3 to K 1 ; do m = 4 to K; if (i PAGE 225 215 Appendix A (Continued) run calcqb (dvect1,dvect2,nvect, KK, K, qbtemp, d_plspls, probqbt); if qbtemp < Q_real then prob_qb2=prob_qb2+ 1 ; perm=perm+ 1 ; free dvect1 dvect2 nvect; end; prob_qb2= 1 (prob_qb2/perm); end; finish; ++ Bubble sort X = matrix to be sorted N = number of rows in matrix C = column number to sort by ++; start bubble(x,n,c); do i = 1 to n; do j = 1 to n1 ; if x[J,C] > x[J+ 1 ,C] then do; temp = x[J+ 1 ,]; x[J+ 1 ,] = x[J,]; x[J,] = temp; end; end; end; finish; ++ Main program Generates samples, calls subroutines, computes rejection rates. ++; do rep= 1 to replicat; This starts the big do loop; k=sum(kk); n_vec=J(K, 2 0 ); do study = 1 to k; Inner loop for primary studies; ALPHA_K=RANNOR( 0 )#SQRT(VARDELTA)+MEANDELTA; Mean_exp=mu1[ 1 1 ]+alpha_k; n1=rannor( 0 )+ njs[ 1 1 ]; n1=round(n1); if n1< 3 then n1= 3 ; n2=rannor ( 0 )+njs[ 2 1 ]; n2=round(n2); if n2< 3 then n2= 3 ; PAGE 226 216 Appendix A (Continued) n_vec[study, 1 ]=n1; n_vec[study, 2 ]=n2; *print 'Group Sizes:' Rep Study n1 n2; generate a vector of scores for each group; if (study <= kk[ 1 1 ]) then do; run gendata(n1,sds[ 1 1 ],b,c,d,mu1[ 1 1 ],z1); run gendata(n2,sds[ 2 1 ],b,c,d,mean_exp,z2); end; if (study > kk[ 1 1 ]) then do; run gendata (n1,sds[ 1 1 ],b,c,d, mu2[ 1 1 ], z1); run gendata (n2,sds[ 2 1 ],b,c,d, mean_exp,z2); end; calculate sample means, SS, di and cd for primary studies; xbar1 = (J( 1 ,n1, 1 )*z1)/n1; xbar2 = (J( 1 ,n2, 1 )*z2)/n2; ss1 = (J( 1 ,n1, 1 )*(z1## 2 )) ((J( 1 ,n1, 1 )*z1)## 2 /n1); ss2 = (J( 1 ,n2, 1 )*(z2## 2 )) ((J( 1 ,n2, 1 )*z2)## 2 /n2); njstemp = n1//n2; di = ((xbar1 xbar2)/sqrt((ss1 + ss2)/(J( 1 2 1 )*njstemp 2 )))#( 1 ( 3 /( 4 #(J( 1 2 1 )*njstemp)9 ))); if study = 1 then di_vec = di; if study > 1 then di_vec = di_vec//di; end; End inner loop for primary studies; Calculate the regular parametric Q tests; *n_vec = njs*J(1,k,1); *n_vec = T(n_vec); run calcq(di_vec,n_vec,q,d_plus,prob_q1); *print 'From CALCQ'; print di_vec n_vec q d_plus prob_q1; run calcreq(KK,di_vec,n_vec,RSS_wls1,B_wls2,B_wlsi,vartheta,cov_B,cov _B2,SE_B,SE_B2); prob_q1= 1 probchi(RSS_wls1,k1 ); *print 'FROM CALCREQ'; *print di_vec n_vec RSS_wls1 B_wls2 B_wlsi vartheta cov_B cov_B2 SE_B SE_B2; FE_T=B_wls2[ 2 1 ]/SE_B2[ 2 1 ]; PROB_FE= 2 #( 1 probt(abs(FE_T),K2 )); PROB_FEZ= 2 #( 1 probnorm(abs(FE_T))); *print 'Fixed Effects Test:' fe_t prob_fe prob_fez; RE_T = B_wlsi[ 2 1 ]/se_b[ 2 1 ]; PROB_RE= 2 #( 1 probt(abs(RE_T),K2 )); PROB_REZ= 2 #( 1 probnorm(abs(RE_T))); PAGE 227 217 Appendix A (Continued) *; *record reject and failtoreject probabilities for the randomand fixedeffects magnitude of effects tests. *; if prob_REZ < .01 then rejprob_REZ101 = rejprob_REZ101+ 1 ; if prob_REZ < .05 then rejprob_REZ105 = rejprob_REZ105+ 1 ; if prob_REZ < .10 then rejprob_REZ110 = rejprob_REZ110+ 1 ; if prob_FEZ < .01 then rejprob_FEZ101 = rejprob_FEZ101+ 1 ; if prob_FEZ < .05 then rejprob_FEZ105 = rejprob_FEZ105+ 1 ; if prob_FEZ < .10 then rejprob_FEZ110 = rejprob_FEZ110+ 1 ; *print 'Random Effects Test:'RE_t prob_RE prob_REZ; *calculate Qb tests; run calcqb (di_vec[ 1 :kk[ 1 1 ],],di_vec[kk[ 1 1 ]+ 1 :k,], n_vec, kk, k, qb, d_plspls, prob_qb); print 'From CALCQB'; print qb d_plspls prob_qb; *; *Subroutine to run conditionallyrandom procedure *conrand=conditionallyrandom procedure *; conrand=prob_q1; if prob_q1< .05 then conrand=prob_REZ; if prob_q1> .05 then conrand=prob_FEZ; if conrand < .01 then rejprob_conrand01 = rejprob_conrand01+ 1 ; if conrand < .05 then rejprob_conrand05 = rejprob_conrand05+ 1 ; if conrand < .10 then rejprob_conrand10 = rejprob_conrand10+ 1 ; print 'Conditionally Random Test:'conrand; Permutation test of Qb; run Qb_exact (di_vec[ 1 :kk[ 1 1 ],],di_vec[kk[ 1 1 ]+ 1 :k,],n_vec,KK,K,FE_T## 2 ,prob_ qb2); Record reject/fail to reject for each test; if prob_q1 < .01 then rejq101 = rejq101+ 1 ; if prob_q1 < .05 then rejq105 = rejq105+ 1 ; if prob_q1 < .10 then rejq110 = rejq110+ 1 ; if prob_qb2 < .01 then rejqb201=rejqb201+ 1 ; if prob_qb2 < .05 then rejqb205=rejqb205+ 1 ; if prob_qb2 < .10 then rejqb210=rejqb210+ 1 ; nsamples=nsamples+ 1 ; end; *end the big loop; PAGE 228 218 Appendix A (Continued) rejprob_REZ101 = rejprob_REZ101/nsamples; rejprob_REZ105 = rejprob_REZ105/nsamples; rejprob_REZ110 = rejprob_REZ110/nsamples; rejprob_FEZ101 = rejprob_FEZ101/nsamples; rejprob_FEZ105 = rejprob_FEZ105/nsamples; rejprob_FEZ110 = rejprob_FEZ110/nsamples; rejprob_conrand01 = rejprob_conrand01/nsamples; rejprob_conrand05 = rejprob_conrand05/nsamples; rejprob_conrand10 = rejprob_conrand10/nsamples; ++ Convert counts of rejected hypotheses into proportions ++; rejq101 = rejq101/nsamples; rejq105 = rejq105/nsamples; rejq110 = rejq110/nsamples; *rejreq101 = rejreq101/nsamples; *rejreq105 = rejreq105/nsamples; *rejreq110 = rejreq110/nsamples; *rejreq201 = rejreq201/nsamples; *rejreq205 = rejreq205/nsamples; *rejreq210 = rejreq210/nsamples; *rejqb01=rejqb01/nsamples; *rejqb05=rejqb05/nsamples; *rejqb10=rejqb10/nsamples; rejqb201=rejqb201/nsamples; rejqb205=rejqb205/nsamples; rejqb210=rejqb210/nsamples; *print 'Tests of Homogeneity of Effect Sizes'; PRINT njs sds dlta kk specific b Tau2 nsamples; *print 'sk= 0.00, kr= 0.00'; *SHAPE2 print 'sk= 1.00, kr= 3.00'; *SHAPE3 print 'sk= 1.50, kr= 5.00'; *SHAPE4 print 'sk= 2.00, kr= 6.00'; *SHAPE5 print 'sk= 0.00, kr= 25.00'; PRINT rejq101, rejq105, rejq110; *PRINT rejreq101, rejreq105, rejreq110, rejreq201, rejreq205, rejreq210; PRINT rejqb201, rejqb205, rejqb210; PRINT rejprob_REZ101, rejprob_REZ105, rejprob_REZ110; PRINT rejprob_FEZ101,rejprob_FEZ105,rejprob_FEZ110; PRINT rejprob_conrand01, rejprob_conrand05, rejprob_conrand10; quit; PAGE 229 219 Appendix B: SAS Program for Simulating False Null Hypotheses option ps=59 ls=132 pageno=1; proc printto print='c:\my documents\dissertation \test48.out'; proc iml ; ++ ROBUSTQ.SAS Changes required to execute the program: *NN0 *NNB Sample sizes *SPEC0 *SPEC1 True False Moderating Null Hypothesis *S1 *S4 Variances *SHAPE1 *SHAPE5 Skewness and Kurtosis *K1 *K3 N of studies in each metaanalysis *delta0 *delta5 Population mean differences ++; ++ Define parameters for execution of the simulation ++; replicat= 5000 ; N of metaanalyses to simulate; *delta0 dlta=0; dlta= 0.8 ; *K1 KK={2,3}; N of studies in each metaanalysis; KK={ 4 6 }; *K3 KK={12,18}; *NN2 njs={ 5, 5}; *NN3 njs={20,20}; *NN5 njs={4,6}; *NN6 njs={16,24}; *NN7 njs={80,120}; *NN8 njs={6,4}; Note: NN8 NN10 reverse pairing with unequal variances; Use these only for nonnull conditions; njs={ 120 80 }; *NNA njs={100,100}; *NNB njs={24,16}; *S1 sds={1.0,1.0}; *S2 sds={1.0,2.0}; sds={ 1.0 4.0 }; *pooled=SQRT ((njs` sds)/sum (njs)); POOLED_VAR=( 0.5 )#SUM(sds); POOLED_SD=SQRT(POOLED_VAR); Tau2= 0 ; PAGE 230 220 Appendix B (Continued) *Tau2=.33; *Tau2=1.0; Tau2=5.0; MEANDELTA=DLTA#POOLED_SD; VARDELTA=TAU2#POOLED_VAR; mu1={ 0.0 0.0 }; Pop means for experimental and control boys; mu2={ 0.0 0.0 }; Pop means for experimental and control girls; *SPEC0 specific=0; Null condition for moderation effect; specific= 1 ;* Nonnull condition for moderation effect; ++ Fleishman Transformations to nonnormality ++; The following give sk= 0, kr= 0; *SHAPE 1b=1; *SHAPE 1c=0; *SHAPE 1d=0; The following give sk= 1.00, kr= 3.00; *SHAPE 2b= .83221632289426; *SHAPE 2c= .12839670935047; *SHAPE 2d= .04803205907079; The following give sk= 2.00, kr= 6.00; b= 0.82632385761082 ; c= 0.31374908500462 ; d= 0.02270660525731 ; ++ Initialize counters ++; rejq101 = 0 ; rejq105 = 0 ; rejq110 = 0 ; rejreq101 = 0 ; rejreq105 = 0 ; rejreq110 = 0 ; rejreq201 = 0 ; rejreq205 = 0 ; rejreq210 = 0 ; rejqb01= 0 ; rejqb05= 0 ; rejqb10= 0 ; PAGE 231 221 Appendix B (Continued) rejqb201= 0 ; rejqb205= 0 ; rejqb210= 0 ; rejprob_REZ101= 0 ; rejprob_REZ105= 0 ; rejprob_REZ110= 0 ; rejprob_FEZ101= 0 ; rejprob_FEZ105= 0 ; rejprob_FEZ110= 0 ; rejprob_conrand01= 0 ; rejprob_conrand05= 0 ; rejprob_conrand10= 0 ; nsamples= 0 ; ++ Subroutine to generate a random sample. User specifies the population mean and standard deviation. For population shapes, Fleishman constants are used. Inputs to the subroutine are NN desired sample size mu population mean variance population variance bb,cc,dd Fleishman constants Outputs are Rawdata column vector of NN observations from the specified population ++; start gendata(NN,variance,bb,cc,dd,mu,rawdata); seed1=round( 1000000 *ranuni( 0 )); rawdata=rannor(repeat(seed1,nn, 1 )); rawdata = (1 *cc) + (bb*rawdata) + (cc*rawdata## 2 ) + (dd*rawdata## 3 ); rawdata = (rawdata SQRT(variance)) + mu; finish; ++ Direct resampling for randomization ++; start resamp(x); n=Nrow(x); allnbut=n1 ; do i = 1 to allnbut; PAGE 232 222 Appendix B (Continued) ++ Randomly select rows from the matrix X to create the matrix NEWM. Sampling is without replacement so that the matrix NEWM has the same data as X, but in random order ++; ranrow = round(uniform( 0 )*(n i + 0.999 )+ 0.5 ); if i = 1 then do; newm = x[ranrow,]; end; if i > 1 then do; newm = newm//x[ranrow,]; end; if ranrow > 1 then do; if ranrow < (n(i1 )) then x = x[ 1 :ranrow1 ,]//x[ranrow+ 1 :n(i1 ),]; if ranrow = n(i1 ) then x=x[ 1 :(ni),]; end; if ranrow = 1 then x = x[ 2 :n(i1 ),]; end; newm = newm//x; x = newm; print x; finish; ++ Subroutine to calculate the Q test of homogeneity. Inputs to the subroutine are di_vec column vector of effect sizes (d) n_vec matrix (k X 2) of sample sizes corresponding to each effect size Outputs are QQ = the obtained value of Q d_plus = weighted mean d value d_star = iteratively weighted mean d value prob_qq1 = chisquare probability associated with QQ ++; *start calcq(di_vec,n_vec,qq,d_plus,prob_qq1); calculate variance for each effect size; k = nrow(di_vec); var_di=J(k,1,0); do i = 1 to k; var_di[i,1] = ((n_vec[i,1]+n_vec[i,2])/(n_vec[i,1]#n_vec[i,2])) + ((di_vec[i,1]##2)/(2#(n_vec[i,1]+n_vec[i,2]))); end; calculate weighted mean effect size; PAGE 233 223 Appendix B (Continued) d_plus = 0; sum_wt = 0; do i = 1 to k; d_plus = d_plus + di_vec[i,1]/var_di[i,1]; sum_wt = sum_wt + var_di[i,1]##1; end; d_plus = d_plus/sum_wt; calculate Q; QQ = 0; do i = 1 to k; QQ = QQ + ((di_vec[i,1] d_plus)##2/var_di[i,1]); end; prob_qq1 = 1 PROBCHI(QQ,k1); print di_vec var_di; print d_plus qq prob_qq1; finish; *++ Subroutine to calculate the REQ test of homogeneity. Inputs to the subroutine are KK column vector of N of studies in each class di_vec column vector of effect sizes (d) n_vec matrix (k X 2) of sample sizes corresponding to each effect size Outputs are reqq = the obtained value of Q d_plusrq = weighted mean d value d_starrq = iteratively weighted mean d value prob_req = chisquare probability associated with req ++; start calcreq(KK,di_vec,n_vec,RSS_wls1,B_wls2,B_wlsi,vartheta,cov_B,cov _B2,SE_B, SE_B2); calculate variance for each effect size; k = nrow(di_vec); var_di=J(k, 1 0 ); Vi=J(k, 1 0 ); X=J(k, 1 1 ); do i = 1 to k; var_di[i, 1 ] = ((n_vec[i, 1 ]+n_vec[i, 2 ])/(n_vec[i, 1 ]#n_vec[i, 2 ])) + ((di_vec[i, 1 ]## 2 )/( 2 #(n_vec[i, 1 ]+n_vec[i, 2 ]))); Vi[i, 1 ]=var_di[i, 1 ]##1 ; end; print X; print var_di di_vec Vi; PAGE 234 224 Appendix B (Continued) B_ols =INV(X`*X)*X`*di_vec; M=X*INV(X`*X)*X`; NOBS =NROW(di_vec); IOBS = I (NOBS); RSS = di_vec`*(IOBS M)* di_vec; *const1=(J(1, NOBS, 1)*Vi) TRACE(X`*DIAG(Vi)*X*INV(X`*X)); const1=(J( 1 NOBS, 1 )*Vi##1 )TRACE(X`*DIAG(Vi##1 )*X*INV(X`*X)); *A vector of variances will need to be created where we take the reciprocal= var_di and call in the Vi; const2=NOBS NCOL(X); vartheta=(RSS const1)/const2; if vartheta< 0 then vartheta= 0 ; wi=Vi##1 + J(NOBS, 1 1 )*vartheta; wi = wi##1 ; wi2=Vi; *prob_req = 1 PROBCHI(req,k1); print 'Initial Run Using OLS'; print B_ols RSS vartheta const1 const2 vi wi wi2; B_wls=INV(X`*DIAG(wi2)*X)*X`*DIAG(wi2)*di_vec; RSS_wls1=(di_vec X*B_wls)`*DIAG(wi2)*(di_vecX*B_wls); print 'This equals Q test:'RSS_wls1; X=J(k, 2 1 ); do i= 1 to k; if i<=KK[ 1 1 ] then X[i, 2 ] = 0 ; end; print X; *++ Weighted least squares estimation using wi as variance estimates ++; B_wls = INV(X`*DIAG(wi)*X)*X`*DIAG(wi)*di_vec; RSS_wls = (di_vec X*B_wls)`*DIAG(wi)*(di_vecX*B_wls); cov_b = INV(X`*DIAG(wi)*X); SE_B = SQRT(vecdiag(cov_b)); B_wls2=INV(X`*DIAG(wi2)*X)*X`*DIAG(wi2)*di_vec; RSS_wls2=(di_vec X*B_wls2)`*DIAG(wi2)*(di_vecX*B_wls2); cov_b2= INV(X`*DIAG(wi2)*X); SE_B2= SQRT(vecdiag(cov_b2)); print 'Running Thru WLS with wi as the weights'; print B_wls RSS_wls cov_b SE_B; print 'Running Thru WLS with wi2 as the weights'; print B_wls2 RSS_wls2 cov_b2 SE_B2; *++ Maximum Likelihood Estimation ++; change= 1 ; iterate= 1 ; PAGE 235 225 Appendix B (Continued) do until(change< .0000000001 ); wi=(Vi##1 ) + J(NOBS, 1 1 )*vartheta; wi=wi##1 ; *wi2=Vi##1; wi2=Vi; B_wlsi = INV(X`*DIAG(wi)*X)*X`*DIAG(wi)*di_vec; print wi B_wlsi; RSS_i= (di_vec X*B_wlsi)`*DIAG(wi)*(di_vec X*B_wlsi); r_vec=(di_vecX*B_wlsi)## 2 ; *varnew=SUM(wi##2#(r_vec vi))/(wi`*wi); varnew=SUM(wi## 2 #(r_vec var_di))/(wi`*wi); if varnew< 0 then varnew= 0 ; change=abs(vartheta varnew); B_prt = B_wlsi`; print 'Maximum Likelihood Algorithm'; print iterate vartheta varnew change B_prt RSS_i; vartheta = varnew; iterate = iterate + 1 ; end; wi=(Vi##1 ) + J(NOBS, 1 1 )*vartheta; wi=wi##1 ; cov_b = INV(X`*DIAG(wi)*X); cov_b2= INV(X`*DIAG(wi2)*X); SE_B = SQRT(vecdiag(cov_b)); SE_B2= SQRT(vecdiag(cov_b2)); print 'Last Commands in Routine'; print cov_b cov_b2; print se_b se_b2; *prob_req = 1 PROBCHI(req,k1); finish; ++ Subroutine to calculate Qb test of homogeneity of effect sizes across classes. Inputs to the subroutine are dp_vec1column vector of study effect sizes for boys dp_vec2column vector of study effect sizes for girls n_vecmatrix (K X 2) of sample sizes corresponding to each individual study KK column vector with N of studies on boys and on girls Outputs are Qb=the obtained value of Qb d_plspls=grand mean effect size prob_qb=chisquare probability associated with Qb ++; start calcqb (dp_vec1,dp_vec2,n_vec,KK,K,qb,d_plspls,prob_qb); print 'This is within the CALCB Subroutine'; PAGE 236 226 Appendix B (Continued) print dp_vec1 dp_vec2 n_vec KK K; *calculate variance for each effect size for the studies on boys; n_vec1=n_vec[ 1 :kk[ 1 1 ],]; var_di1=J(kk[ 1 1 ], 1 0 ); do i= 1 to kk[ 1 1 ]; var_di1[i, 1 ]=((n_vec1[i, 1 ] + n_vec1[i, 2 ])/(n_vec1[i, 1 ]#n_vec1[i, 2 ]))+ ((dp_vec1[i, 1 ]## 2 )/( 2 #(n_vec1[i, 1 ]+n_vec1[i, 2 ]))); end; *calculate weighted mean effect size per class; dp1= 0 ; sum_wt= 0 ; do i= 1 to kk[ 1 1 ]; dp1=dp1 + dp_vec1[i, 1 ]/var_di1[i, 1 ]; sum_wt=sum_wt + var_di1[i, 1 ]##1 ; end; dp1=dp1/sum_wt; print 'These are calculations for boys'; print n_vec1 var_di1 dp1; *calculate variance for each effect size for the studies on girls; n_vec2=n_vec[(kk[ 1 1 ]+ 1 ):K,]; var_di2=J(kk[ 2 1 ], 1 0 ); do i= 1 to kk[ 2 1 ]; var_di2[i, 1 ]=((n_vec2[i, 1 ] + n_vec2[i, 2 ])/(n_vec2[i, 1 ]#n_vec2[i, 2 ]))+ ((dp_vec2[i, 1 ]## 2 )/( 2 #(n_vec2[i, 1 ]+n_vec2[i, 2 ]))); end; *calculate weighted mean effect size for girls; dp2= 0 ; sum_wt= 0 ; do i= 1 to kk[ 2 1 ]; dp2=dp2 + dp_vec2[i, 1 ]/var_di2[i, 1 ]; sum_wt=sum_wt + var_di2[i, 1 ]##1 ; end; dp2=dp2/sum_wt; print 'These are calculations for girls'; print n_vec2 var_di2 dp2; *calculate weighted grand mean (d++); dpall=dp_vec1//dp_vec2; varall=var_di1//var_di2; d_plspls= 0 ; sum_wt= 0 ; do i= 1 to k; PAGE 237 227 Appendix B (Continued) d_plspls=d_plspls + dpall[i, 1 ]/varall[i, 1 ]; sum_wt=sum_wt + varall[i, 1 ]##1 ; end; d_plspls=d_plspls/sum_wt; *calculate Qb; Qb= 0 ; do i= 1 to kk[ 1 1 ]; Qb=Qb + ((dp1 d_plspls)## 2 /var_di1[i, 1 ]); end; do i= 1 to kk[ 2 1 ]; qb=qb+ (dp2d_plspls)## 2 /var_di2[i, 1 ]; end; prob_qb= 1 PROBCHI (Qb, 1 ); print d_plspls qb prob_qb; finish; ++ Subroutine to calculate exact (and approximate) permutation test of homogeneity of effect sizes across classes (Qb). For K = 5 and K = 10, the test is exact. For K = 30, the test is approximate, based on a sample of 1000 permutations of the obtained effect sizes. Inputs to the subroutine are dp_vec1column vector of study effect sizes for boys dp_vec2column vector of study effect sizes for girls n_vecmatrix (K X 2) of sample sizes corresponding to each individual study KK column vector with N of studies on boys and on girls K total number of studies in the metaanalysis Q_real obtained value of Qb on the actual study data Outputs are prob_qb2 permutation probability associated with Qb ++; start Qb_exact (dp_vec1,dp_vec2,n_vec,KK,K,Q_real,prob_qb2); dpall = dp_vec1//dp_vec2; prob_qb2 = 0 ; perm = 0 ; if K = 5 then do; do i = 1 to K 1 ; do j = 2 to K; if i < j then do; dvect1 = dpall[i,]; dvect1 = dvect1//dpall[j,]; nvect = n_vec[i,]; PAGE 238 228 Appendix B (Continued) nvect = nvect//n_vec[j,]; cdt2 = 0 ; do z = 1 to K; if (z ^= i & z ^= j) then do; dvect2=dvect2//dpall[z,]; nvect = nvect//n_vec[z,]; end; end; run calcqb (dvect1,dvect2,nvect,KK,K,qbtemp,d_plspl,probqbt); if Qbtemp < Q_real then prob_qb2 = prob_qb2 + 1 ; perm = perm + 1 ; free dvect1 dvect2 nvect; end; end; end; prob_qb2 = 1 (prob_qb2 / perm); end; if K = 10 then do; do i = 1 to K 3 ; do j = 2 to K 2 ; do l = 3 to K 1 ; do m = 4 to K; if (i PAGE 239 229 Appendix B (Continued) run resamp (dpN); dvect1=dpN[ 1 : 12 1 ]; dvect2=dpN[ 13 : 30 1 ]; nvect=dpN[, 2 : 3 ]; run calcqb (dvect1,dvect2,nvect, KK, K, qbtemp, d_plspls, probqbt); if qbtemp < Q_real then prob_qb2=prob_qb2+ 1 ; perm=perm+ 1 ; free dvect1 dvect2 nvect; end; prob_qb2= 1 (prob_qb2/perm); end; finish; ++ Bubble sort X = matrix to be sorted N = number of rows in matrix C = column number to sort by ++; start bubble(x,n,c); do i = 1 to n; do j = 1 to n1 ; if x[J,C] > x[J+ 1 ,C] then do; temp = x[J+ 1 ,]; x[J+ 1 ,] = x[J,]; x[J,] = temp; end; end; end; finish; ++ Main program Generates samples, calls subroutines, computes rejection rates. ++; do rep= 1 to replicat; This starts the big do loop; k=sum(kk); n_vec=J(K, 2 0 ); do study = 1 to k; Inner loop for primary studies; ALPHA_K=RANNOR( 0 )#SQRT(VARDELTA)+MEANDELTA; ALPHA_K_GIRLS = RANNOR( 0 )#SQRT(VARDELTA); Mean_exp=mu1[ 1 1 ]+alpha_k; MEAN_GIRLS=mu2[ 1 1 ]+ ALPHA_K_GIRLS; n1=rannor( 0 )+ njs[ 1 1 ]; n1=round(n1); if n1< 3 then n1= 3 ; PAGE 240 230 Appendix B (Continued) n2=rannor ( 0 )+njs[ 2 1 ]; n2=round(n2); if n2< 3 then n2= 3 ; n_vec[study, 1 ]=n1; n_vec[study, 2 ]=n2; *print 'Group Sizes:' Rep Study n1 n2; generate a vector of scores for each group; if (study <= kk[ 1 1 ]) then do; run gendata(n1,sds[ 1 1 ],b,c,d,mu1[ 1 1 ],z1); run gendata(n2,sds[ 2 1 ],b,c,d,mean_exp,z2); end; if (study > kk[ 1 1 ]) then do; run gendata (n1,sds[ 1 1 ],b,c,d, mu2[ 1 1 ], z1); run gendata (n2,sds[ 2 1 ],b,c,d, mean_girls,z2); end; calculate sample means, SS, di and cd for primary studies; xbar1 = (J( 1 ,n1, 1 )*z1)/n1; xbar2 = (J( 1 ,n2, 1 )*z2)/n2; ss1 = (J( 1 ,n1, 1 )*(z1## 2 )) ((J( 1 ,n1, 1 )*z1)## 2 /n1); ss2 = (J( 1 ,n2, 1 )*(z2## 2 )) ((J( 1 ,n2, 1 )*z2)## 2 /n2); njstemp = n1//n2; di = ((xbar1 xbar2)/sqrt((ss1 + ss2)/(J( 1 2 1 )*njstemp 2 )))#( 1 ( 3 /( 4 #(J( 1 2 1 )*njstemp)9 ))); if study = 1 then di_vec = di; if study > 1 then di_vec = di_vec//di; end; End inner loop for primary studies; Calculate the regular parametric Q tests; *n_vec = njs*J(1,k,1); *n_vec = T(n_vec); run calcq(di_vec,n_vec,q,d_plus,prob_q1); *print 'From CALCQ'; print di_vec n_vec q d_plus prob_q1; run calcreq(KK,di_vec,n_vec,RSS_wls1,B_wls2,B_wlsi,vartheta,cov_B,cov _B2,SE_B,SE_B2); prob_q1= 1 probchi(RSS_wls1,k1 ); *print 'FROM CALCREQ'; *print di_vec n_vec RSS_wls1 B_wls2 B_wlsi vartheta cov_B cov_B2 SE_B SE_B2; FE_T=B_wls2[ 2 1 ]/SE_B2[ 2 1 ]; PROB_FE= 2 #( 1 probt(abs(FE_T),K2 )); PAGE 241 231 Appendix B (Continued) PROB_FEZ= 2 #( 1 probnorm(abs(FE_T))); *print 'Fixed Effects Test:' fe_t prob_fe prob_fez; RE_T = B_wlsi[ 2 1 ]/se_b[ 2 1 ]; PROB_RE= 2 #( 1 probt(abs(RE_T),K2 )); PROB_REZ= 2 #( 1 probnorm(abs(RE_T))); *; *record reject and failtoreject probabilities for the randomand fixedeffects magnitude of effects tests. *; if prob_REZ < .01 then rejprob_REZ101 = rejprob_REZ101+ 1 ; if prob_REZ < .05 then rejprob_REZ105 = rejprob_REZ105+ 1 ; if prob_REZ < .10 then rejprob_REZ110 = rejprob_REZ110+ 1 ; if prob_FEZ < .01 then rejprob_FEZ101 = rejprob_FEZ101+ 1 ; if prob_FEZ < .05 then rejprob_FEZ105 = rejprob_FEZ105+ 1 ; if prob_FEZ < .10 then rejprob_FEZ110 = rejprob_FEZ110+ 1 ; *print 'Random Effects Test:'RE_t prob_RE prob_REZ; *calculate Qb tests; run calcqb (di_vec[ 1 :kk[ 1 1 ],],di_vec[kk[ 1 1 ]+ 1 :k,], n_vec, kk, k, qb, d_plspls, prob_qb); print 'From CALCQB'; print qb d_plspls prob_qb; *; *Subroutine to run conditionallyrandom procedure *conrand=conditionallyrandom procedure *; conrand=prob_q1; if prob_q1< .05 then conrand=prob_REZ; if prob_q1> .05 then conrand=prob_FEZ; if conrand < .01 then rejprob_conrand01 = rejprob_conrand01+ 1 ; if conrand < .05 then rejprob_conrand05 = rejprob_conrand05+ 1 ; if conrand < .10 then rejprob_conrand10 = rejprob_conrand10+ 1 ; print 'Conditionally Random Test:'conrand; Permutation test of Qb; run Qb_exact (di_vec[ 1 :kk[ 1 1 ],],di_vec[kk[ 1 1 ]+ 1 :k,],n_vec,KK,K,FE_T## 2 ,prob_ qb2); Record reject/fail to reject for each test; if prob_q1 < .01 then rejq101 = rejq101+ 1 ; if prob_q1 < .05 then rejq105 = rejq105+ 1 ; PAGE 242 232 Appendix B (Continued) if prob_q1 < .10 then rejq110 = rejq110+ 1 ; if prob_qb2 < .01 then rejqb201=rejqb201+ 1 ; if prob_qb2 < .05 then rejqb205=rejqb205+ 1 ; if prob_qb2 < .10 then rejqb210=rejqb210+ 1 ; nsamples=nsamples+ 1 ; end; *end the big loop; rejprob_REZ101 = rejprob_REZ101/nsamples; rejprob_REZ105 = rejprob_REZ105/nsamples; rejprob_REZ110 = rejprob_REZ110/nsamples; rejprob_FEZ101 = rejprob_FEZ101/nsamples; rejprob_FEZ105 = rejprob_FEZ105/nsamples; rejprob_FEZ110 = rejprob_FEZ110/nsamples; rejprob_conrand01 = rejprob_conrand01/nsamples; rejprob_conrand05 = rejprob_conrand05/nsamples; rejprob_conrand10 = rejprob_conrand10/nsamples; ++ Convert counts of rejected hypotheses into proportions ++; rejq101 = rejq101/nsamples; rejq105 = rejq105/nsamples; rejq110 = rejq110/nsamples; *rejreq101 = rejreq101/nsamples; *rejreq105 = rejreq105/nsamples; *rejreq110 = rejreq110/nsamples; *rejreq201 = rejreq201/nsamples; *rejreq205 = rejreq205/nsamples; *rejreq210 = rejreq210/nsamples; *rejqb01=rejqb01/nsamples; *rejqb05=rejqb05/nsamples; *rejqb10=rejqb10/nsamples; rejqb201=rejqb201/nsamples; rejqb205=rejqb205/nsamples; rejqb210=rejqb210/nsamples; *print 'Tests of Homogeneity of Effect Sizes'; PRINT njs sds dlta kk specific b Tau2 nsamples; PAGE 243 233 Appendix B (Continued) *print 'sk= 0.00, kr= 0.00'; *SHAPE2 print 'sk= 1.00, kr= 3.00'; *SHAPE3 print 'sk= 1.50, kr= 5.00'; *SHAPE4 print 'sk= 2.00, kr= 6.00'; *SHAPE5 print 'sk= 0.00, kr= 25.00'; PRINT rejq101, rejq105, rejq110; *PRINT rejreq101, rejreq105, rejreq110, rejreq201, rejreq205, rejreq210; PRINT rejqb201, rejqb205, rejqb210; PRINT rejprob_REZ101, rejprob_REZ105, rejprob_REZ110; PRINT rejprob_FEZ101,rejprob_FEZ105,rejprob_FEZ110; PRINT rejprob_conrand01, rejprob_conrand05, rejprob_conrand10; quit; PAGE 244 234 About the Author Lisa Aaron received a Bachelors Degree in Psychology from Emory University in 1987 and a M.S. in Rehabilitation Counseling from Georgia Stat e University in 1990. She practiced vocational rehabilitation counseling after graduation from the Mast ers program, until she entered the Ph.D. program at the University of South Florida in 1992. While in the Ph.D. program at the University of South Florida, Ms. Aaron worked as a member of the design team for the Teaching of Higher Order Thinki ng. She also worked as the Supervisor of Title I Evaluation for the Pasco County School District, serv ing migrant and low socioeconomic students. She has presented papers at regional and national meetings of the American Educationa l Research Association. She provides consulta tion to the Working for Success organization affiliated with Hesed House, dedicated to job placement of the homeless. 