USFDC Home  USF Electronic Theses and Dissertations   RSS 
Material Information
Subjects
Notes
Record Information

Full Text 
xml version 1.0 encoding UTF8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchemainstance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd leader nam Ka controlfield tag 001 001441475 003 fts 005 20060705132341.0 006 med 007 cr mnuuuuuu 008 031203s2003 flua sbm s0000 eng d datafield ind1 8 ind2 024 subfield code a E14SFE0000148 035 (OCoLC)54018052 9 AJM5915 b SE SFE0000148 040 FHM c FHM 090 LB3051 1 100 Hess, Melinda Rae. 0 245 Effect sizes, significance tests, and confidence intervals h [electronic resource] : assessing the influence and impact of research reporting protocol and practice / by Melinda Rae Hess. 260 [Tampa, Fla.] : University of South Florida, 2003. 502 Thesis (Ph.D.)University of South Florida, 2003. 504 Includes bibliographical references. 500 Includes vita. 516 Text (Electronic thesis) in PDF format. 538 System requirements: World Wide Web browser and PDF reader. Mode of access: World Wide Web. Title from PDF of title page. Document formatted into pages; contains 223 pages. 520 ABSTRACT: This study addresses research reporting practices and protocols by bridging the gap from the theoretical and conceptual debates typically found in the literature with more realistic applications using data from published research. Specifically, the practice of using findings of statistical analysis as the primary, and often only, basis for results and conclusions of research is investigated through computing effect size and confidence intervals and considering how their use might impact the strength of inferences and conclusions reported. Using a sample of published manuscripts from three peerrviewed journals, central quantitative findings were expressed as dichotomous hypothesis test results, point estimates of effect sizes and confidence intervals. Studies using three different types of statistical analyses were considered for inclusion: ttests, regression, and Analysis of Variance (ANOVA). The differences in the substantive interpretations of results from these accomplished and published studies were then examined as a function of these different analytical approaches. Both quantitative and qualitative techniques were used to examine the findings. General descriptive statistical techniques were employed to capture the magnitude of studies and analyses that might have different interpretations if althernative methods of reporting findings were used in addition to traditional tests of statistical signficance. Qualitative methods were then used to gain a sense of the impact on the wording used in the research conclusions of these other forms of reporting findings. It was discovered that tests of nonsignficant results were more prone to need evidence of effect size than those of significant results. Regardless of tests of significance, the addition of information from confidence intervals tended to heavily impact the findings resulting from signficance tests. The results were interpreted in terms of improving the reporting practices in applied research. Issues that were noted in this study relevant to the primary focus are discussed in general with implicaitons for future research. Recommendations are made regarding editorial and publishing practices, both for primary researchers and editors. 590 Adviser: Kromrey, Jeffrey D. 653 research practices. practical significance. statistical signficance. educational research. confidence bands. 690 Dissertations, Academic z USF x Measurement and Evaluation Doctoral. 773 t USF Electronic Theses and Dissertations. 4 856 u http://digital.lib.usf.edu/?e14.148 PAGE 1 Effect Sizes, Significance Tests, and Confidence Intervals: Assessing the Influence and Impact of Research Reporting Protocol and Practice by Melinda Rae Hess A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Measurement and Research College of Education University of South Florida Major Professor: Jeffrey D. Kromrey, Ph.D Kathryn M. Borman, Ph.D. John M. Ferron, Ph.D. Cynthia G. Parshall, Ph.D. Date of Approval: October 30, 2003 Keywords: Research Practices, Practical Significance, Statistical Significance, Educational Research, Confidence Bands PAGE 2 Dedication It is absolutely inconceivable that I could have hoped to achieve this goal without the support and focus of my parents, Judy and Dean Hess. Regardless of the situations I have found myself in, past and present, they have always encouraged me to pursue my goals, even when things looked bleak. I often thought of quitting, slowing, or delaying this dream due to full time work, financial concerns, classwork, etc., yet their encouragement, support and love provided me with the strength and will power to continue to pursue this goal, regardless of the obstacles present, perceived or otherwise. I also wish to dedicate this to the memory of my grandfather, Howard Erslev. I have been fortunate to have family on all sides of my family tree who were, and are, good people with strong values. However, my grandfather Erslev has always had a special place in my heart and memory as an individual who knew the value of people as well as hard work. He treated all people well, regardless of race, gender or background during a time and place of great prejudice. The values he instilled and exhibited will never be forgotten and, I believe, were key in shaping my values, even before I knew what values were. So, in every way possible, thanks Mom and Dad. I dedicate this work to you, my brother Mike, nephew Michael Jr. and my grandparents. There is no way I could have made it through this without you. PAGE 3 Acknowledgements I am indebted to the entire Measurement and Research Department, for providing the support necessary to successfully complete this program. I was very fortunate to have four extremely knowledgeable and accessible researchers willing to be on my committee. I will always look back on the mentorship of Dr. Jeff Kromrey, Dr. John Ferron, Dr. Cynthia Parshall, and Dr. Kathryn Borman with extreme graditude. Jeff has been a constant inspiration and example of the consummate researcher who, since early in my program, helped me identify my research interests through provision of various research opportunities. John has been a wonderful mentor whose sense of humor provided an additional dimension to this experience, while ensuring I stayed focused on the goal. Cynthia has not only been a wonderful supporter in my research endeavors, but also provided invaluable assistance in obtaining an internship with the Educational Testing Service. Kathryn, who so willing gave of herself, her time and her expertise, has provided me with an appreciation of, the broader implications of methods research. Of course, without the support and help of my fellow doctoral students, this would have not been the fantastic experience it was. A special thank you to Tom Lang III, Gianna RendinaGobioff, Kris Hogarty, Peggy Jones and Freda Watson for making this such an enjoyable experience. It was great to have colleagues with which to laugh, commiserate, and yes, even get some productive work accomplished at times. Thanks so much for everything. PAGE 4 i Table of Contents List of Tables iii List of Figures iv Abstract v Chapter One: Introduction 1 Statement of Problem 2 Purpose of the Study 4 Research Questions 6 Study Significance 6 Limitations 7 Definition of Terms 9 Chapter Two: Review of the Literature 13 Reporting Research 13 Disciplinary Norms 16 Effect Sizes 18 Statistical vs. Practical Significance 19 Point Estimates vs. Confidence Intervals 22 Examples 29 Summary 32 Chapter Three: Method 34 Study Type and Description 35 MetaAnalysis 35 Methodological Research Review 37 Sample 38 Selection of Journals 39 Selection of Published Studies 44 Computations 47 Confidence Intervals 48 Data Analysis 49 Reporting Results and Conclusions 51 Reliability of Interpretative Results 53 Recommendations for Reporting Research Results 53 PAGE 5 ii Chapter Four: Results 54 Characteristics of Selected Studies 57 Statistical Significance vs. Practical Significance 68 Potential Impact on Results and Conclusions 73 Examples 75 Summary 79 Point Estimates vs. Confidence Intervals 80 Potential Impact on Results and Conclusions 82 Examples 83 Summary 88 Chapter Five: Conclusions 90 Purpose of Research 90 Overview of Method 91 Impact of Findings 91 Statistical Significance vs. Practical Signficance 92 Point Estimates vs. Confidence Intervals 94 Reporting Results 95 Relevant Issues 97 Future Research 99 Summary 101 References 103 Appendices 116 Appendix A: Coding Sheet for Studies 117 Appendix B: Coding Sheet for Reviewers 119 Appendix C: SAS Code 149 Appendix D: Summary of 42 Analyses 185 Appendix E: Internal Review Board Exemption 212 About the Author End Page PAGE 6 iii List of Tables Table 1 Profile of Journals 42 Table 2 Citation Scores and Rankings Compared to all Social Science Journals 43 Table 3 Journal Ranks Relative to Subject Specific Journals 44 Table 4 Types of Analyses Included in Number of Articles 46 Table 5 Effect Sizes and Associated Interpretation 50 Table 6 Types of Analyses Reviewed by Article Number and Journal 59 Table 7 Numbers of Analyses Reporting Statistical Significance Relative to Computed Effect Size 69 Table 8 Number and Percent of Analyses or Sets of Analyses that Warrant Different Degrees of Change When Effect Size is Considered in Addition to Results of Statistical Significance 75 PAGE 7 iv List of Figures Figure 1. An illustration of various confidence bandwidths. 25 Figure 2. Point estimates (Cohens f2) of the impact of gender on Mathematics Attitude and Identity. 30 Figure 3. Point estimates (Cohens f2) and confidence intervals on the impact of gender on Mathematics Attitude and Identity at a Type I error rate of .05. 30 Figure 4. Point estimates (Cohens d) and confidence intervals on the impact of treatment intensity on gains in students instructional reading levels at a Type I error rate of .05. 32 Figure 5 Distribution of effect sizes and 95% confidence intervals for all ttests analyses pooled across journals as effect size increases. 60 Figure 6. Distribution of effect sizes and 95% confidence intervals for all ANOVA analyses pooled across journals as effect size increases. 61 Figure 7. Distribution of effect sizes and 95% confidence intervals for all Regression analyses pooled across journals as effect size increases. 62 Figure 8 Distribution of effect sizes and 95% confidence intervals for all ttest analyses as effect size increases by journal type 63 Figure 9 Distribution of effect sizes and 95% confidence intervals for all ANOVA analyses pooled across journals as effect size increases. 63 Figure 10 Distribution of effect sizes and 95% confidence intervals for all Regression analyses as effect size increases by journal. 64 Figure 11 Distribution of effect sizes and 90% confidence intervals for all ANOVA analyses pooled across journals as effect size increases. 65 Figure 12 Distribution of effect sizes and 95% confidence intervals for all ANOVA analyses pooled across journals as effect size increases. 65 PAGE 8 v Figure 13 Distribution of effect sizes and 99% confidence intervals for all ANOVA analyses pooled across journals as effect size increases. 66 Figure 14 Bandwidth of Cohens f pooled across journals as total sample size increases for Type I error rates of .01, .05, and .10. 67 Figure 15 Bandwidth of Cohens f pooled across journals as the ratio of total sample size/number of groups increases for Type I error rates of .01, .05, and .10. 67 Figure 16 Effect sizes of statistically significant findings at an alpha of .05, by journal. 71 Figure 17 Effect sizes of statistically significant findings pooled across journals at an alpha of .05, by analysis type. 71 Figure 18 Effect sizes of nonstatistically significant findings pooled across journals at an alpha of .05, by analysis type. 73 Figure 19. Percent of effect sizes of 95% confidence band endpoints pooled across journals found in statistically significant analyses. 81 PAGE 9 vi Effect Sizes, significance tests, and confidence intervals: Assessing the influence and impact of research reporting protocol and practice Melinda Rae Hess ABSTRACT This study addresses research reporting practices and protocols by bridging the gap from the theoretical and conceptual debates typically found in the literature with more realistic applications using data from published research. Specifically, the practice of using findings of statistical analysis as the primary, and often only, basis for results and conclusions of research is investigated through computing effect size and confidence intervals and considering how their use might impact the strength of inferences and conclusions reported. Using a sample of published manuscripts from three peerreviewed journals, central quantitative findings were expressed as dichotomous hypothesis test results, point estimates of effect sizes and confidence intervals. Studies using three different types of statistical analyses were considered for inclusion: ttests, regression, and Analysis of Variance (ANOVA). The differences in the substantive interpretations of results from these accomplished and published studies were then examined as a function of these different analytical approaches. Both quantitative and qualitative techniques were used to PAGE 10 vii examine the findings. General descriptive statistical techniques were employed to capture the magnitude of studies and analyses that might have different interpretations if alternative methods of reporting findings were used in addition to traditional tests of statistical significance. Qualitative methods were then used to gain a sense of the impact on the wording used in the research conclusions of these other forms of reporting findings. It was discovered that tests of nonsignificant results were more prone to need evidence of effect size than those of significant results. Regardless of tests of significance, the addition of information from confidence intervals tended to heavily impact the findings resulting from significance tests. The results were interpreted in terms of improving the reporting practices in applied research. Issues that were noted in this study relevant to the primary focus are discussed in general with implications for future research. Recommendations are made regarding editorial and publishing practices, both for primary researchers and editors. PAGE 11 1 Chapter One Introduction The everincreasing attention and concern about effective educational practices as well as the focus on accountability among educators requires educational research to be as precise and informative as possible. Results of research in education are used in a wide variety of ways, often with potentially critical fiscal, political, and practical implications. As such, current issues in educational research span a wide variety of topics; from decisions on appropriate and critical subjects to be studied and funded (e.g., curriculum effectiveness, student achievement), to how that information should be communicated to key members of the educational community (policy makers, researchers and practitioners). One of the outcomes of the call for increased accountability in education is an emphasis on sciencebased research and assessment of educational effectiveness. The recent No Child Left Behind Act (United States Department of Education, n.d.) legislation is but one example of this increased emphasis on educational accountability. Although it is critical that methods used in research be judiciously selected, carefully designed, and fastidiously implemented, the analysis of the data and reporting of the findings must also reflect a rigorous attitude and practice. Discussions on research methods seem commonplace yet the criticality of reporting practices and protocols should PAGE 12 2 not be overlooked or marginalized. A research study may follow all the tenants of sound design and conduct, but if results are not presented properly and thoroughly, it is possible, and maybe even probable, that consumers of the research may be misled or, even worse, misinformed about the strength of meaning and applicability of the findings. Therefore, researchers must be made aware of, and held accountable for, proper reporting procedures and protocols. Statement of Problem The need for awareness of, and compliance with, proper and thorough research reporting practices is the primary inspiration for this study, which focuses on the differences in the strength of inferences that may be drawn as a result of how a researcher chooses to present his or her findings. Through the review and analysis of previously conducted and published research, this study illustrates the impact that reporting practices may have on how results are interpreted and presented by researchers. With a clear demonstration of the differences that may result from how findings are reported, it is anticipated that the appreciation among researchers for the need to approach reporting their results with the same degree of rigor they use when designing their studies and analyzing their data will be enhanced. Among the vast variety of reporting issues, two in particular have garnered growing interest, and at times conflict, in recent years: (1) how should results be reported to adequately convey their importance and meaning, e.g., significance testing with pvalues vs. effect sizes, and (2) how well does the representation of results communicate the precision of the findings, e.g., point estimates vs. confidence intervals (Thompson, PAGE 13 3 1998; Nix & Barnette, 1998). The last two editions of the American Psychological Association s (APA) Publication Manual (1994, 2001), as well as the 1999 report by Wilkinson and the APA Task Force on Statistical Inference, both recommend and encourage the use of effect size reporting as well as confidence intervals. Fidler and Thompson (2001) provide three very specific recommendations based on the findings of the task force: (1) Always provide some effectsize estimate when reporting a p value (p. 599 of the statistical task force report), (2) report confidence intervals as they provide more information than what is available from a decision of yes or no based on a single point estimate, and (3) graphical representations of confidence intervals will aid in data presentation and interpretation. With a variety of factors influencing the most appropriate way(s) of reporting research findings, the debates that result from differing viewpoints about what and how findings should be reported are not likely to be easily resolved. This complexity of influences thus necessitates further exploration of the impact of research reporting practices and protocols. The growing importance of effect size and confidence interval reporting is further supported not only by a seemingly everincreasing presence of professional journal articles on the topic but also by a text devoted entirely to the issue of effect sizes (Harlow, Mulaik, & Steiger, 1997). In addition, the summer 2001 publication of an entire issue of Educational and Psychological Measurement devoted primarily to these two topics (Vol 61(4), August 2001) further underscores their growing importance in the field. Within this text and journal are numerous articles and papers by a wide range of researchers that cover many aspects of effect size and CI reporting including specific issues with non PAGE 14 4 centrality, fixed and randomeffects designs as well as statistical power. The recognition of the criticality of reporting effects sizes and using confidence intervals by such recognized authorities as the American Psychological Association as well as professional journals such as Educational and Psychological Measurement should leave little doubt about the growing recognition of these two statistical measures as necessary elements in solid research reporting. Robinson, Fouladi, Williams and Bera (2002) note that Curiously, no researchers have attempted to determine how the inclusion of effect size information might affect readers interpretations of research articles (p. 370). One goal of the proposed study is to address this specific issue, albeit indirectly. It is indirect as this study will be focused on how using these reporting methods might affect conclusions, recommendations, and implications reached by researchers, not a means to empirically assess readers interpretations. Purpose of Study The primary goal of this research is not to advocate the appropriateness of specific statistical tests (ANOVA, ttest, etc) or effect sizes (Cohens d, Hedges g, Cohens f, etc.) or methods of computing confidence intervals (bootstrapping, students t, etc.); rather it is to provide a sense of how reporting results in different ways may affect the strength of inferences that can be obtained from a study and, as a consequence, the potential impact on results, conclusions, implications and recommendations made by researchers. It is anticipated that with clear examples and illustrations of how representing findings can potentially alter the conclusions drawn from specific research studies, educational and PAGE 15 5 other social science research professionals will gain an even greater appreciation for the importance and criticality of reporting results in a variety of appropriate and meaningful ways to better understand what the data represent. Potential differences that may result from data interpretation using statistical significance approaches compared to practical significance approaches are vital in the understanding of why one or the other alone may not be sufficient. Additionally, the context and purpose of the study underscores interpretation of these two types of measures. The first issue of interest in this study concerns determining how the significance of results should be reported and interpreted. That is, does one consider statistical significance, as determined by testing a given null hypothesis and focusing on resulting pvalues, sufficient? Or should other indices of significance, e.g., effect sizes, such as Cohens d, be reported instead of, or in addition to, pvalues or similar statistical significance measures? Often one finds these two ideas classified, respectively, as statistical significance and practical significance (Fan, 2002; Thompson, 2002a; Robinson & Levin, 1997). Fan presents these two approaches as analogous to two sides of a coin, saying they complement each other but do not substitute for one another. (2002, p. 275) The second issue of interest concerns not just what should be reported, but how. Of particular interest is whether a point estimate is sufficient or is it better to use some measure of specificity, such as a confidence interval approach? To complicate this issue even more is determination of an appropriate method for constructing intervals around such measures as effect sizes, which can be much more complex than the more common and accepted practices of constructing confidence intervals around descriptive statistics PAGE 16 6 such as the mean (Thompson, 2002b). Research Questions The main objectives and focus of this research lead to three questions: 1.) To what extent does reporting outcomes of tests of statistical significance vs. tests of practical significance result in different conclusions and/or strengths of inference to be drawn from the results of research? 2.) To what extent does reporting confidence intervals instead of, or in addition to, point estimates affect the conclusions and inferences to be drawn from the results of research? 3.) What method, or combination of methods, is recommended for reporting results in educational studies? Study Significance Todays educational atmosphere is highly laden with assessment and accountability issues. Researchers need to be attuned to the need for effectively communicating the practical impact of research results in addition to, or possibly in lieu of, merely reporting findings that are statistically significant. The use of effect sizes and confidence intervals can be key elements in aiding in this communication. Effect sizes provide a means of measuring practical significance and confidence intervals convey the precision of results. The difference between a tight confidence interval and wider confidence interval cannot be underestimated when discussing study implications. Oftcriticized for substandard practices and products (see, for example, Davis, 2001 and Gall, Borg, & Gall, 1996), educational researchers must increase their PAGE 17 7 awareness of, and compliance with, sound research methods, including how they report their research. The increased emphasis on accountability in education is not limited to the practitioner. The educational researcher is also likely to be under closer scrutiny as time progresses and resource expenditures for educational program evaluation continue to climb. When applied to ongoing research in education as well as the other social sciences, the ability to construct effective and efficient confidence intervals that provide precise data summaries will enable decisionmakers at all levels of the educational system to make better decisions based on more precise and accurate information about the effectiveness of interventions, curriculum and other aspects of the educational environment. Technology is available to support these enhanced methods and there is not a viable excuse not to pursue and develop the abilities to use confidence intervals instead of point estimates for numerous statistical estimations, including the increasingly critical estimate of effect size. Limitations This is an initial investigation into using confidence intervals and effect sizes in addition to, or in lieu of, traditional significance test results beyond the theoretical and conceptual level. It is based on previously reported research and is therefore limited in its ability to predict performance with untested data. That is, it is recognized that reported research is typically research that has shown to have an effect or significant finding. This study, like many metaanalytic studies, is subject to bias due to the exclusion of research studies that may have fallen victim to the file drawer syndrome (Bradley & Gupta, PAGE 18 8 1997; Rosenthal, 1979; Rosenthal, 1995; Gall, Borg, & Gall, 1996; Riechardt & Gollab, 1997). These studies are likely to have either shown a nonsignificant result or show evidence in the opposite direction of the hypothesis (Bradly & Gupta, 1997). Therefore, it is possible that studies that havent been reported because they showed a small or nonsignificant effect might have a wide confidence band and that if those studies were revisited, using confidence intervals instead of point estimates, it is possible that the null hypothesis might not have been subject to a Fail to Reject (FTR) decision in a definitive fashion, but rather with an awareness that the decision to Fail to Reject may have been a result of a very large confidence band that barely extended to the point of nonsignificance. Such awareness could provide the researcher with a strong theoretical foundation for his or her alternative hypothesis, but has a weak study design, with enough justification to repeat the study with an improved design (e.g., larger sample sizes). Cohens effect sizes (Cohens d Cohens f and Cohens f2) are just a few of a myriad of effect indices available. They were selected for this study for a variety of reasons, including commonality of use and the oftdesired characteristic of standardization when using multiple studies and scales; however, the use of these statistics does not imply that they are always the most appropriate for a given study. The purpose of a study, nature of the data, and selection of data analysis methods may make the use of different effect sizes more appropriate. Additionally, even when they may be deemed as appropriate statistics to be used in a study, the context and criticality of the study itself is essential for proper interpretation of index values. As the purpose of this study is to investigate how different reporting processes may affect findings and not an PAGE 19 9 investigation of study method, purpose, and/or strength, this contextual issue, though recognized as a valid and important topic, is not considered to be a primary issue in this study. Likewise, the regular use of Cohens d Cohens f and Cohens f2 throughout the study permits a consistency necessary to make communal decisions and comparisons. A final limitation of this study pertains to the issue of who is doing the interpretation of the findings. This study is primarily focused on how the researcher(s) of a particular study analyze, interpret, and report the results of their research. Of core interest in this research is an investigation of how different analyses and reporting practices might impact the conclusions and recommendations made by the researcher. Also of interest is how the choice of method of reporting findings may impact the magnitude of strength of the findings. What is not investigated in this study, but is acknowledged as being of fundamental and vital importance, is the consideration of the impact of reporting practices and protocols on the consumer of the research, that is, the practitioner who reads and interprets the findings presented. This type of research question has been addressed to a slight degree (Robinson, Fouladi, and Williams, 2002) and deserves further consideration and investigation external to this study. Definitions of Terms The following definitions are provided for clarification. Some of the terms used, e.g., practical significance, have various interpretations depending on the source; the definitions provided were chosen to best reflect how they are intended to be used and interpreted within this study. PAGE 20 10 Cohens d: One method of computing an effect size, this measure of effect size is determined by taking the difference of the two sample means and dividing by the pooled standard deviation (Cohen, 1988). Cohens f: An effect size often used with ANOVA significance tests, given by: mf where m is the mean standard deviation of the means of k number of groups around the grand mean and the standard deviation of the common population Values can range from 0, when there is no difference between groups, to, at least theoretically, infinitely large as mincreases in magnitude relative to the population mean (Cohen, 1988) Cohens f2 : An effect size measure calculated in correlational/multiple regression studies given by: 2 S E P V f P V where S P V is the proportion of variance accounted for by the source, or predictor variables, and E P V is the proportion of variance accounted for by the residuals (Cohen, 1988 and Cohen, Cohen, West, & Aiken, 2003) Confidence Interval : An interval containing a range of possible values, so defined that there can be high confidence that the true values, the parameter, lies within this range (Glass & Hopkins, 1996, p.261). Boundaries are calculated as a function of the level of Type I error designated. Other variables and PAGE 21 11 characteristics of the study are also taken into account but are dependent on the method of confidence interval estimation used. Effect Size : An estimate of the magnitude of a difference, a relationship, or other effect in the population represented by a sample (Gall, Borg & Gall, 1996). Eta squared ( 2): A measure of association used in ANOVA analyses, this is a measure of variance accounted for by group membership given by: 2B TSS SS where SSB is the Sum of Squares between groups and SST is the Sum of Squares Total (Stevens, 1999). MetaAnalysis: As defined by Hedges and Olkin (1985), Metaanalysis is the rubric used to describe quantitative methods for combining evidence across studies (p.13). Multiple Correlation Coefficient (R2): A measure of association that provides the proportion of variance of a dependent variable that can be predicted, or accounted for, by the predictors in the model, given by: 2reg totSS R SS where SSreg is the Sum of Squares due to regression and SStot is the Sum of Squares Total (Stevens, 1999). Point Estimate: A specific, single quantitative value used to estimate a parameter (Glass & Hopkins, 1996). PAGE 22 12 Practical Significance: Often a term associated with effect sizes, this is the concept of evaluating the practical noteworthiness of results (Thompson, B., 2002a, p.65). Significance Tests: Statistical tests conducted that lead a researcher to make a decision, either Reject or Fail to Reject. In this study, the RejectSupport approach will be employed (Steiger and Fouladi, 1997) in which a decision to Reject actually supports the researchers expectations (e.g., that there is a difference in populations) as it is the primary school of thought used in most social science research. Statistical Significance : A means of using quantitative, probabilistic interpretations to determine whether to Reject (or Fail to Reject) a given null hypothesis (Gall, Borg, & Gall, 1996). Type I Error: The error that occurs when a researcher incorrectly rejects a True null hypothesis (Glass & Hopkins, 1996, p. 259). PAGE 23 13 Chapter Two Review of Literature This review of the literature is intended to provide a concise yet comprehensive overview of the controversies and explorations relative to significance reporting as well as the use of point estimates compared to confidence intervals. It is divided into five main areas of review. First, an overview of research reporting practices in education, both in general and as a function of study type and method is provided. Second, disciplinary norms and the need to consider their influence when reading research from different disciplines are discussed. Next, a synopsis of effect size uses and characteristics is given. After the discussion on effect sizes, a discourse on the controversy surrounding statistical versus practical significance testing is presented. And finally, there is an overview of the discussions and differences of opinion regarding the use of point estimates compared to confidence intervals. Reporting Research Appropriate, effective and meaningful reporting practices are critical for communicating research results correctly. Thoughtful interpretation of research and the ability of readers to sift through good and bad research have gone beyond being merely a part of courses in research methodology. Books are now being written to provide readers not only with a sense of interpreting research itself, e.g., Hittleman and Simons PAGE 24 14 Interpreting Educational Research: An Introduction for Consumers of Research, 2nd ed. (2002), to entire books about determining the quality of the research (see, for example, Making Sense of Research. Whats Good, Whats Not and How to Tell the Difference ( McEwan & McEwan, 2003) and Evaluating Research Articles from Start to Finish, 2nd Ed (Girden, 2001)). The mere fact that there is a market for such books is indicative of the lack of trust and/or perceived rigor in research conduct and reporting. Although poor conduct or design of research must always be a concern, it is also unfortunate that the reporting practices themselves can leave a lot to be desired. The less ethical researcher might alter how they report findings, including only information that supports his or her hypothesis, or present results in such a way as to misinform or mislead the reader. In his book Statistics as Principled Argument, Abelson (1995) provides numerous examples of how this might be accomplished. For example, the conduct of numerous types of tests on the same data may be suspect unless clearly justified. As he illustrates on p. 70, If you look at enough boulders, there is bound to be one that looks like a sculpted human face. Other issues he takes research reporting to task on are those that use rhetoric to justify results not quite meeting the desired conclusion (e.g., pvalues of .07 when desired Type I error rate is .05), wording that hints at more indepth meaning than the data clearly indicate, and findings reached from distributions and/or statistics that are strange (e.g., outliers and/or dips in data distributions, statistics that are logically too small, too large or defy logic). Abelson presents cautions about using statistics (pvalues) void of reason, logic, and judgment. While Abelson provides important cautions about interpreting research as well as beneficial guidance on how to PAGE 25 15 use statistics to support research, his, along with others, concern about the misuse of statistics is not new. Almost half a century ago, a still oftused book by Huff (1954), How to Lie With Statistics, provides the interested reader with numerous examples of how the public had been misled through advertisement and research results during that timeframe. The fact that these types of issues still exist and may even be worse, is a sad and troubling reflection on current research, especially considering the presumably ongoing advances in statistical methods, applications, and understanding. Educational research specifically is often criticized for poor research practices. In their text titled, appropriately enough, Educational Research, Gall, Borg, and Gall (1996) advise the reader in their section about studying a research report to keep in mind that the quality of published studies in education and related disciplines is, unfortunately, not very high (p. 151). In a review of analytical practices of studies contained in 17 fairly prominent social science journals, Keselman, et al., (1998) noted that The present analyses imply that researchers rarely verify that validity assumptions are satisfied and that, accordingly, they typically use analyses that are nonrobust to assumption violations (p. 350). Tuckman (1990) found that when it came to educational research much of the work in print ought not to be there p.22). The editor of the Journal of Curriculum and Supervision (Davis, 2001), provides a succinct yet thoughtful discourse on educational research reporting practices in general. While potentially harsh, the issues discussed in this article provide one with a sense of the impact that poor or inadequate research reporting can have on practice. He states on page 9 that Educational research inattentive to meanings corrupts the enterprise of inquiry and PAGE 26 16 fails its obligation to practice. Davis hints at the possibility that ineffective and inappropriate reporting, hopefully a relatively innocent result of unfortunate ignorance of the subject, context, or proper procedure, may also be intentional on the part of the researcher. As such he notes that Educational research has the moral purpose to informnot to direct or to control educational practice (p. 9). Davis also recognizes that the responsibility for good decisionmaking does not necessarily rely solely on the researcher as the practitioner has a moral duty to be capable enough to discern what the research is telling him or her. However, if the research is not communicated properly and effectively, the practitioner has little, if any, real opportunity to put the research to proper use. Disciplinary Norms Understanding that attributes of particular sciences or disciplines differ in many aspects, including written communications, is important to consider when reviewing literature present in various disciplines. Parry (1998) provides a succinct discussion on the importance of disciplinary norms within scholarly writing, including the need to address this issue during the preparation of future academic scholars. She discusses the absence of clear understanding of what disciplinary norms are and attempts to aid the newcomer to this type of knowledge through a vast discussion on previous literature on this aspect of research. Essentially, one might think of disciplinary norms as the conventions, rules, and/or practices, explicit or implicit, that one finds within a certain body of scholarly literature relative to a given discipline. PAGE 27 17 According to Becher (1987) there are broad disciplinary groupings that encompass a wide variety of disciplinary norms. Furthermore, the conventions of writing and language within different disciplinary norms vary and often are not explicit in nature; rather these norms must be learned through observations within different disciplines and subdisciplines. As such, Gersholm (1990) asserts that many of these norms are implicit and must be learned through tacit means. Social science research reporting, according to Bazerman (1981), tends to lean toward persuasion due to the potential differences in methodological and theoretical frameworks in the scholarly community. He also identifies six attributes that may be found to contribute to differences in written research as a function of discipline. These attributes include conventions regarding the type of knowledge, traditions, external accessibility of knowledge, degree of technicality, methodological and theoretical considerations, and writing mechanics associated with a given discipline. Becher (1987) asserts that four overlapping domains exist within linguistic preferences and styles in different disciplines: modes of formal scholarly communication; how writers assert fieldunique tacit knowledge; guiding conventions of citing and referencing previous research; and traditions of argument structure. Depending on the discipline umbrella under which research is written, different practices and accepted conventions may be evidenced in different manners depending on the particular field in which the research is conducted and disseminated. As such, it is necessary for consumers of research originating in different disciplines to acknowledge PAGE 28 18 that underlying differences exist and, at a minimum, be sensitive to those differences when considering the quality, nature, and intention of the research. Effect Sizes Effect size has become increasingly recognized as an important statistic that needs to be reported. Numerous field experts have stressed the need for effect size reporting throughout the social sciences, including education (Nix & Barnette, 1998). Both the fourth and fifth editions of the American Psychological Association (1994 and 2001) highly recommend that researchers report effect sizes. Often termed practical significance or, sometimes substantive significance (Robinson & Levin, 1997), effect sizes provide a different, albeit related, piece of information about how a treatment or other variable is impacting the issue of interest. There are various effect size indices available as well as different terms used when referencing effect sizes. Some of the various descriptors for effect size estimates include percent of variance accounted for, strength of association, and magnitude of effect, among others (Plucker, 1997). Additionally, correlation coefficients such as Spearman rho and the Pearson Product Moment Correlation Coefficient are sometimes considered a type of effect size (Plucker 1997). Hedges g, Glasss and Cohens d are all variations of effect sizes for differences in means between two groups (Rosenthal, 1994 and Cohen, 1988). Effect sizes for studies using statistical methods examining correlational relationships or variance relationships have measures such as etasquared ( 2), Rsquared (R2), and omega squared (2) available for use (Snyder & Lawson, 1993). PAGE 29 19 In his book Statistical Power Analysis for the Behavioral Sciences, Cohen (1988) provides effect sizes for various types of analyses including those that can be used in ttests, Chisquare tests, and multivariate tests, just to name a few. Ultimately, of course, the selection of effect size indices is a factor of many considerations, including purpose of the research, data analysis to be employed, and the nature of the data. For example, a decision on whether to use Hedges g or Glasss may depend on the disparities between the groups in sample size and variance (Rosenthal, 1988). Statistical vs. Practical Significance The literature over the past decade seems inundated with articles and tomes pleading for, as a minimum, inclusion of effect sizes when reporting research results (see, for example: Plucker, 1997; Thompson,1998; Thompson, 1999a; Fan, 2001; and Fouladi & Williams, 2002). In his review of studies reporting effect sizes in gifted education, Plucker describes the relationship between statistical significance and practical significance as analogous to a chasm in the earth. In his illustration, he uses the pvalue of a significance test as the indication that the chasm exists, and the effect size reported as the measure of the width of the chasm. Both of these concepts of significance, as they tend to be thought of today, are products of the last century. During the early 1900s, such groundbreakers of modern statistical concepts such as Karl Pearson, Ronald Fisher, and Jerzy Newman, among others, provided the conceptualization and formal development of null hypothesis based significance testing (Harlow, 1997). However, it wasnt until around the middle of the 20th century that significance tests started taking a dominant role in research literature. PAGE 30 20 Hubbard and Ryan (2000) reviewed articles in 12 prominent journals of the American Psychological Association and found that until 1940, significance tests only appeared in empirically based research about 40% of the time or less. By 1960, the popularity of using significance tests rose to such a degree that over 90% of empirical research reported findings using some type of significancebased analysis. Interestingly, it is during the rise of publication popularity that the notion of statistical inference testing using a null hypothesis approach began acquiring a vocal set of detractors (Mulaik, Raju, & Harshman, 1995; Rozeboom, (1960). As time has progressed, the popularity of reporting significance tests has continued while at the same time the debates about using other reporting methods, e.g., effect sizes and confidence intervals, has continued to grow stronger and more frequent. There is a portion of researchers who go so far as to advocate the use of effect sizes in place of, not merely in addition to, the traditional significance tests (Schmidt & Hunter, 1997; and Meehl, 1997). Others are more moderate and take a middle of the road approach, arguing that the use of effect sizes and/or tests of significance are both useful, depending on context and purpose of the research. Muliak, Raju, and Harshman (1997) provide arguments for inclusion of indices of practical significance in many cases but also suggest that elimination of significance testing is neither warranted nor desired. They illustrate how influences of factors such as the power of a given study may limit the desirability of relying on significance tests but argue that significance testing has an objective nature that requires the researcher to form an opinion based on theory and/or previous research before conducting the analysis. This required assertion of a formal PAGE 31 21 hypothesis a priori to data analysis helps preserve a certain sanctity of the research by avoiding potentially inappropriate datadriven hypothesizing about effectiveness of a given treatment or study effect. Regardless of the position held by individual statisticians and researchers, there is little doubt that this topic is one of the hot buttons of debate in educational research today. Within the past few years, an entire text was dedicated to this issue (Harlow, Muliak, & Steiger, 1997) as well as an edition of Educational and Psychological Measurement (Vol 61(4), 2001). However, it would be a mistaken notion to consider this to be an issue of recent origin. According to Schmidt and Hunter (1997, p. 58), a discourse by Jones in 1955 was one of the first, if not the first, to argue for the replacement of statistical significance with effect sizes (as well as confidence intervals) in Volume 6 of the Annual Review of Psychology. Since then, the topic has ridden a wave of periodic attention, often becoming the topic du jour for a period of time before taking a back seat to other topics of interest for a few years and then once again coming back to the forefront of attention. However, over the past decade, this issue has taken on a new and stronger life among researchers, and, rather than waning, appears to be continuing to gather momentum. From the aforementioned dedicated text and journal to the stronger stance taken by the APA on reporting requirements, resulting, at least in part, from the findings of the Statistical Task Force of 19992001 (Wilkinson, 2001), enhanced attention to the issues of effect size reporting and the use of confidence intervals is evident. While the stance and beliefs of individual researchers is critical to their personal motivation to report effect size estimates, actual reporting of such estimates is also an PAGE 32 22 indirect result of what publishers and journal editors demand and expect in submissions. In general, support for effect size reporting is growing as more professional journals across disciplines require such statistics for consideration for publication. At least 17 such journals, spanning areas of interest from careers, education, counseling and agricultural education currently require this information (Fidler & Thompson, 2001). Unfortunately, even though a growing number of journals are requiring effect sizes to be reported, many are not enforcing their own mandates for publication. A review of 13 journals by McMillan, Snyder, and Lewis (2002) that require effect size reporting revealed that most of those journals were not enforcing this particular constraint. Additionally, Devaney (2001) found in a survey of journal editors that while 93% of those surveyed agreed with the importance of effect size reporting, 73% indicated that inclusion of effect size information was not a requirement for consideration of a manuscript. These findings seem to indicate that while there is indeed a perceived need to report effect size information, there is little, if any, enforcement of such reporting. The reasons for this are not clear and it may well be the case that editors and others who make critical decisions on what research is noteworthy require more evidence about how reporting of findings may impact conclusions and the relative significance of findings resulting from a particular study. Point Estimates vs. Confidence Intervals Confidence intervals have been accepted for quite some time as a useful method for describing statistical parameter estimates such as sample means and can be traced back at least three decades (Meehl, 1967). The use of statistics to describe population PAGE 33 23 parameters is an imprecise science and the use of confidence bands around a given statistic allows researchers to gauge the precision of a given statistic and therefore can help determine the strength of conclusions and inferences that can be drawn. Unfortunately, confidence intervals do not appear as frequently in research as might be desired. Reichardt and Gollob (1997) provide eight reasons for why this might be the case. These reasons, summarized, are: (1) conventional use of statistical test precludes consideration of use of intervals, (2) lack of recognition by researchers of situations conducive to the use of intervals, (3) less frequent production of intervals by computer programs as compared to results of statistical tests, e.g., pvalues, (4) diminished size of actual parameter estimate and associated confidence interval is less impressive than reporting statistical significance alone, (5) magnitude of interval width might be large enough to inhibit potential for publication acceptance, (6) some statistical tests, e.g., chisquare test of association for a 2x2 table, do not have a unique parameter defined, thus necessitating additional steps to identify appropriate measures, (7) criticism of statistical tests, sometime themselves incorrect, rather than advocacy of interval strengths, dissuades uses, and (8) the incorrect and inappropriate association of interval use advocacy with statistical testing banning undermines and thus discourages the acceptance and application of confidence intervals. These reasons for not using confidence intervals seem to fall into three main types of justifications for not using this technique. The first general type of aversion to using confidence intervals is, perhaps, the least alarming. The lack of use resulting from reasons (1), (2), or (3) appear to result more from lack of knowledge and awareness of PAGE 34 24 either the methods or tools available. These obstacles to using confidence intervals are likely to diminish as awareness increases and computer programs continue to increase in their sophistication. The second broad category of reasons for which one might be reticent to using confidence intervals seems to center around a researchers concern that his or her research wont get published or recognized because confidence intervals or point estimates might diminish the strength of their findings (reasons (4) and (5)). These types of justifications (and associated ethical issues) seem to be, in some regards, the most insidious of the three and are likely contributors to the skepticism with which research is often viewed. The final broad category encompasses the last two items on the list. The lack of use of intervals due to these concerns have a more philosophical flavor and may be a factor of personal comfort with techniques and tools learned early in ones career (e.g., significance testing) and may be overcome by better communication of the benefits of confidence intervals and less villianization of significance testing. Although there are issues associated with the lack of universal use of confidence intervals in research reporting, there have been recent advances in using confidence intervals for statistics other than the mean and standard deviation. The use of confidence intervals for other statistical estimates is quickly growing as an improved way of reporting more informative measures of estimates than point estimates. Cumming and Finch (2001) provide four reasons for researchers to give confidence interval estimates when reporting research findings: (1) confidence intervals provide both point and interval information that improves understanding and interpretation, (2) the use of intervals enhances the practice of traditional null hypothesis reporting, it does not negate PAGE 35 25 it. That is, if a specific null value is being tested and is found to fall outside of the computed interval, it is rejecting the null hypothesis, but with more precision, (3) the use of CIs may serve metaanalytical methods which focus on estimation using many sources of study data, and (4) information about the precision of the study and subsequent findings may be gained through the use of intervals. In Figure 1, results of four hypothetical studies are illustrated with computed confidence bands around the effect size (Cohens d, in all cases). 0.7 0.23 0.9 0.95 0.12 0.17 0.05 0.55 1.28 0.29 1.85 1.35 1 2S t u d y 1 S t u d y 2 S t u d y 3 S t u d y 4Effect Size Figure 1. An illustration of various confidence band widths. PAGE 36 26 In studies, 1,2, and 4, the decision based on statistical significance testing would have been to Reject the null hypothesis. However, this illustration helps demonstrate that the strength of the inference to be drawn from such a conclusion is not consistent. Depending on whether one considers effect size in addition to statistical significance and/or confidence intervals in addition to point estimates can dramatically impact how one interprets the findings and the certainty one places on the associated Reject or Fail to Reject decision. In study 1, a report of the effect size point estimate only would support the findings of the significance test; however, the lack of precision of the results indicates that the population effect size might be as small as 0.12, a rather minor effect, or as large 1.28, a very large effect. In this case, the reporting of the effect size doesnt really change how one views the results; however, the inclusion of confidence intervals very well might have an impact on interpretation of findings. In study 2, the opposite phenomenon occurs. In this case, the confidence interval is very tight. A bandwidth of 0.12 indicates high precision of the estimate and one is likely to be confident that there is a statistical difference found in the study. However, an effect size of 0.23 is considered small by Cohen, so although one is likely to have little doubt that there is really a difference, the practicality of the difference is very small. At this point, the context and purpose of the study would be primary determinants in deciding whether such a small measure of practical significance is worthy of pursuing. In study 4, neither the use of a measure of practical significance and/or confidence interval has the potential for as dramatic an impact on interpretation as the first two PAGE 37 27 studies did. In this case, although the confidence band still indicates a rather large amount of error in the sample, the effect size is large enough that, at a minimum, the effect is moderately strong (d = .55). The final study considered, study 3, may illustrate one of the most compelling reasons to use confidence intervals, especially when one Fails to Reject the null. In this case, using statistical significance tests alone would likely result in the unfortunate file drawer syndrome (Bradley & Gupta, 1997; Rosenthal, 1992; and Rosenthal 1979) previously discussed. The researcher would put away this particular line of research inquiry and pursue other endeavors. Using effect sizes and/or confidence intervals, however, the results of the significance test lose quite a bit of credibility. The effect size of 0.9 is large by virtually any standard and the confidence interval clearly indicates that the decision to Fail to Reject was not reached by a large margin. If nothing else, this type of result would indicate that further pursuit of this research is warranted, hopefully with attention paid to increasing power of the study through larger samples, better controls, more potent treatment, etc. Estimates made prior to conducting a particular study can help guide and inform study design while followup of results will provide greater precision about the potential interpretation and inferences that can be drawn from the findings. Confidence intervals provide a measure of precision for statistics and can provide decision makers with yet a better sense of how strong or reliable a reported statistic actually is. Methods of constructing confidence intervals are as much of a concern as whether to use them or not. Factors such as sample size, distribution shape, variance PAGE 38 28 heterogeneity, and reliability must be taken into consideration as well as the nature of the parameter to be estimated when deciding on an appropriate method of constructing these intervals. Confidence intervals for descriptive statistics such as the mean and standard deviation are fairly commonplace and have been around for many years. It is only in more recent years that investigation into constructing confidence intervals around statistics such as the multiple correlation coefficient, Cohens d, Cronbachs Alpha and others have been investigated (see, for example, Steiger & Fouladi, 1997; Fidler & Thompson, 2001, Carpenter & Bithell, 2001, and Fan & Thompson, 2001). Although the argument for effective construction of confidence intervals for a larger variety of statistics have been, at least theoretically, around for many years, it is only within recent years, due, at least in part, to the recent explosion of technology sophistication, that more computationally demanding methods such as Steiger and Fouladis interval inversion method (1992) have been able to be implemented. Nine techniques for constructing confidence intervals have recently been examined using Monte Carlo techniques for the indices of practical significance to be used in this study (see Kromrey & Hess, 2001 and Hess & Kromrey, 2003 for details). In general, the Steiger and Fouladi interval inversion method (Steiger & Fouladi, 1992) and Pivotal Bootstrap method (Carpenter & Bithell, 2001) showed the best results, followed by the Normal Z computation for approximately homogeneous samples (Kromrey & Hess 2002 and Hess & Kromrey, 2003). Due to the design of this study, a bootstrap technique such as the Pivotal Bootstrap is not tenable as only summary data and statistics were expected to be available. Therefore, confidence band interval construction was limited to PAGE 39 29 using the most promising equation based algorithm found in these studies for the type of analyses considered, e.g. the Fisher Ztransformation for R2, as well as the computerintensive Steiger and Fouladi methods. Both the hyperbolic sine transformation and students t show some promise in selected applications; however, they did not add anything to using the simpler computations chosen and therefore were eliminated as unnecessary transformations. Examples To illustrate the potential impact of using different reporting practices, or a combination of reporting practices, two studies that reported significant findings were examined. In the first study (Nosek, Banaji, & Greenwald, 2002), the researchers were interested in investigating how an individuals implicit and explicit attitudes and selfidentities regarding math and science differed from their implicit and explicit attitudes and selfidentities with the arts as a function of their gender. The authors reported significant findings on students math/arts attitude and identity depending on gender using an alpha of .05. The study did not report effect size information or confidence intervals. Using the data provided, effect sizes for a correlational analysis, f2, were computed (Figure 2). Using guidance provided by Cohen (1988), these results reflect effect sizes approaching large (attitude f2 = 0.32) and medium (identity f2 = 0.14) measures of practical significance. The magnitude of these effects tend to provide further support for the findings of the researchers that gender has a significant impact on students academic attitude and identity; however, the strength of these assertions is somewhat diminished when one computes confidence intervals around these effect sizes PAGE 40 30 (Figure 3). The use of confidence intervals provides more information that should impact the types of conclusions and merit given to these conclusions. with a Type I Error Rate of 95% 0.32 0.14 0 1Implicit math/arts attitudeImplicit math/arts identityEffect Size Figure 2. Point estimates (Cohens f2) of the impact of gender on Mathematics Attitude and Identity. with a Type I Error Rate of 95% 0.32 0.14 0.14 0.02 0.5 0.26 0 1Implicit math/arts attitudeImplicit math/arts identityEffect Size Figure 3. Point estimates (Cohens f2) and confidence intervals on the impact of gender on Mathematics Attitude and Identity at a Type I error rate of .05. PAGE 41 31 While the relatively large width around the attitude measure does not weaken the argument for gender impact on attitude too severely (the lower limit still reflects a medium effect), the confidence interval around the identity variable provides evidence that the impact of gender on a students math/arts identity may not be very influential after all. The lower limit in this case is 0.02, a very small, almost nonsignificant effect. In this study, the provision of confidence intervals adds important information necessary to report the findings adequately and comprehensively. In another study (Fitzgerald, 2000), the researcher investigated the impact of an intervention on a students reading achievement as, at least in part, a function of the intensity of participation reported significant differences between students who received the treatment for the duration of the program (25 weeks) as compared to those students who were only enrolled in the treatment for a fraction of the program (612 weeks). Similar to the first study, the calculation of effect size (Cohens d) still tended to support the authors conclusion about effectiveness (d = 0.7, a large effect according to Cohen); however, the construction of confidence intervals around the treatment intensity on gains in students instructional reading levels at a Type I error rate of .05. effect size (Figure 4) again weakens the definitiveness with which one might regard the results. In this case, the confidence interval is approximately one full standard deviation wide, with a lower limit reflecting a very small effect and an upper limit reflecting a huge effect. The imprecision of the measurement should be clearly PAGE 42 32 represented in reported findings using a tool such as a confidence interval to fully inform the reader. g 0.7 0.13 1.28 1 2I R L 9 5 %Effect Size Figure 4. Point estimates (Cohens d) and confidence intervals on the impact of treatment intensity on gains in students instructional reading levels at a Type I error rate of .05. Summary While the recognition of these two elements of research reporting, effect sizes and confidence intervals, appears to be growing over the last decade, they are not new to debate among statisticians and researchers. The theoretical knowledge and conceptual basis of effect sizes can be traced back to early in the 20th century (Harlow, 1995). The use of confidence intervals as they are currently applied can be traced back at least three PAGE 43 33 decades (Meehl, 1967). However, it is only due to the recent advances in technology and availability of highpowered computers to the average researcher that has enabled the use of more advanced and precise techniques. Statistical software packages available commercially in the past few years readily report and compute different statistics that used to require extensive programming and calculations by the researcher (Fidler & Thompson, 2001). These computations, probably taken for granted by many researchers in the past few years are only recent when one considers the historical evolution of tools. Given the fact that these reporting issues have relative longevity as issues in the statistical and research world, an attempt to at least broach the issue from an applied setting is called for. Additionally, since the lack of appropriate mechanisms and necessary technology is no longer a barrier to conducting this type of research, it is imperative that beginning steps be taken to start to bridge the conceptual and theoretical world of research to connect with the realistic and applied world of research. This study is intended to begin building such a bridge. PAGE 44 34 Chapter Three Method The general purpose of this study, to investigate the impact of reporting practices on the types of conclusions reached by researchers, is supported by three questions: 1.) To what extent does reporting outcomes of tests of statistical significance vs. tests of practical significance result in different conclusions and/or strengths of inference to be drawn from the results of research? 2.) To what extent does reporting confidence intervals instead of, or in addition to, point estimates affect the conclusions and inferences to be drawn from the results of research? 3.) What method, or combination of methods, is recommended for reporting results in educational studies? To address this purpose and associated questions, this study goes beyond the rhetoric and philosophical arguments currently found in most of the literature published regarding this issue. Rather, actual studies already deemed worthy of professional consideration and use by others in the field, as evidenced by publication in peerreviewed journals that are wellknown and used throughout professional circles, were examined to determine if alternative conclusions, and/or differences in inferential strength might have resulted from different analysis and reporting procedures. PAGE 45 35 Study type and description The nature and objective of this study are such that it does not cleanly fit into one classification or type of study. It uses techniques that are both qualitative and quantitative in nature but is not one or the other explicitly. As such, it takes on a mixed method approach and might reflect the type of study that Tashakkori and Teddlie (1998) call a mixed model design with multilevel uses of data, using different types of analyses and methods of analysis at different levels of the study. Summary data, not original raw data, are used so it cannot be considered a secondary data analysis. Probably the closest description of this study would be to consider it a mixed method design with a blending of metaanalytic methods (Hedges & Olkin, 1985) and a methodological research review (Keselman, et al., 1998). MetaAnalysis. While there is evidence of research synthesis across studies as far back as 1904 (Cooper & Hedges, 1994, p. 5), the now common term, metaanalysis, debuted courtesy of Glass (1976). He defined metaanalysis as the statistical analysis of a large collection of analysis results from individual studies for the purpose of integrating the findings (p.3). Over the past three decades, the use of metaanalysis has increased at a tremendous rate not only in the social sciences, but also in other fields such as medical research. According to Cooper and Hedges (1994), only four books and two major papers emerged in the first half of the 1980s. This rather limited number of resources has expanded virtually exponentially over the last decade and a half. A cursory search of literature reveals a much more detailed list of resources dedicated to metaanalysis, its techniques, uses, and applications. Cooper and Hedges also discuss how studies using PAGE 46 36 metaanalytical techniques have increased in conjunction with resource materials. A search of three major databases (PsycINFO, ERIC, and Social Scisearch) over a 15 year time period (1974 to 1989) revealed almost a nonexistence of metaanalytic studies the first four years considered, 1974 to 1977. Beginning in 1978 an approximately exponential growth was seen with about 18 studies reported across the three databases in 1978 to almost 300 metaanalytic designed studies in 1989. It is highly likely that this type of growth in using metaanalysis in social science research has continued. Traditionally, metaanalysis is used to synthesize findings across studies with a common theme or substantive research question, e.g., gender difference impact on mathematics or effectiveness of new medications for members of different populations. In traditional metaanalytic studies, researchers gather primary research studies pertinent to their topic of interest (often with varied and disparate findings and conclusions), code articles to determine relative strengths and weaknesses, and perform statistical calculations, typically in the form of effect sizes, to determine effectiveness of a treatment, magnitude of difference between groups, etc. There are a myriad of forms these different steps can take, any one of which would likely be worthy of further investigation. However, this study, while using a metaanalytic approach through the synthesis of findings from different studies, has a slightly different research focus. Rather than targeting a specific topic or applied research question, a metaanalytic philosophy was used to examine effect of chosen statistical analysis and chosen reporting method(s) on interpretation of findings using various studies found in published research that has potential implications for educators. PAGE 47 37 Methodological Research Review. Since this study is more focused on method and practice, one might also consider it to be, at least in part, a methodological research review. According to Keselman, et al. (1998, p. 351) these types of reviews tend to have two main purposes: (a) to form a basis for recommending improvements in research practice, and (b) to use as a guide for procedures to be taught in methods courses. The American Educational Research Association offers the following definition of a methodological review to be considered when submitting an article to their journal Review of Educational Research: descriptions of research design, methods, and procedures that can be employed in literature reviews or research in general. The articles should highlight the strengths and weaknesses of methodological tools and explore how methods constrain or open up opportunities for learning about educational problems. They should be written in a style that is accessible to researchers in education rather than methodologists. (AERA, 2003). A review of some of the studies that used this phrase, Methodological Research Review, or some derivation of it such as Methodological Review, Research Review, etc. finds a rather wide umbrella of study goals and design. In Barnett, Docherty, and Frommelt (1991), the authors reviewed 43 studies published since 1963 for a broad range of types of methodological flaws in a very specific topic of study, that of child and adolescent psychotherapy. Other studies are more specific about the method type they are interested in and less concerned about the substantive topic at hand. For example, Morgan (1996) investigated appropriate methods for a specific strategy of data collection: focus groups, across academic and applied research areas. Other studies may have a mix PAGE 48 38 of specificity regarding both method of interest as well as topic, or domain of interest. In DiPrete and Forristals (1994) study, they reviewed a fairly specific family of methods, multilevel models, used within a broad yet focused area of study, sociology, over a more restrictive span of time, 10 years. Similar to the issues about classifying this as a metaanalysis in the traditional sense, this study cannot be considered a pure methodological review either. Rather, statistical methods are being augmented within each published study to determine the potential impact of such changes in reporting. Sample Previously conducted social science research studies with either a direct or indirect educational implication were gathered and reviewed. Studies were drawn from a limited number of education and social science journals in order to restrict variation of research rigor that may be influenced by publication source as well as targeted audience. Obviously, within the journals selected, there was the influence of publication source; however by limiting the number of journals used in this study, it is hoped that this publication bias was minimized. Additionally, consideration must be given to the idea of disciplinary norms. Disciplinary norms address the differences in which professionals within various different disciplines communicate, including conventions regarding the conduct and reporting of their research. As such the sampling for this research addressed research contained within the broad umbrella of specific disciplines within Social Sciences. Although articles selected for inclusion were required to have either a direct or indirect educationallyoriented focus, at least a portion of the journals in the sample were PAGE 49 39 written for audiences that included not only educators but also psychologists and other social scientists. The specific number of studies from each journal varied slightly, due to differences in the frequency of publication and number of articles per publication; however, the goal of a minimum of ten studies to be extracted from each journal was met (see Table 6). Additionally, in order to attain a representative sample of current research reporting practices, only studies that were published within a fiveyear time frame were considered for inclusion (July 1998June 2003). Selection of Journals Considerations leading toward journal selection included a review of journals sponsored by professional organizations such as the American Psychological Association, the National Council of Teachers of Mathematics, and the American Educational Research Association. Characteristics of the types of studies was of key importance as many of those reviewed were primarily methodologically based, e.g., the Review of Educational Research, or possessed a majority of studies that were not of a nature conducive to inclusion such as those using many qualitative types of studies, e.g., the Journal of Research in Mathematic Education. Other considerations for selection included whether or not journals utilized a peerreview process as well as their longevity in the field. A final consideration was frequency of use and consultation of the selected journals as evidenced by their availability in libraries and frequency of citations by other journals. These criteria have been identified to maintain some degree of similarity both in PAGE 50 40 expected research rigor, as well as exposure to more recent advances in research methods and philosophy. Based on a preliminary review of journals currently in use in the social sciences, three journals were identified as the primary sources for studies to be reviewed. After a preliminary screening of the recent fiveyear collection of studies within each journal, it was determined that a sufficient number of studies were available with the required data within each of the three journals. The journals included in this study as the sources of research studies analyzed are: (1) Reading Research Quarterly, (2) Journal of Educational Research, and (3) Journal of Personality and Social Psychology. These three were selected after a review of journals used in the social sciences and consultation with individuals familiar with researchbased professional journals using the criteria and considerations previously discussed. All three have a national or international research audience and contain empiricallybased research with educational consequences. Additionally, the three represent journals that have audiences that vary in scope. The first, Reading Research Quarterly, the flagship journal of the International Reading Association, is of primary interest to educators with a focus on literacy issues. The Journal of Educational Research has a more broad scope of audience, including educators of various academic disciplines as well as roles, e.g., administrators. The final journal, the Journal of Personality and Social Psychology reaches beyond the educational community and encompasses the entirety of social science professionals. The difference in aspects of disciplinary norms associated with the different primary target audiences of these journals must be taken into consideration. While there PAGE 51 41 is some concern that the research contained in these journals may contain differences regarding type of knowledge as well as the technical depth of the research, the fact that all three journals fall within the realm of Social Science research is likely to minimize the impact of such differences. To some degree, the audience of the smaller scope journal might include those readers of the other two, and the audience of the Journal of Educational Research might include readers of the third; however, this is not a reciprocal relationship. This difference in scope may be of potential importance regarding the impact of research regarding the rigor and reporting methods relative to the type and size of the intended audience. All three journals are disseminated worldwide and were thus readily accessible. Table 1 contains a brief profile of each journal regarding the source and frequency of publication, as well as a summary of the number of libraries currently subscribing to each journal (University of South Florida Virtual Library, n.d.). This table illustrates the diversity of the types of journals contained within the broad context of educational research, not only in scope of topic but also in sponsoring organization and frequency of publication. A review of the Journal Citation ReportsSocial Sciences Edition (Institute for Scientific Information, 2002) indicated varying degrees of strength of use as evidenced by the frequency of citations in other journals (Table 2). The Impact Score is intended to provide an indication of a journals relative importance to the field and is calculated by dividing the number of citations during a given year, in this case 2001, by the number of articles published during the preceding two years (1999 and 2000). The Immediacy PAGE 52 42 Score, a measure intended to provide an indication of how timely the journal is cited, is calculated by dividing the number of citations of the journal in a given year from articles that were published in that same year. The Journal of Educational Research (JER), for example, was cited 29 times in articles published in 1999 and 2000. During that time (1999 and 2000) JER published 71 articles. To calculate the Impact Factor, we divide 29 by 71, which provides the ratio 0.408. Likewise, 1 article was cited in 2001 from the 29 published during that year, resulting in an Immediacy Index of 0. 034. Table 1. Profile of Journals Journal Name Sponsoring Organization Frequency of Publication Number of Libraries Subscribing Journal of Personality and Social Psychology American Psychological Association Monthly 1683 Journal of Educational Research Heldref Publications BiMonthly 1661 Reading Research Quarterly International Reading Organization Quarterly 1190 The differences in number of citations and the other indices is not, for the purposes of this study, considered problematic due to the substantive differences in the target audience of each journal as well as the differences in the frequency of publication. PAGE 53 43 Table 2. Citation Scores and Rankings Compared to All Social Science Journals Journal Name Impact Score (rank)Immediacy Score (rank) Number of Citations in 2001 (rank) Journal of Personality and Social Psychology 3.61 (24) 0.48 (142) 23,565 (3) Journal of Educational Research 0.41 (1075) 0.034 (1142) 395 (606) Reading Research Quarterly 1.87 (139) 0.15 (560) 922 (280) Total number of journals in Social Sciences Journal Citation Report = 1682 The journals were ranked relative to the entire body of social science journals as well as to those found in their specific discipline. Although the journals were ranked at widely disparate levels when considering their overall rank compared to other social science journals (Table 2), the strength of their ranking was enhanced when compared to other journals in their discipline (Table 3). The only one of the three that was not in one of the first two ranks in their discipline was the Journal of Educational Research. However, the 47 journals preceding JER in the Education and Educational Research category showed a lack of fit for this study in either focus, content, or scope. Only eight of the higher ranking journals were research focused and of those, five were subject specific, e.g., Health Education Research (Rank: 11), and three were methodological or review oriented, e.g., Review of Educational Research (Rank: 1). The highest ranked PAGE 54 44 subject specific research journal, Reading Research Quarterly was selected for this study as the subject specific journal. The Journal of Educational Research was the highest ranked research journal with a general educational focus that contained primarily empirically based research. As such, it was considered the most acceptable for use in this study when all factors were taken into consideration. Table 3. Journal Ranks Relative to SubjectSpecific Journals Journal Name JCR Subject Category Number of Journals in Category Rank Journal of Personality and Social Psychology Psychology, Social 43 1 Journal of Educational Research Education and Educational Research 92 48 Reading Research Quarterly Education and Educational Research 92 2 Selection of Published Studies Studies were considered for inclusion which, to the extent possible, meet the following selection criteria: (1) availability of all necessary statistical estimates to permit calculation of appropriate effect size (if the effect size is not reported in the published report) and confidence intervals, including, but not limited to means, standard deviation, and sample size, (2) studies that used the analyses of interest as a primary basis for reported results, conclusions, and recommendations, and (3) studies that were of a nature PAGE 55 45 conducive to the purposes of this research, e.g., the research is examining differences between two or more groups (ttests or ANOVA designs) and those employing regression/correlational designs. It was determined that although it would be ideal if other key information such as reliability indices and data distribution information were included to help ascertain the soundness of a given study, it was anticipated, and was proved to be true, that this information was not available for many studies and was therefore not considered to be a requirement for inclusion. These criteria permitted a certain degree of commonality between studies selected based on design type and group similarity, thus limiting comparisons to only three general types of studies with groups that are reasonably homogeneous. Additionally, in the case of studies from the Journal of Personality and Social Psychology, only studies with a direct or indirect educational relevance (e.g., studies on the attention span of children or other behavior that could have impact in a classroom) were considered to maintain an educational focus. The selection process of the final sample had multiple stages. Once the journals were identified, all studies within the three journals covering the time span of interest (July 1998June 2003) were scanned to determine if the types of analyses included and statistics reported warranted consideration for inclusion. Additionally, the topic of each article was considered relative to the direct or indirect relationship to educational issues. From this initial review, 79 articles were selected as potential studies to include for the study. Each of these were then reviewed more indepth to determine the level of data available. That is, were standard deviations, group sizes and other critical information clearly reported relative to the analysis employed? At this point, the context of how the PAGE 56 46 analyses that were to be addressed in this study were being employed in the article was considered to determine the impact the analyses had on the overall findings and purpose of the study. For example, some studies might only have used ttests to examine preexisting differences between groups without any significant or direct impact on the goal of the study. The final sample of articles and the types of analyses represented within articles (N=33), by journal and analysis type, is provided in Table 4. Table 4. Types of Analyses Included in Number of Articles Journal of Personality and Social Psychology Journal of Educational Research Reading Research Quarterly Total Two Group Comparisons (ttests) 4 7 1 12 More than Two Group Comparisons (ANOVA) 9 4 9 22 Regression Analyses 1 0 3 4 Note: In some cases, studies used more than one analyses of interest, thus the different total than that reported in the text. The types of analyses used in different articles was fairly diverse when considering the number within a specific journal as well as across journals. For example, ANOVA applications tended to dominate the literature with 22 articles using this type of analysis. PAGE 57 47 Comparatively, only four studies incorporated regression analyses with two group comparison using ttests falling almost halfway between these two extremes, used in 22 studies. Computations Using the reported information, the following statistics were computed, if not already reported in the published study: 1. Test of statistical significance (tvalues, etc), including associated pvalue. 2. Confidence interval for the statistic of interest. For studies comparing differences between two groups, the CI for the difference of means were constructed. For studies comparing differences between more than two groups, e.g., in an ANOVA context, CIs were constructed around 2 a measure of degree of variance attributable to group membership. For studies examining a correlational relationship, the CI around the squared multiple correlation coefficient,2 R a measure of explained variance, was constructed. 3. Statistic of practical significance. Depending on the study design and analysis, one of three effect sizes were computed. a. For studies comparing differences between two groups, Cohens d was used, given by: 12pXX d where 12, XX are the means of the two groups and p is the pooled standard deviation. PAGE 58 48 b. For studies that are comparing more than two groups, e.g., ANOVA analyses, Cohens f effect size were computed, given by: 2 21 f c. For studies that examine a correlational relationship, e.g., those using a regression analyses, Cohens signaltonoise ratio, 2 f was used, given by: 2 2 21 R f R 4. Confidence intervals for the statistic of practical significance were constructed using the Normal Z transformation and the Steiger and Fouladi interval inversion method. Confidence Intervals Confidence intervals were constructed using Type I error rates of 0.01, 0.05, and 0.10, using both the normal Ztransformation as well as the Steiger and Fouladi interval inversion method. Based on previous studies (Hess & Kromrey, 2003 and Kromrey & Hess 2002), it was anticipated that the results of these two methods would not differ to a substantial degree, an expectation that was fulfilled. The only issue relative to CI construction was limited to a very small portion of the studies analyzed. In a this small portion of cases (less than 2%) the values were so extreme (due primarily to inordinately large sample sizes combined with either very large or very small effect sizes) that the PAGE 59 49 Steiger and Fouladi interval inversion method would not function due to the limitations of the SAS software system on probability computations in the extreme tails of the t and F distribution. However, in all cases, the other calculations used (e.g., students t, Fisher ztransformation or z distribution, as appropriate) were used if necessary. The width of the intervals were then examined at each of the three levels for general distributional characteristics. To the extent possible, studies used were analyzed with consideration given to the strength of the study design as well as variables considered and types of related information reported (e.g., was there specific mention of the type I error rate that significance tests were conducted at). The strength of the conclusions that could be drawn using a confidence band instead of a point estimate were examined and discussed. All computational aspects of the analysis were conducted using SAS version 8.2 run on the Windows XP operational system. The data were then imported into Microsoft Excel for the purposes of constructing visual displays of the findings in tables and figures. Data Analysis. The selected studies were coded to collect information on the characteristics of the study such as distributional information, impact of missing data, etc. as well as the statistics reported, e.g., ANOVA F values, Regressions R2 values (Appendix A). The purpose of the coding was not to report a clear measure of study strength or rigor, rather it was intended to gather relevant information about the study as well as provide a sense of the type of information typically reported. PAGE 60 50 Effect sizes were calculated, regardless of whether they had been reported based on the data provided by the author(s), e.g., reported means, sample sizes, degree of variability. This computation external to the study was necessary to preclude the potential of the author(s) using an effect size calculation other than the three identified for this study. For the purposes of this study, effect size magnitudes were classified using Cohens criteria (Cohen, 1988) without attention to contextual issues. Table 5 contains a summary of the three effect sizes and Cohens classification, albeit reluctant, as small, medium, or large. Table 5. Effect Sizes and Associated Interpretation Effect Size Index Cohens d Cohens f Cohens f2 Small Effect 0.20 0.10 0.02 Medium Effect 0.50 0.25 0.15 Large Effect 0.80 0.40 0.35 The consideration of context when interpreting effect sizes is vital for applied purposes; however, this is not a direct consideration in this study and will therefore not be included. Confidence band widths were calculated at three Type I error rates: .01, .05 and .10 using appropriate techniques for the analysis of interest. Confidence intervals for comparisons of twogroups were constructed using the Students t distribution for the PAGE 61 51 differences between means and the zdistribution as well as the Steiger and Fouladi interval inversion method for the Cohens d measures (see Hess & Kromrey, 2002 for details). Similar approaches were used for Regression and ANOVA analyses, using a logarithmic transformation of Z, similar to the Fisher transformation, as well as the Steiger and Fouladi interval inversion approach. Details of the effectiveness of these techniques can be found in Hess and Kromrey (2002) and Kromrey and Hess (2000). Intervals were examined to determine if there were noticeable differences in the research rigor found in different journals or in the impact of precision based on the type of study and method of analysis chosen. Reporting Results and Conclusions The discussion sections of the published studies were reviewed to determine if findings or conclusions might have been affected or altered by different reporting practices. Specific discussions and statements relative to the statistical analysis conducted were culled from the study and reviewed with the intent to determine if additional information, e.g., effect sizes and/or confidence intervals, should have impacted the strength of the wording used in results and conclusions. A determination was made if inclusion of effect sizes and/or confidence intervals would: 1. have no impact on how the results and conclusions were reported, that is, No changes needed. 2. have some impact on how the results and conclusions were reported, that is, slight changes needed. PAGE 62 52 3. have substantial impact on how the results and conclusions were reported, that is, drastic changes needed. 4. have a major impact on how the results and conclusions were reported, that is, a complete revision required. A copy of the instrument used for this determination as well as a sample study and analysis summary is included in Appendix C. A total of 42 analyses or sets of analyses were extracted from the 33 studies for this portion of the study. These analyses or sets of analyses were identified upon review of the results and conclusions provided. If a statement was clearly based on a single analysis, then the statistics associated with that analyses were used. If a statement was based on a group of analysis, then they were reviewed conjointly. This typically happened when an ANOVA test was conducted with followup ttests. A large majority of the analyses conducted within the broad scope of this research did not lend themselves to inclusion in this part of the study. The reasons for this varied, with the most dominant reason being that although results of statistical significance tests might have been reported numerically either in the text or a table, the impact of these specific analyses were not uniquely identifiable within the results and/or discussion of the results. Multiple ttests may have been run for a written conclusion within a larger context. Other examples that were not investigated relative to interpretation aspects of the study included those analyses that were run for preexisting differences (typically not a focus of results or discussions of implications of findings) and those that addressed nonfocal points of the study, e.g., analyses of demographic data that were not addressed relative to conclusions or impact. PAGE 63 53 Reliability of Interpretative Results Twenty of the analyses or sets of analyses were independently reviewed by measurement specialists well versed in educational research to determine if the decisions reached by this researcher would likely to be representative of members of the research world in general. One of the twenty analyses had to be discarded due to a problem noted in the summary information provided to the reviewers. Thus the percent agreement was based on 19 analyses or sets of analyses. This was not considered to be a major problem as 43.54% of the sample was used as a basis for verification and a measure of reliability of this researchers recommendations for change. Prior to the independent reviews, the researcher coded all the analyses (or sets of analyses) using the 1 ( No Change Needed ) to 4 ( Complete Revision Needed ) scale described previously. The analyses (or sets of analyses) coded by the independent reviewers were selected to be representative of the 42 used in the analysis. The subset of analyses used for this reliability check included analyses from all three studied in this research (ttests, ANOVAs, and Regression) as well as analyses (or sets of analyses) from all three journals selected. Additionally, the subset included analyses that had been determined by the researcher to need varying degrees of interpretative adjustment when effect size and confidence interval information was included. That is, a range of analyses were provided to the independent coders, previously rated by the researcher as needing No Change, Slight Change, Much Change, or Complete Revision Once the subset of analyses had been selected, the researcher conducted a training session with the reviewers. Each of the reviewers were provided with an instruction PAGE 64 54 sheet, coding sheet for each analysis (or set of analyses), and a summary of each of the studies that they were to review with the appropriate statistics (see Appendix C). The researcher read the instructions aloud while the reviewers read the instruction sheet. The reviewers were given the opportunity to ask questions and provide input. At that point, one analysis was reviewed and coded independently by each individual and the results discussed among the group. There were some initial differences in how much to consider information such as study strength (some reviewers had taken sample size, deducted from degrees of freedom information) into consideration of their ratings. They were instructed to concentrate primarily on the statistics themselves and not take into consideration other elements of the study. After the training, practice, and discussion, the reviewers were given all their materials to conduct the rest of their reviews independently. Coding sheets were then returned to the researcher (one reviewer emailed their results) and ratings were input into an Excel spreadsheet. The decisions reached by these independent reviewers were then compared to those reached by this researcher and the percent agreement, both by item and overall, was computed. In general, agreement was strong. Overall agreement was 83% with the highest agreement resulting from the impact of confidence intervals on the degree to which results and conclusions might be affected (89%). Interestingly, the lowest agreement (79%) was the degree to which the results and conclusions might be altered based on the results of the significance tests conducted, and reported, within the original study. The percent agreement regarding the degree to which reporting effect sizes might impact revisions of results and conclusions was in between the other two at 82%. PAGE 65 55 Recommendations for Reporting Research Results Finally, the results of this study were considered holistically to provide recommendations for reporting research results. The use of illustrations from actual results is anticipated to provide yet another piece of justification for researchers to more thoroughly report their findings and for editors of journals to demand such reporting. Just as educators in the field are being held accountable for their methods, so should the methods and work of educational researchers, including their reporting practices and protocols. PAGE 66 56 Chapter Four Results The purpose of this research was to examine the potential impact of different methods of reporting research results on the conclusions that could, and should, be made from these findings. Specifically, this study investigated how the use of practical significance as measured by effect sizes in addition to measures of statistical significance might impact the degree to which one should interpret results. Additionally, the use of confidence intervals around point estimates was examined in order to determine the precision of measurements obtained in studies and how that degree of precision might impact conclusion drawn from findings. Previously conducted research deemed worthy of publication that contained one of three rather traditional and oftused statistical analyses, ttests, Analysis of Variance (ANOVA), and/or regression were reviewed and results reanalyzed using not only the significance test results provided in the study, but also using the appropriate measures of practical significance (Cohens d Cohens f and Cohens f2 respectively). Further, confidence intervals for all point estimates, including measures of statistical as well as practical significance were constructed. Results and conclusions relative to specific statistical analyses were then examined with consideration given to the additional information provided by the calculated effect size and confidence intervals. The degree PAGE 67 57 to which the results and conclusions that were presented might be adjusted or reconsidered was estimated. The three questions investigated in this research were: 1.) To what extent does reporting outcomes of tests of statistical significance vs. tests of practical significance result in different conclusions and/or strengths of inference to be drawn from the results of research? 2.) To what extent does reporting confidence intervals instead of, or in addition to, point estimates affect the conclusions and inferences to be drawn from the results of research? 3.) What method, or combination of methods, is recommended for reporting results in educational studies? Characteristics of Selected Studies For the most part, researchers did not report either effect sizes or confidence intervals in their results. Only one article of the 79 studies considered for final inclusion during the screening steps of study selection reported results of significance tests, effect sizes and confidence intervals (Baumann, Edwards, Font, Terehinski, Kameenui, & Olejnik, 2000). No other studies reviewed reported confidence intervals and few reported effect sizes and none did so consistently. Of the final sample of 33 articles, 393 ANOVA analyses, 108 regression analyses, and 149 ttest analyses were reviewed. The types of analyses within specific articles as well as different journals varied widely (see Table 6 for specifics). For example, the Journal of Educational Research tended to have fewer PAGE 68 58 analyses within a given study and reported more analyses of twogroup comparisons than the other two journals. ANOVA applications seemed to dominate studies in both Reading Research Quarterly as well as the Journal of Personality and Social Psychology. During the initial screening, numerous articles were excluded from inclusion due to nonreporting of statistics required for this study such as sample size or standard deviation. For example, two group comparisons using ttest analyses were evident in the Journal of Personality and Social Psychology a little more often than is obvious in this study; however, there tended to be a dearth of sufficient information to permit inclusion of those studies within this study. It was possible, in limited cases, to derive some of that information from other data provided, e.g., degrees of freedom, but this was only done in limited situations where the derived information could be safely relied on. The contribution of regression analyses to this study was limited. Only four articles were found that contained appropriate information to include in this analysis. In many cases, studies that had regression applications reported weights and coefficients only, with no indication of explained variances. Of the four regression studies, two had results that do not seem typical of regression analyses in general and thus may be responsible for the distribution of the results to be highly skewed toward very large effect sizes. For example, Sutton and Soderstrom (1999), reported R2 values that were atypically large, e.g., 0.80, 0.76. Not all of the analyses contained within the 33 studies were considered as appropriate to include in the interpretation of results and conclusions part of this study, as they examined such things as preexisting differences between groups, PAGE 69 59 Table 6. Types of Analyses Reviewed by Article Number and Journal Article No. Journal tTest ANOVA Reg TOTAL 1 JER 12 12 2 JER 22 22 3 JER 33 33 4 JER 10 8 18 5 JER 10 10 6 JER 6 6 7 JER 14 14 8 JER 1 1 9 JER 4 4 10 RRQ 26 1 27 11 RRQ 45 45 12 RRQ 3 3 13 RRQ 6 6 14 RRQ 38 38 15 RRQ 1 1 16 RRQ 3 3 17 RRQ 38 38 18 JPSP 58 38 96 19 JPSP 32 32 20 JPSP 15 15 21 JPSP 9 9 22 JPSP 21 21 23 JPSP 5 4 9 24 JPSP 6 6 25 JPSP 3 27 30 26 JPSP 20 20 27 JPSP 6 11 17 28 JER 2 2 29 JER 25 25 30 JER 6 3 9 31 RRQ 11 11 32 RRQ 32 12 44 33 RRQ 23 23 Total 149 393 108 640 PAGE 70 60 provided evidence of known differences, or were not an evident or specific contributor to the results and conclusions discussed. These analyses were included when examining the general behavior of the statistics as a function of study Regardless of the type of analyses conducted, the general distribution of effect sizes revealed extremes at either end, with most effect sizes spanning Cohens small to large range (Figures 5 and 6) for group comparison studies (ANOVAs and ttests). 1 0 1 2 3 4 1 Analyses, in ascending dCohen's d Figure 5. Distribution of effect sizes and 95% confidence intervals for all ttest Analyses pooled across journals as effect size increases. The studies with extreme values were further examined and found to primarily reflect unique comparisons that, upon review, seemed to provide understandable conditions for PAGE 71 61 the extremeness of the result. For example, many of the large effect sizes in the ANOVA applications came from one study that examined differences in text composition in different literary genre. The only exception was the distribution of the results of Cohens f2 (Figure 7 and Figure 10) which shows a tendency toward rather large effect sizes. This may be due, at least in part, to the limited availability of regressionbased studies available for inclusion in this study (n = 4). 0 0.5 1 1.5 2 2.5 3 3.5Cohen's f Figure 6. Distribution of effect sizes and 95% confidence intervals for all ANOVA Analyses pooled across journals as effect size increases. PAGE 72 62 0 2 4 6 8 10 12Cohen's f2 Figure 7 Distribution of effect sizes and 95% confidence intervals for all Regression analyses pooled across journals as effect size increases. Additionally, the distribution of effect sizes was relatively similar across journals (see Figures 8, 9, and 10), although the frequency of different types of analyses varied from journal to journal. Although the number of published studies that contain ttests was largest in the Journal of Educational Research the actual number of ttests conducted within those studies was largest within the Journal of Personality and Social Psychology PAGE 73 63 1.00 0.00 1.00 2.00 3.00 4.00 1Cohen's dJERJPSPRRQ Figure 8. Distribution of effect sizes and 95% confidence intervals for all ttest analyses as effect size increases by journal type. 0 0.5 1 1.5 2 2.5 3 3.5 JERJPSPRRQCohen's d Figure 9. Distribution of effect sizes and 95% confidence intervals for all ANOVA analyses as effect size increases by journal type. PAGE 74 64 0 2 4 6 8 10 12 JER RRQ AnalysisCohen's f2 Figure 10. Distribution of effect sizes and 95% confidence intervals for all Regression analyses as effect size increases by journal. Relative to the Type I error rate of interest, the bandwidth noticeably increases as alpha decreases as would be expected. Figures 11, 12, and 13 provide an illustration of this using the results of the ANOVA analyses. PAGE 75 65 0 1 2 3 4Cohen's f Figure 11. Distribution of effect sizes and 90% confidence intervals for all ANOVA analyses pooled across journals as effect size increases. 0 1 2 3 4Cohen's f Figure 12. Distribution of effect sizes and 95% confidence intervals for all ANOVA Analyses pooled across journals as effect size increases. PAGE 76 66 0 1 2 3 4Cohen's f Figure 13. Distribution of effect sizes and 99% confidence intervals for all ANOVA Analyses pooled across journals as effect size increases. Sample size, as one might expect, had a notable impact on the results of bandwidth. In Figure 14, bandwidths for ANOVA analyses are illustrated as a function of increasing total sample size for the three type I error rates examined. A similar trend was noted for studies using ttests and Regression analyses. Additionally, as the ratio of sample size to the number of groups in ANOVA studies increased (that is, increased average sample size within each group), bandwidths also tended to decrease (Figure 15) PAGE 77 67 0 0.2 0.4 0.6 0.8 1 1.2 1.4 N Total NCohen's f width 99 width 95 width 90 Linear (width 95) Linear (width 99) Linear (width 90) Figure 14. Bandwidth of Cohens f pooled across journals as total sample size increases for Type I error rates of .01, .05, and .10. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 N/k 175.00 N/k ratioCohen's f width 99 width 95 width 90 Linear (width 99) Linear (width 95) Linear (width 90) Figure 15 Bandwidth of Cohens f pooled across journals as the ratio of total sample size/number of groups increases for Type I error rates of .01, .05, and .10. PAGE 78 68 Additionally, there was a notable lack of what some might consider basic but critical information regarding a research study. Noticeably lacking in most studies, were measures of, and information regarding, reliability and validity, distributional characteristics of the data (including presence or absence of outliers), missing data, dependence/independence of observations, etc. The propensity to leave this type of information out was alarming. Statistical Significance vs. Practical Significance Of the 640 individual analyses used in this study, the degree to which they were reported as statistically significant varied as a function of the type of analysis conducted (see Table 7). There were a total of 149 two group comparisons that used ttests as their analysis of choice. Of those 149, slight less than half (n=70), 47%, reported statistically significant findings. Contrast this to the reported regression analyses, 88% (n = 95) of which reported statistically significant findings and the ANOVA analyses, 81% (n = 319) reporting significant findings. Although not explicitly stated in virtually all studies examined, it seemed evident that most, if not all, significance testing was done using a Type I error rate of 0.05. This inference is made based on the fact that most findings that were not contained in a table and asterisked (*) to imply various levels significance (a fairly common, and lamentable, practice in reporting research results) were reported using notation such as p<.xx and in the studies reviewed, this number did not exceed 0.05. Rather the letters ns, implying nonstatistical findings were reported. Additionally, this initial examination of effect sizes with regard to significance testing, absent of context, PAGE 79 69 revealed that in many cases, results that were reported to be statistically significant (original authors interpretations) had various degrees of practical significance (see Table 7). Table 7. Numbers and Percent of Analyses Reporting Statistical Significance Relative to Computed Effect Size Type of Test Total No Effect (Cohens: d : <.1 f: <.05 f2: <.01) Small Effect (Cohens: d : .1.34 f: ..05 .16 f2: .01.08) Medium Effect (Cohens: d : .35.64 f : .17.32 f2: .09.25 ) Large Effect (Cohens: d : .65+ f : .33+ f2: .25+) TTest 149 37 (24.83%) 24 (16.11%) 28 (18.79%) 60 (40.27%) Significant 70 0 (0%) 3 (4.29%) 13 (18.57%) 54 (77.14%) Nonsignificant 79 37 (46.84%) 21 (26.58%) 15 (18.99%) 6 (7.59%) ANOVA 393 18 (4.58%) 45 (11.45%) 98 (24.94%) 232 (59.03%) Significant 319 0 (0%) 9 (2.82%) 82 (25.71%) 228 (71.47%) Nonsignificant 74 18 (24.32%) 36 (48.65%) 16 (21.62%) 4 (5.41%) Regression 108 0 (0%) 3 (2.78%) 16 (14.81%) 89 (82.41%) Significant 95 0 (0%) 1 (1.05%) 15 (15.79%) 79 (83.16%) Nonsignificant 13 0 (0%) 2 (15.38%) 1 (7.69%) 10 (76.92%) As might be expected from the dominating reporting of statistically significant findings in regression analyses, this type of analysis reported the greater number of large effect sizes. PAGE 80 70 The degree to which effect sizes varied based on whether or not tests showed statistical significance was further investigated. Of the 640 analyses investigated, a total of 484 were reported to be statistically significant. The magnitude of effect sizes associated with these analyses were reviewed both as a function of journal and type of analysis. Figure 16 contains a summary of the effect sizes for the different analyses by journal. As might be expected, no statistically significant analyses reported effect sizes that indicated complete absence of effect and only a small number indicated small effects. When considering whether or not a medium or large effect size associated with significant findings varied by journal type, the Journal of Educational Research exhibited a greater preponderance of studies reported containing large effect sizes (73 of 78, 93.6%, studies included) as compared to either the Journal of Personality and Social Psychology (139 out of 208, 66.8%, of the analyses) or Reading Research Quarterly (157 out of 209, 75.1%, of the analyses). When effect sizes of statistically significant analyses were reviewed based on type of analyses, again, as expected, there were not any instances in which no effect was present and only a limited number revealed small effects. Depending on the analysis, there were some differences regarding evidence of a large or medium effect. Regression analyses tended to have large effects with statistically significant results (95 out of 108 total). The results of statistically significant ANOVA tests revealed a little over a quarter of the analyses had medium effects or less, with a slightly smaller proportion of Regression and ttest analyses indicating medium effect sizes or less. PAGE 81 71 0.00%0.00%0.00% 1.28% 4.33% 0.96% 5.13% 28.85% 23.92% 93.59% 66.83% 75.12% 0% 25% 50% 75% 100% JERJPSPRRQ JournalPercent of Analyses No Effect Small Effect Medium Effect Large Effect Figure 16. Effect sizes of statistically significant findings at an alpha of .05, by journal. 0.00%0.00%0.00% 3.07% 1.06% 1.33% 25.77% 14.89% 21.33% 71.17% 84.04% 77.33% 0% 25% 50% 75% 100% ANOVARegressionTTests Type of AnalysisPercent of Analyses No Effect Small Effect Medium Effect Large Effect Figure 17 Effect sizes of statistically significant findings pooled across journals at an alpha of .05, by analysis type. PAGE 82 72 Results of nonstatistically significant analyses were not as plentiful due to the nature of publishing preferences toward statistically significant findings. Only 166 analyses reporting nonsignificant findings (a little less than about onethird of that for statistically significant findings) contained enough information to calculate effect sizes. Additionally, the 166 that were available were predominantly from studies using ANOVA and/or ttests. Only four of the nonsignificant findings used regression analyses. While it is not reasonable to offer a definitive explanation for this seeming disparity, it may result from the nature of the tests themselves. Multiple regression models using the same variables in various combinations often are tested and only the ones performing successfully may have been included in the final analysis. Additionally, the comparative nature of ttests and ANOVA using multiple variables of interest might make it less likely for researchers to exclude nonsignificant findings when reporting significant ones. Effect sizes of significance tests were examined as a function of analysis type for nonsignificant findings. Regardless of the direction, Cohens d of around 0.2 or more indicates some degree of difference, so it was not considered problematic to consider evidence of effect within these analyses compared to the other two analyses considered. Figure 18 contains the results of considering point estimates for nonsignificant findings. It is important to note that the regression results only include four cases so the generalization of the likelihood of this distribution is very limited. PAGE 83 73 25.33% 0.00% 47.44% 48.00% 50.00% 26.92% 21.33% 25.00% 19.23% 5.33% 25.00% 6.41% 0% 25% 50% 75% ANOVARegressionTTests Type of AnalysisPercent of Analyses No Effect Small Effect Medium Effect Large Effect Figure 18. Effect sizes of nonstatistically significant findings pooled across journals at an alpha of .05, by analysis type. Of the other two analyses reviewed, data for 74 ANOVA tests and 79 ttests were available. Of note in these results is the evidence of at least a small effect in most of the analyses. Over half of the ttests indicated the presence of at least a small measure of practical difference between the two groups examined, with either a medium or large effect evident in approximately a quarter of the cases (25.64%) ANOVA had a similar proportion with medium or large effects (26.66%) and only a quarter of the analyses indicated the absence of a practical difference (25.33%). The sparse regression representatives all indicated some effect with two analyses having a small effect size, one a medium effect size, and the fourth a large effect size. Potential Impact on Results and Conclusions PAGE 84 74 The final piece of this analysis was reviewing the results and conclusions reported that were based on the tests of statistical significance. The 42 analyses or groups of analyses included in this portion of the study were examined considering the computed effect size(s) in addition to the statistical significance tests. The results and conclusions were then determined to need varying degrees of adjustments based on the information provided by effect sizes: 1. have no impact on how the results and conclusions were reported, that is, No changes needed. 2. have some impact on how the results and conclusions were reported, that is, slight changes needed. 3. have substantial impact on how the results and conclusions were reported, that is, drastic changes needed. 4. have a major impact on how the results and conclusions were reported, that is, a complete revision required. Only 26.19 % (n = 11) of the studies were determined to have results and conclusions that did not need any revision based on the addition of effect size information. About a quarter of the sample analyses were determined to need substantial changes (n = 12, 28.57%) with relatively few being recommended for complete revisions (n = 2, 4.76%). The largest relative proportion of studies, 40.48% (n=17), were identified as needing slight changes when the magnitude of effect size was considered in addition to tests of statistical significance. PAGE 85 75 Table 8. Number and Percent of Analyses or Sets of Analyses that Warrant Different Degrees of Change when Effect Size or Confidence Interval is Considered in Addition to Results of Statistical Significance Tests No Change Needed Slight Change Needed Much Change Needed Complete Revision Needed When Effect Size is Considered 11 (26.19%) 17 (40.48%) 12 (28.57%) 2 (4.76%) When 95% Confidence Interval is Considered 3 (7.14%) 8 (19.05%) 13 (30.95%) 18 (42.86%) These findings are fairly comparable to those found by other coders. When the results of the 19 sets of analyses reviewed by other researcher specialists, there was an adequate percent agreement with the decisions of the researcher of this study. The percent agreement when effect size was considered was 82% and the percent agreement when confidence intervals were considered was 88%. Examples Four analyses or sets of analyses were extracted from the sample to illustrate examples resulting in various levels of the four decisions possible, (1) No Change Needed (2) Slight Change Needed, (3) Much Change Needed, and (4) Complete Revision Needed. PAGE 86 76 For the first example, the study investigated the degree to which college students believed that their admission was based, at least in part, on their race/ethnicity (Brown, Charnsangavej, Keough, Newman, & Renfrow, 2000). Students were classified as members of a stigmatized race/ethnicity if they were African American or Latino; Conversely, they were classified as members of a nonstigmatized race/ethnicity if they were White or Asian American. The results of the statistical significance test, ANOVA, indicated the presence of a statistically significant difference: F (1,369) = 69.89, p <.001. The authors reported that: When we compared stigmatized and nonstigmatized students in the degree to which they suspected that their race or ethnicity might have helped them gain admission to college, we also found a significant difference, as expected. Stigmatized students suspected that their admission to the University of Texas at Austin had been influenced by their race or ethnicity to a greater extent than did nonstigmatized students. (p. 254) The computed effect size, Cohen f = .4043, tends to support the authors conclusion. As such, the inclusion of effect size is not likely to have added any further information that would have suggested different results or necessitated alterations to the conclusions drawn. The rating received by this analysis was a (1), No Change Needed. In the second example, it was determined that while the stated results and conclusions were supported by consideration of the effect size in general, the effect size magnitude was sufficient to suggest slight modifications to the statement made in the conclusions. The researcher in this study (Fitzgerald, 2001) was investigating the degree PAGE 87 77 to which students participation in a tutoring program (part time vs. full time) impacted their achievement in reading. The results of an ANOVA conducted on a measure of postparticipation reading level found statistically significant differences, F (1,76) = 4.72, p = .03. The associated concluding comment by the author was: There was a statistically significant treatment effect. Overall, high level treatment children outperformed lowlevel treatment children in instructional reading level. ( p = .45) Cohen f: .2385 In general, the effect size supported the authors conclusion; however, a rating of (2), Slight Change Needed was assigned due to the rather strong wording associated with what may be, at most, a medium practical effect. It would be recommended that the term outperformed be replaced or conditionally qualified to slightly lessen the strength with which these findings were reported. In many studies, the results were found to need more attention to qualifying the wording when one included effect size information in addition to statistical significance. In this example, the results were agreed with in principle but were considered to need some revamping in order to reflect appropriate strength of inference. In this study, high school students indicated a preference for morning or afternoon academic work (Callan, 1999). These students were then randomly assigned to different groups which were administered an Algebra exam in the morning and in the afternoon. The groups contained a mix of students with different preferences. In this set of analyses, the question being investigated was whether or not students with different time preferences (morning or PAGE 88 78 afternoon) perform differently if they take a test in the morning. Statistical significance was found between the performance of students with different preferences, F (1,64) = 5.44, p <.05. The authors concluded that, There was a significant difference between afternoonpreferenced students and morningpreferenced students taking the test in the morning. (p.296) and, The results indicate clearly that the timeofday element in learning style may play a significant part in the instructional environment. When time preference and testing environment were matched, significant differences emerged between test resultsbut only for the morning test. (p. 298) The measure of practical significance found a medium effect present, Cohen f : .2849. It was determined that the authors should alter the severity of strength reflected in their comments. Using words and phrases such as clearly indicate and play a significant part are very strong and considered not to be appropriate for the potential presence of a medium effect and are thus potentially misleading. As such, this was assigned a rating of (3) Much Change Needed. Finally, there were a few studies for which inclusion of effect size tended to negate or inappropriately represent the results. That is, the results, after inclusion of effect size information were considered to be in need of complete revision. One such study addressed how different types of praise impacted childrens judgment of their performance on tasks (Mueller & Dweck, 1998). Students were put into three groups, one in which the children were praised for their ability (also referred to as praise for PAGE 89 79 intelligence), another in which the children were praised for effort, and a third in which no praise was provided. Based on the results of the significance tests, F (2, 48) = 2.04, ns, the authors reported that: These results indicate that effort praise and intelligence praise do not lead children to judge their performance differently. (p.42) This finding, as written, indicates a rather definitive decision about the lack of differences between the three groups of children on how harshly they judge their performance. However, when one considers the associated effect size, Cohens f = 0.2828 which indicates, according to Cohen, the potential presence of at least a medium effect, the certainty with which one decides that there is no difference should be impacted. Due to the definitiveness of the statement regarding the findings of this part of the study, this example was considered to warrant a (4): Complete Revision Needed The results of the practical significance indicates the possible presence of a medium effect size between the groups that should be addressed in the discussion. It would be advisable to at least discuss the possible existence of an effect and that further research into this issue might be warranted and avoid making a definite statement or judgment. Summary Reporting effect sizes in addition to measures of statistical significance appears to add valuable information to at least a small proportion of tests that have statistically significant results. The utility of a measure of effect appears to be enhanced when statistical tests result in nonsignificant findings. Over 75% of the nonstatistically significant results had indications of at least a small to moderate effect. This type of PAGE 90 80 information might be valuable to researchers who believe, based on theory, previous research, or experience that a true difference does exist, however other factors might have impacted significance findings (e.g., research design, rigor). Point Estimates vs. Confidence Intervals The use of confidence intervals tended to be scant in the literature. Only one article was found during the initial review of journals that possessed information on confidence intervals. However, when confidence intervals were constructed around the statistics of interest in this study, including effect sizes, it became fairly obvious that they added an important element of information regarding the strength with which one should rely on the findings. Figure 19 contains a summary of the percent of analyses that had lower limit and upper limit effect sizes (using a 95% confidence band) of either no effect, little effect, medium effect, or large effect, as defined by Cohen (Cohen, 1988). This bar chart provides representation of the proportion of confidence bands, by analysis type, that contained varying levels of effect size. The left half of the chart shows the percent of analyses that had a lower band limit that had a magnitude that indicated no effect, little effect, moderate effect or large effect. The right half of the chart shows the percent of analyses that had an upper band limit with a magnitude indicating no effect, little effect, moderate effect or large effect. With the exception of the regression analyses, confidence bands tended to include effect sizes of little or no effect in a substantial amount of the analyses (39% for ANOVA analyses and 43% for ttest analyses). Only 12 % of ttests contained a large effect for both the lower and upper limits. Consideration of these confidence intervals leads to PAGE 91 81 clear evidence of a lack of precision in many of these studies. For example, in at least 15% of the ANOVA analyses, the lower band included effect sizes indicating lack of any effect and 28% contained small effect sizes. As such, in at least 15% of the ANOVA based studies found to be statistically significant, one cannot determine with certainty that there is a true difference between the groups of interest. Additionally, only 57% of those found to be statistically significant at an alpha of .05 had confidence bands that included only medium to large effects. 12% 27% 29% 29% 0%0% 3% 88% 1% 15% 18% 67% 0%0% 3% 86% 15% 28% 32% 12% 0% 1%1% 88%0% 25% 50% 75% 100% No EffectSmall EffectMedium Effect Large EffectNo EffectSmall EffectMedium Effect Large Effect Lower Limit of BandUpper Limit of Band Percent of Analyses ANOVA Regression TTests Figure 19. Percent of effect sizes of 95% confidence band endpoints pooled across journals found in statistically significant analyses. Using a more stringent Type I error rate, e.g., an alpha of 0.01 further dilutes the ability to determine if there is a substantiated finding in the research such as a true difference between groups or impact of a treatment. For example, when 99% confidence intervals were constructed around effect sizes, the percent of ANOVA analyses that PAGE 92 82 included effects sizes that indicated noeffect jumped to 19 % (n=74) of the analyses and bands containing small effects (not including those that had lack of any effect present) went to 152 (39%). Thus, less than half (42%) of the statistically significant analyses could say with any degree of confidence at an alpha level of .01 that the findings were indicative of a medium or large effect. Potential Impact on Results and Conclusions The final piece of this analysis was reviewing the results and conclusions reported that were based on the tests of statistical significance. The 42 analyses or groups of analyses included in this portion of the study were examined considering the computed confidence intervals around effect sizes in addition to effect size(s) and statistical significance tests. The results and conclusions were then reviewed to determine the possible need for varying degrees of adjustments based on the information provided by effect sizes: 1. have no impact on how the results and conclusions were reported, that is, No changes needed. 2. have some impact on how the results and conclusions were reported, that is, slight changes needed. 3. have substantial impact on how the results and conclusions were reported, that is, drastic changes needed. 4. have a major impact on how the results and conclusions were reported, that is, a complete revision required. PAGE 93 83 The inclusion of bandwidth information had a rather dramatic impact on the degree to which one could agree with the results and conclusions reported in the study. Of the 42 results and conclusions examined in light of specific analyses, only three (7.14%) were considered adequate when confidence intervals were considered (see Table 8). A slightly larger amount were determined to need some changes (n = 8, 19.05%) with a greater number possibly needing more substantial changes to the wording (n = 13, 30.95%). The relative majority were considered to need complete revision (n=18, 42.86%) of wording to better reflect appropriate strength of inferences as evidenced in results and conclusions relative to the analysis. The overall findings of this portion of the study are quite comparable to those found by other researchers, as evidenced by a review of randomly selected analyses used in this study. When the recommendations for changes in strength of wording of reported results and conclusions of the 19 sets of analyses reviewed by other researcher specialists were compared with those reached by the researcher conducting this study, there was an strong level of percent agreement (89%). Examples Four examples were extracted from the sample to illustrate the basis for reaching each of the four decisions possible, (1) No Change Needed (2) Slight Change Needed (3) Much Change Needed and (4) Complete Revision Needed. One of the few analyses reviewed that had results and/or conclusions that were not considered to be impacted by the reporting of confidence intervals was a study conducted by Sutton and Soderstrom (1999). In this study, the researchers were investigating the impact of variables within the control of a school system such as class size, teacher PAGE 94 84 experience, and expenditure per pupil as well as those variables considered outside the control of the school system, e.g., mobility, attendance, and low income, on the impact of student achievement. They built regression models to determine the relationship of these variables in combination into two models. One model contained the Can Control variables and the other model contained the Cannot Control variables. The outcome of the regression model for the Cannot Control model indicated statistical significance, with R2 = .70 p <.001 for reading achievement and R2=.56 p <.001 for math achievement. The authors reported that: In contrast to the low model R2 values obtained for the can control regression models, the R2 values obtained for the cannot control regression models were considerably higher. We therefore concluded that the cannot control models accounted more accurately for variance in Grade 3 achievement scores than did the can control variables. The calculation of confidence intervals around the estimated effect sizes, 2.1149< f2<2.5706 for reading and 1.1397< f2<2.4176 for math, supports the authors conclusions as the strength of the lower and upper limits of the band are inordinately large. As such, it was determined that No Change was necessary, a rating of (1), based on the inclusion of confidence band information. In a few cases, the results were considered to need only a slight adjustment in wording to reflect the additional information that might be gleaned about the strength of the inference through the use of confidence intervals. In the study by Helwig, RozekTedesco, Tindal, & Heath (1999), researchers were interested in determining if students PAGE 95 85 would do better on a math test that was augmented with video as compared to the more traditional written test. The general concern was an investigation into how reading level might impact math performance and could be minimized through the use of a videobased delivery of the test as an accommodation. The findings did not reach statistical significance at a .05 Type I error rate ( p =.08, no t value reported) and the authors concluded: Students taking the video version of the test scored slightly higher than those taking the standard version, although that difference was not statistically significant. (p. 121) and, As our results indicate, accommodations are unnecessary for the majority of students. (p. 123) Based on the confidence interval around the associated effect size which contained an upper limit of close to a small effect, .1012< d <.251, it was determined that the wording might be slightly altered to reflect at least an indication of the potential for an impact of the accommodation, thus being rated a (2) for Slight Change Needed The use of confidence intervals had more impact on some studies without going as far as requiring a complete revision. Stangor, Carr, and King (1998) conducted a study on whether or not someones belief that they were chosen for a leadership role based on merit or on group membership (in this case gender) impacted performance. Women were paired with a male individual to perform certain performance tasks. One group was told they were selected based on merit to the be the leader of the pair, the other group was told PAGE 96 86 they were selected merely based on their gender and not merit. The research team found statistical significance F (1, 75) = 4.75, p <.04 between the performance of the women, depending on which group they were assigned to. The authors concluded that: As predicted, participants in the genderonly conditioned performed worse than participants in the control and gender+merit conditions. (p. 1191) and, The data were conceptually consistent with prior research in demonstrating that the belief that one has been selected for a task on the basis of gender alone. (p. 1195) Based on the results of the significance test and point estimate of effect size (Cohen f = .2484) these statements do not appear to be too strong. However, when one considers the confidence interval, .0225< f <.4867, with a lower limit close to no effect, then the results seem to be too strongly worded. It would seem that while there does appear that a true difference exists, there is also a possibility that any difference that exists is very small. As such, this case earned a rating of (3), Much Change Needed. Finally, in many cases, the use of confidence intervals impacted the results/conclusions that were written quite strongly, resulting in a recommendation for complete revision. Using an example from one of the studies cited in the previous set of examples, (Mueller & Dweck, 1998) in which children were studied for their response to different types of praise, either for intelligence or effort, as well as the absence of praise, we can also see the potential impact of confidence intervals on findings, albeit from a different perspective statistical significance. In this example, a different group of children PAGE 97 87 were studied, grouped into the same three categories as before. This part of the study examined the differences regarding how children in the three groups differed in how much they reported enjoying tasks. Unlike the previous example from this study, the findings were statistically significant, F (2, 120) = 7.73, p < .005 (with three supporting ttests, all showing statistical significance). Based on the results of these tests, the authors reported that: Children praised for intelligence enjoyed the tasks less than did children praised for effort; again, children in the control conditions fell in between the other two groups. Children praised for intelligence were significantly less likely to enjoy the problems than were children in the effort and control conditions. Further, children in the control condition were less likely to enjoy the problems than those praised for effort (p. 37) and, Indictment of ability also led children praised for intelligence to display more negative responses in terms of lower levels of task enjoyment than their counterparts (p.48). The results of both the statistical significance tests and practical significance tests supported these assertions to a fair extent with resulting pvalues less than .05 on both ANOVA and ttests and effect sizes ranging from moderate to large point estimates. However, when confidence bands were constructed around the effect sizes, two of the three twogroup comparisons included values indicating no effect. Only ttests between the group of children praised for ability and effort had a confidence band that ranged from PAGE 98 88 moderate to very strong differences between the two groups (0.4136 < d < 1.3495). The bandwidth around the effect size for the differences between children praised for intelligence and those receiving no praise was almost a full standard deviation wide, including a lower band of almost zero (0.0175 < d < 0.8814) and the band around the practical effect size between students receiving praise for effort and those not receiving praise was similar (0.0043 < d < 0.9158. This lack of precision in the estimate is alarmingly large and does not support the strength of the authors allegations. As such, it would have been appropriate for the authors to report their findings with indications of the limitations of the inferences that could be drawn between the control group, the effort group and the ability group. The rating received for this analysis regarding change was a (4) for Complete Revision Needed Summary The results of this portion of the study provide strong evidence that the inclusion of confidence intervals in reporting research findings may, in fact, severely impact the strength with which one interprets their results. In the majority of the analyses in this study, the width of confidence intervals and their propensity to include measures of a lack of effect or small effect is of concern. Conversely, the ability to report that a confidence interval contains only medium to large effects serves to enhance the strength with which a researcher can draw conclusions. Unfortunately, this latter situation was not the typical situation in the studies found. The use of confidence intervals in approximately 74% of the analyses reviewed resulted in a recommendation that results and conclusions be changed to a large extent, even though they may reflect the findings of significance PAGE 99 89 testing to a slight degree, or they needed to be completely revamped as they did not substantiate the results and conclusions made based on the significance testing. PAGE 100 90 Chapter Five Conclusions Purpose of Research The purpose of this research was to examine the potential impact of different methods of reporting research results on the conclusions that could, and should, be made from these findings. Specifically, this study investigated how the use of practical significance as measured by effect sizes in addition to tests of statistical significance might impact the degree to which one should interpret results. Additionally, the use of confidence intervals around point estimates was examined in order to determine the precision of measurements obtained in studies and how that degree of precision might impact conclusions drawn from findings. The three questions investigated in this research were: 1.) To what extent does reporting outcomes of tests of statistical significance vs. tests of practical significance result in different conclusions and/or strengths of inference to be drawn from the results of research? 2.) To what extent does reporting confidence intervals instead of, or in addition to, point estimates affect the conclusions and inferences to be drawn from the results of research? PAGE 101 91 3.) What method, or combination of methods, is recommended for reporting results in educational studies? Overview of Method Journals used in the social sciences were reviewed for inclusion and three rather prominent journals were selected for consideration: Reading Research Quarterly, Journal of Personality and Social Psychology and the Journal of Educational Research Previously conducted research deemed worthy of publication that contained one of three rather traditional and oftused statistical analyses, ttests, Analysis of Variance (ANOVA), and/or regression were reviewed and results reanalyzed using not only the significance test results provided in the study, but also using the appropriate measures of practical significance (Cohens d Cohens f and Cohens f2 respectively). Further, confidence intervals for all point estimates, including measures of statistical as well as practical significance were constructed. Results and conclusions relative to specific statistical analyses were then examined with consideration given to the additional information provided by the calculated effect size and confidence intervals. The degree to which the results and conclusions that were presented might be adjusted or reconsidered was estimated. Impact of Findings The criticality of thorough and appropriate reporting of research results should be of primary importance to researchers, policymakers, funding agencies, publishing entities, and practioners alike. The propensity of the current researchbased literature to rely almost exclusively on the results of tests of statistical significance has the potential to PAGE 102 92 rob the consumer of researcher, including fellow researchers and practioners, of important information regarding the strength of the findings of the research. The findings of this study provide evidence that supports the APAs Task Force (Wilkinson, 2001) recommendations to include measures of practical significance as well as confidence intervals when reporting findings of quantitative research. The additional reporting of measures of practical significance, e.g., effect sizes, had a limited, though often informative, impact on the strength of inferences drawn in the articles examined in this study. However, the inclusion of confidence bands in analyses appears to have the potential for drastic impact on the types and strengths of results and conclusions drawn by researchers. Admittedly, this is one of the reasons that has been suggested regarding the resistance to using intervals as reporting intervals might have the consequence of weakening the strength of conclusions drawn from a study, a rationale at least partially substantiated by the results of this study. While this might be highly likely, it is not, obviously, an ethically sound reason to avoid including this information in results and should be stridently opposed. It is incumbent upon consumers of research to expect inclusion of this type of information if research is to contribute to practice effectively. In the end, it does not benefit the education populace to allow potentially substandard reporting practices to continue. Statistical Significance vs. Practical Significance When considering the results of this study, question one, To what extent does reporting outcomes of tests of statistical significance vs. tests of practical significance result in different conclusions and/or strengths of inference to be drawn from the results PAGE 103 93 of research?, is addressed with caution. While there were clear indications that effect size reporting did impact a select number of studies, especially those found not to be statistically significant, effect sizes did not, for the most part, drastically alter how one considered the results of studies shown to have statistically significant results. Overall, only 30.57% (n= 14) of the results/conclusions examined were considered to require major or complete revision when considering measures of practical significance in addition to findings of statistical significance. Although this researcher continues to maintain that the reporting of effect sizes is a reasonable expectation of researchers as it provides a different yet complementary interpretation of results, it does not appear, based on these findings, to have a substantial impact on how one views the results of a large portion of studies reporting statistically significant results found in this type of literature. It is important to note, however, that the vast majority of the studies reviewed in this research contained sample sizes that might be considered small to moderate. Only six of the studies contained samples sizes that exceeded 100 participants, and three of those were from the four regression analyses. This limitation made it somewhat unlikely to see the relationship between statistical significance and practical significance when sample sizes are large. One of the ongoing arguments for reporting measures of practical significance addresses the concern that the likelihood of finding statistically significant results increases as sample size increases. As such, with larger sample sizes, which typically provide enhanced precision of the estimate, there is possibly a greater potential for statistically significant results to have PAGE 104 94 smaller measures of practical significance that would have further impact on how strongly one can interpret the results of a given study. The consideration of practical significance measures in analyses containing nonstatistically significant results had a slightly greater impact on the findings of this study. The fact that evidence of at least a small effect was present in the majority of analyses reporting the lack of statistical significance, 111 of 166 (66.87%) is quite notable. It may be that the need to consider effect sizes in research is more critical for those finding nonsignificance, especially if the design of the study is not rigorous. The potential that there exists a true difference between groups as evidenced by an effect size measure that was not found through statistical significance testing may provide enough of a foundational rationale to pursue a particular line of research with enhanced study design. Point Estimates vs. Confidence Intervals The results of this study provide a much stronger basis for answering question two: To what extent does reporting confidence intervals instead of, or in addition to, point estimates affect the conclusions and inferences to be drawn from the results of research?. Clearly the results of both the analytic review of the disparity of confidence band limits in conjunction with the interpretive review of results supports the contention that confidence bands are critical to ensuring that results are interpreted and reported appropriately. Very few bands indicated any strong degree of measurement precision in the findings and this lack of precision weakens the strength with which one should interpret the results. Only 7.15% of the results and conclusions considered were determined to adequately reflect the strength of inference that should be drawn when PAGE 105 95 confidence interval information was included in addition to results of statistical significance as well as point estimates of effect sizes. The failure to include measures such as confidence bands is a disservice to the consumer of research. The degree to which one is able to interpret the strength of inference present in any study is key to ensuring that the information is presented properly and thoroughly. The lack of including this type of information is likely to result in conclusions that are, at best, misleading, and at worst, incorrect. Reporting Results In order to address question three, What method, or combination of methods, is recommended for reporting results in educational studies?, many elements of the nature of the research to be conducted and study design need to be taken into account. It doesnt seem reasonable to consider that the reporting of all three types of information, statistical significance, practical significance, or confidence bands, should ever be discouraged on considered as unacceptable due to such things as limits on manuscript length for publication purposes. One of the studies used in the examples provided earlier (Mueller & Dweck, 1998) clearly illustrated how the use of both practical significance and confidence intervals can impact different aspects of findings and conclusions in different ways within one study. The strength of nonsignificant findings were found to be questionable when considering measures of practical effect and the strength of statistically significant findings were weakened when considering confidence intervals. However, it is important to realize that the criticality of including such measures may vary by study. Practical significance measures in statistically significant analyses PAGE 106 96 provides additional information that can contribute to interpretation of results but may have limited substantive contribution to changes in overall conclusions and findings, especially when sample sizes are small to moderate. One of the concerns about the limitations of statistical significance tests is the tendency to find statistical significance as sample size increases. This research, due to the limitations inherent in it, did not possess many studies that had very large samples. As such, it is quite possible that the importance of including measures of practical significance in studies with statistically significant results increases as sample size increases. In studies that have do not have statistically significant results, the importance of including effect sizes appears to have more impact as it may be a key piece of information that may or may not help researchers determine whether or not to pursue a given line of research. While the recommendations about whether or not to include measures of practical significance are somewhat murky, the same cannot be said regarding confidence intervals. The results of this study clearly indicate that the importance of including such a measure to assist with determining the precision of research results. To not include this information is to withhold critical information for consumers of research and should not only be encouraged, but, increasingly be made an expectation. When considering recommendations for what to include in research reporting, a critical element guiding decisions must be the intended use of the findings. If research findings will impact decisions on such things as funding, policymaking, or choice of curriculum, the importance of providing all relevant information about effectiveness and significance of research reports cannot be underestimated. The more critical a decision is, PAGE 107 97 the more information should be provided. To that end, the information gleaned from practices such as effect size reporting and confidence intervals should always be reported. Relevant Issues In addition to the findings that were a direct goal and consequence of this research, other issues were identified that impact the overall integrity of research reporting. Few studies reported what many might consider to be highly important information regarding research design and data characteristics (e.g., distributional information, reliability and validity data). Of particular note was the dearth of information about the Type I error rate at which a given study was being conducted. Related to this issue, studies that used more than one ttest did not indicate that they had performed any special analyses, e.g., Bonferroni adjustments, to compensate for the possibility of inflated type I error rates due to multiple comparisons. Relative to this study, this issue requires further investigation into how one thinks about constructing CIs under these conditions. That is, do the algorithms for constructing confidence intervals need to be adjusted under situations that have multiple comparison tests? In most cases, one had to make assumptions of the alpha level based on what they reported as significant. The infamous asterisks in the table did not dominate all the studies but was a notable contributor to the inability to determine what Type I error rate was of true interest. This seems to indicate an underlying violation of one of the basic tenants of good research taught in most beginning research courses: the need for the researcher to make a decision, based on criticality of the research and knowledge of their field, regarding the alpha level that he or she is going to conduct significance testing at a PAGE 108 98 priori to actual conduct of research. The obvious absence of the communication of this rather foundational aspect of a research design is just one possible reason that the ethics of research is sometimes called into question. The findings of this study also impacts how one thinks of the disciplinary norms associated with the reporting of research contained within the disciplines within the social sciences. Perhaps the community as a whole needs to consider the accepted practices of reporting research in such disciplines as education and psychology regarding their current expectation and what, perhaps, might be changed to make the research available less open to criticism or alternative interpretations. Even within a given discipline, the roles of different professionals within that discipline will influence how they think about, interpret, and apply results and conclusions of research. Within this research itself, this issue is evident. For example, other methodologists with similar backgrounds and training to the researcher conducting this study conducted the review of the interpretative results. As such, the rather strong level of interrater reliability can only be used to support the contention that other methodological researchers would draw the same types of conclusions. In cannot be used to support a claim that other consumers of researchers, e.g., practitioners, theorists, etc., would have similar interpretations regarding the impact that effect size and/or confidence interval information might impact their view of the results and conclusions. A final element that should be considered if there is to be any potential for changing the reporting practices of researchers is preparation of future scholars, researchers, and practitioners. Members entering into a given profession engage in the PAGE 109 99 practices for which they have been trained and instructed on. As such, in addition to trying to reach those currently active in the engagement, dissemination and consumption of research, it seem critical to be properly training and educating those entering the field on appropriate reporting practices. New researchers should be made aware of both the frailties and merits of various options of reporting results. The type of information provided by effect size estimates as well as confidence intervals should be an important element of that training. Future Research The findings of this study strongly support the need for further investigation of the impact of research reporting practices on the integrity and interpretability of published research. This study was an initial foray into the practical implications of using effect size information as well as confidence intervals in addition to measures of statistical analyses. Future studies might benefit the research community by selecting a more specific genre of research literature to review in order to assess impact on specific fields, e.g., subject specific research such as mathematics, administrative based research such as policy analyses, or different levels of development such as specific school levels. Additionally, similar studies within a given field but with respect to varying professional roles and responsibilities within those fields, e.g., practitioner vs. statistician, would provide yet another way of considering how different individuals and professionals perceive results based on how they are reported. One might also consider an extended examination of the impact of publication source on how much measures of practical significance and confidence intervals are PAGE 110 100 either reported, or impact published findings. In general, the findings of this study did not indicate any strongly notable differences between the three somewhat diverse journals used, with the exception of the types of statistical analyses typically used; however, other explorations with a focus on this as a primary question might have different results. Additionally, the relationship of the importance of measures such as practical significance and confidence intervals with the design of research studies is likely to be vital to determining the true utility of these measures in research reporting under certain conditions. Research into more definitive impacts of design characteristics such as sample size, heterogeneity of samples, etc. in applied research studies, along with an evaluation of their impact on effect sizes and confidence intervals, would be very beneficial to researchers throughout the social sciences. The other element of this type of issue is the need for research from the point of view of the consumers of research. One of the issues that became evident when measurement specialists were used to determine possible changes in the results reported was their tendency to use all aspects of the research design in consideration of their ratings. How this might change when the reader is less likely to wellversed in measurement, statistical analyses and research design is an important distinction that might further guide refinements making determinations and judgments about appropriate practices in reporting research. A final consideration for future research taps into the preparation of researchers. It could be quite enlightening to investigate the extent to which graduate students are trained and instructed on the use of various reporting methods and practices when PAGE 111 101 conducting research studies. This type of inquiry could take on many forms, from course content reviews, e.g., syllabi, textbook reviews, to a methodological review of dissertations and thesis. An examination of how often effect sizes and confidence interval information is provided in new scholars work would provide some evidence regarding the extent to which new researchers are entering the field prepared to report findings above and beyond the results of significance testing. Summary The findings of this research reinforce the need for increased emphasis on appropriate and thorough research reporting practices. Individuals in leadership positions that have critical decisionmaking power in the research world, e.g., administrators, policymakers, journal editors, funding sources, etc. need to require enactment and enforcement of more indepth research reporting practices and protocols. Without substantial requirements of such guiding forces in research as well as enforcement of these requirements, the quality of research reported in the social sciences is not likely to see any substantial change or improvement. The degree of quality of research in any field does not merely impact the research community. Poor research has the potential to damage the leadership of a professional community, the policy and guidelines constructed for that community, and ultimately, the consumers or customers within that community. In education, this translates to damage to the learner. As a society that values education and understands that a strong educational foundation is necessary to keep society strong, we cannot afford to overlook the importance of insuring that sound research practices are in place for all aspects of PAGE 112 102 research conduct, including study design, method, conduct and reporting. The idea that there is a problem with the quality of educational and social science research is not new and it is incumbent upon leaders in the field that guide policy and funding to take strong actions to improve the situation. It is often suggested that research should guide practice. What benefit is that if the research is poorly conceived, designed or reported? PAGE 113 103 References Abelson, R. P. (1995). Statistics as Principled Argument. Hillsdale NJ: Lawrence Erlbaum Associations. American Educational Research Association (n.d.). Call for Manuscripts 20032006, Review of Educational Research. Retrieved May 18, 2003 from http://www.aera.net/pubs/rer/recall.htm Alspaugh, J. W. (1998). Achievement loss associated with the transiton to middle school and high school, Journal of Educational Research, 92 (1), 2025. American Psychological Association (2001). Publication manual of the American Psychological Association (5th ed.). Washington, DC: Author. Barnett, R. J., Docherty, J. P., & Frommelt, G. M. (1991). A review of child psychotherapy research since 1963. Journal of the American Academy of Child and Adolescent Psychiatry, 30 (1), 114. Baumeister, R. F., Twenge, J. M., & Nuss, C. K. (2002). Effects of social exclusion on cognitive processes: Anticipated aloneness reduces intelligent thought, Journal of Personality and Social Psychology, 83 (4), 817827. Bazerman, C. (1981). What written knowledge does, Philosophy of the Social Sciences, 2 361387. Becher, T. (1987). Disciplinary discourse, Studies in Higher Education, 12 261274. PAGE 114 104 Blanton, J., Buunk, B. P., Gibbons, F. X., & Kuyper, H. (1999). When betterthanothers compare upward: Choice of comparison and comparative evaluation as independent predictors of academic performance, Journal of Personality and Social Psychology, 76 (3), 520430. Bowman, C. L. & McCormick, S. (2000). Comparison of peer coaching versus traditional supervision effects, Journal of Educational Research, 93 (4), 256382. Bradley, M. T. & Gupta, R. D. (1997). Estimating the effect of the file drawer problem in metaanalysis. Perceptual and Motor Skills, 85 719722. Brown, R. P., Charnsangavej, T., Keough, K. A., Newman, M. L., & Renfrow, P. J. (2000). Putting the Affirm into affirmative action; preferential selection and academic performance, Journal of Personality and Social Psychology, 79 (5), 736747. Brown, R. P. & Josephs, R. A. (1999). A burden of proof: Stereotype relevance and gender differences in math performance, Journal of Personality and Social Psychology, 76 (2), 246257. Callan, R. J. (1999). Effects of matching and mismatching students timeofday preferences, Journal of Educational Research, 92 (5), 295299. Carpenter J. W. & Bithell C. (2001). Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. Statistics in Medicine, 19 11411164. PAGE 115 105 Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). New York: Academic Press. Cohen, J., Cohen P., West, S. G., & Aiken, L. S. (2003). Applied Mulitple Regression/Correlational Analysis for the Behavioral Sciences (3rd ed.). Mahwah, New Jersey: Lawrence Erlbaum Associates. Cooper H. & Hedges, L. (1994). The Handbook of Research Synthesis New York: Russel Sage Foundation. Cumming G. & Finch S. (2001). A primer on the understanding, use, and calculation of confidence intervals that are based on central and noncentrl distributions. Educational and Psychological Measurement, 61(4), 53274. Davies, M. F. (1998). Dogmatism and belief formation: Output interference in the processing of supporting and contradictory cognitions, Journal of Personality and Social Psychology, 75 (2), 456466. Davis, O.L. (2001). So what? Journal of Curriculum and Supervision, 16 (2), 9194. DiPrete T. A. & Forristal, J. D. (1994). Multilevel models: methods and substance. Annual Review of Sociology, 20 331359. Evertson, C. M. & Smithey, M. W. (2000). Mentoring effects on proteges classroom practice: An experimental field study, Journal of Educational Research, 93( 5), 294204. PAGE 116 106 Fan, X. (2001). Statistical significance and effect size in education research: two sides of a coin. The Journal of Educational Research, 94 (5), 275282. Fan, X. & Thompson, B. (2001). Confidence intervals about score reliability coefficients. Please: An EPM guidelines editorial. Educational and Psychological Measurement 61 (4), 517531. Fidler F. & Thompson B. (2001). Computing correct confidence intervals for ANOVA fixedand randomeffects effect sizes. Educational and Psychological Measurement, 61(4), 575604. Fitzgerald, J. (2001). Can minimally trained college student volunteers help young atrisk children to read better? Reading Research Quarterly 36 (1), 2847. Galassi, J. P., White, K. P., Vesilind, E. M. & Bryan, M. E. (2001). Perceptions of research from a secondyear, multisite professional development schools partnership, Journal of Educational Research, 95 (2), 7583. Gall M. D, Borg W. R, & Gall J. P. (1996). Educational Research: An Introduction (6th ed.), New York: Longman Publishers. Gerholm, T. (1990). On tacit knowledge in acadamia, European Journal of Education, 25 263271. Girden, E. R. (2001). Evaluating Research Articles from Start to Finish (2nd ed.). Thousand Oaks, CA: Sage Publications. PAGE 117 107 Glass, G. V. & Hopkins, K. D. (1996). Statistical Methods in Education and Psychology (3rd ed.) Needham Heights, MA: Allyn and Bacon. Grissom R. J. & Kim J. J. (2001). Review of assumptions and problems in the appropriate conceptualization of effect size. Psychological Methods 6(2), p. 135146. Hancock, D. R. (2000). Impact of verbal praise on college students time spent on homework, Journal of Educational Research, 93 (6), 384389l Harlow, L. L., Mulaik, S. A., & Steiger, J. H., (1997). What if there were no significance tests? Lawrence Erlbaum Associates, Mahwah, NJ. Hedges L. V. & Olkin I. (1985). Statistical Methods for MetaAnalysis New York: Academic Press. Helwig, R., RozekTedesco, M. A., Tindal, G., & Heath, B. (1999). Reading as an access to mathematics problem solving on multiplechoice tests for sixthgrade students, Journal of Educational Research, 93 (21), 113125. Hess, M. H. & Kromrey J. D. (2001, November). Confidence Intervals around Standardized Mean Differences: An empirical comparison of methods for constructing confidence bands around standardized mean differences. Paper presented at the annual meeting of the Florida Educational Research Association, Marco Island FL. PAGE 118 108 Hess M. H. & Kromrey J. D. (2002, November). Variations on the Bootstrap: A comparison of confidence band coverage for the standardized mean difference. Paper presented at the annual meeting of the Florida Educational Research Association, Gainesville, FL. Hittleman, D. R. & Simon, A. J. (2002). Interpreting Educational Research: An Introduction for Consumers of Research (3rd ed.). Upper Saddle River, New Jersy: Merrill Prentice Hall. Hogarty K. Y. & Kromrey, J. D. (1999, August). Traditional and robust effect size estimates: Power and Type I error control in metaanalystic tests of homogeneity Paper presented at the Joint Statistical Meetings, Baltimore. Hubbard, R. & Ryan, P. A. (2000). The historical growth of statistical significance testing in psychologyand its future prospects. Educational and Psychological Measurement, 60 (5), 661681. Huff, D. (1954). How to Lie with Statistics New York: Norton. Institute for Scientific Information (2002). Social Sciences Citation Index Journal Citation Reports for 2001 [Microform]. Philadelphia, PA: Institute for Scientific Reform. Jenkins, E., Queen, A., & Algozzine, B. (2002). To block or not to block: Thats not the question, Journal of Educational Research, 95 (4), 196202. PAGE 119 109 Jordan, G. E., Snow, C. E., & Porche, M. V. (2000). Project EASE: The effect of a family literacy project on kindergarten students early literacy skills. Reading Research Quarterly, 35 (4), 524546. Keselman, H. J., Huberty, C. J., Lix, L. M., Olejnik, S., Cribbie, R. A., Donahue, B., Kowalchuk, R. K., et al. (1998). Statistical practices of educational researchers: An analysis of their ANOVA, MANOVA, and ANCOVA analyses, Review of Educational Research, 68(3), 350386. Kim, H. S. (2002). We talk, therefore we think? A cultural analysis of the effect of talking on thinking, Journal of Personality and Social Psychology, 83 (4), 828842. Knapp, T. R., Sawilowsky, S. S. (2001). Constructive criticisms of methodlogical and editorial practices, The Journal of Experimental Education, 70 (1), 6579. Kromrey, J. D. & Hess, M. H. (2001, April). Interval Estimates of R2: An empirical comparison of accuracy and precision under violations of the normality assumption Paper presented at the annual meeting of the American Educational Research Association, Seattle, WA. Kromrey, J. D. & Hess, M. H. (2002, November). Constructing Confidence Bands Around Mean Differences Between Nonhomogeneous Groups: An Empirical Comparison of Methods. Paper presented at the annual meeting of the Florida Educational Research Association, Gainesville, FL. PAGE 120 110 Lepore, S. J., Ragan, J. D., & Jones, S. (2000). Talking facilitates cognitiveemotional processes of adaptation to an acute stressor, Journal of Personality and Social Psychology, 78 (3), 499508. Leseman, P. M. & de Jong, P. F. (1998). Home literacy: Opportunity, instruction, cooperation and socialemotional quality predicting early reading achievement. Reading Research Quarterly, 33 (3), 294318. McEwan, E. K. & McEwan, P. J. (2003). Making Sense of Research. Whats Good, Whats Not, and How to Tell the Difference. Thousand Oaks, CA: Corwin Press. Meehl, P. E. (1997). The problem is epistemology, not statistics: Replace significance tests by confidence intervals and quantify accuracy of risky numerical predictions. In L.L. Harlow, S.A. Muliak & J.H. Steiger (Eds.), What if There Were No Signficance Tests? (pp. 393425). Mahway, NJ: Lawrence Erlbaum. Meehl, P. E. (1967). Theory testing in psychology and physics: A methodological paradox. Philosophy of Science, 34, 103115. Morgan, D. L. (1996). Focus Groups (using focus groups in research). Annual Review of Sociology, 22, 129153. Mori, Y. & Nagy, W. (1999). Integration of information from context and word elements in interpreting novel kanji compounds. Reading Research Quarterly, 34 (1), 80101. Mueller, C. M. & Dweck, C. S. (1998). Praise for intelligence can undermine childrens motivation and performance, Journal of Personality and Social Psychology, 75 (1), 3352. PAGE 121 111 Muliak, S. A., Raju, N. S., & Harshman, R. A. (1997). There is a time and place for significance testing. In L.L. Harlow, S.A. Muliak & J.H. Steiger (Eds.), What if There Were No Significance Tests? (pp. 65116). Mahway, NJ: Lawrence Erlbaum. Nix, T. W. & Barnette, J. J. (1998). The data analysis dilemma: Ban or abandon. A review of null hypothesis significance testing. Research in the Schools 5 (2), p. 314. Nosek, B. A., Banaji, M. R., & Greenwald, A. G. (2002). Math=Male, Me=Female, therefore Math Me. Journal of Personality and Social Psychology 83(1), 4459 Paris, A. H. & Paris, S. G. (2003). Assessing narrative comprehension in young children. Reading Research Quarterly, 38 (1), 3676. Parry, S. (1998). Disciplinary discourse in doctoral theses. Higher Education, 36, 273299. Plucker, J. A, (1997). Debunking the myth of the highly significant result: effect sizes in gifted education research, Roeper Review, 20 122126. Reichardt, C. S. & Gollob, H. F. (1997). When confidence intervals should be used instead of statistical tests, and vice versa. In L.L. Harlow, S.A. Muliak & J.H. Steiger (Eds.), What if There Were No Significance Tests? (pp. 260284). Mahway, NJ: Lawrence Erlbaum. PAGE 122 112 Robinson, D. H., Fouladi, R. T., & Williams, N. J. (2002). Some effects of including effect size and what if information. The Journal of Experimental Education, 70(4), 365382. Robinson, D. H. & Levin, J. R. (1997). Reflections on statistical and substantive significance, with a slice of replication. Educational Researcher, 26 (5), 2127. Rosentahl, R. (1992). Effect size stimation, significance testing, and the filedrawer problem. Journal of Parapsychology, 56, 5758. Rosenthal, R. (1988). Parametric measures of effect size. In H. Cooper and L.V. Hedges (Ed.) The Handbook of Research Synthesis (pp. 231244). New York, NY: Russel Sage. Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86 (3), 638641. Roth, F. P., Speece, D. L., & Cooper, E. H. (2002). A longitudinal analysis of the connection between oral language and early reading. Journal of Educational Research, 96 (5), 259271. Santa, C. M. & Hoien, T. (1999). An assessment of Early Steps: A program for early intervention of reading problems. Reading Research Quarterly, 34 (1), 5479. Schmidt, F. L. & Hunter, J. E. (1997). Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In L.L. PAGE 123 113 Harlow, S.A. Muliak & J.H. Steiger (Eds.), What if There Were No Signficance Tests? (pp. 3764). Mahway, NJ: Lawrence Erlbaum. Spooner, F., Jordan, L., Algozzine, B., & Spooner, M. (1999). Student ratings of instruction in distance learning and oncampus classes, Journal of Educational Research, 92 (3), 132140. Stangor, C., Carr, C., & King, L. (1998). Activating stereotypes undermines task performance expectations, Journal of Personality and Social Psychology, 75 (5), 11911197. Steiger, J. H. & Fouladi, R. T. (1997). Noncentrality interval estimation and the evaluation of statistical models. In L.L. Harlow, S.A. Muliak & J.H. Steiger (Eds.), What if There Were No Signficance Tests? (pp. 221257). Mahway, NJ: Lawrence Erlbaum. Steiger, J. H. & Fouladi, R. T. (1992). R2: A computer program for interval estimation, power calculation, and hypothesis testing for the squared multiple correlation. Behavior Research, Methods, Instruments, and Computers, 4, 581582. Stevens, J. (1999). Intermediate Statistics: A Modern Approach. Mahway, N.J.: Lawrence Earlbaum. Stine, R. (1990). An introduction to bootstrap methods. Sociological Methods and Research, 18 (2&3), p. 243291. PAGE 124 114 Sutton, A. & Soderstrom, I. (1999). Predicting elemtary and secondary school achievement with schoolrelated and demographic factors. Journal of Educational Research, 29(6), 330338. Tashakkori, A. & Teddlie, C. (1998). Mixed Methodology: Combining Qualitative and Quantitative Approaches. Thousand Oaks California: Sage Publications. Thompkins, A. C. & Binder, K. S. (2003). A comparison of factors affecting reading performance of functionally illiterate adults and children matched by reading level. Reading Research Quarterly, 38(2), 236258. Thompson, B. (2002a). Statistical, practical, and clinical: how many kinds of significance to counselors need to consider?. Journal of Counseling and Devlopment, 80 (1), 6471. Thompson, B. (2002b). What future quantitative social science research could look like: confidence intervals for effect sizes, Educational Researcher, 2532. Thompson, B. (1999a). Improving research clarity and usefulness with effect size indices as supplements to statistical tests, Exceptional Children, 65 (3), 329337. Thompson, B. (1999b). Why encouraging effect size reporting is not working: the etiology of researcher resistance to changing practices. The Journal of Psychology, 133 (2), 13340. Thompson, B. (1998). Statistical significance and effect size reporting: Portrait of a possible future. Research in the Schools 5 (2), p. 3338. PAGE 125 115 Tiedens, L. Z. & Linton, S. (2001). Judgment under emotional certainty and uncertainty: The effects of specific emotions on information processing. Journal of Personality and Social Psychology, 81 (6), 973988. Tuckman, B. W. (1990). A proposal for improving the quality of published educational research. Educational Researcher, 19 (9), 2224. United States Department of Education (n.d.). Inside No Child Left Behind. Retrieved May 29, 2003 from http://www.ed.gov/legislation/ESEA02 United States Department of Education (n.d). No Child Left Behind Act Factsheet. Retreived May 27, 2003 from http://www.ed.gov/offices/OESE/esea/factsheet.html University of South Florida Virtual Library (n.d.). Retrieved June 5, 2003 from http://www.usf.virtuallibrary.edu Van den Branden, K. (2000). Does negotiation of meaning promote reading comprehension? A study of multilingual primary school classes. Reading Research Quarterly, 35 (3), 426443. Wilkinson & APA Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594604. PAGE 126 116 Appendices PAGE 127 Appendix A Article Coding Sheet 117 Title of Article: ____________________________________________________________________________________________________ Authors: __________________________________________________________________________________________________________ Website, date accessed (if applicable): __________________________________________________________________________________ Journal Name: ______________________________________________ Vol (No): _____ Date: _______ Pgs: _______ Preliminary Screening Information: Which of the three analyses of interest are used in this study: ___________ Ttests ________ Regression _______ A NOVA Is one of the analyses of interest the primary analysis used for this study? ____yes ____no If no, explain relationship of analysis to be focused on to other analyses in the study. (ex. Ttests are used to provide s upportive and/or additional information in a study that uses SEM as the primary analysis. ____________________________________________________ __ _________________________________________________________________________________________________________________ _________________________________________________________________________________________________________________ _________________________________________________________________________________________________________________ _________________________________________________________________________________________________________________ _________________________________________________________________________________________________________________ _________________________________________________________________________________________________________________ General Description of Study: Date(s) of Study: _____________ Conducted by: ___________ Description of participants (Age, grade, school, etc.): __________________________ __________________________ __________________________ Where was study conducted (classroom, school, lab) __________________ Purpose of Study: _____________ ____________________________ ____________________________ ____________________________ ____Regression ____ Qual. ____ ttests ____ ANOVA ____ ANCOVA ___ MANOVA All Method(s) used: ____ HLM ____ SEM ____ Other ____________ _____ Other ___________ How was missing data handled? (not discussed, listwise deletion, imputation, etc.): _________________________________________ _____ _________________________________________________________________________________________________________________ Was power discussed? If so, briefly describe: _______________________________________________ _________________________________________________________________________________________________________________ Were validity and reliability discussed? If so, briefly describe: ___________________________________________________________ ____ _________________________________________________________________________________________________________________ ____ Race/Ethnicy _____ Gender _____ Age ____ SES No of groups: _______ Demographics : ____ Other __________________ ____ Other __________________ Other characheristics/issues of study: __________________________________________________________________________________ _________________________________________________________________________________________________________________ _________________________________________________________________________________________________________________ _________________________________________________________________________________________________________________ _________________________________________________________________________________________________________________ _________________________________________________________________________________________________________________ PAGE 128 Appendix A Article Coding Sheet 118 Two groups N1 N2 Mean1 Mean2 Sd1 Sd2 Sig rep? Regression N K R2 ANOVA N K F PAGE 129 Appendix B Article Reviewer Instructions and Cases 119 Reviewer Instructions (to be provided verbally as well as written): You have received a collection of analyses pulled from published research. Each analysis contains a synopsis of the study with relevant statistical information provided as well as results and conclusions reported by the author(s) of the study. The synopsis is not necessarily a direct quote from the study investigated, rather it is a summary; However, all statistical information and related results/conclusions are directly from the article of interest and words are direct quotes pertaining to the statistical information provided. 1. Please read the synopsis and analysis reported. Then, review the authors words regarding their interpretation and application of that statistical analysis. Once you have reviewed the analysis and results, decide whether or not you concur with the findings/results of the author as reported and to what degree, and then complete item A on the review sheet. 2. After completing item A, consider the calculated effect size provided. Using Cohens definitions of effect size, decide whether or not you concur with the findings/results of the author as reported and to what degree, and then complete item B on the review sheet. Effect Size Index Cohens d Cohens f Cohens f2 Small Effect 0.20 0.10 0.02 Medium Effect 0.50 0.25 0.15 Large Effect 0.80 0.40 0.35 3. Finally, consider the confidence interval calculated at a Type I error rate of 0.05 which indicates we are 95% confident that truth resides somewhere within that band, although where we do not know. When considering the interval and related results/conclusions reported, take into account such characteristics of the interval such as lower and upper limits, width, etc. Using this information, again decide whether or not you concur with the findings/results of the author as reported and to what degree, and then complete item B on the review sheet. PAGE 130 Appendix B Article Reviewer Instructions and Cases 120 Study Number: ____ Analysis: ____ Coder: ____________________________________________ A. Based on the information provided by the author regarding statistical significance I: _____ Agree completely with the results/conclusions drawn. No changes needed. _____ Agree in essence with the results/conclusions provided; However, wording of results/conclusions should be changed slightly to better reflect appropriate strength of inferences, generalizability, etc. _____ Agree a little bit with the results/conclusions provided; However, wording of results/conclusions should be changed drastically to better reflect appropriate strength of inferences, generalizability, etc. _____ Disagree completely with the results/conclusions drawn. Complete revision needed. B. Based on the information provided by the researcher regarding practical significance I: _____ Agree completely with the results/conclusions drawn. No changes needed. _____ Agree in essence with the results/conclusions provided; However, wording of results/conclusions should be changed slightly to better reflect appropriate strength of inferences, generalizability, etc. _____ Agree a little bit with the results/conclusions provided; However, wording of results/conclusions should be changed drastically to better reflect appropriate strength of inferences, generalizability, etc. _____ Disagree completely with the results/conclusions drawn. Complete revision needed. C. Based on the information provided by the researcher regarding 95% confidence intervals I: _____ Agree completely with the results/conclusions drawn. No changes needed. _____ Agree in essence with the results/conclusions provided; However, wording of results/conclusions should be changed slightly to better reflect appropriate strength of inferences, generalizability, etc. _____ Agree a little bit with the results/conclusions provided; However, wording of results/conclusions should be changed drastically to better reflect appropriate strength of inferences, generalizability, etc. _____ Disagree completely with the results/conclusions drawn. Complete revision needed. PAGE 131 Appendix B Article Reviewer Instructions and Cases 121 Study Number: 68 Synopsis of Study: The purpose of this study was to determine whether or not the prediction of impending misfortune and/or aloneness (emphasis was on aloneness) impacted perseverance and/or cognitive abilities. Three groups were assembled. Based on results of assessments administered, they were told that they would either: 1) Spend the rest of their life surrounded by people who care about them, 2) be accident prone the rest of their life, or 3) become increasingly alone in life (lose friends over time, not replaced). Participants were then administered an intelligence test. Measurement were taken regarding number of items attempted and total score. Statistical Significance Reported with associated results and conclusions: Analysis 1: Issue addressed: Difference between groups regarding correctness of answers Statistical Signficance Information: F(2, 37) = 5.44, p< .01 Relevant Results/Conclusions: Participants in the future alone condition answered significantly fewer questions correctly, as compared with participants in the future belonging and misfortune condition (p. 819) Thus, hearing that one was likely to be alone later in life affected performance on a timed cognitive test. (p. 819820) A diagnostic forecast of future social exclusion caused a significant drop in intelligent performance (p. 820) Before continuing, please answer item A on the review sheet Calculated Effect Size (Cohens f): 0.5215 Before continuing, please answer item B on the review sheet Calculated Confidence Interval (95%) Lower Cohens f: 0.1372 Upper Cohens f: 0.8318 Please answer item C on the review sheet PAGE 132 Appendix B Article Reviewer Instructions and Cases 122 Study Number: 68 Synopsis of Study: The purpose of this study was to determine whether or not the prediction of impending misfortune and/or aloneness (emphasis was on aloneness) impacted perseverance and/or cognitive abilities. Three groups were assembled. Based on results of assessments administered, they were told that they would either: 1) Spend the rest of their life surrounded by people who care about them, 2) be accident prone the rest of their life, or 3) become increasingly alone in life (lose friends over time, not replaced). Participants were then administered an intelligence test. Measurement were taken regarding number of items attempted and total score. Statistical Significance Reported with associated results and conclusions: Analysis 2: Issue addressed: Difference between groups in effort, as measured by number of items attempted. Statistical Signficance Information: F(2, 37) = 3.46, p< .05 Relevant Results/Conclusions: This analysis again showed significant variation among the three conditions. Participants in the future alone condition attempted the fewest problems. Again, the deficit was specific to feedback about social exclusion, insofar as participants in the misfortune control condition attempted as many problems (if not more) than the people in the future belonging condition (p. 820) The decline in performance reflected both a higher rate of errors and reduced number of problems attempted (p. 820) A diagnostic forecast of future social exclusion caused a significant drop in intelligent performance (p. 820) Before continuing, please answer item A on the review sheet Calculated Effect Size (Cohens f): 0.4159 Before continuing, please answer item B on the review sheet Calculated Confidence Interval (95%) Lower Cohens f: 0.000 Upper Cohens f: 0.7149 PAGE 133 Appendix B Article Reviewer Instructions and Cases 123 Please answer item C on the review sheet PAGE 134 Appendix B Article Reviewer Instructions and Cases 124 Study Number: 53 Synopsis of Study: The purpose of this study was to determine whether or not the degree to which someone was considered dogmatic impact such things as their confidence and tendency to be judgmental. This study also investigated the degree to which dogmatism impacted an individuals ability to provide reason behind decisions and judgments and the nature of those reasons. Faced with two possible outcomes to given scenarios (e.g., likelihood of persons stopping to help an injured person with blood present vs no blood present), participants selected their prediction of the outcome and then indicated how confident they were in their decision. They then listed reasons why they thought their outcome was most likely (pro decisions) as well as reasons why the other outcome might occur (con decisions) Analysis 1: Issue addressed: Difference in confidence between individuals classified as high or low in dogmatism. Statistical Signficance Information: F(1, 61) = 3.46, p< .01 Relevant Results/Conclusions: Individuals high in dogmatism were much more confident in their judgments (M=7.17) than individuals low in dogmatism (M=6.19). (p.458) Before continuing, please answer item A on the review sheet Calculated Effect Size (Cohens f): 0.2905 Before continuing, please answer item B on the review sheet Calculated Confidence Interval (95%) Lower Cohens f: 0.0505 Upper Cohens f: 0.5238 Please answer item C on the review sheet PAGE 135 Appendix B Article Reviewer Instructions and Cases 125 Study Number: 53 Synopsis of Study: The purpose of this study was to determine whether or not the degree to which someone was considered dogmatic impact such things as their confidence and tendency to be judgmental. This study also investigated the degree to which dogmatism impacted an individuals ability to provide reason behind decisions and judgments and the nature of those reasons. Faced with two possible outcomes to given scenarios (e.g., likelihood of persons stopping to help an injured person with blood present vs no blood present), participants selected their prediction of the outcome and then indicated how confident they were in their decision. They then listed reasons why they thought their outcome was most likely (pro decisions) as well as reasons why the other outcome might occur (con decisions) Statistical Significance Reported with associated results and conclusions: Analysis 2: Issue addressed: Are there differences in the types of reasons provided for outcomes that support an individuals opinion (pro decisions) as compared to the reasons that oppose an individuals opinion (con decisions) resulting from how dogmatic an individual is? Statistical Signficance Information: Due to the nature of the issue and statistics provided to support results and conclusion, consideration of data from two main effects and an interaction effect are necessary for this analyses. Please use all relevant information when deciding on how you will answer the review sheet. Main effect of dogmatism on generation of pro reasons. F(1, 61) = 3.47, p< .07 Main effect of dogmatism on generation of con reasons: F(1,61) = 3.07, p< .08 Interaction of level of dogmatism and type of reason generated F(1,61) = 10.03, p <.01 Relevant Results/Conclusions: There was a significant interactions of dogmatism with type of reason generated (see interaction information). Individuals high in dogmatism produced more pro reasons than individuals low in dogmatism (see main effect 1). Also, they produced fewer con reasons than individuals low in dogmatism (see main effect 2). (p. 458) The results (of the experiment) show that individuals high in dogmatism are more likely to generate cognitions supporting their newly created PAGE 136 Appendix B Article Reviewer Instructions and Cases 126 beliefs and are less likely to generate cognitions contradicting them. (p.459) Before continuing, please answer item A on the review sheet Calculated Effect Size (Cohens f): 0.2347 Calculated Effect Size (Cohens f) 0.2207 Calculated Effect Size (Cohens f) 0.4049 Before continuing, please answer item B on the review sheet Calculated Confidence Interval (95%) Lower Cohens f: 0.0 Upper Cohens f: 0.4842 Calculated Confidence Interval (95%) Lower Cohens f: 0.0 Upper Cohens f: 0.4699 Calculated Confidence Interval (95%) Lower Cohens f: 0.1462 Upper Cohens f: 0.6605 Please answer item C on the review sheet PAGE 137 Appendix B Article Reviewer Instructions and Cases 127 Study Number: 52 Synopsis of Study: The purpose of this study was to determine if the types of praise given to children impacted their motivation and performance. Children were placed in three groups. In the two experimental groups, children were given different types of praise for accomplishments. The first group was praised on ability and children wer told You must be smart at these problems and the second group was praised on effort, You must have worked hard at these problems. The third group was controlled and given no feedback. Students were subsequently given measures that rated persistence, enjoyment, quality of performance and failure attributions. Additionally, they were administered a second assessment (similar to the one that they had received praise on) of similar difficulty. Analysis 1: Issue addressed: Do children who receive different types of praise (ability, effort, or none) differ in what they attribute their performance (effort or intelligence) to on performance measures? Statistical Significance Information: Two main effects reported, no interactions: Effect of low effort on performance: F(2,120) = 8.64, p< .001 Effect of low intelligence on performance: F(2, 120) = 4.63, p<.05 Relevant Results/Conclusions: Children differed in their endorsements of low effort and low ability as causes of their failure (p.37) Overall, the findings (of the study) support our hypothesis that children who are praised for intelligence when they succeed are the ones leastlikely to attribute their performance to low effort, a factor over which they have some amount of control. (p.39) Before continuing, please answer item A on the review sheet Calculated Effect Size (Cohens f): 0.3748 Calculated Effect Size (Cohens f): 0.2744 Before continuing, please answer item B on the review sheet Calculated Confidence Interval (95%) Lower Cohens f: 0.1750 Upper Cohens f: 0.5482 PAGE 138 Appendix B Article Reviewer Instructions and Cases 128 Calculated Confidence Interval (95%) Lower Cohens f: 0.0621 Upper Cohens f: 0.4423 Please answer item C on the review sheet PAGE 139 Appendix B Article Reviewer Instructions and Cases 129 Study Number: 52 Synopsis of Study: The purpose of this study was to determine if the types of praise given to children impacted their motivation and performance. Children were placed in three groups. In the two experimental groups, children were given different types of praise for accomplishments. The first group was praised on ability and children wer told You must be smart at these problems and the second group was praised on effort, You must have worked hard at these problems. The third group was controlled and given no feedback. Students were subsequently given measures that rated persistence, enjoyment, quality of performance and failure attributions. Additionally, they were administered a second assessment (similar to the one that they had received praise on) of similar difficulty. Analysis 2: Issue addressed: Do children who receive different types of praise (ability, effort, or none) differ in how they rate their enjoyment of tasks? Statistical Significance Information: Difference between groups: F(2, 120) = 7.73, p<.005 Follow up groups comparisons: Ability vs. Effort, t(81) = 3.81, p<.001 Ability vs. Control, t(83) = 2.03, p<.05 Control vs. Effort, t(82) = 2.16, p< .05 Relevant Results/Conclusions: Children praised for intelligence (M= 4.11) enjoyed the tasks less than did children praised for effort (M=4.89); again, children in the control condition fell in between the other two groups (M=4.52) Children praised for intelligence were significantly less likely to enjoy the problems than were children in the effort and control conditions. Further, children in the control condition were less likely to enjoy the problems than those praised for effort. (p.37) Indictment of ability also led children praised for intelligence to display mor negative responses in terms of lower levels of task enjoyment than their counterparts, who received commendations for effort. (p.48) Before continuing, please answer item A on the review sheet Calculated Effect Size (Cohens f): 0.3545 PAGE 140 Appendix B Article Reviewer Instructions and Cases 130 Calculated Effect Size (Cohens t): 0.8816 Calculated Effect Size (Cohens t): 0.4495 Calculated Effect Size (Cohens t): 0.4801 Before continuing, please answer item B on the review sheet Calculated Confidence Interval (95%) Lower Cohens f: 0.1358 Upper Cohens f: 0.5269 Calculated Confidence Interval (95%) Lower Cohens t: 0.4136 Upper Cohens t: 1.3495 Calculated Confidence Interval (95%) Lower Cohens t: 0.0175 Upper Cohens t: 0.8814 Calculated Confidence Interval (95%) Lower Cohens t: 0.0043 Upper Cohens t: 0.9158 Please answer item C on the review sheet PAGE 141 Appendix B Article Reviewer Instructions and Cases 131 Study Number: 52 Synopsis of Study: The purpose of this study was to determine if the types of praise given to children impacted their motivation and performance. Children were placed in three groups. In the two experimental groups, children were given different types of praise for accomplishments. The first group was praised on ability and children wer told You must be smart at these problems and the second group was praised on effort, You must have worked hard at these problems. The third group was controlled and given no feedback. Students were subsequently given measures that rated persistence, enjoyment, quality of performance and failure attributions. Additionally, they were administered a second assessment (similar to the one that they had received praise on) of similar difficulty. Analysis 3: Issue addressed: Do children who receive different types of praise (ability, effort, or none) differ regarding their future expectations of their performance? Statistical Significance Information: Difference between groups: F(2, 48) = 1.01, ns Relevant Results/Conclusions: No significant differences were noted for childrens expectations; children in the intelligence, effort, and control conditions displayed equivalent expectations. (p.40) Before continuing, please answer item A on the review sheet Calculated Effect Size (Cohens f): 0.199 Before continuing, please answer item B on the review sheet Calculated Confidence Interval (95%) Lower Cohens f: 0.0 Upper Cohens f: 0.4419 Please answer item C on the review sheet PAGE 142 Appendix B Article Reviewer Instructions and Cases 132 Study Number: 52 Synopsis of Study: The purpose of this study was to determine if the types of praise given to children impacted their motivation and performance. Children were placed in three groups. In the two experimental groups, children were given different types of praise for accomplishments. The first group was praised on ability and children wer told You must be smart at these problems and the second group was praised on effort, You must have worked hard at these problems. The third group was controlled and given no feedback. Students were subsequently given measures that rated persistence, enjoyment, quality of performance and failure attributions. Additionally, they were administered a second assessment (similar to the one that they had received praise on) of similar difficulty. Analysis 4: Issue addressed: Do children who receive different types of praise (ability, effort, or none) differ in how harshly they judge their performance? Statistical Significance Information: Difference between groups: F(2, 48) = 2.04, ns Relevant Results/Conclusions: No significant differences were noted for childrens expectations; children in the intelligence, effort, and control conditions displayed equivalent expectations. (p.40) These results indicate that effort praise and intelligence praise do not lead children to judge their performance differently Before continuing, please answer item A on the review sheet Calculated Effect Size (Cohens f): 0.2828 Before continuing, please answer item B on the review sheet Calculated Confidence Interval (95%) Lower Cohens f: 0.0 Upper Cohens f: 0.5366 PAGE 143 Appendix B Article Reviewer Instructions and Cases 133 Please answer item C on the review sheet PAGE 144 Appendix B Article Reviewer Instructions and Cases 134 Study Number: 52 Synopsis of Study: The purpose of this study was to determine if the types of praise given to children impacted their motivation and performance. Children were placed in three groups. In the two experimental groups, children were given different types of praise for accomplishments. The first group was praised on ability and children wer told You must be smart at these problems and the second group was praised on effort, You must have worked hard at these problems. The third group was controlled and given no feedback. Students were subsequently given measures that rated persistence, enjoyment, quality of performance and failure attributions. Additionally, they were administered a second assessment (similar to the one that they had received praise on) of similar difficulty. Analysis 5: Issue addressed: Do children who receive different types of praise (ability, effort, or none) differ regarding persistence? Statistical Significance Information: Difference between groups: F(2, 45) = 3.16, p = .05 Follow up groups comparisons: Ability vs. Effort, t(30) = 2.09, p<.05 Ability vs. Control, t(30) = 2.22, p<.05 Control vs. Effort, t(30) = 0.12, ns Relevant Results/Conclusions: Children praised for intelligence were less likely to want to persist on the problems after setbacks than were children praised for effort; children in the control condition closely resembled those in the effort conditions. Followup ttests revealed significant differences between the intelligence condition and the effort and control conditions but no difference between the effort and control conditions. (p.46) Indictment of ability also led children praised for intelligence to display mor negative responses in terms of lower levels of task persistence than their counterparts, who received commendations for effort. (p.48) Before continuing, please answer item A on the review sheet Calculated Effect Size (Cohens f): 0.3707 Calculated Effect Size (Cohens t): 0.7332 PAGE 145 Appendix B Article Reviewer Instructions and Cases 135 Calculated Effect Size (Cohens t): 0.7777 Calculated Effect Size (Cohens t): 0.0412 Before continuing, please answer item B on the review sheet Calculated Confidence Interval (95%) Lower Cohens f: 0.0 Upper Cohens f: 0.6462 Calculated Confidence Interval (95%) Lower Cohens t: 0.0055 Upper Cohens t: 1.4609 Calculated Confidence Interval (95%) Lower Cohens t: 0.0472 Upper Cohens t: 1.0582 Calculated Confidence Interval (95%) Lower Cohens t: 0.7570 Upper Cohens t: 0..6746 Please answer item C on the review sheet PAGE 146 Appendix B Article Reviewer Instructions and Cases 136 Study Number: 1 Synopsis of Study: The purpose of this study was to determine if differences in course delivery mode (oncampus vs distance learning) of college courses impacted student perceptions/satisfaction of the course in aspects of instructor, organization, teaching, and communication. Student in two graduate level special education courses delivered in both modes responded to surveys administered measuring satisfaction with course Analysis 1: Issue addressed: Do ? Statistical Significance Information: Overall Satisfaction: t(25) = 0.81, p>.01, ns Relevant Results/Conclusions: No differences were evident in overall ratings. Students overall perceptions of the course were similar when the course was taught on campus or off campus with distance education technologies. (p.46) As evidenced by this research, data on outcomes of distance learning experiences are favorable. Within the context expanded by data on such issues, the promises of technologyimproved distance learning experiences will be realized and education for all students will be greatly enhanced. Before continuing, please answer item A on the review sheet Calculated Effect Size (Cohens t): 0.6740 Before continuing, please answer item B on the review sheet Calculated Confidence Interval (95%) Lower Cohens t: 1.7509 Upper Cohens t: 0.4030 Please answer item C on the review sheet PAGE 147 Appendix B Article Reviewer Instructions and Cases 137 Study Number: 12 Synopsis of Study: The purpose of this study was to determine if differences in students timeofday preferences impacted their performance on an algebra test. A measure of students timeofday preference (morning or afternoon) was obtained and the test was administered to members of both groups during morning and afternoon (not the same students). Analysis 1: Issue addressed: Do students who have different preferences (morning or afternoon) perform differently if they take the test in the morning? Statistical Significance Information: Difference between groups: F(1,64) = 5.44, p < .05 Relevant Results/Conclusions: There was a significant difference between afternoonpreferenced students and morningpreferenced student taking the test in the morning. (p.298) The results indicate clearly that the timeofday element in learning stule may play a signficacnt part in the instructional environment. When time preference and testing environment were matched, significant differences emerged between test resultsbut only for the morning test (p. 298) Before continuing, please answer item A on the review sheet Calculated Effect Size (Cohens f): 0.2849 Before continuing, please answer item B on the review sheet Calculated Confidence Interval (95%) Lower Cohens f: 0.0024 Upper Cohens f: 0.5283 Please answer item C on the review sheet PAGE 148 Appendix B Article Reviewer Instructions and Cases 138 Study Number: 12 Synopsis of Study: The purpose of this study was to determine if differences in students timeofday preferences impacted their performance on an algebra test. A measure of students timeofday preference (morning or afternoon) was obtained and the test was administered to members of both groups during morning and afternoon (not the same students). Analysis 2: Issue addressed: Do students who have different preferences (morning or afternoon) perform differently if they take the test in the afternoon? Statistical Significance Information: Difference between groups: F(1,64) = 3.81, p < .055 Relevant Results/Conclusions: There was a small difference between afternoonpreferenced students and morningpreferenced student taking the test in the afternoon. (p.298) The results indicate clearly that the timeofday element in learning stule may play a signficacnt part in the instructional environment. When time preference and testing environment were matched, significant differences emerged between test resultsbut only for the morning test (p. 298) Before continuing, please answer item A on the review sheet Calculated Effect Size (Cohens f): 0.2385 Before continuing, please answer item B on the review sheet Calculated Confidence Interval (95%) Lower Cohens f: 0.0 Upper Cohens f: 0.4805 Please answer item C on the review sheet PAGE 149 Appendix B Article Reviewer Instructions and Cases 139 Study Number: 76 Synopsis of Study: The purpose of this study was to determine if the type of supervision preservice teachers experienced impacted their development of clarity skills, pedagogical reasoning and actions, and attitudes toward several aspects of their field experience. Preservice teachers were assigned either to the experimental group which engaged in peer coaching techniques or to the control group which experienced traditional mentoring experiences. Analysis 1: Issue addressed: Do students who have different supervision experiences have different attitudes toward their experience upon completion? Statistical Significance Information: Difference between groups on overall measure: T(30) = .67, p > .51 Relevant Results/Conclusions: We did not find statistical significance for the overall rating.(p.260) Evidence presented here indicates that peer coaching is a feasible vehicle for institutitng collaborative efforts; therefore, peer coaching warrants consideration as a potentially serviceable solution for strengthening fieldbased training of prospective teachers (p.261) Before continuing, please answer item A on the review sheet Calculated Effect Size (Cohens d): .7929 Before continuing, please answer item B on the review sheet Calculated Confidence Interval (95%) Lower Cohens d: .2840 Upper Cohens d: 1.3018 Please answer item C on the review sheet PAGE 150 Appendix B Article Reviewer Instructions and Cases 140 Study Number: 76 Synopsis of Study: The purpose of this study was to determine if the type of supervision preservice teachers experienced impacted their development of clarity skills, pedagogical reasoning and actions, and attitudes toward several aspects of their field experience. Preservice teachers were assigned either to the experimental group which engaged in peer coaching techniques or to the control group which experienced traditional mentoring experiences. Analysis 2: Issue addressed: Do preservice teachers who have different supervision experiences demonstrate differences in clarity skills Statistical Significance Information: Difference between groups on overall measure: f(1, 30) = 41.66, p < .001 Relevant Results/Conclusions: Posttreatment results showed statistically significant differences in favor of the experimental group for overall demonstration of clarity skills.(p.260) Evidence presented here indicates that peer coaching is a feasible vehicle for institutitng collaborative efforts; therefore, peer coaching warrants consideration as a potentially serviceable solution for strengthening fieldbased training of prospective teachers (p.261) Before continuing, please answer item A on the review sheet Calculated Effect Size (Cohens f): .8068 Before continuing, please answer item B on the review sheet Calculated Confidence Interval (95%) Lower Cohens f: 0.5213 Upper Cohens f: 1.0874 Please answer item C on the review sheet PAGE 151 Appendix B Article Reviewer Instructions and Cases 141 Study Number: 78 Synopsis of Study: The purpose of this study was to determine if participation by families in a literary intervention project helped their young students gain literacy skills. Parents and families participated in a monthly training session for five months to provide them with skills and materials to help their kindergarten age children with literacy skills. Gains on various measure were compared with gains by children in the same schools and classes that did not participate in the program. Analysis 1: Issue addressed: Is the family intervention program effective in helping children gain vocabulary skills? Statistical Significance Information: Difference between groups on overall measure across time: f(1, 247) = 32.08, p < .001 Relevant Results/Conclusions: When examining the effect of the interaction of group affiliation with time using repeated measures ANOVA we found that Project EASE participants made statistically significantly greater gains than the control group on Vocabulary..(p.532) It appeared from the posttest measures on the CAP vocabulary subtests that those students who participated in the intervention were better able to recall more superordinate terms, which in turn have been shown to relate to the reading skills of elementary aged children. (p. 538) Because vocabulary knowledge, story comprehension, and story sequencing are precisely the language skills that relate most strongly to literacy accomplishments (citation), the improvement on these measures strongly confirms the relevance of the intervention to improved reading outcomes.(p.539) Before continuing, please answer item A on the review sheet Calculated Effect Size (Cohens f): .3597 Before continuing, please answer item B on the review sheet Calculated Confidence Interval (95%) Lower Cohens f: 0.2309 PAGE 152 Appendix B Article Reviewer Instructions and Cases 142 Upper Cohens f: 0.4878 Please answer item C on the review sheet PAGE 153 Appendix B Article Reviewer Instructions and Cases 143 Study Number: 78 Synopsis of Study: The purpose of this study was to determine if participation by families in a literary intervention project helped their young students gain literacy skills. Parents and families participated in a monthly training session for five months to provide them with skills and materials to help their kindergarten age children with literacy skills. Gains on various measure were compared with gains by children in the same schools and classes that did not participate in the program. Analysis 2: Issue addressed: Is the family intervention program effective in helping children gain sound awareness skills? Statistical Significance Information: Difference between groups on overall measure across time: f(1, 247) = 7.45, p < .01 Relevant Results/Conclusions: When examining the effect of the interaction of group affiliation with time using repeated measures ANOVA we found that Project EASE participants made statistically significantly greater gains than the control group on Sound Awareness.(p.532) Before continuing, please answer item A on the review sheet Calculated Effect Size (Cohens f): .1733 Before continuing, please answer item B on the review sheet Calculated Confidence Interval (95%) Lower Cohens f: 0.0474 Upper Cohens f: 0.2985 Please answer item C on the review sheet PAGE 154 Appendix B Article Reviewer Instructions and Cases 144 Study Number: 78 Synopsis of Study: The purpose of this study was to determine if participation by families in a literary intervention project helped their young students gain literacy skills. Parents and families participated in a monthly training session for five months to provide them with skills and materials to help their kindergarten age children with literacy skills. Gains on various measure were compared with gains by children in the same schools and classes that did not participate in the program. Analysis 3: Issue addressed: Is the family intervention program effective in helping children gain story comprehension skills? Statistical Significance Information: Difference between groups on overall measure across time: f(1, 229) = 6.85, p < .01 Relevant Results/Conclusions: When examining the effect of the interaction of group affiliation with time using repeated measures ANOVA we found that Project EASE participants made statistically significantly greater gains than the control group on Story Comprehension.(p.532) The impact of participation in Project EASE on childrens language scores is striking. (p. 537) Because vocabulary knowledge, story comprehension, and story sequencing are precisely the language skills that relate most strongly to literacy accomplishments (citation), the improvement on these measures strongly confirms the relevance of the intervention to improved reading outcomes. Before continuing, please answer item A on the review sheet Calculated Effect Size (Cohens f): .1874 Before continuing, please answer item B on the review sheet Calculated Confidence Interval (95%) Lower Cohens f: 0.0448 Upper Cohens f: 0.3288 PAGE 155 Appendix B Article Reviewer Instructions and Cases 145 Please answer item C on the review sheet PAGE 156 Appendix B Article Reviewer Instructions and Cases 146 Study Number: 78 Synopsis of Study: The purpose of this study was to determine if participation by families in a literary intervention project helped their young students gain literacy skills. Parents and families participated in a monthly training session for five months to provide them with skills and materials to help their kindergarten age children with literacy skills. Gains on various measure were compared with gains by children in the same schools and classes that did not participate in the program. Analysis 4: Issue addressed: Is the family intervention program effective in helping children gain language skills? Statistical Significance Information: Difference between groups on overall measure across time: f(1, 246) = 35.46, p < .001 Relevant Results/Conclusions: Although all the children in the sample showed statistically significant gains in all three literacy composites over time, we were able to attribute a statistically significant gain in Language skills to the Project EASE intervention. (p.532) The impact of participation in Project EASE on childrens language scores is striking. (p. 537) Before continuing, please answer item A on the review sheet Calculated Effect Size (Cohens f): .3789 Before continuing, please answer item B on the review sheet Calculated Confidence Interval (95%) Lower Cohens f: 0.2494 Upper Cohens f: 0.5077 Please answer item C on the review sheet PAGE 157 Appendix B Article Reviewer Instructions and Cases 147 Study Number: 73 Synopsis of Study: The purpose of this study was to determine if praise impacted the amount of time college students spent on homework. Additionally, it was investigated if praise impacted achievement. Students maintained a log of time spent on homework and were either placed into the praised group (when receiving the log, the instructor momentarily reviewed and told the student good job, very good, or great work) or were in the nonpraised group these students were merely thanked when they turned in their log. At the end of the course, the average amount of time spent on homework for 17 randomly selected homework assignments was calculated and compared, as well as performance on an instructorcreated final examination. Analysis 1: Issue addressed: Does praise impact the amount of time spent on homework? Statistical Significance Information: Difference between groups: t(59) = 9.788, p < .001 Relevant Results/Conclusions: Results revealed that students studied significantly more outside of the classroom when exposed to the verbal praise treatment than when exposed to the no verbal praise treatment. (p. 387) Although the results of this study may not generalize to all college student populations, they demonstrate the profound impact of properly administered verbal praise on college students motivation to engage in homework. (p. 388) Before continuing, please answer item A on the review sheet Calculated Effect Size (Cohens d): 2.4881 Before continuing, please answer item B on the review sheet Calculated Confidence Interval (95%) Lower Cohens d: 1.8196 Upper Cohens d: 3.1566 Please answer item C on the review sheet PAGE 158 Appendix B Article Reviewer Instructions and Cases 148 Study Number: 73 Synopsis of Study: The purpose of this study was to determine if praise impacted the amount of time college students spent on homework. Additionally, it was investigated if praise impacted achievement. Students maintained a log of time spent on homework and were either placed into the praised group (when receiving the log, the instructor momentarily reviewed and told the student good job, very good, or great work) or were in the nonpraised group these students were merely thanked when they turned in their log. At the end of the course, the average amount of time spent on homework for 17 randomly selected homework assignments was calculated and compared, as well as performance on an instructorcreated final examination. Analysis 2: Issue addressed: Does praise on homework through the length of a course impact the performance on the end of course assessment? Statistical Significance Information: Difference between groups: t(59) = 1.929, p > 0.05 ns Relevant Results/Conclusions: Although the difference was not statistically significant (on the end of course exam), the direction of the means suggested that the students exposed to verbal praise not only studied more for each lesson but also achieved more than those not exposed to verbal praise. (p. 387) In addition, my findings suggest that students who experience verbal praise for doing homework perform somewhat better on an instructorcreated, criterionreferenced final examination than those who experience no verbal praise for their homework habits. (p. 388) Before continuing, please answer item A on the review sheet Calculated Effect Size (Cohens d): .4800 Before continuing, please answer item B on the review sheet Calculated Confidence Interval (95%) Lower Cohens d: .0292 Upper Cohens d: .9891 Please answer item C on the review sheet PAGE 159 Appendix C SAS Code 149 procprinttoprint='C:\Cohen_ci.lst'; *++ Thisprogramcalculatesconfidencebandsfortwogroupeffectsize (Cohen'sd)usingbothanintervalinversionapproachthroughthe macroatthebeginningandthenusingzbands. ThisfirstpartcalculatesendpointsusingSteiger Rawvaluesareinputaboutmidwaythroughprogramfortwogroup Ns,meansandstddeviations.Dependingondataprovided,these inputsmightneedtobemodified. Lastmodification:4Sept2003 ++; *++ Inputtothemacro: data=nameofdataset effect_size=obtainedsamplevalueofCohend n1=samplesizeofgroupone n2=samplesizeofgrouptwo Outputisprintedtableofconfidenceintervals *++; %macro EFFECT_CI(data,effect_size,n1,n2); prociml; startfind_delta(obs_stat,n1,n2,pctl,delta_t); df=n1+n22 ; *Step1:Findvalueofdeltathatisalittletoohigh; OK= 0 ; delta_t= 0 ;*starttheloopwithpopulationeffectsize=0; loop= 0 ; dountil(OK= 1 ); nc=delta_t#sqrt(n1#n2/(n1+n2)); cumprob=PROBT(obs_stat,df,nc); ifcumprob PAGE 160 Appendix C SAS Code 150 delta_t= 0 ;*starttheloopwithpopulationeffectsize=0; loop= 0 ; dountil(OK= 1 ); nc=delta_t#sqrt(n1#n2/(n1+n2)); cumprob=PROBT(obs_stat,df,nc); ifcumprob>pctlthenOK= 1 ; ifcumprob PAGE 161 Appendix C SAS Code 151 doi= 1 tok; obs_stat=effect_vec[i, 1 ]#sqrt(n1[i, 1 ]#n2[i, 1 ]/(n1[i, 1 ]+n2[i, 1 ])); runfind_delta(obs_stat,n1[i, 1 ],n2[i, 1 ], .005 ,delta005); runfind_delta(obs_stat,n1[i, 1 ],n2[i, 1 ], .995 ,delta995); runfind_delta(obs_stat,n1[i, 1 ],n2[i, 1 ], .025 ,delta025); runfind_delta(obs_stat,n1[i, 1 ],n2[i, 1 ], .975 ,delta975); runfind_delta(obs_stat,n1[i, 1 ],n2[i, 1 ], .05 ,delta05); runfind_delta(obs_stat,n1[i, 1 ],n2[i, 1 ], .95 ,delta95); print_effect=effect_vec[i, 1 ]; fileprint; put@ 1 print_effect 8.4 @ 10 delta995 8.4 @ 20 delta005 8.4 @ 30 delta975 8.4 @ 40 delta025 8.4 @ 50 delta95 8.4 @ 60 delta05 8.4 ; end; quit; %mend EFFECT_CI; data one; inputjourn$articleanalysis$n1n2mn1mn2sd1sd2; nsample1=n1; nsample2=n2; d= 0 ; vard= 0 ; width_z_99= 0 ; width_z_95= 0 ; width_z_90= 0 ; lo_z_99= 0 ; hi_z_99= 0 ; lo_z_95= 0 ; hi_z_95= 0 ; lo_z_90= 0 ; hi_z_90= 0 ; *++ Computesamplemeansandvariances *++; n1=n1; n2=n2; mn1=mn1; mn2=mn2; var1=sd1** 2 ; var2=sd2** 2 ; *++ Computesamplevalueofdanditsvariance *++; d=(mn1mn2)/(((((n11 )*var1)+((n21 )*var2))/(n1+n22 ))** 0.5 ); PAGE 162 Appendix C SAS Code 152 vard=((n1+n2)/(n1*n2))+d** 2 /( 2 *(n1+n2)); *++ ComputeendpointsofCIusingnormaldistribution *++; lo_z_9 9=d( 2.576 *sqrt(vard)); hi_z_9 9=d+( 2.576 *sqrt(vard)); lo_z_9 5=d( 1.96 *sqrt(vard)); hi_z_9 5=d+( 1.96 *sqrt(vard)); lo_z_9 0=d( 1.645 *sqrt(vard)); hi_z_9 0=d+( 1.645 *sqrt(vard)); *++ NormalZBands ++; width_z_99=width_z_99+(hi_z_99lo_z_99); width_z_95=width_z_95+(hi_z_95lo_z_95); width_z_90=width_z_90+(hi_z_90lo_z_90); *++ justcomputingsampledelta *++; width_z_99= 0 ; width_z_95= 0 ; width_z_90= 0 ; lo_z_99= 0 ; hi_z_99= 0 ; lo_z_95= 0 ; hi_z_95= 0 ; lo_z_90= 0 ; hi_z_90= 0 ; *++ Computesamplemeansandvariances *++; n1=n1; n2=n2; mn1=mn1; mn2=mn2; var1= 6.93 ** 2 ; var2= 5.71 ** 2 ; *++ Computesamplevalueofdanditsvariance *++; d=(mn1mn2)/(((((n11 )*var1)+((n21 )*var2))/(n1+n22 ))** 0.5 ); vard=((n1+n2)/(n1*n2))+d** 2 /( 2 *(n1+n2)); PAGE 163 Appendix C SAS Code 153 *++ ComputeendpointsofCIusingnormaldistribution *++; lo_z_9 9=d( 2.576 *sqrt(vard)); hi_z_9 9=d+( 2.576 *sqrt(vard)); lo_z_9 5=d( 1.96 *sqrt(vard)); hi_z_9 5=d+( 1.96 *sqrt(vard)); lo_z_9 0=d( 1.645 *sqrt(vard)); hi_z_9 0=d+( 1.645 *sqrt(vard)); *++ NormalZBands ++; width_z_99=width_z_99+(hi_z_99lo_z_99); width_z_95=width_z_95+(hi_z_95lo_z_95); width_z_90=width_z_90+(hi_z_90lo_z_90); *Ifjournalsarecodedby: 1:ResearchReadingQuarterly 2:JournalofEducationalResearch 3:JournalofPersonalityandSocialPsychology; diff=mn1mn2; vardiff=((n1+n2)/(n1*n2))+diff** 2 /( 2 *(n1+n2)); crit_t99=TINV( .995 ,n1+n22 0 ); crit_t95=TINV( .975 ,n1+n22 0 ); crit_t90=TINV( .95 ,n1+n22 0 ); lo_t_99=diff(crit_t99*sqrt(vardiff)); hi_t_99=diff+(crit_t99*sqrt(vardiff)); lo_t_95=diff(crit_t95*sqrt(vardiff)); hi_t_95=diff+(crit_t95*sqrt(vardiff)); lo_t_90=diff(crit_t90*sqrt(vardiff)); hi_t_90=diff+(crit_t90*sqrt(vardiff)); width_t_99=hi_t_99lo_t_99; width_t_95=hi_t_95lo_t_95; width_t_90=hi_t_90lo_t_90; cards; JER1A423 3.693.94.59.33 JER1B423 3.563.88.33.31 JER1C423 3.653.88.59.34 JER1D423 4.154.23.34.17 JER1E423 3.483.62.47.44 JER1F423 3.493.79.48.27 JER1G1113 3.693.79.28.44 JER1H1113 3.723.60.29.22 JER1I1113 3.653.65.19.43 JER1J1113 3.834.25.19.10 JER1K1113 3.583.42.44.23 JER1L1113 3.563.63.13.40 PAGE 164 Appendix C SAS Code 154 JER5A5515 4.794.21.50.81 JER5B5515 4.714.07.54.83 JER5C5515 4.464.13.75.83 JER5D5515 4.363.93.881.16 JER5E5515 4.044.27.87.70 JER5F5515 3.923.72.961.58 JER5G5515 4.023.291.181.27 JER5H5515 3.063.401.071.12 JER5I5515 2.423.131.321.13 JER5J5515 2.173.471.201.36 JER76D32323.754.501.18.63 JER76E32323.874.751.02.45 JER76F32324.314.88.87.34 JER76G32324.314.80.70.41 JER76H32324.694.56.80.73 JER76I32324.564.751.03.45 JER4A24724727.5626.849.459.68 JER4B14914931.1030.768.168.47 JER4C9898 22.1820.878.748.27 JER4D9494 22.2721.579.039.31 JER4E4545 34.5134.537.628.41 JER4F3333 24.9424.527.367.05 JER4G5959 20.4819.079.238.49 JER4H3535 25.3125.787.929.22 JER4I3535 29.5028.366.765.86 JER4J3535 36.7537.076.326.55 JER73A303134.746.85.34.4 JER73B303183.586.05.64.8 JPSP63M54414.272.053.171.69 JPSP63N54414.896.021.842.62 JPSP63Y1461463.27.96.851.59 JPSP52D383911.964.948.157.04 JPSP52E463810.5811.968.438.15 JPSP52F393816.499.7811.049.00 JPSP52G463913.8816.499.1811.04 JPSP52H39383.254.531.411.03 JPSP52I39463.254.304.411.33 JPSP52J38464.534.301.031.33 JPSP52L39384.114.891.02.72 JPSP52M39464.114.521.020.81 JPSP52N38464.894.52.72.81 JPSP52P3938.921.211.531.63 JPSP52Q3946.92.131.531.57 JPSP52R38461.21.131.631.57 JPSP52AA302914.834.707.703.43 JPSP52AB302914.837.977.704.87 JPSP52AC29294.707.973.434.87 JPSP52AD293019.797.707.186.20 JPSP52AE292919.7912.287.187.43 JPSP52AF30297.7012.286.207.43 JPSP52AH29303.245.20.831.00 JPSP52AI29293.244.28.831.29 JPSP52AJ30295.204.281.001.29 JPSP52AL29303.864.991.01.55 JPSP52AM29293.864.491.01.94 JPSP52AN30294.994.49.55.94 PAGE 165 Appendix C SAS Code 155 JPSP52AQ2930.371.231.421.50 JPSP52AR2929.37.341.422.13 JPSP52AS30291.23.341.502.13 JPSP52AZ17174.242.191.791.52 JPSP52BA17172.193.471.522.24 JPSP52BB17174.243.461.792.24 JPSP52BE151620.067.1311.325.52 JPSP52BF151520.0610.0611.326.79 JPSP52BG16157.1310.065.526.79 JPSP52BH161520.947.757.179.50 JPSP52BI161520.9412.067.178.06 JPSP52BJ15157.7512.069.508.06 JPSP52BL16153.444.621.591.63 JPSP52BM16153.444.561.591.26 JPSP52BN15154.624.561.631.26 JPSP52BP16153.925.19.95.82 JPSP52BQ16153.924.90.95.93 JPSP52BR15155.194.90.82.95 JPSP52BV161620.817.259.425.34 JPSP52BW161620.815.759.424.92 JPSP52BX16167.255.755.344.92 JPSP52BZ161616.947.139.746.48 JPSP52CA161616.9413.319.748.67 JPSP52CB16167.1313.316.488.67 JPSP52CE16163.844.86.74.88 JPSP52CF16163.844.41.74.80 JPSP52CG16164.864.41.88.80 JPSP52CK16164.386.812.162.23 JPSP52CL16166.814.942.231.84 JPSP52CM16164.384.942.161.84 JPSP52CP16164.132.561.201.44 JPSP52CQ16164.132.941.201.84 JPSP52CR16162.562.941.441.84 JPSP69D17179.2412.354.042.62 JPSP69E21209.769.353.483.01 JPSP69I22225.274.452.072.13 JPSP69J23233.304.172.322.23 JPSP69K23225.894.511.011.23 JPSP69L23224.022.861.891.68 JPSP58A64644.492.25.50.74 JPSP58B64644.483.01.50.68 JPSP58C64644.122.32.59.66 JPSP58D64643.983.35.83.71 JPSP58E64643.562.94.74.87 rrq18A23263.62.61.31.4 rrq49A232659.653.75.9512.4 rrq49H232629.823.25.88.2 rrq49I23263.62.61.31.4 JER3A103611312.522.581.011.07 JER3B103611312.832.91.91.91 JER3C103611312.282.291.121.16 JER3D103611313.073.10.88.89 JER3E103611311.982.061.151.15 JER3F103611312.242.411.071.10 JER3G103611312.442.501.051.06 JER3H103611312.212.151.161.21 PAGE 166 Appendix C SAS Code 156 JER3I103611312.372.281.051.16 JER3J103611311.291.171.281.24 JER3K103611311.831.801.241.25 JER3L103611311.461.45.57.60 JER3M103611311.581.58.54.55 JER3N103611311.271.23.67.67 JER3O103611311.501.50.56.58 JER3P103611311.111.12.71.69 JER3Q103611311.261.31.60.60 JER3R103611311.421.41.62.62 JER3S103611311.361.36.60.60 JER3T103611311.431.35.61.65 JER3U10361131.78.69.76.74 JER3V103611311.121.08.67.68 JER3W103611312.042.10.75.75 JER3X103611312.062.10.76.76 JER3Y103611311.541.52.991.01 JER3Z103611312.462.43.75.79 JER3AA103611311.721.74.90.93 JER3AB103611311.591.56.91.94 JER3AC103611311.971.95.88.90 JER3AD103611311.981.95.81.81 JER3AE103611311.981.86.91.92 JER3AF103611311.111.021.101.02 JER3AG103611311.431.371.01.98 ; *ThefollowingcallsthemacroforIntervalInversion; % EFFECT_CI (one,d,n1,n2); PROCFREQ ; TABLESJOURN*ARTICLE; title1'CohendConfidenceIntervalsztransformation'; procprint ; varjournarticleanalysisn1n2mn1mn2var1var2varddhi_z_99 lo_z_99dhi_z_95lo_z_95dhi_z_90lo_z_90d; *procprint; *vardlo_z_99hi_z_99lo_z_95hi_z_95lo_z_90hi_z_90; *procprint; *varwidth_z_99width_z_95width_z_90; title1'DifferenceofMeansConfidenceIntervalsbyttest'; *procprint; *varjournarticlen1n2mn1mn2sd1sd2diff; *procprint; *varcrit_t99crit_t95crit_t90; procprint ; varjournarticlen1n2mn1mn2sd1sd2diffhi_t_99lo_t_99diff hi_t_95lo_t_95diffhi_t_90lo_t_90; *procprint; *varwidth_t_99width_t_95width_t_90; PAGE 167 Appendix C SAS Code 157 run ; PAGE 168 Appendix C SAS Code 158 *++ Thisprogramcalculatesconfidencebandsfortheeffectsize (Cohen'sf)inANOVAanalysesusingbothanintervalinversion approachandztransformation. RawvaluesareinputaboutmidwaythroughprogramfortotalN, numberofgroups.andtheFvalueobtainedintheoriginalanalysis. Dependingondataprovided,these inputsmightneedtobemodified. Lastmodification:4Sept2003 ++; *++ Inputtosubroutine: data=nameofdataset F_obt=obtainedvalueofF N=samplesize K=numberofgroups u=degreesoffreedomnumerator v=degressoffreedomdenominator OutputisprintedtableofconfidenceintervalsatleastIhope someday:) *++; data one; inputjourn$articleanalysi s$Nk F_obt; u=k1 ; v=Nk; eta2=((k1 )*F_obt)/((k1 )*F_obt+N); f=(eta2/( 1 eta2))** .5 ; loweta2_90= 0 ; loweta2_95= 0 ; loweta2_99= 0 ; higheta2_90= 0 ; higheta2_95= 0 ; higheta2_99= 0 ; widtheta2_90= 0 ; widtheta2_95= 0 ; widtheta2_99= 0 ; lowf_90= 0 ; lowf_95= 0 ; lowf_99= 0 ; highf_90= 0 ; highf_95= 0 ; highf_99= 0 ; widthf_90= 0 ; widthf_95= 0 ; widthf_99= 0 ; PAGE 169 Appendix C SAS Code 159 *+++++++++++++++++++++++++++++++++++++++++++++ CalculatiNgtheupperaNdlowerbouNdsofeta2 usiNgO&F_obt3..calledloweta2_95aNdhigheta2_95. CurreNtlycalculatioNsareoNlydoNeusiNgthe 95thperceNtile,peNdiNgresolutioNofmethod +++++++++++++++++++++++++++++++++++++++++++++; z=log(( 1 +sqrt(eta2))/( 1 sqrt(eta2))); loweta2_9 5=z(( 2 *( 1.96 ))/(sqrt(N))); higheta2_95=z+(( 2 *( 1.96 ))/(sqrt(N))); low95=exp(loweta2_95); high95=exp(higheta2_95); loweta2_95=((low951 )/(low95+ 1 ))** 2 ; higheta2_95=((high951 )/(high95+ 1 ))** 2 ; ifloweta2_95< 0 theNloweta2_95= 0 ; ifhigheta2_95> 1 theNhigheta2_95= 1 ; widtheta2_95=higheta2_95loweta2_95; loweta2_9 9=z(( 2 *( 2.576 ))/(sqrt(N))); higheta2_99=z+(( 2 *( 2.576 ))/(sqrt(N))); low99=exp(loweta2_99); high99=exp(higheta2_99); loweta2_99=((low991 )/(low99+ 1 ))** 2 ; higheta2_99=((high991 )/(high99+ 1 ))** 2 ; ifloweta2_99< 0 theNloweta2_99= 0 ; ifhigheta2_99> 1 theNhigheta2_99= 1 ; widtheta2_99=higheta2_99loweta2_99; loweta2_9 0=z(( 2 *( 1.645 ))/(sqrt(N))); higheta2_90=z+(( 2 *( 1.645 ))/(sqrt(N))); low90=exp(loweta2_90); high90=exp(higheta2_90); loweta2_90=((low901 )/(low90+ 1 ))** 2 ; higheta2_90=((high901 )/(high90+ 1 ))** 2 ; ifloweta2_90< 0 theNloweta2_90= 0 ; ifhigheta2_90> 1 theNhigheta2_90= 1 ; widtheta2_90=higheta2_90loweta2_90; *+++++++++++++++++++++++++++++++++++++++++ ThissetofCIs(calledlowf_95aNd highf_95)arecoNstructedbycalculatiNg fforthelowereta2aNduppereta2calculated earlier...thismethodistheoNemore appropriate??? PAGE 170 Appendix C SAS Code 160 ++++++++++++++++++++++++++++++++++++++++++; lowf_95=(loweta2_95/( 1 loweta2_95))** .5 ; highf_95=(higheta2_95/( 1 higheta2_95))** .5 ; widthf_95=highf_95lowf_95; lowf_99=(loweta2_99/( 1 loweta2_99))** .5 ; highf_99=(higheta2_99/( 1 higheta2_99))** .5 ; widthf_99=highf_99lowf_99; lowf_90=(loweta2_90/( 1 loweta2_90))** .5 ; highf_90=(higheta2_90/( 1 higheta2_90))** .5 ; widthf_90=highf_90lowf_90; Smpl_eta2=eta2; cards; RRQ78A2292.04 RRQ78B2292.71 RRQ78C2482.19 RRQ78D248232.08 RRQ78E2472.72 RRQ78F19526.85 RRQ78G24824.80 RRQ78H248212.86 RRQ78I22928.52 RRQ78J2292.56 RRQ78K24822.08 RRQ78L24827.45 RRQ78M24821.42 RRQ78N2472.89 RRQ78O1952.09 RRQ78P2482.06 RRQ78Q2482.03 RRQ78R2292.57 RRQ78S22921.14 RRQ78T2482.28 RRQ78U2482.16 RRQ78V2482.13 RRQ78W24821.53 RRQ78X24822.63 RRQ78Y2472.81 RRQ78Z24728.13 RRQ78AA24721.59 RRQ78AB247235.46 RRQ78AC24723.69 RRQ78AD24721.92 RRQ78AE24720.00 RRQ78AF2472.78 RRQ32B5825.85 RRQ32C58218.05 RRQ32D5822.43 RRQ32E11628.41 RRQ32F11623.13 RRQ32G5826.88 RRQ32H5827.61 RRQ32I58213.81 PAGE 171 Appendix C SAS Code 161 RRQ32J58210.05 RRQ32K58211.48 RRQ32L5827.79 RRQ32M58268.9 RRQ32N5824.02 RRQ32O58211.56 RRQ32P58214.88 RRQ32Q58290.93 RRQ32R58225.15 RRQ32S5824.20 RRQ32T5825.71 RRQ32U5824.20 RRQ32V58210.74 RRQ32W58211.99 RRQ32X58233.19 RRQ32Y58217.19 RRQ32Z5824.67 RRQ32AA5828.05 RRQ35A158315.10 RRQ35B158326.35 RRQ35C158327.10 RRQ35D158315.37 RRQ35E158215.19 RRQ35F6024.71 RRQ35G6026.99 RRQ35H9033.10 RRQ35I9036.59 RRQ35J9038.79 RRQ35K9039.71 RRQ35L9037.18 RRQ35M9039.17 RRQ35N9139.47 RRQ35O9137.18 RRQ35P9035.10 RRQ35Q9139.47 RRQ35R9134.86 RRQ35S9138.64 RRQ35T9125.88 RRQ35U4625.99 RRQ35V46210.72 RRQ35W4626.32 RRQ35X4625.50 RRQ35Y46210.69 RRQ35Z13936.9 RRQ35AA8524.8 RRQ35AB14039.3 RRQ35AC8624.9 RRQ35AD140313.3 RRQ35AE8625.9 RRQ35AF139310.0 RRQ35AG8528.9 RRQ35AH140349.2 RRQ35AI140338.5 RRQ35AJ140338.5 RRQ35AK53227.0 RRQ35AL53258.1 PAGE 172 Appendix C SAS Code 162 RRQ35AM53253.8 RRQ35AN53229.2 RRQ35AO53210.9 RRQ35AP53264.5 RRQ35AQ53236.9 RRQ35AR53250.0 RRQ35AS5326.3 RRQ47A882284.09 RRQ47B8833.61 RRQ47C8823.89 RRQ47D88357.02 RRQ47E88310.26 RRQ47F88214.10 RRQ47G882428.82 RRQ47H8833.73 RRQ47I8836.22 RRQ47J88332.43 RRQ47K88332.43 RRQ47L882374.57 RRQ47M88232.11 RRQ47N8836.51 RRQ47O8835.47 RRQ47P882329.66 RRQ47Q8836.23 RRQ47R882136.73 RRQ47S8827.60 RRQ47T8839.23 RRQ47U882178.00 RRQ47V882700.61 RRQ47W8839.14 RRQ47X8828.42 RRQ47Y8833.90 RRQ47Z88321.24 RRQ47AA882620.89 RRQ47AB88220.61 RRQ47AC88311.64 RRQ47AD8836.14 RRQ47AE8826.97 RRQ47AF8839.99 RRQ47AG88327.87 RRQ47AH88245.16 RRQ47AI88233.65 RRQ47AJ88321.85 RRQ47AK8827.63 RRQ47AL8838.06 RRQ48A118331.6 RRQ42A1514124.81 RRQ42B151121.73 RRQ42C15142.90 RRQ46A9134.57 RRQ46B913113.5 RRQ46C91389.29 RRQ46D91373.99 RRQ46E913113.26 RRQ46F91362.09 RRQ79A8324.72 PAGE 173 Appendix C SAS Code 163 RRQ79B83216.72 RRQ79C8326.27 RRQ79D8325.09 RRQ79E83253.66 RRQ79F83216.42 RRQ79G83221.78 RRQ79H8328.55 RRQ79I83252.98 RRQ79J8329.83 RRQ79K83248.03 RRQ79L83317.68 RRQ79M83326.29 RRQ79N83374.26 RRQ79O83392.84 RRQ79P83278.81 RRQ79Q83211.23 RRQ79R832182.44 RRQ79S83397.11 RRQ79T8328.91 RRQ79U8323.40 RRQ79W83229.3 RRQ79Z83315.58 RRQ77A713.02 RRQ77B7131.80 RRQ77C71368.84 RRQ77D71346.72 RRQ77E71316.38 RRQ77F7139.72 RRQ77G713202.44 RRQ77H7138.48 RRQ77I71314.05 RRQ77J71329.95 RRQ77L71313.32 JER76A64223.71 JER76B64249.77 JER76C64241.66 JER12A672.16 JER12B6728.95 JER12C6729.23 JER12D67213.81 JER12E6720 JER12F6725.44 JER12G6723.81 JER12H7429.90 JER12I7426.25 JER12J7428.25 JER12K74218.32 JER12L7420.04 JER12M74210.27 JER12N742.12 JER31A473.98 JER31B14596.42 JER31C14591.04 JER31D4737.34 JER74A9244.64 JER74B9244.65 PAGE 174 Appendix C SAS Code 164 JER74C9248.16 JER74D9244.63 JER74E9245.32 JER74F9247.47 JER74G9246.33 JER74H9247.95 JER74I92410.79 JER74J92410.25 JER74K9246.32 JER74L9246.22 JER74M9247.23 JER74N9247.59 JER74O92412.55 JER74P9247.66 JER74Q9246.74 JER74R9244.82 JER74S92415.40 JER74T9246.29 JER74U9247.88 JER74V9245.17 JER74W9244.71 JER74X9245.92 JER74Y9245.23 JPSP56A124210.82 JPSP56B12423.97 JPSP56C12427.01 JPSP56D7443.98 JPSP56E3820.00 JPSP56F3428.17 JPSP56G6944.55 JPSP56H3321.05 JPSP56I3624.75 JPSP63A116227.75 JPSP63B1162144.98 JPSP63C11221.06 JPSP63D11225.38 JPSP63E163231.32 JPSP63F163253.18 JPSP63G15927.23 JPSP63H15923.94 JPSP63I15925.22 JPSP63J95270.42 JPSP63K9521.87 JPSP63L95212.78 JPSP63O9526.15 JPSP63P9323.19 JPSP63Q9325.36 JPSP63R146429.19 JPSP63S146415.25 JPSP63T144416.55 JPSP63U1454105.5 JPSP63V145411.29 JPSP63W1454.91 JPSP63X1454.07 JPSP63Z140224.47 JPSP63AA14242.96 PAGE 175 Appendix C SAS Code 165 JPSP63AF140226.21 JPSP63AG14243.48 JPSP63AH14242.21 JPSP53A6327.46 JPSP53B63213.97 JPSP53C63210.33 JPSP53D6323.47 JPSP53E6323.07 JPSP53F622.86 JPSP53G7326.16 JPSP53H7329.4 JPSP53I7324.97 JPSP53J7325.37 JPSP53L72227.95 JPSP53M7222.15 JPSP53N128217.61 JPSP53O12825.31 JPSP53P128221.17 JPSP53Q12821.79 JPSP53R12825.26 JPSP53S128221.11 JPSP53T12821.80 JPSP53U12826.39 JPSP53V128211.75 JPSP53W128217.78 JPSP53Y128213.14 JPSP53Z128218.97 JPSP53AB12829.73 JPSP53AC128224.18 JPSP53AD12827.11 JPSP53AE12828.29 JPSP53AF12825.95 JPSP53AG12821.71 JPSP53AH12826.15 JPSP53AI12822.26 JPSP68A4035.44 JPSP68B4033.46 JPSP68C4034.29 JPSP68D4034.32 JPSP68E6233.21 JPSP68F3634.91 JPSP68G3635.18 JPSP68H6531.73 JPSP68J4735.43 JPSP68K353.39 JPSP68L7933.65 JPSP68M7933.2 JPSP68N8233.33 JPSP68O8233.16 JPSP68P823.65 JPSP68Q4732.13 JPSP68R473.84 JPSP68S473.83 JPSP68T4732.91 JPSP68U4735.91 JPSP52A123315.90 PAGE 176 Appendix C SAS Code 166 JPSP52B12338.64 JPSP52C12334.63 JPSP52F123311.14 JPSP52K12337.73 JPSP52O123317.62 JPSP52S1233.79 JPSP52T1233.18 JPSP52U5131.06 JPSP52V513.17 JPSP52W5131.01 JPSP52X5132.04 JPSP52Y88327.54 JPSP52Z88322.68 JPSP52AG88325.62 JPSP52AK88312.95 JPSP52AO8836.58 JPSP52AP883.28 JPSP52AT8832.70 JPSP52AU5131.03 JPSP52AV513.68 JPSP52AW513.07 JPSP52AX5131.41 JPSP52AY5134.98 JPSP52BC46310.79 JPSP52BD46310.50 JPSP52BK4633.16 JPSP52BO4638.64 JPSP52BS4632.13 JPSP52BT463.59 JPSP52BU48323.38 JPSP52BY4835.57 JPSP52CC483.35 JPSP52CD4836.38 JPSP52CH4836.18 JPSP52CI483.32 JPSP52CJ483.54 JPSP52CN4832.49 JPSP69A7622.50 JPSP69B7622.39 JPSP69C7625.23 JPSP69F7621.66 JPSP69G762.59 JPSP69H762.04 JPSP69M8721.20 JPSP69N872.60 JPSP69O87211.01 JPSP69U8722.2 JPSP69V872.01 JPSP58F254263.66 JPSP58G25532.69 JPSP58H25532.88 JPSP58I254222.92 JPSP55A54235.3 JPSP55B54269.94 JPSP55C5424.15 JPSP55D5425.84 PAGE 177 Appendix C SAS Code 167 JPSP55E5420.62 JPSP55F5428.3 JPSP55G88222.25 JPSP55H882146.73 JPSP55I86211.39 JPSP55J8624.68 JPSP55K8623.99 JPSP55L8624.86 JPSP55M8624.05 JPSP55N862.51 JPSP55O86211.39 JPSP60A7724.75 JPSP60B7724.29 JPSP60C37128.07 JPSP60D371269.89 JPSP60E350218.61 JPSP60F350213.04 ; title1'Eta2andCohenfconfidenceintervalsusingztransformation'; /*PROCFREQ; tablesjourn*article; procpriNt; varjournarticleeta 2fNk; procprint; varloweta2_99higheta2_99loweta2_95higheta2_95loweta2_90 higheta2_90widtheta2_99widtheta2_95widtheta2_90; procprint; varlowf_99highf_99lowf_95highf_95lowf_90highf_90widthf_90 widthf_95widthf_99; ruN;*/ PROCFREQ ; TABLESJOURN*ARTICLE; procprint ; varjournarticleanalysi sNkhigheta2_99loweta2_99eta2higheta2_95 loweta2_95eta2higheta2_90loweta2_90eta2; procprint ; varjournarticleanalysi sNkhighf_99lowf_99fhighf_95lowf_95f highf_90lowf_90f; prociml ; *++ Subroutineeta2_PCTL Calculatespercentilesfromthesamplingdistributionofrsquare usingtheinversionmethodofSteigerandFouladi(1997). Inputsare SMPL_eta2=obtainedsamplevalueofetasquare k=numberofregressorvariables N=samplesize PCTL=desiredpercentilefromthesamplingdistribution Outputis LASTetap2=thepopulationrsquarethatprovidesSMPL_eta2atthe pctlpercentile * PAGE 178 Appendix C SAS Code 168 ++; starteta2_PCTL(Smpl_eta2,k,N,pctl,lastetap2,OOPS); *print'Valueswithineta2_PCTLSubroutine'; eta2_tilde=Smpl_eta2/( 1 Smpl_eta2); *Step1:Findvalueofetapsquaredthatisalittletoohigh; OOPS= 0 ; OK= 0 ; etap2= 0 ; loop= 0 ; flag= 0 ; flag1= 0 ; flag2= 0 ; dountil(OK= 1 ); etap_tild=etap2/( 1 etap2); gamma2= 1 /( 1 etap2); phi_1=(N1 )*(gamma21 )+k; phi_2=(N1 )*(gamma2## 2 1 )+k; phi_3=(N1 )*(gamma2## 3 1 )+k; G=(phi_2SQRT(phi_2## 2 (phi_1#phi_3)))/phi_1; v=(phi_22 #etap_tild#sqrt(gamma2)#sqrt((N1 )#(Nk1 )))/G## 2 ; nc=(etap_tild#sqrt(gamma2)#sqrt((N1 )#(Nk1 )))/G## 2 ; obt_stat=(eta2_tilde#(Nk1 ))/(v#G); *++ BesurethecomputationispossibleinSAS *++; little=FINV( .0000001 ,v,nk1 ,nc); big=FINV( .99999 ,v,nk1 ,nc); not_poss= 1 ; if(obt_stat>little&obt_stat .98 )thendo; flag= 1 ; OK= 1 ; cumprob= 1 ; end; IFnot_poss= 0 thendo; ifcumprob PAGE 179 Appendix C SAS Code 169 etap2= .99 ; flag= 1 ; end; loop=loop+ 1 ; ifloop> 1500 thendo; *print'Loopingtoomuch!'loopeta2_tild ekvpctlcumproboknc etap2; OK= 1 ; end; END; *print'EstimatingHigh'eta2_tild ekvpctlcumprobokncetap2 not_poss; end; high=etap2; ifflag= 1 thendo; high= 1.00 ; flag1= 1 ; end; *print'EndofHighLoop:'high; *printhigh; *Step2:Findvalueofetapsquaredthatisalittletoolow; OK= 0 ; etap2= .99 ; flag= 0 ; dountil(OK= 1 ); etap_tild=etap2/( 1 etap2); gamma2= 1 /( 1 etap2); phi_1=(N1 )*(gamma21 )+k; phi_2=(N1 )*(gamma2## 2 1 )+k; phi_3=(N1 )*(gamma2## 3 1 )+k; G=(phi_2SQRT(phi_2## 2 (phi_1#phi_3)))/phi_1; v=(phi_22 #etap_tild#sqrt(gamma2)#sqrt((N1 )#(Nk1 )))/G## 2 ; nc=(etap_tild#sqrt(gamma2)#sqrt((N1 )#(Nk1 )))/G## 2 ; obt_stat=(eta2_tilde#(Nk1 ))/(v#G); *++ BesurethecomputationispossibleinSAS *++; little=FINV( .0000001 ,v,nk1 ,nc); big=FINV( .99999 ,v,nk1 ,nc); not_poss= 1 ; if(obt_stat>little&obt_stat .01 )thendo; *print'Progisinthisone!'; etap2=etap2.01 ; cumprob= 1 ; end; IF(not_poss= 1 &etap2< .02 )thendo; *print'Programishere!'; PAGE 180 Appendix C SAS Code 170 flag= 1 ; OK= 1 ; cumprob= 1 ; end; IFnot_poss= 0 thendo; ifcumprob>pctlthenOK= 1 ; ifcumprob PAGE 181 Appendix C SAS Code 171 *print'notpossibl e='not_poss; ifnot_poss= 1 thendo; change= 0 ; OOPS= 1 ; lastetap2= 0 ; end; ifnot_poss= 0 thendo; ifcum_h PAGE 182 Appendix C SAS Code 172 run eta2_PCTL(Smpl_eta2[i, 1 ],k[i, 1 ],N[i, 1 ], 0.95 ,eta2_95,oops95); print_eta2=Smpl_eta2[i, 1 ]; fileprint; put@ 1 print_eta2 8.4 @ 10 eta2_995 8.4 @ 20 eta2_005 8.4 @ 30 eta2_975 8.4 @ 40 eta2_025 8.4 @ 50 eta2_95 8.4 @ 60 eta2_05; end; *prociml; startfind_NC(F_obt,u,v,ncc,pctl,f); OK= 0 ; nc= 0 ; target=pctl; loop= 0 ; dountil(OK= 1 ); cumprob=PROBF(F_obt,u,v,nc); ifcumprob PAGE 183 Appendix C SAS Code 173 readallvar{f}intoeffect_vec; k_total=nrow(effect_vec); fileprint; put//@ 1 'ConfidenceIntervalsAroundSampleCohenf'// @ 16 '99%CI'@ 36 '95%CI'@ 56 '90%CI'/ @ 2 'Effect'@ 10 ''@ 30 '' @ 50 ''/ @ 3 'Size'@ 12 'LowerUpper'@ 32 'LowerUpper'@ 52 'Lower Upper'/ @ 1 ''@ 10 ''@ 30 ''@ 50 ''; doi= 1 tok_total; run find_NC(F_obt[i, 1 ],u[i, 1 ],v[i, 1 ],nc_005, .005 ,f005); run find_NC(F_obt[i, 1 ],u[i, 1 ],v[i, 1 ],nc_995, .995 ,f995); run find_NC(F_obt[i, 1 ],u[i, 1 ],v[i, 1 ],nc_025, .025 ,f025); run find_NC(F_obt[i, 1 ],u[i, 1 ],v[i, 1 ],nc_975, .975 ,f975); run find_NC(F_obt[i, 1 ],u[i, 1 ],v[i, 1 ],nc_05, .05 ,f05); run find_NC(F_obt[i, 1 ],u[i, 1 ],v[i, 1 ],nc_95, .95 ,f95); print_effect=effect_vec[i, 1 ]; fileprint; put@ 1 print_effect 8.4 @ 10 f995 8.4 @ 20 f005 8.4 @ 30 f975 8.4 @ 40 f025 8.4 @ 50 f95 8.4 @ 60 f05 8.4 ; end; quit; PAGE 184 Appendix C SAS Code 174 data one; iNputjourn$articleanalysi s$Nkr2; *+++++++++++++++++++++++++++++++++++++++++++++ ThiscalculatescoNfideNceiNtervalsfortheeffect sizeforregressioNaNalyses(f2)usiNg alogtraNsformatioN(O&F3FisherZ)Two approacheswereused(testiNg,aswediscussed) VerydiffereNtresults. Lastedit:Aug23 +++++++++++++++++++++++++++++++++++++++++++++; fsquare= 0 ; lowr2_90= 0 ; lowr2_95= 0 ; lowr2_99= 0 ; highr2_90= 0 ; highr2_95= 0 ; highr2_99= 0 ; widthr2_90= 0 ; widthr2_95= 0 ; widthr2_99= 0 ; lowf2_90= 0 ; lowf2_95= 0 ; lowf2_99= 0 ; highf2_90= 0 ; highf2_95= 0 ; highf2_99= 0 ; widthf2_90= 0 ; widthf2_95= 0 ; widthf2_99= 0 ; u=k; v=Nk1 ; F_obt=(r2/u)/(( 1 r2)/v); F2=r2/( 1 r2) ;*Icomputedf2here; Smpl_R2=r2; *+++++++++++++++++++++++++++++++++++++++++++++ CalculatiNgtheupperaNdlowerbouNdsofeta2 usiNgO&F3..calledloweta2_95aNdhigheta2_95. CurreNtlycalculatioNsareoNlydoNeusiNgthe 95thperceNtile,peNdiNgresolutioNofmethod +++++++++++++++++++++++++++++++++++++++++++++; z=log(( 1 +sqrt(r2))/( 1 sqrt(r2))); lowr2_9 5=z(( 2 *( 1.96 ))/(sqrt(N))); highr2_95=z+(( 2 *( 1.96 ))/(sqrt(N))); low95=exp(lowr2_95); PAGE 185 Appendix C SAS Code 175 high95=exp(highr2_95); lowr2_95=((low951 )/(low95+ 1 ))** 2 ; highr2_95=((high951 )/(high95+ 1 ))** 2 ; if2lowr2_95< 0 theNlowr2_95= 0 ; if2highr2_95> 1 theNhighr2_95= 1 ; widthr2_95=highr2_95lowr2_95; lowr2_9 9=z(( 2 *( 2.576 ))/(sqrt(N))); highr2_99=z+(( 2 *( 2.576 ))/(sqrt(N))); low99=exp(lowr2_99); high99=exp(highr2_99); lowr2_99=((low991 )/(low99+ 1 ))** 2 ; highr2_99=((high991 )/(high99+ 1 ))** 2 ; if2lowr2_99< 0 theNlowr2_99= 0 ; if2highr2_99> 1 theNhighr2_99= 1 ; widthr2_99=highr2_99lowr2_99; lowr2_9 0=z(( 2 *( 1.645 ))/(sqrt(N))); highr2_90=z+(( 2 *( 1.645 ))/(sqrt(N))); low90=exp(lowr2_90); high90=exp(highr2_90); lowr2_90=((low901 )/(low90+ 1 ))** 2 ; highr2_90=((high901 )/(high90+ 1 ))** 2 ; if2lowr2_90< 0 theNlowr2_90= 0 ; if2highr2_90> 1 theNhighr2_90= 1 ; widthr2_90=highr2_90lowr2_90; *+++++++++++++++++++++++++++++++++++++++++ Thissetof2CIs(calledlowf2_95aNd highf2_95)arecoNstructedbycalculatiNg f2f2orthelowerr2aNdupperr2calculated earlier...thismethodistheoNemore appropriate??? ++++++++++++++++++++++++++++++++++++++++++; lowf2_95=lowr2_95/( 1 lowr2_95); highf2_95=highr2_95/( 1 highr2_95); widthf2_95=highf2_95lowf2_95; lowf2_99=lowr2_99/( 1 lowr2_99); highf2_99=highr2_99/( 1 highr2_99); widthf2_99=highf2_99lowf2_99; lowf2_90=lowr2_90/( 1 lowr2_90); highf2_90=highr2_90/( 1 highr2_90); widthf2_90=highf2_90lowf2_90; cards; PAGE 186 Appendix C SAS Code 176 JER7A484.80 JER7B483.75 JER7C484.74 JER7R391.61 JER7S391.60 JER7T393.71 RRQ51A891.13 RRQ51B891.28 RRQ51C891.17 RRQ51D891.35 RRQ51E891.34 RRQ51F891.30 RRQ51G891.31 RRQ51H891.39 RRQ51I891.88 RRQ51J891.47 RRQ51K891.52 RRQ51L891.08 RRQ51M891.08 RRQ51N891.45 RRQ51O891.49 RRQ51P891.51 RRQ51Q891.44 RRQ51R891.88 RRQ51S471.27 RRQ51T471.41 RRQ51U471.25 RRQ51V471.10 RRQ51W891.53 RRQ51X891.55 RRQ51Y891.17 RRQ51Z891.13 RRQ51AA471.06 RRQ51AB471.06 RRQ51AC471.29 RRQ51AD471.16 RRQ51AE891.37 RRQ51AF891.38 RRQ51AG891.18 RRQ51AH891.17 RRQ51AI891.26 RRQ51AJ891.13 RRQ51AK891.33 RRQ51AL891.19 RRQ78AG1951.29 RRQ78AH1951.10 RRQ78AI1951.08 RRQ78AJ1953.42 RRQ78AK1954.49 RRQ78AL1955.54 RRQ78AM1491.35 RRQ78AN1491.17 RRQ78AO1491.04 RRQ78AP1491.22 RRQ78AQ1493.45 RRQ78AR1494.48 PAGE 187 Appendix C SAS Code 177 RRQ32A607.55 JER2A23075.26 JER2B23075.18 JER2C6445.23 JER2D6445.23 JER2E23074.70 JER2F23074.56 JER2G6445.74 JER2H6445.62 JER2I23079.62 JER2J23079.69 JER2K23079.70 JER2L23079.71 JER2M23079.71 JER2N23079.71 JER2O23079.71 JER2P23079.71 JER2Q23079.71 JER2R23075.52 JER2S23075.54 JER2T23075.56 JER2U23075.57 JER2V23075.58 JER4x38565.26 JER4x38565.18 JER4x38565.23 JER4x38565.23 JER4x38565.70 JER4x38565.56 JER4x38565.74 JER4x38565.62 JPSP57A6383.75 JPSP57B6213.67 JPSP57C6493.77 JPSP57D6423.73 JPSP57E5993.70 JPSP57F6303.81 JPSP57G6303.78 JPSP57H6403.74 JPSP57I6243.76 JPSP57J6503.71 JPSP57K6433.64 JPSP57L6553.69 JPSP57M6333.69 JPSP57N6313.74 JPSP57O6402.24 JPSP57P6242.22 JPSP57Q6502.42 JPSP57R6502.42 JPSP57S6582.26 JPSP57T6332.31 JPSP57U6322.35 ; PAGE 188 Appendix C SAS Code 178 *thefollowingcardsetisabsentthelargeNwithlargeR2(middle4) andwillruncomplete,evenwithlargeR2whenthereissmallN; *cards; *JER438565.26 JER438565.18 JER438565.23 JER438565.23 JER7484.80 JER7483.75 JER7484.74 JER7391.61 JER7391.60 JER7393.71 ; procfreq ; tablesjournarticle; title1'R2andF2ConfidenceIntervalsusingZtransformation'; /*procpriNt; varr2f2Nk; procprint; varlowr2_99highr2_99lowr2_95highr2_95lowr2_90highr2_90widthr2_99 widthr2_95widthr2_90; procprint; varlowf2_99highf2_99lowf2_95highf2_95lowf2_90highf2_90 widthf2_90widthf2_95widthf2_99;*/ procprint ; varjournarticleanalysi sNkhighr2_99lowr2_99r2highr2_95lowr2_95 r2highr2_90lowr2_90r2; procprint ; varjournarticleanalysi sNkhighf2_99lowf2_99f2highf2_95lowf2_95 f2highf2_90lowf2_90f2; run ; prociml ; *++ SubroutineR2_PCTL Calculatespercentilesfromthesamplingdistributionofrsquare usingtheinversionmethodofSteigerandFouladi(1997). Inputsare SMPL_R2=obtainedsamplevalueofrsquare k=numberofregressorvariables N=samplesize PCTL=desiredpercentilefromthesamplingdistribution Outputis LASTRHO2=thepopulationrsquarethatprovidesSMPL_R2atthe pctlpercentile ++; startR2_PCTL(Smpl_R2,k,N,pctl,lastrho2,OOPS); PAGE 189 Appendix C SAS Code 179 *print'ValueswithinR2_PCTLSubroutine'; R2_tilde=Smpl_R2/( 1 Smpl_R2); *Step1:Findvalueofrhosquaredthatisalittletoohigh; OOPS= 0 ; OK= 0 ; rho2= 0 ; loop= 0 ; flag= 0 ; flag1= 0 ; flag2= 0 ; dountil(OK= 1 ); rho_tild=rho2/( 1 rho2); gamma2= 1 /( 1 rho2); phi_1=(N1 )*(gamma21 )+k; phi_2=(N1 )*(gamma2## 2 1 )+k; phi_3=(N1 )*(gamma2## 3 1 )+k; G=(phi_2SQRT(phi_2## 2 (phi_1#phi_3)))/phi_1; v=(phi_22 #rho_tild#sqrt(gamma2)#sqrt((N1 )#(Nk1 )))/G## 2 ; nc=(rho_tild#sqrt(gamma2)#sqrt((N1 )#(Nk1 )))/G## 2 ; obt_stat=(R2_tilde#(Nk1 ))/(v#G); *++ BesurethecomputationispossibleinSAS *++; little=FINV( .0000001 ,v,nk1 ,nc); big=FINV( .99999 ,v,nk1 ,nc); not_poss= 1 ; if(obt_stat>little&obt_stat .98 )thendo; flag= 1 ; OK= 1 ; cumprob= 1 ; end; IFnot_poss= 0 thendo; ifcumprob PAGE 190 Appendix C SAS Code 180 ifloop> 1500 thendo; *print'Loopingtoomuch!'loopR2_tild ekvpctlcumproboknc rho2; OK= 1 ; end; END; *print'EstimatingHigh'R2_tild ekvpctlcumprobokncrho2 not_poss; end; high=rho2; ifflag= 1 thendo; high= 1.00 ; flag1= 1 ; end; *print'EndofHighLoop:'high; *printhigh; *Step2:Findvalueofrhosquaredthatisalittletoolow; OK= 0 ; rho2= .99 ; flag= 0 ; dountil(OK= 1 ); rho_tild=rho2/( 1 rho2); gamma2= 1 /( 1 rho2); phi_1=(N1 )*(gamma21 )+k; phi_2=(N1 )*(gamma2## 2 1 )+k; phi_3=(N1 )*(gamma2## 3 1 )+k; G=(phi_2SQRT(phi_2## 2 (phi_1#phi_3)))/phi_1; v=(phi_22 #rho_tild#sqrt(gamma2)#sqrt((N1 )#(Nk1 )))/G## 2 ; nc=(rho_tild#sqrt(gamma2)#sqrt((N1 )#(Nk1 )))/G## 2 ; obt_stat=(R2_tilde#(Nk1 ))/(v#G); *++ BesurethecomputationispossibleinSAS *++; little=FINV( .0000001 ,v,nk1 ,nc); big=FINV( .99999 ,v,nk1 ,nc); not_poss= 1 ; if(obt_stat>little&obt_stat .01 )thendo; *print'Progisinthisone!'; rho2=rho2.01 ; cumprob= 1 ; end; IF(not_poss= 1 &rho2< .02 )thendo; *print'Programishere!'; flag= 1 ; OK= 1 ; cumprob= 1 ; end; PAGE 191 Appendix C SAS Code 181 IFnot_poss= 0 thendo; ifcumprob>pctlthenOK= 1 ; ifcumprob PAGE 193 Appendix C SAS Code 183 fileprint; put@ 1 print_r2 8.4 @ 10 r2_995 8.4 @ 20 r2_005 8.4 @ 30 r2_975 8.4 @ 40 r2_025 8.4 @ 50 r2_95 8.4 @ 60 r2_05; end; prociml ; startfind_NC(F_obt,u,v,ncc,pctl,f2) ;*Iaddedf2tothearguments here; OK= 0 ; nc= 0 ; target=pctl; loop= 0 ;*Iinitializedloophere; dountil(OK= 1 ); cumprob=PROBF(F_obt,u,v,nc); ifcumprob PAGE 194 Appendix C SAS Code 184 readallvar{F2}intoeffect_vec ;*Iaddedthisstatementtocreate effect_vec; k_total=nrow(effect_vec); fileprint; put@ 1 'ConfidenceIntervalsAroundSamplef2'// @ 16 '99%CI'@ 36 '95%CI'@ 56 '90%CI'/ @ 2 'Effect'@ 10 ''@ 30 '' @ 50 ''/ @ 3 'Size'@ 12 'LowerUpper'@ 32 'LowerUpper'@ 52 'Lower Upper'/ @ 1 ''@ 10 ''@ 30 ''@ 50 ''; doi= 1 tok_total; *obs_stat=effect_vec[i,1]#sqrt(n1[i,1]#n2[i,1]/(n1[i,1]+n2[i,1])); *obt_F=(r2[i,1]/u[i,1])/((1r2[i,1])/v[i,1]); run find_NC(F_obt[i, 1 ],u[i, 1 ],v[i, 1 ],nc_005, .005 ,f2005); run find_NC(F_obt[i, 1 ],u[i, 1 ],v[i, 1 ],nc_995, .995 ,f2995); run find_NC(F_obt[i, 1 ],u[i, 1 ],v[i, 1 ],nc_025, .025 ,f2025); run find_NC(F_obt[i, 1 ],u[i, 1 ],v[i, 1 ],nc_975, .975 ,f2975); run find_NC(F_obt[i, 1 ],u[i, 1 ],v[i, 1 ],nc_05, .05 ,f205); run find_NC(F_obt[i, 1 ],u[i, 1 ],v[i, 1 ],nc_95, .95 ,f295); print_effect=effect_vec[i, 1 ];*Don'tthinkthisbelongsasis (relativeto previouscomputation,butdidn'twanttolosethethought; fileprint; put@ 1 print_effect 8.4 @ 10 f2995 8.4 @ 20 f2005 8.4 @ 30 f2975 8.4 @ 40 f2025 8.4 @ 50 f295 8.4 @ 60 f205 8.4 ; end; quit; PAGE 195 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 185 The statistical significance notation used reflects how the original author reported it (format, amount of information included etc. Also, the wording in the Findings/Results column is/are exact quotes. Any information added or deleted for the purposes of clarification are in parenthesis and italicized. Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 2/1 Do variables that can be controlled by school systems (e.g., average class size, teacher experience, pupilteacher ratio, teach salary, and expenditure per pupil) predict academic achievement R2: .26, p <.001 (reading) R2: .18, p <.001 (math) According to the model F statistics, both multiple regressions (reading and math) were statistically significant in accounting for variance in thirdgrade reading and mathematics scores. However, the model R2 for the two models was relatively small, with R2 values of .26 and .18 respectively. Cohen f2: .3514 (reading) Cohen f2: .2195 (math) (2) Slight Change Needed .2979< f2<.3514 (reading) .1796< f2<.2642 (3) Much Change Needed PAGE 196 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 186 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 2/2 Do variables that cannot be controlled by school systems (e.g., percentage White, low income, attendance, mobility) predict academic achievement R2: .70, p <.001 (reading) R2: .56, p <.001 (math) In contrast to the low model R2 values obtained for the can control regression models, the R2 values obtained for the cannot control regression models were considerably higher. We therefore concluded that the cannot control models accounted more accurately for variance in Grade 3 achievement scores than did the can control models. Cohen f2: 2.3333 (reading) Cohen f2: .1.2727 (math) (1) No Change Needed 2.1149 < f2< 2.5706 (reading) 1.1397 < f2<1.4176 (1) No Change Needed 4/1 Does the use of video as an accommodation on a math test to avoid impact of reading ability on performance on a math test for all students? t value not reported. p = .08 Students taking the video version of the test scored slightly higher than those taking the standard version, although that difference was not statistically significant. As our results indicate, accommodations are unnecessary for the majority of students. Cohen d : .0753 (1) No Change Needed .1012 < d <.25170 (2) Slight Change Needed PAGE 197 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 187 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 4/2 Does the use of video as an accommodation on a math test to avoid impact of reading ability on performance on a math test for students with low math ability? t value not reported, p =.05 Of the subgroups examined, only the low mathematics group showed a preference that reached significance. Cohen d : .1537 (2) Slight Change Needed .1265 < d <.4344 (3) Much Change Needed 7/1 Do students with different time preferences (morning or afternoon) perform differently if they take a test in the morning? F (1,64) = 5.44, p <.05 There was a significant difference between afternoonpreferenced students and morningpreferenced studs taking the test in the morning. The results indicate clearly that the timeofday element in learning style may play a significant part in the instructional environment. When time preference and testing environment were matched, significant differences emerged between test resultsbut only for the morning test. Cohen f : .2849 (3) Much Change Needed .0024 < f <.5283 (4) Complete revision needed PAGE 198 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 188 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 7/2 Do students with different time preferences (morning or afternoon) perform differently if they take a test in the afternoon? F (1,64) = 3.81, p <.055 here was a small difference between afternoonpreferenced students and morningpreferenced studs taking the test in the afternoon. The results indicate clearly that the timeofday element in learning style may play a significant part in the instructional environment. When time preference and testing environment were matched, significant differences emerged between test resultsbut only for the morning test. Cohen f : .2385 (2) Slight Change Needed .0000 < f <.4805 (4) Complete revision needed 11/1 Do children differ in their explicit and implicit comprehension abilities? F (1,155) = 15.19 p <.001 The explicit comprehension subscore was significantly higher than the implicit comprehension subscore. Cohen f : .3101 (1) No Change Needed .1499 < f <.4778 (3) Much Change Needed PAGE 199 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 189 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 11/2 Do children differ in their overall ability to comprehend narrative based on their grade level (K2) F (2,88) = 7.18, p <.001 F (2,87) = 9.17, p <.001 F (2,88) = 9.47, p<.001 Older children received significantly more points than younger children on total prompted comprehension for all three task versions. Cohen f : .3994 Cohen f : .4514 Cohen f : .44562 (2) Slight Change Needed .1839 < f <.6321 .2328 < f <.6894 .2385 < f <.6933 (2) Slight Change Needed 11/4 Does the ability of children to retell a story differ among students in grades K2nd? F (1,84)=5.9, p <.05 Older students were significantly more likely to provide retellings with appropriate sequencing of events. Retelling (and prompted comprehension scores) improved significantly, indicating that the NC task differentiates between children who can recall main narrative elements from children who have weakness with this narrative comprehension skill. Cohen f : .3236 (2) Slight Change Needed .1058 < f <.5561 (3) Much Change Needed PAGE 200 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 190 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 12/1 Does negotiation of mean (allowing students to discuss meanings of words prior to taking individual assessments) impact performance? F =124.81, df = 3 p <.001 An analysis of variance with repeated measures showed a statistically significant main effect for condition. Cohen f : 1.5747 (1) No Change Needed 1.2960 < f <1.8936 (1) No Change Needed 12/2 Does the level of language ability impact effectiveness of using negotiation of meaning for students measured on comprehension? F = 1.73, df = 9, p = .079 The interaction of condition by level of language proficiency was not significant. Cohen f: 0.3350 (3) Much Change Needed .1896 < f <.5395 (4) Total Revision Needed 15/1 Is there a difference in the way students learning foreign language use different types of clues, specifically contextual clues or, for learning Japanese, kanji measures or, integrating the two methods. F (2,116)= 31.6, p <.0001 A oneway analysis of variance indicates a statistically significant effect of condition on students choice of integrated answers. Cohen f : .7218 (2) Slight Change Needed .5190 < f <.9686 (2) Slight Change Needed PAGE 201 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 191 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 17/1 Is a childs receptive vocabulary at ages 4 and 7 different, depending on ethnic background? R2 = .47, p <.01 (Age 4) R2 = .52, p <.01 (Age 7) Childrens receptive vocabulary at ages 4 and 7 also differs strongly between groups. Cohen f2: .89 Cohen f2: 1.08 (1) No Change Needed .4552< f2<.8868 .5796< f2<1.8644 (2) Slight Change Needed 18/1 Do children who receive different types of praise (ability, effort, or none) differ in what they attribute their performance (effort or intelligence) to on performance measures? Two main effects: Effect of low effort on performance: F (2,120) = 8.64, p <.001 Effect of low intelligence on performance: F (2,120) = 4.63, p <.05 Children differed in their endorsements of low effort and low ability as causes of their failure. Overall, the findings (of the study) support our hypothesis that children who are praised for intelligence when they succeed are the ones least likely to attribute their performance to low effort, a factor over which they have some amount of control Cohen f : .3748 Cohen f : .2744 (3) Much Change Needed .1750 < f <.5482 .0621 < f <.4423 (4) Complete Revision Needed PAGE 202 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 192 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 18/2 Do children who receive different types of praise (ability, effort, or none) differ in how they rate their enjoyment of tasks? F (2,129) = 7.73, p <.005 Ability vs. Effort t (81) = 3.81 p <.001 Ability vs. Control t (83) = 2.03, p <.05 Control vs. Effort t (82)=2.16, p <.05 Children praised for intelligence enjoyed the tasks less than did children praised for effort; again, children in the control condition fell in between the other two groups. Children praised for intelligence were significantly less likely to enjoy the problems than were children in the effort and control conditions. Further, children in the control condition were less likely to enjoy the problems than those praised for effort. Indictment of ability also led children praised for intelligence to display more negative responses in terms of lower levels of task enjoyment than their counterparts, who received commendations for effort. Cohen f : .3545 Cohen d : .8816 Cohen d : .4495 Cohen d : .4801 (3) Much Change Needed .1358 < f <.5269 1.3495 < d <.4136 .8814 < d <.0175 .9158 < d <.0043 (4) Complete Revision Needed PAGE 203 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 193 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 18/3 Do children who receive different types of praise (ability, effort, or none) differ in regarding their future expectations of their performance? F (2,48) = 1.01, ns No significant differences were noted for childrens expectations; children in the intelligence, effort and control conditions displayed equivalent expectations. Cohen f : .1990 (3) Much Change Needed .0000 < f <.4419 (4) Complete Revision Needed 18/4 Do children who receive different types of praise (ability, effort, or none) differ in how harshly they judge their performance?? F (2,48) = 2.04, ns No significant differences were noted for childrens judgement of their performance; children in the intelligence, effort and control conditions displayed equivalent expectations. These results indicate that effort, praise and intelligence do not lead children to judge their performance differently. Cohen f: .2828 (4) Complete Revision Needed .0000 < f <.5366 (4) Complete Revision Needed PAGE 204 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 194 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 18/5 Do children who receive different types of praise (ability, effort, or none) differ regarding persistence? F (2,45) = 3.16, p =.05 Ability vs. Effort t (30) = 2.09 p <.05 Ability vs. Control t (30) = 2.22, p <.05 Control vs. Effort t (30)=.12, ns Children praised for intelligence were less likely to want to persist on the problems after setbacks than were children praised for effort; children in the control condition closely resembled those in the effort conditions. Followup ttests revealed significant differences between the intelligence condition and the effort and control conditions but no difference between the effort and control conditions.. Indictment of ability also led children praised for intelligence to display more negative responses in terms of lower levels of task persistence than their counterparts, who received commendations for effort. Cohen f: .3707 Cohen d : .7332 Cohen d : .7777 Cohen d : .0412 (2) Slight Change Needed .0000 < f <.6462 1.4609 < d <.0055 1.0582 < d <.0472 .6746 < d <.7570 (4) Complete Revision Needed PAGE 205 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 195 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 19/1 Is there a difference in confidence between individuals classified as high or low in dogmatism? F (1,61), p <.01 Individuals high in dogmatism were much more confident in their judgments than individuals low in dogmatism. Cohen f : .2905 (2) Slight Change Needed .0500 < f <.5236 (3) Much Change Needed 19/2 Are there differences in the types of reasons provided for outcomes that support an individuals opinion (pro decisions) as compared to the reasons that oppose an individuals opinion (con decisions resulting from how dogmatic an individual is? F (1,61), p <.01 Individuals high in dogmatism produced more pro reasons than individuals low in dogmatism. Also they produce fewer con reasons than individuals low in dogmatism. The results show that individuals high in dogmatism are more likely to generate cognitions supporting their newly created beliefs and are less likely to generate cognitions contradicting them. Cohen f : .4049 (1) No change needed .1462 < f <.6605 (2) Slight Change Needed PAGE 206 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 196 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 20/1 Does gender stereotyping impact prediction of self performance by women, regardless of their ability as evidenced by previous performance? F (1,52) = 4.15, p < .05 On ratings of estimated performance on a stereotypical task, the effects of initial confidence were completely undermined. Cohen f : .3920 (3) Much Change Needed .1162 < f <.6060 (4) Complete Revision Needed 21/1 Would preassessment belief about whether a test outcome predicts weakness or excellence impact actual performance by women on a math test? F (1,122) = 3.97, p <.05 Women who believed that the test would indicate whether they were especially weak in math performed less well than did women who believed the test would indicate whether they were exceptionally strong. Cohen f : .1789 (2) Slight Change Needed .00198 < f <.3144 (4) Complete Revision Needed 21/2 Would preassessment belief about whether a test outcome predicts weakness or excellence impact actual performance by women on a math test? F (1,122) = 7.01, p <.01 Men performed less well when they believed the test might indicate whether they were exceptionally strong. Cohen f: .2378 (1) No Change Needed .05960 < f <.4233 (2) Slight Change Needed PAGE 207 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 197 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 24/1 Does belief that selection for leadership role is based on merit or genderbias impact performance? F (1,75) = 4.75, p <.04 As predicted, participants in the genderonly condition performed worse than participants in the control and gender + merit conditions. The data (from this study) were conceptually consistent with prior research in demonstrating that the belief that one has been selected for a task on the basis of gender alone. Cohen f: .2484 (2) Slight Change Needed .0225 < f <.4867 (3) Much Change is Needed PAGE 208 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 198 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 24/2 Does membership in a stigamatized race/ethnicity (African American or Latino) as compared to those in a nonstigmatized race/ethnicity, impact the degree to which one suspects preferential treatment for admission into college. F (1,369) = 69.89, p <.001 When we compared stigmatized and nonstigmatized students in the degree to which they suspected that their race or ethnicity might have helped them gain admission to college, we also found a significant difference, as expected. Stigmatized students suspected that their admission to the University of Texas at Austin had been influenced by their race or ethnicity to a greater extent than did nonstigamtized students. F = .4043 (1) No Change Needed .3252 < f <.5470 (2) Slight Change Needed PAGE 209 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 199 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 24/3 Does membership in a stigmatized race/ethnicity (African American or Latino) as compared to those in a nonstigmatized race/ethnicity, impact the degree to which students possess academic selfconfidence F (1,348) = 18.61, p <.001 Stigmatized and nonstigmatized participants differed in academic selfconfidence f = .2306 (2) Slight Change Needed .1241 < f <.3396 (3) Much Change Needed 24/4 Does membership in a stigmatized race/ethnicity (African American or Latino) as compared to those in a nonstigmatized race/ethnicity, impact the degree to which students are certain about the degree of their own selfconfidence. F (1,348) = 18.61, p <.001 Related, stigmatized students in our sample were significantly lower than nonstigmatized students in the certainty of their selfconfidence ratings. f = .1930 (2) Slight Change Needed .0872 < f <.3010 (4) Complete Revision Needed PAGE 210 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 200 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 26/1 Does the anticipation or prediction of future loneliness impact perserverence on tasks? F (2,37), 3.46 p <.05 This analysis again showed significant variation among the three conditions. Participants in the future alone condition attempted the fewest problems. Again, the deficit was specific to feedback about social exclusion, insofar as participants in the misfortune control condition attempted as many problems (if not more) than the people in the future belonging condition. The decline in performance reflected both a higher rate of errors and reduced number of problems attempted. A diagnostic forecast of future social exclusion caused a significant drop in intelligent performance. Cohen f : .4159 (3) Much Change Needed .0000 < f <.7149 (4) Complete Revision Needed PAGE 211 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 201 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 26/2 Does the anticipation or prediction of future loneliness impact cognitive abilities? F (2,37), p <.01 Hearing that one was likely to be alone later in life affected performance on a timed cognitive test. A diagnostic forecast of future social exclusion caused a significant drop in intelligent performance. Cohen f: .5215 (2) Slight Change Needed .1372 < f <.8318 (4) Complete Revision 27/1 Does culture, EuropeanAmerican or AsianAmerican) impact performance on a problem solving exam? F (1,74) = 2.50, ns The test revealed that there was no main effects of culture on the number of answers reported correctly. Cohen f: .1837 (2) Slight Change Needed .0445 < f <.4164 (3) Much Change Needed PAGE 212 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 202 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 27/2 Does performance by Eastern Asian Americans differ when they work under think aloud conditions or silent conditions? t (32) = 2.67, p <.05 East Asian American participants performance was worse when they had to think aloud than when they were not thinking aloud. The results support the hypothesis that talking would interfere with East Asian American participants performance. Cohen d : .9134 (1) No Change Needed .2069 < d <1.1/6199 (2) Slight Change Needed 27/3 Does performance by European Americans differ when they work under think aloud conditions or silent conditions? t (39) = .40, ns European American participants performance, however, did not differ whether they were thinking aloud or not. The results support the hypothesis that talking would not interfere with European American participants cognitive performance. Cohen d : .1258 (2) Slight Change Needed .4872 < d <.7388 (4) Complete Revision Needed PAGE 213 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 203 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 28/1 Does praise on homework impact the amount of time spent on homework? t (59) = 9.788, p <.001 Results revealed that students studied significantly more outside of the classroom when exposed to the verbal praise treatment than when exposed to the no verbal praise treatment. Although the results of this study may not generalize to all college student populations, they demonstrate the profound impact of properly administered verbal praise on college students motivation to engage in homework. Cohen d : 2.4881 (1) No Change Needed 1.8196 < d <3.1566 (1) No Revision Needed PAGE 214 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 204 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 28/2 Does praise on homework given throughout the course impact the performance on the end of course assessment? t (59) = 1.929, p >.05, ns Although the difference was not statistically significant (on the end of course exam), the direction of the means suggested that the students exposed to verbal praise not only studied more for each lesson but alos achieved more than those not exposed to verbal praise. In addition, my findings suggest that students who experience verbal praise for doing homework perform somewhat better on an instructorcreated, criterion referenced final examination than those who experience no verbal praise for their homework habits. Cohen d : .4800 (3) Much Change Needed .0292 < d <.9891 (4) Complete Revision Needed PAGE 215 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 205 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size 30/1 Do preservice teachers who have different supervision experiences have different attitudes toward their experience upon completion? t (30) = .67, p >.51 We did not find statistical significance for the overall rating. Evidence presented here indicates that peer coaching is a feasible vehicle for instituting collaborative efforts; therefore, peer coaching warrants consideration as a potentially serviceable solution for strengthening fieldbased training of prospective teachers. Cohens d : .7929 (3) Much Change Needed 1.3018 < d < .2840 (4) Complete Revision Needed PAGE 216 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 206 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 30/2 Do preservice teachers who have different supervision experiences demonstrate differences in clarity skills? t (30) = 41.66, p <.001 Post treatment results showed statistically significant differences in favor of the experimental group for overall demonstration of clarity skills. Evidence presented here indicates that peer coaching is a feasible vehicle for instituting collaborative efforts; therefore, peer coaching warrants consideration as a potentially serviceable solution for strengthening fieldbased training of prospective teachers. Cohens d : .8068 (2) Slight Change Needed .5213 < d < 1.0874 (3) Much Change Needed PAGE 217 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 207 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 32/1 Is a family intervention program effective in helping children gain vocabulary skills? F (1,247) = 32.08, p <.001 When examining the effect of the interaction of group affiliation with time using repeated measures ANOVA we found that project EASE participants made statistically significantly greater gains than the control group on Vocabulary. It appeared from the posttest measures on the CAP vocabulary subtests that those students who participated in the intervention were better able to recall more superordinate terms which in turn have been shown to relate to the reading skills of elementary aged children. Because vocabulary knowledge, story comprehension, and story sequencing are precisely the language skills that relate most strongly to literacy accomplishments, the improvement on these measures strong confirms the relevance of the intervention to improved reading outcomes. Cohen f : .3597 (3) Much Change Needed .2309< f <.4878 (3) Much Change Needed PAGE 218 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 208 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 32/2 Is a family intervention program effective in helping children gain sound awareness skills? F (1,247) = 7.45 p <.01 When examining the effect of the interaction of group affiliation with time using repeated measures ANOVA we found that project EASE participants made statistically significantly greater gains than the control group on Sound Awareness Cohen f : .1733 (3) Much Change Needed .2309 < f <.4878 (3) Complete Revision Needed PAGE 219 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 209 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 32/3 Is a family intervention program effective in helping children gain story comprehension skills? F (1,227) = 6.85, p <.01 When examining the effect of the interaction of group affiliation with time using repeated measures ANOVA we found that project EASE participants made statistically significantly greater gains than the control group on Story Comprehension. The impact of participation in Project EASE on childrens language scores is striking. Because vocabulary knowledge, story comprehension, and story sequencing are precisely the language skills that relate most strongly to literacy accomplishments, the improvement on these measures strong confirms the relevance of the intervention to improved reading outcomes. Cohen f : .1874 (4) Complete Revision Needed .0448 < f <.3288 (4) Complete Revision Needed PAGE 220 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 210 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 32/4 Is a family intervention program effective in helping children gain language skills? F (1,246) = 35.46, p <.001 Although all the children in the sample showed statistically significant gains in all three literacy composites over time, we were able to attribute a statistically significant gain in Language Skills to the Project EASE intervention.. The impact of participation in Project EASE on childrens language scores is striking. Cohen f : .3789 (3) Much Revision Needed .2494 < f <.5077 (4) Complete Revision Needed 33/1 Does level of participation in a tutoring program impact student achievement in overall reading level? F (1,76) = 4.72, p = .03 There was a statistically significant treatment effect. Overall, high level treatment children outperformed lowlevel treatment children in instructional reading level. Cohen f : .2385 (2) Slight Change Needed .0211 < f <.4669 (3) Much Change Needed PAGE 221 Appendix D Summary of Analyses and Associated Statistics with Decision Made About Results 211 Study / Analysis Number Summary of Issue to be Addressed by Analysis Statistical Significance Reported Findings/Results Effect Size Decision CIs for Effect Size Decision 33/2 Does level of participation in a tutoring program impact student achievement in reading words in isolation? F (1,71) = 5.09, p = .03 There was a treatment effect for reading words in isolation. On average, for reading words in isolation, those who received longer treatment had higher word reading abilities overall. Cohen f : .2476 (1) No Change Needed .0300 < f <.4767 (3) Much Change Needed PAGE 222 Appendix E IRB Exemption 212 PAGE 223 About the Author Melinda Hess received her Bachelor of Science degree in Electrical Engineering at the USF in 1986 while on an Air Force Reserved Officer Training Corp Scholarship. Commissioned in May 1986, she honorably served 11 years in the Air Force, serving in locations worldwide. She earned her Masters Degree in Management in 1990 from Webster University. Melinda entered into education in 1997 teaching mathematics and beginning her doctoral studies. Additional experiences include consulting for a local school district and coteaching graduate level educational research courses. She has presented her research at professional and technical conferences at state, regional and national levels and has been nominated for the Florida Educational Research Associations Distinguished Paper award three times, winning the award once. Additionally, she has interned with the Educational Testing Service, edited the Florida Journal of Educational Research and served as President of the USF Graduate Research Association for Professional Enhancement. 