xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam 22 Ka 4500
controlfield tag 007 cr-bnu---uuuuu
008 s2009 flu s 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0003263
035
(OCoLC)
040
FHM
c FHM
049
FHMM
090
XX9999 (Online)
1 100
Toyinbo, Peter.
0 245
Additive latent variable (alv) modeling :
b assessing variation in intervention impact in randomized field trials
h [electronic resource] /
by Peter Toyinbo.
260
[Tampa, Fla] :
University of South Florida,
2009.
500
Title from PDF of title page.
Document formatted into pages; contains X pages.
Includes vita.
502
Dissertation (Ph.D.)--University of South Florida, 2009.
504
Includes bibliographical references.
516
Text (Electronic dissertation) in PDF format.
538
Mode of access: World Wide Web.
System requirements: World Wide Web browser and PDF reader.
3 520
ABSTRACT: In order to personalize or tailor treatments to maximize impact among different subgroups, there is need to model not only the main effects of intervention but also the variation in intervention impact by baseline individual level risk characteristics. To this end a suitable statistical model will allow researchers to answer a major research question: who benefits or is harmed by this intervention program? Commonly in social and psychological research, the baseline risk may be unobservable and have to be estimated from observed indicators that are measured with errors; also it may have nonlinear relationship with the outcome. Most of the existing nonlinear structural equation models (SEM's) developed to address such problems employ polynomial or fully parametric nonlinear functions to define the structural equations. These methods are limited because they require functional forms to be specified beforehand and even if the models include higher order polynomials there may be problems when the focus of interest relates to the function over its whole domain. To develop a more flexible statistical modeling technique for assessing complex relationships between a proximal/distal outcome and 1) baseline characteristics measured with errors, and 2) baseline-treatment interaction; such that the shapes of these relationships are data driven and there is no need for the shapes to be determined a priori. In the ALV model structure the nonlinear components of the regression equations are represented as generalized additive model (GAM), or generalized additive mixed-effects model (GAMM). Replication study results show that the ALV model estimates of underlying relationships in the data are sufficiently close to the true pattern. The ALV modeling technique allows researchers to assess how an intervention affects individuals differently as a function of baseline risk that is itself measured with error, and uncover complex relationships in the data that might otherwise be missed. Although the ALV approach is computationally intensive, it relieves its users from the need to decide functional forms before the model is run. It can be extended to examine complex nonlinearity between growth factors and distal outcomes in a longitudinal study.
590
Advisor: C. Hendricks Brown, Ph.D.
653
Generalized additive model
Em algorithm
Monte carlo em
Markov chain monte carlo
Metropolis-hastings algorithm
Nonlinear structural equation model
690
Dissertations, Academic
z USF
x Public Health
Doctoral.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.3263
PAGE 2
DEDICATION To my family. Grace, Funbi, Femi and Tomi.
PAGE 3
ACKNOWLEDGMENTS This work has been shaped by one of the research goals of the Prevention Science and Methodology Group (PSMG) directed by C. Hendricks Brown, Ph.D. It is supported by a training grant (Preventive Research Training in Mental Health / 5T32MH018834-19 funded by NIMH, PI Nicholas Ialongo (JHSPH); and also by a Diversity Supplement 3R01DA019984-02S1 from NIDA, Parent grant 1R01DA019984-01, PI Sheppard Kellam (American Institutes for Research). The author is indebted to his dissertation committ ee members for their valuable supervision, to his fellow colleagues at the U.S.F. College of P ublic Health for the helpful discussions at the Biostatistical Forums, and to Zilli Sloboda, Sc.D. for allowing the data for illustration to be provided through The University of Akrons In stitute for Health and Social Policy (IHSP).
PAGE 4
TABLE OF CONTENTS LIST OF TABLES iii LIST OF FIGURES iv ABSTRACT v 1. INTRODUCTION 1 1.1 Variation in Intervention Impact across Individual Level Baseline Characteristics 1 1.2 The Concept of Additive Latent Variable (ALV) Modeling 4 1.3 Proposing a New Methodology: Additive Latent Variable Model 6 2. FRAMEWORK FOR ALV MODEL FORMULATION 10 2.1 General SEM framework 10 2.2 Basic formulation of ALV model 14 2.3 ML Estimation of ALV Model via the EM Algorithm 16 2.3.1 The Expectation step of EM 19 2.3.2 The Maximization step of EM 27 2.3.3 Standard Errors of Estimates 29 3. ADDITIVE MODEL COMPONENT OF ALV MODEL 32 3.1 General Introduction 32 3.2 The Additive Model 32 3.3 Baseline-Treatment Interactions using GAM 35 3.4 The Best Smoothing Function 38 3.5 Cubic Splines 42 3.5.1 Representation of Natural Cubic Splines 45 3.5.2 Penalized Regression Cubic Splines 47 3.5.3 Estimation in Penalized Regression Splines 49 3.6 Goodness of Fit and Model Comparison 51 4. COMPUTATIONAL DEVELOPMENT OF ALV MODEL 54 4.1 Computation Steps 54 4.2 Criteria for Convergence of ALV Model 60 4.2.1 Convergence of MCMC 61 4.2.2 Convergence of EM 62 5. APPLICATION TO SIMULATED DATA 65 5.1 Simulation Design and Background Information 65 5.2 Monitoring Convergence 69 5.2.1 Convergence Pattern: MCMC loop 69 5.2.2 Convergence Pattern: EM loop 71 5.3 Assessing performance of ALV model 76 i
PAGE 5
5.3.1 Performance under Scheme 1 77 5.3.2 Performance under Scheme 2 85 6. APPLICATION OF ALV MODEL TO ASAPS DATA 92 7. DISCUSSION & RECOMMENDATIONS 105 REFERENCES CITED 110 APPENDICES 115 Appendix A: DATA SIMULATION CODES FOR SCHEMES 1 & 2 116 Appendix B: SELF-WRITTEN R FUNCTIONS CALLED BY ALV MODEL 128 Appendix C: R CODES FOR ALV MODEL ALGORITHM 134 ABOUT THE AUTHOR End Page ii
PAGE 6
LIST OF TABLES Table 3.1 Cubic Spline Interpolation 43 Table 4.1 Performance of ALV Model Estimation 45 Table 4.2 Results of ALV Model and Mplus Analyses under Scheme 1 46 Table 5.1 Twelve Simulation Scenarios used to Assess ALV Performance 66 Table 5.2 ALV Model Estimation Performan ce under Scheme 1 (50 Replications) 80 Table 5.3 Results of ALV and Mplus Analyses of a Single Dataset (Scheme 1,N=300) 81 Table 5.4 Percent Change in MSEs: GAMs of Z on and Compared to 89 Table 6.1 Five Items Used in the ASAPS to Assess Normative Beliefs of 11th Graders 96 Table 6.2 Ten Summary Baseline Risk Constructs in ASAPS Data 97 Table 6.3 Some 11th Grade Outcomes and Missing Data in ASAPS Data 98 Table 6.3 Parameter Estimates for ALV Model 46 Table 6.4a ALV Model of Marijuana Use Reported by 11th Grade Males (N=2500; 79 High School Clusters): Additive Sub-model# Includes Nonlinear Interaction Term 104 Table 6.4b ALV Model of Marijuana Use Reported by 11th Grade Males (N=2500; 79 High School Clusters): Additive Sub-model# Includes Linear Interaction Term 105 iii
PAGE 7
LIST OF FIGURES Figure 2.1 The Proposed ALV Model Diagram 15 Figure 3.1 The Plots of Distal Outcome versus Baseline Risk 37 Figure 4.1a ALV Algorithm Flow Chart: Overview of EM Setup 55 Figure 4.1b ALV Algorithm Flow Chart: Implementing the E-step via MCMC Process 56 Figure 4.1c ALV Algorithm Flow Chart: Summary of the MCEM Implementation 57 Figure 5.1 Trace and Density Plots of Markov Samples for Individual Subjects (Scheme 1) 71 Figure 5.2 Convergence Errors versus EM Iteration 73 Figure 5.2 Scheme 2: Binary Z, Nonlinear Model, N=200 74 Figure 5.4 Sequences of Parameters (Scheme 1) Across 100 EM Iterations 75 Figure 5.5 Sequences of Parameters (Scheme 2) Across 100 EM Iterations 76 Figure 5.6 Boxplots of Eta Produced from Three Sources (Scheme 1, N = 300) 82 Figure 5.7 Q-Normal Plots of Eta Produced from Three Sources (Scheme 1, N = 300) 82 Figure 5.8 Model Checking Plots: GAM Component of ALV Model (Scheme 1, N = 300) 83 Figure 5.9 ALV Model Convergence (Scheme 1, N=300) 84 Figure 5.10 Correlations between ALV Estimates of Eta and Population Values 86 Figure 5.11 Boxplots for the MSEs of GAM of Z Separately on and 86 Figure 5.12 Plots of Z (Binary Outcome) Predicted by Eta (Baseline Risk) by G (Treatment) (Results Based on a Single Simulated Sample Among 50) 92 Figure 5.13 ALV Model Convergence (Results Based on a Single Simulated Sample of Size N=200 Selected from 50) 93 Figure 6.1 Estimated Relationship of Probability of Marijuana Use to Baseline Risk 100 iv
PAGE 8
Figure 6.2 The Effect of Baseline Risk on Probability of Marijuana Use by Group with 95% Pointwise Confidence Bands 102 v
PAGE 9
ADDITIVE LATENT VARIABLE (ALV) MODELING: ASSESSING VARIATION IN INTERVENTION IMPACT IN RANDOMIZED FIELD TRIALS Peter Ayo Toyinbo ABSTRACT In order to personalize or tailor treatments to maximize impact among different subgroups, there is need to model not only the main effects of intervention but also the variation in intervention impact by baseline individual leve l risk characteristics. To this end a suitable statistical model will allow researchers to answer a major research question: who benefits or is harmed by this intervention program? Commonly in social and psychological research, the baseline risk may be unobservable and have to be estimated from observed indicators that are measured with errors; also it may have nonlinear relationship with the outcome. Most of the existing nonlinear structural equation models (SEMs) developed to address such problems employ polynomial or fully parametric nonlinear functions to define the structural equations. These methods are limited because they require func tional forms to be specified beforehand and even if the models include higher order polynomials there may be problems when the focus of interest relates to the function over its whole domain. To develop a more flexible statistical modeling technique for assessing complex relationships between a proximal/distal outcome and 1) baseline characteristics measured with errors, and 2) baseline-treatment interaction; such that the shapes of these relationships are data driven and there is no need for the shapes to be determined a priori. I n the ALV model structure the nonlinear components of the regression equa tions are represented as generalized additive model (GAM), or generalized additive mixed-effects model (GAMM). vi
PAGE 10
Replication study results show that the ALV model estimates of underlying relationships in the data are sufficiently close to the tr ue pattern. The ALV modeling technique allows researchers to assess how an intervention affects i ndividuals differently as a function of baseline risk that is itself measured with error, and uncover complex relationships in the data that might otherwise be missed. Although the ALV approach is computationally intensive, it relieves its users from the need to decide functional forms be fore the model is run. It can be extended to examine complex nonlinearity between growth factors and distal outcomes in a longitudinal study. vii
PAGE 11
1 CHAPTER 1 INTRODUCTION 1.1 Variation in Intervention Impact across Individual Level Baseline Characteristics Preventive interventions focus on reducing specific risk factors or enhancing protective factors. Intervention theory posits that a distal target or an outcome should be impacted by effectively modifying the risk factors. The su ccess of a typical intervention program then is measured in terms of how well it impacts distal preventive targets. In an experimental design aimed at assessing the effects of an intervention, ra ndomization is carried out to ensure that all the intervention groups are comparable at baseline with respect to the risk factors and other systematic effects. Given a successful randomi zation, post intervention differences between the groups may be attributable solely to intervention e ffects. However even in randomized trials there remains a variable degree of within-group heteroge neity with respect to individual level baseline risk that can potentially modify the interventi on impact. Generally such variability is more common in field trials that are universal than in standard efficacy trials. Depending on the nature of an intervention program, its effects on a risk factor may be expected to vary according to the individuals baseline status on the risk scale; that is, the intervention impact may not be same for all individuals. One example is a randomized fi eld trial (RFT) where the intervention, Good Behavior Game (GBG), was designed to reduce aggr essive behavior in first grade kids (Muthen, 2002; Kellam, et al., 2008; Brown, et al., 2008; Poduska, et al., 2008). The investigators predicted that, since about half the kids have minimal le vel of aggressive behavior throughout life, GBG would have impact on only those kids who scored high on the risk scale for aggressive behavior
PAGE 12
2 at baseline. These predictions can be tested statistically by measuring the significance of interaction effects between the baseline aggressive behavior and the intervention (Kellam, et al., 2008; Poduska, et al., 2008; Brown, et al., 2008). In their paper Brown, et al. (2008) the authors argued that the focus of intent-to-treat (ITT) analyses should not be limited to that of ex amining the main effects of intervention. They explained that one reason to also model the varia tion in intervention impact by baseline individual level risk characteristics is to personalize or tailor treatments to maximize impact among different subgroups. It is also possible for an intervention to be beneficial to some individuals while the same is harmful to others. The author s concluded that there is almost always an a priori reason to examine interactions involving intervention and baseline level of risk, particularly in RFTs employing universal preventive interventions. Often times investigators are interested in how multiple risk factors act together to predict a long term outcome. In social sciences in par ticular, these observable risk factors often have measurement errors. To obtain a summary risk va riable for use in a statistical analysis designed to investigate intervention effect, a better altern ative to simple averaging might be hypothetical latent risk variable(s) that is/a re constructed from the observed baseline risk variables. The risk variables then serve as indicators for the underlying latent construct(s) or factor(s). By using e.g. a single latent summary variable, the accuracy of th e results is less affected by measurement error in any individual risk factor. Multiple measures (indicators) of a construct therefore tend to be more valid and reliable than when only a single fa llible indicator is used (Kline, 2005). In a more complex scenario the observed risk indicators ma y be jointly multi-dimensional rather than unidimensional; the dimension is reflected by the number of constructs underlying the observed indicators. So, when risk summary by simple av eraging is not an option, a latent variable
PAGE 13
3 modeling technique (such as structural equation modeling (SEM) and its extensions) is most appropriate for carrying out such analyses. In a SEM, the observed baseline risk indicators are summarized into latent factor(s) in the measurement part of the model. Here the estimate(s) of the latent factor(s) constitutes the latent baseline risk, the common contents of the observed risk indicators. The latent baseline risk factor(s) may then be applied in the structural part of the SEM to relate to one another and also as predictor(s) of one or more distal outcomes. This approach can be extended for longitudinal data modeling where estimated latent growth factors (i ntercept and slope) may relate to each other and predict a later outcome in a system of linear equations within the SEM framework (Muthen, 2002). For simplicity and to keep within the scope of this dissertation, we will assume that the indicators are uni-dimensional, and the single latent factor predicts a single distal outcome in a sub-model. To assess variation in intervention impact as a function of the baseline, a dummycoded intervention variable plus its interaction term with baselin e are added to the sub-model. The conventional SEM seeks simultaneous solutions to a system of linear regression equations that form the measurement and the structural parts, with the assumption that the latent factors (or latent baseline risk) influence the other set of variables (e.g. a distal outcome) or their transformed versions in a linear fashion. However if in reality the latent baseline relates to the outcome it predicts in a non-linear fashion, the usual linear modeling methods can easily lead to erroneous conclusion (Wang, Brown, & Bandeen-Roche, 2005). Also as a result of such misspecification, a test of intervention impact on later outcomes may result in non-significant findings due to low statistical power. Even with significant effects, there may be misinterpretation of what the effects mean. The conditions under which such intervention impacts occur may only be fully captured by methods that allow for nonlinear relations among both the
PAGE 14
4 observed and unobserved variables. Linear mode ls, even if they include higher order polynomials, do not provide sufficient flexibilit y in assessing this potential type of impact (Brown, 1993). So, even the advanced SEM technique that includes polynomial model specifications has its limitations. 1.2 The Concept of Additive Latent Variable (ALV) Modeling An alternative approach to modeling nonlin ear relationship is to use a nonparametric estimation method. These methods can vary in complexity. For example, an estimation method was developed for panel data (binary or conti nuous) with correlated errors (Chib & Jeliazkov, 2006). In the authors regression model, the res ponse variable depends parametrically on some covariate vectors and nonparametrically on a nother covariate where the relationship is unspecified but the function is assumed to be sm ooth, otherwise unrestricted. The authors used MCMC based algorithm in the estimation of the nonparametric function when the errors are correlated. They were able to incorporate nonlinearity and intertemporal dependence in their model (Chib & Jeliazkov, 2006). While their setting is a single regression model aimed at distinguishing among important sources of inte rtemporal dependence in the observations, our focus here is on developing a system of simultaneous regression equations incorporating latent variable and semiparametric m odeling techniques. Our choice approach to modeling nonlinear relationship is the use of the Additive Model (AM) technique (Hastie & Tibshirani, 1990). This method employs smooth functions which (in a flexible way) allow data to define the relationships between response and predictor variables. For example, in Kellam, et al.,(2008) and Brown, et al., (2008) the Additive modeling technique was used to examine intervention impact on certain distal outcomes (e.g. drug abuse/dependence) in terms of smooth functions of the observed aggressive behavior at baseline and of the baseline-treatment interaction term. Specifically, the
PAGE 15
5 authors employed the Generalized Additive Mode l (GAM) technique (Hastie & Tibshirani, 1990) for the binary outcomes. The theory of interv ention that guided their work posited that the intervention would succeed to the extent that it does affect the risk factor. If someone is already at the lowest level of risk, i.e., does not exhi bit aggressive behavior, then the impact of the intervention is expected to be low or nonexistent. In the context of baseline measures, two notable statistical issues commonly arise in social and psychological research, and how they are addr essed may have serious implications to validity of statistical inference. The first situation is when baseline risk factors are not directly observed and have to be estimated from multiple observed i ndicators before each can be used to predict any outcome in a regression model. Second, a nonlinear relationship between the baseline and the outcome is often closer to reality than linear rela tionship. One analytic approach to resolve these issues is to combine two existing methodologies se quentially in a complimentary way: latent variable (LV) modeling and additive modeling (AM). Essentially one can perform a two-step analysis where the Empirical Bayes (EB) estimate (or factor scores) of the baseline risk is first obtained for individuals via a Confirmatory Factor Analysis (CFA), then follow up with a nonlinear regression analysis such as the GAM (H astie & Tibshirani, 1990) that includes the baseline estimate as a predictor (Toyinbo, P. A., & Brown, C. H. Variation in Intervention Impact in Adolescent Substance Abuse Prevention Study. Report presented on April 18, 2007 at the Adolescent Substance Abuse Prevention Study Na tional Advisory Group Meeting, Institute for Health and Social Policy, University of Akron Akron, Ohio, USA.). The approach adopted here led to the novel concept of Additive Latent Va riable (ALV) modeling technique (the original name was coined by the second author).
PAGE 16
6 We noted that a strong feature of the SEM framework is the capacity for evaluating the whole model as a unit with several available and useful model comparison and goodness-of-fit tests. However when a two-step analysis is carried out as described above, the simultaneous solution advantage is lost. In addition such an ad-hoc approach suffers from the usual estimation errors of factor scores (Muthen, 1989). Also, as at the time of this study, the two-step analysis requires different statistical software platforms for the different implementation steps, hence the method is not efficient. Ideally, a simultaneous so lution to the system of equations consisting of both LV and GAM models will be more efficient but to the best of our knowledge, such method has never been reported in the literature. We hope to begin to fill this gap in our current study. Based on the idea of bringing together several di fferent existing analysis types in one general model under a unifying framework of the SEM (Mut hen, 2002), an attractive approach to filling the gap we have identified is to integrate GAM into the very powerful and flexible SEM framework to allow for simultaneous solutions to one system of equations. This will allow maximal strengths to be drawn from both component modeling techniques. 1.3 Proposing a New Methodology: Additive Latent Variable Model This dissertation concerns the development of a new method for analyzing variation in impact of a universal preventive intervention. The new integrated ALV modeling technique we are proposing here consists of two existing anal ytic techniques (LV and GAM) and will allow for the evaluation of the combined analyses as a whole unit. By integrating additive models into a LV framework, a single system of equations can be solved simultaneously, and many of the powerful tests possible in LV modeling may be availabl e globally. This new method has a great potential for extensions and will provide an opportunity to examine smoothed nonlinear relationships between latent variable constructs (such as baseline risk, growth factors) and a proximal or distal
PAGE 17
7 outcome, within the LV framework. The met hod will also examine how these relationships interact with treatment. What is unique about the proposed model is that it incorporates two powerful analytic techniques into one modeling tech nique which to date has never been reported. Moreover, the ALV will allow for a more efficien t use of the data, alleviate the problem of multiple comparison tests as well as be ridden of the need for multiple statistical application platforms. The ALV modeling technique can be used to analyze cross-sectional or longitudinal data. In the longitudinal case it allows us to exam ine how the effect of treatment on a long term outcome may interact with the mediating growth process, particularly when such interaction is non-linear. This dissertation is motivated by analysis requirements of the Adolescent Substance Abuse Prevention Study (ASAPS) data (Sloboda, et al., 2008). The ASAPS study, funded by the Robert Wood Johnson Foundation an d conducted by the Institute for Health and Social Policy of the University of Akron, is a cluster randomized field trial of 83 school clusters consisting of high school and all its feeder middle schools with a total of 19,200 students from six metropolitan areas spread across the country. Th e intervention program: Take Charge of Your Life (TCYL) was delivered to students in the 7th grade and re peated in the 9th grade. Seven self-administered survey waves were administered, starting from 7th grade (pre-intervention) to the 11th grade, to collect data on behavioral outcomes including use of alcohol, tobacco and other drugs. A major research question confronting the resear chers is who benefits or is harmed by this intervention? To put it in another way, do es the intervention affect individuals differently as a function of baseline risk? To answer this ma jor research question the analyses will benefit immensely from both the Latent Variable and A dditive modeling techniques. The Additive model which uses a smoother to summarize the potential nonlinear trends in the data complements the
PAGE 18
8 Latent Variable model. The Latent Variable modeling technique can be used to summarize the observed baseline risk indicators into a one-dim ensional error-free latent baseline risk, from which the EB estimates can be computed. The GAM model part then describes the dependence (linear or nonlinear) of the outcome on baseline ri sk and the baseline interaction with treatment (see also Brown, 1993). So the GAM method can help distinguish whether the effect of the intervention is similar across all levels of risk (o r other baseline covariates of interest), limited to low or high risk, or beneficial for some but harmful for others. Furthermore, for hierarchical data such as the ASAPS data, the GAM can be substituted with a generalized additive mixed model (GAMM) counterpart to add school level random ef fects into the nonlinear regression sub-model. A major limitation to the two-step modeling approach described above is that the two analytic methods required two different statistical app lication platforms for implementation, so the approach is not efficient. The purpose of this dissertation is to address the deficiency by integrating the two components into one singl e model that performs simultaneous solutions. In the next chapter, we will discuss the basic statistical theory for the development and estimation of the proposed integrated ALV m odel. First we describe the SEM as a standard template for LV models. The EM algorithm will be set up for the maximum likelihood estimation such that it will allow for a later incorporation of the additive model to into the LV framework. The Markov Chain Monte Carlo fitting method for handling resultant complex integrations is also presented here. The remainder of the dissertation is organized as follows. Chapter 3 is devoted to the basic theory and estimation of Additive Models Here the GAM is finally integrated into the LV framework which employs the EM algorithm fo r estimation as set up in chapter 2. Chapter 4 is concerned with the computational developm ent of the ALV model algorithm while detailed simulation study of the performance of the ALV estimation method is presented in chapter 5. In
PAGE 19
9 chapter 6 the new ALV modeling technique is app lied to a portion of the ASAP data, and finally the discussion is presented in chapter 7.
PAGE 20
10 CHAPTER 2 FRAMEWORK FOR ALV MODEL FORMULATION 2.1 General SEM Framework The structural equation modeling (SEM) is a pow erful and very flexible analytic tool. To illustrate how expandable the SEM modeling framework is, we describe here a general model where continuous latent variables represent some unobservable constructs of substantive significance, but are indirectly measured by multiple observable variables or factor indicators, and are also influenced by covariates. This model type represents a member of the SEM family, the general form of which can be specified in tw o parts: measurement and structural sub-models (Muthen, 2002). The measurement sub-model of the SEM links a m1 vector of latent variables to a p1 vector of factor indicator variables observed on i th unit as follows: K iiiiyx (2.1) where is a p1 parameter vector of measurement intercepts, x is a q1 vector of covariates, is a p1 vector of measurement errors or residuals with a p p covariance matrix denoted is a p mparameter matrix of factor loadings or measurement slopes. Kis a p qparameter matrix of regression slopes measuring the direct influence on the outcome vector y by covariates x. For some applications, the flexibility of the SEM general framework is exploited to allow for the measurement intercept vector to be used for multiple groups modeling or for growth
PAGE 21
11 modeling with multiple indicators at each time poi nt. The structural sub-model is specified as follows: iiiiBx (2.2) where is an m1 parameter vector of structural intercepts, is an mq parameter matrix of slopes where latent variables are regressed on covariates, and is an m1 vector of structural errors and has a mm covariance matrix This formulation is written with the latent variables on both sides of the equation to allow for some late nt variables to be predicted by others or one another (simultaneous equations with recursive or non-recursive relationships). So Bis a restricted, mm matrix of regression coefficients that relate latent variables to one another (either recursively or non-recursively). We will li mit our focus in this study to recursive models (no feedback loops); that is, Bmatrix is restricted to lower tria ngular with zeros in its diagonal so that the independent latent variables do not co-vary with dependent latent variables. The above formulations of the SEM are equivalent to the original LISREL model (Joreskog, 1977) with the usual model assu mptions as follows. The measurement errors s and structural errors s are mutually independent among the units, and both vectors are not correlated. We apply the standard assump tions about the error distribution. The s and s are assumed to have a multivariate normal distri bution with zero expectations. Also the s are independent of the exogenous latent variables. p m~N(0,) ~N(0,) Cov(,)0
PAGE 22
12 These assumptions imply that the latent independent variables are multivariate normal. The independent covariates that are measured w ithout error may have any distribution since we will be conducting our analysis conditional on these covariates. Th e general formulation for SEM above can be re-expressed in a more trad itional regression framework by solving for and then pre-multiplying by1() I B: 111()()() IBIBxIB (2.3) 111()()() yIBIBxIBKx (2.4) We prefer to work with this reduced model form. Thus for the general SEM, the mean and covariance structures conditional on xare respectively 11()() IBIBxKx (2.5) 11()() IBIB (2.6) with the restriction that I B is non-singular. For convenience, we will consider a simple and less general member of the SEM family where y is a vector of indicators of a single latent variable and there are no covariates. This simple model can be easily specified by introduc ing different constraints into the general SEM model framework (2.2 and 2.3). With no covariates xin the model, Kand diminish. Such is the confirmatory factor analytic (CFA) model which is a system of linear regressions where each of the observed variables is predicted by one or more latent variables based on theory. A conventional CFA model is a simple recursive model with additional constraints that include
PAGE 23
13 0B in (2.3). This model can be represented by the following simplified results from (2.3) and (2.4) respectively: y (2.7) (2.8) pp~N(,);[|]~N(,) yy (2.9) In this model, the loadings are the regression coefficients of the latent variables. This would be a traditional multivariate regression model (with restricted regression coefficients) if the latent variables were observed. With latent variables being unobservable there is a built in indeterminacy where affine transformations of th e latent variables can yield the same fit but completely different loadings and other parameters. To resolve this identifiability issue, sufficient number of restrictions is placed on either (one element in each column is fixed at one) or on (diagonal elements set to one); and also on A convenient way to remove this model indeterminacy is to fix the means of to zero and the covariance to the identity. It means that the factors are orthogonal and the factor components are independent under normal distribution. Using this specification, the mean vector is unrestricted and therefore optimally estimated (under ML and other such mode ls) by the observed sample means for y The vector of the parameters to be estimated is1p1p1p(,,)( ,..., ,..., ,..., ) y Assuming a standard normal marginal distribution for define a p-variate vector of variables iY observed on the ith subject, i=1,,N; assume that the N observations are independent and each vector iYconditionally has multivariate normal density written as
PAGE 24
14 1/2 /21 /21 f(|;)(2 )exp()(), 2 1 f()(2 )exp; 2 p iiyiiii m iiiyy y (2.10) where is the mean vector of i y Next we will introduce the ALV model building blocks and the algorithms to be used to optimize the m odel parameters. Implemen tation details will be addressed in subsequent chapters. 2.2 Basic Formulation of ALV Model In the ASAPS study, each of N individuals in the ASAPS study is associated with a set of observations including the baseline risk variab les, baseline covariates, distal outcome and intervention status data. Let the vector of risk measures be denoted by 1p{Y,.....,Y} Y which serve as indicators of a single underlying latent risk factor denoted by. We wish to examine how the unobserved baseline risk factor may predict the observed distal outcome Zgiven an intervention status Treat and fixed covariates X. Specifically we are interested in modeling the potential smooth non-linear relationship between Zand. With little modifications to the model di agram conventions adopted in (Muthen & Muthen, 2008) the path diagram for our proposed ALV model when p = 4 is depicted in the Figure 2.1. The straight lines with single arro w head represent linear regressions where the s are factor loadings, that is regression coefficients of the Ys on the latent variable. The thick curve counterpart represents a potential non-linear relationship to be captured by the regression of Zon which may appropriately require anything fro m non-linear parametric to non-parametric modeling approach, such as the use of a smooth function (Sm) in the ALV model.
PAGE 25
15 Figure 2.1 The Proposed ALV Model Diagram The proposed ALV model therefore involves the integration of Generalized Additive Model (GAM) into a LV model framework. Let the LV model be defined as a simple latent factor model of the observed data i{,z}iywithout covariates and be represented by the probability density function iip(,z| ;)iy where is the updated set of all parameters to be estimated. For now we treat the population as a single homogenous group and there is no additional e xplanatory variable to be considered in the model. That means we have anN(p1) observed data matrix consisting of p factor indicators () Sm 1234
PAGE 26
16 plus a single distal outcome. We assume iY are continuous and are linearly related to an underlying latent factori, that is, iY retains the conventional linear factor analysis model structure and the latent variable model is identif iable. In preparation for the new work on this dissertation, we do not require the response iZ to be normally distributed or to be linearly related toi, but we will always assume that iY and iZ are independent giveni. With iY and iZconditionally independent, we can decompose the LV model into two independent joint conditional likelihood functions: i 1 iii 1L(;,z|)p(,| ;) p(| ;)p(z| ;) N ii i N iyz iyyz y (2.11) where y and z are distinct set of parameters in The first component of the LV model then represents a simple confirmatory factor anal ysis (CFA) model consisting of a system of p linear regression equations while the second component is simply a regression model of the distal outcome on the latent factor. To convert the LV into ALV we simply represent the second component as a GAM of a distal outcome Z where Z can be a continuous, categorical or count variable. 2.3 ML Estimation of ALV Model via the EM Algorithm We adopt a likelihood based approach to parameter estimation. It is anticipated that eventually when we fully specify the two co mponent density functions constituting the ALV model, performing direct maximization of the obse rved data likelihood will be complicated by the presence of non-linear relationship between the vari ables and the associated parameters and the intractable integrals that may result. Theref ore we choose to implement maximum likelihood
PAGE 27
17 estimation using the Expectation-Maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977). The EM algorithm is a general and easily adaptable approach for finding the maximum likelihood estimates (mles) of the underlying pa rameters in a given data when the data are incomplete or have missing values. In our case the observed data{,z}ydepend on a latent factor which is unobservable. So we consider th e situation as a missing data problem, where is treated as missing at random (Rubin, 1976). Our sp ecifications of the EM algorithm will be based on regarding as a random N-vector of missing data within the SEM model framework. We then treat the observed {,z}y as incomplete data while {,z, }y constitute complete data in which the rows are independently and identically distributed (Dempster, Laird, & Rubin, 1977). We will develop an EM procedure for parameter estimati on that allows for a non-linear regression of Z on the latent factor via a smooth function. The complete-data likelihood is expressed as com yzL()p(,z, ;) p(| ;)p(z| ;)p( ;) y y (2.12) where the random is unknown and, given a factor analytic model framework, its marginal distribution is fixed as standard normal for model identification purpose (see equation (2.10)). The maximum likelihood estimates of will be computed from the complete data with the above specifications and restrictions. Consider that if were observed, then we have a simple distribution for the complete data where mles for can be obtained by the usual least square method based on the sums, sums of squares and sums of cross-products (Dempste r, Laird, & Rubin, 1977). Normally, when is not observed, we would obtain the mles of the parameters by integrating the complete-data
PAGE 28
18 likelihood (,z, ;)py with respect to and maximizing the results with respect to. However the approach to estimation taken by the EM algor ithm is to alternate between computation of the expectation of the complete data log-likelihood (E-step) and the maximization of this expectation with respect to (M-step). The idea is to fill in a set of values for the missing(E-step) and solve the problem, i.e. find mles for (M-step); the repeat the two steps to find better values of to fill in (Rubin,1991). Because is unknown, draws from its conditional distributionp( |,z;) y will be taken to simulate. Let ()Q(,)kbe defined as the kth iterative expected complete data log-likelihood given the observed data and current values of() k, which is given by () () com ()Q(,)E[logL()|,z,] E[logp(,z, ;)|,z,] kk ky yy (2.13) Each repetition of the two steps yields a new set of mles for by numerically increasing the value of quantity ()Q(,)k and the iteration continues until convergence. One important property of the EM algorithm is that a (k+1)th iteration causes ()Q(,)k to increase over its kth value (Dempster, Laird, & Rubin, 1977). Briefly, contains the parameters to be optimized as ()Q(,)k increases; with a current value()k, iteration k of the EM Algorithm is implemented in the following sequences: 1. Draw from the conditional distribution()p( |,z;)ky (i.e. evaluate it at the current parameter estimates()k); supply initial values if k=0.
PAGE 29
19 2. E-step: Evaluate ()Q(,)k using updates from (1), that is taking the expectation of the complete data log-likelihood with respect to the conditional distribution()p( |,z;)ky 3. M-step: Maximize ()Q(,)k over to obtain a revised(1) k That is, solve (1) () maxQ(,)kk 4. Check for convergence, if none, set ()(1) kk and return to (1). 2.3.1 The Expectation Step of EM The E-step at the kth iteration computes the expected value of the complete data loglikelihood over given the observed data and current values of()k This step is more formally defined as () () ()Q(,)logp(,z, ;)p( ;)d logp(,z, ;)p( |,z;)d kk ky yy (2.14) where the complete data likelihood derives its randomness solely from being a function of random variable that is governed by its conditional predictive distribution given the observed data:()f( |,z;)ky The complete data log-likelihood function is not tractable analytically because we do not have a fully known para metric form for the joint distributionp(,z, ;)y therefore we require an alternative method to direct integration in the E-step. Markov Chain Monte Carlo (MCMC) Method Whenever the computations involved in the in tegration (E-step) and /or optimization (Mstep) are intractable, numerical methods or Monte Carlo methods may be indicated (Wei &
PAGE 30
20 Tanner, 1990; McLachlan & Krishnan, 2008). Our choice here, the Monte Carlo method, computes integrals using random number generation, and it is pref erred to numerical quadrature methods when the dimension of integral may be large or the functions may not be smooth (McLachlan & Krishnan, 2008). For simplicity we illustrate with an example of a complex integral I(y)f(y|x)p(x)dx which can be expressed as an expectation of f(y|x) over the continuous densityp(x). To use the classical Monte Carlo integration (McLachlan & Krishnan, 2008; Walsh, 2004), a sufficiently large number 1cCx,..,x,..,xof random sample are drawn from the density p(x)(which must be completely known) and the integral is approximated by C c c11 I(y)f(y|x) C (2.15) The estimated variance of the Monte Carlo estimate is given by C 2 c c111 var[I(y)]f(y|x)I(y) CC1 If the target distribution() p x itself is complex and is indirectly or incompletely specified, then a more complex Monte Carlo method will be required (McLachlan & Krishnan, 2008). For example when() p x is uniquely defined but does not have a standard form that is amenable to direct sampling, Markov chain Monte Carlo (MCMC) methods are commonly used to draw samples indirectly from these distributions (McLachlan & Krishnan, 2008; Lee & Song, 2007; Wei & Tanner, 1990). A Markov chain is a stochastic process that characterizes sequences of random variables, where the transition probabilities between different values in the state space depend only on the random variables current state (Walsh, 2004; Gamerman & Lopes, 2006). The most critical
PAGE 31
21 feature that defines a particular Markov chain is the transition kernel (transition probabilities) which is the limiting distribution of the chain. The aim therefore is to construct a Markov chain such that its limiting distribution equals the target distribution we wish to simulate. Let the transition kernel be defined as (c)(c1)q(x,x)which is the probability of transition of a process from an earlier state (c)xto the next state (c1)x in a single step (Walsh, 2004). To draw a random sample from distribution p(x)via Markov chains, the transition kernel must be chosen such that the stationary distribution of the chain is p(x) (Gamerman & Lopes, 2006), and q(,)must satisfy (c)(c)(c1)(c1)(c1)(c) (c)(c1) (c) (c1)(c) (c1) (c)(c1)p(x)q(x,x)p(x)q(x,x),(x,x) p(x)q(x,x) i.e. p(x)q(x,x) (2.16) This is the basis for the Metropolis-Hastings Algorithm which we will discuss next. Metropolis-Hastings Algorithm The Metropolis-Hastings (M-H) Algorithm is a very widely applicable MCMC method for simulating a complex nonstandard multivariate distribution; the Gibbs sampler (Geman & Geman, 1984) is a special case of the M-H algor ithm (Walsh, 2004). The mechanism of the M-H algorithm as outlined in Gamerman & Lopes (2006) and Walsh (2004) will be described briefly here. Note from the qratio above that it is sufficient to be able to express a qualifying stationary distribution p(x)up to the normalizing constant, since any constant factor cancels out when calculating the transition kernel. Suppose we wish to draw samples from 1d p (x):x(x,...,x) of which direct sampling is complicated. If f(x)is an approximation up to a constant and is
PAGE 32
22 available, such that p(x)f(x)/h where the normalizing constant h is difficult to compute, we can generate a d-dimensional random vector from f(x)using the M-H algorithm. For the M-H scheme, first a proposal kernel (density) (c)(c1)q(x,x)is chosen so as to be as similar to the target densityp(x)as possible, to increase acceptance rate. No te that if the sampling (proposal) kernel equals the target distribution (i.e. when the la tter is known), then acceptance rate is 100 percent and direct draw from the target density itself is possible, as in the classical Monte Carlo procedure. Desirable features of a proposal kern el include tunable parameters such as location and scale (Chib & Greenberg, 1995; Walsh, 2004). A widely used proposal kernel (or candidategenerating density) is the multivariate normal. The following steps are carried out in the M-H scheme: (I) choose arbitrary initial values 0xsatisfying0f(x)0 (II) evaluate the proposal (or jumping) distribution (c)(c1)q(x,x)at the current 0xvalues, and then sample a candidate point *xfrom (c)(c1)q(x,x), (III) define an acceptance probability of a move of the chain from current value (c)xto a new value (c1)x as the ratio of the densities at the proposal point *xand current point (c)x: *** cccp(x)f(x)hf(x) p(x)f(x)hf(x) (2.17) If 1 a move to the new proposal point increases th e density and so is allowed, else the move is allowed with a probability of The basis for allowable move can be summarized as *c* c* c*cf(x)q(x,x) (x,x)min1, f(x)q(x,x) (2.18)
PAGE 33
23 (IV) To introduce randomness a quantityuis generated from an independent uniform distribution U(0,1), then the proposal point is accepted as the current value *(c1)xx if u else it is rejected and no change takes place, i.e. *(c)xx Steps II to IV are iterated until convergence. The above Metropolis-Hastings algorithm is a genera lization of the original Metropolis algorithm (McLachlan & Krishnan, 2008; Walsh, 2004). The original algorithm requires that the proposal density be symmetric (e.g. normal distributions): c**cqx,xqx,x so that c*(x,x) reduces to c* cf(x) (x,x)min1, f(x) (2.19) Expectation of Complete Data Log-likelihood To reiterate we are adopting a method sim ilar to the Monte Carlo EM algorithm (MCEM) (Wei & Tanner, 1990) whereby the Monte Carlo inte gration of the log-likelihood is approximated by drawing a sufficiently large number Cof observations from the predictive conditional distribution ()p( |,z;)ky evaluated at the current values()k Upon generating the random observations (c) i{ ,c=1,...,C,i=1,...,N}by the MH algorithm, there are different ways to use the observations in both the E-step and M-step. For the E-step, a popular and straight forward process is to plug the expected value of the Mar kov process generated random observations (or its function of some sufficient statistics) directly into the ()Q(,)kfunction (Lee & Zhu, 2000). For another example, these random observations we re plugged into conditio nal expectations of the complete data approximate su fficient statistics in (Lee & Zhu, 2002) to evaluate the E-step. In our case, in the E-step we decided to fill in the entire estimated density of ()p( |,z;)ky into ()Q(,)k so the problem considerably simplifies to that of a C number of simple regression
PAGE 34
24 equations with fixed covariates similar to an example described in McLachlan & Krishnan (2008). Note that()k is already imbedded in the C drawings: () (c) 11 Q(,)[logp(,z, ;)] CC k cy (2.20) A single scan or generated sequence (-t)(0)(1)(c)(C) iiiii ,..., ,..., ,...,is a Markov chain for the ith subject. As c tends to infinity, or with a sufficiently largeC, the stationary distribution converges in distribution to the target distribution() iip( |,z;)k iy To allow a sufficient amount of time for a stationary distribution to be r eached, the first set of iterations in the chain (-t)(0) ii ,..., serves as the burn-in segment. This initi al set of iterations is discarded while the remainder segment of the chain forms the sample of an optimal finite size C to be used in the Monte Carlo integration. The usable Markov sample then consists of limiting transition probabilities that are no longer dependent on the start values. However, by using successive values from a single Markov chain per subject, within-subject autocorrelation does induce chain dependence. In order for inference based on the sample to still be valid, higher autocorrelation will require a longer chain to run (Gamerman & Lopes, 2006). The authors also noted that Markov chains only have first order dependence which decreases with increasing lag between iterations, therefore subsample of quasi-independent elements can be formed by storing only every jth value post burn-in period. This method is re ferred to as thinning and it also has the advantage of requiring relatively shorter chains. With thinning we can achieve independence in the final sample with improved optimality and, in addition, reduce storage requirement for computer generated data. Furthermore, by ge nerating a Markov sample independently for each subject, we are able to make the assumpti on of both within-subject and between-subject
PAGE 35
25 independence for the N Markov samples. Therefore by drawing a sufficiently large sample (c) i ,c=1,...,C from()p( |,z;)ky we can write () (c) ii 111 Q(,)[logp(,z, ;)] CCN k i ciy (2.21) Since we are using Metropolis algorithm to sample from a conditional normally distributed, the selection probability simplifies to *p( |,z;)/p( |,z;) yy where *is the proposal value. Recall that given, Yand Zare independent. Therefore for the ith subject in the kth EM iterate the conditional distribution can be approximated up to a constant K as follows: () () () ii ii ii () i () () () ii iiiip( ,,z;) 1 p( |,z;)p( ,,z;); p(,z;)h 1 p( |,z;)p(| ;)p(z| ;)p( ). h k kk i ii k i kkk iiyzy yy y yy (2.22) So we have (for normal Z linearly related to ) 1/2 -p/21 i 2-1/2 2 ii ii 2 -m/2 ii i1 f(| ;)(2 )exp()(), 2 1 f(z| ;)(2 )exp-(z-a), 2 1 f( )=(2 )exp 2 iy iiii zyy y (2.23) In the cth MCMC iteration the candidate value idrawn from the univariate normal proposal distribution is accepted as the new value (c+1) iwith the probability of: *()*()* iiii (c)* ii (c)()(c)()(c) iiiip(| ;)p(z| ;)p( ) ( )=min1, p(| ;)p(z| ;)p( ) kk iyz kk iyzy y (2.24)
PAGE 36
26 Note that h has cancelled out in the ratio(c)* ii ( ). From (2.23) the ratio therefore simplifies to *1 *-2*2** ii i i i i (c)1 (c)-2 (c)2(c)(c) ii i i i i1 exp( )( ) (z-a-b ) 2 min1, 1 exp( )( ) (z-a-b ) 2 ii iiyy yy (2.25) If u where u has a random uniform distribution, the transition jump (c)(c+1) ii is allowed, otherwise the jump does not occur and the current va lue is retained in the Markov chain position. We chose for our proposal density a norma l distribution centered on the current value(c) i. The scale and spread of the proposal density are important factors controlling the acceptance or rejection rate and the sample space region covered by the chain. For accuracy it is desirable that the density be sampled mostly around its mode. If the variance of the density is too large some generated candidates will be too far from the mode and so have relatively low acceptance probability. On the other hand if the variance is t oo small, it will prolong the time required by the process to sufficiently traverse the sampling space supported by the density, leading to undersampling of the low probability regions. To achie ve a delicate balance an approximate acceptance rate of 0.45 is recommended when dealing with one-dimensional problem like ours where we estimate only one parameter (i ) (Chib & Greenberg, 1995). Therefore a proper fine tuning of the variance of proposal density is necessary for good mixing and efficient sampling (Chib & Greenberg, 1995; Walsh, 2004). As a rough guide, we compute the Empirical Bayes variance estimate of ()[ |;]ky for use as a start value. From general multivariate results for factor analytic model, the latent factors conditional on the observed indicators are multivariate normal: m | |[ |,]~N(,) yy y Therefore given a standard normal marginal density of (see (2.7) to (2.9)), the common conditional variance is computed as follows
PAGE 37
27 1 |() yI (2.26) 2.3.2 The Maximization Step of EM Recall the decomposition of the complete data log-likelihood (see (2.12) and (2.13)) as reproduced below: () () ()() (c)()(c)() iii 1111Q(,)E[logp(| ;)logp(z| ;)logp( )]|,z, E[logp(| ;)|,z,]E[logp(z| ;)|,z,]logp( ) 11 [logp(| ;)|,][logp(z| ;)|z,]w CCk k yz kk yz CNCN kk iyz ciciyy yyy yy (2.27) The first two terms on the right of (2.27) on the firs t line (a factor analytic model and a univariate regression model) have their sepa rate distinct parameters, so ma ximization can be done separately (McLachlan & Krishnan, 2008). Note that for the purpose of identification the marginal distribution p() has 0 mean and unit variance (Dempster, Laird, & Rubin, 1977). Secondly, even if we resort to approximatingp() by its conditional distribution in the M-step, this will not be useful since ()[ |,z;]ky can be specified only up to a normalizing constant (see (2.22)) and the conditional distribution is proportiona l to its joint distribution. Ther efore the last term is treated here as a constant w (2.27) that does not depend on hence does not contribute to the maximization. The EM algorithm hence concerns the finding of 1. (1) k y to maximize ()E[logp(| ;)|,]k yyyy and 2. (1) k z to maximize ()E[logp(z| ;)|z,]k zz. An alternative approach to maximization is base d on the idea of a Stochastic EM algorithm as described in (Lee, Song, & Lee, 2003). He re the mean of random observations (i ) in the
PAGE 38
28 Markov chain for the ith subject is computed, considered as fixed, and simultaneous regression model is solved to obtain mles. For example, their approach to modeling the Ys would give: (1) () () C (c) ii c=1 argmaxQ(,)argmax[logp(| ;)|,], 1 = ,i=1,.....N Cy ykk k yy y yyy However our approach to maximization is sli ghtly different in the sense that we plugged (c) idirectly into the regression model and solve C simultaneous regression equations instead. A major consideration in our decision is to avoid bias in our estimation, since the computed likelihoods from the two methods are not necessarily equivalent. Our approach requires the assumption that the random sample elements in each Markov chain (subject) are independent; which we are able to satisfy by using thinning method as necessary to minimize autocorrelation. Secondly we can also assume independent obser vations between subjects since the C Markov samples are independently generated for each subj ect to yield N independent Markov chains. So, using the MCMC method, the kth M-step solves (1) () (c) () 1 (1) () (c) () 11 argmaxQ(,)argmax[logp(| ;)|,] C 1 argmaxQ(,)argmax[logp(z| ;)|z,] C y y zzC kkk yyyy c C kkk zzzz cyy (2.28) One notable advantage of the ALV model structure is that with iavailable, the two parts above have fixed-effects GLM structure and ma ximum likelihood estimation can be carried out separately for them using the existing statistical tools for standard regression models. For the future ALV model extensions, all that is required of either part is for the response variables to belong to the exponential family.
PAGE 39
29 2.3.3 Standard Errors of Estimates In the context of EM algorithm the standard errors of the maximum likelihood estimates may be obtained from the inverted Hessian or information matrix based on the observed data likelihood function, according to the method of L ouis e.g. (Lee & Zhu, 2002; Song & Lee, 2005; Law, Taylor, & Sandler, 2002). The Louis method expresses observed information matrix as the difference between complete and missing data information matrices, thus 2 com com I(;,z)ElogL(;,z, )|,z, VarlogL(;,z, )|,z, yyy yy (2.29) where I(;,z) y is the observed information and the firs t and second terms on the right represent complete and missing data information evalua ted at the final parameter maximum likelihood estimates This approach is chosen because th e EM implementation does not generate observed data information as a by-product. Howe ver since these matrices generally have no closed forms, the Louis method provides a formula for computing the observed information matrix in terms of the expectation of the first and second derivatives of the complete data log likelihood function using the MCMC samples (M cLachlan & Krishnan, 2008). The missing data information formula is written as 2( c ) com 1 2 (c) (c) com com 11 logL(;,z, |,z,) 1 I(;,z) C logL(;,z, |,z,)logL(;,z, )|,z, 11 CC C c CC ccyy y yyyy (2.30)
PAGE 40
30 Given the availability of and assuming normally distributed response variables, the complete-data log likelihood function and relate d partial derivatives can be easily obtained separately for each outcome variable at each point in the Markov chain in the form of a least square regression model: 1( c ) 2 com i 1 com 1( c ) i 1 com 1( c ) ii 1 com 12 ( c )2 ii 1111 logL()Nlog(2 )Nlog( ) 222 logL() ( ) logL() ( ) logL() 11 NN ( ) 22 N yi i N y i i N y i i N y i iy y y y (2.31) where 1p1p1p { ,..., ,..., ,..., } yis the set of MLEs associated with p indicator variables. The corresponding second partial derivatives are 2 com 1 2 2 com 1(c)2 i 2 1 2 com 23(c)(c)2 ii 2 1 logL() logL() ( ) logL() 1 NN ( ) 2 y N y i N y i iy (2.32) Appropriate combinations of th e above derivations according to the Louis formula will supply the approximate ingredients of the observed information matrix with respect to each outcome. Similar expression can be derived for the Z variable. According to the literature, out of the availa ble different techniques for computation of the standard errors within the EM setting, Loui s method is best suited for adaptation to the
PAGE 41
31 Monte Carlo version of the EM (McLachlan & Krishnan, 2008), our simulation results show values that are much smaller than the true population values. Therefore we have decided to explore a different option that is more applicable to our situation. According to (Gamerman & Lopes, 2006), if the ergodic theorem is applied where it is possible to go from every state to every other state, the summaries (marginal point or intervals) of any real function t( ) can be consistently estimated by their corresponding es timators based on the ge nerated random sample. If for each state of the Markov chain we havect( ) c, then we can estimate the posterior mean and variance of by c 11 2 111 E()t( ) CC 1 Var()() C CC c cc C c c (2.33) Since we are employing standard regre ssion model estimation techniques, the ML estimates as well as their standard errors are di rect products of regression analyses and can be treated as functions of. Furthermore, in addition to the theoretical support for the use of the ergodic averages as estimates, the approximate confidence intervals about these estimates can be computed based on central limit theorem (Gamerman & Lopes, 2006). For example, for C = 1000, the 0.025 and 0.975 quantiles of can be consistently estimate d by the limits provided by the 25th and 975th largest sample values of In summary, based on the results of our simulation studies, the standard error estimates from this a pproach are relatively closer to the true values compared to the estimates obtained using the Louis method.
PAGE 42
32 CHAPTER 3 ADDITIVE MODEL COMPONENT OF ALV MODEL 3.1 General Introduction In this chapter and the next we formally in troduce the theoretical basis for the integration of Generalized Additive Models (GAM) into a latent variable framework to model a smoothed nonlinear relationship between the latent baseline ri sk and a distal outcome. Specifically, this requires the development of a statistic al model linking the latent variable to the outcome Z with a spline function. So we start the current ch apter by first exploring further the crucial role played by Additive Models (AM) in the assessment of variation in intervention impact. We then devote the rest of the chapter to discussing the basic theory and estimation of GAM along that line. 3.2 The Additive Model Consider a set of independent observations 1,...., p X XZ, consisting of response random variable Z and a set of predictor variablesX which may include interaction terms. The standard linear regression model fit to the data is specified as 1 p j j j x (3.1) where E(Z) and j is unknown parameter or coefficient that quantifies the dependence of the on j x Let us express the linear regression problem as iiiZs(x)e,i1,...,n;
PAGE 43
33 2 iix[0,1];eN(0,) Then (.) s is assumed to be of the form(,)sx linear in and known up to the parameters to be estimated from the data. Genera lly in parametric models including both linear and nonlinear regression models, (.) s is constrained such that the model space dimension (number of unknown parameters in ) is much smaller than n, with consequent possible model misspecification. In contrast, non-parametric and semi-parametric estimation methods allow (.) s varying in a relatively high dimension function state to avoid possible model misspecification (Gu, 2002). One such popular method is the Additive Model (AM) (Hastie & Tibshirani, 1990) which is a generalization of th e linear regression model that allows for a more flexible description of the dependence of the outcome Z on individual predictor term in X, without requiring the usual rigid parametric form fo r the dependence. Although a standard multiple regression model is additive in natu re but there is a single coefficient per predictor term to explain its relationship, which is very restrictive. The idea in AM is that a complex nonlinear relationship often requires the estimation of more than a single coefficient for each predictor in order to achieve the best prediction of the outcome variable values. The AM accomplishes this by estimating an uns pecified or non-parametric (smoothing) function for each covariate to produce a representation of the trend of Z (as a function of one or more predictors) that is less variable than Z itself. Here the estimated smooth functions serve as analogues of the coefficients in standard linear model (Hastie & Tibshirani, 1990): p jj j1s(x) (3.2) The functions are fitted using scatterplot smoothers which are nonparametric techniques that define the relationships be tween the response variable and each predictor in a flexible way,
PAGE 44
34 thereby relieving the user from the need to s earch for the appropriate transformation for each predictor (Chambers & Hastie, 1993). Note that the two models described above are both additive in their predictor effects which also can be exam ined separately, in the absence of interactions. However in AM the linear predictors are replaced with additive predictors which are represented as smooth functions of the predictors (Hastie & Tibshirani, 1990). There are multivariate assumptions underlying the AM and hence the name additive: a low-dimensional additive structure to the p-predictor function of X, that are far easier to interpret than a p-dimensional multivariate surface (Chambers & Hastie, 1993). So an additive approximation to the multivariate regression function is obtained by using a univariate smoother to estimate the individual additive terms; this way the problem of curse of dimensionality (e.g. rapidly increasing variance with increasing dime nsionality) is avoided (Xiang, 2004). Similarly, the Generalized Additive Model (GAM) (Hastie & Tibshirani, 1990; Wood, 2006) and the Generalized Additive Mixed Model (Wood, 2006) are the respective additive or smooth nonlinear versions of the Generalized Li near Model (GLM) and the Generalized Linear Mixed Model (GLMM). Also, a semi parametric fo rm of the AM can be fitted whereby a linear relation of the outcome with some predictors is assumed while unknown functional relations are assumed for some other predictors in the model and are explored by smoothers. For example, a GAM that has p non-parametric smooth functions of plus k parametric terms can be expressed in the general form p k jjmm j1m1g()s()x (3.3) where ()iiEY and ~iYsome distribution in the exponential family.
PAGE 45
35 From another perspective (Ho ller, 2005), GLMs may be seen as a special case of GAMs. For example consider a regression equation of the form 12 1 2 Interceptsxsxas a generic additive model. For a GLM the functions 1s and 2s can be polynomial, categorical, or transforms e.g. log. In a GAM one or both functions may be repres ented as non-parametric smoothers; in the former case we have the semi parametric type of GAM. The question then is how to strike the best balance between the degr ees of freedom, amount of data, and functional form (Holler, 2005). 3.3 Baseline-Treatment Interactions using GAM As already indicated in the first chapter, additive models (GAM, GAMM) are particularly useful for uncovering a potential nonlinear stru cture between an outcome and each continuous covariate (and its interaction with other predic tors) that one might otherwise miss. Consider a GAM modeling of the dependence of the mean of the outcome Z on treatment Tx (intervention=1, control=0), and the smooth functions of the baseline risk covariate and baseline-treatment interaction: i1i2ig(E[z])= +()+ (Tx)+(*Tx)iiss (3.4) In addition to the use of smoothers by the GAM procedure to estimate the dependence in the data based on the model, the smoothers are also used to estimate the distribution shapes to enhance the visual appearance of the plot of Z against the predictor (Hastie & Tibshirani, 1990), and to describe vividly the nature of the treatment-b aseline interaction (Brown, 1993; Khoo, 1997; Brown, et al., 2008). The usefulness of these m odels can be best illustrated with hypothetical situations such as described in the plots in figure 3.1 which is modified from Khoo (1997) with additions. In the plots, Y is the fitted outcome of a GAM model in which Z is regressed on the
PAGE 46
36 treatment variable plus smooth functions of base line risk and baseline-treatment interaction. On the x-axis the level of baseline risk increases fro m left to right. The dashed curve represents the treatment that is designed to reduce outcome Z relative to the control (solid line). Any tangible separation between the two curves indicates interv ention effects. The length of a vertical arrow measures the drop in Z along y-axis, thereby depicting the magnitude of intervention effects at a given level of baseline risk. Generally all the plots display a nonlinear increase of the outcome with the baseline risk irrespective of the intervention condition. In plot A the curves are parallel and the constant length of the arrows indicates constant treatment effects across all levels of the baseline risk; hen ce there is no baseline-treatment interaction. In contrast there is a steady or linear increase of group difference (drop arrows) in Z as the baseline risk increases in plot B; this signifies a lin ear baseline-treatment interaction. The higher the baseline risk levels of the subject the more eff ective the intervention. In plot C the Z drop arrow length initially increases with baseline then tapers off; that is, the intervention is effective for individuals in the lower end of the risk scale but less so for the high risk individuals. The opposite occurs in plot D where the intervention is rather effective for only the high risk individuals. Plot E describes a rather interesting situation where the intervention impact is most effective in some middle region but not at the extremes of risk. Su ch situations exist whenever too little or too much of a baseline characteristic that is the ta rget of intervention is problematic and more resistant to modification. For example, either extreme on a parenting scale (too authoritative or too permissive) may lead to poorer child outcomes th an moderate scores on this scale. Lastly, it is not uncommon that program interventions may produce iatrogenic effects. As shown in plot F, the intervention is detrimental to low risk indivi duals but the impact gradually shifts to being beneficial as the baseline risk level gets higher.
PAGE 47
37 Figure 3.1 The Plots of Distal Outcome versus Baseline Risk As we can see, analyses that ignore variati on in intervention impact may not be telling the whole story since all the hypothetical situations depicted on the plots are not implausible. Apart from gaining insight as to what works and for whom, we may also uncover unintended consequences of a given intervention if there is an y. Much of this obtainable extra information is contingent on the ability to capture the nonlinear outcome-baseline relationship; this type of information may be easily lost if we are limited to linear modeling techniques. Most importantly, while GAM type models are most suitable for exposing such nonlinear dependence in the data, they can handle linearity as well. VARIATION IN INTERVENTION IMPACT ACROSS BASELINE RIS K A B C D E F
PAGE 48
38 3.4 The Best Smoothing Function Motivation For a simple illustration consider this time a se t of independent bivariat e observations consisting of outcome Z and predictor, where ii, z,i 0, 1, 2,,n and012na ...b We wish to fit a curve through the data points and plot it on a graph, infer data values between the data points and estimate some parameters from the data. Supp ose we wish to fit a simple function s() that can be easily manipulated to the disc rete data, such that it matches the data points exactly; such a function will be an inter polant. Some families of common interpolant functions include polynomia ls, piecewise polynomials or splines (segments of polynomials joined together at data points or knots); trigonometric, exponential and rational functions. Polynomials are popular candidates for interpolation because they are continuously differentiable up to all orders so that the smoot hness can be easily quantified. However, simply fitting a single high-degree polynomial function to se veral data points is plagued with excessive oscillations thereby resulting in some misfit. For this reason polynomial bases are not efficient for representing s() when we are interested in the whole domain of s() (Wood, 2006). A better alternative is to employ a piecewise polynomial in terpolation (spline bases) which allows for fitting low-degree polynomials (e.g. cubics) to interval segments on the continuum (Heath, 2005; Wood, 2006; Cheney & Kincaid, 2004). So in terms of fitting a model to sampled data from a function, the idea is to create a spline that approximates that data well. Therefore it is necessary to determine which of the low order polynomials will be most appropriate for achieving optimal smoothness and minimal error. While c hoosing the best curve fitting function is of paramount importance it should not be done arbitrarily (Wood, 2006).
PAGE 49
39 Smoothness Property If we assume that the data points represent a discrete sample of an underlying continuous function, fitting all the observed points exactly may be undesirable because the behavior of the function spanning the discrete data points will lik ely be highly variable. For example the results of plotting a candidate function ma y be unpleasing to the eyes because of excessive oscillations or sharp curvatures. A curvature is a function of the second derivative (rate of change of slope) at the given data point. Therefore for s() to be the best smoothing interpolant, it must possess the minimum magnitude of the integrated squared s econd derivatives over all data points, that is, b 2 amin[s''()] from among all other interpolating func tions (that are differentiable up to second derivatives) over the same set of data points. Let 2 iiiiizs()e,E(e)0,Var(e) We wish to estimate the unknown function s() without specifying a form for s except that s belongs to a class of suitably smooth functions. So in terms of data fitting we seek a general solution to the penalized least squares criterion (l east squares criterion with respect to optimal smoothness), specified as b 22 ii a i[zs()][s''()] (3.5) where is the smoothing parameter. The first term in (3.5) measures approximation to the data, and the second term controls smoothne ss by penalizing larger curvatures.
PAGE 50
40 Theorem (Cheney & Kincaid, 2004; Heath, 2005) For a given, there exists an interpolant s() for the set ii, z where, of all twicecontinuously differentiable functions f() that interpolate ii, z s()f() is the smoothest interpolant, i.e. an explicit, unique minimizer of (3.5) in the sense of having the smallest integrated squared second derivative over ii, z Thus we have the following Lemma: bb 22 aas"()f"() (3.6) Proof We need to show that if certain conditions are satisfied, s() will qualify as the smoothest interpolant. Since s() and any other f() are interpolants with knots at all the data points in the interval, the functions must be equal ati ; hence it follows thatiif()s() and also 22 iiii ii[zf()][zs()] Therefore we let g()f()s()0 such that f"s"g" By expansion bbbb 222 aaaa(f")(s")2s"g"(g") Note that we are mainly interested in the magn itudes of the integrated squared second derivatives. Hence we see that the inequality bb 22 aa[s"()][f"()] will be true if the integralb as"g"0 so that
PAGE 51
41 bbbb 2222 aaaa(f")(s")(g")(s") Therefore, to prove our theorem we next need to show that this integral equals zero under certain specified conditions which must be satisfied bys() We set out to accomplish the task by integrating by parts. Using the formulauvuvvu lets"u,g"v then we have bb b a aas"g"s"g'|s'''g' Now, the first set of conditions is: s''(a)0 ands''(b)0 that is, the second derivatives at both end points 1ax and nxb must be set to zero. This done, we will then have bb aas"g"s'''g' If we break the interval [a,b] into its n-1 segments of component functions joined together at the knots, the equation becomes discrete summation over all segments, that is i1 in1 bb aa i1s"g"s'''g's'''g' Next, it is required thats''', the 3rd derivative at each unit interval ii1[,] be a constant, sayic, a property of cubic polynomial at each interval, so that we have i1 i1 i1 iiikn 1n 1n 1 x iiii 1i x i1 i1 i1 i1S'''g'xcg'cg'cg()g()0 The last term above equals zero because we specify at the beginning thatig()0 for every knot, thus the proof.
PAGE 52
42 So far we have determined a number of conditions that must be imposed on s() for it to be the smoothest interpolant: be a cubic sp line with knots at the unique values of, and the second derivatives at the end points set to zero. By definition, the function that satisfies these conditions is a natural cubic spline (Cheney & Kincaid, 2004). 3.5 Cubic Splines There are several types of splines in the l iterature and the typology may be associated with how the splines are represented, the spaci ng of the knots, and type of other conditions imposed. For example, in B-splines basis functions are used for the entire spline, interpolating splines require that the splines include some give n values, zero second derivatives are enforced at the end knots to yield natural splines, and uniform splines have evenly spaced knots; just to mention a few. As already shown, the natural c ubic splines are the best available curve fitting functions (Cheney & Kincaid, 2004; Wood, 2006). A k-degree spline function is a function consisting of k-degree polynomial pieces joined together and are continuously differentiable k-1 times (Heath, 2005). A cubic spline (k = 3) is a twice continuously differentiable piecewise poly nomial function. The connection points of the polynomial pieces plus the two e nd points are known as the knots of the spline. The polynomials join smoothly at these knots because the cubic splin e is continuous up to second derivative across the knots (Wood, 2006). Supposing an N-vector (single predictor variable) is divided into n intervals so that 012n. represent n1 unique values. Let different cubic polynomials be fitted to each intervaljj1,;j1,...,n In its standard representation the knots of a cubic spline
PAGE 53
43 coincide exactly with the unique values of in the data; and the 1st and 2nd derivatives including the values of the cubic spline at the knots are specified to yield a number of equations and polynomial coefficients (parameters) to be es timated. Each cubic polynomial piece joins two adjacent knots and has four unknown coefficients 's whose values vary from one piece to the other. Given n intervals in the piecewise polynomial, there are n+1 (or q) knots, thus there are n different cubics and 4n spline coefficients in all. The esti mates of the coefficients are therefore simultaneous solutions to a system of linear equa tions. To get a unique set of solution, it is required that the number of equations and parameters be equal. Thus for a simple standard representation of a natural cubic spline to be fitted to the set of n+1 knots, a total number of 4n equations is formed with continuity conditions imposed on the cubic polynomials as listed in Table 3.1 below (Heath, 2005; Cheney & Kincaid, 2004): Table 3.1 Cubic Spline Interpolation Following an example that is illustrated in (Heath, 2005), suppose we wish to estimate the natural cubic spline function that interpolates three data points jj, z,j 0,1,2. So we have n2 Three Continuity Conditions Number of equations 1 Each cubic to pass through the 2 knots at either end of its interval j+1, j 2n 2 1st derivatives of adjacent cubics to match at each of n 1 interior knots (0,n)jj n 1 3 2ndderivatives of adjacent cubics to match at each of n 1 interior knots (0,n)jj n 1 4* 2ndderivatives of first and last cubics to be fixed at zero at endpoints 0 and n 2 Total number of equations 4n addition of this specification results in a natural cubic spline function
PAGE 54
44 intervals 0112(,),(,) with two cubic polynomials joined at the 3 knots to represent the cubic spline; and 4n8 simultaneous equations to estimate 8 parameters a,b in the two polynomials denoted as 23 11234 23 21234p()aaaa p()bbbb (3.7) The 2n4 equations satisfying continuity condition 1 in the table 3.1 are specified as follows: 23 012030400 23 112131411 23 112131411 23 212232422At:aaaaz At:aaaaz At:bbbbz At:bbbbz (3.8) Condition #2 requires the first derivatives of the two polynomials to match at the lone interior point: 22 12314123141At:a2a3ab2b3b (3.9) Similarly, condition #3 with respect to the second derivatives gives the equation: 1341341At:2a6a2b6b (3.10) The final two equations satisfying the 4th condition relate to the endpoints: 0340 2342At:2a6a0 At:2b6b0 (3.11) The above representations and notations are for the very basic conventional spline where the knots coincide exactly with the input data points. Typically less number of knots than data
PAGE 55
45 points are chosen and may be evenly spaced over the range of values of that is constrained to between 1 and 0 (Wood, 2006). Alternatively the knots may be placed at the quintiles of unique values distribution of. We will revisit how to determine the optimal number of knots later in this chapter. 3.5.1 Representation of Natural Cubic Splines A critical objective of GAM fitting is ensuring that the chosen smoothing function is the best smoother, as well as fits or summarizes the data well. This property is related to how the smooth function is represented. The representation of the smoothing function can take many forms and can be very complicated and intimidating especia lly for those forms that are most suitable for computation and general practical use. Therefor e, representing the smooth functions and choosing how smooth the functions should be are two critical issues of major theoretical importance in additive modeling (Hastie & Tibshirani, 1990; Wood 2006). There is more than one approach to representing GAM depending on the method of estimation. The estimation by backfitting technique (Hastie & Tibshirani, 1990) iteratively fits each smoot hing component to its partial residuals until the individual components no longer change (convergence) but automatic smoothness selection is very costly (Wood, 2006) Another approach to estimation is penalized regression splines; this involves choosing some ba sis functions defined as the space of functions of which the smoothing function is an element (Wood, 2006). Here the degree of smoothness of model terms is estimated as part of the GAM algorithm (Wood, 2006), therefore we prefer this latter approach for our work. The estimation of degree of smoothness is not integrated into the backfitting procedure (Wood, 2006) and the degree has to be chosen by the user.
PAGE 56
46 To illustrate the basic principles, we will again use a simple regression model of the outcome Z with a smooth function of the single predictor : iiizs() (3.12) A proper representation of (3.12) requires that it becomes a linear model. One way to achieve this is by choosing for (.) s some basis functions and treating them as known (Wood, 2006): L ll l1s()b() (3.13) A basis for () s defines the space of all functions of which () s or its approximation is an element. With ()lb as the lth basis function andl the lth parameter, substituting (3.13) into (3.12) results in a linear model (Wood, 2006) so that estimation methods for linear model such as least square method can be employed. For example, a basis for the space of cubic or less order polynomials is 23 1234b()1,b(),b(),b() in which case we have 23 1234s() (3.14) The above representation is for a single 4th degree polynomial fitted to in its entirety. As previously noted, the natural cubic spline is th e best smoothing function; in which case we have divided into intervals and we fit a cubic to each segment. For a similar purpose, a modified representation of cubic spline function can be made. Let the knot locations be* and the number
PAGE 57
47 of the chosen knots be q, where q therefore represents the dimension or rank of the basis. The rather complicated bases for the cubic spline (Wood, 2006) are 12j 2j b ()1,b(),b()R(,),forj1.....q2 (3.15) where, if we let t represent the jth knot location *j 22 42R(,t)t1/21/121/21/12/4 t1/21/2t1/27/240/24 (3.16) The result is a linear model representation of Z which then allows for model estimation by least squares: q2 12jj2 j1zz s ( ) s() R(,) X (3.17) where is a q-vector of real valued coefficients and the ith row of model matrix X is ** iii1i2 iq21,,R(,),R(,),..............R(,) X Further technical details about the cubic spline bases formulation can be found in (Wood, 2006). 3.5.2 Penalized Regression Cubic Splines Once a basis has been chosen for each smooth in the model, next it is necessary to control the degree of smoothness. One method for doing this is to fix the basis dimension q (number of knots) at a slightly higher level than necessary a nd add a wiggleness pena lty to the least square fitting criterion (Wood, 2006). That is, fit model to the data by minimizing
PAGE 58
48 1 2 2 0ys ()d X (3.18) over all twice continuously differentiable functions (.) s having integrable second derivatives. The first term in (3.18) measures the goodness-of-fit to the data and from here the wiggleness of the function arises; and without a pe nalty term the model becomes strictly an interpolation of q knots. The second term in (3.18) represents quantified wiggleness multiplied by It penalizes the first term. The tradeoff between the wiggleness (how closely the data points are tracked) and smoothing (for visual pleasing and ease of interpretation) is controlled by the smoothing parameter which weights the wiggleness. Whens"()0 a constant slope is implied, that is s() is linear, in which case we have the standa rd least squares problem. Otherwise, whens"()0 (and therefore 2[s"()] is positive), the slope is changing and nonlinearity is present; therefore as approaches infinity the penalty term also approaches infinity. Obviously the penalty term then needs to be calibrated. For example 0 implies an un-penalized regression estimate (Hastie & Tibshirani, 1990; Wood, 2006). Too low causes the model to fit the signal plus the noise; the resulting excessive tracking or extra variability will lead to poor prediction of the missing datum by the model. The idea is to choose the best value for that will allow a candidate additive model to maximally predict data to which it was not fitted. Fortunately there are algorithms for finding the optimal value for including the ordinary cross validation (OCV) and the generalized cross validation (GCV); basically both methods find that minimizes the difference between the true function () s and the spline estimate () s : 2 11 ()()n ii iCVss n
PAGE 59
49 Since we do not know() s the cross validation (CV) cannot be calculated directly, instead the expected squared error in predicti ng a new variable is derived as 2() ECV and worked with in slightly different ways in the two methods; deta ils of which can be found in (Wood, 2006). The GCV approach has computational advantages over the former; hence GCV is preferred for searching for the optimal that is, selecting the degree of smoothness (Wood, 2006). Whereas the approach to model estimation in AM is by penalized least squares, the method of choice for estimation in GAM is penalized likelihood ma ximization which in practice is achieved by penalized iterative least squares (Wood, 2006) For detailed information about the crossvalidation techniques and the model estimation methods, please refer to (Wood, 2006). In the GAM procedure according to the mgcv package (R Development Core Team, 2008), the effective degrees of freedom (edf) is automatically calculated as a mathematical function of and reported in the model output. A higher edf corresponds to greater nonlinearity. 3.5.3 Estimation in Penalized Regression Splines Expanded details of the estimation process de scribed in this section can be found in (Wood, 2006). Briefly the penalty term in (3.18) being linear in the parameters can be reexpressed in a quadratic form of (for cubic splines) 1 2 0 ** i2,j2ijs"()dx SR, i,j1,...,q2 S (3.19) where Sis a matrix of known coefficients with its first two rows and columns equal to zero. It follows that fitting the model reduces to minimizing
PAGE 60
50 2y XS (3.20) w.r.t. given. An optimal smoothing parameter is chosen using the method of generalized cross validation. For an additive model involving two smooth functions, penalized regression spline basis is used for each smooth func tion. Consider two predictors U and with all values constrained to lie in[0,1]: 1 22 i1i2iii q2 112jj2 j1 q2 212jj2 j1ys(u)s()e;ei.i.d.N(0,) s(u)uRu,u s(v)vR, (3.21) where 1 q and 2 q are the number of parameters to be estimated for the corresponding smooth function. For identification, either of 1 or 1 is set to zero. The ith row of the model matrix for the linear model form y X becomes 12***** iii1i2iq2ii1iq21,u,R(u,u),R(u,u),...,R(u,u),,R(,),...,R(,) X (3.22) To estimate the parameters1212q2q[,,...,,,...,] we minimize the least square objective 2 1122y XSS (3.23) For non-normal data the Generalized Additive Models (GAMs) are set up as penalized GLMs and the model is fitted by penalized likelihood maxi mization using penalized iterative least square. For an example of a model that includes both non-smoothed and smoothed terms including interaction term, let *i X represent the model matrix of the strictly parametric (non-smoothed)
PAGE 61
51 component of the model with its associated parameters while the j s are the smooth functions; we have j* i1i2ii q jjjijij i1gE(y)s()s(,x)... s()b() X (3.24) with g as the known link function. To make the mode l identifiable, the model matrices for each smooth term is meanor sum-centered at zero, and the model can then be represented as 12 12gE(y) [X:X:X:...] [,,,...] X X (3.25) To suppress the wiggleness contribution from each jjs(x) the likelihood ()L of the model is penalized to obtain()pL : p jj j1 L()L() 2 S (3.26) where the smoothing parameters j control the wiggleness and are themselves estimated. For the proof and the iterative estimation process the reader is referred to (Wood, 2006). 3.6 Goodness of Fit and Model Comparison For each regression equation in the ALV mode l we applied the generalized linear model (Nelder & Wedderburn, 1972) so that each regressi on model specification is in terms of the linear predictor X So the deviance is output directly by the standard GLM/GAM procedure, and is defined as
PAGE 62
52 sat 2 np D2[logL()logL()] D where log() s atL is the maximized log likeliho od of the saturated model, n is number of observations, p is number of identifiable parameters, and the scale parameter 1 for the Normal and Binomial distributions used in the development of the ALV model. Note that there are C columns of N-vectorgenerated in each EM cycle as MCMC samples (N and C are number and length of MCMC chains respectively). For p response variables there are p univariate regression models fitted for each N-vector repeatedly across C columns, to yield a pC matrix of each element of the model fit results. One of these elements is the deviance D directly estimated by each regression model. Then the average deviance is computed over the C columns to produce a set of average values 1p{D,...,D}for the p sub-models. So there are C univariate regression model fits yielding C deviances w.r.t. each response variable. These p deviances are then summed for the system of regression equations to give a total deviance D which indicates the overall log likelihood of the ALV model. So, to compare nested ALV models 1 and 2 we can perform the likelihood ratio test, where with hypothesis testing based on large sa mple limit we have approximately 212 p pD1D2 Non-nested ALV models can also be compared on the Bayesian information criterion (BIC) and Akaike information criterion (AIC) also wh ich we are able to compute as follows: AICD2*p BICD(logn)*p
PAGE 63
53 The better fitting of the two ALV models will pr oduce lower values of either statistic. We considered computing the BIC and AIC also at the sub-model levels and finding the average as done for the deviance, however we believe th at more simulation studies will be required specifically to investigate which approach should be better, and this should be a subject of future study.
PAGE 64
54 CHAPTER 4 COMPUTATIONAL DEVELOPMENT OF ALV MODEL 4.1 Computation Steps The proposed ALV model was developed and written entirely in R language using the R 2.8.1 statistical application (R Deve lopment Core Team, 2008). The latent vector was simulated using a random walk Metropolis algorithm avai lable within the Markov Chain Monte Carlo Package (MCMCpack version 0.9-4) written by Martin, Quinn, & Park (2009) in R. To simulate the random vector we employed the R function MCMCmetrop1 available from the MCMCpack to construct a Markov sample fro m user-defined conditional distribution of using a random walk Metropolis algorithm. For dia gnostic purpose the output of the MCMC simulations is analyzed with the CODA (Convergence Diagnostics and Output Analysis) package (Plummer, Best, Cowles, & Vines, 2006) that comes with the MCMCpack. The steps involved in the extended MCEM computations are graphically displayed in Figures 4.1a-c reflecting summaries of th e equations (2.20) to (2.28) For the k th MCEM iteration, a single chain Markov sample of size C was drawn from the conditional distribution of for the ith subject in the E-step via the Metropolis algorithm (Figure 4.1b). This yields for all subjects N independent Markov samples stored in an NCmatrix. Each column of this matrix constitutes independent obse rvations and was plugged into the Q function one column at a time to substitute for, given the current ( k th) parameter estimates. The availability of estimated Nvector as a predictor variable then allows for the new (k+1) th MLEs and their standard errors
PAGE 65
55 to be obtained at the M-step as direct outputs by fitting standard regression models including GAM (Figure 4.1c). The ALV model at this stage accepts continuous indicator variables (Ys), one continuous or binary distal outcome (Z), and a two-category group or treatment variable (GRP). In addition it can also accept a cluster variable as a random effect; however in its current form he ALV model can optionally include the cluster vari able in its analysis only at the final EM iteration. That is, the Additive component will switch from GAM to GAMM in the final EM iteration to accommodate the the clustering factor in the data. Technically the GAMM analysis procedure combines Linear Mixed Model (LME ) with GAM within its algorithm (Wood, 2006). Figure 4.1a ALV Algorithm Flow Chart: Overview of EM Setup 1log()log(,|;)N iii iLpyz (1) ()argmax(,)kkQ (1)(|,,)k iiipyz (1) k 1 ()() 1 () 1log()log(,,;) (,)log(,,;)|,, log(,,;)(|,,)N ciii i N kk iiiii i N k iiiiii iLp yz QEpyzyz pyzpyzd
PAGE 66
56 Figure 4.1b ALV Algorithm Flow Chart: Implementing the E-Step via MCMC Process () () () () ()()(,,;) (|,;) (,;) (,,;)/ (|;)(|;)()/k k iii iii k ii k iii kk iiyiizipyz pyz pyz pyzh py pzph 1/2 /21 21/22 2 /21 (|;)(2 )exp()(), 2 1 (|;)(2 )exp(), 2 1 ()(2 )exp. 2p iiy iiii iiz ii m ii ify y y fz za f *()*()* ()* ()()()()()(|;)(|;)() (,)min1, (|;)(|;)()kk iiyiizi c ii ckckc iiyiizipypzp pypzp *1 *2*2** ()1 ()2 ()2()()1 exp()()() 2 min1, 1 exp()()() 2iiiiiiii ccccc iiiiiiiiyyz a b yyz a b
PAGE 67
57 Figure 4.1c ALV Algorithm Flow Chart: Summary of the MCEM Implementation () () () 111 (,)[log(,,;)|,,]CN kck iiiii ciQ pyzyz C (1) () ()() 11 ()() 11argmax(,) 1 argmax[log(|;) 1 argmaxlog(|;)kk CN ck iiy ci CN ck iiz ciQ py C p zw C The main parameters to be estimated require start values. We adopt the following scheme to facilitate fast convergence and efficiency of the AVL algorithm. The calculated sample means and variances of the Ys are employed as start values for the Y-intercepts and measurement error variances while Y-slopes are arbitrarily assigned start values, e.g. 1.0. For the GAM component, Z is regressed on GRP and Ys to obtain a start value for the Z intercept, but the initial error variance 2 is also obtained from the sample variance. The above first line start
PAGE 68
58 values are then used to compute the empirical Bayes (EB) estimates to serve as start values for the N-vector. Alternatively, a standard normal rando m sample can be gene rated as the initial vector and we found this to work as well for our simulated data but in general this is not our preferred choice. It is important to choose start values that allow the MCMC chain to start as close to the center of the target distribution (conditional distribution of) as possible (e.g. EB estimates, approximate MLEs) as this will greatly reduce the required burn-in time and facilitate a well mixing chain (Walsh, 2004). In a well mixing chain the entire space of the target distribution is sufficiently sampled. In a situati on where the target distribution has multiple peaks, a simulated annealing approach would be an a lternative for obtaining start values on a singlechain such as ours (Walsh, 2004), but our target distribution is uni-modal. The pseudocode for ALV model is as follows: Step 1. Preliminary Dataset: Arrange variable columns in the order1p{Y,...,Y,Z,GRP,Cluster}. Remove rows with missing values. Start values Parametric coefficients: supply 0's compute 2 000's,'s,'s Generate or compute the initial vector 0 Nonparametric coefficients: regress Z on 0 using the GAM function to obtain initial MLEs of 0 and 0's Use 0 (rescaled to lie in [0,1]) to construct initial GAM model matrix 0Xmat
PAGE 69
59 Step 2. Start EM Loop for k = 1 to maximum iteration do until convergence update EM counter, parameters, vector to give k th values create matrices for holding results generated in k th iteration Metropolis Loop (generates MCMC samples) for i = 1 to N do update subject counter input: subject level data: ith row of (i) dataset (ii) vector (iii) Xmat apply MCMCmetrop1R function to th e input data (subject level) output: NC -matrix; the rows consist of N in dependent Markov chains of length C; each column is an N-vector end for Regression Loop (produces MLEs) for j = 1 to C do input: (i) NC matrix consisting of columns of (ii) dataset regression models are plugged into the conditional likelihood functions: fit linear model to each Y-indicator with jth column of -matrix as a lone predictor
PAGE 70
60 fit GAM to Z (response variable) with jth column of -matrix + GRP + 2-way interaction terms as predictors output: ( k+1) th set of (i) mles {'s,'s,'s } (ii) standard errors of mles (iii) {2 00's,'s } computed from residuals of regre ssion equations (iv) deviance estimate for each model fit. end for A total of C regression equations are fitted per response variable to yield C estimates per parameter. Final ( k+1) th estimates are the average of C estimates Update Xmat Compute the row means of the NC eta-matrix to yield for N subjects Use N-vector to generate the ( k+1) th Xmat Compute convergence errors. If convergence, stop, else return to step 2. End EM Loop 4.2 Criteria for Convergence of ALV Model Convergence issues are addressed at two le vels: how to ensure that the Metropolis sampler has reached a stationary distribution; and how to diagnose convergence for the E-M iterations and ensure that the parameter estimates converge to their true values.
PAGE 71
61 4.2.1 Convergence of MCMC There are two main concerns including how to eliminate dependence on start values and how to diagnose convergence of the MCMC itera tions. Also there are two schools of thought on the appropriate approach to address these concer ns: generate multiple chains from different start values, or simply use one long chain because this is more robust with respect to poor start values (McLachlan & Krishnan, 2008). According to the authors, whether one uses one long Markov chain or uses multiple short chains, the diagnostic tests of convergence can still be fallible; so the focus of MCMC runs should be on the precision of estimation of the expectation(s). Therefore, considering the above and the fact that our ALV algorithm involves N number of MCMC runs per EM cycle; obviously we prefer the single chain approach. Later in our simulation studies, we will place emphasis on the precision of MCMC esti mates (compared to true values) in the evaluation of convergence of the ALV model. We took advantage of the available diagnostic te sts that can be conducted within the MCMCpack (Martin, Quinn, & Park, 2009) to confirm that the Markov chain converges sufficiently close to its stationary distribution. Using the R functi on raftery.diag we were able to calculate the effects of autocorrelation in a short pilot run of a Markov chain and use the results to determine an adequate length required for the chain to achieve a stationary state. If the estimated autocorrelation is high (dependence factor estim ate > 5), the required length of chain will be large and this can pose computer memory ch allenge. The memory demand can be reduced by thining the output whereby every n th consecutive value after burn-in period is selected and stored for use in subsequent analysis (Walsh, 2004). The results of the Raftery diagnosis also included the estimated number of burn-in iterations to be thrown away at the beginning of the Markov chain, as well as plots of the sampler run.
PAGE 72
62 4.2.2 Convergence of EM The determination of convergence in the Monte Carlo EM extension is not trivial; the usual standard approach is not suitable and non-convergence may be compounded by implementation or numerical errors (Lee, S ong, & Lee, 2003; McLachlan & Krishnan, 2008). According to the authors, by approximating the e xpectation at the E-step with values generated from MCMC samples, a Monte Carlo error is introduced and the monotonicity property of the EM algorithm is lost. One approach to monitori ng of EM convergence therefore is to plot the values of parameter estimates against the index of iteration and conclude that convergence has taken place if the process has stabilized with random fluctuations around the estimates (Wei & Tanner, 1990; McLachlan & Krishnan, 2008). When the number of parameters to be estimated is large (as in our case) an alternative appro ach is to monitor changes in a function of such as the log-likelihood function (Meng & Sc hiling, 1996). It is also known that the log-likelihood function can still fluctuate randomly along the EM iterates even in the absence of implementation or numerical errors (Lee, Song, & Lee, 2003), however this has been shown to be adequate for the purpose of statistical inference (Lee & Zhu, 2002; Meng & Schiling, 1996). Although we included both of these methods in our appro ach, we placed relatively more emphasis the monitoring of log-likelihood function. A special method is required to monitor convergence of a likelihood function in the EM setting. We would be interested specifically in the change in observed data log-likelihoods between two consecutive EM iterations ( k, k+1), which can be obtained from the logarithm of the ratio of the two likelihood values (logLR): (1) (1)() ()p(,z| ;) logLR(,)log p(,z| ;) k kk ky y (4.1)
PAGE 73
63 Ideally the ratio will be easy to evaluate if using marginal likelihood after integrating out. However, similar to the experience of Lee, Song, & Lee (2003) the observed likelihood in our case is difficult to obtain analytically, so we follow the authors approach (bridge sampling method) by applying the Meng and Schilings a pproximation formula (Meng & Schiling, 1996) based on the complete data likelihood with respect to c th MCMC iterate within the k th EM cycle as follows: 1/2 k,(c)(1) (1)() k,(c)() 1 1/2 k+1,(c)() k+1,(c)(1) 1p(,z, |) logLR(,)log p(,z, |) p(,z, |) log p(,z, |) k C kk k c k C k cy y y y (4.2) where k(c){ ,c=1,....C}are simulated from the conditional distribution of evaluated at kth estimates. The aim is to claim approximate convergence when the change in consecutive likelihoods along EM iterations becomes very small a nd fluctuates within a desired level, that is, the log of the likelihood ratio fluctuates near zer o. The approximation is claimed to be sufficient for the purpose of statistical inference (Meng & Schiling, 1996; Lee & Zhu, 2002). Similar to the monitoring of parameter estimates, approximate convergence is assumed when the estimated logLR approaches and fluctuates in the neighborhood of zero. Most importantly, once within the region of su ch steady fluctuation, it is necessary to come up with appropriate values of parameter es timates at convergence. Some authors obtained the average of all values within the region for the final estimates with respect to individual parameters, while some selected the parameters values at arbitrary kth iteration within the region as the MLEs (Lee, Song, & Lee, 2003). In ou r case, regression model deviances are also available as output using our algorithm. So, in order to establish a more objective criterion for
PAGE 74
64 point convergence, we decided to also monitor the ALV model deviance computed as the sum of deviances for all regression model fits comprising the ALV model. So, from within the steady fluctuation region we choose the parameter values corresponding to the point of minimum deviance across the EM iterations i.e. set of parame ters that provide the best overall model fit to the data. So our strategy for deciding EM convergence is to first monitor the log of the likelihood ratio to ensure the region of stable fluctuati on around zero is achieved, then choose parameter estimates associated with minimal mode l deviance within the stable region.
PAGE 75
65 CHAPTER 5 APPLICATION TO SIMULATED DATA 5.1 Simulation Design and Background Information A simulation study was carried out to evaluate the properties of the estimation procedure of the proposed ALV model. The simulation data structure mimics a randomized control trial investigating the variation in impact of treat ment G on an outcome Z across the levels of a baseline risk (); with the additional challenge that is unobservable and must be inferred from five observed variables Y1 to Y5. The Ys are assumed to be continuous and multivariate Normal, G has two levels, and Z may be binary or continuous with a normal distribution. Also, the Ys and Z are conditionally independent given The ALV model in this case consists of six regression sub-models corresponding to five Ys and one Z. We used different specifications, each with 50 replicated datasets, and repeated for each of three sample sizes N = 100, 200, 300 (see Table 5.1). All simulation datasets were ge nerated in R 2.81 (R Development Core Team, 2008). Complementary analyses were performed in Mplus version 5.1 (Muthen & Muthen, 2008). The simulations were used to assess two ma jor important questions for the model: (1) what is the long term behavior of the ALV model (pattern of convergence)? (2) How well does the ALV model perform under (a) different study sa mple sizes and (b) different functional forms of the relationship of Z to? To answer the posed questions we performed simulations under 12 different scenarios constructed from the combinations of the following data structures (see Table 5.1): Z-scale (continuous and binary), -effects (linear and nonlinear), and sample size (100,
PAGE 76
66 200, 300). Under each scenario we investigated (i) the optimality requirements for the MCMC sampler, (ii) the ALV model convergence character istics with a single run of the model through 100 EM iterations, and (iii) the accuracy of the ALV model estimates and their standard errors by running 50 replications. Regarding the MCMC sa mpler optimality conditions, we pilot tested the MCMC sampling to (a) fine tune the variance of proposal distribution so as to achieve a Metropolis sampling acceptance rate of between 0. 43 and 0.47, (b) determine the shortest length of Markov chain required to achieve stationarity (burn-ins), (c) assess variance inflation factor I of the data, if 4Ithen determine how much thinning is required to reduce the autocorrelation in the data to a minimum, (d) determine the op timal size of MCMC samples that is required to estimate accurately and efficiently, i.e. the minimum size that is adequate for the purpose of statistical inference. Table 5.1 Twelve Simulation Scenario s used to Assess ALV Model Performance Sample size Scale of Z Z dependence on and *Z100 200 300 Continuous Linear 1 2 3 Nonlinear 4 5 6 Binary (logit) Linear 7 8 9 Nonlinear 10 11 12
PAGE 77
67 If we can assume a joint multivariate normality for all dependent variables (Ys and Z) the ALV model analysis will mimic a simple lin ear CFA. However, unlike the conventional CFA, in addition to solving linear equations, the ALV model is also able to model complex and unknown relationships in the Z sub-model compone nt. So the major difference between a simple linear CFA and the proposed ALV model in its current formulation is found in the functional form of the regression sub-model of Z (the GAM component). Therefore for brevity we illustrate the results of our simulation study with the report on six representative scenarios that capture the span of ALV performance under two standard conditions (simplest & most complex) based on the functional form of the Z regression equation. Th e two functional forms include a continuous Z linearly related to both and G interaction term (1st row of Table 5.1), and a binary Z related to both predictor terms in a nonlinear fashion (last row of Table 5.1), and are presented as schemes 1 & 2. The first standard condition (1st row, scheme 1) allows for the assessment of ALV performance when joint multivariate normality can be assumed for the data (Ys and Z conditioned on), equivalent to a standard linear CFA. Importantly, the ALV model results in this case can be compared to the results of CFA performed by a standard statistical application such as Mplus. The second simulated condition (last row, scheme 2) enables us to evaluate the ALV model application to more complex data. Such condition includes when Z has a binary distribution conditioning on, plus the presence of a complex nonlinear dependence of Z on the Ys indirectly through. The dependence is not fully specifi ed except that the variables are conditionally independent. The emphasis in the latter evaluation is therefore on how well the ALV model is able to recover the true and uncover the functional fo rms used to generate the data.
PAGE 78
68 For all simulations, the measurement sub-mode l component of the ALV model connecting the latentand the five observed indicators ,i=1,...Niyis given by ii ; N(0,1)iiy Population values for this component are assigned as15{ ,..., }0 ,15{ ,..., }1 and 15diag{ ,..., }0.5 A standard normal N-vector was generated first, then the response variables Ys and Z were generated conditioned on While a linear model is specified for each Y, both linear and nonlinear regression models of Z (second component of the ALV model) were specified. Nonlinearity is described by inclusion of appropriate higher degree polynomial terms in the model. Let n = N/2 where N is the number of subjects in the sample; separate specification for each treatment group G is as follows: i=1:n0010i=1:ni=1:n 2 i i=(n+1):N0111i=(n+1):Ni=(n+1):N 23 i=1:n0010i=1:n20i=1:n30i=1:n i=(n+1):N0111i=(n+1):G=0:z= + +e Scheme1(Linear);e~N(0, ) G=1:z= + +e G=0:z= + + + Scheme2(Nonlinear) G=1:z= + 23 N21i=(n+1):N31i=(n+1):N+ + (5.1) For the second scheme we then simulated binary *Z to have probability p rob(z=1)=1[1+exp(-z)] where z is probability on the logit scale. Model specifications for scheme 1 were0001 = =0, 10 =0.2, 11 =0.7, and =0.5. For the second scheme we specified0010 =1, =-1, 20 30 =-0.5, =1for group 0; 01 =0, 11 =-3,21 =0.4,31 =0.6for group 1.
PAGE 79
69 5.2 Monitoring Convergence 5.2.1 Convergence Pattern: MCMC loop The studies of the MCMC convergence were ca rried out at the subject level where the conditional distribution of is randomly generated from an inner loop inside an EM iteration cycle. Guided by the Raftery diagnostic tests re sults from several runs in which we looked at the times series trace of across number of MCMC iterations, we found that a relatively shorter MCMC chain with 400 iterations after a burn-in period of at least 100 iterations is generally sufficient for the purpose of inference with the ALV model. For all our simulated data the calculated sample inflation factor due to autocorr elation was generally low at about 0.4 (less than 0.5) (Chib & Greenberg, 1995), and we found that th inning of the MCMC samples did not change our results in any significant way. The accepta nce rates for the Metropolis algorithm ranged between 0.40 and 0.47 (Lee & Zhu, 2002; Gamerman & Lopes, 2006). For a snapshot illustration the results of three Metropolis sampling runs are shown in Figure 5.1; the purpose here is to compare the graphical outputs of the Raftery test for three different burn-in periods. The dataset used was ge nerated according to scheme 1, and the results for the subject sample size N =300 is reported he re. Potentially N trace/density plots could be generated in each EM cycle for all the subject s in the sample. However each run producing a trace plot in the Figure 5.1 occurred in the kth EM cycle and was carried out on the ith subject randomly selected from the subjects sample stratified by treatment group. Each plot in the left column depicts a trace of accepted i values across 400 random-walk Metropolis samplings. Note that by default in R (difficult to override) the N in the density plot label (right panels) represents MCMC sample size (for the ith subject) and not subject sample size; this confusion with use of symbols arises in this instance only. So each density plot depicts the distribution of
PAGE 80
70 MCMC samplings from a single run for the ith subject in the kth EM cycle; under different burnins. For the trace plots (left panels), the horizontal scale starts from the (b+1)th MCMC iteration after a burn-in period of length b. Any long flat segment of the trajectory corresponds to iterations where all proposed values were being rejected; this is not desirable. The presence of multiple vertical spikes indicates well explored sampli ng space. We want the Markov chain to be well mixing and this is achieved when the time series looks like white noise (Walsh, 2004). In addition, if stationarity has been reached the average value of across the iterations should be approximately linear and horizontal; if it appears to be drifting, it may su ggest inadequate burn-in period. The results reported here in Figure 5.1 are re presentative of our findings for several subjects with different sample sizes under the different model specifications we tested. They show a fair settling of the traces (linear trend) with good mixing produced with the three choices of burn-in, therefore we found the shortest burn-in period of 100 to be most efficient. In addition, apart from the relatively greater computation time and memory demand by the longer burn-in periods, we did not see any noteworthy difference in the model estimates under burn-in periods of 100 or more.
PAGE 81
71 Figure 5.1 Trace and Density Plots of Markov Samples for Individual Subjects (Scheme 1) 5.2.2 Convergence Pattern: EM loop We report here the results of our investiga tion of the long term behavior of the ALV model estimation process with applications to simulated data. For each ALV model run we monitored across the EM iterations (i) convergence errors, (ii) approximate log of observed Burn in = 100 Burn in = 200 Burn in = 400 100200300400500 -0.6-0.5-0.4-0.3-0.2-0.10.0 IterationsTrace of var1 -0.6-0.4-0.20.0 01234 N = 400 Bandwidth = 0.03218 Density of var1 3rd Randomly Selected Subject (GRP=1) 200400600 -0 .8-0.6-0 .4-0 .20.0 IterationsTrace of var1 -0.8-0.40.0 0 .00.51.01 .52.02.5 N = 400 Bandwidth = 0.04863 Density of var1 5th Randomly Selected Subject (GRP=2) 400500600700800 -0 .20 .00 .20.4 IterationsTrace of var1 -0.40.00.20.40.6 0 .00.51.01.52.02.5 N = 400 Bandwidth = 0.04876 Density of var1 5th Randomly Selected Subject (GRP=2)
PAGE 82
72 likelihood ratio derived by bridge sampling, (iii) total ALV model deviance, and (iv) parameter estimates. We calculated convergence errors separa tely for two sets of parameters, the smoothing coefficients and the remainder parametric ML estimates. To compute the convergence error for the current EM iteration given P para meters we apply the formula: P 2 pp p1newerroroldestimate-newestimate (5.2) Two representative plots of sequences of c onvergence errors across 100 EM iterations are displayed in Figure 5.2. The two convergence erro r curves in either plot (dashed line for the parametric set of estimates and solid line for the smoothing spline coefficients) show dramatic drop before flattening out. The steady portion of each trajectory is fairly linear for the parametric set but values of the smoothing coefficients show random fluctuation with in a small range. Note that the starting convergence error for the parametric set is relatively small for scheme 1 that corresponds to linear CFA analysis. This is because we started very close to the true values of vector by using the empirical Bayes estimates of as start values in the Metropolis algorithm. However such approximation of is less accurate in scheme 2 where multivariate normality does not hold, therefore the corresponding starting convergence error in this particular case is expectedly higher. The patterns are otherwise rather similar.
PAGE 83
73 Figure 5.2 Convergence Errors versus EM Iteration In the same ALV model run (scheme 2), the consecutive values of log of likelihood ratios and the summed deviances from all six regression sub-model estimations were plotted against the index of EM iterations (Figure 5.3). From the top graph we see that the log of likelihood ratios curve quickly approaches zero and thereafter con tinues to fluctuate within a narrow band around zero. This pattern is consistent with reports of previous similar studies in which the authors used the Monte Carlo EM (MCEM) approach (Lee & Song, 2007; Lee & Zhu, 2000; McLachlan & Krishnan, 2008). In the bottom graph of Figure 5.3 the total deviance scores had been rescaled so that the minimum value equals zero. The deviance curve reaches a minimum in the 4th iteration (vertical dashed line) before stabilizing; ba sed on our criteria for convergence we concluded convergence at this point. A trajectory with an early pit followed be steadiness has been the typical finding from all our simulations results describing the trace of ALV model deviance. Therefore we believe that the bottom of the pit likely represents a global minimum on the trajectory. Although the linear model structure probably indicates this is the case, we cannot Scheme 1b: Continuous Z, linear model, N = 200 Scheme 2: Binary Z, nonlinear model, N = 200 020406080100 0.00.51.01.52.02.53.0 IterationConvergence Errors {mu's, lambda's, theta's, sigma^2's} beta's 0 20 40 60 80100 0.00.51.01.52.02.53.0 IterationConvergence Errors {mu's, lambda's, theta's, sigma^2's} beta's
PAGE 84
74 assume this in general. For efficiency, once convergence is decided at this minimum model deviance, the ALV algorithm is terminated at a couple of EM iterations afterwards. Figure 5.3 Scheme 2: Binary Z, Nonlinear Model, N=200 Just for completeness we also monitored the individual parameters in the parametric set of estimates just to explore how these parameters behave as the iterations in EM increase; some results are displayed in Figures 5.4 and 5.5 where start values on the Y-axis correspond to zero iteration. From these Figures, the residuals (thetas, sigma^2) and z-intercept (GAM component) stabilize rather quickly by the 4th iteration, our decided point of convergence; however the ys and lambdas (measurement intercepts and slopes) show gentle steady increase and only start to stabilize as from around the 100th EM iteration. Again, from all of our simulation the results the model deviance typically becomes stable as from around 10th to 20th iteration after the minimum deviance has already been achieved; 02 04 06 08 01 0 0 0.000.15 IterationLog of likelihood ratioLog of Observed-Data Likelihood Ratio Versus EM Iteration from the 2nd Iteration 02 04 06 08 01 0 0 040100 IterationTotal devianceScaled Total Deviance Versus EM Iteration
PAGE 85
75 similar patterns are exhibited by the model residuals. Therefore we suspect that the drifting values of the y-intercepts and lambdas indicate possible mu ltiple solutions; this may need to be explored further in future studies. We believe these findings are further justifications for our approach of choosing solutions at the point of minimum model deviance as these solutions will be better than if chosen at any other point in the iterations. Fo r these reasons, we did not see the need to extend the EM runs beyond 100 iterations just to show where the y-intercepts and lambdas finally stabilize. Figure 5.4. Sequences of Parameters (Scheme 1) Across 100 EM Iterations 020406080100 -0.10.10.30.5 iterationestimateY-intercepts 020406080100 0.51.52.5 iterationestimateLambdas 020406080100 1.01.52.0 iterationestimateThetas 020406080100 0.01.02.0 iterationestimateSigma^2 -.Z-interc --
PAGE 86
76 Figure 5.5. Sequences of Parameters (Scheme 2) Across 100 EM Iterations 5.3 Assessing Performance of ALV Model We performed 50 replications of the ALV m odel analysis using datasets of different sample sizes (N = 100, 200, 300) generated accord ing to the scheme 1 where joint multivariate normality is assumed for the {Ys, Z}. For the Monte Carlo part of each replication, we used burn-in =100, MCMC sample = 400. To be c onservative we stopped the ALV algorithm at two iterations subsequent to reaching the minimum point on the total deviance curve. This is based on our consistent findings of an early convex shape (pit) before a prolonged flat trajectory for all the plots of deviance against EM iterations in our simulations for studying ALV model convergence (see section 5.2.2). 020406080100 -0.7-0.4-0.1 iterationestimateY-intercepts 020406080100 0.51.52.5 iterationestimateLambdas 020406080100 0.20.61.01.4 iterationestimateThetas 020406080100 -1.5-0.50.5 iterationestimateSigma^2 -.Z-interc --
PAGE 87
77 To evaluate the overall accuracy of the ALV model we calculated the following summary statistics for each parameter esti mate based on 50 replications: Bias This was calculated as the difference between the true value and the computed mean of estimates. Standard deviation (SD) This is the empirical standard deviation of the parameter estimates across replications. Root mean square error (RMSE) This was calculated as the square root of the sum of the variance (of the estimates across replications) and the squared bias. Standard error average (SE) This is the mean of the standard errors estimated by ALV model for each parameter estimate across the replications. SE/SD This ratio was used to assess the accuracy of the standard errors estimated by the ALV model. If the number of replications is sufficiently large, the empirical SD can be taken as the standard error of estimate. Therefore, assumi ng we have sufficient number of replications, correctly estimated SE should closely approximate the empirical SD. However given the extensive computations involve in our simulations we have arbitrarily limited our replications to 50. 5.3.1 Performance under Scheme 1 The replication study based on linear models (scheme 1) helps establish that the ALV algorithm was set up correctly; and the results are reported in Table 5.2. Overall, the estimates produced by the ALV model are close to their true values as evidenced by the very small RMSEs, and the values further reduce (i.e. th e performance improves) as the sample size
PAGE 88
78 increases. However, the residual variance estima tes (thetas, sigma^2) show little change across the different sample sizes, possibly masked by round-off errors. In addition, the estimated residual variances are considerably smaller than the specified values for the error terms (true values) used in generating the data, hence the hi gh values of recorded biases. Alternatively the true values of the residual variances may be approximated by replicating OLS regression equations with true eta as predictor, but we decide d that this is not crucial to our study. While the bias associated with the y-intercept estimates dec lines as sample size increases, no clear pattern is seen with respect to the estimated slopes (lambda s). The recorded bias in z-intercept estimation appears not to be influenced by sample size. Al so, while the SE/SD columns show values close to 1 for the intercepts and thetas, the values recorded for the lambdas are small. This indicates potential bias (or possibly imprecision due to in sufficient number of replications) in the ALV model estimation of standard errors of estimates for the lambda parameters specifically, although there is improvement as sample size increases. Based on our simulation findings above we belie ve that the measurement part of the ALV model may not yield unique solutions to the parameter estimates (intercepts and lambdas) under the current stopping rule we have adopted for co nvergence. As previously mentioned in earlier section, the potential existence of multiple soluti ons may be reflected in the delayed stabilization seen for the Y-intercepts and lambdas long after the thetas, Z-intercept and sigma^2 have stabilized (see Figures 5.4 & 5.5). Based on our stopping rule, convergence is diagnosed at the point of minimum deviance on the condition that the approximate observed log-likelihood ratio has stabilized (is fluctuating around zero) (see Figure 5.3), even when the Y-intercepts and lambdas are yet to. Although ther eafter the stable sequence of the model deviance did not change considerably from the minimum, it is most efficient to stop the algorithm soon after the minimum is crossed. We believe that running the model l onger than is allowed by our stopping rule will not
PAGE 89
79 yield improvement in the estim ation of the latent variablewhich is our major focus, however further studies are needed in this area. In addition to the above replication study of ALV model performance we also compared its analysis results to those obtained from a standard reference statistical application such as Mplus. Both statistical methods were applied howev er to only one copy of the simulated datasets (scheme 1, N = 300), in which a simple confirma tory factor analysis (CFA) was performed in Mplus. Although no definitive conclusion can be drawn from the results based on a single replication, the following comparison analyses o ffer a glimpse into some other aspects of the performance of ALV model. We found that the results of both analyses (see Table 5.3) are similar; although relatively smaller standard errors are recorded for the ALV model, the residual variance estimates from both models are close. Next we used the results of the same set of analyses (one replication, N=300) to make comparisons between (1) the true (2) Empirical Bayes (EB) estimates of obtained from Mplus output, and (3) MCMC estimated from the ALV model. As revealed in Figures 5.6 and 5.7, the ALV model estimated is nearly identical in distribution to both true and EB estimates. This suggests that the ALV algorith m is able to accurately estimate the latent(conditional distribution) underlying the outcome variables Ys and Z in the data. These results based on a single dataset are only preliminary; the accuracy of ALV model in estimating the latent factor will be examined further with replication studies later under scheme 2.
PAGE 90
80 Table 5.2 ALV Model Estimation Performance under Scheme 1 (50 Replications) Conditional on, Ys are Linearly Related to Z N = 100 N = 200 N = 300 ParameterPop Bias SE/SD RMSE Bias SE/SD RMSE Bias SE/SD RMSE Y1 intercept 0 0.161 0.906 0.026 0.085 0.982 0.007 0.012 1.029 0.000 Y2 intercept 0 0.159 0.896 0.026 0.088 1.071 0.008 0.010 1.066 0.000 Y3 intercept 0 0.160 1.003 0.026 0.090 0.816 0.008 0.011 1.017 0.000 Y4 intercept 0 0.164 1.276 0.027 0.090 1.198 0.008 0.011 1.019 0.000 Y5 intercept 0 0.160 1.050 0.026 0.090 0.934 0.008 0.013 1.094 0.000 lambda1 1 0.076 0.493 0.007 0.118 0.649 0.014 0.091 0.720 0.008 lambda2 1 0.084 0.518 0.009 0.119 0.571 0.014 0.090 0.613 0.008 lambda3 1 0.085 0.532 0.009 0.118 0.604 0.014 0.091 0.651 0.009 lambda4 1 0.078 0.510 0.008 0.116 0.474 0.014 0.093 0.657 0.009 lambda5 1 0.083 0.487 0.009 0.115 0.576 0.014 0.093 0.625 0.009 theta1 0.25* 0.218 NA 0.047 0.218 NA 0.047 0.217 NA 0.047 theta2 0.25* 0.216 NA 0.047 0.217 NA 0.047 0.217 NA 0.047 theta3 0.25* 0.218 NA 0.048 0.217 NA 0.047 0.217 NA 0.047 theta4 0.25* 0.216 NA 0.047 0.216 NA 0.047 0.217 NA 0.047 theta5 0.25* 0.217 NA 0.047 0.218 NA 0.047 0.217 NA 0.047 Zintercept 0 0.002 1.064 0.000 0.002 1.188 0.000 0.032 1.009 0.001 sigma^2 0.25* 0.215 NA 0.046 0.211 NA 0.045 0.210 NA 0.044 Variance of error term use d in simulatio n
PAGE 91
81 Table 5.3 Results of ALV and Mplus Analyses of a Single Dataset (Scheme 1, N=300) Conditional on, Ys are Linearly Related to Z MPLUS ANALYSIS ALV ANALYSIS Parameter Estim S.E. Estim S.E. Y1 intercept 0.007 0.061 0.014 0.010 Y2 intercept 0.034 0.060 0.027 0.011 Y3 intercept 0.000 0.060 0.007 0.012 Y4 intercept 0.009 0.059 0.002 0.010 Y5 intercept 0.029 0.059 0.023 0.010 lambda1 1.031 0.047 1.104 0.011 lambda2 1.023 0.047 1.096 0.011 lambda3 1.014 0.046 1.086 0.012 lambda4 1.011 0.046 1.082 0.011 lambda5 1.001 0.046 1.072 0.011 theta1 0.038 0.004 0.031 NA theta2 0.040 0.004 0.033 NA theta3 0.047 0.005 0.040 NA theta4 0.038 0.004 0.031 NA theta5 0.035 0.004 0.029 NA Zintercept 0.037 0.019 0.008 0.012
PAGE 92
82 Figure 5.6 Boxplots of Eta Produced from Three Sources (Scheme 1, N = 300) Conditioned on Ys are Linearly Related to Z Figure 5.7 Q-Normal Plots of Eta Produced from Three Sources (Scheme 1, N = 300) Conditional on, Ys are Linearly Related to Z -3 -2 -1 0 1 2 3TRUE -2 -1 0 1 2 3MPLUS -2 -1 0 1 2 3ALV -3-10123 -3 -2 -1 0 1 2 3 TRUETheoretical QuantilesSample Quantiles -3-10123 210123 MPLUSTheoretical QuantilesSample Quantiles -3-10123 210123 ALVTheoretical QuantilesSample Quantiles
PAGE 93
83 To assess the quality of the fitting process of the GAM component of the ALV model, some basic residual plots were produced (Figure 5.8) using the gam.check routine in R (Wood, 2006). The closeness of the Q-Q plot to a straight lin e validates the Gaussian assumption for the model and the histogram of the residuals is consistent with normality. The plot of residuals versus linear predictors (top right) shows that the assumption of constant variance as the mean increases is not violated. The bottom right plot shows a positive lin ear correlation between the response and fitted values. Figure 5.8 Model Checking Plots: GAM Component of ALV Model (Scheme 1, N = 300) Conditional on, Ys are Linearly Related to Z The same analysis (based on a single dataset) is also used to further illustrate with an example of how convergence is decided in a simple run of the ALV algorithm with the results of the monitoring displayed in Figure 5.9. Based on our criteria for convergence, the total deviance -3-2-10123 -2-1012 Normal Q-Q PlotTheoretical QuantilesSample Quantiles -3-2-1012 -2-1012 Resids vs. linear pred.linear predictorresidualsHistogram of residualsResidualsFrequency -3-2-10123 0204060 -3-2-1012 -4-2024 Response vs. Fitted ValuesFitted ValuesResponse
PAGE 94
84 trajectory reached the minimum at the 3rd iteration (see left panels, bottom plot). Therefore convergence was decided after just three iterations and the algorithm was terminated after five iterations (two consecutive iterations following the minimum deviance point). It is seen that the approximate log of likelihood ra tio has already reached the region of zero at the chosen convergence point. Similarly the right column show s that the estimates of the residual variances and z-intercept stabilized by the 3rd iteration. However the measurement y-intercepts and slopes (lambdas) continue to drift slightly in their estima tes, as previously noted. Note that the recorded values of the log of likelihood ratio start from 2nd iteration (first likelihood ratio being between the first two iterations). For the plots in the right panels the recorded values at zero iteration correspond to the start values used in the ALV algorithm. Figure 5.9 ALV Model Convergence (Scheme 1, N=300) Conditional on Ys are Linearly Related to Z 12345 0.00.30.6 IterationLog of likelihood ratioLog of Observed-Data Likelihood Ratio Versus EM Iteration from the 2nd Iteration 12345 020 IterationTotal devianceScaled Total Deviance Versus EM Iteration 012345 -0.19-0.16-0.13 iterationestimateY-intercepts 012345 0.60.81.01.2 iterationestimateLambdas 012345 0.00.40.8 iterationestimateThetas 012345 0.000.150.30 iterationestimateSigma^2 -.Z-interc --
PAGE 95
85 5.3.2 Performance under Scheme 2 As previously stated in section 5.1, the emphasis in the evaluation of ALV performance under scheme 2 is on how well the ALV model is able to estimate conditioned on Ys and Z and uncover an unknown comp lex relationship between and Z. Although we were able simulate a complex relationship accordingly we did not have a full analytic expression for the conditional distribution ofor an appropriate existing standard statistical model for comparison (like in the linear case in scheme 1). Therefore for this assessment we compared the ALV model estimate to the true generated according to scheme 2 speci fications based on 50 replications; considering that the marginal distribution of true is directly proportional to its true conditional distribution to be estimated as by the ALV model. We comput ed 50 correlation values between and r ,r=1,...,50 using a sample size N = 300. We believe that the strength of the computed correlation indirectly reflects the closeness in values of (estimated conditional distribution) to the unknown true values of the conditional distri bution. Our results show that the correlation between and is very high in the range of .973 to .981 with mean of .975. The distribution of the calculated correlations is shown in Figure 5.10. This result indicates that the measurement component of the ALV model consiste ntly recovers the latent factor underlying the Ys and Z variables even when the solutions to the measurement parameters (y-intercepts and y-slopes) may not be unique. However we are aware that while a high correlation between and is desirable it does not necessarily indicate accuracy in the estimation of because a shift of from its true value by a constant (bias) can retain the high correlation. Therefore we took the next step to address this concern.
PAGE 96
86 Figure 5.10 Correlations between ALV Es timates of Eta and Population Values (50 Replications; Scheme 2: Conditional on, Ys are Nonlinearly Related to Z (Logit); N = 300) To further quantify the performance of ALV model in estimating we considered computing the mean square error or MSE which a ssesses the quality of the estimation in terms of its variation and unbiasedness. Ideally we would define 2 MSE()E[()] however this definition of MSE is not appropriate here because is not an estimate of the marginal distribution of rather it is an estimate of the conditional distribution. Therefore instead we compared the MSE wed get if we used the true in a regression model (GAM), to the MSE obtained by using (1) ALV estimated and (2) *measurementerror where the error term is defined as the ratio of the variance of to the variance of true For this comparison three different GAMs were fitted to 50 replicated datasets (N=300) genera ted under scheme 2. The regression models were specified (with variables represented as vectors) as follows: 0.9740.9760.9780.980
PAGE 97
87 rrrrrrrrrr *** rrrrrirrrrlogit(Z) ~ s() + GRP + s(GRP)+e; logit(Z) ~ s() + GRP + s(GRP)+e;(|y,z;); logit(Z) ~ s() + GRP + s(GRP)+e;[var()/var()]; r1,.....,50replications. (5.3) For each fitted GAM the MSE was computed as the mean of the squared residuals; residuals being the difference between the observe d and the fitted values of Z. The degree of closeness of the computed MSEs for the differe nt GAMs will reflect the accuracy of the ALV model; that is one can assess how comparable is the estimated to the true in predicting the Z observations. Similar comparison between and will allow us to assess the effects of measurement errors (associated with ) on the quality of prediction of the true. Since the MSE in the context of statistical models depends on data, it is treated as a random variable and the 50 replicated MSEs then serve as a measure of how well the three models explain the variability in the observations. As anticipated, the boxplots of MSEs in Figure 5. 11 and the related summaries in Table 5.4 show that generally there is only a slight increase in the MSEs with respect to the predictor over that of, and on the average the increase in median MSE is less than 1%. Also, the measurement errors arising from the ALV estimation did not a ffect the quality of model prediction when added to the true in the GAM. In addition the spread of MSEs for is slightly smaller than the other two.
PAGE 98
88 Figure 5.11 Boxplots for the MSEs of GAM of Z Separately on and (Scheme 2; N=300, 50 Replications) Having assessed the performance of the ALV m odel quantitatively, next we wish to use graphical tools to visually demonstrate the primary purpose of the ALV model, which is to assess variation in intervention impact across the unobserved baseline given an unknown complex relationship between the outcome Z and the predictors including baseline-treatment interaction (and G). In the following description we again sp ecifically investigated the GAM component of
PAGE 99
89 the ALV to determine how well it is able to recover the true complex relationships between a binary response Z and the predictors. To achieve this we compared two analyses. Table 5.4 Percent Change in MSEs: GAMs of Z on and Compared to (Scheme 2; N=300; 50 Replications) Value Change (%) Value Change (%) Lower Quartile 1.208 1.231 1.836 1.208 -0.001 Median 1.238 1.248 0.806 1.238 0.001 Upper Quartile 1.264 1.269 0.388 1.264 0.000 = True (simulated); = ALV estimate (|y,z) ; = var()/var() In the first analysis a stand-alone GAM pr ocedure was performed on the sample data using the true as known. For the second analysis the ALV model was fitted to the same data with estimated from the data. This pair of analyses was performed on a single sample randomly selected from 50 under each sample si ze N = 100, 200, 300; and the results are graphically displayed in Figures 5.12 and 5.13. In the first column of Figure 5.12 the outcome is model estimated logit of Z (simulated as binary) and is plotted against the true that was used to generate it. This plot is used to establish the true trajectories according to the simulation model in scheme 2. The ALV model performance is evaluated against the true trajectories directly by comparing plots in the first and third columns. Al so, the trajectories of the fitted values by ALV model (column 3) are compared to those of the stand alone GAM (column 2). Each trajectory on
PAGE 100
90 a single plot represents members of one arm of treatment. The curves are drawn with points corresponding to actual data points (and estimated logit of Z). Vertical dashed lines are drawn to partition the trajectories along the indicated quantiles of. The vertical lines serve as aids in the assessment of distribution of fitted values for individual subjects across the baseline; also comparison across modalities is made easy. Conf idence bounds are constructed at one standard error around the estimates for easy comparison on precision of estimates. From the patterns of the plots (Figure 5.12), compared to the true trajectories both GAM and ALV trajectories reveal some attenuation gene rally; otherwise the ALV trajectories are nearly identical to those of GAM. Note that there are one or two substantial ou tliers in the observed Z (logit transformed) located in the top right corn ers in column one. The presence of such outliers in data has been noted to be problematic in GAM fitting technique (Wood, 2006), apparently the outliers were not tracked to any reasonable degr ee by the trajectories produced by both GAM and ALV model. Both methods did not completely capture the true relationship between Z and, however the use of a single dataset as a basis for the comparison prevents any definitive conclusion here. Possibly these performances may also be related to the outlier problems. Single replication analyses notwithstanding, the simila rity between GAM and ALV models reflects our earlier findings (comparisons of MSEs) and suggests that when and its relationship Z are unknown, the ALV model may perform equiva lently to GAM procedure given known. Also graphically there seems an improved perform ance by both GAM and ALV models (closer approximation to the true trajectories) and increas ed similarities between the two as sample size increases (Figure 5.12). The convergence pattern of the ALV analysis under scheme 2 is again depicted in Figure 5.13. For example, the ALV model converged after 4 iterations (left column). On the right
PAGE 101
91 column it is seen that the parameter estimates (except for the lambdas) have stabilized before the convergence point. These findings are similar to those obtained under scheme 1.
PAGE 102
92 Figure 5.12 Plots of Z (Binary Outcome) Predicted by Eta (Basel ine Risk) by G (Treatment) (Resu lts Based on a Single Simulated Sample Among 50) Observed Z (logit) vs True ETA GAM fitted Z vs True ETA ALV fitted Z vs ALV estimated ETA N = 100 N = 200 N = 300 -2-1012 -2024 Vertical lines at p ercentiles of eta eta (population values)Z (logit scale, population values) 10th 25th 50th 75th 90th -2-1012 -2024 Vertical lines at p ercentiles of eta eta (population values)fitted Z (GAM ) 10th 25th 50th 75th 90th -1012 -2024 Vertical lines at p ercentiles of eta eta (ALV estimates)fitted Z (ALV) 10th 25th 50th 75th 90th -3-2-10123 0510 Vertical lines at p ercentiles of eta eta (population values)Z (logit scale, population values) 10th 25th 50th 75th 90th -3-2-10123 0510 Vertical lines at p ercentiles of eta eta (population values)fitted Z (G AM ) 10th 25th 50th 75th 90th -10 1 2 0510 Vertical lines at p ercentiles of eta eta (ALV estimates)fitted Z (ALV) 10th 25th 50th 75th 90th -3-2-10123 -50510 Vertical lines at p ercentiles of eta eta (population values)Z (logit scale, population values) 10th 25th 50th 75th 90th -3-2-10123 -50510 V e r t i ca l lin es at pe r ce n t il es o f eta eta (population values)fitted Z (GAM) 10th 25th 50th 75th 90th -1 0 1 2 -50510 Vertical lines at p ercentiles of eta eta (ALV estimates)fitted Z (ALV) 10th 25th 50th 75th 90th
PAGE 103
93 Figure 5.13. ALV Model Convergence (Results Based on a Single Simulated Sample of Size N=200 Selected from 50) 123456 0.00.3 IterationLog of likelihood ratioLog of Observed-Data Likelihood Ratio Versus EM Iteration from the 2nd Iteration 123456 0150 IterationTotal devianceScaled Total Deviance Versus EM Iteration 0123456 0.00.20.40.6 iterationestimateY-intercepts 0123456 0.61.0 iterationestimateLambdas 0123456 0.20.61.01.4 iterationestimateThetas 0123456 0.40.8 iterationestimateSigma^2 -.Z-interc --
PAGE 104
94 CHAPTER 6 APPLICATION OF ALV MODEL TO ASAPS DATA We illustrate the ALV method with an application to data from the Adolescent Substance Abuse Prevention Study (ASAPS) (Sloboda, et al., 2008). This is a cluster randomized field study involving 19,200 students in 83 high school clusters (a cluster being a high school and all its feeder middle schools) from six metropolitan areas across the U.S. (see chapter 1of this dissertation). The studys main objective was to test an intervention program Take Charge of Your Life (TCYL) delivered by selected trai ned D.A.R.E. officers, on its effectiveness in reducing some key behavioral outcomes: use of al cohol, tobacco and other drugs (ATOD). One of the major research questions was to investigate who benefits or is ha rmed by the instituted intervention program and how the intervention effect s are moderated by the baseline risk factors. The original D.A.R.E. curriculum was criticized fo r focusing on the low risk group, thinking that high risk group would be alienated by offi cers who were preaching at them. The new curriculum with TCYL program delivered by tr ained instructors was designed with sensation seeking and high risk kids in mind. The aim was to impact intentions to use alcohol, tobacco and other drugs (marijuana) by addressing baseline (7th grade pretest) beliefs as to the normative use of ATOD; perceptions of the harmful effects of use; and skills necessary to avoid substance use (decision making, resistance skills). It was hypot hesized that intervention may show different effects for low and high risk kids at baseline. The 1st wave-data (pretest data) consisted of 53 ite ms that showed significant loadings on 10 risk constructs in a previous factor analysis performed by the researchers. The item-level response scores on Likert scale were coded so that the highe st score implies highest risk. As an example of the constructs, the five items designe d to assess normative beliefs of 11th graders about alcohol,
PAGE 105
95 tobacco and other drugs are displayed in Table 6.1. To illustrate ALV model application these constructs (see Table 6.2) formed 10 summary risk variables that served as factor indicators for a single latent risk to be estimated by ALV. Some of these summary risk variables are skewed but no attempt was made to dichotomize any of them since the ALV model its current form only takes continuous measurement variables. In th e ALV analysis we examined variation in the intervention program effect on only one of the 7th wave-outcomes (substance use in 11th grade), (see Table 6.2), across the estimated baseline ri sk. This illustrative ALV analysis is neither complete nor final because of the presence of significant amount of missing data on the outcomes, for which no imputation was performed (Table 6. 3). The researchers had anticipated 50 percent attrition among the student cohort. There was substantial cross mobility of students during transition to high school from feeder middle schools. For example some students went into study high schools not assigned to their middle schools or to high schools not included in the study. In addition, one high school opted out of the study and by the time of the 11th grade survey two additional high schools affected by Hurricane Katrina were lost from the study. Therefore, for illustrative purpose, we report here on the results of fitting ALV model to incomplete data on risk measures in 7th grade and substance use in 11th grade for 2500 males from the ASAPS study (after listwise deletion of missing values).
PAGE 106
96 Table 6.1 Five Items Used in the ASAPS to Assess Normative Beliefs of 11th Graders Item Questions In the Last 30 Days, how many 8th graders across the entire U.S. do you think a) used cocaine or other hard drugs? b) drank beer, wine or liquor? c) smoked cigarettes? d) sniffed glue, inhale gases or a spray to get high? e) smoked marijuana (pot, reefer, weed, blunts)? Possible Answers Possible Scores All or almost all (100%) More than half (about 75%) About half (50%) Less than half (25%) None (0%) 5 4 3 2 1
PAGE 107
97 Table 6.2 Ten Summary Baseline Risk Constructs in ASAPS Data Construct 1 Normative beliefs 2 Referent others 3 Consequences of ATOD use on the brain 4 Personal attitudes towards ATOD use 5 Negative expectation from ATOD use 6 Intentions (to use under certain situations) 7 Intentions ( what age ok to initiate risky behave) 8 Number of best friends using ATOD 9 Pro-social bonding (school attachment) 10 Self-reported delinquent behaviors ATOD = alcohol, tobacco and other drugs (marijuana)
PAGE 108
98 Table 6.3 Some 11th Grade Outcomes and Missing Data in ASAPS Data A key feature of the ALV model lending weight to its appropriateness for analyzing the ASAPS data is that it can easily handle complex relationships in the da ta without requiring the knowledge of the relationship before hand. Nonlinearities arise in the data because of the potential variation in impact of the administered behavioral intervention on the individuals with different baseline risk experience. It is also important to note that the risk experience was not directly observed and has to be inferred from the data as a latent variable; plus, the shape of the relationship between the latent risk and the outcome (in this example, marijuana use) is unknown and is potentially complex. These are compelling reasons to specify the effects of the latent baseline risk (and its interaction with interven tion) nonparametrically. To include the cluster effects of school districts in the analysis Generalized Additive Mixed Model (GAMM) was specified for the additive part of the ALV algorith m at the final EM iteration after baseline risk has been estimated ( ) from the data, treating the clusters as random effects: # MissingProportion Missing Explanatory Variables School 00 Gender00 Treatment 4760.03 Outcomes Used Marijuana in Past 30 Days78690.46 Got Drunk in Past 30 Days78240.46 Binge drinking in Past 30 Days77580.46 Used Cigs in Past 30 Days77500.45 Used Inhalants in Past 30 Days78260.46
PAGE 109
99 ii i01iiii bilogit(E[z])Groups(Risk)s(Risk*Group)Hb; bN(0,);N(0,). (5.4) Here z is a binary response Marijuana Use; 's are fixed parameters for the model intercept and intervention group variable; s(.) is a smoothing function that estimates the unknown complex relationships of the response to the baselin e risk and its interaction with treatment; H and b are the random effects model matrix and coefficients. The partial results (additive part) of ALV model fit to the ASAPS data are reported here (Figures 6.1 & 6.2; Tables 6.4 a & b). In the context of the estimates of the nonparametric functions, the plots in Figure 6.1 describe the relationships between the smoothing terms in the model and the outcome using solid lines/curves w ithin 95% point wise confidence bands (dashed lines). Along the bottom of each plot are rug-plots at points corresponding to the covariate values for each smooth. For the whole sample (treatme nt and control), a smooth curve (top panel) is estimated with 2.97 (number in y-axis caption) effective degrees of freedom for the effect of baseline risk while the estimated interaction effect (bottom panel) is approximately linear with the outcome and so requires only 1 degree of freedom to estimate a slope. The above information could be missed if a parametric model with s(.) restricted to be linear were to be fitted to the data; although for this particular sample data, the fit of a quadratic model may be sufficiently close in quality to the ALV model fit.
PAGE 110
100 Figure 6.1 Estimated Relationship of Probab ility of Marijuana Use to Baseline Risk. ALV Estimated Baseline Risk (top panel) and Baseline-Treatment Interaction (bottom panel) Figure 6.2 is a visual display of the variation in intervention impact across the baseline risk. The dashed curves represent pointwise 95% confidence intervals around values predicted from the results of the fitted GAMM. To compute these values the R function predict() was applied to the R object for GAMM fit; the correspondi ng standard errors were also returned. The 95% confidence was then constructed around each predicted value as value +/standard error and from these generated values it was possible to draw the upper and lower limits separately around the fitted curves. The plot shows a changing direction of intervention effects along the risk scale and precisely which levels of risk are asso ciated with higher or lower marijuana use. Note 10123 -2024 risk1s(risk1,2.97) 10123 -2024 riskBYgroups(riskBYgroup,1)
PAGE 111
101 that the uniformly increasing group difference in average probability of marijuana use across baseline indicates a linear interaction effect, as re vealed by the bottom panel plot in Figure 6.1, and previously expounded in this dissertation (refe r to Figure 3.1 (B)). For the top 5% of kids on the baseline risk scale, the average probability of marijuana use is obviously lower for individuals in the intervention group relative to the controls In contrast, the intervention appears to be marginally harmful to the low risk subgroup (below 25 percentile). In summary the effect of the intervention is harmful when there is low baseline risk and gets more beneficial with higher risk. However only across the percentiles where the 95% confidence intervals show no overlap is significant intervention impact implied. There appears to be some degree of overlap across all percentiles more marked at the top end of the ri sk scale. This indicates that the intervention effects are not locally significant, that is the intervention has no significant impact on any risk subgroup at the 95% confidence level.
PAGE 112
102 Figure 6.2 The Effect of Baseline Risk on Proba bility of Marijuana Use by Group, with 95% Pointwise Confidence Bands Results of the ALV model analysis are also reported in Tables 6.4 a & b. These results are from the direct output of the Additive component of the ALV model and are supported by the iterpretations derived from Figure 6.1. In Table 6.4a both terms for the baseline risk and the interactioe are specified as nonparametric smoothi ng functions as in (5.4). Under the section on Nonlinear Terms the baseline risk (3rd row) shows significant nonlinearity (p<0.001) in its relationship with marijuana use and this effect is estimated as a smooth curve with 2.97 expected degrees of freedom (edf). However a straight lin e corresponding to edf of 1.00 is estimated for its interaction effect (4th row) and the test of nonlinearity for this term is not significant (p=0.07). It should be noted here that the p-values of smooth terms are only approximate due to the 10123 0.00.20.40.60.81.0 ALV Model of Variation in Intervention Impact on Marijuana Use in 11th Grade Males by Baseline Risk in 7th GradeLatent Baseline Risk Probability of Marijuana Use in Past 30 Days Tx (n = 1355 ) Ctrl (n = 1145 ) Risk Percentile 25th 50th 75th 90th 95th
PAGE 113
103 uncertainty in estimating smoothing parameters (Wood, 2006). According to the author, the pvalues are usually safe to rely on only when they give a very clear cut result; when the results are around a reject/accept threshold, the tests reject the null too readily and therefore must be treated with caution. Given that linear interaction effect is demonstrated in Table 6.4a, we fitted another GAMM this time using a fixed parameter for the interaction term; the results (Table 6.4b, 3rd row) show a negative linear interaction that is fairly significant (p=0.037) at the 95% confidence level. This is in support of the finding of reversal of intervention effects along the baseline demonstrated graphically in Figure 6.2; and gi ven the caution required for interpreting p-values, the interaction effects are probably not significant. For the parametric terms, we see in the 2nd rows of both tables that no significant main effect (p=0.50 ) is demonstrated for the interventi on. Finally, there is significant random effect of school districts clustering in the data (last rows). Combining all of the findings from Figures 6.1 & 6.2 and Tables 6.4a&b, in summary there is significant nonlinearity in the relationship of 7th grade baseline risk and Marijuana use in 11th grade but no significant intervention effect are demonstrated across any baseline risk subgroups.
PAGE 114
104 Table 6.4a ALV Model of Marijuana Use Reported by 11th Grade Males (N=2500; 79 High School Clusters): Additive Sub-model# Includes Nonlinear Interaction Term #GAMM: i01 iii ilogit(E[MarijuanaUse])*Interventions(Risk)s(Risk*Intervention) Type of Effect Effect Coefficient (Logit) SE z value p value Parametric Terms 1. Intercept 1.521 0.100 15.103<0.001 2 InterventionMain Effect (adjusted ) Intervention = 1 vs. Controls = 0 0.091 0.135 0.6750.500 Smooth Terms Functions ed f F p value 3 BaselineRiskSmooth (baseline) Smoothing coefficients 2.97 27.493 <0.001 4. Interaction Effect Smooth (interaction) Smoothing coefficients 1.002.9060.070 Random EffectsEffect NameSD 95% CI Cluste r School District 0.3480.219 0.555
PAGE 115
105 Table 6.4b ALV Model of Marijuana Use Reported by 11th Grade Males (N=2500; 79 High School Clusters): Additive Sub-model# Includes Linear Interaction Term #GAMM: i01 ii3i ilogit(E[MarijuanaUse])*Interventions(Risk)(Risk*Intervention) Type of Effect Effect Coefficient (Logit) SE z value p -value Parametric Terms 1. Intercept -1.552 0.102 -15.168<0.001 2 .Intervention Main Effect(adjusted ) 3 .InteractionEffects Intervention = 1 vs. Controls = 0 Interv by-Baseline 0.091 -0.418 0.135 0.200 0.675 -2.087 0.500 0.037 Smooth Terms Functions ed f F p -value 4 .BaselineRiskSmooth (baseline)Smoothing coefficients 2.97 27.480 <0.001 Random EffectsEffect NameSD95% CI Cluste r School District 0.3480.219 0.555
PAGE 116
106 CHAPTER 7 DISSCUSSION & RECOMMENDATIONS In this dissertation we have considered plausi ble variations in intervention impact due to baseline individual level risk/protective factor ch aracteristics. We also considered the importance of modeling these variations in the statistical analyses of behavioral, social and psychological research data from randomized field trials in particular, where measurement errors and nonlinearity commonly arise and pose statistical cha llenges. We reviewed the existing statistical modeling techniques that have been applied to assess these variations, such as nonlinear (polynomial terms) SEM and GAM. We highlighted their limitations including the inefficiency associated with the ad hoc approach of stepwise application of these two methods in one analysis but on different statistical application platforms. To address these cha llenges we have developed a new modeling technique, ALV, by integrating the two powerful statistical models (SEM and GAM) into one model that runs on one plat form and draws strength from both methods. We reached the following conclusions from the results of our simulation studies. First, the ALV model works well with th e tested sample sizes of 100, 200, and 300 with measurement errors. Second, this new method was successful in capturing the nonlinear dependence of the outcome on a latent variable in the data. Al so the method performs nonlinear modeling task nearly as well as it does a linear modeling at leas t in the simulation studies with sample size as low as 100. Like most existing methods in SEM our pr oposed ALV model approach is based on the assumptions of conditional independence for the baseline factor indicators and distal outcome given the underlying latent factor, plus normally distributed errors. However a notable
PAGE 117
107 distinguishing feature of the ALV modeling technique is that it makes no assumption about the relationship between the latent factor and the distal outcome. The new ALV method is developed to simultaneously estimate the latent factor unde rlying the observed baseline risk variables plus the complex relationship between the latent fact or and the distal outcome it predicts, without requiring a priori specification of a functional form for the unknown relationship. The ALV modeling is implemented in Monte Carlo EM environment and it involves the estimation of posterior distribution of the latent factor in the E-step via Metropolis algorithm while ML estimation of parameters is via standard regr ession sub-models in the M-step. The EM type algorithms are tremendously useful in solving statistical problems involving missing and latent data. In order to establish a more objective criterion for our stopping rule for convergence in the Monte Carlo EM loop within the ALV algorithm, we have taken into account the overall fit of the ALV model in addition to the behavior of para meters. Given the typical long term pattern of the ALV model deviance trace with respect to EM iterations, we are able to conclude model convergence at the point of minimum deviance, whic h we consider to be probably global within the context of our simulations. Our stopping rule is new relative to those proposed in the literature for Monte Carlo EM; and from our experience we also found our criteria (including point of minimum deviance) to be very crucial for the ef ficiency of the ALV algorithm. The criteria allow us to decide convergence after single digit number of EM iterations in most instances, because the ALM model is largely a linear model. Performance-wise, a key emphasis has been on testing the ability of ALV model to accurately recover both the latent factor (unde rlying baseline risk) as well as the complex nonlinear relationships between the outcome and the predictors. The results of our simulation studies show that the ALV model performs well. While the role of the measurement part is
PAGE 118
108 mainly concerned with estimation of the latent factor, for interpretability our focus necessarily shifts to the nonparametric (GAM) component, on which the major feature of the ALV model depends. Compared to the easily interpretable GL Ms, GAMs may be more difficult to interpret because of the nonparametric nature of the unde rlying nonlinearity in the data. However it is important to acknowledge that although GAMs may serve different analytic purposes like suitably exploring the data nonparametrically a nd visualizing the complex relationships, in the presence of unknown complex nonlinearity GAMs are closer to reality and are known to yield a better fit than their GLM counterparts. These properties are well illuminated by the results of our application of the proposed ALV model to both simulated and real data in this dissertation. In practice, because of the flexibility of GAM technique, it is very possible to provide a good fit to the data by tracking significant noise in addition to the nonlinear relationships in the predictor variables. This happens whenever higher than th e appropriate degrees of freedom are used in estimating the nonparametric functions of the pred ictor terms. Although the user is allowed to specify degrees of freedom for the cubic spline sm oother for each predictor term in a stand-alone GAM procedure, the optional feature we adopted in the GAM component of ALV model allows for optimal estimates of effective degrees of freedom to be computed directly by the model (Wood, 2006). So the potential problems of over fitting (or under fitting) typically associated with user-specified degrees of freedom in AM methods are minimized in the ALV method. One major limitation was the number of cases we examined in the simulation. This limitation with respect to maximum size of 300 was due to practical considerations since each simulation required massive computing time. The minimum size of 100 was chosen because typically factor analysis is a large sample procedure, and also because the choice is in line with similar past studies involving Monte Carlo version of the EM (Lee & Song, 2007; Lee & Zhu, 2002). However more studies are required to stud y the stability of ALV model when sample size
PAGE 119
109 drops below the minimum of 100 used in the present study. Another major limitation of the ALV model in its current form is its listwise deletion approach to missing data problems. Given the frequent encounter with missing data in practice a nd the availability of more effective methods of handling this problem, the incorporation of such methods into ALV model will be of considerable importance and we are planning to do this in our next stage. As it is currently set up, the nesting in the data is accounted for only at the final EM iteration and only in the GAM component of the ALV model. Further studies are needed to assess the adequacy of this partial effort compared to full multilevel extensions to the ALV model. Although this new approach is computationally intensive, given the persistent rapid developmen ts in computer technology, this should not be considered a serious limitation. Even though th e ALV model consistently estimates the latent factor accurately in the measurement part of th e model, the associated measurement parameter estimates are not stable and this may indicate that the solutions are non-unique. Therefore the emphasis of the ALV model application should be on the accurate recovery of unknown complex relationships in the data; in its current form it may not be useful for analyzing psychometric properties of instruments. There are several other ways (than our choice in this dissertation) of defining a cubic regression spline basis which may offer some advantag es with respect to the interpretability of the parameters and appropriateness to the data at hand (Wood, 2006). The ALV method can be improved upon therefore by exploring other smoothi ng spline bases available as options in the R package mgcv and determining under what conditions a par ticular choice would be best within the ALV framework. Our model can be extended to examine co mplex nonlinearity between multiple distal outcomes and their predictors including multiple la tent factors (e.g. multiple-factors solutions to observed baseline risk variables) or growth factors in a longitudinal study. In future we intend to
PAGE 120
110 also explore the application of ALV method to a wider spectrum of nonlinear structural equation modeling involving complex factor-to-factor, factor-to-indicator, and indicator-to-indicator relationships, using n onparametric methods. In conclusion, the ALV modeling technique allows researchers to assess how an intervention affects individuals differently as a function of baseline risk that is itself measured with error, and uncover complex relationships in th e data that might otherwise be missed. In practice, its users are relieved from the need to decide functional forms for the complex relationships before the model is run. The ALV program is written in R la nguage and the R software is freely available; so general users can apply the new methodology. We expect the ALV model and its extensions to have lots of new applications to modeling of behavioral, sociological and psychological data in the future.
PAGE 121
111 REFERENCES CITED Brown, C. (1993). Analyzing Preventive Tr ials with Generalized Additive Models. American Journal of Community Psychology 21, 635-664. Brown, C., Wang, W., Kellam, S., Petras, H., Toyinbo, P., Poduska, J., et al. (2008). Methods for Testing Theory and Evaluating Impact in Ra ndomized Field Trials: Intent-to-Treat Analyses for Integrating the Perspectives of Person, Place, and Time. Drug and Alcohol Dependence 95 (Supplement 1), S74-S104. Chambers, J. M., & Hastie, T. J. (1993). Statistical Models in S. London: Chapman & Hall. Cheney, W., & Kincaid, D. (2004). Numerical Mathematics and Computing. Belmont, California: Brooks/Cole-Thomson. Chib, S., & Greenberg, E. (1995). Unders tanding the Metropolis-Hastings Algorithm. The American Statistician 49 (4), 327-335. Chib, S., & Jeliazkov, I. (2006). Inference in Semiparametric Dynamic Models for Binary Longitudinal Data. Journal of the American Statistical Association 101, 685-700. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society 39 (1), 1-38. Gamerman, D., & Lopes, H. F. (2006). Markov Chain Monte Carlo. Stochastic Simulation for Bayesian Inference (2nd Edition ed.). Boca Raton, FL: Chapman & Hall/CRC.
PAGE 122
112 Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distribution and Bayesian restoration of images. IEE Transactions on Pattern Analysis and Machine Intelligence 6, 721 741. Gu, C. (2002). Smoothing Spline Anova Models. New York: Springer-Verlag. Hastie, T. J., & Tibshirani, R. J. (1990). Generalized Additive Models. New York: Chapman and Hall. Heath, M. T. (2005). Scientific Computing : An Introductory Survey. New York, New York: McGraw-Hill. Holler, K. D. (2005). Generalized Additive Models. Retrieved May 5, 2009, from www.casact.org/education/specsem/f2005/handout s/holler.ppt Joreskog, K. G. (1977). Structural Equation Models in the Social Sciences: Specification, Estimation and Testing. In P. R. Krishnaiah, Applications of Statistics. Armsterdam: NorthHolland. Kellam, S., Brown, C., Poduska, J., Ialongo, N ., Petras, H., Wang, W., et al. (2008). Effects of a Universal Classroom Behavior Management Program in First and Second Grades on Young Adult Behavioral, Psychiatric, and Social Outcomes. Drug and Alcohol Dependence 95 (Supplement 1), S5S28. Khoo, S.-T. (1997). Assessing Interactions between Program Effects and Baseline. Dissertation University of California, Los Angeles, Carlifornia.
PAGE 123
113 Law, N. J., Taylor, J. M., & Sandler, H. (2002). The Joint Modeling of a Longitudinal Disease Progression Marker and the Failure Time Process in the Presence of Cure. Biostatistics 3 (4), 547-563. Lee, S. Y., & Zhu, H. T. (2000). Statistical Analysis of Nonlinear Structural Equation Modelwith Continous and Polytomous Data. British Journal of Mathem atical and Statistical Psychology 53, 209-232. Lee, S.-Y., & Song, X.-Y. (2007). A Unified Maximum Likelihood Approach for Analyzing Structural Equation Models W ith Missing Nonstandard Data. Sociological Methods & Research 35 (3), 352-381. Lee, S.-Y., & Zhu, H.-T. (2002). Maximum Likelihood Estimation of Nonlinear Structural Equation Models. Psychometrika 67 (2), 189-210. Lee, S.-Y., Song, X.-Y., & Lee, J. C. (2003). Maximum Likelihood Estimation of Nonlinear Structural Equation Models with Ignorable Missing Data. Journal of Educational and Behavioral Statistics 28 (2), 111-134. Martin, A. D., Quinn, K. M., & Park, J. H. (2009). R package version 0.9-6. Retrieved from MCMCpack: Markov chain Monte Carlo (MCM C) Package: http://mcmcpack.wustl.edu McLachlan, G. J., & Krishnan, T. (2008). The EM Algorithm and Extensions (2nd Edition ed.). Hoboken, New Jersey: John Wiley & Sons, Inc. Meng, X. L., & Schiling, S. (1996). Fitting Fu ll-information Item Factor Models and an Empirical Investigation of BridgeSampling. Journal of American Statistical Association 91, 1254-1267.
PAGE 124
114 Muthen, B. O. (2002). Beyond SEM: General Latent Variable Modeling. Behaviormetrika 29 (1), 81-117. Muthen, B. O. (1989). Latent Variable Modeling in Heterogeneous Populations. Psychometrika 54 (4), 557-585. Muthen, L. K., & Muthen, B. O. (2008). Mplus User's Guide, Version 5 Nelder, J. A., & Wedderburn, R. W. (1972). Generalized Linear Models. Journal of the Royal Statistical Society 135, 370-384. Plummer, M., Best, N., Cowles, K., & Vines, K. (2006, March). CODA: Convergence Diagnosis and Output Analysis for MCMC. R News 7-11. Poduska, J., Kellam, S. G., Wang, W., Brown, C. H., Ialongo, N., & Toyinbo, P. (2008). Impact of the Good Behavior Game, a Universal ClassroomBased Behavior Intervention, on Young Adult Service Use for Problems with Em otions, Behavior, or Drugs or Alcohol. Drug and Alcohol Dependence 95 (Supplement 1), S29-S44. R Development Core Team. (2008). R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing. Rubin, D. B. (1991). EM and Beyond. Psychometrika 56, 241-254. Rubin, D. B. (1976). Inference and Missing Data. Biometrika 63, 581-592. Sloboda, Z., Pyakuryal, A., Stephens, P. C., Teas dale, B., Forrest, D., St ephens, R. C., et al. (2008). Reports of Prevention Programming Available in Schools. Prev. Sci.
PAGE 125
115 Song, X.-Y., & Lee, S.-Y. (2005). Maximum Likelihood Analysis of Nonlinear Structural Equation Models With Dichotomous Variables. Multivariate Behavioral Research 40 (2), 151177. Walsh, B. (2004). Markov Chain Monte Carlo and Gibbs Sampling. Lecture Notes for EEB 581 version 26. Wang, C., Brown, C., & Bandeen-Roche, K. (2005). Residual Diagnostics for Growth Mixture Models. Journal of the American Statistical Association Wei, G. C., & Tanner, M. A. (1990). A Mont e Carlo Implementation of the EM Algorithm and the Poor Man's Data Augmentation. American Statistical Association 85 (411), 699-704. Wood, S. N. (2006). Generalized Additive Models: An Introduction with R. Boca Raton, Florida: Chapman & Hall / CRC. Xiang, D. (2004). Fitting Generalized Additive Models with the GAM Procedure. Paper P256-26 Cary, NC: SAS 9.1.2 Users Guide, SAS Institute Inc.
PAGE 126
116 APPENDICES
PAGE 127
117 APPENDIX A: DATA SIMULATION CODES FOR SCHEMES 1 & 2 #################################################### # SIMULATE COPIES OF DATASETS (N=100,200,300): # 6 VARIABLES CONDITIONALLY INDEPENDENT GIVEN ETA: # {5 CONTINUOUS Y's + 1 CONTINUOUS OR BINARY Z} #################################################### # Define Population parameters n <150 # half sample size J <50 # number of datasets lambda.1
PAGE 128
118 APPENDIX A: (CONTINUED) # define G (group) G
PAGE 129
119 APPENDIX A: (CONTINUED) # SIMULATION for (j in 1:J){ # simulate y1 to y5 e1
PAGE 130
120 APPENDIX A: (CONTINUED) #---------------------------------------------------------------------# define Z(binary, nonlinear with eta) for group (0,1) #----------------------------------------------------------------------Z.logit
PAGE 131
121 APPENDIX A: (CONTINUED) #============================================= # ANALYZE BINARY Z: Given observed eta #============================================= #------------------------------------------------# establish population characteristics graphically #------------------------------------------------plot(density(eta), main="Eta (Population values)") ### LOGIT SCALE Z.logit.true
PAGE 132
122 APPENDIX A: (CONTINUED) segments(Q[1], yrange[1], Q[1], yrange[2], lty = 2) segments(Q[2], yrange[1], Q[2], yrange[2], lty = 2) segments(Q[3], yrange[1], Q[3], yrange[2], lty = 2) segments(Q[4], yrange[1], Q[4], yrange[2], lty = 2) segments(Q[5], yrange[1], Q[5], yrange[2], lty = 2) text( Q[1], yrange[1], 10th ", adj = c(0,0), par(srt=90)) text( Q[2], yrange[1], 25th ", adj = c(0,0), par(srt=90)) text( Q[3], yrange[1], 50th ", adj = c(0,0), par(srt=90)) text( Q[4], yrange[1], 75th ", adj = c(0,0), par(srt=90)) text( Q[5], yrange[1], 90th ", adj = c(0,0), par(srt=90)) ### PROBABILITY SCALE # Prob(Z = 1) = 1/[1+exp(Z.logit.true)] Z.prob.true <1/(1+exp(Z.logit.true)) par(mfrow = c(1, 1)) yrange
PAGE 133
123 APPENDIX A: (CONTINUED) # plot fixed values with no error term plot(eta, Z.prob.true, type="n", ylim = yrange, xlab = "eta (population values)", yl ab = "Probability of Z (population)", sub = "Vertical lines at percentiles of eta") points(eta [1:n], Z.prob.true[1:n], pch=19 col=4) points( eta [(n+1) : (2*n) ], Z.prob.true[(n+1) : (2*n) ], col=2) Q
PAGE 134
124 APPENDIX A: (CONTINUED) ############################################################### # SELECT A COPY FROM 50 DATASETS FOR Z-REGRESSION MODEL ############################################################### ## For Analyses of 1st copy of 50 datasets dat.copy
PAGE 135
125 APPENDIX A: (CONTINUED) #--------------------------------# ANALYTIC PLOTS #---------------------------------lwd<-2; lwd2<-1; Tx.col<-2; Ctr.col<-4; fit
PAGE 136
126 APPENDIX A: (CONTINUED) # ylab = paste("Probability of Z (GAM fit)")) # for continous Z ylab = paste("fitted Z (GAM)")) # for binary Z # ylab = paste("fitted Z (GLM)")) # for binary Z xord
PAGE 137
127 #tx.legend
PAGE 138
128 APPENDIX A: (CONTINUED) #============================================= # save simulated datasets fo r later replication studies #============================================= write.csv(ContArray, file = "C:/.../repDat .N300.ContLin.csv", row.names = FALSE) write.csv(BinArray, file = "C:/.../repDat .N300.BinNlin.csv", row.names = FALSE) write.csv(ContArray2, file = "C:/.../repDat2.N300.ContLin.csv", row.names = FALSE) write.csv(BinArray2, file = "C:/.../repDat 2.N300.BinNlin.csv", row.names = FALSE)
PAGE 139
129 APPENDIX B: SELF-WRITTEN R FUNCTIONS CALLED BY ALV MODEL ######################################################## # R-FUNCTIONS FOR THE GAM COMPONENT OF ALV MODEL ######################################################## # LIST OF FUNCTIONS # (1) write R function to define R(x,z)for cubic spline on [0,1] # function name = rk # (2) Use the rk function in a new function that takes a sequence of knots # and an array of x values to produce a model matrix X for cubic spline (p127) # function name = spl.X # (3) write a function to setup a pena lized regression spline penalty matrix S # function name = spl.S # (4) write a simple matrix sqrt function to use on S # function name = mat.sqrt # (5) write a function to SET UP a simple additive model # with 2 smooth terms + 1 parametric term This function is modified from the # function am.setup(Wood, 2006, p 135) and calls functions (1) to (3). # function name = am.setup2
PAGE 140
130 APPENDIX B: (CONTINUED) ##################################### # FOR GAM COMPONENT ################################### # SIMPLE CUBIC SPLINE # write R function to define R(x,z)for cubic spline on [0,1] rk
PAGE 141
131 APPENDIX B: (CONTINUED) # EXTENSION TO PENALIZED CUBIC SPLINE # Model extension: to fit penalized regression spline to x, y, data # First write a function to setup a pena lized regression spline penalty matrix S spl.S
PAGE 142
132 APPENDIX B: (CONTINUED) # EXTENSION TO ADDITIVE MODEL # Write a function to SET UP a 3-term simple additive model : # function to produce a model matrix X # and 2 regression penalty matrices in S for # a 2-smooth + 1-parametric terms additive model am.setup2
PAGE 143
133 APPENDIX B: (CONTINUED) X1[ ,2:q]
PAGE 144
134 APPENDIX B: (CONTINUED) # FOR MCMC ALGORITHM # Define the unnormalized log-density of the cond distribution of eta # from which to draw a sample. # The function accepts data from the ith independent observation. condETAfun.gam
PAGE 145
135 APPENDIX C: R CODES FOR ALV MODEL ALGORITHM #$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ # MUST FIRST RUN ALV FUNCTIONS IN R (APPENDIX B) #$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ # load libraries. library(foreign) library(mgcv) library(nlme) library(MASS) library(MCMCpack) library(numDeriv # GET SIMULATED DATASETS (50 COPIES STACKED HORIZONTALLY) replicData
PAGE 146
136 APPENDIX C: (CONTINUED) ### assign GLM family of distribution Y1.family no of Y variables Ydata
PAGE 147
137 APPENDIX C: (CONTINUED) #----------------------------------------# TRUE VALUES #------------------------------------------# STORE TRUE VALUES (WHERE AVAILABLE) FOR EASY TABULATION # Y's muY.t
PAGE 148
138 APPENDIX C: (CONTINUED) beta.t
PAGE 149
139 APPENDIX C: (CONTINUED) #----------------------------------------# START VALUES #----------------------------------------muY0
PAGE 150
140 APPENDIX C: (CONTINUED) beta0
PAGE 151
141 APPENDIX C: (CONTINUED) # generate initial N-vector eta from its cond distrib eta00
PAGE 152
142 APPENDIX C: (CONTINUED) ################################################# # ALV (MCEM) ALGORITHM ################################################# #---------------------------------------------------------------# SET PARAMETERS FOR ALV ALGORITHM #---------------------------------------------------------------# NOTE: # SINGLE REPLICATION TO STUDY CONVERGENCE (set bridge = TRUE) # MULTIPLE REPLICATIONS TO STUDY ESTIMATION (set bridge = FALSE) # Start RUN from here bridge
PAGE 153
143 APPENDIX C: (CONTINUED) ######################################################################### #=========== START OF ALV MODEL RUN ============ ######################################################################### ## intialize lines as pointers for tracking different stages in the MCEM loop line1 <0; line2 <0; line3 <0; line4 <0; line5 <0; line6 <0; line7 <0; line8 <0; line9 <0; line10 <0 #---------------------------------------------------------------------------#### Initialize storage matrices for all MCEM replications #----------------------------------------------------------------------------etaVectors
PAGE 154
144 APPENDIX C: (CONTINUED) # sub-model deviances all.deviances replic ) { replic
PAGE 155
145 APPENDIX C: (CONTINUED) #### create storage matrices for kth MCEM iteration # arrays to store calculated MEANS of individual regression parameter values Y_params
PAGE 156
146 APPENDIX C: (CONTINUED) # arrays to store calculated VARIANCES of individual regression parameter values Y_parVars
PAGE 157
147 APPENDIX C: (CONTINUED) # Store initial values in 2nd row of parameters table for (h in 1:p) { Y_params[2 ,c(2:10), h]
PAGE 158
148 APPENDIX C: (CONTINUED) log_LR
PAGE 159
149 APPENDIX C: (CONTINUED) # supply initial parameter values new.params1 iter ) { iter
PAGE 160
150 APPENDIX C: (CONTINUED) # create an array to record derivatives a nd calculated stderr to be generated in # the current EM iteration using Louis' formula ############ # E-step ############ #--------------------------------------------------------# START M-H ITERATION: 3RD LOOP #--------------------------------------------------------old.params1
PAGE 161
151 APPENDIX C: (CONTINUED) # SUBJECT loop to run N Markov chains (1 for each subject i = 1:N) i <0 # initialize subject (row) counter while ( N > i ) { # begin M-H inner loop i 5) break } eta.samp
PAGE 162
152 APPENDIX C: (CONTINUED) #raftery
PAGE 163
153 APPENDIX C: (CONTINUED) ########## # M-step: Estimate new parameters given the expected value of eta ########## # linear regression of Y's on etaHat ### Personal note: This loop will be generalized later to accept any number p of regressions line3
PAGE 164
154 APPENDIX C: (CONTINUED) Yparam.est[j, 1:3, 2]
PAGE 165
155 APPENDIX C: (CONTINUED) Yparam.est[j, 4:6, 5]
PAGE 166
156 APPENDIX C: (CONTINUED) ############################################################################## ### Calculate the means & Monte Carlo std err of parameter estimates for current EM iteration ############################################################################## #=============== # parameters for Y #================ Ymeans
PAGE 167
157 APPENDIX C: (CONTINUED) # Record y-interc, lambdas, thetas and calculate their MC std err of estimates Y.estim.se[iter, c(1,3,5), k]
PAGE 168
158 APPENDIX C: (CONTINUED) #=================================== # 2nd set of parameters for Z #=================================== Zmeans2
PAGE 169
159 APPENDIX C: (CONTINUED) #-----------------------# Update Z parameters #-----------------------muZ
PAGE 170
160 APPENDIX C: (CONTINUED) # update current estim of penalty of th e penalized least square expression for # (Z|eta, grp, eta*grp) component of (eta|Y,Z ,omega(k)). NOTE: This step is not neccssary penalty
PAGE 171
161 ## Calculate the gradient/Hessian of a func tion b y numerical approximation using numDerivpackage functions func.mu
PAGE 172
162 APPENDIX C: (CONTINUED) # calculate Louis std err (use NE GATIVE 2nd partial derivatives) # Store stacked in a column per EM iter for (k in 1:p) { louis.se[k, iter]
PAGE 173
163 APPENDIX C: (CONTINUED) # Calculate and update convergence error err1
PAGE 174
164 APPENDIX C: (CONTINUED) if ( iter == iter.best ) { eta.chains.best
PAGE 175
165 Ym eans.best
PAGE 176
166 APPENDIX C: (CONTINUED) if (bridge == TRUE && iter > 1) { # RUN bridge-sampling loop only when studying convergence # and start from 2nd EM iteration for (m in 1:M) { for (i in 1:N) { Y
PAGE 177
167 APPENDIX C: (CONTINUED) Lik_ab[i, m] <(1/sqrt(det(theta.b))) (1/sqrt(sigma.sq.b)) exp( -0.5 (t(Y-muY.b-lambda.b%*%eta.a) %*% solve(theta.b) %*% (YmuY.b-lambda.b%*%eta.a) + 1/sigma.sq.b ((Z t(Xm.a) %*% beta.b)^2 + penalty) + eta.a^2 )) Lik_ba[i, m] <(1/sqrt(det(theta.a))) (1/sqrt(sigma.sq.a)) exp( -0.5 (t(Y-muY.a-lambda.a%*%eta.b) %*% solve(theta.a) %*% (YmuY.a-lambda.a%*%eta.b) + 1/sigma.sq.a ((Z t(Xm.b) %*% beta.a)^2 + penalty) + eta.b^2 )) Lik_bb[i, m] <(1/sqrt(det(theta.b))) (1/sqrt(sigma.sq.b)) exp( -0.5 (t(Y-muY.b-lambda.b%*%eta.b) %*% solve(theta.b) %*% (YmuY.b-lambda.b%*%eta.b) + 1/sigma.sq.b ((Z t(Xm.b) %*% beta.b)^2 + penalty) + eta.b^2 )) } } num1
PAGE 178
168 APPENDIX C: (CONTINUED) # record estimates for (k)th EM iteration muY.a new.iter.best) target.replic
PAGE 179
169 APPENDIX C: (CONTINUED) line9
PAGE 180
170 APPENDIX C: (CONTINUED) paramMeans2[ replic]
PAGE 181
171 APPENDIX C: (CONTINUED) # store sub-model deviances at EM covergence for each replication all.deviances[replic, ]
PAGE 182
172 APPENDIX C: (CONTINUED) line5 line6 line7 line8 line9 line10 #============================================ # COMPILE REPLICATION RESULTS #============================================ ##### COMPILE TRUE VALUES p no of Y variables Parameter
PAGE 183
173 APPENDIX C: (CONTINUED) #### PARAMETRIC COEFFICIENTS L
PAGE 184
174 APPENDIX C: (CONTINUED) # Save estimated eta vector for each replication write.csv(etaVectors, file = "C:/.../*.csv", row.names = FALSE) # Save results of replication studies write.csv(paramCoef, file = "C:/.../*.csv", row.names = FALSE) ########################################################### # WHEN ALV MODEL IS FITTED TO A SINGLE DATASET, # COMPILE RESULTS FOR GAM COMPONENT AS FOLLOWS: ########################################################### #==================================== # Plot the fitted curve GAM #===================================== # USE THE SELECTED BEST ETA ES TIMATE (AT EM CONVERGENCE) data.comp
PAGE 185
175 APPENDIX C: (CONTINUED) # ANALYTIC PLOT lwd<-2; lwd2<-1; Tx.col<-2; Ctr.col<-4; fit
PAGE 186
176 APPENDIX C: (CONTINUED) lines(xord[Grord1 == 1 ], fitord[Grord1 == 1 ], lty=1, lwd=lwd, col=Tx.col) lines(xord[Grord1 == 0 ], fitord[Grord1 == 0 ], lty=2, lwd=lwd, col=Ctr.col) tx.legend
PAGE 187
177 APPENDIX C: (CONTINUED) #------------------------------------# DIAGNOSTIC PLOTS #------------------------------------gam.check(fitZ.b) # residual plots plot(fitZ.b,pages=1,residuals=TRUE) plot(fitZ.b,pages=1,seWithMean=TRUE, shade=TRUE) #-----------------------------------------------------------------------# MORE ALV MODEL FIT RESULTS FOR REVIEW #-----------------------------------------------------------------------Y_params.best Z_params1.best Z_params2.best Y_params[1:(iter+2),-c(4,7,9) ,] Z_params1[1:(iter+2),c(1:3,8,9,11,13)] paramMeans1[,1:replic]
PAGE 188
178 APPENDIX C: (CONTINUED) ############################################# # ALV MODEL CONVERGENCE RESULTS ############################################# # OVERAL FOR ALV MODEL # record log of observed-data likeli hood ratio between 2 consecutive steps convergence$logLR.obs
PAGE 189
179 APPENDIX C: (CONTINUED) #--------------------------------------------------------# PLOTS TO MONITOR CONVERGENCE #--------------------------------------------------------### DEVIANCE PLOTS par(mfrow = c(2, 1)) plot(c(1:iter), convergence$logLR.obs[1:iter], type="l", ylab="Log of likelihood ratio", xlab="Iteration") title(main="Log of Observed -Data Likelihood Ratio Versus EM Iteration from the 2nd Iteration", cex.main=1.1) abline( v = iter.best, col = "blue", lty=3) plot(c(1:iter), convergence$SumDeviance2[1:iter], type="l", ylab="Total deviance", xlab="Iteration") title(main="Scaled Total Deviance Ve rsus EM Iteration", cex.main=1.1) abline( v = iter.best, col = "blue", lty=3)
PAGE 190
180 APPENDIX C: (CONTINUED) ### CONVERGENCE ERRORS par(mfrow = c(1, 1)) plot(convergence[,1], convergence[,3], y lim=c(0,3), xlim=c(0,iter),type="n", ylab="Convergence Errors", xlab="Iteration") title(main="Log of Observed -Data Likelihood Ratio Versus EM Iteration from the 2nd Iteration", cex.main=1.1) #abline( v = iter.best, col = "black", lty=3) lines(convergence[,1], convergence[ ,3], lwd=1.9, lty=2, col=1) lines(convergence[,1], convergence[ ,4], lwd=1.9, lty=1, col=1) err1.legend
PAGE 191
181 APPENDIX C: (CONTINUED) for (k in 1:p) { est1[ k]
PAGE 192
182 APPENDIX C: (CONTINUED) plot(iteration, estA[,1], ylim=range(estA[,1:5]), ylab="estimate", type="n") title(main="Y-intercepts", cex.main=1.1) abline( v = iter.best, lty=3) for (j in 1:5) { lines(iteration, estA[,j], col=(j+1), lty=(j+1), lwd=2) } plot(iteration, estA[,1], y lim=range(estA[,6:10]), ylab="estimate", type="n") title(main="Lambdas", cex.main=1.1) abline( v = iter.best, col = "blue", lty=3) for (j in 6:10 ) { lines(iteration, estA[,j], col=(j-4), lty=(j-4), lwd=2) } plot(iteration, estA[,1], ylim=range(estA[,11:15]), ylab="estimate", type="n") title(main="Thetas", cex.main=1.1) abline( v = iter.best, col = "blue", lty=3) for (j in 11:15 ) { lines(iteration, estA[,j], col=(j-9), lty=(j-9), lwd=2) }
PAGE 193
183 plot(iteration, estA[,1], ylim=range(estA[ c(16,18)]), ylab="estimate", type="n") title(main="Sigma^2 -.Z-interc __", cex.main=1.1) abline( v = iter.best, col = "blue", lty=3) lines(iteration, estA[,16], lwd=2, lty=2) #lines(iteration, estA[,17], lwd=2, lty=3) lines(iteration, estA[,18], lwd=2, lty=4) #--------------------------------------------------------------------------------------------# PLOTS to examine MCMC simulations for 5 randomly selected subjects #---------------------------------------------------------------------------------------------plot(eta.sample1.1) mtext("1st Randomly Selected Subject (GRP=1)", cex = 1.2, side = 3, line = 3) plot(eta.sample1.2) mtext("2nd Randomly Selected Subject (GRP=1)", cex = 1.2, side = 3, line = 3) plot(eta.sample1.3) mtext("3rd Randomly Selected Subject (GRP=1)", cex = 1.2, side = 3, line = 3) plot(eta.sample1.4) mtext("4th Randomly Selected Subject (GRP=1)", cex = 1.2, side = 3, line = 3)
PAGE 194
184 APPENDIX C: (CONTINUED) plot(eta.sample1.5) mtext("5th Randomly Selected Subject (GRP=1)", cex = 1.2, side = 3, line = 3) plot(eta.sample2.1) mtext("1st Randomly Selected Subject (GRP=2)", cex = 1.2, side = 3, line = 3) plot(eta.sample2.2) mtext("2nd Randomly Selected Subject (GRP=2)", cex = 1.2, side = 3, line = 3) plot(eta.sample2.3) mtext("3rd Randomly Selected Subject (GRP=2)", cex = 1.2, side = 3, line = 3) plot(eta.sample2.4) mtext("4th Randomly Selected Subject (GRP=2)", cex = 1.2, side = 3, line = 3) plot(eta.sample2.5) mtext("5th Randomly Selected Subject (GRP=2)", cex = 1.2, side = 3, line = 3) ############################ # SAVE WORKSPACE ############################ save.image(file = "C://*.RData")
PAGE 195
ABOUT THE AUTHOR Peter Toyinbo received a doctoral Degree in Medicine from University of Ife, Nigeria in 1981 and an M.S.P.H. (Biostatistics) from Univers ity of South Florida (U.S.F.) in 2004. He was accepted into the Ph.D. program at the U.S.F. in 2004. Concurrently he completed a two-year postdoctoral fellowship in Preventive Research Me thodology in Mental Health from the Johns Hopkins Bloomberg School of Public Health in 2006. He received the national Certified in Public Health (CPH) credential in 2009. He is an active member of the Prevention Science and Methodology Group (PSMG) and has collaborated on few research projects through th e PSMG research efforts. He currently holds the position of Statistical Data Analyst with the Department of Aging and Mental Health Disparities in the Florida Mental Health Institute of the USF where he collaborates and provides consultations on research study designs and st atistical analyses to fellow research faculty.