USFDC Home  USF Electronic Theses and Dissertations   RSS 
Material Information
Subjects
Notes
Record Information

Full Text 
PAGE 1 Synthesis of Local ThermoPhysical Models Using Geneti c Programming by Ying Zhang A dissertation submitted in partial fulfillment of the requirements for th H degree of Doctor of Philosophy Department of Chemical and Biomedical Engineering College of Engineering University of South Florida Major Professor: Aydin K. Sunol, Ph.D. John A. Llewellyn, Ph.D. Scott W. Campbell, Ph.D. Luis H. GarciaRubio, Ph.D. Rafael Perez, Ph.D. Date of Approval: December 11, 2008 Keywords: data mining, symbolic regression, function identi fication, parameter regression, statistic analysis, process simulation Copyright 2009, Ying Zhang PAGE 2 ACKNOWLEDGMENTS I wish to express my deepest gratitude to Dr. Aydin K. Suno l, for his continuous guidance and encouragement throughout my Ph.D. experience. I would also like to thank Dr. John A. Llewellyn, Dr. S cott W. Campbell, Dr. Luis H. GarciaRubio and Dr. Rafael Perez for being my comm ittee members and taking time from their busy schedules. Finally, I would like to tha nk my family for their help. PAGE 3 i TABLE OF CONTENTS LIST OF TABLES iv LIST OF FIGURES v ABSTRACT viii CHAPTER ONE: INTRODUCTION 1 CHAPTER TWO: LITERATURE REVIEW 6 2.1 Review of local models for phase partition coefficie nts 6 2.2 Review of data mining techniques in the knowledge discovery process 14 2.3 Elements of structural regression using genetic programm ing 21 2.3.1 Terminal set, function set and initial representatio n 23 2.3.2 Fitness measures for models of varying complexity 25 2.3.2.1 Fitness function with no penalty for the mode l complexity 25 2.3.2.2 Fitness function with model complexity contro l 26 2.3.2.2.1 Minimum description length 27 2.3.2.2.2 Parsimony pressure 28 2.3.2.3 Fitness function using external validation 30 2.3.3 Genetic operators 31 2.3.3.1 Reproduction and crossover 31 2.3.3.2 Mutation 33 2.3.4 Selection strategy 35 2.4 Parametric regression: review of objective functions and optimization methods 36 2.5 Applications of intelligent system in chemical eng ineering 43 CHAPTER THREE: A HYBRID SYSTEM FOR STRUCTURAL AND PARAMETRIC OPTIMIZATION 46 3.1 The system structure 46 3.2 Data and data preparation 48 3.3 The regression strategy 49 3.4 Implementation with MATLAB based genetic search tool box 53 3.4.1 The population architecture 54 3.4.2 The population size 57 PAGE 4 ii 3.4.3 The principal component analysis and the selectio n of terminal & function sets 59 3.4.4 Genetic operator 62 3.4.5 Fitness evaluation 64 3.4.6 Selection strategy 67 3.4.7 Decimation strategy 69 3.4.8 Result designation and termination criteria 72 3.4.9 Summary 73 3.5 Model evaluation 74 3.5.1 Statistical analysis 74 3.5.1.1 Graphical methods 74 3.5.1.2 Goodness of fit statistics 76 3.5.1.3 Cross validation and prediction 76 3.5.2 Steady state and dynamic simulation using local models 78 CHAPTER FOUR: RESULTS AND DISCUSSIONS 79 4.1 Local composition models and their impac t 79 4.1.1 Developed models for mixtures that form ideal or nearideal solutions and their statistical analysis 80 4.1.2 Performance of models for separation of ideal and near ideal mixtures 84 4.1.3 Developed models for mixtures that form nonideal solutions and their statistical analysis 89 4.1.4 Performance of models for separation of nonideal solutions 93 4.2 Discussion 99 4.2.1 Ideal gaseous and liquid mixture 100 4.2.2 Nonideal gaseous mixture and ideal liquid mixture 101 4.2.3 Ideal gaseous and nonideal liquid mixture 102 4.2.4 Nonideal gaseous mixture and nonideal liquid mixtur e 103 4.2.5 Linear model vs. nonlinear model 104 4.2.6 Extrapolation 105 CHAPTER FIVE: CONCLUSIONS AND RECOMMENDATIONS 107 REFERENCES 110 APPENDICES 115 Appendix A. User Manual for GP Package 116 A.1 MATLAB files 116 A.2 Architecture 116 A.3 How to use the GP package 118 PAGE 5 iii Appendix B. Statistical Analysis 122 B.1 Tests for outliers 122 B.2 Evaluation of final model: crossvalidation 124 ABOUT THE AUTHOR End Page PAGE 6 iv LIST OF TABLES Table 4.1 Result Summary for Propylene (1)Propane (2) 81 Table 4.2 Standard Error Analysis for Generated K 1 and K 2 Model 84 Table 4.3 PropanePropylene Distillation Column Profile 88 Table 4.4 Result Summary for Acetone (1) Water (2) 89 Table 4.5 Standard Error Analysis for Local Model for Ace toneWater System 92 Table 4.6 Tower Profile at Different Runtime 98 Table 4.7 R 2 of Different Models for K 1 in Group Tests 99 Table 4.8 R 2 of Extrapolation Test 106 Table B.1 MATLAB Code for Outlier Test 122 Table B.2 The Result for Outlier Test 123 Table B.3 MATLAB Code for Data Set Partition 125 Table B.4 The Result for Data Set Partition 126 PAGE 7 v LIST OF FIGURES Figure 2.1 Flowchart for an Algorithm for Isothermal Flas h Calculation Algorithm that Uses Composition Dependent Local Models 8 Figure 2.2 Flowchart for an Algorithm for Isothermal Flash Calculation Algorithm that Uses Equation of State 9 Figure 2.3 Architecture of Local Model Strategy 10 Figure 2.4 Knowledge Discovery Process 15 Figure 2.5 Example of a Regression Tree 17 Figure 2.6 A Simple Multilayer Neural Network Model with Two Hidden Nodes 18 Figure 2.7 A Flowchart for Computation through Genetic Progra mming 23 Figure 2.8 Functional Representation of ) (2 pC T c T b a + + Using a Tree Structure 24 Figure 2.9 An Example of Reproduction Operator 32 Figure 2.10 Crossover Operation for an Algebraic Equation Ma nipulation 34 Figure 2.11 An Example of Mutation Operation, Type I 34 Figure 2.12 An Example of Mutation Operation, Type II 35 Figure 3.1 A Hybrid System Structure for Structural and Param etric Optimization 47 Figure 3.2 The Structure for LinearNonlinear Regression St rategy 51 Figure 3.3 A Schematic Diagram of the Genetic Search Met hodology 54 Figure 3.4 A Flowchart of Genetic Search, Steady State P opulation 55 PAGE 8 vi Figure 3.5 A Flowchart of Genetic Search, Genera tional Population 56 Figure 3.6 Fitness vs. Generation Using Different Popula tion Sizes 59 Figure 3.7 Fitness vs. Generation Using Different Mutatio n and Crossover Probabilities 63 Figure 3.8 The Standard Deviation of Fitness Using Different Mutation and Crossover Probabilities 64 Figure 3.9 Models Accuracy vs. Generation Using Different Fitness Functions 66 Figure 3.10 Models Complexity vs. Generation Using Differen t Fitness Functions 66 Figure 3.11 Model Fitness vs. Generation Using Different To urnament Sizes 68 Figure 3.12 The Standard Deviation of Fitness vs. Generatio n Using Different Tournament Sizes 68 Figure 3.13 Models Accuracy vs. Generation Using Different Deletion Strategies 71 Figure 3.14 The Standard Deviation of Fitness vs. Generatio n Using Different Deletion Strategies 72 Figure 4.1 K 1 Model vs. Experimental Data for Propylene (1)Propane (2) 82 Figure 4.2 Residual Plot of K 1 for Propylene (1)Propane (2) 82 Figure 4.3 K 2 Model vs. Experimental Data for Propylene (1)Propane (2) 83 Figure 4.4 Residual Plot of K 2 for Propylene (1)Propane (2) 83 Figure 4.5 PX 1 Y 1 at Low Pressures for Propylene (1)Propane (2) 85 Figure 4.6 PX 1 Y 1 at Medium to High Pressures for Propylene (1)Propane ( 2) 85 Figure 4.7 K 1 vs. Liquid Composition for Different Temperatures for Propylene (1)Propane (2) System 86 Figure 4.8 K 1 vs. Pressure for Different Temperatures for Propylene ( 1)Propane (2) 86 Figure 4.9 Temperature Profile of PropylenePropane Distill ation Column 88 PAGE 9 vii Figure 4.10 Relative Volatility Profile for PropylenePropa ne Distillation Column 89 Figure 4.11 K 1 Model vs. Experimental Data for Acetone (1)Water (2) 90 Figure 4.12 Residual Plot of K 1 for Acetone (1)Water (2) 90 Figure 4.13 K 2 Model vs. Experimental Data for Acetone (1)Water (2) 91 Figure 4.14 Residual Plot of K 2 for Acetone (1)Water (2) 91 Figure 4.15 TX 1 Y 1 at Different Temperatures for Acetone (1)Water (2) 93 Figure 4.16 K 1 vs. Liquid Composition at Different Pressures for Aceto ne (1)Water (2) 94 Figure 4.17 K 1 vs. Temperature at Different Pressures for Acetone (1) Water (2) 94 Figure 4.18 Temperature Profile of AcetoneWater Distillat ion Column 96 Figure 4.19 Relative Volatility Profile of AcetoneWater Distillation Column 96 Figure 4.20 Distillation Total Flow Rate 97 Figure 4.21 Distillation Column Pressure 98 Figure A.1 Architecture of MATLAB Code 117 PAGE 10 viii SYNTHESIS OF LOCAL THERMOPHYSICAL MODELS USING GEN ETIC PROGRAMMING Yin g Zhang ABST RACT Local thermodynamic models are practical alternatives to computationally expensive rigorous models that involve implicit computationa l procedures and often complement them to accelerate computation for realtim e optimization and control. Humancentered strategies for development of these mode ls are based on approximation of theoretical models. Genetic Programming (GP) system can extract knowledge from the given data in the form of symbolic expressions. This re search describes a fully data driven automatic selfevolving algorithm that builds appropria te approximating formulae for local models using genetic programming. No apriori inform ation on the type of mixture (ideal/non ideal etc.) or assumptions are necessa ry. The approach involves synthesis of models for a given s et of variables and mathematical operators that may relate them. The sel ection of variables is automated through principal component analysis and heuristics. For ea ch candidate model, the model parameters are optimized in the inner integrated n ested loop. The tradeoff between accuracy and model complexity is addressed through incorporation of the Minimum Description Length (MDL) into the fitness (object ive) function. PAGE 11 ix Statistical tools including residual analysis are used to evaluate performance of models. Adjusted Rsquare is used to test models accuracy, and Ftest is used to test if the terms in the model are necessary. The analysis of the performance of the models generated with the data driven approach depicts theoretica lly expected range of compositional dependence of partition coefficients and l imits of ideal gas as well as ideal solution behavior. Finally, the model built by GP integra ted into a steady state and dynamic flow sheet simulator to show the benefits o f using such models in simulation. The test systems were propanepropylene for ideal solutio ns and acetonewater for nonideal. The result shows that, the generated models are accurate for the whole range of data and the performance is tunable. The generated local mo dels can indeed be used as empirical models go beyond elimination of the local mode l updating procedures to further enhance the utility of the approach for deploymen t of realtime applications. PAGE 12 1 CHAPTER ONE INTRODUCTION Approaches to modeling of chemical processes have changed si gnificantly in the past three decades. In general, these approaches are divided into two generic categories. One is mechanistic modeling, which is mainly based on fir st principles and fundamental knowledge. The other is empirical modeling, which is data driv en. In the latter, the model structure and its associated parameters are selected to re present the process data accurately for a given range and aim to bring ease through simplified model development stage as well as reduced computational load. Data driven modeling techniques have been popular for many de cades. They are easier to develop than the mechanistic models, particula rly for practitioners. This is especially true when mechanistic first principles models and their associated thermophysical properties are not adequate in representing the real world problems. Furthermore, these mechanistic models are highly nonlinear and complex, which makes them difficult to identify [Ramirez 1989] and implement particularly onrealtime applications. Currently, the most of the data driven mode ling methods fall under statistical methods and artificial neural networks head ings [Pyhnen, 1996]. Neural networks usually provide models that are accurate in represe nting the data, but they don't provide any insight into represented phenomena. Usually, neura l networks are black PAGE 13 2 boxes, and one cannot abstract the underlying physical rel ationships between input and output data. It is often desirable to gain some insight i nto the underlying structures, as well as make accurate numeric predictions. Application of Genetic Programming (GP) based approaches are known to produce inputoutput models with relatively simple and transparent structures and the associated procedures are co ined with symbolic regression terminology. Genetic Programming allows synthesis of data driven model s when model elements are represented as a tree structure. This tree structure is of variable length and consists of nodes. The terminal nodes can be input variable s, parameters or constants while thee nonterminal nodes are standard library funct ions, like addition, subtraction, multiplication and division. Each tree structure may po ssibly describe an equation. Genetic programming works by emulating natural evolution to ge nerate an optimum model structure that best maximizes some fitnes s function. Model structures evolve through the action of operators known as reproductio n, crossover and mutation. Crossover involves interchange of the branches from t wo parent structures. Mutation is random creation of a completely new branch. At each ge neration, a population of model structures undergoes crossover, mutation and selection and then a fitness function is evaluated. These operators improve the general fitness o f the population. Based on fitness, the next generation is selected from the pool of old and new structures. The process repeats itself until some convergence criterio n is satisfied and a model is generated. PAGE 14 3 One primary classification used for property and process mo dels in Chemical Engineering is based on algebraic versus differential equat ion models [Franks 1967]. The mathematical models are either comprised of a set of a lgebraic equations for steadystate operation or by a set of ordinary differential equation s (ODE) coupled with algebraic equations for dynamic (timedependent) models, or partial di fferential equations (PDE) for distributed models. The majority of algebraic mathe matical models for physical or engineered systems can be classified in one of the foll owing three types [Englezos 2001]: Type I: A model with a single dependent variable and a si ngle independent variable. For example, heat capacity model for ideal ga s is a function of temperature. Type II: A model with a dependent variable and several indepe ndent variables, for example, a pressureexplicit equation of states (EOS) which is enable the calculation of fluid phase equilibrium and th ermophysical properties such as enthalpy, entropy, and density necessa ry in the design of chemical processes. Mathematically, a pressureexplici t EOS expresses the relationship among pressure, volume, temperature, and compos ition for a fluid mixture. Type III: A model with multiple dependent variables and se veral independent variables. A typical group of applications is modeling of re action kinetics where possible mechanism is depicted as multiple reaction s that are coupled through concentration of species. PAGE 15 4 The objective of this dissertation is to develop a metho dology, which uses genetic operations in order to find a symbolic relationship betwe en a single dependent variable and multiple independent variables, i.e., Type II. The a pproach was demonstrated for Type I problems by Zhang [ 2004] earlier. The structure and hence the complexity of the model or t he equation is not specified like in the conventional regression, which see ks to find the best set of parameters for a prespecified model. The goal is to seek a mathematical expression, in symbolic form, which fits or approximates a given sampl e of data using genetic programming (GP). The approach is called Symbolic Regress ion. The nested two tier approach is proposed in this research w here parameter regression method is embedded within GP. The GP is employe d to optimize the structure of a model, while classical numerical regression is em ployed to optimize its parameters for each proposed structure. The model structure and its pa rameters are unknown, and determined for each step through the algorithm. Models adequac y is tested through post analysis. The approach is tested for a practical and significant pro blem: development of local and/or empirical partition coefficient models fo r vapor liquid separation. For accurate chemical process design and effective opera tion, a correct estimate of physical and thermodynamic properties is a prerequisite The estimation of these properties through first principle but complex implicit m odels for pure components and mixtures is computationally costly. The computational tim e is critical particularly in real time applications. The phase equilibrium calculations are the most computationally PAGE 16 5 intensive of these properties due to implicit nature of pro cedures with more complex property models, especially when used with rigorous separatio n models [Leesley 1977]. Local thermodynamic models are explicit functions tha t approximate more rigorous models that involve implicit computational proce dures in equilibrium calculations. Computations with these functions are fas t and noniterative at times, but are only valid in a limited region where the functions are accurate. Therefore, local models need to be updated as the simulation moves into new regions in the state spaces. Since the late seventies, many functional forms with differing independent variable sets for these models were suggested and some have been imple mented within flow sheet simulator packages [Perregaard 1992, Storen 1994, and Storen 1997]. This introductory chapter is followed by Chapter Two, where local models, data mining applications and technologies, evolutionary algorithm s and their applications in chemical engineering, and optimization methods and their objective functions are reviewed. Chapter Three describes the proposed system struct ure, guidelines for determination of GP controlling parameters, and the detail s of implementing the approach. Results and discussion are given in Chapter Fo ur. Finally, in Chapter Five, conclusion and recommendations are presented. PAGE 17 6 CHAPTER TWO LITERATURE REVIEW This chapter includes the review of local models for vapo rliquid partition coefficient (K value), data mining tasks and techniques, evolut ionary algorithms and optimization methods. In the first section, the deve lopment of local models is summarized. The review on data mining technologies is given in the second section. The third section describes the development of evolutionary algorithms and the comparison of different algorithms. More emphasis is given to genetic programming. A brief summary of applications of intelligent system in chemical eng ineering is also given at the end of this section. In the fourth section, some popular optimi zation methods and pertinent objective functions (criteria) are reviewed. 2.1 Review of local models for phase partition coefficients A correct estimate of physical and thermodynamic proper ties is a prerequisite for the accurate chemical process design and operation. The c alculation of those properties of pure components and mixtures contributes the major cost i n computer time. Local thermodynamic models are practical alternatives to co mputationally expensive, more rigorous macroscopic or molecular models that involve im plicit computational PAGE 18 7 procedures, and often complement them to accelerate com putation for run time optimization and control. Since vaporliquid equilibrium constant K is among the most computationally expensive one [Leesley 1977], the research efforts on developing local models focus on the vaporliquid equilibrium constant K. Since the late seventies, several research groups devel oped local thermodynamic models and accompanying procedures to be implemented within flow sheet simulator packages. The objective is to replace, or assist more rigoro us thermodynamic models with local alternatives to reduce the computer time while maintaining the thermodynamic accuracy at an acceptable level. Local thermodynamic m odels have been used to accelerate steady state calculation [Leesley et al. 1977, Chimowitz et al. 1983, Perregaard 1993], dynamic simulation [Chimowitz et al. 1984, Perregaard 1993] and dynamic optimization [Storen 1997]. The more rigorous thermodynam ic models are nonlinear equation sets which involve iterative calculations for v aporliquid equilibrium constant (K) model. The local models are in explicit form, and l inear with respect to its parameters. Their calculation procedures are shown in Figure 2.1 while t he flowchart of isothermal, isobaric flash calculation using an equation of state is shown in Figure 2.2. As can be seen, the explicit local models are much easier and fa ster to evaluate, but they are only valid locally. The local models must be updated, if the si mulation proceeds out of the region where the local model is valid. The implementation of the idea involves three major co mponents: local model formulation, error monitor and parameter update, as shown in Figure 2.3. PAGE 19 8 Not converged Converged Figure 2.1 Flowchart for an Algorithm for Isothermal Flash Calculation Algorithm that Uses Composition Dependent Local Models New estimate of ix and iy, if not direct substitution. Compare estimated and calculated values of i x and i y Initial estimate of ix and iy Calculate ) , ( P T y x Ki i i Specify T, P (of equilibrium), and feed mole fractions iz Calculate L, by solving = + = 0 )] 1( /[ ) 1( ) ( L K L z K L fi i i Calculate )] 1( /[ L K L z xi i i+ = i i ix K y = (i=1,2,...,n) PAGE 20 9 No Yes Figure 2.2 Flowchart for an Algorithm for Isotherma l Flash Calculation Algorithm that Uses Equation of State Calculate l Z using ix T and P, Then, ) , (i l ix P T f (i=1,2, .,n) Guess set of iK (i=1,2,...,n) Is ) , ( i l i x P T f= ) , ( i v i y P T f? v i l i old i new if f K K = Solution for L, and ), ,..., 2,1 ( n i xi = ), ,..., 2,1 ( n i yi = Specify T, P (of equilibrium), and feed mole fractions iz Calculate )] 1( /[ L K L z xi i i+ = i i ix K y = (i=1,2,...,n) Calculate L, by solving = + = 0 )] 1( /[ ) 1( ) (0L K L z K L fi i i Calculate v Z using iy T and P. Then, ) , (i v iy P T f (i=1,2, .,n) PAGE 21 10 Figure 2.3 Architecture of Local Model Strategy The first component of local model based system dev elopment is to formulate the approximate local function. Leesley [1977] develope d several local models for ideal solutions, which didnt include composition depende nce. He derived the local K model from the complete form: P RT dP V RT dP V P x y KV i PP L i L i s i v i i i i i S iF F = = exp00 0g (2.1) After simplification, through ideal solution assump tion and avoiding complex functional forms, an approximation to Eq. (2.1) can be developed for low pressure. P A T P A Ki s i i iln ) ( ln ln,2 ,1+ = (2.2) Explicit TP model Simulator (Process Model) Rigorous TP model Error Monitor Parameter Update PAGE 22 11 Eq. (2.2) reproduces the temperature dependence of Kvalues fairly well over the range of 50100 C. However, the relation is too ap proximate to be useful above the pressures of 23 bars. Thus, a third adjustable coe fficient has been introduced in the approximation formula for high pressure application s: P A A T A K i i i i ln ln ,3 ,2 ,1 + = (2.3) Chimowitz [1983] extended the local models to noni deal solutions for multicomponent vaporliquid system. One of the essential ideas has been to treat multicomponent mixtures as pseudobinary solutions. The functional form used to model the K values, which is compositiondependent for each pse udobinary, has been also derived from basic thermodynamic considerations. In Chimowit zs work, he presented a local model for nonideal solutions: P P x RT A K s i i i ln ln ) 1( ln 2 + = (2.4) Lender [1994] used a sequential least squares proce dure to build approximating formulae from a general model that contains all the terms necessary to represent any particular mixture: = + + = + + + + + + + = n i i i n n i i i s i i x A x A P A T P A T A T P A A K 1 5 1 2 5 5 4 3 2 1 ) 1( ln ) ( ln ln(2.5) The problem with this formula is that it has too ma ny parameters (5+2n) to be efficient. To eliminate the unnecessary terms, Eq. ( 2.5) is rewritten in the form of: F Q Q Q AT T 1) (= (2.6) PAGE 23 12 Eq. (2.6) is the least squares solution of the Eq. (2.5), where A is the vector of local model parameters, F is the vector of ln(K) ob tained from experimental data, Q is the matrix of terms. If one lets1) (= Q Q CT, then, jj ii ij ij C C C Corr = (2.7) If this correlation is found to be higher than a sp ecified tolerance, one of the two parameters will be eliminated from the corresponding line and column in matrix C. Stepwise regression strategy is applied. For each p arameter introduction, the parameters are recomputed and the residuals are examined. If they are satisfactory, the local model is accepted. If not, the parameter is eliminated bef ore introducing the next one. When all parameters have been examined, the ones that gave th e lowest residuals are accented. The second component is the error monitor to estima te the range of validity of the local models for a set of parameters and identify w hen to update the local model. Leesley and Heyen [1977] fixed upper and lower values of th e two independent variables, T and P. The bounds defined an interval, which included the two data points used in calculating the parameters. Hillestad et al. [1989] and Storen [1994] developed different error models for predicting the deviation between local and rigo rous thermodynamic property models. The third component is parameter estimation for upd ating models as the range of model have to change. Macchietto [1986] and Hillest ad et al. [1989] applied recursive leastsquares methods. The objective here is to pre serve information from past data in the covariance matrix for the parameters. When a new da ta point is introduced, the covariance matrix can be updated and new values for the parameters can be obtained. PAGE 24 13 Storen [1994] used a simplified scheme with correct ion factors. Ledent [1994] presented a sequential least squares procedure. In summary, two major approaches were developed to synthesize local thermodynamic models. One method is to derive a rela tionship based on a thermodynamic insight. Assumptions are made to simpl ify the relationship [Leesley 1977, Chimowitz 1983]. Each formula is suitable for a par ticular type of solutions (ideal/non ideal etc.). The final structure of formula mostly includes one constant term, one term that accounts for the temperature influence, and on e term accounts for the pressure influence. In the case of nonideal solutions, one or more terms may be added, to account for the composition influence. The other approach t o the empirical formulation is on evaluated statistical basis, i.e. provide a general form of local model, which includes all the terms described above and their combinations, a nd then eliminate the redundant terms by examining the correlation for every pair of param eters [Ledent 1994]. Both approaches are humancentered strategies, and they share a common task of developing local models. An initial function struct ure is proposed first, and then the function structure is simplified and reduced by app lying different strategies. The humancentered approaches may bring some limitations to th e final structure of local models due to the oversimplified structure introduced by inap propriate assumptions made for the procedure of simplification, or, the insufficient d escription of the studied system introduced by proposed initial structure. The form of local model is important because it is closely related to the correlation capabilities of the local model. Its a lso very important that the local model PAGE 25 14 ensures a fast, robust and consistent evaluation of the parameters. For this study, we are interested in a fully automatic algorithm, which ca n develop formula sufficiently and flexibly to all solution mixture types and function al forms. This can be obtained with symbolic regression through genetic programming. 2.2 Review of data mining techniques in the knowledge discove ry process In this study, genetic programming is used as a data mining tool for knowledge discovery in data. Knowledge discovery in database (KDD) is the nontri vial process of searching valid, novel, potentially useful and ultimately und erstandable patterns or models in data. It involves a number of steps [Thuraisingham 1999]. For the sake of simplicity, these steps can be grouped as three major stages: data pr eprocessing, data mining, and postprocessing. The simplified flowchart is shown in Fi gure 2.4. Data mining, here, refers to a particular step in o verall knowledge discovery process. As the core stage of KDD process, it focus es on applying discovery algorithms to find understandable and useful relationships fro m observed data. The data mining tasks can be divided into three major categories [Hand et al. 2001]: model building, discovering pattern and rules, and retrieval by con tent. The tasks of model building can be categorized furt her based on objectives. The first is descriptive modeling. The goal of a descri ptive model is to describe all of the data. Examples of such descriptions include models for th e overall probability distribution of PAGE 26 15 the data (density estimation), partitioning of the ndimensional space into groups (cluster analysis), and models describing the relationship b etween variables (dependency modeling). The second one is predictive modeling. T he aim of predictive modeling is to build a model that will allow the value of one vari able to be predicted from the known values of other variables. The key distinction betw een prediction and description is that prediction has, as its objective, one or more than one specifically targeted variables, while in descriptive problems no single variable is centr al to the model. Classification and regression are two of the most popular applications in predictive modeling. In classification, the variable being predicted is cat egorical, while in regression the variable is quantitative. From a data mining viewpoint, this study can be set in the category of predictive modeling, which involves both model structure and p arameter regression to build a local model for vaporliquid equilibrium coefficient K. Raw Data Model Objectives Figure 2.4 Knowledge Discovery Process Data Mining Tasks: *Descriptive modeling *Predictive modeling *Discovering patterns and rules Retrieval by content Data Pre processing: *Data integration *Data cleaning *Feature Selection To clean out: *incomplete/imprecise data *noisy data *Missing attribute values *Redundant or insignificant data Data Min in g techniques: *Statistics/regression/optimization *Evolutionary computing *Neural networks Regression Tree Examine results: Model Testing & Statistics Analysis PAGE 27 16 There are many different data mining techniques. In this section, only a few of techniques that are applicable for predictive model ing are summarized and compared. These include linear regression, regression tree, n eural network, genetic algorithm and genetic programming. Since the structure of the linear model is simple, easy to interpret, and estimation of parameters for linear models is straightforward, linear regression holds a special place among datadriven data analysis methods. A linear regression model can be represented as: i n i i X a a y= + = 1 0 (2.8) where the ai s are parameters that need to be estimated by fitti ng the model to the given data set. Xi can simply be original predictor variables xi, or more generalized form of f(xi) i.e., transformations of the original x variables. f(xi) could be smooth function, such as log, squareroot, or crossproduct terms of xis for polynomial models which allows interaction among the xis in the model The parameter estimation for linear regression mode l is straightforward through least square fitting. However, selecting a proper m odel structure to fit the data is a challenge. This is because the selected model is ge nerally empirical, rather than first principle. The model may not include all of the pre dictor variables, or certain functions of the predictor variables, that are needed for correc t prediction. PAGE 28 17 Regression tree (RT) can be viewed as a variant of decision trees. Its designed for approximating realvalued functions, instead of being used for classification as what traditional decision tree does. Regression tree has representation as Figure 2.5: Figure 2.5 Example of a Regression Tree Regression tree is built through a process known as binary recursive partitioning. This is an iterative process that splits the data i nto partitions, and then splitting it up further on each of the branches. In the structure o f regression tree, each intermediate node is decision node that contains a test on one predic tor variable's value. The terminal nodes of the tree contain the predicted output variable v alues. The objective function for building an optimum tree structure, i.e. the minimized function, is the mean absolute. The process of regr ession tree induction usually has two phases: building a tree structure that covers the t raining data, and pruning the tree to the best size using validation data set. In training pr ocess, at each node, the best split that x2<=1 x2>1 x1<=3 x1>3 x1 x2 y=10 y=2 y=5 PAGE 29 18 minimizes the mean absolute distance is selected. Pa rtitioning continues until a prespecified minimum number of training data are covere d by a node, or until the mean absolute distance within a node is zero. "Pruning" involves chopping off nodes from the bottom up so that there are fewer and fewer branche s in the tree, so, the regression tree is pruned to avoid the overfitting. In terms of performance, regression tree is extreme ly effective in finding the key attributes in high dimensional applications. In mos t applications, these key features are only a small subset of the original feature set. On the negative side, regression trees cannot represent compactly many simple functions, f or example linear functions. A second weakness is that the regression tree model i s discrete, yet predicts a continuous variable. For function approximation, the expectati on is a smooth continuous function, but a decision tree provides discrete regions that are discontinuous at the boundaries. For its explanatory capability, regression tree cannot describe the relationship between output variable and predictor variables in a form of funct ions. Figure 2.6 A Simple Multilayer Neural Network Model with Two Hidden Nodes v1 v2 w1 w2 w3 w4 w5 w6 Input Hidden la yer Output h1 h2 x1 y x2 x3 PAGE 30 19 Neural networks have been found to be useful becaus e of their learning and generalization abilities. Model structure presented by neural network is multiple layers of nonlinear transformations of weighted sums of the i nput variables. In a single hidden layer network as shown in Figure 2.6, wi and vi are weight factors, hi is nonlinear transformation of sum of weighted input variables x. The output variable y is sum of weighted hi. Therefore, in general, output variable y is a non linear function of the input variables x. As a result, neural network can be used as a nonl inear model for regression. If there is more than one hidden layers, the output s from one layer, which is the transformed linear combinations of nodes in previou s layer, serve as inputs to the next layer. In this next layer, the inputs are combined in exactly the same way, i.e., each node forms a weighted sum that is then nonlinearly trans formed. The number of layers and the number of nodes per layer are important decisions. There is no limit to the number of layers that can be used, though it can be proven th at a single hidden layer (with enough nodes in that layer) is sufficient to model any con tinuous functions [Hand 2001]. Once a network has been structured for a particular applic ation, this network is ready to be trained. The weights wi and vi are the parameters of this model and must be determ ined from the data in training process. The fact that neural network is highly parameterize d makes it very flexible, so that it can accurately model relatively small irreg ularities in functions. On the other hand, such flexibility means that there is a serious dang er of over fitting. In recent years, strategies have been developed for overcoming this problem. Due to the multiple layers of nonlinear transformation of weighted sum, the re lationship between output variable y PAGE 31 20 and input variables x is hard to be presented in a single explicit form o f mathematical model, the neural network is usually used as a blac k box for predictive modeling. Evolutionary algorithms (EAs) provide an effective avenue for structural and parametric regression. EAs are originally divided i nto three major categories, namely evolutionary programming (EP) [Fogel et al. 1966], e volution strategy [Rechenberg 1973] and genetic algorithms (GAs) [Holland 1975]. In the 1990s, a new branch called genetic programming (GP) was added to the group whic h was introduced by John Koza [Koza 1992, 1994]. GP is an extension of John Holla nds GA in which the genetic population consists of models of varying complexiti es and structures. GA uses binary string to represent possible solutio ns to a problem, whereas GP uses tree structure as knowledge representation. Bo th GA and GP guide the search by using some genetic operators and the principle of survival of the fittest. The major difference between GA and GP is their coding used t o represent possible solutions for a problem. In GA, the solution is presented in a form of fixed length binary string, and its output is a quantity. The aim of such coding is to allow the possible solutions to be manipulated with those genetic operators in evoluti onary process. Sometimes, its a challenge to encode the possible solutions in a str ucture of binary string. GP uses tree structure with variable sizes, which allows the sol ution to be manipulated in their current form. Therefore, GP can be used as a tool for symbo lic regression, i.e. structural regression. The details of GP will be explained in the next section. PAGE 32 21 2.3 Elements of structural regression using genetic program ming The objective of this research is to find the appro ximate function for K, which includes parametric and structural regression. As m entioned earlier, Genetic programming (GP) is an extension of the genetic algo rithm in which the genetic population consists of possible solutions (that is, compositions of primitive functions and terminals). Koza [1992] demonstrated a surprising result that, genetic programming is capable of symbolic regression. To accomplish this, genetic programming starts with a pool of randomly generated mathematical models and genetically breeds the population using the Darwinian principle of survival of the fi ttest and an analog of naturally occurring genetic crossover (sexual recombination) operation. In other words, genetic programming provides a way to search the space of po ssible model structures to find a solution that fits, or approximately fits, a given data set. Genetic programming is a domain independent method t hat genetically breeds populations of models to fit the given data set by executing the following three steps that are also shown in Figure 2.7: Generate an initial population of random individual s (mathematical models) composed of the primitive functions and terminals of the problem. Iteratively perform the following intermediatestep s until the termination criterion has been satisfied: PAGE 33 22 o Execute each individual in the population and assig n it a fitness value according to how well it solves the problem. o Create a new population of individuals by applying the following three primary operations. The operations are applied to i ndividual(s) in the population selected with a probability based on fit ness (i.e., the fitter the individual, the more likely it is to be selecte d). Reproduction: Copy an existing individual to the ne w population. Crossover: Create two new offspring individuals for the new population by genetically recombining randomly chos en parts of two existing individuals. The genetic crossover (sexual recombination) operation (described below) operates on two parental individuals and produces two offspring ind ividuals using parts of each parent. Mutation: randomly alteration in existing individua ls, and produces one offspring individuals. The single best individual in the population produc ed during the run is designated as the result of the run of genetic prog ramming. This result may be the solution (or approximate solution) fitted to th e given data set. The description on GPs components will be given in the following subsections, which includes terminal set, function set, fitness function, genetic operators and selection strategies. PAGE 34 23 Figure 2.7 A Flowchart for Computation through Gene tic Programming 2.3.1 Terminal set, function set and initial representation In genetic programming, any explicit mathematical eq uations can be represented by a tree that intermediate nodes are mathematical operators (functions), and terminal nodes (leaves) are input variables and parameters. Create Initial Population Done Terminate? Crossover N=N+2 Reproduction N=N+1 Mutation N=N+1 N=Population Size? Gen = 0 Gen=Gen+1 Evaluate fitness of each individual in population N = 0 Select Genetic Operation Probabilistically Y es Yes PAGE 35 24 As shown in Figure 2.8, the tree corresponding to t he equation of ideal heat capacity 2 T c T b a C p + + = can be represented as: Figure 2.8 Functional Representation of ) ( 2 p C T c T b a + +Using a Tree Structure In this graphical depiction, the function set consi sts of intermediate nodes of the tree that are labeled with several mathematical ope rators, such as +,* and ^. The terminal set consists of terminal nodes (leaves) of the tree that are labeled with input variables T, parameters a b, c and constant 2. The terminal and function sets are important compon ents of genetic programming. The terminal and function sets contain the primitive elements of the mathematical model to be composed. The sufficiency property requires that the set of terminals and the set of primitive functions should be capable of expressing a solution to the problem. ^ 2 T c + + a T b PAGE 36 25 2.3.2 Fitness measures for models of varying complexity The most difficult and most important concept of ge netic programming is the fitness function. The fitness function determines h ow well a generated model is fit to the data. Fitness is the driving force of genetic progr amming. In genetic programming, each individual model in a population is assigned a fitn ess value. 2.3.2.1 Fitness function with no penalty for the model comple xity The basic fitness function is a function of the dif ference between the model predicted value and the data. Widely used basic fit ness functions include the raw fitness, adjusted fitness and normalized fitness. The raw fi tness is the sum of squared errors. In particular, the raw fitness r (i, t) of an individual model i in the population of size M at any generation t is []= =eNj j C j i S ti r 1 2 ) ( ) ,( ) ,( (2.9) where S (i, j) is the value returned by individual model i for data case j (of Ne data cases) and C (j) is the data value for data case j The closer this sum of squared errors is to zero, the better the model. Another popular raw fitness is absolute error. Sin ce squared error gives greater weight to extreme differences between the predicted value and the data than absolute error does, the quality of the model is perhaps more appropriately reflected in absolute PAGE 37 26 error for some cases. Each raw fitness value can be adjusted (scaled) to produce an adjusted fitness measure a (i, t) The adjusted fitness value is )) ,( 1( 1 ) ,( ti r ti a + = (2.10) where r (i, t) is the raw fitness for individual model i at generation t Unlike raw fitness, the adjusted fitness is larger for better individua ls in the population. Moreover, the adjusted fitness lies between 0 and 1. Each such adjusted fitness value a (i, t) is then normalized. The normalized fitness value n (i, t) is ==M jtj a ti a ti n1) ( ) ,( ) ,( (2.11) The normalized fitness not only ranges between 0 an d 1 and is larger for better individuals in the population, but the sum of the n ormalized fitness values is 1. Thus, normalized fitness is a probability value. 2.3.2.2 Fitness function with model complexity control It was noted that, after a certain number of genera tions, the average size of the mathematical models in a population would start gro wing at a rapid pace. However, the increase in complexity of model doesnt show signif icant improvement on fitness. This behavior displayed by GP is called bloat. Bloat oft en occurs in symbolic regression problems where GP runs start from a population of s mall size individuals, and then grows PAGE 38 27 in complexity to be able to comply with all data. I n practice, bloat affects the efficiency of GP significantly. The overcomplicated model str uctures are computationally expensive to evolve or use, it also can be hard to interpret, and may display poor ability of generalization, i.e. overfitting problem. Over t he years, many theories have been proposed to explain bloat from different aspects, b ut none of them is universally accepted as a unified theory to explain the various observat ions on bloat. Therefore, several strategies for control of complexity were proposed with different theoretical foundations. 2.3.2.2.1 Minimum description length As described earlier, the problem of over fitting i s a common problem to every application in GP. Besides the complexity of the generated model by GP, the presence of noise in the data is another possible cause of o ver fitting. Good results with noisy data are only achievable at the cost of precision on the entire data distribution. The issue of selecting a model of appropriate complexity to over come the over fitting problem is, therefore, always a key concern in any GP applicati on. One of proposed strategies is the Minimum Descripti on Length (MDL), which provides a tradeoff between the accuracy and the c omplexity of the model by including a structure estimation term for the model. The final model (with the minimal MDL) is optimum in the sense of being a consistent estimate of the complexity of model while achieving the minimum error. PAGE 39 28 There are a number of criteria that have been propo sed for MDL, which compare models based on a measure of goodness of fit penali zed by model complexity. The most popular and widely used criteria are Akaike's Infor mation Criterion (AIC) (Eq. (2.12)) and Schwarz Bayesian Information Criterion (BIC) (E q. (2.13)). Akaike's information criterion (AIC) n/2 ln ( MSE ) + k (2.12) Schwarz Bayesian Information Criterion (BIC) n /2 ln ( MSE ) + k ln( n )/2. (2.13) where MSE is the mean squared prediction error, MSE = [1/n] SSE n is the number of the data points used for the identification of the model, i.e. the sample size. Both AIC and BIC take the form of a penalized maximi zed likelihood, and their first term can be interpreted as the evaluation on models accuracy, the second term is the penalty term, which can be interpreted as the compl exity of the model, which is a function of the number of parameters and the depth of the tree structure that represents the model. These two criteria utilize different pen alties: AIC adds 1 for each additional variable included in a model, while BIC adds ln (n) /2. 2.3.2.2.2 Parsimony pressure A variety of practical techniques have been propose d to control complexity bloat. Among these techniques, parsimony pressure method i s a simple and frequently used PAGE 40 29 method to control bloat in genetic programming [Zha ng et al. 1993, Zhang et al. 1995]. In this method, the parsimony pressure is applied to t he original fitness function: ) ( ) ( ) ( x l c x f x fp = (2.14) ) ( x fp is the new fitness function with parsimony pressur e term. ) ( x f is the original fitness of model x as mentioned above c is a constant, known as the parsimony coefficient. ) ( x l is the size of model x counted as the number of intermediate nodes in the tree representation, i.e., the number of mathema tical operators appeared in the model. This new fitness function ) ( x f p minimizes model size by using the penalty term as a mild constraint. The penalty is simply proportional to model size. The fitness of models will decrease with the increase on model size. The strength of control over bloat is determined by the parsimony coefficient c The value of this coefficient is very important: if c has a small value, GP runs will still bloat wildly; if the value is too large, GP will take the size of the minimization model as its main target and will almost ignore fitness, which incurs the loss of model accuracy, consequent ly, weaken the prediction ability of model. However, the proper values of parsimony coef ficient highly depend on specific problem being solved, the choice of functions and t erminals, and various GP parameter settings. Very few theories have been proposed to h elp setting the parsimony coefficient, and trial and error method was widely used before P oli [2008] introduced a simple, effective, and theoretically sound solution to this problem. The strategies introduced above are ones focusing o n the fitness against complexity bloat, and hereupon, over fitting probl em. Other than those antibloat selection rules, numerous empirical techniques have also been proposed to control PAGE 41 30 complexity bloat, which are based on GP algorithms improvements. Briefly, these techniques can be summarized into two major categor ies: size and depth limits [Koza 1992, and antibloat genetic operators [Kinnear 199 3, Langdon 1998, Langdon 2000, and CrawfordMarks et al. 2002]. 2.3.2.3 Fitness function using external validation Section 2.3.2.2 introduces two of the wellaccepted strategies on the complexity control. Although based on different theories, they both combine multiple objectives into a scalar fitness function. A different strategy for choosing models is sometimes used, not based on adding a penalty term, but instead based o n external validation of the model. The basic idea is to randomly split the data into t wo parts, a training set and a validation set. The training set is used to construct the mode ls and estimate the parameters. Then, the fitness function is recalculated using the vali dation set. These validation scores are used to select models. In the validation context, s ince the training set and validation data set are independently and randomly selected, for a given model the validation score provides an unbiased estimate of the fitness value of that model for new data points, therefore, the difference in validation scores can be used to choose between models. This general idea of validation has been extended to the notion of crossvalidation. This external validation method will be discussed furthe r in section 3.5.1.3. PAGE 42 31 2.3.3 Genetic operators In this section, three genetic operators will be d escribed in detail. In the first section, reproduction and crossover will be introdu ced as two primary operations, and the mutation, including its two different types, will b e introduced as a secondary operation. 2.3.3.1 Reproduction and crossover The two primary genetic operations in GP for modify ing the structures are fitness proportionate reproduction, as shown in Figure 2.9, and crossover, as shown in Figure 2.10. The operation of fitness proportionate reproduction for the genetic programming is an asexual operation in that it operates on only one parental individual (model). The result of this operation is one offspring individua l (model). In this operation, if f (i, t) is the fitness of an individual i in the population M at generation t the individual i will be copied into the next generation with probability = M j tj f ti f 1 ) ( ) ,( (2.15) The operation of fitness proportionate reproduction does not create anything new in the population. It increases or decreases the nu mber of occurrences of individuals already in the population, and improves the average fitness of the population (at the expense of the genetic diversity of the population) To the extent, it increases the number PAGE 43 32 of occurrences of more fit individuals and decrease s the number of occurrences of less fit individuals. The crossover (recombination) operation for the gen etic programming starts with two parental individuals (models). Both parents are selected from the population with a probability equal to its normalized fitness. The re sult of the crossover operation is two offspring individuals (models). Unlike fitness prop ortionate reproduction, the crossover operation creates new individuals in the population s. Parent: b T a + Offspring: b T a + Figure 2.9 An Example of Reproduction Operator The operation begins by randomly and independently selecting one point in each parent using a specified probability distribution ( discussed below). The number of points in two parental individuals typically is not equal to each other. As will be seen, the crossover operation is welldefined for any two ind ividuals. That is, for any two individuals and any two crossover points, the resul ting offspring are always valid individuals in the population. Each offspring conta ins some traits from its parent. + a T b + a T b PAGE 44 33 The crossover fragment for a particular parent is t he rooted subtree whose root is the crossover point for that parent and where the s ubtree consists of the entire subtree lying below the crossover point (i.e., more distant from the root of the original tree). The first offspring is produced by deleting the cro ssover fragment of the first parent and then impregnating the crossover fragment of the second parent at the crossover point of the first parent. The second off spring is produced in a symmetric manner. Since entire subtrees are swapped, this ge netic crossover (recombination) operation produces syntactically and semantically v alid individuals as offspring regardless of which point is selected in either par ent. For example, consider the parental individuals, i.e., algebraic equation b T a + and b a T +2. These two models can be depicted graphically as rooted, pointlabeled trees with ordered branches. The two parental models are shown in Figure 2.10. S uppose that the crossover points are randomly selected for each parent indivi dual. The crossover points are therefore the in the first parent and the + in th e second parent. The places from which the crossover fragments were removed are identified with dash line. 2.3.3.2 Mutation Mutation is another important feature of genetic pr ogramming. Two types of mutations are possible. In the first type, a functi on can only replace a function or a terminal can only replace a terminal. In the second type, an entire subtree can replace another subtree. Figures 2.11 and 2.12 explain the concept of mutation: PAGE 45 34 Crossover point Crossover point Parent1: b T a + Parent2: b a T +2 Offspring1: b T a + 2 Offspring2: b a T + Figure 2.10 Crossover Operation for an Algebraic Eq uation Manipulation Parent: b T a + 2 Offspring: a a T + + )2 ( Figure 2.11 An Example of Mutation Operation, Type I + a T b + ^ 2 T b a + T b a + a ^ 2 T b + a ^ 2 T b + a + 2 T a PAGE 46 35 Parent: b T a + 2 Offspring: b T a + Figure 2.12 An Example of Mutation Operation, Type II 2.3.4 Selection strategy There are many different selection methods based on fitness. The most popular is fitnessproportionate selection. If f (i, t) is the fitness of individual i in the population at generation t, then, under fitnessproportionate selection, the probability that individual i will be selected to process genetic operation is: = M j tj f ti f 1 ) ( ) ,( (2.16) where M is the population size. Among the alternative sele ction methods are tournament selection and rank selection [Goldberg 1989]. In ra nk selection, selection is based on the rank (not the numerical value) of the fitness value s of the individuals in the population. Rank selection reduces the potentially dominating e ffects of comparatively highfitness individuals in the population by establishing a pre dictable, limited amount of selection pressure in favor of such individuals. At the same t ime, rank selection exaggerates the + a T b + a ^ 2 T b PAGE 47 36 difference between closely clustered fitness values so that the better ones can be sampled more. In tournament selection, a specified group of indiv iduals (typically two) are chosen at random from the population and the one wi th the better fitness (i.e., the lower standardized fitness) is then selected. 2.4 Parametric regression: review of objective functions and op timization methods Optimization techniques are used to find a set of va lues for design variables that best meet an objective. In parameter estimation, th e optimum parameter values are searched with the aid of a selected optimization me thod, to minimize or maximize a welldefined objective function that depends on the parameters, measurements and the model. The objective function is a suitable measure of the overall departure of the model performance from the observed measurements. A widel y used objective function in parameter estimation is the least squares formulati on. For the model equation ) ( b x f Y = bis denoted as a vector that is an estimate of the parameter vector b x is a vector of independent variables, the sum of sq uared residuals is: ) ()' ( '* *Y Y Y Y = = Fe e (2.17) where Y is vector of experimental observations of the depen dent variables. The least squares method is used to evaluate the unknown vect orb by minimizing the sum of squared residuals F PAGE 48 37 In linear regression analysis, the equation is a li near function of the parameters, which is in the form of: m b+ = X Y* (2.18) Eq. (2.17) can be: ) ()' ( '* *Xb Y Xb Y = = Fe e (2.19) In order to calculate the vectorb, which minimizes F the partial derivative of F with respect to bis taken, and set equal to zero: 0 ) ()' ( ) ()' ( * = + = F X Xb Y Xb Y X b ( 2.20) Eq. (2.20) can be simplified and rearranged to yiel d: 1) (Y X X X b = (2.21) Therefore, the value of the parameter vector bcan be obtained directly from the Eq. (2.21) given above. In nonlinear regression analysis, the model equatio n) (b x f Y = is a relation that is nonlinear with respect to the parameters. There are several techniques for minimization of the sum of squared residuals described by Eq. (2 .17). These techniques are broadly classified into two categories: gradient methods an d direct search methods. The gradient search methods require derivatives of the objective functions whereas the direct methods are derivativefree and rely solely on function eva luations. The gradient search methods are efficient for smooth functions, and still effic ient if there are some discontinuities in the derivatives. Direct search techniques, which us e function values, are more efficient for highly discontinuous functions. PAGE 49 38 The major gradient search methods include: GaussNe wton, steepest descent, Marquardt and Newtons. All of these methods are lo cal methods that provide global solution only for convex models. GaussNewton method can be used to convert nonlinea r problem into a linear one by approximating the function Y by a Taylor series expansion around an estimated value of the parameter vector b: b J Y b b Y b x Y b b x Y b x Ym b m mD + = D + = D + = ) ( ) ( ) ( (2.22) where the Taylor series has been truncated after th e second term. Eq. (2.22) is linear in b D Therefore, the problem has been transformed from fi nding b to that of finding the correction to b, that is b D which must be added to an estimate of b to minimize the sum of squared residuals. ...... ........ .......... ...... 1 1 1 1 n n n n n n r r r r r r = k n n k b Y b Y b Y b Y J (2.23) and ) (' ) ( 1 Y Y J J J b = D (2.24) b b bm m D + = + 1 (2.25) where m is the iteration counter. In GaussNewton method, the drawback is the fact th at the incremental changes, namely the b D as described previously, can be estimated very poo rly due to computation PAGE 50 39 of the partial derivative matrix (JJ) 1 when it is close to singular. The result is that t he convergence may be very slow with a large number of iterations being required. Even wrong signs may occur on the b D s, and then the procedure will move in the wrong direction. The method may not converge at all with the residual sum of squares continuing to increase. Also, the closer a model is to behaving like a linear model, the more likely it is to converge in a small number of iterations from a reasonable starting point and, more important, the zone of ability of c onverge is greater for a closetolinear model than a farfromlinear one. Since this object ive function is a quadratic one in nature, GaussNewton method is susceptible. In the steepest descent method, the gradient of a s calar objective function gives the direction of the greatest objective function de crease at any. Therefore, the initial vector of parameters estimates are corrected in the direction of the negative gradient of F : F = D b K b (2.26) where F is the sum of squared residuals and K is a suitable constant factor and b D is the correction vector to be applied to the estimated va lue of b to obtain a new estimate of the parameter vector, same as before: b b bm m D + = + 1 (2.27) where m is the iteration counter. Then, b D can be calculated from: ) (' 2 Y Y KJ b = D (2.28) PAGE 51 40 where J is the Jacobian matrix of partial derivatives of Y with respect to b evaluated at all n points where experimental observations are availa ble, as shown in GaussNewton method. The steepest descent method moves toward the minimu m sum of squares without diverging, provided that the value of K, which dete rmines the step size, is small enough. The value of K may be a constant throughout the cal culations, which may change at calculation step. However, the rate of convergence to the minimum decreases as the search approaches this minimum. Marquardt method is a compromise between the GaussNewton and the steepest descent methods. This interpolation is achieved by adding the diagonal matrix ) (I l to the matrix (JJ) in the function of b D in GaussNewton method above: ) (' ) ( 1 Y Y J I J J b + = D l (2.29) The value of l is chosen, at each iteration, so that the corrected parameter vector will result in a lower sum of squares in the follow ing iteration. We can see from b D equation above, as 0 l Marquardt method approaches the GaussNewton meth od; while as l this method is identical to steepest descent, wit h the exception of a scale factor that does not affect the direction of the parameter correction vector but that gives a small step size. From this aspect, by selec ting appropriate value of l an indicator of compromising between GaussNewton and Steepest Descent method, Marquardt method can combine the best feature of th ose two methods: almost always converges and does not slow down [Draper et al. 1 981]. PAGE 52 41 In Newtons method, similar to GaussNewton method that approximates the function Y by a Taylor series expansion to the seco nd term, the sum of squared residuals F is also expanded by Taylor series up to the third t erm: b b b b b b x b x m m m D F D + D F + F = F) ( 2 2 ) ( ) (' 2 1 ) ( ) ( (2.30) Taking the partial derivative of both sides of Eq. (2.30) with respect to b gives b H Y Y J b b b b m m D + = D F + F = F ) (' 2* ) ( 2 2 ) ( (2.31 ) where H is Hessian matrix of the secondorder partial deriv ative of F with respect to b evaluated at all n points where experimental observations are availabl e: ......... .......... ..... .......... ......... ..........2 2 1 2 2 2 1 2 2 1 2 2 1 2n n n n n n n n r r r r r r r r F F F F F F =k k k kb b b b b b b b b b H (2.32) The above description of optimization methods is ba sed on least squares as the objective function. Other than least squares, maxim um likelihood estimation is also popularly used as an objective function for paramet er estimation purpose, which is based on statistical principles and account of data quali ty. In general, statistically based parameter estimatio n reduces the problem of the determination of parameters in the mechanistic mode l to assessing the correspondence between the residuals generated by a particular set of parameter values and the assumptions made about the residual distribution. PAGE 53 42 It is assumed that every vector of residuals e for an experiment is a random vector following a probability density function of specifi ed form ) ( b pi ie, where b is the unknown vector of parameters. The experimental outc ome of a ievector is regarded as a random sample out of the distribution defined byip The combination of all ip for all ieresults in the likelihood function: ,...) , ; (3 2 1e e eb L which, for correct specification of the joint proba bility density function for all e s and known true b values, represents the probability density of gett ing just that set of ievectors obtained experimentally. In the parameter estimation situation, b is not known. So in Maximum Likelihood Estimation, those unknown values are searched with the aid of an optimization method, which maximize this function L This means that the optimal values b obtained are those parameter values which generate the residual pattern for which the probability density is highest. Maximum Likelihood (ML) method can be used with an y joint probability density functional form of residuals, while one specific di stribution properties of residuals has to be assumed before ML method is applied. In most of applications, the normal distribution is used. However, often these assumptions are not f ulfilled at optimal parameter values determined due to random measurement errors, system atic measurement errors, such as drift, calibration, measurement technique, determin istic model inadequacies, errors in values assumed to be precisely known dependent vari able. PAGE 54 43 Other than these measurement errors, there are thre e types of computation related errors: the truncation error, the round off error, and the propagation error. The truncation error is a function of the number of terms that are retained in the approximation of the solution from the infinite series expansion. Since computers carry number using a finite number of significant figures, a round off error is introduced in the calculation when the computer rounds up or down (or just chops) the numb er to n significant figures. Meanwhile, the truncation and round off errors may accumulate and propagate, creating the propagation error, which may grow in exponentia l or oscillatory pattern. Thus, these errors may cause the calculated solution to deviate drastically from the correct solution. All errors explained above may affect the distribu tion properties of residuals, in other words, the assumptions made on the probabilit y density function of the residuals may be violated. In this case, the optimal paramete r values may be not trustable. 2.5 Applications of intelligent system in chemical engineeri ng Development of intelligent systems in process engin eering has been mainly focused in the following six areas [Stephanopoulos 1987, 1994]: Process design: Select thermodynamic models and est imate physical properties: select the best thermodynamic model(s) for the problem [Fredenslund 1980, BanaresAlcantara et al. 1985, a nd Gani 1989]. If no explicit models are available, the system could sel ect a method and estimate the value of the physical property [Friese 1998]. U se process data with PAGE 55 44 engineering heuristics to recommend the optimal pro cessing method for the task at hand [Bamicki 1990]. Fault diagnosis: process troubleshooting, i.e., det ermining the origins of process problems and recommending solutions [Frank 1997, Ozyurt 1998, Ruiz et al. 2000]. Process control: improving process control through utilization of qualitative process information, trend analysis, neural network s, etc. Planning and operations: scheduling, developing pro cedures, assessing safety concerns, executing complex interrelated procedure s, and aiding maintenance [Csukas 1998]. Modeling and simulation: using qualitative reasonin g and symbolic computing to model and simulate chemical processes [Cao et al 1999, McKay et al. 1997, Csukas 1998, Greeff 1998, Gao et al. 2001, Hi nchliffe 2003, Grosman et al. 2004]. Product design, development, and selection: recomme nding chemical formulations, compositions, materials, process proc edures, etc., required to design, develop, or select a new or existing produc t that achieves specified objectives [Xu 2005]. The foundation of accurate chemical process design and simulation is a correct estimate of physical and thermodynamic properties. In this exploratory project, an automatic procedure will be developed to identify a thermophysical model from a set of PAGE 56 45 given data. This model is expected to be used in fu rther investigation of the physical system or to validate the structure of an existing model developed in some other way. PAGE 57 46 CHAPTER THREE A HYBRID SYSTEM FOR STRUCTURAL AND PARAMETRIC OPTIMIZATION In this chapter, the structure of hybrid system and its implementation is introduced. The structure of hybrid system is prese nted in the first section. The data used throughout this research and its preparation are gi ven in the second section. In the third section, the regression strategies applied in this research is introduced. The fourth section describes the determination of genetic programming (GP) controlling parameters. In the last section, the model verification strategies are summarized. 3.1 The system structure The purpose of this research is to investigate the feasibility of designing a generalpurpose machine function identification sys tem which can automatically build a function model to fit the given experimental data. The approach is to solve the function identification problems is through coupling the sym bolic computing method (Genetic Programming) and a parameter regression method. The parameter regression process involves either linear or nonlinear depending on th e problem definition. The twolayer structure is shown in Figure 3.1. The parameter reg ression is embedded in the genetic PAGE 58 47 programming as an inner layer. For the given data s et, the structural regression system searches the space of mathematical models, dynamica lly creates new generation of mathematical models using genetic programming, and optimizes the parameters of the generated models using the linear or nonlinear regr ession algorithm in an effort to develop best model and associated parameter that re present (fit) the data. The complete procedure ends with a statistical analysis which is used to evaluate the model and its performance. Raw Data Final Model Candid ate Parameter Struc ture Value & Outer Layer: Structure Inner Layer: Fu nction Identification Figure 3.1 A Hybrid System Structure for Structur al and Parametric Optimization Model Structure Identification Model Testing & Statistical Analysis Parameter Estimation Data PreProcessing Outlier Detection PCA PAGE 59 48 3.2 Data and data preparation Generally, the data to be mined is voluminous, inco mplete or imprecise, noisy, has missing values, redundant or insignificant. To get a better mining result, the data will be preprocessed to eliminate those data points. In other words, preprocessing is a sequence of operations converting raw data into dat a representation suitable for processing tasks. Data preparation is one of the mo st important steps in the model development process. From the simplest analysis to the most complex model, the quality of the data used is vital to the success of the mod eling. Once the data is cleaned, then, the data set is worthy of modeling. The widely applied data preprocessing approaches i nclude data integration, data cleaning and feature selection [Freitas 2002]. Data integration is necessary if the data to be mined comes from several different sources. This step involves, for instance, removing inconsistencies in variable names or varia ble value names between data sets of different sources. For data cleaning, it is important to make sure tha t the data to be mined is as accurate as possible. This step may involve d etecting and correcting errors in the data, filling in missing values, etc. Feature selec tion (select the variable) consists of selecting a subset of features (variables) relevant for mining the data among all original features. Data on vaporliquid equilibrium can be obtained fr om various sources. In this research, the equilibrium data for propylenepropan e and acetonewater systems are taken from DECHEMA. Since the linear regression method is sensitive to outliers, outliers are PAGE 60 49 particularly need to be detected, and deleted if th ere is any. Outliers can be detected from studentized residuals that are defined for the i th observation with: ) 1(ii i i ih MSE y y r = (3.1) where ii h is the i th diagonal element of the hat matrix H that is defined as: ) (1X X X X H n n= MSE is the mean square error. When working with a sufficient number of observations, e.g. (np1)>20, where n i s the number of observations, and p is the number of parameters, a ir>2.0 indicates that the i th observation might be an outlier. Similarly, a ir>2.5 is a strong indicator of a likely outlier. 3.3 The regression strategy The developed GP package includes both linear regre ssion and nonlinear regressions options for parameter estimation. The r egression procedure in this package is shown in Figure 3.2. Regression strategy can be sel ected depending on the definition of the problem. All individuals in the population will be checked for their linearity automatically. Depending on the characteristics of the problem, user needs to assign a value to the indicator of models type before runni ng the program. The assigned values can be 1 for linear model only, 0 for both linear and nonlinear models, and 1 for nonlinear model only. If the final model is expecte d to be linear, and the indicators value is set to 1, then, the program will delete the non linear models and apply the linear PAGE 61 50 regression. If the final model is expected to be no nlinear, the indicators value is set to 1, then, the program will delete the linear models, an d get into the nonlinear regression procedure. If the final model could be either linea r or nonlinear, user predefines the indicators value as 0, then, after checking the li nearity of individual model, the program will apply the corresponding regression strategy co nsidering both types of models. To identify the linearity of model, the first orde r derivative of individual model with respect to each parameter is taken. If all der ivatives of model with respect to parameters are constant, the model is linear on par ameters. For example: x b a y + = (3.2) x b a y + = 2 (3.3) Eq. (3.2) is linear, while Eq. (3.3) is nonlinear. For Eq. (3.2), since 1 = a y, x b y = and there are no parameters a and b that appears in the derivative terms, Eq. (3.2) is linea r with respect to parameters a and b. Similarly for Eq. (3.3), 1 = a y, and bx b y 2 = and although the derivative of parameter a is constant, the partial derivative wit h respect to b is a function of b. Therefore, Eq. (3.3) is not linear with respect to parameter b. In identification of linearity, MATLAB symbolic mat h toolbox is used. Linear least squares regression is a very popular t ool for empirical process modeling because of its effectiveness and completen ess. The estimates of the unknown parameters obtained from linear least squares regre ssion are unique and are regarded as the global optimal values. Furthermore, linear leas t squares makes very efficient use of PAGE 62 51 data and is easy to implement. Good results can be obtained with relatively small data sets. Start Linear w.r.t. Non linear w.r.t. Parameters Pa rameters Yes No Yes No Yes No Figure 3.2 The Structure for LinearNonlinear Regre ssion Strategy Nonlinear Regression *x 0 initial value *i=1 *old_resnorm=0 *Call MATLAB function lsqnonlin, to get resnorm and x. Calculate the fitness and return parameter value to main function i= i+1, x 0 x Call lsqnonlin, to get new resnorm and x If : 001 .0 _ < resnorm old resnorm old resnorm or: i > 200 Linear Regression Estimate parameter using Linear Least Square Check Model s Linearity Indicator of Model s Type = 1? (Has to be Nonlinear Model?) D elete the Indivi d ual Model Indicator of Model s Type = 1? (Has to be Linear Model?) D elete the Indivi d ual Model Return to Main Fun ction PAGE 63 52 note: i is the iteration number; x 0 is the initial value for parameters; x is the parameter values obtained from Marquardt method; r esnorm is sum of the least square at (i+1)th iteration; old_resnorm is sum of least square at ith iteration; lsqnonlin is MATLAB nonlinear regression function Marquardt. The evolution of the model is an automatic process. The individual models generated in the process can be very complicated an d its structure is neither predicted nor easily simplified. This increases the difficulty in convergence during parameter regression stage, if the nonlinear regression strat egy is applied. Nonlinear least squares regression may involve iterative computation to est imate the parameters. With functions that are linear in the parameters, the least square s estimates of the parameters can always be obtained analytically, while that is generally n ot the case with nonlinear models. The use of iterative numerical procedures for nonlinear regression requires the user to provide initial values for the unknown parameters before th e optimization process starts. The initial values should generally be reasonably close to the real parameter values or the optimization procedure may not converge. Different initial values will result in convergence to a local minimum rather than the glob al minimum unless the problem is convex. To ensure convergence of parameter regression robus tly and more precisely, an automated restart operation and two stop criteria are implemented. These enforce the algorithm to repeat the regression process before i t stops. The automated restart operation initializes the pa rameter value with the regressed parameter values in last run. PAGE 64 53 Two termination criteria are: the relative change o f sum of the least square is less than a setup value, i.e., 001 .0 _ < resnorm old resnorm old resnorm (3.4) or the number of iterations i exceeds a predefined number, as shown in Figure 3.2. 3.4 Implementation with MATLAB based genetic search toolbox The Genetic Search Toolbox provides an integrated e nvironment for performing a genetic search, including a collection of genetic o perations used to implement genetic search methods. A schematic representation of the genetic search pr ocess is shown in Figure 3.3. Important functional elements of a genetic search a re: a population with an associated fitness evaluation methodology, one or more selecti on and creation strategies, and a decimation strategy. The core component of every ge netic search is the population. In a genetic search, each member of a population needs t o be evaluated and assigned a fitness. Members of a population can be selected for genetic operations through a variety of criteria. In order to manage the size of the popula tion in a genetic search, members of a population also can be selected for deletion throug h different decimation criteria. Genetic programming is controlled by several parame ters for its components. In the following sections, several tests for different population sizes, mutation and crossover probabilities, fitness functions, tournament sizes, and deletion strategies will be analyzed. PAGE 65 54 The sensitivity is analyzed when the values of the controlling parameters are changed within a range, one at a time with all other parame ters being fixed. Thus, although the resulting parameters, population size etc. are not optimal for a given specific problem, the analysis provides a good feasible set and is robust for most systems. Initial Population Out Figure 3.3 A Schematic Diagram of the Genetic Searc h Methodology 3.4.1 The population architecture The genetic search process can be implemented in tw o distinct ways. In the first, the genetic search uses a steady state population i.e., one or two individuals are selected and manipulated at a time, and one or two new offspring are generated, as shown in Figure 3.4. The term "generation" will refer to a single iteration of creating one or two new offspring. The second, as shown in Figure 3.5, is the generational approach, individuals are selected and entire population is m anipulated, many new offspring are Population Creation of Offspring Selection Decimation Fitness Evaluation PAGE 66 55 created at each generation; one generation may repr esent almost complete population turnover (some members may be retained unmodified t hrough the "reproduction"). Yes No No Figure 3.4 A Flowchart of Genetic Search, Steady St ate Population Evaluate Fitness Create Initial Population Select Parent Chromosomes Evaluate Fitness Perform genetic operation Select genetic operation Population Reduction ? Subroutine: regression Select chromosomes and delete from population Subroutine: regression Terminate ? PAGE 67 56 No Yes Yes No Figure 3.5 A Flowchart of Genetic Search, Generatio nal Population Create Initial Population Evaluate Fitness Select Parent Chromosomes Evaluate Fitness Perform genetic operation Select genetic operation Go to next Generation? Subroutine: regression Subroutine: regression = population size? PAGE 68 57 In this research, the steady state population is used. Specifically, the following occurs, as shown in Figure 3.4: An individual is fi rst selected according to some selection strategy, and then the genetic operator is selected According to the selected genetic operator, a second individual may be selected and p erform the genetic operation to create one or two offspring. Finally, the decimation stra tegy allows selection of the individual which is going to be deleted from the population, a nd then the new child replaces the individual deleted. As a steady state optimizer, when GP operates on ju st one individual at a time, the number of cycles within a given run can be high, pe rhaps 25,000 or more. In order to make results more comparable to a generational opti mizer, the number of cycles is divided by the size of the population to give the a pproximate number of generations. The theoretical understanding of the relationship betwe en steady state and generation optimizers is not strong. In order to generate reli able statistics, we ran each test multiple times; typically ten times. From these runs, we the n calculated the average performance for each selection scheme. 3.4.2 The population size Genetic programming is an optimization technique th at uses a population of candidate individuals to search the solution space. Since a population consists of multiple individuals, several locations in the solution spac e are examined in parallel. The use of a large population has several advantages. First, thi s allows the evolutionary algorithm to examine a large number of positions in the solution space simultaneously. Second, a large PAGE 69 58 population is more resistant to the loss of diversi ty in the population. Diversity can be lost in a population when, due to evolutionary pressure, a large number of individuals become similar to the best individuals of the population. When this happens, the search will be restricted to the small area of the solution space containing these similar individuals. Consequently, finding new solutions becomes more di fficult. When using a large population, the diversity of the population will pe rsist longer. The disadvantage of using a large population is that more individuals have to be evaluated every generation. Often, the fitness evaluation is the most timeconsuming s tep in genetic programming, and reducing the number of fitness evaluations can sign ificantly speed up the search. Since the population size is one of the major contr ol parameters, some analytic methods are available for suggesting optimal popula tion sizes for runs of the genetic algorithm on particular problems. However, the prac tical reality is that researchers generally do not use any such analytic method to ch oose the population size. Instead, researchers determine the population size such that genetic programming can execute a reasonably large number of generations within the a mount of computer time people are willing to devote to the problem [Koza et al. 2003] Therefore, in this research, a group of tests on the speed of GP convergence are run for di fferent population sizes of 250 (P250), 500 (P500) and 1000 (P1000). As shown in Figure 3.6 the GP runs with the population of 1000 and 500 display a similar behavior, both of which reach the approximately optimal solution at about eighteenth generation, wh ile the population of 250 needs about twentyfive generations to find an approximate solu tion. Compared with P250, P1000 and P500 have better fitness than P250 does. A larg er population size does help to speed PAGE 70 59 up the searching, but also has impact on improving the accuracy level. The population size was set to 500 based on the results. Figure 3.6 Fitness vs. Generation Using Different P opulation Sizes 3.4.3 The principal component analysis and the selection of te rminal & function sets The complexity of the mathematical model evolved by genetic programming depends on the choice of the function set and termi nal set. The chance of discovering a specific formula in a finite number of generations is a decreasing function of its model complexity [Chen 1997]. A larger function set and t erminal set will help reduce the complexity of a mathematical model. From this persp ective, it seems that a larger 0 50 100 150 200 250 300 0 10 20 30 40 50 60 70 80 Generation Fitness P1000 P500 P250 PAGE 71 60 terminal set and function set can enhance search ef ficiency. In fact, there are empirical evidences that suggest that this is indeed the case [Johnson 2000]. The influence of search space and population size has to be taken in to the consideration. The search space includes all potential individuals and its size grows exponentially with the size of terminal and functio n sets. The probability of finding the solution in a finite number of generations depends on population size ratio s i.e. S G s =, where, G is population size, and S is the size of search space. If population size do esnt grow exponentially with the size of function and te rminal set, then the population size ratio will be close to zero, i.e., the probability of finding the specific formula in a finite number of generations is nearly impossible. Since in practice the population size cannot grow i n proportion to the size of function and terminal set, reducing the complexity of a mathematical function by enlarging terminal and function set may help gain l ittle efficiency. Therefore, constrained by the population size ratio, the size of function and terminal set has to be optimized. There are several strategies to optimize the size o f function and terminal set. Principal Component Analysis (PCA) is one way to he lp optimize the size of terminal set. PCA is a multivariate procedure where the data is t ransformed such that the maximum variabilities are projected onto the axes. Essentially, a set of correlated variables are transformed into a set of uncorrelate d variables which are ordered by reduced variability. The uncorrelated variables are linear combinations of the original variables. The first principal component is the com bination of variables that explains the greatest amount of variation. The second principal component defines the next largest PAGE 72 61 amount of variation and is independent to the first principal component and so on, with the last of these variables can be removed with min imum loss of real data. The main use of PCA is to reduce the dimensionality of a data se t, i.e. the size of the terminal set, while retaining as much information as is possible. PCA c omputes a compact and optimal description of the data set. The method, proposed by Jolliffe [1986], uses the p rincipal components as the basis for the feature selection. A high absolute va lue of the ith coefficient of one of the principal components (PC) implies that the ith ori ginal variable is very dominant in that PC. By choosing the variables corresponding to the highest coefficients of each of the first several selected PCs, the same projection as that computed by PCA is approximated. This method is a very intuitive and computationally feasible method. Due to the low dimensionality of the vaporliquid e quilibrium data set, where the degrees of freedom is only two, application of PCA is not necessary in this case. However, the developed GP package aims to be more g eneral. Therefore, PCA is included in the package as an option to aid selecti on of terminal set. A more practical way than PCA, which is also popular among GP users, is to incorporate some physically meaningful variables and their parameters into the function and the terminal set. In this research, according to the preknowledge on thermod ynamics, we believe that some composition of the functions and terminals supplied here can yield a solution to the problem. The algorithm performs symbolic regression on the experimental data to extract functional representation of the data. The genetic programming module starts with a set of primitive functions, including +, , *, /, exp, square root (sqrt), log, power (^) and so PAGE 73 62 on. Temperature, pressure, the vapor and liquid com position are selected to be the elements of terminal set. 3.4.4 Genetic operator The genetic programming paradigm is controlled by t wo major numerical parameters, i.e., the population size and the maxim um number of generations. These two parameters depend on the difficulty of the problem involved. The population size has been addressed in section 3.4.2. Throughout this re search, unless the specific declaration is made, the maximum number of generations is set t o seventyfive. As shown in Figure 3.6, seventyfive generations is enough for GP to c onverge to an approximate solution in this research. Other minor numerical parameters inc lude the probability of crossover, mutation, which will be defined in this section. Since the population is steady state, the individua l will be preserved in the population until it is selected for deletion by the decimation strategy. Therefore, genetic operator reproduction is wasteful, only crossover and mutation are applied. Good values for the mutation and crossover probabil ities depend on the problem and must be manually tuned based on experience as t here are few theoretical guidelines on how to do this. For some problems performance ca n be quite sensitive to these values while in others their values do not make much diffe rence. Different mutation and crossover probabilities are tested. The ratio between mutation and crossover is set to 0.5:0.5, 0.4:0.6 a nd 0.2:0.8 respectively. Figure 3.7 is the PAGE 74 63 fitness plot over different mutation and crossover probabilities, and Figure 3.8 is the plot of the standard deviation of fitness value. The sta ndard deviation of fitness value is an index of the diversity of individuals in the popula tion. Figure 3.7 shows that, different mutation and crossover probabilities can reach equi valent accuracy levels. Meanwhile, the diversity in population follows a similar trend These results show that, in this research, GP is robust and not sensitive to the mut ation and crossover probabilities. Figure 3.7 Fitness vs. Generation Using Different M utation and Crossover Probabilities 0 50 100 150 200 250 0 10 20 30 40 50 60 70 80 Generation Fitness 0.5:0.5 0.4:0.6 0.2:0.8 PAGE 75 64 0 500 1000 1500 2000 2500 3000 3500 4000 01020304050607080 GenerationStdDev of Fitness 0.5:0.5 0.4:0.6 0.2:0.8 Figure 3.8 The Standard Deviation of Fitness Using Different Mutation and Crossover Probabilities 3.4.5 Fitness evaluation The fitness function is the driving force of GP, wh ich should reflects the goodness of a potential solution which is proportional to th e probability of the selection of the individual. Usually, the fitness function is based on the sum of square error (SSE) between the calculated and the measured output valu es: [] = =e Njj C j i S SSE 1 2 ) ( ) ,( (3.5) PAGE 76 65 where S (i, j) is the value returned by equation i for fitness case j (of N e fitness cases) and C (j) is the correct value for fitness case j The closer this sum of distances is to zero, the better the program. A good model is not only accurate but simple, tran sparent and interpretable. In addition, a complex overparameterized model decrea ses the general estimation performance of the model. Since GP is selfevolving process and can result in overly complex models, there is a need for a fitness funct ion that ensures a tradeoff between complexity and model accuracy. To evaluate the efficiency of different fitness fun ctions, including the least square, Akaikes Information Criterion (AIC) and Schwarz Ba yesian Information Criteria (BIC), several GP tests are run for the propylenepropane system as an example, and the generated models accuracy and complexity are compa red, as shown in Figure 3.9 and Figure 3.10. In Figure 3.9, it shows that, compared with the lea st square, both AIC and BIC keep a relatively comparable accuracy level in this case. Figure 3.10 is the comparison of the models complexity. It shows that, without appl ying the parsimony rule, GP tends to increase the complexity of generated model consiste ntly with the generations. Between two parsimony rules, i.e., AIC and BIC, BIC has a b etter control on the generated models complexity, and tends to select more parsim onious models, because the BIC places a larger penalty on the number of predictors Theoretical and simulation studies done by another researcher [Hansen 2001] also concl udes that, mostly in the regression PAGE 77 66 case, when the underlying model is finitedimension al (specified by finite number of parameters), BIC should be preferred. 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 01020304050607080 GenerationAccuracy LS BIC AIC Figure 3.9 Models Accuracy vs. Generation Using Di fferent Fitness Functions 0 10 20 30 40 50 60 70 80 90 100 01020304050607080 GenerationComplexity LS BIC AIC Figure 3.10 Models Complexity vs. Generation Using Different Fitness Functions PAGE 78 67 3.4.6 Selection strategy One practical difficulty in symbolic regression is the problem of crowding. Crowding is a phenomenon where some individuals tha t are more fit than others in the population are quickly reproduced. Then, copies of these individuals and similar individuals take over a large fraction of the popul ation. The crowding reduces the diversity of the population, thereby slowing furthe r progress by the GP. Several strategies have been explored for reducing crowding. One appro ach is to change the selection function, using criteria such as tournament selecti on or rank selection instead of fitness proportionate selection. For the selection strategy, the commonly used tourn ament selection is applied. Under tournament selection, a group of individuals is randomly selected from the population, then, the individual with the highest f itness in this subset is returned. The size of the group is called the tournament size. It is c lear that, the larger this group is, the more likely a highly fit individual from the popula tion is selected. In the tests, tournament sizes ranging from 3 to 6 are used, in order to exa mine a range of selection intensities. When reporting test results, the following notation is applied: TOUR3 means tournament selection with a tournament size of 3. S imilar notations are used for TOUR4, TOUR6 and so on. PAGE 79 68 40 60 80 100 120 140 160 180 01020304050607080 GenerationFitness TOUR3 TOUR4 TOUR6 Figure 3.11 Model Fitness vs. Generation Using Diff erent Tournament Sizes 0 500 1000 1500 2000 2500 3000 3500 01020304050607080 GenerationStdDev of Fitness TOUR3 TOUR4 TOUR6 Figure 3.12 The Standard Deviation of Fitness vs. G eneration Using Different Tournament Sizes PAGE 80 69 Figure 3.11 is the comparison of models fitness u sing different tournament sizes. Figure 3.12 shows the impact of tournament size on preserving the diversity in population. In Figure 3.11, it shows that, compared to TOUR4 a nd TOUR6, TOUR3 converges relatively slowly. It takes sixtysix gen erations for TOUR3 to find an approximate solution, while TOUR4 takes fiftythree generations, and TOUR6 needs 34 generations to reach the approximate solution. From TOUR3 to TOUR6, with the tournament size increasing, the slope of convergenc e curve sets steeper. For a given number of generations, the tournament size is large r, i.e., the intensity of selections are increased, the possibility of fittest solution to b e chosen will be increased; the number of generations needed to find an approximate solution is less. However, too fast or too slow convergence is not the case expected. If the algori thm converges too fast, it may imply premature termination and the solution may be a loc al optimal solution, as shown for TOUR6. TOUR6 converges the fastest, but its fitness is the worst one among those of three different tournament sizes. TOUR3 converged a little bit slowly while TOUR6 co nverged prematurely and became stuck. TOUR4 appears to be about the correct tournament size for this problem. TOUR3 could be an alternative size for this problem 3.4.7 Decimation strategy In order to manage the size of the population in a genetic search, as well as to keep the better individuals in the population, the decimation strategy for deleting excess PAGE 81 70 individuals in the population need to be decided. A s analogous to the selection strategies described and compared in the section 3.4.6, differ ent decimation strategies are also compared in this section, as shown in Figure 3.13 a nd Figure 3.14. The maximum population size is set to 500, if the population si ze is bigger than this limit, then, the individuals chosen by the decimation strategy will be deleted from the population. A common problem experienced with population based optimization methods is the gradual decline in population diversity that te nds to occur over time. This can slow a search progress or even halt it completely, if the population converges on a local optimum from which it cannot escape. With steady state population optimizers, the standa rd decimation strategy used is simply random deletion. The reason for this is that it is neutral in the sense that it does not skew the distribution of the population in any way. Thus, whether the population tends toward high or low fitness etc. is solely a f unction of the selection scheme and its parameters, in particular the selection intensity. This issue was discussed in the previous section. The other option for decimation strategy is that t he individuals are deleted from the population based on their fitness. The worse th e fitness is, the higher the probability of the particular individual being selected for del etion. Figure 3.13 depicts the comparison between two different decimation strateg ies in terms of model accuracy. Figure 3.14 shows the standard deviation of fitness As mentioned earlier, the standard deviation of fitness value in the population is an index of the diversity of individuals in this population. As shown in Figure 3.13, fitnessproportional decimation strategy allows PAGE 82 71 GP algorithm to converge at the fifteenth generatio n while random deletion strategy result in convergence at the twentyseventh generat ion. On the other hand, fitnessproportional deletion strategy allows GP run conver ge faster than random deletion. However, the fitnessproportional deletion strateg y leads to less diversity in the population. As shown in Figure 3.14, using the same selection strategy (tournament selection with size 4), the diversity of population using random deletion is maintained well over the generations, however, the diversity o f population using fitnessproportional deletion is decreased tremendously after the second generation. From this group of tests, we conclude that, the diversity of population with random deletion is best for the whole process. 0 50 100 150 200 250 01020304050607080 GenerationFitness SELR SELRHIFIT Figure 3.13 Models Accuracy vs. Generation Using D ifferent Deletion Strategies PAGE 83 72 0 500 1000 1500 2000 2500 3000 3500 01020304050607080 GenerationStdDev of Fitness SELR SELRHIFIT Figure 3.14 The Standard Deviation of Fitness vs. G eneration Using Different Deletion Strategies 3.4.8 Result designation and termination criteria Each computer experiment is terminated when the max imum number of GP generation is reached. To have a better evaluation of the number of maximum GP generations, computational experiments with differe nt population sizes, mutation and crossover probabilities, fitness evaluation functio ns, selection strategies and decimation strategies are performed. The results were shown in the sections from 3.4.2 to 3.4.7. Models accuracy vs. generation using different pop ulation sizes is shown in Figure 3.6. Figure 3.7 depicts models accuracy vs. generation using different mutation and PAGE 84 73 crossover probabilities. Figure 3.9 is the same plo t using different fitness functions, Figure 3.11 depicts models accuracy vs. generation using different selection intensities, and Figure 3.13 is using different deletion strateg ies. All these plots of models accuracy using different GP controlling parameters show that GP program will reach its approximate solution before forty generations. The accuracy doesnt improve as the number of generations is increased beyond that poin t. To ensure that GP program has enough run time and the generated model fits the da ta at a satisfied level, the maximum number of generation is set to 75 throughout this r esearch. In each generation, the best sofar model is designated. 3.4.9 Summary Based on the theoretical analysis and experimental tests shown in previous sections, the input values for the GP controlling p arameters used throughout the research are summarized as follows: Population Size: 500 Terminal Set: temperature, pressure, liquid phase c omposition (x i ), parameters (maximum eight parameters) Function Set: +, , , , ^, exponential, log Mutation and Crossover Probability: 0.5:0.5 Fitness Function: Schwarz Bayesian Information Crit eria (BIC) Selection Strategy: Tournament, Size 4. PAGE 85 74 Maximum Generations: 75 Deletion Strategy: Random 3.5 Model evaluation Genetic Programming (GP) is inherently probabilisti c in nature, and thus for symbolic regression problems, each time the algorit hm is used, its expected that one will arrive at different approximate solutions. Therefor e, one should perform multiple experiments to develop alternate models, employ a s tatistical analysis to further prune the list of model candidates, and evaluate model perfor mance through steady state and dynamic simulation of columns that utilize these mo dels. 3.5.1 Statistical analysis The GP generated models with optimized parameters w ere evaluated using graphical and numerical statistical methods. 3.5.1.1 Graphical methods A scatter plot of the predicted response versus the observed response provides simple and visual depiction of model performance. A ny model that can explain most of the variation in the observed responses will produc e a plot with points clustered around a PAGE 86 75 45 line. Better models yield less scatter about th is y y = ^ line. Moreover, the scatter of points about the ^ y = y line should remain roughly constant with magnit ude. That is, a poor model that is less accurate at larger values o f ^ y will produce increasing scatter with larger values of y. A scatter plot of the residuals is also useful in evaluating a regression model. Here, the model residuals or errors ( ^ y y r = ) are plotted against the model predictions ( ^ y ). Residual plots are used to visually verify some of the basic assumptions underlying regression analysis and observe model da ta mismatch in term of bias and accuracy. The residuals (errors) between the model predictions and observed responses should have a mean of zero and a constant variance. Hence, the scatter in the residuals should be fairly uniform and centered about0 = e A good regression model will produce a scatter in the residuals that is roughly constant with ^ y Unsatisfactory models yield a scatter in these residuals that change with ^ y All of the information on lack of fit is contained in the residuals. Assuming the model you fit to the data is correct, the residuals approximate the random errors. Therefore, if the residuals appear to behave random ly, it suggests that the model fits the data well. However, if the residuals display a syst ematic pattern, it is a clear sign that the model has a bias in representing data. If the residuals appear randomly scattered around z ero, the model describes the data accurately and without systematic bias. PAGE 87 76 3.5.1.2 Goodness of fit statistics To express the quality of fit between a regression model and the sample data, the coefficient of multiple determination ( 2 R ) is typically used. However, adding any regressor variable to a multiple regression model, even an irrelevant regressor, yields a smaller SSE and greater 2 R For this reason, 2 R by itself is not a good measure of the quality of fit. To overcome this deficiency in 2 R and adjusted value is used. The adjusted coefficient of multiple determinations ( 2 R ) is defined as: ) 1( 1 1 2 2 R p n n R = (3.6) Since a number of model coefficients (p) is used in computing 2 R the value will not necessarily increase with the addition of any regre ssor. Hence, 2 R is a more reliable indicator of model quality. 3.5.1.3 Cross validation and prediction Finally, its often important to evaluate the abil ity of a fitted regression model to predict future events. The best way to do this is t o gather additional, new data and compare these observed responses with predictions f rom the model. However, when this is not possible, the data used to fit the model can be split and a cross validation is performed. A good regression model can be fit to pa rt of the original data set and still accurately predict the other observations. To perfo rm a double crossvalidation: PAGE 88 77 Partition the data into two subsets (say, A and B) with an equal number of observations in each. Assigning individual observat ions to subset A or B must be done randomly. Using the same model form, fit the model using the data from subset A. use this model to predict the observations in subset B. Compute the predicted R 2 p,A for the model that fits to subset A, as defined in Eq. (3.8) below. Similarly, fit the model to data in subset B and us e this to predict the observations in subset A. Compute the predicted R 2 p,B for the model fit to subset B. A good model will produce high values of R 2 p for both subsets and these values will be approximately equal (2 2 ,B p A pR R @). The predicted R 2 p for a model fit to subset A (R 2 p,A ) is computed with: = = = B B n i B iB n i iA iB A p y y y y R 1 2 1 2 2 ) ( ) ( 1 ( 3.7) where n B and y B are the number and mean of the observed responses (y iB ) in the random subset B. Using the model fit to subset A, predicti ons of the observations in subset B are made to give ^ y iA The predicted R 2 p for a model that fits to data in subset B (R 2 p,B ) is computed the same way, with the use of subsets A an d B reversed. PAGE 89 78 3.5.2 Steady state and dynamic simulation using local models To have a further evaluation on the models stabili ty and accuracy, the generated model is integrated with chemical engineering proce ss simulation package CHEMCAD. In CHEMCAD, users are allowed to supply their own K values in three major ways: users may input tabular Kvalues that CHEMCAD will then interpolate for use during the calculations. The second option is that, the Kval ues are assumed to be a function of temperature in the following polynomial forms: 4 3 ) ( dT cT bT a K f + + + = (3.8) Users need to provide the coefficient a, b, c and d The third option allows users to create their own added modules. The User Added Modules (UA M) gives users the ability to create new thermodynamic routines. The evolved Kva lue model by GP can be linked with CHEMCAD as a UAM, and then evaluated by proces s simulation. UAMs are programmed in Visual C++ and have access to a large number of internal CHEMCAD routines. PAGE 90 79 CHAPTER FOUR RESULTS AND DISCUSSION The structure and implementation of the GP packag e for symbolic regression was addressed in Chapter Three. In this chapter, the ap plication of the GP package for development of local K value models for propylenep ropane (ideal or near ideal solution) and acetonewater systems (nonideal solution) is p resented. The results and their discussion are given in sections 4.1 and 4.2 respec tively. In section 4.1, generated K value models with associated regressed parameter va lues, the phase equilibrium plots, distillation column simulation profiles (steady sta te, or dynamic simulation) are presented along with residual plots. In section 4.2, the perf ormance of the models for ideal and nonideal solutions is discussed. The approach is appli ed to both propylenepropane and acetonewater systems, using different terminal set s and for different pressure ranges of data. Furthermore, the results will be discussed an d the performance of models will be compared to rigorous models as well as data. 4.1 Local composition models and their impact Macroscopic vaporliquid equilibrium coefficient K value is a function of temperature, pressure and compositions of liquid as well as vapor phase. However, PAGE 91 80 simple composition independent models are very popu lar due to reduced computation burden, since iterative procedures with composition dependent models are avoided. These models work well for ideal solutions. In this secti on, the equilibrium K value models developed using GP for ideal (or near ideal) propyl enepropane binary system and strongly nonideal acetonewater binary system are studied respectively. The data for propylenepropane system ranges from low to higher pressures for different temperatures, and for acetonewater system, the data ranges from low temperatures to higher temperature for different pressures. For each binar y system, the final K value model and its parameter are summarized, and the fit of the mo dels are evaluated statistically while the model performance is evaluated with process sim ulation package CHEMCAD. 4.1.1 Developed models for mixtures that form ideal or nearideal solutions and their statistical analysis In Table 4.1, the final structure of K model for pr opylenepropane solution are summarized. PAGE 92 81 Table 4.1 Result Summary for Propylene (1) Propane (2) Result for K 1: Evolutionary Model: T x v T x P v T x v K ) 1( ) 1( ) 1( log 1 3 2 1 2 2 2 1 1 1 + + = Adjusted R 2 : 0.997 Result for K 2: Evolutionary Model: ) 1( ) 1( ) 1( ) 1( ) 1( log 2 2 4 3 2 2 3 2 2 2 2 1 2 x P T x v T x v T x v T x v K + + + + = Adjusted R 2 : 0.996 The = y y plot for generated K 1 and K 2 models are given in Figures 4.1 and 4.3 respectively. The points are distributed along 45 line. No significant deviation from 45 line is observed. The residual plots for K 1 and K 2 model are given in Figures 4.2 and 4.4. The mean value for both residual plots are close to zero. PAGE 93 82 0.98 1.03 1.08 1.13 1.18 1.23 1.28 1.33 1.38 1.43 1.48 0.981.031.081.131.181.231.281.331.381.431.48 Predicted K1K1, Data Figure 4.1 K 1 Model vs. Experimental Data for Propylene (1)Prop ane (2) 0.008 0.006 0.004 0.002 0 0.002 0.004 0.006 0.008 0.981.031.081.131.181.231.281.331.381.43Predicted K1Residual StdDev=2.24E3Mean=3.18E5 Figure 4.2 Residual Plot of K 1 for Propylene (1)Propane (2) PAGE 94 83 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 0.840.860.880.90.920.940.960.981 Predicted K2K2, Data Figure 4.3 K 2 Model vs. Experimental Data for Propylene (1)Prop ane (2) 0.005 0.004 0.003 0.002 0.001 0 0.001 0.002 0.003 0.004 0.005 0.006 0.840.860.880.90.920.940.960.981 Predicted K2Residual StdDev=2.04E3Mean=5.21E5 Figure 4.4 Residual Plot of K 2 for Propylene (1)Propane (2) PAGE 95 84  K and  K/K (%) for K 1 and K 2 at five different temperatures are list in Table 4.2. Maximum and average values of  K/K (%) for K 1 and K 2 are small, which indicate the models for K 1 and K 2 fit experimental data well. Table 4.2 Standard Error Analysis for Generated K 1 and K 2 Model  K/K, % K 1 (K 2 )  K K 1 (K 2 ) T, K Max. Avg. Max. Avg. 223.75 0.614 (0.255) 0.292 (0.168) 0.00734 (0.00235) 0.00351 (0.00150) 239.35 0.484 (0.417) 0.158 (0.269) 0.00659 (0.00367) 0.00186 (0.00242) 310.93 0.315 (0.534) 0.101 (0.232) 0.00372 (0.00495) 0.00112 (0.00239) 327.59 0.121 (0.130) 0.0659 (0.0679) 0.00138 (0.00121) 0.000716 (0.000648) 344.26 0.286 (0.106) 0.147 (0.0557) 0.00316 (0.00101) 0.00156 (0.000533) 4.1.2 Performance of models for separation of ideal and near i deal mixtures The generated K 1 and K 2 models are integrated with CHEMCAD as user added modules. Pxy for five different temperatures 223. 75K, 239.35K, 310.93K, 327.59K and 344.26K are plotted and compared with those from Pe ngRobinson (PR) EOS, as shown in Figures 4.5 and 4.6. As observed, the generated local models fit the entire data range well. They also perform as well as the PR EOS that is usually the default model for these tasks. PAGE 96 85 0.65 0.85 1.05 1.25 1.45 1.65 1.85 00.10.20.30.40.50.60.70.80.91 X1/Y1, Mole FractionPressure, bar Model Data PR223.75K 239.35K Figure 4.5 PX 1 Y 1 at Low Pressures for Propylene (1)Propane (2) 12.5 14.5 16.5 18.5 20.5 22.5 24.5 26.5 28.5 30.5 00.10.20.30.40.50.60.70.80.91 X1/Y1, Mole FractionPressure, bar PX1, Model, 310.93K PX1, Data, 310.93K PR310.93K 327.59K 344.26K Figure 4.6 PX 1 Y 1 at Medium to High Pressures for Propylene (1)Prop ane (2) PAGE 97 86 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4 1.45 00.10.20.30.40.50.60.70.80.91 X1, Mole FractionK1 Model Data PR223.75 K 239.35 K 310.93K 327.59 K 344.26 K Figure 4.7 K 1 vs. Liquid Composition for Different Temperatures for Propylene (1)Propane(2) System 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4 1.45 1.5 05101520253035 Pressure, barK1 Model Data PR 223.75 K 239.35K 310.93K 327.59K 344.26K Figure 4.8, K 1 vs. Pressure for Different Temperatures for Propyl ene (1)Propane (2) PAGE 98 87 The results show that the generated K models are a function of pressure, composition and temperature, therefore, the Kcompo sition and Kpressure analysis of models are given in Figures 4.7 and 4.8. K 1 composition plot in Figure 4.7 shows the local model has a better fit over composition than rigorous model PR does at whole data range. In K 1 pressure plot, local model show slight deviation a t high pressure range. Propylene is the most important building block in a ny petrochemical industry. The cost of producing propylene is much associated with the propylenepropane separation step, conventionally performed by low te mperature or highpressure distillation. This step is energy intensive, theref ore costly, because of the low relative volatility; the separation process requires large d istillation columns with more than 210 trays and high reflux ratios. This makes the propyl enepropane system one of the most energyexpensive separations. To test local models reliability further, steady s tate simulation is run. A 210plate distillation column is simulated. The feed stream c ontaining 800 lbmol/hr propylene and 200 lbmol/hr propane is fed to tray 125 at pressure 19 bar as saturated liquid. The column is operated under 17.5 bar on the top with 1.8 bar pressure drop, the mole fraction of propylene on the top and at the bottom is 0.95 and 0.05 respectively. The tower temperature profile is presented in Figure 4.9 and relative volatility is presented in Figure 4.10. The vapor and liquid flow rate at different t ray is given in Table 4.3. PAGE 99 88 Table 4.3 PropanePropylene Distillation Column Pro file Model Data PR Vapor, lbmol/h 8481.52 8092.83 8534.49 Stage 10 Liquid, lbmol/h 7650.25 7261.44 7703.27 Vapor, lbmol/h 8634.44 8227.33 8688.34 Stage 100 Liquid, lbmol/h 7802.67 7395.33 7856.68 Vapor, lbmol/h 8861.79 8438.48 8918.01 Stage 200 Liquid, lbmol/h 9031.94 8608.7 9088.10 Condenser, MMBtu/h 46.03 44.02 46.43 Reboiler, MMBtu/h 45.86 43.84 46.27 314 316 318 320 322 324 326 328 330 050100150200250 Stage NumberTemperature, K Model Data PR Figure 4.9 Temperature Profile of PropylenePropane Distillation Column PAGE 100 89 1.07 1.08 1.09 1.1 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 050100150200250 Stage NumberRelative Volatility Model Data PR Figure 4.10 Relative Volatility Profile for Propyle nePropane Distillation Column 4.1.3 Developed models for mixtures that form nonideal solut ions and their statistical analysis Table 4.4 Result Summary for Acetone (1) Water (2) Result for K 1: Evolutionary Model: ) log( ) 1( ) 1( ) 1( log4 ) 1( 1 1 3 1 2 1 1 11 P v x x T v x T v x v K T x + + + =Adjusted R 2 : 0.994 Result for K 2: Evolutionary Model: 1 ) log( ]1 ) log( ) 1 [( ) 1( log 3 4 2 3 3 2 2 2 2 1 2 + + + + + + + = T T v T v P x v T x v K Adjusted R 2 : 0.990 PAGE 101 90 0 5 10 15 20 25 30 35 40 0510152025303540 Predicted K1K1, Data Figure 4.11 K 1 Model vs. Experimental Data for Acetone (1)Water (2) 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 0510152025303540Predicted K1Residual StdDev=0.2206Mean=0.05209 Figure 4.12 Residual Plot of K 1 for Acetone (1)Water (2) PAGE 102 91 0.1 0.3 0.5 0.7 0.9 1.1 1.3 0.10.30.50.70.91.11.3 Predicted K2K2, Data Figure 4.13 K 2 Model vs. Experimental Data for Acetone (1)Water (2) 0.08 0.06 0.04 0.02 0 0.02 0.04 0.06 0.08 0.1 0.10.30.50.70.91.11.3Predicted K2Residual StdDev=0.02763Mean=0.001994 Figure 4.14 Residual Plot of K 2 for Acetone (1)Water (2) PAGE 103 92 The = y y plot for generated K 1 and K 2 models for acetonewater system are given in Figures 4.11 and 4.13 respectively. The po ints are distributed along 45 line. No significant deviation from 45 line is observed. Th e residual plots for K 1 and K 2 models are given in Figures 4.12 and 4.14. The mean values for both residual plots are very close to zero.  K and  K/K (%) for K 1 and K 2 at three different pressures are listed in Table 4.5. Maximum and average values of  K/K (%) for K 1 and K 2 are small, which indicate that the models for K 1 and K 2 fit the experimental data well. Its also observed that the models deviation at medium (17.24 bar) and high (3 4.47 bar) pressure are larger than those at low pressure, 1.013 bar. Table 4.5 Standard Error Analysis for Local Model f or AcetoneWater System D_K/K, % K 1 (K 2 ) D_K K 1 (K 2 ) P, bar Max. Avg. Max. Avg. 1.013 1.92 (4.0) 0.962 (2.78) 0.374 (0.0325) 0.0564 (0.0147) 17.24 4.99 (4.32) 2.61 (2.26) 0.643 (0.0273) 0.102 (0.0173) 34.47 4.97 (5.20) 3.01 (2.89) 0.245 (0.0625) 0.0661 (0.0298) PAGE 104 93 4.1.4 Performance of models for separation of nonideal solut ions The generated K 1 and K 2 models for acetonewater solution are integrated w ith CHEMCAD as an user added module. Txy for three di fferent pressure levels of 1.013bar, 17.24 bar and 34.47 bar are plotted and c ompared with activity coefficient model NRTL, as shown in Figure 4.15. As shown, the generated local models fit the entire range of data set relatively well. It also s hows that the generated local models have larger deviations as the pressure increases. 320 370 420 470 520 00.10.20.30.40.50.60.70.80.91X1/Y1, Mole FracT, K Model Data NRTL 34.47 bar17.24 bar 1.013 bar Figure 4.15 TX 1 Y 1 at Different Temperatures for Acetone (1)Water (2 ) The Kcomposition and Kpressure analysis of models are given in Figures 4.16 and 4.17. K 1 composition plot in Figure 4.16 shows that the loc al model has a slightly better fit over composition than NRTL does at the e ntire data range. In K 1 pressure plot, PAGE 105 94 0 2 4 6 8 10 12 14 00.10.20.30.40.50.60.70.80.91 X1, Mole FractionK1 model data NRTL 1.013 bar 17.24 bar 34.47 bar Figure 4.16 K 1 vs. Liquid Composition at Different Pressures for Acetone (1)Water(2) 0 5 10 15 20 25 320340360380400420440460480500520Temperature, KK1 model data nrtl 1.013 bar 17.24 bar 34.47 bar Figure 4.17 K 1 vs. Temperature at Different Pressures for Acetone (1)Water (2) PAGE 106 95 the local model shows a larger deviation at high pr essure due to the azeotropy. As can be seen from Figures 4.16 and 4.17, the dependency of the Kvalues on temperature and composition are different. Figure 4.17 shows that t he relative deviation of temperature dependency is increased with the pressure increased but still within the acceptable level. In the examples given here, the local models were u sed at pressures up to 34.47 bar with reliable results. From our experience with mixtures that we have examined, the general conclusions are that, for nonideal mixture s, the generated model is capable of correlating the equilibrium ratios to a relative ac ceptable accuracy over a wide range of compositions, temperature and pressure. To test local models reliability further, steady s tate simulation is run. A 56plate acetonewater distillation column is simulated. The feed stream containing 400 lbmol/hr acetone and 600 lbmol/hr water is fed to the bottom of the column (tray 55) at pressure of 2 bar as saturated liquid. The column is operated u nder 1.013 bar on the top with 0.5 bar pressure drop, the mole fraction of acetone on the top and at the bottom is 0.95 and 0.05 respectively. The tower temperature profile is pres ented in Figure 4.18 and relative volatility is presented in Figure 4.19. Figures 4.1 8 and 4.19 show that, compared with the data and NRTL model, the generated local models pe rformance is excellent under the specified operating conditions. PAGE 107 96 325 330 335 340 345 350 355 360 365 0102030405060 Stage NumberTemperature, K Model Data NRTL Figure 4.18 Temperature Profile of AcetoneWater Di stillation Column 0 5 10 15 20 25 30 35 0102030405060 Stage NumberRelative Volatility Model Data NRTL Figure 4.19 Relative Volatility Profile of AcetoneWater Distillation Column PAGE 108 97 In dynamic simulation, the process starts at steady state, a disturbance of 10% feed flow rate is introduced at the 50th minute of the simulation. The resulting tower vapor and liquid flow rate profiles at different si mulation times are given in Table 4.6. The distillation flow rate fluctuation due to the 1 0% feed flow rate increase is shown in Figure 4.20. The pressure response to the disturban ce over time is shown in Figure 4.21. All figures for dynamic simulation show that the ge nerated model gives a reliable result in the simulation. 320 340 360 380 400 420 440 460 050100150200250Time, minsFlowrate, lbmol/hr Model Datal NRTL Figure 4.20 Distillation Total Flow Rate PAGE 109 98 1.01 1.011 1.012 1.013 1.014 1.015 1.016 1.017 1.018 020406080100120140Time, minsPressure, bar Model Data NRTL Figure 4.21 Distillation Column Pressure Table 4.6 Tower Profile at Different Runtime 0 (min.) 200 (min.) Feed: 1000 lbmol/h Feed: 1100 lbmol/h Model Data NRTL Model Data NRTL Vapor lbmol/h 1630.8 1611.3 1523.6 1793.9 1772.4 1686.0 Stage 5 Liquid lbmol/h 1257.8 1238.3 1150.2 1383.6 1362.2 1275.3 Vapor lbmol/h 1621.4 1602.0 1507.8 1783.6 1762.3 1668.6 Stage 25 Liquid lbmol/h 1248.7 1229.3 1134.8 1373.6 1352.3 1258.3 Vapor lbmol/h 1598.2 1578.7 1485.1 1758.0 1736.5 164 3.6 Stage 50 Liquid lbmol/h 1220.9 1201.5 1109.5 1343.0 1321.6 1230.4 Condenser, MMBtu/h 21.03 20.78 19.64 23.13 22. 86 21.76 Reboiler, MMBtu/h 21.51 21.25 20.06 23.66 23.37 22. 26 PAGE 110 99 4.2 Discussion The linear models of vaporliquid partition coeffic ient K for propanepropylene and acetonewater systems presented in Section 4.1 evolved from full data set that covers the entire pressure range. For acetonewater system as shown in Figures 4.15 to 4.17, the TXY diagram at 1.013 bar, 17.34 bar and 34.47 bar respectively, the deviation of the models increases with the pressure. Table 4.7 R 2 of Different Models for K 1 in Group Tests PropylenePropane AcetoneWater Data Pressure (P) Range for Linear and Nonlinear Models PT PTX PTXY PT PTX PTXY Linear 0.995 0.996 0.996 0.85 0.995 0.996 Low P Nonlinear 0.995 0.996 0.996 0.87 0.995 0.996 Linear 0.992 0.995 0.996 0.82 0.991 0.995 High P Nonlinear 0.994 0.996 0.996 0.87 0.993 0.997 Linear 0.993 0.994 0.995 0.82 0.991 0.996 Full Data Nonlinear 0.994 0.995 0.995 0.87 0.993 0.996 In an effort to study pressure induced effects, the full data set was divided into low pressure and high pressure sets and additional tests were performed. The model was forced to be linear and nonlinear respectively, for both low and high pressures, as discussed in the next section. PAGE 111 100 4.2.1 Ideal gaseous and liquid mixture Low pressure models developed for propylenepropane system with different terminal sets provide very good data representation Models with PT, PTX and PTXY terminal sets have R 2 values of 0.995, 0.996 and 0.996 respectively. Inc luding composition term in evolved local model doesnt imp rove models goodness of fit since PT models represent the data very well. This result is expected as can be seen from the discussion to follow in this section. The basic vaporliquid equilibrium relation can be expressed as: ) , ( ) , ( ) , ( ) , (= = = y P T f y P T P y x P T P x x P T fv i v i i l i i l if f (4.1) The description of vaporliquid equilibrium using a n equation of state for both liquid and vapor phase is referred to as the f f method, then, ) , ( ) , ( = y P T x P T x y K v i l i i i if f (4.2) The other alternative is to use an activity coeffic ient model for the liquid phase and an equation of state for the vapor phase, i.e, f g method. Eq. (4.1) can be rewritten in the form of: ) , ( ) ( ) , (,= y P T P y P T P x P T x vi i sat li sat i i if f g (4.3) PAGE 112 101 ) , ( ) ( ) , (, = y P T P P T P x P T Kvi sat li sat i i if f g (4.4) At low pressures, the vapor phase can be treated as an ideal gas, and all fugacity coefficient corrections are negligible. Thus, Eq. ( 4.4) can be simplified to: P P x P T Ksat i i i) , (=g (4.5) Furthermore, propylenepropane mixture forms an ide al solution in the liquid phase, i.e. 1 =igfor both propylene and propane. Then, eq. (4.5) can be further simplified to: P T P Ksat i i) ( = (4.6) Eq. (4.6) depicts that the vaporliquid partition c oefficient K for propylenepropane system at low pressure is only the function of temperature and pressure, not composition. Therefore, the models with terminal se ts of PTX and PTXY dont improve the models accuracy significantly. 4.2.2 Nonideal gaseous mixture and ideal liquid mixture For propylenepropane system at high pressure, as s hown in Table 4.7, R2 for the local models using different terminal sets of PT, P TX and PTXY are 0.992, 0.995 and 0.996 respectively. Compared with the local model u sing the terminal set of PT, the one with PTX has better accuracy, whereas the model usi ng PTXY has about same R2 value. PAGE 113 102 At higher pressures, the behavior of propyleneprop ane vapor phase deviates from ideal gas state and thus fugacity coefficient canno t be negligible. The liquid phase is not affected by pressure much so it still can be treate d as ideal solution. The Eq. (4.4) can be reduced to: ) , ( ) (, = y P T P P T P Kvi sat li sat i if f (4.7) Due to the composition dependent characteristic of fugacity coefficient correction, the introduced composition term of the evolved loca l model can and do improve models accuracy. 4.2.3 Ideal gaseous and nonideal liquid mixture As can be seen In Table 4.7, the local models devel oped for acetonewater system at low pressure improve significantly the goodness of fit after the liquid composition is introduced to the model, R2 is 0.85 for the terminal set of PT and R2 is 0.995 for the terminal set of PTX. There is no significant improv ement for the terminal set of PTXY where R2 is 0.996. At low pressures, the vapor phase of ace tonewater system approaches ideal gas state, and fugacity coefficient correctio ns can be omitted. However, in liquid phase, acetonewater forms a pol ar solution which displays a strong nonideal behavior. Therefore, the effect of activity coefficient needs to be taken into account. Eq. (4.5) can be used to describe the vaporliquid equilibrium for acetonewater system at low pressure. PAGE 114 103 Since activity coefficient is composition dependent the evolved local model with composition term improves the goodness of fit. 4.2.4 Nonideal gaseous mixture and nonideal liquid mixture At high pressures, R2 s for local models with different terminal set of PT, PTX and PTXY are 0.82, 0.991 and 0.995 respectively for acetonewater system. The models with terminal set of PTX and PTXY have better data fit than the model with the terminal set of only PT. The model with PTXY also has better accuracy than the model with PTX. At higher pressures, both vapor and liquid phase di splay nonideal behavior, and therefore fugacity coefficient and activity coeffic ient are necessary for description of the mixture. The composition has relatively larger impa ct on the system, in both vapor phase and liquid phase. The rigorous form of K model Eq. (4.4) can be used to describe this binary system at high pressure. From Eq. (4.4), we can see K value is vapor and liquid compositiondependent, therefore, the local models with the terms of composition in vapor and liquid phase can significantly improve th e goodness of fit. Its also observed that, in acetonewater system at low pressure, the models accuracy with PTX is close to that of the model wit h PTXY, however, at high pressure, the difference of accuracy between PTX and PTXY is larger. This is because at higher pressures, the nonideal gas behavior has more impa ct in vapor phase; therefore, including the composition in the vapor phase can im prove the models accuracy. PAGE 115 104 4.2.5 Linear model vs. nonlinear model Using each low and high pressure group data, linear and nonlinear models were developed. The linear models that evolved from low pressure data for both propanepropylene and acetonewater systems were given in T able 4.7. As can be seen in the Table 4.7, nonlinear models for low pressure do not exhibit significant improvement on goodness of fit for propylenepropane system. At lo w pressure, the propanepropylene system is an ideal or nearly ideal system. The line ar model is good enough to describe the characteristics of solution. For the acetonewater system, the nonlinear model improves the prediction accuracy for the one with terminal s et of PT, which R2 is 0.87 compared to 0.85 for linear model, but there is no significant improvement for the linear models that have terminal set of PTX and PTXY. It indicates tha t composition is a necessary term for the evolved local model to have adequate capability to describe nonideal system. Once the composition term is included in the model, line ar model is also good enough to describe the nonideal behavior. For higher pressures, the scenario got changed. Whe n the linear model and the nonlinear model are compared, for propanepropylene system, the nonlinear model has slight improvement on accuracy for the PT model, wh ere R2 is 0.994 for the nonlinear model compared with 0.992 for the linear model. For models with terminal sets of PTX and PTXY, the nonlinear ones dont show significant improvement on models accuracy. For the acetonewater system at higher pressures, t he nonlinear model shows some improvement for PT, PTX and PTXY terminal sets. The acetonewater system exhibits PAGE 116 105 strong nonlinearity with high pressure. Therefore, a nonlinear model may describe this behavior better. It also can be further demonstrate d in Figures 4.16 and 4.17, with K1 vs. composition and K1 vs. pressure plots respectively. At low pressures, with no azeotropy present, the linear local model with PTX terminal s et has a good fit to the data. At higher pressures, with an azeotropic behavior, K value exh ibits a strong nonlinearity. Therefore, the nonlinear model can fit the data better. Acetonewater is a nonideal system with an azeotro pic point, which displays a nonlinear behavior with increasing pressure. The va porliquid partition coefficient K also displays a stronger nonlinearity than that of propa nepropylene system. However, the model that evolved from the data is a linear model, with respective to its parameters. This limits model performance for the acetonewater syst em. This group of tests displays that, the models with different terminal sets are consistent with the thermophysical behavior of ide al (or nearly ideal) and nonideal systems. 4.2.6 Extrapolation In an effort to explore the extrapolation capabilit y and validity of the models, models generated utilizing only low pressure data, including both the linear and the nonlinear models, were tested with the high pressur e data. Similarly, models generated using high pressure data were used to test model at lower pressure. PAGE 117 106 The R2 s for each case are shown in Table 4.8. *** in T able 4.8 indicates the correspondingly redundant result presented in Table 4.7. Table 4.8 R2 of Extrapolation Test PropylenePropane AcetoneWater fit to low pressure data fit to high pressure data fit to low pressure data fit to high pressure data linear *** 0.848 *** 0.528 model (PTX) generated from low pressure data nonlinear *** 0.976 *** 0.720 linear 0.946 *** 0.919 *** model (PTX) generated from high pressure data nonlinear 0.966 *** 0.977 *** As can be seen in Table 4.8, the models generated f rom low pressure data do not represent high pressure data well. This is true whe ther the model is linear or nonlinear, particularly for acetonewater system since azeotro py information is missing in low pressure. Also one can observe that the nonlinear m odels have better generalization capability than the linear models which is not surp rising giving the nonlinear nature of the phenomena particularly as transitions from idea l gas to real or azeotropy occurs. Similarly, the models generated from high pressure data have better generalization capability than those generated from low pressure d ata. These results prove again that genetic programming method is a fully datadriven m ethod and is reliable in capturing phenomena when applied judiciously. PAGE 118 107 CHAPTER FIVE CONCLUSIONS AND RECOMMENDATIONS This research proposes a hybrid evolutionary modeli ng algorithm to build the vaporliquid equilibrium Kfactor models automatica lly. The developed system may also serve as a generalpurpose machine function identif ication package which can automatically build a function model to fit the giv en experimental data. The main idea of the algorithm is to embed linear or nonlinear regre ssion into GP, where GP is employed to optimize the structure of a model, while linear or nonlinear regression method is employed to optimize its parameters. A distinct adv antage of this method is that no apriori assumptions have to be made about the actual model form: the structure and complexity of the model evolve as part of the probl em solution. It may reduce the limitations on function structure that insufficient descriptions on studied system introduced by proposed initial structure, or overs implified structure introduced by inappropriate assumptions made for function simplif ication procedure. The developed K model can be inserted in process simulation program s to save computation time, without affecting the accuracy of the simulation. The results clearly prove that the hybrid system is able to find the explicit form of models that closely approximate the data in a relat ively wide range while the accuracy can be tailored to a desired level. The models have also been shown to accurately PAGE 119 108 represent ideal mixture equilibrium ratios for a bi nary vaporliquid system. The greater the accuracy and region of validity of the models, the less frequently the parameters have to be updated. This in turn will require less use o f the rigorous thermodynamicphysical property programs for computationally expensive pro perty evaluations. For a nonideal system with azeotropic point, GP seems to be capabl e for the whole pressure range at an acceptable level, and if necessary, one can adjust the parameter of the generated model to fit the data well. The experimental data was obtained from literature which may contain error. The results revealed that the hybrid system is able to find the explicit model that closely approximated the data. It shows GPs robustness on data. In some cases, parsimony principle was incorporated through the fitness function. The results show that a simpler structure, but less accuracy can be obtained. In conclusion, the results presented in this resea rch indicate the potential of GP for developing thermophysical model and other gene ral purpose function identification. Genetic programming is a robust and efficient parad igm for discovering model structure using the expressiveness of symbolic representation In this developed package, Marquardt regression met hod is used to get the parameter values for a nonlinear model structure. T his method results in local as opposed to global solutions. The GP, as well as other evolu tionary methods, will terminate early or converge to incorrect solution when the paramete rs regressed using local solvers are not adequate. Therefore, a global optimization meth od will resolve the local optimization problem, as well as ensuring a robust hybrid symbol ic regression scheme. PAGE 120 109 As stated before, a primary classification used for property and process models in chemical engineering is algebraic versus differenti al equation models. K model is an algebraic equation that has multiple dependent vari ables and a single independent variable. The extended application of the developed package to differential equation can be studied further, which may enable the functional ity of the package to be completed. PAGE 121 110 REFERENCES Bamicki, S. D. and Fair, J. R. (1990). Separation s ynthesis: a knowledgebased approach. 1. Liquid Mixture separations. Ind. Eng. Chem. Res., 29: 421435. BanaresAlcantara, R., Westerberg, A. W. and Rychen er, M. D. (1985). Development of an expert system for physical proper ty predictions. Computers Chem. Engng, 9:127142. Cao, H. Yu, J., Kang, L. and Chen, Y. (1999). The K inetic evolutionary modeling of complex systems of chemical reactions. Computers & Chemistry, 23:143151. Chen S.H. and Yeh C.H. (1997). Toward a computable approach to the efficient market hypothesis: an application of genetic progra mming. Journal of Economic Dynamics and Control, 21(4): 10431063. Csukas, B. and Balogh, S. (1998). Combining genetic programming with generic simulation models in evolutionary synthesis. Computers in Industry, 36:181197. Draper, N. R. and Smith H. (1981). Applied regressi on analysis. John Wiley, NY. Englezos, P. and Kalogerakis, N. (2001). Applied pa rameter estimation for chemical engineers. Marcel Dekker, NY. Fogel, L. J., Owens, A. J. and Walsh, M. J. (1966). Artificial intelligence through simulated evolution John Wiley, NY. Frank, P. and KppenSeliger, B. (1997). New developments using AI in fault diagnosis. Engineering Applications of Artificial Intelligence 10: 314. PAGE 122 111 Franks, R. G. (1967). Mathematical modeling in chem ical engineering, Wiley, New York. Fredenslund, A., Rasmussen, O. and Mollerup, J. (19 80). Thermo physical and transport properties for chemical process design. I n Foundations of Computeraided Process Design Freitas, A.A. (2002). A survey of evolutionary algo rithms for data mining and knowledge discovery. To appear in: A. Ghosh and S. Tsutsui. (Eds.) Advances in Evolutionary Computation. SpringerVerlag. Friese, T. Ulbig, P. and Schulz, S. (1998). Use of evolutionary algorithms for the calculation of group contribution parameters in ord er to predict thermodynamic properties. Part I: Genetic Algorithms. Computers Chem. Engng, 22:15591572. Gani R. and OConnell, J. P. (1989). A knowledgeba sed system for the selection of thermodynamic models. Computers Chem. Engng, 13:397404. Gao, L. and Loney, N. W. (2001). Evolutionary polym orphic neural network in chemical process modeling. Computers Chem. Engng, 25:14031410. Goldberg, D. E. (1989). Genetic algorithms in searc h optimization and machine learning. Addison Wesley, Reading, MA. Greeff, D. J. and Aldrich, C. (1998). Empirical mod elling of chemical process systems with evolutionary programming, Computers Chem. Engng, 22:9951005. Grosman, B. and Lewin, R. D. (2004). Adaptive genet ic programming for steadystate process modeling, Computers Chem. Engng, 28:27792790. Hand, D., Mannila, H. and Smyth, P. (2001). Princip les of data mining. MIT Press, Cambridge, MA. PAGE 123 112 Hansen, M. and Yu, B. (2001). Model selection and t he principle of minimum description length. Journal of the American Statistical Association, 96(454): 746774. Holland, J. H. (1975). Adaptation in natural and ar tificial system, University of Michigan Press, Ann Arbor, MI. Hutter, M. (2002). Fitness uniform selection to pr eserve genetic diversity. In Proc. 2002 congress on evolutionary computation. 783788. Johnson, H. E., Gilbert, R. J., and Winson, K. (200 0). Explanatory analysis of the metabolome using genetic programming of simple, int erpretable rules. Genetic Programming and Evolable Machines. 1(3):243258. Jolliffe, I. T (1986). Principal component analysi s, SpringerVerlag, New York. King, R. A. (1986). Expert systems for materials se lection and corrosion. The Chemical Engineer, December, 4257. Kinnear, K. E. Jr. (editor) (1999). Advances in gen etic programming III. The MIT Press, Cambridge, MA. Kinnear, K. E. Jr. (editor) (1994). Advances in gen etic programming. The MIT Press, Cambridge, MA. Kirkwood, R. L., Locke, M. H. and Douglas, J. M. (1 988). A prototype expert system for synthesizing chemical process flowsheet. Computers Chem. Engng, 12:329341. Koza, J. R., Bennett III. F. H., Andre, D. and Kean e, M. A. (1999). Genetic programming III: darwinian invention and problem so lving. Morgan Kaufmann, San Francisco, CA. Koza, J. R. (1994). Genetic programming II: automat ic discovery of reusable programs. MIT press, Cambridge, MA. PAGE 124 113 Koza, J. R., Keane, M. A., Streeter, M. J. Mydlowec W., Yu, J. and Lanza, G. (2003). Genetic programming IV: routine humancompe titive machine intelligence. Springer. Legg, S., Hutter, M., and Kumar, A. (2004). Tournam ent versus fitness uniform selection. In Proceeding of the 2004 congress on evolutionary computation. McKay, B., Willis, M. and Barton, G. (1997). Steady state modeling of chemical process systems using genetic programming. Computers Chem. Engng, 21:981996. Miettinen, K. (editor) (1999). Evolutionary algorit hms in engineering and computer science: recent advances in genetic algori thms, evolution strategies, evolutionary programming, genetic programming, and industrial applications, Wiley, NY. Mitchell, M. (1996). An introduction to genetic alg orithms, MIT Press, Cambridge, MA. Mitchell, T. (1997). Machine Learning. McGraw Hill, NY. Ozyurt, B. (1998). Chemical process fault diagnosi s using pattern recognition and semiquantitative model based methods. Ph.D. disser tation, University of South Florida. Poli, R. (2008). Covariant Parsimony Pressure for G enetic Programming. Computers Chem. Engng, 21:981996. Pyhnen, H.O. and Savic, D. A. (1996). Symbolic Re gression Using ObjectOriented Genetic Programming (in C++), Centre For S ystems And Control Engineering, Report No. 96/04, School of Engineering, University of Exeter, Exeter, United Kingdom, 7284. Quagliarella, D. (editor) (1998). Genetic algorithm s and evolution strategy in engineering and computer science: recent advances a nd industrial applications, John Wiley & Sons, New York. Quantrille, T. E. and Liu, Y. A. (1991). Artificial Intelligence in chemical engineering Academic Press, Inc. San Diego. PAGE 125 114 Ramirez F. (1989). Computational methods for proces s simulation, Butterworths, MA. Ruiz, D., Nougus, J., Caldern, Z., Espua, A. and Puigjaner. L. (2000). Neural network based framework for fault diagnosis in batc h chemical plants. Computers & Chemical Engineering. 24:777784. Stephanopoulos, G. (1987). The future of expert sys tems in chemical engineering. Chem. Engng. Prog, 83:4451. Stephanopoulos, G. and Han, C. (1994). Intelligent systems in process engineering: a review. In Proc. 5th Intl. Symp. on Process Systems Engineering, Kyongj u, Korea Stephanopoulos, G., Johnston, J. and Kriticos, T. ( 1987). Designkit: an objectoriented environment for process engineering. Computers Chem. Engng 11(6): 655678. Thuraisingham, B. (1999). Data mining technologies, techniques, tools and trends. CRC Press, NY. Xu, W. (2005). Improved genetic algorithms for dete rministic optimization and optimization under uncertainty, Ind. Eng. Chem. Res, 44: 71327146. Zhang, BT. and Muhlenbein, H. (1993). Evolving opt imal neural networks using genetic algorithm with Occams razor, Complex Systems, 7:199220. Zhang, BT. and Muhlenbein, H. (1995). Balancing ac curacy and parsimony in genetic programming, Evolutionary Computation, 3(1):1738. Zhang, BT. (2000). Bayesian methods for efficient GP, Genetic Programming and evolvable machine learning, 1, 217242. PAGE 126 115 APPENDICES PAGE 127 116 Appendix A. User Manual for GP Package A.1 MATLAB files In the developed GP package, the MATLAB files (.m files) include: gp_ main.m: the main code that completes GP operati ons. marquardt.m: the subroutine that performs nonlinear parameter regression using Marquardt method, and calculates the fitness for an individual. linear.m: the subroutine that performs linear para meter regression using least square, and calculates the fitness for an individua l. A.2 Architecture The relationship between the .m files described ab ove is given in the Figure A.1. The expanded structure for main code (gp_main.m) is on left side, and the subroutine is shown on the right hand side. PAGE 128 117 Appendix A (Continued) Main Subroutine Yes No No Yes Figure A.1 Architecture of MATLAB Code Create Initial Population Evaluate Fitness Select Parent Chromosomes Evaluate Fitness Perform genetic operation Select genetic operation Population Reduction ? Subroutine: linear or marquardt regression Select chromosomes and delete from population Subroutine: linear.m or marquardt .m Terminated? End Solution is assigned PAGE 129 118 Appendix A (Continued) A.3 How to use the GP package In this section, we will follow the Figure A.1, and give a detail explanation on how to specify the parameter in the code. Step 1: load data set and assign the value to the v ariables of terminal set. Command: DATA = LOAD (FILENAME) Step 2: create initial population. Command: CHROM_I D = GS_NEW (POPULATION, CHROMOSOME), POPULATION is a character string designating the chromosome population in which the new chromosome is created. CHROMOSOME is a character string defining the gene structure of the new chromosome. CHROM_ID is an integer identify ing the newly created chromosome. Example: mem_id = gs_new ('Pop1', 'v2/T '); The initial population defined with GS_NEW command needs to inc lude the complete terminal set and function set. Step 3: define regression strategy. Command: REGRE SSION_TYPE = TYPE_VALUE. TYPE_VALUE is set to 0 for linear reg ression, 1 for nonlinear regression, or 1 for both. Default value is set to 0. This command is to define the regression strategy going to be used in the GP. PAGE 130 119 Appendix A (Continued) Step 4: define stop criteria (Maximum Generation). Command: MAXGEN = MAXIMUM NUMBER OF GENERATION. This command define s the maximum generation of GP. The stop criterion is con trolled by a while loop. Step 5: select parent chromosomes. The selection st rategies provided in GP toolbox are: o GS_SEL_HIFIT (or GS_SEL_LOFIT): select two chromos omes with the highest (lowest) fitness values and return thei r ID o GS_SEL_LOTRN (or GS_SEL_HITRN): choose chromosomes using tournament selection and return their IDs, with chr omosomes with higher (lower) fitness winning the tournament. o GS_SELR: select two chromosomes randomly. o GS_SELR_HIFIT (or GS_SELR_LOFIT): select two chromo somes randomly, with the probabilities of selection being proportional to their fitness value. o GS_SELR_HIRANK (or GS_SELR_LORANK): select two high (low) fitness chromosomes randomly, with the probabilitie s of selection being based on fitness rank. In this research, tour nament method is used. Example: mem_ids = gs_sel_lotrn ('Pop1',4,2). PAGE 131 120 Appendix A (Continued) Step 6: define and perform genetic operators. Comma nd: OP_NAME = GS_SEL_OP (OP_LIST, OP_PROB). Choose a random genet ic operation from a list according to specified probabilities. OP_LIST is a cell array containing the names of the operations among which to select, OP_P ROB is a corresponding list of probabilities, OP_NAME returns a string identify ing the chosen operation, or null ( ) if an error occurs. Then run genetic operations using: Command: CHROM_ IDS = GS_OP (POPULATION, OPERATION, CHROM_ID1, [CHROM_ID2, MUTP ROB]). GS_OP implements a specified genetic operation on s pecified chromosome(s). POPULATION is a string designating the population f rom which the chromosomes to be operated. OPERATION is a string i dentifying the genetic operation. CHROM_ID1 is the first chromosome to be operated on. CHROM_ID2 is the second chromosome to be operated o n, and is only required for crossover operations. MUTPROB is the mutation p robability, required only for binary mutation. CHROM_IDS returns a vector of identification numbers of the one or two new chromosomes created by the genet ic operation. Example: op_name = gs_sel_op ({'mut', 'xovr'}, [0.5, 0.5]); off_ids = gs_op ('Pop1', op_name, 12, 22). PAGE 132 121 Appendix A (Continued) Step 7: define deletion strategy. The commands are same as the ones used for the selection strategies. Command: OUTCOME = GS_DEL (PO PULATION, CHROM_ID). POPULATION is a character string designa ting the chromosome population from which the chromosome is to be delet ed. CHROM_ID is an integer identifying the chromosome to be deleted. O UTCOME returns the specified CHROM_ID if the deletion is successful; o therwise, a value 0 is returned. Example: gs_del ('Pop1', off_ids(off)). PAGE 133 122 Appendix B. Statistical Analysis B.1 Tests for outliers Outlier can be detected from student residuals.The MATLAB code is given in Table B.1. Table B.1 MATLAB Code for Outlier Test nn=length(x1data); residual=k1_calc1k1data; avg_k1=sum(k1data)/nn; sst=sum((k1dataavg_k1).^2); sse=sum(residual.^2); ssr=sum((k1_calc1avg_k1).^2) counter=length(param); mse=sse/(nncounter) msr=ssr/(counter1) chr1=char( '(1+3*x1+x1*log(x1))v1*T+v2*x1/Tv3*(1log(x1))v4 *(1x1)' ) term1=char( 'T' ); term2=char( 'x1./T' ) term3=char( '1log(x1)' ) term4=char( '1x1' ) r=0; X=[eval(term1) eval(term2) eval(term3) eval(term4)] ; H=X*inv(X'*X)*X'; for ii=1:length(k1data) r(ii)=abs((k1data(ii)k1_calc1(ii))/sqrt(mse*(1 H(ii,ii)))); end PAGE 134 123 Appendix B (Continued) Table B.2 The Result for Outlier Test r r r r r 1 1.667 26 0.014 51 0.848 76 0.446 101 0.1719 2 0.694 27 0.341 52 0.224 77 0.275 102 0.0758 3 1.594 28 0.106 53 0.0003 78 0.022 103 0.0657 4 0.978 29 0.154 54 0.091 79 0.003 104 0.0673 5 0.437 30 0.030 55 0.156 80 0.062 105 0.1176 6 0.087 31 0.005 56 0.201 81 0.113 106 0.1549 7 0.054 32 0.009 57 0.215 82 0.100 8 0.165 33 0.026 58 0.182 83 0.125 9 0.259 34 0.081 59 0.129 84 0.090 10 0.334 35 0.124 60 0.094 85 1.540 11 1.979 36 0.592 61 2.647 86 1.265 12 0.099 37 0.491 62 1.371 87 0.963 13 0.846 38 1.389 63 0.968 88 0.529 14 0.457 39 0.216 64 0.428 89 0.288 15 0.598 40 0.170 65 0.106 90 0.046 16 0.389 41 0.150 66 0.026 91 0.042 17 0.380 42 0.060 67 0.157 92 0.072 18 0.119 43 0.058 68 0.194 93 0.085 PAGE 135 124 Appendix B (Continued) Table B.2 (Continued) r r r r 19 0.021 44 0.057 69 0.213 94 0.108 20 0.028 45 0.061 70 0.172 95 0.039 21 0.088 46 0.040 71 0.141 96 0.022 22 0.155 47 0.006 72 4.143 97 6.162 23 1.001 48 0.013 73 0.828 98 0.056 24 2.510 49 2.592 74 0.803 99 0.268 25 0.520 50 2.276 75 0.530 100 0.206 0.3 >ir is a strong indicator that this data point is a li kely outlier. The table shows that, data point #72 and #97 could be an outlier. Looking into the data, it is found that, both #72 and #97 are the equilibrium data points of dilu te solution, which cannot be deleted, therefore, in this data set, no outlier was deleted B.2 Evaluation of final model: crossvalidation When its not possible to gather additional new dat a, and compare these observed data with the predicted value from the model, a cro ss validation is an alternative evaluation method. PAGE 136 125 Appendix B (Continued) The original data set can be divided into two subse ts A and B randomly, which is double crossvalidation. A good model should be fit to both parts. The MATLAB code for data set partition is given in Table B.3. Table B.3 MATLAB Code for Data Set Partition r = ceil(106.*rand(53,1)) r=sort(r,1) alarm=2; for ivv=1:20 for iii=2:53 if r(iii)==r(iii1) r(iii) = ceil(106.*rand(1,1)) end end r=sort(r,1); end r=sort(r,1); A_index=r; B_index=0; counter_Bset=0; for iqq=1:106 alarm0=0; for imm=1:53 if A_index(imm)==iqq alarm0=1; break ; end end if alarm0 ==0 counter_Bset=counter_Bset+1; B_index(counter_Bset)=iqq; end end PAGE 137 126 Appendix B (Continued) Table B.4 The Result for Data Set Partition Set A (Datapoint_indext_setA) 2 3 7 15 16 19 21 22 25 29 31 33 37 38 41 44 45 46 48 49 50 52 53 54 56 57 58 63 65 66 69 70 72 73 74 76 7 8 79 80 81 84 87 88 89 90 91 92 95 96 98 99 100 101 Set B (Datapoint_indext_setB) 1 4 5 6 8 9 10 11 12 13 14 17 18 20 23 24 26 27 28 30 32 34 35 36 39 40 42 43 47 51 55 59 60 61 62 64 67 68 71 75 77 82 83 85 86 93 94 97 102 103 104 105 106 Result: = = =B Bn i B iB n i iA iB A p y y y y R 1 2 1 2 2 ) ( ) ( 1 =0.9952 = = =B Bn i A iA n i iB iA B p y y y y R 1 2 1 2 2 ) ( ) ( 1 =0.9948 2 A pR and 2 B pR have a high value, and close enough to each other. It concludes that, the evolved regression model is good. PAGE 138 ABOUT THE AUTHOR Ying Zhang received a Bachelors Degree in Chemica l Engineering from Beijing Union University, Beijing, China in 1995 and a M.S. in Chemical Engineering from University of South Florida in 2004, and continued her Ph.D. program at the University of South Florida until 2008. xml version 1.0 encoding UTF8 standalone no record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchemainstance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd leader nam 2200397Ka 4500 controlfield tag 001 002317640 005 20100903160238.0 007 cr cnuuuuuu 008 100903s2009 flua ob 000 0 eng d datafield ind1 8 ind2 024 subfield code a E14SFE0003086 035 (OCoLC)660848077 040 FHM c FHM 049 FHMM 090 TP145 (Online) 1 100 Zhang, Ying. 0 245 Synthesis of local thermophysical models using genetic programming h [electronic resource] / by Ying Zhang. 260 [Tampa, Fla.] : b University of South Florida, 2009. 500 Title from PDF of title page. Document formatted into pages; contains 126 pages. Includes vita. 502 Dissertation (Ph.D.)University of South Florida, 2009. 504 Includes bibliographical references. 516 Text (Electronic dissertation) in PDF format. 520 ABSTRACT: Local thermodynamic models are practical alternatives to computationally expensive rigorous models that involve implicit computational procedures and often complement them to accelerate computation for realtime optimization and control. Humancentered strategies for development of these models are based on approximation of theoretical models. Genetic Programming (GP) system can extract knowledge from the given data in the form of symbolic expressions. This research describes a fully data driven automatic selfevolving algorithm that builds appropriate approximating formulae for local models using genetic programming. No apriori information on the type of mixture (ideal/non ideal etc.) or assumptions are necessary. The approach involves synthesis of models for a given set of variables and mathematical operators that may relate them. The selection of variables is automated through principal component analysis and heuristics.For each candidate model, the model parameters are optimized in the inner integrated nested loop. The tradeoff between accuracy and model complexity is addressed through incorporation of the Minimum Description Length (MDL) into the fitness (objective) function. Statistical tools including residual analysis are used to evaluate performance of models. Adjusted Rsquare is used to test model's accuracy, and Ftest is used to test if the terms in the model are necessary. The analysis of the performance of the models generated with the data driven approach depicts theoretically expected range of compositional dependence of partition coefficients and limits of ideal gas as well as ideal solution behavior. Finally, the model built by GP integrated into a steady state and dynamic flow sheet simulator to show the benefits of using such models in simulation. The test systems were propanepropylene for ideal solutions and acetonewater for nonideal.The result shows that, the generated models are accurate for the whole range of data and the performance is tunable. The generated local models can indeed be used as empirical models go beyond elimination of the local model updating procedures to further enhance the utility of the approach for deployment of realtime applications. 538 Mode of access: World Wide Web. System requirements: World Wide Web browser and PDF reader. 590 Advisor: Aydin K. Sunol, Ph.D. 653 Data mining Symbolic regression Function identification Parameter regression Statistic analysis Process simulation 690 Dissertations, Academic z USF x Chemical Engineering Doctoral. 773 t USF Electronic Theses and Dissertations. 4 856 u http://digital.lib.usf.edu/?e14.3086 