xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam Ka
controlfield tag 001 001709526
007 cr mnu|||uuuuu
008 060516s2005 flua sbm s000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0001408
Use of random subspace ensembles on gene expression profiles to enhance the accuracy of survival prediction for colon cancer patients
h [electronic resource] /
by Vidya Kamath.
[Tampa, Fla.] :
b University of South Florida,
Thesis (M.S.B.E.)--University of South Florida, 2005.
Includes bibliographical references.
Text (Electronic thesis) in PDF format.
System requirements: World Wide Web browser and PDF reader.
Mode of access: World Wide Web.
Title from PDF of title page.
Document formatted into pages; contains 104 pages.
ABSTRACT: Cancer is a disease process that emerges out of a series of genetic mutations that cause seemingly uncontrolled multiplication of cells. The molecular genetics of cells indicates that different combinations of genetic events or alternative pathways in cells may lead to cancer. A study of the gene expressions of cancer cells, in combination with the external influential factors, can greatly aid in cancer management such as understanding the initiation and etiology of cancer, as well as detection, assessment and prediction of the progression of cancer. Gene expression analysis of cells yields a very large number of features that can be used to describe the condition of the cell. Feature selection methods are explored to choose the best of these features that are most relevant to the problem at hand. Random subspace ensembles created using these selected features perform poorly in predicting the 36-month survival for colon cancer patients.A modification to the random subspace scheme is proposed to enhance the accuracy of prediction. The method first applies random subspace ensembles with decision trees to select predictive features. Then, support vector machines are used to analyze the selected gene expression profiles in cancer tissue to predict the survival outcome for a patient. The proposed method is shown to achieve a weighted accuracy of 58.96%, with 40.54% sensitivity and 77.38% specificity in predicting 36-month survival for new and unknown colon cancer patients. The prediction accuracy of the method is comparable to the baseline classifiers and significantly better than random subspace ensembles on gene expression profiles of colon cancer.
Adviser: Dr. Rangachar Kasturi.
Co-adviser: Dr. Dmitry Goldgof
x Biomedical Engineering
t USF Electronic Theses and Dissertations.
Use of Random Subspace Ensembles on Gene Expression Profiles in Survival Prediction for Colon Cancer Patients by Vidya Kamath A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Biomedical Engineering Department of Chemical Engineering College of Engineering University of South Florida Co-Major Professor: Rangachar Kasturi, Ph.D. Co-Major Professor: Dmitry Goldgof, Ph.D. Lawrence Hall, Ph.D. Steven Eschrich, Ph.D. Date of Approval: November 4, 2005 Keywords: Microarray, Bioinformatics, Data Mining, Feature Selection, Classifiers Copyright 2005, Vidya Kamath
DEDICATION Dedicated to our battle against cancer.
ACKNOWLEDGMENTS I would like to express my deepest gratitude to Dr. Steven Eschrich for providing me with masterful guidance through the course of this project. I am grateful to Dr. Kasturi, Dr. Goldgof, and Dr. Hall for being on my committee and guiding me with their expertise in the area of data mining. Special thanks are due to Dr. Yeatman and Dr. Gregory Bloom for insightful discussions on understanding cancer and gene expressions. I also extend my thanks to Dr. Niranjan Pai, Kurt Kramer and Andrew Hoerter for their helpful interactions during the implementation of the project. Finally, I am grateful to H. Lee Moffitt Cancer Center and Research Instit ute for allowing me to work with the microarray gene expression data, and to the colorectal cancer patients whose consent made this work possible.
i TABLE OF CONTENTS LIST OF TABLES iii LIST OF FIGURES iv ABSTRACT vii CHAPTER 1: INTRODUCTION 1 1.1 Introduction 1 1.2 Overview of genetics 1 1.3 Structure and function of DNA 3 1.4 Cancer 1.4.1 Cancer vs normal cells 9 1.4.2 Causes of cancer 12 1.5 Microarray technology for gene expression analysis 13 1.5.1 Techniques 14 1.6 Overview of bioinformatics methods for gene analysis 18 1.7 Outline of the thesis 20 CHAPTER 2: GENE EXPRESSION DATA FOR ANALYSIS OF COLON CANCER 22 CHAPTER 3: METHODS FOR GENE EXPRESSION ANALYSIS 24 3.1 Introduction 24 3.2 Supervised feature selection 26 3.3 Unsupervised feature selection 32 3.4 Classifiers for gene expression analysis 3.4.1 Feed-forward backpropagation neural network 36 3.4.2 Support vector machines 37 3.4.3 C4.5 decision trees 39 3.5 Evaluation of classifiers 41 3.6 Accuracy of classification 44 CHAPTER 4: RANDOM SUBSPACE ENSEMBLES FOR GENE EXPRESSION ANALYSIS 46 4.1 Introduction 46
ii 4.2 Random subspace ensembles 47 4.3 Voting techniques to create random subspace ensembles 48 4.4 Selection of good subspaces 50 CHAPTER 5: RESULTS 56 5.1 Introduction 56 5.2 Supervised feature selection 56 5.3 Unsupervised feature selection 59 5.4 Baseline experiments with colon cancer gene expression data 63 5.5 Majority voting to create random subspace ensembles 66 5.6 Selection of good subspaces 70 5.7 Verification of results 75 CHAPTER 6: DISCUSSION AND CONCLUSION 78 6.1 Discussion 78 6.2 Conclusion 81 REFERENCES 82 BIBLIOGRAPHY 85 APPENDICES 86 Appendix A: Application of the proposed method on different gene expression datasets 87 A.1 Analysis of leukemia data 87 A.2 Analysis of gender data 90
iii LIST OF TABLES Table 1.1: Landmark events in the era of classical genetics 2 Table 1.2: Characteristics of normal vs cancer cells 11 Table 2.1: Dukes classification (modified by Turnbull) 22 Table 3.1: Confusion matrix 44 Table 5.1: Range of parameters used for majority voting technique using random subspace ensembles 67 Table 5.2: Confusion matrix for the performance of the support vector machine on the union of features created by selecting good random subspaces (LT: survival less than 3 years, GT: survival greater than 3 years) 72 Table A.1: Confusion matrix for the performance of the proposed method on the leukemia gene expression dataset 90 Table A.2: Confusion matrix for the performance of the proposed method on the gender gene expression dataset 93
iv LIST OF FIGURES Figure 1.1: Structure of DNA (a) Basic unit of DNA (b) DNA double helix 3 Figure 1.2: The central dogma of genetic information processing 4 Figure 1.3: Process of DNA replication 6 Figure 1.4: Transcription and translation 7 Figure 1.5: The genetic code 8 Figure 1.6: Phase shift in the reading frame of the genetic code 9 Figure 1.6: Stages of development of cancer 12 Figure 1.7: Hybridization of RNA with cDNA 16 Figure 1.8: Perfect-match and mismatch probes form a probe-pair 17 Figure 3.1: A typical setup for microarray gene expression analysis 25 Figure 3.2: Formulation of the t-test 27 Figure 3.3: Three cases with equal difference in means (a) medium variability (b) high variability (c) low variability 28 Figure 3.4: A sample Kaplan-Meier curve 30 Figure 3.5: Comparison of two sample K-M curves using log-rank test 31 Figure 3.6: Architecture of feed-forward-back-propagation neural network 37 Figure 3.7: A maximum margin hyperplane in a support vector machine 38 Figure 3.8: Structure of a decision tree 40 Figure 3.9: 10-fold cross-validation scheme 44 Figure 4.1: Creation of random subspace ensembles 47
v Figure 4.2: Random subspace ensemble classifier using the majority voting technique 49 Figure 4.3: Classification scheme for selecting good subspaces 53 Figure 4.4: Scheme to select good features for classification with typical va lues of the random subspace parameters ( a,r,c ) for a 10-fold cross-validation specified in parenthesis 55 Figure 5.1: Number of features with a specified t-test p-value 57 Figure 5.2: Number of features with a specified log-rank test p-value for comparing Kaplan-Meier curves of the two survival classes 58 Figure 5.3: Graph of the number of features retained as the two threshold values of expression level and minimum percentage value are varied 59 Figure 5.4: Graph of the number of features retained as the threshold for variance is varied 60 Figure 5.5: Number of features retained as the threshold for MAD values is varied 61 Figure 5.6: Histogram of the mean level of gene expressions across all samples (a) all genes (b) cancer-related genes 63 Figure 5.7: Basic classifier block for the baseline gene analysis experime nt the parameter a was varied (100 <= a <=1000) 64 Figure 5.8: Performance of baseline classification schemes 65 Figure 5.9: Basic classifier block to create random subspace ensembles using majority voting technique 66 Figure 5.10: Random subspace ensembles ( a =5000, r=200 c ) vs single decision tree ( a =5000, r=200 c=1) 67 Figure 5.11: Weighted test accuracies of 2000 random decision trees 68 Figure 5.12: Weighted training and testing accuracies of 100 random classifiers built from random subspaces 69 Figure 5.13: Classification by selection of good subspaces 71
vi Figure 5.14: Weighted accuracies of neural networks, support vector machines and decision trees; these classifiers were trained on the union of the best features created by selecting good random subspaces (Section 5.6) 71 Figure 5.15: Survival curves for the predicted classes; the survival curves are statistically different at significance of 0.05 as determined by a log-rank test 73 Figure 5.16: Repetition of genes across two or more folds of the cross-validation scheme 74 Figure 5.17: Variation in the weighted accuracy for prediction of survival for colon cancer with changes in randomization of the samples and feature subspaces 76 Figure 5.18: Number of features repeatedly selected as the most predictive fea tures across all the experiments to test variability of results 77 Figure 6.1: Survival curves for two genes, split on the median, repeated across three folds in the classifier scheme described in Section 5.6 78 Figure A.1: Classifier performance with ALL-AML: neural networks, support vec tor machines and C4.5 decision trees 88 Figure A.2: Random subspace ensembles ( a =5000, r =200, c ) vs. single decision tree ( a =5000, r =200, c =1) on the ALL-AML dataset 89 Figure A.3: Classifier performance with gender dataset: neural networks and support vector machines 91 Figure A.4: Random subspace ensembles ( a =5000, r =200, c ) vs. single decision tree ( a =5000, r =200, c =1) on the gender dataset 92
vii USE OF RANDOM SUBSPACE ENSEMBLES ON GENE EXPRESSION PROFILES IN SURVIVAL PREDICTION FOR COLON CANCER PATIENTS Vidya Kamath ABSTRACT Cancer is a disease process that emerges out of a series of genetic mutations that cause seemingly uncontrolled multiplication of cells. The molecular genetics of cells indicates that different combinations of genetic events or alternative pathways in cells may lead to cancer. A study of the gene expressions of cancer cells, in combination wit h the external influential factors, can greatly aid in cancer management such a s understanding the initiation and etiology of cancer, as well as detection, assessme nt and prediction of the progression of cancer. Gene expression analysis of cells yields a very large number of fea tures that can be used to describe the condition of the cell. Feature selection methods a re explored to choose the best of these features that are most relevant to the problem at hand. Random subs pace ensembles created using these selected features perform poorly in predi cting the 36-month survival for colon cancer patients. A modification to the random subspace scheme is proposed to enhance the accuracy of prediction. The method first applies r andom subspace ensembles with decision trees to select predictive featur es. Then, support vector machines are used to analyze the selected gene expression profiles in cancer t issue to predict the survival outcome for a patient.
viii The proposed method is shown to achieve a weighted accuracy of 58.96%, with 40.54% sensitivity and 77.38% specificity in predicting 36-month survival for ne w and unknown colon cancer patients. The prediction accuracy of the method is compa rable to the baseline classifiers and significantly better than random subspace ensembles on gene expression profiles of colon cancer.
1 CHAPTER 1 INTRODUCTION 1.1 Introduction Cancer is a disease process that emerges out of a series of genetic mutations that cause seemingly uncontrolled multiplication of cells [1,2]. The progress made in the a rea of molecular genetics in recent years has made it possible to profile the differ ent combinations of genetic events or alternative pathways in cells that may lead to cancer. A study of the gene expressions of cancer cells, in combination with the external influe ntial factors has shown promise in several areas of cancer management [1,3], such as understanding the initiation and etiology of cancer, as well as detection, assessme nt and prediction of the progression of cancer . 1.2 Overview of genetics The fascinating diversity of traits amongst living beings and the transmission of traits through generations of a species led scientists and biologists to investig ate the nature of heredity since the late 1600s . Use of science, reason and observation led to a series of landmark discoveries that yielded a deeper insight into the functioning of li ving beings. Table 1.1 shows a limited list of the contributors to classical genetics al ong with their contributions to the field.
2 Table 1.1: Landmark events in the era of classical genetics  Period Contributor Contribution to genetics 1651 William Harvey Identification of the egg as th e basis of life 1665 Robert Hooke Discovery of cells as the basic unit of organisms 1677 Antoni van Leeuwenhoek Discovery of sperms 1801 Erasmus Darwin Evolution of life based on progress, development and metamorphosis 1815 Jean Baptiste Lamarck Evolution based on acqui red characteristics 1833 Robert Brown Description of the cell nucleus 1858 Charles Darwin Evolution by natural selection 1865 Gregor Mendel Law of segregation and law of independent assortment for peas 1880 Eduard Strasburger Description of mitosis 1888 Gottfried Waldeyer Discovery of chromosomes 1890 August Weismann Description of meiosis 1909 Wilhelm Johannsen Definition of Â“genotypeÂ”, Â“phenotypeÂ” and Â“genesÂ” 1926 Hermann J. Muller Proposal that the gene is th e basis of life 1944 Oswald Avery, Maclyn McCarty, Colin MacLeod Establishment of DNA as genetic material 1953 Watson, Crick Double-helix model of DNA The era of classical genetics focused on understanding the functional behavior of cells. Cells were identified as the basic unit of life, and chromosomes as the basi s of individual traits of the cell. However, it was not until the era of molecular genetic s that scientists began to investigate the structural and functional properties of the chromosomes. The discovery of deoxy-ribonucleic acid or DNA as the molecular basis of chromosomes ushered in the era of molecular genetics . In 1953, Watson and Crick  deduced the geometric configuration of the components of DNA along a stretch of the molecule. Molecular genetics involves the analysis of the exact functioning of DNA at a molecular level in the transmission of traits, and sustenance of life [2,4,6]. DNA ser ves as the repository of information that determines the genetic variability of an organi sm. It is a polymeric molecule that encodes the genetic information for an organism in an
3 arrangement of nucleic acid bases along the polymer chain [5,7]. A gene is a length of nucleic acids which is responsible for the transmission and expression of a hereditar y characteristic . Four nucleic acid bases thymine (T), adenine (A), cytosine ( C) and guanine (G) are arranged in a specific sequence in a gene. This sequence determines the amino acid sequence of the polypeptide chain synthesized through the transcription or expression of the gene. A gene can be treated as a sentence that gives the step-by-s tep instructions for the production of the protein . Each Â“wordÂ” in this sentence is described by a sequence of three nucleic acid bases (refer to Section 1.3). 1.3 Structure and function of DNA Structure of DNA The basic molecular sub-unit of DNA consists of a deoxyribose sugar, attached to a phosphate molecule on one end, and one of the four nucleic acid bases on the other[5,7]. (a) (b) Figure 1.1: Structure of DNA (a) Basic unit of DNA (b) DNA double helix reproduced with permission from: http://www.biology-online.org These basic molecular units attach to other such units at the 3Â’ and the 5Â’ position of the molecules, forming a long chain of polymeric molecules like Â“beads on a chainÂ”. n r
4 Further, each unit can attach to another unit at the position of the nucleic acid base. These bases cannot undergo non-specific binding: Adenine (A) bonds exclusively with Thymine (T), and Guanine (G) bonds exclusively with Cytosine (C) . The bonds between the bases bring together two polymeric DNA chains like the rungs of a ladder, with the two individual strands forming the parallel sides of the ladder. Due to the oblique angle at which each of these two types of bonds can occur, the ladder twists to form a double helix structure [5,7]. Function of DNA The central dogma of molecular genetics  describes the three fundamental phases in genetic information processing: Figure 1.2: The central dogma of genetic information processing Replication Biosynthesis of DNA occurs during cellular division or reproduction . During replication, the DNA molecule functions as a template for the synthesis of two repl icate Replication : DNA serves as a template for additional DNA synthesis Transcription : DNA provides a template for mRNA production Translation : mRNA furnishes a template for protein synthesis Replication Transcription Translation Protein RNA (working template) DNA (master template)
5 molecules which are fundamentally identical to the parent DNA. During the first s tage of replication, the double stranded DNA unwinds itself to expose two single DNA strands. Each of these strands serves as a template, directing the growth of the nucleotide ba se sequence for the synthesis of a new complimentary strand, from the 3Â’ to 5Â’ end of the single-stranded template. Each of these two new complementary strands combine wi th one of the parent strands to form the two replicate DNA molecules. This type of synthes is is termed Â“semi-conservativeÂ” , since the parent DNA is entirely containe d in the product DNAs: one of the parent strands is found in one replicate molecule, while the other strand is found in the second replicate. The structural stability intrinsic to the formation of the base pairs reinforce s the fidelity of DNA replication . However, the rare occurrence of errors at the le vel of DNA replication could result in genetic mutations. Three basic types of errors ma y arise : i. Substitution, or a mismatch in base pairing during the formation of the new complimentary strands, results in the substitution of one base pair for another at a particular point in the molecule. ii. Deletion, or the loss of a specific base pair from a particular point in the molecule iii. Insertion, or the addition of a specific base pair at a particular point in the molecule.
6 Figure 1.3: Process of DNA replication Transcription The genetic message encoded in a DNA molecule via the nucleotide base sequence is instrumental in the formation of a specific protein . However, DNA is not directly used in the formation of a protein. Instead, mRNA (messenger RNA) is fir st synthesized as a working template from the DNA master template through the proce ss of transcription . Hence, the genetic information contained in DNA is transfered to t he
7 mRNA molecule via the process of transcription. Only one of the complementary DNA strands may be used in transcription at a time, depending on the gene being transcribed . Synthesis of mRNA proceeds just like in DNA replication. However, the RNA base Uracil (U) is used instead of the DNA base Thymine (T). Once synthesized, the newl y formed mRNA molecule is released from the DNA template, which then resumes it s right handed helical form. The newly formed mRNA is then transported to the cytoplasm of the cell, the site of translation . Figure 1.4: Transcription and translation; images reproduced with permission from Dennis OÂ’Neil, copyright2005 by Dennis OÂ’Neil Translation The synthesis of a protein molecule from an mRNA template occurs through the process of translation . A protein is synthesized by translating the mRNA nucleot ide base sequence into the amino acid sequence of a primary polypeptide by means of a 4-letter, 64-word genetic code (Figure 1.5). Each triplet (or codon) in the mRNA sequence of bases gives instruction for one amino acid to be included into a growing polypeptide
8 chain . Thus, the linear arrangement of codons along the mRNA template dictates the types and the linear arrangement of amino acids in the final protein product. A ribosomal complex within the cell aids in setting the phase of the genetic message. Reading of the mRNA template occurs here, and protein synthesis proceeds along the 5Â’ to 3Â’ direction. . As the genetic code is read at the ribosomal level, each codon is recognized by a particular transfer RNA (tRNA) molecule. This tRNA tr ansports the amino acid specified by the codon to the site of protein synthesis. Figure 1.5: The genetic code  Initiation of protein synthesis occurs at the AUG codon or the start-word by activating the ribosomal complex to set the phase of translation . Subsequently, the ribosome shifts one triplet down the mRNA in the 3Â’ direction during translation and the 2 nd Position 1 st Position (5` end) U C A G 3 rd Position (3` end) U Phe Phe Leu Leu Ser Ser Ser Ser Tyr Tyr STOP STOP Cys Cys STOP Trp U C A G C Leu Leu Leu Leu Pro Pro Pro Pro His His Gln Gln Arg Arg Arg Arg U C A G A Ile Ile Ile Met Thr Thr Thr Thr Asn Asn Lys Lys Ser Ser Arg Arg U C A G G Val Val Val Val Ala Ala Ala Ala Asp Asp Glu Glu Gly Gly Gly Gly U C A G
9 appropriate tRNA brings the amino acid encoded by the new codon into position. This process continues until one of the three stop-codons (UAA, UAG, or UGA) is encountered . The initiator codon-ribosome complex partitions the mRNA base sequence into codons to determine the reading frame of the translation . A phase-shift mutation (DNA deletions and insertions) in the gene modifies this reading frame. For exampl e, Figure 1.6 shows three ways in which the genetic code could be read, depending on the position or phase of the first base pair. Addition or deletion of a base pair from the genetic sequence changes the sequence of the base pairs translated, and this can radic ally change the protein structure . Figure 1.6: Phase shift in the reading frame of the genetic code 1.4 Cancer 1.4.1 Cancer vs normal cells Reproduction through cell division is essential for body growth and tissue repair. Cells that are constantly sloughed off the surface, such as cells of the skin and intes tinal lining, reproduce themselves almost continuously . The initiation signals for cell division are not fully understood, but surface-volume relationships are deemed to be important [1,2,6]. The volume of a cell dictates the amount of nutrients needed for the U C G A U C U C G C C C A C A C G A U C U C G C C C A C A U A U C GAU C U C G C C C A C Ile Asp Leu Ala His Ser Ile Ser Pro Arg Ser Arg Pro 5Â’ 3Â’ 1 2 3
10 cell to survive. The need for nutrients grows in proportion to the size of the cell. However, the cell surface of the plasma membrane gradually becomes inadequate to transfer nutrients to the cell and flush the waste products out of the cell. When the cell reaches such a critical size, cell division is initiated to produce two daughter ce lls that are each smaller in size. Cell division is also influenced by other mechanisms including availability of space, and chemical signals such as growth factors and hormones rel eased by neighboring or distant cells . Normal cells employ the phenomenon of contact inhibition to stop proliferating when they begin touching. When cells break free from these normal controls of cell division, they begin to divide wildly thus turning into cancer cells [1,2,6]. It is estimated that four to seven mutational events must occur between an initial normal state and a final stage of malignancy of a cell . For example, some epit helial cancers, such as skin cancer and colon cancer, follow a sequence that includes [2,6]: i. Hyperplasia (Â“increased numbers of regularly arranged normal cellsÂ” ) ii. Dysplasia (Â“increased numbers of normal cells with some atypical cells and s ome abnormal arrangement of cells but with no major disturbance of tissue structureÂ”  ) iii. Carcinoma-in-situ ( Â“a severe form of dysplasia, with numerous atypical cells major disturbance of tissue structure but no invasion of surrounding tissueÂ” ) iv. Invasive cancer (Â“spread of altered cells derived from one tissue into adjacent dif ferent tissuesÂ” ). The risk of developing invasive cancer at the site of a dysplastic lesion is greate r than developing cancer from normal tissue, and the risk of invasive cancer developing from a carcinoma-in-situ lesion is greater that developing it from a dysplasti c lesion.
11 Table 1.2: Characteristics of normal vs cancer cells  Normal Cells Cancer Cells Reproduce themselves exactly Stop reproducing at the right time Stick together in the right place Become specialized or 'mature' Self destruct if they are damaged Reproduce continuously Don't obey signals from other neighboring cells Don't stick together Don't become specialized, but stay immature Don't self-destruct or die if they move to another part of the body Alteration of cell behavior that transforms the cell from normal to cancerous is permanently maintained and transmitted to descendant generations of the cell through t he chromosomes, the genetic component of the cell [2,6]. Normal cellular activity does not require all genes to be operational within the cell, however, a relatively intact s et of chromosomes is vital. Each cell is furnished with a complete chromosome set during the process of reproduction, when each daughter cell receives a replica of the chromosomes of the parent cell. The delicate balance of the integrated genetic system may be disrupted is a chromosome or parts of a chromosome are lost from or added to the genome through some error during cell division. Such an error may have fatal consequences, not only in the cells affected but eventually in the whole organism [2,6]. Cancer cells evolve along pathways that define the fate of the tumor [1,2]. Once transformed into cancerous cells, growth becomes more rapid, and cell types of a less normal nature appear in the tissue [2,6]. The ability of the abnormal cells to invade surrounding tissues becomes more evident. The cell then undergoes a series of physiological alterations that could collectively encourage malignant growth [ 1,2,6]. These changes include self-sufficiency in growth signals; insensitivity to g rowthinhibitory signals; evasion of programmed cell death; limitless replicative potential; sustained angiogenesis and tissue invasion and metastasis [1,2].
12 Figure: 1.6: Stages of development of cancer images reproduced with permission from: www.cancerhelp.org.uk The cancer cells are classified according to a grade based on how normal the cell appears to be. The more normal a cancer cell is, the lower its grade. The more abnormal or less well-developed a cancer cell is, the higher its grade. The sequence of events that cause the cells to change from dysplasia to carcinoma-in-situ to low-grade mal ignancy to high-grade-malignancy could possibly be programmed in the genetic material of the cell at the time of the first essential change from normal to cancerous state[1,2]. 1.4.2 Causes of cancer Mutations of genes alter the behavior of the cell that may ultimately cause the c ell to become cancerous [1,2,6]. While innumerable factors, such as the DNA replicative state, the repair potential and the hormonal status of the host are likely to be promotional factors for cancer initiation, exposure of the cell to carcinogens is also a likel y cause of cancer . Almost all known carcinogens have been shown to be capable of irreversibly binding to genetic material in receptive animal tissues. This occurs either as a consequence of direct chemical reaction or metabolism of reactive metabolites . The initial event in carcinogenesis is the introduction of certain inheritable defect s causing the cells divide incorrectly. Here, genetic material is divided disproportionately between
13 daughter cells. This results in a mixed population of cells in both cancerous and pre-cancerous lesions that compete with each other for nutrients and survival. Selective survival of the most aggressive of these cells could lead to tumor progression . At least three distinct kinds of genes are important in making a cell cancerous: oncogenes, tumor suppressor genes and DNA repair genes . i. Oncogenes are genes that encourage the cell to multiply. These are normal cellular genes that when inappropriately activated cause the cell to multiply without stopping ii. Tumor suppressor genes are genes that stop the cell multiplication. These gene s produce proteins that act to slow or regulate mitogenic activity. When these genes a re impaired, the cell are not inhibited from multiplying uncontrollably and malignant progression occurs. iii. DNA repair genes are genes that repair the other damaged genes. They gene s aid in detection and facilitation of the correction of errors in the genetic code.. 1.5 Microarray technology for gene expression analysis DNA microarray chips are employed to analyze the genetic behavior of tissue . An in-depth description of the behavior of cells may be obtained by analyzing the DNA of the cells. It is important to understand and locate the presence of mutations in genes that are important to the functioning of the cell, as well as to detect the genes that are active in the cell. Manual analysis of all the genes in any cell would take an extraordinarily long time. The time lag for such manual processing may be unaccept able while attempting to make decisions regarding treatment options for patients base d on the possible genetic mutations in the tissue. Microarray technology alleviates thi s problem in
14 the following ways: microarray technology can follow the activity of many genes at the same time, compare the activity of genes in diseased and healthy cells, determi ne any mutations in a gene, and categorize diseases into subgroups while acquiring results ve ry rapidly . 1.5.1 Techniques Each cell in the body ideally owns the exact copy of the entire genome as every other cell [2,6,7]. However, only a small set of these genes are active in any one cell, a nd the functioning of these genes aid in understanding the functioning of the cell. Directly measuring the DNA of a cell will not aid in quantifying the level of expression of a ge ne. However, the number of copies of each mRNA in a cell indicates the level of activity of the gene that corresponds to that mRNA. Labeling these copies of mRNA and counting them will then directly indicate the level of activity of the corresponding genes. A sample of the tissue may be analyzed in isolation, to understand the behavior of the cell in its natural environments, or the sample tissue may be subjected to two or more different kinds of environment in order to analyze the genetic behavior of the tissue under different conditions. Further, one kind of tissue may be compared with another kind to analyze the difference in gene expressions and activity in the two tissue types. In general, ge ne expression analysis techniques involves three major steps: preparing the DNA chip, carrying out the reaction and collection and analysis of data [3,8].
15 Preparing the DNA chip The first step to being able to analyze the gene activity is to list out the genes t hat need to be monitored. The sequences or parts of the sequences of these genes must be specified. A piece of each gene is synthesized as a short strand DNA, or oligonucleoti de, a few base pairs long (for example, Affymetrix DNA chips  use 25 base pairs). E ach short strand of DNA is fixed on a tiny spot on a slide. Billions of copies of each strand are affixed on the same spot of the slide. Several thousand such genes may be converted into short strands and fixed to the glass slide for analysis. Carrying out the reaction The next step is to convert the DNA of the target cell into mRNA under the environmental conditions being studied. If more than one environmental condition of the cell is used, the mRNA obtained under each condition is labeled with a fluorescent stain separately to make it easy to identify the level of activity under the different conditions. This is achieved by reverse transcribing the mRNA into complementary DNA or cDN A. By introducing modified fluorescent bases into the DNA during hybridization, the cDNA can be conveniently tagged with different colors, such as red and green, for different experimental conditions. The one or more sets of cDNA are then combined and hybridized onto the DNA chip in a special chamber.
16 Figure 1.7: Hybridization of RNA with cDNA, image reproduced with permission: Wosik, E. cDNA-Detailed Information, Connexions Web site. http://cnx.rice.edu/content/m12385/1.2/, Sep 30, 2004 Collection and analysis of data The final step is to measure the amount of each type of cDNA hybridized to any spot on the DNA chip. If multiple conditions are used, not just the total amount of hybridization, but the relative levels of hybridization of the two types of cDNA on any one spot of the chip are important. Color laser scanners, one for each color used in tagging the cDNA, are used to scan the DNA chip. Each color scan indicates the amount of that color cDNA hybridized to all spots on the chip. By combining the information on the color scans, one can measure the relative expressions of genes under the different experimental conditions. The results will indicate which genes are turned on, and to what level of activity under the experimental conditions. Alternately, the fluorescenc e of a single color will indicate the expression level of a gene under the target conditions
17 Oligonucleotide arrays Two of the several types of microarray technology available today for DNA analysis are cDNA, or complementary DNA chip and Short oligonucleotide arrays [8 ]. cDNA chips measure the relative abundance of a spotted DNA sequence in two DNA or RNA samples by assessing the differential hybridization of these two samples to the sequence on the array. Here, probes are defined as DNA sequences spotted on the array. Short oligonucleotide arrays, such as Affymetrix chips  on the other hand, use Â“probesetsÂ” for measurement. Each gene is represented by 6-20 oligonucleotides of 25 base-pairs (or 25-mers). Each 25-mer is called a Â“probeÂ”. Two complimentary probes a re created for a 25-mer that has to be analyzed: A perfect match probe is a 25-mer exac t compliment of the target probe and mismatch probe is a 25-mer, same as the perfect match, but with a single homomeric base change for the middle (or 13 th ) base. Figure 1.8: Perfect-match and mismatch probes form a probe-pair A perfect-match and mismatch combination for a 25-mer sequence of a gene is called a Â“probe pairÂ”, and about 16-20 probe-pairs form a Â“probesetÂ”. The addition of the mismatch pair to the experiment helps in measurement of non-specific binding and the background noise . Â… nnnnnn nnnnn nnnnn nn nnnnn nn n rn r r
18 1.6 Overview of bioinformatics methods for gene expression analysis Advancements in the area of molecular genetics have enabled the mapping of the entire human genome. About 30,000 genes of the human genome have been mapped today , and specific information regarding the genes actively expressed in vari ous tissue types has made it possible to identify the normal functioning of cells in the body. The introduction of microarray technology allowed the analysis of several thousands of genes in a single experiment . This explosion of information makes it possible to thoroughly investigate the expression of genes in tissues . The genetic activit y in normal cells could be compared with the activity in tumor cells , and tumors of different types may be distinguished . Several investigators have worked towards mining meaningful information from the thousands of genes acquired from the microarray experiments in order to distinguish between various diseased conditions of tissues [10,11,12]. In the area of cancer management, the two main areas of research have been class discovery and class prediction . Class discovery involves identifying previously unrecognized tumors, and class prediction involves present or future assignment of the tumor to a previously discovered tumor type. Golub et al  analyzed two types of acute leukemia, (ALL: acute lymphoblastic leukemia and AML: acute myeloid leukemia), to develop a general strategy for discovering and predicting types of cancer. Neighborhood analysis was used to identify a set of informative genes that could predict the class of an unknown sample of leukemia. Each informative gene was used to cast a weighted vote on the class of the sample, and the summation of the votes predicted the class of the sample. Self-organizing maps (SOM) were used to cluster tumors by gene expression to discover new tumor types.
19 van Â‘t Veer et al [15,16] utilized a hierarchical clustering algorithm to identi fy a gene expression signature that could predict the prognosis of breast cancer. Two subgroups were created, using the clustering technique, with genes that were highly correlated with the prognosis of cancer. The number of genes in each cluster was then optimized by sequentially adding subsets of 5 genes and evaluating the power of prediction in a leave-one-out cross-validation scheme. Expression profiles of tumors with a correlation coefficient above the optimized sensitivity threshold were class ified as good prognosis, and the rest as poor prognosis. Alon et al  distinguished between normal and tumor samples of colon cancer using a deterministic annealing algorithm. Genes were clustered into separat e groups sequentially to build a binary Â“gene treeÂ”, and tissues were clustered to create a Â“tissue treeÂ”. Genes that showed strong correlation were found closer to each other on the Â“gene treeÂ”, and tissues with strong similarities were found close together on the Â“ti ssue treeÂ”. A two-way ordering of genes and tissues was used to identify families of genes and ti ssue based on the gene expressions in the dataset. Glinsky et al  identified an 11-gene signature that was shown to be a powerful predictor of a short interval to distant metastasis and poor survival after therapy i n breast and lung cancer patients, when diagnosed with an early-stage disease. The method clustered genes exhibiting concordant changes of transcript abundance. The degree of resemblance of the transcript abundance rank order within a gene cluster between a te st sample and a reference standard was measured by the Pearson correlation coeffic ient. Ramaswamy et al  analyzed a 17 gene signature to study the metastatic potential of cancer cells in solid tumors. Genes were selected based on a signal-t o-noise
20 metric followed by a hierarchical clustering to determine the individual correl ations for the selected genes. The results of the algorithm were tested using Kaplan-Mei er survival analysis techniques. Eschrich et al  showed that molecular staging of colorectal cancer, using the gene expression profile of the tumor at diagnosis, can predict the long-term survival outcome more accurately than clinical staging of the tumor. A feed-forward-back propagation neural network was used with 43 genes to predict the molecular stage of a tumor sample. 1.7 Outline of the thesis While the main goal of the study is to develop a classifier scheme using a random subspace ensemble to improve the accuracy of survival prediction for colon cancer patients, it is essential to have a fundamental understanding of the microarray gene expression data and methods generally used to analyze this data. Chapter 2 describes t he microarray gene expression data used in the study. Chapter 3 introduces the general method used to analyze gene expression data, including feature selection and classification. A brief description of some of the algorithms used at various stages of the analysis is given. The chapter concludes wi th a description of methods used to evaluate and measure the performance of the classifier s. Chapter 4 introduces the concept of random subspaces and describes three methods of creating random subspace ensembles, highlighting the merits and demerits of each method.
21 Chapter 5 describes the experiments conducted with feature selection methods and the various baseline classifier experiments on the colon cancer gene expression da taset. A detailed description of the experiment with random subspace ensembles is presented ne xt. The chapter is concluded with a verification of the results. Finally, a discussion on the proposed method, its merits and potential improvements are presented in Chapter 6, followed by conclusions from the study. In addition to the experiments with colon cancer data, the proposed method was tested on datasets with different clinical measures (leukemia and gender). A de scription of these experiments is presented in the Appendix Section A.
22 CHAPTER 2 GENE EXPRESSION DATA FOR ANALYSIS OF COLON CANCER Colorectal cancers are the second most common cause of cancer-rel ated deaths in developed countries and the most common GI (gastro-intestinal) cancer . Colon cancer develops as polyps in the intestinal wall, and could progress slowly t o a severe stage cancer if left unchecked. A common and well-accepted method of clinical staging of colon cancer is the DukeÂ’s classification (Table 2.1) of colon cancer [2,6]. However, Duke Â’s staging system has been shown to be inadequate in determining prognosis for patients d iagnosed with stages B or C of colon cancer . Molecular staging, on the other hand has shown promise in predicting prognosis for patients based on the gene expression profile of the tumor . Table 2.1: Dukes classification (modified by Turnbull)  Stage Description A Limited to bowel wall B Extension to pericolic fat; no nodes C Regional lymph node metastasis D Distant metastasis (liver, lung, bone) The goal of this study is to analyze gene expression patterns to pr edict the 36 month survival rate for colon cancer patients. The samples used for the st udy were categorized based on the patient prognosis of cancer rather than the clinical sta ging. Samples were classified as good prognosis cases if the patient survived greater 36 months, and bad
23 prognosis if the survival was less than 36 months. Thus all patient used for the study had to have been followed for at least 36 months. 121 samples of colorectal adenocarcinoma were selected from the Mof fitt Cancer Center Tumor bank and include 37 samples with bad prognosis and 84 samples with good prognosis. The samples included all four Dukes stages of colon cancer. The evidence of survival, as well as patient information such as family history of cancer, and treatment history was acquired from the cancer registry. Each tissue sam ple used for the microarray analysis was taken during surgical resection of the tumor from the primary site of tumor and verified as adenocarcinoma of the colon by a pathologist. The gene expression microarray used to analyze these tumor cases was the Affymetrix Human Genome U133 Plus 2.0 Array . Each microarr ay experiment measured the expression levels of 54675 probesets (refer to Section 1.5.1). These ex pression levels were normalized using the Robust Multichip Average (RMA)  method to yield features values in log-2 scale for analysis.
24 CHAPTER 3 METHODS FOR GENE EXPRESSION ANALYSIS 3.1 Introduction DNA microarray analysis generates information about the level of expression of genes in a target cell or tissue type [3,8]. Normal and cancerous cells are expected to exhibit differential expressions of certain genes. For example, abnormally high levels of expression of oncogenes, that ordinarily regulate the multiplication of cells [ 1,2], could indicate a tendency of the cell to proliferate without control. The genes that a re instrumental in the onset and progression of cancer are likely to be expressed different ly than in normal tissue, with alterations in their expression levels as the cancer progre sses. Identifying these differentially expressed genes, and analyzing their expression pat terns as the cancer progresses will aid diagnosis and prognosis of cancer. Microarray gene expression analysis involves studying the expressi on patterns of the genes across varying environmental conditions of the tissue, such as tum or cells treated with different types of drugs or radiation therapy, or across the differe nt stages of development of the cancer . The aim of these analyses is to identify a set of genes that are reliably expressed differently across the different stages of cancer. H owever, most microarray experiments yield a very large number of features for analysis  In practical situations, it is reasonable to assume that only a subset of these features truly r epresent the distinction between the stages of cancer, as well as between cancerous and nor mal tissues. Hence,
25 methods to reduce the dimensionality of the feature set are used in the first stage of analysis, to obtain a minimum useful set for classification. This feature s election may be done in a supervised or unsupervised manner . While supervised techniques use the under lying class information to select features, unsupervised techniques use empirical e vidence in the data to decide whether or not a feature would aid classification. In general, classifiers are used to analyze a set of samples and separate them into groups, such that the characteristics of each group reflect the ch aracteristics or features of the individual samples of the group. These defining features are governed by the context of analysis. In practical situations, even when the best feature set is used, it is often difficult to identify features that unambiguously separate groups from one another as well as predict the classes of new samples. Hence, classification methods aim at unc overing patterns that best describe the distinction between these groups. These patterns are le arned from training samples and later used in predicting the class of new and unseen samples. Figure 3.1: A typical setup for microarray gene expression analysis Thus, a typical microarray gene analysis experiment would follow the steps shown in Figure 3.1. The first stage in analyzing gene expressions is to select a limi ted set of Input: Probesets from microarray analysis Feature Selection: Pick a set of features that will aid classification Classifier : Build a classifier to distinguish between the classes, using the selected features Microarray expression analysis
26 features that would aid the classification stage in identifying the important pa tterns that distinguish between the classes. The features that may confuse the classific ation are dropped from consideration. A classifier is then built to learn patterns from these se lected features in order to distinguish between the classes or conditions under consideration. The knowledge gained by this classifier in learning the patterns from the traini ng samples is evaluated to ensure that the patterns may be generalized to unseen sampl es. A measure of performance of the classifier will indicate the expected perform ance of the classifier in predicting the class of an unseen sample. 3.2 Supervised feature selection Supervised feature selection methods use the underlying class information of all the samples to make a decision regarding the importance of a particular feature i n distinguishing between the classes. Statistical techniques such as the studentÂ’ s t-test  and survival analysis [25,26], which attempt to capture the biological relevance of a feature, are used to retain a minimal set of features that would be best able to dist inguish between the classes. StudentÂ’s t-test The studentÂ’s t-test is used to determine whether the means of two groups are statistically different from each other . This analysis assumes normal ly distributed data, with mean, m and variance, s The two groups for comparison are created by varying one or more features that characterize the samples. The aim of the t-test is t hen to
27 determine if the distribution of the samples changes due to the variation of the feature /s, and if so, whether the change can be detected easily. The null hypothesis for a t-test is that the two groups are not different from each other, and hence have the same mean. A t-statistic is computed from the samples of the two groups and is treated as Â“evidenceÂ” for or against the null hypothesis. The computed statistic is compared to a standard measure to decide whether to accept or rejec t the null hypothesis. Strong evidence for being able to detect a difference between the two groups would suggest rejection of the null hypothesis. Figure 3.2: Formulation of the t-test; reproduced with permission from: http://www.socialresearchmethods.net/kb/stat_t.htm Figure 3.2 shows the distributions of two groups with individual means. In a basic sense, the distance between the two means can be used as a measure of difference between the groups, and gives an indication of the distinction between the groups. However, as shown in Figure 3.3, the distinction between the groups may be influenced by the relative spread of the two groups. C C T T C T n n X X SE var var ) ( + = where T X : mean of group T var T : variance of group T n T : number of samples in group T C X : mean of group C var C : variance of group C n : number of samples in group Group T Group C
28 (a) (b) (c) Figure 3.3: Three cases with equal difference in means (a) medium variability (b) high variability (c) low variability; reproduced with permission from: http://www.socialresearchmethods.net/kb/stat_t.htm Thus, a true measure of the difference between the two groups can be obtained by computing a score that measures the difference between their means relative t o the spread or variability of their distributions. The t-statistic is computed as the ratio of the difference between the two means and the standard error of the difference: The t-statistic indicates the ease of distinguishing between two groups in prese nce of variability due to the inherent variability of the data or noise in measurement. A standard t-statistic is computed based on the degrees of freedom available and significance level desired for the test. A significance level ( a ), commonly set at 0.05 indicates that 5 times out of 100, a significant difference between the means could be found merely by chance, even if there was none. The computed t-statistic is compared with the standard t-statistic to obtain a p-value for the test. The p-value indicates the probability of making an error in distinguishing between the two groups. where T X : mean of group T var T : variance of group T n T : number of samples in group T C X : mean of group C var C : variance of group C n : number of samples in group C C T T C T n n X X t var var + =
29 A p-value of less than the a -level indicates that the difference between the two groups is statistically significant, and hence, the null hypothesis is rejected The ability of each feature in the microarray dataset to predict the classes for new samples can be examined using the described t-test. The p-value for each feature is univariately computed at a significance level a =0.05. All features with p-values less than 0.05 are considered by convention to have statistically significant power to discrimina te between the two groups of patients. Survival analysis When dealing with problems in cancer research, a common endpoint is determining whether a patient will survive for a certain period of time. Here, Â“dea thÂ” is considered to be an event and survival analysis [25,26] attempts to model the time-to-event, to predict the fraction of the population that could survive past a certain time. Of those that survive, the analysis tries to predict the rate at which these patients would fail or die. Survival models may be viewed as ordinary regression models where the response variable is time. However, this analysis also needs to account for missing data or information on patients who could not be followed for the entire duration of the study for various reasons. This introduces the concept of censoring in survival analysis . It may be known that a patient had colon cancer, but died at an unknown time before data collection began. This is known as left censoring. Right censoring occurs when a patient may have a date of death at a future unknown date. When a sample is both left and right censored at the same time, the sample is said to interval censored. Another possibility of an
30 incomplete event is delayed-entry, when the patient does not enter the study until a cert ain event occurs. Kaplan-Meier (KM) curves [25,26] are used to plot the probability of survival of the population against intervals of time. For each interval, the survival probability is computed as the ratio of the number of patients alive at that time point with the number of patients at risk. All patients who are alive and reached the time point are considere d to be Â“at riskÂ”. Patients who either die before the time point or are Â“lostÂ” for the study a re not counted as Â“riskÂ” patients. Â“LostÂ” patients are censored. Further, patients who have not yet reached the time point are not considered as Â“riskÂ” patients. The probability of survival to any point is estimated from the cumulative probability of surviving each of the preceding time intervals. This formula, also known as the Kaplan-Meier Product-Limit formula  is given by: where n: total number of cases, Figure 3.4: A sample Kaplan-Meier curve 15 ` 5 censored 10 at risk 2 died 8 alive 3 censored 5 at risk 3 died 2 alive 1 censored 3 at risk 2/5 = 40% 8/10 = 80% 30% x 80% = 2 4 % 24% x 40% = 9.6 % Cumulative probability of survival Time Probability of survival ( %) 30 ` 20 ` 10 ` if j th case is uncensored if j th case is censored = ;0 ;1 ) ( jd ) ( 1)1 ( ) ( ) (j t jj n j n t Sd' = + =
31 The computation thus far has been shown on a single group of samples. In DNA analysis for prediction of survival for colon cancer patients, two groups of samples are used: a group that survived less than 36 months, and a group that survived greater than 36 months. Survival curves for both of these groups may be shown on a single graph. The next task is then to determine whether or not these two KM curves are statistical ly equivalent. Figure 3.5: Comparison of two sample K-M curves using log-rank test The log-rank test  can be used for this purpose. This test is basically a large sample chi-square test that uses as its test criterion a statistic that provides an overall comparison of the two KM curves. where O i observed score for the group i E i expected score for the group i n+ + + + = -j j j j j j j j j j j j j i in n n n m m n n m m n n E O Var )1 ( ) ( ) )( ( ) (2 1 2 2 1 2 1 2 1 2 1 2 1 where nij is the number of samples in group mij ; i =1,2 ) ( ) (2 i i i iE O Var E O statistic rank Log = Survival Cluster (in months) Month Survival Distribution Cluster 1 Cluster 2 P < 0.001 Log rank test
32 Under the null hypothesis that the two KM curves ar e statistically equivalent, the log-rank test statistic is approximately chi-square with one degree of freedom . Thus, a p-value may be obtained at an a (say 0,05) confidence level from the chi-square distribution tables. At p-values less than the conf idence level a the null hypothesis is rejected, and hence the two curves are considered t o be statistically different. The features of the microarray gene expression data with significant p-values from the log-rank test are retained as features useful i n discriminating between the two classes of patients, divided based on survival times. 3.3 Unsupervised feature selection Unsupervised feature selection does not use any a p riori information regarding the class information or distribution of samples amongs t the classes in order to select features. Many unsupervised feature selection metho ds analyze some statistical measurement made on the samples in order to identif y a small set of features that help in separating the samples into distinct groups. A feat ure is selected based on the strength that demonstrates its ability to separate the sampl es into the required number of classes. These methods may be categorized as quantitative me thods and qualitative methods. The quantitative methods use statistical quantities suc h as the expression level or a measure of variability to reduce the feature set, while the qu alitative method attempts to identify features that are relevant to the problem at hand.
33 Expression level threshold While conducting a microarray gene expression analy sis experiment, a minimum level of hybridization of cDNA to the DNA chip is r equired to reliably translate the activity of the gene to an expression level. This i mposes a lower limit on the level of gene expression that may be considered useful for detect ion and analysis . The expression level threshold method of feature selection can be used to reduce the number of features for classification based on a minimum expression le vel for each feature. However, it is difficult to find a crisp cut-off value for this th reshold, and it is possible to find at least a few samples that have expression levels that are ma rginally higher than the threshold. Thus, it may not be possible to eliminate a feature based purely on the expression level below a threshold. A second limitation must then be imposed to successfully eliminate features. This limitation will only allow a feature to be eliminated if at least a predetermined percentage of the samples display expres sion levels below the threshold value. Thus, feature selection by expression level threshold is parameterized by two threshold values: t the expression level threshold and p the threshold for minimum percentage of samples below t This method of feature selection will aid in identi fying the features with meaningful gene expression levels and would potenti ally aid in classifying the samples into the relevant classes. Measures of variability Features that tend to have similar values for sampl es belonging to different classes exhibit low variance across samples of both classes Since classifiers attempt to learn
34 patterns that can distinguish samples of distinct c lasses, features with low variance rarely aid in classification. Researchers have often used the 2-fold expression change as a measure of variability . However, this approach has been questioned. In general, measures of variability are used to select a set of features that display sufficient variability between classes. Parametric or non-parametric measu res of variability may be used, depending on the distribution of the expression lev els of the features. Statistical variance: Measure of variability Statistical variance  is a parametric measure o f variability. This measure assumes a normal distribution of the feature values with a mean m and variance s . Mathematically, the statistical variance is defined as: n ==N i iN Y Y s1 2 2)1 /( ) ( where Y is the mean of the data, and N is the number of sam ples. It can be observed that the variance is roughly the arithmetic average of the s quared distance from the mean. Squaring the distance from the mean has the effect of giving greater weight to values that are further from the mean. Thus, although the varia nce is intended to be an overall measure of spread, it can be greatly affected by th e tail behavior, or the values at the extreme ends of the distribution. Feature selection may be achieved using the statist ical variance measure of variability by discarding all features that exhibit a variance lower than a pre-determined threshold across all the samples. The retained feat ures would be better suited to aid classification than the discarded features.
35 Median of absolute deviation from the median A non-parametric measure of variability is the medi an of absolute deviation from the median (MAD) . Mathematically, MAD is defin ed as: |) (| Y Y median MAD i = where Y is the median of the data and |Y | is the absolute value of Y Since the median is at the middle of a distribution the value is not as sensitive to the values at the extreme ends or tails of the dist ribution as are the mean and variance. Further, since the computation of MAD does not use the sample size, the MAD value is expected to be a stable measure of variability, esp ecially in the case of small sample sizes. Feature selection using MAD involves discarding all features that exhibit MAD values below a pre-determined threshold. As describ ed earlier, such features with low variability across classes are considered to be ine ffective in predicting the underlying classes, and hence can be safely eliminated from co nsideration. Selection of biologically relevant genes For several decades, researchers across the world h ave been studying the genetic behavior of cancer [1,2,3,10,11,12]. Attempts have been made to pinpoint gene mutations that may be strongly indicative of the cancerous na ture of cells, as well as genes that are predictive of cancer progression [10,11,12]. It is reasonable to assume that these genes, when over or under expressed in a cancerous tissue, would exhibit expression characteristics that are significantly different fr om genes that are not associated with the presence or progression of cancer.
36 A careful analysis of the characteristics of these Â“cancer-relatedÂ” genes could yield an insight into the behavior of genes that co ntrol the progression of cancer. Hence, these genes could be identified [27,28,29] and sepa rated from the rest of the genes in the microarray dataset based on the expression patterns 3.4 Classifiers for gene expression analysis Some of the classification methods that have been u sed for gene expression analysis include Neural Networks [31,32,33], Suppor t Vector Machines [31,34], and Decision Trees [31,35]. The following sections revi ew the basic methodologies of each of these classifiers that will be used to baseline per formance measurement for the analysis of the colon cancer gene expression data to predict su rvival. 3.4.1 Feed-forward back propagation neural network A neural network is a massively parallel distribute d processor [31,32,33] made up of simple processing units. It resembles the brain in two respects: i. Knowledge is acquired by the network from the envir onment through a learning process. ii. Inter-neuron connection strengths, known as synapti c weights, are used to store the acquired knowledge. A feed-forward-back-propagation neural network  typically consists of at least three layers. The first layer is the input layer fo llowed by one or more layers of hidden units or computational nodes, ending in a layer of output nodes. The learning algorithm employs a forward pass and a backward pass of signa ls through the different layers of the
37 network. The forward pass involves the application of an input vector to the sensory nodes of the network. The effect of this input vect or is propagated through the layers of the network, producing a response at the output lay er of the network. While the weights of the nodes are fixed during the forward pass, the y are adjusted according to an errorcorrection rule during the backward pass. This err or signal is computed by subtracting the actual response of the network from a desired o r target response. The error signal is then propagated backward through the network agains t the direction of synaptic connections, adjusting the weights to make the actu al response of the network closer to the desired response. Figure 3.6: Architecture of feed-forward-back-propa gation neural network 3.4.2 Support vector machines Support vector machines are algorithms which use li near models to represent nonlinear boundaries between classes [31,34]. Input fe ature vectors are transformed into a higher dimensional space using a non-linear mapping Hyperplanes are defined in this high dimensional space so that data from any two cl ass categories can always be Input layer I0 I1 I2 I3 I4 O0 O1 Hidden layer Out put layer Processing units
38 separated. The hyperplane that achieves the highest separation of the classes is known as the maximum margin hyperplane and generalizes the s olution of the classifier. Figure 3.7: A maximum margin hyperplane in a suppor t vector machine  The maximum margin hyperplane is completely defined by specifying the vectors closest to it. These vectors are called support vec tors. Since these vectors have the minimum distance to the plane, they uniquely define the hyperplane for the learning problem. Thus, the maximum margin hyperplane can be completely reconstructed given these support vectors, and all other training insta nces can be deleted without changing the position and orientation of the hyperplane. Consider a simple two-class problem with two attri butes or features, a1 and a2. A hyperplane separating the two classes may be writte n as: 2 2 1 1 0a w a w w x + + = where the three weights, wi are to be learned. This may be expressed in terms of the support vecto rs. Suppose we define a class variable y with a value 1 if it belongs to class 1 else -1. Maximum margin hyperplane Support vectors margin
39 Then, the maximum margin hyperplane is defined as: n + = a i a y b x i i )(a where: y i : the class value of the training instance a(i) b, ai : the numeric parameters that have to be determine d by the learning algorithm: these parameters determine the hype rplane a : vectors representing the test instance a(i) : support vectors a(I)Â•a : dot product of the test instance with one of the support vectors The support vectors are the training samples that d efine the optimal separating hyperplane and are the most difficult patterns and also the most informative patterns for the classification task. A constrained quadratic op timization technique is used to learn the parameters b and a i . 3.4.3 C4.5 decision trees Decision trees are learning algorithms that employ the Â“divide and conquerÂ” strategy . Decision trees are constructed by cr eating nodes at various levels by testing certain attributes. The first step is to select an attribute to be placed at the root node. At every node, a comparison of an attribute value with a constant is made. When using discrete attribute data, this makes one branch for each possible value of this attribute. This process splits up the samples into subsets for each value of the attribute. The process is then repeated recursively for each b ranch, using only those samples that reach the branch. When a node attribute cannot split the samples into any more
40 subsets, a leaf node is created. Leaf nodes give a classification that applies to all samples that reach the leaf. An unknown sample is classified by routing it down the tree according to the value of the attributes tested in successive nodes. When a leaf is reached the instance is classified according to the class assigned to the l eaf. Figure 3.8: Structure of a decision tree The structure of a decision tree is governed by the rules used to select the attribute to split on, at each node or branch. Given a set of attributes to choose from, the best choice for splitting the data is the attribute that produces subsets of samples that are most distinct from each other. This choice is made by me asuring the purity of the daughter nodes at each split . The best decision is made when the purest daughter nodes are created. The C4.5 algorithm: C4.5  is a variant of the basic decision learni ng approach that uses the concept of information gain as a measure of purity at each node. The information gain can be Root Node Test attribute Node Test attribute Leaf Node Branches Set of possible answers Branches
41 described as the effective decrease in entropy resu lting from making a choice as to which attribute to use and at what level. entropy(p i ) = -p i log(p i ) where p i = (# samples at node i)/(total samples at parent n ode) For each attribute that is tested as a potential sp litting attribute, the entropy of the subsets created by the split is measured and compar ed to the entropy of the system prior to the split. The attribute that yields the maximum information gain by splitting the dataset is chosen as the best split or test attribu te. By considering the best attributes for discriminating among cases at a particular node, th e tree can be built up of decisions that allow navigation from the root of the tree to a lea f node by continually using attributes to determine the path to take . The decision tree can be simplified using pruning techniques to reduce the size of the tree according to a user-defined level. Pruning will yield decision trees are more generalized . 3.5 Evaluation of classifiers The task of machine learning is to Â“learnÂ” or acqui re knowledge about input data. This can be achieved by looking for and describing patterns in the input data. This acquired knowledge can be then used to predict patt erns in unseen samples. The quality of knowledge gained in this process is determined b y the samples used to train the system. The samples have to be representative of al l characteristics that may be encountered to ensure that predictions on unknown s amples are accurate. It should also be ensured that the machine learning system infers the correct patterns in the data. Methods to evaluate and predict performance on seen and uns een data help in ensuring this.
42 While measuring the performance of a learning algor ithm, a measure of the success rate, or alternately, the error rate is use d . This is measured by comparing the results of classification on each of the training s amples to the actual class to which the sample belongs. Thus, the success rate will indicat e how well the algorithm has learnt the characteristics of the training samples. However, t his gives no indication as to how the algorithm will behave when asked to predict the cla ss of a new and unknown sample. The error in prediction may also be computed by testing the classification of test samples. If these test samples are taken from the same pool of data that was used for training, the measured success rate will be highly optimistic [31 ], and will not realistically indicate future performance. It is therefore necessary to se t aside a set of samples that will not be used for training, but used for testing purposes on ly. Generally in DNA analysis problems, the number of s amples available for inferring gene activity and mutation is very small in comparison to the number of features available [3,8]. Separating a set of samples for te sting will further reduce the number of samples available for training the learning algorit hm. While it is beneficial to have a good size test set to rigorously test the prediction acc uracy of the classifier, it is equally important to ensure that the samples used for train ing are representative of the population. A set of samples is generally held out as a complet ely independent test set while the rest are used for training. A smaller set from these tra ining samples may be held out as a test set while training the classifier. Training and tes t performances are measured on this set to tune the learning algorithm. The independent tes t set is then used to validate the performance of the algorithm. Several methods have been used to address this issue .
43 A simple validation method is a hold-out procedure  that involves dividing the dataset into a fixed number of partitions. All but one partition is used to train the classifier and the left-out partition is used for t esting. The training-and-testing procedure is repeated enough number of times (called folds) s o that each partition is used as a test set exactly once. This method is known as Cross-Val idation , and a variable number of partitions may be used depending on the number o f samples available. Ideally, the samples in each partition should represent a propor tional selection of samples from all the classes under consideration to ensure that the clas sifier learns all the classes equally well, and is not over-trained on any one class. Leave-One-Out-Cross-Validation (LOOCV)  leaves out a single sample for testing, while training on the rest. This method is useful when a very small number of samples are available, since it increases the numbe r of train-test procedures that can be performed. However, this method does not ensure tha t the classifier learns all the classes well. Since only one sample is used to test the pre diction accuracy in each fold, the classifier may predict the majority class for each sample, and still achieve high prediction accuracy. Further, this method of cross-validation may be computationally expensive. A reasonable method of cross-validation is stratifi ed n-fold cross-validation . The sample set is divided into n partitions such th at each partition is stratified in proportion to the number of samples in each class. This ensures that each the classifier is trained proportionally well to learn all the classe s. Further, each test set will require the classifier to predict all classes, yielding a more realistic measure of the classifier performance. Although Â‘nÂ’ can take any value, n=10 has been experimentally shown in literature to achieve the a reasonable estimate of error .
44 Figure 3.9: 10-fold cross-validation scheme 3.6 Accuracy of classification In the context of predicting the survival time for patients with colon cancer, the performance of a classifier can be evaluated using a confusion matrix, as shown in the Table 3.1. Table 3.1: Confusion matrix Classified As True condition Short term survival (positive) Long term survival ( negative) Short term survival (positive) True Positive (a) False Negative (b) Long term survival (negative) False Positive (c) True Negative (d) The most common measure of performance is the accur acy of classification, defined as: Accuracy : d c b a b a + + + + All training samples Combine classification of the 10 folds as the output Training: 9/10 ths of samples; Testing: 1/10ths of samples Predictions of Fold 1 0 Classifier 10 Predictions of Fold 1 Classifier 1 Training: 9/10 ths of samples; Testing: 1/10ths of samples
45 Accuracy is a good measure to use if samples are di stributed equally amongst both the classes. However, in cases where an unequal dis tribution of the samples may be expected, a weighted accuracy computation will yiel d a better estimate of how well the classifier performed in both the classes. Weighted Accuracy : 2/ r + + + d c d b a a While dealing with clinical information however, me asures of sensitivity and specificity  are used to gauge the performance in each class separately. Specificity : b a a + Sensitivity : d c d + Here, it can be observed that sensitivity is merely the probability that the patient will survive less than 36 months, given that the cl assifier predicted short term survival. Specificity is the probability that the classifier will predict long term survival given that the patient survived greater than 36 months. Sensitivity is also the true positive rate and sp ecificity is the true negative rate. Weighted accuracy reports the average of these rate s, and hence may be used as a convenient measure to evaluate the performance of t he classifier.
46 CHAPTER 4 RANDOM SUBSPACE ENSEMBLES FOR GENE EXPRESSION ANALYSIS 4.1 Introduction Traditionally, classifiers have been used to uncove r patterns from input features that can explain the observed characteristics of th e samples, as well as make predictions on unseen samples [13-19]. The training stage uses a set of samples drawn from a larger population. If the total number of observed feature s that describe the population is very large, feature selection methods can be used in an attempt to pick a small set of features that adequately describe the patterns of difference s between the classes in the population. Since the classifier learns these patterns from a l imited set of features describing limited samples, there is a risk associated with over-train ing the classifier, or over-fitting the patterns to the samples at hand . If the sample s chosen for training do not adequately represent the population, the patterns learned woul d be specific in identifying these samples, and hence may not be general enough to ide ntify or predict classes of unseen samples. The patterns learned could also be highly dependent on the features used to train the classifier. If a different feature set was used for training, a different set of patterns could be learned. With different configurations of learning parameters, different classifiers would be created. Some of these classif iers could be successful in accurately predicting classes of unknown samples, while others could have varying degrees of weaknesses depending on the feature set used to tra in the classifier. Further, use of
47 different sets of features may help in identifying different types of patterns, all of which may be important in completely describing the popul ation. Random subspace ensembles  may be used to take advantage of this variatio n in performance due to different selection parameters in order to create a classific ation scheme that performs better than any single classifier . 4.2 Random subspace ensembles The goal of creating ensembles of classifiers is to combine a collection of weak classifiers into a single strong classifier . O ne way to create ensembles of classifiers is to divide the entire space of features into subspac es. Each subspace is formed by randomly picking features from the entire space, al lowing for features to be repeated across subspaces. If enough such random subspaces a re formed the subspaces may optimally represent all the important features in t he subspaces. Figure 4.1: Creation of random subspace ensembles Accuracy n Accuracy 1 Classifier Random subspace 1 Classifier Random subspace 2 Classifier Random subspace n Features Smart combination -> better accuracy Accuracy 2
48 One classifier is trained on each random subspace o f features, using all the training samples. Thus, each classifier is built on one random projection of the feature space. A large number of such classifiers are creat ed. If each classifier were tuned to learn a few characteristics of the population, then a jud icious ensemble of these classifiers would be better at identifying samples from the ent ire population than any one classifier. Depending on the characteristics learned by each cl assifier, different kinds of ensembles could be created. Voting techniques used on ensembles typically assume that all the random subspaces created are useful in some way in describing the classes. Alternately, if some of the random subspaces are fo und to be ineffective in describing the classes, while some others are very effective, then an ensemble could be created by using only the effective subspaces, while discarding all other subspaces. 4.3 Voting techniques to create random subspace ensembles A general approach to the combination of random su bspaces in an ensemble is the use of the majority voting technique . Here, al l the classifiers created are retained for use. Since each classifier is built from a random s ubset of the feature space, a single classifier may learn only a small section of the ch aracteristics of the population. If the entire feature space is assumed to be important in describing the population, then each classifier created plays a role in describing the p opulation. When a new and unknown sample from the population has to be analyzed, each classifier is considered to be equally capable in classifying the sample.
49 The majority voting technique uses each classifier to individually predict the class of a new sample. Then, a simple majority vote among st the predictions is used to decide the final classification of the sample. Figure 4.2: Random subspace ensemble classifier usi ng the majority voting technique; here the number Â‘ c Â’ of trees built is varied from 1 to 2000 An alternate voting technique is to use weighted ma jority voting instead of a simple majority. In this technique, the classifiers are not considered to learn equally well. While all classifiers are assumed to play a role in describing and distinguishing the patterns of the classes, some classifiers are deeme d to be better at classifying samples than others. These Â“betterÂ” classifiers are given a higher weight in the voting, while the Â“poorerÂ” classifiers are given a lesser weight. As in the simple majority voting case, all the random subspace classifiers are used to individ ually predict the class of a new and unseen sample. The quality of the classifier in pre dicting the class is typically defined by the field of application. The individual prediction s are weighted by the quality of the classifier, before computing the majority class pre diction. The majority class from the weighted vote is used as the prediction for that sa mple. t-test, select top Â‘aÂ’ (e.g.: a=5000) features based on p-values Majority vote on predictions of test samp les Record training accuracy Classify test samples Build tree 1 Randomly select Â‘rÂ’ (r=200) features Randomly select Â‘rÂ’ (r=200) features Randomly select Â‘rÂ’ (r=200) Record training accuracy Classify test samples Build tree c Record training accuracy Classify test samples Build tree 2
50 4.4 Selection of good subspaces The voting techniques to create ensembles of classi fiers, described in the previous section, work well when most of the features are us eful in describing the characteristics of the population. In typical gene expression analysis problems, the number of features used for classification is very large. Although feature selection methods help in reducing the number of features for classification, these method s cannot ensure that every feature considered for classification is indeed important f or prediction. It is reasonable to assume that in gene expression analysis problems, a subset of random subspaces may be created that are completely ineffective for classification. Including these ineffective subspaces in the ensemble may bias the classification in an unde sirable manner. Hence, one approach to creating a good ensemble of classifiers is to di scard these subspaces from consideration altogether. Alternatively, a small set of effective classifiers may be retained for creating the ensemble classifier. Consider a set of random subspaces of Â‘ r Â’ features selected from a set of Â‘ a Â’ features. If the number Â‘ c Â’ of random subspaces created is large enough to co ver the feature space sufficiently, allowing for features t o be repeated across subspaces, then at least a few subspaces are likely to include a major ity of the Â“goodÂ” features. An effective classifier can be created by picking only these ran dom subspaces with the Â“goodÂ” features. A simple method to identify a good random subspace is to estimate the accuracy of the subspace in predicting the classes of a set of samples. Each random subspace is trained on a set of samples, such that the classifi er learns patterns from this set in order to make class predictions on unseen samples. Hence, th e quality of a random subspace can
51 be assessed by determining the prediction accuracy of the subspace classifier on test samples. In addition to this measure of quality, a random subspace may be assessed on the accuracy of learning the training samples. Alth ough optimistically biased , training accuracy of a classifier reflects the abil ity of the classifier to learn the patterns of classes from the given training data. Classifiers built on subspaces that have a large number of good features should be able to learn pre dictive patterns better than classifiers built on poorer features. Hence, selecting a random subspace that has a combination of good classifier testing and training accuracies sho uld ensure an overall more accurate classifier ensemble. In order to estimate the accuracy of the selected feature subspace, the gene expression dataset is split into three separate sub sets of samples. These are the independent test set (10% of the total samples), th e training set (81% of the samples), and the validation set (9% of the samples). The random subspace classifiers are built on the training set (81%) and each classifier is tested on the samples in the validation set (9%). The performance of a classifier on the validation s et, along with its training accuracy, is used to determine the predictive quality of the cla ssifier. A classifier is chosen as the best random subspace classifier based on the condition o f the highest validation set accuracy and in case of ties, a secondary condition of the b est training accuracy is used. The features used by this classifier are selected as go od features for the task. To ensure that the selection process is relatively independent of the samples used for training, the gene expression dataset is split into the training and validation subsets in many different ways, so as to create different comb inations of samples for training and validation. Consider 10 different ways of creating these subsets. The procedure of
52 selecting features, described above, is repeated fo r each of these 10 sets of data. Thus, for each of the 10 sets of training and validation samp les a set of features is selected that can best describe 81% of the samples, and is tested by predicting the classes for 9% of the samples. The union of these 10 sets of selected fea tures describe 90% of the samples (81% training samples+ 9% validation samples). The ability of this union of features to predict the class of unknown samples is tested with the help of the samples held out as the independent test set (10% of the samples). A classi fier is built using these features by training on the 90% of the samples (81% training an d 9% validation). This classifier is used to predict the class of each independent test sample. The weighted accuracy (see Section 3.6) of these predictions would indicate th e expected performance of the classifier in predicting the class of new samples. To further ensure that the prediction accuracy is n ot particularly tuned to the combination of training, validation and independent test samples, the gene expression dataset is split into these three subsets in severa l different ways. Consider 10 different ways of creating these subsets. Each of the 10 ways yields a weighted accuracy of prediction on the 10% of the independent samples fo r that set. The individual sample predictions on all 10 independent test sets are use d to create the confusion matrix for the classifier scheme (see Section 3.6), and the weight ed accuracy of the classifier scheme is computed from this matrix as a measure of performan ce. In an experimental setup (Figure 4.3), this describ ed procedure may be achieved by using a 10-fold cross-validation scheme. To illu strate the use of the described procedure on a gene expression dataset, consider a hypothetical gene expression dataset with 1000 samples and 50,000 features (probesets). Each fold of the 10-fold cross-
53 validation creates an independent test set with a d istinct 100 samples (10%), and a training set with 900 samples (90%). An additional 10-fold cross-validation performed on each 900-sample training set, provides 810 samples (81% of the overall samples) for training, and 90 samples (9% of the overall samples ) for validation. Therefore, for each of the 10 sets of the data (the 10 folds), 100 samples are held out as an independent test set, and 10 internal sets, each with different combinati ons of 810 training and 90 validation samples, are created from the 900 training samples. Figure 4.3: Classification scheme for selecting goo d subspaces For each training set of 810 samples, a preliminary feature selection using a t-test is performed, and the best 5000 features, ranked ac cording to significant t-test p-values (see Section 3.2), are retained for use. Random sub space classifiers are created on these 810 training samples using the selected 5000 featur es. Consider creating 100 random subspaces that have each been created by randomly p icking 200 features from the 5000 Entire feature set, All training samples Use all individual test predictions to create the confusion matrix for the classifier, compute weighted accuracy of the classifier All features; Training: 9/10 ths of samples; Testing: 1/10ths of samples All features; Training: 9/10 ths of samples; Testing: 1/10ths of samples Predict classes of test samples Selection of good features (Fig 4.4) Classifier 1 Selection of good features (Fig 4.4) Classifier 10 Predict classes of test samples
54 input features. For each random subspace of 200 fea tures, a single decision tree is built by training on all of the 810 samples. The decision tr ee selects, from this random subspace of 200 features, the best features to distinguish b etween the classes. The prediction accuracy of each decision tree (random subspace cla ssifier) is tested on the corresponding 90 validation samples. Each of the random subspace classifiers is trained on the same set of 810 samples and tested on the same set of 90 samples, and the c lassifier with the highest validation set accuracy and training accuracy is selected. The fea tures used by the selected classifier are identified. This procedure of selecting good features is repeat ed on each of the 10 sets of 810 training and 90 validation samples for a given fold of the data, yielding 10 possibly distinct sets of good features. The union of these features is then used to train a single classifier on the 900 samples (90%; the 81% trainin g and 9% validation samples combined), and tested on the 100 held-out independe nt test samples (10%). This process is repeated for all 10 folds. The indi vidual predictions on all the independent test samples are used to create the con fusion matrix (see Section 3.6) for the system. The weighted accuracy computed from this ma trix estimates the expected performance of the classifier scheme in predicting the classes of new samples.
55 Figure 4.4: Scheme to select good features for clas sification with typical values of the random subspace parameters ( a,r,c ) for a given 10-fold cross-validation specified in parenthesis Input feature set; Input training samples (90% of the entire sample set) Union of the features selected by ea ch fold Output features of Fold 1 All features; Samples used for training: 81% of overall samples; Samples used for validation: Validation: 9% of overall samples t-test, select top Â‘aÂ’ (e.g.: a=5000) features based on p-values Select the features used by the tree with best validation and training accuracy Record training accuracy Classify validation samples Build tree 1 Record training accuracy Classify validation samples Build tree c Record training accuracy Classify validation samples Build tree 2 Fold 1 of 10 Output features of Fold 10 All features; Samples used for training: 81% of overall samples; Samples used for validation: Validation: 9% of overall samples t-test, select top Â‘aÂ’ (e.g.: a=5000) features based on p-values Select the features used by the tree with best validation and training accuracy Record training accuracy Classify validation samples Build tree 1 Record training accuracy Classify validation samples Build tree c Record training accuracy Classify validation samples Build tree 2 Fold 10 of 10 Randomly select Â‘rÂ’ (r=200) features Randomly select Â‘rÂ’ (r =200) features Randomly select Â‘rÂ’ (r=200) features Randomly select Â‘rÂ’ (r=200) features Randomly select Â‘rÂ’ (r=200) features Randomly select Â‘rÂ’ (r=200) features 55
56 CHAPTER 5 RESULTS 5.1 Introduction The colon cancer gene expression dataset, described in Chapter 2, was used to create classifiers to predict survival prognosis fo r patients. First, supervised and unsupervised feature selection methods were explore d to choose the best method for predicting survival. A series of baseline classifie r experiments were conducted using the basic experimental scheme described in Section 3.1. Random subspace ensembles were created using the majority voting technique as well as the proposed technique of selecting Â“goodÂ” classifiers (Chapter 4). The performance of these random subspace ensembles was compared to the baseline classifier performance. Fi nally, the results were further tested and verified in a series of additional experiments. 5.2 Supervised feature selection T-test A t-test was used on each feature in the training d ataset at a significance level of a =0.05. The null hypothesis for the test was that th e mean expression level for the two prognosis groups was equal. A feature was considere d to be significant in predicting survival for a colon cancer patient if the p-value for the feature, as determined by the ttest was less than 0.05.
57 # features retained at a t-test p-values0 10000 20000 30000 40000 50000 60000 00.10.20.30.22.214.171.124.80.9 t-test p-value # features retained Figure 5.1: Number of features with a specified t-t est p-value Figure 5.1 shows the significance of the features i n distinguishing between the two classes for all the samples, at the 0.05 level. All features with p-values less than 0.05 were considered to be significant for prediction, a nd features with lower p-values were considered to be stronger predictors than features with higher p-values. There were 5901 features found to be predictive features for classi fication. Since the t-test could aid in selecting a small number of features that were high ly significant for prediction, the test was found to be a good feature selection technique for predicting survival for colon cancer patients. This test aided in reducing the nu mber of features for classification, while ensuring that the retained features were indeed str ong predictors of survival. Survival Analysis: The dataset was analyzed with respect to the two cl asses using the survival analysis techniques described in Section 3.2. Two K aplan-Meier survival curves were 5901 features significant at level 0.05 Number of features retained at t-test p-values T-test p-values Number of features retained
58 plotted for each feature, one curve for each of the two classes. A feature was considered effective in predicting survival if the survival cu rves for each of the two classes were statistically different. A log-rank test was used a t the significance level of 0.05 to test if the survival curves were significantly different. A ll features with log-rank p-values less than 0.05 were considered to significant in predict ing survival times for the patient. # features retained at a log-rank p-value 0 10000 20000 30000 40000 50000 60000 00.10.20.30.126.96.36.199.80.9 log-rank test p-value # features retained Figure 5.2: Number of features with a specified log -rank test p-value for comparing Kaplan-Meier curves of the two survival classes It can be observed from Figure 5.2 that only 4676 f eatures in the experiment demonstrate the ability to predict survival. Hence, when censored samples are expected to be included in the experiment, survival analysis co uld be a reliable feature selection tool. Number of features retained at log rank test p value s Log-rank test p-value Number of features retained 4676 features significant at level 0.05
59 5.3 Unsupervised feature selection Quantitative Methods: Expression Level Threshold Low expression levels recorded during microarray an alysis may be attributed to noise in measurement or other undesirable effects. Feature selection by expression level threshold was used to eliminate features that seeme d to arise from sources other than expression of genes. The experiment was parameteriz ed by the threshold value t (3.5 <= t <= 14.5) for the expression level and the threshold p (85% <= p <= 100%) for minimum percentage of samples below t The goal of this experiment was to identify an op erating pair ( t,p ) such that a maximum number of non-informative fea tures were discarded. 3 4.5 6 7.5 9 10.5 12 13.5 80% 0 10000 20000 30000 40000 50000 60000 Number of features retained Expression level threshold % threshold Number of features retained using expression level threshold method 50000-60000 40000-50000 30000-40000 20000-30000 10000-20000 0-10000 Figure 5.3: Graph of the number of features retaine d as the two threshold values of expression level and minimum percentage value ar e varied Figure 5.3 shows the number of features retained at each point ( t,p ). It can be observed that the number of discarded features rema ins fairly constant as Â‘ p Â’ is varied from 85% to 100% for most values of Â‘ t Â’. Also, the number of discarded features drops very slowly for lower values of Â‘ t Â’, making it difficult to clearly identify a thresh old that
60 could distinguish between informative features and noise. Given the difficulty in selecting appropriate threshold values, as well as the insign ificant drop in the number of features at low threshold values, the method was not considered as a suitable feature selection method for gene expression analysis. Quantitative Methods: Measures of Variability The statistical variance method of feature selectio n was used to eliminate features that did not have high enough variability to be use ful for classification. All features with variances below a cut-off threshold t ( 0.05 <= t <= 8.0) were considered for elimination. # features retained vs variance of the features acr oss all samples 0 10000 20000 30000 40000 50000 60000 00.511.52 Variance of feature across all samples # features retained Figure 5.4: Graph of the number of features retaine d as the threshold for variance is varied Figure 5.4 shows that a large number of features ma y be dropped with values of t <0.5. The selection of a threshold for variance cou ld be made by either choosing the desired number of features for classification, or s imply by the value of the variance. In either case, care has to be taken to ensure that tr uly predictive features are not dropped Number of features retained vs variance of the feat ures across all samples Variance of the features across all samples Number of features retained
61 from consideration. The p-values of a t-test at sig nificance level of 0.05 were used to determine if any predictive features were eliminate d. The filter based purely on variance does not take into account the effects of central t endency, such as the mean value, as the t-test does. Hence, at each threshold value of vari ance below 0.5, at least 25% of the eliminated features were found to be predictive, th ereby rendering this feature selection method ineffective for the purpose of classificatio n. Feature selection with MAD was used to eliminate fe atures with low variability. All features with MAD values less than a threshold t (0.05 <= t <= 3.5) were considered to be ineffective for classification and therefore removed. # features retained vs MAD values of the features 0 10000 20000 30000 40000 50000 60000 00.10.20.30.188.8.131.52.80.91 MAD values of the features # features retained Figure 5.5: Number of features retained as the thre shold for MAD values is varied Figure 5.5 shows that a large number of features ma y be dropped with values of 0.1 < t < 0.5. Here, the threshold could be chosen by spec ifying the desired number of features for classification, or by choosing an opti mal value of variability below which the classifier would not be able to distinguish between classes. As described in the section on the experiments with statistical variance, a t-test at significance level of 0.05 was used to Number of features retained vs MAD values of the fe atures across all samples MAD values of the features across all samples Number of features retained
62 determine if predictive features were eliminated du e to the MAD threshold value. At each threshold value t < 0.5, at least 25% of the eliminated features wer e found to be predictive. Thus, feature selection with MAD was no t found to be useful for gene expression analysis. Qualitative Methods: Selection of biologically rele vant genes A careful list of all genes associated with cancer was created to study the characteristics of expression levels in known cance r genes [27,28,29]. A total of 5687 cancer related genes, described by 9149 probesets, were used for this experiment (refer to Section 1.5.1). In order to determine if the cancer-related genes had any distinctive expression patterns in the colon cancer dataset, a smoothed hi stogram of the mean expression levels of these probesets was compared to the smoothed his togram of the mean expression levels of all the probesets in the dataset. To make a fair comparison of the curves, each histogram was normalized for the number of probeset s used. If the cancer related genes were expected to have distinctive characteristics, then the two histograms would show different characteristics in terms of spread and ce ntral tendencies. However, as shown in Figure 5.6, both histograms ha ve very similar characteristics. It can be inferred from the graphs that the set of cancer-related genes that were used for this analysis do not display characte ristics that are significantly different from genes that are not associated with cancer prog ression. Thus, a study of the gene expression patterns of cancer-related genes would n ot aid in identifying the most predictive features for classification.
63 Histogram of mean expression levels of all genes 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 1.546.5911.514 Mean expression level% features (a) Histogram of mean expression levels of cancer genes 0 0.02 0.04 0.06 0.08 0.1 0.12 1.546.5911.514 Mean expression level% features (b) Figures 5.6: Histogram of the mean level of gene ex pressions across all samples (a) all genes (b) cancer-related genes 5.4 Baseline experiments with colon cancer gene expression data The gene expression data for prediction of survival for colon cancer patients was used to conduct three main experiments with three d ifferent classifiers: Neural Networks, Support Vector Machines and C4.5 Decision Trees. Ea ch experiment was setup as a 10fold cross-validation, with the t-test as the featu re selection method. The top a features (100<= a <= 10000) from the entire dataset were selected wi thin each fold of the crossvalidation to avoid pre-selection bias . Since the distribution of the samples across the two classes was not balanced, the weighted accuracy (see Section 3.6) of each experiment was computed as a measure of success.
64 Figure 5.7: Basic classifier block for the baseline gene analysis experiment the parameter a was varied (100 <= a <= 1000) Since the parameter a could take on several different values, the accura cies of the classifier for each value of a (100 < a < 10000) was explored, to determine the optimal configuration of the classifier scheme. The Neural Network used for this experiment was Qui ckprop , a fast implementation of the Feed-forward-back-propagation network described in Section 3.4.1. The network was designed with 10 hidden unit s and two output nodes. The training of the classifier was designed to halt either when the net error dropped to zero, or in 500 epochs . The Support Vector Machine experiments used the implementation in WEKA (31). A linear kernel was used with standard n ormalization. The USF implementation of C4.5 decision trees  was used to test the accuracies of single decision trees at the various parameter settings. t-test, select top Â‘aÂ’ features based on p-values Record training accuracy Build Classifier Classify test samples
65 Figure 5.8: Performance of baseline classification schemes Figure 5.8 shows that none of the baseline classif iers were able to achieve weighted accuracies higher than 58.47%. For all thr ee classifiers, the highest accuracies were achieved when 3000-4000 features were used. Th is suggests that the best features for prediction are in the top 4000 features of the t-test p-values. The observation is supported by the results of feature selection with t-tests, which indicate that the top 4000 features are highly significant in prediction. As l ower numbers of features are used the accuracies drop possibly due to inadequate features to represent the sample characteristics. As higher numbers of features were used, the useful features start being overwhelmed by the non-predictive features, resulti ng in inaccurate classifiers. Performance of classifiers30 40 50 60 70 1 00 200 50 0 1 0 00 2 000 3 00 0 4000 5 000 6 0 00 7000 8 00 0 9 0 00 10 0 00 # features used for classifier training Weighted Accuracy (%) Performance of classifiers Number of features used for training Weighted accuracy (%) Neural Networks Support Vector Machines C4.5 Decision Trees
66 5.5 Majority voting to create ensembles Random subspace ensembles were created using the ma jority voting technique described in Section 4.3. The basic classifier bloc k shown in Figure 5.9 was used in the 10-fold cross-validation scheme (refer to Figure 3. 9) to create a single ensemble classifier from a set of classifiers built on random subspaces within each fold. A t-test was used on the training samples within each fold to choose the best Â‘ a Â’ features for classification (100 <= a <= 1000). Random subspaces were created by picking features randomly from this set of selected features. Individual decision trees were built on each random subspace and used to predict the class of each test sample. The actual class of each test sample was decided based on the majority prediction of all the trees within the fold. The confusion matrix for the final classifier was created by usin g the predictions of all the test samples. The weighted accuracy computed from this matrix was used as a measure of performance of the classifier. Figure 5.9: Basic classifier block to create random subspace ensembles using majority voting technique (used within the 10-fold cross-val idation scheme, Figure 3.9); see Table 5.1 for experimental values of parameters (a,r,c) t-test, select top Â‘aÂ’ features based on p-values Majority vote on predictions of test samples Record training accuracy Classify test samples Build tree 1 Randomly select Â‘rÂ’ features Randomly select Â‘rÂ’ features Randomly select Â‘rÂ’ features Record training accuracy Classify test samples Build tree c Record training accuracy Classify test samples Build tree 2
67 Several experiments were conducted for the various values of the design parameters, random subspace size ( r ), number of random subspaces ( c ) and the number of features used for classification ( a ). The values of these parameters used for the experiment are listed in Table 5.1. Table 5.1: Range of parameters used for majority vo ting technique using random subspace ensembles Parameter Description Min value Max value a Top features selected from t-test 5000 10000 r Size of random subspace 50 2000 c Number of random subspaces/trees 1 2000 If all the features selected from the t-test featur e selection stage are predictive in nature, a random subspace ensemble created using a majority voting technique is expected to yield higher accuracies as a larger num ber of subspaces are created (refer to Appendix Sections A.1 and A.2 for details). A large r number of subspaces of a given size ensure better coverage of the feature space, and he nce the ensemble of classifiers is expected to learn the patterns in the samples more accurately. Performance of ensembes using majority voting Random Subspace Size r =20046 47 48 49 50 51 52 53 10020050010002000 # random subspace ensembles ( t ) Weighted Average Accuracy (%) MajorityVoting SingleDecisiontree Figure 5.10: Random subspace ensembles ( a =5000, r =200, c ) vs single decision tree ( a =5000, r =200, c =1) c Performance of ensembles using majority voting Total number of features =5000, Random subspace size, r =200 Number of random subspaces ( c ) Weighted accuracy of ensemble (%)
68 The weighted accuracy of ensembles created with var ying values of ( a,r,c ) were compared to the accuracies of single classifiers cr eated from a single random subspace (a,r,c=1) It can be observed from Figure 5.10 that, contrar y to the expected outcome, there is a decrease in accuracy as the number of ra ndom subspaces is increased. This indicates that a large number of the random subspac es created are probably not very effective in describing the sample classes. In order to investigate the nature of these subspac es, the decision tree built on each random subspace was tested on the 10% held-out test samples from a single 90%10% split of the data. The weighted test accuracies of these subspace classifiers were analyzed to identify the subspaces that represented the sample classes well, and those that did not. Figure 5.11: Weighted test accuracies of 2000 rando m decision trees Figure 5.11 shows the spread of the weighted test a ccuracies of 2000 decision trees created from random subspaces of size 200 fea tures from the top 5000 t-test features 40% 50% 60% 70% 80% 90% 100% 0500100015002000 Tree indexWeighted Accuracy (%) Weighted test accuracy (%) Tree Index
69 (random subspace parameters are ( a=5000, r=200, c=2000 )). If the top 5000 features as determined by the p-values of a t-test at significa nce level 0.05 are predictive in nature, and the combination of these features is also predi ctive, then all random subspaces created from these 5000 features would be expected to perform accurately. However, less than 7% of the 2000 random subspaces created were f ound to have accuracy higher than 80%. An analysis of the training accuracies for each dec ision tree for the corresponding test accuracies indicated that while a few random s ubspace classifiers seemed to have learned the training samples well, the performance on test samples was poor. Only a few subspace classifiers had been able to learn the tra ining samples well and were able to generalize the knowledge enough to predict classes of test samples accurately (Fig 5.12). Figure 5.12: Weighted training and testing accuraci es of 100 random classifiers built from random subspaces nnrn n nr nn Test accuracy vs Train accuracy for a single 90% 10% split of the data Training accuracy Test accuracy
70 5.6 Selection of good subspaces Figures 5.11 and 5.12 indicate that only a small su bset of the random subspaces generated on the input features are effective for p rediction. The classifiers created on random subspaces that generate high test and train accuracies simultaneously are considered as good classifiers. The basic scheme to select good subspaces or features is described in Section 4.4. A 10-fold cross-validatio n scheme was used to split the 121sample colon cancer gene expression dataset (see Ch apter 2) into 10 sets of training (90%) and independent test (10%) sets. For each fol d, an additional 10-fold crossvalidation created 10 sets of training samples (81% ) and validation samples (9%). A t-test was used on each of the 81% training sets to select the best 5000 features from the total of 54675 features, ranked according to the t-test p-va lues at the significance level of 0.05. 200 features were picked randomly from this set of 5000 features to create a random subspace. 100 such random subspaces were created. F or each random subspace, a single decision tree was built on the training samples (81 %), and the training accuracy was recorded. The decision tree was tested on the 9% va lidation set. The decision tree with the highest validation accuracy and the highest tra ining accuracy was selected as the best classifier for that training and validation set of samples. The features used by this tree were selected as good features for the sub-fold. Each of the 10 training and validation sets for a s elected fold produces a set of good features. The union of these features sets was used to train a single classifier on the 90% training samples for that fold. This single cla ssifier was tested using the independent test set (10%) to estimate the prediction accuracy. Ten such classifiers were built, one for each fold of the cross-validation scheme. The predi ctions of these classifiers on the
71 individual independent test samples were collective ly used to create the confusion matrix for the classifier scheme. The weighted accuracy of prediction for the classifier was computed from this matrix. Figure 5.13: Classification by selection of good su bspaces Experiments were performed with three different cla ssifiers (neural networks, support vector machines and decision trees) for pre diction, using the scheme shown in Figure 5.13. n n 0 10 20 30 40 50 60 Support Vector MachinesNeural NetworkDecision Trees Weighted average accuracy (%) Figure 5.14: Weighted accuracies of neural networks support vector machines and decision trees; these classifiers were trained on t he union of the best features created by selecting good random subspaces (Section 5.6) Weighted prediction accuracy (%) Entire feature set, All training samples Use all individual test predictions to create the confusion matrix for the classifier, compute weighted accuracy of the classifier All features; Training: 9/10 ths of samples; Testing: 1/10ths of samples All features; Training: 9/10 ths of samples; Testing: 1/10ths of samples Predict classes of test samples Selection of good features (Fig 4.4) Classifier 1 Selection of good features (Fig 4.4) Classifier 10 Predict classes of test samples
72 Support vector machines were found to achieve the highest accuracy and the use of decision trees resulted in the poorest accuracy of prediction. The performance measures for the best classifier are listed in Tabl e 5.2. Table 5.2: Confusion matrix for the performance of the support vector machine trained on the union of the features created by selecting good random subspaces (LT: survival less than 3 years, GT: survival great er than 3 years) The classifier was used to predict one of two classes for each tes t sample. These two classes (survival less than 3 years and surviva l greater than 3 years) were used as two groups to draw survival K-M curves. The p-value for the log-rank test to compare the curves indicates that the two predicted classes are significantly different from each other. As can be observed in Figure 5.15, the percentage o f patients surviving across time in the poor prognosis class (LT) decreased at a higher rat e than the patients in the good prognosis category (GT). While the survival curves are significantly different when using a survival cut-off point of 36 months, a clear cutoff in the survival values cannot be observed for the two classes. Hence, a more optimal cut-off point in survival may yield better accuracy of prediction. LT GT Classified as True class 15 22 LT 19 65 GT Weighted accuracy 58.96 % Total accuracy 66.12 % Sensitivity 40.54 % Specificity 77.38 %
73 020406080100120140 0.00.20.40.60.81.0 Colon Cancer Survival Prognosis Survival in monthsSurvival distribution P=0.0321 N=34 N=87 Figure 5.15: Survival curves for the predicted classes; the surv ival curves are statistically different at significance of 0.05 as determined by a log-rank test Analysis of features used by the classifier The t-test was used to select the best features for prediction on 81% of the samples. Since this selection was repeated on a dif ferent set of 81% training samples each time, a different set of features may be selected f or use depending on the patterns of the training samples within the classes, with a minimum of 5000 unique features selected across the entire experiment. A larger number of un ique features would indicate that the predictive strength of the features varied dependin g on the samples used for training. A total of 24998 unique features were selected by the t-tests across all the folds. This suggests some features were found to be predictive only when specific samples were used for training. The random subspaces were created by picking 200 fe atures randomly from the best 5000 features determined by the t-test. Decisi on trees, built on these random Colon Cancer Survival Prognosis Survival in months Survival distribution
74 subspaces, selected the best of these 200 features to create the tree. Each tree on an average used 10 features selected from the random s ubspace. Since the 10 features sets were used to create the union of features for class ification, it is expected that between 10 and 100 unique features would be used by a single c lassifier. The entire classifier scheme including 10 such classifiers used a total of 744 f eatures. Hence, each classifier used an average of 74.4 unique features. 667 of these 744 f eatures used for classification were found to be unique. Features that are truly predictive would ideally be found to be the best features across multiple folds of the cross-validation schem e. Figure 5.16 shows the repetition of features across three or more folds using the class ifier described in Section 5.6. Since these features were selected as predictive features for various combinations of training samples, they are expected to be the most predictiv e for the samples used in the study. Repetition of features across folds 0 10 20 30 40 50 60 234# of times a feature was repeated across folds# of features Figure 5.16: Repetition of features across two or m ore folds of the cross-validation scheme Repetition of features across folds Number of times a feature was repeated across folds accuracy Number of features repeated
75 5.7 Verification of results In order to help verify the results of classificati on, a few additional experiments were performed to test the effect of randomization of the samples and the features subspaces on classification accuracy. The proposed method uses randomization of the sampl es into train and test sets at each fold of the cross-validation, and in the creat ion of the training and validation sets. Further, the random subspaces select features from a pool at random, with a few repetitions of the features. The reliability of the classifier results can be ensured if the results are repeated with different random selectio ns at every stage. To investigate the reliability of the classifier s cheme, a series of experiments were conducted to vary the configuration of the samples and feature subspaces. The first experiment was a re-run of the experiment three tim es with constant parameters for the entire classifier scheme. Since the parameters for partitioning of the data into the various training, validation and independent test sets spli t the samples into exactly the same configuration for each of these experiments, the on ly source of variation in results was the randomization in creation of subspaces. Three additional experiments were conducted to chan ge the parameters of the cross-validation schemes. In the first of these exp eriments, the initial random seed used to create the split of samples into the independent te st and train sets was varied. The second of the experiments varied the split of samples into the training and validation sets, and the third experiment varied both the splits simultaneou sly.
76 Variation in weighted average accuracy with randomi zation of samples/features58.96 57.29 58.2 54.75 30 35 40 45 50 55 60 65 Random subspace selection Random train-test splits Random train-validation splits Random both Weighted average accuracy (%) Figure 5.17: Variation in the weighted accuracy for prediction of survival for colon cancer with changes in randomization of the samples and feature subspaces; the accuracies reported here are the averages of experi ments for each randomization It can be observed from Figure 5.17 that the weight ed accuracy from the verification experiments varies by 3.36%, indicatin g that the classification scheme is relatively stable in spite of changes in the sample s used for training. The next step in verifying the stability of the cl assifier is analyzing the features used for classification by each experiment. Repetit ion of features used would indicate that the feature selection at the random subspace stage was relatively invariant to the randomization of samples and subspace generation. F igure 5.18 shows the number of features that were used by a total of 8 experiments including two experiments with randomization of the subspaces, and two each for th e randomization for splitting the data into the various test, train and validation sets. I t can be observed that several features were repeatedly picked as the best predictors in sp ite of the various randomization effects in the verification experiments. Variation in weighted accuracy with randomization o f samples/features Weighted accuracy (%)
77 Repetition of features across folds 198 78 48 19 12 2 0 50 100 150 200 250 345678# of times a feature was repeated across folds# of features Figure 5.18: Number of features repeatedly selected as the most predictive features across all the experiments to test variability of results Repet ition of features across experiments Number of times a feature was repeated across experiments accuracy Number of features repeated
78 CHAPTER 6 DISCUSSION AND CONCLUSION 6.1 Discussion The proposed classifier scheme using random subspac e ensembles, described in Section 5.6, achieved a weighted accuracy of predic tion of 58.96% for colon cancer microarray data. Several features used by the class ifier in the final prediction of samples were found to be repeated across at least three fol ds of the cross-validation scheme. Further, the survival times for each predicted clas s for this classifier were found to be significantly different (Figure 5.15), indicating t hat the features used for prediction are collectively predictive in nature for the colon can cer gene expression data. Figure 6.1: Survival curves for two genes, split on the median, repeated across three folds in the classifier scheme described in Section 5.6 Probeset A, repeated across 3 folds Survival in months Survival distribution Survival in months Probeset B, repeated across 3 folds Survival distribution
79 Figure 6.1 shows two survival K-M curves and the co rresponding log-rank tests for two of the features that were repeated across t hree folds of the cross-validation scheme. Since these features were picked as the mos t predictive features in three of the folds, they would be expected to be individually pr edictive. However, Figure 6.1 indicates that while the K-M curves for feature B are signifi cantly different, indicating good predictive value for the feature, the K-M curves fo r feature A are not significantly different. This suggests that although the selected features may not be individually significant in prognostic value, a set of features in combination could be good for prediction of survival. The goal of the classifier is then to select an optimal set of features that can collectively predict the outcome of colon cancer in terms of survival. Although the proposed method using random subspaces improved the weighted accuracy (Sections 5.5 and 5.6), the success of any of the classifiers generated is clearly not optimal. This could indicate that the colon can cer dataset probably includes subgroups of patients within each of the survival grou ps. These sub-groups may exhibit unique characteristics that are not sufficiently de scribed by the group as a whole. In other words, the two survival groups could likely be hete rogeneous in gene expression characteristics. The voting technique to create ran dom subspace ensembles would work well in simple cases (refer to Appendix Sections A. 1 and A.2) where the groups for classification consist of homogeneous gene expressi on characteristics. While working with a more complex or heterogeneous set of samples the simple voting technique would confuse the classification since all the sub-groups would not be adequately represented. In such cases, the proposed technique of selecting the best feature subspaces across multiple
80 folds may work well by generalizing the feature spa ce to include a majority of the subgroups within the training samples. Future work The classifier scheme of creating random subspace e nsembles, by selection of the best feature subspace, has been shown to work at le ast as well as the baseline experiments in the task of predicting 36-month survival for col on cancer patients. However, Figure 5.15 suggests that a dichotomization on survival ti me points other than 36 months may lead to a more accurate classifier. The proposed me thod could be used with other survival time thresholds, to investigate the split of the tr aining samples for highest accuracy of prediction. Further, the configuration of the random subspaces used with C4.5 decision trees was selected based on the results of the baseline e xperiments. Experimentation with variations of the configuration would help in ident ifying a potentially more accurate classifier scheme and selection of features that ar e more robust in survival prediction. The described method used the random subspace classifie r with the best validation and training accuracy for selection of features. Select ion of multiple subspaces rather than a single best random subspace may enhance the accurac y of survival prediction by including better descriptors of the classes. In the description of the proposed method, C4.5 dec ision trees have been used with the random subspaces to select good features f or classification. Support vector machines or neural networks were used with these go od features to train on the training samples. These classifiers were used for prediction of classes for new samples. Use of the
81 same type of classifier at the feature selection st age as well as final classification could yield a simpler model. However, the effect of such a model on the accuracy of prediction may be assessed only through experimental evidence. 6.2 Conclusion Gene expressions of cancer tissue at different stag es of development are expected to have unique signatures. Identification of these signatures would aid in prognosis of cancer, and prediction of long-term survival for th e patient. Gene expressions of colon cancer tissue were studied for the purpose of predi cting 36-month survival for the cancer patient. Microarray technology enables analysis of gene expressions by generating information for thousands of genes. A t-test was us ed to select a set of the most promising features for prediction. Random subspace ensembles created using these selected features yielded poor accuracy in survival prediction for th e colon cancer data. A modification to the random subspace technique was proposed, that se lected the most accurate feature subspace amongst all the random subspaces, created as the most predictive features. These predictive features were used by support vect or machines to classify samples into two survival groups with a weighted accuracy of 58. 96 %. The accuracy of this classifier was shown to be comparable to any of the baseline c lassifiers tested on the same dataset in predicting the class of new and unknown samples. Further, the method was tested on other gene expression datasets (see Appendix Sectio ns A.1 and A.2) and shown to work with prediction accuracies comparable to the accura cies of the baseline classifiers.
82 REFERENCES 1. Douglas Hanahan, Robert A. Weinberg. The Hallmarks of Cancer Cell, Vol. 100, 5770, January 7, 2000. Copyright 2000 by Cell Press. 2. E. J. Ambrose, F. J. C. Roe (1975) Biology of Cancer. Sussex, England: Ellis Horwood Limited. 3. David J. Lockhart, Elizabeth A. Winzeler. Genomics, gene expression and DNA arrays Nature Vol 405, 15 June 2000. 4. E. A. Carlson (2004). MendelÂ’s Legacy New York: Cold Spring Harbor Laboratory Press. 5. Watson, J., Crick, F.H.C. Molecular Structure of Nucleic Acids Letters to Nature, Nature 171, 737-738 (1953), Macmillan Publishers Lt d. 6. Elaine Marieb. Human Anatomy and Physiology Pearson Education Inc. 7. Findley, McGlynn, Findley (1989). The Geometry of Genetics Wiley-Interscience Monographs In Chemical Physics. 8. Affymetrix. http://www.affymetrix.com/index.affx. 9. The Human Genome Project. http://www.ornl.gov/sci/techresources/Human_Genome/ project/info.shtml 10. Edlich RF, Winters KL, Lin KY. Breast cancer and ovarian cancer genetics J Long Term Eff Med Implants. 2005;15(5):533-45. 11. Murphy N, Millar E, Lee CS. Gene expression profiling in breast cancer: towards individualising patient management Pathology. 2005 Aug;37(4):271-7. 12. Ueki K.Oligodendroglioma : impact of molecular biology on its definition, di agnosis and management. Neuropathology. 2005 Sep;25(3):247-53. 13. T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, E. S. Lander. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring Science, Vol. 286, 15 October 1999.
83 14. Gennadi V. Glinsky, Olga Berezovska, Anna B. Glinsk ii. Microarray analysis identifies a death-from-cancer signature predicting therapy failure in patients with multiple types of cancer The Journal of Clinical Investigation, Vol. 115, No. 6, June 2005. 15. Laura J van Â‘t Veer,. Hongyue Dai, Marc J van de Vi jver, Yudong D. He, Hart, Augustinus A. M., Mao Mao, Hans L.Peterse, Karin va n der Kooy, Marton, Matthew J., Anke T Witteveen, George J Schreiber, Ron M. Ke rkhoven, Chris Roberts, Peter S. Linsley, Rene Bernards, Stephen H Friend. Gene expression profiling predicts clinical outcome of breast cancer Letters to nature, Vol. 415(6871), 31 January 200 2, pp 530-536. 16. M.van de Vijver, Y.He, L.van'T Veer H.Dai, A.Hart D.Voskuil, G.Schreiber, et al. A gene expression signature as a predictor of survi val in breast cancer The New England Journal of Medicine 2002; 347 (25): 1999-20 09. 17. U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Yb arra, D. Mack, A. J. Levine. Broad patterns of gene expression revealed by clust ering analysis of tumor and normal colon tissues probed by oligonucleotides arr ays Proc. Natl. Acad. Sci. USA, Vol. 96, pp. 6745-6750, June 1999, Cell Biology. 18. Sridhar Ramaswamy, Ken N. Ross, Eric S. Lander, Tod d R. Golub. A molecular signature of metastasis in primary solid tumors Nature Genetics, Vol 33, January 2003. 19. Steven Eschrich, Ivana Yang, Greg Bloom, Ka Yin Kw ong, David Boulware, Alan Cantor, Domenico Coppola, Mogens Kruhoffer, Lauri A altonen, Torben F. Orntoft, John Quackenbush, Timothy J. Yeatman. Molecular staging for survival prediction of colorectal cancer patients J Clin Oncol. 2005 May 20;23(15):3526-35. 20. National Cancer Institue. SEER Statistics http://seer.cancer.gov/ 21. Bolstad, B.M., Irizarry R. A., Astrand, M., and Spe ed, T.P. A Comparison of Normalization Methods for High Density Oligonucleot ide Array Data Based on Bias and Variance Bioinformatics 19(2):185-193, (2003). 22. Torres-Roca JF, Eschrich S, Zhao H, Bloom G, Sung J McCarthy S, Cantor AB, Scuto A, Li C, Zhang S, Jove R, Yeatman. Prediction of radiation sensitivity using a gene expression classifier Cancer Res. 2005 Aug 15;65(16):7169-76. 23. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response .Proc Natl Acad Sci U S A. 2001 Apr 24;98(9):5116-2 1. Epub 2001 Apr 17. Erratum in: Proc Natl Acad Sci U S A 2001 Aug 28;98(18):10515.
84 24. D. C. Montgomery, G. C. Runger, N. F. Hubele (1997) Engineering Statistics JohnWiley & Sons, Inc. 25. D. G. Kleinbaum (1996). Survival Analysis: A self-learning text New York: SpringerVerlag New York, Inc. 26. Statsoft inc. Survival Analysis : http://www.statsoft.com/textbook/stsurvan.html#kapl an. 27. Atlas of Genetics and Cytogenetics in Oncology and Haematology. http://www.infobiogen.fr/services/chromcancer/Genes /Geneliste.html 28. Cancer Genetics Web. http://www.cancerindex.org/ge neweb/genes_a.htm 29. Tumor Gene Database. http://condor.bcm.tmc.edu/ermb /tgdb/tgdb.html 30. Ivana V Yang, Emily Chen, Jeremy P Hasseman, Wei Li ang, Bryan C Frank, Shuibang Wang, Vasily Sharov, Alexander I Saeed, Jo seph White, Jerry Li, Norman H Lee, Timothy J Yeatman, and John Quackenbush. Within the fold: assessing differential expression measures and reproducibilit y in microarray assays Genome Biol. 2002; 3(11): research0062.1Â–research0062.12. 31. H. Witten, E. Frank (2000). Data Mining Morgan Kauffman Publishers. 32. S. V. Kartalopoulos (1996). Understanding Neural Networks and Fuzzy logic: Basi c Concepts and Applications. New Delhi: Prentice-Hall of India. 33. Simon Haykin (1999). Neural Networks: A Comprehensive Foundation New Jersey: Prentice Hall, Inc. 34. Cristopher J. C. Burges (1998). Data Mining and Kno wledge Discovery, A Tutorial on Support Vector Machines for Pattern Recognition. Boston: Kluwer Academic Publishers. 35. Tin Kam Ho. The Random Subspace Method for Constructing Decisio n Forests IEEE Transactions on Pattern Analysis and Machine Intell igence, Vol. 20, No. 8, August 1998. 36. Trisha Greenhalgh. How to read a paper: Papers that report diagnostic or screening tests BMJ 1997;315:540-543 (30 August). 37. Steven Eschrich, Timothy J. Yeatman. DNA microarrays and data analysis: An overview Surgery, Vol. 136, No. 3, May 2004.
85 BIBLIOGRAPHY 1. Findley, McGlynn, Findley (1989). The Geometry of Genetics Wiley-Interscience Monographs In Chemical Physics. 2. A.H. Sturtevant (2001). A history of genetics New York: Cold Spring Laboratory Press. 3. F. H. Prtugal, J. S. Cohen (1977). A Century of DNA: A History of the Discovery of the Structure and Function of the Genetic Substance Cambridge: The MIT Press. 4. R. O. Duda, P. E. Hart, D. G. Stork (2001). Pattern Classification John Wiley & Sons, Inc.
87 Appendix A: Application of the proposed method on various gene expression datasets The proposed method has been applied to the analy sis of the gene expression of colon cancer in order to predict survival. The meth od was shown to work with prediction accuracy comparable to other classifier schemes dis cussed in the literature. In order to test the merit of the proposed method on analysis of gen e expression profiles, the method was used to analyze two datasets with different class c haracteristics. The first dataset used was the publicly available leukemia dataset , with two main classes: ALL (acute lymphoblastic leukemia) and AML (acute myeloid leuk emia). The second dataset constituted gender information, extracted from the colon cancer survival dataset. The two classes in this case were male and female patients. The description of the experiments for each of these datasets is outlined in the following sections. A.1 Analysis of leukemia data Data Description : The leukemia gene expression dataset consists of two variants of leukemia, Acute Lymphoblastic Leukemia (ALL) and Acute Myeloid Leuk emia (AML). The dataset includes a total of 7129 normalized features or pro besets. The 38 samples in the dataset include 27 samples of ALL and 11 samples of AML.
88 Appendix A: (Continued) Baseline Experiments : Basic classifier experiments were conducted to ob tain a baseline performance measure on the dataset. The three classifiers used were Neural Networks, Support Vector Machines and C4.5 Decision Trees. The t-test was used as an initial feature selection to reduce the number of features used for classification. Since the number of samples in the two classes was unequal a weighted accuracy was used to measure the success of classification. Performance of baseline classifiers82 84 86 88 90 92 94 96 98 100 10020050010002000300040005000 # features used for classifier trainingWeighted Average Accuracy (%) Neural Networks Decision Trees Support Vector Machines Figure A.1: Classifier performance with ALL-AML: neural networks, support vector machines and C4.5 d ecision trees The Neural Networks performed consistently, with a n accuracy of 95.45%, with all values for the feature selection method (100 <= a <= 5000). Support Vector Machines achieved a high accuracy of 97.37%. Decision trees however, deteriorated in performance as the number of input features increased, with the maximum accuracy occurring at the lowest number of features. Performance of baseline classifiers Number of features used for classifier training Weighted accuracy (%)
89 Appendix A: (Continued) Random subspace ensembles using majority voting tec hnique : The majority voting technique (Figure 5.9) was use d in the creation of random subspace ensembles to predict classes of samples fr om the ALL-AML dataset. The experiment was tested at various parameters of ( a,r,c ) (Table 5.2), using the weighted accuracy to measure the performance of the ensemble The performance of the ensembles was compared with the performance of a single decis ion tree built on a single random subspace selected from the same set of a features. Performance of random subspace ensembles using majority voting technique ( r =200) 75 80 85 90 95 100 10020050010002000nnn Weighted Average Accuracy (%) Randomsubspaces SingleDecision Tree Figure A.2: Random subspace ensembles ( a =5000, r =200, c ) vs. single decision tree ( a =5000, r =200, c =1) on the ALL-AML dataset The accuracy of the ensemble increases as the numbe r of subspaces increases. The increasing number of subspaces ensures better cover age of the feature space. Since all the features seem to be predictive in nature, the accur acy of the ensemble increases as more predictive features are added to it. Each of these random subspace ensembles has a better predictive accuracy than a single decision tree (Fi gure A.1). Performance of random subspace ensembles Using majority voting technique ( r =200) Number of random subspaces ( c ) Weighted prediction accuracy (%)
90 Appendix A: (Continued) Random subspace ensembles by selection of good clas sifiers : The proposed method of using random subspaces ensem bles by selecting the good classifiers within the cross-validation scheme (Fig ure 5.13) was tested on the ALL-AML dataset. 100 random subspaces, each of size 200, se lected from the top 5000 t-test features, were used to create the ensembles. Suppor t Vector Machines were used for classification. The performance of the method, asse ssed by computing the weighted accuracy of prediction on the 10%, held-out indepen dent samples, is shown in Table A.1. Table A.1: Confusion matrix for the performance of the proposed method on the leukemia gene expression dataset A.2 Analysis of gender data Data Description : The colon cancer dataset (refer to Chapter 2) was split into two classes based on gender: MALE and FEMALE. The dataset consisted of 1 35 samples, with 68 female and 67 male patients. Each sample was characterized by 54675 normalized features/probesets. ALL AML Classified as True class 27 0 ALL 2 9 AML Weighted accuracy 90.91 % Total accuracy 94.74 % Specificity 100.0 % Sensitivity 81.81 %
91 Appendix A: (Continued) Baseline Experiments : Basic classifier experiments were conducted to obt ain a baseline performance measure on the dataset. The two classifiers used we re Neural Networks and Support Vector Machines. The t-test was used as an initial feature selection to reduce the number of features used for classification. Weighted accur acy was used to measure the success of classification. Performance of baseline classifiers0 20 40 60 80 100 120 100200500100020003000400050006000700080009000 # features used for classifier trainingWeighted Average Accuracy (%) Neural Networks Support Vector Machines Figure A.3: Classifier performance with gender data set: neural networks and support vector machines Random subspace ensembles using majority voting tec hnique : The majority voting technique (Figure 5.9) was use d in the creation of random subspace ensembles to predict gender of samples the dataset. The experiment was tested at various parameters of ( a,r,c ) (Table 5.2), using the weighted accuracy to measu re the Performance of baseline classifiers Number of features used for classifier training Weighted accuracy (%)
92 Appendix A: (Continued) performance of the ensemble. The performance of the ensembles was compared with the performance of a single decision tree built on a si ngle random subspace selected from the same set of a features. Performance of random subspace ensembles using majority voting technique ( r =200)65 70 75 80 85 90 95 110020050010002000# random subspaces ( t )Accuracy % Randomsubspaces Singledecision tree Figure A.4: Random subspace ensembles ( a =5000, r =200, c ) vs. single decision tree ( a =5000, r =200, c =1) on gender dataset As expected, the accuracy of the ensemble increases as the number of subspaces increases. The increasing number of subspaces ensur es better coverage of the feature space. Since all the features seem to be predictive in nature, the accuracy of the ensemble increases as more predictive features are added to it. Each of these random subspace ensembles has a better predictive accuracy than a s ingle decision tree (Figure A.1). Random subspace ensembles by selection of good clas sifiers: The proposed method of using random subspace ensem bles by selecting the good classifiers within the cross-validation scheme (Fig ure 5.13) was tested on the gender dataset. 100 random subspaces, each of size 200, se lected from the top 5000 t-test Performance of random subspace ensembles Using majority voting technique ( r =200) Number of random subspaces ( c ) Weighted prediction accuracy (%)
93 Appendix A: (Continued) features, were used to create the ensembles. Suppor t Vector Machines were used for classification. The performance of the method, asse ssed by computing the weighted accuracy of prediction on a 10-fold cross-validatio n, is shown in Table A.2. It is observed that the proposed method creates a classifier that predicts classes of unknown samples with accuracy comparable to that obtained using the majority voting technique of creating random subspace ensembles. Table A.2: Confusion matrix for the performance of the proposed method on the gender gene expression dataset MALE FEMALE Classified as True class 60 8 MALE 4 63 FEMALE Weighted accuracy 91.13 % Total accuracy 91.11 % Specificity 88.23 % Sensitivity 94.02 %