USF Libraries
USF Digital Collections

An iterative feature perturbation method for gene selection from microarray data

MISSING IMAGE

Material Information

Title:
An iterative feature perturbation method for gene selection from microarray data
Physical Description:
Book
Language:
English
Creator:
Canul-Reich, Juana
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla
Publication Date:

Subjects

Subjects / Keywords:
IFP
Feature selection
Classification
T-test
Data mining
SVM-RFE
SVM
Dissertations, Academic -- Computer Science & Engineering -- Masters -- USF   ( lcsh )
Genre:
non-fiction   ( marcgt )

Notes

Abstract:
ABSTRACT: Gene expression microarray datasets often consist of a limited number of samples with a large number of gene expression measurements, usually on the order of thousands. These characteristics might negatively impact the prediction accuracy of a classification model. Therefore, dimensionality reduction is critical prior to any classification task. We introduce the iterative feature perturbation method (IFP), an embedded gene selector that iteratively discards non-relevant features. Relevant features are those which after perturbation with noise cause a change in the predictive accuracy of the classification model. Non-relevant features do not cause any accuracy change in such a situation. We apply IFP to 4 cancer microarray datasets: colon, leukemia, Moffitt colon, and lung. We compare results obtained by IFP to those of SVM-RFE and the t-test using a linear support vector machine as the classifier. We use the entire set of features in the datasets, and a preselected set of 200 features (based on p values from the t-test) from each dataset. When using the entire set of features, IFP results in comparable accuracy (and higher at some points) with respect to SVM-RFE on 3 of the 4 datasets. The t-test feature ranking produces classifiers with the highest accuracy across the 4 datasets. When using 200 features, the accuracy results show up to 3% performance improvement for both IFP and SVM-RFE across the 4 datasets. We corroborate these results with an AUC analysis and a statistical analysis using the Friedman/Holm test. Similar to the t-test, we used the methods information gain and relief as filters and compared all three. Results of the AUC analysis show that IFP and SVM-RFE obtain the highest AUC value when applied on the t-test-filtered datasets. This result is additionally corroborated with statistical analysis. The percentage of overlap between the gene sets selected by any two methods across the four datasets indicates that different sets of genes can result in similar accuracies. We created ensembles of classifiers using the bagging technique with IFP, SVM-RFE and the t-test, and showed that their performance can be equivalent to those of the non-bagging cases, as well as better in some cases.
Thesis:
Dissertation (PHD)--University of South Florida, 2010.
Bibliography:
Includes bibliographical references.
System Details:
Mode of access: World Wide Web.
System Details:
System requirements: World Wide Web browser and PDF reader.
Statement of Responsibility:
by Juana Canul-Reich.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains X pages.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
usfldc doi - E14-SFE0004571
usfldc handle - e14.4571
System ID:
SFS0027886:00001


This item is only available as the following downloads:


Full Text

PAGE 1

AnIterativeFeaturePerturbationMethodforGeneSelectio nfromMicroarrayData by JuanaCanulReich Adissertationsubmittedinpartialfulllment oftherequirementsforthedegreeof DoctorofPhilosophy DepartmentofComputerScience&Engineering CollegeofEngineering UniversityofSouthFlorida MajorProfessor:LawrenceO.Hall,Ph.D. DmitryGoldgof,Ph.D. RafaelPerez,Ph.D. StevenEschrich,Ph.D. DateofApproval: June11,2010 Keywords:IFP,featureselection,classication,t-test, datamining,SVM-RFE,SVM Copyright c r 2010,JuanaCanulReich

PAGE 2

Tomymotherwhowillalwaysbeaninspirationthroughoutmyl ife.Tomyhusbandfor beingmypartnerinlife,whoontopofhisownjobhestrovehis bestathomewhileIwas atschool.And,tomydaughtersMaru,Mine,andVivi,whowill alwaysbemyblessings fromGod.Youallthreerepresentmygreatestcommitmentsin life.Iprayforwisdomso Icanaccomplishmyjobasyourmother.Alltogether,wehaves urvivedthroughoutthis, inoursituation,challengingjourney.Iloveyouall.

PAGE 3

Acknowledgements MyrstacknowledgementistoGodforbeingmysourceofspiri tualstrength. Iwouldliketoacknowledgethoseagencieswhoindifferentw ayssupportedmyPhD studies:Fulbright,Promep,Conacyt,UJAT,UJAT-DAIS,Mof ttCancerCenterandResearchInstitute,andUSF.MyendlessgratitudetoDr.Hall, Dr.Goldgof,andDr.Eschrich fortheirvaluableweek-after-weekadvice.Tomyfamily,fo rwalkingalongwithmeall thistime. Lastbutnotless,myheartiswiththosepeoplewemetinTampa ,nowfriendsofmy family,whomadeofourstayanunforgettableandpleasanton e. Thankyouall.

PAGE 4

TableofContents ListofTables v ListofFigures viii Abstract xiii Chapter1Introduction 1 1.1Motivation 1 1.2Contributions 3 1.3Organization 4 Chapter2Background 5 2.1MicroarrayData 5 2.2FeatureSelection 6 2.2.1Filters 7 2.2.1.1Student'sT-Test72.2.1.2InformationGain82.2.1.3ReliefF 9 2.2.2Wrappers 10 2.2.3EmbeddedMethods 11 i

PAGE 5

2.3SupportVectorMachines 12 Chapter3FeatureSelectionMethodsforMicroarrayData16 3.1Filters 16 3.1.1Student'sT-Test 16 3.1.2RankProducts 17 3.1.3MarkovBlanketandROCCurves193.1.4InformationGain,ReliefF,andCorrelation-BasedFe atureSelection 21 3.2Wrappers 23 3.2.1SequentialSearch 23 3.2.2GeneticAlgorithmsandLocalSearch24 3.3Embedded 26 3.3.1RecursiveFeatureElimination(SVM-RFE)263.3.2RandomForests 28 3.3.3PenalizedMethods:HHSVM303.3.4LocalLearningBasedFeatureSelection31 Chapter4IterativeFeaturePerturbationMethod34 4.1BinarySearch 36 4.2ComputationalCostofIFP 38 Chapter5ExperimentalStudies 40 5.1DataandPreprocessing 40 5.2ParametersfortheSVM 42 ii

PAGE 6

5.3PerformanceMeasure 42 5.4AdaptiveFeatureEliminationStrategy435.5GenePre-FilteringStrategy 44 5.6ExperimentsandResults 44 5.6.1AccuracyResults 45 5.6.2IntersectionAcrosstheEntireSetofFeatures475.6.3GenePreselection 54 5.6.4IntersectionAcrossaSubsetofFeatures(Genes)60 5.7FrequencyoftheUseofSVMWeightsforTie-Breakingby IFP 66 5.8AUCAnalysis 66 5.8.1AcrosstheEntireSetofFeatures675.8.2AcrosstheTop200SubsetofFeatures69 Chapter6StatisticalAnalysis 71 6.1StartingwiththeEntireSetofFeatures726.2StartingwithaPreselectedSetof n Features75 6.3Preselectionvs.NoPreselectionofIFPandSVM-RFE79 Chapter7FilteringforImprovedGeneSelection83 7.1ExperimentalDesignandEvaluation837.2EffectofFilteringinTermsofSVMAccuracy847.3AccuracyofIFPandSVM-RFEUsingFilteredDatasets877.4StatisticalComparisonofFilteringMethods95 iii

PAGE 7

7.5AnalysisofOverlapBetweenGenesSelected997.6SummaryofResultsIncludingReviewedFeatureSelectio n Methods 104 Chapter8EnsembleApproachUsingBagging110 8.1DeningEnsembleandBagging1108.2WhyUseBagging? 111 8.3BaggingAppliedtoIFP,SVM-RFE,andtheT-Test112 8.3.1ColonCancerDataset1138.3.2LeukemiaDataset 114 8.3.3MofttColonCancerDataset114 Chapter9DiscussionandFutureWork 120 9.1FutureWork 123 ListofReferences 124 Appendices 133 AppendixA:FlowchartforIterativeFeaturePerturbationM ethod134 AppendixB:AccuraciesfortheTopNFeatureswithPValues<= 0.01135 AbouttheAuthor EndPage iv

PAGE 8

ListofTables Table5.1Confusionmatrix. 43 Table5.2AUCanalysisofaccuracycurvesacrossall4datase tsusingtheentire setoffeatures. 68 Table5.3AUCanalysisofaccuracycurvesacrossall4datase tsusingthetop 200features. 70 Table6.1Statisticalanalysisofresultsonthecoloncance rdatasetbetweenthe methodsIFP(I),SVM-RFE(R),andthet-test(T)acrossfeatu res(a) 50to38,(b)37to25,(c)24to12,and(d)11to1withnopreviou s preselectionoffeatures. 73 Table6.2Statisticalanalysisofresultsontheleukemiada tasetbetweenthe methodsIFP(I),SVM-RFE(R),andthet-test(T)acrossfeatu res(a) 24to12and(b)11to1withnopreviouspreselectionoffeatur es.74 Table6.3Statisticalanalysisofresultsonthecoloncance rdatasetbetweenthe methodsIFP(I),SVM-RFE(R),andthet-test(T)acrossfeatu res (a)50to38,(b)37to25,(c)24to12,and(d)11to1withprevio us preselectionof n features(thosewithpvalues < = 0 : 01).76 Table6.4Statisticalanalysisofresultsontheleukemiada tasetbetweenthe methodsIFP(I),SVM-RFE(R),andthet-test(T)acrossfeatu res (a)50to38,(b)37to25,(c)24to12,and(d)11to1withprevio us preselectionof n features(thosewithpvalues < = 0 : 01).78 Table6.5Statisticalanalysisofresultsonthelungcancer datasetbetweenthe methodsIFP(I),SVM-RFE(R),andthet-test(T)acrossfeatu res(a) 33to21,(b)20to8,and(c)7to1withpreviouspreselectiono f n features(thosewithpvalues < = 0 : 01).79 v

PAGE 9

Table6.6Statisticalcomparisonbetweendoingpreselecti on(P)andnotdoing preselection(NP)withthemethodsIFPandSVM-RFEacrossfe atures(a)50to38,(b)37to25,(c)24to12,and(d)11to1onthecoloncancerdataset. 80 Table6.7Statisticalcomparisonbetweendoingpreselecti on(P)andnotdoingpreselection(NP)withthemethodsIFPandSVM-RFEacros s features(a)32to20,(b)19to7,and(c)6to1onthelungcance r dataset. 82 Table7.1Highestaccuraciesattainedforeachdatasetacro ssthreelters.90 Table7.2AUCanalysisofaccuracycurvesacrossallfourlt ereddatasets, usingthe(a)t-test,(b)informationgain,and(c)ReliefF. 98 Table7.3AverageaccuracyimprovementsobtainedbetweenI FPandSVMRFEappliedtothefouroriginalnon-ltereddatasetsandto IFPand SVM-RFEappliedtothe200-geneltereddatasets.101 Table7.4Statisticalcomparisonacrossthetop50features oftheaccuracies fromIFPonthebest200geneslteredaccordingtothet-test (T), informationgain(G),andReliefF(F)forthefourdatasets. 101 Table7.5Statisticalcomparisonacrossthetop50features oftheaccuracies fromSVM-RFEonthebest200geneslteredaccordingtothettest (T),informationgain(G),andReliefF(F)forthefourdatas ets.102 Table7.6Statisticalcomparisonacrossthetop50features oftheaccuracies fromSVMonthebest200geneslteredaccordingtothet-test (T), informationgain(G),andReliefF(F)forthefourdatasets. 102 Table7.7Percentageofintersection(at50genes)betweeng enesselectedby IFPandSVM-RFEwhenappliedonthetop200geneslteredwiththet-test,informationgain,andReliefFacrossthefourda tasets.103 Table7.8Percentageofintersection(at50genes)ofgenes lteredby(a)IFP and(b)SVM-RFEwiththegeneslteredbyeachlteracrossth e fourdatasets. 103 Table7.9Percentageofintersection(at50genes)betweeng eneslteredbyany 2lters. 103 vi

PAGE 10

Table7.10Percentageofintersection(at50genes)foreach methodwithitselfusingthe(a)informationgain,(b)ReliefF,and(c)t-t est,across thetop200geneslteredwiththecorrespondinglterforth efour datasets. 104 Table7.11Summaryofresultsfromfeatureselectionmethod sin[5,9–11]for thecoloncancerdataset. 106 Table7.12Summaryofresultsfromfeatureselectionmethod sin[12,47]and thisworkforthecoloncancerdataset.107 Table7.13Summaryofresultsfromfeatureselectionmethod sin[5,8–10]for theleukemiadataset. 108 Table7.14Summaryofresultsfromfeatureselectionmethod sin[11,12,44] andthisworkfortheleukemiadataset.109 vii

PAGE 11

ListofFigures Figure2.1Matrixrepresentationofamicroarraydataset[3 ].6 Figure2.2Filtersworkindependentlyofthelearningalgor ithm[15].7 Figure2.3AlgorithmReliefF[23]. 10 Figure2.4Wrappersusethelearningalgorithmasablackbox whenevaluating featuresubsets[15]. 11 Figure2.5Embeddedmethodsincorporatethefeaturesubset searchandevaluationwhenbuildingaclassier. 11 Figure2.6Illustrationofafeaturemappingviaakernelfun ction ( F ) .12 Figure2.7Maximummarginhyperplane( H 0 ).13 Figure4.1IFPalgorithm. 39 Figure5.1Comparisonofresultingaverageweightedaccura cyoffeaturerankinggivenbymethodsIFP,SVM-RFE,andthet-testforthecolo n cancerdataset. 47 Figure5.2Comparisonofresultingaverageweightedaccura cyoffeaturerankinggivenbymethodsIFP,SVM-RFE,andthet-testfortheleuk emia dataset. 48 Figure5.3Comparisonofresultingaverageweightedaccura cyoffeaturerankinggivenbymethodsIFP,SVM-RFE,andthet-testfortheMof tt coloncancerdataset. 49 viii

PAGE 12

Figure5.4Comparisonofresultingaverageweightedaccura cyoffeaturerankinggivenbymethodsIFP,SVM-RFE,andthet-testforthelungcancerdataset. 50 Figure5.5Intersectionacrosstheentiresetoffeaturesof (a)IFPvs.SVMRFE,(b)IFPvs.IFP,(c)t-testvs.IFP,(d)t-testvs.t-test ,(e)t-test vs.SVM-RFE,and(f)SVM-RFEvs.SVM-RFEonthecoloncancerdataset. 51 Figure5.6Intersectionacrosstheentiresetoffeaturesof (a)IFPvs.SVMRFE,(b)IFPvs.IFP,(c)t-testvs.IFP,(d)t-testvs.t-test ,(e)t-test vs.SVM-RFE,and(f)SVM-RFEvs.SVM-RFEontheleukemiadataset. 53 Figure5.7Intersectionacrosstheentiresetoffeaturesof (a)IFPvs.SVMRFE,(b)IFPvs.IFP,(c)t-testvs.IFP,(d)t-testvs.t-test ,(e)t-test vs.SVM-RFE,and(f)SVM-RFEvs.SVM-RFEontheMofttcoloncancerdataset. 54 Figure5.8Intersectionacrosstheentiresetoffeaturesof (a)IFPvs.SVMRFE,(b)IFPvs.IFP,(c)t-testvs.IFP,(d)t-testvs.t-test ,(e)t-test vs.SVM-RFE,and(f)SVM-RFEvs.SVM-RFEonthelungcancerdataset. 55 Figure5.9Comparisonofresultingaverageweightedaccura cyoffeaturerankinggivenbymethodsIFP,SVM-RFE,andthet-testforthecolo n cancerdatasetacrosstop200featuresaslteredbyt-test. 56 Figure5.10Comparisonofresultingaverageweightedaccur acyoffeaturerankinggivenbymethodsIFP,SVM-RFE,andthet-testfortheleuk emia datasetacrosstop200featuresaslteredbyt-test.57 Figure5.11Comparisonofresultingaverageweightedaccur acyoffeaturerankinggivenbymethodsIFP,SVM-RFE,andthet-testfortheMof tt coloncancerdatasetacrosstop200featuresaslteredbyttest.58 Figure5.12Comparisonofresultingaverageweightedaccur acyoffeaturerankinggivenbymethodsIFP,SVM-RFE,andthet-testforthelungcancerdatasetacrosstop200featuresaslteredbyt-test. 59 ix

PAGE 13

Figure5.13Intersectionacrossthetop200subsetoffeatur esof(a)IFPvs.SVMRFE,(b)IFPvs.IFP,(c)t-testvs.IFP,(d)t-testvs.t-test ,(e)t-test vs.SVM-RFE,and(f)SVM-RFEvs.SVM-RFEonthecoloncancerdataset. 61 Figure5.14Intersectionacrossthetop200subsetoffeatur esof(a)IFPvs.SVMRFE,(b)IFPvs.IFP,(c)t-testvs.IFP,(d)t-testvs.t-test ,(e)t-test vs.SVM-RFE,and(f)SVM-RFEvs.SVM-RFEontheleukemiadataset. 62 Figure5.15Intersectionacrossthetop200subsetoffeatur esof(a)IFPvs.SVMRFE,(b)IFPvs.IFP,(c)t-testvs.IFP,(d)t-testvs.t-test ,(e)t-test vs.SVM-RFE,and(f)SVM-RFEvs.SVM-RFEontheMofttcoloncancerdataset. 63 Figure5.16Intersectionacrossthetop200subsetoffeatur esof(a)IFPvs.SVMRFE,(b)IFPvs.IFP,(c)t-testvs.IFP,(d)t-testvs.t-test ,(e)t-test vs.SVM-RFE,and(f)SVM-RFEvs.SVM-RFEonthelungcancerdataset. 65 Figure7.1SVMaccuracyacrossthetop200geneslteredwith thet-test,informationgain,andReliefFforthecoloncancerdataset.85 Figure7.2SVMaccuracyacrossthetop200geneslteredwith thet-test,informationgain,andReliefFfortheleukemiadataset.86 Figure7.3SVMaccuracyacrossthetop200geneslteredwith thet-test,informationgain,andReliefFfortheMofttcoloncancerdataset .87 Figure7.4SVMaccuracyacrossthetop200geneslteredwith thet-test,informationgain,andReliefFforthelungcancerdataset.88 Figure7.5Comparisonofresultingaverageweightedaccura cyoffeaturerankinggivenbymethodsIFP,SVM-RFEandinformationgain(inte rms ofSVMclassieraccuracy)onthecoloncancerdatasetacros stop 200featuresaslteredbyinformationgain.90 Figure7.6Comparisonofresultingaverageweightedaccura cyoffeaturerankinggivenbymethodsIFP,SVM-RFEandReliefF(intermsofSVMclassieraccuracy)onthecoloncancerdatasetacrosstop2 00featuresaslteredbyReliefF. 91 x

PAGE 14

Figure7.7Comparisonofresultingaverageweightedaccura cyoffeaturerankinggivenbymethodsIFP,SVM-RFEandinformationgain(inte rms ofSVMclassieraccuracy)ontheleukemiadatasetacrossto p200 featuresaslteredbyinformationgain.92 Figure7.8Comparisonofresultingaverageweightedaccura cyoffeaturerankinggivenbymethodsIFP,SVM-RFEandReliefF(intermsofSVMclassieraccuracy)ontheleukemiadatasetacrosstop200f eaturesas lteredbyReliefF. 93 Figure7.9Comparisonofresultingaverageweightedaccura cyoffeaturerankinggivenbymethodsIFP,SVM-RFEandinformationgain(inte rms ofSVMclassieraccuracy)ontheMofttcoloncancerdatase tacross top200featuresaslteredbyinformationgain.94 Figure7.10Comparisonofresultingaverageweightedaccur acyoffeaturerankinggivenbymethodsIFP,SVM-RFEandReliefF(intermsofSVMclassieraccuracy)ontheMofttcoloncancerdatasetacro sstop200 featuresaslteredbyReliefF. 95 Figure7.11Comparisonofresultingaverageweightedaccur acyoffeaturerankinggivenbymethodsIFP,SVM-RFEandinformationgain(inte rms ofSVMclassieraccuracy)onthelungcancerdatasetacross top200 featuresaslteredbyinformationgain.96 Figure7.12Comparisonofresultingaverageweightedaccur acyoffeaturerankinggivenbymethodsIFP,SVM-RFEandReliefF(intermsofSVMclassieraccuracy)onthelungcancerdatasetacrosstop20 0features aslteredbyReliefF. 97 Figure7.13ComparisonofAUCvaluescorrespondingtotheav erageaccuracy curvesofIFPappliedacrossthetop200geneslteredwithth et-test, informationgain,andReliefFforeachdataset.99 Figure7.14ComparisonofAUCvaluesobtainedwithSVM-RFEa crossthetop 200geneslteredwiththet-test,informationgain,andRel iefFfor eachdataset. 100 Figure8.1IFPbaggedaccuracyincomparisonwithIFPandSVM -RFE(highest accuracyachieved)forthecoloncancerdataset.113 xi

PAGE 15

Figure8.2SVM-RFEbaggedaccuracyincomparisonwithSVM-R FE(highest accuracyachieved)forthecoloncancerdataset.114 Figure8.3t-testbaggedaccuracyincomparisonwiththet-t est(intermsofSVM classieraccuracy)andSVM-RFE(highestaccuracyachieve d)for thecoloncancerdataset. 115 Figure8.4IFPbaggedaccuracyincomparisonwithIFPandthe t-test(interms ofSVMclassieraccuracy),whichachievedthehighestaccu racyfor theleukemiadataset. 116 Figure8.5SVM-RFEbaggedaccuracyincomparisonwithSVM-R FEandthettest(intermsofSVMclassieraccuracy),whichachievedth ehighest accuracyfortheleukemiadataset.117 Figure8.6t-testbaggedaccuracyincomparisonwiththet-t est(intermsofSVM classieraccuracy)fortheleukemiadataset.118 Figure8.7IFPbaggedaccuracyincomparisonwithIFPandthe t-test(interms ofSVMclassieraccuracy),whichachievedthehighestaccu racyfor theMofttcoloncancerdataset. 118 Figure8.8SVM-RFEbaggedaccuracyincomparisonwithSVM-R FEandthettest(intermsofSVMclassieraccuracy),whichachievedth ehighest accuracyfortheMofttcoloncancerdataset.119 Figure8.9t-testbaggedaccuracyincomparisonwiththet-t est(intermsofSVM classieraccuracy)fortheMofttcoloncancerdataset.11 9 FigureA.1FlowchartofIFP. 134 FigureB.1Comparisonofresultingaverageweightedaccura cyofIFP,SVMRFE,andthet-testacrossthetopnfeatureswithpvalues<=0 .01for the(a)colon,(b)leukemia,and(c)Mofttcoloncancerdata sets.135 xii

PAGE 16

AnIterativeFeaturePerturbationMethodforGeneSelectio nfromMicroarrayData JuanaCanulReich ABSTRACT Geneexpressionmicroarraydatasetsoftenconsistofalimi tednumberofsamples relativetoalargenumberofexpressionmeasurements,usua llyontheorderofthousands ofgenes.Thesecharacteristicsposeachallengetoanyclas sicationmodelastheymight negativelyimpactitspredictionaccuracy.Therefore,dim ensionalityreductionisacore processpriortoanyclassicationtask. Thisdissertationintroducestheiterativefeaturepertur bationmethod(IFP),anembeddedgeneselectorthatiterativelydiscardsnon-relevantf eatures.IFPconsidersrelevant featuresasthosewhichafterperturbationwithnoisecause achangeinthepredictive accuracyoftheclassicationmodel.Non-relevantfeature sdonotcauseanychangein thepredictiveaccuracyinsuchasituation. WeapplyIFPto4cancermicroarraydatasets:coloncancer(c ancervs.normal), leukemia(subtypeclassication),Mofttcoloncancer(pr ognosispredictor)andlungcancer(prognosispredictor).WecompareresultsobtainedbyI FPtothoseofSVM-RFEand thet-testusingalinearsupportvectormachineastheclass ierinallcases.Wedoso usingtheoriginalentiresetoffeaturesinthedatasets,an dusingapreselectedsetof200 xiii

PAGE 17

features(basedonpvalues)fromeachdataset.Whenusingth eentiresetoffeatures,the IFPapproachresultsincomparableaccuracy(andhigherats omepoints)withrespectto SVM-RFEon3ofthe4datasets.Thesimplet-testfeaturerank ingtypicallyproduces classierswiththehighestaccuracyacrossthe4datasets. Whenusing200featureschosen bythet-test,theaccuracyresultsshowupto3%performance improvementforbothIFP andSVM-RFEacrossthe4datasets.Wecorroboratetheseresu ltswithanAUCanalysis andastatisticalanalysisusingtheFriedman/Holmtest. Similartotheapplicationofthet-test,weusedthemethods informationgainandreliefF asltersandcomparedallthree.ResultsoftheAUCanalysis showthatIFPandSVM-RFE obtainthehighestAUCvaluewhenappliedonthet-test-lte reddatasets.Thisresultis additionallycorroboratedwithstatisticalanalysis. Thepercentageofoverlapbetweenthegenesetsselectedbya nytwomethodsacrossthe fourdatasetsindicatesthatdifferentsetsofgenescanand doresultinsimilaraccuracies. Wecreatedensemblesofclassiersusingthebaggingtechni quewithIFP,SVM-RFE andthet-test,andshowedthattheirperformancecanbeatle astequivalenttothoseofthe non-baggingcases,aswellasbetterinsomecases. xiv

PAGE 18

Chapter1:Introduction Geneexpressionmicroarraydatasetstendtobesmallinsamp lesizeduetothecost associatedwiththeassays.Typically,therearemanymoreg eneexpressionmeasurements (e.g.54,000transcripts)availablethansamples.Hence,t heselectionofasubsetof genes/featuresiscrucialbeforebuildingaclassier.Ide ntifyingasmallnumberofgenes thataregoodpredictorsisimportantfromabiologicalstan dpoint,asexpressionexperimentsaretypicallyperformedtogeneratehypothesesforfu rtherexperimentationinthe lab[1].Forclinicalapplications,identifyingasmallnum berofgenesthatareimportantin predictingpatientsurvivaltimeordiagnosingcancercans peedthetranslationofexpression signaturesintocost-effectivetestsforclinicalpractic e.Fromamachinelearningviewpoint, toomanyfeatures/genesinadatasetcannegativelyinuenc etheclassicationperformance astheyincreasethepossibilityofovertting.Therefore, thefeatureselectionprocessplays avitalroleforthebuildingofasuccessfulclassierfromm icroarraydatasets. 1.1Motivation Featureselectiononmicroarraydatasetsisprimarilycond uctedtoselectrelevantgenes amongsttheusuallynumerousgenespresentinthistypeofda ta[2,3].Thisprocessaids inotheraspectssuchasgeneraldatareduction;performanc eimprovementsincelessgenes 1

PAGE 19

leadtoareducedriskofovertting,aswellaslearningalgo rithmsthatareabletowork fasterwithlessamountofdata;andfordataunderstandinga ndvisualizationsincefor exampleitissimplertovisualizeareduceddataset[2]. Ontheotherhand,mostlearningalgorithmsexistingbefore theadventofmicroarray analysishavenotbeendesignedtocopewithhighdimensiona landsmallsamplesize characteristics.Microarrayanalysisraisestheneedfora specialfeatureselectionprocess priortoanalysisofdata. Currently,inbioinformaticsthefeatureselectionproces shasbecomeaprerequisitefor modelbuilding,sincebynaturetaskssuchassequenceanaly sis,microarrayanalysis,and spectralanalysisdealwithhighdimensionaldata.Asaresu lt,diversefeatureselection techniqueshaveappearedinthiseld[4]. Threecommonlyusedapproachesforfeatureselectionarel ters,wrappers,andembeddedmethods[4].Filterstypicallywillevaluateeachge neinisolationwithoutconsideringcorrelationbetweengenes.Alterwillrankallgenes basedontheircapabilityof discriminatingthetargetclassandeventuallythetop n genesgetselected[3].Examplesof univariateltersarethet-test[5]andrankproducts[6].M ultivariateltersconsiderinteractionbetweengenes;examplesaretheMarkovblanketlter [7]andthecorrelation-based featureselection[8].Wrappersandembeddedfeatureselec tionmethodsaremultivariate sincetheysearchforsubsetsofgenes.Examplesofwrappers areappliedin[9,10],and examplesofembeddedmethodsareappliedin[11–13].Asumma ryofresultsobtained withtheseapproachesisshownanddiscussedinSection7.6. 2

PAGE 20

1.2Contributions Thisdissertationcontributestothestate-of-the-artoft hegeneselectionmethodswith theintroductionoftheIterativeFeaturePerturbation(IF P)method.IFPisanembedded geneselectorwiththecapabilityofusinganyclassicatio nalgorithmasthebaseclassier. AllexperimentswithIFPincomparisonwiththeRecursiveFe atureElimination(SVMRFE),andthet-testfeaturerankingareconductedonfourmi croarraycancerdatasetsusing SVMasthebaseclassierinallcases.Experimentsinclude, rsttheuseoftheentiresetof genesofeachdataset,andsecondtheuseofasubsetofpresel ectedgenesrankedwithbasis ontheirpvalues(fromthet-test).Weshowthataccuracyimp rovementscanbeobtained asaresultofgenepreselection.Forbothscenarios,statis ticalanalysisisconductedto determinethesignicancelevelofdifferencesdetectedin theaccuracyresults.TheArea UndertheaccuracyCurve(AUC)fromeachmethodisusedasame asureforcomparison betweenmethodsaswell. Theintersectionoramountofoverlapbetweengenesselecte dbytwomethodsatpoints wheresimilaraccuraciesareobservedisanalyzed,andthec onclusionismadethatdifferent setsofgenescanleadtothesameorsimilaraccuracies.Itis alsotrue,thatahighpercentage ofoverlapbetweengenesselecteddoesnotguaranteesimila rperformance.Thissuggests betterfeatureselectionmethodsthatnduniquesetsorabe tterbiologicalmodelare needed. Experimentsareconductedusinglterssuchastheinformat iongainandReliefFin additiontothet-test.PerformancecomparisonsofIFPandS VM-RFEappliedtothe12 ltereddatasets(4datasetsand3lters)aremadeintermso ftheAUCmeasure.Statistical analysisisconductedintheaccuracyresults.Basedonourr esultsweconcludethatthe 3

PAGE 21

t-testcanbeatleastasaccurateasotherltermethodsandw esuggestthatitshouldbe consideredforfeatureselectiononmicroarraydata.Itpro videsasimpleapproachfor comparisonpurposes. Ensembleofclassierstypicallyimproveperformance.Wes howthataccuracyimprovementscanbeattainedbybuildingensemblesof30bagge dclassiersforeachofIFP, SVM-RFEandthet-testusingSVMs.1.3Organization Thisdissertationispresentedasfollows.InChapter2back groundconceptsrelated tothetopicofmicroarraydata,featureselectionaswellas supportvectormachinesare discussed.Chapter3presentssomefeatureselectiontechn iquescommonlyusedinthe microarraydomain.Chapter4introducestheIterativeFeat urePerturbationmethod.Chapter5providesathoroughdescriptionofexperimentalstudi esconductedusingIFP,SVMRFE,andthet-testforfeatureselectionacrossfourmicroa rraycancerdatasets.Resultsare presentedintermsofaccuracy,overlapofgenesselectedbe tweeneachpairofmethods,as wellasintermsoftheAUCmeasure,fortwoscenarios:1)usin gtheentiresetofgenes ofeachdatasetand2)usingpreltereddatasets.Chapter6s howsthestatisticalanalysis conductedonresultspresentedinChapter5.Chapter7prese ntsresultsofexperiments conductedusinginformationgain,andReliefFasltermeth odsincomparisonwiththe t-test.Chapter8discussestheuseofbaggingforanensembl eapproachincluding30 classiersforeachmethodIFP,SVM-RFE,andthet-test.Cha pter9summarizesthework presentedandoutlinesfuturedirections. 4

PAGE 22

Chapter2:Background Thischapterreviewssomeconceptsrelatedtotheinvestiga tedproblem.Microarray data,featureselection,classicationoffeatureselecti onmethods,andsupportvectormachinesarebrieydescribed.Thethreelterslaterusedine xperimentsaredescribedhere. 2.1MicroarrayData Microarrayanalysesaremotivatedbythesearchforusefulp atternsoftumorclassication,diseasestateclassication,discoveryofnewsubt ypesofdiseaseordiseasestates, amongothers.Microarraydataarecharacterizedashavinga largenumberoffeatures/genes typicallyontheorderofthousands,incontrasttoasmallnu mberofsamplesusuallyon theorderoftensorhundreds,andredundancyamonggenes.Am atrixrepresentationofa datasetisshowninFigure2.1.Eachrowrepresentsasamplea ndeachcolumnagene. Eachentryshowsanumberwhichisthelevelofexpressionofa particulargeneina particularsample.Thelastcolumntotherightshowsthelab elorclassthateachsample belongsto.Microarraystudiesofteninvolvetheuseofmach inelearningmethodssuchas unsupervisedandsupervisedlearning.Unsupervisedlearn ingndssubgroupsinthedata basedonasimilaritymeasurebetweentheexpressionprole s.Itdoesnotconsiderany priorknowledgeorclassicationinformationtoaccomplis hitsgoal,thatis,itdoesnotuse 5

PAGE 23

Figure2.1:Matrixrepresentationofamicroarraydataset[ 3]. theclasscolumnofadataset.Asaresultofunsupervisedlea rning,unknownsubtypesof tumorsmaybefound.Supervisedlearningontheotherhand,s tartswithsetsofsamples knowntobeassociatedwithaparticulardiseaseordiseases (itdoesrequireaclasscolumn ofadataset),andsearchesforapatternofexpressionorrul es.Theseruleswillhelpclassify unseensamples[14].2.2FeatureSelection Thehighdimensionalityandredundancyinmicroarraydatas etsmaketheclassication taskchallenging[3].Theproblemissomelearningalgorith msmayperformpoorlywhen dealingwithdatathathasanumberofirrelevantfeatures[1 5].So,inmicroarrayapplicationsthereisaneedtoselectasubsetoffeatures(genes)to beusedbytheclassication algorithmwhencreatingamodel,suchthatamodelthatdoesn otovertandachievesthe highestaccuracypossibleischosen.Also,learningalgori thmsworkfasterondatasetswith fewerfeatures. Thefeatureselectionmethodscanbecategorizedintolter s,wrappers,andembedded methods[4,13]. 6

PAGE 24

2.2.1Filters Filtersselectfeaturesbasedonameasure/scoreindividua llyobtainedoneachfeature (univariatecase)[15].Low-scoringfeaturesarethendisc ardedandonlyasubsetofselected featuresisgivenasinputtothelearningalgorithm.Filter sdonotincorporatethelearning algorithminthefeaturesubsetsearch,theyonlylookatpro pertiesofthedata.Inother words,ltersselectsubsetsoffeaturesindependentlyoft helearningalgorithmasillustratedinFig.2.2.Advantagesofltersarethattheyarecom putationallyefcienttodeal withveryhighdimensionaldatasets,andtheyaresimpletoc ompute.Adisadvantageis thatltersdonotaccountforfeaturedependencies.Multiv ariateltersaimatincorporating featuredependencies[4,16]. Figure2.2:Filtersworkindependentlyofthelearningalgo rithm[15]. Therearethreeltersusedinaportionoftheexperimentsco nductedinthisworkwhich aredescribedbelow.TheltersaretheStudent'st-test,in formationgain,andReliefF. 2.2.1.1Student'sT-Test Thet-test[5,17,18]isastatisticalhypothesistestusedt odeterminethesignicance ofthedifferencebetweenthemeansoftwoindependentsampl es.Itassumesnormally distributedpopulations.Forunequalvarianceandunequal (maybeequal)samplesizesthe t statisticiscalculatedasfollows t = X 1 X 2 q s 21 n 1 + s 22 n 2 7

PAGE 25

where X 1 and X 2 arethemeansofthetwosamples; s 21 and s 22 arethevarianceestimatesof thetwosamples; n 1 and n 2 arethesamplesizesofthetwosamples. Thedegreesoffreedom df canbecalculatedas df = [ s 21 n 1 + s 22 n 2 ] 2 ( s 21 = n 1 ) 2 n 1 1 + ( s 22 = n 2 ) 2 n 2 1 The t distributioncanbeusedwiththe t statisticand df degreesoffreedomasparameterstocalculatethecorrespondingpvalue.Inthisworkwe useanRlibrary[19]that implementsthe t distribution.Whenthet-testisusedasalter,apvaluefor eachfeature inthedatasetisgenerated.Featuresarethenrankedaccord ingtopvalues.Thesmallerthe pvaluethemorerelevantthefeature.2.2.1.2InformationGain Theinformationgainfrominformationtheorywasdescribed byQuinlanin[20].This lterevaluatestheworthofafeature/genebymeasuringthe informationgainwithrespect totheclass.Featurescanberankedindecreasinginformati ongainorderandthosewith thelargestinformationgainareselected.Informationgai nisdenedby[20–22] InfoGain ( X j Y )= H ( X ) H ( X j Y ) whereH(X)istheentropyofX,whichisdenedasH ( X )= i P ( x i ) log 2 ( P ( x i )) where P ( x i ) isthepriorprobabilitiesforallvaluesof X and H ( X j Y ) istheentropyof X giventhevaluesofvariable Y ,whichisdenedas 8

PAGE 26

H ( X j Y )= j P ( y j ) i P ( x i j y j ) log 2 ( P ( x i j y j )) where P ( x i j y j ) istheposteriorprobabilityof x i giventhevalueof y j Intermsofclassesandattributes,theinformationgainmea sureisexpressedas InfoGain ( Class j Attribute )= H ( Class ) H ( Class j Attribute ) : 2.2.1.3ReliefF TheReliefFalgorithmwasdescribedbyKononenkoin[23,24] .Itisusedtoestimate thequalityofattributesbasedonthecriterionofhowwellt heirvaluesdistinguishamong instancesofdifferentclasses.ReliefFestimatesprobabi litiesmorereliablyandisableto dealwithincomplete(missingvalues)andmulti-classdata sets. Givenaninstance(andforeachtraininginstance),ReliefF searchesforits k nearest neighbors: k nearesthits(instancesfromthesameclass)and k nearestmisses(instances fromeachdifferentclass),andaveragesthecontributiono fall k nearesthits/misses. k is auser-denedvaluewhich,asproposedbytheauthorsin[23] canbeusedwithadefault value k = 10withsatisfactoryresults( k = 10wasusedforexperimentswithReliefFinthis work).Theaveragecontributionofallnearmissesisweight edwiththepriorprobabilities ofeachclass. Function diff(Attribute,Instance1,Instance2) calculatesthedifferencebetweenthevaluesofAttributefortwoinstances.Featureweightsarecalc ulatedasshowninFig.2.3, andtheyareestimatesofthequalityoffeatures(attribute s).Thelogicbehindtheweight formulationisthatagoodattributeshouldhavethesameval ueforinstancesfromthesame 9

PAGE 27

setallweights W [ A ] : = 0 : 0; for i : = 1to n do randomlyselectaninstanceR;nd k nearesthits H j for eachclass C 6 = class ( R ) do nd k nearestmisses M j ( C ) ; endforfor A : = 1to# attributes do W [ A ] : = W [ A ] k j = 1 diff ( A ; R ; H j ) = ( n k )+ C 6 = class ( R ) P ( C ) 1 P ( class ( R )) k j = 1 diff ( A ; R ; M j ( C )) # = ( n k ) ; endfor endfor Figure2.3:AlgorithmReliefF[23]. classandshoulddifferentiatebetweeninstancesfromdiff erentclasses.Thepseudocode forReliefFisshowninFig.2.3.2.2.2Wrappers Forwrappers,thefeaturesubsetsearchincorporatesalear ningalgorithmtoassess diversefeaturesubsetsasshowninFig.2.4.Thesubsetwith thebestassessmentis chosen[15].Thereisasearchalgorithmwhichsearchesthro ughthesearchspacefor theoptimalfeaturesubsetaccordingtosomecriteria.Howe verasthenumberoffeatures increases,thenumberofpossiblefeaturesubsetsgrowsexp onentially.Heuristicsearch methodscanbeusedinguidingthesearch.Anadvantageofwra ppersisthattheyaccount forfeaturedependencies.Adisadvantageisthatwrappersa recomputationallyintensiveas theyrequiretrainingamodelforeachpotentialfeaturesub set[2,4]. 10

PAGE 28

Figure2.4:Wrappersusethelearningalgorithmasablackbo xwhenevaluatingfeature subsets[15].2.2.3EmbeddedMethods Embeddedmethodsincorporatethefeaturesubsetsearchand evaluationintheprocess ofbuildingaclassier[2]asshowninFig.2.5.Thesearchis guidedbythelearning algorithm.Embeddedmethodsarecomputationallylessinte nsivethanwrappersandthey accountforfeaturedependencies. Figure2.5:Embeddedmethodsincorporatethefeaturesubse tsearchandevaluationwhen buildingaclassier. 11

PAGE 29

2.3SupportVectorMachines Datawithcharacteristicsofhighdimensionalityandsmall numberofsamplescan behandledwithSVMs[25],andparticularlylinearSVMkerne lshavebeenusedfor microarraydata[11],[26],[27].SVMswereintroducedbyVa pnik[28].Athorough descriptionofSVMscanbefoundin[29,30]. SVMsmap2-classtrainingdatatermedtheinputspaceintoah igherdimensionalspace termedthefeaturespacebyapplyingakernelfunction(seeF ig.2.6),andndamaximum marginhyperplaneinthefeaturespacethatseparatestheda taintothetwoclasses.The maximummarginhyperplanehasthelargestdistancefromthe hyperplanetotheclosest trainingpoints[26].Bymaximizingthemarginbetweenthet woclassestheclassication performanceimproves[31]. Figure2.6:Illustrationofafeaturemappingviaakernelfu nction ( F ) .Thehyperplane inthefeaturespacecorrespondstoanon-lineardecisionbo undaryintheinputspace[21, 32]. Fig.2.7showsanexampleofamaximummarginhyperplane.Sup portvectorsarethe trainingsampleswhichdenethemarginboundaries.SVMtra ininghastheformofa quadraticprogrammingdualproblem[34]: 12

PAGE 30

Figure2.7:Maximummarginhyperplane( H 0 ). H 1 and H 2 arethemarginboundaries. Supportvectorsarethetrainingsamplesthatfallon H 1 and H 2 [33]. maximize: l i = 1ai 1 2 l i ; j = 1aiaj y i y j K ( x i ; x j ) (2.1) subjectto: 0 < =ai < = C (2.2) l i = 1ai y i = 0(2.3) where l isthenumberoftrainingsamples, y i istheclass/label(+1forpositiveand-1for negative)ofthei th trainingsample( x i ),and K ( x i ; x j ) representsthevalueofthekernel 13

PAGE 31

functionforthei th andj th samples. C istheregularizationparameterchosenbytheuser andrepresentsthepenaltytoerrors.When C issettosmallvalues,errorsareallowedanda muchlargermarginisobtainedbytheSVM[27].Thepredictio noftheSVMforasample x is f ( x )= sign ( l i = 1ai y i K ( x ; x i )+ b ) (2.4) wherethescalarvalue b andthevectorofalphasa(with l elements)aredeterminedbythe quadraticoptimizationproblem. Theweightofeachfeature[29]iscalculatedasfollows w = nSV i = 1ai y i x i (2.5) where nSV isthenumberofsupportvectorswhicharetheonlytrainings ampleswith nonzeroalphavalues, y i isthelabel/class(+1/-1)forthe i th supportvector,ai isapositive realvaluegivenbytheSVMmodeltothe i th supportvectorindicatingitscontributionto themargin,and x i isthegene/featurevalueinthe i th supportvector. Thenonlinearmappingoftheinputspaceintothefeaturespa cesimplyrequiresthe evaluationofdotproductsbetweenthesamplesintheinputs pacewithouttheneedof visualizinganimageoftheactualfeaturespace[32]: K ( x i ; x j )=( F ( x i ) F ( x j )) (2.6) SVMscanbeusedwiththefollowingkernels( F inEq.2.6): 14

PAGE 32

Polynomial K ( x i ; x j )=( x i x j )) d (2.7) Linearwhichisaparticularcaseofapolynomialkernelwith d = 1. Radialbasisfunction K ( x i ; x j )= exp ( k x i x j k 2 = ( 2s2 )) (2.8) Sigmoid K ( x i ; x j )= tanh (k( x i x j )+ Q ) (2.9) 15

PAGE 33

Chapter3:FeatureSelectionMethodsforMicroarrayData Thischapterreviewssomefeatureselectiontechniquescom monlyusedinthemicroarraydomain,classiedaslters,wrappers,andembeddedmet hods[4]. 3.1Filters3.1.1Student'sT-Test Thet-testisawell-knownstatisticalapproachfrequently appliedinmicroarraydata analysis[35].In[5]thet-testwasusedtoltergenedataon fourmicroarraydatasets includingthecoloncancer[36]andleukemia[37]datasetsu sedinthiswork.Different pvaluethresholdsalongwiththenumberoffeaturesselecte dundereachthresholdwere analyzedtodeterminethenumberoffeaturestheywouldexpe rimentwith.Featuresets of6,27,and53featuresfortheleukemiadataset,andof7,27 ,and54featuresfor thecoloncancerdatasetwerechosentocreatereduced-in-d imensiondatasetsfortheir experiments.Thefeaturesetsselectedcorrespondedtoapv aluethresholdp 0 : 01. Fourclassicationalgorithmswereappliedtoeachofthere duceddatasets:Lineargenetic programs,MultivariateRegressionSplines(MARS),Classi cationandRegressionTrees (CART),andRandomForests.50%ofthesamplesofthecolonca ncerdatasetwereused 16

PAGE 34

asthetrainsetandthe50%remainingasthetestset.Theorig inalsplitwasusedforthe leukemiadataset:37trainingsamplesand38testingsample s.Nodatapreprocessingwas describedthroughoutthepaperforanydataset. Thebestresultsreportedin[5]wereobtainedwithlinearge neticprogramsacrossall datasetsfordifferentnumbersoffeatures.Resultswerere portedintermsofaccuracyper class.So,forcomparisonpurposeswithourresultsforthes etwodatasets,theircorrespondingweightedaccuracieswerecalculated.Forthecolo ncancerdataset,theyachieved 85.91%testsetweightedaccuracywith7,27,and54features .Wereported86.57%average 10-foldcrossvalidationweightedaccuracy,for12feature swithIFPonexperimentsthat startedwiththeentiresetof2000features(seeFig.5.1).F orthe200-genedatasetltered withthet-test,IFPhadan87.93%average10-foldcrossvali dationweightedaccuracywith 23features(seeFig.5.9).Thet-test,intermsofSVMclassi eraccuracyreached89.20% with24features(seeFig.5.1). Theresultreportedin[5]fortheleukemiadatasetwas100%t estsetweightedaccuracy with7,27,and54features.Ouraverage10-foldcrossvalida tionweightedaccuracywas 96.56%for213featureswithIFP(seeFig.5.2).Theseaccura ciesarenotcomparablesince oneistestsetaccuracyandtheotherisaverage10-foldcros svalidationweightedaccuracy. 3.1.2RankProducts In[6]atechniquebasedonthecalculationofrankproducts( RP)fordetectingdifferentiallyexpressedgeneswasintroduced.Thetechnique originatedfromabiological rationale:Bymanuallyobservingatwo-colormicroarrayex perimentcomparingmRNA levelsunderconditionsAandBononeslide,abiologistwill knowwhichgenesareup 17

PAGE 35

ordownregulatedundereachcondition.Withnoisydata,the seresultswouldnotbeso reliable.However,ifafteranumberofreplicateexperimen tsthesamegenesappearatthe topofeachlistofdifferentiallyexpressedgenes,thenthe condenceofsuchobservations wouldincrease. Therankproductsmethodestablishesthatforeachgene g in k replicates,eachexamining n i genes,thecombinedprobabilityforeachgenetoappearatth etopofeachlistcan becalculatedasarankproduct RP up g = ki = 1 ( r up i ; g = n i ) ,where r up i ; g isthepositionofgene g inthelistofgenesinthe i threplicatesortedbydecreasingfoldchange.Thatis r up = 1 correspondstothemoststronglyupregulatedgene.Eventua lly,geneswiththesmallestRP valuescanbeselectedforbiologicalsignicance. Agenethatishighlylikelytobedetectedasdifferentially expressedwasdenedasa truepositivegene.Threedatasetswereusedin[6]forexper imentsincludingtheleukemia datasetusedinourwork,forwhichin[6]aconstantwasadded toeachgeneexpression valuetomakethesmallestvalueone.Then,thenormalizatio nprocessfollowed.Itwas arguedin[6]thatbyanalyzingtheentiredatasetagoodseto ftruepositivegeneswould result.Itwasalsoconsideredthatagoodalgorithmwouldgi vegoodrankstothesetrue positivegenesevenwhenusingonlyaportionofthedataset. Giventhesearguments,three randomsubsetsoftheleukemiadatasetwerechosen.Eachsub setconsistedofthreeALL andthreeAMLsamples,whichwereanalyzedwithSAM[38](sig nicanceanalysisof microarrays)andRP.Theexperimentconsistedoflookingat theranksgivenbySAMto thetop25upregulatedgenes,andthencomparewiththeranks givenbyRPtothissame subsetofgenes.Theresultsindicatedthatusingtheentire datasetRPandSAMagreedon theranks.However,RPoutperformedSAMwhenthesubsetsofs ampleswereused.RP assignedranksabouttwiceasgoodasthoseassignedbySAMto these25genes. 18

PAGE 36

OurworkwiththealgorithmIFPismoreorientedtowardsaccu racyanalysisduringthe geneselectionprocess.IFPconsiderstheaccuracychangec ausedbyagenewhennoise isaddedtoitasadeterminantfactorofitsgenerelevance.I FPoutputsarank-orderedlist basedonthisrelevancecriterion.Ontheotherhand,RPemph asizeshowconsistentlythe ranklistmatchesthesetofmostupregulatedgenes.3.1.3MarkovBlanketandROCCurves In[7]amultivariateltermethodwasintroducedforfeatur eselectionwhichwasbased onaMarkovblanketandtheuseofROC(ReceiverOperatingCha racteristic)curves[39]. Itwasdesignedfor2-classmicroarraydatasets.Themethod wascalledFROC.Ittook advantageofthenon-parametricpropertyofROCcurves:The areaundertheROCcurve isrelatedtothenon-parametrichypothesistestingmethod Mann-Whitney-Wilcoxon[40], whichmadeROCcurvessuitableasameasureforfeatureselec tionfromhighdimensional datasetswithfewsamples. TheMarkovblanketofafeature F i isasetoffeatureshighlycorrelatedwith F i .The ideaconsistedofsafelyremovingafeatureforwhichaMarko vblanketwasfoundinthe currentfeaturesubset.TheMarkovblanketwasabletocaptu refeaturesthatareirrelevant tothetargetclassandthoseredundantgivenotherfeatures .However,adisadvantageof lteringbasedontheMarkovblanketisthatitwillnotremov eafeatureforwhichaMarkov blanketisnotfound,evenwhenthefeatureisnotrelevantto thetargetclass[7]. ThemethodFROCproposedin[7]wasatwo-stepprocedure.TherststepwasROC-curvebasedone-gene-at-a-timelter ing.Itconsistedinusing theROCcurvetocheckfortherelevanceofafeature.Foreach feature,anROCcurveis 19

PAGE 37

generatedshowingtheratioofthenumberofsamplesinonecl asstothatofsamplesinthe otherclassatacertaincut-offvalue.TheARD(theareabetw eentheROCcurveandthe diagonalline)iscalculated.Thisprocesswillendupwitht heARDvalueforeachfeature, whichissorted,suchthatthefeaturesmostrelevantareont op.Thetopfeaturesconstitute theinitialfeaturesubsetselected. ThesecondstepwasROC-curve-basedMarkovblanketlterin g.Iteratively,redundant featuresareeliminatedfromtheinitialfeaturesubsetusi ngtheMarkovblanketapproach basedonROCcurves. Experimentsconductedin[7]involvedvetwo-classesmicr oarraydatasetstakenfrom ONCOMINE[41],acancermicroarraydatabase.FROCresultsw erecomparedtothoseof twootherfeatureselectionmethods:1)thet-testwhichran ksfeaturesbyrelevanceofeach featuretotheclass,andthetop n ofthemconstitutethenalfeaturesubsetselected;and2) atwo-stepfeatureselectionprocedure.First,informatio ngainofallfeaturesiscalculated andfeatureswiththehighestinformationgainareselected forachosenintermediatefeature subset.Second,theMarkovblanketlteringisappliedtoth eintermediatefeaturesubset tondthenalfeaturesubsetselected.Decisiontreesanda linearSVMwereusedas learningalgorithmswiththedatasetscontainingonlythe nalfeaturesubsetsselected(of size100and50). Five-foldcrossvalidationrepeatedvetimeswastheevalu ationmethodusedforestimationofpredictiveperformance,andtheaverageaccura cieswerereported.Results showedthesignicantperformanceadvantageofFROCoverco mpetingmethodsinmost casesacrossallvedatasets.Theperformancedifferenceb etweenFROCandthet-testwas greaterthanthatbetweenFROCandthetwo-stepfeaturesele ctionprocedure.WithFROC, 20

PAGE 38

theresultswerealmostthesamefor100and50features.Thet wofeaturesets(of100and 50features)wereselectedusingallthedata. Inourworkweexperimentedwithdatasetsusingtheentirese toffeaturesaswellas datasetsusingareducedfeaturesubset(ltereddatasets) .Inbothcases,accuracieswere calculatedacrossdifferentnumberoffeatures.FROCisal termethodthatconsiders interactionbetweenfeaturesviaMarkovblanketltering( asalteritdoesnotinvolveany classierforfeatureranking).IFPisanembeddedgenesele ctorwhosefeatureremoval criterionconsidersinteractionbetweenfeaturesaswell. Itdoesinvolveaclassierasthe baselearningalgorithm.3.1.4InformationGain,ReliefF,andCorrelation-BasedFe atureSelection In[8]theauthorsinvestigatedthephenomenonofinformati onextractionanddimensionalityreductiononmicroarraydatabyusingltersandw rappers.Theltersapplied intheirexperimentswerethec2 ,informationgain,symmetricaluncertainty,ReliefF,and correlation-basedfeatureselection(CFS)ontwomicroarr aydatasets:theacuteleukemia data(thesameweuseinourwork),andthediffuselargeB-cel llymphomadata. WeputemphasisontheperformanceofinformationgainandRe liefFbecauseweuse theminourwork,andonCFSbecauseitisaltercommonlyused inmicroarraydata analysis[4].ThersttwoweredescribedinSections2.2.1. 2and2.2.1.3respectively.The CFSFilterisbrieydescribedhere. CFSevaluatesfeaturesubsets.ForCFSagoodfeaturesubset containsfeatureshighly correlatedwiththeclassanduncorrelatedwiththerestoft hefeaturesintheevaluating subset.Itisgivenby CFS s = k r cf = p k + k ( k 1 ) r ff ,where CFS s isthescoreofafeature 21

PAGE 39

subset S containing k features, r cf istheaveragefeaturetoclasscorrelation ( f 2 S ) ,and r ff istheaveragefeaturetofeaturecorrelation[8]. Wrappersin[8]wereusedwithdecisiontrees(J48)andNave Bayesasbaseclassiers withbest-rstsearchasthesearchmethod,whichwasusedwi thCFSaswell. Intheirexperiments,thetop10genesusingtheltersc2 ,informationgain,symmetricaluncertainty,andReliefFwereselectedonbothdat asets.Expressiondatawere discretizedforexperimentswithc2 ,informationgain,andsymmetricaluncertainty.The rankingsforthese10genesweresimilarbetweeninformatio ngain,c2 ,andsymmetrical uncertainty;whileReliefFproducedadifferentrankingfo rthesegenesonbothdatasets. TheyarguedthatthisdissimilaritywasduetoReliefFbeing sensitivetofeatureinteractions whiletheotherltersarenot.Inourwork,ReliefFwasthel terwhichledtotheleast accuracyimprovementacrossourfourdatasetsasshowninTa ble7.3. CFS,J48wrappers,andNaveBayeswrapperswereevaluatedw ithinaleave-oneoutcrossvalidationprocess.CFSselectedthesameonegene 34timesoutof38runs, whichwasalsotheresultwiththeJ48wrapper.NaveBayeswr apperselectedthatsame gene28timesoutof38runs.Thegeneinquestionwasgivenahi ghrankbyltersc2 informationgain,symmetricaluncertainty,andReliefF.T heauthor'sexperimentalresults showedthatgenesselectedbyeitherltersorwrappers,and CFSontheanalyzeddatasets ledtoclassiersofsimilarperformance.Sotheyrecommend edtheuseofltersandCFS forfastanalysisofdata. Weassumedtheentiresetoffeaturesforeachdatasetwereus ed,andexperimentswere performedwithltersandwrappersforndinggoodfeatures ubsets.Inourwork,theuse ofltersisfocusedononehandondimensionalityreduction priortotheapplicationofour analyzingembeddedfeatureselectionalgorithms,suchasI FPandthecounterpartSVM22

PAGE 40

RFE.Ontheotherhand,wedocalculatetheperformanceoflt ersasfeatureselectors, andcomparetheirperformanceintermsofSVMclassieraccu racyagainstthoseofIFP andSVM-RFE.Also,inourworkweshowedresultsintermsofav erage(overve10foldcrossvalidationprocesses)weightedaccuracyachiev edbyIFPacrosseachnumberof features,incomparisonwiththoseofSVM-RFE.Statistical signicanceanalysisisalso usedtocomparedifferences.3.2Wrappers3.2.1SequentialSearch Awrappersequentialsearchapproachforgeneselectionwas introducedin[9].The searchcomponentfollowedasequentialforwardselectionc riterion,whichwasahillclimbingdeterministicsearchalgorithm[42]thatstartsw ithanemptysubsetofgenes, andcontinuesaddinggenesoneatatimeuntilnoperformance improvementwasobserved. Thewrappermethodintroduceddidnotrequireasinputaspec iednumberofgenesto lookfor,thisnumberwasdeterminedbythesearchprocedure itself. Acomparisonwasmadein[9]betweentheperformanceoftwol termethods(P-metric andt-test)andthewrapperapproachusingfourclassicati onalgorithms:IB1(nearestneighbor),NaveBayes,C4.5(decisiontree),andCN2(rule induction).Thenumberof genesspeciedtobeselectedbythelterswere3,5,10,and2 0.Experimentsconducted involvedtwomicroarraycancerdatasets:thecoloncancer[ 36]andleukemia[37]datasets. Theleave-one-outcross-validation(LOOCV)methodwasuse dforperformanceestimation ofthreeapproaches:usinglters,usingthewrapper,andth eno-geneselectioncase.Inthe 23

PAGE 41

caseofthewrapper,theLOOCVwasperformedonlywiththegen esubsetselectedbythe searchprocedure.Apairedt-testwasconductedtodetermin ethestatisticalsignicance ofaccuracydifferencesbetweentheno-geneselectioncase andtheuseoflterswiththe wrapperapproach. Accordingtotheresultsreported,accuracyimprovementsi nfavorofthewrapper approachwereobservedinbothdatasetswhencomparedwitht heno-geneselectioncase. Theseaccuracydifferenceswerefoundtobestatisticallys ignicantatthe5%condence levelexceptfortheuseofC4.5inthecoloncancerdataset.S tatisticallysignicantdifferenceswerealsofoundwhenthewrapperapproachwascompared withmostofthelter methodsinbothdatasets,exceptfortheIB1forwhichnostat isticallysignicantdifferences werefoundinbothdatasets.Inbothdatasets,thenumberofg enesselectedbythewrapper approachrangedbetween2and4acrossthefourclassicatio nalgorithms. Inourwork,allconductedexperimentsfollowedabackwarde liminationapproachin thegeneremovalprocess.Thatis,theselectionprocesssta rtedwiththeentiresetofgenes inthedatasets,andnon-relevantgeneswereremovediterat ively.Theaverageoververuns ofa10-foldcross-validationprocesswasusedasmeasurefo rperformanceestimation. 3.2.2GeneticAlgorithmsandLocalSearch Awrapper-lteralgorithmforfeatureselection(WFFSA)us ingamemeticframework wasintroducedin[10].Amemeticframeworkisacombination ofgeneticalgorithms(GA) andlocalsearch.ThepurposeofWFFSAwastoimproveclassi cationperformanceand speedupthesearchinidentifyingimportantfeaturesubset s.Featuresareaddedordeleted fromeachfeaturesubsetbasedonaltermethodranking.Att hestartoftheprocedure, 24

PAGE 42

aninitialGApopulationoffeaturesubsetsisrandomlygene rated.Eachcandidatefeature subsetrepresentsachromosome(selectedfeatureshaveava lueof1andexcludedfeaturesa valueof0).Allcandidatefeaturesubsetsareevaluatedusi ngaclassicationalgorithm.As aresult,alloranumberofthesefeaturesubsetsarechosent oundergoalocalimprovement process.Afterlocalimprovementprocesscompletion,gene ticoperatorsbasedonselection, crossover,andmutationareusedtogeneratethenextpopula tion.Thisprocessisrepeated untilstoppingconditionsaremet. Inthelocalimprovementprocess,featuresareaddedordele tedfromthefeaturesubset (chromosome)asfollows.Altermethodisusedtorankallse lectedfeaturesandall excludedfeatures.Theselectedfeaturewiththelowestran kismarkedasexcluded(feature deletion).Similarly,theexcludedfeaturewiththehighes trankismarkedasselected(featureaddition).Theimprovedfeaturesubset(chromosome)i sevaluatedwithaclassication algorithmandreplacestheoriginalifhigheraccuracyisob tained. Theexperimentsconductedin[10]includedthecoloncancer andtheleukemiadatasets, thesamedatasetsweuseinourwork.Threeltermethods:Rel iefF,gainratio,andc2 wereusedinthelocalimprovementprocess.Theonenearestn eighborclassier(1NN)and leave-one-outcrossvalidation(LOOCV)wereusedtoevalua teallimprovedfeaturesubsets (chromosomes).DuetothestochasticnatureoftheGAandWFF SA,theaccuracyresults reportedcorrespondtotheaverageovertenindependentrun s.Performancecomparisons weremadebetweentheGA,WFFSA,andusingtheltermethodsf orfeatureselection.The resultsindicatedthatintermsofclassicationaccuracy, WFFSAoutperformedallthree ltermethodsandtheGA.Ontheleukemiadataset,WFFSAshow edbetterclassication accuracyandusedlessthanone-thirdofthefeaturesrequir edbyGAandtheltermethods. 25

PAGE 43

Asawrapper,WFFSAevaluatesmanypossiblefeaturesubsets usingaclassication algorithm,whichmakesitcomputationallyintensive.Inad ditiontondingafeaturesubset withhighaccuracy,WFFSAresultsinfeaturesubsetswithfe wgenes.Incontrast,IFPas embeddedgeneselectordoesnotrequireevaluationofasman yfeaturesubsets.Also,in ourexperimentsweshowedaverageweightedaccuracyresult snotonlyforaparticular featuresubsetbutalsoacrosseachnumberoffeatures.3.3Embedded3.3.1RecursiveFeatureElimination(SVM-RFE) AmethodusingtheSVMalgorithmasthebaselearneristherec ursivefeatureeliminationforsupportvectormachines(SVM-RFE)introducedi n[11].Itisanembedded selectorthatfollowsabackwardeliminationapproach.Itr anksthefeaturesaccordingto theirweights,whicharecalculatedfromthesupportvector sgivenbytheSVMmodel.See Section2.3foradetaileddescriptionofSVMs. TheiterativeprocedurecalledRecursiveFeatureEliminat ionconsistsof: 1.TrainingtheclassierusingtheSVMlearningalgorithm.2.Calculatingtherankingcriterionforallfeatures,whic hisequivalenttothesquareof theweightofeachfeaturecalculatedwithEq.2.5. 3.Removethefeaturewithlowestrankingcriterion. 26

PAGE 44

Thecoloncancer[36]andleukemia[37]datasetswereusedfo rexperimentsin[11]. Duetothelinearseparabilityofthesedatasets,theSVMmod elshouldbeinsensitivetothe valueofparameterC.Theauthorsused C = 100andalinearkernel. Datasetswerepreprocesseddifferently.Forthecoloncanc erdataset:thelogarithmof eachvaluewascalculated,samplevectorsandfeaturevecto rswerenormalized,andthe resultwaspassedthroughasquashingfunctiontodecreaset heimportanceofoutliers.The normalizationconsistedinsubtractingthemeanoveralltr ainingvaluesanddividingby thecorrespondingstandarddeviation.Fortheleukemiadat aset,themeanvalueofagene wassubtractedfromeachgeneexpressionvalueandtheresul twasdividedbyitsstandard deviation. Leave-one-outcrossvalidationwasthemethodusedforcalc ulationofthegeneralizationerroronbothdatasets.ForthecoloncancerdatasetSVM -RFEresultedinanaccuracy of98%using4genes.Fortheleukemiadataset,anaccuracyof 100%wasachievedusing 2genes. SVM-RFEisthemainmethodourIFPalgorithmwascomparedaga inst.SVM-RFE worksonlywithsupportvectormachineswhileIFPcanbeappl iedtootherlearningalgorithms.Inourwork,theaccuracyresultswerereportedas averageweightedaccuracy calculatedoververunsof10-foldcrossvalidationproces ses,asdescribedinSection5.6. AllexperimentsconductedforalgorithmIFPwerecompleted forSVM-RFEaswell.Our resultsincludingsuchcomparisonsweredescribedthrough outChapters5,6,7,and8. 27

PAGE 45

3.3.2RandomForests Ageneselectionandclassicationapproachformicroarray datausingrandomforests wasintroducedin[12].Theobjectivewastoidentifythesma llestpossiblesetsofgenes thatwouldallowgoodpredictiveperformance.Therandomfo restsapproach[43]isa classicationalgorithmessentiallyconsistingofanense mbleofclassicationtrees.Each classicationtreeisbuiltonabootstrapsampleofthedata andarenotpruned.Allfeatures arepresentinthebootstrap,andarerandomlyselectedatea chsplit. Thefeaturespresentinthebootstraparerandomlychosenfr omtheoriginalfeatureset. Experimentsin[12]wereconductedonbothrealandsimulate dmicroarraydatasets. Realdatasetsincludedthecolon[36]andleukemia[37]data sets,whichwereusedinour work.Thepredictionerrorratewascalculatedusingthe0.6 32+bootstrapmethodwith200 bootstrapsamples. Resultsofrandomforestsweregivenfortwoscenarios:1)us ingaxednumberof genes,forwhichthe200geneswiththelargestF-ratiofrome achdatasetwereused, and2)withgeneselection,whichconsistediniterativelye liminatinga0.2fractionof theleastimportantgenesusedinpreviousiteration(there foreitfollowedabackward eliminationapproach).Forestsofeachiterationwereexam ined.Theimportanceofagene wasdeterminedbythedecreaseinclassicationaccuracyob tainedwhenvaluesofagene inanodeofatreewererandomlypermuted. Errorratesforscenario1werecomparedtothoseofalternat emethodsincludingsupportvectormachines(SVM),diagonallineardiscriminanta nalysis(DLDA),andKnearest neighbor(KNN)whichwereestimatedusingthe0.632+bootst rapmethodinallcases. 28

PAGE 46

Resultsofrandomforestsin[12]shownforscenario1indica tedanerrorrateof0.127 (0.873accuracy)forthecoloncancerdataset,andof0.051( 0.949accuracy)fortheleukemia dataset.SVMhadanerrorrateof0.147(0.853accuracy)fort hecoloncancerdatasetand of0.014(0.986accuracy)fortheleukemiadataset.Theauth orsconcludedrandomforests hadcomparablepredictiveperformancetothatofalternati vemethods. Forscenario2,theminimumerrorrateshownwas0.159(0.841 accuracy)forthecolon cancerdataset,and0.087(0.913accuracy)fortheleukemia datasetwith14and2genes respectively. Inourwork,wediduseabackwardeliminationapproachjusta sintheaforementioned scenario2,onlyourgeneremovalproportionwasdifferent. Ourgeneralizationerrorfor eachfeaturesetwasestimatedasanaverageoverve10-fold crossvalidationprocesses, vs.the0.632+bootstraperrorratetheyreported.Theerror rateofIFPappliedonthe200genedatasetlteredwiththet-testwas0.149(0.851accura cy)for200genes,and0.121 (0.879accuracy)for23genes(seeFig.5.9)forthecoloncan cerdataset.Similarly,forthe leukemiadataset(seeFig.5.10)theerrorratewas0.047(0. 953accuracy)for200genes, and0.035(0.965accuracy)for75genes. Theauthorsobservedintheirexperimentsthemultiplicity problem:geneselection onmicroarraydatacanresultinanumberofsolutionsthatca nbeequivalentinterms ofpredictiveperformance.Thisphenomenonwasdiscussedi nourresultsdescribedin Section5.6.2. 29

PAGE 47

3.3.3PenalizedMethods:HHSVM Ahybridhuberizedsupportvectormachine(HHSVM)wasintro ducedin[44]forboth classicationandgeneselection.HHSVMusesacombination ofthehuberizedhingeloss function[45]tomeasuremisclassicationandtheelasticnetpenalty[46]whichallows forautomaticvariableselectionandgroupingeffect(grou psofcorrelatedvariablesget selected/removedtogether). ResultsoftheHHSVMwereshownontheleukemiadataset[37]c omparedtothose ofSVM-RFE.Inthecontextoftheoriginaltraining/testspl itofthedataset,38training samplesand34testingsamples,SVM-RFEmade2/34errorswit h128genes,and0/38 errorswith128genesinacrossvalidationexperiment.Simi larly,HHSVMmade0/34 errorswith84genesintheoriginalsplitand0/38errorsint hecrossvalidationcontext. In[44]resultswereshownforexperimentsconductedundera randomly-splittingapproach,wheretheycombinedalltheoriginaltraining/test ingsamplestogetherandmade randomsplitsof38trainingand34testingsamples;thispro cesswasrepeated50timesand anaverageofresultswerereported.SVM-RFEhadanaveraget estingerrorof2.25%with 256genesonaverage;whileHHSVMhadanaveragetestingerro rof1.67%withjust87.9 genesonaverage. In[47],resultswereshownforHHSVMonthecoloncancerdata set[36]comparedto thoseofSVM-RFE.Thedatasetwasrandomlysplit100timesin to42trainingsamples(27 cancersamplesand15normaltissues)and20testingsamples (13cancersamplesand7 normaltissues).HHSVMresultedin12.69%testerrorwith94 .5genes,whileSVM-RFE showedatesterrorof17.10%with64genes. 30

PAGE 48

TheHHSVMisquadraticinthenumberoffeatures.Intheirexp eriments,theyused SVM-RFEwithaxednumberofgenes.However,theapproachis differentthanthis work,inwhichwelookataverageaccuracy(vs.accuracywhic hisbiasedtowardslarger classes)ateachsetofgenesandhowthisvarieswith“small" trainingsetchanges.For SVM-RFE10%ofthegeneswereremovedateachiterationvs.on eatatimeaswehave exploredaftereitherpre-selectingorasetnumberhavebee nremoved.So,SVM-RFE mightbemorecompetitiveunderdifferentconditions. Asummaryofresultsobtainedbythefeatureselectionmetho dsreviewedinthischapter,comparedtoIFP,isdiscussedinSection7.6.3.3.4LocalLearningBasedFeatureSelection Amethodforfeatureselectionbasedonlocallearningwasin troducedin[48].The formulationoftheiralgorithmwasbasedontheconceptthat acomplexproblemcouldbe analyzedbytransformingitintoasetoflocallylinearprob lems.Theparameterestimation couldbeperformedglobally. Foreachsampleinthedataset,thenearesthit(sameclass)a ndnearestmiss(different class)werefoundusingadistancefunction.Themarginfort hesampleisdenedbased onthesetwodistances.Thelargemargintheory[49]establi shesthataclassierthatminimizesamargin-basederrorfunctionhasgoodperformance. Basedonthisconcept,their algorithmscaleseachfeaturetoobtainaweightedfeatures pace,sothattheminimizationof amargin-basederrorfunctionisconductedinthetransform edspace.Amethodfornding thebestweightsoveralldatawasdeveloped.Featureswitht hehighestweightsaremore 31

PAGE 49

relevantthanthosewiththelowestweights.Thealgorithmu sesonlylocalinformation fromeachsample(neighborhood). Experimentsin[48]wereconductedoneightdatasetsfromth eUCIrepository[50]. Experimentsinvolvedaddingirrelevantfeatures(upto300 00)whichwereindependently sampledfromazero-meanandunit-varianceGaussiandistri bution.Theycomparedtheir algorithmwithveotherapproaches,andconcludedthatitc onsistentlyassignedhigh weightstonearlythesamerelevantfeatures,regardlessof thenumberofnoisyfeatures addedtothedataset.SVM-RFE(RBFkernel)andI-Relief[51] werecomparedusing thebackwardfeatureeliminationapproach,andtheyconclu dedthatinthepresenceof irrelevantfeatures,usefulfeatureswereeliminatedinth ebackwardeliminationprocess. SVM-RFEwithanRBFkerneldidnotshowgoodperformanceonhi ghdimensionaldata, thoughtheyusedaxednumberoffeatureswhichwassometime smorethanexistedand sometimesless. Theyalsoexperimentedwiththeiralgorithmonthreemicroa rraydatasets:prostate cancer[52],breastcancer[53],anddiffuselargeB-cellly mphoma[54].The3-nearest neighborclassierwasusedinaleave-one-outcrossvalida tionforperformanceestimation, andresultswerecomparedagainsttwootherapproaches.The ytookthetop50genesdown to1.Featureswereselectedoneachtrainsubset.Theiralgo rithmobtainedthehighest accuracyalongwithselectinganumberofgenesasfollows.O nprostatecancer,83.5% accuracywith6genesvs.I-Relief74.7%with9genes.Onbrea stcancer,78.3%accuracy with4genesvs.I-Relief76.3%accuracywith28genes.Ondif fuselargeB-celllymphoma, 97.4%with10genesvs.I-Relief94.8%accuracywith7genes. Itismentionedin[48]thatsometop-rankedgenesfromthettestarenotfoundbytheir algorithm.Theyargueditexcludedredundantgenes(withno correlationanalysisshown). 32

PAGE 50

However,fromourworkweknowdifferentsetsofgenescanlea dtosimilaraccuracies. Leave-one-outcrossvalidationseemstobeoptimisticinit serror.Itusesthemaximum dataforgeneselectiontoo. 33

PAGE 51

Chapter4:IterativeFeaturePerturbationMethod Thischapterintroducesanewalgorithmforfeature/genese lectiontermedtheIterative FeaturePerturbation(IFP)Method.Adetaileddescription alongwithpseudocodeare provided.AowchartforthealgorithmIFPisshowninAppend ixA. Theiterativefeatureperturbationmethod(IFP)isanembed dedgeneselectorthatfollowsabackwardeliminationapproach.Thebaselearningalg orithmisinvolvedinthe processofdeterminingwhichfeaturesaregoingtoberemove dinthenextstep.The algorithmstartswiththeentiresetoffeaturesinthedatas et,andateveryiterationthe sizeofthefeaturesetisreducedbyremovingtheleastimpor tantfeatures. Thecriteriontodeterminewhichfeaturesaretheleastimpo rtantreliesontheimpact ontheclassicationperformancethateachfeaturehaswhen perturbed.Thatis,each featureisperturbedbyaddingnoisetoit.Ifasaresult,itl eadstoabigchangeinthe classicationperformance,thenthefeatureisconsidered relevant.Correspondingly,nonrelevantfeatureswillcauselittleornoimpacttotheclass icationperformance.Nonrelevantfeaturesarethenremovedsothatonlyrelevantfea turesremain. In[55]weconcludedthatdifferentamountsofnoisewerenee dedtoadequatelyperturb featuresetsofdifferentsizes.ThisworkoutlinesanewIFP algorithmthatisdescribed inFig.4.1.Fig.4.1adescribesthemainiterativepartofIF P,andFig.4.1bdescribesthe 34

PAGE 52

binarysearchcalledfromFig.4.1aforthecalculationofth enoiselevel,perturbationand rankingoffeatures. IFPreceivesasinputtheoriginaldatasetwhichconstitute sthetrainingset X anda k valueindicatingthenumberoffeaturestoberemovedinthec urrentiteration.Atthe beginningthesubsetofcurrentsurvivingfeatures S issettobetheinitialsetoffeatures.The methoditeratesthroughstages(describedbelow)untilnof eaturesremain;thenaranked featurelistisoutput.Forperformanceevaluationpurpose s,aclassicationmodelcan becreatedbasedonaselectedfeaturesetfromtherankedlis t,anditsaccuracycanbe calculatedonthetestset. Intheiterativeprocess,therststageistotrainaclassi cationmodelon X withall existingsurvivingfeatures S .Anyclassicationlearningalgorithmcouldbeusedtocrea te theclassicationmodel.Fortheexperimentsconductedint hisresearch,asupportvector machine(SVM)wasusedasthebaseclassier.Forperformanc ereasons,aftertrainingthe classicationmodel,thetrainingsetwasreducedtothesub setofsamplesselectedasthe supportvectors.Thesesamplescarrytheessentialinforma tionneededfortheclassication problem;therestofthesamplesareirrelevantforthefeatu reevaluationstage[32,56]. Inthesecondstagethetrainingaccuracyforthereducedsam plesetwascalculated. Thethirdstageperturbsandranksallfeaturesin S fromleasttomostrelevant.Thisphase identiesanappropriateamountofnoisetobeinjectedinea chfeaturebyusingabinary searchprocess,andidentiesthe k leastrelevantfeaturesforremoval. Therankingforafeatureisdeterminedbythechangeinaccur acyobservedonthe trainingsamplesbeforeandafteraddingnoise.Non-releva ntfeaturesweredenedasthose causingachangeinaccuracyinthe0 10%range.Thebinarysearchprocessreturnsa rankingoffeatureswhichisexaminedfortiesinthefourths tageofIFP.Twofeaturesare 35

PAGE 53

tiedwhentheycausethesameaccuracychange.Ifnotieisfou nd,thenthetop k features intheranking,whicharethe k leastrelevantfeaturesofthecurrentset S areremoved.In caseofatie,alltiedfeaturesarerankedbasedonatie-brea kingcriterion,andonlythetop featureisremoved.SinceSVMisthebaseclassierused,the weightofeachfeature[29] waschosenasthetie-breakingcriterion,whichiscalculat edbyEq.2.5previouslydened inSection2.3. Giventhatmicroarraydatasetsusuallyhavemanylessexamp lesthangeneexpressions, thefeatureremovalprocesswouldbecomputationallyexpen siveiffeatureswereremoved oneatatime.Moreover,onlyasmallsubsetoffeaturesisexp ectedtoberelevantfor classication.Thisisthereasonwhytheadaptivefeaturee liminationstrategywasapplied inourexperiments(seedenitioninSection5.4). Afterremovingthe k leastrelevantfeatures,thenalfeatureranking F isupdated. 4.1BinarySearch Aniterativebinarysearchwasimplementedtondtheamount ofnoisetobeaddedto eachfeaturewhenperturbing.Noiseisgeneratedas Noise i =( c sd i ) (4.1) where sd i isthestandarddeviationofthefeaturebeingperturbedacr ossallexamplesin thetrainingset,and c isadynamicfactorindicatingthenoiselevelbeinginjecte dinthe perturbationprocess.Itisdynamicinthesensethatitvari esinmagnitudefordifferent sizesofthesetofsurvivingfeatures S .Thatis,the c factorforaset S with500features 36

PAGE 54

maybedifferentthanthatforaset S with10features.The c factorimpactsthenalamount ofnoiseinjectedintoeachfeature. Whenalargeamountofnoiseisappliedtotheset S ,itmayresultinnofeatureswithin the0 10%accuracychange.Theoppositeisalsotrue.Thatis,when asmallamount ofnoiseisapplied,itmayresultingettingmorefeatureswi thinthe0 10%rangethan needed.Generally,themorenoiseappliedtoafeature,theb iggertheaccuracychangeit willcause.Specically,thebinarysearchlooksfora c factorsuchthatwhenappliedin Eq.(4.1),itndsthe k featureswiththeleastaccuracychangetoremove. Binarysearchisanefcientalgorithmtondaniteminasort edlist.Inourimplementation,the c factorcantakeaminimumvalueof1 10 6 andcanbeaslargeasneeded. Thisfactallowsthesearchspacetobeorderedinascendingo rder.Theinitialvalueofthe maximumboundaryofthesearchspaceissettoaverylargemax imumvalue.Asthesearch advances,thesearchspacewillbereducedtoeitheritsleft halforrighthalfdependingon theneedtoincrementordecrementthenumberoffeatureswit hinthe0 10%accuracy changerespectively.Eventually,the c factorbeingsoughtwillbethevalueinthemidpoint ofthesearchspace. InFig.4.1b,line1sets adjusting_variable tothelowestvaluethatthe c factorcantake. Thisvariableisusedwhenreducingthesearchspacetoitsle ftorrighthalfinlines16and 18.Theminimumvaluethat adjusting_variable isassignedallowsforafurtherscanofthe searchspace.Lines2and3settheminimumandmaximunbounda ryvaluesofthesearch spacerespectively.Theperturbationoffeaturesiscodedi nlines6through12.Theamount ofnoisethatisinjectedintoeachfeatureinline7iscalcul atedwith(4.1).Inline10, Acc i correspondstotheaccuracyobtainedafterperturbingfeat ure i oftheset S .Inline11,the 37

PAGE 55

rankingcriterioniscalculated;itisthedifferenceofthe trainingaccuracyandtheaccuracy calculatedinline10. Inline13,allfeaturesin S arerankedbasedontherankingcriterion;thatis,the resultingrankingwillgofromfeaturescausingtheleastch angeinaccuracydownto featureschangingaccuracythemost.Inline14,thefeature swith0 10%ofaccuracy changearecounted.Inlines15-16,whentherearenofeature swithinthisrangeofaccuracy changeortheytotallessthanthe k featuresneeded,thenthesearchspaceisreducedtothe lefthalf,meaningthatthe c factorandthusthenoiselevelhastobedecreasedinorderto getmorefeatureswithinthedesiredrangeofaccuracychang e.Ontheotherhand,inlines 17-18,whentherearemorefeatureswithinthe0 10%ofaccuracychangethanneeded, thenthesearchspaceisreducedtotherighthalf,meaningth atthe c factorandthusthe noiselevelhavetobeincreasedinordertogetfewerfeature swithinthedesiredrangeof accuracychange. Finally,inline20,whenexactlythe k featuresneededarefound,therankingofall featuresisreturnedbythebinarysearch.4.2ComputationalCostofIFP ThecomputationalcostofIFPis O ( lgKK 2 N ) ,with K beingthenumberoffeatures and N thenumberofsamplesinthedataset. 38

PAGE 56

Input: AsetXoftrainingsamplesanda kvalueindicatingthenumberoffeaturestoremove : Output: F ; thefinalfeatureranking : S : Subsetofsurvivingfeatures 1. S allfeatures 2. while S 6 = /0 do 3. TrainaclassifieronXwith featuresS 4. Calculatethetrainingset accuracyAcc 5. Perturbandrankfeaturesbybinarysearch 6. Determineiftherearetiedfeatures 7. if notiedfeatures then 8. S S f kleastrelevantfeatures g 9. else 10. Ranktiedfeaturesbyincreasingweight 11. S S f featureontop g 12. endif 13. UpdatefeaturerankingF 14. endwhile 15. return R (a) Input: ThecurrentsetXof trainingsampleswithSsurviving features ; thekvalueandthe trainingaccuracyAccofcurrentsetX : Output: FR ; rankingofcurrentfeatures inS 1. adjusting variable 1 10 6 2. min 1 10 6 3. max largemaximumvalue 4. while min < = max do 5. c factor midpoint ( min + max ) = 2 6. forall featuresiinS do { Perturbingfeatures } 7. addnoisetoiacrossallexamples 8. getanewperturbeddatasetX 0 9. predictclassesforX 0 10. calculateaccuracyAcc i 11. calculaterankingcriterionr i = abs ( Acc Acc i ) 12. endfor 13. FR rankingofallfeatures inSbyincreasingr i 14. count featureswith 0 10% rankingcriterion 15. if count = 0 orcount < k then { c factorneedstobedecreased } 16. max midpoint adjusting variable 17. elseif count > k then { c factorneedstobeincreased } 18. min midpoint + adjusting variable 19. else { rightc factorfound } 20. return FR 21. endif 22. endwhile (b) Figure4.1:IFPalgorithm.(a)isthemainloopoftheiterati vefeatureperturbation algorithmand(b)thebinarysearchcalledfrom(a)forcalcu lationofnoiselevel/cfactor, perturbationandrankingoffeatures. 39

PAGE 57

Chapter5:ExperimentalStudies ExperimentsconductedwithIFPonfourmicroarraydatasets aredescribedinthischapter.TheIFPaccuracyresultsarecomparedwiththoseofSVMRFEandthet-testasfeature ranking(intermsofSVMclassieraccuracy).Twoalternate scenariosofexperimentsare exploredforeachdataset:First,westartthefeatureremov alprocesswiththeentiresetof genesinthedataset,andsecondwithapreselectionofgenes basedontheirpvalues.An AUCanalysisoftheresultantaccuracycurvesisconductedf orperformancecomparison betweenmethodsunderbothscenarios.Analysisofintersec tionoroverlapbetweengenes selectedbyeachmethodisperformedunderbothscenariosas well. 5.1DataandPreprocessing Experimentswereperformedon4Affymetrix-platform2-cla ssgeneexpressionmicroarraydatasets.All4datasetsunderwentapreprocessin gphaseasistypicalforthistype ofdata,whichincludedalog 2 functionappliedtoeachgeneexpressionvalue.Usually, thetransformeddatasatisesassumptionsofstatisticalt ests.Thelog 2 isthemostcommontransformationinmicroarraystudies[57].Dataprepar ationallowsforthelearning algorithmtoeasilyaccesstheinformationcarriedbytheda tasets[58]. 40

PAGE 58

TheColoncancerdatasetisawell-studiedpubliclyavailab lemicroarraybenchmark [36].Itismadeupof62samplesincluding22normaland40col oncancertissues. Thereare2000geneexpressionvaluesforeachsample.Forda tapreprocessing,a log 2 -transformationwasappliedtoeachgeneexpressionvalue. Theleukemiadatasetisanotherpubliclyavailabledataset [37].Itcontainsinformationonhumanacutemyeloid(AML)andacutelymphoblasticle ukemia(ALL)with 25and47samplesrespectively.Thereare7129gene-express ionvaluesforeach sample.Thedatasetwasexploredandalargenumberofnegati vegene-expression valueswerefound.Fordatapreprocessing,allnegativegen evaluesweresetto1and alog 2 -transformationwasappliedtotheentiredataset,resulti nginalargenumberof zerovalues.Asetof2689geneswerepreselectedunderthecr iteriaofgeneshaving < 25%ofzerovaluesandvariance > = 1. TheMofttcoloncancerdatasetusedinthispaperisasupers etofthesetdescribed in[59].Itcontainsinformationon122samples,84labeled“ goodprognosis"representingpatientswithsurvivaltime > = 36months,and38sampleslabeled“poor prognosis"representingpatientswithsurvivaltime < 36months.Theoriginaldataset has54675probesets.Fordatapreprocessing,alog 2 -transformationwasdoneanda preselectionofgenesusingthecriteriaofgeneshavingvar iance > = 0 : 5resultedin asubsetof2619genes. Thelungcancerdatasetusedinourexperimentsisthesameas wasusedin[60].It iscomposedof410samples,271labeled“goodprognosis",an d139labeled“poor prognosis"relativetotheirsurvivaltimeasdescribedfor theMofttcoloncancer dataset.Theoriginaldatasethas22282probesets,68contr olgenes(whichwere removed),andasubsetof22214genesremained.Thedatasetw asexploredand 41

PAGE 59

anumberofgeneswithvaluesclosetozerowerefound.Fordat apreprocessing, allgene-expressionvalues < 2weresetto2,sotheywouldnotresultinnegative numbersinthelog 2 -transformation.Asubsetof2428geneswerepreselectedun der thecriteriaofgeneshaving < 25%ofgenevalues=1andvariance > = 1. Finally,geneexpressionsinall4datasetswerescaledtobe in [ 0 ; 1 ] .Avariance > = 1 waschosenbasedonthefactthatavarianceof1representsa2 -foldchangeonalog 2 transformeddataset.FortheMofttcoloncancerdatasetwe observedalargenumber (53979)ofprobesetswithvariance < 1,whichisthereasonwhyforthisparticulardataset weselectedgeneswithvariance > = 0 : 5. 5.2ParametersfortheSVM AsstatedinthedescriptionofIFPinChapter4,theIFPmetho dcouldbeusedwith anyclassicationalgorithmasthebaseclassier.Inourim plementation,weusedSVMas thebaseclassiertobeabletocompareourresultsagainstt hoseofSVM-RFE.TheSVM usedisamodiedversionoflibSVM[61].Alinearkernelwasu sedwithparameter C = 1 toreducetrainingtimeandtheprobabilityofovertting.T heoptimizationalgorithmused wasthesequentialminimaloptimization(SMO).5.3PerformanceMeasure AllresultsreportedinSection5.6areexpressedinweighte daccuracyratherthantotal accuracy.Weightedaccuracywaspreferredastheclassier performancemeasureduetothe unequaldistributionofthetwoclassesinallfourdatasets .Insituationslikethese,weighted 42

PAGE 60

Table5.1:Confusionmatrix. Predicted Predicted Cancer NormalTissue Actual TruePositive FalseNegative Cancer (TP) (FN) Actual FalsePositive TrueNegative Normaltissue (FP) (TN) accuracygivesabetterperformanceestimateofthelearnin galgorithm[62].Weighted accuracyfor2-classdatasetsisdenedasfollows WeightedAccuracy = tp tp + fn + tn fp + tn = 2(5.1) where tp fp tn ,and fn respectivelyarethenumberoftruepositives,falsepositi ves,true negativesandfalsenegativesinaconfusionmatrix,asshow ninTable5.1. 5.4AdaptiveFeatureEliminationStrategy Thestrategyconsistsofdeterminingthenumberoffeatures toberemovedinrelation tothecurrentnumberofsurvivingfeatures.Specically, A %ofthesurvivingfeaturescan beremovedatatimeratherthanjustonewhenthenumberofexi stingfeaturesislarger thanagiventhreshold;thereafterone-at-a-timefeaturer emovalisused. 43

PAGE 61

5.5GenePre-FilteringStrategy Acommonpracticesuggestedformicroarraydata[4]whenitc omestogeneselection istopre-reducetheoriginalhigh-dimensiondatasetviaau nivariatelter.Wrappersand embeddedlearningalgorithmswillthenbeappliedoverthen ewreduceddataset. Thegenepre-selectionprocessgetsridofasetofunimporta nt/noisygenesbasedon thechosenltercriterion;allowingforthelearningalgor ithmunderstudytoperformover alessnoisydataset.5.6ExperimentsandResults TheadaptivefeatureeliminationstrategydenedinSectio n5.4wasappliedinthe featureremovalprocessforallexperimentsconductedthat startedwiththeentiresetof featuresinthedataset.Specically,50%oftheexistingfe atureswereremovedacross iterationsuntilathresholdof25%ofthetotalwasreached, exceptforthecoloncancer datasetwhosethresholdwassetto10%ofthetotal.Allexper imentswereassessedvia a10-foldcrossvalidationprocess,andeachcross-validat ionexperimentwasrepeated5 timeswithdifferentseeds.Theaverageweightedaccuracyo vertheverunsisreported. Threemethodsoffeaturerankingwereassessedwithallfour datasets:IFP,SVM-RFE andthet-test.Thet-testranksthefeaturesaccordingtoth eirpvalues.Themostrelevant featurehasthelowestpvalueandtheleastrelevantfeature hasthehighestpvalue. 44

PAGE 62

5.6.1AccuracyResults ResultsforthefourdatasetsareshowninFig.5.1through5. 4.Thegraphofthecolon cancerdatasetinFig.5.1showsthatIFPresultedinmoreacc urateclassiersthanSVMRFEthroughoutmostoftherangeof6to139features(genes), withabiggerdifferencein favorofIFPintheranges81through95,53through68feature sand6through42features. SVM-RFEresultedinmoreaccurateclassierthanIFPthroug houtmostoftherangeof140 to2000features.Thet-testbasedfeaturerankingresulted inthehighestaccuracyclassier fornearlytheentiresetoffeaturesonthisdataset. Theleukemiadataset,showninFig.5.2showsthatthet-test resultedinamoreaccurate classierthanIFPandSVM-RFEthroughouttherangeof1000t o2000features.IFP resultedinamoreaccurateclassierthanbothSVM-RFEandt het-testthroughoutthe rangeof100to250features.Thethreeaccuracieswerecompa rableandtherewasatiein therangeof57through100features.Howevertherewerediff erencesinaccuraciesinthe rangeof1through56features.IFPalternatedwithSVM-RFEi naccuracyspikesinthis rangealthoughIFPoverallattainedhigheraccuracies.Ont heotherhand,thet-testfeature rankingresultedinthehighestaccuracyclassierthrough outtherangeof3to24features, anditwastheleastaccurateclassierintherangeof31to56 features. ThegraphoftheMofttcoloncancerdatasetinFig.5.3,show scomparableaccuracies betweenthethreemethodsthroughouttherangeof600to2619 features.IFPspiked at500featuresandthereafterdroppedofftothelowestaccu racyofthethreemethods, mostlythroughouttherangeof33to250features.Thet-test clearlyresultedinamore accurateclassierthroughouttherangeof116to250featur es,whereasSVM-RFEandthe t-testshowedsimilaraccuraciesintherangeof80through1 15features.SVM-RFEwas 45

PAGE 63

betterthanthet-testthroughoutmostoftherangeof48to79 features,andthet-testwas betterthanSVM-RFEintherangeof33through46features.La st,therewereinteresting accuraciesinthetop33features.SVM-RFEandIFPshowedsim ilaraccuraciesinthis rangeexceptthatIFPaccuracyspikedaround25features.On theotherhand,thet-test showeditslowestaccuracyinthisrangeat24features. ThegraphofthelungcancerdatasetinFig.5.4,showssimila raccuraciesbetweenIFP andSVM-RFEthroughouttherangeof1000to2428features.Th et-testresultedinthe highestaccuracyclassierthroughoutnearlytheentirera ngeof14through250features, exceptthatIFPreachedaccuraciesclosertothoseofthet-t estintherangeof43through 92features.SVM-RFEandIFPreachedverycomparableaccura ciesthroughouttherange of180to250features,whereasIFPclearlyoutperformedSVM -RFEintherangeof33 through154features.From33downwardtheyshowedsimilara ccuracies.Finally,the t-testresultedinthelowestaccuracyclassieracrossthe top9features. Surprisingly,thet-testrankingbasedonthepvalueofthef eaturesresultedinselecting subsetsoffeatureswhich,intermsofSVMclassieraccurac y,tendedtooutperformboth IFPandSVM-RFEacrossallfourdatasets.Ontheotherhand,I FPshowedaccuracy comparableorsuperiortothatofSVM-RFEonthecolon,leuke miaandlungdatasets. IFPresultedinalessaccurateclassierontheMofttcolon cancerdataset,exceptatthe endwhenlessthan33featuresremainedwhereIFPaccuracywa scomparableorsuperior tothatofSVM-RFEatsomepoints.Thet-testbasedclassier accuracywasverylowin thisrangeoffeatures. 46

PAGE 64

Figure5.1:Comparisonofresultingaverageweightedaccur acyoffeaturerankinggiven bymethodsIFP,SVM-RFE,andthet-testforthecoloncancerd ataset. 5.6.2IntersectionAcrosstheEntireSetofFeatures Ananalysisoftheintersectionacrosstheentiresetoffeat uresoneachdatasetwas conductedtodeterminethedegreeofsimilarityofthesetso ffeatures/genesatchosen pointsontherankedlistsresultingfromeachofthethreeme thodsIFP,SVM-RFEand thet-test.Theanalysisconsistedoflookingattheaverage percentageofintersection offeaturesbetweenanytwomethods.Averagepercentageswe recalculatedacrosseach numberoffeatures.Giventhattherewereverunsofeachmet hodwithdifferentseeds, thepercentageofintersectionwascalculatedbetweenpair sofrunsusingthesameseedfor anytwomethods.Weendedupwithvepercentagestoaverageo ver.Fortheintersection ofanymethodwithitself,oneofitsverunswasusedtocalcu latethepercentageof 47

PAGE 65

Figure5.2:Comparisonofresultingaverageweightedaccur acyoffeaturerankinggiven bymethodsIFP,SVM-RFE,andthet-testfortheleukemiadata set. intersectionwitheachoftheotherfourruns.Theaveragepe rcentageofintersectionwas calculatedoverfourpercentagesinthesecases. ResultsonthecoloncancerdatasetareillustratedinFig.5 .5,ontheleukemiadataset inFig.5.6,ontheMofttcoloncancerdatasetinFig.5.7,an donthelungcancerdataset inFig.5.8. Fig.5.5ashowstheintersectionoffeaturesbetweenIFPand SVM-RFEonthecolon cancerdataset.Interestingly,inFig.5.1accuraciesofIF PandSVM-RFEwerecloseto eachotheratanumberofpointsintherangeof70through2000 features,whereastheir percentageofintersectionwasnotashighasonemightexpec tforthisrangeoffeatures, giventheirsimilarityinaccuracies.At2000featuresthei rintersectionstartedat100%,and steadilywentdownasthenumberoffeaturesdecreaseduntil around200featureswereleft. Thereafterthepercentageofintersectionstayedataround 60%allthewayuntil70features 48

PAGE 66

Figure5.3:Comparisonofresultingaverageweightedaccur acyoffeaturerankinggiven bymethodsIFP,SVM-RFE,andthet-testfortheMofttcolonc ancerdataset. remained,andtheirintersectionkeptnoticeablydecreasi nguntil1featurewasleft.This factsuggeststhatdifferentsetsofgeneswillresultinsim ilaraccuracies. Figs.5.5b,5.5d,and5.5fshowtheintersectionsofIFP,t-t estandSVM-RFErespectivelywiththemselves.Thesethreeguresshowthatthet-t estwasmorestablethanboth IFPandSVM-RFEinselectingfeatures. Fig.5.5cshowstheintersectionoffeaturesbetweenthet-t estandIFPwhichstayed overallverylow.Itwas100%at2000features,andpictorial lyfollowedaclose-to-linear behaviorasthenumberoffeaturesdecreaseduntil250orles sfeatureswereleft.Thereafter, theintersectionstayedintherangeof15%to20%.Fig.5.5es howssimilarbehavior fromtheintersectionofthet-testandSVM-RFE,exceptthat thepercentageofintersection stayedataround20%fromwhen250featureswereleft. 49

PAGE 67

Figure5.4:Comparisonofresultingaverageweightedaccur acyoffeaturerankinggiven bymethodsIFP,SVM-RFE,andthet-testforthelungcancerda taset. Fig.5.6ashowstheintersectionbetweenIFPandSVM-RFEont heleukemiadataset. Interestingly,thepercentageoffeaturesintheintersect ionbetweenthesetwomethods steadilydecreasedwiththenumberoffeatures.Also,Fig.5 .6candFig.5.6eshowthe percentageofintersectionbetweenthet-testandIFPandbe tweenthet-testandSVMRFErespectively.Thenumberoffeaturesintheintersectio nofbothIFPandSVM-RFE withthet-testwaslowcomparedtothatintheintersectionb etweenIFPandSVM-RFE, aresultwhichindicatesthatthet-testselecteddifferent setsoffeatures.However,in termsofaccuracies,inFig.5.2,thethreemethodsdidnotdi fferinthesameproportion. Particularly,thet-testandSVM-RFEshowedsimilaraccura ciesintherangeof57through 2689features.Theseresultshighlighttheideathatdiffer entsetsoffeatures(genes)can leadtocomparableifnotidenticalaccuracies.Ontheother hand,Fig.5.6b,Fig.5.6d,and 50

PAGE 68

Fig.5.6fshow,asinthecaseofthecoloncancerdatasetinFi g.5.5,thatthet-testwasmore stableinselectingfeaturesacrosstheentiresetthanIFPa ndSVM-RFE. (a) (b) (c) (d) (e) (f) Figure5.5:Intersectionacrosstheentiresetoffeatureso f(a)IFPvs.SVM-RFE,(b)IFP vs.IFP,(c)t-testvs.IFP,(d)t-testvs.t-test,(e)t-test vs.SVM-RFE,and(f)SVM-RFE vs.SVM-RFEonthecoloncancerdataset. Fig.5.7ashowstheintersectionoffeaturesbetweenIFPand SVM-RFEontheMoftt coloncancerdataset.Clearly,itshowsthesametrendaswit hthecolonandleukemia datasets;thatis,thepercentageofintersectiondecrease swiththenumberoffeatures.For 51

PAGE 69

thisdatasetwewantedtofocusontherangesoffeatureswher eanytwomethodshadsimilar accuracies.Fig.5.3showsthatboththet-testandSVM-RFEs electionmethodsresultedin similaraccuraciesofaround54%intherangeof80through11 4features;however,their intersectioninthisrange,asshowninFig.5.7e,wasapprox imately23%.Thismeansthat thesemethodsshowedsimilarbaseclassieraccuracieswit hverydifferentsetsoffeatures. Thesamesituationoccurredwiththebest9features,wheret hethreemethodsreached accuraciesclosetoeachother,whiletheirpercentagesofi ntersectionsweredifferent. Forthet-testwithIFPaswellasforthet-testwithSVM-RFEt heintersectionwasless than20%(2features)andforIFPwithSVM-RFEitwaslessthan 25%.Also,forthis datasetthepercentageofintersectionbetweenthet-testw ithbothIFPandSVM-RFEwas lowcomparedtothatofIFPwithSVM-RFE.Aswiththecolonand leukemiadatasets,by justlookingatthepercentageofintersectionofeachmetho dwiththemselves,bothIFPand SVM-RFEwerelessstablethanthet-testinselectingfeatur es. Last,theresultsobtainedforlungcancerdatasetwerenotm uchdifferentfromthose forthecoloncancer,leukemiaandMofttcoloncancerdatas ets,inthesensethatanytwo methodscouldreachsameorsimilaraccuracieswithdiffere ntsetsoffeatures.Thatwas thecaseforIFPandSVM-RFEwhoseaccuraciesshowninFig.5. 4wereat57%inthe rangeof178to225features,whiletheirintersectionshown inFig.5.8awasaround54%in thesamerange.AnothercaseistheanalysisofIFPwiththettest,whoseaccuracieswere around57%intherangeof57to71features,whiletheirinter sectionshowninFig.5.8c wasaslowas6.6%inthesamerange. Theexperimentsconductedonfourmicroarraydatasets,sta rtingwiththeirentiresets offeatures,haveconsistentlyshownthatthefeatureranki ngbasedont-testresultedinsets offeatureswithverycompetitiveSVMclassieraccuracy.A ccuraciesofSVMclassiers 52

PAGE 70

(a) (b) (c) (d) (e) (f) Figure5.6:Intersectionacrosstheentiresetoffeatureso f(a)IFPvs.SVM-RFE,(b)IFP vs.IFP,(c)t-testvs.IFP,(d)t-testvs.t-test,(e)t-test vs.SVM-RFE,and(f)SVM-RFE vs.SVM-RFEontheleukemiadataset.withfeatureschosenbyIFP,SVM-RFEandthet-testwerecomp aredagainsteachother, showingthet-testeitherequalingoroutperformingtheoth ertwoinanimportantrangeof features. 53

PAGE 71

(a) (b) (c) (d) (e) (f) Figure5.7:Intersectionacrosstheentiresetoffeatureso f(a)IFPvs.SVM-RFE,(b)IFP vs.IFP,(c)t-testvs.IFP,(d)t-testvs.t-test,(e)t-test vs.SVM-RFE,and(f)SVM-RFE vs.SVM-RFEontheMofttcoloncancerdataset.5.6.3GenePreselection Inthissectionweexploretheeffectofapplyingthelterin gstrategydescribedin Section5.5onthepreviouslydiscussedfourdatasets.The lterchosenwasthet-test. 54

PAGE 72

(a) (b) (c) (d) (e) (f) Figure5.8:Intersectionacrosstheentiresetoffeatureso f(a)IFPvs.SVM-RFE,(b)IFP vs.IFP,(c)t-testvs.IFP,(d)t-testvs.t-test,(e)t-test vs.SVM-RFE,and(f)SVM-RFE vs.SVM-RFEonthelungcancerdataset.In[14]itwasnotedthatpvalueswereusefulforprioritizin ggenesforfurtherinvestigation. In[4]itwasadvisedtopre-reducethesearchspaceupfrontv iaaunivariatelter,and secondlyapplywrapperorembeddedmethods. 55

PAGE 73

Ourexperimentsconsistedofrstpreselectingthetop200g enes(200genesarea reasonablenumberofgenesforfurtherinvestigationforbi ologicalsignicance)based ontheirpvalues,second,forminganewdatasetwiththissub setofgenes,andthird, applyingIFPandSVM-RFEonthereducedsettoexaminetheirp erformanceonprereduceddatasets.Thegenepreselectionprocesswasdonewi thina10-foldcrossvalidation context,eachperformedvetimes.Theaverageweightedacc uraciesovertheverunsare showninFig.5.9throughFig.5.12. Figure5.9:Comparisonofresultingaverageweightedaccur acyoffeaturerankinggiven bymethodsIFP,SVM-RFE,andthet-testforthecoloncancerd atasetacrosstop200 featuresaslteredbyt-test. AsareferencetoresultsshowninFig.5.1throughFig.5.4an dforaclearerviewofthe effectofpreselectingthefeaturesoneachdataset,thet-t estlineofeachchartinFig.5.1 throughFig.5.4wasincludedonitscorrespondingchartinF ig.5.9throughFig.5.12,as theyfollowthesamemethodology. 56

PAGE 74

Figure5.10:Comparisonofresultingaverageweightedaccu racyoffeaturerankinggiven bymethodsIFP,SVM-RFE,andthet-testfortheleukemiadata setacrosstop200features aslteredbyt-test. Forthecoloncancerdataset,Fig.5.9showsthatwithgenepr eselectionbothIFPand SVM-RFEreachedhigheraccuraciesthanwhennopreselectio nwasdone.Nowtheir accuraciesareatorabove85%fornearlytheentiresetof200 genes,incontrasttobelow 85%accuracybefore.SVM-RFEgainedthemostsinceitnowout performedthet-testin thetop22features,whichdidnothappeninFig.5.1. Fig.5.10showsthatfortheleukemiadatasetthegenepresel ectionprocessbeneted bothIFPandSVM-RFE.Theiraccuracieswerehigherthanwhen nogenepreselectionwas done;theyarenowataround95%.Thesetwomethodsresultedi nmoreaccurateclassiers thanthet-testintherangeof22through200featuresonthis dataset. Thegenepreselectionbasedont-testwasalsobenecialfor bothIFPandSVM-RFEon theMofttcoloncancerdatasetasFig.5.11shows.Asaposit iveeffectofthepreselection, 57

PAGE 75

Figure5.11:Comparisonofresultingaverageweightedaccu racyoffeaturerankinggiven bymethodsIFP,SVM-RFE,andthet-testfortheMofttcolonc ancerdatasetacrosstop 200featuresaslteredbyt-test.thesetwomethodsreachedhigheraccuracies.TheIFPaverag eaccuracyoverthetop 200featureswithoutgenepreselectionwas53.04%whilewit hpreselectionitwentup to56.27%.TheSVM-RFEaverageaccuracyoverthetop200feat ureswithoutgene preselectionwas54.43%whilewithpreselectionitwentupt o56.36%.Itisnotablein Fig.5.11thatbothIFPandSVM-RFEreachedaccuraciessimil artothoseofthet-testin therangeof124to200features,whilethet-testinFig.5.3, wasmoreaccurateinthesame range.Also,IFPandSVM-RFEresultedinmoreaccurateclass iersthanthet-testinthe rangeof1to123featuresinFig.5.11,whichdidnotoccurrin Fig.5.3. Last,Fig.5.12showsoncemorethatbothIFPandSVM-RFEgain edbenetsfromgene preselectiononthelungcancerdataset.Accordingtothere sultsshowninFig.5.4,thet-test wasmoreaccuratethanbothIFPandSVM-RFEmostofthetimeon thetop200features, 58

PAGE 76

Figure5.12:Comparisonofresultingaverageweightedaccu racyoffeaturerankinggiven bymethodsIFP,SVM-RFE,andthet-testforthelungcancerda tasetacrosstop200 featuresaslteredbyt-test.whilewithgenepreselectioninFig.5.12,thesetwomethods improvedtheiraccuraciesin thisrangetothepointofgettingaccuraciesaroundorhighe rthanthoseofthet-test. Overall,resultsonthefourdatasetsshowedthatbothmetho dsIFPandSVM-RFE improvedtheiraccuracieswhenthedatasetunderwentagene preselectionprocessprior torunningthelearningalgorithmunderstudy.Inourexperi ments,thet-testcriterionwas usedasthelteringtechnique.Ourresultsrigorouslyrein forcethatitisausefulcriterion whenpreselectinggenesforfurtheranalysis,asstatedbyQ uackenbushin[14]. 59

PAGE 77

5.6.4IntersectionAcrossaSubsetofFeatures(Genes) Intermsofanalysisoftheintersectionacrossthesubsetof 200genes,weproceeded inthesamewayasdescribedinSection5.6.2.Basedontheres ultsobtainedonthe intersectionofgenesselectedbythethreemethods,across theentiresetofgenesofeach dataset,asimilarapproachcanbetakenhereonpointswhere anytwomethodsreached similarorthesameaccuracies.Thisprocedurewouldhelpus clarifywhetherornotit ispossibletoreachthesameaccuracywithdifferentsetsof features(genes).Average percentageswerecalculatedacrosseachofthe200features .Giventhattherewereve runsofeachmethodwithdifferentseeds,thepercentageofi ntersectionwascalculated betweenpairsofrunsusingthesameseedforanytwomethods. Weendedupwithve percentagestoaverageover.Fortheintersectionofanymet hodwithitself,oneofitsve runswasusedtocalculatethepercentageofintersectionwi theachoftheotherfourruns. Theaveragepercentageofintersectionwascalculatedover fourpercentagesinthesecases. Resultsofpercentagesofintersectiononthecoloncancerd atasetareshowninFig.5.13, ontheleukemiadatasetinFig.5.14,ontheMofttcoloncanc erdatasetinFig.5.15,and onthelungcancerdatasetinFig.5.16. Fig.5.13a,Fig.5.13c,andFig.5.13eshowtheaverageinter sectionbetweenIFPwith SVM-RFE,thet-testwithIFP,andthet-testwithSVM-RFEres pectivelyacrossthe200 genes.Inthethreecasestheamountofintersectiondecreas edwiththenumberoffeatures. Theintersectionwiththet-testwaslowerasfewerfeatures remained.Theintersection betweenthet-testandIFPwasatlessthan50%when100featur esremainedandreached itslowestrateof16.57%at7features.Theintersectionbet weenthet-testandSVM-RFE wassomewhatsimilartothatwithIFP.Theseresultsledusto observethatthet-testresults 60

PAGE 78

(a) (b) (c) (d) (e) (f) Figure5.13:Intersectionacrossthetop200subsetoffeatu resof(a)IFPvs.SVM-RFE, (b)IFPvs.IFP,(c)t-testvs.IFP,(d)t-testvs.t-test,(e) t-testvs.SVM-RFE,and(f) SVM-RFEvs.SVM-RFEonthecoloncancerdataset.inselectingverydifferentsetsofgenes.However,accordi ngtoaccuraciesindicatedin Fig.5.9,therewerepointswhereclassiersbuiltafterapp lyingthethreemethodsreached verysimilaraccuracies,suchasat97geneswhereIFPreache d85.57%,SVM-RFE85.52%, andthet-test85.98%.Atthispointtheintersectionwithth et-testwaslessthan50%and betweenIFPandSVM-RFEwas83.96%.Again,itwasclearthati tispossibletoreach 61

PAGE 79

(a) (b) (c) (d) (e) (f) Figure5.14:Intersectionacrossthetop200subsetoffeatu resof(a)IFPvs.SVM-RFE, (b)IFPvs.IFP,(c)t-testvs.IFP,(d)t-testvs.t-test,(e) t-testvs.SVM-RFE,and(f) SVM-RFEvs.SVM-RFEontheleukemiadataset.similaraccuracieswithdifferentsetsofgenes.Ontheothe rhand,theintersectionofIFP withitselfinFig.5.13bresultedinaround59.80%overlapo naverage,thatofSVM-RFE withitselfinFig.5.13fresultedin58.72%overlaponavera ge,andtheintersectionof thet-testwithitselfinFig.5.13dresultedin77.28%overl aponaverage.Thesenumbers indicatedthatthet-testwasmorestableinselectinggenes thanIFPandSVM-RFE. 62

PAGE 80

(a) (b) (c) (d) (e) (f) Figure5.15:Intersectionacrossthetop200subsetoffeatu resof(a)IFPvs.SVM-RFE, (b)IFPvs.IFP,(c)t-testvs.IFP,(d)t-testvs.t-test,(e) t-testvs.SVM-RFE,and(f) SVM-RFEvs.SVM-RFEontheMofttcoloncancerdataset. Fig.5.14showstheaverageintersectionsbetweenIFP,SVMRFE,andthet-testonthe leukemiadataset.Theshapeofeachofthecurvesresemblest hoseofthecolondataset inFig.5.13.SimilarconclusionsweredrawnfromFig.5.14c andFig.5.14e,asthet-test endedupwithdifferentsetsofselectedgenes,sincetheave rageintersectionbetweenthe t-testwithIFPaswellastheaverageintersectionbetweent het-testwithSVM-RFE,was 63

PAGE 81

lowermostofthetimeacrossthe200genesthanthatofIFPwit hSVM-RFE.Now,thethree methodsresultedinsimilaraccuraciesat21genesinFig.5. 10:IFP93.05%,SVM-RFE 93.37%,andthet-test93.66%,whiletheiraverageintersec tionsatthesamepointwere, betweenthet-testandIFP29.43%,betweenthet-testandSVM -RFE30.10%,andbetween IFPandSVM-RFE42.57%. Again,asinthecoloncancerdataset,resultsontheleukemi adatasetshowedthatitis possibletoreachthesameorsimilaraccuracieswithdiffer entsetsofgenes.Ontheother hand,theintersectionofIFPwithitselfinFig.5.14bwas66 .84%overlaponaverage,that ofSVM-RFEwithitselfinFig.5.14fwas67.67%overlaponave rage,andtheintersection ofthet-testwithitselfinFig.5.14dresultedin85.82%ove rlaponaverage.Again,these numbersindicatedthatthet-testwasmorestableinselecti nggenesthanIFPandSVMRFE. Fig.5.15showstheaverageintersectionsbetweenIFP,SVMRFEandthet-testonthe Mofttcoloncancerdataset.Theshapeofeachofthecurvesr esemblesthoseofthecolon cancerandleukemiadatasetsinFig.5.13andFig.5.14respe ctively,exceptthatthet-test inFig.5.15ddidnotmaintainthesamepercentageofinterse ctionacrossthe200genesas itdidonthecoloncancerandleukemiadatasets.Itwaslesss tableforthisdataset. AsfortheintersectionsofIFPandSVM-RFEwiththet-testin Fig.5.15candFig.5.15e respectively,theyshowedagainthatthet-testselecteddi fferentsetsofgenes.Also,the threemethodsreachedsimilaraccuraciesat141featuresin Fig.5.11:IFP56.70%,SVMRFE56.35%,andt-test56.27%.Howevertheirintersections ,eventhoughtheywerehighat thispoint,theywerenotclosetothesameproportionamongt hem.Averageintersections betweenthet-testwithIFPreached70.74%,thet-testwithS VM-RFE71.22%,andIFP 64

PAGE 82

(a) (b) (c) (d) (e) (f) Figure5.16:Intersectionacrossthetop200subsetoffeatu resof(a)IFPvs.SVM-RFE, (b)IFPvs.IFP,(c)t-testvs.IFP,(d)t-testvs.t-test,(e) t-testvs.SVM-RFE,and(f) SVM-RFEvs.SVM-RFEonthelungcancerdataset.withSVM-RFE91.72%.Oncemore,theobservationwasthatwit hdifferentsetsofgenes, theresultingaccuraciescanbealike. Finally,Fig.5.16showstheaverageintersectionsbetween IFP,SVM-RFE,andthettestonthelungcancerdataset.Theshapeofeachofthecurve sledustonotethatsimilar 65

PAGE 83

conclusionscanbedrawnregardingreachingalikeaccuraci eswithdifferentsetsofgenes. Eventhoughforthisdataset,theaccuraciesofthethreemet hodsshowninFig.5.12are closertoeachotheracrossthe200genes,stilltheiraverag eintersectionsshowedsimilar behaviortothoseofthecoloncancer,leukemiaandMofttco loncancerdatasets. 5.7FrequencyoftheUseofSVMWeightsforTie-BreakingbyIF P Foreachdataset,weobservedthefollowingfrequenciesoft ie-breakingacrossexperimentsconductedusingeithertheentiresetoffeatures(Sec tion5.6.1),orusingthetop200 features(Section5.6.3).Averagesshownwereobservedfor eachfoldina10-foldcross validationprocess.Forthecoloncancerdataset,IFPusedt heSVMweights1.4timeson averagetobreaktiesbetween3or2features.Fortheleukemi adataset,IFPdidnothave anytiesatall.FortheMofttcoloncancerdataset,IFPused theSVMweightstobreakties 1timeonaverage,andthisoccurredmostlyamongthelast5fe aturestoberemoved.Ties wereobservedallthewayuntil2featureswereleft.Similar totheMofttcoloncancer dataset,forthelungcancerdatasetIFPusedtheSVMweights 1timeonaveragetobreak tiesmostlyamongthelast9featurestoberemoved.Thesetie swereobservedalltheway until2featureswereleft.5.8AUCAnalysis Theaccuracyresultsbetweenmethodsforalldatasetswerea dditionallyanalyzedusing theareaunderthecurve(AUC)[55]toallowforcomparisonof thefeatureselection methods.ThemethodwiththehighestAUCvaluehadthehighes taccuracymorefrequently 66

PAGE 84

acrosstherangeoffeaturesetsizes.AUCwascalculatedfor alloftheaccuracycurvesin Fig.5.1throughFig.5.4andFig.5.9throughFig.5.12using thetrapezoidalmethodin thestudentversionoftheDADiSPsoftware[63].Fig.5.1thr oughFig.5.4showaccuracy resultsforour4datasetsusingtheentiresetoffeatures.F ig.5.9throughFig.5.12show accuracyresultsusingthetop200preselectedfeatureswit hbaseontheirpvalues. 5.8.1AcrosstheEntireSetofFeatures ResultsareshowninTable5.2.Table5.2(a)showstheAUCofe achmethodacross the4datasets.Thet-testhadthegreatestAUConthecolonca ncerandtheleukemia datasets.SVM-RFEwashighestontheMofttcolonandthelun gcancerdatasets.The comparisonofany2methodsindicatedinTable5.2(b)thatIF PhadgreaterAUCthan thatofSVM-RFEonlyontheleukemiadataset.Table5.2(c)sh owsthatthet-testhad greaterAUCthanthatofIFPonthecolon,Mofttcolon,andth elungcancerdatasets. Table5.2(d)showsthatthet-testreachedgreaterAUCthant hatofSVM-RFEonthecolon cancerandtheleukemiadatasets.Interestingly,thediffe rencesinAUConthedatasets wherethet-testoutperformedIFPorSVM-RFEaremostlylarg erthanthosewherethese methodsoutperformedthet-test.Also,differencesinAUCb etweenIFPandSVM-RFEare mostlysmallerinmagnitudethanthosewhenthese2methodsc omparedagainstthet-test, suggestingasimilarityinthesetwomethods. 67

PAGE 85

Table5.2:AUCanalysisofaccuracycurvesacrossall4datas etsusingtheentireset offeatures.(a)AUC(aboldentryrepresentsthehighestAUC ).Comparisonbetween methodsintermsofAUCdifference(notedinthecolumnofthe methodwithhighest AUC):(b)IFPvs.SVM-RFE,(c)IFPvs.t-test,and(d)SVM-RFE vs.t-test. Dataset/Method IFP SVM-RFE t-test Colon 164654.88 165656.48 168526.20 Leukemia 255409.53 254867.40 256530.08 MofttColon 138889.91 139046.06 138300.87 Lung 134020.46 134691.81 134664.98 (a) Dataset/Method IFP SVM-RFE Colon 1001.6 Leukemia 542.13 Mofttcolon 156.15 Lung 671.35 (b) IFP t-test 3871.32 1120.08 589.04 644.52 (c) Dataset/Method SVM-RFE t-test Colon 2869.72 Leukemia 1662.68 Mofttcolon 745.19 Lung 26.83 (d) 68

PAGE 86

5.8.2AcrosstheTop200SubsetofFeatures ResultsareshowninTable5.3.Table5.3(a)showstheAUCofe achmethodacross the4datasetsusingonlythetop200features.Thet-testhad thegreatestAUConthecolon andthelungcancerdatasets.IFPwashighestontheleukemia dataset.SVM-RFEhad thegreatestAUContheMofttcoloncancerdataset.Thecomp arisonofany2methods indicatedinTable5.3(b)thatIFPnowhadgreaterAUCthanSV M-RFEonthecolon cancer,theleukemiaandthelungcancerdatasets.Table5.3 (c)showsthatIFPhada biggerAUCthanthatofthet-testontheleukemiaandtheMof ttcoloncancerdatasets. Table5.3(d)showsthatSVM-RFEhadgreaterAUCthanthet-te stontheleukemia andtheMofttcoloncancerdatasets.Interestingly,thedi fferencesinAUConthedatasets whereeitherIFPorSVM-RFEoutperformedthet-testwerecom parabletothosewherethe t-testoutperformedthetwoformer.Thisistrueexceptonth elungcancerdatasetwhere thet-testbasedfeatureselectorresultedinabetterclass ierthanbothIFPandSVM-RFE withaveryminimumdifferenceinAUC. Ontheotherhand,differencesinAUCbetweenIFPandSVM-RFE wereverysmall ascomparedtothosebetweenthesetwomethodsandthet-test .Ourperceptionatthis pointwasthatthepreselectionofgeneshelpedbothIFPandS VM-RFEimprovetheir performance.ConsideringAUCvaluesonly,weobservedthat accuraciesofIFPandSVMRFEgotclosertoandinsomecasesexceededthoseofthet-tes tasaneffectofgene preselection. 69

PAGE 87

Table5.3:AUCanalysisofaccuracycurvesacrossall4datas etsusingthetop200 features.(a)AUC(aboldentryrepresentsthehighestAUC). Comparisonbetween methodsintermsofAUCdifference(notedinthecolumnofthe methodwithhighest AUC):(b)IFPvs.SVM-RFE,(c)IFPvs.t-test,and(d)SVM-RFE vs.t-test. Dataset/Method IFP SVM-RFE t-test Colon 16962.56 16958.48 17252.42 Leukemia 18940.77 18877.87 18636.43 MofttColon 11200.31 11219.15 10961.38 Lung 11598.31 11593.80 11600.65 (a) Dataset/Method IFP SVM-RFE Colon 4.08 Leukemia 62.9 Mofttcolon 18.84 Lung 4.51 (b) IFP t-test 289.86 304.34 238.93 2.34 (c) SVM-RFE t-test 293.94 241.44 257.77 6.85 (d) 70

PAGE 88

Chapter6:StatisticalAnalysis AccuracyresultsdescribedinSection5.6werefurtheranal yzedforastatisticalsignicanceintheaccuracydifferencesbetweenmethods.Weus edtheFriedman/Holmtest whichhasbeendiscussedin[18,64–66]. TheFriedmantestisanon-parametrictestthatallowstheco mparisonoftwoormore classiers.Itranksthemethodsbeingcomparedrangingfro m1-3inthiswork,1and3 beingthehighestandthelowestranksrespectively.Tiesof 1areeachgiven1.5.Thenull hypothesisstatestherearenodifferencesbetweenthemeth ods.Whenthenullhypothesis isrejected,apost-hoctestfollowstodeterminethemethod withbetterresults. InthisworktheHolmprocedurewasusedasapost-hoctest.It consistsofsequentially testinghypothesesstartingwiththemostsignicantpvalu e.Whenahypothesisisrejected theHolmproceduremovesontothenextpvalueandcontinuesu ntilnonullhypothesis canberejected. WeappliedtheFriedman/Holmtesttothetop50accuraciesre sultingfromtwoscenarios:a)startingwiththeentiresetoffeaturesinthedatase t(Fig.5.1throughFig.5.4),and b)startingwithapreselectedsetof n features, n beingthenumberoffeatureswithpvalues < = 0.01fromat-test.Thelattercriterionimpliedthateachda tasethadadifferent-in-size initialsetoffeatures.Thecoloncancerdatasetstartedwi th201features,theleukemia datasetwith437features,theMofttcolondatasetwith50f eatures,andthelungcancer 71

PAGE 89

datasetwith375features.Oursamplesizewas10,thatis,fo reachdatasetweconducted the10-foldcrossvalidationexperiment10timeswithdiffe rentseedseach. Throughoutthefollowingdescriptionofthestatisticalan alysis,thetermsdifferentand betterareusedtomeanstatisticallysignicantdifferent andstatisticallysignicantbetter, respectively.Weconsideredstatisticalsignicantresul tsata95%condencelevel(p values < = 0.05). 6.1StartingwiththeEntireSetofFeatures Table6.1showstheresultsforthecoloncancerdataset.Eac htableentryshowsthe methodthatwasfoundbetter(winner)betweenthemethodsat thenumberoffeatures indicatedbythecolumntitle.Ablanktableentrymeansthat nomethodwasfound signicantlydifferent. WhencomparedtoIFPthet-testwasbetterfor35featuresets ,nodifferencewasfound for14featuresetsandIFPwasfoundbetterforonly1feature set.WhencomparedtoSVMRFEthet-testwasbetterfor38featuresets,andnodifferen cewasfoundfor12featuresets. ThecomparisonbetweenIFPandSVM-RFEindicatedthatbothm ethodswerenearlythe sameexceptat1featurewhereIFPwasabetterclassier. Table6.2showstheresultsfortheleukemiadataset.Thesub tablecorrespondingtothe rangeof25-50featureswasomittedsincenomethodwasfound differentthroughoutthis range.WhencomparedtoIFPthet-testwasbetterfor11featu resets,andnomethodwas founddifferentfor39featuresets.WhencomparedtoSVM-RF Ethet-testwasbetterfor 9featuresets,andnomethodwasfounddifferentfor41featu resets.Nodifferencewas foundbetweenIFPandSVM-RFE. 72

PAGE 90

Table6.1:Statisticalanalysisofresultsonthecoloncanc erdatasetbetweenthemethods IFP(I),SVM-RFE(R),andthet-test(T)acrossfeatures(a)5 0to38,(b)37to25,(c)24 to12,and(d)11to1withnopreviouspreselectionoffeature s. 50 49 48 47 46 45 44 43 42 41 40 39 38 t-testvsIFP T T T T T T T T T T T T T t-testvsSVM-RFE T T T T T T T T T T T T T IFPvsSVM-RFE (a) 37 36 35 34 33 32 31 30 29 28 27 26 25 t-testvsIFP T T T T T T T T T T T T T t-testvsSVM-RFE T T T T T T T T T T T T T IFPvsSVM-RFE (b) 24 23 22 21 20 19 18 17 16 15 14 13 12 t-testvsIFP T T T T T T T T t-testvsSVM-RFE T T T T T T T T T T T IFPvsSVM-RFE (c) 11 10 9 8 7 6 5 4 3 2 1 t-testvsIFP T I t-testvsSVM-RFE T IFPvsSVM-RFE I (d) 73

PAGE 91

Table6.2:Statisticalanalysisofresultsontheleukemiad atasetbetweenthemethodsIFP (I),SVM-RFE(R),andthet-test(T)acrossfeatures(a)24to 12and(b)11to1withno previouspreselectionoffeatures. 24 23 22 21 20 19 18 17 16 15 14 13 12 t-testvsIFP T T T T T T T T t-testvsSVM-RFE T T T IFPvsSVM-RFE (a) 11 10 9 8 7 6 5 4 3 2 1 t-testvsIFP T T T t-testvsSVM-RFE T T T T T T IFPvsSVM-RFE (b) TheresultsfortheMofttcoloncancerdatasetshowedthatn odifferencewasfound betweenIFPandthet-testfor49featuresets.IFPwasbetter for1featureset(12features). WhencomparedtoSVM-RFEthet-testwasbetterfor1features et(39features).No differencewasfoundfor48featuresets.SVM-RFEwasfoundb etterfor1featureset(12 features).NodifferencewasfoundbetweenIFPandSVM-RFE. Theresultsforthelungcancerdatasetshowedthatnodiffer encewasfoundbetween IFPandthet-testfor44featuresets.IFPwasbetterfor3fea turesets(7,6,and5features). Thet-testwasbetterfor3featuresets(42,34,and23featur es).WhencomparedtoSVMRFEnodifferencewasfoundfor44featuresets.SVM-RFEwasb etterfor3featuresets (7,6,and5features).Thet-testwasbetterfor3featureset s(42,37,34features).No differencewasfoundbetweenIFPandSVM-RFE. Interestingly,thestatisticalanalysisshowedthatthemo recomplicatedfeatureselection algorithmsIFPandSVM-RFEdidnotgenerallyresultinbette rclassiersusingasupport vectormachineasthebaseclassieronmicroarraydata. 74

PAGE 92

6.2StartingwithaPreselectedSetof n Features Thissectiondescribesthestatisticalanalysisperformed ontheresultsderivedfrom experimentsundertheaforementionedscenario2.Thatis,p riortotheapplicationofIFP andSVM-RFEeachdatasetunderwentapreselectionofthetop n features,where n was thenumberoffeatureswithpvalues < = : 01. Table6.3showstheresultsonthecoloncancerdataset.When comparedtoIFPthe t-testwasbetteron20featuresets(15lessthanwithoutpre selection).Nodifferencewas foundin29featuresets(15morethanwithoutpreselection) .IFPwasbetterin1featureset. WhencomparedtoSVM-RFEthet-testwasfoundbetteron15fea turesets(23lessthan withoutpreselection).Nodifferencewasfoundin35featur esets(23morethanwithout preselection).IFPwasbetterthanSVM-RFEin1featureset( 6features). Resultsonthisdatasetindicatedthattherewasachangeint heperformanceofIFPand SVM-RFEbydoingupfrontapreselectionofgenes,asfollows .Thenumberoffeature setswherethet-testwasfoundbetterthanIFPandSVM-RFEde creasedafterpreselecting. Also,thenumberoffeaturesetswherenodifferencewasfoun dbetweenthemethodsbeing comparedincreasedafterpreselecting.So,itwasnottheca sethatIFPandSVM-RFEwere foundbetterthanthet-test;however,thesemethodsgoteno ughperformanceimprovement afterpreselecting,thattheyshowednodifferencewiththe t-testinagreaternumberof featuresets. Table6.4showstheresultsontheleukemiadataset.Nodiffe rencewasfoundbetween IFPandthet-teston34featuresets(5lessthanwithoutpres election).Ofthese5feature sets,IFPwasbetteron3(3morethanwithoutpreselection). Thet-testwasnowbetteron 13featuresets(2morethanwithoutpreselection).Whencom paredtoSVM-RFEthet-test 75

PAGE 93

Table6.3:Statisticalanalysisofresultsonthecoloncanc erdatasetbetweenthemethods IFP(I),SVM-RFE(R),andthet-test(T)acrossfeatures(a)5 0to38,(b)37to25,(c)24 to12,and(d)11to1withpreviouspreselectionof n features(thosewithpvalues < = 0 : 01). 50 49 48 47 46 45 44 43 42 41 40 39 38 t-testvsIFP T T T T T T T T T T T T t-testvsSVM-RFE T T T T T T T T T T T T IFPvsSVM-RFE (a) 37 36 35 34 33 32 31 30 29 28 27 26 25 t-testvsIFP T T T T T T T t-testvsSVM-RFE T T T IFPvsSVM-RFE (b) 24 23 22 21 20 19 18 17 16 15 14 13 12 t-testvsIFP T t-testvsSVM-RFE IFPvsSVM-RFE R (c) 11 10 9 8 7 6 5 4 3 2 1 t-testvsIFP I t-testvsSVM-RFE IFPvsSVM-RFE I (d) 76

PAGE 94

wasfoundbetterfor11featuresetsandnodifferencewasfou ndfor39featuresets.The comparisonbetweenIFPandSVM-RFEshowednodifferenceon4 6featuresets(4less thanwithoutpreselection).SVM-RFEwasfoundbetterthanI FPon4featuresets(4more thanwithoutpreselection). TheresultsfortheMofttcoloncancerdatasetindicatedth atafterpreselectionno differencewasfoundbetweenIFPandthet-testthroughouta ll50features.WithoutpreselectionIFPwasbetterthanthet-teston1featureset.Onthe otherhand,nodifference wasfoundbetweenthet-testandSVM-RFEon49featuresets(1 morethanwithout preselection).That1featuresetbenetedSVM-RFEwhichim proveditsperformance toreachthatofthet-test.SVM-RFEwasstillfoundbetterth anthet-teston1featureset. ThecomparisonbetweenIFPandSVM-RFEshowednodifference on47featuresets(3 lessthanwithoutpreselection).SVM-RFEwasfoundbettert hanIFPon3featuresets(3 morethanwithoutpreselection). Table6.5showstheresultsonthelungcancerdataset.Thesu btablecorrespondingto therangeof34-50featureswasomittedsincenomethodwasfo unddifferentthroughout thisrange.Whencomparedtothet-test,IFPwasbetteron5fe aturesets(2morethan withoutpreselection).Thet-testwasnotfoundbetteratal l(itwasbetterat3featuresets withoutpreselection).Nodifferencewasfoundin45featur esets(1morethanwithout preselection).WhencomparedtoSVM-RFEthet-testwasbett eron4featuresets(1 morethanwithoutpreselection).SVM-RFEwasbetteronlyon 1featureset(2lessthan withoutpreselection).Nodifferencewasfoundon45featur esets(1morethanwithout preselection).IFPwasbetterthanSVM-RFEon14featureset s(14morethanwithout preselection). 77

PAGE 95

Table6.4:Statisticalanalysisofresultsontheleukemiad atasetbetweenthemethodsIFP (I),SVM-RFE(R),andthet-test(T)acrossfeatures(a)50to 38,(b)37to25,(c)24to 12,and(d)11to1withpreviouspreselectionof n features(thosewithpvalues < = 0 : 01). 50 49 48 47 46 45 44 43 42 41 40 39 38 t-testvsIFP I I I t-testvsSVM-RFE FPvsSVM-RFE (a) 37 36 35 34 33 32 31 30 29 28 27 26 25 t-testvsIFP t-testvsSVM-RFE T FPvsRFE (b) 24 23 22 21 20 19 18 17 16 15 14 13 12 t-testvsIFP T T T T T T T T t-testvsSVM-RFE T T T T T IFPvsSVM-RFE R (c) 11 10 9 8 7 6 5 4 3 2 1 t-testvsIFP T T T T T t-testvsSVM-RFE T T T T T IFPvsSVM-RFE R R R (d) 78

PAGE 96

Table6.5:Statisticalanalysisofresultsonthelungcance rdatasetbetweenthemethods IFP(I),SVM-RFE(R),andthet-test(T)acrossfeatures(a)3 3to21,(b)20to8,and(c)7 to1withpreviouspreselectionof n features(thosewithpvalues < = 0 : 01). 33 32 31 30 29 28 27 26 25 24 23 22 21 t-testvsIFP t-testvsSVM-RFE T T IFPvsSVM-RFE I I (a) 20 19 18 17 16 15 14 13 12 11 10 9 8 t-testvsIFP I t-testvsSVM-RFE T T IFPvsSVM-RFE I I I I I I I I I I I (b) 7 6 5 4 3 2 1 t-testvsIFP I I I I t-testvsSVM-RFE R IFPvsSVM-RFE I (c) Interestingly,ourresultsshowedthatthepreselectionof genespriortotheapplication ofIFPorSVM-RFEmostlymakesapositiveimpactontheperfor manceofthesefeature selectionmethods.6.3Preselectionvs.NoPreselectionofIFPandSVM-RFE Previoussectionsdescribedthestatisticalsignicancea nalysisperformedonthetop50 accuraciesobtainedforeachofthemethodsoneachdataset. First,weanalyzedthescenario withoutdoingpreselectionpriortotheapplicationofIFPa ndSVM-RFE.Second,we analyzedthescenariowherewedidapreselectionoffeature s/genespriortotheapplication ofthesetwomethods.Inthissectionwestatisticallydescr ibehoweachmethodcompared 79

PAGE 97

Table6.6:Statisticalcomparisonbetweendoingpreselect ion(P)andnotdoingpreselection(NP)withthemethodsIFPandSVM-RFEacrossfeatures(a )50to38,(b)37to25, (c)24to12,and(d)11to1onthecoloncancerdataset. 50 49 48 47 46 45 44 43 42 41 40 39 38 IFP P P P P P P P P SVM-RFE (a) 37 36 35 34 33 32 31 30 29 28 27 26 25 IFP P P P P P P P P P P SVM-RFE P P P P P P P P P P P (b) 24 23 22 21 20 19 18 17 16 15 14 13 12 IFP P P P P SVM-RFE P P P P P P P P P P P (c) 11 10 9 8 7 6 5 4 3 2 1 IFP P P P P SVM-RFE P (d) toitselfonitstwoversions:withnopreselectionandwithp reselectionofgenes.The analysiswasperformedoneachdataset.Theanalysisshowed ateachfeaturesetwhich versionwasstatisticallysignicantbetterorifnostatis ticalsignicantdifferencewasfound betweenthetwoversions. Table6.6showstheresultsonthecoloncancerdataset.OnIF P,doingpreselectionwas foundbetteron26featuresets.Nodifferencebetweenthetw oversionswasfoundon24 featuresets.OnSVM-RFE,doingpreselectionwasfoundbett eron23featuresets.No differencewasfoundon27featuresets. 80

PAGE 98

ResultsontheleukemiadatasetshowedthatforIFP,doingpr eselectionwasfoundbetter on4featuresets(45,46,47,and48features).Nodifference wasfoundbetweenthetwo versionson46featuresets.OnSVM-RFE,doingpreselection wasfoundbetterin1feature set(21features).Nodifferencewasfoundin49featuresets ResultsontheMofttcoloncancerdatasetshowedthatforIF P,notdoingpreselection wasfoundbetteron3featuresets(9,13,14features).Nodif ferencewasfoundbetween thetwoversionson47featuresets.OnSVM-RFE,nodifferenc eatallwasfoundacross the50featuresets. Table6.7showstheresultsonthelungcancerdataset.Thesu btablecorresponding totherange50-33wasomittedsincenodifferencebetweenth etwoscenarioswasfound throughoutthisrange.OnIFP,doingpreselectionwasfound betteron13featuresets.No differencebetweenthetwoversionswasfoundon37features ets.OnSVM-RFE,notdoing preselectionwasfoundbetteron1featureset.Nodifferenc ewasfoundon49featuresets. Ourresultsshowedthatthepreselectionoffeaturesmadeas tatisticallysignicant differenceonsomefeaturesetsonthecoloncancer,leukemi aandlungcancerdatasets. 81

PAGE 99

Table6.7:Statisticalcomparisonbetweendoingpreselect ion(P)andnotdoingpreselection(NP)withthemethodsIFPandSVM-RFEacrossfeatures(a )32to20,(b)19to7, and(c)6to1onthelungcancerdataset. 32 31 30 29 28 27 26 25 24 23 22 21 20 IFP P P P P P SVM-RFE (a) 19 18 17 16 15 14 13 12 11 10 9 8 7 IFP P P P P P P SVM-RFE NP (b) 6 5 4 3 2 1 IFP P P SVM-RFE (c) 82

PAGE 100

Chapter7:FilteringforImprovedGeneSelection Thischapterdemonstratestheuseoflteringmethodsincom binationwithembedded featureselectionmethods.Threepopularlteringmethods wereconsidered,includingthe t-test[5,16],informationgain[67],andReliefF[68].The semethodsrankindividual featuresbasedondatacharacteristicsanddonotinvokecla ssierstoevaluatefeature quality.Filteringcanquicklyreducethenumberoffeature savailableforembeddedfeature selectiontechniques,suchasIFPorSVM-RFE. Weinvestigatedtheeffectonaccuracyofgeneselectionby rstreducingthedimensionalityofmicroarraydatasetsvialters,andsecondapp lyinganembeddedgeneselection algorithmontheresultantreduceddatasets.Muchofthisch apterhasbeenpublished in[69].7.1ExperimentalDesignandEvaluation Here,weexpandourinvestigationontheuseofltersfordim ensionalityreductionprior totheworkofembeddedgeneselectionalgorithms.Weexperi mentedonfourmicroarray datasets(describedinSection5.1)coloncancer(cancervs .normal),leukemia(subtype classication),Mofttcoloncancer(prognosispredictor )andlungcancer(prognosispredictor). 83

PAGE 101

Evaluationofourexperimentalresultsincludetheuseofwe ightedaccuracyasdescribedinSection5.3.Also,foranappropriateaccuracyes timate,ourexperimentswere performedwithina10-foldcrossvalidationrepeatedveti meswithdifferentseedsas describedinSection5.6.Parametersusedforthesupportve ctormachinearethesameas thosedescribedearlierinSection5.2.7.2EffectofFilteringinTermsofSVMAccuracy Eachofthedatasetsunderstudy(intheformresultingfromp reprocessingasdescribed inSection5.1)werelteredtoincludethetop k genes,with k rangingfrom200genesdown toonegene,usingthreelters:thet-test[5,16],informat iongain[67],andReliefF[68]. Thet-testlterwillstartwiththe200featureswiththesma llestpvalues,theinformation gainlterwiththe200featureswiththegreatestinformati ongain,andReliefFwillstart withthe200featureswiththemostquality(featureswithth egreatestweights).Thatis,200 newreduceddatasetseachwithdifferentnumberoffeatures werecreatedoutoftheoriginal byeachlterforeachdataset.Thisgenelteringwasperfor medwithinthe10-foldcross validationprocess(actually200reduceddatasetswerecre atedforeachfold).The10-fold crossvalidationSVMweightedaccuracywascalculatedfore achsetof k genes,ranked byeachindividuallter,acrossthefourdatasets.Average soververunsof10-cross validationprocessesareusedinourgraphs. Fig.7.1throughFig.7.4showtheresultsforthecolon,leuk emia,Mofttcolon,andthe lungcancerdatasets,respectively.Eachgureshowstheav erageaccuracyofthesupport vectormachineusingeachofthethreelteringmethods,acr osstherangeoffeatures considered. 84

PAGE 102

Figure7.1:SVMaccuracyacrossthetop200geneslteredwit hthet-test,information gain,andReliefFforthecoloncancerdataset. Inthecoloncancerdataset(Fig.7.1),aclearincreaseinac curacywasobservedwhen reducingthenumberoffeaturesusedwithintheclassier.T het-testlteringgenerated thehighestaccuracySVMclassiersacrossthe200genesmos tfrequently.However,in therangeofthetop20genes,theinformationgainlterprod ucedSVMswiththehighest accuracies.Fortheleukemiadataset(Fig.7.2),accuracie stendedtodecreaseslightlywith fewerfeatureshowevertherangeofchangesweresmall.Thet -testandReliefFlters alternatelyhavethehighestaccuracyintherangeof200to5 8genes.From24genesthe t-teststartedshowingthehighestaccuracies. FortheMofttcoloncancerdataset(Fig.7.3),theoverallp redictiveaccuracywaslow. Thet-testresultedinthehighestaccuracySVMclassierst hroughouttherange200to116 genes.From81featuresdown,thet-testalternatesinhighe staccuracywithReliefF. 85

PAGE 103

Figure7.2:SVMaccuracyacrossthetop200geneslteredwit hthet-test,information gain,andReliefFfortheleukemiadataset. Forthelungcancerdataset(Fig.7.4),theaccuracywasgene rallylowhoweverclassier performancewasverypoorwithsmallnumbersoffeatures(e. g. < 20).Thet-testproduced thehighestaccuracySVMclassiersintherangeof200to75g enes.From74genes downward,thet-testandinformationgainshowedsimilarac curacies. Thethreedifferentlteringtechniquesvarysignicantly inaccuracyintheexperiments performed.Furtherinvestigationoftheoverlapinthe200f eaturesselectedfromdifferent approachesindicatedthatahighoverlapdoesnotguarantee similarclassierperformance. Theoverlappercentageobservedamongthegenesselectedby thethreelters,acrossthe four200-genedatasets,inthebestcaseoccurredbetweenth et-testandinformationgain with71.34%overlapfortheLeukemiadataset.Theaverageca sewas41.26%overlap.The 86

PAGE 104

Figure7.3:SVMaccuracyacrossthetop200geneslteredwit hthet-test,information gain,andReliefFfortheMofttcoloncancerdataset.lowestoverlapwas11.32%andoccurredbetweeninformation gainandReliefFforthe Mofttcolondataset(wherethelargestdisparitybetweenm ethodswasobserved). 7.3AccuracyofIFPandSVM-RFEUsingFilteredDatasets Foradditionalgeneselectionafterltering,weappliedth emethodsIFPandSVM-RFE toeachofthe200-genedatasets.Onlythe200-genedatasets wereusedsincemethods IFPandSVM-RFEcalculatedaccuraciesforeachfeatureseta llthewayfrom200features downto1feature.Theaverageweightedaccuracyovertheve runsarereportedinFig.7.5, Fig.7.7,Fig.7.9,andFig.7.11fortheperformanceusingth einformationgainaslteron thecoloncancer,leukemia,Mofttcoloncancer,andlungca ncerdatasetsrespectively. 87

PAGE 105

Figure7.4:SVMaccuracyacrossthetop200geneslteredwit hthet-test,information gain,andReliefFforthelungcancerdataset. 88

PAGE 106

TheircorrespondingperformanceusingtheReliefFaslter areshowninFig.7.6,7.8, 7.10,and7.12.Thecorrespondinggraphsforthet-test-lt ereddatasetswereshownin Fig.5.9throughFig.5.12. EachchartincludesthelinecorrespondingtotheSVMclassi eraccuraciesacrossthe 200genesforthelterappliedtothedatasets.Theyalsosho wacomparisonbetweenthe performanceofthelterappliedandthebestperformance(i ntermsofthehighestAUC value,seeTable7.2)attainedacrossthethreeltersforea chdataset. Forthecoloncancerdataset,thebestperformanceobserved acrossthethreelters appliedtoitwasthatofthet-testfeatureranking(interms ofSVMclassieraccuracy), whichiscomparedagainsttheperformanceoftheltersinfo rmationgain(Fig.7.5)and ReliefF(Fig.7.6).Theinformationgainfeaturerankingsh owedthehighestaccuracies fromwhen20featureswereleft(Fig.7.5). Fortheleukemiadataset,thebestperformanceobservedwas thatofIFPappliedonthe t-testltereddataset,whichiscomparedagainsttheperfo rmanceofltersinformationgain (Fig.7.7)andReliefF(Fig.7.8).TheIFPappliedontheReli efF-ltereddataset(Fig.7.8) showedthehighestaccuraciesfromwhen60featureswerelef t. FortheMofttcoloncancerdataset,thebestperformanceob servedwasthatofSVMRFEappliedonthet-testltereddataset,whichiscompared againsttheperformanceof ltersinformationgain(Fig.7.9)andReliefF(Fig.7.10). Overall,accuraciesachievedby experimentswithReliefFwerehigherthanthoseofinformat iongain. Forthelungcancerdataset,thebestperformanceobserveda crossthethreelters appliedtoitwasthatofthet-testfeatureranking(interms ofSVMclassieraccuracy), whichiscomparedagainsttheperformanceoftheltersinfo rmationgain(Fig.7.11)and ReliefF(Fig.7.12).Theinformationgainfeaturerankingp erformedsimilarlytothet89

PAGE 107

Figure7.5:Comparisonofresultingaverageweightedaccur acyoffeaturerankinggiven bymethodsIFP,SVM-RFEandinformationgain(intermsofSVM classieraccuracy) onthecoloncancerdatasetacrosstop200featuresasltere dbyinformationgain. testfeaturerankingfromwhen102featureswereleft,excep ttowardtheendwherethe informationgainfeaturerankingachievedthehighestaccu racies(Fig.7.11). Table7.1showstheexperimentthatachievedthehighestacc uracyconsideringallnine experimentsforeachdataset:3lterswereappliedtoeachd ataset,foreachlteraccuracies weregivenforIFP,SVM-RFE,andthelteritselfintermsofS VMclassieraccuracy. Table7.1:Highestaccuraciesattainedforeachdatasetacr ossthreelters. Highest No. Dataset accuracy(%) features Experiment Colon 90 17 informationgain(SVM) 20 SVM-RFEappliedont-testltereddataset Leukemia 96.94 13 IFPappliedonReliefFltereddataset Mofttcolon 58.98 102 IFPappliedont-testltereddataset Lung 59.85 162 t-test(SVM) 90

PAGE 108

Figure7.6:Comparisonofresultingaverageweightedaccur acyoffeaturerankinggiven bymethodsIFP,SVM-RFEandReliefF(intermsofSVMclassie raccuracy)onthe coloncancerdatasetacrosstop200featuresaslteredbyRe liefF. WecalculatedtheAUCvalueforeachoftheaverage-accuracy curvesacrossthefour 200-genedatasetsinFig.5.9throughFig.5.12,andFig.7.5 throughFig.7.12.Table7.2 showstheAUCvaluesforeachlter/classiercombinationa ppliedacrossourfourdatasets. Interestingly,thehighestAUCvalues(boldentries)werea ttainedinexperimentsconducted usingthet-testasalter. AdditionallytotheoverallcomparisonintermsofAUCvalue sshowninTable7.2, Fig.7.13showsanAUC-basedcomparisonoftheIFPmethodapp liedtothe200-gene ltereddatasets.Ofparticularnote,acrossthefourdatas etstheIFPmethodachievedthe highestAUCvalueswhenappliedtodatasetsthatwereltere dwiththet-test. Fig.7.14showstheAUC-basedcomparisonoftheSVM-RFEmeth odappliedtothe 200-geneltereddatasets.SimilarlytoIFP,SVM-RFEreach edthehighestAUCvalues 91

PAGE 109

Figure7.7:Comparisonofresultingaverageweightedaccur acyoffeaturerankinggiven bymethodsIFP,SVM-RFEandinformationgain(intermsofSVM classieraccuracy) ontheleukemiadatasetacrosstop200featuresaslteredby informationgain. whenappliedtodatasetsthatwerelteredwiththet-test.T akentogether,Fig.7.13and Fig.7.14suggestthatasalterthesimplet-testcanbeasgo odandevenbetterthan informationgainandReliefF. Ontheotherhand,weinvestigatedthedifferenceinaccurac yattainedbetweenapplying IFPandSVM-RFEonthefouroriginalnon-ltereddatasetsan donthe200-geneltered datasets.Accuracygains/changeswereobserved,whichare showninTable7.3.These numbersrepresentaverageimprovementscalculatedover10 experimentsforeachmethod (IFPandSVM-RFE)appliedonthenon-ltereddatasets,comp aredto10experimentsfor thesametwomethodsappliedonthe200-geneltereddataset s. Interestingly,bothmethods(IFPandSVM-RFE)showedaccur acyimprovementsacross thefourt-test-ltereddatasets.However,theinformatio ngain-ltereddatasetsresulted 92

PAGE 110

Figure7.8:Comparisonofresultingaverageweightedaccur acyoffeaturerankinggiven bymethodsIFP,SVM-RFEandReliefF(intermsofSVMclassie raccuracy)onthe leukemiadatasetacrosstop200featuresaslteredbyRelie fF. inaccuracyimprovementsforbothIFPandSVM-RFEonlywhena ppliedonthecolon dataset,andforSVM-RFEonlywhenappliedonthelungdatase t.TheReliefF-ltered datasetsresultedinaccuracyimprovementsforbothIFPand SVM-RFEwhenappliedon thecolonandMofttcoloncancerdatasets,andforIFPwhena ppliedontheleukemia dataset.TheaccuracyimprovementsshowninTable7.3illus tratethepositiveaccuracy impactofadimensionalityreductionprocessperformedpri ortotheactualgeneselection process. Acrossallourexperimentsthesimplet-testusedasalterw asshowntoworkwell withmicroarraydata;howeverbasedonthecasesofaccuracy lossobservedforIFPand SVM-RFEondatasetslteredwithinformationgainandRelie fF,webelievetheremaybe betterltersformicroarraydata. 93

PAGE 111

Figure7.9:Comparisonofresultingaverageweightedaccur acyoffeaturerankinggiven bymethodsIFP,SVM-RFEandinformationgain(intermsofSVM classieraccuracy) ontheMofttcoloncancerdatasetacrosstop200featuresas lteredbyinformationgain. Thet-testhasbeenusedacrossfourdatasetsandthreelter s.TheAUCanalysis indicatedthatthehighestAUCvalueswereachievedinexper imentswherethet-testwas usedasalter(Table7.2).AccuracyimprovementsforIFPan dSVM-RFE(Table7.3)were alwaysobservedacrossthefourdatasetslteredwiththettest.Moreover,bothIFPand SVM-RFEobtainedhighestAUCvalueswhenappliedondataset slteredwiththet-test (Fig.7.13andFig.7.14).Thet-testisatleastasaccuratea sthemorecomplexinformation gainandReliefF. 94

PAGE 112

Figure7.10:Comparisonofresultingaverageweightedaccu racyoffeaturerankinggiven bymethodsIFP,SVM-RFEandReliefF(intermsofSVMclassie raccuracy)onthe Mofttcoloncancerdatasetacrosstop200featuresaslter edbyReliefF. 7.4StatisticalComparisonofFilteringMethods TheaccuracycurvesresultingfromIFPandSVM-RFEappliedo nthe200-genedatasets (seeSection7.3)werefurtheranalyzedusingtherangeof50 genesto1gene.Thisrange wasusedtoevaluatethestatisticalsignicanceofaccurac ydifferences.Suchstatistical comparisonsallowedtheidenticationofthegenesetsforw hicheachmethodattained signicantlybetteraccuracyoneachdataset.Thesamplesi zeforthecomparisonwas10 —thatis,foreachlterandforeachdatasetweconductedthe 10-foldcrossvalidation experiment10timeswithdifferentseeds. WeusedtheFriedman/Holmtestforstatisticalanalysiswhi chwasbrieydescribedin Chapter6.Table7.4showscomparisonresultsfromapplying IFPonthefourdatasets. 95

PAGE 113

Figure7.11:Comparisonofresultingaverageweightedaccu racyoffeaturerankinggiven bymethodsIFP,SVM-RFEandinformationgain(intermsofSVM classieraccuracy) onthelungcancerdatasetacrosstop200featuresasltered byinformationgain. Eachrowindicatesthetwoltersappliedtothedatasetindi catedbythecolumntitleto obtainthe200-genedatasets.IFPwasappliedonthesereduc eddatasetsandaccuracy resultswerestatisticallycompared.Eachtableentrypoin tstotheltermethodapplied onthedatasetforwhichIFPresultswerestatisticallysign icantlybetter,indicatingalso thegenesetswherethishappened.Forexample,T:24-22mean sthatthet-testlterwas appliedonthedatasetwithIFPbeingsignicantlybetterin therangeof24-22genes. WeobservedthatwhenIFPwasappliedtoat-test-ltereddat asetwhencomparedtoan informationgain-ltereddataset,IFPwassignicantlybe tterforagreaternumberofgene setsforthet-test-ltereddataset,exceptfortheleukemi adatasetforwhichtheinformation gain-ltereddatasetshowedthegreatestnumberofsignic antlybettergenesets. 96

PAGE 114

Figure7.12:Comparisonofresultingaverageweightedaccu racyoffeaturerankinggiven bymethodsIFP,SVM-RFEandReliefF(intermsofSVMclassie raccuracy)onthelung cancerdatasetacrosstop200featuresaslteredbyReliefF WeproceededsimilarlyforSVM-RFE(Table7.5),forwhichth et-testwasalwaysthe lterappliedtothedatasetswhenasignicantdifferencew asfound.Betweeninformation gainandReliefF,theformerwasthelterappliedwhensigni cantdifferenceswerefound forthecolonandlungdatasets.ReliefFwasbetterforthele ukemiaandMofttcolon datasets. WealsoconductedtheSVMaccuracystatisticalcomparison( Table7.6)whenusing justthethreelters,startingwithour200-genesetandthe nexaminingonlythe“best" 50genes.Thet-testresultsweresignicantlybetterwhent herewasadifferencefor allexperimentsexceptthelungdataset(vs.informationga in)inafewisolatedgene sets.BetweeninformationgainandReliefF,theformerwass ignicantlybetterwhena differencewasfoundfortheleukemiaandlungdatasets.Rel iefFwasbetterfortheMoftt 97

PAGE 115

Table7.2:AUCanalysisofaccuracycurvesacrossallfourl tereddatasets,usingthe(a) t-test,(b)informationgain,and(c)ReliefF.Boldentries representthehighestAUCvalues acrossall3ltersforeachdataset. Dataset/Method IFP SVM-RFE t-test Colon 16962.56 16958.48 17252.42 Leukemia 18940.77 18877.87 18636.43 MofttColon 11200.31 11219.15 10961.38 Lung 11598.31 11593.80 11600.65 (a) Dataset/Method IFP SVM-RFE information gain Colon 16754.69 16715.87 16944.77 Leukemia 18558.02 18499.82 18529.33 MofttColon 9851.37 9701.76 9435.32 Lung 11162.36 11099.16 11452.19 (b) Dataset/Method IFP SVM-RFE ReliefF Colon 16318.74 16220.00 16920.33 Leukemia 18824.12 18631.90 18596.31 MofttColon 10925.57 10817.46 10799.18 Lung 10660.74 10636.86 10763.78 (c) colondataset.BothinformationgainandReliefFweresigni cantlybetteratdifferentgene setsforthecolondataset. Ourstatisticalanalysisindicatesthatthet-testcanbeag oodlterformicroarray datasets.Particularly,thisisofinterestsincethet-tes tasalterislesscomplexthan informationgainorReliefF. 98

PAGE 116

Figure7.13:ComparisonofAUCvaluescorrespondingtothea verageaccuracycurvesof IFPappliedacrossthetop200geneslteredwiththet-test, informationgain,andReliefF foreachdataset.7.5AnalysisofOverlapBetweenGenesSelected Weinvestigatedthepercentagesofoverlapbetweengenesse lectedbybothIFPand SVM-RFEwheneachappliedonthe200-genedatasetslteredw iththet-test,theinformationgainandtheReliefFforthefourdatasets.Weconsidere dall5runsofeachexperiment forcalculationsofaveragepercentages.Thisanalysiswas performedfor50genes.Overall, resultsshowninTable7.7indicatedthatIFPandSVM-RFEsel ectedgeneswithanoverlap between63%to76%averageacrossthefourdatasets. Wealsolookedatthepercentagesofoverlapbetweengenesse lectedbyIFPapplied onthe200-genedatasetswithgenesselectedbyeachlter.R esultsshowninTable7.8(a) indicatedthattheseoverlapsarepresentbetween21%to34% .Thesepercentagesindi99

PAGE 117

Figure7.14:ComparisonofAUCvaluesobtainedwithSVM-RFE acrossthetop200 geneslteredwiththet-test,informationgain,andRelief Fforeachdataset. catedthatmethodIFPselecteddifferentgenesfromthosese lectedbyeachofthelters. WeperformedasimilaranalysisforpercentagesofSVM-RFEa ndtheresultsshownin Table7.8(b)werebetween23%to43%acrossthefourdatasets Ontheotherhand,wewereinterestedinthepercentagesofov erlapbetweengenes selectedbyanytwolters,suchas:t-testvsReliefF,t-tes tvsinformationgain,and informationgainvsReliefF.ResultsareshowninTable7.9. Forthecolonandleukemia datasets,theoverlappercentageswerebetween47%to64%.F ortheMofttcolonandthe lungcancerdatasets,thepercentageswereaslowasbetween 11.1%to20%,exceptfor theoverlapbetweenthegenesselectedbythet-testandthei nformationgainforthelung cancerdatasetwhichwasat62.1%. 100

PAGE 118

Table7.3:Averageaccuracyimprovementsobtainedbetween IFPandSVM-RFEapplied tothefouroriginalnon-ltereddatasetsandtoIFPandSVMRFEappliedtothe200geneltereddatasets.Theltersusedwerethet-test,info rmationgain,andReliefFon the(a)colonandleukemia(b)Mofttcolonandlungdatasets .Boldentriesrepresent accuracyimprovementsobserved. Dataset Colon Leukemia Filter/Method IFP SVM-RFE IFP SVM-RFE t-test 4.62% 3.79% 0.70% 0.03% Infogain 4.16% 3.83% -1.16% -1.99% ReliefF 1.54% 0.17% 0.15% -0.63% (a) Dataset Mofttcolon Lung Filter/Method IFP SVM-RFE IFP SVM-RFE t-test 4.15% 0.03% 3.81% 4.42% Infogain -6.58% -1.99% -0.17% 0.25% ReliefF 2.87% 2.22% -4.15% -4.19% (b) Table7.4:Statisticalcomparisonacrossthetop50feature softheaccuraciesfromIFP onthebest200geneslteredaccordingtothet-test(T),inf ormationgain(G),and ReliefF(F)forthefourdatasets.Entriesshowthegenesets forwhichsignicantaccuracy differenceswerefoundandinwhosefavortheywere.T:24-22 meansthet-testwas signicantlybetterfrom24to22genes. Colon Leukemia MofttColon Lung t-testvsInfogain T:24-22 G:34,30, T:43-41, T:45,43-40, 26-22, 34,27 31,30,28, 20,19,17, 22,20,19 14-11 17-13,11 T:1 t-testvsReliefF F:35-33, F:11 T:50-40,38-35, 31-28, 33-30,28-27, 26-19, 24-22,20,19, 17-10,1 17-14,7,4 InfogainvsReliefF F:30 F:6,5 F:49-41,38, 27,11,7 101

PAGE 119

Table7.5:Statisticalcomparisonacrossthetop50feature softheaccuraciesfromSVMRFEonthebest200geneslteredaccordingtothet-test(T), informationgain(G), andReliefF(F)forthefourdatasets.Entriesshowthegenes etsforwhichsignicant accuracydifferenceswerefoundandinwhosefavortheywere .T:15-13meansthet-test wassignicantlybetterfrom15to13genes. Colon Leukemia MofttColon Lung t-testvsInfogain T:21,20,18, T:23-20 T:50,47, T:43,19 15-13 46,44, G:7,6,2 43,41, G:8,7 39,35, 34 t-testvsReliefF T:48,35,31, T:50-46,44, 29-20,18, 43,41-29, 15-12,1 27,25, 23-19,3 F:7 InfogainvsReliefF G:50,48,47, F:22-20 F:50-46, G:50,39-37, 45-42,39-35, 44,41, 34,32,29, 32-25,7,2,1 39 26,24,3 Table7.6:Statisticalcomparisonacrossthetop50feature softheaccuraciesfromSVM onthebest200geneslteredaccordingtothet-test(T),inf ormationgain(G),and ReliefF(F)forthefourdatasets.Entriesshowthegenesets forwhichsignicantaccuracy differenceswerefoundandinwhosefavortheywere.T:50-35 meansthet-testwas signicantlybetterfrom50to35genes. Colon Leukemia MofttColon Lung t-testvsInfogain T:50-35,33-31, T:29,23-9, T:50,48,46, G:49,46 4 5,4,2,1 G:11-9 G:42,8-6 40-36,34 t-testvsReliefF T:50-48,46-40, T:29,23, T:6 T:44-8,6 38,37,17, 21-4,2 13,8 F:42 F:2,1 InfogainvsReliefF F:32,5,3-1 G:20,10-5 F:50-46,44, G:50-5 G:18-16,13-8 43,40,38-36, 34-32,30-26 102

PAGE 120

Table7.7:Percentageofintersection(at50genes)between genesselectedbyIFPand SVM-RFEwhenappliedonthetop200geneslteredwiththet-t est,informationgain, andReliefFacrossthefourdatasets. Filter/Dataset Colon Leukemia Mofttcolon Lung information 76.28 66.12 71.6 68.5 gain ReliefF 73.76 63.6 74.2 68.5 t-test 75.88 66.56 70.16 71 Table7.8:Percentageofintersection(at50genes)ofgenes lteredby(a)IFPand(b) SVM-RFEwiththegeneslteredbyeachlteracrossthefourd atasets.IFPandSVMRFEwereappliedonthetop200geneslteredbyeachlter. Colon Leukemia Mofttcolon Lung information 26.44 34.64 32.36 28.4 gain ReliefF 29.96 34 34.64 26.2 t-test 28.96 36.52 33.92 21.7 (a) Colon Leukemia Mofttcolon Lung information 30.04 34.56 31.4 27 gain ReliefF 29.08 43.08 35 25.5 t-test 31.36 36.36 33.32 23.3 (b) Table7.9:Percentageofintersection(at50genes)between geneslteredbyany2lters. Colon Leukemia Mofttcolon Lung t-testvsReliefF 64.92 47.72 13.36 11.1 t-testvsinformation 61.76 71.12 20.08 62.1 gain informationgainvs 58.08 49.36 7.32 13.6 ReliefF Last,weinvestigatedtheoverlapbetweengenesselectedby IFP,SVM-RFE,andeach lterwiththemselves.IFPandSVM-RFEwereappliedonthe20 0-genedatasets.These analysesallowedustodeterminethestabilityofeachmetho dinselectinggenes.Results 103

PAGE 121

showninTable7.10indicatedthateachoftheltersweremor establethanIFPandSVMRFE,acrossthefourdatasets.Table7.10:Percentageofintersection(at50genes)foreac hmethodwithitselfusingthe (a)informationgain,(b)ReliefF,and(c)t-test,acrossth etop200geneslteredwiththe correspondinglterforthefourdatasets. Colon Leukemia Mofttcolon Lung IFP 46.35 55.95 43.9 33.3 SVM-RFE 47.45 62.85 40.7 31.8 information 71.75 77.85 51.1 71 gain (a) Colon Leukemia Mofttcolon Lung IFP 52.9 66.95 37.05 31.7 SVM-RFE 49.7 63.6 37.1 30.9 ReliefF 74.7 79.8 50.1 56.7 (b) Colon Leukemia Mofttcolon Lung IFP 52.9 59.45 44.8 46.7 SVM-RFE 53.15 62.1 41.65 46.7 t-test 78 87.2 47.15 77 (c) 7.6SummaryofResultsIncludingReviewedFeatureSelectio nMethods Asummaryoftheresultsobtainedbythefeatureselectionap proachesreviewedin Chapter3,includingourworkwithIFPisshowninTable7.11a ndTable7.12forthecolon cancerdataset,andinTable7.13andTable7.14fortheleuke miadataset. Table7.11,Table7.12,Table7.13,andTable7.14showthepo rtionofthecorrespondingdatasetusedintheexperimentsforfeatureselection,t hefeatureselectorapplied,the 104

PAGE 122

numberoffeaturesforwhichresultsareshown,andabriefde scriptionofresults.Ascan benoted,acrossthediversefeatureselectionmethods,dif ferentclassiersandevaluation methodswereapplied.Testsetaccuracies,totalaccuracie s,weightedaccuracies,anderror rateswereusedtoexpressresultsobtained. Youcanseehigheraccuracieswhenalldataisusedforfeatur eselection.However, thisleavesnotrulyunseentestdata.Forthecoloncancerda taset,ouraccuraciesare comparable.Fortheleukemia,leave-one-outtestingcanle adto100%,especiallywhen alldataisusedforfeatureselection.Ouraccuraciesareco mparableinothercases. 105

PAGE 123

Table7.11:Summaryofresultsfromfeatureselectionmetho dsin[5,9–11]forthecoloncancerdataset. No. Reference Portionofdatasetused Featureselector features Results forfeatureselection [5] 50%train,50%test Filtert-testandLinear 7, 85.91%test featuresselected geneticprograms 27, weightedaccuracy fromtrainset 54 Nostat.signif.test wrappersusing: SingleLOOCV 1-nearestneighbor 4 91.94% [9] entiredataset NaveBayes 2 87.10% C4.5 3 95.16% CN2 3 91.94% Pairedt-testforstat.test wrapper-lter: Avg.over10LOOCV [10] WFFSA(withReliefF) 10.9avg. 97.90% entiredataset WFFSA(withGainratio) 11.9avg. 96.29% WFFSA(withc2 ) 10avg. 95.97% Nostat.signif.test [11] entiredataset SVM-RFE 4 98%LOOCV Statist.signif.test 106

PAGE 124

Table7.12:Summaryofresultsfromfeatureselectionmetho dsin[12,47]andthisworkforthecoloncancerdataset. No. Reference Portionofdatasetused Featureselector features Results forfeatureselection Errorratesusing 0.632+bootstrap with200bootstrap samples [12] entiredataset RandomForests 200 0.127 14 0.159 Nostat.signif.Test 100randomsplitsinto42 trainingand20testing [47] featuresselectedon HHSVM 94.5avg. samples trainforeachsplit 12.69%avg.testerror Nostat.signif.test Avg.over510-foldCV IFPstartingwith 12 86.57%weightedaccuracy 2000features (87.42%totalaccuracy) Thiswork eachfold IFPstartingwith 23 87.93%weightedaccuracy 200features (88.06%totalaccuracy) Friedman/Holmforstat.test 107

PAGE 125

Table7.13:Summaryofresultsfromfeatureselectionmetho dsin[5,8–10]fortheleukemiadataset. No. Reference Portionofdatasetused Featureselector features Results forfeatureselection Filtert-testandLinear 6, 100%testaccuracy originaltrainandtestsets geneticprograms 27, Nostat.signif.test [5] featuresselectedfrom 53 trainset wrappersusing: [8] originaltrainandtestsets J48(C4.5) 1 91.18%test featuresselectedfrom NaveBayes 1 91.18%test trainset Nostat.signif.test wrappersusing: SingleLOOCV 1-nearestneighbor 3 100% [9] entiredataset NaveBayes 4 95.83% C4.5 2 95.83% CN2 3 97.22% Pairedt-testforstat.test wrapper-lter: Avg.over10LOOCV allsampleswith WFFSA(withReliefF) 5.1avg. 100% [10] 1000features WFFSA(withGainratio) 5.5avg. 100% WFFSA(withc2 ) 6.1avg. 100% Nostat.signif.test 108

PAGE 126

Table7.14:Summaryofresultsfromfeatureselectionmetho dsin[11,12,44]andthisworkfortheleukemiadataset. No. Reference Portionofdatasetused Featureselector features Results forfeatureselection [11] entiredataset SVM-RFE 2 100%LOOCV Statist.signif.test Errorratesusing 0.632+bootstrap with200bootstrap [12] originaltrainsetonly samples with3051features RandomForests 200 0.051 2 0.087 Nostat.signif.test 50randomsplitsinto38 trainingand34testing [44] featuresselectedon HHSVM 87.9avg. samples trainforeachsplit 1.67%avg.testerror Nostat.signif.test Avg.over510-foldCV IFPstartingwith 213 96.56%weightedaccuracy Thiswork 2689features (97.22%totalaccuracy) eachfold IFPstartingwith 75 96.51%weightedaccuracy 200features (96.67%totalaccuracy) Friedman/Holmforstat.test 109

PAGE 127

Chapter8:EnsembleApproachUsingBagging Thegoalinthischapteristoexplorewhetheraccuracyimpro vementsarepossible throughanensembleapproach,whichisacombinationofclas siers.Baggingisthe ensembletechniqueweapplied.Resultsareshownforthecol oncancer,leukemia,and theMofttcoloncancerdatasets.8.1DeningEnsembleandBagging Anensembleistypicallyacombinationofclassiers.Thecl assiermaybeofthesame type(e.g.SVM)ordifferenttypes.Theresultantoutputisa summarizationofthepredictionofdiversemodelsintoasingleprediction.Foraclassi cationproblem,thediverse modelscanbegeneratedusingthebaggingtechnique.Baggin g(bootstrapaggregating) consistsofsamplinginstancesrandomlywithreplacementf romtheoriginaltrainingset tocreatenewtrainingsets,usuallyofthesamesize.Inbagg ingalearningalgorithmis appliedtoeachofthenewtrainingsets.Eachoftheresultan tmodelswillpredictthe classforeachinstanceinthetestset,withthenalpredict edclasstypicallydeterminedby majorityvote[62,70,71].Thereareothercombinationmeth odspossible[62,70,72,73] forclassierapproaches. 110

PAGE 128

8.2WhyUseBagging? ItisexpectedthatbyapplyingthebaggingtechniquewithIF P,SVM-RFE,andthettest(usingSVMasthebaselearningalgorithm),theachieve daccuracieswillbeimproved comparedtothoseofthenon-baggedapproaches. Ononehand,ithasbeenclaimedthatafterparametertuningo ftheSVMmodel,the combinationofSVMsdonotleadtoperformanceimprovemento fasingleSVM[74]. Ontheotherhand,thecurseofdimensionalityofmicroarray datasetscanbeaddressed withfeatureselectionalgorithms,whilevarianceproblem sderivedfromthesmallsample sizealongwiththelargedegreeofbiologicalvariability, canbeaddressedwithensemble methodsbasedonresamplingtechniquessuchasbagging[75] Inpractice,whencreatingensemblesofclassierswhereSV Misthebaselearning algorithm,onemustexperimentwithdiversevaluesforthep arameter C [76].The C value producingthebestresultsischosen. Inourwork,experimentswereconductedwithdiverse C valuesforthecoloncancer dataset(usingtheentiredataset),suchas1000,500,100,1 ,0.8,0.5,0.1,0.01,0.001.The chosenvalue C = 0 : 5wasappliedonboththecoloncancerandleukemiadatasets. Forthe Mofttcoloncancerdataset C = 0 : 5showednoaccuracyimprovements,sothevalue C = 1 wasused. 111

PAGE 129

8.3BaggingAppliedtoIFP,SVM-RFE,andtheT-Test Thedatasetsusedforexperimentswiththeensembleapproac hwerereducedtoinclude onlythetop n featureswith p values 0 : 01.Forthecoloncancerdataset n = 201, fortheleukemiadataset n = 437,andfortheMofttcoloncancerdataset n = 50.For thenon-baggedexperiments,IFP,SVM-RFEandthet-testwer ethenappliedtoeachof thereduceddatasetsundera10-foldcrossvalidationproce ss,whichwasrepeatedve timeswithdifferentseeds.Theaverageweightedaccuracie sforeachdatasetareshownin AppendixB,andareincludedinthefollowingchartsreprese ntedwiththelinesthathave thenameofthemethod(whicharethenon-baggedcases). Thebaggingtechniqueinthisworkgenerated30classiersf oreachofIFP,SVMRFEandthet-testforeachdataset.Essentially,baggingwa sappliedwithinthe10-fold crossvalidationprocess.Foreachtrainingset30bagswere generatedandamodelcreated foreachbag(300modelstotal,30foreachfold).Wehadeacho fthesemodelsvote theirpredictedclassforeachinstanceinthetestsetandth enalpredictedclasswasthen determinedastheclasswiththemostvotes.Only30classie rswereusedduetothe highcomputationalcostthatusingmoreclassiersrequire d.Although30classierscan showtheadvantageofensembles,moreclassiersarelikely neededtoshowstatistical signicanceoftheresults[77]. Allthefollowingchartsincludethefeatureselectionmeth odinvestigated(IFP,SVMRFE,ort-test),showingitsbaggedandnon-baggedresultsi ncomparisonwiththebest (intermsofthehighestaccuracyachieved)non-baggedresu ltsforeachdataset.Forthe coloncancerdataset,thehighestaccuracyachievedwas90% with20featuresbySVMRFE(Fig.B.1a).Fortheleukemiadataset,itwas96.94%with 371featuresbythet-test 112

PAGE 130

Figure8.1:IFPbaggedaccuracyincomparisonwithIFPandSV M-RFE(highest accuracyachieved)forthecoloncancerdataset.(intermsofSVMclassieraccuracy)(Fig.B.1b).FortheMof ttcoloncancerdataset,the highestaccuracyachievedwas56.74%with45featuresbythe t-test(Fig.B.1c). 8.3.1ColonCancerDataset Forthecoloncancerdataset,thebaggedIFP(Fig.8.1)didsh owperformanceimprovementcomparedtothenon-baggedIFP.However,itdidnotreac hthehighestaccuracy attainedbythenon-baggedcaseofSVM-RFE.ThebaggedSVM-R FE(Fig.8.2)performed similarlytothebaggedIFP.Itdidnotreachthehighestaccu racyof90%(at20genes)of thenon-baggedSVM-RFE.Last,thebaggedt-test(Fig.8.3)d idnotshowperformance improvementcomparedtothenon-baggedt-testoverall. 113

PAGE 131

Figure8.2:SVM-RFEbaggedaccuracyincomparisonwithSVMRFE(highestaccuracy achieved)forthecoloncancerdataset.8.3.2LeukemiaDataset Fortheleukemiadataset,allbaggedexperimentsshowedper formanceimprovements comparedtotheircorrespondingnon-baggedcases.Thebagg edIFP(Fig.8.4),bagged SVM-RFE(Fig.8.5),andbaggedt-test(Fig.8.6)eachachiev edthehighestaccuracies achievedbyusingthet-testlterforagreaternumberoffea turesets. 8.3.3MofttColonCancerDataset FortheMofttcoloncancerdataset,thebaggedIFP(Fig.8.7 )andthebaggedSVMRFE(Fig.8.8)outperformedtheircorrespondingnon-bagge dcases.Theyevenachieved 114

PAGE 132

Figure8.3:t-testbaggedaccuracyincomparisonwiththettest(intermsofSVM classieraccuracy)andSVM-RFE(highestaccuracyachieve d)forthecoloncancer dataset. 115

PAGE 133

Figure8.4:IFPbaggedaccuracyincomparisonwithIFPandth et-test(intermsofSVM classieraccuracy),whichachievedthehighestaccuracyf ortheleukemiadataset. higheraccuraciesthanthet-testaslterwithSVM,whichha dreachedthehighestaccuracy. Thebaggedt-test(Fig.8.9)showedloweraccuraciesthanth enon-baggedcaseinoverall. Thebaggedresultsshowninthischapterledustoconcludeth attheperformance ofbaggedensemblescreatedforIFP,SVM-RFE,andthet-test canbebetterthantheir correspondingnon-baggedcasesforthecoloncancer,leuke mia,andtheMofttcolon cancerdatasets.However,withonly30classierstheyofte narenot. 116

PAGE 134

Figure8.5:SVM-RFEbaggedaccuracyincomparisonwithSVMRFEandthet-test(in termsofSVMclassieraccuracy),whichachievedthehighes taccuracyfortheleukemia dataset. 117

PAGE 135

Figure8.6:t-testbaggedaccuracyincomparisonwiththettest(intermsofSVM classieraccuracy)fortheleukemiadataset. Figure8.7:IFPbaggedaccuracyincomparisonwithIFPandth et-test(intermsofSVM classieraccuracy),whichachievedthehighestaccuracyf ortheMofttcoloncancer dataset. 118

PAGE 136

Figure8.8:SVM-RFEbaggedaccuracyincomparisonwithSVMRFEandthet-test(in termsofSVMclassieraccuracy),whichachievedthehighes taccuracyfortheMoftt coloncancerdataset. Figure8.9:t-testbaggedaccuracyincomparisonwiththettest(intermsofSVM classieraccuracy)fortheMofttcoloncancerdataset. 119

PAGE 137

Chapter9:DiscussionandFutureWork TheIFPalgorithmincludesaself-tuningmechanism,viabin arysearch,todetermine theamountofnoiseneededtoperturbanynumberoffeatures( genes).Wecomparedthe performanceofthreefeatureselectionmethods:IFP,SVM-R FEandtheStudent'st-test, intermsofaverageSVMclassieraccuracy.Fourmicroarray datasetswerepreprocessed andusedinourexperiments:coloncancer,leukemia,Moftt coloncancerandlungcancer datasets.Overall,IFPresultedinaclassiercomparableo rsuperiorinaccuracytoSVMRFEonthecolon,leukemiaandlungdatasets.IFPresultedin alessaccurateclassieron theMofttcolondataset. Surprisingly,thet-testfeatureranking,whichisbasedon thepvaluesofthegenes, turnedouttobethebestgeneselectorexplored.Itfoundbet tersetsofgenesthanthemore complicatedIFPandSVM-RFEdid,inthesensethatthegeness electedbythet-testledto SVMaccuracieshigherthanthoseofIFPandSVM-RFE,inmanyg enesubsetsacrossall 4datasets.Thissuggeststhatperhapsthemorecomplexalgo rithmsforfeatureselection increasetheriskofoverttingforsuchsmallsampleproble ms. Basedonthegoodperformanceofthet-testasageneselector ,weinvestigatedthe effectofapreselectionofgenes/featuresacrossour4data setsbeforetheapplicationofIFP andSVM-RFE.Weusedthepvaluesofeachgeneasourltercrit erionandanalyzedthe accuracyresultsstatisticallyusingtheFriedman/Holmte standusingtheAUCcriterion. 120

PAGE 138

Bothscenariosofexperimentswereanalyzed(i.e.withandw ithoutgenepreselection). Ourresultsconrmedthesuperiorityofthet-testonexperi mentswithoutgenepreselection aswellastheperformanceimprovementofIFPandSVM-RFEone xperimentswheregene preselectionwiththet-testwasincorporated. Wealsolookedatthesimilarityofthesetsofgenesselected byeachofthemethods, withparticularemphasisonpointswhereanytwomethodsrea chedsimilaraccuracies.Our ndingsindicatethatsimilaraccuraciescanbereachedwit hdifferentsetsofgenes. Whilethet-testcanbeanaccuratetechniqueforfeaturesel ection,itislimitedtotwoclassproblems.However,theuseofANOVAcouldprovideasim ilarmethodinthecaseof 3ormoreclasses. Wehaveinvestigatedthedimensionalityreductionproblem formicroarraydatasetsby lteringfeaturestothetop200genesonfourmicroarraydat asetsusingthreeltersin comparison:thet-test,informationgain,andReliefF.For eachindividuallter,theSVM accuracyatvaryingnumbersofgeneswascalculated.Theres ultsdemonstratethatthe simplet-testisgenerallyasgoodandevenbetterthanthemo recomplexinformationgain andReliefFmethodsacrossthefourdatasets. WeappliedIFPandSVM-RFEonthe12(4datasetsand3lters)r esultant200-gene datasets,andconductedanAUCanalysisoneachofthecorres pondingaccuracycurves. ItshowedthatbothIFPandSVM-RFEreachedthehighestAUCva luewhentheywere appliedondatasetsthatwerelteredwiththet-test. BothIFPandSVM-RFEconsistentlydemonstratedaccuracyim provementswhenappliedtothet-test-ltereddatasets,howeverthiswasnott ruewhenappliedtodatasets lteredwithinformationgainorReliefF. 121

PAGE 139

FortheleukemiaandtheMofttcoloncancerdatasets,bothI FPandSVM-RFEhad loweraccuracieswhenappliedtodatasetslteredwithinfo rmationgain.SodidIFPfor thelungcancerdataset.Also,forthelungcancerdataset,b othIFPandSVM-RFEhad loweraccuracieswhenappliedtodatasetslteredwithReli efF.SodidSVM-RFEforthe leukemiadataset. WealsoconductedastatisticalanalysisusingtheFriedman /Holmtestacrossthetop50 genesobtainedbyIFPandSVM-RFE,astheywereappliedonthe 200-genedatasets.The analysisshowsthatthesemethodsgenerallyresultinthegr eatestnumberofstatistically signicantaccuraciesforsetsofgenesondatasetsltered withthet-test. Theresultshereshowthesimplet-testisasurprisinglygoo dlterofgenesformicroarraydata.Embeddedfeatureselectiontechniques,suc hasIFPandSVM-RFE,canbe usedtoselectgoodcombinationsofgenestouseforclassic ationpurposes,howeverthe initiallteringusingat-testcanimproveclassierperfo rmancebylimitingthenumberof uninformativegenesconsidered. Theseresultsclearlypointtotheideathatthet-testcanbe atleastasaccurateasother ltermethodsandshouldbeconsideredinthesetypesofstud ies. Weshowedthattheperformanceofbaggedensemblescreatedf orIFP,SVM-RFE,and thet-testusingSVMsasbaseclassiercanbemoreaccuratet hantheircorrespondingnonbaggedresults.However,with30classiersandSVMtheyoft enarenotmoreaccurate. 122

PAGE 140

9.1FutureWork TheIFPimplementedandexperimentedinthisworkusestheba ckwardelimination approachinthefeatureremovalprocess.However,theuseof theforwardfeatureselectionisalwaysanalternativetoexplorewithIFP.Underthis approach,theinitialsetof selectedfeaturesisempty.Progressively,featuresaread deduntilnomoreperformance improvementisobserved. IFPinthisworkusedSVMasthebaselearningalgorithm.Itwo uldbeinterestingto exploretheapplicationofIFPusingotherdifferentlearni ngalgorithms. FurtherinvestigationofgenesselectedbyIFPmayincludet heiranalysisforbiological signicance. Asalter,thet-testisknowntoselecthighlycorrelatedge nes.Asetofsuchfeatures mightbeexpectedtolowertheperformanceofclassication algorithms[78].Basedon thegoodperformanceofthet-testasalterinthiswork,itb ecomesofinteresttoverify thecorrelationbetweenfeaturesselectedfrommicroarray datasetsusingthet-test,andthe impactofitontheIFPperformance. Theensembleapproachcanalsobeinvestigatedusingthecom binationofdifferent kernelsfortheSVMmodel.Andlargerensemblescanbeexplor ed. 123

PAGE 141

ListofReferences [1]W.Kuo,E.Kim,J.Trimarchi,T.Jenssen,S.Vinterbo,and L.Ohno-Machado,“A primerongeneexpressionandmicroarraysformachinelearn ingresearchers,”Journal ofBiomedicalInformatics,vol.37,no.4,pp.293–303,2004. [2]I.Guyon,S.Gunn,M.Nikravesh,andL.Zadeh,FeatureExtraction:Foundations andApplications(StudiesinFuzzinessandSoftComputing).Springer-VerlagNew York,Inc.Secaucus,NJ,USA,2006. [3]H.LiuandH.Motoda,Computationalmethodsoffeatureselection.Chapman& Hall/CRC,2008. [4]Y.Saeys,I.Inza,andP.Larranaga,“Areviewoffeatures electiontechniquesin bioinformatics,”Bioinformatics,vol.23,no.19,p.2507,2007. [5]S.Mukkamala,Q.Liu,R.Veeraghattam,andA.Sung,“Feat ureSelectionand RankingofKeyGenesforTumorClassication:UsingMicroar rayGeneExpression Data,”LECTURENOTESINCOMPUTERSCIENCE,vol.4029,p.951,2006. [6]R.Breitling,P.Armengaud,A.Amtmann,andP.Herzyk,“R ankproducts:asimple, yetpowerful,newmethodtodetectdifferentiallyregulate dgenesinreplicated microarrayexperiments,”FEBSletters,vol.573,no.1-3,pp.83–92,2004. [7]H.Mamitsuka,“Selectingfeaturesinmicroarrayclassi cationusingROCcurves,”PatternRecognition,vol.39,no.12,pp.2393–2404,2006. 124

PAGE 142

[8]Y.Wang,I.Tetko,M.Hall,E.Frank,A.Facius,K.Mayer,a ndH.Mewes, “Geneselectionfrommicroarraydataforcancerclassicat ion-amachinelearning approach,”ComputationalBiologyandChemistry,vol.29,no.1,pp.37–46,2005. [9]I.Inza,P.Larraaga,R.Blanco,andA.Cerrolaza,“Filt erversuswrappergeneselectionapproachesinDNAmicroarraydomains,”ArticialIntelligenceinMedicine, vol.31,no.2,pp.91–103,2004. [10]Z.Zhu,Y.Ong,andM.Dash,“Wrapper-lterfeaturesele ctionalgorithmusinga memeticframework,”IEEETRANSACTIONSONSYSTEMSMANANDCYBERNETICSPARTB,vol.37,no.1,p.70,2007. [11]I.Guyon,J.Weston,S.Barnhill,andV.Vapnik,“Genese lectionforcancer classicationusingsupportvectormachines,”Machinelearning,vol.46,no.1,pp. 389–422,2002. [12]R.Daz-UriarteandA.deAndrs,“Geneselectionandcl assicationofmicroarray datausingrandomforest,”BMCbioinformatics,vol.7,no.1,p.3,2006. [13]S.MaandJ.Huang,“Penalizedfeatureselectionandcla ssicationinbioinformatics,”BriengsinBioinformatics,vol.9,no.5,p.392,2008. [14]J.Quackenbush,“Microarrayanalysisandtumorclassi cation,”TheNewEngland journalofmedicine,vol.354,no.23,p.2463,2006. [15]R.KohaviandG.John,“Wrappersforfeaturesubsetsele ction,”Articialintelligence, vol.97,no.1-2,pp.273–324,1997. [16]I.GuyonandA.Elisseeff,“Anintroductiontovariable andfeatureselection,”The JournalofMachineLearningResearch,vol.3,pp.1157–1182,2003. [17]G.Kanji,100statisticaltests.SAGEpublicationsLtd,2006. 125

PAGE 143

[18]Lowry,R.,Conceptsandapplicationsofinferentialstatistics,VassarCollege, Poughkeepsie,NY,2009.[Online].Available:http://facu lty.vassar.edu/lowry/ webtext.html [19]RDevelopmentCoreTeam,R:ALanguageandEnvironmentforStatistical Computing,RFoundationforStatisticalComputing,Vienna,Austria, 2010, ISBN3-900051-07-0.[Online].Available:http://www.R-p roject.org [20]J.Quinlan,“Inductionofdecisiontrees,”Machinelearning,vol.1,no.1,pp.81–106, 1986. [21]M.BertholdandD.Hand,Intelligentdataanalysis:anintroduction.SpringerVerlag, 2007. [22]L.YuandH.Liu,“Redundancybasedfeatureselectionfo rmicroarraydata,”inProceedingsofthetenthACMSIGKDDinternationalconferen ceonKnowledge discoveryanddatamining.ACMNewYork,NY,USA,2004,pp.737–742. [23]I.Kononenko,“Estimatingattributes:Analysisandex tensionsofRELIEF,”Lecture NotesinComputerScience,pp.171–171,1994. [24]I.Kononenko,M.Robnik-Sikonja,andU.Pompe,“Relief Fforestimationand discretizationofattributesinclassication,regressio n,andILPproblems,”Articial Intelligence:Methodology,Systems,Applications,pp.31–40,1996. [25]S.Mukherjee,P.Tamayo,D.Slonim,A.Verri,T.Golub,J .Mesirov,andT.Poggio, “Supportvectormachineclassicationofmicroarraydata, ”AIMemo1677,MassachusettsInstituteofTechnology,1999. [26]M.Brown,W.Grundy,D.Lin,N.Cristianini,C.Sugnet,M .Ares,andD.Haussler, “Supportvectormachineclassicationofmicroarraygenee xpressiondata,”Proc. Natl.Acad.Sci.USA,vol.97,pp.262–267,2000. [27]S.Mukherjee,“Classifyingmicroarraydatausingsupp ortvectormachines,”A PracticalApproachtoMicroarrayDataAnalysis,vol.1,pp.166–185,2003. 126

PAGE 144

[28]C.CortesandV.Vapnik,“Support-vectornetworks,”Machinelearning,vol.20,no.3, pp.273–297,1995. [29]N.CristianiniandJ.Shawe-Taylor,AnIntroductiontoSupportVectorMachinesand OtherKernel-basedLearningMethods.CambridgeUniversityPress,March2000. [30]C.Burges,“Atutorialonsupportvectormachinesforpa tternrecognition,”Data miningandknowledgediscovery,vol.2,no.2,pp.121–167,1998. [31]T.Hastie,R.Tibshirani,J.Friedman,andJ.Franklin, “Theelementsofstatistical learning:datamining,inferenceandprediction,”TheMathematicalIntelligencer, vol.27,no.2,pp.83–85,2005. [32]M.Hearst,S.Dumais,E.Osman,J.Platt,andB.Schlkop f,“Supportvector machines,”IEEEIntelligentsystems,vol.13,no.4,pp.18–28,1998. [33]O.Ivanciuc,“SVM-SupportVectorMachines,OptimumSe parationHyperplane,” http://www.support-vector-machines.org/SVM_osh.html ,2005,[Online;accessed7July-2008]. [34]D.DecosteandB.Schlkopf,“Traininginvariantsuppo rtvectormachines,”Machine Learning,vol.46,no.1,pp.161–190,2002. [35]P.JafariandF.Azuaje,“Anassessmentofrecentlypubl ishedgeneexpression dataanalyses:reportingexperimentaldesignandstatisti calfactors,”BMCMedical InformaticsandDecisionMaking,vol.6,no.1,p.27,2006. [36]U.Alon,N.Barkai,D.Notterman,K.Gish,S.Ybarra,D.M ack,andA.Levine, “Broadpatternsofgeneexpressionrevealedbyclusteringa nalysisoftumorand normalcolontissuesprobedbyoligonucleotidearrays,”ProceedingsoftheNational AcademyofSciences,vol.96,no.12,pp.6745–6750,1999. [37]T.Golub,D.Slonim,P.Tamayo,C.Huard,M.Gaasenbeek, J.Mesirov,H.Coller, M.Loh,J.Downing,M.Caligiurietal.,“Molecularclassicationofcancer:class discoveryandclasspredictionbygeneexpressionmonitori ng,”Science,vol.286,no. 5439,p.531,1999. 127

PAGE 145

[38]V.GossTuscher,R.Tibshirani,andG.Chu,“Signicanc eanalysisofmicroarrays appliedtotheionizingradiationresponse,”ProcNatlAcadSci,vol.98,pp.5116– 5121,2001. [39]J.HanleyandB.McNeil,“Themeaninganduseoftheareau nderareceiveroperating characteristic(ROC)curve.”Radiology,vol.143,no.1,p.29,1982. [40]D.HandandR.Till,“Asimplegeneralisationofthearea undertheROCcurvefor multipleclassclassicationproblems,”MachineLearning,vol.45,no.2,pp.171– 186,2001. [41]D.Rhodes,J.Yu,K.Shanker,N.Deshpande,R.Varamball y,D.Ghosh,T.Barrette, A.Pandey,andA.Chinnaiyan,“ONCOMINE:acancermicroarra ydatabaseand integrateddata-miningplatform,”Neoplasia(NewYork,NY),vol.6,no.1,p.1, 2004. [42]J.Kittler,“FEATURESETSEARCHALBORITHMS,”Patternrecognitionand signalprocessing,p.41,1978. [43]L.Breiman,“Randomforests,”Machinelearning,vol.45,no.1,pp.5–32,2001. [44]L.Wang,J.Zhu,andH.Zou,“Hybridhuberizedsupportve ctormachinesfor microarrayclassication,”inProceedingsofthe24thinternationalconferenceon Machinelearning.ACMNewYork,NY,USA,2007,pp.983–990. [45]S.RossetandJ.Zhu,“Piecewiselinearregularizedsol utionpaths,”Annalsof Statistics,vol.35,no.3,p.1012,2007. [46]H.ZouandT.Hastie,“Regularizationandvariablesele ctionviatheelasticnet,”JournaloftheRoyalStatisticalSociety.SeriesB,Statist icalMethodology,vol.67, no.2,p.301,2005. [47]L.Wang,J.Zhu,andH.Zou,“Hybridhuberizedsupportve ctormachinesfor microarrayclassicationandgeneselection,”Bioinformatics,vol.24,no.3,p.412, 2008. 128

PAGE 146

[48]Y.Sun,S.Todorovic,andS.Goodison,“LocalLearningB asedFeatureSelection forHighDimensionalDataAnalysis,”IEEETransactionsonPatternAnalysisand MachineIntelligence,2009. [49]V.VapnikandV.Vapnik,Statisticallearningtheory.WileyNewYork,1998. [50]D.N.A.Asuncion,“UCImachinelearningrepository,”2 007.[Online].Available: http://www.ics.uci.edu/$\sim$mlearn/{MLR}epository. html [51]Y.Sun,“IterativeRELIEFforfeatureweighting:Algor ithms,theories,andapplications,”IEEEtransactionsonpatternanalysisandmachineintellig ence,pp.1035– 1051,2007. [52]A.Stephenson,A.Smith,M.Kattan,J.Satagopan,V.Reu ter,P.Scardino,and W.Gerald,“Integrationofgeneexpressionprolingandcli nicalvariablestopredict prostatecarcinomarecurrenceafterradicalprostatectom y,”Cancer,vol.104,no.2, pp.290–298,2005. [53]Y.He,A.Hart,M.Mao,H.Peterse,K.vanderKooy,M.Mart on,A.Witteveen, G.Schreiber,R.Kerkhoven,C.Robertsetal.,“Geneexpressionprolingpredicts clinicaloutcomeofbreastcancer,”2002. [54]M.Shipp,K.Ross,P.Tamayo,A.Weng,J.Kutok,R.Aguiar ,M.Gaasenbeek, M.Angelo,M.Reich,G.Pinkusetal.,“DiffuselargeB-celllymphomaoutcome predictionbygene-expressionprolingandsupervisedmac hinelearning,”Nature medicine,vol.8,no.1,pp.68–74,2002. [55]J.Canul-Reich,L.Hall,D.Goldgof,andS.Eschrich,“F eatureselectionfor microarraydatabyaucanalysis,”in2008IEEEInternationalConferenceonSystems, Man,andCybernetics,Singapore,October12-15,Proceedin gs,2008,pp.768–773. [56]B.Schlkopf,C.Burges,andV.Vapnik,“Extractingsup portdataforagiventask,” inKnowledgeDiscoveryandDataMining,1995,pp.252–257.[Online].Available: http://www.aaai.org/Papers/KDD/1995/KDD95-030.pdf 129

PAGE 147

[57]D.Allison,X.Cui,G.Page,andM.Sabripour,“Microarr aydataanalysis:from disarraytoconsolidationandconsensus,”NatureReviewsGenetics,vol.7,no.1, pp.55–66,2006. [58]D.Pyle,Datapreparationfordatamining.MorganKaufmannPub,1999. [59]S.Eschrich,I.Yang,G.Bloom,K.Kwong,D.Boulware,A. Cantor,D.Coppola, M.Kruhoffer,L.Aaltonen,T.Orntoftetal.,“Molecularstagingforsurvivalprediction ofcolorectalcancerpatients,”JournalofClinicalOncology,vol.23,no.15,p.3526, 2005. [60]K.Shedden,J.Taylor,S.Enkemann,M.Tsao,T.Yeatman, W.Gerald,S.Eschrich, I.Jurisica,T.Giordano,D.Miseketal.,“Geneexpression–basedsurvivalprediction inlungadenocarcinoma:amulti-site,blindedvalidations tudy,”NatureMedicine, vol.14,no.8,pp.822–827,2008. [61]C.-C.ChangandC.-J.Lin,LIBSVM:alibraryforsupportvectormachines,2001. [Online].Available:http://www.csie.ntu.edu.tw/~cjli n/libsvm [62]I.H.WittenandE.Frank,DataMining:PracticalMachineLearningToolsand Techniques.MorganKaufmannPub,2005. [63]DADiSP/SE2002:Theultimateengineeringspreadsheet,version6.0.[Online]. Available:http://www.dadisp.com [64]M.Friedman,“Theuseofrankstoavoidtheassumptionof normalityimplicitinthe analysisofvariance,”JournaloftheAmericanStatisticalAssociation,vol.32,no. 200,pp.675–701,1937. [65]S.Holm,“Asimplesequentiallyrejectivemultipletes tprocedure,”Scandinavian JournalofStatistics,vol.6,no.2,pp.65–70,1979. [66]J.Demšar,“Statisticalcomparisonsofclassiersove rmultipledatasets,”TheJournal ofMachineLearningResearch,vol.7,pp.1–30,2006. 130

PAGE 148

[67]A.HobsonandB.Cheng,“AcomparisonoftheShannonandK ullbackinformation measures,”JournalofStatisticalPhysics,vol.7,no.4,pp.301–310,1973. [68]I.Kononenko,E.Šimec,andM.Robnik-Šikonja,“Overco mingthemyopiaof inductivelearningalgorithmswithRELIEFF,”AppliedIntelligence,vol.7,no.1, pp.39–55,1997. [69]J.Canul-Reich,L.Hall,D.Goldgof,andS.Eschrich,“F ilteringforimprovedgene selectiononmicroarraydata,”in2010IEEEInternationalConferenceonSystems, Man,andCybernetics,Istanbul,October10-13,Proceeding stoappear,2010. [70]T.Dietterich,“Ensemblemethodsinmachinelearning, ”Multipleclassiersystems, pp.1–15,2000. [71]G.ValentiniandF.Masulli,“Ensemblesoflearningmac hines,”NeuralNets,pp.3– 20,2002. [72]G.SeniandJ.Elder,“EnsembleMethodsinDataMining:I mprovingAccuracy ThroughCombiningPredictions,”SynthesisLecturesonDataMiningandKnowledge Discovery,vol.2,no.1,pp.1–126,2010. [73]E.Alpaydin,Introductiontomachinelearning.TheMITPress,2004. [74]T.Evgeniou,L.Perez-Breva,M.Pontil,andT.Poggio,“ Boundsonthegeneralizationperformanceofkernelmachineensembles,”inMACHINELEARNINGINTERNATIONALWORKSHOPTHENCONFERENCE,2000,pp.271–278. [75]G.Valentini,M.Muselli,andF.Rufno,“Cancerrecogn itionwithbaggedensembles ofsupportvectormachines,”Neurocomputing,vol.56,pp.461–466,2004. [76]G.Valentini,“Geneexpressiondataanalysisofhumanl ymphomausingsupport vectormachinesandoutputcodingensembles,”ArticialIntelligenceinMedicine, vol.26,no.3,pp.281–304,2002. 131

PAGE 149

[77]R.Baneld,L.Hall,K.Bowyer,andW.Kegelmeyer,“Acom parisonofdecisiontree ensemblecreationtechniques,”IEEETransactionsonPatternAnalysisandMachine Intelligence,vol.29,no.1,pp.173–180,2007. [78]J.Jger,R.Sengupta,andW.Ruzzo,“Improvedgenesele ctionforclassicationof microarrays,”Biocomputing2003,p.53,2003. 132

PAGE 150

Appendices 133

PAGE 151

AppendixA:FlowchartforIterativeFeaturePerturbationM ethod (a) (b) FigureA.1:FlowchartofIFP.(a)themainloopoftheiterati vefeatureperturbation algorithmand,(b)thebinarysearchcalledby(a)forcalcul ationofnoiselevel/cfactor, perturbationandrankingoffeatures. 134

PAGE 152

AppendixB:AccuraciesfortheTopNFeatureswithPValues<= 0.01 ThemethodsIFP,SVM-RFE,andthet-testwereappliedonthec olon(Fig.B.1a), leukemia(Fig.B.1b),andMofttcoloncancerdatasets(Fig .B.1c)withina10-foldcross validationprocessrepeatedvetimeswithdifferentseeds .Eachdatasetwaspreviously reducedtoincludeonlythetop n featureswith p values 0 : 01.Forthecoloncancer dataset n = 201,fortheleukemiadataset n = 437,andfortheMofttcoloncancerdataset n = 50.AverageweightedaccuracyresultsareshowninFig.B.1. (a) (b) (c) FigureB.1:Comparisonofresultingaverageweightedaccur acyofIFP,SVM-RFE,and thet-testacrossthetopnfeatureswithpvalues<=0.01fort he(a)colon,(b)leukemia, and(c)Mofttcoloncancerdatasets. 135

PAGE 153

AbouttheAuthor Ms.JuanaCanulReichisaPh.D.candidateinthedepartmento fComputerScience andEngineeringattheUniversityofSouthFlorida(USF).In Mexico,shereceivedher BScdegreeinFebruary1996fromAutonomousUniversityofTa basco(UJAT).Lateron, in1999shereceivedherMScdegreeinComputerSciencefromM onterreyTecnological Institute(ITESM).InUSA,shereceivedherMScdegreeinCom puterEngineeringfrom USFin2007.SheisaFulbrightscholar. Herresearchinterestsincludebutarenotlimitedtosuppor tvectormachines,feature/geneselection,datapreprocessing,supervisedlear ning,microarraydataanalysis,data mining,machinelearning.


xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam 22 Ka 4500
controlfield tag 007 cr-bnu---uuuuu
008 s2010 flu s 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0004571
035
(OCoLC)
040
FHM
c FHM
049
FHMM
090
XX9999 (Online)
1 100
Canul-Reich, Juana.
0 245
An iterative feature perturbation method for gene selection from microarray data
h [electronic resource] /
by Juana Canul-Reich.
260
[Tampa, Fla] :
b University of South Florida,
2010.
500
Title from PDF of title page.
Document formatted into pages; contains X pages.
502
Dissertation (PHD)--University of South Florida, 2010.
504
Includes bibliographical references.
516
Text (Electronic thesis) in PDF format.
538
Mode of access: World Wide Web.
System requirements: World Wide Web browser and PDF reader.
3 520
ABSTRACT: Gene expression microarray datasets often consist of a limited number of samples with a large number of gene expression measurements, usually on the order of thousands. These characteristics might negatively impact the prediction accuracy of a classification model. Therefore, dimensionality reduction is critical prior to any classification task. We introduce the iterative feature perturbation method (IFP), an embedded gene selector that iteratively discards non-relevant features. Relevant features are those which after perturbation with noise cause a change in the predictive accuracy of the classification model. Non-relevant features do not cause any accuracy change in such a situation. We apply IFP to 4 cancer microarray datasets: colon, leukemia, Moffitt colon, and lung. We compare results obtained by IFP to those of SVM-RFE and the t-test using a linear support vector machine as the classifier. We use the entire set of features in the datasets, and a preselected set of 200 features (based on p values from the t-test) from each dataset. When using the entire set of features, IFP results in comparable accuracy (and higher at some points) with respect to SVM-RFE on 3 of the 4 datasets. The t-test feature ranking produces classifiers with the highest accuracy across the 4 datasets. When using 200 features, the accuracy results show up to 3% performance improvement for both IFP and SVM-RFE across the 4 datasets. We corroborate these results with an AUC analysis and a statistical analysis using the Friedman/Holm test. Similar to the t-test, we used the methods information gain and relief as filters and compared all three. Results of the AUC analysis show that IFP and SVM-RFE obtain the highest AUC value when applied on the t-test-filtered datasets. This result is additionally corroborated with statistical analysis. The percentage of overlap between the gene sets selected by any two methods across the four datasets indicates that different sets of genes can result in similar accuracies. We created ensembles of classifiers using the bagging technique with IFP, SVM-RFE and the t-test, and showed that their performance can be equivalent to those of the non-bagging cases, as well as better in some cases.
590
Advisor: Lawrence O. Hall, Ph.D.
653
IFP
Feature selection
Classification
T-test
Data mining
SVM-RFE
SVM
690
Dissertations, Academic
z USF
x Computer Science & Engineering
Masters.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.4571