USF Libraries
USF Digital Collections

A study of machine learning performance in the prediction of juvenile diabetes from clinical test results

MISSING IMAGE

Material Information

Title:
A study of machine learning performance in the prediction of juvenile diabetes from clinical test results
Physical Description:
Book
Language:
English
Creator:
Pobi, Shibendra
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla
Publication Date:

Subjects

Subjects / Keywords:
Ensembles
Decision trees
Neural networks
Diabetes prediction
Over-sampling
Dissertations, Academic -- Computer Science -- Masters -- USF
Genre:
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )

Notes

Abstract:
ABSTRACT: Two approaches to building models for prediction of the onset of Type 1 diabetes mellitus in juvenile subjects were examined. A set of tests performed immediately before diagnosis was used to build classifiers to predict whether the subject would be diagnosed with juvenile diabetes. A modified training set consisting of differences between test results taken at different times was also used to build classifiers to predict whether a subject would be diagnosed with juvenile diabetes. Neural networks were compared with decision trees and ensembles of both types of classifiers. Support Vector Machines were also tested on this dataset. The highest known predictive accuracy was obtained when the data was encoded to explicitly indicate missing attributes in both cases. In the latter case, high accuracy was achieved without test results which, by themselves, could indicate diabetes. The effects of oversampling of minority class samples in the training set by generating synthetic examples were tested with ensemble techniques like bagging and random forests. It was observed, that oversampling of diabetic examples, lead to an increased accuracy in diabetic prediction demonstrated by a significantly better F-measure value. ROC curves and the statistical F-measure were used to compare the performance of the different machine learning algorithms.
Thesis:
Thesis (M.A.)--University of South Florida, 2006.
Bibliography:
Includes bibliographical references.
System Details:
System requirements: World Wide Web browser and PDF reader.
System Details:
Mode of access: World Wide Web.
Statement of Responsibility:
by Shibendra Pobi.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains 52 pages.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 001911070
oclc - 173642076
usfldc doi - E14-SFE0001671
usfldc handle - e14.1671
System ID:
SFS0025989:00001


This item is only available as the following downloads:


Full Text

PAGE 1

AStudyofMachineLearningPerformanceinthePredictionofJuvenileDiabetesfromClinicalTestResultsbyShibendraPobiAthesissubmittedinpartialfulllmentoftherequirementsforthedegreeofMasterofScienceinComputerScienceDepartmentofComputerScienceandEngineeringCollegeofEngineeringUniversityofSouthFloridaMajorProfessor:LawrenceO.Hall,Ph.D.DmitryGoldgof,Ph.D.RangacharKasturi,Ph.D.DateofApproval:July12,2006Keywords:ensembles,decisiontrees,neuralnetworks,diabetesprediction,over-samplingcCopyright2006,ShibendraPobi

PAGE 2

DEDICATIONTomyance

PAGE 3

ACKNOWLEDGEMENTSIwouldliketothankmymajorprofessor,Dr.LawrenceHallforhisconstantsupportandguidanceinsuccessfullyaccomplishingthisproject.IwouldalsoliketothankDr.JereyKrischeroftheMottCancerResearchCenterforhisexpertguidanceduringvariousphasesoftheresearch.MysinceregratitudetoDr.DmitryGoldgofandDr.RangacharKasturiforprovidingtheirexpertiseandthoughtfulsuggestions.ThisstudyhasbeenpartlysupportedbyNationalInstitutesofHealthgrant1U54RR019259-01.

PAGE 4

TABLEOFCONTENTSLISTOFTABLESiiiLISTOFFIGURESvABSTRACTviCHAPTER1INTRODUCTION11.1Type-1Diabetes11.2ProjectOverview2CHAPTER2PREVIOUSWORK42.1ResearchonthePimaIndianDiabetesDataset42.2PredictionofBloodGlucoseLevelandOtherHealthRisksofDi-abetesPatients42.3DiabetesPredictionUsingHealthRiskAssessmentHRAQuestionnaires6CHAPTER3DIABETESDATASET83.1DiabetesPreventionTrial-183.2DatasetDescription93.3Attributes113.4TestResultsDataset113.4.1DatasetwithNominalAttributes133.4.2DatasetwithBitEncodedNominalAttributes133.5DierentialDataset143.5.1DierentialDatasetwithNominalAttributes153.5.2DierentialDatasetwithBitEncodedNominalAttributes16CHAPTER4LEARNINGAPPROACHESUSED184.1CascadeCorrelationNeuralNetworks184.2C4.5DecisionTrees194.3Ensembles214.3.1Bagging224.3.2RandomForests234.4SupportVectorMachines244.5SyntheticMinorityOver-SamplingTechnique25CHAPTER5EXPERIMENTSANDRESULTS275.1PredictingType1JuvenileDiabetesUsingTestResults275.1.1ExperimentswithDecisionTreesandEnsembles275.1.2ExperimentswithNeuralNetworksandEnsembles28i

PAGE 5

5.2Type1JuvenileDiabetesPredictionfromDierencesinTestResults295.2.1ExperimentswithDecisionTreesandEnsembles295.2.2ExperimentswithNeuralNetworksandEnsembles305.2.3ExperimentswithSupportVectorMachines335.2.4ExperimentsandResultsUsingSMOTE355.3AnalysisofDierentLearningTechniques445.3.1F-Measure445.3.2ROCCurves46CHAPTER6SUMMARYANDDISCUSSION49REFERENCES51ii

PAGE 6

LISTOFTABLESTable3.1DetailsoftheSUBJECTtable9Table3.2DetailsoftheTESTtable10Table3.3DetailsoftheTESTREStable10Table3.4Listofattributes12Table5.1Confusionmatrixafter10-foldcrossvalidationusingusfc4.528Table5.2Confusionmatrixafter10-foldcrossvalidation:usingasingledecisiontreeonthedierentialdataset29Table5.3Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith50attributesand100treesonthedierentialdataset30Table5.4Confusionmatrixafter10-foldcrossvalidation:usingasingleneuralnetworkonthemodieddierentialdataset30Table5.5Confusionmatrixafter10-foldcrossvalidation:usingbaggedensembleof100neuralnetworksonthemodieddierentialdataset31Table5.6Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith100attributesand100treesonthemodieddier-entialdataset32Table5.7Summaryofpredictionaccuraciesobtainedfrom10-foldcrossvalidationusingdierentlearningmethodsonthemodieddif-ferentialdataset33Table5.8Confusionmatrixafter10-foldcrossvalidation:usingsupportvectormachineonthemodieddierentialdataset35Table5.9Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith100attributes,200treesand100%smotingofdif-ferencevalues38Table5.10Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswithlgnattributes,100treesand100%smoting40Table5.11Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith50attributes,100treesand100%smoting40iii

PAGE 7

Table5.12Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith100attributes,100treesand100%smoting41Table5.13Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith50attributes,150treesand100%smoting41Table5.14Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith50attributes,200treesand100%smoting41Table5.15Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith50attributes,150treesand50%smoting42Table5.16Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith100attributes,150treesand50%smoting42Table5.17Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith100attributes,100treesand75%smoting43Table5.18Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith75attributes,150treesand75%smoting43Table5.19Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith100attributes,150treesand125%smoting43Table5.20Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith100attributes,100treesand200%smoting44Table5.21SummaryofF-measurevaluesfordierentlearningtechniques45iv

PAGE 8

LISTOFFIGURESFigure4.1Arepresentationofdatain2dimensionalspacewithsupportvectorsmarkedingreysquares,withmarginsofseparation25Figure5.1Decisiontreegeneratedusingthemodieddataset28Figure5.2ROCcurveforcomparingtheperformanceofrandomforestswithandwithoutoversampling48v

PAGE 9

ASTUDYOFMACHINELEARNINGPERFORMANCEINTHEPREDICTIONOFJUVENILEDIABETESFROMCLINICALTESTRESULTSShibendraPobiABSTRACTTwoapproachestobuildingmodelsforpredictionoftheonsetofType1diabetesmellitusinjuvenilesubjectswereexamined.Asetoftestsperformedimmediatelybeforediagnosiswasusedtobuildclassierstopredictwhetherthesubjectwouldbediagnosedwithjuveniledia-betes.Amodiedtrainingsetconsistingofdierencesbetweentestresultstakenatdierenttimeswasalsousedtobuildclassierstopredictwhetherasubjectwouldbediagnosedwithjuvenilediabetes.Neuralnetworkswerecomparedwithdecisiontreesandensemblesofbothtypesofclassiers.SupportVectorMachineswerealsotestedonthisdataset.Thehighestknownpredictiveaccuracywasobtainedwhenthedatawasencodedtoexplicitlyindicatemissingattributesinbothcases.Inthelattercase,highaccuracywasachievedwithouttestresultswhich,bythemselves,couldindicatediabetes.Theeectsofoversamplingofminorityclasssamplesinthetrainingsetbygeneratingsyntheticexamplesweretestedwithensembletechniqueslikebaggingandrandomforests.Itwasobserved,thatoversamplingofdiabeticexamples,leadtoanincreasedaccuracyindiabeticpredictiondemonstratedbyasignicantlybetterF-measurevalue.ROCcurvesandthestatisticalF-measurewereusedtocomparetheperformanceofthedierentmachinelearningalgorithms.vi

PAGE 10

CHAPTER1INTRODUCTION1.1Type-1DiabetesType1diabetesoccurswhenthebody'simmunesystemattacksanddestroyscertaincellsinthepancreas,anorganaboutthesizeofahandthatislocatedbehindthelowerpartofthestomach.Thesecells-calledbetacells-arecontained,alongwithothertypesofcells,withinsmallislandsofendocrinecellscalledthepancreaticislets.Betacellsnormallyproduceinsulin,ahormonethathelpsthebodymovetheglucosecontainedinfoodintocellsthroughoutthebody,whichuseitforenergy.Butwhenthebetacellsaredestroyed,noinsulincanbeproduced,andtheglucosestaysinthebloodinstead,whereitcancauseseriousdamagetoalltheorgansystemsofthebody.Type1diabetesisusuallydiagnosedinchildrenandyoungadults,andwaspreviouslyknownasjuvenilediabetes.ScientistsdonotyetknowexactlywhatcausesType1diabetes,buttheybelievethatautoimmune,genetic,andenvironmentalfactorsareinvolved.IncidenceofType-1diabetesintheUSisestimatedatabout30,000casesannuallyandabout40per10,000children.Type1diabetesaccountsfor5%to10%ofalldiagnosedcasesofdiabetes.AccordingtotheNationalInstituteofAllergyandInfectiousDiseaseNIAIDtheprevalencerateofType1diabetesisapproximately1in800oraround340,000peopleintheUS.Theriskofdevelopingtype1diabetesishigherthanvirtuallyallotherseverechronicdiseasesofchildhood.Peakincidenceoccursduringpuberty,around10to12yearsofageingirls,and12to14yearsofageinboys.AccordingtotheJuvenileDiabetesResearchFoundationasmanyas3millionAmericansmayhavetype-1diabetes.Eachyearover13,000childrenarediagnosedwithdiabetesintheU.S.That's35childreneachandeveryday.1

PAGE 11

Asisevidentfromtheabovediscussion,Type1diabetesisaseriousmedicalconcernnotonlyintheUnitedStatesbutacrosstheworld.Therehasbeenquitealotofresearchinthemedicalcommunityonhowtotreatthisdisease.OurobjectivehasbeentoanalyzeaType1diabetesdatasetfromtheDiabetesPreventionTrial-1andstudywhetheritispossibletopredicttheonsetofType-1diabetesfrommedicaltestresultsusinglearnedclassiers.1.2ProjectOverviewTherehasbeenalotofstudyintheareaofmachinelearningondatafromthedomainofdiabetes,especiallyonthePimaIndianDiabetesdataset[25],[2],[11]fromtheUniversityofCalifornia,atIrvineUCIrepository.Therehasalsobeentremendousinterestinusingmachinelearningalgorithmsforpostdiagnosiscare,likepredictionofbloodglucoselevelstocontrolthedosageofinsulin[21]andtheuseofassociationrulestopredicttheoccurrenceofcertaindiseasesindiabeticpatients[28],[16].Howeverourstudydiersfromboththeseapproachesinthesensethatweuseadatasetthatisnotrestrictedtoaparticularethnicitybuttoaspecictypeofdiabetes,namelyType1diabetesinthejuvenilepopulationandourobjectiveisnottomonitordiabeticpatientsbutlearnamodeltopredicttheoccurrenceofthistypeofdiabetesbytakingthepatient'spastmedicalrecordsintoconsideration.Weprimarilyusedtwotypesofclassiers:C4.5baseddecisiontrees[23]andCascadeCorrelation[24]basedneuralnetworks,topredictdiabeticcasesfromnondiabeticonesbyusingsubjecttestresults.ThisisnotknowntobeadicultproblemforPhysicians,butthereaderwillseebuildingagoodpredictivemodelwasnottrivial.Next,weusedthesamebaseclassierstopredictdiabetes,butthistimetheattributeswerethedierencesintestresults,betweenconsecutivetestsofthesametypeforasubject.Thisapproachhasthepromiseofallowingapredictionthatsomeoneissusceptibletodiabetesbeforeanytestresultsindicatetheymayhaveit.Bothrandomforestsofdecisiontrees[2]andbaggedclassiers[1]forbothneuralnetworksanddecisiontreeswereused.Surprisingly,itwasnecessarytoexplicitlyencodemissingattributestoachieveover95%accuracyindiabetespredictionforbothdecisiontreesandneuralnetworks.2

PAGE 12

Theensembleclassiersprovidedthebestaccuracy.Decisiontreeclassierswerecompa-rabletocascadecorrelationneuralnetworkclassiers.Ofinterestwasthefactthatapprox-imately80%accuracycanbeobtainedinpredictingdiabetesfromdierencesintestresultsovertimewithoutusingdatafromthelasttimeperiodbeforediagnosisasdiabetic.Anotheraspectthatwastestedwaswhetheroversamplingoftheminorityclassexamplesi.e.thediabeticexampleswouldimprovethepredictionaccuracyoftheclassiers.TheoversamplingtechniquethatwasusedwastheSyntheticMinorityOversamplingTechniqueSMOTE[5],whichusesanearestneighborapproachtogeneratesyntheticexamples.Al-thoughtheoversamplingtechniquedidnotimprovetheoverallaccuracyofpredictionusingBaggingandRandomForests,thenumberofTruePositivesdidshowasignicantincrease,withcomparablyhigherF-measurevalues.3

PAGE 13

CHAPTER2PREVIOUSWORK2.1ResearchonthePimaIndianDiabetesDatasetMostoftheworkrelatedtomachinelearninginthedomainofdiabetesdiagnosishasconcentratedonthestudyofthePimaIndianDiabetesdatasetintheUCIrepository.Inthiscontext,Shanker[25]usedneuralnetworkstopredicttheonsetofdiabetesmellitusamongthePimaIndianfemalepopulationnearPhoenix,Arizona.ThisparticulardatasethasbeenwidelyusedinmachinelearningexperimentsandiscurrentlyavailablethroughtheUCIrepositoryofstandarddatasets.ThispopulationhasbeenstudiedcontinuouslybytheNationalInstituteofDiabetes,DigestiveandKidneyDiseasesowingtothehighincidenceofdiabetes.Thestudychose8particularvariableswhichwereconsideredhighriskfactorsfortheoccurrenceofdiabetes,likenumberoftimespregnant,plasmaglucoseconcentrationat2hinanoralglucosetolerancetestOGTT,diastolicbloodpressure,2-hseruminsulin,bodymassindex,diabetespedigree,etc.Allthe768exampleswererandomlyseparatedintoatrainingsetof576casessubjectswithoutdiabetesand198subjectswithdiabetesandatestsetof192casesnondiabeticsubjectsand70diabeticcases.Usingneuralnetworkswithonehiddenlayer,Shanker[25]obtainedanoverallaccuracyof81.25%whichwashigherthanthepredictionaccuracyobtainedusingalogisticregressionmethod.17%andtheADAPmodel[27]%.Manyotherpapershavereportedresultsonthisdataset.2.2PredictionofBloodGlucoseLevelandOtherHealthRisksofDiabetesPa-tientsResearchondiabetesdataotherthanthePimaIndiandataset,relatedtotheapplicationofmachinelearningtechniques,hasmainlyfocusedontryingtopredictandmonitortheBloodGlucoseLevelsBGLofdiabeticpatients[21]orpossiblehealthrisksofsuchpatients[28]4

PAGE 14

and[16].In[21],acombinationofArticialNeuralNetworksANNandaNeuro-FuzzyOptimizerwasusedtopredicttheBGLofadiabeticpatientintherecentfutureandthenapossiblescheduleofdietandexerciseaswellasthedosageofinsulinforthepatientwassuggested.AlthoughtheBGLpredictionswereclosetotheactualreadings,thedatasetwasrestrictedtoonlytwoType1diabeticpatients,whichraisesdoubtsaboutit'susabilityforlargegroups.Inanotherstudy,byKarimAlJabali[10],articialneuralnetworkswereusedtomodelandsimulatetheprogressionofType1diabetesinpatientsaswellastopredicttheoptimumoradequatedosageofinsulinthatshouldbedeliveredtomaintainthebloodglucoselevelBGL.Thestudydatasetwascomprisedof70patientswith30,000traininginstancesandtheattributesconsideredwerePreviousGlucoseLevel,ShortTerm,MidTermandLongTermInsulinreleaseaswellassomeotherfeatureslikeexercise,meal,etc.Abackpropagationneuralnetworkwithfourlayerswasusedtosimulatethediabeticpatient'smetabolismandalsosimulatethecontrollersdeliveringinsulin.TheresultsshowedthattheuseofcomplexneuralnetworkarchitecturescouldeectivelyemulatetheworkingofcontrollersthatdeliverinsulintoType1diabeticpatients.Neuro-FuzzysystemshavealsobeenusedbyDazzietal.[8],forthecontrolofBGLincriticaldiabeticpatients,withthemainobjectiveofbeingabletopredicttheexactdosageofinsulinwiththeleastnumberofinvasivebloodtests.AcombinationofbackpropagationBEPneuralnetworksandfuzzylogicwereusedtopredictthevariationininsulindosage.Theneuralnetworkswereemployedtodiscovertherelationshipsbetweenvariablesandndtherightrulesandadjustmembershipfunctions.Fortrainingtheneuralnetworks,asetof1000randomlysimulatedBGvalueswereusedandthecorrespondinginsulininfusionratesnoted.Thetrainedneuralnetswerethentestedwithasetof400unseenBGvaluesandthepredictedinsulininfusionratesweremonitoredandusedtobuildanomogram.TheNeuro-FuzzysystemwasabletoprovidenetunedvariationsininsulininfusioninresponsetosmallglycemicvariationsandmaintainBGLbetterthanconventionalcontrolsystems.AnotherareaofresearchinType1diabetes,usingmachinelearningtechniqueshasbeeninthestudyofthegeneticdataassociatedwiththeoccurrenceofType-1diabetesT1DM.AnumberofrecentstudieshaveaimedatunravelingthegeneticbasisofT1DMwithafocusonwholegenomescreeningsoffamilieswithaectedsiblingpairsASPs.Pociotetal.,[22],5

PAGE 15

studiedtheapplicationofdataminingtechniquestodetectcomplexinteractionsofgenesunderlyingtheonsetofType1diabetesi.e.nonlinearinteractionsbetweenmultiple-traitloci.Thedatasettheystudied,hadthegeneticdatafromtheanalysesof318micro-satellitemarkersin331multiplexfamilies.Thesubjectsincluded375ASPs,188unaectedsibpairs,564discordantsibpairsmakingupatotalof1586individuals.Decisiontreesandneuralnetworkapproacheswereusedtoanalyzethedata.BoththesetechniqueswerenotonlyabletoidentifyallthemajorlinkagepeaksthatwereidentiedbyothernonparametriclinkageNPLanalyses,butalsofoundevidenceofsomenewregionsofinterestthataecttheonsetofdiabetesoncertainspecicchromosomes.Thedataminingtechniquesprovedrobusttomissinganderroneousdata.Moreover,theseapproachescouldpredicttheType1diabeticpatientsfromthenondiabetics,withtrainingusingsetsofcombinationsoffewermarkers.Thisstudyalsoemphasizedthatinheritedfactorsinuencebothsusceptibilityandresistancetothedisease.LinkageanalysisofASPscouldnotidentifyprotectivegenevariants,whereasdatamininganalysiswithunaectedsubjectswereabletoidentifycertaincombinationrulesthatoccurredonlyinnondiabetics.Therulesonmarkerinteractionweregeneratedbydecisiontreeswhichwerevalidatedusingneuralnetworkanalysis.Forexperimentsconcentratingonpredictingpotentialhealthrisksofdiabeticpatients,themachinelearningalgorithmofchoiceformostresearchersisassociationrulemining.In[28],theauthorsmakeacomparativestudyofassociationrulesanddecisiontreestopredicttheoccurrencesofcertaindiseasesprevalentindiabeticpatients.In[16],theydealsolelywithassociationruleminingondiabetespatientdata,tocomeupwithnewrulesforpredictionofspecicdiseasesinsuchpatients.ALocalCausalDiscoveryLCD[6]algorithm[26]isusedtostudyhowcausalstructurescanbedeterminedfromassociationrulesandgeneraterulestomapsymptomstodiseases.Moreover,exceptionruleminingleadstomoreusefulrulesfromamedicalpointofview.2.3DiabetesPredictionUsingHealthRiskAssessmentHRAQuestionnairesIn[20]thepredictionofdiabetesfromrepeatedHealthRiskAppraisalHRAques-tionnairesoftheparticipantsusinganeuralnetworkmodelwasstudied.Itusedsequential6

PAGE 16

multilayeredperceptronSMLPwithbackpropagationandcapturedthetime-sensitivenessoftheriskfactorsasatoolforpredictionofdiabetesamongtheparticipants.Ahierarchyofneuralnetworkswasused,whereeachnetworkoutputstheprobabilityofasubjectgettingdiabetesinthefollowingyear.ThisprobabilityvalueisthenfedforwardtothenextneuralnetworkalongwiththeHRArecordsforthenextyear.Resultsshowimprovementinaccuracyovertime,i.e.thestudyoftheriskfactorsovertimeratherthanatanyparticularinstant,yieldsbetterresults.WiththeSMLPapproach,themaximumaccuracyofpredictionobtainedwas99.3%fornon-diabeticsand83.6%fordiabeticsatathresholdofoutputprobabilityfromeachneuralnetworkinthehierarchyof20%.While[20]focusesontheimportanceoftime-sensitivenessoftheriskfactorsindiabetespredictionsusingonlyneuralnetworks,ourstudycomparesdecisiontreelearningmethodsincludingrandomforests[2]andanensembleofneuralnetworksappliedtoaspecicjuvenilediabetesdataset.Ourstudyalsodiersfrom[20]inthefollowingaspects:1.[20]usedHRArecordsofemployeesfromamanufacturingrmwithagesofthesubjectsrangingfrom45to64,whileoursubjectsarealljuveniles.2.Theattributesinthedatasetin[20]aregeneralhealthparameterslikeBodyMassIndexBMI,AlcoholConsumption,Backpain,Cholesterol,etc.whicharecompletelydierentfromtheattributesthatwedealherewith,likeIntravenousGlucoseTolerance,C-PeptidesandothermedicalteststhatarespecictoType1diabetes.InthecurrentstudywehaveuseddatafromtheDiabetesPreventionTrial-Type1[15],whichwastherstlarge-scaletrialinNorthAmericadesignedtotestwhetherinterventionwithantigen-basedtherapies,parenteralinsulinandoralinsulinwouldpreventordelaytheonsetofdiabetes.In[4]itwasshownthatastrongcorrelationbetweenrst-phaseminute+3minuteinsulinFPIRproductionduringintravenousglucosetolerancetestsIV-GTTandriskfactorsfordevelopingtype1diabetesexistedusingtheDPT-1data.In[14]theasymptoticgroupofcasesintheDPT-1trialwhosediabetescouldbedirectlydiagnosedbythe2-hcriteriaonOralGlucoseToleranceTestOGTTwasstudied.Boththesestudies[14][15]helpedidentifythetestsweusedforourtrainingdata.7

PAGE 17

CHAPTER3DIABETESDATASETInthischapter,thediabetesdatasetthathasbeenstudiedisdiscussedindetail.WestartwithadiscussionoftheDiabetesPreventionTrial,fromwhichthecurrentdatahasbeenextractedandthenmoveontohowthedatasetswerebuiltandformattedtotraintheclassiers.3.1DiabetesPreventionTrial-1TheDiabetesPreventionTrial-Type1DPT-1wastherstlargescaletrialinNorthAmericadesignedtotestwhetherinterventionduringtheprodromalperiodcanpreventordelaytheonsetofType1diabetes.ThemajorobjectiveofDPT-1wastodeterminewhetherinterventionwithantigen-basedtherapies,parenteralinsulinforhigh-risksubjects>50%like-lihoodofdevelopingdiabetesinthenext5yearsandoralinsulinorplaceboforintermediate-risksubjects25%-50%riskinthenext5years,wouldpreventordelaytheonsetofdiabetes.ThestudywasdesignedtoidentifyrelationshipsbetweeninsulinproductionduringtheIV-GTTandknownriskfactorsforType1diabetesinalargepopulationatincreasedriskfordevelopmentofdiabetes.Itwasamulticenterrandomizedtrialinvolvingrst-degreerelativesages3-45yearsorsecond-degreerelativesages3-20yearsofpersonswhobeganinsulintherapybeforetheageof40years.AsofDecember1998,59,600individualshadbeenscreenedand2199hadpositivendings>=10JuvenileDiabetesFoundationunitsontheirICAtest.69%.TheinitialstagingIV-GTThadbeendonefor1622subjects.Themeanageofthe1622subjectswas11yearsrange,3-45years;56%ofthesesubjectswerefemaleand44%weremale.Ofthe1622subjects,81.6%werewhite,9.7%wereHispanic,2.6%wereblack,and3.4%weremembersofunspeciedethnicbackground.8

PAGE 18

Atotalof84,228rstdegreeandseconddegreerelativesofpatientswithdiabetesforislet-cellantibodieswerescreened;3152testedpositive;2103ofthe3152underwentgenetic,immunologic,andmetabolicstagingtoquantifytheirrisk;372ofthe2103hadaprojectedve-yearriskofmorethan50percent.3.2DatasetDescriptionThedatasetwasobtainedasasetofthreetables.ThetableswereimportedintoaMSAccessdatabasetofacilitatethequeryingoftestresultsforindividualpatients.Thefollowingtablesdescribetheattributesineachofthethreetables.Table3.1showstheattributesintheSubjecttablewhichhadthedetailsofthe711patientswhosemedicalrecordswereconsidered.Table3.1DetailsoftheSUBJECTtable AttributeName Description Comments PATID PatientID Eachofthe711subjectshadauniquePATID. DATERAND Dateofrandomization Datefromwhichthemedicaltestre-sultswererecorded. RACE Ethnicityofthesubject Thepossiblevaluesare: W-White B-Black H-Hispanic O-Other X-Unknown SEX SexoftheSubject Thepossiblevaluesare: Male Female DATEDIAG DateofDiabetesDiag-nosis Fornon-diabeticpatientsthiseldhadnodata Table3.2showsindetailthecontentsoftheTesttable,whichlistedthedetailsoftheclinicalteststhatwereconductedonthese711patients,thedateonwhichtheywereconductedandtheoutcomeofthetestsTheTESTNAMEisassociatedwitheachoftheTestCategoriesmentionedinTable3.2.Atestcategorymaybeconsideredasasetofindividualtestswhoseresultsincombinationdeterminetheoutcomeofthattestcategory.AsimpleexamplewouldbeforAB-IVGTT,whichisconstitutedofthefollowingindividualtests:-9

PAGE 19

Table3.2DetailsoftheTESTtable AttributeName Description Comments PATID PatientID ForeignkeyfromtheSUBJECTStable. TESTUID UniqueIDforthepar-ticulartest PrimarykeyfortheTESTRESta-ble TESTNAME NameoftheTestCate-gory SomeofthetestcategoriesareICA,AB-IVGTT,CO-IVGTT,IAAandsoon. DATEDRAW Dateonwhichthetestwasconducted OUTCOME ResultoftheTestCat-egory ThevaluesintheOUTCOMEeldarelow,normal,high,absent,posandneg. 1.GLU-42.GLU13.INS14.INS3Thetestresultsofthese4testscombined,determinetheoutcomeofAB-IVGTTwhichcaneitherbelow,normalorhigh.TheOUTCOMEeldintheTESTtablehastheoutcomeofthesetestcategoriesandnottheresultsoftheindividualteststhatconstitutethattestcategory.Table3.3DetailsoftheTESTREStable AttributeName Description Comments TESTUID UniqueIDforasetoftests ForeignkeyfromtableTEST SUBTSTN Nameoftheindividualtest SomeofthevalueswereICA,GLU-4,GLU1,INS1,PEP120andsoon. RESULT Testresultvalue Numericvalueofthemedicaltest Table3.3liststheattributesofthethirdtable,TESTRES,whichhadtheactualclinicaltestresultvaluesforeachofthetestslistedinTable3.2.10

PAGE 20

3.3AttributesThissectiongivesadetailedlistingoftheattributesthatconstitutedthediabetesdataset.TheattributesofthedatasetarecomprisedofclinicaltestsandtwodemographicparametersRaceandSex.TheclinicalteststhatwerechosenwerefoundrelevanttothediagnosisandtheonsetofType1diabetesaspartofDiabetesPreventionTrialstudies.Table3.4showsthelistofattributes,alongwiththeirrespectivedatatypese.g.NominalandContinuous.Inalltherewere42attributesorfeaturesthatwereavailable.Ofthese,5nominalat-tributes,AB IVGTT,CO IVGTT,FPG,OGTTandSUBTSTN Rwereleftout,whicheitherwererarelyusedortheiroutcomesweredeterminedbythecombinationofsomeothertestresultsthatwerealreadyconsideredi.e.whetherAB IVGTTisnormalornotisdeterminedbythevaluesofGLU-4,GLU1,INS1andINS3.Moreover,thetestDQABwasexcludedasithadthevalue"ABSENT"forallthe711patients.Sothereare36attributesthathavebeentakenintoconsideration,ofwhich34arerealvaluedrepresentingmedicaltestresultsand2arenominalrepresentingRaceandSex.3.4TestResultsDatasetThetrainingdatawasextracteddierentlyfordiabeticandnondiabeticpatients.Fordiabeticpatients,onlythelastsetofmedicaltestsbeforetheirdiagnosisofdiabeteswasconsidered.Ofthe256subjectswhohaddiagnoseddiabetes,201ofthemhadtestsintheimmediatevicinityoftheirdiagnosis.Theremaining55patientshadtheirlastsetoftestsbeforediagnosis,quitefarbackintimeapproximately3monthsbeforethedateofdiagnosisfromthediagnosisdate.Fornon-diabeticsubjects,alltheirtestresultsthroughouttheirrecordedmedicalhistorywereconsidered.Forrepetitivetests,onlytheresultsofthersttestwereused.Thereweretwocompellingreasons,astowhythisapproachwasadopted:1.Fornondiabeticpatientstherewasnoreferencedateunlikediabeticsubjectswhohadadateofdiagnosistoconsidertestsinthatvicinity.11

PAGE 21

Table3.4Listofattributes TestCategory AttributeName AttributeType Demography Race Nominal Sex Nominal FPG F Continuous GAD65Antibody GAD65 Continuous BloodGlucose GLU0 Continuous GLU1 Continuous GLU-10 Continuous GLU120 Continuous GLU30 Continuous GLU-4 Continuous GLU60 Continuous GLU90 Continuous HBA1CAntibody HBA1C Continuous InsulinAutoAntibody IAA Continuous IsletCellAntibody ICA Continuous ICA512 Continuous InsulinLevel INS1 Continuous INS10 Continuous INS-10 Continuous INS3 Continuous INS-4 Continuous INS5 Continuous INS7 Continuous MicroInsulinAutoAntibody mIAA Continuous C-Peptides PEP0 Continuous PEP1 Continuous PEP10 Continuous PEP-10 Continuous PEP120 Continuous PEP3 Continuous PEP30 Continuous PEP-4 Continuous PEP5 Continuous PEP60 Continuous PEP7 Continuous PEP90 Continuous 12

PAGE 22

2.Theconsiderationoftheresultsofalldistinctteststhatwereconductedonthesubjectreducedthenumberofmissingdatavaluesinthedataset.Onaverage,eachsubjecthad30datavaluesfromthe36attributes.However,fordiabeticpatients,sinceonlytheirlatesttestsbeforediagnosiswereconsidered,therewereaconsiderablenumberofmissingdatavalues.Oneofthediabeticsubjectshadjustonetestresultinthetestsetclosesttothediagnosisdate.Soforthisparticularpatient,thesetoftests,beforethenaltestpriortodiagnosiswasalsotakenintoconsideration.3.4.1DatasetwithNominalAttributesThisparticulardatasethadbothnominalandrealvaluedattributes.ThetwonominalfeatureswereRaceandSex,whileallattributescorrespondingtothedierentmedicaltestresults,wererealvalued.Theclassattributehadonlytwopossiblevalues,0and1,repre-sentingtheclassesdiabeticandnon-diabeticrespectively.Anotherimportantaspectwastherepresentationofmissingvalues.Missingattributevaluesinthisdatasetwererepresentedby"?",irrespectiveofwhethertheattributewasnominalorrealvalued.Thedecisiontreeimple-mentationtoolthatwasused,i.e.usfc4.5,hastheabilitytohandlemissingdatarepresentedas"?".Sonospecialencodingofsuchmissingvalueswasrequired.Thisparticulardataset,wasusedintheevaluationofpredictionaccuracyoflearningtech-niqueslikedecisiontreesusingusfc4.5andensemblesthatuseddecisiontreeslikebaggingandrandomforests.Thisdataset,willhenceforthbereferredtoasdiabetesdatasetD.3.4.2DatasetwithBitEncodedNominalAttributesFortheneuralnetworkapproach,themissingdatavalueshadtobehandleddierently.Moreoverthenominalattributeshadalsotobetranslatedintonumericattributes.Thereasonforthiswasthattheusfcascorimplementationlikemostneuralnetworkscouldonlyhandlenumericattributes.Thechangesthatweremadetotherepresentationofthedatasetwere:1.Inthedataset,D,forthedecisiontreeapproach,themissingdatavalueswererepre-sentedby"?".Intheneuralnetworkdataset,weadoptedadierentapproachtohandlemissingdata.Eachtestattributewhichisanumericeldhadanotherattributeasso-13

PAGE 23

ciatedwithit.Letuscallthisnewattribute,anindicatorattributeforthattesteld.Theindicatorattributescouldhaveonly2possiblevalues,-0.5and0.5.Ifatesteld,hadavaliddatavalue,thenthevalueofthecorrespondingindicatorattributewas0.5,otherwiseifthetesteldvaluewasmissing,thentheindicatorattribute'svaluewas-0.5.Moreover,thevalueassignedtothetesteldifthedatawasmissing,wasalso-0.5,i.e.iftheindicatorattributehasavalue-0.5thentheassociatedtestattributetoohasavalueof-0.5.2.Thedatavaluesforeachoftheattributeswerescaledtoliebetween-0.5and0.5.Toscalethevalueofanattribute,theminimumandmaximumvaluesforthatattributewerecalculatedbytraversingthroughthedataset.Thescaledvaluewasthencomputedbytheformula:ScaledValue=ActualValue-Minimum/Maximum-Minimum-0.5Where,Minimum:MinimumvalueofthatattributeMaximum:Maximumvalueoftheattribute3.Thetwonominalattributes:RaceandSex,weresplitupintoatotalof7distinctattributesforRaceand2forSexeachindicatingoneofthepossiblevaluesofthenominalattribute.Eachofthese7attributeshasavalue0.5or-0.5.Asimpleexample,wouldbe,theattributeindicatingtheRacevalue"W",wouldbe0.5iftheraceofthesubjectis"W",andalltheother4Raceattributeswouldhavevalue-0.5.Noseparateindicatorattributeswereassociatedwiththesenominalelds.Incaseofamissingvalue,eachoftheattributesassociatedwiththatnominalfeatureeitherRaceorSexhasavalueof-0.5.Thisdatasetwillbereferredtoasthemodieddiabetesdataset,Dmod,inthelattersections.3.5DierentialDatasetThenextphaseofanalysis,involvedthecomparisonoftheDecisiontreeandNeuralNetworkapproachesinpredictingtheonsetofType1diabetesbylookingatthedierence14

PAGE 24

insuccessivetestresults.Theobjectiveofsuchanexperimentwastoseeifitwaspossibletopredictwithreasonableaccuracy,whetherasubjectispotentiallyatriskofgettingType1diabetesbyexaminingthechangeofdierentdiagnostictestresultsovertime.Ensemblesofclassiersbothdecisiontreesandneuralnetworkswereusedtoimprovetheaccuracyofprediction.3.5.1DierentialDatasetwithNominalAttributesThisdatasetwasbuiltbycomputingthedierenceinsuccessivetestresultsofthesametesttypeforaparticularsubject.Fornon-diabeticsubjects,alltheteststhroughouttheirrecordedmedicalhistoryi.e.collectedthroughoutthedurationofthestudywereconsidered.Ontheotherhand,fordiabeticpatients,theirmedicaltestsfromthedateofrandomizationtothedateofdiagnosisofdiabetes,exceptthelastsetoftestsimmediatelybeforediagnosis,wereconsidered.Thereasonforleavingoutthelastsetoftestsbeforediagnosiswasbecausetheseresultswouldhavemostlikelydierentiatedthediabeticpatientsfromthenon-diabeticsubjects,astheypresumablyenabledadiagnosis.Forthe55diabeticpatientswhohadtheirlasttestsconductedalongtimebeforetheirdiagnosis,thislastsetoftestswhichwereconductedasfarbackas3monthsbeforediagnosisofdiabeteswereconsidered.Inthelatterexperiments,thisdatasetwillbereferredtoasthedierentialdiabetesdataset,DD.Inthedierentialdataset,eachexamplecorrespondstoaparticularsubjecteitherdiabeticornon-diabetic.Eachoftheattributesinthedatasetisadierencevaluebetweenconsecutivetestsofthesametesttype.Tobuildthisdataset,thefollowingstepswereperformed:1.Alltherelevanttestresultsforeachofthe711subjectswerecollectedfromtheoriginaldatabase.2.Foreachtesttype,thedierencewascomputedbetweenonetestandthenextandsoon.Forexample,thetesttypeFwasrepeated4timesforapatientandsotherewerethreeattributevaluescorrespondingtothe3dierencevaluesbetweenthe1stFtestvalueandthe2ndFtestvalue,andbetweenthe2ndtestand3rdtestandsoon.3.Sinceeachpatientmayhaveatestperformedadierentnumberoftimes,thenumberofdierencevaluesforeachsubjectmaybedierent.Tocalculatethenumberattributesin15

PAGE 25

thedataset,themaximumnumberoftimesatestisperformedonapatientiscomputedforeachofthetesttypes.Thesemaximumnumbersminus1,wouldgivethemaximumnumberofdierencevaluesforthattest.Hence,themaximumnumberofdierencesforatesttypedeterminesthenumberofattributesassociatedtothattest.Forexample,thetestGAD65wasrepeatedamaximumof15timesforaparticularpatientamongallthe711subjects,thenthemaximumnumberofdierencevaluesis14forGAD65.TheattributesarenamedfromGAD65 0toGAD65 13.Forsubjectswhohavelessthan14GAD65tests,theremainingvaluesaretreatedasmissingvalues.Thetotalnumberofattributesinthedatasetwascalculatedtobe473includingthe2nominalattributes:AgeandSex.SimilartothedatasetDforthedecisiontreeapproaches,thedierentialdataextractedwasformattedintodatasetDD.Thedatasethastwonominalattributes,RaceandSex,and471realvaluedfeatures.Eachofthese471attributesrepresentadierencevalue,betweentwoconsecutivetestsofthesametype.TheRaceattributecouldhavethevalue"W","B","H","O"and"X".TheSexattributecouldtakethevalues"M"and"F".Theclassattributehadtwopossiblevaluesof1and0,representingthetwoclassesDiabeticandNon-Diabetic,respec-tively.Themissingdatavaluesinthedataset,bothfornominalandrealvaluedfeatures,werereplacedbythecharacter"?".Theusfc4.5implementationi.e.thedecisiontreeapproachcanhandlemissingdatarepresentedinthismanner,andsothedatasetDDwasusedwithusfc4.5baseddecisiontreesandensembletechniqueslikebaggingandrandomforeststhatuseddecisiontreesfortheindividualpredictors.3.5.2DierentialDatasetwithBitEncodedNominalAttributesThedierentialdatasethadtobereformattedtobeusedwiththeCascadeCorrelationbasedlearningtechnique.Theusfcascor[11]toolwasusedinthiscase.Thechangesinthedatasetwererelatedtotherepresentationofthemissingvaluesandthenominalattributes,RaceandSex.Thefollowingchangesweremadetothedataset:-1.Anindicatorattributewasassociatedwitheachattributeinthedierentialdataset,anditindicateswhethertheassociateddierenceattributehasavalidvalueornot.Ifthe16

PAGE 26

associateddierenceattributehasavalidvaluethentheindicatorattributevalueis0.5otherwiseitis-0.5.Iftheindicatorattributeis-0.5,thenthevalueofthecorrespondingdierenceattributeisalso-0.5.2.Thevaluesofeachofthe471dierenceattributeswerescaledbetween0.5and-0.5.Inordertoscaletheattributevalues,themaximumandminimumvaluesofeachofthedierenceattributeswerecalculatedandthenthescaledvaluewascomputedasfollows:ScaledValue=ActualDataValue-Minimum/Maximum-Minimum-0.5Where,Minimum:Minimumvalueofthedierenceattributeinthedierentialdataset.Maximum:Maximumvalueofthedierenceattributeinthedierentialdataset.3.ThenominalattributesRaceandSexweresplitupintoseparateattributeseachrep-resentingindividualvaluesi.e.5attributesforRacecorrespondingtothe5dierentRacevalues,"W","B","H","O"and"X"and2attributesforSexcorrespondingtothevalues"F"and"M".Indicatorattributesarenotrequiredforanyofthese,asamissingnominalattributevalueisrepresentedbyallthevalueattributesbeing-0.5i.e.ifaRacevalueismissing,thenallthe5attributesrepresentingthe5dierentRacevaluesare-0.5.Thisdatasetwillhenceforthbereferredtoasthemodieddierentialdataset,DDmod,inthelattersections.17

PAGE 27

CHAPTER4LEARNINGAPPROACHESUSEDInthischapter,thedierentmachinelearningalgorithmsandtheirimplementationsthathavebeenusedinstudyingtheDPT-1dataarediscussedinsomedetail.Alsodiscussedisanalgoritmforoversamplingofminorityclassexamples,calledSMOTE[5],whichwasusedincombinationofdierentlearningpredictors.4.1CascadeCorrelationNeuralNetworksTheCascade-Correlationneuralnetworks[24]arebasedontwomainconcepts:theCas-cadeArchitectureandthelearningmethod.TheCascadearchitecturereferstotheaddingofhiddenunitsoneatatimeandthenfreezingtheinputweightsoftheaddedunitsbeforeaddingthenextnewunit.Thelearningalgorithmdealswithcreatingandinstallingthehid-denunits.Theadditionofeachhiddenunittriestomaximizethecorrelationbetweentheoutputofthenewhiddenunitandtheresidualerrorthatthenetworkistryingtoeliminate.Thebuildingofthecascornetworkstartsinitiallywithjusttheinputnodesandtheoutputnodeswithnohiddenunits.Thenumberofinputandoutputunitsaredeterminedbytheproblemandthenatureofdatarepresentation.Theinputnodesareeachconnectedtoeachoftheoutputnodes,witheachconnectionhavinganadjustableweightandthebiasispermanentlysetto1.Theoutputvaluescanjustbeaweightedsumoftheinputsorthelearningmethodmayemploysomenonlinearactivationfunction.Incascor,thedefaultactivationfunctionisasymmetricsigmoidalfunctionhyperbolictangentwithanoutputrangeis-1.0to+1.0.Thehiddenunitsareaddedoneatatime,witheachhiddenunitreceivinganinputfromeachoftheinputnodesaswellastheprevioushiddennodesandhavingaoutputconnectiontoeachoftheoutputnodes.Theinputweightsareallfrozenwhenaddinganewhiddenlayerandonlytheoutputweightsneedtobeadjusted.18

PAGE 28

Asdiscussedearlier,therearenohiddenunitsatthebeginningexcepttheinputandtheoutputnodes.Theseconnectionweightsaretrainedaswellaspossibleovertheentiretrainingset.Cascordoesnotusebackpropagation,insteadituseseithersomewellknownsinglelayerneuralnetworklearningschemelikeWidrow-Ho"deltarule"ortheQuickpropmethod.Aftersometime,thetrainingreachesanasymptote,wherebynosignicantreductioninerroroccursoveracertainnumberofepochs,thentheerrorisnoted.Anewhiddenunitisaddedwithit'sinputconnectionweightsfrozenandtheoutputweightsareonceagaintrained.Theprocesscontinuesuntiltheerrorreachesanacceptablelimit.Theusfcascortool[11],developedattheUniversityofSouthFlorida,isamodicationofFahlman'soriginalcascorimplementationthathasbeenmodiedtoreadthesamedataformatastheUSFimplementationofC4.5theusfc4.5toolasdiscussedinSection4.2.4.2C4.5DecisionTreesAdecisiontreeisalearningmechanism,whichusesa"divideandconquer"strategytoclassifyinstances.Itconsistsofleafnodeslabeledwithaclassandtestnodesthatconnecttwoormoresubtrees.Everytestnodecomputesaresultbasedonsomeparticularattribute,andthisoutcomedecidesalongwhichsubtreetheexampleshouldgo.Everyinstancestartsfromtherootofthetreeandtraversesdowntotheleaf,dependingontheattributetestsateachnodeuntilitreachesaleaf,wherebytheclasslabeloftheleaf,determinestheclasspredictionoftheexample.Inordertobuildadecisiontree,ateverynodeanattributeneedstobechosenthatwouldoerthebestsplitamongthetrainingdata.Thedecisionisbasedonthevalueofinformationgainorgainratioi.e.theattributethatmosteectivelysplitsthetrainingexamplesaccordingtotheirclasslabels.Theattributethathasthebestinformationgainatanynode,isusedtosplitthatnode.Thedecisiontree,couldalsobegeometricallyrepresentedaspartitioningthex-dimensionaldataspacedenedbyxattributesbydecisionplanescorrespondingtothetestsateachnode.Realdatainmostcasescontainnoisydata,andthedivideandconquerapproachtobuildingtreestendstobuildsuchdiscrepanciesintotheclassier,whichevidentlyleadstolowerpredictionaccuracyonunseentestdata.Toovercomesuchoverttingofthetraining19

PAGE 29

data,decisiontreesaretypicallypruned.Smaller,prunedtreesarenotonlyeasytounderstandbutalsoaregenerallymoreaccurateontestdata.Unprunedtreeshoweverareusefulinensembleslikerandomforests,generallyproducingbetteraccuracypercentage.Ensemblesandrandomforestshavebeendiscussedinalattersection.TheC4.5[23]algorithmforgeneratingdecisiontreesisbasedontheID3IterativeDichotmiser3algorithm.ThesplittingcriteriausedbyC4.5ateachnodeofthedecisiontreeisaninformation-basedmeasure,thegainratio,whichtakesintoconsiderationdierentprobabilitiesoftestoutcomes.LetDbeadatasetandTbeaparticulartestthathasoutcomesT1,T2,T3,...,TkwhichisusedtopartitionDintosubsetsD1,D2,D3,...,Dk,suchthatDicontainsthesetofexampleshavingtheoutcomeTiforthetestT.Now,ifCdenotesthenumberofclassesandpD,j,theproportionofcasesinDthatbelongtothejthclass,thentheresidualuncertaintyabouttheclasstowhichacaseinDbelongsisexpressedasInfoD=-PCj=1pD,jxlog2pD,jandthecorrespondinginformationgainfromatestTwithkoutcomesismathematicallyrepresentedas:GainD,T=InfoD-Pki=1jDij jDjxInfoDi.Again,thepotentialinformationobtainedbypartitioningasetofexamplesisbasedonwhichDisubsettheexampleliesandisknownasthesplitinformation,denedmathematicallyas:SplitD,T=-Pki=1jDij jDjxlog2jDij jDjThegainratioofatestisdenedastheratioofit'sinformationgaintoitssplitinformation.TheC4.5algorithmcomputesthegainratioofeverytestatanode,andtheonethatyieldsthemaximumvalueisusedtosplitthenode.Anotheralternativestoppingcriteriaistostopsplittinganodewhenalltheexamplesinthenodebelongtothesameclassi.e.thegainratiosforalltestsarezero.Inthecaseofcontinuousattributes,itmustbenoted,thattheycanbeusedmultipletimestosplitnodes.TheC4.5algorithmhassomeimprovementsoverID3intermsofhandlingmissingdataanddealingwithcontinuousattributes.Theusfc4.5toolisatoolbasedonC4.5thathasbeendevelopedattheComputerScienceandEngineeringdepartmentatUniversityofSouthFlorida.Thedatasethastwomainles,a.dataleanda.namesle.The.nameslecontainstheclasslabelsandlistofallthe20

PAGE 30

attributes,withtheattributetypenominal,continuous,etc.andfornominalattributes,thepossiblevalues.The.datalecontainseachexampleinaseparatelineandtheattributevaluesforeachattributebeingseparatedbycommas,withthelastvaluebeingtheclasslabel.4.3EnsemblesEnsembletechniques[9]arelearningalgorithmsthatbuildasetofclassiersandthenclassifyunseenexamplesbytakingaweightedorunweightedvoteoftheirpredictions.Bag-ging,BoostingandRandomForestsaresomeoftheensembletechniques.Anecessaryandsucientconditionforaclassiertobemoreaccuratewhenusedinensemblesisthattheclassierbeaccuratewithaccuracyofgreaterthanrandomguessinganddiversetheclas-siersmakedierenterrorsonnewexamples.Therearethreemainreasonsastowhymostensemblesproducegoodresultsbetterthanasingleclassier.Therstisstatistical,whereasingleclassiersuersfromlackofsucienttrainingdatai.e.toosmallcomparedtothesizeofthehypothesesspace,H.Buildinganensembleofaccurateclassiersandaveragingthepredictions,reducestheerrorofchoosingthewrongclassierinthehypothesesspace.ThesecondreasonisComputational,whereaclassiercangetstuckinalocaloptimae.g.forneuralnetworkswhichusegradientdescentordecisiontreesusingagreedyapproach.FindingthebesthypothesesusingneuralnetworksordecisiontreesisanNP-hardproblemandcreatingensembleswhichcalculatethelocalpointsfromdierentstartingpointscangiveabetterapproximation.Thethirdreasonisrepresentational,wherebytheuseofensembledoesnotrestricttheclassiertondthebesthypothesesforasingletrainingset,butusesdierenttrainingsetsandhencesearchesagreaterspaceofpossiblebesthypotheseswithinH.Followingaresomeofthemostpopularmethodsofbuildingensembles:1.Dierentsetsoftrainingexamplesareusedforbuildingeachclassierintheensemble,whichworkswellforunstableclassierslikedecisiontreesandneuralnetworks.2.Buildingdierentclassierswithdierentcombinationsoffeatures.3.Ensembleofclassierscanbebuiltbypartitioningtheclassesiftherearealargenumberofclassesintotwogroups,onehavingasingleclassandtheotheralltheotherclasses21

PAGE 31

combinedandthenrelabelthetrainingandtestsets.Repeatingthisprocedurewitheachoftheclasses,buildsanensemble.4.Introducingrandomnessintothemethodofbuildingclassiere.g.randomlychoosingtheinitialweightsofthenodesinabackpropagationneuralnetworkorrandomlychoosinganattributefromamongthetoptenattributeswiththebestinformationgain,tosplitanodeonadecisiontree.4.3.1BaggingBaggingpredictors[1]isanensemblemethodforgeneratingmultipleversionsofaclas-sierandaggregatingthepredictions.Fornumericpredictions,theresultsfromindividualpredictorsareaveragedtogivethenalpredictedvalue,whileforclassiers,apluralityvoteamongthepredictorsdecidesthepredictedclass.Oneimportantrequirementforbaggingtoyieldbetteraccuracythanasinglepredictorisfortheunderlyingpredictortobeunstable,whereunstableisdenedaswhenperturbingthetrainingsetcausessignicantchangesinthepredictorconstructed.Tobetterillustratehowbaggingworks,letusconsideralearningdatasetLwhichconsistsofthedatapointsfyn,xn,n=1,2,...,Ng,wherethey'sareeitherclasslabelsornumericresultsandthepredictor'spredictionisdenotedby'x,L.Ifanensembleofkpredictorsisbuilt,theneachofthefLkglearningsetsconsistofNindependentlydrawnobservationsfromthesameunderlyingdatasetL.Fornumericpredictions,i.e.ifyisnumeric,thenx,Lisgivenbytheaverageofx,Lkoverallki.e.by'Ax=EL'x;L,whereELrepresentstheexpectedvalueoverL,and'Axdenotestheaggregatedvalue.Incaseyrepresentsclasslabelswithavaluej2f1,2,...,Jg,thentheaggregationisdoneusingvoting.IfNjrepresentsthenumberofvotesforthejthclassfromallthekclassiers,thenthevotedpredictionoftheensembleisgivenby'Ax=argmaxjNj.Inbootstrapaggregationorbagging,dierenttrainingsetsaregeneratedbyrepeatedlytakingbootstrapsamplesfLBgfromL.Theaggregationtechniquesremainthesamei.e.theaveragevaluefornumericpredictionsandthevotedmajorityclassforclassication.ThefLBgeachhaveNexamples,drawnrandomlywithreplacementfromL.Theaccuracyobtainedusingbaggingdependsonthestabilityof22

PAGE 32

thepredictor'.IfchangesinLproducesubstantialchangesinthepredictions,i.e.ifthepredictorisunstable,thenthebaggedresultsshouldgivebetteraccuracy.Theimplementationofbaggedensemblesisprettysimplewherebyindividualpredictorsareiterativelycreatedusingdierenttrainingsetsandtheclass/numericpredictionsareaggregatedattheend.However,onetradeoforbetteraccuracyisthat,thebaggedpredictionsarelessinterpretablethantheresultsobtainedfromasingleclassierlikedecisiontrees.4.3.2RandomForestsInRandomForests[2],multipleclassicationtreesaregrownandtoclassifyeachnewexample,theinputvectorissentdowneachofthesetreesandtheindividualtreesvotetodecideonthenalprediction.Whicheverclassgetsthemaximumvotes,ispredicted.Tobuildasingletreeintheforest,thefollowingstepsarefollowed:-1.IfthenumberoftrainingexamplesareN,thenNsamplesarerandomlyselectedfromthetrainingset,withreplacement,similartobagging.Thissampleddatasetisusedtobuildthetree2.Ofalltheavailablefeatures,sayM,asubsetmsuchthatm<
PAGE 33

Itgeneratesaninternalunbiasedestimateofthegeneralizationerrorastheforestgrows.Itcaneectivelyestimatemissingdatavaluesandmaintainaccuracywhenalargeproportionofdataismissing.Itcanbalanceerrorforunbalanceddatasetsunequalclassdistribution.Itprovidesforanexperimentalmethodtodetectvariableinteractions.4.4SupportVectorMachinesSupportVectorMachineSVM[7]isarelativelynewmethodoflearningfortwo-classclassicationproblems.TheSVMmapstheinputvectorsnon-linearlytoahighdimensionalfeaturespaceandbuildsalineardecisionboundaryi.e.ahyperplanewithinthisfeaturespace.Themainobjectiveistondanoptimalseparatingdecisionplanethatwillgivethebestgeneralizationamongallthehyperplanesinthehighdimensionalfeaturespace.TheoptimalhyperplaneisdenedasalineardecisionfunctionwiththemaximumdistancebetweenthevectorsofthetwoclassesasillustratedinFigure4.1.Tobuildtheoptimalhyperplane,onlyasmallamountoftrainingsetexamplesneedtobeconsidered.Theseexamplesarecalledthesupportvectors.Mathematically,givenatrainingsetoflabeledinstancesxi,yi,i=1,2,...,nwherexi2Rdandy2f1,-1gn,theSVMinvolvesthesolutiontotheoptimizationproblem:minw;b;/2wTw+CPni=1isubjecttoyiwTxi+b1)]TJ/F21 10.909 Tf 10.909 0 Td[(i,i0AfunctionmapsthetraininginstancestoahigherdimensionalspaceandSVMndstheplanethatseparatesthemwiththemaximummargininthishighdimensionalspace.C>0isthecostfactorandxiTxjiscalledthekernelfunction.Thecommonkernelsarelinear,polynomial,radialbasisfunctionandsigmoid.TheSupportVectorMachineimplementationthathasbeenusedhereistheopensourceLIBSVMtool[3].Thebasicstepsrequiredfortestingadatasetinvolve:24

PAGE 34

Transformthedatasettothesoftwarespecicformat.Tryafewdierentkernelswithdierentparametervalues.Testthepredictorwithunseenexamples.SVMrequiresthateachofthedataattributesarerealnumbers.Toachievehigherac-curacy,it'sgenerallygoodtoscalethedatavalues,sothatlargernumericrangesdonotdominateoverfeatureswithsmallerranges.Linearscalingofdataiscommonlyused.AlsothemostcommonmodelorkernelusedistheradialbasisfunctionRBF.Thiskernelnon-linearlymapsthesamplestoahighdimensionalspaceandthusisabletohandlenon-linearrelationshipsbetweenthefeaturesandtheclasses.Theparametersalsoneedtobetunedtogivethebestresultsforchoosingtheoptimalseparatinghyperplane.Agridsearchtechniqueisgenerallyemployedtodeterminethebestcombinationofcostfactor,Candthekernelparameter.4.5SyntheticMinorityOver-SamplingTechniqueTheproblemofimbalanceddatasets,wherethereisawidedisparityinthenumberofexamplesbelongingtoeachclass,hasbeenstudiedbymanylikeJapkowicz[17],KubatandMatwin[18],LingandLi[19],tomentionafew.Ithadbeenshownthatunder-samplingofmajorityclassexamplesenablesbetterclassiersthanover-samplingofmajority Figure4.1Arepresentationofdatain2dimensionalspacewithsupportvectorsmarkedingreysquares,withmarginsofseparation25

PAGE 35

classexamples.However,themostcommonapproachtoover-samplinghasbeentosamplewithreplacementfromtheoriginaldata.Anotherover-samplingapproachhasbeenproposedbyChawla,etal.[5]calledSMOTESyntheticMinorityOver-samplingTechnique.Inthisparticularmethod,minorityclassexamplesareover-sampledbygenerating"synthetic"examplesratherthanover-samplingwithreplacement.Thegenerationofsyntheticexamplesoperatesinthefeaturespaceratherthaninthedataspace.Theminorityclassisoversampledbytakingeachminorityclassexampleandgeneratingnewsyntheticexamplesalongthelinesegmentsjoininganyorallofthekminorityclassnearestneighbors.Neighborsfromtheknearestneighborsarerandomlychosendependingontheamountofoversamplingrequired.Asanexample,iftheamountofover-samplingis200%,twooftheknearestneighborsarerandomlychosenandthenonesyntheticexampleisgeneratedalongeachofthesetwonearestneighborsamples.Togenerateasyntheticexample,thefollowingstepsneedtobedone:1.Thedierencebetweenthefeaturevectorsampleandthenearestneighboriscomputed.2.Thedierenceismultipliedbyarandomnumberchosenfrom0to1andtheproductisaddedtothefeaturevectorunderconsiderationThiscausestheselectionofarandompointinthelinesegmentjoiningthetwofeaturevectorsandthusforcesthedecisionregionoftheminorityclasstobemoregeneral.Thesyntheticexamplescreatelarger,lessspecicdecisionboundarieshelpingdecisiontreesgeneralizebetter.26

PAGE 36

CHAPTER5EXPERIMENTSANDRESULTSThischapterdiscussesindetail,thedierentexperimentsthatwereconductedontheclinicaltestresultsdatasetandthedierentialdatasets.5.1PredictingType1JuvenileDiabetesUsingTestResultsInthissetofexperiments,thedatasetsDandDmodwereusedtotesttheperformanceofthedierentlearningmethodsinpredictingjuvenilediabetesbylookingatactualtestresults.Thedetailsofhowthesedatasetswerecompiledhavebeendiscussedinsections3.4and3.5.5.1.1ExperimentswithDecisionTreesandEnsemblesForthedecisiontreeapproach,thedatasetD,withnominalattributeswasused.ThedatasetDrepresentedmissingvalueswith"?".Thedecisiontreeimplementationthatwasused,wastheusfc4.5tool.A10-foldcrossvalidation,usingusfc4.5astheclassier,wasdoneandtheresultantoverallaccuracyobtainedwas88.8%.Theconfusionmatrixgeneratedfromthe10-foldcrossvalidationisshowninTable5.1.Thisresultwasabitdisappointing,giventhefactthataphysicianalmostalwaysdiagnosesdiabetesaccurately.Wesuspectedthatforthe55diabeticpatientswhohadtheirlasttestsconductedalmostthreemonthsbeforediagnosis,itwouldbehardtopredict.Sotheirpredictionresultswereexaminedseparately.Thepredictionsforthese55patientswereisolatedfromthetestfoldsofthe10-foldcrossvalidationresultsandchecked.5ofthe55subjectswereclassiedasdiabeticsandtherestwereclassiedasnon-diabetics.Thereason,astowhyalmostallofthese55patientswerewronglyclassiedmightbeattributedtotheabsenceoftestresultsintheimmediatevicinityoftheirdateofdiabetesdiagnosis.27

PAGE 37

Table5.1Confusionmatrixafter10-foldcrossvalidationusingusfc4.5 Classiedas! Diabetics NonDiabetics Diabetics 188 68 NonDiabetics 11 444 5.1.2ExperimentswithNeuralNetworksandEnsemblesInthefollowingexperiments,thedatasetDmod,asdescribedinSection3.4.2,wasused.A10-foldcrossvalidationusingtheusfcascor[11]neuralnetworkclassieryieldedanoverallaccuracyof99.72%,whichwasquiteanastonishinggure.Tobetterunderstandthereasonforsuchahighaccuracyrate,adecisiontreewasgeneratedusingthisparticulardataset.Thedecisiontree,showninFig5.1,revealsthattheabsenceoftwoparticulartestsinasubjectsrecords,improvedthepredictionaccuracyfromthepreviouslyobtained88.8%usingdatasetDto99.85%usingdatasetDmod.FromthedecisiontreeinFigure:5.1,itisevidentthattheabsenceofthetestsPEP4andGLU0asshownbytheirrespectiveindicatorattributesSUBTSTN PEP 4 INDandSUBTSTN GLU0 INDhavingvalue-0.5,playsavitalroleinachievingsuchhighaccuracy.Ifthesetestsarenotdone,thenthepatientislikelytobediagnosedwithdiabetes. Figure5.1Decisiontreegeneratedusingthemodieddataset28

PAGE 38

5.2Type1JuvenileDiabetesPredictionfromDierencesinTestResultsThenextphaseofanalysis,involvedthecomparisonoftheDecisiontreeandNeuralNetworkapproachesinpredictingtheonsetofType1diabetesbylookingatthedierenceinsuccessivetestresults.Theobjectiveofthisexperimentwastoseeifitwaspossibletopredictwithreasonableaccuracy,whetherasubjectispotentiallyatriskofgettingType1diabetesbyexaminingthechangeofdierentdiagnostictestresultsovertime.Ensemblesofclassiersbothdecisiontreesandneuralnetworkswereusedtoimprovetheaccuracyofprediction.5.2.1ExperimentswithDecisionTreesandEnsemblesForthedecisiontreeapproach,thedatasetDDwasusedinwhichthemissingvalueswererepresentedby"?"andtheAgeandSexattributesweretreatedasnominalattributes.The10-foldcrossvalidationusingasingledecisiontreeyieldedanoverallaccuracyof66.8%.Ascanbeseen,theaverageaccuracyobtainedfromthe10foldcrossvalidationusingtheusfc4.5classierwasquitelowandwasclosetotheaccuracyobtained%ifalltheexampleswerepredictedtobeofthemajorityi.e.non-diabeticclass.Table5.2showstheconfusionmatrixthatwastheresultofthepredictionsovera10-foldcrossvalidationusingasingledecisiontreeonthemodieddierentialdataset,DDmod.Table5.2Confusionmatrixafter10-foldcrossvalidation:usingasingledecisiontreeonthedierentialdataset Classiedas! Diabetics NonDiabetics Diabetics 129 127 NonDiabetics 109 346 Next,randomforestswithusfc4.5decisiontreeclassierswereusedtoseeifensemblesofdecisiontreeswouldyieldbetterpredictionaccuracy.Dierentsubsetsofrandomlychosenattributesweretestedfortherandomforestapproach,tocomeupwiththebestnumberofattributeswhichwouldyieldmaximumpredictionaccuracy.Itwasseen,thatasubsetof50attributesusinganensembleof100decisiontreesyieldedmaximumpredictionaccuracy.A10foldcrossvalidationwithrandomforestsusing50attributesand100treesyieldedanaverage29

PAGE 39

accuracyof71.31%.TheconfusionmatrixoftherandomforestsapproachisshowninTable5.3.Table5.3Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith50at-tributesand100treesonthedierentialdataset Classiedas! Diabetics NonDiabetics Diabetics 99 157 NonDiabetics 47 408 Althoughtheoverallaccuracyoftherandomforestapproachwasquiteanimprovementoverasingledecisiontree,thenumberofcorrectdiabeticpredictionswasstilllow.Thenumberofdiabeticsubjectswhowereincorrectlyclassiedasnon-diabeticsfaroutnumberedtheonescorrectlyclassiedasdiabetics.ThislownumberofTruePositiveswasacaseforconcern.5.2.2ExperimentswithNeuralNetworksandEnsemblesThe10-foldcrossvalidationresultsusingusfcascoronthemodieddierentialdatasetDDmod,yieldedanoverallaccuracyof72.71%,whichisamarginalimprovementovertherandomforestsapproachdiscussedinSection4.3.2.Theconfusionmatrixobtainedfromthe10-foldcrossvalidationisshowninTable5.4Table5.4Confusionmatrixafter10-foldcrossvalidation:usingasingleneuralnetworkonthemodieddierentialdataset Classiedas! Diabetics NonDiabetics Diabetics 151 105 NonDiabetics 89 366 Although,theoverallaccuracyobtainedusingCascorbasedneuralnetworksdidnotim-provemuchfromtherandomforestsresults,asignicantincreaseincorrectdiabeticpredic-tionswasnoted.TheTruePositivesincreasedfrom99to151,whichisa52%increaseovertherandomforestresults.DuetotheincreaseinthenumberoftruePositivesusingNeuralNetworksandtheincreaseinaccuracywithensemblesrandomforests,thenextsetofexperimentsusedensemblesofNeuralNetworks.30

PAGE 40

Theusfcascortoolwasmodiedinordertoimplementtheensembleofneuralnetworks.Tobuildanensembleofbaggedneuralnetworks,therststepwastobuildthetrainingsetsofeachoftheneuralnetworksthatformpartoftheensemble.Thetrainingexamplesforeachoftheneuralnetworks,arerandomlychosenwithreplacementthereplacementisdeterminedbythepercentagevaluespeciedandbydefaultitistakenas100%,fromtheoriginaltrainingset.Arandomseedvalueischosenwhichdeterminestherandomsamplesthatmakeupthetrainingdataofeachindividualneuralnetwork.Havingbuiltthetrainingset,theneuralnetworkisbuiltandthepredictionsoftheunknowntestexamplesarenoted.Theprocessisrepeatedforeachoftheindividualpredictors.Togetthenalpredictionforeachofthetestexamples,amajorityvoteofallthepredictionsfromeachpredictorforthatexampleisconsidered.AbaggedensembleofCascorbasedNeuralNetworkswasusedwithdierentpercentagesforreplacementwhiledrawingtrainingsamples.Whenbagswith100%replacementofdrawnsamplesi.e.100%ofthetrainingsamplesweredrawnwithreplacementwereused,withanensembleof100NeuralNetworks,theaverageaccuracyobtainedfroma10-foldcrossvalida-tionwas77.21%.TheconfusionmatrixobtainedisshowninTable5.5.UsinganensembleofneuralnetworksalsoshowsimprovementwithrespecttosomeotherstatisticalmeasuresliketheF-value,whichhasbeendiscussedinSection5.3.1.Experimentswithbaggedensemblesofneuralnetswith90%replacementtohavegreatervariabilityinthetrainingdatawerealsoconductedonthemodieddierentialdatasetDDmod.Theaccuracyfrom10-foldcross-validationwith100neuralnetswas77.07%whichdidnotincreasefromtheaccuracywith100%bags.Table5.5Confusionmatrixafter10-foldcrossvalidation:usingbaggedensembleof100neuralnetworksonthemodieddierentialdataset Classiedas! Diabetics NonDiabetics Diabetics 162 94 NonDiabetics 68 387 ThemodieddierentialdatasetDDmodwasalsousedwiththedecisiontreesandrandomforeststotestwhethertheselearningapproacheswouldyieldbetterresults.Theaccuracyre-31

PAGE 41

sultsobtainedaftera10-foldcrossvalidationusingthedecisiontreeapproachonthemodieddierentialdataset,DDmodare:Asingledecisiontreeyieldedanoverallaccuracyof72.29%whichisasignicantincreasefrom66.8%whichwasobtainedusingthedierentialdatasetDD.Usingbaggingwith100%bagsand100decisiontreesgaveanaccuracyof79.04%.Onusing100%bagswith100treesbuteachtreeintheensembleareunprunedtomakethemmoreunstable,theaverageaccuracymarginallyincreasesto79.76%Randomforestswithlgnwherenisthenumberofattributesrandomattributesateachnodeandwith100treeshadanaccuracyof77.78%.Usingrandomforestswithsubsetsof100attributesand100treesyieldedahigheraccuracyof80.31%.Ascanbeseenfromthediscussionabove,randomforestswithasubsetof100randomattributesateachnodeand100trees,yieldedamaximumaverageaccuracyovera10-foldcrossvalidationof80.31%.Theconfusionmatrixisshownintable5.6.Table5.6Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith100attributesand100treesonthemodieddierentialdataset Classiedas! Diabetics NonDiabetics Diabetics 156 100 NonDiabetics 40 415 Adierentsetofexperimentswereconductedtotesttheeectofvaryingtherandomseedvaluethatisusedtosplitthe10foldsonusingcrossvalidation.Aseriesof5experimentsweredoneeachwithadierentseedvalueandtheminimumaccuracyobtainedwas78.91%andamaximumof80.59%whichisslightlybetterthanthebestaccuracyof80.31%obtainedusingthedefaultseed.Theaverageaccuracyoverthesevetestswas79.75%.Asummaryofsomeofthepredictionaccuraciesobtainedfrom10-foldcrossvalidationusingdierentlearningalgorithmsonthedierentialdatasetisshowninTable5.7.32

PAGE 42

Table5.7Summaryofpredictionaccuraciesobtainedfrom10-foldcrossvalidationusingdif-ferentlearningmethodsonthemodieddierentialdataset Classier AverageAccuracyPercentage SingleNeuralNetwork 72.71 100%bagsand100NeuralNetworks 77.21 90%bagsand100NeuralNetworks 77.07 SingleDecisionTree 72.29 100%bagsand100PrunedDecisionTrees 79.04 100%bagsand100UnprunedDecisionTrees 79.76 RandomForestswithlgnattributesand100Trees 77.78 RandomForestswith100attributesand100trees 80.31 5.2.3ExperimentswithSupportVectorMachinesSupportVectorMachineSVMbasedpredictorswerealsousedtoclassifythedierentialdata.AsmentionedinSection4.4,thetoolthathasbeenusedistheLIBSVM[3].TestingthepredictionaccuracyoftheSVMusingthistoolrequiresthatthefollowingbedone:1.TransformthedatasetintotheformatrequiredbytheLIBSVMtool.2.Scalethedatalinearly.3.Theavailablekernelsare:LinearPolynomialRadialBasisFunctionSigmoidPre-computedKernelThekernelvaluesarespeciedinaspecialle4.Crossvalidationisusedtodeterminethebestcombinationofcost,Candkernelpa-rameters.5.TheseparametervaluesareusedtotraintheSVM.6.Anunknowntestsetisusedtocalculatetheaccuracy.33

PAGE 43

TheLIBSVMsoftware[3],[12],comeswithadditionaltools,writtenasPythonscripts,toscalethedatasetaswellastoperformagridsearchtodeterminetheoptimumvaluesofCandkernelparametersthatwouldgivethebestpredictionaccuracy.A10-foldcrossvalidationisusedwiththeseoptimizedvalues,toobtainthehighestoverallaccuracy.Themodieddierentialdatatset,DDmod,wasreformattedforLIBSVM[12].LIBSVM[3],requiresthatallattributevaluesinthedatasetberealvalued,i.e.allnominalattributesneedtobeconvertedtonumericattributes.Am-valuedattributeissplitintomdierentnumericattributesandonlyoneattributehasavalueof1whiletheotherm-1attributesare0.Inthecaseofdataset,DDmod,nominalattributeslikeRaceandSexaresplitupintotheirvaluesbuttheattributethatisturned"on"isrepresentedbythevalue0.5insteadof1andtheremainingattributeshavethevalue-0.5insteadof0.ThedatasetDDmod,alsohastheattributevalueswerescaledlinearlyasdiscussedinsection3.5.2.Sothedatasetdidnotrequireanyfurtherscaling.However,theDDmodhadtobemodiedasLIBSVM,neededthefollowingchanges:Theclasslabelshadtobechangedto+1representingdiabeticpatientsand-1rep-resentingnon-diabeticsubjectsEachattributevalueisprecededbytheattributeserialnumberandtheyareseparatedbyacolon.The.namesleisnotrequiredforLIBSVM,astheonlyattributedatatypeisnumeric.TheRadialBasisFunctionwasusedasthekernelandagridsearchwasconductedtocomeupwiththebestcombinationofCandtheRBFparametervalue.ThegridsearchtechniqueoptimizedtheparametersCandfor10-foldcrossvalidationbytakingintoaccounttheentiretrainingandtestdatainthedataset,DDmod.HencetheaccuracyobtainedisanoverlyoptimisticaccuracyfortheSVMapproach.Thehighestaccuracywith10-foldcrossvalidationwasobtainedwithacostvalueofC=70and=0.00048828125.Theaccuracypercentwas79.747%.AscanbeseenthebestaccuracygureswithSVMdidnotimproveoverthebestrandomforestsaccuracy,howeveritperformsbetterthanbaggedensembleofneuralnetworks.21%asdiscussedinSection5.2.2.Theconfusionmatrixobtainedafter34

PAGE 44

a10foldcrossvalidationusingtheLIBSVMtoolwiththebestcombinationofparametersisshowninTable5.8.Table5.8Confusionmatrixafter10-foldcrossvalidation:usingsupportvectormachineonthemodieddierentialdataset Classiedas! Diabetics NonDiabetics Diabetics 178 78 NonDiabetics 66 389 IftheconfusionmatrixobtainedfromLIBSVMinTable5.8iscomparedtotheconfusionmatrixfromRandomForestsusing100attributesand100treesinTable5.6,whichyieldsthebestoverallaccuracyonthemodieddierentialdataset,DDmod,thenitcanbeseenthatSVMyieldssignicantlygreaternumberofTruePositivesi.e.predictsgreaternumberofdiabeticexamplescorrectly.AswillbeshownlaterinSection5.3.1,theF-measurevaluefortheSVMresultsishigherthantheF-valuefromRandomForestsapproach.5.2.4ExperimentsandResultsUsingSMOTEThesmotingimplementationthathasbeenused,isthesmotetooldevelopedatUSF[5].Theimplementationhowever,doesnothaveamethodtogeneratesyntheticexamplesbelongingtotheminorityclass,inthepresenceofmissingdatavalues.Inordertogenerateasyntheticexample,thealgorithmperformsthefollowingtasks:ThedistancebetweenthefeaturesofthesampleexampleandeachoftheremainingminorityclassexamplesiscomputedThenearestneighborsarerankedaccordingtothedistance.Ofthek-nearestneighbors,theexampleinwhosedirectionthesyntheticexampleistobegeneratedischosenrandomly.Theosetthatisintroducedinthesyntheticexampleistheproductofthedierencebetweenthefeaturesandarandomnumberbetween0and1.Thedirectionoftheosetisdeterminedbythenearestneighbori.e.thesyntheticexampleliessomewhereinbetweenthecandidateandthenearestneighborinthefeaturespace.35

PAGE 45

However,formissingvalues,itisnotpossibletocalculatethedierencebetweenthefeatures,unlessavalueisassignedtothemissingattributeineitherthecandidateorthenearestneighbor.Thestrategythatweadoptedwas:Ifoneoftheattributevaluesweremissingeitherinthesampleorthenearestneighborthenthedierencebetweenexamplesforthatfeatureiscalculatedas10%oftheproductofthenon-missingvalueandarandomnumberbetween0and1.Thedirectionofthedierencevaluepositiveornegativedoesnotmatterinthiscase,asthesumofsquaresofthedierencesarecomputedtoranktheneighbors.Ifboththevaluesweremissing,thenthedierencewouldbezero.Thesedierencesarecomputedfordeterminingwhichminorityclassexamplesarethenear-estneighborsofthecandidateexample.Oncethenearestneighborhasbeenidentiedthesyntheticexampleneedstobegenerated.Theattributevaluesofthesyntheticexample,asmentionedearlier,arecomputedbyaddinganosettotheexampleattributevalueinthedi-rectionofthenearestneighbor.Theosetvalueistheproductofarandomnumberbetween0and1andthedierencebetweentheattributevalues.Ifeitherthecandidateornearestneighborexamplehasamissingvalue,thentheosetiscomputedas10%oftheproductofthenon-missingvalueandarandomnumberbetween-1and1.Finally,ifboththecandidateattributevalueandthenearestneighborattributearemissing,thentheosetiszeroandthesyntheticattributehasthesamevalueasthatofthecandidate.Inalltheexperimentsthatwereperformed,10-foldcrossvalidationwasusedtoestimatethepredictionaccuracyofeachofthelearningtechniques.In10-foldcrossvalidation,theentiredatasetisdividedinto10disjointfolds,wheretheexamplesineachfoldaredrawnindependentlyandrandomlyfromtheoriginaldataset.Theexperimentsarerun10times,eachtimewithadierentsetof9trainingfoldsandtheremainingfoldasatestset.IfthewholedatasetisoversampledusingSMOTE,thentheresultingaccuracywouldbeanoverestimate.Thisisowingtothefactthatthetestexamplesbelongingtotheminorityclasswouldhavealsobeenoversampledandsomeofthesyntheticexamplesgeneratedfromthem,couldhavebeeninthetrainingsetusedtobuildtheclassier.Topreventsuchabiasfromoverestimatingtheaccuracy,theoriginaldatasetwasrstsplitupintothe10disjoint36

PAGE 46

foldsforcrossvalidationandthenthetrainingfoldswereoversampledwhilethetestfoldwaskeptasis.Thesameprocesswasrepeatedforeachofthe10roundsofthe10-foldcrossvalidation.Anotheraspectthatwastestedwaswhethertooversampleorsmote,thedierencedatasetortosmotethemedicaltestresultsandthencompilethedierencedatasetforeachroundofthe10foldcrossvalidationprocess.Intherstroundofexperimentsthatwereconductedusingoversampling,thedierentialdatasetusedinneuralnetworkswereusedandthe10disjointfoldsfor10-foldcrossvalidationwerebuilt.Anumberofexperimentswereperformedontheoversampleddatasettocomeupwiththebestpredictionaccuracy.Randomforestswithsubsetof100attributesand100treesgivesthebestaccuracyresultswiththemod-ieddierentialdatasetDDmod.Soensembleswithrandomforestsweretestedusingtheoversampleddataset.Asdiscussedinsection4.3.2,thepredictionaccuracyusingrandomforests,dierswiththenumberofattributesthatarerandomlyselectedateachnodeforconsiderationasapossibletestatthenodeandthatthebestaccuracyoccurswithinaspecicrangeofattributesubsets.Inordertodetermine,thesubsetofattributeswhichyieldsthemaximumaccuracy,therandomforestsexperimentswereperformedwithdierentattributesubsets.Thepercentageofoversamplingorsmotingisalsovariedtoobservehowtheaccuracyguresvarywiththenumberofsyntheticexamplesinthetrainingdata.Thesamplingpercentagedeterminesthenumberofsyntheticexamplesperminorityclassexamplethatwillbegeneratedinthesmoteddatasete.g.asmotingpercentageof200%impliesthatforeveryminorityclassexamplethatissmoted,twosyntheticexampleswouldbegenerated.Thesamplingpercentagealsodetermineshowmanynearestneighborexampleswouldbeconsidered,i.e.for200%sampling,foreveryminorityclassexample,twonearestneighborswouldberandomlyselectedandonesyntheticexamplewouldbebuiltalongeachofthetwonearestneighbors'direction.Theaverageaccuracyobtainedafter10-foldcrossvalidationwas78.2%whenrandomforestswithlgTotalNumberofattributesi.e.lg,anensembleof100treesandasmotingsamplingpercentageof100%.Whenthenumberofrandomattributeswereincreasedto50inthepreviouscombinationi.e.100%oversamplingand100treestheaccuracyper-centagemakesaslightjump78.45%.Thebestoverallaccuracyof79.89%,wasobtainedwith37

PAGE 47

acombinationof50randomforestsattributesandanensembleof200treesand100%over-samplingusingSMOTE[5]ofthedierentialdataset.Table5.9illustratestheconfusionmatrixobtainedfrom10-foldcrossvalidationusingrandomforestswith100attributes,200treesand100%oversampling.Table5.9Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith100attributes,200treesand100%smotingofdierencevalues Classiedas! Diabetics NonDiabetics Diabetics 150 106 NonDiabetics 37 418 Theaccuracyguresdidnotshowanymarkedimprovementsovernon-smotedresultsasdiscussedinsubsections5.2.1and5.2.2.AnotherinterestingthingtonoteisthatthenumberofTruePositivesdoesnotshowanyincreaseinnumberswhenthetrainingdataisoversampled,whichshouldhavebeenthecase,sincethepresenceofsyntheticdiabeticexamplesshouldhaveshiftedthebiasoftheindividualpredictorstowardstheminorityclass.Therefore,adierentapproachtooversamplingoftheminoritydatainthetrainingsetwasadoptedtotestifthepredictionaccuracyondiabeticexamplescouldbeimproved.Intheprevioussetofexperiments,thetrainingexamplesthatwereover-sampledusingSMOTE[5],werethetestdierenceexamplesinthedatasetDDmod.Inthefollowingex-periments,however,adierentapproachtosmotingwasadopted.Insteadofsmotingthedierenceintestresults,themedicaltestresultsofthediabeticminorityclasspatientsweresmoted.Forthispurposeadierentdatasetwasbuilt.Inorder,tobuildthedierencedataset,allthemedicaltestsforeachpatienthadbeencollecteddependingonwhetherthesubjectwasdiagnosedwithdiabetesornot,i.e.fortheformercase,thetestsaretillthedateofdiagnosisexceptingthelasttestsbeforediagnosisandforthelatterovertheentirespanofthemedicaldatacollected.Thiscollectionofmedicaltestsdata,adatasetwascompiledforusewithmachinelearningtechniques.Thenumberoftimesasingletestwasrepeatedinapatientwascomputed,andthemaximumdeterminedthenumberofattributesthatwouldbeassociatedwiththattest,i.e.ofallthe711patients,aparticularpatienthadthePEP90testrepeatedforamaximumof20timesandasaresultthedatasethad20attributesas-sociatedwithPEP90namedfromPEP90 0toPEP90 19.Forpatientshavinglessthanthe38

PAGE 48

maximumnumberoftests,theremainingattributesdatawouldbeconsideredasmissing,i.e.ifapatienthas10PEP90teststhenthevaluesoftheattributesfromPEP90 10toPEP90 19wouldbemissingforthispatient.Thenominalattributesarenotbitencoded,astheSMOTEimplementationcanhandlenominalattributesforoversampling.A10-foldcrossvalidationinvolvedthefollowingsteps:1.ThedatasetDDtests,wasdividedinto10disjointfolds,andtheexamplesineachfold,wererandomlyselected.2.Foreachofthe10roundsinthecross-validation,thedatacomprisingthe9trainingfoldswerecollectedintoasingletemporarydataset.3.Thistemporarytrainingdatawasthensmotedi.e.theminorityclassordiabeticexampleswereoversampleddependingonthesamplingpercentage.4.ThesyntheticexampleswerethencombinedwiththeactualexamplesandacombinedtrainingsetTtempwasbuilt.5.ThedierencesinconsecutivetestsofthesametesttypeforeachpatientwerecomputedforboththedatasetTtempandthetestfoldi.e.theremaining10thfoldextractedinstep1fromdatasetDDtestsandthemaximumnumberofdierencesforeachtesttypewasalsonoted.Thevaluesofthenominalattributes,RaceandSex,werekeptunchanged.6.ThedierentialtrainingdatasetDDtraintestswasbuiltusingthedierencevaluesfromTtemp.ThetestsetDDtesttestswasbuiltfromthedierencesfromthetestfold.ThenameslewascreatedusingthenamesleofthedatasetDDtestswiththenumberofattributesassociatedwitheachtestbeingreducedby1,owingtotheobviousfactthatasetofntestswouldyieldn-1dierencesbetweenconsecutivevalues.7.Thisdierentialdataset,DDtraintests,wasthenusedtotraintheensembleofpredictorsandtheensemblewastestedonthetestsetDDtesttests.8.Forthenextroundofcrossvalidationthestepswererepeatedfromstep2.Inthissetofexperimentswithoversampling,themedicaltestresultsdataset,DDtests,usedineachofthe10roundsofthe10-foldcrossvalidation,wasnotnormalized.SincetheSMOTE39

PAGE 49

implementation,cannothandlemissingvaluesrepresentedby"?",themissingvalueswerereplacedbythenumber-100000.Thereasonforchoosingsuchanumberwasthatnodierencevaluehadsuchalargevaluei.e.notestresultdieredfromthenextresultofthesametestbyamarginof100000.Likebefore,randomforestswithdierentsubsetsofattributesanddierentoversamplingpercentagesweretestedtocomeupwiththecombinationthatwouldproducethebestaccuracygures.Asisevident,increasingthesamplingpercentageofSMOTE,increasesthenumberofdiabeticexamplesinthetrainingsetactualandsyntheticcombinedandthisshiftsthebiasofthelearnedclassiertowardsthediabeticexamples.Sotheclassiersgivebetteraccuracyondiabetictestexamples.However,thepredictionaccuracyonnon-diabeticexamples,godownsignicantlytoo,whichresultsintheoverallaccuracyremainingalmostunchanged.Table5.10,showstheconfusionmatrixfroma10-foldcrossvalidationwithrandomforestswiththenumberofrandomattributeschosentospliteachnodebeingdeterminedthevalueoflgnwherenisthenumberofattributes.Anensembleof100treeswasusedwiththesmotesamplingpercentagebeing100%.Theoverallpredictionaccuracyobtainedwas76.65%.Table5.10Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswithlgnattributes,100treesand100%smoting Classiedas! Diabetics NonDiabetics Diabetics 193 63 NonDiabetics 103 352 Inthenextexperiment,randomforestwithasubsetof50attributeswasused.Ascanbeseenfromtheconfusionmatrixobtainedaftera10foldcrossvalidationandusing100treeswith100%smoting,asshowinTable5.11,notonlydoesthenumberoftruepositivesi.e.correctdiabeticpredictionsincrease,butoverallaccuracyincreasestoo,from76.65%to79.04%.Table5.11Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith50attributes,100treesand100%smoting Classiedas! Diabetics NonDiabetics Diabetics 200 56 NonDiabetics 93 362 40

PAGE 50

Inviewoftheresultsoftheincreaseinpredictionaccuracywithincreasingthesubsetofattributesto50,thenextexperimentusedasubsetof100attributes.Theconfusionmatrixfroma10-foldcrossvalidationwithrandomforestsof100attributesand100treeswith100%smotingi.e.onesyntheticexampleperminorityclassexampleinthetrainingsetisgeneratedhasbeenshowninTable5.12.Althoughtheaverageaccuracyshowedamarginalincreaseto79.18%,thenumberoftruepositivesshowedasignicantdecrease.Table5.12Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith100attributes,100treesand100%smoting Classiedas! Diabetics NonDiabetics Diabetics 189 67 NonDiabetics 81 374 Inordertostudyifincreasingthenumberofdecisiontreesintheensemblewouldincreasethepredictionaccuracy,experimentswithrandomforestsusing50attributesand100%over-sampling,wereconductedwithdierentensemblesizes.Table5.13showstheresultsof10-foldcrossvalidationwith150treeswhichgivesanaverageaccuracyof79.32%andTable5.14showstheresultswith200trees,withaccuracyof79.75%.With150trees,boththeoverallaccuracyaswellasthenumberofcorrectdiabeticpredictionsincreases,whilewith200treesalthoughtheoverallaccuracyincreasedmarginally,thenumberoftruepositivesdecreased.Table5.13Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith50attributes,150treesand100%smoting Classiedas! Diabetics NonDiabetics Diabetics 200 56 NonDiabetics 91 364 Table5.14Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith50attributes,200treesand100%smoting Classiedas! Diabetics NonDiabetics Diabetics 178 78 NonDiabetics 69 389 Anotherinterestingaspectwastotesttheeectofthevariationofsamplingpercenti.e.varyingthenumberofsyntheticexamplesthatareusedinthetrainingset.Fromthe41

PAGE 51

previousexperiments,itbecameapparent,thatincreasingtheminorityclassexamples,byoversamplingi.e.generatingnewsynthetictrainingdatashiftstheclassierbiastowardstheminorityclass,thusresultingingreateraccuracyinpredictionofdiabeticminorityclasspatients.However,itwasalsonoticedthattheoverallaccuracydoesnotimprove,owingtotheincreaseinthenumberoffalsepositives,i.e.non-diabeticexamplesbeingwronglyclassiedasdiabetics.Thispromptedthestudyofwhether,loweringthesamplingpercentage,wouldresultinanoptimaloversamplingofminorityclassexamples,suchthattheoverallaccuracyinpredictionwouldreachamaximumwithoutgeneratingtoomanyfalsepositives.InTable5.15,the10-foldcrossvalidationresultswith50%oversampling,andrandomforestsusing50attributesand150trees,havebeenshown.Theoverallaccuracyinthiscasewas79.04%.Table5.15Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith50attributes,150treesand50%smoting Classiedas! Diabetics NonDiabetics Diabetics 176 80 NonDiabetics 69 386 Table5.16showsthe10-foldcrossvalidationresultsusing50%smoting,100attributesand150trees,withtheaverageaccuracybeing80.59%.Table5.16Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith100attributes,150treesand50%smoting Classiedas! Diabetics NonDiabetics Diabetics 181 75 NonDiabetics 63 392 Thisincreaseinoverallaccuracy,wasfurtherinvestigatedbyincreasingthesamplingrateto75%toseeifitwouldleadtofurtherimprovementsintheoverallaccuracygures.Ascanbeseenfromtheconfusionmatrixafter10-foldcrossvalidationusing75%smoting,100randomforestsattributesand100trees,inTable5.17givesanaccuracyvalueof79.04%.Randomforestswith75attributesand75%smoting,using150trees,yieldedthebestoverallaccuracyof80.73%whichismarginallybetterthanthebestaccuracyguresof80.31%obtainedusingrandomforestswithoutoversampling.TheconfusionmatrixofthisexperimenthasbeenshowninTable5.18.42

PAGE 52

Table5.17Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith100attributes,100treesand75%smoting Classiedas! Diabetics NonDiabetics Diabetics 188 68 NonDiabetics 81 374 Table5.18Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith75attributes,150treesand75%smoting Classiedas! Diabetics NonDiabetics Diabetics 196 60 NonDiabetics 77 378 Thesamplingpercentagewasalsoincreasedbeyond100%to125%and200%tostudythebestaccuracythatcanbeobtainedonjustthediabeticexamples.Thebestaccuracywith125%oversamplingthatwasobtainedwas80.03%whenrandomforestswith100attributesand150treeswasused.Table5.19showstheconfusionmatrixfroma10-foldcrossvalidationofthesameexperiment.Table5.19Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith100attributes,150treesand125%smoting Classiedas! Diabetics NonDiabetics Diabetics 202 54 NonDiabetics 88 367 With200%oversamplingrate,thenumberofdiabeticexamplesincludingthesyntheticexamplesinthetrainingsetoutnumbersthenon-diabeticexamplesandsotheresultantpredictorswouldbebiasedtowardsthediabeticclassandhenceshouldresultingreaternumberofTruePositives.Thebestoverallaccuracywith200%oversampleddata,obtainedwas75.95%usingrandomforestswithsubsetsof100attributesand100trees.Theconfusionmatrixfroma10-foldcrossvalidationhasbeenillustratedinTable5.20.Ascanbeseen,theoverallaccuracydipswhentheoversamplingrateisincreasedto200%,howeverthenumberofTruePositivesincreasessignicantlyfrom196obtainedwith75%smoting,thatproducedthebestaverageaccuracyto213.43

PAGE 53

Table5.20Confusionmatrixafter10-foldcrossvalidation:usingrandomforestswith100attributes,100treesand200%smoting Classiedas! Diabetics NonDiabetics Diabetics 213 43 NonDiabetics 128 327 Increasingtheoversamplingratedenitelyimprovestheperformanceofthepredictorsontheminorityclassi.e.diabeticexamples,however,thepresenceofsyntheticdata,alsoincreasesthenumberofFalsePositivestooandasaresulttheoverallaccuracyguresuf-fers.ThemoretheoversamplingrateisincreasedtheeectofFalsePositivesbecomemorepronouncedandtheaverageaccuracydipssignicantly.5.3AnalysisofDierentLearningTechniquesTwodierentstatisticalmeasures,theF-measureandROCcurve,havebeenusedinthisstudytocomparetheperformancesofthedierentlearningmethodsinpredictingdiabetes.Thefollowingtwosectionsdiscusstheresultsofthisanalysis.5.3.1F-MeasureBeforewegointotheresultsandF-values,letusdiscusswhattheF-measuremeansandhowitiscomputedfromthepredictionresultsofaclassier.ThecomputationoftheF-valueforaparticularpredictor,twovaluesneedtobecomputedrst.Theseare:1.Precision:Percentageorfractionoftherelevanttestexamplesthatwerecorrectlypre-dicted.Mathematically,Precision=TruePositives/TruePositives+FalsePositives2.Recall:Percentageorfractionofthetestexamplesthatwerepredictedasrelevantwereactuallyrelevant.Mathematically,Recall=TruePositives/TruePositives+FalseNegativesOncetheprecisionandrecallvalueshavebeencomputed,thentheF-valueiscalculatedbythefollowingformula:F-value=2xprecisionxrecall/precision+recall44

PAGE 54

TheF-measurevalueswerecalculatedfromtheconfusionmatricesthatweregeneratedusing10-foldcrossvalidationforallthedierentlearningtechniques,inordertocomparetheirpredictionperformance.TheF-valuesforthedierentlearningtechniqueshavebeenlistedintheTable5.21.Table5.21SummaryofF-measurevaluesfordierentlearningtechniques DatasetsandExperiments LearningTechnique F-value DierentialDataset SingleDecisionTree 0.5222 RandomForest=50,100trees 0.4925 ModiedDierentialDataset SingleNeuralNetwork 0.6089 100%Bagging100-NN 0.6532 RandomForest=100,100trees 0.68 SVMwithoptimizedparameters 0.712 OversampledDatasetusingSmote 100%Smotelgnran-domforests100trees 0.6993 100%Smoterandomforests=50100trees 0.6966 100%Smoterandomforest=100100trees 0.7186 100%Smote,RandomForests=50,150trees 0.789 100%Smote,RandomForests=50,200trees 0.7077 50%Smote,RandomForests=100,150trees 0.724 50%Smote,RandomForests=50,150trees 0.7063 75%Smote,RandomForests=75,150trees 0.741 125%Smote,RandomForests=100,150trees 0.74 200%Smote,RandomForests=100,100trees 0.713 AscanbeseenfromTable5.21,oversamplingusingsmotewith100%samplingpercentageandrandomforestswithsubsetof50attributesand150trees,yieldsthebestF-value,althoughthismethoddidnotyieldthebestoverallaccuracyasshownintheconfusionmatrixinTable5.13.Moreover,forthedierentialdataset,thebestF-valuewasobtainedwas0.789withtherandomforestsapproachonthe100%oversampleddatausing50attributesand150trees.45

PAGE 55

AnotherveryinterestingfactthatcanbeobservedisthattheoversamplingofthetrainingdatausingSMOTE,doesnotproduceanysignicantimprovementintheoverallaccuracyofpredictionascanbeseenfromcomparingtheconfusionmatrices.However,theF-valuesforalllearningexperimentsusingoversampleddataaresignicantlyhighercomparedtothenon-oversampledexperiments.Thereasonbeing,thatoversamplingofminorityclassexamplesinthetrainingset,shiftsthebiasoftheclassiertowardstheminorityclassinthiscasethediabeticclassandthushasahigherTruePositivesaswellashighernumberofFalsePositives,whichbringsdowntheoverallaccuracybutthehighTruePositivevalueimprovestheF-measure.5.3.2ROCCurvesReceiverOperatingCharacteristicROCorthereceiveroperatingcurve[13],isagraph-icalmethodofcomparingbinaryclassicationaccuracy.Itplotsthevaluesoftwostatisticalmeasures,sensitivityand-specicitycomputedfromtheresultsofabinaryclassiersystemasit'sdiscriminationthresholdisvaried.Sensitivityofabinaryclassier,isdenedastheproportionofofallpositivepredictionsthatwereactuallypositiveexamples,i.e.mathemat-ically:Sensitivity=TruePositives/TruePositives+FalseNegativesSpecicityontheotherhandismathematicallyrepresentedas:Specicity=TrueNegatives/TrueNegatives+FalsePositivesInotherwords,inthecontextofthisstudy,itisdenesastheproportionofpeoplewhoactuallywerenon-diabeticamongallthepredictednon-diabeticexamples.Aclassierthathasahighspecicity,haslowType1error.TheROCcanalsobegeneratedbyplottingthefractionofTruePositivesagainstTrueNegatives.ROCcurvesarenotonlyusedincomparingclassiersbutalsoinmedicalresearch,epidemiologyorpsychophysics.Acompletelyrandompredictorwouldgenerateastraightlineatanangleof45degreeswiththehorizontal,fromleftbottomtotopright.Thereasonbeing,asthethresholdisraised,theTruePositivesandFalsePositivesincreaseequally.ClassierswithROCcurveshigherthanthisstraightlinearebetterthanarandomclassier.Thestatisticthatismost46

PAGE 56

commonlycalculatedfromaROCforcomparingclassierperformanceistheAreaUnderCurveAUC.Theoverallaccuracyguresfrom10-foldcrossvalidationandtheF-measurevaluescomputedfromthedierentlearningtechniquesclearlydemonstrate,thattheuseofensembleswithneuralnetworksanddecisiontreesperformsignicantlybetteroversingleclassierperformance.Moreover,therepresentationofdierentialdatainthedatasetDDmodgivesbetterpredictionaccuracythantherepresentationindatasetDD.However,thedierenceintheperformanceofensembleslikerandomforestsonoversampledandnon-oversampleddataisnotthatevidentfromtheoverallaccuracyguresortheF-measurevalues.SotheROCcurveswereplottedfortherandomforestsmethodusingsmotedandnon-smoteddata.DierentcombinationsofrandomattributesandnumberoftreesintheensembleweretestedtocomparetheirROCcurves.TobuildaROCcurveforanensemblemethodlikerandomforestsorbagging,thetechniquethatwasused,wastoplottheTruePositivepercentagainsttheFalsePositivepercentbyvaryingthedecisionthreshold.Inthiscasethedecisionthresholdwasthenumberorthepercentageofvotesoftheindividualpredictorscomprisingtheensemblethatwoulddecidethenalclassicationi.e.classlabeloftheparticularexample.Forthispurpose,theusfc4.5wasmodied,suchthatforeverytestexample:TheclasspredictionofeachoftheindividualensemblepredictorsisnotedIfthenumberofvotesforthediabeticclassexceedthechosenthresholdthenthepar-ticularexampleisclassiedasdiabetic,e.g.ifthethresholdis10%outof100treesintheensemble,thenifthenumberofvotesfordiabeticis10ormorethentheexampleisclassiedasdiabetic.TheROCcurveshowninFigure5.2,isagraphicalcomparisonofthediabeticpredictionaccuracyi.e.howtheTruePositiveratevarieswithrespecttotheFalsePositiverateasthethresholdvaluesisvaried.Thethresholdisvariedinincrementsof10%ofthetotalnumberoftreesintheensemble,startingfrom0to100inordertogetmorepointsonthecurve,incrementsof5%weremadewhenincreasingthedecisionthresholdfrom70%to100%.As47

PAGE 57

thethresholdisincreased,thenumberofTruePositivesTPstartsdecreasing,whiletheFalsePositivesFPincrease. Figure5.2ROCcurveforcomparingtheperformanceofrandomforestswithandwithoutoversamplingSincerandomforestswith100attributesand100treesgavethebestoverallaccuracyonthenon-smoteddatasetDDmod,thisensemblewastestedagainsttheresultsoftherandomforestsmethodonsmoteddatasetswith75%oversampling,75attributesand150treeswhichyieldsthebestoverallaccuracyamongthesmotedexperimentsand200%smotingwith100randomattributesand100treeswhichyieldsthebestTruePositivevalue.AlsoshowninFigure5.2istheROCcurveforrandomforestsapproachwith50attributesand150treestrainedon100%oversampleddataasthisparticularmethodproducedthebestF-value.AscanbeseenfromtheROCcurveinFigure5.2,noparticularensembletechniqueusedwithoversamplingperformssignicantlybetterthanthenon-smotedresults.Furthermore,noconvexhullexiststhatwouldclearlydistinguishwhetheroversamplingwouldperformbetterundercertainconditionsornot.48

PAGE 58

CHAPTER6SUMMARYANDDISCUSSIONFromtheresultsoftheexperimentsconducted,itcanbeseenthatensembleapproacheswithdecisiontreesrandomforest[2]andbagging[1]givecomparableandsometimesevenbetterdiabetespredictionaccuracyforjuvenilesubjectsinthecontextofourdatasetthancascadecorrelationbasedneuralnetworks.Anotherareawheredecisiontreeapproacheshavepotentialadvantagesoverneuralnetworksisthattheyarefastandeasytobuildandunderstand.However,itmayalsobenotedthatabagged[1]ensembleofcascornetworksareabletopredictmorediabeticsubjectscorrectlythandecisiontreebasedapproachesascanbeseenfromtheconfusionmatrices,discussedearlier.Further,awelltunedneuralnetworkofanothertypemaywellprovideahigherensembleaccuracy.Anotherinterestingobservationthatcanbemadefromtheresultsisthattherepresen-tationofdatainthecurrentcontexthasimportantimplicationsandaectstheaccuracyofpredictiontoagreatextent.Intherstseriesofexperiments,wherethemedicaltestrecordsweredirectlyusedfordiabetesprediction,theaccuracyguresimprovedbyover10%from88.8%to99.85%inthecaseofdecisiontrees,whenthelearningsetwasmodiedi.e.usingassociatedindicatorattributesfortestexistenceandnormalizingthedatavaluesbetween-0.5to0.5.Similarly,whenthedierentialdatasetwasmodiedfortheneuralnetworkapproachwithindicatorattributesandscaleddatavaluestheaccuracyofpredictionshowedasigni-cantincreaseevenwithensembleofdecisiontreesrandomforests[2]andbagging[1]from66.8%toamaximumof80.31%usingrandomforests[2]with100attributes.Decisiontreeshavebuilt-inmethodstodealwithmissingattributesforexampleallowingexamplestogodownmultiplebrancheswithweightsasinourimplementation.However,inthiscaseamoreexplicitrepresentationofmissingvalueswasveryuseful.49

PAGE 59

Moreover,itcanalsobeinferredfromtheresults,thattheabsenceofsomemedicaltestsrevealimportantinformationwhichhelpstopredicttheoccurrenceofType1diabetesinjuvenilesubjectswithbetteraccuracy.Suchinformationneedstobeembedded,asdoneherewithindicatorattributesassociatedwitheachtestresultattribute,inthelearningdatatoimprovepredictionaccuracy.50

PAGE 60

REFERENCES[1]L.Breiman.Baggingpredictors.MachineLearning,24:123{140,1996.[2]L.Breiman.Randomforests.MachineLearning,45:5{32,2001.[3]C.-C.ChangandC.-J.Lin.LIBSVM:alibraryforsupportvectormachines,2001.Soft-wareavailableathttp://www.csie.ntu.edu.tw/~cjlin/libsvm.[4]H.P.Chase,D.Cuthbertson,andL.M.D.et.al.First-phaseinsulinreleaseduringtheintravenousglucosetolerancetestasariskfactorfortype1diabetes.JournalofPediatrics,138:244{249,2001.[5]N.V.Chawla,K.W.Bowyer,L.O.Hall,andW.P.Kegelmeyer.Smote:Syntheticminorityover-samplingtechnique.JournalofArticialIntelligenceResearch,16:341{378,2002.[6]G.F.Cooper.Asimpleconstraint-basedalgorithmforecientlyminingobservationaldatabasesforcausalrelationships.MachineLearning,1:203{224,1997.[7]C.CortesandV.Vapnik.Support-vectornetworks.MachineLearning,20:273{297,September1995.[8]D.Dazzi,F.Taddei,A.Gavarini,E.Uggeri,R.Negro,andA.Pezzarossa.Thecontrolofbloodglucoseinthecriticaldiabeticpatient:aneuro-fuzzymethod.JournalofDiabetesComplications,15:80{87,Mar-Apr2001.[9]T.G.Dietterich.Ensemblemethodsinmachinelearning.LectureNotesinComputerScience,1857:1{15,2000.[10]A.K.El-Jabali.Neuralnetworkmodelingandcontroloftype1diabetesmellitus.Bio-processandBiosystemsEngineering,27:75{79,April2005.[11]S.Eschrich.LearningfromLess:ADistributedMethodforMachineLearning.PhDthesis,Dept.ofCSE,Univ.ofSouthFlorida,2003.[12]R.-E.Fan,P.-H.Chen,andC.-J.Lin.Workingsetselectionusingsecondorderinforma-tionfortrainingSVM.JournalofMachineLearningResearch,6:1889{1918,2005.[13]T.Fawcett.Rocgraphs:Notesandpracticalconsiderationsforresearchers,2004.[14]C.J.Greenbaum,D.Cuthbertson,andJ.P.Krischer.Thediabetespreventiontrialoftype1diabetesstudygroup,type1diabetesmanifestedsolelyby2-horalglucosetolerancetestcriteria.Diabetes2001,50:470{476.51

PAGE 61

[15]D.S.Group.Thediabetespreventiontrialoftype1diabetes[abstract].Diabetes,43suppl1:159A,1994.[16]W.Hsu,M.L.Lee,B.Liu,andT.W.Ling.Explorationminingindiabeticsubjectsdatabases:Findingsandconclusion.In6thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,2000.[17]N.Japkowicz.Theclassimbalanceproblem:Signicanceandstrategies.InInternationalConferenceonArticialIntelligence:SpecialTrackonInductiveLearning,LasVegas,Nevada,2000.[18]M.KubatandS.Matwin.Addressingthecurseofimbalancedtrainingsets:Onesidedselection.InFourteenthInternationalConferenceonMachineLearning,Nashville,Ten-nesse,pages179{186,1997.[19]C.LingandC.Li.Dataminingfordirectmarketingproblemsandsolutions.InFourthInternationalConferenceonKnowledgeDiscoveryandDataMiningKDD-98,NewYork,NY,1998.AAAIPress.[20]J.ParkandD.W.Edington.Asequentialneuralnetworkmodelfordiabetesprediction.ArticialIntelligenceinMedicine,23:277{293,November2001.[21]K.PattersonandW.Sandham.Neuralnetworkandneuro-fuzzysystemsforimprovingdiabetestherapy.In20thAnnualInternationalConferenceoftheIEEEinEngineeringinMedicineandBiologySociety,HongKong,China,1998.[22]F.Pociot,A.E.Karlsen,C.B.Pedersen,M.Aalund,andJ.Nerup.Novelanalyticalmethodsappliedtotype1diabetesgenome-scandata.TheAmericanSocietyofHumanGenetics,74:647{660,April2004.[23]J.Quinlan.C4.5:Programsformachinelearning.MorganKaufmannPublishersInc.,302,1993.[24]S.E.FahlmanandC.Lebiere.Thecascade-correlationarchitecture.AdvancesinNeuralInformationProcessingStructures,2:524{532,1990.[25]M.Shanker.Usingneuralnetworkstopredicttheonsetofdiabetesmellitus.JChemInformComputerScience,36:35{41,1996.[26]C.Silverstein,S.Brin,R.Motwani,andJ.Ullman.Scalabletechniquesforminingcausalstructures.DataMiningandKnowledgeDiscovery,4-3:162{192,2000.[27]J.W.Smith,J.E.Everhart,W.C.Dickson,W.C.Knowler,andR.Johannes.Usingtheadaplearningalgorithmtoforecasttheonsetofdiabetesmellitus.In12thAnnualSymposiumonComputerApplicationsinMedicalCare,pages261{265,November1988.[28]M.Zorman,G.Masuda,P.Kokol,R.Yamamoto,andB.Stiglic.Miningdiabetesdatabasewithdecisiontreesandassociationrules.In15thIEEESymposiumonComputer-BasedMedicalSystems,pages134{139,2002.52


xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam Ka
controlfield tag 001 001911070
003 fts
005 20071003104331.0
006 m||||e|||d||||||||
007 cr mnu|||uuuuu
008 071003s2006 flu sbm 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0001671
040
FHM
c FHM
035
(OCoLC)173642076
049
FHMM
090
QA76 (ONLINE)
1 100
Pobi, Shibendra.
2 245
A study of machine learning performance in the prediction of juvenile diabetes from clinical test results
h [electronic resource] /
by Shibendra Pobi.
260
[Tampa, Fla] :
b University of South Florida,
2006.
3 520
ABSTRACT: Two approaches to building models for prediction of the onset of Type 1 diabetes mellitus in juvenile subjects were examined. A set of tests performed immediately before diagnosis was used to build classifiers to predict whether the subject would be diagnosed with juvenile diabetes. A modified training set consisting of differences between test results taken at different times was also used to build classifiers to predict whether a subject would be diagnosed with juvenile diabetes. Neural networks were compared with decision trees and ensembles of both types of classifiers. Support Vector Machines were also tested on this dataset. The highest known predictive accuracy was obtained when the data was encoded to explicitly indicate missing attributes in both cases. In the latter case, high accuracy was achieved without test results which, by themselves, could indicate diabetes. The effects of oversampling of minority class samples in the training set by generating synthetic examples were tested with ensemble techniques like bagging and random forests. It was observed, that oversampling of diabetic examples, lead to an increased accuracy in diabetic prediction demonstrated by a significantly better F-measure value. ROC curves and the statistical F-measure were used to compare the performance of the different machine learning algorithms.
502
Thesis (M.A.)--University of South Florida, 2006.
504
Includes bibliographical references.
516
Text (Electronic thesis) in PDF format.
538
System requirements: World Wide Web browser and PDF reader.
Mode of access: World Wide Web.
500
Title from PDF of title page.
Document formatted into pages; contains 52 pages.
590
Adviser: Lawrence O. Hall, Ph.D.
653
Ensembles.
Decision trees.
Neural networks.
Diabetes prediction.
Over-sampling.
690
Dissertations, Academic
z USF
x Computer Science
Masters.
773
t USF Electronic Theses and Dissertations.
4 0 856
u http://digital.lib.usf.edu/?e14.1671