USF Libraries
USF Digital Collections

Semi-supervised self-learning on imbalanced data sets

MISSING IMAGE

Material Information

Title:
Semi-supervised self-learning on imbalanced data sets
Physical Description:
Book
Language:
English
Creator:
Korecki, John
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla
Publication Date:

Subjects

Subjects / Keywords:
Machine learning
Ensembles
Bootstrapping
Self-training
Stratification
Dissertations, Academic -- Computer Science and Engineering -- Masters -- USF   ( lcsh )
Genre:
non-fiction   ( marcgt )

Notes

Abstract:
ABSTRACT: Semi-supervised self-learning algorithms have been shown to improve classifier accuracy under a variety of conditions. In this thesis, semi-supervised self-learning using ensembles of random forests and fuzzy c-means clustering similarity was applied to three data sets to show where improvement is possible over random forests alone. Two of the data sets are emulations of large simulations in which the data may be distributed. Additionally, the ratio of majority to minority class examples in the training set was altered to examine the effect of training set bias on performance when applying the semi-supervised algorithm.
Thesis:
Thesis (M.S.C.S.)--University of South Florida, 2010.
Bibliography:
Includes bibliographical references.
System Details:
Mode of access: World Wide Web.
System Details:
System requirements: World Wide Web browser and PDF reader.
Statement of Responsibility:
by John Korecki.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains X pages.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
usfldc doi - E14-SFE0003441
usfldc handle - e14.3441
System ID:
SFS0027756:00001


This item is only available as the following downloads:


Full Text

PAGE 1

Semi-SupervisedSelf-LearningonImbalancedDataSets by JohnNicholasKorecki Athesissubmittedinpartialfulllment oftherequirementsforthedegreeof MasterofScienceinComputerScience DepartmentofComputerScienceandEngineering CollegeofEngineering UniversityofSouthFlorida MajorProfessor:LawrenceO.Hall,Ph.D. DmitryGoldgof,Ph.D. StevenEschrich,Ph.D. DateofApproval: April5,2010 Keywords:machinelearning,ensembles,bootstrapping,self-training,stratication c Copyright2010,JohnNicholasKorecki

PAGE 2

DEDICATION Tomyadvisor,Dr.LawrenceHall,forallofhisguidanceandsupport. ToLarryShoemaker,forhisadviceandsuggestions. ToDr.CaseyKorecki,mysister,forhergreatexample. Tomyparents,NicholasandSheree,foralwaysbeingthere. AndtoJohnandDorisComiskey.

PAGE 3

ACKNOWLEDGEMENTS ThisthesisincorporatesanearlierpublicationSemi-SupervisedLearningonLarge ComplexSimulations"[1].

PAGE 4

TABLEOFCONTENTS LISTOFTABLESiii LISTOFFIGURESv ABSTRACT vii CHAPTER1INTRODUCTIONANDRELATEDWORK1 1.1RelatedWork2 1.1.1Semi-SupervisedLearning2 1.1.1.1Semi-SupervisedSelf-Learning3 1.1.2LearningFromImbalancedData4 1.2Contributions5 1.3ThesisOrganization6 CHAPTER2DATASETS7 2.1BoltDataSet7 2.2CanDataSet9 2.3KDDCup1999DataSet11 CHAPTER3SEMI-SUPERVISEDSELFLEARNINGFRAMEWORK12 3.1GeneralFramework12 3.2LabeledDataSelection13 3.3UnlabeledDataSelection14 3.4LearningModelSelection14 3.5LabelingUnlabeledData14 3.6PartitioningLargeDataSets18 3.7EvaluatingModelPerformance18 CHAPTER4EXPERIMENTSANDRESULTS20 4.1ResultsontheBoltDataSet20 4.2ResultsontheCanDataSet22 4.2.1RandomlySampledTrainingSet24 4.2.1.1Non-IdenticallyDistributedTrainingSet25 4.2.2Partitioning29 4.2.2.1Non-IdenticallyDistributedTrainingSet30 4.3ResultsontheKDDCup1999DataSet42 4.3.1CloserThanAverageUnlabeledExampleSelection44 4.3.2NearAverageUnlabeledExampleSelection45 i

PAGE 5

4.3.3Non-IdenticallyDistributedTrainingSet46 4.3.3.1CloserThanAverage46 4.3.3.2NearAverage51 CHAPTER5SUMMARYANDDISCUSSION56 5.1FutureWork59 REFERENCES61 ii

PAGE 6

LISTOFTABLES Table2.1Partitioningcharacteristicsfortheboltsimulationdataset.9 Table4.1Theentireboltsimulationdatasetissplitacrossthetraining,semi-supervised,andtestsetsbytimestep-20.22 Table4.2Boltsetpredictionsonthetestsetaveragedover20trials.22 Table4.3Averagenumberoftestsetboltnodesfoundforeachtimestep.23 Table4.4AverageF-measureandclass-weightedaccuracyof30trials fordierentinitiallabeledsetsizesonthecandatasetusing arandomlysampledtrainingset.25 Table4.5WilcoxonSigned-Ranktestresultsonthecandatasetusing randomsampling.25 Table4.6AverageinitialandnalF-measurefor30trialsofthesemisupervisedmethodusingdierentamountsoftheavailable labeledexampleswith10%to90%majorityclassforthe candatasetusingstratiedsampling.26 Table4.7WilcoxonSigned-RanktestresultsforNtrialsofthesemisupervisedmethodwithvariousnumbersoflabeledexamplesandbetween10%and90%majorityclass.28 Table4.8AverageF-measureandclassaveragedaccuracyfor30runs usingthecandatasetwithpartitioning.29 Table4.9PartitionedcandatasetWilcoxonSigned-Ranktestresults.30 Table4.10AverageF-measurefor30trialsofthesemi-supervisedmethod onthepartitionedcandatasetusing0.015%,0.03%,0.05%, 0.07%,0.09%,0.1%,0.11%and0.13%oftheavailablelabeledexampleswith10%to90%majorityclass.31 Table4.11WilcoxonSigned-RanktestresultsforNtrialsofthesemisupervisedmethodonthepartitionedcandatasetwith closerthanaverageselectionusing0.003%,0.005%,0.007%, 0.009%and0.01%oftheavailablelabeledexampleswith between10%and90%majorityclass.33 iii

PAGE 7

Table4.12Partitionedcandatasetresultscomparisons.43 Table4.13Averageperformancefor30supervisedtrialsofa200tree randomforestforrelativelysmallsamplesofKDDdata.43 Table4.14Averageperformancefor30trialsofthesemi-supervised methodwithcloserthanaverageselectionontheKDDdata setusing10,17,25and34labeledexamples.45 Table4.15SignicanceofcloserthanaverageunlabeledexampleselectionresultsontheKDDdatasetfor30trialsofthe semi-supervisedmethodusing10,17,25and34labeled examples.45 Table4.16Averageperformancefor30trialsofthesemi-supervised methodwithnearaverageselectionontheKDDdataset using10,17,25and34labeledexamples.46 Table4.17Signicanceofresultsfor30trialsofthesemi-supervised methodwithnearaverageselectionontheKDDdataset using10,17,25and34labeledexamples.46 Table4.18AverageF-measurefor30trialsofthesemi-supervisedmethod withcloserthanaverageselectionontheKDDdatasetusing10,17,25and34labeledexampleswith10%to90% majorityclass.47 Table4.19WilcoxonSigned-RanktestresultsforNtrialsofthesemisupervisedmethodwithcloserthanaverageselectionon theKDDdatasetusing10,17,25and34labeledexamples withbetween10%and90%majorityclass.49 Table4.20AverageF-measurefor30trialsofthesemi-supervisedmethod withnearaverageselectionontheKDDdatasetusing10, 17,25and34labeledexampleswith10%to90%majorityclass.51 Table4.21WilcoxonSigned-RanktestresultsforNtrialsofthesemisupervisedmethodwithnearthanaverageselectiononthe KDDdatasetusing10,17,25and34labeledexamples withbetween10%and90%majorityclass.53 iv

PAGE 8

LISTOFFIGURES Figure2.1Thevepartitionsoftheboltdatasetshowingthecasing andbolts.8 Figure2.2Thecandatasetattimestep1andtimestep14.10 Figure3.1Flowchartofthegeneralsemi-supervisedalgorithm.13 Figure3.2VisualexplanationoftheGeneralizedMinkowskidistance's CartesianjoinandCartesianmeetoperators.17 Figure4.1AverageF-measureover20trialswiththerangeforthe boltdataset.23 Figure4.2VisualizationoftheaveragechangeinF-measureforeach samplerateandmajorityproportionforthenon-partitioned candataset.27 Figure4.3VisualizationofthesignicanceofthechangeinF-measure foreachsamplerateandmajorityproportionfornearaverageselection.27 Figure4.4Truemajoritypercentageandtrueexamplecountforthe partitionedcanexperiments.35 Figure4.5Improvementduetothesemi-supervisedprocessesonthe partitionedcandatasetasthetruemajoritypercentageof theinitialtrainingsetincreases.36 Figure4.6Statisticalsignicanceduetothesemi-supervisedprocesses withthepartitionedcandatasetasthetruemajoritypercentageoftheinitialtrainingsetincreases.37 Figure4.7Improvementy-axisduetothesemi-supervisedprocesses onthepartitionedcandatasetasthesizeoftheinitial trainingsetincreasesx-axis.38 Figure4.8Statisticalsignicanceduetothesemi-supervisedprocesses withthepartitionedcandatasetasthesizeoftheinitial trainingsetincreases.39 v

PAGE 9

Figure4.9Truemajoritypercentageandtrueexamplecountforthe partitionedcanexperimentswithimprovementshown.40 Figure4.10Truemajoritypercentageandtrueexamplecountforthe partitionedcanexperimentswithstatisticalsignicanceshown.41 Figure4.11VisualizationoftheaveragechangeinF-measureforeach samplerateandmajorityproportionforcloserthanaverage selectionfortheKDDdataset.48 Figure4.12VisualizationofthesignicanceofthechangeinF-measure foreachsamplerateandmajorityproportionforcloserthan averageselectionfortheKDDdataset.50 Figure4.13VisualizationoftheaveragechangeinF-measureforeach samplerateandmajorityproportionfornearaverageselectionfortheKDDdataset.52 Figure4.14VisualizationofthesignicanceofthechangeinF-measure foreachsamplerateandmajorityproportionfornearaverageselectionfortheKDDdataset.54 vi

PAGE 10

SEMI-SUPERVISEDSELF-LEARNINGONIMBALANCEDDATASETS JohnNicholasKorecki ABSTRACT Semi-supervisedself-learningalgorithmshavebeenshowntoimproveclassieraccuracy underavarietyofconditions.Inthisthesis,semi-supervisedself-learningusingensembles ofrandomforestsandfuzzyc-meansclusteringsimilaritywasappliedtothreedatasetsto showwhereimprovementispossibleoverrandomforestsalone.Twoofthedatasetsare emulationsoflargesimulationsinwhichthedatamaybedistributed.Additionally,the ratioofmajoritytominorityclassexamplesinthetrainingsetwasalteredtoexaminethe eectoftrainingsetbiasonperformancewhenapplyingthesemi-supervisedalgorithm. vii

PAGE 11

CHAPTER1 INTRODUCTIONANDRELATEDWORK Semi-supervisedself-learningalgorithmshavebeenshowntoimproveclassicationaccuracyunderavarietyofconditions.Inthisthesis,semi-supervisedself-learningusing ensemblesofrandomforestsandfuzzyc-meansclusteringexamplesimilaritywasapplied tothreecomplexandlargedatasetstoshowwhereimprovementispossibleoverrandom forestsalone. Minimizingtheamountofrequiredlabeleddatacangreatlyminimizethecostofobtainingdata,sinceobtaininginitiallabelingistypicallyacostlyprocess.Generally,experts willmanuallydesignatesalientexamplesindatasetsasinteresting"accordingtopersonal,possiblysubjectivecriteria.Forlargedatasets,thisprocessisbothtime-consuming andtediousforthedomainexpert.Thisprocesswouldbemarkedlyspedupbyanalyzing salientexamplesandtentativelylabelingsomeexamplestobeusedastruth"infurther processing.Inaddition,theamountoflabeleddatarequiredforaccurateclassicationis diculttodeterminepriortoclassiergeneration.Thegoalofsemi-supervisedlearningis touseexistinglabeleddatainconjunctionwithunlabeleddatatogeneratemoreaccurate classiersthanusingthelabeleddataalone. Threedierentdatasetswithdierentlevelsofimbalancewereusedtostudytheeect ofasemi-supervisedself-learningalgorithmonavarietyofdatasets.Thersttwo,the boltandcandatasets,representthenodalcharacteristicsoftwophysicalsimulations.The thirddatasetcomesfromtheKDDCup1999competition.Theexamplesinthisdataset representLANnetworkactivity.IntheboltandKDDdatasets,theclassesareimbalanced insize.Inthecandataset,theratioofclassexamplesisabouteven,withaclassratioof 1.08:1. 1

PAGE 12

1.1RelatedWork 1.1.1Semi-SupervisedLearning Semi-supervisedalgorithmsarethosethatleveragebothlabeledandunlabeleddata examples.Theyaimtocombineunsupervisedlearning,whichutilizesnodatalabels,and supervisedlearning,whichutilizesdataforwhichlabelsarepresent.Thealgorithmdoes this,generally,usingasmoothnessassumptionwhichstatesthatiftwoexamplesarerelativelycloseinfeaturespace,thentheircorrespondingclassoutputsshouldbeclosein classspace[2].Ifonecanassumethatexampleswhicharecloseinfeaturespacearefrom thesameclass,thenfewerlabeledexamplesfromeachclassassumingahomogenousdata distributionshouldbeneededtodetermineclassicationsforlargeamountsofdata. Semi-supervisedlearningalgorithmscanbegenerative,discriminative,oracombination ofboth.Generativemodelsattempttomodelthejointprobabilitiesofexamplesandtheir labels.Oncethisjointprobabilityismodeled,onecangeneratenewexamplesforaparticularclass,aswellasdeterminethemostlikelyclassforagivenexample.Discriminative modelsrestrictthemselvestodeterminingthemostlikelyclassforagivenexamplebyestimatingtheprobabilityofeachclassgiventhedataexample.Discriminativemodelsdonot modeltheclasses,sogenerationofnewclassexamplesisdicult.Generativemodelscan bethoughtofmodelingthemethodbywhich pairsaregenerated,while discriminativemodelsattempttoidentifytheboundariesbetweenclasses[2].Modeling ajointdistributionofdataandclassordeterminingtheboundariesbetweenthedataof dierentclassesisaverydatadependentexercise;ithasbeenfoundthatmostattemptsat usingsemi-supervisedclassicationshifttheresponsibilityofadomainexpertfromlabeling datatochoosingthebestassumptionsaboutthedata[3]. Onecommonlyusedgenerativesemi-supervisedalgorithmisExpectationMaximization withgenerativemixturemodels,whichworkswellondatawithwellgroupedandseparated classes.Itessentiallymodelstheexampledistributionanduseswhatlabelsexisttolabelthe modeleddistributions.OtherdiscriminativetechniquesincludeCo-Training,whichsplits 2

PAGE 13

thefeaturespacetocreatemultipleclassiers.Eachclassierisbuiltonconditionally classindependentfeaturesetsfromlabeledexamples,andsupplieslabelsforadditional unlabeledtrainingexamples.TransductiveSupportVectorMachinesTSVMsarebuilton theSupportVectorMachine,whichattemptstondalowdensitydecisionboundarywith largestmarginbetweenlabeledexamplesbyensuringthemarginseparatestheunlabeled examplesaswell.Thisisdonebyndingalabelingoftheunlabeledexampleswhich maximizesthemarginfound.However,thisisanNP-hardproblem[3].Anotherclassof semi-supervisedalgorithms,self-training,isessentiallyaclassofwrappermethodswhich leverageexistingsupervisedtechniques,andarediscussedfurtherinsection1.1.1.1. Althoughsemi-supervisedapproachesappeartohavemuchpromise,theyareofcourse notalwayssuccessful.Asanexample,theaccuracyofHiddenMarkovModelsinlexical analysiscanbe reduced throughselectsemi-supervisedlearningalgorithms[4,5]. 1.1.1.1Semi-SupervisedSelf-Learning Inasubsetofsemi-supervisedlearningreferredtoasself-trainingorself-learning,a classierisiterativelybuiltonitsownpredictions.First,aclassierisbuiltonthelabeled dataandusedtoclassifyunlabeleddata.Typicallythemostcondentlypredictedexamples aretheniterativelyinsertedintothetrainingsetandanewclassiergenerated. Semi-supervisedself-learningmethodswereoneoftherstusesofsemi-supervisedlearning.However,themethodreliesheavilyontheunderlyingclassier[2].Despitethis,they havefoundmuchsuccess,aremuchlesscomplex,andarelaudedforeaseofusecomparedto othersemi-supervisedmethods.In[6],self-trainingwasappliedtotheproblemofdecipheringcontextinwrittenwords,suchaswhethercrane"isreferringtoabirdoramachine. Thisapplicationisparticularlyttingbecausewordsaroundthetargetwordprovidea strongsenseastoitsmeaning.Therefore,higheraccuracyisobtainedbyincrementally addingalsocalledtaggingwordswhicharefoundclosetothetargetword.Inmostcases, accuracyusingonlytwomanuallytaggedwordswasonparwiththatofusingafullylabeled trainingset. 3

PAGE 14

In[7],semi-supervisedself-trainingwasappliedtotimeseriesclassication.Theprocedurestartswithasetoflabeledpositiveexamples P andasetofunlabeledexamples U Byconsideringtheexamplesin U tobenegative,aone-nearestneighborclassier C was built.Usingclassier C ,allexamplesin U areclassied.Themostcondentlyclassied positiveexamplesisthenaddedto P andremovedfrom U .Thisprocedurecontinuesuntil allexamplesfrom U wereconsumedsincenostoppingcriteriahadbeendeveloped. Thisapproachappearedtoprovideagoodincreaseintheprecision-recallbreak-even pointwithonlyasmallnumberofinitiallylabeledexamples.However,iftoomanyexampleswereadded,thebreakevenpointcoulddropdramatically.Theprecision-recall break-evenpointisawelldocumentedtextclassicationmeasurewhichevaluatesaclassierbyalteringpredictioncondencelevelstondthepointwhereprecisionisequalto recall.Informally,recallisthepercentageofallpositiveexamplesfoundbyaclassier, andprecisionisthepercentageoffoundexamplesthatarepositive.Precisionandrecall arepresentedformallyinSection3.7.Byvaryingcondencelevels,acurvedescribingthe trade-obetweenprecisionandrecallcanbegenerated.Onthiscurve,precisionwillequal recallwhenthenumberofpredictionsthataremadeisequaltothenumberofpositive examplespresent.Thispoint,calledtheprecision-recallbreak-evenpoint,canbeusedto comparetwoprecision-recallcurves. 1.1.2LearningFromImbalancedData Dataimbalanceoccurswhenoneclasshasagreaterrepresentationintheavailabledata thananotherclass,typicallyinasignicantmanner.Thisraisesmanyissues.First,since mostclassiersattempttoaddresstheglobalerrorrateacrossallclassesinsteadofthe individualerrorofeachclass,classifyingeveryexampleasmajoritywillprovideabetter thanrandomerrorrate.Second,sincemanyclassiersmaketheassumptionthatatraining andtestingsetareindependentandidentically-distributed,theclassimbalancewhichdiers betweentrainingandtestingsetneedstobecorrectedorthemodelwillbebuiltassuming thedatadistributionexhibitedbythetrainingset[8,9].Further,insomeinstancesa 4

PAGE 15

trainingsetwiththeactualclassdistributionexhibitedbythedatadoesnotresultina classierthatperformsbestontestsetswiththeactualclassdistribution[10]. Under-samplingisusedinthisthesistoexploredataimbalanceissues.Under-sampling tomaketheclassdistributionuniformisequivalenttoviolatingtheassumptionofindependentandidentically-distributedtrainingandtestingsets,orinotherwords,articially creatingmissingdatainthetrainingset.MissingdataconsideredmissingatrandomMAR canoccurintwodierentways,missingcompletelyatrandomMCARormissingnotat randomMNAR[11].StratiedsamplingconditionedonclasscausesdatatobeMNAR, andhastheeectofaddingbiastothemodelthatresultsfromnon-randomlyselected examplestoestimatethetargethypothesis[9].Sincetheactofcausingdatatobemissing hasapattern,themodelattemptstotboththetargethypothesisfromwhichthedata hasbeengeneratedandthepatternofthemissingdata.However,semi-supervisedlearning canhelpincreasetheperformancemorethansupervisedlearningalonewhenthetraining setdistributiondoesnotreectthepopulationdistribution[12].In[8]acombinationof under-samplingandthensyntheticminorityover-samplingSMOTEwereperformedto ndtheoptimalcombinationtolearnonimbalanceddata. 1.2Contributions First,wewillutilizeadierentmeasureofsimilarity.Typicallysemi-supervisedselflearningalgorithmsuseclassiercondencetoselectnewexamples.Ratherthanutilizethe typicalcondencebasedapproach,aclusteringbasedapproachisutilizedtoselectexamples closeinfeaturespacetoclustersoflabeleddata.ThisissimilartotheapproachofTSVMs andofgraphbasedmethods,wherewellseparatedexamplesareassumedtobedierent classeswhileexamplescloseinfeaturespaceareassumedtobeofasimilarclass[3].Using distanceratherthancondenceismotivatedby[13],whereitwasfoundthataselection metricbasedondistanceoutperformedacondencebasedapproach.Further,usingfeature spacedistanceratherthanclassiercondencedrawsonsomeofthepowerofgenerative models,whiletheclassiercondencemeasuresthecondenceinclassdiscrimination. 5

PAGE 16

Alsoconsideredarecasesoflearningfromdisjointdata.Complexsimulationsorexperimentscangenerateverylargeamountsofdatastoredinadisjointmanneracrossmany localdisks.Semi-supervisedlearningisdesignedtotakeadvantageofunlabeleddata,and isparticularlyrelevantwhenthequantityofunlabeleddataislarge.However,mostof theresearchhasfocusedonsmalldatasets[3].Thescalabilityofsuchsemi-supervised learningalgorithmshaslargelybeenleftunaddressed.Inthecaseofalargecomplexsimulation,dataisstoredondisksattachedtocomputenodesaccordingtoitsspatiallocation withina3Dsimulation[14].Oneapproachistovotepredictionsfromclassiersbuilton disjointdatabites"[15].Inthisthesis,asemi-supervisedself-learningalgorithmisapplied todisjointdataandtheresultsvotedtogethertoevaluatetheutilityoftheseconsensus semi-supervisedpredictions. Anothercontributionisanexaminationofthetrainingsetclassdistributionontest setaccuracy.StudieshaveshownthatondatawithexamplesMNAR,semi-supervised techniquescanhelptoimproveperformanceoversupervisedlearning[11].However,the eectoftrainingsetswithdierentlevelsofclassimbalancewasnotexplored. Basedonareviewoftheliterature,itappearsthatnotmuchfocushasbeenputon self-learningbecauseofthedicultiesinunderstandingtheeectofself-learningonthe chosensupervisedlearningmethod[2].Noliteratureonsemi-supervisedself-learninginthe contextofdistributedlearningordataimbalancewasfound,sotothebestofourknowledge thisistherstanalysisoftheeectofsemi-supervisedself-learningonimbalanceddata. 1.3ThesisOrganization Therestofthisthesisisorganizedasfollows.Chapter2describesthecharacteristics ofthedatasetsusedinthisthesis.Chapter3discussesthesemi-supervisedself-learning frameworkgenerally,whileChapter4discussestheconsiderations,implementation,and resultsforeachdataset.Finally,Chapter5containstheconclusion. 6

PAGE 17

CHAPTER2 DATASETS ThreedatasetswereusedtotesttheframeworktobepresentedinChapter3.Two datasets,boltandcan,containexamplesdescribingthenodalcharacteristicsofphysically simulatedobjects.Thethirddataset,KDD,comesfromtheKDDCup1999contestand containsexamplesrepresentingnetworkintrusiondata. 2.1BoltDataSet Theboltdatasetcontainsdatadescribingaphysicalsimulationofacasingdropping ontotheground.Thecasingiscomposedoffourmainsections,includingthebodytube andtailsection,whicharejoinedbythecouplerthroughaseriesoftenbolts.Thecasingis droppedfromashortheightanditstailsectionimpactsthegroundatanangle,simulating thestressesacrosstheentiredeviceasmightbeencounteredwereittobedroppedfrom someheight.Moreinformationaboutthecasingdatasetisavailablein[16],whereit wasusedfororderingregionspredictedsalient.Thenodescomprisingthedatasetare describedbyatotalof21continuousfeaturesdescribingthephysicalstateofeachnodeof thesimulation. Thegoalistodiscoverwhichnodesinthesimulationbelongtobolts.Whendroppedat anangleonthetail,onegroupofboltswillexperienceatensileforce,whiletheothergroup ofboltswillexperienceacompressiveforce.Eachwillalsobesubjecttoshearforces.These forcesareoccurinmanyothersectionsofthecasingaswell.Thephysicalcharacteristicsof theindividualnodesmodelingtheboltsarenotsubstantiallydierentfromthosemodeling therestofthecasing.Inotherwords,thereisno apriori reasontoassumetheexistence ofsomeunderlyingfeatureofboltness"whichwouldmakethisaneasyproblem.Each 7

PAGE 18

Figure2.1.Thevepartitionsoftheboltdatasetshowingthecasingandbolts.An unimpededviewoftheboltsisshownontherightforclarity. featuredescribingphysicalforceonthenodescomprisingthecanisterateachtimestep wasnormalizeduniformlysincefeaturespacedistancewasusedfordeterminingexample similarity. Thedataforeachtimestepisdividedspatiallyaccordingtothecomputenodetowhich itisassigned.Theboltdatasetwasonlyconsideredinapartitionedfashion.Figure2.1 showsthepartitioningoftheactualsimulation,andanunimpededviewofhowtheboltsare distributed.Therealsoexistsadatasetimbalanceproblem,asthosepartitionsofdatasets withfourboltsaremuchlargerthanthedatasetwithonlytwobolts.Thepartitioningis performedlengthwiseinvepiecesacrossthecylindricalbodysoastodistributethebolts acrosscomputenodes.Thedataispurposefullypartitionedsothattwoofthepartitions haveneverseenanynodefromabolt.Thiscreatestwoone-classclassierswhichmustbe carefullydealtwithbythevotingalgorithmduringclassication. Thedatasetcontains1,569,813examplesevenlydividedinto21timesteps.Ofthe 74,753examplesineachtimestep,69,073areunsalient,non-boltexamples.Eachofthe10 boltsineachtimestepconsistsof568examplesandintotaleachtimestepcontains5,680 salient,boltexamples.Datasetattributesincludethemotionvariablesofdisplacement, 8

PAGE 19

Table2.1.Partitioningcharacteristicsfortheboltsimulationdataset.Nodecountsareper timestep. non-boltbolt total #nodesinpartition1 19,4582,272 21,730 #nodesinpartition2 10,2970 10,297 #nodesinpartition3 6,6611,136 7,797 #nodesinpartition4 11,3790 11,379 #nodesinpartition5 21,2782,272 23,550 velocity,andaccelerationaswellasseveralinteractionvariableswhicharethecontactforce, totalinternalforce,totalexternalforce,andthereactionforce. Thesimulationsarerecordedasatimesequenceofdiscretesnapshotsofthe3Dscene. Inthisthesis,thesetimesteps"areconsideredaslogicalbuildingblocksforincremental learninginasemi-supervisedfashion. 2.2CanDataSet Thecandatasetisalsoasimulationrecordedasatimesequenceofdiscretesnapshots ofa3Dscene.Thecandatasetrepresentsthenodesofacanisterimpactingastationary block,whichcrushesthecan.Thedenitionoftheclassvalue`salient'islesswelldened inthiscontext,andismarkedbyanexpertasthecrushzone".Twotimestepsfromthe candatasetareshowninFigure2.2. Thecandatasetandtheboltdatasethaveseveralimportantdierences.Thereisnota largechangeinthestructureofthecasingdescribedbytheboltdataasthesimulationruns throughtime,withthechangeinthestructureoccurringmostlyattheendofthesimulation. Sincethestructuralchangesaremoresubtle,thedeformationoftheboltsimulationturns outtobemoredicultthanthecanistersimulationtoaccuratelymark.Likewise,sincethe deformationofthecanismorepronounced,thelabelingismoreintensiveandmoreprone tonoisethansimplymarkingthesalient`bolt'regionsoftheboltdataset. Also,thecandatasetconsistsof4distinctcancrushsimulations,eachwithadierent initialvelocityandnumberoftimesteps[17].Foreachnodecomprisingthecan,onlysix featuresdescribingdisplacementandvelocityineachdimensionwereextracted.Features 9

PAGE 20

Figure2.2.Thecandatasetattimestep1andtimestep14. describingaccelerationwereavailableforonlyoneofthe4simulations,sowereomitted tocombinethesimulations.The4simulationswerecombinedintoonelargedataset. Timestepsstillprovideaframeworkforexampleselection,butinsteadofselectingall salientregionsinaparticularstep,aswasdoneforbolt,onlyasampleofsalientnodes wereselectedfromthreechosentimestepsfromtherstsimulation.Thetaskoflearning salient"ismadeharderwiththisexperientialdesignsinceonlyoneofthefoursimulations contributeslabeledexamples,whileallfoursimulationscontributeunlabeledexamples.In theentirecombineddataset,thereare423,613unsalientnodes,and457,231salientnodes. Ofthethreedatasetsexplored,thisdatasethasthemostbalancedclassdistribution. Forcan,aswithbolt,eachfeaturewasnormalizeduniformlysincefeaturespaceEuclidean distancewasusedfordeterminingexamplesimilarity. Forthepartitionedexperiments,only4verticalspatialpartitionsinthedatasetwere used.Atthestartofeachofthe4simulations,thecanisinthesamespatialposition.While inthisposition,theverticalpartitioningofeachsimulationwasperformedinanidentical manner,with1640,1886,1886,and1312nodesineachpartitionpersimulation[18].Due 10

PAGE 21

totherelativeratioofsalienttounsalientnodesineachpartitionineachsimulation,the combinedeectis4partitionseachwithaboutequalnumbersofsalientandunsalient examples. 2.3KDDCup1999DataSet Asanadditionaltest,weconsideredthe1999KDDCupdataset[19].Thedataset containsinformationcollectedinasimulatedLANenvironmentconsistingofnormaltrac witharelativelysmallnumberofintrusionattempts.Theoriginaldatasetconsistedof 23classes,ofwhichnormaltracwasonlyone.Theoriginalintentwasdiscrimination betweenthe22attackclasses,whicheachhaveassociatedcost.Inourexperiments,the datasetwasparseddowntoatotalof2classes,`normal'and`attack'.Afterparsing,the KDDdatahas972,781minority`attack'classexamplesand3,925,650majority`normal' classexamples,whichisapproximately80.14%majorityexamples. Additionally,theKDDdatahasmanynominalfeatures.Sinceexamplesareadded tothesemi-supervisedself-learner'strainingsetateachiterationbasedonfeaturespace distance,ageneralizedMinkowskimetricwhichutilizestheCartesianspacemodelwasused todeterminedistancebetweenexamples[20]. 11

PAGE 22

CHAPTER3 SEMI-SUPERVISEDSELFLEARNINGFRAMEWORK 3.1GeneralFramework Algorithm1 Semi-SupervisedLearner 1: Trainingset T labeledexamples 2: Semi-Supervisedset S unlabeledexamplestoadd 3: while j S j > 0 do 4: s x %of S 5: Learnmodel M from T 6: forall s do 7: Predictclass c for s using M 8: if s issimilartoall T in c then 9: Add s to T asclass c 10: endif 11: endfor 12: endwhile Thesemi-supervisedself-trainingalgorithmisgivenalabeledsubset T ofthetotaldata set.Amodel Forest i iscreatedfromthedata.Themodelisthenusedtopredictasmall number x %ofunlabeledexamples s drawnfromthetotaldataset S .Inthecaseofthe boltandcandatasets,theexamplepredictionsarethensmoothedspatially.Thisisdone bytransformingeachexample'sclassvaluetoeither0or1,andthenaveragingclassvalues withinaspeciedradius.Next,theclassvaluesarebinarizedusingtheOtsuthresholding algorithm[21].Thisisdonesinceitisknownthatsalientclassexamplesoccurspatially closetoeachother.Thetrainingdata T arethenseparatedbyclass c andtheneachsetis clusteredinfeaturespace. Asubsetofunlabeledcandidatedataischosenforadditiontothetrainingdatausinga thresholdbasedontheaverageEuclideandistancetotheclosestcentroidofthesameclass 12

PAGE 23

Figure3.1.Flowchartofthegeneralsemi-supervisedalgorithm. andassignedthepredictedlabel.Learningthenstartsagainfromthebeginning,thistime learningfromtheoriginallabeledtrainingsetaugmentedbythenewlylabeledexamples. Thisprocesscontinuesuntilallprovidedunlabeleddatahasbeenprocessed.Figure3.1 showsthisowgraphically. 3.2LabeledDataSelection Inthecanandboltdatasets,thetimestepsofthesimulationprovideanaturalchoicefor selectingexamplesforinclusioninthetrainingset.Fortheboltdataset,salientexamples werespecicboltsfromspecictimesteps.Forunsalientexamples,arandomselection ofexamplesfromthecasingwereselected.Forthecandataset,randomexamples,both 13

PAGE 24

salientandunsalient,fromspecictimestepswereusedasatrainingset.FortheKDDdata set,arandomsampleselectionofattackandnon-attackexampleswasusedforselecting theinitialtrainingset. Additionally,forthecanandKDDdatasets,initialtrainingsetswereselectedusing stratiedsampling,asdiscussedinsection1.1.2.Thiswasdonetoexploretheeectof initialclassdistributiononclassierperformance. 3.3UnlabeledDataSelection Forselectingexamplestoincludeinthetrainingsetateachiteration,aratioof1:9of labeledtounlabeleddatawasused.Whiletheempiricallybestratiowasdeterminedtobe verydomaindependent,thisratiowasfoundtobeusefulinmorethanonecase[11].Since thegoalofsemi-supervisedlearningistoimprovewhenusingasmallamountoflabeled datawithamuchgreateramountofunlabeleddata,thisratioseemsappropriate.Inall cases,examplesweredrawnuniformlyatrandomfromtheavailableunlabeledexamples.In thecaseoftheboltandcandatasets,anyexamplenotselectedfortrainingortestingwas acandidateforunlabeledsampleselection.FortheKDDdataset,examplesweredrawn fromanaccompanyingsetofunlabeledexamples. 3.4LearningModelSelection Randomforestswerechosenasthebaseclassierbecausetheyhavebeenshowntobe comparabletootherensemblecreationtechniquesandbecausetheyprovidegoodoverall classierperformance[22].ThefreeopensourcesoftwarepackageOpenDT"[23]wasused togenerateensemblesof200randomforests[24].Eachofthetreeswasbuilttoleafpurity oruntilnoadditionalsplitscouldbemade. 3.5LabelingUnlabeledData Insteadofusingclassiercondenceorinthiscaseensemblevotetodetermineifan unlabeledexampleshouldbeincluded,featurespacedistancetolabeledexampleswasused. 14

PAGE 25

Clusteringwasperformedusingsinglepassfuzzyc-means[25].Inallcases25centroids wereused.AlthoughtheXie-Beniindexwasexploredasameansofidentifyingtheoptimal numberofcentroidsforagivensetoflabeleddata,itwasfoundthat25centroidsoften outperformedtheoptimalnumberidentiedbyXie-Beni[26].Thismayhaveoccurred duetocoincidentcentroidscausinganarticialboostinthenumberofselectedunlabeled examples,butthiswasnotexploredfurther. Asmentionedinsection2.3,theKDDdatasetcontainsbothcontinuousandnominal valuesattributes.Todetermineadistancebetweentheseexamples,thesinglepassfuzzy c-meanscodewasmodiedtousethegeneralizedMinkowskimetricformixedfeatureswith p=2[20].ThegeneralizedMinkowskimetricisdenedas d p A ; B = d X k =1 c k A k ;B k p # 1 =p where d isthenumberoffeatures, c k > 0istheweightstheimpactofeachfeatureonthe overalldistancemeasurementand P k c k =1.Thismeansthat0 d p 1.Theterm is denedas A k ;B k = A k ;B k = j U k j where U k isthesizeofthe k th feature'svalues;forcontinuousandordinalfeaturesitisthe range,whichisestimatedfromallobservedfeaturevalues,whilefornominalfeaturesitis thenumberofdierentfeaturevalues.Thefunction isdenedas A k ;B k = j A k B k j)-278(j A k B k j whichintroducestwonewoperators,cartesianjoin andcartesianmeet whichoperatedierentlybasedonthetypeoffeature.Fornominalfeatures, A k B k = A k [ B k 15

PAGE 26

sothatthe operatorgrowsthesetofincludeditems.The j A k B k j issimplythenumber ofuniqueitemsintheunionof A k and B k .Forcontinuousandordinalfeatures, A k B k = min A kL ;B kL ;max A kU ;B kU where A kU istheupperboundoftheintervaldescribedby A k and A kL isthelowerbound. Ofcourseif A k isnotanintervalbutinsteadasinglepointvalue,the A kL = A kU = A k .In thecasewhere A k and B k aretwonon-equalpointvalues,theresultisaninterval,whose sizeisgivenby j A k B k j .Thecartesianmeetoperator isdenedas A k B k = A k B k : Fornominalfeatures,thisistheintersectionofthevaluesin A k and B k .Forcontinuous andordinalfeatures, A k B k willrepresenttheoverlapbetweentheintervals A k and B k .If A k and B k areequalsinglepointvalues,then A k B k = A k = B k ;if A k and B k areunequal pointvaluesthen A k B k = ; .Figure3.2visuallyshowstheresultoftheCartesianmeet andCartesianjoinoperatorsonintervalsandpointvalues. Thedatasetsexploredinthisthesisonlycontainexampleswithsinglevaluednominal andcontinuousfeatures.Sinceintervalsandsetswithmorethanoneitemarenevercompared,computationbecomesmuchsimpler.Fornominalfeatures A k ;B k willeitherbe 0iftheitemswereidenticalor2istheitemswerenot.Forcontinuousorordinalfeatures, A k ;B k representsthedistancebetweenthetwocontinuousvalues.Oncedividedby j U k j ,thevalueof A k ;B k willrepresenttheproportionoftheentirefeature-valuerange bywhichthetwofeaturevaluesdier.Thismetricallowscontinuousandnominalfeature valuedistancestobecombinedintoasinglemetricmeasuringsimilaritybetweenexamples. Oncecentroidswereidentiedfromthelabeleddataforbothsalientandunsalientexamples,unlabeleddataselectioncouldbeperformed.Foralldatasets,unlabeledexamples wererstclassiedaseithersalientorunsalient.Thepredictedsalientexampleswerethen assignedtothenearestsalientcentroid,whilepredictedunsalientexampleswereassigned tothenearestunsalientcentroid.Next,foreachcentroid,theaveragedistanceofallun16

PAGE 27

Figure3.2.VisualexplanationoftheGeneralizedMinkowskidistance'sCartesianjoinand Cartesianmeetoperators.TheCartesianjoin A B ofinterval A andinterval B isan intervalencompassingboth A and B .TheCartesianmeetof A and B istheemptyset sincethereisnooverlap.Intervals C and D dooverlap,sotheCartesianjoin C D is denedastheoverlappinginterval.Forpointvalues E and F ,theCartesianjoin E F is theintervalwhoseendpointsare E and F .Again, E F istheemptysetsince E and F donotoverlap.Finally,forpointvalues G and H ,theCartesianjoinandmeetbothresult inapointvalueequaltoboth G and H .Notethatforpointvalues,themagnitudeofthe CartesianjoinminusthemagnitudeoftheCartesianmeetisequaltotheabsolutevalueof thedierencebetweenvalues,whileforintervalsitprovidesameasureofthedierenceor similaritybetweenintervals. 17

PAGE 28

labeledexamplesassignedtothatcentroidwascomputed,andexamplesthatwerecloser thantheaveragedistancewereaddedtothatcentroidalongwiththecorrespondinglabel. AvariationoftheabovewasexploredontheKDDdataset,inwhichonlyexamplesthat werearoundtheaveragedistancewereaddedaslabeleddata. 3.6PartitioningLargeDataSets Theeectoffeaturespacepartitioningwasalsoexploredfortheboltandcandataset sincebothcouldnaturallybepartitionedonspatialfeatures.Thechangetotheoverallalgorithmisminimal.Insteadofbuildingoneensemblebasedonallavailablelabeleddata,one ensembleisbuiltperpartition.Unlabeledexamplesarerandomlyselectedandaconsensus voteofallpartitionswasobtainedthroughmajorityvoting[17].Oncelabeled,similarity thresholdingoccursinthepartitiontowhichtheexamplesbelongbasedonspatialfeatures. Thisformofpartitioninghastheaddedeectofdistributingtheclusteringandensemble generation,whilestillallowingallensemblesgeneratedfromallpartitionstoparticipatein unlabeledexampleclassication. 3.7EvaluatingModelPerformance Theperformanceofallmodelswasevaluatedusingtwodierentaccuracymeasures. TherstwastheF-measure,whichisdenedastheharmonicmeanoftheprecisionand recallforasingleclass c F =2 Precision Recall Precision + Recall where Precision = TP TP + FP where TP isthenumberofexamplesclassiedasclass c thatwerefromthatclassand FP isthenumberofexamplesclassiedas c thatwereactuallyfromadierentclass. 18

PAGE 29

The Precision istheratioofthecorrectmodelpredictionstothetotalnumberofmodel predictionsforaclass.Likewise, Recall = TP TP + FN where FN isthenumberofexamplesfromclass c thatwereclassiedasadierentclass. Recall istheratioofcorrectmodelpredictionstothetotalnumberofactualclassexamples. SincetheF-measureisameasureofaccuracyforasingleclass,inallcasestheF-measure oftheminorityclassisgiven. Thesecondmeasureusediscalledtheclass-weightedaccuracy,andisdenedas Class-WeightedAccuracy= 1 2 TP TP + FP + TN TN + FN whichisthearithmeticmeanoftheaccuraciesofeachclass.Theclassweightedaccuracy ishighwhenbothclassaccuraciesarehigh.Ifoneclassaccuracyislow,perhapsbecause oneclassisanextrememinority,thentheclassweightedaccuracywillbelow.Ifinstead thenormalaccuracywasused,alltruepositiveandtruenegativepredictionsdividedbyall predictions,thentheminorityclassaccuracywouldbeoverwhelmedbythemajorityclass accuracy. Totestthestatisticalsignicanceofperformanceresults,theWilcoxonSigned-Rank testwasused.TheWilcoxon-SignedRankTestisapairedtesttodetermineifonemethod performsstatisticallysignicantlybetterthananothermethod.Thedierencebetween performancefortwomethodsisdeterminedforanumberoftrials,andthedierencesin performancesarerankedbytheirabsolutevalue.Theintuitionbehindthetestisthatisthe twomethodshavethesameperformance,thenthedistributionsofpositiveandnegative dierencesintherankedorderingshouldbeaboutuniform.Inotherwords,ifmorepositive ormorenegativedierenceswererankedhighersotheabsolutedierencesweregreater, thenthetwomethodsdonothavethesameperformance. 19

PAGE 30

CHAPTER4 EXPERIMENTSANDRESULTS Primarily,wewishedtodetermineifoursemi-supervisedframeworkwouldprovidean improvementoverasimilarsupervisedlearningapproach,andtheconditionsunderwhich animprovementcouldberealized.Thefollowingsectionspresenttheapproachforthebolt datasetdescribedin2.1,thecandatasetdescribedin2.2,andtheKDDCup1999data setdescribedin2.3. 4.1ResultsontheBoltDataSet Toselectinitiallabeleddatausedforlearning,aregionofsalientnodeswasselected,and thenunsalientnodeswithasmallspatialdistancefromtheselectedregionofsalientnodes wereselectedasunlabeleddata.Thiswasdonetosimulatetheactofauserwithaview ofthesimulationmarkingaregionofthebolt'snodessalient,andindoingsoimplicitly markingnodesnotselectedfromtheviewasunsalient. First,nodesfromthe2nd,9th,and16thtimestepswerechosenaccordingtothemethod above.Thesetimestepswerechosenbecauseoftheiruniformseparationinthesimulation. Weattemptedtousethesmallestnumberofexamplespossiblefortheinitialtrainingset, asexpertlabelingofdataisquiteexpensive.Exactly375nodes.5%ofthe74753nodes comprisingthe2nd,9th,and16thtimestepswerechosenbyselectingthesalientregion ofnodescomprisingtheboltsandamatchingnumberofunsalientnodesasmallspatial distanceaway.Thesenodesarecorrectlylabeledandusedfortraining.Entireboltswere selectedfortworeasons.First,using375nodesfromeachtrainingtimestepresultedina smallsupervisedmodelwithaccuracybetterthansimplypredictingthemajorityclassof non-boltforallexamples.Second,typicallyexpertslabelentireregionsofinterestrather 20

PAGE 31

thanindividualnodesofinterestwhenvisualizingsimulations.Selectinganentirebolt emulatestheprocessofselectingaregion. Selectingtimestepsclosetotheinitialones,yetstillspaceduniformly,3.375%ofthe unlabelednodesinthe4th,5th,11th,12th,18th,and19thtimestepswereaddedtothe trainingsetbythesemi-supervisedalgorithm.Thepercentagesampledensuredthata ratioofatleast1labeledexampleto9unlabeledexampleswasreached.Afterexhausting thetimestepsdenotedforsemi-supervisedaddition,thealgorithmstartsagain,thistime evaluatingthesemi-supervisedtimestepswithouttheunlabeledexamplespreviouslyadded tothemodel.Theaccuracywasrecordedaftereachgroupofexampleswasaddedtothe trainingset.Thetestsetconsistedofallunusedtimesteps,i.e.alltimestepsnotpartof thetrainingsetorpartofthesemi-supervisedsetoftimestepsevaluatedforaddition.See Table4.1.Atotalof6000treeswerecreatedoneveryblockoftimesteps,2000foreachof the3partitionscontainingbolts.Theindividualensemblepredictionswerecombinedvia ProbabilisticMajorityVote[27]. Theclusterswereobtainedviaafastincrementalfuzzyc-meansapproach[25]using25 centroidsafterlinearlynormalizingthedata.Thismethodofobtainingclustercentroids providessimilaraccuracytothatofclusteringanentiredataset,butonlyafractionofthe dataisusedateachiteration.Wechosetouse10%ofthedataateachiterationforthe clusteringofboththesalientandtheunsalientexamples,andthenumberofclusterschosen ineachcasewas25.Afterobtainingpredictionsforexamplesinthetestset,theprediction datawasaveragedwithinaradiusof3unitsandthresholdedat0.5toobtainsmoothed predictionsof0.0forunknownor1.0forsalient. Thesemi-supervisedmodelhasahigherF-measurethanasupervisedmodelusingthe sameunderlyingclassier,asshowninFigure4.1.Thesameamountoftheinitiallylabeled datagiventothesemi-supervisedself-teachinglearnerwasprovidedtothesupervisedapproachfortraining,andthesametestsetusedforevaluatingthesemi-supervisedtechnique wasusedforevaluatingtheperformanceofthesupervisedapproach.Wefounda15.1% increaseinthenumberofsalientboltnodesidentiedcorrectlyandanincreaseinoverall 21

PAGE 32

Table4.1.Theentireboltsimulationdatasetissplitacrossthetraining,semi-supervised, andtestsetsbytimestep-20. timesteps%usedfromeach initialtrainingset 2,9,160.5% semi-supervisedset 4,5,11,12,18,193.375% testset allremaining100% Table4.2.Boltsetpredictionsonthetestsetaveragedover20trials.Thesemi-supervised methodusesarandom0.5%oflabeleddatafromtheoriginaltimestepsand3.375%ofunlabeleddatafromthesemi-supervisedsteps.Thesalientpercentageresultisthepercentage ofsalientnodescorrectlyidentied,whiletheoverallpercentageisthepercentageofnodes correctlyidentiedregardlessofclass. TruePredictedNotSemiSemi ClassClassSupervisedSupervised UnsalientUnsalient748942.2750366.4 UnsalientSalient10860.89436.6 SalientUnsalient19307.459903.1 SalientSalient43172.5552576.9 Salient69.1%84.2% Overall96.3%97.6% accuracyasshowninTable4.2.Foreachtimestepofthetestset,thenumberofboltnodes identiedincreasedonaverageby15%.Thefullresultsbytestsettimesteparepresented inTable4.3. Sincethevarianceofthemeansforthesupervisedandsemi-supervisedresultsappear unequal,weperformedaWilcoxonSigned-Ranktesttodeterminethestatisticalsignicance usingeachindividualtestsettimestep.Basedonasamplesizeof11,ourobservedsum ofsignedranksWwas58withz-ratio2.56andprobability0.0052,indicatingasignicant dierenceatthe0.01levelbetweenthesupervisedandsemi-supervisedapproaches. 4.2ResultsontheCanDataSet Inordertotestthestatisticalsignicanceoftheresultsofthesemi-supervisedmethod usingthecandataset,multiplerunswereperformed.Additionally,dierenttrainingset distributionsweretested,includingdrawnatrandomanddrawninastratiedmanner accordingtoapredenedclassimbalancelevel.Samplingatrandomisnotentirelyappro22

PAGE 33

Figure4.1.AverageF-measureover20trialswiththerangefortheboltdataset. Table4.3.Averagenumberoftestsetboltnodesfoundforeachtimestep. TestSet SupervisedSemi-supervised TimeStep TP%oftotalFP TP%oftotalFP 3 3754.8566.1%175.45 2728.6548.0%186.45 6 3748.2566.0%154.1 4783.8584.2%240.7 7 402370.8%260.45 4848.185.4%371.5 8 4150.9573.1%615.6 5039.8588.7%675.05 10 4298.2575.7%437.8 5335.5593.9%536.65 13 4318.4576.0%1583.1 5332.5593.9%988.15 14 3792.1566.8%2384.55 4967.787.5%1377.85 15 4320.476.1%2315.4 5292.493.2%1714.15 17 3877.8568.3%1060.9 5050.0588.9%948.55 20 3601.263.4%1220.15 4718.0583.1%1344.15 21 3287.257.9%653.3 4480.1578.9%1053.4 TP-#boltnodesfound,FP-#boltnodesincorrect 23

PAGE 34

priatewiththecandataset.Aswiththeboltdataset,anexpertwouldtypicallylabel regionsofnodesinthesimulationasinterestingratherthanindividualnodes. The1st,20th,and30thtimestepsoftherstsimulationweremadeavailableastraining timesteps.Unlabeleddatawasdrawnrandomlyfrom3timestepsintherstsimulation and2timestepsintheremaining3simulations.Thetestsetwascomprisedofallremaining timesteps,whichincluded34simulation1timesteps,19eachfromsimulations2and3, and20timestepsfromsimulation4. Samplingwasperformedbecausethegoalistominimizetheamountoflabeleddata, andbecauseinitialclassicationresultsusingfulltimestepswasataleveltoohighfor semi-supervisedlearningtoprovideanincreaseintheF-measure.Aftersampling,theFmeasurewasusedtoteststatisticalsignicancebetweenthesemi-supervisedmethodanda supervisedapproach.Inallexperiments,thetestsetwasdrawnrandomlyfromtheavailable trainingdata. 4.2.1RandomlySampledTrainingSet Atsevendierentlevelsofrandomsampling,30trialswereperformed.Theaverage resultsareshowninTable4.4.Although,samplingwasperformedtoachieveaninitial trainingsetthathadpoorenoughperformancetoallowforimprovement,butascanbe seen,suchalevelwashardtoachieve.Withonly8labeledexamples,asupervisedapproach resultedinanaverageF-measureof0.8359,andby20examplestheaverageF-measure roseto0.9056.For18and20examples,theF-measureincreasedwiththeadditionofthe unlabeleddata,asdidtheweightedaverageaccuracy,butfor8,10,12,14,and16examples, boththeF-measureandclassaveragedaccuracydecreased.Next,thesignicanceofthe changeintheaverageperformancewastested. UsingtheWilcoxonSigned-RankTest,noneofthesamplingpercentagesshowedstatisticallysignicantchangesbetweenthesupervisedandsemi-supervisedapproaches,asseen inTable4.5.Itappearsthatthelowersamplingratesresultinausuallynon-signicant decreaseinperformance,while18and20examplesresultsinanincreaseinF-measure,with 24

PAGE 35

Table4.4.AverageF-measureandclass-weightedaccuracyof30trialsfordierentinitial labeledsetsizesonthecandatasetusingarandomlysampledtrainingset.Entriesinbold indicateimprovement. Percent NInit.FMFinalFMInit.CWAFinalCWA 0.004% 80.83590.81950.85210.8486 0.005% 100.86820.82380.88850.8472 0.006% 120.90090.89390.90490.9049 0.007% 140.88950.87350.88980.8682 0.008% 160.90560.89850.90780.8990 0.009% 180.9063 0.9121 0.9119 0.9182 0.01% 200.9056 0.9141 0.9109 0.9191 N-LabeledDataSize,FM-F-Measure,CWA-Class-WeightedAccuracy. Table4.5.WilcoxonSigned-Ranktestresultsonthecandatasetusingrandomsampling. While30trialswereperformed,insomecasestheinitialmodelhadnopotentialforimprovement-theclassierclassiedalltestexamplesasasingleclass.Thesemodelswere discardedandarenotincluded. PercentExamples WNImprovement?P-valueSignicant? 0.004%8 18830No > 0.19808No 0.005%10 19029No > 0.19808No 0.006%12 16230No0.15198No 0.007%14 16829No0.19092No 0.008%16 -19630Yes > 0.19808No 0.009%18 -17629Yes > 0.19808No 0.01%20 -16130Yes0.14600No W-Wilcoxonscore,N-Numberoftrials. 20examplesresultinginthemostalmoststatisticallysignicantimprovement.Possibly morethan20exampleswillresultinincreasinglysignicantF-measureincreases. 4.2.1.1Non-IdenticallyDistributedTrainingSet Experimentswerealsoperformedtoanalyzetheeectoftheclassdistributionofthe trainingsetonperformanceofthemethod.Atrainingsetwassampledwithaxedproportionofmajorityexamples,rangingfrom10%to90%ofthetrainingdata.Inthecaseofthe candataset,thepercentageofmajorityclassinalllabeleddatais52%.ReferringtoTable 4.6,theaveragesdonotappeartoresultinanytrendbeyondlargerF-measureswithmore majorityexamples.Thesemi-supervisedprocessinsomecasesresultsinahigheraverage F-measure;inothercasestheprocessresultsinalowerF-measure.Figure4.2represents 25

PAGE 36

Table4.6.AverageinitialandnalF-measurefor30trialsofthesemi-supervisedmethod usingdierentamountsoftheavailablelabeledexampleswith10%to90%majorityclass forthecandatasetusingstratiedsampling.NotethatinitialF-measureistheaccuracy thatasupervisedapproachwouldachieve.Entriesinboldindicateimprovement. 0.03%0.05% Maj. NStartEnd NStartEnd 10% 70.43730.4194 100.4090 0.4532 30% 60.4126 0.4283 100.78790.7775 50% 60.8628 0.8643 100.8719 0.8790 70% 60.85920.8301 100.8984 0.9042 90% 60.8159 0.8705 100.8113 0.9140 0.07%0.09% Maj. NStartEnd NStartEnd 10% 140.3424 0.3818 180.3919 0.4132 30% 140.8561 0.8172 180.86520.8349 50% 140.8917 0.8953 180.88590.8820 70% 140.8740 0.9080 180.9154 0.9167 90% 140.9016 0.9296 180.9112 0.9269 0.1% Maj. NStartEnd 10% 200.6954 0.7139 30% 200.85070.8496 50% 200.90600.9048 70% 200.90810.9034 90% 200.9030 0.9384 N-Numberoflabeledexamples. theseresultsvisually,foradditionalinsight.Manyofthehighlyimbalancedtrainingsets %and90%majorityexamplesappeartoresultinahigheraverageF-measurewhenthe semi-supervisedapproachisused. ToanalyzethesignicanceofthedierencesinF-measure,theWilcoxonSigned-Rank testwasagainapplied.Althoughusingadatasetcomprisedof10%majorityclassexamples and90%minorityclassexamplesresultedinanaverageimprovementintheF-measure,the dierenceinaverageswasnotstatisticallysignicant.However,with90%majorityclass examples,everytrainingsetsizeshowedastatisticallysignicantimprovementintheFmeasureoftheminorityclass,asshowninTable4.7. 26

PAGE 37

Figure4.2.VisualizationoftheaveragechangeinF-measureforeachsamplerateand majorityproportionforthenon-partitionedcandataset.Pointsabovethelineshowrates whereaverageimprovementoccurs. Figure4.3.VisualizationofthesignicanceofthechangeinF-measureforeachsample rateandmajorityproportionfornearaverageselection.Pointsat1indicatesignicant improvement;-1indicatesasignicantdecreaseintheF-measure.Apointat0indicates nosignicantincreaseordecrease. 27

PAGE 38

Table4.7.WilcoxonSigned-RanktestresultsforNtrialsofthesemi-supervisedmethodwith variousnumbersoflabeledexamplesandbetween10%and90%majorityclass.`Better?' indicatesastatisticallysignicantimprovementintheF-measure p< 0 : 01.Although30 trialswereperformed,insometrialstheinitialmodelshowednopossibilityofimprovement -alltestexampleswereclassiedasasingleclass.Thesemodelswerediscarded. 0.03%0.05% Maj.Num WNProbBetter? Num WNProbBetter? 10%7 22729 > 0.19962No 10 -17630 > 0.19808No 30%6 23030 > 0.19808No 10 169300.19808No 50%6 -23230 > 0.19808No 10 -21930 > 0.19808No 70%6 124300.02478No 10 -19030 > 0.19808No 90%6 -107300.008706 Yes 10 -9306.146E-8 Yes 0.07%0.09% Maj.Num WNProbBetter? Num WNProbBetter? 10%14 -162300.15188No 18 -18730 > 0.19808No 30%14 115300.014538No 18 167300.18396No 50%14 -19130 > 0.19808No 18 -21730 > 0.19808No 70%14 -160300.14028No 18 -17130 > 0.19808No 90%14 -55303.128E-4 Yes 18 -137300.04972 Yes 0.1% Maj.Num WNProbBetter? 10%20 -157300.12414No 30%20 -20530 > 0.19808No 50%20 21730 > 0.19808No 70%20 -23230 > 0.19808No 90%20 -48304.408E-5 Yes W-Wilcoxonscore,N-NumberofTrials 28

PAGE 39

Table4.8.AverageF-measureandclassaveragedaccuracyfor30runsusingthecandata setwithpartitioning. Percent NInitialFMFinalFMInitialCWAFinalCWA 0.02% 40.48890.48890.50.5 0.03% 40.46670.46820.50.4969 0.04% 60.64860.61340.61990.5671 0.05% 70.59540.59130.63940.5900 0.06% 100.71770.69840.67000.6694 0.07% 110.77180.77810.72350.7304 N-LabeledExamples,FM-F-Measure,CWA-Class-WeightedAccuracy 4.2.2Partitioning Aswiththeboltdataset,theeectofspatialpartitioningwasexplored.Asdescribed inSection2.2,4non-overlappingpartitionswerecreatedfromthecandata.Eachpartition wastreatedasaseparatetrainingsetandvotesfromeachcombinedforamajorityvoteon testexamples. Inallcases,30trialswithdierentrandomseedswereperformedandaverageresults obtained,showninTable4.8.Comparedtothenon-partitionedtrials,theinitial,supervised F-measurewaslowerinthecaseofpartitioning.Whilethismayseemincorrectsincethe totalnumberoflabeledexamplesisthesame,thisisappropriatebecauseeachclassier getslesslabeleddata. Sincethesamplingiscontrolledsothateachpartitionmustcontainatleastoneexample, thesamplingratesof0.02%and0.03%resultedinexamples,1ineachpartition.Alsoof noteisthattheF-measureoftheseensemblesbeforeandafterthesemi-supervisedprocess wasroughly0.5.Sincetheywereexposedtoonlyoneexample,theyactedina`brain-dead' fashion,classifyingallexamplesintothemajorityclassofthe4representedexamples. Forallsampleratesexplored,noimprovementwasseenduetothesemi-supervised process,although,themarginofaccuracydecreaseappearedsmallinallcases. Onceagain,thesignicanceofthedierencesintheF-measureswasdeterminedusing theWilcoxonSigned-Ranktest.AsTable4.9shows,oneinstanceofsignicantdecrease atthe0.05levelinF-measurewasfoundfor6examples.Inall30trials,only13resulted 29

PAGE 40

Table4.9.PartitionedcandatasetWilcoxonSigned-Ranktestresults.Signicancewas foundwitha p< 0 : 05. PercentExamples WNImprovement?PSignicant? 0.02%4 N/A0N/AN/AN/A 0.03%4 N/A2N/AN/AN/A 0.04%6 1613No0.0398Yes 0.05%7 6818No > 0.19638No 0.06%10 9620No > 0.18934No 0.07%11 -15126Yes > 0.19874No W-Wilcoxonscore,N-LabeledExamples insomechangeinF-measure.Thisindicatesthattheclassiersgeneratedfromthesetof initialtrainingexamplesweretoobrain-dead"orinaccuratefromthebeginning. 4.2.2.1Non-IdenticallyDistributedTrainingSet Inthecaseofuniformlysampledtrainingdata,littleornodecreasewasobservedusing thesemi-supervisedself-learningframework.Totesthownon-uniformlysampleddatawould aecttheseresults,experimentswereperformedbysamplingthetrainingsetinastratied mannertosetthenumberofmajorityexamplesthatwouldbeincluded.However,small samplesizesobscuredtheresultsofthesetrials. Table4.10showstheinitialsupervisedandnalsemi-supervisedF-measureforeight samplingpercentagesandvelevelsofmajorityinthetrainingset.Thesemi-supervised methodappearstocauseadecreaseintheF-measureforlowersamplingpercentages.One notableexceptionis0.015%,butasshown,thislevelofsamplingresultedin8training examples.Asamplingrateof0.03%alsoresultedin8trainingexamples,butinthatcase theresultsdidnotshowanaccuracyincrease,butratheradecrease.Becauseatleast1 examplemustbepresentfromeachclassandineachpartition,the4partition,2class natureoftheexperimentcreatesaminimum8exampletrainingset. At0.09%,thehigherbiasesstarttoshowanincreaseinminorityclassF-measuredue tothesemi-supervisedprocess.Thesamecanbeseenwitha0.1%samplingrate.For ratesof0.11%and0.13%,allbutonemajoritypercentageshowsanincreaseintheaverage F-measure. 30

PAGE 41

Table4.10.AverageF-measurefor30trialsofthesemi-supervisedmethodonthepartitioned candatasetusing0.015%,0.03%,0.05%,0.07%,0.09%,0.1%,0.11%and0.13%ofthe availablelabeledexampleswith10%to90%majorityclass.Entriesinboldindicatean increaseintheF-measure,whileentriesitalicizedindicateadecrease. 0.015%0.03% Maj. NStartEnd NStartEnd 10% 80.8248 0.8400 80.8475 0.7525 30% 80.8319 0.8377 80.8140 0.6963 50% 80.8312 0.8351 80.8287 0.7533 70% 80.8354 0.8362 80.8462 0.7180 90% 80.7995 0.8113 80.8172 0.7464 0.05%0.07% Maj. NStartEnd NStartEnd 10% 110.7455 0.6367 150.6220 0.6165 30% 110.7379 0.6698 120.6572 0.6305 50% 80.8286 0.7262 110.7446 0.7425 70% 80.8511 0.7359 110.8043 0.8229 90% 110.8044 0.8442 150.7580 0.8881 0.09%0.1% Maj. NStartEnd NStartEnd 10% 210.5499 0.6235 210.5958 0.6453 30% 170.6514 0.6179 170.6147 0.6562 50% 170.8558 0.8358 170.8540 0.8436 70% 170.8284 0.8843 170.8424 0.8901 90% 180.7606 0.8692 210.7710 0.8917 0.11%0.13% Maj. NStartEnd NStartEnd 10% 250.5038 0.6527 290.5336 0.6132 30% 210.6079 0.6763 250.7913 0.8120 50% 210.8809 0.8773 250.8896 0.8833 70% 210.8701 0.9090 250.8888 0.9088 90% 210.7775 0.8913 250.7657 0.8676 N-Labeledexamples 31

PAGE 42

TheWilcoxonSigned-Ranktestwasutilizedtotestthesignicanceoftheobserved changeinaccuracyacrossalltrials.Table4.11showstheresultofa2-tailtestandthe levelatwhichstatisticalsignicanceoccurs,ifany.Thecolumn`Better?'indicatesifthe F-measurewasobservedtoimprovewithstatisticalsignicance.Forthelowsamplingrates of0.015%and0.03%weseeastatisticallysignicantdecreaseintheF-measure.Similarly 0.05%showsaborderlinesignicantdecreasefordierentpercentagesofmajority,butfor 90%majorityclasstheminorityF-measuredoesimprovesignicantly.At0.07%ofthe availableexamples,themethodshowsnosignicance,exceptinthecaseof90%majority, whichagainshowssignicantimprovement,atrendwhichcanbeseeninalltestedsampling ratesabove0.05%,or11examples.Forsamplingratesof0.11%and0.13%,theWilcoxon Signed-RanktestindicatessignicantimprovementintheF-measureforalltestedmajority percentagesexcept50%,whichisclosesttothetruedataclassdistribution. ThecountoflabeledtrainingexampleslistedateachsamplingrateinTables4.10 and4.11appearinconsistentduetosamplingartifacts,asmentionedabove.Additionally, thepercentageofmajorityclasswasaectedbythechoiceofsamplingrate.Sinceone exampleofeachclassmustbepresentineachpartition,thelowersamplingratesresulted ineight,oneofeachclassfromeachofthefourpartitions-atruemajoritypercentageof 50%.Tobetterunderstandtheinteractionofthetruenumberofinitialexamplesandtrue majoritypercentageonmodelaccuracy,thenumberofexamplesandmajorityproportion weredeterminedpost-samplingforeachtrialrun,asshowninFigure4.4.Determining thetruemajorityandtruecountallowedvisualizationoftheeectonbothchangeinthe F-measureandthestatisticalsignicanceofanychange.Figure4.7showstheeectonthe numberoflabeledexamplesonthechangeintheF-measure.Thex-axisisthesizeofthe initiallabeledset,whileavalueof1onthey-axisrepresentsanincreaseandavalueof -1representsadecrease.Fromthisviewthereappearstobenotrendbetweeninitialset sizeandimprovementinF-measure.However,whenwevisualizetheeectofthemajority percentageinthetrainingset,wecanclearlyseethatmanyinstancesofimprovementin 32

PAGE 43

Table4.11.WilcoxonSigned-RanktestresultsforNtrialsofthesemi-supervisedmethod onthepartitionedcandatasetwithcloserthanaverageselectionusing0.003%,0.005%, 0.007%,0.009%and0.01%oftheavailablelabeledexampleswithbetween10%and90% majorityclass.EntriesinboldindicateanincreaseintheF-measure,whileentriesitalicized indicateadecrease. 0.015%0.03% Maj.Num WNProbBetter? Num WNProbBetter? 10%8 -140300.05768 No 8 49304.408E-05 No 30%8 -168300.19092 No 8 65300.0001416 No 50%8 -21530 > 0.19808 No 8 78300.0009518 No 70%8 23130 > 0.19808 No 8 37301.061E-05 No 90%8 -136300.04726 Yes 8 129300.03272 No 0.05%0.07% Maj.Num WNProbBetter? Num WNProbBetter? 10%11 123300.01171 No 15 -22230 > 0.19808 No 30%11 141300.06056 No 12 -22230 > 0.19808 No 50%8 99300.005012 No 11 22130 > 0.19808 No 70%8 25301.6828E-06 No 11 -18630 > 0.19808 No 90%11 -131300.03644 Yes 15 0301.8626E-9 Yes 0.09%0.1% Maj.Num WNProbBetter? Num WNProbBetter? 10%21 -107300.004353 Yes 21 -159300.13474 No 30%17 19330 > 0.19808 No 17 -158300.12936 No 50%17 17630 > 0.19808 No 17 20030 > 0.19808 No 70%17 -71300.0005054 Yes 17 -62300.0002092 Yes 90%18 -20306.910E-07 Yes 21 -34296.618E-06 Yes 0.11%0.13% Maj.Num WNProbBetter? Num WNProbBetter? 10%25 -47302.904E-05 Yes 29 -128300.03098 Yes 30%21 -69300.0004184 Yes 25 -132300.03842 Yes 50%21 17830 > 0.19808 No 25 20830 > 0.19808 No 70%21 -76300.0007978 Yes 25 -119300.01853 Yes 90%21 -21308.326E-07 Yes 25 -44302.678E-05 Yes Num-Labeledexamples,W-Wilcoxonscore,N-Trials 33

PAGE 44

theF-measureoccurwith40%orlessmajorityexamplesinthetrainingset,withafew casesinthehighermajoritypercentages. VisualizingboththemajorityproportionandtheexamplecounttogetherinFigure4.9, wecanseethataverageimprovementoccurredintworegionsoftheparameterspace.The rstregionismorethan20examplesandlessthat30%majority.Thesecondisanytrial withlargerthan60%majorityexamples.Unfortunately,duetothesamplingartifacts, smallersamplesizeswithlargemajorityproportionswerenotexplored. Despiteseeingimprovement,notallcasesarestatisticallysignicant,asshowninFigure 4.10.Ofthetrialswith20ormoreexampleswithlessthan30%ormoremajorityclass, thesmallestsamplesizeexamplesdoesnotresultinstatisticalsignicance.Similarly, thesmallersamplesizeabove60%majorityexamplesdidnotresultinstatistical signicance.Itisalsointerestingtonotethatthestatisticallysignicantdecreasesin accuracyoccurredat8exampleswith50%majorityandat11exampleswith36%majority. Itappearsthatthepartitionedsemi-supervisedapproachworkswellforthecandata setundercasesoftrainingsetimbalanceineitherclasses'favorgivenenoughtraining examples.Whennopartitioningisused,itappearsonlytoworkwellforatrainsetmixture containing90%majorityclass.Comparingthetwocasesbecomesincreasinglydicultas oneconsidersthateachofthepartitionsonlycontainsafractionoftheavailabledata.The non-partitionedtrialshavethepossibilityofrandomlyseeingalloftheavailablelabeled datainatrainingset.Further,thelargestnumberofexamplesseenbyasinglepartition was8examples,seenby2ofthe4partitionswhensampling0.13%oftheavailableat10% majority.Contrastedwiththesmallestseentrainingsetwithno-partitioning,6examples, adirectcomparisonofperformanceisdiculttoachieve.Still,ifcomparingbasedonthe rawnumberofinitiallabeledexamples,bothshownostatisticalsignicantimprovement whendrawingrandomlywithauniformdistributionfromtheavailablelabeledexamples. Whendrawinginastratiedmanneracrossclasses,apartitionedlearnercanimprovefor morepercentagesofthemajorityclassthananon-partitionedlearner. 34

PAGE 45

Figure4.4.Truemajoritypercentageandtrueexamplecountforthepartitionedcanexperiments.Duetostratiedsamplingandlowsamplesizes,requestedmajoritypercentages resultedindierenttruemajoritypercentagesandexamplecounts. 35

PAGE 46

Figure4.5.Improvementduetothesemi-supervisedprocessesonthepartitionedcandata setasthetruemajoritypercentageoftheinitialtrainingsetincreases.A-1indicatesa decreaseinF-measure,while1indicatesincreaseinF-measure. 36

PAGE 47

Figure4.6.Statisticalsignicanceduetothesemi-supervisedprocesseswiththepartitioned candatasetasthetruemajoritypercentageoftheinitialtrainingsetincreases.A-1 indicatesastatisticallysignicantdecreaseinF-measure,while1indicatesastatistically signicantincreaseinF-measure. 37

PAGE 48

Figure4.7.Improvementy-axisduetothesemi-supervisedprocessesonthepartitioned candatasetasthesizeoftheinitialtrainingsetincreasesx-axis.A-1indicatesadecrease inF-measure,while1indicatesincreaseinF-measure. 38

PAGE 49

Figure4.8.Statisticalsignicanceduetothesemi-supervisedprocesseswiththepartitioned candatasetasthesizeoftheinitialtrainingsetincreases.A-1indicatesastatistically signicantdecreaseinF-measure,while1indicatesastatisticallysignicantincreasein F-measure. 39

PAGE 50

Figure4.9.Truemajoritypercentageandtrueexamplecountforthepartitionedcanexperimentswithimprovementshown.Thelowmajorityproportionwithhighexamplecounts showedimprovement,aswellatthehighmajorityproportions. 40

PAGE 51

Figure4.10.Truemajoritypercentageandtrueexamplecountforthepartitionedcan experimentswithstatisticalsignicanceshown.Thestatisticallysignicanttrialsoccurred intworegions-lowestmajorityproportion < 0.3andlargetrainingsets > 20andlarge trainingsets > 15withlargemajorityproportions > 0.6. 41

PAGE 52

Forthepartitionedcase,somelimitedcomparisonwithpreviousapproachesbecomes possible.In[17],ensembleswerecreatedfromthesame4disjointverticalpartitionsusing onlytrainingdatafromonlytherstsimulation.Then,falsepositiveratesandfalsenegative rateswerepresentedusingeachofthe2nd,3rd,and4thsimulationsastestsets.These ratescanbeconvertedintotheF-measureusingtheknownsizeofeachsimulationand theknownnumberofsalientnodes.CalculatingtheF-measureispossiblesincethefalse positiverate FPr = )]TJ/F20 10.9091 Tf 11.408 0 Td [(TNr = )]TJ/F21 7.9701 Tf 22.438 4.295 Td [(TN TN + FP .Since TN + FP issimplythenumber ofunsalientnodesinthedataset, TN iscalculable.Similarly,thefalsenegativerate FNr = )]TJ/F20 10.9091 Tf 12.046 0 Td [(TPr = )]TJ/F21 7.9701 Tf 23.588 4.295 Td [(TP TP + FN .Inthiscase TP + FN isthenumberofsalient simulationnodes.ThetotalF-measureforeachmethodwascomputedbydetermining thetotalnumberoftruepositive,falsenegative,andfalsepositivepredictionsacrossall testsimulations.Table4.12showstheclassierorensembletechniqueusedaswellasthe testsimulation,thefalsepositiveandfalsenegativerates,allfrom[17]andthecomputed F-measure. WhilethecomputedF-measurevaluesprovideabasisforevaluatingthesemi-supervised method,onlylimitedcomparisonispossiblesincethetestsetsusedinthisthesisandin [17]dierbyquitealargedegree. 4.3ResultsontheKDDCup1999DataSet Westartbyrandomlydownsamplingthetrainingsetusedtotraintheinitialsupervised learner.Theprocessofrandomlydownsamplingisappropriatesinceonegoalistoutilize aslittlelabeleddataaspossible.Weassumeanexpertlabelsthetrainingexamples,but thattheexpertdoesnotselectorguideexampleselectionforeitherthelabeledtraining dataortheunlabeleddatautilizedinthesemi-supervisedlearningprocess.Thecostof obtainingeachadditionallabeledexamplemotivatesreducingtheamountoflabeleddata used.Wealsofoundthisprocessnecessarysinceperformancewasratherhighonthedata setusingrandomforests,leavinglittleroomforimprovement.Forinstance,werandomly sampled0.1%oftheavailabledatalabeledexamplesandthenbuiltarandomforest 42

PAGE 53

Table4.12.Partitionedcandatasetresultscomparisons.Columns1-4arereproducedfrom [17],column5iscomputedfromavailabledata.Becauseoftestsetdierences,onlylimited comparisonscanbedrawn. Classier/EnsembleTrain/Test FPRateFNRate F-M DT1/2 0.0230.143 0.809 RF1/2 0.0210.103 0.859 RFW1/2 0.0260.086 0.878 DT1/3 0.0520.041 0.955 RF1/3 0.0410.028 0.966 RFW1/3 0.0530.019 0.965 DT1/4 0.0550.045 0.958 RF1/4 0.0720.024 0.957 RFW1/4 0.0880.014 0.053 DT1/all 0.880 RF1/all 0.904 RFW1/all 0.907 Train/Test-SimulationsusedforTrainandTest,F-M-F-measure Table4.13.Averageperformancefor30supervisedtrialsofa200treerandomforestfor relativelysmallsamplesofKDDdata. PercentN FMCWA 0.1%3428 0.99580.9987 2%68578 0.99920.9997 4%137156 0.99940.9998 6%205734 0.99950.9998 8%274312 0.99960.9998 10%342890 0.99960.9998 FM-F-Measure,CWA-Class-WeightedAccuracy,N-LabeledExamples with200trees.AnaverageF-measurefrom30trialsof0.9958wasobtainedwitharange of[0.9929,0.9974].Table4.13givestheresultsforavarietyofsmallrandomsamplesofthe availabletrainingdatausingarandomforestwith200trees. Toadddatatothetrainingset,rstarandomsampleofavailabledatawastaken. Usingthecurrenttrainingset,classlabelsarepredicted,andthenexamplesareselectedfor additiontothetrainingsetforthenextiteration.Twomethodsforselectingexamplesfor additiontothetrainingsetwereexplored,therstbeingthecloserthanaveragemethod. 43

PAGE 54

4.3.1CloserThanAverageUnlabeledExampleSelection Inthismethod,theselectedunlabeleddatawasseparatedbyclasspredictions.Similarly, thelabeledtrainingdataisseparatedbyclassandthenclustered.Next,eachexample evaluatedduringasemi-supervisedstepisassignedmembershiptoacentroidbynding theclosestsameclasscentroidfoundfromclusteringthetrainingdata.Finally,theaverage distanceofallexamplesinagroupfromthecorrespondingcentroidisdetermined,and onlythoseexamplescloserthantheaveragetothecentroidareaddedtothetrainingdata. CloserthanaverageunlabeledexampleselectionisdescribedindetailinSection3.5. Totestthismethodofaddingexamplestothetrainingset,30trialswereperformed withdierentrandomseeds.Table4.14presentstheaverageresultsofthecloserthan averagethresholdingmethodforselectingexamplesforadditiontothetrainingset.Initial trainingsetsconsistingof0.0003%,0.0005%,0.00075%,and0.001%ofavailablelabeled trainingdata,17,25,and34exampleswererandomlyselectedateachtrial.Then,10 iterationswereperformed,addingenoughrandomlyselectedunlabeleddatatothetraining setsothatthenalratioofinitiallylabeledtoinitiallyunlabeledwasapproximately1:9. Theaverageofthe30runswasnotpromising.InallcasestheaverageF-measureand averageclass-weightedaccuracydecreasefromstarttoend. However,werealizedthattheaverageresultdoesnotaccuratelycapturetheperformance ofthemethod.Rather,wewishtotestifimprovementoccursonacase-by-casebasis asaresultofapplyingthesemi-supervisedmethod.TotestthisweusedtheWilcoxon Signed-RanktestontheinitialF-measure,representingthesupervisedresult,andthenal F-measure,representingtheresultofthesemi-supervisedmethod. AsTable4.15shows,with17and34examplestheinitialsupervisedandnalsemisupervisedF-measuresarestatisticallysignicantlydierentatthe0.05level,butwitha decreaseintheF-measure.At34examples,thestatisticalsignicanceholdsatthe0.01level. With10and25examples,thereisnostatisticaldierencebetweentheinitialsupervised andnalsemi-supervisedF-measure.Atrstglanceitappearsthatthesemi-supervised methodhasnohopeofimprovingtheinitialsupervisedF-measureandcandonothingbut 44

PAGE 55

Table4.14.Averageperformancefor30trialsofthesemi-supervisedmethodwithcloser thanaverageselectionontheKDDdatasetusing10,17,25and34labeledexamples. Num Init.FMFinalFMInit.CWAFinalCWA 10 0.66620.60690.78570.7480 17 0.90020.86820.92900.8958 25 0.91390.91020.93480.9297 34 0.94100.90600.95810.9229 FM-F-Measure,CWA-Class-WeightedAccuracy Table4.15.Signicanceofcloserthanaverageunlabeledexampleselectionresultsonthe KDDdatasetfor30trialsofthesemi-supervisedmethodusing10,17,25and34labeled examples. %ofAvail.LbledExamples W-ScoreProbabilityofH=H0Improvement? 0.000310 171 > 0.05No 0.000517 139 < 0.05and > 0.01No 0.0007525 228 > 0.05No 0.00134 64 < 0.01No harmtheresults.However,lookingagainatTable4.14wecanseethatevenwithjust17, 25or34examplestheaverageinitialsupervisedF-measureisabove0.9,indicatingvery goodresultsandleavingverylittleroomforimprovement. 4.3.2NearAverageUnlabeledExampleSelection Inthesecondunlabeleddataselectionmethod,examplesneartheaverageareselected foradditiontothetrainingset.Afterassigningunlabeledexamplestosimilarclasscentroids usingthepredictedlabelsanddeterminingtheaveragedistanceofallassignedexamples fromthecorrespondingcentroids,allexamplesplusorminus10%oftheaveragedistanceare addedtothetrainingsetwiththeircorrespondingpredictedlabels.Near-averageunlabeled exampleselectionisdescribedindetailinSection3.5. Table4.16showstheaverageinitialsupervisedandnalsemi-supervisedF-measure andclass-weightedaccuracyresults.Table4.17showsthesignicanceofeachsamplingrate over30randomlyinitializedtrials.Inallcases,thesemi-supervisedmethodresultedina signicantdecreaseinperformance.Thisdiersfromcloserthanaverageselection,where 45

PAGE 56

Table4.16.Averageperformancefor30trialsofthesemi-supervisedmethodwithnear averageselectionontheKDDdatasetusing10,17,25and34labeledexamples. Num Init.FMFinalFMInit.CWAFinalCWA 10 0.75620.53960.81970.7077 17 0.89900.78160.92880.8359 25 0.92230.89090.94530.9116 34 0.92480.87660.94170.8992 FM-F-Measure,CWA-Class-WeightedAccuracy Table4.17.Signicanceofresultsfor30trialsofthesemi-supervisedmethodwithnear averageselectionontheKDDdatasetusing10,17,25and34labeledexamples. %ofAvail.Lbl.Examples W-ScoreProbabilityofH=H0Improvement? 0.000310 48 < 0.01No 0.000517 26 < 0.01No 0.0007525 118 < 0.01No 0.00134 56 < 0.01No onlywith17and34examplestheF-measurebecamesignicantlyworse,andwith10and 25examplesneitherimprovednorbecameworse. 4.3.3Non-IdenticallyDistributedTrainingSet Wealsoexploredselectinginitialtrainingsetswithdierentclassdistributionsthanthe originalsetoflabeleddata.AsdiscussedinSection2.3,theKDDdatasethas972,781 minorityattackclassexamplesand3,925,650majoritynormalclassexamples,whichis approximately80.14%majorityexamples.Weexploredusinganinitialmajoritypercentage ofbetween10%to90%,at10%increments.Bycontrastthetestdatawasdrawnuniformly fromtheavailablelabeleddata.Weagainselectedtrainingexamplesforlabelingand inclusionbasedondistancetoclustercentroidsobtainedfromlabeleddata. 4.3.3.1CloserThanAverage Lookingattheaverageresultfor30trialsusing10,17,25,and34examplesasshownin Table4.18,thetrainingsetswith80%majorityexamplesmirrortheresultsusinguniform randomsamplingofthetrainingset,asexpected.Inallcases,morethan70%majorityclass inthetrainingsetresultsinverylittleornoimprovementintheF-measure.However,for 46

PAGE 57

Table4.18.AverageF-measurefor30trialsofthesemi-supervisedmethodwithcloserthan averageselectionontheKDDdatasetusing10,17,25and34labeledexampleswith10% to90%majorityclass.EntriesinboldindicateanincreaseintheF-measure,whileentries italicizedindicateadecrease. 0.003%0.005% StartEnd StartEnd 0.1 0.5832 0.5493 0.5870 0.6250 0.2 0.6778 0.6963 0.7275 0.8027 0.3 0.7662 0.8026 0.8348 0.8560 0.4 0.8063 0.9023 0.8691 0.9122 0.5 0.8105 0.8500 0.9042 0.9475 0.6 0.8570 0.8893 0.9110 0.9112 0.7 0.8305 0.8223 0.9149 0.9154 0.75 {{ 0.9204 0.9117 0.8 0.6662 0.6069 0.9002 0.8682 0.9 0.4223 0.2846 {{ 0.0075%0.01% StartEnd StartEnd 0.1 0.6775 0.6833 0.7381 0.8267 0.2 0.8421 0.9015 0.8218 0.8829 0.3 0.8643 0.9226 0.9288 0.9547 0.4 0.9295 0.9616 0.9385 0.9672 0.5 0.9282 0.9656 0.9566 0.9615 0.6 0.9468 0.9485 0.9627 0.9593 0.7 0.9403 0.9215 0.9528 0.9436 0.75 {{ 0.9491 0.9396 0.8 0.9139 0.9102 0.9410 0.9060 0.9 0.7412 0.6895 {{ 60%majority10and17examplesshowimprovement,andforallmajoritypercentagesused below60%allfourtrainingsetsizesshowimprovementafterapplyingthesemi-supervised method. Figure4.11presentsthesameresultsgraphically.Thex-axisrepresentstheaverageinitialsupervisedF-measureobtainedusingarandomforestof200trees.They-axisrepresents theaveragenalsemi-supervisedF-measureobtainedafterapplyingthesemi-supervised methodtothestratiedselectionofexamples.Thelinewherey=xisthepointwhereno improvementoccurs.Manyoftheaverageswith10%to50%majorityareabovetheline whereinitialF-measureequalsnalF-measure,orwherethesupervisedF-measureequals thesemi-supervisedF-measure.Thegraphqualitativelyillustratesthatwhenthetrain47

PAGE 58

Figure4.11.VisualizationoftheaveragechangeinF-measureforeachsamplerateand majorityproportionforcloserthanaverageselectionfortheKDDdataset.Pointsabove thelineshowrateswhereaverageimprovementoccurs. ingdatasetisimbalancedawayfromthemajorityclass,improvementappearspossibleby applyingthesemi-supervisedmethod. Tovalidatethestatisticalsignicanceofthesequalitativeobservations,theWilcoxon Signed-Ranktestwasperformedforinitialsamplerateandmajorityproportion.Table4.19 showstheresultsoftheWilcoxontest. Withonly10examples,thetrialswith30%,40%,50%and60%majorityclassshow statisticalimprovementafterapplyingthesemi-supervisedmethod.With10%and90% majorityclass,therewasastatisticallysignicantdecreaseintheF-measure.For20%, 70%,and80%majority,therewasnostatisticallysignicantincreaseordecreaseinthe F-measure.Asimilarpatterncanbeseenwith17examples.Withatrainingsetconsisting ofonly10%majorityexamples,theresultsindicatenostatisticallysignicantincrease ordecreaseintheF-measure,butfor20%,30%,40%,and50%majorityastatistically 48

PAGE 59

Table4.19.WilcoxonSigned-RanktestresultsforNtrialsofthesemi-supervisedmethod withcloserthanaverageselectionontheKDDdatasetusing10,17,25and34labeled exampleswithbetween10%and90%majorityclass.Entriesinboldindicateanincrease intheF-measure,whileentriesitalicizedindicateadecrease. 0.003%0.005% Maj. WNProbBetter? WNProbBetter? 0.1 98280.007816 No -22030 > 0.1 No 0.2 -18130 > 0.1 No -76300.0003989 Yes 0.3 -148300.04203 Yes -134300.02133 Yes 0.4 -25288.42E-07 Yes -89300.001184 Yes 0.5 -142290.03178 Yes -127300.01466 Yes 0.6 -152300.0502 No -16328 > 0.1 No 0.7 19829 > 0.1 No -21930 > 0.1 No 0.75 {{{{ 19629 > 0.1 No 0.8 17130 > 0.1 No 139300.02746 No 0.9 100300.002691 No {{{{ 0.0075%0.01% Maj. WNProbBetter? WNProbBetter? 0.1 -158300.06468 No -72300.0002774 Yes 0.2 -90300.00128 Yes -147300.04016 Yes 0.3 -93300.001611 Yes -105300.003806 Yes 0.4 -85300.0008593 Yes -19230 > 0.1 No 0.5 -127300.01466 Yes -18830 > 0.1 No 0.6 -19230 > 0.1 No -22830 > 0.1 No 0.7 155290.05709 No -29430 > 0.1 No 0.75 {{{{ 18730 > 0.1 No 0.8 22830 > 0.1 No 58290.0001346 No 0.9 -27730 > 0.1 No {{{{ W-Wilcoxonscore,N-Trials signicantincreaseoccursasaresultofusingthesemi-supervisedmethod.For60%and70% theresultswerenotstatisticallysignicant,andby80%majoritytheF-measurebecomes signicantlyworse. TheseresultsarevisuallyrepresentedinFigure4.12.The`improvement'axisindicates whetherasignicantincrease,representedbya`1',orasignicantdecrease,representedby a`-1',occursasaresultofusingthesemi-supervisedmethod.An`improvement'valueof `0'indicatesnostatisticalsignicance.Selectingalowproportionofmajorityexamplesin theinitialtrainingsetprovidesastatisticallysignicantincreaseintheF-measure. 49

PAGE 60

Figure4.12.VisualizationofthesignicanceofthechangeinF-measureforeachsample rateandmajorityproportionforcloserthanaverageselectionfortheKDDdataset.Points at1indicatesignicantimprovement;-1indicatessignicantdecreaseintheF-measure.A pointat0indicatenosignicantincreaseordecrease. 50

PAGE 61

Table4.20.AverageF-measurefor30trialsofthesemi-supervisedmethodwithnearaverage selectionontheKDDdatasetusing10,17,25and34labeledexampleswith10%to90% majorityclass.EntriesinboldindicateanincreaseintheF-measure,whileentriesitalicized indicateadecrease. 0.003%0.005% Maj. StartEnd StartEnd 0.1 0.5643 0.5374 0.3311 0.3316 0.2 0.6437 0.6858 0.3996 0.4013 0.3 0.7336 0.8094 0.5985 0.6338 0.4 0.8252 0.8805 0.8690 0.9122 0.5 0.8755 0.9074 0.9041 0.9475 0.6 0.8668 0.8840 0.9109 0.9112 0.7 0.8425 0.7880 0.8866 0.8835 0.8 0.7562 0.5396 0.6054 0.3868 0.9 0.3328 0.1339 0.6806 0.5562 0.0075%0.01% Maj. StartEnd StartEnd 0.1 0.6619 0.6961 0.4015 0.4020 0.2 0.7876 0.8739 0.6340 0.6344 0.3 0.9005 0.9536 0.6340 0.6338 0.4 0.8963 0.9380 0.9385 0.9672 0.5 0.9553 0.9599 0.9566 0.9615 0.6 0.9344 0.9503 0.9627 0.9593 0.7 0.9494 0.9456 0.9582 0.9277 0.8 0.9223 0.8909 0.8348 0.6815 0.9 0.7216 0.5262 0.9140 0.8978 4.3.3.2NearAverage NearaverageunlabeledexampleselectionwasalsocomparedtodetermineifthesemisupervisedmethodperformedanybetterorworseunderthisselectionmethodfortheKDD datasetwithdierentmajorityproportionsintheinitialtrainingset.TheaverageFmeasureresultsofnearaverageselectionareshowninTable4.20Aswithcloserthan averageselection,lowpercentagesofthemajorityclassintheinitialtrainingsetgenerally resultedinincreasesintheaverageF-measure,whilehighpercentagesofthemajorityclass generallyresultedinadecreaseintheF-measure. OnceagainvisualizingthenalF-measureasafunctionoftheinitialsupervisedFmeasure,itappearsthatmajorityproportionsof20%to60%resultinincreasesinthe F-measurefromapplyingthesemi-supervisedmethod.Whenthemajorityproportionis 51

PAGE 62

Figure4.13.VisualizationoftheaveragechangeinF-measureforeachsamplerateand majorityproportionfornearaverageselectionfortheKDDdataset.Pointsabovetheline showrateswhereaverageimprovementoccurs. 10%oftheinitialtrainingset,itappearsthatoftenadecreaseintheF-measureoccurs,as seeninFigure4.13.Atorabove60%itappearsthatnoincreaseoccurs,andfor80%and 90%,theF-measuredenitelyappearstodecrease. ByusingtheWilcoxonSigned-RankTestfortestingthechangeafterapplyingthe semi-supervisedmethod,weseethatthepercentageofmajorityclassintheinitiallabeled trainingdatasetappearstohaveaneectontheabilityofthesemi-supervisedmethodto improvetheF-measureTable4.21.Formajoritypercentagesof50%orless,improvement doesoccurafterapplyingthesemi-supervisedmethod,exceptforthecasewithonly10% majorityclassinthetrainingsetwithonly10or17examples.By25examples,withtraining madeupof10%to50%ofthemajorityclassweseesignicantimprovement.Sincethe initialmajoritypercentageofthedataisaround80%,wecanseethatimprovementdueto 52

PAGE 63

Table4.21.WilcoxonSigned-RanktestresultsforNtrialsofthesemi-supervisedmethod withnearthanaverageselectionontheKDDdatasetusing10,17,25and34labeled exampleswithbetween10%and90%majorityclass.Entriesinboldindicateanincrease intheF-measure,whileentriesitalicizedindicateadecrease. 0.003%0.005% Maj. WNProbBetter? WNProbBetter? 0.1 168300.09546 No 18128 > 0.1 No 0.2 -162300.07594 No -124290.02156 Yes 0.3 -117300.008216 Yes -110300.005299 Yes 0.4 -139300.02746 Yes -139300.02746 Yes 0.5 -137290.04184 Yes -132300.01921 Yes 0.6 -18830 > 0.1 No 20430 > 0.1 No 0.7 109290.008942 No -23330 > 0.1 No 0.8 48302.20E-05 No 26309.96E-07 No 0.9 55304.95E-05 No 68300.00019 No 0.0075%0.01% Maj. WNProbBetter? WNProbBetter? 0.1 -155300.05709 No -35304.00E-06 Yes 0.2 -74300.0003333 Yes -15301.28E-07 Yes 0.3 -57306.17E-05 Yes -15301.28E-07 Yes 0.4 -91300.001383 Yes -64300.0001281 Yes 0.5 -17030 > 0.1 No -160300.07014 Yes 0.6 -17629 > 0.1 No -25430 > 0.1 No 0.7 -18630 > 0.1 No 85300.0008593 No 0.8 118300.008727 No 56305.53E-05 No 0.9 67300.0001725 No 102290.005661 No W-Wilcoxonscore,N-Trials thesemi-supervisedmethodwillnotappearconsistentlywithlabeledtrainingsetsdrawn uniformlyrandomlyfromtheavailablelabeleddata. TheseresultsarevisuallyrepresentedinFigure4.14.The`improvement'axisindicates whetherasignicantincrease,representedbya`1',orasignicantdecrease,represented bya`-1',occursasaresultofthesemi-supervisedmethod.An`improvement'valueof`0' indicatesnostatisticalsignicance.Aswithcloserthanaverageselection,alowproportion ofmajorityexamplesintheinitialtrainingsetprovidesastatisticallysignicantincrease intheF-measure. Theseresultsindicatethatselectingatrainingsetuniformlyrandomlyfromtheavailable exampleswillnotresultinanincreaseintheF-measurefortheKDDdataregardlessofthe 53

PAGE 64

Figure4.14.VisualizationofthesignicanceofthechangeinF-measureforeachsample rateandmajorityproportionfornearaverageselectionfortheKDDdataset.Pointsat 1indicatesignicantimprovement;-1indicatessignicantdecreaseintheF-measure.A pointat0indicatesnosignicantincreaseordecrease. 54

PAGE 65

methodbywhichunlabeledexamplesareselectedforlabeling.Bothyieldpoorresultswhen theinitialtrainingsetisdrawnuniformlyrandomlyfromtheavailableexamples.Instead,a trainingsetwhichfavorsexamplesoftheminorityclasswillresultinastatisticallysignicant increaseintheF-measure. 55

PAGE 66

CHAPTER5 SUMMARYANDDISCUSSION TheseexperimentshaveshownanincreaseinaccuracyorF-measureispossibleforthe semi-supervisedself-learningframeworkusingclusterdistanceforunlabeleddataselection. Section4.1showedthatfortheboltdataset,partitionedexperimentswereperformed selectingexamplescloserthanaveragetocentroidsforaddition.TheaverageF-measure from30trialswentfrom0.7411upto0.8447,astatisticallysignicantimprovementfrom usingthesemi-supervisedmethod. InSection4.2,thecandataset,whennotpartitioned,showedanincreasedF-measureas thetrainingsetsizeincreased.Whenselectinginitialtrainingsetswith0.009%and0.01% oftheavailableexamples,theaverageF-measureof30runsshowedthesemi-supervised methodoutperformingthesupervisedapproachwiththesameunderlyingclassier.However,whiletheserunsshowedanincreasingF-measure,theydidnotshowastatistically signicantincrease.Whenclassstratiedsamplingwasperformed,itwasfoundthatusing 90%majorityclassexamplesinthetrainingsetwouldcreateasignicantincreaseinthe F-measureafterapplyingthesemi-supervisedmethod,regardlessofthenumberofexamples.ThisshowsthatanincreasetotheF-measureispossiblewithenoughlabeledtraining examplesandalthoughstatisticalsignicancewasnotobserved,assumablyastatistically signicantincreaseasthenumberoflabeledexamplesincreases.Anothermethodofndingastatisticallysignicantincreaseistoarticiallycreateimbalanceinthetrainingset usingstratiedsampling,introducingbiasintotheclassier. Withpartitioning,againnosignicancewasseenwhensamplingrandomlyfromavailableexamples.Whenperformingstratiedsampling,signicancewasseenforthetrials wheretheinitialtrainingsethadmorethan15examplesandmorethan60%majority 56

PAGE 67

examples,orwhenthetrainingsethadmorethan25examplesandlessthan30%majority. Seeingimprovementwithlowandhighmajorityclasspercentagesisunexpectedinthis case,becausemajority"isamisnomer-theclassdistributionisaboutequal. Lookingatboththepartitionedandnon-partitionedresults,aninterestingtrendappears.Withequallyfrequentclasses,onewouldnotnormallybeconcernedwithtraining setimbalance.Inthiscase,introducingarticialimbalancetowardoneclassortheother appearstocauseasignicantincreaseintheF-measureafterapplyingthesemi-supervised method.Stratiedsamplingshouldcausethebuiltmodeltoincludesomeamountofbias introducedbytheclassimbalance.Itcouldbethattheapplicationofthesemi-supervised methodmayhelptoovercomethisbiasthroughtherepeatedintroductionofnewexamples fromthetrue,balanceddatadistribution.Lookingatthecanresultsforbothpartitioned andnon-partitionedstratiedsamplingaverageperformanceTables4.6and4.10wesee that,withafewminorexceptions,theinitialsupervisedperformancewith50%majority classexamplesisusuallyhigherthantheinitialsupervisedperformancewhenthetraining setisimbalanced.Inthecaseofimbalancedtrainingsets,itcouldbethatthesemisupervisedmethodhasmoreroomforimprovementbycorrectingthebiasintroducedby theimbalanceintheoriginaltrainingset. FinallyinSection4.3,forthehighlyimbalancedKDDdataset,bothcloserthanaverage andnearaverageshowedthesamegeneralresult.Whensamplingatrainingsetuniformly randomly,nosignicanceisapparent.However,whensamplingmoreminorityexamples thanmajorityexamples,improvementintheF-measureispossible.Thisresultreects theanticipatedresultfromusingstratiedsampling.Theexpectationisthatintheface ofclassimbalance,introducingmoreminorityclassexamplesintothetrainingsetcan combattheinherentimbalance.Ithasbeenshownthatifasupervisedmodelisusedinan imbalancedenvironment,introducingimbalanceintothetrainingsetbystratiedsampling canhelpimproveperformance;thissupportsthatasemi-supervisedmethodcanbeusedin animbalancedenvironmentusingthesamesetoftrainingsetimbalancingtechniques. 57

PAGE 68

Forbolt,noalterationofthetrainingsetwasrequiredtoseeanimprovementinthe F-measure.ForcanandKDD,alteringthebalanceofclassesinthetrainingsetincreased performance.Additionally,foragivennumberofexamples,thereexistedaratioofmajority tominorityclassexampleswherethemeanF-measurewasgreaterthanitwaswithoutusing thesemi-supervisedmethod,regardlessoftheratioofmajoritytominorityclassexamples. Forinstance,withtheKDDdatasetusingnearaverageselectionSection4.3.3.2,Table 4.20showsthatthebestaverageF-measure,0.9672,occurredafterapplyingthesemisupervisedmethodwhenusing0.01%ofavailabletrainingdataoccurredwhenusinga trainingsetwith40%majorityclassexamples.Thebestaverageperformancewhennot usingthesemi-supervisedmethodoccurredwithatrainingsetwith60%majorityclass examples,achievinganaverageF-measureof0.9627.Thissingularcaseisrepresentativeof allcases:thesemi-supervisedmethodwhenusedwithtrainingsetswithvaryingproportions ofmajorityexamplesresultsinanaverageF-measurewhichisgreaterthanwithoutusing thesemi-supervisedmethod. Sinceusingthesemi-supervisedmethodresultsinthesameorhigheraverageF-measure, thentheonlyremainingquestionishowtodeterminethepercentageofmajorityexamples touseinthetrainingset,andhowtodeterminethispercentagewhilekeepingthenumber ofneededexampleslow.Itappearsthatatlowernumbersoflabeledexamples,themajoritypercentagesthatcauseimprovementremainsthesameastrainingsetsizeincreases. Forexample,usingaselectionofthenon-partitionedcandatasetwith90%majorityclass examplescausesastatisticallysignicantincreaseinperformanceforallnumbersofexamples,andtheaverageF-measureisgreaterat90%majorityclassexamplesthanallother percentages.FortheKDDdatasetusingcloserthanaverageunlabeledexampleselection, with10examplesthebestpercentageofmajorityclassexamplewas40%,for17examples itwas50%,andfor25and34examplesitwas40%majorityclassexamples.Inallcases, thebestproportionwasaround40%-50%. Basedonthis,itappearsthattondthebesttrainingsetmajoritypercentage,onewould startwiththelowestnumberofexamplespossible,tryingallpossibleratiosofmajorityto 58

PAGE 69

minorityexamplesinthetrainingset.Thenumberofexamplescanthenbeincreased untilastatisticallysignicantimprovementisseenonavalidationset.Atthatpoint,it appearsthatthepercentageofmajorityexampleswiththebestperformancewillremainthe percentagewithbestperformanceasthenumberofexamplesincreases.Usingavalidation set,thebestmajoritypercentagecanbefoundandusedwhileincreasingthenumberof labeledexamplesuntilthemodelperformsadequatelyoruntilthecostofobtaininglabeled examplesbecomesprohibitive.Weassumethatexpertswillmanuallyassignaclassto eachexample,aprocesswhichisbothtime-consumingandtedious.Becauseofthis,the costtimeanddicultyofobtaininglabeledexamplesishigh,makingtheuseofasemisupervisedapproachattractivebecauseitallowsforlesslabeledexamplestobeused. 5.1FutureWork Thisworkhasmanyareasforpossibleexpansionandrenement.Forinstance,inall cases25centroidswereused.Insomecases,samplingcausesaverysmallnumberof examplestobeinthetrainingdata.Inthesecases,itisnotcompletelyclearwhateect multiplecoincidentcentroidswouldhaveonunlabeledexampleselection.Presumably, unlabeledexamplesshowingfeaturespacesimilaritywouldstillbeselected,butwhether ornotmoreorlessareselectedduetomultipleclosecentroidsisanopenquestion. Anotherinterestingpathwouldbecomparingavarietyofbaseclassierstoseetheeect onperformance. Anotherextensionwouldbetoincreasetherangeofinitiallabeleddata.Inthecase ofcan,theupperinitialtrainingsetsizesexploredwereshowingincreasedsignicance withuniformrandomsampling.Likewisethestratiedsamplingresultsshoweddeveloping regionsofsignicanceforcombinationsofmajoritypercentageandtrainingsetsize.Expandingthesamplesizeswouldgiveaclearerpictureofthetotaleectofthecombination ofmajoritypercentageandnumberinitiallabeledexamplesonperformanceafterapplying thesemi-supervisedmethod. 59

PAGE 70

TheKDDdatasetcouldalsobeexploredfromacostperformanceperspective.The KDDdatasethasmultipleattacktypeclassesthatwerecombinedintoonesingle`attack' class,andeachtypeofattackhasarelativecostassociatedwithdetection. 60

PAGE 71

REFERENCES [1]JohnN.Korecki,RobertE.Baneld,LawrenceO.Hall,KevinW.Bowyer,andPhilip Kegelmeyer.Semi-supervisedlearningonlargecomplexsimulations.In IEEEThe19th InternationalConferenceonPatternRecognition ,2008. [2]O.Chapelle,B.Scholkopf,andA.Zien,editors. Semi-SupervisedLearning .MITPress, Cambridge,MA,2006. [3]XiaojinZhu.Semi-supervisedlearningliteraturesurvey.Technicalreport,Computer Sciences,UniversityofWisconsin-Madison,2007. [4]BernardMerialdo.TaggingEnglishtextwithaprobabilisticmodel. Computational Linguistics ,20:2:155{171,1994. [5]DavidElworthy.DoesBaum-Welchre-estimationhelptaggers?In FourthACLConferenceonAppliedNaturalLanguageProcessing ,pages53{58,1994. [6]DavidYarowsky.Unsupervisedwordsensedisambiguationrivalingsupervisedmethods. In 33rdAnnualMeetingoftheAssociationforComputationalLinguistics ,pages189{ 196,1995. [7]LiWeiandEamonKeogh.Semi-supervisedtimeseriesclassication.In KDD'06:Proceedingsofthe12thACMSIGKDDinternationalconferenceonKnowledgediscovery anddatamining ,pages748{753,NewYork,NY,USA,2006.ACM. [8]NiteshV.Chawla,DavidA.Cieslak,LawrenceO.Hall,andAjayJoshi.Automaticallycounteringimbalanceanditsempiricalrelationshiptocost. DataMiningand KnowledgeDiscovery ,17:225{252,2008. 61

PAGE 72

[9]JamesJ.Heckman.Sampleselectionbiasasaspecicationerror. Econometrica 47:153{61,January1979. [10]GaryM.WeissandFosterProvost.Learningwhentrainingdataarecostly:theeect ofclassdistributionontreeinduction. JournalofArticialIntelligenceResearch 19:315{354,2003. [11]NiteshV.ChawlaandGrigorisKarakoulas.Learningfromlabeledandunlabeleddata: anempiricalstudyacrosstechniquesanddomains. JournalofArticialIntelligence Research ,23:331{366,2005. [12]B.M.ShahshahaniandD.A.Landgrebe.Theeectofunlabeledsamplesinreducing thesmallsamplesizeproblemandmitigatingtheHughesphenomenon. Geoscience andRemoteSensing,IEEETransactionson ,32:1087{1095,September1994. [13]ChuckRosenberg,MartialHebert,andHenrySchneiderman.Semi-supervisedselftrainingofobjectdetectionmodels.In ApplicationofComputerVision,2005. WACV/MOTIONS'05Volume1.SeventhIEEEWorkshopson ,2005. [14]LawrenceO.Hall,D.Bhadoria,andKevinW.Bowyer.Learningamodelfromspatiallydisjointdata.In 2004IEEEInternationalConferenceonSystems,Man,and Cybernetics ,volume2,pages1447{1451,October2004. [15]NiteshV.Chawla,LawrenceO.Hall,KevinW.Bowyer,andW.PhilipKegelmeyer. Learningensemblesfrombites:Ascalableandaccurateapproach. J.Mach.Learn. Res. ,5:421{451,2004. [16]LarryShoemaker,RobertBaneld,LawrenceO.Hall,W.PhilipKegelmeyer,and KevinW.Bowyer.Detectingandorderingsalientregionsforecientbrowsing.In The 19thInternationalConferenceonPatternRecognition ,2008. [17]LarryShoemaker,LawrenceO.Hall,KevinW.Bowyer,andW.PhilipKegelmeyer. Usingclassierensemblestolabelspatiallydisjointdata. InformationFusion ,8:120{ 133,2008. 62

PAGE 73

[18]LarryShoemaker,RobertE.Baneld,LawrenceO.Hall,KevinW.Bowyer,and W.PhilipKegelmeyer.Learningtopredictsalientregionsfromdisjointandskewed trainingsets.In ICTAI'06:Proceedingsofthe18thIEEEInternationalConferenceon ToolswithArticialIntelligence ,pages116{126,Washington,DC,USA,2006.IEEE ComputerSociety. [19]KDDCup1999DataSet.UCImachinelearningrepository. http://kdd.ics.uci.edu/databases/kddcup99,1999. [20]M.IchinoandH.Yaguchi.GeneralizedMinkowskimetricsformixedfeature-typedata analysis. Systems,ManandCybernetics,IEEETransactionson ,24:698{708,April 1994. [21]N.Otsu.Athresholdselectionmethodfromgraylevelhistograms. IEEETransactions onSystems,ManandCybernetics ,9:62{66,1979. [22]RobertE.Baneld,LawrenceO.Hall,KevinW.Bowyer,DivyaBhadoria,W.Philip Kegelmeyer,andStevenEscherichia.Acomparisonofensemblecreationtechniques. MultipleClassierSystems ,pages223{232,2004. [23]RobertE.Baneld.OpenDT.http://opendt.sourceforge.net,2005. [24]LeoBreiman.Randomforests. MachineLearning ,45:5{32,2001. [25]ProdipHore,LawrenceHall,andDmitryB.Goldgof.SinglepassfuzzyCmeans.In 2007IEEEInternationalConferenceonFuzzySystems ,July2007. [26]X.L.XieandG.Beni.Avaliditymeasureforfuzzyclustering. IEEETransactionson PatternAnalysisandMachineIntelligence ,13:841{847,1991. [27]RobertE.Baneld,LawrenceO.Hall,KevinW.Bowyer,andW.PhilipKegelmeyer. Ensemblesofclassiersfromspatiallydisjointdata.In SixthInternationalConference onMultipleClassierSystems ,pages196{205,2005. 63


xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam 22 Ka 4500
controlfield tag 007 cr-bnu---uuuuu
008 s2010 flu s 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0003441
035
(OCoLC)
040
FHM
c FHM
049
FHMM
090
XX9999 (Online)
1 100
Korecki, John.
0 245
Semi-supervised self-learning on imbalanced data sets
h [electronic resource] /
by John Korecki.
260
[Tampa, Fla] :
b University of South Florida,
2010.
500
Title from PDF of title page.
Document formatted into pages; contains X pages.
502
Thesis (M.S.C.S.)--University of South Florida, 2010.
504
Includes bibliographical references.
516
Text (Electronic thesis) in PDF format.
538
Mode of access: World Wide Web.
System requirements: World Wide Web browser and PDF reader.
3 520
ABSTRACT: Semi-supervised self-learning algorithms have been shown to improve classifier accuracy under a variety of conditions. In this thesis, semi-supervised self-learning using ensembles of random forests and fuzzy c-means clustering similarity was applied to three data sets to show where improvement is possible over random forests alone. Two of the data sets are emulations of large simulations in which the data may be distributed. Additionally, the ratio of majority to minority class examples in the training set was altered to examine the effect of training set bias on performance when applying the semi-supervised algorithm.
590
Advisor: Lawrence O. Hall, Ph.D.
653
Machine learning
Ensembles
Bootstrapping
Self-training
Stratification
690
Dissertations, Academic
z USF
x Computer Science and Engineering
Masters.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.3441