USF Libraries
USF Digital Collections

Ensemble learning with imbalanced data

MISSING IMAGE

Material Information

Title:
Ensemble learning with imbalanced data
Physical Description:
Book
Language:
English
Creator:
Shoemaker, Larry
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla
Publication Date:

Subjects

Subjects / Keywords:
Random Forest
Saliency
Probabilistic Voting
Imbalanced Training Data
Lift
Dissertations, Academic -- Engineering Computer Science -- Masters -- USF   ( lcsh )
Genre:
non-fiction   ( marcgt )

Notes

Abstract:
ABSTRACT: We describe an ensemble approach to learning salient spatial regions from arbitrarily partitioned simulation data. Ensemble approaches for anomaly detection are also explored. The partitioning comes from the distributed processing requirements of large-scale simulations. The volume of the data is such that classifiers can train only on data local to a given partition. Since the data partition reflects the needs of the simulation, the class statistics can vary from partition to partition. Some classes will likely be missing from some or even most partitions. We combine a fast ensemble learning algorithm with scaled probabilistic majority voting in order to learn an accurate classifier from such data. Since some simulations are difficult to model without a considerable number of false positive errors, and since we are essentially building a search engine for simulation data, we order predicted regions to increase the likelihood that most of the top-ranked predictions are correct (salient). Results from simulation runs of a canister being torn and from a casing being dropped show that regions of interest are successfully identified in spite of the class imbalance in the individual training sets. Lift curve analysis shows that the use of data driven ordering methods provides a statistically significant improvement over the use of the default, natural time step ordering. Significant time is saved for the end user by allowing an improved focus on areas of interest without the need to conventionally search all of the data. We have also found that using random forests weighted and distance-based outlier ensemble methods for supervised learning of anomaly detection provide significant accuracy improvements when compared to existing methods on the same dataset. Further, distance-based outlier and local outlier factor ensemble methods for unsupervised learning of anomaly detection also compare favorably to existing methods.
Thesis:
Dissertation (PHD)--University of South Florida, 2010.
Bibliography:
Includes bibliographical references.
System Details:
Mode of access: World Wide Web.
System Details:
System requirements: World Wide Web browser and PDF reader.
Statement of Responsibility:
by Larry Shoemaker.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains X pages.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
usfldc doi - E14-SFE0004763
usfldc handle - e14.4763
System ID:
SFS0028055:00001


This item is only available as the following downloads:


Full Text
xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim
leader nam 22 Ka 4500
controlfield tag 007 cr-bnu---uuuuu
008 s2010 flu s 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0004763
035
(OCoLC)
040
FHM
c FHM
049
FHMM
090
XX9999 (Online)
1 100
Shoemaker, Larry.
0 245
Ensemble learning with imbalanced data
h [electronic resource] /
by Larry Shoemaker.
260
[Tampa, Fla] :
b University of South Florida,
2010.
500
Title from PDF of title page.
Document formatted into pages; contains X pages.
502
Dissertation (PHD)--University of South Florida, 2010.
504
Includes bibliographical references.
516
Text (Electronic thesis) in PDF format.
538
Mode of access: World Wide Web.
System requirements: World Wide Web browser and PDF reader.
3 520
ABSTRACT: We describe an ensemble approach to learning salient spatial regions from arbitrarily partitioned simulation data. Ensemble approaches for anomaly detection are also explored. The partitioning comes from the distributed processing requirements of large-scale simulations. The volume of the data is such that classifiers can train only on data local to a given partition. Since the data partition reflects the needs of the simulation, the class statistics can vary from partition to partition. Some classes will likely be missing from some or even most partitions. We combine a fast ensemble learning algorithm with scaled probabilistic majority voting in order to learn an accurate classifier from such data. Since some simulations are difficult to model without a considerable number of false positive errors, and since we are essentially building a search engine for simulation data, we order predicted regions to increase the likelihood that most of the top-ranked predictions are correct (salient). Results from simulation runs of a canister being torn and from a casing being dropped show that regions of interest are successfully identified in spite of the class imbalance in the individual training sets. Lift curve analysis shows that the use of data driven ordering methods provides a statistically significant improvement over the use of the default, natural time step ordering. Significant time is saved for the end user by allowing an improved focus on areas of interest without the need to conventionally search all of the data. We have also found that using random forests weighted and distance-based outlier ensemble methods for supervised learning of anomaly detection provide significant accuracy improvements when compared to existing methods on the same dataset. Further, distance-based outlier and local outlier factor ensemble methods for unsupervised learning of anomaly detection also compare favorably to existing methods.
590
Advisor: Lawrence O. Hall, Ph.D.
653
Random Forest
Saliency
Probabilistic Voting
Imbalanced Training Data
Lift
690
Dissertations, Academic
z USF
x Engineering Computer Science
Masters.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.4763



PAGE 1

EnsembleLearningWithImbalancedData by LarryShoemaker Adissertationsubmittedinpartialfulllment oftherequirementsforthedegreeof DoctorofPhilosophy DepartmentofComputerScienceandEngineering CollegeofEngineering UniversityofSouthFlorida MajorProfessor:LawrenceO.Hall,Ph.D. DmitryB.Goldgof,Ph.D. SudeepSarkar,Ph.D. KevinW.Bowyer,Ph.D. DateofApproval: September20,2010 Keywords:RandomForest,Saliency,ProbabilisticVoting, ImbalancedTraining Data,Lift Copyright c r 2010,LarryShoemaker

PAGE 2

DEDICATION Tomyfamilyandfriends

PAGE 3

ACKNOWLEDGEMENTS IwouldliketothankDr.LawrenceO.Hallforhisguidance,fe edback,andenthusiasm.IwouldalsoliketothankDr.KevinW.Bowyer,Dr.W .PhilipKegelmeyer, Dr.DmitryB.GoldgofandDr.SudeepSarkarfortheirvaluabl einputandfeedback.

PAGE 4

TABLEOFCONTENTS LISTOFTABLES iii LISTOFFIGURES v ABSTRACT ix CHAPTER1INTRODUCTION 1 1.1Motivation 1 1.2Contributions 4 1.3Organization 4 CHAPTER2ENSEMBLELEARNING 6 2.1BackgroundandRelatedWork 6 2.2GeneralMethods 8 2.2.1Bagging 9 2.2.2Boosting 10 2.3DecisionTreesandRandomForests102.4AnomalyDetection 14 CHAPTER3DATASETS 16 3.1CanisterTearSimulation 16 3.1.1PhysicalandSpatialCharacteristics173.1.2TrainandTestSets 20 3.2CasingSimulation 22 3.2.1PhysicalandSpatialCharacteristics233.2.2TrainandTestSets 26 3.3ModiedKDDCup1999 27 CHAPTER4SIMULATIONENSEMBLEEXPERIMENTS29 4.1PredictingandOrderingSalientRegions294.2Experiments 33 4.3EvaluationMetrics 35 4.4CasingSimulationRegionalResults384.5CanisterTearSimulationRegionalResults464.6LabelingNoiseResults 70 i

PAGE 5

4.7VotingMethodResults 78 4.8StatisticalSignicanceResults79 CHAPTER5ANOMALYENSEMBLEEXPERIMENTS82 5.1Experiments 82 5.2EvaluationMetrics 85 5.3Results 88 CHAPTER6CONCLUSIONS 98 6.1Contributions 99 REFERENCES 100 ABOUTTHEAUTHOR EndPage ii

PAGE 6

LISTOFTABLES Table3.1Physicalandspatialcharacteristicsforthecani stertear simulationruns. 17 Table3.2Featurerangesforthecanisterteardatainruns1a nd2.18 Table3.3Featurerangesforthecanisterteardatainruns3a nd4.19 Table3.4Salientclassstatisticsbypartitionforthecani stertearsimulationruns. 22 Table3.5Physicalandspatialcharacteristicsforthecasi ngsimulation.24 Table3.6Featurerangesforthecasingsimulation.25Table3.7Partitioningcharacteristicsforthecasingsimu lation.27 Table3.8ModiedKDDCup1999datasetcharacteristics.28Table4.1Casingregionalresultsevaluatedwith10%and50% overlapthresholds. 39 Table4.2CanistertearDTregionalresultsevaluatedwith1 0%overlapthreshold. 47 Table4.3CanistertearRFunweightedregionalresultseval uated with10%overlapthreshold. 48 Table4.4CanistertearRFWregionalresultsevaluatedwith 10% overlapthreshold. 49 Table4.5CasingsimulationRFWboltlabelingnoiseresults evaluatedwith10%overlapthreshold.74 Table4.6CasingsimulationRFboltlabelingnoiseresultse valuated with10%overlapthreshold. 75 Table4.7CasingsimulationDTboltlabelingnoiseresultse valuated with10%overlapthreshold. 76 iii

PAGE 7

Table4.8CasingsimulationSDTboltlabelingnoiseresults evaluatedwith10%overlapthreshold.77 Table4.9Canistertearaverageregionalresultsusingonly two-class partitionsandevaluatedwith10%overlapthreshold.79 Table4.10Canistertearaverageregionalresultsevaluate dwith10% overlapthreshold. 80 Table4.11Casingregionalresultsevaluatedwith10%overl apthreshold.81 Table4.12Statisticallysignicantresultsacrosscasing andcanister tearsimulationexperimentsusing10%overlapthreshold.8 1 Table5.1ModiedKDDCup1999averagetrain/testROCAUCres ults.89 Table5.2ModiedKDDCup1999averagetestROCAUCresults.9 4 iv

PAGE 8

LISTOFFIGURES Figure1.1Visualizationsofthecasingandcanistertearsi mulations.3 Figure3.1Avisualizationofthe14canistertearsimulatio npartitions, withtheteararea(seeninlatertimesteps)insidethewhiteoutline. 21 Figure3.2Trainingtimestepinthecanistertearsimulatio nrun1,run 2,run3,andrun4(lefttoright).23 Figure3.3Finaltimestepinthecanistertearsimulationru n1,run2, run3,andrun4(lefttoright).23 Figure3.4Avisualizationofthe12casingsimulationparti tions.26 Figure4.1Avisualizationofthecasingcumulativeliftcur veforthe modeltrainedusingrandomforestsunweightedensem-blesandevaluatedwith10%overlapthreshold.40 Figure4.2Avisualizationofthecasingcumulativeliftcur vesforthe modelstrainedusingsingledecisiontree,andrandomfores ts unweightedandweightedensembles,evaluatedwith10%overlapthresholds. 41 Figure4.3Avisualizationofthecasingcumulativeliftcur vesforthe modeltrainedusingrandomforestsunweightedensem-bles,evaluatedwith10%,50%,and80%overlapthresholds.4 2 Figure4.4Casingsimulationsurfaceviewoftimesteps7to2 0with predictedsalient(bolt)regionsfortheRFmodeltrainedontimesteps0to6partitioneddata.44 Figure4.5Casingsimulationalternatesurfaceviewoftime steps7to 20withpredictedsalient(bolt)regionsfortheRFmodeltrainedontimesteps0to6partitioneddata.45 v

PAGE 9

Figure4.6Avisualizationofthecanistertearcumulativel iftcurvesfor themodelstrainedusingRFWensemblesandevaluatedwith10%overlapthresholdoncanistertearrun1.51 Figure4.7Avisualizationofthecanistertearcumulativel iftcurvesfor themodelstrainedusingRFWensemblesandevaluatedwith10%overlapthresholdoncanistertearrun2.51 Figure4.8Avisualizationofthecanistertearcumulativel iftcurvesfor themodelstrainedusingRFWensemblesandevaluatedwith10%overlapthresholdoncanistertearrun3.52 Figure4.9Avisualizationofthecanistertearcumulativel iftcurvesfor themodelstrainedusingRFWensemblesandevaluatedwith10%overlapthresholdoncanistertearrun4.52 Figure4.10Canistertearrun4(timesteps1to6)groundtrut hsalient regions(left)andpredictedsalientregions(right)forth e RFWmodeltrainedonrun2timestep5partitioneddata.54 Figure4.11Canistertearrun4(timesteps7to10)groundtru thsalient regions(left)andpredictedsalientregions(right)forth e RFWmodeltrainedonrun2timestep5partitioneddata.55 Figure4.12Canistertearrun4(timesteps11to14)groundtr uthsalient regions(left)andpredictedsalientregions(right)forth e RFWmodeltrainedonrun2timestep5partitioneddata.56 Figure4.13Canistertearrun4(timesteps15to17)groundtr uthsalient regions(left)andpredictedsalientregions(right)forth e RFWmodeltrainedonrun2timestep5partitioneddata.57 Figure4.14Canistertearrun4(timesteps18to20)groundtr uthsalient regions(left)andpredictedsalientregions(right)forth e RFWmodeltrainedonrun2timestep5partitioneddata.58 Figure4.15Canistertearrun4(timesteps21to23)groundtr uthsalient regions(left)andpredictedsalientregions(right)forth e RFWmodeltrainedonrun2timestep5partitioneddata.59 Figure4.16Canistertearrun4(timesteps24to26)groundtr uthsalient regions(left)andpredictedsalientregions(right)forth e RFWmodeltrainedonrun2timestep5partitioneddata.60 vi

PAGE 10

Figure4.17Canistertearrun4(timesteps27to28)groundtr uthsalient regions(left)andpredictedsalientregions(right)forth e RFWmodeltrainedonrun2timestep5partitioneddata.61 Figure4.18Canistertearrun4(timesteps29to30)groundtr uthsalient regions(left)andpredictedsalientregions(right)forth e RFWmodeltrainedonrun2timestep5partitioneddata.62 Figure4.19Eleven-pointinterpolatedprecision-recallg raphaveraged across16canisterteartrain/testcombinationsusingRFWensemblesandallfourruns. 65 Figure4.20Eleven-pointinterpolatedprecision-recallg raphaveraged across16canisterteartrain/testcombinationsusingRFensemblesandallfourruns. 66 Figure4.21Eleven-pointinterpolatedprecision-recallg raphaveraged across16canisterteartrain/testcombinationsusingDTensemblesandallfourruns. 67 Figure4.22Eleven-pointinterpolatedprecision-recallg raphaveraged acrossninecanisterteartrain/testcombinationsusingRF W ensemblesandrunstwotofour.68 Figure4.23Eleven-pointinterpolatedprecision-recallg raphaveraged acrosssevencanisterteartrain/testcombinationsusingRFWensemblesandrunonefortrainingand/ortesting.69 Figure4.24Boltlabelingnoiseof0%,1%,5%,10%,15%,and20 % shownforatypicalbolt. 71 Figure4.25Non-boltlabelingnoiseof0%,1%,5%,10%,15%,a nd 20%shownforatypicalbolt. 72 Figure4.26Casingsimulationboltandnon-boltnoisecumul ativelift curvesusingRFW. 74 Figure4.27Casingsimulationboltandnon-boltnoisecumul ativelift curvesusingRF. 75 Figure4.28Casingsimulationboltandnon-boltnoisecumul ativelift curvesusingDT. 76 Figure4.29Casingsimulationboltandnon-boltnoisecumul ativelift curvesusingSDT. 77 vii

PAGE 11

Figure5.1ModiedKDDCup1999ROCcurvesforRFWandDBO usingnopartitioningonthetraininggroup.91 Figure5.2ModiedKDDCup1999ROCcurvesforRFWandDBO using20partitionsonthetraininggroup.92 Figure5.3ModiedKDDCup1999ROCcurvesforDBOusingno partitioning,two,and20partitionsonthetestgroup.96 Figure5.4ModiedKDDCup1999ROCcurvesforLOFusing20 partitionsonthetestgroup. 97 viii

PAGE 12

ABSTRACT Wedescribeanensembleapproachtolearningsalientspatia lregionsfromarbitrarilypartitionedsimulationdata.Ensembleapproach esforanomalydetection arealsoexplored.Thepartitioningcomesfromthedistribu tedprocessingrequirementsoflarge-scalesimulations.Thevolumeofthedataiss uchthatclassiers cantrainonlyondatalocaltoagivenpartition.Sincetheda tapartitionreects theneedsofthesimulation,theclassstatisticscanvaryfr ompartitiontopartition. Someclasseswilllikelybemissingfromsomeorevenmostpar titions.Wecombineafastensemblelearningalgorithmwithscaledprobabi listicmajorityvotingin ordertolearnanaccurateclassierfromsuchdata.Sinceso mesimulationsare difculttomodelwithoutaconsiderablenumberoffalsepos itiveerrors,andsince weareessentiallybuildingasearchengineforsimulationd ata,weorderpredicted regionstoincreasethelikelihoodthatmostofthetop-rank edpredictionsarecorrect (salient).Resultsfromsimulationrunsofacanisterbeing tornandfromacasing beingdroppedshowthatregionsofinterestaresuccessfull yidentiedinspiteof theclassimbalanceintheindividualtrainingsets.Liftcu rveanalysisshowsthatthe useofdatadrivenorderingmethodsprovidesastatisticall ysignicantimprovement overtheuseofthedefault,naturaltimestepordering.Sign icanttimeissavedfor theenduserbyallowinganimprovedfocusonareasofinteres twithouttheneedto conventionallysearchallofthedata.Wehavealsofoundtha tusingrandomforests weightedanddistance-basedoutlierensemblemethodsfors upervisedlearningof anomalydetectionprovidesignicantaccuracyimprovemen tswhencomparedto ix

PAGE 13

existingmethodsonthesamedataset.Further,distance-ba sedoutlierandlocal outlierfactorensemblemethodsforunsupervisedlearning ofanomalydetection alsocomparefavorablytoexistingmethods. x

PAGE 14

CHAPTER1 INTRODUCTION 1.1Motivation Weconsidertheproblemofdealingwithdatasetstoolargeto tinthememory ofanyonecomputerandtoobandwidth-intensivetomovebetw eenneighboring computers[1].SuchproblemsexistintheUnitedStatesDepa rtmentofEnergy's AdvancedSimulationandComputing(ASC)program[2,3],whe reinasupercomputersimulatesahypotheticalreal-worldevent.Thesimul ationdataispartitioned anddistributedacrossseparatedisks,tofacilitateparal lelcomputation.Itmaybe manyterabytestopetabytesinsize.Thecurrentstateofthe artisthatdevelopers spendtimemanuallybrowsingforanomaliesinordertodevel opthesimulation,and domainexpertsspendsimilartimelookingforsalientevent s.Wewanttocreate atooltoletthemmanuallymarkasmallnumberofexamplesand thenautomaticallyagfoundexamplesthroughouttherestofthedataset, orsimilardatasetsfor directedbrowsing.Becausetherewillbefalsepositives,w ewanttopresentpredictedpositivestotheuserinanorderthatincreasesthech ancesoftruepositives beingpresentedearly. Asaresultofpartitioning,thepointsofinterest,or“sali ence”,insomepartitions maybelimitedtoonlyafewnodes.Salientpoints,beingfewi nnumber,exhibita pathologicalminorityclassclassicationproblem.Thepr oblemsassociatedwith imbalanceddatasetsandvariousstrategiesfordealingwit hthoseproblemsare 1

PAGE 15

describedin[4,5].Techniquesincludevariousformsofund ersamplingandoversampling[6],andcost-sensitivelearningmethods[7].Int hecaseofapartition havingzerosalientpoints,asingle-class“classier”wil lbelearned.Thismotivates ascalingadjustmenttothevotingschemeusedin[8,9],andd evelopedin[10]to improveaccuracy,asshowninChapter4.1.Facialregionrec ognitionexperiments andanalysisofnodalaswellasregionalaccuracywerealsoi ncludedin[9].A different,smallerdatasetwithonlyfourpartitionswasus edin[8,9].Werstused theorderingofsalientregionsandtheuseofliftqualityto measurethequalityof theorderingforthecasingsimulationonly[10]. Inthisdissertation,wegiveexamplesoflearningfromfour simulationrunsof acanisterbeingtorn,andfromonerunofacasingbeingdropp ed.Thematerial inourpreviousjournalarticle[37]formsafoundationfort hisresearch,andhas beenincludedinthisdissertation.Thesearerelativelysm allsimulations,usedin initialinvestigationsintheASCprogram,butlargeenough toshowtheutilityand advantagesofourapproach.Visualizationsofacasingbein gdroppedandofa canistertearsimulationmodelareshowninFigure1.1.Weha veevaluatedhow wellourapproachcandetectconnectedgroupsofsalientnod es.Also,wehave measuredthequalityofourorderingofsalientregionpredi ctions,asdiscussedin Chapter4.1.Weshowthatitispossibletoobtainanaccurate predictionanda usefulorderingofsalientregions,evenwhenthedataisbro kenuparbitrarilyin 3Dspacewithnoparticularrelationtofeaturespace.Resul tsonthecanistertear andcasingdatasetsindicatethatexpertsworkingwithmuch largersimulationscan benetfromthepredictiveguidanceobtainedfromonlyasma llamountofrelevant data. Asanalpieceofevidencefortheutilityofthisapproach,o neoftheexpertsin thiseld,W.PhilipKegelmeyerassistedwithareal-worlde xampleofamuchlarger 2

PAGE 16

(a)Casingsimulation (b)Canistertearsimulation Figure1.1:Visualizationsofthecasingandcanistertears imulations.Forthe casingsimulation,groundtruthsalient(bolt)regionsare thesmallerred(incolor) ordarker(ingrayscale)regions.simulationthatinvolved162runsof876gigabytesofdatain eachrun.Theoriginal dataisclassied.Thestructuralsafetysimulationwasave rycomplicatedmodel withmanylayersthatdevelopedcracksandtears,whichsome timesconstituted abreachallthewayfromtheoutsidetotheinsideofthemodel .After180man hourswerespentmanuallymarkingonlythetearsin12runs,a fasterapproach wasneeded.Adistributedclassicationapproachsimilart othatproposedinthis dissertationwasusedtotrainonthe12markedrunsandtestt heremaining150 runstondalltearsandevaluatebreachesinacumulative75 manhours[37]. Ensembleapproachesforanomalydetection(notregionorsi mulationbased) arealsoexploredandcomparedtoexistingapproaches.Weha vefoundthatusing randomforestsweightedanddistance-basedoutlierensemb lemethodsforsu3

PAGE 17

pervisedlearningofanomalydetectionprovidesignicant accuracyimprovements whencomparedtoexistingmethodsonthesamedataset.Furth er,distancebasedoutlierandlocaloutlierfactorensemblemethodsfor unsupervisedlearning ofanomalydetectionalsocomparesfavorablytoexistingme thods. 1.2Contributions Themaincontributionsofthisdissertationincludeanappr oachthatshowshow toorderpredictedregionsbymostlikelytobeinteresting/ relevantforlargepartitioned,spatiallydisjointdatasetswheretheindividualp redictionsarenotasimportantasnding/predictingregions.Weshowthatthemeth odworksevenwhen dataisdistributedinwaysthatmaynotbehelpfultothelear ningalgorithms.Our methodsmightbeadaptedtopresentregionsinmedicalimage sfromdistributed databyorderoflikelyimportance.Anotherapplicationmig htbetorecognizesupernovaeregionsinastronomy.Ourapproachislimitedtoap plicationswiththe abovecharacteristics,wherelargearbitrarypartitionso fdataincludearelatively smallminorityclassandwhereregionlabelingnoiseintrai ningistypicallyless than10%.Asecondareaofcontributionistodemonstratehow usingdifferentensembleapproachestoanomalydetectioncanimprovetheresu ltsofsomeofthe existingapproaches.1.3Organization Therestofthisdissertationisorganizedasfollows:inCha pter2,background, relatedwork,andtypesofensemblelearningandanomalydet ectionarediscussed;Chapter3includesthecanistertear,casingsimula tion,andKDDCup 1999modieddatasetscharacteristicsandtrain/testseto rganization;Chapter4 4

PAGE 18

includesexperiments,evaluationmetrics,andresultsfor thecanistertearandcasingsimulations;Chapter5includesexperiments,evaluati onmetrics,andresults fortheKDDCup1999modieddataset;Chapter6summarizesth edissertation anditscontributions. 5

PAGE 19

CHAPTER2 ENSEMBLELEARNING Inthischapter,background,relatedwork,andtypesofense mblelearningand anomalydetectionarediscussed.2.1BackgroundandRelatedWork Wediscussthemostrelevantresearchintherelatedareasof incremental learning,distributedlearning,andrankingproblems.Inc rementallearning[11–14], wherethemodelchangesastrainingdatabecomesavailableo vertime,provides apotentialapproachforcreatingamodelfromaverylargetr ainingdataset.The modelcouldbebuiltononesetofdataandthenmovedtoanothe rprocessorfor continuedlearningonasecondsetofdata,etc.Incremental learningmodelsthat requirethestorageofprevioustrainingexamples,suchasi nstance-basedlearningapproaches[12],anddecisiontreeapproaches[15]aret imeconsumingfor verylargedatasets.Also,wecouldndnoworkevaluatingth eirperformanceon verylargedatasets.Alternatively,dataminingofstreami ngdata[16,17]hasbeen developedpreciselyforendlessstreamsofdata.Thedatase tsconsideredinthis dissertationcouldbetreatedasastream,althoughtheylac kanaturalordering principle.Ourempiricalexperimentsshowstatisticallyd ifferentresultsdepending onhowthepartitionsareordered[37]. 6

PAGE 20

Therearedistributedlearningalgorithms,suchasdistrib utedboosting[18],that couldbeappliedtothisproblem.Theauthorsof[19]evaluat eseveraldistributed boostingalgorithms,oneofwhichdealsspecicallywithle arningfromhomogeneousdistributionsofdatascatteredacrossdifferentsit es.Theyconsiderthe problemfromthestandpointofdataprivacy,wheredataexam plesmaynotbe propagatedtoothercomputers.Inthisalgorithm,theycomp utestatisticsonthe datasuchasmeanandcovarianceinordertocalculatetheMah alanobisdistance betweensites.Sitescontainingsimilardistributionsemp loytheauthors'distributed boostingalgorithm,whilethosewithoutsimilarityusesta ndardboosting. Inthedistributedboostingalgorithm,aboostedclassier wasbuiltineachpartitionandbroadcasttotheotherpartitions.Usingthisens embleofclassiers,the weightofeachexamplewasupdated.Aglobalweightarraysto resthesumofthe updatedweightsforeachindividualsite,thusprovidingin formationonhowdifcult itistolearnatanyonesite,andweightingthatpartitionac cordinglyforthenext iteration.Theauthorsshowedthatthisalgorithmwasatlea stasaccurateasstandardboostingonthecentralizeddatabase.Theonlyspatial lydisjointsetsused in[19]weretwoverysmallsyntheticdatasetswiththreeequ alsizeclasses,two physicaldimensions,andnotimedimension.Incontrast,ou rmuchlargerdatasetsconsistofphysicssimulationsofrealworldeventswit hunequalsizeclasses, threephysicaldimensions,atimedimension,anddifferent partitionschemesthat presentuniquedataminingchallenges. Distributedlearningmodelshavebeenshowntobeabletopro videclassicationperformancethatiscompetitivewiththatobtainedo nallofthedata[20]. Thereissomeworkthatindicatesitispossibletodoeffecti vedistributedlearning withcostsensitivedata[21].Further,anyapproachthatbu ildsindependentclassiersormodelsandcombinesthemcouldpotentiallybeappl ied[22].Ofthework 7

PAGE 21

discussedhere,onlyin[19]werespatiallydisjointdatase tsused,withsignicant differencesfromourworkasmentionedabove.Inaddition,w ehavedeveloped smoothingandthresholdingmethodstoobtainregionalpred ictions. Manyvariantsofrankingproblemsinmachinelearningexist .Forinstance,documentsininformationretrievalarerankedaccordingtothe irprobabilityofbeingrelevanttoaquery.Anotherexampleisapersonalizedemaillt erthatassignsprioritiestounreadmail.Oneapproachistolearnanorderofitems basedonapairwise scorefunction[23].Bucketorders,i.e.,totalorderswith tiesareconsideredin[24]. TheSpearmanrankcorrelationmetricisminimizedbyusinga simpleweighted votingprocedure[25],andbyactivelearningoflabelranki ngfunctions[26].The authorsof[27]useanorderconsistencymetrictomeasureho widenticalthepredictedorderistothetrueorderofrecommendationonitemgr aphs.Sinceweare concernedonlywithwhetherourpredictionssatisfyanover laprequirement,we useliftqualityasdevelopedfordatabasemarketing[28,29 ]. 2.2GeneralMethods Ensemblemethodsimproveclassaccuracybycombiningthepr edictionsof multipleclassiers.Requirementsforimprovedaccuracyi ncludehavingindependent(oronlyslightlycorrelated)baseclassiersthatper formbetterthanrandom guessing[30].Onemethodofconstructingensemblesofclas siersistomanipulatethetrainingsetbycreatingmultipletrainingsetsby resamplingtheoriginal dataset.Twoexamplesofthistechniquearebaggingandboos ting,whichare describedinChapters2.2.1and2.2.2.Asecondensemblecre ationtechnique manipulatestheinputfeatures.Anexampleofthistypeisra ndomforests,which alsousesbaggingandisdescribedinChapter2.3.Athirdmet hodfordatawith 8

PAGE 22

alargenumberofclassesmanipulatestheclasslabelsbyran domlypartitioning theclasslabelsintotwodisjointsubsetsandusingtherela beledexamplestotrain abaseclassier.Thisprocessisrepeatedtocreateanensem bleofbaseclassiers.Foreachtestexample,eachbaseclassierpredicts0o r1,andallclasses belongingtothewinningsubsetreceiveavote.Thevotesare talliedandtheclass thatreceivesthehighestvoteisassignedtothetestexampl e.Anexampleofthis techniqueistheerror-correctingoutputcodingmethod.Af ourthensemblecreation methodmanipulatesthelearningmethodbychangingthenetw orktopologyorinitialweightsinarticialneuralnetworksorbyinjectingra ndomnessintothedecision tree-growingprocedure[30].Whiletherstthreeensemble creationmethodscan beappliedtoanyclassier,thefourthdependsontheclassi ertype.Baseclassiersthataresensitivetominorperturbationsinthetrai ningsetareknownas unstableclassiersandincludedecisiontrees,rule-base dclassiers,andarticial neuralnetworks.Ensemblemethodswithsuchclassiersten dtoimproveaccuracy.Eachtestexampleisclassiedbytakingamajorityvot eonpredictionsof baseclassiersorbyweightingeachpredictionwiththebas eclassier'saccuracy. 2.2.1Bagging Baggingorbootstrapaggregatingrepeatedlysampleswithr eplacementfroma datasetusingauniformprobabilitydistribution[30].Eac hbaghasthesamesize astheoriginaldataset,withsomeexamplesappearingmoret hanonceandsome notatallineachbag.About63%oftheoriginaldatasetappea rsineachbagon average.Anunstablebaseclassiersuchasadecisiontreei strainedoneach bagofdata.Foreachtestexample,theclasslabelpredicted byamajorityvoteof thebaseclassiersisassignedtothatexample.Thereducti oninvarianceofthe baseclassieraccountsfortheimprovementingeneralizat ionerror[30]. 9

PAGE 23

2.2.2Boosting Boostingisaniterativeprocessthatincreasesthefocusof baseclassierson examplesthataredifculttoclassify[30].Boostingdoest hisbyassigningaweight toeachtrainingexampleandadaptivelychangingthisweigh tattheendofeach boostinground.Theseweightscanbeusedtodrawasetofboot strapexamples fromtheoriginaldata,ortheycanbeusedbythebaseclassi ertolearnamodel thatfavorshigher-weightexamples.2.3DecisionTreesandRandomForests Adecisiontreeclassierisacommonlyusedclassicationt echniquewiththree typesofnodes.Theyincludearootnodeatthetop,internaln odes,andleafor terminalnodes[30].Therootandinternalnodescontainatt ributetestconditions toseparaterecordsthathavedifferentcharacteristics.E achleafnodeisassigned aclasslabel.Oncethedecisiontreehasbeenconstructed,a testexampleis classiedbybeginningattherootnode,applyingthetestco nditiontotherecord, andfollowingthebranchthatsatisesthetest.Thiswillei therleadtoaninternal nodewithanewtestconditionforselectinganotherbrancho rtoaleafnodewitha classlabelassociatedwiththatnode.Whenaleafnodeisrea ched,itsclasslabel isthenassignedtothetestrecord[30]. Buildinganoptimaldecisiontreeforlargereal-worlddata setsisnotcomputationallyfeasibleduetotheexponentialsearchspacesiz e.However,efcient algorithmshavebeendevelopedtoinduceareasonablyaccur atedecisiontreein areasonabletime.Theseusuallyemployagreedystrategyof makingthedecisionaboutwhichattributetouseforpartitioningthedataa sthetreeisbuilt.One suchalgorithmisHunt'salgorithm,inwhichadecisiontree isgrownrecursively 10

PAGE 24

bypartitioningthetrainingrecordsintosuccessivelypur ersubsets[30].Therst stepconsistsofcreatingaleafnodewiththeclassofthetra iningrecordsassociatedwiththatnode,iftheyallhavethesameclass.Thesecon dstep,inthecase whenallthetrainingrecordsassociatedwithanodecontain recordsthatbelongto morethanoneclass,istoselectanattributetestcondition topartitiontherecords intosmallersubsets.Eachoutcomeofthetestconditiongen eratesachildnode, andthetrainingrecordsaredistributedtothechildnodesb asedontheoutcomes. Thesetwostepsaresufcientforcaseswhenallattributeco mbinationsarepresent inthetrainingdataandeachuniqueattributecombinationh asauniqueclasslabel. Sincethisisnotnecessarilythecase,therearetwoadditio nalcasesthatmustbe addressed.First,anychildnodecreatedinthesecondstepa bovemaybeempty becausenoneofthetrainingrecordsmatchthecombinationo fattributesassociatedwiththatnode.Inthiscase,thechildnodeisdeclareda leafnodewiththe classlabelofthemajorityclassoftrainingrecordsassoci atedwithitsparent.Second,ifalltherecordsassociatedwithanodehaveidentical attributevaluesbutnot thesameclasslabel,thenthatnodeisdeclaredaleafnodewi ththeclasslabelof themajorityoftherecordsassociatedwiththenode[30]. Thetwomainissuesthatadecisiontreeinductionalgorithm mustaddressare decidinghowthetrainingrecordsshouldbesplitandwhenth esplittingprocess shouldend.Fortherstissue,thealgorithmmustspecifyth etestconditionfor differentattributetypesandprovideanobjectivemeasure ofthetestcondition's quality.Anattributeisanobject'spropertyorcharacteri sticthatcanvaryamong objectsortimes.Differenttypesofattributesincludebin ary,nominal,ordinal,and continuous.Binaryattributeshavetwopossiblevaluesoro utcomes,whilenominal attributeshavethreeormore.Ordinalattributescanprodu cebinaryormultiple splitsaslongasthegroupingofattributevaluespreserves theorderpropertyof 11

PAGE 25

thevalues.Forinstance,groupinggoodandbesttogetherwi thoutincludingbetter isnotallowed.Unliketherstthreecategoricalorqualita tivetypesofattributes, continuousattributeshaveaquantitativeornumericnatur e.Thetestconditionfor continuousattributescanbeexpressedasacomparisontest witheitherbinary outcomes(tworanges)ormultipleoutcomes(morethantwora nges)[30]. Differentmeasuresoftestconditions'squalityincludeen tropy,Gini,classicationerror,andinformationgain.Theydealwiththeclassdi stributionoftherecords beforeandaftersplitting.If p ( i j t ) denotesthefractionofrecordsbelongingtoclass i atagivennode t ,thefollowingmeasuresforselectingthebestsplitare: Entropy ( t )= c 1 X i =0 p ( i j t )log 2 p ( i j t ) (2.1) Gini ( t )=1 c 1 X i =0 [ p ( i j t )] 2 (2.2) Classicationerror ( t )=1 max i [ p ( i j t )] (2.3) where c isthenumberofclassesand 0log 2 0=0 inentropycalculations[30]. Thesemeasuresarebasedonthedegreeofimpurityofthechil dnodes.For example,inatwoclassproblem,ifanodehasallexampleswit hthesameclass, ithaszeroimpurity.Ifanodehasuniformclassdistributio namongtwoclasses,it hasthehighestimpurity.Todeterminethequalityofatestc ondition,thedegreeof impurityoftheparentnode(beforesplitting)iscomparedt othedegreeofthechild nodes(aftersplitting).Alargerdifferenceindicatesabe ttertestcondition.The gain, ,isameasureofthisdifference: = I ( parent ) k X j =1 N ( v j ) N I ( v j ) (2.4) 12

PAGE 26

“where I ( ) istheimpuritymeasureofagivennode, N isthetotalnumberof recordsattheparentnode, k isthenumberofattributevalues,and N ( v j ) isthe numberofrecordsassociatedwiththechildnode, v j ”[30].Iftheimpuritymeasure usedaboveisentropy,thenthegainisknownasinformationg ain, info .Atest conditionthatresultsinmanyoutcomesmayresultinunreli ablepredictionsifthe numberofrecordsassociatedwitheachsplitistoosmall.On ewayofovercoming thisproblemistouseasplittingcriterionknownasgainrat io: Gainratio = info SplitInfo (2.5) SplitInfo = k X i =1 P ( v i )log 2 P ( v i ) (2.6) and k isthetotalnumberofsplits[30].Anattributethatproduce smanysplits increasesthesplitinformation,whichreducesthegainrat io. Forthesecondissueofdeterminingwhentostopthesplittin gprocess,thepreviouslydiscussedmethodsofcontinuingtoexpandanodeunt ileitherallrecords belongtothesameclassorallrecordshavethesameattribut evaluesarebothsufcienttostopsplitting.However,earlyterminationmetho dssuchaspre-pruning andpost-pruningcanbebenecial[30]. Breiman'srandomforest(RF)algorithm[31]isanensemblem ethodspecically designedfordecisiontreeclassiers.Onepopularrandomf orestmethodinitially usesbaggingorbootstrapaggregating(seeChapter2.2.1)t orepeatedlysample withreplacementfromadatasetusingauniformprobability distribution[30].Each baghasthesamesizeastheoriginaldataset,withsomeexamp lesappearing morethanonceandsomenotatallinagivenbag.About63%ofth eoriginaldatasetappearsineachbagonaverage.Anothermethodofinjecti ngrandomnessinto eachbasedecisiontree,grownusingeachbagoftrainingdat a,istorandomlyse13

PAGE 27

lectonlysomeofthefeaturesastestsateachnodeofthedeci siontree.Thereare differentwaysofdeterminingthenumberoffeaturesselect ed,butonecommonly usedis log 2 n +1 given n features.RFpredictionsproduceasingleclassvote fortheforest,whilerandomforestsweighted(RFW)predict ionsarebasedonthe percentageoftreesthatvoteforaclass.Themotivationfor usingthisensemble techniquestemsfromtheinherentspeedbenetofanalyzing onlyafewpossible attributesfromwhichatestisselectedataninternaltreen ode.Theaccuracyof randomforestswasevaluatedin[32]andshowntobecomparab lewithorbetterthanotherwell-knownensemblegenerationtechniques. Itismoreimpervious tonoiseandmuchfasterthanAdaBoost,acommonlyusedboost ingensemble method[30].2.4AnomalyDetection Anomalydetection,alsoknownasoutlierdetection,dealsw ithndingpatterns indatathatareunusual,abnormal,unexpected,and/orinte resting[33].Anomalies areimportantbecausetheytranslatetosignicantinforma tionthatcanleadto criticalactioninawidevarietyofapplicationdomains,su chascreditcardfraud detection,securityintrusiondetection,insurance,heal thcare,faultdetection,and militarysurveillance[33].Someofthechallengespresent edbyanomalydetection includeimpreciseboundariesbetweennormalandanomalous behavior,malicious actionsthatmakeanomaliesappearnormal,evolutionofnor malbehavior,different applicationdomains,lackoflabeleddata,andnoise.Resea rchershaveapplied conceptsfromstatistics,machinelearning,datamining,i nformationtheory,and spectraltheorytoformanomalydetectiontechniques[33]. 14

PAGE 28

Accuratedatalabelsthatdenotenormaloranomalousbehavi orusuallyrequire manualeffortbyahumanexpertandcanbetooexpensivetoacq uireformanyapplications.Inaddition,labelingallpossibletypesofano maliesthatcanariseinan evolvingdomaincanbedifcult.Basedonthetypeofdatalab elsavailable,there arethreemodesinwhichanomalydetectiontechniquescanop erate.Theyare supervised,semi-supervised,andunsupervisedanomalyde tectionmodes[33]. Insupervisedmode,atrainingdatasetwithlabeledinstanc esfornormaland anomalyclassisassumed.Typically,apredictivemodelisb uiltfornormal vs anomalyclassesusingthetrainingdata.Thenthemodelisus edtopredictthe classofunseendata.Sincetheanomalyclassisusuallyrare ,theimbalanced datadistributionmustbeaddressed. Insemi-supervisedmode,itisassumedthatthetrainingdat ahaslabelsonly forthenormalclass.Amodelisbuiltforthenormalclass,an dthemodelisusedto identifyanomaliesinthetestdataset.Inunsupervisedmod e,nolabeledtraining dataisrequired.Theonlyassumptionmadeisthatnormalins tancesappearmuch morefrequentlythananomalyinstancesinthetestdata[33] .Ifthisassumptionis notvalid,ahighfalsealarmrateresults. 15

PAGE 29

CHAPTER3 DATASETS Inthischapter,thecanistertear,casingsimulation,andt hemodiedKDDCup 1999datasetscharacteristicsandtrain/testsetorganiza tionarepresented. 3.1CanisterTearSimulation Inthecanistertearsimulation,alsousedin[37],acaniste risdroppedona strikeplateasshowninFigure1.1b.Thecanisterappearsat thetop,overthe strikeplate.Thecanisterismadeofonematerialforthesid esandofasecond materialforthetopandbottom.Simulatedweldsjointhetop andbottomtothe sides.Thecollisionofthecanisterwiththestrikeplateca usescompressionfaults inthecanistershellatthepointofimpactandrupturefault s(tears)inthecanister shellfarthestfromimpact.Inourexperiments,dependingo ntheparticularrun, weobserved11to31timestepsforthesimulatedevent.Theba selinerunwas designatedrun1.Inruns2and3,variablesassociatedwitht hetwocanister materialsweregivenvaluesdifferentfromthebaselineval ues.Inrun4,theshell heightofthemiddleregionofthecanistershellwasincreas edfrom1to2andthe renedweldsurfaceandthicknesswereeachreducedfrom2to 1. 16

PAGE 30

Table3.1:Physicalandspatialcharacteristicsforthecan istertearsimulationruns. TearRun 1234 #ofnodespertimestep140,293140,293140,29381,465#oftimesteps 11111131 Total#ofnodes1,543,2231,543,2231,543,2232,525,415%ofsalientnodesintrainingtimestep0.794.943.252.28%ofsalientnodesinremainingtimesteps2.173.734.502.38%ofsalientnodesinalltimesteps2.053.844.382.30 Thereare37nodalvariablesforeachrun,including26thatc hange. 3.1.1PhysicalandSpatialCharacteristics Inthefourdifferentinstancesofthecanistertearsimulat ionprovidedtousby theDepartmentofEnergy,allintheEXODUSIIformat,nodesa ndniteelements ofthesimulationmodelareembeddedinameshframework[34] .Ninephysical variablesarestoredforeachnodewithineachofthetimeste ps.Theyarethe displacementontheX,Y,andZaxes;velocityontheX,Y,andZ axes;andaccelerationontheX,Y,andZaxes.Inaddition,17variablesares toredforeachnite elementofeightnodes.Weconvertedtoapurelynodalrepres entationbyaveragingallvaluesofthecorrespondingelementalvariablest hatcontainthenode. Wethenusedonlynodalvariablesforlearning.Table3.1sho wstheparameter settingsforeachrun.Tables3.2and3.3showthedifferentr angestakenonbythe featuresavailableineachrun. 17

PAGE 31

Table3.2:Featurerangesforthecanisterteardatainruns1 and2. FeatureRun1Run2 minmaxminmax DISPLX-262.1374.4-186.5472.4DISPLY-30.68507.3-30.52338.8DISPLZ-486.9206.9-416.4215.6 VELX-144,943262,027-170,385164,370VELY-111,516234,437-129,884212,983VELZ-171,581102,341-214,932122,208 ACCLX-5.74E+113.71E+11-5.36E+116.65E+11ACCLY-5.59E+113.49E+11-6.67E+113.31E+11ACCLZ-3.55E+116.54E+11-3.80E+113.87E+11 ELEMDEATH0404ELEMVAR70324.80288.4ELEMVAR800.21800.226ELEMVAR9047,968022,256ELEMVAR100538.90597.4ELEMVAR1109.18E+0605.93E+06ELEMVAR1201.0401.04ELEMVAR1801.5601.63ELEMVAR19064,043030,995ELEMVAR20-16401021-22021165PLASTICSTRAIN01.5601.63 SIGMAXX-19131021-22021173SIGMAXY-568.7520.5-625.4593.8SIGMAYY-25131222-28711238SIGMAYZ-566.1515.8-640.6581.2SIGMAZX-574.6515.2-660.7620.9SIGMAZZ-18191025-23871300 18

PAGE 32

Table3.3:Featurerangesforthecanisterteardatainruns3 and4. FeatureRun3Run4 minmaxminmax DISPLX-223.3231.2-134.8190.2DISPLY-30.16435.4-30.66235.3DISPLZ-212.4214.6-182.3142.9 VELX-122,159133,943-111,141151,133VELY-133,301312,411-161,039227,234VELZ-117,727118,168-96,732119,858 ACCLX-5.67E+116.50E+11-2.12E+112.99E+11ACCLY-3.84E+111.88E+11-2.50E+112.35E+11ACCLZ-2.82E+111.80E+11-1.51E+111.33E+11 ELEMDEATH0404ELEMVAR70326.00324.8ELEMVAR800.22900.217ELEMVAR9026,819024,105ELEMVAR100680.40557.0ELEMVAR1107.05E+0605.06E+06ELEMVAR1201.0101.00ELEMVAR1802.1901.20ELEMVAR19054,581029,743ELEMVAR20-19291038-19741025PLASTICSTRAIN02.1901.21 SIGMAXX-19291051-19741025SIGMAXY-540.1527.8-541.0521.4SIGMAYY-23661084-2327976.7SIGMAYZ-595.4527.7-564.5518.8SIGMAZX-585.9532.5-582.6514.8SIGMAZZ-22681134-17341046 19

PAGE 33

3.1.2TrainandTestSets Tocreatelabeleddataforeverytimestep,thosepiecesofth ecanisterthat havedeformedsoastopossiblyindicateatearinthecaniste rwallweremarked assalientbymanualeditingofthedataviaacustomplug-int otheopensource visualizationtoolParaView[35].Atthebeginningofthesi mulationthereareno salientnodeswithinthemesh.Astimeprogressesandthecan isterdeforms,more andmorenodesweremarkedsalient. Themarkingofthesalientnodeswithinthemeshcaninprinci plebeasprecise asdesired,butmoreprecisionrequiresgreatereffortinma nualmarking.Inactual ASCwork,thescientistsuseatoolthatpermitsthemtoquick lymarkcoarsely shapedregions,ortolaboriouslymarkdetailedregions.Si ncetheyinvariably choosethefastbutcoarseoption,wehavedonethesame,allo wingnoiseinthe classlabelsbymarkingareasratherthanindividualnodesi nthesimulationmodels.Smoothingoftheoutputtocreateregionsmayreducethe noiseinpredictions createdbyimpreciselabeling,asweshallsee. Thedataforthemiddletimestepofeachcanistertearsimula tionrunwas dividedspatiallyaccordingtothecomputertowhichitisas signedandusedfor trainingclassiersand/orensembles.Thepartitioningdi videdthecanisterinto14 disjointspatialpartitionsofroughlyequalsize,asshown inFigure3.1.Table3.4 showsthenumberofsalientnodesineachcanistertearparti tionofthetraining timestep.Thetrainingdataineightofthe14trainingparti tionsofbothruns1and 3havenosalientexamples.Thetrainingdatainsevenofthe1 4trainingpartitions ofbothruns2and4havenosalientexamples.Inaddition,two otherpartitions ofrun1eachcontainonlytwosalientexamples.Thehighnumb erofone-class partitionswasdeliberatelychosentoillustratetheadvan tagesofscaledproba20

PAGE 34

Figure3.1:Avisualizationofthe14canistertearsimulati onpartitions,withthetear area(seeninlatertimesteps)insidethewhiteoutline.bilisticmajorityvoting.Inreality,thepartitioningwou ldbearbitraryandnotuser selectable. Ineachtimestepandineachpartition,saliencywasdesigna tedasdescribed. Everynodenotdesignatedsalientreceivedthelabel“unkno wn”,ratherthan“not salient”,toreectthefactthat,ingeneral,theuserswill indicateonlysalientregions.Anensembleofclassierswastrainedoneachofthe14 partitionsofthe trainingtimestep.Testingwasdoneeitheronalloftherema iningtimestepsofthe samerun,oronalltimestepsofeachdifferentrun.Figures3 .2and3.3showa viewofthetrainingtimestepandthenaltimestepofallfou rcanistertearruns. Theclassierspredictedeachtestexamplebasedontheattr ibutesassociated withthatexample.Thevotesofeachpartitionwerecombined usingascaledprobabilisticcombinationofthevotes(tobereviewedinChapte r4.1).Weobtained region -basedresultsbyspatiallysmoothingandthresholdingthe point-basedpre21

PAGE 35

Table3.4:Salientclassstatisticsbypartitionforthecan istertearsimulationruns. Partition#oftrainingnodes#ofsalienttrainingnodes Run1,2,or3Run4Run1Run2Run3Run4 0504150410000168006800000026271627100003747173880000412,98074410000510,67257200000612,25766420000710,7595972264032811,65158231661076226183912,653656047116791415337 1010,93863712622109538911825844060122668528912869337803711925373681315,84932504331071599262 all140,29381,4651111693045571860 dictions.Smoothingoccursbyaveragingsaliencyvaluesof nodeswithinaspecieddistanceandsubsequentlybinarizingthesaliencyusi ngtheOtsuautomatic thresholdingalgorithm[36].Wefocusontheaccuracyofreg iondetection,not nodedetection,becauseitisregionsthatarepresentedtot heuser,andassessed fortheirutility.3.2CasingSimulation Inthisdataset,alsousedin[10,37,38],acasingwasdroppe dontheground asshowninFigure1.1a.Thecasingiscomposedoffourmainse ctions:thenose cone,thebodytube,thecoupler,andthetail.Thecouplerco nnectsthebodytube andtailthroughaseriesoftenbolts.Thegroundhasalsobee nmodeled.The casingwasdroppedfromashortheightandlandedonthetaila tanangle.This 22

PAGE 36

Figure3.2:Trainingtimestepinthecanistertearsimulati onrun1,run2,run3, andrun4(lefttoright).Groundtruthsalientregionsarere d(incolor)ordarker (ingrayscale)thanunknownregions.ThetearareaviewinFi gure3.1hasbeen rotated90 cw. Figure3.3:Finaltimestepinthecanistertearsimulationr un1,run2,run3, andrun4(lefttoright).Groundtruthsalientregionsarere d(incolor)ordarker (ingrayscale)thanunknownregions.ThetearareaviewinFi gure3.1hasbeen rotated90 cw. simulationrecordedthestressesacrosstheentiredevicea smightbefoundwere ittobeaccidentallydroppedduringtransport,storage,et c. 3.2.1PhysicalandSpatialCharacteristics Thegoalusingthisdatasetistodiscoverwhichnodesinthes imulationbelong tobolts.Whendroppedatanangleonthetail,onegroupofbol tsexperiencesa tensileforce,whiletheothergroupofboltsexperiencesac ompressiveforce.Each wasalsosubjecttosheerforces.Theseforceswereexpresse dinmanyother sectionsofthecasingaswell.Thephysicalcharacteristic softheindividualnodes modelingtheboltsarenotsubstantiallydifferentfromtho semodelingtherestof thecasing.Inotherwords,thereisnounderlyingfeatureof “boltness”whichwould makethisaneasyproblemwithoutusingadditionalblocknod eidenticationor locationgeometry.Thisadditionalinformationwasonlyus edfortheinitiallabeling 23

PAGE 37

Table3.5:Physicalandspatialcharacteristicsforthecas ingsimulation. #nodalvariables21#timesteps21#non-boltnodespertimestep69,073#boltnodespertimestep5680Total#nodespertimestep74,753Total#non-boltnodes1,450,533Total#boltnodes119,280Total#nodes1,569,813Total%ofboltnodes7.6% ofboltnodesassalient,andnotasoneofthefeaturesforimp rovingtestaccuracy, asdiscussedlaterinthisChapter3.2.1. ThephysicalandspatialcharacteristicsareprovidedinTa ble3.5.datasetattributesincludethemotionvariablesofdisplacement,vel ocity,andaccelerationas wellasseveralinteractionvariablessuchascontactforce ,totalinternalforce,total externalforce,andreactionforce.Thedifferentrangesfo reachoftheseattributes areshowninTable3.6.Atimestepshowingthegroundtruthda taisshownin Figure1.1.Theboltsarethesmallerred(incolor)ordarker (ingrayscale)regions andrepresentthesalientnodesinthissimulation. Thereareseveralimportantdifferencesbetweenthisdatas etandthecanister teardataset.Thereisnotalargechangeinthestructureoft hecasingdataasthe simulationrunsthroughtime.Thechangeinthestructureoc cursmostlyattheend ofthesimulationaftersomeamountofshearhastakenplace. Sincethestructural changesaremoresubtle,thedeformationofthecasingsimul ationturnsouttobe moredifcultthanthecanistersimulationtoaccuratelypr edict.Instead,forthis datasetitisconsideredsufcientmerelytoidentifythebo lts. 24

PAGE 38

Table3.6:Featurerangesforthecasingsimulation. FeatureMinimumMaximum DISPLX-2.625.00DISPLY-0.240.23DISPLZ-10.340.55 VELX-43067437VELY-21085943VELZ-11,5183922 ACCELX-1.30E+098.79E+09ACCELY-1.47E+091.46E+09ACCELZ-2.23E+093.29E+09 F-CONTACT-X-463.9392.4F-CONTACT-Y-469.1478.6F-CONTACT-Z-49172354 F-EXT-X-1550877.2F-EXT-Y-354.1345.8F-EXT-Z-25612329 F-INT-X-1550877.0F-INT-Y-470.0473.1F-INT-Z-49202354 REACT-X-558.3596.4REACT-Y-354.1345.8REACT-Z-165.42328 Figure3.4showsthepartitioningoftheactualsimulation. Thepropertiesofthe partitioningforthecasingdatasetareshowninTable3.7.T hepartitioningwas performedlengthwisein12piecesacrossthecylindricalbo dysoastodistribute theboltsacrosscomputers.Wepurposefullypartitionedth edatasothatfourof thepartitionsdonotcontainanynodefromabolt.Thiscreat edfourone-class classiers,whichwereprocessedaccordinglybythevoting algorithmduringclassication.Twooftheremainingpartitionseachcontainaco mpleteboltandparts oftwootherbolts.Thesixremainingpartitionseachcontai nonlyapartofeachof 25

PAGE 39

Figure3.4:Avisualizationofthe12casingsimulationpart itions.Fouroftheeight boltsthatareeachcontainedinmorethanonepartitionarev isible. twobolts.Thegroundsectionwasalsopartitionedandusedf ortrainingincaseits databecamerelevantinlatertimesteps.3.2.2TrainandTestSets Duringthesimulation,nodesbelongingtoalloftheboltswe respecicallydesignatedastheirownsubstructureorblockwithinthesimula tion.Therefore,labeling ofthosepointswasamatterofsettingallthosenodesassali ent,andhencethe trainingandtestsetsarelabeledperfectly.Thisblocknod eidenticationwasdeliberatelynotusedasoneofthefeaturesforimprovingtest accuracyinorderto establishalegitimatemachinelearningchallengeforourm ethods.Recallthatin thecanistertearsimulationthegroundtruthwassubjectto theinaccuraciesinherentinthetoolsavailabletodesignatesaliency. 26

PAGE 40

Table3.7:Partitioningcharacteristicsforthecasingsim ulation. Partition#of#ofsalient%ofsalient trainingnodestrainingnodestrainingnodes 053,529681812.74159,65451108.57237,62500.00329,6179803.31440,467697217.23529,18300.00643,10600.00729,155337411.57830,254457815.13954,48800.001049,72833746.781166,465855412.87 all523,27139,7607.60 Datafromtimestepszerotosixwerecombinedforeachofthe1 2partitions toform12setsoftrainingdata.Thetestsetconsistedofall ofthedatainthe remainingtimesteps,7to20.Aclassieroranensembleofcl assierswastrained onthetrainingdataofeachofthe12partitions.Testingwas performedusinga scaledprobabilisticcombinationofthose12votes(tobere viewedinChapter4.1). Theclassierspredictedeachtestexamplebasedontheattr ibutesassociated withthatexample.Weobtainedregion-basedresultsbyspat iallysmoothingand thresholdingthepoint-basedpredictions.3.3ModiedKDDCup1999 Inordertocompareourensemblebasedanomalydetectionapp roachesto otherapproachesonthesamedataset,weselectedthesamemo diedKDDCup 1999datasubsetusedin[39,40].Theunmodieddataset“inc ludesawidevariety ofintrusionssimulatedinamilitarynetworkenvironment” [41].Thefulldataset, 27

PAGE 41

Table3.8:ModiedKDDCup1999datasetcharacteristics. ModicationsmadeSizeNumberoffeaturesnumberofoutlier s%ofoutliers continuousdiscrete U2Rvs.normal60,8393292460.40% a10%subsetofthedataset,afulltestdataset,andthreedif ferent10%subsets ofthefulltestdatasetareavailablefromtheUCIKDDArchiv e[41].TheKDD Cup199910%testdatasetwithcorrectedlabelsistheonemod iedandused in[39,40].Thisdatasetwith311,029datarecordsandvecl asseswasmodied toincludeonlythenormalclassandtheU2Rintrusionattack class,whichwas consideredtheoutlieroranomalyclass.Themodieddatase thas60,593normal and246intrusion(outlier)datarecords.Table3.8showsth emodieddataset characteristics. ThemodiedKDDCup199910%correctedtestdataset,toberef erredtoas themodiedKDDCup1999dataset,wasrandomlysplitintotwo groupsatotalof 30times,witheachgroupusedasatrain/testsetasin[39,40 ].Inordertoinvestigateensembleapproachestooutlierdetectionforthisdata set,andtocomparethe resultstothosein[39,40],eachgroupwaspartitionedinto anumberofstratied partitions(eachpartitionhadaboutthesamenumberofposi tiveinstancesand eachhadaboutthesamenumberofnegativeinstances)fortes tingontheother groupoftherandomdatasplit.Eachgroupwasalsopartition edintoanumber ofrandom(non-stratied)partitionsforanothersetofexp erimentstotestjustthe datainthatgroupforeachrandomdatasplit. 28

PAGE 42

CHAPTER4 SIMULATIONENSEMBLEEXPERIMENTS Thischapterincludesexperiments,evaluationmetrics,an dresultsforthecanistertearandcasingsimulations. Firsttheprocessofpredictingandorderingsalientregion sisdiscussed.Then thedetailsoftheexperiments,evaluationmetricsused,an dtheresultsoftheexperimentsarepresented.4.1PredictingandOrderingSalientRegions Initially,aclassierorensembleofclassierswasconstr uctedusingthelabeled,spatiallydisjoint,trainingdatalocaltoeachpart ition.Eachoftheseclassiersorensembleswasthentransferredtoatestpartitionfr omeitherthesameor similarsimulations.Oncethere,eachclassierorensembl eofclassierswasused topredicttheclassofeachinstanceoftestdatalocaltotha tcomputer.Duetopossibleclassimbalances,ascaledprobabilisticmajorityvo teofallclasspredictions wasusedtodeterminetheconsensusclassofeachinstanceof testdata.Becausespatialregionalpredictionsinsimulationsaretheu ltimategoal,connectedcomponentregionsofthepredicteddatawereconstructed,s moothed,andthresholdedforbetteraccuracy.Forevaluationpurposes,thesep redictedregionswere comparedtothelabeledgroundtruthtestregions,usingdif ferentoverlapthresholdstodeterminethequalityofeachresult. 29

PAGE 43

First,toestablishabaselineforeachpartitionweusedasi ngledefaultpruned C4.5release8,decisiontree(DT)withacertaintyfactorof 25,trainedonallthe dataatthatpartition.ThenweusedBreiman'srandomforest (RF)algorithm[31], with250unprunedtreesperpartitionwithbothunweighted( RF)andweighted (RFW)predictions.Breimanusedrandomforestswith100tre es[31],whilealater studyused1000trees[32].Wechose250toachievemoreaccur ateresultsthan theestablishednumberof100withoutincurringtheadditio nalcomputationalcosts of1000trees,whichmightnotnecessarilyprovebeneciali nourexperiments.The accuracyofrandomforestswasevaluatedin[32]andshownto becomparablewith orbetterthanotherwell-knownensemblegenerationtechni ques.Thenumberof randomfeatureschosenateachdecisiontreenodewas log 2 n +1 given n features. Thevaluesforthesamesetoffeaturesforeachindividualno deispresentedto eachDTorRF,butRFrandomlyselectsonlysomeofthefeature sforuseinternally. RFpredictionsproduceasingleclassvotefortheforest,wh ileRFWpredictionsare basedonthepercentageoftreesthatvoteforaclass.Themot ivationforusingthis ensembletechniquestemsfromtheinherentspeedbenetofa nalyzingonlyafew possibleattributesfromwhichatestisselectedataninter naltreenode. Classicationofatestpointwithinthesimulationinvolve spredictionbyeach partition'sensembleofdecisiontrees.Becauseouralgori thmsneedtoworkwhen onlyafewcomputershavesalientexamples,asimplemajorit yvotealgorithmmay failtoclassifyanypointsassalient.Inalarge-scalesimu lationitislikelythat therewillbenodeswhichhavenosalientexamplesintrainin g.Ifmanyindividual classiersareunabletopredictanodeassalientbecauseth erearenosalient examplesintheindividualtrainingsets,thenitmaybeimpo ssibleforamajorityvote topredictanodeassalient.Thereforewemustconsiderthep riorprobabilitythat anygivennodecontainedsalientexamplesduringtraininga ndthereforeiscapable 30

PAGE 44

ofproducingaclassierthatcanpredictanexampleassalie nt.Abreakdownof thisalgorithmaspresentedin[42]isasfollows: p ( w 1 j x )= %ofensemblesvotingforclass w 1 forexample x (4.1) P ( w 1 )= %ofensemblescapableofpredictingclass w 1 (4.2) Classifyas w 1 if : p ( w 1 j x ) P ( w 1 ) > p ( w 2 j x ) P ( w 2 ) (4.3) Classifyas w 2 if : p ( w 1 j x ) P ( w 1 ) < p ( w 2 j x ) P ( w 2 ) (4.4) Thus,aprobabilisticmajorityvotecanbeappliedforatwoclassproblem.For instance,supposetherearevetrainingpartitions,inclu dingtwopartitionswith bothunknown( w 1 class)andsalient( w 2 class)examplesandthreepartitionswith onlyunknown( w 1 class)examples.Therefore p ( w 1 )=5 and p ( w 2 )=2 .Iftherst twopartitionseachvotesalientforexample x ,andsincethenalthreepartitions canonlyvoteunknownforexamplex,theoverallvotewouldbe salientsince 3 5 < 2 2 Thisalgorithmdoesnotdifferentiatebetweenensemblestr ainedondatawitha verydifferentnumberofexamplesbyclass.Inordertofurth erimproveaccuracy, wemodiedtheinputtotheabovealgorithmbyrstmultiplyi ngeachpartition's ensemblevoteforeachclassbythepercentageofexamplesof thatclassinthe correspondingpartition,comparedtothenumberofexample softhatclassinall partitions.Afterthisadditionalstep,themodiedclassv otesweretotaled,and theabovealgorithmapplied.Wecallthisimplementationas caledprobabilistic majorityvote(spmv). 31

PAGE 45

Ann-classproblem'sclassvoteswouldbesimilarlymodied ,andthealgorithm belowwouldthenbeapplied[42]: Classifyas w n : argmax n p ( w n j x ) P ( w n ) (4.5) Inthecaseofatievote,theunknownclasswaspredicted,sin ceadenite salientvotehasnotbeendetermined.Weareinterestedindi rectingpeopleto salientregionsso,presumably,missingafewsalientpoint sthataretiedinavote willnotbeimportantforregionrecognition. Casingsimulationgroundtruthsalientregions(bolts)are constantinsize,while salientregionsinthecanistertearsimulationsgenerally growlargerwitheachtime step.Differentmethodswereexploredinanattempttoorder truesalientregions beforefalsepositiveregions.Predictedregionswereorde redbytheirsizeinnumberofnodes,withlargestregionsrst.Thismethodassumes thatverysmallpredictedregionsarelesslikelytomeetoverlapthresholdreq uirementsfortruepositives.Anothermethodorderedthepredictedregionsclose sttothemeansizeof allpredictedregionsrst.Thistechniqueassumesfalsepo sitiveregionsaremore likelytobeverysmallorverylarge.Regionswerealsoorder edbythemeanofthe salientmarginsofthescaledprobabilisticmajorityvotes byensemblesfornodesin eachregionbeforesmoothing.Inthiscase,salientmargini scomputedbysalient votesminusunknownvotes.Regionswithhighermeansareord eredrstsince theensemblevotingshowsmorecondenceinasalientclassi cation.Inaddition, usingdomainknowledgepredictedregionsforcasingexperi mentswereordered byhowcloselytheirnumberofnodescomparedtothenumberof nodes(568)in eachgroundtruthbolt.Thegoalistopointtheusertoactual groundtruthregions 32

PAGE 46

rst,andfalsepositiveregionslast,inthosecaseswherep erfectaccuracycannot beobtained.4.2Experiments Forthecasingexperiments,trainingwasperformedontheda tacontainedin eachofthe12partitionsoftimesteps0to6tocreatebothasi nglepruneddecisiontreeanda250-treerandomforestensembleforeachpart ition.Wechose250 treesasdiscussedinChapter4.1.Thedecisiontreeclassi erortherandomforest ensembleofeachtrainingpartitionreturnedasinglepredi ction(oraweightedpredictioninthecaseofrandomforestsweighted)foreachtest exampleintesttime steps7to20.The12predictionsfromthoseclassiersorens embleswerecombinedintoasinglepredictionforeachtestexampleusingth escaledprobabilistic majorityvote(seeChapter4.1). Foreachofthecanistertearsimulationexperiments,train ingwasperformed onthedatacontainedineachofthe14partitionsofthetrain ingtimestepofa singlerun.Foreachofruns1,2,and3,thistimestepwas5of0 to11,andfor run4,thistimestepwas15of0to31.Thedatainthetrainingt imesetisasmall fractionofthetotaldatainalltimesteps,whichreectsth etypicalsituationwhere labeleddatafortrainingisrare.Bothasinglepruneddecis iontreeanda250-tree randomforestensemblewerecreatedforeachpartitionofth etrainingdataina run.Wechose250treesasdiscussedinChapter4.1.Thedecis iontreeclassier ortherandomforestensembleofeachpartitionreturnedasi ngleprediction(ora weightedpredictioninthecaseofrandomforestsweighted) foreachtestexample ofeithertheremainingtimestepsofthesamerun,orofallti mestepsofadifferent run.The14predictionsfromthoseclassiersorensemblesw erecombinedintoa 33

PAGE 47

singlepredictionforeachtestexampleusingthescaledpro babilisticmajorityvote (seeChapter4.1). Thesalientregionsofthedatafortrainingweremarkedusin gtheregion-based toolsoftheParaViewapplication[35].Theensemblesofcla ssiersusedtoclassifythetestdataoftenproducedsmallersalientclusterso fnodesorevenindividualisolatedsalientnodes,whichdidnotcorrespondwel ltothelargermarked, groundtruthregions.Inordertoimprovetheregionalaccur acyoftheseensembles, weemployedsomeoftheregionaltoolsintheFeatureCharact erizationLibrary (FCLib-1.2.0)toolkit[43]toprocesstheensemblepredict iondata.Thenumeric classlabel(0.5forunknown,1.0forsalient)ofallnodeswi thinaphysicalradius ofthreeunitsofeachnode(foundbytestingonthetrainingd ata)wasaveraged inasmoothingoperation.Weexpectedthatsmoothingatthre eunitswoulderase smallerdimensionregionswithoutdegradinglargerregion s. Aftersmoothing,nodeshadnumericclasslabelsintherange from[0.5,1]. ThesevalueswerebinarizedusingtheOtsuautomaticthresh oldingalgorithm[36]. Predictedregionswerecreatedfromconnectedcomponentso fsalientnodesaftersmoothing.Smoothingtendedtoremovethesmallersalie ntregionsandthe isolatedsalientnodes.Allpairsofsalientregionssepara tedbynomorethanthe maximumedgedistancebetweennodesforthecasingsimulati on,orbynomore thananedgedistanceoftwounitsbetweennodesforthecanis tertearsimulation runs,wereassignedthesameregionlabel.Forreference,th elongestdimension ofthecasingandcanistertearsimulationsintheinitialti mestepare301.4units and176.1unitsrespectively.Twounitswerechosenforthet earsimulationinstead ofthedefaultmaximumedgedistancebecauseoneblockofthe tearsimulationnot involvedinthetearinghadmuchlargeredgedistanceupto29 .74units.Thiswould haveresultedinassigningthesameregionlabeltowidelyse paratedlabels.An34

PAGE 48

othertoolwasusedtogenerateoverlapmatricesofconnecte dcomponentground truthandpredictedregions.Predictedsalientregionswer enallyorderedbythe varioussizeandvotingcondencemethodsasdescribedinth epreviouschapter. 4.3EvaluationMetrics Ourpreviousapproach[8]didnotconsidertheactualnodein tersectionpercentagebetweengroundtruthandpredictedsalientregions .Weextendedthat approach[9,10,37]byestablishingthresholdsfortheover lappercentageofthe nodesinagroundtruthsalientregionandapredictedsalien tregionforthepredictiontobecountedasatruepositive.Theoverlaprequiredfo ratruepositiveata giventhresholdwasappliedseparatelytoboththegroundtr uthregionandtothe predictedregion.Ifnopredictedsalientregionssufcien tlyoverlappedaground truthsalientregion,afalsenegativewasregisteredforth efailuretoadequately predictthegroundtruthregion. Afalsepositivewasrecordedforeachpredictedregionthat didnotsufciently overlapanygroundtruthregion.Thismayhaveresultedinmo retotalpredictedregionsthanactualregions.Itispossiblethatmorethanonep redictedsalientregion satisedagivenoverlapthresholdforintersectionwithal abeledsalientregion.We countedthisasasinglediscoveryofthegroundtruthregion (truepositiveorTP), withtheremainingprediction(s)countedasfalsepositive (s).Forthepurposesof peoplesearchingforinterestingevents,thisappearssens iblebecausetheywould bedirectedtotheregion.Ifonepredictedregionsufcient lyoverlappedmorethan onelabeledsalientregion,theonlytruepositivecountedw astheonewiththemost overlapwiththepredictedregion. 35

PAGE 49

Recall,precision,andthetraditionalF-measure,whichwe ightsfalsepositives (FP)andfalsenegatives(FN)equally,providemeasuresofr egionalaccuracy,as shownbelow[44]. recall = TP TP + FN (4.6) precision = TP TP + FP (4.7) F-measure = 2 TP 2 TP + FP + FN (4.8) Formanyusers,asmalloverlapthresholdisanappropriater egionalmetric,since coarselypointingthoseuserstosuspiciousregionsforfur therinvestigationisthe maingoal.Fromamachinelearningviewpoint,asmallerover lapdoesnotaddress thecasewhereaverylargeregionisalwayspredictedsalien t.Aslongasthisregionminimallyoverlapsagivengroundtruthregion,atruep ositiveiscounted.By increasingtheoverlaprequirementto10%,oreven50%forex ample,amoreprecisematchisobtained.Thestricterrequirementsalsoprov ideusefuldiscrimination betweenclassiermethodsthatwouldnotbepossiblewitham inimaloverlaprequirement. WhiletheF-measureindicateshowwellsalientregionsaref ound,itdoesnot measurethequalityoforderingpredictedsalientregionss othatcorrectpredictions areselectedbeforefalsepredictions.Liftisameasureoft enusedindatabase marketingforthispurposeandisdenedasthepercentofall targets(hits)inthe rst p %ofthemarketinglistsortedbydecreasingscoreofthemode l,divided by p .Theauthorsof[28,29]previouslyaddressedthemeasureme ntofliftquality indatabasemarketing.Theauthorsof[28]introducedalift indexthatuseda weightedsumoftheitemsinthelifttable.Thatindexconver gedtotheratioofthe areaunderthecumulativeliftcurve.TheL-qualitymeasure describedin[29]is 36

PAGE 50

similartotheliftindex,andrangesfrom-1(worstcase),to 0(randomcase),to1 (optimalcase).Weapplythesamebasicformulaforcalculat ingL-qualityasshown below. L-quality( M ) = SumCPH( M ) SumCPH( R ) SumCPH( B ) SumCPH( R ) (4.9) ThetermCPHdenotesCumulativePercentHits,whichisdene dasliftmultipliedby p %asexplainedabove.ThetermSumCPH( M )isdenedasthearea undertheCPHcurveforthemodel M .ThetermsSumCPH( R )andSumCPH( B ) aredenedastheareaundertheCPHcurvefortherandommodel andforthe optimalmodelrespectively.Theoptimalcaseoccurswhenal ltargetsaregrouped atthebeginningofthelist.Inourapplication,wesortalis tofpredictedregionsinsteadofalistofpotentialcustomers.Insteadofcountingc umulativetargetsorhits, wecountthenumberofcumulativegroundtruthsalientregio nsthatmeetgiven overlapthresholdrequirementswithauniquepredictedreg ion.Whiledatabase applicationsinvolvenumbersofpotentialcustomerslarge enoughforpracticalexpressionaspercentagesoncumulativeliftcharts,thenumb erofpredictedregions issmallenoughtoshowonourchartsasactualnumbers. Twocasesthatusuallydonotoccurindatabasemarketingapp licationswould resultinundenedL-qualities.First,everypredictedreg ionmaymeetthegiven overlapthresholdandqualifyasatruepositive.Inthiscas e,allpossibleordersof thepredictedregionshaveequalquality.Second,noground truthregionmaybe correctlypredictediftheoverlapthresholdrequirementi ssufcientlyhigh.Sinceall orderingsareequivalentinthesecaseswecannotevaluatet heL-quality.Hence, itwillbeundened. 37

PAGE 51

4.4CasingSimulationRegionalResults Table4.1showstheregionalresultsforthecasingexperime ntsevaluatedwith 10%and50%overlapthresholds.Theseexperimentsused12pa rtitionsoftraining data,eachfromtherstseventimesteps.AsdiscussedinCha pter4.1,predicted regionswereorderedbytheratioofregionsizetogroundtru thboltsize(GTratio),bytheratioofregionsizetomeanregionsize(sizerat io),byregionsizefrom highesttolowest(size),andbythemeanofthesalientmargi nsofthescaledprobabilisticmajorityvotesbyensemblesfornodesineachregi onbeforesmoothing (smm).EachmethodresultedinhighL-qualities,whichindi catesthattheordering greatlyimprovestheuserexperiencecomparedtorandomord ering,bypointingto themostsalientregionsrst,andcausingmostfalsepositi vestolieneartheend ofthelist.Regionswerealsoorderednaturally,bytimeste p(ts),fromlowtohigh, thenbyregionnumberwithineachtimestepfromlowtohigh.T hecorresponding naturalL-qualitiesarelower,butstillabovezero,whicha randomorderingwould produce.Figure4.1showsthecumulativeliftcurveforthem odeltrainedusing250 randomforestsunweightedtreesforeachofthe12trainingp artitionsofdata,usinga10%overlapthreshold.Predictedregionswereordered byhowcloselytheir numberofnodescomparedtothenumberofnodes(568)ineachg roundtruthbolt. Forreference,theideal,random,natural(orderedbytimes tep),andworstcaselift curvesarealsoshown. Figure4.2showsthecumulativeliftcurvesforasingledeci siontree(DT),for 250RFunweightedtrees,andfor250RFWtreesforeachofthe1 2trainingpartitionsofdata,usinganoverlapthresholdof10%.Theverti calheightoftherightmostpointofeachcurveshowsthetotalnumberofgroundtrut hsalientregionsfor whichpredictedregionsmeetthe10%overlapthresholdrequ irement.Ofthe140 38

PAGE 52

Table4.1:Casingregionalresultsevaluatedwith10%and50 %overlapthresholds. (Boldindicatesthehighestvalues.) ClassOT%GTPredsTPFNFPRec.Prec.F-mL-qualities GTsizesizesmmts ratioratio DT10140319123171960.880.390.540.97 0.98 0.860.900.26 RF10140251 124161270.890.490.630.98 0.970.930.870.33 RFW10140257115251420.820.450.58 0.96 0.950.940.940.37 mean:0.860.440.58 0.970.97 0.910.900.32 sd:0.040.050.050.010.020.040.040.06 DT50140319104362150.740.330.45 1.00 0.810.910.890.37 RF50140251 109311420.780.430.560.99 0.790.950.890.43 RFW50140257100401570.710.390.50 0.99 0.760.970.910.48 mean:0.740.380.50 0.99 0.790.940.900.43 sd:0.040.050.060.010.030.030.010.06 Class:classier;OT:overlapthreshold;GT:groundtruthr egions;Preds:predictedregions;TP:truepositives; FN:falsenegatives;FP:falsepositives;Rec.:Recall;Pre c.:Precision;F-m:F-measure; sm:salientmarginmean;ts:timestep. actualgroundtruthregions,RFWcorrectlypredicted115,D T123,andRF124. Theverticaldistanceoftherightmostpointbelowthetopof thechartindicatesthe numberoffalsenegativeregions.Thehorizontallocationo ftherightmostpointof eachcurveshowsthetotalnumberofpredictedregions,incl udingtrueandfalse positives.RFhas251,RFWhas257,andDThas319predictedre gions.The highL-qualitiesforallthreeclassier/ensemblemethods indicatesfalsepositives aremostlyaddedattheend,aftertheuserhasseenalmostall correctlyidentiedsalientregions.Figure4.3showstheRFcumulativelift curves,evaluatedwith 10%,50%,and80%overlapthresholds.Asexpected,ahighero verlapthreshold requirementresultsinalowernalheightofthecurve(less truepositiveregions). PredictedregionsinFigures4.2and4.3wereorderedbyhowc loselytheirnumber ofnodescomparedtothenumberofnodes(568)ineachgroundt ruthbolt. 39

PAGE 53

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 0255075100125150175200225250 predicted region ## of TP regions ideal RF natural random worst Figure4.1:Avisualizationofthecasingcumulativeliftcu rveforthemodeltrained usingrandomforestsunweightedensemblesandevaluatedwi th10%overlap threshold.Theideal,natural,random,andworstcaseliftc urvesarealsoshown. 40

PAGE 54

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 0255075100125150175200225250275300325 predicted region ## of TP regions RF DT RFW Figure4.2:Avisualizationofthecasingcumulativeliftcu rvesforthemodelstrained usingsingledecisiontree,andrandomforestsunweighteda ndweightedensembles,evaluatedwith10%overlapthresholds. 41

PAGE 55

0 20 40 60 80 100 120 140 050100150200250 predicted region ## of TP regions 10% 50% 80% Figure4.3:Avisualizationofthecasingcumulativeliftcu rvesforthemodeltrained usingrandomforestsunweightedensembles,evaluatedwith 10%,50%,and80% overlapthresholds. 42

PAGE 56

Figures4.4and4.5showdifferentcasingsimulationsurfac eviewsoftimesteps 7to20withpredictedsalient(bolt)regionsfortheRFmodel trainedontimesteps 0to6partitioneddata.Theground(bottomslabinFigure1.1 a)doesnotcontain anypredictedregionsandthushasbeenhiddeninthesegure sforbetterviewing oftheremainingblocks.Therearetenbolts,whichexistfor eachtimestep.The leftmostandrightmostboltsineachsubgureappearinboth gures,buttheother fourboltsineachsubgureareuniquetothegure.Trueandf alsepositivecounts foreachtimestepwerecomputedusinga10%overlapthreshol d.Thenumberof truepositiveregionsdecreasesinthenalfourtimestepsa ndthenumberoffalse positiveregionsincreasesinthenalseventimesteps.The senaltimesteps whenforcespeakarethefarthestintimefromthetrainingti mesteps(0to6)and aremoredifculttopredictcorrectly.Someofthelargestp redictedregionsappear inthesetimesteps.Largepredictedregionsoverlapwithtw oorthreeboltsintime steps19and20ofFigure4.5.Theevaluationmetricsdiscuss edinChapter4.3 onlypermitasingletruepositivetobecountedforeachofth eseregions.Mostof thefalsepositiveregionsareorderedafterthetruepositi veregions,asshownin thecumulativeliftcurvesinFigure4.3. 43

PAGE 57

(a)Timestep7:10TP,1FP. (b)Timestep8:10TP,0FP. (c)Timestep9:10TP,1FP. (d)Timestep10:10TP,10FP. (e)Timestep11:10TP,6FP. (f)Timestep12:9TP,9FP. (g)Timestep13:8TP,5FP. (h)Timestep14:10TP,11FP. (i)Timestep15:10TP,15FP. (j)Timestep16:10TP,19FP. (k)Timestep17:7TP,12FP. (l)Timestep18:7TP,13FP. (m)Timestep19:7TP,13FP. (n)Timestep20:6TP,11FP. Figure4.4:Casingsimulationsurfaceviewoftimesteps7to 20withpredicted salient(bolt)regionsfortheRFmodeltrainedontimesteps 0to6partitioneddata. Predictedboltregionsarered(incolor)ordarker(ingrays cale)regions.TP(true positives)andFP(falsepositives)werecomputedusinga10 %overlapthreshold. 44

PAGE 58

(a)Timestep7:10TP,1FP. (b)Timestep8:10TP,0FP. (c)Timestep9:10TP,1FP. (d)Timestep10:10TP,10FP. (e)Timestep11:10TP,6FP. (f)Timestep12:9TP,9FP. (g)Timestep13:8TP,5FP. (h)Timestep14:10TP,11FP. (i)Timestep15:10TP,15FP. (j)Timestep16:10TP,19FP. (k)Timestep17:7TP,12FP. (l)Timestep18:7TP,13FP. (m)Timestep19:7TP,13FP. (n)Timestep20:6TP,11FP. Figure4.5:Casingsimulationalternatesurfaceviewoftim esteps7to20withpredictedsalient(bolt)regionsfortheRFmodeltrainedontim esteps0to6partitioned data.Predictedboltregionsarered(incolor)ordarker(in grayscale)regions.TP (truepositives)andFP(falsepositives)werecomputedusi nga10%overlapthreshold. 45

PAGE 59

4.5CanisterTearSimulationRegionalResults Tables4.2,4.3,and4.4showthecanistertearregionalresu ltswith10%overlapthresholdsusingDT,RF,andRFWrespectively.Thetrain ingdatafromeach runwasfrom14partitionsofasingletimestep.Mostofthelo werF-measures involverun1(baseline)aseitherthetestrunorthetrainin gruntimestep.Run 1hasthefewestsalientnodesofanyrun,andistheonlyrunth athasmorethan onesalientregioninasingletimestep(seventimestepseac hhavetwosmaller salientregions,includingthetrainingtimestep).Only3o fthe14partitionsofrun 1haveatleast50salientnodes.AsdiscussedinChapter4.1, predictedregions wereorderedbyregionsizefromhighesttolowest,bytherat ioofregionsizeto meanregionsize,andbythemeanofthesalientmarginsofthe scaledprobabilisticmajorityvotesbyensemblesfornodesineachregio nbeforesmoothing. Regionswerealsoorderednaturally,bytimestep(ts),from lowtohigh,thenby regionnumberwithineachtimestepfromlowtohigh.Whileso meofeachtable's entriesshowanL-qualityof1.00,themostnotableofthesei sthe89falsepositive regionsinTable4.2,allafterthe29truepositiveshavebee npresented.Ahigh L-qualityismoresignicantwhentherearemorefalseposit ivesalongwithmany truepositives. 46

PAGE 60

Table4.2:CanistertearDTregionalresultsevaluatedwith 10%overlapthreshold. TrainTestGTPredsTPFNFPRec.Prec.F-mL-qualities runrun sizesizesmmts ratio 11155387450.530.150.240.960.570.86-0.01211730413260.240.130.170.710.710.67-0.23311736413320.240.110.150.920.530.53-0.86411717512120.290.290.290.830.170.53-0.27 121066100561.000.150.260.940.560.880.062291990101.000.470.641.001.000.960.5132106191520.900.150.250.970.740.94-0.6542101810081.000.560.711.000.931.000.45 131075100651.000.130.240.870.740.860.22231020100101.000.500.671.000.600.800.603394590361.000.200.331.000.830.96-0.13431024100141.000.420.591.001.001.000.41 14291942541690.860.130.220.910.200.870.00242967290381.000.430.601.000.980.910.343429117281890.970.240.381.000.810.95-0.39442857280291.000.490.661.001.000.980.39 mean:0.810.290.400.940.710.860.03 sd:0.300.160.200.080.260.150.43 GT:groundtruthregions;Preds:predictedregions;TP:tru epositives;FN:falsenegatives; FP:falsepositives;Rec.:Recall;Prec.:Precision;F-m:F -measure;smm:salientmarginmean; ts:timestep. 47

PAGE 61

Table4.3:CanistertearRFunweightedregionalresultseva luatedwith10%overlapthreshold. TrainTestGTPredsTPFNFPRec.Prec.F-mL-qualities runrun sizesizesmmts ratio 11151814140.930.780.850.140.930.96-0.0421171271050.410.580.480.20-0.20-0.14-0.20311717512120.290.290.290.630.400.370.2041171071030.410.700.520.14-1.00-0.52-0.14 121021100111.000.480.650.360.470.530.38229139041.000.690.821.001.001.000.6132101210021.000.830.911.001.001.000.6042101210021.000.830.911.001.001.00-0.10 13102073130.700.350.470.580.690.560.5223101310031.000.770.871.001.001.000.6033999001.001.001.00NDNDNDND43101610061.000.620.771.001.001.000.40 142967263410.900.390.540.410.250.240.22242960290311.000.480.650.680.540.800.39342951290221.000.570.731.001.000.710.4444283328051.000.850.921.001.001.000.01 mean:0.850.640.710.680.610.630.26 sd:0.250.200.200.350.570.470.29 GT:groundtruthregions;Preds:predictedregions;TP:tru epositives;FN:falsenegatives; FP:falsepositives;Rec.:Recall;Prec.:Precision;F-m:F -measure;smm:salientmarginmean; ts:timestep;ND:notdened. 48

PAGE 62

Table4.4:CanistertearRFWregionalresultsevaluatedwit h10%overlapthreshold. TrainTestGTPredsTPFNFPRec.Prec.F-mL-qualities runrun sizesizesmmts ratio 11151815031.000.830.910.510.960.910.6421171051250.290.500.370.60-0.60-0.92-0.60311717512120.290.290.290.630.400.300.1741171161150.350.550.430.47-0.60-0.33-0.27 12101810081.000.560.710.180.400.180.30229109011.000.900.951.001.001.000.3332101110011.000.910.951.001.001.000.4042101110011.000.910.951.001.001.000.20 13101991100.900.470.620.40-0.090.530.2923101510051.000.670.801.001.001.000.5633999001.001.001.00NDNDNDND43101610061.000.620.771.001.001.000.30 142961245370.830.390.530.260.230.260.1124293129021.000.940.971.001.000.45-0.17342947290181.000.620.761.001.001.000.5444283028021.000.930.971.001.000.89-0.11 mean:0.850.690.750.740.580.550.18 sd:0.270.220.230.310.600.580.34 GT:groundtruthregions;Preds:predictedregions;TP:tru epositives;FN:falsenegatives; FP:falsepositives;Rec.:Recall;Prec.:Precision;F-m:F -measure;smm:salientmarginmean; ts:timestep;ND:notdened. 49

PAGE 63

Anillustrationofthecanistertearrun1cumulativeliftcu rvesforthemodel trainedusing250randomforestsweightedtreesforeachoft he14trainingpartitionsofdataappearsinFigure4.6.Predictedregionswereo rderedbytheirsize (numberofnodes)fromlargetosmall.Similarly,thecumula tiveliftcurvesforthe canistertearruns2,3,and4areshowninFigures4.7,4.8,an d4.9.Asdiscussedabove,mostofthelowerF-measuresinvolverun1(bas eline),andcanbe distinguishedbyalowerand/orfartherrightnalpointont heliftcurve.Thoselift curveswithhigherL-qualitieshavemorediagonallyupward steps(truepositives) fortheinitialpredictionsandmorehorizontalsteps(fals epositives)forthenal predictions. 50

PAGE 64

0 2 4 6 8 10 12 14 16 024681012141618 predicted region ## of TP regions train 1, ts 5 train 2, ts 5 train 3, ts 5 train 4, ts 15 Figure4.6:Avisualizationofthecanistertearcumulative liftcurvesforthemodelstrainedusingRFWensemblesandevaluatedwith10%overl apthresholdon canistertearrun1.Ineachcasethetimestep(ts)ofthetrai ningrun(train)is specied. 0 1 2 3 4 5 6 7 8 9 10 024681012141618 predicted region ## of TP regions train 1, ts 5 train 2, ts 5 train 3, ts 5 train 4, ts 15 Figure4.7:Avisualizationofthecanistertearcumulative liftcurvesforthemodelstrainedusingRFWensemblesandevaluatedwith10%overl apthresholdon canistertearrun2.Ineachcasethetimestep(ts)ofthetrai ningrun(train)is specied. 51

PAGE 65

0 1 2 3 4 5 6 7 8 9 10 024681012141618 predicted region ## of TP regions train 1, ts 5 train 2, ts 5 train 3, ts 5 train 4, ts 15 Figure4.8:Avisualizationofthecanistertearcumulative liftcurvesforthemodelstrainedusingRFWensemblesandevaluatedwith10%overl apthresholdon canistertearrun3.Ineachcasethetimestep(ts)ofthetrai ningrun(train)is specied. 0 5 10 15 20 25 051015202530354045505560 predicted region ## of TP regions train 1, ts 5 train 2, ts 5 train 3, ts 5 train 4, ts 15 Figure4.9:Avisualizationofthecanistertearcumulative liftcurvesforthemodelstrainedusingRFWensemblesandevaluatedwith10%overl apthresholdon canistertearrun4.Ineachcasethetimestep(ts)ofthetrai ningrun(train)is specied. 52

PAGE 66

Figures4.10to4.18showcanistertearrunfourgroundtruth salientregions andpredictedsalientregionsforthemodeltrainedonrun2, timestep5partitioned datausingRFWensemblesandanoverlapthresholdof10%.The wireframeview showstheedgesthatconnectsalientnodes.Allnodeswithun knownsaliencehave beenremovedfromtheseguresforbetterviewing.Sinceall regionsareshown tothesamescale,itisevidentthatastimestepsincreasefr omoneto30,both groundtruthsalientandpredictedsalientregionsincreas einsize.Table4.4and Figure4.9showthatofthe31predictedsalientregions,29a retruepositivesand twoarefalsepositives.Timesteponehasafalsepositivebe causeasalientregion waspredictedandthereisnogroundtruthsalientregion.Ti mestep26alsohas afalsepositivewhichistoosmalltoseeinthegure.Noteth atalthoughground truthsalientregionsand/orpredictedsalientregionsmay sometimesappeartobe separateinagiventimestep,theseregionsmetthemaximume dgedistanceof twounitsseparatingthem,andthuswerelabeledasbelongin gtothesameregion, asdiscussedin4.2.Whenpredictedregionswereorderedbys izeorsizeratio, theL-qualitieswere1.0foreachmethod,whichsigniestha tthetwofalsepositive regionswereorderedlast,afterall29truepositiveregion s. 53

PAGE 67

(a)Timestep1.Sincetherearenogroundtruthsalientregio nsontheleft,thepredictedsalient regionontherightisafalsepositive,with214nodes. (b)Timestep2.Groundtruth(left)has133nodes.Predicted (right)has540nodes.Overlapis129 nodes. (c)Timestep3.Groundtruth(left)has361nodes.Predicted (right)has774nodes.Overlapis349 nodes. (d)Timestep4.Groundtruth(left)has621nodes.Predicted (right)has769nodes.Overlapis341 nodes. (e)Timestep5.Groundtruth(left)has694nodes.Predicted (right)has952nodes.Overlapis427 nodes. (f)Timestep6.Groundtruth(left)has665nodes.Predicted (right)has1002nodes.Overlapis 362nodes.Figure4.10:Canistertearrun4(timesteps1to6)groundtru thsalientregions (left)andpredictedsalientregions(right)fortheRFWmod eltrainedonrun2time step5partitioneddata. 54

PAGE 68

(a)Timestep7.Groundtruth(left)has686nodes.Predicted (right)has1039nodes.Overlapis 323nodes. (b)Timestep8.Groundtruth(left)has1182nodes.Predicte d(right)has1502nodes.Overlapis 871nodes. (c)Timestep9.Groundtruth(left)has1270nodes.Predicte d(right)has1521nodes.Overlapis 907nodes. (d)Timestep10.Groundtruth(left)has1527nodes.Predict ed(right)has1574nodes.Overlapis 1044nodes.Figure4.11:Canistertearrun4(timesteps7to10)groundtr uthsalientregions (left)andpredictedsalientregions(right)fortheRFWmod eltrainedonrun2time step5partitioneddata. 55

PAGE 69

(a)Timestep11.Groundtruth(left)has1363nodes.Predict ed(right)has1629nodes.Overlapis 908nodes. (b)Timestep12.Groundtruth(left)has1288nodes.Predict ed(right)has1874nodes.Overlapis 1067nodes. (c)Timestep13.Groundtruth(left)has1276nodes.Predict ed(right)has1962nodes.Overlapis 1006nodes. (d)Timestep14.Groundtruth(left)has1240nodes.Predict ed(right)has2010nodes.Overlapis 1041nodes.Figure4.12:Canistertearrun4(timesteps11to14)groundt ruthsalientregions (left)andpredictedsalientregions(right)fortheRFWmod eltrainedonrun2time step5partitioneddata. 56

PAGE 70

(a)Timestep15.Groundtruth(left)has1860nodes.Predict ed(right)has2140nodes.Overlapis 1411nodes. (b)Timestep16.Groundtruth(left)has1579nodes.Predict ed(right)has2148nodes.Overlapis 1260nodes. (c)Timestep17.Groundtruth(left)has2020nodes.Predict ed(right)has2231nodes.Overlapis 1364nodes.Figure4.13:Canistertearrun4(timesteps15to17)groundt ruthsalientregions (left)andpredictedsalientregions(right)fortheRFWmod eltrainedonrun2time step5partitioneddata. 57

PAGE 71

(a)Timestep18.Groundtruth(left)has2099nodes.Predict ed(right)has2335nodes.Overlapis 1542nodes. (b)Timestep19.Groundtruth(left)has1846nodes.Predict ed(right)has2444nodes.Overlapis 1383nodes. (c)Timestep20.Groundtruth(left)has2999nodes.Predict ed(right)has2661nodes.Overlapis 1790nodes.Figure4.14:Canistertearrun4(timesteps18to20)groundt ruthsalientregions (left)andpredictedsalientregions(right)fortheRFWmod eltrainedonrun2time step5partitioneddata. 58

PAGE 72

(a)Timestep21.Groundtruth(left)has2890nodes.Predict ed(right)has2675nodes.Overlapis 1643nodes. (b)Timestep22.Groundtruth(left)has3079nodes.Predict ed(right)has2797nodes.Overlapis 1874nodes. (c)Timestep23.Groundtruth(left)has3528nodes.Predict ed(right)has2698nodes.Overlapis 1902nodes.Figure4.15:Canistertearrun4(timesteps21to23)groundt ruthsalientregions (left)andpredictedsalientregions(right)fortheRFWmod eltrainedonrun2time step5partitioneddata. 59

PAGE 73

(a)Timestep24.Groundtruth(left)has2793nodes.Predict ed(right)has2913nodes.Overlapis 1615nodes. (b)Timestep25.Groundtruth(left)has3323nodes.Predict ed(right)has2753nodes.Overlapis 1827nodes. (c)Timestep26.Groundtruth(left)has3248nodes.Predict ed(right)has2951nodes.Overlapis 1910nodes.Afalsepositiveregionwithonenodeistoosmall tobeseeninthisview. Figure4.16:Canistertearrun4(timesteps24to26)groundt ruthsalientregions (left)andpredictedsalientregions(right)fortheRFWmod eltrainedonrun2time step5partitioneddata. 60

PAGE 74

(a)Timestep27.Groundtruth(left)has4381nodes.Predict ed(right)has3089nodes.Overlapis 2143nodes. (b)Timestep28.Groundtruth(left)has4000nodes.Predict ed(right)has3214nodes.Overlapis 1958nodes.Figure4.17:Canistertearrun4(timesteps27to28)groundt ruthsalientregions (left)andpredictedsalientregions(right)fortheRFWmod eltrainedonrun2time step5partitioneddata. 61

PAGE 75

(a)Timestep29.Groundtruth(left)has4342nodes.Predict ed(right)has3251nodes.Overlapis 2062nodes. (b)Timestep30.Groundtruth(left)has3727nodes.Predict ed(right)has3631nodes.Overlapis 2223nodes.Figure4.18:Canistertearrun4(timesteps29to30)groundt ruthsalientregions (left)andpredictedsalientregions(right)fortheRFWmod eltrainedonrun2time step5partitioneddata. 62

PAGE 76

Whileprecision,recall,andtheF-measurearecomputedonu norderedsetsof predictedregions,L-qualityiscomputedonorderedorrank edsetsofpredicted regions.Anotherrankingqualitymethodistheprecision-r ecallcurve,whichshows theprecisionatincreasingrecalllevels.Thestandardcur veisusuallysmoothed toremovesawtoothpatternsbyusinginterpolatedprecisio n,whichisthehighestprecisionfoundforanyrecalllevelgreaterthanorequa ltothegivenrecall level.Eleven-pointinterpolatedprecision-recallgraph saveragetheinterpolated precisionatelevenxedrecalllevelsfromzerotoone[45]. Figure4.19showsan eleven-pointinterpolatedprecision-recallgraphaverag edacross16canistertear train-testcombinationsusingRFWensemblesandallfourru ns.Curvesforve overlapthresholds(10%to50%)betweenpredictedandgroun dtruthregionsare graphed.Itisconventionalininterpolatedprecision-rec allcurvestoplottheprecisionaszeroforrecalllevelsthatareneverreachedduetono tretrievingallrelevant documents(inourcaseduetonotalwayspredictingallsalie ntregionsthatmeet agivenoverlapthreshold)[45].Inordertoavoidconfusion thatthisnon-intuitive conventionmightcause,wehavesimplyelectednottoplotan ysuchzeroprecisionpointsforthecorrespondingrecalllevelsnotreached inallofthecurvesthat wereaveraged.However,plottedpointsthatrepresentthea verageinterpolated precisionofmultiplecurvesmayincludepointsfromoneorm ore(butnotall)ofthe curvesthatwereassignedthezeroprecisionvalue. Asexpected,asmoresalientregionsareretrievedfromther ankedlist(the recallincreases),somemorefalsepositiveregionsareals oretrieved(theprecisiondecreases).Ingeneral,thecurvesmeetingloweroverl apthresholdrequirementsshowthebestperformanceandareclosesttotheupperrightcornerofthe graph,whererecallandprecisionaremaximized.Figures4. 20and4.21showthe 63

PAGE 77

precision-recallgraphsforRFandDTthatcorrespondtoFig ure4.19forRFW.In general,RFWhasslightlybetterresultsthanRF,followedb yDT. Figure4.22showsaneleven-pointinterpolatedprecisionrecallgraphaveraged acrossninecanisterteartrain/testcombinationsusingRF Wensemblesandruns twotofour.Thisgureleavesoutcanistertearrunone,whic hwasincludedin Figure4.19andforreasonsdiscussedearlierinthisChapte r,showsconsiderably lessaccuracywhenusedfortrainingortesting.Perfectpre cisionofoneisshown forrecalllevelsuptooneforoverlapthresholdsupto10%.F igure4.23showsan eleven-pointinterpolatedprecision-recallgraphaverag edacrosssevencanister teartrain/testcombinationsusingRFWensemblesandrunon efortrainingand/or testing.Thisgureshowsthelessaccurateresultsforruno ne,whichwasleftout ofFigure4.22. 64

PAGE 78

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 00.10.20.30.40.50.60.70.80.91 RecallInterpolated Precision 10% OT 20% OT 30% OT 40% OT 50% OT Figure4.19:Eleven-pointinterpolatedprecision-recall graphaveragedacross16 canisterteartrain/testcombinationsusingRFWensembles andallfourruns. Curvesareshownforveoverlapthesholds(OT)from10%to50 %. 65

PAGE 79

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 00.10.20.30.40.50.60.70.80.91 RecallInterpolated Precision 10% OT 20% OT 30% OT 40% OT 50% OT Figure4.20:Eleven-pointinterpolatedprecision-recall graphaveragedacross16 canisterteartrain/testcombinationsusingRFensemblesa ndallfourruns.Curves areshownforveoverlapthesholds(OT)from10%to50%. 66

PAGE 80

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 00.10.20.30.40.50.60.70.80.91 RecallInterpolated Precision 10% OT 20% OT 30% OT 40% OT 50% OT Figure4.21:Eleven-pointinterpolatedprecision-recall graphaveragedacross16 canisterteartrain/testcombinationsusingDTensemblesa ndallfourruns.Curves areshownforveoverlapthesholds(OT)from10%to50%. 67

PAGE 81

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 00.10.20.30.40.50.60.70.80.91 RecallInterpolated Precision 10% OT 20% OT 30% OT 40% OT 50% OT Figure4.22:Eleven-pointinterpolatedprecision-recall graphaveragedacrossnine canisterteartrain/testcombinationsusingRFWensembles andrunstwotofour. Curvesareshownforveoverlapthesholds(OT)from10%to50 %. 68

PAGE 82

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 00.10.20.30.40.50.60.70.80.91 RecallInterpolated Precision 10% OT 20% OT 30% OT 40% OT 50% OT Figure4.23:Eleven-pointinterpolatedprecision-recall graphaveragedacross sevencanisterteartrain/testcombinationsusingRFWense mblesandrunone fortrainingand/ortesting.Curvesareshownforveoverla pthesholds(OT)from 10%to50%. 69

PAGE 83

4.6LabelingNoiseResults Labelingnoiseexperimentswereperformedonthecasingsim ulation,sinceit hasperfectgroundtruthlabelingforallbolts(salientreg ions).Thegroundtruth labelswerechangedinthetrainingdataonly,sothat1%,5%, 10%,15%,and 20%ofthe568nodesineachboltweremislabeledasunknownin steadofsalient. Eachbolthas11layersofnodes,andnoisewasthusaddedtoth eexteriorsurface nodesoflayer(s)beginningfarthestfromtheboltheadasre quired,asshownin Figure4.24.Inseparateexperiments,thelabelsofthesame numberofnodes closesttoeachboltwerechangedfromunknowntosalientatt heabovevenoise levels,asshowninFigure4.25.Inotherwords,therstseto fvenoiselevels decreasedthenumberofboltnodesthatwerelabeledcorrect ly,andthesecondset increasedthenumberofnodesadjacenttoeachboltthatwere labeledincorrectly asboltnodes. Randomforestsunweighted(RF),weighted(RFW),andasingl edecisiontree (DT)wereseparatelytrainedoneachofthe12partitionsoft hetrainingdata.For reference,asingledecisiontreewasalsotrainedonalloft hecombinedtraining data(SDT),althoughthismethodwouldnotbefeasibleformu chlargerdatasets. Anoverlapthresholdof10%wasused.Theresultsofthelabel ingnoiseexperimentsforRFWareshowninTable4.5andFigure4.26.Theresul tsforRF,DT, andSDTareshownintheAppendix—Tables4.6,4.7,and4.8,an dFigures4.27, 4.28,and4.29.Allguresshowcumulativeliftcurvesthatw ereproducedwith thegroundtruth(GT)ratiomethodoforderingpredictedreg ions,asdiscussedin Chapter4.1.Forallnoiselevelcases,theL-qualitiesrema inedconsistentlyhigh. MethodsthatproducehighL-qualitiescompensateforlowpr ecisionandmake higherrecallthekeymeasurement,sincemosttruepositive regions(bolts)are 70

PAGE 84

(a)0%boltnoise. (b)1%boltnoise. (c)5%boltnoise. (d)10%boltnoise. (e)15%boltnoise. (f)20%boltnoise. Figure4.24:Boltlabelingnoiseof0%,1%,5%,10%,15%,and2 0%shownfora typicalbolt.Red(incolor)ordarker(ingrayscale)wirefr ameoutlinesboltnodes thatarelabeledbolt.Blue(incolor)orlighter(ingraysca le)wireframeoutlinesbolt nodesthatarelabelednon-bolt. 71

PAGE 85

(a)0%non-boltnoise. (b)1%non-boltnoise. (c)5%non-boltnoise. (d)10%non-boltnoise. (e)15%non-boltnoise. (f)20%non-boltnoise. Figure4.25:Non-boltlabelingnoiseof0%,1%,5%,10%,15%, and20%shown foratypicalbolt.Red(incolor)ordarker(ingrayscale)wi reframeoutlinesnonboltnodesthatarelabeledbolt.Blue(incolor)orlighter( ingrayscale)wireframe outlinesboltnodesthatarelabeledbolt. 72

PAGE 86

detectedbeforefalsepositiveregions.Whiletherecallfo reachmethodsteadily decreasedasthenon-boltnoiselevelincreased,therewere exceptionsasthebolt noiselevelincreased.BothSDTandRFWshowedasmallgainin recallforsome increase(s)inboltnoiselevels.DTshowedalargerdecreas einrecall(thoughstill highoverall)atboltnoiselevelsof1%and10%thanattheoth erboltnoiselevels. Thiswaslikelyduetoacombinationofdecisiontreeinstabi lityandpartitioningof thedata. Thedecisiontreemethodseachprovedmorerobusttonoisein sidethebolts thantherandomforestsmethodsfor15%and20%noiselevels. Conversely,both decisiontreemethodswerelessrobustto20%noiseaddedton odesoutsidethe bolts.Fortherecallattheboltandnon-bolt10%noiselevel s,SDTaverageda recallof0.89,followedbyRFWat0.63,RFat0.62,andDTat0. 55.Excluding SDT,whichisnotviableforlargerdatasets,RFWperformedb estoverallforboth typesofnoise.SDTandDTpredictedmoresalientregions(tr ueandfalsepositives combined)thanRFandRFW,andthenumberofpredictedregion sforallmethods tendedtodecreaseatthehighernoiselevels.Ingeneral,ou rmethodsarebest usedfornoiselevelslessthanabout10%. 73

PAGE 87

Table4.5:CasingsimulationRFWboltlabelingnoiseresult sevaluatedwith10% overlapthreshold. NoiseGTPredsTPFNFPRecPrecF-mL-qualities leveltype GTsizesizets ratioratio 0%NA140257115251420.820.450.580.960.950.940.37 1%b140250118221320.840.470.610.970.950.930.335%b140226120201060.860.530.660.920.910.890.3710%b14019590501050.640.460.540.910.870.870.3215%b14017460801140.430.340.380.930.930.800.2620%b14010625115810.180.240.200.860.860.590.19 1%nb140253104361490.740.410.530.930.940.920.385%nb14023098421320.700.430.530.940.880.950.3810%nb14019286541060.610.450.520.950.790.940.3815%nb14019278621140.560.410.470.950.850.960.4220%nb1401607169890.510.440.470.930.760.970.39 NA:notapplicable;b:bolt;nb:non-bolt;GT:groundtruthr egions;Preds:predictedregionsTP:truepositives FN:falsenegatives;FP:falsepositives;Rec.:Recall;Pre c.:Precision;F-m:F-measure;ts:timestep 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 050100150200250 predicted region ## of TP regions 0% 1% 5% 10% 15% 20% (a)Boltnoise.Toptobottom:5,1,0,10,15,20 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 050100150200250 predicted region ## of TP regions 0% 1% 5% 10% 15% 20% (b)Non-boltnoise.Toptobottom:0to20 Figure4.26:Casingsimulationboltandnon-boltnoisecumu lativeliftcurvesusing RFW. 74

PAGE 88

Table4.6:CasingsimulationRFboltlabelingnoiseresults evaluatedwith10% overlapthreshold. NoiseGTPredsTPFNFPRecPrecF-mL-qualities leveltype GTsizesizets ratioratio 0%NA140251124161270.890.490.630.980.970.930.33 1%b140241118221230.840.490.620.970.980.920.325%b140228106341220.760.460.580.870.860.830.3510%b14021267731450.480.320.380.840.830.780.1715%b14017152881190.370.300.330.870.860.750.2020%b140132221181100.160.170.160.900.900.470.18 1%nb140270115251550.820.430.560.950.960.930.315%nb140263115251480.820.440.570.960.940.900.2710%nb140249107331420.760.430.550.950.950.940.2515%nb14023492481420.660.390.490.950.890.960.3120%nb14018979611100.560.420.480.950.850.970.34 NA:notapplicable;b:bolt;nb:non-bolt;GT:groundtruthr egions;Preds:predictedregionsTP:truepositives FN:falsenegatives;FP:falsepositives;Rec.:Recall;Pre c.:Precision;F-m:F-measure;ts:timestep 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 050100150200250 predicted region ## of TP regions 0% 1% 5% 10% 15% 20% (a)Boltnoise.Toptobottom:0to20 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 050100150200250 predicted region ## of TP regions 0% 1% 5% 10% 15% 20% (b)Non-boltnoise.Toptobottom:0to20 Figure4.27:Casingsimulationboltandnon-boltnoisecumu lativeliftcurvesusing RF. 75

PAGE 89

Table4.7:CasingsimulationDTboltlabelingnoiseresults evaluatedwith10% overlapthreshold. NoiseGTPredsTPFNFPRecPrecF-mL-qualities leveltype GTsizesizets ratioratio 0%NA140319123171960.880.390.540.970.980.860.26 1%b14026496441680.690.360.480.980.900.980.375%b140277122181550.870.440.590.940.940.930.2210%b14021387531260.620.410.490.930.910.970.4415%b140253121191320.860.480.620.970.920.950.1820%b140266100401660.710.380.490.970.920.960.30 1%nb140300112281880.800.370.510.940.990.850.255%nb14025883571750.590.320.420.930.910.950.3610%nb14028867732210.480.230.310.930.910.950.4015%nb14027354862190.390.200.260.930.890.960.3520%nb14024041991990.290.170.220.910.870.960.64 NA:notapplicable;b:bolt;nb:non-bolt;GT:groundtruthr egions;Preds:predictedregionsTP:truepositives FN:falsenegatives;FP:falsepositives;Rec.:Recall;Pre c.:Precision;F-m:F-measure;ts:timestep 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 050100150200250300350 predicted region ## of TP regions 0% 1% 5% 10% 15% 20% (a)Boltnoise.Toptobottom:0,5,15,20,1,10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 050100150200250300350 predicted region ## of TP regions 0% 1% 5% 10% 15% 20% (b)Non-boltnoise.Toptobottom:0to20 Figure4.28:Casingsimulationboltandnon-boltnoisecumu lativeliftcurvesusing DT. 76

PAGE 90

Table4.8:CasingsimulationSDTboltlabelingnoiseresult sevaluatedwith10% overlapthreshold. NoiseGTPredsTPFNFPRecPrecF-mL-qualities leveltype GTsizesizets ratioratio 0%NA14032313371900.950.410.570.971.000.840.28 1%b14034513192140.940.380.540.970.990.830.295%b14037813192470.940.350.510.971.000.830.1810%b140351130102210.930.370.531.001.000.890.2015%b140323129111940.920.400.560.990.990.860.2520%b14034813552130.960.390.550.991.000.850.23 1%nb14033413282020.940.400.560.980.990.820.275%nb140240126141140.900.530.660.991.000.970.2010%nb140236117231190.840.500.620.970.990.93-0.0215%nb14031787532300.620.270.380.890.960.800.2720%nb140186201201660.140.110.120.920.820.97-0.33 NA:notapplicable;b:bolt;nb:non-bolt;GT:groundtruthr egions;Preds:predictedregionsTP:truepositives FN:falsenegatives;FP:falsepositives;Rec.:Recall;Pre c.:Precision;F-m:F-measure;ts:timestep 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 050100150200250300350400 predicted region ## of TP regions 0% 1% 5% 10% 15% 20% (a)Boltnoise.Toptobottom:20,0to15 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 050100150200250300350400 predicted region ## of TP regions 0% 1% 5% 10% 15% 20% (b)Non-boltnoise.Toptobottom:0to20 Figure4.29:Casingsimulationboltandnon-boltnoisecumu lativeliftcurvesusing SDT. 77

PAGE 91

4.7VotingMethodResults Asimplemajorityvote(mv)andaprobabilisticmajorityvot e(pmv)withoutscalingeachproduceinferiorresultswhencomparedtoascaledp robabilisticmajority vote(spmv)fortheexperimentsinthisdissertation.Ifonl ypartitionsthathave examplesofbothclasses(unknownandsalient)areusedfort raining,theprobabilisticcomponentofspmvisthesameforbothclassesandha snoeffectoneach classierorensemblevote.However,bothmvandascaledmaj orityvote(smv) usingonlytwo-classpartitionscouldbeusedtoexaminealt ernativevotingmethods.Table4.9showstheaverageresultsofsuchtestsonthe1 6canistertear train-testcombinations.Thesmvmethodalmostalwaysyiel dshigherrecall,precision,andF-measurethanthemvmethod.Thenalthreelineso fTable4.10show thecorrespondingspmvresults,whicharebetterthanthemv andsmvresultsfor recall,precision,andF-measureinTable4.9forRFandRFW. Thedifferencelies intheadjustmenttotheprobabilisticcomponentofthescal edprobabilisticmajorityvote,whichassignshigherrelativeweightstosalientv otesinthesecases.Of coursethisadjustmentcouldbemadewithoutbuildingone-c lasspartitionclassiers,sincetheyalwayspredicttheclassasunknown.Inaddi tion,ananalysisof theactualmultipliersofthescaledprobabilisticmajorit yvotecansometimeslead tofurthersimplicationinthecaseofunweightedensemble s.Forinstance,there aresixtrainingpartitionseachwithatleastonesalientex ampleincanistertear runone,timestep5.Twoofthepartitionshavemostofthesal ientexamplesand eachhasamultipliermuchlargerthantheotherfoursuchpar titions.Asaresult,if andonlyif,eitherorbothoftheunweightedensembleswithl argemultiplierpartitionsvotesanexampleassalient,thatexamplewillbevoted assalientregardless ofhowtheotherfoursalientpartitionsvote.Weightedense mbleswouldstillthe78

PAGE 92

Table4.9:Canistertearaverageregionalresultsusingonl ytwo-classpartitions andevaluatedwith10%overlapthreshold.(Boldindicatest hehighestvalues.) ClassierVotingRecallPrecisionF-measureL-qualities method sizesizeratiotimestep DTmv0.610.310.40 0.85 0.420.32 DTsmv0.800.340.47 0.84 0.710.18 RFmv0.600.470.43 0.76 0.580.51 RFsmv 0.840.500.61 0.65 0.71 0.40 RFWmv0.630.470.50 0.70 0.500.48 RFWsmv0.730.450.54 0.660.66 0.50 mv:majorityvote;smv:scaledmajorityvote oreticallyrequirevotesfromallpartitionswithatleasto nesalientexamplefora denitivevote.4.8StatisticalSignicanceResults Ensemblesoftenresultinahigheraccuracyclassierthana singleclassier. Manytimesanensembleistrainedonasubsetofthedata(e.g. bagging).The datamaybeimplicitlyweightedasinbagging,featuresleft outtocreaterandom subspaces,etc.Otherworkhasshownthatyoucangetbettera ccuracyfroman ensembleofclassiersbuiltonsubsetsofthedata[18,46]. Thedisjointsubsetsof dataherewillresultinclassiersthatmakedifferenterro rs,whichcan(andoften does)leadtobetteraccuracyashasbeenseenwithsmallerda tasetspreviously. Theaveragecanisterandcasingregionalresultsfora10%ov erlapthreshold areshowninTables4.10and4.11.Forabaselinecomparison, eachtableincludes resultsforasingledecisiontreebuiltontheunpartitione dtrainingdata,whichincludesallavailablelabeledtrainingdata.Toillustrate, insteadoftrainingoneDT forthedataineachofthe14teartrainingpartitions,onlyo nedecisiontreewas trainedonallofthecombinedtrainingdataofthe14partiti ons.Forthecombined 79

PAGE 93

Table4.10:Canistertearaverageregionalresultsevaluat edwith10%overlap threshold.(Boldindicatesthehighestvalues.) ClassierTrainingRecallPrecisionF-measureL-qualitie s data sizesizeratiotimestep partitioned? SDTno0.770.240.35 0.87 0.720.05 DTyes0.810.290.40 0.94 0.710.03 RFyes 0.85 0.640.71 0.68 0.610.26 RFWyes 0.850.690.750.74 0.580.18 casingexperimentsandcanistertearexperimentsweapplie dtheFriedmantest, anaveragealgorithmrankmethod,andtheHolmstep-downpro cedure,bothdescribedin[47],toshowthattheprecisionandF-measurefro mbothSDTandDT aresignicantlyworsethaneitherRForRFWwitha99%conde ncelevel.Eachof thefourcanistertearsimulationrunswasconsideredasase paratedataset,since eachhasaunique%ofsalientnodesinthetrainingtimestepa ndinalltimesteps, andeitheruniquematerialproperties,oradifferentnumbe roftimestepsornodes pertimestep.Thecasingexperimentsandall16teartrain-t estcombinationswere usedintheevaluations.Wealsofoundthatthenaturaltimes teporderingissignificantlyworsethanallofthepredictedregionorderingmeth odswitha99%condencelevel.Theseresultsdemonstratesthatpartitioning obstaclestodatamining canbemorethanovercomewiththediversityofrandomforest s.Table4.12shows asummaryofthestatisticallysignicantresults.Thesali entmarginmeanmethod (andtheothermethods)oforderingsalientregionsforL-qu alityisdiscussedin Chapter4.1. 80

PAGE 94

Table4.11:Casingregionalresultsevaluatedwith10%over lapthreshold.(Bold indicatesthehighestvalues.) ClassierTrainingRecallPrecisionF-measureL-qualitie s data GTsizesizetimestep partitioned? ratioratio SDTno 0.95 0.410.570.97 1.00 0.840.28 DTyes0.880.390.540.97 0.98 0.860.26 RFyes0.89 0.490.630.98 0.970.930.33 RFWyes0.820.450.58 0.96 0.950.940.37 GT:groundtruthregions Table4.12:Statisticallysignicantresultsacrosscasin gandcanistertearsimulationexperimentsusing10%overlapthreshold. StatisticallyworseStatisticallybetter ClassierMetricClassierMetric SDTprecisionRFprecisionSDTprecisionRFWprecisionDTprecisionRFprecisionDTprecisionRFWprecision SDTF-measureRFF-measureSDTF-measureRFWF-measureDTF-measureRFF-measureDTF-measureRFWF-measure SDTL-quality(timestep)SDTL-quality(size)DTL-quality(timestep)DTL-quality(size)RFL-quality(timestep)RFL-quality(size)RFWL-quality(timestep)RFWL-quality(size) SDTL-quality(timestep)SDTL-quality(sizeratio)DTL-quality(timestep)DTL-quality(sizeratio)RFL-quality(timestep)RFL-quality(sizeratio)RFWL-quality(timestep)RFWL-quality(sizeratio) SDTL-quality(timestep)SDTL-quality(salientmarginmea n) DTL-quality(timestep)DTL-quality(salientmarginmean)RFL-quality(timestep)RFL-quality(salientmarginmean)RFWL-quality(timestep)RFWL-quality(salientmarginmea n) 81

PAGE 95

CHAPTER5 ANOMALYENSEMBLEEXPERIMENTS Thischapterincludesexperiments,evaluationmetrics,an dresultsforthemodiedKDDCup1999dataset.5.1Experiments Anomalydetection,alsoknownasoutlierdetection,dealsw ithndingpatterns indatathatareunusual,abnormal,unexpected,and/orinte resting[33](seeChapter2.4).Intheseexperimentstheoutlierexamplesarethos erarecasesoftheU2R intrusionattackclassinthemodiedKDDCup1999datasetas describedbelow. AsdiscussedinChapter3.3,weusedthesamemodiedKDDCup1 999datasetusedin[39,40],whichincludesonlythenormalclassand theU2Rintrusion attackclass,whichwasconsideredtheoutlieroranomalycl ass.Themodied datasethas60,593normaland246intrusion(outlier)datar ecords.Themodied KDDCup1999datasetwasrandomlysplitintotwogroups.Each groupwasthen partitionedandoutlierdetectionmethodswereusedonallt hedatainthatgroup oronthedatainvariouspartitioncombinationsofthatgrou ptodetermineoutliers ineithergroup.Thisbasicprocesswasrepeatedfor30diffe rentrandomsplits. Therstgroupofexperimentsincludedusingoutlierdetect ionmethodsonthe datainonedatagrouporinpartitionsofthatgrouptodeterm ineoutliersinthe otherdatagroup.Theoutlierdetectionmethodsincludedra ndomforestsweighted 82

PAGE 96

(RFW)[31]anddistance-basedoutlier(DBO)methods[48].T heaccuracyofrandomforestswasevaluatedin[32]andshowntobecomparablew ithorbetterthan otherwell-knownensemblegenerationtechniques.Here,th enumberofrandom featureschosenateachdecisiontreenodewas log 2 n +1 given n features.RFW predictionsarebasedonthepercentageoftreesthatvotefo raclass.FirstRFW with250treeswastrainedonallofthedata(withoutpartiti oning)inonegroupand thenusedtotestallofthedataintheremaininggroup.Breim anusedrandom forestswith100trees[31],whilealaterstudyused1000tre es[32].Wechose250 toachievemoreaccurateresultsthantheestablishednumbe rof100withoutincurringtheadditionalcomputationalcostsof1000trees,whic hmightnotnecessarily provebenecialinourexperiments.Therandomsplitwasint endedtoduplicate themethodsusedin[39,40],anddidnotlimiteachgrouptoth esamenumberof examples.Thentheseconddatagroupwasusedfortrainingan dtherstfortesting.Thisdoubletrain/testprocesswasusedfor30randomda tasplitsforatotalof 60testresults.Nexteachtrainingdatagroupwaspartition edinto20partitionswith stratication(eachpartitionhadascloseaspossiblethes amenumberofpositive instancesandthesamenumberofnegativeinstances).ThenR FWwith250trees wastrainedonthedataineachtrainingpartitionandtested onallofthetestdata intheothergroup.Thissimulatesdistributeddatathatcan 'tbeshared;amore difcultproblem. Theabovetrain/testexperimentswererepeatedusingthedi stance-basedoutlier(DBO)algorithmexceptthatthetrainingdatainallcas eswasusedasthe referencesetofneighborsfordeterminingtheaveragedist ancefromeachtest instancetoitsvenearestneighbors[48].DBOwaschosenas theconventional outliermethodforcomparisonwithRFWandforinvestigatin gDBO'sensemble performance.ThesoftwareimplementationofDBOwasORCA,w hichisamethod 83

PAGE 97

forndingoutliersinnearlineartime[48].Continuousfea tureswerescaledto zeromeanandunitstandarddeviation.Thedistancemeasure usedforcontinuous featureswastheweightedEuclideandistance,andfordiscr etefeatureswasthe weightedHammingdistance.Thenumberofoutlierswasspeci edasthenumber oftestinstances,sothatasmanytestinstancesaspossible weregivenoutlier predictionscores.UnlikeRFW,thegroundtruthtrainingla belswerenotusedfor training,onlyforthestratiedpartitioning.TheDBOoutl ierpredictionscoreswere usedinsteadofthecorrespondingRFWscoresabove.Thesepr edictionscores ratedeachtestinstance'slikelihoodofbeinganoutlierba sedontheaveragedistancefromitsvenearestneighborsinthereferencetraini ngset.Thoseinstances withahigheraveragedistanceweregivenahigherscoretore ecttheincreased likelihoodofbeingoutliers. Thesecondgroupofexperimentsincludedusingoutlierdete ctionmethodson thedataineachgrouporinpartitionsofthatgrouptodeterm ineoutliersinthat samegroup.Theoutlierdetectionmethodsincludeddistanc e-basedoutlier(DBO) andlocaloutlierfactor(LOF)methods[49].DBOwaschosenf ortheseexperiments fordirectcomparisontoitsperformanceintherstgroupof supervisedlearning experiments.LOFwaschosentocompareitsensembleperform ancetoitsnonensembleperformancein[39,40].Thisgroupofexperiments representsacaseof unsupervisedlearningwheregroundtruthtestlabelsweren otusedineitherthe outlierpredictionprocessorinthepartitioningprocess. Eachtestwasrepeated usingtheotherrandomgroupasthetestgroup.FirstDBOused allofthetestdata inthetestgroup(exceptforeachtestinstancebeingproces sed)asareference setofneighborsfordeterminingtheaveragedistancefrome achtestinstanceto itsvenearestneighbors.Nexteachtestdatagroupwasrand omlypartitionedinto 84

PAGE 98

2,4,and20partitionswithoutstratication(withoutcons ideringclass).ThenDBO usedthedataineachtestpartitionasthereferencesettoch ooseneighbors. LOFoutliersoroutlierswithalocaloutlierfactorperobje ctaredensitybased outliers[49].Theapproachtondthoseoutliersisbasedon measuringthedensity ofobjectsandtheirrelationtoeachother(referredtohere aslocalreachability density).Basedontheaverageratioofthelocalreachabili tydensityofanobject anditsk-nearestneighbors(e.g.theobjectsinitsk-dista nceneighborhood),a localoutlierfactor(LOF)iscomputed.Theapproachtakesa parameterMinPts (actuallyspecifyingthe“k”)anditusesthemaximumLOFsfo robjectsinaMinPts range(lowerboundandupperboundtoMinPts).ThedefaultMi nPtsrangefrom 10to20andthedefaultdistancemeasureofEuclideandistan cewasused.The localoutlierfactor(LOF)methodwasusedtopredictoutlie rsinthetestdatagroup byapplyingLOFseparatelytoeachofthe20testpartitionso fdatausedabovefor DBOtests.Thiswasdoneforthesamerandomsplitsofdatatha twereusedfor DBO.5.2EvaluationMetrics Receiveroperatingcharacteristics(ROC)graphsarecommo nlyusedinmachinelearninganddataminingresearchtoorganizeandvisu alizetheperformance ofclassiers,especiallyfordomainswithskewedclassdis tributionandunequal classicationcosts[50].Fortwoclassproblems,aclassi erisamappingfrom instancestopredictedclasses,positiveornegative.Some classiermodelsonly predicttheclassoftheinstance,whileothersproduceacon tinuousoutputorscore. AnROCgraphcanbeplottedbyapplyingdifferentthresholds tothecontinuous output,whichproducesdifferentpointsonthegraph.Thetr uepositive(TP)rate 85

PAGE 99

(alsoknownasthedetectionorhitrate)andthefalsepositi ve(FP)rate(alsoknown asthefalsealarmrate)aredeterminedbelow[30]. TPrate = Positivescorrectlyclassied Totalpositives (5.1) FPrate = Negativescorrectlyclassied Totalnegatives (5.2) TheTPrateisplottedonthe Y axisandFPrateisplottedonthe X axisin ROCgraphs.NotablepointsinROCspaceincludethelowerlef tpoint(0,0),which representstheclassierneverpredictingpositiveforany instance,and(1,1),which representstheclassieralwayspredictingpositiveforan yinstance.Anotherpoint ofinterestis(0,1)whichrepresentsperfectclassicatio n,withnoFPsorFNs.Averageorrandomperformanceliesonthediagonalfrom(0,0)t o(1,1).AnROC curveisconstructedbysortingtestinstancesbytheclassi er'sscoresfrommost likelytoleastlikelytobeamemberofthepositiveclass[50 ].Eachclassierscore establishesapointontheROCcurveandathresholdthatcanb eusedtoclassifyinstanceswithscoresthatareabovethatthresholdasp ositive,otherwiseas negative.Sincetheremaybecasesofclassiersassigninge qualscorestosome testinstanceswithpossiblydifferentgroundtruthclassl abels,allequalclassier scoresareprocessedbyanROCalgorithmbeforeestablishin ganewROCpoint. Thisreectstheexpectedperformanceoftheclassier,ind ependentoftheorder ofequallyscoredinstancesthatreectthearbitraryorder oftestinstances. AnROCcurveisatwo-dimensionalrepresentationofexpecte dclassierperformance.Ifasinglescalarvalueisdesiredtorepresentex pectedclassierperformance,itcanbedeterminedastheareaundertheROCcurve ,abbreviated AUC[50–52].Sincerandomguessingproducesthediagonalfr om(0,0)to(1,1) andanAUCof0.5,arealisticclassiershouldhaveanAUCgre aterthanthis.The 86

PAGE 100

AUCisequivalenttotheprobabilitythatitsclassierwill rankarandomlyselected positiveinstancehigherthanarandomlyselectednegative instance[50].ThisfeatureisequivalenttotheWilcoxontestofranks[50,51].Ifa nareaofinterestisless thanthatofthefullcurve,theareasunderthepartsofthecu rvesofinterestmay beusedinsteadforcomparingclassierperformance. WhenconsideringanaverageROCformultipletestruns,onec ouldsimply mergesorttheinstancestogetherbytheirassignedscoresi ntoonelargeset,and runanROCalgorithmontheset.Inordertomakeavalidclassi ercomparison usingROCcurves,ameasureofthevarianceofmultipletestr unsshouldbeused [50].Onevalidmethodisverticalaveraging,whichtakesve rticalsamplesforxed, regularlyspacedFPratesandaveragesthecorrespondingTP rates.Foreach ROCcurve,themaximumplottedTPrateischosenforeachxed FPrate.This pointisusedinindividualROCcurvestocalculatetheAUC,a ndthusisalsoused foraveragingmultiplecurveAUCs.IfnoTPratehasbeenplot tedforthexed FPrate,theTPrateisinterpolatedbetweenthemaximumTPra tesattheFP ratesimmediatelybeforeandimmediatelyafterthexedFPr ate.Themeanof themaximumplottedTPratesforallcurvesateachxedFPrat eisplottedand condenceintervalsofeachmeancanbeaddedbydrawingthec orresponding condenceintervalerrorbars[50].Thesearecalculatedas shownbelow:[53,54] Standarddeviation(SD) = s P ( X M ) 2 n 1 (5.3) StandardError(SE) = SD p n (5.4) CondenceInterval(CI) = M t ( n 1) SD(5.5) 87

PAGE 101

where X inthiscaseisaTPratefortheselectedFPrate, M isthemeanoftheTP ratesfortheselectedFPrate, n isthenumberoftrials,and t ( n 1) isacriticalvalue oft. Thresholdaveragingisanalternatemethodforaveragingmu ltipleROCtest runs.Itisbasedonthethresholdofclassierscores,which canbedirectlycontrolledbytheresearcher.Thismethodgeneratesasetofthr esholdstosampleby placingalloftheclassierscoresintoNbinssothateachbi nhasapproximately thesamenumberofscores,andselectingthethresholdsthat separatethebins. Foreachthreshold,thecorrespondingpointforeachROCcur veisselectedand alltheselectedpointsareaveraged[50].5.3Results Table5.1showsthemodiedKDDCup1999ROCAUCresultsforra ndom forestsweighted(RFW)andfordistance-basedoutlier(DBO )methodswhendata inoneofeachofthe30randomdatagroupswaspartitionedand variouspartitioncombinationswereusedtodetermineoutliersintheoth ercorrespondingdata group.FirstRFWwastrainedonallofthedata(withnopartit ioning)inonegroup andthenusedtotestallofthedataintheremaininggroup.Th entheseconddata groupwasusedfortrainingandtherstfortesting.Thisdou bletrain/testprocess wasusedfor30randomdatasplits.TheresultingRFWnoparti tioningmeanAUC, showninTable5.1,is0.9976whichiscloseto1.0orperfectc lassication.Next eachtrainingdatagroupwaspartitionedinto20partitions withstratication(each partitionhadaboutthesamenumberofpositiveinstancesan daboutthesame numberofnegativeinstances).ThenRFWwastrainedontheda taineachtrainingpartitionandtestedonthetestdataintheothergroup.T hevotesbythe20 88

PAGE 102

Table5.1:ModiedKDDCup1999averagetrain/testROCAUCre sults. Method#oftrainingpartitionsusedforROCAUC nalvoteoneachtestexampleMeanStandard Deviation RFW(nopartitioning)0.99760.0029RFW20of20(ensemblevoting)0.99850.0007RFW1of20(average)0.98300.0176 DBO(nopartitioning)0.96920.0108DBO20of20(ensemblevoting)0.98500.0017DBO1of20(average)0.98180.0049 RFW:randomforestsweighted;DBO:distance-basedoutlierROC:receiveroperatorcharacteristics;AUC:areaundercu rve ensemblesforeachtestinstancewereaveragedtodetermine theaverageoutlier predictionscore.Thesescoresweresortedwithmostlikely topredictanoutlier rst.TheRFW20of20partitionmeanAUCforis0.9985whichis almostidentical totheaboveresult.Finally,theRFWoutlierpredictionsco resusingonlyoneofthe 20partitionsweresortedandthemeanAUCwasdetermined.Th eRFW1of20 meanAUCof0.9830istheaverageofallsinglepartitionresu lts.Thisshowsthat using20ensemblestovoteismoreaccuratethanjustusingas ingleclassier. TheabovetestswererepeatedforDBOexceptthatthethetrai ningdatain allcaseswasusedasthereferencesetofneighborsfordeter miningtheaverage distancefromeachtestinstancetoitsvenearestneighbor s.UnlikeRFW,the groundtruthtraininglabelswerenotusedforoutlierdetec tion.TheDBOoutlier predictionscoreswereusedinsteadofthecorrespondingRF Wscoresabove.The lowestDBOmeanAUCis0.9692fortheDBOusingalltrainingda tawithoutpartitioningasthereferencesetofneighbors.TheDBO20of20m eanAUCforthe 20partitionvoteis0.9850whichisonly0.0135lessthanthe RFW20of20mean AUC.TheDBOmeanAUCforusingasinglepartitionis0.9818,w hichisstillhigher thanusingnopartitioning.Forcomparison,thefeaturebag gingmethodofoutlier 89

PAGE 103

detectionintheresultsof[39,40]yieldedanAUCof0.74( 0.1),soourapproach isabigimprovement. Figure5.1showsthemodiedKDDCup1999ROCverticallyaver agedcurves usingRFWandDBOwithunpartitionedtrainingdata.Conden cebarsareshown forthe99%condenceregionoftheROCmean.Thecurveisonly shownfor theFP(falsealarm)rateof0.0to0.2by0.01stepssinceboth curvesarevery closetoadetectionrateof1.0foraFPrateof0.2to1.0.Them eanAUCfor RFWis0.9976andforDBOis0.9692,asshowninTable5.1.RFWi sbetterthan DBOwithstatisticalsignicanceata99%condencelevelfo ranFPrateof0.0 to0.1.Figure5.2showsthemodiedKDDCup1999ROCvertical lyaveraged curvesusingRFWandDBOwith20partitionsoftrainingdata. Condencebars areshownforthe99%condenceregionoftheROCmean.Thecur veisonly shownfortheFPrateof0.0to0.2asdiscussedabove.Themean AUCforRFW is0.9985andforDBOis0.9850,asshowninTable5.1.RFWisbe tterthanDBO withstatisticalsignicanceata99%condencelevelforan FPrateof0.0to0.1. 90

PAGE 104

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 00.020.040.060.080.10.120.140.160.180.2 False Alarm rateDetection rate RFW DBO Figure5.1:ModiedKDDCup1999ROCcurvesforRFWandDBOusi ngno partitioningonthetraininggroup.Curvesareverticallya veragedoverbothgroups of30randomsplits.Condencebarsareshownforthe99%con denceregionof theROCmean. 91

PAGE 105

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 00.020.040.060.080.10.120.140.160.180.2 False Alarm rateDetection rate RFW DBO Figure5.2:ModiedKDDCup1999ROCcurvesforRFWandDBOusi ng20 partitionsonthetraininggroup.Curvesareverticallyave ragedoverbothgroups of30randomsplits.Condencebarsareshownforthe99%con denceregionof theROCmean. 92

PAGE 106

Table5.2showsthemodiedKDDCup1999ROCAUCresultsfordi stancebasedoutlier(DBO)andlocaloutlierfactor(LOF)methodsw hendatainonegroup ofeachofthe30randomdatasplitswaspartitionedandvario uspartitioncombinationswereusedtodetermineoutliersinonlythattestdat agroup,acaseof unsupervisedlearningwheregroundtruthtestlabelsareno tusedintheoutlier predictionprocess.Eachprocesswasrepeatedusingtheoth errandomgroupas thetestgroup.FirstDBOusedallofthetestdatainthetestg roup(exceptforeach testinstancebeingprocessed)asareferencesetofneighbo rsfordeterminingthe averagedistancefromeachtestinstancetoitsvenearestn eighbors.TheresultingDBOnopartitioningmeanAUCis0.9112,whichisconsider ablylessthanthe 0.9692inTable5.1forDBOusingthetraininggroupdataasth ereferencesetof neighbors. Nexteachtrainingdatagroupwasrandomlypartitionedinto 20partitionswithoutstratication(classwasnotconsidered).Allofthese2 0partitionswerecombinedasrequiredtocreateseparatetwoandfourpartitione xperimentsinaddition tothe20partitionexperiment.ThenDBOusedthedataineach testpartitionas thereferencesetofneighbors.Thescoresforeachindividu alpartitionweresorted withmostlikelytopredictanoutlierrst.Allindividualp artitionresultswereaveragedforthecorrespondingnumberofpartitionstoproducet he1of2,1of4,and 1of20(average)DBOentriesinTable5.2.Additionally,the votesforeachtestinstancewereaveragedamongthecorrespondingnumberofpart itionstodetermine theaverageoutlierpredictionscore.Theseaveragescores weresortedwithmost likelytopredictanoutlierrsttoproducethe2of2,4of4,a nd20of20(ensemble voting)DBOentriesinTable5.2. TheDBO1of2,2of2,1of4,and4of4meanAUCsgraduallyincrea sefrom 0.9587to0.9700asthenumberofpartitionsincreasesandas ensemblevotingis 93

PAGE 107

Table5.2:ModiedKDDCup1999averagetestROCAUCresults. Method#oftestpartitionsusedforROCAUC nalvoteoneachtestexampleMeanStandard Deviation DBO(nopartitioning)0.91120.0101 DBO1of2(average)0.95870.0140DBO2of2(ensemblevoting)0.96110.0110 DBO1of4(average)0.96550.0098DBO4of4(ensemblevoting)0.97000.0061 DBO1of20(average)0.98190.0052DBO20of20(ensemblevoting)0.98540.0019 LOF1of20(separatevotingon0.91050.0068 eachpartition'sexamples) RFW:randomforestsweighted;DBO:distance-basedoutlierLOF:localoutlierfactor;ROC:receiveroperatorcharacte ristics AUC:areaundercurve usedforeachnumberofpartitions.Thistrendcontinuedfor theDBO1of20mean AUCof0.9819,andtheDBO20of20meanAUCof0.9854,whichis0 .0742more thanthe0.9112aboveusingunpartitionedtestdata.Thissh owsthatthemean AUCincreasesasalowerpercentageoftestdataisusedasthe referencesetof neighborsforeachensemble. Thelocaloutlierfactor(LOF)methodwasusedtopredictout liersinthetest datagroupbyapplyingLOFseparatelytoeachofthe20testpa rtitionsusedabove forDBOtests,andthenmergingandsortingtheresultingens emblesofLOFoutlier predictionscores.Thiswasdoneforthesame30datasplitsa sDBOinTable5.2. TheresultingmeanAUCis0.9105.Forcomparison,theLOFmet hodofoutlier detectionin[39,40]producedanAUCof0.61( 0.1),whichismuchlowerthan ourresult. Figure5.3showsthemodiedKDDCup1999ROCverticallyaver agedcurves usingDBOwithunpartitionedtestdata,twooftwo,and20of2 0partitionsoftest 94

PAGE 108

data.Condencebarsareshownforthe99%condenceregiono ftheROCmean. ThecurveisonlyshownfortheFP(falsealarm)rateof0.0to0 .5by0.01steps sincebothcurvesareveryclosetoadetectionrateof1.0for aFPrateof0.5 to1.0.ThemeanAUCforDBOwith20testpartitionsis0.9854a ndDBOwith unpartitionedtestdatais0.9112,asshowninTable5.2.Twe ntypartitionensemble votingisbetterthannopartitionswithstatisticalsigni canceforanFPrateof0.0 to0.49. Figure5.4showsthemodiedKDDCup1999ROCverticallyaver agedcurves usingLOFwith20partitionsoftestingdata,withnalresul tsmergedfortesting. Condencebarsareshownforthe99%condenceregionoftheR OCmean.The curveisonlyshownfortheFPrateof0.0to0.5asdiscussedab ove.Themean AUCforLOFis0.9105,asshowninTable5.2. Insummary,wefoundthesupervisedlearningapproachofusi ngRFWensemblesyieldsthemostaccurateresults.DBOensembleseit herinsupervisedor unsupervisedlearningwerelessaccurate,followedbytheu nsupervisedLOFensemblesapproach.OfnoteisthatthebestROCAUC'swerealwa ysobtainedwith ensembleapproaches.Analternatesupervisedclassicati onlearningapproachis one-classclassicationsupportvectormachines(SVMs)[3 3].Forthistechnique, itisassumedthatalltraininginstanceshaveonlyoneclass label.Adiscriminative boundaryislearnedaroundthenormalinstancesandanytest instancethatfalls outsideoftheboundaryisdeclaredananomaly. 95

PAGE 109

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 00.050.10.150.20.250.30.350.40.450.5 False Alarm rateDetection rate DBO 20 partitions DBO 2 partitions DBO no partitions Figure5.3:ModiedKDDCup1999ROCcurvesforDBOusingnopa rtitioning, two,and20partitionsonthetestgroup.Curvesarevertical lyaveragedoverboth partsof30randomsplits.Condencebarsareshownforthe99 %condence regionoftheROCmean. 96

PAGE 110

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 00.050.10.150.20.250.30.350.40.450.5 False Alarm rateDetection rate LOF Figure5.4:ModiedKDDCup1999ROCcurvesforLOFusing20pa rtitionson thetestgroup.Curvesareverticallyaveragedoverbothpar tsof30randomsplits. Condencebarsareshownforthe99%condenceregionoftheR OCmean. 97

PAGE 111

CHAPTER6 CONCLUSIONS Largesimulations(terabytetopetabytescale)mustbepart itionedacrossmultipleprocessorsinordertoobtainresultsinareasonablea mountoftime.The methodofbreakingdataintopiecesmaycausehighlyskewedc lassdistributions, asitviolatestheusualassumptionofindependentandident icallydistributeddatasets.Inthisdissertation,weshowedhowsuchdatamaynonet helessbeeffectively usedfordatamining.Weshowedthatresultsfromthedistrib utedtrainingdataare asgoodorbetterthanonecanobtainwithasingledecisiontr eetrainedonall thelabeledtrainingdata.Ourapproachusesfastensemblel earningalgorithms, scaledprobabilisticmajorityvoting,andorderingofpred ictedregionsofsaliency. Theresultsshowthatasimulationexperimentthatyieldson lysomewhatabove averageregionalF-measurescanprovideefcientvisualan alysisofthoseresults byeffectiveorderingofthepredictedregions.Thevastmaj orityoffalsepositives wereorderedlast,aftertheuserhasalreadyseenmostofthe truepositivesalient regions.ThecanistertearresultsoftenshowedhigherF-me asuresthanthecasing resultsinspiteoftherelativelyfewerexamplesusedfortr ainingensembles.Again, thequalityoforderingpredictedregionsistypicallyree ctedinhighL-qualitymeasures. Theresultsindicatethatsimulationdevelopersandusersw ouldbeaccurately directedtoregionsofinterestwithonlyoccasionalmisdir ection.Thishasthepotentialforsavingsignicanttimeduringdebugginganduse byallowingforamuch 98

PAGE 112

improvedfocusofattentiononareasofinterestwithouthig hlytime-consuming search. Anexplorationofensembleapproachesforuseinanomalydet ectionshowsthat forbothsupervisedandunsupervisedlearningcategories, someoftheexisting approachescanbeimprovedsignicantlybyemployingensem bles.Thenormal methodofcomparinganomaly(outlier)detectionapproache sisbycomparingthe areasundereachapproach'sreceiveroperatingcharacteri stic(ROC)curve.The improvementshownbyensembleapproachesliesinthelarger areaundertheROC curveforthevitallyimportantinitialstages,wherethede tection(truepositive)rate reacheshigherlevelswhilethefalsealarm(falsepositive )rateisstillrelativelylow. 6.1Contributions Themaincontributionsofthisdissertationincludeanappr oachthatshowshow toorderpredictedregionsbymostlikelytobeinteresting/ relevantforlargepartitioned,spatiallydisjointdatasetswheretheindividualp redictionsarenotasimportantasnding/predictingregions.Weshowthatthemeth odworksevenwhen dataisdistributedinwaysthatmaynotbehelpfultothelear ningalgorithms.Our methodsmightbeadaptedtopresentregionsinmedicalimage sfromdistributed databyorderoflikelyimportance.Anotherapplicationmig htbetorecognizesupernovaeregionsinastronomy.Ourapproachislimitedtoap plicationswiththe abovecharacteristics,wherelargearbitrarypartitionso fdataincludearelatively smallminorityclassandwhereregionlabelingnoiseintrai ningistypicallyless than10%.Asecondareaofcontributionistodemonstratehow usingdifferentensembleapproachestoanomalydetectioncanimprovetheresu ltsofsomeofthe existingapproaches. 99

PAGE 113

REFERENCES [1]L.O.Hall,D.Bhadoria,andK.W.Bowyer.Learningamodel fromspatially disjointdata.In 2004IEEEInternationalConferenceonSystems,Man,and Cybernetics,Vol.2 ,pages1447–1451,October2004. [2]D.F.Kusnezov.AdvancedSimulation&Computing:thenex ttenyears.Technicalreport,SandiaNationalLabs,Albuquerque,NM87185, 2004. [3]ASC,NationalNuclearSecurityAdministrationincolla borationwith Sandia,LawrenceLivermore,andLosAlamosNationalLabora tories. http://www.sandia.gov/NNSA/ASC/,accessedon29Nov2008 [4]S.Kotsiantis,D.Kanellopoulos,andP.Pintelas.Handl ingimbalanceddatasets:areview. GESTSInternationalTransactionsonComputerScienceand Engineering ,30(1):25–36,2006. [5]G.Weiss.Miningwithrarity:aunifyingframework. SIGKDDExplorations 6(1):7–19,2004. [6]N.V.Chawla,L.O.Hall,K.W.Bowyer,andW.P.Kegelmeyer .SMOTE: syntheticminorityover-samplingtechnique. JournalofArticialIntelligence Research ,16:321–357,2002. [7]P.Domingos.Metacost:ageneralmethodformakingclass ierscostsensitive.In KDD'99:ProceedingsoftheFifthACMSIGKDDInternational ConferenceonKnowledgeDiscoveryandDataMining ,pages155–164,New York,NY,USA,1999.ACMPress. [8]L.Shoemaker,R.E.Baneld,L.O.Hall,K.W.Bowyer,andW .P.Kegelmeyer. Learningtopredictsalientregionsfromdisjointandskewe dtrainingsets.In 18thIEEEConferenceonToolswithArticialIntelligence( ICTAI2006),Arlington,Virginia,USA ,pages116–123,2006. [9]L.Shoemaker,R.E.Baneld,L.O.Hall,K.W.Bowyer,andW .P.Kegelmeyer.Usingclassierensemblestolabelspatiallydisjoi ntdata. Inf.Fusion 9(1):120–133,2008. 100

PAGE 114

[10]L.Shoemaker,R.E.Baneld,L.O.Hall,K.W.Bowyer,and W.P.Kegelmeyer. Detectingandorderingsalientregionsforefcientbrowsi ng.In Proceedingsof the19thConferenceoftheInternationalAssociationforPa tternRecognition 2008. [11]Z.Erdem,R.Polikar,F.Gurgen,andN.Yumusak.Ensembl eofSVMsfor incrementallearning.In MultipleClassierSystems,6thInternationalWorkshop ,volume3541of LectureNotesinComputerScience ,pages246–256, Seaside,CA,USA,2005.Springer. [12]M.A.MaloofandR.S.Michalski.Incrementallearningw ithpartialinstance memory. ArticialIntelligence ,154(1-2):95–126,2004. [13]G.I.Webb,J.R.Boughton,andZ.Wang.NotsonaiveBayes :aggregating one-dependenceestimators. MachineLearning ,58(1):5–24,2005. [14]R.KongandB.Zhang.Afastincrementallearningalgori thmforsupportvector machine. ControlandDecision ,20(10):1129–1136,2005. [15]P.DomingosandG.Hulten.Mininghigh-speeddatastrea ms.In KDD'00: ProceedingsoftheSixthACMSIGKDDInternationalConferen ceonKnowledgeDiscoveryandDataMining ,pages71–80,NewYork,NY,USA,2000. ACMPress. [16]C.C.Aggarwal,J.Han,J.Wang,andP.S.Yu.Ondemandcla ssicationof datastreams.In KDD'04:ProceedingsoftheTenthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining ,pages503–508, NewYork,NY,USA,2004.ACMPress. [17]W.Fan.Systematicdataselectiontomineconcept-drif tingdatastreams.In KDD'04:ProceedingsoftheTenthACMSIGKDDInternationalC onference onKnowledgeDiscoveryandDataMining ,pages128–137,NewYork,NY, USA,2004.ACMPress. [18]NiteshV.Chawla,LawrenceO.Hall,KevinW.Bowyer,and W.PhilipKegelmeyer.Learningensemblesfrombites:Ascalableandaccura teapproach. J. Mach.Learn.Res. ,5:421–451,2004. [19]A.LazarevicandZ.Obradovic.Boostingalgorithmsfor parallelanddistributed learning. DistributedandParallelDatabasesJournal ,11(2):203–229,2002. [20]N.V.Chawla,T.E.Moore,L.O.Hall,K.W.Bowyer,W.P.Ke gelmeyer, andC.Springer.Distributedlearningwithbagging-likepe rformance. Pattern RecognitionLetters ,24(1-3):455–471,2003. 101

PAGE 115

[21]W.Fan,H.Wang,P.S.Yu,andS.J.Stolfo.Afullydistrib utedframeworkfor cost-sensitivedatamining.In Proceedings22ndInternationalConferenceon DistributedComputingSystems ,pages445–446,2002. [22]C.A.ShippandL.I.Kuncheva.Relationshipsbetweenco mbinationmethodsandmeasuresofdiversityincombiningclassiers. InformationFusion 3(2):135–148,2002. [23]W.W.Cohen,R.E.Schapire,andY.Singer.Learningtoor derthings. JARTIF INTELLRES ,10:243–270,1999. [24]A.Gionis,H.Mannila,K.Puolam ¨ aki,andA.Ukkonen.Algorithmsfordiscoveringbucketordersfromdata. Proceedingsofthe12thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining ,pages561–566, 2006. [25]E.HullermeierandJ.Furnkranz.Learninglabelprefer ences:Rankingerror versuspositionerror. ProceedingsIDA05,6thInternationalSymposiumon IntelligentDataAnalysis ,2005. [26]K.Brinker.Activelearningoflabelrankingfunctions .In Proceedingsofthe 21thInternationalConferenceonMachineLearning ,pages129–136,2004. [27]F.Wang,S.Ma,L.Yang,andT.Li.RecommendationonItem Graphs. ProceedingsoftheSixthInternationalConferenceonDataMini ng ,pages1119– 1123,2006. [28]C.X.LingandC.Li.Dataminingfordirectmarketing:Pr oblemsandsolutions. In KnowledgeDiscoveryandDataMining ,pages73–79,1998. [29]G.Piatetsky-ShapiroandS.Steingold.Measuringlift qualityindatabasemarketing. SIGKDDExplor.Newsl. ,2(2):76–80,2000. [30]P.Tan,M.Steinbach,andV.Kumar. IntroductiontoDataMining .Addison Wesley,2006. [31]L.Breiman.Randomforests. MachineLearning ,45(1):5–32,2001. [32]R.E.Baneld,L.O.Hall,K.W.Bowyer,andW.P.Kegelmey er.Acomparison ofdecisiontreeensemblecreationtechniques. IEEETransactionsonPattern AnalysisandMachineIntelligence ,pages173–180,2007. [33]V.Chandola,A.Banerjee,andV.Kumar.Anomalydetecti on:Asurvey. ACM Comput.Surv. ,41(3):1–58,2009. 102

PAGE 116

[34]L.A.SchoofandV.R.Yarberry.EXODUSII:aniteelemen tdatamodel, TechnicalReport#SAND92–2137.Technicalreport,SandiaN ationalLabs, Albuquerque,NM87185,1998. [35]A.Henderson. TheParaViewGuide .Kitware,Inc.,UnitedStates,2004. [36]N.Otsu.Athresholdselectionmethodfromgraylevelhi stograms. IEEETrans. Systems,ManandCybernetics ,9:62–66,1979. [37]L.Shoemaker,R.E.Baneld,L.O.Hall,K.W.Bowyer,and W.P.Kegelmeyer. Detectingandorderingsalientregions. DataMiningandKnowledgeDiscovery ,pages1–32,2010.10.1007/s10618-010-0194-6. [38]J.N.Korecki,R.E.Baneld,L.O.Hall,K.W.Bowyer,and W.P.Kegelmeyer. Semi-supervisedlearningonlargecomplexsimulations.In Proceedingsof the19thConferenceoftheInternationalAssociationforPa tternRecognition 2008. [39]A.Lazarevic.Featurebaggingforoutlierdetection.I n KDD05 ,pages157– 166,2005. [40]N.Abe,B.Zadrozny,andJ.Langford.Outlierdetection byactivelearning.In KDD'06:Proceedingsofthe12thACMSIGKDDinternationalco nferenceon Knowledgediscoveryanddatamining ,pages504–509,NewYork,NY,USA, 2006.ACM. [41]UCIKDDArchive.Kddcup1999data. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99. html,accessedon 1Jan2010. [42]R.E.Baneld,L.O.Hall,K.W.Bowyer,andW.P.Kegelmey er.Ensembles ofclassiersfromspatiallydisjointdata.In MultipleClassierSystems,Sixth InternationalWorkshop ,volume3541of LectureNotesinComputerScience pages196–205,Seaside,CA,USA,2005.Springer. [43]W.S.KoeglerandW.P.Kegelmeyer.FCLib:alibraryforb uildingdataanalysisanddatadiscoverytools. AdvancesinIntelligentDataAnalysisVI ,IDA 2005:192–203,2005. [44]I.H.WittenandE.Frank. DataMining:PracticalMachineLearningToolsand Techniques,secondedition .MorganKaufmann,SanFrancisco,2005. [45]C.D.Manning,P.Raghavan,andH.Schutze. IntroductiontoInformation Retrieval .CambridgeUniversityPress,2008. 103

PAGE 117

[46]S.EschrichandL.O.Hall.Learningfromsoftpartition sofdata:reducingthe variance.In FuzzySystems,2003.FUZZ'03.The12thIEEEInternational Conferenceon ,volume1,pages666–671,2003. [47]J.Demsar.Statisticalcomparisonsofclassiersover multipledatasets. JournalofMachineLearningResearch ,7:1–30,2006. [48]S.D.BayandM.Schwabacher.Miningdistance-basedout liersinnearlinear timewithrandomizationandasimplepruningrule.In Proceedingsofthe ninthACMSIGKDDinternationalconferenceonKnowledgedis coveryand datamining ,pages29–38.ACMPress,2003. [49]M.M.Breunig,H.Kriegel,R.T.Ng,andJ.Sander.Lof:id entifyingdensitybasedlocaloutliers.In Proceedingsof2000ACMSIGMODInternationalConferenceonManagementofData ,pages93–104.ACMPress,2000. [50]T.Fawcett.Anintroductiontorocanalysis. PatternRecognitionLetters 27(8):861–874,2006.ROCAnalysisinPatternRecognition. [51]J.A.HanleyandB.J.McNeil.Themeaninganduseofthear eaundera receiveroperatingcharacteristic(roc)curve. Radiology ,143:29–36,1982. [52]A.P.Bradley.Theuseoftheareaundertheroccurveinth eevaluationof machinelearningalgorithms. PatternRecognition ,30:1145–1159,1997. [53]G.CummingandS.Finch.Inferencebyeye:Condenceint ervalsandhowto readpicturesofdata. AmericanPsychologist ,60,No.2:170–180,FebruaryMarch2005. [54]G.Cumming,F.Fidler,andD.L.Vaux.Errrorbarsinexpe rimentalbiology. TheJournalofCellBiology ,177,No.1:7–11,April9,2007. 104

PAGE 118

ABOUTTHEAUTHOR LarryShoemakerreceivedtheB.S.andM.S.degreesincomput ersciencefrom UniversityofSouthFloridain2003and2005respectively.H isroleasagraduate researchassistantatUniversityofSouthFloridaincluded membershipintheAvatar projectresearchteamfrom2004to2008.Hisresearchintere stsincludedata mining,machinelearning,andknowledgediscovery.