USF Libraries
USF Digital Collections

Learning on complex simulations

MISSING IMAGE

Material Information

Title:
Learning on complex simulations
Physical Description:
Book
Language:
English
Creator:
Banfield, Robert E
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla.
Publication Date:

Subjects

Subjects / Keywords:
Machine learning
Spatial inhomogeneity
Parallel processing
Complex simulations
Ensemble methods
Dissertations, Academic -- Computer Science and Engineering -- Doctoral -- USF   ( lcsh )
Genre:
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )

Notes

Summary:
ABSTRACT: This dissertation explores Machine Learning in the context of computationally intensive simulations. Complex simulations such as those performed at Sandia National Laboratories for the Advanced Strategic Computing program may contain multiple terabytes of data. The amount of data is so large that it is computationally infeasible to transfer between nodes on a supercomputer. In order to create the simulation, data is distributed spatially. For example, if this dissertation was to be broken apart spatially, the binding might be one partition, the first fifty pages another partition, the top three inches of every remaining page another partition, and the remainder confined to the last partition. This distribution of data is not conducive to learning using existing machine learning algorithms, as it violates some standard assumptions, the most important being that data is independently and identically distributed (i.i.d.).Unique algorithms must be created in order to deal with the spatially distributed data. Another problem which this dissertation addresses is learning from large data sets in general. The pervasive spread of computers into so many areas has enabled data capture from places that previously did not have available data. Various algorithms for speeding up classification of small and medium-sized data sets have been developed over the past several years. Most of these take advantage of developing a multiple classifier system in which the fusion of many classifiers results in higher accuracy than that obtained by any single classifier. Most also have a direct application to the problem of learning from large data sets. In this dissertation, a thorough statistical analysis of several of these algorithms is provided on 57 publicly available data sets. Random forests, in particular, is able to achieve some of the highest accuracy results while speeding up classification significantly.Random forests, through a classifier fusion strategy known as Probabilistic Majority Voting (PMV) and a variant referred to as Weighted Probabilistic Majority Voting (wPMV), was used on two simulations. The first simulation is of a canister being crushed in the same fashion as a human might crush a soda can. Each of half a million physical data points in the simulation contains nine attributes. In the second simulation, a casing is dropped on the ground. This simulation contains 21 attributes and over 1,500,000 data points. Results show that reasonable accuracy can be obtained by using PMV or wPMV, but this accuracy is not as high as using all of the data in a non-spatially partitioned environment. In order to increase the accuracy, a semi-supervised algorithm was developed.This algorithm is capable of increasing the accuracy several percentage points over that of using all of the non-partitioned data, and includes several benefits such as reducing the number of labeled examples which scientists would otherwise manually identify. It also depicts more accurately the real-world usage situations which scientists encounter when applying these Machine Learning techniques to new simulations.
Thesis:
Dissertation (Ph.D.)--University of South Florida, 2007.
Bibliography:
Includes bibliographical references.
System Details:
System requirements: World Wide Web browser and PDF reader.
System Details:
Mode of access: World Wide Web.
Statement of Responsibility:
by Robert E. Banfield.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains 108 pages.
General Note:
Includes vita.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 001920238
oclc - 187987392
usfldc doi - E14-SFE0002112
usfldc handle - e14.2112
System ID:
SFS0026430:00001


This item is only available as the following downloads:


Full Text
xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam Ka
controlfield tag 001 001920238
003 fts
005 20080107141553.0
006 m||||e|||d||||||||
007 cr mnu|||uuuuu
008 080107s2007 flu sbm 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0002112
035
(OCoLC)187987392
040
FHM
c FHM
049
FHMM
090
TK7995 (ONLINE)
1 100
Banfield, Robert E.
0 245
Learning on complex simulations
h [electronic resource] /
by Robert E. Banfield.
260
[Tampa, Fla.] :
b University of South Florida,
2007.
520
ABSTRACT: This dissertation explores Machine Learning in the context of computationally intensive simulations. Complex simulations such as those performed at Sandia National Laboratories for the Advanced Strategic Computing program may contain multiple terabytes of data. The amount of data is so large that it is computationally infeasible to transfer between nodes on a supercomputer. In order to create the simulation, data is distributed spatially. For example, if this dissertation was to be broken apart spatially, the binding might be one partition, the first fifty pages another partition, the top three inches of every remaining page another partition, and the remainder confined to the last partition. This distribution of data is not conducive to learning using existing machine learning algorithms, as it violates some standard assumptions, the most important being that data is independently and identically distributed (i.i.d.).Unique algorithms must be created in order to deal with the spatially distributed data. Another problem which this dissertation addresses is learning from large data sets in general. The pervasive spread of computers into so many areas has enabled data capture from places that previously did not have available data. Various algorithms for speeding up classification of small and medium-sized data sets have been developed over the past several years. Most of these take advantage of developing a multiple classifier system in which the fusion of many classifiers results in higher accuracy than that obtained by any single classifier. Most also have a direct application to the problem of learning from large data sets. In this dissertation, a thorough statistical analysis of several of these algorithms is provided on 57 publicly available data sets. Random forests, in particular, is able to achieve some of the highest accuracy results while speeding up classification significantly.Random forests, through a classifier fusion strategy known as Probabilistic Majority Voting (PMV) and a variant referred to as Weighted Probabilistic Majority Voting (wPMV), was used on two simulations. The first simulation is of a canister being crushed in the same fashion as a human might crush a soda can. Each of half a million physical data points in the simulation contains nine attributes. In the second simulation, a casing is dropped on the ground. This simulation contains 21 attributes and over 1,500,000 data points. Results show that reasonable accuracy can be obtained by using PMV or wPMV, but this accuracy is not as high as using all of the data in a non-spatially partitioned environment. In order to increase the accuracy, a semi-supervised algorithm was developed.This algorithm is capable of increasing the accuracy several percentage points over that of using all of the non-partitioned data, and includes several benefits such as reducing the number of labeled examples which scientists would otherwise manually identify. It also depicts more accurately the real-world usage situations which scientists encounter when applying these Machine Learning techniques to new simulations.
502
Dissertation (Ph.D.)--University of South Florida, 2007.
504
Includes bibliographical references.
516
Text (Electronic dissertation) in PDF format.
538
System requirements: World Wide Web browser and PDF reader.
Mode of access: World Wide Web.
500
Title from PDF of title page.
Document formatted into pages; contains 108 pages.
Includes vita.
590
Advisor: Lawrence O. Hall, Ph.D.
653
Machine learning.
Spatial inhomogeneity.
Parallel processing.
Complex simulations.
Ensemble methods.
690
Dissertations, Academic
z USF
x Computer Science and Engineering
Doctoral.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.2112



PAGE 1

LearningonComplexSimulationsbyRobertE.BaneldAdissertationsubmittedinpartialfulllmentoftherequirementsforthedegreeofDoctorofPhilosophyDepartmentofComputerScienceandEngineeringCollegeofEngineeringUniversityofSouthFloridaMajorProfessor:LawrenceO.Hall,Ph.D.DmitryGoldgof,Ph.D.SudeepSarkar,Ph.D.KevinW.Bowyer,Ph.D.DateofApproval:April11,2007Keywords:MachineLearning,SpatialInhomogeneity,ParallelProcessing,ComplexSimulations,EnsembleMethodscCopyright2007,RobertE.Baneld

PAGE 2

NotetoReaderNotetoReader:Theoriginalofthisdocumentcontainscolorthatisnecessaryforunderstandingthedata.TheoriginaldissertationisonlewiththeUSFlibraryinTampa,Florida.

PAGE 3

DedicationTothosemanypeopleforwhommyunendinggratitudeisofferedfortheircontributions,directlyandindirectly,tomylife.Whilethosepeoplearenumerous,IwishtospecicallynameElizabethBaneld,EugeneMerritt,Dr.DouglasYarbrough,andmyresearchteamincludingDr.LawrenceHall,Dr.KevinBowyer,Dr.PhilipKegelmeyer,Dr.NiteshChawla,andDr.StevenEschrich.

PAGE 4

TableofContentsListofTablesiiiListofFiguresvAbstractviiiChapter1Introduction11.1Motivation11.2Contributions21.3Organization3Chapter2Background42.1DecisionTrees42.2MultipleClassierSystems8Chapter3EnsembleLearning113.1GeneralMethodsforCreatingEnsembles113.1.1Bagging123.1.2Boosting133.1.3RandomSubspaces143.1.4DisjointPartitioning153.2DecisionTreeSpecicMethods153.2.1RandomTrees163.2.2RandomForests173.3ExperimentalDesign173.3.1AlgorithmicDetails183.3.2DataSets193.3.3Experiments193.3.4StatisticalTests213.4ExperimentalResults233.5Discussion243.6TowardsanAppropriateNumberofClassiers293.6.1Complications303.6.2ExperimentsUsingOut-of-BagError32Chapter4LearningonComplexSimulations374.1Considerations374.2TheCanisterCrushProblem404.3TheCasingProblem43i

PAGE 5

Chapter5AProbabilisticApproachUsingMajorityVote475.1MajorityVoting475.2ProbabilisticMajorityVoting485.3WeightedProbabilisticMajorityVoting515.4Smoothing515.5ExperimentalDesign535.6TrainandTestSets545.7CanisterOdd/EvenExperiments555.8CanisterOut-of-PartitionExperiments605.9CasingOdd/EvenExperiments605.10CasingOut-of-PartitionExperiments70Chapter6ASemi-SupervisedIncrementalApproach756.1Semi-SupervisedMethods756.2ASemi-SupervisedSelf-TrainingAlgorithm786.3Experiments796.4FurtherAnalysisofSSILearning806.5RemovingtheUnknown876.6Discussion93Chapter7DiscussionandFutureWork957.1Contributions97References100AbouttheAuthorEndPageii

PAGE 6

ListofTablesTable1.Descriptionofdatasetsattributesandsize.20Table2.Theaveragerawaccuracyresultsovera10foldcrossvalidation.25Table3.Theaveragerawaccuracyresultsover5x2foldcrossvalidations.26Table4.Statisticalresultsareshownforeachdataset.27Table5.Asummarystatisticaltableisprovidedforeachmethodshowingstatisticalwinsandlossesagainstbagging.27Table6.Numberoftreesandtestsetaccuracyofthestoppingcriteriaforrandomforestsandbagging.31Table7.TestsetaccuracyresultsusingathirdofthetreeschoseninTable6.34Table8.Physicalandspatialcharacteristicsforthecanistersimulation.41Table9.Featurerangesforcanistersimulation.41Table10.Physicalandspatialcharacteristicsforthecasingsimulation.45Table11.Partitioningcharacteristicsforthecasingsimulation.46Table12.Featurerangesforthecasingsimulation.46Table13.Theunsmoothednodalaccuracyresultsforthecanistersimulationfortheodd/evenexperimentsareprovided.56Table14.Smoothednodalaccuracyresultsforthecanistersimulationfortheodd/evenexperimentsareprovided.56Table15.Unsmoothednodalaccuracyresultsforthecanistersimulationfortheout-of-partitionexperimentsareprovided.61Table16.Smoothednodalaccuracyresultsforthecanistersimulationfortheout-of-partitionexperimentsareprovided.61Table17.Unsmoothednodalaccuracyresultsforthecasingsimulationfortheodd/evenexperimentsareprovided.66iii

PAGE 7

Table18.Smoothednodalaccuracyresultsforthecasingsimulationfortheodd/evenexperimentsareprovided.66Table19.Unsmoothednodalaccuracyresultsforthecasingsimulationfortheout-of-partitionexperimentsareprovided.71Table20.Smoothednodalaccuracyresultsforthecasingsimulationfortheout-of-partitionexperimentsareprovided.71Table21.SmoothednodalaccuracyresultsforthecasingsimulationusingtheSSIal-gorithmandotherapproachesareprovided.83Table22.SmoothednodalaccuracyresultsforthecasingsimulationusingtheSSIal-gorithmincrementingbymultipletimesteps.89iv

PAGE 8

ListofFiguresFigure1.Anexampleofasmalldecisiontreeisprovided.5Figure2.Decisiontreescanonlymakesplitswhichareparalleltothex-axisory-axis.7Figure3.Amodularlearningschemeisshown.10Figure4.Anensemblearchitectureisshown.10Figure5.Agraphicaldepictionofbaggingisshown.12Figure6.Agraphicaldepictionoftherandomsubspacealgorithm.15Figure7.Out-of-bagaccuracyvs.testsetaccuracyresultsasclassiersareaddedtotheensembleforsatimage.35Figure8.Out-of-bagaccuracyvs.testsetaccuracyresultsasclassiersareaddedtotheensembleforsegment.36Figure9.Avisualizationofthecanistersimulationasdistributedacrosscomputenodesisgiven.40Figure10.Amosaicofdifferentviewsofthecanistersimulationaslabeledbyanexpertisprovided.42Figure11.Avisualizationofpartitionsofthecasingdataasdistributedacrosscomputenodesisprovided.43Figure12.Avisualizationofthegroundtruthofthecasingdataisprovided.45Figure13.Animagemosaicofpredictedsaliencyaftersmoothingonthecanistersim-ulationusingProbabilisticMajorityVotingontimesteps10,22,32,and44fortheodd/evenexperimentsisprovided.57Figure14.Animagemosaicofpredictedsaliencyaftersmoothingonthecanistersimu-lationusingWeightedProbabilisticMajorityVotingontimesteps10,22,32,and44fortheodd/evenexperimentsisprovided.58Figure15.Animagemosaicofpredictedsaliencyaftersmoothingonthecanistersim-ulationusingallthedataontimesteps10,22,32,and44fortheodd/evenexperimentsisprovided.59v

PAGE 9

Figure16.Animagemosaicofpredictedsaliencyaftersmoothingonthecanistersim-ulationusingProbabilisticMajorityVotingontimesteps10,22,32,and44fortheout-of-partitionexperimentsisprovided.62Figure17.Animagemosaicofpredictedsaliencyaftersmoothingonthecanistersimu-lationusingWeightedProbabilisticMajorityVotingontimesteps10,22,32,and44fortheout-of-partitionexperimentsisprovided.63Figure18.Animagemosaicofpredictedsaliencyaftersmoothingonthecanistersim-ulationusingallthedataontimesteps10,22,32,and44fortheout-of-partitionexperimentsisprovided.64Figure19.Animagemosaicofpredictedsaliencyaftersmoothingonthecasingsimu-lationusingProbabilisticMajorityVotingontimesteps6,10,16,and20fortheodd/evenexperimentsisprovided.67Figure20.Animagemosaicofpredictedsaliencyaftersmoothingonthecasingsimu-lationusingWeightedProbabilisticMajorityVotingontimesteps6,10,16,and20fortheodd/evenexperimentsisprovided.68Figure21.Animagemosaicofpredictedsaliencyaftersmoothingonthecasingsim-ulationusingallthedataontimesteps6,10,16,and20fortheodd/evenexperimentsisprovided.69Figure22.Animagemosaicofpredictedsaliencyaftersmoothingonthecasingsimu-lationusingProbabilisticMajorityVotingontimesteps6,10,16,and20fortheout-of-partitionexperimentsisprovided.72Figure23.Animagemosaicofpredictedsaliencyaftersmoothingonthecasingsimu-lationusingWeightedProbabilisticMajorityVotingontimesteps6,10,16,and20fortheout-of-partitionexperimentsisprovided.73Figure24.Animagemosaicofpredictedsaliencyaftersmoothingonthecasingsimula-tionusingallthedataontimesteps6,10,16,and20fortheout-of-partitionexperimentsisprovided.74Figure25.AnimagemosaicofpredictedsaliencyafterusingtheSSIalgorithmonthecasingsimulationwithProbabilisticMajorityVotingisprovided.81Figure26.AnimagemosaicofpredictedsaliencywithouttheSSIalgorithmonthecas-ingsimulationusingProbabilisticMajorityVotingisprovided.82Figure27.AnimagemosaicofpredictedsaliencywithouttheSSIalgorithmusingallthedataisprovided.83Figure28.AnimagemosaicofpredictedsaliencyshowingSSIlearningusingPMVastimeprogressesforwardisprovided.85vi

PAGE 10

Figure29.TruepositiveversustruenegativeaccuracyonthenaltimestepafterSSIlearningprocessesself-labeleddata.86Figure30.FalsepositiveversusfalsenegativeerroronthenaltimestepafterSSIlearn-ingprocessesself-labeleddata.86Figure31.TruepositiveversustruenegativeaccuracyacrossalltimestepsafterSSIlearningprocessesself-labeleddata.88Figure32.FalsepositiveversusfalsenegativeerroracrossalltimestepsafterSSIlearn-ingprocessesself-labeleddata.88Figure33.AnillustrationoftheclassicationbasedontwowellseparatedGaussiandistributionsisshownforpredictionsontimestep8.91Figure34.AnillustrationoftheclassicationbasedontwowelloverlappingGaussiandistributionsisshown.91Figure35.Anillustrationofthehistogramofpredictedsaliencyisshown.92Figure36.AnillustrationofthetwoGaussiandistributionsasestimatedbytheEMal-gorithmfortimestep8isshown.92Figure37.AninternetsearchonGoogleforcarsreturnsveryfewtopicspertainingtotheautomobile.98vii

PAGE 11

LearningonComplexSimulationsRobertE.BaneldABSTRACTThisdissertationexploresMachineLearninginthecontextofcomputationallyintensivesimulations.ComplexsimulationssuchasthoseperformedatSandiaNationalLaboratoriesfortheAdvancedStrategicComputingprogrammaycontainmultipleterabytesofdata.Theamountofdataissolargethatitiscomputationallyinfeasibletotransferbetweennodesonasupercomputer.Inordertocreatethesimulation,dataisdistributedspatially.Forexample,ifthisdissertationwastobebrokenapartspatially,thebindingmightbeonepartition,therstftypagesanotherpartition,thetopthreeinchesofeveryremainingpageanotherpartition,andtheremainderconnedtothelastpartition.Thisdistributionofdataisnotconducivetolearningusingexistingmachinelearningalgorithms,asitviolatessomestandardassumptions,themostimportantbeingthatdataisindependentlyandidenticallydistributedi.i.d..Uniquealgorithmsmustbecreatedinordertodealwiththespatiallydistributeddata.Anotherproblemwhichthisdissertationaddressesislearningfromlargedatasetsingeneral.Thepervasivespreadofcomputersintosomanyareashasenableddatacapturefromplacesthatpreviouslydidnothaveavailabledata.Variousalgorithmsforspeedingupclassicationofsmallandmedium-sizeddatasetshavebeendevelopedoverthepastseveralyears.Mostofthesetakeadvantageofdevelopingamultipleclassiersysteminwhichthefusionofmanyclassiersresultsinhigheraccuracythanthatobtainedbyanysingleclassier.Mostalsohaveadirectapplicationtotheproblemoflearningfromlargedatasets.Inthisdissertation,athoroughstatisticalanalysisofseveralofthesealgorithmsisprovidedon57publiclyavailabledatasets.Randomforests,inparticular,isabletoachievesomeofthehighestaccuracyresultswhilespeedingupclassicationsignicantly.viii

PAGE 12

Randomforests,throughaclassierfusionstrategyknownasProbabilisticMajorityVotingPMVandavariantreferredtoasWeightedProbabilisticMajorityVotingwPMV,wasusedontwosimulations.Therstsimulationisofacanisterbeingcrushedinthesamefashionasahumanmightcrushasodacan.Eachofhalfamillionphysicaldatapointsinthesimulationcontainsnineattributes.Inthesecondsimulation,acasingisdroppedontheground.Thissimulationcontains21attributesandover1,500,000datapoints.ResultsshowthatreasonableaccuracycanbeobtainedbyusingPMVorwPMV,butthisaccuracyisnotashighasusingallofthedatainanon-spatiallypartitionedenvironment.Inordertoincreasetheaccuracy,asemi-supervisedalgorithmwasdeveloped.Thisalgorithmiscapableofincreasingtheaccuracyseveralpercentagepointsoverthatofusingallofthenon-partitioneddata,andincludesseveralbenetssuchasreducingthenumberoflabeledexampleswhichscientistswouldotherwisemanuallyidentify.Italsodepictsmoreaccuratelythereal-worldusagesituationswhichscientistsencounterwhenapplyingtheseMachineLearningtechniquestonewsimulations.ix

PAGE 13

Chapter1IntroductionThisdissertationfocusesprimarilyontheproblemoflearningfromlargeamountsofdatagener-atedbycomplexsimulationsperformedonasupercomputer.Thisimpliestwoseparateproblems.Therst,learningonlargeamountsofdata,isarelativelynewprobleminMachineLearningandpresentsmanyinterestingchallenges.Somedatasetsaretoolargetotinthememoryofanysinglecomputer.Theymayevenbetoolargetotonanysingleharddrive,despitetoday'scurrentca-pacityof750GBperdrive.Asanexample,theWaybackMachinewhichretainscopiesofinternetsitesdatingbackto1996,storesover2petabytesofdataandgrowsby20terabyteseverymonth[1].Theproblemismorethananissueofspace.Iftheamountofmemorycouldbeincreasedtosufcientcapacity,theprocessorspeedsarestillcurrentlyinsufcienttoallowlearninginatractableamountoftime.Thisnecessitatesthedesignofmoreefcientalgorithms.Thesecondissue,ap-plyinglearningalgorithmstocomplexsimulations,isanentirelynewtopic.ThemannerinwhichthesecomplexsimulationsareperformedmakesthemparticularlytediousforMachineLearning.ThestandardMachineLearningparadigmsbreakeasilyundertheconstraintsencountered.1.1MotivationMultipleprocessorcomputershaveawiderangeofuses,suchasinhighperformanceworkstations,webservers,anddatabases.Frequently,thesemachinesarecombinedwithothermultipleproces-sorcomputersforreasonsofload-balancing,additionalcomputingpower,orsimplyashot-spares.Conglomerationsofcomputers,eachcalledanode,assembledexpresslyforadditionalcomputingpower,arecommonlyreferredtoasBeowulfclusterswiththefastest,andconsequentlythemostexpensive,beingcalledsupercomputers.Theseareusedinweatherforecasting,climateresearch,1

PAGE 14

cryptanalysis,molecularmodeling,andphysicalsimulationssuchasfusionandthedetonationofnuclearweapons.ItisforthislastpurposethattheASCprogramwascreated[2].Specializedmulti-threadedandhighlyoptimizedapplicationsareneededinordertotakead-vantageofthisadditionalcomputingpower.Theyrelyonpredeterminedarchitecturalinformationincludingthenumberofprocessors,amountofmemory,amountofdiskspaceoneachnode,andtheinterconnectingnetwork.FastparalleldiskspeedsareachievedbothbydistributingtheoperationsacrossharddiskscommonlyreferredtoasaRAIDandbydistributingthemacrossnodes,themostcommonofwhichiscalledaparallelvirtuallesystem[3,4].Theinterconnectingnetworksareoftenhighspeedandlowlatency,enablingwhennecessary,fastandefcientcommunicationbetweenprocessors[5,6].ThecurrenttopofthelineASCsupercomputer,theRedStorm,consistsof10,368AMDOpteron2.0GHzprocessors,31.2TBofRAM,and750TBofdiskspace[7].Thetheoreticalpeakoatingpointoperationspersecondopsare41.2teraops1012opsusing1.7megawattsofpowerandoccupyingover3000squarefeetofoorspace.In2007,thenumberofcomputenodesisexpectedtoincreaseto12,960andtheprocessorstobereplacedwithAMDOpteron2.4GHzdual-coreprocessors,effectivelyincreasingthenumberofCPUsto25,920andproducingatheoreticalpeakoatingpointperformanceof125teraops[7].Effectivelyusingsuchamassivesystemcanrequireanentirelydisjointsetofalgorithmsfromthoseusedinasingleprocessorenvironment.1.2ContributionsThisdissertationcontributestothestateoftheartinthefollowingways.Extensiveexperimentationisperformedon57publiclyavailabledatasetsandvedifferentlearningalgorithmsandvariationsusingseveraldifferentstatisticalmethods[8,9,10].Thisisthelargestsuchpublishedcomparisontodate,thesizeofwhichenablesthendingofasurprisingconclusion:statisticalresultscomparingtheoverallaccuracyofanalgorithmmaybequitedifferentfromtheaggregatestatisticalresultsassembledfromtheindividualdatasets.Amethodfordecidingwhenenoughtreeshavebeenaddedtotheensembleon-the-yisthenintroducedwhichdoesnotrequireaseparatevalidationsetanddoesnotnegativelyaffecttheaccuracyoftheensemble[10].2

PAGE 15

Severalclassierfusionmethodsareintroducedforthepurposeoflearningonlargecomplexsimulationswheretheamountofdataisgreatandcannotbetransferredtootherprocessors[11].Thesemethodsshowthattheadversepartitioningofthedatahasanegativeeffectupontheaccuracyofclassication,howevertheimpactisnotgreat.Throughasemi-supervisedlearningalgorithm,thelossinaccuracyasaresultofthepartitioningcanberecovered,andtheneteffectisanoverallincreaseinpredictiveaccuracy.1.3OrganizationThisdissertationisorganizedasfollows:inChapter2,backgroundinformationontheproblemoflearningonvastquantitiesofdataisprovidedandtheconceptoflearningwithensemblesofclassiersisintroduced;Chapter3presentsmethodsbywhichtheseensemblesmaybegenerated,providesadetailedstatisticalanalysisonseveraloftheensemblelearningstrategies,andintroducesandtestsanewmethodfordeterminingwhenenoughclassiershavebeenincludedintheensem-ble;Chapter4introducesaheretofore-not-consideredproblemoflearningfromspatiallydisjointcomplexsimulationsandshowstwoexampleproblems;Chapters5and6discussnewalgorithmsforefcientlypredictingsalientregionsinthesesimulations;Chapter7summarizesthedissertationaswellassuggestspossibilitiesforfuturework.3

PAGE 16

Chapter2BackgroundThisdissertationdealssolelywithsupervisedmachinelearningalgorithms.Inthisparadigm,adatabaseofexamplesisconstructedfromasetofattributesandalabeledtargetclass.Thegoalisto,givenasetofattributese.g.amountofgasolineleftinthecar,amountofcashinhand,etc.assignthemostlikelyclasslabele.g.whetherornottostopatthegasstation.Therearemanyalgorithmsforassigningthemostlikelyclasslabel.Inaprobabilisticframework,themostlikelyclassisdeterminedbycreatingprobabilitydensityfunctionswhichdescribehowattributevaluescorrespondtoclasslabels.Inthenearestneighboralgorithm,theclassofanexampleisassignedbyndingtheexamplefromthetrainingsetwhichhasthesmallestdistancebetweenitandthetestexample.Eachoftheseapproachesresultsinaclassier.Thatis,thelabeledtrainingdataisprocessedandlearnedsuchthat,givenanentirelynewexamplewhichwasnotpresentinthetrainingset,acomputercanmakeapredictionoftheclasstowhichanexamplewiththespeciedattributesetbelongs.2.1DecisionTreesDecisiontreesareoneofthemostbasicyetversatileandpowerfulMachineLearningclassiers.Themethodbywhichdecisiontreelearningandclassicationworksiseasilyexplainable,andcloselymimicshumandecisionmaking.Adecisiontreeproducesaseriesofquestionstoaskoftheattributes,andbasedontheanswersitreceives,chosesthenextmostimportantquestiontoaskorgivesitsprediction.AnexampleofasimpledecisiontreeisgiveninFigure1whichshowsadecisionmakingprocessfordecidingwhentobuygasolineforanautomobile.Therstquestionasksifthenumberofgallonsinthetankisgreaterthan3gallons.Ifitisnot,thengasisneeded,andtheclassierpredictsgasolinewillbepurchased.Ifthenumberofgallonsisgreaterthan3,thenthe4

PAGE 17

Figure1.Anexampleofasmalldecisiontreeisprovided.Questionsaremarkedinboxes.Classlabelsarenotenclosedinboxes.decisiontreeasksifthenumberofgallonsinthetankisgreaterthan6gallons.Ifitis,thennomoregasisneeded.Ifthenumberofgallonsinthetankislessthan6butnecessarilygreaterthan3,thenthedecisiontreequeriestheamountofmoneythepotentialcustomerhas.Ifthecustomerhasgreaterthan$20.00,thedecisiontreepredictsthatgaswillbepurchased,howeveriftheamountislessthan$20.00,thedecisiontreepredictsthatgaswillnotbepurchased.Adecisiontreeprovidesforasimpliedhuman-likedecisionmakingprocess.Itsinterpretabil-ityallowsdecisiontreestobeusedinuniqueenvironmentswhereotherclassiersareinappropriate.Creditcompaniesshouldbeabletogiveareasonfordenyingalineofcredit,pharmaceuticalcom-paniesmustknowwhyapatientisbeingdeniedoracceptedintoclinicaltrials,etc.Thecomplexityofdecisiontreecreationisindiscoveringwhichattributesshouldbequeriedrstandtheninwhatorder.Adecisiontreeismadeupofdecisionnodesandleafnodes.Attributetestsarecontainedwithinthedecisionnodes,andclasslabelsarecontainedintheleafnodes.Duringclassication,anexam-pleproceedsthrougheachdecisionnodeonapathuntilitreachesaleafnode.IntheC4.5decision5

PAGE 18

treelearningalgorithm[12],theutilityofeachattributeasatestatanodeisdeterminedbyevalu-atingwhatthegainininformationwouldbeifthedecisiontreeusedthatattributeinthedecisionnode.Onceanappropriatesplithasbeenfound,examplesfromthetrainingsetarethemselvessentthroughthatdecisionnode,producingtwoormoredisjointsets.Eachdisjointsetisrecursivelysimpliedinthismanner,untilnoadditionalsplitsarepossibleallexampleswithinthatsetcontainthesameclasslabel,andaleafnodecontainingtheclasslabeliscreated.ThebaseinformationInformationDcontainstheamountofinformationstoredindatasetDpriortosplittingthedatasetatthedecisionnode.Thisiscalculatedasafunctionoftheprobabilitiesofclassi2CinthedatasetD,denedhereaspi.TheequationforcalculatingInformationDisshowninEquation.InformationD=jCjXi=1)]TJ/F21 10.909 Tf 8.485 0 Td[(pilogpiInthenextstepofthedecisiontreecreationalgorithm,eachpossibletestissimulatedandevaluated.AsaresultofsplittingthedataonanattributeAwitha2ValuesA,therewillbeseveraldatasetsDa2D.Theinformationgainofthispotentialsplitiscalculatedbysubtractingthesumoftheinformationineachnewdatasetfromthebaseinformationofthefulldatasetatthatnode,asinEquation.InformationGainD=InformationD)]TJ/F1 9.963 Tf 28.59 9.701 Td[(Xa2ValuesAjDaj jDjInformationDaTheattributetestwhichproducesthelargestinformationgainischosenforthedecisionnode.Whencreatingadecisionnode,ifalltheexamplesbelongtothesameclass,thenaleafnodeiscreatedinsteadwhichcontainsthevalueoftheclasslabel.Thisprocedurehappensrecursivelyuntilalldatahasbeenprocessed.Thecreationofadecisiontreeisusuallyfast,takingadvantageofthereducedtimecomplexityofdivide-and-conqueralgorithmstimecomplexitiesarepolynomial[13].TheDivide-and-conquerparadigmisoftenexploitedinsorting[14,15].6

PAGE 19

Figure2.Decisiontreescanonlymakesplitswhichareparalleltothex-axisory-axis.Thereisaspecialcasewhenanattributeismadeupofcontinuousvalues.Insuchacase,adiscretizedversionofthisattribute,a
PAGE 20

2.2MultipleClassierSystemsArticialIntelligenceandMachineLearningtechniqueshaveundergonemanyimprovementsthrough-outthepastseveralyears.Forsupervisedlearning,traditionallyaresearcherwouldobtainalabeledsetofdataandthenpassittoaMachineLearningalgorithmtobuildamodelforclassication.ThechosenMachineLearningalgorithmwouldattempttolearnamodeloftheentirespaceofpossibledecisionsorproblemspace.Thiscouldleadtoinefcienciesinboththerunningtimeofthealgo-rithmandinthepredictionaccuracy.Oneofthemostimportantbreakthroughstooccurinthiseldhasbeenthediscoveryofthepotentialaccuracyincreasethatcomesfromcombiningmanyindi-vidualclassiersintoasystemofclassiers,adirectionofresearchotherwiseknownasMultipleClassierSystems[16,17,18,19].MultipleClassierSystemsMCShaveallowedscientiststoevaluateproblemspaceswhichweretoolargeforasinglelearningalgorithmtohandlebyutilizingalgorithmswhichfocusonthedivideandconquerapproach.Bybreakingalargeproblemspaceintomuchsmallerspacesthedividestage,traditionallearningmethodscanbeutilizedtheconquerstage.Thisallowsthedataoftheproblemtotinmemorysoastobuildamodelinamanageableamountoftime.ThedivideandconquerapproachisoftenusedinComputerScienceforproblemswithgreaterthanlinearcomplexitysuchasinsorting[14]orfastFouriertransforms[20].IntheBioinformaticseld,problemspacesareusuallyverylarge[21,22].Sometimestheyhaveaverylargenumberofattributes,suchasDNAsequencesusedingenomeprojects,whichformhundredsofthousandsofattributes[23].AnotherexampleisthescienceofproteinfoldingwhereverylargedistributedprojectssuchasStanford'sFolding@Home[24],havebeencreatedwhichenlisttheidleCPUcyclesofthousandsofcomputersacrosstheworldtoassistintheirsimulation.Theproblemofdiscoveringhowtheyfoldisatleastasdemanding.Massivedatabasesholdinformationontheactualstructureofmanyproteins[25,26,27].AseriesofcontestsreferredtoastheCriticalAssessmentofTechniquesforProteinStructurePredictionCASPweredesignedtoevaluatepredictionaccuracyinsuchdomains[28].NearlyallMachineLearningparticipantsinthecontestsusedMCSapproachesduetooftheintractablesizeoftheproblem.Inaddition8

PAGE 21

toprovidingspeedandmemorybenets,MCSapproacheshavealsoincreasedtheaccuracyofpredictivemodelsforsmallandmediumsizedatasets[29,30,31].MethodsofcreatingMCSareeithermodularorcommittee-based[32].Inamodularapproach,theproblemisbrokenupintomanysmallertasksinanon-trivialway[33,34].Anexampleofamodularapproachisthehierarchicalmixtureofexpertsmethod[34,35,36].Consideringatwolevelapproach,ontherstlevelofprocessing,aclassierassignsanexampletooneofmanypotentialsecondlevelclassiers.Thechosensecondaryclassierthenpredictstheclasslabel.Thereareseveralwaystobuildsuchaclassier,anditisimportanttonotethattherstlevelclassierdoesnotneedtobetrainedonthetrainingset.Trainingcouldoccurbyanalyzingthedecisionboundariesorerroronthesecondlevelclassiers.Subsequently,itwoulddevelopameansofsendingexamplestothoseclassiersmostlikelytobecorrectgiventheregionoffeaturespaceforthatexample.Furthermore,eachsecondlevelclassierdoesnotneedtobetrainedonalltheavailabledata,butratheronlysubsetsofdatathatitwouldbecomeanexperton.Ratherthanhavingtherstlevelclassiersendtheexampletoonlyonesecondlevelclassier,itcouldsendittoanumberofclassiers.Thoughnottypicallyconsideredassuch,multi-layerneuralnetworkscanbelikenedtoamixtureofexperts,witheachneuronrepresentinganexpert[37].AnexampleofamodularlearningsystemisshowninFigure3.Inacommittee-basedapproach,alsocalledanensemble,eachmemberisredundant,capableofpredictionontheentireproblemspacewithouttheneedforothermemberswithinthecommit-tee.Yet,bycombiningmanyoftheindependentclassiers,amoreaccurateclassicationcanbeobtainedmorequickly.Inanensemble,manysimpleclassicationmethodologies,suchasneuralnetworksordecisiontrees,aremanipulatedsuchthatrepeatedconstructionoftheclassiergener-atespotentiallydifferentpredictionsforthesameexample.Thisiscontrastedagainstthemodularapproachwhereeachmembermayonlylearnorpredictonsomeparticularsubjectmatter.Figure4showshowanensemblearchitectureoperates.Thisworkfocusesonensembles,thoughmostofthetopicscoveredwouldapplytomodulararchitecturesaswell.9

PAGE 22

Figure3.Amodularlearningschemeisshown.Dashedlinessymbolizethatthepathneednotbetaken.Classier11canpassanexampletooneormoreclassiersinthesecondlevel,eachofwhichwouldthenoutputaprediction. Figure4.Anensemblearchitectureisshown.Anexampleisgiventoeachclassier,andanensem-blepredictionisgeneratedbysomecombinationofindividualpredictions.10

PAGE 23

Chapter3EnsembleLearningOnebenetofproducingmultipleclassiersishavingeachlearndifferentdecisionboundariessuchthatthecombinationofthemultipleindividualclassiersismoreaccuratethananysingleclassier.Anaccuracyincreasetypicallyoccursforanensembleofclassiers.Differentdecisionboundariesamongstclassiersareproducedbyhavingsomedegreeofnon-deterministicchoicesinthelearn-ingstage.Classierstrainedondifferentdatawilloftenlearndifferentconcepts.Differentdecisionboundariescanbeobtainedbytrainingeachclassieronrandomlyperturbedvariationsofthetrain-ingset.Sometimesthisnon-deterministicbehavioroccursintrinsically.Forexample,intrainingatypicalneuralnetwork,eachweightconnectingtwonodesisinitiallyassignedarandomvaluebetweentwothresholdvalues.Givenalargenumberofnodesandasufcientlywidethreshold,itishighlyunlikelythatanytwonetworkswilleveroriginate,ornishtraining,withtheexactsamedecisionboundaries.Inthepast,experimenterswouldbuildmultipleneuralnetworksandremoveallbutthebestnetwork,denedastheonewhichobtainedthehighestaccuracyonavalidationset.Ontheotherhand,researchtodayshowsthatstillhigheraccuracymaybeobtainedbyusingallofthesenetworks,combiningtheindividualpredictionsintoonesingleensembleprediction[32].Muchofthischapterhasbeenpublishedin[10].3.1GeneralMethodsforCreatingEnsemblesIn[32],differentphilosophiesofgeneratingensemblesareaddressed.Therstmeansofcreat-ingdifferentclassiersistomodifythetrainingdataforthealgorithm.Thiscancauseunstableclassierstoproduceverydifferentresults.Anunstableclassier,bydenition,mayobtainvastlydifferentdecisionboundariesduetosmalldifferencesintrainingdata[38],i.e.decisiontreesandneuralnetworks.Thisisincontrasttostableclassiers,suchasnearestneighbororlinearregression11

PAGE 24

Figure5.Agraphicaldepictionofbaggingisshown.Eachletterrepresentsanexample.algorithms,whereasmallchangeintrainingdatadoesnottypicallysubstantiallyimpacttheoveralloutputoftheclassier.Anotherwayofcreatingdifferentclassiersistomodifythelearningalgo-rithmitself.Decisiontreeshavebeenthefocusofalargemajorityofthisresearch,thoughattentionhasbeenpaidtoneuralnetworksaswell[39].Trainingdatamanipulationmethods,referredtohereasthegeneralmethodforcreatingensemblessincenearlyallclassiersrequiretrainingdata,aredescribedrst.Decisiontreespecicmodelsarethenanalyzed.3.1.1BaggingBootstrapaggregating,alsoknownasbagging[40],createsclassiersbymanipulatingtheorig-inaltrainingsetbysuccessivelyresamplingitwithreplacementtocreatemanydifferenttrainingsets.Foreverytrainingsetcreated,calledabag,aclassierislearned.Boththegenerationofthebagsandthelearningoftheclassiercanoccurinparallel.Furthermore,testingoftheclassierscanalsobeperformedinparallel,makingthescalabilityofthealgorithmnearlyonetoonewiththenumberofCPUsavailable.Onlythemajorityvoteinthepredictionphaseissequential.AnexampleofbaggingisshowninFigure5.ThoughBreiman[40]didnotincludestatisticalsignicanceresultsagainsttheaccuracyofasingletree,baggingwasoneoftherstensemblemethodstoshowconsistentimprovementinac-curacyoverthatofasingleclassier.Forsevenpubliclyavailabledatasets,BreimancomparedbaggingwithasingledecisiontreeusingtheCARTprogram.Foreachdataset,thebaggedensem-blereducedtheerrorfrombetween20%and47%ofthesingledecisiontreeerror.Breimandidnottestforstatisticalsignicance.Statisticallysignicantimprovementoverasingleclassierwaslaterveriedin[9].Asaresultofitssuccess,muchoftheliteratureusesbaggingasabenchmarkineval-12

PAGE 25

uatingtheaccuracyofnewalgorithms.Thoughthereisoftenexperimentalevidenceofincreasedaccuracywithlateralgorithms,theresultswerenottypicallytestedforstatisticalsignicance.Whentheaccuracyresultsarescrutinizedstatistically,baggingisoftenfoundtobeindistinguishable.InFigure5thebagsareequalinsizetothatoftheoriginaltrainingset.Thisisnotarequirement,andsurprisinglygoodresultshavebeenobtainedusingonlyasmallsubsetoftheexamples[41].Usingasmallersubsetofexampleswilldecreasethetrainingtimeofthealgorithmforthatdataset.3.1.2BoostingFreundandSchapireintroducedaboostingalgorithm[42,19]forincrementalrenementofanensemblebyemphasizinghard-to-classifydataexamples.Boostingbeginsbybuildingaclassieronthetrainingdataandobservingwhichexamplesfromtheoriginaltrainingdataarenotcorrectlyclassied.Theweightsofeachtrainingexamplearethenadjustedsothatmisclassiedexamplesareweightedmoreheavilythancorrectlyclassiedexamples.Therearethreemethodsofadjustingtheweightsoftheexamples:decreasingtheweightofexamplespredictedcorrectly,increasingtheweightofexamplespredictedincorrectly,orbothincreasinganddecreasingtheweightdependingonthecorrectnessofclassication[43].Aweightisthenassignedtothecreatedclassierandthepredictionoftestexamplesinvolvesaweightedmajorityvoteofallclassiersintheensemble.Examplesthatareintuitivelyhardtolearnareassignedhigherweightvaluesandwillappearinthetrainingdataforeachclassiermorefrequently.Easyexamplesneednotbegivenasmuchweightbecausemostclassierswillcorrectlypredictthemanyway.Fordatasetswithoutnoise,thisalgorithmgenerallyprovidesanincreaseinaccuracy,sometimesaverylargeincrease,overasingleclassier.However,fordatasetswherenoiseisanissue,thoseexampleswhicharedeemeddifculttolearnareindeednothard,butnoise.Ratherthanfocusonthemspecically,asboostingdoes,theyshouldbeignoredaltogether.FreundandSchapireshowedthatboostingwasoftenmoreaccuratethanbaggingwhenusinganearestneighboralgorithmasthebaseclassier,thoughthismarginwassignicantlydiminishedwhenusingC4.5.Resultswerereportedfor27datasets,comparingtheperformanceofboostingwiththatofbaggingusingC4.5asthebaseclassier.Thesameensemblesizeof100wasused13

PAGE 26

forboostingandbagging.Ingeneral,10-foldcrossvalidationwasdone,repeatedfor10trials,andaverageerrorratereported.Fordatasetswithadenedtestset,anaverageover20trialswasused.Boostingresultedinhigheraccuracythanbaggingon13ofthe27datasets,baggingresultedinhigheraccuracythanboostingon10datasets,andtherewere4ties.Thedifferencesinaccuracywerenotevaluatedforstatisticalsignicance.3.1.3RandomSubspacesHo'srandomsubspacetechniqueselectsrandomsubsetsoftheavailablefeaturestobeusedintrainingtheindividualclassiersinanensemble[5].Ho'sapproachtypicallyselectsarandomonehalfoftheavailablefeaturesforeachdecisiontreeandcreatesensemblesofsize100.Eachtrainingsetconsistsofdifferentattributesforthesameexamplesandisunique.Asinbagging,boththetrainingsetcreationandlearningcanbedoneinparallel.Yetrandomsubspaceshastheaddedbenetofdecreasingthenumberofattributestherebysimplifyingthedimensionalityoftheproblemspace.SincemostoftheCPUtimeindecisiontreelearningisspentcalculatingafunctiontomeasuretheinformationforeachattribute,decreasingthenumberofattributesisparticularlyhelpfulinloweringthetrainingtime.ThismethodispresentedvisuallyinFigure6.Inonesetofexperiments,therandomsubspacetechniquegavebetterperformancethaneitherbaggingorboostingforasingletrain/testsplitforfourdatasets.Anothersetofexperimentsin-volved14datasetsthatwererandomlysplitintohalvesfortrainingandtesting.Tenrandomsplitsweredoneforeachofthe14datasets.Foreachdataset,theminimumandmaximumofthe10ac-curaciesweredeletedandtheremainingeightvaluesaveraged.Qualitatively,itappearsthatrandomsubspacesresultedinhigheraccuracythaneitherbaggingorboostingonaboutveofthe14datasets.Thedifferencesinaccuracywerenotevaluatedforstatisticalsignicance.Hosummarizedtheresultsasfollows:Thesubspacemethodisbetterinsomecases,aboutthesameorworseinothercaseswhencomparedtotheothertwoforestbuildingtechniques[baggingandboosting][5].Oneotherconclusionwasthatthesubspacemethodisbestwhenthedatasethasalargenumberoffeaturesandsamples,andthatitisnotgoodwhenthedatasethasveryfewfeaturescoupledwithaverysmallnumberofsamplesoralargenumberofclasses[5].14

PAGE 27

Figure6.Agraphicaldepictionoftherandomsubspacealgorithm.3.1.4DisjointPartitioningAnothermeansofgeneratingclassiersinadistributedfashionisbreakingthedatasetupintodis-jointpartitions[44,45].Sinceensemblestypicallybenetfromalargenumberofclassiersandeachdatapartitiongetsitsownclassier,thismethodworksprimarilyfordatasetswithalargenumberofexamples.Aswellasbeingdistributable,whenthenumberofexamplesisdividedbythenumberofdisjointpartitions,theworkloadforeachindividuallearningalgorithmisreduced,asseeninrandomsubspaces.Indatasetswithmillionsorevenbillionsofexamples,thiscanprovideverylargedecreasesintrainingtime.Finally,theserandompartitionscanalsobestratied,mean-ingthattheratioofclassespresentintheoriginaltrainingsetismaintainedforeachdisjointsubsetafterpartitioning.Whilethistypicallyproducesbetterresults[44,46,47],itnegativelyaffectstheabilitytodistributethealgorithmbecausestraticationislargelysequential.Unfortunately,thereisnostandardmeansofselectingthenumberofdisjointpartitionsexceptbyempiricalselectionafterexperimentationwithmultiplevalues.3.2DecisionTreeSpecicMethodsEnsemblescanalsobecreatedbyinsertingnondeterministicapproachesintoaspecictypeoflearn-ingalgorithmtoproducedifferentclassiers.Mostresearchhasfocusedonmanipulatingdecisiontreebuildingalgorithmstoproducetreeswithdifferingpredictionswithouttrainingsetmodica-15

PAGE 28

tions,asdiscussedintheprevioussection.Dependingontheimplementationofthemultipletrainingsetcreationandstorageofthesets,thismaygeneratetimeandspacesavings.Theclassicdecisiontree,asdescribedinSection2.1,isbuiltbyrecursivelydividingthetrainingsetbysplittingontheattributetestwiththegreatestinformationgain[48],informationgainratio[49,12],orvalueoftheginiindex[50].Thebranchingfactorforacontinuousattributeistypicallytwo,representinganattributevalueeitherlessthanorgreaterthanthetestvalue.Thenumberofpotentialtestsontheotherhandisatmostthenumberofdistinctvaluesminusone,meaningthatmanypossiblesplitsmayexistforeachcontinuousattribute.Fordiscreteattributes,onlyonepossiblesplitexists,withthebranchingfactorequaltothenumberofdiscretevaluesforthatattribute.Nondeterminismcanbeintegratedintoeachlearnerbychoosingrandomlyfromoneofseveralcontinuousattributethresholdsorattributesinsteadofonlythebest.Thesealgorithmscanallberuninadistributedfashionbyrepeatedlycallingthedecisiontreecreationfunction.3.2.1RandomTreesAnexampleofanensemblecreationtechniqueusedindecisiontreesischoosingtosplitthedataonarandomattributetestselectedfromalistofthemostimportanttests.Dietterich'srandomtrees[51]workbyanalyzingthebesttwentytestsaccordingtoinformationgainratioacrossallattributes,andchoosingoneatrandomasthetestonwhichtosplit.Datasetswithcontinuousattributeswillgeneratealargenumberofpotentialsplits,thusthevalueoftwentythatDietterichchosewilllikelygeneratesplitswithhigh,butsuboptimal,informationgain.Ontheotherhand,foradatasetwithtwentyorlessdiscreteattributes,thismethodisequivalenttoselectinganyattributetosplitonrandomly,despiteitsinformationgainratio.Unfortunatelyfordatasetswithamixtureofbothcontinuousanddiscreteattributes,themyriadsplitpossibilitiesgeneratedbycontinuousattributesmaycausethediscretesplitstobeunderrepresentedinthetop-twentylist.Dietterichreportedonexperimentswith33datasetsfromtheUCIrvinerepository.Forallbutthreeofthedatasets,a10-foldcrossvalidationapproachwasfollowed.Theotherthreeusedatrain/testsplitasincludedinthedistributionofthedataset.Randomtreeensembleswerecreatedusingbothunprunedandprunedwithcertaintyfactor10trees,andthebetterofthetwowasman-16

PAGE 29

uallyselectedforcomparisonagainstbagging.Differencesinaccuracyweretestedforstatisticalsignicanceatthe95percentlevel.Withthisapproach,itwasfoundthatrandomizedC4.5resultedinbetteraccuracythanbaggingsixtimes,worseperformancethreetimes,andwasnotstatisticallysignicantlydifferent24times.3.2.2RandomForestsBreiman'srandomforesttechniqueblendselementsofrandomsubspacesandbagginginawaythatisspecictousingdecisiontreesasthebaseclassier[16].Ateachnodeinthetree,asubsetoftheavailablefeaturesisrandomlyselectedandthebestsplitavailablewithinthosefeaturesisselectedforthatnode.Thenumberoffeaturesrandomlychosenfromntotalateachnodeisaparameterofthisapproach.Inchoosingatestforanode,anattributemaybechosenwhosebestsplithasaverylowinformationgain,causingthetreetoloseaccuracy.Thisisespeciallyaconcernfordatasetswithalargenumberofattributes.Finally,baggingisusedtocreatethetrainingsetofdataitemsforeachindividualtree.Following[16],weconsideredversionsofrandomforestscreatedwithrandomsubsetsofsize1,2,andblog2n+1c.Breimanreportedonexperimentswith20datasets,inwhicheachdatasetwasrandomlysplit100timesinto90percentfortrainingand10percentfortesting.Ensemblesofsize50werecreatedforAdaboostandensemblesofsize100werecreatedforrandomforests,exceptforthezipcodedataset,forwhichensemblesofsize200werecreated.Accuracyresultswereaveragedoverthe100train-testsplits.TherandomforestwithasingleattributerandomlychosenateachnodewasbetterthanAdaBooston11ofthe20datasets.Therandomforestwithblog2n+1cattributeswasbetterthanAdaBooston14ofthe20datasets.Theresultswerenotevaluatedforstatisticalsignicance.3.3ExperimentalDesignThecomplexsimulationsthisdissertationaddressesrequirealgorithmswhichareoptimizedforbothspeedandaccuracy.Itiswellknownthatthereisnoonebestalgorithmforclassifyinganyarbitrarydataset[52,53,54].Quiteoftensomelearningmethodsperformwellonsomedatasets17

PAGE 30

orevensubsets,andnotsowellinothers.Inordertoascertainwhichmethodmightproducethebestclassiersforthesimulations,arigorousstatisticaltestwasperformedonalargenumberofpreliminarydatasets.Thesedatasetsrangefromsmalltomediuminsize,andincorporatemanydifferentkindsofvariables.3.3.1AlgorithmicDetailsThisworktakesadvantageofthefreeopensourcesoftwarepackageOpenDT[55]forlearningdecisiontreesinparallel.ThisprogramhastheabilitytooutputtreesverysimilartoC4.5release8[49],buthasaddedfunctionalityforensemblecreation.InOpenDT,likeC4.5,apenaltyisassessedtotheinformationgainofacontinuousattributewithmanypotentialsplits.Intheeventthattheattributesetrandomlychosenprovidesanegativeinformationgain,itsapproachistorandomlyre-chooseattributesuntilapositiveinformationgainisobtained,ornofurthersplitispossible.Thisenableseachtesttoimprovethepurityoftheresultantleaves.ThisapproachwasalsousedintheWEKAsystem[56].Asboostingwasdesignedforbinaryclasses,asimpleextensiontothisalgorithmisusedcalledAdaBoost.M1W[57],whichmodiesthestoppingcriteriaandweightupdatemechanismtodealwithmultipleclassesandweaklearningalgorithms.Theimplementedboostingalgorithmusesaweightedrandomsamplingwithreplacementfromtheinitialtrainingset,whichisdifferentfromaboosting-by-weightingapproachwheretheinformationgainisadjustedaccordingtotheweightoftheexamples.FreundandSchapireusetheboosting-by-resamplingapproachin[19].Thereappearstobenoaccuracyadvantageforboosting-by-resamplingorboosting-by-reweighting[58,59,60],thoughBreimanreportsincreasedaccuracyforboosting-by-resamplingwhenusingunprunedtrees[61].Thisdissertationusesunprunedtreesbecauseofthisand,ingeneral,forincreasedensemblediversity[62].However,boosting-by-resamplingmaytakelongertoconvergethanboosting-by-reweighting[58].Thereisamodicationtotherandomtreesensemblecreationmethod[8]inwhichonlythebesttestfromeachattributeisallowedtobeamongthebestsetoftwentytests,fromwhichoneisrandomlychosen.Thisallowsthealgorithmtobelessprejudicedagainstdiscreteattributes18

PAGE 31

whentherearealargenumberofcontinuousvaluedattributes.ThisiscalledtherandomtreesBapproach.Forthisapproach,arandomtestfromthe20attributeswithmaximalinformationgainisused.IntherandomsubspaceapproachofHo,halfdn=2eoftheattributeswerechoseneachtime.Fortherandomforestapproach,asingleattribute,twoattributesandblog2n+1cattributeswhichwillbeabbreviatedasrandomforests-lginthefollowingareused.3.3.2DataSetsEachofthedatasetsusedinthemostinuentialpapersdescribingthealgorithmsexperimentedwithwereconsideredhere.Fifty-sevendatasetswereused,themajorityfromtheUCIrvinerepository[63]andotherspubliclyavailable[64,65,66].Thedatasets,describedinTable1,havefrom4to256attributesandtheattributesareamixtureofcontinuousandnominalvalues.3.3.3ExperimentsForeachdataset,astratiedten-foldcrossvalidationwasperformed.Astratiedn-foldcrossvalidationbreaksthedatasetintondisjointsubsets,eachwithaclassdistributionapproximatingthatoftheoriginaldataset.Foreachofthenfolds,anensembleistrainedusingn)]TJ/F15 10.909 Tf 11.719 0 Td[(1ofthesubsets,andevaluatedontheheld-outsubset.Thiscreatesnnon-overlappingtestsets,allowingforstatisticalcomparisonsbetweenapproachestobemade.Foreachdataset,asetofvestratiedtwo-foldcrossvalidationswasalsoused.Inthismethod-ology,thedatasetisrandomlybrokenintotwohalves.Onehalfisusedintrainingandtheotherintestingandviceversa.Eachhalfofthedatacontainsaclassdistributionapproximatelyequaltothatoftheoriginaldataset.Thisvalidationisrepeatedvetimes,eachwithanewhalf/halfpartition.Dietterich'sexperimentsusedat-scoretoevaluatestatisticalsignicance[67].InanalternativemethodbyAlpaydin,thet-scoreisabandonedinfavorofanF-scoreforreasonsofstability[68].Specically,ratherthanusingthedifferenceofonlyonetestset,thedifferenceofeachtestsetisconsideredintheF-score.19

PAGE 32

Table1.Descriptionofdatasetsattributesandsize. DataSet #attributes #cont.att. #examples #classes abalone3 8 7 4177 29 anneal2 38 6 898 6 audiology2,4 69 0 226 24 autos2 25 15 205 7 breast-w1,2,3,4 9 9 699 2 breast-y2 9 0 286 2 bupa1 6 6 345 2 car3 6 0 1728 4 credit-a2 15 6 690 2 credit-g1,2,3,4 20 7 1000 2 crx4 15 6 690 2 dna3 180 0 3186 3 ecoli1 7 7 325 8 glass1,2,4 9 9 214 7 heart-c2,4 13 5 303 2 heart-h2 13 5 294 2 heart-s2 13 5 123 2 heart-v2 13 5 200 2 hepatitis2,4 19 6 155 2 horse-colic2 22 8 368 2 hypo2,4 25 7 3163 2 ion1,4 34 34 351 2 iris2,4 4 4 150 3 krk2 6 6 28056 18 krkp2,3,4 36 0 3196 2 labor2,4 16 8 57 2 led-24 24 0 5000 10 letter1,2,3,4 16 16 20000 26 lrs3 93 93 530 10 lymph2 18 3 148 4 mushroom4 22 0 8124 2 nursery3 8 0 12961 5 page 10 10 5473 5 pendigits 16 16 10992 10 phoneme2 5 5 5404 2 pima1,3,4 8 8 768 2 primary2 17 0 339 22 promoters4 57 0 106 2 ringnorm1 20 20 300 2 sat1,2,3,4 36 36 6435 7 segment1,2,3,4 19 19 2310 7 shuttle2,3 9 9 58000 7 sick2,4 29 7 3772 2 sonar1,2,4 60 60 208 2 soybean1,2,4 35 0 683 19 soybean-small4 35 0 47 4 splice2,3,4 60 0 3190 3 tic-tac-toe3 9 0 958 2 twonorm1 20 20 300 2 threenorm1 20 20 300 2 vehicle1,2,3,4 18 18 846 4 voting1,4 16 0 435 2 vote12,4 15 0 435 2 vowel1,4 10 10 528 11 waveform1,2 21 21 5000 3 yeast3 8 8 1484 10 zip1 256 256 9298 10 1UsedinBreiman'srandomforestpaper[16];2usedinDietterich'srandomtreespa-per[17];3usedinHo'srandomsubspacespaper[18];4UsedinFreund&Schapire'sBoostingpaper[19]20

PAGE 33

Foreachapproach1000treesaregenerated,thoughboostingisexaminedwith50,100,and1000trees.Breimanoftenusedonly50boostedtreesinhisresearch[40,16],andSchapirehasusedasmanyas1000[69].100treesare,perhaps,mosttypical.3.3.4StatisticalTestsInthisresearchthreestatisticalframeworksfordeterminingtheperformanceoftheensemblecre-ationmethodswereused.Unfortunately,twoframeworkshaveknownissues.Whiletheten-foldcrossvalidationexperimentshaveindependenttestsets,thetrainingdataoverlapsacrossfolds,andt-testsassumeindependenttrials.Thisnotwithstanding,theten-foldcrossvalidationisthemostwidelyusedstatisticaltestforthesetypesofexperiments.Dietterichpointsoutthattheseexperi-mentsproduceelevatedTypeIerrorwhichcanbecorrectedforbyhis5x2crossvalidation.Thisreliesontheideathatlearningcurvesrarelycrossforalgorithmsastrainingsetsizevaries,whichmaynotbetrue.Letpjibethedifferenceinerrorratesonfoldj=1;2forreplicationi=1;2;3;4;5.Let pibethemeanerrorforreplicationiands2i=p1i)]TJET1 0 0 1 314.774 406.622 cmq[]0 d0 J0.398 w0 0.199 m8.87 0.199 lSQ1 0 0 1 -314.774 -406.622 cmBT/F21 10.909 Tf 314.774 400.73 Td[(pi2+p2i)]TJET1 0 0 1 372.156 406.622 cmq[]0 d0 J0.398 w0 0.199 m8.87 0.199 lSQ1 0 0 1 -372.156 -406.622 cmBT/F21 10.909 Tf 372.156 400.73 Td[(pi2theestimatedvariance.Assumingp1iandp2iareindependentnormallydistributedvariables,thens2i=2hasa2distributionwithonedegreeoffreedom.Sinceeachofthes2ivaluesareassumedtobeindependent:M=P5i=1s2i 2hasachi-squaredistributionwith5degreesoffreedom.IfZZandXXandZandXareindependentthenTn=Z p X=nhasat-distributionwithndegreesoffreedomandt=p11 p M=521

PAGE 34

ist-distributedwith5degreesoffreedom.Dietterichstoppedaftercalculatingthist-value.AlpaydinconsidersthatthereareactuallytenvalueswhichcanbeplacedinthenumeratorofEquation5,dependingonthefoldandreplicationvalue.tji=pji p M=5Sincethefoldandreplicationarearbitrary,andpjidependsonlyontheorderinwhichthefoldsareevaluated,p11mayleadtoincorrectconclusionsifskewed.Asaresult,AlpaydinsuggeststheF-test.f=N=10 M=5=P5i=1P2j=1pji2 2P5i=1s2iwhereN=P5i=1P2j=1pji2 2sinceNisachi-squarewith10degreesoffreedom.TheBonferronicorrection[70,71],acalculationwhichraisesthecriticalvaluenecessaryfordeterminingsignicance,wasappliedinordertocompensateforthenumberofmethodsusedintheseexperiments.IntheBonferronicorrection,thevalueofanentiresetofncomparisonsisadjustedbytakingthevalueofeachindividualtestas=n[70].Intheseexperiments,=0:05andn=7.Inthecaseofthe10-foldcrossvalidation,thet-criticalvalueis3.47andforthe5x2-foldcrossvalidation,theF-criticalvalueis11.66.Arecentpaper[72]suggeststhatthebestwaytocomparemultiplealgorithmsacrossmultipledatasetsistocomparetheiraverageranks.Inthiscase,onecouldrankthealgorithmbyaverageaccuracyoveracrossvalidationexperimentfrom1-thebestto8-theworst.Inthecaseofatie,therankistheaverageoftheranksweretheynottied.Forexample,iftwoalgorithmstiedforthird,theywouldeachgetarankof3:5+4=2.Afterobtainingtheaverageranks,theFriedmantestcanbeappliedtodetermineifthereareanystatisticallysignicantdifferencesamongthealgorithmsforthedatasets.LetNequalthenumberofdatasetstobetestedonbyeachofkpossiblealgorithms.22

PAGE 35

FortheFriedmantest,letRj=1 NPirjiwhererjiistherankofthejthalgorithmondataseti.Underthenullhypothesis,eachoftherankswouldbeequal.TheFriedmanteststatisticsays:2F=12N kk+1XjR2j)]TJ/F21 10.909 Tf 12.104 7.381 Td[(kk+12 4isa2distributionwithk)]TJ/F15 10.909 Tf 10.909 0 Td[(1degreesoffreedom.IftheFriedmanstatisticindicatedsignicance,thentheHolmstep-downprocedureisusedtodeterminewhichensemblemethodmightbestatisticallysignicantlydifferentfrombagging.IntheHolmstep-downprocedure,ifthemostsignicantpvalueislessthan=k)]TJ/F15 10.909 Tf 11.154 0 Td[(1thenthenullhypothesisisrejected.Thenextmostsignicantpvalueisthencomparedagainst=k)]TJ/F15 10.909 Tf 11.191 0 Td[(2,andsoonuntilthenullhypothesiscannotberejected.Oncethenullhypothesiscannotberejected,thehypothesisforanysubsequentpvalueisrejected.Itwasargued[72]thatthisisastableapproachforevaluatingmanyalgorithmsacrossmanydatasetsanddeterminingoverallstatisticallysignicantdifferences.3.4ExperimentalResultsTables2and3showtheaccuracyresultsinrawpercentages.Table4showsthestatisticalresults.Statisticalwinsagainstbaggingaredesignatedbyaplussignandlossesbyaminussign.Ifneitherastatisticalwinnorstatisticallossisregistered,thetableeldforthatdatasetisomitted.Theresultsofthe10-foldcrossvalidationandthe5x2-foldcrossvalidationareseparatedwithaslash.Theaverageresultsforeachmethodusingthe10-foldcrossvalidationareshowninTable2andtheresultsforthe5x2-foldcrossvalidationareshowninTable3.Table5containsasummaryoftheresults.For37of57datasets,consideringbothtypesofcrossvalidation,noneoftheensembleap-proachesresultedinastatisticallysignicantimprovementoverbagging.Ononedataset,zip,allensembletechniquesshowedstatisticallysignicantimprovementunderthe10-foldcrossvalidationapproach.Thebestensemblebuildingapproachesappeartobeboosting-1000andrandomforests-lg.Eachscoredthemostwinsagainstbaggingwhileneverlosing.Forbothrandomsubspacesand23

PAGE 36

randomforests-1,therewereagreaternumberofstatisticallossestobaggingthanstatisticalwins.Boostingwith50and100treesandrandomforestsusingonlytwoattributesalsodidwell.Randomtrees-Bhadahighnumberofstatisticalwinsinthe10-foldcrossvalidationyetitalsohadahighnumberoflosses.Interestingly,inthe5x2-foldcrossvalidation,itresultedinveryfewwinsandlosses.Incomparingthedifferencesbetweenthe10-foldcrossvalidationandthe5x2-foldcrossvali-dation,theprimarydifferenceisthenumberofstatisticalwinsorlosses.Usingthe5x2-foldcrossvalidationmethod,foronly12ofthe57datasetswasthereastatisticallysignicantwinoverbag-gingwithanyensembletechnique.Thiscanbecomparedtothe10-foldcrossvalidationwherefor18ofthe57datasetstherewasastatisticallysignicantwinoverbagging.Underthe5x2-foldcrossvalidation,fornodatasetwaseverymethodbetterthanbagging.TheaverageranksforthealgorithmsareshowninTable5.Itwassurprisingtoseethatrandomforestswhenexaminingonlytworandomlychosenattributeshadthelowestaveragerankforthe10foldcrossvalidation,andsecondlowestaveragerankforthe5x2foldcrossvalidation.UsingtheFriedmantestfollowedbytheHolmtestwitha95%condencelevel,itcanbeconcludedthattherewasastatisticallysignicantdifferencebetweenbaggingandallapproachesexceptforrandomsub-spacesusingtheaverageaccuracyfromatenfoldcrossvalidation.Usingthe5x2crossvalidation,therewasastatisticallysignicantdifferencebetweenbaggingandallapproaches,exceptforboost-ing50classiersandrandomsubspaces.Theapproacheswereoftennotsignicantlymoreaccuratethanbagging.However,theywereconsistentlymoreaccuratethanbagging.3.5DiscussionOfthe57datasetsconsidered,for37wecouldndnostatisticallysignicantimprovementoverbaggingforanyoftheothertechniques,usingeitherthe10-foldor5x2crossvalidation.However,usingtheFriedman-Holmtestsontheaverageranks,itcanbeconcludedthatseveralapproachesperformstatisticallysignicantlybetterthanbaggingonaverageacrossthegroupofdatasets.In-formally,onemightsaythatwhilethegainoverbaggingisoftensmall,thereisaconsistentpatternofgain.24

PAGE 37

Table2.Theaveragerawaccuracyresultsovera10foldcrossvalidation. DataSet Bst Bst Bst RS RTB Bag RF RF RF 1000 100 50 lg 1 2 abalone 25.07 24.23 24.35 24.44 24.49 23.51 24.2 24.59 24.68 anneal 99.56 99.56 99.67 99.22 99.56 99.33 99.56 99.67 99.67 audiology 83.68 85.43 84.57 83.2 83.66 85 86.74 81.52 84.58 autos 82.38 82.38 82.88 87.24 82.36 83.33 83.83 81.4 82.88 breast-w 96.42 96.42 96.13 96.7 96.71 95.99 96.27 96.99 96.56 breast-y 64.61 63.9 63.55 73.07 72.75 69.9 71.32 73.74 70.96 bupa 72.76 71.92 76.24 67.84 71.9 72.77 73.64 75.65 75.09 car 97.45 97.45 96.93 70.49 80.21 93.92 94.1 87.04 94.68 credit-a 87.39 86.67 86.96 87.83 86.38 86.38 86.38 87.25 87.54 credit-g 74.7 73.9 73.4 76.1 73.9 72.1 74.7 74.1 75.3 crx 87.1 87.39 85.8 86.67 86.09 86.09 86.52 87.68 87.54 dna 96.05 95.36 95.23 96.33 95.98 95.1 95.92 72.6 83.27 ecoli 86.18 85.57 85.86 82.17 86.18 83.11 83.71 86.79 86.17 glass 79.07 80.43 77.14 78.48 80.93 77.19 79.55 79.52 80.02 heart-c 78.83 80.82 81.81 81.18 82.49 77.9 82.17 81.81 82.49 heart-h 79.97 79.63 81.7 84.02 82.33 80.66 81.02 82.67 82.36 heart-s 88.53 87.76 87.76 91.86 89.36 88.46 89.29 93.53 91.86 heart-v 71.5 70.5 72.5 71.5 74.5 73.5 72.5 75 73.5 hepatitis 85.67 83.83 81.96 81.83 83.79 82.54 84.42 83.75 85.08 horse-colic 83.69 84.21 83.67 84.77 82.88 83.69 85.6 82.6 83.44 hypo 99.11 99.08 99.05 98.7 98.86 99.02 99.18 98.77 98.99 ion 94.02 94.02 94.02 93.44 93.17 92.89 93.74 93.44 93.73 iris 94.67 94.67 94.67 94.67 94.67 94.67 94.67 94.67 95.33 krk 90.15 89.17 88.21 42.13 87.21 86.88 87.12 85.22 87.9 krkp 99.59 99.53 99.53 96.24 98.87 99.69 99.56 98.25 99.28 labor 93 93 91.33 93 96.67 91.33 93 96.67 94.67 led-24 73.24 72.5 71.22 69.3 72.28 73.6 74.84 73.86 74.64 letter 99.74 99.71 99.65 99.72 99.66 99.38 99.72 99.6 99.73 lrs 89.81 87.92 87.92 86.42 87.17 86.6 88.3 86.98 87.92 lymph 85.14 85.14 87.81 84.48 85.81 81.05 85.1 85.76 86.48 mushroom 100 100 100 100 100 100 100 100 100 nursery 99.96 99.9 99.83 93.8 97.65 99.05 99 97.71 99.29 page 97.17 97 96.93 97.22 97.55 97.31 97.44 97.44 97.64 pendigits 99.52 99.45 99.38 99.24 99.24 98.58 99.16 98.99 99.1 phoneme 91.75 91.6 91.41 83.42 90.17 91.28 91.39 91.23 91.47 pima 74.07 75.51 74.74 73.82 75.91 75.77 75.51 75.51 76.29 primary 42.18 38.95 39.83 44.84 41.89 41.61 41.92 41.6 42.5 promoters 96.36 96.27 93.27 88.91 91.73 88.91 93.55 86.09 88.91 ringnorm 97.67 96.33 95 92.67 96 92.33 93.33 97.67 94.67 sat 92.6 91.84 91.56 92.26 92.37 91.47 91.97 90.88 91.56 segment 98.66 98.83 98.48 97.71 97.92 98.1 98.35 97.66 98.14 shuttle 99.99 99.99 99.99 99.89 99.97 99.98 99.99 99.96 99.98 sick 99.13 99.13 99.02 97.4 98.41 99.13 98.91 97.69 98.52 sonar 84.19 83.71 84.19 83.69 89.05 75.55 84.76 83.26 84.24 soybean 94.88 94.58 94 95.46 94.14 94.15 95.02 93.12 94.87 soybean-small 91.5 91.5 91.5 100 100 97.5 100 100 100 splice 95.77 95.39 95.3 96.61 95.14 94.7 96.93 76.87 88.84 threenorm 86.33 85.33 84.67 83 84.67 83.67 85 87.33 86.67 tic-tac-toe 98.75 98.64 98.64 73.17 89.36 96.24 96.97 88.84 95.83 twonorm 93.67 94.33 93.33 94 95.67 94 95 96.67 95.67 vehicle 77.53 76.23 76 75.75 73.39 74.22 74.21 73.15 74.45 vote 95.17 94.72 95.87 95.41 94.94 95.62 96.1 95.17 95.87 vote1 88.26 88.72 87.8 90.57 89.87 87.57 88.04 90.12 89.41 vowel 97.17 95.65 95.83 96.03 98.11 92.06 96.6 97.74 97.36 waveform 85.08 84.56 84.26 85.56 85.44 84.28 85.14 85.62 85.52 yeast 61.05 59.16 57.48 55.06 61.73 61.66 62.06 62.2 62.47 zip 97.06 96.72 96.16 96.36 97.03 93.74 96.87 95.56 96.18 25

PAGE 38

Table3.Theaveragerawaccuracyresultsover5x2foldcrossvalidations. DataSet Bst Bst Bst RS RTB Bag RF RF RF 1000 100 50 lg 1 2 abalone 24.32 24.05 23.74 24.06 23.88 24.01 24.29 24.58 24.66 anneal 99.18 99.13 99.13 98.8 99.02 98.78 99.13 98.95 99.33 audiology 80.09 79.47 78.5 78.76 79.03 80.88 82.3 73.1 78.76 autos 76 75.12 73.95 76.68 72.29 75.71 76.68 73.56 76.19 breast-w 96.08 95.91 95.94 96.74 96.82 95.99 96.28 97 96.65 breast-y 66.01 65.73 65.31 71.82 71.05 69.37 70.07 71.54 71.19 bupa 69.45 70.03 69.74 64.64 65.51 70.96 71.07 71.54 71.08 car 95.59 95.5 95.2 71.04 83.08 92.03 92.41 83.21 92.73 credit-a 86 86.09 85.39 85.91 85.8 85.57 86.29 86.72 87.07 credit-g 74.18 73.14 72.7 73.72 73.1 72.86 74.54 72.9 73.62 crx 86.26 86.29 86.14 85.97 85.36 85.77 86.41 86.29 86.52 dna 95.51 95.17 94.99 95.56 95.25 94.66 95.06 70.27 81.34 ecoli 85.23 84.74 84.99 81.72 84.86 83.39 84 86.03 85.66 glass 73.46 71.96 72.15 73.18 73.83 73.46 72.71 74.39 74.67 heart-c 80.33 80.2 78.95 80.52 81.72 77.75 80.72 81.45 81.65 heart-h 80.34 80.61 81.43 82.79 80 80.2 80.95 81.63 81.43 heart-s 89.76 89.6 89.6 92.19 91.55 90.89 92.03 93.33 93.01 heart-v 69.8 71.5 70.7 72.2 73.2 71.2 72.6 75.5 74.2 hepatitis 84.65 84.13 84.65 83.47 84.9 82.58 84.13 84.51 85.8 horse-colic 82.39 82.34 81.09 82.07 82.5 83.04 83.59 82.01 82.61 hypo 98.98 98.97 98.97 98.72 98.75 98.95 98.98 98.64 98.8 ion 93.85 93.62 93.45 92.99 92.37 92.65 93.05 92.71 93.11 iris 93.6 93.87 93.73 93.73 93.47 93.87 93.6 93.6 93.47 krk 82.31 81.05 79.84 41.88 79.19 78.88 79.29 77.38 80.09 krkp 99.34 99.33 99.32 95.21 98.04 99.11 99.29 96.91 98.8 labor 90.91 90.55 91.27 91.22 94.75 91.22 92.28 93.69 93.34 led-24 72.95 72 71.24 68.87 71.56 72.65 74.42 73.73 74.65 letter 98.46 98.22 97.95 98.27 97.87 96.68 98.12 97.43 98.06 lrs 88.04 87.77 87.36 85.81 87.55 86.34 87.32 86.75 87.06 lymph 82.43 81.35 81.62 80.27 84.73 79.05 82.3 82.7 83.51 mushroom 100 100 100 100 100 100 100 100 100 nursery 99.27 99.07 98.85 93.53 96.19 97.5 97.49 96.34 98.17 page 96.81 96.73 96.77 96.85 97.14 97.01 97.14 96.89 97.07 pendigits 99.3 99.18 99.05 98.97 99.07 98.08 98.87 98.75 98.88 phoneme 89.51 89.2 89.01 82.55 87.89 88.87 89 89.09 89.17 pima 74.48 73.75 73.52 72.58 75.34 76.04 76.12 75.6 76.48 primary 40.42 40.77 39.41 43.6 40.53 39.3 40.18 40.71 40.83 promoters 91.7 90.75 88.68 83.96 90.94 84.91 91.51 83.77 88.3 ringnorm 94.73 94 93.73 90.4 94.4 88.87 91.6 97 94.8 sat 91.6 91.23 90.71 91.09 91.25 90.03 90.97 90.15 90.62 segment 98.17 98.04 97.88 97.02 97.47 96.66 97.59 96.88 97.32 shuttle 99.99 99.98 99.98 99.88 99.95 99.96 99.98 99.94 99.96 sick 98.64 98.64 98.67 97.08 98.08 98.59 98.64 97.13 98.13 sonar 80.87 79.42 76.83 78.17 82.12 75.96 80.29 80 80 soybean 93.56 93.29 92.74 94.32 93.44 92.39 93.53 92.03 93.47 soybean-small 97.01 97.01 97.01 99.15 100 97.45 99.57 100 100 splice 95.7 95.18 94.76 96.45 94.61 94.26 96.65 74.04 87.35 threenorm 85.67 82.6 81.27 79.27 83 80.6 83.33 85.67 85.27 tic-tac-toe 98.14 97.66 96.74 72.8 85.72 90.52 92.86 84.61 90.75 twonorm 95.07 93.73 93.13 88.6 94.8 88.6 92.6 95.67 94.87 vehicle 76.52 76.1 76.36 74.78 75.04 74.16 74.52 73.59 74.04 vote 95.04 94.99 94.85 95.04 94.35 94.67 95.63 94.71 95.54 vote1 88.74 89.06 88.78 91.17 90.12 87.68 89.15 90.53 90.25 vowel 91.55 89.58 88.11 89.02 93.48 85.15 89.05 92.2 91.52 waveform 84.74 84 83 84.79 85.32 83.36 84.54 85.25 85.35 yeast 58.49 57.52 56.51 53.83 59.02 58.85 59.16 60.11 59.95 zip 96.21 95.73 95.25 95.17 96.31 92.84 96.11 94.88 95.55 26

PAGE 39

Table4.Statisticalresultsareshownforeachdataset.Resultsaredisplayedas10Fold/5x2Foldwhereaplussigndesignatesastatisticallysignicantwinandaminusdesignatesastatisticallysignicantlossagainstbagging.Onlydatasetsforwhichthereweresignicantdifferencesarelisted. DataSet Bst Bst Bst RS RTB RF RF RF 1000 100 50 lg 1 2 zip +/+ +/+ +/+ +/+ +/+ +/+ +/ +/+ letter +/+ +/+ +/+ +/+ +/+ +/+ +/+ pendigits +/+ +/+ +/+ +/+ +/ +/+ +/ heart-c /+ +/ +/ +/ waveform +/ +/ +/ page +/ +/ sat +/ +/ sonar +/ +/ +/ krk +/+ +/+ +/ -/+/ -/+/+ splice /+ /+ +/ +/+ +/+ -/-/autos +/ credit-a /+ lrs /+ promoters +/ vote1 +/ breast-w /led-24 -/ -/+/+ +/+ tic-tac-toe +/+ +/+ /+ -/-/ -/ dna +/ +/ +/ -/-/breast-y -/ -/ horse-colic -/ car +/+ +/+ +/+ -/-/-/nursery +/+ +/+ +/+ -/-/-/yeast -/shuttle -/phoneme -/-/krkp -/-/ -/ sick -/ -/ -/Table5.Asummarystatisticaltableisprovidedforeachmethodshowingstatisticalwinsandlossesagainstbagging.Theaveragerankisalsoshown. Bst Bst Bst RS RTB RF RF RF 1000 100 50 lg 1 2 10FoldWins 10 8 8 6 10 8 1 8 10FoldLosses 0 1 2 10 7 0 8 2 5x2FoldWins 8 8 6 5 2 6 0 5 5x2FoldLosses 0 0 0 9 4 0 6 2 10Foldaveragerank1 3.89 4.80 5.49 5.84 5.00 4.28 5.32 3.81 5x2Foldaveragerank2 3.51 4.77 5.88 6.05 4.95 4.12 5.21 3.70 1Averagerankfor10FoldBaggingis6.562Averagerankfor5x2FoldBaggingis6.8327

PAGE 40

Therawaccuracynumbersshowthatrandomsubspacescanbeupto44%lessaccuratethanbaggingonsomedatasets.Datasetswithwhichrandomsubspacesperformspoorlylikelyhaveattributeswhicharebothhighlyuncorrelatedandeachindividuallyimportant.Onesuchexampleisthekrkking-rook-kingdatasetwhichstoresthepositionofthreechesspiecesinrow#,column#format.Ifevenoneoftheattributesisremovedfromthedataset,vitalinformationislost.Ifhalfoftheattributesaredismissede.g.KingatA1,RookatA?,Kingat??thealgorithmwillnothaveenoughinformationandwillbeforcedtoguessrandomlyattheresultofthechessgame.Boosting-by-resampling1000classierswasbetterthanboostingwith100classiers,andsub-stantiallybetterthanboostingwith50classiers.Sequentiallygeneratingmoreboostedclassiersresultedinbothmorestatisticallysignicantwinsandfewerstatisticallysignicantlosses.Ifpro-cessingtimepermitsadditionalclassierstobegenerated,alargerensemblethan50isworthwhile.Inaseparatesetofstatisticalexperiments,theaccuraciesfromboosting100treeswerecompareddirectlyagainstthoseofboosting-50andboosting-1000.Whereasboosting-1000wasstatisticallysignicantlymoreaccuratewhencomparedtoboosting-50intheFriedman-Holmtests,thiswasnotthecasebetweenboosting-1000andboosting-100.Determiningtheappropriatenumberofclassi-erstobuildwithoutexplicitlyremovingexamplesfromthetrainingsetforaseparatevalidationsetisadifcultproblemwhichisaddressedinthenextchapter.Randomforestsusingonlytwoattributeshadabetteraveragerankthanrandomforests-lginbothcrossvalidationmethods,butdidworseintermsofnumberofstatisticallysignicantim-provements.Experimentationwiththesplicedatasetresultedinstatisticallysignicantwinsforrandomforests-lgandstatisticallysignicantlossesforrandomforests-2,witha6-9%differenceinaccuracy.Thus,whiletestingonlytworandomattributesislikelysufcient,testingadditionalattributesmayprovebenecialoncertaindatasets.Breimansuggestedusingout-of-bagaccuracytodeterminethenumberofattributestotest[16].Thereareotherpotentialbenetsasidefromincreasedaccuracy,whichareofparticularim-portanceintherealmofcomplexsimulations.Randomforests,bypickingonlyasmallnumberofattributestotest,allowsforveryrapidtreegeneration.Randomsubspaces,whichtestsfewerattributes,canusemuchlessmemorybecauseonlythechosenpercentageofattributesneedstobe28

PAGE 41

stored.Recallthatsincerandomforestsmaypotentiallytestanyattribute,itdoesnotrequirelessmemorytostorethedataset.Sincerandomtreesdonotneedtomakeandstorenewtrainingsets,theysaveasmallamountoftimeandmemoryovertheothermethods,howeverthisisnotsignicantenoughtowarrantfurtherattention.Finally,randomtreesandrandomforestscanonlybedirectlyusedtocreateensemblesofdecisiontrees.Bagging,boostingandrandomsubspacescouldbeusedwithotherlearningalgorithms,suchasneuralnetworks.Forthesimulations,therandomforestandboostingalgorithmsappeartobethemostapplicable.Whilerandomforestsisbothfastandaccurate,thestandardframeworkforboostingwouldneedtobemodiedinordertomakeitmoreconducivetolearningquickly.Therehavealreadybeenseveralattemptsatdoingthisforboostingsuchasdivoting[41,73],withseveralconcentratingspecicallyonneuralnetworks[74,75].3.6TowardsanAppropriateNumberofClassiersAnarbitrarilylargenumberoftreeswerecreatedfortheensemblesintheprecedingchapter.Theboostingresults,forexample,showthatanincreaseinthenumberoftreesprovidesbetteraccuracythanthesmallerensemblesizesgenerallyused.Thissuggestsaneedtoknowwhenenoughtreeshavebeengenerated.Italsoraisesthequestionofwhetherapproachescompetitivewithboosting-1000maynearlyreachtheirnalaccuracybefore1000treesaregenerated.Theeasiestwayofdeterminingwhenenoughtreeshavebeengeneratedwouldbetouseavalidationset.Thisunfortunatelyresultsinalossofdatawhichmightotherwisehavebeenusedfortraining.Oneoftheadvantagesofthetechniqueswhichusebaggingistheabilitytotesttheaccuracyoftheensemblewithoutremovingdatafromthetrainingset,asisdonewithavalidationset.Breimanhypothesizedthatthiswouldbeeffective[16].Hereferredtotheerrorobservedwhentestingeachclassieronexamplesnotinitsbagastheout-of-bagerror,andsuggestedthatitmightbepossibletostopbuildingclassiersoncethiserrornolongerdecreasesasmoreclassiersareaddedtotheensemble.Theeffectivenessofthistechniquehasnotyetbeenfullyexploredintheliterature.Inparticular,thereareseveralimportantaspectswhichareeasilyoverlooked,andaredescribedinthefollowingsection.29

PAGE 42

3.6.1ComplicationsInbagging,onlyasubsetofexamplestypicallyappearsinthebagwhichwillbeusedintrainingtheclassier.Out-of-bagerrorprovidesanestimateofthetrueerrorbytestingthoseexampleswhichdidnotappearinthetrainingset.Formally,givenasetTofexamplesusedintrainingtheensemble,lettbeasetofsizejTjcreatedbyarandomsamplingofTwithreplacement,moregenerallyknownasabag.LetsbeasetconsistingofT-Tt.Sincesconsistsofallthoseexamplesnotappearingwithinthebag,itiscalledtheout-of-bagset.Aclassieristrainedonsettandtestedonsets.Incalculatingthevotederroroftheensemble,eachexampleinthetrainingsetisclassiedandvotedonbyonlythoseclassierswhichdidnotincludetheexampleinthebagonwhichthatclassierwastrained.Becausetheout-of-bagelements,bydenition,werenotusedinthetrainingset,theycanbeusedtoprovideanestimateofthetrueerror.Allthetreesintheensemblewillvoteonanitemoftestdata,butonlyafractionofthetreesintheensembleareeligibletovoteonanygivenitemoftrainingdatabyitsbeingout-of-bagrelativetothem.Forexample,supposeout-of-bagerrorwasminimizedafter150treesweregenerated.These150treesaremostlikelytobeanoverestimateofthetruenumberbecauseforanyexampleinthedataset,itwouldneedtobeout-of-bagon100%ofthebagsinordertohaveall150treesclassifythatexample.Therefore,theout-of-bagresultsmostlikelyleadtoalargerensemblethanistrulyneeded.Breimanshowed[40]thatwithNpossibleexamplestoplaceinabagandNtotalexamplestoplaceinthebag,theprobabilitythatexamplenisselected0,1,2,...timesisapproximatelyPoissondistributedwith=1forlargeN.Theprobabilitythatanyexamplewillappearatleastonceinthebagis1)]TJ/F15 10.909 Tf 10.004 0 Td[(=e0:632.Sinceanytrainingexampleisonlybeingvotedonby63.2%oftheclassiersthathavebeengenerated,usingthetotalnumberoftreesgeneratedasatargetforhowmanyclassiersarenecessarywilloverestimatethenumberofclassiersbyapproximatelyonethird.Experimentationwithalgorithmstopredictanadequatenumberofdecisiontreesisfurthercomplicatedbyout-of-bagerrorestimatequirksondatasetswithasmallnumberofexamples.Smalldatasets,i.e.lessthan1000examples,canoftenhaveaverylowerrorestimatewitharathersmallnumberofdecisiontreesto100,butthentheadditionofmoretreesresultsinagreater30

PAGE 43

Table6.Numberoftreesandtestsetaccuracyofthestoppingcriteriaforrandomforestsandbag-ging. DataSet Algorithm BestOOB Oracle Algorithm BestOOB Oracle #ofTrees #ofTrees #ofTrees Accuracy Accuracy Accuracy rf/bagging rf/bagging rf/bagging rf/bagging rf/bagging rf/bagging credit-g 113/92 296/645 25/913 75.7/74.6 75.8/75.2 76.8/75.8 hypo 53/73 100/305 48/502 99.11/99.21 99.15/99.24 99.15/99.30 krkp 216/35 1997/159 161/13 99.06/99.41 98.84/99.47 99.09/99.56 led-24 204/290 2000/946 132/887 75.02/74.52 75.00/74.78 75.36/74.90 pendigits 181/215 911/449 1518/838 99.02/98.43 99.04/98.46 99.09/98.53 phoneme 118/189 1464/377 123/269 90.49/90.03 90.49/90.01 90.58/90.18 ringnorm 243/192 1841/988 947/961 96.35/95.60 96.43/95.69 96.49/95.72 sat 291/204 1828/969 786/926 91.67/91.14 91.72/91.19 91.80/91.22 segment 217/167 1862/555 1822/721 98.18/97.53 98.18/97.62 98.31/97.66 sick 93/64 93/64 1990/13 98.33/99.05 98.33/99.05 98.41/99.18 splice 342/129 1863/536 1414/609 96.30/95.05 96.58/95.14 96.65/95.17 twonorm 311/419 1586/985 1863/429 96.96/96.86 97.07/96.78 97.14/96.99 waveform 182/185 1534/876 1562/920 84.98/83.86 85.28/84.22 85.36/84.36 errorrateinboththeout-of-bagerrorandthetestseterror,asmightbeshowninatenfoldcrossvalidation.Thisbehavioriscontrarytomanyexperimentswhichhaveshownthattestseterrorsteadilydecreaseswithanincreasingnumberofclassiersuntilitplateaus.Wespeculatethatthisisaresultofinstabilityinthepredictionsleadingtoaluckyguessbytheensembleforsuchdatasets.Sincethedecisiontostopbuildingadditionalclassiersismoreeffective,inatime-savingsense,forlargedatasets,itismoreimportanttoconcentrateondatasetswithalargernumberofexamples.Analgorithm[10]hasbeendevelopedwhichappearstoprovideareasonablesolutiontotheproblemofdecidingwhenenoughclassiershavebeencreatedforanensemble.Itworksbyrstsmoothingtheout-of-bagerrorgraphwithaslidingwindowinordertoreducethevariance.Awin-dowsizeof5isusedintheseexperiments.Afterthesmoothinghasbeencompleted,thealgorithmtakeswindowsofsize20onthesmootheddatapointsanddeterminesthemaximumaccuracywithinthatwindow.Itcontinuestoprocesswindowsofsize20untilthemaximumaccuracywithinthatwindownolongerincreases.Atthispointthestoppingcriterionhasbeenreached,andthealgorithmreturnstheensemblewiththemaximumrawaccuracyfromwithinthatwindow.ThealgorithmisshowninAlgorithm1.31

PAGE 44

Algorithm1Algorithmfordecidingwhentostopbuildingclassiers. 1:SlideSize5,BuildSize202:A[n]RawEnsembleaccuracywithntrees3:S[n]AverageEnsembleaccuracyusingSlideSizetrees4:W[n]Maximumsmoothedvalue5:repeat6:AddBuildSizemoretreestotheensemble7:NumTrees=NumTrees+BuildSize//UpdateA[]withrawaccuracyestimatesobtainedfromout-of-bagerror8:forxNumTrees)]TJ/F21 10.909 Tf 10.909 0 Td[(BuildSizetoNumTreesdo9:A[x]VotedAccuracyTree1...Treex10:endfor//UpdateS[]withaveragedaccuracyestimates11:forxNumTrees)]TJ/F21 10.909 Tf 10.909 0 Td[(BuildSizetoNumTreesdo12:S[x]AverageA[x)]TJ/F21 10.909 Tf 10.909 0 Td[(SlideSize]...A[x]13:endfor//Updatemaximumsmoothedaccuracywithinwindow14:W[NumTrees=BuildSize)]TJ/F15 10.909 Tf 10.909 0 Td[(1]maxS[NumTrees)]TJ/F21 10.909 Tf 10.909 0 Td[(BuildSize]:::S[NumTrees]15:untilW[NumTrees=BuildSize)]TJ/F15 10.909 Tf 10.909 0 Td[(1]W[NumTrees=BuildSize)]TJ/F15 10.909 Tf 10.909 0 Td[(2]16:StopattreeargmaxjA[j]jj2[NumTrees)]TJ/F15 10.909 Tf 10.909 0 Td[(2BuildSize]:::[NumTrees)]TJ/F21 10.909 Tf 10.91 0 Td[(BuildSize] 3.6.2ExperimentsUsingOut-of-BagErrorThestoppingpointsandtheresultingtestsetaccuracyarecomparedusingensemblesbuiltto2000treesusingrandomforests-lgandatenfoldcrossvalidation.Forthiscomparisonthefollowingresultsareexamined,athestoppingpointofthenewalgorithm,bthestoppingpointbytakingtheminimumout-of-bagerroroverall2000trees,andcanoraclealgorithmwhichlooksatthelowestobservederroronthetestsetoverthe2000createdtreesastreesareaddedsequentially.Thirteenofthepreviouslyuseddatasetswithgreaterthan1000exampleswereused.TheresultsareshowninTable6.Formostdatasets,theout-of-bagerrorcontinuestodecreaselongintothetrainingstage.Thisoftendoesnotresultinanyimprovementoftestsetperformance.Acrossallthirteendatasetsthetotalgainbyusingtheminimumout-of-bagerrorratherthanthenewalgorithmwasonly0.06%onaverage.Comparingthenewalgorithmtotheoracle,theaccuracylossislessthan0.25%perdataset.Incomparingthenumberoftreesused,thenewmethodusesmanyfewertreesthantheothermethods.Onaveragethisalgorithmuses1140fewertreescomparedtotheminimumout-32

PAGE 45

of-bagerror,and755fewertreescomparedtotheoraclemethod.Whilethesenumbersareclearlyinuencedbythemaximumnumberoftreeschosentobuild,itisalsoevidentthatlookingatthemaximumout-of-bagaccuracycausesthealgorithmtocontinuebuildingalargenumberoftrees.Thismethodhasalsobeentestedonbaggedtreeswithouttheuseofrandomforests.Halfthenumberofthetreesusedinthepreviousexperimentaregeneratedinordertoshortenthepreviouslyobservedlargeoverestimationonthenumberoftreesusingtheminimumout-of-bagerroralone,andtoreducethetrainingtime.TheresultsforthisexperimentareshowninTable6.Theuseofthenewalgorithmresultsinanaveragenetlossof0.12%perdatasetcomparedtotheminimumout-of-bagerrormeanwhileusing431fewertrees.Comparedtotheoraclemethod,thereisanetlossof0.25%perdatasetconsistentwiththepreviousexperimentwhileusing442fewertrees.Basedontheseresults,itispossibletochooseanacceptablestoppingpointwhiletheensembleisbeingbuiltandwithoutlosingtrainingdata.Inexperimentswiththenewalgorithm,ithasnotshownitselftobeoverlysensitivetotheparametersoftheslidingwindowsizeandthebuildingwindowsize.Onaverage,thenumberoftreesbuiltinexcessforthepurposeofchoosingthestoppingpointwillbehalfofthebuildingwindowsize.Asshownpreviously,theprobabilityofanyparticularexamplebeingincludedinthebagis1)]TJ/F15 10.909 Tf 11.128 0 Td[(=e0:632,meaningonlyabout1=e0:368oftheexamplesareout-of-bag.Putanotherway,foreachexampleinthetrainingset,onlyalittlemorethanone-thirdofthetreesgeneratedavoteonanyunseenexample.Therefore,thenumberoftreeschosentostopatmaybeasmanyasthreetimestheamountnecessaryforequivalentperformanceonatestsetconsistingofallunseenexamples.Forthisreason,theaccuracyresultsobtainedbyusingarandomone-thirdofthenumberoftreeschosentostopwithinthepreviousexperimentsareincluded.TheseresultsareshowninTable7.Figures7and8demonstratetherelationshipofout-of-bagerrorandtestseterrorforagivennumberoftreesinthefullensemble.Figure8isaworstcaseresultwithout-of-bagerrordecreasing,butoverallerrorbeingminimalearlyandhigherwithmoretreesbeforestabilizing.33

PAGE 46

Table7.TestsetaccuracyresultsusingathirdofthetreeschoseninTable6. DataSet Original 1=3 Original 1=3 Algorithm Algorithm MaxOOB MaxOOB Accuracy Accuracy Accuracy Accuracy rf/bagging rf/bagging rf/bagging rf/bagging credit-g 75.70/74.60 75.60/73.80 75.80/75.20 75.80/74.90 hypo 99.11/99.21 99.11/99.27 99.15/99.24 99.11/99.24 krkp 99.06/99.41 98.88/99.53 98.84/99.47 98.81/99.47 led-24 75.02/74.52 74.92/74.54 75.00/74.78 75.00/74.48 pendigits 99.02/98.43 99.02/98.38 99.04/98.46 99.02/98.40 phoneme 90.49/90.03 90.17/89.93 90.49/90.01 90.32/89.90 ringnorm 96.35/95.60 96.01/95.45 96.43/95.69 96.35/95.60 sat 91.67/91.14 91.52/90.89 91.72/91.19 91.76/90.97 segment 98.18/97.53 98.31/97.45 98.18/97.62 98.27/97.49 sick 98.33/99.05 98.28/99.07 98.33/99.05 98.28/99.07 splice 96.30/95.05 95.77/95.05 96.58/95.14 96.27/94.95 twonorm 96.96/96.86 96.78/96.58 97.07/96.78 96.95/96.83 waveform 84.98/83.86 84.36/83.58 85.28/84.22 85.02/83.96 Lookingattheaccuracy,one-thirdofthenumberoftreesshowsmixedresults.Thoughtherearesomedatasetsunaffectedbythechange,otherdatasets,especiallythelargersizedones,benetedfromthegreaternumberoftrees.Webelievethisalgorithm,whichstopsearlyattheveryrstwindowwhenaccuracynolongerincreases,compensatesforwhatmightotherwiserequirethreetimesthenumberoftreestodecide.34

PAGE 47

Figure7.Out-of-bagaccuracyvs.testsetaccuracyresultsasclassiersareaddedtotheensembleforsatimage.35

PAGE 48

Figure8.Out-of-bagaccuracyvs.testsetaccuracyresultsasclassiersareaddedtotheensembleforsegment.36

PAGE 49

Chapter4LearningonComplexSimulationsHereinaredescribedtheuniqueissuessurroundingthesimulationswhichareperformedbymem-bersoftheUnitedStatesDepartmentofEnergy'sAdvancedStrategicComputingASCprogram.Illustrativeexamplesofreal-worldsimulationswillserveasthebasisforthemajorityofsubsequentexperimentation.Theyincludeacanisterbeingcrushedbyahighspeedimpactorbar,andacasingfallingtotheground.4.1ConsiderationsEfciencyisaprincipleconcerninperformingthesimulations.Themostfundamentalaspectofthisisnearbydatapointsbeingkeptincomputationalproximity.Aphysicalsimulationwhichisspatiallypartitionedintoaseriesofcomponentswithoutoverlapisreferredtoasbeingspatiallydisjoint.Asanexample,ifacomputerwastobesimulated,onenodemightcontainthemotherboard,processor,andmemory,anothernodecontainsthediskdrives,andonanothernodethepowersupplyismodeled.Thisspatialdivisionofpartsispotentiallyquitethehindrancetopatternrecognitionalgorithms,asitviolatestheconceptofindependentlyandidenticallydistributediiddata.TheoriginationsofiiddatacomesfromProbabilityTheorywhereeachrandomvariableisrequiredtohavethesameprobabilitydistributionandbemutuallyindependent.InStatistics,sam-plesareoftenassumedtobeiidbecauseittendstosimplifytheunderlyingmathematicalmethods[52,48]whenmakinginferencesaboutthedata.Forapatternrecognitionalgorithmtobeaccurate,thedataonwhichtoapplythelearnedconceptsmustberepresentativeofthedataonwhichitwastrained.Otherwise,theconceptslearnedduringthetrainingphasemightbesubstantiallydifferentfromthethosethelearningalgorithmwasexpectedtoderive.Humanassessmentofwhattheresult-ingclassierhasactuallylearnedisoftenexceedinglydifcult[76].Afrequentlycitedexampleis37

PAGE 50

oneinwhichtheUnitedStatesArmyemployedaneuralnetworkfordetectingthepresenceofatankinanimage[77].Testingonaseparatedatasetshowedgoodresults,howeverintheeld,resultswerequitedisappointing.Itwasdiscoveredthatthedatasetstheneuralnetworkwasprovidedwithshowedtanksundercloudyskies,andnotanksunderclearskies.Asaresult,theneuralnetworkwasonlyabletodecipherbetweenacloudyandanot-so-cloudyday.ThestarkdifferencesbetweenthedataavailablefortrainingandthedatafortestingpreciselydescribestheproblemwhichafictsMachineLearningresearchersworkingwiththesesimulations.Inordertosolvetheiidproblem,anaivesolutionistotransferdatapointscorrespondingtoalldifferentphysicalregionsofthesimulationamongstthecomputenodes.Thissolutionforcesiiddata,yetisnotagoodsolutioninseveralrespects:theamountoftimetobroadcastgigabyteorterabyte-sizeddataisprohibitive,theamountofspacetostorethisadditionaldataislimited,anditrequiresnewdatastructuresandexpensiveparallelcodetoaccomplish.Duetotheselimitations,apotentiallyfeasiblesolution,regardlessofthechallenge,istolearndirectlyonthecomputenodeswiththedatathathaspreviouslybeenacquiredatthatlocation.Itisnotasexpensiveintimeorcomputationalpowertotransmitmosttypesofclassiersbetweenprocessors,sothisisanavenuethatmaybepursued.Thereareadditionalcomplicationsaswell.Thesimulationstakeplaceinfourdimensionsincludingtime.Theconceptslearnedatearliertimestepsmayloseapplicabilityastimemovesforward.Datapointswhichwereinitiallynotrelevanttothetargetconceptcouldsuddenlybecomehighlyimportant.Theremayevenbetimeswhenaconceptiscyclicalinnature,causingareasofthesimulationtomanifestthemselvesasrelevantonlyduringcertainnon-contiguoustimesteps.Thetargetconceptconsistsofidentifyingsalient,orinteresting,points.Saliencyisdenedasanarbitrarycharacteristicofthedata,andthereisnoreasontoassumeunderlyingphysicalpropertiesofthesimulationsupportclassication.Onemanifestationofthisisindebugging.Thesimulationprogramsthemselvesconsistofmanylinesofcode,anditisentirelyprobablethatbugsexist.Bylabelingtheawedregionsofthesimulation,onehopeistoautomaticallydiscoverotherawedregionsinunrelatedspatiallocations.Bylabelingasmallinstanceofaawedregion,testershopetobeabletoautomaticallytrackthebugsthroughouttime.Inpractice,mostofthesimulationwould38

PAGE 51

havebeendebuggedpriortoapplyingMachineLearningalgorithms.Thusthenumberofbugsislikelytobesmall,whichfurtherhindersMachineLearningalgorithmsbypresentingaproblemwhereoneimportantclassishugelyunderrepresented.Thisisgenerallyknownasaminorityclassproblem.Salientpointsaregenerallydenedasbeingdifferentfromtherestofthesimulation,henceapathologicalminorityclassproblemislikelytobefound.Aminorityclassmaysurfaceinseveraldistinctareas.Withinanode,onlyafewpointsmaybesalient.Inthiscase,thereareknowntechniquessuchasminorityclassoversamplingandmajorityclassundersampling[78,79,80,81]orcost-sensitivelearningalgorithms[82,83]whichmaybeapplied.Adifferenttypeofminorityclassproblemcanoccurwhenonlyafewtimestepsincludesalientexamples.Aminorityclassintimecanbeaddressedusingtheprevioustechniquesbecauseallofthetimestepsforanyonespatialportionofthesimulationarestoredonthesamecomputenode.Moreinterestingly,therecanbeaminorityclasspresenceinonlyaselectnumberofnodessuchthatamajorityvoteofclassiersdevelopedoneachnodeoutweighstheminorityclassbydefault.Suchaminorityclassproblemcouldincludeone-classclassierswhichlabelallpointsasbelongingtothemajorityclass.Whileitwouldbeasimplemattertoexcludetheseclassiersfromvoting,valuableinformationabouttheexamplesmakingupthisregionisalsothrownaway.Anotherconcernisnodesthatcontaintwoopposingsalientconcepts.Thegoalofthesesimulationsistocombinethepatternrecognitioncapabilitiesacrossnodes,yetinthecaseofopposingsalientconceptsthisiscounterproductive.Evenoutsideofthesimulationandsupercomputerrealmthiswouldbeaproblem,althoughitismorelikelytounknowinglymanifestitselfwhenthedetailsarehiddeninterabyte-sizeddatasetsspreadoutacrossseveralmachines.Inthefollowingsections,tworepresentativedatasetsfromsimulationsdoneatSandiaNationalLaboratoriesareexamined.Therstsimulationshowsacanisterbeingcrushedathighspeedbyanimpactorbar.Thesecondsimulationinvolvesamulti-componentcasingbeingaccidentallydropped.Eachofthesedatasetsaremoderatelysized,andprovidemeaniningfulinsightintotheproblemoflearningonlargespatiallydisjointdata.39

PAGE 52

Figure9.Avisualizationofthecanistersimulationasdistributedacrosscomputenodesisgiven.Therearefourpartitionsshownwithdifferentcolorsasthestoragecanisteriscrushed.4.2TheCanisterCrushProblemAsimulationofastoragecanisterbeingcrushedfromabovebyanimpactorbaratapproximately300milesperhourisshowninFigure9.Thecanisterisrapidlycrushedinmuchthesamefashionasapersonmightcrushasodacan.Thewallsofthecanisterbuckleunderthepressureandthetopofthecanisteracceleratesdownwardwiththeimpactorbaruntilitmeetsthebottom.Theaboveeventissimulatedandrecordedin44slicesoftime.Figure9alsoillustrateshowthecompletesimulationappearswhendividedintopartitions.Thefourdifferentcolorsrepresentthefourdifferentcomputenodeswhichoriginallymodeledthesimulation.Notethatpiecesoftheimpactorbarcrushingthecanisterarealsobrokenupspatiallyaccordingtothepartitionwhichisbeingcrushed.Thedataforeachofthetimestepsisdividedspatiallyaccordingtothecomputenodetowhichitisassigned.ThepartitioningisperformedverticallyalongtheYaxisofthecanister,dividingthecanisterinto4disjointspatialpartitionsofroughlyequalsize.Eachcomputenodecanseeonlyoneofthesepartitions,anditistooexpensiveintimeorstoragespacetomovedatatoanothercomputenode.40

PAGE 53

Table8.Physicalandspatialcharacteristicsforthecanistersimulation. Barinitialvelocityin/s 5,000 #nodalvariables 9 #cannodespertimestep 6,724 #barnodespertimestep 3,364 Total#nodespertimestep 10,088 #timesteps 44 Total#cannodes 295,856 %salientcannodes 64.7 Table9.Featurerangesforcanistersimulation. Feature Minimum Maximum DISPLX -7.2 1.4 DISPLY -5.5 1.5 DISPLZ -17.8 0.1 VELX -4,820 2,252 VELY -7,891 3,357 VELZ -8,862 3,287 ACCLX -1.75E+09 2.39E+09 ACCLY -2.47E+09 3.38E+09 ACCLZ -3.99E+09 3.02E+09 Tenphysicalvariablesarestoredforeachof10,088nodeswithineachtimestep.TheyarethedisplacementontheX,Y,andZaxes;velocityontheX,Y,andZaxes;accelerationontheX,Y,andZaxes;andequivalentplasticstrainwhichisametricforthestressonthesurfaceofthecanister[84].Equivalentplasticstrainisnotincludedinthetrainingandtestdatasets.Instead,itwasusedasatemplateforlabelingthedata.ThephysicalandspatialcharacteristicsareprovidedinTable8withtherangeofthecanisterdatashowninTable9.Foreverytimestep,thosepiecesofthecanisterthathavebuckledandbeencrushedaremarkedassalient.Atthebeginningofthesimulation,beforetheimpactorbarhasmadecontact,therearenosalientnodeswithinthemesh.Astimeprogressesandthecanistercollapses,moreandmorenodesaremarkedsalient.Theprocessofmarkingsalientnodeswithinthemeshcanbeaspreciseastheexpertdemands.However,ahighlevelofprecisionrequiresacorrespondinglyhighlevelofeffortmarkingthedata.Inordertomodelapracticalscenariowhereanexpertismoreinterestedinsavingtimethancatering41

PAGE 54

Figure10.Amosaicofdifferentviewsofthecanistersimulationaslabeledbyanexpertisprovided.Timesteps1,15,32,and44areshownfromlefttoright,toptobottom.Redareashavebeenmarkedsalient.tothenuancesofMachineLearning,afairamountofnoisehasbeenincludedintheclasslabelsbyusingtoolswhichmarkareasassalientratherthanindividualpointssincethereareover10,000pointspertimestep.Sincetheimpactorbarandthecanisterareinsuchproximity,itisquiterea-sonabletoassumethebarwilloftenhaveoverlappingareasselectedbythetoolsandbeincorrectlymarkedassalient.DifferentviewsofthegroundtruthforthissimulationareprovidedinamosaicinFigure10.Timesteps1,15,32,and44areshownwiththesalientregionscoloredinredandthenon-salientregionscoloredinblue.Theimpactorbarhasbeenremovedfromtheguresoasnottoobscuretheviewofthecanisterbeingcrushed.42

PAGE 55

Figure11.Avisualizationofpartitionsofthecasingdataasdistributedacrosscomputenodesisprovided.Therearevepartitionseachrepresentedwithdifferentcolors.Aseparateviewofhowtheboltsarepartitionedisalsoshown.4.3TheCasingProblemThecasingproblemisanorderofmagnitudelargerthanthecanistercrushproblem.Inthisdataset,acasingisdroppedontheground.Thecasingiscomposedoffourmainsections:thenosecone,thebodytube,thecoupler,andthetail.Thecouplerconnectsthebodytubeandtailthroughaseriesoftenbolts.Thegroundhasalsobeenmodeled.Thecasingisdroppedfromashortheightandlandsonthegroundatanangleonthetail.Thissimulationrecordsthestressacrosstheentiredeviceasmightbefoundwereittobeaccidentallydroppedduringtransport,storage,etc.Thegoalusingthisdatasetistodiscoverwhichnodesinthesimulationbelongtobolts.Whendroppedatanangleonthetail,onegroupofboltswillexperienceatensileforce,whiletheothergroupofboltswillexperienceacompressiveforce.Eachwillalsobesubjecttosheerforces.Theseforcesareexpressedinmanyothersectionsofthecasingaswell.Thephysicalcharacteristicsoftheindividualnodesmodelingtheboltsarenotsubstantiallydifferentfromthosemodelingtherestofthecasing.Inotherwords,thereisnoapriorireasontoassumesomeunderlyingfeatureofboltnesswhichwouldmakethisaneasyproblem.43

PAGE 56

Thedataforeachofthetimestepsisdividedspatiallyaccordingtothecomputenodetowhichitisassigned.Figure11showsthepartitioningbothoftheactualsimulation,andanunimpededviewofhowtheboltsaredistributed.Thereisalsoadatasetimbalanceproblemhere,asthosedatasetswithfourboltsaremuchlargerthanthedatasetwithonlytwobolts.Thepartitioningisperformedlengthwiseinvepiecesacrossthecylindricalbodysoastodistributetheboltsacrosscomputenodes.Thedataispurposefullypartitionedsothattwoofthepartitionshaveneverseenanynodefromabolt.Thiscreatestwoone-classclassierswhichmustbecarefullydealtwithbythevotingalgorithmduringclassication.Twoofthesectionscontainfourbolts,andtheoneremainingsectionhasonlytwobolts.Anexplodedviewoftheboltsandhowtheyarepartitionedisalsoshowninthisgure.ThephysicalandspatialcharacteristicsareprovidedinTable10.Thepropertiesofthepartition-ingforthecasingdatasetareshowninTable11.Inordertocalculatestressandotherforceswithinthesimulation,thenumberofnodalattributeshasmorethandoubled.Thesamemotionvariablesofdisplacement,velocity,andaccelerationarepresentasinthecanisterdataset.Inaddition,severalinteractionvariablesarestoredsuchascontactforce,totalinternalforce,totalexternalforce,andthereactionforce.TherangesforeachoftheseattributesareshowninTable12.AtimestepshowingthegroundtruthdataisshowninFigure12.Theboltsarecoloredinredandrepresentthesalientnodesinthissimulation.Anythingwhichisnotaboltiscoloredinblueandnotsalient.Thereareseveralimportantdifferencesbetweenthisdatasetandthecanisterdataset.Thereisnotalargechangeinthestructureofthecasingdataasthesimulationrunsthroughtime.Thechangeinthestructureoccursmostlyattheendofthesimulationaftersomeamountofsheerhastakenplace.Sincethestructuralchangesaremoresubtle,thedeformationofthecasingsimulationturnsouttobemoredifcultthanthecanistersimulationtoaccuratelymark.Instead,forthisdatasetitisconsideredsufcientmerelytoidentifythebolts.Duringthesimulation,nodesbelongingtoalloftheboltsarespecicallydesignatedastheirownsubstructurewithinthesimulation.Therefore,labelingofthosepointsisatrivialmatterofsettingallthosenodesassalient,andhencethetrainingandtestsetsarelabeledperfectly.Recallinthecanistercrushsimulation,thegroundtruthwassubjecttotheinaccuraciesinherentinthetoolsavailabletodesignatesaliency.44

PAGE 57

Figure12.Avisualizationofthegroundtruthofthecasingdataisprovided.Asopposedtothecanistersimulation,thecasingsimulationdoesnotchangemuchasthesimulationprogresses.Table10.Physicalandspatialcharacteristicsforthecasingsimulation. #nodalvariables 21 #timesteps 21 #non-boltnodespertimestep 69,150 #boltnodespertimestep 5,603 Total#nodespertimestep 74,753 Total#non-boltnodes 1,452,150 Total#boltnodes 117,663 Total#nodes 1,569,813 Total%ofboltnodes 7.5% 45

PAGE 58

Table11.Partitioningcharacteristicsforthecasingsimulation. non-bolt bolt #nodesinpartition1pertimestep 19,458 2,272 #nodesinpartition2pertimestep 10,297 0 #nodesinpartition3pertimestep 6,738 1,059 #nodesinpartition4pertimestep 11,379 0 #nodesinpartition5pertimestep 21,278 2,272 Total#nodesinpartition1 408,618 47,712 Total#nodesinpartition2 216,237 0 Total#nodesinpartition3 141,498 22,239 Total#nodesinpartition4 238,959 0 Total#nodesinpartition5 446,838 47,712 Table12.Featurerangesforthecasingsimulation. Feature Minimum Maximum DISPLX -2.62 5.00 DISPLY -0.24 0.23 DISPLZ -10.34 0.55 VELX -4306 7437 VELY -2108 5943 VELZ -11518 3922 ACCELX -1.30E+09 8.79E+09 ACCELY -1.47E+09 1.46E+09 ACCELZ -2.23E+09 3.29E+09 F-CONTACT-X -463.9 392.4 F-CONTACT-Y -469.1 478.6 F-CONTACT-Z -4917 2354 F-EXT-X -1550 877.2 F-EXT-Y -354.1 345.8 F-EXT-Z -2561 2329 F-INT-X -1550 877.0 F-INT-Y -470.0 473.1 F-INT-Z -4920 2354 REACT-X -558.3 596.4 REACT-Y -354.1 345.8 REACT-Z -165.4 2328 46

PAGE 59

Chapter5AProbabilisticApproachUsingMajorityVoteThischapterintroducesanewalgorithmforlearningoncomplexsimulations.Namely,amethodisdiscussedforweightingclassiersbasedontheaprioriknowledgeofwhichpartitionscontainsalientdataintheirtrainingsetsandwhichdonot.Experimentalanalysisshowingthesuccessofthisalgorithmonsimulationdataforthecanisterandcasingdataisprovided.5.1MajorityVotingMajorityvotingisastandardwayofcombiningmultipleclassiersystems[85,16,18,40].Inpatternrecognition,moreaccurateresultsareobtainedasmoreclassiersareincorporatedintothevote.Mathematically,majorityvotingiswelldenedunderseveralassumptions[86,87].Denepiastheprobabilitythatanyindividual,i,iscorrect.Therstassumptionisthattheaccuracyofallindividualsinthegroup,G,isthesame,pi=pjforalli;j2G.Itisalsoassumedthateachclassierisindependentoftheother.Insuchacase,theprobabilitythatthegroupproducesacorrectbinarydecisionPCovernclassiers,isabinomialdistribution:PCn=nm=knm!pm)]TJ/F21 10.909 Tf 10.909 0 Td[(pn)]TJ/F22 7.97 Tf 6.587 0 Td[(mwherek=8><>:n=2+1;ifnisevenn+1=2;ifnisoddInaddition,theCondorcetJuryTheorem[88,87]providesboundsonPCforseveralcasesofpwhenn3.Whenthedecisionisbinary:47

PAGE 60

1.Ifp>0:5thenPCn!1asn!1.2.Ifp=0:5thenPCn!0:5asn!1.3.Ifp<0:5thenPCn!0asn!1.Thisstatesthatifallindividualsmakedecisionswhichareindependentlyerroneousandeachindividualisbetterthan50%accurateona2-classproblemi.e.votingtheoppositedoesnotproducemoreaccurateresults,themajorityvotewillconvergeto100%accuracy.MachineLearningviolatestheseassumptions.Theerrorsarefrequentlynotindependent,hencetheaccuracyoftheensemble,PCn,willnotconvergeto1.Ofparticularimportancetothisdissertation,theaccuracyratesareoftennotequal.Inthegeneralcase,thisisnotaproblem.Infact,attemptstocorrectthisproblembyweightingclassiersaccordingtoaccuracyhasledto,atbest,extremelyminimalgainsandmoreoften,extraprocessingtimefornosignicantgaininaccuracy[89,59].WeightedclassiersareneededinonlysomespecializedareasofMachineLearningsuchasforincrementallearning[89,90,91]andtrackingconceptdriftinstreams[92,93,94,95].Ithasalsobeenfoundtobeusefulintheproblemoflearningonspatiallydisjointdata[11].5.2ProbabilisticMajorityVotingBecauseanyalgorithmmustworkonsimulationswhereonlyafewcomputenodeshavesalientex-amples,asimplemajorityvotealgorithmmayfailtoclassifyanypointsassalientifthenumberofcomputenodestrainedwithsalientexamplesislessthanhalfofthenumberofcomputenodes.Theproblemisdealtwithbyusingaweightedmajorityvote.Inthelearningstage,severalensemblesofclassiersaretrained.Oneachcomputenode,afastparallelensembletechnique,suchasrandomforests,isapplied.Eachoftheseclassierensemblesisthentransferredtoasinglenode.Classi-cationofatestpointwithinthesimulationinvolvesavoteofthepredictionsbyeachpartition'sensemble.Mathematically,asetofdatapartitionsDi2Dispresentedfortraining,generatingclassiersCix.Letdujdescribeanexampledj2Dihavingclassu.AnyindividualclassierCixmaynothavebeentrainedonanexamplewithclassu.ThenumberofpartitionscontainingatleastoneexampleofuiscalculatedandreferredtoasNu.Thatis,Nu=iniuwhere48

PAGE 61

niu=8><>:1;ifduj2Diforsomedj2Di0;otherwiseUndertheProbabilisticMajorityframework,thechosenlabelofanewexamplexisargmaxui;Cix=u NuAsanexample,if3of5partitionscontainsalientexamplesand5of5partitionscontainnon-salientexamples,thenthenumberofpartitionscapableofvotingsalient,Ns,is3andthenumberofpartitionscapableofvotingnon-salient,Nn,is5.Inthetwoclasssaliencydeterminationproblem,EquationcanbesimpliedfromthatofanargmaxproblemtothatofndinganappropriatethresholdgiventhenumberofpartitionensemblespredictingsaliencyVsandthenumberofpartitionensemblespredictingnon-saliencyVn,thenumberofpartitionensemblescapableofpredictingsaliencyNsandthenumberofpartitionensemblescapableofpredictingnon-saliencyNn.Specically,avoteforsaliencyisobtainedif:Vs Ns>Vn NnAnyunlabeledexampleisconsideredtobenon-salientbydefault.Thus,unlessanentirepartitionisclassiedassalient,partitionscapableofpredictingsaliencyarealsocapableofpredictingnon-saliency.ThiscanberepresentedbywritingNn=Vs+Vn.ItfollowsthatavoteforsaliencyoccurswhenVs Ns>Vn NnVs Ns>Nn)]TJ/F22 7.97 Tf 6.586 0 Td[(Vs NnNnVs>NnNs)]TJ/F21 10.909 Tf 10.909 0 Td[(VsNsNnVs+VsNs>NnNsVsNn+Ns>NnNsVs>NnNs Nn+Ns49

PAGE 62

Equationallowstheaprioricalculationofthenumberofrequiredsaliencyvotesgivenpa-rameterscalculatedduringthetrainingphaseoftheensemble.Simplifyingthecalculationofaprobabilitytothatofathresholdnumberofsalientvotestobemetisveryconvenient.Considerthepreviousexamplewhere3of5partitionsincludedsalientexamplesinthepartition,andeachofthe5containednon-salientexamples.ByEquation,thenumberofpartitionsrequiredtovotesalientinordertoclassifyanodeassalientis5=+5=15=82.EquationcanbeeasilyshowntosimplifybacktothatofasimplemajorityvotewhenNn=Ns.Vs>NnNs Nn+NsVs>N2s 2NsVs Ns>1 2Itcanalsobeshownthattheseequationscandealwithrisk.Thismaybeimportantwhenthecostofmisclassifyingasalientexampleisgreaterthanthecostofmisclassifyinganon-salientexample,orviceversa.Let0<<1representtheriskofmisclassifyingasalientexample,0<1)]TJ/F21 10.909 Tf 10.964 0 Td[(<1theriskofmisclassifyinganon-salientexample.Vs Ns>)]TJ/F21 10.909 Tf 10.909 0 Td[(Vn Nn 1)]TJ/F22 7.97 Tf 6.586 0 Td[(Vs Ns>Nn)]TJ/F22 7.97 Tf 6.586 0 Td[(Vs Nn 1)]TJ/F22 7.97 Tf 6.587 0 Td[(NnVs>NnNs)]TJ/F21 10.909 Tf 10.909 0 Td[(VsNs 1)]TJ/F22 7.97 Tf 6.586 0 Td[(NnVs+VsNs>NnNsWithoutlossofgenerality,when>0:5greaterweightisgiventoasalientclassication:>0:52>1>1)]TJ/F21 10.909 Tf 10.909 0 Td[( 1)]TJ/F22 7.97 Tf 6.586 0 Td[(>1 1)]TJ/F22 7.97 Tf 6.586 0 Td[(VsNs>VsNs50

PAGE 63

Equationcanbeusedintoobtain: 1)]TJ/F22 7.97 Tf 6.586 0 Td[(NnVs+ 1)]TJ/F22 7.97 Tf 6.586 0 Td[(VsNs>NnNs 1)]TJ/F22 7.97 Tf 6.587 0 Td[(Vs>NnNs Nn+NsHencewehavealinearscalingfactoronVswhichallowsfortheincorporationofrisk.As!1,pu=salientjx!1andas!0,pu=salientjx!0.ThestandardthresholdderivedfromtheProbabilisticMajorityVoteisappliedwhen=0:5.Forascientistusingvisualizationtools,thevalueofcouldbeadjustedsimplywithasliderbar,withoutanyadditionallearningrequirements.Thiswouldprovideforanactualvisualizationofthecostfunction.5.3WeightedProbabilisticMajorityVotingIntheoriginalformulaforProbabilisticMajorityVoting,onepartitionisusedfortrainingoneclassierthatprovidesonevote.Whilethisclassiercoulditselfbeanensemble,stillonlythelocalmajorityvotefromthatpartitionisconsidered.Thedetailsbehindthenumberofclassiersvotingforeachclassishidden.Theconvenientthresholdformcanalsobeeasilyextendedtohandleasimilaryetsubtlydifferentproblemofcombiningeachoftheindividualpartitionensemblesintooneconglomeratesuper-ensemble.ThisconversionisdonebyredeningVsasthetotalnumberofclassiersvotingsalient,Vnasthetotalnumberofclassiersvotingnon-salient,Nsasthetotalnumberofclassierswhichcouldpredictsaliency,andNnasthetotalnumberofclassierswhichcouldpredictnon-saliency.Underthisframework,aweightedclassicationcouldbesaidtobeproducedbyeachensemble.ThusthismethodiscalledtheWeightedProbabilisticMajorityVote.5.4SmoothingInthesesimulations,scientistsaredirectedtoareascontainingsalientexamples.Intheory,scientistswouldneedtoexamineeventhoseareasconsistingofasinglesalientpoint.Inpractice,singlepointregionsofsaliencyareoftennottrulysalient.Anotherproblematicobservationisthatinalarge51

PAGE 64

simulation,thesefalseregionsofsaliencyareoftenscattered.Inorderforscientiststohavefaithintheaccuracyoftheclassication,itisveryimportanttoremovethesespuriousdecisions.Inordertoremovesmallsalientregions,atwostepalgorithmisapplied.First,theentiredatasetissmoothedbystoringateachpointtheaveragevalueofnearbypointswithinaspeciedradius.Thischangesthesaliencyvaluefromabinarytrue/falseclassicationtoaoatingpointvaluebe-tween0notsalientand1salient.Followingthis,eachpointinthesimulationiscomparedwithathreshold,andonlyifthethresholdvalueisexceededdoesthatnodebecomesalient.Inthisfashion,smallsalientregionssurroundedbynon-salientregionsareeliminated.Likewise,smallnon-salientregionswithinsalientregionsareconvertedtosalient.Thechoiceofanappropriatethresholdcanbeparticularlydaunting,andislargelydependentonthechoiceofsmoothingradius.Inthecaseofaradiuslargerthanthatofthesalientregion,thetruesalientregionneedsanappropriatelylowthresholdinordertoavoidremovingtheregionentirely.Likewise,theradiusmustbeappropriatelyhighinordertoremovethesmallsalientregionswhichhavebeenmisclassied.Manualspecicationofathresholdiscomplicatedevenfurthershoulddifferenttimestepsrequiredifferentthresholds.Arobustsolutionwouldbetodeterminethethresholdautomatically.Ingray-scaleimagebinarization[96,97,98,99],thereareseveralalgorithmsfordeterminingthebestthresholdforwhichpixelvaluesgreaterthanadeterminedvalueareturnedtowhite,andpixelvalueslowerthanthethresholdareturnedblack.Apopularone,Otsuthresholding[96,97],createsahistogramofvaluesandchosesthebestpossiblethresholdwhichmaximizesthebetween-classvariancefunctionallyequivalenttominimizingthewithin-classvariance.InthisapplicationofOtsuThresholding,atimestephavingsaliencyvalueswitharange[0,1]isconvertedintoahistogramofNbins.Eachbinicontainsthenumberofexamplesfi.TheprobabilityofanexampleappearinginbiniisPi=fi=NForeachthresholdvalueoft=1;2;:::;Ntheprobabilitythatanexampleappearinginbinilessthanorequaltotis52

PAGE 65

!1t=tXi=1PiLikewise,theprobabilityofanexampleappearinginabingreaterthantis!2t=NXi=t+1PiThesevaluescanbeusedtocalculatethemeansfortheclassofexampleslessthanorequaltothethreshold,andtheclassofexamplesgreaterthanthethreshold.1t=1 !1ttXi=1iPi2t=1 !2tNXi=t+1iPiThemeanvaluefortheentiretimestepcanbecalculatedasT=1 NNXi=1iPiFinally,thebetweenclassvariance,2B,asderivedbyOtsuforthresholdtis2B;t=!1t1t)]TJ/F21 10.909 Tf 10.909 0 Td[(T2+!2t2t)]TJ/F21 10.909 Tf 10.909 0 Td[(T2Theappropriatethresholdtochooseisthevalueoftwhichmaximizes2B;t.5.5ExperimentalDesignExperimentsweredoneonboththecanisterandthecasingsimulations.Foreachsimulation,Proba-bilisticMajorityVotingPMV,WeightedProbabilisticMajorityVotingwPMVandtestingusingallthedatawasdone.Whiletestingonallthedatainvolvesshiftingdatabetweencomputenodes,thisexperimentprovidedausefulbenchmark.Thenumberofclassierswaschosensuchthatdur-ingprediction,approximately1000classierswerevotingoneachdataexample.Theopensource53

PAGE 66

softwarepackageOpenDT[55]wasusedforbuildingrandomforestdecisiontreesinparallel.Eachtreeintherandomforesttestedonlyblgnc+1attributesatagivennodeandwasbuiltuntilleafpurity.Inthecaseofatievotebetweenclassiers,theunknownclassispredicted,sinceadef-initesalientvotehasnotbeendetermined.Thisallowsforlessfalsepositivepredictions,andsmoothinghelpsmitigatetheeffectofnothavingenoughclassiersvoteforsaliency.5.6TrainandTestSetsTherearemanywaysofbreakingupthesimulationintotrainandtestsets,withtwobeingexaminedhere.Inmethodone,thedataisbrokenupaccordingtotimestepinformation.Foreachpartition,datapresentinthetimestepsarecollapsedintotwosegments,atrainingsetandatestset,accordingtothetimestepnumber:oddtimesteps,3,5,...areusedfortrainingandeventimesteps,4,6,...areusedfortesting.Inthisfashion,halfofthedataisusedfortraining,andtheotherhalfofthedataisusedfortesting.Inthecaseofthecan,theseexperimentsutilizefourpartitionseachhavingtwodatasets,foratotalofeightdatasets.Inthecaseofthebolt,vepartitionseachhavingtwodatasetsarecreated,foratotaloftendatasets.PredictionsfromclassierslearnedonthetrainingsetsarecombinedviaPMVorwPMVtopredictsaliency.Whentrainingonallthedata,theindividualtrainingsetsarecombinedintoasingletrainingset,andtheensemblepredictsthesaliency.Inmethodtwo,anout-of-partitionschemeisdevelopedwhereeachpartitionservesasatestsetforclassierstrainedontheotherpartitions.Specically,apartitionisusedtocreateclassiersandthenthoseclassiersareusedtopredicteachoftheremainingpartitions.ThisprocessisrepeateduntilallpartitionshavebeenusedfortrainingandallpartitionshavebeentestedonbyP)]TJ/F15 10.909 Tf 11.989 0 Td[(1partitions.Forexample,whentherearefourpartitions,classiersfrompartitionsP1;P2;P3voteonP4;classiersfrompartitionsP1;P3;P4voteonP2;etc.ForPMVandwPMV,theout-of-partitionvotesarecombinedtomakeasingleensembleprediction.Whentrainingonallofthedata,theindividualpartitionedtrainingsetsarecombined.Continuingwiththeaboveexample,thismeansclassierstrainedonP1[P2[P3voteonP4;classierstrainedonP1[P3[P4voteonP2;etc.54

PAGE 67

5.7CanisterOdd/EvenExperimentsInthissection,thesimulationisbrokenintotrainandtestsetsaccordingtotimestepnumber:train-ingoccursontheoddtimestepsandtestingoccursontheeventimesteps.Predictionoccursusing1000decisiontreesgeneratedviatherandomforestsalgorithm.InthecaseofPMVandwPMV,250classiersaregeneratedoneachofthefourpartitionsforalloftheoddtimesteps.Duringprediction,all1000classiersvoteoneveryeventimestep.Sinceeverynodecontainsbothsalientandnon-salientexamples,PMVandwPMVsimplifytoamajorityvoteoftheensemblepredictionsandclassierpredictionsrespectively.Whenusingallofthedata,1000treesaregeneratedfromdataobtainedonalloddtimesteps,andtheyareusedtopredictontheeventimesteps.Thenodalaccuracyforeachmethodisrecorded.Predictedsaliencyvaluesarethensmoothedusingaradiusof1inchandthresholdedautomaticallyviaOtsuthresholding.ScreenshotsofthesimulationareprovidedforcomparisonontimestepsthroughoutthesimulationinFigures13,14,and15.ThenodalaccuracyforboththeunsmoothedandthesmoothedversionsareshowninTables13and14.Ineverycase,usingallofthedataprovidesformoreaccurateclassication.Thisshowstheunfortunatepredicamentofonlybeingabletotrainonapartition'slocaldata.PMVwasmorelikelytoclassifyanexampleasnotsalientthanwaswPMV.Thisisexplainedbythechoiceofhowtiesarebroken,alltounsalient.PMVisusingfourindividualpredictionswhereaswPMVisusing1000.ThereforeitismuchmorelikelythatatiewilloccurforPMVthanforwPMV.TheoverallaccuracyforwPMVishigherthanthatofPMV,butforthisproblemthatcanbeagainexplainedbythetiebreakingprocedure,astherearemoresalientexamplesthannon-salientexamplesinthissimulation.Incomparingthethreegures,thesimilaritiesareprofound.Theyeachseemtobecapturingthesameconceptofthecrushedcanisternodes.Theaccuracydifferencesseemtooriginatefromthenedetailofexactlyhowcrushedanodehastobebeforeitisclassiedassalient.Smoothingincreasestheaccuracyofboththetruepositiveandthefalsenegativeresults.Thenumberandpercentageofdifferencesareverysmall,usuallyaround0.5%.Thesmoothnessoftheresultingcanisterguresshowsthatthesmoothingalgorithmhasdonewellatconnectingtheindividualregionsandremovingsalientspecklefromthesimulation.55

PAGE 68

Table13.Theunsmoothednodalaccuracyresultsforthecanistersimulationfortheodd/evenex-perimentsareprovided.Foreachtrueclass,thepercentageofcorrectlyandincorrectlyclassiedexamplesisshownbeneaththenodalcounts.Theoverallaccuracyisalsogiven. True Predicted PMV wPMV All Class Class Unsalient Unsalient 89776 87906 91772 89.87% 87.99% 91.86% Unsalient Salient 10123 11993 8127 9.13% 12.01% 8.14% Salient Unsalient 11267 6760 6563 9.23% 5.54% 5.38% Salient Salient 110770 115277 115474 90.77% 94.46% 94.62% Overall 90.36% 91.55% 93.38% Table14.Smoothednodalaccuracyresultsforthecanistersimulationfortheodd/evenexperimentsareprovided.Foreachtrueclass,thepercentageofcorrectlyandincorrectlyclassiedexamplesisshownbeneaththenodalcounts.Theoverallaccuracyisalsogiven. True Predicted PMV wPMV All Class Class Unsalient Unsalient 89827 88204 91940 90.68% 88.29% 92.03% Unsalient Salient 10072 11695 7959 10.23% 11.71% 7.97% Salient Unsalient 10503 6285 6001 8.61% 5.15% 4.92% Salient Salient 111534 115752 116036 91.39% 94.85% 95.08% Overall 90.73% 91.90% 93.71% 56

PAGE 69

Figure13.AnimagemosaicofpredictedsaliencyaftersmoothingonthecanistersimulationusingProbabilisticMajorityVotingontimesteps10,22,32,and44fortheodd/evenexperimentsisprovided.Timesteporderisfromlefttoright,toptobottom.Salientnodesarecoloredinredandnon-salientnodesinblue.57

PAGE 70

Figure14.AnimagemosaicofpredictedsaliencyaftersmoothingonthecanistersimulationusingWeightedProbabilisticMajorityVotingontimesteps10,22,32,and44fortheodd/evenexperi-mentsisprovided.Timesteporderisfromlefttoright,toptobottom.Salientnodesarecoloredinredandnon-salientnodesinblue.58

PAGE 71

Figure15.Animagemosaicofpredictedsaliencyaftersmoothingonthecanistersimulationusingallthedataontimesteps10,22,32,and44fortheodd/evenexperimentsisprovided.Timesteporderisfromlefttoright,toptobottom.Salientnodesarecoloredinredandnon-salientnodesinblue.59

PAGE 72

5.8CanisterOut-of-PartitionExperimentsInthissection,partitionswhichwerenotusedfortrainingaretestedon.Predictionoccursusingapproximately1000decisiontreesgeneratedviatherandomforestsalgorithm.InthecaseofPMVandwPMV,333classiersaregeneratedoneachofthefourpartitions.Duringprediction,eachpartitionisvotedonbytheclassiersthatwerenottrainedonthatpartition.Thuseachpartitionisvotedonby3333classiers.Sinceeverynodecontainsbothsalientandnon-salientexamples,PMVandwPMVsimplifytoamajorityvoteoftheensemblepredictionsandclassierpredictionsrespectively.Whenusingallofthedata,1000treesaregeneratedfromdataappearinginthreeofthefourpartitions.Theheld-outpartitionisthenvotedonusingthose1000trees.Thisprocessisrepeateduntilallpartitionshavebeenclassied.Thenodalaccuracyforeachmethodisrecorded.Predictedsaliencyvaluesarethensmoothedusingaradiusof1inchandthresholdedautomaticallyviaOtsuthresholding.ScreenshotsofthesimulationareprovidedforcomparisonontimestepsthroughoutthesimulationinFigures16,17,and18.ThenodalaccuracyforboththeunsmoothedandthesmoothedversionsareshowninTables15and16.IntheseexperimentsusingallthedataisonceagainmoreaccuratethanusingwPMVwhichismoreaccuratethanusingPMV.Inthiscasethough,thedifferencebetweenusingPMVandwPMVisverysmall.Thisexperimentdoesnothavethesametiebreakingissuesaswasseenintheodd/evenexperimentsforPMV,alikelyexplanationforthecloseraccuracyresultsbetweenPMVandwPMV.Inallcases,smoothingagainincreasestheaccuracyrateofthetruepositivesandtruenegatives.Theincreaseinoverallaccuracyduetosmoothingisapproximately0.5%.Lookingatthemosaicsofthetimestepsshowsthattheresultingpredictionsaremostlylargesingleregionsofsaliency,withoutappreciablespeckling.Theseexperimentswereevidentlyabitmoredifculttolearnastheaccuracydroppedbetween1-2%foreachlearningalgorithm.5.9CasingOdd/EvenExperimentsThesameodd/evenexperimentsusedinthecanistersimulationarerepeatedhereforthecasingsimulation.Thatis,theoddtimesteps,3,5,...arereservedfortrainingandtheeventime60

PAGE 73

Table15.Unsmoothednodalaccuracyresultsforthecanistersimulationfortheout-of-partitionexperimentsareprovided.Foreachtrueclass,thepercentageofcorrectlyandincorrectlyclassiedexamplesisshownbeneaththenodalcounts.Theoverallaccuracyisalsogiven. True Predicted PMV wPMV All Class Class Unsalient Unsalient 171679 172752 173880 84.56% 85.08% 85.64% Unsalient Salient 31356 30283 29155 15.44% 14.92% 14.36% Salient Unsalient 17295 15986 10691 7.18% 6.64% 4.44% Salient Salient 223542 224851 230146 92.82% 93.36% 95.56% Overall 89.04% 89.58% 91.02% Table16.Smoothednodalaccuracyresultsforthecanistersimulationfortheout-of-partitionex-perimentsareprovided.Foreachtrueclass,thepercentageofcorrectlyandincorrectlyclassiedexamplesisshownbeneaththenodalcounts.Theoverallaccuracyisalsogiven. True Predicted PMV wPMV All Class Class Unsalient Unsalient 172944 173382 174731 85.18% 85.40% 86.06% Unsalient Salient 30091 29653 28304 14.82% 14.60% 13.94% Salient Unsalient 16028 15037 8821 6.66% 6.24% 3.66% Salient Salient 224809 225800 232016 93.34% 93.76% 96.34% Overall 89.61% 89.93% 91.64% 61

PAGE 74

Figure16.AnimagemosaicofpredictedsaliencyaftersmoothingonthecanistersimulationusingProbabilisticMajorityVotingontimesteps10,22,32,and44fortheout-of-partitionexperimentsisprovided.Timesteporderisfromlefttoright,toptobottom.Salientnodesarecoloredinredandnon-salientnodesinblue.62

PAGE 75

Figure17.AnimagemosaicofpredictedsaliencyaftersmoothingonthecanistersimulationusingWeightedProbabilisticMajorityVotingontimesteps10,22,32,and44fortheout-of-partitionexperimentsisprovided.Timesteporderisfromlefttoright,toptobottom.Salientnodesarecoloredinredandnon-salientnodesinblue.63

PAGE 76

Figure18.Animagemosaicofpredictedsaliencyaftersmoothingonthecanistersimulationusingallthedataontimesteps10,22,32,and44fortheout-of-partitionexperimentsisprovided.Timesteporderisfromlefttoright,toptobottom.Salientnodesarecoloredinredandnon-salientnodesinblue.64

PAGE 77

steps,4,6,...fortesting.Onceagain,predictionoccursusingapproximately1000decisiontreesgeneratedviatherandomforestsalgorithm.Sincetherearevepartitionsinthissimulation,datafromeachpartitionisusedtolearn200classiers.InthecaseofPMVandwPMV,only3partitionswilleverseesalientexamples.Therefore,thenumberofpartitionspredictingsaliencyforPMVmustbegreaterthanorequalto5=+52.LikewiseforwPMV,thenumberofpartitionspredictingsaliencymustbegreaterthanorequalto1000=+1000=375.Forexperimentsusingallthedata,1000classiersaregeneratedonallthedatathroughouteachoddtimestep,andpredictionoccursoneacheventimestep.Thenodalaccuracyforeachmethodisrecorded.Predictedsaliencyvaluesarethensmoothedusingaradiusof2inchesandthresholdedautomaticallyviaOtsuthresholding.ScreenshotsofthesimulationareprovidedforcomparisonontimestepsthroughoutthesimulationinFigures19,20,and21.ThenodalaccuracyforboththeunsmoothedandthesmoothedversionsareshowninTables17and18.TheresultsforthecasingsimulationshowthatPMVismorelikelytoresultinasalientvotethananyoftheotheralgorithms.Unfortunately,thisresultsinlargeregionsofnon-salientexamplesbeingclassiedassalient.Thisisborneoutinthetableresultsandplainlyvisibleinthegures.Ontheotherhand,PMVdoesseemtondeachoftheboltsandlabelthemmostlyfullaftersmoothing.TheresultsforwPMVandallthedatalookverysimilar.Onthelasttimestep,wPMVlabelsonlyabout1/3ofoneofthebolts,whereasusingallofthedatalabelsthatboltfully.OnceagaintheOtsusmoothingalgorithmimprovestheaccuracyofeachtrainingalgorithm.Thistimehowever,theaccuracydifferencebetweensmoothedandunsmoothedforPMVandwPMVismuchlarger.TheaccuracyincreaseforwPMVfortruepositivesis3.6%andforPMVitis11.6%.Asneitheroftheseincreasesproducealowertruenegativeaccuracy,thisisawelcomedimprove-ment.Usingallthedataonlyproducedatruepositiveincreaseof0.1%.InterestinglyPMVhadalowertruepositiveaccuracythanbothwPMVandallthedatabeforesmoothing,aswellasahighertruepositiverateaftersmoothing.MostlikelyPMVwasabletospeckleeachboltwithsalientpre-dictionssuchthatthesmoothingalgorithmwasabletollinthegaps.Unfortunately,PMVstillhadthehighestfalsepositiverate.65

PAGE 78

Table17.Unsmoothednodalaccuracyresultsforthecasingsimulationfortheodd/evenexperimentsareprovided.Foreachtrueclass,thepercentageofcorrectlyandincorrectlyclassiedexamplesisshownbeneaththenodalcounts.Theoverallaccuracyisalsogiven. True Predicted PMV wPMV All Class Class Unsalient Unsalient 678953 686275 686450 98.19% 99.24% 99.27% Unsalient Salient 12547 5225 5050 1.81% 0.76% 0.73% Salient Unsalient 9667 8228 5673 17.25% 14.68% 10.12% Salient Salient 46363 47802 50357 82.75% 85.32% 89.88% Overall 97.03% 98.20% 98.57% Table18.Smoothednodalaccuracyresultsforthecasingsimulationfortheodd/evenexperimentsareprovided.Foreachtrueclass,thepercentageofcorrectlyandincorrectlyclassiedexamplesisshownbeneaththenodalcounts.Theoverallaccuracyisalsogiven. True Predicted PMV wPMV All Class Class Unsalient Unsalient 679285 686376 686558 98.23% 99.26% 99.29% Unsalient Salient 12215 5124 4942 1.77% 0.74% 0.71% Salient Unsalient 3153 6212 5622 5.63% 11.09% 10.03% Salient Salient 52877 49818 50408 94.37% 88.91% 89.97% Overall 97.94% 98.48% 98.59% 66

PAGE 79

Figure19.AnimagemosaicofpredictedsaliencyaftersmoothingonthecasingsimulationusingProbabilisticMajorityVotingontimesteps6,10,16,and20fortheodd/evenexperimentsispro-vided.Timesteporderisfromlefttoright,toptobottom.Salientnodesarecoloredinredandnon-salientnodesinblue.67

PAGE 80

Figure20.AnimagemosaicofpredictedsaliencyaftersmoothingonthecasingsimulationusingWeightedProbabilisticMajorityVotingontimesteps6,10,16,and20fortheodd/evenexperimentsisprovided.Timesteporderisfromlefttoright,toptobottom.Salientnodesarecoloredinredandnon-salientnodesinblue.68

PAGE 81

Figure21.Animagemosaicofpredictedsaliencyaftersmoothingonthecasingsimulationusingallthedataontimesteps6,10,16,and20fortheodd/evenexperimentsisprovided.Timesteporderisfromlefttoright,toptobottom.Salientnodesarecoloredinredandnon-salientnodesinblue.69

PAGE 82

5.10CasingOut-of-PartitionExperimentsThesameout-of-partitionexperimentsusedinthecanistersimulationarerepeatedhereforthecasingsimulation.Onceagain,predictionoccursusingapproximately1000decisiontreesgeneratedviatherandomforestsalgorithm.Sincetherearevepartitionsinthissimulationofwhichonewillbeheldouteachiteration,eachpartitionhas250classierslearnedonitsdata.InthecaseofPMVandwPMV,only3partitionswilleverseesalientexamples.Calculatingthenumberofsalientvotesrequiredtopredictsaliencybecomesabitmorecomplexintheseexperiments.Iftheout-of-partitionexamplecomesfromoneofthepartitionswhichdidnotseesalientexamples,then3ofthe4partitionsintrainingincludedsalientexamplesinthetraining.InthecaseofPMV,greaterthanorequalto4=+42ensemblesmustpredictsaliency,andinthecaseofwPMV,greaterthan1000=+1000429classiersmustpredictsaliency.Ontheotherhand,whentheout-of-partitionexamplecomesfromoneofthepartitionswhichdidseesalientexamples,then2ofthe4partitionsweretrainedonsalientexamples.ForPMV,thesaliencyrequirementdoesnotchangesince4=+42.InthecaseofwPMV,greaterthanorequalto1000=+1000334classiersmustpredictsaliency.Forexperimentsusingallthedata,1000classiersaregeneratedoneachsetoffourpartitionsandtestedontheheld-outpartition.Thenodalaccuracyforeachmethodisrecorded.Predictedsaliencyvaluesarethensmoothedusingaradiusof2inchesandthresholdedautomaticallyviaOtsuthresholding.ScreenshotsofthesimulationareprovidedforcomparisonontimestepsthroughoutthesimulationinFigures22,23,and24.ThenodalaccuracyforboththeunsmoothedandthesmoothedversionsareshowninTables19and20.Intheseexperiments,therearemanymoremissedboltsinthesimulationusingPMV,wPMV,andallthedata.Thedifferenceinresultswhencomparedtotheodd/evenexperimentssuggestthateachoftheboltsisbehavingabitdifferently.Further,theboltswhicharemostlikelytobemissedarethoseontheouterpartitions,especiallythetwoboltsoneithersidewhicharefurtherfromtheothers.Bothoftheseobservationssuggestthateachboltismorelikeitsneighborthananyotherboltinthesimulation.70

PAGE 83

Table19.Unsmoothednodalaccuracyresultsforthecasingsimulationfortheout-of-partitionex-perimentsareprovided.Foreachtrueclass,thepercentageofcorrectlyandincorrectlyclassiedexamplesisshownbeneaththenodalcounts.Theoverallaccuracyisalsogiven. True Predicted PMV wPMV All Class Class Unsalient Unsalient 1448761 1449300 1448694 99.77% 99.80% 99.76% Unsalient Salient 3389 2850 3456 0.23% 0.20% 0.24% Salient Unsalient 27549 29108 21494 23.41% 24.74% 18.27% Salient Salient 90114 88555 96169 76.59% 75.26% 81.73% Overall 98.03% 97.96% 98.41% Table20.Smoothednodalaccuracyresultsforthecasingsimulationfortheout-of-partitionex-perimentsareprovided.Foreachtrueclass,thepercentageofcorrectlyandincorrectlyclassiedexamplesisshownbeneaththenodalcounts.Theoverallaccuracyisalsogiven. True Predicted PMV wPMV All Class Class Unsalient Unsalient 1448478 1448996 1448484 99.75% 99.78% 99.75% Unsalient Salient 3672 3154 3666 0.25% 0.22% 0.25% Salient Unsalient 24078 24534 17928 20.54% 20.85% 15.24% Salient Salient 93129 93129 99735 79.46% 79.15% 84.76% Overall 98.23% 98.24% 98.62% ThepropensityofPMVandwPMVtonotvoteasmanyboltnodesassalientbolsterstheirtruenegativeaccuracyrate,allowingthetruenegativeaccuracyrateofwPMVtoovertakethatofusingallthedata.Thetruepositiveratewhenusingallthedatais5-6%higherthanthatofPMVorwPMV.Nonetheless,theuseofallthedatastillmissesentireboltsfromthesimulation.Thesmoothingresultsshowaveryslightdecreaseintruenegativeaccuracythelargestdiffer-encebeing0.02%.Inreturn,eachofthetruepositiveratesincreasesby3-4%.Consideringthedifcultyinthisexperimentwithndingbolts,thisislikelyawelcomedtrade-off.71

PAGE 84

Figure22.AnimagemosaicofpredictedsaliencyaftersmoothingonthecasingsimulationusingProbabilisticMajorityVotingontimesteps6,10,16,and20fortheout-of-partitionexperimentsisprovided.Timesteporderisfromlefttoright,toptobottom.Salientnodesarecoloredinredandnon-salientnodesinblue.72

PAGE 85

Figure23.AnimagemosaicofpredictedsaliencyaftersmoothingonthecasingsimulationusingWeightedProbabilisticMajorityVotingontimesteps6,10,16,and20fortheout-of-partitionexperimentsisprovided.Timesteporderisfromlefttoright,toptobottom.Salientnodesarecoloredinredandnon-salientnodesinblue.73

PAGE 86

Figure24.Animagemosaicofpredictedsaliencyaftersmoothingonthecasingsimulationusingallthedataontimesteps6,10,16,and20fortheout-of-partitionexperimentsisprovided.Timesteporderisfromlefttoright,toptobottom.Salientnodesarecoloredinredandnon-salientnodesinblue.74

PAGE 87

Chapter6ASemi-SupervisedIncrementalApproachAtypicalMachineLearningproblemexistsinastaticenvironment.Thatis,allofthedatathatcanbetrainedonisavailableduringtheLearningPhaseandallofthetestingcanbeperformedintheTestingPhase.Thealgorithmfromthepreviouschapterisreadilyapplicabletothaten-vironment,andthesimulationproblemcanbereadilyconvertedtotthatscheme.Inmakingthisconversionthough,relevantandpotentiallyvaluableinformationaboutthefuturesaliencyofnodesisdismissed.Asalientnodeinonetimestepislikelytobeagainbesalientinthenexttimestep.6.1Semi-SupervisedMethodsMosttrainingsetsmustbemanuallylabeledbyadomainexpertinordertoprovideaninductivelearnerwiththenecessarydatatogenerateaclassier.Forlargedatasets,particularlythosewithmanyexamples,thisprocesscanbecomebothtime-consumingandtediousforthedomainexpert.Theamountoflabeleddatarequiredforaccurateclassicationisdifculttodeterminepriortoclas-siergeneration,andincreasesrapidlywiththescalingofthenumberofattributesand/orclasses.Semi-supervisedlearningusesexistinglabeleddatainconjunctionwithunlabeleddatatogener-atemoreaccurateclassiersthanusingthelabeleddataalone.Itachievesthis,inpart,byprovidinglabelsfornewexampleswhichhavenotyetbeenclassiedbyanexpert.Thefoundationforsemi-supervisedclassicationisbuiltaroundaccuratelymodelingtheproblemwhichistobesolved.Clearly,providinganexcessofincorrectlabelstotheunlabeleddatawillintroducenoiseintotheinductivelearnerandadverselyaffectthechoiceofdecisionboundaries.Ithasbeenfoundthatmostattemptsatusingsemi-supervisedclassicationshiftstheresponsibilityofadomainexpertfromlabelingdatatochoosingthebestmodelwhichtsthedata[100].Notallalgorithmsaresuccessful75

PAGE 88

innewdomains.Asanexample,theaccuracyofHiddenMarkovModelsinlexicalanalysiscanbereducedthroughselectsemi-supervisedlearningalgorithms[101,102].Thereareseveralcommonpracticesinthedesignofsemi-supervisedlearningalgorithms.Inagenerativemodel,theunlabeleddataisclustered,andthenassignedtothemostlikelyclasschosenfromthemanuallylabeledexamples.Thereareamultitudeofmethodsbywhichthisdatamaybeclustered.AsoftclusteringcanbeperformedusinganExpectationMaximizationEMalgo-rithminordertoestimatetheparametersoftheunderlyingdistributions.ThemixtureofGaussianmodel,orthemixtureofmultivariateBernoullimodelsarerepresentative.Intheory,ifthemodelassumptioniscorrect,unlabeleddataexamplesareguaranteedtoimproveaccuracy[103,104,105].Ifthemodelassumptionisnotcorrect,accuracycanbenegativelyeffected,asderivedin[106].Byincorporatingactivelearning,wherethedomainexpertparticipatesinprovidinglabelsforal-gorithmicallychosensamples,amoreaccurateapplicationoftheEMalgorithmcanbeobtained[107].Anothermethodforsemi-supervisedlearningisreferredtoasco-training.Inco-training,mul-tiplemodelswhichcanprovideanestimateofthecondenceintheirpredictionaregeneratedfromthelabeleddata.Someoralloftheunlabeleddataisthenclassiedaccordingtotheclassiers,andthemostcondentlypredictedexamplesareinsertedintothelabeledtrainingsetswiththepre-dictedclasslabels.Manypapershavebeenwrittenwhichutilizethismethodology.In[108,109],theattributesofthelabeledtrainingsetaredisjointlypartitionedintotwosubsetswhichareusedfortrainingtwoclassiers.Iftheclassofanunlabeledexamplecanbecondentlydeterminedbyoneclassier,thenthatexampleisrecursivelyaddedtothetrainingsetoftheotherclassier.Inorderforthismethodtowork,thepartitioningoftheattributespacemustbesuchthattheattributegroupsarenotwellcorrelated.Ingeneral,co-trainingcanonlyworkiftheindividualclassiersareindependent[110].In[111],ratherthanpartitiontheattributes,thefullsetoflabeleddataisprovidedtobothatop-downdecisiontreecreatorID3[12]andabottom-updecisiongraphcreatorHOODG[112].Thisworkwasfurtherextendedtoconsideranensembleofgreaterthantwoclassi-ers,whereintheexamplewasincludedinthetrainingdataifthemajorityofclassierscondentlyagreeduponthepredictedclasslabel[113].76

PAGE 89

Finally,thereisthemethodofself-training,inwhichaclassierisbuiltonthelabeleddata,andusedtoclassifysomeportionoftheunlabeleddata.Typicallythemostcondentlypredictedexamplesarethenrecursivelyinsertedintothetrainingsetandanewclassiergenerated.In[114],self-trainingwasappliedtotheproblemofdecipheringcontextinwrittenwords,suchaswhethercraneisreferringtoabirdoramachine.Theapplicationofsemi-supervisedlearningtothisproblemisparticularlyttingbecauseofseveralknownpropertiesofwordusage.Surroundingwordsaroundthetargetwordprovideastrongsenseastoitsmeaning.Therefore,higheraccuracyisobtainedbyincrementallyaddingalsocalledtaggingwordswhicharefoundclosetothetargetword.Exceptforsomespecicwordssuchassakemeaningeitherthealcoholicbeverageorabenet,accuracyusingonlytwomanuallytaggedwordswasonparwiththatofusingafullylabeledtrainingset.Usingthedictionarydenitionofthewordproducedslightlyhigheraccuracy.Anincreaseinaccuracyoversupervisedlearningwasobtainedbyincreasingthecondenceofalltaggedwordsinadocumentiftherewashighcondenceinanytaggedwordinthedocument.Thiswassuccessfulbecausetargetwordsarenotoftenusedinmultiplecontexts[115].Self-traininghasalsobeenusedinobjectdetectionsystemsinordertoreducelabelingassoci-atedwiththelargenumberofvariationsanobjectinaphotographmaytake[116].Asignicantconclusionoftheirworkwasthatthemeasureofcondencefromtheclassierinthepredictionswasnotespeciallybenecialtoincreasingaccuracy.Anothermeasure,whichtheycalledtheMSEselectionmetric,performedmuchbetter.Inthismetric,therelativevalueofanewlylabeledex-ampleisdeterminedbycalculatingthedistancebetweensaidexampleandthegroupsofexistinglabeledexamplesinfeaturespacewhetherornotthoselabelscamefromanexpert,orviasemi-supervisedclassication.Thecondencemetricresultedinasteadydecreaseinaccuracy,whereastheMSEmetricresultedinanoverallincreaseinaccuracy.TheauthorsclaimthecondencemetricandbatchEMalgorithmsareparticularlyweakinthisdomainbecausethedatadistributionofthesampleisinherentlynotrepresentativeoftheunderlyingdistributionofthedata.Theauthorspro-videanexamplewhichshowshowself-trainingwithacondencemetriccannegativelychangetheunderlyingdistributionofthedataforaclassfromthatofitsinitialmanuallylabeleddistribution.77

PAGE 90

Semi-supervisedlearningisdesignedtotakeadvantageofunlabeleddata,andisparticularlyrel-evantwhenthequantityofunlabeleddataislarge.Unfortunately,mostoftheresearchhasfocusedonsmalldatasets[100].Thescalabilityofsuchsemi-supervisedlearningalgorithmshaslargelybeenleftunaddressed.Inthefollowingsection,ascalableandaccuratesemi-supervisedlearningalgorithmisappliedtothelargecasingdatasetwhichhasmorethanonemillionexamples.6.2ASemi-SupervisedSelf-TrainingAlgorithmTheproblemofdeterminingsaliencyhasananalogueintheOperatingSystemsproblemofdeter-miningpageleallocationandreplacementstrategies[117,118].Spatiallocalitystatesthatapageanditsneighborsonceaccessedarelikelytobeaccessedagaininthenearfuture.Temporallocalitystatesthatpageswhichhaverecentlybeenusedarelikelytobereusedinthefuture.Spatiallocalitycanreadilybeappliedtothesimulationdataviasmoothingoftheclasslabels.Thatis,afterthetestingphasehasbeencompleted,thesaliencyofeverynodeisafunctionofthesaliencyofitselfanditsneighbors.Theuseofthiswasshowninthepreviouschapter.Temporalsmoothingwouldalsobepossible.Namely,thesaliencyofeverynodeisafunctionofthesaliencyofitself,anditspastandfutureselves.Whiletemporalsmoothingwouldcertainlybeapplicableinmostproblems,thereareinstanceswhereitwouldnot.Forexample,inauiddynamicsproblem,theliquidnodesareoftenchanginglocationwhiletheliquidcontainerdoesnot.Supposeuidnodesareonlysalientwhenlocatedinacertainstaticareaofthesystem.Inanenclosedcircuit,nodeswouldchangefromsalienttonon-salientandrepeatthiscycleforeachloopthroughthecircuit.Temporalsmoothingcouldchangetheentireliquidsubstructuretosalientornotsalient,harmingtheaccuracy.InreconsideringtheOperatingSystemanalogy,temporallocalityisnotexhibitedwhentheaccesspatternissporadic,suchasinaSQLdatabase.Thepropertiesoftheuidnodesinfeaturespaceasitpassesthroughthesalientregionofthecontainercouldlikewiseappearrandom.Amoreuniversalapproachwhichtakesadvantageoftemporallocalityisneeded.Inthefollowingsemi-supervisedself-trainingalgorithm,theuserpresentstothelearnerasubsetoflabeleddata.Thesubsetconsistsofseveraltimestepsfromthebeginningofthesimulation.The78

PAGE 91

learnerusesthesetimestepstopredictasmallnumberoffuturetimesteps.Thepredictedfuturetimestepsaresmoothedspatially.Learningthenstartsagainfromthebeginning,thistimelearningontheadditionalpredictedandsmoothedtimesteps.Thisprocesscontinuesrecursivelyuntilalldatahasbeenlabeled.DeneTi2Tasatimestepinthesimulation,andTi;janexamplejinTi.Tui;jistheclasslabelofTi;j.Letnbethenumberoftimestepswhichareoriginallylabeledbyanexpertandmthetotalnumberoftimesteps.ClassierCixisaclassierwhichhasbeentrainedonT1:::Tm)]TJ/F19 7.97 Tf 6.587 0 Td[(1.Thesemi-supervisedlearningalgorithmisshowninAlgorithm2.Thoughthisalgorithmincrementsonetimestepatatime,amultiplestepincrementcouldalsobespecied.WerefertothisalgorithmastheSemi-SupervisedIncrementalSSIalgorithm. Algorithm2TheSemi-SupervisedIncrementalSSILearningAlgorithm. 1:InitiallyTiforallinarelabeled2:in3:whilei
PAGE 92

werecreatedoneveryblockoftimesteps,200pereachofthe5partitions.TheindividualensemblepredictionswerecombinedviaPMV.Inthepreviouschapter,ProbabilisticMajorityVotingwasshowntobelessaccuratethanwPMVinclassifyingthecasingsimulation.IdeallytheapplicationoftheSSIalgorithmwouldhelptheaccuracyofPMVtomeetorexceedthatofwPMVorusingallthedata.Afterbuildinganensembleofclassiersoneachgroupoftimesteps,thepredictionsforthefollowingtimestepsaresmoothed,thresholdedviaOtsu'salgorithm,andrecorded.Intheseex-periments,thetreeswereusedtopredictandsmoothalloftheremainingtimesteps.ThisisnotnecessaryfortheSSIlearningalgorithm,whereonlythenexttimestepneedstobepredictedandsmoothed.Theextrapredictionswereobtainedbecausetheyallowedforanalyzinghowlaterpre-dictionschangedasafunctionoftime.Figure25showsthenalpredictionsmadebytheSSIalgorithmonfourdifferenttimesteps.Timestep8,thersttimestepafterwhichlabeledtrainingdatahasbeenprovided,willhaveboltsthatareverysimilarphysically.Timestep21,thelasttimestepinthesimulationandfurthestintimefromthelabeleddata,willhaveboltsthathaveundergonethemostamountofchangephysically.Twointermediatetimesteps,12and16,arealsoincludedinthemosaic.Forcomparisonpurposes,Figure26showstheresultwhennotusingtheSSIalgorithm.Instead,therstseventimestepsareusedfortraining,andeachoftheremainingtimestepsarepredictedviaPMV.Figure27doesnotusetheSSIalgorithmeither,butusestheentireunpartitionedamountofdataontherstseventimesteps.Inallofthesegures,theresultingpredictionshavebeensmoothedandOtsuthresholdedjustasinthecaseoftheSSIalgorithm.Assmoothingandthresholdinghavealreadybeenshowntoincreaseaccuracy,thisprocessisnecessaryforafaircomparison.Table21providesanaccuracycomparison.6.4FurtherAnalysisofSSILearningSemi-supervisedlearningisaprocessbywhichadditionallabeledexamplesareaddedtothetrainingset.Thenumberofadditionalexamplestobeaddedtothetrainingsetisxedbasedonthenumberofnodesinthetimestepofthesimulation.Thechoiceofclasslabelforthosenewexamplesis80

PAGE 93

Figure25.AnimagemosaicofpredictedsaliencyafterusingtheSSIalgorithmonthecasingsimu-lationwithProbabilisticMajorityVotingisprovided.Timestepsshownare8,12,16,and21.Timesteporderisfromlefttoright,toptobottom.Salientnodesarecoloredinredandnon-salientnodesinblue.81

PAGE 94

Figure26.AnimagemosaicofpredictedsaliencywithouttheSSIalgorithmonthecasingsimula-tionusingProbabilisticMajorityVotingisprovided.Timestepsshownare8,12,16,and21.Timesteporderisfromlefttoright,toptobottom.Salientnodesarecoloredinredandnon-salientnodesinblue.82

PAGE 95

Table21.SmoothednodalaccuracyresultsforthecasingsimulationusingtheSSIalgorithmandotherapproachesareprovided.Foreachtrueclass,thepercentageofcorrectlyandincorrectlyclassiedexamplesisshownbeneaththenodalcounts.Theoverallaccuracyisalsogiven. True Predicted PMV PMV All Class Class SSI NoSSI NoSSI Unsalient Unsalient 959818 927364 955061 99.14% 95.79% 98.65% Unsalient Salient 8282 40736 13039 0.86% 4.21% 1.35% Salient Unsalient 316 34 4009 0.40% 0.04% 5.11% Salient Salient 78126 78408 74433 99.60% 99.96% 94.89% Overall 99.18% 96.10% 98.37% Figure27.AnimagemosaicofpredictedsaliencywithouttheSSIalgorithmusingallthedataisprovided.Timestepsshownare8,12,16,and21.Timesteporderisfromlefttoright,toptobottom.Salientnodesarecoloredinredandnon-salientnodesinblue.83

PAGE 96

subjecttotwoprocesses.First,thedecisiontreeslabelthenodesofthesimulation.Next,thesmoothingandthresholdingalgorithmsextendlargeregionsofsaliencyandremovesmallregionsofsaliency.Sincepredictingonetimestepaheadiseasycomparedwithpredictingmultipletimestepsahead,thehopeisthatsmallstepsthroughthedatawillleadtobetterresultsoverall.InFigure28,predictionsmadeonthenaltimestepareshownasSSIlearningoccurs.Intherstimage,therstsevenmanuallylabeledtimestepsarebeingusedtopredictthenaltimestep.Inthesecondimage,therstsevenmanuallylabeledtimestepsplusthenextfourself-labeledtimestepsarebeingusedtopredictthenaltimestep.Initiallytheresultsarequitepoor,asexpected.AsSSIlearningproceeds,theresultsclearlyimprove.Bytimestep15,veryreasonablepredictionsarebeingmadeonthenaltimestep.FurtheranalysisisprovidedintheformofgraphsbyobservingthechangesintruepositivesandtruenegativesFigure29,andfalsepositivesandfalsenegativesFigure30asSSIlearningincre-ments.Therstseventimestepshavebeenmanuallylabeledandarenotincludedinthegures.AsinFigure28,thechangeinthepredictionsofthenaltimestepisobservedwithreferencetothelasttimestepusedinSSIlearning.Thepercentageoftruenegativesnon-salientnodesmarkedasnon-salientsteadilyincreases.Thenumberoftruepositivessalientregionspredictedassalient,isslightlyreducedbetweenthebeginningandendofthesimulation.Duringtheintermediatetimesteps13-16,thepercentageoftruepositivesdemonstratesacleardipinaccuracy.Atapproximatelytimestep13,theboltsbegintoshear,aphysicalprocesswhichthelabeleddatadoesnotinclude.ThesameerrorsaremadeinthisregionwhetherornotSSIlearningisused,andwhetherornotthedataispartitioned.Itisnotabletheseerrorsareremovedaftertrainingonlatertimesteps.Falsepositivepredictionsnon-salientnodesmarkedassalientandfalsenegativepredictionssalientnodesmarkedasnon-salientcorrelatepreciselywithtruenegativeandtruepositivepredic-tions,respectively.Hencethelargedipintruepositiveaccuracybetweenintermediatetimestepswillbemirroredbyaspikeinfalsenegativeerror.Theoverallincreaseinaccuracyislargelyex-plainedbythereductioninfalsepositiveerror.Indeed,thefalsenegativeerroractuallyincreasesslightly.Giventhatfalsepositiveerrorwilldirectscientiststoexploreareasofthesimulationwith84

PAGE 97

Figure28.AnimagemosaicofpredictedsaliencyshowingSSIlearningusingPMVastimepro-gressesforwardisprovided.Eachimageisofthenaltimestep.Eachimageshowsthepredictionsonthenaltimestepafterlearningon7,11,15,and20timesteps.Onlytherst7timestepsusedlabeleddata.Timesteporderisfromlefttoright,toptobottom.Salientnodesarecoloredinredandnon-salientnodesinblue.85

PAGE 98

Figure29.TruepositiveversustruenegativeaccuracyonthenaltimestepafterSSIlearningprocessesself-labeleddata. Figure30.FalsepositiveversusfalsenegativeerroronthenaltimestepafterSSIlearningprocessesself-labeleddata.86

PAGE 99

novalue,thisisimmenselybenecial.Inpractice,ascientistintendingtotraceaconceptthroughasimulationismorelikelytoprefercorrectlyidentifyingotherinstancesofthatconceptthansortingthroughnumerousredherrings.Thetrendsinerroronthenaltimestepslargelycharacterizethewholeofthesimulation.Thesamemethodsforproducingthepreviousgraphsareduplicatedinproducingagraphwhichcharacterizesperformanceacrosstheentiresimulation.Figures29and30showthesamelargereductioninfalsepositiveerrorattheslightexpenseoftruenegativeaccuracy.Thetrendsarelessampliedthroughoutthewholesimulationduetothenumberoftimestepsclosertothedatalabeledbytheexpert.Finally,thenumberoftimestepswhichtheSSIalgorithmincrementsbyintheprediction/retrainingphasecanbeincreasedtoconsidermultipletimesteps.Itispossibletopredictntimestepsandthenchoosetheexamplesfromthosetimestepswithlabelstoincludeinthetrainingset.Thenretrainingcanbedone.Alloftheexamples,orarepresentativeset,orthebestsetofexamplesbysomecriteriacanbechosentoaddtothetrainingsetinthesamewayaswouldbedonewhenlookingonlyonetimestepahead.Anadvantageofthisisanincreaseinoverallspeedasaresultofnotclassifyingandretrainingeveryintermediatetimestep.Asthenumberoftimestepstoincrementapproachesthefullnumberoftimestepsremaininginthesimulation,thealgorithmapproachesPMV.WemightthereforeexpectthattheaccuracyofskippingmultipletimestepsislowerboundedbythatofPMV.Thecasingsimulationwasexperimenteduponusingtheout-of-partitionstrategy.Thenumberoftimestepsisincrementedbytwoforeachprediction/retrainingiteration.TheresultsareprovidedinTable22.Thenumberoffalsepositivesandfalsenegativeshasincreased,resultinginanetlossof0.61%.Theoverallaccuracyisstillbetterthanusingtheunpartitioneddata,butonlyby0.20%.PMVwithoutSSIproducesanaccuracyof96.20%,henceareductioninrunningtimeofabout50%wasobtainedwithoutsignicantlyharmingtheclassicaitonaccuracy.6.5RemovingtheUnknownMostofthesemi-supervisedlearningalgorithmsuselessthanthefullamountofdatafortraining.Underthetypicalsemi-supervisedparadigm,eachtimestepwouldhaveonlyitsmostcondently87

PAGE 100

Figure31.TruepositiveversustruenegativeaccuracyacrossalltimestepsafterSSIlearningpro-cessesself-labeleddata. Figure32.FalsepositiveversusfalsenegativeerroracrossalltimestepsafterSSIlearningprocessesself-labeleddata.88

PAGE 101

Table22.SmoothednodalaccuracyresultsforthecasingsimulationusingtheSSIalgorithmin-crementingbymultipletimesteps.Foreachtrueclass,thepercentageofcorrectlyandincorrectlyclassiedexamplesisshownbeneaththenodalcounts.Theoverallaccuracyisalsogiven. True Predicted SSI SSI All Class Class Inc1 Inc2 NoSSI Unsalient Unsalient 959818 954127 955061 99.14% 98.56% 98.65% Unsalient Salient 8282 13973 13039 0.86% 1.44% 1.35% Salient Unsalient 316 949 4009 0.40% 1.21% 5.11% Salient Salient 78126 77493 74433 99.60% 98.79% 94.89% Overall 99.18% 98.57% 98.37% predicteddatapointsclassiedaseithersalientorunsalient,withtheremainderofthedatafallingintoanewunknownclass.Exampleshavinganunknownclassareexcludedfromthetrainingset.Inatypicalsemi-supervisedapproach,datawouldberecursivelyreclassieduntilallormostoftheunknowndatahasbeenassignedtoanexistingclass.Ashasbeencitedpreviously,mostofthesealgorithmshavenotbeenconsideredintheparadigmoflargescalelearning.Repeatedlyreclassifyingdataisatimeconsumingprocess,anddeterminingclassiercondencecanresultindubiousnumbers.Inthissectionastatisticalmethodfordeterminingwhetherdatashouldbelabeledorremainasunknownisconsidered.Therstseventimestepsareusedastraining,andpredictionoccursonthenexttimestep.Thepredictedvalueofeachtestsetexampleistherealvaluedaveragepredictionofeachclassiergeneratedfromeachofvepartitions,200classiersperpartition.TheExpectation-MaximizationalgorithmisusedtomodelthedistributionasthatoftwoGaussians,oneGaussiandescribingtheboltclass,andtheotherdescribingthenotboltclass.TheExpectation-MaximizationEMAlgorithm[119,120,121]isameansofincrementallyderivingtheparametersofan-componentdistributionsconsistingoftwosteps:theexpectationstepandthemaximizationstep.Intheexpectationstep,theparametersofthedistributionorinitialguessesofthemareusedtocalculatethedatalikelihood.Forpbeingajointprobabilitydistributionwithparametersgivenby,py;xj,theconditionaldistributioncanbeexpressedas:89

PAGE 102

pxjy;=py;xj pyj=pyjx;pxj Rpyj^x;p^xjd^xusingBayes'RuleandtheLawofTotalProbability.Inthemaximizationstep,thevaluesofnarere-estimatedtoproducen+1.Thevaluen+1maximizestheconditionalexpectationobtainedfromn.DenotedasQ,thiscanbegivenas:Q=Ex[logpy;xjjy]=Z1pxjy;nlogpy;xjdxIncrementalrenementoftheparameterscontinuesuntilconvergence.InthecasewheretwoGaussiandistributionsarebeingestimated,thefollowingequationscanbeused:i=mj=1Pxijyj;tyj mj=1Pxijyj;ti=mj=1Pxijyj;tyj)]TJ/F21 10.909 Tf 10.909 0 Td[(iyj)]TJ/F21 10.909 Tf 10.909 0 Td[(iT j=1mPxijyj;tPxi;=mj=1Pxijyj;t nk=1mj=1Pxkjyj;tWeconsiderexampleswithpredictedvalueswithin1.65standarddeviationsofthemean,theone-tailedvalueforwhich95%oftheexamplesarecontained.IfthetwoGaussiandistributionsarewellseparated,exampleswhicharenotdonotliebetweenunsalient+1:65unsalientandsalient)]TJ/F15 10.909 Tf 11.669 0 Td[(1:65salientareconsideredunknown.Exampleswithapredictedsaliencylessthanunsalient+1:65unsalientorgreaterthansalient)]TJ/F15 10.909 Tf 9.397 0 Td[(1:65salientareassignedtotheunsalientnotboltorsalientboltclass,respectively,asinFigure33.Ifthetwodistributionsoverlap,valueslessthansalient)]TJ/F15 10.909 Tf 9.261 0 Td[(1:65salientarelabeledasunsalient,valuesgreaterthanunsalient+1:65unsalientarelabeledassalient,andtheremainingexamplesarelabeledasunknown,asshowninFigure34.ThisequationfortheassignmentofclasslabelsisshowninEquation31.90

PAGE 103

Figure33.AnillustrationoftheclassicationbasedontwowellseparatedGaussiandistributionsisshownforpredictionsontimestep8. Figure34.AnillustrationoftheclassicationbasedontwowelloverlappingGaussiandistributionsisshown.Classification=8>>>>>>>>>>>><>>>>>>>>>>>>:unsalient;ifxmaxunsalient+1:65unsalient;salient)]TJ/F15 10.909 Tf 10.909 0 Td[(1:65salientunknown;otherwise.TheclassicationaccuracyresultsusingthismethoddonotcomparewellwiththeSSIalgorithmpreviouslydiscussed.Thisisduelargelytoimmenseskewtowardtheclassicationofunsalientexamples.Lookingonlyattimestepeight,thereare74,753examplesofwhich69,150belongtotheunsalientclass.Ofthese69,150examples,30,305examplesarenotclassiedasboltbyanyofthe1000trees,and60,407areclassiedassalientby20orfewertrees.ThisrapiddeclineinpredictedsaliencyisnotwellmodeledbyaGaussiandistribution,whichinthiscasecausesanunderestimationofthevarianceforthecalculatedmean,asshowninFigures35and36.TheEMalgorithm,inmaximizingtheexpectedvaluesforbothdistributions,incorrectlyincorporatesverylowpredictedsaliencyvaluesintothesalientdistribution.Thisadverselyaffectsthesalientdistributionbydecreasingthemeanandincreasingthevariance.91

PAGE 104

Figure35.Anillustrationofthehistogramofpredictedsaliencyisshown.They-axisiscutoffat1000examplesinordertoshowthehistogramclearly. Figure36.AnillustrationofthetwoGaussiandistributionsasestimatedbytheEMalgorithmfortimestep8isshown.92

PAGE 105

Asaresultoftheoverestimationofthevarianceoftheboltclass,manyexampleswhosetrueclassisnotboltarelabeledasbolts.Theseexamplesarefurtherpropagatedtothefollowingtimesteps,butasgroundtruthlabeleddata.Asaresultoftheincorrectlylabeledtruthdata,acascadingmispredictionofunsalientexamplesensues.Theoverallaccuracyofthismodelis75.26%,whichisapproximately24percentagepointslowerthanthecorrespondingSSImodelonthesamedata.Thenaltimestephad22,483falsepositivesandanaccuracyofonly69.92%.Thisillustratesinalarmingdetailthebenetsanddrawbacksofsemi-supervisedlearning,whereacertainamountofdomainknowledgeisnecessaryforproperclassication.6.6DiscussionTheSSIlearningalgorithmismostapplicablewhenascientistwantstotrackasalientconceptthroughtime.Thisisamuchharderproblemthantheexperimentsinthepreviouschapter.Specif-ically,inthepreviousodd/evenexperiments,thedecisiontreeshavetheadvantageofinterpolatingasalientconceptfrombetweentimesteps.Inthecasewheresaliencymustbeextrapolated,asintheseSSIlearningexperiments,thetargetconceptismuchmoredifculttolearn.Trackingasalientconceptthroughtimeisamorerealisticdepictionofascientist'sinteractionwiththeprogramthanlabelingoddandeventimesteps.Likewise,obtaininglabeleddatafromman-ualscientistinteractionwiththeprogramisconsistentlyproblematic.TheSSIlearningalgorithmallowsscientiststomarkonlyafewtimestepsattheemergenceofasalientconcept.Webelievethisalgorithmmayalsobeusefulintrackingsalienttargetsbackwardsintimeaswellasforwards.TheSSIalgorithmmayalsobeusefulforincrementingthroughpartitionsinsteadoftimesteps,suchaslearningonpartitions1-4,thenincorporatingthepredictedlabelsofpartition5intothetrainingset.AmajordifferenceinaccuracycanbeseenbetweenusingSSIlearningonthepartitioneddataversusnotusingSSIonthepartitioneddata.Theinitialimagesareverysimilarbecausetheboltsareverysimilartothelabeledtrainingdata.Inotherwords,theSSIlearningalgorithmisgainingverylittlenewandvaluabledatawiththeadditionofeachearlytimestep.LatertimestepsshowasignicantimprovementforSSI.TheSSIlearningalgorithmeliminatesnearlyallofthepreviouslyfalsepositivepredictionsforthenaltimestepsofthesimulation.93

PAGE 106

UsingallthedatawithoutSSIlearningshowsmuchbetteraccuracyresultsthandoesPMVwithoutSSIlearning,aresultweseereplicatedfromthepreviouschapter.Surprisingly,SSIlearningusingPMVproducesmuchbetterresultsthandoesusingallthedatawithoutSSIlearning.Usingallthedataviolatesourhypothesisabouttheinabilitytotransferdatabetweennodes.Thus,thisisaverysatisfyingresult.Infact,therawnumbersshowthatahigheraccuracyisobtainedusingPMVandSSIlearningthananyoftheexperimentsofthepreviouschapterdespitethegreaterdifcultyoftheproblem.94

PAGE 107

Chapter7DiscussionandFutureWorkThisdissertationconsiderstheproblemoflearningonlargesimulationssuchasthoseperformedbyresearchscientistsonsupercomputers.Specically,therearedatasetssolargethattheduplicationofdataacrosscomputenodesisnotfeasible.Asaresult,thedatamustbepartitionedspatiallyaccordingtorequirementsofthesimulationprogram.Unfortunately,thecomputationaloptimiza-tionsneededbythesimulatorareatoddswiththetypicalassumptionsmadeinexistingMachineLearningalgorithms.Inthisdissertation,amethodforadaptingexistingbaselearningalgorithmstothechallengingsupercomputersimulationenvironmentisconsideredandshowntobeeffective.Resultsshowthatsimilaraccuracyresultscanbeobtainedusingspatiallypartitionednon-i.i.d.datawhencomparedtousingthefulldata.InthecaseofthecasingsimulationandtheSSIlearningalgorithmintroducedhere,itwaspossibletoexceedthataccuracy.Twoframeworksforcombiningclassiersaremathematicallyderivedfromthemajorityvotingalgorithm.TherstisaProbabilisticMajorityVotingalgorithmPMVwhichtakesintoaccountanunfairvotingscheme.Thesecondisaweightedversionoftherst,calledWeightedProbabilisticMajorityVoteorwPMV.Thesealgorithmswereevaluatedontwosimulations:acanisterbeingcrushedbyanimpactorbar,andacasingbeingdroppedontheground.Generallyspeaking,theweightedversionismoreaccuratethantheunweightedversion.ThisismostlyareectionupontheincreasedcapabilityofwPMVtobreaktiesandaccountforround-offerrorinthePMVcalculation.Anewmethodforlearningisintroducedinthespecialcircumstancewhenascientistestablishesafewtargetconceptsearlyoninthesimulationandwishestobothtrackthoseconceptsanddiscovernewvariants.Semi-SupervisedIncrementalSSIlearningworksbypredictingtheclassofnewdata,decidingeithertoincorporatesuchdataasatargetregionorasananomalousfalsedetectionintoanewexpandedtrainingset.Thus,learningoccurslittlebylittleovertime,untilclassiersarecreatedwhicharesuccessfulatclassifyingallthedatainthesimulation.EvenPMV,theweakerof95

PAGE 108

thetwovotingalgorithms,wasabletoobtainanincreaseinaccuracyoverusingtheunpartitioneddatawhennotusingSSIlearning.SSIlearninghasananaloginstreamprocessingMachineLearningalgorithms.Intheversionthisdissertationconsiders,valuablenewdataismadeavailablebysmoothingpredictedclassesin3Dspaceandautomaticallyestablishingathresholdwhichnewdatamustexceedbeforebeingassignedtothetargetclass.Bysmoothinginfeaturespaceratherthan3Dspace,itmaybepossibletobettertrackhowconceptschangethroughtime.Forexample,thek-nearestneighboralgorithmcouldbeusedtoestablishboththe3Ddistanceequivalentfortheamountofsmoothingtobedoneandtherequiredcondencetokeepitsassignedclasslabel.Forascientistusingthesealgorithmsinasimulation,theabilitytobedirectedtowardinterest-ing,orsalient,regionsisparamount.Thevisualizationtoolswhichscientistsuseshowareasofthesimulation,notindividualnodes.ManyoftheguresinthisdissertationarecapturedthroughtheuseofoneofthesetoolscalledParaView[122].Asscientistsaredirectedtowardregionsofsaliency,itmakesmoresensetoconsiderregionalaccuracyratherthannodalaccuracy.Toolsarecurrentlyindevelopmentforthispurpose.Oneofthesetoolsconnectsnearbyregionsanddetermineswhetherthisregioniscontainedwithinenoughofthegroundtruthregion.Determiningwhatconstitutesnearbyregionsandwhatconstitutesenoughofthegroundtruthisproblematic.Truepositive,falsepositive,truenegative,andfalsenegativeregionalpercentagescanbeinsufcient.Forexample,atruesalientregioncanbemadeupofseveralpredictedsalientregionswhenthenearbyparameteristoolow.Asaresult,manyregionsofthesamegenerallocationareshowntoafrustratedscientist,meanwhilethetruepositiveaccuracyisuncertain.Inconnectionwithwhichregionstoshowascientist,thereareissuespertainingtotheorderinwhichthoseregionsaredisplayed.InmanualprocessingofsimulationswithParaView,scientistsbrowsetoaregionofinterestandcreatealookmark,similartoabookmarkonwebbrowsers,whichcanimmediatelyreturnthescientisttothatregiononalaterdate[123].Whentheclassierdiscoversasalientregion,itnotiesthescientistsbyforwardingthelookmark.Thenumberandorderingofthelookmarkshassignicantinteractionswithsomehumanfactors.Ifthereareonlyafewlookmarks,eachwillprobablybelookedat.Iftherearealargenumber,thecapabilityand96

PAGE 109

usefulnessoftheclassierswilllikelybejudgedbyonlytherstfewlookmarksonthelistbeforefatiguesetsin.Theproblemofmultiplydenedsalientregionsdiminishesifnearbyregionsaresimplymovedtothebottomofthelistoflookmarks.Thelackofquantiabilityincomparingdifferentalgorithmsremains,andnowisexacerbatedbythesubjectivenatureofwhatqualiesasanappropriateorder-ing.GooglehasaddressedtheverysimilarissueoforderinghypertextdocumentsinthePageRankalgorithm[124].PageRankusesarecursiveprobabilitydistributiontoassignnumericalvaluesbasedonthenumberoflinkstoandfromawebpagegivencontextualinformationinthesearchterms.Unfortunately,largescalesimulationsmayhaveonlyafewscientistsexaminingtheresults,andeachmaybeinterestedinverydifferentregionsofsaliency.Itshouldalsobenotedthatsearchenginealgorithmsdonotaddressmultiplydenedtrueposi-tiveregionsasinourpreviousexample.AsearchonGoogleforcarsFigure37bringsupmanynewandusedcarpricingservicesandinformationabouttherecent2006movieCars.Inreal-ity,therearemanydifferentsubtopics,analogoustoregions,whichfallunderthegeneraltermofcars.Insteadofdisplayingawidevariety,thesamepopularsubtopicsareshownrepeatedly.Thisispreciselytheproblemwhichneedstobeavoidedinorderingthelookmarks.7.1ContributionsInsummary,thecontributionsofthisdissertationareasfollows.Extensiveexperimentationwasperformedon57publiclyavailabledatasetsandvedifferentlearningalgorithmsandseveralvari-ations[8,9,10].Theresultsshowedthatwhileforanyindividualdataset,thenumberofstatisticallysignicantdifferencesmaybesmall,astatisticalconclusionaboutthegeneraltendenciesoftheal-gorithmscanbeobtained.Namely,boostingandrandomforestswithgreaterthanoneattributearemorelikelytoresultinhigheraccuracy.Sincerandomforestsisthefastestofthesealgorithms,thisisaparticularlypleasingresult.Amethodusingrandomforestswasthenintroducedfordetermin-ingwhenenoughclassiershavebeenaddedtoanensemble.Randomforestsarethenappliedtothepreviouslyunconsideredproblemoflearningonlargecomplexsimulationswheretheamountofdataisgreatandcannotbetransferredtootherprocessors.Datafromthesesimulationsisdis-97

PAGE 110

Figure37.AninternetsearchonGoogleforcarsreturnsveryfewtopicspertainingtotheauto-mobile.98

PAGE 111

tributedinsuchawaythatitisgenerallynot,andcannotbemade,independentlyandidenticallydistributed,oneofthebasicassumptionsinmostMachineLearningandStatisticsmodels[10].Fi-nally,awaytofurtherincreasetheaccuracyofthesealgorithmsisconsideredwhichtakesadvantageofthesemi-supervisedlearningparadigm.Experimentationwithinthelargesimulationenvironmentshowedthatbetteraccuracyresultscanbeobtainedusingthismethodthenusingtheentireamountofunpartitioneddata.TheMachineLearningeldisrifewithdifcultchallenges.Anenormousamountofdataisgeneratedwitheachpassingday.Cheaperstorage,fasterprocessingandnetworkspeeds,andthevastexpansionofelectronicdatarecordingintonearlyeveryaspectoflifewillonlymakethesechal-lengesharder.Thisdissertationoffersnaughtbutaglimpseintothekindofdata-drivenchallengespresenttodayyetwithimplicationslongintotomorrow.99

PAGE 112

References[1]Frequentlyaskedquestions.InternetArchive.[Online].Available:http://www.archive.org/about/faqs.php.[2]R.Christensen.Computersimulationsinsupportofnationalsecurity.LawrenceLiv-ermoreNationalLaboratories.[Online].Available:http://www.llnl.gov/str/Christensen.html.[3]P.H.Carns,I.WalterB.Ligon,R.B.Ross,andR.Thakur,Pvfs:Aparallellesystemforlinuxclusters,in4thAnnualLinuxShowcaseandConference,2000.[4]H.Ramachandran,DesignandimplementationofthesysteminterfaceforPVFS2,Master'sthesis,ClemsonUniversity,2002.[5]F.Petrini,W.chunFeng,A.Hoisie,S.Coll,andE.Frachtenberg,TheQuadricsnetwork:High-performanceclusteringtechnology,IEEEMicro,vol.22,pp.46,2002.[6]N.J.Boden,D.Cohen,R.E.Felderman,A.E.Kulawik,C.L.Seitz,J.N.Seizovic,andW.-K.Su,Myrinet:Agigabit-per-secondlocalareanetwork,IEEEMicro,vol.15,pp.29,1995.[7]J.Tomkins,Capabilitymachines:Redstorm,in10thWorkshoponDistributedSupercom-puting,2006.[8]R.E.Baneld,L.O.Hall,K.W.Bowyer,D.Bhadoria,W.P.Kegelmeyer,andS.Eschrich,Acomparisonofensemblecreationtechniques,inTheFifthInternationalConferenceonMultipleClassierSystems,2004,pp.223.[9]L.Hall,K.Bowyer,R.Baneld,D.Bhadoria,W.Kegelmeyer,andS.Eschrich,Comparingpureparallelensemblecreationtechniquesagainstbagging,inTheThirdIEEEInternationalConferenceonDataMining,2003,pp.533.[10]R.E.Baneld,L.O.Hall,K.W.Bowyer,andW.P.Kegelmeyer,Acomparisonofdecisiontreeensemblecreationtechniques,IEEETransactionsonPatternAnalysisandMachineIntelligence,pp.173,2007.[11]R.E.Baneld,L.O.Hall,K.W.Bowyer,andW.P.Kegelmeyer,Ensemblesofclassiersfromspatiallydisjointdata,inSixthInternationalConferenceonMultipleClassierSys-tems,2005,pp.196.[12]J.R.Quinlan,Inductionofdecisiontrees,MachineLearning,vol.1,pp.81,1986.[13]N.H.Bshouty,Simplelearningalgorithmsusingdivideandconquer,ComputationalCom-plexity,vol.6:2,pp.174,1996.100

PAGE 113

[14]C.A.R.Hoare,Quicksort,ComputerJournal,vol.5,no.1,pp.10,1962.[15]D.Knuth,TheArtofComputerProgramming,3rded.AddisonWesley,1997,vol.3,ch.SortingbyMerging,pp.158.[16]L.Breiman,Randomforests,MachineLearning,vol.45,no.1,pp.5,2001.[17]T.Dietterich,Anexperimentalcomparisonofthreemethodsforconstructingensemblesofdecisiontrees:bagging,boosting,andrandomization,MachineLearning,vol.40,no.2,pp.139,2000.[18]T.Ho,Therandomsubspacemethodforconstructingdecisionforests,IEEETransactionsonPatternAnalysisandMachineIntelligence,vol.20,no.8,pp.832,1998.[19]Y.FreundandR.E.Schapire,Experimentswithanewboostingalgorithm,inMachineLearning:ProceedingsoftheThirteenthNationalConference,1996,pp.148.[20]S.Gorlatch,Programmingwithdivide-and-conquerskeletons:AnapplicationtoFFT,JournalofSupercomputing,vol.12,no.1-2,pp.85,1998.[21]S.Eschrich,N.V.Chawla,andL.O.Hall,Generalizationmethodsinbioinformatics,in2ndWorkshoponDataMininginBioinformaticsKDD02,2002,pp.25.[22]S.Eschrich,N.V.Chawla,andL.O.Hall,Scalablelearningmethodsforbioinformatics,JournalofSystemSimulation,vol.14,pp.1464,2002.[23]A.Narayanan,E.C.Keedwell,andB.Olsson,Articialintelligencetechniquesforbioin-formatics,AppliedBioinformatics,vol.1,pp.191,2002.[24]S.M.Larson,C.D.Snow,M.Shirts,andV.S.Pande,Folding@homeandgenome@home:Usingdistributedcomputingtotacklepreviouslyintractableproblemsincomputationbiol-ogy,inComputationalGenomics,2002,pp.645.[25]H.W.Mewes,D.Frischman,U.Gildener,G.Mannhaupt,K.Mayer,M.Mokrejs,B.Morgen-stern,M.Minsterkotter,S.Rudd,andB.Weil,Mips:Adatabaseforgenomesandproteinsequences,NucleicAcidsResearch,vol.30,pp.31,2000.[26]A.BairochandR.Apweiler,Theswiss-protproteinsequencedatabankanditssupplementtremblin1999,NucleicAcidsResearch,vol.27,pp.49,1999.[27]H.M.Berman,J.Westbrook,Z.Feng,G.Gilliland,T.N.Bhat,H.Weissig,I.N.Shindyalov,andP.E.Bourne,Theproteindatabank,NucleicAcidsResearch,vol.28,pp.235,2000.[28]Criticalassessmentoftechniquesforproteinstructurepredictions.GenomeCenter,UniversityofCalifornia,Davis.[Online].Available:http://predictioncenter.gc.ucdavis.edu/.[29]R.O.V.Santos,M.M.B.R.Vellasco,F.A.V.Artola,andS.A.B.daFontoura,Neuralnetensemblesforlithologyrecognition,in4thInternationalWorkshoponMultipleClassierSystems,2003,pp.246.101

PAGE 114

[30]E.Jaser,J.Kittler,andW.Christmas,Buildingclassierensemblesforautomaticsportsclassication,in4thInternationalWorkshoponMultipleClassierSystems,2003,pp.366374.[31]N.C.Oza,K.Tumer,I.Y.Tumer,andE.M.Huff,Classicationofaircraftmaneuversforfaultdetection,in4thInternationalWorkshoponMultipleClassierSystems,2003,pp.375.[32]A.J.C.Sharkey,Combiningarticialneuralnets:Ensemblesandmodularmulti-netsystems,A.J.C.Sharkey,Ed.Springer-Verlag,1998.[33]M.I.JordanandR.A.Jacobs,Hierarchicalmixturesofexpertsandtheemalgorithm,NeuralComputation,vol.6,pp.181,1994.[34]R.A.Jacobs,M.I.Jordan,S.J.Nowlan,andG.E.Hinton,Adaptivemixturesoflocalexperts,NeuralComputation,vol.3,pp.79,1991.[35]P.A.EstevezandR.Nakano,Hierarchicalmixtureofexpertsandmax-minpropagationneuralnetworks,inIEEEInternationalConferenceonNeuralNetworks,1995,pp.651656.[36]V.RamamurtiandJ.Ghosh,Advancesinusinghierarchicalmixtureofexpertsforsignalclassication,inIEEEInternationalConferenceonAcoustics,Speech,andSignalProcess-ing,1996,pp.3569.[37]S.Eschrich,Learningfromless:adistributedmethodformachinelearning,Ph.D.disser-tation,UniversityofSouthFlorida,2003.[38]T.Dietterich,Ensemblemethodsinmachinelearning,in1stInternationalWorkshoponMultipleClassierSystems,2000,pp.1.[39]G.BrownandJ.L.Wyatt,Negativecorrelationlearningandtheambiguityfamilyofensem-blelearningmethods,inMultipleClassierSystems,2003,pp.266.[40]L.Breiman,Baggingpredictors,MachineLearning,vol.24,pp.123,1996.[41]N.V.Chawla,L.O.Hall,K.W.Bowyer,J.ThomasE.Moore,andW.P.Kegelmeyer,Dis-tributedpastingofsmallvotes,inMultipleClassierSystemsConference,Cagliari,Italy,June2002,pp.52.[42]Y.Freund,Boostingaweaklearningalgorithmbymajority,inComputationLearningThe-ory,1990,pp.202.[43]L.I.KunchevaandC.J.Whitaker,Usingdiversitywiththreevariantsofboosting:ag-gressive,conservative,andinverse,in3rdInternationalWorkshoponMultipleClassierSystems,2002,pp.81.[44]N.V.Chawla,S.Eschrich,andL.O.Hall,Creatingensemblesofclassiers,UniversityofSouthFlorida,DepartmentofComputerScienceandEngineering,ISL-01-01,Tech.Rep.,2001.102

PAGE 115

[45]B.Parmento,P.W.Munro,andH.R.Doyle,Improvingcommitteediagnosiswithresam-plingtechniques,AdvancesinNeuralInformationProcessingSystems,vol.8,pp.882,1996.[46]D.E.Hocevar,M.R.Lightner,andT.N.Trick,Astudyofvariancereductiontechniquesforestimatingcircuityields,IEEETransactionsonComputer-AidedDesign,vol.CAD-2,pp.180,1983.[47]C.Ding,C.Hsieh,Q.Wu,andM.Pedram,Stratiedrandomsamplingforpowerestima-tion,inInternationalConferenceonComputerAidedDesign,1996,pp.576.[48]T.M.Mitchell,MachineLearning,E.M.Munson,Ed.McGrawHill,1996.[49]J.Quinlan,C4.5:ProgramsforMachineLearning.MorganKaufmann,1992,SanMateo,CA.[50]L.Breiman,J.Friedman,R.Olshen,andP.Stone,ClassicationandRegressionTrees.Bel-mont,CA.:WadsworthInternationalGroup,1984.[51]T.DietterichandE.B.Kong,Machinelearningbias,statisticalbias,statisticalvarianceofdecisiontreealgorithms,DepartmentofComputerScience,OregonStateUniversity,Tech.Rep.,1995.[52]R.O.Duda,P.E.Hart,andD.G.Stock,PatternClassication.JohnWileyandSons,Inc,2001.[53]D.H.WolpertandW.G.Macready,Nofreelunchtheoremsforoptimization,IEEETrans-actionsonEvolutionaryComputation,vol.1,pp.67,1997.[54]Y.C.HoandD.L.Pepyne,Simpleexplanationofthenofreelunchtheoremofoptimiza-tion,CyberneticsandSystemsAnalysis,vol.38,pp.292,2002.[55]R.Baneld.OpenDT.[Online].Available:http://opendt.sourceforge.net.[56]I.H.WittenandE.Frank,DataMining:PracticalmachinelearningtoolswithJavaimple-mentations.SanFrancisco:MorganKaufmann,1999.[57]G.EiblandK.P.Pfeiffer,HowtomakeAdaBoost.M1workforweakbaseclassiersbychangingonlyonelineofthecode,inProceedingsoftheThirteenthEuropeanConferenceonMachineLearning,2002,pp.72.[58]E.BauerandR.Kohavi,Anempiricalcomparisonofvotingclassicationalgorithms:Bag-ging,boosting,andvariants,MachineLearning,vol.36,no.1,2,pp.105,1999.[59]L.Breiman,Arcingclassiers,AnnalsofStatistics,vol.26,no.3,pp.801,1998.[60]Y.FreundandR.E.Schapire,DiscussionofthepaperarcingclassiersbyLeoBreiman,AnnalsofStatistics,vol.26,no.2,pp.824,1998.[61]L.Breiman,RejoindertothepaperarcingclassiersbyLeoBreiman,AnnalsofStatistics,vol.26,no.2,pp.841,1998.103

PAGE 116

[62]R.E.Baneld,L.O.Hall,K.W.Bowyer,andW.P.Kegelmeyer,Anewensemblediver-sitymeasureappliedtothinningensembles,inFifthInternationalWorkshoponMultipleClassierSystems,2003,pp.306.[63]C.MerzandP.Murphy,UCIRepositoryofMachineLearningDatabases,Univ.ofCA.,Dept.ofCIS,Irvine,CA.,http://www.ics.uci.edu/mlearn/MLRepository.html.[64]P.Brazdil.NIADD.[Online].Available:http://www.niaad.liacc.up.pt/old/statlog/datasets.html.[65]J.A.Lee.ELENAproject.[Online].Available:http://www.dice.ucl.ac.be/neural-nets/Research/Projects/ELENA/elena.htm.[66]Delvedatasets.DepartmentofComputerScience,UniversityofToronto.[Online].Available:http://www.cs.utoronto.ca/delve/data/datasets.html.[67]T.G.Dietterich,Approximatestatisticaltestforcomparingsupervisedclassicationlearningalgorithms,NeuralComputation,vol.10,no.7,pp.1895,1998.[Online].Available:citeseer.ist.psu.edu/dietterich98approximate.html.[68]E.Alpaydin,Combined5x2cvFtestforcomparingsupervisedclassicationlearningalgorithms,NeuralComputation,vol.11,no.8,pp.1885,1999.[69]R.E.Schapire,Y.Freund,P.Bartlett,andW.S.Lee,Boostingthemargin:anewexplanationfortheeffectivenessofvotingmethods,inProc.14thInternationalConferenceonMachineLearning.MorganKaufmann,1997,pp.322.[70]R.JohnsonandD.Wichern,AppliedMultivariateStatisticalAnalysis,3rded.Prentice-Hall,1992.[71]J.P.Shaffer,Multiplehypothesistesting,AnnualReviewPsychology,vol.46,pp.561,1995.[72]J.Demsar,Statisticalcomparisonsofclassiersovermultipledatasets,JournalofMachineLearningResearch,vol.7,pp.1,2006.[73]N.V.Chawla,L.O.Hall,K.W.Bowyer,andW.P.Kegelmeyer,Learningensemblesfrombites:Ascalableandaccurateapproach,JournalofMachineLearningResearch,vol.5,pp.421,2004.[74]A.LazarevicandZ.Obradovic,Thedistributedboostingalgorithm,inSeventhACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,2000,pp.311.[75]A.LazarevicandZ.Obradovic,Boostingalgorithmsforparallelanddistributedlearning,DistributedandParallelDatabases,SpecialIssueonParallelandDistributedDataMining,vol.11:2,pp.203,2002.[76]B.J.Taylor,M.A.Darrah,andC.D.Moats,Vericationandvalidationofneuralnetworks:Asamplingofresearchinprogress,IntelligentComputing:TheoryandApplications,vol.5103,pp.8,2003.104

PAGE 117

[77]D.M.SkapuraandP.S.Gordon,BuildingNeuralNetworks.Addison-Wesley,1996.[78]N.V.Chawla,K.W.Bowyer,L.O.Hall,andW.P.Kegelmeyer,Smote:Asyntheticminorityoversamplingtechnique,inKnowledgeBasedComputerSystems,2000,pp.46.[79]N.V.Chawla,K.W.Bowyer,L.O.Hall,andW.P.Kegelmeyer,Smote:Syntheticminorityover-samplingtechnique,JournalofArticialIntelligenceResearch,vol.16,pp.321,2002.[80]M.KubatandS.Matwin,Addressingthecurseofimbalancedtrainingsets:Onesidedselection,inFourteenthInternationalConferenceonMachineLearning,1997,pp.179.[81]N.Japkowicz,Theclassimbalanceproblem:Signicanceandstrategies,in2000Interna-tionalConferenceonArticialIntelligence:SpecialTrackonInductiveLearning,2000,pp.111.[82]M.Pazzani,C.Merz,P.Murphy,K.Ali,T.Hume,andC.Brunk,Reducingmisclassicationcosts,inEleventhInternationalConferenceonMachineLearning,1994,pp.217.[83]P.Domingos,Metacost:Ageneralmethodformakingclassierscost-sensitive,inFifthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,1999,pp.155.[84]B.S.Lee,R.R.Snapp,andR.Musick,Towardaquerylanguageonsimulationmeshdata:anobjectorientedapproach,inInternationalconferenceondatabasesystemsforadvancedapplications,2001,pp.242.[85]G.James,Majorityvoteclassiers:Theoryandapplications,Ph.D.dissertation,StanfordUniversity,1998.[86]L.Lam,Theoryandapplicationofmajorityvote-fromCondorcetJuryTheoremtopatternrecognition,inConferenceonMathematicsforLiving,2000,pp.177.[87]N.Miller,InformationPoolingandGroupDecisionMaking.JAIPress,1986,ch.Infor-mation,Electorates,andDemocracy:SomeExtensionsandInterpretationsoftheCondorcetJuryTheorem,pp.173.[88]H.P.Young,Condorcet'stheoryofvoting,AmericanPoliticalScienceReview,vol.82,pp.1231,1988.[89]W.N.StreetandY.Kim,AstreamingensemblealgorithmSEAforlarge-scaleclassi-cation,inSeventhACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,2001,pp.377.[90]R.Polikar,J.Byorick,S.Krause,A.Marino,andM.Moreton,Learn++:aclassierin-dependentincrementallearningalgorithmforsupervisedneuralnetworks,in2002Interna-tionalJointConferenceonNeuralNetworks,2002,pp.1742.[91]M.LewittandR.Polikar,Anensembleapproachfordatafusionwithlearn++,inFourthInternationalWorkshoponMultipleClassierSystems,2003,pp.176.105

PAGE 118

[92]J.Z.KolterandM.A.Maloof,Dynamicweightedmajority:anewensemblemethodfortrackingconceptdrift,inThirdIEEEInternationalConferenceonDataMining,2003,pp.123.[93]J.Z.KolterandM.A.Maloof,Usingadditiveexpertensemblestocopewithconceptdrift,inProceedingsoftheTwenty-secondInternationalConferenceonMachineLearning,2005,pp.449.[94]H.Wang,W.Fan,P.Yun,andJ.Han,Miningconcept-driftingdatastreamsusingensembleclassiers,inNinthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,2003,pp.226.[95]K.O.Stanley,Learningconceptdriftwithacommitteeofdecisiontrees,DepartmentofComputerSciences,UniversityofTexasatAustin,Tech.Rep.,2003.[96]N.Otsu,Athresholdselectionmethodfromgraylevelhistograms,IEEETrans.Systems,ManandCybernetics,vol.9,pp.62,1979.[97]P.-S.Liao,T.-S.Chen,andP.-C.Chung,Afastalgorithmformultilevelthresholding,Jour-nalofInformationScienceandEngineering,vol.17,pp.713,2001.[98]W.Tsai,Moment-preservingthresholding:Anewapproach,ComputerVisionandGraphicImageProcessing,vol.29:3,pp.377,1985.[99]O.D.TrierandT.Taxt,Evaluationofbinarizationmethodsfordocumentimages,IEEETransactionsonPatternAnalysisandMachineIntelligence,vol.17,pp.312,1995.[100]X.Zhu,Semi-supervisedlearningliteraturesurvey,ComputerSciences,UniversityofWisconsin-Madison,Tech.Rep.,2005.[101]B.Merialdo,Taggingenglishtextwithaprobabilisticmodel,ComputationalLinguistics,vol.20:2,pp.155,1994.[102]D.Elworthy,Doesbaum-welchre-estimationhelptaggers?inFourthACLConferenceonAppliedNaturalLanguageProcessing,1994,pp.53.[103]V.CastelliandT.Cover,Theexponentialvalueoflabeledsamples,PatternRecognitionLetters,vol.16,pp.105,1995.[104]V.CastelliandT.Cover,Therelativevalueoflabeledandunlabledsamplesinpatternrecognitionwithanunknownmixingparameter,IEEETransactionsonInformationThe-ory,vol.42,pp.2101,1996.[105]J.RatsabyandS.Venkatesh,Learningfromamixtureoflabeledandunlabeledexampleswithparametricsideinformation,inEighthAnnualConferenceonComputationalLearningTheory,1995,pp.412.[106]F.Cozman,I.Cohen,andM.Cirelo,Semi-supervisedlearningofmixturemodelsandbayesiannetworks,in20thInternationalConferenceonMachineLearning,2003.106

PAGE 119

[107]K.P.Nigam,Usingunlabeleddatatoimprovetextclassication,Ph.D.dissertation,CarnegieMelonUniversity,2001.[108]A.BlumandT.Mitchell,Combininglabeledandunlabeleddatawithco-training,inCOLT:ProceedingsoftheWorkshoponComputationalLearningTheory,1998,pp.92.[109]T.Mitchell,Theroleofunlabeleddatainsupervisedlearning,inSixthInternationalCol-loquiumonCognitiveScience,1999.[110]K.NigamandR.Ghani,Analyzingtheeffectivenessandapplicabilityofco-training,inNinthInternationalConferenceonInformationandKnowledgeManagement,2000,pp.8693.[111]S.GoldmanandY.Zhou,Enhancingsupervisedlearningwithunlabeleddata,in17thIn-ternationalConf.onMachineLearning,2000,pp.327.[112]R.Kohavi,Bottom-upinductionofoblivious,read-oncedecisiongraphs:strengthsandlimitations,inNationalConferenceonArticialIntelligence,1994,pp.613.[113]Y.ZhouandS.Goldman,Democraticco-learning,in16thIEEEInternationalConferenceonToolswithArticialIntelligence,2004,pp.594.[114]D.Yarowsky,Unsupervisedwordsensedisambiguationrivalingsupervisedmethods,in33rdAnnualMeetingoftheAssociationforComputationalLinguistics,1995,pp.189.[115]W.Gale,K.Church,andD.Yarowsky,Amethodfordisambiguatingwordsensesinalargecorpus,ComputersandtheHumanities,vol.26,pp.415,1992.[116]C.Rosenberg,M.Hebert,andH.Schneiderman,Semi-supervisedself-trainingofobjectdetectionmodels,inSeventhIEEEWorkshoponApplicationsofComputerVision,2005,pp.29.[117]G.J.Nutt,OperatingSystems:AModernPerspective.Addison-Wesley,2000.[118]A.Silberschatz,P.B.Galvin,andG.Gagne,OperatingSystemConcepts.JohnWileyandSons,Inc.,2002.[119]A.P.Dempster,N.M.Laird,andD.B.Rubin,Maximumlikelihoodfromincompletedataviatheemalgorithm,JournaloftheRoyalStatisticalSociety.SeriesBMethodological,vol.39,no.1,pp.1,1977.[120]J.Bilmes,AgentletutorialontheEMalgorithmanditsapplicationtoparameterestimationforgaussianmixtureandhiddenmarkovmodels,UniversityofBerkeley,Tech.Rep.ICSI-TR-97-021,1997.[121]F.Dellaert,Theexpectationmaximizationalgorithm,GeorgiaInsititudeofTechnology,Tech.Rep.GIT-GVU-02-20,2002.[122]J.Ahrens,B.Geveci,andC.Law,ParaView:Anend-usertoolforlarge-datavisualizationinTheVisualizationHandbook,C.HansenandC.Johnson,Eds.AcademicPress,2005.107

PAGE 120

[123]E.T.StantonandW.P.Kegelmeyer,Creatingandmanaginglookmarksinparaview,inProceedingsoftheIEEESymposiumonInformationVisualization,2004,pp.215.[124]S.BrinandL.Page,Theanatomyofalarge-scalehypertextualWebsearchengine,Com-puterNetworksandISDNSystems,vol.30,pp.107,1998.108

PAGE 121

AbouttheAuthorRobertBaneldobtainedhisBachelor'sDegreeinComputerSciencein2001andhisMaster'sDegreeinComputerSciencein2003fromtheUniversityofSouthFlorida.ForsixyearsheoccupiedthepositionofResearchAssistantforDr.LawrenceHallunderagrantfromSandiaNationalLaboratoriesandworkedinconjunctionwithLawrenceLivermoreNationalLaboratoriesontheAdvancedStrategicComputingsupercomputers.Hehaspresentedpapersonhisresearchbothdomesticallyandinternationally.In2003hewasawardedtheVerizonfellowship.HeisalsoaLinuxNetworkAdministratorandHighPerformanceComputingarchitect.HeisacertiedSCUBAdiverandskydiver,andpresidentoftheuniversityskydivingclub,theSouthFloridaSkydivingBulls.