USF Libraries
USF Digital Collections

Ensembles for distributed data

MISSING IMAGE

Material Information

Title:
Ensembles for distributed data
Physical Description:
Book
Language:
English
Creator:
Shoemaker, Larry
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla.
Publication Date:

Subjects

Subjects / Keywords:
Random forests
Nearest centroid
Exodus
ParaView
Region labeling
Dissertations, Academic -- Computer Science -- Masters -- USF   ( lcsh )
Genre:
government publication (state, provincial, terriorial, dependent)   ( marcgt )
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )

Notes

Summary:
ABSTRACT: Many simulation data sets are so massive that they must be distributed among disk farms attached to different computing nodes. The data is partitioned into spatially disjoint sets that are not easily transferable among nodes due to bandwidth limitations. Conventional machine learning methods are not designed for this type of data distribution. Experts mark a training data set with different levels of saliency emphasizing speed rather than accuracy due to the size of the task. The challenge is to develop machine learning methods that learn how the expert has marked the training data so that similar test data sets can be marked more efficiently. Ensembles of machine learning classifiers are typically more accurate than individual classifiers. An ensemble of machine learning classifiers requires substantially less memory than the corresponding partition of the data set. This allows the transfer of ensembles among partitions.If all the ensembles are sent to each partition, they can vote for a level of saliency for each example in the partition. Different partitions of the data set may not have any salient points, especially if the data set has a time step dimension. This means the learned classifier for such partitions can not vote for saliency since they have not been trained to recognize it. In this work, we investigate the performance of different ensembles of classifiers on spatially partitioned data sets. Success is measured by the correct recognition of unknown and salient regions of data points.
Thesis:
Thesis (M.S.C.S.)--University of South Florida, 2005.
Bibliography:
Includes bibliographical references.
System Details:
System requirements: World Wide Web browser and PDF reader.
System Details:
Mode of access: World Wide Web.
Statement of Responsibility:
by Larry Shoemaker.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains 82 pages.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 001709560
oclc - 69371719
usfldc doi - E14-SFE0001296
usfldc handle - e14.1296
System ID:
SFS0025617:00001


This item is only available as the following downloads:


Full Text

PAGE 1

Ensembles for Distributed Data by Larry Shoemaker A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Department of Computer Science & Engineering College of Engineering University of South Florida Major Professor: Lawrence O. Hall, Ph.D. Dmitry Goldgof, Ph.D. Sudeep Sarkar, Ph.D. Date of Approval: October 21, 2005 Keywords: random forests, nearest cent roid, Exodus, ParaView, region labeling Copyright 2005, Larry Shoemaker

PAGE 2

ACKNOWLEDGMENTS I would especially like to thank my major professor, Dr. Lawrence O. Hall for his enthusiastic and expert guidance, support, patience, and feedback during my research efforts. I would also like to thank Dr. Kevin W. Bowyer for his valuable feedback. I also appreciate the valuable ideas and tools offered by W. Philip Kegelmeyer. I am indebted to Dr. Dmitry Goldgolf and to Dr. Sudeep Sarkar for generously offering to serve on my committee. I would like to thank Divya Bhadoria for access to her research data and tools. I am also very fortunate to have r eceived the expert assistance and guidance of Robert Banfield. This research was partially supported by th e United States Department of Energy through Sandia National Laboriatories ASCI Views Data Discovery Program, Contract number: DE-AC04-76DO00789 and the National Science Foundation under grant EIA0130768. Images from the FERET database of facial images collected under the FERET program were used.

PAGE 3

i TABLE OF CONTENTS LIST OF TABLES .............................................................................................................iii LIST OF FIGURES ...........................................................................................................iv ABSTRACT ......................................................................................................................vii CHAPTER 1 INTRODUCTION.......................................................................................1 CHAPTER 2 DATA DESCRIPTION...............................................................................3 2.1 Face Data .....................................................................................................3 2.2 Can Data .......................................................................................................6 CHAPTER 3 CLASSIFIERS AND ENSEMBLES...........................................................8 3.1 Decision Tree ...............................................................................................8 3.2 Random Forests ...........................................................................................9 3.3 K-Nearest Neighbor ...................................................................................11 3.4 K-Nearest Centroid ....................................................................................12 CHAPTER 4 FACE EXPERIMENTS AND RESULTS.................................................13 4.1 Face Baseline Experiments ........................................................................13 4.2 Face Partitioning ........................................................................................15 4.3 Face Ensemble Experiments and Results ..................................................18 4.3.1 Four Row by Two Column ............................................................18 4.3.2 Vertical Partitions ..........................................................................21 4.3.3 Horizontal Partitions ......................................................................23 4.3.4 Diagonal Partitions .........................................................................25 CHAPTER 5 CAN EXPE RIMENTS AND RESULTS..................................................27 5.1 Can Partitioning .........................................................................................27 5.2 Out of Partition Can Experiments ..............................................................28 5.2.1 Vertical Partition Experiments .......................................................29 5.2.1.1 Vertical Decision Tree ..................................................29 5.2.1.2 Vertical Random Forests ...............................................30 5.2.1.3 Vertical KNC ................................................................35 5.2.1.4 Vertical KNN ................................................................38 5.2.1.5 Vertical Comparison .....................................................39

PAGE 4

ii 5.2.2 Horizontal Partition Experiments ..................................................41 5.2.2.1 Horizontal Decision Tree ..............................................41 5.2.2.2 Horizontal Random Forests ..........................................42 5.2.2.3 Horizontal KNC ............................................................47 5.2.2.4 Horizontal KNN ............................................................49 5.2.2.5 Horizontal Comparison .................................................50 5.2.3 Regional Experiments ....................................................................52 CHAPTER 6 SUMMARY AND FUTURE WORK.......................................................68 6.1 Summary ....................................................................................................68 6.2 Future Work ...............................................................................................69 REFERENCES ..................................................................................................................71

PAGE 5

iii LIST OF TABLES Table 5.1 Can Horizontal Centroids .................................................................................47

PAGE 6

iv LIST OF FIGURES Figure 2.1 Images after Preprocessing ..............................................................................4 Figure 2.2 Training Imag es with Classes Marked ............................................................5 Figure 2.3 Can Crush ........................................................................................................7 Figure 4.1 Baseline Experiments ....................................................................................14 Figure 4.2 Ground Truth in Face Training Partitions .....................................................17 Figure 4.3 Face Four Row by Two Column Results ......................................................20 Figure 4.4 Face Vertical Partition Results ......................................................................22 Figure 4.5 Face Horizontal Partition Results ..................................................................24 Figure 4.6 Face Diagonal Partition Results ....................................................................26 Figure 5.1 Can Partitions ................................................................................................28 Figure 5.2 Can Decision Tree A ccuracies for Vert ical Partitions ..................................30 Figure 5.3 Can Random Forests A ccuracies for Vertical Partitions ...............................31 Figure 5.4 Can Vertical, Gr ound Truth vs. Random Forests Maps ................................32 Figure 5.5 Can Random Forests Accuraci es for Unknown Weighted Thresholds (vertical) ........................................................................................................33 Figure 5.6 Can ROC Curve of Weighted Random Forests Accuracies (vertical) ..........34 Figure 5.7 Can KNC Accuraci es for Vertical Partitions .................................................36 Figure 5.8 Can Vertical Ground Truth vs. KNC Maps ..................................................38 Figure 5.9 Can KNN Accuraci es for Vertical Partitions ................................................39 Figure 5.10 Can Out of Partition Ac curacies for Vertical Partitions ................................40 Figure 5.11 Can Accuracies by T ype for Vertical Partitions ............................................41

PAGE 7

v Figure 5.12 Can Decision Tree Accu racies for Horizontal Partitions ..............................42 Figure 5.13 Can Random Forests Accura cies for Horizontal Partitions ...........................43 Figure 5.14 Can Horizontal, Ground Truth vs. Random Forests Maps ............................44 Figure 5.15 Can Random Forests Accuracies for Unknown Weighted Thresholds (horizontal) ....................................................................................................45 Figure 5.16 Can ROC Curve of Weighted Random Forests Accuracies (horizontal) ......46 Figure 5.17 Can ROC Curves of Weighted Random Forests Accuracies ........................47 Figure 5.18 Can KNC Accuracies for Horizontal Accuracies ..........................................48 Figure 5.19 Can Horizontal, Ground Truth vs. KNC Maps..............................................49 Figure 5.20 Can KNN Accuracies for Horizontal Partitions ............................................50 Figure 5.21 Can Out of Partition Ac curacies for Horizontal Partitions ............................51 Figure 5.22 Can Accuracies by T ype for Horizontal Partitions ........................................52 Figure 5.23 Nodal Accuracy for Regionalized Ground Truth Models .............................53 Figure 5.24 Can ROC Curve for Regionalized Ground Truth Models .............................54 Figure 5.25 Nodal Accuracy for Re gionalized Random Forest Models ...........................55 Figure 5.26 Can ROC Curve for Re gionalized Random Forest Models ..........................56 Figure 5.27 Nodal Accuracies for Regionalized KNC Models ........................................57 Figure 5.28 Can ROC Curve for Regionalized KNC Models ..........................................58 Figure 5.29 Can Horizontal, Ground Tr uth vs. Random Forests Regional Maps ............59 Figure 5.30 Can Horizontal, Ground Truth vs. KNC Regional Maps ..............................60 Figure 5.31 Can Ground Truth Maps of Shells 4 (inside), to 1 (outside), Time 0 to 21 ..62 Figure 5.32 Can Ground Truth Maps of Shells 4 (inside), to 1 (outside), Time 22 to 43 63 Figure 5.33 Can Vertical Inside Shell, Ground Truth vs. KNC Maps..............................64 Figure 5.34 Can Vertical She ll 3 Ground Truth vs. KNC Maps .......................................65 Figure 5.35 Can Vertical Inside Shel l, Ground Truth vs. Random Forests Maps ............66

PAGE 8

vi Figure 5.36 Can Vertical She ll 3, Ground Truth vs. RF Maps.........................................67

PAGE 9

vii ENSEMBLES FOR DISTRIBUTED DATA Larry Shoemaker ABSTRACT Many simulation data sets are so massive th at they must be distributed among disk farms attached to different computing nodes. Th e data is partitioned into spatially disjoint sets that are not easily transferable among nodes due to bandwidth limitations. Conventional machine learning methods are not designed for this type of data distribution. Experts mark a training data set with diffe rent levels of saliency emphasizing speed rather than accuracy due to the size of the task. The challenge is to develop machine learning methods that learn how the expert has marked the training data so that similar test data sets can be marked more efficiently. Ensembles of machine learning classifi ers are typically more accurate than individual classifiers. An ensemble of mach ine learning classifiers requires substantially less memory than the corresponding partition of the data set. This allows the transfer of ensembles among partitions. If a ll the ensembles are sent to ea ch partition, they can vote for a level of saliency for each example in the partition. Different partitions of the data set may not have any salient points, especially if the data set has a time step dimension. This means the learned classifier for such part itions can not vote for saliency since they have not been traine d to recognize it.

PAGE 10

viii In this work, we investigate the performa nce of different ensembles of classifiers on spatially partitioned data sets. Success is measured by the correct recognition of unknown and salient regions of data points.

PAGE 11

1 CHAPTER 1 INTRODUCTION Computer simulations can have datasets that are too massive to store in just one computer node [1] [2]. Such datasets must be stored in many computer nodes, each having a spatially disjoint partition of the data. The DOEs ASC program [3] is one example [1]. When experts visually exam ine the simulation data, they search for interesting events that are typically uncomm on. They may also need to examine the data for debugging purposes. Machine learning clas sifiers offer a time and labor saving alternative to an unassisted visual search. Th e way the data is divided may not necessarily be conducive to ordinary machine learning me thods. Many partitions and time steps may not have any interesting events for classifi ers to learn [1] [2]. This unbalanced data, where classes are unequally represented, presen ts a problem for classifiers or ensembles of classifiers. This thesis explores different methods of machine learning with two different types of spatially disjoint sets of data. For this study, these datasets are necessarily much smaller than the gigantic terabyte-sized datasets we are truly interested in, and we may consider them small-scale models of these. One dataset has two spatial dimensions, while the other has three spatial dimensions and a time dimension. The first dataset studied is of face images from the FERET database [4] [5]. There are two different images of each of five peopl e. Five images are used for training, and the

PAGE 12

2 other five are used for testi ng. The interesting and somewhat interesting facial features were marked on each image. We established a baseline by separately training a single decision tree, a random forest, and a k -nearest neighbor classifi er on all of the training data, and then testing on each test image. Next, we partitioned each training image into eight partitions using four different partitioning schemes. This produced many partitions in several of the schemes that exhibit an imbalance of classes. Then we traine d separately on each of the 40 partitions to produce random forests ensembles, and a k -nearest centroid classifier. Finally, we used each set of these ensembles of classifiers to te st each test image, and combined the votes for a final classification by each set. For our second study, we used the data from a simulation of a plate crushing a can during 44 time steps. Salient regions of the can were marked for each time step. We then partitioned the plate and the can into four vertical and five horizontal partitions for training with the four classifiers that we us ed for the face study. We tested each partition with classifiers or ensembles using each of the three remaining partitions. This allowed us to test all of the data without training directly on the same data. Again we combined the votes from the ensemble of classifiers for the final classification. Chapter 2 describes the data and how it was separated into training and test sets. Chapter 3 presents some theory of the vari ous classifiers and ensembles used in this thesis. Chapter 4 describes face experiments and results, while Chapter 5 provides can experiments and results. Chapter 6 presen ts a summary, and future directions for research.

PAGE 13

3 CHAPTER 2 DATA DESCRIPTION Two different types of data are used for this study. The first consists of data for two different frontal images of each of five peopl e. The second type of data is for a fourdimensional can crush simulation with three spatial dimensions and one time dimension. 2.1 Face Data For our first series of experiments we tr ied to find regions in face images. In her Masters thesis, Divya Bhadoria performed e xperiments [6] on images from the Facial Recognition Technology (FERET) database [5]. We used the same images she used in her experiments in order to extend the research with different classifiers and partitions. In her work, she randomly selected five people and chose two frontal images of each person that differed in such features as facial expression, illumination, and hair-style. For each person, one image was used for training and one image for testing, as shown in Figure 2.1. She pre-processed each eight-bit gray scale image by normalizing the intensity, aligning the eyes at fixed pixel coordinates, and using an elliptical mask to remove everything except the face [6]. Image manipul ation software [7] was used to manually mark eyes and mouths red for interesting (I) class, and eyebrows green for somewhat interesting (SI) class. The remaining f ace regions remained unmarked for not interesting (NI) class. The tr aining and test images before and after marking are shown in Figure 2.2.

PAGE 14

Image # Training Set Test Set 1 2 3 4 5 Figure 2.1 Images after Preprocessing 4

PAGE 15

Image Training Set Train Ground Tr uth Test Set Test Ground Truth 1 2 3 4 5 Figure 2.2 Training Images with Classe s Marked (red = I, green = SI) 5

PAGE 16

6 Bhadoria created data sets in USFC4.5 form at [8] with seven features for each image pixel [6]. This feature vector for each pixel consists of: Intensity value (0 to 255) of the pixel Maximum intensity over a 5x5 pixel neighborhood of the pixel Minimum intensity over a 5x5 pixel neighborhood of the pixel Intensity range equal to the difference between maximum and minimum intensity over a 5x5 pixel neighborhood of the pixel Arithmetic mean of the intensities ov er a 5x5 pixel neighborhood of the pixel Standard deviation of the intensities over a 5x5 pixel neighborhood of the pixel Class assigned to the pixel (NI = 0, I = 1, SI = 2) For pixels located less than tw o pixels from one of the external image borders, some of the neighborhood pixels would lie outside the image. For such pixels, the features of the corresponding pixels inside the image that lie the same normal distance from the border(s) are used Each image after prepro cessing consists of 150 rows of 130 horizontal pixels for a total of 19,500 pixels. 2.2 Can Data The can crush data is simulation mesh data that represents a rectangular plate crushing half of an open cylindr ical can. W. Philip Kegelmeyer provided this data set in Exodus format [9]. The 16 in. x 8 in. x 2.5 i n. plate impacts the can from above at an angle of 10 with an initial velocity of 5000 in/s. The can has an inside radius of 5 in., a height of 15 in., and a thickness of 0.2 in.. A displacement barrier prevents the can from moving below its initial bottom position. The can data is stored in a four dimens ional space that consists of variables x, y, z, and a time step. The plate has 3364 nodes (four layers of 29 x 29 nodes) and the can

PAGE 17

has 6,724 nodes (four shells of 41 x 41 nodes). For each of the 44 time steps, the displacement, velocity, and acceleration in the x, y, and z direction for each node are stored as field variables. In addition, an eqps (equivalent plas tic strain) field variable is stored for each element (a hexahedral volume consisting of eight nodes, one at each corner). This variable represents th e stress or strain on the can [10]. We employed an additional node variable to represent saliency, which the user can assign during a ParaView training se ssion. ParaView is an open-source, multiplatform visualization application [11] [12]. W. Philip Kegelmeyer provided a special Linux version of ParaView that includes an E xodus reader for this data format. The user can view the can data set at selectable angl es, scales, and time steps. The user can also mark unknown or salient node(s) by us ing point and/or box picking tools. The default class for unmarked nodes is unknown. Initially, we used an earlier visualization application (VEP), without many of the more useful, ParaView tools to view and mark the can data set. Figure 2.3 shows the can be fore, during, and after the plate crushes it. (a) Time Step 0 (b) Time Step 21 (c) Time Step 43 Figure 2.3 Can Crush 7

PAGE 18

8 CHAPTER 3 CLASSIFIERS AND ENSEMBLES Some machine learning algorithms pro cess training data to produce general classifier models that aim to correctly cl assify as many unobserved test examples as possible. Such induction-based learning me thods produce classifiers such as decision trees and artificial neural ne tworks. Other instance-based le arning classifiers wait until a new test example is presented to pr ocess the training da ta. These include k -nearest neighbors, radial basis func tions, and case-based reasoning [13]. Different ensemble methods, such as boosting and bagging, genera te a group of classi fiers that vote to classify each test example. This chapter pr esents an overview of the classifiers and ensembles used in later experiments. 3.1 Decision Tree A decision tree is an example of a div ide-and-conquer approach to machine learning [14]. Each node of the tree contains a test of a part icular feature or attribute. Usually, an attribute value is compared to a constant. This comparis on causes the data at the node to flow into one of two or more branches. As each unknown instance is sent down the tree, another attribute is tested at each new node. When an instance reaches a leaf, it is assigned the class th at is associated with the le af. If all the attributes are nominal, each attribute will normally be tested at just one node of the tree. If an attribute is numeric or continuous, tests may be done us ing different values of the attribute, and more than one node may contain a test on the same attribute.

PAGE 19

9 If a nominal attribute is chosen for the test at the root, one br anch is created for every value of that attribute. If the chosen attribute is continuous, two branches are typically created. This splits the training data into one proper subset for each branch. Each branch leads to another node of the tree. The process is recursive until all training instances that reach a node have the same cla ss, or no more attributes remain for testing, or some other stopping criteri on is met. The building process for that branch is halted and a class is assigned to the node, which is now called a leaf. One major challenge in the construction of a decision tree is deciding which attribute should be selected for the test at each node, beginning with the root. In order to make the tree as small as possible, it would be helpful to select an a ttribute at each node that results in as many leaves as possible. Using some method of evaluating the purity of each node would lead to this goal. Methods in clude information gain, and for cases where attributes have many possible values, information gain ratio. Some of the other enhancements to the tree-building process involve the handling of missing values and either postpruning or prepruning the tree. Postpruning is more common and involves subtree replacem ent and/or subtree raising [14]. 3.2 Random Forests A random forest involves a combination of decision trees [15]. There are two random steps in the process. For every tree in the desired forest, 100% of the input training data is bagged with replacement. In other words, for as many examples as there are of training data, one of the examples fr om the entire set of examples is randomly chosen for the bag. Once a particular example is chosen, the chances of that example being chosen again are the same as for any example not yet chosen for the bag. On

PAGE 20

10 average this results in about two thirds of the examples being chosen for the bag (some more than once) and one third not being chosen for the bag. Each bag of training examples will be used to train one of the unpruned decision trees in the forest. The second element of randomness is in the building process for each tree. Normally, all features are considered for each split decision, and the one that leads to the highest information gain or othe r criterion is chosen. In the case of random forests, only a certain number of features are randomly selected from all feat ures for utilization in the test at that node. The number of features to be selected is chosen by the user before the process begins. Breiman found that one of the best choices for the number of features is int(log 2 M + 1), where M is the number of features [15] For instance, if there are nine features, the number of featur es randomly chosen for evalua tion at a node would be four. While a decision tree is an individual cl assifier, random forests consist of an ensemble of trees. The traini ng set variation that results from bagging and the random, reduced feature selection for each tree node produce a variety of classifiers in the ensemble. While some trees may not correctly cl assify a given test ex ample, the votes of other trees can often be enough to overcome the incorrect minority. Breiman found several advantages for ra ndom forests [15]. Its accuracy compares favorably with Adaboost (a popular type of boosting), while offering easy parallelization that Adaboost does not offer. Random forests are also faster than boosting or bagging with regular decision trees. A nother advantage is the robustnes s with respect to noise in the data compared to boosting.

PAGE 21

11 3.3 K-Nearest Neighbor While decision trees and random forests spend the majority of time processing the training set rather than duri ng classification, instance-based learning classifiers such as nearest neighbors consume the most processi ng time classifying each new test example [14]. In nearest neighbors, each new test example is compar ed to all training examples using a distance metric. The class of the training example with the shor test distance to the test example is predicted as the test class. If the data is noisy, using the nearest neighbor approach may result in a high error rate. If k -nearest neighbors are located, and the class of a majority vote of these neighbors is pred icted, error rates may be reduced. Normally k is restricted to odd positive integers. One of the most common distance metrics is the Euclidean distance metric. For each feature, the difference in value between each pair of features is squared. These squared differences are summed. The square root of this sum is the Euclidean distance. In the case of k -nearest neighbors, the square root st ep can be skipped, since it does not change the basis for distance comparison. If all attributes have equal importance but different scales, those with smaller scales may have their importance diminished in the calculation of the distance metric. One way to counteract this problem is to normalize the at tribute data. Even then, if some attributes are more important than others, it would be helpful to weight each attribute according to importance. This is a main challenge in instance-bas ed learning [14]. Another problem to be addressed is the selection of k in k -nearest neighbors that yields the highest accuracy. In the case of unbalanced da ta, the accuracy component of rare classes may need to be weighted more heavily than fo r other classes [16]. Processing

PAGE 22

12 time for a large data set may set a practical limit for k Generally k is upper bounded by the number of minority class examples in order to give each class a theoretical chance of winning the vote. 3.4 K-Nearest Centroid This variation of k -nearest neighbors was developed in [6]. Since huge datasets render KNN impractical due to processing requ irements, KNC provides a faster, practical alternative. Instead of searchi ng the entire training set for the k examples that are closest to each test example in vector space, the search is limited to the set of pre-computed centroids. One centroid is computed and stor ed for each class in each training partition. The average value of each feature of all ex amples with the same class in a training partition is selected as the co rresponding value for that featur e in the centroid with that class. The search space for each test example is thus greatly reduced as a result of this relatively minimal processing of the training data which is done before any test examples are classified. Thus, KNC has a training step that KNN does not have.

PAGE 23

13 CHAPTER 4 FACE EXPERIMENTS AND RESULTS First we tested each test image by usi ng a single pruned decision tree that was trained on data from all of the pixels in the five training images. We repeated this experiment using random forests and KNN. Then we partitioned each training image into eight, nearly equally sized partitions for each of four different part ition arrangements. We began with the same four row by two column pa rtition arrangement that was used in [6]. Three other variations include eight vertical partitions, eight horizontal partitions, and eight diagonal partiti ons. We tested each test image by using voted ensembles of random forests, and then ensembles of KNC classifier s. Each ensemble was trained separately on each of the 40 partitions in each arrangement. 4.1 Face Baseline Experiments We used C4.5 software to train a single pruned decision tree on the pixel feature vectors of the five training images. We used the default certainty factor for pruning. Then we repeated this experiment with a si ngle, log (n) random forest of 1000 pure, unpruned trees. Finally we used a single KNN classifier that was trained on all five training images, using odd k from 1 to 99. These experiments provide a baseline for comparing ensembles of classifiers that are each trained on single partitions of the data. Of course, these representative datasets are small enough to allow such processing, as opposed to the actual data sets that are too huge to process in this way. The results are shown in Figure 4.1. For KNN, the highest average accuracy of 85.52% was with k = 99.

PAGE 24

Image Ground Truth Test Decision Tree Random Forests KNN (k = 99) 1 2 3 4 5 Figure 4.1 Baseline Experiments (red = I, green = SI) 14

PAGE 25

15 All three methods do best at correctly predicti ng the interesting eye regions. The decision tree and random forests correctly pred ict as somewhat interesting more eyebrow pixels than KNN does, but all three get some credit for at least predicting eyebrows as interesting rather than not interesting. Mouth pixels are predicted as interesting (correct) or somewhat interesting at least enough to dir ect attention to these regions. The nostrils in all five images and the femi nine hair on the forehead of images two and three are incorrectly predicted as either in teresting or somewhat interesting. 4.2 Face Partitioning First we partitioned each training image in to eight partitions using a four row by two column arrangement. The top and third ro w partitions are 65 pixels wide by 37 pixels high. The second and bottom row partitions ar e 65 pixels wide by 38 pixels high. This resulted in all 40 partitions w ith NI examples. Of those 40 partitions, 24 also have only I examples, seven also have only SI examples, and five also have both I and SI examples. We reprocessed the data in each partition so that those pixels near the new partition borders would have their neighborhood pixel feat ures calculated correctly. This step was required for each partitioning arrangement Figure 4.2 shows the ground truth class assignment for each training image partition. Next we used a vertical arrangement of ei ght partitions per image. The fourth from the left and the rightmost partitions are 17 pixels wide by 150 pixels high and the remaining six partitions are 16 pixels wide by 150 pixels high. This resulted in every partition having examples of all three classe s. The partition data was reprocessed as before. Figure 4.2 shows the ground truth assignment for each partition.

PAGE 26

16 For the eight horizontal partitions, the top a nd the fifth from the top partitions are 130 pixels wide by 18 pixels high. The othe r six partitions are 130 pixels wide by 19 pixels high. There are 14 partitions with only NI examples, three with examples of all three classes, six with only NI and SI examples, and 17 with only NI and I examples. Figure 4.2 shows the ground truth for each partition. In order to make the eight diagonal partitio ns each contain approximately the same number of pixels, it was necessary to space the partition borders unequally. Starting at the upper left partition, each part ition contains 2415, 2436, 2409, 2425, 2434, 2431, 2465, and 2485 pixels respectively. Two partitions have only NI and SI examples, 11 have only NI and I examples, 1 has only NI examples, a nd 26 have examples of all three classes.

PAGE 27

No. 4 Row by 2 Col. Vertical Horizontal Diagonal 1 2 3 4 5 Figure 4.2 Ground Truth in Face Training Partitions (red = I, green = SI) 17

PAGE 28

18 4.3 Face Ensemble Experiments and Results We used each of the four partition arra ngements to create an ensemble of 40 classifiers (or of 40 ensembles in the case of random forests). The final classification for each test example was the result of a vote in which more weight was given to classes that had training examples in fewer partitions. E ach partition represents a compute node. If the number of partitions with examples of a less common class is less than half of the number of partitions with examples of anot her class, the minority class may be outvoted unfairly. In order to make the vote fair, the probability that a give n partition contained each class was taken into account, as was done in [2]. This is an application of Bayesian decision theory [2]. We will refer to this as a probabilistic majority method. The equations for a two-class problem according to [2] are: p(w1|x) = percentage of ensembles voting for class w1 for example x, P(w1) = percentage of ensembles capable of predicting class w1 Classify as w1 if: p(w1 |x)/P(w1) > p(w2|x)/P(w2) Classify as w2 if: p(w1 |x)/P(w1) < p(w2|x)/P(w2) A tie, p(w1|x)/P(w1) = p(w2|x)/ P(w2), is broken randomly The number of ensembles capable of predicting class w1 is the number of ensembles that have at least one example of class w1. Our three class problem is an example of an nclass problem according to [2]: Classify as w n : argmax n (p(w n |x)/P(w n )) We broke ties in I>SI>NI order in stead of randomly, as in [2]. 4.3.1 Four Row by Two Column Each of the eight partitions in the four row by two column partition arrangement of each of the five training images was initially bagged 1,000 times at 100%. These bags of

PAGE 29

19 data were used to train 40 ensembles of random forests, each with 1,000 pure decision trees. Each ensemble was then used to classify each pixel of each test image. Finally the class that received a probabilistic majority [2 ] of the 40 ensemble votes was predicted as the class for the pixel of in terest. This method weights each vote by dividing it by the number of partitions with examples of that class. Ties were broken in I>SI>NI order. The procedure for using KNC on the above partition arrangement began with the calculation of one centroid for each class of each training partition. For each of the first six features, the mean of that feature in examples with identic al classes in the partition of interest was calculated as the feature for the centroid. This resulted in 40 NI centroids, 29 I centroids, and 12 SI centroids for a total of 81. Odd k from 1 to 11 was used for KNC to classify each pixel of each test image. For each k used, the k centroids with the closest Euclidean distance to the test pixel in the si x-dimensional feature space were determined and the classes of those centroids were voted using the probabilistic majority method [2]. The weight of each vote was adjusted by divi ding it by the number of centroids with examples of that class. Ties were br oken in I>SI>NI order. The results for k = 11 produced the best overall av erage accuracy of 72.78%. The results for random forests and KNC are shown in Figure 4.3. Random forests are more accurate than KNC in correctly pred icting I and SI pixels and in not falsely predicting these classes. Rando m forests perform better than baseline random forests on all the data. KNC performs considerably wo rse than baseline KNN on all the data. The much greater total processing time required fo r random forests than for KNC results in a significant accuracy advantage.

PAGE 30

Image Ground Truth Random Forests KNC (k = 11) 1 2 3 4 5 Figure 4.3 Face Four Row by Two Column Results (red = I, green = SI) 20

PAGE 31

21 4.3.2 Vertical Partitions Random forests and KNC methods were used to train on the 40 vertical partitions and then used to test each of the five test images. Since each of the 40 vertical partitions has examples of all three classes, the probabilistic majority vote results in the same predicted class as a simple majority vote. The results are shown in Figure 4.4. Random forests have less false positives than they did for the previous partition arrangement. KNC incorrectly predicts most I pixels as SI. The highest aver age accuracy of 68.54% was for k = 11, with results shown in Figure 4.4. We attempted to improve KNC performance by eliminating centroids that represented groups of pixels with less than a variety of minimum numbers of pixels, without success.

PAGE 32

Image Ground Truth Random Forests KNC (k = 11) 1 2 3 4 5 Figure 4.4 Face Vertical Partition Results (red = I, green = SI) 22

PAGE 33

23 4.3.3 Horizontal Partitions Each of the five test images were tested with random forests and KNC that were trained on the 40 horizontal partitions. The random forests probabilistic majority voting method was used normally, and also modified by requiring a minimum threshold of 50 examples in the partition for a class to be considered present. This was an attempt to neutralize the effect of a sma ll number of classes in some of the partitions. For KNC, there were 40 NI centroids, 20 I centroids and 9 SI centroids. A value of 11 for k produced the highest average accuracy of 75.4%. The results are shown in Figure 4.5. Both random forests have far fewer false positives than KNC.

PAGE 34

Image Ground Truth Random Forests Rand. For. (th = 50) KNC (k = 11) 1 2 3 4 5 Figure 4.5 Face Horizontal Partitio n Results (red = I, green = SI) 24

PAGE 35

25 4.3.4 Diagonal Partitions Each of the five test images were tested with KNC classifiers that were each trained on one of the 40 diagonal partitions. There we re 40 NI centroids, 37 I centroids, and 28 SI centroids for a total of 105. Probabilistic majority voting was used as in previous experiments. The results are shown in Fi gure 4.6 for k=11, which produced the highest average accuracy of 70.02%. They are similar to those for KNC with vertical partitions in Figure 4.4.

PAGE 36

Image Ground Truth KNC (k = 11) 1 2 3 4 5 Figure 4.6 Face Diagonal Partition Re sults (red = I, green = SI) 26

PAGE 37

27 CHAPTER 5 CAN EXPERIMENTS AND RESULTS We partitioned the can and the plate into four vertical partitions for the first series of experiments, and into a pl ate and four horizontal can part itions for the second series of experiments. For each partition, we separately trained a single decision tree, an ensemble of random forest classifiers, a k -nearest neighbor clas sifier (KNN), and a k -nearest centroid classifier (KNC). We then determ ined predictive accuracy for each partition using the classifier or ensemble of classifier s previously trained on each of the remaining three partitions. A majority of the three votes determined the predicted class, unknown or salient, for each test node. 5.1 Can Partitioning First, we partitioned the can and plate into four vertical partitions, so that each spatially disjoint partition contains can data in approximately equa l proportions. Figure 5.1 shows the vertical partitions Partition 0 includes the front of the plate and the left and right outer vertical sections of the can. Partition 3 includes the rear of the plate and the center vertical section of the can. When the plate strikes the can from above, the plate crushes the top part of the can in the initial time steps, a nd successively lower parts as each time step progresses, as seen in Figure 2.3. Next, we divided the can and plate horizon tally into a plate pa rtition, and four can partitions. Figure 5.1 shows the horizontal part itions, with the plate at the top, and can

PAGE 38

partitions 0 to 3 in order beneath the plat e. Can partitions 0, 1, and 2 each contain 10 rows of nodes, and partition 3 contains 11 rows of nodes. (a) Vertical Partitions (b) Horizontal Partitions Figure 5.1 Can Partitions This partitioning scheme represents a gr eater machine learning challenge because the plate and crushed can do not impact the lower positioned partitions until many of the 44 time steps have elapsed. During these time steps, partitions two and three do not contain any nodes marked salient. Learned ense mbles of classifiers for lower partitions have fewer salient examples overall fo r training. Conversely, ensembles for upper partitions have fewer unknown saliency examples. 5.2 Out of Partition Can Experiments We trained a separate classifi er or ensemble of classifi ers on the data for each of the four partitions. Then, we test ed each partition by using each of the three classifiers or ensembles that were trained on the three rema ining partitions. Finally, we voted the three 28

PAGE 39

29 out of partition predictions to obtain the final prediction for each node. This scheme, suggested by W. Philip Kegelmeyer, allows each data node of each partition to be tested without the use of any data nodes of that pa rtition for training. We will use the acronym OOP for out of partition from this point forward. Since there are only two classes and three votes, ties are not possible. 5.2.1 Vertical Partition Experiments Each vertical partition cont ains both plate and can da ta. When we used the VEP marking tool to mark salient can nodes in each time step, the crudeness of the tool resulted in some plate nodes also being marked salient. Although we chose to use both plate and can data for traini ng and testing, we concentrate on the can results. The plate data was never an objective fo r saliency marking, since the pl ate does not deform in the simulation. 5.2.1.1 Vertical Decision Tree We used C4.5 software to create four si ngle decision tree classifiers, one for each partition. We used the default certainty f actor for pruning. The accuracy results are shown in Figure 5.2. The highest accuracy for e ach test partition is for the decision tree that was trained on that same partition. Even though these results are included in the table, they were not used in the OOP voti ng to obtain the voted re sults. The number of can nodes in partitions 0 to 3 varies in a 20:23:23:16 relationship. The total voted accuracy for the labeled examples from the can is 92.89%.

PAGE 40

70 75 80 85 90 95 100 Training Partition, OOP VotedAccuracy % test 0 test 1 test 2 test 3 test 0 98.9486.8983.1779.4685.32 test 1 93.0899.6995.1488.3695.55 test 2 83.4196.9999.4594.6296.89 test 3 74.7792.5794.4798.9292.8 train 0train 1train 2train 3voted Figure 5.2 Can Decision Tree Accuracies for Vertical Partitions 5.2.1.2 Vertical Random Forests For our random forests experiments, we ag ain used the C4.5 software to train four ensembles of 250 classifiers, one ensemble for each partition. Each pure, unpruned decision tree in each forest was trained on partition data that was initially bagged, as is recommended in [15]. Each of the three OOP ensembles produced one vote for each data node. We assigned the class of th e majority of the three OOP votes to the predicted class for each node. The results are shown in Fi gure 5.3. The overall OOP voted accuracy for the labeled examples from the can is 93.93%. 30

PAGE 41

80 85 90 95 100 Training Partition, OOP VotedAccuracy % test 0 test 1 test 2 test 3 test 0 99.8989.9786.586.3487.36 test 1 94.5299.9797.1495.197.06 test 2 89.4597.4599.9696.9597.31 test 3 84.1892.5394.8199.9192.77 train 0train 1train 2train 3voted Figure 5.3 Can Random Forests Accura cies for Vertical Partitions Figure 5.4 shows two-dimensional maps of ground truth and random forests OOP can predictions for each time step. Yellow or light represents unknown saliency nodes and red or dark represents salient nodes. Each pixel in every map re presents a single node of the inside shell of the can. There are 41 nodes in each row and column. The other three shells for each time step have similar saliency maps. The random forests ensembles for the second column of maps show the most fals e positive salient predictions. Most of the total errors are false positives in these time steps. 31

PAGE 42

TS GT RF TS GT RF TS GT RF TS GT RF 0 11 22 33 1 12 23 34 2 13 24 35 3 14 25 36 4 15 26 37 5 16 27 38 6 17 28 39 7 18 29 40 8 19 30 41 9 20 31 42 10 21 32 43 Figure 5.4 Can Vertical, Ground Tr uth vs. Random Forests Maps (red = salient, yellow = unknown) We used the unweighted predictions from each of the three random forests in the experiments above for OOP voting. Each unweighted vote from a partitions random forest was either for unknown saliency or for salience. Weighted predictions are also available from each partitions random forest. For example, if 200 trees in a partitions forest voted for salience, and 50 trees voted for unknown saliency, the weighted vote for that partition would be 0.8 for salience a nd 0.2 for unknown saliency. If the weighted 32

PAGE 43

predictions were used instead of the unweighted predicti ons, voted OOP accuracy was increased from 93.93% to 94.66%. This is for a simple majority of the three weighted predictions, with ties as signed to unknown saliency. We varied the threshold of total wei ghted unknown votes for classifying an example as unknown, from 0.0 to 3.1, and found a maximum OOP voted accuracy of 95.47% at a threshold of 1.0 or 1.1. For a thres hold of 1.0, an example was classified as unknown if the sum of the weighted unknown vot es from the three ensembles was from 1.0 to 3.0; otherwise, the example was classified as salient. The results are shown in Figure 5.5. 30 40 50 60 70 80 90 100 00.511.522.53 Unknown Weighted Vote Total Threshold for Unknown ClassificationOOP Voted Accuracy % Figure 5.5 Can Random Forests Accuraci es for Unknown Weighted Thresholds (vertical) An ROC curve is an evaluation tool used in data mining. This curve depicts the performance of a classifier by plotting the percentage of correct positive predictions (true 33

PAGE 44

positives) compared to all positive examples on the vertical axis, and the percentage of incorrect positive predictions (false positives) compared to all negative examples on the horizontal axis [14]. We plotted an ROC cu rve for this experiment using weighted thresholds from 0.0 to 3.1. The resulting curve is shown in Figure 5.6. If a true positive rate of 91.2% is desired, a false positive rate of 4.39% must be allo wed (at a threshold of 0.5). A true positive rate of 97.5% for a thres hold of 1.0 matches a false positive rate of 8.22%. A true positive rate of 99.0% for a thresh old of 1.5 matches a fa lse positive rate of 13.4%. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0%10%20%30%40%50%60%70%80%90%100% False PositivesTrue Positives Figure 5.6 Can ROC Curve of Weighted Random Forests Accuracies (vertical) 34

PAGE 45

35 5.2.1.3 Vertical KNC For KNC experiments, we first obtained th e average feature valu e for each of the nine features that are associated with each class present in each time step. We call this average feature vector a centroid. Partitions 0 to 2 each have 87 centroids and partition 3 has 86 centroids. If nodes of both classes ex isted for each time step, there would be 88 centroids for each partition. We combined all th e centroids for an individual partition for training use by the KNC classifier. The classifi er determines the majority class of the k nearest centroids that are closest in Euclidean ve ctor space to the vector of each test node. One of the challenges in instance-based learning is the weig hting of features according to their importance [14]. The range of the three displacement features is about two to three orders of magnitude smaller than the range of the three velocity features. The range of the velocity features is likewise about two to thr ee orders of magnitude smaller than that of the three acceleration features. On e approach to this problem is to normalize the data. This prevents components in the Eucl idean distance calculati on for features with greater magnitudes from dwarfing the importance for features with much smaller magnitudes. We tried several methods of normalizing the data, including linear normalization, which provided the best boost in accuracy fo r KNC. We also tried another approach to the problem by running KNC using just one feat ure, and various combinations of two to four features. We determined that by using just the firs t four features, (the three displacement features and the X velocity feature), overall accuracy could be improved over the accuracy of using all nine features. The improvement gained by first normalizing

PAGE 46

the first four features was 0.6% We decided to use the firs t four non-normalized features for all remaining KNC and KNN experiments. We tried odd k from 1 to 87 and chose k = 51 for optimal overall accuracy. This accuracy measure gives equal weight to false positive and false negative errors. We could have selected a different k if we wanted to emphasize salient accuracy more than unknown accuracy for example. We polled the vo tes from each of the OOP classifiers to obtain the predicted class for each test node. The can KNC accuracies are shown in Figur e 5.7. The overall OOP voted accuracy for the labeled examples from the can is 89.16%, which is about four to five percent lower than that of decisi on trees and random forests. 86 87 88 89 90 91 92 Training Partition, OOP VotedAccuracy % test 0 test 1 test 2 test 3 test 0 88.5788.0988.0488.4588.09 test 1 90.8890.7790.7890.8590.85 test 2 89.8390.0190.0189.8889.88 test 3 86.8687.0587.0286.8187.05 train 0train 1train 2train 3voted Figure 5.7 Can KNC Accuracies for Vertical Partitions 36

PAGE 47

37 Figure 5.8 shows two-dimensional maps of ground truth and KNC OOP can predictions for each time step. Yellow or light represents unknown saliency nodes and red or dark represents salient nodes. False negatives are more plentiful than in the random forests maps of Figure 5.4. The accuracy is best after time step 33, when most of the can nodes are salient.

PAGE 48

TS GT KNC TS GT KNC TS GT KNC TS GT KNC 0 11 22 33 1 12 23 34 2 13 24 35 3 14 25 36 4 15 26 37 5 16 27 38 6 17 28 39 7 18 29 40 8 19 30 41 9 20 31 42 10 21 32 43 Figure 5.8 Can Vertical, Ground Truth vs. KNC Maps (red = salient, yellow = unknown) 5.2.1.4 Vertical KNN For KNN experiments, we used odd k from 1 to 51. We selected k = 1 for best overall accuracy over all voting partitions. The majority class of the three OOP votes was chosen as the predicted class for each test node. The KNN voted accuracies in Figure 5.9 are higher than those of thr ee of the four corresponding pa rtitions for KNC. The overall KNN voted accuracy is 90.41%, which is higher than the 89.16% for KNC. The cost of this higher accuracy is a much longer processing time. 38

PAGE 49

75 80 85 90 95 100 Training Partition, OOP VotedAccuracy % test 0 test 1 test 2 test 3 test 0 100 88.96 84.7 80.0185.03 test 1 86.81 100 92.3587.9393.51 test 2 94.5792.81 100 90.6892.82 test 3 80.7 89.5590.79 100 89.23 train 0train 1train 2train 3voted Figure 5.9 Can KNN Accuracies for Vertical Partitions 5.2.1.5 Vertical Comparison Figure 5.10 shows a comparison of the OOP vot ed accuracies for the classifiers and ensembles we tested. Random forests and single decision tree total accuracy surpass that of KNN and KNC. The total accuracy is ca lculated using the node totals for each partition, since the number of nodes in each pa rtition is not uniform. This method gives equal weight to false positives and false negatives. 39

PAGE 50

75 80 85 90 95 100 Classifier-EnsembleAccuracy % test 0 test 1 test 2 test 3 total test 0 85.3287.3688.0985.03 test 1 95.5597.0690.8593.51 test 2 96.8997.3189.8892.82 test 3 92.892.7787.0589.23 total 92.8993.9389.1690.41 dec. tree rand. forest knc knn Figure 5.10 Can Out of Partition Ac curacies for Vertical Partitions Figure 5.11 shows the voted accuracies in terms of unknown and salient accuracies. Unknown accuracy is the percentage of correct ly classified unknown examples compared to all unknown examples. Salient accuracy is th e percentage of correct ly classified salient examples compared to all salient examples. Overall accuracy is the percentage of all correctly classified examples compared to all examples. 40

PAGE 51

75 80 85 90 95 100 Classifier-EnsembleAccuracy % unknown salient overall unknown 83.1485.0285.9481.17 salient 98.4599.019195.64 overall 92.8993.9389.1690.41 dec. tree rand. forest kncknn Figure 5.11 Can Accuracies by T ype for Vertical Partitions 5.2.2 Horizontal Partition Experiments We isolated the plate as a separate pa rtition from the four can partitions. This allowed us to use only can partitions for training and testing. The OOP procedures for horizontal testing are otherwise very similar to th ose for vertical testing. 5.2.2.1 Horizontal Decision Tree Figure 5.12 shows the single decision tree accu racies for horizontal partitions. The total accuracy of 91.11% is lower than the 92.89 % for vertical parti tions. Each vertical partition is more representative of the entire can as the can is crushed. The percentage of ground truth salient nodes per partition is mo re uniform with vertical partitions. 41

PAGE 52

60 65 70 75 80 85 90 95 100 Training Partition, OOP VotedAccuracy % test 0 test 1 test 2 test 3 test 0 99.1392.0894.2271.3993.5 test 1 93.7299.895.6381.2596 test 2 85.7786.4999.6188.7893.1 test 3 72.4865.4982.6899.1882.7 train 0train 1train 2train 3voted Figure 5.12 Can Decision Tree Accuracies for Horizontal Partitions 5.2.2.2 Horizontal Random Forests Figure 5.13 shows the random forests result s for horizontal partitions. The total voted accuracy of 89.15% is lower than the 91.11% for single decision trees. 42

PAGE 53

60 65 70 75 80 85 90 95 100 Training Partition, OOP VotedAccuracy % test 0 test 1 test 2 test 3 test 0 99.9294.3894.8194.2894.7 test 1 92.6299.9996.7194.3396.5 test 2 83.9187.7999.9593.2990.5 test 3 76.3163.5884.5899.9676.2 train 0train 1train 2train 3voted Figure 5.13 Can Random Forests Accura cies for Horizontal Partitions Figure 5.14 shows two-dimensional maps of ground truth and random forests OOP can predictions for each time step. Yellow or light represents unknown saliency nodes and red or dark represents salient nodes. False positives in time steps 12 to 28 are the main type of errors. 43

PAGE 54

TS GT RF TS GT RF TS GT RF TS GT RF 0 11 22 33 1 12 23 34 2 13 24 35 3 14 25 36 4 15 26 37 5 16 27 38 6 17 28 39 7 18 29 40 8 19 30 41 9 20 31 42 10 21 32 43 Figure 5.14 Can Horizontal, Ground Truth vs. Random Forests Maps (red = salient, yellow = unknown) We used the unweighted predictions from the decision trees in the random forests experiments above for OOP voting. If the weight ed predictions were used instead, voted OOP accuracy was increased from 89.15% to 89.97% This is for a simple majority of the weighted predictions, with ties assigned to unknown saliency. We vari ed the threshold of total weighted unknown votes for classify ing an example as unknown, from 0.0 to 3.1, 44

PAGE 55

and found a maximum OOP voted accuracy of 92.51% at a threshold of 1.0. The results are shown in Figure 5.15. 30 40 50 60 70 80 90 100 00.511.522.53 Unknown Weighted Vote Total Threshold for Unknown ClassificationOOP Voted Accuracy % Figure 5.15 Can Random Forests Accuraci es for Unknown Weighted Thresholds (horizontal) We plotted an ROC curve for this expe riment using weighted thresholds from 0.0 to 3.1. The resulting curve is shown in Figure 5.16. If a true positiv e rate of 90.4% is desired, a false positive rate of 9.04% must be allowed (at a threshold of 0.6). A true positive rate of 96.9% for a threshold of 1.0 ma tches a false positive rate of 15.6%. A true positive rate of 99.0% for a threshold of 1.5 matches a false positive rate of 26.6%. 45

PAGE 56

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0%10%20%30%40%50%60%70%80%90%100% False PositivesTrue Positives Figure 5.16 Can ROC Curve of Weighted Random Forests Accuraci es (horizontal) A comparison of the sections of the ROC curves of weighted random forests accuracies using vertical and horizontal pa rtitions is shown in Figure 5.17. For a true positive rate of 97.5%, the classifiers trained on vertical partitions produce a lower false positive rate (8%) than the false positive rate (17%) for classifiers trained on horizontal partitions. Overall, the classifiers trained on vertical partitions have the greatest area under the curve (AUC) [14]. 46

PAGE 57

47 60% 65% 70% 75% 80% 85% 90% 95% 100% 0%10%20%30%40%50%60%70% False PositivesTrue Positives Vertical Horizontal Figure 5.17 Can ROC Curves of Weighted Random Forests Accuracies 5.2.2.3 Horizontal KNC Table 5.1 shows the number of centr oids for each horizontal partition. Table 5.1 Can Horizontal Centroids Partition Number Number of Centroids 0 1 2 3 Unknown 41 29 36 44 Salient 43 38 31 22 Total 84 67 67 66 We tried each odd k from 1 to 65 for KNC experiments. We chose to use k = 23 for optimal overall accuracy over all voting part itions. We defined overall accuracy as the percentage of correctly classified nodes compared to the total numb er of nodes. Figure

PAGE 58

5.18 shows the KNC accuracies for horizontal pa rtitions. The total accuracy of 82.7% is lower than the 89.16% for vertical partitio ns. KNC trained on pa rtition 1 (second from the top of the can, which is nearest to the plate) is the most accurate in classifying examples from any partition. 75 80 85 90 95 Training Partition, OOP Voted Accuracy % test 0 test 1 test 2 test 3 test 0 79.188.782.380.982.6 test 1 84.492.686.785.985.9 test 2 83.190.884.683.584.2 test 3 76.283.97977.378.5 train 0train 1train 2train 3voted Figure 5.18 Can KNC Accuracies for Horizontal Accuracies Figure 5.19 shows two-dimensional maps of ground truth and KNC OOP can predictions for each time step. Yellow or light represents unknown saliency nodes and red or dark represents salien t nodes. False negatives in time steps 38 to 44 are the most noticeable errors. 48

PAGE 59

TS GT KNC TS GT KNC TS GT KNC TS GT KNC 0 11 22 33 1 12 23 34 2 13 24 35 3 14 25 36 4 15 26 37 5 16 27 38 6 17 28 39 7 18 29 40 8 19 30 41 9 20 31 42 10 21 32 43 Figure 5.19 Can Horizontal, Ground Truth vs. KNC Maps (red = salient, yellow = unknown) 5.2.2.4 Horizontal KNN Figure 5.20 shows the KNN accuracies for hor izontal partitions. The total accuracy of 87.42% is lower than the 90.41% for vertical partitions. Partition 3 (bottom) is the most difficult to classify for KNN trained on other partitions. 49

PAGE 60

50 60 70 80 90 100 Training Partition, OOP VotedAccuracy % test 0 test 1 test 2 test 3 test 0 10091.691.888.891.9 test 1 89.41009489.494.2 test 2 79.189.610086.790.8 test 3 59.375.382.110074.1 train 0train 1train 2train 3voted Figure 5.20 Can KNN Accuracies for Horizontal Partitions 5.2.2.5 Horizontal Comparison Figure 5.21 shows a comparison of the OOP vot ed accuracies for the classifiers and ensembles we tested. Single decision tree a nd random forests total accuracy surpass that of KNN and KNC. The total accuracy is ca lculated using the node totals for each partition, since the number of nodes in each part ition is not uniform. Partition 1 (second from the top) has the highest accuracy and partition 3 (bottom) has the lowest accuracy for all classifiers. 50

PAGE 61

70 75 80 85 90 95 100 Classifier-EnsembleAccuracy % test 0 test 1 test 2 test 3 total test 0 93.594.782.691.9 test 1 9696.585.994.2 test 2 93.190.584.290.8 test 3 82.776.278.574.1 total 91.1189.1582.787.42 dec. treerand. forestsknc knn Figure 5.21 Can Out of Partition Accu racies for Horizontal Partitions Figure 5.22 shows these accuracies in term s of unknown and salient accuracies. Unknown accuracy is the percentage of correct ly classified unknown examples compared to all unknown examples. Salient accuracy is th e percentage of correct ly classified salient examples compared to all salient examples. Overall accuracy is the percentage of all correctly classified examples compared to all examples. 51

PAGE 62

70 80 90 100 Classifier-EnsembleAccuracy % unknown salient overall unknown 81.4673.1988.5775.37 salient 95.9496.1575.9795.88 overall 91.1189.1582.787.42 dec. treerand. forestkncknn Figure 5.22 Can Accuracies by Type for Horizontal Partitions 5.2.3 Regional Experiments We have measured classifier accuracy so far by counting the can nodes of each predicted and ground truth class. In order to investigate how this accuracy relates to regional accuracy, we establishe d a fixed, three-dimensional grid for can regions. There are four layers or shells of 41 nodes by 41 nodes in the can structure. It is not possible to divide the can nodes into a fixed number of re gions, each with the same length in each spatial dimension, and with each node in only one region. Therefore we created 196 regions per time step, in a 14 by 14 region arrangement. Of these 196 regions, 169 regions include 3x3x4 = 36 nodes, 26 regions include 24 nodes (13 with 3x2x4 and 13 with 2x3x4), and one region includes 2x2x4 nodes = 16 nodes. Looking at the inside of the can with the plate on top, the 24 node regions lie in the left column and the bottom row of the can, and the single 16 node region is at the bottom left corner. Every node is part of only one region. In order for a region to be considered sa lient, a minimum number of salient nodes in the region must be established as a th reshold. If the number of salient nodes in the 52

PAGE 63

region meets or exceeds this threshold, then th e region is salient; otherwise the region is of unknown saliency. Since the majority of regions have 36 nodes, this suggests 36 possible saliency thresholds. First we applie d thresholds 1 to 36 to the ground truth regions to establish a new set of ground truth nodes for each threshold. Then we tested each new set for accuracy compared to the original ground truth model on a node by node basis. Figure 5.23 shows the resulting plot The regional ground truth model with a threshold of 24 has the maximum overall accura cy of about 98%. This threshold is also where unknown and salient accuracies converge. 90 91 92 93 94 95 96 97 98 99 100 159131721252933 Salient Nodes per Region ThresholdNodal Accuracy unknown acc salient acc overall acc Figure 5.23 Nodal Accuracy for Regionalized Ground Truth Models Figure 5.24 shows the significant area of an ROC curve for the regionalized ground truth modes. In this case, positive represents salient and negative represents unknown saliency. Saliency thresholds of 0 to 37 were used for each 36 node region. 53

PAGE 64

54 92% 93% 94% 95% 96% 97% 98% 99% 100% 0%1%2%3%4%5%6%7%8%9%10% False PositivesTrue Positives Figure 5.24 Can ROC Curve for Re gionalized Ground Truth Models Next, we applied thresholds 1 to 36 to the random forests prediction model for the can horizontal partitions. Then we tested each new set for accuracy compared to the original ground truth model on a node by node basis. Figure 5.25 shows the resulting plot. This exercise also illustrates an opportunity to repr ocess an original prediction model to maximize accuracy. The target accuracy could also be based on different weights for unknown and salient accuracies.

PAGE 65

60 65 70 75 80 85 90 95 100 159131721252933 Salient Nodes per Region ThresholdNodal Accuracy % unknown acc. salient acc. overall acc. Figure 5.25 Nodal Accuracy for Regi onalized Random Forest Models Figure 5.26 shows the corresponding ROC curve, which shows that if the allowable false positive rate is increased from 22% to 32%, the true positive rate can be increased from 97% to almost 100%. 55

PAGE 66

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0%10%20%30%40%50%60%70%80%90%100% False PositivesTrue Positives Figure 5.26 Can ROC Curve for Regi onalized Random Forest Models Next, we applied thresholds of 1 to 36 to the KNC prediction model for the can horizontal partitions. Then we tested each new set for accuracy compared to the original ground truth model on a node by node basis. Figure 5.27 shows the resulting plot. 56

PAGE 67

60 65 70 75 80 85 90 95 100 159131721252933 Salient Nodes per Region ThresholdNodal Accuracy % unknown acc. salient acc. overall acc. Figure 5.27 Nodal Accuracies for Regionalized KNC Models Figure 5.28 shows the corresponding ROC curve, which shows that if the allowable false positive rate is increased from 8% to 13%, the true positive ra te can be increased from 65% to 90%. 57

PAGE 68

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0%10%20%30%40%50%60%70%80%90%100% False PositivesTrue Positives Figure 5.28 Can ROC Curve for Regionalized KNC Models The selection of a particular threshold is arbitrary so we chos e four for mapping our OOP results for random forests and KNC on can horizontal partiti ons. Figure 5.29 shows the ground truth vs. random forests regional ma ps for the inside can shell in each time step. In both the case of ground truth and random forests, if the thresh old of four salient nodes per region (adjusted for regions with le ss than 36 nodes) is met or exceeded, the region is colored red or dark for salient; otherwise it is colored yellow or light for unknown saliency. Figure 5.30 shows similar maps for ground truth vs. KNC. 58

PAGE 69

TS GT RF TS GT RF TS GT RF TS GT RF 0 11 22 33 1 12 23 34 2 13 24 35 3 14 25 36 4 15 26 37 5 16 27 38 6 17 28 39 7 18 29 40 8 19 30 41 9 20 31 42 10 21 32 43 Figure 5.29 Can Horizontal, Ground Truth vs. Random Fore sts Regional Maps (red = salient, yellow = unknown) 59

PAGE 70

TS GT KNC TS GT KNC TS GT KNC TS GT KNC 0 11 22 33 1 12 23 34 2 13 24 35 3 14 25 36 4 15 26 37 5 16 27 38 6 17 28 39 7 18 29 40 8 19 30 41 9 20 31 42 10 21 32 43 Figure 5.30 Can Horizontal, Ground Truth vs. KNC Regional Maps (red = salient, yellow = unknown) We performed additional experiments that highlight the regional performance of random forests and KNC. We maintained the cl ass assignments for exterior nodes, which are composed of all nodes in the inside and out side shells, and externally visible nodes of the two mostly hidden shells. We then assi gned the unknown saliency class to almost all of the hidden interior nodes. Even though this does not represent a conventional class 60

PAGE 71

61 assignment, it does reveal how each ensemble t ype handles regions with different levels of saliency than those in previous experiments. Figure 5.31 shows the ground truth assignment for all four can shells during time steps 0 to 21. Figure 5.32 continues with time steps 22 to 43. Figure 5.33 shows ground truth vs. KNC maps for the inside can shell, while Figure 5.34 displays maps for the hidden shell adjacent to the inside can sh ell. Figures 5.35 and 5.36 show the ground truth vs. random forests maps for the inside and adjacent can shells respectively. KNC performs almost the same as it di d in the previous vertical partition experiments, even though the saliency percenta ge for given regions was almost halved. In contrast, random forests more closely track the actual class assignment for individual nodes. The alternating vertical stripes represent predictions for different partitions. Random forests tend to track the average salie ncy of a region rath er than recognizing a minimum threshold of saliency for a region.

PAGE 72

TS GT4 GT3 GT2 GT1 TS GT4 GT3 GT2 GT1 0 11 1 12 2 13 3 14 4 15 5 16 6 17 7 18 8 19 9 20 10 21 Figure 5.31 Can Ground Truth Maps of Shells 4 (inside), to 1 (outside), Time 0 to 21 (red = salient, yellow = unknown) 62

PAGE 73

TS GT4 GT3 GT2 GT1 TS GT4 GT3 GT2 GT1 22 33 23 34 24 35 25 36 26 37 27 38 28 39 29 40 30 41 31 42 32 43 Figure 5.32 Can Ground Truth Maps of Shells 4 (inside), to 1 (outside), Time 22 to 43 (red = salient, yellow = unknown) 63

PAGE 74

TS GT KNC TS GT KNC TS GT KNC TS GT KNC 0 11 22 33 1 12 23 34 2 13 24 35 3 14 25 36 4 15 26 37 5 16 27 38 6 17 28 39 7 18 29 40 8 19 30 41 9 20 31 42 10 21 32 43 Figure 5.33 Can Vertical Insi de Shell, Ground Truth vs. KNC Maps (red = salient, yellow = unknown) 64

PAGE 75

TS GT3 KNC TS GT3 KNC TS GT3 KNC TS GT3 KNC 0 11 22 33 1 12 23 34 2 13 24 35 3 14 25 36 4 15 26 37 5 16 27 38 6 17 28 39 7 18 29 40 8 19 30 41 9 20 31 42 10 21 32 43 Figure 5.34 Can Vertical Shell 3 Ground Trut h vs. KNC Maps (red = salient, yellow = unknown) 65

PAGE 76

TS GT RF TS GT RF TS GT RF TS GT RF 0 11 22 33 1 12 23 34 2 13 24 35 3 14 25 36 4 15 26 37 5 16 27 38 6 17 28 39 7 18 29 40 8 19 30 41 9 20 31 42 10 21 32 43 Figure 5.35 Can Vertical Insi de Shell, Ground Truth vs. Ra ndom Forests Maps (red = salient, yellow = unknown) 66

PAGE 77

TS GT3 RF TS GT3 RF TS GT3 RF TS GT3 RF 0 11 22 33 1 12 23 34 2 13 24 35 3 14 25 36 4 15 26 37 5 16 27 38 6 17 28 39 7 18 29 40 8 19 30 41 9 20 31 42 10 21 32 43 Figure 5.36 Can Vertical Shell 3, Ground Trut h vs. RF Maps (red = salient, yellow = unknown) 67

PAGE 78

68 CHAPTER 6 SUMMARY AND FUTURE WORK 6.1 Summary We have demonstrated that computer simu lations with data spatially divided into partitions can successfully be processed by machine learning algorithms to produce classifiers of interesting or salient events. Th e partitioning often results in some partitions and/or time steps without interesting examples. This hurdle can be overcome largely by intelligent weighting of the votes of the ensemble. Random forests provide a fast machine learning technique with promising accuracy. Even faster, but less accurate is k -nearest centroid. The face image experiments showed that KNC and random forests were largely successful at distinguishing betw een events that were not inte resting and those that were either interesting or somewhat interesting. Th ese tools were not near ly as successful at discriminating between classes with different levels of interest. The performance of random forests using partitioned data was usually competitive with its performance on baseline non-partitioned data In contrast, KNC using pa rtitioned data was usually considerably less accurate than KNN in the baseline experiments. Although each can partition had examples of both classes (unknown and salient), many time steps in some horizontal partitions only had examples of one class, usually of unknown saliency. As a result of this more non-uniform distribution of data, all

PAGE 79

69 classifiers were less accurate overall with horizontal partitions than with vertical partitions. The imbalance in classes was refl ected in the centroids used by KNC, since one or two centroids were created for each time step in a partition. One advantage of using a single decision tree or random forest s is the ability to process the weighted predictions using different th resholds in the OOP voting. This allows specific unknown vs. salient accuracies to be targeted. An attempt to explore the accuracy of cla ssifiers at the level of region resulted in similar results to those observed at the nodal le vel. The ability to set different thresholds of saliency also provides an opportunity to post-process the output of the voting ensembles to improve the accuracy accord ing to different weights for unknown and salient accuracies. 6.2 Future Work For the face studies, it would be interes ting to observe the effects of several methods of normalizing the data. Another area to explore is the weighting of features according to importance in an attempt to im prove accuracy. Perhaps a regional accuracy study would be beneficial. A more advanced set of features co uld be designed in order to improve accuracy. For the can studies, further variations in the number and arrangement of partitions could lead to new insights. If the number of partitions was increased, and each partition was further divided by groups of time steps, a test of the probabilistic majority vote method could be performed. If a different saliency marking scheme was used, the saliency class could be represented by fa r fewer examples. This would simulate the typical situation in which interesting events are rare. New can simulations with variations

PAGE 80

70 such as different plate angles and initial velocities would provide an illuminating test of current classifiers. More research needs to be done on the m easurement of regional accuracy. A more definitive difference in the weight of false positives and false negatives could be determined. A more advanced goal is to deve lop easily adjustable metrics that measure how well regions of different si zes and with different levels of saliency are classified.

PAGE 81

71 REFERENCES [1] L. O. Hall, D. Bhadoria, and K. W. Bowyer, Learning a model from spatially disjoint data,: 2004 IEEE Internatio nal Conference on Systems, Man and Cybernetics, October 2004. [2] R. E. Banfield, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer, Ensembles of classifiers from spatially disjoint data, Sixth International Workshop on Multiple Classifier Systems, pp 196-205, 2005. [3] Advanced Simulation and Com puting program, Department of Energy, http://www.nnsa.doe.gov/asc/home.htm [4] P. J. Philips, H. Wechsler, J. Huang, and P. Rauss, The FERET database and evaluation procedure for f ace recognition algorithms, Image and Vision Computing Journal Vol. 16, No. 5, pp 295-306, 1998. [5] The Facial Recognition Technology (FERET) Database, http://www.itl.nist.gov/iad/humanid/feret/feret_master.html [6] D. Bhadoria, Learning From Spatially Disjoint Data, M.S. Thesis, Univ. of South Florida, Tampa, 2004. [7] The GIMP, http://www.gimp.org/ [8] J. R. Quinlan, C4.5: Programs for Machine Learning San Francisco, Morgan Kaufmann, 1993. [9] L. A. Schoof, and V. R. Yarberry, EXODUS II: A Finite Element Data Model Sandia National Labs, Albuquerque, NM 87185. [10] B. S. Lee, R. R. Snapp, and R. Musick, Toward a query language on simulation mesh data: an object-oriented approach, Proceedings of the International Conference of Database Systems for A dvanced Applications, Hong Kong, April 2001. [11] Para View, http://www.paraview.org/HTML/Index.html [12] A. Henderson, The ParaView Guide United States: Kitware, Inc., 2004. [13] T. M. Mitchell, Machine Learning New York, NY: McGraw Hill, 1997.

PAGE 82

72 [14] I. H. Witten, and E. Frank, Data Mining. San Francisco, CA: Morgan Kaufmann, 2005. [15] L. Breiman, Random forests, Machine Learning vol. 45, pp 5-32, 2001. [16] C. Chen, A. Liaw, L. Breiman. Using random forest to learn imbalanced data, Technical report 666, St atistics Department, University of California at Berkeley, 2004.


xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam Ka
controlfield tag 001 001709560
003 fts
005 20060614112245.0
006 m||||e|||d||||||||
007 cr mnu|||uuuuu
008 060525s2005 flua sbm s000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0001296
035
(OCoLC)69371719
SFE0001296
040
FHM
c FHM
049
FHMM
090
QA76 (Online)
1 100
Shoemaker, Larry.
0 245
Ensembles for distributed data
h [electronic resource] /
by Larry Shoemaker.
260
[Tampa, Fla.] :
b University of South Florida,
2005.
502
Thesis (M.S.C.S.)--University of South Florida, 2005.
504
Includes bibliographical references.
516
Text (Electronic thesis) in PDF format.
538
System requirements: World Wide Web browser and PDF reader.
Mode of access: World Wide Web.
500
Title from PDF of title page.
Document formatted into pages; contains 82 pages.
520
ABSTRACT: Many simulation data sets are so massive that they must be distributed among disk farms attached to different computing nodes. The data is partitioned into spatially disjoint sets that are not easily transferable among nodes due to bandwidth limitations. Conventional machine learning methods are not designed for this type of data distribution. Experts mark a training data set with different levels of saliency emphasizing speed rather than accuracy due to the size of the task. The challenge is to develop machine learning methods that learn how the expert has marked the training data so that similar test data sets can be marked more efficiently. Ensembles of machine learning classifiers are typically more accurate than individual classifiers. An ensemble of machine learning classifiers requires substantially less memory than the corresponding partition of the data set. This allows the transfer of ensembles among partitions.If all the ensembles are sent to each partition, they can vote for a level of saliency for each example in the partition. Different partitions of the data set may not have any salient points, especially if the data set has a time step dimension. This means the learned classifier for such partitions can not vote for saliency since they have not been trained to recognize it. In this work, we investigate the performance of different ensembles of classifiers on spatially partitioned data sets. Success is measured by the correct recognition of unknown and salient regions of data points.
590
Adviser: Dr. Lawrence O. Hall.
653
Random forests.
Nearest centroid.
Exodus.
ParaView.
Region labeling.
690
Dissertations, Academic
z USF
x Computer Science
Masters.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.1296