USF Libraries
USF Digital Collections

System for identifying plankton from the sipper instrument platform

MISSING IMAGE

Material Information

Title:
System for identifying plankton from the sipper instrument platform
Physical Description:
Book
Language:
English
Creator:
Kramer, Kurt
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla
Publication Date:

Subjects

Subjects / Keywords:
Marine Science
PICES
Machine Learning
Feature Selection
Support Vector Machine
SVM
Multi-Class
Pair-Wise
Dissertations, Academic -- Engineering Computer Science -- Masters -- USF   ( lcsh )
Genre:
non-fiction   ( marcgt )

Notes

Abstract:
ABSTRACT: Plankton imaging systems such as SIPPER produce a large quantity of data in the form of plankton images from a variety of classes. A system known as PICES was developed to quickly extract, classify and manage the millions of images produced from a single one-week research cruise. A new fast technique for parameter tuning and feature selection for Support Vector Machines using Wrappers was created. This technique allows for faster feature selection, while at the same time maintaining and sometimes improving classification accuracy. It also gives the user greater flexibility in the management of class contents in existing training libraries. Support vector machines are binary classifiers that can implement multi-class classifiers by creating a classifier for each possible combination of classes or for each class using a one class versus all strategy. Feature selection searches for a single set of features to be used by each of the binary classifiers. This ignores the fact that features that may be good discriminators for two particular classes might not do well for other class combinations. As a result, the feature selection process may not include these features in the common set to be used by all support vector machines. It is shown through experimentation that by selecting features for each binary class combination, overall classification accuracy can be improved and the time required for training a multi-class support vector machine can be reduced. Another benefit of this approach is that significantly less time is required for feature selection when additional classes are added to the training data. This is because the features selected for the existing class combinations are still valid, so that feature selection only needs to be run for the new combination added. This work resulted in a system called PICES, a GUI based user friendly system, which aids in the classification management of over 55 million images of plankton split amongst 180 classes. PICES embodies an improved means of performing Wrapper based feature selection that creates classifiers that train faster and are just as accurate and sometimes more accurate, while reducing the feature selection time.
Thesis:
Dissertation (PHD)--University of South Florida, 2010.
Bibliography:
Includes bibliographical references.
System Details:
Mode of access: World Wide Web.
System Details:
System requirements: World Wide Web browser and PDF reader.
Statement of Responsibility:
by Kurt Kramer.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains X pages.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
usfldc doi - E14-SFE0004805
usfldc handle - e14.4805
System ID:
SFS0028084:00001


This item is only available as the following downloads:


Full Text

PAGE 1

System for Identifying Plankton from the SIPPER Instrument Platform b y Kurt A. Kramer A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Computer Science and Engineering College of Engineering University of South Florida Co M ajor Professor: Dmitry Goldgof, Ph.D. Co M ajor Professor: L awrence O. Hall, Ph.D. Sudeep Sa r kar, Ph.D. Scott Samson Ph.D. Andrew Remsen, Ph.D. Date of Approval : October 29, 2010 Keywords: Marine Science, PICES, Machine Learning, Feature S election, Support Vector Machine SVM Multi Class Pair Wise Copyr ight 2010 Kurt A. Kramer

PAGE 2

DEDICATION I would like to dedicate this dissertation to my father Gustav K Kramer. He believed in working hard, always doing the right thing, and being honest with himself and others at all times. These are qualities that when practiced will lead to a successful and fulfilling life as no one else I know has achieved as well a s my dad.

PAGE 3

ACKNOWLEDGMENTS I wish to acknowledge my major p rofessor s Dr. Dmitry Goldgof and Dr. Lawrence Hall for their beyond infinite patience and hand holding these past several years. I wish to thank the other members of my committee who have helped guide me through the years: Dr. Scott Samson, Dr. Andrew Remsen, and Dr. Sudeep Sarkar. I want to thank the people from Marine Science who I had the great pleas ure to work with over the years: Dr. Kendra Daly, Bill Flan ery, and Gino Gonzalez In ge neral I want to also acknowledge all the faculty that I have worked with in the Computer S cience and Engineering D epartment who helped make these last seven years very rewarding: Dr. Rangachar Kasturi, Dr. Srinivas Katkoori, Edward Kellner, Dr. Miguel Labr ador, Dr. Rafael Perez, Dr, Les Piegl, Dr. Nagarajan Ranganathan, Dr. Ralph Tindell and Dr Rahul Tripathi I especially want to thank the staff who have had to deal with all the administrative headaches tha t I have created over the years: Dee Allen, Yvette Blanchard Catherine Burton, Theres a Collins, John Giannoni and Daniel Prieto M ost important ly let s not forget the p lankton that made my dissertation possible

PAGE 4

i TABLE OF CONTENTS LIST OF TABLES ................................ ................................ ................................ ............................. iii LIST OF FIGURES ................................ ................................ ................................ ........................... vi ABSTRACT ................................ ................................ ................................ ................................ .... viii CHAPTER 1: INTRODUCTION ................................ ................................ ................................ ...... 1 CHAPTER 2: BACKGROUND ................................ ................................ ................................ ........ 8 2.1 SIPPER Project ................................ ................................ ................................ ............ 8 2.1.1 Tow Platform ................................ ................................ ................................ 8 2.1.2 PICES ................................ ................................ ................................ ........ 1 0 2.2 Support Vector Machine ................................ ................................ ............................ 20 2.2.1 Support Vector Machine Introduction ................................ ........................ 20 2.2.2 Description ................................ ................................ ................................ 2 0 2.2.3 Assigning Probability Values. ................................ ................................ .... 24 2.2.4 Probability Parameter Adjustment. ................................ ............................ 25 2.2.5 Multi Class Support Vector Machines. ................................ ...................... 2 5 2.3 Feature Selection ................................ ................................ ................................ ....... 2 7 2.4 Datasets ................................ ................................ ................................ ..................... 2 8 2.4.1 Plankton Datasets ................................ ................................ ...................... 29 2.4.2 Forest Cover Dataset ................................ ................................ ................. 44 2.4.3 Letter Dataset ................................ ................................ ............................ 45 2.4.4 Sat Image ................................ ................................ ................................ ... 47 2.5 Data Normalization. ................................ ................................ ................................ ... 47 2.6 Significance Testing ................................ ................................ ................................ ... 48 CHAPTER 3: METHODS ................................ ................................ ................................ .............. 50 3.1 Introduction ................................ ................................ ................................ ................ 5 0 3.2 G eneral Organization of Parameter Tuning and Feature Selection .......................... 51 3. 3 SVM Tuning ................................ ................................ ................................ ............... 55 3.3.1 Correctness of Probability Prediction (CPP) ................................ .............. 55 3.3.2 Criteria ................................ ................................ ................................ ........ 55 3.3.3 General Flow ................................ ................................ .............................. 57 3.3.4 Detailed Implementation ................................ ................................ ............ 58 3.4 Feature Selection ................................ ................................ ................................ ....... 62 3.5 Merge the N Best Feature Sets ................................ ................................ ................. 66 3.6 Experimental Procedure ................................ ................................ ............................ 66 3.7 Unbalanced Datasets ................................ ................................ ................................ 67 3.8 Adding a Class ................................ ................................ ................................ ........... 69 3.9 Multi Processor Implementation ................................ ................................ ................ 69

PAGE 5

ii CHAPTER 4: RESULTS ................................ ................................ ................................ .............. 72 4.1 Results Showing Accuracy and Time Improvements. ................................ ............... 72 4.2 Feature Selection Time Analysis ................................ ................................ ............... 74 4.3 Classification Accuracy and Training Time Improvements ................................ ........ 83 4.4 Unbalanced Datasets ................................ ................................ ................................ 90 4. 5 Adding a Class. ................................ ................................ ................................ .......... 92 CHAPTER 5: DISCUSSI ON ................................ ................................ ................................ ......... 95 REFERENCES ................................ ................................ ................................ ............................... 99 APPENDICES ................................ ................................ ................................ .............................. 10 5 Appendix A Plankton Images ................................ ................................ ......................... 1 0 6 Appendix B SIPPER Raw Data Format ................................ ................................ ......... 1 1 8 Appendix C Glossary ................................ ................................ ................................ ..... 1 2 1

PAGE 6

iii L IST OF TABLES Table 1 Dataset Descriptions. ................................ ................................ ................................ ....... 29 Table 2 WFS Dataset Class Distribution. ................................ ................................ ...................... 30 Table 3 ETP2008 Class Distribution. ................................ ................................ ............................ 33 Table 4 Nine Class Plankton Class Distribution. ................................ ................................ .......... 34 Table 5 Plankton Feature Categories. ................................ ................................ .......................... 34 Table 6 Common Variables /Functions Used in Feature Calculation. ................................ .......... 35 Table 7 Eight Basic Moment Features Used in the Three Different Moment Groups. ................. 37 Table 8 Texture Features Variables and Functions. ................................ ................................ ..... 38 Table 9 Lower and Upper Frequency Bounds f or Texture Features. ................................ ........... 38 Table 10 Intensity Regions. ................................ ................................ ................................ .......... 39 Table 11 Upp er and Lower Contour Frequency Ranges. ................................ ............................. 41 Table 12 Contour Variables and Functions. ................................ ................................ .................. 41 Table 13 Hybrid Contour Features. ................................ ................................ ............................... 42 Table 14 List of Plankton Features. ................................ ................................ .............................. 42 Table 15 Forest Cover List of Features. ................................ ................................ ....................... 45 Table 16 Forest Cover Class Breakdown. ................................ ................................ .................... 45 Table 17 Letter Dataset Class Distribution. ................................ ................................ .................. 46 Table 18 Letter Dataset Featu re Description. ................................ ................................ .............. 46 Table 19 Sat Image Class Distribution. ................................ ................................ ......................... 47 Table 2 0 Comparison of Feature Reduction Parm Tuning Before vs After F/S. ........................... 51

PAGE 7

iv Table 21 List of Fields Maintained for Each Job. ................................ ................................ .......... 52 Table 22 SVM Parameter Tuning Steps. ................................ ................................ ...................... 59 Table 23 Feature Selection Variables and Functions. ................................ ................................ .. 63 Table 24 Feature Selection Steps. ................................ ................................ ................................ 64 Table 25 Nine Class Plankton; Feature Selection and Parameter Tuning Times. ....................... 75 Table 26 WFS Feature Selection and Parameter Tuning Times. ................................ ................. 75 Table 27 ETP2008 Station 1, Feature Selection and Parameter Tuning Times. ......................... 76 Table 28 Forest Cover Dataset; 300/Class; Feature Selection and Parameter Tuning Times. ................................ ................................ ................................ ............................ 76 Table 29 Forest Cover Dataset; 1,500/Class; Feature Selection and Parameter Tuning Times. ................................ ................................ ................................ ............................ 77 Table 30 Letter Dataset Feature Selection and Parameter Tuning Times. ................................ .. 77 Tab le 31 Sat Image Dataset Feature Selection and Parameter Tuning Times. ........................... 77 Table 32 Summary of CPU and Longest Path Times Required for Processing. .......................... 79 Table 33 Number of Binary SVMs Built Performing Parameter Search and Feature Selection. ................................ ................................ ................................ ....................... 82 Table 34 Nine Class Plankton; Most Accurate Set of Features. ................................ .................. 84 Table 35 WFS; Most Accurate Set of Features. ................................ ................................ .......... 85 Table 36 ETP2008 Station 1; Most Accurate Set of Features. ................................ ..................... 85 Table 37 Forest Cover Dataset; 30 0/Class; Most Accurate Set of Features. .............................. 86 Table 38 Forest Cover Dataset; 1,500/Class; Most Accurate Set of Features. ............................ 87 Table 39 Letter Dataset; Most Accurate Set of Features. ................................ ............................ 88 Table 40 Sat Image; Most Accurate Set of Features. ................................ ................................ .. 89 Table 41 Sat Image Dataset as Reported b y [ 19 ], a Pairwise F/S Paper. ................................ ... 89 Table 42 Summary of Results. ................................ ................................ ................................ ...... 90 Table 43 WFS BFS Produced Classifier Where Minority Classes ar e Compensated. ............. 91 Table 44 ETP2008 BFS Produced Classifier Where Minority Classes are Compensated. ...... 92 Table 45 Nine Class Plankton Dataset with Only 8 Classes. ................................ ....................... 93

PAGE 8

v Table 46 ETP2008 Adding One Class at a Time. ................................ ................................ ......... 94 Table 47 ETP2008 Station 1 Support Vector Comparison. ................................ ........................ 97 Table B1 SIPPER 3 Grayscale Decoding Values. ................................ ................................ ..... 11 8 Table B2 SIPPER File Sensor Number Descriptions. ................................ ................................ 11 9 Table B3 Data Payload Table ................................ ................................ ................................ ..... 1 20

PAGE 9

vi LIST OF FIGURES Figure 1 SIPPER Underwater Imaging Platform. ................................ ................................ ............ 9 Figure 2 SIPPER Sampling Tube and Light Source Path. ................................ .............................. 9 Figure 3 SIPPER Canister Including Light Source, Camera, and Data Control Board. ............... 10 Figure 4 SIPPER Light Sou rce. ................................ ................................ ................................ .... 10 Figure 5 SIPPER Line Scan Camera. ................................ ................................ ........................... 10 Figure 6 System Flow Chart. ................................ ................................ ................................ ........ 11 Figure 7 Image Extraction. ................................ ................................ ................................ ............ 14 Figu re 8 PICES Commander. ................................ ................................ ................................ ....... 16 Figure 9 Image Viewer. ................................ ................................ ................................ ................. 17 Figure 10 Classification Breakdown. ................................ ................................ ............................. 19 Figure 11 Random Harvesting Algorithm. ................................ ................................ ..................... 31 Figure 12 2D Fourier Transform of Image, Frequency Ranges Indicated. ................................ ... 38 Figure 13 Contour Frequency Domain. ................................ ................................ ......................... 40 Figure 14 Basic Process Flow. ................................ ................................ ................................ ..... 54 Figure 15 Classification Accuracy Response to and on Nine Class Plankton Dataset. ......... 56 Figure 16 Processing Time Response to and on Nine Class Plankton Dataset. ................... 57 Figure 17 Classification Accuracy Response to and on the Forest Cover Dataset with 300 Examples per Cl ass. ................................ ................................ ............................. 57 Figure 18 Experiment Procedure Steps. ................................ ................................ ....................... 67 Figure 19 Status File Update Procedure. ................................ ................................ ...................... 71

PAGE 10

vii Figure 20 Nine Class Plankton Feature Combinations Evaluated to Reach a Given Feature Count. ................................ ................................ ................................ ............. 80 Figure 21 Nine Class Plankton CPU Seconds Consumed to Reach a Given Feature Count. ................................ ................................ ................................ ........................... 81 Figure 22 WFS Feature Combinations Evaluated to Reach a Given Feature Count. .............. 81 Figure 23 WFS CPU Seconds Consumed vs Feature Count. ................................ .................. 82 Figure A1 Images from ETP2008 Dataset ................................ ................................ .................. 10 6

PAGE 11

viii A BSTRACT Plankton imaging systems such as SIPPER produce a large quantity of data in the form of plankton im ages from a variety of classes. A system known as PICES was developed to quickly extract, classify and manage the millions of images produced from a single one week research cruise. A new fast technique for parameter tuning and feature selection for Support Vector Machines using W rappers was created. This technique allows for faster f eature selection while at the same time maintaining and sometimes improving classification accuracy It also gives the user greater flexibility in the management of class contents in existing training libraries. Support vector machines are binary classif iers that can impl ement multi class classifiers by creating a classifier for each possible combination of classes or for each class using a one class versus all strategy Feature selection searches for a single set of features to be used by each of the bi nary classifiers. This ignores the fact that features that may be good discriminators for two particular classes might not do well for other class combinations. As a result the feature selection process may not include these features in the common set to be used by all support vector machines. It is show n through experimentation that by selecting features for each binary class combination overall classification accuracy ca n be improved and the time required for training a multi class support vector machine can be reduced Another benefit of this approach is that significantly less time is required for feature selection when additional classes are added to the training data This is because the features selected for the existing cla ss combinations are still valid, so that feature selection only need s to be run for the new combination added. This work result ed in a system called PICES, a GUI based user friendly system, which aids in the classification management of over 55 million images of plankton split amongst 180

PAGE 12

ix classes. PICES embodies an improved means of performing Wrapper based feature selection that creates classifiers that train faster and are just as accurate and sometimes more accurate while reducing the feature selection time

PAGE 13

1 CHAPTER 1: IN TRODUCTION This dissertation centers on work spa w n ed within the SIPPER (Shadow Imaging P article Profiler and Evaluation Recorder) project. My contribution to this project was the a pplication of image processing and machine learning techniques to a marine science problem; that being the timely extraction and identif ication of the millions of images per day of deployment as scanned by the SIPPER underwater sensor platform. The majority of this software lies in the image processing and machine learning disciplines. Applications were developed to extract individual pl ankton [ 1 ] images along with their discriminating features, classify images into user specified classes, assist in training library development through active learning [ 2 3 ] techniques, do feature selection and parameter tuning to improve both classification accuracy and processing times, and manage database functions to facilitate the p rocessing of very large image datasets (+ 50 million individual images). Small plants and animals collectively known as Plankton are the foundation of most oceanic food webs. Almost all commercially important fish and shrimp species begin their lives as pla nkton and / or feed on plankton. Consequently determining plankton populations and their diversity is an important means of determining the current health of the oceans. This leads to the need to efficiently collect statistics on plankton populations such as their distribution, interaction amongst different types, and related environmental conditions. Traditional methods of collecting plankton in nets is labor intensive and does not provide spatial distribution or environmental parameters such as depth, tem perature or salinity. In some studies [ 4 5 ] nets have been shown to undercount the actual number of plankton particles. To overcome these limitations and to more efficiently collect plankton data imaging systems [ 6 ] such as the Video Plankton Recorder

PAGE 14

2 (VPR) [ 7 8 9 ] the Shadow Imaging Particle Profiler and Evaluation Recorder (SIPPER) [ 10 ] and HOLOMAR [ 11 ] have been developed [ 12 ] Current automated plankton classification systems are achieving 70% to 80% classification accuracy for 10 to 30 classes [ 13 ] The Video Plankton Recorder (VPR) [ 7 8 9 ] is an underwater in situ imaging platform based on a video recorder. It can image particles as small as 50 microns to a couple of centimeters in size. The current version is t h e Digital Video Plankton Recorder II [ 8 ] It utilizes a progressive scan monochrome 1008 by 1018 pixel CCD camera outputting 30 frames per second. A custom software program called VPRdeck extracts regions of interest (ROI) based on user selectable thresholds of brightness, focus, and object size. After data collection a subset of images are extracted for manual labeling to be used as a training library. One of two learning algorithms ar e used, neural net or support vector machine, to then classify all the unlabeled data. The authors of [ 8 ] report overall classification accuracy on unseen data, for 7 to 10 classes/taxa ranging from 60% to 90%. Classification occurs separately from ROI extraction and performs at the rate of 6 ROI per second on a 2 ghz Pentium 4. HOLOMAR is a holographic based system that takes 3 D imagery of plankton in situ with the ability to distinguish plankton parti cles as small as 5 microns to as large as a few millimeters. Classification software is based on a neural network using HU based moments as features. ZooScan [ 14 ] is an integrated system for analyzing preserved plankton samples. It consists of image extraction and classification software using a scanner for capturing images. It processes tens of thousands of images. There are several different learning algorithms supported. Unlike the previous systems mentioned it does not sample in situ. SIPPER is a continuous scanning sensor capturing images that are 10 cm in width and continuous in length. All plankton particles that enter the sampling tube are imaged. A single 6 hour deployment will result in ha lf a million to a million imaged plankton particles. A need to quickly extract, classify, manage and analyze these discrete plankton images is important for the success of the instrument platform. A database management system is required to manage the la rge amount of data generated by this sensor platform. The abilities that a database system

PAGE 15

3 provides such as the quick retriev al and organiz ation of data by multiple parameters, cruise, deployment, depth, salinity, temperature, taxa/classes date time, etc results in more efficient and timely processing of collected data. To respond to the needs described above the Plankton Imaging Classification Extraction System (PICES) was developed. It incorporates image extraction, classification, active learning feature selection, parameter tuning, and database manageme nt functions needed to manage the large amount of data collected by the SIPPER platform. The PICES extraction function is uniquely designed to process the continuously scanned imagery data genera ted by SIPPER extracting individual plankton images and associated embedded environmental data. Feature vectors are computed for each image which is then automatically classified using t raining libraries maintained by the user into use r defined classes. The classified images with their feature vectors and environmental data are then inserted into a database. Overall classification accuracies from 75% to 85% are achieved with classifiers consisting of 30 to 55 classes Class weighted equally accuracy ra nges from 69% to 75% The PICES database allows for easy management of the data providing facilities to view data by various parameters such as cruise, deployment, date time, depth, predicted class, and others. The user can manually classify images and update training libraries improving future classification performance. At any time images can be reclassified with the improved training libraries. Feature selection methods can be div ided into at least three different types: Filter, Embedded, and Wrappers. Filter methods work with the feature data itself without any knowledge of the learning algorithm to be used. These methods are fast but because they are ignorant of any bias that a particular learning algorithm may have, they may produce feature subsets that are not as accurate as those produced by the other two groups. Embedded methods such as Recursive Feature Elimination (RFE) [ 16 ] perform feature selection as part of the training process of the learning algorithm. These methods are not ignorant of the learning algorithms bias and yet are relatively fast. Wrappers [ 17 ] use the learning algorithm as a Black Box often with cross validation as a heuristic to drive a search through feature space towards finding a good set

PAGE 16

4 of features. In all these methods a single global set of features is being selected for all class es. In some more recent papers [ 18 19 ] authors are starting to look at the pos sibility of determining feature subsets and parameter settings that are optimal for a single class or com bination. In some cases they still select a global set of features for all classes but try to at least consider the consequences of a given subset of features by binary class combinations [ 20 21 ] There are applications where a relative ly large number of classes are involved and additional classes are added and subtracted as the situation may warrant. For example, plankton images collected by SIPPER [ 10 15 22 ] may consist of 20 to 50 classes of interest out of thousands of possible classes. As data is collected and conditions change the user requ ire s the addition and subtraction of classes to an existing classifier. If the user wishes to keep the features and Support Vector Machine (SVM) parameters tuned the feature selection process needs to be run again for each new set of combinations of class es which is needed to get the best performance With the Wrappers approach this is a very lengthy procedure. Consider the case where there is a classifier consisting of 20 classes that has already been tuned and had features selected and the user would like to add one more class. The parameter tuning and feature selection process has already been done on the 190 binary class combinations that comprise the 20 class classifier. If these features and parameters were specified for each binary class combina tion separately then for all practical purposes the procedure needs to be performed for only the 20 additional binary class combinations being created This is preferable to processing the 210 binary class combinations that need to be evaluated when a single set of features and parameters are used as done otherwise The objects in these classes may vary in different ways; some differ by shape while others have similar shapes but vary in texture. As a result features that do a good job of separating t wo particular classes may be ineffective in separating two other classes thus reducing classification accuracy. This leads to the idea of selecting features by indiv idual binary class combinations. For example, for 3 classes A, B, and C, feature selection would be performed separately for each binary combination of classes AB, AC, and AD, rather than for ABC. It will be

PAGE 17

5 shown that feature selection for binary combinations can result in a net reduction of features which can have the benefits of improved cl assification accuracy, reduced training times and faster feature selection times. In [ 19 ] t he authors implement ed a Wrappers based feature selection approach where they specialize d fea tures on pair wise combinations; t hat is they select a different set of features for each binary combination of classes. Their premise is that there may be a subset of features that can be descriptive for a particular binary class combination but not for the global case. As a result a global based feature selection schema would not select these features. They apply this logic using a Nearest Neighbors (NN) and a Bayes learner to four different datasets. They show a reduction in error rates and also a reduction in the mean number of features. Feres de Souza, et al. [ 18 ] experiment ed with tuning the Support Vector Machine (SVM) cost parameter C and the Radial Basis Function (RBF) kernel parameter for binary class combinations. Their results showed that they obtained error rates that were comparable to using one set of parameters for all binary classifiers. Chappelle and Keerthi [ 21 ] concentrate d on selecting one global set of features for all binary classes. They implemented a new embedded method that utilized scaling factors with the goal of finding the smallest number of features that worked for all classes. Their results showed that they can produce a classifier requiring less features than a traditional embedded method such as recursive feature elimination ( RFE ) [ 23 ] without losing classification accuracy. Chen, et al. [ 20 ] look ed at feature selection for multi class problems by applying a RFE algorithm across all binary classifiers. RFE selects a feature for removal by determining the impact each feature makes on the SVM margin. This is accomplished by iteratively removing one feature and computing the difference in margins between all features and the one feature removed. The feature that had the smallest impact on the margin is then select ed for removal. In the multi class case, the margin differences for all binary classifiers would be totaled up and the featu re whose sum was the smallest selected The authors proposed to select the smallest maximum margin difference across all classifier s. In experimentation they showed that they

PAGE 18

6 could maintain a higher classification accuracy as they reduced the number of features using their method over the traditional method. Support Vector Mac hines (SVM) [ 24 ] often perform multi class classification by building a SVM for each combination of classes I f there are 3 classes A, B, and C, for example, three SVMs would be built: AB, AC, and BC. Unknown examples wou ld be voted on by al l three SVM s and the class that wins the greatest number of votes would be selected as the prediction. The same SVM parameters and features would be selected for use by all three classifiers where Wrapper s [ 17 ] might be used for feature selection and a GR ID search for SVM parameters. It is propose d in this dissertation that the feature selection and SVM parameters should be selected separately for each binary class combination. There are s everal benefits to this approach. First, the feat ure selection process is faster ( in some cases it will be shown to be 2.5 times faster ). Second, t he addition of new classes to an al ready tuned classifier only require s that the work needed for the additional binary class combinations be created rather t han for all class combinations. Third, the removal of a class would require n o additional processing at all. Fourth, the resultant classifiers would consist of a fewer numbe r of features res ulting in faster training times. Finally, in some cases it was shown that a n increase in classification accuracy can be achieved. Fast support vector machines for continuous data [ 25 ] in IEEE Transactions on Systems Man and Cybernetics Part B 2009, Active Learning to Recognize Multiple Types of Plankton [ 2 ] in Journal of Machine Learning Research Recognizing Plankton Images from the Shadow Image Particle Profiling Evaluation Recorder [ 26 ] in IEEE Transactions on Systems Man and Cybernetics Part B 2004 Active Learning to Recognize Multiple Types of Plankton [ 3 ] in International Conference on Pattern Recognition (ICPR) 2004. The first major contribution of this research involves the application of image processing and machine learning techniques for large multi class data sets. In particular, applications to the marine science domain are shown This include s the development of the system called PICES

PAGE 19

7 wh ich manages feature extraction, classification of plankton images, development of training libraries through active learning techniques as well as the management of the millions of images that are acquired during a single day deployment of the SIPPER under water sensor platform. The second major contribution is the exploration of a pair wise class feature selection technique for the Support Vector Machine to speed up feature selection, allow for the incremental addition of classes and produce classifiers th at train faster as well as in some cases improve classificatio n accuracy This is the first work that describes the plankton classification system known as PICES ( Plankton Imaging Classification Extraction Software) used by the SIPPER underwater imaging platform and a new feature selection and parameter tuning procedure that selects parameters and features by individual binary class combinations. PICES is a system of applications that manages the extraction and cla ssification of plankton images, the management of a database of the same images the maintenance and tuning of training libraries and the running of related reporting facilities. Its purpose is to make parameter tuning and feature selection faster and mo re flexible while at the same time maintain and/or improve classification accuracy using a feature selection procedure that tunes SVM parameters and selects features by binary class combinations (BFS). As a result of the PICES system the user has the abi lity to efficiently manage the extraction and classification of millions of p lankton images that SIPPER can image during a single day deployment. T his dissertation is organized into five chapters Chapter 2 provides background on the SIPPER project and un derwater sensor platform, support vector machines (SVM), and the Wrappers feature selection method Chapter 3 provides the methods used for experimentation such as the procedure for performing feature selection by binary class combinations. Chapter 4 de scribes the experiments and results : first the feature selection timing results comparing the new BFS with the MFS procedures followed by the impact on classification accuracy and t raining time. Chapter 5 is a discussion of the results and the conclusi on.

PAGE 20

8 CHAPTER 2: BACKGROUND 2.1 SIPPER Project The SIPPER project is a collaboration between the University of South Florida College of Marine Science in St. Petersburg and the Department of Computer Science and Engineering in the College of Engineering in Tampa. It involves several disciplines, Marine Science, Mechanical Engineering, Electrical Engineering, and Computer Science. The SIPPER is a n underwater imaging platform that was designed to collect images in situ of pla nkton and corresponding instrumentation data. The project was started in 1998 under the direction of Dr. Thomas L. Hopkins and Larry Langebrake It has evolved through three major versions and has become a mature platform that is in active use by both th e USF College of Marine Science and the National Oceanic and Atmospheric Administration (NOAA) agency. There are two main components to the SIPPER project, the tow platform called SIPPER and the software that is used to operate it and process the collecte d data called PICES. 2.1.1 Tow Platform The SIPPER tow platform was developed by the Center for Ocean Technology (COT) at the USF College of Marine Science. Figure 1 shows an image of the platform in the C OT workshop being prepared for a rese arch cruise. The picture is fro m in front of the instrument with the sampling tube sticking out. SIPPER is towed through the water behind a research vessel with a cable attached to the brid le As it is being towed water flow s through the sampling tube in which any particles i n the water will be imaged. Figure 2 is a diagram showing the relative position of the components and the path of the light from the light source to the camera. This

PAGE 21

9 light source is collimated by use of a bowl shaped mirror. As particles flow through the sampling tube they create a silhouette image by blocking the path of the light. The image that is produced is a 3 bit grayscale image. Figure 1 SIPPER Underwater Imaging Platform Figure 2 SIPPER Sampling Tube and Light Source Path.

PAGE 22

10 Figure 3 is a picture that contains the light source, line scan camera, and the S I PPER data control board. The pressure vessels covering the light sour ce, data control board and line scan camera have been removed. There is what appears to be a silver bowl at one end of the assemblage; this is what collimates the light. Figures 4 and 5 are close ups of the light source and line scan came ra. Figure 3 SIPPER Canister I ncluding L ight S ource, C amera, and D ata C ontrol B oard. Figure 4 SIP P ER Light Source. Figure 5 SIPPER Line Scan C amera. 2.1 .2 PICES To manage the large volume of images generated by SIPPER a system was developed that would manage all image extraction, feature calculation, image classification, and database

PAGE 23

11 management. All the programs and database became collectively known as Plankton Imaging Classification Software (PICES). Th is system consists of s everal applications and a MySQL based database [ 27 ] [ 28 ] There is one central application called PICES Comman der which provides a friendly GUI based interface to the user. From this application users can perform various functions that allow them to manage all the images extracted from SIPPER files. These functions include validation of misclassified images, active learning, brow sing images by various criteria, maintenance of class structure, updating of training libraries, classification of images, and various reporting functions. Appendix C contains a glossary of terms that are related to the PICES system and the rest of this work Figure 6 System F low C hart. As Figure 6 shows, t he first function PICES performs is to Decode SIPPER Data. SIPPER raw data consists of compressed data using a simple Run Length algorithm. It is organized into two byt e data records that are either i mage data or instrument data. The decoder will decompres s this dat a and create two streams of data: o ne consisting of 4096 pixel scan lines with 3 bit grayscale, and one of instrument data as produced by the original instruments such as

PAGE 24

12 a Conductivity, Temperature, and Depth sen sor ( CTD ) [ 29 ] Th e instrument data will be human readable text data pro duced by the Source Instrument. Appendix B contains a description of the raw SIPPER data as it is encoded by the SIPPER data control board. The E xtract Frame function groups scan lines together into logical frames so that individual plankton images are not split across two frames. These frames can be between 1 and 4096 scan lines in length. This is accomplished by taking the next 4096 scan lines and working backwards from the last scan line until a break point consisting of three bl ank scan lines is detected. If no such break point is fo u nd then the scan line with the least number of foreground pixels is used as the division point between two frames. All scan lines past the break point are added into the next frame. As part of th is step a filter is applied that removes artifact lines that can be caused by the accumulation of particles partially block ing the camera light source. Given a frame as input, the Extract Individual Images function identifies individual plankton images by performing a connected component analysis. Two pixels are considered connected if they are both foreground and within three pixels of each other. In the Extract Features function, a Feature Vector is calculated for each plankton image. This vector consists of 88 features for each pl ankton image. These features can be divided into several groups: Size, Moment Morphological Contour Textural, and instrument data; and were developed and refined over time The original features were developed for SIP PER I [ 26 ] which provided binary image data. These included Moment [ 30 ] and the morphological based features SIPPER II introduced 3 bit grayscale wi th higher resolution allowing for the development and implementation of texture and shape based features such as Fourier descriptors [ 31 ] SIPPER III, the current version, introduced embedded instrumentation data ; t hat is the instrument data is embedded with the image data, ( see Table B3 ) This gives the ability to record environmental parameters with each individual plankton image such as temperature, salinity, depth, pressure, flow rate, et c The Classify Images functi on classifies unknown images into user defined classes. The user provides a library o f classes which consist of user labeled plankton images. For each class

PAGE 25

13 the user provide s a set of examples from which the classifier will le arn. The classifier uses a m achine l earning a lgorithm called a s upport v ector m achine (SVM) [ 32 ] The SVM learn s from the user labeled images how to recognize the c lass to which the unlabeled images will be assigned The SVM accomplishes this by locating hyper planes separating the different classes from each other. For example if the user provides examples for 4 different plankton classes called A, B, C and D, SVM will find 6 different hype r planes that will separate each possible class combination (AB, AC, AD, BC, BD, and CD). Unknown examples/images can now be put in to the same space as the hyper planes and depending on what side of the surfaces they fall into the most likely class that they belong to can be decided In effect each example will participate in 3 different votes and the class having the most number of votes wins. In the case of ties the class with the highest probability will be selected. The probability is a confidence value assigned by the classifier. The larger the probability the more likely an example belongs to the predicted class. Figure 7 shows the SIPPER i mage e xtraction f unction at work. This is the first application to be used after data has been retrieved from SIPPER. It performs all the functions described above, decoding SIPPER data, extracting frames, extracting individual plankton images, computing a feature vector for each plankton image, classifying images using the feature vectors, and storing them in a MySQL based database. Once images are classified they are added to the PICES Image Datab ase where they can be retrieved, and from which reports can be created. D ata that is recorded with each image include the two most likely classes with their related probability, location in the source SIPPER file, and instrumentation data (temperature, salinity, depth, etc ). The probabilities allow for the implementation of ac tive learning [ 2 3 ].

PAGE 26

14 Figure 7 Image Extraction. Figure 8 shows the main screen of PICES Commander. From this screen the user can sort and browse images by several different criteria, size, depth class, cruise, station, deployment, prediction probability, and validate the class to which they belong Other functions include reclassify ing sele cted images using a classifier, perform ing cross validations, randomly harvest ing images, export ing images, extract ing feature data for parameter tuning and feature selection, and general maintenance functions. The final function of the PICES process is ac tive l earning This is the process of improving the Training Library by locating examples that have a high likelihood of improving classification accuracy and asking a user/expert manually classify them. This is accomplished

PAGE 27

15 by s electing the examples that the c lassifier has the hardest time distinguishing between. Specifically it uses the probabilities/confidence values that are a ssigned to each class for each p rediction. The examples that have the smallest probability differen ce between the two most likely classes are selected. These images are then added to the training library where they will be used for future classifications after an expert classifies them This is accomplished in Figure 8. In Figure 8, each panel shows a thumbnail view of a p lankton image with some related data. value. The images at the t op are the ones most likely to improve classification accuracy.

PAGE 28

16 Figure 8 PICES Commander

PAGE 29

17 Figure 9 shows a single plankton image displayed in the Image Viewer. From here the user can see the actual size of the image, t he grid lines indicate millimeters. The top two results from the two c urrently active classifiers are displayed. In the case of Figure 9 both classifiers are using the same training library but different paramete rs. The top one is usi ng the M FS approach and the bottom one the B FS approach. Both classifiers are displaying the probabilities and votes assigned to the two most likely classes. In this case both classifiers are correct If he classification was incorrect t he user would h ave the option to validate the true class and update the training library by pressing the correct class in the list below the image. Figure 9 Image Viewer

PAGE 30

18 Figure 10 show s the prediction breakdown of the PICES classifier, the Support Vector Machine (SVM) PICES implementation of the SVM creates multiple binary class SVMs; that is for each pair of classe s there is a binary class SVM. The breakd own shown is for classifier wise binary class S VM classifiers. Each one of these SVM classifiers makes a prediction between the two possible classes and assigns a probability to each class. Each one of these predictions is considered a vote where the one that gets a probability greater than 50% wins the vote. Each class will participate in 54 votes. A final probability is assigned t o each class by performing a product sum for each class and normalizing all the probabilities of all classes such that the sum of all probabilities is 100%. There are two panes in Figure 10 The top pane shows a summary by class of votes and probabilities assigned to each class. The bottom pane shows the results for the individual binary class SVM classifiers where the highlighted class in the top pane is one of the c lass es and the copepod which is highlighted in the bottom pane represents the results of the classifier between the two tunicate doliol i tacean copepod displayed on that row, 13.68%, indicates that for that particular binary class classifier copepod and Gelatinous tunicate dolioid 86.32%. At the very bottom of Figure 10 is an information row that displays the probability break down of the two currently highlighted classes.

PAGE 31

19 Figure 10 Classification Breakdown

PAGE 32

20 2.2 Support Vector Machine 2.2.1 Support Vector Machine Introduction A Support Vector Machine (SVM) [ 32 ] classifier was extended to provide a confidence value as exp lained below. The SVM is a two class classifier that uses training examples to find a hyper plane that separates the two classes. Since in many cases there is no linear solution that will separate the two classes in feature space the SVM projects the data into a higher dimension through the use of a kernel function. In this higher dimensional space locating a hyper plane becomes a much more tractable problem. A SVM can easily be extended to handle multiple classes. For the PICES research, a one vs. one schema was utilized, where a SVM is built for every possible two class combination. The final classification can be decided by either a popular vote or the highest probability. The specific SVM used was derived from [ 33 ] ; the probability model was added separately from [ 26 ] 2.2.2 Description For the training data i n equation ( 1 ) represents a single feature vector of features and is its corresponding label either 1 or 1. The training data is mapped in feature space and a hyper plane is found that separates the two classes ( 1 and 1). Equation ( 2 ) defines the hyper plane. Assuming that the data is linearly separ able there will be more than one possible hyper plane. ( 1 ) ( 2 ) where is a normal vector that is perpendicular to the hyper plan e and = distance from the origin to the hyper plane along the vector

PAGE 33

21 By adding the requirement that the hyper plane must maximize the margin between the two classes the possible hyper planes are reduce d to just one. This constraint can be thought of as ide ntifying two more hyper planes parallel to the separating hyper plane where one hyper plane is in contact with the nearest 1 example and the other is in contact with the nearest +1 example. Equations ( 3 ) and ( 4 ) represent these two hyper planes respectively. ( 3 ) ( 4 ) The distance between the two hyper planes is expressed in equation ( 5 ) with the goal of maximizing or conversely minimizing The optimization problem is then updated to include the constraint shown in ( 6 ). ( 5 ) ( 6 ) To increase th e classification ability of SVM s feature data is mapped into a higher dimensional space with where inner products are calculated using a kernel function [ 34 ] w hich is used to avoid performing inner products in the higher dimensional space. Equations (7) and (8) represent the optimization problem to be solved in its primal form. Because not all training data can be separated a second term is added, T his term allows for some trai ning examples to be on the wrong side of the decision boundary, where slack variable, represents the distance that example is from the decision boundary and is a cost parameter that can be used to balance between empirical risk and margin width The greater the value that takes the higher the penalty paid for training examples on the wron g side of the decision boundary. This parameter has an impact on the hypothesis space or capacity of the classifier; t hat is larger values of will inc rease the size of the margin which increases the capacity of the classifier but reduces its ability to generalize [ 35 ] [ 36 ]

PAGE 34

22 To minimize, equation (7) is subject to equation (8) where = Cost parameter for training examples that cannot be separated and = s lack variable to handle non separable examples m inimize : ( 7 ) subject to: ( 8 ) The Lagrangian multiplier is introduced into equation ( 7 ) creating ( 9 ). ( 9 ) This becomes a convex optimization problem [ 37 ] where the optimal solution is at a saddle point. Taking the partial derivatives with respect to and gives us ( 10 ) Substituting ( 10 ) into equation ( 9 ) results in equations ( 11 ) through ( 13 ) The dual form of a SVM is shown in ( 11 ). Equation ( 13 ) becomes the decision funct ion. m aximize: ( 11 ) subject to: ( 12 ) decision function: ( 13 )

PAGE 35

23 The Karush Kuhn Tucker condition [ 38 ] of the optimal solutions to equations ( 7 ) and ( 8 ) results in : ( 14 ) When is non zero and ( 15 ) is satisfied then is c on tributing to the decision boundary separating the two classes and is called a support vector. ( 15 ) As discussed earlier t he cost parameter in equation ( 7 ) represents the penalty that is assessed for training examples that end up on the wrong side of the decision boundary. The larger its value the harder the SVM training process will work to find a separating hyper plane that cleanly separates the training examples. The danger is that too l arge a value will cause overfitting where the resulting classifier will not generalize well to unseen dat a. The function from ( 11 ) is the kernel function that allows for projection to a higher dimensional space. It returns the results of a dot product between and as if the dot product was performed in a higher dimension. Equation ( 16 ) shows the implementation of the radial basi s function (RBF) kernel used for all experiments discussed in this research ( 16 ) This gives us another parameter (Gamma) which controls hypothesis space. T he larger is the more powerful the classifier but the less capable it is to generalize However, making it too small will reduce the ability of the classifier to find a good decision boundary also resulting in poorer classification performance.

PAGE 36

24 2.2.3 Assigning Pr obability Values A probability is calculated for the classification by using a sigmoid function. The distance from the decision boundary is translated by the sigmoid function into a confidence value/probability [ 39 40 ] A probability of less than 0.5 indicates class 1, while a probability greater or equal to 0.5 indicates class 1. ( 17 ) The probability calculat ion is extended to the multiple class case by making use of a method developed in [ 26 ] This method normalizes the distance function ( 18 ) of each binary class SVM by dividing it by its respective weight vector giving us equation ( 19 ). This allows for the use of the same probability parameter by all the binary class SVMs involved in the multi class SVM. = Distance function for classifier ( 18 ) ( 19 ) Equation ( 20 ) show s the product sum for class which then needs to be normalized such that the probability of all possible classes add up to 1.0, as shown in E q uation ( 21 ). ( 20 ) ( 21 ) The variable in equation ( 19 ) now becomes an additional parameter that will need to be tuned along with and for each support vector machine. The idea is that the probability should reflect the likelihood that the classification of a given example is correct.

PAGE 37

25 2.2.4 Probability Parameter Adjustment One important part of the feature selection process is the tuning of the SVM parameters , and is the cost that has to be paid by the classifier for each training example that ends up on the wrong side of the decision boundary. The greater the value of the more CPU time required to find a decision boundary that separates all the training examples. (Gamma) controls the hypothesis space, the larger is the more powerful the classifier but the less able it is to generalize However, making it too small will reduce the ability of the classifier to find a good decision boundary resulting in poorer classification performance. The probability parameter is used to tune the confidence value B y using this parameter the probability returned by the prediction function can be made to closely reflect the actual probability that the prediction is correct. For example if all predictions that were given a probability of being 80% were submitted to an oracle it would turn out to have an actual classification accuracy close to 80%. Therefore, part of the SVM parameter tuning process is concerned with the probability of prediction (confidence value) being as close as possible to the actual probability that the pre diction is correct. In a multi class problem this becomes even more useful. By using the probability from each binary SVM r ather than voting some of the bias that gets introduced by the weaker binary class classifiers can be reduced Consider 3 classes A, B, and C for example, with 3 binary c lassifiers AB, AC, and BC. If classifiers AB and AC have a 90% classif ication accur acy on a test data set while classifier BC has a 60% classification accuracy o n the same test dataset, with un weighted voting classifier BC has the same influence over the final multi class prediction as classifiers AB and AC However, if probabilities ar e used then the prediction from BC will carry less weight. 2.2.5 Multi Class Support Vector Machines There are two main strategies for SVM s, which are binary classifiers, use d to deal with multi class problems: one versus all and one versus one.

PAGE 38

26 1) One versus all : One binary SVM is built for each class where the training data is divided into the class of interest and all other classes. Problems with the approach include the unbalanced nature of each binary classifier since one side will consist of a ll training examples and the implementation of a decision function. 2) One versus one: There is one SVM built for each po ssible binary class combination For example if there are 4 classes A, B, C, and D there woul d be 6 classifiers built: AB, AC, AD, BC, BD, and CD. In this case prediction would be i mplemented using a voting scheme Another scheme for deciding the winning class is to utilize the computed probabilities and voting as done in [ 3 ] The one versus one strategy requires the construction of more SVMs than the one versus all strategy but because there are fewer examples to train per SVM the one versus one strategy is actually faster than the one versus all strategy. Exper iments were done in [ 41 ] that show that the training time of one versus one was considerably faster than one versus all; from a speed up of 2 to 10 times depending on the dataset. I n both strategies th e same features are used for all the SVM s. This research propose s to select features as well as tune the SVM parameters for each specific b inary combination of classes; t he premise being that the user can take greater advantage of the specific characteristi cs that apply to the two classes involved and thus maximize the performance of each individual SVM with respect to both accuracy and processing time resulting in better performance for the overall classifier. There were two differ ent decision functions ex plored: voting ( 23 ) and probability ( 24 ). In the case of voting it is possible that there might be ties; that is two different classes receiving the same number of votes. In this case the class with the highest probability from any classifier is used to break the tie. ( 22 )

PAGE 39

27 ( 23 ) ( 24 ) 2.3 Feature Selection Feature selection is the process of identifying the subset of features for a given dataset that will improve classification accuracy and reduce processing time. Typically the more features a dataset has the longer the training time. By reducing the feature count we can then reduce the training time. At t he same time by eliminating features that do not help discriminate the various classes an improvement in classification accuracy can be achieved. Wrappers is the feature selection proce dure used in PICES. Wrappers is a process that searches through feat ure space by using the learning algorithm as a black box. It assigns an accuracy to each feature subset by performing a cross validation, typically 5 fold cross validation on the training data The accuracy assigned to each feature subset is then used to drive the search. Each feature subset is referred to as a node with each node connected to ot her feature subsets. T wo connecting nodes are separated by the addition or subtraction of one feature. The search method employed is a best first search. In th is case the Wrappers algorithm keeps track of nodes that have already been evaluated and assigned a n accuracy and those that have not been evaluated yet. It keeps on processing the nodes that have not been evaluated until no ne are left. At that point i t will select from the entire set of evaluated nodes the one with the highest accuracy; this is referred to as expansion This node will then be expanded to create new nodes to search. This is done by either removing one feature or adding one feature. Th ere are two typical ways of performing the search. One is s tarting with all features selected and reducing down to a small set and the other is starting with all two feature combinations and growing the features The process continues in a loop until a te rmination condition is met at which point it switches to a 5 wide beam search In the case of this research

PAGE 40

28 that termination is when 50 expansions are made without locating a subset of features the produce a high er classification accuracy. The purpose of the beam search is to drive the search down to just one feature. At each expansion it will select 5 nodes with the highest accuracy from the set with the least number of features. This will continue until there is only one feature left. 2.4 Datas ets Experiments were performed on seven different datasets. Three data sets were derived from plankton images produced by SIPPER [ 10 ] ; West Florida Shelf (WFS), ETP2008 Station 1, and Nine Class Plankton [ 42 ] The fourth and fifth datasets were both der ived from the Forest Cover data set [ 43 ] which has more than 500,000 examples of tree coverage : Spruce, Pine, Willow, Aspen, etc. The difference between the two Forest Cover datasets is the number of examples used in the training datasets; one has 300 examples per class and th e other 1,500 examples per class The purpose of the two different size s was to see how the two methods MFS and BFS would respond to the changes in the training set size The sixth dataset is the Letter dataset found in the UCI repository [ 44 ] The seventh dataset is the S at Image dataset also found in the UCI repository. Table 1 provides a summary of the datasets Each data set is split into two parts, training and test by randomly sorting and selecting the first examples of each class for the training dataset and the rest for the test dataset The training set will be used to drive both the feature selection and SVM parameter search es and the test set is used as a final validation of results for comparison purposes between the MFS and BFS search methods. In addition the test datasets will be stratified by class for 10 folds.

PAGE 41

29 Table 1 Dataset D escriptions Dataset Description Num Class es Num Features Train Set Size Test Set Size WFS 33 82 16,807 4,199 ETP2008 Station 1 55 83 17,678 23,211 Nine Class Plankton 9 73 9,000 4,500 Forest Cover 300/class 7 54 2,100 574,012 Forest Cover 1500/class 7 54 7,000 570,512 Letter 26 16 15,998 4,002 Sat Image 7 36 4,435 2,000 2.4.1 Plankton Dataset s All three p lankton datasets come from the SIPPER underwater imaging platform Two of them are training libraries, WFS and ETP2008, while the third is from [ 42 ] Both the training library datasets, WFS and ETP2008 are highly unbalanced with respect to class dis tribution while the Nine Class Plankton dataset is evenly distributed. The examples from all three plankton datasets were derived from images of plankton manually classified by marine biologists from the College of Marine Science at the University of South Florida. See Appendix A for images of the different classes of plankton. The three datasets share a common set of features that are listed in Table 14 The Nine Class dataset utilizes the first 73 features, the WFS dataset uses the first 82 features, and the ETP2008 dataset use s all 83 features listed. The categories of features are listed in Table 5 The West Florida Shelf ( WFS ) plankton dataset is from marine science cruises that occurred in the Gulf of Mexico off the west coast of Florida between 2004 and 2007. It consists of 21,006 examples split into 33 classes. The distribution of examples among classes is very unbalanced. The smallest class has 34 examples while the largest contains 1,948 examples. Table 2 lists all classes with their related counts. There are 82 features which are listed in Table 14 All feature values are continuous float s The data was randomly split in to training and test datasets with 80% of each class going into the training dataset and 20% into the test dataset

PAGE 42

30 Table 2 WFS D ataset C lass D istribution Class Name Class Index Count Class Name Class Index Count Artifact 1 351 Cnidaria_Aglaura 18 845 Chaetognath 2 1,047 Cnidaria_Hydroid 19 592 Cladoceran 3 1,185 Cnidaria_other 20 452 Copepod 4 170 Doliolid 21 157 Copepod_C alanoid 5 1,948 Salp 22 127 Copepod_C opilia 6 58 Siphonophore 23 388 Copepod_C orycaeus 7 294 Lancelet 24 66 Copepod_M acrosetella 8 325 Larvacean 25 1,513 Copepod_O ithona 9 1,233 Other 26 997 Copepod_O ncaea 10 322 Other_F ish 27 100 Eumalacostracan 11 808 Pther_Polychaete 28 84 Ostracod 12 1,112 Other_Pteropod 29 191 Detritus 13 1,252 Protist 30 1,222 Echinoderm_Bipinnaria 14 34 Trichodesmium_Colonies 31 1,092 Echinoderm_Plutei 15 830 Trichodesmium_Elongate 32 384 Elongate_Phytoplankton 16 576 UnKnown 33 1,065 Elongate_S trings 17 186 Total 21,006 The ETP2008 dataset is from a four week marine science cruise in the Eastern Tropical Pacific aboard the R/V Knorr [ 45 46 29 ] a 3,000 ton research vessel operated by Woods Hole Oceanographic Institution The cruise departed December 8 2008 from Balboa Panama and returned January 6 2009 to Puntarenas Costa Rica. Th ere were over 10 million plankton images acquired during the cruise from 14 deployments o f SIPPER at 4 different stations. Each station represents a different geographical location in the Eastern Tropical Pacific; they were labeled 1, 4A, 4B, and 8. The examples for this dataset were randomly harvested from images collected at station 1 and then manually labeled by a m arine b iologist. The random harvesting of examples was done such that depth was weighted equally; that is examples were grouped into 5 meter depth

PAGE 43

31 ranges and then examples were randomly selected from each depth range and weighted by the density of images at that depth. The idea is that the training library should reflect the underlying depth distribution of the different plankton class es Since SIPPER was not deployed at various depths for equal amounts of time the random sampling needed to be adjusted to reflect the density of images at given depth ranges. Figure 11 shows the algorithm used to randomly harvest. Figure 11 Random H arvesting A lgorithm. Table 3 shows the class distribution of the resulting ETP2008 dataset. There are 40,889 examples divided into 55 classes. The smallest class consists of 23 examples while the largest contains 9,566 examples. The data was then randomly split into training and test datasets with 70% of each class going into the training dataset an d the remaining 30% into the test dataset with a limit of 1,000 examples maximum per class in the training dataset. When 70% of a given class exceeds 1,000 examples the remainder is added to the test dataset. There are 83 features in this dataset with the first 82 being the same as the WFS dataset and the 83rd feature represent ing the depth at which a given plankton image was sampled Table 5 summarizes th e featur es by category while Table 14 lists each feature.

PAGE 44

32 The Nine Class Plankton dataset from [ 42 ] consists of 13,500 examples split into 9 classes with 1,500 examples in each class and 73 features. This dataset was originally used in the development of the first 73 features developed for the SIPPER system and are the same as the first 73 features described in Table 14 This dataset is randomly divided into training and test datasets, with 66.66% of each class going into the training and the remaining 33.34% going into the test.

PAGE 45

33 Table 3 ETP2008 Class Distribution Class Name Idx Count Class Name Idx Count Copepod_Calanoid 0 1,238 Larvacean 28 557 Copepod_Calanoid_Eucalanus 1 1,011 Larvacean_House 29 299 Copepod_Copilia 2 826 Larvacean_Large 30 422 Copepod_Eyes 3 379 Larvacean_Tectillaria 31 279 Copepod_Macrosetella 4 572 Larvae_Doliolid 32 63 Copepod_Nauplii 5 856 Larvae_Polychaete 33 115 Copepod_oithona 6 1,586 Larvae_Tornaria 34 23 Copepod_Oncaea 7 671 Pteropod_Creseis 35 812 Eumalacostracan 8 298 Pteropod_Gymnosome 36 33 Eumalacostracan_amphipod 9 201 Noctiluca 37 430 Eumalacostracan_euphausiid 10 1,010 Noise 38 9,566 Ostracod 11 435 Other 39 242 Detritus_Molts 12 638 Phyto_Chaetoceros 40 377 Detritus_Snow 13 8,604 Phyto_Pyrocystis 41 215 Echinoderm_Plutei 14 131 Protist_Darkcenter 42 182 Elongate_Chaetognath 15 533 Protist_Diffuse 43 99 Elongate_Polychaete 16 333 Protist_Knobby 44 271 Elongate_Strands 17 312 Protist_Lobed 45 451 Fish 18 170 Protist_Lopsided 46 246 Ctenophore 19 60 Protist_Multiple 47 98 Ctenophore_Cydippid 20 27 Protist_Phage 48 38 Hydromedusae 21 549 Protist_Phi 49 371 Hydromedusae_Blunt 22 541 Protist_Radiolarian 50 1,236 Hydromedusae_Small 23 148 Protist_Spiny 51 465 Hydromedusae_Solmundella 24 111 Protist_Wisp 52 75 Siphonophore 25 265 Radiolarian_Ribboncolony 53 84 Tunicate_Doliolid 26 1,597 Radiolarian_Roundcolony 54 579 Tunicate_Pyrosome 27 159 40,889

PAGE 46

34 Table 4 Nine Class Plankton C lass D istribution Class Name Idx Count Chaetognath 0 1,500 Cnidaria_Smallbell_Longarms 1 1,500 Copepod_Oithona 2 1,500 Echino_Plutei 3 1,500 Larvacean 4 1,500 Marine_Snow_Dark 5 1,500 Marine_Snow_light 6 1,500 Protist_all 7 1,500 Trich 8 1,500 Total 13,500 Table 5 Plankton Feature Categories Category Sub Category Feature Count Moment Features [ 30 ] Binary 8 Intensity Weighted 8 Edge Pixels Only. 8 Morphological 9 Head/Tail, main access of image is found via a Eigen Vector, image rotated to align with horizontal access. Pixel C ounts of F irst Q uarter and L ast Q uarter. 2 Length vs Width 1 Length 1 Width 1 Filled Area 1 Convex Area 1 Transparency One Binary, One Weighted 2 Texture U sing Fourier Transform. One F eature for each F requency R ange from L ow to High F requency. 5 Contour Fourier Average of 5 Frequency Domains Low to High. 5 Hybrid, 4 lowest frequencies are sampled while the rest represents ranges of frequencies. 1 5 Intensity Histogram Not I ncluding white space 7 Including W hite S pace 8 Instrument Data Depth 1

PAGE 47

35 Table 6 Common Variables /Functions U sed in Feature Calculation Image H eight Image Width Intensity at x, y 0 = Background, 255 = Foreground Center of Image Weighted C enter Number of F oreground Pixels in I mage Image Size in N umber of Pixels Indicates which intensity range the pixel value is in. See Table 10 Ex: Indicates that the Intensity of pixel at location falls in intensity range 1 if true else 0. Indicates that the Intensity of pixel at location is greater than 31, a foreground pixel, 1 if true else 0. Histogram Feature Value for intensity range r. See Table 10 The following equations are used in Table 14 as part of the Plankton feature data description. Equation ( 26 ) computes the number of foreground pixels in image Equation ( 27 ) returns the weighted size of the image ; that is the size is weighted by the intensity of each pixel. Equations ( 28 ) and ( 29 ) return the centroid and weighted centroid of image ( 25 ) ( 26 ) ( 27 ) ( 28 )

PAGE 48

36 ( 29 ) Table 7 describes the 8 basic M oment features developed in [ 30 ] There are three different flavors of moment features implemented : binary, edge, and intensity weighted 1) Features 0 through 7 are the b inary m oment features and use E quation ( 30 ) with the moment features described in Table 7 2 ) Features 8 through 15 edge m oment features and use E quation ( 30 ) with the moment features described in Table 7 3 ) Featured 31 through 38 ar e the intensity weighted m oment features and use E quation ( 31 ) with the moment features described in Table 7. ( 30 ) ( 31 )

PAGE 49

37 Table 7 Eight B asic M oment Features U sed in the T hree D ifferent M oment G roups 0 1 2 3 4 5 6 1,2 2,1 + 0,3 7 With the grayscale values that SIPPER 2 and SIPPER 3 produce features that reflect the texture of the image can be computed. A 2D Fourier Transform is performed on the original image. By using the result of this transform the energy of different frequency ranges was captur ed by computing the average magnitude for each of 5 different frequency ranges (see Table 9 ). Figure 12 show s a plankton image and its Fourier transform. The semi circle bands that are labeled R1 thru R5 indicate the boundaries of the regions. Only half the Fourier domain needs to be processed since both halves are mirror images of each other. These five regions result in five Fourier features. The value of each feature is the average value of the magnitude of their respective region.

PAGE 50

38 Figure 12 2D Fourier Transform of Image, Frequency Ranges Indicated Table 8 provides descriptions of some variables and functions that are needed for equation ( 32 ) Using these equation s five features are computed that represent five different frequency ranges as listed in Table 9 Table 8 Texture Features Variables and Functions Function Description Fourier transform of image. This will be a two dimension matrix with the same dimension same the original image. Each element in the matrix will have both a real and imaginary part. Distance from upper left to centroid. Indicator function that specifies weather the pixel at is in region Return 1 if true or 0 if false. Uses Table 9 and Pixel count for region Table 9 Lower and Upper Frequency Bounds for Texture Features. Region Number Lower Bound Upper Bound 1 0 2 3 4 5 1

PAGE 51

39 ( 32 ) Equation ( 33 ) computes the fraction of image pixels that belong to a given range. It is used by the two groups of intensity histogram features. The first group, f eatures 63 through 69 is computed from t he original image while the second group, f eatures 74 through 82 is computed after a fill hole operation is performed on the original image. ( 33 ) Table 10 Intensity Regions Region Intensity Range Background 0 31 1 32 63 2 64 95 3 96 127 4 128 159 5 160 191 6 192 223 7 224 255 Contour features based off Fourier descriptors were implemented. Fourier descriptors were derived by performing a Fourier transform on a one dimensional array of data that represents the contour of the image where the real and imaginary components come from the locations of the edge pixels When pe rforming a Fourier transform on an array that represents the edge/contour of an image the frequencies captured in the resultant array will reflect the deviations from a circle. There are two types of contour features implemented : 1 ) averaging by frequenc y region, and 2 ) a combination of region averaging and sampling referred to as hybrid

PAGE 52

40 1 ) A Fourier transform is performed on the entire contour of the image. The result of the transform is used to generate 5 contour features with each one representing a range of frequencies. This is done by computing the average value of the magnitudes for each range, (see Figure 13 ) This is similar as to the way the Texture Features were computed. In this case instead of bounding the regions with semi circles around the center of the image the region is derived by determining the distance from the center of the array. Table 11 shows the size of the frequency ra nges as a fraction of 1. Equation ( 34 ) computes the averaging contour feature for the specified region using functions described in Table 12 2 ) Hybrid is a mix of averaging and sampling. The lowest frequencies are sampled and the higher frequencies are averaged. The idea here is that the lowest i ndividual frequencies capture the greatest amount of information while individual higher frequencies are not as significant but taken as an average over a domain can contribute to classification accuracy. Table 13 gives a summary of the 16 features computed in this section. Figure 13 Contour Frequency Domain

PAGE 53

41 Table 11 Upper and Lower Contour Frequency Ranges Region Number Lower Bound Upper Bound 1 2 3 4 5 Table 12 Contour Variables and Functions Variable Description Length of edge in pixels Center position Magnitude of complex number(amplitude) at position Indicator function, specifies weather position is in region If then 1 else 0. Number of pixels in region Contour Feature Value for region ( 34 )

PAGE 54

42 Table 13 Hybrid Contour Features 0 1 Hz Left First Bucket in resultant Fourier transform 1 2 Hz Left Second Bucket in resultant Fourier transform. 2 3 Hz Left 3 4 Hz Left 4 13/16 4 Hz Avg. of amplitudes in left buckets that range from 13/16 th to 4hz from center 5 12/16 13/16 Avg. of amplitudes in left buckets that range from 12/16 th to 13/16 th from center 6 10/16 12/16 Avg. of amplitudes in left buckets that range from 10/16 th to 12/16 th from center 7 Center 10/16 Left 8 Center 10/16 Right 9 10/16 12/16 Avg. of amplitudes in right buckets that range from 10/16 th to 12/16 th from center 10 12/16 13/16 Avg. of amplitudes in right buckets that range from 12/16 th to 13/16 th from center 11 13/16 4 Hz Avg. of amplitudes in right buckets that range from 13/16 th to 4hz from center 12 4 Hz Right 13 3 Hz Right 14 2 Hz Right 15 1 Hz Right Last Bucket in resultant Fourier transform Table 14 List of Plankton Features Feature Num Description 0 7 Moment Features from Table 7 using equation ( 30 ). 8 15 Edge Moments a. Image is reduced to just edge pixels. b. New center is calculated 16 17 18 19 20

PAGE 55

43 Table 14 (Continued) 21 22 23 24 25,26 a. Create covariance matrix of image b. Calculate 1 st and 2 nd Eigen vectors of c. d. Determine orientation of image, e. Using orientation rotate Image so that it lies horizontal. f. g. Helps to determine if organism has head. 27 28 29 30 37 Moment Equations from Table 7 using equation ( 31 ) 38 42 , and 43 57 Hybrid contour as described in Table 13 58 62 Averaging contour as described in Table 11 and equation ( 34 ). 63 69 Intensity Histogram Field [0 6] 70 Height / Width, Using information used to calculate the eigen ratio a. Image is rotated so that its longest dimension run parallel to bottom. b. Tight bounding box is drawn. c. The shortest dimension is considered the Height while the longest is the Width. 71 Height 72 Width

PAGE 56

44 Table 14 (Continued) 73 Hole filled area. 74 82 Intensity Histogram including whitespace. 83 Depth from CTD [ 47 ] 2.4.2 Forest Cover Dataset The Forest Cover dataset result ed from a study performed by Colorado State University Department of Forest Sciences using remote sensor data provided by the US Geological Survey (USGS) and the US Forest Service (USFS) Each example in the dataset repre sents the predominant growth in a 30 meter by 30 meter square area. The area where the data was collected was the Rawah, Comanche Peak, Neota, and Cache la Poudre wilderness areas of the Roosevelt National Forest in northern Colorado. The dataset can be do wnloaded from the UCI Machine Learning Repository [ 43 ] It consists of 54 features and 7 classes. The 54 features are of both continuous and B oolean types with 10 features being continuous and 44 being Boo lean. Table 15 gives a breakdown of the features. The seven classes represent the type of forest growth i.e. tree s growing in each 30 by 30 meter cell Table 16 gives a detailed breakdown of the seven classes. A more detailed description of the study, data, and how it was acquired can be found in [ 48 49 50 ]. Analysis of the Forest Cover data set shows that the two sets of B oolean fields, Wilderness and Soil Type, contain only one field in each one that will be true for any examp le.

PAGE 57

45 Table 15 Forest Cover List of F eatures Description Type Feature Count Elevation Continuous 1 Aspect Continuous 1 Slope Continuous 1 Horizontal dist to nearest water Continuous 1 Vertical dist. To nearest water surface Continuous 1 Horizontal distance to nearest road Continuous 1 Angle to su at 9am on the summer solstice Continuous 1 Angle to sun at 12noon on the summer solstice Continuous 1 Angle to sun at 3pm on the summer solstice Continuous 1 Horizontal dist. to nearest forest fire ignition point Continuous 1 Wilderness area designation Boolean 4 Soil Type Boolean 40 Table 16 Forest Cover C las s Breakd own Cover Type Number Examples Spruce / fir 211,840 Lodgepole pine 283,301 Ponderosa pine 35 754 Cottonwood Willow 2,747 Aspen 9,493 Douglas fir 17,367 Krummholz 20,510 2.4.3 Letter Dataset The Letter dataset w as first described in [ 51 ] It consists of 26 classes and 1 6 integer features. Each class represents a different letter generated using 20 different fonts. The distribution amongst the classes is relatively even with smallest class having 734 examples and the largest 813 exampl es. The data set was randomly split into training and test datasets with 80% of each class in the training data set and the remaining 20% in the test data set. Support vector machines are known to obtain good c lassification a ccuracy better than 95%, on this

PAGE 58

46 da taset [ 25 52 53 ]. In [ 54 ] a classification accuracy of 100% was achieved using an ensemble of 200 c4.5 classifiers. This makes it an interesting dataset on which to attempt improve d classification accuracy since t here is not as much room for improvement as with the other datasets. Table 17 shows the class distribution of examples and Table 18 provides a description of the 16 features which are all integer based. Table 17 Letter D ataset C lass D istribution Class Name Count Class Name Count Class Name Count A 789 J 747 S 748 B 766 K 739 T 796 C 736 L 761 U 813 D 805 M 792 V 764 E 768 N 783 W 752 F 775 O 753 X 787 G 773 P 803 Y 786 H 734 Q 783 Z 734 I 755 R 758 Table 18 Letter D ataset F eature D escription Feature Num Description Feature Num Description 1 Horizontal position of box. 9 Mean y variance. 2 Vertical position of box. 10 Mean x y correlation. 3 Width of box. 11 Mean of x x y. 4 Height of box. 12 Mean of x y y. 5 Total # pixels. 13 Mean edge count left to right. 6 Mean x of on pixels in box. 14 Correlation of x ege with y. 7 Mean y of on pixels in box. 15 Mean edge count bottom to top. 8 Mean x variance. 16 Correlation of y ege with x.

PAGE 59

47 2.4.4 Sat Image T his is the Statlog (Landsat Sat ellite) dataset [ 55 ] This represents a subset of the original dataset which was purchased from NASA. There a re four overlaid images where each image represents a different spectral region. Each example represents a 3x3 pixel region where each pixel represents an 80 x 80 meter area. It was ground truthed by a site visit by Ms. Karen Hall and Professor John A. Richards at the Centre for Remote Sensing at the University of New South Wales, Australia. Table 19 lists the 7 classes in the dataset, the smallest containing 626 examples and the largest 1,533 examples. The dataset is already s p lit into training and test as downloaded from the UCI repository with 4,435 examples in the training dataset and 2,000 examples in th e test dataset. This data set is also included in [ 19 ] which also implemented a binary class pair wise feature selection procedure Table 19 Sat Image Class D istribution. Class Name Class Description Count 1 Red c oil 1,533 2 Cotton crop 703 3 Grey coil 1,358 4 Damp grey soil 626 5 Soil with vegetation stubble 707 6 Mixture class (All types) 1,508 7 Very damp grey soil 1,533 2.5 Data Normalizatio n Data normalization is the process of scaling feature data such that all features have approximately the same range of values. This is necessary so that all features have equivalent weight when building the support vector machines. For example the si ze f eature in the plankton data sets typically have ranges from 200 to 200,000 whereas the intensity hist ogram features of the same data sets typically have ranges from 0.0 to 1.0. If no normalization was done then the s ize feature would overwhelm the intensity histogram features.

PAGE 60

48 The z score normalization procedure [ 56 ] is used on all continuous and integer data. This requires the computation of normalization paramet ers for each feature. These parameters are the mean value, and standard deviation, of each continuous feature in the training datasets. These normalization parameters are then used to normalize the training and test data. Equation ( 35 ) is used to compute the normalized value of feature for example ( 35 ) Binary data such as the Soil Type features in the Forest Cover dataset are not normalized since the for true. 2.6 Significance Testing The feature selection and parameter tuning methods were done using the training data that were extracted from each dataset. The idea is that the training datasets represent training libraries created by users that wish to improve performance against unseen data. The test data is not involved in the feature selection or SVM parameter tuning pro cesses but represent the future unseen data that the resultant classifiers will need to classify. The need for the significance test in this dissertation is to compare the performance of the two procedures MFS and BFS. To compare the two procedures clas sifiers will be built for each one using the training examples that drove the feature selection and parameter tuning processes. These classifiers will then classify the unseen test examples that had no influence in the selection of features and parameters This will result in two sets of predictions that can then be compared. In [ 57 ] t test (random splits), cross validated t test (10 fold cross validation), and a new proposed test called the 5x2 cross validation are analyzed. Of these tests the paper recommends the 5x2 cross validation when the experiment can be run 10 times or a McNe can only be run once. To use the 5x2 cross validation on the two procedures MFS and BFS

PAGE 61

49 would require rerunning both procedures 10 times. In the case of the ETP2008 dataset this would take approxima tely 50 days and this is not pra s test which [46] states has similar power [ 57 58 ] was used in all experiments to determine if two cla ssifiers were statistically different. This test compares the results of two classifiers by comparing the test examples that are classified correctly and those that are classified incorrectly. Given two classifiers A and B and the variables , described below equation ( 36 ) is used to calculate the test statistic A test statistic equal or greater than 3.8415 would indicate that the two classifiers have a 95% or greater probability of being different. The null hypothesis is testing if The variables and are not used as part of equation ( 36 ). ( 36 ) where: = number of test examples misclassified by both classifiers A and B, = number of test examples correctly classified by A but not B, = number of test examples misclassified by A and correctly by B, and = number of test examples correctly classified by both A and B.

PAGE 62

50 CHAPTER 3: METHODS 3.1 Introduction In this chapter a description of the method s used to implement the binary c lass feature sele c tion process and related functions is presented These methods include SVM p arameter tuning, the specific implementation of Wrappers feature selection, merging of best fe atures, description of the multi processor implementation, description of experiment s including unbalanced datasets, and experiments related to adding addit ional classes to existing classe s. Feature selection by binary combinations consists of three major st eps: initial SVM parameter tuning binary class feature selection and binary class SVM parameter tuning. It is important that SVM parameters are tuned before feature selection is performed. If feature selection starts with poor SVM parameters the pr ocess will not be as effective at reducing the number of features. For example Table 20 compares feature reduction results when SVM parameter tuning is done prior to feature selection and when it is not. The WFS dataset reduced to 60 features without the parameter tuning but 41 when pa rameter tuning was done first. 1) SVM parameter tuning The search is driven by classification accuracy and then by correctness of probability prediction (CPP) It is important that the SVM parameters are tuned before feature selection. Poor SVM parameters will have a detrimental impact on the feature selection process. 2 ) Binary class feature selection. Using the SVM parameters determined in the previous step perform feature selection for each binary combination of classes. 3) Binary class SVM p arameter tuning. For each binary class combination perform the SVM parameter search as done in step 1 above.

PAGE 63

51 The labeled feature data is divided into two datasets, training and test. The training dataset is used in bot h the SVM parameter tuning and f eature s election processes. The test dataset allows measurement of h ow well the binary class feature selection process does compared to the traditional multi class feature selection process. Table 20 Comparison of Feature Reduction Parm Tuning Before vs After F/S. Initial Number Features Default Parms Used With Parm Tuning Nine Class 73 51 43 WFS 82 60 41 Forest Cover 150 0 /Class 54 38 32 Letter 16 16 15 3.2 General Organization of P arameter Tuning and F eature S election The central unit in both SVM parameter tuning and feature selection is the n fold cross validation. This is how a specific combination of features and SVM parameters are evaluated. The term Job refers to a specific set of features and SVM parameters. Eac h individual job is processed as a single unit of work on a single processor. Each job will have a status of Open, Started Done and Expanded. Open indicates that it is waiting for a process to select it and evaluate it. Started indicates that it has b een selected by a process and is being evaluated. Done indic ates that it has been evaluated; that is a n fold cross validation was performed and a n accuracy assigned to it. Expanded indicates that it was selected for expansion ; that is a new set of jo bs was created using it as a seed. See Table 21 for a list of fields that are assigned to each Job. Figure 14 shows the basic flow that is used for both the SVM parameter search and the feature selection processes. When a process starts up it first determines whether the procedure has already started. This is ind icated by the existence of a status file. If the status file already exists then it knows that the procedure has already started and that it needs to read the status file to catch up to the current status. If the status file does not exist it then need s to create a new

PAGE 64

52 one and seed it with the initial set of jobs. In the case of the feature selection process this would be a single job with all features selected. Table 21 List of Fields Maintained for E ach Job. Field Name Description Job ID A unique number that is sequentially assigned to every new job created. The first one created will have JobID = 0. Parent ID The ID of the job that was expanded to create this job, For example during feature selection when the best next job is selected for expansion several new ones will be created, each one varying by just one feature. Those jobs will have the expanded job ID assigned as their Parent Id. parameter The SVM parameters that are to be used for this job. Paramet er parameter Features to use The feature that are to be used for this job. Status Open Indicates that this job is available to be evaluated. Started This job is being evaluated, meaning a 5 fold cross validation is being performed. Done This job has been evaluated and assigned a Accuracy and can now available for expansion. Expanded This job was selected for expansion. Meaning new jobs have been created with this jobs assigned as their parent. Test Job Yes/No Some jobs are test jobs. These are tests of a previous evaluated job. The ParentID in this case will indicate the Job that is being tested. A test jobs evaluation will perform a test against a separate validation dataset if no validation data is available then a 10 fold cross validation will be performed on the training data. Accuracy This is the accuracy assigned to this job. A accuracy is typically classification accuracy or class weighted equally accuracy. It gets assigned when the job is done being evaluated and status is set to done. This Processing Time Number of CPU seconds required to perform evaluation. Number of Avg. Pred. Probability Average predicted probability assigned to predictions. There is an inner and outer loop in the procedure. The outer loop consists of locating the next best candidates to expand, expanding the candi dates by creating new jobs for processing and then processing the individual jobs in the inner loop. The outer loop continues until a

PAGE 65

5 3 termination condition is met. In the case of feature selection this means that feature subsets have been reduced to just one feature. In the case of SVM parameter tuning there is a fixed number of expansions. The inner loop simply consists of the processing of jobs that are flagged as open.

PAGE 66

54 Figure 14 Basic Process Flow.

PAGE 67

55 3.3 SVM Tuning 3.3.1 Correctness of Probability Prediction (CPP) Correctness of probability prediction (CPP) is the concept that the probability assigned by the classifier to a prediction should approximately reflect the chances of that prediction being correct. Specifically if there were 100 predictions assigned the probability of 85% then 85 of them should have been assigned the correct class. This is going to be one of the criteria used when selecting the probability parameter ( ). 3.3.2 Criteria The search for parameters is driven by three criteria : c lassification a ccuracy, p rocessing t ime, and c orrectness of p robability p rediction (CPP). The first criterion, classification accuracy, is by far the most important. After cl assification accuracy comes processing time O f the sets of SVM parameters that produce the best accuracy, or near best accuracy the user would want to select the ones that allow for the fastest training time. In the set of SVM parameters that produce th e highest classification accuracies there can be a wide range of processing times; the one that runs the longest can easily take several times longer than the one that runs fast est. After selecting the set of SVM parameters that have the highest accuraci es and fastest training times one is then interested in the correctness of probability prediction (CPP). This criteria is impacted by the probability parameter and is refined separately after first finding the and parameters. Figure 15 shows the classification accuracy response for the Nine Class P lankton dataset to the (Gamma) and (Cost) parameters. This is a result of a grid search where ranges from 0.00001 to 5 by multiples of 1.3 and ranges from 1 to 1,000 by multiples of 1.15 providing 50 and parameter values for a total of 2,500 parameter combinations evaluated by performing a 5 fold cross v alidation From this figure it can be observed that has the greatest influence over classification accuracy while impact is less. Figure 16 s hows the processing time response to the and parameters. The Z axis represents the number of seconds required to perform a

PAGE 68

56 5 fold cross validation for the given parameters It goes from longest time at the base to the shortest at the top. Note that the shortest processing times occur at approximate ly same parameter values as the highest classification accuracies. Figure 17 shows the classification accuracy response for the Forest Cover dataset with 300 examples per class using the same range of the and parameters as used in Figure 15 The areas of highest classification accuracy are different than those of the Nine Class plankton dataset but the behavior is still similar. In F igures 15 and 17 several local maxima are observed This behavior was also noted in [ 59 ] which describes a multi pass algorithm similar to what is implemented here. Figure 15 Classification A ccuracy R esponse to and on Nine Class Plankton D ataset. Right chart shows top 2 % 20 30 40 50 60 70 80 90 100 C Accuracy 88.0 88.5 89.0 89.5 90.0 C Accuracy

PAGE 69

57 Figure 16 Processing T ime R esponse to and on Nine Class Plankton Dataset. Right chart shows top 10 seconds The data from this chart reflects the CPU time it takes to process a 10 fold cross validation Figure 17 Classification A ccuracy R esponse to and on the Forest Cover Dataset with 300 Examples per Class. Right chart s hows t op 3 % 3.3.3 General Flow The parameter search is a modified grid search similar to the parameter search done in [ 25 ] It involves multiple passes with the first pass being coarse focusing on the parameter only with and both fixed at 1 and 100 respectively. Each successive pass is finer with first the parameter added then the parameter. Each parameter set is evaluated by performing a 5 0 100 200 300 400 500 C Seconds 40 42 44 46 48 50 C Seconds 55 60 65 70 75 C Accuracy 70.0 70.5 71.0 71.5 72.0 72.5 73.0 C Accuracy

PAGE 70

58 fold cross validation using the classification accuracy, then processing time as selection criteria. T he following pass then perform s localized searches around each of these candidates usi ng a finer level of granularity. After the and parameters are located the value is then added to the search starting with a large range and a coarse granularity with correctness of probability prediction (CPP) added as a th ird criteria to the search. The last grid search pass is used to test the best parameter sets located during the search. Using the three criteria listed above the 10 best parameter sets are located and evaluated by performing a 10 fold cross validation on the training dataset From these final 10 parameter sets the best parameter set is selected using the three criteria of classification accuracy, processing time, and correctness of probability prediction (CPP). Figures 15 and 16 indicate the classification accuracy and processing time respectively that resulted from a grid search of and parameters across the Nine Class Plankton dataset. This shows typical beha vior seen in other datasets where both training accuracy and training time exhibit pseudo convex like behavior. This is the motivation of using a multi level grid search, starting out being coarse and getting finer with each pass. It also shows that accu racy is more sensitive to than For this reason, the first pass of the grid search only tunes the parameter. 3.3.4 Detailed Implementation This procedure utilizes the frame work described in Section 3.2 In this case each Job represents a specific set of SVM parameters. When the SVM parameter tuning procedure first starts it create s a set of jobs for the SVM parameters that are to be evaluated. The jobs then go into a queue and await execution. Exec ution in this case is a 5 fold cross validation using the specified set of features and SVM parameters. When evaluation of each job is completed it is flagged as done and be available for expansion. When a job is selected for expansion it is flagged as expanded and new jobs will be created.

PAGE 71

59 The function creates a new job with the Status field set to Open indicating the job is awaiting evaluation and assigned the SVM parameter specified. This function is used when parameter tuning is first started to seed the initial set of parameters to be evaluated and during expansions. Table 22 provides the details o f the SVM parameter tuning procedure. Each step describes the major events that occur during the search for SVM parameters. The first step is what creates the initial set of jobs that Initialize step in Figure 14 step of Figure 14 Step 10 is where the finale SVM parameters are selected and step of Figure 14 The cluster that is being used for these experiments contains 64 processors. To try to maximize throughput s teps 3 through 9 create multiples of 64 jobs each. This is done by using the ex ponent and logarithm functions to calculate growth rates for the parameters being searched. Example in step 6 only the probability parameter ( ) is being varied so to calculate a growth rate ( such that 64 jobs are created the equation is used where and represent the lower and upper range for to be searched Table 22 SVM Parameter Tuning Steps. 1 The first step is to create the initial set of jobs. These jobs will perform a very coarse search over the parameter while holding the and constant at 1 and 100 respectively. The idea is to locate the approximate range of values where high classification accuracy can be located. f or ( )

PAGE 72

60 Table 22 (Continued) 2 Select Best Accuracy a. Select the parameter set that produced the best cla ssification accuracy; assign it s to Perform a finer search around the best found so far and a very coarse range of will be performed. b. for ( ) for ( ) 3 Pick the job with the highest accuracy and perform a less coarse search around its and parameters. a. Select job w ith highest accuracy; assign it s and to and , b. for ( ) for ( ) 4 This step is similar to step 3 but an even finer search around the highest accuracy found. a. Select job with highest accuracy; assign its and to and , , b. for ( ) for ( ) 5 Select the 64 jobs whose parameter combinations produced the highest accuracy and create a Test Job for each one. Each test job will perform a 10 fold cross validation.

PAGE 73

61 Table 22 (Continued) 6 Select the Test Job with the highest accuracy. The and parameters from this job will be selected as the tuned parameters and from this point the focus will be to tune the probability parameter for (A ) 7 Select the job with the smallest difference between classification accuracy and average predicted probability. This must be one of the jobs that have the same and parameters selected in step 6. T he first refinement of the parameter will now be done. for (A ) 8 This is a repeat of Step 7; making this the second refinement of parameter Select the job with the smallest difference between classification accuracy and average predicted probability. for (A ) 9 Similar to Step 5, 64 test jobs will be created to select the parameter. This is done by selecting the 64 jobs that have the smallest difference between classification accuracy and average predicted probability. The parameters used in these jobs will then be tested by performing a 10 fold cross validation. 10 The last step; the probability parameter will now be selected.

PAGE 74

62 3.4 Feature Selection There is one feature selection method used Wrappers [ 17 ] for both the multi class and binary combination processes. In the case of two classes it is used once for each possible combination of classes. It utilizes the best first strategy starting with all features selected and reducing down by one feature at a time. The procedure keeps track of a pool ( ) of feature combinations that have been evalua ted but not expanded. From this pool it selects for expansion the feature combination that produced the highest classification accurac y from a 5 fold cross validation Expansion is the process of taking a given feature combination and creating from it ne w feature combinations. There are two differen t types of expansions performed: Shrink and Grow These are represented by the functions and respectively. create s new feature combinations by removing each of the features whi le add s each feature that is not currently selected. The goal is to reduce the number of features so is the function primarily used. At every tenth expansion the growth expansion is used in addition to the shrink expansi on. This is meant to make sure there is no loss of any features that may have performed badly earl ier in the search process but will do better as the feature count is reduced or as other features that it worked poorly with are removed. In addition on every tenth expansion one feature set is selected at random for shrink expansion. The specific details of the algorithm are explained below In this process, J ob has the same definition as i t did in SVM parameter tuning, see S ection 3.3.4 except in feature selection each job will specify a unique set of features. Execution or evaluation refers to is a five fold cross validation on the training set using the specified set of features and SVM parameters. When evaluation of each job is completed it will be flagged as done and be available for expansion. When a job is selected for expansion it will be flagged as expanded and new jobs will be created. Table 23 describes major variables and functions used in the feature selection algorithm. Table 24 lists the major steps in the feature selection algorithm.

PAGE 75

63 Table 23 Feature Selection Variables and Functions. Set of all jobs that have not been evaluated yet. Each job represents a different set of features to be evaluated. Set of all feature combinations that have already been evaluated. Feature sets that have been evaluated before are not evaluated again. Set of all jobs that have been expanded where each job represents a feature subset. A expansion that shrinks the number of features. It creates subsets of features derived from by removing one feature at a time. Ex: and a maximum of 8 features 0 through 7. A expansion that grows the number of features. It creates a subset of features derived from by adding one feature at a time. Ex:

PAGE 76

64 Table 24 Feature Selection Steps. 1 Feature selection starts with one job, where all features are selected. This occurs in in Figure 14 . 2 Process a l l Jobs in that are flagged as open; that is perform a 5 fold cross validation on each one. If any of them produce a Accuracy that is greater than the n set t o it and reset to 0 For each Perform 5 Fold Cross Validation If classification accuracy greater than Cclassification Accuracy from 5 fold. ++ 3 Process an expansion. This occurs in the step labeled Figure 14 If the termination condition has not been reached the job that has the highest accuracy and has not previously been expanded will be expanded. ++ ++ If > 50 We are done performing the Best Case next search. Go to Step 4 The job in that has the highest accuracy but is not a member of , If Select on job at random that has not been expanded. , Go to Step 2.

PAGE 77

65 Table 24 (Continued) 4 The Best Case Next part of feature selection is now done. From this point a 5 wide beam search is implemented that will continue until selection is reduced down to 1 feature. Set of Jobs in that have the least number of features. If number of features have been reduced down to 1 Go to step 6. For The job in that has the highest accuracy and is not in , 5 This step is the same as step 2. Process all Jobs that are flagged as open members of ; that is perform a 5 fold cross validation on each one. For each Perform 5 Fold Cross Validation Go to step 4. 6 The feature subset needs to be selected from all the feature subsets that have been evaluated. From the set of all evaluated feature combinations select the features desired using highest accuracy, smallest number of features, and fastest training time as criteria. a. For each feature, count from 1 to number of features in dataset. Select the best feature subsets by accuracy b. For each feature subset selected in previous step perform a 10 fold cross validation. c. From the set of all evaluated subsets in the previous step select the one with the highest classification accuracy followed by the l east number of features, followed by fastest processing time. This feature subset will be the result of this feature selection process.

PAGE 78

66 3.5 Merge the N B est F eature S ets There is a danger of overfitting which can occur when the feature set that resu lts in the best accuracy is selected T here may not be enough features to allow the classifier to generalize to unseen data. This becomes especially problematic with just two classes. Feature selection on just two classes reduces down to a much smaller subset of features than woul d occur with multiple classes. To deal with this situation it is proposed that the N best feature combinations be selected. That is after performing feature selection the N best feature sets by accuracy are selected and the union of these feature sets are used as the final selected features. This increases the number of features selected but the cardinality is still less than what is selected by MFS. 3.6 Experimental Procedure Experiments were performed on a 64 processor cluster consisting of 8 nodes with two quad core processors per node sharing 32 GB of ram on each node. Each of the 64 processors runs at 3.2 gigahertz. Both SVM parameter search and feature selection, which a re used by both the BFS and MFS procedur es, take advantage of the multi processor environment. The purpose of these experiments was to demonstrate the advantage of the binary class combinations feature selection (BFS) process over the multi class feature selection (MFS) process. Figure 18 shows the flow of exp eriments performed on each dataset. Each data set is first divided into training and test data. The training data is used as input into both the MFS and BFS procedures. The test data is used to test the resultant classifiers from the MFS and BFS procedures. The outputs of both procedures are classifiers which are then compared by training on the training data and testing against the test data which was not used as part of either the MFS or BFS procedures. The classifiers define the SVM parameters and selected features to be used. In the case of the MFS procedure there will be one set of parameters and selec ted

PAGE 79

67 features to be used for all binary class combinations. The BFS procedure will produce a classifier that has separate SVM parameters and selected features for each binary class combination. Figure 18 Experiment P rocedure S teps. 3.7 Unbalanced Datasets Two datasets were unbalanced with respect to train ing example distribution among their classes, WFS and ETP2008. The WFS dataset with 33 classes has 1,558 examples in its largest class and 27 exampl es in the smallest class. The ETP2008 dataset has 13 classes that have less than 100 examples in the training dataset with smallest class only having 16 examples. In order to test the premise that the smaller the number of training examples, the less like ly they are going to properly represent the class, th ree approaches were tried to mitigate the poor performance on the smaller classes. In this situation the feature selection procedure is

PAGE 80

68 more likely to eliminate features that could be useful if more cl ass examples existed Since the MFS procedure has to satisfy the needs of all binary class combinations it is less likely to eliminate features that might prove useful, where as the BFS procedure focuses on just two classes at a time so is more likely to eliminate these features. This leads to the idea to be less aggressive in eliminating features when performing feature selection when classes with a small number of examples are involved. T hree approaches are described below. Results for the WFS data se t using these methods are shown in Table 35 1) A small change to the way features are selected is made. Usually when two different feature sets produce classifiers with the same accuracy the one with the smaller number of feat ures is selected. In this case, the feature set with a larger number of features is selected This results in a classifier that has a n over all larger number of features and is more likely to generalize to unseen data. 2) In this method all features are used for the binary classifiers where one of the two classes has a limited number of training examples. The binary class combinations that do not involve the small classes still have features selected as described in the BFS procedure. Compar ed to the normal BFS classifier, this group produced classifiers that had a small drop in overall classification accuracy but an improvement in class wei ghted equally accuracy. The minimum class size threshold is computed as a fraction of the average class size. There were six thresholds tried, 5%, 10%, 15%, 20%, 25%, and 30% of average class size. Each one results in successively longer training times but all still less tha n the MFS produced classifier. There was very little change in either overall or class weighted equally classification accuracies between the thresholds. This indicates that the threshold of 50 impacting 2 of the 33 classes is as g ood a selection as a minimum threshold of 200 impacting 10 of the 33 classes. Another method of threshold selection could be a percentage of the average class size. This has not been experimented with but would be worth exploring in the future

PAGE 81

69 3) Here a merge process was used This is where the N best feature sets are used rather than just the best feature set. For example Merge2Best indicates that for each binary class combination the two best feature sets are merged together. So for class AC whe re the two best features sets by classification accuracy were {1, 3, 5} and {1, 3, 7} then the feature set of {1, 3, 5, 7} would be selected for that particular class combination. Of the three cases Merge2Best, Merge3Best, and Merge4Best Merge2Best resu lted in the best classification accuracie s. 3.8 Adding a C lass The addition of a class to an already existing classifier require s that feature selection and parameter tuning be performed again. In the case of the MFS procedure this requires the re running of the whole procedure from the beginning. As a result this procedure will take at least as long as the original MFS procedure would. Assuming that one started with classes an additional binary class combinations are created so that the MFS procedure would require processing class combinations for both parameter tuning and feature selection. The BFS procedure w ould only need to process additional class combinations, s o that a potential speed up of SVM feature and parameter tuning is 3.9 Multi Processor Implementation The feature selection process was designed to be implemented in a multiprocessor envir onment. It can be run on multiple Windows and Unix based computers at the same time. The only requirement is that all instantiations of the process have access to common directory that supports file locking. Communications between processes is managed t hrough a lock file and status file that will be located in th e common directory. The lock file is used to ensure consistency while the status file is used to record process status allowing for communications between processes and easy restart. All proc esses used in the procedure are shown in Figure 19

PAGE 82

70 Figure 19 (a) shows the procedure used to update the status file. This is a simple ope This is a simple function that is supported by most platforms. It allows only one process to create the file. starting at the point it last read up to update any variables needed. A HighestAccuracy if there is a new accuracy that is greater High estAccuracy highest accuracy found so far amongst all the nodes searched on any processer; accuracy will be step writes any state variabl procedure the process es are able to keep their state variables in sync with each oth er. When a new process is started it reads the Status file from the beginning and as a result contains the current state. Figure 19 (b) shows an example of updating t HighestAccuracy This is a variable of which all processes must be aware The decision to update it is made after the status file has been read and the update to the status file only occurs if the variable is updated.

PAGE 83

71 (a) (b) Figure 19 Status File Update Procedure.

PAGE 84

72 CHAPTER 4: RESULTS Experiments were performed on a 64 processor cluster consisting of 8 nodes with two quad core processors per node sharing 32 GB of ram Each of the 64 processors runs at 2.66 gigahertz. The purpose of these experiments was to compa re the performance of the multi class feature selection (MFS) process and the bin ary class combinations feature s ele ction process (BFS) A procedure was d esigned that starts with a data set divided into training and test data and performs the two independent steps of MFS and the three independent steps of BFS. The binary clas s combination feature selection process cons ists of three major steps : parameter tuning for all classes combined, feature selection by each binary class combination, and parameter tuning for each binary class combination. 4.1 Results Showing Accuracy and Time I mprovements Tables 34 through 39 show the detailed results for each of the datasets. For each step in the feature selection process there are two rows created; each showing the performance of the resultant classifier one u sing prediction by voting and the other prediction by probability. The first two steps (first four rows) show the parameter tuning and feature selection for tradition al multi class feature selection (MFS). The last two steps (last four rows) show the las t two steps of the binary class feature selection (BFS) process; the first step in BFS is the same as the first step in MFS so the results of that step are used. In each set of results there will be two lines highlighted with bold. One represents the r esult from MFS that has the highest accuracy and the other from BFS that has the highest accuracy.

PAGE 85

73 Equations ( 37 ) through ( 39 ) show how the values of accuracy gain speed up and average number of features are computed. Accuracy gain, E quation ( 37 ) is the difference in starting and ending classification accuracy divided by the starting accuracy. Equation ( 38 ) defines s peed up which is the original time divided by the new time For example if the MFS procedure takes 10,000 seconds to process and the BFS procedure takes 2,500 seconds to process then there was a speed up of A speed up greater than 1.0 means the new process is faster and a speed up less than 1.0 means it is slower. ( 37 ) ( 38 ) Equation ( 39 ) calculates the average number of features for a BFS derived classifier. It reflects the fact that each binary class SVM has its own set of features and its own set of training examples. The idea is that binary classifiers that have a greater number of training examples should be weighted more than those that have a smaller numbe r of training examples. The premise for weighting by number of training examples comes from the fact that the more examples a given binary classifier has to work with the longer it will take to train and most likely the longer it will take to predict beca use it will have more examples and SVs . ( 39 )

PAGE 86

74 4.2 Feature Selection Time A nalysis Tables 25 through 30 show the CPU and longest path time s in seconds used for each step o f the two feature selection methods MFS and BFS. The top half of the table shows th e processing time for the multi class feature selection (MFS) steps while the bottom part of each table shows the processing time for the bi nary class combination feature s election (BFS). The time spent doing MFS parameter tuning shown in the first row is also included in the tot als for the bottom (BFS) part. This is because, as mentioned earlier, MFS parameter tuning is also the first step in BFS. The first column provides a description for the step. The second column indicates the number of support vectors (SV s) created when building the classifier from the training data. The third column shows the number of features selec ted as a result of the processing step. The fourth column shows the total CPU time in seconds involved in the processing step; this includes the time spent performing the 5 fold cross validations used to evaluate specific SVM parameters and feature select ions plus the overhead time required in managing the feature selection process. The fifth column shows the time spent performing the 5 fold cross validations; the difference between the fourth and fifth column would represent the overhead in managing the processing steps. The sixth column represents the longest time in CPU seconds of the 64 processors for the processing step the row represents. Both the top (MFS) and bottom half (BFS) have totals representing the total amount of time required to perform their respective feature selection processes. Lo ngest Path Time is the longest amount of time any one individual processor spent on a given task. For example the Nine Class Plankton data set required 216,054 CPU seconds divided amongst 64 processors to process the BFS procedure. The longest any one individual processor spent was 6,756 CPU seconds.

PAGE 87

75 Table 25 Nine Class Plankton; Featur e Selection and Parameter T uning T imes Description Number SV's Number Features Total CPU Time Secs CPU Classifier Time Secs Longest Path Secs MFS Parameter tuning 3,668 73.0 45,050 36,678 1,160 MFS Feature selection 3,319 43.0 253,649 249,305 7 491 Total MFS time 298,699 285,983 8,652 BFS Feature selection 5,846 19.0 158 022 145,501 5,205 BFS Parameter tuning 6.503 19.0 12, 982 9,807 3 91 Total BFS time 216,054 191,986 6,756 Speed Ups 1.38 1.49 1. 28 Table 26 WFS Feature Selection and Parameter Tuning T imes Description Number SV's Number Features Total CPU Time Secs CPU Classifier Time Secs Longest Path Secs MFS Parameter tuning 11,650 82.0 254,386 221,682 5,814 MFS Feature selection 11,222 45.0 4,794,304 4,765,748 77,708 Total MFS time 5,048,690 4,987,430 83,522 BFS Feature selection 12,044 15 7 1,213,275 1,035,906 19,900 BFS Parameter tuning 11,405 15 7 78,546 49,122 1,317 Total BFS time 1,546,207 1,306,709 27,031 Speed Ups 3.27 3.82 3.09

PAGE 88

76 Table 27 ETP2008 Station 1, Feature S election and P arameter T uning T imes Description Number SV's Number Features Total CPU Time Sec s CPU Classifier Time Sec s Longest Path Sec s MFS Parameter tuning 10,042 83.0 227,322 207,023 9,361 MFS Feature selection 9,252 40.0 3,437,356 3,417,029 109,091 Total MFS time 3,664,678 3,624,052 118,452 BFS Feature selection 9,758 10.0 1,223,904 718,445 38,706 BFS Parameter tuning 12,044 10.0 123,815 38,607 3,900 Total BFS time 1,575,040 964,076 51,967 Speed Ups 2.33 3.76 2.28 Table 28 Forest Cover Dataset; 300/Class; Feature S election and P arameter T uning T imes Description Number SV's Number Features Total CPU Time Sec s CPU Classifier Time Sec s Longest Path Sec s MFS Parameter tuning 1,550 54.0 5,438 4,222 209 MFS Feature selection 1,406 26.0 18,281 17,779 800 Total MFS time 23,719 22,001 1,009 BFS Feature selection 1,365 12.9 11,540 9,924 455 BFS Parameter tuning 1,615 12.9 2,573 1,609 152 Total BFS time 19,551 15,755 816 Speed ups 1.21 1.40 1.24

PAGE 89

77 Table 29 Forest Cover Dataset; 1,500/Class; Feature S election and P arameter Tuning T imes Description Number SV's Number Features Total CPU Time Sec s CPU Classifier Time Sec s Longest Path Sec s MFS Parameter tuning 6,580 54.0 123,113 96,209 2,405 MFS Feature selection 5,832 32.0 381,848 375,217 6,620 Total MFS time 504,960 471,426 9,025 BFS Feature selection 5,597 13.7 197,036 186,747 4,901 BFS Parameter tuning 6,719 13.7 45,322 34,044 1,197 Total BFS time 365,471 317,000 8,503 Speed ups 1.38 1.49 1.06 Table 30 Letter Dataset Feature Selection and Parameter Tuning T imes Description Number SV's Number Features Total CPU Time Secs CPU Classifier Time Secs Longest Path Secs MFS Parameter tuning 7,634 16.0 81,794 66,948 1,979 MFS Feature selection 7,274 15.0 53,507 52,223 1,272 Total MFS time 135,300 119,171 3,251 BFS Feature selection 6,754 7.9 36,348 31,700 619 BFS Parameter tuning 8.656 7.9 35,585 26,782 655 Total BFS time 153,727 125,429 3,253 Speed ups 0.88 0.95 1.00 Table 31 Sat Image Dataset Feature Selection and Parameter Tuning Times. Description Number SV's Number Features Total CPU Time Secs CPU Classifier Time Secs Longest Path Secs MFS Parameter tuning 1,396 36.0 3,537 2,869 116 MFS Feature selection 1,400 17.0 7,982 7,730 242 Total MFS time 11,519 10,599 358 BFS Feature selection 1,290 12.3 7,359 6,648 256 BFS Parameter tuning 1,420 12.3 2,581 1,970 100 Total BFS time 13,477 11488 471 Speed ups 0.85 0.92 0.76

PAGE 90

78 Table 32 shows a summary of CPU and longest process times in seconds for all the datasets. The first column indicates the dataset. The second and thir d columns are CPU and longest path time in seconds for the MFS procedure. The fourth and fifth columns are CPU and longest path time in seconds for the BFS process. The sixth and seventh columns are the resulting speed ups that are achieved by the BFS process over the MFS process for CPU and longest path times. All the datasets except Letter had a speed up in CPU time with BFS. In the case of the WFS dataset, this meant a savings of 1,187 CPU hours. In terms of longest path times all datasets except th e Letter had a speed up with the WFS data set having the best at 2.45 times and the Forest Cover 1500 IPC only having a speed up of 1.06. In the case of the WFS dataset this meant that the user had to wait 11.3 hours less for the tuned classifier. Figure 20 shows a chart that indicates the number of feature combinations processed to reach a given feature count for the Nine Class Plankton dataset. There are two series plotted, one for the MFS procedure and the one for the BFS procedure. The BFS series represents the average of all the different binary class combi nations. Both methods approach a feature count of 4 3 with approximately the same number of feature combinations. At that point the MFS approach evaluates 5,000 feature subsets to reach 36 features constantly finding slightly better comb inations of features as per a 5 fold cross validation. This reflects the difficulty of finding a common set of features that will satisfy all the different class combinations. The BFS approach does not exhibit this behavior reflecting the fact that each binary class combination is searched independently of other combinations.

PAGE 91

79 Table 32 Summary of CPU and L ongest Path Times Required for P rocessing Dataset MFS Search time Sec s BFS Search time Sec s Speed up CPU Longest path CPU Longest path CPU Longest path Nine Class Plankton 298,699 8,652 216,054 6 756 1.3 8 1. 28 WFS 3,987,283 68,914 1,576727 28,076 2.53 2.45 ETP2008 Station 1 3,664,678 118,452 1,575,040 51,967 2.33 2.28 Forest Cover 300 IPC 23,719 1,009 19,551 815 1.21 1.24 Forest Cover 1500 IPC 504,960 9,025 365,471 8,503 1.38 1.06 Letter 135,300 3,251 153,727 3,253 0.88 1.00 Sat Image 11,519 357 13,477 471 0.85 0.76 Figures 20 and 21 are for the Nine Class Plankton dataset. They show the number of feature combinations evaluated and the number of CPU seconds consumed respectively to reduce down to a given feature count. There are two series in each chart : one for the MFS approach and the other for the BFS approach. The MFS approach produced a classifier th at required 43 features but the feature selection process did not switch over to the beam search step until it reached 36 features. The BFS approach produced a classifier that required a mean number of 19 features for all the binary combinations but the feature selection process switched over to the beam search at 25.8 features. Figures 22 and 23 are for the WFS dataset and show the number of feature combinations evaluated and the number of CPU seconds consumed respectively to reduce down to a given feature count. The MFS procedure produc ed a classifier that required 41 features but the switch to the beam search occurred earlier at 58 features. The BFS procedure produced a classifier that requires only 15.7 features with the switch to the beam search occurring at 23.5 features. With b oth the Nine Class and WFS data sets the MFS procedure requires a large number of feature combinations to be evaluated before the switch from next best case to beam search occurs. This is reflected in the steep slope that the MFS series exhibit s in all fou r figures before the switch to the beam search occurs. Once the switch is made both data sets quickly have the number of features reduced down to one feature. In contrast in both datas ets the BFS series

PAGE 92

80 shows a consistent smooth rise that is not as steep as the MFS procedure. This is where the BFS procedure saves processing time over the MFS procedure. Figure 20 N ine Class Plankton Feature Combinations Evaluated to R each a Given F eature C ount. 36, BS 43 25.8 BS 19.0 0 2,000 4,000 6,000 8,000 10,000 12,000 73 69 65 61 57 53 49 45 41 37 33 29 25 21 17 13 9 5 1 Combinations Evaluated Feature Count Nine Classes Evaluations vs Feature Count MFS BFS

PAGE 93

81 Figure 21 Nine Class Plankton CPU S econds C onsumed to Reach a G iven F eature C ount. Figure 22 WFS Feature C ombinations E valuated to R each a G iven F eature C ount. 36, BS 43 25.8, BS 19.0 0 50,000 100,000 150,000 200,000 250,000 300,000 73 69 65 61 57 53 49 45 41 37 33 29 25 21 17 13 9 5 1 CPU Secs Consumed Feature Count Nine Classes CPU Secs Consumed vs Feature Count MFS BFS 58, BS 41 23.5 BS 15.7 0 4,000 8,000 12,000 16,000 20,000 82 78 74 70 66 62 58 54 50 46 42 38 34 30 26 22 18 14 10 6 2 Num Feature Combos Evaluated Feature Count WFS Evaluations vs Feature Count MFS BFS

PAGE 94

82 Figure 23 WFS CPU S econds C onsumed vs F eature C ount. Table 33 shows the number of parameter and feature combinations that were processed during the parameter a nd feature selection steps. The first colum n provides the name of the data set. The second and third show the number of combinations processed during the parameter tuning and feat ure selection steps doing multi class feature selection (MFS). The fourth an d fifth columns show the number of combinations processed during the parameter tuning and feature selection steps in the binary combination feature selection (BFS) process. The MFS feature selection procedure is typically building twice as many binary SVM s than the BFS procedure. Table 33 Number of B inary SVM s B uilt P erforming P arameter S earch and F eature S election MFS Procedure BFS Procedure Parameter Feature Sel Parameter Feature Sel Nine Classes Plankton 19,188 392,724 19,171 245,805 WFS 281,424 9,784,896 280,825 3,609,407 ETP 2008 Station 1 791,505 20,681,595 789,904 9,228,177 Forest Cover 300/Class 11,151 115,290 11,174 62,509 Forest Cover 1,500/Class 11,193 130,368 11,170 70,105 Letter 172,900 295,425 172,972 256,139 58, BS 41 23.5, BS 15.7 0.E+00 1.E+06 2.E+06 3.E+06 4.E+06 5.E+06 82 78 74 70 66 62 58 54 50 46 42 38 34 30 26 22 18 14 10 6 2 Number CPU Seconds Feature Count WFS CPU Seconds vs Feature Count MFS BFS

PAGE 95

83 4.3 Classification A ccuracy and T raining T ime I mprovements Tables 34 through 39 show the detailed accuracy and training time results for each of the datasets. For each step in the feature selection process there are two rows created; each showing the performance of the resul tant classifier one using prediction by voting and the other prediction by probability. The first two steps (first four rows) show the parameter tuning and feature selection for traditional multi class feature selection (MFS). The last two steps (last four rows) show the last two steps of the binary class f eature selection (BFS) process. In each set of results, BFS and MFS, the selection method column will be highlighted for the method that had the highest classification accuracy. If the two results a re statistically different as per a McNemar s test [ 57 ] then the whole line will be highlighted For example, in Table 34 th e MFS results row because they are not statistically different. All timing results, training time and test time are in second s of CPU time consumed. Table 34 shows the detailed results for the Nine Class Plankton dataset. The best MFS procedure using voting had a classifi cation accuracy of 90.29% while the best BFS classifier had a classification accuracy of 90.42% reflecting a n accuracy gain of 0.15%. The two classifiers [ 57 ] There was a speed up of 2.2 in training time. MFS f eature selection redu ced the number of features to 43 while BFS f eature s election reduced them from 73 to 19.

PAGE 96

84 Table 34 Nine Class Plankton ; Most A ccurate S et of F eatures Method Description Sel Meth Test Acc. Wtd. Acc. Train Time(s) Test Time(s) Avg. # Features MFS SVM parms tuned Voting 89.82% 89.82% 12.2 7.0 73.0 Prob 89.96% 89.96% 11.8 7.1 SVM parms tuned, Features Selected Voting 90.29% 90.29% 7.8 3.6 43.0 Prob 90.22% 90.22% 7.1 3.6 BFS Features selected Voting 90.40% 90.40% 3.7 5.9 19.0 Prob 90.00% 90.00% 3.6 6.4 Features selected, SVM parms tuned Voting 90.42% 90.42% 3.5 7.0 19.0 Prob 90.29% 90.29% 3.9 7.0 Speed up 2.2 0.5 Table 35 shows the results for the WFS dataset. When comparing the BFS approach with the MFS approach the BFS approach had a 1.19% increase in overall accuracy but only a 0.19% when classes are weig statistically significantly different. There was a speed up in training time of 1.3 times. This particular dataset consists of 33 classes where the classes are very unbalanced. In the training set the largest class consists of 1,558 examples while the smallest class contains just 27 classes. The smallest class, echinoderm bipinnaria which only had 7 examples in the test set went from 85.74% classification accuracy using the MFS generated classifier to 57.14% using the BFS generate d classifier. The pteropod class with only 38 examples in the test dataset also takes a large hit in classification accuracy going from 63.16% to 50.00%. I t is worth noting that prediction by probability did very poor compared to prediction by voting in all cases except the BFS produced classifier where there was an improvement in both overall accurac y and class weigh ted equally accuracy. This is different than what was observed with the Nine Class Plankton dataset where the difference between the two prediction methods was not significant. This is probably due to the unbalanced nature of the datase t where there are several classes with few training examples.

PAGE 97

85 Table 35 WFS ; Most A ccurate S et of F eatures Method Description Sel Meth Test Acc. Wtd. Acc. Train Time(s) Test Time(s) Avg. # Features MFS SVM parms tuned Voting 76.45% 69.32% 44.8 29.0 82.0 Prob 72.30% 71.83% 45.2 29.0 SVM parms tuned, Features Selected Voting 76.73% 69.18% 26.0 20.0 41.0 Prob 67.75% 70.64% 26.0 19.7 BFS Features selected Voting 77.45% 69.33% 18.8 52.6 15.7 Prob 69.14% 66.35% 18.7 52.7 Features selected, SVM parms tuned Voting 77.35% 68.96% 18.8 63.6 15.7 Prob 77.64% 69.31% 19.3 63.4 Speed Up 1.3 0.3 Table 36 shows the results for the ETP2008 dataset. There was a 2.42% gain in overall classification accuracy but a 3.90% l oss in class weighted equally accuracy A t the same time there was a spee d up of 1.3 in training time. A McNema produced classifier is statistically significantly different from the MFS classifier as indicated by the BFS row using prediction by voting being bold There were 13 classes that have less than 100 examples in the training dataset with the smallest class only having 16 examples. Of the ten classes that had the greatest accuracy loss eight had less than 100 training examples while the ten classes with the largest accuracy gain all ha d more than 100 training examples Table 36 ETP2008 Station 1 ; Most A ccurate S et of F eatures Method Description Sel Meth Test Acc. Wtd. Acc. Train Time(s) Test Time(s) Avg. # Features MFS SVM parms tuned Voting 83.33% 77.54% 32.4 169.5 83.0 Prob 72.09% 77.54% 32.5 169.8 SVM parms tuned, Features Selected Voting 83.53% 78.05% 18.4 116.6 40.0 Prob 65.25% 77.31% 18.3 115.1 BFS Features selected Voting 84.01% 75.73% 14.1 297.6 10.0 Prob 71.38% 65.03% 13.9 302.0 Features selected, SVM parms tuned Voting 85.55% 75.00% 13.9 424.7 10.0 Prob 85.66% 74.52% 13.8 430.1 Speed up 1.3 0.3

PAGE 98

86 When comparing prediction by voting and prediction by probability similar results as noted with the WFS data set were observed. This means that except for the BFS produced classifier prediction by probability did very poorly compared to prediction by voting. As noted with the WFS data set this dataset is very unbalanced with respect to the number of training examples p er class. Table 37 shows the results for the Forest C over dataset with 300 train ing examples per class. The BFS produced classifier has better overall accuracy as well as better class weighted equally accuracy. There is a n accuracy gain of 6.0% and 1.5% for the BFS classifier over the MFS produced classifier for overall accuracy and class weighted equally accuracy respectively. The BFS row in bold indicates that it is statistically significantly differe nt from the MFS results. Table 37 Forest Cover Dataset; 300/Class ; Most A ccurate S et of F eatures Method Description Sel Meth Test Acc. Wtd. Acc. Train Time(s) Test Time(s) Avg. # Features MFS SVM parms tuned Voting 58.38% 72.83% 2.2 223.0 54.0 Prob 57.51% 72.88% 2.0 221.4 SVM parms tuned, Features Selected Voting 58.94% 73.35% 0.9 133.9 26.0 Prob 52.69% 71.54% 1.2 134.2 BFS Features selected Voting 61.71% 74.23% 0.5 181.1 12.9 Prob 55.16% 72.01% 0.5 203.2 Features selected, SVM parms tuned Voting 62.87% 74.22% 0.7 271.0 12.9 Prob 62.70% 74.43% 1.0 270.4 Speed up 1.0 0.5 Table 38 shows results for the Forest Cover dataset with 1,500 training examples per class. T he MFS procedure reduced the number of features from 54 to 32 while the BFS procedure reduced the feature set to a mean of 13.7 resulting in an additional 2.0 % overall classification a ccuracy and 0.8% class weighted equally accuracy. The BFS row in bold indicates that it is statistically significantly different from the MFS results.

PAGE 99

87 Table 38 Forest Cover Dataset; 1,500/Class; Most A ccurate S et of F eatures Method Description Sel Meth Test Acc. Wtd. Acc. Train Time(s) Test Time(s) Avg. # Features MFS SVM parms tuned Voting 69.59% 82.72% 38.8 1,348.6 54.0 Prob 68.93% 82.56% 39.0 1,375.8 SVM parms tuned, Features Selected Voting 71.15% 84.18% 20.8 652.1 32.0 Prob 69.33% 83.41% 21.6 725.1 BFS Features selected Voting 70.84% 84.13% 9.5 1,191.5 13.7 Prob 68.41% 82.47% 10.7 1,267.0 Features selected, SVM parms tuned Voting 72.56% 84.86% 16.2 1,772.7 13.7 Prob 72.51% 84.97% 16.2 1,832.1 Speed up 1.3 0.4 Table 39 shows the results for the Letter datase t. There was a speed up of two times in training time but also a significant loss in classification accuracy. The Letter dataset started with a class ification accuracy of 97.73% before any feature selection is done. The MFS procedure reduce d from 16 down to 15 features while the BFS procedure reduce d down to 7.9 features. The best accuracy is with all features and tuned SVM parameters indicating th at perhaps all features are of good quality. It appears that this particular dataset does not require feature selection but just SVM parameter tuning In [ 60 ] the authors used a mutual information maximization scheme to search for features to eliminate. Their results indicate that they could not locate a subset of features that performed as well as using all 16 features. Their best classification accuracy was with all 16 featur es 87.68% using 75% of the dataset for training and 25% for test. In [ 54 ] the authors used c4.5 where they built an ensemble of 200 classifiers and achieved an accuracy rate of 100%. In [ 61 ] the authors implemented a class decision tree where each node in the tree would implement a classifier, (SVM, Three Nearest Neighbors, etc ), with features that are specific for the class that the no de is making a decision for. When considering all nodes one feature was eliminated from the complete decision tree. They were able to improve classification accuracy from 91.66% to 95.05% where a 10 fold cross validation was used.

PAGE 100

88 Table 39 Letter Dataset ; Most A ccurate S et of F eatures Method Description Sel Meth Test Acc Wtd. Acc. Train Time(s) Test Time(s) Avg. # Features MFS SVM parms tuned Voting 97.73% 97.72% 21.9 6.6 16 Prob 97.63% 97.61% 21.3 6.9 SVM parms tuned Features Selected Voting 97.65% 97.64% 18.7 6.1 15 Prob 97.45% 97.44% 18.5 6.0 BFS Features selected Voting 96.88% 96.86% 6.6 22.0 7.9 Prob 96.35% 96.34% 6.6 21.7 Features selected SVM parms tuned Voting 96.98% 96.96% 9.1 32.3 7.9 Prob 96.98% 96.96% 9.2 33.0 Speed up 2.0 0.2 Table 40 shows the results for the Sat Image dataset. Similar to the Letter dataset using all features produced the highest classification accuracy. After feature selection the BFS approach had sli ghtly better overall and class weighted equally classification accuracy ( 89.50% versus 89.35% and 87.21% versus 87.02% respectively), but not statistically significantly better. Table 41 shows results as reported in [ 19 ] a paper that implements binary class pairwise feature selection using Wrapper s and the learning algorithms one nearest neighbor, three nearest neighbor, and B ayes learner. Results for all features were not reported. Overall classification accuracy is reported but not class weighted equally. The top three rows show results when gl obal feature selection is performed across all class pairs and the bottom three rows show the results when features are selected by binary class pairs. When comparing global feature selection with pairwise feature selection the one nearest neighbor algor ithm using pairwise feature selection had slightly better classification accuracy of 87.30% versus 87.00% and the other two learning algorithms had a small loss in accuracy. Feature reduction for the three learning algorithms 9.4, 9.0, and 7.4 was better than the 12.2 achieved by BFS in Table 40

PAGE 101

89 Table 40 Sat Image; Most A ccurate S et of F eatures Method Description Sel Meth Test Acc. Wtd. Ac c. Train Time(s) Test Time(s) Avg. # Features MFS SVM parms tuned Voting 91.05% 89.12% 0.9 0.5 36.0 Prob 90.80% 88.82% 0.9 0.5 SVM parms tuned, Features Selected Voting 89.35% 87.02% 0.7 0.4 17.0 Prob 89.25% 86.82% 0.7 0.4 BFS Features selected Voting 88.70% 85.99% 0.6 0.5 12.2 Prob 88.35% 85.17% 0.6 0.5 Features selected, SVM parms tuned Voting 89.45% 87.16% 0.7 0.6 12.2 Prob 89.50% 87.21% 0.7 0.7 Speed up 1.0 0.7 Table 41 Sat Image Dataset as Reported by [ 19 ] a Pair wise F/S P aper. Description Classification Accuracy Avg # Features Global Feature Selection 1NN 87.00% 22.7 Global Features Selection 3 NN 86 90% 19.1 Global Features Selection bayes 85.20% 10.1 Pair wise Feature Selection 1NN 87.30% 9.4 Pairwise Features Selection 3 NN 86.80% 9.0 Pairwise Features Selection bayes 85.00% 7.4 Table 42 provides a summary for each dataset indicating accuracy improvements, speed ups and processing times. It gives an overall view of how the binary feature selection (BFS) procedure performs. The parameter tuning and feature selection process were performed on a 64 processor cluster. Accuracy gain occurred for all datasets except Letter while training time improved for all datasets except Forest Cover wit h 300 examples per class. Bold indica tes that the BFS results are significantly statistically different from the MFS results.

PAGE 102

90 Table 42 Summary of R esults Acc. Gain over NO F/S Acc. Gain over MFS. F/S Train Speed Up Test Speed Up MFS Feature Sel Time BFS Feature Sel Time F/S Speed Up Nine Classes Plankton 0.67% 0.15% 2.2 0.5 298,699 216,054 1.38 WFS 1.56% 1.18% 1.3 0.3 5,048,690 1,546,207 3.27 ETP 2008 Station 1 2.67% 2.42% 1.3 0.3 3,664,678 1,575,040 2.33 Forest Cover 300/Class 7.40% 5.99% 1.0 0.5 23,719 19,551 1.21 Forest Cover 1,500/Class 4.27% 1.98% 1.3 0.4 504,960 365,471 1.38 Letter Dataset 0.77% 0.69% 2.0 0.2 135,300 153,727 0.88 Sat Image 1.70% 0.17% 1 0 0.7 11,519 13,477 0.85 4. 4 Unbalanced Datasets Table 43 shows results of applying the three approaches described in Section 3.7 to the WFS dataset, plus the top results from Table 35 For convenience the top results from Table 35 are at the top of the table followed by the three approaches. The first column indicates which method is being employed. The second gives a description of the method with any applicable threshold. Rows that are in bold indicate that the results in that row are statistically different from that of the MFS results The rest of the columns are the same as in Table 36 For group two the description specifies the threshold as a percentage of the average class size. For the WFS data set the average number of examples per class is 509.3 so that for the row that specifies All indicates that for any binary SVM where either class has less than (10% of 509.3) = 50.9 examples then all features are to be used. Of the three methods, the first had the best results. It selected a large number of features at 32.5 but still retained a faster training time over MFS. The MFS classifier that uses all features has the best class weighted equally classification accuracy but at the cost of 6.9% loss of overall accuracy compared to the BFS classifier. This results in a large number of fals e positives in many of the classes. The classifier produced by the first method, preference for greater number of features, retained the overall classification accuracy and reduced the loss in class weighted equally classification accuracy from 3.6% to 1. 5%. This results in the reduction of the nu mber of

PAGE 103

91 misclassified examples ( false positives ) Bold indicates that the results were statistically significantly different than the MFS results. The first row in Table 43 shows the results of using all features and SVM parameters tuned. This row has the best class weighted equal accuracy of any row but also the worst overall accuracy The advantage to thi s row is that without performing any feature selection the best class weighted equally accuracy was achieved but at a cost of a 5.77% accuracy loss compared to the BFS approach which translates to 5.34% more examples being misclassified. Table 43 WFS BFS P roduced C lassifier W here M inority C lasses are C ompensated B old indicates stati sti cally different from MFS case Description Test Acc. Wtd Acc. Train Time Test Time Avg # Features All Features Parms Tuned. 72.30% 71.83% 45.2 29.0 82.0 MFS Parms Tuned Features Selected 76.73% 69.18% 26.0 20.0 41.0 BFS Features Selected Parms Tuned 77.64% 69.31% 19.3 63.4 15.7 1 BFS Prefer greater num of features. 77.66% 70.75% 21.7 75.3 32.5 2 All Features when < 5% examples 77.64% 69.31% 19.3 63.4 15.7 All Features when < 10% examples 77.45% 70.24% 19.4 70.0 20.8 All Features when < 15% examples 77.47% 70.53% 20.9 76.1 25.6 All Features when < 20% examples 77.19% 70.23% 20.5 77.2 28.2 All Features when < 25% examples 77.21% 70.35% 21.2 84.8 33.7 All Features when < 30% examples 77.28% 70.28% 22.0 88.2 39.1 3 Merge2Best 77.73% 70.28% 19.9 78.7 17.8 Merge3Best 77.69% 70.10% 21.6 72.3 20.1 Merge4Best 77.30% 68.73% 22.1 76.2 22.5 Table 44 shows the results of applying the three approaches for dealing with unbalanced classes to the ETP2008 dataset. The best results were obtained with group two with a threshold of 25 % of average class size training examples. That is binary classifiers that involve a class with less than 25% of the average class size examples will utilize all features rather than selected feature s. The overall accuracy had a 2.2 % accuracy gain and the class weighted equally

PAGE 104

92 accuracy loss was reduced fro m 3.9% to 1.1 %. Depending on goals this may be a more desirable selection. The c lass weighted equally accuracy has a small loss compared to the MFS ap proach but there is a reduction in incorrect classifications ( false positives ) Bold indicates that the results were statistically significantly different than the MFS results. Table 44 ETP2008 BFS P roduced C lassifier W here M i nority C lasses are Compensated Description Test Acc. Wtd Acc. Train Time(s) Test Time(s) Num Features All Features parms Tuned. 83.33% 77.54% 32.4 169.5 83.0 MFS Parms Tuned Features Selected 83.53% 78.05% 18.4 116.6 40.0 BFS Features Selected Parms Tuned 85.55% 75.00% 14.0 424.7 10.0 1 BFS Prefer greater num of features. 84.99% 76.76% 19.3 789.6 48.8 2 All Features when < 5% examples 85.55% 75.00% 14.0 424.7 10.0 All Features when < 10% examples 85.40% 76.17% 15.4 499.2 15.7 All Features when < 15% examples 85.50% 76.19% 16.2 536.8 19.3 All Features when < 20% examples 85.44% 77.10% 16.4 582.2 22.0 All Features when < 25% examples 85.35% 77.18% 17.7 649.1 26.9 All Features when < 30% examples 85.36% 76.89% 17.8 725.1 30.8 3 Merge2Best 85.63% 75.75% 15.4 529.9 11.9 Merge3Best 85.60% 76.54% 15.3 474.8 13.8 Merge4Best 85.46% 76.35% 15.3 484.9 15.5 4.5 Adding a C lass Using the procedure explained in S ection 3.8 an experiment on the Nine Class Plankton dataset was performed, where one of the classes from the dataset was removed, leaving only 8 classes T hen the MFS and BFS procedures were performed on both ( see Table 45 ) The MFS proc edure built 268,660 binary SVM s between the parameter tuning and feature selection steps consuming a total of 157,785 CPU seconds where the longest CPU path took 5,369 seconds. The BFS procedure consumed 122,946 CPU seconds with a longest path of 4,921 seconds. When the class that was removed is added, the MFS procedure has to be rerun from the beginning, not being able to take advantage of the CPU cycles already used for the first 8

PAGE 105

93 cl asses while the BFS procedure only needs to process the additional class combinations that are created with the addition of the one class. The BFS procedure required an additional 93,108 CPU secon ds while the MFS procedure needed 298,699 seconds. This w a s a speed up of 3.2 times for the BFS procedure to add one class additional class. With respect to longest path time the MFS procedure required 5,387 seconds and the BFS procedure 1,860 seconds reflecting a speed up of 2.9 times. Table 45 Nine Class Plankton D ataset with Only 8 C lasses Description CPU Search Time CPU Classifier Time Longest Path Time MFS 8 Classes 157,785 150,840 5,369 MFS 9 Classes 298,699 285,983 5,387 BFS 8 Classes 122,946 108,482 4,921 BFS add one class 93,108 83,504 1,860 Speed Up 3.21 3.42 2.90 Table 46 shows the results of add one class at a time using data from the ETP2008 dataset. The 13 classes that had 600 or more examples were used for this experiment. The data was then randomly split into training and test sets with 70% of the data going to training with a maximum of 800 per class and the remainder into test. The classes were randomly ordered with the first 5 classes (Copepod Calanoid, Copepod Nauplii, Detritus Molts, Detritus Snow, and Pteropod Creseis) used as the initial starting training library. Both the MFS and BFS procedures were performed on the initial training library with the results shown in the first row of Table 46 The following classes were then added one at a time with both the MFS and BFS procedures being performed with results shown in the following rows The processing time for the MFS procedure consistently grows with the number of classes while the BFS procedure takes varying amount s of time. The processing time of BFS is a function of number of training examples in the additional class being added, total number of classes in the training library and the difficulty in d iscriminating the class being added from the

PAGE 106

94 classes already in the training library. The class Copepod C alanoid E ucalanus which is similar to Copepod Calanoid and Copepod Oithona required 46,769 seconds processing time compared to Tunicate Doliolid which required only 27,380 seconds even though Tunicate Doliolid had more training examples and a greater number of classes in the existing training library. This can be attributed to the fact that Tunicate Doliolid was very easy to discriminate from th e other classes in the training library. In genera l, th e BFS procedure had a speedup fro m 4.69 to 19.39 over that of the MFS procedure. Table 46 ETP2008 Adding One Class at a Time Search Time Longest Path Time Class Name Class Count Num Train MFS Sec BFS Sec Speed Up MFS Sec BFS Sec Speed Up Starting 5 Classes 5 3 219 70,834 36,915 1.92 2,565 1,668 1.54 Eumalacostracan euphausiid 6 709 73,744 12,027 6.13 2,857 605 4.73 Copepod copilia 7 579 109,064 7,778 14.02 2,154 366 5.88 N oise 8 801 101,341 8,607 11.77 3,445 413 8.34 C opepod O ithona 9 301 136,380 21,515 6.34 2,845 793 3.59 Copepod C alanoid E ucalanus 10 709 219,212 46,769 4.69 4,383 1,331 3.29 Copepod O ncaea 11 471 291,367 15,030 19.39 5,624 723 7.78 Proti st R adiolarian 12 801 344,289 29,125 11.82 6,597 1,104 5.98 Tunicate D oliolid 13 801 483,045 27,380 17.64 16,638 1,354 12.29

PAGE 107

95 CHAPTER 5: DISCUSSION The major benefits of the BFS approach have been shown to be a significant speed up in Wrapper feature selection time, speed up in training times, and reduction in the time to add or delete classes from existing classifiers giving the user greater flexibility in managing existing cl assifiers. Four of the six data sets had a significant improvement in overal l cl assification accuracy. O ne data set, Nine Class Plankton, was slightly higher and one dataset, Letter, had a loss of accuracy Two of the datasets which contained unbalanced class representation in the training data did no t do well with respect to cla ss weighted equally classification accuracy; but other methods were proposed and shown to improve the class weighted equally accuracy while also maintaining the higher overall accura cy. Both the Forest Cover datasets, 300/c lass and 1,500/class, had signi ficantly higher overall classification accuracy and class weighted equally accuracy. BFS requires a larger higher quality training set than MFS. It will tend to make a much tighter fit than normal feature selection and as a result will not generalize as w ell. However when the training set is low noise and is representative it will result in classification accuracy that is as good as MFS or better as shown with the Forest Cover datasets. Wrapper based feature selection time is faster with BFS. The featur e selection process for multiple classes is considerably longer than the combined time to do combinations feature selection and parameter tuning for all the binary combinations with the exception of the Letter dataset. The binary SVM s are more apt to red uce dow n to a smaller set of features ( Figures 20 and 22 ) resulting in considerably less feature combinations needing evaluation (that is training and testing). This makes sense considering that the MFS method has to search for a set of

PAGE 108

96 features that performs well for all binary classifier pairs whereas BFS focuses on one pair of classes at a time hence the features for which it is looking need only satisfy the requirements of two classes. After tuning the SVM parameters by BFS the number of support vectors (SV s) increases Th is results in longer training time although it is still shorter than the training time resulting from MFS. The larger number of SV s can be attributed to the decision boundary between binary class combinations being less generalized (Tighter Fitting). As seen in results from Tables 34 thr ough 42 the prediction time for the classifiers using features selected by binary combination is longer compared to when the same set of features are used for all binary classifiers. This is a result of the implementation of the two classifiers. When a classifier involves more than two classes training examples can end up being used in more than one binary classifier and can be a support vector for more than one binary class SVM. The implementation that uses one set of features for all binary classi fiers is able t o take advantage of this fact. Table 47 shows for the ETP2008 data set the total number of support vectors and the net number of support vectors. The Total Num SVs column takes into account the fact that some training examples will become support vectors for more than one binary SVM. The Net N um SVs column reflects the number of training examples that become support vectors ; meaning if a given training example were used in 3 different binary SVMs it would still only be counted once For example the MFS produced classifier found 9,210 training examples that became support vectors with each one on average being used by 5.5 binary SVMs for a total of 50,378 support vectors. Since the MFS produced classifier uses the same feature subset for all its binary SVMs it only needs to compute the dot product on the net number of support vec tors and use the results of these dot products for all the binary SVMs The BFS produced classifier will need to compute dot products on its total number of SVs. This means that in the example shown in Table 47 the BFS approach needs to do 63,609 dot products for each prediction while the MFS approach only needs to perform 9,210 dot products for each prediction.

PAGE 109

97 Equation ( 40 ) is how a dot product using the RBF kernel is computed where is the number of features and are the two features vectors. It shows that for each dot pro duct there are floating point multiplications and for each prediction there will be floating point multiplications where = number of SVs. The seventh column in Table 47 Num FP Ops, shows the number of floating multiplications required for each prediction by the i r respective classifiers. ( 40 ) Table 47 ETP2008 Station 1 Support Vector C omparison Description Training Time Test Time Total Num SV's Net Num SV's Number Features Num FP Ops MFS produced classifier. 18.96 130.72 50,378 9,210 40.0 368,400 BFS produced classifier 14.81 428.89 63,609 12,044 10.0 636,090 The BFS process does not work well with all datasets, especially ones that already have a high classification accuracy because t hey have a good set of features. F or example the Letter dataset which starts with 97.7% loses 0.77% in classification accuracy with a training time speed up of 2.0 times. The CPU time required to perform feature selection also increased from 37.6 hours to 42.7 hours. Once a particular pair of classes is processed by the BFS procedure its resultant SVM parameters and feature selections can always be used in combination with other class combinations since each binary combination of classes is processed independently of any ot her classes. For instance, with the WFS dataset that has 33 classes, a user can always build a classifier from any combination of subsets of classes of WFS. This can allow the user far more flexibility in putting together classifiers. For example if a p articular class is not appearing in a set of data that a user wish es to classify the user can just remove that class from the classifier. If on the other hand a new class appears in the data that a user wish es to classify that has not

PAGE 110

98 been processed by BFS before the BFS process can be run on the new combinations created as described in S ection 3.8 This dissertation proposed selecting Support Vector Machine (SVM) parameters and performing feature selection by binar y class combinations rather than selecting a common set of parameters and features for all binary class co mbinations that make up a multi class SVM. Experimentation demonstrates that the time it takes to tune SVM parameters and perf orm feature selection for multi class support vector machines can be reduced in som e cases to less than half the time. A t the same time classification ac curacy can be mai ntained and in some cases improved and training time of the resultant classifiers speeded up Another benefit of this approach is to give the user greater flexibility in the addition and subtraction of classes from existing classifiers; that is SVM para meter tuning and feature selection only need s to be performed for the new class combinations created. This can benefit a user who has to f requently change the class make up of existing classifiers such as those in the marine science world. In the case of the Forest Cover dataset a significant improvement in classification accuracy was made, 4.27%. The WFS dataset had a 1.19% improvement in overall classification accuracy, 0.19% in class equalized accuracy, 1.3 times speed up in training time and a 3.09 speed up in longest path feature selection time. The savings was often measured in hours and sometimes in days. In the case of one dataset, WFS, there were 40.5 less days of CPU time and 15.7 hours less longest path time; it took 7.5 hours to process rat her than 23.2 hours. The user has greater flexibility with modifying and mai ntaining existing classifiers due to the ability to utilize SVM parameters and feature selections from class combinations previously processed.

PAGE 111

99 REFERENCES [1] Harold V. Thu rman and Alan P. Trujillo, Introductory Oceanography 10th ed.: Prentice Hall, 2004. [2] To ng Luo et al., "Active Learning to Recognize Multiple Types of Plankton," Journal of Machine Learning Research vol. 6, pp. 589 613, December 2005. [3] Tong Luo et al., "Active Learning to Recognize Multiple Types of Plankton," in International Conference on Pattern Recognition (ICPR) Cambridge, UK, August 2004. [4] Andrew Remsen, Thomas L. Hopkins and Scott Samson "What you see is not what you catch : A comparison of concurrently collected net, Optical Plankton Counter, and Shadowed Image Particle Profiling Evaluation Recorder data from the northeast Gulf of Mexico," Deep Sea Research Part I: Oceanographic Research Papers vol. 51, no. 1, pp. 129 151, 2004. [5] Mark C. Benfield, Cabell S. Davis, Peter H. Wiebe, Scott M. Gallager, and Gregory Lough, "Video plankton recorder estimates of copepod, pteropod and larvacean distributions from a stratified region of Georges Bank with comparative measurements from a MOCNESS sampler.," Deep Sea Research Part II: Topical Studies in Oceanography vol. 43, no. 7 8, pp. 1925 1945, 1996. [6] M.E. Sieracki et al., "Optical Plankton Imaging and Analysi s Systems for Ocean Observation ," in Venice, Italy, 2009, p. 11. [7] C. S. Davis, S. M. Gallager, M. S. Berman, M. S. Haury, and J. R. Strickler, "The Video Plankton Recorder (VPR): Design and initial results.," Archiv fr Hydrobiologie vol. 36, pp. 67 81, 1992. [8] C. S. Davis, F. T. Thwaites, S. M. G allager, and Q. Hu, "A three axis fast tow digital Video Plankton Recorder for rapid surveys of plankton taxa and hydrography.," Limnology and Oceanography Methods vol. 3, pp. 59 74, January 2005. [9] Mark C. Benfield et al., "RAPID Research on Automated Plankton Identification," Oceanography vol. 20, no. 2, pp. 172 187, 2007. [10] Scott Samson et al., "A system for high resolution zooplankton imaging," IEEE Journal of Oceanic Engineering vol. 26, no. 4, October 2001.

PAGE 112

100 [11] J. Watson et al., "A holographic system for subsea recording and analysis of plankton and other marine particles (HOLOMAR)," OCEANS 2003. Proceedings vol. 2, pp. 830 837, September 2003. [12] Emilia T Wiebe and C. M ark Benf ield, "From the Hensen net toward four dimensional biological oceanography.," Progress in Oceanography vol. 56, no. 1, pp. 7 136, 2003. [13] M. Blaschko et al., "Automated In situ Identification of Plankton," in IEEE Workshop on Applications in Computer Vision. Breckinridge, Colorado., 2005. [14] Gaby Gorsky et al., "Digital zooplankton image analysis using the ZooScan integrated system," Journal of Plankton Research vol. 32, no. 3, pp. 285 303, 2010. [15] Andrew Remsen, Scott Samson, Thomas Hopkins, and Kurt Kramer, "Observations of plankton and detrital particle distribution on the West Florida Shelf using SIPPER 2 and an automated classification system," Journal of Plankton Research submitted 2010. [16] Isabelle Guyon, Steve Gunn, Nikravesh Masoud, and Lotfi A. Zadeh, Feature Extraction, Foundations and Applications, Embedded Methods 1st ed.: Springer, August 2006. [17] John Ron Kohavi and H George, "Wrappers for feature subset selection," Artific ial Intel li gence vol. 97, no. 1, pp. 273 324, December 1997. [18] F. Bruno, A. Carvalho, Rodrigo Calvo, and Renato Porfirio Ishii, "Multiclass SVM Model Selection Using Particle Swarm Optimization.," Proceedings of the Sixth International Conference on Hybrid Intelligent Systems p. 31, 2006. [19] Hugo Silva and Ana Fred, "Pairwise vs global multi class wrapper feature selection," in Proceedings of the 6th Conference on 6th WSEAS Int. Conf. on Artificial Intelligence, Knowledge Engineering and Data Bas es (AIKED'07) vol. 6, Corfu Island, Greece, 2007, pp. 1 6. [20] Xue wen Chen, Xiangyan Zeng, and Deborah van Alp, "Multi class feature selection for texture classification," Yahoo Research, Technical Report YR 2008 002 vol. 27, no. 14, pp. 1685 1691, O ctober 2006. [21] Olivier Chapelle and Sathiya Keerthi, "Multi Class Feature Selection with Support Vector Machines.," Yahoo Research, Technical Report YR 2008 002 2008, Yahoo Research, Technical Report YR 2008 002. [22] Andrew Walker Remsen, Evolutio n and Field Application of a Plankton Imaging System, College of Marine Science, University of South Florida PhD Dissertation 2008

PAGE 113

101 [23] Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik, "Gene Selection for Cancer Classification using Support Vector Machines," Machine Learning vol. 46, no. 1 3, pp. 389 422, 2002. [24] Christopher J. C. Burges, "A Tutorial on Support Ve ctor Machines for Pattern Recognition," Data Mining and Knowledge Discovery vol. 2, no. 2, pp. 121 167, June 1998. [25] Kurt A. Kramer, Lawrence O. Hall, Dmitry B. Goldgof, Andrew Remsen, and Tong Luo, "Fast support vector machines for continuous data," IEEE Trans Syst Man Cybern B Cybern vol. 39, no. 4, pp. 989 1001, August 2009. [26] Tong Luo et al., "Recognizing Plank ton Images from the Shadow Image Particle Profiling Evaluation Recorder.," IEEE Transactions on Systems Man and Cybernetics Part B Cybernetics vol. 34, no. 4, pp. 1753 1762, August 2004. [27] Paul DuBois, MySQL 4th ed.: Addison Wesley, 2008. [28] MySqlL.org. (2010) MySQL 5.1 Reference Manual. [Online]. http://dev.mysql.com/doc/refman/5.1/en/index.html [29] Committee on Evolution of the National Oceanographic Research Fleet National Research Council, Science at Sea: Meeting Future Oceanographic Goals with a Robust Academic Research Fleet : The National Academic Press, 2009. [30] M.K. Hu, Visual Pattern Recognition by Moment Invariants, February 1962. [31] D. Zhang and G. Lu, "A Comparative Study on Shape Retrieval Using Fourier Descriptors with Different Shape Signatures," Journal of Visual Communication and Image Representation vol. 14, no. 1, pp. 41 60, 2003. [32] Nello Cristianini and John Shawe Taylo r, An Introduction to Support Vector Machines and other kernel based learning methods. 1st ed. Cambridge, United Kingdom: Cambridge University Press, March 2000. [33] Chih Chung Chang and Chih Jen Lin. (2001) LIBSVM: a library for support vector machine s (version 2.3). [Online]. http://www.csie.ntu.edu.tw/~cjlin/libsvm/ [34] M. G. Genton, "Classes of kernels for machine learning: a statistics perspective," Journal of Machine Learning Research vol. 2, pp. 229 312, March 2002. [35] N. Vladimir Vapnik, The nature of statistical learning theory 2nd ed.: Springer, 2000. [36] V. Vapnik, Estimation of Dependences Based on Empirical Data. : Springer, 2001. [37] Stephen Boyd and Lieven Vandenbergh e, Convex Optimization .: Cambridge University Press, 2004.

PAGE 114

102 [38] Harold W. Kuhn and Albert W. Tucker, "Nonlinear programming," in Proceedings of 2nd Berkeley Symposium. Berkeley Berkeley: University of California Press, 1951, pp. 481 492. [39] John C. Platt, "Probabilistic Outputs for Support Vector Machines and Comparison to Regularized Likelihood Methods," in Advances in Large Margin Classifiers Cambridge, MA, USA: MIT Press, 1999, pp. 61 74. [40] Ting Fan Wu, Chih Jen Lin, and Ruby C. Weng, "Probability estimates for multi class classification by pairwise coupling," Journal of Machine Learning Research vol. 5, pp. 975 1005, August 2004. [41] Chih Wei Hsu and Chih Jen Lin, "A comparison of method s for multiclass Support Vector Machines," IEEE Transactions on Neural Neworks vol. 13, no. 2, pp. 415 425, M arch 2002. [42] Kurt Kramer, Identifying Plankton from Grayscale Silhouette Images Tampa, Florida: University of South Florida, Masters Thesis, 2005. [43] Jack A. Blackard, Dr. Denis J. Dean, and Dr. Charles W. Anderson. (1998, August) Machine Learning Repository Forest Cover Data Set. [Online]. http://archive.ics.uci.edu/ml/dataset s/Covertype [44] David J. Slate. (1991, Jan.) University of California, Irvine, School of Information and Computer Sciences, Repository Letter Recognition Data Set. [Online]. http://archive.ics.uci.edu/ml/datasets/Letter+Recognition [45] Kurt Kramer. (2008, December) University of South Florida, Research Cruise, R/V Knorr, Research in the Eastern Tropical Pacific (Dec 2008). [Online]. http://etpcruise2008.blogspot.com/ [46] Elizabeth Caporelli. (2008, Dec.) University National Oceanographic Laboratory System Knorr Ships Schedule. [Online]. http://strs.unols.org/Public/diu_schedule_view.aspx?ship_id=10037&year=2008 [47] N. G. Larson K. Lawson, "CTD," in Encyclopedia of Ocean Sciences Inc, Bellevue, Washington, USA: Academic Press, 2003, pp. 579 588. [48] Jock A. Blackard and Denis J. Dean, "Comparative Accuracies of Artificial Neural Networks and Discriminant Analysis in Predicting Forest Cover Types from Cartographic Variables.," Computers and Electronics in Agriculture vol. 24, no. 3, pp. 131 151, 2000. [49] Jock A Blackard, "Comparison of Neural Networks and Discriminant Analysis in Predicting Forest Cover Types.," Colorado State University, Fort Collins, Colorado, Ph.D. D issertation. Department of Forest Sciences. 1998. [50] Jock A. Blackard and Denis J. Dean, "Comparative Accuracies of Neural Networks and Discriminant Analysis in Predicting Forest Cover Types from Cartographic Variables.," in Second Southern Forestry GIS Conference University of Georgia. Athens, GA, 1998, pp 198 199.

PAGE 115

103 [51] Peter W Frey and David J Slate, "Letter Recognition Using Holland Style Adaptive Classifiers," Machine Learning vol. 6, no. 2, pp. 161 182, 1991. [52] B. Y. Sun, D. S. Huang, and and H. T. Fang, "Lidar signal denoising using least squares support vector machine," Signal Processing Letters, IEEE vol. 12, no. 2, pp. 101 104, Feb ruary 2005. [53] S tephen Kwek, Nathalie Japkowicz, and Rehan Akbani, "Apply ing support vector machines to imbalanced datasets," in In Proceedings of the 15th European Conference on Machine Learning 2004, pp. 39 50. [54] Thomas G. Dietterich, "Machine learning research: Four current directions," AI Magazine vol. 18, no. 4, pp. 97 136, 1997. [55] A Frank and A Asuncion. (2010) UCI Machine Learning Repository. [Online]. http://archive.ics.uci.edu/ml/machine learning databases/covtype/ [56] Richard J. Larsen and Morris L. Marx, An Introduction to Mathematical Statistics and Its Applications 3rd ed.: Rowman & Littlefield, 2000. [57] Thomas G. Dietterich, "Approximate statistical tests for comparing supervised classification learning algorit hms," Neural Computation vol. 10, no. 7, pp. 1895 1923, October 1998. [58] B. S. Everett, The analysis of contingeny tables London: Chapman and Hall, 1977. [59] Carl Staelin, "Parameter selection for support vector machines ," HP Laboratories Israel, Technion City, Haifa, 2002. [60] Nojun Kwak and Chong Ho Choi, "Input feature selection by mutual information based on Parzen window," IEEE Transactions on Pattern Analysis and Machine Intelligence vol. 24, no. 12, pp. 1667 1671, Dec ember 2002. [61] Kazuaki Aoki and Mineichi Kudo, "Feature and Classifier Selection in Class Decision Trees," Proceedings of the 2008 Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition pp. 562 571, 2008. [62] Sebastian Mika, G unnar Rtsch, Jason Weston, Bernhard Schlkopf and Klaus Robert Mller, Fisher Discriminant Analysis With Kernels, 1999. [63] George Forman, "An extensive empirical study of feature selection metrics for text classification," Journal of Machine Learning Research vol. 3, pp. 1289 1305, March 2003. [64] Huan Liu and Hiroshi Motoda, Computational methods of feature selection Boca Raton: Chapman & Hall/CRC, 2007. [65] Bernhard Schlkopf, Christopher J.C. Burges, and Alexander J. Smola, "Kernel Principal Component Analysis," in Advances in Kernel Methods Support Vector Learning Cambridge, MA: The MIT Press, December 1998, pp. 327 352.

PAGE 116

104 [66] Jason Weston, Andr Elisseeff, Bernhard Schlkopf, and Mike Tipping, "Use of the Zero Norm with Linear Models and Kernel Methods," Journal of Machine Learning Research vol. 3, pp. 1439 1461, March 2003. [67] Lei Wang, "Feature Selection with Kernel Clas s Separability," IEEE Transactions on Pattern Analysis and Machine Intelligence vol. 30, no. 9, pp. 1534 1546, September 2008.

PAGE 117

105 APPENDICES

PAGE 118

106 Appendix A Plankton Images The images in this appendix are from the ETP2008 dataset. Each box contains one or more samples from a single class. They reflect the relative size of the images but they are not to any scale. For example the images of gelatinous_tunicate_doliolid are large r than the images of Larvaceans because they tend to be larger in the dataset. The re are 55 separate classes shown in this appendix. crustacean_copepod_calanoid crustacean_copepod_calanoid_eucalanus Figure A 1 Images from ETP2008 Dataset

PAGE 119

107 Appendix A (Continued) crustacean_copepod_copilia crustacean_copepod_eyes crustacean_copepod_lateral crustacean_copepod_macrosetella crustacean_copepod_nauplii crustacean_copepod_oithona Figure A 1 (Continued)

PAGE 120

108 Appendix A (Continued) crustacean_copepod_oncaea crustacean_eumalacostracan crustacean_eumalacostracan_amphipod Crustacean_ostracod detritus_molts Figure A 1 (Continued)

PAGE 121

109 Appendix A (Continued) detritus_snow echinoderm_plutei Elongate_chaetognath Figure A 1 (Continued)

PAGE 122

110 Appendix A (Continued) elongate_polychaete elongate_strands fish gelatinous_ctenophore gelatinous_ctenophore_cydippid Figure A 1 (Continued)

PAGE 123

111 Appendix A (Continued) gelatinous_hydromedusae gelatinous_hydromedusae_blunt gelatinous_hydromedusae_small gelatinous_hydromedusae_solmundella Figure A 1 (Continued)

PAGE 124

112 Appendix A (Continued) gelatinous_siphonophore gelatinous_tunicate_doliolid Figure A 1 (Continued)

PAGE 125

113 Appendix A (Continued) gelatinous_tunicate_pyrosome larvacean larvacean_house larvacean_large Figure A 1 (Continued)

PAGE 126

114 Appendix A (Continued) larvacean_tectillaria larvae_doliolid larvae_polychaete larvae_tornaria mollusc_pteropod_creseis mollusc_pteropod_gymnosome Figure A 1 (Continued)

PAGE 127

115 Appendix A (Continued) noctiluca noise other phytoplankton_chaetoceros phytoplankton_pyrocystis protist_darkcenter protist_diffuse protist_knobby Figure A 1 (Continued)

PAGE 128

116 Appendix A (Continued) protist_lobed protist_lopsided protist_multiple protist_phage protist_phi protist_radiolarian Figure A 1 (Continued)

PAGE 129

117 Appendix A (Continued) protist_spiny protist_wisp radiolarian_ribboncolony radiolarian_roundcolony Figure A 1 (Continued)

PAGE 130

118 Appendix B SIPPER Raw D ata Format The following describes the layout o f the RAW SIPPER data file as produced by the SIPPER 3 device. SIPPER 3 produces a continuous stream of 16 bit records that read from most significant bit (MSB) to least significant bit (LSB). Table B3 gives a detail ed description of each bit. Each 16 bit record contains either image data or instru ment data as specified by bit 15. The two types of records are processed by separate decoding functions. There are three basic types of image data records : Gray Scale White Run L ength, and b inary. The first two types are the most common. The third t ype, binary, only occurs when SIPPERS internal buffer is getting full and needs to write the data to disk before an overrun occurs. 1) Gray Scale records provide four grayscale 3 bit pixels that range from 0 to 7 where 0 represents white (background) and 7 represents black. These values are scaled to 8 bit range as indicated in Table B1 to aid in compatibility with future versions of SIPPER where 8 bit level grayscale is envisioned. When data is stored in image files such as BMP images the values are complemented such that 255 = 0, and 0 = 255. Table B1 SIPPER 3 Grayscale Decoding Values. 3 Bit Value 8 Bit Scaled Value 3 Bit Value 8 Bit Scaled Value 0 0 4 146 1 36 5 182 2 73 6 219 3 109 7 255 2) White Run Length records are an implementation of a simple run length compression algorithm. The majority of SIPPER data is white background. This record will specify the number of 4 pixel packages that contain white that occur in a row. The count is specified in bits 11 through 0 and is multiplied by 4 to get the number of white back ground pixels that have occurred.

PAGE 131

119 Appendix B (Continued) 3) Binary image data. This format is me ant to help prevent buffer over flows. Since only white or black are being recorded SIPPER can write four times more data in a given amount of time. The downside is that texture information is lost. In practice this situation rarely occurs; the most commo n cause is when bubbles pass through the sampling tube, which can occur when SIPPER is very near the surface, such as when first being deployed. In this case white pixels are mapped to 0 and black pixels to 255. Instrument Data has two different formats, text and binary data. In practice only the text variation is used. Each 16 bit record contains a 6 bit sensor number. Table B2 contains a list of sensor numbers that are currently in use. Table B2 SIPPER File Sensor Number Descriptions. Sensor Number Name Description 6 User Message This is a user provided description provided v i a the SIPPER interface. It is written on the SIPPER disk at the beginning of the SIPPER file. The Disk Manager software, which is used to offload SIPPER Files, reformats this data into the 16 bit records as described in Table B3. 9 GPS Data As of this time has not been implemented GPS data is currently being imported into the PICES database from text files provided by hosting research vessels. 10 Flow Rate This instrument consists of both text and binary data. The text indicates the half turns of the flow meter where there are 9 8 turns per meter. The binary data indicates flow rate in meters per sec. 16 Serial Port 0 CTD note CTD can have up to 4 external instruments, such as O 2 17 Serial Port 1 Pitch and Roll Sensor. This is text only data. Each line is separated by a line feed character. Ex: "R 1.15 P 16.18 18 Serial Port 2 Battery Sensor. Provides voltage levels and status of SIPPERS 4 batteries. Ex: "1, 25.55, 26.61, 26.14, 25.79, LLLL" Active battery followed by 4 voltage readings, followed by Live/Dead status of 4 batteries. Batteries are labeled 1,2,3,4. 19 Serial Port 3 Unused.

PAGE 132

120 Appendix B (Continued) Table B3 Data Payload Table. Bit N umber 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Bit name Image EOL / ASCII RAW Gray Data 8 1 0 0 0 Compressed, count of blocks of 4 white pixels. Ex: 0x312 = 786 = (786 4) = 3144 white pixels. 9 1 0 0 1 A 1 0 1 0 Black and white, 8 binary pixels. 1 2 3 4 5 6 7 8 9 10 11 12 B 1 0 1 1 G ray Level pixel 1 G ray Level pixel 2 G ray Level pixel 3 G ray Level pixel 4 C 1 1 0 0 Compressed, count of blocks of 4 white pixels. Followed by end of line. D 1 1 0 1 E 1 1 1 0 End Of Line encountered, so there are 4, 8, or 12 pixels of Black and White stored incrementing left to right as above. The program will have to count to know which pixels are valid. F 1 1 1 1 4 gray scale pixels, as above, followed by End of Line. 0 0 Sensor number Sensor data 0 1 Sensor number Sensor related text Some sensor numbers are defined. See below.

PAGE 133

121 Appendix C Glossary Active Learning: This is a concept of reducing the number of images that the user needs to manually classify in order to achieve a desired level of classification accuracy. In PICES this is implemented by sorting classified images by uld then be asked to manually classify the images that appear at the top of the list. Since these are the images that the classifier is having a hard time distinguishing they are the images that will most likely have an impact on the decision boundary be tween classes. Beam Search: the search. Differences are that only the best N nodes are evaluated for each level. Once the search has processed a given level, it will not go back to that level again. This way the search will continue until there are no more levels to process. Break Tie: This is the difference in probability between the two most likely classes that of 0.5% indicates that the classifier finds little difference between the two most likely classes. Class: Also referred to as a label. Different types of Plankton are considered Classes. For example, Trichodesmium, Larvacean, and Copepods would be considered three different classes. Classifier: A classifier predicts to which class an unknown image belongs. It is built using parameters from a specified training model using labeled examples from the related Training Library from which to learn. O nce a classifier is built (Trained ) it can be used anytime in the future to make predictions. If the training model parameters are changed or the related Training Library is modified, the user will need to rebuild the classifier for the changes to take pl ace. When a classifier makes a prediction, it returns the class that it generated as most likely correct.

PAGE 134

122 Appendix C (Continued) Image Groups: PICES allows images to be organized by groups. These groups can span the entire PICES database, across cruises, stations and deployments. The most common use of this is when images are harvested randomly. It can also be used to group images imported from a sub directory structure. PICES will allow a user to View, Export, Classify, and Extract Feature Data by Image Group. An example would be to group all images that could pertain to a study allowing the user to quickly locate th em in the future. SVM: Support Vector Machine. A learning algorithm that learns from labeled data how to predict the proper classes to be assigned to unseen data. See [ 32 ] for a more detail ed description. Training Library: For purposes of this dissertation, a Training Library is the set of plankton images that are divided up into logical groups (see Appendix A for examples of groupings). These images are then used to train a learning algorithm, such as the one utilized in this dissertation, the support vector machine (SVM). Training Model: For purposes of this dissertation, training m odel refers to the set of classes, and parameters that are to be used. The user has the ability to maintain several training models for the same training library. Each one will consist of a list of classes, features to be used, and support vector machine parameters. The training library may have many classes in it, but any one training model may only reference just several of these classes. Training models may also be set up such that several classes are grouped together to form a single logical class. V alidated Class: Class assigned by the user(Expert) to a specific plankton image. In PicesCommander the user has the ability to validate the class of any plankton image displayed. VPR: Video plankton recorder, a device used to collect imagery of marine pla nkton. Its purpose is similar to SIPPER but its implementation is very di fferent.


xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim
leader nam 22 Ka 4500
controlfield tag 007 cr-bnu---uuuuu
008 s2010 flu s 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0004805
035
(OCoLC)
040
FHM
c FHM
049
FHMM
090
XX9999 (Online)
1 100
Kramer, Kurt.
0 245
System for identifying plankton from the sipper instrument platform
h [electronic resource] /
by Kurt Kramer.
260
[Tampa, Fla] :
b University of South Florida,
2010.
500
Title from PDF of title page.
Document formatted into pages; contains X pages.
502
Dissertation (PHD)--University of South Florida, 2010.
504
Includes bibliographical references.
516
Text (Electronic thesis) in PDF format.
538
Mode of access: World Wide Web.
System requirements: World Wide Web browser and PDF reader.
3 520
ABSTRACT: Plankton imaging systems such as SIPPER produce a large quantity of data in the form of plankton images from a variety of classes. A system known as PICES was developed to quickly extract, classify and manage the millions of images produced from a single one-week research cruise. A new fast technique for parameter tuning and feature selection for Support Vector Machines using Wrappers was created. This technique allows for faster feature selection, while at the same time maintaining and sometimes improving classification accuracy. It also gives the user greater flexibility in the management of class contents in existing training libraries. Support vector machines are binary classifiers that can implement multi-class classifiers by creating a classifier for each possible combination of classes or for each class using a one class versus all strategy. Feature selection searches for a single set of features to be used by each of the binary classifiers. This ignores the fact that features that may be good discriminators for two particular classes might not do well for other class combinations. As a result, the feature selection process may not include these features in the common set to be used by all support vector machines. It is shown through experimentation that by selecting features for each binary class combination, overall classification accuracy can be improved and the time required for training a multi-class support vector machine can be reduced. Another benefit of this approach is that significantly less time is required for feature selection when additional classes are added to the training data. This is because the features selected for the existing class combinations are still valid, so that feature selection only needs to be run for the new combination added. This work resulted in a system called PICES, a GUI based user friendly system, which aids in the classification management of over 55 million images of plankton split amongst 180 classes. PICES embodies an improved means of performing Wrapper based feature selection that creates classifiers that train faster and are just as accurate and sometimes more accurate, while reducing the feature selection time.
590
Advisor: Dmitry Goldgof, Ph.D.
653
Marine Science
PICES
Machine Learning
Feature Selection
Support Vector Machine
SVM
Multi-Class
Pair-Wise
690
Dissertations, Academic
z USF
x Engineering Computer Science
Masters.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.4805