USFDC Home  USF Electronic Theses and Dissertations   RSS 
Material Information
Subjects
Notes
Record Information

Full Text 
PAGE 1 Scaling Up Supp ort V ector Mac hines with Application to Plankton Recognition b y T ong Luo A dissertation submitted in partial fulllmen t of the requiremen ts for the degree of Do ctor of Philosoph y Departmen t of Computer Science and Engineering College of Engineering Univ ersit y of South Florida CoMa jor Professor: La wrence O. Hall, Ph.D. CoMa jor Professor: Dmitry B. Goldgof, Ph.D. Sudeep Sark ar, Ph.D. Suresh Khator, Ph.D. Thomas Sano c ki, Ph.D. Date of Appro v al: F ebruary 10, 2005 Keyw ords: mac hine learning, data mining, k ernel mac hines, activ e learning, bit reduction c r Cop yrigh t 2005, T ong Luo PAGE 2 A CKNO WLEDGEMENTS This dissertation can not b e done without man y p eople's supp ort. I am indebted to m y advisor, Dr. La wrence O. Hall, who stim ulated man y of the ideas in m y dissertation. His careful review help ed to greatly impro v e the dissertation. I hop e I can b e as professional as him in future. I am also v ery grateful to m y coma jor professor{Dr. Dmitry Goldgof. He in tro duced the plankton recognition pro ject to me and pro vided man y researc h guidances. The w ork in this pro ject nally b ecome this dissertation. I w ould lik e to thank all the committee mem b ers and c hair for their review and commen ts to m y dissertation: Dr. Sudeep Sark ar, Dr. Suresh Khator, Dr. Thomas Sano c ki and Dr. Mic hael X. W eng. In particular, I learned a lot from the courses taugh t b y Dr. Sark ar. He is a great teac her and researc her. I w ould lik e to thank m y colleagues and friends in the vision lab: Kurt Kramer, Y an Qiu and Kevin Shallo w. W e had a lot of fun together. Also, I enjo y ed the coop eration with Kurt. Most of all, I w ould lik e to thank m y family My paren ts, sister and brotherinla w visited me during the last semester. Their supp orts k eep me going and gro wing. My w onderful wife, Y ang W ang, encouraged and supp orted me when I w as busy and felt frustrated. This dissertation can not b e done without her lo v e. PAGE 3 T ABLE OF CONTENTS LIST OF T ABLES iii LIST OF FIGURES v ABSTRA CT vii CHAPTER 1 INTR ODUCTION 1 CHAPTER 2 BA CK GR OUND 5 2.1 Mac hine learning problem 5 2.2 Supp ort v ector mac hines 7 2.2.1 Deriving the SVM 8 2.2.2 Kernel 9 2.2.3 Margin, V C dimension of large margin h yp erplane and structure risk minimization 10 2.2.4 Primal and dual form of SVM 11 2.3 Multiclass SVM 13 CHAPTER 3 RECOGNIZING PLANKTON IMA GES FR OM THE SHADO W IMA GE P AR TICLE PR OFILING EV ALUA TION RECORDER 15 3.1 In tro duction 15 3.2 Image gallery 17 3.3 F eature computation 17 3.3.1 Ob ject detection and noise suppression 19 3.3.2 Momen t in v arian ts 20 3.3.3 Gran ulometric features 21 3.3.4 Domain sp ecic features 22 3.4 Assigning probabilit y v alues in supp ort v ector mac hines 23 3.5 F eature selection 27 3.6 Exp erimen ts 30 3.6.1 Initial exp erimen ts 30 3.6.2 Exp erimen ts with uniden tiable particles 31 3.6.3 F eature selection 33 3.6.4 Probabilit y assignmen t exp erimen ts 35 3.7 Conclusions 38 i PAGE 4 CHAPTER 4 A CTIVE LEARNING TO RECOGNIZE MUL TIPLE TYPES OF PLANKTON 41 4.1 In tro duction 41 4.2 F eature computation 44 4.2.1 W eigh ted momen t features 46 4.2.2 Con tour features 47 4.2.3 T exture 48 4.2.4 Other features 49 4.3 Activ e learning approac h with m ulticlass supp ort v ector mac hines 49 4.4 Exp erimen ts 51 4.4.1 Exp erimen ts with IPR=1, I IPC v aried 54 4.4.2 V arying the IPR 60 4.5 Conclusion and discussion 65 CHAPTER 5 BIT REDUCTION SUPPOR T VECTOR MA CHINE 67 5.1 Motiv ation 67 5.2 Previous w ork 68 5.3 Bit reduction SVM 70 5.3.1 Bit reduction 71 5.3.2 W eigh ted SVM 74 5.4 Exp erimen ts 76 5.4.1 Exp erimen ts with pure bit reduction 77 5.4.2 Exp erimen ts with un balanced bit reduction 83 5.4.3 Summary and discussion 89 5.5 Conclusion 92 CHAPTER 6 CONCLUSIONS AND FUTURE RESEAR CH 94 6.1 Conclusions 94 6.2 Con tributions 95 6.3 F uture researc h 97 REFERENCES 99 ABOUT THE A UTHOR End P age ii PAGE 5 LIST OF T ABLES T able 3.1 Description of 29 features. 19 T able 3.2 Hu's momen t in v arian ts. 21 T able 3.3 10fold cross v alidation accuracy on the initial 1285 image set. 31 T able 3.4 Confusion matrix of SVM (onevsone) from a 10fold cross v alidation on 1285 SIPPER images with all 29 features. 31 T able 3.5 10fold cross v alidation accuracy on the 6000 image set. 32 T able 3.6 Confusion matrix of SVM (onevsone) from a 10fold cross v alidation on 6000 SIPPER images with all 29 features. 33 T able 3.7 Description of 15 selected feature subset. 34 T able 3.8 Confusion matrix of SVM (onevsone) from a 10fold cross v alidation on 6000 SIPPER images with the b est 15feature subset. 35 T able 3.9 Best parameters for loglik eliho o d loss function. 38 T able 4.1 The upp er and lo w er b oundary regions as a fraction of one half edge length. 47 T able 4.2 Inner and outer region b oundaries. 49 T able 5.1 An 1d example of bit reduction in BRSVM. 73 T able 5.2 W eigh ted examples after the aggregation step. 73 T able 5.3 Description of the nine data sets. 77 T able 5.4 BRSVM on the banana data set. 78 T able 5.5 BRSVM on the phoneme data set. 79 T able 5.6 BRSVM on the sh uttle data set. 80 T able 5.7 BRSVM on the page data set. 80 T able 5.8 BRSVM on the p endigit data set. 80 iii PAGE 6 T able 5.9 BRSVM on the letter data set. 81 T able 5.10 BRSVM on the plankton data set. 81 T able 5.11 BRSVM on the w a v eform data set. 81 T able 5.12 BRSVM on the satimage data set. 81 T able 5.13 BRSVM of UBR after 8bit reduction on the phoneme data set. 86 T able 5.14 BRSVM of UBR after 10bit reduction on the p endigit data set. 87 T able 5.15 BRSVM of UBR after 9bit reduction on the plankton data set. 87 T able 5.16 BRSVM of UBR after 10bit reduction on the w a v eform data set. 88 T able 5.17 BRSVM of UBR after 9bit reduction on the satimage data set. 88 T able 5.18 Summary of BRSVM on all nine data sets. 89 T able 5.19 BRSVM after retraining on the nine data set. 92 iv PAGE 7 LIST OF FIGURES Figure 3.1 Cop ep o d in SIPPER I images. 17 Figure 3.2 Diatom in SIPPER I images. 18 Figure 3.3 Doliolid in SIPPER I images. 18 Figure 3.4 Larv acean in SIPPER I images. 18 Figure 3.5 Proto ctista in SIPPER I images. 18 Figure 3.6 T ric ho desmium in SIPPER I images. 18 Figure 3.7 Uniden tiable particles in SIPPER I images. 19 Figure 3.8 F eature selection on the training set. 36 Figure 3.9 Selected feature subsets on the v alidation set. 37 Figure 3.10 Rejection curv e for b oth approac hes. 39 Figure 4.1 Calanoid cop ep o d in SIPPER I I images. 45 Figure 4.2 Larv acean in SIPPER I I images. 45 Figure 4.3 MarineSno w in SIPPER I I images. 45 Figure 4.4 Oithona in SIPPER I I images. 46 Figure 4.5 T ric ho desmium in SIPPER I I images. 46 Figure 4.6 Con tour frequency domains. 47 Figure 4.7 Source image. 48 Figure 4.8 The v e frequency regions (r1 through r5) of the image. 48 Figure 4.9 Some misclassied images. 53 Figure 4.10 Comparison of activ e learning and random sampling in terms of accuracy and n um b er of supp ort v ectors: initial training images p er class are 10, one new lab eled image added at a time. 55 v PAGE 8 Figure 4.11 The rst v e images lab eled b y BT in one run. 56 Figure 4.12 Comparison of activ e learning and random sampling in terms of accuracy and n um b er of supp ort v ectors: initial training images p er class are 50, one new lab eled image added at a time. 57 Figure 4.13 Comparison of activ e learning and random sampling in terms of accuracy and n um b er of supp ort v ectors: initial training images p er class are 100, one new lab eled image added at a time. 58 Figure 4.14 Comparison of activ e learning and random sampling in terms of accuracy and n um b er of supp ort v ectors: initial training images p er class are 200, one new lab eled image added at a time. 59 Figure 4.15 Comparison of activ e learning and random sampling in terms of accuracy with dieren t IPR: initial training images p er class are 10. 61 Figure 4.16 Comparison of activ e learning and random sampling in terms of accuracy with dieren t IPR: initial training images p er class are 50. 62 Figure 4.17 Comparison of activ e learning and random sampling in terms of accuracy with dieren t IPR: initial training images p er class are 100. 63 Figure 4.18 Comparison of activ e learning and random sampling in terms of accuracy with dieren t IPR: initial training images p er class are 200. 64 vi PAGE 9 SCALING UP SUPPOR T VECTOR MA CHINES WITH APPLICA TION TO PLANKTON RECOGNITION T ong Luo ABSTRA CT Learning a predictiv e mo del for a large scale realw orld problem presen ts sev eral c hallenges: the c hoice of a go o d feature set and a scalable mac hine learning algorithm with small generalization error. A supp ort v ector mac hine (SVM), based on statistical learning theory obtains go o d generalization b y restricting the capacit y of its h yp othesis space. A SVM outp erforms classical learning algorithms on man y b enc hmark data sets. Its excellen t p erformance mak es it the ideal c hoice for pattern recognition problems. Ho w ev er, training a SVM in v olv es constrained quadratic programming, whic h leads to p o or scalabilit y In this dissertation, w e prop ose sev eral metho ds to impro v e a SVM's scalabilit y The ev aluation is done mainly in the con text of a plankton recognition problem. One approac h is called activ e learning, whic h selectiv ely asks a domain exp ert to lab el a subset of examples from a lot of unlab eled data. Activ e learning minimizes the n um b er of lab eled examples needed to build an accurate mo del and reduces the h uman eort in man ually lab eling the data. W e prop ose a new activ e learning metho d \Breaking Ties" (BT) for m ulticlass SVMs. After dev eloping a probabilit y mo del for m ultiple class SVMs, \BT" selectiv ely lab els examples for whic h the dierence in probabilities b et w een the predicted most lik ely class and second most lik ely class is smallest. This simple strategy required sev eral times less lab eled plankton images to reac h a giv en recognition accuracy when compared to random sampling in our plankton recognition system. vii PAGE 10 T o sp eed up a SVM's training and prediction, w e sho w ho w to apply bit reduction to compress the examples in to sev eral bins. W eigh ts are assigned to dieren t bins based on the n um b er of examples in the bin. T reating eac h bin as a w eigh ted example, a SVM builds a mo del using the reducedset of w eigh ted examples. Exp erimen tal results indicate bit reduction SVM (BRSVM) runs up to 245 times faster during the training phase and up to 33 times faster in the prediction phases. A t a w ellc hosen compression ratio, it also b eats random sampling in accuracy viii PAGE 11 CHAPTER 1 INTR ODUCTION Applying mac hine learning to solv e a realw orld pattern recognition problem presen ts sev eral c hallenges: 1. Ho w do w e pro duce a set of features that b est summarizes the targeted patterns and dieren tiates b et w een those patterns? 2. Can w e nd a go o d mac hine learning algorithm, whic h learns a mo del on a set of features extracted from the training data, to predict the targeted patterns correctly from future data? 3. Is this mac hine learning algorithm scalable to a large n um b er of examples? This dissertation addresses the ab o v e questions mainly in the con text of plankton recognition. Ho w ev er, our metho ds can also b e applied on general pattern recognition problems and other data sets. Among the three c hallenges, scalabilit y is the main fo cus. The kno wledge of distributions of underw ater plankton helps to predict particle rux, sheries recruitmen t and biomass pro duction in the o cean. Therefore, the rst generation Shado w Image P article Proling Ev aluation Recorder (SIPPER I) w as dev elop ed to con tin uously sample plankton and susp ended particles in the o cean [78 ]. It is capable of pro ducing tens of thousands images an hour. As a result, a plankton recognition system is necessary to iden tify those plankton automatically to a v oid otherwise prohibitiv e image lab eling w ork for p eople. The images from SIPPER I are 1bit, blac k and white images without clear con tours. This situation brings us to the answ er for the rst question: Ho w do w e pro duce a set of go o d features that maximize the classication accuracy? Since contour features lik e F ourier descriptors w ere not stable for SIPPER I images, w e c hose sev eral 1 PAGE 12 robust features: the momen t in v arian ts and gran ulometric features whic h do not dep end hea vily on con tour information. Also, sev eral domain sp ecic features w ere designed to dieren tiate b et w een sev eral particular t yp es of plankton. A wrapp er approac h w as used to select the optimal subset of features. After feature computation, another issue w as to nd a mac hine learning algorithm whic h generalizes w ell to future unseen data. Man y traditional mac hine learning algorithms suc h as nearest neigh b or [1 ], decision trees [73 ], neural net w orks [33 ], and ba y esian net w orks [68 ] etc. often suer from o v ertting or undertting due to the lac k of an elegan t w a y to con trol the tradeo b et w een empirical risk minimization (ERM) and generalization abilit y Adv ances in statistical learning theory [90 ][91 ] in tro duced the V C dimension to measure the capacit y of a h yp othesis space, in whic h a learning algorithm searc hes for the optimal function for classication. V apnik and Cherv onenkis [91 ] indicated a b etter generalization b ound can b e ac hiev ed b y restricting a learning algorithm's V C dimension. A supp ort v ector mac hine (SVM) is dev elop ed to separate the data in a highdimensional feature space using a large margin h yp erplane. Suc h a large margin classier has a lo w V C dimension, th us leading to go o d generalization abilit y Therefore, w e use a supp ort v ector mac hine to classify the feature v ector extracted from the SIPPER I images. Our exp erimen ts indicate a SVM p erforms b etter than a decision tree, neural net w ork and ev en ensem bles of decision trees [11 ][12 ][4 ][26 ]. The fo cus of this dissertation is the third c hallenge: scaling up a supp ort v ector mac hine. As man y applications generate massiv e lab eled and unlab eled data sets, it is crucial for a SVM to handle large amoun t of data. W e tac kle the scalabilit y issues of a SVM as follo ws. First, the high sampling rate of SIPPER allo ws us to obtain a large n um b er of images without class lab els. A t the same time, the SVM classier built on an initial small set of lab eled images is exp ected to b e more accurate b y b eing retrained with more lab eled images. The time required is prohibitiv e for a p erson to man ually lab el all the images for retraining. A smart sampling strategy is needed to selectiv ely c ho ose to lab el a small set of images, whic h most helps to impro v e the accuracy Suc h a smart selectiv e sampling metho d is also 2 PAGE 13 called activ e learning. Most previous w ork [89 ][77 ] on activ e learning with SVMs fo cuses on t w oclass problems and can not b e directly extended to m ulticlass problems. W e dev elop a simple activ e learning metho d{\Breaking Ties" for m ulticlass SVMs. In terpreting SVM outputs through a probabilit y mo del, w e c hose to lab el the examples whose t w o largest class probabilities are close to eac h other. Breaking the p oten tial tie helps impro v e the mo del's classication abilit y Our exp erimen ts sho w \Break Tie" outp erforms random sampling and a least certain activ e learning metho d. Second, as the n um b er of lab eled examples accum ulate, training a SVM b ecomes v ery time consuming and sometimes ev en prohibitiv e. A large n um b er of supp ort v ectors resulting from a large data set mak es the prediction slo w to o. Man y metho ds [90 ], [63 ] [43 ] [70 ] [45 ] [27 ] [34 ][97 ] of sp eeding up the quadratic programming (QP) problem for a SVM ha v e b een w ell dev elop ed. Ho w ev er, training a SVM is still v ery slo w for a v ery large data set. The reduced set metho d [15 ][16 ][64 ][82 ] has b een prop osed to use a small set of p oin ts called preimages to appro ximate the supp ort v ectors, whic h in turn reduces the prediciton time. Ho w ev er, solving the preimage problem itself is time consuming. Believing there is not enough space for impro v emen t in sp eeding up the QP problem, w e turn to the \data squashing" metho d [29 ] instead. This metho d compresses the large data set in to sev eral small bins. A mo del is t b y only using a represen tativ e example instead of all examples within a bin. The reduced training set results in signican tly less training and prediction time. Ho w ev er, previous w ork on data squashing + SVM [99 ][85 ][9 ] applied clustering algorithms to partition the data, while a clustering algorithm itself is relativ ely exp ensiv e computationally Moreo v er, all but one previous exp erimen t w as done using the simplest k ernel: a linear k ernel. This adds doubt ab out whether their metho ds can b e generalized to other complicated k ernels. In this dissertation, w e prop ose a v ery fast data squashing metho d: bit reduction. Bit reduction groups similar examples b y reducing their resolution. The mean statistics of examples from eac h bin are computed. W e assign a w eigh t to the mean according to the n um b er of examples within a bin. A SVM trained with w eigh ted examples is dev elop ed. W e call this metho d bit reduction SVM (BRSVM). Giv en a go o d 3 PAGE 14 c hoice of compression ratio, BRSVM can ac hiev e signican t sp eedup in training and prediction with a statistically insignican t loss in accuracy It also outp erforms random sampling at a w ell c hosen compression ratio. The dissertation is organized as follo ws. Chapter 2 in tro duces the supp ort v ector mac hine and its generalization b ound. Chapter 3 presen ts our systematic metho d to recognize underw ater SIPPER I images: extracting robust image features, selecting an optimal subset of features, applying a SVM to classify image features and compare with other classiers and in terpreting a SVM's outputs with a new probabilit y mo del. Chapter 4 describ es our activ e learning metho d \Breaking Ties" (BT) for m ulticlass SVMs to reduce h uman eort in lab eling images. \BT" is applied to recognizing a new feature set of 3bit, gra ylev el images from the adv anced SIPPER (SIPPER I I). Chapter 5 in tro duces bit reduction SVM (BRSVM), whic h emplo ys a simple bit reduction metho d to compress the data and sp eed up the training and prediction. W e conclude this dissertation and presen t future researc h directions in Chapter 6. 4 PAGE 15 CHAPTER 2 BA CK GR OUND 2.1 Mac hine learning problem Giv en m pairs of examples: ( x 1 y 1 ), ( x 1 y 1 ), ..., ( x m y m ); x i 2 R p A mac hine learning algorithm learns a function f mapping the input x i to the output y i where y i = f ( x i ). If y i is a real v alue, this problem is called a regression problem. Otherwise, if y i is an unordered discrete v alue, it is a classication problem. All c hapters in this dissertation deal with classication problems. Therefore, w e only discuss classication problems in the rest of this c hapter. A function learned b y a mac hine learning algorithm is exp ected to generalize w ell to unseen data. F or example, giv en a new data p oin t x t whic h do es not b elong to the m pairs of examples, w e w an t to predict its resp onse y t correctly b y using f Before w e address the generalization abilit y of learning algorithms, w e giv e some denitions in the follo wing. Denition 1 The giv en m pairs of examples are called training data, whic h are assumed to b e indep enden tly iden tically distributed (iid) according to an unkno wn probabilit y distribution P ( x; y ). Denition 2 A h yp othesis space is a function space where a learning algorithm searc hes for the optimal function f based on certain criteria. Denition 3 A loss function denes the loss asso ciated with the prediction f ( x i ) when the true output is y i F or example, a widely used loss function in classication problems is the 5 PAGE 16 01 loss function, in whic h L ( x i ; y i ; f ) = 8><>: 0 : f ( x i ) = y i 1 : f ( x i ) 6 = y i Denition 4 Generalization error is also called true risk or exp ected error. It is dened as follo ws. L ( f ) = Z L ( x; y ; f ) dP ( x; y ) Noting P ( x; y ) is unkno wn, w e can not measure generalization error precisely Denition 5 Empirical loss is also called empirical risk or training error. It is the loss o ccured on the training data and is usually measured b y the a v erage loss on the training data. L emp = 1 m m X i =1 L ( x i ; y i ; f ) A go o d mac hine learning algorithm should minimize the generalization error. While the generazliation error is not measureable due to a lac k of kno wledge ab out the true data distribution P ( x; y ), a natural c hoice for learning is to minimize the empirical risk instead. Ho w ev er, a pure empirical risk minimization (ERM) algorithm only giv es an accurate prediction on the training data and do es not ensure equally go o d p erformance on the unseen data. Ov ertting is often observ ed in practice suc h that a function that ts the training data w ell p erforms p o orly on the new data. T o tac kle this problem, statistical learning theory [91 ][92 ] indicates that restricting the capacit y of a h yp othesis space pro vides a b ound b et w een the generalization error and the empirical loss. V apnik and Cherv onekis [92 ] prop osed to use the V apnikCherv onekis (V C) dimension to measure the capacit y of a h yp otheis space. W e study the V C dimension b y rst in tro ducing the concept of shattering. Denition 6 In binary classication, a h yp othesis space (function space) shatters a set of examples if for an y class lab el assignmen ts to these examples, there exists at least one h yp othesis (function) within the h yp othesis space that can separate them. 6 PAGE 17 The shattering abilit y sho ws the classication abilit y in a h yp othesis space. No matter ho w the class lab els are arranged o v er those examples, a ric h h yp othesis space pro vides a h yp othesis (function) making no error, that is, shatters them. Denition 7 In binary classication, the V C dimension is the maxim um n um b er of examples suc h that the h yp othesis space (function space) can shatter them. The V C dimension indicates the capacit y of a h yp othesis space. In an extreme case, a h yp othesis space, whose V C dimension is m can alw a ys pro vide a h yp othesis (function) with no training error on m training data. In tuitiv ely a v ery highcapacit y h yp othesis space has the abilit y to classify v ery large n um b er of examples correctly On the other hand, a learning algorithm searc hing in suc h a h yp othesis space is prone to o v ertting. The or em 1 Giv en a h yp othesis space with a V C dimension h and a sample size m the generalization error b ound holds with probabilit y 1: L ( f ) L emp + s h ( l n 2 m h ) + h l n 4 m (2.1) A small V C dimension giv es a tigh t b ound for generalization error. In this situation, the generalization error can not deviate from the empirical loss v ery m uc h with a large probabilit y In order to k eep the generalization b ound lo w, a learning algorithm should minimize the empirical loss in a h yp othesis space with a small V C dimension. 2.2 Supp ort v ector mac hines A supp ort v ector mac hine (SVM) minimizes the empirical loss and restricts the capacit y of a h yp othesis sim ultaneously A regularized form of a SVM can b e written as: min f 2H 1 m m X i =1 L ( x i ; y i ; f ) + jj f jj 2H (2.2) H is a h yp othesis space in whic h a learning algorithm searc hes for an optimal function f jj f jj 2H is a L 2 norm dened on the h yp othesis space. Minimizing this term amoun ts to restricting the capacit y of a h yp othesis space, whic h in turn results in a tigh t generalization 7 PAGE 18 b ound. is the regularization constan t that con trols the tradeo b et w een the empirical loss and the capacit y of a h yp othesis space. 2.2.1 Deriving the SVM A SVM uses the hinge loss as its loss function: L ( x i ; y i ; f ) = (1 y i f ( x i )) + = 8><>: 0 : y i f ( x i ) > 1 1 y i f ( x i ) : other w ise where ( k ) + = max( k ; 0). As y i 2 ( 1 ; 1), a SVM tries to mak e y i f ( x i ) as large as p ossible. The v alue of y i f ( x i ) is dened as the margin. In tuitiv ely a large v alue of y i f ( x i ) brings a strong condence of making a correct classication. That is wh y a SVM is also called a large margin classier. If y i f ( x i ) > 1, there is no loss. Otherwise, w e pa y 1 y i f ( x i ) as the p enalt y In a SVM, the input data x i is mapp ed to a highdimensional feature space b y using a function The h yp othesis space H in a SVM is a set of h yp erplanes in that feature space. A SVM w orks in a highdimensional feature space b ecause man y linearly unseparable problems in a lo wdimensional space b ecome separable in a highdimensional feature space. A SVM's h yp othesis space is in the form of a linear mo del: f ( x i ) = h w ; ( x i ) i + b (2.3) where ( w b ) are the co ecien ts for the optimal h yp erplane. Consequen tly the L 2 norm of the SVM h yp othesis space is h w ; w i Substituting the loss function and L 2 norm of a SVM in to Eq. (2.2), w e get min w ;b 1 m m X i =1 (1 y i ( h w ; ( x i ) i + b )) + + h w ; w i (2.4) Eq. (2.4) can b e further simplied b y in tro ducing slac k v ariables i = 1 y i ( h w ; ( x i ) i + b ), i = 1,..., m It leads to 8 PAGE 19 min w ;b 1 m m X i =1 i + h w ; w i (2.5) sub ject to: y i ( h w ; ( x i ) i + b ) 1 i (2.6) ; i > 0 (2.7) As a result, a SVM is equiv alen t to searc hing for the optimal ( w b ) in the ab o v e quadratic programming problem. 2.2.2 Kernel According to the represen ter theorem [46 ], the function f with optimal ( w b ) can b e written as f ( x ) = m X i =1 i h ( x i ) ; ( x ) i (2.8) Therefore, the optimal function f only dep ends on the inner pro duct b et w een t w o examples in the feature space. In other w ords, if w e can dene an inner pro duct in the feature space k ( x i ; x j ) = h ( x i ) ; ( x j ) i w e only need to w ork with k ( x i ; x j ) without explictly calculating ( x i ), i = 1,..., m The function k whic h denes an inner pro duct computation is called a k ernel. The \k ernel tric k" sa v es man y computations, whic h are otherwise in v olv ed in a highdimensional space. In an extreme case, w e can w ork with a k ernel k without kno wing what the corresp onding feature space is. Denition 8 A k ernel matrix (Gram matrix) K is p ositiv e semidenite if X i X j c i c j K ( x i ; x j ) 0 for an y giv en c i 2 R 9 PAGE 20 If a k ernel is p ositiv e semidenite, there exists a feature mapping suc h that h ( x i ) ; ( x j ) i = k ( x i ; x j ) Therefore, an y p ositiv e semidenite k ernel is a v alid k ernel. It implicitly determines a feature space in whic h a inner pro duct is w ell dened. Suc h a feature space is called a Repro ducing Kernel Hilb ert Space (RKHS). More details ab out k ernels and RKHS space can b e found in [6][92 ][37 ]. There are three widely used k ernels: the linear k ernel, p olynomial k ernel and Gaussian RBF k ernel. Linear k ernel: k ( x i ; x j ) = h x i x j i (2.9) P olynomial k ernel: k ( x i ; x j ) = h x i x j i d (2.10) RBF k ernel: k ( x i ; x j ) = exp ( g jj x i x j jj 2 ) (2.11) P ap er [36 ] is a go o d reference for man y other k ernels and their applications. 2.2.3 Margin, V C dimension of large margin h yp erplane and structure risk minimization The margin of a h yp erplane classier is dened in [83 ] as follo ws. Denition 9 F or a h yp erplane fh w ; x i + b = 0 g the margin of a p oin t ( x i y i ) is ( x i y i ), where ( x i ; y i ) = y i ( h w ; x i i + b ) jj w jj The margin of ( x 1 y 1 ),..., ( x m y m ) is the minim um v alue of the individual margin. = m min i =1 ( x i ; y i ) 10 PAGE 21 Geometrically the margin of ( x i y i ) is just the distance from x i to the h yp erplane. An example b eing correctly classied has a p ositiv e margin, otherwise it has a negativ e margin. In tuitiv ely a large margin h yp erplane has go o d generalization abilit y Since training data are sampled from the unkno wn true data distrubtion P ( x; y ), the future data can b e tak en as the training data with a certain noise. As long as the amplitude of the noise is less than the margin, a large margin h yp erplane will correctly classify it. If w e assume jh w ; x i + b j = 1, the margin of the m examples is 1 jj w jj Therefore, Eq. (2.6) minimizes the empirical loss i and maximizes the margin 1 jj w jj b y minimizing h w ; w i The go o d generalization abilit y of a large margin classier is explained in [90 ] using the V C dimension theory The or em 2 Giv en a set of h yp erplanes whic h satisfy h w ; x i i + b = 1 and jj w jj < It has a V C dimension ( h ) satisfying h R 2 2 where R is the radius of the smallest sphere cen tered at the origin and con taining all the examples. Therefore, h is related to the margin. W e can restrict the V C dimension of a large margin classier b y minimizing jj w jj as sho wn in Eq. (2.6). This principle is called structure risk minimization (SRM). The idea of SRM is to searc h through a set of h yp othesis space with reduced capacit y or V C dimension. In Eq. (2.6), a small v alue of jj w jj corresp onds to a h yp othesis space with a small capacit y In this w a y the generalization abilit y is impro v ed. Please refer to [3 ][84 ] for more details ab out the V C dimension of a large margin h yp erplane and SRM. 2.2.4 Primal and dual form of SVM T o b e consisten t with the con v en tional expression of a SVM, w e in tro duce a scalar v ariable C and rewrite Eq. (2.6) as 11 PAGE 22 min 1 2 h w ; w i + C m m X i =1 i (2.12) sub ject to: y i ( h w ; ( x i ) i + b ) 1 i (2.13) C ; i > 0 ; i = 1 ; :::m (2.14) It is not hard to see that Eq. (2.12) is equiv alen t to Eq. (2.6). Eq. (2.12) is called the primal form of a SVM. In tro ducing the Lagrangian m ultiplier i Eq. (2.12) leads to L ( ; w ; b ) = 1 2 h w ; w i + C m m X i =1 i m X i =1 i ( y i ( h w ; ( x i ) i + b ) 1 + i ) (2.15) i > 0 ; i = 1 ; :::; m where is the v ector of ( 1 ; 2 ; :::; m ). F rom the kno wledge of con v ex optimization [10 ], the optimal solution to Eq. (2.15) is at a saddle p oin t: the minim um with resp ect to ( w b ) and the maxim um with resp ect to T ak e the partial deriv ativ es of L ( ; w ; b ). @ L ( ; w ; b ) @ w = 0 and @ L ( ; w ; b ) @ b = 0 (2.16) It leads to w = m X i =1 i y i ( x i ) (2.17) m X i =1 i y i = 0 (2.18) 12 PAGE 23 Substitute them in to Eq. (2.15), the dual form of a SVM is as follo ws. maximize P mi =1 i 1 2 P mi =1 P mj =1 i j y i y j k ( x i ; x j ) (2.19) sub ject to 0 i C m ; i = 1 ; :::; m P mi =1 i y i = 0 The dual problem of a SVM is simpler than its primal form. A quadratic programming (QP) solv er can b e used to solv e it. SVM's decision function is f ( x ) = X i i y i k ( x i ; x ) + b (2.20) The KarushKuhnT uc k er condition of the optimal solution is i [ y i ( h w ; ( x i ) i + b ) 1 + i ] = 0 (2.21) The v ariable i is nonzero only when Eq. (2.22) is satised. In this case x i con tributes to the decision function and is called a supp ort v ector (SV). y i ( h w ; ( x i ) i + b ) = 1 i (2.22) Therefore, w e get a sparse solution of the decision function, where only SVs con tribute. 2.3 Multiclass SVM There are t w o main approac hes to extending SVMs to m ulticlass classication: onevsall and onevsone. 1. Onevsall: A set of binary SVMs are trained to separate one class from the rest. The dra wbac k is that w e are handling un balanced data when building binary SVMs. Moreo v er, eac h binary SVM is built on a totally dieren t training set. There migh t 13 PAGE 24 b e cases in whic h some binary SVMs conrict with eac h other for some examples. It is dicult to assign the class b y just the realv alued outputs from ev ery binary SVM. 2. Onevsone: All p ossible groups of 2 classes are used to build binary SVMs. In the N class case, w e will build N ( N 1) 2 binary SVMs. When a new example is tested, all the binary SVMs v ote to classify it. The onevsone approac h needs to build more binary SVMs than the onevsall approac h, ho w ev er, eac h of its binary SVMs only learns on a fraction of the data, th us it can b e time ecien t in a large data set. Hsu [40 ] compared the onevsall and onevsone approac h to handle m ultiple class problems in SVMs. They found the onevsone approac h w as m uc h faster and more accurate than the onevsall approac h on most data sets. W e also compared the t w o approac hes on our data sets in Chapter 3 and the onevsone approac h w as sup erior to the other in our exp erimen ts. 14 PAGE 25 CHAPTER 3 RECOGNIZING PLANKTON IMA GES FR OM THE SHADO W IMA GE P AR TICLE PR OFILING EV ALUA TION RECORDER W e presen t a system to recognize underw ater plankton images from the Shado w Image P article Proling Ev aluation Recorder (SIPPER). The c hallenge of the SIPPER image set is that man y images do not ha v e clear con tours. T o address that, shap e features that do not hea vily dep end on con tour information w ere dev elop ed. A soft margin supp ort v ector mac hine (SVM) w as used as the classier. W e dev elop ed a w a y to assign probabilit y after m ulticlass SVM classication. Our approac h ac hiev ed appro ximately 90% accuracy on a collection of plankton images. On another larger image set con taining man ually uniden tiable particles, it also pro vided 75.6% o v erall accuracy The prop osed approac h w as statistically signican tly more accurate on the t w o data sets than a C4.5 decision tree and a cascade correlation neural net w ork. The single SVM signican tly outp erformed ensem bles of decision trees created b y bagging and random forests on the smaller data set and w as sligh tly b etter on the other data set. The 15feature subset pro duced b y our feature selection approac h pro vided sligh tly b etter accuracy than using all 29 features. Our probabilit y mo del ga v e us a reasonable rejection curv e on the larger data set. 3.1 In tro duction Recen tly the Shado w Image P article Proling Ev aluation Recorder (SIPPER) w as dev elop ed to con tin uously sample plankton and susp ended particles in the o cean [78 ]. The SIPPER uses highsp eed digital linescan cameras to record images of plankton and other particles, th us a v oiding the extensiv e p ostpro cessing necessary with analog video particle images. The large sampling ap erture of the sensor com bined with its high imaging resolu15 PAGE 26 tion (50 m p er pixel), means that it is capable of collecting tens of thousands of plankton images an hour. This so on o v erwhelms a scien tist attempting to man ually classify the images in to recognizable plankton groups. Therefore, an automated plankton recognition system is necessary to solv e the problem or at the v ery least to help with the classication. T ang [88 ] dev elop ed a plankton recognition system to classify plankton images from video cameras. The momen t in v arian ts and F ourier descriptor features from con tour images w ere extracted. Also, gran ulometric features from the gra ylev el images w ere computed. Finally a learning v ector quan tization neural net w ork w as used to classify examples. T ang [88 ] ac hiev ed 92% classication accuracy on a mediumsize data set. The pro ject ADIA C (Automatic Diatom Iden tication and Classication) has b een ongoing in Europ e since 1998. Dieren t feature sets and classiers ha v e b een exp erimen ted with to recognize separate sp ecies of diatoms tak en from photomicroscop es. Lok e [51 ] and Cioban u [20 ] studied some new con tour features. San tos [79 ] extended the con tour features to m ultiscale Gab or features together with texture features. Wilkinson [96 ] applied morphological op erators to help extract b oth con tour and texture information. Fisc her [35 ] summarized these features and used ensem bles of decision trees to classify the com bined feature set. Greater than 90% o v erall accuracy w as ac hiev ed on the diatom images. Ho w ev er, images from previous w ork are of relativ ely go o d qualit y or at least with clear con tours. Therefore, complicated con tour features and texture information can b e extracted easily The SIPPER images, on the other hand, presen t sev eral diculties: 1. Man y SIPPER images do not ha v e clear con tours. Some are partially o ccluded. Therefore, w e cannot primarily dep end on con tour information to recognize the plankton. 2. The SIPPER image gallery includes man y uniden tiable particles as w ell as man y dieren t t yp es of plankton. 3. The SIPPER images in our exp erimen ts are binary th us lac king texture information. 16 PAGE 27 Figure 3.1. Cop ep o d in SIPPER I images. T ang [87 ] prop osed sev eral new features for SIPPER images and applied m ultilev el dominan t eigen v ector metho ds to select a b est feature subset. A Gaussian classier w as emplo y ed to recognize the image features and v alidate the feature selection metho ds on selected iden tiable plankton. This c hapter is organized as follo ws. Section 3.2 in tro duces the binary SIPPER images used in the exp erimen ts. In Section 3.3, w e discuss the prepro cessing of the images and the extraction of the features. Section 3.4 describ es the probabilit y assignmen t in a m ulticlass supp ort v ector mac hine. W e applied wrapp ers with bac kw ard elimination to select the b est feature subset in Section 3.5 and exp erimen tal results for the system are detailed in Section 3.6. Finally w e summarize our w ork in Section 3.7. 3.2 Image gallery The image gallery includes 7285 binary SIPPER images: 1285 images from v e t yp es of plankton w ere initially selected b y marine scien tists as our starting p oin t. The other 6000 images w ere samples from a deplo ymen t of SIPPER in the Gulf of Mexico. The 6000 images w ere from the v e most abundan t t yp es of plankton and man ually unrecognizable particles. All the images w ere man ually classied b y marine scien tists. Figures 3.1 to 3.7 are t ypical examples of plankton and uniden tiable particles from the SIPPER image set. 3.3 F eature computation In the eld of shap e recognition, some general features lik e in v arian t momen ts, F ourier descriptors and gran ulometric features etc. are widely used [23 ]. Ho w ev er, those general 17 PAGE 28 Figure 3.2. Diatom in SIPPER I images. Figure 3.3. Doliolid in SIPPER I images. Figure 3.4. Larv acean in SIPPER I images. Figure 3.5. Proto ctista in SIPPER I images. Figure 3.6. T ric ho desmium in SIPPER I images. 18 PAGE 29 Figure 3.7. Uniden tiable particles in SIPPER I images. T able 3.1. Description of 29 features. F eatures Num b er of features Momen t in v arian ts of the original image 7 Momen t in v arian ts of the con tour image after closing 7 Gran ulometric features 7 Domain sp ecic features 8 features are insucien t to capture the information con tained in SIPPER images sampled from the Gulf of Mexico. Moreo v er, the SIPPER images ha v e a lot of noise around or on the plankton and man y images do not ha v e clear con tours, th us making a direct implemen tation of the con tour features (F ourier descriptor [100 ] etc.) not stable. T o solv e this problem, w e rst prepro cessed the images to suppress noise. W e only extracted in v arian t momen ts and gran ulometric features, whic h are relativ ely stable with resp ect to noise and do not dep end hea vily on the con tour image. T o capture the sp ecic information from our SIPPER image set, domain kno wledge w as used to extract some sp ecic features suc h as size, con v ex ratio, transparency ratio, etc. There are 29 features in total as sho wn in T able 3.1. There rest of this section will pro vide more details ab out these features. 3.3.1 Ob ject detection and noise suppression Marine scien tists used sp ecialized soft w are to detect ob jects: A series of morphological dilations w ere p erformed to connect the nearest image pixels. If the b ounding b o x of the 19 PAGE 30 connected image pixels after dilation w as bigger than 15 15, the original image w as stored as an ob ject. Otherwise, the image pixels w ere considered irrelev an t and deleted. There are man y noise suppression metho ds [86 ]. In this w ork, w e applied a simple metho d{connected comp onen t analysis to eliminate the noise pixels far from the ob ject b o dies. Under the eigh tconnectivit y condition (that is, all eigh t neigh b or pixels of a pixel are considered connected to it), if a pixel's connected path to the image b o dy is more than 4, it will b e regarded as noise and eliminated. In addition, a morphological closing op eration with a 3 3 square windo w as the structure elemen t w as used to get a roughly smo oth image shap e and separate the holes inside the plankton b o dy from the bac kground [69 ]. This op eration also helps to compute sev eral domain sp ecic features describ ed in Section 3.3.4 and to get a rough con tour of the image. 3.3.2 Momen t in v arian ts Momen t features are widely used as general features in shap e recognition. The standard cen tral momen ts are computed as follo ws: ( x; y ) is the cen ter of the foreground pixels in the image. The ( p + q )order cen tral momen ts are computed with ev ery foreground pixel at ( x; y ): ( p; q ) = X x X y ( x x ) p ( y y ) q (3.1) Then cen tral momen ts are normalized b y size as sho wn in Eq. (3.2). ( p; q ) = ( p; q ) (0 ; 0) ( p + q 2 +1) (3.2) Hu [41 ] in tro duced a w a y to compute the sev en lo w er order momen t in v arian ts based on sev eral nonlinear com binations of the cen tral momen ts. Using the normalized cen tral momen ts, w e got scale, rotation and translation in v arian t features. W e computed the same 7 momen t in v arian ts on the whole ob ject and the con tour image after a morphological closing op eration, resp ectiv ely The computation of sev en momen t features are as follo ws. 20 PAGE 31 T able 3.2. Hu's momen t in v arian ts. F eature n um b er Momen t 1 (2 ; 0) + (0 ; 2) 2 ( (2 ; 0) (0 ; 2)) 2 + 4 (1 ; 1) 2 3 ( (3 ; 0) 3 (1 ; 2)) 2 + (3 (2 ; 1) (0 ; 3)) 2 4 ( (3 ; 0) + (1 ; 2)) 2 + ( (2 ; 1) + (0 ; 3)) 2 5 ( (3 ; 0) 3 (1 ; 2))( (3 ; 0) + (1 ; 2) )[( (3 ; 0) + (1 ; 2) ) 2 3( (2 ; 1) + (0 ; 3)) 2 ] + (3 (2 ; 1) (0 ; 3))( (2 ; 1) + (0 ; 3))[3( (3 ; 0) + (1 ; 2)) 2 ( (2 ; 1) + (0 ; 3)) 2 ] 6 ( (2 ; 0) (0 ; 2))[( (3 ; 0) + (1 ; 2)) 2 ( (2 ; 1) + (0 ; 3)) 2 ] + 4 (1 ; 1)( (3 ; 0) + (1 ; 2))( (2 ; 1) + (0 ; 3)) 7 (3 (2 ; 1) (0 ; 3))( (3 ; 0) ) + (1 ; 2)) [( (3 ; 0 ) + (1 ; 2) ) 2 3( (2 ; 1) + (0 ; 3)) 2 ] ( (3 ; 0) 3 (1 ; 2))( (2 ; 1) + (0 ; 3))[3( (3 ; 0) + (1 ; 2)) 2 ( (2 ; 1) + (0 ; 3)) 2 ] 3.3.3 Gran ulometric features Since the Hu momen ts only con tain lo w order information from the image, w e also extracted gran ulometric features [55 ], whic h are robust measuremen ts of the high order information. Gran ulometric features w ere computed b y doing a series of morphological op enings with dieren t sizes of structure elemen ts. Then w e recorded the dierences in size b et w een the plankton b efore and after op enings. Gran ulometric features are relativ ely robust to noise and con tain inheren t information on shap e distribution. T ang [88 ] found that gran ulometric features w ere the most imp ortan t features in his exp erimen ts. W e applied 3 3, 5 5, 7 7 and 9 9 square windo ws as structure elemen ts and did a series of morphological op enings. Then dierences in size w ere normalized b y the original plankton size to obtain the gran ulometric features. Also, w e applied 3 3, 5 5 and 7 7 square windo ws as structure elemen ts, and did a series of morphological closings. The dierences in size w ere normalized in the same w a y W e did not apply a 9 9 square windo w to the closing b ecause the SIPPER images are so small that most of them are diminished after the closing with a 7 7 square windo w as the structure elemen t. The gran ulometric features are computed from Eq. (3.3). 21 PAGE 32 g i = # pixels c hanged after i th morphological op erations # pixels in the original image (3.3) 3.3.4 Domain sp ecic features Momen t in v arian ts and gran ulometries only capture some global information, whic h is insucien t to classify SIPPER images. Giv en advice from domain exp erts, w e dev elop ed some domain sp ecic features to help classication. The domain sp ecic features include size, con v ex ratio, transparency ratio, eigen v alue ratio, and the ratio b et w een the plankton's head and tail. 1. Size: It is the area of the plankton b o dy that is, the n um b er of foreground pixels in the plankton image. The size features w ere extracted for b oth the original image and the con tour images. 2. Con v ex ratio: W e implemen ted a fast algorithm [7 ] to get the con v ex h ull of the plankton image. The con v ex ratio is the ratio b et w een the plankton image size and the area of the con v ex h ull. This feature con tains information ab out the plankton b oundary irregularit y W e computed con v ex ratios with Eq. (3.4) for the original image and the image after a morphological op ening. The morphological op ening w as to eliminate the noise around the plankton, whic h ma y cause an incorrect con v ex h ull. cr = # pixels in the original image # pixels in the con v ex h ull (3.4) 3. T ransparency ratio: This is the ratio b et w een the area of the plankton image and the area of the plankton after lling all inside holes. The transparency ratio helps in recognizing the transparen t plankton. W e computed the transparency ratios with Eq. (3.5) for b oth the original image and the image after a morphological op ening. The morphological closing w as used to get rid of the noise inside the plankton b o dy 22 PAGE 33 tr = # pixels in the original image # pixels within the con tour (3.5) 4. Eigen v alue ratio: W e rst computed the co v ariance matrix b et w een the x and y co ordinates of all pixels on the plankton b o dies. Then the ratio b et w een the t w o eigen v alues from the co v ariance matrix w as calculated in Eq. (3.6). This ratio helps classify elongated plankton. er = min ( f 1 ; f 2 ) max ( f 1 ; f 2 ) (3.6) where f 1 f 2 are eigen v alues of co v(X,Y). 5. Ratio b et w een the head and the tail: Some plankton suc h as larv aceans ha v e a large head relativ e to their tail. W e computed the ratio b et w een the head and tail to dieren tiate them. T o do this w e rst rotated the image to mak e the axis with the bigger eigen v alue parallel to the xaxis. Assuming the smallest and largest x v alues are 0 and T resp ectiv ely w e accum ulated the n um b er of foreground pixels along the xaxis from 0 to 1 4 T and from 3 4 T to T resp ectiv ely Then w e computed the ratio b et w een them as the ratio b et w een the head and the tail. 3.4 Assigning probabilit y v alues in supp ort v ector mac hines A probabilit y asso ciated with a classier is often v ery useful and it pro vides an indication of ho w m uc h to b eliev e the classication result. F or example, the classier could reject the example and lea v e it to an exp ert to classify it when the probabilit y is v ery lo w. The classication probabilit y will b e used to dev elop an activ e learning strategy for a m ulticlass SVM in Chapter 4. Platt [71 ] in tro duced the sigmoid function as the probabilit y mo del to t P ( y = 1 j f ) directly where f is the decision function of the binary SVM. The parametric mo del is sho wn 23 PAGE 34 in Eq. (3.7). P ( y = 1 j f ) = 1 1 + exp ( Af + B ) (3.7) where A and B are scalar v alues, whic h are t with maxim um lik eliho o d estimation. Platt tested the mo del with 3 data sets including the UCI Adult data set and t w o other w eb classication data sets. The sigmoidmo del SVM had go o d classication accuracy and probabilit y qualit y in his exp erimen ts. Hastie et al. [39 ] prop osed a metho d to estimate classication probabilit y for a series of pairwise classiers. Giv en the estimated probabilit y for eac h binary classier ( P pq ), the probabilit y of b eing class p in a binary classier (class p vs. class q), they minimize the a v erage Kullbac kLeibler distance b et w een P pq and P ( p ) P ( p )+ P ( q ) where P ( p ) and P ( q ) are the probabilities of a giv en example b elong to class p and q, resp ectiv ely An iterated algorithm is giv en to searc h for P ( p ). F ollo wing this line of the w ork, W u [98 ] et al. dev elop t w o new criteria for the go o dness of the estimated probabilities and applied their metho d to m ulticlass SVMs. Their approac h has three steps to get the probabilit y estimation. First, a gridsearc h is used to determine the b est SVM parameters ( C g ) based on kfold cross v alidation accuracy Second, with the optimal ( C g ) found in the rst step, A and B are t individually for eac h binary SVM. Third, a constrained quadratic programming metho d is used to optimize the criteria they prop osed. Ho w ev er, this approac h is time consuming. The second step in v olv es estimating N ( N 1) parameters for SVMs using onevsone approac h. The third step needs quadratic programming to solv e N v ariables for eac h example. On a data set with m examples, this step needs to run m times. Another issue is that the SVM parameters ( C g ) are estimated based on accuracy and th us migh t not b e go o d for probabilit y estimation in the follo wing t w o steps. In realtime plankton recognition, the probabilit y computation needs to b e fast since retraining the probabilit y mo del is frequen tly needed as more plankton images are acquired on a cruise. 24 PAGE 35 W e dev elop ed a fast appro ximation metho d to compute the probabilit y v alue while a v oiding exp ensiv e parameter tting. By normalizing the real v alued output f ( x ) from eac h binary SVM, our probabilit y mo del assumes the same A for all binary SVMs. Also, our approac h can optimize SVM parameters ( C g ) together with probabilit y parameter A sim ultaneously using a loglik eliho o d criterion. 1. W e assume P ( y = 1 j f = 0) = P ( y = 1 j f = 0) = 0 : 5. It means that a p oin t righ t on the decision b oundary will ha v e a 0.5 probabilit y of b elonging to eac h class. W e eliminate B in this w a y 2. Since eac h binary SVM has a dieren t margin, a crucial criterion in assigning the probabilit y it is not fair to assign a probabilit y without considering the margin. Therefore, the decision function f ( x ) is normalized b y its margin in eac h binary SVM. The probabilit y mo del of SVMs is sho wn in (3.8) and (3.9). P pq represen ts the probabilit y output for the binary SVM on class p vs. class q class p is +1 and class q is 1. W e add a negativ e sign b efore A to ensure that A is p ositiv e. P pq ( y = 1 j f ) = 1 1 + exp ( Af k w k ) (3.8) P pq ( y = 1 j f ) = 1 P pq ( y = 1 j f ) = P q p ( y = 1 j f ) (3.9) 3. Assuming P pq ; q = 1 ; 2 ; ::: are indep enden t, the nal probabilit y for class p is computed as follo ws: P ( p ) = q 6 = p Y q P pq ( y = 1 j f ) (3.10) Normalize P ( p ) to mak e P p P ( p ) = 1. 4. Output k = ar g max p P ( p ) as the prediction. 25 PAGE 36 Although it is arguable whether P pq and P pk are really indep enden t since P pq and P pk are b oth estimated using data from class p the onevsone approac h do es not suer from dep endence v ery m uc h b ecause as a discriminativ e classier, a binary SVM dep ends on the data from b oth p ositiv e examples and negativ e examples. F or example, the supp ort v ectors from b oth sides determine a SVM' s decision b oundary whic h in turn aects the probabilit y estimation. In the onevsone approac h, there are no o v erlaps b oth in p ositiv e examples and negativ e examples at the same time b et w een an y pair of binary SVMs. Kno wing there is only a w eak dep endence b et w een P pq and P pk Eq. (3.10) pro vides a reasonable appro ximation. ( A C g ) are determined based on the cost function L from (3.11), where t i is the true class lab el of x i L = X i l og P ( t i ) (3.11) There are t w o w a ys of searc hing for the three parameters. The simple one is to searc h for ( C g ) rst based on kfold cross v alidation, then searc h for A based on L in Eq. (3.11) using the ( C g ) determined in the rst step. This metho d can pro vide go o d classication accuracy since ( C g ) are selected based on accuracy in the rst step. W e compare Platt's metho d and ours using this metho d. Both Platt's approac h for t w oclass problems (searc h for A and B ) and our approac h for m ultiple class problems (searc h for A ) try to minimize the loglik eliho o d loss function as in Eq. (3.11). Since the loss function is not con v ex, w e used line searc h for a single parameter A to a v oid lo cal minima. W e also compared it with gradien t descen t searc h for A and B as Platt prop osed. The comparison will b e detailed in Section 3.6.4. After learning a SVM mo del and setting a rejection threshold p w e reject an example and lea v e it to b e classied b y a p erson if P ( k ) < p 26 PAGE 37 The other metho d is to emplo y a gridsearc h to nd the optimal ( C g A ) sim ultaneously using the loglik eliho o d criterion in Eq. (3.11). This metho d will b e explained in activ e learning in Chapter 4 where probabilit y qualit y is v ery imp ortan t. Both metho ds can b e run in parallel to ac hiev e a big sp eedup. If w e w an t to up date the probabilit y mo del after adding more lab eled images, w e can x C and g and only searc h for A As a result, it is v ery fast to up date the probabilit y mo del. Moreo v er, normalizing f b y its margin and assuming the same A for eac h binary SVM trades o some rexibilit y to gain a regularization eect and sp eedup since it restricts the otherwise big ( N ( N + 1)) parameter space. 3.5 F eature selection F eature selection helps reduce the feature computation time and increase the accuracy There are t w o basic w a ys to do feature selection [24 ]. The ltering approac h attempts to select a subset of features without applying learning algorithms. It is fast, but seems unlik ely to result in the b est accuracy The wrapp er approac h [47 ] selects a feature subset b y applying the learning algorithm. It has the p oten tial to result in v ery go o d accuracy but is computationally exp ensiv e. A feature selection metho d sp ecically for SVMs w as prop osed recen tly W eston [95 ] tried to minimize the generalization b ound b y minimizing the radius of the sphere including all the training examples. The dra wbac k of this approac h is that the generalization b ound is lo ose, and minimizing the lo ose b ound ma y not pro vide a feature subset with go o d accuracy In our system, w e applied the wrapp er approac h with bac kw ard elimination. Bac kw ard elimination means one starts with all the features and systematically eliminates features. The a v erage accuracy from a v efold cross v alidation w as used as an ev aluation function. In our case, w e start with 29 features and remo v e 1 feature from the feature set and get 29 dieren t feature subsets with 28 features. W e ev aluate the 29 feature subsets b y running 5fold cross v alidation and c ho ose the feature subset with b est a v erage accuracy to explore. F or instance, if the feature subset with b est a v erage accuracy is M w e remo v e 27 PAGE 38 1 more feature from M and get 28 feature subsets with 27 features to add to the remaining candidate feature subsets. In this w a y w e can use certain searc h strategies to explore those feature subsets and k eep remo ving features. The algorithm halts if there is no impro v emen t in accuracy for p successiv e feature subsets explored. Best rst searc h (BFS), whic h is em b edded in the wrapp er approac h, is used to explore the feature subset space. Ho w ev er, it tends to stop with man y features b ecause BFS selects the most accurate no des to explore and those no des tend to ha v e man y features. In order to explore feature subsets with small n um b ers of features, greedy b eam searc h (GBS) [38 ] w as emplo y ed on the nal feature subsets selected b y BFS. GBS op erates b y only expanding the b est q (b eam width) leafno des without an y bac ktrac king. It can quic kly reduce the n um b er of features to 1. T o reduce the eect of o v ertting, w e randomly c hose 20 p ercen t of the data as a heldout data set, and did the feature selection on the remaining data while testing the selected feature subsets on the heldout data. The feature selection pro cedure is describ ed in the follo wing. F e atur e sele ction algorithm 1: N= f a 1 a 2 ,..., a k g S= f N g T= ; q = constan t, max = 0 where a i is the i th feature, k is the n um b er of features, S is a sorted list, max is the maxim um accuracy and q is the b eam width. 2: Compute the a v erage accuracy f ( N ) from a 5fold cross v alidation on the training data using all features. max = f ( N ). 3: M=p op the set of features with b est a v erage accuracy o S, ho w ev er randomly c ho ose a feature subset candidate from S ev ery 5 no de expansions. 4: if f ( M ) > max then 5: max = f ( M ) 6: else if max has not b e en change d in p exp ansions then 7: Go to 15. 8: endif 28 PAGE 39 9: for al l a i 2 M do 10: M i = M a i 11: R un a 5fold cr oss validation using the M i subset of fe atur es and r e c or d the aver age ac cur acy f ( M i ) 12: A dd M i onto S, which is sorte d in asc ending or der by f ( M i ) 13: end for 14: Go to 3. 15: R emove al l the elements fr om S exc ept for the ve most ac cur ate fe atur e subsets. 16: if S = ; then 17: Stop. 18: endif 19: for al l elements in S do 20: M=p op the next element o S. 21: for al l a i 2 M do 22: M i = M a i 23: R un a 5fold cr oss validation using the M i subset of fe atur es and r e c or d the aver age ac cur acy f ( M i ) 24: A dd M i and f ( M i ) onto T, which is a queue. 25: end for 26: end for 27: Pick the q most ac cur ate M i fr om T and add them to S. 28: Go to 16. After the selection algorithm, w e acquired ev ery B t ( t = 1 ; 2 ; :::; 29), the b est a v erage accuracy in 5 fold cross v alidation with t features com bination. Then w e tested the B t (the b est com bination of t features) on the heldout data set and selected the feature subset with the least n um b er of features and go o d accuracy 29 PAGE 40 3.6 Exp erimen ts Sev eral exp erimen ts ha v e b een done to test our system. The Libsvm [19 ] supp ort v ector mac hine soft w are w as mo died and used in our exp erimen ts. Libsvm applies sequen tial minimal optimization [70 ] in its optimization and a onevsone approac h to do m ulticlass classication. W e mo died libsvm to pro duce a probabilistic output. F or comparison, w e also implemen ted a onevsall approac h. In all exp erimen ts the gaussian radial basis function ( k ( x; y ) = exp ( g k x y k 2 )) w as used as the k ernel. The parameters C and g w ere c hosen b y 5fold cross v alidation using all the examples from eac h data set. T o ev aluate the accuracy of SVMs, w e compared with a cascade correlation neural net w ork [33 ], a C4.5 decision tree with the default pruning settings [73 ], and t w o ensem bles of decision trees: bagging unpruned decision trees [11 ] and random forests [12 ]. There w ere 100 trees built for eac h ensem ble of decision trees. 3.6.1 Initial exp erimen ts The rst training set has a total of 1285 SIPPER images (50 m resolution), whic h w ere selected b y marine scien tists. It con tains images of 64 diatoms, 100 proto ctista, 321 doliolids, 366 larv aceans, and 434 T ric ho desmium. W e used C SVM with parameters C = 200 and g = 0 : 03 for onevsone and C = 64 and g = 0 : 08 for onevsall. T able 3.3 sho ws the a v erage accuracy of dieren t learning algorithms from a 10fold cross v alidation. A pairedt test w as used to compare the results at the 95% condence in terv al. The SVM onevsone approac h is signican tly more accurate than the other learning algorithms at the 95% condence lev el. Also, the running time for onevsall and onevsone are 9 seconds and 2 seconds resp ectiv ely on a P en tium 4 PC at 2.6 GHZ. Therefore, the SVM onevsone approac h outp erforms the onevsall approac h b oth in accuracy and running time on this data set. T able 3.4 sho ws the confusion matrix of the SVM onevsone approac h from a 10fold cross v alidation exp erimen t. The o v erall a v erage accuracy is 90.0%. While w e ha v e greater than 84% accuracy on most plankton, w e only ac hiev e 79% accuracy on the diatom class. 30 PAGE 41 T able 3.3. 10fold cross v alidation accuracy on the initial 1285 image set. Classiers 10fold cross v alidation accuracy C4.5 Decision tree 82.2% Neural net w ork 86.1% Bagging 87.4% Random forests 88.2% SVM (onevsall) 86.5% SVM (onevsone) 90.0% T able 3.4. Confusion matrix of SVM (onevsone) from a 10fold cross v alidation on 1285 SIPPER images with all 29 features. P Di, Do, L and T represen t Proto ctista, Diatom, Doliolid, Larv acean and T ric ho desmium resp ectiv ely as P as Di as Do as L as T P 84.4% 1.6% 9.4% 4.7% 0.0% Di 2.0% 79.0% 11.0% 6.0% 2.0% Do 0.8% 0.3% 92.8% 3.1% 0.0% L 0.8% 0.3% 4.4% 88.0% 6.6% T 0.0% 0.5% 0.2% 6.2% 93.1% The reason is that w e only ha v e 64 diatom samples in our training set and the SVM fa v ors classes with more samples. F or instance, assume there is an o v erlap in the feature space b et w een t w o classes: one with man y examples and one with few examples. It is lik ely that most examples within that o v erlap come from the class with more examples. T o minimize the o v erall hinge loss describ ed in 2.2.1, the decision b oundary is pushed a w a y from the class with more examples and th us will fa v or that class. 3.6.2 Exp erimen ts with uniden tiable particles The second image set w as collected from a deplo ymen t of SIPPER in the Gulf of Mexico. A set of 6000 images w as selected from the v e most abundan t t yp es of plankton, whic h accoun t for 95% of the plankton samples in that run, and man ually unrecognizable particles. The v e t yp es of plankton are cop ep o ds, doliolids, larv aceans, proto ctista and T ric ho desmium. The image qualit y in this training set is not as go o d as in the initial exp erimen t. Apart from the shap e of image ob jects, some prior kno wledge w as used b y 31 PAGE 42 T able 3.5. 10fold cross v alidation accuracy on the 6000 image set. Classiers 10fold cross v alidation accuracy C4.5 Decision tree 64.1% Neural net w ork 70.4% Bagging 74.2% Random forest 74.5% SVM (onevsall) 68.7% SVM (onevsone) 75.1% marine scien tists to lab el the images. Also, w e ha v e to classify uniden tiable particles in this exp erimen t. There are a total of 6000 images: 1000 images of eac h plankton class and 1000 uniden tiable particles. W e used C SVM with C = 200 and g = 0 : 032 for onevsone and C = 216 and g = 0 : 114 for onevsall. T able 3.5 sho ws the a v erage accuracy of dieren t classiers from 10fold cross v alidation. A pairedt test w as used to compare the results at the 95% condence in terv al. The SVM onevsone approac h is signican tly more accurate than all other learning algorithms except the t w o ensem bles of decision trees at the 95% condence lev el. Also, the running time for onevsall and onevsone are 160 seconds and 610 seconds resp ectiv ely on a P en tium 4 PC at 2.6 GHZ. Therefore, the SVM onevsone approac h outp erforms the onevsall approac h b oth in accuracy and running time on this data set. T able 3.6 sho ws the confusion matrix of the SVM onevsone approac h from a 10fold cross v alidation. The o v erall a v erage accuracy is 75.12%. The a v erage accuracy from the v e t yp es of plankton is 78.56%. There are a signican t n um b er of larv aceans confused with T ric ho desmium. This observ ation disagrees with the rst exp erimen t where w e had high classication accuracy for b oth t yp es of plankton. The reason is that some larv acean and T ric ho desmium are linear ob jects. Domain exp erts ha v e prior kno wledge of the abundance of larv acean and T ric ho desmium in some o cean areas. They lab eled the linear ob jects as larv acean or T ric ho desmium when they kno w the other plankton w ere less commonly found in the particular o cean areas examined. Therefore, there are man y linear particles without signican t fea32 PAGE 43 T able 3.6. Confusion matrix of SVM (onevsone) from a 10fold cross v alidation on 6000 SIPPER images with all 29 features. C, D, L, P T, and U represen t Cop op ed, Doliolid, Larv acean, Proto ctista, T ric ho desmium and Uniden tiable particles resp ectiv ely As As As As As As C D L P T U C 84.2% 0.6% 3.1% 1.0% 5.5% 5.6% D 0.2% 82.9% 2.4% 8.7% 0.4% 5.4% L 3.2% 1.9% 68.8% 1.4% 11.1% 13.6% P 1.7% 5.3% 1.1% 84.4% 3.1% 4.4% T 3.3% 0.6% 9.4% 1.8% 72.5% 12.4% U 4.3% 3.1% 15.8% 5.4% 13.5% 57.9% tures to dieren tiate b et w een the t w o t yp es of plankton in this training set, whic h result in lo w er classication accuracy on larv aceans and T ric ho desmium. It is clear that the onevsone approac h is sup erior to the onevsall on the t w o data sets. Therefore, w e c ho ose to use onevsone approac h in our system. W e use the term SVMs to represen t SVMs created with the onevsone approac h b y default in the rest of this section. 3.6.3 F eature selection F eature selection w as tested on the larger training set as describ ed in Section 3.6.2. Although the single SVM seems sup erior to the other t w o single classiers, there is no guaran tee that it is still true after feature reduction. Therefore, w e exp erimen ted with feature selection (wrapp er approac h) on the SVM and its direct comp etitor: the cascade correlation neural net. W e did not use the decision tree in the comparison b ecause it is far less accurate than the SVM on this data set, th us unlik ely to b e the b est. W e did not c ho ose random forest b ecause the ensem bles of classiers increase the complexit y of the classier while not resulting in a b etter accuracy The data set w as randomly divided in to t w o parts: 80% as training and 20% as v alidation. In this w a y w e ha v e 1200 data as v alidation whic h mak es the test result relativ ely stable and 80% data in training, whic h is lik ely to pro vide a similar feature subset to using 33 PAGE 44 T able 3.7. Description of 15 selected feature subset. F eatures Num b er of Num b er of original features selected features Momen t in v arian ts of the original image 7 4 Momen t in v arian ts of the con tour image 7 1 Gran ulometric features 7 4 Domain sp ecic features 8 6 all the data. W e set the stopping criterion p to b e 150 and the b eam width q as 5 in our exp erimen t. Figures 3.8 and 3.9 sho w the exp erimen tal results of the a v erage accuracy from the 5fold cross v alidation on the training data and the test accuracy on the v alidation data resp ectiv ely The SVM pro vided b etter accuracy than the neural net on b oth the training set and the v alidation set when the n um b er of features w as greater than 4. T o c ho ose the least n um b er of features for the SVM, McNemar's test [25 ] w as applied on the v alidation set to compute the 95% condence in terv al. When the n um b er of features w as less than 15, the accuracy w ould b e outside the condence in terv al. Therefore, w e c hose the 15feature subset as the optimal feature subset and it pro vided sligh tly b etter accuracy than using all the features on the v alidation data set. T able 3.7 briery describ es the selected feature subset. A detailed description of the 15 selected features is is as follo ws. 1. Momen t in v arian ts of the original images: The rst 4 Hu momen ts w ere selected. 2. Momen t in v arian ts of the coutour images: The rst one of Hu momen ts w as selected. 3. Gran ulometric features: Morphological op enings with 3 3, 5 5, 7 7 square windo ws w ere selected. Also, a morphological closing with a 5 5 square windo w w as selected. There are four gran ulometric features selected. 34 PAGE 45 T able 3.8. Confusion matrix of SVM (onevsone) from a 10fold cross v alidation on 6000 SIPPER images with the b est 15feature subset. C, D, L, P T, and U represen t Cop op ed, Doliolid, Larv acean, Proto ctista, T ric ho desmium and Uniden tiable particles resp ectiv ely As As As As As As C D L P T U C 84.5% 0.9% 3.1% 0.5% 5.6% 5.4% D 0.7% 85.2% 1.1% 9.3% 0.4% 3.3% L 4.3% 2.1% 67.2% 1.1% 12.5% 12.8% P 1.8% 5.0% 0.7% 85.8% 3.0% 3.7% T 4.5% 0.4% 10.0% 1.5% 72.5% 11.0% U 5.1% 2.3% 15.6% 5.4% 13.4% 58.2% 4. Domain sp ecic features: Among the domain sp ecic features, the con v ex ratio and transparency ratio for images after morphological op ening w ere eliminated. There are 6 domain sp ecic features selected. Only 1 momen t in v arian t for con tour images w as selected. This w as reasonable b ecause the con tours of the plankton images w ere not stable and hence the momen t in v arian ts for con tour images w ere not v ery helpful in classication. Among the domain sp ecic features, the con v ex ratio and transparency ratio for images after morphological op ening w ere eliminated. They seem to b e redundan t for the same features computed on the original images. Therefore, our feature selection approac h seems to eliminate irrelev an t and redundan t features on this image set. T o test the o v erall eect of feature selection, w e applied 10fold cross v alidation on the whole 6000 image set. The confusion matrix is sho wn as T able 3.8. The o v erall a v erage accuracy is 75.57%. The a v erage accuracy from the v e t yp es of plankton is 79.04%. Both indicate that the b est 15feature subset p erforms sligh tly b etter than all 29 features. It is certainly faster to compute the 15 features. 3.6.4 Probabilit y assignmen t exp erimen ts In this exp erimen t, w e compared our approac h (line searc h for A ) and Platt's approac h extended to m ultiple classes (gradien t descen t searc h for A and B ). W e used the same training set as in the last exp erimen t with the 15feature subset. T o reduce the o v ertting 35 PAGE 46 30 40 50 60 70 Mean accuracy on the training data number of featuresAccuracy (%) 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Figure 3.8. F eature selection on the training set: The solid line represen ts accuracy of the SVM and the dashed line represen ts the accuracy of the neural net. 36 PAGE 47 30 40 50 60 70 Accuracy on the validation data number of featuresAccuracy (%) 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Figure 3.9. Selected feature subsets on the v alidation set: The solid line represen ts accuracy of the SVM and the dashed line represen ts the accuracy of the neural net. 37 PAGE 48 T able 3.9. Best parameters for loglik eliho o d loss function. A B log(L) Line searc h for A 87.0 { 4516.5 Gradien t descen t searc h for A and B 71.0 0.412 4496.4 eect from parameter tting, a 3fold cross v alidation w as applied to searc h for the b est parameters in Platt's pap er [71 ]. W e used 3fold cross v alidation for b oth approac hes. Since gradien t descen t searc h for A and B is easily stuc k in lo cal minima, w e v aried the initialization sev eral times to obtain the minimal loss. T able 3.9 describ es the optimal parameters for b oth approac hes. The gradien t descen t searc h pro vided parameters with smaller loss. The line searc h for a single parameter A is denitely faster than gradien t descen t searc h for A and B with dieren t initializations. T o compare the dieren t parameter sets, w e drew a rejection curv e from 10fold cross v alidation using the b est parameters for b oth approac hes. The p oin ts on the rejection curv e w ere sampled b y v arying the rejection threshold p whose range is b et w een 0 and 1. Figure 3.10 sho ws our approac h is at least as go o d as the MLE of A and B It indicates that B = 0 is a reasonable assumption, at least for our data set. 3.7 Conclusions This c hapter presen ts a plankton recognition system for binary SIPPER images. General features as w ell as domain sp ecic features w ere extracted and a supp ort v ector mac hine w as used to classify examples. W e also dev elop ed a w a y to assign a probabilit y v alue after the m ulticlass SVM classication. W e tested our system on t w o dieren t data sets. The recognition rate exceeded 90% in one exp erimen t and w as o v er 75% on the more c hallenging data set with uniden tiable particles. A SVM w as more accurate than a C4.5 decision tree [73 ] and a cascade correlation neural net w ork [33 ] at the 95% condence lev el on the t w o data sets. The single SVM w as signican tly more accurate than bagging [11 ] applied to decision trees and random forests [12 ] on the smaller data set and w as insignican tly more accurate on the larger data set. The wrapp er approac h with bac kw ard elimination 38 PAGE 49 0.0 0.2 0.4 0.6 0.8 1.0 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Rejection curve for two different algorithms Rejection ratioAccuracy search only AMLE for A and B Figure 3.10. Rejection curv e for b oth approac hesOv erall accuracy vs. rejection rate. 39 PAGE 50 successfully reduced the n um b er of features from 29 to 15 and allo w ed a classier to b e built with sligh tly b etter accuracy than using all the features. Our probabilit y mo del for m ultiple class SVMs pro vided a reasonable rejection curv e. 40 PAGE 51 CHAPTER 4 A CTIVE LEARNING TO RECOGNIZE MUL TIPLE TYPES OF PLANKTON This c hapter presen ts an activ e learning metho d to reduce domain exp erts' lab eling eorts in applying supp ort v ector mac hines to recognize underw ater zo oplankton from higherresolution, new generation SIPPER I I images. Most of the previous w ork on activ e learning with supp ort v ector mac hines only deals with t w o class problems. In this c hapter, w e prop ose an activ e learning approac h \Breaking Ties" [48 ] for m ulticlass supp ort v ector mac hines using the onevsone approac h with a probabilit y appro ximation. Exp erimen tal results indicate that our approac h often requires signican tly less lab eled images to reac h a giv en accuracy than the least certain t y activ e learning metho d and random sampling. It can also run in batc h mo de with an accuracy comparable to lab eling one image at a time and retraining. 4.1 In tro duction Recen tly an adv anced shado w image particle proling ev aluation recorder (SIPPER I I) has b een dev elop ed to pro duce 3bit gra yscale images at 25 m resolution. SIPPER I I uses highsp eed digital linescan cameras to con tin uously sample plankton and susp ended particles in the o cean. The high sampling rate of SIPPER I I mak es it necessary to dev elop an automated plankton recognition system. F or example, a previous study using appro ximately 150,000 SIPPER images from a t w o hour sampling deplo ymen t to ok o v er one mon th to man ually classify [75 ]. Also, this automated system is exp ected to con tin uously ev olv e from a previous mo del to a more accurate mo del created b y training after adding some 41 PAGE 52 new lab eled images in to the training set. Since it is imp ossible to man ually lab el all the new images during the time they are acquired on the ship, activ e learning seems attractiv e. Recen tly activ e learning with SVMs has b een dev elop ed and applied to a v ariet y of applications [89 ][81 ][17 ][80 ][94 ][14 ] [93 ][62 ][2 ][52 ][61 ][66 ][58 ][59 ]. W e review the most represen tativ e and relev an t w ork as follo ws. T ong and Koller [89 ], Sc hohn and Cohn [81 ], and Campb ell et al. [17 ] indep enden tly dev elop ed a similar activ e learning approac h for supp ort v ector mac hines (SVMs) in t w o class problems. Their approac h, whic h w e call \simple", lab eled the new examples closest to the decision b oundary T ong and Koller [89 ] used v ersion spaces to analyze the h yp otheses space of SVMs. They indicated that \simple" appro ximately found the examples whic h most dramatically reduced the v ersion space. Compared to random sampling, \SIMPLE" reduced the n um b er of lab eled images in their exp erimen ts on text classication. Mitra et al. [58 ] argued the greedy searc h metho d emplo y ed in \simple" is not robust and prop osed a condence factor to measure the closeness of the curren t SVM to the optimal SVM. A random sampling factor w as in tro duced when the condence factor w as lo w. Their prop osed metho d p erformed b etter than \simple" in their exp erimen ts. Ro y and McCallum [77 ] used a dieren t strategy to select a candidate example to lab el. Based on the probabilit y mo del, they lab eled examples whic h could maximize the p osterior en trop y on the unlab eled data set. This approac h is called \CONF". \CONF" amoun ts to impro ving the curren t classier' s classication condence on the unlab eled data set. Although it initially w as applied with naiv e ba y es classiers, it could b e easily extended to an y classier with probabilit y outputs. F or example, the probabilit y outputs of SVMs can b e roughly appro ximated b y a sigmoid function [71 ]. Baram et al. [2 ] observ ed there w as no single winner from dieren t activ e learning strategies on sev eral data sets. They prop osed to dynamically select from four learning algorithms: \SIMPLE", \CONF", random sampling and sampling examples furthest from the curren t lab eled data set. The automatic selection w as done b y solving a m ultiarmed bandit problem through online learning. 42 PAGE 53 Brink er [14 ] and P ark [66 ] indep enden tly prop osed a similar selection metho d \combined" named b y Brink er [14 ] to lab el sev eral examples at a time for t w oclass problems. Based on \SIMPLE", it c hose to lab el examples whic h are close to the decision b oundary and whose feature v ectors ha v e large angles to the previous selected candidates. A parameter w as in tro duced to con trol the tradeo b et w een the t w o criteria. Although Brink er did not giv e a w a y to set the optimal v alue of \com bined" p erformed b etter than \SIMPLE" in batc h mo de, namely lab eling sev eral images at a time, on sev eral data sets. Tw o things in our w ork mak e it dieren t from previous approac hes. The images sampled from rst generation SIPPER (SIPPER I) did not ha v e clear con tours. The lo w image qualit y resulted in man y uniden tiable particles, whic h made it imp ortan t to create robust image features and handle uniden tiable particles [53 ]. Higher resolution SIPPER (SIPPER I I) images pro vide relativ ely b etter qualit y images with clear con tours. Also, 3bit gra ylev el images ha v e more texture information than binary images. As a result, there w as no longer an issue in handling man y uniden tiable particles. Therefore, new con tour features and texture features are needed to help recognition. Also, little previous w ork in activ e learning has b een done with m ultiple class SVMs, whic h is required in plankton recognition. F or instance, SVMs solv e m ultiple class problems b y building sev eral t w oclass SVMs. A new example usually has dieren t distances to the decision b oundaries in eac h of the t w oclass SVMs. It is hard to apply the \SIMPLE" approac h b ecause w e do not kno w whic h distance to c ho ose. A v ery recen t pap er [59 ] simply applied \simple" to eac h binary SVM in a m ulticlass SVM. F or a m ulticlass problem with N binary SVMs, N examples w ere lab eled at a time. Ho w ev er, this metho d is far from elegan t. They did not pro vide a w a y to judge whic h example is b est for all binary SVMs. It is not un usual that an \informativ e" example for one binary SVM is useless for other binary SVMs. The \com bined" metho d suers from the same problem. It do es not kno w whic h distance to minimize and whic h angle to maximize. \CONF" seems to b e a natural solution for m ulticlass problems as long as w e ha v e a probabilit y estimation for the output from a m ulticlass SVM. 43 PAGE 54 Ho w ev er, applying the \CONF" approac h in v olv es estimating the decision b oundary after adding eac h unlab eled example in to the training data in eac h round. Supp ose m is the n um b er of unlab eled examples and c is the n um b er of classes, \CONF" needs to train a SVM cm times to decide the next example to lab el. Although there are sev eral heuristics to sp eedup suc h a pro cedure, it is still extremely computationally exp ensiv e. In this c hapter, w e dev elop a new image feature set [49 ], whic h adds some con tour features and texture features in to the previous feature set in [53 ]. W e also prop ose a new activ e learning strategy for onev ersusone m ulticlass SVMs and compare with a least certain t y metho d in [52 ]. After dev eloping a probabilit y mo del for m ultiple class SVMs as describ ed in [53 ] and Chapter 3, w e lab el the example for whic h the dierence in probabilities b et w een its most lik ely class and second most lik ely class is smallest. W e compare our approac h with other metho ds lik e random sampling and least certain t y for the plankton recognition problem. T o obtain the same classication accuracy our approac h required man y few er lab eled examples than random sampling. It also outp erformed the least certain t y approac h in terms of needed examples to reac h a giv en accuracy lev el. Our prop osed metho d can run in batc h mo de, lab eling up to 20 images at a time, with an accuracy comparable to lab eling one image at a time and retraining. This c hapter is organized as follo ws. Section 4.2 describ es the feature computation of gra yscale SIPPER images. Section 4.3 in tro duces our activ e learning approac h for m ulticlass supp ort v ector mac hines using the probabilit y mo del dev elop ed in Chapter 3. Exp erimen tal results for the system are presen ted in Section 4.4. Finally w e summarize our w ork and prop ose some ideas for future w ork in Section 4.5. 4.2 F eature computation The adv anced SIPPER (SIPPER I I) made impro v emen ts in b oth resolution and gra yscale v alues o v er the last generation SIPPER (SIPPER I). The resolution w en t from 2048 to 4096 pixels p er scan line and gra yscale v alues from 1 bit to 3 bits. The higher resolution resulted in p erceptually sup erior dened images with clear con tours. The 3 bit gra yscale v alue al44 PAGE 55 lo ws for 8 lev els of gra yscale whic h giv es the images texture that w as lac king b efore. As a result of these impro v emen ts the images are far easier to iden tify for marine scien tists. Giv en the new and impro v ed data, 20 new features w ere created and added in to the feature set for SIPPER I. There w ere four new groups of features created, 8 w eigh ted momen ts, 5 con tour, and 5 texture features. There w ere also t w o other features created: w eigh ted size and w eigh ted size divided b y con v ex area. The 28 features for SIPPER I included in v arian t momen ts, gran ulometric features, size, con v ex ratio, transparency ratio, and eigen ratio. They w ere describ ed in detail in Chapter 3. In this c hapter w e only presen t the 20 new image features for SIPPER I I. F or purp oses of displa ying images, the 3bit gra yscale images are rescaled to 8 bits. Figures 4.1 to 4.5 are t ypical examples of the images pro duced b y SIPPER I I. Figure 4.1. Calanoid cop ep o d in SIPPER I I images. Figure 4.2. Larv acean in SIPPER I I images. Figure 4.3. MarineSno w in SIPPER I I images. 45 PAGE 56 Figure 4.4. Oithona in SIPPER I I images. Figure 4.5. T ric ho desmium in SIPPER I I images. 4.2.1 W eigh ted momen t features The w eigh ted momen ts are the same as those originally dev elop ed for the binary data supplied b y SIPPER I with the exception that the calculations are w eigh ted b y the gra yscale in tensit y v alue. ( x; y ) is the w eigh ted cen ter of the foreground pixels in the image. The ( p + q )order w eigh ted cen tral momen ts ( p; q ) are computed with ev ery foreground pixel at ( x; y ): x = P Hx =1 P Wy =1 I ( x;y ) 255 x P Hx =1 P Wy =1 I ( x;y ) 255 (4.1) y = P Hx =1 P Wy =1 I ( x;y ) 255 y P Hx =1 P Wy =1 I ( x;y ) 255 (4.2) ( p; q ) = H X x =1 W X y =1 ( x x ) p ( y y ) q I ( x; y ) 255 (4.3) where I ( x; y ) is the in tensit y v alue at ( x; y ), W and H are the width and the heigh t of a image resp ectiv ely Hu [41 ] in tro duced a w a y to compute the sev en lo w er order momen t in v arian ts based on sev eral nonlinear com binations of the cen tral momen ts. Using the normalized cen tral momen ts, w e get scale, rotation and translation in v arian t features. W e computed the w eigh ted momen ts in the same w a y as describ ed in Chapter 3.3.2. 46 PAGE 57 The original momen ts alone w ould giv e us 56.98% accuracy with a 10 fold cross v alidation on SIPPER I I images. Utilizing w eigh ted momen ts w e w ere able to get 59.16% accuracy 4.2.2 Con tour features There are v e con tour features pro duced. They are deriv ed from a 1d F ourier transform of the con tour p oin ts plotted as complex n um b ers. The arra y that results is then divided in to v e frequency ranges where the a v erage magnitude v alue of eac h range is calculated. The frequency range for eac h pixel in the resultan t 1D arra y is determined b y computing its distance from the cen ter of the arra y T able 4.1 and Figure 4.6 sho w the range for eac h feature. T able 4.1. The upp er and lo w er b oundary regions as a fraction of one half edge length. Lo w er Upp er Region # b ound( LB ) b ound( U B ) 1 0 1/2 2 1/2 3/4 3 3/4 7/8 4 7/8 15/16 5 15/16 1 R e g i o n 2 r H i g h e s t F r e q u e n c y R e g i o n 1 r R e g i o n 2 r 3 r 3 r 4 r 4 r 5 r 5 r 0 r 4 r 3 r 4 r 3 r 2 r 1 r 2 r 1 r 8 r 7 r 8 r 7 r 1 r 1 6 r 1 5 r 1 6 r 1 5 r 1 r L o w e s t F r e q u e n c y R e g i o n 5 r Figure 4.6. Con tour frequency domains. The F ourier descriptor C V F [ r ], r = 1 ; :::; 5 is computed in Eq. (4.4). C V F [ r ] = P Lx =1 F ( x ) R ( x; r ) P C ( r ) (4.4) 47 PAGE 58 where L is the length of the con tour in pixels, F ( x ) is magnitude of the complex n um b ers at p osition x R ( x; r ) is a indicator function whic h sp ecies whether edge pixel x is in region r and P C ( r ) is the n um b er of pixels in region r 4.2.3 T exture With the gra yscale v alues that SIPPER I I pro duces, features that rerect the texture of the image can b e computed. A 2D F ourier T ransform is p erformed on the original image. By using the result of this transform the energy of dieren t frequency ranges is captured b y computing the a v erage magnitude for eac h frequency range. Figure 4.7. Source image. R 1 r R 2 r R 3 r R 3 r R 4 r R 4 r R 5 r R 5 r Figure 4.8. The v e frequency regions (r1 through r5) of the image. Figures 4.7 and 4.8 sho w a t ypical plankton image and its F ourier transform. In Figure 4.8, the righ t half of the image represen ts the amplitude of the F ourier transform; the arcs in the left half indicate the b oundaries of the regions. W e only pro cess half the F ourier domain since b oth halv es of the F ourier magnitude are mirror images of eac h other. These v e regions result in v e F ourier features. The v alue of eac h feature is the a v erage v alue of the F ourier amplitude of their resp ectiv e region. The F ourier texture features ( T V F [ r ], r = 1 ; :::; 5) are calculated b y Eq. (4.5). T V F [ r ] = P Hx =1 P Wy =1 J ( x; y ) R ( x; y ; r ) P C ( r ) (4.5) 48 PAGE 59 T able 4.2. Inner and outer region b oundaries. Lo w er Upp er Region # b ound( LB ) b ound( U B ) 1 0 1/2 2 1/2 3/4 3 3/4 9/10 4 9/10 19/20 5 19/20 1 where P C ( r ) is the n um b er of pixels in region r R ( x; y ; r ) is the indicator function of whether pixel at ( x; y ) is in region r J is the 2D F ourier transformation of the image. These v e features alone enable a 58.34% accuracy to b e obtained on a 10 fold cross v alidation on SIPPER images. 4.2.4 Other features There w ere t w o other features dev elop ed: w eigh ted size and w eigh ted con v ex ratio. The w eigh ted size is mean t to rerect not just the size of the image in pixels but also the densit y of the image as indicated b y eac h pixel's in tensit y v alue. Eac h pixel in the image will b e assigned a v alue in the range of 0.0 to 1.0 (bac kground to foreground). W eigh ted Size = H X x =1 ( W X y =1 ( I ( x; y ) 255 )) (4.6) The W eigh ted con v ex ratio is computed as follo ws: W C R = w eigh ted Size con v ex h ull area (4.7) 4.3 Activ e learning approac h with m ulticlass supp ort v ector mac hines In [52 ], the least certain t y activ e learning approac h, whic h mak es use of the estimated probabilit y describ ed in the last subsection, pro vides go o d p erformance in m ulticlass SVM classication. The idea can b e traced bac k to [50 ], whic h uses \uncertain t y sampling" 49 PAGE 60 to lab el the examples with the least classication certain t y W e call the least certain t y approac h in [52 ] \LC". In this c hapter, w e prop ose another activ e learning approac h{ \Breaking Ties" (BT). The idea of \BT" is to impro v e the condence of the m ulticlass classication. Recall in a m ulticlass SVM with probabilit y outputs, w e assign the class lab el of x to ar g max p P ( p ). Supp ose P ( a ) is the largest and P ( b ) is the second largest probabilit y for example x, where a, b are class lab els. \BT" tries to impro v e the P ( a ) P ( b ). In tuitiv ely impro ving the v alue of P ( a ) P ( b ) amoun ts to breaking the tie b et w een P ( a ) and P ( b ), th us impro ving the classication condence. The dierence b et w een \LC" and \BT" is that \LC" tries to impro v e the v alue of P ( a ) instead of P ( a ) P ( b ). The t w o algorithms w ork as follo ws: 1. Start with an initial training set and an unclassied set of images. 2. A m ulticlass supp ort v ector mac hine is built using the curren t training set. 3. Compute the probabilistic outputs of the classication results for eac h image on the unclassied set. Supp ose the class with highest probabilit y is a and the class with second highest probabilit y is b. Record the v alue of P ( a ) and P ( b ) for eac h unclassied image. 4. If LC: Remo v e the image(s) from the unclassied set that ha v e the smallest classication condence, obtain the lab el from h uman exp erts and add them to the curren t training set. 5. If BT: Remo v e the image(s) from the unclassied set that ha v e the smallest v alue of P ( a ) P ( b ), obtain the lab el from h uman exp erts and add them to the curren t training set. 6. Go to 2. 50 PAGE 61 4.4 Exp erimen ts There w ere 8440 plankton images selected from the v e most abundan t t yp es of plankton: 1688 images from eac h t yp e of plankton. 1000 images (200 eac h t yp e of plankton) w ere randomly selected as the v alidation set used in the activ e learning exp erimen ts. The Libsvm [19 ] supp ort v ector mac hine soft w are w as mo died to pro duce probabilistic outputs. In [76 ] it w as argued the onevsall approac h w as essen tially as go o d as other v oting algorithms, ho w ev er, without p ostpro cessing binary SVMs, w e observ ed the onevsone approac h pro vided b etter accuracy and less training time than the onevsall approac h in our previous exp erimen ts. Also, giv en N classes, up dating mo dels with sev eral more lab eled examples, the onevsone approac h only needs to up date N binary SVMs built with a p ortion of the data, while the onevsall approac h needs to up date N binary SVMs built with all the lab eled data. Therefore, the onevsone approac h w as used in our exp erimen ts. In all exp erimen ts the Gaussian radial basis function (RBF) w as used as the k ernel: k ( x; y ) = exp ( g k x y k 2 ) where g is a scalar v alue. The optimal feature subset w as determined b eforehand b y the W rapp er approac h with bac kw ard elimination. This feature selection metho d has b een describ ed in [53 ] and Chapter 3. With the b est ( g C ) parameters found b y 5fold cross v alidation, w e applied the wrapp er approac h to feature selection. 80% of the images w ere used as training data and 20% of the image w ere held out as v alidation. 5fold cross v alidation w as used to select the b est feature subset for eac h n um b er of features. Then the b est feature subsets w ere tested on the v alidation set. As a result, 17 out of 49 features w ere selected with sligh tly b etter 10fold cross v alidation accuracy than using all 49 features. In all the activ e learning exp erimen ts, w e used the b est 17 feature subset instead of the 49 feature set. See [49 ] for more details ab out the selected features. When activ e learning w as applied in our system, w e only had some initial training data a v ailable. Therefore, the b est parameter set for the probabilit y mo del should b e estimated from a small data set. The parameters ( g C A ) w ere optimized b y p erforming a gridsearc h across a random selected 1000 images consisting of 200 images p er class. W e b eliev e 51 PAGE 62 suc h optimal parameters w ere built from a relativ ely small set of data and reasonably stable. A v efold cross v alidation w as used to ev aluate eac h com bination of parameters based on the loss function L from (3.11). The parameters ( g C A ) w ere v aried with a certain in terv al in the grid space. Since the parameters are indep enden t, the gridsearc h ran v ery fast in parallel. The v alues of g = 0.04096, C = 16, and A =100 w ere found to pro duce the b est results. W e did a series of retrainings for the t w o activ e learning metho ds and random sampling on the training data with N randomly selected images p er class as the initial training set. Eac h exp erimen t w as p erformed 30 times and the a v erage statistics w ere recorded. Instead of exhausting all of the unlab eled data set, w e only lab eled 750 more images for eac h exp erimen t b ecause exhausting all unlab eled data w as not a fair criterion for comparing b et w een dieren t sample selection algorithms. F or example, activ e learning lab eled the most \informativ e" new examples, whic h w ere a v ailable in the b eginning of the exp erimen t. As more \informativ e" examples w ere lab eled, only \garbage" examples w ere left unlab eled in the late stages of the exp erimen t. The term \garbage" examples here means the examples correctly classied b y the curren t classier and far from the decision b oundary Therefore, \garbage" examples ha v e no con tribution to impro ving the curren t classier. In con trast to activ e learning, random sampling lab eled a v erage \informativ e" examples throughout the whole exp erimen t. It surely w ould catc h up with activ e learning in the later stages when activ e learning only had \garbage" examples to lab el. Moreo v er, when the plankton recognition system is emplo y ed on a cruise, the unlab eled images come lik e a stream. The nature of suc h application prev en ts one from exhausting all the unlab eled images b ecause of the prohibitiv e lab eling w ork. Therefore, it mak es more sense to compare dieren t algorithms in the early stage of the exp erimen t when the unlab eled data set is not exhausted. T o see the upp er limit of classication accuracy w e built a SVM using all 7400 training images. Its prediction accuracy w as 88.3% on the 1000 heldout data set. Figures 4.9(a){4.9(e) are sev eral misclassied images. 52 PAGE 63 (a) Calanoid cop ep o d misclassied as oithona (b) Larv acean misclassied as tric ho desmium (c) Marine sno w misclassied as larv acean (d) Oithona misclassied as calanoid cop ep o d (e) T ric ho desmium misclassied as larv acean Figure 4.9. Some misclassied images. 53 PAGE 64 Sev eral v ariations of the pro cedure describ ed ab o v e w ere p erformed. W e v aried the n um b er of initial lab eled images p er class (I IPC) to result in initial classiers with dieren t accuracy In this w a y w e could test activ e learning when the accuracy of the initial mo del v aried. W e also c hanged the n um b er of images selected for lab eling at eac h retraining step (IPR) to test ho w w ell activ e learning w orks in batc h mo de. 4.4.1 Exp erimen ts with IPR=1, I IPC v aried Figures 4.10{4.14 sho w the exp erimen tal results of activ e learning metho ds on dieren t I IPC v alues. A pairedt test w as used to determine if there w as a statistically signican t dierence b et w een approac hes. W e used standard error as the error bar b ecause the denominator of t test is in the form of standard error. As sho wn in Figure 4.10, with only 10 images p er class in the initial training sets w e started o with rather p o or accuracy (64.6%). A t p=0.05, \BT" is statistically signican tly more accurate than \LC" and b oth activ e learning metho ds are statistically signican tly more accurate than random sampling. A t 81% accuracy random selection required appro ximately 1.7 times the n um b er of newly lab eled images in \BT". Figure 4.11 con tains the rst v e images lab eled b y \BT" for one of the 30 runs. They seem to b e relativ ely \hard" images. Activ e learning is designed to lab el the most \informativ e" new images, th us helping impro v e the classier. In SVMs, the decision b oundary is represen ted b y supp ort v ectors (SVs). Therefore, an eectiv e activ e learning metho d is lik ely to nd more SVs than random sampling. Figure 4.10 also sho ws the a v erage n um b er of SVs v ersus the n um b er of images added in to initial training set from the 30 runs. Activ e learning resulted in man y more SVs than random sampling. Also, the slop e of b oth activ e learning curv es are ab out 0 : 9, whic h means that 90% of the lab eled images turn out to b e SVs. Our activ e learning approac h ecien tly captured supp ort v ectors. W e note that a high slop e of the supp ort v ector curv e is not a sucien t condition for eectiv e activ e learning b ecause there are man y SVs to b e added in to the curren t mo del and dieren t SVs lead to dieren t impro v emen ts. 54 PAGE 65 I I P C = 1 0 I P R = 1 r 6 3 7 % r 6 0 % r 6 5 % r 7 0 % r 7 5 % r 8 0 % r 8 5 % r 0 r 1 0 0 r 2 0 0 r 3 0 0 r 4 0 0 r 5 0 0 r 6 0 0 r 7 0 0 r N u m b e r o f n e w i m a g e s rA c c u r a c y r L C r B T r R a n d o m r I I P C = 1 0 I P R = 1 r 4 4 r 0 r 1 0 0 r 2 0 0 r 3 0 0 r 4 0 0 r 5 0 0 r 6 0 0 r 7 0 0 r 8 0 0 r 0 r 1 0 0 r 2 0 0 r 3 0 0 r 4 0 0 r 5 0 0 r 6 0 0 r 7 0 0 r N u m b e r o f n e w i m a g e s rN u m b e r o f s u p p o r t v e c t o r s r L C r B T r R a n d o m r Figure 4.10. Comparison of activ e learning and random sampling in terms of accuracy and n um b er of supp ort v ectors: initial training images p er class are 10, one new lab eled image added at a time. The error bars represen t the standard errors. 55 PAGE 66 Figure 4.11. The rst v e images lab eled b y BT in one run. The true class lab els of the images from left to righ t are oithona, calanoid cop ep o d, oithona, larv acean and larv acean Ideally a v ery eectiv e activ e learning metho d should nd the SVs whic h can impro v e the curren t mo del the most. In con trast, an activ e learning metho d, whic h alw a ys nds the SVs misclassied b y the curren t classier and far from its decision b oundary ma y p erform v ery p o orly b ecause suc h SVs are v ery lik ely to b e noise. Therefore, w e cannot compare activ e learning metho ds only based on sligh t dierences in the supp ort v ector curv e. With 50 I IPC in the initial training set sho wn in Figure 4.12, w e started with 77% accuracy As compared with 10 I IPC, the accuracy for b oth activ e learning approac hes impro v ed faster than random sampling. A t the 81% accuracy lev el, random sampling required ab out 2.5 times and 1.7 times the n um b er of images compared with using \BT" and \LC", resp ectiv ely The slop es of supp ort v ector curv es for activ e learning are higher than those of random sampling. Also, \BT" outp erformed \LC", ho w ev er, it is not as ob vious as with I IPC=10. In Figures 4.13 and 4.14, w e started with more than 80% accuracy using 100 and 200 initial images from eac h class, and activ e learning w as v ery eectiv e. Random sampling required more than 3 times the n um b er of images to reac h the same lev el of accuracy as b oth activ e learning approac hes. The t w o activ e learning metho ds eectiv ely capture man y more SVs than random sampling. Also, our newly prop osed activ e learning approac h, \BT", requires less images to reac h a giv en accuracy than \LC" after adding 450 lab eled images. Before adding 450 lab eled images, ho w ev er, \BT" p erforms similarly to \LC". It mak es sense that the accuracy of the initial classier aects the p erformance of activ e learning and random sampling. Activ e learning greedily c ho oses the most \informativ e" examples based on the previous mo del. So a bad mo del ma y mislead the activ e learning approac h to c ho ose noninformativ e examples, whic h do not help to impro v e the classier. While random sampling pro vides the classier with a v erage \informativ e" examples what56 PAGE 67 I I P C = 5 0 I P R = 1 r 7 7 % r 7 8 % r 7 9 % r 8 0 % r 8 1 % r 8 2 % r 8 3 % r 8 4 % r 8 5 % r 8 6 % r 8 7 % r 0 r 1 0 0 r 2 0 0 r 3 0 0 r 4 0 0 r 5 0 0 r 6 0 0 r 7 0 0 r N u m b e r o f n e w i m a g e s rA c c u r a c y r L C r B T r R a n d o m r I I P C = 5 0 I P R = 1 r 1 6 8 r 1 0 0 r 2 0 0 r 3 0 0 r 4 0 0 r 5 0 0 r 6 0 0 r 7 0 0 r 8 0 0 r 9 0 0 r 0 r 1 0 0 r 2 0 0 r 3 0 0 r 4 0 0 r 5 0 0 r 6 0 0 r 7 0 0 r N u m b e r o f n e w i m a g e s rN u m b e r o f s u p p o r t v e c t o r s r L C r B T r R a n d o m r Figure 4.12. Comparison of activ e learning and random sampling in terms of accuracy and n um b er of supp ort v ectors: initial training images p er class are 50, one new lab eled image added at a time. The error bars represen t the standard errors. 57 PAGE 68 I I P C = 1 0 0 I P R = 1 r 8 0 5 % r 8 0 % r 8 1 % r 8 2 % r 8 3 % r 8 4 % r 8 5 % r 8 6 % r 8 7 % r 0 r 1 0 0 r 2 0 0 r 3 0 0 r 4 0 0 r 5 0 0 r 6 0 0 r 7 0 0 r N u m b e r o f n e w i m a g e s rA c c u r a c y r L C r B T r R a n d o m r I I P C = 1 0 0 I P R = 1 r 2 9 3 r 2 0 0 r 3 0 0 r 4 0 0 r 5 0 0 r 6 0 0 r 7 0 0 r 8 0 0 r 9 0 0 r 1 0 0 0 r 0 r 1 0 0 r 2 0 0 r 3 0 0 r 4 0 0 r 5 0 0 r 6 0 0 r 7 0 0 r N u m b e r o f n e w i m a g e s rN u m b e r o f s u p p o r t v e c t o r s r L C r B T r R a n d o m r Figure 4.13. Comparison of activ e learning and random sampling in terms of accuracy and n um b er of supp ort v ectors: initial training images p er class are 100, one new lab eled image added at a time. The error bars represen t the standard errors. 58 PAGE 69 I I P C = 2 0 0 I P R = 1 r 8 3 4 % r 8 3 0 % r 8 3 5 % r 8 4 0 % r 8 4 5 % r 8 5 0 % r 8 5 5 % r 8 6 0 % r 8 6 5 % r 8 7 0 % r 8 7 5 % r 0 r 1 0 0 r 2 0 0 r 3 0 0 r 4 0 0 r 5 0 0 r 6 0 0 r 7 0 0 r N u m b e r o f n e w i m a g e s rA c c u r a c y r L C r B T r R a n d o m r I I P C = 2 0 0 I P R = 1 r 5 0 0 r 6 0 0 r 7 0 0 r 8 0 0 r 9 0 0 r 1 0 0 0 r 1 1 0 0 r 1 2 0 0 r 0 r 1 0 0 r 2 0 0 r 3 0 0 r 4 0 0 r 5 0 0 r 6 0 0 r 7 0 0 r N u m b e r o f n e w i m a g e s rN u m b e r o f s u p p o r t v e c t o r s r L C r B T r R a n d o m r Figure 4.14. Comparison of activ e learning and random sampling in terms of accuracy and n um b er of supp ort v ectors: initial training images p er class are 200, one new lab eled image added at a time. The error bars represen t the standard errors. 59 PAGE 70 ev er the initial classier is. Therefore, if the initial classier helps activ e learning to c ho ose examples more informativ e than a v erage (random sampling), activ e learning will result in a more accurate classier with few er lab eled examples. The b etter the initial classier, the more lab eling eort is sa v ed. When comparing the t w o activ e learning metho ds, \BT" outp erformed \LC" under all four starting conditions. Ho w ev er, the dierence in accuracy b et w een them w as insignican t as the initial classier b ecame more accurate. The justication is an accurate initial classier has less space for impro v emen t using activ e learning. \BT" impro v ed the accuracy b y more than 20% when I IPC=10 while it only b o osted the accuracy b y less than 4% when I IPC=200. Therefore, as the scale of the accuracy impro v emen t w as small, the dierence in accuracy b et w een the t w o activ e learning metho ds b ecame insignican t. 4.4.2 V arying the IPR It is usually exp ected that more than one image is lab eled and added in to the training set for retraining. F or instance, it is con v enien t for an exp ert to lab el sev eral images instead of one at a time. Also, giv en the total n um b er of newly lab eled images is U it is appro ximately k times faster if w e lab el k images at a time b ecause it requires only U k times mo del up dating. Although an incremen tal SVM training algorithm w as prop osed in [18 ] to reduce the retraining time, it is still v ery slo w for mo del up dating to lab el one image at a time esp ecially when man y images are to b e lab eled. Therefore, w e exp ect activ e learning to b e eectiv e ev en when adding sev eral lab eled images at a time. The activ e learning metho d \BT" is go o d for adding only one \informativ e" example at a time, there is no guaran tee that adding sev eral examples at a time will still fa v or \BT". The reason is that adding one \informativ e" example will up date the mo del, whic h in turn c hanges the criterion for the next \informativ e" example. Therefore, the most \informativ e" example set is dieren t from simply grouping sev eral most \informativ e" examples together. Ho w ev er, suc h an optimal example set is v ery hard to compute. Therefore, w e ex60 PAGE 71 p ect grouping sev eral most \informativ e" examples together is a reasonable appro ximation of the optimal example set, or at least is sup erior to randomly sampling sev eral examples. 1 0 I I P C r 8 5 0 % r 8 2 9 % r 6 4 % r 6 9 % r 7 4 % r 7 9 % r 8 4 % r 0 r 1 0 0 r 2 0 0 r 3 0 0 r 4 0 0 r 5 0 0 r 6 0 0 r 7 0 0 r N u m b e r o f n e w i m a g e s rA c c u r a c y r 1 I P R r 5 I P R r 1 0 I P R r 2 0 I P R r 5 0 I P R r R a n d o m r Figure 4.15. Comparison of activ e learning and random sampling in terms of accuracy with dieren t IPR: initial training images p er class are 10. Standard error bars are on the random sampling curv e. Figures 4.15 to 4.18 presen t the exp erimen tal results on \BT" b y v arying IPR for eac h I IPC. In all the exp erimen ts, the IPR w as v aried from 1 to 50. W e only drew the error bars for random sampling b ecause adding error bars to \BT" will mak e the graph to o busy W e still used a pairedt test to compare \BT" with random sampling. T o our surprise, classication accuracy with large IPRs is almost as go o d as with small IPRs although a v ery large IPR (IPR=50) results in sligh tly less accurate classiers than a small IPR in man y cases. In all situations, a large IPR up to 50 is statistically signican tly more accurate than random sampling at p=0.05. These results indicate that our activ e learning 61 PAGE 72 5 0 I I P C r 8 5 7 % r 8 3 2 % r 7 7 % r 7 8 % r 7 9 % r 8 0 % r 8 1 % r 8 2 % r 8 3 % r 8 4 % r 8 5 % r 8 6 % r 0 r 1 0 0 r 2 0 0 r 3 0 0 r 4 0 0 r 5 0 0 r 6 0 0 r 7 0 0 r N u m b e r o f n e w i m a g e s rA c c u r a c y r 1 I P R r 5 I P R r 1 0 I P R r 2 0 I P R r 5 0 I P R r R a n d o m r Figure 4.16. Comparison of activ e learning and random sampling in terms of accuracy with dieren t IPR: initial training images p er class are 50. Standard error bars are on the random sampling curv e. 62 PAGE 73 1 0 0 I I P C r 8 6 6 % r 8 4 3 % r 8 0 5 % r 8 0 % r 8 1 % r 8 2 % r 8 3 % r 8 4 % r 8 5 % r 8 6 % r 8 7 % r 0 r 1 0 0 r 2 0 0 r 3 0 0 r 4 0 0 r 5 0 0 r 6 0 0 r 7 0 0 r N u m b e r o f n e w i m a g e s rA c c u r a c y r 1 I P R r 5 I P R r 1 0 I P R r 2 0 I P R r 5 0 I P R r R a n d o m r Figure 4.17. Comparison of activ e learning and random sampling in terms of accuracy with dieren t IPR: initial training images p er class are 100. Standard error bars are on the random sampling curv e. 63 PAGE 74 2 0 0 I I P C r 8 7 2 % r 8 3 4 % r 8 4 9 % r 8 3 % r 8 4 % r 8 4 % r 8 5 % r 8 5 % r 8 6 % r 8 6 % r 8 7 % r 8 7 % r 8 8 % r 0 r 1 0 0 r 2 0 0 r 3 0 0 r 4 0 0 r 5 0 0 r 6 0 0 r 7 0 0 r N u m b e r o f n e w i m a g e s rA c c u r a c y r 1 I P R r 5 I P R r 1 0 I P R r 2 0 I P R r 5 0 I P R r R a n d o m r Figure 4.18. Comparison of activ e learning and random sampling in terms of accuracy with dieren t IPR: initial training images p er class are 200. Standard error bars are on the random sampling curv e. 64 PAGE 75 approac h \BT" can run in batc h mo de, where tens of examples are lab eled at a time, to ac hiev e sp eedup with at most a little compromise in accuracy 4.5 Conclusion and discussion This c hapter presen ts an activ e learning approac h to reduce domain exp erts' lab eling eorts in recognizing plankton from higherresolution, new generation SIPPER I I images. It can b e applied to an y data set where the examples will b e lab eled o v er time and one w an ts to use the system as early as p ossible. The \Breaking Ties" activ e learning metho d is prop osed and applied to a m ulticlass SVM using the onevsone approac h on newly dev elop ed, image features extracted from gra yscale SIPPER images. The exp erimen tal results indicate that our prop osed activ e learning approac h successfully reduces the n um b er of lab eled images required to reac h a giv en accuracy lev el when compared with random sampling. It also outp erforms the least certain t y approac h prop osed b y us earlier in [52 ]. Our new approac h can b e run in batc h mo de, lab eling up to 50 images at a time to ac hiev e signican t sp eedup with similar classication accuracy compared with lab eling one image at a time. In the follo wing, w e will address and discuss sev eral issues of activ e learning in SVMs for further explorations. One critique of activ e learning is the o v erhead related to searc hing for the next candidate to lab el. While random sampling just selects an example to lab el at random, activ e learning needs to ev aluate ev ery unlab eled example. This o v erhead b ecomes signican t when the unlab eled data set is v ery large. A simple solution to that is to use random subset ev aluation: Eac h time searc hing for the next candidate example to lab el, instead of ev aluating the en tire unlab eled data set, one can only ev aluate a subset of data randomly dra wn from the en tire set. W e indicate without pro of here that for IPR=1, w e need to sample 59 data, whic h pro vides 95% probabilit y condence that the b est candidate from the 59 data subset is sup erior to 95% data from the total unlab eled set. See [83 c hap 6.5] for more detail. 65 PAGE 76 Another imp ortan t issue is the c hange of optimal k ernel parameters. W e can nd the optimal k ernel parameters for the initial lab eled data set. As more lab eled data are added, ho w ev er, suc h k ernel parameters are no longer optimal. Unless w e can aord a heldout, lab eled data set, it is hard to tune the k ernel parameters online. The k ey reason is w e do not ha v e a go o d metho d to ev aluate dieren t k ernel parameters while activ e learning pro ceeds. The standard metho ds lik e crossv alidation and lea v eoneout tend to fail b ecause activ e learning brings in biased data samples. Suc h failures w ere observ ed and discussed in [2 ]. An imp ortan t future direction is to nd a go o d online p erformance ev aluation metho d for activ e learning. Otherwise, one could tak e it as one of the biggest b ottlenec ks to use SVMs as the base learner in activ e learning b ecause SVMs dep end hea vily on go o d k ernel parameters. An eort to w ard solving this problem is rep orted in [2], where the classication en trop y maximization (CEM) criterion w as used to ev aluate the p erformances of dieren t activ e learners. Their w ork sho ws CEM can help select the b est activ e learner on sev eral t w oclass data sets. An imp ortan t thing omitted in most activ e leaning+SVMs literature is to run activ e learning in batc h mo de. Unless lab eling an example is extremely exp ensiv e, it is alw a ys con v enien t and practical to run activ e learning in batc h mo de, namely lab eling sev eral examples at a time. As indicated in this c hapter, the b est candidate set to lab el migh t b e found in a dieren t w a y from a single b est candidate p oin t. \Com bined" [14 ] only w orks for t w oclass problems. A criterion for the b est set of data to lab el in m ulticlass SVMs needs to b e addressed in future activ e learning w ork. A t the v ery least, existing activ e learning metho ds need to sho w they w ork w ell in batc h mo de. F ortunately our prop osed activ e learning metho d w orks w ell in batc h mo de without emplo ying a new criterion for selecting a set of data to lab el. 66 PAGE 77 CHAPTER 5 BIT REDUCTION SUPPOR T VECTOR MA CHINE Supp ort v ector mac hines are v ery accurate classiers and ha v e b een widely used in man y applications. Ho w ev er, the training and to a lesser exten t prediction time of supp ort v ector mac hines on v ery large data sets can b e v ery long. This c hapter presen ts a fast compression metho d to scale up supp ort v ector mac hines to large data sets. A simple bit reduction metho d is applied to reduce the cardinalit y of the data b y w eigh ting representativ e examples. W e then dev elop supp ort v ector mac hines trained on the w eigh ted data. Exp erimen ts indicate that the bit reduction supp ort v ector mac hine pro duces a signican t reduction of the time required for b oth training and prediction with minim um loss in accuracy It is also sho wn to b e more accurate than random sampling when the data is not o v ercompressed. 5.1 Motiv ation Supp ort v ector mac hines (SVMs) ac hiev e high accuracy in man y application domains including this w ork in recognizing underw ater zo oplankton. Ho w ev er, scaling up SVMs to a v ery large data set is still an op en problem. T raining a SVM requires solving a constrained quadratic programming problem, whic h usually tak es O ( m 3 ) computations where m is the n um b er of examples. Predicting a new example in v olv es O ( sv ) computations where sv is the n um b er of supp ort v ectors and is usually prop ortional to m As a consequence, SVMs' training time and prediction time to a lesser exten t on a v ery large data set can b e quite long, th us making it impractical for some realw orld applications. In plankton recognition, a SVM needs to mak e a realtime or near realtime prediction on underw ater plankton sampled b y insitu imaging sensors in order to obtain the comp osition, abundance and 67 PAGE 78 distribution of plankton in a timely fashion. Also, fast retraining is often required as new plankton images are lab eled b y marine scien tists and added to the training library on the ship. As w e acquire a large n um b er of plankton images, training a SVM with all lab eled images b ecomes extremely slo w. In this c hapter, w e prop ose a simple strategy to sp eedup the training and prediction pro cedures for a SVM: bit reduction. Bit reduction reduces the resolution of the input data and groups similar data in to one bin. A w eigh t is assigned to eac h bin according to the n um b er of examples in it. This data reduction and aggregation step is v ery fast and scales linearly with resp ect to the n um b er of examples. Then a SVM is built on a set of w eigh ted examples whic h are the exemplars of their resp ectiv e bins. Our exp erimen ts indicate that bit reduction SVM (BRSVM) signican tly reduces the training time and prediction time with a minimal loss in accuracy It outp erforms random sampling on most data sets when the data are not o v ercompressed. W e also nd that on one high dimensional data set, that bit reduction do es not p erform as w ell as random sampling, th us pro viding a limit on the p erformance of BRSVM for high dimensional data sets. The rest of this c hapter is organized as follo ws. Section 5.2 reviews previous w ork in sp eeding up SVMs. Section 5.3 describ es the bit reduction supp ort v ector mac hine (BRSVM). In Section 5.4, w e describ e exp erimen ts with BRSVM on nine data sets and analyze the results. Section 5.5 summarizes this approac h. 5.2 Previous w ork There are t w o main approac hes to sp eed up training of SVMs. One approac h is to nd a fast algorithm to solv e the quadratic programming (QP) problem for a SVM. \Ch unking", in tro duced in [90 ], solv es a QP problem on a subset of data. Ch unking only k eeps the supp ort v ectors on the subset and replaces others with data that violate the KarushKuhnT uc k er (KKT) conditions. Using an idea similar to c h unking, decomp osition [63 ] [43 ] puts a subset of data in to a \w orking set", and solv es the QP problem b y optimizing the co ecien ts of the data in the w orking set while k eeping the other co ecien ts unc hanged. In this w a y a large QP problem is decomp osed in to a series of small QP problems, th us making 68 PAGE 79 it p ossible to train a SVM on large scale problems. Sequen tial minim um optimization (SMO) [70 ] and its enhanced v ersions [45 ] [27 ] tak e decomp osition to the extreme: Eac h w orking set only has t w o examples and their optimal co ecien ts can b e solv ed analytically SMO is easy to implemen t and do es not need an y thirdpart y QP solv ers. SMO is widely used to train SVMs. Another w a y of solving large scale QP problems [34 ][97 ] is to use a lo wrank matrix to appro ximate the Gram matrix of a SVM. As a consequence, the QP optimization on the small matrix requires signican tly less time than on the whole Gram matrix. The other main approac h of sp eeding up SVM training comes from the idea of \data squashing", whic h w as prop osed in [29 ] as a general metho d to scale up data mining algorithms. Data squashing divides massiv e data in to a limited n um b er of bins. The statistics of the examples from eac h bin are computed. A mo del is t b y only using the statistics instead of all examples within a bin. The reduced training set results in signican tly less training time. Researc hers ha v e applied data squashing to SVMs. Sev eral clustering algorithms [99 ][85 ][9 ] w ere used to partition data and build a SVM based on the statistics from eac h cluster. In [9], the SVM mo del built on the reduced set w as used to predict on the whole training data. Examples falling in the margin or b eing misclassied w ere tak en out from their original clusters and added bac k in to the training data for retraining. Ho w ev er, b oth [99 ] and [85 ] assumed a linear k ernel and it migh t not generalize w ell to other k ernels. In [9 ], t w o exp erimen ts w ere done with a linear k ernel and only one exp erimen t used a thirdorder p olynomial k ernel. Moreo v er, it is not un usual that man y examples fall in to the margin of a SVM mo del esp ecially for a RBF k ernel. In suc h cases, retraining with all examples within the margin is computationally exp ensiv e. F ollo wing the idea of the lik eliho o dbased squashing [54 ] [65 ], a lik eliho o d squashing metho d w as dev elop ed for a SVM b y P a vlo v and Ch udo v a [67 ]. The lik eliho o d squashing metho d assumes a probabilit y mo del as the classier. Examples with similar probabilit y p ( x i ; y i j ) are group ed together and tak en as a w eigh ted exemplar. P a vlo v and Ch udo v a used a 69 PAGE 80 probabilistic in terpretation of SVMs to p erform the lik eliho o d squashing. Still, only a linear k ernel w as used in their exp erimen ts. Most w ork [15 ][16 ][64 ][82 ] on enabling fast prediction with SVMs fo cused on the problem of reducing the n um b er of SVs obtained. Since the prediction time of a SVM dep ends on the n um b er of supp ort v ectors, they searc hed for a reduced set of v ectors whic h can appro ximate the decision b oundary The prediction using the reduced set w as faster than using all supp ort v ectors. Ho w ev er, reduced set metho ds in v olv e searc hing for a set of preimages [82 ][83 ], whic h is a set of constructed examples used to appro ximate the solution of a SVM. It should b e noted that the searc hing pro cedure is computationally exp ensiv e. Data squashing approac hes seem promising and can b e com bined with fast QP lik e SMO etc. for fast training and prediction. Ho w ev er, most w ork [85 ][9 ][99 ] in data squashing+SVM requires clustering the data and/or linear k ernels [85 ][99 ][67 ]. Clustering usually needs O ( m 2 ) computations and highorder k ernels, lik e the RBF k ernel, are widely used and essen tial to man y successful applications. Therefore, a fast squashing metho d and exp erimen ts on highorder k ernels is necessary to apply data squashing+SVMs to realw orld applications. In this c hapter, w e prop ose a simple and fast metho d data compression metho d: bitreduction SVM (BRSVM). It do es not require an y computationally exp ensiv e clustering algorithms and w orks w ell with RBF k ernels as sho wn in our exp erimen ts. 5.3 Bit reduction SVM Bit reduction SVM (BRSVM) w orks b y reducing the resolution of examples and represen ting similar examples as a single w eigh ted example. In this w a y the data size is reduced and training time is sa v ed. It is simple and m uc h faster than clustering. Another ev en simpler data reduction metho d is random sampling. Random sampling subsamples data without replacemen t. Compared to w eigh ted examples, random sampling suers from high v ariance of estimation in theory Please refer to [21 ][72 ] for details ab out sampling theory In spite of its high v ariance, random sampling has b een sho wn to w ork v ery w ell in 70 PAGE 81 exp erimen ts [65 ][85 ]: It w as as accurate as or sligh tly less accurate than complicated data squashing metho ds. 5.3.1 Bit reduction Bit reduction is a tec hnique to reduce the data resolution. One example use is a bit reduction fuzzy cmeans (BRF CM) metho d [44 ][32 ], whic h applied bit reduction to sp eed up the fuzzy cmeans (F CM) [30 ][8 ] clustering algorithm. Ho w ev er, bit reduction in clustering do es not consider the class lab el of examples in the same bin. In classication only examples from the same class should b e aggregated together. Also, comparison to random sampling w as omitted in [44 ][32 ]. There are three steps in v olv ed in bit reduction for a SVM: normalization, bit reduction and aggregation. 1. Normalization is used to ensure equal resolution for eac h feature. Dieren t features ma y ha v e v ery dieren t scales, th us reducing precision equally along eac h feature ma y not b e fair. F or instance, features with small scales b ecome zero and irrelev an t to classication after bit reduction. Therefore, a normalization is needed to mak e eac h feature ha v e zero mean and unit v ariance. T o a v oid losing to o m uc h information during quan tization, an in teger is used to represen t eac h normalized feature v alue. The in teger I ( v ) for a roating p oin t v alue v is constructed as follo ws: I ( v ) = int ( Z v ) where Z is an arbitrary n um b er used to scale v and function int ( k ) returns the in teger part of k In this w a y the true v alue of v is k ept and only I ( v ) is used in bit reduction. In our exp erimen ts, w e used Z = 1000. 2. Bit reduction is p erformed on the in teger I ( v ). Giv en b the n um b er of bits to b e reduced, I ( v ) is righ tshifted and its precision is reduced. W e sligh tly abuse notation here b y letting the I ( v ) in the righ t hand side of Eq. (5.1) b e the I ( v ) b efore bit 71 PAGE 82 reduction and I ( v ) in the left hand side b e the I ( v ) after bit reduction. I ( v ) I ( v ) b (5.1) where k b shifts the in teger k to the righ t b y b bits. Giv en an r dimensional example x i =( x i 1 x i 2 ,..., x ir ), its in teger expression after bit reduction is ( I ( x i 1 ), I ( x i 2 ),..., I ( x ir )). 3. The aggregation step groups the examples from the same class whose in teger expressions fall in to the same bin. F or eac h class, the mean of examples within the same bin is computed as their represen tativ e. The w eigh t of the represen tativ e equals the n um b er of examples from that class. During the mean computation, the real v alues ( x i 1 ; x i 2 ; :::; x ir ) are used. Note the bit reduction pro cedure loses data precision. A v ery large b results in to o man y examples falling in the same bin. The mean statistic is not enough to capture the lo cation information of man y examples. A small b do es not pro vide enough data reduction, th us lea ving training still slo w. The b est n um b er of bits reduced ( b ) v aries for dieren t data sets. It can b e found b y trialanderror. F or instance, one can try dieren t b to see ho w man y training examples exist b efore building a classier. The data miner then can c ho ose b based on the size of the data set a SVM can handle within a giv en time span. Alternativ ely if retraining on the same t yp e of data is needed frequen tly (e.g. new data acquired), cross v alidation can b e used to nd the b est b whic h pro vides desired sp eedup and minim um accuracy loss. The optimal n um b er b for bit reduction will b e used for retraining on the same t yp e of data. During bit reduction, it is v ery lik ely that a bin has examples from man y dieren t classes. Therefore, in the aggregation step, the mean statistic of examples in the same bin w as computed individually for eac h class. This can at least alleviate the side eect of grouping examples from dieren t classes in to the same bin. As a result, one bin ma y con tain w eigh ted examples for m ultiple classes. 72 PAGE 83 T able 5.1 describ es the bit reduction pro cedure for four 1d examples with class lab el y i T able 5.1. An 1d example of bit reduction in BRSVM. i Example I( x i ) and I( x i ) after ( x i y i ) its bit expression 2bit reduction Z = 1000 1 (0.008, 1) 8 (1000) 2 (10) 2 (0.009, 1) 9 (1001) 2 (10) 3 (0.010, 2) 10 (1010) 2 (10) 4 (0.011, 2) 11 (1011) 2 (10) The four examples from t w o classes are rst scaled to in teger v alues b y using Z = 1000. Then 2bit reduction is p erformed b y righ t shifting its in teger expression b y 2 bits. All four examples end up ha ving the same v alue, whic h means all four examples fall in to one bin after a 2bit reduction. T able 5.2 sho ws the w eigh ted examples after the aggregation step. T able 5.2. W eigh ted examples after the aggregation step. i New examples ( x i y i ) W eigh t 1 (0.0085, 1) 2 2 (0.0105, 2) 2 Since all four examples are in the same bin, w e aggregate them b y class and compute their mean for eac h class using the original v alues x i The w eigh t is computed b y simply coun ting the n um b er of examples from the same class. Although bit reduction is fast, a slopp y implemen tation of aggregation ma y easily cost O ( m 2 ) computations where m is the n um b er of examples. W e implemen ted a hash table for the aggregation step as done in [32 ]. Univ ersal hashing [22 ] w as used as the hash function. Collisions w ere resolv ed b y c haining. When inserting the bitreduced in teger v alues in to the hash table, w e used a list to record the places that w ere lled in the hash table. The mean statistics w ere computed b y revisiting all the lled places in the hash table. The a v erage computational complexit y for our implemen tation is 2 m Please see [22 ] for more detail ab out univ ersal hashing function. 73 PAGE 84 5.3.2 W eigh ted SVM P a vlo v et al. [67 ] prop osed a metho d to train a w eigh ted SVM, although its description in [67 ] is concise and lac ks signican t details. F ollo wing their w ork, w e describ e ho w to train a w eigh ted SVM in more detail in this subsection. Giv en examples x 1 ; x 2 ; :::; x m with class lab el y i 2f 1,1 g a SVM solv es the follo wing problem minimize 1 2 h w ; w i + C m m X i =1 i (5.2) sub ject to: y i ( h w ; ( x i ) i + b ) 1 i C ; i > 0 where w is normal to the decision b oundary (a h yp erplane), C is the regularization constan t that con trols the tradeo b et w een the empirical loss and the margin width, the slac k v ariable i represen ts the empirical loss asso ciated with x i In the case of w eigh ted examples, the empirical loss of x i with a w eigh t i is simply i i In tuitiv ely it could b e in terpreted as i iden tical examples x i Accum ulating the loss of the i examples results in a loss of i i Substitute i with i i in Eq. (5.2), and w e deriv e the primal problem of a w eigh ted SVM: minimize 1 2 h w ; w i + C m P mi =1 i i (5.3) sub ject to: y i ( h w ; ( x i ) i + b ) 1 i C ; i > 0 ; i = 1 ; :::m 74 PAGE 85 The constrain t in Eq. (5.3) remains unc hanged b ecause the constrain t for eac h of the i examples x i is iden tical. The i iden tical constrain t form ulas can b e reduced to one constrain t as sho wn in Eq. (5.3). In tro ducing the Lagrangian m ultiplier i Eq. (5.3) leads to L ( ; w ; b ) = 1 2 h w ; w i + C m m X i =1 i i (5.4) m X i =1 i ( y i ( h w ; ( x i ) i + b ) 1 + i ) i > 0 ; i = 1 ; :::; m where is the v ector ( 1 ; 2 ; :::; m ). Its saddle p oin t solution can b e computed b y taking the partial deriv ativ es of L ( ; w ; b ). @ L ( ; w ; b ) @ w = 0 and @ L ( ; w ; b ) @ b = 0 (5.5) W e get w = m X i =1 i y i ( x i ) (5.6) m X i =1 i y i = 0 (5.7) Substitute them in to Eq. (5.4) and the dual form of a w eigh ted SVM is as follo ws. maximize P mi =1 i 1 2 P mi =1 P mj =1 i j y i y j k ( x i ; x j ) (5.8) sub ject to 0 i C i m ; i = 1 ; :::; m P mi =1 i y i = 0 75 PAGE 86 The dual form of a w eigh ted SVM is almost iden tical to a normal SVM except for the b oundary condition of i C i m while in a normal SVM i C m Therefore, ecien t solv ers for a normal SVM suc h as the SMO [70 ] can b e used to solv e a w eigh ted SVM b y mo difying the b oundary condition sligh tly 5.4 Exp erimen ts W e exp erimen ted with BRSVM on nine data sets: banana, phoneme, sh uttle, page, p endigit, letter, SIPPER I I plankton images, w a v eform and satimage. Banana includes t w odimensional, bananashap ed data from t w o classes. It is a widely used b enc hmark data set and w as originally in tro duced in [74 ]. Phoneme is pro vided in the ELENA rep ository [31 ]. Its aim is to dieren tiate b et w een nasal and oral v o w els from 5 harmonic attributes. The sh uttle data set in the Statlog collection [57 ] has 9 n umerical attributes and 58,000 examples. There are 7 classes. Ab out 80% of the data b elongs to class 1. Therefore, the default accuracy is ab out 80%. P age, p endigit and letter come from the UCI mac hine learning rep ository [56 ]. The problem of the page data set is to classify the page la y out of a do cumen t in to: horizon tal line, picture, v ertical line and graphic. The goal of the p endigit data set is to recognize the 10 p enbased digits: 0{9. In the letter data set, w e are to iden tify 26 capital letters from 16 image features computed from blac kandwhite images. The plankton data set w as originally used in [52 ]. Its ob jectiv e is to classify the v e most abundan t t yp es of plankton from 17 selected image features from 3bit plankton images. W a v eform w as originally used in [13 ]. The problem is to predict 3 classes of w a v es from 21 attributes with added noise. Please see [13 49{55] for details. Satimage is from the Statlog rep ository It is to predict the cen tral pixel in 3 3 neigh b orho o ds from a satellite image. T able 5.3 summarizes the c haracteristics of the nine data sets. The Libsvm to ol [19 ] for training supp ort v ector mac hines w as mo died and used in all exp erimen ts. The RBF k ernel ( k ( x; y ) = exp ( g k x y k 2 )) w as emplo y ed. The k ernel parameter g and the regularization constan t C w ere tuned b y a 5fold cross v alidation on the training data. g and C w ere searc hed for from all com binations of the v alues in 76 PAGE 87 T able 5.3. Description of the nine data sets. Dataset # of data # of attributes # of classes banana 5300 2 2 phoneme 5404 5 2 sh uttle 58000 9 7 page 5473 10 5 p endigit 10992 16 10 letter 20000 16 26 plankton 8440 17 5 w a v eform 5000 21 3 satimage 6435 36 6 (2 10 ; 2 9 ; :::; 2 4 ) and (2 5 ; 2 4 ; :::; 2 9 ) resp ectiv ely W e used the same training and test separation as giv en b y original uses of the data sets. F or those data sets whic h do not ha v e a separate test set, w e randomly selected 80% of the examples as the training set and 20% of the examples as the test set. Since all nine data sets ha v e more than 5000 examples, 20% of the total data will ha v e more than 1000 examples. W e b eliev e it pro vided a relativ ely stable estimation. W e built SVMs on the training set with the optimal parameters and rep orted the accuracy on the test set. All our exp erimen ts w ere run on a P en tium 4 PC at 2.6 GHZ with 1 GB memory under the Redhat 9.0 op eration system. 5.4.1 Exp erimen ts with pure bit reduction T ables 5.4{5.12 describ e the exp erimen tal results from using BRSVM on the nine data sets. The last ro w of eac h table records the result of a SVM trained on the uncompressed data set. The other ro ws presen t the results from BRSVM. The rst column is the n umb er of bits reduced. The second column is the compression ratio, whic h is dened as # of examples after bit reduction # of examples W e start o with 0bit reduction whic h ma y not corresp ond to a 1.0 compression ratio. The reason is that rep eated examples are group ed together ev en when no bit is reduced. This results in compression ratios less than 1.0 at 0bit reduction in some cases. The third column is the accuracy of BRSVM on the test set. McNemar's test [25 ] is used to c hec k whether BRSVM accuracy is statistically signican tly dieren t from the accuracy of a SVM built on the uncompressed data set. The n um b er 77 PAGE 88 in italics indicates the dierence is not statistically signican t at the p = 0 : 05 lev el. The fourth column is the time for bit reduction plus BRSVM training time. The time required to do example aggregation is included in this training time. The fth column is the prediction time on the test set. All of the timing results w ere recorded in seconds. The precision of the timing measuremen t w as 0.01 seconds. The training and prediction sp eedup ratio are dened as SVM training time BRSVM training time and SVM prediction time BRSVM prediction time resp ectiv ely In the last column, the a v erage accuracy of random sampling on the test set is listed for comparison. The subsampling ratio is set to equal the compression ratio of BRSVM. Since random sampling has a random factor, w e ran it 50 times for eac h subsampling ratio and recorded the a v erage statistics. This accuracy is listed in the last column of T ables 5.4{5.12 titled subsampling accuracy T able 5.4. BRSVM on the banana data set. The accuracy in italics means it is not statistically signican tly dieren t from the accuracy of a SVM. bit compression BRSVM BRSVM BRSVM subsampling reduction ratio accuracy training time prediction time accuracy 01 1.000 0.902 2.59s 0.33s 0.902 2 0.996 0.902 2.59s 0.33s 0.902 3 0.987 0.902 2.59s 0.33s 0.902 4 0.957 0.902 2.45s 0.31s 0.902 5 0.842 0.902 1.99s 0.29s 0.902 6 0.572 0.902 0.98s 0.23s 0.901 7 0.245 0.903 0.21s 0.12s 0.895 8 0.077 0.900 0.03s 0.05s 0.890 9 0.024 0.890 0.02s 0.01s 0.865 10 0.007 0.740 0.01s 0.01s 0.687 SVM 1.000 0.902 2.58s 0.33s The exp erimen tal results on the banana data set are sho wn in T able 5.4. As more bits are reduced, few er examples are used in training. Th us training time is reduced. Also, less training data results in a classier with few er supp ort v ectors. The prediction time is prop ortional to the total n um b er of supp ort v ectors. Therefore, the prediction time of BRSVM is reduced accordingly When 9 bits are reduced, BRSVM runs 129 times faster during training and 33 times faster during prediction than a normal SVM. Its accuracy is not statistically signican tly dieren t from a SVM built on all the data at the p = 0 : 05 78 PAGE 89 lev el. BRSVM is as or more accurate than a SVM with random sampling up to 10bit reduction.T able 5.5. BRSVM on the phoneme data set. The accuracy in italics means it is not statistically signican tly dieren t from the accuracy of a SVM. bit compression BRSVM BRSVM BRSVM subsampling reduction ratio accuracy training time prediction time accuracy 0 0.992 0.895 18.61s 1.03s 0.895 1 0.984 0.895 18.59s 1.03s 0.895 2 0.978 0.895 17.01s 1.02s 0.894 3 0.973 0.895 15.60s 1.02s 0.894 4 0.968 0.895 15.59s 1.03s 0.893 5 0.963 0.895 15.59s 1.02s 0.892 6 0.950 0.895 15.43s 1.02s 0.892 7 0.891 0.895 14.21s 0.97s 0.890 8 0.679 0.893 9.28s 0.83s 0.873 9 0.303 0.846 2.01s 0.41s 0.824 10 0.059 0.752 0.09s 0.09s 0.730 SVM 1.000 0.895 17.51s 1.03s Phoneme is another relativ ely lo wdimensional data set with v e attributes. T able 5.5 presen ts the exp erimen tal results of BRSVM on this data set. When 8 bits are reduced, BRSVM runs 1.9 times faster during training and 1.2 times faster during prediction than a normal SVM. Its accuracy is not statistically signican tly dieren t from a SVM built on all the data at the p = 0 : 05 lev el. BRSVM is as or more accurate than random sampling when the compression ratio is larger than 0.059. T able 5.6 and T able 5.7 presen t the BRSVM exp erimen ts on the sh uttle and the page data sets, whic h ha v e 9 attributes and 10 attributes resp ectiv ely After 10bit reduction, BRSVM ac hiev es 245.2 times sp eedup in training and 2.4 times sp eedup in prediction with a loss of 1.2% accuracy on the sh uttle data set. BRSVM is as or more accurate than random sampling when the compression ratio is greater than 0.006. On the page data set, the sp eedup ratios for BRSVM are 7.9 in training and 1.8 in prediction when the compression ratio is 0.187 after 9bit reduction. It should b e noted that BRSVM is 0.5% more accurate than a SVM. Because bit reduction eliminates noisy examples and has the regularization eect, it ma y impro v e the classication accuracy 79 PAGE 90 T able 5.6. BRSVM on the sh uttle data set. The accuracy in italics means it is not statistically signican tly dieren t from the accuracy of a SVM. bit compression BRSVM BRSVM BRSVM subsampling reduction ratio accuracy training time prediction time accuracy 01 1.000 0.999 22.57s 1.85s 0.999 2 0.994 0.999 22.00s 1.83s 0.999 3 0.904 0.999 19.03s 1.73s 0.999 4 0.717 0.999 12.43s 1.59s 0.999 5 0.484 0.999 8.07s 1.55s 0.999 6 0.258 0.999 3.72s 1.37s 0.998 7 0.108 0.999 1.40s 1.26s 0.998 8 0.031 0.999 0.44s 1.08s 0.996 9 0.014 0.996 0.19s 0.94s 0.994 10 0.006 0.987 0.09s 0.78s 0.986 11 0.003 0.835 0.06s 0.68s 0.982 SVM 1.000 0.999 22.07s 1.85s T able 5.7. BRSVM on the page data set. The accuracy in italics means it is not statistically signican tly dieren t from the accuracy of a SVM. bit compression BRSVM BRSVM BRSVM subsampling reduction ratio accuracy training time prediction time accuracy 03 0.988 0.970 2.89s 0.16s 0.970 4 0.987 0.970 2.89s 0.16s 0.970 5 0.985 0.970 2.89s 0.16s 0.970 6 0.971 0.970 2.75s 0.16s 0.970 7 0.833 0.970 2.26s 0.14s 0.970 8 0.465 0.974 1.37s 0.12s 0.969 9 0.187 0.975 0.37s 0.09s 0.964 10 0.073 0.723 0.06s 0.05s 0.952 11 0.028 0.579 0.01s 0.04s 0.930 SVM 1.000 0.970 2.92s 0.16s T able 5.8. BRSVM on the p endigit data set. The accuracy in italics means it is not statistically signican tly dieren t from the accuracy of a SVM. bit compression BRSVM BRSVM BRSVM subsampling reduction ratio accuracy training time prediction time accuracy 07 1.000 0.981 3.04s 2.23s 0.981 8 0.999 0.981 3.03s 2.23s 0.981 9 0.931 0.981 2.81s 2.17s 0.981 10 0.400 0.977 0.99s 1.59s 0.977 11 0.013 0.864 0.02s 0.12s 0.878 SVM 1.000 0.981 3.02s 2.23s 80 PAGE 91 T able 5.9. BRSVM on the letter data set. The accuracy in italics means it is not statistically signican tly dieren t from the accuracy of a SVM. bit compression BRSVM BRSVM BRSVM subsampling reduction ratio accuracy training time prediction time accuracy 07 0.941 0.975 54.08s 25.81s 0.974 8 0.937 0.975 53.98s 25.78s 0.974 9 0.887 0.975 50.68s 25.39s 0.973 10 0.490 0.966 21.89s 17.74s 0.956 11 0.059 0.779 0.72s 2.60s 0.807 SVM 1.000 0.975 57.73s 26.25s T able 5.10. BRSVM on the plankton data set. The accuracy in italics means it is not statistically signican tly dieren t from the accuracy of a SVM. bit compression BRSVM BRSVM BRSVM subsampling reduction ratio accuracy training time prediction time accuracy 08 0.995 0.889 24.02s 2.42s 0.886 9 0.962 0.887 23.14s 2.31s 0.884 10 0.362 0.829 2.79s 0.74s 0.854 11 0.070 0.695 0.09s 0.12s 0.771 SVM 1.000 0.887 24.23s 2.42s T able 5.11. BRSVM on the w a v eform data set. The accuracy in italics means it is not statistically signican tly dieren t from the accuracy of a SVM. bit compression BRSVM BRSVM BRSVM subsampling reduction ratio accuracy training time prediction time accuracy 09 1.000 0.859 3.86s 1.09s 0.859 10 0.995 0.857 3.81s 1.09s 0.857 11 0.151 0.845 0.11s 0.14s 0.844 SVM 1.000 0.859 3.84s 1.09s T able 5.12. BRSVM on the satimage data set. The accuracy in italics means it is not statistically signican tly dieren t from the accuracy of a SVM. bit compression BRSVM BRSVM BRSVM subsampling reduction ratio accuracy training time prediction time accuracy 08 1.000 0.917 4.40s 2.62s 0.917 9 0.990 0.917 4.27s 2.60s 0.916 10 0.727 0.900 2.95s 2.05s 0.911 11 0.126 0.734 0.23s 0.26s 0.871 SVM 1.000 0.917 4.38s 2.62s 81 PAGE 92 T able 5.8 and T able 5.12 sho w the exp erimen tal results on v e relativ ely high dimensional data sets: p endigit, letter, plankton, w a v eform and satimage. BRSVM deliv ers a higher or equal accuracy compared to random sampling when the compression ratio is not v ery small. On the p endigit data set, BRSVM with 10 bit reduction results in a compression ratio of 0.400, whic h is 3.1 times faster in training and 1.4 times faster in prediction. It is also as accurate as random sampling. On the letter data set, BRSVM is a more accurate classier than random sampling when the compression ratio is 0.490 after 10bit reduction. When 10 bits are reduced, BRSVM has a sp eedup ratio of 2.6 for training and a sp eedup ratio of 1.5 for prediction with a loss of 0.9% in accuracy T able 5.10 sho ws the exp erimen tal results on a relativ ely high dimensional data set{ plankton. BRSVM is sligh tly more accurate than random sampling when the n um b er of reduced bits is up to 9. A t the 10bit reduction lev el, the compression ratio of BRSVM drops sharply from 0.962 to 0.362, resulting in a signican t loss in accuracy On the plankton data set, BRSVM is sligh tly more accurate than random sampling when the n um b er of reduced bits is up to 9. A t the 10bit reduction lev el, the compression ratio of BRSVM drops sharply from 0.962 to 0.362, resulting in a signican t loss in accuracy A t the 11bit reduction lev el, the accuracy of BRSVM is m uc h lo w er than random sampling b ecause of the lo w compression ratio. The same thing happ ens with the w a v eform data set. The compression ratio drops from 0.995 to 0.151 with a 11bit reduction. BRSVM loses 1.4% accuracy accordingly The reason of this phenomenon is that when the compression ratio is small, it is v ery lik ely that man y examples from dieren t classes fall in to the same bin and the n um b er of examples distribute far from uniformly among dieren t bins. F or instance, supp ose bit reduction compresses the data in to sev eral bins and one bin has 80% of the examples from dieren t classes. BRSVM uses the mean statistic as the represen tativ e for eac h class, whic h ma y not b e able to capture the information ab out the decision b oundary in this bin. Random sampling, on the other hand, selects the examples more uniformly If 80% of the examples fall in to one bin, random sampling will eectiv ely sample four times more examples that reside in this bin than all others together, and 82 PAGE 93 preserv e the lo cal information of the decision b oundary m uc h b etter than BRSVM. As a result, random sampling is lik ely to b e as or more accurate than BRSVM when the compression ratio is v ery lo w. This tends to happ en on high dimensional data sets. On the other hand, at a higher compression ratio, where examples from the same class fall in to the same bin and distributions of the n um b er of examples in bins are not v ery sk ew ed, BRSVM preserv es the statistics of all examples while random sampling suers from high sampling v ariance. Therefore, BRSVM is more accurate than random sampling when the compression ratio is relativ ely high. On the highest dimensional data satimage, the sp eedup ratios of BRSVM are only 1.5 and 1.3 with 1.7% accuracy loss at the compression ratio 0.727 after a 10bit reduction. Also, BRSVM is not as accurate as random sampling at this compression ratio. It should b e noted that the compression ratios on some highdimensional data sets (plankton and w a v eform ) drop m uc h faster than those on the previous four data sets. This phenomenon is caused b y the \Curse of Dimensionalit y" [5 ]. The corresp onding in terpretation in our case is that the data in a highdimensional space are sparse and far from eac h other. Bit reduction will either group v ery few data together or put to o man y data in the same bin. As a result, BRSVM on the high dimensional data (satimage) do es not p erform as w ell as on the relativ ely lo w er dimensional data sets. 5.4.2 Exp erimen ts with un balanced bit reduction W e used a simple solution to get a b etter compression ratio: un balanced bit reduction (UBR). UBR w orks b y reducing a dieren t n um b er of bits for dieren t attributes. F or instance, if reduction of a bits results in v ery little compression while reduction of a + 1 bits compresses the data to o m uc h, UBR randomly selects sev eral attributes to reduce a + 1 bits while it applies a bit reduction to the rest of the attributes. In this w a y an in termediate compression ratio can b e obtained. Since trying all of attributes to get a desired compression ratio is time consuming esp ecially for high dimensional data sets, w e use the follo wing algorithm to c ho ose the optimal n um b er of attributes. 83 PAGE 94 Unb alanc e d Bit R e duction 1: I a and C a are the data set and the compression ratio after reduction of a bits resp ectiv ely C a is to o large while C a +1 is to o small. A = f a 1 ; a 2 ; :::; a r g is the set of r attributes. 2: s = v = b r = 2 c 3: if v=0 then 4: Stop. 5: endif 6: R andomly sele ct s attributes fr om A, apply 1 mor e bit r e duction on the s attributes, I a is further c ompr esse d to I a;s with c ompr ession r atio C a;s 7: if C a;s > desir e d c ompr ession r atio r ange then 8: v = b v = 2 c s = s + v go to 3. 9: endif 10: if C a;s < desir e d c ompr ession r atio r ange then 11: v = b v = 2 c s = s v go to 3. 12: endif 13: Apply BRSVM on the r e duc e d data set I a;s with r andomly sele cte d s 50 times and r e c or d the me an and the standar d deviation of the c ompr ession r atio and the test ac cur acy over the 50 runs. In this algorithm, a bits are reduced on all the attributes initially The desired compression ratio w ould b e a range giv en b y the user. Since one more bit reduction on all the attributes w ould compress the data to o m uc h, steps 2{12 determine the n um b er of attributes s to b e reduced b y one more bit, whic h enables a compression ratio falling in to a desired range. This algorithm can b e also run in an in teractiv e mo de b y asking the user to judge whether the C a;s is go o d enough at step 7 and 10. Considering the random factor in selecting the s attributes, w e run the UBR 50 times and record the statistics in step 13. This pro vides more stable results b ecause it exp erimen ts with BRSVM on compression ratios resulting from 1 more bit reduction on dieren t com binations of s attributes. 84 PAGE 95 W e exp erimen ted with UBR on phoneme, p endigit, plankton, w a v eform and satimage, on whic h pure bit reduction did not result in ideal incremen tal compression ratios. Dep ending up on the application domain, a go o d compression ratio can b e dened in dieren t w a ys. In this c hapter w e dene a go o d compression ratio as the minim um compression ratio with an accuracy within 1.2% of that obtained from a SVM trained on the uncompressed data set. In our UBR exp erimen ts, the un balanced bit reduction algorithm w as applied in in teractiv e mo de. Basically the program ask ed one to decide whether C a;s fell in to the desired compression ratio range at step 7 and step 10. If the ratio w as acceptable, the program pro ceeded to build SVMs on the reduced data set at step 13. W e also ran the random subsampling 50 times at the same compression ratio as UBR for comparison. W e presen t the exp erimen tal results of UBR in T ables 5.13{5.17. The un balanced bit reduction algorithm w as applied to nd a s whic h ga v e a go o d compression ratio. In the tables, the rst column records the s the second column is the the mean and the standard deviation (in paren theses) of compression ratios from the 50 runs. The third column and the last column record the mean and the standard deviation of the accuracies o v er the 50 runs on the test set from BRSVM using UBR and random sampling resp ectiv ely Assuming the accuracies of 50 runs follo w a normal distribution, w e applied the t test to c hec k whether the accuracy is statistically signican tly dieren t from the accuracy of a SVM built on the uncompressed data set. The n um b er in italics indicates the dierence is not statistically signican t at p = 0 : 05 lev el. The fourth and the fth column are the a v erage training time and prediction time resp ectiv ely W e will describ e ho w the un balanced bit reduction algorithm pro ceeds on the phoneme data set in detail while only presen ting the exp erimen tal results from the other data sets. The pure bit reduction exp erimen ts on phoneme w ere recorded in T able 5.5. After 8 bit reduction, BRSVM giv es a 0.679 compression ratio and 1.9 times sp eedup in the training phase with a loss of 0.3% in accuracy While after 9 bit reduction, the compression ratio drops to 0.303 and the corresp onding 4.9% accuracy loss could not b e tolerated. Since w e will accept up to 1.2% accuracy loss, w e applied UBR to searc h for a compression ratio 85 PAGE 96 b et w een 0.679 and 0.303. W e hop ed this could giv e more sp eedup than 1.9 times from a 8bit reduction. W e rst applied 8 bit reduction to the data and then used the un balanced bit reduction algorithm to nd an s whic h giv es a go o d compression ratio. See T able 5.13 for the UBR results on the phoneme data set. Initially s = v = b r = 2 c = b 5 = 2 c = 2 where the n um b er of attributes r is 5 on the Phoneme data set. Since the compression ratio 0.55 from a randomly selected 2 attributes w as v ery dieren t from 0.679 (from a pure 8bit reduction), w e pro ceeded to step 13 to rep eat the random selection 50 times with s = 2. W e recorded the mean and standard deviation for the compression ratio and the test accuracy o v er the 50 runs. Using the a v erage compression ratio from UBR as the sampling rate, w e applied the random sampling metho d to build SVMs 50 times for comparison. Observing that the a v erage accuracy 0.888 on the test set from 50 runs w as only 0.7% less than 0.895, the accuracy from the uncompressed data, w e ran the un balanced bit reduction algorithm again to get an ev en lo w er compression ratio. A t s = 3, the a v erage accuracy is 1.5% less than 0.895. Therefore, w e c hose s = 2, whic h results in a sp eedup ratio of 2.7 and 1.7 for the training and prediction phase, resp ectiv ely Also, BRSVM is 2.5% more accurate than random sampling at the giv en compression ratio. T able 5.13. BRSVM of UBR after 8bit reduction on the phoneme data set. The accuracy in italics means it is not statistically signican tly dieren t from the accuracy of a SVM. The n um b er in the paren theses is the standard deviation. # of attributes compression BRSVM BRSVM BRSVM subsampling reduction ratio accuracy training time prediction time accuracy 2 0.550 0.888 6.40s 0.67s 0.863 (0.003) (0.0027) (0.0059) 3 0.467 0.880 4.74s 0.59s 0.856 (0.008) (0.0049) (0.0067) SVM 1.000 0.895 17.51s 1.03s On the p endigit data set, T able 5.14 sho ws that after 10 bit reduction and s = 8 (UBR), w e get a sp eedup ratio of 12.1 and 3.0 in the training and prediction phase, resp ectiv ely The accuracy drops 1.1% compared to a SVM trained on the uncompressed data set. 86 PAGE 97 T able 5.14. BRSVM of UBR after 10bit reduction on the p endigit data set. The accuracy in italics means it is not statistically signican tly dieren t from the accuracy of a SVM. The n um b er in the paren theses is the standard deviation. # of attributes compression BRSVM BRSVM BRSVM subsampling reduction ratio accuracy training time prediction time accuracy 8 0.146 0.970 0.25s 0.74s 0.970 (0.016) (0.0035) (0.0048) 12 0.055 0.953 0.08s 0.39s 0.954 (0.008) (0.0075) (0.0116) SVM 1.000 0.981 3.02s 2.23s F rom T able 5.15, w e see UBR pro vides a compression ratio of 0.739 on the plankton data set at b = 9 ; s = 10. The corresp onding training and prediction phase w ere 1.6 and 1.4 times faster resp ectiv ely with a 0.11 accuracy loss. BRSVM is just sligh tly more accurate than random sampling. T able 5.15. BRSVM of UBR after 9bit reduction on the plankton data set. The accuracy in italics means it is not statistically signican tly dieren t from the accuracy of a SVM. The n um b er in the paren theses is the standard deviation. # of attributes compression BRSVM BRSVM BRSVM subsampling reduction ratio accuracy training time prediction time accuracy 8 0.814 0.881 18.03s 1.94s 0.880 (0.034) (0.0040) (0.0030) 10 0.739 0.876 15.03s 1.73s 0.875 (0.039) (0.0055) (0.0039) 12 0.638 0.866 11.22s 1.50s 0.872 (0.036) (0.0059) (0.0045) SVM 1.000 0.887 24.23s 2.42s On the w a v eform data set, T able 5.16 indicates UBR has 9.8 and 4.2 times sp eedup at b = 10 ; s = 18 in the training and prediction phase resp ectiv ely with a 0.11 accuracy loss. W e also exp erimen ted with UBR on the highest dimensional data set satimage. When b = 9 ; s = 31, BRSVM sp eeds up the training and prediction phase b y only 1.3 and 1.1 times with a 1.0% accuracy loss. Similar to the exp erimen ts on pure bit reduction, random sampling is more accurate than BRSVM with UBR on this data sets. 87 PAGE 98 T able 5.16. BRSVM of UBR after 10bit reduction on the w a v eform data set. The accuracy in italics means it is not statistically signican tly dieren t from the accuracy of a SVM. The n um b er in the paren theses is the standard deviation. # of attributes compression BRSVM BRSVM BRSVM subsampling reduction ratio accuracy training time prediction time accuracy 15 0.469 0.853 1.04s 0.52s 0.853 (0.012) (0.0059) (0.0053) 18 0.283 0.848 0.39s 0.26s 0.847 (0.007) (0.0043) (0.0058) SVM 1.000 0.859 3.84s 1.09s T able 5.17. BRSVM of UBR after 9bit reduction on the satimage data set. The accuracy in italics means it is not statistically signican tly dieren t from the accuracy of a SVM. The n um b er in the paren theses is the standard deviation. # of attributes compression BRSVM BRSVM BRSVM subsampling reduction ratio accuracy training time prediction time accuracy 27 0.885 0.913 3.80s 2.45s 0.915 (0.011) (0.0025) (0.0021) 31 0.820 0.907 3.41s 2.31s 0.913 (0.011) (0.0028) (0.0028) 33 0.785 0.903 3.20s 2.22s 0.912 (0.009) (0.0028) (0.0029) SVM 1.000 0.917 4.38s 2.62s 88 PAGE 99 5.4.3 Summary and discussion T able 5.18 summarizes the p erformance of BRSVM on all nine data sets. The second column is the optimal b and s resulting in a \go o d" compression ratio, at whic h BRSVM ac hiev es signican t sp eedup with an accuracy loss less than 1.2%. The accuracy loss in the third column is dened as (accuracy of SVM accuracy of BRSVM). The n um b er in italics means the loss is not statistically signican t. The sp eedups in the fourth and fth columns are calculated as the sp eedup ratio in the previous exp erimen ts. T able 5.18. Summary of BRSVM on all nine data sets. The accuracy in italics means it is not statistically signican tly dieren t from the accuracy of a SVM. Data Optimal Accuracy Sp eedup Sp eedup set b and s loss (BRSVM) in training in prediction banana b =9, s =0 1.2% 129.0 33.0 phoneme b =8, s =2 0.7% 2.7 1.7 sh uttle b =10, s =0 1.2% 245.2 2.4 page b =9, s =0 0.5% 7.9 1.8 p endigit b =10, s =8 1.1% 12.1 3.0 letter b =10, s =0 0.9% 2.6 1.5 plankton b =9, s =10 1.1% 1.6 1.4 w a v eform b =10, s =18 0.9% 13.0 4.0 satimage b =9, s =31 1.0% 1.3 1.1 BRSVM w orks w ell on the nine data sets. A t a small accuracy loss (less than 1.5%), the training and prediction sp eedup ratios range from 1.3 and 1.1 on the data set with the highest dimension to 245.2 and 33.0 on the lo w er dimensional data sets. Although accuracy loss exists (e.g. statistically signican t) on sev en out of nine data sets, it is small (less than 1.2%) and p oten tially acceptable to sa v e time on large data sets. Pure bit reduction ( s = 0) p erforms v ery w ell on the four data sets with up to 10 attributes: banana, phoneme, sh uttle and page. It ac hiev es up to 245.2 times sp eedup in training and up to 33.0 times sp eedup in prediction without m uc h loss in accuracy on the four data sets. On one relativ ely highdimensional data set{letter, BRSVM with pure bit reduction is 2.6 times faster in training and 1.5 times faster in prediction with 0.9% loss in accuracy BRSVM with pure bit reduction is more accurate than random sampling 89 PAGE 100 on v e data sets. On the p endigit, plankton and w a v eform data sets with relativ ely high dimensional data, pure bit reduction fails to pro vide a v ery go o d compression ratio, hence making BRSVM not as eectiv e as random sampling. The justication is as follo ws: a high compression ratio results in minimal sp eedup while a to o lo w compression ratio mak es BRSVM less accurate. The b est bit reduction and compression ratio v ary across data sets. In our exp erimen ts, a high compression ratio is go o d for lo wdimensional data sets while an in termediate compression ratio is desired for highdimensional data sets. F or instance, a 49% compression ratio is v ery go o d for BRSVM on the letter data set. As pure bit reduction fails to pro vide a compression ratio b et w een 0.362 and 0.962 on the plankton data set, BRSVM is not as eectiv e as random sampling. When un balanced bit reduction w as in tro duced for the data sets, BRSVM obtained in termediate compression ratios, whic h result in b etter accuracies than random sampling and signican t sp eedups. On the phoneme data set, UBR w as used to searc h for a smaller compression ratio, whic h enables more sp eedup. As a result, at a go o d compression ratio, BRSVM pro vides fast training and prediction and is more accurate than random sampling on eigh t data sets. On the highest dimensional data satimage, BRSVM is not as accurate as random sampling: A t the optimal b = 9 and s = 31, the compression ratio of BRSVM is 0.885 and its corresp onding accuracy is 90.7%, whic h is 0.6% less than that of random sampling. Although random sampling has higher v ariances in theory it w orks fairly w ell in our exp erimen ts except for banana and phoneme where random sampling is more than 2% less accurate than BRSVM. It p erforms only sligh tly w orse than BRSVM on six out of nine data sets. This phenomenon w as also observ ed in [65 ][85 ], where complicated data squashing strategies brough t small or no adv an tages o v er random sampling. On satimage{the highest dimensional data set, random subsampling is sligh tly more accurate than BRSVM. Moreo v er, when a large compression ratio is needed for v ery fast training, random sampling outp erforms BRSVM esp ecially on high dimensional data sets. In SVMs, only supp ort v ectors determine the decision function. Grouping supp ort v ectors together in the aggregation step ma y result in a less accurate classier. T o impro v e 90 PAGE 101 the accuracy one could c hec k all the original data using the BRSVM built on the w eigh ted examples. Then those supp ort v ector candidates could b e pulled out from the w eigh ted examples. A SVM retrained on the supp ort v ector candidates and the rest of the w eigh ted examples is lik ely to b e more accurate than BRSVM. A supp ort v ector candidate can b e judged b y c hec king if it falls in to the margin of a SVM. The follo wing algorithm describ es this pro cedure. R etr aining BRSVM 1: br S V M is the classier trained b y BRSVM; R = f ( r 1 ; w 1 ) ; ( r 2 ; w 2 ) ; :::; ( r n ; w n ) g is the reduced, w eigh ted data after bit reduction where r i is the i th exemplar and w i is its w eigh t; X = ( x 1 ; x 2 ; ::; x m ) is the full data set without bit reduction; S = ; 2: for i = 1 to m do 3: Use br S V M to pr e dict x i 4: if x i fal ls within the SVM mar gin and x i is gr oup e d to gether with other examples in ( r k w k ) then 5: S S + ( x i ; 1) w k w k 1 r ec alculate r k 6: endif 7: end for 8: R etr ain a SVM on S + R Retraining ma y impro v e the classication accuracy ho w ev er, it also slo ws do wn the whole training phase b ecause of the additional retraining time and the o v erhead of prediction and mo dication of R and S from steps 3{5. T able 5.19 records the exp erimen tal results using BRSVM with retraining on the nine data sets. The second column is the mean accuracy of a SVM using the uncompressed data from cross v alidation. The third column is the mean accuracy of BRSVM after retraining. The fourth column is the sp eedup ratio after retraining. It includes the time for BRSVM training, retraining, and the time to select supp ort v ector candidates and up date S and A in steps 3{5 The last column is the prediction time of the classier built with retraining. W e also list the BRSVM statistics without retraining in the paren theses for comparison. 91 PAGE 102 T able 5.19. BRSVM after retraining on the nine data set. Data SVM BRSVM+retraining Sp eedup Sp eedup set accuracy accuracy in training in prediction banana 0.902 0.890 6.2 2.4 (0.890) (129.0) (33.0) phoneme 0.895 0.895 0.7 1.1 (0.888) (2.7) (1.7) sh uttle 0.999 0.999 1.0 1.1 (0.987) (245.2) (2.4) page 0.970 0.972 0.8 1.1 (0.975) (7.9) (1.8) p endigit 0.981 0.981 0.8 1.0 (0.970) (12.1) (3.0) letter 0.975 0.975 0.4 1.1 (0.966) (2.6) (1.5) plankton 0.887 0.887 0.5 1.0 (0.876) (1.6) (1.4) w a v eform 0.859 0.858 1.0 1.3 (0.848) (13.0) (4.0) satimage 0.917 0.917 0.3 1.0 (0.907) (1.3) (1.1) On sev en out of nine data sets, retraining impro v ed the accuracy at the cost of more training time. The total training time increased signican tly after retraining due to the second round of training and the o v erhead. On most data sets, the training sp eedup ratio is less than 1, whic h indicates the total retraining time is longer than training a SVM on the uncompressed data set. Therefore, retraining is not v ery helpful in BRSVM b ecause it reduces the sp eedup ratio signican tly 5.5 Conclusion In this c hapter, a bit reduction SVM is prop osed to sp eed up SVMs' training and prediction. BRSVM groups similar examples together b y reducing their resolution. Suc h a simple metho d reduces the training time and the prediction time of a SVM signican tly in our exp erimen ts when bit reduction can compress the data w ell. It is more accurate than random sampling when the data set is not o v ercompressed. BRSVM tends to w ork b etter with relativ ely lo w er dimensional data sets, on whic h it is more accurate than random 92 PAGE 103 sampling and also sho ws more signican t sp eedups. Therefore, feature selection metho ds migh t b e used to reduce the data dimensionalit y and p oten tially help BRSVM to obtain further sp eedups. It should b e noted that no feature reduction has b een done on most of the data sets used in our exp erimen ts. W e can also conclude if a v ery high sp eedup is desired in whic h a high compression ratio is required, random sampling ma y b e a b etter c hoice. This tends to happ en with high dimensional data. 93 PAGE 104 CHAPTER 6 CONCLUSIONS AND FUTURE RESEAR CH 6.1 Conclusions W e review and conclude this dissertation as follo ws. 1. A set of robust image features without requiring a precise con tour w ere discussed in Chapter 3. These features including momen t in v arian ts and gran ulometric features are global and do not pro vide information ab out an y lo cal areas in an image. They ha v e b een used to extract information from 1bit, SIPPER I images, whic h do not ha v e clear con tours and lac k texture information. T ogether with domain sp ecic features, our feature set pro vides a go o d recognition rate b y using a SVM as the classier. The success of applying \imprecise", global features to the rather dicult SIPPER I images indicates the p oten tial application of robust features to lo wqualit y images. In our exp erimen ts, a single SVM using the onevsone approac h w as more accurate than a decision tree, a cascade correlation neural net w ork, and ensem bles of decision trees. The sup erior p erformance of a SVM v alidates its lo w generalization error b ound from statistical learning theory T o select an optimal subset of features for a SVM, a wrapp er approac h with a b est rst searc h and a b eam searc h reduced to half the total n um b er of features needed with a sligh tly impro v emen t in accuracy This sho ws our feature selection metho d w orks w ell with mo derate feature sizes. 2. In Chapter 4, the \Breaking Ties" (BT), an activ e learning metho d w as prop osed to reduce h uman eort in lab eling large n um b er of plankton images. Using a probabilistic in terpretation of a m ulticlass SVM's outputs, \BT" selectiv ely lab els examples whic h w ere closest in probabilit y b et w een the most lik ely class and second most lik ely 94 PAGE 105 class. Suc h a simple strategy required signican tly less lab eled images to reac h a giv en accuracy than a least certain t y activ e learning metho d and random sampling. \BT" also w ork ed eectiv ely in batc h mo de with an accuracy comparable to lab eling one image at a time. 3. Chapter 5 presen ts bit reduction SVM (BRSVM), whic h sp eeds up a SVM's training and prediction. Since man y metho ds ha v e b een w ell dev elop ed for fast solving the QP problem in v olv ed in a SVM, w e think there is little ro om for further impro v emen t in this direction. Data squashing is another p oten tial w a y of scaling up SVMs b y compressing a large data set to a small one. Most previous w ork on data squashing + SVM applied a clustering algorithm to compress data. Ho w ev er, clustering massiv e data is also slo w. BRSVM uses a v ery simple strategy to compress data: bit reduction. Bit reduction reduces the resolution of data and can b e tak en as a sp ecial case of a clustering algorithm. Ho w ev er, bit reduction is simple and v ery fast. Assuming m is the total n um b er of training examples, an ecien t implemen tation of bit reduction only tak es 2 m computations to p erform a data partition and aggregation, while most clustering algorithms need O ( m 2 ). Exp erimen tal results indicate BRSVM ac hiev ed fast training and prediction for a SVM. When the compression ratio is not v ery small, BRSVM could sp eed up SVM's training and prediction signican tly and w as more accurate than random sampling. It has b een observ ed that on relativ ely highdimensional data sets, a naiv e implemen tation of bit reduction sometimes did not pro vide a go o d compression ratio. A simple un balanced bit reduction metho d helps reliev e this situation. 6.2 Con tributions This dissertation addresses sev eral c hallenges presen ted in a realw orld plankton recognition problem and fo cuses on the scalabilit y issue of applying a supp ort v ector mac hine to largescale data sets. Its con tributions are summarized as follo ws: 95 PAGE 106 1. W e designed an automated plankton recognition system to recognize underw ater zo oplankton from SIPPER using highsp eed, linescanned laser imaging sensors. Robust features w ere computed for the c hallenging SIPPER I images. A supp ort v ector mac hine w as applied to classify those feature v ectors. This system ac hiev ed appro ximately 90% accuracy on a collection of SIPPER I images. On another larger image set con taining man ually uniden tiable particles, it also pro vided 75.6% o v erall accuracy Our feature selection metho d reduced the n um b er features from 29 to 15 with sligh tly higher accuracy than using all features. 2. W e prop osed \Breaking Ties", an activ e learning metho d for m ulticlass SVMs, to reduce h uman eort in man ually lab eling a large n um b er of data. \Breaking Ties" uses a \smart" selectiv e sampling strategy to lab el data. When applied to recognizing SIPPER I I images, it required signican tly less lab eled images to reac h a giv en accuracy than a least certain t y activ e learning metho d and random sampling. It ran eectiv ely in batc h mo de, lab eling up to 20 images at a time with an accuracy comparable to lab eling one image at a time and retraining. 3. W e dev elop ed a bit reduction SVM (BRSVM) to sp eed up a SVM's training and prediction. Compared to previous w ork in this direction, bit reduction is m uc h simpler and faster. Also, w e are among the few, if there an y others, who exp erimen ted with data squashing on SVMs using a relativ ely complicated k ernel (RBF k ernel). In our exp erimen ts, BRSVM p erformed v ery w ell on the nine data sets. It ac hiev ed up to 245.2 times sp eedup in training and 33.0 times sp eedup in prediction without m uc h loss in accuracy on the data sets resp ectiv ely The exp erimen ts indicate that as long as bit reduction compresses the data at a reasonable ratio, BRSVM outp erforms random sampling. 96 PAGE 107 6.3 F uture researc h Ha ving tac kled sev eral scalabilit y issues of SVMs in large scale data sets and plankton recognition, it is imp ortan t to explore future researc h directions to impro v e the curren t metho ds. W e presen t some of them in the follo wing. 1. When w e dealt with SIPPER I images, there w ere man y man ually uniden tiable particles. It is hard to dev elop a feature set for them b ecause their shap es and app earances v ary a lot. The accuracy of a SVM dropp ed from 90% to 75.6% due to inclusion of those particles. The higherresolution SIPPER I I images alleviated this problem as man y blured images b ecame m uc h clearer with a higher resolution. Ho w ev er, this issue still remains b ecause marine scien tists only pro vided plankton images from a limited n um b er of classes and other t yp es of plankton w ere unclassied. Consequen tly a SVM needs to lab el images whic h do not b elong to an y t yp es of plankton in the training data. A SVM is a discriminativ e learning algorithm, namely dieren tiating b et w een giv en classes. Hence it cannot tell us whether a curren t image b elong to a new t yp e of plankton. Generativ e learning metho ds suc h as a Ba y esian net w ork sp ecify a probabilit y mo del for the feature data, and th us are capable of indicating ho w lik ely a new example is to b elong to a class. Ho w ev er, sp ecifying a probabilit y mo del is a strong assumption and it often leads to p erformance inferior to that of a SVM [28 ][42 ][60 ]. An in teresting researc h question w ould b e: Can w e build a learning algorithm whic h pro vides SVM's classication accuracy and also detects new t yp es of data? 2. In activ e learning, the k ernel parameters of a SVM are usually predetermined and xed as more data are lab eled. Ho w ev er, suc h k ernel parameters are no longer optimal. Unless w e can aord a heldout, lab eled data set, it is hard to tune the k ernel parameters online. The k ey reason is w e do not ha v e a go o d metho d to ev aluate dieren t k ernel parameters while activ e learning pro ceeds. The standard metho ds lik e crossv alidation and lea v eoneout tend to fail b ecause activ e learning brings in 97 PAGE 108 biased data samples. Suc h failures w ere observ ed and discussed in [2 ]. An imp ortan t future direction is to nd a go o d online p erformance ev aluation and k ernel tuning metho d for activ e learning. Also, as activ e learning in batc h mo de is often con v enien t and fast, activ e learning for m ulticlass SVMs in batc h mo de needs to b e further explored. Recen t w ork in batc hmo de activ e learning for t w oclass problems can b e found in [14 ]. 3. In BRSVM, it do es not p eform as w ell as random sampling in the highest dimensional data set (satimage) b ecause the distributions of the n um b er of examples in bins tend to b e sk ew ed in high dimensional data sets. F or those data sets, BRSVM and random sampling ha v e the p oten tial to b e used together. Instead of using one w eigh ted exemplar for eac h bin, one can randomly sample sev eral examples at a ratio prop ortional to the n um b er of examples in this bin. Then sev eral w eigh ted exemplars w ould b e used to represen t the examples in this bin. This com bination metho d can help when the examples distribution is sk ew ed across the bins, and has the p oten tial to impro v e BRSVM on high dimensional data sets. The com bination metho d is an in teresting future researc h direction. 98 PAGE 109 REFERENCES [1] D. Aha, D. Kibler, and M. Alb ert. Instancebased learning algorithms. Machine L e arning 6(1):37 { 66, 1991. [2] Y. Baram, R. Y aniv, and K. Luz. Online c hoice of activ e learning algorithms. Journal of Machine L e arning R ese ar ch pages 255{291, 2004. [3] P L. Bartlett and J. Sha w eT a ylor. Generalization p erformance of supp ort v ector mac hines and other pattern classiers. In A dvanc es in Kernel Metho ds { Supp ort V e ctor L e arning pages 43{54. 1999. [4] E. Bauer and R. Koha vi. An expirical comparison of v oting classication algorithms: bagging, b o osting and v arian ts. Machine le arning 1999. [5] R. E. Bellman. A daptive Contr ol Pr o c esses Princeton Univ ersit y Press, 1961. [6] C. Berg, J. P R. Christensen, and P Ressel. Harmonic A nalysis on Semigr oups SpringerV erlag, 1984. [7] M. D. Berg(Editor), M. V. Krev eld, M. Ov ermars, O. Sc h w arzk opf, and M. D. Berg. Computational ge ometry: algorithms and applic ations Springer, 2001. [8] J. C. Bezdek. Pattern R e c o gnition with F uzzy Obje ctive F unction A lgoritms Plen um Press, New Y ork, 1981. [9] D. Boley and D. Cao. T raining supp ort v ector mac hines using adaptiv e clustering. In SIAM International Confer enc e on Data Mining 2004. [10] S. Bo yd and L. V anden b erghe. Convex Optimization Cam bridge Univ ersit y Press, 2003. [11] L. Breiman. Bagging predictors. Machine L e arning 24(2):123{140, 1996. [12] L. Breiman. Random forest. Machine L e arning 45(1):5{32, 2001. [13] L. Breiman, J. H. F riedman, R. A. Olshen, and C. J. Stone. Classic ation and R e gr ession T r e es W adsw orth In ternational Group, 1984. [14] K. Brink er. Incorp orating div ersit y in activ e learning with supp ort v ector mac hines. In Pr o c e e dings of the Twentieth International Confer enc e on Machine L e arning pages 59{66, 2003. 99 PAGE 110 [15] C. J. C. Burges. Simplied supp ort v ector decision rules. In International Confer enc e on Machine L e arning pages 71{77, 1996. [16] C. J. C. Burges and B. Sc h olk opf. Impro ving the accuracy and sp eed of supp ort v ector mac hines. In A dvanc es in Neur al Information Pr o c essing Systems v olume 9, pages 375{381, 1997. [17] C. Campb ell, N. Cristianini, and A. Smola. Query learning with large margin classiers. In Pr o c e e dings of 17th International Confer enc e on Machine L e arning pages 111{118, 2000. [18] G. Cau w en b erghs and T. P oggio. Incremen tal and decremen tal supp ort v ector mac hine learning. In A dvanc es in Neur al Information Pr o c essing Systems v olume 13, pages 409{415, 2000. [19] C. Chang and C. Lin. LIBSVM: a library for supp ort v ector mac hines (v ersion 2.3), 2001. [20] A. Cioban u and H. D. Buf. A utomatic Diatom Identic ation c hapter Iden tication b y con tour proling and legendre p olynomials, pages 167{186. W orld Scien tic, 2002. [21] W. G. Co c hran. Sampling T e chniques John Wiley and Sons, Inc., 3 edition, 1977. [22] T. H. Cormen, C. E. Leiserson, R. L. Riv est, and C. Stein. Intr o duction to A lgorithms MIT Press, 2 edition, 2001. [23] L. F. Costa and R. M. C. Jr. Shap e analysis and classic ation CR C press LLC, 2001. [24] T. G. Dietteric h. Mac hine learning researc h: four curren t directions. AI Magazine 18(4):97{136, 1997. [25] T. G. Dietteric h. Appro ximate statistical test for comparing sup ervised classication learning algorithms. Neur al Computation 10(7):1895{1924, 1998. [26] T. G. Dietteric h. An exp erimen tal comparison of three metho ds for constructing ensem bles of decision trees: bagging, b o osting and randomization. Machine le arning 40(2):139{157, 2000. [27] J. X. Dong and A. Krzyzak. A fast svm training algorithm. International Journal of Pattern R e c o gnition and A rticial Intel ligenc e 17(3):367{384, 2003. [28] S. Dumais. Using svms for text categorization. IEEE Intel ligent Systems Magazine 13(4):21{23, 1998. [29] W. DuMouc hel, C. V olinsky T. Johnson, C. Cortes, and D. Pregib on. Squashing rat les ratter. Data Mining and Know le dge Disc overy pages 6{15, 1999. [30] J. C. Dunn. A fuzzy relativ e of the iso data pro cess and its use in detecting compact w ellseparated clusters. Journal of Cyb ernetics 3:32{57, 1973. 100 PAGE 111 [31] ELENA. ftp://ftp.dice.ucl.ac.b e/pub/neuralnets/elena/database. [32] S. Esc hric h, J. Ke, L. Hall, and D. Goldgof. F ast accurate fuzzy clustering through data reduction. IEEE T r ansactions on F uzzy Systems 11(2):262{270, 2003. [33] S. E. F ahlman and C. Lebiere. The cascadecorrelation learning arc hitecture. In A dvanc es in Neur al Information Pr o c essing Systems v olume 2, pages 524{532, 1990. [34] S. Fine and K. Sc hein b erg. Ecien t svm training using lo wrank k ernel represen tations. Journal of Machine L e arning R ese ar ch 2:243{264, 2001. [35] S. Fisc her and H. Bunk e. A utomatic Diatom Identic ation c hapter Iden tication using classical and new features in com bination with decision tree ensem bles, pages 109{140. W orld Scien tic, 2002. [36] M. G. Gen ton. Classes of k ernels for mac hine learning: a statistics p ersp ectiv e. Journal of Machine L e arning R ese ar ch 2:299{312, 2001. [37] F. Girosi. An equiv alence b et w een sparse appro ximation and supp ort v ector mac hines. Neur al Computation 10:1455{1480, 1998. [38] D. Hand, H. Mannila, and P Sm yth. Principles of data mining the MIT press, 2001. [39] T. Hastie and R. Tibshirani. Classication b y pairwise coupling. In A dvanc es in Neur al Information Pr o c essing Systems v olume 10, 1998. [40] C. W. Hsu and C. J. Lin. A comparison of metho ds for m ulticlass supp ort v ector mac hines. IEEE T r ansactions on Neur al Networks 13(2):415{425, 2002. [41] M. K. Hu. Visual pattern recognition b y momen t in v arian ts. IRE T r ans. Information the ory IT(8):179{187, 1962. [42] T. Jebara. Machine L e arning: Discriminative and Gener ative Klu w er Academic Publishers, 2004. [43] T. Joac hims. Making largescale supp ort v ector mac hine learning practical. In A dvanc es in Kernel Metho ds: Supp ort V e ctor Machines pages 169{184, 1999. [44] J. Ke. F ast accurate fuzzy clustering through reduced precision. Master's thesis, Univ ersit y of South Florida, 1999. [45] S. S. Keerthi, S. K. Shev ade, C. Bhattac haryy a, and K. R. K. Murth y Impro v emen ts to platt's smo algorithm for svm design. Neur al Computation 13:637{649, 2001. [46] G. S. Kimeldorf and G. W ah ba. Some results on tc heb yc hean spline functions. Journal of Mathematic al A nalysis and Applic ations 33:82{95, 1971. [47] J. Koha vi. W rapp ers for feature subset selection. A rticial Intel ligenc e, sp e cial issue on r elevanc e 97(12):273{324, 1997. [48] K. Kramer. P ersonal comm unication, 2004. 101 PAGE 112 [49] K. Kramer. Iden tifying plankton from gra yscale silhouette images. Master's thesis, Univ ersit y of South Florida, 2005. [50] D. D. Lewis and W. A. Gale. A sequen tial algorithm for training text classiers. In Pr o c e e dings of SIGIR94, 17th A CM International Confer enc e on R ese ar ch and Development in Information R etrieval pages 3{12, 1994. [51] R. E. Lok e and H. d. Buf. A utomatic Diatom Identic ation c hapter Iden tication b y curv ature of con v ex and conca v e segmen t, pages 141{166. W orld Scien tic, 2002. [52] T. Luo, K. Kramer, D. Goldgof, L. Hall, S. Samson, A. Remsen, and T. Hopkins. Activ e learning to recognize m ultiple t yp es of plankton. In 17th c onfer enc e of the International Asso ciation for Pattern R e c o gnition v olume 3, pages 478{481, 2004. [53] T. Luo, K. Kramer, D. Goldgof, L. Hall, S. Samson, A. Remsen, and T. Hopkins. Recognizing plankton images from the shado w image particle proling ev aluation recorder. IEEE T r ansactions on System, Man, and Cyb ernetics{Part B: Cyb ernetics 34(4):1753{1762 August 2004. [54] D. Madigan, N. Ragha v an, W. Dumouc hel, M. Nason, C. P osse, and G. Ridgew a y Lik eliho o dbased data squashing: a mo deling approac h to instance construction. Data Mining and Know le dge Disc overy 6(2):173{190, 2002. [55] G. Matheron. R andom sets and inte gr al ge ometry John Wiley and Sons: New Y ork, 1975. [56] C. J. Merz and P M. Murph y Uci rep ository of mac hine learning database. h ttp://www.ics.uci.edu/ mlearn/MLRep ository .h tml, 1999. [57] D. Mic hie, D. J. Spiegelhalter, and C. C. T a ylor. Mac hine learning, neural and statistical classication, 1994. [58] P Mitra, C. A. Murth y and S. K. P al. A probabilistic activ e supp ort v ector learning algorithm. IEEE T r ansactions on Pattern A nalysis and Machine Intel ligenc e 26(3):413{418, 2004. [59] P Mitra, B. U. Shank ar, and S. K. P al. Segmen tation of m ultisp ectral remote sensing images using activ e supp ort v ector mac hines. Pattern R e c o gnition L etters 25(9):1067{1074 2004. [60] K. M uller, A. Smola, G. R atsc h, B. Sc h olk opf, J. Kohlmorgen, and V. V apnik. Predicting time series with supp ort v ector mac hines. In Seventh International Confer enc e on A rticial Neur al Networks pages 999{1004, 1997. [61] H. T. Nguy en and A. Smeulders. Activ e learning using preclustering. In Twentyrst International Confer enc e on Machine le arning 2004. [62] T. Ono da, H. Murata, and S. Y amada. Relev ance feedbac k with activ e learning for do cumen t retriev al. In Pr o c e e dings of the International Joint Confer enc e on Neur al Networks 2003 v olume 3, pages 1757{1762, 2003. 102 PAGE 113 [63] E. Osuna, R. F reund, and F. Girosi. Impro v ed training algorithm for supp ort v ector mac hines. In Pr o c e e dings of IEEE Neur al Networks in Signal Pr o c essing 1997. [64] E. Osuna and F. Girosi. Reducing the runtime complexit y of supp ort v ector mac hines. In A dvanc es in Kernel Metho ds: supp ort ve ctor machines 1999. [65] A. Ow en. Data squashing b y empirical lik eliho o d. Data Mining and Know le dge Disc overy pages 101{113, 2003. [66] J. M. P ark. Con v ergence and application of online activ e sampling using orthogonal pillar v ectors. IEEE T r ansactions on Pattern A nalysis and Machine Intel ligenc e 26(9):1197{1207 2004. [67] D. P a vlo v, D. Ch udo v a, and P Sm yth. T o w ards scalable supp ort v ector mac hines using squashing. In Pr o c e e dings of the sixth A CM SIGKDD international c onfer enc e on Know le dge disc overy and data mining pages 295{299, 2000. [68] J. P earl. Pr ob abilistic R e asoning in Intel ligent Systems: networks of Plausible Inferenc e Morgan Kaufmann, 1997. [69] I. Pitas. Digital image pr o c essing algorithms and applic ations John Wiley and Sons, Inc., 2000. [70] J. Platt. F ast training of supp ort v ector mac hines using sequen tial minimal optimization. In A dvanc es in Kernel Metho ds Supp ort V e ctor L e arning pages 185{208. The MIT Press, 1999. [71] J. Platt. Probabilistic outputs for supp ort v ector mac hines and comparison to regularized lik eliho o d metho ds. In A dvanc es in L ar ge Mar gin Classiers pages 61{74, 2000. [72] D. N. P olitis, J. P Romano, and M. W olf. Subsampling Springer, 1999. [73] J. R. Quinlan. C4.5: Pr o gr ams fr om empiric al le arning Morgan Kaufmann, 1993. [74] G. Ratsc h, T. Ono da, and K. Muller. Soft margins for adab o ost. Machine L e arning 42(3):287{320, 2001. [75] A. Remsen, T. L. Hopkins, and S. Samson. What y ou see is not what y ou catc h: a comparison of concurren tly collected net, optical plankton coun ter, and shado w ed image particle proling ev aluation recorder data from the northeast gulf of mexico. De ep Se a R ese ar ch Part I: Oc e ano gr aphic R ese ar ch Pap ers 51:129{151, 2004. [76] R. Rifkin and A. Klautau. In defense of onevsall classication. Journal of Machine L e arning R ese ar ch 5:101{141, 2004. [77] N. Ro y and A. McCallum. T o w ard optimal activ e learning through sampling estimation of error reduction. In Pr o c e e dings of 18th International Confer enc e on Machine L e arning pages 441{448, 2001. 103 PAGE 114 [78] S. Samson, T. Hopkins, A. Remsen, L. Langebrak e, T. Sutton, and J. P atten. A system for high resolution zo oplankton imaging. IEEE journal of o c e an engine ering pages 671{676, 2001. [79] L. M. San tos and H. D. Buf. A utomatic Diatom Identic ation c hapter Iden tication b y gab or features, pages 187{220. W orld Scien tic, 2002. [80] M. Sassano. An empirical study of activ e learning with supp ort v ector mac hines for japanese w ord segmen tation. In Pr o c e e dings of the 40th A nnual Me eting of the Asso ciation for Computational Linguistics pages 505{512, 2002. [81] G. Sc hohn and D. Cohn. Less is more: Activ e learning with supp ort v ector mac hines. In Pr o c. 17th International Conf. on Machine L e arning pages 839{846, 2000. [82] B. Sc h olk opf, S. Mik a, C. Burges, P Knirsc h, K. R. Muller, G. R atsc h, and A. Smola. Input space v ersus feature space in k ernelbased metho ds. IEEE T r ansactions on Neur al Networks 10(5):1000{1017, 1999. [83] B. Sc h olk opf and A. J. Smola. L e arning with kernels The MIT Press, 2002. [84] J. Sha w eT a ylor and P L. Bartlett. Structural risk minimization o v er datadep enden t hierarc hies. IEEE T r ans. on Information The ory 44(5):1926{1940 1998. [85] Y. C. L. Shih, J. D. M. Rennie, and D. R. Karger. T ext bundling: Statistics based datareduction. In Pr o c e e dings of the Twentieth International Confer enc e on Machine L e arning pages 696{703, 2003. [86] M. Sonk a, V. Hla v ac, and R. Bo yle. Image Pr o c essing, A nalysis and Machine Vision ThomsonEngineering, 2 edition, 1998. [87] X. T ang, F. Lin, S. Samson, and A. Remsen. F eature extraction for binary plankton image classication. ac c epte d by IEEE Journal of o c e anic engine ering 2004. [88] X. T ang, W. K. Stew art, L. Vincen t, H. Huang, M. Marra, S. M. Gallager, and C. S. Da vis. Automatic plankton image recognition. A rticial Intel ligenc e R eview 12(13):177{199 1998. [89] S. T ong and D. Koller. Supp ort v ector mac hine activ e learning with applications to text classication. In Pr o c e e dings of ICML00, 17th International Confer enc e on Machine L e arning pages 999{1006, 2000. [90] V. V apnik. Estimation of Dep endenc es Base d on Empiric al Data Springer, 2001. [91] V. V apnik and A. Cherv onenkis. The necessary and sucien t conditions for consistency in the empirical risk minimization metho d. Pattern R e c o gnition and Image A nalysis 1(3):283{305, 1991. [92] V. N. V apnik. The natur e of statistic al le arning the ory Springer, 2000. 104 PAGE 115 [93] L. W ang, K. L. Chan, and Z. h. Zhang. Bo otstrapping svm activ e learning b y incorp orating unlab elled images for image retriev al. In Pr o c e e dings of IEEE Computer So ciety Confer enc e on Computer Vision and Pattern R e c o gnition v olume 1, pages 629{634, 2003. [94] M. K. W arm uth, G. R atsc h, M. Mathieson, J. Liao, and C. Lemmen. Supp ort v ector mac hines for activ e learning in the drug disco v ery pro cess. Journal of Chemic al Information Scienc es 43(2):667{673, 2003. [95] J. W eston, S. Mukherjee, O. Chap elle, M. P on til, T. P oggio, and V. V apnik. F eature selection for svms. In Neur al information pr o c essing systems pages 668{674, 2000. [96] M. H. F. Wilkinson, A. C. Jalba, E. R. Urbac h, and J. B. T. M. Ro erdink. A utomatic Diatom Identic ation c hapter Iden tication b y mathematical morphology pages 221{244. W orld Scien tic, 2002. [97] C. K. I. Williams and M. Seeger. Using the n ystrom metho d to sp eed up k ernel mac hines. In A dvanc es in Neur al Information Pr o c essing Systems pages 682{688. the MIT Press, 2001. [98] T. F. W u, C. J. Lin, and R. C. W eng. Probabilit y estimates for m ulticlass classication b y pairwise coupling. Journal of Machine L e arning R ese ar ch 5:975{1005, 2004. [99] H. Y u, J. Y ang, and J. Han. Classifying large data sets using svm with hierarc hical clusters. In Pr o c e e dings of the ninth A CM SIGKDD international c onfer enc e on Know le dge disc overy and data mining pages 306{315, 2003. [100] C. Zahn and R. Z. Roskies. F ourier descriptors for plane closed curv e. IEEE tr ansaction on c omputers C(21):269{281, 1972. 105 PAGE 116 ABOUT THE A UTHOR Luo, T ong receiv ed the Bac helor of Engineering degree in Electrical Engineering from Tsingh ua Univ erstiy in 1997, the Master of Engineering degree in Con trol Theory and System from Institute of Automation, Chinese Academ y of Sciences in 2000. He is curren tly a Ph.D studen t in Departmen t of Computer Science and Engineering at Univ ersit y of South Florida. His researc h in terests are in mac hine learning, data mining, and pattern recognition. xml version 1.0 encoding UTF8 standalone no record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchemainstance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd leader nam Ka controlfield tag 001 001680967 003 fts 005 20060215071038.0 006 med 007 cr mnuuuuuu 008 051205s2005 flu sbm s000 0 eng d datafield ind1 8 ind2 024 subfield code a E14SFE0001154 035 (OCoLC)62459162 SFE0001154 040 FHM c FHM 049 FHMM 090 TK7885 (Online) 1 100 Luo, Tong. 0 245 Scaling up support vector machines with application to plankton recognition h [electronic resource] / by Tong Luo. 260 [Tampa, Fla.] : b University of South Florida, 2005. 502 Thesis (Ph.D.)University of South Florida, 2005. 504 Includes bibliographical references. 516 Text (Electronic thesis) in PDF format. 538 System requirements: World Wide Web browser and PDF reader. Mode of access: World Wide Web. 500 Title from PDF of title page. Document formatted into pages; contains 116 pages. 520 ABSTRACT: Learning a predictive model for a large scale realworld problem presents several challenges: the choice of a good feature set and a scalable machine learning algorithm with small generalization error. A support vector machine (SVM), based on statistical learning theory, obtains good generalization by restricting the capacity of its hypothesis space. A SVM outperforms classical learning algorithms on many benchmark data sets. Its excellent performance makes it the ideal choice for pattern recognition problems. However, training a SVM involves constrained quadratic programming, which leads to poor scalability. In this dissertation, we propose several methods to improve a SVM's scalability. The evaluation is done mainly in the context of a plankton recognition problem. One approach is called active learning, which selectively asks a domain expert to label a subset of examples from a lot of unlabeled data.Active learning minimizes the number of labeled examples needed to build an accurate model and reduces the human effort in manually labeling the data. We propose a new active learning method "Breaking Ties" (BT) for multiclass SVMs. After developing a probability model for multiple class SVMs, "BT" selectively labels examples for which the difference in probabilities between the predicted most likely class and second most likely class is smallest. This simple strategy required several times less labeled plankton images to reach a given recognition accuracy when compared to random sampling in our plankton recognition system. To speed up a SVM's training and prediction, we show how to apply bit reduction to compress the examples into several bins. Weights are assigned to different bins based on the number of examples in the bin. Treating each bin as a weighted example, a SVM builds a model using the reducedset of weighted examples. 590 Adviser: Lawrence O. Hall. Coadviser: Dmitry Goldgof 653 Machine learning. Data mining. Kernel machines. Active learning. Bit reduction. 690 Dissertations, Academic z USF x Computer Science and Engineering Doctoral. 773 t USF Electronic Theses and Dissertations. 4 856 u http://digital.lib.usf.edu/?e14.1154 