USF Libraries
USF Digital Collections

Identifying plankton from grayscale silhouette images

MISSING IMAGE

Material Information

Title:
Identifying plankton from grayscale silhouette images
Physical Description:
Book
Language:
English
Creator:
Kramer, Kurt A
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla.
Publication Date:

Subjects

Subjects / Keywords:
SIPPER
Feature selection
Feature calculation
Active learning
Support vector machine
SVM
Multi-class
Dissertations, Academic -- Computer Science and Engineering -- Masters -- USF   ( lcsh )
Genre:
government publication (state, provincial, terriorial, dependent)   ( marcgt )
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )

Notes

Summary:
ABSTRACT: Utilizing a continuous silhouette image of marine plankton produced by a device called SIPPER, developed by the Marine Sciences Department, individual plankton images were extracted, features were derived, and classification was performed. There were plankton recognition experiments performed in Support Vector Machine parameter tuning, Fourier descriptors, and feature selection. Several groups of features were implemented, moments, gramulometric, Fourier transform for texture, intensity histograms, Fourier descriptors for contour, convex hull, and Eigen ratio. The Fourier descriptors were implemented in three different flavors sampling, averaging and hybrid (mix of sampling and averaging). The feature selection experiments utilized a modified WRAPPER approach of which several flavors were explored including Best Case Next, Forward and Backward, and Beam Search.Feature selection significantly reduced the number of features required for processing, while at the same time maintaining the same level of classification accuracy. This resulted in reduced processing time for training and classification.
Thesis:
Thesis (M.S.C.S.)--University of South Florida, 2005.
Bibliography:
Includes bibliographical references.
System Details:
System requirements: World Wide Web browser and PDF reader.
System Details:
Mode of access: World Wide Web.
Statement of Responsibility:
by Kurt A. Kramer.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains 112 pages.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 001709529
oclc - 68912940
usfldc doi - E14-SFE0001402
usfldc handle - e14.1402
System ID:
SFS0025722:00001


This item is only available as the following downloads:


Full Text

PAGE 1

Identifying Plankton from Gr ayscale Silhouette Images by Kurt A. Kramer A thesis submitted in partial fulfillment of the requirement s for the degree of Master of Science in Computer Science Department of Computer Science and Engineering College of Engineering University of South Florida Major Professor: Dm itry Goldgof, Ph.D. Lawrence O. Hall, Ph.D. Scott Samson, Ph.D. Date of Approval, October 27, 2005 Keywords: SIPPER, feature sele ction, feature calculation, active learning, Support Vector Machine, SVM, Multi-Class Copyright 2005, Kurt A. Kramer

PAGE 2

DEDICATION I would like to dedicate this thesis to my farther, Gustav K. Kramer. He believed in working hard, always doing the right thing, and being honest with himself and others at all times. These ar e qualities that when pr acticed will lead to a successful and fulfilling life as no one else I know has achieved as well as my dad.

PAGE 3

ACKNOWLEDGMENTS I wish to acknowledge my major professo r Dr. Goldgof for a ll his near infinite patience and hand-holding these past thr ee years. Also, I would like to acknowledge Dr. Hall for hi s vast amount of knowledge that I was able to tap into; and all the other fine professors that I have had the privil ege to learn from here at USF. I would like to make spec ial mention of Tong Luo, a recent PHD graduate of USF that I have been working with for the past three years; he has been very helpful and tolerant of me and I have learned a great deal from him. Let’s not forget the people in Marine Sciences such as Dr. Scott Samson, Andrew Remson. And most importantly le t’s not forget the Plankton that made my thesis possible.

PAGE 4

i TABLE OF CONTENTS LIST OF TABLES.................................................................................................iv LIST OF FI GURES...............................................................................................vi LIST OF EQUA TIONS ........................................................................................viii ABSTRACT ..........................................................................................................ix CHAPTER 1 IN TRODUCT ION.............................................................................1 1.1 Motivation..................................................................................................1 1.2 Scope of Work...........................................................................................2 1.3 SIPPER......................................................................................................3 1.4 Feature Calc ulations..................................................................................4 1.5 Feature Sele ction.......................................................................................5 1.6 Previous Work...........................................................................................5 1.7 Organization..............................................................................................7 CHAPTER 2 SUPPORT VECTOR MA CHINE......................................................8 2.1 Introducti on................................................................................................8 2.2 Probability Model.......................................................................................8 2.3 Parameter Tuning......................................................................................9 2.4 Parameter Search Specif ics....................................................................15

PAGE 5

ii CHAPTER 3 FEATURE EXTRACTION..............................................................16 3.1 Introducti on..............................................................................................16 3.2 Moments..................................................................................................17 3.3 Area Based F eatures ...............................................................................21 3.4 Granulometric Bas ed Featur es................................................................23 3.5 Contour F eatures .....................................................................................26 3.5.1 Contour Sa mpling. ............................................................................28 3.5.2 Average Over Frequen cy Domain s...................................................29 3.5.3 Hybrid Co ntour.................................................................................31 3.5.4 Contour Su mmary.............................................................................33 3.6 Textur e.....................................................................................................34 3.7 Grayscale His togram...............................................................................36 3.8 Other Feat ures.........................................................................................38 3.8.1 Weighted Size..................................................................................38 3.8.2 Width vs. Length...............................................................................38 3.9 Result s.....................................................................................................38 CHAPTER 4 FEATURE SELECTION................................................................56 4.1 Feature Selecti on Backgr ound.................................................................56 4.2 Detailed Description of the Algor ithm.......................................................57 4.3 Experiment s.............................................................................................62 4.4 Analyzing the Result s..............................................................................63 4.4.1 Selection by Hi ghest Accu racy.........................................................63 4.4.2 Selection by Feature Usage in Be st 200...........................................63 4.4.3 Selection by A ccuracy Im pact...........................................................65 4.5 Feature Selection Utilizing 5 Fr equency Domain Contour Features........65 4.6 Feature Selection Utilizing 16 Hybrid Contou r Features..........................75 CHAPTER 5 CO NCLUSION...............................................................................78 5.1 Parameter Tuning. ...................................................................................78

PAGE 6

iii 5.2 Feature Calc ulati on..................................................................................79 5.3 Feature Sele ction.....................................................................................80 REFERENC ES...................................................................................................83 APPENDICE S....................................................................................................86 Appendix A Syst em Desi gn...............................................................................87 Appendix B User Cla ssified Data Sets..............................................................89 Appendix C The F ourier Trans form ...................................................................95 Appendix D Gl ossary.......................................................................................100

PAGE 7

iv LIST OF TABLES Table 2-1 Parameter Sear ch Wi nner.............................................................15 Table 3-1 Summary of Featur es....................................................................17 Table 3-2 Common Variables for F eature Calcul ations.................................17 Table 3-3 Size Plus Seven Basic Mo ment Feature Equatio ns......................20 Table 3-4 Moment Features Result Su mmary...............................................21 Table 3-5 Granulometric Operations on a Long-arm Cnidaria (Jelly Fish)....24 Table 3-6 Example of Granulometric F eatures on Copepod Oithona and Echino Pl utei.................................................................................25 Table 3-7 Contour Ap proaches .....................................................................28 Table 3-8 Upper and Lower Contour Frequency R anges..............................30 Table 3-9 Contour Variables and Func tions..................................................30 Table 3-10 Hybrid Contour Featur es...............................................................32 Table 3-11 Summary of Contour Features Cross Validati on Results..............33 Table 3-12 Lower and Upper Frequency Bounds for Texture Features..........35 Table 3-13 Texture Features Vari ables and Func tions...................................35 Table 3-14 Histogram E quations .....................................................................37 Table 3-15 Intensity Regions ...........................................................................37 Table 3-16 Summary of Cross Va lidation Re sults...........................................39 Table 3-17 Size Only (0).................................................................................40 Table 3-18 Moments Not Wei ghted (0 7)......................................................41 Table 3-19 Moments Weight ed (31 38).........................................................42 Table 3-20 Edge Moment Featur es (8 15) ....................................................43 Table 3-21 Moments and Weighted Moments Combined (8–15, 31-38).........44 Table 3-22 Granulometric Featur es (18 24) ..................................................45 Table 3-23 Fourier Contour Features Usi ng 100 Sample Points, Normalized (73-172) ........................................................................................46 Table 3-24 Contour Features, 100 Sample Po ints, Not Normalized (74-172).47

PAGE 8

v Table 3-25 Contour Utilizing 5 Frequency Do main Features (44-48)..............48 Table 3-26 Contour 16 Hybr id (5772)............................................................49 Table 3-27 All Features Using 100 Sample Contour Points Normalized.........50 Table 3-28 All Features Using 5 Average Frequency Domain Features (056).................................................................................................51 Table 3-29 All Features Using Hybrid Mixed (043, 4972)............................52 Table 3-30 5 Fourier Texture F eatures ( 39 – 43) ............................................53 Table 3-31 Intensity Histogr am (49 –55).........................................................54 Table 3-32 Intensity Histogram (49-50, 52-55) Less 3rd One.........................55 Table 4-1 Listing of Expansions that Occur in Figure 4-1 with Explanations.59 Table 4-2 Feature Search Variable s.............................................................60 Table 4-3 Best Accuracies Found by F eature Count, 5 Freq. Domain Contour F eatures ..........................................................................66 Table 4-4 Feature Usage in Best 200, Using 5 Averaging Contour Features71 Table 4-5 Feature Descr iptions .....................................................................72 Table 4-6 Performance by Occurr ences in Top 200 ......................................73 Table 4-7 Feature Descriptions Includ ing Hybrid Contour.............................76 Table B-1 Classes Selected fo r Testing........................................................89 Table B-2 List of Plankton Classes Classified by Us er..................................90 Table C-1 Frequency Assignments on a Transfo rm of a 20 Element Array...96

PAGE 9

vi LIST OF FIGURES Figure 1-1 SIPPER Mounted in the Sl ed..........................................................3 Figure 1-2 Diagram of Line Sc an Camera Layout............................................4 Figure 2-1 Parameter Tuning, Grid Search for C and Gamma.......................10 Figure 2-2 Parameter Tuning, Grid Search for Alpha.....................................10 Figure 2-3 Parameter Tuning, Sele ct Best Re sults........................................10 Figure 2-4 Parameter Grid Search Accuracy Re sults....................................11 Figure 2-5 Parameter Grid Search, Time to Perform 5 fold Cross Validation.11 Figure 2-6 Parameter Grid Search Slic e, C=12, Varying Gamma ( )............13 Figure 2-7 Parameter Grid Search Slice, Gamma ( )=0.01507, Varying C...14 Figure 3-1 Central Moment Co mponent Exam ples........................................19 Figure 3-2 Cnidaria Thimble Befo re any Proc essing ......................................21 Figure 3-3 Convex Hull of Image in Figur e 3-2., Area = 66,498 Pixels...........22 Figure 3-4 Fill Hole Operation on Image in Figur e 3-2...................................22 Figure 3-5 Contour Fourier, Low Pass Exam ples...........................................27 Figure 3-6 Contour Frequen cy Doma in..........................................................30 Figure 3-7 2D Fourier Transform of Image, Frequency Ranges Indicated.....34 Figure 3-8 Feature Values Calculated for Regions 4 and 5............................36 Figure 4-1 Example of Best Firs t Next in (BFB)..............................................58 Figure 4-2 Feature Search Algorit hm.............................................................61 Figure 4-3 Expansion Algorithms for Best Firs t Out/In...................................62 Figure 4-4 Determine Optimal Feature Set.....................................................63 Figure 4-5 Select Feature Set by Frequency in Top 200 ................................64 Figure 4-6 Feature Search, Accuracy, Timings by Features Count, Using 5 avg. Freq. Domain Contour F eatures............................................68 Figure 4-7 Feature Selection, Accuracy, S upport Vectors vs. Feature Count, Using 5 avg. Freq. Doma in Contour Featur es...............................69 Figure 4-8 Search, Best 200, Usi ng 5 Freq. Domain ......................................72

PAGE 10

vii Figure 4-9 Accuracy Impact vs. Usage To p 200.............................................74 Figure 4-10 Feature Search, Best 200, Using 16 Hybrid Contour Features.....76 Figure 4-11 Feature Search, 16 Hybrid Contour, Accuracy by Number of Featur es........................................................................................77 Figure A-1 System Flow Chart........................................................................87 Figure B-1 Sample Ch aetognath ....................................................................91 Figure B-2 Sample Cnidaria Smallbell Longarms...........................................91 Figure B-3 Sample Copepod Oithona .............................................................92 Figure B-4 Sample Echi no Plut ei....................................................................92 Figure B-5 Sample Lar vacean ........................................................................92 Figure B-6 Sample Marine Snow Da rk...........................................................93 Figure B-7 Sample Marine Snow Li ght...........................................................93 Figure B-8 Sample Pr otist...............................................................................93 Figure B-9 Sample Tric hodesmium................................................................94 Figure C-1 Fourier Mask, Locations 1, 2, 3 of 0 thr ough 19 Bucke t...............97 Figure C-2 Fourier Mask, Last 3 Locati ons of Real Component.....................97 Figure C-3 Fourier Mask, Imaginary Component of First Three Locations.....98 Figure C-4 Fourier Mask, Last 3 Locations of Imaginary Component.............98

PAGE 11

viii LIST OF EQUATIONS Equation 1 Compact Va lue..............................................................................10 Equation 2 Black & White Centroid.................................................................18 Equation 3 Weighted C entroid ........................................................................18 Equation 4 q p Central Order Moments.......................................................18 Equation 5 Central Moments Norm alized for Size...........................................18 Equation 6 Contour Featur e Valu e..................................................................31 Equation 7 Texture Feature Valu e for Regi on r...............................................36 Equation 8 Histogram Feat ure Valu e..............................................................37 Equation 9 Weighted Size...............................................................................38 Equation 10 One Dimensional Four ier Trans form.............................................95 Equation 11 Euler’s Id entity...............................................................................95

PAGE 12

ix IDENTIFYING PLANKTON FROM GRAYSCALE SILHOUETTE IMAGES KURT A. KRAMER ABSTRACT Utilizing a continuous silhouette im age of marine plankton produced by a device called SIPPER, devel oped by the Marine Sciences Department, individual plankton images were extrac ted, features were deriv ed, and classification was performed. There were plankton recogniti on experiments performed in Support Vector Machine parameter tuning, Fourie r descriptors, and feature selection. Several groups of features were impl emented, moments, gramulometric, Fourier transform for texture, intensit y histograms, Fourier descriptors for contour, convex hull, and Eigen ratio. The Fourier descr iptors were implemented in three different flavors sampling, averaging and hybrid (mix of sampling and averaging). The feature selection experiments ut ilized a modified WRAPPER approach of which several flavors were explored including Best Case Next, Forward and Backward, and Beam Search. Feature sele ction significantly reduced the number of features required for proc essing, while at the same time maintaining the same level of classification accuracy. This resulted in reduced processing time for training and classification.

PAGE 13

1 CHAPTER 1 INTRODUCTION 1.1 Motivation The process of monitoring the plankton content of the world’s oceans is currently a very manually intensive a ffair. By using image processing and machine learning techniques the task of sampling the contents of the oceans for plankton can be reduced from months per sampling to near real time. The world's oceans are full of plankton. Plankton are particles that float along with the current. They may have the ability to move but for the most part they go where the current takes them. There are two primary types of plankton, phytoplankton and zooplankt on. Phytoplankton consists of plants while zooplankton consists of animals. Plankton is very important to life on earth. Some plankt on produce oxygen while others consumes it. Phytoplankton’s absorb as much CO2 as land based vegetation [18] potentially having a signi ficant impact on global warming. Phytoplankton produces half the world’s oxy gen [19]. Plankton is very important to the food chain it forms the base of the food chain in the oceans [22].

PAGE 14

2 Given the facts above the ability to moni tor plankton, there types, region, and quantities in a quick and effici ent manner can be very us eful. There is also a need to be able to adapt quickly to differe nt circumstances. Different locations and environmental conditions will require th e classification of different groups of plankton. 1.2 Scope of Work This thesis describes my contribution to a system that automates the identification of plankton. The Center for Ocean Tec hnology of the College of Marine Sciences developed a device ca lled SIPPER [2] (Shadowed Image Particle Profiling Evaluation Recorder). This device uses a line scan laser to produce a silhouette of all par ticles that flow through it. One minute worth of data collection can result in several thousand discrete par ticles that need to be classified. In a single day several hundred minutes’ worth of data can be collected requiring the classification of millions of particles. It is desired that this classification process be co mpleted in near real time. To accomplish the goals above applications were developed, and experiments performed in the areas of feature extraction, feature selection, and active learning. These applications perform several tasks including image and feature extraction, classifica tion into appropriate plankt on classes, training model maintenance, and active learning to aid in the creation of new training libraries. Experiments were done in f eature selection using a modified WRAPPER [12] approach and in active learning to aid in the building of training libraries. A comprehensive software system was designed that gives the user the ability to quickly create training models and classify large volumes of plankton images. See Appendix A for more det ails of the implemented system.

PAGE 15

3 A Support Vector Machine (SVM) is th e learning algorithm used. The specific implementati on used is a modified version of libSVM [13]. It was modified by Tong Lou [1] of USF to include a conf idence estimate which is used by the active learning algorithms developed. Chapter 2 provides a more detailed discussion on the background and use of the SVM algorithm. 1.3 SIPPER SIPPER [2] which stands for Shadowed Im age Particle Profiling Evaluation Recorder, is the source of the imagery that is being proc essed. It was developed by Center for Ocean Technology of the Colle ge of Marine Sciences at USF in St. Petersburg. It uses a line scan laser ca mera to take a cross section of all particles that flow through a 4” by 4” tube (see Figure 1-1 and Figure 1-2). This results in a continuous digital image that is 4 inches wide. Its purpose is to enable scientists to get an accurate count of types of marine plankton in a region of water. Figure 1-1 shows a picture of the sled that mounts th e SIPPER device; it is towed behind a research vessel. Figur e 1-2 shows a diagram of the line scan camera with some associated specificati ons. The particle flow area (Figure 1-2 B) is 4” wide or 1024 pixels per inch. To keep the aspect ratio of the imagery as close to 1 x 1 the sled is towed at a rate such that one inch equals 1024 scan lines or approximatel y 1.25 miles per hour. A sled that contains several instrument packages. SIPPER is the three canist er package at the top of the sled. It is inline with a rectangular tube that runs the length of the sled. Water passes through from left to right through SIPPER. The sled is towed behind a research vessel at a speed that is compatible with the speed of the camera. Such that the aspect ratio of the imagery should be 1 to 1. http://marine.usf.edu/si pper/instrument.htm Figure 1-1 SIPPER Mounted in the Sled

PAGE 16

4 This is a diagram of the line scan laser layout; the red sheet represents the light being projected. Water flows from right to left. As particles pass through they block some of the light creating a silhouette of the particle passing through. Technical Specifications 4096 pixels per scan line 23,000 lines scanned per second. 160 um resolution particle size Figure created by Mike Hall and Chad Lembke http://www.mbari.org/rd/sensors/presentations/daly.pdf Figure 1-2 Diagram of Line Scan Camera Layout The data produced is a conti nuous image that is 40 96 pixels wide. Each pixel represents a grayscale value from 0 to 7 where 0 is the background and all other values represent increasingly le ss transparent foreground. A value of 1 indicates very transparent and 7 completely opaque. All other values represent intermediate degrees of transparency. Fu ture versions of SIPPER may support 255 levels of grayscale. To accommodate this possible upgrade the 8 levels of grayscale of SIPPER 2 are remapped to 2 55 levels (see section 3.1 for more details). 1.4 Feature Calculations Improvements in SIPPER 2 over the previous version include 8 levels of grayscale versus two levels and an improv ement in resolution from 512 pixels per inch to 1024. There were 28 new f eatures created that take advantage of these hardware im provements. Taking advantage of the gray scale information provi ded by SIPPER 2, weighted moment and text ure based features were added. The texture based features used both Fourier domain features and int ensity histograms. Also a weighted transparency featur e was created. The int ensity histograms based features in particular perform ed very well resulting in significant improvement in classification accuracy (see section 3.7).

PAGE 17

5 The greater resolution provided by SIPPER 2 makes the use of contour features more attractive. These were implemented by computing a Fourier transform of the boundary of the plankton organism im age to derive frequency information. Experiments were performed to show the importance of the different frequency ranges (see section 3.5). The lowest frequencies are shown to have the greatest amount of information wh ile higher frequencies produced diminishing returns. 1.5 Feature Selection The feature selection work was bas ed on the WRAPPER approach by Kohavi [12]. The basic idea of WRAPPER’s is to search feature space using the learning algorithm as the heuristic for gr ading any given feat ure combination. The search is done by impl ementing a Best First Search [16] algorithm (forward and backward) plus a Beam S earch [17]. By performing feature selection it is possible to reduce the number of features being proce ssed resulting in faster executions times for the traini ng and classification processes. 1.6 Previous Work The contents of this thesis cover seve ral different areas in which previous work has been done. The areas are plankton classification, feature extraction, and feature selection. The work in this area continues with the 29 features that were done for SIPPER 1 [5] which produced binary ima ge data. The imagery produced by SIPPER 2 is superior to that produced by SIPPER 1. The resolution is higher 1024 pixels per inch versus 512 and 8 levels of grayscale compared to just two. Because of the poor quality of the SIPPER 1 images c ontour features proved unfeasible [5] where as with SIPPER 2 contour feat ures performed well (see section 3.5.4).

PAGE 18

6 There have been several papers descr ibing a device called the Video Plankton Recorder (VPR) [21, 22, 24, 25, 26]. This is a device that takes still images underwater utilizing a powerful str obe light. Some issues are different and some are the same as that whic h need to be dealt with using SIPPER. Similar issues are a large amount of data that will need to be processed quickly, particles are deformable, and rotationally invariant in a 3d space. There are three issues in particular that VPR has that SIPPER does not; 1) images may be in partial occlusion; 2) very noisy backgro und (see figures in [21, 22, 24, 25, 26] as compared to ones in Appendix B); and 3) no guarantee that all particles are being captured. The biggest issues faced were the ability to identify discrete images and segment them out. Xiaoou Tang [9] did ext ensive work with similar features that were also used in this work; specifical ly the Fourier descriptors experimented with. All previous works dealt with 4, 5 or 6 plankton classes with just a few hundred images in each class. In the work covered in this thesis there are 9 classes used for tests involv ing 1200 images per class. Xiaoou Tang [24] achieved 92% accuracy with 4 classes of plankton. Fourier descriptors used were of the centroidal radius type [31] compared to the complex type [29] used in this work. There were 365 fourier descriptors created compared to the averaging over 5 frequen cy ranges. In another paper by Xiaoou Tang [25] an accuracy of 95% was reach ed using 6 classes with as few as 40 images per class to as many as 600 for another. The types of features implemented were similar to ones impl emented in this work with differences being in the details of impl ementation. Fourier descr iptors were implemented drastically different. Xiaoou Tang creat ed fourier descriptor vector of 365 features compared to the 5 and 16 (see Se ction 3.5) implement ed in this paper.

PAGE 19

7 The paper by Kohavi [12] was used fo r the starting point of the feature selection work. This involves using t he learning algorithm as the heuristic that drives a search through feature space. T he search itself is a simple best case next search, with some small modification s. Other approaches used in previous works involved the filter approach. T ang [9] used the Karhunen-Loeve Transform (KLT) and Bhatlecharyya distance measure me thods for selecting features. This was more appropriate considering the large feature vectors used. 1.7 Organization There are four chapters following the introduction; 2 The Support Vector Machine, 3 Feature Calculations, 4 Feat ure Selection, and 5 Active Learning. The Support Vector Machine (SVM) chapt er reviews the support vector machine and describes modifications made to it, and how to tune its parameters. Feature extraction chapter describes the 57 features that are extracted, plus gives a thorough review of Fourier descriptors. Feature Selection describes the process of selecting pertinent featur es for a given feature se t and the implementation of Wrapper’s [12] that was im plemented. The final chapter describes the active learning experiments performed.

PAGE 20

8 CHAPTER 2 SUPPORT VECTOR MACHINE 2.1 Introduction The Support Vector Machine (SVM) is a learning algorithm that generates a classifier from labeled training data. The classifier can then be used to assign labels to unlabeled data. For purposes of this thesis it is treated as a black box and not modified. The primary concern is on how to tune the parameters for various different data sets. 2.2 Probability Model Starting with a Support Ve ctor Machine written by Chih-Chung Chang and Chih-Jen Lin [13] a confidence value for th e class prediction was added [1]. This confidence value gives a g auge of how good t he prediction is. Using the decision boundary that is drawn between each comb ination of classes an estimated probability is calculated as to which of t he two classes a test point belong. Using the probabilities of all the two class SVM’s generated a normalized probability can be assigned to each class. The class that has the highest probability is the predicted class. There are other uses for probabilitie s. In Chapter 5, Active Learning experiments use these values to det ermine which data points will have the greatest beneficial impact on ex isting training libraries.

PAGE 21

9 2.3 Parameter Tuning There are three parameters SVM’s that are important to tune with respect to the data set that you are wo rking with, these are C, Gamma, and Alpha. These parameters impact on classification accura cy, training time, and classification time. There are several criteria that need to be considered for selecting the appropriate values for thes e parameters. The idea is to determine the best parameter settings for the data that you are working with. The most important criteria is classifi cation accuracy, but this needs to be balanced with the time that it takes train and classify. For instance one set of parameter settings might produce 92.1% cl assification accura cy on a given test set while another only gives 91.9% on the sa me test set. But the first group of settings might take 20 minutes to train and classify while the other takes les than a minute. Another import ant criterion to consider is th e confidence of prediction. In a perfect world it would be nice that the actual classification accuracy for data that returns 80% confidence is also 80% classified correctly. This would give extra meaning to the confidence va lue. In summary what needs to be accomplished is to balance the shortest possible processing time, highest possible classification accuracy, and mi nimal difference between classification accuracy and confidence of prediction. The parameter tuning process consists of two simple grid searches. The first one is used to determine C and Gamma. The second one determines Alpha. By separating the two grid searches the number of combinations evaluated is reduced. In both grid searches a 5 fold cross va lidation is performed. The classification accuracy, processing time, number of support vectors, average predicted probability, and C are re corded. These results are us ed to select the appropriate parameters.

PAGE 22

10 Figure 2-1 shows the algorithm for the first grid search where C and Gamma are determined and Figure 2-2 shows the al gorithm for the second grid search where Alpha is determined. Figure 2-3 shows the criteria for selecting parameters from the re sults, it is used for both grid searches. Figure 2-1 Parameter Tuning, Grid Search for C and Gamma Figure 2-2 Parameter Tuning, Grid Search for Alpha Figure 2-3 Parameter Tuning, Select Best Results KnownClass ob Compact Pr log Equation 1 Compact Value For 1 1 ; 1000 ; 1 do Perform 5 fold Cross validation ( C,100) Record Results For 2 1 C ; 1700 ; 1 C C C do For 2 1 ; 10 ; 00001 0 do Perform 5 fold Cross validation ( ,C,100) Record Results 1) Select all results that have 5 fold accuracy with in 1 percent of the highest one found. 2) Select from remaining resu lts those that have a compact value (Equation 1) no greater that the smallest one found plus 0.5. 3) From the remaining results select the one re sult that has the smallest difference bet ween 5 fold accuracy and average predicted probability.

PAGE 23

11 Figure 2-4 and Figure 2-5 show 3d graphical results of the grid search on C and Gamma. Figure 2-4 indicates 5 fold cross validation accuracy while Figure 2-5 indicates the time needed to process a 5 fold cross validation. What is interesting and most useful is that for the points where accura cy tends to be the greatest, processing time tends to be shor test. So by indicating processing time with longest at the bottom and shortest at th e top the two charts look very similar. Figure 2-4 Parameter Grid Search, Accuracy Results Figure 2-5 Parameter Grid Search, Time to Perform 5 fold Cross Validation

PAGE 24

12 Looking at Figure 2-4 it becom es obvious that Gamma ( ) has the greatest influence on classification accuracy, ex cept when it is very small, like =0.00001, at which point the C paramet er has the greater influenc e and also at the very peak of accuracy. The processing time seems to have an inverse relation with accuracy which is quite handy. Higher classification accuracy and smaller processing time is exac tly what is wanted. Figure 2-6 and Figure 2-7 on pages 13 and 14 show slices through the grid search, one holds C at 12 while varying Gamma ( ) while the other holds Gamma ( ) at 0.01507 while C is varied. In both figures there are three charts, top, middle and bottom. The top chart s hows 5 fold cross validation accuracy, average predicted probability, and processing time in seconds; the middle shows the number of supports points and proc essing time; and the bottom ones show accuracy and support points. The top chart in both figures show that where accuracy and predicted probability cross C=12 and 01507 0 is also where accuracy is very close to the maximum detected. Also note that at this point is where processing time is near its minimum; see Table 2-1 on page 15 for the specifics.

PAGE 25

13 Parameter Tuning, C=1286.59%10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0.000010.00010.0010.010.1110GammaAccuracy0 10 20 30 40 50 60Seconds Accuracy Avg Pred Prob Time Parameter Tuning, C=121000 1200 1400 1600 1800 2000 2200 0.000010.00010.0010.010.1110 GammaSupport Points0 10 20 30 40 50 60Seconds Support Points Time Parameter Tuning, C=1210% 20% 30% 40% 50% 60% 70% 80% 90% 0.000010.00010.0010.010.1110 GammaAccuracy1000 1200 1400 1600 1800 2000 2200Support Points Accuracy Support Points Figure 2-6 Parameter Grid Search Slice, C=12, Varying Gamma ( ) Top shows accuracy, predicted probabilit y, and time. Middle shows support points and time. Bottom shows support points and accuracy

PAGE 26

14 Parameter Tuning, Gamma = 0.0150786.59% 70% 75% 80% 85% 90% 95% 100% 1101001000 CAccuracy6.0 6.5 7.0 7.5 8.0 8.5Seconds Accuracy Predicted Probability Time Parameter Tuning, Gamma = 0.015071000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1101001000CSupport Points6.0 6.5 7.0 7.5 8.0 8.5Seconds SupportPoints Time Parameter Tuning, Gamma = 0.0150782.5% 83.0% 83.5% 84.0% 84.5% 85.0% 85.5% 86.0% 86.5% 87.0% 87.5% 1101001000 CAccuracy1000 1050 1100 1150 1200 1250 1300 1350 1400Support Points Accuracy SupportPoints Figure 2-7 Parameter Grid Search Slice, Gamma ( )=0.01507, Varying C Top shows accuracy, predicted probability, and time. Middle shows support points and time. Bottom shows support points and accuracy

PAGE 27

15 In both Figure 2-6 and Figure 2-7 the mi ddle charts show the correlation between the number of support points and pr ocessing time. This makes obvious sense since the more support points t here are the greater the processing required to perform classification. What is of interest in these two figures is how the number of support points behaves wit h the varying of the respective parameters and C. In Figure 2-7 as C increa ses the number of support points decline. In Figure 2-6 as increases the number of support points follows a parabolic curve. 2.4 Parameter Search Specifics The data set used consisted of the nine classes specified in Table B-1, using 300 planktons per class for a total of 2700 images. It was decided to use the 300 plankton per class dataset rather than t he 1200 to make the processing time a more reasonable. For purposes of the parameter search all 57 features described in Chapter 2 are used. The parameters selected as a result of the parameter search are C=12 and 01507 0 ; all future tests with this data in following chapters will use these settings. There were 8230 different combin ations of parameters processed over a period of 2.5 days on a 2.8ghz Pentium based PC. Looking at Table 2-1 the chosen parameters ranked 421 in terms of processing time, but the difference was still only a little over a half a sec ond from the best time recorded yet the accuracy ranked 36. Also the differenc e between the predicted probability and the five fold cross validation accu racy is near its smallest at 0.91%. Table 2-1 Parameter Search Winner Rank ValueDiff From Best Accuracy 36 86.59%0.37% Accuracy Pred. Prob. 112 85.68%0.91% Processing Time (sec’s) 421 9.500.59 Compact 332 0.58550.0522

PAGE 28

16 CHAPTER 3 FEATURE EXTRACTION 3.1 Introduction SIPPER 2 made improvements in both resolution and intensity over SIPPER 1. The resolution went from 512 to 1024 pi xels per inch while intensity went from one bit to three bits per pixel. The three bi ts per pixel give 8 levels of grayscale. With these improvements t he images become perceptually easier to identify. The user reports that there ar e less unidentifiable particles than in SIPPER 1. To take advantage of these technical improv ements 28 new features were created. There were 29 features implemented with SIPPER 1 whic h can be grouped into invariant moments, granulometric, si ze, convex ratio, transparency ratio, and Eigen value ratio. To take advantage of the technical improvements in SIPPER 2, 28 new features are implem ented. Some of these features are grayscale weighted versions of previ ous features while others ar e new features that take advantage of the grayscale information and greater resolution being provided. These features are divided into five logical groups, 8 weighted moments, 5 contour, 5 texture, 7 grayscale histogram, and 3 other. Th is results in a total of 57 features that will be described in gr eater detail in following sections. Using 3 bits per pixel SIPPER 2 provides 8 levels of grayscale, 0 through 7. To allow for future improvements in SIPPE R the 8 levels provided will be scaled to 255 levels. This was done using the formula 7 8 255 255 GrayScale GrayScale giving intensities 0, 36, 73, 109, 146, 182, 219, and 255. Ze ro will represent background and all other values are interpreted as foreground where each higher value represents a smaller degree of trans parency until 255 which is completely opaque. See Appendix B Fi gure Bthrough Figure B-9 for typical examples of the images produced by SIPPER 2.

PAGE 29

17 Table 3-1 gives a summary list of all t he features implemented, where 0 – 28 are the original ones implemented wit h SIPPER 1. Featur es 29 through 56 represent the 28 additional f eatures implemented for SI PPER 2. Sections 3.2 through 3.8 will describe the feat ures in detail. Table 32 lists variables that are common to all feature calculations. Table 3-1 Summary of Features Feature Num Description 0–7 Moment Features Black/White 8–15 Moment Edge 16 Convex Hull Ratio 17-24 Morphological Based Features, Using Openings and Closings (granulometric) 25,26 Eigen Head, Eigen ratio 27 Convex Area 28 Convex Area Ratio 29 Weighted Transparency 30 Weighted Size 31-38 Moment Weighted using Gray Scale Values 39-43 Texture, Fourier 44-48 Contour, using 1 Dimensional Fourier 49-55 Intensity Histogram Field 56 Height / Width Table 3-2 Common Variables for Feature Calculations I m Image being processed H Image height W Image Width y x I Intensity at X, Y 0 = Background, >0 = Foreground y xc c Black and White Centroid (Equation 2) y xwc wc Weighted Centroid (Equation 3). I PC Number of foregr ound pixels in image I IS Image Size in number of Pixels 3.2 Moments Hu [ 8 ] introduced a way to compute the se ven lower order moment invariants based on several nonlinear combinations of the central moments. Using the normalized central moment’s invariant scale, rotational and translation features are computed.

PAGE 30

18 Table 3-3 shows the basic seven equations all these equations use Equation 5 (central order moments) as their co mponents. Figure 3-1 demonstrates how each central order moment is extracted from the image. 0 1 0 0 0 1 sgn x x x x 1 0 1 0, sgnH x W yy x I I Size 1 0 1 0255 ,H x W yy x I WS Size y y x I Size x y x I c cH x W y H x W y y x 1 0 1 0 1 0 1 0, sgn , sgn Equation 2 Black & White Centroid WS y y x I WS x y x I wc wcH x W y H x W y y x 1 0 1 0 1 0 1 0255 ) ( 255 ) ( Equation 3 Weighted Centroid 1 0 1 0, sgn ,H x W y q y p xy x I c y c x q p 1 0 1 0255 ,H x W y q y p xy x I wc y wc x q p Equation 4 q p Central Order Moments 20 0 ,q pq p q p Equation 5 Central Moments Normalized for Size Note that 0 0 = size of image

PAGE 31

19 Figure 3-1 illustrates how the various components in the equations in Table 3-3 are derived. For example mom ent number 1 uses the sum of 0 2 and 2 0 These two components come from t he left column of Figure 3-1. Think of these components as masks where the greater the int ensity indicates a higher value. Green represents positiv e values and red negative values. Figure 3-1 Central Moment Component Examples

PAGE 32

20 Table 3-3 Size Plus Seven Basic Moment Feature Equations Moment Num Equation 0 0 0 (Same as image size) 1 2 0 0 2 2 2 21 2 4 2 0 0 2 3 2 23 0 1 2 3 2 1 3 0 3 4 2 23 0 1 2 2 1 3 0 3 5 2 2 2 23 0 1 2 2 1 0 3 3 3 0 1 2 3 0 1 2 3 3 0 1 2 3 2 1 3 0 3 2 1 0 3 0 3 0 3 6 3 0 1 2 2 1 0 3 1,1 4 3 0 1 2 2 1 0 3 2 0 0 22 2 7 2 2 2 20,3 2,1 1,2 3,0 3 0,3 2,1 1,2 3 3,0 3 0 1 2 3 2 1 0 3 2 1 0 3 3 0 1 2 3 There are three groups of mo ment features wh ich will be referred to as black and white (binary), edge, and weighted moments [30]. All three use the same calculations except for how they deal wi th pixels. The black and white moments consider pixels as backg round and foreground, 0 and 1 (x sgn). The edge moments only process pixels that are from the contour of the image. The weighted moments utilize the in tensity values such that the higher the intensity the greater the weight assigned the pixel.

PAGE 33

21 Table 3-4 has a summary of 10 fold cross validation accuracies on the different groups of moment features The black and white moments (not weighted) produce 37.34% accuracy (Tabl e 3-18, page 41). By assigning a weight to each pixel that is derived fr om its intensity value, the moments can reflect the intensity of the image. Imagine two images that have the same shape but have different textures, meaning int ensities vary differently. Without weighting they would appear to be the same to the classifier. These moments are called weighted, and have achieved an accuracy of 37.82% accuracy (Table 3-19, page 42). The edge moments result in a accuracy of 37.63% (Table 3-20, page 43). Looking at the feature selection experiments in Section 4.5, Figure 4-8, page 72 you will see that the weight ed moments were the most successful of the three different mo ment calculations. Table 3-4 Moment Features Result Summary Description Accuracy Result Table Page Moments Not Weighted (0 7) 37.34%Table 3-18 41 Moments Weighted (31 38) 37.82%Table 3-19 42 Edge Moment Features (8 15) 37.69%Table 3-20 43 Moments and Weighted Moments Combined (8–15, 31-38) 53.44%Table 3-21 44 3.3 Area Based Features These are features that ut ilize morphological operations. They include such things as Convex-Hull ratio, Transparency, etc. There are two versions of these features, one that is based off black and white (binary) and the other that is weighed by intensity. Figure 3-2 Cnidaria Thimble Before any Processing I ConvexHole FillHoles PC IS Ratio ConvexHull

PAGE 34

22 This is a simple feature where the im age size is divided by the area of the convex-hull of the image. This can give an idea how stable the contour of the image is, for example a perfect rectangle, would produce a f eature value of 1 where the image in Figure 3-2 would produce a very small fraction. Figure 3-3 Convex Hull of Image in Figure 3-2., Area = 66,498 Pixels I FillHoles PC IS cy Transparan Closing3 There are actually two flavors of th is feature a Black and White and a Gray Scale version. They are f eatures 17 and 29 respectively. Both these features performed very well during feature select ion, see Figure 4-8, page 72 and Figure 4-10, page 76. They both were in the top 200 feature sets searched by accuracy. Figure 3-4 shows the results of Fill Hole operation on image in Figure 3-2. There was a Closing operati on with a structure size of 3 performed before the Fill Holes operation. The resulting image had a new size of 20,105 pixels Figure 3-4 Fill Hole Operation on Image in Figure 3-2

PAGE 35

23 3.4 Granulometric Based Features There are 7 features in th is group that were origi nally implemented with the SIPPER 1 implementation [7]. They are based on performing opening and closing operations [30] on the plankt on image and comparing the change in pixel count. From these operations different c haracteristics of the image shape can be found. There are 4 featur es that use opening and 3 that use closing operations. The 4 features that use the opening oper ation utilize 3x3, 5x5, 7x7, and 9x9 structural elements [30], while the three that use the closing operation use 3x3, 5x5, and 7x7 structural elements. These 7 features achieve 51.02% accu racy on a 10 fold cross validation (Table 3-22, page 45). Spec ifically Copepod Oithona and Echino Plutei can be well recognized with 75.75% and 75.42% accuracy respectively. Both of these classes have long slender arms. They differ from each other by the number of appendages and the size of their central body. See Ta ble 3-6 for the results of opening and closing operations. Table 3-5 shows how the operations wo rk on a Long-Arm-Cnidaria; the long thin tentacles disappear quickly with the open operations and are merged together with the closing oper ations. This particular class is sensitive to both the closing and opening operations. The le ft column shows the equation used with each feature calculation while the right column shows the resultant image. The fraction just below each image indicates t he resultant feature value calculated.

PAGE 36

24 Table 3-5 Granulometric Operations on a Long-arm Cnidaria (Jelly Fish) IS I Open PC IS 3 0.5368 IS I Open PC IS 5 0.7778 IS I Open PC IS 7 0.8535 IS I Open PC IS 9 0.8750 IS I Close PC IS 3 -0.1482 IS I Close PC IS 5 -0.3956 IS I Close PC IS 7 -0.6585 Table 3-6 shows the result s of the granulometric f eature calculations on a Oithona Copepod and a Echino Plutei. Th e Oithona is sensitive to the opening operations as indicated by the arms dis appearing but not sensit ive to the closing operations. The Echino Plutei is sensitiv e to both opening and closing operations although not as sensitive to the closing as the Long-arm Cnid aria was as shown in Table 3-5.

PAGE 37

25 Table 3-6 Example of Granulometric Feat ures on Copepod Oithona and Echino Plutei IS I Open PC IS 3 0.0125 0.0847 IS I Open PC IS 5 0.1029 0.2629 IS I Open PC IS 7 0.3314 0.3064 IS I Open PC IS 9 0.4673 0.3325 IS I Close PC IS 3 -0.0058 -0.0317 IS I Close PC IS 5 0.0650 -0.1180 IS I Close PC IS 7 0.0535 -0.2177

PAGE 38

26 It appears from the 3 images used in Ta ble 3-5 and Table 3-6 that images with smooth surfaces have very small res ponses to the closing operation, as seen with the Oithona Copepod. The Echino Plutei with its furry like boundary had a small response while the Cnidaria with its dangling tentacles had a very large response. 3.5 Contour Features The images produced by SIPPER II ar e greatly improved over those produced by SIPPER I. Because of this it was decided to attempt to implement contour features based on Fourier descrip tors. These Fourier descriptors are derived by performing a Fourier Transform on a one dimensiona l array of data that represents the contour of the image. There were three different approaches tried for implementing these feat ures see Table 3-7 on page 28. The basis of all three methods tried wa s plotting the edge of the image as a one dimensional array where each element has a real and imaginary component. For each edge pixel used, the row and co lumn would popula te the real and imaginary components respectively. This takes advantage of the fact that a straight line in the complex number plane is a circle. When performing a Fourier transform on an array that represents the edge/contour of an image the frequencies captured in the resu ltant array will reflect the deviations from a circle. Each location in the output array will r epresent a different frequency, starting with the lowest frequency to the high in the midd le and then back down to the lowest frequency on the right.

PAGE 39

27 1Hz 2Hz 3Hz 5Hz 10Hz 15Hz 20Hz 50Hz Figure 3-5 Contour Fourier, Low Pass Examples These images were generated by performing a low pass filter operation on the contour frequency. They indicate the amount of information that is included in just a few frequency buckets. This image had a total of 8506 contour points for a total of 4,253 frequency buckets. This is a fairly complicated image yet 50hz is able to capture a great deal of information about the image

PAGE 40

28 When working with the low fr equencies, 1-hertz, 2-hertz, etc each individual frequency contains a great deal of informati on about the image. If you look at the 1hz picture in Figure 3-5 you can see that the orientation, height vs. width, and size are all there. Ea ch successive frequency provides more detail with diminishing returns. In the actual implementation, ex periments were performed with three different approaches for Fourier descriptors Table 3-7 lists the three different approaches with a brief description. Fo llowing the table will be a detailed description of each approac h. The approach that was actually implemented was not the best performing by itse lf compared to the other two approaches but when implemented with all other features perfo rmed just as well using a smaller number of features. Table 3-7 Contour Approaches Sampling Sample 100 equally distant points along the contour resulting in 100 Fourier descriptors. Averaging Transform on all contour points and then breaking the resulting transform into 5 frequency ranges and get the average amplitude. Hybrid Transform on all contour point s. Use the three lowest frequencies discretely and divi de the rest into frequency ranges. 3.5.1 Contour Sampling In Xiaoou Tang’s [9] paper on automatic plankton recognition, 360 sample points were chosen. Out of 10,800 test images a large number had less than 360 boundary points, but just ab out all the images had at least 100 points on the boundary. For this reason it was decided to use 100 sample points instead. These 100 Fourier descriptors result in 54.50% (Table 3-23, page 46) accuracy for normalized descriptors and 67.69% (Table 3-24, page 47) accuracy for non normalized. Fourier descriptors are norma lized by dividing the magnitude of all buckets in the resultant frequency by the magnitude of bucket (1 Hertz).

PAGE 41

29 There are advantages and disadvantages to this approach, because the number of points used in the Fourier transform is always the same, 100 in our case, a transform plan can be pre-built th at will be used for all images to be classified, so as there are more images to be classified there is a greater time savings. Also since only a sampling of the contour points are used and not the whole contour there is a c onsiderable reduction in ti me performing the fourier transform itself; for example the image in Figure 3-5 has over 8000 edge pixels, but only 100 of them will be used The down side is that a lot of detail is lost; two large images that are alike in all res pects except one with a smooth contour and the other with a noisy contour will produce similar feature values. Another issue to consider is the num ber of features bei ng generated; 100 features vs. 16 for the hybrid approach and 5 for averaging. A greater number of features will add to the processing time. Training, classification, and feature selection will all require ex tra cycles to process the extra features. Feature selection especially will have a more difficu lt time. With over 100 features the WRAPPER approach would becom e impractical to use. 3.5.2 Average Over Frequency Domains A Fourier transform is performed on the entire contour of the image. The result of the transform is used to generate 5 contour features with each one representing a range of frequencies. This is do ne by computing the average value of the magnitudes for each range, Figure 3-6. This is similar as to the way the Texture Features were computed in Se ction 3.6. In this case instead of bounding the regions with semi -circles around the center of the image the region is derived by determining the distance from the center of the array. Table 3-8 shows the size of the frequency ranges as a fraction of 1, see Figure 3-6 for a graphical repr esentation.

PAGE 42

30 Table 3-8 Upper and Lower Contour Frequency Ranges Region Number Lower BoundLB Upper BoundUB 1 0 2 1 2 2 1 4 3 3 4 3 8 7 4 8 7 16 15 5 16 15 1 Figure 3-6 Contour Frequency Domain Equation 6 is used for calculating the contour feature values. It uses variables described in Table 3-9. r CFV with r = 1 to 5 maps to features 44 through 48 respectively. These features achieved 47.52% accuracy (Table 3-25, page 48). Table 3-9 Contour Variables and Functions Variable Description L Length of edge in pixels c Center position 2L x F Magnitude of complex numbe r(amplitude) at position x r x R, Indicator function, specifies weather position x is in region r If ) ( ) ( r UB L x c r LB then 1 else 0. r PC Number of pixels in region r r CFV Contour Feature Value for region r

PAGE 43

31 r PC r x R x F r CFVL x 1 0, Equation 6 Contour Feature Value 3.5.3 Hybrid Contour The sampling method described in secti on 3.5.1 performs considerably better than the averaging method described in secti on 3.5.2. In this section an attempt is made to try and get the best of both resu lting in features that get the higher accuracy of sampling while producing t he smaller number of features like averaging. This resulted in 16 features. The idea here is that the lowest i ndividual frequencies capture the greatest amount of information while individual higher frequencies are not as significant but taken as an average over a domain can contribute to classification accuracy. Table 3-10 gives a summary of the 16 f eatures computed in this section.

PAGE 44

32 Table 3-10 Hybrid Contour Features 0 1 Hz Left First Bucket in resultant Fourier transform 1 2 Hz Left Second Bucket in resultant Fourier transform. 2 3 Hz Left 3 4 Hz Left 4 13/16 – 4 Hz Avg. of amplitudes in le ft buckets that range from 13/16th to 4hz from center 5 12/16 – 13/16 Avg. of amplitudes in le ft buckets that range from 12/16th to 13/16th from center 6 10/16 – 12/16 Avg. of amplitudes in le ft buckets that range from 10/16th to 12/16th from center 7 Center – 10/16 Left 8 Center – 10/16 Right 9 10/16 – 12/16 Avg. of amplitudes in right buckets that range from 10/16th to 12/16th from center 10 12/16 – 13/16 Avg. of amplitudes in right buckets that range from 12/16th to 13/16th from center 11 13/16 – 1 Avg. of amplitudes in right buckets that range from 13/16th to 4hz from center 12 4 Hz Right 13 3 Hz Right 14 2 Hz Right 15 1 Hz Right Last Bucket in resultant Fourier transform These 16 features result in 57.36% (Table 3-26, page 49) accuracy on a 10 fold cross validation as compared to 57.69% (Table 3-24, page 47) for sampling points. Considering that there are only 16 features this is a good result. The smaller number of featur es is preferred when dealin g with the support vector machine, resulting in faster training and classification times.

PAGE 45

33 3.5.4 Contour Summary The averaging of frequency domains in section 3.5.2 resulted in poor performance, 47.52% (Table 3-25, page 48), when com pared to the other two contour methods, hybrid 57.74% (Table 326, page 49) and 10 0 sample points 66.38% (Table 3-24, page 47). When used with all other feat ures favorable performance was obtained compared to the other two, 90.37% (Table 3-28, page 51) vs. 86.13% (Table 3-27, page 50) fo r 100 sampled and 90.11% (Table 3-29, page 52) for hybrid. When considering pr ocessing time the averaging method required less time to proc ess both training and classification, see Table 3-11. Given the fact that averagi ng was as accurate as hybr id and significantly more accurate than the sampling method and at the same time using less processing time it was decided to use this method. To confirm t hat the averaging method would perform as well as the hybrid ther e were two feature selection experiments performed, one where the data set used av eraging features and the other with hybrid features. See Table 4-4 (page 71 ) and Figure 4-10 (page 76). The result of these two feature selection searches did not find any combi nation of features in the hybrid set that has better classifi cation accuracy than the best feature set found with the averaging method. Table 3-11 Summary of Contour Features Cross Validation Results Description Accuracy Training Time Classification Time Result Table 100 sample points normalized. 67.47% 837.4 75.9 Table 3-23 (46) 100 sample points non normalized 66.38% 1096.3 85.4 Table 3-24 (47) Average of 5 frequency domains. 47.52% 270.8 19.4 Table 3-25 (48) 16 Hybrid, mixed avg. and sample 57.74% 313.5 29.5 Table 3-26 (49) All Features using 100 sample contour points 88.13% 645.2 81.4 Table 3-27 (50) All Features using 5 avg. freq doma ins. 90.37% 146.7 29.4 Table 3-28 (51) All Features using 16 hybrid 90.11% 183.5 33.9 Table 3-29 (52)

PAGE 46

34 3.6 Texture With the grayscale values that SIPPER 2 produces fe atures that reflect the texture of the image can be computed. A 2D Fourier Transform is performed on the original image. By using the result of this transform the energy of different frequency ranges was captured by computing the avera ge magnitude for each of 5 different frequency ranges (see Table 3-12). Figure 3-7 show a typical plankton image and its Fourier transform. The semi circle bands that are labeled R1 thru R5 indicate the boundaries of the regions. Only half the Fourier domain needs to be processed since both half's are mirror images of each other. These five r egions result in five Fourier features. The value of each feature is the av erage value of the magnitude of their respective region. Figure 3-7 2D Fourier Transform of Image, Frequency Ranges Indicated

PAGE 47

35 Table 3-12 Lower and Upper Frequency Bounds for Texture Features Region Number Lower BoundLB Upper BoundUB 1 0 2 1 2 2 1 4 3 3 4 3 10 9 4 10 9 20 19 5 20 19 1 Table 3-13 provides descriptions of some variables and functions that are needed for Equation 7. Using this equat ion five features are computed. Table 3-13 Texture Features Variables and Functions Function Description J Fourier transform of image. This will be a two dimension matrix with the same dimension same the original image. Each element in the matrix will have both a real and imaginary part. D Distance from upper left to centroid. 2 2 y xC C r y x R , Indicator function that spec ifies weather the pixel at y x is in region r Return 1 if true or 0 if false. Uses Table 3-13 and D. r PC Pixel count for region r Using Equation 7 the five texture feat ures can be computed. These five features result in 47.00% accuracy (T able 3-30, page 53) on a 10 fold cross validation of test data. The class for which accuracy was best was Marine Snow Light with 77.50% accuracy while Marine Snow Dark had the worst with 6.08% accuracy. Looking at the Marine Snow Light images in Figure B-7 the two lightest intensities are almost used exclusiv ely. This should result in very small amount of energy, especially in the lower frequencies.

PAGE 48

36 r PC y x J y x J r TFVH x W y im re 1 0 1 0 2 2, Equation 7 Texture Feature Value for Region r Figure 3-8 displays the calculated feat ure values for two lowest frequency regions, 4 and 5. The height of each bar represents the mean value for that class, while the error bars represent plus /minus one standard deviation. Note the bar for Marine Light, it is distinctly different from all the other classes. Figure 3-8 Feature Values Calculated for Regions 4 and 5 3.7 Grayscale Histogram There are 7 features gener ated in this category, each one represents the percentage of pixels in their respective intensity range with respect to the total number of foreground pixels in the image. The 255 possible intensity values are divided into 8 ranges, with each range coveri ng 32 values as specified in Table 3-15. This should allow for compatibilit y with future versions of SIPPER that might provide 255 levels of intensity.

PAGE 49

37 Table 3-14 Histogram Equations Function Description z IR Indicates which intensity range the pixel value z is in. See Table 3-15. r y x I IR Indicates that the Intens ity of pixel at location y x falls in intensity range r 1 if true else 0. 31 y x I Indicates that the Intensit y of pixel at location y x is greater than 31, 1 if true else 0. r HFV Histogram Feature Value for intensity range r. See Equation 8. Table 3-15 Intensity Regions Region Intensity Range Background0 31 1 32 63 2 64 95 3 96 127 4 128 – 159 5 160 – 191 6 192 – 223 7 224 – 255 1 0 1 0 1 0 1 031 ,H x W y H x W yy x I r y x I IR r HFV Equation 8 Histogram Feature Value These 7 features achieved 67.69% accu racy on a 10 fold cross validation (Table 3-31, page 54). They are simple to compute requiring little processing power yet they obtained good accuracy co mpared to all other feature groups.

PAGE 50

38 3.8 Other Features 3.8.1 Weighted Size Equation 9 is used for calculating Weight ed Size. This feature is meant to reflect density of the image as indicated by each pixels intensity value. Each pixel in the image will be assigned a value in the range of 0.0 to 1.0 (background to foreground). 1 0 1 0255 ,H x W yy x Int ze WeightedSi Equation 9 Weighted Size 3.8.2 Width vs. Length The procedure for calculating the Eigen Value ratio described in [ 1 ] produces an image that is rotated such that its main axis of orientation is parallel to the x axis. Using this rotated image the length and width of the image are determined. This is done by finding the difference bet ween the fist and last columns to contain a foreground pixel for one dime nsion and the first and last rows for the other. The longer dimension is considered the l ength while the other is the width. The feature is computed by dividi ng the width by the length. 3.9 Results The following pages contain the result s of ten fold cro ss validations on various feature combinations. These results are referenced through out the chapter to help clarify points. They ar e all against the nine class test set described in Appendix B, with 1200 images per class. Each row will show how the Plankton class in that row was classi fied. For example in Table 3-17 (page 40) of the 1200 images in the Trich class 99 were classified as Chaetognath, 134 as Smallbell Longarms, 111 as Copepod Oithona, etc, etc, etc. In a perfect world only the diagonal would be populated.

PAGE 51

39 Table 3-16 Summary of Cross Validation Results Description Accuracy Training Time (Sec’s) Classification Time (Sec’s) Table 3-17 Size Only (0) 22.13% 527.6 20.5 Table 3-16 Summary of Cross Va lidation Results 37.34% 450.2 22.9 Table 3-19 Moments Weighted (31 38) 37.82% 453.5 24.6 Table 3-20 Edge Moment Features (8 15) 37.69% 494.3 23.5 Table 3-21 Moments and Weighted Moments Combined (8– 15, 31-38) 53.44% 408.2 28.1 Table 3-22 Granulometric Feat ures (18 24) 53.02% 245.3 19.4 Table 3-23 Fourier Contour Features Using 100 Sample Points, Normalized (73-172) 67.47% 837.4 75.9 Table 3-24 Contour Features, 100 Sample Points, Not Normalized (74-172) 66.38% 1096.3 85.4 Table 3-25 Contour Utilizing 5 Frequency Domain Features (44-48) 47.52% 270.8 19.4 Table 3-26 Contour 16 Hybrid (57-72) 57.74% 313.4 29.5 Table 3-27 All Features Using 100 Sample Contour Points Normalized 86.13% 645.2 81.4 Table 3-28 All Features Using 5 Average Frequency Domain Features (0-56) 90.37% 146.7 29.4 Table 3-29 All Features Using Hybrid Mixed (0-43, 49-72) 90.11% 183.5 33.9 Table 3-30 5 Fourier Texture Features (39 – 43) 47.00% 273.9 19.0 Table 3-31 Intensity Histogram (49 –55) 67.69% 119.3 17.81

PAGE 52

40 Table 3-17 used the size alone. Over all it did not do well but for some classes; note Copepod Oithona and Tric hodesmium, it did very well. Table 3-17 Size Only (0) Class Names Chaeto Cnidaria Smallbell Longarms Copepod Oithona Echino Plutei Larvacean Marine Snow Dark Marine Snow light Protist Trich Chaetognath 109 166 135 214 57 0 25 2 492 Longarms 0 221 357 424 65 0 54 0 79 Oithona 0 74 880 136 24 0 84 0 2 Echino Plutei 0 299 274 440 89 0 58 0 40 Larvacean 1 234 401 326 61 0 71 0 106 Snow Dark 7 181 467 273 49 0 49 0 174 Snow light 0 164 632 232 33 0 88 0 51 Protist 15 195 311 252 43 0 54 2 328 Trich 34 182 110 206 45 0 34 0 589 Totals 166 1716 3567 2503 466 0 517 4 1861 Chaetognath 9.08% 13.83% 11.25% 17.83% 4.75% 0.00% 2.08% 0.17% 41.00% Longarms 0.00% 18.42% 29.75% 35.33% 5.42% 0.00% 4.50% 0.00% 6.58% Oithona 0.00% 6.17% 73.33% 11.33% 2.00% 0.00% 7.00% 0.00% 0.17% Echino Plutei 0.00% 24.92% 22.83% 36.67% 7.42% 0.00% 4.83% 0.00% 3.33% Larvacean 0.08% 19.50% 33.42% 27.17% 5.08% 0.00% 5.92% 0.00% 8.83% Snow Dark 0.58% 15.08% 38.92% 22.75% 4.08% 0.00% 4.08% 0.00% 14.50% Snow light 0.00% 13.67% 52.67% 19.33% 2.75% 0.00% 7.33% 0.00% 4.25% Protist 1.25% 16.25% 25.92% 21.00% 3.58% 0.00% 4.50% 0.17% 27.33% Trich 2.83% 15.17% 9.17% 17.17% 3.75% 0.00% 2.83% 0.00% 49.08% Accuracy 22.13%

PAGE 53

41 Table 3-18 Moments Not Weighted (0 7) Class Names Chaeto Cnidaria Smallbell Longarms Copepod Oithona Echino Plutei Larvacean Marine Snow Dark Marine Snow light Protist Trich Chaetognath 684 85 21 82 123 0 55 4 146 Longarms 53 455 229 286 62 0 69 6 40 Oithona 1 110 770 250 26 0 42 0 1 Echino Plutei 1 59 50 977 43 2 5 39 24 Larvacean 96 82 54 536 296 2 51 19 64 Snow Dark 29 42 46 814 69 1 18 65 116 Snow light 148 145 140 295 126 0 297 33 16 Protist 16 50 21 705 33 4 5 81 285 Trich 189 68 30 293 57 2 34 55 472 Totals 1217 1096 1361 4238 835 11 576 302 1164 Chaetognath 57.00% 7.08% 1.75% 6.83% 10.25% 0.00% 4.58% 0.33% 12.17% Longarms 4.42% 37.92% 19.08% 23.83% 5.17% 0.00% 5.75% 0.50% 3.33% Oithona 0.08% 9.17% 64.17% 20.83% 2.17% 0.00% 3.50% 0.00% 0.08% Echino Plutei 0.08% 4.92% 4.17% 81.42% 3.58% 0.17% 0.42% 3.25% 2.00% Larvacean 8.00% 6.83% 4.50% 44.67% 24.67% 0.17% 4.25% 1.58% 5.33% Snow Dark 2.42% 3.50% 3.83% 67.83% 5.75% 0.08% 1.50% 5.42% 9.67% Snow light 12.33% 12.08% 11.67% 24.58% 10.50% 0.00% 24.75% 2.75% 1.33% Protist 1.33% 4.17% 1.75% 58.75% 2.75% 0.33% 0.42% 6.75% 23.75% Trich 15.75% 5.67% 2.50% 24.42% 4.75% 0.17% 2.83% 4.58% 39.33% Accuracy 37.34%

PAGE 54

42 Table 3-19 Moments Weighted (31 38) Class Names Chaeto Cnidaria Smallbell Longarms Copepod Oithona Echino Plutei Larvacean Marine Snow Dark Marine Snow light Protist Trich Chaetognath 774 127 10 48 115 14 24 9 79 Longarms 58 539 162 224 115 32 32 27 11 Oithona 0 153 458 529 12 40 2 4 2 Echino Plutei 2 87 105 875 18 82 1 25 5 Larvacean 91 145 19 293 327 218 5 19 83 Snow Dark 23 75 42 433 114 324 18 14 157 Snow light 161 252 34 188 179 101 192 50 43 Protist 16 129 47 535 108 142 8 38 177 Trich 177 28 9 63 122 217 13 13 558 Totals 1302 1535 886 3188 1110 1170 295 199 1115 Chaetognath 64.50% 10.58% 0.83% 4.00% 9.58% 1.17% 2.00% 0.75% 6.58% Longarms 4.83% 44.92% 13.50% 18.67% 9.58% 2.67% 2.67% 2.25% 0.92% Oithona 0.00% 12.75% 38.17% 44.08% 1.00% 3.33% 0.17% 0.33% 0.17% Echino Plutei 0.17% 7.25% 8.75% 72.92% 1.50% 6.83% 0.08% 2.08% 0.42% Larvacean 7.58% 12.08% 1.58% 24.42% 27.25% 18.17% 0.42% 1.58% 6.92% Snow Dark 1.92% 6.25% 3.50% 36.08% 9.50% 27.00% 1.50% 1.17% 13.08% Snow light 13.42% 21.00% 2.83% 15.67% 14.92% 8.42% 16.00% 4.17% 3.58% Protist 1.33% 10.75% 3.92% 44.58% 9.00% 11.83% 0.67% 3.17% 14.75% Trich 14.75% 2.33% 0.75% 5.25% 10.17% 18.08% 1.08% 1.08% 46.50% Accuracy 37.82%

PAGE 55

43 Table 3-20 Edge Moment Features (8 15) Class Names Chaeto Cnidaria Smallbell Longarms Copepod Oithona Echino Plutei Larvacean Marine Snow Dark Marine Snow light Protist Trich Chaetognath 614 13 73 41 130 48 157 51 73 Longarms 32 402 122 438 72 33 85 1 15 Oithona 4 31 619 267 127 90 62 0 0 Echino Plutei 1 76 27 822 76 155 16 24 3 Larvacean 99 19 59 119 409 358 105 14 18 Snow Dark 45 20 44 154 259 557 26 57 38 Snow light 163 63 68 256 279 121 223 7 20 Protist 10 91 8 247 196 371 10 173 94 Trich 208 142 53 127 145 156 57 60 252 Totals 1176 857 1073 2471 1693 1889 741 387 513 Chaetognath 51.17% 1.08% 6.08% 3.42% 10.83% 4.00% 13.08% 4.25% 6.08% Longarms 2.67% 33.50% 10.17% 36.50% 6.00% 2.75% 7.08% 0.08% 1.25% Oithona 0.33% 2.58% 51.58% 22.25% 10.58% 7.50% 5.17% 0.00% 0.00% Echino Plutei 0.08% 6.33% 2.25% 68.50% 6.33% 12.92% 1.33% 2.00% 0.25% Larvacean 8.25% 1.58% 4.92% 9.92% 34.08% 29.83% 8.75% 1.17% 1.50% Snow Dark 3.75% 1.67% 3.67% 12.83% 21.58% 46.42% 2.17% 4.75% 3.17% Snow light 13.58% 5.25% 5.67% 21.33% 23.25% 10.08% 18.58% 0.58% 1.67% Protist 0.83% 7.58% 0.67% 20.58% 16.33% 30.92% 0.83% 14.42% 7.83% Trich 17.33% 11.83% 4.42% 10.58% 12.08% 13.00% 4.75% 5.00% 21.00% Accuracy 37.69%

PAGE 56

44 Table 3-21 Moments and Weighted Moments Combined (8–15, 31-38) Class Names Chaeto Cnidaria Smallbell Longarms Copepod Oithona Echino Plutei Larvacean Marine Snow Dark Marine Snow light Protist Trich Chaetognath 825 3 10 0 125 28 93 54 62 Longarms 19 744 212 113 9 20 46 21 16 Oithona 1 93 851 180 12 37 9 15 2 Echino Plutei 1 133 161 801 1 41 0 57 5 Larvacean 76 10 8 4 672 333 39 1 57 Snow Dark 26 34 65 23 232 610 52 28 130 Snow light 160 151 59 26 285 122 304 66 27 Protist 0 177 71 149 14 255 8 376 150 Trich 171 35 26 7 91 245 29 7 589 Totals 1279 1380 1463 1303 1441 1691 580 625 1038 Chaetognath 68.75% 0.25% 0.83% 0.00% 10.42% 2.33% 7.75% 4.50% 5.17% Longarms 1.58% 62.00% 17.67% 9.42% 0.75% 1.67% 3.83% 1.75% 1.33% Oithona 0.08% 7.75% 70.92% 15.00% 1.00% 3.08% 0.75% 1.25% 0.17% Echino Plutei 0.08% 11.08% 13.42% 66.75% 0.08% 3.42% 0.00% 4.75% 0.42% Larvacean 6.33% 0.83% 0.67% 0.33% 56.00% 27.75% 3.25% 0.08% 4.75% Snow Dark 2.17% 2.83% 5.42% 1.92% 19.33% 50.83% 4.33% 2.33% 10.83% Snow light 13.33% 12.58% 4.92% 2.17% 23.75% 10.17% 25.33% 5.50% 2.25% Protist 0.00% 14.75% 5.92% 12.42% 1.17% 21.25% 0.67% 31.33% 12.50% Trich 14.25% 2.92% 2.17% 0.58% 7.58% 20.42% 2.42% 0.58% 49.08% Accuracy 53.44%

PAGE 57

45 Table 3-22 Granulometric Features (18 24) Class Names Chaeto Cnidaria Smallbell Longarms Copepod Oithona Echino Plutei Larvacean Marine Snow Dark Marine Snow light Protist Trich Chaetognath 574 7 51 1 264 127 13 11 152 Longarms 14 609 318 100 6 21 109 15 8 Oithona 19 155 909 33 41 6 23 4 10 Echino Plutei 1 61 50 905 1 18 25 72 67 Larvacean 215 1 37 13 632 92 23 27 160 Snow Dark 144 23 33 43 175 336 81 67 298 Snow light 25 163 80 50 67 76 677 43 19 Protist 17 94 14 195 29 135 109 237 370 Trich 79 12 62 46 189 87 10 84 631 Totals 1088 1125 1554 1386 1404 898 1070 560 1715 Chaetognath 47.83% 0.58% 4.25% 0.08% 22.00% 10.58% 1.08% 0.92% 12.67% Longarms 1.17% 50.75% 26.50% 8.33% 0.50% 1.75% 9.08% 1.25% 0.67% Oithona 1.58% 12.92% 75.75% 2.75% 3.42% 0.50% 1.92% 0.33% 0.83% Echino Plutei 0.08% 5.08% 4.17% 75.42% 0.08% 1.50% 2.08% 6.00% 5.58% Larvacean 17.92% 0.08% 3.08% 1.08% 52.67% 7.67% 1.92% 2.25% 13.33% Snow Dark 12.00% 1.92% 2.75% 3.58% 14.58% 28.00% 6.75% 5.58% 24.83% Snow light 2.08% 13.58% 6.67% 4.17% 5.58% 6.33% 56.42% 3.58% 1.58% Protist 1.42% 7.83% 1.17% 16.25% 2.42% 11.25% 9.08% 19.75% 30.83% Trich 6.58% 1.00% 5.17% 3.83% 15.75% 7.25% 0.83% 7.00% 52.58% Accuracy 51.02

PAGE 58

46 Table 3-23 Fourier Contour Features Using 100 Sample Points, Normalized (73-172) Class Names Chaeto Cnidaria Smallbell Longarms Copepod Oithona Echino Plutei Larvacean Marine Snow Dark Marine Snow light Protist Trich Chaetognath 971 32 13 3 105 15 3 14 44 Longarms 19 934 98 54 8 35 5 40 7 Oithona 25 111 904 27 32 58 10 23 10 Echino Plutei 1 39 25 957 5 17 21 121 14 Larvacean 67 13 13 8 869 144 14 31 41 Snow Dark 35 54 60 37 171 549 38 140 116 Snow light 23 22 42 76 41 82 801 64 49 Protist 10 39 5 225 16 124 26 709 46 Trich 75 42 46 58 131 85 16 154 593 Totals 1226 1286 1206 1445 1378 1109 934 1296 920 Chaetognath 80.92% 2.67% 1.08% 0.25% 8.75% 1.25% 0.25% 1.17% 3.67% Longarms 1.58% 77.83% 8.17% 4.50% 0.67% 2.92% 0.42% 3.33% 0.58% Oithona 2.08% 9.25% 75.33% 2.25% 2.67% 4.83% 0.83% 1.92% 0.83% Echino Plutei 0.08% 3.25% 2.08% 79.75% 0.42% 1.42% 1.75% 10.08% 1.17% Larvacean 5.58% 1.08% 1.08% 0.67% 72.42% 12.00% 1.17% 2.58% 3.42% Snow Dark 2.92% 4.50% 5.00% 3.08% 14.25% 45.75% 3.17% 11.67% 9.67% Snow light 1.92% 1.83% 3.50% 6.33% 3.42% 6.83% 66.75% 5.33% 4.08% Protist 0.83% 3.25% 0.42% 18.75% 1.33% 10.33% 2.17% 59.08% 3.83% Trich 6.25% 3.50% 3.83% 4.83% 10.92% 7.08% 1.33% 12.83% 49.42% Accuracy 67.47% `

PAGE 59

47 Table 3-24 Contour Features, 100 Sample Points, Not Normalized (74-172) Class Names Chaeto Cnidaria Smallbell Longarms Copepod Oithona Echino Plutei Larvacean Marine Snow Dark Marine Snow light Protist Trich Chaetognath 984 21 6 2 70 17 3 19 78 Longarms 23 936 94 51 4 23 8 21 40 Oithona 25 166 806 43 32 67 34 11 16 Echino Plutei 4 49 32 908 5 20 20 118 44 Larvacean 98 11 24 6 799 117 20 13 112 Snow Dark 26 62 90 28 143 513 43 129 166 Snow light 20 19 53 52 33 61 810 48 104 Protist 12 41 4 166 16 86 40 714 121 Trich 66 35 33 41 111 70 23 122 699 Totals 1258 1340 1142 1297 1213 974 1001 1195 1380 Chaetognath 82.00% 1.75% 0.50% 0.17% 5.83% 1.42% 0.25% 1.58% 6.50% Longarms 1.92% 78.00% 7.83% 4.25% 0.33% 1.92% 0.67% 1.75% 3.33% Oithona 2.08% 13.83% 67.17% 3.58% 2.67% 5.58% 2.83% 0.92% 1.33% Echino Plutei 0.33% 4.08% 2.67% 75.67% 0.42% 1.67% 1.67% 9.83% 3.67% Larvacean 8.17% 0.92% 2.00% 0.50% 66.58% 9.75% 1.67% 1.08% 9.33% Snow Dark 2.17% 5.17% 7.50% 2.33% 11.92% 42.75% 3.58% 10.75% 13.83% Snow light 1.67% 1.58% 4.42% 4.33% 2.75% 5.08% 67.50% 4.00% 8.67% Protist 1.00% 3.42% 0.33% 13.83% 1.33% 7.17% 3.33% 59.50% 10.08% Trich 5.50% 2.92% 2.75% 3.42% 9.25% 5.83% 1.92% 10.17% 58.25% Accuracy 66.38%

PAGE 60

48 Table 3-25 Contour Utilizing 5 Fr equency Domain Features (44-48) Class Names Chaeto Cnidaria Smallbell Longarms Copepod Oithona Echino Plutei Larvacean Marine Snow Dark Marine Snow light Protist Trich Chaetognath 792 51 94 5 185 15 26 6 26 Longarms 53 651 255 87 23 41 38 9 43 Oithona 24 125 652 50 107 91 130 14 7 Echino Plutei 1 60 60 778 10 47 45 162 37 Larvacean 159 26 112 4 729 132 26 10 2 Snow Dark 13 23 165 84 161 463 107 153 31 Snow light 40 66 227 108 83 102 406 125 43 Protist 8 32 31 323 41 251 57 424 33 Trich 137 164 89 103 92 116 103 159 237 Totals 1227 1198 1685 1542 1431 1258 938 1062 459 Chaetognath 66.00% 4.25% 7.83% 0.42% 15.42% 1.25% 2.17% 0.50% 2.17% Longarms 4.42% 54.25% 21.25% 7.25% 1.92% 3.42% 3.17% 0.75% 3.58% Oithona 2.00% 10.42% 54.33% 4.17% 8.92% 7.58% 10.83% 1.17% 0.58% Echino Plutei 0.08% 5.00% 5.00% 64.83% 0.83% 3.92% 3.75% 13.50% 3.08% Larvacean 13.25% 2.17% 9.33% 0.33% 60.75% 11.00% 2.17% 0.83% 0.17% Snow Dark 1.08% 1.92% 13.75% 7.00% 13.42% 38.58% 8.92% 12.75% 2.58% Snow light 3.33% 5.50% 18.92% 9.00% 6.92% 8.50% 33.83% 10.42% 3.58% Protist 0.67% 2.67% 2.58% 26.92% 3.42% 20.92% 4.75% 35.33% 2.75% Trich 11.42% 13.67% 7.42% 8.58% 7.67% 9.67% 8.58% 13.25% 19.75% Accuracy 47.52%

PAGE 61

49 Table 3-26 Contour 16 Hybrid (57-72) Class Names Chaeto Cnidaria Smallbell Longarms Copepod Oithona Echino Plutei Larvacean Marine Snow Dark Marine Snow light Protist Trich Chaetognath 848 17 26 3 156 49 19 35 47 Longarms 31 777 153 71 18 65 44 24 17 Oithona 13 98 790 22 71 101 100 2 3 Echino Plutei 2 66 36 864 6 40 34 147 5 Larvacean 144 21 22 4 801 155 37 8 8 Snow Dark 23 35 60 35 185 603 138 85 36 Snow light 74 55 76 79 77 158 586 57 38 Protist 3 48 16 226 60 170 37 606 34 Trich 176 96 58 35 101 118 69 186 361 Totals 1314 1213 1237 1339 1475 1459 1064 1150 549 Chaetognath 70.67% 1.42% 2.17% 0.25% 13.00% 4.08% 1.58% 2.92% 3.92% Longarms 2.58% 64.75% 12.75% 5.92% 1.50% 5.42% 3.67% 2.00% 1.42% Oithona 1.08% 8.17% 65.83% 1.83% 5.92% 8.42% 8.33% 0.17% 0.25% Echino Plutei 0.17% 5.50% 3.00% 72.00% 0.50% 3.33% 2.83% 12.25% 0.42% Larvacean 12.00% 1.75% 1.83% 0.33% 66.75% 12.92% 3.08% 0.67% 0.67% Snow Dark 1.92% 2.92% 5.00% 2.92% 15.42% 50.25% 11.50% 7.08% 3.00% Snow light 6.17% 4.58% 6.33% 6.58% 6.42% 13.17% 48.83% 4.75% 3.17% Protist 0.25% 4.00% 1.33% 18.83% 5.00% 14.17% 3.08% 50.50% 2.83% Trich 14.67% 8.00% 4.83% 2.92% 8.42% 9.83% 5.75% 15.50% 30.08% Accuracy 57.74%

PAGE 62

50 Table 3-27 All Features Using 100 Sample Contour Points Normalized Class Names Chaeto Cnidaria Smallbell Longarms Copepod Oithona Echino Plutei Larvacean Marine Snow Dark Marine Snow light Protist Trich Chaetognath 1127 2 0 0 23 19 15 0 14 Longarms 6 1073 16 11 0 27 51 12 4 Oithona 1 12 1128 3 7 13 30 0 6 Echino Plutei 0 26 18 1053 1 4 34 62 2 Larvacean 13 1 9 0 1018 80 29 0 50 Snow Dark 23 32 11 9 49 880 107 27 62 Snow light 0 8 2 3 0 38 1132 17 0 Protist 1 23 0 124 0 31 50 964 7 Trich 20 26 17 2 58 37 31 6 1003 Totals 1191 1203 1201 1205 1156 1129 1479 1088 1148 Chaetognath 93.92% 0.17% 0.00% 0.00% 1.92% 1.58% 1.25% 0.00% 1.17% Longarms 0.50% 89.42% 1.33% 0.92% 0.00% 2.25% 4.25% 1.00% 0.33% Oithona 0.08% 1.00% 94.00% 0.25% 0.58% 1.08% 2.50% 0.00% 0.50% Echino Plutei 0.00% 2.17% 1.50% 87.75% 0.08% 0.33% 2.83% 5.17% 0.17% Larvacean 1.08% 0.08% 0.75% 0.00% 84.83% 6.67% 2.42% 0.00% 4.17% Snow Dark 1.92% 2.67% 0.92% 0.75% 4.08% 73.33% 8.92% 2.25% 5.17% Snow light 0.00% 0.67% 0.17% 0.25% 0.00% 3.17% 94.33% 1.42% 0.00% Protist 0.08% 1.92% 0.00% 10.33% 0.00% 2.58% 4.17% 80.33% 0.58% Trich 1.67% 2.17% 1.42% 0.17% 4.83% 3.08% 2.58% 0.50% 83.58% Accuracy 86.83%

PAGE 63

51 Table 3-28 All Features Using 5 Average Frequency Domain Features (0-56) Class Names Chaeto Cnidaria Smallbell Longarms Copepod Oithona Echino Plutei Larvacean Marine Snow Dark Marine Snow light Protist Trich Chaetognath 1146 1 0 0 20 14 8 0 11 Longarms 6 1125 11 11 0 16 17 9 5 Oithona 0 12 1165 6 4 5 0 0 8 Echino Plutei 0 30 14 1086 1 3 5 57 4 Larvacean 25 2 3 0 1090 43 2 0 35 Snow Dark 32 31 12 9 65 885 78 23 65 Snow light 0 4 0 1 0 29 1154 12 0 Protist 1 19 2 102 0 8 14 1050 4 Trich 27 24 11 1 53 22 2 1 1059 Totals 1237 1248 1218 1216 1233 1025 1280 1152 1191 Chaetognath 95.50% 0.08% 0.00% 0.00% 1.67% 1.17% 0.67% 0.00% 0.92% Longarms 0.50% 93.75% 0.92% 0.92% 0.00% 1.33% 1.42% 0.75% 0.42% Oithona 0.00% 1.00% 97.08% 0.50% 0.33% 0.42% 0.00% 0.00% 0.67% Echino Plutei 0.00% 2.50% 1.17% 90.50% 0.08% 0.25% 0.42% 4.75% 0.33% Larvacean 2.08% 0.17% 0.25% 0.00% 90.83% 3.58% 0.17% 0.00% 2.92% Snow Dark 2.67% 2.58% 1.00% 0.75% 5.42% 73.75% 6.50% 1.92% 5.42% Snow light 0.00% 0.33% 0.00% 0.08% 0.00% 2.42% 96.17% 1.00% 0.00% Protist 0.08% 1.58% 0.17% 8.50% 0.00% 0.67% 1.17% 87.50% 0.33% Trich 2.25% 2.00% 0.92% 0.08% 4.42% 1.83% 0.17% 0.08% 88.25% Accuracy 90.37%

PAGE 64

52 Table 3-29 All Features Using Hybrid, Mixed (0-43, 49-72) Class Names Chaeto Cnidaria Smallbell Longarms Copepod Oithona Echino Plutei Larvacean Marine Snow Dark Marine Snow light Protist Trich Chaetognath 1149 2 0 0 17 17 2 0 13 Longarms 4 1132 9 9 0 18 17 9 2 Oithona 0 9 1170 5 5 5 0 0 6 Echino Plutei 1 28 14 1084 1 2 6 60 4 Larvacean 30 2 4 0 1073 48 2 0 41 Snow Dark 24 30 14 8 69 887 78 26 64 Snow light 12 2 1 1 0 30 1145 9 0 Protist 11 21 1 97 0 6 16 1044 4 Trich 36 24 9 2 56 22 1 2 1048 Totals 1267 1250 1222 1206 1221 1035 1267 1150 1182 Chaetognath 95.75% 0.17% 0.00% 0.00% 1.42% 1.42% 0.17% 0.00% 1.08% Longarms 0.33% 94.33% 0.75% 0.75% 0.00% 1.50% 1.42% 0.75% 0.17% Oithona 0.00% 0.75% 97.50% 0.42% 0.42% 0.42% 0.00% 0.00% 0.50% Echino Plutei 0.08% 2.33% 1.17% 90.33% 0.08% 0.17% 0.50% 5.00% 0.33% Larvacean 2.50% 0.17% 0.33% 0.00% 89.42% 4.00% 0.17% 0.00% 3.42% Snow Dark 2.00% 2.50% 1.17% 0.67% 5.75% 73.92% 6.50% 2.17% 5.33% Snow light 1.00% 0.17% 0.08% 0.08% 0.00% 2.50% 95.42% 0.75% 0.00% Protist 0.92% 1.75% 0.08% 8.08% 0.00% 0.50% 1.33% 87.00% 0.33% Trich 3.00% 2.00% 0.75% 0.17% 4.67% 1.83% 0.08% 0.17% 87.33% Accuracy 90.11%

PAGE 65

53 Table 3-30 5 Fourier Texture Features (39 – 43) Class Names Chaeto Cnidaria Smallbell Longarms Copepod Oithona Echino Plutei Larvacean Marine Snow Dark Marine Snow light Protist Trich Chaetognath 639 132 181 40 50 9 15 45 89 Longarms 99 648 180 52 32 15 87 66 21 Oithona 16 188 699 67 141 28 45 14 2 Echino Plutei 3 29 53 693 57 5 109 222 29 Larvacean 37 51 300 82 523 44 27 16 120 Snow Dark 92 152 291 111 248 73 77 42 114 Snow light 16 27 38 89 3 1 930 94 2 Protist 25 107 30 346 18 1 243 368 62 Trich 167 75 77 148 129 14 14 73 503 Totals 1094 1409 1849 1628 1201 190 1547 940 942 Chaetognath 53.25% 11.00% 15.08% 3.33% 4.17% 0.75% 1.25% 3.75% 7.42% Longarms 8.25% 54.00% 15.00% 4.33% 2.67% 1.25% 7.25% 5.50% 1.75% Oithona 1.33% 15.67% 58.25% 5.58% 11.75% 2.33% 3.75% 1.17% 0.17% Echino Plutei 0.25% 2.42% 4.42% 57.75% 4.75% 0.42% 9.08% 18.50% 2.42% Larvacean 3.08% 4.25% 25.00% 6.83% 43.58% 3.67% 2.25% 1.33% 10.00% Snow Dark 7.67% 12.67% 24.25% 9.25% 20.67% 6.08% 6.42% 3.50% 9.50% Snow light 1.33% 2.25% 3.17% 7.42% 0.25% 0.08% 77.50% 7.83% 0.17% Protist 2.08% 8.92% 2.50% 28.83% 1.50% 0.08% 20.25% 30.67% 5.17% Trich 13.92% 6.25% 6.42% 12.33% 10.75% 1.17% 1.17% 6.08% 41.92% Accuracy 47.00%

PAGE 66

54 Table 3-31 Intensity Histogram (49 –55) Class Names Chaeto Cnidaria Smallbell Longarms Copepod Oithona Echino Plutei Larvacean Marine Snow Dark Marine Snow light Protist Trich Chaetognath 725 86 6 0 249 42 0 3 89 Longarms 39 923 57 45 7 22 45 62 0 Oithona 0 71 1086 12 8 1 0 8 14 Echino Plutei 0 27 42 1090 2 2 6 28 3 Larvacean 111 25 16 0 786 30 0 12 220 Snow Dark 243 256 91 24 203 129 67 9 178 Snow light 1 35 0 4 0 0 1147 13 0 Protist 61 255 45 348 15 9 48 394 25 Trich 22 21 38 0 60 23 0 6 1030 Totals 1094 1409 1849 1628 1201 190 1547 940 942 Chaetognath 53.25% 11.00% 15.08% 3.33% 4.17% 0.75% 1.25% 3.75% 7.42% Longarms 8.25% 54.00% 15.00% 4.33% 2.67% 1.25% 7.25% 5.50% 1.75% Oithona 1.33% 15.67% 58.25% 5.58% 11.75% 2.33% 3.75% 1.17% 0.17% Echino Plutei 0.25% 2.42% 4.42% 57.75% 4.75% 0.42% 9.08% 18.50% 2.42% Larvacean 3.08% 4.25% 25.00% 6.83% 43.58% 3.67% 2.25% 1.33% 10.00% Snow Dark 7.67% 12.67% 24.25% 9.25% 20.67% 6.08% 6.42% 3.50% 9.50% Snow light 1.33% 2.25% 3.17% 7.42% 0.25% 0.08% 77.50% 7.83% 0.17% Protist 2.08% 8.92% 2.50% 28.83% 1.50% 0.08% 20.25% 30.67% 5.17% Trich 13.92% 6.25% 6.42% 12.33% 10.75% 1.17% 1.17% 6.08% 41.92% Accuracy 67.69%

PAGE 67

55 Table 3-32 Intensity Histogram (49-50, 52-55) Less 3rd One Class Names Chaeto Cnidaria Smallbell Longarms Copepod Oithona Echino Plutei Larvacean Marine Snow Dark Marine Snow light Protist Trich Chaetognath 695 106 8 0 250 42 0 3 96 Longarms 31 962 56 63 3 17 55 13 0 Oithona 0 83 1078 12 2 3 0 8 14 Echino Plutei 0 30 44 1099 1 2 8 13 3 Larvacean 117 33 36 1 715 34 1 14 249 Snow Dark 226 265 95 27 198 134 72 5 178 Snow light 1 36 0 6 0 0 1154 3 0 Protist 65 315 46 428 14 6 79 227 20 Trich 22 22 40 0 49 27 0 5 1035 Totals 1157 1852 1403 1636 1232 265 1369 291 1595 Chaetognath 57.92% 8.83% 0.67% 0.00% 20.83% 3.50% 0.00% 0.25% 8.00% Longarms 2.58% 80.17% 4.67% 5.25% 0.25% 1.42% 4.58% 1.08% 0.00% Oithona 0.00% 6.92% 89.83% 1.00% 0.17% 0.25% 0.00% 0.67% 1.17% Echino Plutei 0.00% 2.50% 3.67% 91.58% 0.08% 0.17% 0.67% 1.08% 0.25% Larvacean 9.75% 2.75% 3.00% 0.08% 59.58% 2.83% 0.08% 1.17% 20.75% Snow Dark 18.83% 22.08% 7.92% 2.25% 16.50% 11.17% 6.00% 0.42% 14.83% Snow light 0.08% 3.00% 0.00% 0.50% 0.00% 0.00% 96.17% 0.25% 0.00% Protist 5.42% 26.25% 3.83% 35.67% 1.17% 0.50% 6.58% 18.92% 1.67% Trich 1.83% 1.83% 3.33% 0.00% 4.08% 2.25% 0.00% 0.42% 86.25% Accuracy 65.73%

PAGE 68

56 CHAPTER 4 FEATURE SELECTION 4.1 Feature Selection Background There are a total of 57 features extracted from Plankton mages, but the use of all these features does not necessa rily produce optimal classification accuracy. A subset of these features can possibly produce better classification accuracy. There are other advantages of using fewer features; 1 computation time is reduced for both training and cl assification (Figure 4-6, page 68), 2 features that are determi ned to never be of use can be completely eliminated reducing the computation necessa ry for feature extraction. The user is working with several di fferent training models consisting of different combinations of classes. For each one of these models there may be a different combination of f eatures that will produc e best classification accuracy. For this reason the user will need the abi lity to run feature selection on own adding another constraint that will need to be addressed; that is the time it takes to perform the featur e selection process. The test sets used in this section c onsist of nine classes with 300 images per class. There were two feature selectio n experiments performed in this section, exploring 7253 and 10035 featur e combinations. The time required for this took a little over 2 days each on a single 2. 8ghz PC. There ar e several methods explored for determining an appr opriate feature set from t he results of the feature selection; beam search, next best case in, and next best case out.

PAGE 69

57 The feature search algor ithm used was modeled off the wrapper approach, by Ron Kohavi [12]. Wher e other feature selection techniques are unaware of the bias of the learning algorithm; WR APPER is meant to treat the learning algorithm as a black box. The idea is t hat you wrap the learni ng algorithm up into a box with inputs and out puts. The inputs are a giv en set of features and training data; and the output is the classification ac curacy that results from a 5 fold cross validation. WRAPPER will then act as a searching algori thm, treating the different possible combinations of features as nodes in a tree. The accuracy being returned from the learning algorithm will drive the direction of the search. The feature selection space is a tree structure wher e a single node represents a set of features. The childr en of any given node ar e all feature sets that differ by one feature. Depending on whether you are talking about Best First Forward (BFF) or Best First Back (BFB); in the case of BFF a child node will have all the same features less one feature. With BFB a child node will have all the features of t he parent plus one additional feature. 4.2 Detailed Description of the Algorithm The feature selection space is like a tree with parents and children. A single node represents a specific set of select ed features. The children of any given node are derived by adding or subtracting one or more features. There are two directions in which the search can go ‘best first back’ (B FB) or ‘best first forward’ (BFF). With BFB the search goes from a ll features selected to less features selected, see Figure 4-1; while BFF goes from all pair of feature combinations towards all features selected.

PAGE 70

58 In practice neither BFB or BFF go from all feature to 1 feature or one feature to all features; that is BFB never gets to the point where the ne xt best case to try has only one feature select ed. If enough time was given, something less than infinity but longer than the average graduate program te rm, it could be possible but not practical. To deal with this si tuation another termi nation condition is specified. In these experim ents if the number of expans ions that are processed without finding a new higher accuracy ex ceeds a given threshold then the next best case part of the search is terminated. To finish the search off, that is drive the number of selected featur es down to 1 (in the case of BFB), a 5 wide Beam search [ 17 ] is started using the 5 best feature combinations found so far as the starting point. Figure 4-1 and Table 4-1 illustrate a si mple example of a BFB (Best First Back) search. There are three expans ions and 7 nodes evaluated. Table 4-1 provides a narrative of each action that is performed during the search. As each node is evaluated results are stored, five fo ld accuracy, processing time, selected features, feature number that was removed, and change in accuracy from the parent node. After the search terminates th e results are used to select features. Figure 4-1 Example of Best First Next in (BFB)

PAGE 71

59 Table 4-1 Listing of Expansions that Occur in Figure 4-1 with Explanations Order Selected Features Accuracy Comments 0 1,2,3 82.5% First node by def ault would be all features selected. 1 2,3 84.1% Of the sub-nodes of ‘1,2,3’, ‘2,3’ has the best accuracy. 2 1,2 83.9% Since ‘2,3’ did not produce any children with higher accuracy then ‘1,2’ becomes the next best node to expand. Note that the expansion only produces one child since the feature set ‘2’ has already been evaluated. 3 3 83.5% This would be the node with the highest accuracy, but since it only consist of one child an expansion of this node would produce an empty set which woul d then terminate the search. Note that in practice this situation has never actually occurred, usually the number of expansions exceed a threshold and the search turns into a Beam search. Table 4-2 gives a list of variables and fu nctions that are used in algorithm described in Figure 4-3. N Expand in Table 4-2 is actua lly a operation that is performed on a single node. It is perfo rmed on the next best node that has not been expanded yet. Figure 4-3 gives a detailed description on what N Expand will do. Figure 4-2 gives a detailed descr iption of the featur e search algorithm implemented.

PAGE 72

60 Table 4-2 Feature Search Variables F Will represent an individual node t hat consists of a set of features. xF Specifies feature x ex: 46 15 4 0, , F F F F= set of features with members feature 0, 4, 15, and 46. allF A single node that includes all features ,...... ,56 2 1 0F F F F O Set of all open nodes, that is nodes that have not been evaluated yet. C Set of all completed jobs that have not been expanded. C High Returns the node with the highest accuracy. N Expand Expands node N using appropriate algorithm Best First Out or Best First In returning the set of children nodes. C Random Will select one node at random from set of nodes in C.

PAGE 73

61 Figure 4-2 Feature Search Algorithm 0 uracy HighestAcc 0 pansions NumberOfEx 0 gh SinceNewHi Expansions C // Completed jobs not yet expanded. if (Best First In) then allF O else 56 2 1 0,... , F F F F O while (150 gh SinceNewHi Expansions) for (F in O) // Process each node in Open List set. Evaluate F // Perform 5 fold cross validation F C C F O O if ( uracy HighestAcc F Accuracy ) F Accuracy uracy HighestAcc 0 gh SinceNewHi Expansions C High E // Select next node for expansion E Expand O OU E C C if ((pansions NumberOfEx mod 5) = 0) C Random E E Expand O OU E C C 1 pansions NumberOfEx pansions NumberOfEx 1 gh SinceNewHi Expansions gh SinceNewHi Expansions BeamSearch (Starting with 5 nodes with highest accuracy)

PAGE 74

62 Expand Best First In ( N ) E for each iF in N iF N E E return E Expand Best First Out ( N ) E for each iF in N Fall iF N E E return E Figure 4-3 Expansion Algorithms for Best First Out/In 4.3 Experiments There were two different feature sear ches performed. Table 4-3 (page 66) and Table 4-6 (page 71) in section 4.5 used the averaged 5 contour features (section 3.5.2 on page 29) while Figur e 4-10 (page 76) and Figure 4-11 (page 77) in section 4.6 used the 16 hybrid contour features (section 3.5.3 on page 31). Both feature searches used data from the same set of images. The two feature selections were performed to help determi ne if the 16 hybrid contour features, which perform far superior to the 5 average contour features when run standalone, would also help produce a supe rior overall accuracy when using all the features. The result s ended up showing that the 5 average contour features would produce a feature set that was as good as the 16 hybrid with fewer features required in the final feature set. The data used in these two feature selectio ns is the same as that used in the feature calculation results in Chapter 3 (see Appendix B) The data is split into two files, a training set and a test set. T he training set consists of 300 examples per class for a total of 2,700 images and t he test set consists of 1200 examples per class for a total of 10,800 examples. The feature search is driven by the training set, while the test set is used to verify specific set of features.

PAGE 75

63 4.4 Analyzing the Results There are several criteria that shoul d be considered to determine the optimal set of features, classification accuracy, f eature extraction time, training time, and classification time. Of these criteria accuracy is the most important, but given two different feature sets t hat result in similar accu racy the other factors should be considered; less processing time is al ways better as long as classification accuracy is not sacrificed. As a rule less features means less processing time for both training and classification. Upon comp letion of the feature selection process there is a pool of a ll feature combinations evaluated. From this pool of results an appropriate set of features can to be selected. 4.4.1 Selection by Highest Accuracy Figure 4-4 Determine Optimal Feature Set 4.4.2 Selection by Feat ure Usage in Best 200 There were two other means for dete rmining optimal feat ures from the feature selection results. One involv es determining the frequency of each features’ usage in the top 200 results, see Figure 4-8. The other method looks at the impact on accuracy every time a feat ure is added to or removed from a feature set. See Table 4-3 and Figure 4-6 1) Sort all results into groups by feature count, where feature count means the number of featur es used in the node. 2) For each group select the result that has the highest accuracy. (See Table 4-3, page 83) 3) For each group determine test accuracy by using a separate test set. The data used for feature selection wi ll be used as the training data. 4) Select the group that has the highest accuracy. In Table 4-3 this would be the row with feature count = 35. 5) Knowing the highest accuracy, sele ct the group with the smallest feature count that has statistically the same accuracy against the test set. The McNnemer’s[ 15 ] test is used for this purpose. This is the row with feature count = 26 in Table 4-3.

PAGE 76

64 Figure 4-5 Select Feature Set by Frequency in Top 200 Figure 4-8 (page 72) and Figur e 4-10 (page 76) show the usage of features in the 200 combinations wit h the highest accuracies. Example in Figure 4-8 (page 72) feature num ber 3 was used in 160 of th e best 200 while feature number 29 was used in all 200. The idea is to get a feel for how useful a given feature is. Table 4-6 shows the resu lts achieved by using the algorithm described in Figure 4-5. The best accura cy is found with a threshold of 90 given a validation accuracy of 88.12% with 33 fe atures. This compares favorably with the best results found in Table 4-3 which found its bets accura cy of 88.07 with 35 features. When comparing the entry with 26 features in Table 4-3 which was statistically the same as that with 35 features with 87.89% accuracy. Table 4-6 has an entry using a threshold of 180 t hat used 24 features with a validation accuracy of 87.85%. 1) Sort all the results into accuracy order. 2) Select top 200 results by accuracy. 3) Using these 200 results build a histogram by fr equency of feature occurrence. 4) Select the features with a minimum of occurrences (see Table 4-4, page 88).

PAGE 77

65 4.4.3 Selection by Accuracy Impact No matter which algorithm is used wh en nodes are expa nded which means creating new feature subsets by the ad dition or subtraction of a feature depending on whether using BFF or BFB, there is a cha nge in accuracy that can be recorded that is attributed to a single feature. In an attempt to determine the relevance of each feature similar to t he Relief algorithm described in Kohavi's paper [12]. The relevance of a given f eature was determined by calculating the difference in five fold cross validation accuracies between subsets that vary by that given feature. As va rious feature subsets are tr ied the change in accuracy with respect to the parent subset is re corded. Using the accumulated deltas the mean average and variance are calculated for each feature. There are two criteria used here to sele ct features, those that produced the greatest gain in accuracy and 2) those that had the smallest variance. 4.5 Feature Selection Ut ilizing 5 Frequency Domain Contour Features This section shows the results from performing feature search using the 5 averaging contour features. These include Table 4-3 through Table 4-6. The following section will show the same resu lts when using the 16 hybrid contour features. This particular feature s earch explored 7,253 different feature combinations. Table 4-6 on page 73 is a summary of the best accuracies found by feature count. The combination that had the best 5 fold accuracy consisted of 35 features The set with the least number of features th at was not statistically significantly different using a McNe mar’s test [15] had 26 features. The McNemar’s column in Table 4-6 in dicates the results of a McNemar’s test between a feature set for a given ro w (feature count) and th e feature set with the best accuracy (feature count = 35 in Table 4-3)

PAGE 78

66 Table 4-3 Best Accuracies Found by Feature Count, 5 Freq. Domain Contour Features Feature Count Five Fold Accuracy Test Accuracy McNemar’s Training Time Classification Time Feature Count Five Fold Accuracy Test Accuracy McNemar’s Training Time Classification Time 1 35.26% 36.15% 5235 2.41 5.03 30 88.15% 88.21% 0.03 0.92 7.77 2 41.41% 41.60% 4610 1.95 4.81 31 88.15% 88.13% 0.82 0.92 8.09 3 55.30% 55.16% 3147 1.53 4.61 32 88.11% 88.22% 0.01 0.94 7.97 4 63.22% 64.07% 2221 1.36 4.27 33 88.15% 88.19% 0.54 0.98 8.19 5 71.96% 72.82% 1262 0.97 3.69 34 88.11% 88.17% 1.07 0.97 8.19 6 75.74% 76.90% 840.3 1.05 3.61 35 88.07% 88.24% 0.00 0.97 8.33 7 77.70% 78.98% 639.7 0.94 3.50 36 88.04% 88.22% 0.13 1.00 8.47 8 79.04% 78.92% 632.1 0.91 3.83 37 88.04% 88.19% 0.08 1.08 8.62 9 81.00% 81.70% 385.3 0.69 3.74 38 88.04% 88.00% 2.87 1.06 8.69 10 82.04% 82.17% 353.4 0.72 3.98 39 88.00% 88.17% 0.23 1.16 9.08 11 82.44% 83.02% 292.9 0.70 3.99 40 88.04% 88.13% 0.55 1.09 8.92 12 83.78% 84.04% 212.9 0.67 4.14 41 87.96% 87.72% 11.12 1.06 9.08 13 84.48% 84.98% 144.3 0.92 4.33 42 88.00% 87.73% 10.92 1.13 9.17 14 85.22% 85.04% 153.4 0.83 4.78 43 87.96% 87.81% 8.30 1.09 9.28 15 85.52% 85.64% 107.3 0.72 5.05 44 87.96% 87.69% 12.32 1.23 9.41 16 85.70% 86.17% 75.58 0.67 5.06 45 87.93% 87.74% 10.33 1.06 9.55 17 86.26% 86.42% 63.08 0.69 5.27 46 87.85% 87.77% 9.16 1.33 9.72 18 86.67% 86.80% 46.03 0.64 5.41 47 87.82% 87.74% 10.18 1.27 10.33 19 87.15% 87.08% 32.37 0.95 5.78 48 87.78% 87.72% 10.80 1.36 10.03 20 87.26% 87.29% 24.14 0.70 6.03 49 87.78% 87.75% 9.69 1.41 10.14 21 87.26% 87.32% 23.56 0.70 6.44 50 87.78% 87.71% 11.16 1.56 10.38 22 87.22% 87.21% 26.83 0.70 6.67 51 87.67% 87.83% 5.16 1.48 10.36 23 87.48% 87.54% 15.71 0.80 7.33 52 87.56% 87.83% 5.11 1.80 10.44 24 87.59% 87.84% 5.53 0.78 6.95 53 87.33% 88.00% 1.60 1.78 10.67 25 87.52% 87.66% 13.98 0.80 7.03 54 87.26% 87.93% 2.62 1.58 10.64 26 87.89% 88.01% 2.33 0.75 7.25 55 87.11% 88.11% 0.45 1.66 10.80 27 87.93% 88.02% 2.52 0.84 7.36 56 86.85% 87.71% 7.67 1.77 10.84 28 88.00% 88.12% 0.70 0.86 7.59 57 86.59% 88.04% 0.98 1.66 11.05 29 88.15% 88.13% 0.61 0.99 7.70 30 88.15% 88.21% 0.03 0.92 7.77

PAGE 79

67 Figure 4-6, page 68 is a gr aph that shows test accu racy, training time, and classification time with respect to feature count. Feat ure counts 35 and 26 (highest test accuracy and not significantly different accuracies) are specifically labeled. Figure 4-7 (page 69) is a graph that shows test accuracy and number of support vectors with respect to feature count. Classification time reduces as the f eature count reduces until feature count equals 7 at which point classification time climbs significantly. In Figure 4-7 the number of support vectors r educes as feature count re duces until feature count equals 25 at which point it starts to clim b, the rate at which it climbs grows significantly when feature count is less than 7. Classification time is a function of the product of number of features and numbe r of support vectors. As the number of features reduces classification ti me reduced, but when the number of support vectors grew classifica tion time also grew. In Figure 4-7 accuracy and support vect ors appear to be the reciprocal of each other. As features are eliminated du ring the feature selection process the support vector machine is fi nding it easier to derive th e decision boundary, this is reflected in Figure 4-6 where traini ng time reduces slowly and accuracy increases slightly. After feature count reduces below 26, the number of support vectors increases. At this point the s upport vector machine is now having a harder time deriving the decision boundary. Features that were helping have been eliminated and as a result more s upport vectors are needed to define the decision boundary.

PAGE 80

68 Figure 4-6 Feature Search, Accuracy, Timings by Featur es Count, Using 5 avg. Freq. Domain Contour Features

PAGE 81

69 Figure 4-7 Feature Selection, Accuracy, Support Vectors vs. F eature Count, Using 5 avg. Freq. Domain Contour Features

PAGE 82

70 Table 4-4 shows how each individual feature preformed in the feature search. The 200 feature combinations ( nodes) with the highest training accuracy are used to determine useful features. The column labeled “Usage in Top 200” indicates how many times a given featur e was used in the 200 best feature sets. The features that are used in all 200 are ve ry useful features while the ones that don’t occur in any are not us eful at all. Table 4-6 shows the test accuracy for a set of features that achiev ed a given frequency of occurr ence in top 200. Figure 4-8 (page 72) is a bar chart visualiz ing the occurrence in the top 200. The column labeled “Impact on Accura cy” reflects the average change in accuracy that occurred when the give n feature is added to an existing combination. This was tracked while pe rforming the feature selection search. As features were removed from a given feature combination the change in the training accuracy was recorded for the give n feature. A posit ive value indicates that when that feature was removed the accuracy on average was reduced by that amount. The following column show s the standard deviation of the “Impact on Accuracy Column”.

PAGE 83

71 Table 4-4 Feature Usage in Best 200, Using 5 Averaging Contour Features Feature Num Feature Description Usage in Top 200 Impact on Accuracy Std Dev of Impact Feature Num Feature Description Usage in Top 200 Impact on Accuracy Std Dev of Impact 0 Size 10 0.07%0.09% 29 Tr ans Wtd. 200 0.94% 1.22% 1 Moment1 0 -0.06%0.06% 30 Wtd. Size 88 0.06% 0.09% 2 Moment2 0 -0.04%0.04% 31 WtdMoment0 96 0.05% 0.07% 3 Moment3 158 0.15%0.09% 32 WtdMoment1 198 0.69% 1.41% 4 Moment4 3 0.06%0.06% 33 WtdMoment2 141 0.08% 0.08% 5 Moment5 0 0.00%0.03% 34 WtdMoment3 126 0.12% 0.10% 6 Moment6 0 0.00%0.04% 35 WtdMoment4 161 0.13% 0.13% 7 Moment7 0 -0.01%0.02% 36 WtdMoment5 5 0.00% 0.05% 8 EdgeSize 184 0.19%0.09% 37 WtdMoment6 95 0.02% 0.06% 9 EdgeMoment1 200 0.29%0.11% 38 WtdMoment7 0 -0.08% 0.04% 10 EdgeMoment2 0 -0.01%0.02% 39 Fourier0 5 0.02% 0.08% 11 EdgeMoment3 2 -0.03%0.03% 40 Fourier1 140 0.12% 0.12% 12 EdgeMoment4 2 -0.02%0.03% 41 Fourier2 200 1.18% 0.18% 13 EdgeMoment5 0 -0.01%0.01% 42 Fourier3 200 1.26% 0.76% 14 EdgeMoment6 0 -0.01%0.02% 43 Fourier4 200 0.44% 0.15% 15 EdgeMoment7 87 0.00%0.02% 44 Cont Fourier0 200 0.41% 0.11% 16 Trans ConvexHull 118 0.37%0.25% 45 Cont Fourier1 200 0.51% 0.15% 17 Trans PixelCount 200 0.28%0.13% 46 Cont Fourier2 200 0.73% 0.46% 18 TransOpen3 200 0.48%0.14% 47 Cont Fourier3 0 -0.22% 0.04% 19 TransOpen5 200 0.70%0.22% 48 Cont Fourier4 200 0.41% 0.15% 20 TransOpen7 0 -0.06%0.12% 49 IntensityHist1 200 0.43% 0.16% 21 TransOpen9 200 0.71%0.22% 50 IntensityHist2 200 0.47% 0.20% 22 TransClose3 198 0.26%0.12% 51 IntensityHist3 0 -0.06% 0.10% 23 TransClose5 200 0.65%0.24% 52 IntensityHist4 200 0.59% 0.44% 24 TransClose7 27 0.11%0.11% 53 IntensityHist5 200 1.03% 0.22% 25 EigenRatio 0 0.00%0.00% 54 In tensityHist6 200 1.77% 1.72% 26 EigenHead 9 0.12%0.20% 55 Int ensityHist7 200 0.92% 0.24% 27 ConvexArea 0 0.01%0.06% 56 Height vs. Width 200 0.71% 0.23% 28 Trans Size 119 0.38%0.25% 29 Trans Wtd. 200 0.94% 1.22%

PAGE 84

72 Weighted Transparency Weighted Size Moment’s Weighted by Intensity Texture, Fourier Contour, Fourier Averaging Intensity Histogram Height/Width 29 30 31-38 39-43 44-48 49-55 56 Moments Black/White (Binary) Moments Edge Convex Hull Ratio Granulometric Eigen Head/ Ratio Convex Area Convex Area Ratio 0-7 8-15 16 17-24 25,26 27 28 Figure 4-8 Search, Best 200, Using 5 Freq. Domain Table 4-5 Feature Descriptions

PAGE 85

73 Table 4-6 shows test accu racy for different levels of occurrence in the top 200. Each row reflects an increasing thre shold of frequency of occurrence in the best 200. The first row wi th (“Min 200” = 80) includes features that have a “Usage in Top 200” greater or equal to 80. Compared to using “Highest Accuracy Found” “Usage in Top 200” does not find a significantly better combination of features than “Highest Accuracy Found” Table 4-6 Performance by Occurrences in Top 200 Min 200 Test Accuracy Training Time Classificatio n Time Count Feature Num’s 80 88.07% 0.88 8.28 35 3,8,9,15,16,17,18,19,21,22, 23,28,29,30,31,32,33,34,35,37, 40,41,42,43, 44,45,46,48 ,49,50,52,53,54,55,56 90 88.12% 1.00 8.16 33 3,8,9,16,17,18,19,21,22,23, 28,29,31,32,33,34,35,37,40,41, 42,43,44,45, 46,48,49,50,52,53,54,55,56 100 88.09% 1.00 8.13 31 3,8,9,16,17,18,19,21,22,23, 28,29,32,33,34,35,40,41,42,43, 44,45,46,48, 49,50,52,53,54,55,56 110 88.09% 0.74 8.17 31 3,8,9,16,17,18,19,21,22,23, 28,29,32,33,34,35,40,41,42,43, 44,45,46,48, 49,50,52,53,54,55,56 120 87.94% 0.97 7.80 29 3,8,9,17,18,19,21,22,23,29, 32,33,34,35,40,41,42,43,44,45, 46,48,49,50, 52,53,54,55,56 130 87.84% 0.91 7.74 28 3,8,9,17,18,19,21,22,23,29, 32,33,35,40,41,42,43,44,45,46, 48,49,50,52, 53,54,55,56 140 87.84% 0.97 7.72 28 3,8,9,17,18,19,21,22,23,29, 32,33,35,40,41,42,43,44,45,46, 48,49,50,52, 53,54,55,56 150 87.78% 0.84 7.34 26 3,8,9,17,18,19,21,22,23,29, 32,35,41,42,43,44,45,46,48,49, 50,52,53,54, 55,56 160 87.80% 0.94 7.17 25 8,9,17,18,19,21,22,23,29,32, 35,41,42,43,44,45,46,48,49, 50,52,53,54,55, 56 170 87.85% 0.81 6.92 24 8,9,17,18,19,21,22,23,29,32, 41,42,43,44,45,46,48,49,50, 52,53,54,55,56 180 87.85% 0.84 6.92 24 8,9,17,18,19,21,22,23,29,32, 41,42,43,44,45,46,48,49,50, 52,53,54,55,56 190 87.71% 0.80 7.23 23 9,17,18,19,21,22,23,29,32,41, 42,43,44,45,46,48,49,50, 52,53,54,55,56 200 87.19% 0.84 6.20 21 9,17,18,19,21,23,29,41,42,43, 44,45,46,48,49,50,52,53, 54,55,56

PAGE 86

74 Figure 4-9 show how accuracy impac t and usage in top 200 are highly related. Each column reflects a specif ic feature. The order of the columns is based on its average impact on 5 fold cl assification accuracy during feature selection. For example the first column which is labeled “-0.22%” represents feature number 47, see Tabl e 4-4, and it was not used in any of the best 200 feature combinations searched. The co lumn labeled “0.38%” represents feature number 28 and it is used in 119 of the bes t 200 feature combi nations searched. Usage in top 200 vs Accuracy Impact 0 20 40 60 80 100 120 140 160 180 200-0.22% -0.06% -0.01% 0.00% 0.00% 0.05% 0.08% 0.12% 0.26% 0.38% 0.44% 0.59% 0.71% 0.94% 1.77%Accuracy ImpactUsage top 200 Figure 4-9 Accuracy Impact vs. Usage Top 200

PAGE 87

75 4.6 Feature Selecti on Utilizing 16 Hybrid Contour Features This feature selection experiment utiliz es the 16 hybrid contour features rather that the 5 averaging contour features used in se ction 4.5. The results of this feature search showed that when the hybrid features are used in conjunction with all the other features t hat they did not result in better performance than the 5 averaging contour features even though when run standalone they perform far superior to them. The 16 hybrid features are 57 through 72. For the most part they perform well; but what is interesti ng is how the higher frequency buckets, that is the ones in the mi ddle of the range, were in all 200 best groups, see Figure 4-10, while the lower frequency bu ckets were in the less than 120 group. The higher frequency buckets are an average of the magnitude across a frequency range while the lower frequency bucke ts represent specific frequency buckets (see Table 3-10). The higher frequency buckets are probably generalizing better than t he lower frequency buckets.

PAGE 88

76 Weighted Transparency Weighted Size Moment’s Weighted by Intensity Texture, Fourier Intensity Histogram Height/Width Contour Hybrid 29 30 31-38 39-43 49-55 56 57-72 Moments Black/White (Binary) Moments Edge Convex Hull Ratio Granulometric Eigen Head/ Ratio Convex Area Convex Area Ratio 0-7 8-15 16 17-24 25,26 27 28 Figure 4-10 Feature Search, Best 200, Using 16 Hybrid Contour Features Table 4-7 Feature Descriptions Including Hybrid Contour

PAGE 89

77 Figure 4-11 Feature Search, 16 Hybrid Contour, Accuracy by Number of Features

PAGE 90

78 CHAPTER 5 CONCLUSION 5.1 Parameter Tuning Choosing the correct parameters for the SVM is very important. Both accuracy and processing time can be adversely affected or greatly improved. As shown in Section 2.3, Figure 2-1 and Figure 2-2 good parameters can achieve a higher training accuracy and at the same time reduce the time required to perform a 5 fold cross validation. C onversely bad parameters lead to poor accuracy and at the same time longer processing times. An interesting observation that can be made from Figure 2-4 is there appears to be very few local maxima; and the two th at are obvious are very close to each other with just a small amount of differenc e between them. This is significant in that it might be useful to consider using a hill climbing strategy rather than a grid search in trying to locate optimal par ameters. The grid search that produced these charts took over 2 days to run on a 2.8ghz PC with 1 gigabyte of ram. This would not be a practical search for users to run in the field as they create new models to meet there current environments.

PAGE 91

79 5.2 Feature Calculation There were a total of 57 features im plemented that are lo gically grouped into various categories (Table 3-1, page 17) They all had varying degrees of success; some did very well, such as in tensity histograms with 67.69% accuracy (Table 3-31, page 54) while others did po orly such as the binary moments with 37.34% accuracy (Table 3-18, page 41). This does not mean that the features that perform poorly are of no use, when th ey are used in combination with other features higher accuracy can be obtained. For example the two moment features binary and weighted both performed poorly separately, 37.34% accuracy (Table 3-18, page 41) and 37.82% accuracy (Tabl e 3-19, page 42), but when used together achieved an accuracy of 53.44% (Table 3-21, page 44). In some cases a group of features will perform well in general but a specific member of the group may not. For exampl e the intensity histogram features, consisting of 7 discrete f eatures, had 67.69% accuracy, but when looking at its performance during feature selection (Tab le 4-4, page 71), t he third one was not included in the best 200 f eature combination evaluated. The accuracy of the intensity histogram featur es without the third featur e was 65.73% (Table 3-32, page15) compared to the 67.69% when included. This indicates that the third feature does have value just not in th e particular combinations that were searched during feature selection.

PAGE 92

80 There were three types of contour f eatures implemented wh ich are referred to as averaging, sampling and hybrid, cons isting of 5, 100, and 16 features each respectively. Independently each one achieved accuracies of 47.52% (Table 3-25, page 48), 67.47% (Table 3-24, page 47), and 57.74% (Table 3-26, page 49). When taken individually the sampling with 100 features performed significantly better than the other two; but when each type of contour feature is combined with all the other features, accuracies of 90.37% (Table 3-28, page 51) for averaging, 86.13% (Tabl e 3-27, page 50) for samp ling, and 90.11% (Table 3-29, page 52) for hybrid were obtained. In this ca se the sampling, with 100 features, did not perform as well as eit her averaging or hybrid. The two feature selection experiments performed, one wit h averaging used for contour features and the other with hybrid used for cont our features did not find any feature combination where hybrid would perform significantly better than averaging. Given that averaging has only 5 features compared to the 16 that hybrid contains it was decided to use the averaging contour features. The Fourier texture features consist of 5 discrete feat ures 39 through 43. Each one represents a different frequency range (Figure 3-7, page 34) starting with the high frequency range fo r feature 39 down to the low frequency range for feature 43. As a group accuracy of 47. 00% (Table 3-30, page 53) was obtained. The feature selection experiment (shown in Figure 4-8, page 72) show that 3 of the features (41, 42, 43) were in a ll of the best 200 feature combinations evaluated, feature 40 was in 140, and feature 39 was in 5. As a group these are very successful features. 5.3 Feature Selection By reducing the number of features a reduction in the required processing time for classification and tr aining are also obtained; bu t elimination of too many features makes it more difficult for the SVM to define the decision boundary ultimately requiring more support vectors wh ich results in requiring more time for training and classification.

PAGE 93

81 As the feature count was reduced test accuracy did not reduce until a given point. In the case of Figure 4-6 (page 68) test accuracy was maintained until feature count dropped below 26. In that same range a r eduction in time required for performing both training and classification was obt ained. This correlates with the reduction of support vectors that also occurred in this same range (Figure 4-7, page 69). The least number of support points occurred at feature count 25. As the feature count reduced below 25, the number of support vectors increases resulting in increasing the pr ocessing time required for tr aining and classification. As feature count reduced below 13 the number of s upport vectors starts to grow at more rapid rate. At the same ti me the test accuracy starts to fall at a very rapid rate to. It appears the SVM is creating decision boundaries with the features it has to work wit h which is becoming more spec ific to the training data and not relevant to the test data. At th e extreme end, that is where feature count equals 1, the number of s upport vectors created is 2602 out 2700 training examples and accuracy has fallen to 36.15% (Table 4-3, page 66). Selecting features by Highest Accura cy (Section 4.4.1, page 63) proved to work best. Usage in Best 200 (Section 4.4.2, page 63) did not find the best feature combination but can be useful in understanding the usefulness of individual features. When se lecting features that o ccur in the best 200, the interaction between features is being ignored. In Table 4-6 (page 73) the row with “Min 200”=200 has 21 features. These are the features t hat were in all 200 best feature combinations evaluated. But ther e were other featur es in 200 feature combinations that did not occur in all 200 but may have been aiding.

PAGE 94

82 An interesting experiment might be to determine if using the features that occurred in the best 200 might be more applic able to other data sets. Specifically in the situation where the user needs to classify a different combination of classes other than what feature sele ction has already been performed for. Rather than re-running feature selection for the new combination of classes use the features from Table 46 for a given level occurrence in the best 200. The idea is that these featur es might be more generalized than the ones which were selected by highest accuracy.

PAGE 95

83 REFERENCES 1 Tong Luo, Kurt Kramer, Dmitry B. Goldgo f, Lawrence O. Hall, Scott Samson, Andrew Remsen, Thomas Hopkins, Recognizing Plank ton from Shadow Image Particle Profiling Evaluation Recorder, IEEE trans. on system, man and cybernetics-pa rt B: cybernetics, August 2004, vol. 34, no. 4. 2 Samson, S., Hopkins, T., Remsen, A., Langebrake, L., Sutton, T., Patten, J., 2001. A system for high resolution zooplankton imaging. IEEE Jour nal of Oceanic Engineer ing 26 (4), pages 671-676. 3 Samson, S.; Langebrake, L.; Lembke, C.; Patten, J., 1999 Design and initial results of highresolution Shadowed Image Particle Profiling and Evaluation Recorder OCEANS '99 MTS/IEEE. Riding the Crest into the 21st Cent ury, Volume 1, 13-1 6 Sept. 1999 Page(s):58 63 vol.1. 4 Andrew Remsen, Thomas Hopkins, Scott Sams on, What you see is not what you catch: a comparison of concurrently collected net, optical plankton counter, and shadowed image particle profiling evaluation recorder data from northeast gulf of mexico. Deep Sea research Part 1: Oceanographic research pape rs, 51, pages 129-151, august 2004. 5 Tong Luo, Kurt Kramer, Dmitry B. Goldgo f, Lawrence O. Hall, Scott Samson, Andrew Remsen, Thomas Hopkins Learni ng to Recognize Plankton, IEEE International Conf. On Systems, Man and Cybernetics. 6 Tong Luo, Kurt Kramer, Dmitry B. Goldgo f, Lawrence O. Hall, Scott Samson, Andrew Remsen, Thomas Hopkins, Active learning to recognize multiple types of plankton. 17th conference of International Association for Pattern Recognition, vol. 3, pages 478-481, 2004a. 7 T. Luo, K. Kramer, D. Goldgof, L.O. Hall, S. Samson, A. Remson, and T. Hopkins, Learning to Recognize Plankton, IEEE Inter national Conference on SMC, 2003. 8 Hu, M.K., Visual Pattern Recognition by Mom ent Invariants, IRE Trans. Information Theory, February 1962, pages 179-187. 9 Xiaoou Tang and W. Kenneth Stewart and He Huang and Scott M. Gallager and Cabell S. Davis and Luc Vincent and Marty Marra, Automatic Plankton Image Recognition, Artificial Intelligence Review, February 1998, vol. 12, no. 1-3, pages 177-199. 10 C.T. Zahn and R.Z. Roskies, "Fourier desc riptors for plane close curves", IEEE Trans. Computers, vol. C-21, March 1972, pp. 269-281. 11 G.H. Granlund, "Fourier Preprocessing for hand print character recognition", IEEE Trans. Computers, Vol C-21, Feb. 1972, pp. 195-201. 12 Ron Kohavi and George H. John, Wrappers for Feature Subset Selection, Artificial Intelligence archive, December 1997, vol. 97, pages 273-324.

PAGE 96

84 13 Chih-Chung Chang and Chih-Jen Lin, A Libr ary for Support Vector Machines, libsvm, http://www.csie.ntu.edu.tw/~cjlin/libsvm/. 14 Cristianini N. and Shawe-Taylor J., (2000), An Introduction to Support Vector Machines and other Kernel-based learning methods Cambridge University Press, UK. 15 Thomas G. Dietterich, Approximate Statistical tests for Comparing Supervised Classification Learning Algorithms. 16 Stuart J. Russell and Peter Norvig, Artificial Intelligence A Modern Approach, Prentice Hall, Saddle River, NJ, 1995, Page 91. 17 Puneet Gupta, David Doermann, Daniel DeMent hon. "Beam Search for Feature Selection in Automatic SVM Defect Classification," icpr p. 20212, 16th International Conference on Pattern Recognition (ICPR'02) Volume 2, 2002. (Beam Search Reference). 18 Paul G, Falkowski Scientific American The Oceans Invisible Forest, August 2002, page 54. 19 John Roach, Source of Half Earth's Oxy gen Gets Little Credit, National Geographic, June 7th 2004. 20 J. Platt. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Advances in Large Margin Classiers, pages 61 74, 2000. 21 C.S. Davis, S.M. Gallager, N.S. Berman, L. R. Haury and J.R. Strickler, The Video Plankton Recorder (VPR): design and initial results, Arch Hydrobol, Neith, vol. 36, pages 67-81, 1992. 22 X. Tang and W. K. Stewart, Plankton image classification using novel parallel-training learning vector quantiz ation network, in Proc. IEEE/MTS OCEANS'96, Fort Lauderdale, FL, Sept. 1996. 23 Akiba, T. Kakui, Y. Electrotech. Lab ., Hyogo, Development of an in situ zooplankton identification and counting syst em based on local auto-correlation masks, OCEANS '97. MTS/IEEE Conference Proceedings, Publicat ion Date: 6-9 Oct 1997, Volume: 1, On page(s):655-659 vol. 1. 24 Xiaoou Tang, Multiple compet itive learning network fusion for object classification, IEEE Transactions of Systems, Man and CyberneticsPart B. Cybernetics, Vol. 28. No. 4 August 1998. 25 Xiaoou Tang, W. Kenneth Stewart, He Huang, Scott M. Gallager, Cabell S. Davis, Luc Vincent, and Marty Marra, Automatic plankton image recognition, Artificial Intelligence Review vol. 12 pages 177-199, February 1998. 26 Qiao Hu, Cabell Davis, Automatic plankton image recognition with co-occurrence matices and Support Vector Machine, MEPS vol. 295 June 23 2005. 27 CS Davis, SM Gallager, M Marra, WK St ewart, Rapid visualization of plankton abundance and taxonomic composition using the Video Pl ankton Recorder, Deep S ea Research Part II: Topical Studies in Oceanography 1996, vo l. 43, no. 7, pages 1947-1970(24).

PAGE 97

85 28 Matthew B. Blaschko, Gary Holness, Marwan A. Mattar, Dimitri Lisin, Paul E. Utgoff, Allen R. Hanson, Howard Schultz, Edward M. Rise man, Michael E. Sieracki, William M. Balch, Ben Tupper, Automatic in situ identific ation of plankton, Seventh IEEE kshops on Application of Computer Vision (WACV/MOTION'05) vol 1. 29 I. Pitas, Digital Image Processing Algorithms and Applications, 2000, Fourier descriptors pages 334 – 336. 30 Kenneth R, Castleman, Digital Image Processing, Prentice Hall, 1996. 31 C. Zahn and R.Z. Ro skies, Fourier descriptor s for plane closed curves, IEEE Transactions Computers, vol. C-21, pages 269-281, March 1972. 32 R Chellappa and R. Bagdazi an, Fourier coding of image boundaries, IEEE Trans. Pattern Anal. Machine Intelligence, vol. PAMI 6, pages 102-105, January 1984.

PAGE 98

86 APPENDICES

PAGE 99

87 Appendix A System Design Figure A-1 System Flow Chart Figure A-1 shows the general flow of data in the application. The application consists of several separate programs wr itten in c++ and java. It was desired that all code would be compatible with fr eely available developm ent tools. With a few exceptions all code is capable of being compiled and ran in both the widows and UNIX worlds. Other concerns were performance with respect to number of plankton mages processed per unit time an d at the same time it was wished to make use to make use of the more current development tools available.

PAGE 100

Appendix A: (Continued) 88 To accomplish this two different langu ages were used, Java and C++. Java was used for all GUI based applicati ons and c++ was used for processing intensive tasks such as image extraction and feature calculat ion. Using this combination applications we re developed that can be compiled and ran in both the Windows and Unix environments. The java development environment used wa s JBuilder from Borland which is a free development environment. The code produced is generic Java code that can be compiled by any Java compiler and executed by any valid Java virtual machine. The idea is that all GUI will be done in java. After any pertinent information is entered the java application would them r un the appropriate binary, such as Image Extraction, or Image Classification, The applications that ar e written in c++ are meant for processor intensive tasks. To keep the applic ation generic with respect to platforms (UNIX or Windows) all OS specific tasks were placed into two modules called OSservices.cpp and ImageIO.cpp. Functions such as CopyFile, DeleteFile, RenameFile, CreateDirecto ry, ReadImage, SaveImage, getCPUTime, etc were implemented in these modules. In thes e two modules there are two different versions of each function implemented. On e for Unix and the other for Windows. All other modules use no specific OS func tions. In this way the amount of work necessary to implement code for a different OS is minimized. Most of the c++ applications have been compiled and ran in Windows, Solaris, and Linux.

PAGE 101

89 Appendix B User Classified Data Sets Table B-2 lists all the different plankt on classes the user has classified for purposes of building training libraries. From this pool of classified images 9 classes listed in Table B-1 were select ed for performing f eature selection and active learning experiments. These ni ne classes were selected because there were at least 1500 images. Each class was then further divided into test and validation sets with 1200 and 300 images in each. Table B-1 Classes Selected for Testing Class Names Figure Page Chaetognath Figure B-1 91 Smallbell Longarms Figure B-2 91 Copepod Oithona Figure B-3 92 Echino Plutei Figure B-4 92 Larvacean Figure B-5 92 Marine Snow Dark Figure B-6 93 Marine Snow Light Figure B-7 93 Protist Figure B-8 93 Trichodesmium Figure B-9 94

PAGE 102

Appendix B: (Continued) 90 Table B-2 List of Plankton Classes Classified by User Class Name Num Class Name artifact_lines 175 Ostracod 835 Chaetognath 2104 Polychaete 35 Cladoceran 303 Protist_acanthametreon 158 clipped_images 103 Protist_actinopoda 147 Cnidaria_Clear_Bell 468protist_all 1676 Cnidaria_misc_large 71 Protist_silicoflagellate 702 Cnidaria_misc_small 282 Protist_spiny 115 Cnidaria_Smallbell_Long 1564 Protist_thalassicola 348 Cnidaria_Thimble 1048 Protist_unsorted 181 Copepod 2988Pteropod 50 Copepod_Calanoid 989 Pteropod_creseis 94 Copepod_Calocalanus 169 Pteropod_limacina 80 Copepod_Copilia 20 Radiolarian_Chain 24 Copepod_Corycaeus 39 salp_chain 55 Copepod_FuzzyAntenna 38 salp_ind 116 Copepod_Macrosetella 81 Shrimp 536 Copepod_Oithona 1719 Shrimp amphipod 6 Copepod_Oncaea 59 Shrimp Lucifer 24 Ctenophore 9 Shrimp_phyllosome 18 Ctenophore_Beroe 12 Shrimp_porcellanid 5 Ctenophore_maybe 27 Shrimp_squilla 30 Ctenophore_Ocyropsis 50 Shrimp_zoea 8 Ctenophore_venus 6 Siphonophore_all 528 Diatom 719 Siphonophore_eudoxid 6 Dinoflagellate 324 Siphonophore_Large_calycophoran 23 Doliolid 505 Siphonophore_physonect 37 Doliolid_tail 183 Siphonophore_Praya 3 Echino_Bipinnaria 66 Siphonophore_Small_calycophoran_n one 160 Echino_Plutei 1798 Siphonophore_Small_calycophoran_t entacles 149 Egg 15 Siphonophore_Sphaeronectes 189 Eggs 77 Small_unknown 56 Fish 86 Tentacles 67 Larvacean 2525 Trichomes 433 Larvacean_house 294Trich_all 1964 Larvacean_no_membrane 967 trich_puffs 645 Larvacean_total 3371 trich_tufts 998 Marine_Snow_dark 1898 UNKNOWN 44 Marine_Snow_light 1924 out_of_focus 83 Total Images 37704

PAGE 103

Appendix B: (Continued) 91 Figure B-1 Sample Chaetognath Figure B-2 Sample Cnidaria Smallbell Longarms

PAGE 104

Appendix B: (Continued) 92 Figure B-3 Sample Copepod Oithona Figure B-4 Sample Echino Plutei Figure B-5 Sample Larvacean

PAGE 105

Appendix B: (Continued) 93 Figure B-6 Sample Marine Snow Dark Figure B-7 Sample Marine Snow Light note artifact line thru middle of image Figure B-8 Sample Protist

PAGE 106

Appendix B: (Continued) 94 Figure B-9 Sample Trichodesmium

PAGE 107

95 Appendix C The Fourier Transform The magic that is behind the F ourier transform is the identity x j x exjcos sin which is used by the Fourier transform to capture frequency information. Since sine and cosine are 90 degrees out of phas e with each other the resulting Fourier descripto rs are rotationally invariant 1 0 21N k N kn je k f N n F Equation 10 One Dimensional Fourier Transform x j x exjsin cos Equation 11 Euler’s Identity The Fourier transform can be thought of as a series of masks, where each one represents a different frequency. In actuality there are two masks for each frequency, one that represent s the real component and the other the imaginary component; see Figures Figure Cthrough Figure C-4. The real component mask is driven by the cosine function and the imaginary by sine function. This means that the two masks are 90 degrees out of phase with eac h other, or that the imaginary component leads t he real component by 90 degrees.

PAGE 108

Appendix C: (Continued) 96 When each location in the resultant tr ansform is examined independently you see that it is the result of the app lication of a given frequency mask. The frequency of each bucket is a fu nction of its distance from the center of the array. There are two masks applied one for the real component and the other for the imaginary part where each one is 90 degrees out of phase with the other. You might say that the real component (cos ine) follows the imaginary component (sine) by 90 degrees. This is true for the first half of the arra y, at the mid point there is a shift such that the real co mponent then leads the imaginary component by 90 degrees. This is important to note because if you ar e going to be using both the real and imaginary components on the input to the transform, you will need to include both halves of the resultant transform. The next few tables and figures are go ing to show a break down of the different components involved in a Fourier transform of a 20 element array. The idea is to see how by using Euler’ s identity creates masks that capture information about image contours that are represented with an array. Table C-1 Frequency Assignments on a Transform of a 20 Element Array Bucket 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Frequency 1 2 3 4 56789109 8 7 6 5 4 3 2 1 In Figure C-1 see how each bucket repres ents a different frequency, starting with 1 hertz to 9 then back to one.

PAGE 109

Appendix C: (Continued) 97 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 01234567891011121314151617181920 Frequency Buckets (Hertz)Mask Value N=1 N=2 N=3 Figure C-1 Fourier Mask, Locations 1, 2, 3 of 0 through 19 Bucket Figure C-2 note how it is the same as the first three buckets, but note how the imaginary buckets(Figure C-3, Figur e C-4) don;t exhibit the same behavior. -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 01234567891011121314151617181920 Frequency Buckets (Hertz)Mask Value N=19 N=18 N=17 Figure C-2 Fourier Mask, Last 3 Locations of Real Component Figure C-1 (left half) and Figure C-2 (right half) show the real component of the Fourier transform for buckets 1, 2, 3 and 17, 18, 19. As you can see the left half and right half are symmetric to each other with the lower frequencies on the ends and the higher in t he middle. If you will only being using the real component in the input to the transform t hen you will find that the magnitudes of the resultant transform will be symmetric.

PAGE 110

Appendix C: (Continued) 98 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 01234567891011121314151617181920 Frequency Buckets (Hertz)Mask Value N=1 N=2 N=3 Figure C-3 Fourier Mask, Imaginary Component of First Three Locations Note that the imaginary component is leading the real component by 90 degrees -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 01234567891011121314151617181920 Frequency Buckets (Hertz)Mask Value N=19 N=18 N=17 Figure C-4 Fourier Mask, Last 3 Locations of Imaginary Component Figure C-4 note how it is 180 degrees out of phase of the first 3 buckets. This is different behavior than the real component

PAGE 111

Appendix C: (Continued) 99 What is interesting, in Figure C-3 and Figure C-4 is how the two waves are 180 degrees out of phase with ea ch other? Because of this you can not just treat the resultant Fourier transform as symmetric because the corresponding buckets for the same frequencies will be captur ing information that is 180 degrees different. This is important to mention because many people are used to thinking that the Fourier transform is symmetric. Th is would be the case if only the real component was used on the input to the transform. Because of this characteristic when capturing information about 1hz which is in buckets 1 and 19 you will need to include both buckets as separate features. This idea was used in all three different Fourier Descriptor methods tried.

PAGE 112

100 Appendix D Glossary Beam Search A specific implementation of a “B est-First search” where a heuristic drives the search. Differences are that only the best N nodes are evaluated for each level. Once the search has processed a given level it will not go back to that leve l again. This way the search will continue until there are no more levels to process. Class Also referred to as a label. Different types of Plankton are considered Classes. For exampl e Trichodesmium, Larvacean, and Copepods would be considered three different classes. SVM Support Vector Machine, A lear ning algorithm that learns from labeled data how to predict the proper classes to be assigned to unseen data. See Chapter 2 for a more detailed description. Training Library For purposes of this thesis a Trai ning Library is t he set of plankton images that or broken up into logi cal groups. See Appendix B for examples of groupings. These images are then used to train a learning algorithm such as the one utilized in this thesis, the Support vector machine (SVM). Training Model For purposes of this thesis, trai ning model refers to the set of classes, and parameters that are to be used. The user has the ability to maintain several traini ng models. Each one will exist of a list of classes, feature to be use, and support vector machine parameters. The training library may have several dozen classes in it, but any one training model may only reference just several of these classes. Training models will also have the ability to group several classes together to fo rm a single logical class. VPR Video plankton recorder, a device used to collect imagery of marine plankton. Its purpose is similar to SIPPER buts its implementation I very different.


xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam Ka
controlfield tag 001 001709529
003 fts
005 20060614112207.0
006 m||||e|||d||||||||
007 cr mnu|||uuuuu
008 060517s2005 flua sbm s000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0001402
035
(OCoLC)68912940
SFE0001402
040
FHM
c FHM
049
FHMM
090
QA76 (Online)
1 100
Kramer, Kurt A.
0 245
Identifying plankton from grayscale silhouette images
h [electronic resource] /
by Kurt A. Kramer.
260
[Tampa, Fla.] :
b University of South Florida,
2005.
502
Thesis (M.S.C.S.)--University of South Florida, 2005.
504
Includes bibliographical references.
516
Text (Electronic thesis) in PDF format.
538
System requirements: World Wide Web browser and PDF reader.
Mode of access: World Wide Web.
500
Title from PDF of title page.
Document formatted into pages; contains 112 pages.
520
ABSTRACT: Utilizing a continuous silhouette image of marine plankton produced by a device called SIPPER, developed by the Marine Sciences Department, individual plankton images were extracted, features were derived, and classification was performed. There were plankton recognition experiments performed in Support Vector Machine parameter tuning, Fourier descriptors, and feature selection. Several groups of features were implemented, moments, gramulometric, Fourier transform for texture, intensity histograms, Fourier descriptors for contour, convex hull, and Eigen ratio. The Fourier descriptors were implemented in three different flavors sampling, averaging and hybrid (mix of sampling and averaging). The feature selection experiments utilized a modified WRAPPER approach of which several flavors were explored including Best Case Next, Forward and Backward, and Beam Search.Feature selection significantly reduced the number of features required for processing, while at the same time maintaining the same level of classification accuracy. This resulted in reduced processing time for training and classification.
590
Adviser: Dr. Dmitry Goldgof.
653
SIPPER.
Feature selection.
Feature calculation.
Active learning.
Support vector machine.
SVM.
Multi-class.
690
Dissertations, Academic
z USF
x Computer Science and Engineering
Masters.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.1402