USF Libraries
USF Digital Collections

Automatic red tide detection using MODIS satellite images

MISSING IMAGE

Material Information

Title:
Automatic red tide detection using MODIS satellite images
Physical Description:
Book
Language:
English
Creator:
Cheng, Weijian
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla
Publication Date:

Subjects

Subjects / Keywords:
Karenia brevis
West Florida Shelf
Machine learning
Random forest
Support vector machine
Dissertations, Academic -- Computer Science -- Masters -- USF   ( lcsh )
Genre:
non-fiction   ( marcgt )

Notes

Abstract:
ABSTRACT: Red tides pose a significant economic and environmental threat in the Gulf of Mexico. Detecting red tide is important for understanding this phenomenon. In this thesis, machine learning approaches based on Random Forests, Support Vector Machines and K-Nearest Neighbors have been evaluated for red tide detection from MODIS satellite images. Detection results using machine learning algorithms were compared to ship collected ground truth red tide data. This work has three major contributions. First, machine learning approaches outperformed two of the latest thresholding red tide detection algorithms based on bio-optical characterization by more than 10% in terms of F measure and more than 4% in terms of area under the ROC curve. Machine Learning approaches are effective in more locations on the West Florida Shelf. Second, the thresholds developed in recent thresholding methods were introduced as input attributes to the machine learning approaches and this strategy improved Random Forests and KNearest Neighbors approaches' F-measures. Third, voting the machine learning and thresholding methods could achieve the better performance compared with using machine learning alone, which implied a combination between machine learning models and biocharacterization thresholding methods can be used to obtain effective red tide detection results.
Thesis:
Thesis (M.S.C.S.)--University of South Florida, 2009.
Bibliography:
Includes bibliographical references.
System Details:
Mode of access: World Wide Web.
System Details:
System requirements: World Wide Web browser and PDF reader.
Statement of Responsibility:
by Weijian Cheng.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains 56 pages.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 002064139
oclc - 567549103
usfldc doi - E14-SFE0003073
usfldc handle - e14.3073
System ID:
SFS0027389:00001


This item is only available as the following downloads:


Full Text
xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam 2200349Ka 4500
controlfield tag 001 002064139
005 20100323161957.0
007 cr mnu|||uuuuu
008 100323s2009 flu s 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0003073
035
(OCoLC)567549103
040
FHM
c FHM
049
FHMM
090
QA76 (Online)
1 100
Cheng, Weijian.
0 245
Automatic red tide detection using MODIS satellite images
h [electronic resource] /
by Weijian Cheng.
260
[Tampa, Fla] :
b University of South Florida,
2009.
500
Title from PDF of title page.
Document formatted into pages; contains 56 pages.
502
Thesis (M.S.C.S.)--University of South Florida, 2009.
504
Includes bibliographical references.
516
Text (Electronic thesis) in PDF format.
3 520
ABSTRACT: Red tides pose a significant economic and environmental threat in the Gulf of Mexico. Detecting red tide is important for understanding this phenomenon. In this thesis, machine learning approaches based on Random Forests, Support Vector Machines and K-Nearest Neighbors have been evaluated for red tide detection from MODIS satellite images. Detection results using machine learning algorithms were compared to ship collected ground truth red tide data. This work has three major contributions. First, machine learning approaches outperformed two of the latest thresholding red tide detection algorithms based on bio-optical characterization by more than 10% in terms of F measure and more than 4% in terms of area under the ROC curve. Machine Learning approaches are effective in more locations on the West Florida Shelf. Second, the thresholds developed in recent thresholding methods were introduced as input attributes to the machine learning approaches and this strategy improved Random Forests and KNearest Neighbors approaches' F-measures. Third, voting the machine learning and thresholding methods could achieve the better performance compared with using machine learning alone, which implied a combination between machine learning models and biocharacterization thresholding methods can be used to obtain effective red tide detection results.
538
Mode of access: World Wide Web.
System requirements: World Wide Web browser and PDF reader.
590
Co-advisor: Dmitry B. Goldgof, Ph.D.
Co-advisor: Lawrence O. Hall, Ph.D.
653
Karenia brevis
West Florida Shelf
Machine learning
Random forest
Support vector machine
690
Dissertations, Academic
z USF
x Computer Science
Masters.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.3073



PAGE 1

Automatic Red Tide Detection using MODIS Satellite Images by Weijian Cheng A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Department of Computer Science and Engineering College of Engineering University of South Florida Co-Major Professor: Dm itry B. Goldgof, Ph.D. Co-Major Professor: Lawrence O. Hall, Ph.D. Sudeep Sarkar, Ph.D. Date of Approval: June 8, 2009 Keywords: karenia brevis, West Florida Sh elf, machine learning, random forest, support vector machine Copyright 2009 Weijian Cheng

PAGE 2

ACKNOWLEDGEMENTS This research was partially supporte d by the USF Research and Graduate Education Investment thrusts in Computa tional tools for discove ry and Sustainable Healthy Communities-Water thrust. We thank National Aeronautics and Space Administration (NASA) for the satellite da ta, and thank Florid a Fish and Wildlife Research Institute for providi ng red tide ground truth data.

PAGE 3

i TABLE OF CONTENTS LIST OF TA BLES iii LIST OF FI GURES v ABSTRACT vi CHAPTER 1 INTR ODUCTION 1 CHAPTER 2 BACKGROUND AND PREVIOUS WORK 3 2.1. Backgr ound 3 2.1.1 Research region 3 2.1.2 In situ data 4 2.1.3 MODIS satell ite data 5 2.1.4 Machine learni ng models 9 2.1.4.1 Support Vector Machines 9 2.1.4.2 Random Fo rests 11 2.1.4.3 K-Nearest Ne ighbor s 13 2.2 Review of pr ior work 15 2.2.1 Thresholding methods 15 2.2.1.1 The Chlorophyll a nomaly me thod 15 2.2.1.2 The backscattering detection method 17 2.2.2 Machine learni ng methods 17 CHAPTER 3 AUTOMATED RED TIDE DETECTION AL GORITHMS 20 3.1 Machine learning approach es 20 3.2 Machine learning models with additional input features used in thresholding methods 21 3.3 Combination of mul tiple algori thms 22 CHAPTER 4 EXPERI MENT SE TUP 23 4.1 Data set and pr e-processi ng 23 4.2 Accuracy a ssessmen t 25 4.3 Algorithm implem entation 27 CHAPTER 5 ALGORITH M PERFORMA NCE 29 5.1 Machine learning approach es 29

PAGE 4

ii 5.2 Machine learning methods with and without approximated ground truth data for training 30 5.3 Machine learning models with additional input features used in thresholding methods 31 5.4 Combination of multiple algorithms by voti ng 32 5.5 ROC anal ysis 33 5.6 Confusion matrix 35 5.7 Arithmetic means and geometric means for red tide detection accuracies 36 5.8 An exam ple 37 CHAPTER 6 CONCLUSION AND DISCUSSIONS 48 REFERENCES 50

PAGE 5

iii LIST OF TABLES Table 1. Notation on accu racy assess ment 25 Table 2. F-measures of 3-Neares t Neighbors, Random Forests, Support Vector Machines, the Chlorophyll anomaly method and the backscatte ring met hod 29 Table 3. F-measures of machine l earning methods with and without approximated ground truth da ta for trai ning 30 Table 4. Confusion matrix for S upport Vector Machines, without approximated gr ound trut h 31 Table 5. Confusion matrix for S upport Vector Machines, with approximated gr ound trut h 31 Table 6. F-measures of machine lear ning models with additional input features introduced in the Chlo rophyll anomaly method and the backscatteri ng met hod 32 Table 7. F-measures of hybrid approaches by votin g 33 Table 8. AUC for diffe rent met hods 34 Table 9. Confusion matrix for Support Vector Machines 35 Table 10. Confusion matrix for Random Forest s 35 Table 11. Confusion matrix fo r 3-Nearest Ne ighbor s 35 Table 12. Confusion matrix fo r the Chlorophy ll method 35 Table 13. Confusion matrix for the backscatteri ng met hod 35 Table 14. Confusion matrix for weighted votin g 36 Table 15. Confusion matrix fo r unweighted vo ting, N= 2 36

PAGE 6

iv Table 16. Arithmetic means for differe nt approaches’ accuraci es 37 Table 17. Geometric means for differe nt approaches’ accuraci es 37

PAGE 7

v LIST OF FIGURES Figure 1. Study area: west Florida shelf 4 Figure 2. Measurement location for K. brevis cells on August 26, 2004 5 Figure 3. 7 channels from MODIS for the west Florida sh elf on Nov 23, 2005 8 Figure 4. A maximum margin hyperplane, w ith dark and white dots representing different cl asses 10 Figure 5. A Decision tree exam ple 12 Figure 6. K-Nearest Neighbors (K =3) 14 Figure 7. Flow chart of mach ine learning a pproaches 21 Figure 8. Flow chart of machine learning models with bio-op tical feat ures 21 Figure 9. Flow chart for gr ound truth appr oximatio n 24 Figure 10. ROC curves for different me thods 34 Figure 11. The Chlorophyll anomaly al gorithm for red tide detection on September 21, 2006 38 Figure 12. The backscattering algorithm for red tide detection on September 21, 2006 39 Figure 13. 3-Nearest Neighbors for red tid e detection on Sept ember 21, 2006 41 Figure 14. Random Forests for red tide detection on Sept ember 21, 2006 42 Figure 15. Support Vector Machines for re d tide detection on September 21, 2006 43 Figure 16. Weighted voting for red tid e detection on Se ptember 21, 2006 44 Figure 17. Unweighted voting with N=2 (classi fying a pixel as red tide as long as any 2 out of 5 algorithms classifyi ng it as red tide) for red tide detection on Se ptember 21, 2006. 45

PAGE 8

vi Automatic Red Tide Detection using MODIS Satellite Images Weijian Cheng ABSTRACT Red tides pose a significant economic and environmental threat in the Gulf of Mexico. Detecting red tide is important for understanding this phenomenon. In this thesis, machine learning approaches based on Random Forests, Support Vector Machines and K-Nearest Neighbors have been evalua ted for red tide detection from MODIS satellite images. Detection results using machine learning algorithms were compared to ship collected ground truth red tide data. This work has three major contributions. First, machine learning approaches outperformed tw o of the latest thresholding red tide detection algorithms based on bio-optical charac terization by more than 10% in terms of F measure and more than 4% in terms of area under the ROC curve. Machine Learning approaches are effective in more locati ons on the West Florida Shelf. Second, the thresholds developed in recent thresholding me thods were introduced as input attributes to the machine learning approaches and th is strategy improved Random Forests and KNearest Neighbors approaches’ F-measures Third, voting the machine learning and thresholding methods could achieve the bett er performance compared with using machine learning alone, which implied a combination between machine learning models and biocharacterization thresholding methods can be us ed to obtain effective red tide detection results.

PAGE 9

1 CHAPTER 1 INTRODUCTION Toxic K. brevis blooms (commonly known as Florida’s red tides) represent a serious problem for local fisheries and the tourism economy in Florida. Red tides are frequent along the West Florida Shelf, typically in late summer and fall. Traditional red tide detection methods, including station sampling and ship measurement, usually moving slowly through po sitions, cannot monitor red tide at larger scales in a timely manner. The cost of setti ng up such physical red tid e detection points is also very high. Since K. brevis blooms change the colo r of oceanic surface waters (Carder and Steward 1985), it is possible to detect and monitor red tide blooms using satellite based ocean color products (we refer to them as features) provided by remote sensing techniques. With a series of polar orbiting o cean color satellite sensors, red tides can be monitored and studied in near real-time ever y day over the entire easte rn Gulf of Mexico, given cloud-free conditions. Prev ious research has shown that remote sensing has great potential in successful red tid e prediction and monitoring (Millie et al., 1997 and Kahru and Mitchell, 1998). However, before this technique can be utilized towards an automated system to provide rapid detection and early warning to the public, we must

PAGE 10

2 develop reliable algorithms to differentia te red tide from other blooms and water disturbances in satellite imagery. In this thesis, we evaluate machin e learning methods based on K-Nearest Neighbors, Support Vector Machines and Ra ndom Forests using the MODIS satellite data. Our methods take one MODIS satellite image of the Gulf of Mexico daily and automatically indicate whether each pixel contains red tide or not. The remaining six chapters are organized as follows: Chapter 2 describes details of data collection for our experiments, the th ree machine learning models we used in our study and previous remote sensing red tid e detection methods. Chapter 3 describes several algorithms and hybrid strategies use d. Chapter 4 discusses our experimental setup. Chapter 5 describes and analyzes the result s of different methodologies. Finally Chapter 6 contains the conclusions and a discussion.

PAGE 11

3 CHAPTER 2 BACKGROUND AND PREVIOUS WORK 2.1 Background 2.1.1 Research region We chose the West Florida Shelf as our study region. For this region, we obtained high-resolution MODIS satellite images, as well as in situ data provided by the Florida Fish and Wildlife Research Institute (FWRI). As shown in Figure 1, the study area covers Key West in the south to the Big Bend area in the north, bounded by -80 to -87 in longitude and 24 to 31 in latitude. All of the in situ data we used are located within this region.

PAGE 12

4 Figure 1. Study area: west Florida shelf. Bo th training and testing data are from this region 2.1.2 In situ data From the cruises launched by FWIR, 17649 sa mples of seawater in our research region were collected during the years 2003 to 2007. Figure 2 shows an example of the measurement locations in the cruises. Each white point represents a location where the concentration of K. brevis cells was measur ed. One can see that the region covered is small and sparse.

PAGE 13

5 Figure 2. Measurement location for K. brevis cells on August 26, 2004 2.1.3 MODIS satellite data Remote sensing is the technique of acqui ring the information of an object using sensing devices without physical or intimate contact with the object (GIS Development, 2009). Remote sensing relies on the electrom agnetic radiation (E MR) to transfer information. EMR is a form of energy that produces observable effects when it strikes matter, with its spectrum of wavelengths spanning from 10-10 mm to 1010 mm. Sunlight that penetrates the water returns upwards a nd passes through the surface after scattering

PAGE 14

6 in the water (Yao, 1999). The amount of sunli ght return depends on substances in the water, which gives us clues about those s ubstances through the study of remote sensed images. Recently, several ocean color sensors have been launched and the values of reflection in different spectrums from the s eawater can be used to study the bio-physical combination of the water. Those sensors incl ude CZCS (the Coastal Zone Color Scanner, from 1978 to 1986), SeaWiFS (the Sea-Viewi ng -Wide-Field-of-Vie w-Sensor, from 1997 to present), and OCTS-MODIS (the Moderate -Resolution Imaging Spectra Radiometer from 1999 to present) (Zhang 2002). Among them, MODIS is the most comprehensive sensor with the capacity for detecting a wide spectral range of electromagnetic energy and taking measurements at different spatia l resolutions (NASA, MODIS brochure, 2009). The MODIS instrument was designed and developed from mid-1995. MODIS has two space-flight units: Protoflight Model (PFM) and the Flight Model 1 (FM1). PFM was launched with the Terra Sa tellite on December 18, 1999 a nd FM1 was launched with Aqua on May 4, 2002 (NASA, MODIS Webs ite: MODIS Components, 2009). Terra orbits around the Earth from north to south across the equator in the morning and Aqua passes south to north over the equator in the afternoon. Their orbits ar e timed so that the entire Earth’s surface can be covered every 1 to 2 days (NASA, MODIS Website, 2009). MODIS provides high radiomet ric sensitivity (12 bit) data in 36 spectral bands ranging in wavelength from 0.4 m to 14.4 m. Products for ocean research, including Normalized Water Leaving Radiance, Chlorophyll Fluorescence and Chlorophyll_a Pigment Concentration, can be ca lculated from those bands. This data helps researchers understand global dynamics and global processes better, especially those occurring in the ocean. MODIS not only improves the validation

PAGE 15

7 of global and interactive Earth system models, but also predic ts global changes to assist policy makers in making decisions on environmental protection (NASA, MODIS Website, 2009). Due to its consistently excellent perfor mance, we chose MODIS as the remote sensed measurement for this study. Every day, if clouds do not cover the whol e study of interest, one image obtained from MODIS for the West Florida Shel f was used in our experiments. It was shown that the satellite features of Chlorophyll (CHL), fluorescence light height (FLH) and particulate b ackscatter (BBP) might have been related to detect red tide occurrences (Hu et al., 2003, 2005, Cannizzaro et al., 2008). Those three channels are computed from the normalized water leav ing radiance (NLW) channels. Researchers have confirmed that the spectral reflectan ce at the length between 450nm to 520nm is “sensitive to sedimentation, d eciduous and coniferous forest cover discrimination and soil vegetation differentiation” (GIS Developm ent, 2009); the reflectance between 520nm to 590nm is “green reflectance by healthy vegetation, vegetation vigour, rock-soil discrimination, turbidity and bathymetry in shallow waters “(GIS Development, 2009); the reflectance between 620nm to 680nm is “s ensitive to Chlorophyll absorption: plant species discrimination, differentiation of soil and geological boundary” (GIS Development, 2009) and reflectance at 770nm to 860nm is “sensitive to green biomass and moisture in vegetation, land and wate r contrast and landform and geomorphic studies” (GIS Development, 2009). Spectral refl ectance at all these lengths might provide clues about red tide. We picked one NLW feat ure in each of the four wavelength ranges above. In our study we used NLW at wave lengths of 412nm, 5 51nm, 678nm, and 869nm,

PAGE 16

8 as well as CHL, FLH and BBP at a wavelengt h of 551nm. In our initial experiments, dropping one of those features will decrease the algorithms’ accuracies. Examples of satellite images at each of the channels above are shown in Figure 3. (a) (b) (c) (d) (e) (f) (g) Figure 3. 7 channels from MODIS fo r the west Florida shelf on Nov 23, 2005. (a) CHL, (b) FLH, (c) NLW 412 nm, (d) NLW 551 nm, (e) NLW 678 nm, (f) NLW 869nm and (g) BBP 551nm

PAGE 17

9 2.1.4 Machine learning models Recently, Support Vector Machines, Ra ndom Forests and K-Nearest Neighbors algorithms have been widely applied to diffe rent machine learning problems (Y. Liao et al., 2002, M. Ankerst et al., 1999, D.R. Cutle r et al., 2007, H. Byun et al., 2002). We applied the three methods in our study to r ecognize red tide from M ODIS satellite images. Descriptions of the th ree algorithms follow. 2.1.4.1 Support Vector Machines One of the motivations of Support Vector Ma chines is to classify objects that are not linearly separable. Linear models includ ing linear regression we re applied well in some classification problems. However, their disadvantage is obvious that they can only represent linear boundaries between classes, while many pr actical applications have nonlinear boundaries. One solution is to extend the linear models to ordinary linear models. However, the coefficients will increase rapidly as the number of attributes in the data set grows. It makes the ordinary linear models impossible to solve. Overfitting is also another problem for the ordina ry linear models solution. To solve those problems, a special ki nd of linear model was introduced: the maximum margin hyperplane. We try to sepa rate two classes with a hyperplane. We define the margin of a linear classifier as the maximum width before the boundary is increased to hit a data point. The maximum margin hyperplane is the hyperplane giving the best separation between two classes. Figure 4 shows an example of a maximum margin hyperplane. Support vectors are the da ta points having the minimum distance to the hyperplane.

PAGE 18

10 Figure 4. A maximum margin hyperplane, with dark and white dots representing different classes Each data point can be presented as: ,)(iidPa whileia is the attribute vector of a point and {0,1}id is the ground truth value of wh ich class that point belongs to. A hyperplane separating two classes can be written as:01122 x wwawa where 1a and 2a are attribute vectors. And iw ( i =0,1,2) are the weights to be learned. Another form of maximum margin hyperpla ne can be written in terms of support vectors: i is support vector()ii x bdaia, where () ai are the support vectors, id is the class value of () ai, a is a test instance and b and i are parameters to be learned. () aia corresponds to the dot product be tween the test instance and one of the support vectors. Finding the support vectors for a data set and the parameters b and i belongs to a standard optimization problem: constrained quadratic optimization (Witten et al., 2005) Maximum margin hyperplane

PAGE 19

11 Platt (Platt, 1999) described a support vector m achines training algorithm that reduced its computational complexity and accelerated the learning. To model nonlinear class boundaries, we map the input features with the kernel method (Aizerman et al., 1964) to a tran sformed feature space, which allows the algorithm to fit the maximum margin hyperpla ne in the transformed space. Since the kernel can be non-linear, the classifier can be non-linear in origin al space. Some common kernel functions include: ((),)(())nkaiaaia (Polynomial) 2 2() ((),)exp() 2 aia kaia (Radial Basis Function) ((),)tanh(()) kaiaaiac (Sigmoid) The Support Vector Machines’ strength s include less likely occurrence of overfitting since the decision boundary is contro lled by the support vectors instead of all instances, and ability to handle large feature spaces. Support Vector Machines have been successfully applied in many real world problems including image classification, bioinfomatics, and hand-writt en character recognition. 2.1.4.2 Random Forests A Random Forest (RF) (Breiman, 2001) is an ensemble of decision trees. A Decision tree is a common method in machine learning. It uses the structure of a tree to represent the data classification rules. Figur e 5 shows an example of a decision tree.

PAGE 20

12 Figure 5. A Decision tree example Each internal node corresponds to an attribute. Each leaf corresponds to a classification result determined by values of the attributes represented by the path from the root. Each branch of the decision tree represents a possible outcome depending on the test of the value on the node. Usually if th e attribute is numeric, the test at a node determines whether its value is greater or less than a predetermi ned constant, giving a binary split. To construct a decision tree, we select an attribute to place at the root node and split instances into branches based on that attr ibute’s value. This process will be repeated for each branch, using the instances on that branch. The development for a node will stop when all instances on the node belong to the same class, or some other criteria is met. To determine which attribute to split on, we use the information gain (Witten et al., 2005), which is the difference between the informati on of all instances before the split and the sum of information of each subset after the split. The attribute with the highest information gain will be selected as the attribute to split the data. CHL FLH CHL 1 CHL < 1 FLH 1 FLH < 1 Red tide Non red tide Non red tide

PAGE 21

13 The advantages of decision tree include simplicity to understand and interpret, little requirement on data prep aration, no assumptions about the data, and using a white box model. To improve the classification a ccuracy of random fo rests, researchers developed strategies to grow an ensemble of trees and let them vote for the most popular class. Such strategies include bagging (Breiman, 1996) that grows each tree from a random selection with replacement of instan ces in the training set, and random split selection that selects the sp lit at each node at random from the K best splits. Breiman (Breiman 2001) developed the Random Forests algorithm to improve decision trees’ accuracy via an ensemble of decision trees. The Random Forests contains N decision trees. Assuming that there are M in stances as training data available, P% (typically P=100) of them are chosen randomly with replacement for training individual decision trees. Each tree is c onstructed by randomly selecting K features at each internal node as the tree is created and se lecting the best one to test at that node. The classification result is obtained by unweighed voting of all d ecision trees. For many data sets, Random Forests produ ces very accurate classification results (Breiman, 2001). Its other advantages include the ability to estimate the importance of attributes during trees growing and ha ndle large amount of input attributes. 2.1.4.3 K-Nearest Neighbors The K-Nearest Neighbors algo rithm (Dasarathy, 1991) is a type of instance-based learning method that classifies instances ba sed on their closest training examples. It assigns an object to the class most common among its K nearest neighbors.

PAGE 22

14 That is, for an input vector v, its distance id to every item in the training data it is computed as iidvt Let 12,,......kiiiddd be the k smallest distances. v will be classified as the majority class among 12,,......kiiittt Figure 6. shows an example of how K-Nearest Neighbors work when K=3. Figure 6. K-Nearest Neighbors (K=3). Instance A will be classified as class “2” since there are 2 instances among t hose 3 closest are class “2” Typically we use Euclidean distance to measure the distance between two instances in the feature space (Witten et al., 2005). The K-Nearest Neighbors algo rithm is effective when the training data is in moderate size. It requires minimum time on trai ning. When new training data is available, we can add them to the original training set and do not need to rebuild the classification model. This incremental feature makes the KNearest Neighbor efficient for applications having a dynamically changing training set. The classification time for the K-Nearest Neighbor method can also be high when th e training set is large. The K-Nearest Neighbors algorithm has been successfully used in protein structure prediction (Bondugula et al., 2005), optical music re cognition (Fujinaga, 1996) and text categorization (E. Han, et al., 1999). 1 1 1 ? 2 2 Instance A to be classified

PAGE 23

15 2.2 Review of prior work 2.2.1 Thresholding methods Remote sensing can provide synoptic a nd frequent observations of the surface ocean, and has been proven effective fo r seawater classification and phytoplankton detection (Carder, and Steward, 1985). Sin ce the 1970s, many studies have worked on refining algorithms to estimate Chlorophyll us ing ocean color imagery. Some researchers (Gordon, Brown and et al., 1983, Gordon, Cl ark and et al., 1983, Gordon et al., 1980) estimated phytoplankton concentr ation using the function of two to three spectral bands provided by the satellite. However, the methods of looking at band rati o or thresholding a sensor channel are highly suscep tible to errors when estimating biomasses. In some areas, the amount of green pigment in plants is not the only factor of the water color. Optically complex environments are common in waters like continental shelves and coastal water. Dissolved organic matter (DOM) and sediment s discharged by rivers, bottom reflection and intense phytoplankton blooms are al so typical in those waters. To overcome the challenges of differentia ting red tide from other similar species, people have studied the statisti cal record of red tide appear ance in satellite imagery. In recent years, researchers have developed two effective methods to detect red tide using remote sensing techniques, including the Chlorophyll anomaly method and the backscattering method discussed in the following sections. 2.2.1.1 The Chlorophyll anomaly method Tester et al. (Tester et al ., 1997) have shown that K. brevis concentration must reach at least 510/ cellsL to be detected as a bloom from a single CZCS satellite image.

PAGE 24

16 Another study by Antoine et al. (Antoine et al., 1996) also demonstr ated that blooms of 510/ cellsL correspond to approximately 31/ mgm of Chlorophyll. Moreover, laboratory studies have also showed that 510/ cellsL of K.brevis contain around 31/ mgm of Chlorophyll. (Tomlinson et al., 2004) Based on those findings of Chlorophyll's re lation with K. brevis concentration, NOAA's CoastWatch program has implemen ted an algorithm to use the Chlorophyll anomaly to detect red tide. The algorith m defines the Chlorophyll anomaly" as: Chlorophyll anomaly of day x = Chlor ophyll of day x – average Chlorophyll concentration from day (x-74) to day (x-14) This Chlorophyll anomaly describes the difference between the current Chlorophyll level and the mean of an earli er 60 days’. The 14-day window between the current day and the days to compute the Chlo rophyll mean avoids bias in the case of slowly changing blooms. In the satellite image, if one pixel's (approximately 1 kilometer square) Chlorophyll anomaly is bigger than 31/ mgm, it will be classified as red tide. The Chlorophyll anomaly method is the NOAA’s CoastWatch program’s official red tide detection algorithm. The Chlorophyll anomaly method works we ll in the open ocean. But in coastal water, the high concentration of colored dissolved organic matte r absorbs the blue part of the spectrum, which induces an inaccurate Ch lorophyll estimate from the satellite. Moreover, if the red tide last s longer than two and a half month in a certain area, the Chlorophyll anomaly method may not able to de tect the red tide si nce there is not a significant difference between the current Chlo rophyll value and its previous 2 months average.

PAGE 25

17 2.2.1.2 The backscattering detection method Garver et al. (Garver et al ., 1994) have shown that due to other plants containing Chlorophyll, a detection method only using Chlorophyll may also create false alarms from non-toxic blooms. However, some sate llite channels can change when viewing harmful algae species. Thus, Cannizzaro et al. (Cannizzaro et al ., 2008) developed the following red tide detection algorithm. A pixel will be classified as red tide if and only if: Chlorophyll >1 and FLH>0.01 and BBP 551< More l's function, where Morel's function is 0.62 100.3**(0.0020.02*(0.5-0.25*log())) ChlorophyllChlorophyll (Morel 1988) The backscattering algorithm solved the problem of relying only on Chlorophyll value for red tide classification as in the Chlorophyll anomaly method. However, many events (i.e. storm or river discharge) can increase the sediment in the seawater, which increases the backscattering value of the wa ters. In those cases, the backscattering algorithm cannot detect the red tide due to the high individual backscattering values. 2.2.2 Machine learning methods Machine learning has been successfully applied for remote sensed information understanding. Wang (Wang, 1990) developed a fuzzy supervised classification method to determine land-use classes in Landsat MSS images, which included two steps of estimating fuzzy parameters from the traini ng data and a fuzzy partition of the spectral space. Kubat et al. (Kubat et al., 1998) ha ve used machine learning algorithms including decision tree and nearest neighbors approaches to detect oil spills using satellite radar data. Remotely sensed data have been clas sified using feed-forwa rd neural networks (Kiang, 1992, Hepner, 1990, Abuelgasim, 1995), decision trees (Friedl et al., 1997),

PAGE 26

18 expert systems (Kruse et al., 1993) and ru le-based systems (Warner et al., 1994) by different researchers. M. Zhang et al. (M. Zhang et al., 2002) ha ve developed a computer expert system to classify multi-band remote sensed imagery for red tide recognition. Briefly, based on the spectral reflectance, an initial segmentation was performed using a fuzzy clustering algorithm (FCM) (Bezdek and Pal, 1992). The algorithm assumed that the desired number of classes was given, in addition, a partition distance metric, a fuzziness measure m, and a stopping criterion was chosen. Accordingly, the FCM algorithm partitioned the data set X into c classes including the red ti de class. Applicatio n of the FCM algorithm correctly recognized some of red tides, but it relied on the correct se lection of the number of clusters and had the problem of over-c lustering (waters in the same category are classified as different kinds) or under-clustering (water s in different categories are classified as the same). M. Zhang (M. Zhang 2000) used a knowledge based system for multiple-class water detection using Coastal Zone Color S canner images. In the first phase of the algorithm, a Neural Network system was app lied to recognize whether the water was red tide or not. To improve the accuracy of re d tide detection from a Neural Network, heuristic rules were used to reclassify th e pixels whose recognition results from the Neural Network were similar. H. Zhang (H. Zhang 2002) introduced a combined system with Fuzzy C Means Clustering and a Neural Netw ork for red tide detection. Daily SeaWiFS data was clustered into 10 clusters. To recognize whet her a cluster is red tide, the method used some cluster centers with know n ground truth to train the Ne ural Network. With this

PAGE 27

19 trained Neural Network model, ot her clusters were classified as red tide or non red tide. If a cluster is recognized as red tide or non red tide, all pixels in it are accordingly recognized as red ti de or non red tide. Yao (Yao, 1999) used a combined system with Fuzzy C Means Clustering, heuristic rules, and decision trees. Each da y of SeaWiFS data was separated into 10 different clusters. To classify what class a cluster belonged to, the method first used heuristic rules to categorize that cluster. If a cluster cannot be classified by a heuristic rule, it will be classified by a decision tr ee. The decision tree was trained from some clusters whose ground truth is known. If a cluste r was recognized as a certain water class, all pixels in it were accordingly recognized as that water class. This method depends heavily on pre-defined parameters such as the number of clusters. One cluster may contain both red tide and non red tide pixels in which case the he uristic rule or the decision tree cannot classify all pixels accurately.

PAGE 28

20 CHAPTER 3 AUTOMATED RED TIDE DETECTION ALGORITHMS Machine learning models introduced in Section 2.1.4 produce classification results from models built through learning the training data. When the training data cover enough regions and different events like norm al days, storms and river discharge, machine learning approaches may be able to learn the data pattern in different regions and events. Hence, the machine learning approa ches can detect red tide in the whole West Florida Shelf including both open ocean a nd costal waters regardless of wind and temperature. With all relevant input features shown in Section 2.1.3, the machine learning approaches can build nonlinear models base d on the relationship between red tide occurrences and values of satellite features Such models can avoid the reliance on only one or two features which ma y fail the algorithm when those features are contaminated by suspended sediments or color dissolved organic matters. 3.1 Machine learning approaches This category of algorithms includes three machine learning approaches with their input attributes directly obtai ned from the satellite data (including CHL, FLH, BBP at 551 nm, NLW at 412nm, 550 nm, 678nm a nd 869nm). No thresholding method is involved in this category. Figure 7 shows how this strategy works.

PAGE 29

21 Figure 7. Flow chart of machine learning approaches 3.2 Machine learning models with additional input features used in thresholding methods The recent thresholding methods shown in Section 2.2.1 used several combinations of traditional satellite attributes as thresholds to detect red tide. In order to utilize these bio-optical red tide detection th resholds, we used the values of Chlorophyll anomaly and BBP551 Morel's function intr oduced in the Chlorophyll anomaly method (Tomlinson et al., 2004) and the backscatter method (Ca nnizzaro et al., 2008) as additional input features for the machine lear ning models. Figure 8 shows the flow chart of this approach. Figure 8. Flow chart of machine learni ng models with bio-optical features Machine learning models (Support Vector Machines, Random Forests or K-Nearest Neighbors) Inputs: CHL, FLH, BBP551, N LW412, 551, 678, 869 Output: red tide / non red tide Machine learning models (Support vector machines, Random Forests, K-Nearest Neighbors) Inputs: CHL, FLH, BBP551, N LW412, 551, 678, 869, Chlorophyll anomaly, BBP551 Morel's function Output: red tide / non red tide

PAGE 30

22 3.3 Combination of multiple algorithms We also used unweighed voting and we ighted voting to combine prediction results generated by K-nearest neighbor, Ra ndom Forests, Support V ector Machines, the Chlorophyll anomaly method and the backscattering method. Unweighed voting uses the votes of all the algorithms to decide which class a pixel belongs to. In our case, we used a threshol d N. If not less than N algorithms predict a pixel as red tide, the pixel wi ll be classified as red tide. Otherwise the pixel will be classified as non red tide. We varied N from 1 to 5 in our experiments. For weighted voting, each algorithm pr oduces its weighted voting percentage between 0 and 1. For Random Forests, its weight ed voting percentage is the percentage of trees that predict this pixel as red tide. Fo r Support Vector Machines, it is this pixel’s probability of being the red tide class. For KNearest Neighbors, it is the percentage of this pixel’s red tide neighbors among its K nearest neighbors. For the Chlorophyll anomaly method, it is the linearly normali zed distance between the pixel’s Chlorophyll anomaly and its thresholding value. For Cannizzaro’s method, it is the linearly normalized distance between this pixel’s CH L, FLH and BBP and their thresholding values. A pixel will be classified as re d tide if the sum of the 5 weighted voting percentages is not less than 2.5.

PAGE 31

23 CHAPTER 4 EXPERIMENT SETUP 4.1 Data set and pre-processing For each day, if the clouds do not cover the whole area of interest, one MODIS image for the West Florida Shelf was used Ground truth red tide data for the West Florida Shelf was collected by ships of the Fl orida Fish and Wildlife Research Institute (FWRI). We have 17649 ground truth points fr om Jan 1, 2003 to Apr 20, 2007. Water with a K. brevis cell count higher than 15000 cells/liter is regarded as red tide water; otherwise it is non red tide. Although we have 17649 ground truth pixels due to clouds or the satellite's mechanical failure, just 1969 of these points are associated with valid, concurrent, and co-located MODIS data. To get more ground tr uth data in our experiment, we developed the ground truth approximation st rategy as shown in Figure 9.

PAGE 32

24 Figure 9. Flow chart for ground truth approximation After this process, we had 2695 ground tr uth red tide pixels and 8165 non red tide pixels across 832 days from Jan 1, 2003 to Apr 20, 2007. To prove the effectiveness of training machine learning models with the approximated ground truth data, Section 5.2 pr ovides a summary on the F-measures of 3Nearest Neighbors, Random Forests and Support Vector Machines with and without the approximated ground truth for algorithm training. Each attribute was normalized to a va lue between 0 and 1 for each image by: max() max()min()jijij ij jijjijvv x vv where ijv is the original value of channel j for pixel i. ij x is its Let S to be the set of MODIS data points that are within the 5X5 spatial neighborhood of P fo r two days before or after the occurrence of P Label P to be the class that is the majority in S, and assign P the satellite feature as the data point with closest spatial and temporal distance Is a data point P bad due to cloud cover or other failure? MODIS data at P can be directly used S False True True False Abandon MODIS data at P

PAGE 33

25 value after normalization. max()jijv and min()jijv are the maximum and minimum values for channel j in the training set. Some of the satellite feat ures can reach a few extremely high and abnormal values (e.g. CHL > 90 or FLH > 0.51) due to error induced by the satellite. To filter those extreme cases, we took all MODIS images fr om 2003 to 2007, and for the data in each channel j, we sorted them from high to lo w. Then we set the maximum value of this channel (max()jijv ) as the value ranking at the k th position ( k =round(number of all pixels0.3%)). Any ijv bigger than max()jijv is normalized to 1. 4.2 Accuracy assessment To understand how the algorithms work for both red tide and non red tide water, we used a confusion matrix as shown in Table 1. Table 1. Notation on accuracy assessment Classified as red tide Classified as non red tide Red tide AB N on red tide CD A=number of pixels which are red tide and are classified as red tide B=number of pixels which are red tide and are classified as non red tide C=number of pixels which are non red tide and are classified as red tide D=number of pixels which are non red tide and are classified as non red tide TP=A/(A+B), or true positive rate, indicates an algorithm’s sensitivity at red tide detection.

PAGE 34

26 TN=D/(C+D), or true negative rate, indica tes an algorithm’s sensitivity at non red tide detection. Similarly, we used the following notation for false negative rate FN=B/(A+B) and false positive rate FP=C/(C+D) To describe an algorithm’s overall accu racy, considering correct recognition on both red tide and non red tide cases, we used the F-measure (Witten et al., 2005), which is: FM=2*A/(2*A+C+B). In the following sections, we use the F-measure value as the major benchmark to evaluate each algorithm. Besides F-measures, we use the arithme tic and geometric means to analyze red tide detection algorithms’ accuracies: AM=(TP+TN)/2 (Arithmetic means) GM=(TP+TN)0.5 (Geometric means) We also use the receiver operating characteristic (ROC) curve (Provost and Fawcett, 1997; Provost et al., 1998) to evaluate different me thods in our experiments. The ROC curve presents an algorithm’s two operating characteristics (true positive rate and false positive rate) by drawing a 2-D graph with its false positive rate on the X-axis and the corresponding true positive rate on th e Y-axis. This curve shows the cost and benefit of changing an algorith m’s conditions. In our experime nts, different true positive rates and false positive rates for each machine learning mode l or thresholding method are generated by varying their re spective thresholds. For Random Forests, we varied the threshold on the percentage of trees that predict the red tide. For Support Vector Machines, we varied threshold on the probabi lity of being the red tide class. For the

PAGE 35

27 Chlorophyll anomaly method, we varied it s Chlorophyll anomaly threshold. For the backscattering method, we varied its CHL, FLH and BBP thresholds. The area under ROC curve (AUC) has been proposed as a single-number measure for algorithm performance. AUC for Random Forests, Support Vector Machines, the Chlorophyll anomaly method and the back scattering method are computed for comparison in our study. 4.3 Algorithm implementation For our implementation of Support Vector Machines, we used C-SVM (Vapnik 2000) in the LIBSVM package (Chang et al., 2009). The radial basi s function kernel was used. In the experiment with Random Forests, we set the split criteria to the C4.5 style. The number of trees was 1000. Random Forests experiments were done with the OpenDT (Banfield, 2003) system. In the K-Nearest Neighbors experiment, we set the parameter K=3. About 75% of in situ cell counts data were labeled as “non red tide”. Machine learning methods may easily produce models to classify all pixels as “non red tide” to obtain high accuracy in such skewed data (Witten et al., 2005). To overcome this challenge, we randomly chose only B% of the majority class (non red tide) for training. Other in situ data labeled as “red tide” wa s 100% chosen. B was selected in the following way: we divided the training set into 5 c hunks, using 4 chunks for training and 1 chunk for testing. Different B from 10 to 100 was tested on each of 5 chunks. We used the B with the highest average F-measure on thos e 5 chunks for the whole training set.

PAGE 36

28 For support vector machines, after the percentage B was selected, we selected the regulation constant C by di viding the training set into 5 chunks, using 4 chunks for training and 1 chunk for testi ng. Different C from 0.5 to 4096 (increased by doubling itself on each new experiment) was tested on ea ch of 5 chunks. For each training set, we had the C with the highest average F-measur e on those 5 chunks for the whole training set. Two thirds of the ground truth data were randomly selected for training, and for the remaining one third, only points measured in situ without approximation were used for testing. 3-Nearest Neighbors, Random Fore sts and Support Vector Machines used the same testing and training set. This process of randomly selecting the training and testing data was repeated 30 times. We present the averaged results of the 30 testing sets from each machine learning method. Each machine learning algorithm used the same training and testing data.

PAGE 37

29 CHAPTER 5 ALGORITHM PERFORMANCE 5.1 Machine learning approaches The three machine learning approaches had higher F-measures than all thresholding methods, as shown in Tabl e 2. Using a two-si ded Wilcoxon significance rank test (Wilcoxon, 1945) at a confidence interval of 95% (a s 95% is one of the most common confidence interval for hypothesis te sting), F-measures of all methods are significantly different except Support Vector Machines and Random Forests. F-measures were ranked by (from high to low): Suppor t Vector Machines, Random Forests, 3Nearest Neighbors, the backsc attering method, and the Ch lorophyll anomaly method. Table 2. F-measures of 3-Nearest Neighbors, Random Forests, Support Vector Machines, the Chlorophyll anomaly method and the backscattering method Methods F-measure Support Vector Machines 0.590 Random Forests 0.581 3-Nearest Neighbors 0.562 Backscatter method 0.480 Chlorophyll anomaly method 0.463

PAGE 38

30 5.2 Machine learning methods with and without approximated ground truth data for training To investigate the effectiveness of tr aining machine learning models with the approximated ground truth data as described in Section 4.1, we compared the F-measures of 3-Nearest Neighbors, Random Forests, and Support Vector Machines with and without the approximated ground truth data for algor ithm training. Using a two-sided Wilcoxon significance test with confidence interval of 95%, the F-measures of Random Forests and 3-Nearest Neighbors trained by approximate d ground truth data were improved with statistical significance. Support Vector M achines had no statis tically significant difference between using approximated ground tr uth data for training or not. Results are shown in Table 2. Generally speaking, the approximated ground truth data increased machine learning methods’ performance. Table 3. F-measures of machine learning methods with and without approximated ground truth data for training Machine learning methods F-measure with approximated ground truth data for training F-measure without approximated ground truth data for training Statistical significance on F-measure change Support Vector Machines 0.590 0.591 Not significant Random Forests 0.581 0.554 Significant 3 Nearest Neighbors 0.562 0.549 Significant As shown in Table 4 and 5, Support V ector Machines without approximated ground truth data correctly detected more red tide than Support Vector Machines with

PAGE 39

31 approximated ground truth data. However, Support Vector Machines without approximated ground truth data classified more non red ti de pixels as red tide. Table 4. Confusion matrix for Support Vect or Machines, without approximated ground truth Classified as red tide Classified as non red tide Red tide 420 176 Non red tide 405 964 Table 5. Confusion matrix for Support Vect or Machines, with a pproximated ground truth Classified as red tide Classified as non red tide Red tide 419 186 Non red tide 395 970 5.3 Machine learning models with additional input features used in thresholding methods In our experiments, adding input featur es developed in the Chlorophyll anomaly method and the backscattering method increase d the F-measures of 3-Nearest Neighbors and Random Forests but decrea sed the F-measure of Support Vector Machines. Those differences are statistically significant for Support Vect or Machines and 3-Nearest Neighbors but not statistically si gnificant for Random Forest.

PAGE 40

32 Table 6. F-measures of machine learning models with additional input features introduced in the Chlorophyll anomaly method and the backscattering method Machine learning methods F-measure F-measure with additional input features: Chlorophyll anomaly BBP551 Morel's function Statistical significance on additional features Support Vector Machines 0.591 0.562 Significant Random Forests 0.581 0.586 Not significant 3 Nearest Neighbors 0.562 0.571 Significant 5.4 Combination of multiple algorithms by voting F-measures of weighted voting and un weighted voting with N of 2 and 3 outperformed support vector machines (0.591), as shown in Table III. Unweighted voting with N=2 achieved the best F-measure of 0.607 among all voting strategies, higher than 0.597 from the weighted voting. The Fmeasure of voting method with N=2 was significantly higher than support vector machines under a two-sided Wilcoxon significance rank test at a confidence interval of 95%. Unweighted voting with N=5 (a pixel will be classified as red tide as long as it gains not less than N votes as red tide from all 5 algorithms) has the lowest F-measure of 0.271.

PAGE 41

33 Table 7. F-measures of hybrid approaches by voting Voting method F-measure Weighted voting 0.597 Unweighted Voting, N=1 0.579 Unweighted Voting, N=2 0.607 Unweighted Voting, N=3 0.605 Unweighted Voting, N=4 0.484 Unweighted Voting, N=5 0.271 5.5 ROC analysis Figure 10 shows the ROC curves of Rando m Forests, Support Vector Machines, the Chlorophyll anomaly method, and the back scattering method. Different true positive rates and false positive rates were generated by varying their thresholds as discussed in Section 4.2.

PAGE 42

34 Figure 10. ROC curves for different methods The AUC was computed for all 4 algorithms as shown in Table 8. The AUC for random forests was higher than other methods and had highe r true positive than other methods when false positives were between 0.08 and 0.28. For an application that requires a false positive ra te between 0.08 and 0.28, random forests can be a good algorithm to use. Table 8. AUC for different methods Method AUC Random forests 0.754 Support vector machines 0.747 Backscattering method 0.699 Chlorophyll anomaly method 0.629

PAGE 43

35 5.6 Confusion matrix Tables 9-15 show the confusion matrix es for the Support Vector Machines, Random Forests, 3-Nearest Neighbors, the Chlorophyll anomaly method, the backscattering method, weighted voting and unweighted voting with N=2. The Chlorophyll anomaly and the backscattering method did not detect as many red tide pixels as the machine learning approaches or voting methods. Table 9. Confusion matrix for Support Vector Machines Classified as red tide Classified as non red tide Red tide 419 186 Non red tide 395 970 Table 10. Confusion matrix for Random Forests Classified as red tide Classified as non red tide Red tide 408 197 Non red tide 389 976 Table 11. Confusion matrix for 3-Nearest Neighbors Classified as red tide Classified as non red tide Red tide 427 177 Non red tide 488 877 Table 12. Confusion matrix for the Chlorophyll method Classified as red tide Classified as non red tide Red tide 255 357 Non red tide 245 1112 Table 13. Confusion matrix fo r the backscattering method Classified as red tide Classified as non red tide Red tide 242 363 Non red tide 160 1205

PAGE 44

36 Table 14. Confusion matrix for weighted voting Classified as red tide Classified as non red tide Red tide 401 204 Non red tide 336 1029 Table 15. Confusion matrix for unweighted voting, N=2 Classified as red tide Classified as non red tide Red tide 472 132 Non red tide 477 888 5.7 Arithmetic means and geometric m eans for red tide detection accuracies Tables 16 and 17 show the arithmetic means and geometric means respectively, for Support Vector Machines, Random Fore sts, 3-Nearest Nei ghbors, the Chlorophyll anomaly method, the backscattering method, weighted voting and unweighted voting with N=2 (classifying a pixel as red tide as l ong as any 2 out of 5 al gorithms classifying it as red tide). They followed a similar pattern to F-measures, where weighted voting had the best accuracy benchmarks, followed by S upport Vector Machines, unweighted voting with N=2, Random Forests and then 3Nearest Neighbors, while the Chlorophyll anomaly and the backscattering had the worst performance.

PAGE 45

37 Table 16. Arithmetic means for different approaches’ accuracies Methods Arithmetic means Unweighted voting, N=2 0.716 Weighted voting 0.708 Support Vector Machines 0.701 Random Forests 0.694 3-Nearest Neighbor 0.674 Backscatter method 0.641 Chlorophyll Anomaly method 0.621 Table 17. Geometric means for different approaches’ accuracies Methods Geometric means Unweighted voting, N=2 1.196 Weighted voting 1.190 Support Vector Machines 1.184 Random Forests 1.178 3-Nearest Neighbor 1.161 Chlorophyll Anomaly method 1.115 Backscatter method 1.113 5.8 An example Figure 11-17 show an example of the resu lts from different detection algorithms on September 21, 2006.

PAGE 46

38 Figure 11. The Chlorophyll anomaly algorithm for red tide detection on September 21, 2006. White square ( ) represents ground truth non red tide points. Blue square ( ) represents ground truth red tide po ints. Pixels in pink color ( ) represent those classified as red tide a nd pixels in green color ( ) represent those classified as non red tide. Pixels in black color ( ) represent those not available from the MODIS satellite or the east coast of Florid a. Pixels in mixed color ( ) represent the land. The Chlorophyll anomaly algorithm detect ed most of the red tide occurrences, except the region of latitude 27.0545 to 26.8363 and longitude –82.4454 to –82.5545 (as rectangle 1) and the region of latitude 28.0090 to 27.5454 and longitude –83.2545 to – 83.5454 (as rectangle 2). The Chlorophyll anomaly did not work in this region because the Chlorophyll value was not anomalously hi gh compared to the previous 2 months

PAGE 47

39 average It means that the Chlorophyll was also high for the previous months, which could be true due to few storms that fall. The Chlorophyll anomaly algorithm correctly classified most of the non red tide pixels including two pixels in the Northwest Florida Shelf. However, the algorithm classified most of the shallow water (i.e. water in Tampa Bay) as red tide and created some false alarms. Figure 12. The backscattering algorithm for red tide detection on September 21, 2006. White square ( ) represents ground truth non red tide points. Blue square ( ) represents ground truth red tide points. Pixels in pink color ( ) represent those classified as red tide and pixels in green color ( ) represent those classified as non red tide. Pixels in black color ( ) represent those not available from the MODIS satellite or the east coast of Florida. Pixels in mixed color ( ) represent the land.

PAGE 48

40 The backscattering algorithm detected most of the red tide occurrences, except the region of latitude 26.4818 to 26.2545 and l ongitude –82.0818 to –82.2545 (as rectangle 3) and the region of latitude 28.0090 to 27.5454 and longitude –83.2545 to –83.5454 (as rectangle 2). Failure on rectangle 3 was because the BBP values were higher than the morel functions. A possible reason could be that there were high sediments or the sensor saw the bottom. Like the Chlorophyll anom aly algorithm, the b ackscattering algorithm correctly classified most of the non red tide pixels but incorre ctly classified most of the shallow waters pixels as red tide.

PAGE 49

41 Figure 13. 3-Nearest Neighbors for red tide detection on September 21, 2006. White square ( ) represents ground truth non red tide points. Blue square ( ) represents ground truth red tide points. Pixels in pink color ( ) represent those classified as red tide and pixels in green color ( ) represent those classified as non red tide. Pixels in black color ( ) represent those not available from the MODI S satellite or the east coast of Florida. Pixels in mixed color ( ) represent the land. The 3-Nearest Neighbors approach detect ed most of the red tide occurrences, including the regions where the Chlorophyll anomaly and th e backscattering failed. It also correctly classified the coastal waters in stead of classifying all of them as red tides. One of 3-Nearest Neighbors’s problems is its lack of regional homogeneity while the actual red tide occurrences have good regional homogeneity in ge neral. It might be due to

PAGE 50

42 the fact that 3-Nearest Nei ghbors method is based on the votes from 3 closest pixels in feature spaces. Figure 14. Random Forests for red tide dete ction on September 21, 2006. White square ( ) represents ground truth non red tide points. Blue square ( ) represents ground truth red tide points. Pixels in pink color ( ) represent those classified as red tide and pixels in green color ( ) represent those classified as non red tide. Pixels in black color ( ) represent those not available from the MODIS sa tellite or the east coast of Florida. Pixels in mixed color ( ) represent the land. The Random Forests approach detected most of the red tide occurrences, including the regions where the Chlorophyll anomaly meth od and the backscattering

PAGE 51

43 method failed. It also correctly classified the coastal waters instead of classifying all of them as red tides. It had good regional homogeneity. Figure 15. Support Vector Machines for red tide detection on Sept ember 21, 2006. White square ( ) represents ground truth non red tide points. Blue square ( ) represents ground truth red tide points. Pixels in pink color ( ) represent those classified as red tide and pixels in green color ( ) represent those classified as non red tide. Pixels in black color ( ) represent those not available from the MODI S satellite or the east coast of Florida. Pixels in mixed color ( ) represent the land. The Support Vector Machines approach dete cted most of the red tide occurrences, including the regions where the Chlorophyll anomaly meth od and the backscattering method failed. It also correctly classified the coastal waters instead of classifying all of

PAGE 52

44 them as red tides. It had good regional hom ogeneity. Compared with Random Forests, it produced fewer false alarms that cl assified non red tide as red tide. Figure 16. Weighted voting for red tide dete ction on September 21, 2006. White square ( ) represents ground truth non red tide points. Blue square ( ) represents ground truth red tide points. Pixels in pink color ( ) represent those classified as red tide and pixels in green color ( ) represent those classified as non red tide. Pixels in black color ( ) represent those not available from the MODIS sa tellite or the east coast of Florida. Pixels in mixed color ( ) represent the land. The weighted voting approach detected most of the red tide occurrences, including the regions where the Chlorophyll anomaly and the backscattering methods failed. It had good regional homogeneity.

PAGE 53

45 Figure 17. Unweighted voting with N=2 (classify ing a pixel as red tide as long as any 2 out of 5 algorithms classifyi ng it as red tide) for red tide detection on September 21, 2006. White square ( ) represents ground truth non red tide points. Blue square ( ) represents ground truth red tide points. Pixels in pink color ( ) represent those classified as red tide and pixels in green color ( ) represent those classified as non red tide. Pixels in black color ( ) represent those not available from the MODIS satellite or the east coast of Florida. Pixels in mixed color ( ) represent the land. The unweighted voting approach with N=2 detected most of the red tide occurrences, including the regions where the Chlorophyll anomaly method and the backscattering method failed. It had good regional homogene ity. But like the Chlorophyll

PAGE 54

46 anomaly method and the backscattering method, it classified many coastal waters as red tide. Machine learning approaches are better than previous thresholding approaches in terms of F-measure with statistical signifi cance using the Wilcoxon te st with confidence level of 95%. Support Vector Machines obtai ned the best result among machine learning approaches in terms of F-measure and AUC Adding features us ed in thresholding methods to machine learning approaches wi ll improve machine learning approaches’ accuracies. However, the improvement to Rando m Forests had no statistical significance. Combining machine learning approaches and thresholding methods by voting can achieve better F-measure than machine learning appro aches with statistical significance. Their performances in terms of arithmetic means and geometric means for detection accuracy follows the similar pattern as those in terms of F-measures. The 3-Nearest Neighbors method can incr ementally grow the training models with new incoming training data without rebuild ing the model. This advantage is useful in red tide detection since we may have more and more ground truth red tide points available. However, considering the s cale of the problem right now (around 10000 training points), both Random Forests and Supp ort Vector Machines can rebuild their training models within 10 minutes. This advantage is not significant yet. In the case study for red tide detec tion on September 21, 2006, the machine learning algorithms can detect more red ti de than the Chlorophyll anomaly method and the backscattering method. Moreover, the Ra ndom Forests and Support Vector Machines approaches can detect the red tide in case of rich sediment, low Chlorophyll and in longlasting red tide occurrence regions where th e backscattering method and the Chlorophyll

PAGE 55

47 anomaly method did not work. Machine lear ning methods could classify the coastal waters well instead of recogni zing all of them as red tide.

PAGE 56

48 CHAPTER 6 CONCLUSION AND DISCUSSIONS We evaluated three machine learning a pproaches based resp ectively on 3-Nearest Neighbors, Random Forests, a nd Support Vector Machines fo r red tide detection using MODIS imagery. Random Forests and Support Vector Machines achieved F-measures 3% higher than 3-nearest ne ighbors and outperformed the Chlorophyll anomaly and the backscattering red tide detection methods by mo re than 10% in terms of F measures, with statistical significance. Threshol ding methods focus a lot on the biological and statistical relationship between red tide and remote sens ing. This is the reason why thresholding methods can use a few satellite attributes to obtain acceptable red tide detection results. However, thresholding using 1-3 attributes had worse performance than machine learning models when machine learning models used a ll 7 attributes since one single attribute’s value can be inaccurate in events of susp ended sediments and color dissolved organic matter. Before the mechanism of how red ti de occurrences affect satellite attribute changes is fully understood, it would be a s ound strategy to use al l relevant satellite attributes for remote sensing based algorithms for red tide detection. Random Forests and Support Vector Machines might be implemented for red tide detection in the eastern Gulf of Mexico. When adding the features introduced in the Chlorophyll anomaly method and the backscatter method as input features to Random Forests and Support Vector Machines,

PAGE 57

49 the F-measure of the Random Forests increa sed but the F-measure of Support Vector Machines decreased. Results of weighted voting and unweighted voting with N=2 outperformed the accuracy of Support Vector Machines in terms of F-measure with statistical significance. Those results indicate that features developed in the Chlorophyll anomaly method and the backscattering method represent certain in formation about red tide in MODIS data. Machine learning met hods can effectively extract statistical relationships in the data and their detection results can be improved with a good combination with bio-optical features. Future work includes selecting the best subset of satellite features and combining surr ounding environmental condition and image information to improve machine learning approaches’ accuracy.

PAGE 58

50 REFERENCES A. A. Abuelgasim, S. Gopal, J. R. Irons, and A. H. Strahler, “C lassification of ASAS multi-angle and multispectral measurements using artificial neural networks”, Remote Sensing of Environments, vol. 57, pp. 79-87, 1995. M. Aizerman, E. Braverman, and L. Rozonoer, "Theoretical foundati ons of the potential function method in pattern recognition learni ng", Automation and Remote Control, vol. 25, pp. 821–837, 1964. M. Ankerst, G.K. ller, H. Kriegel and T. Se idl, “Nearest Neighbor Classification in 3D Protein Databases”, Proceeding of the 7t h International Conference on Intelligent Systems for Molecular Biology, pp. 34-43, 1999. R. E. Banfield, OpenDT's website http://opendt.sour ceforge.net, 2003. J. C. Bezdek and S. K. Pal, Fuzzy models for pattern recognition, IEEE Press, Piscataway, NJ, 1992. R. Bondugula, O. Duzlevski, and D. Xu, “”Profiles and fuzzy k-nearest neighbor algorithm for protein secondary structure predic tion”, Proceedings of the 3rd Asia-Pacific Bioinformatics Conference, pp 85-94, 2005. L. Breiman, “Bagging Predictors”, M achine Learning, vol. 24, pp.123-140, 1996.

PAGE 59

51 L. Breiman, “Random Forests”, Machine Learning, vol. 45, n. 1, pp. 5-32, 2001. H. Byun and S. Lee, “Applications of Suppor t Vector Machines fo r Pattern Recognition: A Survey”, Lecture Notes in Computer Science, vol. 2388, pp. 571-591, 2002. J.P. Cannizzaro, K.L. Carder, F.R. Chen, C.A. Heil, and G.A. Vargo, “A novel technique for detection of the toxic dinoflagellate Kare nia brevis in the Gulf of Mexico from remotely sensed ocean color data”, Continenta l Shelf Research, vol. 28, n. 1, pp. 137-158, 2008. K. L. Carder and R.G. Steward, “A re mote-sensing reflectance model of a red-tide dinoflagellates off west Florida”, Limnol ogy and Oceanography, vol. 30, n. 2, pp. 286298 1985. C.C. Chang and C.J. Lin, LIBSVM -A Library for Support Vector Machines, http://www.csie.ntu.edu. tw/~cjlin/libsvm/, 2009. D.R. Cutler, T.C. Edwards, K.H. Beard, A. Cutler, K.T. Hess, J. Gibson and J.J. Lawler, “Random Forests For Classification In Ec ology, Ecology”, vol. 88, n.11, pp. 2783-2792, 2007. B. V. Dasarathy, Nearest Neighbor (NN) No rms: NN Pattern Classification Techniques, IEEE Computer Society Press, Los Alamitos, CA, 1991. M. A. Friedl and C. E. Brodley, “Decision tr ee classification of land cover from remotely sensed data”, Remote Sensing of Environments, vol. 61, pp. 399-409, 1997. I, Fujinaga, “Exemplar-based learning in adaptive optical music recognition system”, Proceedings of the International Com puter Music Conference, pp. 55-56, 1996.

PAGE 60

52 GIS Development, GIS Tutorial: Remote Sensing http://www.gisdevelopment.net/tutorials/tuman008.htm, 2009. H. R. Gordon and D. K. Clark, “Atmos pheric effects in the remote sensing of phytoplankton pigments”, Boundary-Layer Meteorology, vol. 18, pp. 299-313, 1980. H. R. Gordon, J. W. Brown, O. B. Brown, R. H. Evans, and D. K. Clark, “Nimbus-7 CZCS: Reduction of its radiomet ric sensitivity with time”, Applied Optics, vol. 22, n. 24, pp. 3929-3931, 1983. H. R. Gordon, D. K. Clark, J. W. Brown, O. B. Brown, R. H. Evans, and W. W. Broenkow, "Phytoplankton pigment concentrat ions in the Middle Atlantic Bight: comparison of ship determinations and CZCS estimates", Applied Optics, vol. 22, n. 1, pp. 20-36, 1983. Harmful Algal Bloom Forecasting System (HabFS) http://www.csc.noaa.gov/crs/habf, 2009. G. F. Hepner, T. Logan, N. Ritter, and N. Br yant, “Artificial neural network classification using a minimal training set: Comparison to conventional supervised classification, “Photogrammetric Engineering & Remote Se nsing”, vol. 56, n. 4, pp. 469-473, 1990. C. Hu, K.E. Hackett, M.K. Callahan, S. Andr efouet, J.L. Wheaton, J.W. Porter, and F.E. Mller-Karger, “The 2002 ocean color anomal y in the Florida Bight: A cause of local coral reef decline?”, Geophysical Res earch Letters, vol. 30, no. 3, 1151, 2003. C. Hu, F. E. Muller-Karger, C. Taylor, K. L. Carder, C. Kelble, E. Johns, and C. Heil, “Red tide detection and tracing using MODIS fluorescence data: A regional example in SW Florida coastal waters”, Remote sensing of environment, vol. 97, pp. 311-321, 2005.

PAGE 61

53 M. Kahru and B.G. Mitchell, “Spectral reflec tance and absorption of a massive red tide off southern California”, Journal of Geophys ical Research-Oceans, vol. 103, n. C10, 21601-21609, 1998. R. K. Kiang, “Classification of remotely sens ed data using OCR-inspired neural network techniques”, Proceedings of International Geoscience and Remote Sensing Symposium, vol. 2, pp. 1081-1083, 1992. F. A. Kruse, A. B. Lefkoff, and J. B. Di etz, “Expert System-Based Mineral Mapping in Northern Death Valley, Califor nia/Nevada, Using the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS)”, Remote Sensing of Environments, vol. 44, pp. 309-336, 1993. M. Kubat, R. C. Holte and S. Matwin, “Machin e learning for the detec tion of oil spills in satellite radar images”, Machine Learni ng, vol. 30, n. 2-3, pp. 195-216, 1998. E. Han, G. Karypis and V. Kumar, “Tex t Categorization Using Weight Adjusted kNearest Neighbor Classification”, Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp.53-65, 2001. Y. Liao and V. R. Vemuri, “Use of K-Neares t Neighbor classifier for intrusion detection”, Computers & Security, vol. 21, n. 5-1, pp. 439-448, 2002. S. J. Mason and N.E. Graham, “Areas be neath the relative operating characteristics (ROC) and relative operating levels (ROL ) curves: Statistical significance and interpretation”, Quarterly Jour nal of the Royal Meteorological Society, vol. 128, n. 584, pp. 2145-2166, 2002.

PAGE 62

54 D. F. Millie, O.M. Schofield, G. J. Kirkpatric k, G. Johnsen and B. T. Vinyard, “Detection of harmful algal blooms using photopigments a nd absorption signature s: A case study of the Florida red tide dinoflagellate, Gym nodinium breve”, Limnology and Oceanography, vol. 42, n. 5-2, pp. 1240-1251, 1997. A. Morel, “Optical Modeling of the Upper O cean in Relation to Its Biogenous Matter Content (Case I Waters)”, Journal of Geophysical Research-Oceans, vol. 93, n. C9, pp. 10749–10768, 1988. NASA, MODIS brochure, http://modis.gsfc.n asa.gov/about/media/modis_brochure.pdf, 2009. NASA, MODIS Website, http://modis.gsfc.nasa.gov/, 2009. NASA, MODIS Website: MODIS Components, http://modis.gsfc.nasa.gov/about/components.php, 2009. J. C. Platt, “Using analytic QP and sparse ness to speed training of support vector machines”, Neural Information Processing Systems, n. 11, 1999. F. Provost, T. Fawcett, and B. B. Kohavi, “The Case Against Accuracy Estimation for Comparing Induction Algorithms”, Proceedings of the Fifteenth International Conference on Machine Learning, pp. 445–453, 1998. F. Provost and T. Fawcett, “Analysis and vi sualization of classi fier performance: Comparison under imprecise class and cost di stributions”, Proceedings of the Third International Conference on Knowledge Di scovery and Data Mining, pp. 43–48, 1997.

PAGE 63

55 P.A. Tester and K.A. Steidinger, “Gymndi nium breve red tide blooms: Initiation, transport, and consequences of surface circulation”, Limnology and Oceanography, vol. 42, n. 5-2, pp. 1239-1051, 1997. M. C. Tomlinson, R. P. Stumpf, V. Ransibrahm anakul, E. W. Truby, G. J. Kirkpatrick, B. A. Pederson, G. A. Vargo and C. A. Heil, “E valuation of the use of SeaWiFS imagery for detecting Karenia brevis har mful algal blooms in the eastern Gulf of Mexico”, Remote sensing of environment, vol. 91, pp. 293-303, 2004. V. N. Vapnik, The nature of statistical learning theory, Springer, New York, NY, 2000. J. J. Walsh, K. D. Haddad, D. A. Dieterle, R. H. Weisberg, Z. Li, H. Yang, F. E. MullerKarger, C. A. Heil and W. P. Bissett, “A num erical analysis of landfall of the 1979 red tide of Karenia brevis along the west coast of Florida”, Continental Shelf Research, vol 22, pp. 15-38, 2002. F. J. Wang, “Fuzzy supervised classifi cation of remote sensing images”, IEEE Transactions on Geoscience and Remote Sensing, vol. 28, n. 2, pp. 194-201, 1990. T. A. Warner, D. W. Levandowski, R. Bell, and H. Cetin, “Rule-based geobotanical classification of topographic, aeromagnetic, and remotely sensed vegetation community data”, Remote Sensing of Envi ronments, vol. 50, pp. 41-51, 1994. F. Wilcoxon, “Individual comparisons by ranking methods”, Biometrics, vol. 1, pp. 80-83, 1945. I.H. Witten and E. Frank, Data mining: Prac tical machine learning tools and techniques. (second edition), Morgan Kaufmann, San Francisco, CA, 2005.

PAGE 64

56 W. Yao, “Knowledge-based classification of SeaWiFS satellite images for monitoring phytoplankton blooms off West Fl orida”, Masters thesis, University of South Florida, 1999. H. Zhang, “Detecting Red Tides off West Florida Shelf by Classification of SeaWIFS Sattelite Imagery”, Masters thesis, Un iversity of South Florida, 2002. M. Zhang, “Generic knowledge-guided image segmentation and labeling with applications”. Ph.D. dissertation, University of South Florida, 1998. M. Zhang, L. O. Hall, F. E. Muller-kar ger, and D. B. Goldgof, “Knowledge-guided classification of coastal zone color images off the West Florida Shelf”, International Journal of Pattern Recogniti on and Artificial Intelligence, vol. 14, n. 8, pp. 987-1008, 2000. M. Zhang, L. O. Hall, and D. B. Goldgof, “A Generic Knowledge-Guided Image Segmentation and Labeling System Usi ng Fuzzy Clustering Algorithms”, IEEE Transactions on Systems, Man, and Cybernetic s, Part B, vol. 32, n. 5, pp. 571-582, 2002.