xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam Ka
controlfield tag 001 001498267
007 cr mnu|||uuuuu
008 041209s2004 flua sbm s000|0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0000537
Candade, Nivedita V.
Application of support vector machines and neural networks in digital mammography
h [electronic resource] :
a comparative study /
by Nivedita V. Candade.
[Tampa, Fla.] :
University of South Florida,
Thesis (M.S.B.E.)--University of South Florida, 2004.
Includes bibliographical references.
Text (Electronic thesis) in PDF format.
System requirements: World Wide Web browser and PDF reader.
Mode of access: World Wide Web.
Title from PDF of title page.
Document formatted into pages; contains 101 pages.
ABSTRACT: Microcalcification (MC) detection is an important component of breast cancer diagnosis. However, visual analysis of mammograms is a difficult task for radiologists. Computer Aided Diagnosis (CAD) technology helps in identifying lesions and assists the radiologist in making his final decision. This work is a part of a CAD project carried out at the Imaging Science Research Division (ISRD), Digital Medical Imaging Program, Moffitt Cancer Research Center, Tampa, FL. A CAD system had been previously developed to perform the following tasks: (a) pre-processing, (b) segmentation and (c) feature extraction of mammogram images. Ten features covering spatial, and morphological domains were extracted from the mammograms and the samples were classified as Microcalcification (MC) or False alarm (False Positive microcalcification/ FP) based on a binary truth file obtained from a radiologist's initial investigation.The main focus of this work was two-fold: (a) to analyze these features, select the most significant features among them and study their impact on classification accuracy and (b) to implement and compare two machine-learning algorithms, Neural Networks (NNs) and Support Vector Machines (SVMs) and evaluate their performances with these features. The NN was based on the Standard Back Propagation (SBP) algorithm. The SVM was implemented using polynomial, linear and Radial Basis Function (RBF) kernels. A detailed statistical analysis of the input features was performed. Feature selection was done using Stepwise Forward Selection (SFS) method. Training and testing of the classifiers was carried out using various training methods. Classifier evaluation was first performed with all the ten features in the model. Subsequently, only the features from SFS were used in the model to study their effect on classifier performance.Accuracy assessment was done to evaluate classifier performance. Detailed statistical analysis showed that the given dataset showed poor discrimination between classes and proved a very difficult pattern recognition problem. The SVM performed better than the NN in most cases, especially on unseen data. No significant improvement in classifier performance was noted with feature selection. However, with SFS, the NN showed improved performance on unseen data. The training time taken by the SVM was several magnitudes less than the NN. Classifiers were compared on the basis of their accuracy and parameters like sensitivity and specificity. Free Receiver Operating Curves (FROCs) were used for evaluation of classifier performance. The highest accuracy observed was about 93% on training data and 76% for testing data with the SVM using Leave One Out (LOO) Cross Validation (CV) training.Sensitivity was 81% and 46% on training and testing data respectively for a threshold of 0.7. The NN trained using the 'single test' method showed the highest accuracy of 86% on training data and 70% on testing data with respective sensitivity of 84% and 50%. Threshold in this case was -0.2. However, FROC analyses showed overall superiority of SVM especially on unseen data. Both spatial and morphological domain features were significant in our model. Features were selected based on their significance in the model. However, when tested with the NN and SVM, this feature selection procedure did not show significant improvement in classifier performance. It was interesting to note that the model with interactions between these selected variables showed excellent testing sensitivity with the NN classifier (about 81%). Recent research has shown SVMs outperform NNs in classification tasks.SVMs show distinct advantages such as better generalization, increased speed of learning, ability to find a global optimum and ability to deal with linearly non-separable data. Thus, though NNs are more widely known and used, SVMs are expected to gain popularity in practical applications. Our findings show that the SVM outperforms the NN. However, its performance depends largely on the nature of data used.
Adviser: Qian, Wei.
Free Receiver Operating Characteristics (FROC).
x Biomedical Engineering
t USF Electronic Theses and Dissertations.
Application Of Support Vector Machines And Neural Networks In Digital Mammography: A Comparative Study by Nivedita V. Candade A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Biomedical Engineering Department of Chemical Engineering College of Engineering University of South Florida Major Professor: Wei Qian, Ph.D. Barnali Dixon, Ph.D. William E. Lee III, Ph.D Date of Approval: October 28, 2004 Keywords: breast cancer, microcalcifications, pattern recognition, f eature selection, Free Receiver Operating Characteristics (FROC) Copyright 2004, Nivedita V. Candade
ACKNOWLEDGEMENTS I take this opportunity to e xpress my gratitude to all the people who have helped me accomplish this work, directly or indirectly. I would like to appreciate all the people who have contributed to my life, in the present and the past. I thank my major professor, Dr. Wei Qia n, Associate Professor, Imaging Science Research Division, Moffitt Cancer Research Center, for giving me an opportunity to work on this project. I am immensely grat eful to his group, Dr. Xue Jun Sun and Dansheng Song for helping me a great deal with my work. Both of them have been very involved in this project and have been more than willing to help me out. I would like to thank Dr. Barnali Dixon for being an exceptional guide and advisor. She has not only helped me accomplish this wo rk in many ways, but given me fantastic opportunities in my academic career and opened doors to prospects that I never dreamt of. I am greatly indebted to her for what I am today, for she has truly made a difference in my life. I thank her for bei ng a great friend, philosopher and guide. Without her support and constructive criticism, this work would never have reached completion. I also thank Dr. William Lee for being on my committee, for his time and comments. I would like to thank some of my friends who have seen me through this work. Dr. Arun Karpur, a close friend and a statistics pro for his invaluable time and help with data analysis. He has never hesitated to help me out or teach me and I really appreciate his time and patience. I would like to thank Sruthi for helping me with any question on my thesis. She has been my 24 hour Â‘help lineÂ’ and a great friend. I wish to thank my friends and roommates Jyothi, Rashmi and Preksha fo r being my Â‘familyÂ’ and for the great times I had in Tampa. Bharath and Balaji for their help and concern and for cheering
me up during all my adversities. All thes e guys and Nishant for always welcoming me with a smile during times I gate crashed at their place. This semester has not been easy with hurricanes and health problems and all these people have he lped me pull through these obstacles. I thank my friends for being there for me and for moments that I will cherish throughout my life. Finally, I would like to thank my Mom and Dad for being my pillars of strength, love and encouragement. They are the wind be neath my wings; and mere words cannot articulate my gratitude and respect for them. Good upbringing, education and unconditional love... these were the gr eatest gifts they could give me! This space is probably not enough to remember and thank all the people who have been a part of my life. There is a part of me in two ends of the world and it goes unsaid that wherever I may go, whatever I may do, my family and friends back home would remain the most precious jewels in my treasur e trove! I feel blessed and fortunate to have such wonderful people as part of my lifeÂ’s journey.
i TABLE OF CONTENTS LIST OF TABLES.............................................................................................................iii LIST OF FIGURES...........................................................................................................iv LIST OF ABBREVIATIONS............................................................................................vi ABSTRACT....................................................................................................................... ix CHAPTER 1 INTRODUCTION......................................................................................1 1.1 Motivation...........................................................................................................2 1.2 Objectives and Approach....................................................................................4 1.3 Study Outline......................................................................................................5 CHAPTER 2 LITERATURE REVIEW...........................................................................6 2.1 Background: Mammography..............................................................................6 2.2 Types of mammography.....................................................................................7 2.2.1 Screening mammography...............................................................................7 2.2.2 Diagnostic mammography..............................................................................8 2.3 Mammographic Abnormalities...........................................................................9 2.3.1 Mass................................................................................................................9 2.3.2 Calcification..................................................................................................10 2.4 Limitations of Mammograms...........................................................................11 2.5 Computer Aided Diagnosi s (CAD) for Mammography...................................12 2.5.1 Components of CAD.....................................................................................12 188.8.131.52 Pre-processing.................................................................................................13 184.108.40.206 Enhancement and Segmentation......................................................................15 220.127.116.11 Feature extraction and Classification..............................................................16 2.5.2 Feature Selection...........................................................................................18 2.5.3 Limitations of CAD......................................................................................23 CHAPTER 3 CLASSIFICATION ALGORITHMS.......................................................24 3.1 Machine Learning Principles............................................................................24
ii 3.2 Neural Networks...............................................................................................26 3.2.1 The Standard Back Propagation Algorithm (SBP).......................................28 3.2.2 Over-learning and Generalization.................................................................29 3.3 Support Vector Machines.................................................................................30 3.3.1 Structural Risk Mi nimization (SRM)............................................................30 3.4 Comparison of SVMs and NNs........................................................................34 CHAPTER 4 MATERIALS AND METHODS.............................................................35 4.1 Schematic of Proposed CAD System...............................................................35 4.2 Database Description........................................................................................37 4.3 Image Pre-processing and Segmentation..........................................................40 4.4 Feature Description...........................................................................................40 4.4.1 Spatial Domain Features...............................................................................40 4.4.2 Morphology Domain Features......................................................................42 4.4.3 Boundary Definitions....................................................................................43 4.5 Data Analysis and Classification......................................................................45 4.5.1 Data Analysis................................................................................................45 4.5.2 Feature Selection...........................................................................................46 4.5.3 Classification.................................................................................................48 4.5.4 Evaluation.....................................................................................................50 CHAPTER 5 RESULTS AND DISCUSSION...............................................................53 5.1 Statistical Analysis of Features.........................................................................53 5.2 Feature Selection using SFS.............................................................................58 5.3 Classification.....................................................................................................62 CHAPTER 6 CONCLUSION AND FUTURE WORK.................................................76 6.1 Concluding Remarks.........................................................................................76 6.2 Future Work......................................................................................................77 REFERENCES.................................................................................................................79
iii LIST OF TABLES Table 1 Summary Of BI RADS Classification Of Calcifications...............................11 Table 2 Input Features................................................................................................40 Table 3 Sample Of Training Data Used In The Study................................................45 Table 4 Confusion Matrix...........................................................................................50 Table 5 Univariate Statistics Of I nput Features For Both Classes.............................55 Table 6 Logistic Fit Of Outcom e By Individual Predictors........................................57 Table 7 Forward Selection Results For Data: Main Effect Model.............................58 Table 8 Forward Selection Results For Data: Interaction Effect Model....................60 Table 9 Choice Of SVM Kernel.................................................................................62 Table 10 Accuracy And Confusion Matrix For Experiment #1...................................65 Table 11 Accuracy And Confusion Matrix For Experiment #2...................................67 Table 12 Accuracy And Confusion Matrix For Experiment #3...................................69 Table 13 Accuracy And Confusion Matrix For Experiment #4...................................72 Table 14 Accuracy And Confusion Matrix For Experiment #5...................................72
iv LIST OF FIGURES Figure 1 Mammographic Anatomy Of Th e Breast ("Interactive Mammography Analysis Web Tutorial", 1999)........................................................................7 Figure 2 Views In Screening Mammogr aphyCranioCaudal (CC) And Mediolateral Oblique (MLO) Views (Imaginis, 2004)....................................8 Figure 3 Views In Diagnostic Mammogra phy. (Left) Cranio-Caudal (CC) And Mediolateral Oblique (MLO) Views, (C enter) Latero Medial (LM) View, (Right) Medio Lateral (ML) View (Imaginis, 2004).......................................8 Figure 4 Descriptors For (Left) Shap e, (Right) Margins (Imaginis, 2004)....................9 Figure 5 Stages In A CAD Process..............................................................................13 Figure 6 The Filter Model............................................................................................19 Figure 7 The Wrapper Model.......................................................................................20 Figure 8 A Perceptron..................................................................................................26 Figure 9 Multilayer Perceptron....................................................................................27 Figure 10 (Left) Several Feasible Hyperplanes, (R ight) Optimal Separating Hyperplane.....................................................................................................31 Figure 11 Kernel Mapping From Input Space To Feature Space..................................31 Figure 12 Schematic Of CAD System...........................................................................35 Figure 13 Detailed Schematic Of Tr aining And Testing Of SVM And NN Algorithms.....................................................................................................36 Figure 14 Arch Distortion With Suspected Microcalcification, (Top Left) Raw Image, (Top Right) ROI, (Bottom Left) Section Of Segmented Image Including MCs And FPs................................................................................37 Figure 15 Image Containing Both The Arch Distortions And Faint Microcalcifications, (Top Left) Raw Image, (Top Right) ROI, (Bottom Left) Section Of Segmented Image, Includes MCs And FPs...................................................38 Figure 16 Other Examples Of Microcalci fications Outlined By The Radiologist.........39
v Figure 17 Architecture Of NN.......................................................................................49 Figure 18 Histograms Of Individual Input Features......................................................54 Figure 19 Plot Of PC-1 Vs PC-2....................................................................................56 Figure 20 ROC Curve For Main Effect Model, C=0.693..............................................59 Figure 21 ROC Curve For Interact ion Effect Model, C=0.77.......................................61 Figure 22 FROC Curves For (Left) Trai ning And (Right) Testing Images, C=50 For Class 1 And C=10 For Class -1, RBF Kernel Radius=7; Polynomial Kernel, Degree=3...........................................................................................63 Figure 23 FROC Curve For NN Using All Ten FeaturesSingle Test..........................65 Figure 24 FROC Curve For SVM Usi ng All Ten FeaturesSingle Test.......................65 Figure 25 Comparison Of FROC Curves Fo r NN And SVM, (Left) Training Images, (Right) Testing Images..................................................................................65 Figure 26 FROC Curve For NN Us ing All Ten FeaturesLOO....................................67 Figure 27 FROC Curve For SVM Us ing All Ten FeaturesLOO.................................67 Figure 28 Comparison Of FROC Curves Fo r NN And SVM, (Left) Training Images, (Right) Testing Images..................................................................................67 Figure 29 FROC Curve For NN Using All Ten FeaturesTraining With Alternate Classes...........................................................................................................68 Figure 30 FROC Curve For SVM Using All Ten FeaturesTraining With Alternate Classes...........................................................................................................69 Figure 31 Comparison Of FROC Curves For NN And SVM, (Top Left) Training Dataset, (Top Right) 12 Training Im ages, (Bottom Left) 10 Testing Images............................................................................................................69 Figure 32 Comparison Of FROC Resu lts From Experiments 1,2 And 3.......................70 Figure 33 FROC Curves For Training And Testing Images Using SFS Main Effect Variables........................................................................................................71 Figure 34 FROC Curves For Training A nd Testing Images Using SFS Interaction Effect Variables.............................................................................................72 Figure 35 Comparison Of FROC Graphs For Models With And Without Feature Selection........................................................................................................73 Figure 36 (Top) Raw Image With Susp icious ROI Outlined, (Bottom Left) Segmented Image, (Bottom Right) Image Showing MC Clusters With Reduced FPs..................................................................................................75
vi LIST OF ABBREVIATIONS ACR American College of Radiology ACS American Cancer Society AFP Average False Positive ANN Artificial Neural Network BIRADS Breast Imaging Reporting and Data System CAD Computer Aided Diagnosis CC Cranial Caudal CV Cross Validation CWMF Central Weighted Median Filters DN Digital Number FCM Fuzzy C-Means algorithm FN False Negative FNR False Negative Rate FP False Positive FPR False Positive Rate FROC Free Receiver Operating Characteristic LL Log Likelihood LM Latero Medial LOO Leave One Out
vii MC Micro Calcification ML Medio Lateral MLE Maximum Likelihood Estimation MLO Medio Lateral Oblique NN Neural Network OSH Optimal Separating Hyperplane PC Principal Components PCA Principal Component Analysis QP Quadratic Programming RBF Radial Basis Function ROC Receiver Operating Characteristic ROI Region of Interest SBE Stepwise Backward Elimination SBP Standard Back Propagation SFS Stepwise Forward Selection SMO Sequential Minimal Optimization SRM Structural Risk Minimization SSE Sum of Square Error SVM Support Vector Machine TN True Negative TNR True Negative Rate TP True Positive TPR True Positive Rate
viii TSF Tree Structured Filter VC Vapnik Chervonenkis dimension WT Wavelet Transformation
ix APPLICATION OF SUPPORT VECTOR MACHINES AND NEURAL NETWORKS IN DIGITAL MAMMOG RAPHY: A COMPARATIVE STUDY Nivedita V. Candade ABSTRACT Microcalcification (MC) detection is an important component of breast cancer diagnosis. However, visual an alysis of mammograms is a di fficult task for radiologists. Computer Aided Diagnosis (C AD) technology helps in identif ying lesions and assists the radiologist make hi s final decision. This work is a part of a CAD project car ried out at the Imaging Science Research Division (ISRD), Digital Medical Imaging Program, Moffitt Cancer Research Center, Tampa, FL. A CAD system had been previo usly developed to perform the following tasks: (a) pre-processing, (b) segmentati on and (c) feature extr action of mammogram images. Ten features covering spatial, and morphological domains were extracted from the mammograms and the samples were classifi ed as Microcalcification (MC) or False alarm (False Positive microcalcification/ FP) based on a binary truth file obtained from a radiologistÂ’s initia l investigation. The main focus of this work was two-fold: (a) to analyze these features, select the most significant features among them and study their impact on cla ssification accuracy and (b) to implement and compare two mach ine-learning algorithms, Neural Networks (NNs) and Support Vector Machines (SVMs) an d evaluate their performances with these features.
x The NN was based on the Standard Back Propagation (SBP) algorithm. The SVM was implemented using polynomial, linear and Radial Basis Function (RBF) kernels. A detailed statistical analysis of the input features was performed. Feature selection was done using Stepwise Forward Selection ( SFS) method. Training and testing of the classifiers was carried out using various trai ning methods. Classifier evaluation was first performed with all the ten features in the model. Subsequently, only the features from SFS were used in the model to study their effect on classifi er performance. Accuracy assessment was done to evaluate classifier performance. Detailed statistical analysis showed that the given dataset showed poor discrimination between classes and proved a very difficult pattern recognition problem. The SVM performed better than the NN in most cases especially on unseen data. No significant improvement in classifier performance was not ed with feature selection. However, with SFS, the NN showed improved performance on unseen data. The training time taken by the SVM was several magnitudes lesser than the NN. Classifiers were compared on the basis of their accuracy and parameters like sensitivity and specificity. Free Receiver Operating Curves (FROCs) were used for evaluation of classifier performance. The highest accuracy observed was about 93% on training data and 76% for testing data with the SVM using Leave One Out (LOO) Cross Validation (CV) training. Sensitivity was 81% and 46% on training and te sting data respectively for a threshold of 0.7. The NN trained using the Â‘single testÂ’ me thod showed the highest accuracy of 86% on training data and 70% on testing data w ith respective sensitivity of 84% and 50%. Threshold in this case was -0.2. However, FROC analyses showed overall superiority of SVM especially on unseen data. Both spatial and morphological domain f eatures were significant in our model. Features were selected based on their signif icance in the model. However, when tested with the NN and SVM, this feature sele ction procedure did not show significant improvement in classifier performance. It was interesting to note that the model with
xi interactions between these selected variables showed excellent testing sensitivity with the NN classifier (about 81%). Recent research has shown SVMs outperf orm NNs in classification tasks. SVMs show distinct advantages su ch as better generalization, increased speed of learning, ability to find a global optimum and ability to deal with linearl y non-separable data. Thus, though NNs are more widely known and used, SVMs are e xpected to gain popularity in practical applications. Our fi ndings show that the SVM outperforms the NN. However, its performance depends la rgely on the nature of data used.
1 CHAPTER 1 INTRODUCTION Breast cancer is the second leading cause of cancer deaths in wome n today (after lung cancer). According to the World Health Orga nization, more than 1.2 million people will be diagnosed with breast cancer this year worldwide (Imagi nis, 2004). Currently, approximately 3 million women in the US are living with the disease (Center, 2004). According to American Cancer Society (ACS) estimates, 215,990 cases of invasive breast cancer will be diagnosed in 2004. In the same year, it is also estimated that 1,450 men will be diagnosed with breast cancer Year 2004 estimates include nearly 40,580 deaths occurring from breast cancer in US alone. According to the National Cancer Institute, one out of eight women will de velop breast cancer during her lifetime. Breast cancer stages range from Stage 0 (v ery early form of can cer) to Stage IV (advanced, metastatic breast cancer) (Imagin is, 2004). Early stage breast cancers are associated with high survival ra tes than late stage cancers. The key to surviving breast cancer is early detection and treatment. According to the ACS, when breast cancer is confined to the br east, the five-year survival rate is almost 100%. Breast cancer screening has been sh own to reduce breast cancer mortality (Society, 2004). Currently, 63% of breast cancer s are diagnosed at a localized stage, for which the five-year survival rate is 97%. The high survival rates of early detection of breast cancer can be attributed to utiliza tion of mammography screening as well as high levels of awareness of the dis ease symptoms in the population.
2 1.1 Motivation Mammography is used for breast cancer scr eening and diagnosis for the detection and characterization of abnormalities that maybe malignant (Association, 2002). Approximately 85% sensitivity (proportion of positives detected correctly as a disease) is achieved with conventional film-screen mammography, though results are operator dependent and may vary with reader expertise. A lot of research has gone into finding techniques that can improve sensitivity and reduce variability among readers. One method of reducing missed MCs or the fa lse-negative (FN) rate in screening mammography is the double reading of mammo grams (Anttinen I, 1993; Thurfjell E.L., 1994). Investigations of this method reported increase in cancer de tection rates by as much as 15% (Hendee WR, 1999). However, th is method is both time consuming and not cost-effective. The incorporation of computer algorithms to increase sensitivity in screening mammography has gained popularity in recen t years (Chan HC, 1990; Kregelmeyer WP, 1994; Nishikawa RM, 1995; te Brake GM, 1998; Vyborny, 1994; Warren Burhenne LJ, 2000). Findings indicated the potential of Co mputer Aided Diagnosis (CAD) to reduce the false negative (FN) rate by 50%-70%. CAD systems use computerized algorithms for identifying susp icious regions of interest (ROIs). The motivation behind CAD sy stems is to reduce both the False Positive (FPR) and False Negative rates (FNR). When used as intended, CAD would be expected to increase the number of mammograms interpreted as positive to the extent that it points out abnormalities previously overlooked by the ra diologist. On the other hand, the cost of missed or undetected abnormalities (FNs) is very high. This work presents a part of a CAD scheme for the detection of mi crocalcifications in mammograms using NNs and SVMs. This would be an aid to a radiologist who would
3 have already outlined suspected abnormalities. This system provides a classification scheme which would aid the radiologist make his final diagnosis. Research (Edwards DC, 2000; Woods KS, 1993; Zhang W, 1996) has shown that the use of classifiers based on Artificial NNs (ANNs or simply NNs) can improve the performance of a detection scheme. NNs (H agan MT, 1996) have been successful in many applications, especially for clusteri ng (Park, 2000) and patter n recognition (Gader PD, 1997). In recent years, the SVM (Cha pelle O, 1999; Pontil M., 1998; Vapnik, 1995, 1998) has become an effective tool for pa ttern recognition, machine learning and data mining, because of its high ge neralization performance. Given a set of points that al l belong to one of the two classes, an SVM can find the hyperplane that leaves the largest possible fraction of points of the same class on the same side, while maximizing the distance of either class from the hyperplane. This optimal separating hyperplane can minimize the risk of misclassifying examples of the test set. On the other hand, NNs are based on the minimization of em pirical risk, which is the minimization of the number of miscla ssified vectors of the training set. SVMs are attracting increasing attention because they rely on a solid statistical foundation and appear to perfor m quite effectively in many di fferent applications (Lecun Y, 1995; M. Pontil, 1998; Osuna E, 1997). After training, the se parating surface is expressed as a certain linear combination of a given kernel function centered at some of the data vectors (named support vectors). All the remaining vect ors of the training set are effectively discarded and the classification of new vectors is obtained solely in terms of the support vectors. SVMs also offer other adva ntages over multivariate classifiers. They are free of optimization problems of NNs b ecause they present a convex programming problem, and guarantee finding a global solution. They are much faster to evaluate than density estimators (like maximum likelihood cl assifiers), because they make use of only the relevant data points, rather than looping over each point regardless of its relevance to the decision boundary. Recent research has sugg ested that the SVM is superior to the NN (Burbidge R, 2001; Ding CH, 2001; Liang H, 2001). In this study, both the algorithms
4 were used to classify microc alcifications from false positive signals (or false alarms) and evaluated. 1.2 Objectives and Approach CAD systems consist primarily of the following processing stages: (a) Preprocessing, (b) Segmentation (c) Feature extr action and (d) classification. Stages (a)-(c) were a part of previous work conducted on this dataset. The mammograms were first studied for abnormalities before they were gi ven to the CAD system. Pre-processing was performed to reduce noise and artifacts a nd to enhance the image (Qian W, 1994). Segmentation was used to identify suspici ous areas from the whole image. Feature extraction and selection is a crucial part of the CAD classification process and has a significant impact on classification accuracy. Te n features were extracted and given as inputs to the classification stage (Qian W, 2001). This work focused on the classification (Stage (d)) and feature sele ction. The database consiste d of 22 mammograms, which included Cranial Caudal (CC) and Medio Lateral Oblique (M LO) view images of the breast. The NN and SVM algorithms were implemented and evaluated for their performance. The NN was constructed using the MA TLAB NN toolbox. The network used the Standard Back Propagation (SBP) algorithm for training. The SVM classifier was obtained in C using the LI BSVM toolbox (Chih-Chung Chang, 2001). LIBSVM is an integrated software for support vector classification, regres sion and distribution estimation. It supports multi-class classificatio n. The basic algorithm is a simplification of both Sequential Minimal Optimization (SMO ) by Platt (Platt, 1999) and SVMLight by Joachims (Joachims, 1999). It is also a si mplification of the modification of SMO by Keerthi et al (Keerthi, 1999). Several kernel options are supported by the classifier. A detailed feature analysis was performed to evaluate the relationships between the input features and the outcome. From this f eature analysis, the most significant features were selected and tested with the classifi ers. Free Receiver Operating Characteristic
5 (FROC) curves were plotted for each e xperiment and compare d. The classification algorithms were compared and the feat ure selection pro cess was assessed. 1.3 Study Outline This document has been organized as fo llows. Chapter 2 discusses the medical background of breast cancer, lite rature review of detection methods and CAD systems in mammography. Chapter 3 gives a description of the classifi cation algorithms studied. Chapter 4 gives a description of the deve loped CAD module a nd the Materials and Methods used in this study. Results and disc ussion are presented in Chapter 5. Chapter 6 presents the conclusions and future work.
6 CHAPTER 2 LITERATURE REVIEW As the second leading cause of cancer-related mortality in women, it is crucial that breast cancer be detected in its early stag es of development. Mammography has been used as a screening and diagnos tic tool for the early detectio n of breast cancer. Screening mammography has proven to be effective fo r women 50-75 years of age (Kerlikowske K, 1995). A recent study showed that in women aged 40-49 years; screening mammography reduces breast cancer mortality by 16-18% (R ajkumar S, 1999). 80-85% of breast cancers are visible on a mammogram as a mass, calci fication or combination of both (Mckenna RJ, 1994). CAD methods play an important ro le in improving diagnostic accuracy in mammogram interpretation. Th is chapter provides a background on mammography, types of mammograms, types of abnormalities and an introduction to automated methods in breast cancer detection. 2.1 Background: Mammography A mammogram is a test that is done to look for any abnormalities in a womanÂ’s breasts. The test uses an X-ray machine to take pictures of both the breasts. With digital mammography, once the images are taken, th ey can be electroni cally manipulated. Digital mammography offers certain advantag es over film mammography. Results can be obtained much faster; the doc tor can electronically manipulate the images (zoom in, magnify etc.) and transmit the images to anot her site for viewing and printing (Systems, 2003). Mammograms look for breast lumps and cha nges in breast tissue that may develop into problems over time. They can find abnormalities that a woman or a health care provider cannot feel during a physical examination. Breast lumps can be benign (non-
7 cancerous) or malignant (cancerous). A biopsy is done if a lump is found, where a small amount of tissue is taken from the lump and the area around the lump. This tissue is then tested for cancer. Early detection of breas t cancer increases the chances of a woman surviving the disease. Figure 1 shows a sample mammogram. Figure 1 Mammographic Anatomy Of The Breast ("Interactive Mammography Analysis Web Tutorial", 1999) 2.2 Types of mammography Two types of mammography exam s are in practice today: Screening and Diagnostic 2.2.1 Screening mammography This is performed to detect breast cancer wh en it is too small to be felt by a physician or a patient. It is performe d on women with no complaints or symptoms of breast cancer (Imaginis, 2004). The procedure involves taking x-ray images of two views for each breast. These views are typi cally from above (Cranial-Cau dal view, CC) and from an angled view (Medio Lateral Oblique, MLO) The MLO is probably the most important and most common view taken followed by th e CC. These views are represented in Figure 2.
8 Figure 2 Views In Screening MammographyCranioCaudal (CC) And Mediolateral Oblique (MLO) Views (Imaginis, 2004) 2.2.2 Diagnostic mammography This is performed on a patient who has b een evaluated as symp tomatic by a physical exam or screening mammography. Additional vi ews of the breast are usually taken as against two in screening mammography, he nce making it a more time-consuming and costly procedure. The objective here is to determine the exact size and location of abnormality and to image the surrounding tissue and lymph nodes. Diagnostic mammography helps determine malignancy, fo llowing which a biopsy maybe ordered. Biopsy is the only definitive way to as certain breast cancer (Imaginis, 2004). Diagnostic mammography typically involves two additional vi ews, the Latero Medial (LM) and the Medio Lateral view (ML) ap art from the CC and MLO views discussed earlier. Additional views maybe taken depe nding on the nature of the problem. Figure 3 Views In Diagnostic Mammography. (Left) Cranio-Caudal (CC) And Mediolateral Oblique (MLO) Views, (Center) Latero Medial (LM) View, (Right) Medio Lateral (ML) View (Imaginis, 2004)
9 2.3 Mammographic Abnormalities A suspicious abnormality normally falls into three broad categories: (1) Asymmetric density, (2) Masses (including ar chitectural distortion) and (3) Calcifications (Imaginis, 2004). Masses often have distinguishing shape, size and margin characteristics. Likewise, calcifications can be characterized by thei r size, number, mor phology, distribution and heterogeneity. These are the distinguishing characteristics based on which a mammogram maybe classified as benign or possibly mali gnant. Masses and calcifications are the most common features associated with ca ncer. They are discussed below. 2.3.1 Mass Masses are three-dimensional lesions which may represent a localizing sign of breast cancer. A mass is a group of ce lls clustered together more densely than th e surrounding tissue. A (non-cancerous) cyst may appear as a mass in a mammographic film. Masses can be caused by benign breast conditions or by breast cancer (Imaginis, 2004). They are characterized by their location, size, shape, margin characteristic s, x-ray attenuation, effect on surrounding tissue, and other associ ated findings like architectural distortion, associated calcifications and skin changes. A mass could be round, oval, lobular, irregular or have architectural dist ortion. Mass margins as defined by Breast Imaging Reporting and Data System (BI-RADS) include: circumscribed, obscured, micro-lobulated, illdefined and speculated (Figure 4). Figure 4 Descriptors For (Left) Shape, (Right) Margins (Imaginis, 2004)
10 2.3.2 Calcification Microcalcifications are tiny (less than 1/50 of an inch or of a millimeter) specks of Calcium that maybe found in an area of rapidly dividing cells (Nagel Rufus H, 1998). Calcifications are often important and comm on findings in mammogr ams. They may be intramammary, within and around the ducts, within the lobules, in vascular structures, in interlobular connective tissue or fat. When many are seen in a cluster, they may indicate a small cancer. About half the cancers de tected appear as these clusters. Microcalcifications are the most common mammographic sign of ductal carcinoma in situ (an early cancer confin ed to the breast ducts). Most breast calcifications are benign. The te rm microcalcification is often used for calcifications found with malignancy, whic h are usually smaller, more numerous, clustered, and variously shaped (rods, branches, teardrops). Calcifications associated with benign conditions are usually larger, fewer in number, widely dispersed and round. These are termed macro-calcifications. In the middle ar e hard-to-tell calcifications that are often labeled indeterminate. The number of calcificati ons that make up a cluster can be used as an indicator of benign and malignancy. While the actual number itself is arbitrary, a minimum number of either four, five or six calc ifications per cluster is considered to be of significance. The morphology of calcifications is considered to be the most important indicator in differentiating benign from mali gnant. As discussed ea rlier, round and oval shaped calcifications are mo re likely to be benign. Those associated with malignant processes resemble small fragments of broke n glass and are rarely rounded or smooth (Imaginis, 2004). The American College of Radiology (ACR) BIRADS has classified findings of calcifications into three categories (Table 1): (a) Typically benign (b) Intermediate concern (c) High probability of malignancy
11Table 1 Summary Of BIRADS Cla ssification Of Calcifications Type of calcification Characteristics Skin typical lucent center and polygonal shape Vascular parallel tracks or linear tubular calcifications that run along a blood vessel Coarse or pop-corn like Involuting fibroadenomas Rod-shaped Large rod-like structures usually > 1mm Round Smooth, round clusters Punctuate Round or oval calcifications Spherical or lucent centered Found in debris collected in ducts, in areas of fat necrosis Rim or egg-shell Found in wall of cysts. Milk or calcium Calcium precipitates Typically benign Dystrophic Irregular in shape but usually large > 0.5mm in size Intermediate concern Indistinct or amorphous Appear round or flake shaped, small and hazy uncertain morphology Pleomorphic or heterogenous Cluster of these calcif ications irregular in shape, size and < 0.5mm raises suspicion High risk Fine, linear or branching Thin, irregular that appear linear from a distance 2.4 Limitations of Mammograms Mammography can help detect breast cancer at an early stage, when the chances for successful treatment and survival are the greatest. Mammography can detect about 85% to 90% of breast cancers. However, ma mmographic films maybe difficult for the radiologist to read and in some cases, abnormalities maybe overlooked. Also, False Negatives (FN) and False Positives (FP) are possible. FN means even though the mammogram may look normal, cancer is actually present. An FP occurs when the results shows the presence of cancer, even t hough this is not the case (4woman.gov, 2002). Younger women are more likely to have an FN mammogram because the breast tissue is denser, making cancer harder to spot. In such cases where there is ambiguity in results, a second interpretation would help the radiologist ma ke his final decision.
12 The CAD technology works as a Â“second readi ngÂ” for radiologists, alerting them to areas on the image that require his atten tion. The following section describes the CAD system, its benefits and limitations and its components in detail. 2.5 Computer Aided Diagnosis (CAD) for Mammography CAD is a recent advance in the field of breast imaging. Studies on CAD technology estimate that for every 100,000 breast cancers currently detected with screening mammograms, the CAD technology could result in the detection of an additional 20,500 breast cancers. In CAD, the computer marks abnormalities on the digitized films. After reviewing the results from CAD, the radiologist decides whether the marked area is indeed an abnormality that is of concern. Mammograms are first loaded into a speci al processing unit that digitizes the mammogram images. The CAD unit incorporates special pattern recognition algorithms to highlight any detected breast abnormalities. In the meantime, the radiologist reviews the patientÂ’s mammogram and makes his inte rpretation. He then views the mammogram from the CAD system and modifies his/ her interpretation if appropriate. CAD technology is designed to detect masses and calcifications in digital mammograms. 2.5.1 Components of CAD The goal of a CAD system in this work is the detection of MCs and the reduction of false positive MCs on mammograms. The goal is also to achieve high sensitivity in order to detect MCs that a radiologist might miss. Clinical utility woul d depend strongly on the number of FPs per image, since radiologists must take extra time and care to read areas of the mammograms with FPs (Rufus H. Nagel, 1995). FPs can also reduce the confidence a radiologist has in using a CAD system. Therefor e, it is important to reduce the number of computer FPs, while maintaining high sensitivity.
13 There are many methods that can be used to classify MCs. Rule based methods (Chan HP, 1987; Davies DH, 1992) and NNs (Yos hida H, 1994; Zhang W, 1996) are two examples of these methods. The overall proc ess involves several st eps that include preprocessing, segmentation, feature extraction and classification (Figure 5). Each module of the CAD process is discussed in sections below with emphasis on classification and evaluation modules. Figure 5 Stages In A CAD Process 18.104.22.168 Pre-processing This module involves noise and artifact re duction, and intensity adjustment. Image enhancement is usually performed by noise re duction or contrast enhancement. Increase in contrast is very essential in mammograms, especially for dense breasts (Ted C. Wang, 1998). Contrast between the ma lignant tissue and the normal dense tissue maybe present in the mammogram but may not be discernable to the human eye. As a result, defining the characteristics of MCs is difficult (Ted C. Wang,1998). Conventional image processing technique s may not work well on mammographic images because of the large variation in feature size and shape (W. Morrow,1992). There are two possible approaches to enhancing mammographic features. One is to suppress background noise and the other is to increase the contrast of suspicious areas
14 Noises due to intrinsic characteristics of imaging device and from imaging process will impact detection sensitivity of CAD. Several types of filters have been reported (Qian W, 1994). Nonlinear filtering has proven more robust than linear filtering in preserving details of the image during noise reduction. Median filtering and selective median filtering locally adapt to the image gray scale using em pirically derived threshold criteria (Lai SM, 1989). Selective median filtering is generally ba sed on restricting the set of pixels within the selected window to those pi xels with a difference in gray level not greater than an empirically derived threshold. Ho wever, detail preservation ma ybe lost since some pixels might be ignored within the filter window (Lai SM, 1989). Other methods like straight line windowing (Chan HP, 1987) and hexagonal windows (Glatt A, 1992) have been introduced to non-linear filtering. Though these me thods were more successful for noise suppression than linear approaches, they did not necessarily show significant improvements in image detail preservation. Multi-stage filtering is introduced in order to combine the propert ies of single filters. The tree-structured nonlinear filter, a symmetric multistage filter combining the advantages of Central Weighted Median Filters (CWMF), linear and curved windows, shows more robust characteristics for noise suppression and deta il preservation. This filter is a three-stage filter de signed with CWMFs as subfilt ering blocks (Qian W, 1995; Qian W, 1999) applied to each pixel with in the filter window (Bamberger RH, 1992; Qian W, 1994). CWMFs are a class of median filters where the basic principle involves replacing a pixel value with the median of the neighboring pixel values (Ko SJ, 1991). The weighted median filter is an extension of the median filter, which gives more weight to some values within the window (Ko SJ, 1991), i.e. a weight coefficient is assigned to each position in a window. The filte r output is the median of the sequence of pixel values; additionally, if weight coefficient is n at a position, the value at this position appears n times. As more emphasis is placed on th e central weights, the filterÂ’s ability to suppress noise and preserve image details increases (Qian W, 1994).
15 The Tree Structured Filter (TSF) is a symmetric multistage filter that sequentially compares filtered and raw image data wi th the objective of obtaining more robust characteristics for noise s uppression and detail preservati on (Arce GR, 1989; Bauer PH, 1991). The TSF architecture c onsists of cascaded CWMFs (Qian W, 1994). Since noise is suppressed at each stage, the overall performance of the TSF is considered to be superior (Arce GR, 1989; Bauer PH, 1991). 22.214.171.124 Enhancement and Segmentation Following noise suppression and artifact removal, image enhancement is performed to improve digital image quality. Enhancement algorithms using the wavelet transformation (WT) are used where the data is cut up into different frequency components using mathematical functions called Â‘ waveletsÂ’ Each component is then studied with a resolu tion matched to its scale (Graps, 2004). This method has advantages over other enhancement techniques like the Fourier transform in analyzing physical situations where the signal contains discontinuities and sharp spikes. Segmentation is used to identify suspicious areas from the whole image. Mammographic lesions are extremely difficult to identify because their radiographic and morphological characteristics resemble those of normal breast tissue. As a mammogram is a projection image, lesions do not appear as isolated densities but are overlaid over parenchymal tissue patterns. The fuzzy C-means (FCM) algorithm was used for soft segmentation based on fuzzy set theory. It allows for fuzzy pixel classi fication based on iterative approximation of local minima to global objective functi ons. This has two advantages over other segmentation approaches, namely it is unsupe rvised and is robust to missing and noisy data. This algorithm helps differentiate small size suspicious regions.
16 126.96.36.199 Feature extraction and Classification Feature extraction and selection is an importa nt part of supervised classification. The number of features selected for breast cancer detection reported in l iterature varies with the CAD approach employed. It is desirable to use an optimum number of features since a large number of features would increase computational needs, making it difficult to define accurate decision boundaries in a larg e dimensional space. Features in different domains (morphological, spatial, texture etc.) are ex tracted. In this pr ocess, the most important characteristics of the ROI are studied. Among the most important characteristics reported by radiologist s are given below (Wouter J, 2000). (a) Polymorphism vs. monomorphism : MCs that are malignant tend to polymorph while benign clusters are mostly characterized by monomorphous calcifications of uniform size (Lanyi, 1988). (b) Size and contrast : some benign calcifications have la rger size and contrast compared to malignant calcifications. (c) Branching vs. round and oval type : linear calcifications maybe an indication of Ductal Carcinoma in situ, since such calcifi cations are loca ted in the glandular ducts. Benign calcifications are mostly round or ova l in shape and are often located in the lobules. (d) Orientation : malignant calcifications often have shapes that ar e oriented to the nipple (Lanyi, 1988) (e) Number : A cluster with very few MCs is regarded as less suspicious. Five or more calcifications, measuring less than 1 mm, in a volume of one cubic centimeter, are considered to form a cluster (Popli, 2001). (f) Location : About 48% of the cancerous proces ses are located in the outer upper quadrant of the breast. Lesions located in this quadrant are more suspicious (Harris JR, 1991). Several methods for feature extraction have been proposed in literature. The use of wavelet features and gray level statistical features was proposed by Songyang Yu et al (Songyang Yu, 2000). MCs are consid ered to be rela tively high-frequency components
17 buried in the background of low-frequency co mponents and very high-frequency noise in the mammograms. Wavelets have a multiresolu tion property since they are localized in both space and frequency domains This property makes it suitable for extracting MCs from low-frequency backgrounds and high-fr equency noise. Spatial features which describe gray level statistics like median co ntrast (Kong, 1998) and no rmalized gray level value (Stetson PF, 1997) are used in combination with wavelet features to describe MCs. Huai Li et al (Huai Li, 1997) suggested a deterministic fractal approach to the enhancement of MCs. Since MCs can be char acterized by different shapes, and possess structures with high local self-similarity, these tissue patterns can be constructed by fractal models (Huai Li, 1997). Features in morphological and spatial domain are most commonly used for MC detection. Once the feature ex traction is complete, thes e features are used for classification. Several automated classificati on techniques have been inve stigated for the detection of MCs in mammograms. The k-Nearest Neighbor approach is a relatively simple and fast classification method (Wouter J, 2000). A statistical method based on the use of statistical models and the ge neral framework of Bayesian image analysis was developed by Karssemeijer et al (Karssemeijer, 1993; N.Karssemeijer, 1991). Another method is based on a difference image technique in whic h a signal suppressed image is subtracted from a signal enhanced image to remove st ructured background noise in the mammogram (Chan HP, 1987). Global and local thresholding were then us ed to extract potential MC signals. Yoshida et al (Yoshida H, 1994) used decimated wavelet transform and supervised learning for the de tection of MCs. Zheng et al (Zheng B, 1994) proposed a method for the detection of MCs using mixed feature-based NNs. A fuzzy logic based approach was proposed by Cheng et al (Che ng HD, 1998). Issam El-Naqa et al (Issam ElNaqa, 2002) used the SVM to detect MCs based on finite image windows. Their approach relies on the capability of the SVM to automatically learn the relevant features for optimal detection. In their work, a sens itivity of as high as 98% was achieved.
18 Recent studies have shown the superiority of SVM over other techniques, suggesting that SVM is a promising technique for MC cl assification. A detailed description of the NN and SVM approaches to MC/ FP classification used is given in Chapter 3. 2.5.2 Feature Selection Feature selection is an important part of any classification scheme. The success of a classification scheme largely depends on the feat ures selected and the extent of their role in the model. The objective of performing feat ure selection is three fold: (a) improving the prediction performance of the predictors, (b) providing faster and more cost effective predictors and (c) providing a better understanding of the processes that generated the data (Isabelle Guyon, 2003). There are many benefits of variable and feature selection: it facilitates data visualization and understanding, reduces the st orage requirements, reduces training times and improves prediction performance. The disc rimination power of the features used can be analyzed through this process. The goal is to eliminate a feat ure if it gives us little or no additional information beyond that subs umed by the remaining features (Daphne Koller, 1996). Only a few feat ures may be useful or Â‘op timalÂ’ while most may contain irrelevant or redundant information that may re sult in the degradation of the classifierÂ’s performance. Irrelevant and correlated attributes are detrimental because they contribute noise and can interact counterproductivel y to a classifier induction algorithm (ChunNan Hsu, 2002). The information about the class that is i nherent in the features determines the accuracy of the model (Daphne Koller, 1996). Theoretically, having more features should give us more discriminating power. However, the real world prov ides us with many reasons why this is generally not the case. Irrelevant and redundant features cause problems in this context as they may conf use the learning algorithm by obscuring the distributions of the small set of truly relevant features for the task at hand. In light of this, a number of researchers have recently addresse d the issue of feature subset selection in
19 machine learning. As defined by (John G, 1994) this work is often divided along two lines: filter and wrapper models. In the filter model, feature selection is performe d as a pre-processing step to induction (Figure 6). Induction refers to the classification algorithm. Figure 1 The Filter Model Methods using criteria such as correlation coefficients and entropy measures that do not involve the inducer come under the category of filter models. Many researchers in machine l earning found difficulties in th is classical definition of the Â“optimalÂ” feature subset and the filter mode l. John et al (John G, 1994) point out that to measure the relevance of a given feature, one must take the existence and relevance of other features into account. In follow up wor k, Kohavi et al (Kohavi 1995) consider that the optimality of a feature subset depends on both the specific induction algorithm and the training data at hand. Th is implies that an Â“optimal Â” feature subset for a given induction algorithm should be defined as a subs et such that the i nduction algorithm can generate a hypothesis with the highest predic tive accuracy. Feature selection should focus on finding features that are Â“usefulÂ” for impr oving the predictive accuracy rather than necessarily finding the Â“theoretically optimal Â” ones. Since the filte r model ignores the effect of the feature subset on the performance of the cla ssifier induction algorithm, an alternative method of feat ure selection called the wrapper model is proposed. The Wrapper model Â“wrapsÂ” around the induction algorithm (F igure 7). The idea is to generate a set of candidate feature subset s, use the induction al gorithm to generate a hypothesis for each candidate f eature subset, and evaluate ca ndidate feature subsets by the classification performance of the resulting hypotheses. Methods like Forward Selection and Backward Elimina tion come under this category.
20 Figure 2 The Wrapper Model The disadvantage of the wrapper model is that since a large number of training cycles is required to search for the best performi ng feature subset, it can be prohibitively expensive. Wrappers try to solve the real world probl em, hence optimizing the desired criterion. They are very time consuming. Filters on th e other hand are much faster. Also, filters provide a generic selection of variables, not tuned for/ by a given learning machine (Isabelle Guyon, 2003). Another justification is that filtering can be used as a preprocessing step to reduce space dimensionality and overcome over fitting. Several feature selection techniques have been discussed in literature. All these methods determine the relevancy of the gene rated feature subset candidate towards the classification task. There are five main t ypes of evaluation func tions (Dash M, 1997): (a) distance (Euclidean distance measure) (b) information (entropy, information gain, etc.,) (c) dependency (correlation coefficient) (d) consistency (minimum features bias) (e) classifier error rate (based on a classification algorithm) The first four are filter models while the la st one comes under the wrapper model. Within the filter model, different feature selection algorithms can be further categorized into two
21 groups, namely feature weighting algorithms and subset search algorithms depending on whether they evaluate the goodne ss of features individually or through feature subsets. The distance measure calculates the physical distan ce (Dash M, 1997), where the main assumption is that instances of the same class must be closer than those in different class. Entropy is a measure of the uncertainty of a feature (Lei Yu, 2003). The entropy of a variable (or feature) X is defined in Equation 1. i i ix P x P X H )) ( ( log ) ( ) (2 (1) And the entropy of a variable X after observing the value of another variable Y is defined by Equation 2. ji j i j i iy x P y x P y P Y X H )) | ( ( log ) | ( ) ( ) | (2 (2) Where P(xi) is the prior probabilities of all values of X and P(xi|yi) is the posterior probability of X after observing the values of Y Information gain (Quinlan, 1993) gives the amount by which the entropy of X decreases and reflects the additional information about X provided by Y (Equation 3). ) | ( ) ( ) | ( Y X H X H Y X IG (3) In Equation 3, a feature Y is re garded more correlated to featur e X than to feature Z, if ) | ( ) | ( Y Z IG Y X IG Another feature weighting criteria is th e correlation measure which measures the correlation between a feature a nd a class label. The PearsonÂ’ s correlation coefficient is given by Equation 4. Y X Y Xn Y Y X X r ) 1 ( ) )( (, (4) A positive correlation implies an simultaneous increase in X and Y (Struble). A negative correlation indicates increas e in one variable as ot her decreases. If the rX,Y has a large
22 magnitude, X and Y are strongly correlated and one of the attributes can be removed (Struble). On the other hand, variables that ha ve a strong correlation with the outcome are retained in the model. A limitation of all the methods listed above is that they may lead to the selection of a redundant subset of variables. Hence subset search methods are preferred over feature weighting methods. Isabelle et al (Isabelle Guyon, 2003) have shown that variables that are independently and identically distribute d are not truly redundant. Noise reduction and better class separation can be obtained by adding variables th at are presumably redundant. They have also shown that a variable that is comp letely useless by itself can provide a significant improvement in perfor mance when taken with others. In other words, two variables that are useless by themse lves can be useful together. Thus selecting subsets of variables could together have good predictive power, as opposed to ranking the variables according to their i ndividual predic tive power. The wrapper methodology is based on using th e prediction performance of a learning machine to assess the relative usefulness of subset s of variables. However, in practice it is necessary to decide on a search strategy that is computationally a dvantageous and robust against overfitting. Greedy se arch strategies like forwar d selection and backward elimination are the most popular search strategi es while genetic algorithms, best-first and simulated annealing are among the others (Kohavi R, 1997). In this work, the wrapper a pproach with logistic regres sion as an induction algorithm was used to find the best subset of featur es. The two most commonly used variable selection strategies are Stepwise Forward Selection (SFS) and Stepwise Backward Elimination (SBE). The SFS begins with no features in the model. At each step, it enters the feature that contributes most to the discriminatory power of the model as meas ured by the likelihood ratio criterion. When none of the unselecte d features meets the entry criterion, the SFS process stops. The SBE on the ot her hand begins with all the f eatures in the model and at
23 each step eliminates the feature that contribut es least to the discri minatory power of the model. The process stops when all the remaining features meet the crit erion to stay in the model. The SFS was used in this work, details of which are given in Chapter 4. 2.5.3 Limitations of CAD Though the use of CAD is becoming widespre ad, a great deal of time and effort is required to digitize the films (Imaginis, 2004). Some radiol ogists also believe that the CAD technology marks a fairly high number of Â“normalÂ” areas as abnormalities leading to additional unnecessary and costly breast imaging and/ or biopsies. In addition, the high cost of CAD technol ogy may hinder its widespread use. A CAD system costs approximately $200,000, in addi tion to the cost of a mammography system. The price of mammograms may also rise fr om $10 to $15 per exam with the usage of CAD technology. In spite of these limitations, studies con tinue to evaluate the advantages and disadvantages of CAD technol ogy. The disadvantages stated above are weighed against the CAD systemÂ’s ability to diagnose cancers early, which dramatically reduces longterm treatment costs.
24CHAPTER 3 CLASSIFICATION ALGORITHMS The focus of this work was to examine the suitability of using the NN and SVM algorithms in the detection of MCs in mammograms and study their impact on classification accuracy. Support Vector Machin es (SVMs) and Neural Networks (NNs) are the mathematical structures, or models, that underlie learning. They are both machine learning techniques that learn patterns based on training data, fit the models to this training data and predict or classify unseen (or future) da ta. The active development of NNs research started in 1970s and that of SVMs started in 1980s. Currently, both techniques are used widely even though SVMs demonstrat e superior performance in various problems compared to NNs. The applic ations of SVMs are expected to expand even though NNs are more widely known. The following sections describe these algorithms in detail. 3.1 Machine Learning Principles Learning tasks are usually divided into supervised unsupervised and reinforcement learning (Hiep Van Khuu, 2003). We discuss th e supervised learning procedure which is an approach that uses examples to mode l input output relationships. The input/ output pairings typically reflect a functional re lationship mapping of inputs to outputs (Cristianini N, 2000). When there exists an underlying function between the inputs and outputs, it is referred to as the target function The estimate of this target function is known as the solution of the learning problem. This is also called the decision function in case of a classification problem (Cristianini N, 2000). The solution is chosen from a set of candidate functions that map the input to the output domain. Th ese set of candidate functions are termed the hypotheses
25 The quality of learning algorith ms is assessed in terms of the number of mistakes it makes during the training phase. However, it is not always possible to verify the validity of the training process especia lly if the function we are tr ying to learn does not have a simple representation. Also, frequently the training data are noisy and the input-output mapping does not guarantee the existence of an underlying function. The fundamental problem of machine learning is not just to find a hypothesis that is consistent with the training data but also works well on unseen data. This is known as the generalization capability which these algorithms try to optim ize. It is possible that with a difficult training dataset, the hypothesis behaves like a rote learner i. e. the data in the training dataset are correctly classified, but pred ictions on unseen data are uncorrelated. Hypotheses that become too complex in or der to become consistent are said to overfit (Cristianini N, 2000). The VC theory due to Vapnik and Chervonenkis gives a better insight of choosing a hypothesis space and hypothesis (Hiep Van Khuu, 2003). Assuming that the data are drawn from an unknown probability distribution P(x,y) and l(.) is some loss function signifying the error of a hypothes is, the risk of the hypothesis is given by Equation 5. ) ( ) ), ( ( ] [ y x dP y x h l h R (5) Where h(x) is the hypothesis function. The risk of hypothesis over th e training set is termed the empirical risk given in Equation 6. ) ), ( ( 1 ] [1n i i i empy x h l n h R (6) The primary goal is to minimize the empi rical risk (error on training data). Unfortunately, this is not possible sin ce the probability di stribution is unknown. However, the risk is bounded by the inequality given in Equation 7. n l d n l d h R h Rn n emp) 4 ( ) 1 2 ( ] [ ] [ (7) Where d is the VC dimension of the function class of h, and is a measure of the classifierÂ’s Â‘powerÂ’. This power does not de pend on the choice of th e training dataset and
26 hence is a true representation of the classi fierÂ’s generalization performance. The VC dimension is the maximum number of data poi nts a function can shatter given all possible labels. A complex function will have a higher VC dimension. This gives us a way to estimate the error on the future data base d only on the training error and the VCdimension of h. The goal is to choose a hypot hesis that minimizes the empirical risk. 3.2 Neural Networks The ANN is an information processing syst em inspired by th e biological nervous system. It is composed of a large number of highly interconnected processing elements called neurons The principle of ANN learning systems is much the same as the biological neuron; it involves ad justments to the synaptic connections that exist between the neurons. An artificial neuron is a device with many inputs and one output. The neuron has two modes of operation, the training mode and the testing mode. In the training mode, the neuron can be trained to fire (o r not) for a particular set of input patterns. In the testing mode, when a pattern is presen ted at the input the firing rule decides whether to fire the neuron or not. These neurons form the node s of the NN. Each node is assigned a threshold and each interconnection between the nodes is assigned a weight that represents the strength between the neurons. The simplest NN has a set of inputs and one output. Figure 8 shows a 1-level NN also called a perceptron Figure 8 A Perceptron
27 In the above figure, x refers to the inputs, w the weights, y the output and T the threshold of the node. The strength of si gnals a node receives is calcul ated as the weighted sum of inputs n nx w x w x w ...2 2 1 1 (8) If this value overcomes the threshold T of the node, then the signal is transmitted to other connected nodes. The value of the output of the node is decided by the activation function f which decides whether the perceptron should fi re or not. Thus, the output y is given as ) ... (2 2 1 1T x w x w x w f yn n (9) Since Equation 9 can be interp reted as an equation of a linear line or a hyperplane, it classifies data ( x1,x2,...xn) into two classes, one above the plane and one below the plane (Hiep Van Khuu, 2003). A dataset is considered linearly separable if it requires only a single hyperplane to classify two classes. If the dataset is not linearly separable, we need more than one hyperplane. Multiple hyperplanes are represented by introducing more nodes in another layer to a perceptron. This is known as a multi-layer perceptron network as shown in Figure 9. Figure 9 Multilayer Perceptron The nodes between the input and the output la yer are called hidden nodes and the layers are called hidden layers. Once th e number of layers, the number of units in each layer has been established, the networkÂ’s weights and thre sholds must be set so as to minimize the prediction error made by the networ k. This is the role of the training algorithms The training cases are run through th e network and the output gene rated is compared to the
28 desired outputs or the targets. The differences are combined together by an error function to give the network error. The most common error function is the Sum of Square Error (SSE) where the individual errors of output units on each case are squared and summed together. 3.2.1 The Standard Back Prop agation Algorithm (SBP) The SBP is the most popular NN training al gorithm. Other examples of training algorithms are the conjugate gradient descent, Quasi-Newton, quick propagation etc. In BP, the gradient vector of the error surface is calculated. The vector points along the line of the steepest descent from the current point so any move in th e shortest distance decreases the error. A sequence of such move s, will eventually find a minimum of some sort (Statsoft Inc., 1984-2003). Large steps conv erge more quickly but might overstep the solution. Small steps would requ ire a large number of iterations The step size is defined by the learning rate of the algorithm. The algorithm progresses through a number of epochs iteratively, the error between the target and actual outputs calculated for each epoch. This error is used to adjust the we ights, and the process repeats. The initial weights are random and training stops at a set convergence cr iterion like a predefined number of epochs, or an acceptable level of SSE. The BP algorithm consists of two phases: The Forward phase and the Backward phase. The feed-forward phase is where the i nputs x are fed into the network. All outputs are computed using sigmoid (activation functi on) thresholding of the inner product of the corresponding weights and the input vectors. All the outputs at stage n are connected to all the inputs at stage n+1. Errors are then propagated backwards by apportioning them to each unit according to the amount of error the unit is responsible for (Anand, 1999). Let (x,t) denote a training example where x and t are vectors repr esenting the inputs and targets respectively. is the learning rate. ni, no and nh are the input, output and hidden nodes respectively. Input from unit i to unit j is denoted as xji and weight is denoted by wji. The SBP algorithm is stat ed as follows (Anand, 1999): (a) Create a feed-forward network with ni inputs, no outputs and nh hidden units.
29 (b) Initialize all the weights to random va lues (say between -0.05 and 0.05) (c) Until convergence do For each training sample (x,t) do (a) Compute the output ou of every unit for instance x (b) For each output unit k calculate ) )( 1 (k k k k ko t o o (c) For each hidden unit h calculate ) () 1 (h downstream k k kh h h hw o o (d) Update each network weight wji as ji ji jiw w w Where ji j jix w Thus the weights of the network are updated until the convergence criterion is met. 3.2.2 Over-learning and Generalization One major problem of the above learning approach is that it doesnÂ’t actually minimize the error that we are actually interest ed in, the generalization error. In reality the network is trained to minimize the erro r on the training set. The most important manifestation of this problem is that of over fitting. A network with more weights models a more complex function, and is therefore prone to this problem. On the other hand, a network with fewer weights may not be su fficiently powerful to model the underlying function. For example, a network with no hidden layers actually models a simple linear function. Thus, it is important to select the optimum number of hidden units. In view of this, a simple model is preferred to a highly complex network. The performance of the NN depends on other fact ors such as nature of the datasets. It is of relevance here to me ntion the problem of having u nbalanced datasets. Since a network minimizes the overall er ror, the proportion of the cla sses of data in the set is critical. A network trained with 1000 positive cases and 100 negative cases will bias its decision towards the positive case, as it allows the algorithm to lower the overall error. It
30 is also important that the training and testing data are representative of the underlying model. 3.3 Support Vector Machines The foundations of SVM have been devel oped by Vapnik, and are gaining popularity due to many attractive features and promising empirical performance. The formulation of SVM embodies the Structur al Risk Minimization (SRM) principle, as opposed to Empirical Risk Minimization (ERM) commonl y employed with other statistical methods. SRM minimizes the upper bound on the genera lization error, as against ERM which minimizes the error on the training data. Thus SVMs are known to generalize better. The SRM technique consists of finding the optimal separation surface between classes due to the identification of the most representative traini ng samples called the support vectors. If the training dataset is not linearly separable, a kernel method is used to simulate a non-linear projection of the data in a higher dimensional space, where the classes are linearly separable. Here, we firs t introduce the foundation of SVMsthe linear learning machine. SVM kernels and ot her components are then explained. 3.3.1 Structural Risk Minimization (SRM) A linear learning machine learns a linea r classifier (Hiep Van Khuu, 2003) or hyperplane from the training data (Equation 10). R b R w b x w x hN , ) ( (10) Thus the hyperplane divides the data so that that all the points with the same label lie on the same side of the hyper plane. This amounts to finding w and b so that 0 ) ( b x w yi i (11) It is possible to rescale w and b so that 1 ) ( b x w yi i (12) This system of equations can have se veral solutions as shown in Figure 10.
31 Figure 10 (Left) Several Feasible Hyperpla nes, (Right) Optimal Separating Hyperplane The SRM approach is based on minimizing bot h the terms in the RHS of Equation 7. The classifier that has the maximal margin to the training set is the preferred solution among all other feasible hype rplanes shown in Figure 10 (Left). This choice of hyperplane gives a tighter bound on the VC dimension and reduces the risk. Thus determining the classifier involves the Quadratic Optimization Problem (QP) of minimizing 22 1 wunder constraints (12). Thus, the N dimensional vector w and the real vector b define the OSH. This concept can be extended to the case when the classes are not linearly separable, i.e. when Equation 12 has no solution. A non-linear mapping L NR R :which maps the input data to a high dimensional space (also called the feature space) is introduced. Here, L is usually much larger than N. We can th en try to find a linear classifier in feature space. Figure 11 Kernel Mapping From Input Space To Feature Space
32 The problem of finding a hyperplane in featur e space is one of reformulating the linear case. Thus the problem is one of minimizing 22 1 w Subject to ,....., 1 1 ) )) ( .((n i b x w yi i (13) We introduce Lagrange multipliers0i i=1,...,n, for each constraint in Equation (13) and find the saddle point (or mi nimum) of the Lagrangian n i i i ib x w y w b w L1 2) 1 ) )) ( (( ( 2 1 ) , ( (14) At the saddle point we have 0 0 w L and b L Which translate into n i n i i i i i ix y w and y11) ( 0 (15) Substituting (15) in (14), we have th e dual quadratic optimization problem Maximize n i n j i j i j i j i ix x y y11 ,) ( ) ( 2 1 (16) Subject to n ii,...., 1 0 (17) n i i iy10 In Equation (16), the inner product ) ( ) (j ix x can be replaced with a kernel function ) (j ix x K that obeys MercerÂ’s condition. MercerÂ’s condition states that any positive semi-definite kernel ) (j ix x K can be expressed as a dot product in high-dimensional space. Thus we avoid translating the input da ta to feature space fi rst and then finding their inner products. This is equivalent to mapping the feature vectors into a highdimensional feature space before using a hyper plane classifier ther e (Figure 11). The use of kernels makes it possible to map the data implicitly into a feature space and to train a linear machine in such a space, potentially side-stepping the computational problems inherent in evaluating the feat ure map (Cristianini N, 2000)
33 In a high dimensional feature space RL the hyperplane is defined by the L dimensional vector w and real number b (Hiep Van Khuu, 2003). L can however be very large, hence storing w and b explicitly is expensive and sometimes impossible. In equations (15) and (17) the vector w is define d by the input vectors that have the non-zero Lagrange multipliers associated with them. Thes e non-zero coefficients are called the support vectors, which together implicitly de fine the hyperplane. New data x is classified with l i i i ib x x k y y1) ) ( sgn( (18) where l is the number of support vectors. In this research, three kinds of kernels are studied. These kernels are mathematically defined in Equations 19-21 (Chang and Lin, 2003): 1. Polynomial kernel. dy x y x K) 1 ) (( ) ( (19) 2. RBF kernel. 2* exp ) (v u y x k (20) 3. Linear kernel. y x y x k ) ( (21) There is no theory regarding which kernel is the best, given a problem domain. It is important to select the appropriate kernel based on the specific application. Training of SVMs requires the solution of a very large Quadratic Programming (QP) optimization problem which is very timeconsuming (Platt, 1999). Sequential Mimimal Optimization (SMO) is an algorithm for traini ng the SVM where this large QP problem is broken down into a series of smallest possible QP problems which are solved analytically. SMO can handle very large trai ning datasets and cons iderably speeds up training times. SMO solves the smallest possible optimiza tion problem at every step. The smallest possible optimization problem involves two La grange multipliers. At every step, the SMO chooses two Lagrange multipliers to jointly optimize, finds the optimal values for these multipliers, and updates the SVM to reflect the new optimal values.
343.4 Comparison of SVMs and NNs Both NNs and SVMs are based on the concep t of linear learning models, using linear hyperplanes to classify data. For non-linear m odels, the approach of these two algorithms is different. SVMs use non-linear mappings to find a decision hyperplane in feature space. On the other hand, NNs use activation functions such as sigmoid, radial or Gaussian to handle non -linear data, so that the BP al gorithm can compute the weight change depending on the error on the output. These activation functions in effect, create some non-linear decision boundary classifying the input data into different classes. SVMs minimize the structural risk (erro r on unseen data) while NNs minimize only the empirical risk (training data). Hence, SVMs are known to generalize better with a better learned hypothesis func tion that approximates more closely to the true classification function. NNs are known to have longer training ti mes since the learning process involves training with the dataset repeatedly to be tter learn the hypothesis function that will perform the classification task. NNs learn bett er the more times they get trained. SVMs on the other hand, handle data simultaneously without losing the degree of accuracy. NNs converge to local minima while the SVMs find a global solution. The problem of overfitting in NNs might get them stuc k at local optima while with SVMs, the bound on the true risk and the QP solution always ensures a global solution. With a good understanding of the mathemati cal foundations of these algorithms, we explain our methodologies and re sults in Chapter 4 and 5.
35CHAPTER 4 MATERIALS AND METHODS This chapter presents the overall approach used in the CAD system. A description of the database and the features us ed is included. Detailed featur e analysis as well as feature selection is performed prior to classi fication. Classification using NN and SVM algorithms with and without feature se lection are studied and evaluated. 4.1 Schematic of Proposed CAD System The objective of this thesis project was three-fold: (a) Analyze the input features and their importance in predicting the outcome. Perform feature selection by selec ting the most significant features. (b) Use all the ten features with the NN and SVM to classify MC and FP and compare their performances. (c) Use the most significant features and thei r interactions (from Step 1) with the NN and SVM to classify MC and FP and compare their performances. The overall procedure however consists of multiple steps like pre-processing, segmentation, feature extraction, classifica tion and evaluation. The schematic of the entire procedure is shown in Figure 12. Figure 12 Schematic Of CAD System
36 The Â‘ClassificationÂ’ and Â‘EvaluationÂ’ stag es were the main focus of this work. Classification was performed using the NN and SVM algorithms, the schematic for which is given in Figure 13. Figure 13 Detailed Schematic Of Training And Testing Of SVM And NN Algorithms
374.2 Database Description The database consisted of 22 images of 60 micron resolution of which all the images had a case of abnormality, marked out by a ra diologist. These images included the CC and MLO views of each breast. Figures 14 and 15 show two examples of abnormal cases. The images shown are the raw mammogram, th e Region of Interest (ROI) marked out by a radiologist and a portion of the image after segmentation. Figure 16 shows other examples of MCs that we re identified by the radiologist in this database. Figure 14 Arch Distortion With Suspected Microcalcification, (Top Left) Raw Image, (Top Right) ROI, (Bottom Left) Section Of Segmented Image Including MCs And FPs
38 Figure 15 Image Containing Both The Arch Distortions And Faint Microcalcifications, (Top Left) Raw Image, (Top Right) ROI, (Bo ttom Left) Section Of Segmented Image, Includes MCs And FPs
39 Large circle= Arch distortion, small circle= calcifications Calcifications Figure 16 Other Examples Of Microcalcifications Outlined By The Radiologist The above images consist of both the MC a nd FP signals. According to Takehiro et al (Takehiro Ema, 1995), false-positive MC signals in mammograms can be classified into four major categories: (a) MC-lik e noise pattern, (b) artifacts, (c) linear pattern and (d) FP signals appearing on ducts, step like edge s or ring patterns. These False Positive MCs vary with database, but overall look like subtle MCs. However, careful observation would reveal their differences. Artifacts are caused by dusts or scratches in films or noise
40 in the digitization process (Takehiro Ema, 1995). False positive MC signals have higher contrast than true MCs. MC-like noise patte rn is most commonly seen, while factors (b)(d) mentioned above also contribute to false positive MCs. 4.3 Image Pre-processing and Segmentation Though not a direct part of this thesis, im age pre-processing and segmentation were performed prior to classification and is mentioned here for completeness. Image preprocessing was performed using TSFs that used cascaded CWMFs. Adaptive WTbased enhancement algorithms were de veloped for digiti zed CAD methods. Segmentation was performed using the fuzzy C-means algorithm. 4.4 Feature Description Subsequent to image segmentation, featur e extraction was perfor med. Ten features that cover spatial and morphological domain and th at are believed to be representative of the two classes were extracted from the segm ented image. These features are listed in Table 2. Table 2 Input Features Feature No. Feature Type of feature 3 Mean entropy 4 Deviation of entropy 7 Average foreground 8 Deviation foreground 9 Mean contrast Spatial 5 Moment 6 Compactness Morphological 1 Eccentricity 2 Spread 10 Boundary gradient Describes the margins 4.4.1 Spatial Domain Features These features are extracted from the e nhanced output image. They describe the entropy and gray-levels of the image. Entropy refers to the disorder of a system. The entropy of a system is related to the amount of information it contains. Low entropy
41 images, such as those containing a lot of black sky, have very lit tle contrast and large runs of pixels with the same or similar Digi tal Number values (Brien). An image that is perfectly flat will have entr opy of zero. On the other hand, high entropy images such as an image of heavily cratered areas on the moon have a great deal of contrast from one pixel to the next. In short, the entropy refers to the Information content of the gray values. The entropy for each ROI can be calculated using Equation 22. 255 0] [ ] [i rel l i rel Entropyn (22) Where rel[i] = histogram of the relati ve gray value frequencies i = gray value of input image (0...255) (a) Mean entropy: This is the average entropy value given by Equation 23. 11n i iEntropyEntropy n (23) (b) Deviation of entropy: standard deviation of entropy values from the mean entropy, given by Equation 24. n i i entropyEntropy Entropy n SD1 21 (24) (c) Average foreground: This is the average gray -level of foreground in enhanced image (Qian W, 2001) given by Equation 25. foreground n m foreground foregroundn m x pixel sum Avg) () ( ) ( 1 (25) (d) Deviation foreground: Standard deviation of gray -levels of the foreground in enhanced image given by Equation 26. 2 1 ) ( 2) ( foreground n m foreground foregroundAvg n m x Stdev (26)
42(e) Mean contrast: Difference in gray level values of foreground and background given by Equation 27. background background foregroundAvg Avg Avg Contrast (27) The above features are based on the fact that MC spots ha ve different gray levels compared to the background tissues. 4.4.2 Morphology Domain Features These features focus on the shape description. They are extracted from the segmented image. (a) Compactness Compactness is a dimensionless quantity that provides a m easure of contour complexity versus the area enclosed (Gavrielides, 1996; Shen L, 1994). It is one of the most commonly used featur e in pattern recognition a nd classification techniques (Tembey, 2003). Compactness can be defined by Equation 28. ) ( 4 ) (2area perimeter (28) For a disc, would be a minimum and equals to 1. A larger value of compactness describes an ir regular and elongated object while a smaller value is representative of a more symmetric object (Tembey, 2003) (b) Moment The moment refers to the roughness of a c ontour and increases as the irregularity of the shape increases.(Castlema n, 1979; Tembey, 2003). It gives information regarding the shape roughness and is used to distinguish between the different shape categories of calcifications. For a two-dimensional image f(x,y), the moments mpq of order (p+q) are defined in Equation 29 (Qian W, 2001)
43 dxdy y x f y x mq p pq) ( for p,q =0,1,2Â…. (29) While the central moments are defined as dxdy y x f y y x xp pq) ( ) ( ) ( (30) Where 00 01 00 10/ / m m y and m m x For a binary image, the above formula can be rewritten as q n m p pq n mn mn n m m n N n m N m ) ( ) ( 1 1) ( ) () ( (31) 4.4.3 Boundary Definitions (a) Boundary gradient This feature is obtained by calculating the gradient of each boundary pixels 8 connected neighbors and taking the average of its neighbo rÂ’s gradient value as its gradient. The gradient operators are represented by a pair of masks H1 and H2, which measure the gradient of the image u(m,n) in tw o orthogonal directions. By defining the bidirectional gradients as g1(m,n)=m,n and g2(m,n)=m,n the gradient vector magnitude and direction ar e given by Equation 32. ) ( ) ( ) (2 2 2 1n m g n m g n m g ) ( ) ( tan ) (1 2 1n m g n m g n mg (32) Using the above formulae, the segmented im age is first screened, labeled all the boundary pixels of each calcificat ion, and then mapped back to the enhanced image to get their boundary pixel gradient. The gradient f eature is based on the optimized algorithm, which use an initially given value and initial ly defined searching direction to find the
44 optimized convergence solution for the problem The Sobel gradient operator was used for calculating the gradient description feature used wi th the masks defined as: 1 0 1 2 0 2 1 0 11H 1 2 1 0 0 0 1 2 12H (33) (b) Eccentricity Eccentricity ( ) measures the degree to which an objectÂ’s mass is concentrated along a particular axis. The range of values for is [0-1] where 0 defines a circular object and 1 a linear object. It is defined in Equation 34. 2 2 0 0 2 2 1 1 2 2 0 0 2) ( 4 ) ( m m m m m (34) Where mpq is the moment of order (p+q) (c) Spread (S) is based on the central moments of the boundary pixels. It measures how unevenly an objectÂ’s mass is distributed along its centroid and takes values in the range of [0-1]. Again, a lower value represents a circular object while a large value defines a linear and non-uniform object Spread is defined in Equation 35 (Tembey, 2003). 0 2 2 0 S (35) where pq is the central moment These 10 features extracted are classified into MC and FP categories based on the truth file (marked by radiologist). A sample of th e training dataset used is shown in Table 3 below. Here Â‘-1Â’ stands for class FP and Â‘1Â’ for class MC.
45Table 3 Sample Of Training Data Used In The Study 4.5 Data Analysis and Classification This was performed in three different stages: 1. Input feature analysis and feature se lection using Forward Selection method. 2. Use all the ten features with the NN a nd SVM and compare their performances. 3. Include the most significant features and thei r interactions (from Step 1) with the NN and SVM and compare their performances. 4.5.1 Data Analysis The first step was a detailed analysis of i nput data. Since the medical implications of these features (or domain knowledge) were not known precisely, we seek a statistical explanation for the effects of the predictors. Data analysis includes the univariate statistics as well as multiple regression analyses. Logistic regression is a form of regression that gives us an insi ght into the independent variab le effects, their significance and the extent of their role in the model, and their relationship with the outcome. With this understanding we continue with traini ng and classification using NN and SVM. Logistic regression is used to predic t a dependent variable on the basis of independents (Hosmer, 1989). In logistic regres sion, the dependent variable is binary or dichotomous. The goal is to find the best fitting model to describe the relationship between the dichotomous characte ristic of interest (outcome) and a set of independent (or predictor) variables (Cox, 1989). Logistic regression generates the coefficients of a
46 formula to predict a logit transformation of the probability of presence of the characteristic of inte rest (MedCalc, 2004). k kx b x b x b x b b p it ... ) ( log3 3 2 2 1 1 0 (36) where p is the probability of the presence of the characteristic of interest. The logit transformation is defined as the logged odds (Equation 37) where odds are given as: stic characteri of absence of y probabilit stic characteri of presence of y probabilit p p odds 1 and p p l p itn1 ) ( log (37) Rather than choosing parameters that mini mize the sum of squares errors (like in ordinary regression), estimati on in logistic regression chooses parameters that maximizes the likelihood of observing the sample valu es. This is called the Maximum Likelihood Estimation (MLE) which is a method used to calculate the logit coefficients. MLE seeks to maximize the log likelihood (LL), which reflects the likelihood of predicting the odds of the observed values of the dependent fr om the observed values of independents. Logistic regression gives us the univariate effects of the variables on the outcome i.e. it gives us an idea as to how each input feat ure affects the classification as MC/ FP, as well as the strength of association between each input and the outcome. 4.5.2 Feature Selection Feature selection was performed using th e wrapper method explained in Section 2.5.2. The induction algorithm used in this case was logistic regression with Stepwise Forward Selection (SFS) as the search strate gy. Logistic regression was used previously for data analysis to study the significance of each variable in our model. The same concept is extended to a procedure for selecti ng the best subset of features based on the likelihood ratio criterion. Variables are tested for individu al significances ( main effect model ) and in combination ( interaction effect model ) by adding each variable stepwise
47 into the model. It is to be noted here that the term Â‘var iablesÂ’ and Â‘featuresÂ’ are used interchangeably. The SFS was implemented in SAS. The algo rithm starts out wi th no predictors (features) in the model. The test is base d on the Â“chi-squareÂ” test which is a nonparametric test of statistical significance. The initial chi-square reflects the error associated with the model when only the interc ept is included in the model i.e. the initial chi-square is -2LL for the model which accepts the null hypothesis that all the predictorsÂ’ coefficients are zero. This statistic is then compared with the corresponding -2LL for the model with the predictors included. The chisquare value represents twice the difference in log likelihoods between fitt ing a model with only an inte rcept term and a model with an intercept and a predictor (independent vari able). This value (difference) is compared with a chi-square distributi on with degrees of freedom e qual to the difference in the number of terms between the two models. If the difference is significant (p-value > chisquare is lesser than 0.05), the null hypothe sis that knowing the independents makes no difference in predicting the dependent, is reject ed. Thus the new variable is added into the model. As stated earlier, it is impor tant to study both the effects of individual independents as well as their interactions. An interaction effect is a change in the simple main effect of one variable over levels of th e second. All possible two way interactions are included to test for their significance. Only two-way inte ractions are used since anything more than two way would not be significant due to issues of power and sample size. The main effect and the interaction effect models give feature subsets that are optimal. These feature subsets are tested w ith the NN and SVM algorithms. ROC curves for both these models are plo tted and evaluated based on the c-statistic. The c-statistic indicates the area under the ROC curve.
484.5.3 Classification Classification was performed with all the (a) all ten features and (b) features selected from SFS procedure. Ideally, the extracted features are representa tive of the classes that they represent. Supervised classificatio n involves two stages: Training and Testing Three different training techniques were used the single test method, Leave One Out (LOO) Cross Validation (CV) and the alte rnate class training method. In the single test method, 12 images out of the 22 were used for training and 10 images for testing. The training images were selected after careful optimization of the training dataset. They were selected ba sed on the NNÂ’s performance on the selected training set and the remaining images which formed the testing set. The images that gave the best performance on the testing set were used as the training images. The size of the dataset however was small and training with 12 images may not have produced the desired high accuracy. Also the cl assifier may perform well on the training dataset, but may not be able to generali ze well i.e. may not produce good test results. Cross Validation is an alternate eval uation method to estimate how well the trained model is going to Â‘generalizeÂ’ or perform on unseen data. This is done in order to avoid the possible bias introduced by relying on any one particular division into te st and train components. The original set is partitioned in severa l different ways a nd an average score is computed over the different partitions. The extreme variant of this is to split p patterns into a training set of size p-1 and a test set of size 1. This is performed p times and the squared error on the left out pa ttern is averaged over the iterations. This is called the LOO CV. In this work, LOO has been performe d using the 12 images. In the first step of the procedure, the first 11 images were used for training and the last image for testing. In the next step, the next 11 images were used for training and the rema ining one for testing. This procedure was carried out 12 times, so each image was used at least once for testing.
49 Training using alternate cla sses was done to account for the imbalance in the training data set since the number of FPs exceeded the MCs by almost five times. Training was performed using equal number of FPs and MCs and these classes were presented alternatively to the classifiers. The NN was implemented in MATLAB using the NN toolbox. A feed-forward back propagation network was used which consists of the forward and the backward phases. The NN architecture consisted of 2 hidden layers with 13 units each, and an output layer with 1 unit. The transfer function for the hidden layers was Â‘tan-sigmoidÂ’ and for the output layer was Â‘linearÂ’. The network was trained for 1000 epochs and the Sum of Squares Error (SSE) goal was set to 15. The ar chitecture of the given NN is as shown in figure 17. Weights are initiali zed with random values. In th e forward phase, the training inputs are given to the network. As the NN is learning, the value of error decreases. The error is propagated back to the hidden laye r in the backward phase thus modifying the weights of the network. Figure 17 Architecture Of NN The SVM was implemented using LIBSVM Ve rsion 2.6. This SVM classifier uses the Sequential Minimal Optimization (SMO) algor ithm. The goal is to construct a binary
50 classifier to derive a decision function fr om the available samp les with the least probability of misclassifying a future sample. Different kernel functions and parameters were experimented with. The kernels incl uded the polynomial, RBF and linear kernels with their different parameters. Initially these kernels and their parameters were compared. However, during the CV process, the best parameters are chosen by nested cross-validation procedures. The data was hi ghly unbalanced i.e. the number of FPs outnumbered the number of MCs by 5 times. T hus they were weight ed unequally to set the penalty for an MC higher than that for an FP. Also, the data was normalized and scaled before presenting it to the SVM to ease mathematical calculations as well as reduce the effect of larger attributes. The final output of the SVM was a conti nuous vector ranging between 0 and 1, a value closer to 0 indicating a FP and a value closer to 1 indicating an MC. The output of the NN varied between -1 and +1. A thre shold was specified on the output. If the likelihood value was greater than the threshold, then the predicted cl ass would be Â‘1Â’ or MC and if lesser than the threshold, the predicted class would be Â‘-1Â’ or FP. 4.5.4 Evaluation Evaluation of the classification algorith ms was performed using two measures: Accuracy and Confusion Matrix FROC curves were plotte d by varying the threshold on the predicted output. The confusion matrix (Kohavi, 1988) contains information a bout actual and predicted classifications done by a classification syst em. The following table shows the confusion matrix for a binary classifier: Table 4 Confusion Matrix +1 -1 +1 TP FN -1 FP TN
51 Where TP = number of correct predictions that an instance is positive FP = number of incorrect predictions that an instance is positive TN = number of correct predictions that an instance is negative FN = number of incorrect predictions that an instance is negative Based on the above values, the followi ng evaluation criter ia are defined: (a) Accuracy : proportion of total number of predic tions that were correct (Equation 38). FN TN FP TP TN TP Accuracy (38) (b) True Positive Rate (TPR) : proportion of positive cases that were correctly identified (Equation 39). FN TP TP TPR (39) (c) False Positive Rate (FPR) : proportion of negatives that were incorrectly classified as positives (Equation 40). TN FP FP FPR (40) (d) True Negative Rate (TNR): proportion of negatives that were correctly identified (Equation 41). FP TN TN TNR (41) (e) False Negative Rate (FNR): proportion of positive cases that were incorrectly classified as negative (Equation 42). TP FN FN FNR (42)
52 Accuracy alone is not an ad equate measure of performan ce especially in our case where the number of negative cases is much greater than the number of positive cases (Kubat M, 1998). Suppose there are 100 cases, 95 of which are negative and 5 positive. If the system classified all the cases as negative, the accuracy would be 95%, even though the classifier missed all the positive cases Thus it is important to study the other parameters described above. The FROC curve gives a graphical representation of these parameters for various thresholds on the output and encapsulates all the information contained in the confusion matrix. Here, the number of FPs/ image is plotted on the xaxis and the TPR on the y-axis. Each threshold re sults in an (FP, TP) pair and a series of such pairs are used to plot the FROC curve. In our case, the TPF w ould be the probability of correctly classifying a true MC as an MC. The FPF is the probability of incorrectly classifying a false positive (or Â‘false alarmÂ’ to avoid term confusion) as an MC. In medical diagnosis, these values are translat ed to produce two im portant indices of assessment: Sensitivity and Specificity. Sensitivity refers to th e TPR or the proportion of patients with cancer who test positive. Spec ificity refers to TNR (or 1-FPR) or the proportion of patients without cancer w ho test negative. The position of the cutoff determines the number of TP, FP, TN and FN. As the sensitivity is increased, the specificity is also sacrificed. Thus, an optim um cut-off needs to be chosen, for which the sensitivity and specificity values are accepta ble. Here, the TNR and TPR refer to the specificity and sensitivity of the classification stage. The overall specificity and sensitivity is affected by their respectiv e values in the segmentation stage. Classification and evaluation based on SV M and NN algorithms was carried out and their performances were compared. Also, the performances of these algorithms using all features, the most significant ones and their interactions were compared.
53CHAPTER 5 RESULTS AND DISCUSSION This chapter presents the results of vari ous experiments conducte d on the training and testing images. The results are presented as follows: 1. Detailed statistical anal ysis of input features 2. Logistic regression a nd Forward Selection 3. Classification results for different type s of training methods with and without feature selection 5.1 Statistical Analysis of Features The output of the segmenta tion and feature extraction process was a text file consisting of MC and FP cases. It was necessa ry to perform a detailed analysis of input data due to the lack of complete domain knowledge. The answers we seek are: Are the input features related to the outcome? Is th ere a pattern? Three types of statistical analyses were performed: Univariate, Multivariate and Logistic Regression. Univariate analysis refers to the analysis of a si ngle variable. This helps us get a Â‘feelÂ’ for the data by giving us an overall description of what we are working with. Univariate analysis included hi stogram plots and feature sta tistics. The simplest way to visualize the input variables in each class is to create a freq uency distribution of the data on each input variable (independent) or feat ure. Histograms give us an idea about the distribution of data in a datase t. The vertical axis of the histogram gives the number of counts of the data in each data range or bin, the bi ns plotted on the horiz ontal axis. In this study, the histograms were plotted to give us an idea about the distri bution of data values in each class.
54 Histogram: Variable x1, Eccentricity '-400 0 400 800 1200 1600 020406080 BinFrequency MC FP Histogram: Variable x2, Spread '-1000 0 1000 2000 3000 4000 01020304050607080 BinFrequency MC FP Histogram: Variable x3, Mean entropy '-400 0 400 800 1200 1600 020406080 BinFrequency FP MC Histogram: Variable x4, Deviation of entropy '-400 0 400 800 1200 1600 020406080 BinFrequency FP MC Histogram: Variable x5, Moment '-400 0 400 800 1200 1600 020406080 BinFrequency FP MC Histogram: Variable x6, Compactness '-1000 0 1000 2000 3000 4000 020406080 BinFrequency FP MC Histogram: Variable x7, Average foreground '-200 0 200 400 600 020406080 BinFrequency FP MC Histogram: Variable x8, Deviation foreground '-400 0 400 800 1200 1600 020406080 BinFrequency FP MC Histogram: Variable x9, Mean contrast '-400 0 400 800 1200 020406080 BinFrequency FP MC Histogram: Variable x10, Boundary gradient '-2000 0 2000 4000 6000 01020304050607080 BinFrequency FP MC Figure 18 Histograms Of Individual Input Features
55 The histograms in Figure 18 show that most features are distributed in the same range for both the classes. This makes it impossi ble to use any one feature to distinguish between the two classes. Also, the distributi ons are heavily skewed (mostly to the right, in all cases except x3 and x5) i. e. the distribution of values is not symmetrical about the mean. Thus it is very difficult to estimate a Â“typical valueÂ” for the distribution. For instance, in a symmetric distribution, the typical value would be the center of the distribution. Data that is seriously sk ewed maybe an indication that there are inconsistencies in the process or procedures etc. Further decisions need to be made to determine if the skew is act ually appropriate (" Histogram"). Among all the features, x3 (mean entropy) and x5 (moment) have reasona bly (though not significant) different range of values. Table 5 Univariate Statistics Of Input Features For Both Classes Feature Class FP, n=5500 Mean SD Std. error mean upper 95% mean lower 95% mean x1 Eccentricity 0. 1213 0.1895 0.0025 0.1263 0.1162 x2 Spread 0.1789 0.1069 0.0014 0.1818 0.1761 x3 Mean entropy 0.1873 0.0576 0.0007 0.1888 0.1858 x4 Dev. Entropy 0.0324 0.0216 0.0003 0.0332 0.0318 x5 Moment 0.0817 0.0319 0.0004 0.0826 0.0809 x6 Compactness 10.19 7.83 0.1057 10.4 9.98 x7 Avg. foreground -25008.8 7458.7 100.58 -24811.61 -25205.97 x8 Dev. Foreground 26.47 36.5 0.492 27.43 25.5 x9 Mean contrast 1048.12 1046.05 14.1 1075.7 1020.4 x10 Boundary gradient 175.94 467.26 6.3 188.3 163.59 Feature Class MC, n=609 Mean SD Std. error mean upper 95% mean lower 95% mean x1 Eccentricity 0. 1224 0.1391 0.0056 0.1335 0.1114 x2 Spread 0.1747 0.0475 0.0019 0.1785 0.1709 x3 Mean entropy 0.1569 0.0576 0.0023 0.1615 0.1523 x4 Dev. Entropy 0.0219 0.0195 0.0007 0.0235 0.0204 x5 Moment 0.079 0.0296 0.0012 0.0814 0.0767 x6 Compactness 11.173 9.295 0.3766 11.912 10.433 x7 Avg. foreground -25460.3 4128.71 167.3 -25131.76 -25788.89 x8 Dev. Foreground 22.22 21.18 0.8586 23.91 20.54 x9 Mean contrast 1266 709.77 28.761 1322.58 1209.61 x10 Boundary gradient 170.5 465.952 18.881 207.584 133.42
56 Table 5 gives the class-wise statistics. It is observed that the means of all the features have approximately the same values for both the classes. However as observed in the histograms, the data are not symmetrically di stributed. Thus studying just the mean holds little significance in this context. Variables x7 and x9 show very high standard deviation from the mean. It is difficult to visualize this data in 10-dimensional space. Instead, for visual representation of these classes, we plot them in 2-dime nsional space. The input data was reduced to two principal components (PCs), wh ich account for 98% of the variance in the input data. The Principal Com ponent Analysis (PCA) was used here to transform the given dataset into a tw o-dimensional vector which contai ns all the information contained in the original dataset. Here, PCA was only used to visualize class seperability in a twodimensional space and not for any other data analysis. Figure 19 Plot Of PC-1 Vs PC-2 The above scatter plot shows completely ove rlapping points for both the classes. This indicates that the features are statistically not representative of the classes that they belong to.
57 The next step is to study the relationships between the input featur es and the outcome. The main interpretation of logistic regression results is to find the significant predictors of the outcome. A logistic fit of each pr edictor vs. the outcome was performed. Table 6 Logistic Fit Of Outcome By Individual Predictors Variable Feature Parameter estimates ( ) Std. Error Chi-square Prob > ChiSq x1 Eccentricity -0.0340.2290.02 0.882 x2 Spread -0.57590.58180.98 0.3222 x3 Mean entropy 8.15350.6845141.88 <0.0001 x4 Dev. Of entropy 29.262.506136.35 <0.0001 x5 Moment 2.741.374 0.0455 x6 Compactness -0.010.0046.54 0.0105 x7 Avg. Foreground 0.000009020.00000612.16 0.142 x8 Dev. Foreground 0.00510.00187.9 0.0049 x9 Mean contrast -0.00015360.000032821.97 <0.0001 x10 Boundary gradient 0.00002720.00009970.07 0.7851 In Table 6, >0 for variable x3 (mean entropy). Since the coefficient for mean entropy is positive, the log odds (and theref ore the probability) of MC increases with mean entropy. On the other hand, the values for x7 and x10 are close to zero. This would imply that the strength of association for the featur es Â‘average foregroundÂ’ and Â‘boundary gradientÂ’ with the outcome is very poor. Interpretation of : The parameter estimate ( ) gives the increase in log odds of the outcome, for one unit increase in x i.e. e represents the change in odds of the outcome, by increasing x by 1 unit. Given below is the interpretation of : (a) If =0, the odds and probability are the same at all x levels (e =1) (b) If >0, the odds and probability increase as x increases (e >1) (c) If <0, the odds and probability decrease as x increases (e <1) The overall significance of the variables is tested using the Model Chi-square, which is derived from the likelihood of observing the actual data under the assumption that the model that has been fitted is accurate. Th e difference in log like lihoods for the model with the predictor and without th e predictor is distributed as a chi-square with degrees of freedom equal to the number of predictors. Thus, chi-square tests are used to test if the
58 predictors are significant or not. If we assu me a significance level of 0.05, any value of likelihood less than 0.05 would be significant. In Table 6 above, x3 (mean entropy), x4 (dev. of entropy), x5 (moment), x6 (com pactness), x8 (dev. foreground) and x9 (mean contrast) have values < 0.05 indicating th at these features are significant (individually) in our model. These features are from bot h the spatial and morphological domains. 5.2 Feature Selection using SFS The above univariate analysis gives an interpretation of individual features and their individual relationships with the outcome. The SFS gives the best feature subset as explained in Section 4.5.2. The results of SFS with only th e predictors in cluded are as shown in Table 7. This is the main effect model. Table 7 Forward Selection Results For Data: Main Effect Model Parameter DF Estimate Standard error Chi-square Pr > chisq Intercept 11.92130.585710.75930.001 x1 1-0.52010.69460.56050.4541 x2 1-15.76464.270113.63010.0002 x3 1-14.44531.1339162.2841<.0001 x6 10.04620.01886.05980.0138 x7 1-0.000027.59E-069.18340.0024 x9 10.0001060.0000436.19150.0128 It is observed that the SFS procedure has sele cted the following features as a good subset: a) eccentricity b) spread c) compactness d) mean entropy e) average foreground f) mean contrast Features (a) and (b) are margin descriptors; (c) is a morphol ogical feature while features (d) to (f) are spatial domain features. Features in bo th the spatial and morphological domain have been selected, indicating that th ese features are significant and important in improving the discriminatory power of the model.
59 SAS uses the c-statistic to determine the discriminating power of the logistic model. The c-statistic is nothing but the Area under the ROC curve, which is close to one for a model that discriminates perfectly. The c-valu e for the main effect model was 0.693. The ROC curve for the main effect model with the predictors in Table 7 is as shown in Figure 20. Figure 20 ROC Curve For Main Effect Model, C=0.693 Once the main effect model has been constructed, two-way interactions are studied to assess the predictive effect of two independe nt variables on the outcome. All possible interactions between the ten input variables were added into the model and a SFS procedure was performed. Results of SFS with all the two-way inter actions included into the main effect model are as shown in Table 8. 1-Specificity = FPR
60Table 8 Forward Selection Results For Data: Interaction Effect Model Parameter DF Estimate Standard Error Chi-Square Pr > ChiSq Intercept 10.76750.78440.95740.3279 x1 1-0.49611.89110.06880.7931 x3 1-20.52681.8069129.0604<.0001 x1*x3 116.53114.474813.64750.0002 x7 10.0000110.0000280.15680.6921 x1*x7 10.0001620.0000597.51730.0061 x8 10.003850.004930.60820.4355 x9 1-0.002430.00044330.1192<.0001 x3*x9 10.009910.0014745.7245<.0001 x7*x9 1-9.61E-081.72E-0831.4287<.0001 x8*x9 1-0.000012.71E-0626.6355<.0001 This model has chosen the following features and their interactions as the optimal subset: a) eccentricity b) mean entropy c) average foreground d) dev. foreground e) mean contrast f) (eccentricity*mean entropy) g) (eccentricity*avg. foreground) h) (mean entropy*mean contrast) i) (avg. foreground*mean contrast) j) (dev. foreground*mean contrast) It is observed here that spatial domain f eatures dominate in significance. Except for eccentricity, the remaining features are all in spatial domain. New indices based on these two-way interactions can be considered to improve the discriminatory power of the features. The c-value of this model is 0.77, which is a significant improvement over that of the main effect model. The ROC curve for the mode l with the predictors of Table 8 is shown in Figure 21.
61 Figure 21 ROC Curve For Interaction Effect Model, C=0.77 This completes the data pre-processing se ction where we analyzed in detail the feature statistics, the relationships between the predictors and the outcome and performed feature selection (main effect and interaction effects) using SFS. In summary, it is seen that the features taken individuall y are not very good predictors of the outcome. Feature statisti cs and univariate analysis are evidence of this observation. A feature selection procedure helps select the be st subset of features that could increase the discriminatory power of our model. Features in both the spatial domain and morphological domain were select ed during the feature selec tion process. However, the SFS procedure used the Logist ic regression as the inducti on algorithm. This may not guarantee the best performance results from NN and SVM. Further classification is performed with all the ten features and the features from the SFS and compared. 1-Specificity = FPR
625.3 Classification The parameters used with the NN classifier were presented in the Materials and Methods section. Despite the theoretical advantages that SVMs possess over NNs, SVM requires a certain amount of m odel selection. The kernel pa rameter is one of the most important design choices for the SVM since it implicitly defines the structure of the high dimensional feature space where a maximal margin hyperplane will be found. Thus the choice of the SVM kernel is crucial. We study two popular kernels (polynomial and RBF) with various parameters, to see which one best suits our case. These kernels were tested on the 12 training images and on 10 uns een images. The training dataset consisted of 3167 cases, 553 of which were MCs and 2614 cases of FPs. Table 9 shows the results on training a nd testing data for various kernels. Table 9 Choice Of SVM Kernel* Kernel Training Testing Accuracy Confusion matrix Accuracy Confusion matrix RBF kernel, g=7 c =1000 0.85 0.12 0.88 0.93 0.16 0.84 (equal weights for classes) 0 1 0.05 0.95 c =100 0.89 0.39 0.61 0.91 0.25 0.75 0 1 0.08 0.92 c =10 0.88 0.35 0.65 0.89 0.34 0.66 0.01 0.99 0.1 0.9 c =5 0.88 0.32 0.68 0.9 0.3 0.7 0.01 0.99 0.09 0.91 RBF kernel g =9 0.88 0.81 0.19 0.82 0.43 0.57 (with c=50 for class 1 and 0.11 0.89 0.18 0.82 c=10 for class -1) g =7 0.87 0.8 0.2 0.81 0.48 0.52 0.12 0.88 0.18 0.82 g =5 0.86 0.76 0.24 0.81 0.45 0.55 0.12 0.88 0.18 0.82 Polynomial kernel d =7 0.8 0.5 0.5 0.81 0.54 0.46 (with c=50 for class 1 and 0.14 0.87 0.19 0.82 c=10 for class -1) d =3 0.8 0.49 0.51 0.81 0.55 0.45 0.13 0.87 0.19 0.81 All numerical values rounde d to two decimal places The confusion matrix interpretation is as follows: +1 -1 +1 TPR FNR -1 FPR TNR
63 It is very important to note the usage of the term Â‘False PositiveÂ’. In this work, FP is also one of the classes i.e. refers to a false positive MC signal. However the FP that is evaluated in the confusion matrix above refers to (in our case) a FP MC signal misclassified as an MC. From Table 9, it is observed that sens itivity (TPR) increases drastically by introducing different weights for the classe s. This is because our data is highly unbalanced and higher penalty for a positive cas e would give equal importance to this under-represented class. It is observed that accuracy decreases as value of c (error penalty) decreases. A high value of error penalty would force the SVM tr aining to avoid classi fication errors, thus resulting in a larger search space for the QP optimizer. It is observed that some experiments fail to converge for very large values of c ( c > 1000). An optimum value of c=10 is chosen. The performance of the RBF kernel largely depends on the value of g which is the radius of the RBF kernel. A ccuracy decreases with kernel radius. Though accuracy at g=9 was the highest with a good sensitivity for training data, the TPR on testing data for g=7 is higher. It can be seen that the polynomial kernel performance on testing data is better (in terms of sensitivity ). The FROC for training and testing datasets for the RBF and polynomial kernel is as shown in Figure 22. Figure 1 FROC Curves For (Left) Training And (Right) Testing Images, C=50 For Class 1 And C=10 For Class -1, RBF Kernel Radius=7; Polynomial Kernel, Degree=3 Considering the overall performance (on traini ng and testing images), the RBF kernel shows better results than the polynomial ke rnel. There is a drastic improvement in
64 sensitivity of the RBF kernel for a thres hold of 0.65. Sensitivity at this threshold on training images was about 80% while sp ecificity was about 88%. However, the sensitivity of the polynomial kernel on test data was 55% as against 48% for RBF kernel. It is to be noted that the horizontal axis in the above gr aph gives the average number of FPs/ image and not the FP clusters. The RBF kernel with g=7 was used for further classification. Once the initial kernel selection was pe rformed, classification and evaluation was done. This step of analysis was broken into experiments which are summarized below: (a) Experiment #1 : Ten features with NN a nd SVM using single test (b) Experiment #2 : Ten features with NN and SVM using LOO CV (c) Experiment #3 : Ten features with NN and SV M using alternate classes for training (d) Experiment #4 : Use features from forward select ion results, main effect model, with NN and SVM (e) Experiment #5 : Use features from forward sel ection results, interaction effect model, with NN and SVM Experiment #1 This experiment used 12 images for tr aining and 10 images for testing. The SVM with RBF kernel (radius=7, e rror penalty c=50 class 1 and 10 for class -1) was used. The NN used the SBP algorithm with 2 hidden layers, 13 units each. The convergence criterion was set to SSE of 15. All NNs were trained till there was no significant change in the SSE. The performances of the NN a nd SVM on training and testing images are given in Figures 23 and 24.
65 Figure 23 FROC Curve For NN Using All Ten FeaturesSingle Test Figure 24 FROC Curve For SVM Using All Ten FeaturesSingle Test Figure 25 Comparison Of FROC Curves For NN And SVM, (Left) Training Images, (Right) Testing Images Table 10 Accuracy And Confusion Matrix For Experiment #1 Algorithm Training Testing Accuracy Confusion Matrix Accuracy Confusion Matrix NN 0.86 0.84 0.16 0.7 0.5 0.5 0.13 0.87 0.3 0.7 SVM 0.87 0.8 0.2 0.81 0.48 0.52 0.12 0.88 0.18 0.82
66 It can be observed from the above gr aphs that though the SVM shows drastic improvement in sensitivity for a specific threshold (0.7). The accuracy of both the algorithms on training data is comparable. However, the SVM clearly outperforms the NNs performance on testing (unseen) data. The overall accuracy on un seen data is 81% for SVM and 70% for NN with respective speci ficities of 0.82 and 0.7. Thus, the average number of FPs per image is much lesse r for the SVM compared to the NN. Experiment #2 This experiment used 12 images and perf ormed CV on these images. The 10 testing images were kept completely independent of training dataset to evaluate the classifiersÂ’ true generalization capability. The NN was trained using the LOO CV. Here, training is performed by dividing the 12 images into severa l training and testing sets. In each pass, 11 images are used for training and 1 image for testing. At the end of the process, each image would have been used at least once for testing. LOO is genera lly used to find the parameters of the classifiers which result in least generalization error. However, with the NN, the model was trained each time with the LOO training datasets. The SVM on the other hand did not requir e training with each LOO training set. Here, the LOO CV was performed as a Â“grid-searchÂ” where pairs of (C, ) are tried and the one with the best CV accu racy is picked. Trying exponent ially growing sequences of C and is a practical method to identify g ood parameters (LIBSVM manual) (for instance, C = 2-5,2-3,Â….,215, =2-15,2-13,Â…,23). Parameter selection was performed using various values of C and The best (C, ) pair was (27,23) with the CV rate of 75.56%. Thus c=128, g=8 were used for this experiment. The performance of these algorithms on trai ning and testing data is as shown in Figures 26, 27 and 28.
67 Figure 26 FROC Curve For NN Using All Ten FeaturesLOO Figure 27 FROC Curve For SVM Using All Ten FeaturesLOO Figure 28 Comparison Of FROC Curves For NN And SVM, (Left) Training Images, (Right) Testing Images Table 11 Accuracy And Confusion Matrix For Experiment #2 Algorithm Training Testing Accuracy Confusion Matrix Accuracy Confusion Matrix NN 0.75 0.47 0.53 0.61 0.38 0.63 0.19 0.81 0.39 0.61 SVM 0.93 0.81 0.19 0.76 0.46 0.54 0.04 0.96 0.24 0.76 The SVM outperformed the NN in terms of accuracy and sensitivity on training data. A good value of overall accuracy of the SVM in this experime nt shows the importance of performing CV to choose good model paramete rs. However, the performance of the NN
68 on training data did not improve with th is method though generalization performance improved compared to single te st method. The performances of both the cl assifiers on unseen data were comparable. Experiment #3 The number of MCs in the training set that contained 12 images was 553 and the number of FPs was 2614. Thus the number of FPs exceeded the MCs by almost five times. Initial observations showed that the classifiers were Â‘biasedÂ’ to the FP class because of the imbalance in the data. Since the dataset is highly unba lanced, it is desired to study the results of using a balanced dataset with equal number of positive and negative cases. 553 FPs were randomly select ed from the 2614 cases. The classifiers were presented with a pattern from class MC, and then a pattern from class FP (alternatively). The performance results for three cases is given: (1) the training set which contains equal number of MC and FP cases, (2) the training images which are the 12 images from which these cases were picked from and (3) testing images which are the 10 unseen images. Results of this experiment are summarized below. Figure 29 FROC Curve For NN Using All Ten FeaturesTraining With Alternate Classes
69 Figure 30 FROC Curve For SVM Using All Ten FeaturesTraining With Alternate Classes Figure 31 Comparison Of FROC Curves For NN And SVM, (Top Left) Training Dataset, (Top Right) 12 Training Images, (Bottom Left) 10 Testing Images Table 12 Accuracy And Confusion Matrix For Experiment #3 Algorithm Training dataset Training images Testing images Accuracy Confusion Matrix Accuracy Confusion Matrix Accuracy Confusion Matrix NN 0.94 1 0 0.65 0.99 0.01 0.51 0.59 0.41 0.12 0.88 0.42 0.58 0.49 0.51 SVM 0.93 0.98 0.02 0.54 0.94 0.06 0.4 0.73 0.27 0.12 0.88 0.55 0.45 0.61 0.39
70 From Figure 29, it can be seen that the NN performed extremely well on the training dataset and the training images. The SVM s howed sharp increase in sensitivity for the training dataset and images. From the FROC curve on testing images it is evident that the SVM outperformed the NN in terms of sensitivity. Thus, it is clear that the SVMÂ’s capability to generalize is better than the NN. However, it should be noted that this method may not be the most a ppropriate for training the cla ssifiers since the number of average false positives per image is very high (poor specificity) in spite of high sensitivity. This would mean that a large num ber of Â‘-1Â’s are being misclassified as Â‘1Â’. Also the number of FPs in the training datase t is chosen randomly, so the chosen samples may not necessarily be representative samples of the FP class. The above three experiments are summarized in the FROC graphs of Figure 32. FROC curves of various training methods: 12 Training images0 0.2 0.4 0.6 0.8 1 01020304050607080AFP/imageTPR NN-single test SVM-single test NN-CV SVM-CV NN-alternate classes SVM-alternate classes FROC curves for various training methods: 10 Testing images0 0.2 0.4 0.6 0.8 1 020406080100 AFP/imageTPR NN-single test SVM-single test NN-CV SVM-CV NN-alternate classes SVM-alternate classes Figure 32 Comparison Of FROC Results From Experiments 1,2 And 3
71 From the above graphs, we observe that the SVM trained with altern ate classes shows the highest sensitivity (98%) for training and testing (73%) images. However, results indicate poor specificity (average number of FPs/ image is high). The SVM with parameters selected from the CV process showed high sensitivity of 81% for training images with low AFP/image value, but showed a sensitiv ity of about 46% on th e testing images. CV improved the performance of NN on unseen data. The SVM trained with the parameters in e xperiment 1 had a sens itivity of 80% and specificity of about 88% on training images. On testing images, the sensitivity was 48% and specificity was 81%. This model showed go od sensitivity as well as low AFP values on both training and testing images. Over all the SVM outperformed the NN. In Experiments #1 and #3, though the two were comparable on th e training data performance, the SVM clearly rule d on the testing (unseen) data. Experiment #4 This experiment was performed with the feature selection results from the logistic regression. The SFS procedure selected features eccentricity, spread, mean entropy, compactness, average foreground and mean contra st as significant. Only these features were now used in our model to study if there was an improvement in accuracy. Thus the input feature space was six-di mensional. Only the main variables (or effects) were considered first. Training and testing was done using the single test method. Figure 33 FROC Curves For Training And T esting Images Using SFS Main Effect Variables
72Table 13 Accuracy And Confusion Matrix For Experiment #4 Algorithm Training Testing Accuracy Confusion Matrix Accuracy Confusion Matrix NN 0.67 0.73 0.27 0.61 0.68 0.32 0.34 0.66 0.39 0.61 SVM 0.85 0.69 0.31 0.8 0.48 0.52 0.12 0.88 0.19 0.81 The SVM again outperforms the NN in training and testing performance. Sensitivity of about 69% is seen for specificity of 88% on training images (Table 13). The NN shows slightly better sensitivity (68%) than SVM on testing images in this case. If we consider an AFP value of 50, both the NN and SVM have testing sensitivity of about 50%. Experiment #5 The interaction effects from the SFS pro cedure were added into the main effect model. The interactions that were added ar e discussed on Table 8. New variables were created by multiplying the corresponding variable values. Figure 34 FROC Curves For Training And Testing Images Using SFS Interaction Effect Variables Table 14 Accuracy And Confusion Matrix For Experiment #5 Algorithm Training Testing Accuracy Confusion Matrix Accuracy Confusion Matrix NN 0.73 0.89 0.11 0.83 0.81 0.19 0.31 0.69 0.17 0.83 SVM 0.84 0.7 0.3 0.76 0.57 0.43 0.13 0.87 0.24 0.76
73 It is seen that the testing performance of the SVM is very poor for this case. The NN showed much higher sensitivity on testi ng data (81%) compared to the SVM (57%) (Table 14). We now see how these models with featur e selection have performed compared to using all the ten features in the mode l. Figure 35 summarizes this comparison. Comparison of performance on training data with and without feature selection 0 0.2 0.4 0.6 0.8 1 01020304050607080 AFP/imageTPR NN-all 10 features SVM-all 10 features NNFS Interaction effect SVM-FS Interaction effect NN-FS Main effect SVMFS Main effect Comparison of performance on testing data with and without feature selection 0 0.2 0.4 0.6 0.8 1 020406080100 AFP/imageTPR NN-single test SVM-single test NN-interaction SVMinteraction NN-main effect SVM-main effect Figure 35 Comparison Of FROC Graphs For Models With And Without Feature Selection
74 The NN and SVM that used all the features showed a sensitivity of about 78% for an AFP value of 17 on the training images. However, the sensitivity on the testing images was around 33% and 47% respectively for an AFP value of 40 per image. Model performance on training images did not improve with feature selection. However, with feature selection, the generalization capability of the NN classifier increased significantly. For an AFP valu e of 40 per image, the sensitivity is about 78% for interaction effect variables and 48% for main effect variables, which is a significant improvement over the 33% sensitivity obtained from the NN without feature selection. However, the same feature selection variab les did not show any improvement in the generalization performance of the SVM cla ssifier. Performance on training data was worse with feature selection than w ith all the ten features included. On unseen data, the NN with feature se lection showed great improvement in performance. A reasonable explanation for this would be to go back to the basis of NN algorithms. The feature selection procedure employed used Logistic Regression as the induction algorithm. A logistic regression mode l is identical to a NN with no hidden units if the logistic (sigmoidal) ac tivation function is used (Bis hop, 1995; Hastie T., 2001). In a NN with hidden units, each hidden unit computes a logistic regression (different for each hidden unit) and the output is therefore a we ighted sum of logist ic regression outputs. The weights (of the NN) or the coefficients (of Logistic Regre ssion) are determined based on the dataset, by maximum likelihoo d estimation (Dreiseitl Stephan, 2003). However, the decision boundary for a NN can be non-linear, making the NN more flexible compared to logistic regression (D reiseitl Stephan, 2003). Better results of NN with FS which used logistic regression as th e induction algorithm coul d be attributed to this similarity in mathematical principle. Figure 36 shows an example of an output image obtained w ith the SVM using all ten features.
75 Figure 36 (Top) Raw Image With Suspicious RO I Outlined, (Bottom Left) Segmented Image, (Bottom Right) Image Showing MC Clusters With Reduced FPs
76CHAPTER 6 CONCLUSION AND FU TURE WORK 6.1 Concluding Remarks In this work, we presented the use of SVM and NN algorithms for detection of MCs in mammograms. The classifiers were traine d through these techniqu es, to test on every location in the segmented mammogram whether the detected signal was an MC or an FP. Ten features were orig inally used to represent the two classes. Experimental results were obtained usi ng a database of 22 images. A detailed statistical analysis of the dataset was perfor med prior to classificat ion. It was observed that based on statistics alone, it was difficult to characterize these classes. However, the SVM and NN algorithms, considered to model highly non-linear data, do show interesting results. The classifiers were trained using different training methods like single test, cross validation and alternate class training. The LOO CV was used with the SVM to perform parameter selection. Accuracy improved from about 87% to 93% on training data for the SVM with parameter search. However, perf ormance on the testing set did not improve significantly. The single test SVM showed good results overa ll (training a nd testing). CV improved the performance of the NN on unseen images. With the alternate class training method, the classifiers showed high sensitivities of about 95% and 65% (average for NN and SVM) on training and testing data respec tively. Though the sens itivity was high the average number of FPs per image was also high. Also, this method chose random cases of the FP class which may not be the ideally re presentative samples. Overall, in all the experiments that used all the ten featur es, the SVM outperformed the NN. Though the algorithms were comparable with results on training set, the SV M performed better on
77 unseen data. The SVM with CV parameter sele ction showed the best performance with much lesser number of FPs per image. Feature selection using Step wise Forward Selection met hod with logistic regression as the induction algorithm was performed. The most significant features were selected and given to the classifiers. For the SVM, though the models with feature selection showed lesser accuracy on the training data th an the models that used all features, the testing sensitivities were comparable. Thus, the models with feature selection achieved the same generalization performance as those without feature selec tion. This helps us remove irrelevant and redundant features and achieve comp arable testing performance with fewer features. In particular, the sens itivity of the NN model on unseen data with interaction effects added was extremely high (around 78% for an AFP /image value of 40) as compared to 33% for the NN model without these interaction terms. The NN with main effect model terms showed a sensit ivity of around 42%. The improvement of the NN in accuracy and sensitivity on unseen da ta can be linked to its mathematical similarity with logi stic regression which was used as the inductor during the feature selection process. New variables that inco rporate the interactions between significant features could be added into the analysis to improve the discriminatory as well as generalization power of the classifiers. In summary, the SVM outperformed the NN in almost all cases. The generalization capability of the SVM was clearly noteworthy. The training time taken by SVM was also several magnitudes lesser. Thus we can say th at for this complex dataset, the SVM is more suited for our analysis. 6.2 Future Work The most crucial part of future work would be to cluster the output to enhance clinical utility. This work only involves detection of MC spots and reduction of FP signals. However, these spots are considered suspic ious when seen in clusters of four,
78 five or six (Faculty of Medicine, 1999). Cluste rs have to be identified individually as shown in Figure 36 for each image and threshold. Feature selection showed an improvement in generalization performance of the NN. However, this was not the case with SVM. Wr appers that use the SVM as the induction algorithm could be used to select features However, all the methods based on the wrapper approach are tuned for/ by a give n learning machine. The filter approach to feature selection could be a better alternative here, sinc e it would provide a generic selection of variables not specific to a ny learning algorithm. A study based on the comparison of all these feature select ion approaches would be worthwhile. Also a direct medical unders tanding of the featuresÂ’ eff ect on the class would make analysis of the results easier. Feature select ion could be performed just based on domain knowledge. Future work could include using a bigger database with more representative cases. More number of images and training sa mples would help establish our results and observations. Comparison with other techni ques like decision trees and statistical classifiers could be performed.
79REFERENCES Histogram.from http://www.sytsma.com/tqmtools/hist.html Interactive mammography analysis web tutorial (1999). from http://sprojects.mmi.mcgill.ca/mammography/anat.htm 4woman.gov. (March 2 002). Mammograms. from http://www.4woman.gov/faq/mammography.htm Anand. (1999). The backpropagation algorithm. from http://www.speech.sri.com/pe ople/anand/771/html/node37.html Anttinen I, P. M., Soiva M, Roiha M. (1993). Double reading of mammography screening films: One radiologist or two? Clin. Radiol., 48, 414-421. Arce GR, F. R. (1989). Detail-preserving ranke d-order based filters for image processing. IEEE Trans Acoust., Speech Signal processing, 37(1), 83-98. Association, B. B. (Dec.2002). Computer-a ided detection (CAD) in mammography. Assessment Program, from http://bcbs.com/tec/vol17/17_17.html Bamberger RH, S. M. (1992). A filter bank for the directio nal decomposition of images: Theory and design. IEEE Trans Signal Processing, 40, 882-893. Bauer PH, Q. W. (1991). A 3D non-linear recursiv e digital filter for video image processing. IEEE Pacific Rim Conf. on Commun., Comput., and Signal Processing, 2, 494-497. Bishop, C. (1995). Neural networks for pattern recognition: Oxford: Oxford University Press. Brien, D. O. Image entropy. from http://www.astro.cornell.edu/resear ch/projects/compression/entropy.html
80 Burbidge R, T. M., Buxton B, Holden S. (2001). Drug design by machine learning: Support vector machines for pharmaceutical data analysis. Comput. Chem., 26, 514. Castleman, K. (1979). Digital image processing: Prentice Hall, Englewood Cliffs, Reading, MA.Center, S. C. (2004) Breast cancer statistics. from http://cancer.stanfor dhospital.com/healthIn fo/cancerTypes/breast/ Chan HC, D. K., Vyborny CJ, et al. (1990). Improvement in radiologists' detection of clustered microcalcificati ons on mammograms: The poten tial of computer-aided diagnosis. Invest. Radiol., 25, 1102-1110. Chan HP, D. K., Galhotra S, Vyborny CJ MacMahon H, Jokich PM. (1987). Image feature analysis and computer aided anal ysis in digital radiography: Automated detection of microcalcifications in mammography. Med Phys, 14, 538-548. Chapelle O, H. P., Vapnik VN. (1999). S upport vector machines for histogram-based image classification. IEEE Trans Neural Networks, 10, 1055-1064. Cheng HD, L. Y., Freimanis RI (1998). A novel approach to microcalcification detection using fuzzy logic technique. IEEE Trans Medical Imaging, 17(3), 442-450. Chih-Chung Chang, C.-J. L. (2001). LIBSVM: A library for support vector machines. Chun-Nan Hsu, H.-J. H., Dietri ch S. (Apr 2002). The ANNI GMAwrapper approach to fast feature selecti on for neural nets. IEEE Trans Systems, Man and Cybernetics, 32(2), 207-212. Cox, D.R. and Snell, E.J. (1989), The Analysis of Binary Data, Second Edition, London: Chapman and Hall. Cristianini N, S.-T. J. (2000). An Introduction to Support Vector Machines: Cambridge University Press. Daphne Koller, M. S. (1996). Toward optimal feature selection. Paper presented at the International Conference on Machine Learning. Dash M, L. H. (1997). Feature selection for classification. Intelligent Data AnalysisAn International Journal, 1(3). Davies DH, D. D. (1992). The automatic comput er detection of subtle calcifications in radiographically dense breasts. Phys. Med. Biol., 37, 1385-1390.
81 Ding CH, D. I. (2001). Multi-class protei n fold recognition using support vector machines and neural networks. Bioinformatics, 17, 349-358. Dreiseitl Stephan, O.-M. L. (Feb 2003). Logi stic regression and artif icial neural network classification models : A methodology review. Journal of Biomedic al Informatics, 35(5-6), 352-359. Edwards DC, K. M., Nagel R, Nishikaw a RM and Papaioannou J. (2000). Using a bayesian neural network to optimally e liminate falsepositive microcalcification detections in a CAD scheme. Proceedings of International Workshop on Digital Mammography, at press. Faculty of Medicine, M. (1999). Tu torial 2: Calcifications. from http://sprojects.mmi.mcgill.ca/ mammography/calcifications.htm Gader PD, K. J., Krishnapuram R, Chiang JH, Mohamed MA. (1997) Neural and fuzzy methods in handwriting recognition. Computer, 30, 79-86. Gavrielides, M. (1996). Shape analysis of mammographi c calcification clusters. Unpublished Masters thesis, University of South Florida, Tampa. Glatt A, L. H., Arnow T, Shelton D, Ravdin P. (1992). An application of weighted majority minimum range filters in the detection and sizing of tumors in mammograms. Paper presented at the SPIE Medical imaging VI: Image processing. Graps, A. (2004). An intr oduction to wavelets. from http://www.amara.com/IEEEwave/IEEEwavelet.html Hagan MT, D. H., Beale MH. ( 1996). Neural network design. Boston, Mass: PWS. Harris JR, H. S., Henders en IC, Kinne DW. (1991). Breast diseases: JB Lippincott Company, Philadelphia, PA. Hastie T., T. R., Friedman J. (2001). The elements of statistical learning: Data mining, inference and prediction: New York: Springer; 2001. Hendee WR, B. C., Hendrick E. (1999). Pr oposition: All mammograms should be double-read. Med Phys, 26, 115-118. Hiep Van Khuu, H.-K. L ., Jeng-Liang Tsai. (2003) Machine learning with neural networks and support vector machines (Unpublished).
82 Hosmer, D.W, Jr. and Lemeshow, S. (1989), Applied Logistic Regression, New York: John Wiley & Sons, Inc. Huai Li, R. L. K., ShihChung B. Lo. (1997) Fractal modeling and segmentation for the enhancement of microcalcifica tions in digital mammograms. IEEE Trans Medical Imaging, 16(6), 785-798. Imaginis. (Sept.2004). Breast cancer diagnosis. Imaginis. (Sept.2004). Breast can cer screening/ prevention. Imaginis. (Sept.2004). General information on breast cancer. from http://imaginis.com/breasthealth/statistics.asp Imaginis. (Sept.2004). Tutori al 2: Calcifications. Imaginis. (Sept.2004). Tutorial: Masses. Isabelle Guyon, A. E. (2003). An introducti on to variable and feature selection. Journal of Machine Learning Research, 3, 1157-1182. Issam El-Naqa, Y. Y., Miles Wernick, Ni kolas Galatsanos, Robe rt Nishikawa. (Dec 2002). A support vector machine approach for detection of mi crocalcifications. IEEE Trans Medical Imaging, 21(12), 1552-1563. Joachims, T. (1999). Making large-scale SVM learning practical. In C. B. B. Schlkopf, A. Smola (Ed.), Advances in kernel methods support vector learning: MIT Press. John G, K. R., Pfleger K. (1994). Irrelevant features and the subset selection problem. Paper presented at the Machine learning: Proceedings of the Eleventh International Conference, San Francisco, CA. Karssemeijer, N. (1993). Recognition of clustered microc alcifications using a random field model, biomedical image pro cessing and biomedic al visualization. Paper presented at the SPIE Proc., San Jose, CA. Keerthi, S., Shevade, S., Bhatt acharyya, C., & Murthy, K. (1999). Improvements to Platt's SMO algorithm for SVM classifier design. Paper presented at the Proceedings of the Tenth European Conference on Machine Learning.
83 Kerlikowske K, G. D., Rubi n SM, et al. (1995). Efficacy of screening mammography: A meta-analysis. JAMA, 273, 149-154. Ko SJ, L. Y. (Sept. 1991). Cent er weighted median filters and their applications to image enhancement. IEEE Trans Circ. Syst., 38, 984-993. Kohavi, P. (1988). Glossary of terms. Machine Learning, 30(2-3), 271-274. Kohavi R, J. G. (Dec 1997). Wrappers for feature selection. Artificial Intelligence, 97(12), 273-324. Kohavi, S. (1995). Feature selection using the wrappe r model: Overfitting and dynamic search space topology. Paper presented at the Firs t Internationa l Conference on Knowledge Discovery and Data Mining. Kong, H. (1998). Self-organizing tree map and its application in digital image processing. Unpublished Ph.D. thesis, Univ. of Sydney, Sydney, Australia. Kregelmeyer WP, P. J., Bourland PD, Hillis A, Riggs MW, Nipper ML. (1994). Computer-aided mammographic scre ening for spiculated lesions. Radiology, 191, 331-337. Kubat M, H. R., Matwin S. (1998) Detection of oil spills in satellite radar images of sea surface. Machine Learning, 30, 195-215. Lai SM, L. X., Bischof WF. (1989). On techniqu es for detecting circumscribed masses in mammograms. IEEE Trans Medical Imaging, 8(4), 377-386. Lanyi, M. (1988). Breast calcifications: Springer-Verlag, Berlin, Heidelberg. Lecun Y, J. L., Bottou L, Brunot A, Cortes C, Denker JS, Drucker H, Guyoin I, Muller A, Sackinger E, Simard P and Vapnik V. (1995). Comparison of learning algorithms for hand written digit recognition. Paper presented at the ICANN '95. Lei Yu, H. L. (2003). Feature selection for high-dimens ional data: A fast correlationbased filter solution. Paper presented at the Proceedings of the Twentieth International Conference on Machine L earning (ICML2003), Washington DC. Liang H, L. Z. (2001). Detection of dela yed gastric emptying from electrogastrograms with support vector machine. IEEE Trans Biomed Eng, 48, 601-604.
84 M. Pontil, A. V. (1998). Support vector machines fo r 3d object recognition. IEEE Trans Pattern Analysis Machine Intelligence, 20, 637-646. Mckenna RJ, S. (1994). The abnormal mamm ogram radiographic findings, diagnostic options, pathology, and st age of cancer diagnosis. Cancer (Suppl 1), 74, 244-255. MedCalc. (2004). Logistic regression. from http://www.medcalc.be/manual/logistic_regression.php N.Karssemeijer. (July 1991). A stochastic model for automated detection of calcifications in digital mammograms. Paper presented at the 12t h Int. Conf. Information Processing Medical Imaging, Wye., U.K. Nagel Rufus H, N. R. M., Papaioannou J ohn, Doi Kunio. (Aug 1998). Analysis of methods for reducing false positives in the automated detection of clustered microcalcifications in mammograms. Med Phys, 25(8), 1502-1506. Nishikawa RM, D. K., Geiger ML, et al. ( 1995). Computerized de tection of clustered microcalcifications: Evaluation of performance on mammograms from multiple centers. RadioGraphics, 15, 445-452. Osuna E, F. R., Girosi F. (1997). Training support vector ma chines: An application to face detection. Paper presented at the Computer Vision and Pattern Recognition. Park, D. (2000). Centroid neural networ k for unsupervised competitive learning. IEEE Trans. Neural Networks, 11, 520-528. Platt, J. C. (1999). Fast training of suppor t vector machines using sequential minimal optimization. In C. B. B. Scholkopf, A. J. Smola (Ed.), Advances in kernel methodssupport vector learning (pp. 185-208): MIT Press. Pontil M., V. A. (1998). Object rec ognition with support vector machines. IEEE Trans. On Pattern Analysis and Machine Intelligence, 20, 637-646. Popli, M. (2001). Pictorial essay: Ma mmographic features of breast cancer, Ind J Radiol Imag (Vol. 11, pp. 175-179). Qian W, C. L. (July 23-28,1995). Hybrid m-channel wavelet transform method and application to medica l image processing. Paper presented at the 37th annual meeting of the American Association of Physicists in Medicine, Boston, Mass.
85 Qian W, C. L., Kallergi M, Clark RA. (1994). Tree-structured nonlinear filters in digital mammography. IEEE Trans Medical Imaging, 13, 25-36. Qian W, L. L., Clarke LP. (1999). Feature extraction for mass detection using digital mammography: Influence of wavelet analysis. Med Phys, 26, 402-408. Qian W, S. D., Sun Xuejun, Clark Robert A. (June 2001). Multistage statistical order using neural network for false pos itive reduction in full field digital mammography. Paper presented at the 2001 IEEEEURASIP Workshop on Nonlinear Signal and Image Processing, Baltimore, Maryland, USA. Quinlan, J. (1993). C4.5: Programs for machine learning: Morgan Kaufmann. Rajkumar S, H. L. (Nov. 1999). Screening mammography in women aged 40-49 years. Medicine, 78(6), 415. Rufus H. Nagel, R. M. N., John Papaioannou, Kunio Doi. (1995). An alysis of methods for reducing false positives in the automated detection of clustered microcalcifications in mammograms. Med Phys, 25, 1502-1506. Shen L, R. R., Leo Desautels JE. (1994). App lication of shape analysis to mammographic calcifications. IEEE Trans Medical Imaging, 13, 263-274. Society, A. C. (2004). Cancer prevention & early detection, facts & figures. from http://www.cancer.org/downloads/STT/CPED2004PWSecured.pdf Songyang Yu, L. G. (2000). A CAD system fo r the automatic detection of clustered microcalcifications in di gitized mammogram films. IEEE Trans Medical Imaging, 19(2), 115-126. Statsoft Inc. (1984-2003). Neural networks. from http://www.statsoft.com/textbook/multilayerb Stetson PF, S. F., Macovski A. (Aug 1997). Lesion contrast enhancement in medical ultrasound imaging. IEEE Trans Medical Imaging, 16, 416-425. Struble, C. A. Dimension reduction and feature selection (Presentation): Dept. of Mathematics, Statistics and Comput er Science, Marquette University. Systems, G. M. (2003). Digital mammography. from http://www.hersource.com/breast/02/a-mammo.cfm
86 Takehiro Ema, K. D., Robert M. Nishikaw a, Yulei Jiang, John Papaioannou. (February 1995). Image feature analysis and comput er-aided diagnosis in mammography: Reduction of false-positive clustered microcalcifications using local edge-gradient analysis. Med Phys, 22(2), 161-169. te Brake GM, K. N., Hendriks JH. (1998). Au tomated detection of breast carcinomas not detected in a screening program. Radiology, 207, 465-471. Ted C. Wang, N. B. K. (Aug 1998). Detec tion of microcalcifi cations in digital mammograms using wavelets. IEEE Trans Medical Imaging, 17(4). Tembey, M. (2003). Computer Aided Diagnosis for ma mmographic microcalcification clusters. Unpublished Masters thesis, Univer sity of South Florida, Tampa. Thurfjell E.L., L. K. A., Taube A.A.S. (1994). Benefit of i ndependent double reading in a population-based mammogra phy screening program. Radiology, 191, 241-244. Vapnik, V. (1995). The nature of statistical learning theory: Springer Verlag. Vapnik, V. (1998). Statistical learning theory: John Wiley. Vyborny, C. J. (1994). Can computers help radiologists read mammograms? Radiology, 191, 315-317. W. Morrow, R. P., R. Rangayyan, J. Desa utels. (June 1992). Region based contrast enhancement of mammograms. IEEE Trans Medical Imaging, 11, 392-406. Warren Burhenne LJ, W. S., D'Orsi CJ, et al. (2000). The potential contribution of computer-aided detection to the se nsitivity of screening mammography. Radiology, 215, 554-562. Woods KS, D. C., Bowyer KW, Solka JL, Pr iebe CE and Kegelmeyer WP Jr. (1993). Comparative evaluation of pattern recognition techniques for detection of microcalcifications in mammography. Int. J. Pattern Recog. Artificial Intell., 7, 1417-1436. Wouter J, V. H., Karssemeijer Nico. (Nov. 2000). Automated classifi cation of clustered microcalcifications into ma lignant and benign types. Med Phys, 27(11), 26002608.
87 Yoshida H, D. K., Nishikawa RM. (1994). Automatic detection of clustered microcalcifications in digital mammogr ams using wavelet transform techniques. Paper presented at the SPIE. Zhang W, D. K., Giger ML, Nishikawa RM Schmidt RA. (1996). An improved shiftinvariant artificial neural network for computerized detection of clustered microcalcifications in digital mammograms. Med. Phys., 23, 595-601. Zheng B, Q. W., Clarke LP. (June 1994). Artificial neural network for pattern recognition in mammography. Paper presented at the Proc. World Congress Neural Networks, San Diego, CA.