USF Libraries
USF Digital Collections

Three essays on analytical models to improve early detection of cancer

MISSING IMAGE

Material Information

Title:
Three essays on analytical models to improve early detection of cancer
Physical Description:
Book
Language:
English
Creator:
Gopalappa, Chaitra
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla
Publication Date:

Subjects

Subjects / Keywords:
Colorectal cancer; disease progression; polyp progression; applied probability; bioinformatics; protein identification; microarray denoising
Dissertations, Academic -- Industrial & Management Systems Engineering -- Masters -- USF   ( lcsh )
Genre:
non-fiction   ( marcgt )

Notes

Abstract:
ABSTRACT: Development of approaches for early detection of cancer requires a comprehensive understanding of the cellular functions that lead to cancer, as well as implementing strategies for population-wide early detection. Cell functions are supported by proteins that are produced by active or expressed genes. Identifying cancer biomarkers, i.e., the genes that are expressed and the corresponding proteins present only in a cancer state of the cell, can lead to its use for early detection of cancer and for developing drugs. There are approximately 30,000 genes in the human genome producing over 500,000 proteins, thereby posing significant analytical challenges in linking specific genes to proteins and subsequently to cancer. Along with developing diagnostic strategies, effective population-wide implementation of these strategies is dependent on the behavior and interaction between entities that comprise the cancer care system, like patients, physicians, and insurance policies. Hence, obtaining effective early cancer detection requires developing models for a systemic study of cancer care. In this research, we develop models to address some of the analytical challenges in three distinct areas of early cancer detection, namely proteomics, genomics, and disease progression. The specific research topics (and models) are: 1) identification and quantification of proteins for obtaining biomarkers for early cancer detection (mixed integer-nonlinear programming (MINLP) and wavelet-based model), 2) denoising of gene values for use in identification of biomarkers (wavelet-based multiresolution denoising algorithm), and 3) estimation of disease progression time of colorectal cancer for developing early cancer intervention strategies (computational probability model and an agent-based simulation).
Thesis:
Dissertation (PHD)--University of South Florida, 2010.
Bibliography:
Includes bibliographical references.
System Details:
Mode of access: World Wide Web.
System Details:
System requirements: World Wide Web browser and PDF reader.
Statement of Responsibility:
by Chaitra Gopalappa.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains X pages.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
usfldc doi - E14-SFE0004563
usfldc handle - e14.4563
System ID:
SFS0027878:00001


This item is only available as the following downloads:


Full Text

PAGE 1

ThreeEssaysonAnalyticalModelstoImproveEarlyDetectionofCancer by ChaitraGopalappa Adissertationsubmittedinpartialfulllment oftherequirementsforthedegreeof DoctorofPhilosophy DepartmentofIndustrialandManagementSystemsEngineering CollegeofEngineering UniversityofSouthFlorida MajorProfessor:TapasK.Das,Ph.D. AlexSavachkin,Ph.D. SusanaLai-Yuen,Ph.D. RebeccaSutphen,M.D. SelenCremaschi,Ph.D. DateofApproval: May4,2010 Keywords:diseaseprogressionandintervention,bioinformatics,healthcaresystems, appliedstochastic,computationalprobability,appliedoptimization,agent-based simulation,waveletbasedsignalprocessing c Copyright2010,ChaitraGopalappa

PAGE 2

TABLEOFCONTENTS LISTOFTABLESiv LISTOFFIGURESvi ABSTRACTviii PREFACEx CHAPTER1INTRODUCTION1 1.1EvolutionofCancerResearch1 1.2BriefDescriptionofCellandCauseofCancer3 1.2.1Chromosomes,DNA,Genes,andProteins3 1.2.2CancerBiomarkers4 1.3NeedforAnalyticalModelsinAchievingEarlyCancerDetection4 1.3.1ResearchTopicsAddressedinthisDissertation6 CHAPTER2MODELFORCOLORECTALPOLYPPROGRESSION9 2.1Introduction9 2.2ProbabilityModeltoEstimateProgressionRates12 2.2.1PolypIncidenceRates12 2.2.2ProbabilityModeltoEstimateProgressionRates17 2.2.2.1EstimatingPre-CancerProgressionRate17 2.2.2.2EstimatingPost-CancerProgressionRates24 2.3Results:EstimatedIncidenceandProgressionRates27 2.3.1ComparisonofResults28 2.4ValidationofProgressionRatesEstimatedfromProbabilityModel30 2.4.1SimulationoftheIndianaPopulation33 2.4.1.1ScreeningParameters33 2.4.1.2AssigningFamilyHistoryStatus34 2.4.1.3ResultsfromtheSimulationModel36 2.4.2SimulationofMinnesotaStudy37 2.5ConcludingRemarks42 2.5.1DiscussionofResults44 2.5.2AccuracyandEstimationErrors45 2.6FutureResearch46 i

PAGE 3

CHAPTER3ADENOISINGMETHODOLOGYFORMICROARRAY48 3.1Introduction48 3.2ANovelDenoisingMethodologyforMicroarrays52 3.2.1OverviewofWaveletBasedMultiresolutionAnalysis53 3.2.1.1Dual-TreeComplexWaveletTransform55 3.2.1.2BivariateShrinkageThresholding55 3.2.2MicroarrayDenoisingMethodology56 3.2.2.1MicroarrayChip56 3.2.2.2cRNAExtraction57 3.2.2.3HybridizationofcRNAstoDNAStrands57 3.2.2.4Strategy1:ASeparateSubimageforeachGene58 3.2.2.5Strategy2:NoiseCharacterizationandTransformation60 3.2.3StepsoftheDenoisingProcedure64 3.3NumericalValidationusingSimulatedandAymetrixMicroarray Data65 3.3.1TestingDenoisingPerformanceonaSimulatedImage66 3.3.2TestingDenoisingPerformanceonAymetrixMicroarray Data68 3.3.2.1AnalyzingCVofProbeSquaresAcrossMultiple ScansofaMicroarray69 3.3.2.2AnalyzingCVwithinProbeSquaresofaMicroarray72 3.4Conclusions73 CHAPTER4ANALYTICALPROCESSINGOFCHROMATOGRAMS75 4.1Introduction75 4.2DescriptionofPF2DChromatogramsandQuantitativeChallenges78 4.3MethodologyforQuantitativeProcessingofPF2DChromatograms84 4.3.1ContinuousWaveletTransformsforBaselineCorrection, andPeakIdenticationandQuantication84 4.3.2OptimizationModelforPeakAlignment93 4.3.2.1TransformationfromNonlineartoLinear95 4.3.2.2HeuristicstoMinimizeExecutionTimeofOptimizationAlgorithm95 4.4ResultsandDiscussion96 4.4.1PeakDetection97 4.4.1.1ComparisonofPeakIdenticationonPF2DChromatograms97 4.4.1.2ComparisonofPeakIdenticationonMSSpectrum100 4.4.2PeakAlignment101 ii

PAGE 4

REFERENCES104 ABOUTTHEAUTHOREndPage iii

PAGE 5

LISTOFTABLES Table2.1Notationusedintheprobabilitymodel13 Table2.2EstimatesofthePoissonparameter r 15 Table2.3PercentageofpopulationwithfamilyhistoryofCRCgivenrace P f F> 0 j R = r g *10015 Table2.4 P rf p 5 a A b :Percentageincidenceofpolyp 5mmat age[ a;b ],given R = r and F = f ,forpolyppathways28 Table2.5 P rf S a A b :Percentageincidenceofin-situCRCat age[ a;b ],given R = r and F = f ,fornon-polyppathway28 Table2.6Meantimestoprogressfromevent i toevent j ,given R = r and F = f 1 j i j rf onpolyppathways28 Table2.7Meantimestoprogressfromevent i toevent j ,given R = r and F = f 1 j i j rf onnon-polyppathway29 Table2.8One-steptransitionprobabilitiesbetweenstages:Comparingresultspresentedinthisresearchtoliteraturepresentedin[1]29 Table2.9Percentageofpopulationcomplianttoscreening[2]33 Table2.10Screeningsensitivity[3]34 Table2.11Screeningspecicity[3]34 Table2.12Simulatedvs.actualIndianaCRCcountsper100,000ofpopulation36 Table2.13Simulatedvs.actualIndianavaluesforstageattimeofdiagnosis aspercentageoftotalCRCcounts36 Table2.14Estimatedcondenceintervalonmeantimetoprogressfrom polyp 5mmtoin-situCRCinyearsaccordingtofamilyhistory37 iv

PAGE 6

Table2.15Estimatedproportionof p 5 'sprogressingto S 37 Table2.16CRCcountsper1000populationandstageatdiagnosis-For annuallyscreenedgroupofMinnesotastudy42 Table2.17CRCcountsper1000populationandstageatdiagnosis-For bienniallyscreenedgroupofMinnesotastudy42 Table2.18StageatdiagnosisaspercentageoftotalCRCcounts-Forannual groupofMinnesotastudy43 Table2.19StageatdiagnosisaspercentageoftotalCRCcounts-Forbiennial groupofMinnesotastudy43 Table3.1PSNRvalueestimatedbytakingerror,epsilon,asdierencebetweenindividualvalues67 Table3.2PSNRvalueestimatedbytakingerror, ,asdierencebetween 75thpercentileofprobesquare68 Table4.1SensitivityofdetectionofknownpeaksintheAurumdata102 v

PAGE 7

LISTOFFIGURES Figure1.1Five-yearrelativesurvivalin%fordierentprimarysitesof cancer2 Figure2.1CRCpathways:Polypincidenceandstagesofprogression12 Figure2.2Eventofincidenceofpolyp 5mmattime t 1 p 5 t 1 anditsprogressiontoaneventofincidencein-situCRCattime t 2 S t 2 withageat t 1 and t 2 as and ,respectively18 Figure2.3Eventofincidenceofpolyp 5mmattime t 1 p 5 t 1 anditsprogressiontoaneventofprevalenceofCRCattime t 2 ~ C t 2 i.e., either ~ S t 2 ~ L t 2 ~ R t 2 ,or ~ D t 2 ,withageat t 1 A t 1 and t 2 A t 2 as and ,respectively20 Figure2.4Eventofincidenceofin-situCRCattime t 2 S t 2 anditsprogressiontoaneventofprevalenceofinvasiveCRCattime t 3 ~ IC t 3 i.e.,either ~ L t 3 ~ R t 3 ,or ~ D t 3 ,withageat t 2 A t 2 and t 3 A t 3 as and ,respectively25 Figure2.5Eventofincidenceofpolyp 5mmat t 1 p 5 t 1 anditsprogression toaneventofincidenceofin-situCRCat t 2 S t 2 ,withage a A t 1 b and c A t 2 d suchthat[ a;b ] [ c;d ]26 Figure2.6One-steptransitionprobabilitiesforpolyppathway1 R =Caucasian, F =029 Figure2.7FlowchartofsimulationEvent2:Incidenceandprogressionof polyps31 Figure2.8FlowchartofsimulationEvent3:Screening32 Figure2.9ComparingsimulatedversusactualCRCcasesper1000populationforannualgroupoftheMinnesotastudy39 Figure2.10ComparingsimulatedversusactualCRCcasesper1000populationforbiennialgroupoftheMinnesotastudy40 vi

PAGE 8

Figure3.1Randomlyvaryingintensitiesofadjacentprobesquaresdepicts presenceoflargenumberofedgesinamicroarrayimage58 Figure3.2Constructionofadyadicsubimage,transformationofpoissonto normal,denoisingandreplacingdenoiseddatatoreconstructmicroarrayimage60 Figure3.3Distributionofprobesquaresshowingimpactofdenoising70 Figure3.4Proportionofprobesquaresinvariousrangesof R s CV forScan1 data72 Figure4.1SamplePF2Dsignal79 Figure4.2Peakidenticationandintensityquantication80 Figure4.3Horizontalshiftofpeaks81 Figure4.4Mexicanhatwavelet85 Figure4.5Awaveletataxedtranslationandatthreedierentscales86 Figure4.6Awaveletataxedscaleandthreedierenttranslations86 Figure4.7Positiveandnegativecontributionsofconvolutionofthewavelet withasignal87 Figure4.8Sampleplotofwaveletcoecients88 Figure4.9Locationofpeaks90 Figure4.10EstimatingareaofoverlappingpeaksunderScenario192 Figure4.11EstimatingareaofoverlappingpeaksunderScenario292 Figure4.12Peakidenticationonsectionofchromatogramwithseveralpeaks98 Figure4.13Peakidenticationonsectionofchromatogramwithhighnoise99 Figure4.14Sampleresultsofpeakidentication:Increasingpeaksselected byMassSpecWaveletto172100 Figure4.15Sampleresultsofpeakidentication:Increasingpeaksselected byMassSpecWaveletto202101 Figure4.16PeakalignmentonPF2Ddatawasvisuallyvalidated102 Figure4.17AportionofPF2Dchromatogramsillustratingpeakalignments103 vii

PAGE 9

THREEESSAYSONANALYTICALMODELSTOIMPROVEEARLY DETECTIONOFCANCER ChaitraGopalappa ABSTRACT Developmentofapproachesforearlydetectionofcancerrequiresacomprehensive understandingofthecellularfunctionsthatleadtocancer,aswellasimplementing strategiesforpopulation-wideearlydetection.Cellfunctionsaresupportedbyproteinsthatareproducedbyactiveor expressed genes.Identifyingcancer biomarkers i.e.,thegenesthatareexpressedandthecorrespondingproteinspresentonlyina cancerstateofthecell,canleadtoitsuseforearlydetectionofcancerandfordevelopingdrugs.Thereareapproximately30,000genesinthehumangenomeproducing over500,000proteins,therebyposingsignicantanalyticalchallengesinlinkingspecicgenestoproteinsandsubsequentlytocancer.Alongwithdevelopingdiagnostic strategies,eectivepopulation-wideimplementationofthesestrategiesisdependent onthebehaviorandinteractionbetweenentitiesthatcomprisethecancercaresystem,likepatients,physicians,andinsurancepolicies.Hence,obtainingeectiveearly cancerdetectionrequiresdevelopingmodelsforasystemicstudyofcancercare. Inthisresearch,wedevelopmodelstoaddresssomeoftheanalyticalchallengesin threedistinctareasofearlycancerdetection,namelyproteomics,genomics,anddiseaseprogression.Thespecicresearchtopicsandmodelsare:1identicationand quanticationofproteinsforobtainingbiomarkersforearlycancerdetectionmixed integer-nonlinearprogrammingMINLPandwavelet-basedmodel,2denoisingof viii

PAGE 10

genevaluesforuseinidenticationofbiomarkerswavelet-basedmultiresolutiondenoisingalgorithm,and3estimationofdiseaseprogressiontimeofcolorectalcancer fordevelopingearlycancerinterventionstrategiescomputationalprobabilitymodel andanagent-basedsimulation. ix

PAGE 11

PREFACE MytimeatUSFhasbeenanexcitingadventureandIamindeepgratitude toallwhomadeitamemorableexperience.Myresearchasanengineerinsuch aninterdisciplinaryareawouldnothavebeensuccessfulhaditnotbeenforthe collaborativeeortandguidanceofseveraldedicatedresearcherswhohavebeenmy mentorsovertheseyears. IwouldliketoimmenselythankmyadvisorDr.TapasDasforencouragingand guidingmetoworkinanareawhichIwasalwaysintriguedabout,cancerresearch, buthadneverthoughtcouldbeinvolvedinasanengineer.Hisdedicationtoresearch andhardworkhasbeentrulyinspirational.Thisrollercoasterridewouldnothave beenexcitingorsuccessfulhaditnotbeenforthecommittedsupportandguidance ofsomanypeople,whowerealwayswillingtosharetheirknowledgeandprovide generousadviceandguidance:Drs.RebeccaSutphen,EricThomas,JohnKoomen, StevenEnkemann,SeanYoder,andStevenEschrich,andMs.TriciaHoltjefrom MottCancerCenterandResearchInstitute,towhomIamtrulythankfulfortheir sincereguidanceinthisareainwhichIhadnopriorknowledge;Drs.SelenCremaschi fromUniversityofTulsaandSezaOrcunfromPurdueUniversityfortheircommitted support,andalsoformakingmefeelathomeduringmythreemonthvisittoPurdue toworkonmyresearch;Drs.SusanaLai-YuenandAlexSavachkinwhoseexcellent feedbackatvariousphasesofmydissertationhelpedshapemyfuturetasks;Dr.Jose Zayas-Castro,ChairofIndustrialEngineering,whosementorshipandwillingnessto helpistrulyinspiring;Drs.BradDoebbelingandDavidHaggstromfromVA,Inx

PAGE 12

dianaUniversity,fortheirguidance;Dr.KarenLillerfromGraduateSchool,USF, forhersupportandencouragementtowardsinterdisciplinaryresearchwhileworking ontheGraduateStudentChallengeGrant;myteammatesontheChallengegrant BarbaraDavilaandJaelRodriguezfromMedicalSchoolandDaynaMartinezfrom Engineering,USF,withoutwhosecollaborativeeorttheproposalwouldnothave beensuccessful;Dr.PeknyfromPurdueUniversityandLoriLoseefromRegenstrief foundationfortheirsupport;JackieStephensandGloriaLatterfromIndustrialEngineeringfortheirwarmsupport;andseveralpeopleatUSFwithwhoIhavebriey metbuteachdiscussionhashelpedmebuildmyresearchknowledge.Mytimeat USFwouldnothavebeensopleasantlymemorableifnotformyfriendsatUSFwith whomIhavehadseveralresearchdiscussions,studentchapteractivities,andsocial times,DianaPrieto,WilkistarOtieno,PatricioRocha,QingweiLi,andAndresUribe. IhadneverthoughtIwouldventurethisfarintoeducation,andIcompletelyowe itmymom,sister,andgrandparents,fortheirlove,support,sacrices,andcondence inme.Amma,Thaata,Mummy,andReshma,thankyouforstandingbymeand guidingme,atthesametime,helpingmebuildmyowncharacterandpersonality.I amextremelyproudtohaveyouasmyfamily.Icannotexpressenoughmygratitude forthededicatedlove,support,motivation,honestexpressionofdierenceofopinion, andsincerecritiquesofmybestfriend,studycompanion,hikingpartner,andsoulmate VishnuNanduri,thathaskeptmemotivatedandencouragedasIbeginmyresearch career. xi

PAGE 13

CHAPTER1 INTRODUCTION Cancerisaresultofabnormalbehaviorofcellswithcharacteristicsincludinguncontrolledgrowthofcells,andovertime,invasiontoadjacenttissuesandsometimes metastasizingtodierentlocationsofthebodyviathelymphorblood.Duetothis timelineofcancer,earlydetectionandinterventioniscrucial.Achievingearlydetectioninvolvesdevelopingdiagnostictoolswhichrequiresacomprehensiveunderstandingofthecellularfunctionsthatleadtocancercellularlevel,andalsostrategies foreectivepopulation-wideimplementationofthedevelopedtoolsstrategiclevel. Beforelookingintothecauseofcancerandtheneedforanalyticalmodelstoimprove earlydetection,itisinterestingtolookintotheevolutionofourknowledgeofcancer overthecenturies. 1.1EvolutionofCancerResearch TheAmericanCancerSocietyinitsrecentarticle[4]hascompiledseveralinterestingfactsofthehistoryofcanceroverthepastcenturies,liketheearliestscientic cancerresearch,evolutionoftheoriesforthecauseofcancer,andavailabilityoftreatmentandsurvival.Ithasbeennotedthat,theearliestknowndescriptionofcancer thewordcancerwasnotusedandwascoinedonlyaround300B.CbyHippocrates wasin1600B.CinanEgyptiantextbookontraumasurgery,whereitwasdescribed asadiseasewithnotreatment.Sincethen,therehasbeentremendousamountof researchandtreatmentadvances,andasofnow,thearticlereportsthatmorethan 1

PAGE 14

2outof3peoplediagnosedwithcancersurviveatleast5years.However,early detectioniscrucialforimprovedchancesofsurvivalascanbeseenbythesurvival ratesperstageofdiagnosisinFigure1.1,whichisasummaryofthe1996-2006cancer statisticsreportedbytheNationalCancerInstitute'sSEERSurveillance,EpidemiologyandEndResultsProgram[5].TheFigureplotsthe5-yearrelativesurvival fordierentprimarysitesofcancer,andascanbeseenforallthesesites,survival whendiagnosedatlatestageofcancerdistantismuchlowercomparedtoearly stagelocal.Therefore,developingtoolsforearlydetectionofcancerisessential. Figure1.1.Five-yearrelativesurvivalin%fordierentprimarysitesofcancer Asinanydisease,developingdiagnosistoolsforearlydetectionandfurtherdevelopingeectivetreatmentsrequireanin-depthunderstandingofthecauseofcancer. Itisinterestingtopresentheresomeofthetheoriesforthecauseofcancerwiththe evolutionofknowledgeoverthecenturies[4].In300B.C.wehadthehumoraltheory, whereitwasbelievedthatcancerwascausedbyexcessblackbile,oneofthefour uidsorhumorblood,phlegm,yellowbile,andblackbilethatwererequiredtobe inbalanceforapersontobehealthy.Inthelate1700'swehadthelymphtheory, whereitwasbelievedthatcancerwascomposedoffermentinglymphuid.In1838it wasrealizedthatcancerwasmadeofcellsandnotuid,however,itwasnotknowat thetimethatitwasthenormalcellsthatbecomecancerousblastematheory.Until 1920ssomebelievedthatcancerwascausedbytrauma.Itisinterestingtopointout 2

PAGE 15

herethattheearliestmentionofcancerin1600B.Cwasinatraumasurgerybook. Itwasonlyafterthemid20thcenturywiththediscoveryofthechemicalstructure ofDNAdoweknowcancerasoftoday,asbeingcausedbymutatedgenes.Without overlookingthefoundationofknowledgelaidbypastresearch,thoughtherehasbeen thousandsofyearsofworkinthisarea,somuchofwhatweknowofcancertoday isbasedonrecentadvancements,andasstatedbytheAmericanCancerSocietyin [4] Scientistshavelearnedmoreaboutcancerinthelast2decadesthanhasbeen learnedinallthecenturiespreceding ".Itisinterestingtopointtheevolutionofour knowledgefrom:abalanceoffouruidswouldkeepushealthy,to:itisacoordinated eortofseveralcellcomponents,like30,000genes,and500,000proteins;thus,the vastnessofdatacreatingtheneedforanalyticalmodels. 1.2BriefDescriptionofCellandCauseofCancer Thecellisafunctionalunitofallorganisms,andthenucleusofthecellcontains geneticinformationintheformofDNA.TheDNAisconsideredtobeablueprint thatprovidesinstructionsforallcellularfunctions.Cellcycleisasequenceconsisting ofDNAreplication,growth,anddivisionofcellstoformtwonewcells.Thusthe geneticinformationiscarriedfromcelltocell. 1.2.1Chromosomes,DNA,Genes,andProteins Humanscellshave23pairsofchromosomeswhereeachparentcontributesoneto eachpair.ChromosomeisaverylongDNAmoleculethatcarryportionsofhereditary information.GenesaresectionsofDNAthatcarryonthechromosomesthatdeterminespecichumancharacteristics.Thehumangenomeconsistsofaround30,000 genes,withdierentgenesactiveor expressed indierentpartsofthebody.Active genesproduceor translate proteins,andmorethan500,000proteinsaretranslatedby 3

PAGE 16

thehumangenome.Theseproteinsarethemaincomponentsthatdrivethemetabolic activitiesinacellwhichleadtofunctioningofthedierentpartsofthebody.Cancer isaresultofthedysfunctionincellularlevelmetabolicactivities 1.2.2CancerBiomarkers Identicationofthegeneorprotein biomarkers ofcancer,i.e.,thespecicgenes thatareexpressedortheproteinsthatarepresentonlyincancerouscells,couldlead toitsuseascancerdiagnostictoolsandfurtherdrugdiscovery.Identifyingbiomarkers oftheearlieststageofcancercanleadtoitsuseasascreeningtoolforearlydetection ofcancer.Sinceresponsetotreatmentvariesacrossindividualsevenforthesame typeofcancer,identifyingbiomarkerscouldalsoallowforamoreindividualized treatmentapproach.Forexample,itispossiblethatdierentsetsofproteinscould causethesametypeofcancer,thereby,identifyingbehaviorofdrugsontheseprotein setscouldleadtoamoretargetedtreatment.Itmaybenotedthatengineering personalizedratherthanstandardizedmedicineshasbeenidentiedunderoneofthe GrandChallengesforthe21 st centuryEngineeringBetterMedicine ,thatwaslaid outbytheNationalAcademyofEngineering,thussignifyingtheidenticationof cancerbiomarkers. 1.3NeedforAnalyticalModelsinAchievingEarlyCancerDetection AsisclearfromthedatainFigure1.1,earlydetectionofcancerisessential forimprovedchancesofsurvival.Advancementsintechnology,biochemistry,and medicinehasincreasedourknowledgeofthecauseofcancer.However,utilizingthis knowledgetodeveloptoolsandachieveeectiveearlydetectionisfacedwithseveral challengesbothbiochemicalandanalytical,twoofwhichcanbebroadlyclassiedas follows. 4

PAGE 17

1.Challengesinidenticationofgeneandproteinbiomarkerscellularlevel: Thevastnessofdataandcomplexityofthecellcreatesanalyticalhurdlesin biomarkerdiscovery.Asanexample,weconsiderthechallengesfacedinthe generalprocedureinextractingbiomakers.Theprocedureincludestranslating chemicalpresenceofcellcomponentse.g.,genesandproteinsincellsamplesto numericalsignals,processingsignalstoextractmeaningfuldatae.g.,geneexpression,oramountofprotein,andprocessingdatatoextractcancerbiomarkers.Someofthechallengesincludeirreproducibilityofsample,whicharises duetofactorslikethesamegenesproducingdierentproteinsunderdierent conditionsofthecell,orproteinschangingafterproductionthusresultingin proteinswithdierentmetabolicactivities.Eachtimethesampleisconverted tonumericalsignal,thevastnessofdatamakesitinfeasibletomanuallyidentify thevaluesofthecellcomponents,likeallgenesthatareexpressedanditslevelof expression.Therefore,analyticalmodelsneedtobedevelopedtoautomatically identifythetypeandamountofthecomponentspresentinasample. 2.Eectivepopulation-wideimplementationofearlydetectionstrategiesstrategiclevel:Achievingeectivepopulation-wideearlycancerdetectionextends beyonddiscoveryofcancerscreeningtools.Utilizingtheadvancesinscreeningtoolsrequiresdevelopmentofeectivescreeningstrategiesandmoreoverits implementationatasystemiclevel,i.e,inthegeneralpopulation.Thesystemiscomprisedofindependentdecisionmakingentitieslikethepopulation, physicians,andinsurancepolicies.Eachentityhasitsowngoalsthatcould sometimeconictwiththatoftheothers,andhence,achievingearlycancerdetectionissubjecttothebehaviorandinteractionbetweenthesesystementities. Therefore,developingeectivepopulation-wideearlycancerdetectionstrate5

PAGE 18

giesrequiresbuildingamodeloftheentirecancercaresystem,usingwhich, implementationstrategiescanbedevelopedandanalyzed. 1.3.1ResearchTopicsAddressedinthisDissertation Aspartofthisdissertation,wedevelopedmathematicalmodelstoaddressanalyticalchallengesintheareasofdiseaseprogressionstrategiclevel,genomics,and proteomicscellularlevel.Theremainingpartofthischapterprovidesabriefdescriptionofthespecicresearchtopicsofthisdissertation,andChapters2,3,and4, willcovereachofthesetopicsindepth. 1.Probabilitymodelforestimatingdiseaseprogressionofcolorectalcancer:Per AmericanCancerSociety,colorectalcancerCRCisthethirdmostcommon causeofcancerrelateddeathsintheUnitedStates.Expertsestimatethat about85%ofCRCsbeginasprecancerouspolyps,earlydetectionandtreatmentofwhichcansignicantlyreducetheriskofCRC.Hence,itisimperative todeveloppopulation-wideinterventionstrategiesforearlydetectionofpolyps. Developmentofsuchstrategiesrequiresprecisevaluesofthepopulation-specic ratesofincidenceofpolypanditsprogressiontocancerousstage.Therehas beenaconsiderableamountofresearchinrecentyearsondevelopingscreeningbasedCRCinterventionstrategies.However,thesearenotsupportedby population-specicestimatesofprogressionrates.Thisresearchaddressesthis needbydevelopingaprobabilitymodelthatestimatespolypprogressionrates consideringraceandfamilyhistoryofCRC;notethat,itisethicallyinfeasible toobtainpolypprogressionratesthroughcasestudies.Weusetheestimated ratestosimulatetheprogressionofpolypsinthepopulationoftheStateof Indiana.Thesimulationalsoincludesthescreeningprocedureconstructedas 6

PAGE 19

perthecurrentscreeningguidelinesforcolorectalcancer,andthescreening compliancebythepopulationofIndiana. 2.Mathematicalmodeltoremovenoisefromgenedatageneratedbymicroarrays:Microarraytechnologyformeasuringgeneexpressionvalueshascreated signicantopportunitiesforadvancesindiseasediagnosisandtreatmentplanning.However,randomnoiseintroducedbythesamplepreparation,hybridization,andscanningstagesofmicroarrayprocessingcreatesinaccuracyinthe estimatesofgeneexpressionlevels.Literaturepresentsseveralmethodologies fornoisereduction,whichcanbebroadlycategorizedas:1modelbasedapproachesforestimationandremovalofhybridizationnoise,2approachesusing commonlyavailableimagedenoisingtools,and3approachesinvolvingcontrol samples.Inthisresearchwepresentanovelmethodologyforidentifyingandremovinghybridizationandscanningnoisefrommicroarrayimages,usingadual treecomplexwavelettransformbasedmultiresolutionanalysiscoupledwithbivariateshrinkagethresholding.Thekeyfeaturesofourmethodologyinclude considerationofspeciccharacteristicsofmicroarrayimagesandthenoisedistribution,andtheabilitytoworkwithasinglemicroarraywithoutneedinga control.Ourmethodologyisrstbenchmarkedonafabricateddatasetthat mimicsarealmicroarrayprobedataset.Thereafter,ourmethodologyistested ondatasetsobtainedfromanumberofAymetrixGeneChiphumangenome HG-U133Plus2.0arraysprocessedonHCT-116cellline.Theresultsindicate anappreciableimprovementinthequalityofthemicroarraydata. 3.Analyticalprocessingofdataforestimatingproteinvalues:Identicationofproteinbiomarkersinbloodandurinesamplescanprovideanon-invasivescreening toolforearlydetectionofcancer.Developingsuchscreeningtoolsrequiresiden7

PAGE 20

tifyingalltheproteinsthatareproducedinthebodyandfurtheridentifying thosethatdistinguishcancercasesfromcontrols.Duetothelargenumberof proteinsproducedbythehumanbody500,000,mostofwhicharepresentin smallamountsinbloodandurine,thebiochemicalprocessingofthebloodor urinesamplesneedtobeaccompaniedbyautomaticmathematicalalgorithms thatidentifyandquantifytheproteins.Developingsuchmathematicalalgorithmsisachallengingtaskduetovastnessofdatawhichisfurthercomplicated bythecomplexbiochemicalnatureoftheproblem.Inthisresearch,wedevelop waveletandoptimizationalgorithmsforanalyticalprocessingofthedatato obtainproteininformationacrossseveralsamples,usingwhich,biomarkerscan beidentied. 8

PAGE 21

CHAPTER2 MODELFORCOLORECTALPOLYPPROGRESSION 2.1Introduction ColorectalcancerCRCisthethirdmostcommoncauseofcancerrelateddeaths intheU.S.MostCRCsbeginasprecancerouspolyp[6,7],referredtoasadenomacarcinomasequence[8].Employingeectivepopulation-widestrategiesforearly detectionandtreatmentatprecancerousstagescanleadtosignicantreductionin CRCmortalities.LiteraturepresentsaconsiderableamountofresearchondevelopingCRCscreeningstrategieswithvaryingtestsandtimelines,andexamining theirinuenceonmortalityrates.However,developingfeasibleinterventionstrategiesrequiresasystem-basedmodelofcancercarethatmustconsider,inadditionto screeningalternatives,variousotherinteractingelementsofthesystem,includingthe social-behavioraltraitsofthepeopleandthephysicians,andtheparametersofthe insurancepolicies.Twoimportantprocessesofthesystem-basedcancercaremodel are: polypincidence and polypprogression .Polypsfollowanaturalincidenceand progression,andupondiagnosis,drivethebehaviorandinteractionofthesystemelements.Thus,precisemodelsportrayingtheincidenceandtheprogressionprocesses arefundamentaltodevelopingeectiveinterventionstrategies.Inthisresearch,our attentionisfocusedondevelopingaprobabilitymodeltoestimateprogressionrates ofcolorectalpolyps. 9

PAGE 22

Theliteraturecontainsaconsiderablenumberofsimulationandmathematical modelsforCRCscreeningstrategies[9,10,11,12,13,14,15,1,16],andCISNET models[17].Allofthesecitedworkshaveanaturalhistorycomponentforthe incidenceandprogressionofpolyps,mostofwhicharemodeledusingvariantsof Markoviantechniques.ThemaininputsrequiredfortheseMarkovianmodelsarethe incidencerates ofpolyps,and progressionrates betweenstages,e.g.,theinverseofthe timethatpolypstaketoprogressfromadenomapre-malignanttocarcinoma.The incidencerateshavebeenestimatedbasedoncasestudyresultsinvolvingrandomized screeningandfollow-up.However,forprogressionrates,acasestudyapproachisnot feasible,sinceitisunethicaltokeepadiagnosedpolypunderobservationwithout treatment.Asaresult,mostoftheabovemodelsuseprogressionratesthatare derivedbasedonexpertopinion,obtainedeitherbyconveningapanelorbyutilizing thedatapresentedin[18]and[19]. Mathematicalmodels,incontrasttothemodelsbasedonexpertopinion,can incorporatecharacteristicslikeraceandfamilyhistoryofCRC,andhenceestimate population-specicprogressionrates.TheNationalAcademyofEngineering,under oneoftheGrandChallengesforthe21 st centuryEngineeringBetterMedicine notedtheneedforengineeringpersonalizedratherthanstandardizedmedicinessince peopledierinsusceptibilitytodiseaseandresponsetomedicine."Also,progression ratesinthepre-diagnosisphasemayvarybetweenpopulations,thusunderscoringthe needforpopulation-specicprogressionrates.Althoughliteraturepresentsnumerous modelsandcostbasedanalysisofCRCscreeningstrategies,andnumerousmodelsand casestudiesonincidencerates,mathematicalmodelstoestimatepolypprogression rateshavebeenlimited[20],[21].Thestudypresentedin[20]usesthedatafromthe nationalcolonoscopyscreeningdatabaseofGermanytodevelopastatisticalapproach toobtainannualtransitionrateandthe10yearcumulativeriskofCRCspecictosex 10

PAGE 23

andagegroups.Thetransitionratesareonlyforaftertheonsetofadvancedadenoma polyp 1cm.AMarkovmodeltoestimateprogressionratesforstagesaftertheonset ofcancer,andtheincidenceofcancerispresentedin[21].Themodelwasbuiltbased ontheresultsofacasestudyconductedonapopulationwithhigh-riskofCRC. Inthisresearch,wepresentaprobabilitymodelthatwasdevelopedforestimating polypprogressionrates,specictoraceandfamilyhistorystatus,fromtheincidence ofpolyptocarcinomaandbetweenstagesofcarcinoma.Notethat,estimatingprogressionrates,andthustimetoprogress,fromincidenceofpolyptocarcinomaareof vitalimportancefordevelopingearlypre-cancerinterventionstrategies.Thepolyp progressionpathwaysconsideredinourmodelaredepictedinFigure2.1andare describedasfollows. CRCPathways-Polypincidenceandstagesofprogression:MostCRCsoriginate asvisibleprecancerouspolypsandonlyasmallpercentageariseasatcarcinoma.Not allpolypsarepre-malignantandhenceonlysomeprogresstocarcinoma.WhileCRCs generallydevelopfrompolypsgreaterthan1cm,carcinomahasalsobeendiagnosed inpolypsbetween6mmand9mm[19].Inthisresearch,weconsiderthreepossible pathwaysforpolypprogressionbeforetheonsetofcancer.Figure2.1depictsthethree pathwaysandthefourstagesofcolorectalcancer.Polyp-pathways1and2referto theprogressiontypesthatbeginwithavisibleadenomapolypbeforeprogressingto cancer;thiswasadoptedfromthepathwayspresentedby[19].Non-polyppathway referstothecancersarisingfromatpolyps[22].Afterthein-situstage,thepolyp progressesthroughthethreestagesofinvasivecancer:local,regional,anddistant, whichwhenrelatedtoDukesclassicationofcancer[23],correspondtostagesA+B, C,andD,respectively.Notethat,whilemostcancersbeginaspre-cancerouspolyp, notallpolypsprogresstocancer,i.e,remaininthebenignstages.Hence,polyps canbecategorizedintoprogressiveandnon-progressive[19].Inwhatfollows,we 11

PAGE 24

presentourprobabilitymodel,resultsestimatedfromtheprobabilitymodel,the modelvalidation,andconcludingremarks. Figure2.1.CRCpathways:Polypincidenceandstagesofprogression 2.2ProbabilityModeltoEstimateProgressionRates Oneofthemaininputsrequiredtodeveloptheprogressionratemodelisthe incidencerate ofpolypsspecictopatient'sage,race,andfamilyhistorystatusofCRC. Inthissection,werstpresentourmethodforestimatingincidencerates,followed bythedevelopmentofaprobabilitymodelforestimatingprogressionrates.The probabilitymodelwasdevelopedbasedontheassumptionthatnon-polyppathway contributestowards15%ofcancers,and70%ofcancersfromthepolyppathways arisefrompathway1basedonexpertopinionscitedin[19]and[11].Allnotation usedinthemodelpresentedinthissectionaresummarizedinTable2.1. 2.2.1PolypIncidenceRates Weestimatedtheprobabilityofincidenceof polyp 5mm ,whichistherstvisible stageonthepolyppathways,byutilizingthecasestudiespresentedin[24]and 12

PAGE 25

Table2.1.Notationusedintheprobabilitymodel SymbolDescription p 5 S L R D eventsofincidenceofpolyp 5mm,in-situCRC,localCRC,regionalCRC,distantCRC p 5 t S t L t R t D t p 5 S L R ,and D attimet ~ S t ~ L t ~ R t ~ D t eventsofprevalenceofin-situCRC,localCRC,regionalCRC,anddistantCRC,attimet ~ C t eventofprevalenceofCRCin-situ,local,regional,ordistant,attimet ~ IC t eventofprevalenceof invasive CRClocal,regional,ordistant,attimet F randomvariabledenotingfamilyhistoryofCRCofarandomlyselectedindividual R randomvariabledenotingraceofarandomlyselectedindividual A t randomvariabledenotingageinyearsofanindividualattime t L randomvariabledenotinglengthoflifeinyears Z + setofpositiveintegers N S numberofstagesfrom p 5 to S N L numberofstagesfrom S to L N D numberofstagesfrominitialevent p 5 or S todistantCRC D T j i timetoprogressfromevent i to j j i j rf progressionratefromevent i toevent j given R = r and F = f [25].Thestudyin[24]presentsdataonthenumberofpositivesigmoidoscopyresults indicatingpolypsinapopulationthathadtestednegativeindicatingnormalthree yearsback.Usingthisrepeatedtestdata,weestimatedtheprobabilityofincidenceof polyp 5mmperyear.Studieshaveshownthattherateofpolypincidencevarieswith age,however,[24]didnotcontainenoughdatatoestimatetheincidenceprobabilities basedonagegroups.Therefore,weusedthestatisticsin[25],whichpresentsagebaseddataofpatientsundergoingpolypdetectionandremovalattheRochester MethodistHospital,Rochester,Minnesota.Ithasalsobeenobservedintheliterature thatafamilyhistoryofCRCincreasestheriskofdevelopingCRC[26]and[25]. Therefore,weconsideredpopulationspecicincidencerates,whichweestimatedas follows. Let p 5 t 1 denoteaneventofincidenceofpolyp 5mmattime t 1 ,and A t 1 2 Z + betherandomvariabledenotingtheageat t 1 ,where Z + denotesthesetofpositive integers.Let R and F betherandomvariablesdenotingtheraceandthestatusof familyhistoryofCRC,respectively,ofarandomlyselectedindividual.Weconsider R = f Caucasian;AfricanAmerican;Other g ,and F = f 0 ;> 0 g ,i.e., F has2 outcomes,eithernofamilyhistoryofCRCevent F =0oratleastonecaseofCRC 13

PAGE 26

inthefamilyevent F> 0.Wecomputedthejointprobabilitiesof p 5 t 1 and A t 1 in interval[ a;b ],forgiveneventsof F = f and R = r asfollows.Lettheprobabilityof p 5 t 1 given R = r and F = f i.e., P f p 5 t 1 j R = r F = f g bedenotedby P rf f p 5 t 1 g Applyingthedenitionofconditionalprobability,andsince p 5 t 1 and R = r F = f aredependenteventswecanwritethat P rf f p 5 t 1 g = P f p 5 t 1 R = r F = f g P f R = r F = f g ; .1 andalsosince P rf f p 5 t 1 g and a A t 1 b aredependenteventswecanwritethat P rf n a A t 1 b p 5 t 1 o = P rf n a A t 1 b j p 5 t 1 o P rf n p 5 t 1 o : .2 Theprobabilityvaluesfortheelementsontherighthandsideofequations.1and .2areestimatedusingdatafrom:[24]forprobabilitiesof p 5 t 1 ;[25]forprobability distributionsbasedon A t 1 and F ;and[27]forprobabilitydistributionsbasedon R Adetaileddescriptionoftheestimationispresentedbelow. 1.Estimating P f F = f R = r g :Usingthedenitionofconditionalprobability, since F = f and R = r aredependentevents,wecanwrite P f F = f R = r g = P f F = f j R = r g P f R = r g .Tocompute P f F = f j R = r g ,weconsiderthat thenumberofCRCsperfamilyi.e.,thefamilyhistorystatus F isPoisson distributedwithmean r foragivenrace.Weestimate r foreachraceas f NumberofCRCcasesinthepopulation i:e:;CRCprevalencecount Totalpopulation Averagefamilysize g .The CRCprevalencecountforyear2006foreachracewasobtainedfromSEER SurveillanceEpidemiologyandEndResultsdatabase[28],whichpresents CRCstatisticsoftheU.S.population.Thetotalpopulationcountinyear2006 foreachracewasobtainedfromtheU.S.censusdata.Theaveragefamilysize 14

PAGE 27

oftheU.S.populationis3.20,asreportedbycensus[29].Withtheinclusion ofseconddegreerelatives,weassumethattheaveragefamilysizeforallrace is7.UsingthePoissondistributionprobabilitydensityfunction,wecompute P f F = f j R = r g = r f e )]TJ/F22 5.9776 Tf 5.756 0 Td [( r f ,and P f R = r g canbeeasilycomputingusingthe U.S.censusdata.Forequation.1,wecompute P f F =0 R = r g asabove, and P f F> 0 R = r g = )]TJ/F20 11.9552 Tf 12.035 0 Td [(P f F =0 j R = r g P f R = r g .Wepresentbelow theestimatesof r and P f F> 0 j R = r g 100inTables2.2and2.3 Table2.2.EstimatesofthePoissonparameter r R =AllRace R =Caucasian R =AfricanAmerican 0.0260.0280.018 Table2.3.PercentageofpopulationwithfamilyhistoryofCRCgivenrace P f F> 0 j R = r g *100 R =AllRace R =Caucasian R =AfricanAmerican 2.552.771.77 2.Estimating P f p 5 t 1 R = r F = f g and P rf n a A t 1 b j p 5 t 1 o :Thearticle in[24]presentspartoftheresultsofalarge-scalerandomizedProstate,Lung, Colorectal,andOvarianScreeningTrialPLCO[30],thatwasconductedto testtheeectofvariousscreeningtestsonmortalitiesfromthecancers.Aspart ofthecolorectalcancerstudy,initially,apopulationintheagegroup55-74was screenedforcolorectalpolyps.Onthepopulationthattestednegativeindicatingnormal,arepeatedscreeningtestwasconductedthreeyearsafterthe initialscreentest.Thenumberofpositiveindicatingpresenceofpolypscreen resultsfromtherepeatedtesthasbeenpresentedin[24].Thediagnosedpolyps havebeencategorizedintosizes < 0.5,0.5-0.9,and 1.0cm.Sinceallpolyps shouldhavestartedas < 0.5cmwiththeeventofincidenceoccurringduring 15

PAGE 28

oneofthethreeyearsbetweentests,weestimatetheprobabilityofincidence ofpolyp 5mmatanarbitraryyear t 1 P f p 5 t 1 g as 1 3 Numberofpeopletestedpositive Totaltested where,wemultiplyby 1 3 assumingthattherewereequalnumberofincidencesin eachofthethreeyears.However,notethat,agegroups40-54and > 74werenot partofthestudypopulationin[24],andhence,theaboveestimateof P f p 5 t 1 g willonlyapplytopopulationinage55-74.Therefore,toestimate P f p 5 t 1 g for therequiredpopulationi.e.,age > 40weperformsimplemathematicalcalculationsusing:ithepercentagedistributionofpolypsacrossagegroupsfrom [25],andiipercentagedistributionofU.S.populationacrossagegroups,taken fromtheU.S.censusdata. Applyingthedenitionofconditionalprobability,wecompute P f p 5 t 1 R = r F = f g = P f R = r F = f j p 5 t 1 g P f p 5 t 1 g .Since,whengivenevent p 5 t 1 ,wedo nothavedatatodeterminedependenceofevents F = f and R = r ,weassume independenceandcompute P f R = r F = f j p 5 t 1 g P f p 5 t 1 g = P f R = r j p 5 t 1 g P f F = f j p 5 t 1 g P f p 5 t 1 g .Wecanestimate P f R = r j p 5 t 1 g = Numberofpolypcasesinracer Totalpolypcases ,however, sincewedidnothavesuitabledatatodeterminethisproportion,weapproximated P f R = r j p 5 t 1 g = NumberofCRCcasesinracer TotalCRCcases ,dataforwhichwasobtainedfromtheIndiana database[27].Weestimate P f F =0 j p 5 t 1 g = NumberofpolypcaseswithfamilyhistoryofCRC Totalpolypcases = 0 : 14usingdatapresentedin[25],and P f F> 0 j p 5 t 1 g =0 : 86.Also, P rf f a A t 1 b j p 5 t 1 g isequatedtotheproportionofpolypsinrespectiveagegroupsaspresented in[25]. Notethat,weconsidertheminimumagefordevelopingapolypas40years, sinceriskofcancerbelow40islowbasedondiscussionspresentedin[31]and[32]. ThereportbytheAmericanCancerSocietyin[31]notesthat90%ofCRCsare diagnosedinindividualsabovetheageof50.Also,theexpertpanelfromtheU.S. MultisocietyTaskForceonColorectalCancersuggestsastartingscreeningageof 16

PAGE 29

50yearsand40yearsforindividualswithoutandwithafamilyhistoryofcolorectal polyps,respectively[32]. 2.2.2ProbabilityModeltoEstimateProgressionRates Basedonexpertopinion[20],[9],weconsiderthattheprogressiontimesfor thefollowingeventsofincidences:polyp 5mmtoin-situCRC,in-situtolocalCRC, localtoregionalCRC,andregionaltodistantCRCareexponentiallydistributed witheventdependentparametersprogressionrates.Itmaybenotedthataccurateestimationoftheprogressionratefromtheincidenceofpre-cancerouspolyp polyp 5mmtocarcinomain-situ,iscrucialfordevelopingeectivepre-cancer interventionstrategies. Let j i j rf denotetheprogressionratefromevent i toevent j given R = r and F = f .Inwhatfollows,wepresentmodelsforestimating j i j rf forpre-cancerevent fromincidenceofpolyp 5mmtoincidenceofin-situconsideringpolyppathways1 and2seeFigure2.1,andpost-cancereventsbetweenincidenceofdierentCRC stagesconsideringallpathways. 2.2.2.1EstimatingPre-CancerProgressionRate Pre-cancerprogressionrate,whichwewilldenoteas S p 5 j rf ,referstotheinverseof theexpectedtimetoprogressfromincidenceoftherststageofvisiblepolyp 5mm p 5 toincidenceofin-situCRC S ,given R = r and F = f Note:Notall p 5 progresstoS.Thosethatdoprogressarecalledprogressivepolyps andtherestnon-progressive.Inthisresearch,theestimationof S p 5 j rf considerscases ofbothprogressiveandnon-progressivepolyps.Inotherwords,iftherandomvalue forthetimetoprogresstoin-situ,selectedfromthedistributionexponential S p 5 j rf issuchthatitexceedsthenaturallife,thenthepolypisconsiderednon-progressive. 17

PAGE 30

Wenowpresentthemodelforestimating S p 5 j rf .Let S t 2 denotetheeventof incidenceofin-situCRCattime t 2 .Recollectthat p 5 t 1 denotes p 5 attime t 1 ,and A t denotesageattime t ,thenfollowingpolyppathwaysinFigure2.1, t 2 >t 1 as representedinFigure2.2.Forapopulationwith S t 2 ,sinceeventsof p 5 t 1 A t 1 = A t 2 = aremutuallyexclusivei.e., P P > p 5 t 1 A t 1 = A t 2 = =1and exhaustive,wecanapplythetotalprobabilityruleandwritethat Figure2.2.Eventofincidenceofpolyp 5mmattime t 1 p 5 t 1 anditsprogressionto aneventofincidencein-situCRCattime t 2 S t 2 ,withageat t 1 and t 2 as and respectively P S t 2 = X X > P n S t 2 j p 5 t 1 A t 1 = A t 2 = o P n p 5 t 1 A t 1 = A t 2 = o : .3 Let T S p 5 bearandomvariabledenotingthetimetoprogressfrom p 5 to S .Referring toFigure2.2, P n S t 2 j p 5 t 1 A t 1 = A t 2 = o isequivalentto P n T S p 5 = )]TJ/F20 11.9552 Tf 11.955 0 Td [( j p 5 t 1 A t 1 = A t 2 = o ,therefore,wecanrewriteequation.3 as, P S t 2 = X X > P n T S p 5 = )]TJ/F53 10.9091 Tf 10.909 0 Td [( j p 5 t 1 A t 1 = A t 2 = o P n p 5 t 1 A t 1 = A t 2 = o : .4 Applyingconditionalprobability,equation.4canbewrittenas, P S t 2 = X X > P n T S p 5 = )]TJ/F53 10.9091 Tf 10.909 0 Td [( j p 5 t 1 A t 1 = A t 2 = o P n p 5 t 1 j A t 1 = A t 2 = o P f A t 1 = A t 2 = g ; .5 18

PAGE 31

where, P n p 5 t 1 j A t 1 = A t 2 = o cansimplybewrittenas P n p 5 t 1 j A t 1 = o ,since incidenceofpolyp 5mmattime t 1 isonlydependentonageat t 1 andnotonageat anyfuturetime t 2 .Usingtheestimatefromequation.2, P n p 5 t 1 j A t 1 = o canbe computedas P f p 5 t 1 a A t 1 b g b )]TJ/F21 7.9701 Tf 6.586 0 Td [(a +1 1 P A = a b ,i.e.,byapplyingconditionalprobabilityandassumingconstantrateofincidencewithineachageinterval.Notethat, theassumptionisinaccordancewiththatinthemicrosimulationmodelMISCANcolon,thatevaluatesCRCscreeningpolicies[9],andwhoseinputparametervalues werebasedonexpertestimatespresentedinmeetingsattheNationalCancerInstitute.Further,inequation.5, P n T S p 5 = )]TJ/F20 11.9552 Tf 11.955 0 Td [( j p 5 t 1 A t 1 = A t 2 = o canbe substitutedas S p 5 e )]TJ/F21 7.9701 Tf 6.586 0 Td [( S p 5 )]TJ/F21 7.9701 Tf 6.587 0 Td [( ,where S p 5 denotesprogressionratefrom p 5 to S .The remainingtermontherighthandsideof.5canbeestimatedusingpopulation demographicsfromU.S.censusdata.Hence, S p 5 canbeestimatedusing.5ifthe probabilityonthelefthandsideisavailable.However, P fS t 2 g isunknown,andis infeasibletoestimatewiththecurrentlyavailabledataasexplainedbelow. Letusdividethein-situCRCstageasaseriesofsequentialevents f s 0 ;s 1 ;s 2 ; ;s n g ,where s i isaneventindicating i timeunitsofcancerprogressionintheinsitustage.Torelate S to s i ,weneedtoconsidersmallertimeunitse.g.,day, inwhichcase S isequivalentto s 0 ,denotingtheeventof epoch of incidence ofinsitu.Notethat,foradiagnosedcaseofin-situCRC,itisnotpossibletodetermine thevalueof i ,andhence,wecannotobtaindatarelatedtotheoccurrenceofeach event s i .Therefore,itisnotfeasibletoestimate P fSg .However,itispossibleto estimatetheprobabilityofeventof prevalence ofin-situCRC,i.e., P f[ n i =1 s i g ,as equaltotheproportionofpeopleinstagein-situCRCinarandomizedscreening trialwewilldenote P f[ n i =1 s i g atanarbitrarytime t 2 as P f ~ S t 2 g .Duetothe unavailabilityofasuitablerandomizedstudythatcanbeusedtoestimate P f ~ S t 2 g weestimated P f ~ C t 2 g ,whichistheprobabilityofprevalenceofCRCat t 2 .Thatis, 19

PAGE 32

P f ~ C t 2 g = P f ~ S t 2 [ ~ L t 2 [ ~ R t 2 [ ~ D t 2 g ,where,theeventsintheprobabilitytermonthe righthandsideoftheequationdenotetheprevalencesofin-situCRC,localCRC, regionalCRC,anddistantCRC,respectively,attime t 2 .Therefore,Figure2.2is modiedtoincludetheabovechangesandispresentedasFigure2.3.Asillustrated byScenarios1through4intheFigure,forarandomlychosenindividualat t 2 ~ C t 2 correspondstoaneventofprevalenceofoneoftheCRCstages. Toreecttheabovechangeswemodifyequation.5asfollows, Figure2.3.Eventofincidenceofpolyp 5mmattime t 1 p 5 t 1 anditsprogressionto aneventofprevalenceofCRCattime t 2 ~ C t 2 i.e.,either ~ S t 2 ~ L t 2 ~ R t 2 ,or ~ D t 2 ,with ageat t 1 A t 1 and t 2 A t 2 as and ,respectively P f ~ C t 2 g = X X > P n T S p 5 )]TJ/F20 11.9552 Tf 11.955 0 Td [( j p 5 t 1 A t 1 = A t 2 = o P n p 5 t 1 j A t 1 = o P f A t 1 = A t 2 = g .6 where,lefthandsidehasbeenreplacedwith P f ~ C t 2 g .Also,wehave T S p 5 )]TJ/F20 11.9552 Tf 9.493 0 Td [( instead of= )]TJ/F20 11.9552 Tf 11.228 0 Td [( ,since,anoccurrenceof ~ C at t 2 doesnotnecessarilymeanoccurrenceof s 0 i.e., s 0 ~ C .Alsosince, S or s 0 istherstofthesetofchronologicaleventsthatmake ~ C ,ageat S is implyingthat T S p 5 )]TJ/F20 11.9552 Tf 11.373 0 Td [( P f ~ C t 2 g wasestimatedusingdatafrom 20

PAGE 33

[33],whichpresentsscreenresultsfromCRCcounselingandscreeningconductedby 10healthdepartmentsin15diversecountiesinthestateofNorthCarolina,aspart ofapilotstudyoncancercoordinationandcontrol. ReferringtoFigure2.3,theupperboundon T S p 5 ,i.e., T S p 5 = )]TJ/F20 11.9552 Tf 11.983 0 Td [( ,isrepresented byScenario1,whilealowerboundwouldberepresentedbyScenario4.Toquantify thelowerbound,wecanconsideravalueofoneyear,however,thismaynotbe realisticandcanbeexplainedwiththefollowingexamples.Foracombinationof valuesfor f A t 1 = ;A t 2 = g consideranexampleof f A t 1 =40 ;A t 2 =43 g .Referring toScenario4,if T S p 5 =1,itwillimplythatageat ~ S t 2 is41,andageat ~ D t 2 is43, i.e.,ittakes2yearstoprogressfromin-situtodistantCRC.Similarly,ifinstead, weconsideranexampleof f A t 1 =40 ;A t 2 =70 g ,if T S p 5 =1,itwillimplythatage at ~ S t 2 is41,andageat ~ D t 2 is70,i.e.,ittakes29yearstoprogressfromin-situto distantCRC,whileittookonly1yeartoprogressfrompolyp 5mmtoin-situCRC. Thisishighlyunlikelysincetheprogressionbetweencancerstagesisfastercompared toprecancerstages.Therefore,inordertoplaceamorerealisticlowerbound,we considerequaltimetoprogressbetweenstages,andreferringtoScenario4,wewrite T S p 5 )]TJ/F21 7.9701 Tf 6.587 0 Td [( N S N D ,where N S and N D arethenumberofstagesfrompolyp 5mmtoinsituCRCandpolyp 5mmtodistantCRC,respectively.Asanexample,forpolyp pathway1inFigure2.1, N S =3and N D =7.Notethat,theequaltimebetween stageswasanassumptionmadeonlyforobtainingamorerealisticlowerbound,than usinganarbitraryvalueofoneyear,andwasnotanassumptionontheprogression rateestimation. Themodiedequationcanbewrittenas, P ~ C t 2 = X X > P N S )]TJ/F20 11.9552 Tf 11.955 0 Td [( N D T S p 5 )]TJ/F20 11.9552 Tf 11.955 0 Td [( j p 5 t 1 A t 1 = A t 2 = P n p 5 t 1 j A t 1 = g P f A t 1 = A t 2 = o .7 21

PAGE 34

where, P N S )]TJ/F20 11.9552 Tf 11.955 0 Td [( N D T S p 5 )]TJ/F20 11.9552 Tf 11.955 0 Td [( j p 5 t 1 A t 1 = A t 2 = = h 1 )]TJ/F20 11.9552 Tf 11.955 0 Td [(e )]TJ/F21 7.9701 Tf 6.586 0 Td [( S p 5 )]TJ/F21 7.9701 Tf 6.587 0 Td [( i )]TJ/F1 9.9626 Tf 11.955 17.535 Td [(" 1 )]TJ/F20 11.9552 Tf 11.955 0 Td [(e )]TJ/F21 7.9701 Tf 6.587 0 Td [( S p 5 N S )]TJ/F22 5.9776 Tf 5.756 0 Td [( N D # sincetheboundsontheprogressiontimeisequivalenttotheevent T S p 5 )]TJ/F20 11.9552 Tf -422.701 -23.908 Td [( T T S p 5 N S )]TJ/F21 7.9701 Tf 6.587 0 Td [( N D .Notethat,theformulationinequation.7willimplythat everyindividualwith p 5 at t 1 willlivethroughtotime t 2 ,howeverinrealitythisis notthecase.Therefore,denoting L 2 Z + asarandomvariableindicatinglengthof life,wewrite.7as, P ~ C t 2 = X X > P N S )]TJ/F20 11.9552 Tf 11.955 0 Td [( N D T S p 5 )]TJ/F20 11.9552 Tf 11.955 0 Td [( j p 5 t 1 A t 1 = A t 2 = P n p 5 t 1 j A t 1 = o P f A t 1 = A t 2 = g P f L> g where; P n N S )]TJ/F21 7.9701 Tf 6.586 0 Td [( N D T S p 5 )]TJ/F20 11.9552 Tf 11.955 0 Td [( j p 5 t 1 A t 1 = A t 2 = o = h 1 )]TJ/F20 11.9552 Tf 11.955 0 Td [(e )]TJ/F21 7.9701 Tf 6.587 0 Td [( S p 5 )]TJ/F21 7.9701 Tf 6.586 0 Td [( i )]TJ/F1 9.9626 Tf 11.955 17.534 Td [(" 1 )]TJ/F20 11.9552 Tf 11.955 0 Td [(e )]TJ/F21 7.9701 Tf 6.586 0 Td [( S p 5 N S )]TJ/F22 5.9776 Tf 5.756 0 Td [( N D # .8 Inequation.8,theonlyunknownvalueistheprogressionrate S p 5 whichcanbe computedbyiterativelyincrementingitsvaluetowhereitbestts.8.Beforedoing sohowever,inordertoobtainpopulation-specicprogressionrates,weestimate S p 5 j rf whichdenotes S p 5 givenrace R = r andfamilyhistorystatus F = f ,asfollows. 22

PAGE 35

Sinceeventsof R = r F = f aremutuallyexclusivei.e., P r P f P f R = r F = f g =1andexhaustive,applyingthetotalprobabilityrulewecanwrite, P ~ C t 2 = X r X f P n ~ C t 2 j R = r F = f o P f R = r F = f g = X r X f P n ~ C t 2 R = r F = f o : .9 Notethat,equation.8foragiven R = r F = f isequivalentto P f ~ C t 2 j R = r F = f g ofequation.9.Therefore,wecanwrite P n ~ C t 2 R = r F = f o = 2 4 X X > P rf N S )]TJ/F20 11.9552 Tf 11.955 0 Td [( N D T S p 5 )]TJ/F20 11.9552 Tf 11.956 0 Td [( j p 5 t 1 A t 1 = A t 2 = P rf n p 5 t 1 j A t 1 = o P r f A t 1 = A t 2 = g P r f L> g i P f R = r F = f g8 r 8 f; .10 where, P rf n N S )]TJ/F21 7.9701 Tf 6.586 0 Td [( N D T S p 5 )]TJ/F20 11.9552 Tf 11.956 0 Td [( j p 5 t 1 A t 1 = A t 2 = o = 1 )]TJ/F20 11.9552 Tf 11.955 0 Td [(e )]TJ/F1 9.9626 Tf 6.586 10.561 Td [( S p 5 j rf )]TJ/F21 7.9701 Tf 6.587 0 Td [( # )]TJ/F1 9.9626 Tf 11.955 17.535 Td [(" 1 )]TJ/F20 11.9552 Tf 11.956 0 Td [(e )]TJ/F1 9.9626 Tf 6.587 10.561 Td [( S p 5 j rf N S )]TJ/F22 5.9776 Tf 5.757 0 Td [( N D # Notethat,occurrencesof P rf f : g in.10andhenceforthrepresent P f : j R = r F = f g ,andhavebeenwrittenassuchfornotationalconvenience.Applyingconditionalprobability,wecanwrite P f ~ C t 2 R = r F = f g = P f R = r F = f j ~ C t 2 g P n ~ C t 2 o .Sincenotenoughdataisavailabletodeterminethedependenceofevents F = f and R = r whengiven ~ C t 2 ,weassumeindependenceand write P n ~ C t 2 R = r F = f o = P n R = r j ~ C t 2 o P n F = f j ~ C t 2 o P n ~ C t 2 o .Asmentionedearlier, P n ~ C t 2 o isestimatedusingdatapresentedin[33].Wecompute P f R = 23

PAGE 36

r j ~ C t 2 g = NumberofCRCcasesinracer TotalnumberofCRCcases ,wheretherequirednumbersareobtainedfrom theIndianaStateDepartmentofHealthdatabase[34].Weconsider P f F = f j ~ C t 2 g =0 : 2basedontheobservationsreportedbytheAmericanCancerSocietyin [35].Notethat,estimatesof P f R = r F = f g wereearlierobtainedforequation .1,thedetailsofwhicharedescribedinSection2.2.1.Therefore,theonlyunknown elementin.10is S p 5 j rf ,andhence,canbeeasilyestimatedforallvaluesof r and f 2.2.2.2EstimatingPost-CancerProgressionRates ThissectiondiscussestheestimationofprogressionratesbetweenCRCstages, i.e,betweenstagesin-situ,local,regional,anddistant,witheventsofincidences denotedas S L R ,and D ,respectively.Asimilarmodelasthatdevelopedfor estimating S p 5 j rf inequation.10canbeusedinestimationoftheprogressionrates betweentheCRCevents.Forexample,considereventofincidenceofin-situattime t 2 S t 2 andconsider ~ IC t 3 ,theeventofprevalenceof invasive CRCattime t 3 i.e., P f ~ IC t 3 g = P f ~ L t 3 [ ~ R t 3 [ ~ D t 3 g ,asrepresentedbyFigure2.4.Wecanestimatethe progressionratefrom S to L given R = r and F = f ,denotedas L Sj rf ,bythe followingequation P rf f ~ IC t 3 g = X X > P rf N L )]TJ/F20 11.9552 Tf 11.956 0 Td [( N D T L S )]TJ/F20 11.9552 Tf 11.955 0 Td [( j S t 2 A t 2 = A t 3 = P rf fS t 2 j A t 2 = g P r f A t 2 = A t 3 = g P r f L> g8 r; 8 f .11 where, P rf n N S )]TJ/F21 7.9701 Tf 6.587 0 Td [( N D T L S )]TJ/F20 11.9552 Tf 11.955 0 Td [( j S t 2 A t 2 = A t 3 = o = 24

PAGE 37

Figure2.4.Eventofincidenceofin-situCRCattime t 2 S t 2 anditsprogressionto aneventofprevalenceofinvasiveCRCattime t 3 ~ IC t 3 i.e.,either ~ L t 3 ~ R t 3 ,or ~ D t 3 withageat t 2 A t 2 and t 3 A t 3 as and ,respectively 1 )]TJ/F20 11.9552 Tf 11.956 0 Td [(e )]TJ/F1 9.9626 Tf 6.587 10.56 Td [( L Sj rf )]TJ/F21 7.9701 Tf 6.586 0 Td [( # )]TJ/F1 9.9626 Tf 11.955 17.535 Td [(" 1 )]TJ/F20 11.9552 Tf 11.955 0 Td [(e )]TJ/F1 9.9626 Tf 6.586 10.56 Td [( L Sj rf N L )]TJ/F22 5.9776 Tf 5.756 0 Td [( N D # ; T L S denotestimetoprogressfrom S to L ,and N L and N D inthelowerboundof T L S nowrepresentthenumberofstagesfrom S to L and S to D ,respectively. P rf f ~ ICg at anarbitrarytime t 3 canbeestimatedas P rf f ~ ICg = P rf f ~ IC ~ Cg = P rf f ~ Cg P rf f ~ ICj ~ Cg P rf f ~ Cg isobtainedasearlierfrom[33],and P rf f ~ ICj ~ Cg isobtainedusingCRCdiagnoseddatafromtheIndianadatabase[36].Therefore,fromequation.11,we canestimate L Sj rf iftheonlyunknown P rf fS t 2 j A t 2 = g canbecomputed,since therestofthetermscanbeobtainedsimilartotheequivalenttermsinequation .10.However,asexplainedinSection2.2.2.1,itisinfeasibletoestimate P fSg usingresultsfromrandomizedscreeningtrials,henceitisalsonotfeasibletoestimate P rf fS t 2 j A t 2 = g usingscreeningtrials.Notehoweverthat,atthispointin themodel,wehaveestimated S p 5 j rf ,theprogressionratefrom p 5 to S .Therefore, usingtheestimatedvaluesof P rf n p 5 t 1 a A t 1 b o fromSection2.2.1and S p 5 j rf 25

PAGE 38

fromSection2.2.2.1,wedevelopedamodeltoestimate P rf fS t 2 j A t 2 = g ,whichis explainedbelow. Theschematicoftheprobabilitymodeldevelopedforestimating P rf fS t 2 j A t 2 = g ispresentedinFigure2.5,andcanbeinterpretedasfollows.Consider p 5 attime t 1 anditsprogressionto S at t 2 t 1 m + k g 3 5 .12 26

PAGE 39

where, P rf f T S p 5 = m g = S p 5 j rf e )]TJ/F21 7.9701 Tf 6.586 0 Td [(m S p 5 j rf .Notethat,equation.12wasderived byasimpleapplicationofthetotalprobabilityrule.Forexample,consideringone specicageat A t 2 as50,wecanwrite P rf fS t 2 A t 2 =50 g = P < 50 P rf fS t 2 A t 2 = 50 j p 5 t 1 A t 1 = g P rf f p 5 t 1 A t 1 = g = P rf f T S p 5 =50 )]TJ/F20 11.9552 Tf 9.802 0 Td [( j p 5 t 1 A t 1 = g P rf f p 5 t 1 A t 1 = g .Equation.12howeverhasbeenwrittenforanageinterval,andhasthe variable L whichdenotesthelengthoflifeofanindividual. P rf fS t 2 j A t 2 = g ,requiredforequation.11,cannowbeestimatedfrom P rf fS t 2 c A t 2 d g byapplyingconditionalprobability,andbyconsideringaconstant rateofincidencewithineachageinterval,whichisinaccordancetoliteratureas explainedearlier.Equation.11cannowbesolvedtoobtain L Sj rf .TheprogressionratesfromlocaltoregionalCRCandregionaltodistantCRCcansimilarlybe estimatedbycyclicallycomputingtheprobabilityofeventofincidencesimilarto computationof P rf fS t 2 j A t 2 = g usingequation.12followedbyestimationofthe progressionratessimilartoestimationof L Sj rf usingequation.11.Notethat, forstagespastin-situ, L alsoincludessurvivalbasedonstageofcancerinaddition tothenaturallifeofanindividual. 2.3Results:EstimatedIncidenceandProgressionRates InTables2.4and2.5,wepresentratesof p 5 forpolyppathwaysandratesof S for non-polyppathway,respectively,fordierentcombinationsofage,race,andfamily historystatus.Forexample,inTable2.4,thepercentageofincidenceofpolyp 5mm atage50 A 64and R = Caucasian and F> 0is4.25.Themeantimesto progressfromevent i toevent j given R = r and F = f ,i.e., 1 j i j rf ,arepresented inTables2.6and2.7,forpolyppathwaysandnon-polyppathway,respectively.The 27

PAGE 40

valuesinTables2.4through2.7willserveasinputfordevelopingamodelofpolyp progression.SuchamodelisessentialfordevelopingCRCinterventionstrategies. Table2.4. P rf p 5 a A b :Percentageincidenceofpolyp 5mmatage[ a;b ], given R = r and F = f ,forpolyppathways AllRace R =Caucasian R =AfricanAmerican AgeGroup[ a;b ] F> 0 F =0 F> 0 F =0 F> 0 F =0 [40 ; 49]0.740.120.730.130.900.10 [50 ; 64]4.330.704.250.745.280.58 [65 ; 74]4.540.734.460.785.530.61 [75 ; ]2.130.342.090.362.590.29 Table2.5. P rf S a A b :Percentageincidenceofin-situCRCatage[ a;b ], given R = r and F = f ,fornon-polyppathway AllRace R =Caucasian R =AfricanAmerican AgeGroup[ a;b ] F> 0 F =0 F> 0 F =0 F> 0 F =0 [40,49]0.0250.0020.0260.0030.0240.002 [50,64]0.2570.0260.2650.0290.2520.018 [65,74]0.3300.0340.3390.0380.3180.023 [75,]0.3440.0400.3460.0440.3320.026 Table2.6.Meantimestoprogressfromevent i toevent j ,given R = r and F = f 1 j i j rf onpolyppathways AllRaces R =Caucasian R =AfricanAmerican event i event jF> 0 F =0 F> 0 F =0 F> 0 F =0 p 5 in-situ a 2341.621.5392950 in-situ local3.43.43.53.53.13.1 local regional554.54.53.54 regional distant0.950.950.950.950.880.9 a : S p 5 j rf wasestimatedconsideringprogressiveandnon-progressivepolyp 2.3.1ComparisonofResults Theexpectedprogressiontimeestimatesi.e.,Tables2.6and2.7canbeusedto computeone-steptransitionprobabilitiesneededtobuildMarkovmodelsasin[1] forpolypprogression.Forexample,probabilitiesfor R = Caucasian and F =0are depictedinFigure2.6.Wecomparedthetransitionprobabilitiesderivedfromour modelwiththosecompiledby[1],whoanalyzethecost-eectivenessofscreeningfor apopulationwithoutafamilyhistoryofcancer.Thoughthestudyin[1]isbased 28

PAGE 41

Table2.7.Meantimestoprogressfromevent i toevent j ,given R = r and F = f 1 j i j rf onnon-polyppathway AllRaces R =Caucasian R =AfricanAmerican event i event jF> 0 F =0 F> 0 F =0 F> 0 F =0 in-situ local3.43.33.43.43.13.1 local regional3.54.844.53.53.5 regional distant0.90.950.90.950.90.9 Table2.8.One-steptransitionprobabilitiesbetweenstages:Comparingresultspresentedinthisresearchtoliteraturepresentedin[1] LiteratureLeshnoet.al.Thisresearch From ToStagesTransitionProbabilityFrom ToStagesTransitionProbability low-riskpolyp highriskpolyp0.02 p 5 polyp 1cm0.035 high-riskpolyp localCRC0.05.02-0.10polyp 1cm localCRC0.06 localCRC regionalCRC0.28.20-0.35localCRC regionalCRC0.20 regionalCRC distantCRC0.63.50-0.70regionalCRC distantCRC0.65 onthepopulationofIsrael,thereasonforourcomparisonistoonlycheckifour estimatesarewithincommonlyobservedranges,andisnotmeantasavalidation. Thepolypstagesconsideredin[1]arelowriskpolyps < 1cm,highrisk 1cm polyps,localCRC,regionalCRC,anddistantCRC,which,asseeninFigure2.6,are slightlydierentfromthatinourmodel.Therefore,toobtainaroughcomparison,we assumedequalprogressiontimebetweenstages p 5 and S seepathway1inFigure2.1, andcomputedthetransitionprobabilityfrom p 5 topolyp 1cm.AsshowninTable 2.8,forsimilarstagesi.e.,rows2,3,and4,thetransitionprobabilitiesobtained fromourmodelarecomparabletothatassumedin[1].Usingourmathematical modelingapproachofprogressionrateestimation,wecanfurthercomputepopulationspecictransitionprobabilitiestobuildMarkovmodelsfordevelopingeectiveCRC interventionstrategies. Figure2.6.One-steptransitionprobabilitiesforpolyppathway1 R =Caucasian, F =0 29

PAGE 42

2.4ValidationofProgressionRatesEstimatedfromProbabilityModel InordertovalidatetheprogressionratesestimatedinSection2.2,weusedasimulationbasedapproachasfollows.Asimulationmodelwasconstructedsuchthat itinitiallygeneratesapopulationbasedonauser-inputdemographicsdataofspecicpopulations.Forvalidationpurpose,weconsideredtwodierentpopulations: populationoftheStateofIndianaandpopulationoftheclinicaltrialdescribedin [37,38,39],thatwasconductedintheStateofMinnesota.Further,thesimulation modelwasbuiltsuchthatitexecutesthefollowingthreeeventseveryyearforeach personinthepopulation: event1 updatingageofeachperson,andcreatingnew birthsandgeneratingmortalitiesinthepopulation; event2 thenaturalincidence andprogressionofpolypsusingvaluespresentedinTables2.4through2.7;and event 3 screeningbasedontheactualcomplianceratesofthecorrespondingpopulation. Notethat,basedonchangeinagethroughevent1,event2generatespolypsand handlesitsnaturalprogressionuntilasuccessfulscreenthroughevent3leadsto thepolyp'sdiagnosis.Thenumberandstageofthenewcasesofpolypsthatarediagnosedeachyeararerecorded.Forvalidation,thesimulatedstatisticsondiagnosed casesofCRCsarecomparedwiththeactualstatisticsofthecorrespondingpopulation.Thesimulationmodelwasconstructedin Repast [40],ajavaagent-based modelingframework.Thereasonforusinganagent-basedapproachisforeaseofincludingthebehaviorandinteractionbetweenthesystementitiesincludingphysician andinsurancepolicies,whichisapartofourfutureresearchforobtainingcancer interventionstrategies.Simulationevents2and3,thatwerementionedabove,are describedusingowchartsinFigures2.7and2.8.Wepresentbelowthedetailsofthe validationonthetwopopulations. 30

PAGE 43

Figure2.7.FlowchartofsimulationEvent2:Incidenceandprogressionofpolyps 31

PAGE 44

Figure2.8.FlowchartofsimulationEvent3:Screening 32

PAGE 45

2.4.1SimulationoftheIndianaPopulation Theactualproportionofpopulationindierentrace,sex,andagegroups,asestimatedfromcensusdata,wasusedtogenerateaninitialsamplepopulationforthe StateofIndiana.Themortalityandbirthratesrequiredforevent1werealsoobtained fromthecensusdata.Theincidenceandprogressionratesrequiredforevent2were obtainedfromTables2.4through2.7.Thescreeningratesrequiredforevent3were computedusingthedataobtainedthroughasurvey,whichwasconductedinyear 2001bytheIndianaStateDepartmentofHealthaspartoftheBehavioralRiskFactorSurveillanceSystem[2].Thesurveyconsideredthreescreeningoptions:FOBT fecaloccultbloodtest,colonoscopy,andsigmoidoscopy.Thesensitivityandspecicityofthescreeningtestsweretakenfromthevaluesusedinaninterventionstudy conductedinyear2006[3],andwerealsousedintheMISCAN-Colonmicrosimulationmodel[9].Thevaluesofthescreeningparametersusedinthesimulationare summarizedbelow. 2.4.1.1ScreeningParameters Theparametersusedforpercentageofpopulationcomplianttoscreening,screeningsensitivityandspecicityaresummarizedbelowinTables2.9,2.10,and2.11. Table2.9.Percentageofpopulationcomplianttoscreening[2] ScreeningTypePercentageCompliant FOBT43 Sigmoidoscopy19 Colonoscopy19 NeverCompliant19 33

PAGE 46

Table2.10.Screeningsensitivity[3] Stage FOBTSigmoidoscopyColonoscopy poly < 5mm 0.020.750.80 poly6-9mm 0.020.850.85 poly > 1cm 0.050.950.95 in-situ 0.050.950.95 invasiveCRC 0.600.950.95 Table2.11.Screeningspecicity[3] FOBTSigmoidoscopyColonoscopy 0.980.950.90 2.4.1.2AssigningFamilyHistoryStatus Foreachpersoninthesimulation,weneedtodeterminethefamilyhistorystatus. Weassumethat F Poisson r asdescribedearlierwhileestimating P f F = f R = r g inSection2.2.1.Notethat,theCRCproportioni.e., NumberofCRCcases Totalpopulation in eachracefortheStateofIndianaisequivalenttotheNationalproportion.Also,the averagefamilysizefortheStateofIndianais3.05whichisapproximatelyequaltothe Nationalaverageseeestimationof P f F = f R = r g inSection2.2.1.Therefore,to generaterandomnumbersinthesimulation,weuse r aspresentedinSection2.2.1. Thesimulationwasrunwithasamplepopulationof10,000for30trials.As presentedinTables2.12and2.13,thecondenceintervalCIofthesimulatedresults, relatedtonewcasesofCRCsdiagnosedover5years,werecomparedwiththeactual 2000-2004diagnosedcasesofCRCavailableontheIndianaStateDepartmentof Healthwebsite[27,36].Table2.12presentsCRCcountsper100,000ofpopulation. ThelargerangeinCIfortheAfricanAmericanpopulationcanbeattributedtoits smallpercentage%ofthetotalpopulationofIndiana.Table2.13presentsthe percentagedistributionofCRCsamongvariousstagesatthetimeofdiagnosis.As 34

PAGE 47

canbeseeninbothtables,theactualvalueslieinbetweenthesimulatedCI.Note that,someoftheactualvaluesinTable2.13fallontheboundaryofthesimulated CI,whichcanbeattributedtothefactthatabout6.5%ofactualCRCcasesdid nothaveastageidentierun-staged.ItmaybenotedthatTable2.13servesasa vericationbecause,thepercentagedistributionofdiagnosedCRCindierentstages wasinitiallyusedintheprobabilitymodel.However,Table2.12servesasavalidation, asthesimulatedCRCresultspresentedintheTablearenotequivalenttothecancer prevalenceprobabilities P f ~ C t 2 g estimatedusing[33]thatwereinitiallyusedinthe probabilitymodel.Thedierencebetweenthetwoisthat: 1.Thecancerprevalenceprobabilitiesusedintheprobabilitymodelwereestimatedusingdatafrom clinicaltrial studiesnotfromIndianadatabase.However,wecomparethesimulateddiagnosedcancercountswiththe actualdiagnosedcounts inIndiana. 2.Theseconddierenceisinherentin clinicaltrialrates versus actualdiagnosed counts itself.Theformerstatisticincludesallcasesofcancerinthepopulation, since,inaclinicaltrial,allparticipantsgetscreenedignoringscreensensitivity whichisthesameinbothcases.However,thelatterstatisticdoesnotinclude allcases,since,inanactualpopulation,noteveryoneiscomplianttoscreening. Therefore,theaccuracyofthesimulateddiagnosedcancercountsaredependenton thepopulation'sscreeningcomplianceratesaswellasthepolypnaturalincidence ratesandexpectedprogressiontimesestimatedfromtheprobabilitymodel.Sincethe compliancerateswerecomputedfromtheIndianapopulationdatabase,theresults inTable2.12serveasavalidationoftheprobabilitymodel. 35

PAGE 48

Table2.12.Simulatedvs.actualIndianaCRCcountsper100,000ofpopulation Simulated95%CI StagesRaceLowerCIUpperCIActualIndianaCounts AllRace48.8357.5556.02 local+regional+distantCaucasian52.5061.9757.68 AfricanAmerican31.9756.5347.93 in-situ+local+regional+distantAllRace52.9861.9060.70 Table2.13.Simulatedvs.actualIndianavaluesforstageattimeofdiagnosisas percentageoftotalCRCcounts Simulated95%CI CRCStageLowerCIUpperCIActualIndianaValues in-situ6.029.047.70 local34.9941.8834.94 regional29.5836.7533.75 distant17.9923.7617.12 un-stagedNANA6.50 NA-NotApplicable 2.4.1.3ResultsfromtheSimulationModel Theabovesimulationmodelwasdevelopedtovalidatetheprobabilitymodel. However,thecombinationoftheprobabilitymodelfollowedbythesimulation,as constructedinthisresearch,servesasamodelinitselfforobtainingcertainpolyp relatedestimatesofinterest.Onesuchsetofestimatesisrelatedtotheprogressive polyps.ThesimulatedCIsonthemaximumlikelihoodestimateoftheexponential distributionparameter,i.e.,onthemean timetoprogress inyearsfrom p 5 to S given R = Caucasian ,arepresentedinTable2.14.Itmaybeseenthat,ourestimatedvaluefortheprogressionfrompolyp 5mmtoin-situCRCrow1oftheTable compareswellwithexpertopinionin[18],where,anaveragetimeofapproximately 10yearstoprogressfromadenomatouspolypmainly < 1cmtoinvasiveCRClocalCRCandbeyondissuggested.Useofamathematicalmodelingapproachfor estimatingpopulation-specicvalues,asinthisresearch,allowsustoquantifyany variationsacrosspopulations.See,forexample,Table2.14,whichshowsshorterprogressiontimetocancerforapopulationwith F> 0.Wealsoobtainedresultsforthe 36

PAGE 49

proportionofpolyp 5mmprogressingtoin-situCRCi.e.,proportionofprogressive polyps,whicharepresentedinTable2.15for R = Caucasian .Notethat,theproportionofprogressivepolypsinapopulationwith F> 0isapproximately1.8times asmuchasthatinapopulationwith F =0.Thoughitknownthatafamilyhistory ofCRCincreasesthelife-timechancesofcancer,suchmathematicalquantications Tables2.14and2.15ofpolypprogressioncouldnotbefoundintheliterature. Table2.14.Estimatedcondenceintervalonmeantimetoprogressfrompolyp 5mm toin-situCRCinyearsaccordingtofamilyhistory FamilyHistoryFUpper95%CILower95%CI All a 10.78.3 F =012.19.4 F> 09.97.7 a includes F =0and F> 0 Table2.15.Estimatedproportionof p 5 'sprogressingto S FamilyHistory F Proportionin% All a 20.9 F =019.5 F> 034.2 a includes F =0and F> 0 2.4.2SimulationofMinnesotaStudy Theauthorsin[37,38,39]presentaclinicaltrialconductedinMinnesota,where apopulationinagegroup50-80yearswithnohistoryofcancerwasrecruitedand randomlydividedinto3groups.Groups1and2weresubjecttoannualandbiennial FOBTscreening,respectively,andgroup3wasacontrolgroup.Theobjectiveof thestudywastoidentifythedierenceinCRCrelatedmortalityratesamongthe threegroups,andhenceanalyzetheeectofannualandbiennialFOBTscreeningon mortalities.PhaseIofthestudywasconductedfrom1978to1982,andcontinuedto PhaseIIfromFeb1986toFeb1992.Studygroups1and2weresimulated,separately, byutilizingthevaluesforproportionsofpeopleinageranges50-59,60-69,and70-80 37

PAGE 50

thatweregivenby[37],asfollows.Thesimulationrstgeneratedpeoplebetweenage 0-30years,withproportionsofpeopleinageranges0-9,10-19,and20-30equaltothe proportionsofpeopleinageranges50-59,60-69,and70-80,respectively,oftheactual study.Thesimulationwasrstrunfor50yearssothatthepopulationisnowbetween agegroup50-80years,andthenlaterrunforaperiodof14yearsrepresentingthe timelineoftheactualclinicaltrial.Themortalityandbirthratesevent1werekept atzeroduringtherst50years.Event2,i.e.,polypincidenceandprogressionwas runduringtheentireperiod,andforwhichtherateswereobtainedfromTables2.4 through2.7.Duringtherst50years,anysymptomaticcasesofCRCwereremoved fromthesimulationinordertoremoveexistingdiagnosedcasesofcancer,andthe proportionofpopulationinthethreeagegroupswereadjusted.Theexponentialmean timetosymptomaticwastakenasperthetimespreclinicaltoclinicalconsidered intheMISCAN-colonmodel[10],whoseparameters,asmentionedearlier,were basedonexpertestimatespresentedinmeetingsattheNationalCancerInstitute. Duringthe14yearrunthatrepresentedtheactualstudy,studygroups1and2were subjecttoannualandbiennialscreening,respectively,asperscreeningdetailsand testsensitivityprovidedby[37]event3. ThesimulatedcasesofCRCduring13yearsofthestudywerecomparedwith theactualcasesgivenby[37]andtheresultsrepresentedasCRCcasesper1000 populationarepresentedinFigures2.9and2.10forannualandbiennialscreening groups,respectively.Sincethesimulatedscreeningintervalsmatchedthatinthe clinicaltrial,theaccuracyofthesimulatedCRCcasesisdependentonthenatural polypprogression,whoserateswereestimatedfromtheprobabilitymodel.Moreover, trackingapopulationandcomparingresultsovera13yearperiodisastronger analysisofthepolypprogression,andtherefore,Figures2.9and2.10canbeusedfor validatingtheprobabilitymodel. 38

PAGE 51

Figure2.9.ComparingsimulatedversusactualCRCcasesper1000populationfor annualgroupoftheMinnesotastudy AsseenintheFigures,thereisagoodagreementbetweentheactualandsimulated values,withonlyslightdeviationsintheannualcase.Whiletheactualvaluesfor biennialgroupfallswithinthesimulated95%condenceintervalinmostyears,the actualvaluesforannualgrouparelowerduringPhase1ofthestudyandhigher duringPhaseIIofstudy.Theaforementioneddeviationsinvaluescanbeexplained asfollows. 1.Lackofdataonthenumberofscreenings:Itisnotedby[37]thatnoteveryone participatedintheschedulednumberofscreens,whichwas11forannualgroup and6forbiennialgroup[38]overtheentiredurationofstudy.Foreachperson, let X =numberofscreensobtainedduringstudyduration.Usinginformation providedforeachgroupinthestudy,datacouldbeextractedforthefollowing features:forannualgroup-percentageofpeoplewith X 1, X 6, X 9, and X =11;andforbiennialgroup-percentageofpeoplewith X 1, X 3, X 5,and X =6.Notethat,whenwecomparetheinformationavailable for X betweenthetwogroups,biennialgrouphasmoreinformationcompared tothatofannualgroup.Underthebiennialgroup,theextractedinformation 39

PAGE 52

Figure2.10.ComparingsimulatedversusactualCRCcasesper1000populationfor biennialgroupoftheMinnesotastudy wasusedtoobtainprobabilitythat X =1 or 2,i.e., P X 2,andwas splituniformlybetween P X =1and P X =2,andsimilarly P X 4 wassplitbetween P X =3and P X =4.However,undertheannualgroup, P X 5, P X 8,and P X 10hadtobesplitbetween 5,3,and2valuesof X ,respectively.Thislackofinformationforannualgroup canbeconsideredtopartlyaccountforthedierenceinsimulatedtoactual values.Also,underbothgroups,ifanindividualhadparticipatedin n number ofscreens,itwasassumedthatitwastherst n ofthescheduledscreens,while inrealitythescreeningscouldhavebeenspreadovertheperiodofstudy.We canexpectthatthisassumptionwillcausegreaterandnoticeabledierence, betweensimulatedandactualvalues,fortheannualgroupthanthebiennial groupbecause,e.g., X =5underbiennialscreeningisspreadoveronly6 possiblescreeningscheduleswhile X =5underannualscreeningisspreadover 11possiblescreeningschedules.Fromtheassumptionstatedaboveandlackof datafortheannualscreen,wecanexpectthattherewillbemorethanactual numberofscreensduringtheinitialperiodofstudyandhencemoreCRCs 40

PAGE 53

getdiagnosedinthesimulation.Further,wecanalsoexpectthat,alongwith causinglesserthanactualnumberofscreensduringthelatterperiodofstudy, increasedscreeninginthebeginningreducesthenumberofpolypsthatprogress toCRC,andhencefurtherreducingCRCincidenceduringthelatterperiod. ThisisevidentinFigure2.9. 2.Lackofinformationonscreeningoutsideofthestudy:Asmentionedinthe study,participantscouldhaveobtainedscreeningfromasourceoutsidethe studyduringandin-betweenthetwophases,andyearlyupdateswereobtained onanydiagnosedcasesofCRC.Sinceinformationonoutsidescreeningwasnot available,itwasassumedthatapersonundergoesscreeningoutsideofstudy insymptomaticcasesonly.Thetimetosymptomaticwasextractedfromthe MISCAN-colonparameterspresentedin[10].Theaboveassumptiontogether withthefactthatadvancedstagesofCRCtendtobemoresymptomaticthan earlierstages,wecanexpectthat,whilecomparingsimulatedminusactualCRC casesperstageatdiagnosis,thevaluewillbehigherfordistantstagecompared toregionalcomparedtolocal.ThistrendisevidentinTables2.16and2.17, forannualandbiennialgroups,respectively,whichpresentstheCRCcasesper 1000populationaccordingtostageatdiagnosisoverthe13yearperiod.Note that,intheclinicaltrial,stagingofCRCwasdonebyusingDukesclassication. WeconsideredDukesstagesAandBasequivalenttolocal,andDukesstages CandDasequivalenttoregionalanddistant,respectively. Basedonourassumptions1thatanindividualundergoestherst n ofthescheduledscreens,and2thatoutsidescreeningwasdoneonlyinsymptomaticcases,we canhypothesizethatthepercentagesofdiagnosedCRCsinstagesregionalanddistant aremoreduringthelatterperiodofstudycomparedtothoseintheinitialperiod. 41

PAGE 54

Table2.16.CRCcountsper1000populationandstageatdiagnosis-Forannually screenedgroupofMinnesotastudy Simulated95%CIActualCounts CRCStageLowerCIUpperCIDukesCRCStagingActualValues local8.839.76A+B13.3 regional4.645.36C5.6 distant2.873.56D2.3 un-stagedNANAun-staged2.1 TotalCRC16.3418.68TotalCRC23.3 Table2.17.CRCcountsper1000populationandstageatdiagnosis-Forbiennially screenedgroupofMinnesotastudy Simulated95%CIActualCounts CRCStageLowerCIUpperCIDukesCRCStagingActualValues local9.5110.76A+B11.6 regional6.777.73C6.1 distant4.665.49D3 un-stagedNANAun-staged2.1 TotalCRC20.9423.98TotalCRC22.8 Thereasoningleadingtothehypothesiscanbeexplainedasfollows.Assumption 1generatesmorethanactualscreeningsduringtheinitialfewyears,thuscausing lesscasestoreachasymptomaticstage.Consequently,thelatterdurationhaslesser thanactualnumberofscreenscausingmorecasestoreachsymptomaticstage.Since advancedstagesofCRCaremoresymptomaticthanearlierstages,wecanthushypothesizethat,theconsequenceofassumption1combinedwithassumption2causes diagnosisofgreaterpercentageofregionalanddistantcasesduringlatterperiodof studycomparedtoinitialperiod.Thishypothesiscanbeveriedbythenumbersin Tables2.18and2.19.Asspeculated,thepercentageofCRCsinregionalanddistant stageishigherattheendofyear13comparedtothatattheendofyear5. 2.5ConcludingRemarks Precisevaluesofpolypincidenceandprogressionratesarecrucialfordeveloping population-wideCRCinterventionstrategies.Polypincidenceratesforpopulation groupscharacterizedbyage,race,andfamilyhistoryofCRC,wereestimatedbyusing 42

PAGE 55

Table2.18.StageatdiagnosisaspercentageoftotalCRCcounts-Forannualgroup ofMinnesotastudy Simulated95%CIActualValues Endof5YearsEndof13YearsEndof13Years CRCStageLowerCIUpperCILowerCIUpperCIDukesCRCStagingActualValues local58.5764.6351.2055.24A+B57.08 regional20.9926.6027.0729.86C24.03 distant12.4316.7816.6220.00D9.87 un-stagedNANANANAun-staged9.01 Table2.19.StageatdiagnosisaspercentageoftotalCRCcounts-Forbiennialgroup ofMinnesotastudy Simulated95%CIActualValues Endof5YearsEndof13YearsEndof13Years CRCStageLowerCIUpperCILowerCIUpperCIDukesCRCStagingActualValues local48.3453.5643.2846.70A+B50.88 regional28.9334.3530.4834.29C26.75 distant15.1319.7020.8824.36D13.16 un-stagedNANANANAun-staged9.21 datafromtheliterature.ThedatasourcesincludedclinicalCRCscreeningtrials, populationdatabases,andevidence-basedreportsfromstatehealthdepartmentsand nationalinstitutes.Thenaturalprogressiontimelineofpolyps,ontheotherhand, couldnotbedirectlyestimatedusingobserveddata,sinceitisinfeasibletoallow adiagnosedpolyptoprogressnaturallywithouttreatment.Hence,wedeveloped aprobabilitymodeltoestimatepopulation-specicratesofpolypprogression.The probabilitymodelwasconstructedbasedonknownconceptsofthenaturalprogression ofpolyps.Thereafter,usingthemodel,datafromtheabovementionedsourceswere synthesizedtoestimateprogressionrates.Theseratesarecharacterizedbyraceand familyhistoryofCRCandcorrespondtobothprogressiveandnon-progressivepolyps. TheestimatedincidenceandprogressionratespresentedinTables2.4through2.7 wereusedtosimulatethenaturalhistoryofcolorectalpolypsforthepopulationin theStateofIndianaandasubsetofthepopulationintheStateofMinnesota.The simulationresultswereusedtovalidatetheprobabilitymodel.Thesimulationmodel alsoyielded1theexpectedtimeforprogressivepolypstoreachin-situCRCfrom 43

PAGE 56

thepolyp 5mmstage,and2theproportionofpolypsreachingin-situCRCi.e., progressivepolyps.Tables2.14and2.15presentresultsfortheprogressiontimeand proportionoftheprogressivepolypsforpopulationwithandwithoutfamilyhistory ofCRC. 2.5.1DiscussionofResults Thepolypprogressionrelatedvaluesavailableintheliteraturearemainlyexperiencebasedapproximationsandarenotpopulation-specic.Thoughtheliteraturecontainsseveraldatasourcesrelatedtotheincidenceofeitherprecancerous polyporcarcinoma,thesedatahadnotbeensynthesizedtomathematicallyestimate population-specicpolypprogressiontimesasinTables2.6and2.7.Mathematical estimationwillhelpidentifyandquantifyanyvariationacrosspopulationswhichare criticalfordevelopingearlyinterventionstrategies. Theprobabilitymodelwasdevelopedtoestimateratesconsideringbothprogressiveandnon-progressivepolyps,andwassubsequentlyusedinthesimulationmodel toobtainstatisticsofprogressivepolyps.Suchanapproachissignicant,since,while itisknownthatafamilyhistoryofCRCincreasestheriskofcancer,quantication oftheincreasedriskbasedonproportionofprogressivepolypsasinTable2.15has notbeenpresentedintheliterature.Also,theriskbasedonprogressiontimefrom polyp 5mmtoin-situCRChadnotbeenmathematicallyquantied,asinTable 2.14.Considerationofbothprogressiveandnon-progressivepolypsalsosupportsdevelopmentofmorecomprehensiveinterventionstrategiescomprisingresourceneeds andallocation. 44

PAGE 57

2.5.2AccuracyandEstimationErrors Thoughourmodelestimatespolypprogressionratesspecictoraceandfamily historyofCRC,forbetteraccuracyoftheestimates,itisessentialthatthemodelbe expandedtoconsiderotherdependentfactors.Examplesofdependentfactorsinclude, numberandhistologyofpolyps,classicationofrstandseconddegreerelativeswith CRC,andpersonalhistoryofothermedicalconditions.However,inclusionofthese factorsinthemodelwouldrequiresignicantadditionaldata,whichiscurrently unavailable.Also,estimatingprogressionratesbetweeneachofthepre-in-situstages wouldhelptosimulateamorecomprehensivenaturalprogressionofpolyps.Asin anyestimatedvalue,theprogressionratespresentedinthismanuscriptcouldcontain someerror.Whilesynthesizingdatafromvarioussourcesisbenecialforestimating therates,thevariationinthedataacquisitionprocessesacrossthesesourcescould inducesomeestimationerror.Forexample,thevalueof P f F> 0 j p 5 t 1 g wasestimated basedondiagnoseddataattheRochesterMethodisthospital[25].However,the valueof P f F> 0 j ~ C t 2 g wasestimatedbasedonexpertobservationreportedbythe AmericanCancerSociety.Sincethesevalueswerebasedoneitherlargeamountof dataorlong-termobservations,weexpecttheerrortoberelativelysmall.Dueto theunavailabilityofrequireddata,someofthevalueswerebasedonassumptions, hencecreatingroomforestimationerror.Forexample,basedonexpertopinionin theliterature,thetimetoprogresswasassumedtofollowexponentialdistribution. Thoughwecannotempiricallyascertainthisassumption,basedonthevalidation results,webelievethattheexponentialdistributionisagoodalternative.Itmay benotedthat,thisresearchpresentsamodelframeworkthathasthepotentialto estimatepopulation-specicprogressionratestheaccuracyofwhichcanbeimproved 45

PAGE 58

asmoredatabecomesavailable.Suchamodelissignicantforobtainingpopulationspecicinterventionstrategies. 2.6FutureResearch Obtainingeectivecancerinterventionstrategiesencompassesnotonlydevelopmentofscreeningstrategies,butalsoanalyzingfactorspertainingtotheavailability ofresourcessuchasthepatient'saccesstophysicianandhospital,andeectivedisseminationofevidencebasedinformationtothepopulation.Forexample,itwouldbe usefultoassessthepopulation'scompliancetoscreeningguidelinesbasedonfeatures likethepatient'sknowledgeofcancerscreeningtestsandcancerriskfactors.This knowledgecanberelatedtothepatient'saccesstoinformationthroughinteraction withtheirphysicianand/orthroughothersources.Themodelcanthenbeusedfor acost-eectivenessanalysisofprogramstoincreaseriskawarenessanditsimpacton reductionincancercases.Itwouldalsobeinterestingtomodeltheimpactofinsurancepoliciesunderdierentsystemsettings.Therefore,inadditiontosimulatingthe populationentity,weneedtoincludeentitieslikephysiciansandinsurancepolicies, andtheirinteractions.Notethat,thecurrentsimulationmodelhasbeenconstructed asanagent-basedmodelfortheconvenienceofdevelopingsuchasystem-basedsimulation,whichispartofourfutureresearch.Suchasystemsapproachwillallowfor amorerealisticanalysisoffeasibleinterventionstrategies. Insummary,theprobabilitymodelinthecurrentstatecanbeconsideredabase modelthatpresentspotentialforuseindevelopingpopulation-specicintervention strategies.Theworkpresentedherecanbeusedtosupporttheneedforcollection ofspecicdatarequiredforanalyzingandidentifyingmorepopulation-specicfactorsofinterest.Whilethemodeldevelopedinthismanuscripthasbeenspecically appliedtocolorectalcancer,mostdiseasesfollowasimilarpattern,i.e.,incidence 46

PAGE 59

andprogression.Followingasimilarprocedure,butwithdiseasespecicmodeling details,thecurrentframeworkcouldbeutilizedforestimatingprogressionanddevelopinginterventionstrategiesforothercancersaswell.Inaddition,byinclusionof atransmissionmodelinthecurrentframework,cost-eectiveanalysisofprevention programsforinfectiousdiseaseslikeHIV/AIDScouldbedeveloped. 47

PAGE 60

CHAPTER3 ADENOISINGMETHODOLOGYFORMICROARRAY 3.1Introduction Discoveryofthehumangenomegeneratedsignicantanticipationforbetterunderstandingoftherolesplayedbythegenesoncellbehaviorandtheresultingimpact onhumanhealth[41,42].Mostdiseasesresultfromcelldysfunction,whichcanbe tracedtoalterationinthestructureofoneormoregenesbiomarkers.Alterations couldbeintheformofabnormalincreaseordecreaseintheexpressionactivity levelsofgenesandtheirpatterns.Identifyingdierentpatternsinbiomarkergenes couldholdthekeytodiseasediagnosisandindividualizedtreatmentplanning,which isofvitalinterestasidentiedbytheNationalAcademyofEngineeringunderone oftheGrandChallenges-EngineeringBetterMedicine.Hence,ourabilitytorst accuratelymeasuretheexpressionlevelsofthegenesiscrucialtowardsidentifying genebiomarkers.Themicroarraytechnologyhasrevolutionizedtheeldofgenomics byoeringthecapabilitytomeasureexpressionlevelsoftensofthousandsofgenes simultaneously.However,duringtheprocessofgeneexpressionestimation,noise fromvarioussourcesgetsaddedtotheexpressionvalue.Thenoisegenerallyoriginatesduringthephasesofsamplepreparation,hybridization,andscanning[43].The samplepreparationnoiseinitiatesfromtheprocessofRNAamplication[44],and thehybridizationnoisereferstotherandomnessintheprocessofRNAbindingto 48

PAGE 61

theprobes.Thesourcesofscanningnoiseincludeleakofexternallight,variationsin laserintensity,andpresenceofdirt[45,43]. Amicroarrayisatinychip.28cmX1.28cmwhichisdividedintoapproximately amillionsquareswhichwewillrefertoasprobesquares.Anumberofsuchprobe squaresacrossthearrayarerandomlyallocatedtoeachgene[44].Thecomplexityof thearrangementofmicroarrayprobesandthetypesofinherentnoiserenderasignificantchallengefordenoising.Someoftheexistingmethodsfordenoisingmicroarray datacanbefoundin[46,47,48,49,50,51].In[46]theauthorsdevelopstatisticalmodelsforanalyzinghybridizationandcross-hybridization,andusemeasuresof cross-hybridizationtoimprovethequalityofgeneexpressionestimates.Aprobability modelcharacterizingthenatureofmolecular-bindinghybridizationinanitybased biosensorsispresentedin[47].Themethodologydevelopedin[48]employsamultiresolutionapproach,inwhicha2-Dstationarywavelettransformisappliedacross amicroarrayimage.Variousothermethodologies[49,50,51],focusondenoising ofmicroarraysinvolvedinidentifyingdierentiallyexpressedgenes.Thesemethods generallyrequiremultiplearraysortwo-colormicroarrays.Noiseboundarymodels aredevelopedin[49]usingtworeplicatechipsofnormaltissues,andtheresulting thresholdboundariesareappliedonfoldchangeobtainedbetweencancertissueand normaltissue.Two-colormicroarraysaredenoisedin[51]byconsideringthecontrol andexperimentalsetsastwocomponentvectorarraysandusemulti-channelimage processingtechniques. Thedisadvantagesintheuseofcontrolsamplebasedapproachesofdenoisingare asfollows. 1.Useofsuchmethods,thatrequireprocessingofuptotwoadditionalmicroarray chips,addssignicantcostoverathousanddollarsindiseasediagnosisand treatmentplanningapplications. 49

PAGE 62

2.Duringmicroarraychipprocessing,dierentamountsofhybridizationandscanningnoisegetaddedeachtimetheprocessisperformed,evenunderacontrolled environment.Therefore,whentworeplicatechipsofcontrolsamplesareprocessed,thedierenceintheirgeneexpressionsisaresultofnoisethatarespecic totheimagesofthetwochips.Hence,thedierenceingeneexpressionsofthe controlimagescannotbeusedtoderiveanoisethresholdfordenoisingother microarrayimages,sincetheyarelikelytocontaindierentsetsofnoise. 3.Inmethodsthatusethedierenceinnalgeneexpressionvalues,between controlandcasesamples,itisdiculttodierentiatenoisefromactualsignal forthecaseoflowexpressedgenes.Thatis,whilesignicantdierenceinhigh expressedgenescanbeeasilyidentied,signicantdierenceinlowexpressed genescouldbefalselyclassiedasnoise. Wepresentanovelandcomprehensivemethodologyforremovinghybridization andscanningnoisesfromAymetrixmicroarrayimagesbeforetheimagedataisprocessedbyAymetrixsoftwaretoobtainnalgeneexpressionvalues.Themethod usesdatafromwithintheimagethatneedstobedenoisedanddoesnotrequire controlsamples.Sincenoisearisesfromavarietyofsources,themethodologyuses amultiresolutionanalysisapproachontheimagetoeectivelyisolatethenoiseat dierentfrequencies.Theimageisdecomposedusingadualtreecomplexwavelet transform,whichisshowntohavebetterpropertieswithregardtoshiftinvariance, directionalselectivity,andperfectreconstruction,comparedtotransformsusingreal waveletfunctions[52].Naturalimagesaregenerallyknowntocontainasmallnumberofedges,andhencewhenitisdecomposedatdierentfrequenciesitresultsin asmallnumberoflargecoecients[53,54].Thisfacilitatestheprocessofseparationofnoisefromsignicantdatausingthresholding[53].Itwasnoticedthata 50

PAGE 63

similardirectuseoftheexistingmultiresolutiondenoisingtechniquewasineective whenappliedtomicroarrayimages.Weidentiedtwomajorfeaturesofmicroarray imagesthatwerethecauseoftheineectiveness:1presenceofnumerousedgesin microarrayimagesresultsinavastnumberoflargecoecientsduringdecomposition, hencehinderingthenoiseseparation,and2non-GaussianPoissoncharacteristicof thehybridizationnoiseaddedtothedenoisingineciency,sincemostthresholding methodsrequireerrortobeGaussian.Amoredetaileddiscussionofthetopicsof presenceofnumerousedges and Poissonnoise inmicroarrayimagesispresentedin Section3.2.2.Toalleviatetheabovediculties,wedevelopedtwoformsofdata transformations:1extractionoftheprobesquaresthatareassignedtoageneon amicroarray,andconstructionofaseparatedyadic subimage foreachgene,and2 foreachsubimage,GaussiantransformationofthePoissonnoise.Foreachmodied subimagedata,weapplyadualtreecomplexwavelettransformfollowedbybivariate shrinkagethresholding.Thethresholdingtechniqueunderlyingthebivariateshrinkagemethodconsiderstheinterdependenciesofthedetailcoecientsinadjacentlevels ofdecompositionproducingbetterperformance[55].Thereafter,thedenoisedversionofthemicroarrayimageiscreatedusingtheprobesquaresfromthedenoised subimages.Finalexpressionvaluesofthegenesareobtainedfromtheprobesusing Aymetrixsoftware. Sinceitisnotpossibletoknowthetruevaluesgroundtruthoftheexpression levelsofthegenesforatissuesample,itisdiculttoassesstheperformanceof denoisingonamicroarray.Weaddressthisproblembyconstructingasampledata setmimickinga subimage foragene,anduseittobenchmarkourmethodology.We thenimplementourmethodologyonAymetrixGeneChiphumangenomeHG-U133 Plus2.0array[56]datasets,obtainedbyprocessingasamplefromHCT-116cellline attheMicroarrayCoreFacilityatMottCancerCenterandResearchInstitute.Each 51

PAGE 64

HG-U133Plus2.0arraycontainsabout1.3millionprobesandisusedtomeasure expressionsoftheentirehumangenomeabout38,500establishedgenes.Achip isprocessedonthecelllineandmultiplescansofthechipareobtained.Using thesemultipledatasetsweconductstatisticalcomparisonstoestablishthedenoising performance. Insummary,thecontributionsofthemethodologyinclude1identicationof distinguishingfeaturesbetweenmicroarrayimagesandnaturalimages,i.e.,features thatwerethecauseofineectivedenoising,2obtainingastrategyforreconguring microarrayimagestoresemblenaturalimages,and3developingastrategyforestimatingnoiseparametersfromwithinthemicroarrayimagebeingdenoisedwhich obviatesthehighcostofusingimagesfrommultiplechipsandhence,eliminatesthe inuenceofdierentnoiseuniquetotherespectiveimages.Also,estimatingnoise bytheuseofmultipleinstancesprobesquaresofthesamegene,fromwithintheraw image,willavoidremovalofvaliddata. 3.2ANovelDenoisingMethodologyforMicroarrays Bothhybridizationandscanningnoiseinmicroarraysareinherentlynonuniform acrossamicroarray,andhencearelocalizedinspace.Also,duetothevarietyoftheir sources,theyappearatdierentfrequencies.Thesefrequencyandspacelocalizations makemicroarraynoiseanidealcandidateforwaveletbasedmultiresolutionanalysis. Priortothepresentationofourmethodology,weprovide,forunfamiliarreaders,a briefoutlineofthe2-Dwaveletdecompositiontechniqueandthepropertiesofadual treecomplexwavelettransform. 52

PAGE 65

3.2.1OverviewofWaveletBasedMultiresolutionAnalysis Wavelet'smultiresolutionapproachtodatarepresentationhasbeenamajorbreakthroughintheeldofsignalprocessing.Itsapplicationsrangefromdatacompression, datadenoising,toreal-timeprocessmonitoring[57,58].Waveletsconsistofbasis functions,where,abasisismadeupofascalingfunctionandamotherwavelet .Themotherwaveletisashortwaveandthereforehasitsenergylocalizedin time,unlikesinusoidsinFouriertransforms.Aone-dimensionalsignalgtcanbe representedbytranslationsofascalingfunctionandtranslationsanddilationsofthe motherwaveletasfollows. g t = X k 2 Z c j 0 ; k j 0 ; k t + X j j 0 X k 2 Z d j ; k j ; k t where, j 0 ; k t =2 j 0 = 2 j 0 t )]TJ/F20 11.9552 Tf 9.53 0 Td [(k j ; k t =2 j= 2 j t )]TJ/F20 11.9552 Tf 9.53 0 Td [(k ,and j = j 0 ;j 0 +1 ;:::: ; k 2 Z whereZisasetofintegers.Translationisashiftinthelocationofthefunction alongtheaxisandisrepresentedbychangeinthevalueof k .Dilationsorchange infrequency,representedby j ,areattainedbychangeinthewidthofthewavelet function.Wavelettransformsusescalinglowpasslterandwavelethighpass lterfunctionstodecomposeasignalatdierentresolutionsor scales andobtain theirscalingcoecients, c j 0 ;k approximations,andwaveletcoecients, d j;k details, respectively.Two-dimensionalsignalsimagesaredecomposedbypassingthesignal throughlowandhighpassltersontherowsofthesignal,theoutputofwhichis againpassedthroughlowandhighpassltersonthecolumns.Thisleadstothe creationofhorizontal,vertical,anddiagonaldetailcoecientsasshownbelow. g x;y = X k;l 2 Z c j 0 ; k;l j 0 ; k;l x;y + X i X j j 0 X k;l 2 Z d i j ; k;l i j ; k;l x;y 53

PAGE 66

where, k 2 Z and l 2 Z aretranslationindicesalong x and y axesrespectively.The scalingfunction j 0 ; k;l x;y isobtainedbythetensorproductofthescalingfunctions appliedontherowsandcolumnsrespectively,whichiswrittenas x N y .The detailfunctionsareobtainedasfollows. j ; k;l x;y = x N y HorizontalDetail j ; k;l x;y = x N y VerticalDetail j ; k;l x;y = x N y DiagonalDetail Sincewaveletsarelocalizedinspaceandfrequency,thewaveletcoecientsobtainedfromdecompositioncontainafewlargevaluesandalargenumberofsmall values[53,54].Thisisanimportantpropertyofwaveletdecompositionandholdsthe keytotheapplicationofwaveletsinsignaldenoising.Signalsaredecomposedusing wavelets,anddenoisingisachievedbyemployingthresholdingmethodsforidenticationofsignicant large coecients,andforremovalofnoisyparts.Thecoecients arethenreconstructedtoobtainnoisefreesignal. Discretewavelettransformsusewaveletsthatarereal.Afterdecompositionofan image,theapproximationanddetailcoecientsobtainedareeachequaltotheoriginallengthofthesignal.Hence,thesignalisdownsampledtoremoveodd-numbered coecients.Thisalterstheshiftinvarianceproperty.Thatis,ifthereisashiftinthe inputsignal,waveletcoecientsatdierentscalesundergoamajorchangeinenergy distribution.Useofrealwaveletscauseanotherproblemcalleddirectionalselectivity.Itisknownthatdetailcoecientscontainenergydistributedinbothpositive andnegativegradients.However,itisnotpossibletodierentiatebetweenthetwo orientationsasallgradientsareobtainedasoutputfromasinglelter[52].Com54

PAGE 67

plexwaveletscanbeusedtoovercomethesedrawbacks.Incomplexwavelets,the phasesvaryapproximatelylinearlywithinputshiftandcanbedesignedsuchthatthe magnitudesvaryveryslowly,thusmakingthemapproximatelyshiftinvariant[52]. Outputfromcomplexwaveletcontaincomplexcoecientssincethecomplexlters eitheremphasizeonpositivefrequenciesandrejectnegativefrequencies,orvice-versa, thusachievingdirectionalselectivity.However,whileusingcomplexwavelets,ifwe decomposeanimagetomorethanonelevel,wecannotachieveperfectreconstruction [52].UseofdualtreecomplexwavelettransformDTCWTalleviatestheabove problemandatthesametimeachievesshiftinvarianceanddirectionalselectivity.In whatfollows,briefdescriptionsofDTCWTandabivariateshrinkagethresholding techniquethatisusedinourdenoisingmethodologyarepresented. 3.2.1.1Dual-TreeComplexWaveletTransform DTCWTusestheconceptthatshiftinvariancecanbeachievedforrealDWTby eliminatingdown-sampling[52].However,insteadofeliminatingdown-sampling, DTCWTachievesshiftinvariancebyusingtwoparallelfullydecimatedtreeswhere thedelaysoftheltersinthesecondtreeareonesampleosetfromtherstat leveloneandhalfasampledierentatfurtherlevels[52,59].Thisconstructionalso oerstheperfectreconstructionproperty.Thedualtreetransformisinterpretedas acomplextransformbyconsideringtheoutputsfromthetwotreesasrealandimaginarypartsofcomplexwaveletcoecients.ThisalsooersDTCWTthedirectional selectivityproperty[52]. 3.2.1.2BivariateShrinkageThresholding Foranythresholdingtechnique,theamountofnoiseremovedisdependentonits abilitytoidentifythecoecientsthatrelatestonoise.Afeaturethatdistinguishes 55

PAGE 68

BivariateShrinkageBiShrinkmethod[55]fromotherpotentialthresholdingtechniquesthatareavailableintheliteratureistheconsiderationofinterscaledependencypropertyofdetailcoecients.Itiswellknownthatlarge/smallvalueofthe waveletcoecientsusuallypropagatethroughthescales[55,53].BiShrinkmethod generatesmodelstoidentifythisinterdependencybetweenadjacentscales.Using maximumlikelihoodestimatorsforthevarianceofnoiseandthatofdataattwoadjacentscales,andassumingnoisetobeGaussian ; ,theBiShrinkmethodestimates noise-freevaluesofthedetailcoecientsusing maximumaposterior MAPestimators[55]Model3.AstudyperformedonDTCWTdecomposedimagecoecients [55]demonstratesthatbivariateshrinkageexhibitbetterdenoisingperformance comparedtootherthresholdingmethods. 3.2.2MicroarrayDenoisingMethodology Inmultiresolutionbaseddenoising,theimageisdecomposedusingtheselected waveletbasisfollowedbyapplicationofthethresholdingstrategy.However,microarrayimagedenoisingcannotfollowasimilarprocedureduetothedicultiesrelated tolargenumberofedgesandnon-Gaussiannoise,andhence,twodierentstrategies weredeveloped.Beforegivingadetaileddescriptionofthetwofeaturesofmicroarray imagesandthetwodevelopedstrategies,wegiveabriefoverviewofamicroarray chip,cRNAextraction,andhybridizationofcRNAstoDNAstrands. 3.2.2.1MicroarrayChip ADNAMicroarray,forexample,aGeneChiphumangenomeHG-U133Plus2.0 array,isdividedintoover1,300,000minutesquarescalled probes [56].Eachprobe squareisembeddedwithmillionsofcopiesofashortDNAstrandthatcorrespondto 56

PAGE 69

asinglegene.The1.3millionprobesquaresareallocatedtoover54,676geneswith eachgenerepresentedinelevenrandomlyselectedprobesquaresacrossthechip. 3.2.2.2cRNAExtraction RNAsareextractedfromthecellsampleandsynthesizedtoobtaincDNAcomplimentaryDNAandfurthercRNA.Hence,allgenesthatareexpressedinthecell willcontainitscorrespondingcRNAsintheextract,withnumberofcRNAsbeing proportionaltogeneexpression.ThenumberofcRNAsintheextractareamplied toensureproperhybridization.Moleculesofachemicalcalledbiotinareattachedto thecRNAstrands. 3.2.2.3HybridizationofcRNAstoDNAStrands ThecRNAextractispouredoverthemicroarray.IfthecRNAndsamatching DNAstrandhencerepresentingsamegenethecRNAwillbindi.e., hybridize to thestrand.Therefore,theextentofhybridizationisproportionaltotheexpression ofthegene.Auorescentstainisthenrunoverthearraywhichstickstothebiotin thatactsasamolecularglue.Themicroarrayisscannedtoextracttheintensity ofglowwhichisproportionaltoextentofhybridization.Themillionsofstrandsof DNAineachprobesquarearerepresentedbyonlyasmallmatrixofpixelvalues, e.g.,7x7inHG-U133Plus2.0.Eachprobesquareisthusrepresentedbyamatrixof pixelswhoseintensitiesrepresenttheextentofhybridizationintheregion,andhence thecorrespondinggeneexpressionvalue.Thus,amicroarrayimageiscomprisedof matricesofintensitiesofprobesquaresacrosstheentirechip. 57

PAGE 70

Figure3.1.Randomlyvaryingintensitiesofadjacentprobesquaresdepictspresence oflargenumberofedgesinamicroarrayimage 3.2.2.4Strategy1:ASeparateSubimageforeachGene Asmentionedabove,elevenrandomlyselectedprobesquaresareallocatedtoeach ofthe54,676genes.Thus,itishighlylikelythatadjacentprobesquaresareallocated todierentgeneswithvaryinglevelsofexpression.Thepixelintensitiesofaprobe squareareproportionaltotheextentofcRNAbinding,andhenceadjacentprobe squareshavevaryingintensities.Alsonotethat,theseintensitiesvaryamongpixels evenwithinaprobesquare[44].Moreover,amismatchsquareisplacedbelowevery probesquareassignedtoageneperfectmatch[60].Thesemismatchsquaresare designedtomeasuretheextentofundesiredbindings,andthushaveverydiering intensitiesfromthatoftheperfectmatchprobes.Asevidentfromabove,themicroarrays,bydesign,haveinherentvariationsbetweenadjacentprobesquaresinaddition tothevariationswithin.Thus,themicroarrayimageshaveaverylargenumberof edgescontrarytonaturalimages.Figure3.1showsaschematicrepresentationof thewidevariationsinintensityamongadjacentprobesquaresinamicroarray.Such images,withhighintensityvariations,whendecomposed,generatealargenumber ofhighvaluedetailcoecients.Consequently,duringthresholding,itbecomesdiculttoidentifythelargecoecientsthatcorrespondtonoise.Thisresultsinpoor denoising. 58

PAGE 71

Ourstrategyinvolvesseparatingthemicroarrayimageintomultiplesubimages, oneforeachgene,whereeachsubimageiscreatedasacollectionoftheprobesquares assignedtoagene.Theideabehindthisstrategyisthat,intensityvariationsacross probesquaresofthesamegenearemuchlesscomparedtothatacrossprobesquares ofdierentgenes.Hence,creatingasubimageforeachgenewillhavefarlessnumber ofedgescomparedtothatoftheoriginalmicroarrayimage,thusleadingtobetter denoising.Thesubimagesareobtainedasfollows.ThelibraryleCDFleofthe microarraychipcontainsalistofallgenesalongwiththeirixandycoordinatesof allcorrespondingprobesquares,andiithenumberofrowsandcolumnsofpixels foreachprobesquare.TheDATlecontainsthepixelvaluesofthemicroarraychip. UsingtheCDFle,thelocationoftherstprobesquareofgene1isidentiedand thecorrespondingmatrixofpixelvaluesisextractedfromtheDATle.Similarly thematrixofpixelvaluesoftheremainingprobesquaresofgene1areextractedand thematricesareplacedsequentiallyrow-wisetoobtainthesubimageinFigure3.2. Someofthematricesarerepeatedtoensurethatthesubimagesaredyadic,which isarequirementforwaveletbasedmultiresolutionanalysis.Similarsubimagesare obtainedforallgenes.AsdepictedinFigure3.2,thesubimagesareindividually consideredfordenoising.Afterwhich,theoriginalprobesquareintensitiesonthe microarrayimagearereplacedwiththoseextractedfromthedenoisedsubimages. Furtheranalysisoftheimage,thatinvolvedeterminingthenalgenevalues[60]is carriedoutusingexistingproceduresofAymetrix,namelyGCOSandMAS.As showninFigure3.2,thesubimagedata,priortodecomposition,undergoanother transformation,whichisdescribednext. 59

PAGE 72

Figure3.2.Constructionofadyadicsubimage,transformationofpoissontonormal, denoisingandreplacingdenoiseddatatoreconstructmicroarrayimage 3.2.2.5Strategy2:NoiseCharacterizationandTransformation Thenoiseingeneexpressionvaluesisacombinationofsamplepreparationnoise, hybridizationnoise,andscanningnoise.Samplepreparationnoisearisesfrom:inherentinabilityoftheexperimentalproceduretoachieveatargetcRNAamplication rateduringcRNAextraction,andvariationintheamplicationrateamongthecRNAsaPoissonnoise[43].However,samplepreparationnoiseissmallinproportion totheothertwonoisetypes.Hybridizationnoiseisattributedtotheprobabilistic natureofhybridizationandisproportionaltothegeneexpressionvaluepixelintensity.Suchanintensitydependentnoiseiscalled shotnoise andisknowntohavea Poissondistribution.Scanningnoise,ontheotherhand,isindependentofexpression valuesandmaybeinducedbythepresenceofdirt,reectionoflight,andotherrandomcauses.Readersarereferredto[43]foradetailedcharacterizationofthenoise types.Inwhatfollows,weexplainourstrategyforestimatingthenoiseparameters usingpixelintensitiesfromasinglechipthatneedstobedenoised. 60

PAGE 73

Isolationof samplepreparation noiseisnotfeasibleusingasinglechip,sinceidentifyingthenoiseparameterrequirestheuseofmultiplereplicatesofthesamesample. Moreover,sincethemagnitudeofthiserrorissmall,wedidnotattempttoremove thisnoise.TheprobabilisticvariationinhybridizationinducesPoissonnoiseingene expressionvalues[43].Hybridizationnoisearisefromtwotypesofbindingerrors,i whereacRNAstrandmightnotbindevenwhenamatchingDNAstrandispresent missedbinding,andiiwhenacRNAmightbindtoanon-matchingDNAstrand falsebinding.Therefore,thenumberofbindingerrorsperpixelisPoissondistributed.Notethat,Poissondistributedrandomvariablestakenon-negativevalues, andthereforeweconsiderthenumberofbindingerrorsasequaltotheabsolutevalue ofthedierencebetweenthenumberoffalsebindingsandthenumberofmissed bindings.Therefore,foreachpixel,thePoissonnoisehaseitherbeenaddedwhen falsebindingisthedominatingtypeofbindingerrororsubtractedwhenmissed bindingisthedominatingtypeofbindingerrorfromtheactualintensity.Asisclear fromabove,thevariationamongthepixelintensitieswithinaprobesquareisalso inducedbyvariationinhybridization.Therefore,weestimatethehybridizationnoise Poissonparameter fromtheabsolutevalueofintensityvariationsamongthepixelsofaprobesquare.Also,sincethedenoisingmethodusingBiShrinkthresholding requiresthatallerroris N ; 2 ,with =0,weuseanormalapproximationof thePoissondistribution.Thisapproximationcanbejustiedasfollows.Sinceeach probesquareisrepresentedbyasmallnumberofpixels,eachpixelreectsacollection ofmillionsofDNAstrands.IfwedividetheseDNAstrandsintoseveralgroupsand considerthenumberofbindingerrorsforeachgroup,then,thesumofthebinding errorsoverallgroupsofthepixelwouldbenormallydistributedbasedonthecentral limittheorem.Therefore,wecanwritethathybridizationnoise N ; 2 with, 61

PAGE 74

= and 2 = .The scanning noise,duetothenatureofitssources,isassumed tobeGaussianwithmeanzero. Inwhatfollows,wedescribetheestimationofthemeanofthehybridization noise,andthenapplyastandardnormaltransformationsince,asmentionedearlier, BiShrinkthresholdingrequiresanoisewithmeanzero.Sincehybridizationnoiseis proportionaltointensity,noiseestimationwascarriedoutseparatelyforeachprobe squarewithinasubimage.Let p n representthearrayofpixelintensitiesofthe n th probesquareofasubimage.Eachpixelintensityvalue p n x;y ,where, x and y are arraycoordinates,ismodeledasfollows: p n x;y = E n + n H x;y + n S x;y ; .1 where E n denotethetrueexpressionvalueofthegeneasmeasuredbythe n th probe square, n H x;y denotethenormaltransformationofhybridizationnoisewith = n and 2 = n ,and n S x;y denotethenormaldistributedscanningnoisewith parameters ; n .Notethat,thehybridizationandscanningnoisesareindependent ofeachother,wherehybridizationnoiseisbasedonincorrecthybridization,while scanningnoiseiscausedbyexternalfeatureslikepresenceofdirt.Wetakethe median ofthepixelvaluesofthe n th probesquare,denotedby ^ E n asanestimateof E n .Dene d n x;y as d n x;y = p n x;y )]TJ/F15 11.9552 Tf 14.687 3.022 Td [(^ E n : .2 Thenwecanwritethat d n x;y containstwonormallydistributednoisecomponents,sumofwhichalsohasanormaldistributionwithparameters = n and = p n + n .Notethat,asexplainedearlier,thehybridizationnoiseisapositive valuewhichiseitheraddedorsubtractedfrom E n .Positivevaluesof d n x;y denote pixelswherenoisehasbeenadded,whilenegativevaluesdenotepixelswherenoisehas 62

PAGE 75

beensubtracted.Therefore,toestimatethevalueof usingamaximumlikelihood estimateprinciple,weusetheabsolutevalueof d n x;y ,andwritethat n = 1 N X x X y j d n x;y j .3 where, N isthesizeofthearray.Notethat d n x;y containsacombinationof hybridizationnoiseandscanningnoise.Whilescanningnoiseisconsideredtohave zeromean,thehybridizationnoisehasameanof n .Therefore,inordertoobtain atotalnoisemeanofzero,weapplyastandardnormaltransformationof d n x;y denotedby d n T x;y asfollows d n T x;y = j d n x;y j)]TJ/F20 11.9552 Tf 17.932 0 Td [( n p n : .4 Therefore, d n T x;y nowcontainshybridizationnoiseandscanningnoisewithatotal sumofmeanofzero.Finally,weobtainthemodiedindividualpixelvalues,denoted by p n M x;y ,asfollows.From p n x;y ,weremove d n x;y whichcontainedanon-zero meanfromhybridizationerror.Byretainingthesignofthe d n x;y ,hencerepresentingthedirectioninwhichhybridizationnoisewasincluded,wereplace d n x;y by d n T x;y ,andwritethat p n M x;y = p n x;y )]TJ/F20 11.9552 Tf 11.955 0 Td [(d n x;y + sign d n x;y j d n T x;y j : .5 Therefore, p n M x;y nowcontainsthetrueexpressionvaluealongwiththenormally distributedhybridizationandscanningnoises,withatotalmeannoiseofzero. 63

PAGE 76

3.2.3StepsoftheDenoisingProcedure 1.BeginwiththeDATleofthemicroarrayimageandtheCDFleofthecorrespondingchip.TheDATlecontainsthepixelvalues,i.e.,therawimagedata. TheCDFleisthelibrarylethatcontainsalistofgenes,thecoordinates ofallofitscorrespondingperfectmatchandmismatchprobesquares,andthe numberofrowsandcolumnsofeachprobesquare. 2.Usingtheinformationonthecoordinatesandthenumberofrowsandcolumns fromtheCDFle,extracttheintensityvaluesofallperfectmatchPMprobe squaresofagenefromtheDATle.Inthisresearch,someoftheCcodes availableat[61],apublicsoftwarerepository,wereusedforthepurpose. 3.Copyontoatextlethedataarraysthatrepresenttheprobesquares,toform therowsandcolumnsofasubimageofmaximumpossibledyadicsize.For example,typicallyageneontheGeneChiphumangenomeHG-U133Plus2.0 arrayisrepresentedby11PMprobesquareswhereeachprobesquareissupposea7 7arrayofintensityvalues.Therefore,formingasubimageof3 3 probesquareswillgiveusanarrayofintensityvaluesofsize21 21.Inorder tomakethisdyadici.e.,32 32,therst11rowsandcolumnsarerepeated toformthelast11rowsandcolumnsrespectively.Sincethesubimagethus createdinvolvesonlytherst9outofthe11PMprobesquaresofthegene,the last9PMprobesquaresareusedtocreateanotherdyadicsubimageinasimilar mannernotethat7probesareincludedinbothsubimages.Thisallowsusto denoisealltheprobesquares. 64

PAGE 77

4.SincePoissonnoiseisproportionaltothepixeldataintensity,conductoneach probesquaretheGaussiantransformationofPoissonnoiseasgiveninEquation .5tocreatetwomodiedsubimages. 5.DenoiseeachofthetwomodiedsubimagesbyapplyingasequenceofDT CWTdecomposition,BivariateShrinkage,andreconstruction,usingthesoftwareavailableat[62]. 6.ReplacethedataintheDATlethatcorrespondstothePMprobesquaresof thegenebyusingthedenoisedintensityvaluesfromthesubimages. 7.Repeatsteps2through6forthemismatchMMprobesquaresofthegene. 8.Repeatsteps2through7withPMandMMsquaresofallremaininggenes. 9.CreatetheCELlebytakingthe75thpercentilevalueofeachprobesquare,and processtheCELleusingAymetrixsoftwaretoarriveatthenalexpression valuesofthegenes. ThedenoisingmethodologywascodedinCandaC-MATLABinterfacewasusedto accessMATLABprogramsfrom[62].Theprocessingofallprogramswascarriedout usingthehighperformancecomputingresourcesprovidedbyResearchComputing DepartmentatUSF[63]. 3.3NumericalValidationusingSimulatedandAymetrixMicroarray Data Inthissectionwepresenttheresultsofthetestsusedtomeasuretheperformance ofourdenoisingmethodology.PeakSignal-to-NoiseRatioPSNRisoneofthe commonlyadoptedmeasureofperformancefortechniquesusedindenoisingofnatural 65

PAGE 78

images[64,53].PSNRmeasurementgenerallyinvolvesdenoisinganimagethatwas createdbyaddingaknownquantityofGaussiannoisetoacleanimage.Thevalue ofPSNR,calculatedindecibels,isinverselyproportionaltothenoisethatremains afterdenoising,andhenceahighervaluerepresentsbetterdenoising.ThePSNRis calculatedasfollows. PSNR =20 log 10 256 ; where, = q 1 N P N k =1 s k )]TJ/F20 11.9552 Tf 11.955 0 Td [(d k 2 ; s k denotesthe k th pixelintensityofthecleanimage, d k isthecorrespondingvalueof thedenoisedimage,and N denotesthetotalnumberofpixels.Clearly,acleanimage isessentialinestablishingtheperformanceofadenoisingmethodologyusingPSNR. Duetotheunavailabilityofacleanmicroarrayimage,wersttestedourmethodology onasimulatedimageasdescribedbelow. 3.3.1TestingDenoisingPerformanceonaSimulatedImage Sinceourmethodologyseparatelydenoiseseachsubimagecreatedforagene,we fabricateda clean dyadicsubimageofsize32 32pixels.Asdescribedearlierin theexampleinStep3ofSection3.2.3,thesimulatedsubimageconsistedof9probe squares.Wealsosimulatedanothersubimagewith16probesquaresobtaining28 X28pixelsandrepeatingfourrowsandcolumnstoget32X32.Similardenoising performancewasnotedinbothsubimages,andtheonepresentedhereisforthe 16probesquaresubimage.Thepixelintensitieswithinaprobesquarewerekept constant.Mimickingtheactualvaluesofasubimagefromarealmicroarraydataset, dierentintensityvaluesfortheprobesquaresacrossthesimulatedsubimagewere chosen.TwoseparatesetsofnoisehavingPoissonandGaussiandistributionswere addedtothesimulatedsubimage,andsubsequentlydenoisedusingourmethodology. Poissonnoise,proportionaltothepixelintensityoftheprobesquares,andGaussian 66

PAGE 79

noise,withdierentvaluesofthevarianceparameterfordierenttrials,wereadded tothecleanimageusingfunctionsavailablewithintheMATLABsoftware. ThePSNRvaluesfordierenterrorcombinationsarepresentedinTable3.1and 3.2.PSNRinTable3.1wasobtainedbyconsidering k in s k and d k asindividual pixelindex,whileinTable3.2 k isaprobesquareindexand s k and d k represent the75 th percentileofthepixelintensitiesofthe k th probesquare.Dierentimage typesthatareconsideredinthetablesseecolumn1areasfollows:noisyimage, wherePSNRcalculationisbasedonpixelvaluesofthecleanandnoiseaddedwithoutdenoisingimages,denoisedimagewithoutPoissontransformation,where PSNRiscalculatedusingcleananddenoisedimagesandthedenoisingiscarriedout withoutincludingGaussiantransformationofthePoissonnoise,anddenoised imagewithPoissontransformation,wherePSNRiscalculatedusingcleananddenoisedimagesanddenoisingisdoneusingthecompletemethodology,i.e.,including GaussiantransformationofthePoissonnoise.Thefollowingobservationsaremade Table3.1.PSNRvalueestimatedbytakingerror,epsilon,asdierencebetweenindividualvalues ParameterofthePoissonnoiseaddedisproportionaltointensityofcorresponding probesquare fromTables3.1and3.2.Inbothtables,thehigherPSNRvaluesforthedenoised imageswith/withoutusingPoissontransformationindicateasignicantimprovement comparedtothePSNRvaluesofthenoisyimages.Accordingtotheliteratureon 67

PAGE 80

Table3.2.PSNRvalueestimatedbytakingerror, ,asdierencebetween75thpercentileofprobesquare denoisingstudies[64,53],thePSNRvaluesobtainedherecanbeconsideredhigh indicatingadesirabledenoisingperformance.AcomparisonofthePSNRvaluesobtainedwithoutandwithGaussiantransformationofthePoissonnoise2ndand3rd rowsofthenumbersrespectively,showsasignicantdierence,thusclearlyestablishingtheimportanceofincludingthetransformationstrategyinourmethodology. Inwhatfollows,wepresentresultsfromapplicationofourmethodologyonasetof Aymetrixmicroarraydata. 3.3.2TestingDenoisingPerformanceonAymetrixMicroarrayData Havingestablishedtheperformanceofourmethodologyonasimulateddataset, weextendedourtestingondatasetsobtainedfromAymetrixGeneChiphuman genomeHG-U133Plus2.0arrays,processedonHCT-116colorectalcancercellline attheMicroarrayCoreFacilityofMottCancerCenterandResearchInstitute. DuetotheunavailabilityofcleanimageswecouldnotcomputePSNRtomeasure theextentofnoiseremoval.Hence,weadoptedastrategyoftestingnoisereduction throughcoecientofvariationCVoftheoriginalanddenoiseddatasets.The premiseofthestrategyisthat,multipleinstancesofadatasetwithrandomnoise arelikelytohaveahigherCVcomparedtothatfromcorrespondingdatasetsfrom whichsomeofthenoisehasbeenremoved.However,itmaybenotedthatthougha 68

PAGE 81

reductionofCVisindicativeofnoiseremoval,itcannotbedirectlytranslatedinto theextentofnoiseremoved. Anassessmentofdenoisingperformanceshouldideallybebasedon geneexpressionvalues obtainedfrombeforeandafterdenoising.However,toobtainthegene expressionvalues,thedatasetspixelintensitiesareprocessedusingAymetrix GCOSorMAS5software.Thesesoftwaresperformdatatransformations,likebackgroundcorrection,beforeconvertingpixelintensitiesintogeneexpressionvalues.As aresult,thegeneexpressionvalueswillreecttheimpactofourdenoisingmethod aswellasthatofthetransformationsperformedinGCOSorMAS5.Thus,itwould bediculttoestimatetheeectofourdenoisingmethodologyalonewhengeneexpressionvaluesareused.Hence,instead,weusedthepixelintensities.Weconducted twodierentdenoisingperformancemeasurementtestsasdescribedbelow. 3.3.2.1AnalyzingCVofProbeSquaresAcrossMultipleScansofaMicroarray Inthistestwecollectedthreedierentscansofasinglechipresultingindatasets containingsamehybridizationnoisebutdierentscanningnoise.Thedatasetswere denoisedindividuallyusingourmethodology,andsubsequentlyCELleswerecreated byrepresentingeachprobesquarewithasinglevalueequaltothe75thpercentileof itspixelintensities.Foreachprobesquare,CVofitsvaluesacrossthethreescans wascomputedfromtheCELledata.TheaboveprocedureforCVcalculationfor eachprobesquarewasalsorepeatedusingthedatasetspriortodenoisingi.e.,using theoriginaldata.Inordertoassessdenoisingperformance,wedividedtherangeof CVto1intomultiplesubdivisions.Wechosenersubdivisionsnearthelower rangeofCVvaluesandwiderdivisionstowardsthehigherend,sinceamajorityofthe probesquareshadarelativelysmallerCV.Thenumberofprobesquaresundereach 69

PAGE 82

Figure3.3.Distributionofprobesquaresshowingimpactofdenoising CVrangewascountedforbothoriginalanddenoiseddatasetsandplottedasshown inthetopplotofFigure3.3.Theshiftinthehistogramofdenoisedprobesindicated bydottedbarstowardsthelowerCVrangeleftwardsindicatesthereductionin noiseafterdenoising.ThebottomplotofFigure3.3showsthe dierence betweenthe denoisedandoriginalnumberofprobesquaresunderdierentCVranges.Apositive rectangleforaCVrangeimpliesthatthenumberofprobesquaresinthedenoised imageisgreaterthanthenumberofprobesquaresintheoriginalimage.Thegure showsfourdistinctzonesofCVvalues:[0-0.07];.07-0.12];.12-0.16];.16-1.0]; 70

PAGE 83

wehenceforthrefertotheseaszones1,2,3,and4respectively.Thefollowingare interpretationsofthedataobservedinthesezones. 1.Zone1showsatotalincreaseof42,289probesquaresinthedenoisedset. Clearly,theseprobesquareshadhigherCVvalueszones2through4inthe originalset.Thisisindicativeofnoisereduction. 2.Zone4,withhighestCVvalues,showsthatthedenoisedsethasmuchless numberofprobesquares,998thanintheoriginalset,whichindicatesthat afterdenoisingmanyprobesquaresinthiszonehavereducedtheirCVvalues andthusmigratedtotheleftzones3through1. 3.Probedistributionsinzones2and3areresultantofthemigrationsdescribed forzones1and4. 4.Zones1and4indicateaCVreductionfor52,287probes.However,thetotal numberofprobesthatexperiencedaCVreductionislikelytobehigher,since itisdiculttoassesstheexactnumber.Inperspectiveofthetotalnumber ofprobessquaresinamicroarrayapproximatelyonemillion,thisnumber 52 ; 287mightseemsmall.However,thebottomplotofFigure3.3shows thatavastmajorityoftheprobesquareshaveCVlowerthan0.24.Thusitcan beconcludedthattheoriginaldatahadasmallamountofnoise,someofwhich hasbeenremoved. Fromtheabove,wecanconcludethatourmethodologyiscapableofreducingnoise introducedduringthescanningprocess.Testsperformedontwootherdatasets yieldedsimilarresults.However,asmentionedearlier,theextentofnoiseremoval couldnotbeascertained. 71

PAGE 84

Figure3.4.Proportionofprobesquaresinvariousrangesof R s CV forScan1data 3.3.2.2AnalyzingCVwithinProbeSquaresofaMicroarray Inordertoassessperformanceofourmethodologyinremovingnoiseintroduced duringhybridization,wecompared,betweenoriginalanddenoiseddatasets,the variationacrosspixelvaluesofeachprobesquareinadatasetobtainedfromasingle scan.TheCVacrosseachprobesquare s ,wascalculatedfromoriginalanddenoised DATles,whichwereusedtoobtainthepercentagereductioninCV R s CV dueto denoisingas, R s CV = CV s orig )]TJ/F20 11.9552 Tf 11.955 0 Td [(CV s denoi CV s orig 100 Figure3.4presentsapie-chartthatshowstheproportionsoftheprobesquaresin variousrangesof R s CV valuesforScan1dataset.Datasetsfromscans2and3yielded verysimilarresults.Itisapparentfromthegurethatalmostallprobesquares havehadahighreductioninCV.However,duetothenatureofthewaveletbased denoisingapproach,thereductioninCVvaluescannotbeattributedcompletelyto thereductioninhybridizationerror.Abettertestingstrategywouldbetohybridize multiplechipsfromasinglesample,scanthemtoobtaintheDATles,anddenoise themseparately.ComparingtheCVofeachprobesquareusingdatafromtheCEL lesofthesemultiplechips,inamannersimilartothatpresentedaboveforthetest 72

PAGE 85

usingmultiplescans,willgiveusabetterestimateofhybridizationnoisereduction. Duetotheimmediateunavailabilityofsuchadataset,wewereunabletoperformthe statedtestingstrategy. 3.4Conclusions Anovelmultiresolutionanalysisbasedmethodology,usingadual-treecomplex wavelettransformandbivariateshrinkagethresholding,forremovinghybridization andscanningnoisefrommicroarraydataispresented.Twospecicmicroarraydata featuresthatarenotconducivetowaveletbaseddenoisingareidentied,forwhich specicdatatransformationstrategiesaredeveloped.Theaforementionedfeatures are:presenceofexcessivenumberofedgesinmicroarrayimages,andnon-Gaussian Poissondistributionofthehybridizationnoise.Thestrategiesinvolvecreationand denoisingofseparatesubimagesforeachgeneinsteadofthecompletemicroarray image,andGaussiantransformationofthePoissonnoisepriortodenoising.A comprehensiveapproachtodenoisingmicroarraydata,aspresentedinhere,isnot availableintheopenliterature.Webelievethatourapproachtoremovingnoisefrom microarrayswillincreasethereliabilityofgeneexpressionvaluesforuseindisease diagnosisandtreatmentplanning. Testingtheperformanceofanyimagedenoisingmethodologyrequiresnoisyimage alongwithitscleanbaseimage.Sinceitisnotpossibletohaveacleanbaseimagefor amicroarray,testingofourmethodologywasrstcarriedoutusingdatasimulated tomimicmicroarrayprobes.Whennoisewasintroducedtothesimulateddata, ourmethodologywasabletoremovethemtoasignicantextentasevidencedby PSNRpeaksignaltonoiseratiovalues.Thetestingwasthencontinuedwithimage datasetsobtainedfromAymetrixmicroarraysprocessedoncolorectalcancercell samples.Duetounavailabilityofcleanbaseimage,coecientofvariationCVbased 73

PAGE 86

comparisonstrategiesweredevelopedtoobtainmeasuresofdenoisingperformance. AconsiderableextentofqualityimprovementwasnoticedforthetestedAymetrix datasets. Thoughourmethodologyiscapableofidentifyinghybridizationandscanning noisewiththeuseofasinglechip,identicationofsamplepreparationnoisestill remainsanopenchallenge.Thereductionoftheportionofsamplepreparationnoise whichisduetorandomnessinamplicationrate,isdependentonthetechniques usedtocontrolormaintainaconstantrate.Thisisconsideredbeyondthescopeof ourcurrentresearch.However,itispossibletoextendourmethodologytoidentify theportionofsamplepreparationnoisewhichisduetotheprobabilisticnatureof amplication.Thiscouldinvolvecreatingsubimageswithprobesquaresofthesame genefrommultiplechips.SincethedistributionofthisnoiseisalsoPoisson,like thatofhybridization,probesquaresfromthesamelocationonthemultiplechips willhavePoissonnoiseofmagnitudeequaltothesumofthetwoindividualPoisson noisetypes.Webelieve,therefore,thattheapplicationofourcurrentmethodology willoerdesirableperformance.However,sincethesamplepreparationnoiseisonly asmallportionofthetotalnoise,therequirementofmultiplechipsmaynotbe economicallyjustied,exceptincaseswheremultiplechipsarenaturallyinvolvedin theexperiment. 74

PAGE 87

CHAPTER4 ANALYTICALPROCESSINGOFCHROMATOGRAMS 4.1Introduction Proteome,thesetofproteinstranslatedbythegenome,isthemaincomponent thatdrivesthemetabolicactivitiesinacell.Hence,theproteomeisanticipatedto holdthekeytothepathwayofdiseaseslikecancerandthereforeplayanimportant roleincancerdiagnosisandtreatment.However,applicationofthistheoryinto practicerequiresidenticationandquanticationoftheproteinsproducedinthecell, whichisachallengingtaskbecauseofthecomplexityoftheproteome.Whilean organismhasaconstantgenome,itsproteomevariesfromcelltocell.Proteinsare producedor translated bygenesthatareactiveor expressed ,anddierentgenesare expressedindierentpartsofthebody.The35,000genesofthehumangenome areestimatedtotranslatemorethattentimesasmanyproteins,wherethetype andquantityofproteinproducedisbasedonthecodedinformationcontainedinthe gene.Inextremecases,anindividualgenecouldhaveacodingcapacityofabout 1000proteins[65].Duetovariationsinthetranslationprocess,thesamegenecould producedierentformsoftheproteinproteinisoformsthatdierintheirfunction ofmetabolicactivities[66].Further,thestructureoftheproteincouldchangeafter productionpost-translationalmodicationsPTM[67],hencecreatingproteins withfunctionsthataredierentfromtheoriginal.Thiscomplexstructureofthe proteomecreateschallengesinidentifyingandquantifyingproteinsproducedina 75

PAGE 88

cell,andthusseveralproteinsanditsfunctionarestillunknown.Identifyingprotein biomarkers ,i.e.,thesetsandquantityofproteinspresentinthecellonlyduringa diseasestate,willhelpindiagnosisandfurtherindrugdiscoveryfortherespective disease.Sinceearlydiagnosisofcancerleadtobetterchancesofsurvival,identifying proteinbiomarkersrelatedtoearlystagesofcancerisessential. Advancementsintheareaofproteinproling,atechniquefordetectingproteins intissue,blood,orurinesamples[68,69],hasledtotheidenticationofanumberof proteinbiomarkers.However,thusfar,veryfewhavebeenidentied,e.g.,inovarian cancer,CA125istheonlyclinicallyusedbiomarkerfordiagnosiswithasensitivityof only50%forearlystagesofcancer[70,71].Otherbiomarkersthathavebeenidentiedlacktherequiredspecicityandhencecannotbeappliedtopopulation-wide screening.Alongwiththecomplexstructureoftheproteome,thecurrentlackof ecientbiomarkerscanbeattributedtochallengessuchaschemicalseparationof theproteins[72].Theseparation,generatedusinginherentdierenceinpropertyof proteins,isrequiredforidentifyingandmeasuringthequantifyofeachprotein.Incompleteseparationcauseslowquantityi.e.,lowabundanceproteinstobemasked bythoseinhighabundanceandhenceseveralsignicantproteinsmightyetremain unidentied[73].Anothersignicantchallengeisduringtheanalyticalquanticationofproteinvaluesfromtheplots,calledchromatograms,generatedduringprotein proling.Chromatogramisavisualrepresentationoftheproteinseparation,where, themeasureofcertainchemicalcomponentsinasamplethatarereectiveofthe separatedproteinsareplotted.Inthisresearch,wedevelopamethodologyforaddressingchallengesinproteinidenticationandquanticationfromchromatogram data.Beforegivingadetaileddescriptionofachromatogramanddiscussingthe researchproblem,webrieyexplainsomeoftheproteinprolingtechniques. 76

PAGE 89

Mostproteinprolingtechniques,e.g.,2-dimensionalelectrophoresisDE,matrixassistedlaserdesorptionandionizationMALDI,andsurface-enhancedlaser desorptionandionizationSELDI,useproteincharacteristicslikehydrophobicity, ionizationmass,andelectrophoresis[67,74,75]toseparateproteins.However,small dierencesinhydrophobicityorlargeionizationmassbetweendierentproteinshindersproteinseparation[67].Inthisresearch,wedevelopamethodologytoprocess chromatogramsgeneratedbyBeckmanCoulterProteomeLabProteinFractionation-2 DimensionPF2D.PF2Dfollowsatwodimensionalapproach,where,separationin therstdimensionisbasedonthevaryingisoelectricpointpIofproteinsandthe sampleisseparatedintofractionsbasedonpH.Intheseconddimensiontheproteins inthefractionsarefurtherseparatedusingthepropertyofhydrophobicity[76].This approachisfoundtoaccount,tosomeextent,forpost-translationalmodicationsof proteinsandtoimprovereproducibilityofproteindetection,whichareweaknesses oftheotherprolingtechniques[67,76].Also,theincreasedcapacityofPF2Dto separateproteinsprovidesanadditionalleverageinanalyzinglowabundanceproteins,whichotherwisewouldhavebeenmaskedbyhighabundanceproteins.The useofPF2Dinproteomicshasbeenwell-establishedtechnically[77,78,79,76,80], andhasbeenshowntobepromisingforbiomarkerresearch[81,82].However,the taskofquantitativeprocessingofthechromatogramsiscomplicatedbecause,while high abundance proteinscanbeeasilyquantiedfromtheseplots,identifyingand quantifyinglowabundanceproteinsisachallenge.Therestofthischapterisstructuredasfollows:descriptionofthePF2Dchromatogram,quantitativechallengesin itsprocessing,andcurrentliteraturearediscussedinSection4.2;themethodologyfor quantitativeprocessingispresentedinSection4.3;andtheresultsofthemethodology arepresentedinSection4.4. 77

PAGE 90

4.2DescriptionofPF2DChromatogramsandQuantitativeChallenges AsamplechromatogramoutputgeneratedfromthePF2Dseconddimensional separation,thatwasappliedonafractionfromrstdimensionseparation,ispresentedinFigure4.1.Approximately17suchfractions,separatedbasedonpH,are obtainedforonebloodorurinesample,however,thenumbercanbevariedbasedon requirements.Duringtheseconddimensionalseparation,proteinsinthefractionsare detectedbytheamountofUVabsorbanceandasinFigure4.1theabsorbanceunit AUisplottedagainstretentiontime[76].Interpretingtheplot,thespikesorpeaks areindicativeofthepresenceofproteins,andbasedontheeectivenessofseparation, eachpeakrepresentsoneormoreproteins.Theareaunderneaththepeakisequivalenttotheamountoftheproteinpresentinthesample.Analyticalprocessingofthe chromatograminvolvesidentifyingthepresenceandlocationofpeaks,andestimating theareaofeachpeak.Further,bycomparingequivalentpeaksacrosssamples,peaks thatdistinguishcancercasesfromcontrolspotentialbiomarkerscanbeobtained. Toidentifytheproteincontentoftherequiredpeaks,thecorrespondingportionof thefractioncanbechemicallyanalyzed.ThisfeatureofPF2D,whichallowsfuture analysisoffractionsofinterest,isanadvantageoverotherproteinprolingtechniques [76,67].Oneofthesignicantreasonsforthelackofeectiveproteinbiomarkers canbeattributedtothechallengesintheanalyticalprocessingofchromatograms. WeexplainbelowthestepsandchallengesinPF2Dchromatogramprocessing. 1.BaselineCorrection:Duringthegenerationofthechromatogram,ideally,the presenceofproteinsshouldreectasapositivevalueandabsenceavaluezero. However,duetodicultiesinsettingthecorrectbaseline,thesignaldriftsto eithersideofzeroascanbeseeninFigure4.1.Hence,thersttaskistoidentify thebaselineforthegeneratedsignal. 78

PAGE 91

Figure4.1.SamplePF2Dsignal 2.PeakIdentication:Figure4.2presentsanenlargedimageofasectionofFigure 4.1.Notethat,thoughwecanvisuallyidentifypeaksinthesignal,itisa tediousandmanuallyinfeasibletaskaseachsamplecontainsthousandsofpeaks andobtainingbiomarkersrequiresanalyzinghundredsofsamples.Therefore, developingamethodologyformathematicalidenticationofpeaksisessential. Whilelargepeaks,likeBandDinFigure4.2,canbeeasilyidentiedusing mathematicaltools,itisachallengingtasktodierentiatesmallervalidpeaks, likeCandE,fromnoise.Sincemostproteinsarepresentinsmallquantitiesin bloodandurine,successfulidenticationofsmallpeakswouldbeessentialfor detectionofpotentiallowabundancebiomarkerproteins[83]. 3.IntensityQuantication:Obtainingbiomarkersinvolvesidentifyingproteins thatareproducedindistinguishablydierentquantitiesacrosscasesandcontrols.Hence,thethirdstepinvolvesquantifyingtheareaofeachpeakwhichis equivalenttothequantityoftheproteinsrepresentedbythepeak.Estimating theareaisachallengingtaskduetothepresenceofoverlappingpeaks,e.g., AandBinFigure4.2,whicharecausedbyincompleteseparationofproteins. 79

PAGE 92

Figure4.2.Peakidenticationandintensityquantication Sinceeachpeakismostlikelyrepresentativeofasetofproteinsratherthan anindividualprotein,itmightseemfairtogrouppeaksAandBandestimate theareaunderthecombinedpeaks,whichisarelativelyeasiertask.However, thiswillbeundesirableforidentifyingdistinguishableproteinsacrosscasesand controls,thereasonforwhichcanbeexplainedasfollows.AsseeninFigure4.2, mostofthesmallpeaksoverlapwiththelargerpeaks.Hence,ifthesepeaksare grouped,whencomparingpeaksacrosscasesandcontrols,thehighvalueofthe largepeakswillmaskanysignicantdierencethatexistsacrossthesmaller peaks.Hence,developinganalgorithmthatidentiesandquantiesallpeaks isessential. 4.PeakAlignment:Ideally,identicalpeaks,i.e.,thosethatrepresentthesameset ofproteinsacrosssamples,shouldbealignedintimesincetheyaredetected basedonproteinproperties.However,ashighlightedinFigure4.3,dueto chemicalcauses[79],thereisahorizontalshiftintheoccurrenceofpeaksin sample1whencomparedtoidenticalpeaksinsample2.Hence,forcomparison 80

PAGE 93

Figure4.3.Horizontalshiftofpeaks acrosscasesandcontrols,identicalpeaksacrossallsamplesneedtobealigned alongx-axistime. Currently,32KaratisthesoftwarethatismainlyusedforquantitativeprocessingofPF2Dchromatograms[76].Thedrawbackofthesoftwareisthatitisnot completelyautomaticbutmanual,likesettingtargetsforaligningpeaksandpeak identication,andmultiplesamplealignmentsisalsonotpossible.WhilemathematicalmethodologiesforprocessingofPF2Dchromatogramsarelimited,literature presentsmethodologiesforotherproteinprolingtechniques,where,thequantitative stepsinvolvedinprocessingaresimilartothatexplainedabove.Mostcommonly usedproteinprolingtechniquesaremassspectrometryMSbased,wheretheproteinsaredetectedbasedonmass-to-chargeratiom/zunlikepHorhydrophobicity inPF2D.ThespectralsignalsgeneratedbyMStechniquescontainthenumberof ions,indicatingintensityofproteins,versusthecorrespondingm/zvalues.Current methodologiesforquantitativeprocessingofSELDIandMALDIbasedMSsignals havebeenreviewedin[84]and[85],respectively,whichcanbesummarizedasfollows.Methodologiesforbaselinecorrectionincludedividingthesignalintowindows 81

PAGE 94

andscalingtolocalminimaineachwindow[86,87],usingaglobalmovingaverage [88],orlinearinterpolation[89,90,91].Somemethodologiessmooththesignal, toremovenoisebeforepeakidenticationtechnique,byapplyingamovingaverage lter[92,93]oraGaussianlter[90,94].Thecrestofthepeakisidentiedasthe localmaximaandconsideredasavalidpeakifsignal-to-noiseratioSNRisgreater thanauser-denedthreshold[90,93,94,95,96],orinaddition,alsouseabaseline thresholdandignoreallpeaksbelowthisvalue[88].SNRofapeakisusuallyobtainedasthelocalaveragedividedbythelocalstandarddeviation.Othertechniques ofpeakidentication,alongwithSNR,useabaselinevaluetoidentifybeginningand endofpeaksandapplyathresholdforpeakwidth[97],orestimateslopesandusea thresholdforbothsidesofthecresttodetermineavalidpeak[89].Whiletheabove techniquesaresuccessfulinidentifyinghighabundancepeaks,identicationoflow abundancepeaksissubjecttosurroundingsignalregionthuscreatinginconsistencies inresults.Sincenoisevariesalongthesignal,smallpeaksinthevicinityoflittleor nonoiseorinthepresenceoflargepeaksmighthavealowSNRvalue,evenequalor lowerthanthatofanoisepeakinotherpartsofthesignal.Hence,itisnotpossible tosetanoptimalthreshold.Methodologiesin[98,99,95,96,86]usewaveletbased signalprocessingtoolsextractingsignalatmultiplefrequencies/scalestoidentify peaks. Themethodspresentedinliteraturehavebeensuccessfulinautomaticallyidentifyingcertainpeaks,andaspresentedby[84]and[85],basedonidenticationof knownproteins,themethodin[98]performsbetterthanothertechniques.However, thecurrentmethodsdonotincludetechniquesforidenticationofsmallpeaksor overlappingpeaks.Mostmethodsusethresholdsindierentiatingnoisefromvalid peaks,andithasbeenseenthatlowthresholdsincreasefalsediscoveryrate,while highthresholdsmissseveralvalidpeaks.Inthisresearch,ourmainfocusisinidenti82

PAGE 95

cationandquanticationofsmallpeaksandoverlappingpeaks.Sincebloodcirculates throughoutthebody,itcontainsproteinsfromallpartsofthebody.Thereareabout 500,000proteinsproducedinthehumanbodymostofwhicharepresentinsmall amountsinbloodandurine,andseveralofwhichareyettobediscovered.Therefore, identicationofthesmallpeaksthatrepresentlowabundanceproteinsisessential forobtainingpotentialbiomarkerproteins[83].Also,duetoincompleteproteinseparation,mostofthesmallpeaksoverlapwithandoccuronshoulderoflargepeaks. Iftheseoverlapsarenotseparated,duringtheprocessofidentifyingpeaksthatdistinguishcancercasesfromcontrols,anysignicantdierenceinthesmallproteins willbemaskedbythelargerproteins.Hence,inthisresearch,wedevelopcontinuous wavelettransformsbasedmethodstoaddresschallengeswithpeakidenticationand quanticationespeciallythosethathavelowabundanceoroverlap. Peakalignmentisataskencounteredinseveralareasinbiochemistryandthe literaturepresentsconsiderablenumberofmethodsbasedontimewarporsequence alignment,andseveralofwhichperformpeakalignmentpriortopeakdetection.The authorsin[100]presentamethodforpeakalignmentofPF2Dchromatogram.The methodhasbeenderivedfromdynamictimewarpbasedondynamicprogramming thatwasoriginallydevelopedforspeechrecognition,andhasbeenusedforalignment ofchromatogramsinotherbiochemistrytechniques[101,102,48].Insuchmethods, thesignalisusuallydividedintonumberofsectionsandeachsectionisstretchedor shortenedbyshiftingitsendswithinaslackparameter,andinterpolatingthesignalpointsinbetween.Theshiftthatprovidesthebestcorrelationcoecientwith thetargetsignalisretained.Suchamethodwillchangetheproleofthepeakand thereforeitsarea,whichisnotsuitableinourapplicationasweareinterestedinidentifyingproteinsthathavedistinguishingquantitiesareasacrosscasesandcontrols biomarkers.Othersimilarmethodsbasedontechniquesoftimewarparepresented 83

PAGE 96

in[103,104].Ascale-spacerepresentation,similarinconcepttoawaveletfunction, butusingaGaussianfunctionandforpeakalignmentofmultiplespectraispresented in[105].ByobtainingDiraccomponentsassumofweighteddistancebetweenpeaks fromallspectrasamplesandconvolvingwiththeGaussianfunction,thecommon peaklocationacrossspectraisobtainedasthelocalmaximaoftheresultingfunction.However,itsperformancewithoverlappingpeaksandlowabundantpeaksis dependentonsettingappropriatecoecientweightsderivedfrompriorknowledgeof biology,whichisnotfeasibletoobtaininourapplication.Inthisresearch,werst performthepeakidenticationandquanticationfollowedbyalignmentoftheidentiedpeaksbyusingasimplemodelbasedonmixedinteger-nonlinearprogramming. Wefollowsuchasequenceasindividualpeakquanticationisessentialforobtaining biomarkersandalsotoavoidanychangesinprole,especiallyinlowabundantpeaks, thatmightresultifalignmentisconductedrst. 4.3MethodologyforQuantitativeProcessingofPF2DChromatograms Inthisresearch,wedevelopedamathematicalmodelusingcontinuouswavelet transformsforbaselinecorrection,peakidentication,andquantication,andan optimizationmodelforpeakalignment.Inwhatfollows,webrieydiscusswavelet transformsandexplainindetailthemethodologiesforthequantitativeprocessing. 4.3.1ContinuousWaveletTransformsforBaselineCorrection,andPeak IdenticationandQuantication Waveletisashortwave,andnotethat,inChapter3,webrieydiscusseddiscrete wavelettransformsforremovingnoisefrommicroarrayimages.Forchromatogram processing,weusecontinuouswavelettransformsCWT,denedastheconvolution ofthenormalizedformofafunction t ,calledthemotherwavelet,withthedata 84

PAGE 97

signal x t toobtainwavelettransforms T a;b at scale a and translation b [106]. Wewilldiscusslaterinthissectiontheadvantageofusingcontinuousoverdiscrete wavelettransformsforchromatogramprocessing.Themathematicalrepresentation ofCWTiswrittenas T a;b = 1 p a Z 1 x t t )]TJ/F20 11.9552 Tf 11.955 0 Td [(b a dt .1 Scalecanbereferredtoasthestretched/squeezedformofawaveletandtranslation asthepositionofthecenterofthewaveletonthex-axis.Byvaryingthevaluesof a thesignalcanbedecomposedorrepresentedatdierentfrequenciesandthevalueof thetransformatvarying b canbeobtained.Inthisresearch,weusetheMexicanhat waveletfunctionFigure4.4whichiswrittenas[106] Figure4.4.Mexicanhatwavelet t )]TJ/F20 11.9552 Tf 11.955 0 Td [(b a = 1 )]TJ/F1 9.9626 Tf 11.955 17.534 Td [( t )]TJ/F20 11.9552 Tf 11.955 0 Td [(b a !# e )]TJ/F18 7.9701 Tf 6.586 0 Td [(0 : 5 [ 1 )]TJ/F15 11.9552 Tf 6.586 -0.997 Td [( t )]TJ/F22 5.9776 Tf 5.757 0 Td [(b a ] .2 Byusingsuchapeakshapedwaveletfunction,identicationofpeaksinachromatogramcanbeachievedbyutilizingvaluesof T a;b whichisexplainedasfollows. Asanillustrationofscalingandtranslation,wedepictinFigure4.5awaveletatxed translation b andatthreescales a 1 a 2 ,and a 3 a 1
PAGE 98

Figure4.5.Awaveletataxedtranslationandatthreedierentscales Figure4.6.Awaveletataxedscaleandthreedierenttranslations seenthatthewaveletatscale a 2 andtranslation b 2 isthebestmatchforrepresenting peakA,andmathematicallyitwillbeseenthat T a 2 ;b 2 willattainthemaximum valueamongall a andall b intheneighborhoodofthepeak.Wecanexplainthis mathematicalbehaviorbyconsideringthepositiveornegativeyieldoftheconvolution x t t )]TJ/F21 7.9701 Tf 6.587 0 Td [(b a inEquation4.1usingFigure4.7,asfollows.Forawaveletwithxed a and b ,Figure4.7indicateswitha`+'or`-'signthepositiveornegativevalueresulting fromtheconvolutionatdierenttimesegments t markedbytheverticallinesalong thex-axis.Thetimesegmentswherethewaveletandsignalarebothpositiveorboth negativeresultinapositivevaluefortheconvolution,andyieldsanegativevalue whentheyareofoppositesigns[106].Hence,inourexampleinFigures4.5and4.6, consideringthedierentvaluesof a and b itcanbeseenthat T a 2 ;b 2 willprovide 86

PAGE 99

Figure4.7.Positiveandnegativecontributionsofconvolutionofthewaveletwitha signal themaximumvalue.Samplevaluesof T a;b ,alsocommonlyreferredtoas wavelet coecients ,atvescalesandalltranslationsforthelengthofthesignalareplotted inFigure4.8.AsseenintheFigure,theuseofCWTprovidescontinuoustranslations thusgeneratingcontinuouscoecients,whichalsoappearaspeakswiththeuseof suitablewaveletfunctionsliketheMexicanhatusedhere.Hence,thelocalmaxima ofcoecientsateachscalerepresentsthepresenceofpeaks,andforeachpeak k ,the bestmatchscale^ a k andtranslation ^ b k areobtainedasthescaleandtranslationof thelocalmaxima T k ^ a k ; ^ b k where b k isintheneighborhoodof k .Fromthesetof identiedlocalmaxima,wefurtherdeveloptechniquestodierentiatebetweennoise andactualpeak,whichisexplainedlaterinthissection.Further,sincethebaseline ofthesignalchangesslowlythusmakingitmonotonicaroundapeak,theconvolution withacompactsymmetricwavelete.g.,Mexicanhatprovidesanautomaticbaseline correction[98].NoticethatthebaselinecorrectionisvisibleinFigure4.8,where, thecoecientstakeazerobaselinewhiletheoriginalsignalhadabaselinearound 1500.Notethat,incaseofclearpeaks,i.e.,nooverlaps,thevalueof T k ^ a k ; ^ b k see Equation4.1isproportionaltotheareaunderneaththecorrespondingsignalpeak k ,andthelocationofthecrestof k is ^ b k .However,incaseofoverlappeaks,neither T k ^ a k ; ^ b k or ^ b k willgiveanestimateofthepeakquantityorlocation,respectively. Therefore,inthisresearch,wealsodevelopamethodologyforquantifyingoverlapping peaks. 87

PAGE 100

Figure4.8.Sampleplotofwaveletcoecients UsingtheaboveknowledgeofCWT,wedescribebelowthestepsforidentication andquanticationofpeaks,includingtechniquesforsmallpeaksandoverlapping peaks. 1.ObtainCWTcoecients T a;b ; 8 a; 8 b .Theresulting T isatwodimensional matrixofsize M x N ,where, M isthenumberofscalesand N isthelengthof thesignal. 2.Usingslopes,identifylocalmaximaateachscaletoobtainamatrix P ofsize M x N .Ifalocalmaximaisfoundat T a;b P a;b = T a;b else P a;b =0. 3.Linklocalmaximanon-zero P a;b 'sacrossscales:Asdiscussedearlier,the optimalscale^ a forapeakcorrespondstothemaximumvalueacrossscales,i.e, P ^ a;b = max a P a;b .However,foranypeakinthesignal,itslocalmaxima acrossscalesdoesnotoccuratthesametranslation,i.e.,if P k k ;b 1 k and P k k ;b 2 k arethelocalmaximacorrespondingtoapeak k atscales1and2, respectively, b 1 k 6 = b 2 k formostpeaks.Therefore,toidentify^ a k ,thenon-zero valuesof P a;b acrossallscalesthatcorrespondtopeak k needtoberst linked.Toachievethis,beginningatthelowestscalewestartwith2since 88

PAGE 101

scale1mostlyrepresentsnoise,performthefollowingstepsateachscale.Note that,theconceptoflinkinglocalmaximaissimilartoobtainingridgesin[98], however,therearechangesintheprocedurethatweadopt. aForeach a; b where P a; b 6 =0,usingawindowsize w ndamatrix index a;b ,suchthat, a 2 1: a )]TJ/F15 11.9552 Tf 12.161 0 Td [(1 ;b 2 b )]TJ/F20 11.9552 Tf 12.161 0 Td [(w : b + w and P a;b 6 =0, i.e,ndalocalmaximainanyofthepreviousscalesintheneighborhood b w .Thesearchisdoneinmorethanonepreviousscale,since,basedon featuresofthesignal,thelocalmaximamightnotoccuratallscales.Note that,insteadofsettingaxedvaluefor w likeintheliterature,weestimate peakdependentvaluewhichisexplainedattheendofthissection. bIffound,supposeatonlyoneindex a )]TJ/F15 11.9552 Tf 11.631 0 Td [(1 ; b )]TJ/F15 11.9552 Tf 11.631 0 Td [(4,i.e., P a )]TJ/F15 11.9552 Tf 11.631 0 Td [(1 ; b )]TJ/F15 11.9552 Tf 11.631 0 Td [(4 6 =0, thenset P a; b )]TJ/F15 11.9552 Tf 11.497 0 Td [(4= P a; b and P a; b =0,i.e,shiftthelocationofthe localmaximaatscale a toalignwiththatofthepreviousscalesindicating thattheybelongtothesamepeak.Iffoundinmorethanoneindexinthe neighborhood b w ,moveittothenearestlocation. cIfnotfound,i.e., P a;b =0 8 a 2 1: a )]TJ/F15 11.9552 Tf 11.128 0 Td [(1 ;b 2 b )]TJ/F20 11.9552 Tf 11.127 0 Td [(w : b + w ,thenretain thevalueof P a; b ,thusindicatingthatitisanewpeak. 4.Locationofpeaks:Eachnon-zerocolumn b inthemodied P indicatesthepresenceofasignalpeak.Forpeak k ,settheoptimalscaleas^ a k where, P k ^ a k ; b k = max a k P k a k ; b k ,andfornow,considerthequantityof k as Q k = P k ^ a k ; b k Setthelocationofthecrestof k as L k = b k ,thuscorrespondingtothelocation ofthelocalmaximaatthelowestscale,thereasonforwhichcanbeexplained asfollows.Forclearnon-overlappingpeaks,thelocationwillbeequaltothe originalbestt ^ b k in T ,i.e.,correspondingto T k ^ a k ; ^ b k = P k ^ a k ; b k .However, incaseofoverlappingpeaks,column ^ b k willbeskewedawayfromthelocationof 89

PAGE 102

thepeakcrest.Now,considerthetranslationofthewaveletatapoint b 1that correspondstothecrestofapeakinthesignal x t .Atlowerscaleswherethe widthofthewaveletisverysmall,itismostlikelythatthewaveletiscompletely containedwithintheregionofthesignalpeakfortranslations b 1 n forasmall n .Hence,thesignsoftheconvolutionalongdierentsegmentsofthewavelet recollectFigure4.7willbethesameatall b 1 n ,thusresultinginamaximum valueof T atatranslationwhere x t wouldhavethemaximumprole,i.e., at b 1.Therefore,thelocationofthelocalmaximaatthelowerscalesismost likelytocorrespondtothecrestorveryclosetothecrestofthesignalpeak. SeeFigure4.9,thatcontainsasampleplotof T a;b x t signal,and P a;b forabetterunderstandingoftherelationshipbetweenthevariables.Notethat, thegureshowsonly10scalesforeaseofillustration.Duetotheoverlapped peaks,thelocationof T k ^ a k ; ^ b k isskewedawayfromthepeakcrest,however, thelocationcorrespondingtothetranslationatthelowestscale b k providesa betterestimate. Figure4.9.Locationofpeaks 90

PAGE 103

5.Identicationofvalidpeaks:Foreachpeak k performthefollowingsteps. aEstimatethewidth W k ofthepeakformedbythewaveletcoecients atscale^ a k bysearchingaround ^ b k .Identifythehighestscaleatwhichthe peakisrecorded, H k ,asthehighestnon-zeroscalein P k a k ; b k bRetain k asavalidpeakbysettingthresholdsfor W k Q k W k and H k 6.Quanticationofoverlappingpeaks:Peakquantityisobtainedbyutilizing theknowledgeofthepositiveandnegativecontributionsto T arisingfrom theconvolutionrefertoFigure4.7fortheconvolutioncontributions,and isexplainedasfollows.Noticethat,foranyclearpeak k ,while ^ b k i.e.,the translationthatprovidesthemaximumtransform T k ^ a k ; ^ b k isequivalentto thelocationofthepeakcrest, T k ^ a k ;b k 0when b k correspondstothe peaktroughs.Let b k = s k a and b k = e k a denotetwotranslationsreferringto s tartand e ndatscale a intheneighborhoodofpeak k suchthat, s k a 0, T k a k ;e k a 0, and T k a k ;e k a )]TJ/F15 11.9552 Tf 13.323 0 Td [(1 > 0.Nowconsidertwopeaks i and j thatoverlap, andsincetheyarealmostalwaysnotofthesamefrequency,^ a i 6 =^ a j ,and let^ a i < ^ a j .Duetothemathematicalformoftheconvolution,thefunctions T i ^ a i ;s i ^ a : ^ b i : e i ^ a and T j ^ a j ;s j ^ a : ^ b j : e j ^ a willbesuchthat,therewillbe someamountofoverlapoftheregions s i ^ a : e i ^ a and s j ^ a : e j ^ a .Theoverlapscouldeitherbecomplete,i.e, s j ^ a s i ^ a e i ^ a e j ^ a scenario1,or partial,i.e,either s j ^ a s i ^ a e j ^ a e i ^ a or s i ^ a s j ^ a e i ^ a e j ^ a scenario2.Scenario1willmostlikelyoccurincaseswhenthesmallerpeak i is alowabundantpeakoccurringontheshoulderofalargepeak,e.g.,asinFigure 4.10,whichpresentsthesignalanditswaveletcoecients T a;b at25scales. Utilizingtheaboveknowledgeandidentifyingoverlapsofscenario1,estimate 91

PAGE 104

Figure4.10.EstimatingareaofoverlappingpeaksunderScenario1 Figure4.11.EstimatingareaofoverlappingpeaksunderScenario2 Q i = T i ^ a i ; ^ b i and Q j = T j ^ a j ; ^ b j )]TJ/F20 11.9552 Tf 11.672 0 Td [(T i ^ a i ; ^ b i .Scenario2representsincompleteseparationoftwoproteinsets,where,theoverlapwillcausethecoecient tobemaximumatascalelargerthantheactualscaleofthelargerpeak j e.g., seeFigure4.11.Therefore,byidentifyingoverlapsofscenario2,andbysearchingthroughscalesbelow^ a j ,obtainascale a where s j a
PAGE 105

Notethat,thealgorithmpresentedintheabovestepsidentieslowabundant peaksaswell.Forexample,unlikeuseofsignal-to-noiseratioSNRasinmost methodsintheliterature,thethresholdsforpeakidenticationusedinthisresearch W k and Q k W k and H k areonlydependentonthepeakunderreviewandnot ontheneighborhoodsignal.Sincenoisevariesalongsectionsofthesignal,using thecharacteristicsofthepeakalonehelpsobtainabetteridentication.Also,for eachpeak k ateachscale a ,thesearchwindow w forlinkinglocalmaximaissetas w = s k a : e k a ,wherethetranslations b k = s k a and b k = e k a areasexplained inStep6ofthealgorithm.Noticethat,as a increases, e k a )]TJ/F20 11.9552 Tf 12.727 0 Td [(s k a increasesor remainsthesameand s k a s k a )]TJ/F15 11.9552 Tf 12.542 0 Td [(1 e k a )]TJ/F15 11.9552 Tf 12.542 0 Td [(1 e k a .Therefore,instead ofsettingaconstantwindowsize,determiningthevaluebasedonthefeatureof thepeak,asdiscussedabove,providesabettersearchcriteria.Whileincomplete chemicalseparationofproteinscreatedoverlappingpeaks,usingwaveletfeaturesto identifyandquantifythevalueofeachoftheseoverlappingpeaksprovidesasmuch proteinseparationaspossible.Suchseparationallowsforbetteridenticationofsmall signicantbiomarkerpeaks,whereotherwise,thenon-separationwouldhavecaused thelargepeakstomasksignicantdierencesacrosssmallpeaks. 4.3.2OptimizationModelforPeakAlignment Alignmentofpeaksisachievedbydevelopinganoptimizationalgorithm.The objectiveofthealgorithmistominimizethemagnitudedistancebetweentwosignals, sincemisalignedpeaksarelikelytohavemoremagnitude.Notethat,inthecontext ofpeakalignment,a signal isanarraywhoselengthequalsthelengthoftheoriginal chromatogram.Thearrayelementstakevalueequaltopeakintensityifitsarray indexcorrespondstolocationofidentiedpeakobtainedusing Q and L fromSection 4.3.1,andtakesazerovalueatallotherindex.Distanceisestimatedasthesumof 93

PAGE 106

magnitudedierencebetweenelementsofthetwosignals.Theoptimizationmodel developedfor peakalignment isasfollows.Let A and B betwosignals,wherethe objectiveistoalign B against A ,i.e.,foreverypeakin A weneedtondamatching peakfrom B .Let B m = AlignedB .Theoptimizationmodelforthealignmentis formulatedasfollows. Objective : Minimize X x j A x )]TJ/F20 11.9552 Tf 11.955 0 Td [(B m x j + X y y B m x = X y B y y;x 8 x .3 X x y;x 1 8 y .4 X y y;x 1 8 x .5 X y y;x y )]TJ/F1 9.9626 Tf 11.955 9.963 Td [(X y y;j y X y y;x 0 8 x; 8 j; .8 where = maximumpossibleshiftonx )]TJ/F20 11.9552 Tf 11.955 0 Td [(axis time )]TJ/F20 11.9552 Tf 11.955 0 Td [(axis y;x f 0 ; 1 g8 y; 8 x B m x R8 x y R8 y .9 Theobjectivefunctionminimizes`distance+unmatched',where,`distance'equals thesumofdierencebetweenalignedpeakintensitiesand`unmatched'equalssum ofpeakintensitiesin B y thatdidnotndamatchingpeakin A x .Constraint4.3 trackspeakshiftsbyusingbinaryvariables,Constraints4.4and4.5avoidduplication ofpeaks,Constraint4.6ensuresthatthesequentialorderofthepeaksismaintained, 94

PAGE 107

Constraint4.7keepstrackofpeaksin B x thatdidnotndamatchingpeakin A x andConstraint4.8placesaboundonthepeakshift.Sincepeakswereexpressedover pHscale,theshiftofpeaksislimitedtoarangeofneighboringpHandhencesearch beyondtherangeisnotrequired. 4.3.2.1TransformationfromNonlineartoLinear Notethat,constraint4.6containsanonlinearterm.Sincesolvinganonlinear modelismuchmoredicultthanalinearmodel,forcomputationaleciency,we linearizeconstraint4.6asfollows. X y y;x = x 8 x .10 x + x =1 8 x .11 X y y;x y )]TJ/F1 9.9626 Tf 11.955 9.963 Td [(X y y;j y + x 100000 0 8 x; 8 j
PAGE 108

describedasfollows.Notethat,foranygivenpeak,thedistanceofshiftislimited duetoitsoccurrencebasedonchemicalproperty.Thereforeweadoptthefollowing twostepprocedureofbreakingdownthesignalintomultiplesmallerfragments: 1.Everysamplehasasmallsetofcommonhighabundanceproteins.Therefore,in therststep,weconsiderasignalasanarrayofhighabundanceproteinsonly. Theoptimizationalgorithmisappliedtoalignthesehighabundanceproteins acrosstwosamples. 2.Thepointofalignmentsinstep1isusedasabreakpoint,todividetheentire datasetintomultiplesmallersignalfragments.Eachsignalfragmentwillconsist ofanarrayofasmallnumberofpeaks.Therefore,instep2,thealgorithmis appliedoneachsetofsignalfragmentseparately,andthepeakswithineach fragmentarealigned. 4.4ResultsandDiscussion OuralgorithmwastestedonPF2Dchromatogramsobtainedonurinesamplesof ovariancancerpatients.ThesampleswereobtainedaspartoftheTampaBayOvarian CancerCoalitionTBOCC-apopulation-basedstudyconductedbyateamfrom MottCancerCenterandResearchInstitute,Tampa.Thestudyisbeingperformed intheTampaBaymetropolitanregionoftheStateofFlorida,whereabout400blood andurinesampleswillbeprospectivelyobtainedfromovariancancerpatientsand alsofromamatchingcontrol.Thecancercasesandcontrolsarematchedbasedon demographicsandriskfactorslikeage,menopausalstatus,andrace.Thisstudy,to ourknowledge,isthelargestprospectivecollectionofpreoperativepopulation-based samplesintheUS. 96

PAGE 109

4.4.1PeakDetection Forcomparisonofpeakdetectioncapacity,peaksinPF2Dchromatogramswere identiedusingouralgorithmandMassSpecWavelet[98],whichisapeakdetectionmethodalsobasedoncomplexwavelettransforms.Notethat,MassSpecWavelet wasoriginallydevelopedforSELDI-TOFmassspectrometrybasedspectrums,whose proteinseparationprocedurediersfromthatofPF2Daswasexplainedearlier inSection4.2,however,withsimilarchallengesinproteinidentication.Weuse MassSpecWaveletforcomparisonasithasbeenshowntoperformcomparatively betterthattheotheralgorithms[85]forpeakidentication.MassSpecWaveletdifferentiatesbetweennoiseandvalidpeakbasedontwomainthresholds,signal-to-noise ratioSNRandamplitudeofthelocalmaximaofthepeak.TheSNRisestimatedas theratiooflocalmaximatothelocalnoiselevel,where,thenoiselevelwithinalocal windowsizearoundthepeakisestimatedasthe95thpercentileofthecoecients attherstscale.SinceMassSpecWaveletwasdevelopedforMSspectrums,wealso comparethepeakdetectioncapacityofthetwoalgorithmsonMSspectrums.We presentbelow,thecomparisonofresultsonPF2DchromatogramsfollowedbyMS spectrums.Notethat,allvisualvericationsmentionedbelowwerebasedonexpert opinion,biochemistsandoncologists. 4.4.1.1ComparisonofPeakIdenticationonPF2DChromatograms ThethresholdsinMassSpecWaveletandouralgorithmwerevariedtoobtaindifferentnumberofpeakselections,andtheensuingresultswithapproximatelysame numberofselectionsinbothalgorithmswerecompared.Whilelargepeaksareidentiedinbothalgorithms,itwasfoundthatouralgorithmperformsbetterinidenticationofsmallpeaksandoverlappingpeaks.Asampleresultispresentedbyconsidering 97

PAGE 110

Figure4.12.Peakidenticationonsectionofchromatogramwithseveralpeaks twosectionsofthesamechromatogram,onewithseveralvalidpeaksFigure4.12 andtheotherwithlotofnoiseFigure4.13.Thebluelinesindicatelocationofpeaks identiedbyouralgorithmandredlinesbyMassSpecWavelet,wherethenumberof peaksidentiedare145and142,respectively.Noticethat,thesectionofthechromatogramshowninFigure4.12containsseveralsmallpeaksandoverlappingpeaks, andascircled,severalofwhicharenotidentiedbyMassSpecWavelet.Moreover, MassSpecWaveletidentiesseveralfalsepeaksinthetailendofthechromatogram thatusuallyhasjustafewvalidpeaksFigure4.13andalsomissessomevalidpeaks, whichismoreclearlyseenintheenlargedportionbetweendatapoints6500to7100. Notethat,thefalsedetectionofpeaksbyouralgorithmismuchlesserascanbe seenintheFigure.Thoughthechemicalcontentofthesmallcircledpeaks,that weremissedbyMassSpecWaveletinFigure4.12,isnotconrmed,itmaybenoted that,suchautomaticidenticationofallpeaklikeappearancesisessential,thereason 98

PAGE 111

Figure4.13.Peakidenticationonsectionofchromatogramwithhighnoise forwhichcanbeexplainedasfollows.Sincethereareabout500,000proteinsmost ofwhicharepresentinsmallamountsinurineandblood,obtainingtheundiscoveredsmallbiomarkerproteinsrequiresidenticationofpeaksthatdistinguishcancer casesfromcontrols.Thevastnumberofsmallpeaksmakesitimpossibletomanuallyidentifydistinguishingpeaks,andmoreoverthedistinguishingcharacteristics mightbepresentintheformofapatternorset,ratherthanindividualpeaks,hidden fromthenakedeye.Hence,theidenticationofdistinguishingpeaksrequirestheuse ofamathematicalmodel,andtherefore,anautomaticalgorithmthatidentiesthe locationandquantiestheareaofallpeaklikeappearances. Inordertotestthecapacityofidenticationofsmallandoverlappingpeaksby MassSpecWavelet,thethresholdsofpeakdetectionwerefurtherlowered.Ascanbe seeninFigures4.14and4.15,wherethethresholdsinMassSpecWavelethavebeenre99

PAGE 112

Figure4.14.Sampleresultsofpeakidentication:Increasingpeaksselectedby MassSpecWaveletto172 ducedtoincreasethenumberofpeaksto172and202,respectively,MassSpecWavelet againmissesthesmallandoverlappingpeaks. 4.4.1.2ComparisonofPeakIdenticationonMSSpectrum BothalgorithmoursandMassSpecWaveletwereappliedonMALDITOFmass spectrometryMSbasedAurumdata[107]publiclyavailableathttps://proteomecommons.org/.TheadvantageofusingAurumdataisthatitprovides thelistofknowvalidpeaks.Hence,weestimatedthepeakdetectionsensitivityof bothalgorithms,andcomparedsensitivitiescorrespondingtothresholdsthatprovide equivalentnumberofidentiedpeaksinbothalgorithms.Thenumberofpeaksand sensitivityresultsarepresentedinTable4.1.Forouralgorithm,allthresholdswere keptconstantexceptfor Q k W k ,whichwasvariedtoidentifydierentnumberofpeaks. 100

PAGE 113

Figure4.15.Sampleresultsofpeakidentication:Increasingpeaksselectedby MassSpecWaveletto202 IntheTable,rows1to3presentresultswhereMassSpecWavelet'sthresholdforSNR waskeptconstantwhileamplitudewasvariedtoidentifydierentnumberofpeaks. Rows4to6presentresultswhereSNRwasvariedandamplitudewasslightlytuned soastoobtainanequivalentof195peakstomatchthatobtainedbyouralgorithm. Ascanbeseen,ouralgorithmprovidesbettersensitivityforallcases. 4.4.2PeakAlignment ResultsofpeakalignmentacrosssampleswerevisuallyvalidatedonthePF2D chromatograms.SamplealignmentisexplainedinFigure4.16andalargersectionof thealignmentispresentedinFigure4.17.AsexplainedinFigure4.16,theobjective oftheoptimizationmodelwastoalignpeaksinSample2againstpeaksinSample 1.Thedarkbluelinesindicatetheoriginallocationoftheidentiedpeaksusing 101

PAGE 114

Table4.1.SensitivityofdetectionofknownpeaksintheAurumdata OurAlgorithmMassSpecWavelet NumberofIdentiedPeaksSensitivity%NumberofIdentiedPeaksSensitivity% Thresholds:SNR,Amplitude 15582.715165.4 ,0.01 20984.621069.2 ,0.0075 30590.430575.0 ,0.0049 19581.819767.3 ,0.007 19581.820167.3 ,0.007 19581.819567.3 ,0.0075 Figure4.16.PeakalignmentonPF2Ddatawasvisuallyvalidated L fromSection4.3.1,andthelightbluelinesindicatetheshiftedlocationoutput fromtheoptimizationmodelinSection4.3.2.Noticethat,fromtheexampleshown intheFigure,visuallyspeakingwecansaythatpeaksAandBrepresentthesame protein,henceconrmingtheresultsfromtheoptimizationmodelwherethelight bluelinecorrespondstocrestofpeakBindicatingthatpeakAhasbeenalignedwith peakB.Similarvisualconrmationofalignmentswasconductedonseveralsections ofmultipledatasets. 102

PAGE 115

Figure4.17.AportionofPF2Dchromatogramsillustratingpeakalignments 103

PAGE 116

REFERENCES [1]M.Leshno,Z.Halpern,andN.Arber.Cost-eectivenessofcolorectalcancer screeningintheaverageriskpopulation. HealthCareManagementScience 6:165174,2003. [2]EpidemiologyResourceCenter/DataAnalysisTeamIndianaStateDepartmentofHealth.Indianahealthbehaviorriskfactors2001statesurveydata. url f http://www.in.gov/isdh/reports/brfss/2001/index.htm,August2002.Last accessed-Feb03,2009. [3]I.Vogelaar,M.vanBallegooijen,D.Schrag,R.Boer,S.J.Winawer,J.D.F. Habbema,andA.G.Zauber.Howmuchcancurrentinterventionsreduce colorectalcancermortalityintheU.S. Cancer ,107:1624{1633,October 2006. [4]AmericanCancerSociety.Thehistoryofcancer.url f http://www.cancer.org /docroot/CRI/content/CRI 2 6x the history of cancer 72.asp,March 2010.Lastaccessed-May14,2010. [5]KosaryC.L.KrapchoM.NeymanN.AminouR.WaldronW.RuhlJ.Howlader N.TatalovichZ.ChoH.MariottoA.EisnerM.P.LewisD.R.CroninK.Chen H.S.FeuerE.J.StinchcombD.G.EdwardsB.K.edsAltekruse,S.F.Seer cancerstatisticsreview,1975-2007,NationalCancerInstitute,Bethesda,MD. [6]B.Morson.Thepolyp-cancersequenceinthelargebowel. ProceedingsofRoyal SocietyofMedicine ,67:451{457,1974. [7]A.I.Neugut,S.Jacobson,J,andI.DeVivo.Epidemiologyofcolorectaladenomatouspolyps. CancerEpidemiology,BiomarkersandPrevention ,2:159{176, March/April1993. [8]A.Leslie,F.A.Carey,N.R.Pratt,andR.J.C.Steele.Thecolorectaladenomacarcinomasequence. BritishJournalofSurgery ,89:845{860,2002. [9]F.Loeve,R.Boer,G.J.Oortmarssen,M.V.Ballegooijen,andJ.D.F Habbema.Themiscan-colonsimulationmodelfortheevaluationofcolorectalcancerscreening. ComputersandBiomedicalResearch ,32:13{33,1999. 104

PAGE 117

[10]F.Loeve,M.L.Brown,R.Boer,M.vanBallegooijen,G.J.vanOortmarssen, andJ.D.FHabbema.Endoscopiccolorectalcancerscreening:acost-saving analysis. JNatlCancerInst ,92:557{563,April2000. [11]S.Roberts,L.Wang,R.Klein,R.Ness,andR.Dittus.Developmentofa simulationmodelofcolorectalcancer. ACMTransactionsonModelingand ComputerSimulation ,18,December2007. [12]P.R.HarperandS.K.Jones.Mathematicalmodelsfortheearlydetectionand treatmentofcolorectalcancer. HealthCareManagementScience ,8:101{109, 2005. [13]R.K.Khandker,J.D.Dulski,J.B.Kilpatrick,R.P.Ellis,J.B.Mitchell,and W.B.Baine.Adecisionmodelandcost-eectivenessanalysisofcolorectalcancerscreeningandsurveillanceguidelinesforaverage-riskadults. International JournalofTechnologyAssessmentinHealthCare ,16:799{810,2000. [14]J.L.Wagner,S.Tunis,M.Brown,A.Ching,andR.Almeida.Thecosteffectivenessofcolorectalcancerscreeninginaverage-riskadults. In:Young GP,RozenP,LevinB,eds.Preventionandearlydetectionofcolorectalcancer. Philadelphia:WBSaunders ,pages321{356,1996. [15]R.T.ClemenandC.J.Lacke.Analysisofcolorectalcancerscreeningregimens. HealthCareManagementScience ,4:257{267,2001. [16]S.Vijan,E.W.Hwang,T.P.Hofer,andR.A.Hayward.Whichcoloncancerscreeningtestacomparisonofcosts,eectiveness,andcompliance. The AmericanJournalofMedicine ,111:593{601,December2001. [17]NationalCancerInstituteNCI.Cisnet-cancerinterventionandsurveillance modelingnetwork.http://cisnet.cancer.gov/colorectal/proles.htmlhttp://cisnet.cancer.gov/publications/Colorectal,2007.Lastaccessed:Feb03-2009. [18]S.J.Winawer,R.H.Fletcher,L.Miller,F.Godlee,M.H.Stolar,C.D.Mulrow, S.H.Woolf,S.N.Glick,T.G.Ganiats,J.H.Bond,L.Rosen,J.G.Zapka,S.J. Olsen,F.M.Giargiello,J.E.Sisk,R.VanAntwerp,C.Brown-Davis,D.A. Marciniak,andR.J.Mayer.Colorectalcancerscreening:Clinicalguidelines andrationale. Gastroenterology ,112:594{642,1997. [19]F.Loeve,R.Boer,A.G.Zauber,M.vanBallegooijen,G.J.vanOortmarssen, S.J.Winawer,andJ.D.F.Habbema.Nationalpolypstudydata:Evidence forregressionofadenomas. Int.JournalofCancer ,111:633{639,2004. [20]H.Brenner,M.Homeister,C.Stegmaier,G.Brenner,L.Altenhofen,and U.Haug.Riskofprogressionofadvancedadenomastocolorectalcancerbyage andsex:estimatesbasedon840149screeningcolonoscopies. Gut ,56:1585{ 1589,2007. 105

PAGE 118

[21]J-M.Wong,M-F.Yen,M-S.Lai,W.Duy,S.,R.A.Smith,andT.H-H. Chen.Progressionratesofcolorectalcancerbyduke'sstageinahigh-risk group:Analysisofselectivecolorectalcancerscreening. TheCancerJournal 10:160{169,May/June2004. [22]R.M.Soetikno,T.Kaltenbach,R.V.Rouse,W.Park,A.Maheshwari,T.Sato, S.Matsui,andS.Friedland.Prevalenceofnonpolypoidatanddepressed colorectalneoplasmsinasymptomaticandsymptomaticadults. Journalof AmericalMedicalAssociation ,299:1027{1035,March2008. [23]C.E.Dukes.Theclassicationofcanceroftherectum. JournalofPathological Bacteriology ,35:323{332,1932. [24]R.L.Schoen,PinskyP.F.,J.L.Weissfeld,R.S.Bresalier,T.Church,P.Prorok, andJ.K.Gohagan.Resultsofrepeatsigmoidoscopy3yearsafteranegative examination. JournaloftheAmericanMedicalAssociation ,290:41{48,July 2003. [25]G.C.HarewoodandG.O.Lawlor.Incidentratesofcolonicneoplasiaaccording toageandgender. JClinicalGastroenterology ,39:894{899,December2005. [26]M.Noe,P.Schroy,BabayanR.Demierre,M-F.,andA.C.Geller.Increased cancerriskforindividualswithafamilyhistoryofprostatecancer,colorectal cancer,andmelanomaandtheirassociatedrecommendationsandpractices. CancerCausesControl ,19:1{12,2008. [27]EpidemiologyResourceCenterIndianaStateDepartmentofHealthandDivisionofChronic/CommunicableDisease.Cancerincidenceandmortalityin indiana.http://www.in.gov/isdh/reports/cancerinc/2004/section2c.htm,2004. Lastaccessed-Feb032009. [28]NationalCancerInstitute.Surveillanceepidemiologyandendresults. http://seer.cancer.gov/faststats/selections.phpOutput,2006.LastaccessedFeb032009. [29]U.SCensusBureau.2006-2008americancommunitysurvey3-yearestimates. url f http://factnder.census.gov/servlet/ACSSAFFFacts.Lastaccessed-Feb 03,2009. [30]NationalCancerInstitute.Prostate,lung,colorectalandovariancancerscreeningtrialplco.url f http://prevention.cancer.gov/programsresources/groups/ed/programs/plco.Lastaccessed-Feb03,2009. [31]AmericanCancerSociety.Cancerfactsandgures2008. Atlanta:American CancerSociety ,2008. 106

PAGE 119

[32]S.J.Winawer,R.H.Fletcher,D.Rex,J.Bond,R.Burt,FerrucciJ.,T.Ganiats, T.Levin,S.Woolf,D.Johnson,L.Kirk,S.Litin,andC.Simmang.Colorectal cancerscreeningandsurveillance:Clinicalguidelinesandrationaleupdatebased onnewevidence. Gastroenterology ,124:544{560,2003. [33]L.Jenkins,D.Bradshaw,P.Cannon,J.Gierisch,andW.Freas.Colorectalcancerscreeninginlocalhealthdepartments-apilotprojectofthenorthcarolina advisorycommitteeoncancercoordinationandcontrolandthenorthcarolina divisionofpublichealth.2003. [34]EpidemiologyResourceCenterIndianaStateDepartmentofHealthandDivisionofChronic/CommunicableDisease.Cancerincidenceandmortalityin indiana.http://www.in.gov/isdh/reports/cancerinc/2004/section2r.htm,2004. Lastaccessed-Feb032009. [35]AmericanCancerSociety.Cancerfactsandgures2008-2010. Atlanta:AmericanCancerSociety ,2008. [36]EpidemiologyResourceCenterIndianaStateDepartmentofHealthandDivisionofChronic/CommunicableDisease.Cancerincidenceandmortalityin indiana.http://www.in.gov/isdh/reports/cancerinc/2004/section2s.htm,2004. Lastaccessed-Feb032009. [37]J.S.Mandel,J.H.Bond,T.R.Church,D.C.Snover,B.G.Mary,SchumanL. M.,andF.Ederer.Reducingmortalityfromcolorectalcancerbyscreeningfor fecaloccultblood. TheNewEnglandJournalofMedicine ,328:1365{1371, May1993. [38]J.S.Mandel,T.R.Church,F.Ederer,andJ.H.Bond.Colorectalcancermortality:Eectivenessofbiennialscreeningforfecaloccultblood. J.ofNational CancerInstitute ,91:434{437,March1999. [39]V.A.Gilbertsen,R.McHugh,L.Schuman,andS.E.Williams.Theearlier detectionofcolorectalcancers. Cancer ,45:2899{2901,June1980. [40]ArgonneNationalLabANL.Repast-recursiveporousagentsimulationtoolkit. http://repast.sourceforge.net/,2007.Lastaccessed:Feb03-2009. [41]G.Piatetsky-ShapiroandP.Tamayo.Microarraydatamining:Facingthe challenges. SIGKKDExplorations ,5. [42]B.H.Mecham,D.Z.Wetmore,Z.Szallasi,Y.Sadovsky,I.Kohane,andT.J. Mariani.Increasedmeasurementaccuracyforsequence-veriedmicroarray probes. PhysiolGenomics ,18:308{315,2004. 107

PAGE 120

[43]Y.Tu,G.Stolovitzky,andU.Klein.Quantitativenoiseanalysisforgene expressionmicroarray. ProceedingsoftheNationalAcademyofSciences 99:14031{14036,2002. [44]Aymetrix.http://www.aymetrix.com/corporate/media /genechip essentials/index.ax.Lastaccessed-May,2007. [45]R.Dror.Noisemodelsingenearrayanalysis. Reportinfulllmentofthe areaexamrequirementintheMITDepartmentofofElectricalEngineeringand ComputerScience ,June2001. [46]H.Vikalo,B.Hassibi,andA.Hassibi.Astatisticalmodelformicroarrays, optimalestimationalgorithms,andlimitsofperformance. IEEETransactions onSignalProcessing ,54:2444{2455,2006. [47]A.Hassibi,S.Zahedi,R.Navid,R.W.Dutton,andT.H.Lee.Biologicalshotnoiseandquantum-limitedsignal-to-noiseratioinanity-basedbiosensors. JournalofAppliedPhysics ,97,2005. [48]X.H.Wang,R.S.H.Istepanian,andY.H.Song.Microarrayimageenhancementbydenoisingusingstationarywavelettransform. IEEETransactionson Nanobioscience ,2:184{189,2003. [49]V.M.Aris,M.J.Cody,J.Cheng,J.J.Dermody,P.Soteropoulos,M.Recce, andP.P.Tolias.Noiselteringandnon-parametricanalysisofmicroarraydata underscoresdisriminatingmarkersoforal,prostate,lung,ovarianandbreast cancer. BMCBioinformatics ,5,2004. [50]X.CaiandG.B.Giannakis.Identifyingdierentiallyexpressedgenesinmicroarrayexperimentswithmodel-basedvarianceestimation. IEEETransactionson SignalProcessing ,54:2418{2426,2006. [51]R.Lukac,K.N.Plataniotis,B.Smolka,andA.N.Venetsanopoulos.Amultichannelorder-statistictechniquesforcdnamicroarrayimageprocessing. IEEE TransactionsonNanobioscience ,3,2004. [52]N.Kingsbury.Imageprocessingwithcomplexwavelets. Phil.Trans.R.Soc. Lond.A ,1999. [53]J.K.Romberg,H.Choi,andR.G.Baraniuk.Bayesiantree-structuredimage modelingusingwavelet-domainhiddenmarkovmodels. IEEETransactionson ImageProcessing ,10:1056{1068,2001. [54]F.Abramovich,T.C.Bailey,andT.Sapatinas.Waveletanalysisanditsstatisticalapplications. JournaloftheRoyalStatisticalSociety ,49,2000. 108

PAGE 121

[55]L.SendurandI.W.Selesnick.Bivariateshrinkagefunctionsforwavelet-based denoisisngexploitinginterscaledependency. IEEETransactionsonSignalProcessing ,50,2002. [56]Aymetrix.Datasheet:Genechiphumangenomearrays.Technicalreport, Aymetrix,2003-2004.Lastaccessed-January,2008. [57]S.G.Mallat.Atheoryformultiresolutionsignaldecomposition:Thewavelet representation. IEEETransactionsonPatternAnalysisandMachineIntelligence ,11,1989. [58]R.Ganesan,T.K.Das,andV.Venkataraman.Wavelet-basedmultiscalestatisticalprocessmonitoring:Aliteraturereview. IIETransactions ,36,2004. [59]N.G.Kingsbury.Complexwaveletsforshiftinvariantanalysisandlteringof signals. JournalofAppliedandComputationalHarmonicAnalysis ,10:234{ 253,2001. [60]Aymetrix.Statisticalalgorithmsdescriptiondocument. AymetrixInc.,CA, USA ,2002. [61]MottCancerCenter&ResearchInstituteandtheUniversityofSouthFlorida. SRCsofwarelibrary.http://morden.csee.usf.edu/software/.LastaccessedJanuary,2008. [62]I.Selesnick,S.Cai,K.Li,L.Sendur,andA.F.Abdelnour.Matlabimplementationofwavelettransforms.http://taco.poly.edu/WaveletSoftware/ denoise.html.Lastaccessed-January,2008. [63]UniversityofSouthFlorida.Researchcomputing.http://rc.usf.edu/.Last accessed-January,2008. [64]L.SendurandI.W.Selesnick.Bivariateshrinkagewithlocalvarianceestimation. IEEESignalProcessingLetters ,9:438{441,2002. [65]TheProteomicsCenteratChildren'sHospitalBoston.Introductiontoproteomics.http://www.childrenshospital.org/cfapps/research/data admin/ Site602/mainpageS602P0.html.Lastaccessed-Nov16th,2008. [66]G.Black.Mechanismsofalternativepre-messengerrnasplicing. AnnualReview ofBiochemistry ,72:291{336,2003. [67]SchlautmanJ.D.,W.Rozek,R.Stetler,R.L.Mosley,H.E.Gendelman,and P.Ciborowski.Multidimensionalproteinfractionationusingproteomelabpf 2dforprolingamyotrophiclateralsclerosisimmunity:Apreliminaryreport. ProteomeSci ,6,2008. 109

PAGE 122

[68]J.JessaniandB.F.Cravatt.Thedevelopmentandapplicationofmethodsfor activity-basedproteinproling. CurrentOpinioninChemicalBiology ,8:54{59, 2004. [69]H.J.Issaq,T.D.Veenstra,T.P.Conrads,andD.Felschow.Theseldi-tofms approachtoproteomics:Proteinprolingandbiomarkeridentication. BiochemicalandBiophysicalResearchCommunications ,292:587592,2002. [70]2007.http://www.emedicine.com/med/fulltopic/topic1698Garcia,AA.OvarianCancer.eMedicinefromWebMD.Accessedonline,11/01/2008. [71]R.Sutphen,Y.Xu,G.D.Wilbanks,J.Fiorica,E.C.GrendysJr.,J.P.LaPolla, H.Arango,M.S.Homan,M.Martino,K.Wakeley,D.Grin,R.BBlanco, A.B.Cantor,Y-J.Xiao,andJ.P.Krischer.Lysophospholipidsarepotential biomarkersofovariancancer. CancerEpidemiolBiomarkersPrev ,13:1185{ 1191,2004. [72]S.BodovitzandT.Joos.Theproteomicsbottleneck:strategiesforpreliminary validationofpotentialbiomarkersanddrugtargets. TrendsinBiotechnology 22:4{7,2004. [73]J.L.Luque-GarciaandT.A.Neubert.Samplepreparationforserum/plasma prolingandbiomarkeridenticationbymassspectrometry. JournalofChromatographyA ,1153:259{276,2007. [74]H.J.An,S.Miyamoto,K.S.Lancaster,C.Kirmiz,B.Li,K.S.Lam,G.S. Leiserowitz,andC.B.Lebrilla.Prolingofglycansinserumforthediscoveryof potentialbiomarkersforovariancancer. JournalofProteomeResearch ,5:1626{ 1635,2006. [75]N.Dossat,A.Mang,J.Solassol,W.Jacot,MaudelondeT.DaursJ-P.Lhermitte,L.,andN.Molinari.Comparisonofsupervisedclassicationmethodsfor proteinprolingincancerdiagnosis. CancerInformatics ,3:295{305,2007. [76]M.H.SimoniamandE.Betgovargez.Proteomeanalysisofhumanplasmawith theproteomelabPF2DsystemA1936A.2003. [77]A.Schramm,O.Apostolov,B.Sitek,K.Pfeier,K.Sthler,H.E.Meyer, W.Havers,andA.Eggert.Proteomics:techniquesandapplicationsincancerresearch. KlinPadiatr ,215,2003. [78]J.Reinders,U.Lewandrowski,J.Moebius,Y.Wagner,andA.Sickmann.Challengesinmass-spectrometrybasedproteomics. Proteomics ,4:3686{3703, 2004. 110

PAGE 123

[79]Y.K.Shin,H-J.Lee,J.S.Lee,andY-K.Paik.Proteomicanalysisofmammalian basicproteinsbyliquid-basedtwo-dimensionalcolumnchromatography. Proteomics ,6:1143{1150,2006. [80]Beckmancoulter:Fromtissuestotargets:Proteomelabpf2dproteinfractionationsystem,br9436a.Technicalreport,BeckmanCoulter,Inc.,2003. [81]H.J.Lee,M-J.Kang,E-Y.Lee,S.Y.Cho,H.Kim,andY.K.Paik.Applicationofapeptide-basedpf2dplatformforquantitativeproteomicsindisease biomarkerdiscovery. Proteomics ,8:3371{3381,2008. [82]D.G.Ward,N.Suggett,Y.Cheng,W.Wei,H.Johnson,L.J.Billingham, T.Ismail,M.J.O.Wakelam,P.J.Johnson,andA.Martin.Identication ofserumbiomarkersforcoloncancerbyproteomicanalysis. BrJCancer 94:1898{1905,2006. [83]H.B.Burke.Proteomics:Analysisofspectraldata. CancerInformatics ,1:15{ 24,2005. [84]A.Cruz-Marcelo,R.Guerra,amdLiY.Vannucci,M.,C.C.Lau,andT.K. Man.Comparisonofalgorithmsforpre-processingofseldi-tofmassspectrometrydata. Bioinformatics ,24:2129{2136,2008. [85]C.Yang,Z.He,andW.Yu.Comparisonofpublicpeakdetectionalgorithms formaldimassspectrometrydataanalysis. BMCBioinformatics ,10,2009. [86]J.S.Morris,K.R.Coombes,J.Koomen,K.A.Baggerly,andR.Kobayashi. Featureextractionandquanticationformassspectrometryinbiomedicalapplicationsusingthemeanspectrum. Bioinformatics ,21:1764{1775,2005. [87]P.Du,R.Sudha,M.B.Prystowsky,andR.H.Angeletti.Datareductionof isotope-resolvedlc-msspectra. Bioinformatics ,23:1394{1400,2007. [88]J.W.H.Wong,G.Cagney,andH.M.Cartwright.Specalignprocessingand alignmentofmassspectradatasets. Bioinformatics ,21:2088{2090,2005. [89]D.Mantini,F.Petrucci,D.Pieragostino,P.DelBoccio,M.D.Nicola,C.D. Ilio,G.Federici,P.Sacchetta,S.Comani,andA.Urbani.Limpic:acomputationalmethodfortheseparationofproteinmalditof-mssignalsfromnoise. Bioinformatics ,8,2007. [90]Y.Yasui,M.Pepe,M.L.Thompson,B.L.Adam,G.L.Wright,Y.Qu,J.D. Potter,M.Winget,M.Thornquist,andZ.Feng.Adata-analyticstrategyfor proteinbiomarkerdiscovery:prolingofhigh-dimensionalproteomicdatafor cancerdetection. Biostatistics ,4:449{463,2003. 111

PAGE 124

[91]M.Bellew,M.Coram,M.Fitzgibbon,M.Igra,T.Randolph,P.Wang,D.May, J.Eng,R.Fang,C.W.Lin,J.Z.Chen,D.Goodlett,J.Whiteaker,A.Paulovich, andM.McIntosh.Asuiteofalgorithmsforthecomprehensiveanalysisof complexproteinmixturesusinghigh-resolutionlc-ms. Bioinformatics ,22:1902{ 1909,2006. [92]Proteinchipsoftware3.1operationmanual.Technicalreport,CiphergenBiosystemsInc.,2002. [93]X.Li,R.Gentleman,X.Lu,Q.Shi,J.D.Iglehart,HarrisL.,andA.Miron. SELDI-TOFmassspectrometryproteindata ,volume1.SpringerNewYork, 2005. [94]C.A.Smith,E.J.Want,G.O.Maille,R.Abagyan,andG.Siuzdak.Xcms: processingmassspectrometrydataformetaboliteprolingusingnonlinearpeak alignment,matching,andidentication. AnalyticalChemistry ,78:779{787, 2006. [95]K.R.Coombes,S.Tsavachidis,J.S.Morris,K.A.Baggerly,M.C.Hung, andH.M.Kuerer.Improvedpeakdetectionandquanticationofmassspectrometrydataacquiredfromsurface-enhancedlaserdesorptionandionization bydenoisingspectrawiththeundecimateddiscretewavelettransform. Proteomics ,5:4107{4117,2005. [96]Y.V.Karpievitch,E.G.Hill,A.J.Smolka,J.S.Morris,K.R.Coombes,K.A. Baggerly,andJ.S.Almeida.Prepms:Tofmsdatagraphicalpreprocessingtool. Bioinformatics ,23:264{265,2007. [97]M.Katajamaa,J.Miettinen,andM.Oresic.Mzmine:Toolboxforprocessing andvisualizationofmassspectrometrybasedmolecularproledata. Bioinformatics ,22:634{636,2006. [98]P.Du,W.A.Kibbe,andS.M.Lin.Improvedpeakdetectioninmassspectrumbyincorporatingcontinuouswavelettransform-basedpatternmatching. Bioinformatics ,22:20592065,2006. [99]E.Lange,C.Gropl,K.Reinert,O.Kohlbacher,andA.Hildebrandt.Highaccuracypeakpickingofproteomicsdatausingwavelettechniques. Pacic SymposiumonBiocomputing ,11:243{254,2006. [100]S.Toppo,A.Roveri,M.P.Vitale,M.Zaccarin,E.Serain,E.Apostolidis, M.Gion,M.Maiorino,andF.Ursini.Mpa:Amultiplepeakalignmentalgorithmtoperformmultiplecomparisonsofliquid-phaseproteomicproles. Proteomics ,8:250{253,2008. 112

PAGE 125

[101]N.V.Nielsen,J.M.Cartensen,andJ.Smedsgaard.Aligningofsingleand multiplewavelengthchromatographicprolesforchemometricdataanalysis usingcorrelationoptimisedwarping. JournalofChromatographyA ,805:17{35, 1998. [102]D.Bylund,R.Danielsson,G.Malmquist,andK.Markides.Chromatographic alignmentbywarpinganddynamicprogrammingasapre-processingtoolfor parafacmodellingofliquidchromatographymassspectrometrydata. Journal ofChromatographyA ,961:237{244,2002. [103]R.J.O.Torgrip,M.Aberg,B.Karlberg,andS.P.Jacobsson.Peakalignment usingreducedsetmapping. JournalofChemometrics ,17:573{582,2003. [104]A.M.vanNederkassel,M.Daszykowski,P.H.CEilers,andY.V.Heyden.A comparisonofthreealgorithmsforchromatogramsalignment. JournalofChromatographyA ,1118:199{210,2006. [105]W.Yu,X.Li,J.Liu,B.Wu,K.R.Williams,andH.Zhao.Multiplepeakalignmentinsequentialdataanalysis:Ascale-space-basedapproach. IEEE/ACM TransactionsOnComputationalBiologyandBioinformatics ,3,2006. [106]P.S.Addison. TheIllustratedWaveletTransformHandbook .Instituteof Physics,BristolandPhiladelphia,2002. [107]J.A.Falkner,D.M.Veine,M.Kachman,A.Walker,J.R.Strahler,andP.C. Andrews.Validatedmaldi-tof/tofmassspectraforproteinstandards. Journal oftheAmericanSocietyforMassSpectrometry ,18:850{855,2007. 113

PAGE 126

ABOUTTHEAUTHOR ChaitraGopalappareceivedherPh.D.inIndustrialEngineeringin2010anda MastersinIndustrialEngineeringin2006fromUniversityofSouthFlorida.She receivedaB.E.inIndustrialEngineeringandManagementfromVisveswaraiahTechnologicalUniversity,India,in2002.Herresearchareasofinterestincludedisease progressionandintervention,bioinformatics,andhealthcaresystemsengineering. Hermethodologicalareasofinterestincludeappliedstochasticmodeling,computationalprobability,appliedoptimization,andwaveletbasedsignalprocessing. Chaitrawasaleadinvestigatorinaninterdisciplinarychallengegrantawarded aspartof`The2008-2009GraduateStudentChallengeGrants:BuildingResearch PartnershipsAcrossDisciplines,UniversityofSouthFlorida'.Shewasavisiting researchassistantatPurdueUniversityMarch08-May08,aspartofCancerCare Engineeringproject,DiscoveryPark,thatwasfundedbyRegenstriefFoundation, Indiana.Chaitrawilljoinasapost-doctoralfellowintheDivisionofHIV/AIDS PreventionofCentersforDiseaseControlandPreventionatAtlanta,Georgia,starting August2010.


xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam 22 Ka 4500
controlfield tag 007 cr-bnu---uuuuu
008 s2010 flu s 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0004563
035
(OCoLC)
040
FHM
c FHM
049
FHMM
090
XX9999 (Online)
1 100
Gopalappa, Chaitra.
0 245
Three essays on analytical models to improve early detection of cancer
h [electronic resource] /
by Chaitra Gopalappa.
260
[Tampa, Fla] :
b University of South Florida,
2010.
500
Title from PDF of title page.
Document formatted into pages; contains X pages.
502
Dissertation (PHD)--University of South Florida, 2010.
504
Includes bibliographical references.
516
Text (Electronic thesis) in PDF format.
538
Mode of access: World Wide Web.
System requirements: World Wide Web browser and PDF reader.
3 520
ABSTRACT: Development of approaches for early detection of cancer requires a comprehensive understanding of the cellular functions that lead to cancer, as well as implementing strategies for population-wide early detection. Cell functions are supported by proteins that are produced by active or expressed genes. Identifying cancer biomarkers, i.e., the genes that are expressed and the corresponding proteins present only in a cancer state of the cell, can lead to its use for early detection of cancer and for developing drugs. There are approximately 30,000 genes in the human genome producing over 500,000 proteins, thereby posing significant analytical challenges in linking specific genes to proteins and subsequently to cancer. Along with developing diagnostic strategies, effective population-wide implementation of these strategies is dependent on the behavior and interaction between entities that comprise the cancer care system, like patients, physicians, and insurance policies. Hence, obtaining effective early cancer detection requires developing models for a systemic study of cancer care. In this research, we develop models to address some of the analytical challenges in three distinct areas of early cancer detection, namely proteomics, genomics, and disease progression. The specific research topics (and models) are: 1) identification and quantification of proteins for obtaining biomarkers for early cancer detection (mixed integer-nonlinear programming (MINLP) and wavelet-based model), 2) denoising of gene values for use in identification of biomarkers (wavelet-based multiresolution denoising algorithm), and 3) estimation of disease progression time of colorectal cancer for developing early cancer intervention strategies (computational probability model and an agent-based simulation).
590
Advisor: Tapas K. Das, Ph.D.
653
Colorectal cancer; disease progression; polyp progression; applied probability; bioinformatics; protein identification; microarray denoising
690
Dissertations, Academic
z USF
x Industrial & Management Systems Engineering
Masters.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.4563