USF Libraries
USF Digital Collections

Statistical learning and behrens fisher distribution methods for heteroscedastic data in microarray analysis

MISSING IMAGE

Material Information

Title:
Statistical learning and behrens fisher distribution methods for heteroscedastic data in microarray analysis
Physical Description:
Book
Language:
English
Creator:
Manandhr_Shrestha, Nabin
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla
Publication Date:

Subjects

Subjects / Keywords:
Genes
False Discovery Rate
Multiple Testing
Correlation
Classification
Dissertations, Academic -- Mathematics and Statistics -- Doctoral -- USF   ( lcsh )
Genre:
non-fiction   ( marcgt )

Notes

Abstract:
ABSTRACT: The aim of the present study is to identify the differentially expressed genes between two different conditions and apply it in predicting the class of new samples using the microarray data. Microarray data analysis poses many challenges to the statisticians because of its high dimensionality and small sample size, dubbed as "small n large p problem". Microarray data has been extensively studied by many statisticians and geneticists. Generally, it is said to follow a normal distribution with equal variances in two conditions, but it is not true in general. Since the number of replications is very small, the sample estimates of variances are not appropriate for the testing. Therefore, we have to consider the Bayesian approach to approximate the variances in two conditions. Because the number of genes to be tested is usually large and the test is to be repeated thousands of times, there is a multiplicity problem. To remove the defect arising from multiple comparison, we use the False Discovery Rate (FDR) correction. Applying the hypothesis test repeatedly gene by gene for several thousands of genes, there is a great chance of selecting false genes as differentially expressed, even though the significance level is set very small. For the test to be reliable, the probability of selecting true positive should be high. To control the false positive rate, we have applied the FDR correction, in which the p -values for each of the gene is compared with its corresponding threshold. A gene is, then, said to be differentially expressed if the p-value is less than the threshold. We have developed a new method of selecting informative genes based on the Bayesian Version of Behrens-Fisher distribution which assumes the unequal variances in two conditions. Since the assumption of equal variances fail in most of the situation and the equal variance is a special case of unequal variance, we have tried to solve the problem of finding differentially expressed genes in the unequal variance cases. We have found that the developed method selects the actual expressed genes in the simulated data and compared this method with the recent methods such as Fox and Dimmic's t-test method, Tusher and Tibshirani's SAM method, among others. The next step of this research is to check whether the genes selected by the proposed Behrens -Fisher method is useful for the classification of samples. Using the genes selected by the proposed method that combines the Behrens Fisher gene selection method with some other statistical learning methods, we have found better classification result. The reason behind it is the capability of selecting the genes based on the knowledge of prior and data. In the case of microarray data due to the small sample size and the large number of variables, the variances obtained by the sample is not reliable in the sense that it is not positive definite and not invertible. So, we have derived the Bayesian version of the Behrens Fisher distribution to remove that insufficiency. The efficiency of this established method has been demonstrated by applying them in three real microarray data and calculating the misclassification error rates on the corresponding test sets. Moreover, we have compared our result with some of the other popular methods, such as Nearest Shrunken Centroid and Support Vector Machines method, found in the literature. We have studied the classification performance of different classifiers before and after taking the correlation between the genes. The classification performance of the classifier has been significantly improved once the correlation was accounted. The classification performance of different classifiers have been measured by the misclassification rates and the confusion matrix. The another problem in the multiple testing of large number of hypothesis is the correlation among the test statistics. we have taken the correlation between the test statistics into account. If there were no correlation, then it will not affect the shape of the normalized histogram of the test statistics. As shown by Efron, the degree of the correlation among the test statistics either widens or shrinks the tail of the histogram of the test statistics. Thus the usual rejection region as obtained by the significance level is not sufficient. The rejection region should be redefined accordingly and depends on the degree of correlation. The effect of the correlation in selecting the appropriate rejection region have also been studied.
Thesis:
Dissertation (Ph.D.)--University of South Florida, 2010.
Bibliography:
Includes bibliographical references.
System Details:
Mode of access: World Wide Web.
System Details:
System requirements: World Wide Web browser and PDF reader.
Statement of Responsibility:
by Nabin Manandhr_Shrestha.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains X pages.
General Note:
Includes vita.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
usfldc doi - E14-SFE0003506
usfldc handle - e14.3506
System ID:
SFS0027821:00001


This item is only available as the following downloads:


Full Text

PAGE 1

forHeteroscedasticDatainMicroarrayAnalysis by NabinK.ManandharShrestha Adissertationsubmittedinpartialfulllment oftherequirementsforthedegreeof DoctorofPhilosophy DepartmentofMathematicsandStatistics CollegeofArtsandSciences UniversityofSouthFlorida MajorProfessor:KandethodyM.Ramachandran,Ph.D. G.S.Ladde,Ph.D. MarcusM.McWaters,Ph.D. TapasK.Das,Ph.D. DateofApproval: March29,2010 Keywords:Genes,FalseDiscoveryRate,MultipleTesting,Correlation,Classication cCopyright2010,NabinK.ManandharShrestha

PAGE 3

AspecialthankgoestoProfessorGordonFoxforhiskindwillingnesstochairthedefenseofmydissertation.Lastbutnottheleast,Iamthankfultomyparents,myuncleBadriK.Shrestha,mywifeSharmilaandmysonSangamShresthafortheirconstantsupportsandencouragement.

PAGE 4

ListofFiguresvi Abstractvii 1MicroarraysandSelectionofDierentiallyExpressedGenes1 1.1Introduction................................1 1.2GeneExpressionandMainQuestionofInterest............3 1.3Whatismicroarrayandhowdoesitwork?..............3 1.4Oligonucleotidemicroarray........................4 1.5cDNAMicroarray.............................6 1.6MeasuringtheExpressionLevelsofaGene...............7 1.7StatisticalMethodsforDierentiallyExpressedGenes.........8 1.8MethodsforcDNAData.........................9 1.9MethodsforOligonucleotideData....................11 1.10AssessingtheReliabilityofTests....................14 1.11Conclusion.................................16 2Behrens-FisherdistributionforselectingDierentiallyexpressedGenes17 2.1Summary.................................17 2.2Introduction................................17 2.3MultipleTesting.............................19 2.4MeasuresofErroneousRejectionofNullHypotheses.........22 2.5Two-SamplePermutationTest......................25i

PAGE 5

2.7SamplingDistributionforNon-homogeneousVariance.........27 2.8BayesianApproach............................27 2.9TestStatistic...............................31 2.10CalculationofPriord.f.andPriorVariance..............34 2.11EstimationofHyperparameters.....................34 2.12SimulationandResult..........................39 2.13PowerComparisons............................41 2.14ComparisonofDEGenesusingtheWindowMethodwithOtherMethods46 2.15ComparisonofProposedMethodwithotherMethods.........50 2.16SelectionofDierentiallyExpressedGenesintheGolubData....51 2.17TransformationandTestofNormalityAssumptions..........55 2.18Conclusion.................................58 3SupportVectorMachinesandOtherclassicationMethods63 3.1Summary.................................63 3.2Introduction................................63 3.3AssumptionsonClassicationMethods.................64 3.4SupportVectorMachines(SVM)Methods...............64 3.5KernelMatrixandKernelTricks....................67 3.6HighDimensionalFeatureSpaceforLargepSmalln.........67 3.7NearestShrunkenCentroidsMethod..................69 3.8WeightedVotingMethod........................72 3.9Dudoit'sMulti-classClassicationMethod...............74 3.10GeneSelectionbyBehrens-FisherStatistic...............74 3.11Choosingthenumberofgenesrequiredforclassication........77 3.12SimulationandResults..........................78 3.13DatasetsPre-processingandFiltering..................79 3.14MLLLeukemiaData...........................79 3.15GolubLeukemiaData..........................82 3.16SubtypesofPediatricAcuteLymphoblasticLeukemia.........85ii

PAGE 6

4SelectionofDierentiallyExpressedGenesinCorrelatedStatistics89 4.1Summary.................................89 4.2Introduction................................89 4.3EectsofCorrelations..........................90 4.4Application................................92 4.5ComputationofEectofCorrelations..................98 4.6ConditionalandUnconditionalp-values................99 4.7DispersionParameter...........................100 4.8Conclusion.................................102 5PerformanceofaClassiertakingintoaccountofCorrelatedGenes103 5.1PerformanceofClassiers........................103 5.2McNemar'sTestandConfusionMatrix.................104 5.3FilteringtheHighlyCorrelatedGenes..................105 5.4ClassicationusingSVM........................106 5.5Application................................107 5.6ConfusionMatrices............................111 5.7BinaryAccuracyMeasures........................113 5.8ClassicationusingthegenesselectedbytheCorrelatedtestStatistics114 5.9Conclusion.................................114 6FutureResearch115 References116 AbouttheAuthorEndPageiii

PAGE 7

2.2TableofPowerComparison.......................47 2.3ComparisonofBFtestwithothertestsbasedontheproportionofactuallyDEgenesselectedinWindowMethod.............48 2.4ComparisionofBFtestwithothertestsbasedontheproportionofactuallyDEgenesselectedinSimilarVarianceMethod.........48 2.5ComparisionofBFtestwithothertestsbasedontheproportionofactuallyDEgenesselectedinResamplingMethod...........49 2.6Comparisonofthreemethods(I=WindowMethod,II=SimilarVarianceMethod,III=ResamplingMethod)accordingtoProportionoftrulyDEgenesselected,FDRandFNR................51 2.7DependenceofVarianceonExpressionlevel..............57 2.8ComparisionofDierentiallyexpressedgenesinGolubdatacontrollingtheFDRatthelevel=0:05......................58 3.1AccuracyRatefortheSimulatedTestdata...............79 3.2ComparisonofClassicationPerformanceonMLLLeukemiaData..82 3.3ComparisonClassicationofGolubDatausingallthegenes.....84 3.4ComparisonClassicationofGolubDatausingBFselectedgenes..84 3.5ComparisonofClassicationPerformanceonGolubData.......85 3.6ComparisonofClassicationPerformanceonALL-7Data.......87 3.7ConfusionmatrixfortheMLL-3trainingdatabyDudoitmethod,andGolubTestDatabyNSCMethod.....................87 3.8ConfusionmatrixfortheALL-7testdatabySCRDAmethod.....88iv

PAGE 8

4.2EectofCorrelationonthestandarddeviationsofthemiddleandtailcountofthetransformedteststatistics.................99 4.3Eectofcorrelationonp-valueonteststatisticsinGolubdata...100 5.1Contingencytableformisclassication.................105 5.2ConfusionMatrixformisclassication..................105 5.3ClassicationofGolubDatatakingintoCorrelationamongthegenes108 5.4ComparisonofClassicationPerformanceonGolubDataaftertakingintocorrelationbetweenthegenes.Dierentpercentagepointswereusedtodeterminetheoptimalcut-point.................109 5.5ComparisonofClassicationPerformanceonALL-7DatausingthegenesSelectedbytheBFmethodandtakingintocorrelationbetweenthegenes.Theclasseswerecomparedusingoneversusanother....110 5.6ConfusionMatricesofTestsamplesofGolubDatabyWeightedVote,SVMandDudoitmethods........................111 5.7ConfusionMatrixofMLLdataapplyingtheDudioitClassicationmethodaftertakingcorrelation.....................111 5.8ConfusionMatrixofMLLdataapplyingtheWeightedVoteClassica-tionmethodaftertakingcorrelation...................112 5.9ConfusionMatrixofMLLdataapplyingtheSVMClassicationmethodaftertakingcorrelation..........................112 5.10ContingencytableformisclassicationofGolubTestData......112 5.11ContingencytableformisclassicationofMLLTestData.......113 5.12Confusionmatrixderivedaccuracymeasures,N=a+b+c+d...113 5.13AccuracymeasuresfortheWt.Vote,SVM,LogisticDiscriminationandDudoitMulti-classClassiers....................113 5.14ClassicationofGolubDatatakingintoCorrelatedteststatistics..114v

PAGE 9

2.2PoweroftheProposedMethodatm=n=3...............43 2.3PoweroftheProposedMethodatm=n=5...............44 2.4GraphofDEgenes............................45 2.5HistogramsofUnprocessedDataGolubData..............52 2.6Histogramoflog-2transformedGolubData..............53 2.7QQ-PlotofGolubData..........................54 2.8Q-QplotofsomeofthegenesbeforeYeo-JohnsonTransformation..60 2.9HistogramoftheT-valuesobtainedbytheBFMethod........61 2.10Q-QplotofsomeofthegenesafterYeo-JohnsonTransformation...62 3.1HistogramofPre-processedAMLData.................80 3.2HistogramofALL,MLLandAMLTrainingandTestingDataafterPre-processing...............................81 3.3Cross-ValidationErrorofNearestShrunkenCentroidMethodforMLLdata....................................83 3.4ErrorComparisonforGolubdata....................86 4.1HistogramandthettedCauchydensitycurvefortheBFteststatistics94 4.2Histogramoftransformedtest-statistics.................95 4.3Density,g(),ofcorrelationsbetweenthegenes............96vi

PAGE 10

for HeteroscedasticDatainMicroarrayAnalysis NabinK.ManandharShrestha WehavedevelopedanewmethodofselectinginformativegenesbasedontheBayesianVersionofBehrens-Fisherdistributionwhichassumestheunequalvariancesintwoconditions.Sincetheassumptionofequalvariancesfailinmostofthesit-vii

PAGE 11

Thenextstepofthisresearchistocheckwhetherthegenesselectedbythepro-posedBehrens-Fishermethodisusefulfortheclassicationofsamples.UsingthegenesselectedbytheproposedmethodthatcombinestheBehrensFishergenese-lectionmethodwithsomeotherstatisticallearningmethods,wehavefoundbetterclassicationresult.Thereasonbehinditisthecapabilityofselectingthegenesbasedontheknowledgeofprioranddata.Inthecaseofmicroarraydataduetothesmallsamplesizeandthelargenumberofvariables,thevariancesobtainedbythesampleisnotreliableinthesensethatitisnotpositivedeniteandnotinvertible.So,wehavederivedtheBayesianversionoftheBehrensFisherdistributiontoremovethatinsuciency.Theeciencyofthisestablishedmethodhasbeendemonstratedbyap-plyingtheminthreerealmicroarraydataandcalculatingthemisclassicationerrorratesonthecorrespondingtestsets.Moreover,wehavecomparedourresultwithsomeoftheotherpopularmethods,suchasNearestShrunkenCentroidandSupportVectorMachinesmethod,foundintheliterature. Wehavestudiedtheclassicationperformanceofdierentclassiersbeforeandaftertakingthecorrelationbetweenthegenes.Theclassicationperformanceoftheclassierhasbeensignicantlyimprovedoncethecorrelationwasaccounted.Theclassicationperformanceofdierentclassiershavebeenmeasuredbythemisclas-sicationratesandtheconfusionmatrix. Theanotherprobleminthemultipletestingoflargenumberofhypothesisisthecorrelationamongtheteststatistics.wehavetakenthecorrelationbetweentheteststatisticsintoaccount.Iftherewerenocorrelation,thenitwillnotaecttheshapeviii

PAGE 14

Sensestrand2. Anti-sensestrand GenesaretranscribedintoRNAcalledmessengerRNA(mRNA)andaretranslatedtoformproteins.TheRNAintheintermediatestageiscalledmRNAbecauseitisusedasaplatformtoformproteinsfromDNAandtopasstheinformationtoproteins2

PAGE 15

Transcription:TheprocessofconvertingDNAintothemRNA2. Translation:TheprocessofconvertingmRNAintoprotein. TheinformationtransferfromDNAtomRNAandmRNAtoproteiniscalledthe"CentralDogmaofBiology."1.2GeneExpressionandMainQuestionofInterest

PAGE 16

cDNAmicroarray2. Oligonucleotidemicroarray1.4Oligonucleotidemicroarray OligonucleotidearraysareusedtomeasuretheabundanceofmRNAtranscriptsformanygenessimultaneously.11to20perfectmatch(PM)andmismatch(MM)probepairsareusedtomeasuretheexpressionlevelofeachgene.Hereisanexample4

PAGE 17

5"...........AGGGTGCCCCTTTGAAA......3"(sensestrand) 3"...........TCCCACGGGGAAACTTT.....5"(antisensestrand) to 5"...........AGGGUGCCCCUUUGAAA......3" 3"...........UCCCACGGGGAAACUUU......5" translationintomRNA TheonlydierencebetweenDNAandRNAisthatthebasethymine(T)isreplacedbythebaseuracil(U). ExamplesofPMandMMprobes: PMprobe:ATGATCTCGAATAGCGTGCGCGAAT MMprobe:ATGATCTCGAATTGCGTGCGCGAAT Thisisanexampleofprobepair.11-20suchprobepairsof25-basesischosenineachspotofmicroarraytogettheexpressionofagene.Thepolymersuchformedisalsocalledaprobeforagene.Thetwoprobesarecalledcomplementarytoeachother.TheonlydierencebetweenPMprobeandMMprobeisthatthemiddlebase(Aintheaboveexample)isippedtoitscomplementarybaseT.Thensimpleorweighted/robustaverageofthedierencePM-MMofallprobesetsistaken.ThisiscalledtheaveragedierencewhichmeasurestheabundanceofmRNAintheoligonu-cleotideaymetrixdata.Thehumangenomehasabout2:8109basepairsanditencodesatleast40;000genes.5

PAGE 18

TheadvantageofcDNAmicroarrayisthattheycanbeprepareddirectlyfromtheisolatedclones.OncethesetofcorrespondingPCRproductshasbeengener-ated,microarrayscanbecreatedinmultipleversionscontainingtheentiresetofcDNAsequences,resultinginthelarge-scalearraysforidenticationofdierentiallyexpressedgenesofnterest.Itislessexpensivetoprepare.Thecross-hybridizationbe-tweenhomologoussequenceisproblematicforcDNAmicroarrays.Theadvantageofoligonucleotidemicroarraysisthattheycanbesynthesizedeitherinplatesordirectlyonsolidsurfaces;makingiteasiertoprepare.Theyareexpensiveandtheprobesinsucharraycanbedesignedtorepresentuniquegenegenesequencessothatthecrosshybridizationbetweenrelatedgenesequencesisminimizedtoadegreedepen-dentuponthecompletenessofavailablesequenceinformation. Microarraysareappliedtodierentsituations.Ithelpstondthegenesthataredierentiallyexpressedindierentconditions.So,ithasenormousimportanceintheeldofBio-medicineandGenomic.Forexample,asmentionedinthepaperofEfronetal.[12],somecancerpatientshaveseverelife-threateningreactionstoradiationtreatment.So,itisimportanttorecognizethebasisofthissensitivity,sothatsuch6

PAGE 19

Tomeasuretheexpressionlevel,thehybridizedmicroarrayisexcitedbythelaserandscannedatthewavelengthssuitableforthedetectionofbothredandgreeninten-sities.Fromtheorescenceintensitiesandcolorofeachspot/gene,wemeasuretheexpressionlevelofthegeneinthespot.Supposethatforgeneg,Rg=expressionlevel(orescentintensity)ofgeneginquerysampleGg=expressionlevel(orescentintensity)ofgeneginreferencesample,andTg=Rg

PAGE 20

Similarly,forthetwosamplewithreplication,wemeasuretheexpressionlevelsofgreenandredintensitiesofthesamegeneatmorethanonespotandwegetthevaluesofexpressionfortwodierentcondition. Thequalitylteringisimplementedtoproducegoodqualityofintensitymeasure-ment.Forthis,oneusesthemultiplespotreplicationslides.MultiplespottingoftargetDNAonaslideprovidesameanstoassessthequalityofdataforageneonthatslide.Supposeeachgeneisspottedptimesontheslide.Foreachspot,aratioofCy3andCy5intensityiscalculatedasm=Cy3=Cy5.LetCV==mbethecoecientofvariationofthesetofratiosm1;m2;:::;mponthemultiplespots.ThequalityofthedataontheexpressionlevelofeachgeneisinverselyrelatedtoitsCV.Foreachgene,awindowsubsetcontaining50geneswhosemeanintensitiesareclosesttothegeneofinterestisconstructed.TheCVofeachgeneiscalculatedandorderedintheincreasingorder.IftheCVofthegeneofinterestfallswithinthetop10%oftheCV'sthenwediscardthisgenesayingithaspoorquality.1.7StatisticalMethodsforDierentiallyExpressedGenes

PAGE 21

Earlyanalysisofmicroarraydata(Schenaetal.,[38])reliedonthefoldchangecut-ostoidentifydierentiallyexpressedgenes.Typicallyafoldchangeequalto9

PAGE 22

Letmg=log2Rg Aslightlymoresophisticatedapproachinvolvescalculatingthemeanandthestandarddeviationofthedistributionoftheratiomganddeningtheglobalfoldchangedierenceandcondence.Theratio-intensityplotrevealsthatthedatahasmorevariabilityatlowerintensitiesandlessvariabilityathigherintensities.So,usingaslidingwindowateachgenewecanaccessthelocalstructureofthedatatodetermineitsdierentiability.Letxgandsgbethemeanandstandarddeviationoflog-2ratioofgenegcalculatedbytakingallthegeneswithinthewindow,thenzg=xg isnormallydistributedwithmeanzeroandstandarddeviation1.Declarethegenesasdierentiallyexpressedifjzgj>1:96.Athigherintensities,sgwillbebiggerandallowschangestobeidentiedwherethedataismorevariable. Kerretal.[22]introducedtheuseofANOVAmodelsthataccountedforar-ray,dye,andtreatmenteectsforcDNAarrays.Inthisfashion,normalizationwasaccomplishedintrinsicallywithoutpreliminarydatamanipulation.Themodeltheyproposedcanbewrittenasyijk=+Ai+Tj+Dk+Gg+AGig+TGjg+ijkg(1.8.3)10

PAGE 23

where,xarethemeansandsisthepooledsamplestandarddeviation. Itishardtoverifytheunderlyingassumptionofnormalitybecauseofsmallsamplesizesperfectly,butwiththecommontechnologiesthisassumptionisreasonableforthelogarithmoftheexpressionlevels[1],[13],[11].BaldiandLong[1]proposedaBayesianversiont-statisticusingthepriors.Morespecically,ifthexc1;xc2;:::;xcncandyt1;yt2;:::;ytntbethelogtransformedexpressionlevelsinthecontrolandtreatmentconditionsrespectively,theyhaveassumedthedatahasbeentransformedintotheformsuchthatnormalityassumptionholds.SoxN(;2).Thepriorsformeansandvariancehasbeenchosenas where,IGisthescaledinversegammapdfwithdegreeoffreedom0>0andscale0>0

PAGE 24

Otherresearchersusedthelog-normalandgamma-gammamodeltodetecttheDEgenes(Newtonetal.,[29]).Sincethesmallvariancegivesrisetolarget-statistic,empiricalBayes(EB)methodofanalyzingmicroarraydatawasusedbyEfronetal.[12]withoutassuminganydistributionalassumptionofthedata.Theyslightlytunedthet-statisticbyaddingasuitableconstant,a0,whichisgenerallytakenas90thpercentileofstandarddeviation,ontheZ-scoreobtainedlikeint-test,Z=Di Thedensitiesf0(z)andf1(z)wereestimatedfromthedatawithoutassuminganydis-tribution,soitwascalledtheEmpiricalBayesapproach.Theposteriorprobabilitiescanbewrittenasp1(z)=1p0f0(z)

PAGE 25

andthepriorsp0andp1wereestimatedbyp11minzf(z) Theestimateofthesamplevarianceisnotreliableforsmallsampledata,speciallyinthecaseofmicroarraydata,Tusheretal.(2001)[48]proposedthestatisticalanalysisofmicroarray(SAM)methodtostabilizetheeectoflargevariancesarisingfromlowexpressedgenes.Inthismethod,theSAMstatisticisthesameasin(1.9.4)butasmallfudgefactors0isaddedinthedenominator.Thevalueofthefudgefactorischosensothatthecoecientofvariationofthet-statisticsisconstantasafunctionofstandarddeviations.Generally,90thpercentileofsiisusedasthefudgefactor.Afterestimatingthefudgefactor,thet-scoresisrankedinthedecreasingorderoft-statistics,i.e.t(1)t(2)::::::t(p),where,pisthenumberofgenes.Toselectthethresholdtodeterminethegeneswitht-scoresgreaterand/orsmallerthanthethresholdaredierentiallyexpressed,SAMusesthepermutationofcolumns(arrays).Therstn1ofthepermutedcolumnsaretakentobeasfromcondition1andtheremainingcolumnsaretakentobeasfromcondition2.Wepermutethearrays,say-Btimes.Ineachpermutationb=1;2;::::B,thet-scorethusobtainedisranked,saytb(1)tb(2)::::::tb(p).TheexpectedorderstatisticsofgenegiscalculatedbytE(g)=1

PAGE 26

Proportionoftrulydierentiallyexpressed(DE)genes2. Distributionoftruedierences3. Measurementvariability4. Samplesize. WhatistherelationbetweensamplesizeandFDR?Anystatisticaltestingpro-cedureappliedonthegenebygenebasisischaracterizedasfollows:

PAGE 27

Pawitanet.al.[32]hasshownthat,ifwedeclaretop(1p0)100%asDEgenes,thenFDR=FNR.Foramicroarraytest,generallyoneexpectsFDR=FNR.Nowtoanswerthequestion:Toanswerhowthesamplesizeaectsthedierentrates(i.e.FDR,FNR,sensitivity),theyprovedthat,asthesamplesizeincreasesthentheFDRdecreasesandthesensitivity(power)increases.FDRandsensitivitytotals1,soFNRdecreasesasthesamplesizeincreases.Givenatrueproportionofnot-DEgenes,thepaperdiscussesthedierentratesasafunctionofthecriticalvalues.Itthendiscussestheeectofsamplesizeonthedierentrates.Againitdiscussesthedierentratesasthefunctionofpercentageofsignicantgenes(obtainedaccordingtothetoprankinggenes).Whenonedeclaresasmallproportionoftopgenesasdierentiallyexpressed,itcanleadtolowsensitivity/largeFNR.Sensitivityofatestincreaseswiththenumberofsamplespergroup.Onecangethighsensitivitydeclaringmoreproportionoftopgenes,butthesmallFNRhastopayapriceforhighFDR.So,declaringfewtopgenesasDEisnotonlythesolutionoftheproblemofgeneselection.Whatshouldoneexpectifonedeclaresgenesinthebasisofcriticalvalues,asint-statistic?Whenonedeclaresgenesonthebasisoftwosidedtestandcriticalvaluesliket-test,thenagaintheproblemisduetothehighFDR.InthiscaseonemusthavebigenoughsamplesizetogethighsensitivityandlowFDR.Itdoesnotworkforthesmallsamplesizeslike5samplespergroupbutworkswellfor30samplespergroup.Furthermore,itdependsonthetrueproportionp0ofnon-DEgenes.Ifp0is0.99,thismethodalsodoesnotproducethesatisfactoryresultevenforsamplesizeislarge.But,ontheotherhand,ifp0issmall,say0.9,thenitworkswell.Ifonechoosesthemethodoffoldchange,onemustchoosethefoldchangeofatleast3timesstandarddeviationtogetbetterresult.15

PAGE 30

Therearetwoinherentproblemsinmicroarrayexperiments:First,thenumberofreplicationsisverysmall,andsecondthenumberofgenestobetestedisusuallylargeandthetestistoberepeatedthousandsoftimes.Sincethesmallpopulationsizeisverycommoninmicroarraystudies,thesampleestimatesofvariancesarenotappropriateforthetesting.Thevarianceofgenesdependsontheexpressionlevel[48],[35],[1].Therefore,wehavetoconsidertheBayesianapproachtoapproximatethevariancesintwoconditions.Toremovetheseconddefectarisingfrommultiplecomparison,weusetheFalseDiscoveryRate(FDR)correction[54].Applyingthehy-pothesistestrepeatedlygenebygeneforseveralthousandsofgenes,thereisagreatchanceofselectingfalsegenesasdierentiallyexpressed,eventhoughthesignicancelevelissetverysmall.Forthetesttobereliable,theprobabilityofselectingtruepositiveshouldbehigh.Tocontrolthefalsepositiverate,wehaveappliedtheFDRcorrection,inwhichthep-valuesforeachofthegeneiscomparedwithitscorre-spondingthreshold.Ageneis,then,saidtobedierentiallyexpressedifthep-valueislessthanthethreshold. TheBehrens-Fisherproblemariseswhenoneseekstomakeinferencesaboutthemeansoftwonormalpopulationswithoutassumingthevariancesareequal.But,inmanypracticalsituationsthepopulationsdonothavesamevariances.AlthoughtheSatterthwaitest-testdealswiththeunequalvariancescase,thedegreesoffreedomissmallandhenceitdoesnotproducebetterestimateofthevariances[1].Here,we18

PAGE 31

Letx=(x1;x2;:::;xm)andy=(y1;y2;::::;yn)betwoindependentsamplesfromtwonormalpopulationswithmeansxandyandequalvariances2x=2y=2respectively.Ifxandyaremeans;s2yands2yarevariancesofxandyrespectively;ands2=(m1)s2x+(n1)s2y Now,letthesamplesxandyarefromnormaldistributionswithunequalvariances2xand2yrespectively.Inthiscaseneitherapivotalstatisticnoranexactcondenceintervalprocedureexist[23].Wecantakeastatistict=(yx)

PAGE 32

Declarednon-DE DeclaredDE Total TruenonDE V m0TrueDE S pm0 R pTable2.1:MultipletestingProcedureinSimultaneousHypothesisTesting Whiletestingthesimultaneoushypotheses,onegetsthenumberofhypothesesasinthetable.Similartotestingasinglehypothesis,theideahereistocontrolthenumberoffalsepositives,V.Thisnumberisarandomvariablewhosevaluediersfromonetesttoanothertest.Letbethetype-Ierrorthatonemakeswhentestingforagenei.Then,=ProbabilityofrejectingH0infactH0istrue.Intermsofabovehypothesis,=Probabilityofselectingafalsegeneasdierentiallyexpressed.Suchageneiscalledafalsepositive.So,expectednumberofselectingfalsepositivesfromasetofpgenesisp.Inotherwords,sincepisverybiginteger,thenumberoffalsepositivesintheexperimentisverybig,eventhoughwechooseverysmall. Thismeansthat,probabilityofnotselectingafalsegeneasdierentiallyexpressed=1.Inotherwords,probabilityofmakingtherightdecisionforagene=1.Hence,theprobabilityofmakingcorrectdecisionforallpgenes=(probabilityofmakingcorrectdecisionforgene1).(probabilityofmakingcorrectdecisionforgene2).........(probabilityofmakingcorrectdecisionforgenep)is(1)p.Fromthisweseethat,aspincreases,theprobabilityofmakingcorrectdecisiondecreases.So,theprobabilityofatleastonefalsepositivesomewhereis:TypeIError=1(1)p(2.3.2)20

PAGE 33

Theaboveequation(2.3.3)iscalledtheSidakCorrectionformultipletesting.Thismeansthatifwewanttoachievetheglobalsignicancelevelg,wehavetosetthesignicancelevelforeachgeneas1(1g)1 or,g= p(2.3.5) Thismeansthatinstead,wehavetosetthesignicancelevelasdividedbythenumberofgenes(teststobeperformed).Inotherwords,iftheerrormeasureisFWER,thentheprobabilitythatatleastonefalsepositivegenewillbeselectedbytherulewhenwesetg= pdoesnotexceed.Fromtheabove,thesignicancelevelsarethesameforeachgene,nomatteroftheirp-values.Toavoidthesituation21

PAGE 34

Choosetheglobalsignicancelevel.2. Orderthegenesaccordingtotheirp-valuesinascendingorder.3. Comparethep-value(pi)ofthei-thgeneintheorderedlistwiththethresholdi= pi+1.4. Reportthegeneiintheorderassignicantlyexpressedifpi
PAGE 35

FalseDiscoveryRate(FDR) TheFWERisdenedastheprobabilityofatleastonefalsepositive:FWER=Prob(V1): RjR>0Prob[R>0]:

PAGE 36

p. StoreyandTibshirani[43]notedthatanadjustmentisonlynecessarywhentherearepositivendings,i:e:therearecaseswhenthenullhypothesesarerejected.TheyproposedthemodiedversionoftheFDR,calledthepositivefalsediscoveryrate(pFDR):pFDR=E[V RjR>0] Sincethenumberoffalsepositivesareunknown,wehavetoestimateVinordertoestimatepFDR.SupposewehaveadatasetofNgenesintwoconditionsreplicatedn1andn2times.ThenchangetheclasslabelsbypermutingeachgeneBtimes.SupposethattheaveragenumberRofgeneshavethep-valuessmallerthanthreshold(whereisthegenewisesignicantlevel)overtheBpermutateddatasets.ThentheestimateofthepFDRis:24

PAGE 37

Ifanyoneoftheaboveassumptionfails,thenitislikelythatt-testcannotgivemoreaccurateresult.Furthermore,thesamplesizeinthemicroarrayexperimentisusuallytoosmall.So,therandomassignmentoftheobservedintensitiesofeachgeneiintwodierentconditionsprovidesthebasisfordrawingstatisticalinferenceabouttheeectofthetreatmentoverthenormalcondition.Theargumentgoesasfollows:Considerasinglegenei.Ifthereisnodierencebetweennormalandtreatmentconditions(i.e.,itexpressessimilarlyinbothconditions),thenassigningrandomlyallthevaluesobtainedbetweentwoconditionswouldhaveanequalchanceofbeingobservedinthestudy. Forexample,letthereare3valuesundernormalconditionand2valuesundertreatmentconditionofgenei.ThenthereareC(5;2)=10combinationsofdierentgenesaltogether,ineachconditions.Thesearethetypesofdatasetsonewouldexpecttoobserveifthegeneiundertwoconditionsareequallyexpressed.Sincewehavepermutedthelabelofeachgeneintwoconditions,weexpectthatthemeanoftwo25

PAGE 38

Withthisp-value,rejectthenullhypothesisatthelevelof5%signicancelevel,i.e.ifp-value<0:05.2.6SignicanceAnalysisofMicroarrays(SAM) wherex1(i)andx2(i)aretheaveragelevelofexpressionsforgeneiintheclasses1and2respectively.Iftherearen1andn2replicationsofgeneiintwoclasses, Comparedtot-statistic,theSAM'sscoreaddsa"fudge"term,s0,inthedenom-26

PAGE 39

Forthesamplesx=(xi)iidN(;2)andy=(yj)iidN(+;2),where,i=1;2;::::;m;andj=1;2;::::;n:27

PAGE 40

22f(m1)s2x+m(x)2gi(2.8.7) Similarly,thedensityofyis,f(y)=1 22f(n1)s2y+n(y)2gi(2.8.8) Thefrequentistapproachrequiresalargenumberofobservationsbeforemakingtheprobabilisticjudgementsabouttheoutcomes.Inthemicroarraydata,thesam-plesizeissmallthusitunderestimatesthepopulationvarianceandgivesunreliableestimateofthepopulationvariance.Classicalstatisticsaredirectedtowardstheuseofsampleinformationinmakinginferences.ButinBayesiananalysis,onecombinestheinformationobtainedfromsampleandotherrelevantaspectsoftheprobleminordertomakethebestdecision.So,weusethebayesiananalysisforestimatingthevariances.Formicroarraydatatheconjugatepriorseemstobemoresuitableandexiblebecauseofitsconvenientform.Assumingtheindependencyofthelocationparameterandscaleparameter2asin[13],thejointpriorforand2istheproductp(;2)=p()p(2)(2.8.9) Similarly,thejointpriorfor+and2isp(+;2)=p(+)p(2)(2.8.10) Finally,thejointpriorfor;2;+;2isprior=p(;2;+;2)=p()p(2)p(+)p(2)(2.8.11)Since(x;s2x)and(y;s2y)aresucientstatisticsfor(;2)and(+;2)respectively,28

PAGE 41

where, Letusassumethatthepriorsforand+areatpriors,i.e.p()=1;p(+)=1andthepriorsfor2and2arescaledinverse-2distributionsasin[13].Thepdfofinversechi-squaredistributionwithdf=andscale=2isgivenbyf(x;;2)=(2 22020)(2.8.14)p(2)/(0+2)exp(1 22020)(2.8.15) where,=(0;0;20;20)isthehyper-parametersthatshouldbeestimatedfromthedata.Hencefromequations(2.8.13),(2.8.14)and(2.8.15),themarginalposteriordensityof(;2;2)isp(;2;2jx;y)/1

PAGE 42

Now, 2)eu(A 2R10euu(m+0+2 2)2du 2R10euu(m+0) 21du 2 2 230

PAGE 43

2 2(2.8.17) where,vm=m+01;vm2m=(m1)s2x+020:Similarly,wegettheanotherfactorin(2.8.16)asI2/[1+n(y)2 2(2.8.18) where,wn=n+01;wn2n=(n1)s2y+020. Substituting(2.8.17)and(2.8.18)in(2.8.16),weget 2(m+0)h1+n(y(+))2 2(n+0)d=kZ1=h1+m(x)2 2(vm+1)h1+n(y(+))2 2(wn+1)d(2.8.19) ThisisthepdfoftheBehrens-Fisherdistribution,wherekisgivenby,k=hBetavm 2Betawn 2p

PAGE 44

(2m m+2n n)1 2 n=p n=p n=p withpdff(j;2;2)=kZ1h1+(cossin)2 2h1+(sin+cos)2 2d;(2.9.20) where,=Bysin+Bxcos;=BycosBxsin

PAGE 45

DuetothecomplexityofthepdfoftheBF-distributionasgivenin(2.9.20),itisveryhardtocomputethecorrespondingprobabilities,especiallyduetothepossi-bilityofthefractionaldegreesoffreedom.Inaddition,therearenouniformlymostpowerfulunbiasedtestsforallsamplesizesfortheBF-problem[4].Becauseofthis,therearevarioustypesofapproximationsavailableintheliterature[31],[4].DuetothesimplicityofapplicationaswellasavailabilityofR-code(www.r-project.org)forcomputingt-valuesevenforfractionaldegreesoffreedom,weusePatil'sapproxima-tion[31]inthisworkasfollows. Let f2=w2n a2=(b2) n n+2m m;sin2=1cos2. Then,thestatisticB at(b):(2.9.21) Thatis,Bhasapproximatelyt-distributionwithbdegreesoffreedom(b1,bmaynotbeaninteger[5])andscaleparametera.ThisstatisticBcanalsobedenotedasBt(0;a2;b).Itwasnoted[31]thattheformula(2.9.21)isvalidonlyforvm;wn5andworksquitewellforvm;wn7.33

PAGE 47

22[(m1)s2x+m(x)2+020]gd2d(2.11.24) Writing2=u;m+02 2=a;(m1)s2x+020=band=v; 2u[b+m(xv)2]gdudv=Z1u=01 2u[b+m(xv)2]gdvdu=Z1u=01 2u[m(xv)2]gdvdu mdu=r mZ1u=01 Substitutingb mb 2aZ1z=0za5 2exp(z)dz=r mb 2aK(a)(2.11.26)35

PAGE 48

m(m1)s2x+020 2m+02 2K(m+02 2)(2.11.27) Dierentiatingwithrespectto20andequatingitto0,weget@P @20=020=(m1)s2x Similarly,theestimateof20isobtainedas^20=Pnj=1(yjy)2 Forestimatingthenumberofsimilarlyexpressedgenes,wehavethefollowingthreemethods: Foreachgeneg,thepriorvariancesandpriormeanswillbedierent.Wehave36

PAGE 49

Similarly,forqgeneswithsimilarvariancesandeachhavingnreplicatesintreatmentcondition,thepriordegreesoffreedomforthevarianceisgivenby^0=q(n1)(2.11.31) Thepriorvariancesforcontrolandtreatmentconditionsarecalculatedassamplevarianceofsimilargenes,whicharethemeansofvariancesofsimilargenesinthesetwoconditionsrespectively.^02=1 where,xkisthemeanresponseofgenekinthecontrolconditionykisthemeanresponseofgenekinthetreatmentconditionxk;iistheresponseiofgenekinthecontrolconditionyk;iistheresponseiofgenekinthetreatmentcondition Anothermethodistochoosethegenesthathavetheabsolutedierenceofvari-ancesinbothcontrolandtreatmentgroupusingsomecut-ovalues,sothatwewillhavecertainlevelofcondencechoosingthem.Forageneg,lets2cgands2tgbethesamplevariancesincontrolandtreatmentconditionsrespectively.Thenchoosepg=37

PAGE 50

Thealgorithmforselectingthedierentiallyexpressedgenesconsistofthefollow-38

PAGE 51

2(VmaxVmin) where,Vmax=maxgVariancesandVmin=mingVariances

PAGE 52

Outof10,000genes,simulate98,00genesfromanormaldistributionwithmeanU[0;1]andvariance2Inv2(df=20;scale=2).2.Withoutlossofgenerality,simulaterst100genesfromnormaldistributionwithmeanU[5;10]+1 2andvariance2Inv2(df=20;scale=4)inthecontrolconditionandmeanU[5;10]1 2andvariance2Inv2(df=18;scale=6)inthecontrolcondition.3. Simulateanother100genesfromnormaldistributionwithmeanU[5;10]1 2andvariance2Inv2(df=18;scale=6)inthecontrolconditionandmeanU[5;10]+1 2andvariance2Inv2(df=18;scale=4)inthecontrolcondition. Inthesimplestcase,whenwechoose0=0=0,thentheBFstatisticreducestothetwosamplet-test.Theparameters0and0representsthedegreeofcondenceinthebackgroundvariances20and20versustheempiricalvariancesofcontrolandtreatmentrespectively.Wehavechosenthevaluesofthesetwovariancesinthewiderangebeginningfrom0.Thevaluesof0and0areincreased.Ineachofthecalculationofvariancesofsimilargenesinthetwoconditions,wehavetakenthewindowsizeequaltothethereplicatesineachcondition,i.e.,takingp=2genesincontrolgroupimmediatelyaboveandbelowthegeneunderconsiderations.Themeanistakenofthoseorderedvariancescorrespondingtothegeneofinterest.SincetheBehrens-Fisherdistributiondoesnotgivethegoodresultunlessvmandwnisgreaterthan5andgivesverygoodresultwhenitexceeds7[31],wehavechosenthevaluesofthepriordegreesoffreedomsuchthatvmandwnexceeds7.Wehavechosenthosegenesasdierentiallyexpressedwhosep-valuesaresmallerthanrankedvaluesofthestatisticgivenbyFDRcriterion.Furthermorewehavechosenthelevelofsignicance=0:05.ThenumberofgenesthatarefounddierentiallyexpressedinourmethodandintheequalvariancecasesarecomparedinTable2.2.Wehaverunthedata5timesandaveragedthenumberofgenesselectedasdierentiallyexpressedineachtime.InanalyzingthefoldchangemethodwiththeBFmethod,wehavefoundthatallthegenesassociatedwiththelarge-foldchangearenotnecessarilystatistically40

PAGE 53

Thepowerofatestisdenedastheprobabilityofrejectingthefalsenullhypoth-esis.Ifistheprobabilitythatatruealternativehypothesisisfalselyrejected,thenthepowerofthattestisgivenby1.Thepowerdependsonthesamplesize,thesamplestandarddeviationandthedierenceofthenullandalternativehypothesizedmean.Wehavechosenthestandardizedeect,12 WeseefromthegraphofpowerinFig.2.1andpowercomparisoninTable2.2that,atsmallvaluesof0thepowerobtainedbytheequalvariancetestispreferredthanourmethod.But,eventually,ourproposedmethodseemstohavebetterresultbecausewehaveconsideredtheunequalvariancesinthetwosamples.Insmallvaluesofmand0,thecorrespondingvaluesforvm=m+01aresmalleranditdoesnotgiveappropriatevaluesunlessitexceeds7.Similarly,forthesmallvaluesofwn.So,wehavetochoosethevaluesof0or0andmornsuchthatthevaluesofvmandwnisatleast5.Holding0=0=4weseethedierenceofpowerinEVmethodandthenewmethodincreasessignicantlyasthesamplesizeincreasefrom2to5.ThedierenceinthepowerbetweenthisproposedmethodandtheEVmethodislesspronouncedforthesmallsamplesizesandsmallvaluesofthepriordegreesoffreedom.Butitismoredistinctasthepriordegreeoffreedomincreases.Wehavechosenthepriorvariances20=0.7and20=0.3.41

PAGE 58

Inthissimulatedstudy,wehaveusedtheproportionofgenesthatareactuallydierentiallyexpressedanddetectedbyourmethodasthecriterionforcomparingtheproposedmethod.Asthismethodisnotsuitableifthereissmallvaluesofmornandporqthatmakesvmandwnsmall,whichisseenclearlyfromthetable(Table2)thatform=n=2andchoosinganyvaluesofthepriors0and0onlyselects10%ofthegenesthatareactuallyexpressed,althoughitreportslargeamountofgenesasdierentiallyexpressed.Wehavefoundthatthegenesmarkedasdierentiallyexpressedbythefold-change(foldchange=2)methodisalmost32%ofthetotalnumberofgenesincludedinthesimulatedstudy.Thisresultishighlyunacceptableaswehaveassumedonly2%ofthegenesasdierentiallyexpressed.Ontheanotherextreme,werantwo-samplet-testandfoundthatitselectedfewgenesasdierentiallyexpressedasthesamplesizeswasincreasedto4andthewindowsizewas20.Afterthatitselectedmanygenesandtheresultwasgettingbetterwhenwetooksamplesizeof10.Inthiscase,itselectedalmostthesamegenesasourmethod.So,thismethodfailedforthesmallsamplesize,becauseitcouldnotselectthegenesthatareactuallydierentiallyexpressed.2.14ComparisonofDEGenesusingtheWindowMethodwithOtherMethods

PAGE 59

0=0 PowerEV 2 16 0.5617153 0.4532222 3 16 0.7179875 0.6035196 4 16 0.820609 0.7177347 5 16 0.8865956 0.8020212 2 8 0.4969753 0.4067113 3 8 0.6431226 0.5496432 4 8 0.7496756 0.6657278 5 8 0.8262618 0.7566512 2 4 0.3985896 0.3444951 3 4 0.5350779 0.4867308 4 4 0.6495701 0.6098946 5 4 0.8395813 0.7105832 2 2 0.3057716 0.2803869 3 2 0.4221825 0.429857 4 2 0.545814 0.5631983 5 2 0.6547838 0.6739527 Table2.2:TableofPowerComparisonthenthemethodofequalvariancetestseemedbetterintermsofproportionofactualgenesselected.Thisisnaturalbecauseourmethoddoesnotgivethebetterresultforsmallvaluesofvmorwn.However,aswehavegreatervaluesofsamplesizeandwindowsize,ourmethodexcelstheequalvarianceBayesiantestcounterpart.WehaveseenfromtheresultthatalmostalloftheactuallyDEgeneswereselectedbythisproposedmethodwhenm=n=10andp=q=8.But,theproportionofgenesselectedcomparedtotheactualsetofDEgenesisjust75.7%. Themaximumproportionofgenesselectedis88.7%whenm=n=10andp=q=2.InthiscaseallofthegenesactuallyDEareselected.Ourmethodandequalvariancetestselectedalmostthesamenumberofgenesonaverage.Unlessp=q=8andform=n=2,boththeequalvarianceandunequalvariancemethodselectedenormousnumberofgenes,mostofthembeingbogus(falsepositive).Theproportionoftruegenesisverylow,about11%.Whenm=n=3andp=q=2,ourmethodselected45genesofthem20areactuallyDE,butequalvariancemethodselectedabout134genes,ofthemonly58areactuallyDE.Hencetheproportionoftruegenesare0.44and0.42respectively.Takingthesamplesizesconstant,wefound47

PAGE 60

DEgenes Commoninactual Proportion p=q EV BF BF EV BF EV 2 2 3421 1 157.5 239 21.5 23 0.09 0.15 2 4 3430.5 0.5 544.5 263 35.5 61 0.13 0.11 2 8 3439.67 0 1145.67 733.33 85.33 119 0.12 0.1 2 20 3459 0 1693 1051 110 138 0.1 0.08 2 100 3374.5 0 2126.5 1187 121 154 0.1 0.07 3 2 3347 0.33 134.33 45 20 57.67 0.44 0.43 3 4 3392 0.33 344 226 79 111 0.35 0.32 3 8 3314.33 0 569.33 357.33 107.67 133.67 0.3 0.23 3 20 3371 0 832 479.67 133 158 0.28 0.19 3 100 3320.5 0.5 1032 568 135 165 0.24 0.16 4 2 3293.33 0.5 165 109.67 74.67 104 0.68 0.63 4 4 3317.4 0.33 279.4 204.4 114.8 141.4 0.56 0.51 4 8 3353.33 1.2 409 282 131 156.67 0.46 0.38 4 20 3378.5 0.67 568 364 151.5 175.5 0.42 0.31 4 100 3311 1 648.5 389 165 184 0.42 0.28 5 2 3332.33 1 208 167.33 131.33 158 0.78 0.76 5 4 3273 36.33 260 227.33 153.67 166.67 0.68 0.64 5 8 3267 23 320.5 240.5 155 176 0.64 0.55 5 20 3295 36.5 424.33 307 168.67 186.67 0.55 0.44 5 100 3251.67 51.67 453.33 326.67 174 184.33 0.53 0.41 10 2 3169 28.67 226.5 225.5 200 200 0.89 0.88 10 4 3133.67 215 245.33 242 198 199.33 0.82 0.81 10 8 3145.33 212.33 264.33 263.67 199.67 199.67 0.76 0.76 10 20 3168.33 217 269.67 264.67 200 200 0.76 0.74 10 100 3094 211.5 266.5 268 199 199.5 0.74 0.75 Table2.3:ComparisonofBFtestwithothertestsbasedontheproportionofactuallyDEgenesselectedinWindowMethod. Samples DEgenes Commoninactual Proportion EV SAM EV SAM EV SAM 22 24 11 15 15 15 11 7 0.68 0.63 1 0.47 4,4 24 23 20 26 19 19 19 3 0.79 0.83 0.95 0.12 5,5 22 22 22 29 20 20 20 19 0.91 0.91 0.91 0.66 6,6 23 21 20 33 20 19 19 19 0.87 0.9 0.95 0.58 10,10 25 23 24 36 20 19 20 20 0.8 0.83 0.83 0.56 12,12 23 26 23 29 20 20 20 20 0.87 0.77 0.87 0.69 15,15 20 20 51 34 20 20 20 20 1 1 0.39 0.59 3,4 29 28 16 30 18 18 16 12 0.62 0.64 1 0.4 4,6 24 25 24 27 20 20 20 17 0.83 0.8 0.83 0.63 10,15 26 26 34 37 20 20 20 20 0.77 0.77 0.59 0.54 Table2.4:ComparisionofBFtestwithothertestsbasedontheproportionofactuallyDEgenesselectedinSimilarVarianceMethod.48

PAGE 61

CommoninActualDE GenesSelectedasDE Propn.ofGenesSelected EV SAM EV SAM EV SAM 73 63 29 15 75 63 29 115 0.97 1 1 0.13 3,3 144 139 136 79 147 142 141 183 0.98 0.98 0.97 0.43 4,4 179 184 175 128 182 190 179 254 0.98 0.97 0.98 0.51 5,5 195 193 191 159 200 197 206 278 0.98 0.98 0.93 0.57 6,6 197 198 198 188 203 203 231 325 0.97 0.98 0.86 0.58 8,8 199 199 199 198 212 209 270 345 0.94 0.96 0.74 0.57 10,10 200 200 200 200 209 206 332 342 0.96 0.97 0.60 0.59 12,12 200 200 200 200 207 207 246 337 0.97 0.97 0.81 0.59 4,3 168 170 151 94 173 173 153 230 0.97 0.98 0.99 0.41 4,5 186 186 183 144 189 192 190 285 0.98 0.97 0.96 0.51 5,6 198 198 198 162 201 201 215 323 0.99 0.98 0.92 0.50 6,8 198 199 198 196 207 207 222 325 0.96 0.96 0.89 0.60 Table2.5:ComparisionofBFtestwithothertestsbasedontheproportionofactuallyDEgenesselectedinResamplingMethod.that,thenumberofgenesselectedbyallthreetestsbyMethod1isproportionaltothewindowsize.But,mostofthemwerebogus,i.e,genesseemtobedierentiallyexpressedbutarenotreally.Eventhoughthewindowsizeissmall,theproportionofgenesselectedbythesemethodsincreasedaswindowsizedecreased.Thismeansthatforaxedsamplesize,smallvaluesofthepriors0and0arepreferred. ToseetheperformanceofMethodII,wehavesimulated1000genesasintheothertwomethods.Becauseitselectshugenumberofgeneswithsimilarvarianceaccordingtoourcut-ofcriterion,thecomputationistimeconsuming(whichmaynotbegoodforlargenumberofgenesandlowmemorycomputer).ThismethodcompareswiththeequalvarianceandSAMasseenfromtheproportionandthenumberoftrulydierentiallyexpressedgenesselected.Inthiscasewehaveintroducedonly20genesasdierentiallyexpressed.Allthreetests-BF,EVandtheSAMwerequitecompetitiveinthismethod.But,EVandBFarerelativelymorecompetitive.WeseethatSAMselectsveryfewgeneswhenthesamplesizesaresmall.Butasthesamplesizewasincreasedto15,itselectedmoregenes.Ontheotherhand,bothBFandtheEVmethodselectedalmostthesimilarnumberofgeneseventhoughthesamplesizewasincreased.TheresultisshowninTable2.4. Table2.5showsthenumberofgeneselectedbytheMethodIII.Wehavecompared49

PAGE 63

ProportionofDEGenes FDR FNR II III I II III I II III 3 0.35 0.68 0.98 0.65 0.32 0.02 0.61 0.25 0.28 4 0.56 0.79 0.98 0.44 0.21 0.02 0.43 0.05 0.11 5 0.67 0.91 0.98 0.33 0.09 0.03 0.24 0 0.03 6 0.74 0.87 0.97 0.26 0.13 0.03 0.13 0 0.02 10 0.82 0.8 0.96 0.18 0.2 0.04 0.01 0 0 Table2.6:Comparisonofthreemethods(I=WindowMethod,II=SimilarVarianceMethod,III=ResamplingMethod)accordingtoProportionoftrulyDEgenesselected,FDRandFNR.2.16SelectionofDierentiallyExpressedGenesintheGolubData Setoorof100unitsandceilingof16,000units.2. Filterthelowqualitygenes.3. Transformthedatabyusingbase-10logarithm. Thethresholdwassetwithoorof100unitsandceilingof16,000units.Aceilingof16,000unitswaschosenbecauseitisatthislevelthatweobservetheuorescence51

PAGE 67

ThehistogramsintheFigure2.5aretherawexpressiondataandafterlogarithmictransformation.ThehistograminFig.2.6arethehistogramsoftheoverallexpres-sionlevelofGolubdatainthetrainingandthetestsamplesafterthelogarithmictransformation.So,fromthehistogramsweseethattheoverallexpressionpatternsineachofthesamplesfromthetwoconditionsaresymmetricalandnormalaftertransformation.2.17TransformationandTestofNormalityAssumptions Pni=1(xix)2(2.17.34) where,xisthemeanofsamples,ai'sareconstantsgivenby (a1;a2;:::;an)=mTV1 andm=(m1;m2;:::;mn)andmi'sareexpectedvaluesoforderstatistics,andVisthecovariancematrixoftheseorderstatistics.Ifthep-valueoftheteststatisticSWis55

PAGE 68

ThistestcanbeusedinsteadoftheKolmogorov-Smirnovtest(Lillieforstest),becausesomestudieshasshownthatthistesthasgoodpowerinmanysituationsthanothergoodnessofttestofthecompositehypothesisofnormality(Shapiro,WilkandChen1960).Thistestisnotaectedbythetiesinthevalues.Inthistest,wewere95%condentthatthegenessatisfythenormalityassumption.Furthermore,thosegeneswhichdidnotsatisfythenormalityassumptionweretransformedtonormalityusingtheYeoh-Johnsontransformation[53].Itisgivenby(;x)=8>>>>><>>>>>:f(x+1)1g=;(x0;6=0);log(x+1);(x0;=0);f(x+1)21g=(2);(x<0;6=2);log(x+1);(x<0;=2).(2.17.35) where,isaparameter,thatisestimatedfromthedataassumingthedatacomefromnormalpopulationusingthemaximumlikelihoodmethod. TheFig.2.8showstheQ-QplotofsomeofthegenesintheGolubdatabeforeYeoh-Johnsontransformation.Wecanseefromthegraphthatthedistributionofthesomeofthegenesarenotnormal.TheYeoh-JohnsontransformationissimilartotheBox-Coxtransformationexcepttheformerisalsousefulfortransformingpositiveornegativevalues.WecanseefromtheFig.2.10thatthegenessatisesnormalityaftertransformation. Hereinthefollowing,wehavecomputedthemeanandvarianceofrstfewgenesoftheGolubdata[16].Weseethatthevariancedependsonthemeanexpressionlevel.Butaftertheestimationofpriorvariances,theposteriorvarianceisstabilizedandthishastheeectonselectingtheactualexpressedgenes. DierentiallyexpressedgeneswereselectedusingtheBehrens-Fisher(BF)distri-bution.Toovercomethefalsegeneswhicharisesduetothemultipletesting,wehavecontrolledthefamilywiseerrorrate(FWER)atthelevel=0:05usingtheBonferroni56

PAGE 69

VarBefore VarAfter 0:22596073 0:281732 0:24048058 0:281736 at 0:10450397 0:281698 at 0:15503013 0:281712 0:25394972 0:28174 0:39634255 0:282684 at 0:54154905 0:282492 0:25132857 0:282435 3:35897426 0:28178 at 2:93944803 0:28182 2:73528247 0:281739 at 2:71219933 0:281805 0:48562603 0:281714 at 0:16032256 0:281698 0:10416494 0:281733 0:23132494 0:282428 at 0:94761702 0:281934 0:33217269 0:281762 0:39854292 0:28178 st 0:27957232 0:281747 0:3516393 0:281767 st 0:3902128 0:281778 0:35655727 0:281769 0:40192783 0:281781 0:05802834 0:281685 Table2.7:DependenceofVarianceonExpressionlevel57

PAGE 70

No.ofGenes CommonGenes No.ofgenes BF 713 BF,SAM 713 SAM 841 BF,LIMMA 670 LIMMA 678 SAM,LIMMA 678 FD 703 FD,BF 658 All4 655 BF,LIMMA,SAM 670 Table2.8:ComparisionofDierentiallyexpressedgenesinGolubdatacontrollingtheFDRatthelevel=0:052.18Conclusion Inthischapter,wederivedanexpressionfortheposteriorBehrens-Fisherdistri-58

PAGE 76

Theabovevemethodsarethebriefsummaryoftheclassicationmethodswewillusefortheclassicationofsamples.Moreaboutthesemethodscanbefoundon[16],[14],[47],and[49].Inthesemethods,thenon-homogeneityofthesampleshavenotbeenused.So,onewouldexpectthatifthenon-homogeneitywastakenintoaccount,theclassicationmethodseparatesthedatamoreaccuratelyintotheirtrueclasses.3.4SupportVectorMachines(SVM)Methods

PAGE 77

fromthetrainingsamplesf(x1;y1);(x2;y2);:::::;(xN;yN)gwhereyi=1arebinaryclasslabels,thatcorrectlyclassiesthetrainingsamplesandmaximizesthemargin.Theclassofanewsamplexisdeterminedbythesign[f(x)]:Allthetrainingsamplesareclassiedcorrectlyifyi(b+w0xi)1foralli arecalledthecanonicalhyperplanesandthedistance1=jjwjjbetweenoneofthecanonicalhyperplanesandseparatinghyperplane(3.4.1)isthemargin.So,theopti-mizationproblemcanberephrasedasminwjjwjj

PAGE 78

andthecorrespondingdecisionruleforanewsamplexis: Usingthenonlinearbasisfunctions(xi),onecanmaptheinputspaceintoahighdimensionalfeaturespace.ThenthesamplesareclassiedbythelinearboundariesinthefeaturespaceusingthekernelK(xi;xj)=(xi):(xj)whichcorrespondstothenonlinearboundariesintheinputspace.Theseparatinghyperplaneinthefeaturepaceisf(x)=^b+NXi=1iyiK(xi;x)(3.4.4) DierentkernelsareobtainedbyK(y;x)=0(y)(x) where(y)and(x)canbeanylinearornonlineartransformationsofyandxandmustsatisfytheMercer'sConditions[21].Butforthiswork,weusesimplelinear66

PAGE 79

Thebasicideaofhighdimensionalcomesfromthefactthat,thetrainingpairs(xi;yi);i=1;2;:::;nareseparatedbymaximizingthemarginbetweenthesupport67

PAGE 80

Letbeafeaturemapthatmapseachpointxinto(x).ThenthedecisionfunctioninthefeaturespaceisgivenbyD(x)=w:(x)+b where,K(x;y)isasymmetricfunctionthatsatisestheMercer'sConditionnXi=1nXj=1K(x;y)hi(x)hj(y)0 forallhi;hj,andnaturalnumberM.ThismeansthatforanykernelfunctionK(x;y)satisfyingtheMercer'sconditionthereexistsafeaturemapgsuchthattheinnerprod-uctinthefeaturespaceisgeneratedbyg.68

PAGE 81

andundecidedifx=0. Hence,weseethat,thepointsarelinearlyseparableinthefeaturespacealthoughtheyarenotlinearlyseparableintheinputspace.Forthisweneedtoknowthekernelfunctions,insteadoftheactualtransformation.Someoftheimportantkernelsare: NotethateachrandomvariableXrepresentsasample.So,inthecaseofgeneex-pression,thenumberofcomponentsofXareverylarge.3.7NearestShrunkenCentroidsMethod

PAGE 82

wherekisthecorrectgroupofthesampleand^kistheassignmentmadetothatsamplebytheclassier.Iftheclassconditionaldensities,fk(x),andtheclasspriors,k,ofclasskareknown,thenBayesianoptimalruleofclassicationtoclassifyanewsamplexistominimizetheriskR(^kjx)=GXk=1L(k;^k)Pr(G=kjX=x)(3.7.6) Thentheclassicationruleis:^k=argmaxkfk(x)k: 2(xk)01k(xk)] Thenassumingtheequalvariance-covariancematrixofeachclass,k=,thelineardiscriminantscoreDlk(x)=x01k1 20k1k+logk(3.7.7) andassumingclasswisecovariancematricesunequal,thequadraticdiscriminantscore 2logjkj1 2(xk)01k(xk)+logk(3.7.8)70

PAGE 83

InthecaseofDNAmicroarraydata,thenumberofcovariates(genes),p,whichareintheorderofseveralthousands,aremuchgreaterthanthenumberofsamples,nk,generallywithinhundreds,ineachclass.So,thesamplecovariancematrixissingularandthisgivestheunreliableestimateofcovariancematrixbecauseofhighvariability.Friedman[14]introducedtheregularizeddiscriminantanalysisinwhichthetheunequalvarianceswereshrinkedtowardsthecommonvarianceusingtheregularization,thusincreasingtheperformanceoftheRDAclassier.Themicroarrayproblemisthusuniqueandchallenging.Sincetheexpressionlevelofmostofthegenesaresameintwodierenttreatmentsamples,thosegenescontributelittleinthecaseofclassication.Thusitisimportanttoidentifythegenesthatactuallycontributefortheclassication.Assumingthatthosegeneswhichhavecommonclassmeansdonotcontributefortheclassication,Tibshiraniet.al.[47]proposedtheNearestShrunkenCentroids(NSC)method.Theyusedtheshrinkageparameter,,forthresholdinganddeclaredthosegenesasnon-contributinggenesiftheshrunkencentroidsforgeneginclassk,x0gkshrinkstotheoverallmeanxg where,s2g=1

PAGE 84

Intheclassicationprocess,thegenesgwhichhaveeachoftheshrunkenclass-meansx0gkshrinkstowardstheoverallclassmeansxgineachclassk=1;2;:::;Ghavethesame(xgx0gk)2values.So,thenumeratoroftheabovediscriminantscore(3.7.9)canbereplacedbythesquareofdierencesofonlythosegenesgforwhichx0gk6=xg;8k=1;2;:::;G wheregiandgiarethemeanandstandarddeviationsofgenegintheclassi;i=1;2.Thelargertheabsolutevaluejwgjisthemoreimportantthegenegisforprediction.Thegenesarerankedbytheirjwgj'sandtoponesareselected.Thesetopselectedgenesarethemarkerorinformativegenes.72

PAGE 85

Thismeasureisalsocalledthesignal-to-noiseratio.Thisweightingfactorreectsthecorrelationbetweentheexpressionlevelofgenegandclassdistinction.Theparameterbgiscalculatedas whichistheaveragemeanofexpressionlevelsoftwoclasses.Hencewedenetheparameters(wg;bg)foreachinformativegeneinthetrainingset.Foranewsamplexwithxgbeingthenormalizedlogexpressionlevelofthegeneg,wecalculatethevotescastedbyeachofthegenesinthe"informative"set.Thevoteofagenegis Apositivevoteindicatesthatthesamplebelongstoclass1andnegativevoteindicatesitbeinginclass-1.Thenthetotalvoteforthesampletobeinclass1isobtainedbyaddingV1=Pgmax(vg;0)andthetotalvoteforsampletobeinclass-1isV2=Pgmax(vg;0).Thenthesampleisassignedtothatclasscorrespondingtothehighertotalvote.Generally,wetakethe5%ofmostpositiveand5%mostnegativegenesasthe"informative"genesinthetrainingset.Butthisnumberisafreeparameteranddependsontheuser.73

PAGE 86

(nG)GXk=1(nk1)(k)2g

PAGE 87

(gk)nk(2)nk 22gkfnk1)s2gk+nk(xgkgk)2gi(3.10.14) where,xgkands2gkaresamplemeanandsamplevarianceofgeneginclasskrespec-tively.Assumingtheindependencyoflocationparametergkandscaleparameter2gk,thejointpriorforgkand2gkcanbewrittenasp(gk;2gk)=p(gk)p(2gk)(3.10.15) Assumethatthepriorsforg1andg2areatpriorsandthepriorsfor2g1and2g2arescaledinverse2distributions,i.e.p(2g1)=I(2g1;0;20)andp(2g2)=I(2g2;0;20),where,=(0;0;20;20)isthehyper-parametersthatshouldbeesti-matedfromthedata. Letg=g2g1.Thenthestatistic,calledtheBF-statistic (2g1 2(3.10.16)=Bx2cosBx1sin

PAGE 88

degreesoffreedom[42].Thatis, withpdf 2h1+(sin+cos)2 2d at(b)(3.10.17) where, f2=v2n2 a2=(b2) Thatis,Bhasapproximatelyt-distributionwithbdegreesoffreedom(b1)andscaleparametera.ThisstatisticBcanalsobedenotedasBt(0;a2;b)andisvalidonlyforvn1;vn25.76

PAGE 90

Outof10,000genes,simulate98,00genesfromanormaldistributionwithmeanU[0;1]andvariance2Inv2(df=20;scale=2).2. Withoutlossofgenerality,simulaterst100genesfromnormaldistributionwithmeanU[5;10]+1 2andvariance2Inv2(df=20;scale=4)inthecontrolconditionandmeanU[5;10]1 2andvariance2Inv2(df=18;scale=6)inthecontrolcondition.3. simulateanother100genesfromnormaldistributionwithmeanU[5;10]1 2andvariance2Inv2(df=18;scale=6)inthecontrolconditionandmeanU[5;10]+1 2andvariance2Inv2(df=18;scale=4)inthecontrolcondition.4. Repeatsteps1to3tosimulateanothertwoclasses,butreplacethescalesby3byand6respectively.5. Thecombinedsimulateddatawillhave10,000genesand4classes. Inthe4classes,wehave1510;135;3012,and2312trainingandtestsamplesrespectively.TheBFstatisticwasusedtoselectthegenes.Itselectedmostofthegenesthatweremarkedasdierentiallyexpressedinthesimulation.WeusedtheseselectedgenesintheSVMclassierandintheWeightedVoting(WV)methods. Theaccuracyrate(AR)ofaclassierisdenedas Totalno.ofSamples(3.12.18) Thesupportvectormachinesmisclassied3.1samplesonaverage,whereastheWVmethodmisclassied0.037ofthesamplesincorrectlyonaverage.Themulticlass78

PAGE 91

no.ofGenesused No.ofClasses Accuracy SVM 203 4 .92 WV 203 4 .97 Dudoit 203 4 .84 SCRDA 203 4 .84 Table3.1:AccuracyRatefortheSimulatedTestdata3.13DatasetsPre-processingandFiltering

PAGE 94

TrainingErrors TestErrors AverageError No.of MLL AML ALL MLL AML Train Test Dudoit 0 1 1 0 0 0 2 0 23 Wt.Vote 0 1 1 0 0 0 2 0 17 SVM 0 0 0 0 0 0 0 0 17 NSC 0 4 0 0 0 0 4 0 12 Table3.2:ComparisonofClassicationPerformanceonMLLLeukemiaData. HerefromtheTable3.3,weseethat,takingallthegenesintoconsideration,thereisoneAMLsampleinthatwasmisclassiedasALLsamplebytheSVMmethod.Similarly,theWeightedVoteandMulticlassmethodsmisclassiedtwoAMLsamples-theoneisthesamemisclassiedbytheSVMmethod,intoALLsamples.82

PAGE 96

TrainingErrors TestErrors TotalErrors AML ALL AML 0 0 0 1 1 SVM 0 0 0 2 2 Multi-class 0 0 0 2 2 Table3.3:ComparisonClassicationofGolubDatausingallthegenes TrainingErrors TestErrors TotalErrors AML ALL AML 0 0 0 1 1 SVM 0 0 0 1 1 Multi-class 0 0 1 1 2 Table3.4:ComparisonClassicationofGolubDatausingBFselectedgenes So,fromthecomparisonoftheabovetwotables,weseethat,eventhoughweuseallthegenesfortheclassiers,theclassicationperformanceisnotimproved.Infact,theperformancewasimprovedifwetakethegenesselectedbytheBFmethod.84

PAGE 97

TrainingErrors TestErrors No.ofGenes AML ALL AML 1 0 1 0 33 SVM 0 0 0 0 33 NSC 0 0 1 0 18 Table3.5:ComparisonofClassicationPerformanceonGolubData.3.16SubtypesofPediatricAcuteLymphoblasticLeukemia Inthis7classescase,thegenesselectedbytheBFmethodhasshownthepromis-ingresultoverthenearestshrunkencentroid(NSC)methodandshrunkencentroidregularizeddiscriminantanalysis(SCRDA)methodofGuoet.al.(2007).IntheNSCandSCRDAmethods,theoverallmeanofeachclassesareshrunkbyashrinkingfac-85

PAGE 99

Classes Total No.of bcr e2a hyp mll t.all tel.aml other Error Genes Tr W.Vote 0 0 0 0 0 0 2 2 370 a Dudoit 2 0 1 0 0 1 16 20 150 i NSC 9 0 7 8 0 0 2 26 185 n SCRDA 0 0 2 0 0 0 12 14 543 T W.Vote 0 0 3 0 0 0 4 7 370 e Dudoit 1 0 1 0 0 1 3 6 150 s NSC 6 0 21 6 0 0 6 39 185 t SCRDA 1 0 2 0 0 0 10 13 543 Table3.6:ComparisonofClassicationPerformanceonALL-7Data.tor.WehaveusedthegenesselectedbytheproposedBFstatisticfortheWeightedVotingmethod.FortheDudoitmethod,wehaveusedthe150genesselectedbytheBFstatistic.TheclassicationperformedbybothweightedvoteandDudoitmethodarebetterthantheresttwomethods.Theoverallerrorfortheweightedvotemethodinthetrainingsamplesis.0093whereasitis0.1209and0.0651inNSCandSCRDAmethodsrespectively.Similarly,theoveralltesterrorsintheweightedvotingmethodis0.0625,andthatis0.3482and0.1160inNSCandSCRDAmethodsrespectively.AnotherimportantfacttonoteisthatthenumberofgenesselectedfortheclassicationbyBFmethodisalsocomparabletobothofthesemethods. GolubData Predicted True ALLAML ALL 200 113 Table3.7:ConfusionmatrixfortheMLL-3trainingdatabyDudoitmethod,andGolubTestDatabyNSCMethod.

PAGE 100

PredictedTrue BCR Table3.8:ConfusionmatrixfortheALL-7testdatabySCRDAmethod.

PAGE 102

where,G0isabelievednullcumulativedistributionfunction(cdf)ofthetivaluesandisthestandardnormalcdf.Sincemicroarrayexperimentsusuallypresupposesthatmostofthegenesarenotdierentiallyexpressedbetweentwoconditions,weexpectthattheN(0;1)densitycurvetsthecenterofthehistogramofzivalues.Ifthereisnocorrelationamongtheteststatistics,ti,thenafterthetransformationweexpectthatthezivaluesfollowstandardnormaldensity.Ifthecorrelationamongtiispresent,thenithaseectonthewideningornarrowingthestandardnormaldensitycurveofthezivalues.Thishasseriouseectonthetailcounts,whicharethedierentiallyexpressedgenes.Becausethisisthedatadrivenmethod,ithasnootherassumptionexceptthatthedataareindependentandidenticallydistributedandtheteststatisticsareindependent.4.3EectsofCorrelations andkl(i;j)=Pr(zi2Zkandzj2Zl);kl:=Pi6=jkl(i;j) allthek(i)valuesaredeterminedbythestandardnormaldensity,(z),aroundthepointz[k].ThefollowingresultwasprovedbyEfron[11].90

PAGE 103

where,C0isthemultinomialcovariancematrixthatwouldapplyifthez-valueswereindependent, andC1=11 where,"diag"representsppdiagonalmatrixwhoseelementsarekland=(1:;2:;:::;K)0: and,Cov(yk;yl)=E(ykyl)E(yk)E(yl)=p(p1)kl:p2k:l:(4.3.8) and,Var(yk)=Cov(yk;yk)=p(k:k:2)+p(p1)(kk:k:2)(4.3.9)91

PAGE 104

2z[k]22ijz[k]z[l]+z[l]2 Fromtheabovelemma,ifziandzjareindependent,thenkl(i;j)=k(i)l(j).Thismeansalltheelementsofthein(4.3.6)arezero,thusC1=0.So,Cov(y)=C0ifthestatisticsarenotcorrelated.Furthermore,theamountofcorrelationbetweenthet-scoresdeterminesthesizeofandincreasestheCov(y)aboveC0.Theelementsklofcanbeestimatedby ^kl=Z11h1 where,isthecorrelationmatrixandg()istheempiricaldensityofp(p1)=2correlations,ij,ofgenes. LetusdeneY0=No.offzi2[a;a]gandY1=No.offzibg(4.3.12) where,aandbaresuitablepositivenumbers(cutos)whichcoversthecentralandtailpartofthedensityofzi.Wewillanalyzetheeectofcorrelationsofzionthesd(Y0)andsd(Y1);andontheCor(Y0;Y1)intheapplicationsection. Wewillusetheabovelemmatondthesd(Y0)andsd(Y1),Cor(Y0;Y1).4.4Application

PAGE 105

Kolmogorov-Smirnov Anderson-Darling Chi-Square Statistic Rank Statistic Rank Statistic Rank Cauchy 0.05333 1 12.205 1 117.45 3 Log-logistic 0.05534 2 12.285 2 81.852 1 Johnson(SU) 0.06687 3 14.46 3 98.901 2 Table4.1:Goodness-of-Fittestofthetest-statisticsobtainedbytheBFmethodfortheGolubData Letustake,inparticular,a=0:75andb=3:0inequation(4.3.12)assuggestedbyEfron,2007.Sinceoutofallthegenestheprobabilitytolieintheintervalof[0:75;0:75]canbeconsideredassuccess,wemodelthatY0isapproximatedbythebinomialdistributionwithn=No.ofgenes,andprobabilityofY0lyinginthe[0:75;0:75]=(0:75)(0:75).Similarly,Y1isapproximatedbythebinomialwithprobabilityofY1lyingin[3;1]=1(3).ThecovariancebetweenthegenesisobtainedbyusingtheLemma1,93

PAGE 109

where,kl:isgivenby(4.3.3)andk:andl:aregivenby(4.3.2)withsuitablemodicationtocovertheintervals.Inparticular,ifziandzjareindependent,thenkl:=k:l:=[(0:75)(0:75)][1(3)]:Therearep=1761genesremainingafterpreprocessing,normalizationandtransformingintonormality.sd(Y0)=p (20:89)(1:54)=1761(1760)(0:5467453)(10:9986501)(1761)2(0:5467453)(10:9986501) (20:89)(1:54)=1:2997 32:1706=0:04(4.4.16) Now,letthetransformedstatisticsziandzjarecorrelated.Thentheaverageofallcorrelationsbetweenthegenesis0asseenfromtheFigure4.3.UsingtheresultsofaboveLemmaforthevariableY0,weget97

PAGE 110

Similarly,theSd(Y1)andCor(Y0;Y1)areobtainedasSd(Y1)=q (217:56)(21:54)=2288:787 (217:56)(21:54)=0:488(4.4.19)4.5ComputationofEectofCorrelations

PAGE 111

Correlated sd(Y0) 20.89 217.56 sd(Y1) 1.54 21.54 cor(Y0;Y1) -0.04 -0.488 Table4.2:EectofCorrelationonthestandarddeviationsofthemiddleandtailcountofthetransformedteststatistics4.6ConditionalandUnconditionalp-values where,pisthenumberofhypothesestobetested.Becausetheshapeofthehistogramofzvaluesvarieswiththecorrelatedteststatisticsthanitdoesfortheindependentteststatistics,thep-valuetotestthenullhypothesesshouldbesmallerthantheac-tualp-valueobtainedfromtheassumptionofindependentteststatistics.Thelargervariationamongthehistogramsofcorrelatedteststatisticscancausemisleadingtailcountsandthusmoregenesaredeclaredtobedierentiallyexpressedthanactualnumberofdierentiallyexpressedgenes. Theconditionalp-valueisestimatedbypermutingthesamplesandcalculatingthenumberofzi'sthatlieinsideasmallinterval(C;C+)aroundC,where, Inparticular,wendthevaluesofX=maxijzij;i=1;2;:::;p,sayx,andCfromtheoriginalnon-permutedsamples.Permutationofsamplesisrepeatedmanytimes,sayB.Ineachpermutationb,thefollowingarecalculated:(1)theproportionofgenes,Cb,lyingwithin(C0:05;C+0:05)(2)Yb=1ifxb>xandYb=099

PAGE 112

1P(Yb=1jCb)=exp(+Cb).Finally,theestimateofconditionedp-valueatthepointC=cisgivenby^px(c)=exp(^+^c) 1+exp(^+^c)(4.6.22) Similarly,theunconditionalp-value,Px,isestimatedastheproportionofallper-mutationswhosemaximumteststatisticexceedstheobservedvaluex.Thosegeneswithp-valuelessthanorequaltotheadjustedp-valuearetakenasthedierentiallyexpressedgenes.IntheGolubdata,thepermutationwasrepeated5000timestocalculateconditionalp-valueandunconditionalp-value. Table4.3:Eectofcorrelationonp-valueonteststatisticsinGolubdata

PAGE 113

LetusdeneY0=numberofzi2[x0;x0]P0=2(x0)1Q0=2p where,isthestandardnormaldensityfunction,andisthestandardnormalcu-mulativedistributionfunction.FortheGolubdatawithx0=2,wegotthefollowing: 1761=0:716^A=1:561 IntheFigure4.2,thecut-ovalueofthezstatisticscorrespondtotheBehrens-Fisherteststatisticsof=0:05isgivenby1.24.Butbecauseofthecorrelationweneedtoadjustthevalueofz,sothatthecorrelationamongtheteststatisticsaecttheinferenceaslittleaspossible.Hence,takingtheconditionalp-valueforthetransformedzidata,thecut-ovaluesis1.574.Hence,wedeclarethosegenesasdierentiallyexpressedifthejzij<1:574.Usingthisasacut-ovalue,wegot32+64=96genesasdierentiallyexpressed.Hence,thereare96genesthatare101

PAGE 115

Summary:Inthischapter,weintroducethecorrelationbetweenthegenesandstudyitseectontheperformanceofdierentclassiers.Furthermore,wehaveintroducedhowtosetupthecut-ovaluewhilelteringthehighlycorrelatedgenes.WehaveusedMcNemar'stesttotestthehypothesisofwhetherthetwoclassiersperformancearesame.Theassumptionsforthistestisthedataareindependentandtheyareclassiedintotwoclasses.Sincetheconfusionmatrixisgenerallyusedfortheperformanceofclassiersinthemulticlassclassicationtask,wehavecalculatedconfusionmatricesfordierentdataandwehaveusedthesematricesformeasuringtheperformanceofclassiers.5.1PerformanceofClassiers Theformofclassieristoocomplexandover-tsthedata.Thistendstoarisewhentheratioofparametertocasesexceedssomedesirablelimitandtheclassierbeginstottherandomnoiseinthetrainingdata.Thiswillleadtopoorgeneralization.Thisiswhysimplemodelsoftenoutperformcomplexmodelsonnoveldata.103

PAGE 116

Theformofclassieristoosimpleorhasaninappropriatestructure.Forex-ample,classesmaynotbelinearlyseparableorimportantpredictorshavebeenexcluded.Thiswillreduceaccuracy.3.Ifsomeofthetrainingclasslabelsareincorrector'fuzzy'therewillbeproblems.Mostclassiersassumethatclassmembershipinthetrainingdataisknownwithouterror.Obviouslyifclassdenitionsareambiguous,itbecomesmorediculttoapplyaclassicationprocedure,andanymeasureofaccuracyislikelytobecompromised.Moreimportantly,mislabeledcasescaninuencethestructureoftheclassier,leadingtobiasorclassierdegradation.Forexample,amisclassiedsamplemaybeanoutlierforsomepredictors,butitsinuencewilldependontheclassier.Indiscriminantanalysisoutliersmayhavelargeeectsbecauseoftheircontributiontothecovariancematrix.4. Trainingsamplesmaybeunrepresentative.Iftheyare,thisleadstobiasandpoorperformancewhentheclassierisappliedtonewsamples.Carefulsamplingdesignsshouldavoidthisproblembutsuchbiasmaybeunavoidableifthereissignicantunrecognisedregionalandtemporalvariability.5. Unequalclass-sizesintrainingandtestsamplealsomayaecttheclassier'sperformance.5.2McNemar'sTestandConfusionMatrix

PAGE 117

Classi1 2 Pred-1 b d From(5.3.2),itcanbeseenthatthelargertheDg,thefurthertheclassesaresepa-rated.Therefore,thebetterthediscriminantabilitythegeneghas.Thecorrelation105

PAGE 118

where,Sk(g)=(xg1;xg2;:::;xgk)isthevectorofcomponentsofgenegintheclasskonly.Basedonrg;h,thecorrelationcoecientbetweenasinglegenegandthesubsetSisdeterminedasrg;S=maxh2Sjrg;hj:(5.3.4) Ahighvalueofrg;SindicatesthatgishighlycorrelatedwithcertaingeneinS,andthereforeitcarriestheredundantinformation. Finally,itisdesirabletoselectthegenesthatcanindividuallyseparatetheclasseswellandhassmallcorrelationwiththegenesinthesubsetS.ThenalscoreassignedtoeachofthegenegisdenedasRg;S=Dg where,Dgisnormalizedsuchthatitisinthesamerangeasjrg;Sj.So,thenalscoregivenby(5.3.5)dependsnotonlyonfeaturegbutdependsonthediscriminantabilityDgandcorrelationscorerg;S.Wetakeonlythosegenesforwhichthenalabsolutecorrelationscore,R,isgreaterthan0.5.5.4ClassicationusingSVM

PAGE 119

Intwo-classclassication,thelinearSVMtsamodelf(x)=b0+pXj=1bjxjthatminimizesnXi=1(1yif(xi))++ Theclassicationdecisionisthenmadeaccordingtosign[f(x)].TheweaknessoftheSVMisthatitonlyestimatessign[p(x)1 2],wheretheprobabilityp(x)=P(C=1jX=x)istheconditionalprobabilityofapointbeinginclass1giventhatX=x.Inmulti-classclassication,theone-vs-restschemeisoftenused:givenKclasses,theproblemisdividedintoaseriesofKone-vs-restproblems,andeachone-vs-restproblemisaddressedbyadierentclass-specicSVMclassier(e.g.class1vs.notclass1);thenanewsampletakestheclassoftheclassierwiththelargestrealvaluedoutputc=argmaxk=1;2;:::;Kfk,wherefkistherealvaluedoutputofthekthSVMclassier.Recently,LeeandLee(2002)extendsthetwo-classSVMtothemulticlasscaseinamoredirectwayandestimatesargmaxkP(C=kjX=x)directly,butitstilllackstheestimatesofP(C=kjX=x)themselves.5.5Application BinaryClassicationInthissectionwewillapplythelogisticregressionmethodforthebinaryclassicationoftheGolubdata.ThegeneswereselectedbyusingtheBehrens-Fishermethodasproposedinchapter2.Thelogisticdiscriminationhighlysueredfromthehigh-dimensionofthedata,sowecouldnotperformthelogisticdiscriminationusingall107

PAGE 120

WehaveusedWeightedVotemethod,SupportVectorMachinesandtheMulti-classmethodofDudoitforassigningtheclasslabelsoftheGolubdata.Ineachofthesemethods,weusedthesame25genesobtainedasaboveafterlteringthehighlycorrelatedgenes.TheresulthasbeenshownintheTable5.3. Errors Train Test Total %Error ALL AML ALL AML Wt.Vote 0 0 1 1 2 5.89 SVM 0 0 2 1 3 8.82 Multiclass 0 0 0 1 1 2.94 Table5.3:ClassicationofGolubDatatakingintoCorrelationamongthegenes

PAGE 121

Inallthegenesselectionprocess,IhaveusedtheBehrens-Fishermethodtoselectthedierentiallyexpressedgenesbetweenonekindofsampleversustherestsamples.Inthemulticlassclassication,therearetwoapproaches: Theideaofbinaryclassicationrulecanbeextendedtomulticlassclassication.Intheoneversusallmethod,thesamplesareclassiedbytheclassierinthetwoclasses.LetthereareCclasses.WeconstructCbinaryclassiers.Thekthclassieristrainedtoclassifythekthclassfromtheremainingclasses.Ifanysampleisclas-siedintwoormorethantwoclasses,thenthemajorityofvoteistakenandassigntheclassofthatsample.IntheOnebyOnemethod,weconstructtheallpossiblebinaryclassiers.asintheOneversusAllmethod,ifsomesamplesareclassiedinmorethantwoclasses,themajorityofvoteistakenandassigntheclassofthatsample. Score(R)> No.ofgenes Wt.Vote SVM DudoitMulticlass 50% -0.1389128 0,2 1,1 0,2 639 55% -0.0628013 0,2 1,2 0,2 639 60% 0.01331021 0,2 0,3 0,2 625 65% 0.089421 1,2 0,3 0,2 526 70% 0.165532 0,2 0,3 0,2 418 75% 0.2416447 0,2 0,3 0,2 300 80% 0.317756 0,2 0,3 0,2 161 85% 0.393867 0,2 0,3 0,2 89 87% 0.425267 0,2 0,3 0,1 47 88% 0.431419 0,2 0,3 0,1 32 89% 0.458178 0,2 0,3 0,1 27 90% 0.469979 0,2 0,3 0,1 25 91% 0.514972 1,2 1,3 2,1 23 95% 0.5460907 2,3 1,3 2,2 16 Table5.4:ComparisonofClassicationPerformanceonGolubDataaftertakingintocor-relationbetweenthegenes.Dierentpercentagepointswereusedtodeterminetheoptimalcut-point.109

PAGE 122

Whenselectingthegenes,IselectedthegenesusingtheproposedBehernsFisher(BF)Statistic.ItookonetrainingsamplefromtherestofthetrainingsamplesandselectedthegenescontrollingtheFDRatthelevel0.05.Then,ItookaccountofthecorrelationbetweentheselectedgenesasinLiuetal.[27].Thisprocedurewasrepeatedforeachofthe7classes.Finally,thecommongenesselectedbetweenBCRvs.restandE2Avs.restwasusedtoclassifyBCRtestsampleswiththeE2Atestsamples,andsoon. No.of No.ofSamples No.of Cost No.ofMisgenes Train Test SV () classications BCRvs.E2A 6 18,9 9,6 5 0.1167 0,0 E2Avs.HYP 12 42,18 22,9 4 0.083 0,1 HYPvs.MLL 2 14,42 6,22 5 0.5 1,2 MLLvs.T.ALL 7 28,14 15,6 5 0.143 0,1 T.ALLvs.TEL.AML 10 28,52 15,27 5 0.1 0,0 TEL.AMLvs.Others 9 52,52 27,27 13 0.111 2,3 Table5.5:ComparisonofClassicationPerformanceonALL-7DatausingthegenesSelectedbytheBFmethodandtakingintocorrelationbetweenthegenes.Theclasseswerecomparedusingoneversusanother.110

PAGE 123

SVM Dudoit PredictedClass Trueclass Trueclass Trueclass ALLAML ALLAML ALLAML ALL 191 192 201 AML 113 112 013 Table5.6:ConfusionMatricesofTestsamplesofGolubDatabyWeightedVote,SVMandDudoitmethods TrainingSample TestSample Pred.Class TrueClass TrueClass ALL MLL AML ALL MLL AML ALL 20 3 0 2 1 0 MLL 0 12 2 2 2 0 AML 0 2 18 0 0 8 Table5.7:ConfusionMatrixofMLLdataapplyingtheDudioitClassicationmethodaftertakingcorrelation5.6ConfusionMatrices ThefollowingTable5.7showstheconfusionmatrixfortheMLL3classesdataasdescribedinChapter2. Fromthetablesofconfusionmatrices,weseethattheclassicationperformanceoftheclassiersisgreatlyincreasedevenwithsmallnumberofgenes.Ineachofthemethods,thegenesusedforclassicationislessthan10.So,intermsofcost,themethodsaresuitableforclassication. Nowwearegoingtocalculatetheassessmentsofdierentclassiers.Iftwoclas-111

PAGE 124

TestSample Pred.Class TrueClass TrueClass ALL MLL AML ALL MLL AML ALL 20 1 0 3 1 0 MLL 0 15 0 1 2 0 AML 0 1 20 0 0 8 Table5.8:ConfusionMatrixofMLLdataapplyingtheWeightedVoteClassicationmethodaftertakingcorrelation TrainingSample TestSample Pred.Class TrueClass TrueClass ALL MLL AML ALL MLL AML ALL 19 1 1 3 1 0 MLL 0 12 0 0 1 0 AML 1 4 9 1 1 8 Table5.9:ConfusionMatrixofMLLdataapplyingtheSVMClassicationmethodaftertakingcorrelationsiersareusedtoclassifythetestsamplesaftertheyweretrainedonthesametrainingsets,thenumberofmisclassicationsbyeachmethods(callederrors)arecalculatedandputasinthecontingencytable.ForcomparingtheperformanceofDudoitmulti-classmethodandSupportVectorMachinesmethodontheclassicationofGolubtestsamples,wegetthecontingencytable5.10.TheMcNemarsstatistic,M=0:5<2(1);0:05=3:84.So,weconcludethatthetwomethodshavesameerrorrates. SVM 1 2 Method 0 29 Table5.10:ContingencytableformisclassicationofGolubTestData TheperformanceofclassiersaregiveninTables3.8andTable3.7ofChapter3.Inthechapter3,wehadtrainedclassiersusingleaveoneoutmethod.112

PAGE 125

SVM 8 2 Method 1 4 Table5.11:ContingencytableformisclassicationofMLLTestData5.7BinaryAccuracyMeasures CorrectClassicationRate(CCR)(a+d)=NMisclassicationRate(b+c)=NSensitivitya=(a+c)Specicityd=(b+d)FalsepositiveRateb=(b+d)FalseNegativeRatec=(a+c) Table5.12:Confusionmatrixderivedaccuracymeasures,N=a+b+c+d CorrectClassicationRate(CCR)0.9410.9120.9710.853MisclassicationRate0.0590.0890.0290.147Sensitivity0.950.9510.85Specicity0.9290.8570.9280.857FalsepositiveRate0.0710.1430.0720.143FalseNegativeRate0.050.0500.15 Table5.13:AccuracymeasuresfortheWt.Vote,SVM,LogisticDiscriminationandDudoitMulti-classClassiers

PAGE 126

Errors Train Test Total %Error ALL AML ALL AML Wt.Vote 0 0 0 0 0 0 SVM 0 0 0 0 0 0 Multiclass 3 0 0 0 0 8.82 Table5.14:ClassicationofGolubDatatakingintoCorrelatedteststatistics

PAGE 127

Inthemicroarraydata,outliersarealsopresentbecauseofthemeasurementerrorandothersources.Theexpressionlevelsofsuchoutliershavenotbeenaccounted.Beforeanalyzingsuchadata,onehastoidentifyandremovesuchanoutliers.Pres-enceofsuchanoutliermightdistorttheresult.So,Iwillworktoidentifytheoutliersandremovetheminmyfuturework. Finally,sincethedimensionreductionisextremelyimportantforcorrectclassi-cationofsamplesusingthemicroarraydata,Iwillworkonthisaspectaswellinmyfutureresearch.115

PAGE 128

Baldi,P.andLong,A.:ABayesianFrameworkfortheAnalysisofMi-croarrayExpressionData:Regularizedt-testandStatisticalInferencesofGeneChanges;Bioinformatics,Vol.17,2001.[2] Baudat,G.andAnouar,F.:KernelBasedMethodsandFunctionAp-proximation;ProceedingsofIJCNN,2001.[3] Benjamini,Y.andHochberg,Y.:Controllingthefalsediscoveryrate:Apracticalandpowerfulapproachtomultipletesting,JournaloftheRoyalStatisticalSocietyB,Vol.(57),1995.[4] Best,D.andRayner,L.:Welch'sApproximateSolutionfortheBehrensFisherProblem;Technometrics,Vol.29(II),1987.[5] Box,G.andTiao,C.:BayesianInferenceinStatisticalAnalysis;Addison-Wesley,1973.[6] Cawley,G.,andTalbot,N.:FastExactLeave-One-OutCross-ValidationofSparseLeastSquaresSupportVectorMachines,NeuralNetworks,2004.[7] Chen,X.,Zehnbauer,B.,Gnirke,A.andKwok,P.:FluorescenceenergytransferdetectionasahomogeneousDNAdiagnosticmethod,ProcNatlAcadSciUSA.,Vol(94),pp.56-61,1997.[8] DeRisi,J.et.al.:UseofacDNAmicroarraytoanalysegeneexpressionpatternsinhumancancer,NatureGenetics,1996[9] Duda,R.,Hart,H.andStork,S.:PatternClassification;WileyInter-science,2003.116

PAGE 129

Dudoit,S.,Fridlyand,J.andSpeed,T.:ComparisonofDiscriminationMethodsfortheClassifcationofTumorsUsingGeneExpressionData,JournaloftheAmericanStatisticalAssociation,Vol.(97),pp.77-87,2002.[11]EfronB.:CorrelationandLargeScaleSimultaneousSignicanceTest-ing,JournalofAmericanStatisticalAssociation,Vol107,2007.[12] Efron,B.,Tibshirani,R.,Storey,J.andTusher,V.:EmpiricalBayesAnalysisofaMicroarrayExperiment,JournalofAmericanStatisticalAssociation,2001.[13] Fox,R.andDimmic,M.:Atwo-samplet-testforMicroarraydata;BMCBioinformatics,Vol.7,2006.[14] Friedman,J.H.:RegularizeddiscriminantAnalysis.JournaloftheAmer-icanStatisticalAssociation.84:165-173,1989.[15] Gelman,A.,Carlin,J.,Stern,H.andRubin,D.:BayesianDataAnalysis;2ndedition,Chapman&Hall,2004.[16] Golub,T.,Slonim,D.,Tamayo,P.et.al.:MolecularClassicationofCan-cer:ClassDiscoveryandClassPredictionsbyGeneExpressionMon-itoring;Science,Vol.286,1999.[17] Gottardo,R.,andPannucci,J.etal.:StatisticalanalysisofMicroarrayData:ABayesianApproach;Biostatistics,Vol.1,2002.[18] Guo,Y.,Hastie,T.andTibshirani,R.:RegularizedLinearDiscriminantAnalysisanditsApplicationinMicroarrays;Biostatistics,AdvancedAc-cesss,2007.[19] Guyon,I.,Weston,J.,Barnhill,S.andVapnik,V.:Geneselectionforcancerclassicationusingsupportvectormachines,MachineLearning,Vol(46),pp.389-4222002.[20] Hastie,T.,Tibshirani,R.andFriedman,J.:TheElementsofStatisticalLearningtheory;Springer-Verlag,2001117

PAGE 130

Herbrich,Ralf:LearningKernelClassifiers,TheMITPress,2002.[22]Kerr,M.,Martin,M.,andChurchill,G.A.:AnalysisofVarianceforgeneexpressionmicroarraydata;JournalofComputationalBiology,Vol.7,2000.[23] Kim,S.andCohen,A.:OntheBehrens-FisherProblem:AReview;JournalofEducationalandBehavioralStatistics,Vol.23,1998.[24] LiF.,YangY:Analysisofrecursivegeneselectionapproachesfrommi-croarraydata;BioinformaticsVol.(21),2005.[25] Li,H.,Jiang,T.andZhang,K.:Ecientandrobustfeatureextractionbymaximummargincriterion;IEEETransNeuralNetwork,Vol.(17),2006.[26] LiaoandChin:LogisticRegressionforDiseaseClassicationusingMi-croarrayData:ModelselectioninaLargepandSmallncase;Bioin-formatics,2007.[27] Liu,Y.andZheng,Y.:FSSFS:Anovelfeatureselectionmethodforsupportvectormachines,PatternRecognitionSociety2005.[28] Mukherjee,S.:Classifyingmicroarraydatausingsupportvectorma-chines.UnderstandingandUsingMicroarrayAnalysisTechniques:APracticalGuide,Springer-Verlag,Heidelberg,166-185,2001.[29] Newton,M.A.andKendziorski,C.ParametricEmpiricalBayesMethodsforMicroarraysTheanalysisofgeneexpressiondata:methodsandsoftwareNewYork:SpringerVerlag,2003.[30] Pan,W.:AComparativeReviewofStatisticalMethodsforDiscoveringDierentiallyExpressedGenesinReplicatedMicroarrayExperiments,Bioinformatics.Vol.18,(4),pp.546-554,2001.[31] Patil,V.H.:ApproximationtotheBehrens-FisherDistributions;Biometrika,Vol.1,1965.118

PAGE 131

Pawitan,Y.,Micheils,S.,Koscielny,S,Gusnanto,A.andPloner,A.:FalseDiscoveryRate,SensitivityandSampleSizeforMicroarrayStudies;Bioinformatics,Vol.21,2005.[33]Pomeroy,J.etal.:Classication,SubtypeDiscovery,andPredictionofOutcomeinPediatricAcuteLymphoblasticLeukemiabyGeneEx-pressionProling,CancerCell,Vol(1),pp:133-143,March,2002.[34] Ramaswamy,S.,etal.:Multiclasscancerdiagnosisusingtumorgeneexpressionsignatures,PNAS,Vol.(26),pp.49-54,2001.[35] Sartor,M.,Tomlinson,C.et.al.:Intensity-basedhierarchicalBayesmethodimprovestestingfordierentiallyexpressedgenesinmicroar-rayexperiments;BMCBioinformatics,Vol.7,2006.[36] SatoshiN.andSatoruK.:Recursivegeneselectionbasedonmaximummargincriterion:acomparisonwithSVM-RFE;BMCBioinformatics,Vol.(7),2006.[37] Shi,J.:Signicancelevelsforstudieswithcorrelatedteststatistics,Biostatistics,Vol.(9),pp.458-466,2008.[38] Shena,M.:QuantitativeMonitoringofGeneExpressionPatternswithaComplementaryDNAMicroarray;Science,Vol.270,1995.[39] Schwender,H.:IdentifyingDierentiallyExpressedGeneswithsiggenes;ABioconductorPackage-Version2.1,2004.[40] Shapiro,S.andWilk,M.:Ananalysisofvariancetestfornormality(completesamples),Biometrika,Vol52,3and4,pages591-611,1965.[41] Smyth,G.K.:LinearmodelsandempiricalBayesmethodsforassessingdierentialexpressioninmicroarrayexperiments;StatisticalApplicationsinGeneticsandMolecularBiology,Vol.3,2004.119

PAGE 132

Shrestha,N.,Ramachandran,K.:Beherens-FisherDistributionforSelect-ingDierentiallyExpressedgenes;Neural,ParallelandScienticCompu-tations,Vol16(147-164),2002.[43]Storey,J.,Tibshirani,R.:EstimatingFalseDiscoveryRateunderdepen-dencewithapplicationstoDNAmicroarrays,TechnicalReports,StanfordUniversity,2001.[44] Tan,A.etal.:SimpleDecisionRulesforClassifyinghumancancersfromgeneexpressionproles;Bioinformatics,Vol.(20),2005.[45] Tang,Suganathan,Yao:Geneselectionformicroarraydatabasedonleastsquaressupportvectormachine;BMCBioinforamatics,2006.[46] Thomas,J.,Olson,J.,Tapscott,S.andZhaoL.Anecientandrobuststatisticalmodelingapproachtodiscoverdierentiallyexpressedgenesusinggenomicexpressionproles,GenomeResVol.(11),pp.1227-1236,2001[47] Tibshirani,R.,Hastie,T.,Narasimhan,B.andChu,G.:ClasspredictionbyNearestShrunkenCentroidswithApplicationtoDNAmicroarrays;StatisticalSciences,2003.[48] Tusher,V.,Tibshirani,R.,andChu,G.:SignicanceAnalysisofMicroar-raysappliedtotheIonizingRadiationResponse;PNAS,Vol.98,2001.[49] Vapnik,V.:StatisticalLearningTheory.JohnWileyandSons,1998.[50] Watson,J.andCrick,F.:AstructureforDeoxyriboseNucleicAcid;Na-ture,Vol.171,1953.[51] ,WestfallandYoung:Resampling-BasedMultipleTesting:ExamplesandMethodsforp-ValueAdjustment,WileySeriesinProbabilityandStatistics,1995.120

PAGE 133

Yekuteli,D.,Benjamini,Y.:Resampling-basedfalsediscoveryratecon-trollingmultipletestproceduresforthecorrelatedteststatistics;Jour-nalofStatisticalPlanningandInference,Vol.82,1999.[53]Yeo,In-KwonandJohnson,R.:Anewfamilyofpowertransformationstoimprovenormalityorsymmetry.Biometrika,87,954-959,2000.[54] ZhangA.:AdvancedAnalysisofGeneExpressionMicroarrayData;WorldScientic,2006.[55] Zhu,J.,Hastie,T.:Classicationofgenemicroarraysbypenalizedlogis-ticregression.Biostatistics2004,Vol.5,pp.427-443.121

PAGE 134

Theauthorisinterestedinapplyingthestatisticalknowledgeinbiologicalandgeneticdata,especiallyinmicroarraydataanalysis.Heiscurrentlyworkinginndingthedierentiallyexpressedgenesandclassicationofdierentkindofcancersusingmicroarraydata.


xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam 22 Ka 4500
controlfield tag 007 cr-bnu---uuuuu
008 s2010 flu s 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0003506
035
(OCoLC)
040
FHM
c FHM
049
FHMM
090
XX9999 (Online)
1 100
Manandhr_Shrestha, Nabin.
0 245
Statistical learning and behrens fisher distribution methods for heteroscedastic data in microarray analysis
h [electronic resource] /
by Nabin Manandhr_Shrestha.
260
[Tampa, Fla] :
b University of South Florida,
2010.
500
Title from PDF of title page.
Document formatted into pages; contains X pages.
Includes vita.
502
Dissertation (Ph.D.)--University of South Florida, 2010.
504
Includes bibliographical references.
516
Text (Electronic dissertation) in PDF format.
538
Mode of access: World Wide Web.
System requirements: World Wide Web browser and PDF reader.
3 520
ABSTRACT: The aim of the present study is to identify the differentially expressed genes between two different conditions and apply it in predicting the class of new samples using the microarray data. Microarray data analysis poses many challenges to the statisticians because of its high dimensionality and small sample size, dubbed as "small n large p problem". Microarray data has been extensively studied by many statisticians and geneticists. Generally, it is said to follow a normal distribution with equal variances in two conditions, but it is not true in general. Since the number of replications is very small, the sample estimates of variances are not appropriate for the testing. Therefore, we have to consider the Bayesian approach to approximate the variances in two conditions. Because the number of genes to be tested is usually large and the test is to be repeated thousands of times, there is a multiplicity problem. To remove the defect arising from multiple comparison, we use the False Discovery Rate (FDR) correction. Applying the hypothesis test repeatedly gene by gene for several thousands of genes, there is a great chance of selecting false genes as differentially expressed, even though the significance level is set very small. For the test to be reliable, the probability of selecting true positive should be high. To control the false positive rate, we have applied the FDR correction, in which the p -values for each of the gene is compared with its corresponding threshold. A gene is, then, said to be differentially expressed if the p-value is less than the threshold. We have developed a new method of selecting informative genes based on the Bayesian Version of Behrens-Fisher distribution which assumes the unequal variances in two conditions. Since the assumption of equal variances fail in most of the situation and the equal variance is a special case of unequal variance, we have tried to solve the problem of finding differentially expressed genes in the unequal variance cases. We have found that the developed method selects the actual expressed genes in the simulated data and compared this method with the recent methods such as Fox and Dimmic's t-test method, Tusher and Tibshirani's SAM method, among others. The next step of this research is to check whether the genes selected by the proposed Behrens -Fisher method is useful for the classification of samples. Using the genes selected by the proposed method that combines the Behrens Fisher gene selection method with some other statistical learning methods, we have found better classification result. The reason behind it is the capability of selecting the genes based on the knowledge of prior and data. In the case of microarray data due to the small sample size and the large number of variables, the variances obtained by the sample is not reliable in the sense that it is not positive definite and not invertible. So, we have derived the Bayesian version of the Behrens Fisher distribution to remove that insufficiency. The efficiency of this established method has been demonstrated by applying them in three real microarray data and calculating the misclassification error rates on the corresponding test sets. Moreover, we have compared our result with some of the other popular methods, such as Nearest Shrunken Centroid and Support Vector Machines method, found in the literature. We have studied the classification performance of different classifiers before and after taking the correlation between the genes. The classification performance of the classifier has been significantly improved once the correlation was accounted. The classification performance of different classifiers have been measured by the misclassification rates and the confusion matrix. The another problem in the multiple testing of large number of hypothesis is the correlation among the test statistics. we have taken the correlation between the test statistics into account. If there were no correlation, then it will not affect the shape of the normalized histogram of the test statistics. As shown by Efron, the degree of the correlation among the test statistics either widens or shrinks the tail of the histogram of the test statistics. Thus the usual rejection region as obtained by the significance level is not sufficient. The rejection region should be redefined accordingly and depends on the degree of correlation. The effect of the correlation in selecting the appropriate rejection region have also been studied.
590
Advisor: Kandethody M. Ramachandran, Ph.D.
653
Genes
False Discovery Rate
Multiple Testing
Correlation
Classification
690
Dissertations, Academic
z USF
x Mathematics and Statistics
Doctoral.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.3506