USF Libraries
USF Digital Collections

Mixture distributions with application to microarray data analysis

MISSING IMAGE

Material Information

Title:
Mixture distributions with application to microarray data analysis
Physical Description:
Book
Language:
English
Creator:
Lynch, O'Neil
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla.
Publication Date:

Subjects

Subjects / Keywords:
Likelihood ratio test
Modified likelihood
Penalized likelihood
Asymptotic chi-square distribution
Consistency
Dissertations, Academic -- Mathematics and Statistics -- Doctoral -- USF   ( lcsh )
Genre:
bibliography   ( marcgt )
non-fiction   ( marcgt )

Notes

Summary:
ABSTRACT: The main goal in analyzing microarray data is to determine the genes that are differentially expressed across two types of tissue samples or samples obtained under two experimental conditions. In this dissertation we proposed two methods to determine differentially expressed genes. For the penalized normal mixture model (PMMM) to determine genes that are differentially expressed, we penalized both the variance and the mixing proportion parameters simultaneously. The variance parameter was penalized so that the log-likelihood will be bounded, while the mixing proportion parameter was penalized so that its estimates are not on the boundary of its parametric space. The null distribution of the likelihood ratio test statistic (LRTS) was simulated so that we could perform a hypothesis test for the number of components of the penalized normal mixture model. In addition to simulating the null distribution of the LRTS for the penalized normal mixture model, we showed that the maximum likelihood estimates were asymptotically normal, which is a first step that is necessary to prove the asymptotic null distribution of the LRTS. This result is a significant contribution to field of normal mixture model.The modified p-value approach for detecting differentially expressed genes was also discussed in this dissertation. The modified p-value approach was implemented so that a hypothesis test for the number of components can be conducted by using the modified likelihood ratio test. In the modified p-value approach we penalized the mixing proportion so that the estimates of the mixing proportion are not on the boundary of its parametric space. The null distribution of the (LRTS) was simulated so that the number of components of the uniform beta mixture model can be determined. Finally, for both modified methods, the penalized normal mixture model and the modified p-value approach were applied to simulated and real data.
Thesis:
Dissertation (Ph.D.)--University of South Florida, 2009.
Bibliography:
Includes bibliographical references.
System Details:
Mode of access: World Wide Web.
System Details:
System requirements: World Wide Web browser and PDF reader.
Statement of Responsibility:
by O'Neil Lynch.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains 109 pages.
General Note:
Includes vita.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 002317272
oclc - 659882175
usfldc doi - E14-SFE0003022
usfldc handle - e14.3022
System ID:
SFS0027339:00001


This item is only available as the following downloads:


Full Text

PAGE 1

by O'NeilLynch Adissertationsubmittedinpartialfulllment oftherequirementsforthedegreeof DoctorofPhilosophy DepartmentofMathematicsandStatistics CollegeofArtsandSciences UniversityofSouthFlorida Co-MajorProfessor:KandethodyRamachandran,Ph.D. Co-MajorProfessor:WonkukKim,Ph.D. ChrisTsokos,Ph.D. TapasDas,Ph.D. DateofApproval: September4,2008 Keywords:Likelihoodratiotest;Modiedlikelihood,Penalizedlikelihood;Asymptoticchi-squaredistribution;Consistency cCopyright2009,O'NeilLynch

PAGE 3

IalsowouldliketothanktheothermembersofmycommitteenamelyDr.C.Tsokos,andDr.T.K.Daswhoallhavegivenoftheirtimeandhavealwaysencouragedmetodomybest.ThestaattheDepartmentofMathematicsandStatisticsisdeservingofmythanksforthehelpandadvicetheyhavegivenmeovertheyears. IowethankstomyfamilyandalargenumberoffriendsfortheencouragementandcompleteunderstandinginmyquesttoexcelintheeldofStatistics.

PAGE 4

ListofTablesvi Abstractviii 1Introduction1 2MicroarrayDataandSomeStatisticalAnalysis4 2.1DNAMicroarrayExperiments........................4 2.1.1GeneticBackground.........................4 2.1.2cDNAMicroarrayExperiment....................5 2.1.3OligonucleotideMicroarrayExperiment...............6 2.2SomeStatisticalChallengesWithAnalyzingMicroarrayData......7 2.3MethodsofAnalyzingMicroarrayData...................8 2.3.1ClusterAnalysis............................8 2.3.2TheT-Test..............................10 2.3.3RegressionModel...........................10 2.3.4SignicanceAnalysisofMicroarrays(SAM)............11 2.3.5MixtureModelMethod(MMM)...................13 2.3.6AMixtureModelApproachUsingP-Value.............15 2.4Conclusion...................................18 3FiniteMixtureDistribution19 3.1DenitionandPreliminary..........................20 3.1.1ExamplesofMixtureDistributions.................21i

PAGE 5

3.1.3ComparisonofTwoGroups:IrisData...............24 3.2ParameterEstimation............................31 3.2.1ExpectationMaximizationAlgorithm................31 3.2.2RobustParameterEstimation....................38 3.2.3PenalizedMaximumLikelihoodEstimatorforNormalMixtureModels.......................41 3.3EstimatingtheNumberofComponentsg..................44 3.3.1SimulationApproach.........................45 3.3.2ModiedLikelihoodRatioTestforHomogeneityinFiniteMixtureModels........................48 3.3.3RegularityConditions........................52 3.4Box-Coxtransformation...........................53 3.5Conclusion...................................54 4ThePenalizedModiedLikelihoodforNormalMixtureModel56 4.1PenalizedModiedLikelihood........................57 4.2ParameterEstimationofPenalizedModiedLikelihood..........58 4.2.1InverseGammaPenaltyFunctionfor...............60 4.2.2InverseChi-SquarePenaltyFunctionfor.............62 4.3ConsistencyandAsymptoticNormality...................64 4.3.1PreliminaryResults..........................65 4.3.2Mainresults..............................73 4.4Conclusion...................................75 5PenalizedModiedLikelihoodApproachtoMicroarrayDataAnalysis76 5.1PenalizedModiedMixtureModel(PMMM)................77 5.2PMMMSimulatedNullDistribution....................77 5.3SimulatingMicroarrayData.........................81 5.4ApplicationofPMMMtoSimulatedMicroarrayData...........81 5.5ApplicationofPMMMtotheRatdata...................86 5.6Conclusion...................................90ii

PAGE 6

6.1ThemodiedP-ValueApproach.......................91 6.2SimulatedNullDistributionoftheModiedP-ValueApproach......95 6.3ApplicationofModiedP-ValuetoSimulatedMicroarrayData.....96 6.4ApplicationofModiedP-ValuetosimulatedProstateData.......98 6.5ApplicationofModiedP-ValuetotheProstateData...........99 6.6Conclusion...................................101 7SummaryandConcludingRemarks102 References105 AbouttheAuthorEndPageiii

PAGE 7

3.2Q-QplotsofSepallengthsforversicolorowers..............26 3.3Q-QplotsofSepallengthsforverginicaowers...............26 3.4Histogramandknownmixturedistribution.................27 3.5Histogramandestimatedmixturedistribution...............28 3.6Histogramwithknownandestimatedmixturedistribution........29 3.7Graphicalrepresentationsoftwocomponentnormalwithequalvariance..............................30 3.8Histogram,heteroscedasticandhomoscedastictforsimulateddatafromthemixture0:5(yj4;1)+0:5(yj8;1)..............41 3.9Histogram,heteroscedasticandhomoscedastictforsimulateddatafromthemixture0:5(yj4;1)+0:5(yj8;2)..............42 5.1Histogramsofz,Zandttedmodelsforthesimulateddata.......83 5.2ThelikelihoodratiostatisticasafunctionofZvalueforthesimulateddata..............................85 5.3ThevaluesofFDRfromourmethodandSAMforthesimulateddata................................85 5.4Histogramsofz,Zandttedmodelsfortheratdata...........87 5.5Thelikelihoodratiocurvefortheratdata.................89 5.6ThevaluesofFDRfromourmethodandSAMfortheratdata.....89 6.1VariousBetaDistributions..........................94 6.2Histogramofp-valueandbetamixturedistributionsfor2000simulatedgenes.............................97iv

PAGE 8

6.4Histogramofp-valueandbetamixturedistributionsfortheprostatedata...............................100v

PAGE 9

5.1Mean,varianceandpercentilesforthepenalizedmodiedlikelihood,basedon1000replicatesforeachsamplefortestingthehypothesisa1-componentagainst2-components............79 5.2Mean,varianceandpercentilesforthepenalizedmodiedlikelihood,basedon1000replicatesforeachsamplefortestingthehypothesis2-componentsagainst3-components.............80 5.3HypothesistestforthenumberofcomponentsforthettednormalmixturemodelsofzandZ,forthesimulatedmicroarraydata................................82 5.4ValuesofTP,FPandFDRfromPMMMforthesimulateddata.....84 5.5ValuesofTP,FPandFDRfromSAMforthesimulateddata......84 5.6HypothesistestforthenumberofcomponentsforthettednormalmixturemodelsofzandZ,fortheratdata.............86 5.7ValuesofTP,FPandFDRfromPMMMfortheratdata.........88 5.8ValuesofTP,FPandFDRfromSAMfortheratdata..........88 6.1Mean,varianceandpercentilesforthelikelihoodratiotestforthemodiedp-value,basedon1000replicatesforeachsamplefortestingthehypothesisauniformagainstauniformandonebetadistribution..............................95vi

PAGE 10

6.3Hypothesistestforthenumberofcomponentsforthetteduniform-betamixturemodelsfortheprostatedata.............99vii

PAGE 11

O'NeilLeeLynch ABSTRACT Themaingoalinanalyzingmicroarraydataistodeterminethegenesthataredif-ferentiallyexpressedacrosstwotypesoftissuesamplesorsamplesobtainedundertwoexperimentalconditions.Inthisdissertationweproposedtwomethodstodeterminedierentiallyexpressedgenes.Forthepenalizednormalmixturemodel(PMMM)tode-terminegenesthataredierentiallyexpressed,wepenalizedboththevarianceandthemixingproportionparameterssimultaneously.Thevarianceparameterwaspenalizedsothatthelog-likelihoodwillbebounded,whilethemixingproportionparameterwaspenalizedsothatitsestimatesarenotontheboundaryofitsparametricspace.Thenulldistributionofthelikelihoodratioteststatistic(LRTS)wassimulatedsothatwecouldperformahypothesistestforthenumberofcomponentsofthepenalizednormalmixturemodel.InadditiontosimulatingthenulldistributionoftheLRTSforthepe-nalizednormalmixturemodel,weshowedthatthemaximumlikelihoodestimateswereasymptoticallynormal,whichisarststepthatisnecessarytoprovetheasymptoticnulldistributionoftheLRTS.Thisresultisasignicantcontributiontoeldofnormalmixturemodel. Themodiedp-valueapproachfordetectingdierentiallyexpressedgeneswasalsodiscussedinthisdissertation.Themodiedp-valueapproachwasimplementedsothatahypothesistestforthenumberofcomponentscanbeconductedbyusingthemodiedlikelihoodratiotest.Inthemodiedp-valueapproachwepenalizedthemixingpro-portionsothattheestimatesofthemixingproportionarenotontheboundaryofitsviii

PAGE 13

Manyofthemethodsusedforsuchanalysis,includingthemethodofidentifyinggeneswithfoldchangesareknowntobeunreliablebecauseinsuchmethodsthestatisticalvariabilityofthedataisnotproperlyaddressed[8].Whilevariousparametricmeth-odsandtestssuchasthetwo-samplet-test[12]andregressionmodelhavebeenappliedformicroarraydataanalysis,strongparametricassumptionsmadeinthesemethodsaswellasstrongdependencyonlargesamplesetsrestrictthereliabilityofsuchtechniquesinmicroarrayproblemswhereonlyasmallnumberofreplicationsareavailable.Thenonparametricstatisticalmethods,includingtheEmpiricalBayes(EB)method[14],thesignicanceanalysisformicroarraydata(SAM[39])andmixturemodelmethod(MMM)[27,42,25]havebeenappliedtomicroarraydataanalysis.Itisclaimedandar-guedthatthenewextensionsofthe(MMM)areamongtheavailablemethodsproducingbiologically-meaningfulresults[27,43]. Inthisdissertationweextendedthemixturemodelmethod(MMM)bypenalizingthemixingproportionsandthecomponentvariancessimultaneously.ThemixingproportionwaspenalizedsothatamodiedlikelihoodratiotestsimilartothatofChenetal.(2001,2004)fortestingthenumberofcomponentsofthettednormalmixturemodelcanbeimplemented.Thevariancewaspenalizedsothatthelog-likelihoodisboundedresulting1

PAGE 14

Thisdissertationisorganizedasfollows.Chapter2describesinsomedetailthegeneticbackgroundofDNAandtwooftheleadingmicroarrayexperiments,cDNAandOligonucleotide.InChapter2wealsodiscussedsomeofthestatisticalchallengeswehaveinanalyzingmicroarraydataandgivesadescriptionofsomeofthemethodsusedtoanalyzemicroarraydata.Themethodsthatwerediscussedare(1)Clusteranalysis(2)T-test(3)Regressionanalysis(4)Signicantanalysisofmicroarray(SAM)(5)Mixturemodelmethod(MMM)and(6)Ap-valueapproachfordetectingdierentiallyexpressedmicroarraydata. InChapter3wepresentthetheoryofnitemixturemethodsanddiscussedhowtheparameterscanbeestimatedby(1)expectationmaximizationalgorithm(EM)and(2)therobustparameterestimation-whichisofinterestifthedatacontainsoutliers.OneofthechallengesofnitemixturedistributionsistodeterminethenumberofcomponentsthereforewediscussedsometechniquesusedtodeterminethenumberofcomponentswhicharenamelyAIC,BIC,simulationandthemodiedlikelihoodratiotest.Thebox-coxtransformationfordistinquishingskeweddistributionsfromcommingleddistributionswasalsopresentedinchapter3. Thepenalizedmodiedapproachwillbediscussedinchapter4.Theestimatorsoftheparametersofthepenalizednormalmixturemodelwhenboththevarianceandmixingproportionweresimultaneouslypenalizedwasillustrated.Theevaluationoftheestimatorsforthetwopenaltyfunctionsforthevariance,theinversegammaandinversechi-squaredistributionswereaddressed.TheasymptoticpropertynamelyasymptoticnormalityofthenormalmixturemodelwasalsoprovedinChapter4.Chapter5discussedtheapplicationsofthepenalized/modiedapproachofthenormalmixturemodeltodetectingdierentiallyexpressedgenesandillustrateditsapplicationstosimulatedand2

PAGE 15

Chapter6discussedthemodiedp-valueapproachfordetectingdierentiallyex-pressedgenes.SimilartotheworkdoneinChapter6weappliedourmethodtosimu-latedandrealdata.Themotivationformodifyingthep-valueapproachofAllisonetal.wasthattheMLEofthemixingproportionwasontheboundarypointofitsparametricspace,thereforeweappliedthetechniqueofChenetal.,thatis,weappliedapenaltyfunctionforthemixingproportionsothattheMLEofthemixingproportionwillnotbeontheboundarypointsofitsparametricspace.TheconclusionsofthisstudyweresummarizedinChapter7.3

PAGE 16

Incells,genesconsistofalongstrandofDNAthatcontainsapromoter,whichcontrolstheactivityofagene.Additionally,alllivingcellscontainchromosomes,thatare,largepiecesofgenescontaininghundredsorthousandsofgenes,eachofwhichspeciesthecompositionandstructureofaprotein(orseveralrelatedproteins).Theworkhorsemoleculesofthecellareproteinpolymersofaminoacidswhichareresponsibleforcellularstructure,producingenergyandimportantbiomoleculeslikeDNAandproteins,andforreproducingthecellchromosomes.Thecohortofchromosomesarealmostthesameineverycellinanorganism,andcontainsthesamerepertoireofproteins.However,cellshaveremarkablydistinctproperties,suchasthedierencebetweenhumaneyecells,haircells,andlivercells;thesedistinctionsaretheresultofdierencesintheabundance,4

PAGE 17

Whenageneisactive,thecodingandnon-codingsequenceiscopiedinaprocesscalledtranscription,producingmessengerRNA(mRNA)whichisacopyofthegene'sinformation.ThemRNA,asmallandrelativelyunstablenucleicacidpolymers,canthendirectthesynthesisofproteinsthroughthegeneticcode.However,mRNAscanalsobeuseddirectly,forexampleaspartoftheribosome.Theresultingmoleculesfromthegeneexpression,mRNAorproteinareknownasgeneproducts.ThereisthereforealogicalconnectionbetweenthestateofacellandthedetailsofitsproteinandmRNAcomposition. Whereasitremainsdiculttomeasuretheabundanceofacell'sproteins,DNAmi-croarraymakesitpossibletoquicklyandecientlymeasuretherelativerepresentationofeachmRNAspeciesinthetotalcellularmRNApopulation,orinmorefamiliarterms,tomeasuregeneexpressionlevels.ThereareseveraltypesofmicroarraysystemsincludingthecDNAmicroarray(Schenaetal.,1995;DeRisietal.,1997:Hughesetal.,2001)andoligonucleotidearray(Lockhartetal.,1996).2.1.2cDNAMicroarrayExperiment InthethirdstepthecDNAissynthesized,aprocedurethatalsoinvolveslabelingtheisolatedmRNAfromthebiologicalsamples.UsuallyinthemostcurrentcDNAmicroarrayexperiments,cDNAsfromtheexperimentalandreferencesamplesarelabeledwithred-uorescentdye,Cy5andgreen-uorescentdye,Cy3respectively,mixedandhybridizedontheslide.ThereareseveraldierentlabelingmethodsincludingPrimer5

PAGE 18

Fourth,thelabeledprobecDNAishybridizedtotargetthecDNAonthemicroarray.Thatis,ifaparticulargeneisexpressedinthetargetcell,wherethecDNAscorrespond-ingtothisgenearefoundinthetargetcDNApool,thesecDNAswillbindwiththecomplementarycDNAprobesprintedonaspecicspotonthemicroarray.Hybridiza-tionreferstothebindingabilityoftwocomplementaryDNAstrandsbythebase-pairingrulethusreformingtheDNAdoublehelix. Finally,thehybridizationresultsareimagedandanalyzedusingauorescentmicro-scope,thelog(red/green)intensitiesofmRNAhybridizationateachsiteismeasured.Theresultistensofthousandsofgeneexpressions,typicallyrangingfrom-4to4,whichisameasureoftheexpressionlevelofeachgeneintheexperimentalsamplerelativetothereferencesample.Positivevaluesindicatehigherexpressioninthetargetversusthereference,andviceversafornegativevalues.2.1.3OligonucleotideMicroarrayExperiment

PAGE 19

Sincestatisticiansareprimarilyinterestedingenesthataredierentiallyexpressedacrosstwodierentexperimentalconditions,whichmayrefertosamplesdrawnfromtwotypesoftissues,tumorsorcelllines,orattwotimepointsduringimportantbiologicalprocesses,weneedtomakeanadjustmentforthetypeIerrorratewhendoingsimul-taneoushypothesistests.ThisadjustmentisdonebymeansoftheBonferronimethod,todealwithmultiplecomparisons.Ifweuseasthesignicancelevelthenthetestorgenespecicsignicancelevelforatwosidedtestistherefore==2n:

PAGE 20

Istherestatisticallysignicantevidencethatanyofthegenesunderstudyexhibitadierenceinexpressionacrosstheconditions?2.Whatisthebestestimateofthenumberofgenesforwhichthereisatruedierenceingeneexpression?3. Whatisthecondenceintervalaroundthatparticularestimate?4. Ifwesetsomethresholdforwhichweexpectparticulargenestobeinterestingandworthyoffollow-upstudy,whatproportionofthosegenesarelikelytobegenesforwhichthereisarealdierenceinexpressionandwhatproportionarelikelytobefalseleads?5. Whatproportionofthosegenesnotdeclared"interesting"arelikelytobegenesforwhichthereisarealdierenceinexpression(i.e.,missesorfalsenegatives)? Inanalyzingmicroarraydatatheassumptionsmadeare(1)Foreachgene,themeasure-mentsofgeneexpressionhaveanitepopulationmeanandvariance;(2)Foreachgeneunderstudy,thereisameasureofexpressionlevelavailableforeachsampleandthismeasurehassucientreliabilityandvalidity(i.e.themeasurementsoftheexpressionlevelsareatruereectionofthetruestateofnature);(3)Themostimportantassump-tionthatismadeisthatgeneexpressionlevelsacrossthetwogroupsareindependent-whichimpliesthatwemayabletoevaluatethelikelihoodfunctionwhichwillbecomeimportantlaterinthisdissertation.2.3MethodsofAnalyzingMicroarrayData2.3.1ClusterAnalysis

PAGE 21

wherexk=8<:1for1km0form+1km+n: Usingthedataconstructionofequation(2.3.1)forYikwewillnowpresentthet-testandregressionmodelsusedinmicroarraydataanalysis.9

PAGE 22

LetthesamplemeansandvariancesofYikforgeneiunderthetwoconditionsbeYi(1)=Pmk=1Yik ands2i(1)=Pmk=1(YikYi(1))2 Thet-statisticisZi=Yi(1)Yi(2) UnderthenormalityassumptionforYik,Ziapproximatelyhasat-distributionwithdegreesoffreedomdj=s2i(1)

PAGE 23

Thisteststatisticcompareswellwiththatofthet-test.Inthecaseofthet-testtheteststatisticisZi=Yi(1)Yi(2) whereYi(1);Yi(2);s2i(1)ands2i(2)aredenedasin(2.3.3)and(2.3.4).Notethatthetwotestsarethesameasm;n!1,howeverinmicroarraydataanalysisbothm;naresmall,whichmakesthet-testbetterbecauseoftheunbiasednessofitsvarianceestimator. Notethatthestrongparametricassumptionsthatneedstobemadetouseboththet-testandtheregressionapproachisoftentimesviolatedformicroarraydataanaly-sis.Therefore,theSignicanceAnalysisofMicroarrays(SAM)isanimportantmethoddevelopedformicroarraydataanalysisthatseekstooverthesesstrongparametricas-sumptions.2.3.4SignicanceAnalysisofMicroarrays(SAM)

PAGE 24

m+n2mXk=1(YikYi(1))2+m+nXk=m+1(YikYi(1))2(2.3.9) wheremandnarethenumbersofmeasurementsinconditions1and2respectively. Inordertocomparevaluesofd(i)acrossallgenes,thedistributionofd(i)shouldbeindependentofthelevelofgeneexpression.Atlowexpressionlevels,varianceind(i)canbehighbecauseofsmallvaluesofs(i).Toensurethatthevarianceofd(i)isindependentofthegeneexpression,asmallpositiveconstants0(exchangeabilityfactor)wasaddedtothedenominatorofequation(2.3.8).Thecoecientofvariationofd(i)wascomputedasafunctionofs(i)inmovingwindowsacrossthedata.Thevaluefors0waschosentominimizethecoecientofvariation. Tominimizetheeectsofpotentialconfoundersbetweentheconditions,thedatawasanalyzedbytakingBsetsofpermutations.Foreachpermutationbthestatisticdbiandthecorrespondingorderstatisticsdb(1)db(2):::db(N)wascomputed.Theexpectedrelativedierence,di=Pbdbi Toidentifypotentiallysignicantchangesinexpressionlevels,theyusedascatterplotoftheobservedrelativedierenced(i)versustheexpectedrelativedierencedi.Foraxedthreshold,startingattheorigin,andmovinguptotherightndthersti=i1suchthatdidi>.Allgenespassi1arecalled"signicantpositive".Similarly,startattheorigin,movedowntotheleftandndthersti=i2suchthatdidi>.Allgenespassi2arecalled"signicantnegative".Foreachtheuppercutopointcutup()wasdenedasthesmallestdiamongthesignicantpositivegenes,andsimilarlydeningthelowercutopointcutlow(). TodeterminethenumberoffalselysignicantgenesgeneratedbySAM,thetotalnumberoffalselysignicantgenescorrespondingtoeachpermutationwascomputedbycountingthenumberofgenesthatexceededthecutoscutup()andcutlow().Theestimatednumberoffalselysignicantgeneswasthemedian(or90thpercentile)ofthenumberofgenescalledsignicantfromtheBsetsofpermutations.Suchgenesarecalled12

PAGE 25

where0isthetrueproportionofequivalentexpressed(EE)genesinthedatasetandTPisthenumberoftotal(true)positivesdiscoveredfromtheteststatistic,thatis,TPisthetotalnumberofgenesclaimedtobedierentiallyexpressed(DE).2.3.5MixtureModelMethod(MMM) Considerthesituationwhere,foreachgenei,i=1;2;:::;N,wehaveexpres-sionlevelsYi(1)=(Yi1;:::;Yim)frommmicroarraysundercondition1,andYi(2)=(Yi;m+1;:::;Yi;m+n)fromnarraysundercondition2.Hereweneedtoassumethatbothmandnareevenintegers,thiswillbecomeobviouslater. Thegoalistoidentifygenessuchthat(Yi1;:::;Yim)and(Yi;m+1;:::;Yi;m+n)havedierentmeans.Thisappearstobeatwosamplecomparisonhowever,inmicroarraydata,thathassmallmandnwithalargeN,rendersthetraditionalstatisticaltestssuchasthet-testorrank-basednonparametrictests,ineective.Onealternativeistodrawstatisticalinferencebasedonthedistributionsofquantitiesrelatedto(Yi1;:::;Yim)or(Yi;m+1;:::;Yi;m+n),for1iN,totakeadvantageofthelargepopulationsizeN. Themodelassumesanonparametricapproachforgeneexpressiondata:Yi(1)=i(1)+"i(1)Yi(2)=i(2)+ei(2)13

PAGE 26

whichdoesnotdependoni(1)andi(2)sinceitsmeanis0.Furthermore,supposethatZi=Pmk=1Yik=mPm+nk=m+1Yik=n ThehypothesisisoftheformH0:f0=f1;thereisnogenewithalteredexpressionH1:f06=f1;otherwise(2.3.13) andisvalidonlyiftherandomerrorsareindependentandtheirdistributionissymmetricabout0.Sincem;n>1thenwecanestimates2i(1)ands2i(2)usingthesamplevariancess2i(1)=Pmk=1(YikYi(1))2

PAGE 27

TotestthenullhypothesisH0thatZisfromf0(whichisequivalenttotestingthehypothesis(2.3.13)),wecanconstructalikelihoodratiotest(LRT)basedonthefollowingstatistic:LR(Z)=f0(Z) AlargevalueofLR(Z)givesnoevidenceagainstH0,whereasatoosmallvalueofLR(Z)leadstorejectingH0.Withthenormalmixturemodel,itispossibletonumericallydeterminetherejectionregion.Foranygivenfalsepositiverate,wecanusethebisectionmethod[29]tosolve=ZLR(z)
PAGE 28

Themixturemodelisag-componentofbetadistributions(rj;sj)forj=1;:::;gwiththeparametersrjandsj,wherethebetadistributionisdenedasfollows(yjr;s)=(r+s)yr1(1y)s1

PAGE 29

TheestimateofthenumberofgenesforwhichthereisadierenceingeneexpressionisevaluatedasN(1^p1),where^p1istheMLEofp1.LetTbesomethresholdbelowwhichtheresultsforparticulargenesaredeclared"interesting"andworthyoffollow-upstudy,theproportionofthosegenesthatarelikelytobegenesforwhichthereisarealdierenceisP(H0;ijyiT)=1P(H0;ijyiT)=1P(H0;ijyiT)

PAGE 30

TheSignicanceAnalysisofMicroarrays(SAM)andtheMixtureModelMethod(MMM)presentedinthischapterusesat-typestatisticstodeterminethenumberofdierentiallyexpressedgenes.However,theMMMhasoneadvantageinthatitisanon-parametricapproach.TheMMMdeterminesthedistributionsunderthenullandalternativeandthenusesthesedistributionstodeterminethenumberofdierentiallyexpressedgenesbymeansofalikelihoodratiotest. Thep-valueapproachofAllisionreliesonparametricassumptionsthataremadetodeterminethep-values.Ifthep-valuesarenotvalidthenitsdistributionsunderthenullhypothesismaynotbeuniformontheinterval[0,1].Indiscussingthemodiedp-valueapproachpresentedinchapter6,weareawarethatthet-testusedtodeterminethep-valuesmustbevalidforthemodiedp-valueapproachtobevalid.However,forthisdissertationweassumealltheassumptionsaresatisedwithrespecttothemodiedp-valueapproach. Inaddition,tothemethodusedtoanalyzemicroarraydata,wepresentedthebio-logicalbackgroundthatthereaderneedssothathemayfullyunderstandthechallengesstatisticianshaveintheanalysisofmicroarraydata.18

PAGE 31

K.Pearson(1894)wasthersttostudymixtureoftwonormaldistributions,wherehemodeledthemixingofdierentcrabspecies.Mixturemodelhasbecomepopularbecause:(1)theyprovideasimplemechanismtoincorporateextravariationandcorrelationinthemodel(2)theyaddmodelexibilityand(3)theyareanaturalapproachformodelingdatathatariseinmultiplestagesorwhenpopulationsarecomposedofsubpopulations.Inadditionthetheory,applications,historyandimportanceofmixturemodelshavebeendiscussedinjournalarticles,monographsandtextbooks.EverittandHand(1981),Titterington,SmithandMakov(1985),Bohning(1999),andMcLachlanandPeel(2000)providedmodels,statisticalmethodsandreferencesfornitemixturesproblems.19

PAGE 32

wherefi1(yij1);:::;fig(yijg)aregdensityfunctionsand1;:::;garecalledmixingproportions,satisfyingthefollowingproperties0j1andPgj=1j=1.Thedensitiesfij(y)forj=1;:::;gmaybecontinuousordiscrete,oracombinationofboth. FromDenition3.1.1weobservethatnitemixturedistributionisthemarginaldistributionofarandomvariablewhichfollowsdierentdistributionsindierentsub-populationsofageneralpopulation.Therefore,ifapopulationSisdenedasS=fS1;S2;:::;Sgg;suchthatSj\Sk=;;j6=k: Furthermore,letXrepresentthestatisticineachsub-populationi.e.,8>>>>><>>>>>:X=x1;ifinS1;X=x2;ifinS2;:::;:::;X=x3;ifinS3:

PAGE 34

ThisimpliesthatthemeanandvarianceofExamples(3.1.2),(3.1.3)and(3.1.4)aregivenas:ForExample(3.1.2)wehaveE(Y)=E(E(YjX))=E(X)=gXj=1jj22

PAGE 36

ThesummarystatisticsisgiveninTable3.1.Forthisdatawehavenoevidencethatthedataisnotnormallydistributed,becausetheKolmogorov-Smirnovtestsfornormalityresultedinap-value>0:5forbothgroups.TheQ-QplotsaredisplayedinFigures3.2and3.3.Additionally,theassumptionofequalvarianceissatisedbecausethep-valuefortheF-testis0:148.Table3.1:Summarystatisticsofdata. SpeciesSepalMeansSepalStd.Dev. Versicolor5.940.516Virginica6.590.636 andrepresentedgraphicallyinFigure3.4.However,whenatwo-componentmixtureofnormalswithequalvariancewasttedtothedata,thefollowingtteddistributionwasobtained(Figure3.5)0:83N(6:08;0:5262)+0:17N(7:13;0:5262) Figure3.6showsthecomparisonofthettedmixturemodelwithequalvarianceand24

PAGE 38

Figure 3.2:Q-QplotsofSepallengthsforversicolorowers Figure 3.3:Q-QplotsofSepallengthsforverginicaowers 26

PAGE 39

Dottedlinestotheleftandrightrepresentstheknowndistributionsofversicolorandvirginicarespectively.Theknownmixturestructureis0:5N(5:94;0:5162)+0:5N(6:59;0:6362).27

PAGE 40

Inrealitytheestimatedmixturedistributionobtainedfortheillustrativeexamplemaybesymmetric.Thedistributionmaybebimodalormultimodalinthecasewherewehavemorethantwocomponents.Figure3.5:Histogramandestimatedmixturedistribution Dottedlinestotheleftandrightrepresentsthetteddistributionsofversicolorandvirginicarespectively.Thettedmodelisgivenby0:83N(6:08;0:5262)+0:17N(7:13;0:5262).28

PAGE 41

Dottedlinerepresentsthettedmixturemodelwhiletheboldlineistheknownmixturestructure. Theyaresymmetricaswellasskewed2. Unimodalaswellasmultimodal.29

PAGE 43

additionally,theobservedlog-likelihoodisgivenby:l(jy)=nXi=1ln(gXj=1jfij(yij)):(3.2.3) Wenowneedtomaximizethelog-likelihoodl(jy)withrespectto.ThisisdonebyusingtheExpectation-Maximization(EM)(Dempsteretal.,1977)algorithmasanalternativetotheNewton-Raphsonwhichinvolvesthecalculationofrstandsecondderivativesofl(jy).TheEMalgorithmwasdevelopedformissingobservation,inourcaseweconsideredthecomponentmembershipasmissing.ThiscanbeseenifwedeneindicatorsZij,i=1;:::;n,j=1;:::;gsuchthatZij=8<:1ifobservationibelongstocomponentj0otherwise31

PAGE 44

andthelog-likelihoodofthecompletedataisl(jy;z)=nXi=1gXj=1zij[lnj+lnfij(yij)]:(3.2.5) Itisthereforeobviousthatmaximizingl(jy;z)("thecompleteloglikelihood")iseasierthanmaximizingl(jy)("theobserveloglikelihood").Notethat(3.2.2)and(3.2.3)arereferredtoastheobservedatalikelihoodandobservelog-likelihoodrespectively,while(3.2.4)and(3.2.5)arereferredtoasthecompletedatalikelihoodandcompletelog-likelihoodrespectively.Insteadofmaximizingl(jy;z)wemaximizeE(l(jy;z)jy),whichisinterpretedintuitivelyasreplacingthemissingobservationszijbytheirexpectedvalues. TheEMalgorithmactsiteratively,inthesensethat,startingfroma"rstguessestimate"(startingvalue)(0)for,aseriesofestimates(t)isconstructed,whichconvergestotheMLE^of(0)!(1)!:::!(t)!(t+1)!:::!(1)=^

PAGE 45

NotetheE-steprequiresonlythecalculationofE[Zijjy;(t)]=P(Zij=1jyi;(t))=fi(yijZij=1)P(Zij=1) whereij((t))iscalledtheposteriorprobabilitiesandjiscalledthepriorprobabilities.NotetheE-stepreducestocalculatingalltheposteriorprobabilitiesij((t))fori=1;:::;n,j=1;:::;g:

PAGE 46

werstmaximizewithrespecttoj.ThisrequiresmaximizationofnXi=1gXj=1ij((t))lnj=nXi=1g1Xj=1ij((t))lnj+nXi=1ig((t))lnh1g1Xj=1ji @jnnXi=1g1Xj=1ij((t))lnj+nXi=1ig((t))lnh1g1Xj=1jio=0 wehavethatnXi=1ij((t)) Notethat1=gXj=1(t+1)j=gXj=1(t+1)gPni=1ij((t)) sincePgj=1ij((t))=1,therefore1=(t+1)gn hence(t+1)gisgivenby(t+1)g=Pni=1ig((t))

PAGE 47

Notethattheupdatedmixturecomponentprobabilitiesaretheaverageposteriorprob-abilities.TheM-stepalsorequiresthemaximizationofnXi=1gXj=1ij((t))lnfij(yij)(3.2.8) withrespectto.Thismaximizationstepisoftentimesnon-trivial.Insuchcases,theEMalgorithmisdoubleiterative.BelowaresomeexampleswhentheM-stepistrivial(c.f.[40]).Example3.2.3 From(3.2.8),andforsimplicityweletij((t))=ij,thenwehavenXi=1gXj=1ijlnfij(yij)=nXi=1gXj=1ijlnejyij therefore@ @jnnXi=1gXj=1ij(j+yilnj)o=0;8j,j=Pni=1ijyi

PAGE 48

22(yij)2/nXi=1gXj=1ijln(2)=2(yij)2=(22) @jnnXi=1gXj=1ijln(2)=2+(yij)2=(22)o=0;8j,j=Pni=1ijyi Also@ @2jnnXi=1gXj=1ijln(2)=2+(yij)2=(22)o=0;8j,nXi=1gXj=1ij1=2(yij)2=4=0,nXi=1gXj=1ij=nXi=1gXj=1ij(yij)2=2,2=Pni=1Pgj=1ij(yij)2

PAGE 49

SimilartoExample(3.2.4),wecanshowthatthemeanisgivenbyj=Pni=1ijyi 22j(yij)29=;=nXi=2ln8<:gXj=2j1 22j(yij)2+11 221(yi1)2+ln8<:gXj=2j1 22j(y1j)2+11 221(y11)2 22j(yij)2+1 221(yi1)2+ln8<:gXj=2j1 22j(y1j)237

PAGE 50

ThereareseveralfactorsaectingtheconvergenceoftheEMalgorithmtothemaxi-mumlikelihoodestimates.Thesefactorsare:1. theinitialestimatescanaecttheconvergencegreatlyand2. thepresenceofstatisticaloutliersdenedtobethoseobservationsthataresub-stantiallydierentfromthedistributionsofthemixturecomponents. TheEMalgorithmassignseachobservationtooneofthecomponentswiththesample'sposteriorprobabilityasitsweight.Althoughanoutlyingsampleisinconsistentwiththedistributionsofallthedenedcomponents,itmaystillhavealargeposteriorprobabil-ityforoneormoreofthecomponents.Thereforetheiterationconvergestoerroneoussolutions. AcommonapproachtoeliminatingthepresenceofoutliersintheEMalgorithmistoapplyachi-squarethresholdtest.Thistesteliminatesobservationswithdistancesgreaterthansomethresholdvalue.Theseobservationsareconsideredtobeoutliersandsubsequentlyexcludedfromupdatingtheparameterestimates.Thischi-squarethreshold2foragivenprobabilityisdenedasthesquaredistancebetweenthesampley2<38

PAGE 51

ItshouldbenotedthattheEMalgorithmrstestimatestheposteriorprobabilitiesofeachsamplebelongingtoeachofthecomponentdistributions,andthencomputestheparameterestimatesusingtheseposteriorprobabilitiesasweights.Withthismethod,eachsampleisassumedtocomefromallcomponents.TherobustestimationattemptstocircumventthisproblembyincludingthetypicalityofthesamplewithrespecttothecomponentdensitiesinupdatingtheestimatesintheEMalgorithm. Ameasureoftypicalityisincorporatedintheparameterestimationofthemixturedensity,ifweassumethateachcomponentdensityfj(yijj;2)isamemberofthefamilyofsymmetricdensitieswithmeanjand2,i.e.221=2fsfj(xjj;2)g;

PAGE 52

2s20skks1 2k2s>k:

PAGE 53

Thedottedandboldlinesrepresenttheheteroscedasticandhomoscedasticmodelsrespectively.

PAGE 54

Thedottedandboldlinesrepresenttheheteroscedasticandhomoscedasticmodelsrespectively. andthentthesimulateddatawithequalandunequalvariances.ThemodelwithunequalvariancesseemstobeabettertinthecasewherethesimulateddatawithunequalvariancewasttedwithunequalvarianceasopposetowhenttedusingequalvarianceFigure3.8.HowevertheresultsforthedatathatwassimulatedusingequalvarianceseeFigure3.9 Thisexampleshowsthatweattainbetterttoourdataifthedataisheteroscedastic,hencettingheteroscedasticmixturemodelisvital.Ciupercaetal.consideredmixture42

PAGE 55

wherefij(yijj)=1 22j(yij)2j=1;:::;g suchthat0j1;Pgj=1j=1;0andthetrueparametersdenedas02. Intheiranalysisthemaximumlikelihood(ML)frameworkwasusedtoestimatetheparametersofthemixture,withlikelihoodfunctiongivenby~L(jy)=fn(Y1;:::;Ynj)=nYi=1f1(Yj):(3.2.13) Sincethelikelihoodfunction(3.2.13)isunboundedonbecauseifoneofthevarianceparameterinthedenominatorof(3.2.13)approaches0asjapproachesyi(c.f.Example3.2.5)thenthelikelihoodisunbounded. TheycircumventedthisproblembyconsideringapenalizedlikelihoodfunctiondenedasLn(jy)=fn(Y1;:::;Ynj)gYj=1h(j)(3.2.14) wherethepenalizedfunctionhwaschosensothatLnisboundedovertheparameterspace.Thepenalizedfunctionwasassumedtohavesatisedthefollowingconditions: (1)limj!01

PAGE 56

Theconsistencyoftheestimatorwasalsoaconcern.Inordertoprovetheconsistencytheyrequiredthathalsosatisedthefollowingconditions: (2)h()ismany-to-onefrom(0;1)onto(0;G];G>0; non-nullmeasure, (4)hiscontinuouslydierentiableon(0;1).3.3EstimatingtheNumberofComponentsg

PAGE 57

fromaunivariatenormalmixturedistributionforthehypothesisH0versusH1,seeEverittetal.(1981).WithoutlossofgeneralityweassumedistributionundernullhypothesisH0isnormallydistributedthatisthenumberofcomponentg=k=1andthedistributionunderthealternateisatwocomponentmixtureofnormaldistribution,thatis,g=k=2.NotethatthedistributionoflnL(k+1)ln(k)45

PAGE 58

McLachlanfurtherstatedthattheWolfe'sapproximationtothenulldistributionof2(lnL(k+1)lnL(k))wasnotapplicableintheheteroscedasticcase(i.ewherethecomponentvarianceswereunequal).McLachlanevaluatedtheempiricaldistributionfunctionof2(lnL(k+1)lnL(k))byconstructing500replicateswithasamplesizeofn=100generatedunderH0usingthenormalcomponentdensitieshavingunequalvariancesunderH1.WhenWolfe'sapproximationwasapplied,theresultingchi-squareddistributionwas24however,the26distributionfunctionwasfoundtoprovideamuchbettert.Furthermore,thesimulatednulldistributionof2(lnL(k+1)lnL(k))hadmeanandvarianceequalto5:96and13:86respectivelywhichfurthersolidiedthatthe26distributionfunctioncharacterizestheempiricalnulldistribution.Wolf'sapproximationwasnotapplicableinthecasewhereheteroscedasticwasconsidered. InthecaseofheteroscedasticitytheregressionapproachofThodeetal.(1988)ismoreappropriatetobeusedtoremedytheaforementionsituationofunequalvariances.Theapproachistotaregressionmodelasafunctionofthesamplesizen,usingdierentsamplesizeswhichresultsintheregresseddegreesoffreedomtobef=0+11

PAGE 59

TheregressiontechniqueofThodeetal.(1988)waspresentedtodeterminethedegreesoffreedomoftheasymptoticdistributionofthelikelihoodratiotest.Thodeetal.foundtheempiricalnulldistributionofthelikelihoodratiotestforthesamplesizes15,20,25,40,50,75,80,100,150,250,500and1,000.However,theirapproachdidnotaccountforskewnesswhichwasaddressedbyMacLeanetal.(1976).Furthermore,foreachsamplesize,percentilepointsandmomentswereevaluatedusing2,500normalsamples.Thodeetal.alsousedaniterativeproceduretodeterminethemaximumlikelihoodestimatesofthenormalmixturedistribution.TheyalsoappliedtherandomstartingpointmethodofThode,FinchandMendell(1987)byusingverandomstartingpointssothattheglobalmaximumisachieved,insteadofthelocalmaximumoftheMLEoftheparametersinthenormalmixturemodel. Thodeetal.mentionedthatsincetheregularityconditionsdonotholdinthecaseofmixtureofnormaldistribution,thereforetheasymptoticdistributionisnotchi-squaredwithdegreesoffreedom2.Thereforetheyfoundthemeansandvariancesforthesamplesizes15,20,25,40,50,75,80,100,150,250,500and1,000.Notethatthemeanisequaltothenumberofdegreesoffreedomforthechi-squaredrandomvariable,andthevarianceistwicethedegreesoffreedom.Theyalsoestimatedtheasymptoticdistributionofthelikelihoodratiotestbyregressingthemean,varianceandsimulatedpercentilesoftheLRTagainstvariousfunctionsofthesamplesizen.Thodeetal.furtherdividedthe2,500samplesgeneratedforeachofthesampleinto5subsamplesofsize500each,andappliedthegoodness-of-ttestdescribedinDraperandSmith(1981)andconsideredaregressionmodelasafunctionof(1=n)tfort=0:125;0:25;0:50;1;2and3.TheregressionmodelisE(YPNs)=aP;t+bP;t(1=n)t;(3.3.17) whereYPNsisthePthpercentileofthesthsubsampleofsizen.Frommodel(3.3.17)theyttedregressionmodelfort=0:125;0:25;0:50;1;2and3andfoundthattheinter-ceptsestimatedforvariouspowersoftwereessentiallythesamethereforeindicatingtheconvergenceoftheasymptoticdistribution.However,Thodeetal.concludedthatthe47

PAGE 60

InthenextsectionwewilldescribetheapproachofChenetal.thatwasusedtodeterminetheexactdistributionofthenulldistributionofthelikelihoodratiostatisticinthecaseweretherewasequalvarianceineachcomponentofmixtureofnormaldistribu-tions.Thisapproachforourpurposeswasmodiedsothatweaccountedfordierencesinthevariancesforeachcomponent.ItshouldbestatedthatthemethodofChencannotbeapplieddirectlytotheproblemofheteroscedasticity,thatis,inthecasewherethevariancesaredierentineachcomponentwhichisthecaseusedinthisdissertation.Therefore,theasymptoticdistributionofthepenalizedmodiedlikelihoodmethodusedinthisdissertation,willbeestimatedusingtheregressionmodelofThodeetal.(1988).ThetheoreticaldistributionofthepenalizedmodiedlikelihoodratiostatisticinthecaseofunequalvariancesforeachcomponentisanopenproblemwhichIhopetosolveinthenearfuture.ThenextsectiondescribesthemethodofChenwhichisthemethodinthisdissertationwemodiedtoaccountforheteroscedasticity(unequalcomponentvariances).3.3.2ModiedLikelihoodRatioTestforHomogeneityinFiniteMixtureModels ChenandKalbeisch(1996),Chen(1998)andChenetal.(2001,2002)suggestamodicationofthelikelihoodbyincorporatingapenaltytermthatforcescertainesti-matesawayfromtheboundaryoftheparameterspace.Thelikelihoodratiostatisticbasedonthemodiedestimatorsisshown,inmanyinstances,toyieldrelativelysimpler48

PAGE 61

Weconsideranitemixturedistributionwithprobabilitydensityfunctionasdenedin(3.1.1),i.e.,f(Yj)=gXj=1jf(yjj) wheref(yjj),isaprobabilitydensityfunctionwithparameterj2.Let1;:::;g2bethesupportpointsoff(yjj)andlet1;:::;gbethecorrespondingweightswithj0andPj=1.Ifweconsiderg=2thenwehavef(yj1)+(1)f(yj2)where2[0;1]and12.WewishtotestthehypothesisH0:1=2versusH0:16=2 whereCisapositiveconstantand~ln(jy)=nXi=1ln(2Xj=1jf(yjj)):(3.3.19) istheordinaryloglikelihood.Thepurposeofthe"penaltyterm",Cln4(1)in(3.3.18)istorestoreregularitytotheproblembyavoidingestimatesofonorneartheboundary.ThemodiedlikelihoodratiostatisticisthusRn=2fln(^;^1;^2)ln(1=2;^;^)g:(3.3.20) andthenulldistributionisgivenby1 220+1 221:49

PAGE 62

whereG()isadiscretecumulativedistributionfunction(calledthemixingdistribution)withanitenumberofsupportpoints.TheclassofallnitemixingdistributionswithgsupportpointsisMg=nG()=gXj=1jI(j):1;:::;g;gXj=1j=1;j0o(3.3.22) whereI()isanindicatorfunctionandg=1;2;:::.TheclassofallnitemixingdistributionsisM=Sg1Mg. Weconsiderthetestwithnullhypothesisg=1versusthealternativeg2;ormorepreciselyweconsideratestofthehypothesisG2M1versusG2Mg2.Furthermore,let^G0and^G1denotetheestimatesunderthenullandalternatehypothesisrespectively,hencethemodiedlikelihoodratiostatisticfortestingG2M1againstG2Mg2isgivenbyRn=2fln(^G1)ln(^G0)g TheTheoremsbelowsummarizetheabovearguments.Theorem3.3.1 220+1 22150

PAGE 63

2 221+ Forthemodiedp-valueapproachtheassumptionthatthemixingdistributionisfromtheexponentialfamilyhasbeenviolatedsincethebetadistributionisnotoftheexponentialfamily.Tothisend,theasymptoticnulldistributionofthelikelihoodratiostatisticwillbedeterminebysimulation. Notethattheasymptoticnulldistributionisabsolutelynecessarysothatwecancarryoutahypothesistesttodeterminethenumberofcomponentsofthemixturemodel.51

PAGE 64

where0j;(j=1;2)aredistinctinteriorpointsofand0<0<1.Allexpectationandprobabilitiesarewithrespecttothisnulldistribution.WealsoassumethatthedistancebetweentwomixingdistributionsGandQismeasuredbythesupremumdistance,i.e.,jGQj=supjG()Q()j:Condition1

PAGE 65

InthenextsectionwewilldescribetheBox-Coxtransformationthatisusedtodistin-guishskewedfromcommingleddistributioninmixturemodels.NoteinthisdissertationwedidnotaccountforskewnessaswasthecaseoftheregressionmethodofThodeetal.(1988).However,itisimportanttothereadertobeawarethatinmixturedistributionwecannormalizemixtureofanydistributions,thatistomixtureofnormaldistributionsifthatneedarises.3.4Box-Coxtransformation

PAGE 66

hy r+11i;when6=0rlny r+1;when=0(3.4.26) Thescaleparameterrisnecessaryonlytoensurethateveryy r+1ispositiveinthesample,howeveritslightlyaectthedistributionofY.MacLeanetal.(1976)suggested,usingaxedvalueofrbecause,whilesimultaneousestimationofrandmightimprovetheapproximationtonormality,itmightexacerbateconvergenceproblems. Inthecaseofa2-componentnormalmixturemodelgivenbyf(y)=N(1;)+(1)N(2;)(3.4.27) TheMLE'softheparameters;1;2;;andareestimatediterativelybymaximizingtheloglikelihoodfunctionl(y)=nXi=1lny r+11+nlnnXi=1lnhexpn(zi1)2 wherez=r hy r+11i. NotethataftertheBox-CoxtransformationhasbeenappliedthedataisnoweithernormallydistributedorisamixtureofnormaldistributionsseeMacLeanetal.(1976)fordetail.3.5Conclusion

PAGE 67

Oneofthemanychallengesforresearchersintheeldintheeldofmixturedistri-butionistodeterminethenumberofcomponents.ThemodelselectioncriteriaBICandAICwerediscussed,however,formixturedistributiontherehasnotbeenanytheoreticaljusticationfortheiruse.Thereforesimulationandthemodiedlikelihoodratiotestaremethodsthathadnosuchtheoreticaldrawback.Allthreeapproacheswerediscussedinthischapter,withthemodiedlikelihoodratiotestusedinthisdissertationtodeterminethenumberofcomponentsforthemixturemodels.Notetheasymptoticnulldistributionforthemodiedlikelihoodtestisdonebymeansofsimulation. Insomecasesinmixturedistributionresearchersmaynotbeabletodistinguishcom-mingleddistributionsfromdistributionthatareskewed.Inthissituation,thelikelihoodratiotesttodistinguishskewnessfromcommingleddistributions,usingtheBox-Coxtransformationtoeliminateskewnessforeachofthehypothesistobetestedisoneavailablemethod.Throughoutthisdissertationweassumethatmixturedistributionisdistinguishablefromskeweddistribution,thereforeweneednotapplytheBox-Coxtransformation.55

PAGE 68

Inthischapteroneofourmajorcontributionisthebuildingofamodelwithboththeabovementionedcapabilities,thatis,wepenalizeboththemixingproportionandthevarianceparameterssimultaneously.Therefore,weareabletotnormalmixturemodelswithunequalvariancesandbeabletoconductatestofhypothesisforthenumberofcomponentsthatcharacterizesthemodel. Firstly,estimatorsfortheparametersofthepenalizedmodiedlikelihoodapproachwillbeillustrated.Theseestimatorsarenecessarysothatwecanimplementtheexpecta-tionmaximizationalgorithmwhensimulatingthenulldistributionforthelikelihoodratiostatistic(LRTS)forthepenalizednormalmixturemodel.AnothermajorcontributioninthisdissertationisthatweprovedasymptoticnormalityoftheMLE's(estimators)forthepenalizednormalmixturemodel.AsymptoticnormalityoftheMLE's(estimators)isamajorcontributionofthisdissertationandisarststeptodeterminetheasymptoticnulldistributionofthelikelihoodratioteststatistic(whichisanopenproblem).56

PAGE 69

wherefij(yi)=1 22j(yij)2j=1;:::;g suchthat0j1;Pgj=1j=1;0andthetrueparametersdenedas02.Thepenalizedmodiedlikelihoodforag-componentnormalmixturemodelisgivenbyLn(jy)=nYi=1gXj=1ifij(yij)gYj=1h(j)gYj=1(gj)C(4.1.3) fortheobserveddata,whereCisapositiveconstantthatcontrolthelevelofmodicationofthemixingproportionj(thelasttermofequation(4.1.3)).Thefunctionhasmentionedinthepreviouschapter,waschosensothatLnisboundedovertheparameterspace.Thepenalizedfunctionhwasassumedtohavesatisedthefollowingconditions: (1)lim!01 maximumargumentofLn,thatisthepenalizedMLEargmax2Lnexists. Theconsistencyoftheestimatorwasalsoaconcern.Inordertoprovetheconsistencyitwasrequiredthathalsosatisedthefollowingconditions: (2)h()ismany-to-onefrom(0;1)onto(0;G],G>0, (3)hisstrictlyincreasinginanopeninterval(0;)oftheoriginwhichhasa57

PAGE 70

(4)hiscontinuouslydierentiableon(0;1). Inthisdissertationweconsidertwodistributionsthatsatisfytheaforementionedconditionsonthepenalizedfunctionh()forthevariance.Thesedistributionsare(1)theinversegammaand(2)theinversechi-squaredistributions. Inthenextsectionwewillevaluatetheestimatorsforthepenalizedmodiedlikelihoodfornormalmixturemodels.Theseestimatesarevitalbecauseinchapter5weusedtheseestimatorsintheexpectationmaximizationalgorithmtoevaluatetheloglikelihoodwhichisthenusedinthesimulationoftheasymptoticnulldistributionofthemodiedlikelihoodratiotest,seesection5.2ofchapter5.4.2ParameterEstimationofPenalizedModiedLikelihood Similartotheapproachinsection3.2,weneedonlytomaximizetheexpectationofthelog-likelihoodQ(j(t))=Ehln(jy;Z)y;(t)i:

PAGE 71

Nowwemaximizewithrespecttoj,thereforeweconsiderthelasttermofequation(4.2.6),sincegXj=1"nXi=1(t)ijlnj+Cln(gj)#/gXj=1"nXi=1(t)ijlnj+Cln(j)# thentakingthederivativew.r.tjofequation(4.2.7)andthenequatingto0wegetPni=1(t)ij+C j=Pni=1(t)ig+C g)j Pni=1(t)ig+C)1 n+gC:59

PAGE 72

n+gC(4.2.8) Fornormalmixturei.efijisnormallydistributedwehavethatlnfij(xjj;2j)/ln(2j) 2(yij)2 2(yij)2 thereforetheestimateforjisgivenby(t+1)j=Pni=1(t)ijyi Wenowturnourattentiontomaximizingthevarianceofthenormalmixturemodelwhentheinversegammafunctionisusedasthepenaltyfunction.4.2.1InverseGammaPenaltyFunctionfor 2j); 2j:60

PAGE 73

2(yij)2 2j):(4.2.10) ThuswehavenXi=1(t)ij1 4j=0)Pni=1(t)ij(yij)2 2j=nXi=1(t)ij+2(+1))(2j)(t+1)=Pni=1(t)ij(yij)2+2 ToestimatethenullparameterswemaximizetheloglikelihoodunderthenullwhichisgivenbynXi=1lnfij(yij;2)+gXj=1lnh(): @nnXi=1ln(2) 2(yi)2 tobenXi=1(yi)=0 therefore^=Pni=1yi Underthenullhypothesistheestimateof2isevaluatedasfollow.Usingtheinversegammapenaltyterm,thatish()= 2),thereforelnh()/(+1)ln(2) 2.Wewantthederivativew.r.t.2ofthefollowingnXi=1ln(2) 2(yi)2 261

PAGE 74

4=0)Pni=1(yi)2 2=n+2(+1))^2=Pni=1(yi)2+2 n+2(+1)(4.2.13) Theotherpenaltytermofinterestistheinversechi-squaredistribution,whichinthenextsectionhasbeusedintheevaluationoftheMLEforvarianceparameterofthenormalmixturemodel.4.2.2InverseChi-SquarePenaltyFunctionfor 22j; 22j 2(yij)2 22j): 24j=0)Pni=1(t)ij(yij)2 22j=nXi=1(t)ij+(=2+1)62

PAGE 75

Toestimatethenullparameterswemaximizetheloglikelihoodunderthenull,givenbynXi=1lnfij(yij;2)+gXj=1lnh(): 2(yi)2 22),thereforelnh()/(=2+1)ln(2)1 22.Thereforetakingthederivativew.r.t.2ofequation(4.2.15)andequationtozero,nXi=1ln(2) 2(yi)2 22(4.2.15) wehavenXi=11 24=0)Pni=1(yi)2 22=n+(=2+1))^2=Pni=1(yi)2+1=2

PAGE 76

ConsequentlyG.Ciupercaetal.(2003)focusedtheirstudyoftheasymptoticprop-ertiesinaneighbourhoodoftheoriginoftheparametersj.Inthissectionweappliedtheirideatoprovethatthereexistsaconstant>0,notdependentonn,sothattheprobabilitythatthepenalizedmodiedlikelihoodLnismaximizedbyaj2[0;)iszero.SimilartotheapproachinG.Ciupercaetal.(2003),from(4.1.3)weconsiderLnandextendeditsdenitionto,i.e, bethetruevalueoftheparameterandletusdenetheBanachspaceH=L1(f1(y;0))64

PAGE 78

where=Qgj=1j 2]tobew()=h() 2G:

PAGE 79

Ifminj=1;:::;gj2(0;),thenbytakingthedenitionofwandtheassumption(3)onhintoaccount,wehaveEH[e]
PAGE 80

For2letusdenethefollowingfunctions WehavethefollowingLemmaLemma4.3.4 Letusintroducetworesultswhichwillbeusefultocharacterizethespeedofcon-vergenceofthepenalizedestimator.First,notethatsinceg=Pg1j=1j,thevectorcontains3g1parameters=(1;:::;g1;1;:::;g;1;:::;g)T Letusdeneu(Yj)=f1(Yj)gYj=1h(j)1=ngYj=1(gj)C=n

PAGE 81

and@lnu(Yj)

PAGE 83

andifderivativesofequation(4.3.23)istakenwithrespectto(andinterchangederiva-tiveandintegral,whichcanusuallybedone)wehave,Z@ @f1(Yj)dY=Zf01(Yj)dY=0 andZ@2 StrongconsistencyofthepenalizedMLEisstatedbymeansofthefollowingtwoTheorems.TheyfollowthestructureoftheTheoremsprovedbyWald(1949)fortheclassicalMLEoveracompactset.71

PAGE 84

inordertohaveanon-singularinformationmatrixI(0)=EHh@lnf1(0)

PAGE 85

2Rn(+n)(4.3.26) ThevectorRn(+)hasthecomponentsRn(+n)k=(+n0)TBk(+n0);k=1;:::;3g1 whereBkisasquarematrixwithelementsBk(i;j)=@3lnLn(+n)

PAGE 86

2nTn(+n)i1:(4.3.27) Letusnowfocusontherstterminthebracketof(4.3.27).BymeansofLemma4.3.5,applicationofthecentrallimitTheoremonthesetofrandomvariables@lnu(Yij0)=@l1in;l=1;:::;3g1leadsto1 Concerningthetermsinthesecondfactorof(4.3.27),@2lnLn(0)=@2isequaltonXi=11 Forthesecondofthetwo,sinceh(3)(0l)=h(0l)isbounded,wehave1

PAGE 87

Themainresultofthischapterwastheproofpresentedforasymptoticnormality(seesection4.3).Thisproofisavitalrststepneededtoprovetheasymptoticnulldistributionforthelikelihoodratiotestusetodeterminethenumberofcomponentsforanormalmixturemodelwithunequalvarianceparameter.However,theproofoftheasymptoticnulldistributionforthelikelihoodratiotestisleftforfuturework.Sincetheasymptoticnulldistributionforthelikelihoodratiotestisanopenproblem,wethereforeusedsimulationinchapter5todeterminetheasymptoticnulldistributionofthelikelihoodratiotest.75

PAGE 88

Weipanetal.(2002,2003)usedBICasacriterionformodelselection,todeter-minethenumberofcomponentsforthenormalmixturemodel.However,therearenotheoreticaljusticationfortheuseofeithertheBICorAICmodelselectioncriteriaformixturemodels.ThereforeweusedthemodiedlikelihoodratiotestproposedbyChenetal.[6,7]totestthehypotheses:a1-component(nullhypothesis)versusatleasta2-componentmodel(alternativehypothesis)anda2-component(nullhypothesis)versusatleasta3-componentmodel(alternativehypothesis).However,themodiedlikelihoodratiotestofChenetal.[6,7]isnotapplicableintheheteroscedasticsense(thatis,mixtureofnormalwithunequalvariances),hencewesimulatethenulldistributionofthepenalizedmodiedlikelihoodratiotest(c.fchapter3page45,wherethesimulation76

PAGE 89

wherefij(yi)=1 22j(yij)2j=1;:::;g satisfyingthat0j1;Pgj=1j=1;0andthetrueparametersdenedas02.Furthermore,asinchapter3weletMg=nG()=gXj=1jfj(xijj):1;:::;g;gXj=1j=1;j0o:(5.1.3) denotetheclassofallmixtureprobabilitydensityfunctionsofwhichcomponentsarelessthanorequaltog.5.2PMMMSimulatedNullDistribution

PAGE 90

Thesimulationofthenulldistributionisdoneasfollows.Inthecaseofhypothesis(5.2.4)wesimulated1000replicatesofthestandardnormalsN(0;1)ofsamplesizes100;250;500;750and1000.Thenwetted2-componentsnormalmixturewithunequalvariancesforeachofthesamplesizesandcalculatedthepenalizedmodiedloglikelihoodratiotest(PMLRT)deneasRn=2flnLn(^;^1;^2;^1;^2)lnLn(1=2;^;^;^;^)g:(5.2.6) whereLnisdenedin(5.1.1).Alinearregressionequationwasttedusingthe5valuesofthePMLRTtodeterminethedegreesoffreedomasafunctionofthesamplesizen(seesection3.3ofchapter3).Thedegreesoffreedomofthesimulatedchi-squarednulldistributionasafunctionofnforhypothesis(5.2.4)aregivenbyf=3:1+10:2n0:5:(5.2.7) Table5.1showsthemean,varianceandpercentilesofthePMLRTforthesamplesizes100;250;500;750and1000forhypothesis5.2.4.Thepercentilesinbracketsarethatofthechi-squareddistributionwithdegreesoffreedomgivenbyequation(5.2.7),whilethosepercentilesnotinbracketsaretheorderedsimulatedpercentilesofPMLRT.WecanseefromTable5.1thatthepercentilesfortheorderedsimulatedvaluescompareswellwiththatofthechi-squareddistribution.Thevaluesforthe50th,75th,90thand95thpercentilesforsamplesizes100,250,500,750and1000arerelativelyclose,suggestingthatwehaveagoodagreementbetweenoursimulatedandtheoreticaldistributions. Notethatthedegreesoffreedomofachi-squaredistributionareintegers,thereforeagammadistributionwithmean1:55+5:10n1=2andsecondparameter0.5wasused.Thiswasdonebecausethechi-squareddistribution,2fwithdegreesoffreedomf,isaspecialcaseofthegammadistributionG(f=2;1=2)withparametersf=2and1=2. Inthecaseofthehypothesis(5.2.5)wesimulated1000replicatesfromthenormal78

PAGE 91

Samplesize1002505007501000 Mean4.003.903.713.333.29Variance8.058.047.717.037.02Percentiles50%3.20(3.45)3.22(3.08)3.05(2.90)2.70(2.81)2.66(2.76)75%5.59(5.51)5.33(5.04)5.13(4.80)4.28(4.69)4.40(4.63)90%8.01(7.93)7.90(7.37)7.23(7.08)6.82(6.95)6.87(6.87)95%9.74(9.65)9.40(9.04)9.25(8.72)8.79(8.56)8.58(8.50) ofsamplesizes100;250;500;750and1000.Wetted2-componentsand3-componentsnormalmixturedistributionsofdatasimulatedfromthenormalmixturemodelsof(5.2.8)andevaluatethePMLRT.ThePMLRTforhypothesis(5.2.5),isgivenbyRn=2flnLn(G3(^))lnLn(G2(^))g(5.2.9) whereG2(^)andG3(^)aretheestimatesunderthenullandalternatehypothesisof(5.2.5)respectively.Thelinearregressionequationforthedegreesoffreedomasafunctionofnwasdeterminedtobef=4:89+11:84n1=2+0:09I;(5.2.10) whereI=8<:1ifmeansarefrom0:5(yj0;1)+0:5(yj2;1)0ifmeansarefrom0:2(yj0;1)+0:8(yj2;1):

PAGE 92

Table5.2illustratesthatthepercentilesfortheorderedsimulatedvaluescompareswellwiththatofthechi-squareddistribution.Thevaluesforthe50th,75th,90thand95thpercentilesforsamplesizes100,250,500,750and1000arerelativelyclose,suggestingthatwehaveagoodagreementbetweenoursimulatedandtheoreticaldistributions. Notethatagammadistributionwithmean2:44+5:92n1=2+0:045Iandsecondparameter0.5isequivalentto2f,wheref=4:89+11:84n1=2+0:09I. Samplesize1002505007501000 Simulatedresultsfor0:5(yj0;1)+0:5(yj2;1)Mean6.115.845.605.335.26Variance11.9211.7211.0410.6910.53Percentiles50%5.42(5.50)5.03(5.07)4.84(4.85)4.71(4.75)4.55(4.69)75%8.46(8.03)7.65(7.50)7.34(7.23)7.02(7.12)7.07(7.05)90%10.60(10.86)11.10(10.25)9.95(9.94)9.74(9.81)9.56(9.72)95%11.78(12.82)12.19(12.17)12.47(11.84)11.82(11.69)11.29(11.60) Simulatedresultsfor0:2(yj0;1)+0:8(yj2;1)Mean6.005.755.475.255.24Variance12.0312.1311.8510.8410.63Percentiles50%5.67(5.41)5.09(4.98)4.90(4.76)4.67(4.66)4.59(4.60)75%7.67(7.92)7.31(7.39)7.15(7.12)7.13(7.01)7.00(6.94)90%9.83(10.73)10.60(10.13)10.09(9.82)9.45(9.68)9.65(9.60)95%12.44(12.69)12.03(12.03)11.90(11.70)11.10(11.55)11.20(11.46)

PAGE 93

Thedatafortheequallyexpressed(EE)genesaresimulatedfromN(i1;2i1)fork=1;:::;mandN(i2;2i2)fork=m+1;:::;m+n,wherei1=i2N(0;2)andi1andi2aregeneratedfromGamma(2;4),respectively.Notethatsuchgeneratedi1andi2takedierentvaluesforeachgeneandarealsodierentbetweengenes.ThedataforDEgenesweregeneratedsimilarly.However,inthiscase,i1andi2weregeneratedfromN(0;2)separately.Thevariancesi1andi2aregeneratedthesamewayasintheEEgenecase.5.4ApplicationofPMMMtoSimulatedMicroarrayData

PAGE 94

whereYi(1)=(Yi1;:::;Yim)aregeneexpressionfrommmicroarraysundercondition1,andYi(2)=(Yi;m+1;:::;Yi;m+n)arefromnarraysundercondition2ofamicroarrayexperiment.Notethatmandnareassumedtobeevenandpi(qi)isacolumnvectorcontainingrandompermutationofm=2,1'sandm=2,-1's(n=2,1'sandn=2,-1's). ThehypothesistobetestedisH0:f0=f1;thereisnogenewithalteredexpressionH1:f06=f1;otherwise(5.4.13) Wethereforetted1;2and3-componentsnormalmixturemodelandcalculatedRndenedin5.2.6and5.2.9respectivelytodeterminethedistributionsoff0andf.Table5.3displaystheresultsofthehypothesistestforthenumberofcomponentsforf0andf.Hence,thechoiceforbothf0andfarethe2-componentsnormalmixturemodelwhicharestatedbelow:f0(z)=0:01(zj0:287;0:056732)+0:99(zj0:004558;0:408122);f(z)=0:20(zj2:442;0:437032)+0:80(zj0:0062961;0:485832): 1vs.2component2vs.3component OurmaininterestforapplyingthePMMMapproachistodeterminewhichgenesaredierentiallyexpressed,thereforethemediannumberoffalsepositive(FP)were82

PAGE 96

0.0700.0691960.000.1000.1381960.000.1500.3101960.000.3021.6552011.000.3533.8282031.480.4055.6212052.440.451413.9312106.670.602625.72422111.760.704343.20723118.610.906866.96624827.421.00104103.51727038.52 Table5.5:ValuesofTP,FPandFDRfromSAMforthesimulateddata MedianFPMeanFPTPFDR% 0.493.716.4961951.900.474.646.4961972.360.456.508.3521983.280.437.4210.2082003.710.428.3512.0642034.110.3711.1414.8482065.410.3223.2027.84021111.000.2833.4140.83222115.120.2545.4755.68023019.770.2069.6082.59224628.290.16107.65118.04226840.17

PAGE 97

Figure5.3:ThevaluesofFDRfromourmethodandSAMforthesimulateddata

PAGE 98

Table5.6presentstheresultsofthetestofhypothesistodeterminethenumberofcomponentsofthenormalmixturemodelsforf0andf.Wechoosethe2-componentsnormalmixturemodelforbothf0andfwhicharestatedbelow: 1vs.2component2vs.3component

PAGE 100

0.0700.03940.000.1000.071030.000.1500.281130.000.3033.171442.080.3588.751684.760.401212.441786.740.452927.8621513.490.604446.1724817.740.706565.9628822.570.909595.5932329.411.00134133.8336836.41 Table5.8:ValuesofTP,FPandFDRfromSAMfortheratdata MedianFPMeanFPTPFDR% 0.944.216.73805.230.889.123.711018.970.7811.229.981499.960.6819.545.3214913.100.6325.162.7616814.940.5834.276.7019817.260.5449.597.6223820.800.5057.2109.4725922.080.4675.7135.2730125.130.4293.1167.3533627.700.38132.5221.0442031.54

PAGE 101

Figure5.6:ThevaluesofFDRfromourmethodandSAMfortheratdata

PAGE 102

Inthisdissertationwehavenotdeterminethetheoreticalnulldistributionofthelikelihoodratiotestforthehypotheses:Aonecomponentnormalmixture(H0)againsttwocomponentsnormalmixture(Ha).Twocomponentsnormalmixturemodel(H0)againstthreecomponentsnormalmixture(Ha).Hencewesimulatedthenulldistributionofthelikelihoodratiotest.Inchapter3section3.3,thesimulationofthenulldistributionwasexplainandthedegreesoffreedomforthechi-squarestatisticwasdeterminebytheregressionapproachofThodeetal.Sincethedegreesoffreedomforthe2fdistributionwithdegreesoffreedomfisequivalenttoagammadistributionwithparametersf=2and0.5,thegammadistributionwasusedtodeterminetheP-valueofthehypothesesstatedabove. TheresultsofthepenalizedmodiedlikelihoodapproachwerecomparedtothatofSAM.ForsimulateddatathepenalizedmodiedlikelihoodapproachoutperformedthatofSAMbycomparingthefalsediscoveryrates(FDR)(seetables5.4and5.5).ThefalsediscoveryratesforthepenalizedmodiedlikelihoodapproachwerelessthanthatofSAM.InthecaseofrealdatathepenalizedmodiedlikelihoodapproachedoutperformedthatofSAMfortruepositive(TP)lessthanorequalto300.WithTP300,thefalsediscoveryratesofthepenalizedmodiedlikelihoodwerelessthanthatofSAM(seetables5.7and5.8).90

PAGE 103

221+1 220; 2 221+

PAGE 104

Totesthypothesis(6.1.1),Allisonetal.(2002)usedbootstrappingtodeterminethenumberofcomponentsinthemixturemodelofuniformandbetadistributions.Theasymptoticnulldistributionofthelikelihoodratiotestforthep-valueapproachofAllisionetal.canbedeterminedbysimulationifwepenalizethemixingproportion.Thepenalizationofthemixingproportionisimportantbecausehypothesis(6.1.1)canbeeasilymisinterpretedasbeingunformiftheestimatesofthemixingcomponentlieontheboundaryofitsparametricspace,thatispj=0.Thereforewemodiedthep-valueapproachofAllisonetal.(2002)bytheadditionofapenaltytermforthemixingproportionsaswasdoneinChenetal.Theadditionofthispenaltytermresultsintheparameterestimateofthemixingproportionpjnotbeingontheboundaryoftheparametricspace(i.e.pj=0),hencecircumventingthenon-identiabilityoftheparametricspace. Let(yjr;s)=(r+s)yr1(1y)s1 hencetheresultingmodiedloglikelihoodfunctionislg=nXi=1ln"p1(yij1;1)+g1Xj=2pj(yijrj;sj)#+CgXj=1ln(gpj);(6.1.3) wherePgj=1pj=1,pj0andyrepresentthep-valuefromavalidstatisticaltest.From92

PAGE 105

NotethatintheapproachofAllisonetal.(2002)theparametersr2;s2andp2arenotidentiableunderthenullaswasmentionedearlyandthenullhypothesisliesontheboundaryoftheparametricspace(p2=0). WiththeadditionofthepenaltytermCPgj=1ln(gpj)wemaybeabletoapplyThe-orem3.3.1ofchapter3,thereforetheresultingnulldistributionis1 220+1 221: Withthisdonewewillnowneedtoestimatetheparametersp2ands2,notingthattheparameters2characterizesthebehaviourofthep-valuesclosetozero. ExaminingFigure6.1weobservethatthebetadistributionsthataptlydescribethebehaviourunderthealternativehypothesis,thatis,wherethedistributionofp-valuestendtoclusterclosertozeroaregivenby(yj1;10)and(yj0:5;5).Forthisreasonifwewishtoapplythemixtureofuniform-betadistributionsforp-valueapproachwecanimplementauniform-betamixtureoftheformp1(yij1;1)+g1Xj=2pj(yij1;sj);

PAGE 106

However,theasymptoticnulldistributionisnot1 220+1 221

PAGE 107

wherelgisdenedinequation(6.1.5).Table6.1displaystheresultsforthemean,varianceandpercentilesforthesamplesizesstatedabove.FromtheresultsstatedinTable6.1weseethatthesimulatedresultsfortheasymptoticnulldistributionisa2distribution,becausethevarianceistwicethemean.Additionally,thesimulatedpercentilesareapproximatelythatofthe2fdistributioninbrackets,wherefistheregresseddegreesoffreedomwhicharestatedbelow.Theregressionequationforthedegreesoffreedomasafunctionofthesamplesizenwasevaluatedusingthemeansforeachsamplesize.Theregressionequationwasfoundtobef=1:32+4:01n1=2(6.2.8) Samplesize1002505007501000 Mean1.691.631.541.491.37Variance3.523.293.172.842.80Percentiles50%0.97(1.12)0.96(0.98)0.89(0.91)0.94(0.88)0.78(0.86)75%2.57(2.38)2.31(2.17)2.19(2.07)2.11(2.02)1.96(1.99)90%4.21(4.11)4.22(3.84)3.85(3.70)3.66(3.64)3.32(3.60)95%5.45(5.44)5.32(5.13)5.00(4.98)5.01(4.91)4.58(4.87) Thepercentilesof2f=G(f=2;0:5);f=1:32+4:01n1=2areinbrackets

PAGE 108

Table6.2depictssimilarresultsasshowninTable6.1.Howeverthedegreesoffreedomforthehypothesis2-componentversus3-componentforthemodiedp-valueapproachisf=3:69+7:27n1=2(6.2.10) Samplesize1002505007501000 Mean4.424.143.993.963.93Variance8.478.168.137.847.73Percentiles50%3.94(3.77)3.43(3.50)3.36(3.37)3.49(3.31)3.17(3.28)75%5.96(5.91)5.24(5.57)5.37(5.40)5.19(5.33)5.46(5.28)90%8.01(8.39)8.67(8.00)7.96(7.80)7.70(7.71)7.83(7.66)95%9.51(10.16)10.46(9.73)9.12(9.51)9.48(9.41)9.37(9.36) Thepercentilesof2f=G(f=2;0:5);f=3:69+7:27n1=2areinbrackets6.3ApplicationofModiedP-ValuetoSimulatedMicroarrayData implyingthatthenumberofdierentiallyexpressedgenesare200whichcompareswellwiththenumberofsimulateddierentiallyexpressedgeneswhichare200.Ifr6=1,96

PAGE 109

yielding219dierentiallyexpressedgeneswhichwasoutperformedbythemodelthatxedr=1.ThegraphsforbothmodelsareshowninFigure6.2wheremodel(6.3.11)isthesolidlineandmodel(6.3.12)isthedottedline. Underthemodelwherer=1,supposewebelievedthatthep-valuefromthedistri-butionofp-valuesthatarelessthan0.10areinterestingandworthyoffollow-up.Theestimatedproportionofthesegenesthatarelikelytobefalseleadsis(seesection2.3page18fordetailsaboutformula)0:899994160:10 0:890544360:10+0:10000584I0:10(1;25:997)=49% whereIa(r;s)isthecumulativebetadistributionwithparametersrands,evaluatedata.Thisproportionis0.490,implyingthatthereexista49%chancethatanyrandomly97

PAGE 110

0:89054436(10:10)+0:10000584[1I0:10(1;25:997)]=0:008:;6.4ApplicationofModiedP-ValuetosimulatedProstateData

PAGE 111

WiththeMLEforpandswehavethattheestimatedproportionofgenesthatarelikelytobefalseleadsifweassumeathresholdvalueof0.10is0:90:10 0:90:10+0:1I0:10(1;105)=47%; 0:9(10:10)+0:1[1I0:10(1;105)]=0:000002:6.5ApplicationofModiedP-ValuetotheProstateData 1vs.2component2vs.3component 8.54(P<0:01)1.67(P>0:05)

PAGE 112

0:6960:10+0:162I0:10(1;90)+0:142I0:10(1;15)=20:2%; 0:696(10:10)+0:162[1I0:10(1;90)]+0:142[1I0:10(1;15)]=0:045:100

PAGE 114

ThepenalizationtechniquesusedinthisdissertationwasintroducedbyChenetal.andtheypenalizedthemixingproportionsothattheasymptoticnulldistributioncanbedeterminedtheoretically.Inthenon-parametricapproachofWeiPantheyusedheteroscedasticnormalmixturemodelswithoutaddressingtheunboundednessofthelikelihood.Ciupercaetal.addressedtheunboundednessoftheMLEofthevarianceparameterswiththeadditionofapenaltyfunctionforthevariances.Wecombinedbothtechniquessothatweaddressedtheissuesofnonidentiabilityoftheparametersunderthenullhypothesisandtheunboundednessoftheloglikelihoodsimultaneously.Thereforeourapproach,thepenalizedmodiedlikelihoodapproachisanimportantcontributiontoareaofmixturemodels. Theproofthatthepenalizedmodiedlikelihoodratiostatisticisasymptoticallynor-malwaspresentedinthisdissertation.Asymptoticalnormalityisanimportantpropertyneededtoprovetheasymptoticnulldistributionofthepenalizedmodiedlikelihoodratiostatisticwhichwasnotproveninthisdissertation.Sincewedidnotprovetheasymptoticnulldistributionofthepenalizedmodiedlikelihoodratiotest,wesimulatedtheasymptoticnulldistributionandusedtheregressionmethodofThodeetal.to102

PAGE 115

Thepenalizedmodiedlikelihoodapproachformixtureofnormaldistributionwithunequalvariancewasthenusedtodeterminethenumberofcomponentsforthenullandalternativedistributionsforsimulatedandrealworlddata.TheresultsofthepenalizedmodiedlikelihoodapproachwerethencomparedtothatofSAM.TheresultsforthepenalizedmodiedmethodwasfoundtooutperformthatofSAM. Inadditiontothemodiedlikelihoodapproach,westudiedthep-valueapproachfordetectingdierentiallyexpressgenesinmicroarraydataintroducedbyAllisonetal.Wemodiedthep-valueapproachofAllisonetal.bypenalizingthemixingproportion.Similarargumentasthatpresentedaboveforthepenalizationofthemixingdistributioninthecaseofnormalmixturemodelapplies.However,wemadeonesimplemodicationbyxingtheparameter,r=1ofthebetadistribution(r;s),becauseweobservedthatthedistribution(1;s)describesthebehaviourofthealternativehypothesis,wherethedistributionofp-valuestendstobeclosertozero.Thechallengeofprovingtheasymptoticdistributionforthemodiedlikelihoodratiostatisticwasnotdoneinthisdissertation.Therefore,thealternateapproachofsimulatingasymptoticnulldistributionwasdone.TheregressionmethodofThodeatal.wasusedtodeterminethedegreesoffreedomoftheasymptoticnulldistributionofthemodiedlikelihoodratiostatistics. Allisonetal.usedthebootstrapapproachtodeterminethenumberofcomponentsofamixtureofbetadistribution.However,wesimulatedtheasymptoticnulldistribution.Furthermore,Allisonetal.didnotstatetheempiricalnulldistributionforthelikelihoodratiotestusedtodeterminethenumberofcomponentsofthemixtureofuniformandbetadistributions.However,byusingsimulationwedeterminedthattheasymptoticnulldistributionhasa2distribution,wherethedegreesoffreedomwasdeterminefromtheregressionapproachofThodeetal. InthefutureIhopetoprovethetheoreticalasymptoticnulldistributionofthepe-nalizedmodiednormalmixturemodel.Althoughtheprooffortheasymptoticnulldistributionofmixtureofbetadistributionswillbemorechallenging,itisworthmyfocusedattention.Furthermore,aninterestingproblemthatneedsseriousconsiderationisthatofusingmixtureoft-distributionsinsteadofmixtureofnormalswithunequalvariancetodescribethedistributionsofthenullandalternativehypothesesusedinthe103

PAGE 117

Akaike,H.(1973),Informationtheoryandanextensionofmaiximumlikelihoodprinciple,2ndInternationalSymposiumonInfornmationTheory(eds.B.N.PetrovandF.Csaki),pp.267-281,AKademiaiKiado,Budapest.[2] Allison,D.B.,Gadbury,G.L.,Heo,M.,(2002),Amixturemodelapproachfortheanalysisofmicroarraygeneexpressiondata.Comput.Statis.&DataAnalysis,Vol.39,pp.1-20[3] Ben-Dor,A.,Shaamir,R.,andYakhini,Z.(1999),Clusteringgeneexpressionpatterns.J.Comput.Biol.,Vol.6,281-297.[4] Benjamini,Y.,andHochberg,Y.,(1995),Controllingthefalsediscoveryrate:aprocticalandpowerfulapproachtomultipletesting,J.R.Statist.Soc.,Vol.57,pp.289-300.[5] Bohning,D.(1999),ComputerAssistedAnalysisofMixture,NewYork:MarcelDekker.[6] Chen,H.,Chen,J.,andKalbeischd,J.D.,(2001),AModiedLikelihoodratioTestforHomogeneityinFiniteMixtureModels,J.R.Statist.Soc.,Vol.63,Part1,pp.19-29.[7] Chen,H.,Chen,J.,andKalbeischd,J.D.,(2004),Testingforanitemixturemodelwithtwocomponents,J.R.Statist.Soc.,Vol.66,Part1,pp.95-115.[8] Chen,Y.,Doughterty,E.,andBitter,M.,(1997),Ratio-baseddecisionsandthequantitativeanalysisofcDNAmicroarrayimages,J.BiomedicalOptics,Vol.2,pp.364-367.105

PAGE 118

Ciuperca,G.,Ridol,A.andIdier,J.,(2003),PenalizedMaximumLikelihoodEsti-matorforNormalMixtures,ScandinavianJournalofStatistics,Vol.30,pp.45-59.[10] Dempster,A.P.,Laird,N.M.andRubin,D.B.,(1977),Maximumlikelihoodesti-mationfromincompletedataviatheEMalgorithm(withdiscussion),J.R.Statist.Soc.B,Vol.39,pp.1-38.[11] DeRisi,J.,Iyer,V.andBrown,P.O.,(1997),Exploringthemetabolicandgeneticcontrolofgeneexpressiononagenomicscale,Science,Vol.278,pp.680-685.[12] DevoreJ.,andPeck,R.,(1977),Statistics:ExplorationandAnalysisofData,3rdedition.PacicGrove,CA:DuxburyPress.[13] Efron,B.,Tibshirani,R.J.,(1993),Anintroductiontobootstrap,London:ChapmanandHall.[14] Efron,B.,Tibshirani,R.J.,Tusher,V.,(2001),EmpiricalBayesAnalysisofaMicroarrayExperiment,J.oftheAmer.Stat.Ass.,Vol.96,pp.1151-1160.[15] Eisen,M.B.,Spellman,P.T.,Brown,P.O.,andBotstein,D.(1998),Clusteranalysisanddisplayofgenome-wideexpressionpatterns.Proc.Natl.Acad.Sci.,Vol.95,PP.14863-14868.[16] Everitt,B.S.,andHand,D.J.(1981),FiniteMixtureDistributions,NewYork:ChapmanandHall.[17] Fraley,C.andRaftery,A.E.,(1998),Howmanycluters?Whichclusteringmethods?-Answerviamodel-basedclusteranalysis.,TheComputerJournal,Vol.41,pp.578-588.[18] Gaasterland,T.,andBekiranov,S.(2000),Makingthemostofmircoarraydata.Nat.Genet.,Vol24,pp.204-206.[19] Hathaway,R.J.,(1985),AcontrainedEMalgorithmforunivariatenormalmixtures,J.Stat.Comp.Simulation,Vol.23,pp.795-800.[20] Hughes,T.R.,Mao,M.,Jones,A.R.,Burchard,J.Marton,M.J.,Shonnon,K.W.,Lefkowitz,S.M.,Ziman,M.,Schelter,J.M.,Meyer,M.R.,Kobayashi,S.,Davis,106

PAGE 119

Kiefer,J.,Wolfowitz,J.,(1956),Consistencyofthemaximum-likelihoodestimatorinthepresenceofinnitelymanyincidentalparameters,Ann.Math.Stat.,Vol.27,pp.888-906.[22] Lockhart,D.J.,Dong,H.,Byrne,M.C.,Follettie,M.T.,Gallo,M.V.,Chee,M.S.,Mittmann,M.,Wang,C.,Kobayashi,M.,Horton,H.andBrown,E.L.,(1996),Expressionofmonitoringbyhybridizationtohigh-densityoligonucleotidearrays,NatureBiotechnology,Vol.14,pp.1675-1996.[23] MacLean,C.J.,Morton,N.E.,Elston,R.C.,andYee,S.(1976),SkewnessincommingleddistributionsBiometrics,Vol.51,No.3,pp.1461-1468.[24] McLachlan,G.J.,andPeel,D.(2000),FiniteMixtureModels,NewYork:Wiley.[25] Najarian,K.,Zaheri,M.,Rad,A.A.,Najarian,S.andDargahi,J.,(2007),AnovelMixtureModelMethiodforidenticationofdierentiallyexpressedgenesfromDNAmicroarraydata,Bioinformatics,Vol.5,pp.201-211.[26] Pan,W.(2002),Acomparativereviewofstatisticalmethodsfordiscoveringdieren-tiallyexpressedgenesinreplicatedmicroarrayexperiments.BioinformaticsVol.18,pp.546-554.[27] Pan,W.,Lin,J.,Le,C.,(2003),AMixtureModelApproachtoDetectingDieren-tiallyExpressedGeneswithMicroarrayData,Functional&IntegrativeGenomics,Vol.3,pp.117-124.[28] Pan,W.,Lin,J.andLe,C.,(2002),HowmanyreplicationofArraysarere-quiredtodetectGeneExpressionchangeinMicroarrayExperiment?AMixtureModelApproach,DivisionofBiostatistics,UniversityofMinnesota.Availableathttp://www.biostat.umn.edu/cgi-bin/rrs?.107

PAGE 120

Press,W.H.,Teukolsky,S.A.,Vetterling,W.T.andFlannery,B.P.,(1992),Numer-icalRecipesinC,TheArtofScienticComputing,2nded.NewYork:CambridgeUniversityPress.[30] Schena,M.,Shalon,D.,Davis,R.W.andBrown,P.O.,(1995),Quantitativemoni-toringofgeneexpressionpatternswithacomplementaryDNAmicroarray,Science,Vol.270,pp.467-470.[31] Schwartz,G.,(1978),Estimatingthedimensionsofamodel,AnnalsofStatistics,Vol.6,pp.461-464.[32] Storey,J.D.,(2003),Thepositivefalsediscoveryrate;aBayesianinterpretationandq-value.,TheAnnalsofStatistics,Vol.31,No.6,pp.2013-2035.[33] Tadjudi,S.,andLandgrebe,D.A.,(2000),RobustParameterEstimationForMix-tureModel,IEEETransactionsonGeoscienceandRemoteSensing,Vol.38,No.1,pp.439-445.[34] Tamyo,P.,Slonim,D.,Mesirov,J.,Zhu,Q.,Kitareewan,S.,Dmitrovsky,E.,Lan-der,E.S.,andGolub,T.R.(1999),Interpretingpatternsofgeneexpressionwithself-organizingmaps:Methodsandapplicationtohematopoieticdierentiation.Proc.Natl.Acad.Sci.,Vol.96,pp.2907-2912.[35] Tavazoie,S.,Hughes,J.D.,Campbell,M.J.,Cho,R.J.,andChurch,G.M.(1999),Sytematicdeterminationofgeneticnetworkarchitecture.Proc.Natl.Acad.Sci.,Vol.22,pp.281-285.[36] Titterington,D.M.,Smith,A.F.M.,andMakov,U.E.(1985),StatisticalAnalysisofFiniteMixtureDistributions,NewYork:Wiley.[37] Thode,H.C.,FinchS.J.andMendellN.R.(1988),SimulatedPercentagePointsfortheNullDistributionoftheLikelihoodRatioTestforaMixtureofTwoNormals.Biometrics,Vol.44,No.4,pp.1195-1201.[38] Thomas,J.G.,Olson,J.M.,Tapscott,S.J.andZhao,L.P.(2001),AnEcientandRo-bustStatisticalModelingApproachtoDiscoverDierentiallyExpressedGenesUsingGenomicExpressionProles.GenomeResearch,Vol.11,pp.1227-1236.108

PAGE 121

Tusher,V.,Tibshirani,R.J.,Chu,G.,(2001),SignicantAnalysisofMicroarraysAppliedtotheIonizingRadiationResponse.,Proc.Nat.Acad.Sci.,Vol.98,pp.5116-5121.[40] Verbeke,G.,(2000),Inferenceformixedpopulations.,BiostaticalCenter,CatholicUniversityofLeuven.[41] Wald,A.,(1949),Noteontheconsistencyofthemaximumlikelihoodestimate.,Ann.Math.Statist.,Vol.20,pp.595-601.[42] Zhang,S.,Jiao,S.,(2007),Thet-mixturemodelapproachfordetectingdierentiallyexpressedgenesinmicroarrays,Functional&IntegrativeGenomics.[43] Zhao.Y,Pan,W.,(2003),Modiednonparametricapproachestodetectingdieren-tiallyexpressedgenesinreplicatedmicroarrayexperiments,Bioinformatics,Vol.19,pp.1046-1054.109

PAGE 122

HisgraduatestudiesattheUniversityofSouthFloridamadehimrealizethatteachingandresearchingwashistruecalling.O'Neil'steachingskillswerefurtherenhancedwhileservingasaTeachingAssistantintheDepartmentofMathematicsandStatistics(USF).HewasalsoaResearchAssistantintheDepartmentofEpidemiologyandBiostatistics(USF)andvolunteeredattheMottcancercenterfor2years. O'NeilisareggaefanaticandhisfavoritedancehallreggaeartistisBujuBanton.HelooksforwardtotheOlympicGamestoseehowwelltheJamaicansprinterswillperform.Healsoenjoysplaying/watchingsoccerandcricket.


xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam 2200397Ka 4500
controlfield tag 001 002317272
005 20100831180756.0
007 cr bnu|||uuuuu
008 100831s2009 flua ob 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0003022
035
(OCoLC)659882175
040
FHM
c FHM
049
FHMM
090
QA36 (Online)
1 100
Lynch, O'Neil.
0 245
Mixture distributions with application to microarray data analysis
h [electronic resource] /
by O'Neil Lynch.
260
[Tampa, Fla.] :
b University of South Florida,
2009.
500
Title from PDF of title page.
Document formatted into pages; contains 109 pages.
Includes vita.
502
Dissertation (Ph.D.)--University of South Florida, 2009.
504
Includes bibliographical references.
516
Text (Electronic dissertation) in PDF format.
520
ABSTRACT: The main goal in analyzing microarray data is to determine the genes that are differentially expressed across two types of tissue samples or samples obtained under two experimental conditions. In this dissertation we proposed two methods to determine differentially expressed genes. For the penalized normal mixture model (PMMM) to determine genes that are differentially expressed, we penalized both the variance and the mixing proportion parameters simultaneously. The variance parameter was penalized so that the log-likelihood will be bounded, while the mixing proportion parameter was penalized so that its estimates are not on the boundary of its parametric space. The null distribution of the likelihood ratio test statistic (LRTS) was simulated so that we could perform a hypothesis test for the number of components of the penalized normal mixture model. In addition to simulating the null distribution of the LRTS for the penalized normal mixture model, we showed that the maximum likelihood estimates were asymptotically normal, which is a first step that is necessary to prove the asymptotic null distribution of the LRTS. This result is a significant contribution to field of normal mixture model.The modified p-value approach for detecting differentially expressed genes was also discussed in this dissertation. The modified p-value approach was implemented so that a hypothesis test for the number of components can be conducted by using the modified likelihood ratio test. In the modified p-value approach we penalized the mixing proportion so that the estimates of the mixing proportion are not on the boundary of its parametric space. The null distribution of the (LRTS) was simulated so that the number of components of the uniform beta mixture model can be determined. Finally, for both modified methods, the penalized normal mixture model and the modified p-value approach were applied to simulated and real data.
538
Mode of access: World Wide Web.
System requirements: World Wide Web browser and PDF reader.
590
Co-Advisor: Kandethody Ramachandran, Ph.D.
Co-Advisor: Wonkuk Kim, Ph.D.
653
Likelihood ratio test
Modified likelihood
Penalized likelihood
Asymptotic chi-square distribution
Consistency
690
Dissertations, Academic
z USF
x Mathematics and Statistics
Doctoral.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.3022