Johnson's system of distributions and microarray data analysis

Citation
Johnson's system of distributions and microarray data analysis

Material Information

Title:
Johnson's system of distributions and microarray data analysis
Creator:
George, Florence
Place of Publication:
[Tampa, Fla.]
Publisher:
University of South Florida
Publication Date:
Language:
English

Subjects

Subjects / Keywords:
Gene expression data
Differentially expressed genes
Transformed distributions
Baye's formula
Mixture model approach
Dissertations, Academic -- Mathematics -- Doctoral -- USF ( lcsh )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Abstract:
ABSTRACT: Microarray technology permit us to study the expression levels of thousands of genes simultaneously. The technique has a wide range of applications including identification of genes that change their expression in cells due to disease or drug stimuli. The dissertation is addressing statistical methods for the selection of differentially expressed genes in two experimental conditions. We propose two different methods for the selection of differentially expressed genes. The first method is a classical approach, where we consider a common distribution for the summary measure of equally expressed genes. To estimate this common distribution, the Johnson system of distribution is used. The advantage of using Johnson system is that, there is no need of a parametric assumption for gene expression data. In contrast to other classical methods, in the proposed method, there is a sharing of information across the genes by the assumption of a common distribution for the summary measure of equally expressed genes. The second method is the gene selection using a mixture model approach and Baye's theorem. This approach also uses the Johnson System of distribution for the estimation of distribution of summary measure. Johnson system of distribution has the flexibility of covering a wide variety of distributional shapes. This system provides a unique distribution corresponding to each pair of mathematically possible values of skewness and kurtosis. The significant flexibility of Johnson system is very useful in characterizing the complicated data set like microarray data. In this dissertation we propose a novel algorithm for the estimation of the four parameters of the Johnson system.
Thesis:
Dissertation (Ph.D.)--University of South Florida, 2007.
Bibliography:
Includes bibliographical references.
System Details:
System requirements: World Wide Web browser and PDF reader.
System Details:
Mode of access: World Wide Web.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains 86 pages.
General Note:
Includes vita.
Statement of Responsibility:
by Florence George.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
001917312 ( ALEPH )
181596712 ( OCLC )
E14-SFE0002040 ( USFLDC DOI )
e14.2040 ( USFLDC Handle )

Postcard Information

Format:
Book

Downloads

This item has the following downloads:


Full Text
xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam Ka
controlfield tag 001 001917312
003 fts
005 20071119165920.0
006 m||||e|||d||||||||
007 cr mnu|||uuuuu
008 071119s2007 flu sbm 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0002040
035
(OCoLC)181596712
040
FHM
c FHM
049
FHMM
090
QA36 (ONLINE)
1 100
George, Florence.
0 245
Johnson's system of distributions and microarray data analysis
h [electronic resource] /
by Florence George.
260
[Tampa, Fla.] :
b University of South Florida,
2007.
3 520
ABSTRACT: Microarray technology permit us to study the expression levels of thousands of genes simultaneously. The technique has a wide range of applications including identification of genes that change their expression in cells due to disease or drug stimuli. The dissertation is addressing statistical methods for the selection of differentially expressed genes in two experimental conditions. We propose two different methods for the selection of differentially expressed genes. The first method is a classical approach, where we consider a common distribution for the summary measure of equally expressed genes. To estimate this common distribution, the Johnson system of distribution is used. The advantage of using Johnson system is that, there is no need of a parametric assumption for gene expression data. In contrast to other classical methods, in the proposed method, there is a sharing of information across the genes by the assumption of a common distribution for the summary measure of equally expressed genes. The second method is the gene selection using a mixture model approach and Baye's theorem. This approach also uses the Johnson System of distribution for the estimation of distribution of summary measure. Johnson system of distribution has the flexibility of covering a wide variety of distributional shapes. This system provides a unique distribution corresponding to each pair of mathematically possible values of skewness and kurtosis. The significant flexibility of Johnson system is very useful in characterizing the complicated data set like microarray data. In this dissertation we propose a novel algorithm for the estimation of the four parameters of the Johnson system.
502
Dissertation (Ph.D.)--University of South Florida, 2007.
504
Includes bibliographical references.
516
Text (Electronic dissertation) in PDF format.
538
System requirements: World Wide Web browser and PDF reader.
Mode of access: World Wide Web.
500
Title from PDF of title page.
Document formatted into pages; contains 86 pages.
Includes vita.
590
Adviosor: Kandethody M. Ramachandran, Ph.D.
653
Gene expression data.
Differentially expressed genes.
Transformed distributions.
Baye's formula.
Mixture model approach.
690
Dissertations, Academic
z USF
x Mathematics
Doctoral.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.2040



PAGE 1

by FlorenceGeorge Adissertationsubmittedinpartialfulllment oftherequirementsforthedegreeof DoctorofPhilosophy DepartmentofMathematics CollegeofArtsandSciences UniversityofSouthFlorida MajorProfessor:KandethodyM.Ramachandran,Ph.D. ChrisP.Tsokos,Ph.D. MarcusMcWaters,Ph.D. GeorgeYanev,Ph.D. DateofApproval:June15,2007 Keywords:Geneexpressiondata,Dierentiallyexpressedgenes,Transformeddistributions,Baye'sformula,Mixturemodelapproach cCopyright2007,FlorenceGeorge

PAGE 4

MyeyesturnmistywithmixedfeelingsofgratitudeandpainwhenIthinkofthelateprofessor,Dr.A.N.VRao,whointroducedmetotheeldofMicroarrays.Hisprinciple{seegoodineverythingandbeaconstructivepartofthewhole-hasbeensoul-stirringguidanceallthroughmyresearch.Ihavearoyalbouquetofthanksforthedistinguishedandwell-knownprofessorDr.ChrisP.Tsokosforhisthreadbareanalysisofmywork,furnishingmewithdetailedandconstructivecomments. Igratefullycherishtheenergizingsupport,adviceandencouragementofDr.McWatersandDr.GeorgeYanev.MyrecompenseisthankstoDr.LiLihuaandDr.Lancaster,H.LeeMottCancerCenterandResearchInstitute,whogavemeac-cesstomicroarraydataandletmehaversthandexperienceintheirlaboratory. IcannotbutthankmyhusbandJohnet,ourdaughterSherisandourentireex-tendedfamily,whoprovidedarefreshingbutlovinghomeenvironmentforadisser-tationlikethis.Finally,butmostttingly,Iwishtothankmyparentswhoformedpartofmyvision.

PAGE 5

ListofTablesv Abstractvii 1Introduction1 2Microarrays3 2.1GeneticBackground...........................3 2.2GeneExpression.............................4 2.3MicroarrayTechnology..........................5 2.4cDNAMicroarrays............................5 2.5OligonucleotideMicroarrays.......................8 3JohnsonSystemofDistributionsandaMLE-LeastSquareApproachtoPa-rameterEstimation10 3.1Introduction................................10 3.2TheJohnsonTranslationSystem....................11 3.3JohnsonSubsystemIdentication....................12 3.4ProbabilityDensityFunctionofJohnsonSystem............19 3.5ParameterEstimationofJohnsonSystem................21 MomentMatching............................21 PercentileMatching............................23 QuantileEstimators...........................25i

PAGE 6

3.7ComparisonofEstimationMethods...................31 3.8Summary.................................40 4MethodsforGeneSelectionandApplicationofJohnson'sDistribution41 4.1Introduction................................41 4.2Data....................................41 OvarianCancerData...........................41 SimulatedData..............................43 4.3ClassicalMethodsforGeneSelection..................44 Foldchange................................44 T-test...................................44 WilcoxonTest...............................45 SAM....................................45 4.4GeneSelectionusingJohnson'sFamilyofDistributions........52 4.5Summary.................................63 5MixtureModelApproachforGeneSelectionusingJohnson'sSystemofDistri-butions64 5.1Introduction................................64 5.2EBARRAYS................................64 5.3EBAM...................................67 5.4AMixtureModelApproachusingJohnsonDistributionandBaye'sFormula..................................68 5.5Results...................................71 5.6Summary.................................81 6Conclusion82 References83ii

PAGE 8

2.2Samplepartofmicoarraychip:(http://www.gene-chips.com)....7 3.1ChartforJohnsonsubsystemidentication...............13 3.2ExamplesofJohnsonSBfamily.Forallcases=0and=100...15 3.3ExamplesofJohnsonSUfamily.Forallcases=0and=100....16 3.4ExamplesofJohnsonSLfamily.Forallcases=0and=100....17 3.5EstimatedandTrueJohnsonSBdistribution(=1,=1,=10,=50).34 3.6EstimatedandTrueJohnsonSUdistribution(=1,=0.5,=10,=10).35 3.7EstimatedandTrueJohnsonSLdistribution(=0,=2,=10,=50)39 4.1SamPlotusingovariancancermicroarraydata.............49 4.2SamPlotofFDRusingovariancancermicroarraydata........51 4.3Histogramoft-values(left)andm-values(right).............53 4.4No.ofgenesselectedvsNo.ofTruePositives;Samplesize-15vs1056 4.5No.ofgenesselectedvsNo.ofTruePositives;Samplesize-20vs2057 4.6No.ofgenesselectedvsNo.ofTruePositives;Samplesize-33vs2258 4.7ReceiverOperatingCharacteristiccurve;Samplesize-15vs10...60 4.8ReceiverOperatingCharacteristiccurve;Samplesize-20vs20...61 4.9ReceiverOperatingCharacteristiccurve;Samplesize-33vs22...62 5.1Posteriorprobabilityofgenes;Genesrepresentinggreen(orgray)pointshaveposteriorprobability>0.8.....................69 5.2Mixturemodelapproaches-No.ofgenesselectedvsNo.ofTruePositives;Numberofgenes-2000;Samplesize-15vs10.......72iv

PAGE 9

5.4Mixturemodelapproaches-No.ofgenesselectedvsNo.ofTruePositives;Numberofgenes-2000;Samplesize-33vs22.......74 5.5Comparisonofallmethodsdiscussed-No.ofgenesselectedvsNo.ofTruePositives;Numberofgenes-2000;Samplesize-15vs10....75 5.6Comparisonofallmethodsdiscussed-No.ofgenesselectedvsNo.ofTruePositives;Numberofgenes-2000;Samplesize-33vs22....76 5.7ROCcurve-Mixturemodelapproaches;Numberofgenes-2000;Samplesize-15vs10...............................78 5.8ROCcurve-Mixturemodelapproaches;Numberofgenes-2000;Samplesize-20vs20...............................79 5.9ROCcurve-Mixturemodelapproaches;Numberofgenes-2000;Samplesize-33vs22...............................80v

PAGE 10

3.2Meanand(MeanSquareError-MSE)ofparameterestimatesforJohn-sonSBfamilybasedon20samplesofsize2000............33 3.3Meanand(MeanSquareError-MSE)ofparameterestimatesforJohn-sonSUfamilybasedon20samplesofsize2000............36 3.4Meanand(MeanSquareError-MSE)ofparameterestimatesforJohn-sonSLfamilybasedon20samplesofsize2000.............38 4.1Ovariancancerdata...........................42 4.2SAM-No.ofsignicantgenesandestimatedfalsepositivesfordierentchoicesofdelta...............................47 4.3No.ofgenesselectedbydierentmethods...............55 5.1No.ofgenesselectedbydierentmethods...............71vi

PAGE 11

FlorenceGeorge Johnsonsystemofdistributionhastheexibilityofcoveringawidevarietyofdistributionalshapes.Thissystemprovidesauniquedistributioncorrespondingtoeachpairofmathematicallypossiblevaluesofskewnessandkurtosis.ThesignicantexibilityofJohnsonsystemisveryusefulincharacterizingthecomplicateddatasetlikemicroarraydata.InthisdissertationweproposeanovelalgorithmfortheestimationofthefourparametersoftheJohnsonsystem.vii

PAGE 12

Whilesimultaneousmeasurementofthousandsofgeneexpressionlevelsprovidesapotentialsourceofprofoundknowledge,successofthemicroarraytechnologydependsheavilyonstatisticalanalysis.Carefulstatisticalthinkingandanalysisarerequiredtondtheunderlyingstructureinthedata.Theunprecedentedamountsofdataproducedbymicroarraysraisenewchallengesforstatisticianstobeabletoperforminferenceonascaleneverbeforeconducted.Recently,statisticiansandresearchersinbioinformaticshavefocusedmuchattentiononthedevelopmentofstatisticalmethodstoidentifydierentiallyexpressedgenes,withspecialemphasisonthosemethodsthatidentifygenesthataredierentiallyexpressedbetweentwoconditions.Thisworkfocussesonthedevelopmentofstatisticalmethodsthatissuitablefordierential1

PAGE 13

Thedissertationisorganizedasfollows.Chapter2introduceMicroarrays.Inchapter3,Johnson'ssystemofdistributionsandthemethodsforestimationofpa-rametersarediscussed.WedevelopanewalgorithmforestimatingtheparametersofJohnsonSystem.Chapter4discussestheclassicalmethodsofgeneselection.WeproposeanewapproachforgeneselectionusingJohnson'sSystemofdistributions.Inchapter5,WehaveproposedamixturemodelapproachusingBaye'stheoremandJohnson'ssystemofdistributionsfortheselectionofdierentiallyexpressedgenes.2

PAGE 14

Microarraysproducehugeamountsofdata.Carefulstatisticalthinkingandanal-ysisarerequiredtondtheunderlyingstructureinthedata.Theunprecedentedamountsofdataproducedbymicroarraysraisenewchallengesforstatisticianstobeabletoperforminferenceonascaleneverbeforeconducted.2.1GeneticBackground

PAGE 16

Therearecurrentlytwoplatforms/typesofDNAmicroarraysthatarecommer-ciallyavailable.GlassDNAmicroarrayswhichinvolvesthemicrospottingofpre-fabricatedcDNAfragmentsonaglassslideandhigh-densityoligonucleotidemicroar-raysoftenreferredtoasa"chip"whichinvolvesinsituoligonucleotidesynthesis.2.4cDNAMicroarrays

PAGE 17

ThroughadenaturingprocessthedoublestrandedDNAmoleculesinthesampleareunzippeddownintotwosinglestrandedmolecules.Themicroarraychipitselfalsocontainssinglestrandsofgenesthatwillattractthesinglegenesfromthesample.ThesinglestrandfromthesamplewillbindwiththesinglestrandsonthemicroarraychiptoreformtheDNAdoublehelix.Thisiscalledhybridization.InacDNAmicroarrayexperiment,thetestsampleislabeledwithadyeandthereferencesampleislabeledwithadyeofdierentcolor.Thereferencesampleservesasacontroltowhichthegeneexpressioninthetestsampleiscompared.Forexample,ifwewantedtodeterminewhichgenesareexpressedinatumorsample,wecoulduseatissuesamplefromahealthyindividualasthereferencesample.Wewouldthencomparetheexpressionlevelofeachgeneinthetumorsampletotheexpressionlevelofeachinthereferencesample.Supposethetumorsamplehadbeenlabeledwithareddyeandthereferencesamplehadbeenlabeledwithagreendye.Thenaredspotonthemicroarraywouldindicatethatthegenecorrespondingtothatspotisexpressedatahigherlevelinginthetumorsamplethaninthereferencesample.Similarlyagreenspotwillindicatethatthegeneisexpressedinalowerlevelinthetumorsample. Afterhybridization,animageofthearraywithhybridizeduorescentdyesmustbeacquired.Microarrayimagescanbeproducedusinganumberofdevices,likealaserscanneroracamera.Thegoalistomeasure,foreachspotonthearray,therelativeuorescenceintensitiesfromeachdyehybridizedwithitstarget.Ahighuorescencelevelindicatesthatmultiplecopiesofagenehaveboundtothechipandthegenehasactivityinthecell.Similarly,alowuorescencelevelindicateslowactivityofthegeneinthecell.Byquantifyingtheuorescencelevelthegeneactivitycanbecomparedacrossdierentsamples.6

PAGE 19

AsamplepartofamicroarraychipisshownintheFigure2.2.Eachspotrep-resentsageneandtherearethousandsofgenesonachip.Theintensityandcolorofeachspotencodeinformationonaspecicgenefromthetestedsample.Amoredetaileddescriptionofgenes,geneticanalysisandmicroarraytechnologycanbeseenin[1],[32],[21]. Thepowerofmicroarraytechnologyliesinitspotentialtosurveytheentiregenomeinoneexperiment.Microarraysareincreasinglyappliedinbiologicalandmedicalresearchtoaddressawiderangeofproblems.Herewelistsomeoftheseapplications.Thecompletionofthehumangenomemeansthatwecansearchforthegenesdirectlyassociatedwithdierentdiseases.Thisnewknowledgewillenablebettertreatments,curesandevenpreventiveteststobedeveloped.Clinicalmedicinewillbecomemore8

PAGE 20

Althoughitispreferableforthestatisticiantohaveahandintheexperimentaldesign,thestatisticianoftencomesintoamicroarrayanalysisprojectoncethedatahasbeencollected.Thestatistician'sjobistousethenumericalgeneexpressionlevelstomakeclaimsaboutthepopulationsofinterest.Animportantandcommonquestioninmicroarrayexperimentsistheidenticationofdierentiallyexpressedgenes.Inchaptersfollowingthenextchapterwewillreviewavailabletechniquesandproposesomenewmethodsfortheselectionofdierentiallyexpressedgenes.9

PAGE 21

In1949,Johnsonderivedasystemofcurves[16],[17]thathastheexibilityofcoveringawidevarietyofshapes.Thissystemhasthepracticalandtheoreticalad-vantagesofbeingabletotransformthesecurvestothenormaldistribution.TheJohnsonsystemisabletocloselyapproximatemanyofthestandardcontinuousdis-tributionsthroughoneofthethreefunctionalformsandisthushighlyexible.TheJohnsonsystemprovidesonedistributioncorrespondingtoeachpairofmathemati-callypossiblevaluesofskewnessandkurtosis. ThesignicantexibilityofJohnsonsystemofdistributionsisveryusefulinchar-acterizingthecomplicateddatasetlikemicroarraydata.Inallparametricapproaches10

PAGE 22

(3.2.1) wheref(:)denotesthetransformationfunction,Zisastandardnormalrandomvariable,andareshapeparameters,isascaleparameterandisalocationparameter.Withoutlossofgenerality,itisassumedthat>0and>0.ThersttransformationproposedbyJohnsondenesthelognormalsystemofdistributionsdenotedbySL: ;X>=+ln(X);X>(3.2.2)SLcurvescoverthelognormalfamily. TheboundedsystemofdistributionsSBisdenedby +X;
PAGE 23

TheunboundedsystemofdistributionsSUisdenedby +(X 2+1)1=235;
PAGE 24

IfwedenethevariableY=X then3.2.1becomes, wheref(:)denotesthetransformation.Forasetofparametervalues,therealizationofJohnsonrandomvariableXcanbeobtainedbyapplyingtheinversetranslation,X=+f1(z )(3.3.9) whereZisstandardnormalvariable.13

PAGE 25

f(y) (1+ez)1 SeveralexamplesofthedensitycurvesofJohnsonSB,SUandSLaredisplayedinFigures3.2to3.4.ThesegraphsgiveanideaabouttheexibilityoftheJohnsonsystemtocoverawidevarietyofshapes.14

PAGE 29

Proof: LetXbeastandardizeddiscreterandomvariablewithprobabilitydensityfunctionp(x). Considerthefollowingfunctionina,bandc,G(a;b;c)=0a2+21ab+22ac+2b2+23bc+4c2=1 Consequentlyitsdiscriminant012123234

PAGE 30

HenceEquation3.3.12isequivalentto2110(3.3.13) Theproofissimilarforcontinuousrandomvariablealso[36].3.4ProbabilityDensityFunctionofJohnsonSystem thenforSLfamily,thepdfis 2[+:ln(y)]2;
PAGE 31

IngeneralthepdfofXisgivenby, p )exp(1 2+:g(x )2)(3.4.17) forallx2H,where [y(1y)]forSBfamily=1 and ThesupportHofthedistributionis

PAGE 32

andsampleskewnessandkurtosisare: ^1=m23 ThemomentmatchingtechniqueforttingaJohnsondistributiontothesampleusesthelocationofthepoint(^1;^2)inFigure3.1toidentifytheappropriatefunc-tionalformamongsystems3.2.2-3.2.4.Theprincipleofmomentmatchingprescribesthattherstksamplemomentsshouldbeequaltothecorrespondingpopulationmomentsofthettedtheoreticaldistribution.Theresultingsystemofknonlinearequations,whichwillbedependentonthekparameters,isthensolvedtoobtaintheparameterestimatesforthetteddistribution.TheestimationofparametersusingthemethodofmomentsisdiscussedindetailbyDraper[5]. ForSLdistribution,therstthreemomentsintermsoftheparametersin3.2.2are,21

PAGE 33

2212=e22213=e9 2231(3.5.23) ForSUdistribution, 2z22r(e(z)=e(z)=)rdz(3.5.24) fromwhichitfollowsthattherstfourmomentsofXare 2sinh2=1 22(!1)(!cosh2+1)3=1 43!1 2(!1)2[!(!+2)sinh3+3sinh]4=1 8(!1)2[!2(!4+2!3+3!23)cosh4+4!2(!+2)cosh2+3(2!+1)](3.5.25) where!=e2and==. Therthmomentofy=x ,whenXfollowsaJohnsonSBis0r(y)=1 2z2(1+e(z)=)rdz(3.5.26) Thisintegralisnoteasytoevaluatedirectly.Johnsonhasevaluated01(y)directlyandthehigherordermomentsareobtainedusingthefollowingsteps. Whenr=1,01(y)=1 2z2(1+e(z)=)1dz(3.5.27) whichcanbeevaluateddirectlyusingaresultduetoMordell[22].Thehighermo-mentscanbeobtainedusingtherelation, r@0r

PAGE 34

and where(:)isthestandardnormaldistributionfunctionandFisthetargetdistri-butionfunction.Oncethefunctionalformf(:)amongsystems3.2.2-3.2.4hasbeenidentied,themethodofpercentilematchingattemptstosolvethekequations );1jk(3.5.30) where^xjisanestimatorofthequantilexjbasedonsampledata. SlifkerandShapiro[29]introducedaselectionrulewhichisafunctionoffourper-centilesforselectingoneofthethreefamiliesandtogiveestimatesoftheparameters.Thetparametersforthetransformationarecalculatedbysolvingthetransformationequationforthechosendistributiontypeatthefourselectedpercentilesasexplainedbelow.Chooseanyxedvaluez(0
PAGE 35

^=2z 2m p+n pi(3.5.32)^=^sinh1264n pm p pn p11=2375(3.5.33)^=2pm pn p11=2 p+n p2m p+n p+21=2(3.5.34)^=xz+xz pm p p+n p2(3.5.35) TheparameterestimatesforSBdistributionare ^=z 21+p m1+p n1=2;(>0)(3.5.36) ^=^sinh1"p np m1+p m1+p n41=2 mp n1#(3.5.37) ^=ph1+p m1+p n224i1=2 mp n1(3.5.38) ^=xz+xz np m mp n1(3.5.39)24

PAGE 36

^=2z p(3.5.40) ^=^ln264m p1 p1=2375(3.5.41) ^=xz+xz p+1 p1(3.5.42)QuantileEstimators 2)=n,wherenisthesamplesize.Denotethequantileofstandardnormaldistributioncorrespondingtothecumulativeprobabilitypnbyzn.Forexample,ifn=100,thenpn=0:995,sothatzn=2:5758.Choosevequantilesxp,xk,x0,xm,xnfromdatacorrespondingtostandardnormalquantilesz=zn;1 2zn;0;1 2zn;zn.ThegeneralformofJohnsonsystemcanbewrit-tenz=+lnf(y)(3.5.43) wheref(y)=yforSL,f(y)=y+(1+y2)1=2forSU,f(y)=y=(1y)forSBandy=(x)=.Wheelerusesthefactthatanyquantityoftheformxixj where!=e(z)=,doesnotdependonor.TheparameterestimatesforSUcurvesare: ^=1 2zn=lnb(3.5.45)25

PAGE 37

2tu+[(1 2tu)21]1=2andtu=xnxp wherea2=1tb2 ForSBcurvestheparameterestimatesare: ^=1 2zn=lnb(3.5.47) whereb=1 2tb+[(1 2tb)21]1=2andtb=(xmx0)(xnxp) (xnxm)(x0xp) wherea=tb2 ForSLcurves, ^=zn wheret=xnx0 (xnxm)(x0xp)(3.5.50) isused.Itislessthan1forsU,equalto1forSLandgreaterthan1forSB.3.6MLE-LeastSquareApproachfortheParameterEstimationofJohnsonSystem

PAGE 38

TheprobabilitydensityfunctionsofthemembersoftheJohnsonfamilyareknown.LetusrstconsidertheSUandSBfamilyofJohnsonsystem.UsingthegeneralformofJohnsondensitiesasgiveninEquation3.4.17,thelikelihoodfunctionis,L(x)=n )e1 2Pni=1(+g(x ))2(3.6.51) Thelog-likelihoodis,logL=nlognlogn=2log(2)+nXi=1g0(x )1 2nXi=1(+g(x ))2(3.6.52) Settingthepartialderivativeswithrespecttotozero, X[g(x )]2Xg(x )=0(3.6.53) whichcanbewrittenas, )]2+Xg(x )n=0(3.6.54) Settingthepartialderivativeswithrespecttotozero,n+Xg(x )=0(3.6.55) whichyields,^=Pg(x )

PAGE 39

)]21 )]2(3.6.58)=1 wheregisthemeanandvar(g)isthevarianceofthevaluesofgdenedinEquation3.4.20. Thepartialderivativesofthelog-likelihoodwithrespecttoandarenotsimple.Storer[31]detailsalengthystrategyforobtainingthesolutionsoftheseparameters.Inthemaximumlikelihoodestimationmethod,KamziahA.Kuduset:al[18]appliedNewton-RaphsoniterationtomaximizetheloglikelihoodoftheJohnsondistribution.Theyhaveobservedthatforsomesamplestheloglikelihoodfunctiondoesnothavealocalmaximumwithrespecttotheparametersand.Thisnon-regularityofthelikelihoodfunctioncausedoccasionalnon-convergenceoftheNewton-Raphsoniterationthatwasusedtomaximizethelog-likelihood[13]. Weapplymethodofleastsquarestoestimatetheparametersand.ConsidertheEquation3.3.9whichisx=+f1(z ).Forxedvaluesofand,thisequationmaybeconsideredasalinearequationwithparametersand. Thesumofsquaresoferrorsis, )]2(3.6.60) TondthevalueofandthatminimizesS(;)getthepartialderivativesofS(;)withrespecttoand.Thenequatethesepartialderivativestozero.Thenweobtainthefollowingtwoequationscallednormalequations,Xx=n+Xf1(z )(3.6.61)Xxf1(z )=Xf1(z )+X[f1(z )]2

PAGE 40

Solvingthenormalequationsweget, ^=nPxf1(z )Pf1(z )Px nP[f1(z )]2[Pf1z ]2(3.6.62) and ^=xmean[f1(z )](3.6.63) wherexisthemeanofx-quantilesandzisthemeanofz-quantilesusedintheaboveequations.Westartwithsomeinitialvaluesofand.Theseinitialvaluesmaybetakenastheestimatesobtainedbyanyoneofthepreviousmethods.ThenestimatesofandarecalculatedusingEquations3.6.57and3.6.59.Oncetheestimatesofandareobtained,theEquations3.6.63and3.6.62canbeusedtorevisetheestimatesofand. NowwemayrepeatthesestepswithEquations3.6.62,3.6.63,3.6.57and3.6.59,eachtimeusingthemostrecentestimates.KeeptrackofResidualSumofSquares(RSS)andafterafewstepschoosetheestimatewithminimumRSSvalue. Thealgorithmcanbesummarizedinthefollowingsteps. 1.Assigninitialvaluesforand.Theinitialvaluesmaybederivedbypercentileorquantilemethod. 2.EstimateandusingtheEquations3.6.57and3.6.59. 3.EstimateandusingtheEquations3.6.62and3.6.63.29

PAGE 41

5.Repeatsteps2to4,eachtimeusingmostrecentestimatesoftheparameters.ChoosetheestimateswithminimumRSS. ForSLfamily,wewillconsiderthetransformationinEquation3.2.2,sothatthereareonly3parametersincluded.Theprobabilitydensityfunctioncanbegivenby, (x)e1 2[+ln(x)]2(3.6.64) Thelikelihoodfunctionis, 2P[+ln(x)]2(3.6.65) Settingthepartialderivativeoflog-likelihoodwithrespecttotozeroweget, X[ln(x)]2X[ln(x)]=0(3.6.66) whichcanbewrittenas, Settingthepartialderivativeoflog-likelihoodwithrespecttotozero, whichgives, ^=1 UsingEquation3.6.70inEquation3.6.67andsolvingfor,30

PAGE 42

whereg=ln(x).TheestimatesinEquations3.6.70and3.6.72dependson. Toestimate,asbefore,wewillusethemethodofleastsquaresintheequation Thesumofsquaresoferrorsis, TondthevalueofthatminimizesS()weobtain, d=2X(xf1(z Settingthisderivativeequaltozero,wehave, ^=xmean[f1(z Herealsothesamesituationarises,estimatedependsonandandviceversa,asinthecaseofSUandSBdistributions.Sowewillstartwithsomeinitialvalueoftoestimateand.Thenusetheseestimatedvaluestoestimate.Repeatthisprocedure,keepingtrackofRSSandchoosetheonewithleastRSS.3.7ComparisonofEstimationMethods

PAGE 43

ThemeanandtheMeanSquareError(MSE)oftheestimatedvaluesofSBfamilyarelistedinTable3.2forcomparisonpurposes.ItcanbeobservedthataverageoftheestimatesareclosetothetruevaluesoftheparametersandingeneraltheMSEoftheestimatesaresmallerintheproposedmethodthantheothermethods.ThismaybebecauseMLE-Leastsquareapproachusesalltheavailabledata,whilequantilemethodusesonlyvequantilesandpercentilemethodusesonlyfourquantiles.Themajordisadvantageofmomentmatchingisthevulnerabilitytooutliersofthesamplethirdandfourthmoments.32

PAGE 44

Parameter TrueValue Percentilemethod Quantilemethod NewApproach 1 0.998(0.167) 1.063(0.409) 0.997(0.026) 1.001(0.059) 1.024(0.083) 0.997(0.026) 10.047(0.085) 9.982(0.131) 9.93(0.08) 10.049(5.92) 10.402(14.37) 10.57(4.99) 2 0.503(0.009) 0.503(0.0493) 0.494(0.007) 0.505(0.003) 0.519(0.023) 0.507(0.001) 9.11(4.038) 9.97(0.077) 10.004(0.004) 10.005(0.285) 10.094(1.614) 9.868(2.056) 3 1.032(0.065) 1.01(0.015) 1.016(0.017) 0.507(0.0039) 0.5006(0.0013) 0.509(0.002) 9.698(.488) 10.001(0.001) 10.001(0.001) 10.355(4.63) 10.085(0.69) 9.86(0.70) 4 0.558(0.287) 0.539(0.136) 0.561(0.165) 1.013(0.191) 1.024(0.108) 1.055(0.115) 9.82(1.097) 9.94(0.55) 9.91(0.52) 10.31(15.4) 10.30(8.2) 9.83(0.50) Table3.2:Meanand(MeanSquareError-MSE)ofparameterestimatesforJohnsonSBfamilybasedon20samplesofsize200033

PAGE 47

Parameter TrueValue Percentilemethod Quantilemethod NewApproach 1 0.04(0.32) 0.015(0.05) 0.015(0.05) 1.41(3.3) 2.08(0.34) 2.05(0.29) 10.24(8.9) 10.1(1.5) 10.1(1.4) 12.3(99.9) 10.5(12.6) 10.3(10.1) 2 0.82(2.9) 0.52(0.11) 0.51(0.09) 2.47(3.23) 2.08(0.45) 2.06(0.37) 11.51(64.6) 10.06(2.79) 10.04(2.59) 12.07(56.5) 10.35(12.6) 10.25(11.22) 3 -0.003(0.003) 0.005(0.002) 0.003(0.002) 1.033(0.006) 0.99(0.003) 0.99(0.002) 10.03(.43) 10.05(0.25) 10.06(0.25) 10.45(1.43) 9.82(0.7) 9.75(0.73) 4 0.514(0.009) 0.488(0.006) 0.487(0.007) 1.008(0.006) 0.999(0.006) 0.996(0.006) 10.243(1.203) 9.95(0.9) 9.94(1.05) 10.06(0.96) 10.06(1.13) 10.02(1.43) Table3.3:Meanand(MeanSquareError-MSE)ofparameterestimatesforJohnsonSUfamilybasedon20samplesofsize200036

PAGE 48

Wehavetestedthismanyothercombinationsofparametervalues,andtheresultsaresimilar.InFigures3.5through3.7,eventhoughforaparticularfamily,anothermethodseemtoapproximateasgoodastheproposedmethod,thecurveresultingfromMLE-leastsquaresmethodfollowscloselythetruecurveinallthethreeSB,SU,andSLcasesoftheJohnston'sdistribution.37

PAGE 49

Parameter TrueValue Percentilemethod Quantilemethod NewApproach 1 -1.353(0.051) -1.29(0.027) 1.303(0.04) (;) (1,10) 1.012(0.006) 0.97(0.008) 1.012(0.008) -0.98(0.14) 0.53(0.057) 0.53(0.057) 2 -2.24(0.04) -2.26(0.01) -2.21(0.07) (;) (0,10) 0.98(0.003) 0.98(0.002) 0.98(0.007) 0.18(0.41) 0.22(0.36) 0.33(0.28) 3 -6.53(22.9) -5.26(18.13) -5.47(12.36) (;) (1,10) 3.18(2.28) 2.66(3.66) 2.87(1.42) -0.503(15.28) 0.72(18.3) 0.504(7.17) 4 -3.78(3.26) -3.45(0.99) -3.45(1.63) (;) (1,10) 2.06(0.35) 1.88(0.35) 1.97(0.21) -0.13(4.12) 0.43(4.41) 0.29(1.67) Table3.4:Meanand(MeanSquareError-MSE)ofparameterestimatesforJohnsonSLfamilybasedon20samplesofsize200038

PAGE 53

1 2 3 4 5 6 ... 55 Rsponse YES YES NO YES NO YES ... NO GeneID 1 7.56 7.39 7.12 7.49 8.09 8.07 ... 8.50 2 8.39 8.14 8.05 8.36 9.01 9.22 ... 9.53 3 7.44 7.16 7.04 7.28 7.90 8.11 ... 8.56 4 8.93 8.62 8.39 8.77 9.26 9.46 ... 9.71 5 8.29 7.89 7.73 7.96 8.58 8.79 ... 9.00 6 8.74 8.28 8.17 8.53 8.98 9.38 ... 9.81 7 11.24 11.24 10.93 11.06 11.73 12.17 ... 11.89 8 11.63 11.50 11.20 11.50 12.12 12.49 ... 12.32 9 11.85 11.99 11.59 11.78 12.61 12.85 ... 12.82 ... ... ... ... ... ... ... ... ... 22283 3.77 3.79 3.68 3.72 3.92 3.88 ... 3.85 Table4.1:Ovariancancerdatacancers.Currently,thestandardtreatmentprotocolusedintheinitialmanagementofadvanced-stageovariancancerisprimarycytoreductivesurgeryfollowedbypri-maryplatinum-basedchemotherapy.However,approximately30%ofpatientswithadvancedstagediseasedonotdemonstrateacompleteresponsetoprimaryplatinum-basedtherapy.Identifyinggeneswhichareexpressedsignicantlydierentinthetwogroups,couldprovidesomeinsightfortheprecisediagnosisofresponsetothetreat-mentandhelpthemedicalspecialiststochooseanalternatetherapywhenneeded.TheovariancancertissuesamplesinvolvedinthisstudyarecollectedfromthetumorbanksattheH.LeeMottCancerCentre&ResearchInstituteandDukeUniversityMedicalcenter.AymetrixU133AGeneChiparrayswereusedtomeasureexpressionof22,283genesinadvancedstageserousovariancancersfrom55patientswhounder-wentprimarysurgeryfollowedbyplatinum-basedchemotherapy.Expressionvaluesarecalculatedusingtherobustmulti-array(RMA)algorithm[15]implementedintheBioconductor(http:nnwww:bioconductor:org)extensionstotheRstatisticalpro-grammingenvironment[14].Geneexpressionswerecomparedbetweenpatientswhodemonstratedacompleteresponsetoplatinum-basedtherapyandthosewhodidnot,toidentifydierentiallyexpressedgenes.42

PAGE 55

whered1=1for0jr1;d1=0forr1+1jr1+r2andd2=0for0jr1;d2=1forr1+1jr1+r2

PAGE 56

followsastudent'st-distribution,withr1+r22degreesoffreedom.Here,x1iandx2iarethemeanexpressionlevelsofgeneiinther1replicatedsamplesofcondition1andr2replicatedsamplesofcondition2respectively;S21andS22arethesamplevariancesofgeneiunderthesetwoconditions. Iftiexceedsthethresholdvalueforaspecicvalueforaspeciccondencelevel(e.g.99%),theexpressionlevelsofgeneiatconditions1and2willthenbeconsideredtobedierent. Becauseinthet-testthedistancebetweenthepopulationmeansisnormalizedbytheempiricalstandarddeviations,thishasthepotentialforaddressingsomeoftheshortcomingsofthefoldratioapproach.WilcoxonTest

PAGE 57

where x1i,x2iaretheaveragegeneexpressionlevelsofgeneiinthetwoconditionsrespectivelyands0isasmallpositiveconstant.Thevalueofs0ischosenastheonewhichminimizesthecoecientofvariationofthestatisticdi.Thisresultingvalueiscalledthe"observeddvalue".Todeterminethesignicanceofthisvalue,SAMestimatesthe"expected"dvalueiftherewerenodierencebetweenthespecimenclasses.Thisisdonebypermuting,orrandomlychanging,theclasslabelswithoutchangingthedataandrecalculatingtheSAMvalueforeachprobeset[24].Afteralargenumberofpermutations,theresultestimatesthevaluethatwouldbeobtainedifthedierenceingeneexpressionwereduetochancealone.Thisisthe"expecteddvalue".Thesignicanceoftheobserveddierentialgeneexpressioncanbeestimatedbycomparingtheobservedandexpecteddvalues.Auserdenedthresholdor"delta"(ie,observeddvalue-expecteddvalue)canbeadjustedtoselectgenesforwhichobserveddvalueexceeds(forupregulatedgenes)orislower(fordownregulatedgenes)thandelta.Thegreaterthe"delta",thegreaterthestringencyoftheresultandlowerthefalsediscoveryrate.Foreachdeltavalue,theSAMoutputconsistsofageneorprobesetlistandanassociatedfalsediscoveryrate.Thefalsediscoveryrateisestimatedfromthedistributionofexpectedandobserveddvalues.TheTable4.3listssomechoicesofandthecorrespondingestimatedfalsediscoveryratefortheovariancancerdatainSection4.2.46

PAGE 58

No.of Estimated signicant False genes positives 0.2 13303 9241.5 0.4 11179 7187 0.6 7804 3335 0.8 4088 835 0.81 3641 793 0.82 3402 692.5 0.83 3089 585.5 0.84 2804 500 0.85 2417 340 0.851 2178 292 0.852 2164 288 0.853 2159 287 0.854 2144 283 0.855 2142 283 0.856 0 0 Table4.2:SAM-No.ofsignicantgenesandestimatedfalsepositivesfordierentchoicesofdelta.47

PAGE 63

where anda0isashrinkageparameterwhichdependsonthesjvalues.Thevaluemjisthemvalueforgenej.Theconstanta0inthedenominatorofEquation4.4.5canleadtothereductionoftheoverallvarianceofthemj,givingthetestsmorepoweronaverage.Thishastheaddedeectofdampeninglargevaluesoft-statisticsthatarisefromsmallvarianceofgenes.Wehavetakena0asthemedianofthesjvalues.TheFigure4.3showsthehistogramoft-valuesandthem-values.Thisgureshowstheshrinkagehappenedtothet-valueswhenweusethem-formula.Theideaofmodifyingestimatorsofvariancehasbeenpresentedbyothersinsimilarcontexts.TheSAMt-test[33]addsasmallconstanttothegene-specicvarianceestimateinordertostabilizethesmallvariances.InSAM,thefudgefactors0ischosenasthevaluewhichminimizesthecoecientofvariationoftheSAMstatisticdi.Theregularizedt-test52

PAGE 65

Asinsummarizingthegene-by-geneinformation,onecandecidetotreateachgeneindividuallyorborrowstrengthacrossthegenestoestimatethedistributionf0.Ifwecalculatethedistributionindividuallyforeachgene,thenoneessentiallytreatseachgeneasadierentexperiment.ThefamilynullhypothesisH0foramicroarraystudystatesthatnoneofthegenesisdierentiallyexpressed.UnderthisH0,itisplausibletoassumethatthem-valuesderivedfrompermutationsofgrouplabelsaredrawnindependentlyfromacommondistributionwithsomeprobabilitydensityfunction,say,f0.Inotherwords,wecanpoolthepermutationm-valuesacrossthegenesandacrosspermutations.Thepoolingofpermutationtestsresultsacrossgenesoersarenedbasisfortestingfordierentialexpressionthatisfreeofthelimitationimposedbyapplyingthetesttoeachgeneinisolation[26].Toestimatethedistributionf0wemakeuseofthefactthatanycontinuousdistributioncanbeapproximatedbytheJohnsonSystem.AnymethoddetailedinthepreviouschaptercanbeusedtotaJohnsonsystemtothepooledstatisticsandtoestimatetheparametersofthesystem.InthefollowingdiscussionsweuseWheeler'squantilemethodforttingtheJohnsonsystem.AnunboundedJohnson'sdistributionisttedtoformthecommonnulldistributionwithparameters^=0.1059483,^=2.690686,^=0.05775453and54

PAGE 66

No.ofgenes 537 SAM 2166 t 738 Wilcoxon 972 Table4.3:No.ofgenesselectedbydierentmethods^=1.301382.Nowthep-valuesofthecalculatedm-vauescanbeobtainedusingtheestimateddistributionf0.Thegeneswithpvalue<0.01areconsideredasdierentiallyexpressedgenes. TheresultsfromovariancancerdataarelistedinTable4.3.Thetableshowsthenumberofgenesselectedasdierentiallyexpressedoutof22,283genesinovariancancerdata.Becausenotruthaboutdierentiallyexpressedgenescouldbeobtainedonovariancancerdata,itisnotpossibletocompareresultsobtainedfortherealdata.Inordertoassesstheeectivenessoftheproposedmethodologyandtoobtainaquantitativeevaluationofgeneselectionmethods,thesimulateddataexplainedinSection4.2isused.ResultsfromsimulateddataaredisplayedinFigures4.4to4.9tocomparethemethodsdescribedinthischapter.InFigures4.4to4.6,thenumberoftrulydierentiallyexpressedgenesareplottedagainstthenumberofgenesselected,fordierentsamplesizes.Itcanbenotedfromtheseguresthat,foranyselectionofgenes,theproposedmethodusingJohnsondistributionprovidesthehighestnumberoftrulydierentiallyexpressedgenes.Itcanalsobeobservedfromtheseguresthatthetopgeneswhicharemostsignicantareselectedbyallthemethods.AftersomepointtherecanobservemorefalsepositivesintheothermethodscomparedtotheproposedmethodusingJohnsonsystem.Howeverasymptoticallyallthemethodsbehavesimilarly.Intheproposedmethod,informationacrossthegenesisusedtoestimatef0,whilet,WilcoxonandSAMconsidergene-by-geneinformation.Thismaybethereasonforhavinglessfalsepositiveswhentheproposedmethodisused.55

PAGE 70

NumberofEquallyExpressedgenes.ThisissameastheprobabilityofTypeIerrordenotedby.ThefalsenegativerateistheproportionofDierentiallyExpressedgenesthatwereerroneouslyreportedasEquallyExpressed.Morespecically,FalseNegativeRate=Numberoffalsenegatives NumberofDifferentiallyexpressedgenes.ThisissameastheprobabilityoftypeIIerror.Itisequalto1minuspowerofthetest. AmethodwhoseROCcurveliesbelowanotheroneispreferred[20],asthecurverepresentstheTypeIandTypeIIerrors.AmethodwhichhasabetterROCcurve,inthissense,willproducetoplistswithmoredierentiallyexpressedgenes(DEGs),fewernon-DEGsandconsequently,willleaveoutfewerDEGs.ForanyxedTypeIerror,theTypeIIerrorwillbelowestforthelowestROCcurve.InFigures4.7to4.9,theROCcurvesofthemethodsdiscussedinthischapteraregiven.Therangevaluesoffalsepositiverateandfalsenegativeratearefrom0to1,asbothareprobabilities.Asexpected,wheneitherfalsepositiverateandfalsenegativerateisclosetoonethecurvesconverge.Thecomparisonbetweenmethodsshouldbedonebasedonthepartofthecurveswherefalsepositiverateandfalsenegativerateisclosetozero(thatisneartheorigin).TheROCcurveoftheproposedmethodusingJohnsondistributionliesbelowtheROCcurvesoftheothermethods,neartheorigin,showingthattheproposedmethodisbetterthantheothermethods.Betterperformancecanalsobeobservedasthesamplesizeincreases.Forexample,inFigure4.9wherethesamplesizesare33vs22,whenthefalsepositiverateis0.05,thefalsenegativerateis0.05inJohnsonROC.ButinFigure4.7,wherethesamplesizesare15vs10,thepossibleminimumvaluesofbothfalsepositiverateandfalsenegativerateareapproximately0.1each.59

PAGE 76

Letpdenotethefractionofgenesthataredierentiallyexpressed(DE).Then1pdenotesthefractionofgenesequivalentlyexpressed(EE).AnEEgenejpresentsdataxj=(xj1;xj2;::::xjI)accordingtoadistributionf0(xj)=ZIYi=1fobs(xjij)!()d(5.2.1) Alternatively,ifgenejisdierentiallyexpressed,thedataxj=(xjg1;xjg2)aregovernedbythedistributionf1(xj)=f0(xjg1)f0(xjg2)(5.2.2) owingtothefactthatdierentmeanvaluesgovernthedierentsubsetsxjg1andxjg2ofsamples,whereg1containsindicesfromsamplesingroup1andg2containsindicesfromsamplesingroup2.Themarginaldistributionofthedatabecomespf1(xj)+(1p)f0(xj)(5.2.3) Withestimatesofp;f0andf1theposteriorprobabilityofdierentialexpressioniscalculatedbyBayesruleaspf1(xj) InEBarraystwomodelspecicationsofthegeneralmixturemodelinEquation5.2.3areconsidered,namely,Gamma-Gamma(GG)andLognormal-Normal(LNN)models.IntheGGmodel,theobservationcomponentsarefromaGammadistri-butionhavingshapeparameter>0andameanvaluej,thuswithscaleparameterj==j, formeasurementsx>0.Thepriordistribution(j)forjischosenasan65

PAGE 77

whereK=0(I+0) I()(0) IntheLognormalnormal(LNN)model,thegenespecicmeanjisameanforthelog-transformedmeasurements,whicharepresumedtohaveanormaldistributionwithcommonvariance2.Aconjugatepriorforthejisnormalwithsomeunderlyingmean0andvariance20.Integratingasin5.2.1,thedensityf0(:)forann-dimensionalinputbecomesGaussianwithmeanvector~0=(0;0;::::0)tandexchangeablecovariancematrixn=(2)In+(2)MnwhereInisannnidentitymatrixandMNisannnmatrixofones[19]. Forbothmodels,themethodofmaximummarginallikelihoodisusedtoobtaines-timatesoftheunknownhyperparametersandthemixingproportionp.Themarginalloglikelihoodisasumovergenesjoftermsin5.2.3,l()=Xjlog[pf1(xg;yg)+(1p)f0(xg;yg)](5.2.7) ThemaximumlikelihoodestimatesoftheparametersareobtainedviaExpectation-Maximization(EM)algorithm.ThisempiricalBayeshierarchicalmodelingapproachisimplementedinRasEBARRAYSpackage.66

PAGE 78

EBAMusesBayestheoremtoobtainposterioriprobabilitiesofanygenetobedierentiallyexpressed. and

PAGE 79

Thevalueofp0choseninsuchawaythatallposteriorprobabilitiesarepositive[7].Wemakeuseoftheestimatedvalueofp0usedinEBAM[7]underthesamecriterion.NowweareabletocalculatetheposteriorprobabilityusingtheEquation5.3.10.Thegeneswithposteriorprobabilitiesgreaterthan0.8arechosenasthedierentially68

PAGE 80

TheFigure5.1showsposteriorprobabilitiesagainstm-values.Thegreenspotscorrespondstothegeneswithposteriorprobability>0.8. BothEBArraysandEBAMshareinformationamonggenes.OnedrawbackofEBArraysistheassumptionofaparametricmodelforgeneexpressionsandhenceagoodchanceofviolationofassumptions.EBArraysassumesthatthegeneexpressions69

PAGE 82

No.ofgenes 537 SAM 2166 t 738 Wilcoxon 972 Bayes-Johnson 543 EBArrays-GG 191 EBArrays-LNN 177 EBAM 223 Table5.1:No.ofgenesselectedbydierentmethods5.5Results Becausenotruthaboutdierentiallyexpressedgenescouldbeobtainedonovariancancerdata,itisnotpossibletocompareresultsobtainedfortherealdata.Inordertoassesstheeectivenessoftheproposedmethodologyandtoobtainaquantitativeevaluationofgeneselectionmethods,thesimulateddataexplainedinSection4.2isused.AcomparisonofmethodsdiscussedinthischapterispresentedinFigures5.2to5.9.ThenumberoftrulydierentiallyexpressedgenesareplottedagainstthenumberofgenesselectedinFigures5.2to5.4.Wecanobservethatforanyvalueofnumberofgenesselected(x-coordinate),theproposedmethodusingJohnsonsystemofdistributionsgivesthemorenumberoftruepositivesthanothermethods.Resultsforthesimulateddatafromallmethodsdiscussedinthisdissertationareshownin71

PAGE 88

NumberofEquallyExpressedgenes.ThisisthesameastheprobabilityofTypeIerrordenotedby.ThefalsenegativerateistheproportionofDierentiallyExpressedgenesthatwereerroneouslyreportedasEquallyExpressed.Morespecically,FalseNegativeRate=Numberoffalsenegatives NumberofDifferentiallyexpressedgenes.ThisisthesameastheprobabilityoftypeIIerror.Itisequalto1minuspowerofthetest.AmethodwhoseROCcurveliesbelowanotheroneispreferred[20],asthecurverepresentstheTypeIandTypeIIerrors.AmethodwhichhasabetterROCcurve,inthissense,willproducetoplistswithmoredierentiallyexpressedgenes(DEGs),fewernon-DEGsandconsequently,willleaveoutfewerDEGs.ItcanbeobservedfromtheFigures5.7to5.9thattheproposedapproachusingJohnson'ssystemofdistributionsisbetterthantheothermethodsdiscussedhere.InFigures5.5and5.6thenumberoftrulydierentiallyexpressedgenesareplottedagainstthenumberofgenesselected,foracomparisonofallmethodsdiscussedinthedissertation.Itcanbenotedthatforanyxednumberofgeneselection,theproposedmethodsusingtheJohnsondistribution,guaranteethehighestnumberoftrulydierentiallyexpressedgenes.77

PAGE 93

ThedissertationalsodiscussestheJohnsonsystemofdistributioninadditiontothemethodsofgeneselection.Wehavedevelopedanewalgorithmfortheestimationofparameters,whichcanbeappliedtoallthethreefamiliesofJohnsondistribution. Animportantapplicationofmicroarraydataistoclassifybiologicalsamplesorpredictclinicalorotheroutcomes.Inthefutureresearchthisisaninterestingstatisti-calproblemtowork.Wehaveastrongfeelingoftherequirementofabetterstatisticwhichcanmeasurethevariationsintheexpressionlevelsofgenesinabetterway.Thisisalsooneoftheinterestingproblemsforthefutureresearch.82

PAGE 94

AlbertsB.,JohnsonA.,LewisJ.,RaM.,RobertsK.,WalterP.(2002)Molecularbiologyofthecell,Garlandpublishing,4thed.[2] BailinZhouandJohnP.McTague(1996).Can.J.For.Research26;928-935.[3] Baldi,P.andLong,A.D.(2001).ABayesianframeworkfortheanalysisofmi-croarrayexpressiondata:regularizedt-testandstatisticalinferencesofgenechanges.Bioinformatics17,509-519.[4] Chakravarti,Laha,andRoy,(1967).HandbookofMethodsofAppliedStatistics,VolumeI,JohnWileyandSons,pp.392-394.[5] DraperJ.PropertiesofDistributionsResultingfromCertainSimpleTransforma-tionsoftheNormalDistribution.Biometrika,Vol.39.No.3/4.(1952),pp.290-301.[6] DevoreJ.,PeckR.Statistics:theExplorationandAnalysisofData,3rdedn,DuxburyPress,PacicGroveCA;1997.GrithsA.J.F,MillerJ.H,SuzukiD.T,LewontinR.C.andGelbartW.M.(2000).Anintroductionforgeneticanalysis,Freeman,Newyork.[7] EfronB.Robbins,EmpiricalBayesandMicroarrays.TheannalsofStatistics2003;31:366-378.[8] EfronB.,TibshiraniR.,StoreyJD,TusherV.EmpiricalBayesanalysisofamicroarrayexperiment,JournaloftheAmericanStatisticalAssociation2001;96:1151-1160.83

PAGE 95

EfronB.,TibshiraniR.EmpiricalBayesmethodsandfalsediscoveryraresformicroarrays.GeneticEpidemiology2002;23:70-86[10] GoulbT.R.,Slonim,D.K.TamayoP.et:al:(1999).Molecularclassicationofcancerclassdiscoveryandclasspredictionbygeneexpressionmonitoring.Science286,531-537.[11] HahnJ.GeraldandShapiroS.Samuel((1967).StatisticalmodelsinEngineering,JohnWileyandSons.[12] HillI.D.,HillR.,HolderR.L.FittingJohnsonCurvesbyMoments.AppliedStatistics,vol.25,No.2.(1976),pp180-189.[13] HoskingJ.R.M.,WallisJ.R.andWoodE.F.Estimationofthegeneralizedextreme-valuedistributionbythemethodofprobability-weightedmoments.Technometrics,27(3):251-261.[14] IhakaR,GentlemanR,"Alanguagefordataanalysisandgraphics",J.Comput.Graph.,5,1196:299-314.[15] IrizarryRA,HobbsB,CollinF.,Beazer-BarelyYD,AntonellisKJ,ScherfUetal."Exploration,Normlization,andSummariesofHighDensityOligonucleotideArrayProbeLevelData"Biostatistics,4(2),2003:249-264.[16] Johnson,N.L.(1949).Systemsoffrequencycurvesgeneratedbymethodsoftrans-lation.Biometrika36,149-176.[17] JohnsonN.L,KotzS.andBalakrishnanN.(1994).ContinuousUnivariateDistri-butions,volume1andvolume2.[18] KamziahA.Kudus,AhmadM.I,JarinL.NonlinearregresionapproachtoestimatingJohnsonSBparametersfordiameterdata.Can.J.For.Res.(1999),29:310-314.84

PAGE 96

Kendziorski,C.M.,M.A.Newton,H.Lan,andM.N.Gould.Onparametricem-piricalBayesmethodsforcomparingmultiplegroupsusingreplicatedgeneex-pressionproles.StatisticsinMedicine2003;22:3899-3914.[20]LonnstedtI,SpeedTP;ReplicatedMicroarraydata.StatSinica2002,12:31-46.[21] Mei-LingTingLee(2004)Analysisofgeneexpressiondata,KluwerAcademicpublishers.[22] Mordell,L.J.Thevalueofthedeniteintegral,QuarterlyJournalofMathemat-ics,Oxford,48(1920):329-342.[23] OlaLarson,ClaesWahlestedtandJamesTimmons.Considerationswhenusingthesignicanceanalysisofmicroarrays(SAM)algorthm.[24] Pan,W.;Lin,J.;Le,C.AMixtureModelApproachtoDetectingDierentiallyExpressedGeneswithMicroarrayData.Functional&IntegrativeGenomics2003,volume3,Number3.[25] ParmigianiG.,E.S.Garrett,R.A.Irizarry,S.L.Zeger(2003).TheAnalysisofGeneExpressionData.Springer,Newyork.[26] PierreBaldi,AnthonyD.Long(2001).ABayesianframeworkfortheanalysisofmicroarrayexpressiondata:regularizedt-testandstatisticalinferencesofgenechanges,Bioinformaticsvol.17no.6[27] PearsonE.S.,Mathematicalcontributionstothetheoryofevolution,XIX,secondsupplementtoamemoironskewvariation.Phil.Trans.Roy.Soc(A).Vol.216(1916),p432.[28] Schmeiser,B.W.(1977).Methodsformodellingandgeneratingprobabilisticcom-ponentsindigitalcomputersimulationwhenthestandarddistributionsarenotadequate:ASurvey.Proceedingsofthe1977WinterSimulationConference,Highland,SargentandSchmidt(eds),51-55.TheInstituteofElectricalandElec-tronicsEngineers,Piscataway,NewJersey.85

PAGE 97

SlifkerJ,ShapiroS.TheJohnsonSystem:selectionandparameterestimation.Technometrics1980;22:239-247.[30] Snedecor,GeorgeW.andCochran,WilliamG.(1989),StatisticalMethods,EighthEdition,IowaStateUniversityPress.[31] Storer.AdaptiveEstimationbyMaximumLikelihood-FittingofJohnsonDistri-butions,unpublishedPh.D.thesis,SchoolofIndustrialandSystemsEngineering,GeorgiaInstitueofTechnology(1987).[32] StrachanT.,ReadA.P.(1999)HumanMoleculargenetics,Wileypublications,2nded.[33] TusherV.,TibshiraniR.,ChuC.SignicanceAnalysisofmicroarraysappliedtotranscriptionalresponsetoionizingradiation.,ProeedingsoftheNationalAcademyofSciences2001;98:5116-5121.[34] VladimirP.Savchuk,andChrisP.Tsokos,BayesianStatisticalMethodswithApplicationstoReliability,WorldFederationPublishers,Inc.,1996.[35] WheelerR.QuantileEstimatorsofJohnsoncurveParameters.Biometrikavol.67No.3Dec1980pp725-728.[36] WilkinsJ.E.AnoteonSkewnessandkurtosis.TheAnnalsofMathematicalStatistics,vol.15,No.3(Sep1944)pp333-335.86


printinsert_linkshareget_appmore_horiz

Download Options

close
Choose Size
Choose file type
Cite this item close

APA

Cras ut cursus ante, a fringilla nunc. Mauris lorem nunc, cursus sit amet enim ac, vehicula vestibulum mi. Mauris viverra nisl vel enim faucibus porta. Praesent sit amet ornare diam, non finibus nulla.

MLA

Cras efficitur magna et sapien varius, luctus ullamcorper dolor convallis. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Fusce sit amet justo ut erat laoreet congue sed a ante.

CHICAGO

Phasellus ornare in augue eu imperdiet. Donec malesuada sapien ante, at vehicula orci tempor molestie. Proin vitae urna elit. Pellentesque vitae nisi et diam euismod malesuada aliquet non erat.

WIKIPEDIA

Nunc fringilla dolor ut dictum placerat. Proin ac neque rutrum, consectetur ligula id, laoreet ligula. Nulla lorem massa, consectetur vitae consequat in, lobortis at dolor. Nunc sed leo odio.