USF Libraries
USF Digital Collections

Johnson's system of distributions and microarray data analysis

MISSING IMAGE

Material Information

Title:
Johnson's system of distributions and microarray data analysis
Physical Description:
Book
Language:
English
Creator:
George, Florence
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla.
Publication Date:

Subjects

Subjects / Keywords:
Gene expression data
Differentially expressed genes
Transformed distributions
Baye's formula
Mixture model approach
Dissertations, Academic -- Mathematics -- Doctoral -- USF   ( lcsh )
Genre:
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )

Notes

Abstract:
ABSTRACT: Microarray technology permit us to study the expression levels of thousands of genes simultaneously. The technique has a wide range of applications including identification of genes that change their expression in cells due to disease or drug stimuli. The dissertation is addressing statistical methods for the selection of differentially expressed genes in two experimental conditions. We propose two different methods for the selection of differentially expressed genes. The first method is a classical approach, where we consider a common distribution for the summary measure of equally expressed genes. To estimate this common distribution, the Johnson system of distribution is used. The advantage of using Johnson system is that, there is no need of a parametric assumption for gene expression data. In contrast to other classical methods, in the proposed method, there is a sharing of information across the genes by the assumption of a common distribution for the summary measure of equally expressed genes. The second method is the gene selection using a mixture model approach and Baye's theorem. This approach also uses the Johnson System of distribution for the estimation of distribution of summary measure. Johnson system of distribution has the flexibility of covering a wide variety of distributional shapes. This system provides a unique distribution corresponding to each pair of mathematically possible values of skewness and kurtosis. The significant flexibility of Johnson system is very useful in characterizing the complicated data set like microarray data. In this dissertation we propose a novel algorithm for the estimation of the four parameters of the Johnson system.
Thesis:
Dissertation (Ph.D.)--University of South Florida, 2007.
Bibliography:
Includes bibliographical references.
System Details:
System requirements: World Wide Web browser and PDF reader.
System Details:
Mode of access: World Wide Web.
Statement of Responsibility:
by Florence George.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains 86 pages.
General Note:
Includes vita.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 001917312
oclc - 181596712
usfldc doi - E14-SFE0002040
usfldc handle - e14.2040
System ID:
SFS0026358:00001


This item is only available as the following downloads:


Full Text
xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam Ka
controlfield tag 001 001917312
003 fts
005 20071119165920.0
006 m||||e|||d||||||||
007 cr mnu|||uuuuu
008 071119s2007 flu sbm 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0002040
035
(OCoLC)181596712
040
FHM
c FHM
049
FHMM
090
QA36 (ONLINE)
1 100
George, Florence.
0 245
Johnson's system of distributions and microarray data analysis
h [electronic resource] /
by Florence George.
260
[Tampa, Fla.] :
b University of South Florida,
2007.
3 520
ABSTRACT: Microarray technology permit us to study the expression levels of thousands of genes simultaneously. The technique has a wide range of applications including identification of genes that change their expression in cells due to disease or drug stimuli. The dissertation is addressing statistical methods for the selection of differentially expressed genes in two experimental conditions. We propose two different methods for the selection of differentially expressed genes. The first method is a classical approach, where we consider a common distribution for the summary measure of equally expressed genes. To estimate this common distribution, the Johnson system of distribution is used. The advantage of using Johnson system is that, there is no need of a parametric assumption for gene expression data. In contrast to other classical methods, in the proposed method, there is a sharing of information across the genes by the assumption of a common distribution for the summary measure of equally expressed genes. The second method is the gene selection using a mixture model approach and Baye's theorem. This approach also uses the Johnson System of distribution for the estimation of distribution of summary measure. Johnson system of distribution has the flexibility of covering a wide variety of distributional shapes. This system provides a unique distribution corresponding to each pair of mathematically possible values of skewness and kurtosis. The significant flexibility of Johnson system is very useful in characterizing the complicated data set like microarray data. In this dissertation we propose a novel algorithm for the estimation of the four parameters of the Johnson system.
502
Dissertation (Ph.D.)--University of South Florida, 2007.
504
Includes bibliographical references.
516
Text (Electronic dissertation) in PDF format.
538
System requirements: World Wide Web browser and PDF reader.
Mode of access: World Wide Web.
500
Title from PDF of title page.
Document formatted into pages; contains 86 pages.
Includes vita.
590
Adviosor: Kandethody M. Ramachandran, Ph.D.
653
Gene expression data.
Differentially expressed genes.
Transformed distributions.
Baye's formula.
Mixture model approach.
690
Dissertations, Academic
z USF
x Mathematics
Doctoral.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.2040



PAGE 1

by FlorenceGeorge Adissertationsubmittedinpartialfulllment oftherequirementsforthedegreeof DoctorofPhilosophy DepartmentofMathematics CollegeofArtsandSciences UniversityofSouthFlorida MajorProfessor:KandethodyM.Ramachandran,Ph.D. ChrisP.Tsokos,Ph.D. MarcusMcWaters,Ph.D. GeorgeYanev,Ph.D. DateofApproval:June15,2007 Keywords:Geneexpressiondata,Dierentiallyexpressedgenes,Transformeddistributions,Baye'sformula,Mixturemodelapproach cCopyright2007,FlorenceGeorge

PAGE 4

MyeyesturnmistywithmixedfeelingsofgratitudeandpainwhenIthinkofthelateprofessor,Dr.A.N.VRao,whointroducedmetotheeldofMicroarrays.Hisprinciple{seegoodineverythingandbeaconstructivepartofthewhole-hasbeensoul-stirringguidanceallthroughmyresearch.Ihavearoyalbouquetofthanksforthedistinguishedandwell-knownprofessorDr.ChrisP.Tsokosforhisthreadbareanalysisofmywork,furnishingmewithdetailedandconstructivecomments. Igratefullycherishtheenergizingsupport,adviceandencouragementofDr.McWatersandDr.GeorgeYanev.MyrecompenseisthankstoDr.LiLihuaandDr.Lancaster,H.LeeMottCancerCenterandResearchInstitute,whogavemeac-cesstomicroarraydataandletmehaversthandexperienceintheirlaboratory. IcannotbutthankmyhusbandJohnet,ourdaughterSherisandourentireex-tendedfamily,whoprovidedarefreshingbutlovinghomeenvironmentforadisser-tationlikethis.Finally,butmostttingly,Iwishtothankmyparentswhoformedpartofmyvision.

PAGE 5

ListofTablesv Abstractvii 1Introduction1 2Microarrays3 2.1GeneticBackground...........................3 2.2GeneExpression.............................4 2.3MicroarrayTechnology..........................5 2.4cDNAMicroarrays............................5 2.5OligonucleotideMicroarrays.......................8 3JohnsonSystemofDistributionsandaMLE-LeastSquareApproachtoPa-rameterEstimation10 3.1Introduction................................10 3.2TheJohnsonTranslationSystem....................11 3.3JohnsonSubsystemIdentication....................12 3.4ProbabilityDensityFunctionofJohnsonSystem............19 3.5ParameterEstimationofJohnsonSystem................21 MomentMatching............................21 PercentileMatching............................23 QuantileEstimators...........................25i

PAGE 6

3.7ComparisonofEstimationMethods...................31 3.8Summary.................................40 4MethodsforGeneSelectionandApplicationofJohnson'sDistribution41 4.1Introduction................................41 4.2Data....................................41 OvarianCancerData...........................41 SimulatedData..............................43 4.3ClassicalMethodsforGeneSelection..................44 Foldchange................................44 T-test...................................44 WilcoxonTest...............................45 SAM....................................45 4.4GeneSelectionusingJohnson'sFamilyofDistributions........52 4.5Summary.................................63 5MixtureModelApproachforGeneSelectionusingJohnson'sSystemofDistri-butions64 5.1Introduction................................64 5.2EBARRAYS................................64 5.3EBAM...................................67 5.4AMixtureModelApproachusingJohnsonDistributionandBaye'sFormula..................................68 5.5Results...................................71 5.6Summary.................................81 6Conclusion82 References83ii

PAGE 8

2.2Samplepartofmicoarraychip:(http://www.gene-chips.com)....7 3.1ChartforJohnsonsubsystemidentication...............13 3.2ExamplesofJohnsonSBfamily.Forallcases=0and=100...15 3.3ExamplesofJohnsonSUfamily.Forallcases=0and=100....16 3.4ExamplesofJohnsonSLfamily.Forallcases=0and=100....17 3.5EstimatedandTrueJohnsonSBdistribution(=1,=1,=10,=50).34 3.6EstimatedandTrueJohnsonSUdistribution(=1,=0.5,=10,=10).35 3.7EstimatedandTrueJohnsonSLdistribution(=0,=2,=10,=50)39 4.1SamPlotusingovariancancermicroarraydata.............49 4.2SamPlotofFDRusingovariancancermicroarraydata........51 4.3Histogramoft-values(left)andm-values(right).............53 4.4No.ofgenesselectedvsNo.ofTruePositives;Samplesize-15vs1056 4.5No.ofgenesselectedvsNo.ofTruePositives;Samplesize-20vs2057 4.6No.ofgenesselectedvsNo.ofTruePositives;Samplesize-33vs2258 4.7ReceiverOperatingCharacteristiccurve;Samplesize-15vs10...60 4.8ReceiverOperatingCharacteristiccurve;Samplesize-20vs20...61 4.9ReceiverOperatingCharacteristiccurve;Samplesize-33vs22...62 5.1Posteriorprobabilityofgenes;Genesrepresentinggreen(orgray)pointshaveposteriorprobability>0.8.....................69 5.2Mixturemodelapproaches-No.ofgenesselectedvsNo.ofTruePositives;Numberofgenes-2000;Samplesize-15vs10.......72iv

PAGE 9

5.4Mixturemodelapproaches-No.ofgenesselectedvsNo.ofTruePositives;Numberofgenes-2000;Samplesize-33vs22.......74 5.5Comparisonofallmethodsdiscussed-No.ofgenesselectedvsNo.ofTruePositives;Numberofgenes-2000;Samplesize-15vs10....75 5.6Comparisonofallmethodsdiscussed-No.ofgenesselectedvsNo.ofTruePositives;Numberofgenes-2000;Samplesize-33vs22....76 5.7ROCcurve-Mixturemodelapproaches;Numberofgenes-2000;Samplesize-15vs10...............................78 5.8ROCcurve-Mixturemodelapproaches;Numberofgenes-2000;Samplesize-20vs20...............................79 5.9ROCcurve-Mixturemodelapproaches;Numberofgenes-2000;Samplesize-33vs22...............................80v

PAGE 10

3.2Meanand(MeanSquareError-MSE)ofparameterestimatesforJohn-sonSBfamilybasedon20samplesofsize2000............33 3.3Meanand(MeanSquareError-MSE)ofparameterestimatesforJohn-sonSUfamilybasedon20samplesofsize2000............36 3.4Meanand(MeanSquareError-MSE)ofparameterestimatesforJohn-sonSLfamilybasedon20samplesofsize2000.............38 4.1Ovariancancerdata...........................42 4.2SAM-No.ofsignicantgenesandestimatedfalsepositivesfordierentchoicesofdelta...............................47 4.3No.ofgenesselectedbydierentmethods...............55 5.1No.ofgenesselectedbydierentmethods...............71vi

PAGE 11

FlorenceGeorge Johnsonsystemofdistributionhastheexibilityofcoveringawidevarietyofdistributionalshapes.Thissystemprovidesauniquedistributioncorrespondingtoeachpairofmathematicallypossiblevaluesofskewnessandkurtosis.ThesignicantexibilityofJohnsonsystemisveryusefulincharacterizingthecomplicateddatasetlikemicroarraydata.InthisdissertationweproposeanovelalgorithmfortheestimationofthefourparametersoftheJohnsonsystem.vii

PAGE 12

Whilesimultaneousmeasurementofthousandsofgeneexpressionlevelsprovidesapotentialsourceofprofoundknowledge,successofthemicroarraytechnologydependsheavilyonstatisticalanalysis.Carefulstatisticalthinkingandanalysisarerequiredtondtheunderlyingstructureinthedata.Theunprecedentedamountsofdataproducedbymicroarraysraisenewchallengesforstatisticianstobeabletoperforminferenceonascaleneverbeforeconducted.Recently,statisticiansandresearchersinbioinformaticshavefocusedmuchattentiononthedevelopmentofstatisticalmethodstoidentifydierentiallyexpressedgenes,withspecialemphasisonthosemethodsthatidentifygenesthataredierentiallyexpressedbetweentwoconditions.Thisworkfocussesonthedevelopmentofstatisticalmethodsthatissuitablefordierential1

PAGE 13

Thedissertationisorganizedasfollows.Chapter2introduceMicroarrays.Inchapter3,Johnson'ssystemofdistributionsandthemethodsforestimationofpa-rametersarediscussed.WedevelopanewalgorithmforestimatingtheparametersofJohnsonSystem.Chapter4discussestheclassicalmethodsofgeneselection.WeproposeanewapproachforgeneselectionusingJohnson'sSystemofdistributions.Inchapter5,WehaveproposedamixturemodelapproachusingBaye'stheoremandJohnson'ssystemofdistributionsfortheselectionofdierentiallyexpressedgenes.2

PAGE 14

Microarraysproducehugeamountsofdata.Carefulstatisticalthinkingandanal-ysisarerequiredtondtheunderlyingstructureinthedata.Theunprecedentedamountsofdataproducedbymicroarraysraisenewchallengesforstatisticianstobeabletoperforminferenceonascaleneverbeforeconducted.2.1GeneticBackground

PAGE 16

Therearecurrentlytwoplatforms/typesofDNAmicroarraysthatarecommer-ciallyavailable.GlassDNAmicroarrayswhichinvolvesthemicrospottingofpre-fabricatedcDNAfragmentsonaglassslideandhigh-densityoligonucleotidemicroar-raysoftenreferredtoasa"chip"whichinvolvesinsituoligonucleotidesynthesis.2.4cDNAMicroarrays

PAGE 17

ThroughadenaturingprocessthedoublestrandedDNAmoleculesinthesampleareunzippeddownintotwosinglestrandedmolecules.Themicroarraychipitselfalsocontainssinglestrandsofgenesthatwillattractthesinglegenesfromthesample.ThesinglestrandfromthesamplewillbindwiththesinglestrandsonthemicroarraychiptoreformtheDNAdoublehelix.Thisiscalledhybridization.InacDNAmicroarrayexperiment,thetestsampleislabeledwithadyeandthereferencesampleislabeledwithadyeofdierentcolor.Thereferencesampleservesasacontroltowhichthegeneexpressioninthetestsampleiscompared.Forexample,ifwewantedtodeterminewhichgenesareexpressedinatumorsample,wecoulduseatissuesamplefromahealthyindividualasthereferencesample.Wewouldthencomparetheexpressionlevelofeachgeneinthetumorsampletotheexpressionlevelofeachinthereferencesample.Supposethetumorsamplehadbeenlabeledwithareddyeandthereferencesamplehadbeenlabeledwithagreendye.Thenaredspotonthemicroarraywouldindicatethatthegenecorrespondingtothatspotisexpressedatahigherlevelinginthetumorsamplethaninthereferencesample.Similarlyagreenspotwillindicatethatthegeneisexpressedinalowerlevelinthetumorsample. Afterhybridization,animageofthearraywithhybridizeduorescentdyesmustbeacquired.Microarrayimagescanbeproducedusinganumberofdevices,likealaserscanneroracamera.Thegoalistomeasure,foreachspotonthearray,therelativeuorescenceintensitiesfromeachdyehybridizedwithitstarget.Ahighuorescencelevelindicatesthatmultiplecopiesofagenehaveboundtothechipandthegenehasactivityinthecell.Similarly,alowuorescencelevelindicateslowactivityofthegeneinthecell.Byquantifyingtheuorescencelevelthegeneactivitycanbecomparedacrossdierentsamples.6

PAGE 19

AsamplepartofamicroarraychipisshownintheFigure2.2.Eachspotrep-resentsageneandtherearethousandsofgenesonachip.Theintensityandcolorofeachspotencodeinformationonaspecicgenefromthetestedsample.Amoredetaileddescriptionofgenes,geneticanalysisandmicroarraytechnologycanbeseenin[1],[32],[21]. Thepowerofmicroarraytechnologyliesinitspotentialtosurveytheentiregenomeinoneexperiment.Microarraysareincreasinglyappliedinbiologicalandmedicalresearchtoaddressawiderangeofproblems.Herewelistsomeoftheseapplications.Thecompletionofthehumangenomemeansthatwecansearchforthegenesdirectlyassociatedwithdierentdiseases.Thisnewknowledgewillenablebettertreatments,curesandevenpreventiveteststobedeveloped.Clinicalmedicinewillbecomemore8

PAGE 20

Althoughitispreferableforthestatisticiantohaveahandintheexperimentaldesign,thestatisticianoftencomesintoamicroarrayanalysisprojectoncethedatahasbeencollected.Thestatistician'sjobistousethenumericalgeneexpressionlevelstomakeclaimsaboutthepopulationsofinterest.Animportantandcommonquestioninmicroarrayexperimentsistheidenticationofdierentiallyexpressedgenes.Inchaptersfollowingthenextchapterwewillreviewavailabletechniquesandproposesomenewmethodsfortheselectionofdierentiallyexpressedgenes.9

PAGE 21

In1949,Johnsonderivedasystemofcurves[16],[17]thathastheexibilityofcoveringawidevarietyofshapes.Thissystemhasthepracticalandtheoreticalad-vantagesofbeingabletotransformthesecurvestothenormaldistribution.TheJohnsonsystemisabletocloselyapproximatemanyofthestandardcontinuousdis-tributionsthroughoneofthethreefunctionalformsandisthushighlyexible.TheJohnsonsystemprovidesonedistributioncorrespondingtoeachpairofmathemati-callypossiblevaluesofskewnessandkurtosis. ThesignicantexibilityofJohnsonsystemofdistributionsisveryusefulinchar-acterizingthecomplicateddatasetlikemicroarraydata.Inallparametricapproaches10

PAGE 22

(3.2.1) wheref(:)denotesthetransformationfunction,Zisastandardnormalrandomvariable,andareshapeparameters,isascaleparameterandisalocationparameter.Withoutlossofgenerality,itisassumedthat>0and>0.ThersttransformationproposedbyJohnsondenesthelognormalsystemofdistributionsdenotedbySL: ;X>=+ln(X);X>(3.2.2)SLcurvescoverthelognormalfamily. TheboundedsystemofdistributionsSBisdenedby +X;
PAGE 23

TheunboundedsystemofdistributionsSUisdenedby +(X 2+1)1=235;
PAGE 24

IfwedenethevariableY=X then3.2.1becomes, wheref(:)denotesthetransformation.Forasetofparametervalues,therealizationofJohnsonrandomvariableXcanbeobtainedbyapplyingtheinversetranslation,X=+f1(z )(3.3.9) whereZisstandardnormalvariable.13

PAGE 25

f(y) (1+ez)1 SeveralexamplesofthedensitycurvesofJohnsonSB,SUandSLaredisplayedinFigures3.2to3.4.ThesegraphsgiveanideaabouttheexibilityoftheJohnsonsystemtocoverawidevarietyofshapes.14

PAGE 29

Proof: LetXbeastandardizeddiscreterandomvariablewithprobabilitydensityfunctionp(x). Considerthefollowingfunctionina,bandc,G(a;b;c)=0a2+21ab+22ac+2b2+23bc+4c2=1 Consequentlyitsdiscriminant012123234

PAGE 30

HenceEquation3.3.12isequivalentto2110(3.3.13) Theproofissimilarforcontinuousrandomvariablealso[36].3.4ProbabilityDensityFunctionofJohnsonSystem thenforSLfamily,thepdfis 2[+:ln(y)]2;
PAGE 31

IngeneralthepdfofXisgivenby, p )exp(1 2+:g(x )2)(3.4.17) forallx2H,where [y(1y)]forSBfamily=1 and ThesupportHofthedistributionis

PAGE 32

andsampleskewnessandkurtosisare: ^1=m23 ThemomentmatchingtechniqueforttingaJohnsondistributiontothesampleusesthelocationofthepoint(^1;^2)inFigure3.1toidentifytheappropriatefunc-tionalformamongsystems3.2.2-3.2.4.Theprincipleofmomentmatchingprescribesthattherstksamplemomentsshouldbeequaltothecorrespondingpopulationmomentsofthettedtheoreticaldistribution.Theresultingsystemofknonlinearequations,whichwillbedependentonthekparameters,isthensolvedtoobtaintheparameterestimatesforthetteddistribution.TheestimationofparametersusingthemethodofmomentsisdiscussedindetailbyDraper[5]. ForSLdistribution,therstthreemomentsintermsoftheparametersin3.2.2are,21

PAGE 33

2212=e22213=e9 2231(3.5.23) ForSUdistribution, 2z22r(e(z)=e(z)=)rdz(3.5.24) fromwhichitfollowsthattherstfourmomentsofXare 2sinh2=1 22(!1)(!cosh2+1)3=1 43!1 2(!1)2[!(!+2)sinh3+3sinh]4=1 8(!1)2[!2(!4+2!3+3!23)cosh4+4!2(!+2)cosh2+3(2!+1)](3.5.25) where!=e2and==. Therthmomentofy=x ,whenXfollowsaJohnsonSBis0r(y)=1 2z2(1+e(z)=)rdz(3.5.26) Thisintegralisnoteasytoevaluatedirectly.Johnsonhasevaluated01(y)directlyandthehigherordermomentsareobtainedusingthefollowingsteps. Whenr=1,01(y)=1 2z2(1+e(z)=)1dz(3.5.27) whichcanbeevaluateddirectlyusingaresultduetoMordell[22].Thehighermo-mentscanbeobtainedusingtherelation, r@0r

PAGE 34

and where(:)isthestandardnormaldistributionfunctionandFisthetargetdistri-butionfunction.Oncethefunctionalformf(:)amongsystems3.2.2-3.2.4hasbeenidentied,themethodofpercentilematchingattemptstosolvethekequations );1jk(3.5.30) where^xjisanestimatorofthequantilexjbasedonsampledata. SlifkerandShapiro[29]introducedaselectionrulewhichisafunctionoffourper-centilesforselectingoneofthethreefamiliesandtogiveestimatesoftheparameters.Thetparametersforthetransformationarecalculatedbysolvingthetransformationequationforthechosendistributiontypeatthefourselectedpercentilesasexplainedbelow.Chooseanyxedvaluez(0
PAGE 35

^=2z 2m p+n pi(3.5.32)^=^sinh1264n pm p pn p11=2375(3.5.33)^=2pm pn p11=2 p+n p2m p+n p+21=2(3.5.34)^=xz+xz pm p p+n p2(3.5.35) TheparameterestimatesforSBdistributionare ^=z 21+p m1+p n1=2;(>0)(3.5.36) ^=^sinh1"p np m1+p m1+p n41=2 mp n1#(3.5.37) ^=ph1+p m1+p n224i1=2 mp n1(3.5.38) ^=xz+xz np m mp n1(3.5.39)24

PAGE 36

^=2z p(3.5.40) ^=^ln264m p1 p1=2375(3.5.41) ^=xz+xz p+1 p1(3.5.42)QuantileEstimators 2)=n,wherenisthesamplesize.Denotethequantileofstandardnormaldistributioncorrespondingtothecumulativeprobabilitypnbyzn.Forexample,ifn=100,thenpn=0:995,sothatzn=2:5758.Choosevequantilesxp,xk,x0,xm,xnfromdatacorrespondingtostandardnormalquantilesz=zn;1 2zn;0;1 2zn;zn.ThegeneralformofJohnsonsystemcanbewrit-tenz=+lnf(y)(3.5.43) wheref(y)=yforSL,f(y)=y+(1+y2)1=2forSU,f(y)=y=(1y)forSBandy=(x)=.Wheelerusesthefactthatanyquantityoftheformxixj where!=e(z)=,doesnotdependonor.TheparameterestimatesforSUcurvesare: ^=1 2zn=lnb(3.5.45)25

PAGE 37

2tu+[(1 2tu)21]1=2andtu=xnxp wherea2=1tb2 ForSBcurvestheparameterestimatesare: ^=1 2zn=lnb(3.5.47) whereb=1 2tb+[(1 2tb)21]1=2andtb=(xmx0)(xnxp) (xnxm)(x0xp) wherea=tb2 ForSLcurves, ^=zn wheret=xnx0 (xnxm)(x0xp)(3.5.50) isused.Itislessthan1forsU,equalto1forSLandgreaterthan1forSB.3.6MLE-LeastSquareApproachfortheParameterEstimationofJohnsonSystem

PAGE 38

TheprobabilitydensityfunctionsofthemembersoftheJohnsonfamilyareknown.LetusrstconsidertheSUandSBfamilyofJohnsonsystem.UsingthegeneralformofJohnsondensitiesasgiveninEquation3.4.17,thelikelihoodfunctionis,L(x)=n )e1 2Pni=1(+g(x ))2(3.6.51) Thelog-likelihoodis,logL=nlognlogn=2log(2)+nXi=1g0(x )1 2nXi=1(+g(x ))2(3.6.52) Settingthepartialderivativeswithrespecttotozero, X[g(x )]2Xg(x )=0(3.6.53) whichcanbewrittenas, )]2+Xg(x )n=0(3.6.54) Settingthepartialderivativeswithrespecttotozero,n+Xg(x )=0(3.6.55) whichyields,^=Pg(x )

PAGE 39

)]21 )]2(3.6.58)=1 wheregisthemeanandvar(g)isthevarianceofthevaluesofgdenedinEquation3.4.20. Thepartialderivativesofthelog-likelihoodwithrespecttoandarenotsimple.Storer[31]detailsalengthystrategyforobtainingthesolutionsoftheseparameters.Inthemaximumlikelihoodestimationmethod,KamziahA.Kuduset:al[18]appliedNewton-RaphsoniterationtomaximizetheloglikelihoodoftheJohnsondistribution.Theyhaveobservedthatforsomesamplestheloglikelihoodfunctiondoesnothavealocalmaximumwithrespecttotheparametersand.Thisnon-regularityofthelikelihoodfunctioncausedoccasionalnon-convergenceoftheNewton-Raphsoniterationthatwasusedtomaximizethelog-likelihood[13]. Weapplymethodofleastsquarestoestimatetheparametersand.ConsidertheEquation3.3.9whichisx=+f1(z ).Forxedvaluesofand,thisequationmaybeconsideredasalinearequationwithparametersand. Thesumofsquaresoferrorsis, )]2(3.6.60) TondthevalueofandthatminimizesS(;)getthepartialderivativesofS(;)withrespecttoand.Thenequatethesepartialderivativestozero.Thenweobtainthefollowingtwoequationscallednormalequations,Xx=n+Xf1(z )(3.6.61)Xxf1(z )=Xf1(z )+X[f1(z )]2

PAGE 40

Solvingthenormalequationsweget, ^=nPxf1(z )Pf1(z )Px nP[f1(z )]2[Pf1z ]2(3.6.62) and ^=xmean[f1(z )](3.6.63) wherexisthemeanofx-quantilesandzisthemeanofz-quantilesusedintheaboveequations.Westartwithsomeinitialvaluesofand.Theseinitialvaluesmaybetakenastheestimatesobtainedbyanyoneofthepreviousmethods.ThenestimatesofandarecalculatedusingEquations3.6.57and3.6.59.Oncetheestimatesofandareobtained,theEquations3.6.63and3.6.62canbeusedtorevisetheestimatesofand. NowwemayrepeatthesestepswithEquations3.6.62,3.6.63,3.6.57and3.6.59,eachtimeusingthemostrecentestimates.KeeptrackofResidualSumofSquares(RSS)andafterafewstepschoosetheestimatewithminimumRSSvalue. Thealgorithmcanbesummarizedinthefollowingsteps. 1.Assigninitialvaluesforand.Theinitialvaluesmaybederivedbypercentileorquantilemethod. 2.EstimateandusingtheEquations3.6.57and3.6.59. 3.EstimateandusingtheEquations3.6.62and3.6.63.29

PAGE 41

5.Repeatsteps2to4,eachtimeusingmostrecentestimatesoftheparameters.ChoosetheestimateswithminimumRSS. ForSLfamily,wewillconsiderthetransformationinEquation3.2.2,sothatthereareonly3parametersincluded.Theprobabilitydensityfunctioncanbegivenby, (x)e1 2[+ln(x)]2(3.6.64) Thelikelihoodfunctionis, 2P[+ln(x)]2(3.6.65) Settingthepartialderivativeoflog-likelihoodwithrespecttotozeroweget, X[ln(x)]2X[ln(x)]=0(3.6.66) whichcanbewrittenas, Settingthepartialderivativeoflog-likelihoodwithrespecttotozero, whichgives, ^=1 UsingEquation3.6.70inEquation3.6.67andsolvingfor,30

PAGE 42

whereg=ln(x).TheestimatesinEquations3.6.70and3.6.72dependson. Toestimate,asbefore,wewillusethemethodofleastsquaresintheequation Thesumofsquaresoferrorsis, TondthevalueofthatminimizesS()weobtain, d=2X(xf1(z Settingthisderivativeequaltozero,wehave, ^=xmean[f1(z Herealsothesamesituationarises,estimatedependsonandandviceversa,asinthecaseofSUandSBdistributions.Sowewillstartwithsomeinitialvalueoftoestimateand.Thenusetheseestimatedvaluestoestimate.Repeatthisprocedure,keepingtrackofRSSandchoosetheonewithleastRSS.3.7ComparisonofEstimationMethods

PAGE 43

ThemeanandtheMeanSquareError(MSE)oftheestimatedvaluesofSBfamilyarelistedinTable3.2forcomparisonpurposes.ItcanbeobservedthataverageoftheestimatesareclosetothetruevaluesoftheparametersandingeneraltheMSEoftheestimatesaresmallerintheproposedmethodthantheothermethods.ThismaybebecauseMLE-Leastsquareapproachusesalltheavailabledata,whilequantilemethodusesonlyvequantilesandpercentilemethodusesonlyfourquantiles.Themajordisadvantageofmomentmatchingisthevulnerabilitytooutliersofthesamplethirdandfourthmoments.32

PAGE 44

Parameter TrueValue Percentilemethod Quantilemethod NewApproach 1 0.998(0.167) 1.063(0.409) 0.997(0.026) 1.001(0.059) 1.024(0.083) 0.997(0.026) 10.047(0.085) 9.982(0.131) 9.93(0.08) 10.049(5.92) 10.402(14.37) 10.57(4.99) 2 0.503(0.009) 0.503(0.0493) 0.494(0.007) 0.505(0.003) 0.519(0.023) 0.507(0.001) 9.11(4.038) 9.97(0.077) 10.004(0.004) 10.005(0.285) 10.094(1.614) 9.868(2.056) 3 1.032(0.065) 1.01(0.015) 1.016(0.017) 0.507(0.0039) 0.5006(0.0013) 0.509(0.002) 9.698(.488) 10.001(0.001) 10.001(0.001) 10.355(4.63) 10.085(0.69) 9.86(0.70) 4 0.558(0.287) 0.539(0.136) 0.561(0.165) 1.013(0.191) 1.024(0.108) 1.055(0.115) 9.82(1.097) 9.94(0.55) 9.91(0.52) 10.31(15.4) 10.30(8.2) 9.83(0.50) Table3.2:Meanand(MeanSquareError-MSE)ofparameterestimatesforJohnsonSBfamilybasedon20samplesofsize200033

PAGE 47

Parameter TrueValue Percentilemethod Quantilemethod NewApproach 1 0.04(0.32) 0.015(0.05) 0.015(0.05) 1.41(3.3) 2.08(0.34) 2.05(0.29) 10.24(8.9) 10.1(1.5) 10.1(1.4) 12.3(99.9) 10.5(12.6) 10.3(10.1) 2 0.82(2.9) 0.52(0.11) 0.51(0.09) 2.47(3.23) 2.08(0.45) 2.06(0.37) 11.51(64.6) 10.06(2.79) 10.04(2.59) 12.07(56.5) 10.35(12.6) 10.25(11.22) 3 -0.003(0.003) 0.005(0.002) 0.003(0.002) 1.033(0.006) 0.99(0.003) 0.99(0.002) 10.03(.43) 10.05(0.25) 10.06(0.25) 10.45(1.43) 9.82(0.7) 9.75(0.73) 4 0.514(0.009) 0.488(0.006) 0.487(0.007) 1.008(0.006) 0.999(0.006) 0.996(0.006) 10.243(1.203) 9.95(0.9) 9.94(1.05) 10.06(0.96) 10.06(1.13) 10.02(1.43) Table3.3:Meanand(MeanSquareError-MSE)ofparameterestimatesforJohnsonSUfamilybasedon20samplesofsize200036

PAGE 48

Wehavetestedthismanyothercombinationsofparametervalues,andtheresultsaresimilar.InFigures3.5through3.7,eventhoughforaparticularfamily,anothermethodseemtoapproximateasgoodastheproposedmethod,thecurveresultingfromMLE-leastsquaresmethodfollowscloselythetruecurveinallthethreeSB,SU,andSLcasesoftheJohnston'sdistribution.37

PAGE 49

Parameter TrueValue Percentilemethod Quantilemethod NewApproach 1 -1.353(0.051) -1.29(0.027) 1.303(0.04) (;) (1,10) 1.012(0.006) 0.97(0.008) 1.012(0.008) -0.98(0.14) 0.53(0.057) 0.53(0.057) 2 -2.24(0.04) -2.26(0.01) -2.21(0.07) (;) (0,10) 0.98(0.003) 0.98(0.002) 0.98(0.007) 0.18(0.41) 0.22(0.36) 0.33(0.28) 3 -6.53(22.9) -5.26(18.13) -5.47(12.36) (;) (1,10) 3.18(2.28) 2.66(3.66) 2.87(1.42) -0.503(15.28) 0.72(18.3) 0.504(7.17) 4 -3.78(3.26) -3.45(0.99) -3.45(1.63) (;) (1,10) 2.06(0.35) 1.88(0.35) 1.97(0.21) -0.13(4.12) 0.43(4.41) 0.29(1.67) Table3.4:Meanand(MeanSquareError-MSE)ofparameterestimatesforJohnsonSLfamilybasedon20samplesofsize200038

PAGE 53

1 2 3 4 5 6 ... 55 Rsponse YES YES NO YES NO YES ... NO GeneID 1 7.56 7.39 7.12 7.49 8.09 8.07 ... 8.50 2 8.39 8.14 8.05 8.36 9.01 9.22 ... 9.53 3 7.44 7.16 7.04 7.28 7.90 8.11 ... 8.56 4 8.93 8.62 8.39 8.77 9.26 9.46 ... 9.71 5 8.29 7.89 7.73 7.96 8.58 8.79 ... 9.00 6 8.74 8.28 8.17 8.53 8.98 9.38 ... 9.81 7 11.24 11.24 10.93 11.06 11.73 12.17 ... 11.89 8 11.63 11.50 11.20 11.50 12.12 12.49 ... 12.32 9 11.85 11.99 11.59 11.78 12.61 12.85 ... 12.82 ... ... ... ... ... ... ... ... ... 22283 3.77 3.79 3.68 3.72 3.92 3.88 ... 3.85 Table4.1:Ovariancancerdatacancers.Currently,thestandardtreatmentprotocolusedintheinitialmanagementofadvanced-stageovariancancerisprimarycytoreductivesurgeryfollowedbypri-maryplatinum-basedchemotherapy.However,approximately30%ofpatientswithadvancedstagediseasedonotdemonstrateacompleteresponsetoprimaryplatinum-basedtherapy.Identifyinggeneswhichareexpressedsignicantlydierentinthetwogroups,couldprovidesomeinsightfortheprecisediagnosisofresponsetothetreat-mentandhelpthemedicalspecialiststochooseanalternatetherapywhenneeded.TheovariancancertissuesamplesinvolvedinthisstudyarecollectedfromthetumorbanksattheH.LeeMottCancerCentre&ResearchInstituteandDukeUniversityMedicalcenter.AymetrixU133AGeneChiparrayswereusedtomeasureexpressionof22,283genesinadvancedstageserousovariancancersfrom55patientswhounder-wentprimarysurgeryfollowedbyplatinum-basedchemotherapy.Expressionvaluesarecalculatedusingtherobustmulti-array(RMA)algorithm[15]implementedintheBioconductor(http:nnwww:bioconductor:org)extensionstotheRstatisticalpro-grammingenvironment[14].Geneexpressionswerecomparedbetweenpatientswhodemonstratedacompleteresponsetoplatinum-basedtherapyandthosewhodidnot,toidentifydierentiallyexpressedgenes.42

PAGE 55

whered1=1for0jr1;d1=0forr1+1jr1+r2andd2=0for0jr1;d2=1forr1+1jr1+r2

PAGE 56

followsastudent'st-distribution,withr1+r22degreesoffreedom.Here,x1iandx2iarethemeanexpressionlevelsofgeneiinther1replicatedsamplesofcondition1andr2replicatedsamplesofcondition2respectively;S21andS22arethesamplevariancesofgeneiunderthesetwoconditions. Iftiexceedsthethresholdvalueforaspecicvalueforaspeciccondencelevel(e.g.99%),theexpressionlevelsofgeneiatconditions1and2willthenbeconsideredtobedierent. Becauseinthet-testthedistancebetweenthepopulationmeansisnormalizedbytheempiricalstandarddeviations,thishasthepotentialforaddressingsomeoftheshortcomingsofthefoldratioapproach.WilcoxonTest

PAGE 57

where x1i,x2iaretheaveragegeneexpressionlevelsofgeneiinthetwoconditionsrespectivelyands0isasmallpositiveconstant.Thevalueofs0ischosenastheonewhichminimizesthecoecientofvariationofthestatisticdi.Thisresultingvalueiscalledthe"observeddvalue".Todeterminethesignicanceofthisvalue,SAMestimatesthe"expected"dvalueiftherewerenodierencebetweenthespecimenclasses.Thisisdonebypermuting,orrandomlychanging,theclasslabelswithoutchangingthedataandrecalculatingtheSAMvalueforeachprobeset[24].Afteralargenumberofpermutations,theresultestimatesthevaluethatwouldbeobtainedifthedierenceingeneexpressionwereduetochancealone.Thisisthe"expecteddvalue".Thesignicanceoftheobserveddierentialgeneexpressioncanbeestimatedbycomparingtheobservedandexpecteddvalues.Auserdenedthresholdor"delta"(ie,observeddvalue-expecteddvalue)canbeadjustedtoselectgenesforwhichobserveddvalueexceeds(forupregulatedgenes)orislower(fordownregulatedgenes)thandelta.Thegreaterthe"delta",thegreaterthestringencyoftheresultandlowerthefalsediscoveryrate.Foreachdeltavalue,theSAMoutputconsistsofageneorprobesetlistandanassociatedfalsediscoveryrate.Thefalsediscoveryrateisestimatedfromthedistributionofexpectedandobserveddvalues.TheTable4.3listssomechoicesofandthecorrespondingestimatedfalsediscoveryratefortheovariancancerdatainSection4.2.46

PAGE 58

No.of Estimated signicant False genes positives 0.2 13303 9241.5 0.4 11179 7187 0.6 7804 3335 0.8 4088 835 0.81 3641 793 0.82 3402 692.5 0.83 3089 585.5 0.84 2804 500 0.85 2417 340 0.851 2178 292 0.852 2164 288 0.853 2159 287 0.854 2144 283 0.855 2142 283 0.856 0 0 Table4.2:SAM-No.ofsignicantgenesandestimatedfalsepositivesfordierentchoicesofdelta.47

PAGE 63

where anda0isashrinkageparameterwhichdependsonthesjvalues.Thevaluemjisthemvalueforgenej.Theconstanta0inthedenominatorofEquation4.4.5canleadtothereductionoftheoverallvarianceofthemj,givingthetestsmorepoweronaverage.Thishastheaddedeectofdampeninglargevaluesoft-statisticsthatarisefromsmallvarianceofgenes.Wehavetakena0asthemedianofthesjvalues.TheFigure4.3showsthehistogramoft-valuesandthem-values.Thisgureshowstheshrinkagehappenedtothet-valueswhenweusethem-formula.Theideaofmodifyingestimatorsofvariancehasbeenpresentedbyothersinsimilarcontexts.TheSAMt-test[33]addsasmallconstanttothegene-specicvarianceestimateinordertostabilizethesmallvariances.InSAM,thefudgefactors0ischosenasthevaluewhichminimizesthecoecientofvariationoftheSAMstatisticdi.Theregularizedt-test52

PAGE 65

Asinsummarizingthegene-by-geneinformation,onecandecidetotreateachgeneindividuallyorborrowstrengthacrossthegenestoestimatethedistributionf0.Ifwecalculatethedistributionindividuallyforeachgene,thenoneessentiallytreatseachgeneasadierentexperiment.ThefamilynullhypothesisH0foramicroarraystudystatesthatnoneofthegenesisdierentiallyexpressed.UnderthisH0,itisplausibletoassumethatthem-valuesderivedfrompermutationsofgrouplabelsaredrawnindependentlyfromacommondistributionwithsomeprobabilitydensityfunction,say,f0.Inotherwords,wecanpoolthepermutationm-valuesacrossthegenesandacrosspermutations.Thepoolingofpermutationtestsresultsacrossgenesoersarenedbasisfortestingfordierentialexpressionthatisfreeofthelimitationimposedbyapplyingthetesttoeachgeneinisolation[26].Toestimatethedistributionf0wemakeuseofthefactthatanycontinuousdistributioncanbeapproximatedbytheJohnsonSystem.AnymethoddetailedinthepreviouschaptercanbeusedtotaJohnsonsystemtothepooledstatisticsandtoestimatetheparametersofthesystem.InthefollowingdiscussionsweuseWheeler'squantilemethodforttingtheJohnsonsystem.AnunboundedJohnson'sdistributionisttedtoformthecommonnulldistributionwithparameters^=0.1059483,^=2.690686,^=0.05775453and54

PAGE 66

No.ofgenes 537 SAM 2166 t 738 Wilcoxon 972 Table4.3:No.ofgenesselectedbydierentmethods^=1.301382.Nowthep-valuesofthecalculatedm-vauescanbeobtainedusingtheestimateddistributionf0.Thegeneswithpvalue<0.01areconsideredasdierentiallyexpressedgenes. TheresultsfromovariancancerdataarelistedinTable4.3.Thetableshowsthenumberofgenesselectedasdierentiallyexpressedoutof22,283genesinovariancancerdata.Becausenotruthaboutdierentiallyexpressedgenescouldbeobtainedonovariancancerdata,itisnotpossibletocompareresultsobtainedfortherealdata.Inordertoassesstheeectivenessoftheproposedmethodologyandtoobtainaquantitativeevaluationofgeneselectionmethods,thesimulateddataexplainedinSection4.2isused.ResultsfromsimulateddataaredisplayedinFigures4.4to4.9tocomparethemethodsdescribedinthischapter.InFigures4.4to4.6,thenumberoftrulydierentiallyexpressedgenesareplottedagainstthenumberofgenesselected,fordierentsamplesizes.Itcanbenotedfromtheseguresthat,foranyselectionofgenes,theproposedmethodusingJohnsondistributionprovidesthehighestnumberoftrulydierentiallyexpressedgenes.Itcanalsobeobservedfromtheseguresthatthetopgeneswhicharemostsignicantareselectedbyallthemethods.AftersomepointtherecanobservemorefalsepositivesintheothermethodscomparedtotheproposedmethodusingJohnsonsystem.Howeverasymptoticallyallthemethodsbehavesimilarly.Intheproposedmethod,informationacrossthegenesisusedtoestimatef0,whilet,WilcoxonandSAMconsidergene-by-geneinformation.Thismaybethereasonforhavinglessfalsepositiveswhentheproposedmethodisused.55

PAGE 70

NumberofEquallyExpressedgenes.ThisissameastheprobabilityofTypeIerrordenotedby.ThefalsenegativerateistheproportionofDierentiallyExpressedgenesthatwereerroneouslyreportedasEquallyExpressed.Morespecically,FalseNegativeRate=Numberoffalsenegatives NumberofDifferentiallyexpressedgenes.ThisissameastheprobabilityoftypeIIerror.Itisequalto1minuspowerofthetest. AmethodwhoseROCcurveliesbelowanotheroneispreferred[20],asthecurverepresentstheTypeIandTypeIIerrors.AmethodwhichhasabetterROCcurve,inthissense,willproducetoplistswithmoredierentiallyexpressedgenes(DEGs),fewernon-DEGsandconsequently,willleaveoutfewerDEGs.ForanyxedTypeIerror,theTypeIIerrorwillbelowestforthelowestROCcurve.InFigures4.7to4.9,theROCcurvesofthemethodsdiscussedinthischapteraregiven.Therangevaluesoffalsepositiverateandfalsenegativeratearefrom0to1,asbothareprobabilities.Asexpected,wheneitherfalsepositiverateandfalsenegativerateisclosetoonethecurvesconverge.Thecomparisonbetweenmethodsshouldbedonebasedonthepartofthecurveswherefalsepositiverateandfalsenegativerateisclosetozero(thatisneartheorigin).TheROCcurveoftheproposedmethodusingJohnsondistributionliesbelowtheROCcurvesoftheothermethods,neartheorigin,showingthattheproposedmethodisbetterthantheothermethods.Betterperformancecanalsobeobservedasthesamplesizeincreases.Forexample,inFigure4.9wherethesamplesizesare33vs22,whenthefalsepositiverateis0.05,thefalsenegativerateis0.05inJohnsonROC.ButinFigure4.7,wherethesamplesizesare15vs10,thepossibleminimumvaluesofbothfalsepositiverateandfalsenegativerateareapproximately0.1each.59

PAGE 76

Letpdenotethefractionofgenesthataredierentiallyexpressed(DE).Then1pdenotesthefractionofgenesequivalentlyexpressed(EE).AnEEgenejpresentsdataxj=(xj1;xj2;::::xjI)accordingtoadistributionf0(xj)=ZIYi=1fobs(xjij)!()d(5.2.1) Alternatively,ifgenejisdierentiallyexpressed,thedataxj=(xjg1;xjg2)aregovernedbythedistributionf1(xj)=f0(xjg1)f0(xjg2)(5.2.2) owingtothefactthatdierentmeanvaluesgovernthedierentsubsetsxjg1andxjg2ofsamples,whereg1containsindicesfromsamplesingroup1andg2containsindicesfromsamplesingroup2.Themarginaldistributionofthedatabecomespf1(xj)+(1p)f0(xj)(5.2.3) Withestimatesofp;f0andf1theposteriorprobabilityofdierentialexpressioniscalculatedbyBayesruleaspf1(xj) InEBarraystwomodelspecicationsofthegeneralmixturemodelinEquation5.2.3areconsidered,namely,Gamma-Gamma(GG)andLognormal-Normal(LNN)models.IntheGGmodel,theobservationcomponentsarefromaGammadistri-butionhavingshapeparameter>0andameanvaluej,thuswithscaleparameterj==j, formeasurementsx>0.Thepriordistribution(j)forjischosenasan65

PAGE 77

whereK=0(I+0) I()(0) IntheLognormalnormal(LNN)model,thegenespecicmeanjisameanforthelog-transformedmeasurements,whicharepresumedtohaveanormaldistributionwithcommonvariance2.Aconjugatepriorforthejisnormalwithsomeunderlyingmean0andvariance20.Integratingasin5.2.1,thedensityf0(:)forann-dimensionalinputbecomesGaussianwithmeanvector~0=(0;0;::::0)tandexchangeablecovariancematrixn=(2)In+(2)MnwhereInisannnidentitymatrixandMNisannnmatrixofones[19]. Forbothmodels,themethodofmaximummarginallikelihoodisusedtoobtaines-timatesoftheunknownhyperparametersandthemixingproportionp.Themarginalloglikelihoodisasumovergenesjoftermsin5.2.3,l()=Xjlog[pf1(xg;yg)+(1p)f0(xg;yg)](5.2.7) ThemaximumlikelihoodestimatesoftheparametersareobtainedviaExpectation-Maximization(EM)algorithm.ThisempiricalBayeshierarchicalmodelingapproachisimplementedinRasEBARRAYSpackage.66

PAGE 78

EBAMusesBayestheoremtoobtainposterioriprobabilitiesofanygenetobedierentiallyexpressed. and

PAGE 79

Thevalueofp0choseninsuchawaythatallposteriorprobabilitiesarepositive[7].Wemakeuseoftheestimatedvalueofp0usedinEBAM[7]underthesamecriterion.NowweareabletocalculatetheposteriorprobabilityusingtheEquation5.3.10.Thegeneswithposteriorprobabilitiesgreaterthan0.8arechosenasthedierentially68

PAGE 80

TheFigure5.1showsposteriorprobabilitiesagainstm-values.Thegreenspotscorrespondstothegeneswithposteriorprobability>0.8. BothEBArraysandEBAMshareinformationamonggenes.OnedrawbackofEBArraysistheassumptionofaparametricmodelforgeneexpressionsandhenceagoodchanceofviolationofassumptions.EBArraysassumesthatthegeneexpressions69

PAGE 82

No.ofgenes 537 SAM 2166 t 738 Wilcoxon 972 Bayes-Johnson 543 EBArrays-GG 191 EBArrays-LNN 177 EBAM 223 Table5.1:No.ofgenesselectedbydierentmethods5.5Results Becausenotruthaboutdierentiallyexpressedgenescouldbeobtainedonovariancancerdata,itisnotpossibletocompareresultsobtainedfortherealdata.Inordertoassesstheeectivenessoftheproposedmethodologyandtoobtainaquantitativeevaluationofgeneselectionmethods,thesimulateddataexplainedinSection4.2isused.AcomparisonofmethodsdiscussedinthischapterispresentedinFigures5.2to5.9.ThenumberoftrulydierentiallyexpressedgenesareplottedagainstthenumberofgenesselectedinFigures5.2to5.4.Wecanobservethatforanyvalueofnumberofgenesselected(x-coordinate),theproposedmethodusingJohnsonsystemofdistributionsgivesthemorenumberoftruepositivesthanothermethods.Resultsforthesimulateddatafromallmethodsdiscussedinthisdissertationareshownin71

PAGE 88

NumberofEquallyExpressedgenes.ThisisthesameastheprobabilityofTypeIerrordenotedby.ThefalsenegativerateistheproportionofDierentiallyExpressedgenesthatwereerroneouslyreportedasEquallyExpressed.Morespecically,FalseNegativeRate=Numberoffalsenegatives NumberofDifferentiallyexpressedgenes.ThisisthesameastheprobabilityoftypeIIerror.Itisequalto1minuspowerofthetest.AmethodwhoseROCcurveliesbelowanotheroneispreferred[20],asthecurverepresentstheTypeIandTypeIIerrors.AmethodwhichhasabetterROCcurve,inthissense,willproducetoplistswithmoredierentiallyexpressedgenes(DEGs),fewernon-DEGsandconsequently,willleaveoutfewerDEGs.ItcanbeobservedfromtheFigures5.7to5.9thattheproposedapproachusingJohnson'ssystemofdistributionsisbetterthantheothermethodsdiscussedhere.InFigures5.5and5.6thenumberoftrulydierentiallyexpressedgenesareplottedagainstthenumberofgenesselected,foracomparisonofallmethodsdiscussedinthedissertation.Itcanbenotedthatforanyxednumberofgeneselection,theproposedmethodsusingtheJohnsondistribution,guaranteethehighestnumberoftrulydierentiallyexpressedgenes.77

PAGE 93

ThedissertationalsodiscussestheJohnsonsystemofdistributioninadditiontothemethodsofgeneselection.Wehavedevelopedanewalgorithmfortheestimationofparameters,whichcanbeappliedtoallthethreefamiliesofJohnsondistribution. Animportantapplicationofmicroarraydataistoclassifybiologicalsamplesorpredictclinicalorotheroutcomes.Inthefutureresearchthisisaninterestingstatisti-calproblemtowork.Wehaveastrongfeelingoftherequirementofabetterstatisticwhichcanmeasurethevariationsintheexpressionlevelsofgenesinabetterway.Thisisalsooneoftheinterestingproblemsforthefutureresearch.82

PAGE 94

AlbertsB.,JohnsonA.,LewisJ.,RaM.,RobertsK.,WalterP.(2002)Molecularbiologyofthecell,Garlandpublishing,4thed.[2] BailinZhouandJohnP.McTague(1996).Can.J.For.Research26;928-935.[3] Baldi,P.andLong,A.D.(2001).ABayesianframeworkfortheanalysisofmi-croarrayexpressiondata:regularizedt-testandstatisticalinferencesofgenechanges.Bioinformatics17,509-519.[4] Chakravarti,Laha,andRoy,(1967).HandbookofMethodsofAppliedStatistics,VolumeI,JohnWileyandSons,pp.392-394.[5] DraperJ.PropertiesofDistributionsResultingfromCertainSimpleTransforma-tionsoftheNormalDistribution.Biometrika,Vol.39.No.3/4.(1952),pp.290-301.[6] DevoreJ.,PeckR.Statistics:theExplorationandAnalysisofData,3rdedn,DuxburyPress,PacicGroveCA;1997.GrithsA.J.F,MillerJ.H,SuzukiD.T,LewontinR.C.andGelbartW.M.(2000).Anintroductionforgeneticanalysis,Freeman,Newyork.[7] EfronB.Robbins,EmpiricalBayesandMicroarrays.TheannalsofStatistics2003;31:366-378.[8] EfronB.,TibshiraniR.,StoreyJD,TusherV.EmpiricalBayesanalysisofamicroarrayexperiment,JournaloftheAmericanStatisticalAssociation2001;96:1151-1160.83

PAGE 95

EfronB.,TibshiraniR.EmpiricalBayesmethodsandfalsediscoveryraresformicroarrays.GeneticEpidemiology2002;23:70-86[10] GoulbT.R.,Slonim,D.K.TamayoP.et:al:(1999).Molecularclassicationofcancerclassdiscoveryandclasspredictionbygeneexpressionmonitoring.Science286,531-537.[11] HahnJ.GeraldandShapiroS.Samuel((1967).StatisticalmodelsinEngineering,JohnWileyandSons.[12] HillI.D.,HillR.,HolderR.L.FittingJohnsonCurvesbyMoments.AppliedStatistics,vol.25,No.2.(1976),pp180-189.[13] HoskingJ.R.M.,WallisJ.R.andWoodE.F.Estimationofthegeneralizedextreme-valuedistributionbythemethodofprobability-weightedmoments.Technometrics,27(3):251-261.[14] IhakaR,GentlemanR,"Alanguagefordataanalysisandgraphics",J.Comput.Graph.,5,1196:299-314.[15] IrizarryRA,HobbsB,CollinF.,Beazer-BarelyYD,AntonellisKJ,ScherfUetal."Exploration,Normlization,andSummariesofHighDensityOligonucleotideArrayProbeLevelData"Biostatistics,4(2),2003:249-264.[16] Johnson,N.L.(1949).Systemsoffrequencycurvesgeneratedbymethodsoftrans-lation.Biometrika36,149-176.[17] JohnsonN.L,KotzS.andBalakrishnanN.(1994).ContinuousUnivariateDistri-butions,volume1andvolume2.[18] KamziahA.Kudus,AhmadM.I,JarinL.NonlinearregresionapproachtoestimatingJohnsonSBparametersfordiameterdata.Can.J.For.Res.(1999),29:310-314.84

PAGE 96

Kendziorski,C.M.,M.A.Newton,H.Lan,andM.N.Gould.Onparametricem-piricalBayesmethodsforcomparingmultiplegroupsusingreplicatedgeneex-pressionproles.StatisticsinMedicine2003;22:3899-3914.[20]LonnstedtI,SpeedTP;ReplicatedMicroarraydata.StatSinica2002,12:31-46.[21] Mei-LingTingLee(2004)Analysisofgeneexpressiondata,KluwerAcademicpublishers.[22] Mordell,L.J.Thevalueofthedeniteintegral,QuarterlyJournalofMathemat-ics,Oxford,48(1920):329-342.[23] OlaLarson,ClaesWahlestedtandJamesTimmons.Considerationswhenusingthesignicanceanalysisofmicroarrays(SAM)algorthm.[24] Pan,W.;Lin,J.;Le,C.AMixtureModelApproachtoDetectingDierentiallyExpressedGeneswithMicroarrayData.Functional&IntegrativeGenomics2003,volume3,Number3.[25] ParmigianiG.,E.S.Garrett,R.A.Irizarry,S.L.Zeger(2003).TheAnalysisofGeneExpressionData.Springer,Newyork.[26] PierreBaldi,AnthonyD.Long(2001).ABayesianframeworkfortheanalysisofmicroarrayexpressiondata:regularizedt-testandstatisticalinferencesofgenechanges,Bioinformaticsvol.17no.6[27] PearsonE.S.,Mathematicalcontributionstothetheoryofevolution,XIX,secondsupplementtoamemoironskewvariation.Phil.Trans.Roy.Soc(A).Vol.216(1916),p432.[28] Schmeiser,B.W.(1977).Methodsformodellingandgeneratingprobabilisticcom-ponentsindigitalcomputersimulationwhenthestandarddistributionsarenotadequate:ASurvey.Proceedingsofthe1977WinterSimulationConference,Highland,SargentandSchmidt(eds),51-55.TheInstituteofElectricalandElec-tronicsEngineers,Piscataway,NewJersey.85

PAGE 97

SlifkerJ,ShapiroS.TheJohnsonSystem:selectionandparameterestimation.Technometrics1980;22:239-247.[30] Snedecor,GeorgeW.andCochran,WilliamG.(1989),StatisticalMethods,EighthEdition,IowaStateUniversityPress.[31] Storer.AdaptiveEstimationbyMaximumLikelihood-FittingofJohnsonDistri-butions,unpublishedPh.D.thesis,SchoolofIndustrialandSystemsEngineering,GeorgiaInstitueofTechnology(1987).[32] StrachanT.,ReadA.P.(1999)HumanMoleculargenetics,Wileypublications,2nded.[33] TusherV.,TibshiraniR.,ChuC.SignicanceAnalysisofmicroarraysappliedtotranscriptionalresponsetoionizingradiation.,ProeedingsoftheNationalAcademyofSciences2001;98:5116-5121.[34] VladimirP.Savchuk,andChrisP.Tsokos,BayesianStatisticalMethodswithApplicationstoReliability,WorldFederationPublishers,Inc.,1996.[35] WheelerR.QuantileEstimatorsofJohnsoncurveParameters.Biometrikavol.67No.3Dec1980pp725-728.[36] WilkinsJ.E.AnoteonSkewnessandkurtosis.TheAnnalsofMathematicalStatistics,vol.15,No.3(Sep1944)pp333-335.86