USF Libraries
USF Digital Collections

Performance oriented scheduling with power constraints

MISSING IMAGE

Material Information

Title:
Performance oriented scheduling with power constraints
Physical Description:
Book
Language:
English
Creator:
Hayes, Brian C
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla.
Publication Date:

Subjects

Subjects / Keywords:
Software
Reordering
Optimization
Compiler
Assembly
Dissertations, Academic -- Computer Engineering -- Masters -- USF   ( lcsh )
Genre:
government publication (state, provincial, terriorial, dependent)   ( marcgt )
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )

Notes

Abstract:
ABSTRACT: Current technology trends continue to increase the power density of modern processors at an exponential rate. The increasing transistor density has significantly impacted cooling and power requirements and if left unchecked, the power barrier will adverselyaffect performance gains in the near future. In this work, we investigate the problem ofinstruction reordering for improving both performance and power requirements. Recently,a new scheduling technique, called Forced Directed Instruction Scheduling, or FDIS, hasbeen proposed in the literature for use in high level synthesis as well as instruction reordering 15, 16, 6. This thesis extends the FDIS algorithm by adding several features suchas control instruction handling, register renaming in order to obtain better performanceand power reduction. Experimental results indicate that performance improvements up to24.62% and power reduction up to 23.98% are obtained on a selected set of benchmark programs.
Thesis:
Thesis (M.S.C.P.)--University of South Florida, 2005.
Bibliography:
Includes bibliographical references.
System Details:
System requirements: World Wide Web browser and PDF reader.
System Details:
Mode of access: World Wide Web.
Statement of Responsibility:
by Brian C. Hayes.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains 46 pages.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 001681049
oclc - 62715586
usfldc doi - E14-SFE0001073
usfldc handle - e14.1073
System ID:
SFS0025394:00001


This item is only available as the following downloads:


Full Text

PAGE 1

PerformanceOrientedSchedulingwithPowerConstraints by BrianC.Hayes Athesissubmittedinpartialfulllment oftherequirementsforthedegreeof MasterofScienceinComputerEngineering DepartmentofComputerScienceandEngineering CollegeofEngineering UniversityofSouthFlorida MajorProfessor:N.Ranganathan,Ph.D. RafaelPerez,Ph.D. SrinivasKatkoori,Ph.D. DateofApproval: March30,2005 Keywords:software,reordering,optimization,compiler, assembly c r Copyright2005,BrianC.Hayes

PAGE 2

DEDICATION Iwouldliketodedicatethisworktomywonderfulfamilyfora llofthehelpandsupport theyhaveprovidedthroughoutmylifeaswellasmylovingan ce,Loren.

PAGE 3

ACKNOWLEDGEMENTS IwouldliketoexpressmysinceregratitudetoDr.Ranganath anforallofhisguidance andassistancethroughoutthecourseofthiswork.Iappreci atehiswillingnesstobecome mymajorprofessorandprovidehisknowledgeandexperience toallowmetosucceed. Withouthim,noneofthiswouldhavebeenpossible.Iwouldal soliketothankDr.Rafael PerezandDr.SrinivasKatkooriforservingonmythesiscomm ittee.

PAGE 4

TABLEOFCONTENTS LISTOFTABLES iii LISTOFFIGURES iv ABSTRACT v CHAPTER1INTRODUCTION 1 1.1ThesisOrganization 4 CHAPTER2RELATEDWORK 5 2.1SimpleScalar 5 2.2SimplePower 5 2.3Wattch 7 2.4GeneralSoftwarePowerReductionTechniques8 2.4.1ReducingSwitchingActivity 8 2.4.2PatternMatching 8 2.4.3MemoryAccessReduction 10 2.5InstructionBueringandCompilerOptimization10 2.5.1HybridApproaches 11 2.5.2VLIW 11 2.6OtherMiscellaneousTechniques 12 2.7ClassicalForceDirectedScheduling 12 2.8FD-ISLP 16 CHAPTER3FORCEDIRECTEDINSTRUCTIONSCHEDULING19 3.1FDISvs.FDS 19 3.2PowerCharacterization 20 3.2.1OperandPowerTable 21 3.3FDISAlgorithm 22 3.4BasicBlocks 24 3.4.1Inter-BlockScheduling 24 3.5SelfandSuccessorForces 26 CHAPTER4EXPERIMENTALRESULTS 30 4.1Overview 30 4.2Results 30 4.2.1PowerResults 32 4.2.2PerformanceResults 34 i

PAGE 5

CHAPTER5CONCLUSIONSANDFUTURERESEARCH36REFERENCES 37 ii

PAGE 6

LISTOFTABLES Table2.1.RegisterRenamingPowerSavingsinMemory7Table3.1.SamplePowerDissipationTable 20 Table4.1.SwitchCapacitanceReduction/PowerSavingsusi ngRegisterRenaming 33 Table4.2.SwitchCapacitanceReduction/PowerSavingsfor FDIS33 Table4.3.SwitchCapacitanceReduction/PowerSavingsfor FDISwithRegister Renaming 34 Table4.4.ClockCycleReduction/PerformanceImprovement 35 iii

PAGE 7

LISTOFFIGURES Figure1.1.CPUPowerTrendsfrom1985-1999[19]2Figure1.2.CPUCoolingCostvs.PowerDissipation[19]2Figure2.1.SimpleScalarSimulatorsandSimpleScalarOrga nization[21]6 Figure2.2.PowerAnalysisScenariosforWattch[4]9Figure2.3.InitialSchedulewithAssignedTimeFrames14Figure2.4.AsSoonAsPossibleSchedule 14 Figure2.5.AsLateAsPossibleSchedule 15 Figure3.1.BlockDiagramofFDISAlgorithm 22 Figure3.2.SampleBasicBlockSegmentationofSourceCode2 5 Figure3.3.CalculationofSelf-Forces 28 Figure3.4.CalculationofSuccessorForces 29 Figure4.1.FlowchartofExperimentalRuns 31 iv

PAGE 8

PerformanceOrientedSchedulingwithPowerConstraints BrianC.Hayes ABSTRACT Currenttechnologytrendscontinuetoincreasethepowerde nsityofmodernprocessorsatanexponentialrate.Theincreasingtransistordens ityhassignicantlyimpacted coolingandpowerrequirementsandifleftunchecked,thepo werbarrierwilladversely aectperformancegainsinthenearfuture.Inthiswork,wei nvestigatetheproblemof instructionreorderingforimprovingbothperformanceand powerrequirements.Recently, anewschedulingtechnique,calledForcedDirectedInstruc tionScheduling,orFDIS,has beenproposedintheliteratureforuseinhighlevelsynthes isaswellasinstructionreordering[15,16,6].ThisthesisextendstheFDISalgorithmbyadd ingseveralfeaturessuch ascontrolinstructionhandling,registerrenaminginorde rtoobtainbetterperformance andpowerreduction.Experimentalresultsindicatethatpe rformanceimprovementsupto 24.62%andpowerreductionupto23.98%areobtainedonasele ctedsetofbenchmark programs. v

PAGE 9

CHAPTER1 INTRODUCTION Moore'sLaw,whichhascorrectlypredictedtheexponential performancegainsseenin therealmofcomputerarchitecture,currentlypredictsthe numberoftransistorsonasingle integratedchipwilldoubleevery18months.Thispredictio nhasheldforalmost30years. Witheverincreasingtransistordensity,powerdensityisa lsoincreasing(seeFigure1.1). Largepowerdensitiesnotonlydegradesintegratedcircuit s,italsowastespreciouspower, especiallyinportableapplicationswherebatterytechnol ogyhasnotincreasedatthesame rate.Inaddition,thecosttocoolsuchintegratedcircuits becomesexponentiallyexpensive, asevidencedinFigure1.2.Untilrecently,manyapproaches stillmaintainedperformance astheprimarydesignobjective,withasecondaryconsidera tiontopowerconsumption. However,itisnowrealizedthatpowermustalsobecomeaprim aryobjectiveinorderto maintainperformance. Asthetechnologyeldbeginstocreatefasterandsmallerin tegratedcircuits,theproblemofpower,heat,andcoolingwillbecomemoreproblematic .Therefore,newtechniques mustbedevelopedtoreducepowerconsumption,whilestillm aintainingperformance. Currentresearchnowfocusesonndingwaystoreducepowera talllevels,fromthehardwaretothesoftware.Inindustry,agreatdealofeorthasbe eninvestedinhardware researchinordertodesignforpower.Asaresult,manypower ecientdesignshavebeen developed.However,thisoptimizationalonewillnotsolve thepowerissue.Softwarearchitects/programmerstraditionallyhavenotspentagreatdea loftimeinattemptingtocreate ecientcodeandcertainlyhavenotaddressedhowtoreducep ower.Therefore,thereis stillagreatdealofworkinthesoftwareareaintherealmofp owerreduction.Although thereismuchtobedoneinallareas,powerawaresoftwarelag sgreatlywhencomparedto itshardwarecounterparts.Thispaperintendstoprovideon eadditionaltechniqueatthe 1

PAGE 10

Figure1.1.CPUPowerTrendsfrom1985-1999[19] Figure1.2.CPUCoolingCostvs.PowerDissipation[19] 2

PAGE 11

softwareleveltoassistinaddressingthisimminentpowerp roblemaswellasintroducean additionalperformancegain. Theforcedirectedschedulingbasedalgorithmpresentedin thispaperspecicallyaddressestheissueofreducingtheamountofpowerneededtoex ecuteagivenprogram. However,thealgorithm'sprimaryobjectiveistoimprovepe rformancewhilegaininga powersavings.Forcedirectedschedulinghasbeensuccessf ullyappliedinhardwaredesign andisoneoftheprimarytechniquesusedintheeld.However ,fewattemptshavebeen madetoapplythismethodinsoftwarescheduling.TheFDISal gorithmworksbyusingthe inter-instructionpowercostasaschedulingmetric.Speci cally,thealgorithmattempts tominimizethepowerutilizedbytheCPUtoswitchfromonein structiontothenext successiveinstruction.Beforescheduling,thealgorithm segmentstheassemblylevelcode intobasicblocks,uponwhichtheschedulercanact.Foreach basicblock(denedasa segmentofcodeexecutedinsequencewithoutjumpsorbranch es),thealgorithmattempts toreordertheinstructionswithinabasicblocksuchthatth ecosttoswitchbetweeninstructionsisminimized.Sincethedatadependencygraphan dcriticalpathoftheprogram iscalculatedaheadoftime,thealgorithmspecicallyexcl udesthoseinstructionsalong thecriticalpathfrombeingrescheduledsothattheperform anceoftheprogramisnot negativelyaected.Therefore,thealgorithmproducesane wversionofthecodethatruns nolongerthantheoriginalversionandideally,usesnomore powerthantheoriginalcode. However,sincethealgorithmisonlylocallyoptimized,the sestatementscannotbetheoreticallyproved,buthavethusfarheldunderallexperimen tsperformed.Ingeneral,most instructionsarenotcontainedwithinthecriticalpathand thereforeprovideasignicant opportunityminimizepower. Usingthetechniquesandalgorithmsdevelopedinthispaper ,averageperformancegains of2.8%wereseen,withamaximumofover26%.Inaddition,pow ersavingsinexcessof 3%wasalsoachieved,withasmuchas24%attained.Theseresu ltsaresignicantin thatthistechniquedoesnotprecludetheuseofother,previ ouslydevelopedtechniques. Furthermore,theoperationsuponthesourcecanbeperforme dquicklyandeciently. 3

PAGE 12

1.1ThesisOrganization Theorganizationofthepaperisasfollows.Section2contai nstherelatedwork,includingathroughdescriptionofthetwomajorbodiesofworkwhic hthispaperusestovalidate theseresults.Section3describestheactualalgorithmand adiscussionofhowtheresults weretobeobtained.Section4containstheactualresultsof theexperimentsperformed andtheanalysisoftheseresults.Finally,section5summar izestheresultsandprovidesa basisoffutureresearch. 4

PAGE 13

CHAPTER2 RELATEDWORK Overthepast10-15years,therehavebeenmanytechniquesand approachestoreduce powerconsumptionatalllevels.However,ithasbeenarelat iverecentphenomenonthat somucheortisbeingplacedintothisareaofresearch,aspo werandheatarethetwo mainobstaclestofacewithinthenext10years.2.1SimpleScalar SimpleScalarwaswrittenin1992aspartoftheMultiscalarp rojectattheUniversity ofWisconsin[21].AsseeninFigure2.1,theSimpleScalarsi mulationsuitecontainsthe abilitytosimulateprogrambinariesusingawiderangeofsi mulations.Inthiswork,the SimpleScalarframeworkwasusedtotranslatethebenchmark sintoassemblylevelcodeto besimulatedintheSimplePowerframework.2.2SimplePower SimplePowerisanexecution-driven,cycle-accurateRTlevel energyestimationtool thatusestransitionsensitiveenergymodelsasthebasicfr amework.SimplePowercanalso providetheenergyconsumedinthememorysystemandon-chipb usesusinganalytical energymodels[22].SimplePowerisoneofthefewtoolstoest imatethepowerconsumption attheRTlevel.OneofthedownsidestotheSimplePowertools etisthatitonlyworkson theintegersubsetoftheSimpleScalartoolset,whichlimit sthecodethatcanbesimulated. Thesimulatorestimatestheswitchcapacitanceoftheproce ssordatapath,memory,and on-chipbuses[22].Theonlyitemsthatarenotestimatedaret hecontrolunit,theclock generator,andthedistributionnetwork.SimplePowercons istsof5maincomponents:the SimplePowercore,theRTLpowerestimationinterface,thet echnologydependentswitch 5

PAGE 14

SimpleScalar baseline simulator models Simulator Description Lines of Code Simulation speed Sim-safe Simple function simulator 320 6 MIPS Sim-fast Speed-optimized functional simulator 780 7 MIPS Sim-profile Dynamic program analyzer 1,300 4 MIPS Sim-bpred Branch predictor simulator 1,200 5 MIPS Sim-cache Multilevel cache memory simulator 1,400 4 MIPS Sim-fuzz Random instruction generator and tester 2,300 2 MIPS Sim-outorder Detailed microarchitectural model 3,900 0.3 MIPS User Programs Prog/Sim Interface Functional Core Performance Core Program Binary Target ISA Host Interface Host Platform B Pred Re source Simulator Core Test Target ISA Target ISA Target ISA Stats Dlite! Cache Loader Memory Regs Figure2.1.SimpleScalarSimulatorsandSimpleScalarOrga nization[21] 6

PAGE 15

Table2.1.RegisterRenamingPowerSavingsinMemory Benchmark Original(nF) Relabeled(nF) Reduction(nF) acker.c 17,742.8 15,784.5 11.04% bsrch.c 0.936 0.843 9.94% bubble.c 1,033.8 870.5 15.79% b.c 166,117.1 146,017.5 12.10% hanoi.c 690.9 600.7 13.05% heap.c 338.6 302.7 10.61% matmult.c 10,075 8.372 16.90% perm.c 2,459.6 2,284.5 7.12% queens.c 1,398.7 1,306.0 6.62% quick.c 194.8 168.6 13.47% sieve.c 1,902.2 1,659.5 12.76% Average 11.76% Std.Deviation 3.04% capacitancetables,thecache/bussimulator,andtheloade r.Besidestheoptionaloutputs, SimplePowerprovidestheregisterlenalstatus,thetota lnumberofcyclesinexecution, thenumberoftransitionsinthebuses,theswitchcapacitan cestatisticsforeachpipeline stage,theswitchcapacitancestatisticsfordierentfunc tionalunits,andthetotalswitch capacitance[22].Comparingeachofthefunctionalunitsmo delstoHSPICE,theaverage errorwasfoundtobewithin15%foralltheunits[22]. Thesimulator,asdescribedabove,simulatescodecompiled ,assembled,andlinked usingtheSimpleScalartoolsgcc(compiler),gas(assemble r),andgld(linker/loader),respectively.TheSimplePowerframeworkalsocontainsutili tiesthatactuallycontainpower reductiontechniquesnotfoundintheSimpleScalartoolset .Specically,theSimplePower compileroutputsaregisterrenamingmappingduringthesim ulationprocessthatcanbe usedtoimplementregisterrenaming.Registerrenamingisa nacceptedpowerreduction techniqueandisincorporatedoftentimesinhardware.Howe ver,thistechniqueimplementedinsoftwarecangivesystemswithoutthishardwarefe aturethepowerreduction savings.Accordingto[22],thisparticularimplementatio nofregisterrelabelingactually cansaveonaverage12%inthememorysystem(seeTable2.1)2.3Wattch In2000,Brooks,et.al.presentedanewtypeofsimulator[4] ,similarinnaturetoSimplePower.Thissimulator,likeSimplePower,isbasedooft heSimpleScalarframeworkand iscycleaccurate.Thegoalofthesimulatoristoprovidethe accuracyoflow-levelpower simulators,likeQuickPowerandPowerMil,butprovidehigh -levelfunctionalitysothat 7

PAGE 16

dierentpowerarchitecturescouldbeevaluatedwithoutbe ingfullyimplemented.Wattch canmodelvarioustypesoffunctionalunitswithinaCPU,inc ludingregisterles,queues, caches,memories,reorderbuers,andbranchpredictors,a mongothers.UnlikeSimplePower,Wattchalsomodelsclockgeneration,includingcloc kwiring,capacitiveloads,and clockbuers.Inaddition,Wattchcanmodelhardwareaswell assoftwareoptimizations, sothatacompleteoverviewcanbeobtained.Microarchitect uraltradeos,compileroptimizationsandhardwareoptimizationscanallbeevaluate dusingthissimulator,asseen inFigure2.2.Accordingto[4],Wattchisaccuratetowithin 10%ofotherlow-levelpower simulators,yetitisordersofmagnitudefasterthanitslow -levelcounterparts. 2.4GeneralSoftwarePowerReductionTechniques WhileSimpleScalar,SimplePower,andFDSallprovidefound ationforthealgorithm presentedinthispaper,manyothertechniquesusingothera pproachescancurrentlybeing developedtosolvethepowerproblem.Intermsofsoftwareap proaches,Tiwari[12]outlined threemaintechniquesthatsoftwarecanusetoreducepowerc onsumption.Theyinclude: reducingswitchingactivity,generatingcodeusingpatter nmatching,andreducingmemory accesses.2.4.1ReducingSwitchingActivity Reducingswitchingactivityistheoneoftheprimarytechni quesuponwhichFDIS works,asFDISattemptstoreducetheinter-instructionpowe rcost.Thisinter-instruction powercostisrelatedtothepreviousinstructionandthecur rentinstructionbeingexecuted, whicharethesamevariablesthataectswitchingactivity. Byminimizingthenumberof transistorstateschanges,thepowerlostduetoswitchinga ctivitycanalsobeminimized. 2.4.2PatternMatching Patternmatchingisperformedbyassigningcoststospecic codepatternsandthen coveringtheentireprogramwiththecodepatterns,whilemi nimizingtheoverallcost. Thisforcesthecompilertoutilizeecientcodepatternsth atexistwithintheprogram, 8

PAGE 17

SimPower App App Binary Config 1 Config 2 Common Configuration SimPower Binary 1 Binary 2 SimPower SimPower Scenario A: Microarchitectural Tradeoff Scenario B: Compiler Optimization Custom Structure? App Additional Hardware? Sim-Power Array Structure? Use Current Models Scenario C: Hardware Optimization Binary Config 1 Estimate Power of Structure Sim-Power Figure2.2.PowerAnalysisScenariosforWattch[4] 9

PAGE 18

butarenotrecognizedinasequentialcompileofauser-dene dprogram.Inaddition, thistechniquemayreducetherequirementsplacedonthepro grammertodeveloppower ecientsourcecode.2.4.3MemoryAccessReduction Thenalareaofsoftwarepowerreductionisthereductionof memoryaccesses.This isoneareawheresoftwareoptimizationscanactuallyhavea greaterimpactthanthose optimizationsperformedinhardware.In[12],anoptimized memorysystemshoweda4% decreaseinpowerwhenoptimizedinhardware,but23-62%decr easewhensoftwarewas optimized.Intherealmofoverallpower,thiscanbesignic ant.Inthe486DX2processor, registeraccesseshadapowercostofapproximately300mA[1 2].Memoryreads,ontheother hand,haveapowercostofnearly430mA,whilememorywritest akealmost530mA[12]. Memorywritesinthe486DX2processoralsoincuranaddition alpowercostsincethe systemutilizedawrite-throughcachecoherencymodelandth erefore,awritecouldoccur ineverylevelofthememoryhierarchy.Theseadditionalwri tescouldsignicantlyincrease thecostofwriteoperationsandaresimplyunavoidableinth ehardware.Therefore,in caseswherehardwareoptimizationshavelittleeect,soft wareoptimizations,whichhave alsobeenprovenbenecialin[10],cangreatlyreduceovera llpowerconsumption. 2.5InstructionBueringandCompilerOptimization In[13],theauthorspresentatechniqueofinsertingamini-c achebetweentheCPUand themaincache.Accordingto[13],thedataandinstructionc acheisoneoftheprimary powerconsumersoftheCPU.In[5]and[20],itwasdetermined thattheinstructioncachein theStrongARMSA-110processorfromDECconsumed27%oftheto talpowerdissipated, andeveninthePentiumProprocessor,thecachestillconsum ed14%.Ingeneral,memory systemsconsumeagreatdealofpowertostorerequiredata.C aches,becauseofthe largeamountofdatathroughput,alsohaveagreatdealofswi tchingactivity,asdata entersandleavesthecachetogotoandfromtheCPUandmemory .Thegoalof[13] istoreducetheswitchingactivityandextraneousinstruct ionfetchingtoreducepower 10

PAGE 19

consumption.Othershavedevelopedsimilartechniques,su chasaltercachein[9]and aloopbuerin[17],toreducepower.[13]alsointroducesan othercache,referredtoas L-cache,whichisaccessiblebythecompiler.Asaresult,asm artcompilercananalyzethe codeandinformthehardwarewhattoloadintotheL-cachetofu rtherreducepower.Such software/hardwarecommunicationmaybetheapproachtaken inthefuturetomaximize thepowerreduction.2.5.1HybridApproaches Whilemanytechniquesworkatthehardwareorsoftwarelevel ,sometechniqueswork atbothlevels.Manyarchitecturescontainparallelmechan isms,suchasanTomasulo implementation,toincreaseperformanceofasystem.InTom asulo,therearemanyextra registersanddatapathsthatconsumealargeamountofpower .Oneideapresentedin[11] istohaveahybridparadigm,whereparallelismthatcanbese enatthecompilerlevelis implementedinsoftware,allowinghardwareparallelschem estobeputtosleepduringthe executionofcompileroptimizedcode,andonlyutilizingha rdwareimplementedparallelism wherenecessary.Althoughperformancebecomesaconcern,[ 1]demonstratedthatifdone properly,powercanbereducedwhilemaintainingperforman cewhensimplyoptimizingat thesoftwarelevel.2.5.2VLIW VeryLongInstructionWords,orVLIW,isanattempttominimi zepowerbychanging theinstructionsetarchitecturetoallowforasinglefetch thatretrievesmultipleinstructions,asopposedtousingonefetchperinstruction.Thepri nciplebehindthisideaisto reducethenumberoftimesthememoryisaccessedandthecomp ilercanpackageinstructionsinthemostecientmethodpossible,usingpoweroptim izationtechniques.However, thisconceptalsoleadstoamorecomplexcontrolunitandcom piler,aswellastheneedfor alargerbuswidthandregisters.Finally,theportabilityo fsuchaschemeisverylow,since thenumberofinstructionspackagedandhowtheyarepackage variesfromarchitectureto architecture.Althoughanovelidea,thepracticalityofVL IWhassignicantlyreducedany 11

PAGE 20

attemptstoimplementsuchaschemeintoanysuchsystem.[3] and[2]hasdemonstrated thetheoreticalbenetshavingaVLIWarchitecturewithpow eroptimization,buttodate, therenementhasremainedintheresearchcommunity.2.6OtherMiscellaneousTechniques Ingeneral,higherfrequenciesresultinmoreswitchingact ivityandswitchcapacitances. Therefore,separatefromtheexplosioninthenumberoftran sistors,theburstinfrequency hasalsoleadtoincreasepowerdissipation.Inaddition,hi ghervoltageleadstoareduction indelay,inrelationtoalowervoltagesystem.Performance canbeexpressedasdirectly proportionaltothefrequencyandthevoltage.Powerisalso directlyproportionalto voltageandfrequency.Manyresearchfacilities,includin gIntel,areattemptingtond waystoreducevoltageand/orfrequencywheneverperforman ceisnotatoppriority.For example,composingaletterinawordprocessordoesnotrequ irethefastestprocessor. Therefore,thefrequencyandvoltagecouldbereduced,asth erewouldbeasignicant powersavingswithoutavisibledecreaseinperformance.2.7ClassicalForceDirectedScheduling Scheduling,inparticular,isadicultproblemtosolve.Ma nyareas,suchasdesignautomationandoptimization,requireschedulinginor dertoecientlyandeectively accomplishtherequiredtasks.In1987,PaulinandKnight[1 5]introducedascheduling techniqueforautomaticdatapathsynthesiscalledforcedi rectedscheduling(FDS),based ontheprincipleofaspring(seeequation2.1). F = K x (2.1) Later,in1989[16],PaulinandKnightintroducedanupdated versionofthetechniquefor behavioralsynthesisofapplicationspecicintegratedci rcuits(ASICs).Sincethen,there havebeenmanyimprovementsandupdatesmadetothealgorith mingeneral.Today,FDS isoneoftheprimaryschedulingalgorithmsinbehavioralsy nthesisandhasexpandedto 12

PAGE 21

manyareassuchasdynamicpoweroptimization[8]andwithth eintroductionofFDIS, softwarepowerreduction. Theabilityofthealgorithmtobeappliedtosomanyareaslie sintheinherenttraitof thebasicblocksuponwhichFDSacts.FDScanbeusedinvirtua llyanyapplicationupon whichthereisaneedtoschedulemanybasicblocks,subjectt oadependencygraph.In thecaseofdesignautomation,FDSisparticularlywellsuit edforthedesignprocess.The designautomationprocessbeginswithadescriptionofacir cuitusuallyspeciedinthe formofahardwaredescriptionlanguage,suchasVerilogorV HDL.Fromthere,acontrol dependencygraph(CDG)andadatadependencygraph(DDG)can beconstructed.It isatthispointwhereoperations(eithersingleormultiple )aremappedtonodeswithin thetwographsandbecomethebasicblocksthattheFDSalgori thmwillschedule.After mappingthebasicblockstothetypeoffunctionalunitneede d,thealgorithmcanusethe dependencygraphstoiterativelyschedulethefunctionalu nitsontothephysicalunitsand createthedatapathsasspeciedinthegraphs. Schedulingthebasicblocksisoneofthemostdiculttasks, asvirtuallyallbasic blocksareconnectedtootherbasicblocks;fewareeverisol ated.Thisdependencymust berealizedandmaintainedduringrunningofthealgorithm. Physicalconstraints,such asthenumberofwirestoconnectthebasicblocks,mustalsob etakenintoaccountfora realisticschedulethatcanmaptoanactualchip.Furthermo re,performanceisalsoanitem thatFDSattemptstoaddress.Allunitsconnectedtogethers houldbeasclosetogetheras possibletoreducedelaysandtominimizeroutingresources used. Toaccomplishthetask,FDSworksinthefollowingmanner.On cethedatadependency graphshavebeencreatedandtimeframeshavebeenassigned( seeFigure2.3),theASAP (assoonaspossible)(seeFigure2.4)andtheALAP(aslateas possible)(seeFigure2.5) schedulescanbecreated.Thisisdoneusingthecriticalpat hofthegraphsasthelatency restrictorandthencalculating,foreachoperation/basic block,theearliestandlateststart times.Anoperationcanbeassignedtoanyofthetimeframesb etweentheoperation's assignedtimeframeintheASAPscheduleanditstimeframein theALAPschedule. Therefore,theprobabilityofanoperationbeingscheduled outsidetheASAPandALAP 13

PAGE 22

12 3 4 68 0 5 7 Figure2.3.InitialSchedulewithAssignedTimeFrames 12 3 4 5 6 7 8 0 Figure2.4.AsSoonAsPossibleSchedule 14

PAGE 23

12 3 4 68 0 7 5 Figure2.5.AsLateAsPossibleSchedule scheduleiszeroandtheprobabilityforanytimestepbetwee nthetwoschedulesisequal to: Prob ( Operation )= 1 ALAP ASAP +1 (2.2) assuminguniformprobability. Takingthesummationoftheprobabilitiesofeachoperation typeforagiventimestepi willresultinadistributiongraphthatrerecttheconcurre ncyofaparticularoperationtype atthatpointintime(seeequation2.3).Thisvalueisimport antforfuturecalculations,as DG(i)isusedasthespringconstantinthecalculationoffor ces. DG ( i )= X OpnType Prob ( Operation;i )(2.3) Foreachoperation,theforcerequiredtochangetheprobabi lityoftheoperationata particulartimestepisgivenby Force ( i )= DG ( i ) x ( i )(2.4) 15

PAGE 24

wherex(i)isequaltothechangeintheoperation'sprobabil ity.Self-forceistheforce requiredtoassignanoperationtoaparticulartimestepj.T herefore,thisiscalculated bysummingtheforcerequiredtochangetheprobabilityover alltimesteps,asgivenby equation2.5: self Force ( j )= b X i = t [ Force ( i )](2.5) wherethetimeframeforthespeciedoperationfallsbetwee ntandb. ThenalstepoftheFDSalgorithmbeforeschedulinganopera tionistodeterminethe predecessorandsuccessorforces.Wheneveranoperationis scheduledtoaparticulartime step,thepossibilitiesforotheroperationsarereducedan dtherebyaecttheprobabilities oftheremainingoperations.Therefore,theforcerequired tochangetheprobabilitiesof theseoperationsshouldalsobeaddedtotheselfforceofthe operationtodeterminethe totalforce,asgivenbyequation2.6: Total Force = self force + ps force (2.6) Overall,thesummaryoftheforcedirectedschedulingalgor ithmisasfollows: Repeatuntilallscheduled FindASAPscheduleFindALAPscheduleCalculateDG(i),usingEquation2.3CalculateSelf(j),usingEquation2.5FindTotal-Force,usingEquation2.6ScheduleOpw/lowestforce EndRepeat 2.8FD-ISLP In2003,Dongale[6],developedanalgorithmbasedonforced irectedschedulingfor creatingpoweroptimizedcode.ThealgorithmcalledForceD irectedInstructionScheduling 16

PAGE 25

forLowPower,orFD-ISLP,wasproposed,however,severaliss ueswerenotaddressedand thustheresultswerenotrerectiveoftherealpicture.Inth iswork,abasicblockisdened asasegmentofcodewhichcontainsnobranchorjumpinstruct ions,withtheexceptionof thelastinstruction.Duringtheschedulingphase,jumporb ranchinstructionsmustremain asthelastinstruction.Otherwise,thesemanticsofthecod eiscompromised.Whilethere wasaprovisiontoensuresuchinstructionsdidnotmovewith intheFD-ISLPalgorithm,the developedmechanismdidnotguaranteethatthebranchinstr uctionswerenotmoved.As aresult,theschedulemayresultinsomecodesegmentsnotbe ingexecuted.Thealgorithm presentedinthiswork,calledFDISisamodiedversionofFD -ISLPthathascorrectedthe schedulingerror,aswellasprovidedadditionalimproveme ntssuchasregisterrenamingto theschedulingparadigm,providingbetterperformanceand powerreduction. InadditiontocorrectingtheschedulingerrorwithintheFD -ISLPalgorithm,other signicantchangeswerealsomade.Mostnotablywastheincl usionofregisterrenaming. Thiswellknowntechniquecanbesuccessfullyincorporated intothealgorithmandprovide arelativelysignicantperformanceboost.Furthermore,a noperandpowertablewasalso includedintothealgorithmtoprovideamorecompletepower characterization.Inthe originalFD-ISLPalgorithm,powervalueswereattainedfort heinstructionsonly,without anyconsiderationfortheoperands.Therefore,twoinstruc tionsofthesametypewould begivenpreferenceovertwoinstructionswiththesameoper ands.Although,thismay generallyprovetobecorrect,suchanassumptioncannotbem ade.Therefore,apower analysiswasperformedandincorporatedintotheupdatedal gorithmtofullycharacterize aninstruction.Whilenosignicanteectontheoverallper formancewasseeninthenal results,asexplainedinsection3.2.1,suchanupdatemaypr oveusefulinspecializedcases. Forexample,ifabasicblockcontainednothingbutasinglei nstructiontype(i.e.alist ofaddinstructions),theFD-ISLPalgorithmwouldcalculate thesamepowercostforall oftheinstructionsandwouldbeforcedtokeeptheoriginals chedule.However,inFDIS, theoperandscanbeconsideredinordertotakeadvantageoft hesameoperandappearing inmultipleinstructions.Animportantanalysisweprovide isthatforcedirectedmethods resultinclockcyclereductionandthusresultasperforman ceimprovingalgorithmsand 17

PAGE 26

thepowersavingsdependonthereductioninthenumberofclo ckcyclesbesidestheinterinstructionandcommonoperanddetectionforpoweroptimiz ation. 18

PAGE 27

CHAPTER3 FORCEDIRECTEDINSTRUCTIONSCHEDULING Conceptually,thetechniqueusedinthispaperisquitesimi lartotheclassicalversion offorcedirectedscheduling.Insteadofbasicblocksconsi stingofcircuitstobemappedto functionalunits,ourbasicblocksconsistofinstructions thataremappedtofunctionalunits withinthecomputer,suchasadders,multipliers,anddivid ers.Essentially,thealgorithm developedisalevelofabstractionabovetheclassicalvers ionofforcedirectedscheduling. RatherthandeterminingthescheduleforplacementonanASI C,thisalgorithmdetermines thescheduleofexecutionforinstructionsonagiveninstru ctionsetarchitecture. 3.1FDISvs.FDS Theprimarydierencebetweenthetwoalgorithmsishowthes elfandsuccessorforces arecalculated.Whereasinforcedirectedschedulingthepr imarymetricforoptimization isconcurrencyofoperations,thegoalofFDISispowerreduc tionandperformanceenhancement.Sincethepowerneededtoexecuteaninstruction isdiculttoreduce,the conceptbehindFDISistoreducethepowercostassociatedwi thmovingfromthecurrent instructiontothenextsequentialinstruction.Forexampl e,switchingfromanaddoperationtoamultiplyoperationmayrequireasignicantamoun tofswitchingactivity 1 on thedatapathinordertobeginthemultiplyinstruction,bec auseoftherequiredaccesses. However,switchingfromoneaddoperationtoanother,witho nlyoneoperandchanging, willrequirelesspower.Therefore,inFDIS,the"springcon stant"isthecostofmoving aninstructionfromonetimeframetothenext,inrelationto theinter-instructionpower increaseordecrease. 1 Thepowerdissipatedfromthisswitchingactivitywillberefer redtoastheinter-instructionpowercost. 19

PAGE 28

Table3.1.SamplePowerDissipationTable Inst. addu addi beq bgtz bne j jal la addu 0 6 7 5 3 5 6 2 addi 6 0 0 2 3 7 5 9 beq 7 0 0 5 0 3 6 1 bgtz 5 2 5 0 4 5 2 5 bne 3 3 0 4 0 0 6 8 j 5 7 3 5 0 0 6 1 jal 6 5 6 2 6 6 0 7 la 2 9 1 5 8 1 7 0 3.2PowerCharacterization Inter-instructionpowercouldvaryfromonecomputertothen ext,dependingonthe computerarchitectureandorganization.Thepowervaluesa renoteasilycalculated,but therelationshipsbetweeninstructionsmustbeknowntothe FDISalgorithminorder toaccuratelydeterminetheselfandsuccessorforces.Ther efore,thealgorithmnotonly requiresthesourcecode,butalsoamatrixrelatingtheinte r-instructionpowercostforall instructionstypes. InTable3.1,asamplepowerdissipationtable,orPDT,isgiv en.Itcanbeseen thattheactualpowercostassociatedbetweentwoinstructi onsisnotimportant.Onlythe relationshipbetweenthetwoinstructionsisneeded.Forex ample,fromTable3.1,switching froman addi toan addu instructionhasacostfactorof6,whilethecostfactorbetw een an addi anda bne instructionis3.Therefore,the addu instructiontransitionwillhave inter-instructionpowercosttwicethatofa bne transitionwhentheprecedinginstruction is addi .Notethatthetableissymmetric,indicatingthatforthisp articularcomputer organization,simplyreversingtheorderofinstructionsd oesnotyieldapowersavings. Asubtlerequirementforthealgorithmistoknowtheentirei nstructionsetofthe sourceprogramandthattheentireinstructionsetischarac terizedinthePDT.Otherwise, thealgorithmmayencounteranunknowninstructionormayno tbeabletocalculateall oftheselfandsuccessorforcesforalloftheinstructions. However,thecharacterization onlyneedstobeperformedonceforaparticularinstruction setarchitectureandassociated organization.Oncethecharacterizationiscomplete,itca nbestoredandreusedforfuture 20

PAGE 29

iterationsofthealgorithm.Thisinherenttraitofthealgo rithmalsoallowsfortheoretical analysis,asthecharacterizationcaneasilybemodiedtoa dapttoatheoreticalmodelof acomputer,assumingthecharacterizationcanbedetermine d. 3.2.1OperandPowerTable Overall,thepowerdissipationtabledoesnotfullycharact erizeaninstruction.Instead, thePDTsimplycharacterizestheoperation.Thatis,itdoes notconsidertheoperands. WhilethePDTconsiderstheswitchingactivitythatoccursa saresultofcontrollines, nodataiscollectedaboutswitchingactivityontheaddress lines.Inordertomorefully characterizeallinstructions,anoperandpowertablewasc reatedusingasimilarmeans usedtogeneratethePDT.WhilethePDTusedthesameoperands andsimplychangedthe operation,theoperandpowertablewasdevelopedbyselecti ngasingleoperation,suchan add instruction,andchangingtheoperandstoseetheoveralla ect.SincetheSimpleScalar architecturecontains32registers,a32x32matrixwasdeve lopedtocharacterizethecost ofswitchingfromoneoperandtothenext.Theconceptbehind thisapproachwasto sumthePDTvalue,aswellasthecostvalueofswitchingfromo neregistertothenext 2 Thepercentageofthetotalswitchingactivitythattheaddr esslinesusewasunknownso experimentalrunswereperformedtondwhatfractionalwei ghtgaveanincreasepower reductionorperformancegain.Forexample,theoperandval uecouldbegivenaweight of25%sothatthePDTvaluewouldcarry4timestheweight.Thi swasdonebecauseit wasbelievedthattheoperationswitchingactivityisthepr imarycostfactor.Inreality, itwasfoundthattheoperationpowervalueisthedominatepo wervalue,asallattempts toincorporateanoperandpowervalueeitherhadnoeectont heoverallscheduleor negativelyimpactedtheschedule.Asaresult,thisschemew asnotincorporatedintothe nalversionofthealgorithm. 2 Foragiveninstruction,therecanbeupto3registers.Therefore,t healgorithmcouldpotentiallysum thePDTvaluewith3operandcostvalues(oneforeachregisterlocation) 21

PAGE 30

Source Code Assembly Code Divide Code into Basic Blocks Construct DDG for each Basic Block Run FDIS on each Basic Block Scheduled Assembly Code Instruction Set Architecture (ISA) Characterization Simple Power Simulator Power Dissipation Table (PDT ) Figure3.1.BlockDiagramofFDISAlgorithm 3.3FDISAlgorithm Thefollowingisapseudo-codeversiontheFDISalgorithm 3 .Asmentionedbefore,the algorithmrequirestheassemblylevelsourcecodeandthepo werdissipationtableforthe desiredcomputerorganization.Figure3.1isablockdiagra mofthealgorithm. FDIS(SourceCode,PDT)(01)Segmentcodeintobasicblocks(02)Foreachbasicblock(03)Createdatadependencygraph(DDG) 3 Itisassumedthatthealgorithmisabletodeterminetheinstruction setfromthePDT 22

PAGE 31

(04)Unscheduled=numberofinstructions(05)FindASAPSchedule(06)FindALAPSchedule(07)Forallinstructions(08)IfALAP-ASAP==0(09)Scheduleinstruction(10)Unscheduled{(11)EndIf(12)EndFor(13)WhileUnscheduled!=0(14)UpdateASAPSchedule(15)UpdateALAPSchedule(15)Forallunscheduledinstructions(16)Foralltimesteps(17)AssumeInst.assignedtotimestep(18)CalculateselfForceusingPDT(19)CalculatesuccessorforcesusingPDT(20)Calculatetotalforce(21)EndFor(22)EndFor(23)Findinstructionwithminimumforce(24)Scheduleinstruction(25)Unscheduled{(26)EndWhile(27)EndForReturnscheduledsourcecode Theoriginalforcedirectedschedulingalgorithmcanbecat egorizedintovestepsor stages.Itcanbeobservedthatmanyofthesesamestepsareco mpletedinFDIS.InFDS, therststageistocalculatetheASAPandALAPschedules.Us ingthecriticalpathas aguide,eachinstructionhasanearliestandlatesttimetha titcanbescheduledinorder 23

PAGE 32

fortheprogramtocompleteontime.Therefore,inlines5and 6oftheFDISalgorithm, theASAPandALAPschedulesaredetermined.Step2updatesth edistributiongraphso thattheselfandsuccessorforcescanbecalculatedinsteps 3and4.Sincethedistribution graphisnotvitallyimportanttotheimplementationofFDIS ,thesethreestepsaremerged andcompletedinlines7-22.Step5consistsofschedulingthe operationwithlowestforce totheselectedtimeframe.Similarly,FDISperformsthisop erationinlines23-24.Finally, bothFDSandFDISloopuntilalloperations/instructionsas scheduled.Lines25-27handle thisloopoverhead.3.4BasicBlocks TheFDISalgorithmworksuponbasicblocksonly.Abasicbloc kisdenedasablock ofsequentialinstructionswherenobreaksorbranchesoccu rwithinthecodesegment.The onlyexceptionisthatabreakorbranchinstructionmayoccu rasthelastinstructionof thecodesegment.Therefore,thersttaskofthealgorithmi stodividethecodeintobasic blocks,uponwhichthealgorithmwilloperate.Figure3.2is asamplesegmentationof sourcecode. Foragivenbasicblock,adatadependencygraphcanbeconstr uctedtorerectthe dependencieswithinthatbasicblock.Sincethereisonlyon eentryandexitpointfromany givenbasicblock,aslongasthedatadependencieswithinth atbasicblockaremaintained, correctcodeexecutionwillresult,regardlessofthenals chedule. 3.4.1Inter-BlockScheduling Duetotheimplementationstyleofthealgorithm,itisclear thatintra-blockschedules willbeoptimized.However,thisdoesnotguaranteeaglobal lyoptimalschedule.Theoretically,itmaybepossibletochangetheorderofexecutio noftwoormorebasicblocks andcreateanevenmoreoptimalschedule.Creatingagloball yoptimalscheduleusing FDISwouldrequirethealgorithmtonotonlycreateoptimali ntra-blockschedules,but alsointer-blockschedules.Suchanidea,inpracticehoweve r,isnotpossiblewithFDIS.In ordertouseFDIStoschedulebasicblocks,aninter-blocksch edulewouldrequireanother 24

PAGE 33

. . lw $2, 16($3) addi $5, $2, 16 sw $5, 16($4) Basic Block n mult $7, $4, $11 j JUMP1 ----------------------------------JUMP3: sw $8, 24($8) Basic Block n+1 bne $15, $13 ----------------------------------add $6, $2, $3 lw $2, 8($3) sw $6, 16($4) Basic Block n+2 div $17, $14, $13 j JUMP4 ----------------------------------sw $16, -8($5) sw $4, 32($10) add $1, $2, $3 lw $5, 0($10) Basic Block n+3 . Figure3.2.SampleBasicBlockSegmentationofSourceCode 25

PAGE 34

levelofabstraction,wherethealgorithmwouldattempttow orkon"super-blocks"that containblocksasitbasicunit(asopposedtoblocksthatcon taininstructionsasthebasic unit).However,therenopowercharacterization(aPDT)tou setocalculatetheselfand successorforcesofanentireblockofcode.Inotherwords,i tisnotlogicaltomakea generalizationaboutwhatitmeanstocategorizethecostof switchingfromoneblockto thenext.Althoughdevelopinganalternativealgorithmtop erforminter-blockscheduling maybeacauseforfuturework,itisnotbelievedthatsignic antpowerreductionwillbe achievedattheinter-blocklevel.Correctcodeexecutionmu stbemaintainedandhandling multiplebranchestoandfromabasicblockreducesitsabili tytobereordered. 3.5SelfandSuccessorForces AlthoughFDSandFDISaresimilar,thecalculationofselfan dsuccessorforcesare somewhatdierent.Thisisprimarilyduetothedomainuntow hichthealgorithmisbeing applied.Asmentionedpreviously,thealgorithmusesthepo werdissipationtableinthe calculationoftheselfandsuccessorforces.Todoso,theal gorithmrst(inlines7through 12)ndsanyinstructionthathasamobility(ALAP-ASAP)equa ltozeroandassigns thatinstructiontotheoneandonlyallowabletimestep.Fro mthere,thealgorithmloops througheachunscheduledtimestep,schedulingoneoftheun scheduledinstructionstothe unscheduledtimeframe.Therefore,whenthelasttimestepi sscheduled,everytimeframe isassignedaparticularinstruction,fromwhichanewsched ulecanbeobtained.Inorder forthealgorithmtoselectaninstructiontoschedule,itmu strstcalculatetheselfand successorforcesandndtheminimum.Figure3.3showshowth eself-forceiscalculated. Everyunscheduledinstructionisapossibleinstructionth atcanbeplacedinthecurrent timestep.Therefore,thealgorithmmustrunforeachunsche duledinstructionbeforeit canmakeadecisionaboutwhichinstructionhastheminimumf orce.IfASAP=ALAPfor anyinstruction,thenthecurrenttimestepisthelasttimes tepthattheinstructioncan bescheduledinto.Therefore,thereisnoneedtocalculatea nyforces.Thatinstruction mustbescheduledinthecurrenttimestep.Ifthisisnotthec ase,thealgorithmcalculates theprobabilitiesforeachtimestep(seeEquation2.2).Onc ecompleted,thealgorithm 26

PAGE 35

tentativelyassignsthecurrentinstructiontothecurrent timestep.FromEquation2.5, theself-forceisequaltothesumoftheforcesateachtimeste p.Aspreviouslymentioned, thespringconstantinFDISisthevalueinthepowerdissipat iontable(PDT).Calculating theforceforthecurrenttimestepisstraightforwardsince itisassumedthatthecurrent instructionisbeingscheduledinthattimeslotandtheinst ructionintheprevioustime stepisalreadyknownsinceitwaspreviouslyscheduled.The refore,thepowerdissipation caneasilybelookedupinthePDT.However,thepowerdissipa tionvaluesfortheother timestepsarenotyetknownsincethosetimestepsaretobesc heduledafterthecurrent timestep.Foralltimestepsafterthecurrenttimestepthat requireaselfforcecalculation, theselfforceiscalculatedusingalloftheunscheduledins tructionsaspotentialinstructions tobescheduledjustbeforethetimestepforwhichtheselffo rceisbeingcalculated.The averageoftheself-forcesisthentakenandusedastheself-fo rceforthatparticulartime step.Oncethecalculationsofself-forcesarecompleted,su ccessorforcesmustbecalculated aswell.Therearenotpredecessorcalculationstobemadein FDIS,sincethetimesteps previoustothecurrenttimesteparealreadyscheduledandc annotbealtered.Therefore, predecessorvaluesinFDISarealwayszero.Calculationofs uccessorforcesisquitesimilar totheself-forcecalculationandisshowninFigure3.4.Succ essorforcesarethoseforces incurredinmovinganinstructiontoadierenttimeslot.As aresult,thersttaskis toensurethatallinstructionsstillhavemobility.IfASAP =ALAP,thentheinstruction cannotbemovedandthesuccessorforceisinnite.Thereisn oneedtocontinuefurther. However,ifallinstructionsdohaveamobility,thenthepro babilitiesforeachinstruction canbefound.Tondthecompletesuccessorforce,thesumofa lloftheinstructions fromtheircurrenttimeslotsmustbedetermined.Justaswas thecaseintheself-force, thecalculationofthesuccessorforceforthecurrenttimes tepiseasy,sincethepower dissipationvaluecanbelookedupinthetable.Eventhesucc essorforceinthenexttime stepcanbedetermined,sinceitisassumedthattheinstruct ionwhosesuccessorforce isbeingcalculatedwillbescheduledinthecurrenttimeste p.However,similartothe self-force,anytimestepsbeyondthecurrent+1timestepmus tusetheaveragesuccessor force.Oncethesuccessorforceiscalculated,itcanbeadde dtotheself-force,andanal schedulingdecisionforthecurrenttimestepcanbemade. 27

PAGE 36

For all unscheduled instructions tmp_ts = current time step If(ASAP == ALAP) Schedule Instruction to current time step Break Else Calculate Probabilities End If Assume Instruction assigned to current time step For(i = ASAP; i <= ALAP; i++) If(i == ASAP) index1 = PDT index of inst. in previous time st ep index2 = PDT index value of current instruction x = Change in Probability self Force =+ x*PDT[index1][index2] Else index2 = PDT index value of current instruction counter = 0 For all other unscheduled instructions j Temp = 0.0 If(ASAP <= tmp_ts && ALAP >= tmp_ts) index1 = PDT index value of instruction j x = Change in Probability Temp =+ x*PDT[index1][index2] counter++ End If End For self_force = self_force + Temp / counter; tmp_ts++ End If End For End For Figure3.3.CalculationofSelf-Forces 28

PAGE 37

For all other instructions k tmp_ts = Current Time Step If(ASAP == ALAP) Instruction cannot be moved Break Else Calculate Probabilities End If Assume Instruction is moved for(i = ASAP; i <= ALAP; i++) If(i == ASAP) index1 = PDT index of inst. in previous time step index2 = PDT index value of instruction k x = Change in Probability succ_force =+ x*PDT[index1][index2] Else If(i == ASAP + 1) index1 = PDT index value of current instruction index2 = PDT index value of instruction k x = Change in Probability succ_force =+ x* PDT[index1][index2] Else index2 = PDT index value of instruction k counter = 0 Temp = 0 for all other instructions j If(ASAP <= tmp_ts && ALAP >= tmp_ts) index1 = PDT index value of instruction j x = Change in Probability Temp =+ x*PDT[index1][index2]; counter++; End If End For succ_force = succ_force + Temp/counter; tmp_ts++ End If End For End For Figure3.4.CalculationofSuccessorForces 29

PAGE 38

CHAPTER4 EXPERIMENTALRESULTS 4.1Overview Todeterminetheactualpowersavingsofthealgorithm,test ingwasperformedinfour stages.Intherststage,allofthebenchmarkswerecompile dusingtheSimpleScalar compiler,assembler,andloaderinordertocreateaSimpleP owerexecutable.Theresultingbinaryleisabaselineversionoftheoriginalles,whe renooptimizationhasbeen performed.Therefore,thissimulationprovidedtherawpow erconsumption.Thesecond stageentailedcompilingthebenchmarksusingtheSimplePo werversionoftheSimpleScalar compiler,whichincludedregisterrenaming.Thisstagepro videdabaselineastothepower conservationthatcanbeachievedwithonlyregisterrenami ng,whencomparedtoRun1. Thenaltwostagesconsistedofrepeatingtheprevioustwos tages,butrstapplyingthe newFDISalgorithmtothecode.Figure4.1providesarowchar tofthedierentrunsofthe experimentperformed.Theresultsprovideinsighttothepo wersavingsofthealgorithm. RepeatingstageonewiththeFDSalgorithmprovidestherawp owersavingsofthealgorithm.Repeatingthesecondstagerevealsthetruebenetof thealgorithm.Evenwhen codehasbeenmodiedtoincorporateregisterrenaming,the FDISalgorithmstillprovides anadditivepowersavings.Thisrelationshipvalidatesthe abilityoftheFDISalgorithmto beusedinconjunctionwithothertechniquestoachievemaxi mumresults. 4.2Results Tables4.1,4.2,and4.3summarizethepowerresultsachieve dfromRuns1-4(see Figure4.1).Foreachbenchmark,theoriginalswitchcapaci tance(innF),thereduced 30

PAGE 39

Standard Simple Scalar Compile Modified Simple Scalar Compile Simple Power Simulator Output Results FDIS Algorithm Run 1 Run 3 Run 4 Run 2 Figure4.1.FlowchartofExperimentalRuns 31

PAGE 40

switchcapacitance(innF),andthepercentageimprovement isshown 1 .Table4.4rerects theperformanceimprovementoftheFDISalgorithm,withreg isterrenaming.Similarto thepowertables,theTable4.4outlinesthenumberofcycles eachbenchmarkrequired beforeandafteroptimization,aswellasthepercentagered uction. 4.2.1PowerResults Fromthetables,itcanbeseenthatasignicantpowersaving swasseeninalmost allofthebenchmarks.Theonlyexceptionwasthebinarysear ch(bsrch.c)benchmark. Becauseofitssmallandsimplestructure,fewopportunitie swithinthecodeexisttoperform optimization.However,thisspecializedbenchmarkdemons tratesthatincaseswherefew optimizationdecisionsexist,thealgorithmdoesnotincor rectlyselectschedulingdecisions thatnegativelyaectperformanceorpower. Althoughthebinarysearchbenchmarkonlyyieldedaminutei mprovement,thequick sortbenchmark(quick.c)yieldedthemaximumimprovement. Withpowerandperformanceimprovementsofover24%,thebenchmarkhadanorderof magnitudegainoverthe otherbenchmarks.Muchofthisimprovementcanbeexplained duetothenatureofthe quicksortalgorithm.Thealgorithmcontainsasimplecodes tructureexecutedanumberof times.FromTable4.1,registerrenamingalonecanachievet heresultsseenfromtheFDIS algorithm,indicatingthatremovalofstallsthroughtheus eofeitherregisterrenamingor reorderingofinstructionstoremoveconrictsistheprimar ysourceoftheimprovement.As aresult,alltablescontainanaveragewithandwithouttheq uicksortbenchmark.Thisis doneinordertoremovetheskewofthequicksortresultsonth eoverallaverage. Tables4.1and4.2representtheoverallindependentperfor manceofaregisterrenamingschemeandtheFDISalgorithm.Table4.3rerectstheover allcombinedeortsofthe twotechniques.Althoughsomeofthebenchmarkshadslightd ecreaseintheoverallimprovement,mostofthebenchmarkshadasignicantimprovem ent.Oneespeciallypositive resultwasthepermutationbenchmark(perm.c).Individual ly,theregisterrenamingand FDIStechniquesonlyintroduceda0.50%and0.16%improveme nt,respectively.Com1 Powerconsumptionisafunctionofswitchcapacitanceandvoltage.There fore,allcomparisonsor derivationsmadeusingswitchcapacitancewillcorrelatetotheactual powersavingsseenforagivensystem 32

PAGE 41

Table4.1.SwitchCapacitanceReduction/PowerSavingsusi ngRegisterRenaming Benchmark Original(nF) Rescheduled(nF) Reduction bsrch.c 11500 11493 0.06% bubble.c 12365628 11709060 5.31% hanoi.c 6341659 6206131 2.14% heap.c 3598277 3587242 0.31% matmult.c 110309 108584 1.56% perm.c 24001185 23880623 0.50% quick.c 2077771 1578328 24.04% Averagew/quick.c 4.85% Averagew/oquick.c 1.65% Table4.2.SwitchCapacitanceReduction/PowerSavingsfor FDIS Benchmark Original(nF) Rescheduled(nF) Reduction bsrch.c 11500 11485 0.13% bubble.c 12365628 11335292 8.33% hanoi.c 6341659 6186881 2.44% heap.c 3598277 3511171 2.42% matmult.c 110309 107036 2.96% perm.c 24001185 23961281 0.16% quick.c 2077771 1522770 26.70% Averagew/quick.c 6.17% Averagew/oquick.c 2.74% 33

PAGE 42

Table4.3.SwitchCapacitanceReduction/PowerSavingsfor FDISwithRegisterRenaming Benchmark Original(nF) Rescheduled(nF) Reduction bsrch.c 11500 11491 0.08% bubble.c 12365628 11303461 8.58% hanoi.c 6341659 6170045 2.70% heap.c 3598277 3522002 2.11% matmult.c 110309 106823 3.16% perm.c 24001185 23280466 3.00% quick.c 2077771 1566066 24.62% Averagew/quick.c 6.33% Averagew/oquick.c 3.27% bined,however,resultedina3.00%improvement.Overall,t heregisterrenamingscheme increasedtheimprovementoftheFDISalgorithmover0.50%. Althoughthisvaluedoesnot appeartobesignicant,0.50%isnearlya20%improvementov ertheFDISonlyversion. 4.2.2PerformanceResults Table4.4representstheoverallclockcyclereductionofth ecombinedFDISandregister renamingscheme.Comparingtheaverageresultsofthecycle reduction(2.76%)tothe averagepowerreductionseeninTable4.3(3.27%),thetwova luesarerelativelyclose. Furthermore,ananalysisofthetwotablesdemonstratesaco rrelationbetweenreduced numberofclockcyclesandreducedpowerconsumption.Thisp henomenonrerectsthat muchofthepowersavingsattainedinTables4.1,4.2,and4.3 arearesultofreduced clockcycles.Althoughsomepowerimprovementwasgainedth roughtheminimizationof theinter-instructioncost,likeincaseofthebinarysearch algorithm,wheretherewasa smallpowerreductiondespitethelackofaperformancegain ,virtuallyallofthesavings mustbeattributedtoclockcyclereduction.Thiscorrelati ondoesnotaectthenetresult ofthealgorithm,butsuchresultsdoexplaintheintrinsicp ropertiesoftheFDIStechnique. Insubsection3.2.1,operandcharacterizationwasdiscuss ed.Theideabehindcreating anoperandpowertablewastoconsidertheactualoperandsin calculatingtheoverallpower costofinstruction,ratherthanjusttheoperationitself. Experimentationperformedinthis areaprovedfutile,asnocharacterizationcreatedapositi veeectontheoverallpoweror performanceofthebenchmarks.Therealizationthatthecyc lereductionaccountsfor 34

PAGE 43

Table4.4.ClockCycleReduction/PerformanceImprovement Benchmark Original(Cycles) Rescheduled(Cycles) Improvement bsrch.c 390 390 0.00% bubble.c 391398 364899 6.77% hanoi.c 220820 217772 1.38% heap.c 105380 103159 2.10% matmult.c 4005 3861 3.59% perm.c 751789 731626 2.68% quick.c 67543 51341 23.98% Averagew/quick.c 5.79% Averagew/oquick.c 2.76% virtuallyallofthegainsseenvalidatethisresult.Minimi zingtheswitchingactivityonthe addressbuswillnotreducethenumberofcyclesandtherefor e,willnottranslateintoa powerorperformancegain. Anotherinterestingsidenotetotheobservedresultsisthe abilityofthealgorithmto optimizeforpower,butactuallyscheduleforperformance. Thisintrinsictraitisespecially benecialintoday'sprocessingenvironment.Usingpowera saschedulingmetrictoyield aperformanceimprovedscheduleensuresthattheschedulea ttainedreducespowerwhile improvingperformance.Suchapproachesmayverywellbethe paradigmuponwhichnew algorithmsmustuse. 35

PAGE 44

CHAPTER5 CONCLUSIONSANDFUTURERESEARCH Providingincreasingperformance,whilereducingthepowe rrequirementstoreachthe desiredperformancegoals,isacriticalfactorfacingthec omputingindustry.Futuresolutionswilllikelyinvolvetheuseofmanytechniques,atalll evelsofabstraction.Asaresult, newlydevelopedtechniquesshouldnotonlyproducesignic antresults,butshouldalsobe combinablewithothertechniques,inordertoachievemaxim umperformancewhileusing minimalpower. Inthispaper,wepresentedanewinstructionschedulingtec hniquebasedonclassical forcedirectedscheduling.Thetechniqueattemptstoreduc epowerandintheprocess,also attemptstoimproveperformance.Thetechniqueiscombinab lewithothertechniques, suchasregisterrenaming.Throughtheuseofthisalgorithm ,incombinationwithregister renaming,anaveragepowerreductionof3.25%wasobtainedw ithareductionofupto26% seenforselectedbenchmarks.Inaddition,aperformancega inof2.8%withamaximum of24%wasattained. Inordertomorefullymaturethealgorithmanditcapability ,manyareasofresearch remain.TherunningtimeoftheFDISalgorithmis O ( n 3 ).Ideally,therunningtimeshould beanorderofmagnitudeless.Therefore,futureresearchsh ouldconcentrateonrunning timereduction.Additionally,amorethoroughanalysissho uldbeperformedoninterblockscheduling.Currently,noinformationisavailablet odeterminehowclosetheFDIS scheduleistothegloballyoptimalsolution.Thisdetermin ationshouldfurthervalidate theresultsseeninthiswork.Amorethoroughanalysisshoul dalsobeconductedinorder todetermineifanyotherintrinsicpropertiescouldbeexpl oitedbyFDIS.Finally,more experimentationshouldbeperformedonawiderrangeofbenc hmarks,includingthosethat arelargerandcontainmoreinstructiontypes,suchasroati ngtypeinstructions. 36

PAGE 45

REFERENCES [1]N.VijaykrishnanA.Parikh,M.KandemirandM.J.Irwin.I nstructionScheduling basedonEnergyandPerformanceConstraints.In Proceedings.IEEEComputerSocietyWorkshoponVLSI ,pages37{42,Orlando,Florida,April2000. [2]N.VijaykrishnanA.Parikh,M.KandemirandM.J.Irwin.V LIWSchedulingfor EnergyandPerformance.In Proceedings.IEEEComputerSocietyWorkshoponVLSI pages111{117,Orlando,FL,April2001. [3]J.LeeC.LeeandT.Hwang.CompilerOptimizationonInstr uctionSchedulingfor LowPower.In Proceeding.The13thInternationalSymposiumonSystemSyn thesis pages55{60,Madrid,Spain,September2000. [4]V.TiwariD.BrooksandM.Martonosi.Wattch:AFramework forArchitecturalLevelPowerAnalysisandOptimizations.In Proceedings.InternationalSymposium onComputerArchitecture ,pages83{94,Vancover,Canada,June2000. [5]D.Dobberpuhl.Thedesignofahigh-performancelow-power microprocessor.In Proceedings.Int.Symp.LowPowerElectronicsandDesign ,pages11{16,Monterey, CA,August1996. [6]P.Dongale. Force-DirectedInstructionSchedulingforLowPower .UniversityofSouth Florida.DepartmentofComputerScienceandEngineering,2 003. [7]M.KandemirG.Esakkimuthu,N.VijaykrishnanandM.JIrw in.MemorySystem Energy:InruenceofHardware-SoftwareOptimizations.In Proceedings.The2000 InternationalSymposiumonLowPowereElectronicsandDesi gn ,pages244{246,Rapallo,Italy,July2000. [8]S.GuptaandS.Katkoori.Force-directedschedulingford ynamicpoweroptimization. In Proceedings.IEEEComputerSocietyAnnualSymposiumonVLSI ,pages68{73, Pittsburg,PA,April2002. [9]M.GuptaJ.KinandW.Mangione-Smith.Theltercache:Ane nergyecient memorystructure.In Proceedings.IEEESymp.Microarchitecture ,pages184{193, ResearchTrianglePark,NC,Dec1997. [10]M.IrwinW.YeM.Kandemir,N.Vijaykrishnan.Inruenceo fCompilerOptimizations onSystemPower. IEEETransactionsonVLSISystems ,9:801{804,2001. [11]L.JohnM.ValluriandH.Hanson.ExploitingCompiler-Ge neratedSchedulesfor EnergySavingsinHigh-PerformanceProcessors.In ISLPED'03 ,pages414{419,Seoul, Korea,August2003. 37

PAGE 46

[12]V.TiwariS.MalikandA.Wolfe.CompilationTechniques forLowEnergy:An Overview.In ProceedingsDesignAutomationConference ,pages38{39,SanDiego, CA,October1994. [13]N.Bellas,etal.ArchitecturalandCompilerTechnique sforEnergyReductionin High-PerformanceMicroprocessors. IEEETransactionsonVLSISystems ,8:317{326, 2000. [14]P.OngandR.Yan.Power-ConsciousSoftwareDesign-afra meworkformodeling softwareonhardware.In DigestofTechnicalPapers.IEEESymposiumonLowPower Electronics ,pages36{37,SanDiego,CA,October1994. [15]P.G.PaulinandJ.P.Knight.Force-directedscheduling inautomaticdatapathsynthesis.In Proceedings.24thDesignAutomationConference ,pages195{202,Miami Beach,FL,July1987. [16]P.G.PaulinandJ.P.Knight.Force-DirectedScheduling fortheBehavioralSynthesis ofASICs. IEEETransactionsonComputer-AidedDesingofIntegratedC ircuitsand Systems ,8:661{679,1989. [17]H.Kojimaet.alR.Bajwa,M.Hiraki.Instructionbueri ngtoreducepowerin processorsforsignalprocessing. IEEETransactionsonVLSISystems ,5:417{424, 1997. [18]M.J.IrwinR.ChenR.Mehta,R.MOwensandD.Ghosh.Techn iquesforlowenergy software.In Proceedings.InternationalSymposiumonLowPowerElectro nicsand Design ,pages72{75,Monterey,CA,August1997. [19]D.CarmeanS.Gunther,F.BinnsandJ.Hall.Managingthe impactofincreasing microprocessorpowerconsumption. IntelTechnologyJournal ,pages1{17,2001. [20]D.GrunwaldS.ManneandA.Klauser.Piepelinegating:S peculationcontrolfor energyreduction.In Proceedings.Int.Symp.ComputerArchitecture ,pages132{141, Barcelona,Spain,June1998. [21]E.LarsonT.AustinandD.Ernst.SimpleScalar:AnInfra structureforComputer SystemModeling. Computer ,35:59{67,2002. [22]M.KandemirW.Ye,N.VijaykrishnanandM.J.Irwin.TheD esignandSseofSimplePower:ACycle-AccurateEnergyEstimationTool.In ProceedingsDesignAutomation Conference ,pages340{345,LosAngeles,CA,June2000. 38


xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam Ka
controlfield tag 001 001681049
003 fts
005 20060215071152.0
006 m||||e|||d||||||||
007 cr mnu|||uuuuu
008 051220s2005 flu sbm s000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0001073
035
(OCoLC)62715586
SFE0001073
040
FHM
c FHM
049
FHMM
090
TK7885 (Online)
1 100
Hayes, Brian C.
0 245
Performance oriented scheduling with power constraints
h [electronic resource] /
by Brian C. Hayes.
260
[Tampa, Fla.] :
b University of South Florida,
2005.
502
Thesis (M.S.C.P.)--University of South Florida, 2005.
504
Includes bibliographical references.
516
Text (Electronic thesis) in PDF format.
538
System requirements: World Wide Web browser and PDF reader.
Mode of access: World Wide Web.
500
Title from PDF of title page.
Document formatted into pages; contains 46 pages.
3 520
ABSTRACT: Current technology trends continue to increase the power density of modern processors at an exponential rate. The increasing transistor density has significantly impacted cooling and power requirements and if left unchecked, the power barrier will adverselyaffect performance gains in the near future. In this work, we investigate the problem ofinstruction reordering for improving both performance and power requirements. Recently,a new scheduling technique, called Forced Directed Instruction Scheduling, or FDIS, hasbeen proposed in the literature for use in high level synthesis as well as instruction reordering [15, 16, 6]. This thesis extends the FDIS algorithm by adding several features suchas control instruction handling, register renaming in order to obtain better performanceand power reduction. Experimental results indicate that performance improvements up to24.62% and power reduction up to 23.98% are obtained on a selected set of benchmark programs.
590
Adviser: Dr. Ranganathan.
653
Software.
Reordering.
Optimization.
Compiler.
Assembly.
690
Dissertations, Academic
z USF
x Computer Engineering
Masters.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.1073