USF Libraries
USF Digital Collections

Force-directed instruction scheduling for low power

MISSING IMAGE

Material Information

Title:
Force-directed instruction scheduling for low power
Physical Description:
Book
Language:
English
Creator:
Dongale, Prashant
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla.
Publication Date:

Subjects

Subjects / Keywords:
optimization
estimation
Dissertations, Academic -- Computer Science -- Masters -- USF   ( lcsh )
Genre:
government publication (state, provincial, terriorial, dependent)   ( marcgt )
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )

Notes

Summary:
ABSTRACT: The increasing need for low-power computing devices has led to the efforts to optimize power in all the components of a system. It is possible to achieve significant power optimization at the software level through instruction reordering during the compilation phase. In this thesis, we have designed and implemented a novel instruction scheduling technique, called FD-ISLP, aimed at reducing the software power consumption. In the proposed approach for instruction scheduling, we modify the force-directed scheduling technique used in high-level synthesis of VLSI circuits to derive a latency-constrained algorithm that reorders the instructions in a basic block of assembly code in application software to reduce power consumption due to its execution. The scheduling algorithm takes the data dependency graph (DDG) for a given basic block and a power dissipation table (PDT), which is generated by characterizing the instruction set architecture. We model power, commonly referred to as software power in literature, as a force to be minimized by relating the inter-instruction power cost as the spring constant,k,and the change in instruction probability as the displacement,x, in the force equation f = k * x. The salient feature of our algorithm is that it accounts for the global effect of any tentative scheduling decision, which avoids a solution being trapped in a local minima. The power estimates are obtained through using a tool set, called Simple-Power. Experimental results indicate that our technique accounts for an average of 12.68 % savings in power consumption over the original source code for the selected benchmark programs.
Thesis:
Thesis (M.S.C.S.)--University of South Florida, 2003.
Bibliography:
Includes bibliographical references.
System Details:
System requirements: World Wide Web browser and PDF reader.
System Details:
Mode of access: World Wide Web.
Statement of Responsibility:
by Prashant Jayawant Dongale.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains 56 pages.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 001447470
oclc - 54089561
notis - AJN3914
usfldc doi - E14-SFE0000210
usfldc handle - e14.210
System ID:
SFS0024906:00001


This item is only available as the following downloads:


Full Text

PAGE 1

Force-DirectedInstructionSchedulingforLowPower by PrashantDongale Athesissubmittedinpartialfulllment oftherequirementsforthedegreeof MasterofScienceinComputerScience DepartmentofComputerScienceandEngineering CollegeofEngineering UniversityofSouthFlorida MajorProfessor:N.Ranganathan,Ph.D. MuraliVaranasi,Ph.D. DeweyRundus,Ph.D. DateofApproval: October24,2003 Keywords:optimization,estimation cCopyright2003,PrashantDongale

PAGE 2

DEDICATION ThisworkisdedicatedtoAai,Papa,ArchanaandPrasad.

PAGE 3

ACKNOWLEDGEMENTS IwouldliketothankDr.N.Ranganathanforgivingmethisopp ortunitytoworkwithhim.I owehimalotforhavinggivendirectionstomythoughtsandun limitedencouragement.Without hisinsightsandpatientguidancethisworkwouldnothavebe enpossible.Hehasbeenasteadfast supportthroughoutmyMaster'sprogram.Ialsowishtothank Dr.MuraliVaranasi&Dr.Dewey Rundusforthetimetheytooktoreviewthisthesisandtheirh elpfulcomments. IamverygratefultomylovingFather,Mother,andmysibling s:Archana&Prasad,without whosemoralsupportthisworkwouldneverhavereachedtheco mpletion.Iamindebtedtothemfor beingaperpetualsourceofinspirationandmotivationform e.Icannotforgetthehelpandsupport ofallmyfriendsandcolleagueswhohavehelpedineverystep oftheway.Specialthankstomy roommates:AyushandSankyfortheirsupport,allthewaythr ough.

PAGE 4

TABLEOFCONTENTS LISTOFTABLES iii LISTOFFIGURES iv ABSTRACT v CHAPTER1INTRODUCTION 1 1.1SourcesofSoftwarePowerConsumption 3 1.2PowerOptimizationatDifferentLevels 3 1.3ThesisContribution 4 1.4ThesisOrganization 7 CHAPTER2RELATEDWORK 8 2.1SoftwarePowerEstimation 8 2.1.1Gate-LevelPowerEstimation 8 2.1.2Architecture-LevelPowerEstimation92.1.3BusSwitchingActivity 10 2.1.4Instruction-LevelPowerAnalysis10 2.2SoftwarePowerOptimization 12 2.2.1Instruction-LevelOptimization 12 2.2.2Compiler-LevelOptimization 12 2.2.2.1InstructionScheduling132.2.2.2EnergyCostBasedCodeGeneration132.2.2.3ReducingMemoryAccesses142.2.2.4SourceCodeTransformations142.2.2.5InstructionPacking,DualMemoryLoadsand OperandSwapping 14 2.2.3System-LevelPowerOptimization15 2.3InstructionSchedulingforLowPower 15 2.3.1GrayCodeAddressingandColdScheduling152.3.2InstructionSchedulingasaTSP172.3.3InstructionSchedulingtoReduceOff-ChipPower172.3.4MinimizePeakPowerDissipation182.3.5EnergyandPerformanceInstructionScheduling18 2.4ClassicalForce-DirectedScheduling 19 2.5OurWork 22 2.6Summary 23 i

PAGE 5

CHAPTER3LOWPOWERFORCE-DIRECTECDINSTRUCTIONSCHEDULIN G24 3.1ModicationstotheFDSAlgorithm 24 3.2ISAPowerCharacterization 25 3.3BasicBlocksofAssemblyCode 26 3.4DataDependencyGraphConstruction 27 3.5Low-PowerForce-DirectedInstructionSchedulingAlgo rithm27 3.6Summary 34 CHAPTER4EXPERIMENTALRESULTS 35 4.1ExperimentalProcedure 35 4.1.1Cross-Compilation 35 4.1.2Basic-BlockPartitioningandDDGExtraction354.1.3PDTGeneration 37 4.1.4FD-ISLP 37 4.2SimplePower 38 4.3PowerSavings 39 4.4Summary 43 CHAPTER5CONCLUSIONS 44 REFERENCES 45 ii

PAGE 6

LISTOFTABLES Table2.1BitSwitchinginBinaryandGrayCodeRepresentati on16 Table3.1ProbabilitiesofInstruction2forAssignment(2, 2)32 Table3.2PossibleAssignmentsforInstruction2 32 Table3.3ProbabilitiesofInstruction3forAssignment(2, 2)33 Table3.4PossibleAssignmentsforInstruction3 33 Table4.1InstructionSetArchitecture 37 Table4.2BenchmarkSet 38 Table4.3PowerSavingsResultsforBenchmarks 40 Table4.4ExecutionTimeofBenchmarks 40 Table4.5PowerSavingsforVariableSizedInputVectors41Table4.6PowerConsumptionBreakdownforBinarySearchBen chmark41 Table4.7PowerSavingsBreakdownforQuickSort41Table4.8PowerSavingsBreakdownforMatrixMultiplicatio n42 Table4.9PowerSavingsBreakdownforHeapSort42Table4.10PowerSavingsComparison 42 Table4.11SavingsinRunningTime 43 iii

PAGE 7

LISTOFFIGURES Figure1.1TrendsinCPUPowerConsumption 1 Figure1.2TheCostofRemovingHeatfromaMicroprocessor2Figure1.3FlowDiagramforPowerOptimizationMethodology 6 Figure2.1TaxonomyofSoftware-PowerRelatedWorks9Figure2.2StepsInvolvedinInstruction-LevelPowerAnaly sis11 Figure2.3UnscheduledDataFlowGraph 20 Figure2.4ASAPScheduleforDFGinFigure2.3 21 Figure2.5ALAPScheduleforDFGinFigure2.3 21 Figure3.1PDTGenerationExample 26 Figure3.2BasicBlockPartitioningAlgorithm 27 Figure3.3BasicBlockGenerationExample 28 Figure3.4DDGGenerationExample 29 Figure3.5FD-ISLPAlgorithm 30 Figure3.6AnExampleofaDDG 31 Figure3.7PartialSchedulefortheDDGinFigure3.631Figure4.1PowerOptimizationandValidationFramework36 iv

PAGE 8

FORCE-DIRECTEDINSTRUCTIONSCHEDULINGFORLOWPOWER PrashantDongale ABSTRACT Theincreasingneedforlow-powercomputingdeviceshasled totheeffortstooptimizepower inallthecomponentsofasystem.Itispossibletoachievesi gnicantpoweroptimizationatthe softwarelevelthroughinstructionreorderingduringthec ompilationphase.Inthisthesis,wehave designedandimplementedanovelinstructionschedulingte chnique,calledFD-ISLP,aimedatreducingthesoftwarepowerconsumption.Intheproposedappr oachforinstructionscheduling,we modifytheforce-directedschedulingtechniqueusedinhig h-levelsynthesisofVLSIcircuitsto derivealatency-constrainedalgorithmthatreordersthei nstructionsinabasicblockofassembly codeinapplicationsoftwaretoreducepowerconsumptiondu etoitsexecution.Thescheduling algorithmtakesthedatadependencygraph(DDG)foragivenb asicblockandapowerdissipation table(PDT),whichisgeneratedbycharacterizingtheinstr uctionsetarchitecture.Wemodelpower, commonlyreferredtoassoftwarepowerinliterature,asafo rcetobeminimizedbyrelatingthe inter-instructionpowercostasthespringconstant,,andthechangeininstructionprobabilityas thedisplacement,,intheforceequation .Thesalientfeatureofouralgorithmisthat itaccountsfortheglobaleffectofanytentativeschedulin gdecision,whichavoidsasolutionbeing trappedinalocalminima.Thepowerestimatesareobtainedt hroughusingatoolset,called SimplePower .Experimentalresultsindicatethatourtechniqueaccount sforanaverageof12.68savings inpowerconsumptionovertheoriginalsourcecodeforthese lectedbenchmarkprograms. v

PAGE 9

CHAPTER1 INTRODUCTION Thetraditionalconcernsforsystemdesignhavebeenperfor mance,reliabilityandcost.However,withtheadventandtremendousgrowthintheneedforpe rsonalcomputingdevicesandwirelesscommunicationsystems,powerconsumptionisbecoming amajorconcerninthedesignof modernsystems.Lowpowerhasbecomeakeyfeatureformostof theportabledevices.Therefore, reducingthepowerconsumptionofsuchdeviceselongatesth ebatterylifetimeandhencethedevice up-time.ThepowerconsumptionofIntelprocessorsoverthe periodof15yearshasbeenshown inFigure1.1(source:[10]).Ascanbeseenfromthetrendtha tthemaximumpowerconsumption tendstoincreasebyafactormorethan2Xeveryfouryears[10 ]. Figure1.1.TrendsinCPUPowerConsumption Modernprocessorsrunningathighclockratesconsumelarge amountofpowerandthecostassociatedwithcoolingandpackagingsuchdevicesishuge.Th ecorepowerconsumptionneedstobe dissipatedthroughpackaging,whichrequiresincreasingl yexpensivecoolingandpackagingtech1

PAGE 10

nologies.Thecostofremovingheatfromprocessorshavingd ifferentlevelsofpowerconsumption isshowninFigure1.2(source:[10]).Itcanbeobservedfrom thegurethataspowerconsumption increases,thecostassociatedwithcoolingsolutionincre asesinanon-linearfashion.Thisstresses ontheimportanceoflimitingthepowerconsumptiontoathre sholddenedbythecoststructureof theplatform.Therefore,itisimportanttoreducethepower consumptionastheresultingheatlimits feasiblepackagingandperformanceofVLSIcircuitsandsys tems. Figure1.2.TheCostofRemovingHeatfromaMicroprocessor Thereisalsotheissueofreliabilityofthesystems.Highpo wersystemsoftenrunhot,which aggravatessiliconfailuremechanisms.Every10Cincreaseintemperatureroughlydoublesthe failurerateofthecomponents[35].Inthisregard,peakpow erbecomesanimportantissue,which needstobekeptundercontrol.Also,thereareenvironmenta lissuesthatpromptusforlowpower systems. Alltheseissuesdiscussedearlieremphasizeonthedesigno fthesystemswithreducedpower consumption,althoughthemotivationsforpowerreduction differfromapplicationtoapplication. Inthenextsectionwewilllookatthesourcesofpowerconsum ptionthataremostinuencedby software. 2

PAGE 11

1.1SourcesofSoftwarePowerConsumption SoftwaresignicantlyimpactsthefollowingpartsoftheCP U,whicharemajorcontributors towardstheCPUpowerconsumption:Memorysystem,Systembu sses,Datapaths. MemorysystemsharesamajorfractionofCPUpowerconsumpti oninportablecomputersand especiallyformemoryintensiveDSPapplicationssuchasvi deoprocessing.Accessingthememory locationsinvolvesswitchingonhighlycapacitivedataand addresslinestothememory,decode logic,wordanddatalineswithinthememory.Also,thememor yaccesspatternsofaprogram largelydeterminethecacheperformanceofasystem.Lowcac heperformancebecauseofhigh numberofcachemissesleadtoexpensivememoryaccesses.Co mpactmachinecodescanalsohelp toreducethememoryaccessesbyreducingtheprobabilityof cachemisses[13]. Systembussessuchasaddress,instructionanddatabussesh avehighloadcapacitances.Sequenceofinstructionop-codesdeterminestheswitchingac tivityonaninstructionbus.Similarly, switchingactivityondataandaddressbussesisinuencedb ythesequenceofdataandinstruction accesses,respectively.Thedatarelatedswitchingishard topredictatcompiletimebecausemost inputdatatoaprogramisprovidedonlyattheexecutiontime Datapathstoarithmeticlogicunitsandoatingpointunits constituteasubstantialfractionof thelogicpowerdissipationinaCPU.Theyaredemandingofpo wer,especiallyifalloperational unitsorpipelinestagesareactivatedatalltimes.Selecti onofinstructions,schedulingofoperations aresomeexamplesofsoftwaredesigndecisionsthatcanaffe ctdatapathpower. 1.2PowerOptimizationatDifferentLevels Poweroptimizationscanbeachievedatdifferentlevelssuc hascircuit,logic,architectural,systemandsoftwarelevels.Atthecircuitleveltherearetechn iquessuchasreorderingofgatesand transistorsizing.In[30]authorshavegivenmethodtoopti mizethepoweroflogic-gatesbasedon transistorreordering.Atlogiclevel,differenttechniqu esareusedforcombinationalandsequentialcircuits.Combinationalcircuitsusepathbalancinga ndfactorizationtechniques.Inorderto reducethespurioustransitions,thedelaysofallpathstha tconvergeateachgateshouldberoughly equal.Pathbalancinginvolvesaddingunit-delaybufferst oinputsofgatestomakethedelaysequal. Factorizationinvolvesndingcommonsubexpressionsandr eusingthem,whichcanreducethetran3

PAGE 12

sistorcount.Sequentialcircuitsusetechniquessuchasen coding,retimingandprecomputationto optimizepower.Retimingisawell-knownoptimizationtech niquethatrepositionstheip-opsin asequentialcircuitsoastominimizingtherequiredclockp eriod.Aretimingmethodthattargets thepoweroptimizationhasbeenpresentedin[25].Inprecom putation,theoutputlogicvaluesofa circuitareselectivelyprecomputedoneclockcyclebefore theyarerequired,andthesevaluesare usedtoturnofftheloadsignaltotheregisterswhoseoutput doesnotchangeinnextcycle.This helpstoreducethepowerdissipationofthecircuit. Duringbehavioralsynthesis,high-levelspecicationsar emodiedbyspecictransformations, whichleadtopowerreduction.Themostimportanttransform ationsarethosewhichreducethe numberofcontrolsteps,amountofresourcesrequiredandsw itchedcapacitance.Poweroptimizationcanalsobeachievedatthesoftwarelevel.Softwareimp actsthepowerconsumptionofa system,onwhichitruns,toagreatextent.Themostfundamen tallevelatwhichsoftwarepower canbecontrolledistheinstructionlevel,becausediffere ntsoftwareinstructionscauseswitchingat differenthardwaremodulesandhenceconsumevariableamou ntsofpower.Forpoweroptimization atinstructionlevel,thereareseveraltechniquessuchasi nstructionscheduling,reducingmemory access,energycostdrivencodegeneration,andinstructio npacking.Theorderofinstructionscan haveanimpactonpowersinceitdeterminesinternalswitchi ngintheCPU.Thereforeselectinga properscheduleofinstructionscanreducepower.Memoryop erandsaremoreexpensivethanregisteroperandsintermsofpowerconsumption.Thereforeopt imalregisterallocationthatreduces thememoryoperands,cansignicantlyreducepower.Ifpowe rcostsofindividualinstructionsare known,anappropriatechoiceofinstructionsduringcodege nerationcanleadtoreductioninpower cost.Softwarecontrolledvoltageandfrequencyscalingar esomeotherexamplesoftechniquesused atthesoftwarelevelfordynamicpoweroptimization.1.3ThesisContribution Asweknowfromdiscussionintheprevioussectionthatthena tureoftheinstructionsandtheir reorderinginthegivensoftwareinuencethepowerconsump tionofasystem,itisincreasingly importanttoanalyzeandoptimizethesystempowerconsumpt ionfromasoftwarepointofview. Justaslogicgatesarethefundamentalunitsindigitalhard warecircuits,instructionscanbethe 4

PAGE 13

fundamentalunitsinsoftware.Therefore,anaccuratemode lingandanalysisatinstructionlevelis essentialforquantifyingpowerathigherlevelsofsoftwar eabstraction.Weareusingtheinstruction modelpresentedin[49]forourwork. Inthisthesis,wedescribeanewinstructionschedulingalg orithmforsoftwarepoweroptimization.Ourapproachforinstructionschedulingisbasedonth eclassical force-directedscheduling (FDS)techniquepresentedbyPaulinandKnight[29].FDShas beeneffectivelyappliedinbehavioralandhigh-levelsynthesisofVLSIcircuitsintheeldo fVLSIdesignautomation.AmodicationoftheFDSalgorithmforpoweroptimizationduring behavioralsynthesis,referredtoas FDS-LPalgorithm,isdescribedin[12]and[11].Thus,apply ingtheforcedirectedschedulingtechniqueforinstructionreorderingatthecompilerlevelisbe ingaddressedforthersttimeinthis work. TheclassicalFDSisalatencyconstrainedschedulingtechn iquethatattemptstominimizethe resourcerequirements,bymaximizingtheconcurrencyofop erations.TheFDSalgorithmpresented byPaulinandKnightin[29]optimizestheresourcerequirem entsduringsynthesis,butdoesnot addresstheissueofpowerconsumption.WehavemodiedtheF DStotargetpoweroptimization duringinstructionschedulinganditcanbeappliedattheco mpilerlevel.Themotivationofusing theFDSininstructionschedulingforlowpowerliesintheab ilitytomodelpowerconsumptionas aforceforoptimization.Werefertoourproposedforcedire ctedinstructionschedulingalgorithm forlowpowerasFD-ISLP.Theforceintheforceequation ,isexpressedastheproduct ofspringconstantanddisplacement.Wemodeltheinter-instructionpowercostasandthe changeininstructionprobabilityas,whilemodelingpoweras,whichcanbeformulatedinthe followingEquation1.1. n r n n (1.1) Theowdiagramforpoweroptimizationduringinstructions chedulinganditscontextwiththe compilationprocessisshowninFigure1.3.Theinputsource codeiscompiledandthederived assemblycodeisfurtherdividedintobasicblocks.Thegive ninstructionsetarchitecture(ISA) ispowercharacterizedtobuildapowerdissipationtable(P DT).Suchatableforanninstruction ISAisan nmatrix,whereeachentry ofthetablerepresentsthepowerdissipationwhen 5

PAGE 14

instructionisfollowedbyinstruction.Adatadependencygraph(DDG)isconstructedforeach basicblock.TheLow-PowerForceDirectedInstructionSche duling(FD-ISLP)algorithmtriesto rescheduleeachbasicblockofthesourceprogramforlowpow er.TheFD-ISLPtakesasinputa DDGforthecurrentbasicblockandthePDT.Duringeachitera tionofthealgorithm,itndsall theinstructionsthatcanbescheduledinthattime-step,us ingtheDDG.Ittentativelyassignsan instructiontoatime-stepandcomputestheforce(i.e.powe rcost)associatedwiththatassignment. Finally,theinstructionhavingtheleastforcevalueischo sentobescheduledinthecurrenttime-step. Thesestepsarefollowedtillalltheinstructionsinthegiv enbasicblockarescheduled. Divide into Basic Blocks Assembly Code Construct Data Dependency Graph (DDG) Instruction Set Architecture (ISA) Power DissipationTable (PDT) Assembly Code Low-Power scheduled Compile Source Code Proposed FD-ISLP Algorithm Figure1.3.FlowDiagramforPowerOptimizationMethodolog y 6

PAGE 15

The SimpleScalar ISAwasconsideredandthepowersimulationswereperformed using SimplePower simulatorfor0.35microntechnology.Fivebenchmarkprogr amswereusedtoverifythe effectivenessoftheschedulingalgorithm.1.4ThesisOrganization Therestofthethesisisorganizedasfollows:Chapter2will discusstherelatedwork,which hasbeenpreviouslydoneinsoftwarepoweroptimizationwit haspecialemphasisoninstruction leveltechniquesandalsotheclassicalFDSalgorithm,whic hformsthebasisofourwork.Chapter 3willdiscusstheproposedinstructionschedulingalgorit hmforlowpower.Chapter4willreport simulationenvironment,experimentalresultsandtheiran alysis.Chapter5willpresentconclusions andthefuturedirectionsthatcanbeexploredaugmentingou twork. 7

PAGE 16

CHAPTER2 RELATEDWORK Softwaredirectsmuchoftheactivityofthehardwareinmicr oprocessor,micro-controllerand digitalsignalprocessorbasedsystems.Consequently,sof twarecanhaveasubstantialeffecton thepowerconsumptionofasystem.Inlightofthis,thereisa clearneedtoconsiderthepower consumptionofthesystemfromasoftwarepointofview.Also ,theabilitytoevaluatesoftwarein termsofpowerconsumptioncanhelptoverifythataparticul ardesignmeetsitspowerconstraints. Therststeptowardssoftwarepoweroptimizationissoftwa repowerestimation.Anumberof techniqueshavebeenproposedforsoftwarepowerestimatio nandoptimization,someofwhichare brieydiscussedinthischapter.Thetaxonomyofthesoftwa repowerrelatedworksispresented inFigure2.1.Therelatedworksareclassiedbasedonthedi fferentlevelsofthesystematwhich theyareapplied(estimation:gate,architecture,instruc tionlevelandforoptimization:instruction, compilerandsystemlevel).2.1SoftwarePowerEstimation Softwarepowerisestimatedusingdifferentapproaches:ga te-levelestimation,architecture-level estimation,switchingactivityonbusesandinstruction-l evelanalysis[31]. 2.1.1Gate-LevelPowerEstimation In[26],[7],theauthorshavepresentedagate-levelpowere stimationmodelofaprocessor runningaprogram.Itisanaccuratemethodbutrequiresdeta iledgate-leveldescriptionofthe processor.Usually,suchadetaileddescriptionisnotavai labletothesoftwaredevelopersandeven ifitisavailable,thismethodistooslowbecauseofitshigh computationalcomplexity. 8

PAGE 17

Najm et al. '94Chou et al. '96 This work 2003 Choi et al. '01 Wen-tsong et al. '01 Simunic et al. '00 Parikh et al. '00 Jain et al '99 Hong et al. '99 Toburen et al. '98 Tomiyama et al. '98 Tiwari et al. '96 Lee et al. '95 Su ae al. '94 Tiwari et al. '94 Sato et al.'95 Compiler-Level Instruction-Levle Mizuno et al.'02 Instruction-Level Arcitecture-Level Software-Power Related Works Optimization Estimation System-Level Gate-Level Binh et al. '95 Huang et al. '94 Sato et al. '94 Sarta et al. '99 Mehta et al. '96 Tiwari et al. '96 Tiwari et al. '95 Tiwari et al.'94Chakabarti et al. '99 Dave et al. '99 Lorch et al. '98 Kulkarni et al. '98 Bajwa et al. '97 Nielsen et al. '94Kirowski et al. '97 Chandrakasan et al. '96 Figure2.1.TaxonomyofSoftware-PowerRelatedWorks 2.1.2Architecture-LevelPowerEstimation Atthearchitecturelevel,powerestimationislessaccurat ebutmuchfasterthanatgate-level. Thisapproachrequiresamodeloftheprocessoratcomponent level(suchasALU,registerle, etc.)andthepowerdissipationestimatesforthosecompone nts.Italsoneedstheknowledgeabout whichcomponentsareactivatedbytheindividualinstructi onsbeingexecuted.Finally,theexecution ofaprogramissimulatedtoidentifythecomponentsthatare activeineachexecutioncycleand powerdissipationofeachcomponentisaddeduptogetthepow erestimate.Suchanapproachis implementedin[33],[22].Mizunoetal.[24]haveproposeda napproachtothepowerestimationof anembeddedhardware/softwareco-designsystematthearch itecturallevel.Thedistinctivefeature 9

PAGE 18

ofthisapproachisconstructingapreciseenergymodelatth earchitecturallevelforeachcomponent ofembeddedsystem,bytakingintoaccountdatatransferasw ellasfunctionalperformance. 2.1.3BusSwitchingActivity Itisalsoimportanttoestimatethepowerbasedonthebusswi tchingactivityofaprocessor. Modelingbusswitchingactivityrequirestheknowledgeoft hebusarchitectureofaprocessor,binaryop-codesoftheinstructions,theinputdataenvironme nt,andthemannerinwhichcodeand dataaremappedintotheaddressspace.Foragivensetofinpu tinstructions,switchingstatistics canbefoundbysimulatinganexecutionandmonitoringthenu mberofbitswitchesoneachofthe buses.Suet.al[36]havepresenteda coldscheduling algorithmforinstructionschedulingthat implementsbusswitchingactivity.Inthiswork,onlyinstr uctionandaddressrelatedswitchingare consideredduetotheunpredictabilityofthedatavalues.2.1.4Instruction-LevelPowerAnalysis Instruction-LevelPowerAnalysis(ILPA)[39]presentsame thodologyfordevelopingandvalidatinganinstructionlevelpowermodelforagivenprocess or.Thismethodmeasuresthecurrent drawnbytheprocessorasitrepeatedlyexecutescertainins tructions. Inthismodel,eachinstructionintheinstructionsetarchi tectureisassociatedwithaxedenergy costcalledthe baseenergycost .The basepowercost iscalculatedasproductof Vandthe averagecurrentdrawnbytheprocessorwhilerunningaloopw ithseveralinstancesofthesame instruction.The basepowercost timesthenumberofnon-overlappedcyclesrequiredtoexecu tethe instructionisits baseenergycost .Thevariationinbaseenergyduetodifferentaddressvalue sand operandsisalsocomputed.Duringtheprogramexecutionthe rearecertain inter-instructioneffects whicharealsoquantied.Theinter-instructioneffectsta keintoaccounteffectofcircuitstate,effect ofresourceconstraintse.g.pipelinestallsandwritebuff erstalls[16]andeffectofcachemisses. Thecircuitstateoverheadforapairofinstructionsisden edasthedifferencebetweentheactual costofthepairandtheaverageofthebasecostoftheindivid ualinstructions[41]. Theoverallenergycostofaprogramiscomputedasthesumofb aseenergycostsofallparticipatinginstructionsandtheirinter-instructioneffects. Thetotalenergycost, Eofaprogramcanbe 10

PAGE 19

Execution Profiling Power Cost Cache Penalty Final Program Base Cost Data Basic Block Cost Estimation Inter-instruction Effects Divide intoBasic Blocks Assembly/Machine Code Figure2.2.StepsInvolvedinInstruction-LevelPowerAnal ysis givenbythefollowingequation[41]: n n(2.1) whereforeachinstruction i Bisthebasecostand Nisthenumberoftimesitwillgetexecuted, andforeverypairofconsecutiveinstructions (i,j) O isthecircuitstateoverhead, Nisthenumber oftimesthepairwillbeexecuted. Enistheenergyoftheotherinter-instructioneffects,thatw ould occurduringprogramexecution.TheoverallowofILPAmeth odologyisshowninFigure2.2.This techniquehasbeenveriedwithtwocommercialmicroproces sors:theIntel486DX2[40]andthe FujitsuSPARClite934[37]. Mehtaetal.[22],[21]haveproposedamethodthatisbasedon asimulationofmicroprocessor andtheeffectoftheinstructionsetinthemicroprocessorm odel.Thisworksuggeststhatlower levelsimulationcanprovideanestimateofthecurrentdraw ntocalculatethepowerconsumption ofeachinstruction.Chakrabartietal.[5]havebuiltamode lofeachbasicmodulefortheHC11 11

PAGE 20

micro-controller.Themodelhasbeenimplementedusinghar dwaredescriptionlanguages.Also, blackboxmodels[8]orotherkindofmodels,whereitispossi bletomakeacurrentorpower measurement,canbeused.Once,themodulesactivatedbyeac hinstructionaredetermined,the powerconsumptioncanbecomputedasthesumofpowerconsump tionofeachactivemodule. Sartaetal.[44]havepresentedaninstructionlevelpowera nalysistechnique,thattriestorelate thepowerdissipationtotheexecutedinstructionsandthei roperandvalues.Thismodelisshown tobeveryaccuratefortheprocessors,inwhichoperanddist ributionstronglyinuencesthepower consumption.2.2SoftwarePowerOptimization Softwarepowercanbecontrolledatthefollowingdifferent levels:instruction,compilerandthe systemlevel.2.2.1Instruction-LevelOptimization Instructionsetoptimizationisanimportantissueintheha rdware/softwareco-designdomain. EffortsatinstructionsetoptimizationarepresentedbySa toetal.[32],Binhetal.[3]andHuang etal.[15].TheworkofSatoandBinharebothrelatedtothePE AS-Isystem(PracticalEnvironmentforASIPdevelopmentversion-I).PEAS-Idenesanexte ndedsetofinstructionsbystarting withacoreinstructionset.Foreachpossibleinstruction, alternativeimplementationsaredened. Optimalselectionofimplementationisbasedonpowerandar eaconstraints.Binh[3]extendedthe instructionselectionformulationtoaccountforpipeline hazards. 2.2.2Compiler-LevelOptimization Compilerlevelpoweroptimizationtechniquescanfurtherb eclassiedintofollowingcategories:instructionschedulingforlowpower,selectionof leastexpensiveinstructionsduringcode generation,reducingthefrequencyorcostofmemoryaccess es,sourcecodetransformationsand exploitingthepowerreductionfeaturesofthehardware. 12

PAGE 21

2.2.2.1InstructionScheduling Schedulingalgorithmscanbeclassiedastransformationa lorconstructive.A transformational algorithmstartswithadefaultscheduleandappliesasetof transformationsthatareallowedby thedependencyconstraints,toobtainanewschedule.Onthe otherhand,a constructive algorithm buildsaschedulebyaddingnewoperationstotheexistingsc heduleineachiteration.Instruction schedulingforlowpoweraimsatreducingthe circuitstateoverhead .Fromtheprevioussection, itisclearthatcircuitstateoverheadistheenergydissipa tedduetoswitchingfromexecutionof onetypeofinstructiontoanother.Giventhecircuitsstate costsforallpossibleinstructionpairs, itispossibletocomputethepowerconsumptionofasequence ofinstructions.Instructionscanbe reordered[38]tohavelessamountofcircuitstateoverhead withoutconsiderablepenaltyinruntime. TheexperimentalresultsdepictthatthepowersavingsinDS Psduetoinstructionschedulingare muchmoresignicantcomparedtogeneralpurposearchitect ures.Thisisattributedtothefactthat circuitstateoverheadconstitutesasubstantialpartofth etotalpowerdissipationforDSPs. Suetal.[36]haveproposedacoldschedulingalgorithmtore ducetheswitchingactivityalong thecontrolpath.In[49],theauthorshavepresentedanothe rschedulingalgorithmbyformulating theinstructionschedulingproblemasatravelingsalesman problem(TSP).Thealgorithmusesminimumspanningtreeandsimulatedannealingtechniquesforo ptimization.Instructionschedulingto reducetheoff-chippowerwasaccomplishedbyTomiyamaetal .[43].Thisworktriestominimize thepowerconsumptionbyreducingthebit-switchingontheo ff-chipbuses.Aschedulingalgorithm toreducethepeakpowerdissipationispresentedin[42].En ergyandperformanceconstraintsfor schedulingwerecombinedbyParikhetal.[28].2.2.2.2EnergyCostBasedCodeGeneration Thetraditionalapproachselectsinstructionsbasedonthe ircodesizeorexecutiontime.The low-powerapproachproposestheselectionofinstructions basedontheirenergycostduringthecode generation.Itisobservedthatenergybasedandtherunning timebasedcodegeneratorsproduce almostsimilarcodefor486DX2.However,thisobservationm aynotholdtrueformanyapplication specicprocessorssuchasDSPcores.Insuchcasesaggressi vepowermanagementcanbeobtained atsomeoverheadinprogramrunningtime. 13

PAGE 22

2.2.2.3ReducingMemoryAccesses Ithasbeenobservedthatthepowerconsumptionoftheinstru ctionsinvolvingmemoryaccess ismuchhigherthantheinstructionsaccessingonlyregiste rs.Forcachemissesthecostismuch higherduetomultiplecachellcyclesandenergydissipati onintheexternalmemorysystem. Consequently,alargeamountofenergycanbesavedbyreduci ngthememoryaccesses,whichcan beachievedthroughoptimumregisterallocationduringthe compilationphase. 2.2.2.4SourceCodeTransformations Simunicetal.[34]havepresentedanoptimizationmethodol ogyatthreelevelsofabstraction: algorithmic,dataandinstruction-level.Algorithmicopt imizationsincludeimplementationofdifferentalgorithmshavingsamefunctionalityandthecompar isonoftheiralgorithmicefciency(e.g. numberofoperations).Thedataandinstruction-levelopti mizationhavealsoshownsignicant amountofpowersavings.Theproblemofgeneratingcompacta ndefcientcodeforembedded DSPprocessorshasbeenaddressedin[17].Theimprovements inexecutionefciencyandcode sizeareachievedbyexploitingthetheparallelisminthein structionsetarchitecture.Theseimprovementshavealsoledtoreductioninpowerconsumption. In[45],aloopunrollingtechnique toreducepowerconsumptionhasbeensuccessfullyappliedt otheDSPprocessors.Inthiswork, theobjectivewastoreducethetotalnumberofcomparisons, whichresultedinlowuseoftheALU andpowersavings.Hongetal.[14]havedevelopedadivide-a nd-conquercompilationmethodto minimizethenumberofoperationsforgeneralDSPcomputati ons. 2.2.2.5InstructionPacking,DualMemoryLoadsandOperand Swapping TheDSPshaveaspecialarchitecturalfeaturethatsupports packingcertainpairofinstructions intoasingleinstruction.Thepackedinstructiongivesthe samefunctionalityastheunpackedinstructionpair,however,itconsumesconsiderablylowerpo werthantheindividualinstructionsexecutedsequentially.Theuseofdualdatamemoriescanalsor educethepowerconsumption.A specialdualloadinstructioncanloadtwooperandsfromtwo differentmemorybanksinonecycle. Thus,thememoryallocationtechniquesthatmakeefcientu seofthedualmemorybankscansave aconsiderableamountofmemoryaccesspower.Itisobserved thatfortheFujitsuDSP,appropriate 14

PAGE 23

swappingoftwomultiplicationoperandscouldleadtoupto3 0%reductioninthemultiplication powercost[20].Underthesoftwarecontrolledpowermanage menttechnique,certainunneeded partsoftheCPUcanbepowereddown.Thisisdonebysettingap propriatebitsinsidesystem controlregister.Theobservations[41]showthatthistech niquecansavesignicantpowerforthe specicprocessors,whichsupportthenecessaryfeatureof poweringdownindividualmodules. 2.2.3System-LevelPowerOptimization Atthesystemlevel,severalmemoryoptimizationtechnique shavebeenproposed[2],[19].In [18],[9],theauthorshavepresentedhardware-softwarepa rtitioningtechniquesforconstructing power-optimizedprocessors.Voltagescalingisanotherte chniqueproposedforpoweroptimization. Itisachievedthroughuseofdynamicallyvariablevoltages upply[27],[6]. 2.3InstructionSchedulingforLowPower Inthissection,wewilllookatexistinginstructionschedu lingtechniquesforlowpower,in detail.2.3.1GrayCodeAddressingandColdScheduling Suetal.[36]haveproposedtwotechniques,graycodeaddres singandcoldschedulingforlow power.Graycodehasaone-bitdifferenceinrepresentation forconsecutivenumbers.Duetotemporalandspatiallocalities,mostoftheprograminstruction saccesstheconsecutivelyaddressedmemorylocations.Therefore,theuseofgraycodeaddressingca nsignicantlyreducethenumberofbit switchesalongtheaddressbuses,savingconsiderableamou ntofpower.Table2.1showsthenumber ofbitswitchesforastreamof16numbersinbinaryaddressin gandgraycodeaddressingschemes. Let bethebinaryrepresentationand bethegraycoderepresentationofthesamememoryaddress.T henthefollowingformulagivesthe conversionrulebetweenbinaryandgraycoderepresentatio ns. n (i=n-1,n-2,...,0) (2.2) 15

PAGE 24

Table2.1.BitSwitchinginBinaryandGrayCodeRepresentat ion Decimal Binary GrayCode 0 00000 00000 1 00001 00001 2 00010 00011 3 00011 00010 4 00100 00110 5 00101 00111 6 00110 00101 7 00111 00100 8 01000 01100 9 01001 01101 10 01010 01111 11 01011 01110 12 01100 01010 13 01101 01011 14 01110 01001 15 01111 01000 16 10000 11000 Switchedbits 31 16 Theothertechniquecalled coldscheduling isasoftwareapproachbasedonatraditionallist schedulingalgorithmmodiedtoreducetheswitchingactiv ityalongthecontrolpath.Traditional instructionschedulingalgorithmsaimatimprovingperfor mance.Theyscheduletheinstructions toreducethepipelinestalls,avoidpipelinehazardsorimp roveresourceutilization[36].These algorithmsneedtobemodiedtoattainthenewobjectiveofl owpowerconsumption. Thebasicstepsfollowedinthelistschedulingalgorithmar easfollows. 1.Dividethecodeintobasicblocks.Abasicblockisapieceo fcodewhichhasonlyoneentry pointandoneexitpoint. 2.FormaControlDependencyGraph(CDG)and/oraDataDepend encyGraph(DDG)foreach BasicBlock. 3.ScheduleinstructionsineachBasicBlocksatisfyingthe resourceconstraintsandthedependencyconstraints. Coldschedulingisanimplementationofthelistscheduling algorithmwithaheuristictoreducethe switchingactivity.LetB=I,I,...,IbetheinstructionsequenceinabasicblockB.S(I,I)de16

PAGE 25

notesthenumberofbitswitcheswheninstructionIisimmediatelyfollowedbyI.Totalswitching activityinabasicblockisdenedas: n (2.3) Thenthegoalofthecoldschedulingalgorithmistominimize theblockswitchingactivity, BS .For aprogramconsistingof k basicblocksB,B,...,Bnthecostfunctionofthecoldschedulerisgiven byequation2.4: n n (2.4) where w,w,...,wnaretheweightsbasedontheexecutionfrequencyofeachbasi cblockand BS, BS,...,BSnarethetotalnumberofbitswitchesofeachofthebasicblock s. 2.3.2InstructionSchedulingasaTSP Choietal.have[49]haveformulatedtheinstructionreorde ringproblemasprecedenceconstrainedHamiltonianpathproblemfordirectedacyclicgra phs(DAG)andtravelingsalesmanproblem(TSP),bothofwhichare NP-Hard .Forpoweroptimization,minimumspanningtree(MST) andsimulatedannealing(SA)mechanismsareused.Thesched ulingtechniqueusesapowerdissipationtable(PDT),whichisgeneratedbypowersimulations .APDTforaninstructionsetwithn instructionsisa(nxn)matrix,whereeachentry(i,j)isthe powerconsumptionduetoexecution ofinstructionIfollowedbyinstructionI.Theschedulingalgorithmusesacontrolowanddata dependencygraph(CDG)foreachbasicblockandthePDT.Thep roblemofinstructionreordering forlowpoweristransformedtondingthetouroflowestcost (TSP).Initialsolutionisfoundasa MSTusingPrim'salgorithmandisfurtherrenedbylocalsea rchheuristicssuchas 2-optimal and 3-optimal byusingSA. 2.3.3InstructionSchedulingtoReduceOff-ChipPower Tomiyamaetal.[43]haveproposedaninstructionschedulin gtechniquetoreducepowerconsumedforoff-chipdriving.Thepowerdissipatedbytheoffchipdriversisproportionaltothe 17

PAGE 26

numberofbitswitchesalongthewiresoftheoff-chipbuses. Theschedulingalgorithmschedules theinstructionsineachbasicblocksuchthatthebinaryrep resentationsoftwoconsecutiveinstructionsinamemoryblockarelessdifferent.Thisreducesthen umberoftransitionsonthedatabus andhencethepowerconsumption.2.3.4MinimizePeakPowerDissipation In[42],aninstructionschedulingapproachisproposedwhi chfocusesonreducingthepeak powerdissipationasopposedtopowerconsumption.Thepeak powerdissipationcanbereducedby reducingtheoccurrencesofcurrentspikesduringprograme xecution.Thisapproachtriestoreduce suchcurrentspikesbylimitingtheamountofenergythatcan bedissipatedinagivencycle.This schedulingalgorithmisbasedonlistscheduling.Onceanin structionisreadytobescheduledand assignedtoafunctionalunit(FU),theschedulerqueriesth epowerdissipationvalue.Thescheduler thenaddsthevaluetothepowerdissipationofthecurrentcy cle,ifthetotalexceedsthepredened thresholdthentheschedulerquitsschedulingforthecurre ntcycleandbeginsnextcyclewiththe instructionthatcausedviolationinthepreviouscycle.2.3.5EnergyandPerformanceInstructionScheduling Parikhetal.[28]haveproposedperformance-oriented,ene rgy-orientedandcombinedapproachesforinstructionscheduling.Theperformance-ori entedapproachusesthetraditionallist schedulingalgorithmwith delay asthecostparameter.Theenergy-orientedapproachalsous es thelistscheduling,butuses circuit-stateeffect ( inter-instructioneffect [39])asthecostparameter. EachnodeoftheDAGtobescheduledhasa basecost andtheedgesdenotethecircuit-stateeffect. Ateachstep,theschedulerpicksupthenextnodewithleastc ircuit-stateeffect.Inthecombined approach,theschedulingdecisionisbasedononeparameter andtheotherparameterisconsidered onlyinthecaseofatie.e.g.inEnergy-with-Performanceap proach,thedecisionsarebasedon circuit-stateeffectandthedelayparameterisusedonlywh enthereisatiewiththecircuit-state values. 18

PAGE 27

2.4ClassicalForce-DirectedScheduling TheForce-DirectedScheduling(FDS)algorithm,widelyuse dinhigh-levelsynthesis,isan exampleofa constructive approachforscheduling.Theconstructivealgorithmbuild sascheduleby addingnewoperationstotheexistingscheduleineachitera tion. Ourapproachforinstructionschedulingisbasedonthework byPaulinandKnight[29].Inthe followingsection,wewilllookintothedetailsoftheFDS.F DSisalatency-constrainedalgorithm. TheintentoftheFDS,whichaimstoreducetheresourcerequi rementsbybalancingtheconcurrencyoftheoperationsassignedtothemwithoutviolatingt helatencyconstraints.Thealgorithmis iterative,withoneoperationscheduledineachiteration[ 29].Inthisversionofthealgorithm,all theoperationsareassumedtoexecuteinsinglecontrolstep The timeframe ofeachoperationisdenedasthethecontrolstepsinwhicht heoperationcan bescheduled.AsSoonAsPossible(ASAP)andAsLateAsPossib le(ALAP)schedulesdetermine thetimeframes.AnASAPscheduleistheoneinwhicheveryope rationisscheduledintheearliest possibletime-stepallowedbythedependencyconstraints. Ontheotherhand,ALAPschedule assignsanoperationtothelatestpossibletime-step.The mobility ofanoperationisdenedasthe differencebetweenitsASAPtime-stamp,TandtheALAPtime-stamp,T.Thewidthof timeframeofanoperationisequaltoitsmobilityplus1.The operationprobability isafunction thatiszerooutsidethetimeframeandisequaltoreciprocal offramewidthinsideit. Considerthedataowgraphpresentedin2.3withaunitdelay multiplierandaunitdelay adder.FromASAPandALAPschedulesshowninFigure2.4andFi gure2.5,operation1haszero mobility.Hence, (1)=1and (2)= (3)= (4)=0.Similarconsiderationsapplytooperation 2.Operation3hasmobility1sinceitstimeframeis[1,2].Co nsequently, (1)= (2)=0.5and (3)= (4)=0. The typedistribution isthesumofprobabilitiesoftheoperationsthatareimplem entableby aspecicresourcetypefromtheresourceset.Thetypedistr ibutionindicatestheconcurrencyof similaroperationsinagiventime-step.Thefollowingequa tiondescribesthetypedistribution. n n (2.5) 19

PAGE 28

** * + + 12 3 4 5 6 7 Figure2.3.UnscheduledDataFlowGraph where n indicatesthetypedistributionintime-stepforoperationtype.Thus,thetype distributionformultiplier(i.e.=1)intime-step1(i.e.=1), (1)=1+1+0.5=2.5.Thetype distributionforeachoperationtypeiscalculatedforever ytime-step. InForce-DirectedScheduling,thedecisionaboutassignme ntofanoperationtoatime-step isbasedontheconceptof force .Theforceexertedbyanelasticspringisproportionaltoth e displacementofitsendpoints,thespringconstantbeingth eproportionalityfactor.Theanalogy ofmechanicalforceisappliedtotheoperationassignmentb ycomparingthedisplacementwith the changeinprobability andspringconstantwiththe typedistribution .Aforceconsistsoftwo components-the selfforce andthe predecessor-successorforce Theselfforceisthesumoftheforcesrelatinganoperationt oallschedulestepsinitstimeframe. Theself-forceforassigningoperationwithtimeframe[ ]andmobility tothetime-stepis givenas: n n (2.6) Theassignmentofoneoperationtoaspecictime-stepmayre ducethetimeframesofother operations,whicharelinkedwithitbydependencyrelation s.Theeffectonsuchoperationsis 20

PAGE 29

** * + + 12 3 4 5 6 7 1234 Figure2.4.ASAPScheduleforDFGinFigure2.3 ** + + 12 4 6 7 1234 5 3* Figure2.5.ALAPScheduleforDFGinFigure2.3 consideredbypredecessor-successorforce.Let[ ]betheinitialtimeframeand[ ]bethe reducedtimeframe.Then,followingequationgivesthepred ecessor-successorforce: n n (2.7) Thetotalforceforanassignmentiscalculatedasthesumofs elf-forceandallps-forces. (2.8) 21

PAGE 30

IneachiterationoftheFDSalgorithm,thetimeframes,prob abilitiesandforcesarecalculated andtheoperationwiththeleastforceisscheduledinthecor respondingtime-step.Theprocessis continueduntilalltheoperationsarescheduled.Thecompu tationalcomplexityofthealgorithmis cubicalinnumberofoperations. In[12],[11],theauthorshavepresentedamodicationtoth eoriginalFDS,referredtoasFDSLP,fordynamicpoweroptimizationatthebehaviorallevel. TheFDS-LPalgorithmoptimizesthe dynamicpowerbyreducingtheswitchingcapacitanceinside theresources.Theforce-directedtechniquehasbeenusedbyrelatingthethedynamicpowertothefo rceandestablishingtheanalogies betweentheirparameters.Sharingofthesamefunctionalmo dulebytwoormoresimilaroperations leadstoswitchingactivityinsidethatmodule.Theresulti ngswitchedcapacitanceandthepossibilityofsucharesourcesharingcombinationareevaluated.To modelthedynamicpowerasaforce, theswitchedcapacitanceduetoaresourcesharingcombinat ionisrelatedtothespringconstant,, andtheprobabilityofselectingsuchacombinationisrelat edtothedisplacement,,intheforce equation .Thus,theproductoftheswitchedcapacitanceassociatedw itharesourcesharing combinationanditsselectionprobability,determineitsp owercost.Thispowercostactsasaforce associatedwiththecombination.2.5OurWork Theworkpresentedheredescribesanewinstructionschedul ingalgorithmbasedontheclassicalforce-directedapproach.AForce-DirectedScheduling (FDS)algorithmwasproposedbyPaulin andKnight[29]forhigh-levelsynthesis.Wehavemodiedth eFDStooptimizepowerthrough instructionscheduling.TheadvantageofusingtheFDSisth atittriestondagloballyoptimized solution.Ineachiterationofthealgorithm,FDStakesinto accounttheglobaleffectofeverytentativeassignmentandthusavoidsgettingtrappedintoaloc allyoptimizedsolution[12].Asseen intheprevioussection,mostoftheexistinginstructionsc hedulingtechniquesforpoweroptimizationuselistschedulingapproach.Listschedulingheurist icsufferstheshortcomingthatitcanbe trappedinlocalminimaofthecostfunctionandthereforema ynotndgloballyoptimalsolution. ThismotivatesustousetheFDS,whichpromisesgloballyopt imalsolution,inourwork. 22

PAGE 31

Itshouldbenotedthatourapproachforinstructionschedul ingisbasedontheforce-directed approach,however,themodelitselfisquitedifferentinte rmsofitsparametermodeling,theirrole intheforceequationandalsothelevel(i.e.compiler-leve l)atwhichthemodelisapplied.We modelsoftwarepowerasaforceforoptimization.Inourmode l,thespringconstant,,oftheforce equationisrelatedtotheinter-instructionpowercostand thedisplacement,isrelatedtochangein instructionprobability.Thus,ourmodeldiffersfromtheo therforce-directedmodelsdescribedin theprevioussection.2.6Summary Thischapterdescribestheexistingworksinarchitecturel evelandsoftwarelevelpowerestimationandoptimization.Wealsopresent,indetail,theForce -DirectedSchedulingalgorithmandsteps involvedsuchastimeframecomputation,probabilities,an dforcecalculations.TheFDStechnique accountsfortheglobaleffectofanassignmentduringthesc hedulingprocess.Hence,FDSprovides anadvantageofavoidinglocallyoptimizedsolutions.Also ,noneoftheexistinginstructionschedulingmethodsforpoweroptimizationatcompilerlevelorsoft warelevelhaveusedtheforce-directed approach.ThisisthemotivationbehindusingamodiedFDSt oscheduleinstructionsforpower optimizationatthecompilerlevel,inourwork. 23

PAGE 32

CHAPTER3 LOWPOWERFORCE-DIRECTECDINSTRUCTIONSCHEDULING Wepresentanewinstructionschedulingmethodforpoweropt imizationcalledLow-Power Force-DirectedInstructionScheduling(FD-ISLP).FD-ISL Pisalatencyconstrainedalgorithmthat reorderstheinstructionsinabasicblocksothatthetotalp owerconsumptionofthebasicblock isreduced.Givenanassemblycode,itisdividedintobasicb locksandadatadependencygraph (DDG)isconstructedforeachofthebasicblocks. SimplePower isusedforpowercharacterization oftheISAtoobtainapowerdissipationtable(PDT).TheFD-I SLPtakesasinput,aDDGfor thebasicblocktobescheduledforlowpowerandthePDT.The owofthepoweroptimization methodologyispresentedinFigure1.3. Ineachiterationofthealgorithm,oneinstructionissched uledbasedontheforcevalue.Power ismodeledasaforcebyrelatingthe inter-instructionpowercost asaspringconstantandthe change ininstructionprobability asdisplacementand,intheforceequation .Thepowercost associatedwiththeassignmentofaninstructiontoapartic ulartime-stepisdeterminedbythechange inprobabilitiesandtheinter-instructioncostsoftheins tructionitselfandthefollowinginstructions whosetimeframesgetreduced.Everyinstructionthatcanbe scheduledinthecurrenttime-step, istentativelyassignedtothattime-stepandthethepowerc ostiscomputedforthisassignment. Finally,theinstructionhavingtheleastpowercostissele ctedtobescheduledinthattime-step.The processiscontinueduntilalltheinstructionsinagivenba sicblockarescheduled. TheproposedapproachisbasedontheFDSalgorithmbyPaulin andKnight[29],whichwas explainedindetailinChapter2.3.1ModicationstotheFDSAlgorithm TheFDShasbeentypicallyappliedforoperationscheduling inhigh-levelsynthesis.Weapply FDSatthecompilerlevelforreorderinginstructionsforpo weroptimization.Ourapproachaims 24

PAGE 33

atoptimizingthesoftwarepower,whereastheoriginalFDSt argetsattheresourceoptimization. Hence,theparametersoftheforceequationmodeledinalgor ithmaredifferentfromthosemodeled intheFDS.ThemechanicalforcegiveninEquation1.1isrepe atedhereforclarity. n n n n(3.1) TheFDSisaresourceoptimizationalgorithm,whichtriesto reducetheresourcerequirements ofadesign.Therefore,FDSmodelstheresourcesasaforceto beoptimized.Resourceoptimization isdonebyincreasingtheoperationconcurrency,whichisth eabilitytoschedulethesimilartypeof operationsinthesamecontrolstep.Wehavemodeledsoftwar epowerasaforcetobeoptimized. Werelatethespringconstantintheforceequationtotheinter-instructionpowercost.T heinterinstructionpowercostforapairofinstructionsistheamou ntofpowerconsumedwhenthetwo instructionsareexecuted,oneaftertheother.Theorigina lFDSrelatestothetypedistribution, whichisindicativeoftheconcurrencyofsimilaroperation sforspecicresources.Thetypedistributionisthesumofprobabilitiesoftheoperationsthatare implementedbyaspecicresourcetype fromtheresourceset.Wemodelthedisplacementasthechangeininstructionprobability.The instructionprobabilitycanbedenedastheprobabilitywi thwhichthatinstructioncanbescheduled inaparticulartime-step,consideringuniformprobabilit ydistribution.Theinstructionprobability ofaninstructionisafunctionofitstimeframe,whichisequ altotheinverseoflengthofthetime frame,forthetime-stepswithinthetimeframeandiszerofo rtheothertime-steps.TheFDSmodelsasthechangeinmobilityofoperations.Themobilityofanop erationisdenedasthedifference betweenitsASAPtime-stampandtheALAPtime-stamp.3.2ISAPowerCharacterization Therststepinvolvesprolingthepowercharacteristicso ftheinstructionsetarchitecture,to generateapowerdissipationtable(PDT).ThePDT,foranISA consistingofninstructions,isannxnmatrix.EachentryPDT[i,j]representstheaveragepowerco nsumedwheninstructionis immediatelyfollowedbyinstruction.Wetermthisvalueasthe inter-instructionpowercost .We useanexecution-driven,cycle-accurateRTlevelpowerest imationtool, SimplePower toolset[50] 25

PAGE 34

forallthepowersimulations.SimplePowerusesatransitio nsensitiveenergymodelforin-order 5-stagepipelineddatapath.TogenerateeachentryinthePD T,thecorrespondinginstructionpair withaninstructionisrepeated200,000timesinanassemblyprogra m.Theassemblycodeis thenrunthroughthesimulator10timestoobtainthetotalen ergydissipation.Thepowercostfor thecorrespondingpairisthenaveragedfromthetotal.Repe atingthesameinstructionpairinstead ofusingaloopavoidsanycorruptionthatcouldariseduetol oopoverheads.Figure3.1showsthe detailsaboutthePDTgeneration. ...200,000 times _start:nop move $2,$3 addu $2,$3,$5 nop move $2,$3 addu $2,$3,$5 nop move $2,$3 addu $2,$3,$5 nop move $2,$3 addu $2,$3,$5 Figure3.1.PDTGenerationExample 3.3BasicBlocksofAssemblyCode Theinput,aClanguageprogram,iscross-compiledfortheSi mpleScalararchitecture,usingthe SimpleScalar toolset[4],version2.0.Thegeneratedassemblylanguagec odeisdividedintobasic blocks.A basicblock isasequenceofconsecutiveinstructionsinwhichowofcon trolentersat thebeginningandleavesattheendwithouthaltorpossibili tyofbranchingexceptattheend.The algorithm[1]usedforbasicblockpartitioningispresente dinFigure3.2.Theproposedapproach FD-ISLPisan intra-basicblock schedulingalgorithmthatschedulestheinstructionsonly withina basicblock.Figure3.3showsanexampleofbasicblocksfort he Quicksort benchmarkprogram. 26

PAGE 35

Input:Asequenceofthreeaddressinstructions Output:Alistofbasicblockswitheachthree-addressinstr uctioninexactlyonebasicblock 1.First,determinethesetof leaders ,aleaderistherstinstructionineachbasicblock.Weuse thefollowingrules: (a)Therstinstructionisaleader. (b)Anyinstructionthatisatargetofaconditionaloruncon ditionalbranchisaleader. (c)Anyinstructionthatfollowsaconditionalorunconditi onalbranchisaleader. 2.Foreachleader,itsbasicblockconsistsoftheleaderand alltheinstructionsuptobutnot includingthenextleaderortheendoftheprogram. Figure3.2.BasicBlockPartitioningAlgorithm 3.4DataDependencyGraphConstruction Foreachbasicblock(BB),adatadependencygraph(DDG)isco nstructedrepresentingthe dependencyrelationsamongtheinstructionsinagivenbasi cblock.TheDDGrepresentingabasic blockisadirectedacyclicgraph(DAG).Eachinstructionin aBBisrepresentedasanodeinthe DDG.Andthedependencybetweenapairofinstructionsisrep resentedasadirectededgepointing towardsthedependentinstruction.Figure3.4showsabasic blockanditsDDGrepresentation. 3.5Low-PowerForce-DirectedInstructionSchedulingAlgo rithm TheFD-ISLPalgorithmtakestwoparametersasinput,aDDGre presentingabasicblockand thePDT,andreturnsapower-optimizedschedule.Apseudo-c odeforthealgorithmispresented inFigure3.5.ThealgorithmstartswithndingtheASAPsche duleandtheALAPschedule(lines 3-4)fortheinputDDG.IntheASAPschedule,eachinstructio nisscheduledattheearliestpossible time-stepunderthedependencyconstraintsandintheALAPs chedule,eachinstructionisscheduled atthelatestpossibletime-step.Thelatencyconstraintis determinedasthecriticalpathlengthofthe DDG.ForeveryinstructionnodeintheDDG,timeframeiscomp utedfromtheASAPandALAP time-stamps(line7).Theinstructionshavingtimeframeeq ualtozeroarethecriticalinstructions 27

PAGE 36

BB 11 BB 10 BB 9 lw $2, 24($fp) j $L2 jal swap lw $6,24($fp) lw $5,44($fp) lw $4,40($fp) $L4: j $L3 sw $3,20($fp) move $3,$2 addu $2,$3,1 lw $3,20($fp) $L5: Figure3.3.BasicBlockGenerationExample andhavetobescheduledinthespecictime-steps.Hence,th eyarescheduledattheirASAPtimestep(line9).Ineachiterationofthealgorithm,alltheuns cheduledinstructionsareconsidered. Eachunscheduledinstructionistentativelyassignedtoev erypossibletime-stepinitstimeframe, allowedbythedatadependencies(line16).Then,thepowerc ostassociatedwiththisassignmentis determinedbycomputingtheself-forceandthepredecessor -successorforces(lines17-18). Thetotalforcefortheassignmentiscomputedasthesumofth eself-forceandtheps-forces (line19).Theforcecomputationsarerepeatedforalltheun scheduledinstructionsforallfeasible time-steps.Thetentativeassignmentthathastheleastfor cevalueamongstall,ischosenandthe correspondinginstructionisscheduledinthattime-step( line23).TheASAPandALAPschedules areupdatedandthetimeframesarerecomputed.Theabovepro cedureisrepeateduntilallthe instructionshavebeenscheduled. ThealgorithmcanbeexplainedwithanexampleofaDDGgrapha nditspartialscheduleshown inFigure3.6andFigure3.7,respectively.IntheFigure3.6 ,thenumberpair,[i,j]nexttoeach instructionnodeintheDDG,representsitsASAPandALAPtim e-stamps.e.g.instruction1canbe scheduledassoonasinthersttime-stepandaslateasinthe forthtime-step.Similarly,instruction 5canbescheduledassoonasinthesecondtime-stepandaslat easinthefthtime-step.Ascanbe seenfromthegure,instruction6hasbothofitsASAPandthe ALAPtime-stampsequalto6and 28

PAGE 37

67 subuswmoveswswlisw $sp,$sp,24$fp,16($sp)$fp,$sp$4,24($fp)$5,28($fp)$2,0x00000001$2,0($fp) 1 2 3 4 5 7 6 5 1234 Figure3.4.DDGGenerationExample hence,itstimeframeisequalto0.Therefore,instruction6 isacriticalinstructionandneedstobe scheduledintime-step6. InFigure3.7,anumber,[k]alongsideofinstructions1and6 indicatesthetime-stepinwhich theyhavebeenscheduledandthenumberpair[i,j]nexttoall otherinstructionnodesrepresents theASAPandtheALAPtime-stampsforeachoneofthem.Thesch eduledinstructionsareshown withtwoconcentriccircles.Thus,instruction1and6haveb eenassignedtotime-steps1and6, respectively,andtheASAPandALAPtime-stampsofallthein structionshavebeenupdated.The timeframeofinstruction2is[2,4].Assuminguniformproba bilitydistribution,instruction2can bescheduledinanyofthetime-steps:2,3,and4,withanequa lprobabilityof0.33.Ifinstruction2isscheduledintime-step2,thepowercostoftheinstr uctionpair(1,2)shouldbeconsidered (because,instruction1hasbeenalreadyscheduledintimestep1).Similarly,forschedulinginstruc29

PAGE 38

Input:Datadependencygraph,DDGandpowerdissipationtab le,PDT Output:Lowpowerschedule,LP-schedule(01)AlgorithmFD-ISLP(GraphDDG,PDT)(02) Begin (03)ASAP-schedule(G)(04)ALAP-schedule(G)(05)unscheduledtotal-inst(G) (06) foreach instG (07)Determine-time-frame(inst)(08) if (inst.time-frame=0) t hen (09)LP-schedule[inst.ASAP]inst (10)unscheduledunscheduled-1 (11) endfor (12) while (unscheduled) do (13) foreach instG (14) if (inst.scheduled=FALSE) then (15) foreach tsteptime-frame(inst) (16)Makeatentativeassignment(inst,tstep)(17)Computeself-force(inst,tstep)(18)Computeps-force(inst,tstep)(19)total-force(inst,tstep)self-force(inst,tstep)+ps-force(inst,tstep) (20) endfor (21) endfor (22)min-force(inst,tstep)minimum-total-force(inst,tstep) (23)LP-schedule[tstep]inst (24)inst.scheduledTRUE (25)Update-ASAP(G,inst)(26)Update-ALAP(G,inst)(27)Computetimeframes(28)unscheduledunscheduled-1 (29) endwhile (30) return LP-schedule (31) End Figure3.5.FD-ISLPAlgorithm 30

PAGE 39

tion2intime-steps3and4,powercostsofinstructionpairs (3,2)and(5,2)needtobeconsidered, respectively. 6 [3,5] [6,6] [2,5] [1,4] [1,4] [1,4] 1 5 4 3 2 Figure3.6.AnExampleofaDDG [3,5] [2,4] [2,4] [1] [6] 6 [3,5] 1 5 4 3 2 Figure3.7.PartialSchedulefortheDDGinFigure3.6 Ifinstruction2istentativelyassignedtotime-step2,the ntheforcecomputationsaredoneas follows:Thetentativeassignmentchangestheschedulingp robabilitiesofinstruction2intime-steps 2,3and4to1.0,0,0respectively.Thechangesintheinstruc tionprobabilitiesaresummarizedin Table3.1.Table3.2listsallthepossibleassignmentsfori nstruction2.Intherstassignmentinterinstructioncostofthepair(1,2)shouldbeconsideredbeca useinstruction2followsinstruction1.If 31

PAGE 40

instruction2canbeassignedintime-step3,dependencycon straintsallowonlyinstruction3tobe scheduledintime-step2(because,instruction1isalready scheduledintime-step1).Inthiscase, theinter-instructioncostofthepair(3,2)shouldbeconsi dered.Similarly,forthethirdassignment, inter-instructioncostofthepair(5,2)shouldbeused.Fro mthediscussioninSection4.1,forceis modeledastheproductofinter-instructionpowercostandt hechangeininstructionprobabilities. Theself-forceisthesumofforcesrelatinganinstructiont oallschedulestepsinitstimeframe. Hence,self-forcefortheassignmentofinstruction2tothe time-step2isgivenasbelow: Similarly,theself-forcesforassignmentofinstruction2 tothetime-steps3and4,aregivenby thefollowingtwoequations,respectively: . Table3.1.ProbabilitiesofInstruction2forAssignment(2 ,2) Time-step 1 2 3 4 5 6 Beforeassignment 0 0.33 0.33 0.33 0 0 Afterassignment 0 1.00 0 0 0 0 Change 0 (1.00-0.33) (0-0.33) (0-0.33) 0 0 Table3.2.PossibleAssignmentsforInstruction2 Time-step 1 2 3 4 5 6 Assignment1 1 2 6 Assignment2 1 3 2 6 Assignment3 1 3 5 2 6 Thepredecessor-successorforcesarecomputedbyevaluati ngthevariationontheself-forces ofthepredecessors/successorsduetorestrictionsonthei rtimeframes[23].Instructions2and4 havethesametimeframe[2,4].Therefore,theassignmentof instruction2totime-step2puts restrictionsontimeframeofitssuccessor,instruction3. Thepossibleassignmentsofinstruction3, beforethetentativeassignmentofinstruction2,areliste dinTable3.4.Thetentativeassignmentof instruction2totime-step2impliestheassignmentofinstr uction3totime-step3or4.Thechange inprobabilitiesofinstruction3aresummarizedinTable3. 3.Inthiscase,theps-forceconsists onlyofthesuccessorforce,becausethepredecessorofinst ruction2,i.e.instruction1hasalready 32

PAGE 41

beenscheduledandtheassignmentdoesnotchangeitstimefr ame.Therefore,theps-forceforthis assignmentiscomputedasfollows: (refertoEquation2.7). Table3.3.ProbabilitiesofInstruction3forAssignment(2 ,2) Time-step 1 2 3 4 5 6 Beforeassignment 0 0.33 0.33 0.33 0 0 Afterassignment 0 0 0.50 0.50 0 0 Change 0 (0-0.33) (0.50-0.33) (0.50-0.33) 0 0 Table3.4.PossibleAssignmentsforInstruction3 Time-step 1 2 3 4 5 6 Assignment1 1 3 6 Assignment2 1 2 3 6 Assignment3 1 2 4 3 6 Thetotalforceassociatedwiththeassignment(2,2)iscomp utedasthesumoftheself-force andtheps-forces. Similarly,theforcecomputationsaredoneforallotheruns cheduledinstructionfortheirfeasible time-steps.Thetentativeassignmentwiththelowestforce isselectedandthatassignmentisnalized.Thesameprocedureisrepeateduntilalltheinstructi onsinthegivenbasicblockarescheduled. Theintuitiveanalysisofthealgorithmshowsthatthealgor ithmpromotessuchinstructionassignmentsthatwouldresultintheleastforcewiththeirneighbo rs.Consequently,theseassignment resultintheleastforcefortheentirebasicblockleadingt opowersavings. Thetimecomplexityoftheproposedapproachis n ,wherenisthenumberofinstructionnodesintheDDG.ThecubicaltimecomplexityoftheFD-ISLPc anbeexplainedasfollows:the algorithmschedulesoneinstructionineachiterationandh encegoesthroughniterationstoschedule theninstructions.Ineachiteration,alltheunscheduledinstr uctionsareconsidered.Andforevery 33

PAGE 42

instruction,forcesarecomputedforeachfeasibletime-st ep,allowedbythedatadependencies.The mosttime-intensivestepsinthealgorithmarecomputingth eself-forcesandtheps-forces. 3.6Summary Theproblemofinstructionschedulingforpoweroptimizati onhasbeensuccessfullymodeled usingtheforce-directedschedulingtechnique.Thesoftwa repowerismodeledastheforceto beoptimized,byrelatingthe n n ntothen n n andthe ntothen n n n .Byconsideringtheglobaleffectduring eachstepofthealgorithm,FD-ISLPavoidsanylocallyoptim alsolutions.Theeffectivenessofour newalgorithmispresentedinthenextchapter. 34

PAGE 43

CHAPTER4 EXPERIMENTALRESULTS Thischapterpresentsdetailedexperimentalresultstover ifytheeffectivenessoftheFD-ISLP algorithmdescribedinthepreviouschapter.Theimplement ationdetailsoftheproposedalgorithm andthebenchmarkprogramsusedtovalidateitsefciencyar edescribed.Themethodologyfor poweroptimizationapproachispresentedinFigure4.1.Eac hmoduleisseparatelycodedand tested.Forintegration,theoutputofeachstepwasmadecom patiblewiththenextstep.e.g.the datadependencygraph(DDG)generationmodulecreatesanad jacencymatrixrepresentationand thenextstepofschedulingacceptsthisrepresentationand thepowerdissipationtable(PDT)to generatealow-powerschedule.Wealsopresentabriefdiscu ssionon SimplePower ,thetoolthatwe haveusedforourpowersimulations.4.1ExperimentalProcedure Variousstepsinvolvedintheproposedpoweroptimizationa pproacharediscussedbelow: 4.1.1Cross-Compilation Thepowersimulator, SimplePower emulatesanintegersubsetofthe SimpleScalar [4]instructionsetarchitecture(SS-ISA).Therefore,theinputbench markCprogramsarecross-compiledusing the ssbig-na-sstrix-gcc compilertoobtaintheSimpleScalar(SS)assemblycode. 4.1.2Basic-BlockPartitioningandDDGExtraction ThecodeforparsingtheSSassemblycodewaswritteninCandi toutputsalistofbasicblocks foragivenassemblylanguagecode.Thedetailsaboutbasicb lockpartitioningalgorithmcanbe foundinSection3.3.Theoutputfromthisstepisfurtherapp liedtothenextmodule,whichextracts 35

PAGE 44

Basic Blocks Architecture (ISA) Power DissipationTable (PDT) C Source Code forBenchmark Programs SimpleScalar Cross-Compiler Assembly Code SS Instruction Set Extract Data Dependency Assembly Code Low-Power scheduled Graph (DDG) for each Basic Block SimplePower SimplePower Simulator SimpleScalar (SS) Compare Power Consumption Simulator Power ConsumptionUnscheduled CodeScheduled Code Power Consumption FD-ISLP Algorithm Divide into Figure4.1.PowerOptimizationandValidationFramework 36

PAGE 45

thedatadependencygraph(DDG)fromeachbasicblock.TheDD Gextractionmoduleisalso implementedinCandtheDDGisrepresentedasanadjacency-m atrixdatastructure. 4.1.3PDTGeneration AsubsetoftheSS-ISAwasrequiredforoursimuations.Itcon sistsof15instructions,whichare listedinTable4.1.Thissubsetoftheinstructionsetarchi tectureentirelycoversallthebenchmark programsthatwehaveused.Thepowercharacterizationofth einstructionsunderconsideration,is doneinthisstep.The SimplePower toolsetwasusedtoobtainthepowerconsumptionofassembly programs,eachofwhichrepresentsaninstruction-pair.Th edetailsaboutPDTgenerationaredescribedinSection3.2.Foritsforcecomputations,theFD-I SLPalgorithmusestheinter-instruction powercostvaluesfromthePDT. Table4.1.InstructionSetArchitecture No. Instruction BriefDescription Example 1. addu Integeradditionunsigned addu$sp,$sp,24 2. subu Integersubtractionunsigned subu$sp,$sp,24 3. lw Loadword lw$31,20($) 4. sw Storeword sw$31,20($sp) 5. j Jumptoabsoluteaddress jlabel 6. jal Jumptoabsoluteaddressandlink jallabel 7. beq Branchifequalto0 beq$2,$0,$L4 8. bne Branchifnotequalto0 $2,$3,$L5 9. sll Shiftleftlogical sll$2,$3,2 10. slt Setlessthan slt$2,$3,$2 11. sra Shiftrightarithmetic sra$2,$3,1 12. mo MovefromLOregister mo$2 13. move Moveregistercontents move$2,$4 14. la la$4,a 15. li li$5,0x00000064 4.1.4FD-ISLP Thelowpowerforce-directedinstructionschedulingalgor ithmwascodedinC.Thealgorithm scheduleseachblockofthecodeseparatelyusingtheDDGand PDTinformationtooptimizeits powerconsumption.Thedetailsaboutschedulingalgorithm andtheforcecomputationsarepre37

PAGE 46

sentedinSection3.3.Eachscheduledbasicblockisthenwri ttentoanoutputassemblyle.Once theentireassemblyleisgenerated, SimplePower isusedtosimulateitspowerconsumption. Theproposedschedulingtechniquehasbeenimplementedand testedsuccessfullywithvarious Cbenchmarkprograms.Thebenchmarkprogramsusedforthisp urposearelistedinTable4.2.To provetheeffectivenessofourapproach,wealsondthepowe rconsumptionoforiginalbenchmark programsandcomparethepowernumbers.Itwasobservedthat theschedulingtechniquecould savepowerforallthebenchmarks.Wealsopresentacomparis onofourresultswiththeworkof Choietal.[49].Thecomparisonshowsthatourtechniqueout performstheothertechnique,forall thecasesunderconsideration.Alltheexperimentswererun onaSunSparcUltra10workstation, runningSunOS,with100MHzprocessorand128MBRAM. Table4.2.BenchmarkSet No. Benchmark BriefDescription 1. bisrch.c Binarysearchthesamennumbers 2. bubble.c Bubblesortnrandomnumbers 3. r.c Finiteimpulseresponselter 4. hanoi.c TowerofHanoifor1tondisks 5. heap.c Heapsortthesamennumbers 6. matm.c nxnMatrixmultiplication 7. perm.c Permutationofnnumbers 8. queen.c Findallthesolutionsforn-Queens'Problem 9. quick.c Quicksortthesamennumbers 10. rsa.c RSAcryptographicalgorithm 4.2SimplePower SimplePowerisaframeworkthatcanbeusedforevaluatingth eeffectofhigh-levelalgorithmic,architectural,andcompilationtrade-offsonpower.I tisanexecution-driven,cycle-accurate, RT-levelpowerestimationtool.Itisbasedonthearchitect ureofave-stagepipelinedatapath. Theinstructionsetarchitectureisasubsetoftheinstruct ionset(theintegerpart)ofSimpleScalar, whichisasuiteofpubliclyavailabletoolstosimulatemode rnmicroprocessors.Thevestagesof thepipelineddatapathareIF(instructionfetch),ID(inst ructiondecode),EXE(execution),MEM (memoryaccess),WB(writeback).Ateachclockcycle,theSi mplePowerCoresimulatestheexecutionofallactiveinstructionsandcallsthecorrespond ingpowerestimationinterfacesforall 38

PAGE 47

activatedfunctionalunits.Thecachesimulatorsimulates thestatusoftheinstructioncacheanddata cache.Thebussimulatorsnoopstheinstructioncacheaddre ssbus,theinstructioncachedatabus, thecacheaddressbus,andthedatacachedatabus.Itrecords thetotalnumberofaccessandthe numberoftransitionsonthesebuses. Thistoolevaluatestheenergyconsideringthesystemasawh oleratherthanjustasasumof parts,anditsupportsbothcompilerandarchitecturalexpe rimentations.Intheenergyestimation procedurewehaveused,thesourceCcode(benchmark.c)isco mpiledbytheSimpleScalerversionofgcc( ssbig-na-sstix-gcc ),whichgeneratesSimpleScalerassemblycodes(benchmark .s).The SimpleScalerassembler gas andtheloader/linker gld produceSimpleScalarexecutablesthatare thenloadedintoSimplePowermainmemoryandexecutedbySim plePower.SimplePowerprovides thetotalnumberofcyclesinexecution,numberoftransitio nsinon-chipbuses,switchcapacitance statisticsforeachpipelinestage,switchcapacitancesta tisticsfordifferentfunctionalunits,andthe totalswitchcapacitance.Allthesimulationswereperform edassuming0.35microntechnology ( simpower35 ). 4.3PowerSavings Tables4.3,4.4and4.5provideasummaryoftheresultsobtai nedonthebenchmarkprograms.In Table4.3,thesecondcolumnindicatesthebenchmarkprogra msthatareprovidedwiththetoolset. Thethirdandfourthcolumnsindicatethetotalpowerconsum ption(inunitsofn )oftheoriginal (unscheduled)benchmarkcodeandtheFD-ISLPscheduledcod e,respectively.Thelastcolumn showsthepowersavingsachievedusingourapproach.Themax imumpowersavingsobtainedis 30.87%andtheminimumsavingsis3.07%,withanaverageof12 .68%overallthebenchmarks. Thepowerconsumptionreadingswereobtainedbyaveragingt hepowersimulationsfor10different setsof100randomlygeneratedinputpatterns.Formatrixmu ltiplication,thesizeofthematrixis indicatedwithanumberenclosedinparentheses,alongside thenameofthebenchmark.e.g.the resultsinTable4.3showthatamatrixofsize10x10wasused. Table4.4showsthenumberofCPU-cyclesrequiredtoexecute thecorrespondingbenchmark program.ItcanbeobservedthatouralgorithmcansaveCPUcy clesofexecutionforallthebench39

PAGE 48

Table4.3.PowerSavingsResultsforBenchmarks No. Benchmark TotalPower TotalPower Power Unscheduled(nF) FD-ISLP(nF) Reduction(%) 1. Binarysearch 11.50 10.84 5.74 2. Bubblesort 12370.26 11990.49 3.07 3. FIRlter 95.73 84.63 11.59 4. Hanoi 63449.17 56996.36 10.17 5. Heapsort 3599.52 3302.92 8.24 6. Matrixmultiplication(10) 110.36 97.85 11.34 7. Permutation 24010.55 20572.24 14.32 8. Queen'sProblem 11924.95 10376.14 12.98 9. Quicksort 2078.56 1445.91 30.44 10. RSAencryption 6867677.30 5599904.07 18.46 markprograms.Thus,FD-ISLPresultsinpowersavingswitho utanyperformanceoverhead.The resultsalsosupportthenotionthatperformanceimproveme ntcanalsoleadtopowerreduction. Table4.4.ExecutionTimeofBenchmarks No. Benchmark TotalCycles TotalCycles Unscheduled FD-ISLP 1. Binarysearch 390 354 2. Bubblesort 391398 383230 3. FIRlter 3062 2875 4. Hanoi 220820 202204 5. Heapsort 105380 99170 6. Matrixmultiplication(10) 4005 3852 7. Permutation 751789 670746 8. Queen'sProblem 469459 425735 9. Quicksort 67543 44754 10. RSAencryption 212619184 177855947 Itwasobservedthatvaryingthesizesoftheinputvectorsre sultsinalmostthesameamount ofpowersavingsforthecodesscheduledusingtheFD-ISLP.T heexperimentswereperformed withsetsofrandomlygenerated100,500,1000and5000input patterns.Theresultsobtainedwith theseexperimentsaresummarizedinTable4.5.Formatrixmu ltiplicationbenchmark,thenumber enclosedinparenthesesindicatesthesizeoftheinputmatr ix. Asdescribedearlier,thetargetmachinehasa5-stagepipel inedarchitecture:instructionfetch, instructiondecode,execution,memoryandwritebackstage .Therefore,thetotalpowerconsumptionwassplitintoconsumptionofeachofthestagesandthes avingswerecomputedforeachof 40

PAGE 49

Table4.5.PowerSavingsforVariableSizedInputVectors No. Benchmark PowerReduction(%)forInputSize 100 500 1000 5000 1. Binarysearch 5.74 6.04 6.72 5.94 2. Bubblesort 3.07 3.12 3.17 3.11 3. Heapsort 8.24 8.12 8.47 8.38 4. Matrixmutliplication 11.34(10) 11.30(25) 11.62(33) 12.09(70) 5. Quicksort 30.44 30.67 30.52 31.26 thestagesindividually.Table4.6showsthebreakdownofth etotalpowerintoitscomponentsfor theBinarysearchbenchmark.Itcanbeobservedthattheexec utionstageofthepipelineconsumes thehighestamountofenergy.Tables4.7,4.8and4.9summari zethepowersavingsbreakdownfor theQuicksort,MatrixmultiplicationandtheHeapsortbenc hmarks,respectively.Theresultsshow thatouralgorithmcouldsavepowerinallthestagesofthepi peline.Thiscouldbeexplainedas follows:ineveryiterationofthealgorithm,itschedulesa ninstructiontoatime-stepsuchthatthis assignmentwillhavetheleastinter-instructionpowercos twithitsneighbors(boththepredecessor andthesuccessorinstructions)consideringallthepossib lecombinations.Thisreducestheswitchingactivityoverallthestagesofthe5-stagepipeline.Itc ouldbealsoobservedthattheFD-ISLP resultsinmaximumpowersavingsfortheexecutionstage. Table4.6.PowerConsumptionBreakdownforBinarySearchBe nchmark InputSize PipelineStages IF(%) ID(%) Execution(%) Mem(%) WB(%) 100 3.68 14.59 40.97 10.20 30.56 500 3.62 14.47 41.50 10.01 30.39 1000 3.38 15.71 42.18 9.26 29.48 5000 3.44 15.73 41.73 9.42 29.69 Table4.7.PowerSavingsBreakdownforQuickSort InputSize PipelineStages IF(%) ID(%) Execution(%) Mem(%) WB(%) 100 20.89 29.37 32.54 28.02 30.03 500 21.45 30.04 32.86 27.59 30.52 1000 21.26 29.85 32.73 28.61 30.46 5000 21.79 30.69 33.71 28.86 31.40 41

PAGE 50

Table4.8.PowerSavingsBreakdownforMatrixMultiplicati on InputSize PipelineStages IF(%) ID(%) Execution(%) Mem(%) WB(%) 100 2.82 4.01 16.38 8.61 11.95 500 3.06 3.94 16.30 9.15 11.45 1000 3.22 4.56 17.41 8.83 12.26 5000 3.31 4.43 17.68 9.12 12.57 Table4.9.PowerSavingsBreakdownforHeapSort InputSize PipelineStages IF(%) ID(%) Execution(%) Mem(%) WB(%) 100 9.16 7.87 8.61 8.57 7.98 500 9.02 7.93 8.50 8.43 7.67 1000 9.54 8.12 8.86 8.70 7.92 5000 9.36 8.03 8.63 8.59 7.69 Finally,wepresentthecomparisonofourresultswiththemo strecentworkininstruction scheduling,byChoietal.[49].Powerestimationwasdoneus ingSimplePower,followingthe sameprocedureasdescribedearlier.Theresultsofthecomp arisonaresummarizedinTable4.10. Itcouldbeobservedthatouralgorithmperformsbettercomp aredtotheothertechnique,forall thebenchmarksunderconsideration.Theaveragepercentag eimprovementinthepowersavings isnotedtobe20.24%.Acomparisonofthesavingsinexecutio ntimeofthetwotechniquesis presentedinTable4.11.ItcanbeobservedthattheFD-ISLPs cheduletakeslessernumberofCPU cyclesascomparedtotheschedulegereratedbyChoietal..T heresultsindicatethattheaverage percentageimprovementintheperformancespeed-upis26.8 9%. Table4.10.PowerSavingsComparison No. Benchmark PowerConsumption(n ) Percentage scheuled/unscheduled Improvement Choietal. FD-ISLP 1. Bubblesort 11627.65/11948.92 11990.49/12370.26 14.55% 2. FIRlter 91.89/100.44 84.63/95.73 36.03% 3. Heapsort 3373.01/3615.12 3302.92/3599.52 24.96% 4. Quicksort 1919.85/2711.44 1445.91/2078.56 5.44% 42

PAGE 51

Table4.11.SavingsinRunningTime No. Benchmark ExecutionTime(cycles) Percentage scheuled/unscheduled Improvement Choietal. FD-ISLP 1. Bubblesort 360675/368100 383230/391398 3.47% 2. FIRlter 2790/2900 2875/3062 61.48% 3. Heapsort 96329/101136 99170/105380 18.59% 4. Quicksort 611170/85502 44754/67543 24.00% 4.4Summary Inthischapter,wehavediscussedindetailtheexperimenta lprocedurefollowedtovalidatethe effectivenessofouralgorithm.Wehavediscussedinbrieft hesimulationtoolcalledSimplePower, usingwhichwehaveperformedallthepowersimulations.The n,wepresenttheresultsandthe powersavingsascomparedtooriginalcodeaswellasthework ofChoietal.Fromtheresults, itcanbeconcludedthattheproblemofinstructionscheduli ngforlowpowercanbesuccessfully formulatedusingtheForce-Directedapproach. 43

PAGE 52

CHAPTER5 CONCLUSIONS Poweroptimizationhasbecomeadesirablefeatureforcompu tersystemsbecauseofseveral reasonssuchashighcoolingandpackagingcostandlessreli ability,associatedwithhighpower consumption.Theabilityofsoftwaretodirectmuchoftheac tivityatthehardwarehasledtothe efforttooptimizethesoftwarepower.Wehavepresentedano velinstructionschedulingtechnique, FD-ISLP,forsoftwarepoweroptimization.Ourapproachofi nstructionschedulingisbasedonthe Force-Directedapproach.Wehavemodeledthepowerconsump tionasaforcetobeoptimizedby relatingthespringconstanttotheinter-instructionpowe rcostandthedisplacementtothechangein instructionprobability.FD-ISLPsavessoftwarepowerwit houtcausinganyperformanceoverhead. Theresultsshowthatourtechniqueoffersanaverageof12.6 8%reductioninpowerconsumption overtheoriginalcodeforthegivenbenchmarksuite. Intheexperimentsthatweperformed,werestrictedthebenc hmarkprogramstothosewith onlytheintegerinstructionssubsetoftheSimpleScalarin structionsetarchitecture,listedinTable 4.1.Furtherexperimentsneedtobecarriedoutfortheentir einstructionarchitecture,includingthe oatingpoinginstructions,whichrequiresthepowerchara cterizationoftheofotherinstructions. Further,thetimecomplexityoftheoriginalFDSalgorithmp roposedin[29],is n .Recently,a fewworkstowardsimprovingthetimecomplexityoftheFDSal gorithmhavebeenreportedin[46], [47]and[48].Theseworksneedtobeinvestigatedandabette rimplementationintermsofrun-time fortheFD-ISLPneedstobeexplored.FD-ISLPisanintra-bas icblockschedulingalgorithm.The effectofinter-basicblockschedulingusingtheforce-dir ectedschedulingcanalsobeexploredin future. 44

PAGE 53

REFERENCES [1]A.Aho,R.Sethi,andJ.Ullman. Compilers:Principles,Techniques,andTools .Addison Wesley,1988. [2]R.Bajwa,M.Hiraki,andH.Kojima.Instructionbufferin gtoreducepowerinprocessorsfor signalprocessing. IEEEtransactionsonVLSISystems ,pages417–424,Dec.1997. [3]N.Binh,M.Imai,A.Shiomi,andN.Hikichi.Aninstructio nsetoptimizationalgorithmfor pipelinedasips. IEICETransactionsonFundamentalsofElectronics,Commun ications,and ComputerSciences ,pages1707–1714,1995. [4]D.BurgerandT.Austin. Technicalreport:SimpleScalarTooSet .UniversityofWisconsinMadisonComputerScienceDepartment,1997. [5]C.ChakrabartiandD.Gaitonde.Instructionlevelpower modelofmicrocontrollers.In ProceedingsoftheInternationalSymposiumonCircuitsandSys tems ,pages76–79,1999. [6]A.Chandrakasan,V.Gutnik,andT.Xanthopoulos.Datadr ivensignalprocessing:Anapproachforenergyefcientcomputing.In ProceedingsofISLPED ,pages347–250,Aug.1996. [7]T.ChouandK.Roy.Accurateestimationofpowerdissipat ionincmossequentialcircuits. IEEEtransactionsonVeryLargerScaleIntergration(VLSI) ,1996. [8]V.DalalandC.Ravikumar.Softwarepoweroptimizationi nembeddedsystems.In Fourteenth InternationalConferenceonVLSIDesign ,pages254–259,2001. [9]B.Dave,G.Lakshminarayan,andN.Jha.Cosyn:Hardwaresoftwarecosynthesisforheterogeneousdistributedembeddedsystems. IEEETransactionsonVLSISystems ,7,1999. [10]S.Gunther,F.Binns,D.Carmean,andJ.Hall.Managingt heimpactofincreasingmicroprocessorpowerconsumption. IntelTechnologyJournal ,FirstQuarter2001. [11]S.Gupta.Force-directedschedulingfordynamicpower optimization.Master'sthesis,UniversityofSouthFlorida,2003. [12]S.GuptaandS.Katkoori.Force-directedschedulingfo rdynamicpoweroptimization.In ProceedingsofIEEEComupterSocietyAnnualSymposiumonVL SI ,pages68–73,April2003. [13]A.Hasegawa,I.Kawasaki,K.Yamada,S.Yoshioka,andP. Biswas.Sh3:Highcodedensity, lowpower. IEEEMicro ,pages11–19,1995. [14]I.Hong,M.Potkonjak,andR.Karri.Poweroptimization usingdivide-and-conquertechniques forminimizationofthenumberofoperations. ACMTransactionsonDesignAutomationof ElectronicSystems. ,4(4):405–429,1999. 45

PAGE 54

[15]I.HuangandA.Despain.Synthesisofinstructionsetsf orpipelinedmicroprocessors.In ProceedingsofDesignAutomationConference ,pages5–11,1994. [16]IntelCorporation. i486Microprocessor,HardwareReferenceManual ,1990. [17]V.Jain,S.Rele,S.Pande,andJ.Ramanujam.Coderestru cturingforimprovingrealtimeresponsethroughcodespeed,sizetrade-offsonlimitedmemor yembeddedDSPs.In Languages andCompilersforParallelComputing ,pages459–463,1999. [18]D.KirowskiandM.Potkonjak.System-levelsynthesiso flow-powerhardreal-timesystems. In DesignAutomationConferenec ,pages697–702,1997. [19]C.Kulkarni,F.Katthoor,andH.D.Man.Codetransforma tionsforlowpowercachingin embeddedmultimediaprocessors.In InternationalParallelProcessingSymposiumandSymposiumonParallelandDistributedProcessing ,pages23–26,March1998. [20]M.Lee.Poweranalysisandlow-powerschedulingtechni quesforembeddeddspsoftware. FujitsuScienticandTechnicalJournal ,31:215–229,1995. [21]H.Mehta,R.Owens,M.Irvin,R.Chen,andD.Ghosh.Techn iquesforlowenergysoftware.In ProceedingsofInternationalSymposiumonLowPowerElectr onicsandDesign ,pages72–75, Aug.1997. [22]H.Mehta,R.M.Owens,andM.J.Irwin.Instructionlevel powerproling.In ICASSP ,pages 3326–3329,1996. [23]G.Micheli. SynthesisandOptimizationofDigitalCircuits .McGRAWHILLInternational, 1994. [24]H.Mizuno,H.Kobayashi,T.Onoye,andI.Shirakawa.Pow erestimationatarchitecturelevel forembeddedsystems.In IEEEInternationalSymposiumonCircuitsandSystems ,May2002. [25]J.Monteiro,S.Devadas,andA.Ghosh.Retimingsequent ialcircuitsforlowpower.In Proc. oftheInternationalConferenceonComputer-AidedDesign ,pages398–402,November1993. [26]F.N.Najm.AsurveyofpowerestimationtechniquesinVL SIcircuits. IEEETransactionson VeryLargeScaleIntegration(VLSI)Systems ,1994. [27]L.Nielsen,C.Niessen,J.Sparso,andK.V.Berkel.Lowpoweroperationusingself-timed circuitsandadaptivescalingofthesupplyvoltage. IEEETrans.onVLSISystems ,pages391– 397,Dec.1994. [28]A.Parikh,M.Kandemir,N.Vijaykrishnan,andM.Irwin. Instructionschedulingbasedon energyandperformanceconstraints.In IEEEComputerSocietyAnnualWorkshoponVLSI 2000. [29]P.PaulinandJ.Knight.Force-directedschedulingfor behavioralsynthesisofasics. IEEE TransactionsonComputerAidedDesign ,8(6):661–679,1989. [30]S.PrasadandK.Roy.Circuitoptimizationforminimiza tionofpowerconsumptionunder delayconstraints.In Proc.oftheInternationalWorkshoponLowPowerDesign ,pages15–20, April1994. 46

PAGE 55

[31]K.RoyandM.Johnson.Softwaredesignforlowpower. NATOAdvancedStudyInstituteon LowPowerDesigninDeepSubmicronElectronics,NATOASISer ies,chapter ,August1996. [32]J.Sato,A.Alomary,Y.Honma,T.Nakata,A.Shiomi,N.Hi kichi,andM.Imai.Ahardware/softwarecodesignsystemforasipdevelopment. IEICETransactionsonFundamentals ofElectronics,Communications,andComputerSciences ,pages483–490,1994. [33]T.Sato,Y.Ootaguro,M.Nagamatsu,andH.Tago.Evaluat ionofarchitecturelevelpower estimationforcmosriscprocessors.In Proc.oftheSymposiumonLowPowerElectronics pages44–45,1995. [34]T.Simunic,G.D.Micheli,L.Benini,andM.Hans.Source codeoptimizationandprolingof energyconsumptioninembeddedsystems.In ISSS ,pages193–199,2000. [35]C.Small.Shrinkingdevicesputthesqueezeonsystempa ckaging. EDN39 ,4:41–46,Feb. 1994. [36]C.-L.Su,C.-Y.Tsui,andA.M.Despain.Lowpowerarchit ecturedesignandcompilation techniquesforhigh-performanceprocessors.TechnicalRe portACAL-TR-94-01,University ofSouthernCalifornia,ACAL,February1994. [37]V.Tiwari,T.Lee,andD.Maheshwari.Poweranalysisoft hesparclitemb86934.Technical ReportFLA-CAD-94-01,FujitsuLabsofAmerica,Aug.1994. [38]V.Tiwari,S.Malik,andA.Wolfe.Compilationtechniqu esforlowenergy:Anoverview.In InProceedingsofSymposiumLow-PowerElectronics ,1994. [39]V.Tiwari,S.Malik,andA.Wolfe.Poweranalysisofembe ddedsoftware:arststeptowards softwarepowerminimization. IEEETransactionsonVeryLargeScaleIntegration(VLSI) Systems ,2(4):437–445,1994. [40]V.Tiwari,S.Malik,andA.Wolfe.Poweranalysisofinte l486dx2.TechnicalReportCEM94-5,PrincetonUniversity,Dept.ofElectricalEng.,Jun e1994. [41]V.Tiwari,S.Malik,A.Wolfe,andM.Lee.Instructionle velpoweranalysisandoptimization ofsoftware. JournalofVLSISignalProcessing ,pages1–18,1996. [42]M.Toburen,T.Conte,andM.Reilly. Technicalreport:Instructionschedulingforlowpower dissipationinhighperformancemicroprocessors .,NorthCarolinaStateUniversity,May 1998. [43]H.Tomiyama,T.Ishihara,A.Inoue,andH.Yasuura.Inst ructionschedulingforpowerreductioninprocessor-basedsystemdesign.In ProceedingsofDesign,AutomationandTestin Europe ,pages855–860,februari1998. [44]D.Trifone,D.Sarta,andG.Ascia.Adatadependentappr oachtoinstructionlevelpower estimation.In ProceedingsofIEEEAlessandroVoltaMemorialWorkshoponL owPower Design ,pages182–190,March1999. [45]W.tsongShuie.Retargetablecompilationforlowpower hardware/softwarecodesign.In ProceedingsofNinethInternationalSymposiumonCODES ,pages254–259,2001. 47

PAGE 56

[46]W.Verhaegh,P.Lippens,E.Aarts,andJ.Korst.Improve dforce-directedscheduling.In ProceedingsoftheEuropeanConferenceonDesignAutomatio nEDAC. ,pages430–435,Feb. 1991. [47]W.Verhaegh,P.Lippens,E.Aarts,J.Korst,J.vanMeerb ergen,andA.vanderWerf.Efciency improvementsforforce-directedscheduling.In InternationalConferenceonComputer-Aided Design,DigestofTechnicalPapers.,IEEE/ACM ,pages286–291,1992. [48]W.Verhaegh,P.Lippens,E.Aarts,J.Korst,J.vanMeerb ergen,andA.vanderWerf.Improved force-directedschedulinginhigh-throughputdigitalsig nalprocessing. IEEETransactionson Computer-AidedDesignofIntegratedCircuitsandSystems ,14:945–960,Aug.1995. [49]K.wonChoiandA.Chatterjee.Efcientinstruction-le veloptimizationmethodologyforlowpowerembeddedsystems.In InternationalSymposiumonSystemsSynthesis ,2001. [50]W.Ye,N.Vijaykrishnan,M.T.Kandemir,andM.J.Irwin. Thedesignanduseofsimplepower: acycle-accurateenergyestimationtool.In DesignAutomationConference ,pages340–345, 2000. 48


xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam Ka
controlfield tag 001 001447470
003 fts
006 m||||e|||d||||||||
007 cr mnu|||uuuuu
008 040114s2003 flua sbm s000|0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0000210
035
(OCoLC)54089561
9
AJN3914
b SE
SFE0000210
040
FHM
c FHM
090
QA76
1 100
Dongale, Prashant.
0 245
Force-directed instruction scheduling for low power
h [electronic resource] /
by Prashant Jayawant Dongale.
260
[Tampa, Fla.] :
University of South Florida,
2003.
502
Thesis (M.S.C.S.)--University of South Florida, 2003.
504
Includes bibliographical references.
516
Text (Electronic thesis) in PDF format.
538
System requirements: World Wide Web browser and PDF reader.
Mode of access: World Wide Web.
500
Title from PDF of title page.
Document formatted into pages; contains 56 pages.
520
ABSTRACT: The increasing need for low-power computing devices has led to the efforts to optimize power in all the components of a system. It is possible to achieve significant power optimization at the software level through instruction reordering during the compilation phase. In this thesis, we have designed and implemented a novel instruction scheduling technique, called FD-ISLP, aimed at reducing the software power consumption. In the proposed approach for instruction scheduling, we modify the force-directed scheduling technique used in high-level synthesis of VLSI circuits to derive a latency-constrained algorithm that reorders the instructions in a basic block of assembly code in application software to reduce power consumption due to its execution. The scheduling algorithm takes the data dependency graph (DDG) for a given basic block and a power dissipation table (PDT), which is generated by characterizing the instruction set architecture. We model power, commonly referred to as software power in literature, as a force to be minimized by relating the inter-instruction power cost as the spring constant,k,and the change in instruction probability as the displacement,x, in the force equation f = k x. The salient feature of our algorithm is that it accounts for the global effect of any tentative scheduling decision, which avoids a solution being trapped in a local minima. The power estimates are obtained through using a tool set, called Simple-Power. Experimental results indicate that our technique accounts for an average of 12.68 % savings in power consumption over the original source code for the selected benchmark programs.
590
Adviser: Ranganathan, N.
653
optimization.
estimation.
690
Dissertations, Academic
z USF
x Computer Science
Masters.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.210