USF Libraries
USF Digital Collections

Architecture and compiler support for leakage reduction using power gating in microprocessors

MISSING IMAGE

Material Information

Title:
Architecture and compiler support for leakage reduction using power gating in microprocessors
Physical Description:
Book
Language:
English
Creator:
Roy, Soumyaroop
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla
Publication Date:

Subjects

Subjects / Keywords:
Compiler Directed Power Gating
Microarchitectural Techniques
Embedded Microprocessors
Multithreading
Multiprocessing
Multicore
Niagara
CGMT
FGMT
SMT
GCC
SUIF
MachineSUIF
M5
Dissertations, Academic -- Computer Science & Engineering -- Masters -- USF   ( lcsh )
Genre:
non-fiction   ( marcgt )

Notes

Abstract:
ABSTRACT: Power gating is a technique commonly used for runtime leakage reduction in digital CMOS circuits. In microprocessors, power gating can be implemented by using sleep transistors to selectively deactivate circuit modules when they are idle during program execution. In this dissertation, a framework for power gating arithmetic functional units in embedded microprocessors with architecture and compiler support is proposed. During compile time, program regions are identified where one or more functional units are idle and sleep instructions are inserted into the code so that those units can be put to sleep during program execution. Subsequently, when their need is detected during the instruction decode stage, they are woken up with the help of hardware control signals. For a set of benchmarks from the MiBench suite, leakage energy savings of 27% and 31% are achieved (based on a 70 nm PTM model) in the functional units of a processor, modeled on the ARM architecture, with and without floating point units, respectively. Further, the impact of traditional performance-enhancing compiler optimizations on the amount of leakage savings obtained with this framework is studied through analysis and simulations. Based on the observations, a leakage-aware compilation flow is derived that improves the effectiveness of this framework. It is observed that, through the use of various compiler optimizations, an additional savings of around 15% and even up to 9X leakage energy savings in individual functional units is possible. Finally, in the context of multi-core processors supporting multithreading, three different microarchitectural techniques, for different multithreading schemes, are investigated for state-retentive power gating of register files. In an in-order core, when a thread gets blocked due to a memory stall, the corresponding register file can be placed in a low leakage state. When the memory stall gets resolved, the register file is activated so that it may be accessed again. The overhead due to wake-up latency is completely hidden in two of the schemes, while it is hidden for the most part in the third. Experimental results on multiprogrammed workloads comprised of SPEC 2000 integer benchmarks show that, in an 8-core processor executing 64 threads, the average leakage savings in the register files, modeled in FreePDK 45 nm MTCMOS technology, are 42% in coarse-grained multithreading, while they are between 7% and 8% in fine-grained and simultaneous multithreading. The contributions of this dissertation represent a significant advancement in the quest for reducing leakage energy consumption in microprocessors with minimal degradation in performance.
Thesis:
Dissertation (PHD)--University of South Florida, 2010.
Bibliography:
Includes bibliographical references.
System Details:
Mode of access: World Wide Web.
System Details:
System requirements: World Wide Web browser and PDF reader.
Statement of Responsibility:
by Soumyaroop Roy.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains X pages.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
usfldc doi - E14-SFE0004634
usfldc handle - e14.4634
System ID:
SFS0027949:00001


This item is only available as the following downloads:


Full Text
xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam 22 Ka 4500
controlfield tag 007 cr-bnu---uuuuu
008 s2010 flu s 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0004634
035
(OCoLC)
040
FHM
c FHM
049
FHMM
090
XX9999 (Online)
1 100
Roy, Soumyaroop.
0 245
Architecture and compiler support for leakage reduction using power gating in microprocessors
h [electronic resource] /
by Soumyaroop Roy.
260
[Tampa, Fla] :
b University of South Florida,
2010.
500
Title from PDF of title page.
Document formatted into pages; contains X pages.
502
Dissertation (PHD)--University of South Florida, 2010.
504
Includes bibliographical references.
516
Text (Electronic thesis) in PDF format.
538
Mode of access: World Wide Web.
System requirements: World Wide Web browser and PDF reader.
3 520
ABSTRACT: Power gating is a technique commonly used for runtime leakage reduction in digital CMOS circuits. In microprocessors, power gating can be implemented by using sleep transistors to selectively deactivate circuit modules when they are idle during program execution. In this dissertation, a framework for power gating arithmetic functional units in embedded microprocessors with architecture and compiler support is proposed. During compile time, program regions are identified where one or more functional units are idle and sleep instructions are inserted into the code so that those units can be put to sleep during program execution. Subsequently, when their need is detected during the instruction decode stage, they are woken up with the help of hardware control signals. For a set of benchmarks from the MiBench suite, leakage energy savings of 27% and 31% are achieved (based on a 70 nm PTM model) in the functional units of a processor, modeled on the ARM architecture, with and without floating point units, respectively. Further, the impact of traditional performance-enhancing compiler optimizations on the amount of leakage savings obtained with this framework is studied through analysis and simulations. Based on the observations, a leakage-aware compilation flow is derived that improves the effectiveness of this framework. It is observed that, through the use of various compiler optimizations, an additional savings of around 15% and even up to 9X leakage energy savings in individual functional units is possible. Finally, in the context of multi-core processors supporting multithreading, three different microarchitectural techniques, for different multithreading schemes, are investigated for state-retentive power gating of register files. In an in-order core, when a thread gets blocked due to a memory stall, the corresponding register file can be placed in a low leakage state. When the memory stall gets resolved, the register file is activated so that it may be accessed again. The overhead due to wake-up latency is completely hidden in two of the schemes, while it is hidden for the most part in the third. Experimental results on multiprogrammed workloads comprised of SPEC 2000 integer benchmarks show that, in an 8-core processor executing 64 threads, the average leakage savings in the register files, modeled in FreePDK 45 nm MTCMOS technology, are 42% in coarse-grained multithreading, while they are between 7% and 8% in fine-grained and simultaneous multithreading. The contributions of this dissertation represent a significant advancement in the quest for reducing leakage energy consumption in microprocessors with minimal degradation in performance.
590
Advisor: Nagarajan Ranganathan, Ph.D.
653
Compiler Directed Power Gating
Microarchitectural Techniques
Embedded Microprocessors
Multithreading
Multiprocessing
Multicore
Niagara
CGMT
FGMT
SMT
GCC
SUIF
MachineSUIF
M5
690
Dissertations, Academic
z USF
x Computer Science & Engineering
Masters.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.4634



PAGE 1

ArchitectureandCompilerSupportforLeakageReduction UsingPowerGatinginMicroprocessors by SoumyaroopRoy Adissertationsubmittedinpartialfulllment oftherequirementsforthedegreeof DoctorofPhilosophy DepartmentofComputerScienceandEngineering CollegeofEngineering UniversityofSouthFlorida Co-MajorProfessor:NagarajanRanganathan,Ph.D. Co-MajorProfessor:SrinivasKatkoori,Ph.D. HaoZheng,Ph.D. SanjuktaBhanja,Ph.D. NatashaJonoska,Ph.D. DateofApproval: June8,2010 Keywords:CompilerDirectedPowerGating,MicroarchitecturalTechniques,Embedded Microprocessors,Multithreading,Multiprocessing,Multicore,Niagara,CGMT,FGMT,SMT, GCC,SUIF,MachineSUIF,M5 Copyright2010,SoumyaroopRoy

PAGE 2

DEDICATION Tomyfamily: Parents,AlpanaandSnehansu,andsister,Shreya

PAGE 3

ACKNOWLEDGEMENTS IwouldliketotakethisopportunitytothankmydoctoraladvisorsDr.NagarajanRanganathan andDr.SrinivasKatkoorifortheirvaluableguidanceandsupportthroughoutmygraduateschool life.Iwouldparticularlyliketothankthemfortheirpatienceandbeliefinmeandmywork.Their knowledge,workethics,andtheprofessionalstandardstheyadheretohavebeenveryintrumental ininspiringmetoputmyheartandmindintosolvingtheproblemsthatarepresentedinthisdisseration.IwouldalsoliketothankDr.HaoZheng,Dr.SanjuktaBhanja,andDr.NatashaJonoskafor takingthetimetobeinmydoctoralcommitteeandprovidingvaluablesuggestionstoimprovethis manuscript.IamextemelygratefultotheCSEdepartmentforprovidingmenancialassistance throughoutallmyyears.IwouldliketothankDr.RangacharKasturi,Dr.LarryHall,Dr.Ken Christensen,Dr.SudeepSarkar,Dr.DmitryGoldgof,Dr.RafaelPerez,YvetteBlanchard,Alex Dashner,TheresaCollins,CatherineBurton,andothersforalltheirhelpandsupportandmaking mystayatUSFaremarkableandunforgettableexperience.Iwouldalsoliketothankmypastand presentcolleaguesatComputerArchitectureandNanoVLSISystems(CANS)ResearchGroup, Himanshu,Koustav,Upavan,Ransford,Mahalingam,Ziad,Michael,Saurabh,Yue,Vidyasangkar, Elizabeth,Haiqiong,Sameer,Pradeep,Hari,Vyas,Jared,Narendra,andothersforafunandintellectuallystimluatingatmosphereinthelabs.IwouldalsoliketothankothercolleaguesintheCSE departmentandfriends,Sergiy,Matt,Kurt,Olya,Satrajit,Ravi,Anand,Vladimir,Pedro,Miguel, Daladier,Puneet,Noelia,Jessica,Jason,Erin,Tatiana,Vaso,Neil,Sara,Keri,andothers.Iam alsoverygratefultotheefcientCSEtechsupportteam,Daniel,Peter,Sridhar,andBrianandthe engineeringcomputingteamfortheirhelpwithITrelatedproblems.Iwouldalsoliketothank KoreySewellofM5,GlennHollowayofMachineSUIF,andmembersoftheprocessorarchitecture laboratory,EPFL.Finally,Iwouldliketothankmymother,Alpana,father,Snehansu,andsister, Shreya,forplayinganextremelyimportantpartinmakingmethepersonthatIamtoday.They providedmetheirunconditionalloveandsupportinmydecisiontotravelseveralthousandmiles westwardfromhomeformydoctoralstudiesandthroughoutmystudies.Iamalsoverythankfulto mycousin,Saurabh,forbeingabeaconofbrilliancesincewewereveryyoung.

PAGE 4

TABLEOFCONTENTS LISTOFTABLESiv LISTOFFIGURESvi ABSTRACT x CHAPTER1INTRODUCTION1 1.1PowerandEnergyConcerninComputingSystems1 1.2PowerConsumptioninDigitalCMOSCircuits3 1.3MotivationforthisDissertation5 1.4ContributionsofthisDissertation5 1.5OutlineofDissertation7 CHAPTER2BACKGROUNDANDRELATEDWORK9 2.1SubthresholdLeakageReductionTechniques9 2.2PowerGating11 2.2.1PerformanceAspectsofPowerGating13 2.3PipelinedMicroprocessors13 2.3.1PipeliningBasics14 2.3.2ClassicationofInstructionTypes14 2.3.3In-OrderProcessors16 2.4HardwareMultithreading18 2.5RelatedWork20 2.5.1ContextandSignicanceofthisDissertation23 CHAPTER3AFRAMEWORKFORPOWERGATINGFUNCTIONALUNITSINEMBEDDEDMICROPROCESSORS24 3.1PowerGatinginMicroprocessors25 3.2ProposedFrameworkforLeakageReductionUsingPower-Gating26 3.2.1HardwareComponent27 3.2.1.1EmbeddedProcessorArchitecturewithPowerGatingSupport27 3.2.1.2DesignofPower-GatedFunctionalUnits30 3.2.2SoftwareComponent31 3.2.2.1IdentifyingPotentialPower-GatingRegionsin theCFG33 3.2.2.2SubgraphsEnclosedWithinLoops35 3.2.2.3InsertionofSleepInstructions38 i

PAGE 5

3.2.2.4TimeComplexity40 3.3ExperimentalSetupandResults41 3.3.1EnergyComponentCalculations41 3.3.2Cycle-AccurateSimulation44 3.4Discussion47 CHAPTER4IMPACTOFCOMPILEROPTIMIZATIONSONPOWERGATING48 4.1Motivation49 4.2ImpactofCompilerOptimizationsonPowerGating51 4.2.1IntraproceduralOptimizations51 4.2.1.1DominatorOptimizations51 4.2.1.2LoopOptimizations54 4.2.1.3MachineDependentOptimizations54 4.2.2InterproceduralOptimizations56 4.3Compiler-DirectedPowerGating58 4.3.1ArchitectureSupportforPowerGating61 4.3.2InsertionofSleepInstructions63 4.3.3PoliciesforHandlingCStandardLibraryRoutines67 4.3.4ProposedLeakageAwareCompilationFlow67 4.4ExperimentalSetupandResults70 4.4.1OptimizationCongurations71 4.4.2Results72 4.4.2.1SusanBenchmarks73 4.4.2.2EpwicBenchmarks76 4.4.2.3Mpeg2Benchmarks78 4.4.3ImpactofPoliciesforHandlingStandardLibraryRoutines79 4.5Conclusions81 CHAPTER5STATE-RETENTIVEPOWERGATINGOFREGISTERFILESINMULTICORES83 5.1Motivation84 5.2RegisterFilePowerGatinginCGMTProcessors87 5.2.1PowerGatingControlDuringFetchMiss88 5.2.2PowerGatingControlDuringLoadMisses91 5.3RegisterFilePowerGatinginFGMTProcessors93 5.3.1PowerGatingControlDuringFetchMiss94 5.3.2PowerGatingControlDuringLoadMiss95 5.4RegisterFilePowerGatinginSMTProcessors97 5.5SummaryoftheProposedTechniques98 5.6ExperimentalSetupandResults101 5.6.1IntegerRegisterFileCharacterization101 5.6.2ProcessorCongurationsandWorkloadDetails101 5.6.3Results105 5.7Discussion107 CHAPTER6CONCLUSIONSANDFUTUREDIRECTIONS112 ii

PAGE 6

REFERENCES114 LISTOFPUBLICATIONS121 ABOUTTHEAUTHOREndPage iii

PAGE 7

LISTOFTABLES Table2.1SpecicationofALUinstructiontype16 Table2.2Specicationofmemoryorload/storeinstructiontype16 Table2.3Specicationofbranchinstructiontype17 Table3.1Latenciesoffunctionalunits31 Table3.2Averageenergycomponentsoffunctionalunits43 Table3.3ARMprocessorconguration44 Table3.4Benchmarkdetails45 Table4.1DescriptionofthelegendsinFigure4.149 Table4.2Procedureinliningparameters69 Table4.3Benchmarkdetails71 Table4.4Optimizationcongurations71 Table4.5Descriptionofmetrics72 Table5.1A summaryoftheproposedtechniques 99 Table5.2Registerleleakagestates101 Table5.3SPEC2000integerbenchmarks102 Table5.4Multi-coreprocessorparameters103 iv

PAGE 8

Table5.5Memoryaccesslatencies103 Table5.6L1D-cacheandI-cacheparameters104 Table5.7L2cachesize(inMB)104 Table5.8L2cachesetassociativity104 Table5.9L2cacheMSHRcount104 v

PAGE 9

LISTOFFIGURES Figure1.1ComponentsofpowerconsumptionindigitalCMOScircuits4 Figure1.2Contributionsofthisdissertation6 Figure2.1Subthresholdleakagereductionapproaches10 Figure2.2Powergatingoptions12 Figure2.3TheGENERIC(GNR)pipeline15 Figure2.4Scalarvs.superscalarpipeline17 Figure2.5Multithreadingapproaches19 Figure2.6Taxonomyofworksonleakagereductioninmicroprocessors22 Figure3.1Proposedframeworkforpowergating27 Figure3.2ModiedARMarchitecture28 Figure3.3Exampleofactivationanddeactivationoffunctionalunits29 Figure3.4Apossibleformatofthesleepinstruction30 Figure3.5AsamplepieceofcodeanditsCFG32 Figure3.6LHTforCFGinFigure3.535 Figure3.7Superimposedvoltageandpowergraphsasfunctionsoftime41 Figure3.8Totalleakageenergysavingsinintegerbenchmarks45 vi

PAGE 10

Figure3.9Totalleakageenergysavingsinoatingpointbenchmarks46 Figure3.10Fractionoftotalsimulationcyclesforwhichtheintegerunitswere powergated46 Figure3.11Fractionoftotalsimulationcyclesforwhichtheoatingpointunits werepowergated47 Figure4.1Impactofafewcompileroptimizationsonpowergating50 Figure4.2Exampletoillustratetheimpactofglobalcommonsubexpression eliminationontheusageoffunctionalunits52 Figure4.3Exampletoillustratetheimpactofpartialredundancyelimination ontheusageoffunctionalunits53 Figure4.4Examplesofloopoptimizationsthatimprovetheopportunitiesfor powergating55 Figure4.5ExamplesofmachinedependentoptimizationsinARMthatcanimproveopportunitiesforpowergating56 Figure4.6Exampleillustratingtheimpactofinterproceduraloptimizationson powergatingopportunitiesoftheintegermultiplier57 Figure4.7Frameworkforcompiler-directedpowergatingoffunctionalunits withcodeoptimizations59 Figure4.8GCCCompilerPipeline60 Figure4.9GatelevelschematicoftheSleepControlRegister(SCR)62 Figure4.10Assemblyandmachinecodeformatsofthesleepinstruction63 Figure4.11Componentsofaloop64 vii

PAGE 11

Figure4.12Proposedcompilationowtogenerateleakageoptimizedcodewith compiler-directedpowergating68 Figure4.13Leakagesavingsandsleepoverheadfor SusanE 74 Figure4.14Leakagesavingsandsleepoverheadfor SusanC 75 Figure4.15Impactofweakstrengthreductiononleakagesavingsoftheinteger multiplierfor SusanC 76 Figure4.16Leakagesavingsandsleepoverheadfor Epwic 77 Figure4.17Leakagesavingsandsleepoverheadfor Mpeg2Encode 78 Figure4.18Impactofvariouspoliciesforhandlingthestandardlibraryroutines duringinsertionofsleepinstructions80 Figure5.1Schematicviewoftheproposedapproachforpowergatingregister lesinin-ordercoresthatsupportmultithreading85 Figure5.2Intermediatestrengthpowergatingappliedduringacachemiss87 Figure5.3Timingdetailsforputtingregisterlestosleepfollowinganinstructionfetchmiss89 Figure5.4Wake-updetailsof T1sregisterleifthepipelineisbusywhenits fetchmisscompletes90 Figure5.5Wake-updetailsof T1sregisterleifthepipelineisidleafterits fetchmisscompletes91 Figure5.6Timingdetailsforputtingaregisterletosleepfollowingadata loadmiss92 viii

PAGE 12

Figure5.7Timingdetailsforwakingup T1'sregisterlefromsleepafterits loadmisscompletesanditgetsreadytorun94 Figure5.8Timingdetailsforputtingathread'sregisterleinandoutoflowleakagestatefollowingafetchmissinFGMT95 Figure5.9Timingdetailsforputtingathread'sregisterleinandoutoflowleakagestatefollowingadatamissinFGMT96 Figure5.10SchematicviewofapipelineorganizationtosupportSMTininordercores97 Figure5.11ApathologicalcaseduringafetchmissinanFGMTcore100 Figure5.12Averageinstructionspercycle(IPC)countforCGMTapproach106 Figure5.13Averageinstructionspercycle(IPC)countforFGMTapproach107 Figure5.14Averageinstructionspercycle(IPC)countforSMTapproach108 Figure5.15AverageIRFleakageenergysavingsforCGMTcores109 Figure5.16AverageIRFleakageenergysavingsforFGMTcores109 Figure5.17DatareadmisslatencyperthreadforFGMTcores110 Figure5.18InstructionfetchmisslatencyperthreadforFGMTcores110 Figure5.19L2readmisslatencyperthreadforFGMTcores111 Figure5.20AverageIRFleakageenergysavingsforSMTcores111 ix

PAGE 13

ArchitectureandCompilerSupportforLeakageReduction UsingPowerGatingin Microprocessors SoumyaroopRoy ABSTRACT PowergatingisatechniquecommonlyusedforruntimeleakagereductionindigitalCMOS circuits.Inmicroprocessors,powergatingcanbeimplementedbyusingsleeptransistorstoselectivelydeactivatecircuitmoduleswhentheyareidleduringprogramexecution.Inthisdissertation, aframeworkforpowergatingarithmeticfunctionalunitsinembeddedmicroprocessorswitharchitectureandcompilersupportisproposed.Duringcompiletime,programregionsareidentied whereoneormorefunctionalunitsareidleandsleepinstructionsareinsertedintothecodeso thatthoseunitscanbeputtosleepduringprogramexecution.Subsequently,whentheirneedis detectedduringtheinstructiondecodestage,theyarewokenupwiththehelpofhardwarecontrol signals.ForasetofbenchmarksfromtheMiBenchsuite,leakageenergysavingsof27%and31% areachieved(basedona70nmPTMmodel)inthefunctionalunitsofaprocessor,modeledon theARMarchitecture,withandwithoutoatingpointunits,respectively.Further,theimpactof traditionalperformance-enhancingcompileroptimizationsontheamountofleakagesavingsobtainedwiththisframeworkisstudiedthroughanalysisandsimulations.Basedontheobservations, aleakage-awarecompilationowisderivedthatimprovestheeffectivenessofthisframework.Itis observedthat,throughtheuseofvariouscompileroptimizations,anadditionalsavingsofaround 15%andevenupto9Xleakageenergysavingsinindividualfunctionalunitsispossible.Finally, inthecontextofmulti-coreprocessorssupportingmultithreading,threedifferentmicroarchitectural techniques,fordifferentmultithreadingschemes,areinvestigatedforstate-retentivepowergating ofregisterles.Inanin-ordercore,whenathreadgetsblockedduetoamemorystall,thecorrex

PAGE 14

spondingregisterlecanbeplacedinalowleakagestate.Whenthememorystallgetsresolved, theregisterleisactivatedsothatitmaybeaccessedagain.Theoverheadduetowake-uplatency iscompletelyhiddenintwooftheschemes,whileitishiddenforthemostpartinthethird.ExperimentalresultsonmultiprogrammedworkloadscomprisedofSPEC2000integerbenchmarks showthat,inan8-coreprocessorexecuting64threads,theaverageleakagesavingsintheregister les,modeledinFreePDK45nmMTCMOStechnology,are42%incoarse-grainedmultithreading, whiletheyarebetween7%and8%inne-grainedandsimultaneousmultithreading.Thecontributionsofthisdissertationrepresentasignicantadvancementinthequestforreducingleakage energyconsumptioninmicroprocessorswithminimaldegradationinperformance. xi

PAGE 15

CHAPTER1 INTRODUCTION Advancesinintegratedcircuit(IC)technologyhavehelpedthesemiconductorindustrytokeep pacewithMoore'slawforovervedecades.Duetotechnologyscaling,theminimumfeaturesize hascontinuedtoshrinkwhilethechipdensityaswellasthetransistorperformancehavecontinued toimprove.ThisscalingtrendhasmultipliedthecomplexityofVLSIcircuitswhich,inturn,has increasedtheimportanceofpowerconsiderationsinchipdesign.Inhigh-performancemicroprocessors,powerdensitylimitshaverestrictedtheupwardscalingofclockfrequenciestoachievegreater performance.Inembeddedmicroprocessors,highpowerconsumptionimpactstheengineeringfeasibilityofbattery-poweredportabledevicesaswellastheirreliability.Designershavetomakea tradeoffbetweenthesizeofthebatterypacksandtheoperatinglifeofthedevices.Theseissues haveforcedthedesignerstopursuelow-powerdesignmethodologiesinanaggressivemanner. 1.1PowerandEnergyConcerninComputingSystems Themarketforembeddedcomputingsystemsisproliferatingatatremendousrate.Itisreported thatmorethanonebillioncellphonesaresoldeachyearandthismarketiseverexpanding[1].It isestimatedthatmediadevicessuchascellphones,videocameras,anddigitaltelevisionsperform morecomputationsthandesktops,laptops,anddatacentersandatpowerconsumptionratesthat areordersofmagnitudelowerthanthoseofthelattercomputationplatforms.Efcientembedded processorsandDSPsconsumeabout250pJperoperation[2],whilelaptopprocessorsconsume about20nJperoperation[3]indicatingthatembeddedprocessorsareabout40Xmoreenergy efcientthanlaptopprocessors.Thisisofparamountimportancebecauseembeddedprocessors featureinportableandhandhelddevicesandthepackagingtechnologyrestrictsthemaximumpower 1

PAGE 16

dissipationofthesedevicestoabout1W[1].Moreover,thebatterylifeofahandhelddeviceis oneofitsmostprominentfeaturesthatimpactsitssuccessasaproductinthemarket.Therefore, lowpowerandenergyefcientdesigntechniquesareofparamountimportanceinthedesignof embeddedprocessors. Inthedomainofhigh-performancemicroprocessors,untilrecently,technologyscalingwas makingitpossibletobuildincreasinglycomplexprocessorarchitectureswithlargeron-chipcaches operatingathigherclockfrequencies.Inrecentyears,however,increasedconcernsaboutpower densityandthermaleffectshaveemergedasfundamentalbarriersthathaveseverelyrestrictedthe upwardscalingofclockfrequenciesforfurtherperformanceimprovement[4].Apartfromthe benetsofferedbytechnologyscaling,advancesinarchitecturaldesigntechniqueshavefurther improvedtheperformanceofmicroprocessors.SuperscalarCPUarchitectureswithmultiplefunctionalunitsweredevelopedsothatseveralinstructionscouldbeexecutedsimultaneouslywithina singleclockcycle.Deeperpipelinesanddynamicschedulingtoallowout-of-orderexecutionofinstructionstreamswithinasinglethreadareemployedtoexploitinstruction-levelparallelisminthe program.However,severalcomplexhardwareunitssuchasbranchpredictors,issuelogic,reorder buffers,etc.,areneededtoimplementout-of-orderexecution,whichinturnrequireshigherpower anddieareabudgets.Ithasbeenreportedthat,withthesameprocesstechnology,anewmicroprocessordesignwithperformanceimprovementof1.5xto1.7xresultsin2xto3xincreaseinthedie area[5]and2xto2.5xincreaseinthepowerconsumption[6].Thus,powerefciencyhasbecome theepicenterofalldesigneffortsfromanarchitecturalstandpointaswell. WhiletheCPUperformancehasbeenmeasuredintermsoftheexecutionthroughputofasingle thread,lately,analternatemetric,referredtoas throughputperformance ,hasbeengainingmore prominence.ThroughputperformanceisdenedasthenumberofthreadsthatcancompleteexecutionperunittimebyutilizingmultipleCPUcorestoperformmorecomputationsinparallel.A surveyofcommerciallyavailablemulti-coreprocessorscanbefoundin[7].Aspowerdissipation continuestobeanincreasinglydifcultchallenge,therehasbeenashiftintheparadigminterms ofCPUdesign.Insteadofbuildingalargeandcomplexout-of-orderprocessor,thedesignersare buildingmultiplesimplein-orderprocessorswithinthesamechiparea.Eachofthosesimplecores 2

PAGE 17

couldfurthersupportthesimultaneousexecutionofmultiplethreadsresidentwithinthecore.Such multi-coresystemsarecommerciallyavailableappliedinhigh-endservers,gamingplatforms,and embeddedprocessors.Niagara[8]andNiagara2[9]aremulti-coregeneralpurposemicroprocessorsfromSunMicrosystemsusedinhighendserversthatfeatureuptoeightin-ordercores.While eachcoreinNiagaraiscapableofexecutingfourthreads,eachcoreinNiagara2iscapableofexecutingeightthreadssimultaneously.Intel'sLarrabeearchitecture[10]forvisualcomputinguses in-orderCPUcoresthatsupportanextendedversionofthex86instructionset.Eachcoresupports executionoftwohardwarethreads.ThenumberofCPUcoresisimplementation-dependent.MIPS 1004Kcoherentprocessingsystem[11]iscomprisedof1-4multi-threadedcores,whereeachcore iscapableofexecutingtwohardwarethreadssimultaneously. 1.2PowerConsumptioninDigitalCMOSCircuits PowerconsumptionindigitalCMOScircuitscanbeclassedintotwomajorcategoriesdynamic powerand static power[12](Figure1.1).Whiledynamicpowerisduetotheactivityin thecircuitblockandtheswitchingfrequency,staticpowerisduetothefactthattransistorsare imperfectswitches.Themajorcomponentofdynamicpoweriscontributedbythechargingand dischargingofthegatecapacitancesinthecircuitduringsignalswitching.Theothercomponent ofdynamicpower,short-circuitpower,isaresultofconductingpathsbetweenthevoltagesupply andgroundforabriefperiodduringwhichalogicgatemakesatransition.Duringthatperiodboth thepull-downandthepull-upnetworksareON.Theprimarycomponentofstaticpowerisdue tothesubthresholddrain-sourceleakage, Isubth,whichisdependentexponentiallyonthethreshold voltage, VT,As VTisreducedlinearly, Isubthincreasesexponentially.Theothertwocomponentsof staticpowerareduetogateleakagecurrentsandjunctionleakagecurrents.IncontemporaryCMOS technologies,subthresholdleakageisofparamountconcerntocircuitdesigners. Traditionally,dynamicpowerhasvastlydominatedtheoverallpowerconsumptionofCMOS circuits.However,insub-90nmtechnologies,leakagepowerhasemergedasasignicantcomponentofthetotalpowerconsumed.Thisisbecause,inordertokeeppacewiththescalingtrendsin technology,thesuppyvoltage, VDD,waslowereddowntoavoidexcessivepowerdensityofthechip. 3

PAGE 18

Power components in digital CMOS Figure1.1ComponentsofpowerconsumptionindigitalCMOScircuits.Powerconsumeddueto subthresholdleakagedominatesthestaticpowerdissipationcomponentinthecontemporarytechnologiesandistargetedinthisdissertation. Thisresultedinloweringthethresholdvoltage, VT,inordertomaintainthecircuitperformance. Thiscausedthesubthresholdleakagecurrent, Isubth,toincreaseexponentially,therebyincreasing thestaticpowerdrastically.Ithasbeenpointedoutin[13]thattherehasbeena3-5 increase insubthresholdandgateleakagecurrentspergenerationduetothresholdvoltageandgate-oxide scaling.Therefore,leakagepowerhasbecomeanimportantdesignaspectinlow-powerCMOS circuits.Themainfocusofthisdissertationisthereductionofthesubthresholdleakagecomponent ofstaticorleakagepowerinmicroprocessors. 4

PAGE 19

1.3MotivationforthisDissertation Theissuesdiscussedaboveformthemainmotivationfortheworksreportedinthisdissertation. IthasbeenreportedthattheleakagecomponentofIntel'sXeonTulsaprocessor[14]in65nm technologyisabout30%ofthetotalchippower.Thecoreleakageisabout37%.Theleakage componentsofNiagara[8]andNiagara2[9]microprocessorsbySunMicrosystems,in90nmand 65nmtechnologies,respectively,arebetween25%and30%ofthetotalchippower.Insub-45nm technologies,theleakagepowercomponentinmicroprocessorsareprojectedtobeatleast50%of theirdynamicpowercomponent[15]. Inviewofthesefacts,itisimportanttoinvestigatetechiquestoreduceleakageinmicroprocessors.Inthisdissertation,architecturaltechniquestoreduceleakageenergyinarithmeticfunctional unitsandregisterles,bothofwhicharecorecomponentsofamicroprocessor,areproposed.While acompiler-directedframeworkwitharchitecturalsupportispresentedforreducingleakageenergy infunctionalunitsinembeddedprocessors,purelymicroarchitecturaltechniquesareproposedfor reducingleakageinregisterlesingeneral-purposemultithreadedprocessors. 1.4ContributionsofthisDissertation Themaincontributionofthisdissertationisasetofmethodologiesproposedatthearchitectureleveltoeffectivelyusepowergating,acircuit-levelleakagereductiontechnique,forreducing leakageinembeddedandgeneralpurposemicroprocessors.Thethemeandthecontributionsofthis dissertationareshowninFigure1.2. Abriefdescriptionofthecontributionsare: € AFrameworkforPowerGatingFunctionalUnitsinEmbeddedMicroprocessorswithArchitectureandCompilerSupport :Inthiswork,aframeworkisdevelopedtopowergatearithmeticfunctionalunitsinembeddedmicroprocessorswitharchitectureandcompilersupport. Theproposedframeworkincludesanefcientalgorithmforidletimeestimation,appropriate insertionofsleepinstructionswithinthecode,andamethodforreactivatingthesleepingunits thateliminatestheneedforhavingexplicitwakeupinstructions. 5

PAGE 20

Figure1.2Contributionsofthisdissertation.Thethemeofthedisserationisarchitecturallevel leakagereductioninmicroprocessors.Contributions1and2areforembeddedmicroprocessors, whilecontribution3isinthecontextofmulti-coreprocessorsforgeneralpurposeapplications. € ACompilationFlowtoEnhanceLeakageReductionAchievedinFunctionalUnitsbyCompilerdirectedPowerGating :Inthecontextofthecompiler-directedpowergatingframeworkdiscussedabove,theimpactoftraditionalperformance-enhancingcompileroptimizationsonthe amountofleakagesavingsobtainedisstudiedthroughanalysisandsimulations.Basedonthe observationsmade,aleakage-awarecompilationowisderivedthatimprovestheeffectivenessoftheframework. € MicroarchitecturalTechniquesforPowerGatingRegisterFilesinMulti-coreProcessors :A classofmicroarchitecturaltechniquesforne-grainedstate-retentivepowergatingininteger registerlestosaveleakageenergyisproposedformulti-coreprocessorsfeaturingin-order coresthatsupporthardwaremultithreading.Instate-retentivepowergating,theregisters retaintheirstatesthroughthepowergatingperiod.Inanin-ordercore,whenathreadgets 6

PAGE 21

blockedduetoamemorystall,thecorrespondingregisterlecanbeplacedinalowleakage statethroughpowergatingforleakagereduction.Whenthememorystallgetsresolved,the registerleisactivatedforsubsequentaccesses.Whilestate-retentivepowergatinginsingle coreshasbeenstudiedintheliterature,itisbeinginvestigatedformulti-corearchitecturesfor thersttimeinthiswork. 1.5OutlineofDissertation Theremainderofthisdissertationisorganizedasfollows.Chapter2describesthebackground andacomprehensiveliteraturesurveyrelatedtothespecicproblemsbeingaddressedinthisdissertation.Morespecically,rstashorttutorialonpowergating,thecircuit-levelleakagereduction techniquethatformstheunderlyingtechniqueinallthemicro-architecturesupportmodeledinall theworks.Followingthat,ashorttutorialoncomputerarchitectureispresentedwhichdescribes theorganizationofatypicalpipelinedmicroprocessor.InChapter3,aframeworkforpowergating arithmeticfunctionalunitsinembeddedmicroprocessorswitharchitectureandcompilersupportis presented.Thisframeworkcomprisesoftwocomponents-ahardwarecomponentandasoftware component.Thehardwarecomponentincludesalibraryofarithmeticfunctionalunitsredesigned withsleeptransistorsandamicroarchitecturalmodeloftheembeddedprocessoralongwithISA supportforcontrollingthesleepstatesoftheseunits.Thesoftwarecomponentcomprisesofextensionsmadetoacompilerinfrastructure,whichincludeanefcientalgorithmforidentifying programregionswherefunctionalunitsareidlesothatsleepinstructionsmaybeinsertedintothe codetoputthoseunitstosleepduringprogramexecution.InChapter4,theimpactoftraditional performance-enhancingcompileroptimizationsontheamountofleakagesavingsobtainedwiththe frameworkdescribedinChapter3isstudiedthroughanalysisandsimulations.Basedontheobservations,aleakage-awarecompilationowisderivedthatimprovestheeffectivenessofthisframework.InChapter5,purelymicroarchitecturaltechniquesareinvestigatedtoperformstate-retentive powergatingofregisterlesinmulti-coreprocessorssupportinghardwaremultithreading.The techniquesproposedarebasedonthefactthatinanin-ordercore,whenathreadgetsblockeddue toapipelinestall,thecorrespondingregisterlecanbeplacedinalowleakagestate.Whenthestall 7

PAGE 22

getsresolved,theregisterleisactivatedsothatitmaybeaccessedagain.Finally,someconcluding remarksandthechallengesgoingforwardarediscussedinChapter6. 8

PAGE 23

CHAPTER2 BACKGROUNDANDRELATEDWORK Inthischapter,somefundamentalconceptsthatformthebasisoftheresearchworkpresentedin thisdissertationaredescribed.Morespecically,wediscussthebasicdesigntechniquesforreducingsubthresholdleakageinCMOScircuitsanddescribe powergating ,acircuit-leveltechniqueused forreducingsubthresholdleakage,indetail.Followedbythat,wepresentthebasicsofpipelined microprocessorsanddiscussthesupportforenablinghardwaremultithreadinginsuchprocessors. Adetaileddescriptionofthevariousrelatedworksforleakagepowerreductioninmicroprocessors appliedatthearchitectureandmicroarchitecturelevelisalsopresentedinthischapter. 2.1SubthresholdLeakageReductionTechniques SubthresholdleakageinCMOScircuitsisduetotheinabilityoftransistorstoactasperfect on/offswitches.Therefore,leakagereductioncanbedoneby[12]: € Increasingtheresistanceintheleakagepath € Reducingthevoltageovertheleakagepath Allthefundamentalleakagereductiontechniquesfallunderoneoftheabovetwocategories. ThisisshowninFigure2.1.Mostofthetechniquesfallintheformercategorybecausethelatter ishardtoachieve.Theabilitytochangethesupplyvoltagetoacircuiteitherrequiresavariable voltagesupplymechanismorvoltagebiascircuitsthatcouldenablemultiplesupplyvoltages.Since thesearehardtoachieveincomplexdesignsbecauseofavarietyofdesignissues[12],mostofthe state-of-the-artcircuitlevelleakagereductiontechniquesusedbydesignersfallundertheformer category.Thesetechniques[12,16]arebrieydescribednext: 9

PAGE 24

Basis of Subthreshold Leakage Reduction Techniques # $ % & & $ Figure2.1Subthresholdleakagereductionapproaches.Inthisdissertation,powergatingisthe underlyingcircuitlevelleakagereductiontechniqueusedinallthearchitecturelevelsolutionsproposedtoreduceleakageinmicroprocessors. € TransistorStacking :Stackingoftransistorshasasuper-linearimpactonleakagereduction duetodrain-inducedbarrierlowering(DIBL).DIBLisadeep-submicroneffectandisrelated toareductionofthethresholdvoltageasafunctionofthedrainvoltage.Foragivencircuit block,theidealwaytoimposethemaximumtransistorstackingwouldbetocontrolthe inputsofeachinputgateindependently.However,sincethisisnotpossibleandtheonly inputscontrollablearetheprimaryinputstothecircuitblock,thenextbestthingistond theprimaryinputpatterntothisblockthatminimizesleakageduringstandbymode[1719]. Althoughtheleakagesavingsachievedbythistechniqueisverylimited,theadvantageof thistechniqueisthatithasnegligibleimpactonperformanceanditsimplementationisvery simple. € PowerGating :Theidealwaytoeliminatesubthresholdleakageistodisconnectthecircuit blockfromthesupplyrails.Butthatwouldrequiretheavailabilityofperfecton-offswitches. SincesuchswitchesdonotexistinthecontemporaryCMOStechnologies,switchesareused as largeresistors betweenthevirtualsupplyrailsofthecircuitblockandtheglobalsupply rails.Uptothreeordersofmagnitudeofleakagereductioncanbeachievedbythistechnique. Sincepowergatingistheunderlyingcircuitlevelleakagereductiontechniqueinallthearchi10

PAGE 25

tecturalleakagereductionsolutionsproposedinthisdissertation,wediscusspowergatingin moredetailsinSection2.2. € BodyBiasing :Analternativeapproachtopowergatingistodecreasetheleakagecurrentby increasingthethresholdvoltageofthetransistorsinthecircuitblock.Eachtransistorhas afourthterminal,whichcanbeusedtoincreasethethresholdvoltagebyreversebiasing. Sincethesubthresholdleakagecurrentdependsexponentiallyonthethresholdvoltage,this techniquecanbeveryeffectiveinminimizingleakageincircuits.Moreover,thistechnique doesnotcomewithaperformancepenaltyanditdoesnotevenchangethecircuittopology. Althoughthistechniquelooksveryattractiveduetotheseattributes,fewofitsdrawbacks areverycriticalinexploringthistechniqueinrealdesigns.Firstofall,theleakagereductionachievedbythistechniqueismuchlower(byuptotwoordersofmagnitude)thanthat achievedbypowergating.Secondly,itrequiresmoresophisticatedtriple-welltechnologyif bothNMOSandPMOStransistorsneedtobecontrolled[12]. 2.2PowerGating Figure2.2showstheschematicviewofthepowergatingtechniqueandthevariousoptions availabletoimplementit.Ifthepowergatingdevice,alsoknownasthe sleeptransistor ,isinserted betweenthe VDDandthepull-upnetworkofthecircuitblock,itiscalleda header device.Onthe otherhand,ifitisinsertedbetweenthegroundrailandthepull-downnetworkofthecircuit,itis calleda footer device.Leakagecurrentinthecircuitblockreducesbecauseofthefollowingtwo reasons: € Increasedresistanceintheleakagepath :Theheaderandthefootersleepdevicesactaslarge resistorstotheleakagepathduringstandbymode. € Sourcebiasingintroducedbystackingeffect :Theadditionaldevicesinserieswiththecircuit pull-upandpull-downnetworksintroducesstackingeffect,whichincreasesthethresholdof thetransistorsinstack. 11

PAGE 26

(a)Footer+header (b)Footeronly (c)HeaderonlyFigure2.2Powergatingoptions.SinceNMOStransistorsaremorearea-efcientthanPMOStransistors,thefooteronlyoption(b)hastheminimumareaoverheadamongstallthethreeoptions. Thefooter+headeroption(a)suffersfromthemaximumareaoverheadbutisthemosteffectivein achievingleakagereduction. InFigure2.2,threepowergatingoptionsareshown: € Footerandheader :Inthisconguration(Figure2.2(a)),boththeheaderandthefooterdevicesareusedandaresimultaneouslyturnedoffwhenthecircuitblockisidle.Thiscongurationhasamaximumareapenaltybecauseofthetwosleeperdevicesbutachievesthe maximumleakagereductionbyensuringthatthestackingeffectisenforcedindependentof theinputpatternstothecircuitblock. € Footeronly :Inprinciple,itissufcienttouseonlyasinglepowergatingdevicetoachieve leakagereduction.Inthefooter-onlyconguration(Figure2.2(b)),onlyafooterdevice(an NMOStransistor)isused.Theareaoverheadistheminimuminthiscase. € Headeronly :Inthisconguration,onlyaheaderdevice(aPMOStransistor)isused. Mostoften,whenasinglepowergatingdeviceisselected,aNMOSsleeptransistorispreferred overthePMOSonebecauseaNMOStransistor'sON-resistanceissmallerthanthatofaPMOS transistorforthesametransistorwidth.Sincesubthresholdcurrent, Isubth,variesexponentially withrespecttothethresholdvoltage, VT,thesavingsachievedbythepowergatingtechniqueare 12

PAGE 27

maximizedwhenthetechnologysupportsbothhigh VTandlow VTtransistors.Whilethelattercan beusedforlogictoachievelowerdelays,theformeractasveryeffectivepowergatingdevices. Whenpowergatingisimplementedusingmultiplethresholddevices,itisoftencalledMultiple ThresholdCMOS(MTCMOS)technology[2022]. 2.2.1PerformanceAspectsofPowerGating Whenthecircuitblockisidle,thesleeptransistor(infooterconguration)isplacedinthecutoffmode,therebyintroducingalargeresistanceinthestandbyleakagepathbetweenthesupply andtheground.Thecircuitisreferredtobeinthe sleep or inactive state.Duringthisstate,the virtualgroundchargesuptoasteadystatevaluethatisdeterminedbytheresistivedividerformed bytheothertransistorsinthestack.Tobringthecircuitbacktothe active state,thevirtualground isrestoredtoitsnominalvaluebyplacingthesleeptransistorinthesaturationmode.Sincethis requiresdischargingthevirtualgroundnodetoactualground,thereisa wake-uplatency associated withit.Moreover,sincethedeactivationandactivationofthecircuitblockinvolvesdischargingand chargingtheoutputcapacitancesoftheinternalcircuitnodes,itrestrictshowoftenthecircuitblock canbetransitionedbetweenthetwostatestoachieveoverallenergysavings.Theperiodoftime thatthecircuitblockshouldbekeptinsleepstatebeforebringingitbacktotheactivestatesothat theleakageenergysavingsequalsthedynamicpoweroverheadincurredcircuitactivationisknown asthe breakevenperiod [23]. 2.3PipelinedMicroprocessors Pipeliningisaveryeffectiveimplementationtechniqueforimprovingsystemthroughputwithoutrequiringmassivereplicationofhardware.Itwasrstemployedinthedesignofhigh-end mainframesinthe1960s-IBM7030[24]andCDC6600[25].Allmodernmicroprocessorsemploypipeliningalmostuniversally.Inthissection,wepresentashorttutorialonconceptsrelatedto pipelinedmicroprocessors. 13

PAGE 28

2.3.1PipeliningBasics Inthecontextofinstructionsetprocessors,pipelininginvolvespartitioningtheprocessordesign intomultiplestagessuchthateachstageperformsonlyapartofthecomputationrequiredbyeach instruction.Atypicalinstructioncyclecanbefunctionallypartitionedintothefollowingvegeneric computations[26]: 1.Instructionfetch(IF) 2.Instructiondecode(ID) 3.Operand(s)fetch(OF) 4.Instructionexecution(EX) 5.Operandstore(OS) Atypicalinstructioncyclestartswiththefetching(IF)ofanewinstructionfromthememoryto beexecutedintheprocessorcore.Followingthis,itisdecoded(ID)sothat work tobeperformedby theinstructionmaybedetermined.Dependingonthetypeoftheinstruction,oneormoreoperands maybefetched(OF)fromtheregisterleorthememory.Oncethenecessaryoperandsareavailable, theinstructionisexecuted(EX)intheappropriatefunctionalunitsoftheprocessor.Finally,the resultsgeneratedbythecomputationintheEXstagearestoredback(OS)totheregisterleorthe memory.Sincememoryaccesslatenciesaremultipleordersofmagnitudeslowerthanthelatencies involvedinthetasksdonewithintheprocessorcore,cachesareuniversallyemployedbetweenthe processordatapathandthememorytospeedupmemoryaccesses.Acachethatstoresinstructions iscalledaninstructioncache(I-cache),whileonethatstoresdataiscalledadatacache(D-cache). Figure2.3showstheGENERIC(GNR)pipeline[26]. 2.3.2ClassicationofInstructionTypes Computationsperformedbyinstructionsinatypicalcomputermaybecategorizedintothree generictasks-(i) arithmeticoperation ,(ii) datamovement ,and(iii) instructionsequencing .Based onthesetasks,inatypicalmodernprocessorarchitecture,instructionsareclassiedintothreetypes: 14

PAGE 29

!" # $ % Figure2.3TheGENERIC(GNR)pipeline.Ithasthevegenericstages-instructionfetch(IF), instructiondecode(ID),operandfetch(OF),instructionexecute(EX),andoperandstore(OS).It alsoshowsthethecomponentsthataretypicallyaccessedateachofthesestages.Thisviewof thepipelineisshownbecause,inthisdissertation,architecturaltechniquesareproposedtoreduce leakagepowerinthearithmeticfunctionalunitsandtheregisterlewhentheyareidleduring programexecution.Itshouldbenotedthatthisguredoesnotdepictthephysicalorganizationofa pipelinedprocessorcore. 1. ALUinstructions :Theseinstructionsperformarithmeticandlogicaloperations. 2. Memoryorload/storeinstructions :Theseinstructionsareusedtomovedatabetweenregisters andthememory. 3. Branchinstructions :Theseinstructionscontrolinstructionsequencingbasedonthecontrol owoftheprogram. Thesemanticsofeachofthethreeinstructiontypescanbespeciedbasedonthesequence ofsubcomputationsperformedbythatinstructiontype.Thesemanticsofalltheinstructiontypes enumeratedabovearedenedinTables2.1,2.2,2.3. 15

PAGE 30

Table2.1SpecicationofALUinstructiontype GenericSubcomputationIntegerFloatingPoint IF Fetchinstruction(access I-cache) Fetchinstruction(access I-cache) IDDecodeinstructionDecodeinstruction OFAccessintegerregisterle Accessoatingpointregister le EX PerformintegerALU operation Performoatpointoperation OS Writebacktointegerregister le Writebacktooatingpoint registerle Table2.2Specicationofmemoryorload/storeinstructiontype GenericSubcomputationLoadinstructionStoreInstruction IF Fetchinstruction(access I-cache) Fetchinstruction(access I-cache) IDDecodeinstructionDecodeinstruction OF Accessintegerregisterlefor baseaddress Accessintegerregisterlefor thebaseaddressandthe integeroroatingpoint registerlefortheregister operand Generateeffectiveaddress (base+offset) Readfrommemory(access D-cache) EX OSWritebacktoregisterle Generateeffectiveaddress (base+offset) Writeintomemory(access D-cache) 2.3.3In-OrderProcessors Pipelinedprocessordesignsareclassedintotwocategoriesscalar and superscalar -accordingtothenumberofinstructionsthatcanbeprocessedineachofthepipelinestages.Ascalar pipelineisonewhichatmostoneinstructionmaybeprocessedatatime.Asuperscalarpipeline,on 16

PAGE 31

Table2.3Specicationofbranchinstructiontype GenericSubcomputationUnconditionalbranchConditionalbranch IF Fetchinstruction(access I-cache) Fetchinstruction(access I-cache) IDDecodeinstructionDecodeinstruction OF Accessintegerregisterlefor thebaseaddress Accessintegerregisterlefor thebaseaddress Generateeffectiveaddress (base+offset) Generateeffectiveaddress (base+offset) Readfrommemory(access D-cache) EXEvaluatebranchcondition OS Updateprogramcounter(PC) withtargetaddress Ifthebranchcondition evaluatestotrue,update programcounter(PC)with targetaddress theotherhand,mayprocessmorethanoneinstructionsatthesametime.Figure2.4showsexamples ofascalarandasuperscalarpipeline.Pipelinedprocessordesignsareclassedintotwocategories &"'"(") &"'"("*Figure2.4Scalarvs.superscalarpipeline. 17

PAGE 32

in-order and out-order -Processorsarefurtherclassiedaccordingtohowtheinstructionsfrom aprogramaresequencedwithinitspipelineas in-order and out-order processors.Inanin-order processor,instructionsenterthepipeline,advancesynchronouslythroughallthepipelinestages, andnishinprogramorder.Since,thisimposesalockstepfashioninthewaytheinstructionsadvancethroughthepipeline,suchpipelinesarealsocalled lockstep or rigid pipelines.Thedrawback ofsuchpipelinesisthatwhenoneinstructiongetsstalled,allthefollowinginstructionsalsoget stalledwhichputssevererestrictionsontheinstructionthroughput.Incontrasttothis,anout-order processorsupportsbypassingofastalledleadinginstructionbytrailinginstructions,therebyallowing out-of-orderexecution ofinstructions.Duetothisreason,thehardwarecomplexityofout-order processorsissignicantlymorethanin-orderprocessors.Mosthigh-performanceprocessors,where performancehastraditionallybeenofforemostimportance,areallout-orderprocessors,whilemost embeddedprocessors,wherepowerefciencyisofparamountimportance,arein-orderprocessors. Inthisdissertation,alltheprocessormodelsconsideredarein-orderprocessors. 2.4HardwareMultithreading Inthissection,wepresentanoverviewofthevarioushardwaremultithreadingapproachesthat havebecomeincreasinglyprevalentinmodernhighperformancemicroprocessors.Intheprevious section,thediscussionofpipelinedprocessorswasrestrictedtosingle-threadedsupport.InChapter 5,microarchitecturaltechniquestoreduceleakageinregisterlesareproposedinthecontextof hardwaremultithreadedmulti-coreprocessors. Hardwaremultithreadingisanapproachwhichenablesaprocessortosupportthesimultaneousexecutionofmultiplethreads.Aprocessorthatsupportshardwaremultithreadingiscalled a multithreadedprocessor .Multithreadingapproachesarecategorizedaccordingtohowtheyare implementedinhardware[26,27]andarebrieydescribedhere: € Coarse-GrainedMultithreading(CGMT) :Inthisapproach(Figure2.5(a)),athreadusesall CPUresourcesuntilalonglatencyevent,likeacachemiss,alonglatencyoperation,etc., occurs.Suchaneventcausesacontextswitchandanotherreadythreadisswitchedin,which runstillitencountersalonglatencyevent.Thisimplementationhasacontextswitchlatency 18

PAGE 33

'bn ntf tb! nf n ++ b!f"rtf &bfbf"'bn"("* b$f" b$f" tf"* tf")(a) ++ b!f"rtf &bfbf"'bn"("*(b) b!f"rtf &bfbf"'bn"(")(c) rtf"* &bfbf"'bn"("*"f rtf")(d)Figure2.5Multithreadingapproaches. (a)coarse-grainedmultithreading(CGMT)with2threads andpipelinewidthof1;(b)ne-grainedmultithreading(FGMT)with2threadsandpipelinewidth of1;(c)simultaneousmultithreading(SMT)with2threadsandpipelinewidthof2;(d)chipmultiprocessing(CMP)with2threadsandpipelinewidthof1percore.A5-stageMIPSpipelinemodel isshown. associatedwithit.Thisisbecause,dependingonthepipelinestagewherethelonglatency eventisdetected(e.g.,I-cachemisshappensinIF-stagebutD-cachemisshappensinMEM19

PAGE 34

stageinaMIPSpipeline[26]),theinstructionsintheprecedingstagesaresquashed,while theinstructionsinthesucceedingstagesareallowedtonishbeforethenextthreadcanbe run.Eachthreadcontexthasaprivatecopyoftheregisterle,instructionfetchbuffers,if any,andcontrollogicstate,whiletherestoftheCPUresourcesareshared.Thisapproachis alsoknownas blockedmultithreading technique. € Fine-grainedmultithreading(FGMT) :Inthiscategory(Figure2.5(b)),threadcontextswitchinghappensattheboundaryofoneofmoreclockcyclesforreadythreads(i.e.,threadsthat arenotblockedduetolonglatencyevents).Eachthreadcontexthasaprivatecopyofthe registerleandcontrollogicstate,whiletherestoftheCPUresourcesareshared.Instruction fetch-buffersmayormaynotbeshared.FGMTisalsoknownas interleavedmultithreading € Simultaneousmultithreading(SMT) :InSMT(Figure2.5(c)),instructionsfromtwoormore threadsarescheduledsimultaneouslyondifferentfunctionalunitsduringthesamecycle. SMTtypicallyworksonsuperscalarprocessorsthathavehardwaretosupportsimultaneous executionfortwoormoreinstructionsinasinglecycle.Eachthreadcontexthasaprivate copyoftheregisterle,instructionfetchbuffers,interstagebuffers,andcontrollogicstate. Therestoftheresourcesareshared. € Chipmultiprocessing(CMP) :InCMP(Figure2.5(d)),multiplesingle-threadedprocessor coresareinstantiatedonadiesuchthattheyshareonlytheL2cacheandthesysteminterfaces.Eachcoreexecutesinstructionsfromadifferentthreadindependentlyoftheother threads,interactingonlythroughsharedmemory. 2.5RelatedWork Inmicroprocessors,themethodsatthearchitecturelevelutilizecircuit-levelleakagereduction techniquesdiscussedearlier.Insuchapproaches,themicroarchitecturalsubsystemsareequipped withmultipleleakagereductiontechniquesthatenableputtingthosesubsystemsinandoutoflowleakagemode.However,thedesignoftheinterfaceofthecontrolstothosetechniquesclassies theseapproachesintotwodistinctcategories.Theyaremicroarchitecturalapproachesandcompiler20

PAGE 35

directedarchitecturalapproaches.Inamicroarchitecturalapproach,thelogictoregulatethose controlsisimplementedinhardware.Inacompiler-directedapproach,theinterfaceofsuchcontrols isincludedintheinstructionsetintheformofspecialinstructionsorhardwaredirectives,which thecompilerinsertsintothecode.Duringtheprogramexecution,theseinstructionsregulatethe leakageintheidlesubsystemsoftheprocessor. Earlierworks[28,29]onarchitecturallevelleakagereductionhaveconcentratedprimarilyon thememorysubsystems(particularlyoncaches),sincetheycontributeupto50%ofthetotalleakage ofthesystem.Thevariousworksonleakagereductioninmicroprocessors(Figure2.6)areclassiedas(i)microarchitectureleveltechniques,and(ii)compilerleveltechniqueswitharchitectural support.Acircuitleveltechniquebasedonreversebody-biaswasusedforleakagereductionin thecommercialIntelXscalemicroprocessor[30].Ananalyticalenergymodeltoachieveleakage energyoptimizationinfunctionalunitsforsuperscalarprocessorsisdescribedin[31].Thestatic energyintheintegerfunctionalunitsisreducedbyemployingdualthresholdvoltagedominologic designtechnique.PowergatingofexecutionunitsisinvestigatedbyHuet.al[23]inwhichthe activationanddeactivationofthefunctionalunitsareguidedbybranchpredictiondecisions.The techniquesdiscussedin[23,31]usespecicallydesignedcontrolblocksthatmonitorthesleepperiodsoffunctionalunits.Thecontrolblocksinvolvesignicantdynamicpoweroverheadandneed tobeavoidedinthedesignofembeddedprocessors. Severaltechniqueshavebeenexploredatthecompilerleveltoidentifyopportunitiesforpower gating.In[32],Zhanget.al.investigatetheuseofinputvectorcontrolmethodsanddynamic prolingforleakagereduction.Theapproachin[33]usesstaticcodeanalysistoidentifypower gatingopportunitiesintheprogramandusesdynamicprolinginformationtodirecttheinsertionof powergatinginstructionsintothecode.Themethoddoesnotconsiderpowergatingopportunitiesin nestedloopstructuresandrequiresspecialarchitecturalsupporttoremovepowergatinginstructions insertedtooclosetoeachotherwhichcancausesignicantperformancedegradation.In[34],dataowanalysisisusedtoestimatefunctionalunitrequirementsinthebasicblocksoftheprogram andidentifyopportunitiestoinsertsleepandwakeupinstructionsatcompilerlevel.Theuseof dataowanalysisinsteadofdynamicprolingresultsinfailuretoaccuratelyestimatetheresource 21

PAGE 36

(2001) [30] (2002) [31] Clark et al. Dropsho et al. Threshold Voltage Control Input Vector Control Compiler Level Techniques with Architecture Support Microarchitectural Techniques Leakage Reduction Techniques Applied in Microprocessors 1 Technique applies to caches only 2 Investigates both input vector control and power gating(2002) [28] (2002) [29] Kaxiras et al. Flautner et al. (2004) [23] Hu et al. Zhang et al.(2003) [32]1 1 Power GatingPower Gating You et al. Rele et al. Zhang et al. 2 2 Seki et al. Komoda et al. This work (2010) (2002) [33 ] (2003) [32 ] (2006) [34 ] (2008) [35 ] (2009) [36 ] Figure2.6Taxonomyofworksonleakagereductioninmicroprocessors. (theworks,unlessstated otherwise,targetfunctionalunits) requirementsinbothsingle-levelandnestedloopsintheprogram.Dynamicprolingprovidesthe runtimecharacteristicsofaprogram,whichcanbeeffectivelyusedinidentifyingpowergating opportunities.Further,theframeworkin[34]assumesthateverybranchintheprogramisequally likelytobetaken(ornottaken).Thisassumptionistooconservative,sinceitiswellknownthat thebehaviorofbranchinstructionsispredictableinaruntimeenvironment,evenmoresoforthe branchesthatrepresentloopexitconditions.In[35],Sekietal.presentedane-grainedpower gatingschemefortheMIPSR3000whichwasincorporatedinaprototypechip,Geyser-0,in90 nmCMOStechnology.Komodaetal.[36]proposedatechniquesimilartothatin[34]butwith interproceduralanalysis.Royetal.proposedaframeworkin[37]thatusesbothstaticcodeanalysis 22

PAGE 37

anddynamicprolingtoidentifypotentialsubgraphsintheprogramduringwhichtheunitscanbe keptdeactivated.Morerecently,powergatinghasalsobeenusedasaprimarypowermanagement techniqueinmoderncommercialprocessors[38,39]. Althoughthereissignicantworkreportedintheliteratureonleakagepowerreductioninfunctionalunits,veryfewworksintheliteraturehavetriedtoaddressleakagereductioninregisterles. Amulti-bankedregisterledesigntoimproveaccessspeedandreducetotalpowerispresented in[40],whilelow-leakageregisterleswithdynamiccontrolshavebeenproposedin[41,42].A state-retentiveregisterledesignedfortheARMprocessorwasfabricatedusing65-nm[43]technologyjusttostudytheleakageaspectsofaregisterleingeneral. 2.5.1ContextandSignicanceofthisDissertation Theearlierworksinleakagereductioninthefunctionalunitsinmicroprocessorsareprimarily focusedonsuperscalarprocessors.Inthiswork,theexistingbodyofworkinthisareaissupplementedbybyanovelframeworkforcompiler-directedpowergating,whichisparticularlygeared towardsembeddedsystems.Further,wealsoinvestigatetheimpactofcompileroptimizationson theopportunitiesforpowergatingandderivealeakage-awarecompilationow,whichhelpsgeneratecodetoachievemaximalleakageenergyreductionduringexecution.Finally,whiletherelated worksdescribedintheearliersectiononleakagereductiontechniquesappliedtoregisterlesrepresentimportantcontributionsinthiseld,itsapplicationinthecontextofmulti-coreprocessorshas notbeenaddressedinanyotherworkpriortotheresearcheffortdescribedinthisdissertation. 23

PAGE 38

CHAPTER3 AFRAMEWORKFORPOWERGATINGFUNCTIONALUNITSINEMBEDDED MICROPROCESSORS Inthischapter,wepresentanewframeworkforpowergatingthefunctionalunitsinembedded systemmicroprocessorswithoutdegradationinperformance.Theproposedframeworkincludesan efcientalgorithmforidletimeestimation,appropriateinsertionofsleepinstructionswithinthe code,andamethodforreactivatingthesleepingunitsonlywhenneeded withouttheuseofwakeup instructions .Weintroducethenotionof loophierarchytrees (LHT)torepresentthepartialordering ofthenestedloopswithintheprogram.Fromthecontrolowgraph(CFG)representationofthe sourceprogram,aforestofloophierarchytreesisconstructedandisusedtoidentifythemaximal sub-graphsrepresentingthelongidleperiodsforthefunctionalunits.Foreachsub-graphthusidentied,asleepinstructionisintroducedintheprogramwithalistofcorrespondingfunctionalunitsto bedeactivated.Whenaninstructionisdecoded,thefunctionalunitsneededforthatinstructionare automaticallyactivatedbythecontrolunitsuchthattheunitsarereadybeforetheinstructionreaches theexecutestage.Thiseliminatestheneedforwakeupinstructionstobeinsertedintotheobject codereducingtheoverheads.Inourimplementation,theARMprocessorarchitecturewasmodied andresynthesizedtoincludepowergatingbydevelopingaCMOScelllibraryoffunctionalunits withtheabovecapabilities.Experimentalresultsarereportedforasetof12benchmarkschosen fromtheMiBenchsuite,whichindicatethat,onaverage,ourtechniquereducestheleakageenergy infunctionalunitsby31.1%forintegerbenchmarksand26.8%foroatingpointbenchmarks. 24

PAGE 39

3.1PowerGatinginMicroprocessors Inmicroprocessors,powergatingisusedtoreduceleakageinpartsoftheprocessorwhichare notrequiredforsustainedperiodsoftimeduringprogramexecution.However,asdiscussedearlier, theimplementationofpowergatingusingsleeptransistors,incurssomelatencyoverhead,called activation latency,whileactivatingthecircuit.Thisisthetimethatthecircuittakestobecome electricallystableafterthepowersupplyhasbeenrestoredtothecircuit.Moreover,theactivation anddeactivationofthecircuitresultsindynamicpoweroverhead.Theminimumperiodforwhich thecircuitshouldremainturnedoffsuchthatthesavingsinleakageenergyequalsthedynamic energyoverheadiscalledthe breakeven period.Theactivationlatencyandthebreakevenperiodare importantfactorstobetakenintoaccountintheimplementationofpowergating. Inthiswork,weinvestigatetheaboveissuesindetailandproposeaframeworkforleakage reductioninthefunctionalunitsofthedatapathsinembeddedmicroprocessorcores.Theframework usesbothstaticcodeanalysisanddynamicprolingtoextractprogramcharacteristicsusefulin determiningpowergatingopportunitiesforleakagereduction.Theidenticationandanalysisof loophierarchieswithincodesegmentsisanimportantobjectiveintheframework.Thus,wefocus ontheiterativecodestructuresintheprogramsfordetectionoflongidleregionsforfunctional units.Theswitchesforpowergatingthefunctionalunitsareprovidedatthecircuitlevel,whichare controlledwithspecialinstructionsinsertedwithinthecodeduringcompiletimebasedonidletime behavioranalysis.Thesalientfeaturesoftheproposedframeworkare: 1.Theapproachusesbothstaticcodeanalysisanddynamicprolinginformationtoidentify powergatingopportunitiesintheprogram,aswellas,todirecttheinsertionofpowergating instructionsintothecode. 2.The breakeven periodandthe activationlatency areusedindeterminingpowergatingopportunitieswithinthecodesegments. 3.Weintroducethenotionof loophierarchytrees (LHT)torepresentthepartialorderingofthe nestedloopswithintheprogram.Fromthecontrolowgraph(CFG)representationofthe sourceprogram,aforestofloophierarchytrees(LHT)isconstructedcapturingthepartial 25

PAGE 40

orderingofthenestedloopswithintheprogram,whichisthenusedtoidentifythemaximal sub-graphsrepresentingthelongidleperiodsforthefunctionalunits.Foreachsub-graph thusidentied,asleepinstructionisintroducedintheprogramwithalistofcorresponding functionalunitstobedeactivated. 4.Theproposedtechniqueinsertsonly sleep instructionsintothecode,butdoesnotinsertany wakeup instructions.Thisisachievedbylimitingthe activationlatency ofthefunctional unitstoasingleclockcycleandbyextendingthedecodeunitsothat,wheneveraninstruction isdecoded,therequiredfunctionalunitsareactivatedjustbeforetheinstructionreachesthe executestage.Thissavessignicantinstructionoverheadinimplementingpowergating. Acelllibraryconsistingofasetofnewcellswithpowergatingcapabilitywasdesignedand veriedusing70nmCMOStechnology.ExtensivesimulationswereperformedusingtheARM processorasthetargetarchitecturemodelbasedontheabovecelllibrarycharacterization.ExperimentalresultsontheMiBenchembeddedbenchmarksuite[44]indicatethatsignicantleakage reductionispossibleusingtheproposedapproach. 3.2ProposedFrameworkforLeakageReductionUsingPower-Gating TheproposedframeworkforleakagereductionshowninFigure3.1consistsoftwomajorcomponents.Thehardwarecomponentconsistsof(i)alibraryoffunctionalunitswithpowergating capability,and,(ii)anembeddedprocessorarchitecturewhichsupportspowergatingforfunctional units.Thesoftwarecomponentconsistsofextensionsatthecompilerlevelto(i)performsource codeanalysis,(ii)identifyidletimesforvariousfunctionalunits,and,(iii)insertsleepinstructionsforthosefunctionalunitsatappropriatelocationsinthesourcecode.Theapplicationsource programsareproleddynamicallyforrun-timecharacterizationandthisinformationisusedto accuratelypredictlongidletimesforthefunctionalunitssothattheycanbeturnedofftosave leakage.Finally,inordertostudytheimpactofpowergatingonmicroprocessorperformance,a cycle-accuratesimulationoftheproposedpowergatingmethodologywasperformedandanumber ofapplicationprogramswereusedfortesting. 26

PAGE 41

Identification Idle FU Subgraph Analysis Static Code Application Source Profiling Dynamic Simulation Cycleaccurate Instructions Insert Sleep Power Gated Architecture Library of Power Gated FUs Compone nt Software Hardware Component Figure3.1Proposedframeworkforpowergating 3.2.1HardwareComponent Thehardwarecomponentconsistsofalibraryoffunctionalunitswithsleeptransistorsandan architecturewithcontrolcapabilitiesforthosesleeptransistors.Thearchitectureandthedesignof functionalunitsaredescribedhere. 3.2.1.1EmbeddedProcessorArchitecturewithPower-GatingSupport Thetargetembeddedprocessorarchitecture,asshowninFigure3.2,isbasedonthepopular ARMv7processorcore(www.arm.com).TheARMv7processorcorehasanintegerALU,abarrel shifter,andanintegermultiplier.Sincetheprocessordoesnothaveaoatingpointunit(FPU), oatingpoint(FP)operationsareemulatedusingsoftwaremacrosconstructedoutofintegerinstructions.Sinceasignicantfractionofembeddedsystemsapplications(multimediaapplications) operateonFPnumbers,aFPUisincludedinthetargetarchitecture.TheFPUshowninFigure3.2, consistsofanadder(ADD),multiplier(MUL),and,adivisionandsquarerootunit(D/S).Thefunc27

PAGE 42

Incrementer Address Register Bank Write Data Register Instruction Decode & Control Logic Multiplier Booth's Shifter Barrel Floating Point Unit Address Register Instruction Pipeline & Read Data Register ALU SCR ADDMULD/S 0 0 0 0 1 A DD M UL Adder Multiplier D/S SCR Division and Square Root Sleep Control RegisterFigure3.2ModiedARMarchitecture tionalunitsaredesignedsuchthattheactivationlatencyislimitedtooneclockcycle.Thedetailsof theintegerandFPfunctionalunitsandtheirpowergatedversionsaregiveninSection3.2.1.2. ASleepControlRegister(SCR)isaddedtotheinstructiondecodelogicwhichdrivesthesleep controlsofthevariousfunctionalunits,asshownininFigure3.2.Theactivationanddeactivation ofthefunctionalunitsisillustratedinFigure3.3.Thedeactivationiscarriedoutusingasleepinstruction.Thefunctionalunitsthatneedtobedeactivatedarespeciedasoperandstothesleepinstruction,asdeterminedduringprogramanalysis.Foreachfunctionalunitinthesleepinstruction, a0'iswrittenatthecorrespondingcellintheSCR.Figure3.3(a)showsasequenceofinstructions. ThecontentsoftheSCRaftereachinstructionisdecodedisshowntoitsright.Beforethesleepinstructionisdecoded,thecontentsoftheSCRare"11110"indicatingthatonlytheFPdivisionand squarerootunitisinpowergatedmode.Theoperandspassedtothesleepinstructionare-integer 28

PAGE 43

multiplier,FPadder,andFPmultiplier.Whenthisinstructionisdecoded,thecellsdrivingthesleep controlsforeachofthesefunctionalunitsarewrittenwitha0'(indicatedbythegraycoloredcells nexttothesleepinstructioninthegure),initiatingtheirdeactivationprocess.sleep imul, fpadd, fpmul Instruction Fetch Instruction Execute Operand Store Wakeup period 1 cycle Power restored to required FU 1 1 1 1 1 0 0 1 00 0 add $r1, $r5, $r6 ld $r2, 100($r1) ... ... mul $r3, $r1, $r2 1 0 10 SCR (b) (a)Instruction Decode Operand FetchFigure3.3Exampleofactivationanddeactivationoffunctionalunits Theactivationofthefunctionalunitstakesplaceatthedecodestage.Thefunctionalunitgets activatedduringthecyclefollowingtheoneinwhichtheinstructionenterstheoperandfetchstagein thepipeline.ThisisshowninFigure3.3(b).IntheexampleinFigure3.3(a),whenasubsequentmulinstructionisdecoded,a1'iswrittenattheSCRpositioncorrespondingtotheintegermultiplier (indicatedbythegraycoloredcellnexttothemulinstructioninthegure),initiatingtherestoration ofthepowersupplytotheintegermultiplier.Therefore,bythetimetheinstructionenterstheexecute stage,theintegermultiplierisactivetoperformcomputation.Thisaspectofourdesignobviates theneedforseparatewakeupinstructions,thereby,reducingthenumberofoverheadinstructions signicantly. TheARMv7architecturereservessomeinstructionformatsforinstructionextensionsinitslater implementations.Forexample,whenthebit6oftheoriginalARMmultiplyinstructionischanged 29

PAGE 44

from0'to1',theinstructionisignoredanditdoesnotevengeneratetheundenedinstruction trap.Thisoptioncanbeusedtodeneanewinstructionrequiringminimalchangesinthecontrol unit.Basedonthisknowledge,apossibleformatofthesleepinstructionisshowninFigure3.4(a). 0 0 0 0 0 0 1 1 0 1X X X X X X X X X X X X X X X X X1 1 0 1 Condopvec opvec 3 12827222117168743 0 87430 16 17 21 22 27 28 31 (a) Generic format (b) Sample sleep instruction sleep imul, fpadd, fpmul X X X X X X X X X 0 1 1 1 0 0 0 0 0 0 0 CondFigure3.4Apossibleformatofsleepinstruction. Thegenericformatofthesleepinstructionisobtainedbyalteringbit6intheoriginalmultiply instructionformatfrom0'to1',ashighlightedbythegrayeldinthegure.Bits21to17, indicatedas opvec inthegure,correspondtothefunctionalunitsthatarepassedasoperands.A 1'inthatbitvectorindicatesthatthefunctionalunitneedstobedeactivatedwhilea0'indicates thatthesleepcontrolshouldbeleftunaltered.Thus,thesleepinstructioninFigure3.4(b)with integermultiplier,FPadder,andFPmultipliertranslatetothebitvector"01110"asshown. 3.2.1.2DesignofPower-GatedFunctionalUnits ThefunctionalunitsareredesignedaccordingtothespecicationsoftheARMprocessor.The integerfunctionalunitcomprisesofanALU,abarrelshifter,andaBooth'smultiplier.Sincethe ALUisfrequentlyused,nopowergatingisincorporatedintoit.Theoatingpointunitcomprises ofanadder,amultiplier,andadivideandsquarerootunit,asin[45].StructuralVHDLdescriptions wereusedtosynthesizethefunctionalunitsusingacelllibrarycharacterizedforlatencyandpower in70nmPredictiveTechnologyModel(PTM)[46]. Thereisatradeoffbetweenthelatencyofthefunctionalunitsandtheleakagesavingsachieved. Narrowersleeptransistorsyieldhigherleakagesavingsattheexpenseofincreasedlatencyofthe functionalunits.Ontheotherhand,toincurnegligibledegradationinthelatencyofthefunctional 30

PAGE 45

Table3.1Latenciesoffunctionalunits Functional Pipeline Latency unit stages (cycles) IntegerALU 0 1 BarrelShifter 0 1 IntegerMultiplier 1 16 FPAdder 4 6 FPMultiplier 3 10 FPDiv-SqrtUnit 4 19 units,thewidthofthesleeptransistorshavetobeincreased.IntheresultsreportedlaterinSection 3.3,leakagesavingsindicatedcorrespondtokeepingthelatencydegradationminimalornegligible (lessthan2%).Theleakagesavingscanbesignicantlyincreasedifweallowfurthertradeoffwith latency.Theareaoverheadofthepowergatedfunctionalunitsis11%intermsofthewidthofthe NMOStransistors.BasedonthespecicationsoftheareaoptimizedARMcores,theclockperiod wasassumedtobe10nsforthepurposeofcharacterization.Thelatenciesofthefunctionalunits (bothregularandpowergatedversions)fortheARMcoreintermsofthenumberofcyclesare listedinTable3.1. 3.2.2SoftwareComponent Themaintaskofthesoftwarecomponentistoanalyzetheprogrambehaviorandpredictregions intheprogramwherecertainfunctionalunitsareexpectedtobeidleduringtheexecutionofthe programsothatthosefunctionalunitsmaybepowergated.Thistranslatestondingoutmaximal subgraphsinaCFGofaprogrampertainingtotheidlenessofeachfunctionalunit.Oncethese subgraphsarefound,sleepinstructionsshouldbeinsertedattheentrysothatthefunctionalunits aredeactivatedwhentheprogramcontrolentersthesesubgraphsduringitsexecution. Thecontrolowsemanticsofaprogramarerepresentedintheformofacontrolowgraph (CFG),inwhichtheverticesrepresentthebasicblocksandtheedgesrepresentthetransferof controlowbetweenthebasicblocks.Abasicblockisdenedasastraight-linecodesequence withnocontrolinstruction.IntheexampleinFigure3.5,thecodesequencemultipliestwomatrices 31

PAGE 46

sum j flag k k flag i i j j sum flag sum j flag 1 0FALSETRUE FALSETRUE1 j+1 j+1 k+1 1 sum + A[i,k]*B[k,j] 1 0 i+1 for (k<=n)? (j<=m)? (i<=m)? (j
PAGE 47

3.2.2.1IdentifyingPotentialPower-GatingRegionsintheCFG WhilegeneratingtheCFGforthesourceprogram,eachvertexisannotatedwiththefunctional unitrequirementofthebasicblockrepresentedbythatvertex.Considerasubgraphoftheruntimetraceofaprogram.Therun-timetraceisformedbytracingtheedgesoftheCFGduring theexecutionoftheprogram.Letvertex u representabasicblockthatdoesnotuseaparticular functionalunit,say r .Let P representthesetofalluniquepathsstartingatvertex u andterminating attheearliestverticessuchthatthebasicblocksrepresentedbythemusethefunctionalunit r Let tclkbetheclockperiodand N ( p ) bethenumberofclockcyclesrequiredtoexecutethe instructionsinpath p P .Thenthetotaltimespentinexecutingtheinstructionsin p isgivenby, tp= N ( p ) tclk(3.1) Let L ( p ) ,the length ofthepath p ,bedescribedintermsofthenumberofinstructionsin p .Ifthe maximumIPC(instructionspercycle)count,whichisdeterminedbythe fetch and decode widthof aprocessor,is ,then L ( p ) N ( p ) (3.2) TheinequalityinEquation(3.2)indicatesthattheIPCcountforacodesegmentcanbesmallerthan onaccountofcachemisses,branchmispredictions,pipelineushes,etc.Eliminating N ( p ) from Equation(3.1)andEquation(3.2),weobtain, tp 1 L ( p ) tclk(3.3) Itcanbenoted,fromEquation(3.3),thatforanIPCcountofatmost1,theexecutiontimeforthe instructionsin p canbelowerboundedbytheproductofthenumberofinstructionsin p andthe clockperiod. Now,lettheleakageenergysavedperunittimebykeepingthefunctionalunit r deactivatedis randthatthedynamicenergyoverheadinactivationanddeactivationis r.Thebreakevenperiod, 33

PAGE 48

tmin,isgivenby, tmin= r r(3.4) So,forapath p togenerateenergysavings,thetimespentinexecutingtheinstructionsin p should bemorethan tmin,i.e., tp tmin.Thisissatisedif, 1 L ( p ) tclk tmin= r r L ( p ) r rtclk= Lth(3.5) Thequantity r/ rtclkiscalledthe threshold lengthandisdenotedby Lth.Equation(3.5)denotes thatifthenumberofinstructionsinapath, p ,isgreaterthanthethresholdlengthforfunctional unit, r ,theenergysavingsobtainedduetothelowleakagestateof r wouldbegreaterthanthe energyrequiredtoactivate r ,thereby,resultinginoverallenergysavings.Iftherelation,specied byEquation(3.5),issatisedforall p P ,then tp tmin, p P (3.6) whichensuresthattheleakageenergysavedoverallthepathsin P exceedsthedynamicenergy overheadincurred,thereby,givingoverallsavings. Although,theaboveapproachensuresenergysavingsineverypathinthesubgraph,itisa ratherconservativeapproach.Duringtheexecutionoftheprogram,someofthepathsin P maybe greaterinlengthand/ortraversedmorefrequentlythantherest.Thus,thereisapossibilitythatthe combinedenergysavingsachievedalongthosepathsexceedsthecombinedenergylossesincurred alongtherestofthepaths,stillresultinginoverallenergysavings.If npisthenumberoftimesthe path p P istraversed,then,usingEquation(3.5),theleakageenergysavingscanbeexpressedas, Esavings=p Pnp( L ( p ) rtclk Š r) (3.7) 34

PAGE 49

If P P bethesetofpathswhoselengthisgreaterthan Lthdenedas, P = { p | p P and L ( p ) Lth} ,i.e.,thesetofpathswhoselengthisgreaterthan LththenEquation(3.7)canberewrittenas, Esavings=p Pnp( L ( p ) rtclk Š r) Šp P Š Pnp( rŠ L ( p ) rtclk ) (3.8) FromEquation(3.8),itcanbeobservedthatifsubgraphsareidentiedintheprogramCFG whichcanresultinprogramexecutionpathsoflength,atleast, Lth,suchthat:(i)afunctionalunitis notrequiredinthosepaths;and(ii)thepathsarealsotraversedmoreoften,wendpotentialregions intheprogramCFGwhicharegoodcandidatesforpowergating. 3.2.2.2SubgraphsEnclosedWithinLoops Forthepurposeofidentifyingregionsinaprogram,notonlythefunctionsbutalsotheloops withinthefunctionsareconsidered.Consideringprogramregionsenclosedwithinentirefunctions willprovideapoorgranularityinndingopportunitiesforpowergating.Moreover,sincetheexecutionofloopsusuallyvariesdynamicallywithinputdata,aprogramusuallyspendsavariable fractionofexecutiontimeinsubgraphsenclosedwithinloops.Inthiscontext,weproposeanduse l3l4l2l1 Figure3.6LHTforCFGinFigure3.5 thenotionof loophierarchytrees (LHTs),whichisessentiallyatreedatastructurethatcapturesthe partialorderingofnestedloopswithinafunctioninthesourceprogram.Atcompiletime,theLHTs areusedtodeterminelongintervalsbetweensuccessiveusesofthefunctionalunitsthusidentify35

PAGE 50

ingopportunitiesforapplyingpowergatingtothefunctionalunits.Duringthestaticcodeanalysis phase,foreachfunctioninthesourceprogram,aforestoftheLHTsiscreated.Eachvertexofa LHTdenotesaloopinthesourceprogramanditschildrendenotetheloopsthatarenestedimmediatelywithinthatloop.Eachvertexisannotatedwiththefunctionalunitrequirementoftheloop correspondingtothatvertex.Figure3.6showstheLHTfortheexampleinFigure3.5.Loop l1has 2nestedloopsatthesamelevel, l2and l4.IntheLHT,thistranslatesto l1beingtherootofthetree, and, l2and l4beingitschildnodes.Sinceloop l2furtherhas l3asanestedloop, l3isachildnode of l2intheLHT.Intheremainderofthisarticle,weusetheterm childloop withrespecttoaparent looptodenotealoopthatisimmediatelynestedintheparentloop. AnessentialpropertyoftheLHTisthatitcapturesapartialorderingoftheloopsinaprogram. WithrespecttotheexampleinFigure3.6,loop l1comprisesofasetofbasicblocksthatisasuperset ofthesetofbasicblocksthatformloops l2or l4.Therefore,thefunctionalunitrequirementsof loops l2and l4aresubsetsofthefunctionalunitrequirementsofloop l1.Iftheloop l1doesnot requireafunctionalunit,say r ,itimpliesthatbothloops l2and l4donotrequire r .Therefore,if r isdeactivatedattheentryof l1thenitwillremaindeactivatedduringtheexecutionsofloops l2and l4aswell.Similarobservationcanbemadeforloops l2and l3aswell.However,sinceloops l2and l4donotshareanyverticesoftheCFGamongthemselves,theirfunctionalunitrequirementsare independentofeachother. Duringthedynamicprolingofthesourceprogram,theverticesintheCFGoftheprogram areannotatedwiththecorrespondingbasicblockexecutioncountsandtheverticesintheLHT areannotatedwiththecorrespondingloopexecutioncounts.Theexecutioncountofaloopisthe numberoftimestheloopisenteredduringthedynamicprolingoftheprogram.Thisisthesame astheexecutioncountoftheentrybasicblockoftheloop.Considerthat Gi=( Vi, Ei) istheCFG ofthe ithfunctioninthesourceprogram,where Viisthesetofverticescorrespondingtothebasic blocksinthefunctionand Eiisthesetofdirectededgesindicatingtheowofcontrolbetweenthe basicblocks.Let libealoopin Gi,suchthat S ( li) Videnotethesetofverticesin liand Cidenote thesetofchildloopsof li.The totallength oftheloop li,whichisdenedasthetotalnumberof instructionsexecutedoveralliterationsof liduringthedynamicprolingstage,canberepresented 36

PAGE 51

bytherecursiverelation, Ltot( li)=ci CiLtot( ci)+v S ( li) ŠSc i C iS ( ci)Lbb( v ) g ( v ) (3.9) where, Ltot( ci) is total lengthofnestedloop ci, Lbb( v ) islengthofbasicblockcorrespondingto vertex v ,and g ( v ) istheexecutioncountofthebasicblockcorrespondingtovertex v .Inwords, the totallength ofaloopiscalculatedasthesumofthenumberofinstructionsexecutedoverall iterationsofitschildloopsandthenumberofinstructionsexecutedaspartofthebasicblocksin theloop,whicharenotpartofthechildloops.Then,the averagelength ofloop liisdenedasthe averagenumberofinstructionsexecutedduringoneiterationof liandisgivenby, Lavg( li)= Ltot( li) f ( li) (3.10) where, f ( li) istheexecutioncountofloop li.Similarly,the averagelength ofanyofthechildloops ci,of li,isgivenby Lavg( ci)= Ltot( ci) f ( ci) (3.11) where, f ( ci) istheexecutioncountofloop ci.SubstitutingEquation(3.10)andEquation(3.11)in Equation(3.9),wegetthefollowingrecursiverelationfor Lavg( li) Lavg( li)= 1 f ( li) ci CiLavg( ci) f ( ci) +v S ( li) ŠSc i C iS ( ci)Lbb( v ) g ( v ) (3.12) Theaveragelengthsoftheloopsareusedtomakedecisionsaboutinsertingsleepinstructionsin caseswherethelooprequiresafunctionalunitbutonlyasubsetofthenestedloopswithinthat loophavethesamefunctionalunitrequirement.The LavgvaluesforalltheloopsintheCFGcanbe calculatedbyrunninga breadthrstsearch fromtherootofeachtreeintheLHTs.Forthefunctions thatarecalledfromwithinaloop,theentirefunctionisconsideredasabasicblockintheabove 37

PAGE 52

formulationinEquation(3.10).Also,eachfunctioninthesourceprogramisseparatelyanalyzed forinsertionofsleepinstructions. 3.2.2.3InsertionofSleepInstructions Tondthelocationsintheprogramtoinsertsleepinstructions,a depthrsttraversal ofthe nodesineachLHTisperformedstartingatitsroot.Thistraversalisdoneonceforeachfunctional unitandisterminatedassoonasitisfoundthattheentireloopcorrespondingtothenodedoesnot usethefunctionalunit.Sincedynamicprolingatthebasicblocklevelcanonlyacquirethetotal executioncountofeachloopintheprogram,wenormalizetheexecutioncountofachildloopwith respecttoitsparentlooptoquantifythe iterativedegree ofthechildloop.Iterativedegreeofaloop reallymeansthenumberofexecutionsoftheloopperexecutionofitsparentloop.Wedenethe normalizedaveragelengthofloop x Lnorm( x ) ,as, Lnorm( x )= Lavg( x ) f ( x ) f ( parent ( x )) (3.13) where, f ( x ) istheexecutioncountofloop x f ( parent ( x )) istheexecutioncountoftheparent loopofloop x .ReferingtotheCFGinFigure3.5,loop l3doesnothaveadivisionoperation whileitsparentloop, l2,hasadivisionoperation.Therefore,loop l2becomesapotentialsubgraph forpowergatingthedivider.However,theamountofsavingsobtainedbykeepingthedivider powereddowndependsonthenumberoftimesloop l3executeseachtimeloop l2isexecuted.It canbeobservedthatloop l3iterates m2n times,whereas,loop l2iterates m2times.Therefore,if m = 8and n = 4, f ( l3)= 82 4 = 256,while f ( l2)= 82= 64.Ontheotherhand,if m = 4and n = 16, f ( l3)= 42 16 = 256,but f ( l2)= 42= 16.Thus,althoughthetotalnumberoftimesloop l3isexecutedisthesame(256)inbothcases,thelattercaseismorefavorableforpowergating thedividersincethereare4timesasmanyinstructionsexecutedbetweentwosuccessivedivision operationsinthiscasecomparedtothoseintheformer.Thus,thenormalizationprocessconsiders nestedloopsthataresmallbutiterativeenoughtobepotentialsubgraphsforpowergating. 38

PAGE 53

Algorithm1 INSERT-SLEEP ( F ) Algorithmtoinsertsleepinstructions 1:S 2:forall tree T F do3:forall functionalunit r R do4:INSPECT( root ( T ) r )5:endfor6:endfor7:UNIQIFY( S ) Algorithm1representsthepseudocodefortheroutineINSERT-SLEEP.Theset S containsthe sleepinstructionlocations.ItissettoaNULLsetatthebeginning(line1).Inlines2-6,thetwo for loopsiterateoveralltheLHTsintheLHTforest F foreachfunctionalunitin R andcallstheroutine INSPECTattherootofeachtree.AfteralltheLHTsareinspected,inline7,theroutineUNIQIFY iscalledtouniqifyallthesleepinstructionsinset S .Itshouldbenotedthatthenormalizedlength ofalltheloops, Lnorm() ,intheprogramandthethresholdnumberofinstructionsforeachfunctional unit, Lth() ,arealreadyknownbeforetheroutineINSERT-SLEEPiscalled. Algorithm2 INSPECT ( x r ) Algorithmtoinspectloops 1:if r / res ( x ) then2:if Lnorm( x ) Lththen3:S S { e ( x ) r }4:endif5:else6:forall y Cxdo7:INSPECT( y r )8:endfor9:endif Algorithm2representsthepseudocodefortheroutineINSPECTwhichtakesavertexofa LHT, x ,andafunctionalunit, r ,asthearguments.Inlines1-4,itcheckswhether r isusedin x ornot.Thefunctionalunitrequirementforloop x isgivenby res ( x ) .If r isnotusedinloop x ,it checksthenormalizedlengthof x Lnorm( x ) .If Lnorm( x ) ,isgreaterthanthethresholdnumberof instructions, Lth,thelocationismarked(addedtotheset S )forinsertionofasleepinstructionfor deactivatingfunctionalunit, r .Thelocationisatwoelementtupleconsistingofthebasicblock leadingtotheloop, e ( x ) ,andthefunctionalunit, r .If,however, r isusedintheloop x ,itcalls 39

PAGE 54

INSPECTrecursivelyonallthechildverticesof x (lines5-8).Inotherwords,theloopsnestedin x areinspectedtoexplorethepossibilityofpowergating r Algorithm3 UNIQIFY ( S ) Algorithmtouniqifythesleepinstructions 1:Sorttheelementsin S2:forall uniquelocations do3:MergealltheFUstoformone sleep instruction4:endfor TheroutineUNIQIFY,showninAlgorithm3,replacesseparatesleepinstructionswhichmight havebeeninsertedforeachfunctionalunitatthesamelocationwithasinglesleepinstructionwith allthosefunctionalunitsasitsoperands.Inline1,theelementsin S aresortedinascendingorder bytheirlocations.Inlines2-4,auniquesleepinstructionisgeneratedforallthefunctionalunits whichhavethesamelocation. 3.2.2.4TimeComplexity SincetheroutineINSPECTimplementsaDFStraversalin T ,itsworst-casetimecomplexity is O ( | V | + | E | ) .However,sinceforarootedtree, | E | = | V |Š 1,theworst-casetimecomplexityof INSPECTis O ( | V | ) .Notethat | V | denotesthatnumberofloopsinafunction.Therefore,theworstcasetimecomplexityofINSPECTislinearinthenumberofloopsinthefunctioncorresponding totheLHT T .ThetimecomplexityofroutineUNIQIFYis O ( | S | log | S | ) ,whereSisthenumber ofsleepinstructionsinserted.TheroutineINSERT-SLEEPmakescallstoINSPECTforallthe LHTsforeachfunctionalunit.SincethereisoneLHTforeachfunctionintheprogram,theworstcasetimecomplexityoflines1-6inINSERT-SLEEPis O ( n | R | ) ,where n isthetotalnumberof loopsintheentireprogramacrossallfunctionsand | R | isthetotalnumberofpowergatingenabled functionalunits.However, | R | isconstantsincethenumberoffunctionalunitsisconstant.Thus,the worst-casetimecomplexityoflines1-6inINSERT-SLEEPis O ( n ) ,whichislinearinthenumber ofloopsintheprogram.However,thenumberofsleepinstructionsthatcanbeinsertedcanbea maximumofthenumberofloopspresentand,therefore, | S | n | R | .Thus,theworst-casetime complexityofthecalltotheroutineUNIQIFYinline7ofINSERT-SLEEPis O ( n log n ) .Therefore,theworst-casetimecomplexityofroutineINSERT-SLEEPis O ( n log n ) 40

PAGE 55

3.3ExperimentalSetupandResults 3.3.1EnergyComponentCalculations Figure3.7showstheinstantaneouspowergraphsuperimposedontheinstantaneousvoltage graphatthevirtualgroundforafootersleeptransistorconguration.Thegraphdepictsthesignicanttimeintervalsforthecalculationofthevariousenergycomponentsofapowergatedfunctional units. Vvrgndreferstothevoltageatthevirtualground. Pinstreferstotheinstantaneouspowerdissipatedbythecircuit.Attime, t0,whenthesleeptransistorisswitchedOFF, Vvrgndrisesto Vddby time t1.Fromtime t1to t2,allthecapacitancesinthecircuitreachtheirnalcharge.Duringthis interval,thecircuitstilldissipatesinstantaneouspowerwhichslowlyapproachesthesteadystate leakagepowerinitsOFFstate, PLOFF.Thus,theoverheadenergy, Et0Š t2,indeactivatingthecircuit isthetotalenergydissipatedduringtheinterval t0to t2,andisgivenbytheareaunderthecurvefor Pinstduringtheinterval t0to t2(indicatedbyshadedareaindarkgrayinFigure3.7): Et0Š t2= Et0Š t1+ Et1Š t2(3.14) 1 2 345 0t t t ttt OFF (idle) Switch ON (transient) ON (active) OFF Switch (transient) vrgnd inst ddV P Voltage at Virtual Gnd V0 Time Instantaneous PowerP PL L OF F ON Figure3.7Superimposedvoltageandpowergraphsasfunctionsoftime 41

PAGE 56

Itisonlyaftertime t2thatthecircuitstartstodissipate PLOFF,whichisthesteadystateleakage powerdissipationofthecircuitinsleepstate.Theoverheadenergyrequiredwhileactivatingthe circuitisconsiderednext.Attime t3,thesleeptransistorisswitchedON, Vvrgndfallsto Vgndbytime t4.Fromtime t4to t5,allthecapacitancesreachtheirnalcharge.Theoverheadenergyinactivating thecircuitisthetotalenergydissipatedduringtheinterval t3to t5(indicatedbyshadedareaindark grayinFigure3.7,denotedby Et3Š t5. Et3Š t5= Et3Š t4+ Et4Š t5(3.15) Aftertime t5,thecircuitstartstodissipate PLON.Thisisthesteadystateleakagepowerdissipation ofthepoweredcircuitwhenitisidle. Thecalculationsoftheenergycomponentsofthelibrarycomponentsareperformedusing HSPICEsimulationswiththe70nmPTMles.Theactivationanddeactivationoverheadenergies(areasindicatedbydarkgrayshades)arecalculatedusingpiecewiselinearapproximationof thecurves.Thedynamicenergycomponentsarecomputedbyaveragingthepowerreportedby HSPICEoverpseudorandomlygeneratedinputs.Duringthecalculationofthedynamicenergy components,thesleeptransistorswereturnedon.Theenergycomponentsofeachfunctionalunit isestimatedasthesumoftheenergycomponentsofitsconstituentcomponentinstances.Fora functionalunit r ,thedynamicenergyoverheadincurredduringactivationanddeactivationof r r, iscalculatedas, r= Et0Š t2+ Et3Š t5(3.16) Table3.2showstheenergyvaluesandthethresholdlengthsthatareestimatedforeachofthe functionalunits. 42

PAGE 57

Table3.2Averageenergycomponentsoffunctionalunits Functional Leakage Leakage Leakage Overhead Overhead Dynamic Unit Nopowergating OFF ON activation deactivation component Lth (nJ/cycle) (nJ/cycle) (nJ/cycle) (nJ) (nJ) (nJ/operation) IntALU* 1.26E-04 BarrelShifter 3.01E-04 1.64E-04 3.12E-04 5.68E-02 2.59E-04 6.13E-02 415 IntMultiplier 2.61E-04 1.46E-05 2.65E-04 6.07E-02 2.34E-04 7.23E-02 530 FPAdder 9.18E-04 4.93E-04 9.20E-04 1.91E-01 8.82E-04 2.16E-01 453 FPMultiplier 8.13E-04 4.23E-04 8.17E-04 1.66E-01 6.73E-04 1.93E-01 428 FPDiv-SqrtUnit 1.60E-03 8.74E-04 1.64E-03 3.80E-01 1.34E-03 4.16E-01 523 *SincetheintegerALUisnotpowergated,itsentriesareomittedfromthecellspertainingtothepowergatedversion. 43

PAGE 58

3.3.2Cycle-AccurateSimulation WeusedtheSimpleScalar-ARMdistribution[47]formodelingtheproposedembeddedarchitectureandMiBenchembeddedbenchmarksuite[44]forexperimentation.WechosetwobenchmarksfromeachofthecategoriesofapplicationsinMiBenchwhichare:AutomotiveandIndustrial Control,Network,Security,ConsumerDevices,OfceAutomation,andTelecommunications.The objectcodefortheprogramwasdisassembledusingthegcctoolsforARM,availablewiththedistribution.StaticcodeanalysiswasperformedontheCFGgeneratedforthefunctionsinthesource program.Althoughsleepinstructionswerenotinsertedintothestandardlibraryfunctions,they wereanalyzedfortheirfunctionalunitrequirements.TheprolingtoolwiththeSimpleScalardistribution,sim-profile,wasusedtoperformdynamicprolingoftheprogramandsim-outordertoperformcycle-accuratesimulationaftertheinsertionofthesleepinstructions. TheembeddedprocessorcongurationisbasedontheIntelStrongARMSA-1processor(Table 3.3).AlltheexperimentswereperformedonasetofbenchmarksfromtheMiBenchsuite,whose detailsaretabulatedinTable3.4.Foreachbenchmark,theleakageenergyconsumptionduringits executiononaprocessorwithpowergatedfunctionalunitswascomparedwiththeleakageenergy consumptiononaprocessorwithoutanypowergatedfunctionalunits.Thesavingsinleakage energyforintegerbenchmarksandtheoatingpointbenchmarksareplottedinFigure3.8and Figure3.9,respectively. Theleakageenergysavingsachievedfortheintegerbenchmarksrangefrom25.6%to37.4%, resultinginanaverageof31.1%.FortheFPbenchmarks,theleakageenergysavingsrangefrom 22.6%to30.4%,resultinginanaverageof26.8%.Further,thefractionoftotalsimulationcycles Table3.3ARMprocessorconguration. [44] FetchQueue(instructions) 2 BranchPredictor Not-taken Fetch&DecodeWidth 1 Issuewidth 1 InstructionL1Cache 16K,32-way DataL1Cache 16K,32-way L2Cache None Memory(buswidth,rstblocklatency) 4-byte,12cycle 44

PAGE 59

Table3.4Benchmarkdetails Benchmark Category Dynamic Benchmark Name InstructionCount Type bitcount Automotive 49.6 Integer qsort Automotive 43.6 Integer jpegdecode Consumer 6.7 Integer lame Consumer 97.2 FloatingPoint dijkstra Network 64.9 Integer patricia Network 103.9 Integer rsynth Ofce 57.9 FloatingPoint ispell Ofce 8.4 Integer fft Telecomm. 52.7 FloatingPoint fft-inverse Telecomm. 65.8 FloatingPoint rijndael Security 30.7 Integer sha Security 13.6 Integer forwhicheachoftheunitswaspowergatedisshowninFigure3.10(integerunits)andFigure 3.11(oatingpointunits).Sincetherewaslessthan0.1%performancedegradationintermsofthe additionalcyclesrequiredtoexecutethesleepinstructions,thosenumbersarenotincludedinthe table. Total Leakage Savings in Integer Benchmarks 0 5 10 15 20 25 30 35 40 b it c ou n t q s o rt j p e gd e c d ij ks t r a p a t ric ia i spe l l r i jnda el s h a BenchmarksPercentage (%)Figure3.8Totalleakageenergysavingsinintegerbenchmarks.Thesenumbersarereportedconsideringonlytheintegerunits(theshifterandtheintegermultiplier). 45

PAGE 60

Total Leakage Savings in FP Benchmarks 0 5 10 15 20 25 30 35 lamersynthfftfft-inv BenchmarksPercentage (%)Figure3.9Totalleakageenergysavingsinoatingpointbenchmarks.Thesenumbersarereported consideringbothintegerandoatingpointfunctionalunits. Fraction of simulation cycles that the integer units were power gated 0 10 20 30 40 50 60 70 80 90 100 b it c ou n t q s o rt j p e gd e c l am e d ij ks t r a p a t ric ia r s ynt h i spe l l f ft f fi n v r i jnda el s h a BenchmarksPercentage(%) Shifter Int-MultFigure3.10Fractionoftotalsimulationcyclesforwhichtheintegerunitswerepowergated.Since bothintegerandoatingpointsunitsuseintegerunits,thesenumbersarereportedforallthebenchmarks. 46

PAGE 61

Fraction of simulation cycles that the FP units were power gated 0 10 20 30 40 50 60 70 80 90 100 lamersynthfftfft-inv BenchmarksPercentage (%) FP-Add FP-Mult FP-DsqtFigure3.11Fractionoftotalsimulationcyclesforwhichtheoatingpointunitswerepowergated. Thesenumbersarereportedonlyfortheoatingpointbenchmarksbecauseintegerbenchmarksdo notuseoatingpointunits. 3.4Discussion Inthischapter,aframeworkforpowergatingthefunctionalunitsinthedatapathsofembedded processorarchitecturesinordertoachieveleakagereductionatthecompilerlevelwasdescribed. Theproposedschemeisbasedonthefactthatmostembeddedprocessorsoperatewithlongerclock cyclesthanthesuperscalarprocessors.Thelongerclockcyclehelpsinactivationoftheneeded functionalunitsbyusingcontrolsignalsfromthedecodestage.Theuseofmoresophisticated techniquessuchasdualsleeptransistorsandchargerecyclingdevicesinthedesignofthefunctionalunitsneedtobeinvestigatedforimprovingthepowergatingopportunitiesduringprogram execution. 47

PAGE 62

CHAPTER4 IMPACTOFCOMPILEROPTIMIZATIONSONPOWERGATING Inthischapter,weinvestigatetheimpactofcompileroptimizationtechniquesonpowergating implementedwithcompilerandarchitecturalsupporttoreduceleakageinthearithmeticfunctional unitsofmicroprocessors.Powergatingcanbeimplementedusingsleeptransistorstoselectively deactivatefunctionalunitsinthedatapaththatwouldremainidleforsustainedperiodsoftimeduringprogramexecution.Duringcompiletime,theidletimesforvariousfunctionalunitscanbe identiedandsleepinstructionscanbeinsertedwithinthecodetodeactivatethoseunitsduringprogramexecution.Theycanbesubsequentlyawakenedthroughuseofhardwarecontrolsignalswhen theirneedisdetectedduringinstructiondecodestage.Inthiscontext,theeffectivenessofpower gatingforleakagereductiondependsontheabilityofthecompilertoidentifytheprogramregions inwhichallorsomeofthefunctionalunitsareexpectedtoremainidleforperiodslongenough toprovideleakagesavingsconsideringtheoverheadsinvolved.However,theusageofresources inaprogramisdependentnotonlyonitssourcecodedescriptionbutalsoonthecodetransformationsfrequentlyperformedbythecompilerforimprovingtheperformanceoftheexecutablebinary. Therefore,theeffectivenessofcompiler-directedpowergatingcouldbelargelydependentonthe compileroptimizationsperformedduringcodegeneration.Inthispaper,wedevelopaleakage awarecompilationowbasedonacomprehensivestudyofthecombinedeffectsofvariouscompileroptimizationtechniquesonpowergatingthroughanalysisandsimulations.WhileGCCisused asthecompilerframeworkforourstudy,theembeddedprocessorismodeledbasedontheARM architecturemodiedtoincludethehardwaresupportneededforpowergating.Simplescalar-ARM isusedforevaluatingenergyandperformanceresultsforthebenchmarkschosenfromMiBench andMediaBenchsuites.Experimentalresultsindicatethat,throughtheuseofdifferentcompiler optimizationtechniques,whiletheleakageenergysavingsduetopowergatinginindividualfunc48

PAGE 63

tionalunitscouldbeincreasedby15%inbenchmarksfromMediaBench,thesameforbenchmarks fromMibenchcouldbeincreasedbyupto9times. 4.1Motivation Animportanttaskincompiler-directedpowergatingistoidentifyprogramregionsinwhich functionalunitsareexpectedtobeidleforlongperiodsoftime.However,theidlenessorusage offunctionalunitsisdirectlyinuencedbythecodetransformationsperformedbythecompiler duringcodegeneration.Suchtransformationsareoftenperformedbyacompilertoimprovethe performanceoftheprogramexecutable.TheplotsinFigure4.1showtheimpactofcertaincompileroptimizationsforperformanceimprovementonpowergatingoftheintegermultiplierinan embeddedmicroprocessor.Table4.1describesthelegendsusedintheplots.ThebenchmarksDijkstraandShaareintegerbenchmarks,whereastherestareoatingpointbenchmarks.For eachbenchmark,thefractionofidlecyclesforwhichthemultiplieriskeptdeactivatedduringthe entireprogramruntimeisplottedinFigure4.1(a).Itcanbeseenthatthestrengthreductionoptimizations,whichremoveredundantintegermultiplicationsintheprogram,improvepowergating ofthemultipliersignicantly.ThesavingsinthetotalenergydissipatedduetoleakageinthefunctionalunitsisshowninFigure4.1(b).Thissuggeststhatcompileroptimizationsplayacriticalrole intheeffectivenessofcompiler-directedpowergatingtechniques.Therefore,itisofimmenseinteresttoinvestigatetheeffectofalargerspectrumofperformance-improvingcodeoptimizationson theopportunitiesforpowergatingoffunctionalunits. Table4.1DescriptionofthelegendsinFigure4.1 Legend Description unopt Nooptimizations sccp SparseConditionalConstantPropagation(SCCP)[49] lcm SCCP+LazyCodeMotion(LCM)[50] wsr SCCP+LCM+WeakStrengthReduction(WSR)[51] osr SCCP+LCM+OperatorStrengthReduction(OSR)[52]+WSR In[53],Kandemiretal.studiedindetailtheeffectsofafewhighlevelcompileroptimizationsondynamicpowerandenergyconsumptioninmicroprocessors.Sinceleakagepowerhas 49

PAGE 64

0 20 40 60 80 100 Dijkstra Sha Fft FftI SusanE Mpeg2E Percentageunopt sccp lcm wsr osr (a)Fractionofidlecyclesforwhichtheintegermultiplierispowergated 0 10 20 30 40 50 60 Dijkstra Sha Fft FftI SusanE Mpeg2E Percentagesccp lcm wsr osr (b)LeakageenergysavedduetopowergatingaftercodeoptimizationsoverthatintheunoptimizedcaseFigure4.1Impactofafewcompileroptimizationsonpowergating[48] becomeacriticalaspectoflowpowerVLSIdesignintherecentyears,itisimportanttostudythe factorsthatinuenceleakagepowerconsumptionofasystemdirectlyorindirectly.Inthiswork, weusetheopensourceproductionqualitycompiler,GNUCompilerCollection(GCC)[54],asthe compilerframeworktostudytheinuenceofperformanceenhancingcompileroptimizationson 50

PAGE 65

compiler-directedpowergatinginembeddedprocessors.Themaincontributionofourworkisa leakageawarecompilationowthatcanbeusedtoselectivelyapplycompileroptimizationtechniquesonanapplicationwithanobjectivetoimprovepowergatingopportunitiesofthefunctional unitsinanembeddedmicroprocessorduringruntime,therebyincreasingleakageenergysavings. TheSimplescalar-ARMdistribution[55]isusedtoperformextensivesimulationsonasetofoating pointbenchmarksfromMibench[44]andMediaBench[56]suites. 4.2ImpactofCompilerOptimizationsonPowerGating Inthissection,theanalysesofvariouscompileroptimizationsarepresentedfromtheperspectiveofpowergating[57].Theoptimizationsthatapplylocallytoaprocedurearecategorizedas intraproceduraloptimizationsandthosethatapplyacrossproceduresarecategorizedasinterproceduraloptimizations. 4.2.1IntraproceduralOptimizations Theoptimizationsdiscussedhereareperformedonthecontrolowgraphofaprocedureand theydealwiththeremovalofredundantarithmeticoperationswithinthecode.Suchoptimizations directlyinuencetheopportunitiesforkeepingoneormorefunctionalunitsdeactivatedinaprogramregionand,therefore,canbeextremelyeffectiveingeneratingcodethatisconducivetopower gating. 4.2.1.1DominatorOptimizations Theseoptimizationsusethedominanceinformation[57]ofthecontrolowgraphfortheproceduretoperformvariousoptimizationtasks.Commondominatoroptimizationsinclude deadcode elimination copyandconstantpropagation commonsubexpressionelimination (CSE),etc. Figures4.2and4.3illustratetheimpactofCSEtechniquesonthepowergatingopportunities inaprocedure.InFigure4.2(a),theexpression a / b (procedure f () writesto a and b )is fully redundantinthebasicblock5becauseitiscomputedinbothofitspredecessorblocks,3and4. However,theexpression a b is partially availableinbasicblock5becauseitisnotcomputed 51

PAGE 66

x = a w = a/b x = a/b y = a*b f (&a, &b) 5 4 3 1 2 w = 2*a*b + a/b (a)OriginalCFG.Theloophasmultiplyanddivideoperations.x = a t = a/b w = t x = t y = a*b f (&a, &b) 5 4 3 2 1 w = 2*a*b + t (b)Afterglobalcommonsubexpressionelimination.Theexpression t = a / b ismovedtobasic block2.w = t x = t y = a*b t = a/b f (&a, &b) x = a 5 4 3 2 1 w = 2*a*b + t (c)Afterloopinvariantcodemotion.Theexpression t = a / b ismovedoutoftheloop.Nowthe loopdoesnothaveanydivideoperations.Figure4.2Exampletoillustratetheimpactofglobalcommonsubexpressioneliminationonthe usageoffunctionalunits inbasicblock4.GlobalCSE(GCSE)movesthecomputationof a / b tobasicblock2(theloop header)(Figure4.2(b)).When loopinvariantcodemotion (describedlaterinSection4.2.1.2)is performed(Figure4.2(c)),theexpressionismovedoutoftheloop,therebymakingthedivider 52

PAGE 67

t = a/b w = t u = a*b w = 2*u + t y = u x = t u = a*b x = a f (&a, &b) 5 3 4 2 1 (a)Afterpartialredundancyelimination.Theexpression u = a b iscomputedinbasicblocks3 and4.w = 2*u + t w = t x = t y = u x = a u = a*b f (&a, &b) t = a/b 4 3 5 2 1 (b)Afterglobalcommonsubexpressionelimination.Theexpression u = a b ismovedtobasic block2.x = a u = a*b t = a/b f (&a, &b) x = t y = u w = t w = u + u + t 4 3 5 2 1 (c)Afterloopinvariantcodemotionandweak strengthreduction.Theexpression u = a b is movedoutoftheloopand2 u isreplacedwith u + u .Nowtheloopdoesnothaveanymultiply operations.Figure4.3Exampletoillustratetheimpactofpartialredundancyeliminationontheusageoffunctionalunits idleintheloop.Anotheroptimization,knownas partialredundancyelimination (PRE),makes theexpression a b fullyredundantinbasicblock5byintroducingthestatement u = a b inbasic 53

PAGE 68

block4(Figure4.3(a)).GCSEcannowmovetheexpression a b tobasicblock2(Figure4.3(b)). Whencodemotionandweakstrengthreduction(describedinSection4.2.1.3)areperformedsubsequently,thisexpressionismovedoutoftheloop,therebymakingeventhemultiplieridleinthe loop(Figure4.3(c)).Itshouldbenoted,however,thatGCSEandPREarenotalwaysperformed bythecompilerbecausetheformermayincreaseregisterpressure,whilethelattermayincrease codesize.IftheIEEEorISOoatingpointprecisionrulesarerelaxedduringcompilation,such arithmetictransformationscanbeappliedtooatingpointtypesaswell. 4.2.1.2LoopOptimizations Theoptimizationsdescribedinthiscategoryareperformedonloopstoremoveredundantoperationsfromthebodyoftheloops.Figure4.4illustrateshow loopinvariantcodemotion and strength reduction oninductionvariablesareeffectiveindoingso,therebyimprovingpowergatingopportunitiesintheloopsofaprocedure.InFigure4.4(a),theexpression c [ i ] k evaluatestothesame valueineveryiterationoftheinnerfor-loop.Therefore,thiscomputationismovedoutofthatloop andanewtemporary t isusedtoholdtheresultofthatcomputation,whichissubsequentlyused intheinnerfor-loop.Therefore,asleepinstruction,puttingthemultipliertosleep,canbeinserted beforetheentryoftheinnerfor-loop.InFigure4.4(b),theresultofthecomputation i k acrossthe iterationsoftheloopformanarithmeticprogressionwithacommondifferenceof k .Thisisbecause k isconstantacrossalltheiterationsoftheloop,while i getsincrementedby1ineachiterationof theloop.Thetransformedcode,therefore,doesnothaveanymultiplyoperationandthemultiplier canbeturnedoffattheentryoftheloop. 4.2.1.3MachineDependentOptimizations Theprimarypurposeoftheseoptimizationsistoimprovethepipelineefciencybyperforminglocalcodetransformationswiththeknowledgeofthetargetmachine.Figure4.5illustrates twopeepholeoptimizationsthateliminatetheintegermultiplyinstructionfromacodesequence, therebyincreasingopportunitiesforpowergating.Figure4.5(a)describesthegenerationofcomplexaddressingoperandswhileperformingaloadoperation.Thesequenceofinstructionsonthe 54

PAGE 69

OriginalcodeTransformedcode for ( i = 0; i < n ; i ++) for ( i = 0; i < n ; i ++) for ( j = 0; j < n ; j ++) { a [ j ]= b [ j ]+ c [ i ] k ; t = c [ i ] k ; /* TurnoffMultiplier */ for ( j = 0; j < n ; j ++) a [ j ]= b [ j ]+ t ; } (a)Loopinvariantcodemotion OriginalcodeTransformedcode for ( i = 0; i < n ; i ++)t=0; a [ i ]= b [ i ]+ i k ;/* TurnoffMultiplier */ for ( i = 0; i < n ; i ++) { a [ i ]= b [ i ]+ t ; t = t + k ; } (b)Strengthreduction Figure4.4Examplesofloopoptimizationsthatimprovetheopportunitiesforpowergating.In (a),themultiplyoperationismovedfromtheinnerfor-looptotheouterone.Themultipliercan nowbeturnedoffattheentryoftheinnerfor-loop.In(b),themultiplyoperationiseliminatedin thefor-loopbyreplacingitwithanequivalentseriesofaddoperations.IfIEEEorISOoating pointprecisionrulesarerelaxed,strengthreductioncanalsobeappliedtoeliminateoatingpoint multiplyoperations. leftcolumn(toloadanelementfromanarraystoringelementsofsize4bytes)isreplacedbya singleinstructionontherightinwhichthebaseaddressandtheoffsetcanbedirectlyspecied. Figure4.5(b)illustratesweakstrengthreduction[51,58]inwhichmultiplicationwithaconstant operand(119)maybereplacedbyasequenceofaddition,subtraction,andshiftinstructions.This transformation,however,shouldbecarriedoutonlyifitisestimatedduringcompilationthatit 55

PAGE 70

OriginalcodeOptimizedcode movr1,#4ldrr3,[r0,r1,asl#2] mulr3,r2,r1 addr1,r0,r3 ldrr3,[r4] (a)Loadswithcomplexaddressing OriginalcodeOptimizedcode movr1,#119movr1,r2,asl#4 mulr3,r2,r1addr1,r1,r2 movr3,r1,asl#3 subr3,r3,r1 (b)Weakstrengthreduction Figure4.5ExamplesofmachinedependentoptimizationsinARMthatcanimproveopportunities forpowergating.Intheexamplesabove,themultiplyinstructionsareeliminatedintheoptimized codes. wouldbebenecial(intermsofbothperformanceandpower)todosoconsideringtheoverheadof fetchingtheextrainstructionsandexecutingthem. 4.2.2InterproceduralOptimizations Theseoptimizationsrelyontheanalysesperformedbythecompilerbyconsideringtheentire compilationunitasawhole.Alltheproceduresareanalyzedanda callgraph iscreated.Ina callgraph,eachvertexrepresentsaprocedureandanedge ( u v ) impliesthatthereisacallfrom procedure u toprocedure v .Interproceduralanalysesareperformedonthecallgraphtocompute thesetofvariablesandmemorylocationsthataremodiedandreferencedbyeachprocedure.This makesitpossibleformoreeffectiveintraproceduraloptimizationstobeperformed.Wediscusstwo 56

PAGE 71

int scale( int s, int* x) { sum = 0; /* TurnoffInt.Mult. */ for ( i = 0; i < n ; i ++) sum = sum + x [ i ] ; /* TurnonInt.Mult. */ return ( s sum ); } void foo( void ) { ... for ( i = 0; i < m ; i ++) b [ i ]= scale ( 5 a [ i ]) ; ... }(a)Originalsourceinwhichtheprocedure,scale(),performsamultiplyoperationonits rstformalargument, s andthesumoftheelementsinthearraypointedtoby x .int scale( int* x) { /* TurnoffInt.Mult. */ sum = 0; for ( i = 0; i < n ; i ++) sum = sum + x [ i ] ; return ( ( sum 2 )+ sum ); } void foo( void ) { ... for ( i = 0; i < m ; i ++) b [ i ]= scale ( a [ i ]) ; ... }(b)Afterinterproceduralconstantpropagationis performedon(a),therstformalargumentisremovedandthearugment s isreplacedwiththe constantvalueof5inthereturnstatement.Weak strengthreductioneliminatesthemultiplyoperation.void foo( void ) { ... /* TurnoffInt.Mult. */ for ( i = 0; i < m ; i ++) { sum = 0; for ( j = 0; j < n ; j ++) sum = sum + a [ i ][ j ] ; b [ i ]=( sum 2 )+ sum ; } ... }(c)Afterprocedureinliningisperformedon(a), the for loopwithintheprocedurescale()becomesanestedloopinthe for loopincalleeprocedurefoo().Weakstrengthreductioneliminatesthemultiplyoperation.Theinstructionto turnofftheintegermultipliercannowbeinserted beforeenteringtheouterfor-loop.Figure4.6Exampleillustratingtheimpactofinterproceduraloptimizationsonpowergatingopportunitiesoftheintegermultiplier 57

PAGE 72

optimizationsinthiscategory, interproceduralconstantpropagation and procedureinlining ,and theirimpactonpowergating(Figure4.6). Theoriginalcode(Figure4.6(a))hastwoprocedures,scale()andfoo().Therstprocedure computesthesumoftheelementsinthearraypointedtobyitsformalargument, x ,andthenscales thecomputedvaluebyitssecondformalargument, s .Thesecondprocedurecallsscale()forevery rowofan ( m n ) array, a ,withaconstantscalingparameter,5,andstoresthevaluesreturnedbyscale()inarray, b .Inthiscase,theintegermultipliercanbeturnedoffattheentryofthefor-loop andturnedbackonattheexitofthefor-loopinscale().However,sincescale()iscalled m times,theperformanceandenergyoverheadsinvolvedinpowergatingincreasesas m increases. Figure4.6(b)showstheinterproceduralconstantpropagationtechniqueconceptually.Theproceduredenitionforscale()isreducedtoonewithonlyasingleargument.Theargument s in thebodyoftheprocedureisreplacedwiththeconstant5.Weakstrengthreductioncannowreplace ( 5 sum ) with (( sum 2 )+ sum ) ,therebyeliminatingthemultiplyoperationcompletely.Interproceduralconstantpropagationisbenecialinthecaseswheremacrosareusedinthesourcecodethat assumeconstantvaluesafterpreprocessing.Whenprocedureinliningisperformed(Figure4.6(c)), thebodyofscale()isintegratedintofoo(),followingwhichthemultiplycanbereplacedwith shiftandaddoperations.Thus,onlyasinglesleepinstructionisrequiredattheentryoftheouterforloopputtingtheintegermultipliertosleep.Thissignicantlyreducestheperformanceandenergy overheadsinimplementingpowergating. 4.3Compiler-DirectedPowerGating Figure4.7describestheframeworkforcompiler-directedpowergatingoffunctionalunitsalong withcodeoptimizations.Acompilerusuallyhasmultipleintermediaterepresentations(IR)torepresentaprogramduringtheentirecompilationprocess.Thesourcecodeofanapplicationistranslated intoahigh-levelIRbythecompilerfront-end.Highleveltranformation,whichincludeinterproceduraloptimizations,areperformedatthehigh-levelIR.Mostofthemachine-independentintraproceduraloptimizations,likedominatorandloopoptimizations,areperformedonanintermediatelevelIR.Finally,themachine-dependentoptimizationsareperformedonalow-levelIR.Allthe 58

PAGE 73

(constant propagation, function inlining)MachineDependent Optimizations Loop Optimizationsinduction variable optimizations) (loop invariant code motion,Insertion of power gating instructions(instruction scheduling, peephole optimizations) (dead code removal, algebraicInterprocedural Optimizationsreassociation, redundancy relimination, etc.)Dominator Optimizations IR Translation Code Generation files SourceFigure4.7Frameworkforcompiler-directedpowergatingoffunctionalunitswithcodeoptimizations optimizationsareorganizedascompilerpasseswhichcanbeselectivelyturnedONorOFFwiththe helpofoptimizationagsspeciedduringcompilation.Theinsertionofthesleepinstructionsis performedafterallofthecodeoptimizationshavebeenperformedandis,therefore,shownaspart ofmachine-dependentoptimizations.Aftercodegeneration,theassemblycodeisassembledand linkedtogenerateamachineexecutablethatissimulatedonacycleaccuratesimulatorforperformanceandleakageenergyevaluation.Inthiswork,weusetheGNUCompilerCollection(GCC 59

PAGE 74

4.2.1)[54],GNUBinaryUtilities(Binutils2.17)[59],andGNUCLibrary(Glibc2.3.6)[60]to buildthecompilertoolchainfortheARMLinuxplatform.TheSimplescalarARMdistribution[55] isusedtoperformcycle-accuratesimulationforperformanceandpowercalculations. TheinternalcompilerpipelineinGCC[61]isshowninFigure4.8.GCCusesthreeintermediate representations-GENERIC,GIMPLE,andRTL.ThecompilerfrontendparsestheCsourcecode andconvertsitintoGENERICrepresentation.ThisisloweredintoGIMPLErepresentation,wherePass to insert pow er is added here gating instruction s Back End M iddle End Front End Tree Optimizations RTL Optimizations Assembly C GENERIC GIMPLE RTLFigure4.8GCCCompilerPipeline[62].Thepasstoinsertpowergatinginstructionsisaddedasan RTLoptimizationpass. alltree-basedcodetransformationsareperformed.Lowerleveloptimizations,includingallormost ofthemachine-dependentoptimizations(instructionscheduling,peepholeoptimizationsetc.),are performedattheRTLlevel.Thecontrolowgraph(CFG)informationforaprocedureisretained tilllateintheRTLpasses.Sinceinsertionofpowergatinginstructionsintothecoderequiresits controlowinformation,thepasstodosoisaddedrightbeforetheCFGispurgedandtheprocedure isdescribedonlyasaninstructionlist. 60

PAGE 75

4.3.1ArchitectureSupportforPowerGating ThearchitecturesupportforpowergatingisbasedontheARMprocessorarchitectureand isproposedin[37].Theinstructionsetarchitecture(ISA)providesanexplicit sleep instruction, whoseargumentisanimmediateintegerthatencodesthelistoffunctionalunitsthataretobe deactivated.TheISA,however,doesnotprovideanyexplicit wakeup instructions.Theactivation ofthefunctionalunitsiscarriedoutautomaticallybythedecodelogicoftheARMpipelinefor thosefunctionalunitsthatareneededbythedecodedinstruction.Theprocessorcorehasaninteger ALU,whichcannotbepowergated;andabarrelshifter,anintegermultiplier(IMUL),aoating pointadder(FADD),aoatingpointmultiplier(FMUL),andaoatingpointdivideandsquare rootunit(FDSQ),allofwhichcanbepowergated.Thelibraryoffunctionalunitswithpower gatingsupportaredesignedfor1cyclewakeuplatencyforaclockperiodof10ns(100MHzclock) andarecharacterizedforlatencyandpower[37].Thebarrelshifter,however,isnotequipped withanypowergatingsupportinthisworkbecauseofthefrequentusageofshiftinstructionsin thecode.Shiftinstructionsaregeneratedduringstrengthreductionofmultiplyoperationswith constantoperands[51]andduringgenerationofloadandstoreinstructionswithcomplexaddressing modes[54]. AsshowninFigure4.9,asleepcontrolregister(SCR),comprisingofSRip-ops,isadded tothedecodelogicoftheARMcore.Theoutputsoftheseip-opsdrivethegatesofthesleep transistorsofthefunctionalunits.SinceNMOStransistorsareusedforthesleeptransistors,a‘0’attheirgateswitchesthemOFFwhilea‘1’switchesthemON.InFigure4.9,thecontentsofthe SCRindicatethatwhileIMULandFDSQareinactivemode,FADDandFMULareinsleepmode. Whenasleepinstructionisdecoded,signal slp isassertedandbits11-8oftheinstructionareused toresettheip-ops.Whenaninstruction,requiringaparticularfunctionalunit,isdecoded,a wakeup signal(wkp *)isassertedtosettheip-op.Thewakeupsignalsaredatapathagsthatare generatedbythedecodelogictolatchtheoperandsintheinputlatchesofthefunctionalunits. The rst inputtoeachip-opisgatedwiththeoutputofthatip-optopreventunnecessary switchingattheoutputofthecorrespondingANDgate.Therefore,theswitchingattheoutputsof theANDgatesandtheSRip-opscarryouttheswitchingONorOFFofthefunctionalunits.If 61

PAGE 76

Sleep contro l for FPDSQ T Sleep contro l for FPMult Sleep contro l for FPAdd Sleep contro l for IntMult slp 0 1 0 1 wkp_fdsq set rst instr[11] q wkp_fmul set rst instr[10] q wkp_fadd set rst instr[9] q wkp_imul set rst instr[8] q clk Figure4.9GatelevelschematicoftheSleepControlRegister(SCR).SCRistheextensiontothe ARMdecodelogicthatenablesthesupporttopowergatefunctionalunits. theoutputofanANDgateswitches,itwillswitchthecontentofthecorrespondingip-opfrom‘1’to‘0’.ThefunctionalunitwhosesleeptransistorstheANDgateiscontrollingis,thereby, puttosleep.Similarly,ifawakeup(wkp *)signalisassertedandtheoutputofthecorresponding ip-opswitchesfrom‘0’to‘1’,thefunctionalunitwhosesleeptransistorsitiscontrollingis wokenupfromsleep.Thus,thedynamicpoweroverheadinvolvedintheimplementationofthe SCRlogicislinearlyproportionaltothenumberoftimesthefunctionalunitsareswitchedONand OFF.However,thisoverheadpowerisnegligiblewhencomparedtothepoweroverheadinvolved inactivationanddeactivationofthefunctionalunits. 62

PAGE 77

0 F 0FXXArg 7 M achine code format of sleep instruction Assembly format of sleep instruction slp /* 4 bit argument */ 312723191511730 Arg bit 0 : IntMult Arg bit 1 : FPAdd Arg bit 2 : FPMult Arg bit 3 : FPDSQT Figure4.10Assemblyandmachinecodeformatsofthesleepinstruction Theassemblersupportfortranslatingthesleepinstructionsintomachinecodeisaddedtothe GNUARMassembler,whichispartoftheBinutilspackage.Theformatofthemachinecodeforthe sleepinstructionischosenfromthedomainofexceptionalopcodesdescribedintheARMreference manualandisshowninFigure4.10.Theassemblyopcodeforthesleepinstructionisslp.Ittakesa four-bitimmediateintegerasanargumentthatencodesthelistofthefunctionalunitsthataretobe deactivated.Asshowninthegure,bits0-3oftheimmediateintegerargumentarefordeactivating IMUL,FADD,FMUL,andFDSQ,respectively.Themachinecodefortheslpinstructionhasbits 7-0as‘F0’andbits31-20as‘07F’.Instructionbits11-8arereservedforthefour-bitimmediate integerargumentthatispassedtothesleepinstruction.Thedenitionsofthesemanticsoftheslpinstructionandtheextensionstothesemanticsoftherestoftheinstructions(wakeuplogic)are addedtothemachinedescriptionle.ThemachinedenitionfortheSimplescalarARMportmaps eachinstructiontoafunctionalunitintheprocessorcore.Thisinformationisusedtoimplement thewakeuplogic.TheARMprocessorcongurationusedisthatoftheStrongArmprocessorcore andistabulatedinTable3.3. 4.3.2InsertionofSleepInstructions Thesleepinstructionsareinsertedatdiscreteregionsinaprocedure.Theseregionsarethe entryblockoftheprocedureitselfandthebasicblocksthatarepredecessorstotheentryblocksof theloopsinthatprocedure.Duetothisreason,thepasstoinsertpowergatinginstructions,rst identiestheloopsintheprocedure,andthenprobesthemfortheusageoffunctionalunits.GCC 63

PAGE 78

providesacomprehensivelibraryforrepresentingloopsandanalyzingthem.A looptree isdened asatreeinwhicheachnoderepresentsaloopintheprocedureandthechildrenofanode,say L representtheloopsimmediatelycontainedinside L .Thelooptreecapturesthenestingstructureof theloopswithintheprocedure.Theimportantattributesofaloopinvolvedinperformingthetask ofinsertionofsleepinstructionsare loopheader looplatch ,and loopbody .Figure4.11illustrates Bod y L atch Loop Loop Header Loo p Latch Tail Figure4.11Componentsofaloop thesecomponents. Loopheader isthebasicblockthatformstheentryblockfortheloop. Loop latch isabackedgethatconnectsabasicblockfrominsidethelooptotheloopheader.Finally, loopbody isthesetofbasicblocksthataredominatedbyitsheader,andarereachablefromitslatch againstthedirectionofedgesinCFG.TheshadedregioninFigure4.11indicatestheloopbody. Naturalloops,denedastheloopsthathaveoneloopheaderandpossiblymultiplelooplatches,are themostcommonlyoccurringloopstructuresintheCprogramminglanguageandare,therefore,of interesttous. Algorithm4 toplevel-driver() 1:Constructthecallgraph2:Performinterproceduraloptimizations3:foreach procedure x callgraphinpostordersequence do4:expand( x )5:endfor Algorithms4-7describethetechniqueofinsertingpowergatinginstructionsduringthecompilationprocess.Thetoplevelprocedurethatdrivestheentirecompilationprocessisdescribedin Algorithm4.First,thecallgraphisconstructedafterthesourceleisparsed.Then,interproce64

PAGE 79

Algorithm5 expand( f ) 1:Performintraproceduraloptimizationson f2:pgi-insert( f )3:Outputtheassemblycodefor f Algorithm6 pgi-insert( f ) 1:/*Gatherfunctionalusageinformation*/2:FU ( f ) 3:foreach basicblock b CFG do4:FU ( b ) 5:foreach instruction i b do6:if i isaCALLinstruction then7:FU ( b ) usage(callee)8:else9:FU ( b ) FU ( i )10:endif11:endfor12:FU ( f ) FU ( f ) FU ( b )13:endfor14:Computetheloophierarchytree, L ,in f15:foreach loop l L do16:FU ( l ) FU ( b ) b l17:endfor18:/*Insertpowergatinginstructions*/19:foreach region r L inpreordersequence do20:if FU ( r ) FU ( parent ( r )) then21:/* r needsfewerfunctionalunitsthan parent ( r ) */22:Insertsleep( FU ( parent ( r ) Š FU ( r )) )attheentryof r23:endif24:endfor 65

PAGE 80

duralanalysisandoptimizationsareperformedacrosstheprocedures.Finally,theproceduresare expandedinthepostordersequenceofthevertices.Thisisdonetoenabletransferofdatafrom acalleeproceduretoitscaller,whichmayleadtobetteroptimizationopportunitiesduringcode generationofthecallerprocedure.Fromtheperspectiveofthetaskofinsertingpowergatinginstructions,thisensuresthatacalleeprocedureisinspectedforpowergatingopportunitiesbeforeany ofitscallers.Algorithm5describestheexpansionstepsforaprocedure,whichinvolveperformingthedesiredintraproceduraloptimizations,insertingpowergatinginstructions,andgenerating itsassemblycode.Algorithm6describesthetechniqueofinsertingsleepinstructionswithinaprocedure.Theinstructionsinabasicblockareinspected(lines3-13)tocomputethefunctionalunit usageofthatbasicblock.IfaninstructionisnotaCALLinstruction,thenitsfunctionalunitusage iscomputedbyparsingtheinstructionexpression.Otherwise,thevaluereturnedbytheprocedureusage()(Algorithm7)isused.Thelooptreeiscomputed(line14)andthefunctionalunitusage ofalltheloopbodiesaredetermined(lines15-17).Next,thelooptreeistraversed(lines18-24)in preordersequenceforregionswhoseentriescanbegatedwithsleepinstructions.Therootofthe looptreerepresentstheentireprocedure. Algorithm7 usage( f ) 1:/*lookupthecallgraphfor f */2:if f callgraph then3:return FU ( f ) Š FU ( l ) loop l f4:endif5:/* f isastandardlibraryprocedure*/6:return functionalusagebasedonpolicyP1,P2,orP3 Algorithm7describestheroutinethatreturnsasetoffunctionalunitscorrespondingtothe calleeproceduresymbolthatispassedtoitasanargument.Ifthecalleeprocedureispartofthe compilationunitthenthecallgraphhasanentryforit.Inthiscase,thefunctionalunitusageof theregionofthecodethatisnotpartofanyloopsinthecalleeisreturned.Thisisbecauseany functionalunitthatisneededinthisregionwillbeneededasmanytimesastheprocedureiscalled. Ifthecalleehasloops,thenthoseloopsalreadyhavesleepinstructionsinsertedattheirentrieswhich turnofffunctionalunitsthatarenotrequiredwithintheirbodies.Finally,ifitisdeterminedthat thecalleeproceduredoesnotbelongtothecompilationunit,thenitmustbeaCstandardlibrary 66

PAGE 81

callandthereturnvalueisdeterminedbythepolicythatisusedtohandletheCstandardlibrary routines.Thisisdescribedthenextsubsection.Althoughthetechniquedescribedheredoesnot usedynamicprolinginformationoftheapplicationstodirecttheinsertionofsleepinstructions inthecode,thetechniquesthatdoso([33,37])canbereadilyincorporatedintothecompilation frameworkdescribedinthiswork. 4.3.3PoliciesforHandlingCStandardLibraryRoutines ThemostfrequentlyusedstandardlibraryroutinesinthebenchmarksfromMediaBenchand MiBencharethosedenedinstdio.h,stdlib.h,string.h,andmath.h.BasedontheGlibc implementationoftheCstandardlibrary,whiletheroutinesdenedinmath.huseFPunitscommonly,thosedenedintherstthreeheadersdonotuseanyFPunits(exceptthestringtoFPtype conversionroutinessuchasatof(),etc.).Moreover,alargenumberofthelowlevelleI/Oroutines,memoryallocation/deallocationroutines,andstringmanipulationroutinesdonotevenusethe integermultiplier.Consideringthisknowledgeaboutthestandardlibraryroutines,inthiswork,we applythreepoliciestohandlethem.Thesepoliciesareasfollows: 1.P1-Itisassumedthatnoneofthestandardlibraryroutinesuseanyofthefunctionalunits. 2.P2-Itisassumedthattheroutinesbelongingtothemathlibraryuseallthefunctionalunits, whereastheotherroutinesdonotuseanyfunctionalunits. 3.P3-Theroutinesdenedinthemathlibraryareproledfortheirfunctionalunitusageanda lookuptableiscreatedwiththisinformation.Thislookuptableisreferredtoduringthepower gatingstage.Fortherestoftheroutines,itisassumedthattheydonotuseanyfunctionalunits. 4.3.4ProposedLeakageAwareCompilationFlow Basedontheworkanddiscussionsintheprecedingsubsections,weproposealeakageaware compilationowthatisshowninFigure4.12.Theapplicationsource,whichiscomprisedofone ormorecompilationunits(setofproceduresthatthecompilergeneratescodeforinasinglerun), istheinputtothisow.Acallgraphisconstructedforeachcompilationunit,whichisinspected 67

PAGE 82

btntfnrf btn nttnnbtnb f fnfbfbn tnnbtn fb fntt btf b!" f#f tn$ fb fntt bttnf b!" f%&'( b#)(*f#+ #ftn& fb fntt bt bb tn+ fn#tn rbnbftnnbtn fff f tnnbtn( fnttn f#n! f f f n,t fftnftf# f#f-n bf# fbt+f "#btf ./b, tf nnf#ntntbf t ffb !"t btnt f#n f btn+nnf bb f btn+nnf bb f f f f fbt f f#fn nftf# nnn f &ttbb bnbtnnt t fbnbtn n,t fftf#fnt#ff0ff ftff nttn %ffbtfb#tt bf+#f bf+ ffbtf#bt f f#fn f (nt 'tbt btnfr fb bfb fb ftnn fb bb fb nbtn f#f Figure4.12Proposedcompilationowtogenerateleakageoptimizedcodewithcompiler-directed powergating.Eachvertexinacallgraphisaprocedure,whichisdecribedasacontrolow graph.Eachvertexinacontrolowgraphisabasicblock.TheregionsenclosedingreyrectanglescorrespondtotheoptimizationphasesshownintheframeworkinFigure4.7.Thelabelsfor theintraproceduraloptimizationpasses, PA F,areusedtoannotatetheoptimizationcongurations studiedintheexperimentalsection(Table4.4). 68

PAGE 83

forprocedureinlining.Thedecisiontoinlineaprocedureprimarilydependsonitsimpactonthe increaseincodesize.GCChasbuiltinheuristicstodecideifaprocedureshouldbeinlinedornot. TherelevantparametersthatcontroltheseheuristicsandtheirdefaultvaluesaretabulatedinTable 4.2.Aprocedurethatisdeclaredstaticandiscalledonlyonceinthecompilationunitisalways consideredforinliningbecauseitisnotlikelytoimpactthecodesize.However,fortherestofthe procedures,onlythoseproceduresthathaveatleastonefunctionalunitthatcanbepowergatedin itsbodyareconsideredtobeinlined.AsmentionedearlierinSection4.2.2,thishasasignicant potentialtoreducethenumberofoverheadsleepinstructionsduringprogramexecution.Although notexplicitlyshowninthegure,interproceduralconstantpropagationisalsoperformedduring thisstage. Table4.2Procedureinliningparameters. [54] Parameter Description Value max-inline-insns-auto Sizeofproceduresconsideredforinlininginnumberofinstructions(countedinGCCsinternalrepresentation) 90 large-function-insns Limitspecifyinglargeproceduresinnumberof instructions(countedinGCCsinternalrepresentation) 2700 large-function-growth Relativegrowthoflargeproceduresafterinlining withrespecttooriginalsize 200% inline-call-cost Relativecostofacallinstructioncomparedtoa simplearithmeticinstruction(e.g.add) 16 Oncealltheproceduresareinspectedforproceduralinlining,theyarepickedinapostorder sequenceofthecallgraph(explainedearlierinSection4.3.2)forintraproceduraloptimizationsand assemblycodegeneration.Thedecisiontoperformoatingpointoptimizationsistakenbasedon theknowledgeoftheapplications.Ifanapplicationhasoatingpointoperations,thentheIEEEor ISOoatingpointprecisionrulesarerelaxedsothatarithmeticoptimizationscanbeperformedon oatingpointdatatypesaswell.Simpledominator-basedarithmeticoptimizationslike deadcode elimination constantandcopypropagation localsubexpressionelimination and fullredundancy elimination areperformed.Followingthis,ifthereareloopsinwhichoneofmorefunctionalunits arebeingused,thencodemotionisperformedontheloopbodies.AsmentionedearlierinSection 69

PAGE 84

4.2.1.1,Globalsubexpressionelimination(GCSE)andpartialredundancyelimination(PRE)are techniquesthatalwaysmaynotresultinperformanceoptimizedcodebecausetheformerincreases registerpressure,whilethelatterhasthepotentialtoincreasecodesize.However,theseoptimizationsmayaidcodemotioninremovingrequirementsoffunctionalunitsoutofloops(illustratedin Section4.2.1)and,therefore,areevaluatedifcodemotionaloneisnotentirelyeffectiveinremoving computationsoutoftheloops.Inductionvariableoptimizationsareperformedfurtherifthereare anymultiplyoperationsintheloops(forbothintegerandoatingpointtypes).Finally,thebackend optimizations,peepholetransformations,andinstructionschedulingareperformedbeforeinserting sleepinstructionsandnallygeneratingassemblycode. 4.4ExperimentalSetupandResults Forthepurposeofexperimentation,wereportresultsforthefollowingbenchmarksfromMibench andMediabenchsuites: € Susan (SmallestUnivalueSegmentAssimilatingNucleas)isasetimageprocessingbenchmarksfromtheautomotivecategoryinMiBench.Itcomprisesofthreeprograms-edge detection SusanE ,cornerdetection SusanC ,andimagesmoothing SusanS € Epwic (EmbeddedPredictiveWaveletImageColor)ispartofMediaBenchwhichimplements awaveletimageencoder( Epwic )andadecoder( Unepwic ). € Mpeg2 (MovingPicturesExpertsGroup)benchmarkscompriseof Mpeg2E and Mpeg2D whicharevideoencodinganddecodingprograms,respectively.Thesebenchmarksarealso partoftheMediabenchsuite. Allthebenchmarksdescribedabove,except SusanS areoatingpointbenchmarks,with Mpeg2D havingtheleastcompositionofoatingpointoperations(lessthan0.5%inC0conguration). Table4.3tabulates,foreachbenchmarkbuiltinC0,thenumberofproceduresandthenumber ofsimulationcyclesreportedbySimplescalar-ARMfortheStrongARMconguration.Allthe experimentswereperformedonaLinuxserverwithfour2.6GHzAMDprocessorsand16GB 70

PAGE 85

Table4.3Benchmarkdetails. (inC0conguration) Benchmark Numberof Numberof procedures cycles(Millions) SusanC 19 4.05 SusanE 19 8.13 Epwic 130 1745.60 Unepwic 130 179.69 Mpeg2E 95 549.02 Mpeg2D 114 28.89 Thecongurationsaredescribedin Table4.4 RAMrunningRedHatEnterpriseLinuxASrelease4.ThecodemodicationstoGCC,including thepasstoinsertsleepinstructions,resultedinlessthan1300additionallinesinthesourcecode. ThesimulationtimeofthebenchmarkswithSimplescalarrangedfromafewseconds,for SusanC benchmark,toseveralminutesfor Epwic benchmark. 4.4.1OptimizationCongurations Table4.4Optimizationcongurations Optimization Description PassLabels Label wrtFigure4.12 C0 Baseconguration(onlymachinespecicinstructionschedulingisperformed) PF C1 C0+Machinedependentpeepholeoptimizations PE F C2 C1+Alldominatoroptimizations(deadcode elimination,copyandconstantpropagation,full andpartialredundancyelimination,localand globalcommonsubexpressionelimination) PA, PE F C3 C1+Simpledominatoroptimizations(except GCSEandPRE)+Loopinvariantcodemotion PA B, PE F C4 C3+Alldominatoroptimizations+Loopinvariantcodemotion PA C, PE F C5 C4+Inductionvariableoptimizations PA F Theoptimizationsineachofthecongurationsabovemaybeperformedwithinterproceduralandoatingpointoptimizations 71

PAGE 86

Table4.4describesthevariousoptimizationcongurationsdenedforthepurposeofexperimentationinthiswork.ThethirdcolumnofthistableliststhelabelsoftheintraproceduraloptimizationpassesfromFigure4.12foreachofthesecongurations.C0isthebasecongurationin whichonlymachinespecicinstructionschedulingisperformed.InC1,machinedependentpeepholeoptimizationsarealsoenabled.InC2,alldominatoroptimizationsareperformedalongwith thoseinC1.InC3,simpledominatoroptimizations(excludingGCSEandPRE)andloopinvariantcodemotionareperformedalongwiththoseinC1.Moreaggressivedominatoroptimizations, includingGCSEandPRE,areaddedtoC3todeneC4.InC5,inductionvariableoptimizations areperformedinadditiontotheonesperformedinC4.Alltheintraproceduraloptimizationsare performedafterinterproceduraloptimizationsareperformedexceptforthe Susan benchmarks.The oatingpointarithmeticoptimizationag-ffast-mathisturnedonforallbenchmarksexcept SusanE and Mpeg2D 4.4.2Results Thelibraryoffunctionalunitsdevelopedin[37]isusedfortheleakageenergycalculationsfor allthebenchmarks.Theleakagesavingsfortheunitsinsleepmodeisabout50%,whichindicates Table4.5Descriptionofmetrics Metriclabel Description num-cyc-nopg Numberofsimulationcycleswithoutanysleepinstructions num-cyc-pg Numberofsimulationcycleswithsleepinstructions cyc-ovh Performancepenaltyincurredinexecutingthe sleepinstructions(inpercentage),computedas ( num-cyc-nopg / num-cyc-pg ) Š 1 unit -bsy Numberofcyclesforwhich unit isbusy unit -engy-nopg leakageenergyfor unit withoutpowergating unit -engy-pg leakageenergyfor unit withpowergating unit -sav Percentagesavingsinleakageenergyin unit duetopower gating,computedas1 Š ( unit -engy-pg / unit -engy-nopg ) total-sav Percentagesavingsintotalleakageenergyacross allunitsduetopowergating,computedas 1 Š unitsunit -engy-pg / unitsunit -engy-nopg 72

PAGE 87

thatthetheoreticalupperboundonthemaximumachievablesavingsis50%.Table4.5enumerates themetricsthatarecomputedandreportedinthissectiontodemonstratetheeffectivenessofpower gating.AlltheresultsdescribednextarereportedforpolicyP3usedtohandletheCstandard libraryroutines.Also,thelegendnum-cyc'inalltheplotsindicatesnum-cyc-nopg',asdescribed inTable4.5.TheFPoptimizationsarealsoturnedonforallthebenchmarksincongurations C1-C5. 4.4.2.1SusanBenchmarks Figures4.13(a)and4.13(b)showthestatisticsfor SusanE .Forthisbenchmark,theintegermultiplierisagainthemostfrequentlyusedunit(busyfor17.3%ofthecyclesinC0),whereastheFP unitsarebusyforroughly1%ofthecycles.Again,inC2-C5,largeleakagesavingsareobservedfor theintegermultiplierwhencomparedtothoseinC0.Inthesecongurations,thesavingsincrease by9X(from3.3%toalmost30%).Thiscanbeattributedtothefactthatthenumberofbusycycles forthisunitreducesby4X(0.24vs.1.00)inC2andbyalmost5X(0.21vs.1.00)inC5.InC5, strengthreductiononinductionvariablesisperformedwhichisveryeffectiveinreplacingmultiply operationswithequivalentaddorsubtractoperations.ThesavingsintheFPunitsarealmostuniformacrossallthecongurations.Thetotalleakagesavingsacrossallthefunctionalunitsincreases bylessthan10%becausetheleakagecomponentoftheintegermultiplierisasmallfractionofthose oftheFPunits. For SusanC ,theintegermultiplierisbusy14%ofthecycles,whereastheFPunitsareused inlessthan2%ofthecyclesinC0.Intherestofthecongurations(allbuiltwithoatingpoint optimizations),signicantimprovementsintheleakagesavingsareobservedfortheintegermultiplierandtheFPdivider.Figures4.14(a)and4.14(b)showthestatisticsfor SusanC .InC2-C5,the numberofbusycyclesfortheintegermultiplierdropsby8X(0.12vs.1.00)to12X(0.08vs.1.00) givinganincreaseinsavingsby7%to2.1X.FortheFPdivider,theleakagesavingsimproveby morethan2.2XincongurationsC3-C5.Thisisbecausethenumberofbusycyclesforthisunit dropsbyalmost6X(0.17vs.1.00).Thesavingsintherestoftheunitsareuniformoverallthe optimizationcongurations.Thetotalleakagesavings(acrossallthefunctionalunits)is1.7Xthan 73

PAGE 88

0 1 2 3 4 5 6 7 8 9 imul-sav tot-sav cyc-ovh Normalized value wrt C0C1 C2 C3 C4 C5 (a)Leakagesavingsintheintegermultiplierandacrossallthefunctionalunitsalongwiththe performanceoverheadacrosstheoptimizationcongurations 0 0.2 0.4 0.6 0.8 1 num-cyc imul-bsy Normalized value wrt C0C1 C2 C3 C4 C5 (b)Totalnumberofexecutioncycles(withoutsleepinstructions)andthenumberofbusycycles fortheintegermultiplieracrosstheoptimizationcongurationsFigure4.13Leakagesavingsandsleepoverheadfor SusanE thatinC0.Theoverheadintermsofthenumberofextracyclesexecutedincreasesby1.3XinC5 butisbelow0.6%ofthetotalnumerofexecutioncycles.Thenumberofexecutioncyclesforthis benchmarkreducesby20%inC2-C5wrtthatinC0.Figure4.15illustratestheimpactofweak strengthreductionontheleakagesavingsachievedfortheintegermultiplieracrossalltheoptimiza74

PAGE 89

0 0.5 1 1.5 2 2.5 imul-sav fdsq-sav tot-sav cyc-ovh Normalized value wrt C0C1 C2 C3 C4 C5 (a)Leakagesavingsintheintegermultiplier,FPdivider,andacrossallthefunctionalunitsalong withtheperformanceoverheadacrosstheoptimizationcongurations 0 0.2 0.4 0.6 0.8 1 num-cyc imul-bsy fdsq-bsy Normalized value wrt C0C1 C2 C3 C4 C5 (b)Totalnumberofexecutioncycles(withoutsleepinstructions),thenumberofbusycyclesfor theintegermultiplierandtheFPdivideracrossalltheoptimizationcongurationsFigure4.14Leakagesavingsandsleepoverheadfor SusanC tioncongurations.ForC5,thesavingsincreasebyalmost2Xwhenweakstrengthreductionis performed. SusanS isanintegerbenchmarkthatspendsalmosttheentiretimeinaloopperforminginteger multiplicationsonmemoryelements.Therefore,duringtheentiretime,theintegermultiplieris 75

PAGE 90

0 2 4 6 8 10 12 14 C0 C1 C2 C3 C4 C5 Leakage energy savings (%)NoWSR WSR Figure4.15Impactofweakstrengthreductiononleakagesavingsoftheintegermultiplierfor SusanC .NoWSRindicatesthatnoweakstrengthreductionisperformed. awake,whereastheoatingpointunitsareasleep.Interproceduraloptimizationsarenoteffective onanyofthebenchmarksfromthe Susan setowingtothenon-modulardescriptionofthesource code.Therefore,interproceduraloptimizationsarenotperformedonthem. 4.4.2.2EpwicBenchmarks Figure4.16showstheplotsfor Epwic benchmark.Inthisbenchmark,theFPdividerandFP multipliertradeoffleakagesavingswitheachotherasthevariousoptimizationsareperformed. ThenumberofbusycyclesfortheFPdividerdropsby10%-16%inC2-C5,whilethatfortheFP multiplierincreasesby5%-6%.Thisgetsreectedintheleakagesavingsforthesetwounits.While thesavingsintheFPdividerincreasesby16%(28%vs.32.5%)inC5,thesavingsfortheFP multiplierdecreasesby5%(30%vs.28.5%).SincetheleakageinFPdividerisalmosttwiceas muchasthatintheFPmultiplier,increasedsavingsinFPdividercontributetowardshighertotal savingsacrossalltheunits(whichincreasesby5%-9%).Theperformanceimprovessteadilyas wegofromC1-C5IPO(C5IPOisC5congurationwithinterproceduraloptimizations).Thecycle overheadinC5IPOreducesbymorethan20%comparedtothatinC0.Since Unepwic shares 76

PAGE 91

0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2 fmul-sav fdsq-sav tot-sav cyc-ovh Normalized value wrt C0C1 C2 C3 C4 C5 C5IPO (a)LeakagesavingsintheFPmultiplier,FPdivider,andacrossallthefunctionalunitsalong withtheperformanceoverheadforvariousoptimizationcongurations.C5IPOindicatesthat interproceduraloptimizationsareperformedinC5 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 num-cyc fmul-bsy fdsq-bsy Normalized value wrt C0C1 C2 C3 C4 C5 C5IPO (b)Totalnumberofexecutioncycles(withoutsleepinstructions)andthenumberofbusycycles fortheFPmultiplierandFPdividerfortheoptimizationcongurationsFigure4.16Leakagesavingsandsleepoverheadfor Epwic majorityofthesourceleswith Epwic ,similiartradeoffisobservedbetweentheFPdividerandthe FPmultiplierasin Epwic .However,theleakagesavingsintheFPadderandtheFPmultiplierare 77

PAGE 92

muchlower(8%and12%)thanthosein Epwic .Therefore,thetotalsavingsin Unepwic islower whencomparedtothosein Epwic 4.4.2.3Mpeg2Benchmarks 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2 fadd-sav fmul-sav tot-sav cyc-ovh Normalized value wrt C0C1 C2 C3 C4 C5 C5IPO (a)LeakagesavingsintheFPadder,FPmultiplier,andacrossallthefunctionalunits alongwiththeperformanceoverheadforvariousoptimizationcongurations.C5IPO indicatesthatinterproceduraloptimizationsareperformedinC5 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 num-cyc fadd-bsy fmul-bsy Normalized value wrt C0C1 C2 C3 C4 C5 C5IPO (b)Totalnumberofexecutioncycles(withoutsleepinstructions)andthenumberof busycyclesfortheFPadderandFPmultiplierfortheoptimizationcongurationsFigure4.17Leakagesavingsandsleepoverheadfor Mpeg2Encode 78

PAGE 93

In Mpeg2E ,theFPmultiplieristhemostfrequentlyusedunit(busyfor4.4%ofthetotalcycles inC0),whereastheFPdivideristhemostinfrequentlyusedunit(busyforlessthan0.01%inC0). TheintegermultiplierandFPadderarebusyfor3.3%and2.6%ofthecyclesinC0.WheninterproceduraloptimizationsareperformedinC5(CFIPO),theleakagesavingsintheFPadderincreases byalmost14%(36%comparedto41%).Thisisasignicantimprovementinsavingsconsidering thefactthatthemaximumachievablesavingsislessthan50%.Thenumberofcyclesforwhich theFPadderisbusydropsby14%inC5andC5IPO.Also,inC5IPO,thetotalsimulationcycles forthisbenchmarkreducessignicantly(bymorethan20%)andthenumberofoverheadcycles duetopowergatingalsodropsbymorethan15%.FortheFPmultiplier,16%increaseinleakage savingsareobservedinC2(44%comparedto38%).Acrossalltheoptimizationcongurations,the leakagesavingsintheintegermultiplierandtheFPdividerarealmostuniform(31%andalmost 50%,respectively).ThisisshownintheplotsinFigure4.17. In Mpeg2D ,thefunctionalunitsotherthantheintegerALUareusedveryinfrequently.Dueto thistheleakagesavingsinalltheunitsarealmostuniform(almost50%)foralltheoptimization congurations. 4.4.3ImpactofPoliciesforHandlingStandardLibraryRoutines TheimpactofthethreepoliciesthatareusedtohandletheCstandardlibraryroutinesontotal leakagesavingsandoverheadcyclesinC5(withinterproceduralandFPoptimizations)isshownin Figure4.18.ThesevaluesarenormalizedwithrespecttothoseinP3.Forthe Susan benchmarks, thereisnodifferenceineithertheleakagesavingsortheoverheadcyclesacrossthedifferentpolicies becausethemathroutinesarenotusedinthemostfrequentlyexecutedregions.However,forthe Epwic and Mpeg2 benchmarks,boththeleakagesavingsandtheoverheadcyclesreduceinpolicyP2 withrespecttothoseinP3becausenoneofthefunctionalunitsareputtosleepintheregionswhere themathlibraryroutinesarecalled.TheleakagesavingsforP1arehigherthanthoseinP2butlower thanthoseinP3becausethefunctionalunitsthatareputtosleeprightbeforeamathroutineare wokenupwhenthatroutineisexecutingbeforetheunitstaysasleepforlessthanbreakevennumber ofcycles.Thisresultsinlowerleakagesavingsinthoseunits.Themostoftenusedmathroutinesin 79

PAGE 94

thesebenchmarksareexp(),log(),andsqrt()allofwhichdonotusetheFPdividerunit,which hasthelargestcomponentofleakage(morethan45%)amongalltheunits.Therefore,theleakage savingsinP1aremorethaninP2. 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 SusanC SusanE Mpeg2E Mpeg2D Epwic Unepwic Overhead Cycles (Normalized wrt P3)P1 P2 P3 (a)TotalleakagesavingsforallthebenchmarksinC5 0 0.2 0.4 0.6 0.8 1 SusanC SusanE Mpeg2E Mpeg2D Epwic Unepwic Leakage Energy Savings (Normalized wrt P3)P1 P2 P3 (b)TheoverheadcyclesforallthebenchmarksinC5Figure4.18Impactofvariouspoliciesforhandlingthestandardlibraryroutinesduringinsertionof sleepinstructions 80

PAGE 95

Integerbenchmarkslike dijkstra patricia quicksort ,and sha donothaveanyintegermultiply operations.Therefore,whenpeepholeoptimizations(particularly,weakstrengthreduction)areperformed,alloverheadmultiplyoperations(introducedbythecompilertocomputearrayaddresses) areeliminated.ThebenchmarksRsynth Fft ,and Ffti -spendalmosttheentiretimeperforming arithmeticoperationsonmemoryelementsand,therefore,noneofthecompileroptimizationsare abletoeliminatesuchoperations. 4.5Conclusions Inthiswork,westudiedtheimpactofvariouscompileroptimizationsontheopportunities forpowergatingvariousfunctionalunitsinanembeddedprocessorcore.Toperformthisstudy weusetheARMLinuxtoolchaincomprisingoftheGNUcompilercollection(GCC),whichisa anopensourceproductionqualitycompiler,alongwiththeSimplescalar-ARMprocessorsimulator.WebuildasetofbenchmarksforembeddedsystemschosenfromMibenchandMediaBench suitesinvariouspredenedoptimizationcongurationsandperformextensivesimulationswiththe Simplescalar-ARMtooltoevaluatetheeffectofcompileroptimizationsonpowergatingforfunctionalunits.Theresultsoftheexperimentsindicatethattheoptimizationsthatremovecomputation redundancyfromtheprogramorassistsuchoptimizationsareinstrumentalinimprovingpower gatingopportunitiesinaprogram.Inquantitativeterms,dependingonthebenchmark,theleakage energysavingsduetopowergatinginindividualfunctionalunitscanbeincreasedby15%to9times throughtheuseofdifferentcompileroptimizationtechniques.Thepenaltyofexecutingthesleep instructionsisatmost0.1%forallbenchmarks,exceptfor SusanC ,forwhichthepenaltyisatmost 0.6%.Therefore,theenergyoverheadinvolvedinexecutingthesleepinstructionsisveryminimal comparedtotheleakagesavingsachievedinthefunctionalunits.Withshrinkingtechnologiesand everincreasingdemandofportabledevices,reducingthetotalenergyconsumptionofprocessors willcontinuetobebeoneoftheprimarygoalsforprocessordesignersandcompilerwriters.This workthoroughlyinvestigatesreducingtheenergydissipatedbytheprocessorsduetoleakagepower atthecompilerlevelwiththehelpofarchitecturalsupport.SincetheleakagecomponentinCMOS circuitsisincreasingeverytechnologygeneration,itisimperativethatcompilerswillneedtogener81

PAGE 96

atecodetonotonlyimproveperformancebutalsoreduceenergyconsumptioninmicroprocessors. Theworkpresentedinthispaperservesasanimportantsteptowardsthatrequirement. 82

PAGE 97

CHAPTER5 STATE-RETENTIVEPOWERGATINGOFREGISTERFILESINMULTI-CORES Weinvestigatestate-retentivepowergatingofregisterlesforleakagereductioninmulti-core processorssupportingmultithreadinginthischapter.Inanin-ordercore,whenathreadgetsblocked duetoamemorystall,thecorrespondingregisterlecanbeplacedinalowleakagestatethrough powergatingforleakagereduction.Whenthememorystallgetsresolved,theregisterleisactivatedforbeingaccessedagain.Sincethecontentsoftheregisterlearenotlostandrestored onwakeup,thisisreferredtoasstate-retentivepowergatingofregisterles.Whilestate-retentive powergatinginsinglecoreshasbeenstudiedintheliterature,itisbeinginvestigatedformulticorearchitecturesforthersttimeinthiswork.Weproposespecictechniquestoimplement state-retentivepowergatingforthreedifferentmulti-coreprocessorcongurationsbasedonthe multithreadingmodel:(i)coarse-grainedmultithreading,(ii)ne-grainedmultithreading,and(iii) simultaneousmultithreading.Theproposedtechniquescanbeimplementedasdesignextensions withinthecontrolunitsofthein-ordercores.Eachtechniqueusestwodifferentmodesofleakage states: lowleakagesavingsandlowwake-uplatency and highleakagesavingsandhighwake-up latency .Theoverheadduetowake-uplatencyiscompletelyavoidedintwotechniqueswhileitis hiddenformostpartinthethirdapproach,eitherbyoverlappingthewake-upprocesswiththethread contextswitchinglatencyorbyexecutinginstructionsfromotherthreadsreadyforexecution.The proposedtechniqueswereevaluatedthroughsimulationswithmultiprogrammedworkloadscomprisedofSPEC2000integerbenchmarks.Experimentalresultsshowthatinan8-coreprocessor executing64threads,theaverageleakagesavingswere42%incoarse-grainedmultithreading,while theywerebetween7%and8%forne-grainedandsimultaneousmultithreading. 83

PAGE 98

5.1Motivation Themotivationforourworkarisesfromthesimpleobservationthat,inanin-orderCPU,when aninstructionfromaparticularthreadencountersapipelinestall,nofurtherinstructionsfromthat thread(i.e.,followinginstructionsinprogramorder)maybeexecutedtillthatstallgetsresolved. Thus,thethreadgetsblockedandthehardwareunitsthatareprivatetothatthreadcouldbeplaced inalow-leakagestate.Whenthepipelinestallisresolved,thethreadgetsreadytorunandthose hardwareunitsneedtobebroughtbacktotheactivestatefromthelow-leakagestatesothatthey arefunctionalagain. Asdiscussedearlier,thedatapathcomponentsthatarereplicatedtosupporthardwaremultithreadingaretheregisterlesandbufferstructuressuchas,instructionfetchbuffers,load-store buffers,and,insomecases,pipelineregisters.Amongthese,theregisterlesarethelargestinarea andatthesametimearetheleakiest.Forexample,theSPARCarchitectureuseswindowedinteger registerleswitheightwindows[8].IntheNiagaraprocessor,eachthreadrequires128registers fortheeightwindows(with16registersperwindow)andanother32globalregisters,whichmakes atotalof160registersperthread.Since,eachSPARCcoresupportsfourhardwarethreads,there areatotalof640registersineachSPARCcore.Eachregisteris64bitsinsizeandthereareadditionalbitsforimplementingerrorcorrectingcodes(ECC).Thismakeseachintegerregisterlein theNiagaraprocessora5KBstoragestructure.IfthisiscomparedwiththeL1datacache,whichis privatetoeachcoreintheNiagaraprocessorandis8KBinsize,theregisterlehasmorethan60% ofthestoragesizeoftheL1datacache.Thus,placingpartsoftheregisterlesinalow-leakage stateduringpipelinestallsappearstobeaveryattractiveoptionforsavingoverallleakageenergy inaprocessorcore. AnotherobservationisthatwhenmultipleCPUcoresarerequiredtobeaccommodatedtobuild amulti-coreprocessor,thecachesthatareprivatetoeachcoreareshrunkinsizetotthechip withinagivenarealimit.Thisresultsinanincreaseinthecachemissratesexperiencedbyeach core.Forinstance,asmentionedearlier,eachcoreintheNiagaraprocessorfeaturesan8KBprivate L1datacachewhichresultsinaveragemissratesofaround10%[8].However,havingfourthreads torunonasinglecorehidesthelatenciesinstallsduetoaccessmissesfromtheL1andL2caches 84

PAGE 99

veryeffectively.Thus,asthenumberofcoresandthenumberofthreadspercoreareincreased,the fractionoftimeforwhichtheintegerregisterlesforeachthreadstaysidleduetomemorystalls alsoincreases. Itisimportantthatanyapproachtoreduceleakagebasedonthestallsincurredbyhardware threadsmeettwoimportantrequirements: 1.Theleakagereductiontechniquethatischosentoputtheregisterletoalow-leakagestate shouldbestateretentivesothat,whenitisbroughtbacktotheactivestate,itscontentsare restored. 2.Everydynamicleakagereductiontechniquehasaperformanceoverheadwhentransitioning betweentheactiveandthelow-leakagestates.Thus,itisimportanttoensurethattheoverhead doesnotnegativelyimpacttheoverallperformanceoftheprocessor. Inthiswork,weapplystate-retentivepowergatingtosaveleakageinintegerregisterlesduring memorystallsinmultithreadedprocessorcores.Figure5.1illustratestheschematicviewofthis approachforacorewhichsupportsexecutionoffourhardwarethreads.Thefundamentalideais thatwhenamemorystall(cachemiss)isdetected,therunningthreadeithergetsblockedorgets nb f"$b rnt #bn f!bnft"%bf %t"f"ntf f"%b nft"ntFigure5.1Schematicviewoftheproposedapproachforpowergatingregisterlesinin-ordercores thatsupportmultithreading.Whenathread(speciedby tid )experiencesacachemiss,itsregister leisplacedinalow-leakagestate.Whenthethreadgetsreadytorunagainafterthecachemiss completes,theregisterleisactivated.Theexistingcontrolunitinthecoremaybeextendedto incorporatethisschemeveryeasily. 85

PAGE 100

switchedoutand,therefore,itsregisterleisplacedinalow-leakagestate.Eventually,whenthe stallgetsresolved(followingacachelinell)theregisterleisputbackintheactivestate. Anintermediatestrengthpowergatingtechniquepresentedin[63]isappliedtocharacterize 32entry64-bitintegerregisterleforleakagesavings.Thetechniqueisalsone-grainedinthat thewake-uplatenciesarebetweenthreeandveclockcycles(foraclockfrequencyof1GHz). Weensurethatwake-uplatencyassociatedwiththistechniqueiseffectivelyhiddenbyvirtueof morethanonethreadsharingtheCPUpipeline.Further,theregisterleshavetwodistinctlowleakagestates-onewithlowerleakagesavingsandlowerwake-uplatency;andtheotherwith higherleakagesavingsandhigherwake-uplatency.Dependingonthedurationofthestallandthe timebetweenwhenthestallgetsresolvedandtheregisterleisaccessedagain,itisplacedisone ofthoselow-leakagestates.ThisiselaboratedinFigure5.2.Theregisterleisdesignedtohave twolow-leakagestates, sleep1 and sleep2 .WhenanL1missisincurredbyaninstructionfroma particularthread,thatthread'sregisterleisplacedin sleep1 state.IftheL1llrequestfurther experiencesanL2miss,thentheregisterleisplacedinthehigherleakagesavingsstate, sleep2 state.WhentheL2misscompletes,theregisterleisbroughtbacktothe sleep1 state.Thewake-up latency, tw 2,isoverlappedwiththeL1lllatencyand,thus,getshidden.If,however,anL2missis notexperienced,itcontinuestostayin sleep1 state.WhentheL1misscompletes,theregisterleis broughtbacktotheactivestate.Thewake-uplatency, tw 1,ishiddenveryeffectivelyduetomultiple threadsrunningonthecore. Themaincontributionsofthisworkareasfollows: € Weproposespecictechniquestoimplementstate-retentivepowergatingforthreedifferent multi-coreprocessorcongurationsbasedonthemultithreadingmodel:(i)coarse-grained multithreading(CGMT),(ii)ne-grainedmultithreading(FGMT),and(iii)simultaneousmultithreading(SMT). € Theproposedtechniquescanbeimplementedasdesignextensionstothecontrolunitsofthe in-orderprocessorcore,withnegligiblecontroloverhead. 86

PAGE 101

€ Theoverheadduetowake-uplatencyiscompletelyavoidedintwotechniqueswhileitis hiddenformostpartinthethirdapproach,eitherbyoverlappingthewake-upprocesswith thethreadcontextswitchinglatencyorbyexecutinginstructionsfromotherthreadsreadyfor execution. 5.2RegisterFilePowerGatinginCGMTProcessors IntheCGMTapproach,wheneverathreadisswitched,thereisamultiplecyclepenaltyincurred duetothecontextswitchingprocess.Thepenaltyisduetoeithersquashing(orrollingback)of instructionsfromthepipelineordrainingofthepipelinefollowinganeventthattriggersthecontext switch.Forinstance,whenathreadencountersadataloadmiss,alltheinstructionsinthepipeline followingtheloadinstructionaresquashedbeforeareadythreadcouldbeswitchedin.Conversely, inthecaseofaninstructionfetchmiss,alltheleadinginstructionsinthepipelineareallowedto :*"$ :)"$ :)"$ $ $ :! $ ) *) ') '*:*"$ $ $ :!Figure5.2Intermediatestrengthpowergatingappliedduringacachemiss.WhenanL1cachemiss occurs,theregisterleisplacedinthelowersavingsstate( sleep1 ).Further,ifL2accessalsoresults inamiss,thenitisplacedinahighersavingsstate( sleep2 ).Otherwise,itstaysin sleep1 state.The wake-uplatency tw 2isoverlappedwiththeL1lllatency.Thewake-uplatency tw 1isoverlapped duringthreadcontextswitching.Thetotalleakagesavingsistheshaded/coloredareainthegure. 87

PAGE 102

nishbeforethenextreadythreadisswitchedin.Inbothcases,bubblesareinsertedintothepipeline thatnegativelyaffectsthepipelineperformance.Adirectapproachtoavoidthisswitchpenaltyisto havecopiesofthepipelineregistersateachstage,whichresultsinincreasedareaandcomplexity oftheCPUcore.However,forshortpipelines,thecontextswitchpenaltyisonlyafewcycles andtheadditionalhardwaredoesnotjustifythesmallimprovementinperformance(fore.g.,IBM Northstar/Pulsar[26]). Inthissection,wedescribethetimingdetailsinvolvedinputtinganintegerregisterlein andoutoflow-leakagestatefollowingamemorystall.Thewake-uplatencyoftheregisterle iscompletelyoverlappedwiththethreadswitchlatencydiscussedabove.WeconsideraCGMT modelinwhichthreadcontextswitchinghappensonlyduringinstructionfetchmisses(referredto as fetchmisses intheremainderofthetext)anddataloadmisses(referredtoas loadmisses inthe remainderofthetext).Also,theCPUpipelineismodeledaroundaMIPSpipelinewithinstruction fetch(IF),instructiondecode(ID),execute(EX),memory(MEM),andwriteback(WB)stages. However,theregisterleisreadduringtherstcycleintheEX-stageandthendispatchedtothe arithmeticfunctionalunits. 5.2.1PowerGatingControlDuringFetchMiss Figure5.3illustratesthescenarioandexplainsthetimingdetailsofputtingaregisterletosleep followingafetchmiss.Thegureshowsthestateofthepipelineduringclockcycle c ,whenthread T1isrunningwhilethread T2isinthereadystate.Thescenariodescribedinthisgureassumes that: 1. T2wasswitchedoutearlierfollowingafetchmissandwaseventuallyputbackintheready stateafteritsfetchmisscompleted. 2. T2'sregisterleiscurrentlyinalow-leakagestateandneeds3cyclestowakeupbeforeitcan beaccessed. 3.Allthestagesofthepipelinearebusyexecuting T1'sinstructions, Ik Š 4to Ik. 88

PAGE 103

% &% & -* -) -; -< )="!%"'" """"" ->""!" % ;""' & *"!" ( "'") "! *="!% "" "" )="!% "!" "'"" *"" ( "' )"" *" ( '") '"%" $ % < & "" % & % *" "%"$ %" % &% ) &% ; &% < &% &% &% ) &% ; &% &% ) &% &% ( &++, ") "* Figure5.3Timingdetailsforputtingregisterlestosleepfollowinganinstructionfetchmiss Whilefetching Ik,aninstructionfetchmissisencounteredfollowingwhich T1startstodrain incycle ( c + 1 ) .Thisisdonesothatinstructions Ik Š 4to Ik Š 1nishbeforeathreadcontextswitch happens.Duringthisdrainingperiod, T1'sregisterleneedstobeactivesothatreadsandwrites maybeperformedtoit.Assumingthatnootherinstructioninthepipelinegetsstalled,thelast instruction, Ik Š 1,nishesincycle ( c + 3 ) .Duringthiscycle, T2'sregisterleissignaledtowakeup. Incycle ( c + 4 ) ,thread T2getsswitchedinanditstartstofetchinstruction Ij,while T1'sregisterle isputtosleep.Bythetime IjreachestheEX-stageincycle ( c + 6 ) andaccesses T2'sregisterle, itisalreadyinactivestate.Thewake-uplatencyisoverlappedwiththecontextswitchinglatency and,therefore,thepipelinedoesnotincuranystallsduetotheunavailabilityoftheregisterle. Eventually,asshowninFigure5.4, T1'sfetchmisscompletesatcycle ( c + m ) and T1switches intothereadystate.Then,wecanconsiderwakingup T1'sregisterle. 89

PAGE 104

) *+ ) )* ) *+ ) ). /0 ,. / 1 2 2+ *+ 2 1 2 -, *-, 2-, .-, 3-, -, 4 /05/56 2 Figure5.4Wake-updetailsof T1'sregisterleifthepipelineisbusywhenitsfetchmisscompletes.ResumingfromtheIF-stagemeansthattheinstructioniseitherfetchedagainorplacedin theinstructionbufferfromthecachelinefollowingacachelinell. Twopossiblescenariosoccur: 1. T2iscurrentlyrunning. 2. T2isswitchedoutandthepipelineiscurrentlyidle. Inscenario1,thereisnoneedtowakeup T1'sregisterlebecause T1willgettorunonlyafter T2getsswitchedoutfollowingamemorystall(assumeafetchstallagain).Thissituationisshown inFigure5.4whenatcycle ( c + n ) T2nishesdrainingfollowingafetchstallandgetsswitched out. T1'sregisterleissignaledtowake-upduringthiscycle.Duringthenextcycle, T1resumes executionfrom Ik,whichwouldneedtoaccesstheregisterleearliestduringtheEX-stage.This allowsthe3-cyclewake-upperiodtofor T1'sregisterle. Inscenario2(Figure5.5),however, T1getstorunrightafterthefetchmisscompletesbecause T2isswitchedoutandthepipelineisidle.Therefore, T1'sregisterleissignaledtowakeupatcycle ( c + m ) .Since T1resumesexecution(frominstruction Ik)atcycle ( c + m + 1 ) T1'sregisterledoes notgetaccessedtillcycle ( c + m + 3 ) whenitreachestheEX-stage. 90

PAGE 105

) *+ ) )* *+ ) ). /0 ,. 2 1 *+ 2+ 4 /05/56 2 Figure5.5Wake-updetailsof T1'sregisterleifthepipelineisidleafteritsfetchmisscompletes In1, T1'sregisterlestaysinalow-leakagestatefor ( n Š 5 ) cycles,whilein2itstaysina low-leakagestatefor ( m Š 5 ) cycles. 5.2.2PowerGatingControlDuringLoadMisses Duringdatastoremisses,thethreadneednotstallaslongastheresultofthestoreinstructioncan beplacedinastorebuffer(unlessthestoreispartofaspecializedatomicinstruction).However, duringadataloadmiss,thethreadgetsstalledanditstartsthetransitionprocesstowardsbeing switchedout.Atthesametime,itsregisterleisplacedinalow-leakagestate.Incase,anewly switchedinthreadalwaysstartsfromtheIF-stage(asinNiagara),thenthewake-uplatencyofthe registerlecanbehiddenwiththenumberofcyclesthattheinstructiontakestotraversefromthe IF-stagetotheEX-stage.However,iftheloadinstructionthatencounterstheloadmissisplaced inaloadbufferintheMEM-stagesothatitmayresumeexecutionassoonastheloadmissgets processed,theneffortsareneededtohidethewake-uplatency.Loadinstructionswriteintothe registerleduringtheW-stage,and,therefore,theregisterleneedstobeinactivestatebeforeit canbewritteninto. Figure5.6showstheregisterlesleepstrategyfollowingaloadmiss.AsinFigure5.3,this gurealsoshowsthestateofthepipelineduringclockcycle c ,whenthread T1isrunningandthread T2isinthereadystate.Thescenariodescribedinthisgureassumesthat: 91

PAGE 106

++ ) ; < -* -) )="!%"'" """"" -<""!" ;""' )"""'""""""$$" *"!" "'") "! *" $ "" %$" !" '" *"" "' )"" " < "% '" ; ""$"! *" ""$ %" ; ) "" ) ; )="!% "!" "'" '"*= !%"" ; ) ; < ( ; "*=""$"$ :"$ ") "* Figure5.6Timingdetailsforputtingaregisterletosleepfollowingadataloadmiss.Aninstruction markedpendingimpliesthatitisplacedintheloadbuffer. 1. T2wasswitchedoutearlierfollowinga fetchmiss andwaseventuallyputbackintheready stateafteritsfetchmisscompleted. 2. T2'sregisterleiscurrentlyinalow-leakagestateandneeds3cyclestowakeupbeforeitcan beaccessed. 3.Allthestagesofthepipelinearebusyexecuting T1'sinstructions, Ik Š 4to Ik. Followingaloadmissencounteredbythread T1whileexecutinginstruction Ik Š 3intheMEMstage,instructions Ik Š 2to Ikaresquashedincycle ( c + 1 ) ,while Ik Š 3isplacedintheloadbuffer. Since T2willbeswitchedintobeexecutedduringthenextcycle,itsregisterleissignaledto wakeup,while T1'sregisterleisputtosleep.Incycle ( c + 2 ) T2resumesexecutionbyfetching 92

PAGE 107

instruction Ij,whichdoesnotaccesstheregisterletillcycle ( c + 4 ) ,therebygivingittheadequate numberofcyclestobecomeactive. Eventually,when T1'sloadmisscompletesduringalatercycle,say ( c + m ) T1transitionsfrom switchedoutstatetothereadystate.Again,thedecisiontowakeupitsregisterledependson whether T2isrunning(conditionshowninFigure5.7)orisswitchedout.Intheformercase,the registerleissignaledtowakeupwhen T2eventuallyencountersastall(a loadmiss thistime)and getsswitchedout(cycle ( c + n ) inFigure5.7).Inthefollowingcycle,cycle ( c + n + 1 ) asshownin thegure, T1resumesexecutionfrom Ik Š 2inIF-stageand Ik Š 3inW-stage.Sincealoadinstruction needstowriteintotheregisterleintheW-stage,itisstalledforthreecyclestoallowthe3-cycle wake-uplatencyneededbytheregisterle.Thistimingalsocoincideswiththeearliestcyclethat T1'sregisterleneedtobeaccessedby Ik Š 2(whenitreachestheEX-stage).If,however,thereisa read-after-write(RAW)datadependencybetween Ik Š 3and Ik Š 2,thentheresultoftheloadoperand isforwardedtothefunctionalunitthatneedsitasaoperandtoexecuteinstruction Ik Š 2. 5.3RegisterFilePowerGatinginFGMTProcessors IncontrasttotheCGMTapproach,FGMTandSMTapproachesdonottypicallysufferfrom multiplecyclethreadswitchpenalties.Thisisbecause,intheseapproaches,eachpipelinestage processesoneormoreinstructionsfrommultiplethreads.Ifaninstructionfromathreadencounters astall,nofurtherinstructionsfromthatthreadarefetchedtobedispatchedtothepipeline.Instead, instructionsfromoneormoreofthereadythreadsarefetchedandprocessed.Thepolicytoselecta threadtofetchfrommayvaryacrossdesigns.Forinstance,Niagarausesround-robin(RR)policyto selectonethreadamongalistofreadythreads,whileNiagara2implementsaleastrecentlyfetched (LRF)policytodothesame.AslongastherearereadythreadsavailabletotheCPU,nobubbles areinsertedintothepipeline.Thisveryideaisutilizedtohidethewakeuplatencyoftheregister leswhentheyaretransitioningfromthelow-leakagestatetotheactivestate. 93

PAGE 108

5.3.1PowerGatingControlDuringFetchMiss ACPUthatisdesignedtosupporttheFGMTapproach,fetchesaninstructionfromanewready threadeachcycleanddispatchesittothepipeline.Figure5.8illustratesthetimingdetailsofputting athread'sregisterleinandoutoflow-leakagestatefollowinganinstructionfetchmiss.Itis assumedthattheCPUhas4hardwarethreads, T1 Š 4,andaround-robinfetchpolicyisimplemented. Thepipelinecontentsareshownforclockcycle c .Instructionfromallfourthreadsarecurrently beingprocessedbythedifferentstagesinthepipelinewhenaninstructionfrom T1encountersa fetchmiss.Therefore,inthenextcycle, ( c + 1 ) ,oneofthereadythreadsisselected( T2inthe gure)andaninstructionfromthatthreadisfetched.Thedecisiontoassertthesleepcontrolto ) )* ., 2-, ., .2 1 2*+ /0 ,. ) )2 ., *-, ., .-, 2) ). .*+ ,. -, 2-, ., .-, *-, ) ., 2, 7 -, *-, 2-, .*+ 1 2+ .86 9 ., 2-1 4 /05/56 2 Figure5.7Timingdetailsforwakingup T1'sregisterlefromsleepafteritsloadmisscompletes anditgetsreadytorun.ResumingfromtheW-stageindicatesthattheresultoftheloadislledinto theloadbufferfromthecacheline. 94

PAGE 109

* 4 )* *+ *+ 6 6 1 *+ 1 : ) *+ ) )* *+ ) ). *+ 2 3 4 /05/56 Figure5.8Timingdetailsforputtingathread'sregisterleinandoutoflow-leakagestatefollowing afetchmissinFGMT T1'sregisterledependsonwhetherthereareanyinstructionsbelongingto T1inthepipelineand requiretoaccesstheregisterle.Forinstance,asshowninthegure,oneof T1'sinstructionsisin theW-stageincycle c .Therefore,itisimperativethattheregisterlebeactivetillthatinstruction nisheswritingintotheregisterle.Inthiscase, T1'sregisterleisputtosleepattheendofcycle ( c + 1 ) Eventually,when T1'sfetchmisscompletesatcycle ( c + m ) ,itbecomesreadytorun. T1's registerleissignaledtowakeuprightaway.Assumingthatitisindeedselectedbythethread schedulerinthenextcycle,i.e.,cycle ( c + m + 1 ) ,itwillaccess T1'sregisterleearliestincycle ( c + m + 3 ) ,therebyprovidingforthe3-cyclewake-uplatency. 5.3.2PowerGatingControlDuringLoadMiss InFGMTprocessors,whenaloadmissisdetectedforaninstructionfromathread,allthe instructionsfollowingtheloadinstructioninthepipelinearesquashed(orrolledbacktothein95

PAGE 110

* 2 )* *+ *+ ) *+ *+ ) )* *+ ) ). : 86 9 *+ 1 : 6 *+ 2 6 1 *+ 1 2 3 4 /05/56 Figure5.9Timingdetailsforputtingathread'sregisterleinandoutoflow-leakagestatefollowing adatamissinFGMT structionbuffer).However,sinceineachcycleadifferentthreadisdispatchedtothepipelineinan FGMTapproach,thenumberofinstructionssquashedisexpectedtobemuchsmallerthanthatin thecaseofCGMT.Theloadinstructionitselfmayalsobesquashedoritmaybemarkedpendingat theMEM-stage(inaloadbuffer).Thechoiceofimplementationhereimpactsthewake-upstrategy appliedtowakeuptheregisterleforastalledthreadwhenitsloadmisscompletes.Intheformer case(asinNiagara),thewake-uplatencyoftheregisterleisoverlappedwiththenumberofcycles thattheinstructiontakestoreachtheEX-stagefromtheIF-stage(describedearlierinSection5.2.2). However,inthelattercase(showninFigure5.9),thewritebacktotheregisterleisdeferredfor additionalcycles(2cyclesinthegure)toaccountforthewake-latency.IfthereisaRAWhazard 96

PAGE 111

betweenthetwo T1'sinstructionsinthepipeline,thendataisforwardedfromtheloadbuffertothe consuminginstructionwhenitreachestheEX-stage. 5.4RegisterFilePowerGatinginSMTProcessors Wemodelagenericsimultaneousmultithreadingin-ordercoreprocessor,inwhicheachpipeline stageiscapableofprocessingmultipleinstructionsfrom distinctthreads duringthesameclock cycle.Tosupportthiscapability,eachpipelinestageisequippedwithstagebuffersforeachthread context.Onceaninstructionisprocessedbythestage,itisplacedinthestagebufferforitsthread contextforthenextstagetoprocessitinasubsequentclockcycle.However,ifaninstruction getsstalledatastageduetoamulticyclelatencyoperation,likeanintegermultiplicationora memorystall,itismarkedblockedorunavailabletilltheoperationnishes.Alongwiththat,allthe instructionsfromthatthreadinalltheprecedingstagesarealsomarkedblocked.Eachstagepicks upinstructionsfromonlyreadythreadstoprocessduringaclockcycle. !" !" %% !" !" %% ) < ; Figure5.10SchematicviewofapipelineorganizationtosupportSMTinin-ordercores.Each pipelinestagehasstagebuffersforeachthreadcontext.Instructionsfromthesestagebuffersare chosentobeprocessedbythenextstage.Duringapipelinestall,thethreadismarkedblockedin alltheprecedingstages(indicatedbythecircularshape).Onlyinstructionsfromthereadythreads arepickedandprocessedbyeachstage. Figure5.10showstheschematicviewofthisstructure.Itshowsasnapshotoftwobacktoback stagesofthepipeline, Sk Š 1and Sk. Sk Š 1processesinstructionsfromthreads T3and T4andplaces 97

PAGE 112

themintheirrespectivethreadbuffersinthatstage. Sk,ontheotherhand,picksupinstructions fromthreads T1and T4from Sk Š 1'sstagebuffers,processesthem,andplacesthemintheirrespectivethreadbuffersinthisstage.Thread T2isshowntohavebeenmarkedasunavailable(thecircular shapeinthegure)orstalledin Sk Š 1becauseitiscurrentlyperformingamulticyclelatencyoperation.Also,aninstructionfrom T3cannotbeprocessedby Sksincethebufferforthread T3isfullin thisstage. InthisSMTcore,placingtheregisterleinandoutofthelow-leakagestateduringbothfetch missesanddatamissesisverysimilar.Onceamissisdetected,thethreadismarkedblockedin all thestagesinthepipelinesothattheregisterlemaybeputtosleepimmediately.Notethatthisis differentfromtheregularcaseinwhichthethreadismarkedblockedonlyintheprecedingstages. Whenthemisscompletes,theregisterleissignaledtowakeupbutthethreadismarkedready onlyafterafewadditionalcyclestoaccountforthewake-uplatencyoftheregisterle.Aslongas thereareinstructionsfromotherreadythreadsinthepipeline,theadditionalblockedcyclesdonot resultinanyperformancedegradation. 5.5SummaryoftheProposedTechniques AsummaryoftheproposedtechniquesforCGMT,FGMT,andSMTcoresispresentedin Table5.1.IntheCGMTapproach,thewake-uplatencyoftheregisterlesisoverlappedwiththe latencyassociatedwiththelatencyofthreadcontextswitching.Itcanalsobeobservedthat,sinceno morethanonethreadisactivesimultaneously,theregisterlesforalltheotherthreads,irrespective ofwhethertheyarestalledorready,maybekeptinlow-leakagestates.Thus,asthenumberof threadsareincreasedinaCGMTapproach,itisexpectedthattheleakagesavingsintheregister lesalsoincreaselinearly.Ontheotherhand,intheFGMTandSMTapproaches,multiplethreads aresimultaneouslyactiveintheCPUatthesametime.Therefore,theleakagesavingsachievedin theseapproachesarenotexpectedtoscalewiththenumberofhardwarethreadcontextssupported bytheCPU.Instead,theyareexpectedtobeproportionaltothefractionofthetimethatthethreads spendwaitingonmemorystalls. 98

PAGE 113

Table5.1A summaryoftheproposedtechniques ControlFeature CGMTFGMTSMT Sleepafterafetch miss Waittillthepipeline drains Waittillthereareno instructionsfromthe targetthreadinthe pipeline Immediately;block alltheinstructions belongingtothis threadintheentire pipeline Wakeupafterthe fetchmisscompletes Whenthethreadgets switchedin(itsstate changesfromready torun) Assoonasthefetch misscompletes Assoonasthefetch misscompletes Performance degradationdueto wake-upoverhead Zerocycles;the threadresumes executionfromthe IF-stage Zerocycles;the threadresumes executionfromthe IF-stage Bestcaseiszero; thereareinstructions availablefromother threads.Worstcaseis wake-uplatency numberofcycles; otherwise. Sleepafteraload miss Immediately;any followinginstructions inthepipelineare squashed Immediately;any followinginstructions fromthatthreadin thepipelineare squashed Immediately;block alltheinstructions belongingtothis threadintheentire pipeline Wakeupaftertheload misscompletes Whenthethreadgets switchedin(itsstate changesfromready torun) Assoonastheload misscompletes Assoonastheload misscompletes Performance degradationdueto wake-upoverhead Zerocycles;either theloadinstructionor theinstruction followingtheload resumesexecution fromtheIF-stage. Zerocycles;either theloadinstructionor theinstruction followingtheload resumesexecution fromtheIF-stage. Bestcaseiszero; thereareinstructions availablefromother threads.Worstcaseis wake-uplatency numberofcycles; otherwise. 99

PAGE 114

+, "%" $ "$"" *"!" '" -<*="! %"""" "*" """ "!"" ? ""'"<"""""*="!%"" "%""%""" "$ "$ -;*=" % ) < ; Figure5.11ApathologicalcaseduringafetchmissinanFGMTcore.Alltheotherthreadsare switchedoutandonlyinstructionsbelongingto T1areinthepipeline. Further,intheCGMTandFGMTcores,thelatencyofputtingaregisterlefromalow-leakage statetotheactivestatecanbeoverlappedcompletely,therebynothavingtoincuranyperformance overhead.However,fortheSMTcores,performancedegradationcanhappenwhentherearenot enoughreadythreadsinthecoreandkeepingathreadblockedfortheadditionalcyclesinserts bubblesintothepipeline.However,thelikelihoodofthisscenariocanbereducedbyincreasingthe numberofthreadsthattheSMTcoreisabletosupport. Furthermore,theamountofsavingsachievedisalsoinuencedbythenumberofreadythreads available.OnepathologicalcasefortheFGMTapproachisshowninFigure5.11wherethreads T2 Š 4areallstalledandonlyinstructionsfromthread T1areinthepipeline.Followingafetchmiss incycle c T1'sregisterlemaynotbeputtosleeptilltheallof T1'sinstructionsaredrained.In theexampledepictedinthegure,wehavetowaitfouradditionalcyclestodoso.Ifthisstallwere resolvedin10cycles(atypicalL2hitlatencyforaCPUclockfrequencyof1GHz),thenthesavings achievedbythistechniquearereducedby40%. 100

PAGE 115

5.6ExperimentalSetupandResults Inthissection,wedescribetheexperimentalsetupusedinthisworktoevaluatetheeffectiveness oftheproposedtechniquesinmulti-coreprocessors. 5.6.1IntegerRegisterFileCharacterization Forthepurposeofestimatingleakagecharacteristicsandthelatencyofaregisterlewith intermediate-strengthpowergating,weconsidera32-entry64-bitip-opbasedregisterle(withoutanyerrorcorrectioncodebits)withtworeadportsandonewriteportin45nmFreePDKtechnology[64].ThelayoutdesignofaD-ip-opfromtheNangate45nmopencelllibrary[65]was extendedtoincludethetworeadportsandonewriteportandaSPICEnetlistwithparasiticswasextractedusingCalibrefromMentorGraphics.Thecharacterizationoftheregisterlewasperformed usingspicesimulations.Theaverageaccesslatencywas0.893ns.Theleakagestates sleep1 and sleep2 reduceleakageby36%and52%respectively.Theirwake-uplatencies,foraclockof1GHz, werecomputedtobe3cyclesand5cycles.Thebreakevenperiodswereshorterthanthewake-up latencies.ThesedetailsoutlinedinTable5.2. Table5.2Registerleleakagestates Leakage Normalized WakeupLatency State Leakage (1GHzClock) active 1 sleep1 0.64 3cycles sleep2 0.48 5cycles 5.6.2ProcessorCongurationsandWorkloadDetails WeusedtheM5simulator[66]formodelingthemulti-coreprocessorsfeaturingmultithreaded CPUcores.TheM5simulatorsupportsfourdifferentCPUmodelstoprovidesimulationplatforms forfunctionalanddetailedsimulations.Amongthem, O3CPU modelsadetailedout-or-orderprocessorcoreandthe InOrderCPU modelsadetailedin-orderprocessorcore.Thein-ordercodehas somedefaultsupportforbothCGMTandSMTmodels.Itwasfurtherextendedtoprovidecompre101

PAGE 116

hensivesupporttomodelthemultithreadingapproachesdescribedinthischapter.Werunallour detailedsimulationsforDECAlphaISAinsyscallemulationmode. Table5.3SPEC2000integerbenchmarks Name DynamicInstructionCount(Millions) vpr 17.6 gap 44.8 vortex 88.3 twolf 91.9 eon 94.0 crafty 94.4 gcc 96.8 mcf 188.6 perlbmk 200.6 parser 267.8 gzip 601.9 -Thedynamicinstructioncountsareforthesmallreduced (smred)inputsets[67]. -Benchmarkbzip2isnotshownbecauseitdoesnothave ansmredinputset.Table5.3enumeratestheintegerbenchmarksfromtheSPEC2000benchmarksuiteandtheir dynamicinstructioncountsforthesmallreduced(smred)inputsets[67].Thebinariesaretru64binaries(COFFversion3.11-10)builtwithoptimizationlevelsO2andO3.Multiprogrammedworkloadsforeachprocessorcoreiscreatedbychoosingtherequirednumberofbenchmarks(alldistinct) atrandom.Simulationsareruntilltherstthreadnishesexecution. Wecongureanumberofmulti-coreprocessorscomprisingofin-orderCPUcoresbyvarying thenumberofcores,thenumberofhardwarecontextsthateachcoresupports,andanumberof L1andL2cacheparameters.TheprocessorparametersaretabulatedinTable5.4.Thenumberof coresiseithertwo,fouroreight.Thenumberofhardwarethreadsthateachcoresupportsisscaled fromtwotoeightincaseofCGMT,fromthreetoeightincaseofFGMT,andfromfourtoeightin caseofSMT.Wecapthenumberofthreadspercoreateightthreadsbecausethecostgrowthfor supportingadditionalhardwarethreadsislinearuptoaroundeightthreadsbutissuperlinearafter that[68]. Thein-ordercoreshavesimplespecications.IncaseofCGMTandFGMT,thecorescanprocessatmostoneinstructioneachcycleineachofitspipelinestages,while,inthecaseofSMT,the 102

PAGE 117

Table5.4Multi-coreprocessorparameters Parameter MultithreadingApproach CGMT FGMT SMT Clock 1GHz Speed Number 2,4,and8 ofCores Numberof 2-8 3-8 4-8 Contexts Pipeline 1 1 2 Bandwidth (ininsts/cycle) Functional 1intALU 1intALU 2intALU Units 1intMult 1intMult 1intMult Load/Store/Fetch 1perthread Buffers FetchSelect Round-robin Policy corescanprocesstwo.Therefore,weprovidetwointegerALUstoeachcoreincaseofSMTbut onlyoneintegerALUtothecoresincaseofCGMTandFGMT.Thecountofintegermultipliers, however,isthesame(one)forallthecores.TheexecutionlatenciesoftheintegerALUandinteger multiplierare1cycleand3cycles,respectively.Furthermore,wemodelafullypipelinedinteger multipliersothatintegermultiplyinstructionsthatarenotdatadependentoneachothermaybe issuedeveryclockcycle.Theclockfrequencyisuniform(1GHz)acrossalltheprocessorcongurations.Eachcorehasoneloadbuffer,onestorebuffer,andonefetchbufferperthread.Thepolicy toselectathreadtofetchinstructionsfromissettoround-robin. ThecacheparametersaretabulatedinTables5.5,5.6,5.7,5.8,and5.9.Wesetthecache parametersbasedonthespecicationsoftheNiagaraseriesofprocessors.Thehitlatenciesforthe Table5.5Memoryaccesslatencies MemoryUnit AccessLatency L1D-cache 1cycle L1I-cache 1cycle L2cache(shared) 10cycles PhysicalMemory 30cycles 103

PAGE 118

Table5.6L1D-cacheandI-cacheparameters Size 2cores4cores8cores 64KB32KB16KB Set 2TCs3-4TCs5-8TCs Associativity 248 MSHRs asmanyasthenumberofTCs TCs-ThreadContextsprivateL1caches(bothI-cacheandD-cache)andthesharedL2cachearesetto1cycleand10 cycles,respectively. Table5.7L2cachesize(inMB) Number NumberofCores ofThreads 248 2 234 3-4 346 5-8 468 Table5.8L2cachesetassociativity Number NumberofCores ofThreads 248 2 468 3-4 6812 5-8 81216 Table5.9L2cacheMSHRcount Number NumberofCores ofThreads 248 2 468 3-4 6812 5-8 81216 Thephysicalmemoryaccesslatencyissetto30cycles.Thecachelinesizeforeachcacheisset to64bytes.Asthenumberofcoresareincreased,theprivateL1cachesarereducedinsizetohave largersharedL2caches.Therefore,whiletheL1cachesizeisscaleddownfrom64KBto16KB asthenumberofcoresisincreasedfrom2to8,thesizeoftheL2cacheisscaledupwardbetween 2MBand8MBbasedonthenumberofcoresandthenumberofthreadcontexts(Tables5.6and 5.7). 104

PAGE 119

Whenthenumberofcoresandthenumberofthreadcontextsincrease,thesetassociativityfor bothL1andL2cachesareincreasedtoreducethenumberofconictmisses(Tables5.6and5.8). ThenumberofMissStatusHandlingRegisters(MSHRs)intheL1cachesisalsoincreasedwiththe numberofthreadcontextstoallowatmostoneoutstandingL1cachemissperthread(Table5.6). TheMSHRcountfortheL2cacheisdependentonboththenumberofcoresandthenumberof threadcontexts(Table5.9).Duringthesimulations,thecachesarewarmedupfortherst100,000 cycles. 5.6.3Results Theinstructionspercycle(IPC)countsfortheworkloadsonthedifferentmulti-coreprocessor congurationsareplottedinFigures5.12,5.13,and5.14.TheIPCforCGMTcoresrangesbetween 0.41and0.52,whileforFGMTcorestheIPCisbetween0.61and0.68.Thismarkeddifferenceis primarilyduetothefactthattheFGMTapproachisveryeffectiveinhidingstallsduetobothlong latencyevents(fore.g.,cachemisses)andshortlatencyevents(branchresolution,datadependency resolution,etc.).However,CGMTswitchesthreadstohidestallsonlyduetolonglatencyevents. Moreover,thethreadswitchpenaltyinCGMTcoresmaybemorethanonecycle,whereas,in FGMTcores,thispenaltyisexactlyonecycleaslongastherearereadythreadsavailable. TheIPCcountsfortheSMTcoresareintherangeof0.91and1.22becausethepipelinewidth fortheSMTcoresisdoubleofthatoftheCGMTandFGMTcores.Forthesamenumberofthreads, theIPCreducesasthenumberofcoresareincreasedbecausetheL1cachesizesbecomesmaller.It canalsobeobservedthatwhiletheIPCcountsfortheCGMTandFGMTcorestendtosaturateas thenumberofthreadcontextsisincreasedtoeight,theIPCforSMTcoresincreasemorelinearly indicatingthatitcouldsupportmorethreadsbeforeitsperformancelevelsout. TheleakagesavingsintheintegerregisterlesinCGMTcoresisshowninFigure5.15.As expected,thesavingsintheCGMTprocessorsscaleverywellwiththenumberofthreadcontexts. For2threadcontexts,thesavingsareintherangeof0.9%to2.9%,while,for8threadcontexts, thesavingsarebetween22%and42%.Thisisbecause,inaCGMTapproach,onlyasinglethread contextisactiveatatimeintheentirepipelinetillitexperiencesalonglatencystall.Therefore, 105

PAGE 120

0.4 0.45 0.5 0.55 2345678 Number of thread contextsInstructions per cycle 2-core 4-core 8-coreFigure5.12Averageinstructionspercycle(IPC)countforCGMTapproach theregisterlesfortherestofthethreadcontexts,irrespectiveofwhethertheyarereadyorstalled, maybeputtosleep. Incontrasttothis,intheFGMTandSMTapproaches,instructionsbelongingtodifferentthread contextsareprocessedbydifferentpipelinestages.Therefore,thesavingsdonotscalewiththe numberofthreadcontextsbutinsteadareproportionalonlytothefractionofthetimethatisspent bythethreadswaitingonmemorystalls.ForFGMTcores,theleakagesavingsrangefrom0.8%to 2.02%for3threadcontextsand3.09%-7.8%foreightthreadcontexts(Figure5.16). ThetotallatenciesofL1D-cachereadmisses,L1I-cachereadmisses,andL2readmisses (normalizedoverthetotalnumberofCPUcycles)averagedoverthenumberofthreadsareplotted inFigures5.17,5.18,and5.19. ForSMTcores,theleakagesavingsrangefrom1.02%to2.23%for4threadcontextsand 2.97%-7.27%foreightthreadcontexts(Figures5.20).Thedegradationinperformancedueto theproposedtechniqueinSMTcoreswascalculatedbycountingthenumberofcycleswhenan 106

PAGE 121

instructioncouldnotbeprocessedbyapipelinestagebecausetheregisterlewasnotawake.For SMTcores,thedegradationwas0.023%incaseofa8-coreprocessorwitheachcoreexecuting8 threads. 5.7Discussion WithcontinuedscalinginmodernVLSIcircuitfabricationtechnology,leakagepowerandheat dissipationlimitshavedrivennotonlycircuitdesignerstodevisenewertechniquestoreducepower dissipationincircuitsbutalsotheCPUdesignerstochangetheparadigmofdesigningprocessors toimproveperformance.Hardwaremultithreadingandmulti-coreprocessorsareoutcomesofthis newdesignparadigm.Theworkpresentedinthischapterleveragesexistingcircuitleveltechniques toreduceleakageinsuchprocessors.Inthiswork,wesynchronizethesleepofaregisterleprivate toathreadwiththeunavailabilityofthatthreadandthewake-upwiththereadinessofthatthread torun.Thisisbecausetheintegerregisterleisaccessedveryfrequentlybyintegerapplications. 0.6 0.62 0.64 0.66 0.68 0.7 345678 Number of thread contextsInstructions per cycle 2-core 4-core 8-coreFigure5.13Averageinstructionspercycle(IPC)countforFGMTapproach 107

PAGE 122

0.9 1 1.1 1.2 1.3 45678 Number of thread contextsInstructions per cycle 2-core 4-core 8-coreFigure5.14Averageinstructionspercycle(IPC)countforSMTapproach Inthefuture,wewouldliketoextendthisworktoin-ordercoreswithoatingpointregisterles. Floatingpointapplicationsusebothintegerandoatingpointregisterlesand,therefore,thepatternsofaccessestotheseunitsmayprovidemoreopportunitiesofpowergatingtheregisterles. Therefore,thefundamentalapproachthatisatthecoreofthetechniquesproposedinthisworkwill resultinconservativesavingsifdirectlyappliedinthecaseofoatingpointregisterles.Also, usingapartitionedregisterleforeachthreadcouldhavefurtheradvantages.Insteadofwakingup theentireregisterleatthesametime,onlythepartitionsthatneedstobeaccessedcanbewoken upmoreurgentlycomparedtotherestofthepartitions. 108

PAGE 123

0 5 10 15 20 25 30 35 40 45 345678 Number of thread contextsLeakage savings (%) 2-core 4-core 8-coreFigure5.15AverageIRFleakageenergysavingsforCGMTcores 0 2 4 6 8 345678 Number of thread contextsLeakage savings (%) 2-core 4-core 8-coreFigure5.16AverageIRFleakageenergysavingsforFGMTcores 109

PAGE 124

0 2 4 6 8 345678 Number of thread contexts Data read miss latency per thread (% of total cycles) 2-core 4-core 8-coreFigure5.17DatareadmisslatencyperthreadforFGMTcores 0 2 4 6 8 10 345678 Number of thread contextsFetch miss latency per thread (% of total cycles) 2-core 4-core 8-coreFigure5.18InstructionfetchmisslatencyperthreadforFGMTcores 110

PAGE 125

0 2 4 6 8 10 345678 Number of thread contextsFetch miss latency per thread (% of total cycles) 2-core 4-core 8-coreFigure5.19L2readmisslatencyperthreadforFGMTcores 0 2 4 6 8 45678 Number of thread contextsLeakage savings (%) 2-core 4-core 8-coreFigure5.20AverageIRFleakageenergysavingsforSMTcores 111

PAGE 126

CHAPTER6 CONCLUSIONSANDFUTUREDIRECTIONS Energyandpowerconsiderationsintheworldofcomputing,bethatinembeddedprocessors, personalcomputers,orserverfarms,isthenewestdimensionthathaschallengedcomputerarchitectsandcircuitdesignerstothinkofnewerdesignparadigmsforsuchsystems.Theresearchefforts reportedinthisdissertationrepresentasolidcontributiontoaddressingtheissueofreducingleakage energyinbothembeddedandgeneral-purposemicroprocessors.Herewelistsomefuturedirections toextendtheworkspresentedinthisdissertation: € Investigatingmoreaggressivecompilertechniques :IntheworksdescribedinChapter3and Chapter4,thealgorithmsproposedtoidentifyidleprogramregionswherefunctionalunits maybeturnedofftargetndinglargesubgraphsthatareenclosedwithinfunctionsandloops. Thisisbecausethebreakevenperiodsofthefunctionalunitsweretwoordersofmagnitude higherthantheprocessorclockfrequency.However,morerecently,newertechnologiesand moresophisticatedcircuitdesigntechniqueshavebroughtdownthebreakevenperiodsto 2X-10Xofthemicroprocessorclockperiod.Thismakesitimportanttoinvestivatemore aggressivetechniqueatthecompilerleveltobeabletopowergateunitsinamorene-grained manner. € Designingpartitionedfunctionalunitsandregisterles :Inthiswork,powergatingofthe unitshasbeenconsideredatthemodulelevel.Eithertheentireunitisactiveorisputtosleep forleakagereduction.However,anunitmaybedesignedaspartitionedblockssothatone ormoreofthosepartitionsmaybeputtosleepwhentheyareidle.Thiswillentailsome challengesatthecircuitlevelbecause(i)partitioneddesignsusuallyhavemoreareaoverhead 112

PAGE 127

thannon-partitioneddesigns,and(ii)poweringonacircuitblockinducesnoiseinthepower railsoftheirneighboringblocks. € Modelingfullsystempower :Inthisworkpowermodelinghasbeendoneonlyforthespecic componentsthathavebeentargetedtoreduceleakagepowerin.Goingforward,itwouldbe importanttomodelentiresystemsandanalyzethepowerconsumptionandenergyefciency inthecontextofoperatingsystemsthatmanagesuchhardwaresystems.Thisincludesmodelingoff-chipcaches,ifany,graphicsprocessingunits(GPUs),interconnectionnetworks, physicalmemory,secondarystoragedevices,andotherperipheraldevices. € SupplementingACPIforpowermanagement :AdvancedCongurationandPowerInterface (ACPI)[69]specicationestablishesindustrycommonstandardinterfacesforoperatingsystem(OS)-directedpowermanagementofentiresystems.Currently,theACPIspeciesactive andstandbypowermanagementofthemicroprocessorasasingleuniedunit.However, thetechniquespresentedinthisdisserationaddressreducingstandybypowerinthesubcomponentsofthemicroprocessorevenwhenthecoreitselfisactive.Anexcitingfuturework wouldbetoextendthecurrentstandardtoincorporatetheleakagepowerreductiontechniques presentedinthisdissertation. 113

PAGE 128

REFERENCES [1]W.J.Dally,J.Balfour,D.Black-Shaffer,J.Chen,R.C.Harting,V.Parikh,J.Park,and D.Shefeld.EfcientEmbeddedComputing. Computer ,41(7):2732,2008. [2]T.R.Halfhill.MIPSThreadstheNeedle. MicroprocessorReport ,20(2):18,2006. [3]E.GrochowskiandM.Annavaram.EnergyperInstructionTrendsinIntelMicroprocessors. Technology@IntelMagazine ,pages18,2006. [4]S.Borkar.DesignChallengesofTechnologyScaling. IEEEMicro ,19:2329,1999. [5]F.J.Pollack.NewMicroarchitectureChallengesintheComingGenerationsofCMOSProcessTechnologies.In MICRO32:Proceedingsofthe32ndannualACM/IEEEinternational symposiumonMicroarchitecture ,page2,1999. [6]E.Grochowskietal.BestofBothLatencyandThroughput.In ICCD’04:Proceedingsofthe IEEEInternationalConferenceonComputerDesign ,pages236243,2004. [7]A.Sodanetal.ParallelismviaMultithreadedandMulticoreCPUs. IEEEComputer ,43(3):24 32,2010. [8]P.Kongetira,K.Aingaran,andK.Olukotun.Niagara:A32-WayMultithreadedSparcProcessor. IEEEMicro ,25(2):2129,2005. [9]P.Kongetira,K.Aingaran,andK.Olukotun.Implementationofan8-core,64-thread,PowerEfcientSPARCServeronaChip. IEEEJSSC ,43(1):620,2008. [10]L.Seileretal.Larrabee:AMany-CoreX86ArchitectureforVisualComputing. ACMTrans. Graph. ,27(3):115,2008. 114

PAGE 129

[11]MIPS.MIPS321004KTMCPUFamilySoftwareUsersManual. http://www.mips.com ,2009. [12]J.Rabaey. LowPowerDesignEssentials .Springer,2009. [13]R.K.Krishnamurthy,S.K.Mathew,M.A.Anders,S.K.Hsu,H.Kaul,andS.Borkar.HighPerformanceandLow-VoltageChallengesforSub-45nmMicroprocessorCircuits. Intl.Conf. ASIC ,pages283286,2005. [14]S.Rusuetal.A65nmDual-CoreMultithreadedXeonProcessorwith16MBL3Cache.pages 1725,2007. [15]S.Lietal.McPAT:AnIntegratedPower,Area,andTimingModelingFrameworkforMulticoreandManycoreArchitectures.In MICRO42:Proceedingsofthe42ndAnnualIEEE/ACM InternationalSymposiumonMicroarchitecture ,pages469480.ACM,2009. [16]D.Duarte,Y.F.Tsai,N.Vijaykrishnan,andM.J.Irwin.EvaluatingRun-TimeTechniquesfor LeakagePowerReduction. Proc.7thASP-DAC ,pages3138,2002. [17]M.Johnson,DSomasekhar,andK.Roy.ModelsandAlgorithmsforBoundsonleakagein CMOSCircuits. IEEETCADIntegratedCircuitsandSystems ,pages714725,1999. [18]S.Narendra,V.De,D.Antoniadis,A.Chandrakasan,andS.Borkar.Scalingofstackeffectand itsapplicationforleakagereduction.In ISLPED’01:Proceedingsofthe2001International SymposiumonLowPowerElectronicsandDesign ,pages195200,NewYork,NY,USA, 2001.ACM. [19]Y.Ye,S.Borkar,andV.De.ANewTechniqueforStandbyLeakageReductioninHighPerformanceCircuits. DigestofTechnicalPapers,Symp.onVLSICircuits ,pages4041, 1998. [20]K.Roy.LeakagePowerReductioninLow-VoltageCMOSDesign. IEEEInternationalConferenceonElectronics,CircuitsandSystems ,pages167173,1998. [21]J.T.KaoandA.P.Chandrakasan.Dual-ThresholdVoltageTechniquesforLow-PowerDigital Circuits. IEEEJ.ofSolid-StateCircuits ,35:10091018,2000. 115

PAGE 130

[22]S.Mutoh,T.Douskei,Y.Matsuya,T.Aoki,SShigematsu,andJ.Yamada.1-VPowerSupply High-SpeedDigitalCircuitTechnologywithMulti-ThresholdVoltageCMOS. IEEEJ.of Solid-StateCircuits ,pages847854,1995. [23]Z.Huetal.MicroarchitecturalTechniquesforPowerGatingofExecutionUnits. International SymposiumonLowPowerElectronicsandDesign ,pages3237,2004. [24]W.Buchholz. PlanningaComputerSystem:ProjectStretch .McGraw-Hill,Inc.,Hightstown, NJ,USA,1962. [25]J.E.Thornton.Paralleloperationinthecontroldata6600.In AFIPS’64(Fall,partII): ProceedingsoftheOctober27-29,1964,FallJointComputerConference,PartII:VeryHigh SpeedComputerSystems ,pages3340,NewYork,NY,USA,1965.ACM. [26]J.P.ShenandM.H.Lipasti. ModernProcessorDesign:FundamentalsofSuperscalarProcessors,1sted. McGraw-HillScience/Engineering/Math,2004. [27]T.Ungerer,B.Robi c,andJ Silc.ASurveyofProcessorswithExplicitMultithreading. ACM ComputingSurveys ,35(1):2963,2003. [28]S.Kaxiras,Z.Hu,andM.Martonosi.CacheDecay:ExploitingGenerationalBehaviorto ReduceCacheLeakagePower. InternationalSymposiumonComputerArchitecture ,pages 240251,2001. [29]K.Flautneretal.DrowsyCaches:SimpleTechniquesforReducingLeakagePower. InternationalSymposiumonComputerArchitecture ,pages148157,2002. [30]L.Clark,S.Demmons,N.Deutscher,andF.Ricci.StandbyPowerManagementfora0.18 m Microprocessors. InternationalSymposiumonLowPowerElectronicsandDesign ,pages7 12,2002. [31]S.Dropsho,V.Kursun,D.H.Albonesi,S.Dwarkadas,andE.G.Friedman.ManagingStatic LeakageEnergyinMicroprocessorFunctionalUnits. InternationalSymposiumonMicroarchitecture ,pages321332,2002. 116

PAGE 131

[32]W.Zhangetal.CompilerSuppportforReducingLeakageEnergyConsumption. Design AutomationandTestinEurope ,pages11461147,2003. [33]S.Releetal.OptimizingStaticPowerDissipationbyFunctionalUnitsSuperscalarProcessors. InternationalConferenceonCompilerConstruction ,pages261274,2002. [34]Y.You,C.Lee,andJ.K.Lee.CompilerAnalysisandSupportsforLeakagePowerReduction onMicroprocessors. ACMTransactionsonDesignAutomationofElectronicSystems ,pages 147164,2006. [35]N.Sekietal.AFine-GrainDynamicSleepControlSchemeinMIPSR3000.In Computer Design,2008.ICCD2008.IEEEInternationalConferenceon ,pages612617,2008. [36]N.Komodaetal.CompilerDirectedFineGrainPowerGatingforLeakageReductioninMicroprocessorFunctionalUnits. WorkshoponOptimizationsforDSPandEmbeddedSystems pages4251,2009. [37]S.Roy,N.Ranganathan,andS.Katkoori.AFrameworkforPowerGatingFunctionalUnitsin EmbeddedMicroprocessors. IEEETransactionsonVLSISystems ,17:16401649,2009. [38]S.Rusuetal.PowerReductionTechniquesforan8-coreXeonProcessor.In ESSCIRC,2009. ESSCIRC’09.Proceedingsof ,pages340343,2009. [39]R.KumarandG.Hinton.AFamilyof45nmIAProcessors.In Solid-StateCircuitsConference -DigestofTechnicalPapers,2009.ISSCC2009.IEEEInternational ,pages5859,2009. [40]T.Saitoetal.DesignofSuperscalarProcessorwithMulti-BankRegisterFile.In Circuitsand Systems,2005.ISCAS2005.IEEEInternationalSymposiumon ,pages35073510,2005. [41]A.Agarwal,R.Kaushik,andR.K.Krishnamurthy.ALeakage-TolerantLow-LeakageRegisterFilewithConditionalSleepTransistor.In SOCConference,2004.Proceedings.IEEE International ,pages241244,2004. [42]J.Linglingetal.ReduceRegisterFilesLeakageThroughDischargingCells.In Computer Design,2006.ICCD2006.InternationalConferenceon ,pages114119,2006. 117

PAGE 132

[43]H.O.Kimetal.SupplySwitchingwithGroundCollapseforLow-LeakageRegisterFiles in65-nmCMOS. VeryLargeScaleIntegration(VLSI)Systems,IEEETransactionson 18(3):505509,2010. [44]M.R.Guthaus,J.S.Ringenberg,D.Ernst,T.M.Austin,T.Mudge,andR.B.Brown.MiBench: AFree,CommerciallyRepresentativeEmbeddedBenchmarkSuite. IEEEAnnualWorkshop onWorkloadCharacterization ,pages314,2001. [45]M.LeeserandX.Wang.VariablePrecisionFloatingPointDivisionandSquareRoot. WorkshopHPEC ,pages4748,2004. [46]W.ZhaoandYu.Cao.PredictiveTechnologyModelforNano-CMOSDesignExploration. ACMJETC ,3:117,2007. [47]D.BurgerandT.Austin.TheSimplescalarToolSet,version2.0.Technicalreport,Tech.Rep. TR-97-1342,UniversityofWisconsin-Madison,1997. [48]S.Roy,N.Ranganathan,andS.Katkoori.ExplorationofCompilerOptimizationTechniques forEnhancingPowerGating. InternationalSymposiumonCircuitsandSystems ,pages1004 1007,2009. [49]M.N.WegmanandF.K.Zadeck.ConstantPropagationwithConditionalBranches. ACM TransactionsonProgrammingLanguagesandSystems ,pages231236,1991. [50]LRolaz.AnImplementationofLazyCodeMotionforMachineSUIF.Technicalreport,Swiss FederalInstituteofTechnology,2003. [51]P.BriggsandT.J.Harvey.MultiplicationbyIntegerConstants.Technicalreport,RiceUniversity,1994. [52]K.D.Cooper,L.T.Simpson,andC.A.Vick.OperatorStrengthReduction. ACMTransactions onProgrammingLanguagesandSystems ,pages603625,2001. [53]M.Kandemir,N.Vijaykrishnan,M.J.Irwin,andWuYe.InuenceofCompilerOptimizations onSystemPower. IEEETransactionsonVLSISystems ,9:801804,2001. 118

PAGE 133

[54]GNUProject.GCC,theGNUCompilerCollection. http://gcc.gnu.org/ [55]D.BurgerandT.Austin.TheSimplescalarToolSet,version2.0.Technicalreport,TR-971342,UniversityofWisconsin-Madison,1997. [56]CLee,M.Potkonjak,andW.H.Mangione-Smith.MediaBench:AToolforEvaluatingand SynthesizingMultimediaandCommunicationsSystems. InternationalSymposiumonMicroarchitecture ,page330,1997. [57]R.Morgan.BuildingandOptimizingCompiler. DigitalPress ,1998. [58]T.GranlundandP.Montgomery.DivisionbyInvariantIntegersusingMultiplication. Proc. ACMSIGPLAN,Conf.onPLDI ,pages6172,1994. [59]GNUProject.GNUBinutils. http://www.gnu.org/software/binutils/ [60]GNUProject.GNUCLibrary. http://www.gnu.org/software/libc/ [61]GNUProject.GNUCompilerCollection(GCC)Internals. http://gcc.gnu.org/onlinedocs/ gccint/ [62]D.Novillo.GCCInternals. http://www.airs.com/dnovillo/ ,2007. [63]H.Singhetal.EnhancedLeakageReductionTechniquesusingIntermediateStrengthPower Gating. IEEETrans.VeryLargeScaleIntegr.Syst. ,15(11):12151224,2007. [64]J.E.Stineet.al.FreePDKv2.0:TransitioningVLSIEducationTowardsNanometerVariationAwareDesigns.In MicroelectronicSystemsEducation,2009.MSE’09.IEEEInternational Conferenceon ,pages100103,2009. [65]Nangate.Nangate45nmOpenCellLibrary. http://www.nangate.com/openlibrary ,2008. [66]N.L.Binkertetal.TheM5Simulator:ModelingNetworkedSystems. IEEEMicro ,26(4):52 60,2006. [67]A.J.KleinOsowskiandD.J.Lilja.MinneSPEC:ANewSPECBenchmarkWorkloadfor Simulation-BasedComputerArchitectureResearch. IEEEComput.Archit.Lett. ,1(1):7,2002. 119

PAGE 134

[68]J.BurnsandJ.L.Gaudiot.SMTLayoutOverheadandScalability. IEEETrans.Parallel Distrib.Syst. ,13(2):142155,2002. [69]PhoenixTechnologiesLtd.ToshibaCorp.Hewlett-PackardCorp.,IntelCorp.MicrosoftCorp. AdvancedCongurationandPowerInterfaceSpecication,Revision4.0a. http://www.acpi. info/ ,2010. 120

PAGE 135

LISTOFPUBLICATIONS € S.Roy,S.Katkoori,N.Ranganathan.ACompilerBasedLeakageReductionTechniqueby Power-GatingFunctionalUnitsinEmbeddedMicroprocessors. InternationalConferenceon VLSIDesign ,January2007,Page(s):215-220. € S.Roy,N.Ranganathan,S.Katkoori.ExplorationofCompilerOptimizationTechniques forEnhancingPowerGating. InternationalSymposiumonCircuitsandSystems ,May2009, Page(s):1004-1007. € S.Roy,N.Ranganathan,S.Katkoori.CompilerDirectedPowerGatinginEmbeddedMicroprocessors. InternationalConferenceonComputerDesign ,October2009,Page(s):35-40. € S.Roy,N.Ranganathan,S.Katkoori.AFrameworkforPower-GatingFunctionalUnitsin EmbeddedMicroprocessors. IEEETransactionsonVeryLargeScaleIntegration(VLSI)Systems ,Volume17,Issue11,November2009,Page(s):1640-1649. € S.Roy,N.Ranganathan,S.Katkoori.ImpactofCompilerOptimizationTechniquesonPower GatingforLeakageReduction. IEEETransactionsonComputers Revisedmanuscriptunder review ,2010. € S.Roy,N.Ranganathan,S.Katkoori.State-RetentivePowerGatingofRegisterFilesin Multi-coreProcessors. IEEETransactionsonComputers Underreview ,2010. 121

PAGE 136

ABOUTTHEAUTHOR SoumyaroopRoyreceivedhisBachelorofEngineeringdegreeinElectronicsandCommunicationEngineeringin2001fromBirlaInstituteofTechnology,Mesra,Ranchi,IndiaandhisMasterof SciencedegreeinComputerEngineeringin2006fromUniversityofSouthFlorida(USF),Tampa, FL.HeiscurrentlypursuinghisDoctoraldegreeinComputerScienceandEngineeringatUSF andhasacceptedaSeniorDesignEngineeringpositionintheArchitecturePerformanceModeling groupatAMD,Austin,TX.Hisresearchinterestsareinarchitectureandcompilermethodologies forlow-powerdesignofmicroprocessors,architecturelevelperformanceandpowermodeling,and low-powerVLSIdesign.From2001to2004,hewasaSoftwareEngineerwiththeNCVHDLgroup atCadenceDesignSystemsIndiainNoida.HehastaughtseveralcoursesattheComputerScience andEngineeringDepartmentatUSF,includingOperatingSystems,DataStructures,Foundationsof Engineering,andLogicDesign,andhasservedasateachingassistantinnumerousothercourses. HereceivedaProvost'sCommendationforOutstandingTeachingbyaGraduateTeachingAssistant atUSFin2010andthe2010SyprisBestTeachingAssistantAwardattheDepartmentofComputer ScienceandEngineering,USF.Heisalsoarecipientofthe2009IEEEComputerSocietyRichard E.Merwinscholarship.HeisastudentmemberoftheIEEEandtheIEEEComputerSociety.