USF Libraries
USF Digital Collections

Leakage power driven behavioral synthesis of pipelined asics

MISSING IMAGE

Material Information

Title:
Leakage power driven behavioral synthesis of pipelined asics
Physical Description:
Book
Language:
English
Creator:
Gopalan, Ranganath
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla.
Publication Date:

Subjects

Subjects / Keywords:
Mtcmos
Speedup
Simulated annealing
Clique-partitioning
Data initiation intervals
Dissertations, Academic -- Computer Engineering -- Masters -- USF   ( lcsh )
Genre:
government publication (state, provincial, terriorial, dependent)   ( marcgt )
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )

Notes

Abstract:
ABSTRACT: Traditional approaches for power optimization during high level synthesis, have targetted single-cycle designs where only one input is being processed by the datapath at any given time. Throughput of large single-cycle designs can be improved by means of pipelining. In this work, we present a framework for the high-level synthesis of pipelined datapaths with low leakage power dissipation. We explore the effect of pipelining on the leakage power dissipation of data-flow intensive designs. An algorithm for minimization of leakage power during behavioral pipelining is presented. The transistor level leakage reduction technique employed here is based on Multi-Threshold CMOS (MTCMOS) technology. Pipelined allocation of functional units and registers is performed considering fixed data introduction intervals. Our algorithm uses simulated annealing to perform scheduling, allocation, and binding for obtaining pipelined datapaths that have low leakage dissipation.
Thesis:
Thesis (M.S.C.P.)--University of South Florida, 2005.
Bibliography:
Includes bibliographical references.
System Details:
System requirements: World Wide Web browser and PDF reader.
System Details:
Mode of access: World Wide Web.
Statement of Responsibility:
by Ranganath Gopalan.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains 57 pages.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 001681056
oclc - 62726135
usfldc doi - E14-SFE0001064
usfldc handle - e14.1064
System ID:
SFS0025385:00001


This item is only available as the following downloads:


Full Text
xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam Ka
controlfield tag 001 001681056
003 fts
005 20060215071158.0
006 m||||e|||d||||||||
007 cr mnu|||uuuuu
008 051221s2005 flu sbm s000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0001064
035
(OCoLC)62726135
SFE0001064
040
FHM
c FHM
049
FHMM
090
TK7885 (Online)
1 100
Gopalan, Ranganath.
0 245
Leakage power driven behavioral synthesis of pipelined asics
h [electronic resource] /
by Ranganath Gopalan.
260
[Tampa, Fla.] :
b University of South Florida,
2005.
502
Thesis (M.S.C.P.)--University of South Florida, 2005.
504
Includes bibliographical references.
516
Text (Electronic thesis) in PDF format.
538
System requirements: World Wide Web browser and PDF reader.
Mode of access: World Wide Web.
500
Title from PDF of title page.
Document formatted into pages; contains 57 pages.
3 520
ABSTRACT: Traditional approaches for power optimization during high level synthesis, have targetted single-cycle designs where only one input is being processed by the datapath at any given time. Throughput of large single-cycle designs can be improved by means of pipelining. In this work, we present a framework for the high-level synthesis of pipelined datapaths with low leakage power dissipation. We explore the effect of pipelining on the leakage power dissipation of data-flow intensive designs. An algorithm for minimization of leakage power during behavioral pipelining is presented. The transistor level leakage reduction technique employed here is based on Multi-Threshold CMOS (MTCMOS) technology. Pipelined allocation of functional units and registers is performed considering fixed data introduction intervals. Our algorithm uses simulated annealing to perform scheduling, allocation, and binding for obtaining pipelined datapaths that have low leakage dissipation.
590
Adviser: Dr. Srinivas Katkoori.
653
Mtcmos.
Speedup.
Simulated annealing.
Clique-partitioning.
Data initiation intervals.
690
Dissertations, Academic
z USF
x Computer Engineering
Masters.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.1064



PAGE 1

LeakagePowerDrivenBehavioralSynthesisOfPipelinedASI Cs by RanganathGopalan Athesissubmittedinpartialfulllment oftherequirementsforthedegreeof MasterofScienceinComputerEngineering DepartmentofComputerScienceandEngineering CollegeofEngineering UniversityofSouthFlorida MajorProfessor:SrinivasKatkoori,Ph.D. NagarajanRanganathan,Ph.D. SoontaeKim,Ph.D. DateofApproval: March4,2005 Keywords:MTCMOS,speedup,simulatedannealing,clique-p artitioning,datainitiation intervals c r Copyright2005,RanganathGopalan

PAGE 2

DEDICATION Tomyfamilyandfriends.

PAGE 3

ACKNOWLEDGEMENTS Iwouldliketoprofoundlythankmymajorprofessor,Dr.Srin ivasKatkoori,fornever beingshortonencouragementandsupport.Hehasbeenanimme nsesourceofsound adviseandjudgement,whichhelpedthisthesisleadtofruit ion.IwouldliketothankDr. N.RanganathanandDr.SoontaeKim,forbeingonmycommittee andprovidingmewith valuablefeedbackformythesisandfuturework.Iwouldlike toacknowledgetheteamat TechSupportfortheirhelpandassistance.Iwouldalsolike toacknowledgethehelpof mycolleaguesVyas,Hariharan,andNarender.Finally,Iamd eeplygratefultowardsmy verygoodfriendsbothnearandfar,fortheirhelpandsuppor tatalltimes.

PAGE 4

TABLEOFCONTENTS LISTOFTABLES ii LISTOFFIGURES iii ABSTRACT iv CHAPTER1INTRODUCTIONANDRELATEDWORK1 1.1Leakagereductiontechniques 5 1.2Highlevelleakagereduction 8 1.3Powerreductionduringfunctionalpipelining9 CHAPTER2PRELIMINARIES 13 2.1Behavioralsynthesis 13 2.2Pipeliningduringbehavioralsynthesis162.3TheAUDIsystem 20 2.4AUDIfrontend 22 2.5MTCMOSbinding 23 2.6Simulation-basedarchitecturalleakageestimation25 CHAPTER3PROPOSEDAPPROACH 27 3.1Functionalresourceoptimization 27 3.1.1Simulatedannealing 30 3.1.2Generationofneighbourstates313.1.3Evaluationofneighbour-statecost323.1.4Costfunctioncharacterization33 3.2Registerallocation&optimization 37 CHAPTER4EXPERIMENTALRESULTS 39 CHAPTER5CONCLUSIONSANDFUTUREWORK45REFERENCES 47 i

PAGE 5

LISTOFTABLES Table2.1ControllerforFIRlter 19 Table3.1Leakageandsettlingconstantstablefor16-bitad deractivity combinations( 0 =3) 36 Table4.1Benchmarksusedforanalysis 39 Table4.2FIRlterleakage&areaanalysis 40 Table4.3EWFlterleakage&areaanalysis 40 Table4.4IIRlterleakage&areaanalysis 41 Table4.5ARlterleakage&areaanalysis 41 Table4.6FFTleakage&areaanalysis 41 Table4.7RunningtimesforSA-algorithm 44 Table4.8Speedupanalysisforvariousbenchmarks44 ii

PAGE 6

LISTOFFIGURES Figure1.1Currentcomponentsoftransistorleakage(repro ducedfrom[1])4 Figure1.2MTCMOSXORgate 6 Figure1.3LECTORtechnique 7 Figure1.4Pipelineswithvariousinitiationrates10Figure2.1ASAPscheduleofFIRlter 15 Figure2.2ResourceallocationofFIRlter 16 Figure2.3ResourcebindingsforFIRlter 17 Figure2.4RTLdatapathofFIRlter 18 Figure2.5Procedureforgenericpipelineallocation19Figure2.6AUDIhighlevelsynthesisrow 20 Figure2.7MTCMOSMacrocellimplementation24Figure3.1Placementofoperationsinspace-timematrix28Figure3.2Pipelineplacementofoperations( 0 =2)29 Figure3.3SimulatedannealingprocedureforFUleakageopt imization32 Figure3.4BindingprocedureforSA 33 Figure3.5Generationofnewstates 33 Figure3.6Activitytableexample 34 Figure3.7Activitycombinationsforvariousdata-initiat ionrates35 Figure3.8CostEvaluationprocedure 36 Figure3.9Cliquepartitioningforleakagepoweroptimizat ion37 Figure4.1Totalareaconsumptionofdatapath(ARlter)42Figure4.2Leakagepowerprolecomparison(ARlter)43 iii

PAGE 7

LEAKAGEPOWERDRIVENBEHAVIORALSYNTHESISOF PIPELINEDASICS RanganathGopalan ABSTRACT Traditionalapproachesforpoweroptimizationduringhigh levelsynthesis,havetargettedsingle-cycledesignswhereonlyoneinputisbeingpr ocessedbythedatapathat anygiventime.Throughputoflargesingle-cycledesignsca nbeimprovedbymeansof pipelining.Inthiswork,wepresentaframeworkforthehigh -levelsynthesisofpipelined datapathswithlowleakagepowerdissipation.Weexploreth eeectofpipeliningonthe leakagepowerdissipationofdata-rowintensivedesigns.A nalgorithmforminimizationof leakagepowerduringbehavioralpipeliningispresented.T hetransistorlevelleakagereductiontechniqueemployedhereisbasedonMulti-ThresholdCM OS(MTCMOS)technology. Pipelinedallocationoffunctionalunitsandregistersisp erformedconsideringxeddata introductionintervals.Ouralgorithmusessimulatedanne alingtoperformscheduling,allocation,andbindingforobtainingpipelineddatapathsth athavelowleakagedissipation. Wehavedevelopedfullypre-characterizedRT-levelleakag elibrariesforecientderivation ofthecostfunctionsandfastaccuratesimulationsofthesy nthesizeddesigns.Resultsshow anaverageleakagepowerreductionof38.2%forvariousbenc hmarks,andanaveragearea overheadof6.2%overunoptimizedpipelineddesigns.Howev erwhenalatencyof1,2,or3 isintroducedtotheschedulelength,areaoptimizationsar enoticed,whichareintherange of3.89-4.6%.Totalleakagereductionhoweverreducesbyar ound2.8-3.4%forthesecases. iv

PAGE 8

CHAPTER1 INTRODUCTIONANDRELATEDWORK Powermanagementisincreasinglybecomingamajordrivingf orcebehindthemethodologiesgoverningthedesignanddevelopmentofhighperfor manceVLSIsystems.Until recently,speedandreliabilityformedthekeyconsiderati onsinthepre-DeepSubMicron (DSM)eraduringtheconception,development,andtape-out oflargeASICsystems.This wasduetoaperspectivethatoptimizinglargedesignsforpo werconsequentiallyresults intheloweringoftheirperformance,andhencethistrade-o didnothighlyfavourpower optimizationduringthedesignrow.However,withtherapid advancementsintechnology coupledwithincreasinglylargeperformancerequirements ,thepowerissuehasalsomaturedtobecomeasignicantbottlenecktowardsreliablede signoflargeASICsystems[2]. Intelligentalgorithmsforminimizingpowerdissipationo fCMOScircuits,duringvarious stagesofdesign,aregainingconsiderableattentionastec hnologyadvancesintotheDSM era. Atthelogic-anddevice-levels,powerdissipationconsist softwomajorcomponents namelydynamicdissipationandstaticdissipation.Dynami cpowerwhichisalsoknownas switchingpoweriscausedduetotheeectofcharginganddis chargingofcapacitancesin signalpathsandisamajorconcernforcircuitsmanufacture dusingtransistorswithlarge feature-sizes.Forhigh-speedcircuits,theswitchingpow ercanbeverylargeasgivenin Equation1.1. P dynamic = C L V 2 DD f m (1.1) whereC L istheloadcapacitance,V DD isthesupplyvoltage,andf m isthefrequency ofthesystem.AsseenfromEquation1.1,thesupplyvoltage V DD isamajorcontributor tothedynamicaswellastheoverallpowerdissipationofCMO Scircuits.ModernCMOS 1

PAGE 9

processesgenerallyemployverylow V DD supplyvoltages.V DD scalingprovidesforasubstantialreductionindynamicpower,duetothequadraticde pendenceon V DD .However, thisconsequentiallyincreasesthetransistordelay( t d )andreducestheoverallperformance ofthecircuitasgiveninEquation1.2. t d = K C L V DD (1.2) where K =[ ( x 0 : 1) (1 x ) + 1 2 ln(19 20 x )] 1 (1 x ) (1.3) x = V tn V DD or V tp V DD (1.4) where V tn isthethresholdvoltageforNMOStransistor,and V tp isthethresholdvoltage forPMOStransistor.Theaboveequationreducesto T delay / C L :V DD K: ( V DD V th ) (1.5) where C L istheloadcapacitance,Kisaprocess-dependentconstantv alue, V th isthe thresholdvoltage, isafactordependentonthechannellength[3].DelayofCMOS gates suerasthe V DD supplyvoltageisreducedtoaslowas1VintheDSMregime.Des igners osetthisdelayissuebyemployingtransistorswithlowert hresholdvoltageV th ,which providefasterswitchingandimprovedpropagationdelay-t imes.ButV th -scalingisnot withoutillsaswell,asthiscanleadtoanexponentialincre aseinthesub-thresholdleakage currentasgiveninEqn.1.6 I sub = A:e q n 0 kT : ( V gs V T r 0 V s + V ds ) : (1 e qV ds kT )(1.6) where A = 0 :C ox : W eff L eff : : kT q ; 2 :e 1 : 8 C ox isthegateoxidecapacitanceperunitarea. 0 isthezerobiasmobility. n 0 isthesub-thresholdswingcoecientofthetransistor. V T is thezerobiasthresholdvoltage, V s isthesourcevoltage, V gs ,thegate-sourcevoltage,and 2

PAGE 10

V ds ,thedrain-sourcevoltage. r 0 isthelinearizedbodyeectcoecientand istheDrain InducedBarrierLowering(DIBL)coecient[4]. Inthiswork,wewillprimarilyfocusonthespecicproblemo fleakagepowerreduction forhighperformanceVLSIsystems.Forcompleteness,theva riouscurrentcomponentsof transistorleakageareillustratedinFig.1.1andsummariz edbelow[1]: 1.Reverse-biasedPNdiodeleakage:Aneectthatoccursdue tothereversebiased pn junctionsbetweenthedrain/sourceandthesubstrateofthe CMOStransistor. Band-to-BandTunnelingalsooccursduetothebiasacrossth e pn junctioncausing electronstotunnelfromp-regiontothen-region.Thisphen omenonusuallyoccurs inhighandunevenlydopedsiliconjunctionregions. 2.Sub-thresholdleakage:Whenthegate-sourcevoltage( V gs )isreducedbelowthe thresholdvoltage(V TH )aweakinversionconditionisformedinthesource-drain channelofthetransistor.Thisinversionregiongivesrise todriftcurrentthatrows duetominoritycarriers.Thisweak-inversioncurrentisne gligibleindeviceswith largerV TH ,howeverasshowninEqn.1.6,itbecomessignicantformode rnlow V TH devices.Ano-shootofsub-thresholdleakageisDrainIndu cedBarrierLowering (DIBL)[5],whichhappenswhenthethresholdvoltageisfurt herreducedbybarrier loweringcausedbythesource-draininteractionnearthech annelsurface,inlongchanneldevices. 3.Gate-oxideTunnelingLeakage:Whenthegate-oxidethick nessisreduced,theincreasedeectofthegatevoltagecauseselectronstotunnel fromthegatetosubstrate andviceversa,leadingtogateleakagecurrent.Twomechani smsassociatedwiththis phenomenonareFowler-NordheimTunnelinganddirecttunne ling[5]. 4.Hot-carriergateleakage:Thisoccursfromelectronsorh olesgainingsucientpotentialtocrosstheinterfacebarrierandenterintotheoxi delayer.Thiseectisalso knownashot-electroninjection. 5.Gateinduceddrainleakage(GIDL):Whenthegate-drainju nctionisreverse-biased, thehigheldeectofthegatecausesthedrainregionundert hegatetobedepletedof 3

PAGE 11

minoritychargecarriers.Thiscausesaninversionconditi oninthedrainwhichcan leadtoavalanchemultiplicationandBand-To-Band-Tunnel ling(BTBT)[5]eects. Thisgivesrisetoadraintosubstratecurrent,whichisdepe ndentonthedoping levelofdrain. 6.Punchthrough:Thisoccurswhenthedepletionregionarou ndthedrainextends towardsthesource,causingacurrent.Whenthedrainisahig hvoltage,thistends toreducethechannellengthandisirrespectiveofthegatev oltage.Thiscausesan increaseinthetotalsub-thresholdcurrent. Gate Source nn p-substrate 5 61 3 4 Drain 2 Figure1.1.Currentcomponentsoftransistorleakage(repr oducedfrom[1]) AsthetechnologyscalesintoDeepSubMicron(DSM)designre gimes( < 90nm),the leakagecomponentofpowerbeginstocommandagreaterpropo rtionoftotalpowerdissipationthanswitchingpower[6].Largesystemsexhibitin gsignicantstandbyperiods willbeplaguedbythisissue,astheytendtodissipatehugea mountsofleakagepower. Thisproblemescalatesasmoreandmorecomputingapplicati onsmoveintothewireless 4

PAGE 12

domain.Rapiddepletionofbatterypoweranddamagestocirc uitryoveralongterm,can becausediftheseproblemsareoverlooked.1.1Leakagereductiontechniques Toreduceleakagepowerinlow-voltagesystems,specialize dtechniqueshavetobe devisedthatcaneectivelycurbtheafore-mentionedeect s.Leakagereductioncanbe performedattheprocess-level,whichmayinvolveeitherch annelengineeringornumerous techniquesinvolvingdopingvariations[1].Itcanalsobea ccomplishedatthecircuitlevelwhichinvolvesspecializedre-designingofdigitalC MOScircuitsorimplementing designmethodologiesthataretargetedtowardsleakageman agement.Circuitleveldesign techniqueslendanappreciabledegreeoffreedomtodesigne rs,towardsthedevisingand developmentofautomatedleakagereductionstrategiestha tcanreadilyintegratedindesign tools. Althoughthereexistanumberofleakagereductiontechniqu esintheliterature,we willdiscusssomeofthetechniquesthathavebeendealtwith morerecently.Theseare summarizedasfollows: 1.Multi-ThresholdCMOS:Inthistechnique,anextratransi storknownasthe sleep transistorisinsertedinseriesbetweenV DD andtherestofthefunctionalcircuitry. Usually,subsystemsareputto`sleep'byturningoV DD bygatingthesleeptransistoro.Thesesubsystemsaregenerallyfunctionalu nitssuchasfull-adders, multipliers,roating-pointALUs,decode-unitsetc.,whic hmaybeidleatcertain pointsintime,duringtheoperationoftheentiresystem.Wh enafunctionalunit becomesidleorisnotperforminganyusefulworkatthattime ,controlsignalsknown asthesleepsignals,turnothesleeptransistors,andisol atetheunitfromitsV DD supply.Thisresultsinalargereductioninthesub-thresho ldleakagecurrentrowing throughthe OFF stacksoftransistors.ThistechniquewaspresentedbyMuto h[7] andisillustratedinFigure1.2.Thedisadvantageofthistechniqueisthedelaypenaltyincu rredduetofrequent `sleep'and`wake-up'transitionsofthefunctionalcircui try.InMTCMOS,sleep 5

PAGE 13

transistorsaregenerallyofalargerV TH thanthatoftherestofthetransistorsin thecircuit.Thesleeptransistoralsohastobeofanappropr iatesize,sinceithasto supplylargecurrentthroughtheactivetransistorsandthe loadcapacitanceofthe circuit.Thisdrivesuptheareaoverheadassociatedwithth esleeptransistorsizing, whichisnecessarytolimittheperformanceloss. transistors small Vth Large Vth transistor Vdd SLEEP B A B A B A B A Figure1.2.MTCMOSXORgate 2.DualThresholdCMOS:Inalogiccircuit,thecriticalpath dictatesthemaximum timingconstraintandtheavailableslackthatcanbeimpose donotherpathsthat arenotcritical[8].LowV TH transistorsallowforfasterswitching,butdissipate moreleakagepowercomparedtohighV TH transistorswhichhavelessswitching speed,butsignicantlylowerleakagedissipation.Thus,t ransistors(cells)alongthe criticalpathcanbelowV TH transistors(cells),whiletheslacksintimingthatexist innon-criticalpathscanbeutilizedbyemployinghighV TH transistors(cells).This techniquehastheadvantageoflittletonoareaoverhead,wh ilealsoprovidingvery littlecompromiseinthetimingofthelogiccircuit. 3.Transistorstacking:HalterandNajm[9]establishthatl eakagedissipationisan input-dependentcondition.Dierentinputstothesamelog iccanresultindierent leakagecurrents.Infact,theinputvectorstoalogiccircu itcanbeorderedintermsof increasingleakagepower.Byvaryingtheinputs,aphenomen onknownastransistor 6

PAGE 14

stackingtakesplace.Inthisapproach,anumberoftransist orseitherinthe p -network orinthe n -network,areturned o duetotheinputvectorcombination.Whenmore thanonetransistorinastackofseries-connectedtransist orsisturnedo,thesubthresholdleakagecurrentisseentoreducesubstantially[ 10].Aminimalleakageinput vectorisonethatmaximizesthestackof o transistors.Manytechniqueshavebeen exploredthatattempttodetermineleakageminimizinginpu tvectorcombinations alsoknowninliteratureasInputVectorControl(IVC)[11].i/p o/p N Network P Network Figure1.3.LECTORtechnique 4.Self-stackingDualTransistormethod:Thistechniqueal soknowninliteraturebyits acronymLECTOR(forLeakageControllingTransistors)wasp roposedbyHanchate andRanganathan[12].Apairof p and n transistors,whosesourcesareshortedto theircounterpartgates,isplacedinserieswiththe p and n network.Theseforma pairofself-stackingtransistors,whichthoughinvisible totherestofthefunctionalityservetogreatlyincreasetheresistancetotherowofl eakagecurrent.Figure 1.3illustratesthistechnique.Theauthorspresentafewci rcuitlevelmethodsfor managingthepotentialareaoverheadthatcanbeincurredin thismethod.However, itisatechniquethatdoesnotrequireanythresholdmodica tionsnordoesitincur anydelaypenalties,whileprovidingsubstantialgainsinl eakagereduction. 7

PAGE 15

1.2Highlevelleakagereduction Althoughleakageisatransistor-levelphenomenon,thecha racteristicsofvariousleakage reductiontechniquesenableoptimizationstobeperformed atvariouslevelsofdesignabstraction.Powelletal[13]presentanovelapproachtoredu ctionofleakagepowerincache memoriesusing`intelligent'cachesthatcandynamicallyi dentifyportionsthatareunused andapplysupplyvoltagegatingtoreduceleakageinthosepo rtions.Apopularmethod thatsubstantiallyreducessub-thresholdleakagebyturni ng o stacksofseries-connected transistorsbycontrollingtheinputs,ispresentedin[14] .Chenetal[15]usegeneticalgorithmstodetermineoptimallow-leakageinputvectorsthat canminimizeleakageinvarious componentsthroughtransistorstacking. SundarajanandParhi[16]presentanapproachwhichusesthe dual-V TH techniquefor combinationallogic.Statictiminganalysisofvariouslog icblocksisperformedtodetermine criticalpathandnon-criticalpathtimings.Usingdual-V TH transistors,theythenbalance allnon-criticalpathdelayssoastomatchupwiththecritic alpathdelay.Thenaldelaybalanceddesigniswelloptimizedforleakage,whilealsosa tisfyingtheuser-speciedtiming constraint. Mostofthesetechniquesperformoptimizationsatthelogic andphysicallevels.In ourwork,weconsiderthebehaviorallevelofabstractionfo rouroptimizationstrategy. Behavioralorhighlevelsynthesisisaprocessduringwhich thefollowingtakeplace:an algorithmicbehaviorofasystem,usuallyspeciedinanhig h-levellanguagesuchasVHDL orC,isresolvedintoaControl-Data-FlowGraph(CDFG).The CDFGcontainsinformationofthevariousdataandcontrol-dependencieswithinth ebehavior.Thisintermediate descriptionisthenanalyzedbyasynthesissystem,whichpe rformsthetasksofoperation scheduling,resourceallocation,andhardwarebinding,to obtainaregistertransferlevel (RTL)description.ThisRT-leveldescriptioncanthenbepr ocessedbylogicandlayout synthesistoolstoproducephysicallayoutsatspeciedCMO Stechnologies.Techniquesfor leakagereductionwhenappliedduringbehavioralsynthesi s,canresultinRT-levelsolutions withsignicantlyreducedleakagepower. 8

PAGE 16

KhouriandJha[17]presentedatechniquetoreduceleakagep owerduringbehavioral synthesisusingtheDual-V th technique.Theyprovideagate-levelleakageanalysisprocedureusingwhichtheydevelopapre-characterizedgateli braryforleakageestimation. Theiroptimizationstrategyusesaprioritizationalgorit hmtoidentifyfrequentlyidleoperationsinaCDFG,andbindthemtoresourcesthatwilllater onbecomenon-critical pathelementscapableofbeinginstantiatedwithhigh-V TH cells.Criticalpathoperations aremadeupoflow-V TH cells,andthusthenaldesigndoesnotincuranyareaordela y penalty. Thelow-levelleakagereductiontechniqueemployedinthis workisbasedonMultiThresholdCMOStechnology(MTCMOS).GopalakrishnanandKa tkoori[18]proposed anapproachforleakagepoweroptimizationduringbehavior alsynthesisusingMTCMOS. Functionalunitsandregistersareboundinsuchaway,thatt heiridle-timesaremaximized andcontiguous.Thefunctionalunitswithmaximalidletime andpotentiallyminimalareaoverheadpenaltiesarethenboundtoMTCMOStechnology. Traditionally,leakagepowerreductionstrategieshaveta rgetedregularsingle-cycledatapathsynthesis.Insuchcases,leakageoptimizeddesignsh aveshownlowerlevelsofperformancethanunoptimizeddesigns.Todatewehavenotseenm uchworkbeendonewith regardstoleakagepowerreductionduringpipelinesynthes is.Wedealwithfunctional pipeliningduringbehavioralsynthesisanditseectsonle akagepowerdissipationofthe system,andhowpipeliningcanbeeasilyadaptedtosynthesi zesystemsthathaveboth lowleakagedissipationandhighthroughput.Thisisoneoft herstworksthatactively emphasizestheimpactofpipeliningduringlowleakagepowe rsynthesis. 1.3Powerreductionduringfunctionalpipelining Functionalpipeliningisthemethodofsegmentingadata-ro wgraphofabehavioral description,intoseveralstageswhereeachstagecontains themappedhardwareresourcesto executetheoperationsofthesegmentedsub-graphs.Theres ultsofeachstagearestoredin registersorlatchesgenerallyknownas stagelatches or stageregisters .Thecharacteristic featureofthistechniqueisthatsuccessivetasksareiniti atedbeforetheresultsofthe 9

PAGE 17

previoustasksareobtained.Thisresultsinanincreaseint hethroughputofthecircuit, comparedtothatofanon-pipelineddatapathobtainedfromb ehavioralsynthesis,where consecutiveresultsareobtainedonlyafteralatencywhich isequaltoorgreaterthanthe lengthofthecriticalpathofthecircuit.However,thereis aninherentincreaseinthe registerandfunctionalresourcecost,duetoimplementati onofthepipelinestages. DII = 1input/clock DII = 1 input / 2 clocksDII = 1 input / 3 clocks Synthesis DII = 1 input/ + 1 clock cycles Flow-based High level Figure1.4.Pipelineswithvariousinitiationrates Pipelinesynthesishasbeenwidelyexploredintheliteratu re,andmanybehavioral synthesissystemshaveincorporatedfunctionalpipelinin galgorithmswithintheirsynthesisrow.Sehwa[19]istherstsynthesissystemthatgene ratespipelineddatapaths frombehavioraldescriptions.Itschedulesadata-rowgrap hwithfeasibilityconstraints todetermineaminimal-costmaximal-performancepipeline .Itperformsschedulingand allocationsimultaneouslytodetermineoptimal-performa ncedesigns,andconsidersaxed pipelinelatencyforsynthesispurposes.HAL[20]usesthef orce-directedschedulingalgorithmtoperformloopwinding.PLS[21]usesaheuristic-bas edlistschedulingalgorithm, andperformsforwardandbackwardschedulingforminimalde layandoptimizedresource usage.Itconsidersloopswithinter-iterationdependenci es.SODAS-DSP[22]usesaniterative/constructivetypeschedulerandatwo-passalloca tionapproachforthesynthesis ofpipelines.Whilemostworkssynthesizepipelineswithx eddatainitiations,Junand Hwang[23]considertheproblemofsynthesisofpipelinessu pportingvariabledatainitiation intervals. 10

PAGE 18

Inpipelinedsystems,thethroughputisdependentonthedat aintroductionrateand theclockfrequency.Afullypipelinedsystemisonethathas adataintroductionrateof oneinputperclockcycle.Evenifapipelinewithdataintrod uctionrateoftwoisclocked atthesamefrequencyasafullypipelinedsystem,itsthroug hputwillbeeectivelyhalfof thefullypipelinedsystem.Figure1.4showspipelineswith dataintroductionratesof1,2 and3,andalsoaregularsingle-cycledatapathsystem.Each boxdepictsthetimeinterval betweensuccessivedatainitiationsknownasa datawindow .Italsodepictstheresource sharingpossibilitiesbetweenoperationswithinadatawin dow.Operationsexecutingat concurrenttimescannotshareresourcesamongsteachother ,whereastheycansharewith thosethatswitchatallothernon-concurrenttimes. Ageneralnotionisthatpipelinedsystemsconsumelesserpo werthannon-pipelined systems.Thisisduetothefactthatimplementationofstage registerscanallowforthe individualcomponentstofunctionatalowerrate[24].This canprovidearexibilityforthe systemsupplyvoltagetobedroppeddownsubstantially,cau singthepowerdissipationtobe about2-2.5timeslesser.Inthesamework,theauthorsdescr ibeamethodtoreducepower byhavingparallelpipelinesforasingle-rowcircuit.Here ,consecutivedatainitiations areswitchedondierentparallelpaths,soastoreducepowe rwhilealsomaintainingthe samethroughput.Ashortcominginthisapproachcouldbesee nintermsoftheamount ofregisters,andinterconnectrequiredtoimplementthein putswitchinglogicandoutput multiplexinglogic. Dynamicpowerwhichisaparticularlydominantcomponentof powerinASICsthat arefabricatedatthesub-micronlevel(0.35 mandabove),canbeconsiderablyreduced bythesemethods.Chang[25]presentamethodofminimizatio nofdynamicpowerin functionallypipelineddatapathswithconditionalbranch es.Theyformulatedaresource bindingstrategywhichissolvedasanetworkrowproblem,wh ereanoptimalbinding isdeterminedsoastominimizethetotalswitchingactivity .Heoetal[26]explorethe eectofpipeliningonthepowerconsumption,andattemptto determinealogic-depth withineachpipelinestage,suchthattheoverallpowerisre ducedduetopipelining.Aloop pipeliningalgorithmwhichmakesuseof rotationscheduling tominimizepowerwhilealso reducingschedulelengthwaspresentedin[27].Kimetal[28 ]minimizepowerinpipelines 11

PAGE 19

byschedulingtheoperationssuchthat ops withcommoninputssharethesamefunctional unitstominimizeinputtransitions. Theseworksaddresstheissueofdynamicpowerreduction,ho weversub-thresholdleakagehasbecomeanimportantconsiderationduetotheshrinki ngoftransistorsizesand usageoflowersupplyvoltages.Therefore,algorithmsaren eededthatarecapableofsynthesizingleakage-optimizedpipelinedesigns,givingthe designertheadvantagesoflow supplyvoltages,increasedthroughputandlowerleakagepo werdissipation.Inourwork, wewillattempttoaddressthisparticularissue. Therestofthisthesisisorganizedasfollows:Chapter2giv esapreliminaryoverviewof basicbehavioralsynthesisanddescribesthestepsthatare tobetakenintoconsideration forpipelineddesigns.Leakagepoweroptimizationinpipel ineddesignsisalimitingcase ofsimilarproblemswhendealingwithregularhighlevelsyn thesis.Hencetheassumptions thataretobeconsideredwhenperformingoptimizationduri ngpipeliningisalsodescribed inthischapter.Chapter3givesadetaileddescriptionofou rsimultaneous,scheduling, allocation,andbindingapproach.Chapter4describesthee xperimentalprocedureand results.Chapter5concludesthisthesisandgivesdetailso fthefuturework. 12

PAGE 20

CHAPTER2 PRELIMINARIES Beforebeginningtheexplanationofourapproach,wewillgi veadetailedoverview ofthevariouspreliminariesinvolvedinbehavioralandpip elinesynthesis,aswellasthe underlyingtechniquesandassumptionsbeingemployedinou rapproach.Wewillbeginby givinganoverviewofbehavioralsynthesisanditsvariousa spects. 2.1Behavioralsynthesis Abehavioralsynthesissystemconvertsanalgorithmusuall yspeciedinahigh-level languagetoamoreelaboratestructuralleveldescription, knownasaRegisterTransfer Level(RTL)description.Thisprocessconsistsofmanystep s,eachofwhichmayormay notbeindependentofeachother.Intheinitialstep,abehav ioraldescriptionisresolved intoanintermediategraphlevelspecication.Thisgraphc apturesthecompletedata-row andcontrol-rowbehaviorofthesystem,andisanintermedia telanguagebridgingahighlevellanguageandasynthesissystem.Hencetheprocessofb ehavioralsynthesis(hereafter referredtoashigh-levelsynthesis)canalsobereferredto asbehavioralcompilation. Typically,graphdescriptionsforhighlevelsynthesisare ofthreetypes: 1.Data-rowgraph(DFG):whichdescribesthedata-dependen ceofvariousoperations oneachother.Itstoresthepredecessorandsuccessorinfor mationsofoperations, whichwillbeusedintheschedulingphase.Itcontainsthein formationregarding thenumberandtypeoffunctionalunitsthatwillbeneededdu ringsynthesis.Each operationgenerallyconsistsoftwoinputsandoneoutput,w hichcanbevariedfor cateringtodierentuser-specicrequirements. 2.Control-rowgraph(CFG):whichdescribesthecontrol-ro wofthebehavioraldescriptionandadheresstrictlytotheprecedencecondition spresentinthedata-row 13

PAGE 21

graph.Informationofadvancedsequentialconstructssuch asbranchingandlooping, iscapturedinthecontrol-rowgraph. 3.Control-Data-rowgraph(CDFG):whichisamorecomprehen siverepresentationformat,andisacombinationofDFGandCFG.Thisformatrepresen tsthebehavioral natureofmanyreal-worlddesigns.Manyarchitecturemodel linglanguagessuchas VHDLandSystem-CareresolvedintoCDFGsbeforecompilerle veloptimizations aremade. Inthiswork,weconsideradata-rowgraph G ( V;E )where V representsthenumberof operationsinthegraph,and E representstheedgesbetweenthem.Graphsforhigh-level synthesiscanbeeitherdata-rowintensiveorcontrol-rowi ntensive(CFI)orboth.Inthe data-rowintensivedesigns,therearefewcontrol-rowchan ges,andmostofthefunctionality ischaracterizedbycomputationallyintensiveoperations presentinthegraph.Control-row intensivedesignscontainmanyadvancedsequentialconstr uctssuchasconditionalsand loops,thataectthecontrol-rowmakingthedesignoftheco ntrollermorecomplicated.In thiswork,weconsiderdatapathintensivedesignsovercont rol-rowintensivedesignsdue tothefollowingreasons: 1.Designofthecontrollerissimplerinthiscase.2.Controllerleakagepowerhasverylowcontributionandth ebulkoftheleakagepower dissipatedisconcentratedinthedatapathinthiscase. Thedata-rowgraphextractedfromthehighleveldesignisth enanalyzedduringthe stepsofscheduling,allocation,andbindingfortherealiz ationoftheRTL.Thesestepsare explainedbelow: 1.Scheduling:istheprocessofassigningoperationsinada ta-rowgraphtospecic time-steps,suchthatdatadependenciesaresatisedandce rtainuser-speciedsystemconstraintsaremet(suchasresources,latency,area,o rpower).TypicalschedulingalgorithmsavailableinliteratureareAs-Soon-As-Pos sible(ASAP),As-Late-AsPossible(ALAP),Force-directedscheduling(FDS),etc.Th eoutputofthescheduling 14

PAGE 22

12468 3 5 7 9 T4 T5 T3 T2 T1 + + + +X X X X X Figure2.1.ASAPscheduleofFIRlter phaseisatime-stampforeachfunctionaloperation,andlif etimeinformationforeach edgeinthedata-rowgraph. 2.Resourceallocation:referstopartitioningtheschedul eddata-rowgraph,suchthat operationswithnon-overlappingexecutionlifetimesbelo ngtothesamepartition. Genericgraphpartitioningalgorithmssuchasclique-part itioning,Left-edgealgorithm(LEA),etc.,areusedinthisphaseforresourcealloca tion. 3.Binding:isthephaseduringwhichthesepartitionsareth enboundtohardware instances.Atypicalgoalofthebindingphaseisthemaximiz ationoftheutilization ofahardwareinstance,whichisrelatedtothearea-consump tionbythedatapath. Bindingdecisionstypicallyaectthevariouscharacteris ticsofthenaldatapath, suchasarea,power,crosstalk,etc.Hence,itisnecessaryt ohaveanintelligent bindingalgorithmthatgivesanoptimalsolutionmatchingt heneedsofthedesigner. Asanexample,wewillillustratethethehigh-levelsynthes isofaFiniteImpulseResponse(FIR)lter.Figure2.1showstheASAPscheduleofthe FIRlterdata-rowgraph. Figure2.2givesthevariouspartitionsformedduringtheal locationphase.Thebinding ofoperationstofunctionalinstancesandedgestoregister sisshowninFigure2.3.The 15

PAGE 23

7 5 3 8 6 4 2 1 9ADD_1 MULT_1 T4 T5 T3 T2 T1 + + + +X X X X X Figure2.2.ResourceallocationofFIRlter nalRTLdatapathresultingafterthebindingphaseisillus tratedinFigure2.4.TheFSM controllerforthisdatapathisillustratedinTable2.1. TheFIRlterconsistsof5multiplieroperationsand4adder operations.Thecritical pathhasalengthof5,whichisalsothelatencyoftheASAPsch edule.Thenumberof allocatedmultipliersandaddersis5and1respectively.In thebindingphase,multiplexers aregeneratedwhichformthedata-steeringlogicandareswi tchedbythecontrolsignals generatedbythecontroller.ThenalRTLdatapathcontaini ngfunctionalunits,registers, andmultiplexerlogicisthenobtained. Havinggivenanoverviewonhigh-levelsynthesisanditsimp ortantaspects,wewill nowfocusourattentiontowardsfunctionalpipeliningduri ngbehavioralsynthesis. 2.2Pipeliningduringbehavioralsynthesis Therearegenerallytwoformsofpipeliningknowninthelite rature: 1.Structuralpipelining:inwhichoperationsinthedata-r owgraphareboundto pipelinedhardwareinstances.Howevertheexecutionofthe data-rowgraphassuch isnotpipelined.Thisisalsoknownashierarchicalpipelin ing. 16

PAGE 24

op3 op5 op7 op9 InstanceEdges R0R1R2R3R4R5R6R7R8R9 yout r6 r4 r2 r0 c1r7 x0r5 c2 r3 x1r1 c3x4c5x3c4x2 Multiplier Adder Register A0 InstanceOps M0M1M2M3M4 op1op2op4op6op8 InstanceOpsFigure2.3.ResourcebindingsforFIRlter 2.Functionalpipelining:inwhichthereisnohierarchical pipeliningorpipelinedresourcesandthedata-rowgraphaswholeispipelined.Inourw ork,weconsiderthe problemoffunctionalpipelining. Afunctionallypipelineddatapathissegmentedinto N linearstages[29],whereeach stagecontainstherequiredresourcestoexecutethereleva ntoperationsofthatstage.The numberofstagesinapipelineddata-rowgraphisdependento ntheschedulelengthor latency ofthedesign,andthedataintroductioninterval( 0 )alsoknownaspipeline latencyasfollows, 17

PAGE 25

R2 R1 R0 A0 M4 M3 M2 M1 M0 R3r7 r5 r3 r1 r1 r3 r5 r7 r6 r4 r0 r2 yout c5 c4 x3 x2 x1 c2 x0 c1R6 R5 R8 R7 R9 R4 Figure2.4.RTLdatapathofFIRlter pipelinestagesN = d schedulelength data initiation interval e = d 0 e Thedataintroductionintervalisaglobalconstraintrepre sentingthetimeinterval betweentwoconsecutiveinputvectors.Itisgenerallyacon stanttermandsmallerthanthe latency .WeshalltaketheexampleofaschedulingalgorithmsuchasA SAPscheduling, wherethescheduleisobtainedbytopologicallysortingthe verticesofthedata-rowgraph. ThisASAPscheduleforanFIRlterisalreadyshowninFigure 2.1. Consideradata-rowgraphcontaining k resourcetypeswhichmaybeadder,multiplier, comparator,etc.IntheFIRexample, k =2wherethetworesourcetypesarean n -bit adderandan n -bitmultiplier.Foreachresourcetype,givenaxedpipeli nelatency(say 18

PAGE 26

Table2.1.ControllerforFIRlter States Controlbus S0 000000000000000000 S1 111111111101111100 S2 111110000010000000 S3 100000000000000000 S4 100000000000000010 S5 100000000000000001 S6 100000000000000011 S7 000000000000000000 0 = f 1,2,3,4,... d g ,where d isthemaximumalloweddataintroductioninterval),the operationsthatexecuteattime-steps i: 0 + l (where i isanintegerintheinterval f 1,2, ... N g and l isanintegerintheinterval f 1,2,... 0 1 g )occurconcurrently,andcannot shareresourcesamongstoneanother.Operationsthatdonot executeconcurrentlyarethus compatibleandtypicallycansharethesamefunctionalreso urce.Fromthisinformation, acompatibilitygraph S offunctionalresourcesandregisters,isbuiltindividual lyand partitionedintoaminimalnumberofcliques.Pseudo-codeo fthisallocationprocedureis presentedinFigure2.5. Pipelineallocation: PAlloc [ G ( V n ;T V ; 0 )] 1for i in f v 0 ;v 1 ;v 2 ;::::;v n g of V 2for j in f v 0 ;v 1 ;v 2 ;::::;v n g of V 3if( v i 6 = v j ) 4if( T v i mod 0 ) 6 =( T v j mod 0 ) 5 S ( i;j; 1) 6else7 S ( i;j; 0) 8else9 S ( i;j; 1) 10Clique Partition f S g Figure2.5.Procedureforgenericpipelineallocation Thesecliquesarethenmappedtofunctionalresourcesandre gisters.Thepipeline controllerisdierentfromtheregularcontrollerinthati thasonly 0 states.Butthe numberofcontrolsignalspercontrolstateismuchmoreinth ecaseofpipelinedsystems thaninregularsystems. 19

PAGE 27

2.3TheAUDIsystem Forthiswork,wemakeuseofAUDI[18],abehavioralsynthesi ssystem,astheframeworkinwhichouralgorithmsareintegratedandexperiments arecarriedout.AUDI(also knownasAUtomaticDesignInstantiation)iscurrentlybein gdevelopedattheUniversity ofSouthFlorida,VLSIresearchgroup.Itisaninterconnect -centricbehavioralsynthesis system,andcancurrentlyproducelow-powerandleakage-po weroptimizeddesigns.The AUDIhigh-levelsynthesisrowisshowninFigure2.6.Clique-partitioning Resource Allocation Force-directed scheduling ASAP scheduling Scheduling CDFG Generation Frontend AUDI Intermediate Format (AIF) VHDL2AIF Behavioral VHDL Binding Leakage Simulation Library FASL Design Level Transfer Register Datapath Controller 1010110 1111111 10101011110011 1010100 0110101 System Instantiation Design AUtomatic Generation Structural VHDL Crosstalk Power Considerations library component MTCMOS Figure2.6.AUDIhighlevelsynthesisrow 20

PAGE 28

BehavioralVHDLisusedastheinputlanguagetothesystem,w hichisconvertedto aCDFG-likelanguageusingaVHDL-to-AIFtranslator.ThisC DFGformatisreferredto astheAUDIIntermediateFormat(AIF),andistheinputforma tfortheAUDIsystem. TheAIFformatalwaysconsistsof:2-input/1-outputoperat ions,control-rowindicators, suchas if and while ,andmemoryread-writeoperations. TheoperationscontainedinAIFarescheduledusingoneofth emanyavailableschedulingalgorithmssuchasASAP,Force-directedschedulingetc .Thisisfollowedbytheallocationphasewheretheseoperationsarepartitionedintoamin imalnumberofmaximal-size mutually-exclusivepartitionsusingacliquepartitioner Multiplexors,whichformthedata-steeringlogicprimaril ybeingusedinAUDI,are generatedduringthebindingphase.Thepartitionedclique saremappedtohardware-level instances,andastructuralnetlistoffunctionalunits,re gisters,multiplexors,andtheir interconnectionsisoutputtedinstructuralVHDL. Manycircuit-levelconsiderationsmaketheirwayintotheb indingphasetoinruence thenaldesignnetlist.Designsmaybeboundinsuchawaysoa stominimizepower, leakage,cross-talk,delay,andothersuchparameters.The typeoftechnologyanddevicelevelcharacteristicscandeterminehowbindingisperform edduringhighlevelsynthesis. Powercanbeminimizedbyusingintelligentschedulingandb indingsolutions,andwork regardingthisissuehasbeendonein[30],[31],[32],[33], [34].Leakagepowerminimization duringbehavioralsynthesishasbeendealtwithin[17],[18 ]. InAUDI,lowleakagebindingisperformedusingMulti-thres holdCMOS(MTCMOS) usinganarea-ecientKnapsackalgorithm[35].Ourworkdet erminesanoptimalbinding solutionforpipelineddatapathsandusestheapproachesin [18]and[35]toobtainan optimalMTCMOSdatapath. BeforewegiveanoverviewoftheMTCMOSbindingsolutiontha tisintegratedin AUDI,wewillnowdescribetheVHDL-to-AIFAUDIfrontendtha twasdevelopedaspart ofthiswork. 21

PAGE 29

2.4AUDIfrontend TheAUDIsystemwasconceptualizedforRTLsynthesis,using VHDLasitsmaintarget language.Thesystemtakesasinputcodewritteninbehavior alVHDLandprovidesas outputanetlistinstructuralVHDL.ThefrontendofAUDI(al soknownasVHDL2AIF), processesthebehavioralVHDLandconvertsittotheAUDIInt ermediateFormate(AIF). VHDL2AIFisahigh-leveltranslatorandwaswrittenusingLe x,Yacc,andC.AIFisthe CDFGinputformatfortheschedulersusedinAUDI.Someofthe importantfeaturesof VHDL2AIFare: 1.FullIEEE-754Standard32-bitroatingpointrepresentat ion. 2.Floating-pointintensiveoperationssuchasSINE,COSIN E,LOG,REAL,FP-add, FP-mult,FP-dividearefullysupported. 3.SupportforMemoryintensiveoperations(usingOp-codes asMEMREADandMEMWRITE). 4.Supportforsignal-indexingwithintegerandvariablein dexes. 5.Fullsupportfornestedloopsandnestedconditionals.6.SupportsWAITstatements.7.StablesupportforlargeASICbenchmarks( > 5Klines) Itconsistsoftwoparts:aLexicalanalyser,whichresolves theinputVHDLleinto astreamofdiscreteunitscalledtokens;andaYaccparser,w hichreadsinthesetokens, validatesthemwithapre-speciedgrammar,andexecutesan actionwhichiscodedinC. ThegrammarconformstotheexistingVHDL-93standardlangu agesyntax.Currently,the toolcanhandleonlysingle-processarchitecturesandsing leentities.SomeoftheVHDL featuresthatarehandledinthetranslatoraregivenbelow: 1.Packagedeclarations:Constant,Type,andSubtypedecla rationsarehandledand inlineexpandedwherevertheyareencounteredinthearchit ecturebody. 22

PAGE 30

2.Arrayoperations:ArraysareinstantiatedasmemoryinAU DI.OperationsonarraysarereplacedwithequivalentMEMREAD f Memory read(array,index) g ,and MEMWRITE f Memory write(array,index) g operations. 3.RealandFloating-pointoperations:Pre-compiledFP-be nchmarksareemployedfor performingroating-pointoperations[36].Whereverroati ngpointintensiveoperationsareencountered,thetranslatorsubstitutesthemwit hsimpleatomicopssuchas FADD,FSUB,FMUL,FDIV;whicharelaterinstantiatedbytheV HDL2AIFroating pointlibrary. 4.Signalsandvariables:Signalsandvariablesaresynthes izedasregistersinAUDI. 5.Conditionals:Nestedconditionalswhichinclude if else endif elsif exitwhen are fullysupported. 6.Loops:Nested For and While loopswitheitherdeniteorindenitenumberofloop iterationsarehandled. 7.Waitstatements:Whileindenite wait and waiton arenotsupported, waitfor and waituntil aresupported.Thesearetranslatedintosimplecontroller waitcycles. Currently,thelistofbenchmarksthathavebeencompiledus ingVHDL2AIFareNAVIFIND,asystemforamobiletourguide;RECOG,acoherentGPSs ignalreciever,anda cancerdetectionbenchmarkperformingadeconvolutionfas t-fouriertransform. WewillnowgiveanoverviewoftheMTCMOSbindingsolutionth atisintegratedin AUDI.2.5MTCMOSbinding Atthephysicallevel,thefunctionalmodulesareselective lyboundtoMTCMOSinstances.TheMTCMOSdesignmethodologyisdescribedintheF igure2.7.EachMTCMOS instanceconsistsofasleeptransistorplacedinserieswit hitsV DD railandinserieswith amacrocellorstandard-cellimplementationofthemodule. Thedelayoftheinstanceisa functionofitscorrespondingsleeptransistorwidth.Asth iswidthincreases,thedelayof 23

PAGE 31

theinstancedecreases.However,theleakagepowerdissipa tedbythesleeptransistoralso increasesduetothedependencyofthesub-thresholdleakag ecurrentonthe( W=L )ratio. Whenasleeptransistorissizedbeyondacertainlevel,thes ub-thresholdleakagepower dissipatedbyitbecomeslarger,whenitturned OFF duringidle-time. sleeptransistor 8-bit multiplier Figure2.7.MTCMOSMacrocellimplementation Inthiswork,weconsideraparameterizedMTCMOScomponentl ibrary[18],which consistsofcharacterizedfunctionstochoosetheappropri atesleeptransistorwidthfora requiredelaymargin.Thesleeptransistorisgenerallyofa higherV TH thantheothertransistorspresentinthemodule.Anactivehigh sleep signalgatesthesleeptransistor o and isolatesthemodulefromtheV DD rail,thuscausingalargereductioninthesub-threshold leakagecurrentrowingthroughthe o stacksoftransistors.Thistechniqueprovideslarge reductionsduringstandbymodeandduringtimeswhenafunct ionalunitbecomesidle.In AUDI,modulesthathavelargeidletimesafterscheduling,a reallocatedandboundtothese MTCMOSfunctionalunits.Thesleepsignalsaregeneratedby thecontrollercircuitryand generallycauseasmallincreaseinthedynamicpowerofthec ircuit.ThusMTCMOScan oeralevelofleakagepowerreductioncomparabletoothert echniquesthoughitbearsthe additionalburdensofarea-overheadanddelaypenalty.How everMTCMOSisasimpler designtechniqueandmorereadilyapplicabletothehighlev elsynthesisoflargecircuit designs,whereasothertechniquesmayneedlongprecomputa tiontimesfordetermination andapplication. 24

PAGE 32

2.6Simulation-basedarchitecturalleakageestimation Forlargedatapaths,leakageestimationthroughHSpicebec omesveryslow,duetoits longsimulationtimes.Thereisaneedforfastandaccuratel eakageestimationalgorithms soastodeterminetheecacyandvalidityofleakageoptimiz ationalgorithms.Forthis purpose,wemakeuseofatoolcalledFASLorFastArchitectur alSimulatorforleakage powerwhichisaregistertransferleakagesimulationlibra ryandiscompatiblewiththe CadenceNCLaunchHDLsimulationtools[37].Thislibraryco nsistsofpre-characterized leafcellcomponents(namelyfull-adder, nand and or not xor ,andsoon),thatare exhaustivelycharacterizedforleakagepowerdissipation .Thesimulationmodelutilizedin FASLisexplainedasfollows.Theleakagedissipationofale afcellisprimarilydivided intotworegions:a transient regionanda steady-state region.Inthetransientregion,the leakagepowerdissipatedbythecellistemporallydependen tonboththecurrentinput andthepreviousinputtothecell,whileinthesteady-state region,itdependsonlyonthe currentinput.Usingautomatedscripts,the thresholdtime orthetimeaftertheinputs havechangedforwhichthistransientconditionexists,isd eterminedforeachleaf-cell. Thus,theinstantaneousleakagepowerofaleaf-cellisgive nas P T i leakage = 8><>: P transient leakage [ current input;previous input ] T 0 T i T threshold P steady state leakage [ current input ] T i >T threshold Sinceallleaf-celllevelcomponentsarefullycharacteriz ed,asimulationofthehierarchicaldescriptionresultsinthesimulationoftheleafcel ls.Thusthetotalleakagepower iscalculatedasasummationofalltheindividualleaf-cell leakagepowers. Theleaf-cellsareoftwocategories:non-MTCMOScellsandM TCMOScells.Forthe MTCMOScells,theoptimumsleeptransistorwidthdetermine dasinsection2.5,isusedin thedesignandextensivecharacterizationisperformedfor both'sleep'modeand'wake-up' mode.Thusacompletesimulationlibraryforleakagepoweri savailableforsimulatingthe datapathsobtainedfromsynthesis.Theaccuracyofthesimu lationiswithin2%ofthe resultsobtainedfromHSpice,whiletherun-timesaresever alordersofmagnitudeshorter. 25

PAGE 33

Tosummarize,wedescribetheprocessofbehavioralsynthes is,whichconsistsofthe stepsofscheduling,allocation,andbinding.Thesestepsd eterminethequalityofthe nalRTLdatapath,andintelligentalgorithmsfortheseste pscangreatlyoptimizethe user-parametersfortheRTLsuchaslatency,power,andarea .Weprovideallocation algorithmsforpipeliningdata-rowgraphs,andthesealgor ithmscanbeintegratedinto existingsynthesissystemsforgenerationofpipelineddes igns.WealsodescribetheMTCMOSmethodologyforleakagepowerreduction,whichisinteg ratedintoourbehavioral synthesissystemknownasAUDI.WealsodescribeFASL,which isaleakageestimation tool,thatmakesavailablephysicallevelleakagepowerinf ormationattheRT-level,soas todrivethebehavioralsynthesisproceduretowardsoptimi zingthedatapathsforminimal leakagepower.FASLwillbemadeextensiveuseof,fortheexp erimentsrelatedtoour approachandthesewillbedetailedinthenextchapter. 26

PAGE 34

CHAPTER3 PROPOSEDAPPROACH Inpipelineddatapaths,thenumberoffunctionalunitsexec utingatvariousclockcycles ismuchmorethanthatinnon-pipelineddatapaths.Innon-pi pelinesynthesis,datapath componenttypesexhibitvaryingidle-timesforeachcompon entinstance.Duetothe implementationofpipelinedstages,thecycle-lifetimeof eachcomponentinstanceisgreatly reduced,whichalsoreducesitsidleperiods.Thisreductio ninidle-periodsalsoreduces thenumberoftimesfunctionalunitsorregisterscanbeputi nstandbymode.Hence leakagepowerreductioninpipelineddatapathsispotentia llylesserthaninnon-pipelined datapaths.Itshouldbenotedthatwhenthedataintroductio nrate 0 isoneinputperclock cycle,allfunctionalunitsandregistersareactiveatallt imes.Inthiscase,standbyleakage powerisatitsminimumsincenofunctionalunithasidlestat esatanytime.Anyformof leakagereductionthroughMTCMOSbecomesuseless,duetoth eabsenceofidle-statesfor thiscase.Thenotionoflowleakagescheduling,allocation andbindingforpipelinedcircuits becomesrelevantonlywhenthedataintroductionrate 0 becomesgreaterthanoneinput perclockcycle.Basedontheseobservations,thefollowing approachesareproposed.Inthis work,leakageoptimizationforbothfunctionalresourceal locationandregisterallocation isperformed.3.1Functionalresourceoptimization Inthissection,wediscussourleakageoptimizationalgori thmforfunctionalresource allocation.Intraditionaltechniques,adata-rowgraphis scheduledusingagenericschedulersuchasASAPorFDS.Thisgivesasetoftimestampsforope rations,whicharethen analysedbytheallocationprocedure.Thistimestampsetis thenresolvedintoanoper27

PAGE 35

ationandaregistercompatibilitygraph,whicharethenpar titionedusinggenericgraph partitioningalgorithmssuchasclique-partitionorleftedge(LEA),etc. Assumingasbefore,thatthereare k resourcetypes; k 2f k 1 ;k 2 ;k 3 :::k n g ,theallocation phaseoutputsa I resource-allocationtable,where I isthenumberofinstancesallocated forthatresourcetype k i and isthelatencyoftheschedule.Theallocationtableforan ASAPscheduleofanFIRltershowninFigure2.1isillustrat edbelow M2M3M4M5 op6op8 op1op2op4 Multiplier op3op5op7op9 A1 T1T2T3T4T5 Adder M1 T5 T4 T3 T2 T1 Figure3.1.Placementofoperationsinspace-timematrix FromFigure3.1,itcanbeseenthatmultipliers M 1, M 2, M 3, M 4, M 5allhave1 activestate(duringtimestep T 1)and4standbystates(during T 2, T 3, T 4, T 5).The adderinstance A 1hasonly1idle-state( T 1)whileitsremainingstatesareactive.From this,wecanintuitivelyguagethatthe5multiplierscanbeb oundtoMTCMOS,sincethey allexperiencelongidle-times.Wecanalsobindtheadder A 1toMTCMOS,duetothe presenceof1idle-state.Butthismayprovetobecounter-pr oductiveasarea-overheadand delayoftheaddermaybeincreased,justforthesakeofthe1i dle-stateintheadder. Letusnowconsiderthepipelinedcase,where oftheallocationtableisequaltothe dataintroductioninterval 0 .ThiscaseisillustratedintheFigure3.2.Here,wecansee thateachofthemulipliers f M 1 ;M 2 ;M 3 ;M 4 ;M 5 g have1idleand1activestateeach. Hencetominimizeleakagepower,theyeachhavetobeboundto MTCMOSinstances. However,weseethatnoneoftheaddersareinstandby,henceb ypipeliningwehave reducedthearea-overheadofaddersduetoMTCMOSandalsoth epossibledynamicand 28

PAGE 36

leakagepowerby-productduetothesleep-transistor.Whil etheadderallocationshave beenoptimized,themultiplierallocationsareunoptimize d.Suchanarrangementisdueto thescheduleobtainedfromASAP.Agoodschedulewouldprodu ceanoptimalallocation forthebulkymultiplierswithMTCMOSbinding.Ourworkprov idessuchanapproach targettingthiskindofproblem. op5 op4 M1M2M4M5 M3A1A2 op3 op5 op7 op9 MultiplierAdder op3 T1T2 T1T2 op1op2 Figure3.2.Pipelineplacementofoperations( 0 =2) Thepremiseforouroptimizationapproachisframedasfollo ws: Aresourceinstance(sayanadder)mayhaveanumberofoperat ionsboundtoit. Theremayalsobesomeniteidleperiodsthatmayemergewith inthisinstanceafterall operationsarebound.Iftherearemanysuchtypesofinstanc ebindings,whereresources containlotofidle-timesaswellasoperations,thefollowi ngwillhold. numberofresourceallocationsarehigh numberofMTCMOSboundinstancesarehigh(increasedareaov erheadandcontroller overhead) increaseindelayandcapacitivepowerdissipation(duetof requent sleep and wakeup transistions) Existingallocationheuristicstendtobeunawareofsuchpr oblems.ItcanbeconsideredinfeasibletobindresourcestoMTCMOSiftheyarenotid leformorethanasingle clockperiodinadataintroductioninterval.Sincetheslee ptransistoritselfdissipates leakagepower,theamountofleakagepowersavingsthrought heapplicationofsleeptransistorsneedstobegreaterthantheeectofleakagepowerdi ssipatedbytheMTCMOS instance.Ouralgorithmaccountsforthisfact,andmakesmo dicationstotheschedule suchthatlessernumberofmodulesareconsideredforMTCMOS binding.Weproposea 29

PAGE 37

scheduling,allocation,andbindingalgorithmthatusessi mulatedannealingforsearching anddiscoveringoptimalsolutionstothisproblem.3.1.1Simulatedannealing Simulatedannealingisameta-heuristic[38]thatmimicsth eannealingprocessofmetals. Ametalisheatedtoaveryhightemperaturestate,suchthati tsatomsbecomefreeto moveabout.Thetemperatureisthenslowlyreducedandtheme talisgraduallycooled, untilitcrystallizesandathermalequilibriumisreached. Whenthisstateisachieved,the atoms'mobilitybecomesveryrigid,andfurthercoolingach ievesnoimprovementinthe metals'equilibrium.Inthesameway,aninitialsolutionis takenasthestartingsolution anditscostistakenastheinitialcost.Simulatedannealin g(SA)thereaftergeneratesnew solutionsbymakingeitherrandom moves orpre-computedones.Thegainincostofnew solutionsarethenevaluatedateachiteration.Athigherte mperatures,positivegainmoves areacceptedwhilenegativegainmovesareacceptedwithapr obabilitywhichisdependent ontheenergyofthesystematthattime.Thenumberofnegativ egainmovesaccepted reducesasthetemperaturedecreases.Simulatedannealing acceptsnon-improvingmoves (knownas hillclimbing )soastoescapeoutoflocalminima. Devadasetal[39]usesimulatedannealingforsynthesizing datapathswithlowarea andlatency.Thescheduling,allocation,andbindingprobl emismodelledasatwodimensionalplacementproblem.Operationsareplacedinat wo-dimensionalmatrixof rowsandcolumns.Schedulingoperationsinthedata-rowgra phisequivalenttoplacing theminarow,whichcorrespondstoatimestep.Thecolumnsco rrespondtopossible instancesofaresource.Thebindingstepisequivalenttopl acingoperationsincolumns. Figure3.1showsanexampleofthismatrixwhichisuniquefor dierentresourcetypes. ThematrixcontainsanoperationplacementafteranAs-Soon -As-Possible(ASAP)scheduleisperformedontheFIRlterexample.Movescanbeeither interchangingtheposition oftwooperationsinthematrix,orndinganewpositionfora noperationinthematrixor interchangingtheinputsbetweensymmetricoperations.Th ecostfunctionisdependenton totalhardwareareaandexecutiontime.Theyalsodescribea frameworkforsynthesizing 30

PAGE 38

pipelineddatapathsusingsimulatedannealingandproduci ngsolutionswithreducedarea andminimumlatency. Prabhakaranetal[40]presentasimultaneousscheduling,a llocationandroorplanning algorithmwhichusessimulatedannealingtodiscoverdatap athsolutionswithminimized interconnectpowerdissipation.Thecostfunctionisdepen dentonthetransitionactivity andthedistancebetweenmodulesontheroorplanforsatisfa ctionoflatencyconstraints. Pipelinesynthesishasasmallersearchspacethannon-pipe linesynthesis,andhenceis alimitingcaseofnon-pipelinesynthesis.Thebasicsimula tedannealingprocedurewhich isusedinourapproachisshownasapseudocodeinFigure3.3. Thedata-rowgraphis rstsubjectedtobothASAPandALAPschedulingtodetermine theindividualoperation mobilities.WeselecttheinitialASAPscheduleasthestart ingsolutionwithwhichimprovementsaremade.Acertainuser-specieddataintroduc tioninterval 0 isconsidered foroperationbinding.Theoperationsarethenboundinorde roftheirscheduletimesand theirmobilitiesasshowninFigure3.4.Thisbindingisthen cost-evaluatedtogivean initialcostfortheSAprocedure.Amoveisselectedfromapr e-computedsetofmoves andisthenmadebyanoperationirrespectiveofitsresource type. 3.1.2Generationofneighbourstates Atthebeginningofeveryiteration,aset M ( i;j )containingalloperationmovespossible fromthecurrentstate,iscreated.Here i denotestheoperationunderconsideration,and j isthenumberofpositionsthat i canmovewithinitsmobilityrange.Amoveisthenmade, andthescheduleisalteredasdirected.Theresultingsched uleddata-rowgraphisthen operatedonbythe ALLOCATE BIND procedure,andthisbindingsolutionisevaluated forcost.Thegainofthissolutionisdetermined,andaspert heselection-rejectionheuristic ofSA,thecurrentscheduleissaved.Ouralgorithmassumest hatthereisnochangein latency andthat isequaltothelengthofthecriticalpathofthedata-rowgra ph. 31

PAGE 39

SA Optim[ G ( V;E )] 1 ASAP G ASAP schedulingperformedon G(V,E) 2 ALAP G ALAP schedulingperformedon G(V,E) 3 ( G ) ALAP G ASAP G 4Selectstartingsolution x start ASAP G 5 B x start allocate bind ( x start ) 6InitialCost cost ( B x start ) 7Currentsolution S B x start 8Initialtemperature T T 0 9while costischanging 10 I numberofiterations 11while I> 0 12new solution S 0 generate neighbor ( S ) 13 C cost( B S )-cost( B S 0 ) 14if( C< 0)or( random( 0 1 )
PAGE 40

ALLOCATE BIND [ G ( V;E )] 1 8 V 2 G ( V;E ),determine MAX = max ( ALAP G ( V ) ASAP G ( V ) ) 2fori 2f 1,2,3... MAX g 3forj 2f 1,2,3... V g 4fork 2 R x 5if f j g = i 6 B k f l; ( sched ( x )mod 0 ) g = j 7endif8endfor9endfor10endfor Figure3.4.BindingprocedureforSA GENERATE NEIGHBOUR ( S ) 1Createmoveset M ( i;j ) 2Pickamove m atrandomfrom M ( i;j ) 3 sched ( i )= sched ( i )+ j 4modify schedule[successor(i)] Figure3.5.Generationofnewstates 2.Itshouldhaveminimalstandbyleakagedissipationandco nsequentiallyminimaloverallleakagedissipation. 3.Itshouldhaveminimaltransitionsbetween`sleep'and`w ake-up'statestominimise circuitdelayandglitchingcapacitance. 4.LessernumberofMTCMOSinstancestominimizearea-overh eadduetoMTCMOS anddelayoffunctionalinstances. Inthenextsection,wewilldiscussourcost-functiondevis ingwhichaccountsforthe abovecriteria.3.1.4Costfunctioncharacterization TheaimofourSA-basedapproachistoproduceoptimalRTLdat apathswithlow leakagepowerandminimalareaoverhead.However,theMTCMO Stechniqueimposesa signicantlylargeareaoverheadonthedatapath.Whenther eareveryfewidle-statesin 33

PAGE 41

8-bit xor 8-bit subtractor 8-bit adder 8-bit multiplier Figure3.6.Activitytableexample aresource,thegainsinleakagepowerreductionisosettoa goodextentbytheleakage dissipatedbythesleeptransistor.Hence,weconsideritne cessarytominimizetheusage ofsleeptransistorstooptimizearea,andalsooptimizethe resourcebindingsuchthateach instancehaseithermanyidle-statesornoidle-states. EachMTCMOSinstancegoesthroughseveral`sleep'and`wake -up'transitions.An instancewithsuchfrequentactivityisalwaysfoundtobein its transition regionofleakage dissipation.Hereleakagepowerisslightlyhigherduetoth eglitchingcharacteristicsofthe transistor.Thethresholdtimeisthetimeafterwhichthisl eakageisseentosettledown. Thisrequirestheinstancetohavelongidle-timessoastoha vesettledleakagedissipations. Instanceswithlongcontiguousidle-timesshowlowerleaka geprolesthaninstanceswith frequenttransitionactivity. Theseconsiderationsaretakenintoaccountforourcostfac torderivationexperiments. WeuseasampleDFGsuchasanFIRlterforouranalysis.Wesyn thesizedtheRTLs insuchawayastoforcethebindingoftheoperationstomatch withourdesiredactivity combinations.TheRTListhensimulatedusingFASL,andthel eakagepowerproleof thefunctionalinstanceunderobservationisextractedusi ngascript. Wesplittheleakagecostfactorsintotwotables:aLeakage Constanttableanda Settling Constanttable.Initiallywenormalizealltheweightsinth etablesto1/2 0 beforebeginningouriterations.Afterasetofiterations, weupdatetheleakagedissipation 34

PAGE 42

123456789101112 131415 7 Data Initiation Rate 2Data Initiation Rate 3Data Initiation Rate 4 123 123456 Figure3.7.Activitycombinationsforvariousdata-initia tionrates constants.ThecostevaluationprocedureisdescribedinFi gure3.1.4.Thenumberof distinctcombinations N C foradatainitiationrate 0 is2 0 If 0 is3,therewillbe8possiblecombinations.Hencetheinitia lweightofallthe combinationsis0.125.WethenperformRTLsynthesisandsim ulationfortwocombinationsatatime.Theaverageleakagepowersforboththecombi nationsarecompared.If combination A hasabetteraverageleakagepowerthancombination B ,theweightagefor combinationAisthenincreasedby0.5.Thisincreaseisnown oted,asthevalueofincrease ordecreaseofthiscombinationwillbe(1/8)*0.5,forthene xttime.Ifthiscombination hasalowerleakagedissipationthanacombination(sayC),t henitsweightagereduces by(1/8)*0.5,andcombinationCweightageincreasesby0.5o r(1/8)*0.5dependingon whetheritwasalreadyincreased.Thiscomparisonprocessc ontinuesuntilallthecombinationsareevaluatedagainsteachother,andrepeatedforsev eraliterations.Theiterations arerepeatedforseveraltrialsandrandominputvectorsets (inourcase,weused500 vectors),andtheaverageweightagesarerecorded.Table3. 1wasobtainedafterextensive characterizationona16-bitadder.Thistablecontainsthe Leakage Constanttableand 35

PAGE 43

COST [ G ( V;E )] 1 for allinstances r 2f 1 ; 2 ; 3 :::r n g ofresource type k i 2f k 1 ;k 2 ;k 3 :::k n g 2DetermineActive-IdlecombinationComb r 4Match Comb r withLeakage Costtable 5cost C ( Comb r ) L c (1 W L ) 6 cost cost + C ( Comb r ) 9Match Comb r withSettling Costtable 10cost S ( Comb r ) S c (1 W S ) 11 cost cost + S ( Comb r ) 12 endfor 13 return cost Figure3.8.CostEvaluationprocedure theSettling Constanttable.Whenthistableislooked-up,thevalueof L c (1 W l )is addedtothetotalcost.Table3.1.Leakageandsettlingconstantstablefor16-bita dderactivitycombinations ( 0 =3) Comb. L c W l S c W s 1 2.6628 0.73 1.1289 0.69 2 2.7359 0.66 1.8178 0.56 3 2.7528 0.64 1.8256 0.53 4 2.7715 0.62 1.4687 0.46 5 2.7559 0.54 1.4687 0.46 6 2.8100 0.33 1.4812 0.34 7 2.7731 0.52 1.4799 0.45 8 2.3842 0.78 1.2112 0.78 FortheSettling Constanttable,weconsiderRTLcomponentssuchasmultipli er,adder, subtractorindividually.WesynthesizeRTLsinthesameway asbefore.Wethenobserve theleakageproleofeachRTLcomponentasperitsactivityt able.Wethenaveragethe leakagepowerinthetransientregiononly.Frequentlyacti vecomponentswillalwaysbe intheirtransientregions,andthisleakageprolewillbeg reaterthancomponentsthat experiencelongidle-times.Basedonthis,wecomparetrans ientprolesforallcombinations andvarytheweightageinthesamewayasexplainedbefore. 36

PAGE 44

3.2Registerallocation&optimization ThescheduleobtainedfromtheSA-basedalgorithmisthenan alyzedbytheregister allocationphase.Alifetimeanalysisforalltheedgesisco nducted,andthecompatibilityof eachedgeisdetermined.Wenotethatedgesthathavealifeti meofmorethan1timestep aresplitintosub-edges.Thiscompatibilityanalysisisth esameasthefunctionalresource compatibilityalgorithmshowninFigure2.5,withtheverti cesreplacedbyedges. ThecompatibilitygraphisthenpartitionedusingacliquepartitioningheuristicproposedbyTsengandSieworek[41].Theobjectiveofthisalgor ithmistoformcliquesin suchawaythattheidle-timesofregistersarecontiguousan dmaximized.Itaggregates theedge-bindingstocompletelyutilizearegister'stimec ycle.Oncetheaggregationis completedbasedontheschedule,theremainingedgesarebou ndsuchthatedgesthatare moretemporallyclosertoeachotherarepresentwithinthes ameregisterbinding. Inourwork,weuseamodiedversionoftheclique-partition ingbasedmethodpresentedin[18].Thealgorithmwhichsupportspipeliningise xplainedinFigure3.9 ModiedCliqueParitioningAlgorithm: Procedureclique partition low leakage 1while( N isnotempty) do 2 x select node(); 3if x isnotconnectedtoanyothernode 4addittoclique5else6formset Y withallneighboursof x 7foreachnode i in Y 8determineasetofnodesMinYincompatiblewith i 9endfor10foreachnodeyinMthatexcludesminimumnodesinY11calculatetheeectonsleeptimecostfromy12calculatetheeectontransitioncostfromy13updatethesetofnodeswithmaximumcostfactor14endfor15mergenodeswithmaxcostfactoryintosetx16endifendprocedure Figure3.9.Cliquepartitioningforleakagepoweroptimiza tion 37

PAGE 45

Thustosummarize,wedescribeapproachesforleakagepower optimizationofpipelined datapathsusingMTCMOS.SinceMTCMOSimposessignicantar ea,delayandglitching considerationsonthesynthesisprocedure,wedevelopcost functionsthattakeintoaccount theseconsiderationssoastomaximizeleakagepoweroptimi zation,andminimizearea andcapacitiveeects.Weprovideapproachesforbothfunct ionalresource,andregister optimization. 38

PAGE 46

CHAPTER4 EXPERIMENTALRESULTS WesynthesizedpipelineddatapathsforsixlinearDSPbench markswhichareshownin table4.1.TheRTLsimulationswereperformedusingtheFASL architecturalsimulation library.Therun-timesforFASLwereconsiderablyshort,be ingtypicallytherun-times reportedbytheCadenceNCLaunchVHDLsimulator.Thesimula tionswererunonaSun UltraSparcIImachinewithaDual-200MhzCPUand256MBmemor y. Wecomparethedatapathsgeneratedbyourapproachtounopti mizeddatapathssynthesizedusingforce-directedschedulingandbindingusin gregularclique-partitioning.In tables4.2,4.3,4.4,4.5,and4.6wereporttheleakagepower reductionsprovidedbyour approachforvariousvaluesofdatainitiationrates( 0 ).Inthesetables,wealsoreport theaverageareareductionsprovidedbyourapproach.Since ourapproachusessimulated annealing,weprovideananalysisontheaveragerunningtim eofourapproachinTable 4.7. Table4.1.Benchmarksusedforanalysis Benchmark Operations Latency( ) FiniteImpulseResponse(FIR)lter 5*,4+ 5 EllipticWavelter(EWF) 7*,26+ 14 InniteImpulseResponse(IIR)lter 5*,4+ 4 AutoRegression(AR)lter 16*,12+ 8 FastFourierTransform(FFT) 16*,25+,76 Foreachsimulationrun,wesimulatedwith1000randomtestv ectors,toobservethe leakagereduction.ThecurrentFASLlibraryhassimulation supportforthe100nmBerkeley PredictiveTechnologyModels,andextensivecharacteriza tionhasbeenperformedforthis generation.Theclockperiodwaskeptat50nstofullyobserv etheeectsoftheMTCMOS leakagetransistors.Foreachbenchmark,weransimulation runsfrom 0 =2upto 0 =6. 39

PAGE 47

Asnotedbefore,itisonlyfrom 0 =2thatouralgorithmperformsanyoptimization throughtheusageofastandbyleakagetechniquesuchasMTCM OS. Table4.2.FIRlterleakage&areaanalysis 0 Regular SA-based Leakage Area Leakage Area Leakage Area Reduction Overhead ( W) ( 2 ) ( W) ( 2 ) (%) (%) 2 0.01941 140727.67 0.01364 151928.23 29.72 7.95 3 0.01841 138768.96 0.013784 111322.49 25.12 -19.77 4 0.02407 96923.78 0.01794 110040.66 25.46 13.53 5 0.01489 94654.53 0.01108 106581.12 25.58 12.61 Table4.3.EWFlterleakage&areaanalysis 0 Regular SA-based Leakage Area Leakage Area Leakage Area Reduction Overhead ( W) ( 2 ) ( W) ( 2 ) (%) (%) 2 0.03781 254996.51 0.03245 272959.93 14.17 7.04 3 0.05430 240736.48 0.03026 227423.48 44.27 -5.53 4 0.06793 233283.65 0.02702 283322.18 60.22 21.44 5 0.06583 187234.59 0.04085 173279.12 37.94 -7.45 6 0.09305 147085.09 0.04134 172924.48 55.57 17.56 7 0.10307 180952.28 0.04249 211773.85 58.77 17.03 8 0.09553 140014.73 0.04200 163622.51 56.03 16.86 9 0.08028 137864.93 0.05388 163764.76 32.88 18.78 10 0.12168 138199.39 0.05586 165383.15 54.07 19.66 11 0.20879 137984.42 0.05995 164570.98 71.28 19.26 12 0.20763 137864.90 0.05853 160621.64 71.81 16.50 Fromthetables,wenoticethatourapproachobtains,onthea veragealeakagepower reductionof38.2%.Thisrangesfromaslowas5%insomecases toashighas71%ina fewcases.However,wenoticethatwhentheschedulelatency ofthedesignisequalto itscriticalpathlength,thereisadeniteareaoverheadth atisincurredbyourapproach. Thoughforsomeofthecases,wenoticedsubstantialareared uctionsrangingfrom8%to 26%,formostoftheothercasestheSA-basedapproachwasnot abletooptimizearea. Theoverallareaoverheadoftheapproachwasaround6.2%. Weattributethistoalackofsucientoperationmobilityin thedesignsweconsidered. Byincreasingthelatencyoftheschedule,weimposeanitem obilityonalltheoperations, givingtheSA-basedapproachmorefreedomtooptimizearea. Wenoticequitegoodim40

PAGE 48

Table4.4.IIRlterleakage&areaanalysis 0 Regular SA-based Leakage Area Leakage Area Leakage Area Reduction Overhead ( W) ( 2 ) ( W) ( 2 ) (%) (%) 2 0.01689 137264.06 0.01390 149942.93 17.70 9.23 3 0.02756 136069.75 0.02141 154207.84 22.31 13.33 4 0.03563 134015.48 0.02609 152118.56 26.77 13.54 Table4.5.ARlterleakage&areaanalysis 0 Regular SA-based Leakage Area Leakage Area Leakage Area Reduction Overhead ( W) ( 2 ) ( W) ( 2 ) (%) (%) 2 0.05005 462236.62 0.04744 385632.81 5.21 -16.57 3 0.08714 373625.06 0.05334 424518.06 38.78 13.62 4 0.09851 290627.65 0.06460 213779.62 34.42 -26.44 5 0.08660 368872.00 0.04060 420956.72 53.11 14.12 6 0.12258 207223.70 0.05315 238991.09 56.64 15.33 7 0.15931 203903.62 0.06653 236505.41 58.23 15.98 8 0.12515 199795.06 0.05584 232221.79 55.39 16.23 Table4.6.FFTleakage&areaanalysis 0 Regular SA-based Leakage Area Leakage Area Leakage Area Reduction Overhead ( W) ( 2 ) ( W) ( 2 ) (%) (%) 2 0.07016 453783.03 0.06115 417946.59 12.83 -7.89 3 0.09974 403099.18 0.08181 339320.12 17.96 -15.82 4 0.12847 278065.46 0.07344 245987.60 42.83 -11.53 5 0.06097 274339.40 0.03505 242233.36 42.51 -11.70 6 0.07164 267173.50 0.03442 300249.57 51.95 12.38 41

PAGE 49

provementsinareafortheresultingsynthesizeddesigns.W eprovideananalysisforthe ARlterbenchmarkinFigure4.1andFigure4.2. Weobtainanarea-reductionintherangeof3.89%to4.6%when thelatencyisincreased by1,2,and3timesteps.However,wealsokeptawatchonthele akageoptimizationproles duringthislatencyincrease.Wenoticethatthereisasmall decreaseintheleakagepower optimizationduetotheimprovedareaoptimizations. Figure4.1.Totalareaconsumptionofdatapath(ARlter) Wenoticethataswemovetowards +3,weobtainlessleakagereductionsthough areaisbecomingincreasinglyoptimized.Henceitisnotfea sibletouseveryhighvalues of formoderatelylargebenchmarks.A increaseofupto30%oftheschedulelength providesgoodarea-leakagetradeo.Anymoreincreasecanb ecounter-productive. TheaveragerunningtimesofourSA-basedalgorithmisindic atedintheTable4.7.This runningtimeisdependentonthemobilityoftheoperationsa ndthesizeofthebenchmark. Ifthemobilitiesareless,onlyanitenumberofmovescanbe madeateverytemperature 42

PAGE 50

Figure4.2.Leakagepowerprolecomparison(ARlter) iteration.Wenoticethattherunningtimesremainsnearlyt hesameforallvaluesof data-initiationrates. Insummary,weprovideanindicationofthetypicalimprovem entsinthroughputthat canbeaordedbyourapproach.Wenoticethattheleakagepow erdissipatedbyregular HLSdatapathsremainlowerthantheleakagepowerdissipate dinpipelineddatapaths.The regularHLSdatapathsconsideredareMTCMOSoptimizeddata pathsgeneratedusingthe approachespresentedin[18].Wedeterminethepipelinelat encyforwhichthesynthesized datapathhasaleakageproleasneartotheleakageproleth eregulardatapath.Wethen computetheimprovementinthroughput,whichisgiveninTab le4.8. 43

PAGE 51

Table4.7.RunningtimesforSA-algorithm Benchmark Runningtimes(s) FIRlter 73 IIRlter 65 ARlter 133 EWFlter 182 FFTlter 144 Table4.8.Speedupanalysisforvariousbenchmarks Benchmark Regular SA-based 0 Speedup FIR 0.008339 0.01364 2 2.5 IIR 0.008377 0.01390 2 2 EWF 0.02222 0.02702 4 3.5 AR 0.02943 0.04060 5 1.6 FFT 0.02523 0.03505 5 1.2 44

PAGE 52

CHAPTER5 CONCLUSIONSANDFUTUREWORK Inthiswork,wehavepresentedanapproachforthebehaviora lsynthesisofpipelined datapathswithlowleakagepower.Ourapproachusessimulat edannealingandprovides arexibilityforvaryingthedataintroductionintervalsdu ringsynthesis.Ouralgorithm intelligentlydeterminesascheduleandbindingthathaslo werareaandreducedleakage. AlsothealgorithmisMTCMOS-awareandthusprovidesadatap aththathasavery selectiveMTCMOSbinding.Theareaoverheadsincurredbyou rapproacharealsolow. Ouralgorithmthoughisfullyextendabletowardsregularhi ghlevelsynthesis,wehave chosentopresentitinalimitingcasesuchaspipelining.Al soanystandbyleakagereduction techniqueotherthanMTCMOSisfullysupportedbythismetho d.Ouralgorithmattempts toprovideaneortofbalancingthefactorsofperformancea ndpower.Sincestandby leakagereductiontechniquescanprovideaperformancelos s,weprovideanapproachthat reducesleakageinaperformanceenhancingtechniquesucha sfunctionalpipelining. Thefollowingarethemaincontributionsofthiswork: 1.Apipeliningcontrollermoduleforlineardatarowgraphs ,whichcansupportthe low-powerandcrosstalkoptimizeddesignssynthesizedinA UDI. 2.Ascheduling,allocationandbindingalgorithmwithcomp letelycharacterizedMTCMOSbasedcostfunctionsforthebehavioralsynthesisofpipeli neddatapaths. 3.AfrontendfortheAUDIsystem(alsoknownasVHDL2AIF)whi chcansupporta vastportionofdescriptionswritteninbehavioralVHDL. Thisworkwasconcievedtoprovideforacomprehensiveframe workforthesynthesisof pipelineddesigns.Howverwehaveidentiedsomekeycompon entsofthisworkwhichare partofthefuturework. 45

PAGE 53

Currently,theFASLsystemhandlesRTLstructuresthatares ynthesizedfromlinear datarowgraphs.WorkiscurrentlyontorealiseFASL2asacom pleteRTLsystemand backendforAUDIsupportinglargecontrol-rowintensivede signs. ThepipeliningmoduleimplementedinAUDIcurrentlyfullys upportslinearbehaviors. Wearenowintegratingtheabilityofgenerationofcontroll ersforcontrol-rowintensive designs.AnapproachtodetermineanoptimalFSMarrayforha ndlingbranchingand loopingiscurrentlyunderstudy. OncetheCFIcontrollerisdeveloped,wewillperformmodic ationandanalysisusing ourSA-basedmethodonlargeCFIdesignsfromtheUSFVLSIres earchgroupthatare concentratedinmobileandwirelessareas.Sinceourapproa chprimarilytargetsthisarea forlowpowerdesignswithincreasedperformance. Forlinearbehaviors,controllerleakagepowerformsasmal lpercentageandcan generallybeignored.ButforCFIbaseddesigns,thecontrol lerleakagepowerbecomes largerandcannotbedisregarded.Ouralgorithmcurrentlyd oesnotperformcontroller leakagereduction.Howeverastudyisunderwaytodetermine anactivity-drivenlogic synthesismechanismandusethismechanismtodrivethehigh levelsynthesisprocedure togenerateecientandoptimizedcontrollers. VHDL2AIFinitscurrentstateisabehavioralVHDLtoAUDIInt ermediateFormat (AIF)convertercapableofperformingsimplecompilerleve loptimizations.Whencompleted,VHDL2AIFwillbecapableofgeneratinghighlyoptim izedCDFGswithimproved controlanddata-rowforhighlevelsynthesis.Loopsintheb odyofadesignareadicult constructtopipelineasthesourcevertexofaloopisnotonl yexecutedforeveryinitiation, itisalsoexecutedforeveryiterationoftheloopincaseofa nyloop-carrieddependencies. Currentlyunderwayisastudytodeterminehowtoapplytrans formationstoloopssoasto reducethisproblemduringpipelinesynthesis.Previouswo rkregardingthisproblemhas beendonein[42]. 46

PAGE 54

REFERENCES [1]K.Roy,S.Mukhopadhyay,andH.Mahmoodi-Meimand.\Leak agecurrentmechanisms andleakagereductiontechniquesindeep-submicrometerCM OScircuits". Proceedings oftheIEEE ,91(2):305{327,Feb2003. [2]D.Melinak.\Powerintegritycomeshometoroostat90nm" .Technicalreport, www.elecdesign.com,February2005. [3]J.T.KaoandA.P.Chandrakasan.\Dual-ThresholdVoltag eTechniquesforLowPowerDigitalCircuits". IEEEJournalonSolidStateCircuits ,35(7):1009{1018,July 2000. [4]K.Roy.\Leakagepowerreductioninlow-voltageCMOSdes ign".In Proceedingsof theIEEEInternationalConferenceonCircuitsandSystems ,pages167{173,Sep1998. [5]Y.TaurandT.H.Ning. FundamentalsofModernVLSIDevices .NewYork:Cambridge UniversityPress,1998. [6]S.Borkar.\Designchallengesoftechnologyscaling". IEEEMicro ,19(4):23{29,Aug 1999. [7]S.Mutoh,T.Douseki,Y.Matsuya,T.Aoki,S.Shigematsua ndJ.Yamada.\1V PowersupplyHighSpeedDigitalCircuitTechnologywithMul ti-ThresholdVoltage CMOS". IEEEJournalonSolidStateCircuits ,30(8):847{854,Aug1995. [8]L.Wei,Z.Chen,M.Johnson,K.Roy,Y.Ye,andV.De.\Desig nandoptimizationof dual-voltagecircuitsforlowvoltagelowpowerapplicatio ns". IEEETransactionson VeryLargeScaleIntegratedSystems ,7(1):16{24,March1999. [9]J.HalterandF.Najm.\Agate-levelleakagepowerreduct ionmethodforultralowpowerCMOScircuits".In ProceedingsoftheIEEECustomIntegratedCircuitsconference ,pages475{478,1997. [10]M.C.Johnson,D.Somasekhar,L.ChiouandK.Roy.\Leaka gecontrolwithecient useoftransistorstacksinsinglethresholdCMOS". IEEETransactionsonVeryLarge ScaleIntegratedSystems ,10(1):1{5,Feb2002. [11]A.Abdollahi,F.Fallah,andM.Pedram.\Leakagecurren treductioninCMOSVLSI circuitsbyinputvectorcontrol". IEEETransactionsonVeryLargeScaleIntegrated Systems ,12(2):140{154,Feb2004. [12]N.HanchateandN.Ranganathan.\LECTOR:ATechniquefo rLeakageReductioninCMOScircuits". IEEETransactionsonVeryLargeScaleIntegratedSystems 12(2):196{205,Feb2004. 47

PAGE 55

[13]M.Powell,S.Yuang,B.Falsa,K.Roy,andT.N.Vijaykum ar.\Gated-Vdd:acircuit techniquetoreduceleakageindeepsub-microncachememori es".In Proceedingsof theInternationalSymposiumonLowPowerElectronicsandDe sign ,pages90{95,July 2000. [14]Y.Ye,S.Borkar,andV.De.\Anewtechniqueforstandbyl eakagereductioninhigh performancecircuits".In DigestofTech.PapersSymposiumVLSICircuits ,pages 40{41,June1998. [15]Z.Chen,M.Johnson,L.Wei,andK.Roy.\Estimationofst andbyleakagepowerin CMOScircuitsconsideringaccuratemodellingoftransisto rstacks".In Proceedingsof LowPowerElectronicsandDesign ,pages239{244,Aug1998. [16]V.SundararajanandK.K.Parhi.\Lowpowersynthesisof dualthresholdvoltage CMOSVLSIcircuits".In ProceedingsoftheInternationalSymposiumofLowPower ElectronicsandDesign ,pages139{144,Aug1999. [17]K.S.KhouriandN.K.Jha.\Leakagepoweranalysisandre ductionduringbehavioral synthesis". IEEETransactionsonVeryLargeScaleIntegratedSystems ,10(6):876{885, Dec2002. [18]C.GopalakrishnanandS.Katkoori.\Resourceallocati onandbindingapproachfor lowleakagepower".In Proceedingsof16thInternationalConferenceonVLSIDesig n pages297{302,2003. [19]N.ParkandA.C.Parker.\Sehwa:Asoftwarepackagefors ynthesisofpipelines frombehavioralspecications". IEEETransactionsonVeryLargeScaleIntegrated Systems ,7(3):356{370,Mar1988. [20]P.G.PaulinandJ.P.Knight.\Force-directedscheduli ngforthebehavioralsynthesis ofASICs". IEEETransactionsonComputer-AidedDesignofCircuitsand Systems 8(6):661{679,June1989. [21]C.Hwang,Y.Hsu,andY.Lin.\PLS:Aschedulerforpipeli nesynthesis". IEEE TransactionsonComputer-AidedDesignofCircuitsandSyst ems ,12(9):1279{1286, Sep1993. [22]H.JunandS.Hwang.\Designofapipelineddatapathsynt hesissystemfordigitalsignalprocessing". IEEETransactionsonVeryLargeScaleIntegrationSystems 2(3):292{303,Sep1994. [23]H.JunandS.Hwang.\Automaticsynthesisofdynamicall yconguredpipelines supportingvariabledatainitiationintervals". IEEETransactionsonVeryLargeScale IntegrationSystems ,4(2):279{285,June1996. [24]A.Chandrakasan,S.Sheng,andR.W.Broderson.\Low-Po werCMOSDigitalDesign". IEEEJournalonSolidStateCircuits ,27(4):473{484,January1992. [25]J.ChangandM.Pedram.\Moduleassignmentforlowpower ".In EuropeanDesign AutomationConferenceEURO-DAC96 ,pages376{381,Sep1996. 48

PAGE 56

[26]S.HeoandK.Asanovic.\Power-optimalpipelininginde epsubmicrontechnology". In Proceedingsofthe2004InternalSymposiumonLowPowerElec tronicsandDesign pages218{223,Aug2004. [27]T.Z.Yu,F.Chen,andE.H.-M.Sha.\Loopschedulingalgo rithmsforpowerreduction".In ICASSP ,pages3073{3076,May1998. [28]D.Kim,D.Shin,andK.Choi.\LowpowerpipeliningofLin earSystems:Acommon operandcentricapproach".In InternationalSymposiumonLowPowerElectronics andDesign ,pages225{230,2001. [29]GiovanniDeMicheli. SynthesisandOptimizationofDigitalCircuits .McGrawHill Inc.,1994. [30]A.RaghunathanandN.K.Jha.\BehavioralSynthesisfor LowPower".In IEEE InternationalConferenceonComputerDesign ,pages318{322,Oct1994. [31]A.K.MurugavelandN.Ranganathan.\Agametheoreticap proachforpoweroptimisationduringbehavioralsynthesis". IEEETransactionsonVeryLargeScaleIntegrated Systems ,11(6):1031{1043,Dec2003. [32]S.P.MohantyandN.Ranganathan.\Aframeworkforenerg yandtransientpower minimizationduringbehavioralsynthesis". IEEETransactionsonVeryLargeScale IntegratedSystems ,12(6):562{572,June2004. [33]R.SanMartinandJ.P.Knight.\OptimisingpowerinASIC behavioralsynthesis". IEEEDesign&TestofComputers ,13(2):58{70,Summer1996. [34]N.Kumar,S.Katkoori,L.Rader,andR.Vemuri.\ProledrivenbehavioralsynthesisforlowpowerVLSIsystems". IEEEDesign&TestofComputers ,12(3):70{83, Autumn1995. [35]C.GopalakrishnanandS.Katkoori.\Knapbind:anareae cientbindingalgorithmfor lowleakagedatapaths".In Proceedingsof21thInternationalConferenceonComputer Design ,pages430{435,2003. [36]P.R.PandaandN.D.Dutt.\1995highlevelsynthesisdes ignrepository".In ProceedingsoftheEighthInternationalSymposiumonSystemSy nthesis ,pages170{174, Sep1995. [37]C.GopalakrishnanandS.Katkoori.\Anarchitecturall eakagepowersimulatorfor VHDLstructuraldescriptions".In ProceedingsoftheIEEEComputerSocietyAnnual SymposiumonVLSI ,pages211{212,Feb2003. [38]S.Kirkpatrick,C.D.Gelatt,andM.P.Vecchi.\Optimiz ationbysimulatedannealing". Science ,220(4598):671{680,13May1983. [39]S.DevadasandA.R.Newton.\Algorithmsforhardwareal locationindatapath synthesis". IEEETransactionsonComputer-AidedDesignofCircuitsand Systems 8(7):768{781,July1989. 49

PAGE 57

[40]P.Prabhakaran,P.Bannerjee,J.Crenshaw,andM.Sarra fzadeh.\Simultaneous scheduling,allocationandroorplanningforinterconnect poweroptimization".In Proceedingsofthe12thInternationalConferenceonVLSIde sign ,pages423{427,Jan 1999. [41]C.TsengandD.P.Siewiorek.\Automatedsynthesisofda tapathsindigitalsystems". IEEETransactionsonComputer-AidedDesignofCircuitsand Systems ,5(3):379{395, Mar1986. [42]M.RimandR.Jain.\Validtransformations:anewclasso flooptransformations forhighlevelsynthesisandpipelinedschedulingapplicat ions". IEEETransactionson ParallelandDistributedComputing ,7(4):399{410,Apr1996. 50